Building LLMs - Stanford
Building LLMs - Stanford
Building LLMs
CS229: Machine Learning
Yann Dubois | Aug. 13th 2024
LLMs
• Architecture
Most of
Transformer
academia
• Training algorithm/loss
• Data Model
What
• Evaluation matters in
practice
• Systems
Overview
Pretraining -> GPT3
• Task & loss
Language Modeling
• LM: probability distribution over sequences of tokens/words 𝑝 𝑥1 , … , 𝑥𝐿
P(the, mouse, ate, the, cheese) = 0.02
P(the, the, mouse, ate, cheese) = 0.0001 Syntactic knowledge
=> You only need a model that can predict the next token given past context!
6
dogs
AR Language Models 5
https://lena-voita.github.io/nlp_course/language_modeling.html#intro
8
Loss
• Classify next tokens’ index
• => cross-entropy loss
https://lena-voita.github.io/nlp_course/language_modeling.html#intro
• => maximize text’s log-likelihood
Tokenizer
• Why?
• More general than words (eg typos) tokenizer:
text to token
• Shorter sequences than with characters
Tokenizer
• Why?
• More general than words (eg typos) tokenizer:
text to token
• Shorter sequences than with characters
Tokenizer
• Why?
• More general than words (eg typos) tokenizer:
text to token
• Shorter sequences than with characters
Tokenizer
• Why?
• More general than words (eg typos) tokenizer:
text to token
• Shorter sequences than with characters
Tokenizer
• Why?
• More general than words (eg typos) tokenizer:
text to token
• Shorter sequences than with characters
Tokenizer
• Why?
• More general than words (eg typos) tokenizer:
text to token
• Shorter sequences than with characters
Tokenizer
• Why?
• More general than words (eg typos) tokenizer:
text to token
• Shorter sequences than with characters
Between 2017-2023, models went from ”hesitating” between ~70 tokens to <10 tokens
Perplexity not used anymore for academic benchmark but still important for development
19
HELM-lite
[Liang+ 2022]
21
MMLU
[Hendrycks+ 2020]
22
Evaluation: challenges
• Sensitivity to prompting/inconsistencies
23
Evaluation: challenges
• Sensitivity to prompting/inconsistencies
• Train & test contamination (~not important for development)
Overview
Pretraining -> GPT3
• Task & loss
• Evaluation
• Data
Data
• Idea: use all of the clean internet
• Note: internet is dirty & not representative of what we want. Practice:
1. Download all of internet. Common crawl: 250 billion pages, > 1PB (>1e6 GB)
4. Deduplicates (url/document/line). E.g. all the headers/footers/menu in forums are always same
5. Heuristic filtering. Rm low quality documents (e.g. # words, word length, outlier toks, dirty toks)
7. Data mix. Classify data categories (code/books/entertainment). Reweight domains using scaling
laws to get high downstream performance.
Data
• Collecting well data is a huge part of practical LLM (~the key)
• Lot of research to be done!
• How do you process well and efficiently? • Synthetic data?
• A lot of secrecy:
• Competitive dynamics • Copyright liability
• Closed: LLaMA 2 (2T tokens), LLaMA 3 (15T tokens), GPT-4 (~13T tokens?)
Overview
Pretraining -> GPT3
• Task & loss
• Evaluation
• Data
• Scaling laws
Scaling laws
• Empirically: more data and larger models => better performance
• Large models =/> overfitting
• Old pipeline:
• Tune hyperparameters on big models (e.g. 30 models)
• Pick the best => final model is trained for as much as each filtered out ones (e.g. 1 day)
• New pipeline:
• Find scaling recipes (eg lr decrease with size)
• Tune hyperparameters on small models of different sizes (e.g. for <3 days)
• Extrapolate using scaling laws to larger ones
• Train the final huge model (e.g. >27 days)
30
Isoflop:
Isoflop:
varytokens
vary tokens&&
parameters
parameters
Best tokens
Best for each
parameters isoflop
for each
isoflop
• Resource allocation:
• Train models longer vs train bigger models? • Collect more data vs get more GPUs?
• Data:
• Data repetition / multiple epochs? • Data mixture weighting?
• Algorithm:
• Arch: LSTMs vs transformers? • Size: width vs depth?
33
Bitter lesson
• Don’t spend time over complicating: do the simple things and scale them!
34
• FLOPs: 6NP = 6 * 15.6e12 * 405e9 = 3.8 e25 FLOPs ~2x less than executive order
• Time: 3.8e25 / (400e12 * 3600) = 26M GPU hour / (16e3 * 24) = 70 days From paper: ~30M
Task: “alignment”
• Goal: LLM follows user instructions and designer’s desires (eg moderation)
X
• Background:
• data of desired behaviors is what we want but scarce and expensive
• pretraining data scales but is not what we want
OpenAssistant
[Kopf+ 2023]
Alpaca
[Taori+ 2023]
Started for academic replication of ChatGPT but “synthetic data generation” is now hot topic!
43
LIMA
[Zhou+ 2023]
If LLM doesn’t know [Bivens 2013] => teaches the model to make up plausibly sounding referneces
RLHF
• Idea: maximize human preference rather than clone their behavior
• Pipeline:
1. For each instruction: generate 2 answers from a pretty good model (SFT)
How??
Instruction
47
RLHF: PPO
• Idea: use reinforcement learning
• What is the reward?
• Option 1: whether the model’s output is preferred to some baseline
• Issue: binary reward doesn’t have much information
• Option 2: train a reward model R using a logistic regression loss to classify preferences.
exp(𝑅(𝑥, 𝑦ො𝑖 ))
𝑝 𝑖>𝑗 = [Bradley-Terry 1952]
exp 𝑅(𝑥, 𝑦ො𝑖 ) + exp 𝑅(𝑥, 𝑦ො𝑗 )
• Use logits R(…) as reward => continuous information => information heavy!
ො
𝑝𝜃 (𝑦|𝑥)
• Optimize 𝔼𝑦∼𝑝
ො 𝜃 (𝑦|𝑥)
ො 𝑅 𝑥, 𝑦ො − 𝛽 log 𝑝 ො
using PPO
𝑟𝑒𝑓 (𝑦|𝑥)
RLHF
[Ouyang+ 2022]
49
AlpacaFarm
[Dubois+ 2023]
Rollout
RLHF: DPO
• Idea: maximize probability of preferred output, minimize the other
DPO
[Rafailov+ 2023]
RLHF: gains
PPO DPO
SFT
Pretrain
guidelines
53
• Hard to focus on correctness rather than form (eg length) LLM Opinion
[Santurkar+ 202
• Annotator distribution shifts its behavior Posttrain
Pretrain
Long way to go
• Crowdsourcing ethics [Singhal+ 2024]
54
AlpacaFarm
[Dubois+ 2023]
Overview
Pretraining -> GPT3
• Task & loss
• Evaluation
• Data
• Scaling laws
• Challenges:
• Can’t use validation loss to compare different methods
Some aligned
• Can’t use perplexity: not calibrated LLMs are policies!
• Large diversity
• Open-ended tasks => hard to automate
InstructGPT
[Ouyang+ 2022]
57
• Steps:
• For each instruction: generate output by baseline and model to eval
AlpacaEval LC
[Dubois+ 2023]
Overview
Pretraining -> GPT3
• Task & loss
• Evaluation
• Data
• Scaling laws
• Systems
Systems
SM
Streaming
Multiprocessors
63
BERT transformer
DataMovement
Matmul
[Ivanov+ 2020]
Activation
65
• 50% is great!
68
Systems: tiling
• Idea: group and order threads to minimize global memory access (slow)
• Eg matrix multiplication
Systems: eg FlashAttention
• Idea: kernel fusion, tiling, recomputation for attention!
FlashAttention
[Dao+ 2022]
72
Systems: parallelization
• Problem:
• model very big => can’t fit on one GPU
• Want to use as many GPUs as possible
• Background: to naively train a P parameter model you need at least 16P GB of DRAM
• 4P GB for model weights
• 2 * 4P GB for optimizer
• 4P GP for gradients
2. Split data
• Idea: each GPU updates subset of weights and them before next step => sharding
ZeRO
[Rajbhandari+ 2019]
75
• Idea: have every GPU take care of applying specific parameters (rather than updating)
• Eg pipeline parallel: every GPU has different layer
GPipe
[Huang+ 2018]
76
• Idea: have every GPU take care of applying specific parameters (rather than updating)
• Eg pipeline parallel: every GPU has different layer
• Eg tensor parallel: split single matrix across GPUs and use partial sum
Megatron-LM:
[Shoeybi+ 2019]
77
Outlook
Haven’t touched upon:
Going further:
• CS224N: more of the background and historical context. Some adjacent material.
• CS324: more in-depth reading and lectures.