0% found this document useful (0 votes)
362 views78 pages

Building LLMs - Stanford

How to build llm

Uploaded by

Tanmay Bakshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
362 views78 pages

Building LLMs - Stanford

How to build llm

Uploaded by

Tanmay Bakshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

Introduction to

Building LLMs
CS229: Machine Learning
Yann Dubois | Aug. 13th 2024

Slides partially based on CS336, CS224N, CS324


2

LLMs

• LLMs & chatbots took over the world

• How do they work?


3

What matters when training LLMs

• Architecture
Most of
Transformer
academia
• Training algorithm/loss

• Data Model
What
• Evaluation matters in
practice
• Systems
Overview
Pretraining -> GPT3
• Task & loss

Post-training -> ChatGPT


5

Language Modeling
• LM: probability distribution over sequences of tokens/words 𝑝 𝑥1 , … , 𝑥𝐿
P(the, mouse, ate, the, cheese) = 0.02
P(the, the, mouse, ate, cheese) = 0.0001 Syntactic knowledge

P(the, cheese, ate, the, mouse) = 0.001 Semantic knowledge

• LMs are generative models: x1:L ~ 𝑝 𝑥1 , … , 𝑥𝐿

• Autoregressive (AR) language models:


𝑝 𝑥1 , … , 𝑥𝐿 = 𝑝 𝑥1 𝑝 𝑥2 𝑥1 𝑝 𝑥3 𝑥2 , 𝑥1 … = ෑ 𝑝 𝑥𝑖 𝑥1:𝑖−1)
𝑖
No approx: chain rule of probability

=> You only need a model that can predict the next token given past context!
6

dogs
AR Language Models 5

• Task: predict the next word


• Steps:
1. tokenize Model
2. forward
3. predict probability of next token
4. sample 1 2 3
Inference only
5. detokenize She likely prefers
7

AR Neural Language Models

https://lena-voita.github.io/nlp_course/language_modeling.html#intro
8

Loss
• Classify next tokens’ index
• => cross-entropy loss

https://lena-voita.github.io/nlp_course/language_modeling.html#intro
• => maximize text’s log-likelihood

max ෑ 𝑝 𝑥𝑖 𝑥1:𝑖−1 ) = min − ෍ log 𝑝 𝑥𝑖 | 𝑥𝑖:𝑖−1 = min ℒ(𝑥𝑖:𝐿 )


𝑖 𝑖
9

Tokenizer
• Why?
• More general than words (eg typos) tokenizer:
text to token
• Shorter sequences than with characters

• Idea: tokens as common subsequences (~3 letters)


• Eg: Byte Pair Encoding (BPE). Train steps: index
1. Take large corpus of text
Start with one token per character
Merge common pairs of tokens into a token
Repeat until desired vocab size
10

Tokenizer
• Why?
• More general than words (eg typos) tokenizer:
text to token
• Shorter sequences than with characters

• Idea: tokens as common subsequences


• Eg: Byte Pair Encoding (BPE). Train steps: index
1. Take large corpus of text
2. Start with one token per character
3. Merge common pairs of tokens into a token
4. Repeat until desired vocab size
11

Tokenizer
• Why?
• More general than words (eg typos) tokenizer:
text to token
• Shorter sequences than with characters

• Idea: tokens as common subsequences


• Eg: Byte Pair Encoding (BPE). Train steps: index
1. Take large corpus of text
2. Start with one token per character
3. Merge common pairs of tokens into a token
4. Repeat until desired vocab size
12

Tokenizer
• Why?
• More general than words (eg typos) tokenizer:
text to token
• Shorter sequences than with characters

• Idea: tokens as common subsequences


• Eg: Byte Pair Encoding (BPE). Train steps: index
1. Take large corpus of text
2. Start with one token per character
3. Merge common pairs of tokens into a token
4. Repeat until desired vocab size or all merged
13

Tokenizer
• Why?
• More general than words (eg typos) tokenizer:
text to token
• Shorter sequences than with characters

• Idea: tokens as common subsequences


• Eg: Byte Pair Encoding (BPE). Train steps: index
1. Take large corpus of text
2. Start with one token per character
3. Merge common pairs of tokens into a token
4. Repeat until desired vocab size or all merged
14

Tokenizer
• Why?
• More general than words (eg typos) tokenizer:
text to token
• Shorter sequences than with characters

• Idea: tokens as common subsequences


• Eg: Byte Pair Encoding (BPE). Train steps: index
1. Take large corpus of text
2. Start with one token per character
3. Merge common pairs of tokens into a token
4. Repeat until desired vocab size or all merged
15

Tokenizer
• Why?
• More general than words (eg typos) tokenizer:
text to token
• Shorter sequences than with characters

• Idea: tokens as common subsequences


• Eg: Byte Pair Encoding (BPE). Train steps: index
1. Take large corpus of text
2. Start with one token per character
3. Merge common pairs of tokens into a token
4. Repeat until desired vocab size or all merged
Overview
Pretraining -> GPT3
• Task & loss
• Evaluation

Post-training -> ChatGPT


17

LLM evaluation: Perplexity


• Idea: validation loss
1
𝑃𝑃𝐿 𝑥1:𝐿 = 2𝐿 ℒ(𝑥1:𝐿 ) = ∏ 𝑝 𝑥𝑖 𝑥1:𝑖−1 −1/𝐿

• To be more interpretable: use perplexity


• avg per token (~independent of length)

• Exponentiate => units independent of log base

• Perplexity: between 1 and |Vocab|


• Intuition: number of tokens that you are hesitating between
18

LLM evaluation: Perplexity

Between 2017-2023, models went from ”hesitating” between ~70 tokens to <10 tokens
Perplexity not used anymore for academic benchmark but still important for development
19

LLM Evaluation: agg. std NLP benchmarks


Holistic evaluation of language models (HELM) Huggingface open LLM leaderboard

collect many automatically evaluatable


benchmarks, evaluate across them
20

LLM Evaluation: agg. std NLP benchmarks


• Mix of things that can be “easily”
evaluated

• Typically there is “gold” answer


=> you likelihood of LLM to
predict that vs other options

HELM-lite
[Liang+ 2022]
21

LLM Evaluation: eg MMLU


• Example: MMLU
• ~Most trusted pretraining benchmark

MMLU
[Hendrycks+ 2020]
22

Evaluation: challenges
• Sensitivity to prompting/inconsistencies
23

Evaluation: challenges
• Sensitivity to prompting/inconsistencies
• Train & test contamination (~not important for development)
Overview
Pretraining -> GPT3
• Task & loss
• Evaluation
• Data

Post-training -> ChatGPT


25

Data
• Idea: use all of the clean internet
• Note: internet is dirty & not representative of what we want. Practice:
1. Download all of internet. Common crawl: 250 billion pages, > 1PB (>1e6 GB)

2. Text extraction from HTML (challenges: math, boiler plate)


3. Filter undesirable content (e.g. NSFW, harmful content, PII)

4. Deduplicates (url/document/line). E.g. all the headers/footers/menu in forums are always same

5. Heuristic filtering. Rm low quality documents (e.g. # words, word length, outlier toks, dirty toks)

6. Model based filtering. Predict if page could be references by Wikipedia.

7. Data mix. Classify data categories (code/books/entertainment). Reweight domains using scaling
laws to get high downstream performance.

• Also: lr annealing on high-quality data, continual pretraining with longer context


26

Data
• Collecting well data is a huge part of practical LLM (~the key)
• Lot of research to be done!
• How do you process well and efficiently? • Synthetic data?

• How do you balance domains? • Multi-modal data?

• A lot of secrecy:
• Competitive dynamics • Copyright liability

• Common academic datasets:


• C4 (150B tokens | 800GB) • Dolma (3T tokens)

• The Pile (280B tokens) • FineWeb (15T tokens)

• Closed: LLaMA 2 (2T tokens), LLaMA 3 (15T tokens), GPT-4 (~13T tokens?)
Overview
Pretraining -> GPT3
• Task & loss
• Evaluation
• Data
• Scaling laws

Post-training -> ChatGPT


28

Scaling laws
• Empirically: more data and larger models => better performance
• Large models =/> overfitting

• Idea: predict model performance based on amount of data & parameter

It works for many things! Scaling laws


[Kaplan+ 2020]
29

Scaling laws: tuning


• You have 10K GPUs for a month, what model do you train?

• Old pipeline:
• Tune hyperparameters on big models (e.g. 30 models)

• Pick the best => final model is trained for as much as each filtered out ones (e.g. 1 day)

• New pipeline:
• Find scaling recipes (eg lr decrease with size)
• Tune hyperparameters on small models of different sizes (e.g. for <3 days)
• Extrapolate using scaling laws to larger ones
• Train the final huge model (e.g. >27 days)
30

Scaling laws: eg LSTM


• Q: Should we use transformers or LSTM?

A: Transformers have a better constant and scaling rate (slope)


Scaling laws
[Kaplan+ 2020]
31

Scaling laws: eg Chinchilla


• Q: How do we optimally allocate training* resources (size vs data)?

Isoflop:
Isoflop:
varytokens
vary tokens&&
parameters
parameters

Best tokens
Best for each
parameters isoflop
for each
isoflop

A: Use 20:1 tokens for each parameter (20:1)


Chinchilla
*doesn’t consider inference cost => in practice use larger (> 150:1) [Hoffmann+ 2022]
32

Scaling laws: tuning


• Many questions you can try to answer with scaling laws

• Resource allocation:
• Train models longer vs train bigger models? • Collect more data vs get more GPUs?

• Data:
• Data repetition / multiple epochs? • Data mixture weighting?

• Algorithm:
• Arch: LSTMs vs transformers? • Size: width vs depth?
33

Bitter lesson

• Bitter lesson: models improve with scale & Moore’s Law


=> “only thing that matters in the long run is the leveraging of computation.”

Bitter [Sutton 2019] http://www.incompleteideas.net/IncIdeas/BitterLesson.html

• Don’t spend time over complicating: do the simple things and scale them!
34

Training a SOTA model


~40 tok/param => train
• Example of current SOTA: LLaMA 3 400B compute optimal
Data: 15.6T tokens Parameters: 405B

• FLOPs: 6NP = 6 * 15.6e12 * 405e9 = 3.8 e25 FLOPs ~2x less than executive order

• Compute: 16K H100 with average throughput of 400 TFLOPS

• Time: 3.8e25 / (400e12 * 3600) = 26M GPU hour / (16e3 * 24) = 70 days From paper: ~30M

• Cost: rented compute + salary=~$2/h*26Mh + 500k/y*50employee= $52M+$25M = ~$75M $65-85M

• Carbon emitted= 26Mh*0.7kW*0.24kg/kWh = 4400 tCO2eq ~2k return tickets JFK-LHR

• Next model? ~10x more FLOPs


Overview
Pretraining -> GPT3
• Task & loss
• Evaluation
• Data
• Scaling laws
• Systems

Post-training -> ChatGPT


Overview
Pretraining -> GPT3
• Task & loss
• Evaluation
• Data
• Scaling laws
• Systems

Post-training -> ChatGPT


• Task
37

Language Modeling ≠ assisting users


• Problem: language modeling is not what we want
38

Task: “alignment”
• Goal: LLM follows user instructions and designer’s desires (eg moderation)
X

• Background:
• data of desired behaviors is what we want but scarce and expensive
• pretraining data scales but is not what we want

• Idea: finetune pretrained LLM on a little desired data => “post-”training


Overview
Pretraining -> GPT3
• Task & loss
• Evaluation
• Data
• Scaling laws
• Systems

Post-training -> ChatGPT


• Task
• SFT: data & loss
40

Supervised finetuning (SFT)


• Idea: finetune the LLM with language modeling of the desired answers
Next word prediction “supervised”
• How do we collect the data? Ask humans

OpenAssistant
[Kopf+ 2023]

This was the ~key to GPT3 -> ChatGPT model!


41

Scalable data for SFT: eg Alpaca


• Problem: human data is slow to collect and expensive
• Idea: use LLMs to scale data collection

Alpaca
[Taori+ 2023]

Started for academic replication of ChatGPT but “synthetic data generation” is now hot topic!
43

Scalable data for SFT: quantity?


• You need very little data for SFT! ~few thousand

LIMA
[Zhou+ 2023]

• Just learns the format of desired answers (length, bullet points, …)


• The knowledge is already in the pretrained LLM!
• Specializes to one “type of user”
Overview
Pretraining -> GPT3
• Task & loss
• Evaluation
• Data
• Scaling laws
• Systems

Post-training -> ChatGPT


• Task
• SFT: data & loss
• RLHF : data & loss
45

RL from Human Feedback (RLHF)


• Problem: SFT is behavior cloning of humans
1. Bound by human abilities: humans may prefer things that they are not able to generate
2. Hallucination: cloning correct answer teaches LLM to hallucinate if it didn’t know about it!

If LLM doesn’t know [Bivens 2013] => teaches the model to make up plausibly sounding referneces

3. Price: collecting ideal answers is expensive


46

RLHF
• Idea: maximize human preference rather than clone their behavior
• Pipeline:
1. For each instruction: generate 2 answers from a pretty good model (SFT)

2. Ask labelers to select their preferred answers

3. Finetune the model to generate more preferred answers

How??

Instruction
47

RLHF: PPO
• Idea: use reinforcement learning
• What is the reward?
• Option 1: whether the model’s output is preferred to some baseline
• Issue: binary reward doesn’t have much information

• Option 2: train a reward model R using a logistic regression loss to classify preferences.
exp(𝑅(𝑥, 𝑦ො𝑖 ))
𝑝 𝑖>𝑗 = [Bradley-Terry 1952]
exp 𝑅(𝑥, 𝑦ො𝑖 ) + exp 𝑅(𝑥, 𝑦ො𝑗 )
• Use logits R(…) as reward => continuous information => information heavy!


𝑝𝜃 (𝑦|𝑥)
• Optimize 𝔼𝑦∼𝑝
ො 𝜃 (𝑦|𝑥)
ො 𝑅 𝑥, 𝑦ො − 𝛽 log 𝑝 ො
using PPO
𝑟𝑒𝑓 (𝑦|𝑥)

-> regularization avoids overoptimization

• Note: LMs are policies not a model of some distribution


48

RLHF: PPO -> ChatGPT

RLHF
[Ouyang+ 2022]
49

RLHF: PPO challenges


• Problem: RL in theory simple, in practice messy (clipping, rollouts, outer loops,…)

AlpacaFarm
[Dubois+ 2023]

Rollout

Idealized PPO in LM setting


50

RLHF: DPO
• Idea: maximize probability of preferred output, minimize the other

DPO
[Rafailov+ 2023]

• This is ~equivalent (same global minima) to RLHF/PPO


• Much simpler than PPO and performs as well => standard (in open source community)
51

RLHF: gains

PPO DPO

SFT

Pretrain

Learn to summarize AlpacaFarm


[Stiennon+ 2020] [Dubois+ 2023]
52

RLHF: human data


• Data: human crowdsourcing
example

guidelines
53

RLHF: challenges of human data


• Slow & expensive

• Hard to focus on correctness rather than form (eg length) LLM Opinion
[Santurkar+ 202
• Annotator distribution shifts its behavior Posttrain
Pretrain
Long way to go
• Crowdsourcing ethics [Singhal+ 2024]
54

RLHF: LLM data


• Idea: replace human preferences with LLM preferences

Works surprisingly well!


=> Standard in open community

AlpacaFarm
[Dubois+ 2023]
Overview
Pretraining -> GPT3
• Task & loss
• Evaluation
• Data
• Scaling laws

Post-training -> ChatGPT


• Task
• SFT: data & loss
• RLHF : data & loss
• Evaluation
56

Evaluation: aligned LLM


• How do we evaluate something like ChatGPT?

• Challenges:
• Can’t use validation loss to compare different methods
Some aligned
• Can’t use perplexity: not calibrated LLMs are policies!
• Large diversity
• Open-ended tasks => hard to automate

• Idea: ask for annotator preference between answers

InstructGPT
[Ouyang+ 2022]
57

Human evaluation: eg ChatBot Arena


• Idea: have users interact (blinded) with two chatbots, rate which is better.

• Problem: cost & speed!


ChatBot Arena
[Chiang+ 2024]
58

LLM evaluation: eg AlpacaEval


• Idea: use LLM instead of human

• Steps:
• For each instruction: generate output by baseline and model to eval

• Ask GPT-4 which output is better


• Average win-probability => win rate
LLM
• Benefits:
Evaluate
• 98% correlation with ChatBot Arena VS
• < 3 min and < $10

• Challenge: spurious correlation AlpacaEval


[Li+ 2023]
59

LLM evaluation: spurious correlation


• e.g. LLM prefers longer outputs

• Possible solution: regression analysis / causal innferece to “control” length

AlpacaEval LC
[Dubois+ 2023]
Overview
Pretraining -> GPT3
• Task & loss
• Evaluation
• Data
• Scaling laws
• Systems

Post-training -> ChatGPT


61

Systems

• Problem: everyone is bottlenecked by compute!


• Why not buy more GPUs?
• GPUs are expensive and scarce!

• Physical limitations (eg communication between GPUs)

• => importance of resource allocation (scaling laws) and optimized pipelines


62

Systems 101: GPUs


• Massively parallel: same instruction applied on all thread but different inputs.
=> Optimized for throughput!

SM
Streaming
Multiprocessors
63

Systems 101: GPUs


• Massively parallel
• Fast matrix multiplication: special cores >10x faster than other fp ops
64

Systems 101: GPUs


• Massively parallel
• Fast matrix multiplication

• Compute > memory & communication:


• Hard to keep processors fed with data

BERT transformer

DataMovement
Matmul
[Ivanov+ 2020]
Activation
65

Systems 101: GPUs


• Massively parallel
• Fast matrix multiplication

• Compute > memory & communication


• Memory hierarchy:
• Closer to cores => faster but less memory
• Further from cores => more memory but slower
66

Systems 101: GPUs


• Massively parallel
• Fast matrix multiplication

• Compute > memory & communication


• Memory hierarchy

• Metric: Model Flop Utilization (MFU)


• Ratio: observed throughput / theoretical best for that GPU

• 50% is great!
68

Systems: low precision


• Fewer bits => faster communication & lower memory consumption
• For deep learning: decimal precision ~doesn’t matter except exp & updates
• Matrix multiplications can use bf16 instead of fp32

• For training: Automatic Mixed Precision (AMP)


• Weights stored in fp32, but before computation convert to bf16
• Activation in bf16 => main memory gains
• (Only) matrix multiplication in bf16 => speed gains
• Gradients in bf16 => memory gains
• Master weights updated fp32 => full precision
69

Systems: operator fusion


• Problem:
• communication is slow
• every new PyTorch line moves variables to global memory

• Idea: communicate once


• torch.compile DRAM SRAM
&
Compute
70

Systems: tiling
• Idea: group and order threads to minimize global memory access (slow)

• Eg matrix multiplication

• Compute matrix multiplications in subphases to reuse memory


1. Load M_00 and N_00 tiles into SM

2. Compute partial sums for P


3. Load M_00 and N_20 into SM
4. …

• => reuse reads (~cache)


E.g. assume that thread can only keep 8 values in memory.
• T reduction of global reads
Then have to reread all values (no cache hits)!
71

Systems: eg FlashAttention
• Idea: kernel fusion, tiling, recomputation for attention!

• 1.7x end to end speed up!

FlashAttention
[Dao+ 2022]
72

Systems: parallelization
• Problem:
• model very big => can’t fit on one GPU
• Want to use as many GPUs as possible

• Idea: split memory and compute across GPUs

• Background: to naively train a P parameter model you need at least 16P GB of DRAM
• 4P GB for model weights
• 2 * 4P GB for optimizer
• 4P GP for gradients

• E.g. for 7B model you need 112GB!


73

Systems: data parallelism


• Goal: use more GPUs
• Naïve data parallelization:
1. Copy model & optimizer on each GPU

2. Split data

3. Communicate and reduce (sum) gradients

• Pro: use parallel GPU

• Con: no memory gains!


74

Systems: data parallelism


• Goal: split up memory

• Idea: each GPU updates subset of weights and them before next step => sharding

ZeRO
[Rajbhandari+ 2019]
75

Systems: model parallelism


• Problem: data parallelism only works if batch size >= # GPUS

• Idea: have every GPU take care of applying specific parameters (rather than updating)
• Eg pipeline parallel: every GPU has different layer

GPipe
[Huang+ 2018]
76

Systems: model parallelism


• Problem: data parallelism only works if batch size >= # GPUS

• Idea: have every GPU take care of applying specific parameters (rather than updating)
• Eg pipeline parallel: every GPU has different layer

• Eg tensor parallel: split single matrix across GPUs and use partial sum

Megatron-LM:
[Shoeybi+ 2019]
77

Systems: architecture sparsity


• Idea: models are huge => not every datapoint needs to go through every parameter
• Eg Mixture of Experts: use a selector layer to have less “active” parameter => same FLOPs
more parameters

Sparse Expert Models:


[Fedus+ 2012]
Wrap-up
79

Outlook
Haven’t touched upon:

• Architecture: MoE & SSM • Misuse

• Decoding & inference • Context size


• UI & tools: ChatGPT • Data wall

• Multimodality • Legality of data collection

Going further:
• CS224N: more of the background and historical context. Some adjacent material.
• CS324: more in-depth reading and lectures.

• CS336: you actually build your LLM. Heavy workload!


Questions?

You might also like