0% found this document useful (0 votes)

33 views41 pages

Course4 Efficiency

1) The document discusses techniques for efficiently training and deploying large language models, including scaling laws for optimal model size and training data given available compute. 2) Methods covered include efficient attention mechanisms, quantization, model parallelism using FSDP/DeepSpeed, low-rank model adaptation, managing key-value caches, and model reduction techniques like DistilBERT and Sheared Llama. 3) The goal is to maximize performance while minimizing training and inference costs through techniques like quantization, model parallelism, and model reduction.

Uploaded by

komala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views41 pages

Course4 Efficiency

Uploaded by

komala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Course 4: Efficient NLP

1
The cost of pre-training LMs

Course 4: Efficient NLP 2

The cost of using LMs

Course 4: Efficient NLP 3

Efficient training

Course 4: Efficient NLP 4

Scaling Laws
Scaling Laws for Neural Language Models (Kaplan et al. 2020)

Course 4: Efficient NLP 5

Chinchilla Scaling Laws
Refinement using more data points & better training recipe
Given FLOPS, what model size and training tokens should
one use?

Course 4: Efficient NLP 6

Chinchilla Scaling Laws
Propose a form for the final loss:

Fit it on data points

E = 1.69 (" entropy of natural language")
A = 406.4, B = 410.7, = 0.34, = 0.28

Course 4: Efficient NLP 7

Chinchilla Scaling Laws

Compute ( )
For a given compute level , there exist an optimal and
Training a bigger model on less data => worse
Training a smaller model on more data => worse

Course 4: Efficient NLP 8

Chinchilla Scaling Laws - In practice
Train for longer than announced by laws
Why?
Over-train smaller models
Trade training compute for inference compute
Example: Mistral-7B model

Course 4: Efficient NLP 9

Chinchilla Scaling Laws - In practice

Course 4: Efficient NLP 10

Training LMs
Batch size matters:
No bigger than 1-2M tokens (Kaplan et al., 2020)
Maximize parallel computation
Float precision:
float16 : reduces memory usage, good with V100-gen GPUs
bfloat16 : more stability, but only usable with A100-gen GPUs

Course 4: Efficient NLP 11

Training LMs - Efficient implementations
FlashAttention (Dao et al. 2022)

Course 4: Efficient NLP 12

Training LMs - Efficient implementations
FlashAttention2 (Dao et al. 2023)

Course 4: Efficient NLP 13

Training LMs - Efficient implementations
xFormers & Memory-efficient attention (Rabe et al. 2021)
Classical implementation

Memory-efficient implementation

~SOTA on V100-gen GPUs

Course 4: Efficient NLP 14

Training LMs - Efficient variants
Linear attention (e.g. Beltagy et al. 2020)

Can be used to adapt model for efficient inference

Used in Mistral-7B
Course 4: Efficient NLP 15
Training LMs - Large-scale training
Dream scenario:
Model fits on the GPU
Forward + backward fit with the batch_size
Optimization fits memory

Course 4: Efficient NLP 16

Training LMs - Large-scale training
Optimization OOM scenario
Model fits on the GPU
Forward + backward fit with the batch_size
Optimization saturates GPU

Use memory-efficient optimizers

Adafactor: factor momentum matrix in Adam
CAME: regularize Adafactor
LION: only tracks momentum
Course 4: Efficient NLP 17
Training LMs - Large-scale training
Forward/backward OOM scenario
Model fits on the GPU
Forward + backward saturates with the batch_size

Use gradient accumulation

Compute forwards with micro_batch_size batch_size

Sum micro_batch_size//batch_size gradients

Backward once

Course 4: Efficient NLP 18

Training LMs - Multi-GPU training
Distirbuted Data Parallel (DDP) with GPUs
Copy the model on the GPUs
Send a micro-batch to each GPU
Compute forward/backward in parallel on each GPU
Send gradients to one GPU & optimize
Send weight updates to each GPU

Course 4: Efficient NLP 19

Training LMs - Multi-GPU training
Model OOM scenario
Model does not fit on one GPU (e.g. micro_batch_size=1 fails)
Model parallelism

Course 4: Efficient NLP 20

Training LMs - FSDP

Course 4: Efficient NLP 21

Training LMs - DeepSpeed
Similar to FSDP:
Shares model weights...
but also optimizer states...
and gradients
For relevant sizes: not that different in speed

Course 4: Efficient NLP 22

Adapting LMs - Adapters
Parameter-Efficient Transfer Learning for NLP (Houlsby et al. '19)

Course 4: Efficient NLP 23

Adapting LMs - LoRA
Low-Rank Adaptation of Large Language Models (Hu et al. '21)

Course 4: Efficient NLP 24

Adapting LMs - LoRA vs Adapters
Better + more stable results across hyper-parameters

Course 4: Efficient NLP 25

Efficient inference

Course 4: Efficient NLP 26

Previous methods hold
Efficient attention implementations & variants
FlashAttention / xFormers
Linear attention
Model parallelism (FSDP & DeepSpeed)
LORA weights for fast model "switching"
Keep big model in memory
Load task-specific LoRA when required

Course 4: Efficient NLP 27

Quantization
Changes the data type of a model (e.g. float32 -> int4 )
Models are usually trained in float16 or bfloat16 for stability
Needs rescaling:

Course 4: Efficient NLP 28

LM quantization
GPTQ (Frantar et al. 2023)

Course 4: Efficient NLP 29

LM quantization - GPTQ
Consider quantization as an optimization problem:

where is a weight matrix to quantize into , and are data

points (e.g. token sequences)

Course 4: Efficient NLP 30

LM quantization - GPTQ
For each row, quantize some by solving the quadratic problem
and adjust the non-quantized coefficients of to minimize impact
Empirical : update order does not matter
Update at smaller scale and batch whole matrix updates
Precompute Hessian (needed for adaptation) on non-quantized
coefficients since they can be taken left-to-right

Course 4: Efficient NLP 31

LM quantization - GPTQ
A matter of minutes/hours (on a single A100 GPU)

Course 4: Efficient NLP 32

LM quantization - GPTQ
Inference speed/memory is greatly increased:

While performance is maintained (OPT perplexity )

Course 4: Efficient NLP 33

Managing KV cache - vLLM
Paged Attention (Kwon et al. 2023)

Course 4: Efficient NLP 34

Managing KV cache - vLLM
Better throughput + parallelization across requests

Course 4: Efficient NLP 35

Long KV cache - StreamingLLM
Efficient Streaming Language Models with Attention Sinks (Xiao et
al. 2023)

Course 4: Efficient NLP 36

Model reduction

Course 4: Efficient NLP 37

DistilBERT (Sanh et al. 2019)

Course 4: Efficient NLP 38

DistilBERT (Sanh et al. 2019)
Can be expensive if teacher is big
Still loses performance

Course 4: Efficient NLP 39

Sheared Llama (Xia et al. 2023)
Remove weights that minimize loss increase

Continue the pretraining of the obtained reduced model

Course 4: Efficient NLP 40

Sheared Llama (Xia et al. 2023)
Get a good model with much less data/compute

Course 4: Efficient NLP 41

Foundations of Large Language Models 1738142777
No ratings yet
Foundations of Large Language Models 1738142777
101 pages
Foundations of LLM
No ratings yet
Foundations of LLM
231 pages
GenAI Preparation
No ratings yet
GenAI Preparation
15 pages
Lec20 LLM
No ratings yet
Lec20 LLM
58 pages
Building LLMs - Stanford
No ratings yet
Building LLMs - Stanford
78 pages
RLDL128
No ratings yet
RLDL128
73 pages
Cs224n 2025 Lecture10 Instruction Tunining RLHF
No ratings yet
Cs224n 2025 Lecture10 Instruction Tunining RLHF
61 pages
Little Guide To Building Large Language Models in 2024
100% (1)
Little Guide To Building Large Language Models in 2024
65 pages
Fundamentals of Deep Learning
No ratings yet
Fundamentals of Deep Learning
195 pages
07 Lecture10 Post Training
No ratings yet
07 Lecture10 Post Training
61 pages
Lec 13
No ratings yet
Lec 13
79 pages
L1 NLP BM Basics
No ratings yet
L1 NLP BM Basics
120 pages
40 Machine Learning Algorithms
From Everand
40 Machine Learning Algorithms
Anam Giri
No ratings yet
Pcep 30 02 - 5
No ratings yet
Pcep 30 02 - 5
11 pages
Lecture 1
No ratings yet
Lecture 1
100 pages
To Create A LLM
No ratings yet
To Create A LLM
53 pages
XCS224N Module4 Slides
No ratings yet
XCS224N Module4 Slides
91 pages
Excel Dynamic Arrays: Course Notes
No ratings yet
Excel Dynamic Arrays: Course Notes
34 pages
2023 LLMBC Whats Next
No ratings yet
2023 LLMBC Whats Next
95 pages
Thoughts On NLP Research in The (Post-) LLM Era: Yijia Shao Yuanpei College 2023/04/28
No ratings yet
Thoughts On NLP Research in The (Post-) LLM Era: Yijia Shao Yuanpei College 2023/04/28
51 pages
Efficient Large Language Models - A Survey
No ratings yet
Efficient Large Language Models - A Survey
67 pages
Know Thy Frenemy
No ratings yet
Know Thy Frenemy
40 pages
Advanced Guide to Dynamic Programming in Python: Techniques and Applications
From Everand
Advanced Guide to Dynamic Programming in Python: Techniques and Applications
Adam Jones
No ratings yet
Pytorch Tutorial by Chongruo Wu
No ratings yet
Pytorch Tutorial by Chongruo Wu
84 pages
14 LookingForward
No ratings yet
14 LookingForward
48 pages
Building Finetuning Aimodels
No ratings yet
Building Finetuning Aimodels
41 pages
ML A Deep Dive in The World of AI and LLM Tun'Up Munich - 241021 - 130023
No ratings yet
ML A Deep Dive in The World of AI and LLM Tun'Up Munich - 241021 - 130023
34 pages
01 Introduction
No ratings yet
01 Introduction
60 pages
Chapter 4 - Fine-Tune Models and Training Algorithms
No ratings yet
Chapter 4 - Fine-Tune Models and Training Algorithms
26 pages
Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks (Oh, and Finish Neural Dependency Parsing J)
No ratings yet
Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks (Oh, and Finish Neural Dependency Parsing J)
66 pages
Brief Introduction To LLM
No ratings yet
Brief Introduction To LLM
69 pages
A Survey On Efficient Inference For Large Language Models
No ratings yet
A Survey On Efficient Inference For Large Language Models
34 pages
LLM Quantization
No ratings yet
LLM Quantization
9 pages
Chapter 1 - Introduction and Supervised
No ratings yet
Chapter 1 - Introduction and Supervised
40 pages
Cs224n Text Generation
No ratings yet
Cs224n Text Generation
73 pages
Survey On Efficient Inference For LLMs 1721657409
No ratings yet
Survey On Efficient Inference For LLMs 1721657409
36 pages
A Survey On Efficient Inference For Large Language Models
No ratings yet
A Survey On Efficient Inference For Large Language Models
35 pages
Engineering Analysis 1
No ratings yet
Engineering Analysis 1
61 pages
A Survey On Transformers in NLP With Focus On Efficiency
No ratings yet
A Survey On Transformers in NLP With Focus On Efficiency
31 pages
Optimizing Large Language Model Training Using FP4 Quantization
No ratings yet
Optimizing Large Language Model Training Using FP4 Quantization
17 pages
Little Guide To Building Large Language Models in 2024
No ratings yet
Little Guide To Building Large Language Models in 2024
65 pages
Design and Implementation of Student Registration System For Universities
No ratings yet
Design and Implementation of Student Registration System For Universities
4 pages
Which Quantization Method Is Right For You - (GPTQ vs. GGUF vs. AWQ) - by Maarten Grootendorst - Nov, 2023 - Towards Data Science
No ratings yet
Which Quantization Method Is Right For You - (GPTQ vs. GGUF vs. AWQ) - by Maarten Grootendorst - Nov, 2023 - Towards Data Science
25 pages
Advanced NLP
No ratings yet
Advanced NLP
111 pages
Model Pretraining
No ratings yet
Model Pretraining
11 pages
Exploring Quantization For Efficient Pre-Training of Transformer Language Models
No ratings yet
Exploring Quantization For Efficient Pre-Training of Transformer Language Models
14 pages
LLM From Scratch
No ratings yet
LLM From Scratch
27 pages
Platypus
No ratings yet
Platypus
17 pages
LLaMA Ankit - Rawat
No ratings yet
LLaMA Ankit - Rawat
52 pages
LLM Quantization Aware Training
No ratings yet
LLM Quantization Aware Training
15 pages
Jungwok Choi - tinyML Asia 2023
No ratings yet
Jungwok Choi - tinyML Asia 2023
17 pages
Deep Learning Lab: How To Train Your First Neural Network
No ratings yet
Deep Learning Lab: How To Train Your First Neural Network
68 pages
Context Injection Attacks On Large Language Models
No ratings yet
Context Injection Attacks On Large Language Models
11 pages
Efficiency in Deep Learning
No ratings yet
Efficiency in Deep Learning
9 pages
Autoencoding Models (Encoder Only) : Three LLM Architectures
No ratings yet
Autoencoding Models (Encoder Only) : Three LLM Architectures
5 pages
Konigstein2 v-8 ScrambledContent Chapter-9
No ratings yet
Konigstein2 v-8 ScrambledContent Chapter-9
10 pages
Final Stats Intrerview Q&A
No ratings yet
Final Stats Intrerview Q&A
20 pages
FPTQ: F - P - T Q - L L M: INE Grained OST Raining Uantiza Tion For Arge Anguage Odels
No ratings yet
FPTQ: F - P - T Q - L L M: INE Grained OST Raining Uantiza Tion For Arge Anguage Odels
17 pages
Bay Learn 2015 Deep Mind
No ratings yet
Bay Learn 2015 Deep Mind
69 pages
LLM Challenges
No ratings yet
LLM Challenges
1 page
Notes 4 Large Language Model
No ratings yet
Notes 4 Large Language Model
4 pages
LLM Advanced
No ratings yet
LLM Advanced
4 pages
AI ML Learning Plan Updated
No ratings yet
AI ML Learning Plan Updated
4 pages
Can You Generate A PDF of The Final Course Outline
No ratings yet
Can You Generate A PDF of The Final Course Outline
6 pages
Communicating, Perceiving and Acting
No ratings yet
Communicating, Perceiving and Acting
32 pages
How LLMs Are Trained - A Simple Guide
No ratings yet
How LLMs Are Trained - A Simple Guide
9 pages
Machine Learning - Artificial Intelligence Questions and Answers - Sanfoundry
No ratings yet
Machine Learning - Artificial Intelligence Questions and Answers - Sanfoundry
3 pages
Aprisa XE User Manual: September 2006
No ratings yet
Aprisa XE User Manual: September 2006
134 pages
AI Algorithms and Statistics
No ratings yet
AI Algorithms and Statistics
11 pages
The (Almost) Complete Machine Learning Roadmap: Milestone 0: Python 3 and Other Basic Stuff
No ratings yet
The (Almost) Complete Machine Learning Roadmap: Milestone 0: Python 3 and Other Basic Stuff
5 pages
AI History - 1-Artificial Intelligence Questions and Answers - Sanfoundry
No ratings yet
AI History - 1-Artificial Intelligence Questions and Answers - Sanfoundry
5 pages
Intelligent Agents & Environment - AI Questions and Answers - Sanfoundry
No ratings yet
Intelligent Agents & Environment - AI Questions and Answers - Sanfoundry
5 pages
Forward Chaining - Artificial Intelligence Questions and Answers - Sanfoundry
No ratings yet
Forward Chaining - Artificial Intelligence Questions and Answers - Sanfoundry
7 pages
Terms 1
No ratings yet
Terms 1
46 pages
Terms 1
No ratings yet
Terms 1
46 pages
Easy Animal Riddles For Kids - Kidpid
No ratings yet
Easy Animal Riddles For Kids - Kidpid
16 pages
Agent Architecture - Artificial Intelligence Questions and Answers - Sanfoundry
No ratings yet
Agent Architecture - Artificial Intelligence Questions and Answers - Sanfoundry
4 pages
Airbnb GRP 6
No ratings yet
Airbnb GRP 6
26 pages
Knowledge & Reasoning - Artificial Intelligence Questions and Answers - Sanfoundry
No ratings yet
Knowledge & Reasoning - Artificial Intelligence Questions and Answers - Sanfoundry
6 pages
AI1
No ratings yet
AI1
38 pages
Artificial Intelligence Questions and Answers - LISP Programming - 2 - Sanfoundry
No ratings yet
Artificial Intelligence Questions and Answers - LISP Programming - 2 - Sanfoundry
4 pages
Duplicate 1723844408809
No ratings yet
Duplicate 1723844408809
404 pages
Angular Js
No ratings yet
Angular Js
45 pages
Quizizz - Text Structure
No ratings yet
Quizizz - Text Structure
7 pages
Constraints Satisfaction Problem - Artificial Intelligence Multiple Choice Questions - Sanfoundry
No ratings yet
Constraints Satisfaction Problem - Artificial Intelligence Multiple Choice Questions - Sanfoundry
5 pages
Frames - Artificial Intelligence Questions and Answers - Sanfoundry
No ratings yet
Frames - Artificial Intelligence Questions and Answers - Sanfoundry
5 pages
Artificial Intelligence Questions and Answers - Facts - 3 - Sanfoundry
No ratings yet
Artificial Intelligence Questions and Answers - Facts - 3 - Sanfoundry
4 pages
Environments - Artificial Intelligence Questions and Answers - Sanfoundry
No ratings yet
Environments - Artificial Intelligence Questions and Answers - Sanfoundry
4 pages
Game Theory - Artificial Intelligence Questions and Answers - Sanfoundry
No ratings yet
Game Theory - Artificial Intelligence Questions and Answers - Sanfoundry
4 pages
STD 5 CH 1
No ratings yet
STD 5 CH 1
3 pages
Automation Digitalization Pelleting Control Brochure Download Data
No ratings yet
Automation Digitalization Pelleting Control Brochure Download Data
22 pages
Schneider Sebastian
No ratings yet
Schneider Sebastian
42 pages
JXCX OMT0002
No ratings yet
JXCX OMT0002
113 pages
Internal Routine (B.Tech)
No ratings yet
Internal Routine (B.Tech)
1 page
Jetnet 6528Gf: Industrial 28G Full Gigabit Managed Ethernet Switch
No ratings yet
Jetnet 6528Gf: Industrial 28G Full Gigabit Managed Ethernet Switch
5 pages
Computer Mcqs File 2 One Paper MCQs Preparation-1
No ratings yet
Computer Mcqs File 2 One Paper MCQs Preparation-1
17 pages
5th Class Computer Ch1 QANS
No ratings yet
5th Class Computer Ch1 QANS
2 pages
Lecture 14
No ratings yet
Lecture 14
40 pages
Lulu Daily
No ratings yet
Lulu Daily
2 pages
Govinda Krishna Jai
No ratings yet
Govinda Krishna Jai
1 page
Table 3
No ratings yet
Table 3
1 page
StewartPCalc7 01 04 Output
No ratings yet
StewartPCalc7 01 04 Output
33 pages
Once A Pone A Time There Lived A Girl Called Likhita and Her Parent All Ways Chear Her and Love Her A Lot
No ratings yet
Once A Pone A Time There Lived A Girl Called Likhita and Her Parent All Ways Chear Her and Love Her A Lot
1 page
Unit IV-storage Virtualization
No ratings yet
Unit IV-storage Virtualization
26 pages
Remote Monitoring System For Cold Storage Warehouse Using IOT
No ratings yet
Remote Monitoring System For Cold Storage Warehouse Using IOT
7 pages
Rock Smith Configuration
No ratings yet
Rock Smith Configuration
23 pages
Catchlogs - 2024-12-11 at 13-30-58 - 6.111.2 - .Java
No ratings yet
Catchlogs - 2024-12-11 at 13-30-58 - 6.111.2 - .Java
22 pages
Wa0141.
No ratings yet
Wa0141.
22 pages
Jquery Validation
No ratings yet
Jquery Validation
2 pages
AN - Install DREAM USB-DBG-IF On Win7 (64-Bit) & Win8
No ratings yet
AN - Install DREAM USB-DBG-IF On Win7 (64-Bit) & Win8
4 pages
Zarin Tasnim
No ratings yet
Zarin Tasnim
11 pages
1xjBRET: Exploration of Geometric Animation Using A Single Formula in Spreadsheet Excel
No ratings yet
1xjBRET: Exploration of Geometric Animation Using A Single Formula in Spreadsheet Excel
25 pages
Sample Program: XGB-INV IG5A (RS-485 Modbus RTU)
No ratings yet
Sample Program: XGB-INV IG5A (RS-485 Modbus RTU)
4 pages
CEC2010 RealParameterOptimization TechnicalReport
No ratings yet
CEC2010 RealParameterOptimization TechnicalReport
16 pages
High School Football Schedule - Oct 29 - Dec 4 Rev Oct 28
No ratings yet
High School Football Schedule - Oct 29 - Dec 4 Rev Oct 28
1 page
mfc480dw Uke QSG
No ratings yet
mfc480dw Uke QSG
2 pages
Statement of Purpose Msinus
No ratings yet
Statement of Purpose Msinus
5 pages
LB25 Manual Books Check List 9542 PDF
No ratings yet
LB25 Manual Books Check List 9542 PDF
5 pages

Course4 Efficiency

Uploaded by

Course4 Efficiency

Uploaded by

Course 4: Efficient NLP

Course 4: Efficient NLP 2

Course 4: Efficient NLP 3

Course 4: Efficient NLP 4

Course 4: Efficient NLP 5

Course 4: Efficient NLP 6

Fit it on data points

Course 4: Efficient NLP 7

Course 4: Efficient NLP 8

Course 4: Efficient NLP 9

Course 4: Efficient NLP 10

Course 4: Efficient NLP 11

Course 4: Efficient NLP 12

Course 4: Efficient NLP 13

~SOTA on V100-gen GPUs

Course 4: Efficient NLP 14

Can be used to adapt model for efficient inference

Course 4: Efficient NLP 16

Use memory-efficient optimizers

Use gradient accumulation

Sum micro_batch_size//batch_size gradients

Course 4: Efficient NLP 18

Course 4: Efficient NLP 19

Course 4: Efficient NLP 20

Course 4: Efficient NLP 21

Course 4: Efficient NLP 22

Course 4: Efficient NLP 23

Course 4: Efficient NLP 24

Course 4: Efficient NLP 25

Course 4: Efficient NLP 26

Course 4: Efficient NLP 27

Course 4: Efficient NLP 28

Course 4: Efficient NLP 29

where is a weight matrix to quantize into , and are data

Course 4: Efficient NLP 30

Course 4: Efficient NLP 31

Course 4: Efficient NLP 32

While performance is maintained (OPT perplexity )

Course 4: Efficient NLP 33

Course 4: Efficient NLP 34

Course 4: Efficient NLP 35

Course 4: Efficient NLP 36

Course 4: Efficient NLP 37

Course 4: Efficient NLP 38

Course 4: Efficient NLP 39

Continue the pretraining of the obtained reduced model

Course 4: Efficient NLP 40

Course 4: Efficient NLP 41

You might also like