0% found this document useful (0 votes)
33 views41 pages

Course4 Efficiency

1) The document discusses techniques for efficiently training and deploying large language models, including scaling laws for optimal model size and training data given available compute. 2) Methods covered include efficient attention mechanisms, quantization, model parallelism using FSDP/DeepSpeed, low-rank model adaptation, managing key-value caches, and model reduction techniques like DistilBERT and Sheared Llama. 3) The goal is to maximize performance while minimizing training and inference costs through techniques like quantization, model parallelism, and model reduction.

Uploaded by

komala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views41 pages

Course4 Efficiency

1) The document discusses techniques for efficiently training and deploying large language models, including scaling laws for optimal model size and training data given available compute. 2) Methods covered include efficient attention mechanisms, quantization, model parallelism using FSDP/DeepSpeed, low-rank model adaptation, managing key-value caches, and model reduction techniques like DistilBERT and Sheared Llama. 3) The goal is to maximize performance while minimizing training and inference costs through techniques like quantization, model parallelism, and model reduction.

Uploaded by

komala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Course 4: Efficient NLP

1
The cost of pre-training LMs

Course 4: Efficient NLP 2


The cost of using LMs

Course 4: Efficient NLP 3


Efficient training

Course 4: Efficient NLP 4


Scaling Laws
Scaling Laws for Neural Language Models (Kaplan et al. 2020)

Course 4: Efficient NLP 5


Chinchilla Scaling Laws
Refinement using more data points & better training recipe
Given FLOPS, what model size and training tokens should
one use?

Course 4: Efficient NLP 6


Chinchilla Scaling Laws
Propose a form for the final loss:

Fit it on data points


E = 1.69 (" entropy of natural language")
A = 406.4, B = 410.7, = 0.34, = 0.28

Course 4: Efficient NLP 7


Chinchilla Scaling Laws

Compute ( )
For a given compute level , there exist an optimal and
Training a bigger model on less data => worse
Training a smaller model on more data => worse

Course 4: Efficient NLP 8


Chinchilla Scaling Laws - In practice
Train for longer than announced by laws
Why?
Over-train smaller models
Trade training compute for inference compute
Example: Mistral-7B model

Course 4: Efficient NLP 9


Chinchilla Scaling Laws - In practice

Course 4: Efficient NLP 10


Training LMs
Batch size matters:
No bigger than 1-2M tokens (Kaplan et al., 2020)
Maximize parallel computation
Float precision:
float16 : reduces memory usage, good with V100-gen GPUs
bfloat16 : more stability, but only usable with A100-gen GPUs

Course 4: Efficient NLP 11


Training LMs - Efficient implementations
FlashAttention (Dao et al. 2022)

Course 4: Efficient NLP 12


Training LMs - Efficient implementations
FlashAttention2 (Dao et al. 2023)

Course 4: Efficient NLP 13


Training LMs - Efficient implementations
xFormers & Memory-efficient attention (Rabe et al. 2021)
Classical implementation

Memory-efficient implementation

~SOTA on V100-gen GPUs

Course 4: Efficient NLP 14


Training LMs - Efficient variants
Linear attention (e.g. Beltagy et al. 2020)

Can be used to adapt model for efficient inference


Used in Mistral-7B
Course 4: Efficient NLP 15
Training LMs - Large-scale training
Dream scenario:
Model fits on the GPU
Forward + backward fit with the batch_size
Optimization fits memory

Course 4: Efficient NLP 16


Training LMs - Large-scale training
Optimization OOM scenario
Model fits on the GPU
Forward + backward fit with the batch_size
Optimization saturates GPU

Use memory-efficient optimizers


Adafactor: factor momentum matrix in Adam
CAME: regularize Adafactor
LION: only tracks momentum
Course 4: Efficient NLP 17
Training LMs - Large-scale training
Forward/backward OOM scenario
Model fits on the GPU
Forward + backward saturates with the batch_size

Use gradient accumulation


Compute forwards with micro_batch_size batch_size

Sum micro_batch_size//batch_size gradients


Backward once

Course 4: Efficient NLP 18


Training LMs - Multi-GPU training
Distirbuted Data Parallel (DDP) with GPUs
Copy the model on the GPUs
Send a micro-batch to each GPU
Compute forward/backward in parallel on each GPU
Send gradients to one GPU & optimize
Send weight updates to each GPU

Course 4: Efficient NLP 19


Training LMs - Multi-GPU training
Model OOM scenario
Model does not fit on one GPU (e.g. micro_batch_size=1 fails)
Model parallelism

Course 4: Efficient NLP 20


Training LMs - FSDP

Course 4: Efficient NLP 21


Training LMs - DeepSpeed
Similar to FSDP:
Shares model weights...
but also optimizer states...
and gradients
For relevant sizes: not that different in speed

Course 4: Efficient NLP 22


Adapting LMs - Adapters
Parameter-Efficient Transfer Learning for NLP (Houlsby et al. '19)

Course 4: Efficient NLP 23


Adapting LMs - LoRA
Low-Rank Adaptation of Large Language Models (Hu et al. '21)

Course 4: Efficient NLP 24


Adapting LMs - LoRA vs Adapters
Better + more stable results across hyper-parameters

Course 4: Efficient NLP 25


Efficient inference

Course 4: Efficient NLP 26


Previous methods hold
Efficient attention implementations & variants
FlashAttention / xFormers
Linear attention
Model parallelism (FSDP & DeepSpeed)
LORA weights for fast model "switching"
Keep big model in memory
Load task-specific LoRA when required

Course 4: Efficient NLP 27


Quantization
Changes the data type of a model (e.g. float32 -> int4 )
Models are usually trained in float16 or bfloat16 for stability
Needs rescaling:

Course 4: Efficient NLP 28


LM quantization
GPTQ (Frantar et al. 2023)

Course 4: Efficient NLP 29


LM quantization - GPTQ
Consider quantization as an optimization problem:

where is a weight matrix to quantize into , and are data


points (e.g. token sequences)

Course 4: Efficient NLP 30


LM quantization - GPTQ
For each row, quantize some by solving the quadratic problem
and adjust the non-quantized coefficients of to minimize impact
Empirical : update order does not matter
Update at smaller scale and batch whole matrix updates
Precompute Hessian (needed for adaptation) on non-quantized
coefficients since they can be taken left-to-right

Course 4: Efficient NLP 31


LM quantization - GPTQ
A matter of minutes/hours (on a single A100 GPU)

Course 4: Efficient NLP 32


LM quantization - GPTQ
Inference speed/memory is greatly increased:

While performance is maintained (OPT perplexity )

Course 4: Efficient NLP 33


Managing KV cache - vLLM
Paged Attention (Kwon et al. 2023)

Course 4: Efficient NLP 34


Managing KV cache - vLLM
Better throughput + parallelization across requests

Course 4: Efficient NLP 35


Long KV cache - StreamingLLM
Efficient Streaming Language Models with Attention Sinks (Xiao et
al. 2023)

Course 4: Efficient NLP 36


Model reduction

Course 4: Efficient NLP 37


DistilBERT (Sanh et al. 2019)

Course 4: Efficient NLP 38


DistilBERT (Sanh et al. 2019)
Can be expensive if teacher is big
Still loses performance

Course 4: Efficient NLP 39


Sheared Llama (Xia et al. 2023)
Remove weights that minimize loss increase

Continue the pretraining of the obtained reduced model

Course 4: Efficient NLP 40


Sheared Llama (Xia et al. 2023)
Get a good model with much less data/compute

Course 4: Efficient NLP 41

You might also like