0% found this document useful (0 votes)
21 views9 pages

Slides

slides
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views9 pages

Slides

slides
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

LoRA: Low-Rank Adaptation of

Large Language Models

Umar Jamil
License: Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0):
https://creativecommons.org/licenses/by-nc/4.0/legalcode

Not for commercial use

Umar Jamil - https://github.com/hkproj/pytorch-lora


How do neural networks work?
Target

Input Output Loss

Hidden Layer 1 Hidden Layer 2

Umar Jamil - https://github.com/hkproj/pytorch-lora


Fine Tuning
Target

Input Output Loss

Hidden Layer 1 Hidden Layer 2

Fine tuning means training a pre-trained network on new data to improve its performance on
a specific task. For example, we may fine-tune a LLM that was trained on many programming languages
and fine-tune it for a new dialect of SQL.
Umar Jamil - https://github.com/hkproj/pytorch-lora
Problems with fine-tuning

1. We must train the full network, which it computationally expensive for the average user
when dealing with Large Language Models like GPT
2. Storage requirements for the checkpoints are expensive, as we need to save the entire
model on the disk for each checkpoint. If we also save the optimizer state (which we
usually do, then the situation gets worse!)
3. If we have multiple fine-tuned models, we need to reload all the weights of the model
every time we want to switch between them, which can be expensive and slow. For
example, we may have a model fine-tuned for helping users write SQL queries and one
model for helping users write Javascript code.

Umar Jamil - https://github.com/hkproj/pytorch-lora


Introducing LoRA
Hidden Layer 1 Target

🥶 W (pre-trained)
FROZEN
ℝ ×

Input Output Loss

🦾
B A
ℝ × ℝ ×

𝑟 ≪ min(𝑑, 𝑘)

Umar Jamil - https://github.com/hkproj/pytorch-lora


What are the benefits?

1. Less parameters to train and store: if 𝑑 = 1000, 𝑘 = 5000, (𝑑 × 𝑘) = 5,000,000; using


𝑟 = 5, we get (𝑑 × 𝑟) + (𝑟 × 𝑘) = 5,000 + 25,000 = 30,000. Less than 1% of the
original.
2. Less parameters = less storage requirements
3. Faster backpropagation, as we do not need to evaluate the gradient for most of the
parameters
4. We can easily switch between two different fine-tuned model (one for SQL generation
and one for Javascript code generation) just by changing the parameters of the A and B
matrices instead of reloading the W matrix again.

Umar Jamil - https://github.com/hkproj/pytorch-lora


Why does it work?

It basically means that the W matrix of a pretrained model contains many parameters that convey
the same information as others (so they can be obtained by a combination of the other weights); This
means we can get rid of them without decreasing the performance of the model. This kind of matrices
are called rank-deficient (they do not have full-rank).

Umar Jamil - https://github.com/hkproj/pytorch-lora


A brief tutorial on the rank of a matrix…

Umar Jamil - https://github.com/hkproj/pytorch-lora


Thanks for watching!
Don’t forget to subscribe for
more amazing content on AI
and Machine Learning!

Umar Jamil - https://github.com/hkproj/pytorch-lora

You might also like