ExLlamaV2 - The Fastest Library To Run LLMs
ExLlamaV2 - The Fastest Library To Run LLMs
Maxime Labonne
Image by author
In this article, we will see how to quantize base models in the EXL2 format
and how to run them. As usual, the code is available on GitHub and Google
Colab.
wget https://huggingface.co/datasets/wikitext/resolve/9a9e482b5987f9d25b3a9b2
Once it’s done, we can leverage the convert .py script provided by the
ExLlamaV2 library. We're mostly concerned with four arguments:
-o : Path of the working directory with temporary files and final output.
-b : Target average number of bits per weight (bpw). For example, 4.0
bpw will give store weights in 4-bit precision.
The complete list of arguments is available on this page. Let’s start the
quantization process using the convert.py script with the following
arguments:
mkdir quant
python python exllamav2/convert.py \
-i base_model \
-o quant \
-c wikitext-test.parquet \
-b 5.0
Note that you will need a GPU to quantize this model. The official
documentation specifies that you need approximately 8 GB of VRAM for a 7B
model, and 24 GB of VRAM for a 70B model. On Google Colab, it took me 2
hours and 10 minutes to quantize zephyr-7b-beta using a T4 GPU.
Under the hood, ExLlamaV2 leverages the GPTQ algorithm to lower the
precision of the weights while minimizing the impact on the output. You can
find more details about the GPTQ algorithm in this article.
So why are we using the “EXL2” format instead of the regular GPTQ format?
EXL2 comes with a few new features:
It can mix different precisions within a model and within each layer
to preserve the most important weights and layers with more bits.
"key": "model.layers.0.self_attn.q_proj",
"numel": 16777216,
"options": [
{
"desc": "0.05:3b/0.95:2b 32g s4",
"bpw": 2.1878662109375,
"total_bits": 36706304.0,
"err": 0.011161142960190773,
"qparams": {
"group_size": 32,
"bits": [
3,
2
],
"bits_prop": [
0.05,
0.95
],
"scale_bits": 4
}
},
In this trial, ExLlamaV2 used 5% of 3-bit and 95% of 2-bit precision for an
average value of 2.188 bpw and a group size of 32. This introduced a
noticeable error that is taken into account to select the best parameters.
Now that our model is quantized, we want to run it to see how it performs.
Before that, we need to copy essential config files from the base_model
directory to the new quant directory. Basically, we want every file that is not
hidden ( .* ) or a safetensors file. Additionally, we don't need the out_tensor
Our EXL2 model is ready and we have several options to run it. The most
straightforward method consists of using the test_inference.py script in the
ExLlamaV2 repo (note that I don’t use a chat template here):
-- Model: quant/
-- Options: ['rope_scale 1.0', 'rope_alpha 1.0']
-- Loading model...
-- Loading tokenizer...
-- Warmup...
-- Generating...
Alternatively, you can use a chat version with the chat.py script for more
flexibility:
Now that we tested the model, we’re ready to upload it to the Hugging Face
Hub. You can change the name of your repo in the following code snippet and
simply run it.
notebook_login()
api = HfApi()
api.create_repo(
repo_id=f"mlabonne/zephyr-7b-beta-5.0bpw-exl2",
repo_type="model"
)
api.upload_folder(
repo_id=f"mlabonne/zephyr-7b-beta-5.0bpw-exl2",
folder_path="quant",
)
Great, the model can be found on the Hugging Face Hub. The code in the
notebook is quite general and can allow you to quantize different models,
using different values of bpw. This is ideal for creating models dedicated to
your hardware.
Conclusion
towardsdatascience.com
towardsdatascience.com
Learn more about machine learning and support my work with one click —
become a Medium member here:
medium.com
Maarten Grootendorst
Which Quantization Method is Right for You? (GPTQ vs. GGUF vs. AWQ)
Exploring Pre-Quantized Large Language Models
11 min read
Fabio Matricardi
12 min read
Yash Bhaskar
6 min read
7 min read
Ali Mobarekati
18 min read
Geronimo
5 min read