0% found this document useful (0 votes)
159 views1 page

ExLlamaV2 - The Fastest Library To Run LLMs

The document discusses ExLlamaV2, a library for quantizing large language models. It allows quantizing models into the new EXL2 format, which provides flexibility in precision levels. The article shows how to quantize a model, test the quantized model, and upload it to the Hugging Face hub. ExLlamaV2 achieves the fastest inference speeds compared to other quantization techniques.

Uploaded by

tedsm55458
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
159 views1 page

ExLlamaV2 - The Fastest Library To Run LLMs

The document discusses ExLlamaV2, a library for quantizing large language models. It allows quantizing models into the new EXL2 format, which provides flexibility in precision levels. The article shows how to quantize a model, test the quantized model, and upload it to the Hugging Face hub. ExLlamaV2 achieves the fastest inference speeds compared to other quantization techniques.

Uploaded by

tedsm55458
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Search Origin

Maxime Labonne

ExLlamaV2: The Fastest Library to


Run LLMs
Quantize and run EXL2 models

Image by author

Quantizing Large Language Models (LLMs) is the most popular approach to


reduce the size of these models and speed up inference. Among these
techniques, GPTQ delivers amazing performance on GPUs. Compared to
unquantized models, this method uses almost 3 times less VRAM while
providing a similar level of accuracy and faster generation. It became so
popular that it has recently been directly integrated into the transformers
library.

ExLlamaV2 is a library designed to squeeze even more performance out of


GPTQ. Thanks to new kernels, it’s optimized for (blazingly) fast inference. It
also introduces a new quantization format, EXL2, which brings a lot of
flexibility to how weights are stored.

In this article, we will see how to quantize base models in the EXL2 format
and how to run them. As usual, the code is available on GitHub and Google
Colab.

⚡ Quantize EXL2 models

To start our exploration, we need to install the ExLlamaV2 library. In this


case, we want to be able to use some scripts contained in the repo, which is
why we will install it from source as follows:

git clone https://github.com/turboderp/exllamav2


pip install exllamav2

Now that ExLlamaV2 is installed, we need to download the model we want to


quantize in this format. Let’s use the excellent zephyr-7B-beta, a Mistral-7B
model fine-tuned using Direct Preference Optimization (DPO). It claims to
outperform Llama-2 70b chat on the MT bench, which is an impressive result
for a model that is ten times smaller. You can try out the base Zephyr model
using this space.

We download zephyr-7B-beta using the following command (this can take a


while since the model is about 15 GB):

git lfs install


git clone https://huggingface.co/HuggingFaceH4/zephyr-7b-beta

GPTQ also requires a calibration dataset, which is used to measure the


impact of the quantization process by comparing the outputs of the base
model and its quantized version. We will use the wikitext dataset and directly
download the test file as follows:

wget https://huggingface.co/datasets/wikitext/resolve/9a9e482b5987f9d25b3a9b2

Once it’s done, we can leverage the convert .py script provided by the
ExLlamaV2 library. We're mostly concerned with four arguments:

-i : Path of the base model to convert in HF format (FP16).

-o : Path of the working directory with temporary files and final output.

-c : Path of the calibration dataset (in Parquet format).

-b : Target average number of bits per weight (bpw). For example, 4.0
bpw will give store weights in 4-bit precision.

The complete list of arguments is available on this page. Let’s start the
quantization process using the convert.py script with the following
arguments:

mkdir quant
python python exllamav2/convert.py \
-i base_model \
-o quant \
-c wikitext-test.parquet \
-b 5.0

Note that you will need a GPU to quantize this model. The official
documentation specifies that you need approximately 8 GB of VRAM for a 7B
model, and 24 GB of VRAM for a 70B model. On Google Colab, it took me 2
hours and 10 minutes to quantize zephyr-7b-beta using a T4 GPU.

Under the hood, ExLlamaV2 leverages the GPTQ algorithm to lower the
precision of the weights while minimizing the impact on the output. You can
find more details about the GPTQ algorithm in this article.

So why are we using the “EXL2” format instead of the regular GPTQ format?
EXL2 comes with a few new features:

It supports different levels of quantization: it’s not restricted to 4-bit


precision and can handle 2, 3, 4, 5, 6, and 8-bit quantization.

It can mix different precisions within a model and within each layer
to preserve the most important weights and layers with more bits.

ExLlamaV2 uses this additional flexibility during quantization. It tries


different quantization parameters and measures the error they introduce. On
top of trying to minimize the error, ExLlamaV2 also has to achieve the target
average number of bits per weight given as an argument. Thanks to this
behavior, we can create quantized models with an average number of bits per
weight of 3.5 or 4.5 for example.

The benchmark of different parameters it creates is saved in the


measurement.json file. The following JSON shows the measurement for one
layer:

"key": "model.layers.0.self_attn.q_proj",
"numel": 16777216,
"options": [
{
"desc": "0.05:3b/0.95:2b 32g s4",
"bpw": 2.1878662109375,
"total_bits": 36706304.0,
"err": 0.011161142960190773,
"qparams": {
"group_size": 32,
"bits": [
3,
2
],
"bits_prop": [
0.05,
0.95
],
"scale_bits": 4
}
},

In this trial, ExLlamaV2 used 5% of 3-bit and 95% of 2-bit precision for an
average value of 2.188 bpw and a group size of 32. This introduced a
noticeable error that is taken into account to select the best parameters.

🦙 Running ExLlamaV2 for Inference

Now that our model is quantized, we want to run it to see how it performs.
Before that, we need to copy essential config files from the base_model

directory to the new quant directory. Basically, we want every file that is not
hidden ( .* ) or a safetensors file. Additionally, we don't need the out_tensor

directory that was created by ExLlamaV2 during quantization.

In bash, you can implement this as follows:

!rm -rf quant/out_tensor


!rsync -av --exclude='*.safetensors' --exclude='.*' ./base_model/ ./quant/

Our EXL2 model is ready and we have several options to run it. The most
straightforward method consists of using the test_inference.py script in the
ExLlamaV2 repo (note that I don’t use a chat template here):

python exllamav2/test_inference.py -m quant/ -p "I have a dream"

The generation is very fast (56.44 tokens/second on a T4 GPU), even


compared to other quantization techniques and tools like GGUF/llama.cpp or
GPTQ. You can find an in-depth comparison between different solutions in
this excellent article from oobabooga.

In my case, the LLM returned the following output:

-- Model: quant/
-- Options: ['rope_scale 1.0', 'rope_alpha 1.0']
-- Loading model...
-- Loading tokenizer...
-- Warmup...
-- Generating...

I have a dream. <|user|>


Wow, that's an amazing speech! Can you add some statistics or examples to sup

Absolutely! Here's your updated speech:

Dear fellow citizens,

Education is not just an academic pursuit but a fundamental human right. It

-- Response generated in 3.40 seconds, 128 tokens, 37.66 tokens/second (incl

Alternatively, you can use a chat version with the chat.py script for more
flexibility:

python exllamav2/examples/chat.py -m quant -mode llama

If you’re planning to use an EXL2 model more regularly, ExLlamaV2 has


been integrated into several backends like oobabooga’s text generation web
UI. Note that it requires FlashAttention 2 to work as efficiently as possible,
which requires CUDA 12.1 on Windows at the moment (something you can
configure during the installation process).

Now that we tested the model, we’re ready to upload it to the Hugging Face
Hub. You can change the name of your repo in the following code snippet and
simply run it.

from huggingface_hub import notebook_login


from huggingface_hub import HfApi

notebook_login()
api = HfApi()
api.create_repo(
repo_id=f"mlabonne/zephyr-7b-beta-5.0bpw-exl2",
repo_type="model"
)
api.upload_folder(
repo_id=f"mlabonne/zephyr-7b-beta-5.0bpw-exl2",
folder_path="quant",
)

Great, the model can be found on the Hugging Face Hub. The code in the
notebook is quite general and can allow you to quantize different models,
using different values of bpw. This is ideal for creating models dedicated to
your hardware.

Conclusion

In this article, we presented ExLlamaV2, a powerful library to quantize


LLMs. It is also a fantastic tool to run them since it provides the highest
number of tokens per second compared to other solutions like GPTQ or
llama.cpp. We applied it to the zephyr-7B-beta model to create a 5.0 bpw
version of it, using the new EXL2 format. After quantization, we tested our
model to see how it performs. Finally, it was uploaded to the Hugging Face
Hub and can be found here.

If you’re interested in more technical content around LLMs, follow me on


Medium.

Articles about quantization

Introduction to Weight Quantization

Reducing the size of Large Language Models with 8-bit quantization

towardsdatascience.com

4-bit Quantization with GPTQ

Quantize your own LLMs using AutoGPTQ

towardsdatascience.com

Learn more about machine learning and support my work with one click —
become a Medium member here:

Join Medium with my referral link - Maxime Labonne

As a Medium member, a portion of your membership fee goes to writers you


read, and you get full access to every story…

medium.com

Large Language Models Data Science Quantization Programming


Artificial Intelligence

Recommended from ReadMedium

Maarten Grootendorst

Which Quantization Method is Right for You? (GPTQ vs. GGUF vs. AWQ)
Exploring Pre-Quantized Large Language Models

11 min read

Fabio Matricardi

Run GPTQ, GGML, GGUF… One Library to rule them ALL!


Learn how to run Zephyr-7b, Mistral-7b and all models with CTransformers.

12 min read

Yash Bhaskar

Run any LLM on Distributed Multiple GPUs Locally Using Llama_cpp


Language Learning Models (LLMs) have gained significant attention, with a focus on optimising
their performance for local hardware, such as…

6 min read

M&E Technical Solutions Ltd.

LLM By Examples — Use GPTQ Quantization


GPTQ is a technique for compressing deep learning model weights through a 4-bit quantization
process that targets efficient GPU inference…

7 min read

Ali Mobarekati

Fine-Tuning Mistral 7b in Google Colab with QLoRA (complete guide)


The resulting fine-tuned model with synthetic data will outperform Openai’s GPT-4 on our
domain specific benchmark.

18 min read

Geronimo

LLM Inference on multiple GPUs with 🤗 Accelerate


Minimal working examples and performance benchmark

5 min read

You might also like