0% found this document useful (0 votes)
3 views

Model Quantization

Uploaded by

jprem637
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Model Quantization

Uploaded by

jprem637
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Model Quantization

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Why Model Quantization?

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Representing Numeric Values

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Representing Numeric Values

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Representing Numeric Values

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Common Data Types

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Common Data Types

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Common Data Types

Note: Just a random representation, not of 3

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Symmetric Quantization
➢ In symmetric quantization, the range of the original floating-point values is mapped
to a symmetric range around zero in the quantized space.

➢ In the previous examples, notice how the ranges before and after quantization
remain centered around zero.

➢ This means that the quantized value for zero in the floating-point space is exactly
zero in the quantized space.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Symmetric Quantization
➢ A nice example of a form of symmetric quantization is called absolute maximum
(absmax) quantization.
➢ Given a list of values, we take the highest absolute value (α) as the range to perform
the linear mapping.

Source:
https://newsletter.maartengrootendorst.
com/p/a-visual-guide-to-quantization
Symmetric Quantization

bits

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Symmetric Quantization

Generally, the lower the number


of bits, the more quantization
error we tend to have.

Source:
https://newsletter.maartengrootendo
rst.com/p/a-visual-guide-to-
quantization
Asymmetric Quantization
Notice how the 0 has shifted
positions? That’s why it’s
called asymmetric
quantization. The min/max
values have different
distances to 0 in the range [-
7.59, 10.8].

Due to its shifted position, we


have to calculate the zero-
point for the INT8 range to
perform the linear mapping. As
before, we also have to
calculate a scale factor (s) but
use the difference of INT8’s
range instead [-128, 127]

Source:
https://newsletter.maartengrootend
orst.com/p/a-visual-guide-to-
quantization
Asymmetric Quantization
127- -128
To dequantize the quantized
from INT8 back to FP32,
we will need to use the
previously calculated scale
factor (s) and zeropoint (z).

Other than that,


dequantization is
straightforward:

Source:
https://newsletter.maartengro
otendorst.com/p/a-visual-
guide-to-quantization
Symmetric vs Asymmetric Quantization

Source:
https://newsletter.maartengro
otendorst.com/p/a-visual-
guide-to-quantization
Outliers – Range Mapping & Clipping

Source:
https://newsletter.maarte
ngrootendorst.com/p/a-
visual-guide-to-
quantization
Outliers – Range Mapping & Clipping

Source:
https://newsletter.maarte
ngrootendorst.com/p/a-
visual-guide-to-
quantization
Calibration
Calibration involves determining the scale and zero-point parameters, essential for
mapping the floating-point values to the integer range.

These parameters are directly derived from the minimum and maximum values of the
activation ranges.

Thus, calibration often includes finding the optimal min and max values as well, as they
are used to calculate the scale and zero-point.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Calibration
➢ The choice of calibration method depends on the model, the data distribution, and
the hardware constraints. Generally:
➢ Min-Max is suitable for models with well-behaved distributions or where
simplicity is needed.
➢ Percentile-based is good for handling outliers.
➢ KL-Divergence is favored for complex distributions, especially in NLP and
certain vision tasks where precision is critical.
➢ EMA/Moving Average is beneficial when data varies over time.
➢ Entropy-Based can be used when the model has high-variance activation
patterns but isn’t as commonly implemented due to its complexity.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Quantization Methods
➢ Broadly, there are two methods for calibrating the quantization method of the
weights and activations:

➢ Post-Training Quantization (PTQ)


➢ Quantization after training

➢ Quantization Aware Training (QAT)


➢ Quantization during training/fine-tuning

➢ Since there are significantly fewer biases (millions) than weights (billions), the
biases are often kept in higher precision (such as INT16), and the main effort of
quantization is put towards the weights.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Post-Training Quantization (PTQ)
➢ One of the most popular quantization techniques is post-training quantization
(PTQ).

➢ It involves quantizing a model’s parameters (both weights and activations) after


training the model.

➢ Quantization of the weights is performed using either symmetric or asymmetric


quantization.

➢ Quantization of the activations, however, requires inference of the model to get


their potential distribution since we do not know their range.

➢ There are two forms of quantization of the activations:


➢ Dynamic Quantization
➢ Static Quantization

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Post-Training Quantization (PTQ)
➢ Dynamic Quantization
➢ After data passes a hidden layer, its activations are collected:

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Post-Training Quantization (PTQ)
➢ Dynamic Quantization
➢ This distribution of activations is then used to calculate the zeropoint (z) and
scale factor (s) values needed to quantize the output:

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Post-Training Quantization (PTQ)
➢ Dynamic Quantization
➢ The process is repeated each time data passes through a new layer. Therefore,
each layer has its own separate z and s values and therefore different
quantization schemes.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Post-Training Quantization (PTQ)
➢ Static Quantization
➢ In contrast to dynamic quantization, static quantization does not calculate the
zeropoint (z) and scale factor (s) during inference but beforehand.
➢ To find those values, a calibration dataset is used and given to the model to
collect these potential distributions.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Post-Training Quantization (PTQ)
➢ Static Quantization
➢ After these values have been collected, we can calculate the necessary s and z
values to perform quantization during inference.

➢ When you are performing actual inference, the s and z values are not
recalculated but are used globally over all activations to quantize them.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Post-Training Quantization (PTQ)
➢ Static vs Dynamic Quantization
➢ In general, dynamic quantization tends to be a bit more accurate since it only
attempts to calculate the s and z values per hidden layer. However, it might
increase compute time as these values need to be calculated.

➢ In contrast, static quantization is less accurate but is faster as it already knows


the s and z values used for quantization.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Post-Training Quantization (PTQ)
➢ The Realm of 4-bit Quantization
➢ Going below 8-bit quantization has proved to be a difficult task as the
quantization error increases with each loss of bit.

➢ Fortunately, there are several smart ways to reduce the bits to 6, 4, and even
2-bits (although going lower than 4-bits using these methods is typically not
advised).

➢ We will explore two methods that are commonly shared on HuggingFace:


➢ GPTQ (Gradient-based PTQ) (full model on GPU)
➢ GGUF (GPTQ-for-GGML Unified Format) (potentially offload layers on the
CPU) (GGML (Georgi Gerganov's Machine Learning) machine learning library
designed for lightweight inference, especially on CPUs. It is highly optimized
for handling large language models with lower memory and computational
requirements.)

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Post-Training Quantization (PTQ)
➢ The Realm of 4-bit Quantization
➢ GPTQ
➢ GPTQ is arguably one of the most well-known methods used in practice for
quantization to 4-bits.
➢ It uses asymmetric quantization and does so layer by layer such that each
layer is processed independently before continuing to the next:

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Post-Training Quantization (PTQ)
➢ The Realm of 4-bit Quantization
➢ GGUF (GPTQ-for-GGML Unified Format)
➢ While GPTQ is a great quantization method to run your full LLM on a GPU,
you might not always have that capacity. Instead, we can use GGUF to offload
any layer of the LLM to the CPU.

➢ This allows you to use both the CPU and GPU when you do not have enough
VRAM.

➢ The quantization method GGUF is updated frequently and might depend on


the level of bit quantization.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Quantization Aware Training (QAT)
➢ Earlier, we saw how we could quantize a model after training. A downside to this
approach is that this quantization does not consider the actual training process.

➢ This is where Quantization Aware Training (QAT) comes in. Instead of quantizing a
model after it was trained with post-training quantization (PTQ), QAT aims to learn
the quantization procedure during training.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Quantization Aware Training (QAT)
➢ QAT tends to be more accurate than PTQ since the quantization was already
considered during training. It works as follows:

➢ During training, so-called “fake” quants are introduced. This is the process of first
quantizing the weights to, for example, INT4 and then dequantizing back to FP32:

➢ This process allows the model to consider the quantization process during training,
the calculation of loss, and weight updates.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Quantization Aware Training (QAT)
➢ QAT attempts to explore the loss landscape for “wide” minima to minimize the quantization
errors as “narrow” minima tend to result in larger quantization errors.

➢ For example, imagine if we did not consider quantization during the backward pass. We
choose the weight with the smallest loss according to gradient descent. However, that
would introduce a larger quantization error if it’s in a “narrow” minima.
➢ In contrast, if we consider quantization, a different updated weight will be selected in a
“wide” minima with a much lower quantization error.
➢ As such, although PTQ has a lower loss in high precision (e.g., FP32), QAT results in a
lower loss in lower precision (e.g., INT4) which is what we aim for.
Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
The Era of 1-bit LLMs: BitNet
➢ Going to 4-bits as we saw before is already quite small but what if we were to
reduce it even further?

➢ This is where BitNet comes in, representing the weights of a model using 1-bit,
representing weight as either -1 or 1.

➢ It does so by injecting the quantization process directly into the Transformer


architecture.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
https://arxiv.org/pdf/2310.11453
The Era of 1-bit LLMs: BitNet
➢ Remember that the Transformer architecture is used as the foundation of most
LLMs and is composed of computations that involve linear layers:

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
https://arxiv.org/pdf/2310.11453
The Era of 1-bit LLMs: BitNet
➢ These linear layers are generally represented with higher precision, like FP16, and
are where most of the weights reside.

➢ BitNet replaces these linear layers with something they call the BitLinear:

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
https://arxiv.org/pdf/2310.11453
The Era of 1-bit LLMs: BitNet
➢ A BitLinear layer works the same as a regular linear layer and calculates the
output/activation based on the weights multiplied by the activation.

➢ However, it represents the weights of a model using 1-bit and activations using
INT8:

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
https://arxiv.org/pdf/2310.11453
The Era of 1-bit LLMs: BitNet
➢ A BitLinear layer, like Quantization-Aware Training (QAT) performs a form of
“fake” quantization during training to analyze the effect of quantization of the
weights and activations:

- Gamma is absmax
- Beta is average of absolute weights

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
https://arxiv.org/pdf/2310.11453
The Era of 1-bit LLMs: BitNet
➢ Weight Quantization
➢ While training, the weights are stored in INT8 and then quantized to 1-bit using a basic
strategy, called the signum function.
➢ In essence, it moves the distribution of weights to be centered around 0 and then assigns
everything at and left to 0 to be -1 and everything to the right to be 1:

Step 2
Notice the change in Step 3

Step 3

Step 1

Step 4

➢ Additionally, it tracks a value β (average absolute value) that we will use later on
for dequantization.
Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
https://arxiv.org/pdf/2310.11453
The Era of 1-bit LLMs: BitNet
➢ Activation Quantization
➢ To quantize the activations, BitLinear makes use of absmax quantization to convert the
activations from FP16 to INT8 as they need to be in higher precision for the matrix
multiplication (×).

For Dequantization Y is dequantized activation


x is activation before quantization
ሼ𝑠𝑡𝑒𝑝 2 𝑤ෝ is quantized weigths
𝑥ො is quantized activation
𝑄𝑏 = 2𝑏−1
Scale Factor

ሼ𝑠𝑡𝑒𝑝 1

where ϵ is a small floating-point number that prevents overflow when performing the clipping.
To preserve the variance of the output after quantization, they introduced a LayerNorm

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
https://arxiv.org/pdf/2310.11453
The Era of 1-bit LLMs: BitNet
➢ This procedure is relatively straightforward and allows models to be represented with only two
values, either -1 or 1. (They also implemented Model parallelism with Group Quantization and
Normalization)

➢ Using this procedure, the authors observed that as the model size grows, the smaller the
performance gap between a 1-bit and FP16-trained becomes.

➢ However, this is only for larger models (>30B parameters) and the gab with smaller models is still
quite large.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
https://arxiv.org/pdf/2310.11453
BitNet 1.58 Bits
➢ In this new method, every single weight of the model is not just -1 or 1, but can now
also take 0 as a value, making it ternary. Interestingly, adding just the 0 greatly
improves upon BitNet and allows for much faster computation.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
https://arxiv.org/pdf/2402.17764
BitNet 1.58 Bits
➢ The Power of 0
➢ So why is adding 0 such a major improvement?

➢ It has everything to do with matrix multiplication!

➢ First, let’s explore how matrix multiplication in general works. When calculating the output, we
multiply a weight matrix by an input vector. Below, the first multiplication of the first layer of a
weight matrix is visualized:

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
https://arxiv.org/pdf/2402.17764
BitNet 1.58 Bits
➢ The Power of 0
➢ Note that this multiplication involves two actions, multiplying individual weights with the input
and then adding them all together.

➢ BitNet 1.58b, in contrast, manages to forego the act of multiplication since ternary weights
essentially tell you the following:

➢ 1: I want to add this value

➢ 0: I do not want this value

➢ -1: I want to subtract this value

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
https://arxiv.org/pdf/2402.17764
BitNet 1.58 Bits
➢ The Power of 0
➢ As a result, you only need to perform addition if your weights are quantized to 1.58 (log2 3) bit:

➢ Not only can this speed up computation significantly, but it also allows for feature filtering.

➢ By setting a given weight to 0, you can now ignore it instead of either adding or subtracting the
weights as is the case with 1-bit representations.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
https://arxiv.org/pdf/2402.17764
BitNet 1.58 Bits
➢ Weight Quantization
➢ To perform weight quantization BitNet 1.58b uses absmean quantization which is a variation of
the absmax quantization that we saw before.

➢ It first scales the weight matrix by its average absolute value, and then round each value to the
nearest integer among {-1, 0, +1}

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
https://arxiv.org/pdf/2402.17764
BitNet 1.58 Bits
➢ Activation Quantization
➢ The quantization function for activations follows the same implementation in BitNet, except
that we do not scale the activations before the non-linear functions to the range [0, Qb].

➢ Instead, the activations are all scaled to [−Qb, Qb] per token to get rid of the zero-point
quantization (because of symmetry).

➢ This is more convenient and simple for both implementation and system-level optimization, while
introduces negligible effects to the performance in their experiments.

➢ And that’s it! 1.58-bit quantization required (mostly) two tricks:

➢ Adding 0 to create ternary representations [-1, 0, 1]

➢ absmean quantization for weights

➢ “13B BitNet b1.58 is more efficient, in terms of latency, memory usage, and energy consumption
than a 3B FP16 LLM”

➢ As a result, we get lightweight models due to having only 1.58 computationally efficient bits!

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
https://arxiv.org/pdf/2402.17764
Disclaimer
➢ The content of this presentation is not original, and it has been
prepared from various sources for teaching purposes.

You might also like