0% found this document useful (0 votes)

34 views48 pages

Model Quantization

Uploaded by

jprem637

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views48 pages

Model Quantization

Uploaded by

jprem637

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

Model Quantization

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Why Model Quantization?

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Representing Numeric Values

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Common Data Types

Note: Just a random representation, not of 3

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Symmetric Quantization
➢ In symmetric quantization, the range of the original floating-point values is mapped
to a symmetric range around zero in the quantized space.

➢ In the previous examples, notice how the ranges before and after quantization
remain centered around zero.

➢ This means that the quantized value for zero in the floating-point space is exactly
zero in the quantized space.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Symmetric Quantization
➢ A nice example of a form of symmetric quantization is called absolute maximum
(absmax) quantization.
➢ Given a list of values, we take the highest absolute value (α) as the range to perform
the linear mapping.

Source:
https://newsletter.maartengrootendorst.
com/p/a-visual-guide-to-quantization
Symmetric Quantization

bits

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Symmetric Quantization

Generally, the lower the number

of bits, the more quantization
error we tend to have.

Source:
https://newsletter.maartengrootendo
rst.com/p/a-visual-guide-to-
quantization
Asymmetric Quantization
Notice how the 0 has shifted
positions? That’s why it’s
called asymmetric
quantization. The min/max
values have different
distances to 0 in the range [-
7.59, 10.8].

Due to its shifted position, we

have to calculate the zero-
point for the INT8 range to
perform the linear mapping. As
before, we also have to
calculate a scale factor (s) but
use the difference of INT8’s
range instead [-128, 127]

Source:
https://newsletter.maartengrootend
orst.com/p/a-visual-guide-to-
quantization
Asymmetric Quantization
127- -128
To dequantize the quantized
from INT8 back to FP32,
we will need to use the
previously calculated scale
factor (s) and zeropoint (z).

Other than that,

dequantization is
straightforward:

Source:
https://newsletter.maartengro
otendorst.com/p/a-visual-
guide-to-quantization
Symmetric vs Asymmetric Quantization

Source:
https://newsletter.maartengro
otendorst.com/p/a-visual-
guide-to-quantization
Outliers – Range Mapping & Clipping

Source:
https://newsletter.maarte
ngrootendorst.com/p/a-
visual-guide-to-
quantization
Outliers – Range Mapping & Clipping

Source:
https://newsletter.maarte
ngrootendorst.com/p/a-
visual-guide-to-
quantization
Calibration
Calibration involves determining the scale and zero-point parameters, essential for
mapping the floating-point values to the integer range.

These parameters are directly derived from the minimum and maximum values of the
activation ranges.

Thus, calibration often includes finding the optimal min and max values as well, as they
are used to calculate the scale and zero-point.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Calibration
➢ The choice of calibration method depends on the model, the data distribution, and
the hardware constraints. Generally:
➢ Min-Max is suitable for models with well-behaved distributions or where
simplicity is needed.
➢ Percentile-based is good for handling outliers.
➢ KL-Divergence is favored for complex distributions, especially in NLP and
certain vision tasks where precision is critical.
➢ EMA/Moving Average is beneficial when data varies over time.
➢ Entropy-Based can be used when the model has high-variance activation
patterns but isn’t as commonly implemented due to its complexity.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Quantization Methods
➢ Broadly, there are two methods for calibrating the quantization method of the
weights and activations:

➢ Post-Training Quantization (PTQ)

➢ Quantization after training

➢ Quantization Aware Training (QAT)

➢ Quantization during training/fine-tuning

➢ Since there are significantly fewer biases (millions) than weights (billions), the
biases are often kept in higher precision (such as INT16), and the main effort of
quantization is put towards the weights.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Post-Training Quantization (PTQ)
➢ One of the most popular quantization techniques is post-training quantization
(PTQ).

➢ It involves quantizing a model’s parameters (both weights and activations) after

training the model.

➢ Quantization of the weights is performed using either symmetric or asymmetric

quantization.

➢ Quantization of the activations, however, requires inference of the model to get

their potential distribution since we do not know their range.

➢ There are two forms of quantization of the activations:

➢ Dynamic Quantization
➢ Static Quantization

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Post-Training Quantization (PTQ)
➢ Dynamic Quantization
➢ This distribution of activations is then used to calculate the zeropoint (z) and
scale factor (s) values needed to quantize the output:

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Post-Training Quantization (PTQ)
➢ Dynamic Quantization
➢ The process is repeated each time data passes through a new layer. Therefore,
each layer has its own separate z and s values and therefore different
quantization schemes.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Post-Training Quantization (PTQ)
➢ Static Quantization
➢ In contrast to dynamic quantization, static quantization does not calculate the
zeropoint (z) and scale factor (s) during inference but beforehand.
➢ To find those values, a calibration dataset is used and given to the model to
collect these potential distributions.

➢ When you are performing actual inference, the s and z values are not
recalculated but are used globally over all activations to quantize them.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Post-Training Quantization (PTQ)
➢ Static vs Dynamic Quantization
➢ In general, dynamic quantization tends to be a bit more accurate since it only
attempts to calculate the s and z values per hidden layer. However, it might
increase compute time as these values need to be calculated.

➢ In contrast, static quantization is less accurate but is faster as it already knows

the s and z values used for quantization.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Post-Training Quantization (PTQ)
➢ The Realm of 4-bit Quantization
➢ Going below 8-bit quantization has proved to be a difficult task as the
quantization error increases with each loss of bit.

➢ Fortunately, there are several smart ways to reduce the bits to 6, 4, and even
2-bits (although going lower than 4-bits using these methods is typically not
advised).

➢ We will explore two methods that are commonly shared on HuggingFace:

➢ GPTQ (Gradient-based PTQ) (full model on GPU)
➢ GGUF (GPTQ-for-GGML Unified Format) (potentially offload layers on the
CPU) (GGML (Georgi Gerganov's Machine Learning) machine learning library
designed for lightweight inference, especially on CPUs. It is highly optimized
for handling large language models with lower memory and computational
requirements.)

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Post-Training Quantization (PTQ)
➢ The Realm of 4-bit Quantization
➢ GPTQ
➢ GPTQ is arguably one of the most well-known methods used in practice for
quantization to 4-bits.
➢ It uses asymmetric quantization and does so layer by layer such that each
layer is processed independently before continuing to the next:

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Post-Training Quantization (PTQ)
➢ The Realm of 4-bit Quantization
➢ GGUF (GPTQ-for-GGML Unified Format)
➢ While GPTQ is a great quantization method to run your full LLM on a GPU,
you might not always have that capacity. Instead, we can use GGUF to offload
any layer of the LLM to the CPU.

➢ This allows you to use both the CPU and GPU when you do not have enough
VRAM.

➢ The quantization method GGUF is updated frequently and might depend on

the level of bit quantization.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Quantization Aware Training (QAT)
➢ Earlier, we saw how we could quantize a model after training. A downside to this
approach is that this quantization does not consider the actual training process.

➢ This is where Quantization Aware Training (QAT) comes in. Instead of quantizing a
model after it was trained with post-training quantization (PTQ), QAT aims to learn
the quantization procedure during training.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Quantization Aware Training (QAT)
➢ QAT tends to be more accurate than PTQ since the quantization was already
considered during training. It works as follows:

➢ During training, so-called “fake” quants are introduced. This is the process of first
quantizing the weights to, for example, INT4 and then dequantizing back to FP32:

➢ This process allows the model to consider the quantization process during training,
the calculation of loss, and weight updates.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
Quantization Aware Training (QAT)
➢ QAT attempts to explore the loss landscape for “wide” minima to minimize the quantization
errors as “narrow” minima tend to result in larger quantization errors.

➢ For example, imagine if we did not consider quantization during the backward pass. We
choose the weight with the smallest loss according to gradient descent. However, that
would introduce a larger quantization error if it’s in a “narrow” minima.
➢ In contrast, if we consider quantization, a different updated weight will be selected in a
“wide” minima with a much lower quantization error.
➢ As such, although PTQ has a lower loss in high precision (e.g., FP32), QAT results in a
lower loss in lower precision (e.g., INT4) which is what we aim for.
Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
The Era of 1-bit LLMs: BitNet
➢ Going to 4-bits as we saw before is already quite small but what if we were to
reduce it even further?

➢ This is where BitNet comes in, representing the weights of a model using 1-bit,
representing weight as either -1 or 1.

➢ It does so by injecting the quantization process directly into the Transformer

architecture.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
https://arxiv.org/pdf/2310.11453
The Era of 1-bit LLMs: BitNet
➢ Remember that the Transformer architecture is used as the foundation of most
LLMs and is composed of computations that involve linear layers:

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
https://arxiv.org/pdf/2310.11453
The Era of 1-bit LLMs: BitNet
➢ These linear layers are generally represented with higher precision, like FP16, and
are where most of the weights reside.

➢ BitNet replaces these linear layers with something they call the BitLinear:

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
https://arxiv.org/pdf/2310.11453
The Era of 1-bit LLMs: BitNet
➢ A BitLinear layer works the same as a regular linear layer and calculates the
output/activation based on the weights multiplied by the activation.

➢ However, it represents the weights of a model using 1-bit and activations using
INT8:

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
https://arxiv.org/pdf/2310.11453
The Era of 1-bit LLMs: BitNet
➢ A BitLinear layer, like Quantization-Aware Training (QAT) performs a form of
“fake” quantization during training to analyze the effect of quantization of the
weights and activations:

- Gamma is absmax
- Beta is average of absolute weights

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
https://arxiv.org/pdf/2310.11453
The Era of 1-bit LLMs: BitNet
➢ Weight Quantization
➢ While training, the weights are stored in INT8 and then quantized to 1-bit using a basic
strategy, called the signum function.
➢ In essence, it moves the distribution of weights to be centered around 0 and then assigns
everything at and left to 0 to be -1 and everything to the right to be 1:

Step 2
Notice the change in Step 3

Step 3

Step 1

Step 4

➢ Additionally, it tracks a value β (average absolute value) that we will use later on
for dequantization.
Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
https://arxiv.org/pdf/2310.11453
The Era of 1-bit LLMs: BitNet
➢ Activation Quantization
➢ To quantize the activations, BitLinear makes use of absmax quantization to convert the
activations from FP16 to INT8 as they need to be in higher precision for the matrix
multiplication (×).

For Dequantization Y is dequantized activation

x is activation before quantization
ሼ𝑠𝑡𝑒𝑝 2 𝑤ෝ is quantized weigths
𝑥ො is quantized activation
𝑄𝑏 = 2𝑏−1
Scale Factor

ሼ𝑠𝑡𝑒𝑝 1

where ϵ is a small floating-point number that prevents overflow when performing the clipping.
To preserve the variance of the output after quantization, they introduced a LayerNorm

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
https://arxiv.org/pdf/2310.11453
The Era of 1-bit LLMs: BitNet
➢ This procedure is relatively straightforward and allows models to be represented with only two
values, either -1 or 1. (They also implemented Model parallelism with Group Quantization and
Normalization)

➢ Using this procedure, the authors observed that as the model size grows, the smaller the
performance gap between a 1-bit and FP16-trained becomes.

➢ However, this is only for larger models (>30B parameters) and the gab with smaller models is still
quite large.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
https://arxiv.org/pdf/2310.11453
BitNet 1.58 Bits
➢ In this new method, every single weight of the model is not just -1 or 1, but can now
also take 0 as a value, making it ternary. Interestingly, adding just the 0 greatly
improves upon BitNet and allows for much faster computation.

➢ It has everything to do with matrix multiplication!

➢ First, let’s explore how matrix multiplication in general works. When calculating the output, we
multiply a weight matrix by an input vector. Below, the first multiplication of the first layer of a
weight matrix is visualized:

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
https://arxiv.org/pdf/2402.17764
BitNet 1.58 Bits
➢ The Power of 0
➢ Note that this multiplication involves two actions, multiplying individual weights with the input
and then adding them all together.

➢ BitNet 1.58b, in contrast, manages to forego the act of multiplication since ternary weights
essentially tell you the following:

➢ 1: I want to add this value

➢ 0: I do not want this value

➢ -1: I want to subtract this value

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
https://arxiv.org/pdf/2402.17764
BitNet 1.58 Bits
➢ The Power of 0
➢ As a result, you only need to perform addition if your weights are quantized to 1.58 (log2 3) bit:

➢ Not only can this speed up computation significantly, but it also allows for feature filtering.

➢ By setting a given weight to 0, you can now ignore it instead of either adding or subtracting the
weights as is the case with 1-bit representations.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
https://arxiv.org/pdf/2402.17764
BitNet 1.58 Bits
➢ Weight Quantization
➢ To perform weight quantization BitNet 1.58b uses absmean quantization which is a variation of
the absmax quantization that we saw before.

➢ It first scales the weight matrix by its average absolute value, and then round each value to the
nearest integer among {-1, 0, +1}

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
https://arxiv.org/pdf/2402.17764
BitNet 1.58 Bits
➢ Activation Quantization
➢ The quantization function for activations follows the same implementation in BitNet, except
that we do not scale the activations before the non-linear functions to the range [0, Qb].

➢ Instead, the activations are all scaled to [−Qb, Qb] per token to get rid of the zero-point
quantization (because of symmetry).

➢ This is more convenient and simple for both implementation and system-level optimization, while
introduces negligible effects to the performance in their experiments.

➢ And that’s it! 1.58-bit quantization required (mostly) two tricks:

➢ Adding 0 to create ternary representations [-1, 0, 1]

➢ absmean quantization for weights

➢ “13B BitNet b1.58 is more efficient, in terms of latency, memory usage, and energy consumption
than a 3B FP16 LLM”

➢ As a result, we get lightweight models due to having only 1.58 computationally efficient bits!

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization
https://arxiv.org/pdf/2402.17764
Disclaimer
➢ The content of this presentation is not original, and it has been
prepared from various sources for teaching purposes.

RBT 40 Hour Training Packet 201117
50% (2)
RBT 40 Hour Training Packet 201117
4 pages
CHM1321 Lab 1
100% (1)
CHM1321 Lab 1
8 pages
A Visual Guide to Quantization - By Maarten Grootendorst
No ratings yet
A Visual Guide to Quantization - By Maarten Grootendorst
31 pages
Jungwok Choi - tinyML Asia 2023
No ratings yet
Jungwok Choi - tinyML Asia 2023
17 pages
FoQ Unit 5
No ratings yet
FoQ Unit 5
13 pages
SmoothQuant- Accurate and Efficient Post-Training Quantization for Large Language Models
No ratings yet
SmoothQuant- Accurate and Efficient Post-Training Quantization for Large Language Models
13 pages
Integer Quantization For Deep Learning Inference
No ratings yet
Integer Quantization For Deep Learning Inference
20 pages
LLM Quantization
No ratings yet
LLM Quantization
9 pages
ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration For Large Language Models
No ratings yet
ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration For Large Language Models
19 pages
Flex Round
No ratings yet
Flex Round
27 pages
Quantizaion LLM Globalisation
No ratings yet
Quantizaion LLM Globalisation
6 pages
2407.11722v1
No ratings yet
2407.11722v1
14 pages
2501.12956v1
No ratings yet
2501.12956v1
12 pages
Quantization in Deep Learning
No ratings yet
Quantization in Deep Learning
2 pages
2503.04704v1
No ratings yet
2503.04704v1
29 pages
9442 Towards Efficient Post Trainin
No ratings yet
9442 Towards Efficient Post Trainin
14 pages
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization
No ratings yet
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization
16 pages
Lec06 Quantization II
No ratings yet
Lec06 Quantization II
82 pages
5_low_bit_quantization_1
No ratings yet
5_low_bit_quantization_1
6 pages
04 AIS421 Finetuning Part 2
No ratings yet
04 AIS421 Finetuning Part 2
50 pages
2310.08659v4
No ratings yet
2310.08659v4
23 pages
2_Build a Complete OpenSource LLM RAG QA Chatbot — Choose the Model _ by Marco Bertelli _ Level Up Coding
No ratings yet
2_Build a Complete OpenSource LLM RAG QA Chatbot — Choose the Model _ by Marco Bertelli _ Level Up Coding
18 pages
Differentiable Quantization of Deep Neural Networks: Equal Contribution
No ratings yet
Differentiable Quantization of Deep Neural Networks: Equal Contribution
21 pages
b
No ratings yet
b
13 pages
GWQ: Gradient-Aware Weight Quantization For Large Language Models
No ratings yet
GWQ: Gradient-Aware Weight Quantization For Large Language Models
11 pages
Optimizing Large Language Model Training Using FP4 Quantization
No ratings yet
Optimizing Large Language Model Training Using FP4 Quantization
17 pages
2407.06794v1
No ratings yet
2407.06794v1
17 pages
Neural Networks Quantization
No ratings yet
Neural Networks Quantization
31 pages
AQLM
No ratings yet
AQLM
18 pages
2310.16836v1
No ratings yet
2310.16836v1
14 pages
Paper Survey - Training With Quantization Noise For Extreme Model Compression
No ratings yet
Paper Survey - Training With Quantization Noise For Extreme Model Compression
25 pages
2502.05003v1
No ratings yet
2502.05003v1
16 pages
Low Bit Post Training
No ratings yet
Low Bit Post Training
16 pages
Bitnet: Scaling 1-Bit Transformers For Large Language Models
No ratings yet
Bitnet: Scaling 1-Bit Transformers For Large Language Models
14 pages
Introduction to Weight Quantization.pdf (1)
No ratings yet
Introduction to Weight Quantization.pdf (1)
9 pages
2401.14895v2
No ratings yet
2401.14895v2
11 pages
Scaling Laws for Precision
No ratings yet
Scaling Laws for Precision
33 pages
RL-PTQ
No ratings yet
RL-PTQ
6 pages
Introduction To Deep Learning
No ratings yet
Introduction To Deep Learning
34 pages
4-Bit Quantization With GPTQ - Towards Data Science
No ratings yet
4-Bit Quantization With GPTQ - Towards Data Science
18 pages
CS20B1060
No ratings yet
CS20B1060
16 pages
L S S Q: Earned TEP IZE Uantization
No ratings yet
L S S Q: Earned TEP IZE Uantization
12 pages
Llm-Qbench: A Benchmark Towards The Best Practice For Post-Training Quantization of Large Language Models
No ratings yet
Llm-Qbench: A Benchmark Towards The Best Practice For Post-Training Quantization of Large Language Models
30 pages
ppt3dl
No ratings yet
ppt3dl
15 pages
Lecture 5-6
No ratings yet
Lecture 5-6
45 pages
AutoQNN
No ratings yet
AutoQNN
23 pages
Pruning v:s Quantization
No ratings yet
Pruning v:s Quantization
21 pages
Zeroquant Efficient and Affordable Post Training Quantization for Large Scale Transformers Paper Conference
No ratings yet
Zeroquant Efficient and Affordable Post Training Quantization for Large Scale Transformers Paper Conference
16 pages
Introduction To Deep Learning
No ratings yet
Introduction To Deep Learning
34 pages
Quantization
No ratings yet
Quantization
2 pages
Which Quantization Method Is Right For You - (GPTQ vs. GGUF vs. AWQ) - by Maarten Grootendorst - Nov, 2023 - Towards Data Science
No ratings yet
Which Quantization Method Is Right For You - (GPTQ vs. GGUF vs. AWQ) - by Maarten Grootendorst - Nov, 2023 - Towards Data Science
25 pages
quq_1528
No ratings yet
quq_1528
6 pages
LAYER-WISE QUANTIZATION
No ratings yet
LAYER-WISE QUANTIZATION
17 pages
ICCV'21 Liu Improving Neural Network Efficiency Via Post-Training Quantization With Adaptive Floating-Point ICCV 2021 Paper
No ratings yet
ICCV'21 Liu Improving Neural Network Efficiency Via Post-Training Quantization With Adaptive Floating-Point ICCV 2021 Paper
10 pages
LLM Quantization Aware Training
No ratings yet
LLM Quantization Aware Training
15 pages
IoT - Lecture 11
No ratings yet
IoT - Lecture 11
58 pages
Deep Learning
No ratings yet
Deep Learning
78 pages
NNQuant3
No ratings yet
NNQuant3
28 pages
Smooth Quant
No ratings yet
Smooth Quant
21 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Top 20 MS Excel VBA Simulations, VBA to Model Risk, Investments, Growth, Gambling, and Monte Carlo Analysis
From Everand
Top 20 MS Excel VBA Simulations, VBA to Model Risk, Investments, Growth, Gambling, and Monte Carlo Analysis
Andrei Besedin
2.5/5 (2)
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
From Everand
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
Fouad Sabry
No ratings yet
Politics, Philosophy, and Myth in Natsume Soseki's First Trilogy
No ratings yet
Politics, Philosophy, and Myth in Natsume Soseki's First Trilogy
27 pages
AI MODULE1
No ratings yet
AI MODULE1
10 pages
Dam: Xixi Country China: Under Construction
No ratings yet
Dam: Xixi Country China: Under Construction
2 pages
Daily Booking List - Third Year - Chemical Engineering - Sep-Electiveregistration
No ratings yet
Daily Booking List - Third Year - Chemical Engineering - Sep-Electiveregistration
3 pages
3location Theory
No ratings yet
3location Theory
37 pages
Eduflow Homework Harmful or Helpful
100% (1)
Eduflow Homework Harmful or Helpful
7 pages
Feng Shui Principles and The
No ratings yet
Feng Shui Principles and The
22 pages
Scientific English
No ratings yet
Scientific English
50 pages
l.m. Catlouge
No ratings yet
l.m. Catlouge
16 pages
EMS - Environmental Assessment Report
No ratings yet
EMS - Environmental Assessment Report
6 pages
Exploring World History Assignment Checklist 2014
No ratings yet
Exploring World History Assignment Checklist 2014
60 pages
Exm 2012
No ratings yet
Exm 2012
12 pages
Quick Ref Guide For Security Officer's Application For NEW, AMEND, RENEW and CANCEL
No ratings yet
Quick Ref Guide For Security Officer's Application For NEW, AMEND, RENEW and CANCEL
41 pages
Half
No ratings yet
Half
3 pages
Making Trouble Cultural Constructiond of Crime, Deviance, and Control
No ratings yet
Making Trouble Cultural Constructiond of Crime, Deviance, and Control
65 pages
Pdfslide - Tips - Braden CH Second Generation ch175 CH 230
No ratings yet
Pdfslide - Tips - Braden CH Second Generation ch175 CH 230
8 pages
SANGKURIANG Contoh Narrative Text Dari Buku Guru Bahasa Inggris
100% (1)
SANGKURIANG Contoh Narrative Text Dari Buku Guru Bahasa Inggris
2 pages
ALS Malaysia Recommended Holding Times and Preservations
No ratings yet
ALS Malaysia Recommended Holding Times and Preservations
2 pages
Metaphysics
100% (4)
Metaphysics
28 pages
Lisondra Es 2022 Activity Design
No ratings yet
Lisondra Es 2022 Activity Design
3 pages
03 Audit of Cash
No ratings yet
03 Audit of Cash
14 pages
Fronza, Memory and Punishment Historical Denialism, Free Speech and The Limits of Criminal Law
100% (1)
Fronza, Memory and Punishment Historical Denialism, Free Speech and The Limits of Criminal Law
246 pages
Id For TM
No ratings yet
Id For TM
1 page
Question Paper Format For Ug and PG Examinations Fq4A: Ime: 3 Hours Answer ALL Questions Max. Marks 100
No ratings yet
Question Paper Format For Ug and PG Examinations Fq4A: Ime: 3 Hours Answer ALL Questions Max. Marks 100
2 pages
Activity Guide and Assessment Rubric - Task 4 - Appropriation and Leadership in Geopolitics and Environment
No ratings yet
Activity Guide and Assessment Rubric - Task 4 - Appropriation and Leadership in Geopolitics and Environment
7 pages
Schneider Electric - GoPact-MCCB - G20F3A63
No ratings yet
Schneider Electric - GoPact-MCCB - G20F3A63
4 pages
21st-LIT Critique-Paper Fernandez
No ratings yet
21st-LIT Critique-Paper Fernandez
3 pages
Handbook of Lost Wax or Investment Casting Sopcak Text
No ratings yet
Handbook of Lost Wax or Investment Casting Sopcak Text
34 pages

Model Quantization

Uploaded by

Model Quantization

Uploaded by

Model Quantization

Note: Just a random representation, not of 3

Generally, the lower the number

Due to its shifted position, we

Other than that,

➢ Post-Training Quantization (PTQ)

➢ Quantization Aware Training (QAT)

➢ It involves quantizing a model’s parameters (both weights and activations) after

➢ Quantization of the weights is performed using either symmetric or asymmetric

➢ Quantization of the activations, however, requires inference of the model to get

➢ There are two forms of quantization of the activations:

➢ In contrast, static quantization is less accurate but is faster as it already knows

➢ We will explore two methods that are commonly shared on HuggingFace:

➢ The quantization method GGUF is updated frequently and might depend on

➢ It does so by injecting the quantization process directly into the Transformer

For Dequantization Y is dequantized activation

➢ It has everything to do with matrix multiplication!

➢ 1: I want to add this value

➢ 0: I do not want this value

➢ -1: I want to subtract this value

➢ And that’s it! 1.58-bit quantization required (mostly) two tricks:

➢ Adding 0 to create ternary representations [-1, 0, 1]

➢ absmean quantization for weights

You might also like