0% found this document useful (0 votes)

20 views16 pages

Low Bit Post Training

Uploaded by

John Bonfardeci II

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views16 pages

Low Bit Post Training

Uploaded by

John Bonfardeci II

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

VPTQ: Extreme Low-bit Vector Post-Training Quantization for

Large Language Models

Yifei Liu‡,†,* Jicheng Wen† Yang Wang†,⋄ Shengyu Ye‡,†

Li Lyna Zhang† Ting Cao† Cheng Li‡ Mao Yang†

‡
University of Science and Technology of China
†
Microsoft
{v-liuyifei, jicheng.wen, Yang.Wang92, v-shengyuye, lzhani,
ting.cao, maoyang}@microsoft.com, [email protected]

Abstract Table 1: LLM Quantization Algorithm Comparison.

VPTQ balances all dimensions and achieves SOTA.
VPTQ AQLM QuIP# GPTVQ GPTQ AWQ
Effective Bitwidth ↓ ↓ ↓ ↑ ↑↑ ↑↑
Scaling model size significantly challenges the Accuracy @ Low-bit ↑ ↑ ↑ ↓ ↓↓ ↓↓
arXiv:2409.17066v2 [cs.AI] 22 Oct 2024

deployment and inference of Large Language Quantization Time Cost ↓ ↑↑ ↓ ↓ ↓ ↓

Inference Throughput ↑ ↑ ↓ ↑ ↑ ↑
Models (LLMs). Due to the redundancy in
LLM weights, recent research has focused on
pushing weight-only quantization to extremely
1 Introduction
low-bit (even down to 2 bits). It reduces mem-
ory requirements, optimizes storage costs, and Large language models (LLMs) (Touvron et al.,
decreases memory bandwidth needs during in-
2023; Meta, 2024) have shown excellent perfor-
ference. However, due to numerical representa-
tion limitations, traditional scalar-based weight mance across various complex tasks as their sizes
quantization struggles to achieve such extreme increase. However, the enormous weight of LLMs
low-bit. Recent research on Vector Quantiza- poses significant challenges for efficient inference
tion (VQ) for LLMs has demonstrated the po- and practical deployment. For instance, storing the
tential for extremely low-bit model quantiza- LLaMA-2 70B model weights in FP16 format re-
tion by compressing vectors into indices using quires 140GB of memory, surpassing the capacity
lookup tables.
of high-end GPUs and necessitating multi-GPU de-
ployment. This huge size significantly affects mem-
In this paper, we introduce Vector Post-
ory capacity and hard disk storage and requires
Training Quantization (VPTQ) for extremely
low-bit quantization of LLMs. We use Second- substantial bandwidth for inference. Weight-only
Order Optimization to formulate the LLM quantization is a mainstream model compression
VQ problem and guide our quantization al- technique that effectively reduces the model’s size
gorithm design by solving the optimization. by representing floating-point numbers with fewer
We further refine the weights using Channel- bits.
Independent Second-Order Optimization for In weight-only quantization of LLMs, a promi-
a granular VQ. In addition, by decomposing
nent method is Post-Training Quantization (PTQ).
the optimization problem, we propose a brief
and effective codebook initialization algorithm. PTQ quantizes model weights directly without re-
We also extend VPTQ to support residual and training the model. Typically, PTQ only involves
outlier quantization, which enhances model converting model weights into lower-bit fixed-point
accuracy and further compresses the model. numbers. Currently, the main approach in PTQ
Our experimental results show that VPTQ re- is scalar quantization, which converts each scalar
duces model quantization perplexity by 0.01- weight in the model into a lower bit value. Recent
0.34 on LLaMA-2, 0.38-0.68 on Mistral-7B,
work (Frantar et al., 2023; Lin et al., 2023; Xiao
4.41-7.34 on LLaMA-3 over SOTA at 2-bit,
with an average accuracy improvement of 0.79- et al., 2023; Lee et al., 2024; Chee et al., 2023)
1.5% on LLaMA-2, 1% on Mistral-7B, 11- has achieved near-original model accuracy with
22% on LLaMA-3 on QA tasks on average. 3-4 bit quantization. Table 1 summarizes the char-
We only utilize 10.4-18.6% of the quantiza- *
tion algorithm execution time, resulting in Contribution during internship at Microsoft Research
⋄
Corresponding author
a 1.6-1.8× increase in inference throughput This paper is the result of an open-source research project,
compared to SOTA. Our code is available at and the majority work of the project is accomplished in April
https://github.com/microsoft/VPTQ. 2024.
acteristics of typical scalar quantization methods ing to an inevitable increase in quantization errors
(GPTQ, AWQ) in LLM. However, due to the limita- as the vector length increases. This prevents the
tions of numerical representation, traditional scalar- use of longer vectors and, consequently, limits the
based weight quantization struggles to achieve ex- compression ratio.
tremely low-bit levels. For instance, with 2-bit The second challenge lies in efficiently execut-
quantization, we can only use four numerical val- ing VQ quantization on LLMs. VQ can com-
ues to represent model weights, which severely press vectors in the weight matrix into indices, but
limits the range of weight representation. Although these indices are discrete, non-differentiable inte-
BitNet (Wang et al., 2023; Ma et al., 2024) has gers. This introduces difficulties in implementing
enabled quantization-aware training that can quan- VQ quantization methods through model training.
tize weights to below 2 bits during the model’s For instance, AQLM (Egiazarian et al., 2024) em-
pre-training phase, this approach requires substan- ploys beam search and backpropagation to quantize
tial GPU cluster resources to maintain reasonable and update centroids in lookup tables. VQ neces-
accuracy. sitates additional gradient estimation, slowing the
Recent studies (van Baalen et al., 2024; Tseng convergence of model quantization training and re-
et al., 2024; Egiazarian et al., 2024) have explored quiring intensive training efforts to achieve better
an efficient method of weight-only quantization accuracy.
known as Vector Quantization (VQ). The third challenge arises as the dequantiza-
VQ is a data compression technique that maps tion overhead in VQ model inference. To reduce
high-dimensional vectors to a set of predefined quantization errors, complex data preprocessing
lower-dimensional vectors stored in codebooks methods may be used to process weights. QuIP#
(lookup tables). During encoding, each data point (Tseng et al., 2024) introduces incoherence process-
is represented by the index of a corresponding ing using the randomized Hadamard transform for
vector in the codebook, and during decoding, the the weight matrix before VQ. These preprocessing
original data is approximated using these indices. steps can reduce quantization errors and improve
This method substantially reduces the storage re- model accuracy. However, postprocessing must
quirements for data while allowing for the quick be performed in real time during model inference,
reconstruction of original vectors through simple which can severely impact throughput in inference.
index references. VQ achieves more effective data VPTQ seeks to bypass the limitations of cur-
compression than scalar quantization by leveraging rent VQ by offering a lightweight and efficient
correlations and redundancies across different data approach exclusively for extreme low-bit weight
dimensions. By detecting and leveraging interde- quantization.
pendence, VQ can encode complex multidimen- In this paper, we present Vector Post-Training
sional data with fewer bits, thus achieving higher Quantization (VPTQ), a novel approach for ex-
compression ratios and reduced bit width. tremely low-bit quantization of LLMs.
While Vector Quantization (VQ) shows promise 1. VPTQ achieves SOTA accuracy results on
in extreme low-bit weight compression for Large extremely low-bit LLMs. We formulate the
Language Models (LLMs), it faces several signif- quantization problem as an optimization prob-
icant challenges. Table 1 compares the strengths lem and employ Second-Order Optimization
and weaknesses of various VQ algorithms in multi- to guide our quantization algorithm design.
ple dimensions. By Channel-Independent Second-Order Opti-
The first challenge is ensuring the accuracy mization, VPTQ reduces model quantization
after extreme low-bit VQ quantization. Unlike perplexity by 0.01-0.34, 4.41-7.34, 0.38-0.5
scalar quantization, the quantization granularity of on LLaMA-2/3/Mistral-7B, respectively, over
VQ algorithms is vector-based. The quantization SOTA at 2-bit, with an accuracy improve-
may introduce additional accumulation errors due ment of 0.79-1.5%,11-22%,1%, on LLaMA-
to the simultaneous quantization of multiple num- 2/3/Mistral-7B in QA tasks on average.
bers. For example, GPTVQ (van Baalen et al., 2. VPTQ can transform LLMs into extremely
2024) uses the Second-Order Optimization method low-bit models with a minor quantization al-
to implement PTQ. However, GPTVQ accumulates gorithm overhead. Under the guidance of the
quantization errors within vector quantization, lead- optimization problem, we transform the quan-
❶❷❸❹❺❻❼❽❾❿⓫⓬⓭⓮⓯⓰⓱⓲⓳⓴

updated
tization algorithm into a heuristic algorithm L index index
to solve the optimization problem. We also !
analyze and propose a brief and effective code- N # !
book initialization algorithm to reduce the ex- ! $
tra overhead of centroid training and updates. M " "# Lookup
Experiments show that VPTQ only requires Table &
N "′
10.4-18.6% of the quantization algorithm ex- ❶ reshape & group ❷ clustering ❸ VPTQ
ecution time compared to existing SOTA re-
sults. Figure 1: Vector Quantization in Weight Quantization
3. VPTQ has low dequantization overhead.
VPTQ algorithm quantizes all the weights in
Higher-order terms exert a minor effect on the opti-
every Linear Operator in the model into an
mization goal, and we typically disregard interac-
index matrix and codebooks. During model
tions among weights between different layers. Con-
inference, we only need to dequantize the
sequently, we can simplify the optimization prob-
weight matrix by reading centroids from the
lem by focusing on optimizing the second-order
codebook according to the index before ex-
term and then define the following optimization
ecuting the operator. The models quantized
problem:
by VPTQ result in 1.6-1.8× improvement in
inference throughput compared to SOTA. arg min ∆WT · H(W) · ∆W,
∆W (1)
2 Background and Motivation s.t. ∆W = 0
2.1 Post Training Quantization in LLM
The objective of the optimization problem is to
Post-Training Quantization (PTQ) (LeCun et al., minimize the second-order error in model quan-
1989; Hassibi et al., 1993; Hassibi and Stork, 1992; tization, subject to the constraint that the change
Frantar et al., 2023; Singh and Alistarh, 2020) aims in model weights is as minimized as possible, i.e.,
to decrease model weight size by simplifying the ∆W = 0.
numerical representation and seeking to maintain
the model’s accuracy without retraining the model. 2.2 Vector Quantization in Neural Networks
We can formulate PTQ as the following optimiza- VQ is a key method for efficient lossy data com-
tion problem: pression (Gersho, 1979). Its objective is to reduce
arg minE[L(X, W + ∆W) − L(X, W)] the distortion by mapping high-dimensional origi-
nal data to a lower-dimensional space represented
1
≈ ∆WT · g(W) + ∆WT · H(W) · ∆W by a lookup table (Eq. 2). VQ maps original vec-
2
tors (W′ ) from the vector space to a finite set of
where W is the original model weights, Ŵ is quan- vectors, which is commonly referred to as a code-
tized weights, and ∆W = Ŵ − W represents the book (lookup table, C). Each vector in the original
weight quantization error. The loss of the model space approximates the closest vector (centroid Ci )
task is L. The optimization object is to minimize in the codebook.
the impact of model quantization on the model task,
which means minimizing the expected deviation of arg min ∥v − Ci ∥2 , ∀v ∈ W′ (2)
i∈k
the loss function.
PTQ typically employs a concise and accurate VQ indicates the nearest centroid Ci that minimizes
method for analyzing the above optimization prob- the Euclidean distance between the input vector
lem: Second-Order Optimization. Following a Tay- v in the lookup table. The optimization problem
lor series expansion, this method breaks down the aims to find the index i that results in the small-
optimization goal into first-order, second-order, and est distance between v. Thus, each input vector
higher-order terms. g(W) and H(W) represent is represented by the most similar centroids, thus
the gradient and Hessian of task loss L, respec- minimizing total distortion.
tively. It often assumes that the model has already Recent research has explored the use of VQ for
reached local optimum before model quantization, model weight quantization (Chen et al., 2020; Cho
which means that the first-order term is nearly zero. et al., 2022; Stock et al., 2020, 2021). These studies
attempt to compress the embedding layer, the con- Algorithm 1 VPTQ Algorithm
volution layer, and the classification layer of neural Input: W ← RM ×N ▷ Input weight matrix
networks using VQ. Figure 1 illustrates an example Input: H ← RN ×N ▷ Hessian matrix
Output: Ŵ ← RM ×N ▷ Quantized weight matrix
of applying VQ to compress model weights on a
E ← RM ×N ▷ Initialize quantization errors
weight matrix. For a weight matrix W with di- for s = 0, B, 2B, . . . do ▷ Column blocks
mensions M × N , we reshape W into vectors of for n = s, s + 1, . . . , s + B − 1 do
▷ Quantize a single column n, fundamentally different
length v as W′ (step ➊). The number of reshaped
from AQLM
vectors should be M ×Nv . Next, we employ k-means for m = 0, V, 2V, . . . , M do
or other clustering algorithms to build a codebook ▷ Parallel (Residual) Vector Quantization by function Q(v)
to vectors in the column n
(step ➋). The constructed codebook contains k cen- Ŵm:m+V,n ← QV (Wm:m+V,n )
troid vectors, each with v dimensions. Applying end for
the VQ algorithm directly often does not yield an E:,n ← (W:,n − W′ :,n )/(H−1 n,n )
▷ Update quantization error
acceptable accuracy. Typically, PTQ algorithms W:,n:s+B ← W:,n:s+B − E:,n H−1 n,n:s+B
adjust the model index and centroid to enhance the ▷ Merge quantization error to weights
accuracy of the quantized model (step ➌). end for
W:,s+B: ← W:,s+B: − E:,s:s+B H−1 s:s+B,s+B:
During model inference, each operator in the ▷ Update all remaining weights
model first dequantizes the original weight matrix end for
from the lookup table (codebook) by index and
centroid. Unlike scalar quantization, VQ keeps the GPTVQ (van Baalen et al., 2024) utilizes the
index and centroid in quantized weight. The equiv- Second-Order Optimization method to implement
alent compression ratio of VQ can be formulated PTQ. However, GPTVQ accumulates quantization
as: total original model bits/(codebook bits + errors within vector quantization, leading to an
index bits). The equivalent quantization bitwidth inevitable increase in quantization errors as the
is as: original bit width/compression ratio. For ex- vector length increases. It prevents the use of longer
ample, a 4096 × 4096 FP16 weight matrix with vectors and consequently limits the compression
vectors of length v = 8 and 256 centroids, the com- ratio.
pression ratio is (16 × 4096 × 4096)/(8 × 256 × QuIP# (Tseng et al., 2024) introduces an incoher-
16 + log2 (256) × 4096 × 4096/8) = 15.97. The ence processing using the randomized Hadamard
equivalent bitwidth is 1.002 bit. transform for the weight matrix before VQ. The
processed weight matrix approximates a sub-
2.3 Vector Quantization in LLMs Gaussian distribution, allowing for compression
While VQ has been applied to weight quantization, with a tiny codebook. However, incoherence pro-
the following significant challenges persist when cessing requires a significant amount of computa-
quantizing LLM. We summarize the benefits and tion, despite QuIP# being able to compress LLM
weaknesses of recent research (Egiazarian et al., to extremely low-bit with a low accuracy drop. It
2024; Tseng et al., 2024; van Baalen et al., 2024) requires significantly more computation for infer-
techniques in Table 1. ence compared to the original LLM, resulting in
low inference throughput.
The number of parameters in LLMs is enor-
mous, which requires quantizing the model using 3 Vector Post-Training Quantization
lightweight methods to avoid excessive resource
3.1 VPTQ Algorithm
consumption. AQLM (Egiazarian et al., 2024) uti-
VPTQ leverages Second-Order Optimization and
lizes gradient descent to train each layer of the
solves the optimization problem Eq.1 to achieve ex-
VQ-quantized model and simultaneously trains
treme low-bit quantization. Assume that a weight
across multiple layers using calibration data. It
matrix is W ∈ RM ×N , and a Hessian matrix col-
achieves effective compression through additive
lected from the current layer is H ∈ RN ×N . We
quantization and joint optimization of the code-
denote the q-th column of the weight matrix as
book, which can achieve high accuracy. However,
Ŵ:,q . The quantized column Ŵ:,q can be repre-
due to AQLM’s use of backpropagation for model
sented as the transpose of concatenated centroid
training, significant GPU hours and memory are re-
vectors
quired to achieve better accuracy, especially when
dealing with LLMs with massive parameters. Ŵ:,q = (C0 , C1 , ..., CM/v )T .
When the weight matrix of the model is large, After quantizing a column of the weight matrix,
we can first split the weight matrix into multi- we need to update the current quantization error to
ple groups. Each group has its own independent the unquantized part through:
codebook. This method allows us to flexibly di-
vide the weight matrix into several submatrices (Ŵ:,q − W:,q )
∆W = Hq,:
(Ŵ:,q:q+(M/group num) ) equal to the group number. H−1
qq
For clarity, we describe only one group in the fol-
lowing algorithm description. It will transform current quantization errors to the
Unlike GPTVQ, we quantize each column of following unquantized columns. Since GPTVQ
the matrix independently, which we refer to as quantizes v columns at the same time, quantization
Channel-Independent Second-Order Optimiza- error can only spread to other unquantized columns
tion. It greatly simplifies the complexity of VQ in when all v columns have been quantized. It will
Second-Order Optimization. GPTVQ, on the other lead to more errors accumulating in the quantiza-
hand, quantizes v columns of the matrix (ŴM,v ) tion, resulting in a decrease in model accuracy. We
at once, leading to larger errors and more complex can have similar conclusions from Table 2. Algo-
transformations for problem optimization. rithm 1 provides a detailed description of the steps
We use the Lagrange Method to transform the to solve the optimization problem and quantize the
optimization problem 1 into an unconstrained op- weights according to the above analysis.
timization problem. The Lagrangian function Distinguish VPTQ from GPTQ and GPTVQ:
L(∆W), and λ is the Lagrangian multiplier: Compared with GPTQ, VPTQ employs vector rep-
resentations in the quantization, which choose the
L(∆W) = ∆WT H(W)∆W + λ∆W vector closest to the original matrix to represent
the original data. As VQ can use a larger code-
The dual function g(λ) can be represented as: book to store the quantized data, it covers a wider
range of numerical distributions compared to the
g(λ) = −H−1 T
qq λλ − λ(Ŵ:,q − W:,q )
scalar quantization of GPTQ, thereby achieving
better accuracy. Table 2 reveals that VPTQ signifi-
Differentiating g(λ) with respect to λ and setting cantly outperforms GPTQ under extremely low bit
it to 0, quantization.
Moreover, since GPTVQ quantizes multiple
g ′ (λ) = −H−1 T
qq λ − (Ŵ:,q − W:,q ) = 0 columns simultaneously, the propagation of quanti-
zation errors to unquantized columns is more chal-
(Ŵ −W ) lenging. Furthermore, the quantization errors in
we can find that when λT = − :,qH−1 :,q , the
qq
problem reaches an optimal solution. GPTVQ accumulate as the vector length increases,
hindering GPTVQ from using longer vector lengths
By substituting λT into the optimization prob-
for weight compression (limited to only 1-4 bits).
lem, we find that to minimize the error introduced
It significantly reduces the compression ratio of
by quantization, we need to minimize the impact on
VQ. On the other hand, VPTQ is capable of com-
the Lagrangian function. Therefore, we can trans-
pressing weights using longer vectors (> 8 bits) and
form the quantization problem into minimizing:
representing data with a larger codebook. Table 2
P
∥v − C∥2 shows the better accuracy achieved by VPTQ than
∆L(∆Ŵ) = GPTVQ.
2H−1
qq
3.2 Optimization in VPTQ
We find that when quantizing a column vector
each 3.2.1 Hessian-Weighted Centroid
P time, we only need to consider minimizing
∥v − C∥2 , which is to find the closest centroid Initialization
in Euclidean Distance. It precisely aligns with the VTPQ algorithm requires the initialization of cen-
optimization of VQ. Moreover, since VPTQ quan- troids in the codebooks prior to quantization. Prop-
tizes the weight matrix column by column, H−1 qq erly initializing centroids can reduce quantization
is constant when quantizing each column, so we errors and improve model accuracy. A straightfor-
do not need to consider Hessian when finding the ward method is to perform K-means clustering on
centroid. the weight matrix as centroids (Eq.2). However, it
does not consider the optimization object in Eq.1, Algorithm 2 End to End Quantization Algorithm
leading to a significant accuracy drop (van Baalen Require: original model, vector length v, centroid number k,
et al., 2024; Egiazarian et al., 2024). hessian matrices H
Ensure: quantized model
We can transform the optimization object by for each layer l do ▷ Fully parallelized each layer on GPUs
leveraging the cyclic property of matrix traces and for each Linear operator do
the Hadamard product. We refine the optimization if outlier is enabled then
Initialize outlier centroids Coutlier
objective as: ′
Woutlier ← VPTQ(Woutlier , Coutlier )
n−1 end if
X Initialize centroids C
∆WT ∆W ⊙ H = hi,i ∥∆W:,i ∥2 w′ ← VPTQ(W, C)
i=0 if residual is enabled then
n−1 n−1 Initialize residual centroids Cres
W′′ ← VPTQ(W − W′ , Cres )
X X
+ hi,j (∆W:,i ∆W:,j ) end if
i=0 j=0,j̸=i end for
if finetune layer is enabled then
Due to the Hessian matrix being predominantly di- Finetune layer l
agonal (Dong et al., 2020), it guides us to split the end if
proxy error into two terms. The first term repre- end for
sents the dominant diagonal elements of the initial
error matrix, which significantly impact the quanti- 3.2.3 Outlier Elimination
zation error. The second term is the interaction of
Recent studies on quantization in LLM have consis-
a single value in weight quantization with others.
tently observed a significant presence of outliers in
Because the Hessian matrix is predominantly di-
activation (Xiao et al., 2023; Lin et al., 2023; Lee
agonal, we can prioritize optimizing the first term
et al., 2024). Outliers, while small portions (~1%
through centroid initialization. We can view the
of the matrix), heavily affect the quantization error
first term as a Weighted K-means Clustering prob-
and simulate model accuracy. Outliers typically re-
lem (Cordeiro de Amorim and Mirkin, 2012; Kerd-
sult in large values in the diagonal elements of the
prasop et al., 2005; Liu et al., 2017). Since this
Hessian matrix. During centroid initialization in
problem is well-studied, we can directly solve it
Sec.3.2.1, VPTQ already considers these Hessian
to achieve efficient and accurate centroid initializa-
diagonals as weights in K-means, allowing VPTQ
tion.
to better quantize the error introduced by outliers.
3.2.2 Residual Vector Quantization
Q(voutlier ) = arg min ∥voutlier − Cioutlier ∥2
We enable Residual Vector Quantization (RVQ) i
(Barnes et al., 1996; Wei et al., 2014) in VPTQ. Furthermore, VPTQ flexibly partitions the weight
RVQ improves vector quantization (VQ) by break- matrix and uses a separate outlier lookup table to
ing down the compression of a weight matrix into quantify matrix tiles most affected by outliers. It
two (or more) stages. Each stage further com- allows us to effectively trade off model accuracy
presses the residual error vres = v − Q(v) from and quantization overhead.
the previous quantization stage:
4 End to end Quantization Algorithm
Q(vres ) = arg min ∥(v − Q(v)) − Cires ∥2
i In this section, we will detail the end-to-end model
Unlike GPTVQ, VPTQ enables RVQ, which quantization algorithm (Algorithm 2). The algo-
quantizes VQ quantization error using a separate rithm takes the original model, vector length v,
lookup table for better representation and quanti- centroid number k, and Hessian matrices H as in-
zation. By partitioning the encoding into multi- puts. It starts by iterating over each layer l of the
ple stages and reducing quantization error, RVQ model. As each layer’s quantization only relates to
not only achieves superior compression efficiency the current layer and the Hessian matrix, we can
but also ensures a balance between quantization fully parallelize the quantization of each layer on
error, the size of lookup tables, and the memory re- GPUs.
quirements for indices. During the decoding phase, In each layer, we first quantize the weight of
VPTQ simply reads the centroids from these multi- each Linear Operator (matrix multiplication of
ple lookup tables and combines them to reconstruct input and weight). If we enable the outlier op-
the original weight matrix. tion, the algorithm first selects outlier columns
following Section 3.2 and initializes the outlier (Gao et al., 2021) to perform zero-shot evalua-
centroids Coutlier . Then, VPTQ is applied to the tions on common sense QA benchmarks (PIQA
outlier weights Woutlier using the outlier centroids, (Bisk et al., 2020), HellaSwag (Zellers et al., 2019),
′
generating the quantized weights Woutlier . Next, WinoGrande (Sakaguchi et al., 2021), ARC (Clark
the algorithm initializes the centroids C for the re- et al., 2018)). Detailed configuration is in Appendix
maining columns and applies VPTQ to the weights A.
W using these centroids to produce the quantized Baselines For LLaMA-2 and Mistral models,
weights W′ . Lastly, if residual quantization is en- we compare VPTQ against GPTQ, GPTVQ, DB-
abled, the algorithm initializes the residual cen- LLM, QuIP#, and AQLM. To account for the dif-
troids Cres . It applies VPTQ to the residual error ferent overheads resulting from varying codebook
between the original weights and the quantized constructions, we provide results with compara-
weights (W − W′ ), using the residual centroids. ble bit widths to facilitate a fair comparison. For
The quantized weight is updated as W′′ . LLaMA-3 models, we use the results of (Huang
After processing all the operators, the algorithm et al., 2024). However, due to alignment issues
will fine-tune the layer l if we enable layer fine- with the C4 dataset, we only show results for Wiki-
tuning. The loss function is the Mean Squared Text and QA tasks. Because LLaMA-3 models are
Error (MSE) between the original and quantized new and running quantization ourselves is costly,
computations. In layer-wise fine-tuning, we only we do not have results for QuIP# and AQLM.
update the normalization operator (e.g. RMSNorm)
and centroid. These parameters only comprise a 5.2 Accuracy Evaluation
small fraction of the entire layer, and we can com- Results on LLaMA-2 model: We compare VPTQ
plete the fine-tuning quickly with limited memory. with QuIP#, AQLM, GPTVQ, DB-LLM, and
After each layer completes quantization and fine- GPTQ on the LLaMA-2 model. First, we discuss
tuning, we can further fine-tune the entire model the results of 2-bit quantization. As shown in Table
as other PTQ methods used (Tseng et al., 2024; 2, GPTQ, as a scalar quantization method, performs
Chee et al., 2023; Egiazarian et al., 2024). Once poorly with unusable accuracy. While DB-LLM
the algorithm processes all layers, it outputs the and GPTVQ perform better, they still experience
quantized model. The end-to-end VPTQ algorithm significant performance drops, with WikiText-2
quantizes all the weights in every Linear Operator perplexity increasing by 2. The significant accu-
in the model into an index and a codebook (C). Dur- racy drop in GPTVQ, despite being a vector quan-
ing model inference, we only need to dequantize tization algorithm, is due to two factors: the use
the weight matrix by reading centroids from the of shorter vector lengths, which introduces higher
codebook according to the index before executing quantization loss, and the choice to update weights
the operator. every v columns, which leads to cumulative errors.
Therefore, we primarily focus on comparing VPTQ
5 Experiments and Evaluations with the state-of-the-art QuIP# and AQLM which
both choose longer vector lengths.
5.1 Settings Table 2 includes the average scores for the five
Algorithm Baseline We focus on weight-only QA tasks mentioned in Section 5.1. VPTQ outper-
quantization. The detailed quantization parameters forms QuIP# and AQLM on 7B and 13B models.
(such as vector length and codebook numbers) and For the 7B model, VPTQ achieves a further re-
fine-tuning parameters of our VPTQ are shown in duction in WikiText-2 perplexity by 0.5 and 0.3
Appendix B . Following (Frantar et al., 2023), our compared to the previous best results at 2-2.02 bits
calibration data consists of 128 random segments and 2.26-2.29 bits, respectively. In QA tasks, the
of the C4 dataset (Raffel et al., 2020). VPTQ 2.26-bit model surpasses the AQLM 2.29-
Models and Datasets We benchmark accuracy bit model with an average accuracy increase of
on LLaMA-2 (Touvron et al., 2023), LLaMA-3 1%. For the 13B model, the VPTQ 2.02-bit model
families (Meta, 2024), and Mistral (Jiang et al., shows a slight improvement over QuIP#, and the
2023). Following previous work (Frantar et al., 2.18-bit model outperforms AQLM in QA accuracy
2023), we report perplexity on language modeling by 1.5%. On the LLaMA-2-70B model, we achieve
tasks (WikiText-2 (Merity et al., 2016), C4 (Raffel similar perplexity (< 0.02) and comparable QA
et al., 2020)). We also employ lm-eval-harness results(< 0.4%). The results for 3- and 4-bit quan-
Table 2: LLaMA-2 2bit Quantization Results. The "N/A" in the table stands for "not available," with further
explanation provided in the Appendix A.1. 1 We use the naive Torch and Triton kernels for inference performance
evaluation, without optimizations like CUDA graphs, FlashAttention, or Torch compile. The inference performance for QuIP#
and AQLM do not represent their performance with all optimizations enabled. QuIP# and AQLM can achieve high performance
when all optimizations are enabled.

(a) 7B results
Method bit W2↓ C4↓ AvgQA↑ tok/s↑ mem(GB)↓ cost(h)↓
FP16 16 5.12 6.63 62.2 38.32 27.22 N/A
GPTQ 2.125 50.75 36.76 39.16 19.59 4.42 0.2
GPTVQ 2.25 6.71 9.9 56.14 N/A N/A 1.5
DB-LLM 2.01 7.23 9.62 55.1 N/A N/A N/A
QuIP#1 2 6.19 8.16 58.2 4.4 2.25 N/A
AQLM 1 2.02 6.64 8.56 56.5 19.4 2.16 N/A
AQLM 1 2.29 6.29 8.11 58.6 19.6 2.4 11.07
VPTQ 2.02 6.13 8.07 58.2 39.9 2.28 2
2.26 5.95 7.87 59.4 35.7 2.48 2.2

(b) 13B results

Method bit W2↓ C4↓ AvgQA↑ tok/s↑ mem(GB)↓ cost(h)↓
FP16 16 4.57 6.05 65.4 30.03 63.63 N/A
GPTQ 2.125 43.84 23.07 43.72 11.56 7.92 0.3
GPTVQ 2.25 5.72 8.43 61.56 N/A N/A 3.7
DB-LLM 2.01 6.19 8.38 59.4 N/A N/A N/A
QuIP#1 2 5.35 7.2 62.0 3.5 3.94 N/A
AQLM 1 1.97 5.65 7.51 60.6 N/A N/A N/A
AQLM 1 2.18 5.41 7.2 61.6 16.5 4.14 22.7
VPTQ 2.02 5.32 7.15 62.4 26.9 4.03 3.2
2.18 5.28 7.04 63.1 18.5 4.31 4

(c) 70B results

Method bit W2↓ C4↓ AvgQA↑ tok/s↑ mem(GB)↓ cost(h)↓
FP16 16 3.12 4.97 70.2 multi-gpu N/A N/A
GPTQ 2.125 NaN NaN 59.18 2.38 37.63 2.83
GPTVQ 2.25 4.25 6.9 68.5 N/A N/A 12
DB-LLM 2.01 4.64 6.77 65.8 N/A N/A N/A
QuIP#1 2 3.91 5.71 69.0 1.9 18.36 25
AQLM 1 2.07 3.94 5.72 68.8 6.9 18.81 183
VPTQ 2.07 3.93 5.72 68.6 9.7 19.54 19
VPTQ 2.11 3.92 5.71 68.7 9.7 20.01 19

Table 3: LLaMA-3 and Mistra-7b 2,3,4-bit Quantization Results. The table shows LLaMA-3 Wikitext2 perplexity
(context length 2048) and average zero-shot QA Accuracy, Mistral-7B Wikitext2, C4 perplexity (context length
8192) and average zero-shot QA accuracy. Detailed score for each task see Table 6 and Table 7.
LLaMA-3 8B LLaMA-3 70B Mistral 7B

bit W2↓ AvgQA↑ bit W2↓ AvgQA↑ bit W2↓ C4↓ AvgQA↑
FP16 16 6.14 68.6 16 2.9 75.3 FP16 16.0 4.77 5.71 68.6
QuIP 4 6.5 67.1 4 3.4 74.5 QuIP# 4.01 4.85 5.79 68.7
GPTQ 4 6.5 67.3 4 3.3 74.9 AQLM 4.02 4.85 5.79 68.0
VPTQ 4.03 6.42 68.1 4.05 3.15 74.7 GPTQ 4.125 4.83 5.74 68.4
QuIP 3 7.5 63.7 3 4.7 72.6 VPTQ 4.03 4.81 5.72 68.2
GPTQ 3 8.2 61.7 3 5.2 70.6 AQLM 3.0 5.07 5.97 67.3
VPTQ 3.03 6.97 66.7 3.01 3.81 73.7 VPTQ 3.03 4.96 5.84 67.3
QuIP 2 85.1 36.8 2 13 48.7 QuIP# 2.01 6.02 6.84 62.2
DB-LLM 2 13.6 51.7 N/A N/A N/A AQLM 2.01 6.32 6.93 62.2
GPTQ 2 2.10E+02 36.2 2 11.9 45.4 GPTQ 2.125 1535 164 44.5
VPTQ 2.08 9.29 60.2 2.02 5.6 70.9 GPTVQ 2.25 8.99 18.6 57.7
VPTQ 2.24 9.19 62.7 2.07 5.66 70.7 VPTQ 2.04 5.64 6.43 63.2
tization shown in Table 5 are without end-to-end ther refining the weights via Channel-Independent
fine-tuning but are also comparable to AQLM and Second-Order Optimization, we have enabled a
QuIP# which include end-to-end fine-tuning. The more granular VQ.
ablation study of quantization parameters is in Ap- VPTQ also includes a brief and effective code-
pendix C. book initialization algorithm, which is achieved
Results on LLaMA-3 and Mistral model: Ta- by decomposing the optimization problem. We
ble 3 presents VPTQ results on the LLaMA-3 have extended VPTQ to support residual and out-
model and Mistral-7b model. In all 2-, 3-, and lier quantization, which not only improves model
4-bit quantizations of LLaMA-3 models, we sig- accuracy but also further compresses the model
nificantly outperform GPTQ, DB-LLM, and QuIP, size.
whose accuracy drops to unusable levels. VPTQ en- Our experimental results demonstrate the effec-
sures an accuracy drop of < 8% for the 8B model tiveness and efficiency of VPTQ. The perplexity
and < 5% for the 70B model. On the Mistral- of quantized model is reduced by 0.01-0.34 on
7B model, our 2-bit performance surpasses both LLaMA-2, 0.38-0.68 on Mistral-7B, 4.41-7.34 on
QuIP# and AQLM by 0.8% in QA accuracy. In LLaMA-3 over SOTA at 2-bit, with an average ac-
3-bit quantization, our perplexity is lower. At 4- curacy improvement of 0.79-1.5% on LLaMA-2,
bit, results are comparable overall. More detailed 1% on Mistral-7B, 11-22% on LLaMA-3 on QA
results are in Table 7. As bit width increases, the tasks. Furthermore, we achieved these results only
advantage of vector quantization diminishes, with using 10.4-18.6% of the execution time of the quan-
GPTQ showing a similar WikiText-2 perplexity at tization algorithm, leading to a 1.6-1.8× increase
4-bit. in inference throughput compared to SOTA. These
Inference throughput and quantization cost: results underscore the potential of VPTQ as an ef-
In Table 2, the ‘tok/s’ column indicates the num- ficient and powerful solution for the deployment
ber of tokens generated per second during the de- and inference of LLMs, particularly in resource-
code phase of inference. VPTQ achieves a 2-9× constrained settings.
speedup compared to QuIP# because QuIP# uses
Hadamard Transform during decoding, which intro- 7 Limitations
duces O(n2 ) multiplications and additions, signifi-
Related research on PTQ (Egiazarian et al., 2024;
cantly slowing the inference throughput. Compared
Tseng et al., 2024; van Baalen et al., 2024) have
to AQLM, VPTQ uses a smaller codebook, result-
adopted end-to-end model fine-tuning after the
ing in a lower decoding overhead. Therefore, our
PTQ phase. Compared to other related works,
inference throughput for the 7B and 13B models
VPTQ can better quantize the model in the PTQ,
is 1.6-1.8× faster than AQLM. As the model size
and it simplifies and reduces the cost and overhead
increases, our codebook size becomes comparable
of model fine-tuning.
to theirs, leading to similar inference throughputs
Due to GPU resource constraints, we cannot fine-
for the 70B model. The ’mem(GB)’ column rep-
tune larger models (70B) for longer iterations and
resents the GPU memory usage at runtime. The
more tokens. It limits our experimental results,
‘cost(h)’ column represents the hours required for
which can only achieve similar results to baselines
model quantization on 4× 80GB A100 GPUs. We
in 70B models. It restricts the demonstration of
achieves comparable or even better results than
VPTQ’s advantages and potential on large mod-
AQLM in only 10.4-18.6% of quantization algo-
els in this paper. We will strive for more GPU
rithm execution time.
resources to fine-tune the VPTQ model for longer
periods and with more tokens in the future, allow-
6 Conclusion ing for a fair comparison.
In this paper, we propose Vector Post-Training Additionally, since LLaMA-3 models are the
Quantization (VPTQ), a novel approach to achiev- latest released models, there is a lack of baselines
ing extremely low-bit quantization of LLMs by from related works. It is difficult for us to fully
Vector Quantization. Through the application of demonstrate our performance improvements. We
Second-Order Optimization, we have formulated will continue to add more baselines in the future to
the LLM Vector Quantization problem and directed highlight the advantages of VPTQ.
the design of our quantization algorithm. By fur- In this paper, we only use AI tools for grammar
checking and code completion. Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and
Dan Alistarh. 2023. OPTQ: accurate quantization for
Acknowledgement generative pre-trained transformers. In The Eleventh
We thank James Hensman for his crucial insights International Conference on Learning Representa-
into the error analysis related to Vector Quanti- tions, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.
zation (VQ), and his comments on LLMs evalua- OpenReview.net.
tion are invaluable to this research. We also thank Leo Gao, Jonathan Tow, Stella Biderman, Sid Black,
QuIP# and AQLM for inspiring our paper and the Anthony DiPofi, Charles Foster, Laurence Golding,
authors for their guidance on implementation. Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff,
et al. 2021. A framework for few-shot language
model evaluation. Version v0. 0.1. Sept.
References A. Gersho. 1979. Asymptotically optimal block quan-
tization. IEEE Transactions on Information Theory,
C.F. Barnes, S.A. Rizvi, and N.M. Nasrabadi. 1996.
25(4):373–380.
Advances in residual vector quantization: a review.
IEEE Transactions on Image Processing, 5(2):226– Babak Hassibi and David G. Stork. 1992. Second or-
262. der derivatives for network pruning: Optimal brain
surgeon. In Advances in Neural Information Process-
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, ing Systems 5, [NIPS Conference, Denver, Colorado,
et al. 2020. Piqa: Reasoning about physical com- USA, November 30 - December 3, 1992], pages 164–
monsense in natural language. In Proceedings of the 171. Morgan Kaufmann.
AAAI conference on artificial intelligence, volume 34,
pages 7432–7439. Babak Hassibi, David G. Stork, and Gregory J. Wolff.
1993. Optimal brain surgeon and general network
Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and pruning. In Proceedings of International Confer-
Christopher De Sa. 2023. Quip: 2-bit quantization of ence on Neural Networks (ICNN’88), San Francisco,
large language models with guarantees. CA, USA, March 28 - April 1, 1993, pages 293–299.
IEEE.
Ting Chen, Lala Li, and Yizhou Sun. 2020. Differ-
entiable product quantization for end-to-end embed- Wei Huang, Xudong Ma, Haotong Qin, Xingyu Zheng,
ding compression. In International Conference on Chengtao Lv, Hong Chen, Jie Luo, Xiaojuan Qi, Xi-
Machine Learning, pages 1617–1626. PMLR. anglong Liu, and Michele Magno. 2024. How good
are low-bit quantized llama3 models? an empirical
Minsik Cho, Keivan Alizadeh-Vahid, Saurabh Adya, study.
and Mohammad Rastegari. 2022. DKM: differen-
tiable k-means clustering layer for neural network Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men-
compression. In The Tenth International Conference sch, Chris Bamford, Devendra Singh Chaplot, Diego
on Learning Representations, ICLR 2022, Virtual de las Casas, Florian Bressand, Gianna Lengyel, Guil-
Event, April 25-29, 2022. OpenReview.net. laume Lample, Lucile Saulnier, Lélio Renard Lavaud,
Marie-Anne Lachaux, Pierre Stock, Teven Le Scao,
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Thibaut Lavril, Thomas Wang, Timothée Lacroix,
Ashish Sabharwal, Carissa Schoenick, and Oyvind and William El Sayed. 2023. Mistral 7b.
Tafjord. 2018. Think you have solved question an-
swering? try arc, the ai2 reasoning challenge. arXiv Kittisak Kerdprasop, Nittaya Kerdprasop, and Pairote
preprint arXiv:1803.05457. Sattayatham. 2005. Weighted k-means for density-
biased clustering. In International conference on
Renato Cordeiro de Amorim and Boris Mirkin. 2012. data warehousing and knowledge discovery, pages
Minkowski metric, feature weighting and anomalous 488–497. Springer.
cluster initializing in k-means clustering. Pattern
Recognition, 45(3):1061–1075. Yann LeCun, John S. Denker, and Sara A. Solla. 1989.
Optimal brain damage. In Advances in Neural In-
Zhen Dong, Zhewei Yao, Daiyaan Arfeen, Amir Gho- formation Processing Systems 2, [NIPS Conference,
lami, Michael W. Mahoney, and Kurt Keutzer. 2020. Denver, Colorado, USA, November 27-30, 1989],
HAWQ-V2: hessian aware trace-weighted quanti- pages 598–605. Morgan Kaufmann.
zation of neural networks. In Advances in Neural
Information Processing Systems 33: Annual Confer- Changhun Lee, Jungyu Jin, Taesu Kim, Hyungjun Kim,
ence on Neural Information Processing Systems 2020, and Eunhyeok Park. 2024. OWQ: outlier-aware
NeurIPS 2020, December 6-12, 2020, virtual. weight quantization for efficient fine-tuning and in-
ference of large language models. In Thirty-Eighth
Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, AAAI Conference on Artificial Intelligence, AAAI
Elias Frantar, Artem Babenko, and Dan Alistarh. 2024, Thirty-Sixth Conference on Innovative Applica-
2024. Extreme compression of large language mod- tions of Artificial Intelligence, IAAI 2024, Fourteenth
els via additive quantization. Symposium on Educational Advances in Artificial
Intelligence, EAAI 2014, February 20-27, 2024, Van- Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr
couver, Canada, pages 13355–13364. AAAI Press. Kuleshov, and Christopher De Sa. 2024. Quip#:
Even better llm quantization with hadamard inco-
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, herence and lattice codebooks.
Xingyu Dang, and Song Han. 2023. AWQ: activation-
aware weight quantization for LLM compression and Mart van Baalen, Andrey Kuzmin, Markus Nagel, Pe-
acceleration. CoRR, abs/2306.00978. ter Couperus, Cedric Bastoul, Eric Mahurin, Tijmen
Blankevoort, and Paul Whatmough. 2024. Gptvq:
Hongfu Liu, Junjie Wu, Tongliang Liu, Dacheng Tao, The blessing of dimensionality for llm quantization.
and Yun Fu. 2017. Spectral ensemble clustering
via weighted k-means: Theoretical and practical evi- Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang,
dence. IEEE Transactions on Knowledge and Data Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping
Engineering, 29(5):1129–1143. Wang, Yi Wu, and Furu Wei. 2023. Bitnet: Scaling
1-bit transformers for large language models. CoRR,
Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, abs/2310.11453.
Wenhui Wang, Shaohan Huang, Li Dong, Ruiping
Wang, Jilong Xue, and Furu Wei. 2024. The era of Benchang Wei, Tao Guan, and Junqing Yu. 2014. Pro-
1-bit llms: All large language models are in 1.58 bits. jected residual vector quantization for ann search.
CoRR, abs/2402.17764. IEEE MultiMedia, 21(3):41–51.
Stephen Merity, Caiming Xiong, James Bradbury, and Guangxuan Xiao, Ji Lin, Mickaël Seznec, Hao Wu,
Richard Socher. 2016. Pointer sentinel mixture mod- Julien Demouth, and Song Han. 2023. Smoothquant:
els. arXiv preprint arXiv:1609.07843. Accurate and efficient post-training quantization for
large language models. In International Conference
AI Meta. 2024. Introducing meta llama 3: The most on Machine Learning, ICML 2023, 23-29 July 2023,
capable openly available llm to date. Meta AI. Honolulu, Hawaii, USA, volume 202 of Proceedings
of Machine Learning Research, pages 38087–38099.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
PMLR.
Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J Liu. 2020. Exploring the limits Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
of transfer learning with a unified text-to-text trans- Farhadi, and Yejin Choi. 2019. Hellaswag: Can a
former. The Journal of Machine Learning Research, machine really finish your sentence? arXiv preprint
21(1):5485–5551. arXiv:1905.07830.
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat-
ula, and Yejin Choi. 2021. Winogrande: An adver- A Appendix: All Experiments Results
sarial winograd schema challenge at scale. Commu-
nications of the ACM, 64(9):99–106. A.1 Supplementary Explanation for Main
Results Table 2
Sidak Pal Singh and Dan Alistarh. 2020. Woodfisher:
Efficient second-order approximation for neural net- Table 2 shows our main results. Here we provide
work compression. In Advances in Neural Infor- an explanation for the ’N/A’ entries relative to other
mation Processing Systems 33: Annual Conference works.
on Neural Information Processing Systems 2020, DB-LLM Since they did not open source their
NeurIPS 2020, December 6-12, 2020, virtual.
code, we use the AvgQA results from their paper.
Pierre Stock, Angela Fan, Benjamin Graham, Edouard However, this number does not align with our FP16
Grave, Rémi Gribonval, Hervé Jégou, and Armand results.
Joulin. 2021. Training with quantization noise for ex-
treme model compression. In 9th International Con-
GPTQ We reproduce the 2-bit results using the
ference on Learning Representations, ICLR 2021, Vir- official GPTQ repository. As GPTQ quantizes each
tual Event, Austria, May 3-7, 2021. OpenReview.net. layer in sequential order, the ’cost(h)’ represents
the time taken to quantize on a single A100 GPU.
Pierre Stock, Armand Joulin, Rémi Gribonval, Ben-
jamin Graham, and Hervé Jégou. 2020. And the bit GPTVQ They do not release their 2-bit quan-
goes down: Revisiting the quantization of neural net- tized model. We reproduce Llama-2, LLama-3
works. In 8th International Conference on Learning 7B and 13B, Mistral 7b 2-bit results using their
Representations, ICLR 2020, Addis Ababa, Ethiopia, released GPTVQ code, which only supports single-
April 26-30, 2020. OpenReview.net.
GPU execution. Therefore, the quantization cost
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- reflects the execution time for quantization on a
bert, Amjad Almahairi, Yasmine Babaei, Nikolay single A100 GPU. Due to the lack of specific logic
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Bhosale, et al. 2023. Llama 2: Open founda-
for loading their quantizers in the released code, we
tion and fine-tuned chat models. arXiv preprint were unable to measure the throughput and runtime
arXiv:2307.09288. memory.
AQLM Their 1.97-bit LLaMA-2 13b model has is v1 . k1 represents the number of centroids in
not been open-sourced, so we are unable to mea- the first codebook, while k2 represents the number
sure its inference throughput and runtime memory. of centroids in the second codebook for residual
QuIP# Due to recent changes in the libraries vector quantization. k2 = −1 indicates no residual
they rely on, the quantization cost is not measured. vector quantization.
The quantization time for the 70B model is esti-
mated based on their original paper. C.2 Vector Length and Residual Vector
Quantization
A.2 All Experimental Results Compression Ratio Calculation The average
In this section, we present all our experimen- bitwidth per element of the index matrix obtained
tal results, including the perplexity of the quan- through vector quantization is:
tized model on different context lengths in two
datasets, Wikitext2 and C4, and the accuracy on log2 (k1 ) log2 (k2 )
five Commonsense QA tasks (abbreviated as AE Average index bitwidth = +
v1 v1
for Arc_easy, AC for Arc_challenge, HE for Hel-
laswag, QA for PIQA, and WI for Winogrande). The compression ratio is calculated by:
Table 4 displays all results of LLaMA-2 at 2-bit
Total original model bits
quantization. Table 5 presents results of LLaMA-2 Compression ratio =
Codebook bits + Index bits
at 3 and 4 bits quantization. Table 6 displays all
results of Llama3 at 2, 3, and 4-bit quantization. For an original linear weight matrix with M pa-
Table 7 shows all results of Mistral 7b at 2, 3, and rameters,
4-bit quantization.
Codebook bits = (v0 × k0 + v1 × (k1 + k2 )) × 16
B Quantitative Analysis of Quantization
Parameter Settings

k0
Quantization configuration The quantization pa- Index bits = M × N % × log2 + M×
rameters of all VPTQ 2bit models are shown in v0

Table 8. log2 (k1 ) log2 (k2 )
(100 − N )% × +
Layer-wise fine-tuning parameters Layer-wise v1 v1
finetuning trains centroids and layer norm using the
The total bitwidth in the table is calculated per
input and output of each layer when entering 128
transformer block, which for LLaMA-2 includes 4
samples of C4 training sets into the full precision
attention linear and 3 FFN linear layers.
model. We train each layer for 100 iterations. Table
Impact of Vector Length First, we discuss
9 shows the learning rate and batch size used for
the impact of vector length on accuracy. In Ta-
each model.
ble 10 rows #2, #3, #4, and #6 show results for
C Ablation Study v1 = 2, 4, 6, 8, keeping the average index bit at
2 (i.e., log2 (k1 /v1 ) = 2). As v1 increases, the
Table 10 shows results from LLaMA-2 13b on perplexity on Wikitext2 and C4 decreases, but the
Wikitext2 and C4 (sequence length = 4096) un- codebook size also increases exponentially. For
der different quantization parameters. The im- v1 = 8 and k1 = 65536, the codebook overhead
pact of techniques such as vector length, channel- introduces an additional 0.19 bits. Then, we eval-
independent optimization, residual vector quanti- uate the model inference throughput in Table 11.
zation, outlier elimination, layer-wise fine-tuning, Since we employ weight-only quantization, the
and end-to-end fine-tuning on quantization results main additional overhead of quantized model in-
will be discussed. ference comes from the lookup table for model
weights. Table 11 shows models with 2 bits on
C.1 Parameter Description various throughputs. As the vector length increases
When performing N% outlier elimination, N% of (from 2 to 6), the granularity of memory access
outliers will be quantized using a codebook with for reading the lookup table in dequantization in-
a vector length of v0 and k0 centroids. For the creases, which allows memory access to match the
remaining (100-N)% parameters, the vector length GPU’s cache line (128 bytes @ L1). This reduces
Table 4: LLaMA-2 2bit Quantization Results, 1 We use the naive Torch and Triton kernels for inference performance
evaluation, without optimizations like CUDA graphs, FlashAttention, or Torch compile. The inference performance for QuIP#
and AQLM do not represent their performance with all optimizations enabled. QuIP# and AQLM can achieve high performance
when all optimizations are enabled.

7B bit W2 C4 AC AE HE QA WI tok/s mem(GB) cost(h)

FP16 16 5.12 6.63 39.93 69.28 56.69 78.35 66.93 38.32 27.22 N/A
GPTQ 2.125 50.75 36.76 20.9 34.9 30.5 57.2 52.3 19.59 4.42 0.2
GPTVQ 2.25 6.71 9.9 31.2 66.3 46.4 72.4 64.4 N/A N/A 1.5
DB-LLM 2.01 7.23 9.62 33.53 45.2 61.98 73.18 61.72 N/A N/A N/A
QuIP#1 2 6.19 8.16 34.6 64.6 51.91 75.1 64.9 4.4 2.25 N/A
AQLM1 2.02 6.64 8.56 33.28 61.87 49.49 73.56 64.17 19.4 2.16 N/A
2.29 6.29 8.11 34.9 66.5 50.88 74.92 65.67 19.6 2.4 11.07
VPTQ 2.02 6.13 8.07 35.24 63.8 52.08 75.19 64.33 39.9 2.28 2
2.26 5.95 7.87 36.43 64.9 52.87 76.17 66.46 35.7 2.48 2.2
13b bit W2 C4 AC AE HE QA WI tok/s mem(GB) cost(h)
FP16 16 4.57 6.05 45.56 73.23 59.71 78.73 69.69 30.03 63.63 N/A
GPTQ 2.125 43.84 23.07 23.3 43.3 36 61.3 54.7 11.56 7.92 0.3
GPTVQ 2.25 5.72 8.43 38.7 73.6 51.6 75.4 68.5 N/A N/A 3.7
DB-LLM 2.01 6.19 8.38 38.14 51.64 68.04 75.14 64.09 N/A N/A N/A
QuIP#1 2 5.35 7.2 39.5 69.3 56.01 77.3 67.7 3.5 3.94 N/A
AQLM1 1.97 5.65 7.51 37.8 69.78 53.74 76.22 65.43 N/A N/A N/A
2.18 5.41 7.2 39.42 69.15 54.68 76.22 68.43 16.5 4.14 22.7
VPTQ 2.02 5.32 7.15 40.02 71.55 56.18 77.26 66.85 26.9 4.03 3.2
2.18 5.28 7.04 40.96 71.8 56.89 77.48 68.43 18.5 4.31 4
70b bit W2 C4 AC AE HE QA WI tok/s mem(GB) cost(h)
FP16 16 3.12 4.97 51.11 77.74 63.97 81.12 77.11 multi-gpu N/A N/A
GPTQ 2.125 NaN NaN 35.8 67 51.8 74.6 66.7 2.38 37.63 2.83
GPTVQ 2.25 4.25 6.9 49.4 80.47 58.26 79.4 75.2 N/A N/A 12
DB-LLM 2.01 4.64 6.77 44.45 55.93 76.16 79.27 73.32 N/A N/A N/A
QuIP#1 2 3.91 5.71 48.7 77.3 62.49 80.3 75.9 1.9 18.36 25
AQLM1 2.07 3.94 5.72 47.93 77.68 61.79 80.43 75.93 6.9 18.81 183
VPTQ 2.07 3.93 5.72 47.7 77.1 62.98 80.3 74.98 9.7 19.54 19
2.11 3.92 5.71 48.29 77.77 62.51 79.82 75.14 9.7 20.01 19

memory access transactions and decreases cache #5 without it, indicating that channel-independent
misses. As the vector length further increases (from second-order optimization effectively mitigates
8 to 12) along with the size and levels of the code- quantization error accumulation.
book, the codebook size further increases, which
results in the codebook not fitting in the L1 cache, C.4 Outlier Elimination
thereby reducing the model’s inference speed. Ad- Rows #4, #8, #9, and #10 represent the results
ditionally, we find that a reasonable setting (e.g., for eliminating 0%, 1%, 2%, and 5% outliers, re-
v = 6, k = 4096) can achieve throughput simi- spectively. We used a codebook with v0 = 4 and
lar to the original model for the quantized model, k0 = 4096 to quantize N% of outliers, achieving
demonstrating the efficiency of the VPTQ design. an effective average index bit of 3 bits, while other
Residual Vector Quantization Without any parameters were 2 bits. Higher N% means more
fine-tuning, rows #4 and #7 show similar perplex- parameters are quantized with 3 bits, leading to a
ities for v1 = 6, k1 = 4096 and v1 = 12, k1 = larger total bitwidth and lower perplexity.
k2 = 4096 , with the latter even higher. How-
ever, after layer-wise fine-tuning, comparing rows C.5 Fine-tuning
#11 and #13, residual vector quantization (RVQ) Rows #4, #11, and #12 show results without any
reduces the perplexity by 0.3 compared to vector fine-tuning, with layer-wise fine-tuning, and with
quantization (VQ) due to the increased number of end-to-end fine-tuning, respectively. Adding fine-
finetunable centroids, showing significant improve- tuning reduced the perplexity on Wikitext2 from
ment. 6.29 to 6.07 and further to 5.32.

C.3 Channel-Independent Optimization C.6 Group Number

Row #4 with channel-independent optimization Rows #14, #15, #16, and #17 show the quantization
shows a perplexity decrease of 1 compared to row results when 99% of parameters are divided into 1,
Table 5: LLaMA-2 3, 4-bit Quantization Results. The table shows Witext2,
C4 perplexity (context length 2048 and 4096) and zeroshot QA Accuracy.

7B bit W2(2k) C4(2k) W2(4k) C4(4k) AC AE HE QA WI

GPTQ 4 — — 5.49 7.2 36.8 66.2 55.4 76.6 68.2
GPTVQ 4.125 5.68 7.25 5.27 6.88 42.83 75.17 56.41 77.37 69.61
QuIP# 4 5.56 7.07 5.19 6.75 40.5 69.1 — 78.4 67.6
AQLM 4.04 — — 5.21 6.75 41.0 70.2 56.0 78.2 67.3
VPTQ 4.01 5.64 7.13 5.26 6.8 39.7 69.0 56.0 78.1 67.1
GPTQ 3 — — 8.06 10.61 31.1 58.5 45.2 71.5 59.2
GPTVQ 3.125 5.83 7.51 5.44 7.24 39.93 74.07 54.21 76.17 69.06
QuIP# 3 5.79 7.32 5.41 7.04 39.2 68.4 — 77.3 66.5
AQLM 3.04 — — 5.46 7.08 38.4 68.1 54.1 76.9 66.9
VPTQ 3.02 5.82 7.33 5.43 7.04 39.3 69.1 54.9 77.3 68.0

13B bit W2(2k) C4(2k) W2(4k) C4(4k) AC AE HE QA WI

GPTQ 4 — — 4.78 6.34 42.49 70.45 58.67 77.75 70.01
GPTVQ 4.125 5.68 7.25 5.27 6.88 42.83 75.17 56.41 77.37 69.61
QuIP# 4 4.95 6.54 4.63 6.13 45.50 73.90 — 78.90 69.90
AQLM 3.94 — — 4.65 6.14 44.80 73.32 59.27 78.35 69.85
VPTQ 4.02 4.96 6.54 4.64 6.13 44.37 73.19 59.37 77.75 69.77
GPTQ 3 — — 5.85 7.86 38.48 65.66 53.47 76.50 63.93
GPTVQ 3.125 5.11 6.83 4.8 6.47 44.45 77.23 58.18 77.8 71.98
QuIP# 3 5.1 6.72 4.78 6.35 44.00 72.50 — 78.40 69.10
AQLM 3.03 — — 4.82 6.37 42.58 70.88 58.30 77.26 68.43
VPTQ 3.03 5.12 6.7 4.79 6.32 42.32 73.99 58.42 77.64 68.67

70B bit W2(2k) C4(2k) W2(4k) C4(4k) AC AE HE QA WI

GPTQ 4 — — 3.35 5.15 49.15 76.81 63.47 81.23 75.61
GPTVQ 4.125 5.32 — — — — — — — —
QuIP# 4 3.38 5.56 3.18 5.02 50.6 78.1 — 81.4 77.1
AQLM 4.14 — — 3.19 5.03 50.68 77.31 63.69 81.5 76.48
VPTQ 4.01 3.39 5.57 3.19 5.02 49.57 78.16 63.71 81.18 76.4
GPTQ 3 — — 4.4 6.26 44.11 72.73 60 78.4 71.82
GPTVQ 3.125 5.51 — — — — — — — —
QuIP# 3 3.56 5.67 3.35 5.15 50.9 77.7 — 81.4 76.4
AQLM 3.01 — — 3.36 5.17 50 77.61 63.23 81.28 77.19
VPTQ 3.01 3.55 5.67 3.34 5.15 48.89 77.06 63.52 80.9 77.51

Table 6: LLaMA-3 Wikitext2 perplexity (context length 2048) and zeroshot QA Accuracy.

LLaMA-3 8B LLaMA-3 70B

bit W2↓ AC↑ AE↑ HE↑ QA↑ WI↑ bit W2↓ AC↑ AE↑ HE↑ QA↑ WI↑
FP16 16 6.14 50.3 80.1 60.2 79.6 73.1 16 2.9 60.1 87.0 66.3 82.4 80.8
QuIP 4 6.5 47.4 78.2 58.6 78.2 73.2 4 3.4 58.7 86.0 65.7 82.5 79.7
GPTQ 4 6.5 47.7 78.8 59.0 78.4 72.6 4 3.3 58.4 86.3 66.1 82.9 80.7
VPTQ 4.03 6.42 49.1 78.8 59.3 78.7 74.8 4.05 3.15 59.0 86.1 66.2 82.4 79.8
QuIP 3 7.5 41.0 72.9 55.4 76.8 72.5 3 4.7 54.9 83.3 63.9 82.3 78.4
GPTQ 3 8.2 37.7 70.5 54.3 74.9 71.1 3 5.2 52.1 79.6 63.5 80.6 77.1
VPTQ 3.03 6.97 45.8 77.5 58.4 78.2 73.4 3.01 3.81 57.3 84.7 65.5 81.7 79.2
QuIP 2 85.1 21.3 29.0 29.2 52.9 51.7 2 13 26.5 48.9 40.9 65.3 61.7
DB-LLM 2 13.6 28.2 59.1 42.1 68.9 60.4 N/A N/A N/A N/A N/A N/A N/A
GPTQ 2 2.10E+02 19.9 28.8 27.7 53.9 50.5 2 11.9 24.6 38.9 41.0 62.7 59.9
VPTQ 2.08 9.29 36.9 71.0 52.2 75.1 65.9 2.02 5.6 52.5 81.8 61.7 80.4 77.9
VPTQ 2.24 9.19 42.6 73.2 53.1 75.4 69.1 2.07 5.66 54.2 83.6 61.8 80.1 74.0
Table 7: Mistral-7B-v0.1 Wikitext2, C4 perplexity (context length 2048 and 8192) and zeroshot QA Accuracy

Mistral 7b
bit W2(2k) W2(8k) C4(8k) AC AE HE QA WI
FP16 16 5.25 4.77 5.71 48.89 78.87 61.12 80.3 73.88
GPTVQ 4.125 5.38 4.87 6.13 50 80.43 60.36 79.65 73.4
QuIP# 4 — 4.85 5.79 49.4 78.96 60.62 80.41 73.95
AQLM 4.02 — 4.85 5.79 48.21 77.86 60.27 79.71 73.8
GPTQ 4.125 5.36 4.83 5.74 49.57 79.5 60.38 79.54 72.85
VPTQ 4.03 5.36 4.81 5.72 48.12 77.82 60.61 80.14 74.19
GPTVQ 3.125 6.42 6.8 13.28 40.78 75.67 54.18 77.42 67.4
AQLM 3.04 — 5.07 5.97 46.67 77.61 59.31 80.14 72.69
GPTQ 3.125 6.02 5.88 6.86 47.35 77.86 58.84 79.82 71.74
VPTQ 3.03 5.53 4.96 5.84 46.67 77.95 59.91 79.49 72.45
QuIP# 2 — 6.02 6.84 39.76 72.14 52.95 76.71 69.3
AQLM 2.01 — 6.32 6.93 40.44 73.65 52.13 76.01 68.75
GPTVQ 2.25 8.2 8.99 18.6 37.37 71 45.43 70.18 64.33
GPTQ 2.125 280 1535 164 24.49 44.91 36.56 63.33 52.96
VPTQ 2.04 6.32 5.64 6.43 41.13 72.22 56.1 77.91 68.67

Table 8: Parameters for 2-bit Quantization of Llama and Mistral Models. v represents the vector length, k denotes
the codebook size, k1 and k2 correspond to the two codebooks, and group num indicates the number of groups
into which PQ (Product Quantization) is divided.

Outlier Other
bit
N% v k v k1 k2 group num
2.02 0 - - 6 4096 - 1
LLaMA2-7b
2.26 1 4 8192 12 4096 4096 4
2.02 0 - - 6 4096 - 1
LLaMA2-13b
2.18 2 4 8192 12 4096 4096 4
2.07 1 4 8192 12 4096 4096 4
LLaMA2-70b
2.11 1 4 8192 12 4096 4096 8
2.08 1 4 4096 12 4096 4096 1
LLaMA3-8b
2.24 1 4 8192 6 4096 - 16
2.02 0 - - 12 4096 4096 1
LLaMA3-70b
2.07 1 4 4096 6 4096 - 16

Table 9: Layer-wise finetuning parameters on 8xH100 number is not significant.

model finetune lr batchsize
LLaMA-2-7B 1 × 10−4 32
C.7 Higher Bitwidth
LLaMA-2-13B 1 × 10−4 32 Rows #18 and #19 represent the results for 3-bit
LLaMA-2-70B 1 × 10−5 16 and 4-bit quantization, respectively. Compared to
LLaMA-3-8B 1 × 10−5 16 the FP16 results in row #1, 4-bit vector quantization
LLaMA-3-70B 5 × 10−6 8
incurs almost no loss.
Mistral-7B 5 × 10−6 16

D Inference Evaluation
2, 4, and 8 groups, respectively. Each group has its
D.1 Throughput Measurement Process
own independent codebook. When divided into 1,
2, and 4 groups, the perplexity on Wikitext2 does We follow the throughput measurement method
not change much, likely because the distribution of used in AQLM (Egiazarian et al., 2024). During
the remaining parameters (after removing 1% out- the prompt phase, we provide 1 token and then
liers) is relatively uniform. This is likely because have the model generate 256 tokens, calculating the
the distributions of different groups overlap after generation time for each output token to determine
grouping, so the benefit of increasing the group the throughput in tokens per second (tok/s).
Table 10: Ablation Study on Different Quantization Techniques for LLaMA-2 13B

channel Finetune outlier other

bit
independent layer group
e2e N% v0 k0 v1 k1 k2 W2(↓) C4(↓)
wise num
#1 FP16 - - - - - - - - - - 4.57 6.05
#2 2 Yes No No 0 - - 2 16 -1 1 14800 13337
#3 2.01 Yes No No 0 - - 4 256 -1 1 7.21 9.78
#4 2.02 Yes No No 0 - - 6 4096 -1 1 6.29 8.29
#5 2.02 No No No 0 - - 6 4096 -1 1 7.25 9.8
#6 2.19 Yes No No 0 - - 8 65536 -1 1 5.8 7.68
#7 2.04 Yes No No 0 - - 12 4096 4096 1 6.32 8.29
#8 2.03 Yes No No 1 4 4096 6 4096 -1 1 6.16 8.08
#9 2.04 Yes No No 2 4 4096 6 4096 -1 1 6.08 8.12
#10 2.07 Yes No No 5 4 4096 6 4096 -1 1 6.02 7.96
#11 2.02 Yes Yes No 0 - - 6 4096 -1 1 6.07 7.64
#12 2.02 Yes Yes Yes 0 - - 6 4096 -1 1 5.32 7.15
#13 2.04 Yes Yes No 0 - - 12 4096 4096 1 5.71 7.52
#14 2.06 Yes Yes No 1 4 4096 12 4096 4096 1 5.63 7.45
#15 2.09 Yes Yes No 1 4 4096 12 4096 4096 2 5.63 7.41
#16 2.17 Yes Yes No 1 4 4096 12 4096 4096 4 5.63 7.38
#17 2.3 Yes Yes No 1 4 4096 12 4096 4096 8 5.55 7.38
#18 3.01 Yes Yes No 0 - - 4 4096 -1 1 4.82 6.37
#19 4.02 Yes Yes No 0 - - 6 4096 4096 1 4.64 6.13

Table 11: Ablation of Vector Length on Inference

Throughput and Peak Memory Usage

v1 k1 k2 group num tok/s mem(GB)

FP16 - - - - 30.03 63.63
2 2 16 -1 1 18.85 4.17
2.01 4 256 -1 1 17.06 4
2.02 6 4096 -1 1 32.09 4.02
2.19 8 65536 -1 1 30.64 4.46
2.04 12 4096 4096 1 21.34 4.06

D.2 Our Dequantization Implementation

Our dequantization implementation is divided into
two phases. In the first phase, which handles
prompts with relatively long sequences, we restore
the quantized weights (index and centroid, etc.) to
FP16 and then call ‘torch.matmul‘. In the second
phase, during decoding, we fuse the dequantization
and GEMV operations into QGemv, eliminating
the repetitive reading and writing of FP16 weights.

ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration For Large Language Models
No ratings yet
ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration For Large Language Models
19 pages
And The Bit Goes Down
No ratings yet
And The Bit Goes Down
11 pages
LLM Quantization
No ratings yet
LLM Quantization
9 pages
Vector Quantization: April 2006
No ratings yet
Vector Quantization: April 2006
25 pages
F S Q: Vq-Vae M S: Inite Calar Uantization ADE Imple
No ratings yet
F S Q: Vq-Vae M S: Inite Calar Uantization ADE Imple
16 pages
2024 Transformer-VQ Lingle ArXiv
No ratings yet
2024 Transformer-VQ Lingle ArXiv
30 pages
Transformer-VQ Linear-Time Transformers Via Vector Quantization
No ratings yet
Transformer-VQ Linear-Time Transformers Via Vector Quantization
22 pages
Zeroquant v2
No ratings yet
Zeroquant v2
24 pages
Model Quantization
No ratings yet
Model Quantization
48 pages
GWQ: Gradient-Aware Weight Quantization For Large Language Models
No ratings yet
GWQ: Gradient-Aware Weight Quantization For Large Language Models
11 pages
Llm-Qbench: A Benchmark Towards The Best Practice For Post-Training Quantization of Large Language Models
No ratings yet
Llm-Qbench: A Benchmark Towards The Best Practice For Post-Training Quantization of Large Language Models
30 pages
F S Q: Vq-Vae M S: Inite Calar Uantization ADE Imple
No ratings yet
F S Q: Vq-Vae M S: Inite Calar Uantization ADE Imple
16 pages
FoQ Unit 5
No ratings yet
FoQ Unit 5
13 pages
ERQ: Error Reduction For Post-Training Quantization of Vision Transformers
No ratings yet
ERQ: Error Reduction For Post-Training Quantization of Vision Transformers
17 pages
OWQ: Outlier-Aware Weight Quantization For Efficient Fine-Tuning and Inference of Large Language Models
No ratings yet
OWQ: Outlier-Aware Weight Quantization For Efficient Fine-Tuning and Inference of Large Language Models
13 pages
Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework
No ratings yet
Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework
13 pages
LLM Paper
No ratings yet
LLM Paper
26 pages
4-Bit Quantization With GPTQ - Towards Data Science
No ratings yet
4-Bit Quantization With GPTQ - Towards Data Science
18 pages
GPTQ
No ratings yet
GPTQ
16 pages
Bitdistiller: Unleashing The Potential of Sub-4-Bit Llms Via Self-Distillation
No ratings yet
Bitdistiller: Unleashing The Potential of Sub-4-Bit Llms Via Self-Distillation
14 pages
Quest: Stable Training of Llms With 1-Bit Weights and Activations
No ratings yet
Quest: Stable Training of Llms With 1-Bit Weights and Activations
16 pages
AQLM
No ratings yet
AQLM
18 pages
9442 Towards Efficient Post Trainin
No ratings yet
9442 Towards Efficient Post Trainin
14 pages
Pushing The Limits of Large Language Model Quantization Via The Linearity Theorem
No ratings yet
Pushing The Limits of Large Language Model Quantization Via The Linearity Theorem
29 pages
Universality of Layer-Level Entropy-Weighted Quantization Beyond Model Architecture and Size
No ratings yet
Universality of Layer-Level Entropy-Weighted Quantization Beyond Model Architecture and Size
29 pages
Fallsem2018-19 Eee1007 Eth Tt424 Vl2018191002720 Reference Material I Unit - V LVQ
No ratings yet
Fallsem2018-19 Eee1007 Eth Tt424 Vl2018191002720 Reference Material I Unit - V LVQ
10 pages
2206 09557v4-2
No ratings yet
2206 09557v4-2
19 pages
Zeroquant Efficient and Affordable Post Training Quantization For Large Scale Transformers Paper Conference
No ratings yet
Zeroquant Efficient and Affordable Post Training Quantization For Large Scale Transformers Paper Conference
16 pages
Optimizing Large Language Model Training Using FP4 Quantization
No ratings yet
Optimizing Large Language Model Training Using FP4 Quantization
17 pages
Layer-Wise Quantization
No ratings yet
Layer-Wise Quantization
17 pages
GANQ: GPU-Adaptive Non-Uniform Quantization For Large Language Models
No ratings yet
GANQ: GPU-Adaptive Non-Uniform Quantization For Large Language Models
12 pages
CS20B1060
No ratings yet
CS20B1060
16 pages
LLM-FP4: 4-Bit Floating-Point Quantized Transformers
No ratings yet
LLM-FP4: 4-Bit Floating-Point Quantized Transformers
14 pages
Mptq-Vit: Mixed-Precision Post-Training Quantization For Vision Transformer
No ratings yet
Mptq-Vit: Mixed-Precision Post-Training Quantization For Vision Transformer
11 pages
2024 - GPTQT - Quantize Large Language Models Twice
No ratings yet
2024 - GPTQT - Quantize Large Language Models Twice
6 pages
Final Year Sample Report
No ratings yet
Final Year Sample Report
49 pages
SmoothQuant - Accurate and Efficient Post-Training Quantization For Large Language Models
No ratings yet
SmoothQuant - Accurate and Efficient Post-Training Quantization For Large Language Models
13 pages
Jungwok Choi - tinyML Asia 2023
No ratings yet
Jungwok Choi - tinyML Asia 2023
17 pages
2022 Acl-Long 331
No ratings yet
2022 Acl-Long 331
16 pages
A Pyramid Vector Quantizer: Ieee Transactions On Information Theory, Vol
No ratings yet
A Pyramid Vector Quantizer: Ieee Transactions On Information Theory, Vol
16 pages
Exploring Quantization For Efficient Pre-Training of Transformer Language Models
No ratings yet
Exploring Quantization For Efficient Pre-Training of Transformer Language Models
14 pages
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization
No ratings yet
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization
16 pages
Quq 1528
No ratings yet
Quq 1528
6 pages
Integer Quantization For Deep Learning Inference
No ratings yet
Integer Quantization For Deep Learning Inference
20 pages
FPTQ: F - P - T Q - L L M: INE Grained OST Raining Uantiza Tion For Arge Anguage Odels
No ratings yet
FPTQ: F - P - T Q - L L M: INE Grained OST Raining Uantiza Tion For Arge Anguage Odels
17 pages
APTQ
No ratings yet
APTQ
6 pages
Integer or Floating Point? New Outlooks For Low-Bit Quantization On Large Language Models
No ratings yet
Integer or Floating Point? New Outlooks For Low-Bit Quantization On Large Language Models
11 pages
Flickr Solution
No ratings yet
Flickr Solution
8 pages
Awq Activation-Aware Weight Quantization
No ratings yet
Awq Activation-Aware Weight Quantization
14 pages
Optimizing Llama 3.2 1B Using Quantization Techniques Usingbitsandbytes For Efficient Ai Deployment
No ratings yet
Optimizing Llama 3.2 1B Using Quantization Techniques Usingbitsandbytes For Efficient Ai Deployment
11 pages
RL PTQ
No ratings yet
RL PTQ
6 pages
Bitnet: Scaling 1-Bit Transformers For Large Language Models
No ratings yet
Bitnet: Scaling 1-Bit Transformers For Large Language Models
14 pages
Loftq: Lora-Fine-Tuning-Aware Quantization For Large Language Models
No ratings yet
Loftq: Lora-Fine-Tuning-Aware Quantization For Large Language Models
23 pages
Data-Free Quantization Through Weight Equalization and Bias Correction
No ratings yet
Data-Free Quantization Through Weight Equalization and Bias Correction
13 pages
Which Quantization Method Is Right For You - (GPTQ vs. GGUF vs. AWQ) - by Maarten Grootendorst - Nov, 2023 - Towards Data Science
No ratings yet
Which Quantization Method Is Right For You - (GPTQ vs. GGUF vs. AWQ) - by Maarten Grootendorst - Nov, 2023 - Towards Data Science
25 pages
Quantizaion LLM Globalisation
No ratings yet
Quantizaion LLM Globalisation
6 pages
1 Bit Quantization
No ratings yet
1 Bit Quantization
3 pages
Pydantic AI Cookbook - ? Swipe
No ratings yet
Pydantic AI Cookbook - ? Swipe
15 pages
Roadmap GenAI Pinnacle Program
No ratings yet
Roadmap GenAI Pinnacle Program
8 pages
Introduction To AI Prompt Hub
No ratings yet
Introduction To AI Prompt Hub
16 pages
Multi-Modal Generative AI Survey
No ratings yet
Multi-Modal Generative AI Survey
23 pages
The Rise of Vector Databases in The Age of LLMs
No ratings yet
The Rise of Vector Databases in The Age of LLMs
26 pages
MITTR Report 2024
No ratings yet
MITTR Report 2024
20 pages
Kavya Agarwal Resume
No ratings yet
Kavya Agarwal Resume
1 page
ACP Gen Ai DS Brochure PDF
No ratings yet
ACP Gen Ai DS Brochure PDF
17 pages
Ugrd Ai6100 Ai Prompt Engineering Midterm Quiz and Lab Quiz 1
No ratings yet
Ugrd Ai6100 Ai Prompt Engineering Midterm Quiz and Lab Quiz 1
21 pages
LLMOps For LLM Models
No ratings yet
LLMOps For LLM Models
7 pages
The Rundown's Guide To Prompt Engineering
No ratings yet
The Rundown's Guide To Prompt Engineering
23 pages
Security AI - Pitch Deck
No ratings yet
Security AI - Pitch Deck
14 pages
Large Language Models Ad Referendum How
No ratings yet
Large Language Models Ad Referendum How
31 pages
CJC Norg Ai Light Version
No ratings yet
CJC Norg Ai Light Version
13 pages
Safeguarding Human Values - Rethinking US Law For Generative AI's Societal Impacts
No ratings yet
Safeguarding Human Values - Rethinking US Law For Generative AI's Societal Impacts
28 pages
Report - PDF 20240827 210738 0000
No ratings yet
Report - PDF 20240827 210738 0000
23 pages
Running Agents
No ratings yet
Running Agents
4 pages
Thai Financial Domain Adaptation of THaLLE - Technical Report.18242v1
No ratings yet
Thai Financial Domain Adaptation of THaLLE - Technical Report.18242v1
27 pages
Gen AIAnd The LSA
No ratings yet
Gen AIAnd The LSA
50 pages
LangChain Custom Project - Student Implementation Guide
No ratings yet
LangChain Custom Project - Student Implementation Guide
9 pages
RAGTruth - A Hallucination Corpus For Developing Trustworthy Retrieval-Augmented Language Models
No ratings yet
RAGTruth - A Hallucination Corpus For Developing Trustworthy Retrieval-Augmented Language Models
16 pages
3 Days Workshop On Generative Ai Agenda1-1 PDF
No ratings yet
3 Days Workshop On Generative Ai Agenda1-1 PDF
4 pages
2023 06 21 How To Leverage Gen AI in Email Creation For Greater Efficiency and Effectiveness Now Stensul Ebook
No ratings yet
2023 06 21 How To Leverage Gen AI in Email Creation For Greater Efficiency and Effectiveness Now Stensul Ebook
19 pages
Tech With Tim Checklist
No ratings yet
Tech With Tim Checklist
9 pages
ApexNav - An Adaptive Exploration Strategy For Zero-Shot Object Navigation With Target-Centric Semantic Fusion
No ratings yet
ApexNav - An Adaptive Exploration Strategy For Zero-Shot Object Navigation With Target-Centric Semantic Fusion
8 pages
Atlas of AI Risks
No ratings yet
Atlas of AI Risks
11 pages
Software Performance Engineering For Foundation Model-Powered Software (Fmware)
No ratings yet
Software Performance Engineering For Foundation Model-Powered Software (Fmware)
13 pages
Empowering Private Tutoring by Chaining Large Language Models
No ratings yet
Empowering Private Tutoring by Chaining Large Language Models
11 pages
Artigo - Grafo Do Conhecimento
No ratings yet
Artigo - Grafo Do Conhecimento
8 pages
ClashEval: Quantifying The Tug-Of-War Between An LLM's Internal Prior and External Evidence
No ratings yet
ClashEval: Quantifying The Tug-Of-War Between An LLM's Internal Prior and External Evidence
13 pages
Study Guide Cisco 300-515 SPVI Implementing Cisco Service Provider VPN Services Certification Exam
From Everand
Study Guide Cisco 300-515 SPVI Implementing Cisco Service Provider VPN Services Certification Exam
Anand Vemula
No ratings yet
RWKV Architecture and Applications: The Complete Guide for Developers and Engineers
From Everand
RWKV Architecture and Applications: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Cerebras GPT: Wafer-Scale Architectures for Large Language Models
From Everand
Cerebras GPT: Wafer-Scale Architectures for Large Language Models
William Smith
No ratings yet

Low Bit Post Training

Uploaded by

Low Bit Post Training

Uploaded by

VPTQ: Extreme Low-bit Vector Post-Training Quantization for

Large Language Models

Li Lyna Zhang† Ting Cao† Cheng Li‡ Mao Yang†

Abstract Table 1: LLM Quantization Algorithm Comparison.

deployment and inference of Large Language Quantization Time Cost ↓ ↑↑ ↓ ↓ ↓ ↓

(b) 13B results

(c) 70B results

7B bit W2 C4 AC AE HE QA WI tok/s mem(GB) cost(h)

C.3 Channel-Independent Optimization C.6 Group Number

7B bit W2(2k) C4(2k) W2(4k) C4(4k) AC AE HE QA WI

13B bit W2(2k) C4(2k) W2(4k) C4(4k) AC AE HE QA WI

70B bit W2(2k) C4(2k) W2(4k) C4(4k) AC AE HE QA WI

LLaMA-3 8B LLaMA-3 70B

Table 9: Layer-wise finetuning parameters on 8xH100 number is not significant.

channel Finetune outlier other

Table 11: Ablation of Vector Length on Inference

v1 k1 k2 group num tok/s mem(GB)

D.2 Our Dequantization Implementation

You might also like