Low Bit Post Training
Low Bit Post Training
updated
tization algorithm into a heuristic algorithm L index index
to solve the optimization problem. We also !
analyze and propose a brief and effective code- N # !
book initialization algorithm to reduce the ex- ! $
tra overhead of centroid training and updates. M " "# Lookup
Experiments show that VPTQ only requires Table &
N "′
10.4-18.6% of the quantization algorithm ex- ❶ reshape & group ❷ clustering ❸ VPTQ
ecution time compared to existing SOTA re-
sults. Figure 1: Vector Quantization in Weight Quantization
3. VPTQ has low dequantization overhead.
VPTQ algorithm quantizes all the weights in
Higher-order terms exert a minor effect on the opti-
every Linear Operator in the model into an
mization goal, and we typically disregard interac-
index matrix and codebooks. During model
tions among weights between different layers. Con-
inference, we only need to dequantize the
sequently, we can simplify the optimization prob-
weight matrix by reading centroids from the
lem by focusing on optimizing the second-order
codebook according to the index before ex-
term and then define the following optimization
ecuting the operator. The models quantized
problem:
by VPTQ result in 1.6-1.8× improvement in
inference throughput compared to SOTA. arg min ∆WT · H(W) · ∆W,
∆W (1)
2 Background and Motivation s.t. ∆W = 0
2.1 Post Training Quantization in LLM
The objective of the optimization problem is to
Post-Training Quantization (PTQ) (LeCun et al., minimize the second-order error in model quan-
1989; Hassibi et al., 1993; Hassibi and Stork, 1992; tization, subject to the constraint that the change
Frantar et al., 2023; Singh and Alistarh, 2020) aims in model weights is as minimized as possible, i.e.,
to decrease model weight size by simplifying the ∆W = 0.
numerical representation and seeking to maintain
the model’s accuracy without retraining the model. 2.2 Vector Quantization in Neural Networks
We can formulate PTQ as the following optimiza- VQ is a key method for efficient lossy data com-
tion problem: pression (Gersho, 1979). Its objective is to reduce
arg minE[L(X, W + ∆W) − L(X, W)] the distortion by mapping high-dimensional origi-
nal data to a lower-dimensional space represented
1
≈ ∆WT · g(W) + ∆WT · H(W) · ∆W by a lookup table (Eq. 2). VQ maps original vec-
2
tors (W′ ) from the vector space to a finite set of
where W is the original model weights, Ŵ is quan- vectors, which is commonly referred to as a code-
tized weights, and ∆W = Ŵ − W represents the book (lookup table, C). Each vector in the original
weight quantization error. The loss of the model space approximates the closest vector (centroid Ci )
task is L. The optimization object is to minimize in the codebook.
the impact of model quantization on the model task,
which means minimizing the expected deviation of arg min ∥v − Ci ∥2 , ∀v ∈ W′ (2)
i∈k
the loss function.
PTQ typically employs a concise and accurate VQ indicates the nearest centroid Ci that minimizes
method for analyzing the above optimization prob- the Euclidean distance between the input vector
lem: Second-Order Optimization. Following a Tay- v in the lookup table. The optimization problem
lor series expansion, this method breaks down the aims to find the index i that results in the small-
optimization goal into first-order, second-order, and est distance between v. Thus, each input vector
higher-order terms. g(W) and H(W) represent is represented by the most similar centroids, thus
the gradient and Hessian of task loss L, respec- minimizing total distortion.
tively. It often assumes that the model has already Recent research has explored the use of VQ for
reached local optimum before model quantization, model weight quantization (Chen et al., 2020; Cho
which means that the first-order term is nearly zero. et al., 2022; Stock et al., 2020, 2021). These studies
attempt to compress the embedding layer, the con- Algorithm 1 VPTQ Algorithm
volution layer, and the classification layer of neural Input: W ← RM ×N ▷ Input weight matrix
networks using VQ. Figure 1 illustrates an example Input: H ← RN ×N ▷ Hessian matrix
Output: Ŵ ← RM ×N ▷ Quantized weight matrix
of applying VQ to compress model weights on a
E ← RM ×N ▷ Initialize quantization errors
weight matrix. For a weight matrix W with di- for s = 0, B, 2B, . . . do ▷ Column blocks
mensions M × N , we reshape W into vectors of for n = s, s + 1, . . . , s + B − 1 do
▷ Quantize a single column n, fundamentally different
length v as W′ (step ➊). The number of reshaped
from AQLM
vectors should be M ×Nv . Next, we employ k-means for m = 0, V, 2V, . . . , M do
or other clustering algorithms to build a codebook ▷ Parallel (Residual) Vector Quantization by function Q(v)
to vectors in the column n
(step ➋). The constructed codebook contains k cen- Ŵm:m+V,n ← QV (Wm:m+V,n )
troid vectors, each with v dimensions. Applying end for
the VQ algorithm directly often does not yield an E:,n ← (W:,n − W′ :,n )/(H−1 n,n )
▷ Update quantization error
acceptable accuracy. Typically, PTQ algorithms W:,n:s+B ← W:,n:s+B − E:,n H−1 n,n:s+B
adjust the model index and centroid to enhance the ▷ Merge quantization error to weights
accuracy of the quantized model (step ➌). end for
W:,s+B: ← W:,s+B: − E:,s:s+B H−1 s:s+B,s+B:
During model inference, each operator in the ▷ Update all remaining weights
model first dequantizes the original weight matrix end for
from the lookup table (codebook) by index and
centroid. Unlike scalar quantization, VQ keeps the GPTVQ (van Baalen et al., 2024) utilizes the
index and centroid in quantized weight. The equiv- Second-Order Optimization method to implement
alent compression ratio of VQ can be formulated PTQ. However, GPTVQ accumulates quantization
as: total original model bits/(codebook bits + errors within vector quantization, leading to an
index bits). The equivalent quantization bitwidth inevitable increase in quantization errors as the
is as: original bit width/compression ratio. For ex- vector length increases. It prevents the use of longer
ample, a 4096 × 4096 FP16 weight matrix with vectors and consequently limits the compression
vectors of length v = 8 and 256 centroids, the com- ratio.
pression ratio is (16 × 4096 × 4096)/(8 × 256 × QuIP# (Tseng et al., 2024) introduces an incoher-
16 + log2 (256) × 4096 × 4096/8) = 15.97. The ence processing using the randomized Hadamard
equivalent bitwidth is 1.002 bit. transform for the weight matrix before VQ. The
processed weight matrix approximates a sub-
2.3 Vector Quantization in LLMs Gaussian distribution, allowing for compression
While VQ has been applied to weight quantization, with a tiny codebook. However, incoherence pro-
the following significant challenges persist when cessing requires a significant amount of computa-
quantizing LLM. We summarize the benefits and tion, despite QuIP# being able to compress LLM
weaknesses of recent research (Egiazarian et al., to extremely low-bit with a low accuracy drop. It
2024; Tseng et al., 2024; van Baalen et al., 2024) requires significantly more computation for infer-
techniques in Table 1. ence compared to the original LLM, resulting in
low inference throughput.
The number of parameters in LLMs is enor-
mous, which requires quantizing the model using 3 Vector Post-Training Quantization
lightweight methods to avoid excessive resource
3.1 VPTQ Algorithm
consumption. AQLM (Egiazarian et al., 2024) uti-
VPTQ leverages Second-Order Optimization and
lizes gradient descent to train each layer of the
solves the optimization problem Eq.1 to achieve ex-
VQ-quantized model and simultaneously trains
treme low-bit quantization. Assume that a weight
across multiple layers using calibration data. It
matrix is W ∈ RM ×N , and a Hessian matrix col-
achieves effective compression through additive
lected from the current layer is H ∈ RN ×N . We
quantization and joint optimization of the code-
denote the q-th column of the weight matrix as
book, which can achieve high accuracy. However,
Ŵ:,q . The quantized column Ŵ:,q can be repre-
due to AQLM’s use of backpropagation for model
sented as the transpose of concatenated centroid
training, significant GPU hours and memory are re-
vectors
quired to achieve better accuracy, especially when
dealing with LLMs with massive parameters. Ŵ:,q = (C0 , C1 , ..., CM/v )T .
When the weight matrix of the model is large, After quantizing a column of the weight matrix,
we can first split the weight matrix into multi- we need to update the current quantization error to
ple groups. Each group has its own independent the unquantized part through:
codebook. This method allows us to flexibly di-
vide the weight matrix into several submatrices (Ŵ:,q − W:,q )
∆W = Hq,:
(Ŵ:,q:q+(M/group num) ) equal to the group number. H−1
qq
For clarity, we describe only one group in the fol-
lowing algorithm description. It will transform current quantization errors to the
Unlike GPTVQ, we quantize each column of following unquantized columns. Since GPTVQ
the matrix independently, which we refer to as quantizes v columns at the same time, quantization
Channel-Independent Second-Order Optimiza- error can only spread to other unquantized columns
tion. It greatly simplifies the complexity of VQ in when all v columns have been quantized. It will
Second-Order Optimization. GPTVQ, on the other lead to more errors accumulating in the quantiza-
hand, quantizes v columns of the matrix (ŴM,v ) tion, resulting in a decrease in model accuracy. We
at once, leading to larger errors and more complex can have similar conclusions from Table 2. Algo-
transformations for problem optimization. rithm 1 provides a detailed description of the steps
We use the Lagrange Method to transform the to solve the optimization problem and quantize the
optimization problem 1 into an unconstrained op- weights according to the above analysis.
timization problem. The Lagrangian function Distinguish VPTQ from GPTQ and GPTVQ:
L(∆W), and λ is the Lagrangian multiplier: Compared with GPTQ, VPTQ employs vector rep-
resentations in the quantization, which choose the
L(∆W) = ∆WT H(W)∆W + λ∆W vector closest to the original matrix to represent
the original data. As VQ can use a larger code-
The dual function g(λ) can be represented as: book to store the quantized data, it covers a wider
range of numerical distributions compared to the
g(λ) = −H−1 T
qq λλ − λ(Ŵ:,q − W:,q )
scalar quantization of GPTQ, thereby achieving
better accuracy. Table 2 reveals that VPTQ signifi-
Differentiating g(λ) with respect to λ and setting cantly outperforms GPTQ under extremely low bit
it to 0, quantization.
Moreover, since GPTVQ quantizes multiple
g ′ (λ) = −H−1 T
qq λ − (Ŵ:,q − W:,q ) = 0 columns simultaneously, the propagation of quanti-
zation errors to unquantized columns is more chal-
(Ŵ −W ) lenging. Furthermore, the quantization errors in
we can find that when λT = − :,qH−1 :,q , the
qq
problem reaches an optimal solution. GPTVQ accumulate as the vector length increases,
hindering GPTVQ from using longer vector lengths
By substituting λT into the optimization prob-
for weight compression (limited to only 1-4 bits).
lem, we find that to minimize the error introduced
It significantly reduces the compression ratio of
by quantization, we need to minimize the impact on
VQ. On the other hand, VPTQ is capable of com-
the Lagrangian function. Therefore, we can trans-
pressing weights using longer vectors (> 8 bits) and
form the quantization problem into minimizing:
representing data with a larger codebook. Table 2
P
∥v − C∥2 shows the better accuracy achieved by VPTQ than
∆L(∆Ŵ) = GPTVQ.
2H−1
qq
3.2 Optimization in VPTQ
We find that when quantizing a column vector
each 3.2.1 Hessian-Weighted Centroid
P time, we only need to consider minimizing
∥v − C∥2 , which is to find the closest centroid Initialization
in Euclidean Distance. It precisely aligns with the VTPQ algorithm requires the initialization of cen-
optimization of VQ. Moreover, since VPTQ quan- troids in the codebooks prior to quantization. Prop-
tizes the weight matrix column by column, H−1 qq erly initializing centroids can reduce quantization
is constant when quantizing each column, so we errors and improve model accuracy. A straightfor-
do not need to consider Hessian when finding the ward method is to perform K-means clustering on
centroid. the weight matrix as centroids (Eq.2). However, it
does not consider the optimization object in Eq.1, Algorithm 2 End to End Quantization Algorithm
leading to a significant accuracy drop (van Baalen Require: original model, vector length v, centroid number k,
et al., 2024; Egiazarian et al., 2024). hessian matrices H
Ensure: quantized model
We can transform the optimization object by for each layer l do ▷ Fully parallelized each layer on GPUs
leveraging the cyclic property of matrix traces and for each Linear operator do
the Hadamard product. We refine the optimization if outlier is enabled then
Initialize outlier centroids Coutlier
objective as: ′
Woutlier ← VPTQ(Woutlier , Coutlier )
n−1 end if
X Initialize centroids C
∆WT ∆W ⊙ H = hi,i ∥∆W:,i ∥2 w′ ← VPTQ(W, C)
i=0 if residual is enabled then
n−1 n−1 Initialize residual centroids Cres
W′′ ← VPTQ(W − W′ , Cres )
X X
+ hi,j (∆W:,i ∆W:,j ) end if
i=0 j=0,j̸=i end for
if finetune layer is enabled then
Due to the Hessian matrix being predominantly di- Finetune layer l
agonal (Dong et al., 2020), it guides us to split the end if
proxy error into two terms. The first term repre- end for
sents the dominant diagonal elements of the initial
error matrix, which significantly impact the quanti- 3.2.3 Outlier Elimination
zation error. The second term is the interaction of
Recent studies on quantization in LLM have consis-
a single value in weight quantization with others.
tently observed a significant presence of outliers in
Because the Hessian matrix is predominantly di-
activation (Xiao et al., 2023; Lin et al., 2023; Lee
agonal, we can prioritize optimizing the first term
et al., 2024). Outliers, while small portions (~1%
through centroid initialization. We can view the
of the matrix), heavily affect the quantization error
first term as a Weighted K-means Clustering prob-
and simulate model accuracy. Outliers typically re-
lem (Cordeiro de Amorim and Mirkin, 2012; Kerd-
sult in large values in the diagonal elements of the
prasop et al., 2005; Liu et al., 2017). Since this
Hessian matrix. During centroid initialization in
problem is well-studied, we can directly solve it
Sec.3.2.1, VPTQ already considers these Hessian
to achieve efficient and accurate centroid initializa-
diagonals as weights in K-means, allowing VPTQ
tion.
to better quantize the error introduced by outliers.
3.2.2 Residual Vector Quantization
Q(voutlier ) = arg min ∥voutlier − Cioutlier ∥2
We enable Residual Vector Quantization (RVQ) i
(Barnes et al., 1996; Wei et al., 2014) in VPTQ. Furthermore, VPTQ flexibly partitions the weight
RVQ improves vector quantization (VQ) by break- matrix and uses a separate outlier lookup table to
ing down the compression of a weight matrix into quantify matrix tiles most affected by outliers. It
two (or more) stages. Each stage further com- allows us to effectively trade off model accuracy
presses the residual error vres = v − Q(v) from and quantization overhead.
the previous quantization stage:
4 End to end Quantization Algorithm
Q(vres ) = arg min ∥(v − Q(v)) − Cires ∥2
i In this section, we will detail the end-to-end model
Unlike GPTVQ, VPTQ enables RVQ, which quantization algorithm (Algorithm 2). The algo-
quantizes VQ quantization error using a separate rithm takes the original model, vector length v,
lookup table for better representation and quanti- centroid number k, and Hessian matrices H as in-
zation. By partitioning the encoding into multi- puts. It starts by iterating over each layer l of the
ple stages and reducing quantization error, RVQ model. As each layer’s quantization only relates to
not only achieves superior compression efficiency the current layer and the Hessian matrix, we can
but also ensures a balance between quantization fully parallelize the quantization of each layer on
error, the size of lookup tables, and the memory re- GPUs.
quirements for indices. During the decoding phase, In each layer, we first quantize the weight of
VPTQ simply reads the centroids from these multi- each Linear Operator (matrix multiplication of
ple lookup tables and combines them to reconstruct input and weight). If we enable the outlier op-
the original weight matrix. tion, the algorithm first selects outlier columns
following Section 3.2 and initializes the outlier (Gao et al., 2021) to perform zero-shot evalua-
centroids Coutlier . Then, VPTQ is applied to the tions on common sense QA benchmarks (PIQA
outlier weights Woutlier using the outlier centroids, (Bisk et al., 2020), HellaSwag (Zellers et al., 2019),
′
generating the quantized weights Woutlier . Next, WinoGrande (Sakaguchi et al., 2021), ARC (Clark
the algorithm initializes the centroids C for the re- et al., 2018)). Detailed configuration is in Appendix
maining columns and applies VPTQ to the weights A.
W using these centroids to produce the quantized Baselines For LLaMA-2 and Mistral models,
weights W′ . Lastly, if residual quantization is en- we compare VPTQ against GPTQ, GPTVQ, DB-
abled, the algorithm initializes the residual cen- LLM, QuIP#, and AQLM. To account for the dif-
troids Cres . It applies VPTQ to the residual error ferent overheads resulting from varying codebook
between the original weights and the quantized constructions, we provide results with compara-
weights (W − W′ ), using the residual centroids. ble bit widths to facilitate a fair comparison. For
The quantized weight is updated as W′′ . LLaMA-3 models, we use the results of (Huang
After processing all the operators, the algorithm et al., 2024). However, due to alignment issues
will fine-tune the layer l if we enable layer fine- with the C4 dataset, we only show results for Wiki-
tuning. The loss function is the Mean Squared Text and QA tasks. Because LLaMA-3 models are
Error (MSE) between the original and quantized new and running quantization ourselves is costly,
computations. In layer-wise fine-tuning, we only we do not have results for QuIP# and AQLM.
update the normalization operator (e.g. RMSNorm)
and centroid. These parameters only comprise a 5.2 Accuracy Evaluation
small fraction of the entire layer, and we can com- Results on LLaMA-2 model: We compare VPTQ
plete the fine-tuning quickly with limited memory. with QuIP#, AQLM, GPTVQ, DB-LLM, and
After each layer completes quantization and fine- GPTQ on the LLaMA-2 model. First, we discuss
tuning, we can further fine-tune the entire model the results of 2-bit quantization. As shown in Table
as other PTQ methods used (Tseng et al., 2024; 2, GPTQ, as a scalar quantization method, performs
Chee et al., 2023; Egiazarian et al., 2024). Once poorly with unusable accuracy. While DB-LLM
the algorithm processes all layers, it outputs the and GPTVQ perform better, they still experience
quantized model. The end-to-end VPTQ algorithm significant performance drops, with WikiText-2
quantizes all the weights in every Linear Operator perplexity increasing by 2. The significant accu-
in the model into an index and a codebook (C). Dur- racy drop in GPTVQ, despite being a vector quan-
ing model inference, we only need to dequantize tization algorithm, is due to two factors: the use
the weight matrix by reading centroids from the of shorter vector lengths, which introduces higher
codebook according to the index before executing quantization loss, and the choice to update weights
the operator. every v columns, which leads to cumulative errors.
Therefore, we primarily focus on comparing VPTQ
5 Experiments and Evaluations with the state-of-the-art QuIP# and AQLM which
both choose longer vector lengths.
5.1 Settings Table 2 includes the average scores for the five
Algorithm Baseline We focus on weight-only QA tasks mentioned in Section 5.1. VPTQ outper-
quantization. The detailed quantization parameters forms QuIP# and AQLM on 7B and 13B models.
(such as vector length and codebook numbers) and For the 7B model, VPTQ achieves a further re-
fine-tuning parameters of our VPTQ are shown in duction in WikiText-2 perplexity by 0.5 and 0.3
Appendix B . Following (Frantar et al., 2023), our compared to the previous best results at 2-2.02 bits
calibration data consists of 128 random segments and 2.26-2.29 bits, respectively. In QA tasks, the
of the C4 dataset (Raffel et al., 2020). VPTQ 2.26-bit model surpasses the AQLM 2.29-
Models and Datasets We benchmark accuracy bit model with an average accuracy increase of
on LLaMA-2 (Touvron et al., 2023), LLaMA-3 1%. For the 13B model, the VPTQ 2.02-bit model
families (Meta, 2024), and Mistral (Jiang et al., shows a slight improvement over QuIP#, and the
2023). Following previous work (Frantar et al., 2.18-bit model outperforms AQLM in QA accuracy
2023), we report perplexity on language modeling by 1.5%. On the LLaMA-2-70B model, we achieve
tasks (WikiText-2 (Merity et al., 2016), C4 (Raffel similar perplexity (< 0.02) and comparable QA
et al., 2020)). We also employ lm-eval-harness results(< 0.4%). The results for 3- and 4-bit quan-
Table 2: LLaMA-2 2bit Quantization Results. The "N/A" in the table stands for "not available," with further
explanation provided in the Appendix A.1. 1 We use the naive Torch and Triton kernels for inference performance
evaluation, without optimizations like CUDA graphs, FlashAttention, or Torch compile. The inference performance for QuIP#
and AQLM do not represent their performance with all optimizations enabled. QuIP# and AQLM can achieve high performance
when all optimizations are enabled.
(a) 7B results
Method bit W2↓ C4↓ AvgQA↑ tok/s↑ mem(GB)↓ cost(h)↓
FP16 16 5.12 6.63 62.2 38.32 27.22 N/A
GPTQ 2.125 50.75 36.76 39.16 19.59 4.42 0.2
GPTVQ 2.25 6.71 9.9 56.14 N/A N/A 1.5
DB-LLM 2.01 7.23 9.62 55.1 N/A N/A N/A
QuIP#1 2 6.19 8.16 58.2 4.4 2.25 N/A
AQLM 1 2.02 6.64 8.56 56.5 19.4 2.16 N/A
AQLM 1 2.29 6.29 8.11 58.6 19.6 2.4 11.07
VPTQ 2.02 6.13 8.07 58.2 39.9 2.28 2
2.26 5.95 7.87 59.4 35.7 2.48 2.2
Table 3: LLaMA-3 and Mistra-7b 2,3,4-bit Quantization Results. The table shows LLaMA-3 Wikitext2 perplexity
(context length 2048) and average zero-shot QA Accuracy, Mistral-7B Wikitext2, C4 perplexity (context length
8192) and average zero-shot QA accuracy. Detailed score for each task see Table 6 and Table 7.
LLaMA-3 8B LLaMA-3 70B Mistral 7B
bit W2↓ AvgQA↑ bit W2↓ AvgQA↑ bit W2↓ C4↓ AvgQA↑
FP16 16 6.14 68.6 16 2.9 75.3 FP16 16.0 4.77 5.71 68.6
QuIP 4 6.5 67.1 4 3.4 74.5 QuIP# 4.01 4.85 5.79 68.7
GPTQ 4 6.5 67.3 4 3.3 74.9 AQLM 4.02 4.85 5.79 68.0
VPTQ 4.03 6.42 68.1 4.05 3.15 74.7 GPTQ 4.125 4.83 5.74 68.4
QuIP 3 7.5 63.7 3 4.7 72.6 VPTQ 4.03 4.81 5.72 68.2
GPTQ 3 8.2 61.7 3 5.2 70.6 AQLM 3.0 5.07 5.97 67.3
VPTQ 3.03 6.97 66.7 3.01 3.81 73.7 VPTQ 3.03 4.96 5.84 67.3
QuIP 2 85.1 36.8 2 13 48.7 QuIP# 2.01 6.02 6.84 62.2
DB-LLM 2 13.6 51.7 N/A N/A N/A AQLM 2.01 6.32 6.93 62.2
GPTQ 2 2.10E+02 36.2 2 11.9 45.4 GPTQ 2.125 1535 164 44.5
VPTQ 2.08 9.29 60.2 2.02 5.6 70.9 GPTVQ 2.25 8.99 18.6 57.7
VPTQ 2.24 9.19 62.7 2.07 5.66 70.7 VPTQ 2.04 5.64 6.43 63.2
tization shown in Table 5 are without end-to-end ther refining the weights via Channel-Independent
fine-tuning but are also comparable to AQLM and Second-Order Optimization, we have enabled a
QuIP# which include end-to-end fine-tuning. The more granular VQ.
ablation study of quantization parameters is in Ap- VPTQ also includes a brief and effective code-
pendix C. book initialization algorithm, which is achieved
Results on LLaMA-3 and Mistral model: Ta- by decomposing the optimization problem. We
ble 3 presents VPTQ results on the LLaMA-3 have extended VPTQ to support residual and out-
model and Mistral-7b model. In all 2-, 3-, and lier quantization, which not only improves model
4-bit quantizations of LLaMA-3 models, we sig- accuracy but also further compresses the model
nificantly outperform GPTQ, DB-LLM, and QuIP, size.
whose accuracy drops to unusable levels. VPTQ en- Our experimental results demonstrate the effec-
sures an accuracy drop of < 8% for the 8B model tiveness and efficiency of VPTQ. The perplexity
and < 5% for the 70B model. On the Mistral- of quantized model is reduced by 0.01-0.34 on
7B model, our 2-bit performance surpasses both LLaMA-2, 0.38-0.68 on Mistral-7B, 4.41-7.34 on
QuIP# and AQLM by 0.8% in QA accuracy. In LLaMA-3 over SOTA at 2-bit, with an average ac-
3-bit quantization, our perplexity is lower. At 4- curacy improvement of 0.79-1.5% on LLaMA-2,
bit, results are comparable overall. More detailed 1% on Mistral-7B, 11-22% on LLaMA-3 on QA
results are in Table 7. As bit width increases, the tasks. Furthermore, we achieved these results only
advantage of vector quantization diminishes, with using 10.4-18.6% of the execution time of the quan-
GPTQ showing a similar WikiText-2 perplexity at tization algorithm, leading to a 1.6-1.8× increase
4-bit. in inference throughput compared to SOTA. These
Inference throughput and quantization cost: results underscore the potential of VPTQ as an ef-
In Table 2, the ‘tok/s’ column indicates the num- ficient and powerful solution for the deployment
ber of tokens generated per second during the de- and inference of LLMs, particularly in resource-
code phase of inference. VPTQ achieves a 2-9× constrained settings.
speedup compared to QuIP# because QuIP# uses
Hadamard Transform during decoding, which intro- 7 Limitations
duces O(n2 ) multiplications and additions, signifi-
Related research on PTQ (Egiazarian et al., 2024;
cantly slowing the inference throughput. Compared
Tseng et al., 2024; van Baalen et al., 2024) have
to AQLM, VPTQ uses a smaller codebook, result-
adopted end-to-end model fine-tuning after the
ing in a lower decoding overhead. Therefore, our
PTQ phase. Compared to other related works,
inference throughput for the 7B and 13B models
VPTQ can better quantize the model in the PTQ,
is 1.6-1.8× faster than AQLM. As the model size
and it simplifies and reduces the cost and overhead
increases, our codebook size becomes comparable
of model fine-tuning.
to theirs, leading to similar inference throughputs
Due to GPU resource constraints, we cannot fine-
for the 70B model. The ’mem(GB)’ column rep-
tune larger models (70B) for longer iterations and
resents the GPU memory usage at runtime. The
more tokens. It limits our experimental results,
‘cost(h)’ column represents the hours required for
which can only achieve similar results to baselines
model quantization on 4× 80GB A100 GPUs. We
in 70B models. It restricts the demonstration of
achieves comparable or even better results than
VPTQ’s advantages and potential on large mod-
AQLM in only 10.4-18.6% of quantization algo-
els in this paper. We will strive for more GPU
rithm execution time.
resources to fine-tune the VPTQ model for longer
periods and with more tokens in the future, allow-
6 Conclusion ing for a fair comparison.
In this paper, we propose Vector Post-Training Additionally, since LLaMA-3 models are the
Quantization (VPTQ), a novel approach to achiev- latest released models, there is a lack of baselines
ing extremely low-bit quantization of LLMs by from related works. It is difficult for us to fully
Vector Quantization. Through the application of demonstrate our performance improvements. We
Second-Order Optimization, we have formulated will continue to add more baselines in the future to
the LLM Vector Quantization problem and directed highlight the advantages of VPTQ.
the design of our quantization algorithm. By fur- In this paper, we only use AI tools for grammar
checking and code completion. Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and
Dan Alistarh. 2023. OPTQ: accurate quantization for
Acknowledgement generative pre-trained transformers. In The Eleventh
We thank James Hensman for his crucial insights International Conference on Learning Representa-
into the error analysis related to Vector Quanti- tions, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.
zation (VQ), and his comments on LLMs evalua- OpenReview.net.
tion are invaluable to this research. We also thank Leo Gao, Jonathan Tow, Stella Biderman, Sid Black,
QuIP# and AQLM for inspiring our paper and the Anthony DiPofi, Charles Foster, Laurence Golding,
authors for their guidance on implementation. Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff,
et al. 2021. A framework for few-shot language
model evaluation. Version v0. 0.1. Sept.
References A. Gersho. 1979. Asymptotically optimal block quan-
tization. IEEE Transactions on Information Theory,
C.F. Barnes, S.A. Rizvi, and N.M. Nasrabadi. 1996.
25(4):373–380.
Advances in residual vector quantization: a review.
IEEE Transactions on Image Processing, 5(2):226– Babak Hassibi and David G. Stork. 1992. Second or-
262. der derivatives for network pruning: Optimal brain
surgeon. In Advances in Neural Information Process-
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, ing Systems 5, [NIPS Conference, Denver, Colorado,
et al. 2020. Piqa: Reasoning about physical com- USA, November 30 - December 3, 1992], pages 164–
monsense in natural language. In Proceedings of the 171. Morgan Kaufmann.
AAAI conference on artificial intelligence, volume 34,
pages 7432–7439. Babak Hassibi, David G. Stork, and Gregory J. Wolff.
1993. Optimal brain surgeon and general network
Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and pruning. In Proceedings of International Confer-
Christopher De Sa. 2023. Quip: 2-bit quantization of ence on Neural Networks (ICNN’88), San Francisco,
large language models with guarantees. CA, USA, March 28 - April 1, 1993, pages 293–299.
IEEE.
Ting Chen, Lala Li, and Yizhou Sun. 2020. Differ-
entiable product quantization for end-to-end embed- Wei Huang, Xudong Ma, Haotong Qin, Xingyu Zheng,
ding compression. In International Conference on Chengtao Lv, Hong Chen, Jie Luo, Xiaojuan Qi, Xi-
Machine Learning, pages 1617–1626. PMLR. anglong Liu, and Michele Magno. 2024. How good
are low-bit quantized llama3 models? an empirical
Minsik Cho, Keivan Alizadeh-Vahid, Saurabh Adya, study.
and Mohammad Rastegari. 2022. DKM: differen-
tiable k-means clustering layer for neural network Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men-
compression. In The Tenth International Conference sch, Chris Bamford, Devendra Singh Chaplot, Diego
on Learning Representations, ICLR 2022, Virtual de las Casas, Florian Bressand, Gianna Lengyel, Guil-
Event, April 25-29, 2022. OpenReview.net. laume Lample, Lucile Saulnier, Lélio Renard Lavaud,
Marie-Anne Lachaux, Pierre Stock, Teven Le Scao,
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Thibaut Lavril, Thomas Wang, Timothée Lacroix,
Ashish Sabharwal, Carissa Schoenick, and Oyvind and William El Sayed. 2023. Mistral 7b.
Tafjord. 2018. Think you have solved question an-
swering? try arc, the ai2 reasoning challenge. arXiv Kittisak Kerdprasop, Nittaya Kerdprasop, and Pairote
preprint arXiv:1803.05457. Sattayatham. 2005. Weighted k-means for density-
biased clustering. In International conference on
Renato Cordeiro de Amorim and Boris Mirkin. 2012. data warehousing and knowledge discovery, pages
Minkowski metric, feature weighting and anomalous 488–497. Springer.
cluster initializing in k-means clustering. Pattern
Recognition, 45(3):1061–1075. Yann LeCun, John S. Denker, and Sara A. Solla. 1989.
Optimal brain damage. In Advances in Neural In-
Zhen Dong, Zhewei Yao, Daiyaan Arfeen, Amir Gho- formation Processing Systems 2, [NIPS Conference,
lami, Michael W. Mahoney, and Kurt Keutzer. 2020. Denver, Colorado, USA, November 27-30, 1989],
HAWQ-V2: hessian aware trace-weighted quanti- pages 598–605. Morgan Kaufmann.
zation of neural networks. In Advances in Neural
Information Processing Systems 33: Annual Confer- Changhun Lee, Jungyu Jin, Taesu Kim, Hyungjun Kim,
ence on Neural Information Processing Systems 2020, and Eunhyeok Park. 2024. OWQ: outlier-aware
NeurIPS 2020, December 6-12, 2020, virtual. weight quantization for efficient fine-tuning and in-
ference of large language models. In Thirty-Eighth
Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, AAAI Conference on Artificial Intelligence, AAAI
Elias Frantar, Artem Babenko, and Dan Alistarh. 2024, Thirty-Sixth Conference on Innovative Applica-
2024. Extreme compression of large language mod- tions of Artificial Intelligence, IAAI 2024, Fourteenth
els via additive quantization. Symposium on Educational Advances in Artificial
Intelligence, EAAI 2014, February 20-27, 2024, Van- Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr
couver, Canada, pages 13355–13364. AAAI Press. Kuleshov, and Christopher De Sa. 2024. Quip#:
Even better llm quantization with hadamard inco-
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, herence and lattice codebooks.
Xingyu Dang, and Song Han. 2023. AWQ: activation-
aware weight quantization for LLM compression and Mart van Baalen, Andrey Kuzmin, Markus Nagel, Pe-
acceleration. CoRR, abs/2306.00978. ter Couperus, Cedric Bastoul, Eric Mahurin, Tijmen
Blankevoort, and Paul Whatmough. 2024. Gptvq:
Hongfu Liu, Junjie Wu, Tongliang Liu, Dacheng Tao, The blessing of dimensionality for llm quantization.
and Yun Fu. 2017. Spectral ensemble clustering
via weighted k-means: Theoretical and practical evi- Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang,
dence. IEEE Transactions on Knowledge and Data Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping
Engineering, 29(5):1129–1143. Wang, Yi Wu, and Furu Wei. 2023. Bitnet: Scaling
1-bit transformers for large language models. CoRR,
Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, abs/2310.11453.
Wenhui Wang, Shaohan Huang, Li Dong, Ruiping
Wang, Jilong Xue, and Furu Wei. 2024. The era of Benchang Wei, Tao Guan, and Junqing Yu. 2014. Pro-
1-bit llms: All large language models are in 1.58 bits. jected residual vector quantization for ann search.
CoRR, abs/2402.17764. IEEE MultiMedia, 21(3):41–51.
Stephen Merity, Caiming Xiong, James Bradbury, and Guangxuan Xiao, Ji Lin, Mickaël Seznec, Hao Wu,
Richard Socher. 2016. Pointer sentinel mixture mod- Julien Demouth, and Song Han. 2023. Smoothquant:
els. arXiv preprint arXiv:1609.07843. Accurate and efficient post-training quantization for
large language models. In International Conference
AI Meta. 2024. Introducing meta llama 3: The most on Machine Learning, ICML 2023, 23-29 July 2023,
capable openly available llm to date. Meta AI. Honolulu, Hawaii, USA, volume 202 of Proceedings
of Machine Learning Research, pages 38087–38099.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
PMLR.
Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J Liu. 2020. Exploring the limits Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
of transfer learning with a unified text-to-text trans- Farhadi, and Yejin Choi. 2019. Hellaswag: Can a
former. The Journal of Machine Learning Research, machine really finish your sentence? arXiv preprint
21(1):5485–5551. arXiv:1905.07830.
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat-
ula, and Yejin Choi. 2021. Winogrande: An adver- A Appendix: All Experiments Results
sarial winograd schema challenge at scale. Commu-
nications of the ACM, 64(9):99–106. A.1 Supplementary Explanation for Main
Results Table 2
Sidak Pal Singh and Dan Alistarh. 2020. Woodfisher:
Efficient second-order approximation for neural net- Table 2 shows our main results. Here we provide
work compression. In Advances in Neural Infor- an explanation for the ’N/A’ entries relative to other
mation Processing Systems 33: Annual Conference works.
on Neural Information Processing Systems 2020, DB-LLM Since they did not open source their
NeurIPS 2020, December 6-12, 2020, virtual.
code, we use the AvgQA results from their paper.
Pierre Stock, Angela Fan, Benjamin Graham, Edouard However, this number does not align with our FP16
Grave, Rémi Gribonval, Hervé Jégou, and Armand results.
Joulin. 2021. Training with quantization noise for ex-
treme model compression. In 9th International Con-
GPTQ We reproduce the 2-bit results using the
ference on Learning Representations, ICLR 2021, Vir- official GPTQ repository. As GPTQ quantizes each
tual Event, Austria, May 3-7, 2021. OpenReview.net. layer in sequential order, the ’cost(h)’ represents
the time taken to quantize on a single A100 GPU.
Pierre Stock, Armand Joulin, Rémi Gribonval, Ben-
jamin Graham, and Hervé Jégou. 2020. And the bit GPTVQ They do not release their 2-bit quan-
goes down: Revisiting the quantization of neural net- tized model. We reproduce Llama-2, LLama-3
works. In 8th International Conference on Learning 7B and 13B, Mistral 7b 2-bit results using their
Representations, ICLR 2020, Addis Ababa, Ethiopia, released GPTVQ code, which only supports single-
April 26-30, 2020. OpenReview.net.
GPU execution. Therefore, the quantization cost
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- reflects the execution time for quantization on a
bert, Amjad Almahairi, Yasmine Babaei, Nikolay single A100 GPU. Due to the lack of specific logic
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Bhosale, et al. 2023. Llama 2: Open founda-
for loading their quantizers in the released code, we
tion and fine-tuned chat models. arXiv preprint were unable to measure the throughput and runtime
arXiv:2307.09288. memory.
AQLM Their 1.97-bit LLaMA-2 13b model has is v1 . k1 represents the number of centroids in
not been open-sourced, so we are unable to mea- the first codebook, while k2 represents the number
sure its inference throughput and runtime memory. of centroids in the second codebook for residual
QuIP# Due to recent changes in the libraries vector quantization. k2 = −1 indicates no residual
they rely on, the quantization cost is not measured. vector quantization.
The quantization time for the 70B model is esti-
mated based on their original paper. C.2 Vector Length and Residual Vector
Quantization
A.2 All Experimental Results Compression Ratio Calculation The average
In this section, we present all our experimen- bitwidth per element of the index matrix obtained
tal results, including the perplexity of the quan- through vector quantization is:
tized model on different context lengths in two
datasets, Wikitext2 and C4, and the accuracy on log2 (k1 ) log2 (k2 )
five Commonsense QA tasks (abbreviated as AE Average index bitwidth = +
v1 v1
for Arc_easy, AC for Arc_challenge, HE for Hel-
laswag, QA for PIQA, and WI for Winogrande). The compression ratio is calculated by:
Table 4 displays all results of LLaMA-2 at 2-bit
Total original model bits
quantization. Table 5 presents results of LLaMA-2 Compression ratio =
Codebook bits + Index bits
at 3 and 4 bits quantization. Table 6 displays all
results of Llama3 at 2, 3, and 4-bit quantization. For an original linear weight matrix with M pa-
Table 7 shows all results of Mistral 7b at 2, 3, and rameters,
4-bit quantization.
Codebook bits = (v0 × k0 + v1 × (k1 + k2 )) × 16
B Quantitative Analysis of Quantization
Parameter Settings
k0
Quantization configuration The quantization pa- Index bits = M × N % × log2 + M×
rameters of all VPTQ 2bit models are shown in v0
Table 8. log2 (k1 ) log2 (k2 )
(100 − N )% × +
Layer-wise fine-tuning parameters Layer-wise v1 v1
finetuning trains centroids and layer norm using the
The total bitwidth in the table is calculated per
input and output of each layer when entering 128
transformer block, which for LLaMA-2 includes 4
samples of C4 training sets into the full precision
attention linear and 3 FFN linear layers.
model. We train each layer for 100 iterations. Table
Impact of Vector Length First, we discuss
9 shows the learning rate and batch size used for
the impact of vector length on accuracy. In Ta-
each model.
ble 10 rows #2, #3, #4, and #6 show results for
C Ablation Study v1 = 2, 4, 6, 8, keeping the average index bit at
2 (i.e., log2 (k1 /v1 ) = 2). As v1 increases, the
Table 10 shows results from LLaMA-2 13b on perplexity on Wikitext2 and C4 decreases, but the
Wikitext2 and C4 (sequence length = 4096) un- codebook size also increases exponentially. For
der different quantization parameters. The im- v1 = 8 and k1 = 65536, the codebook overhead
pact of techniques such as vector length, channel- introduces an additional 0.19 bits. Then, we eval-
independent optimization, residual vector quanti- uate the model inference throughput in Table 11.
zation, outlier elimination, layer-wise fine-tuning, Since we employ weight-only quantization, the
and end-to-end fine-tuning on quantization results main additional overhead of quantized model in-
will be discussed. ference comes from the lookup table for model
weights. Table 11 shows models with 2 bits on
C.1 Parameter Description various throughputs. As the vector length increases
When performing N% outlier elimination, N% of (from 2 to 6), the granularity of memory access
outliers will be quantized using a codebook with for reading the lookup table in dequantization in-
a vector length of v0 and k0 centroids. For the creases, which allows memory access to match the
remaining (100-N)% parameters, the vector length GPU’s cache line (128 bytes @ L1). This reduces
Table 4: LLaMA-2 2bit Quantization Results, 1 We use the naive Torch and Triton kernels for inference performance
evaluation, without optimizations like CUDA graphs, FlashAttention, or Torch compile. The inference performance for QuIP#
and AQLM do not represent their performance with all optimizations enabled. QuIP# and AQLM can achieve high performance
when all optimizations are enabled.
memory access transactions and decreases cache #5 without it, indicating that channel-independent
misses. As the vector length further increases (from second-order optimization effectively mitigates
8 to 12) along with the size and levels of the code- quantization error accumulation.
book, the codebook size further increases, which
results in the codebook not fitting in the L1 cache, C.4 Outlier Elimination
thereby reducing the model’s inference speed. Ad- Rows #4, #8, #9, and #10 represent the results
ditionally, we find that a reasonable setting (e.g., for eliminating 0%, 1%, 2%, and 5% outliers, re-
v = 6, k = 4096) can achieve throughput simi- spectively. We used a codebook with v0 = 4 and
lar to the original model for the quantized model, k0 = 4096 to quantize N% of outliers, achieving
demonstrating the efficiency of the VPTQ design. an effective average index bit of 3 bits, while other
Residual Vector Quantization Without any parameters were 2 bits. Higher N% means more
fine-tuning, rows #4 and #7 show similar perplex- parameters are quantized with 3 bits, leading to a
ities for v1 = 6, k1 = 4096 and v1 = 12, k1 = larger total bitwidth and lower perplexity.
k2 = 4096 , with the latter even higher. How-
ever, after layer-wise fine-tuning, comparing rows C.5 Fine-tuning
#11 and #13, residual vector quantization (RVQ) Rows #4, #11, and #12 show results without any
reduces the perplexity by 0.3 compared to vector fine-tuning, with layer-wise fine-tuning, and with
quantization (VQ) due to the increased number of end-to-end fine-tuning, respectively. Adding fine-
finetunable centroids, showing significant improve- tuning reduced the perplexity on Wikitext2 from
ment. 6.29 to 6.07 and further to 5.32.
Table 6: LLaMA-3 Wikitext2 perplexity (context length 2048) and zeroshot QA Accuracy.
bit W2↓ AC↑ AE↑ HE↑ QA↑ WI↑ bit W2↓ AC↑ AE↑ HE↑ QA↑ WI↑
FP16 16 6.14 50.3 80.1 60.2 79.6 73.1 16 2.9 60.1 87.0 66.3 82.4 80.8
QuIP 4 6.5 47.4 78.2 58.6 78.2 73.2 4 3.4 58.7 86.0 65.7 82.5 79.7
GPTQ 4 6.5 47.7 78.8 59.0 78.4 72.6 4 3.3 58.4 86.3 66.1 82.9 80.7
VPTQ 4.03 6.42 49.1 78.8 59.3 78.7 74.8 4.05 3.15 59.0 86.1 66.2 82.4 79.8
QuIP 3 7.5 41.0 72.9 55.4 76.8 72.5 3 4.7 54.9 83.3 63.9 82.3 78.4
GPTQ 3 8.2 37.7 70.5 54.3 74.9 71.1 3 5.2 52.1 79.6 63.5 80.6 77.1
VPTQ 3.03 6.97 45.8 77.5 58.4 78.2 73.4 3.01 3.81 57.3 84.7 65.5 81.7 79.2
QuIP 2 85.1 21.3 29.0 29.2 52.9 51.7 2 13 26.5 48.9 40.9 65.3 61.7
DB-LLM 2 13.6 28.2 59.1 42.1 68.9 60.4 N/A N/A N/A N/A N/A N/A N/A
GPTQ 2 2.10E+02 19.9 28.8 27.7 53.9 50.5 2 11.9 24.6 38.9 41.0 62.7 59.9
VPTQ 2.08 9.29 36.9 71.0 52.2 75.1 65.9 2.02 5.6 52.5 81.8 61.7 80.4 77.9
VPTQ 2.24 9.19 42.6 73.2 53.1 75.4 69.1 2.07 5.66 54.2 83.6 61.8 80.1 74.0
Table 7: Mistral-7B-v0.1 Wikitext2, C4 perplexity (context length 2048 and 8192) and zeroshot QA Accuracy
Mistral 7b
bit W2(2k) W2(8k) C4(8k) AC AE HE QA WI
FP16 16 5.25 4.77 5.71 48.89 78.87 61.12 80.3 73.88
GPTVQ 4.125 5.38 4.87 6.13 50 80.43 60.36 79.65 73.4
QuIP# 4 — 4.85 5.79 49.4 78.96 60.62 80.41 73.95
AQLM 4.02 — 4.85 5.79 48.21 77.86 60.27 79.71 73.8
GPTQ 4.125 5.36 4.83 5.74 49.57 79.5 60.38 79.54 72.85
VPTQ 4.03 5.36 4.81 5.72 48.12 77.82 60.61 80.14 74.19
GPTVQ 3.125 6.42 6.8 13.28 40.78 75.67 54.18 77.42 67.4
AQLM 3.04 — 5.07 5.97 46.67 77.61 59.31 80.14 72.69
GPTQ 3.125 6.02 5.88 6.86 47.35 77.86 58.84 79.82 71.74
VPTQ 3.03 5.53 4.96 5.84 46.67 77.95 59.91 79.49 72.45
QuIP# 2 — 6.02 6.84 39.76 72.14 52.95 76.71 69.3
AQLM 2.01 — 6.32 6.93 40.44 73.65 52.13 76.01 68.75
GPTVQ 2.25 8.2 8.99 18.6 37.37 71 45.43 70.18 64.33
GPTQ 2.125 280 1535 164 24.49 44.91 36.56 63.33 52.96
VPTQ 2.04 6.32 5.64 6.43 41.13 72.22 56.1 77.91 68.67
Table 8: Parameters for 2-bit Quantization of Llama and Mistral Models. v represents the vector length, k denotes
the codebook size, k1 and k2 correspond to the two codebooks, and group num indicates the number of groups
into which PQ (Product Quantization) is divided.
Outlier Other
bit
N% v k v k1 k2 group num
2.02 0 - - 6 4096 - 1
LLaMA2-7b
2.26 1 4 8192 12 4096 4096 4
2.02 0 - - 6 4096 - 1
LLaMA2-13b
2.18 2 4 8192 12 4096 4096 4
2.07 1 4 8192 12 4096 4096 4
LLaMA2-70b
2.11 1 4 8192 12 4096 4096 8
2.08 1 4 4096 12 4096 4096 1
LLaMA3-8b
2.24 1 4 8192 6 4096 - 16
2.02 0 - - 12 4096 4096 1
LLaMA3-70b
2.07 1 4 4096 6 4096 - 16
D Inference Evaluation
2, 4, and 8 groups, respectively. Each group has its
D.1 Throughput Measurement Process
own independent codebook. When divided into 1,
2, and 4 groups, the perplexity on Wikitext2 does We follow the throughput measurement method
not change much, likely because the distribution of used in AQLM (Egiazarian et al., 2024). During
the remaining parameters (after removing 1% out- the prompt phase, we provide 1 token and then
liers) is relatively uniform. This is likely because have the model generate 256 tokens, calculating the
the distributions of different groups overlap after generation time for each output token to determine
grouping, so the benefit of increasing the group the throughput in tokens per second (tok/s).
Table 10: Ablation Study on Different Quantization Techniques for LLaMA-2 13B