LLM Compression Techniques
LLM Compression Techniques
Xunyu Zhu1,2 , Jian Li1 , Yong Liu3 , Can Ma1,2 , Weiping Wang1,2
1
Institute of Information Engineering, Chinese Academy of Sciences
2
School of Cyber Security, University of Chinese Academy of Sciences
3
Gaoling School of Artificial Intelligence, Renmin University of China
{zhuxunyu, lijian9026, macan, wangweiping}@iie.ac.cn, [email protected]
arXiv:2308.07633v2 [cs.CL] 17 Aug 2023
Quantization-Aware
LLM-QAT [Liu et al., 2023]
Training
Quantization-Aware
PEQA [Kim et al., 2023], QLORA [Dettmers et al., 2023a]
Quantization Fine-tuning
LUT-GEMM [Park et al., 2022], LLM.int8() [Dettmers et al., 2022], ZeroQuant [Yao et al., 2022],
Weight Quantization
GPTQ [Frantar et al., 2022], AWQ [Lin et al., 2023], OWQ [Lee et al., 2023], SpQR [Dettmers et al., 2023b]
Post-Training
Quantization
Weight and Activation SmoothQuant [Xiao et al., 2022], RPTQ [Yuan et al., 2023], OliVe [Guo et al., 2023],
Quantization Outlier Suppression+ [Wei et al., 2023], MoFQ [Zhang et al., 2023c], ZeroQuant-FP [Wu et al., 2023]
Low-Rank
LoRAPrune (Low-Rank Factorization + Pruning) [Zhang et al., 2023a], ZeroQuant-FP (Low-Rank Factorization + Quantization) [Wu et al., 2023]
Factorization
Prompting Generating
Chain- Raw data LLM Explanations SLM
In-Context Instruction
of-
Learning Following
thought
(b)
(c)
Table 1: A summary of quantization methods for the LLM. We divide them into 8-bit quantization and lower-bit quantization based on the
number of bits (i.e., precision) in the weights of the LLM.