0% found this document useful (0 votes)

315 views14 pages

LLM Compression Techniques

Large language models have achieved great success but also have challenges for deployment due to their large size and computational demands. Model compression techniques can help address these issues by creating more efficient compressed versions of models. This document provides a comprehensive survey of model compression techniques for large language models, including pruning, knowledge distillation, quantization, and low-rank factorization. It presents a taxonomy of these methods and evaluates them to help advance the field of model compression for efficient and sustainable deployment of large language models.

Uploaded by

Ram Prakash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

315 views14 pages

LLM Compression Techniques

Uploaded by

Ram Prakash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

A Survey on Model Compression for Large Language Models

Xunyu Zhu1,2 , Jian Li1 , Yong Liu3 , Can Ma1,2 , Weiping Wang1,2
1
Institute of Information Engineering, Chinese Academy of Sciences
2
School of Cyber Security, University of Chinese Academy of Sciences
3
Gaoling School of Artificial Intelligence, Renmin University of China
{zhuxunyu, lijian9026, macan, wangweiping}@iie.ac.cn, [email protected]
arXiv:2308.07633v2 [cs.CL] 17 Aug 2023

Abstract manage operations. To tackle these issues, a prevalent ap-

proach known as model compression [Deng et al., 2020;
Large Language Models (LLMs) have revolution- He et al., 2018] offers a solution. Model compression in-
ized natural language processing tasks with re- volves transforming a large, resource-intensive model into a
markable success. However, their formidable compact version suitable for storage on constrained mobile
size and computational demands present significant devices. Additionally, it can involve optimizing the model for
challenges for practical deployment, especially in faster execution with minimal latency or achieving a balance
resource-constrained environments. As these chal- between these objectives.
lenges become increasingly pertinent, the field of
model compression has emerged as a pivotal re-
search area to alleviate these limitations. This Apart from their technical aspects, LLMs have triggered
paper presents a comprehensive survey that nav- discussions on environmental and ethical matters. These
igates the landscape of model compression tech- models pose significant challenges for engineers and re-
niques tailored specifically for LLMs. Address- searchers in developing nations, where limited resources can
ing the imperative need for efficient deployment, impede access to essential hardware for model execution [Lin
we delve into various methodologies, encompass- et al., 2023]. Additionally, the substantial energy consump-
ing quantization, pruning, knowledge distillation, tion of LLMs contributes to carbon emissions, underscoring
and more. Within each of these techniques, we the significance of sustainable practices in AI research. A
highlight recent advancements and innovative ap- promising solution to these challenges lies in utilizing model
proaches that contribute to the evolving landscape compression techniques, which have showcased the poten-
of LLM research. Furthermore, we explore bench- tial to reduce emissions without substantially compromising
marking strategies and evaluation metrics that are performance [Luccioni et al., 2022]. By implementing model
essential for assessing the effectiveness of com- compression, we can tackle environmental concerns, enhance
pressed LLMs. By providing insights into the latest accessibility, and promote inclusivity in LLM deployment.
developments and practical implications, this sur-
vey serves as an invaluable resource for both re-
searchers and practitioners. As LLMs continue to
evolve, this survey aims to facilitate enhanced ef- In our paper, our primary objective is to illuminate the re-
ficiency and real-world applicability, establishing a cent strides made in the domain of model compression tech-
foundation for future advancements in the field. niques tailored specifically for LLMs. Our work entails an ex-
haustive survey of methodologies, metrics, and benchmarks,
which we meticulously organize into an innovative taxonomy.
1 Introduction As illustrated in Figure 1, our proposed taxonomy provides
a structured framework for understanding the landscape of
Large Language Models (LLMs) [Zhao et al., 2023; Huang Model Compression methods for LLMs. This exploration en-
and Chang, 2023; Chang et al., 2023] consistently exhibit compasses a thorough examination of well-established tech-
remarkable performance across various tasks. Neverthe- niques, including but not limited to pruning, knowledge dis-
less, their exceptional capabilities come with significant chal- tillation, quantization, and low-rank factorization. Further-
lenges stemming from their extensive size and computational more, our study sheds light on prevailing challenges and of-
requirements. For instance, the GPT-175B model [Brown et fers a glimpse into potential future research trajectories in this
al., 2020], with an impressive 175 billion parameters, de- evolving field. We advocate for collaborative efforts within
mands a minimum of 320GB (using multiples of 1024) of the community to pave the way for an ecologically conscious,
storage in half-precision (FP16) format. Furthermore, de- all-encompassing, and sustainable future for LLMs. Notably,
ploying this model for inference necessitates at least five our work stands as the inaugural survey specifically address-
A100 GPUs, each featuring 80GB of memory, to efficiently ing the realm of model compression for LLMs.
Unstructured Pruning SparseGPT [Frantar and Alistarh, 2023], LoRAPrune [Zhang et al., 2023a], Wanda [Sun et al., 2023]
Pruning
Structured Pruning LLM-Pruner [Ma et al., 2023]
Model Compression for Large Language Models

Standard KD MINILLM [Gu et al., 2023], GKD [Agarwal et al., 2023]

Knowledge In-Context Learning In-Context Learning distillation [Huang et al., 2022]

Distillation
MT-COT [Li et al., 2022], Fine-tune-CoT [Ho et al., 2023],
EA-based KD Chain-of-Thought DISCO [Chen et al., 2023], SCOTT [Wang et al., 2023a],
SOCRATIC CoT [Shridhar et al., 2023]

Instruction Following Lion [Jiang et al., 2023]

Quantization-Aware
LLM-QAT [Liu et al., 2023]
Training

Quantization-Aware
PEQA [Kim et al., 2023], QLORA [Dettmers et al., 2023a]
Quantization Fine-tuning

LUT-GEMM [Park et al., 2022], LLM.int8() [Dettmers et al., 2022], ZeroQuant [Yao et al., 2022],
Weight Quantization
GPTQ [Frantar et al., 2022], AWQ [Lin et al., 2023], OWQ [Lee et al., 2023], SpQR [Dettmers et al., 2023b]
Post-Training
Quantization
Weight and Activation SmoothQuant [Xiao et al., 2022], RPTQ [Yuan et al., 2023], OliVe [Guo et al., 2023],
Quantization Outlier Suppression+ [Wei et al., 2023], MoFQ [Zhang et al., 2023c], ZeroQuant-FP [Wu et al., 2023]

Low-Rank
LoRAPrune (Low-Rank Factorization + Pruning) [Zhang et al., 2023a], ZeroQuant-FP (Low-Rank Factorization + Quantization) [Wu et al., 2023]
Factorization

Figure 1: Taxonomy of Model Compression methods for Large Language Models.

2 Methods position. Such irregularity demands specialized compres-

sion techniques for efficient storage and computation of the
2.1 Pruning pruned model. Unstructured pruning often involves substan-
Pruning is a powerful technique to reduce the size or com- tial retraining of the LLM to regain accuracy, which is espe-
plexity of a model by removing unnecessary or redundant cially expensive for LLMs. An innovative approach in this
components [LeCun et al., 1989; Han et al., 2015; Li et al., domain is SparseGPT [Frantar and Alistarh, 2023]. It intro-
2017]. As we know, there are many redundant parameters duces a one-shot pruning strategy that doesn’t require retrain-
that have little even no effects on the performance of the ing. The method frames pruning as an extensive sparse re-
model, thus, the performance of the model will make the least gression problem and solves it using an approximate sparse
drop after directly pruning these redundant parameters. At regression solver. SparseGPT achieves significant unstruc-
the same time, pruning can make the model storage-friendly tured sparsity, even up to 60% on the largest GPT models
[Ardakani et al., 2019], memory-efficiency [Han et al., 2015; like OPT-175B and BLOOM-176B, with minimal increase in
Yang et al., 2017], computation-efficiency [Li et al., 2017]. perplexity. Contrasting this, Syed et al. propose an iterative
Pruning can be divided into Unstructured Pruning [Zhang pruning technique that fine-tunes the model during pruning
et al., 2018; Gordon et al., 2020] and Structured Pruning with minimal training steps. Another advancement is Lo-
[Anwar et al., 2017; Fang et al., 2023]. The main difference RAPrune [Zhang et al., 2023a], which combines parameter-
between structured pruning and unstructured pruning lies in efficient tuning (PEFT) methods with pruning to enhance per-
the pruning targets and the resulting network structure. Struc- formance on downstream tasks. It introduces a unique pa-
tured pruning removes connections or hierarchical structures rameter importance criterion using values and gradients from
based on specific rules while preserving the overall network Low-Rank Adaption (LoRA) [Hu et al., 2022]. In response to
structure. On the other hand, unstructured pruning prunes the resource-intensive weight update process still required by
individual parameters, resulting in an irregular sparse struc- SparseGPT, Wanda [Sun et al., 2023] presents a new pruning
ture. Recent research efforts have been devoted to combining metric. Wanda evaluates each weight based on the product
LLMs with pruning techniques, aiming to tackle the substan- of its magnitude and the norm of corresponding input acti-
tial size and computational costs associated with LLMs. In vations, approximated using a small calibration dataset. This
this section, we systematically categorize these works based metric is employed for local comparisons within linear layer
on whether they employ structured or unstructured pruning outputs, enabling the removal of lower-priority weights from
strategies. LLMs.
Unstructured Pruning Structured Pruning
Unstructured pruning simplifies an LLM by removing spe- Structured pruning simplifies an LLM by removing entire
cific parameters without considering its internal structure. structural components, such as neurons, channels, or layers.
This approach targets individual weights or neurons in the This approach targets whole sets of weights at once, offer-
LLM, usually by applying a threshold to zero out parame- ing the advantage of reducing model complexity and mem-
ters below it. However, this method disregards the overall ory usage while maintaining the overall LLM structure intact.
LLM structure, resulting in an irregular sparse model com- An example in this realm is LLM-Pruner [Ma et al., 2023],
which takes a versatile approach to compressing LLMs while (KLD) - this can lead to overly high probabilities in unlikely
safeguarding their multi-task solving and language genera- areas of the teacher’s distribution, causing improbable sam-
tion capabilities. LLM-Pruner also tackles the challenges that ples during free-run generation. To address this, MINILLM
arise from the substantial training data used for LLMs, which opts for minimizing reverse KLD. This approach prevents the
can lead to significant data transfers and post-training model student from overestimating low-probability regions within
sizes. To overcome these challenges, LLM-Pruner incorpo- the teacher’s distribution, thereby refining the quality of gen-
rates a dependency detection algorithm to pinpoint interde- erated samples. In contrast, GKD [Agarwal et al., 2023] ex-
pendent structures within the model. It also implements an plores distillation from auto-regressive models, where white-
efficient importance estimation method that considers both box generative LLMs are a subset. This method identifies two
first-order information and an approximated Hessian infor- key issues: a distribution mismatch between output sequences
mation. This strategy aids in selecting optimal groups for during training and those generated by the student during de-
pruning, thereby improving the compression process. ployment, and model under-specification, where the student
model might lack the expressive power to match the teacher’s
distribution. GKD handles the distribution mismatch by sam-
LLM pling output sequences from the student during training. It
also tackles model under-specification by optimizing alterna-
tive divergences like reverse KL.
NO Distill Yes
Emergent
Abilities？
In-context Tuning In-contet Learning distillation
Tasks LLM SLM
EA-based
Standard KD
KD (a)

Prompting Generating
Chain- Raw data LLM Explanations SLM
In-Context Instruction
of-
Learning Following
thought
(b)

Prompting Generating Hard

Raw data LLM SLM
instruction
SLM

(c)

Figure 2: A brief classification of knowledge distillation for LLMs.

Figure 3: Overview of EA-based KD. (a) In-Context Learning dis-
tillation, (b) Chain-of-Thought distillation, (c) Instruction Following
distillation.
2.2 Knowledge Distillation
Knowledge Distillation (KD) [Hinton et al., 2015; Kim and
Rush, 2016; Tung and Mori, 2019] is a valuable machine EA-based KD
learning technique aimed at improving model performance EA-based KD goes beyond transferring common knowl-
and generalization. It achieves this by transferring knowledge from LLMs to also encompass distilling their distinc-
edge from a complex model, referred to as the teacher model, tive emergent abilities. Recent research [Wei et al., 2022a;
to a simpler counterpart known as the student model. The Schaeffer et al., 2023; Zhao et al., 2023] underscores that
core idea behind KD involves transforming the comprehen- despite the emphasis on augmenting model size, LLMs like
sive knowledge of the teacher model into a more stream- GPT-3 (175B parameters) and PaLM (540B parameters)
lined and effective representation. In this section, we offer an showcase unique behaviors when compared to smaller mod-
overview of distillation methods that employ LLMs as teach- els like BERT (330M parameters) and GPT-2 (1.5B parame-
ers. We classify these methods based on whether their em- ters). These LLMs exhibit surprising capabilities, referred to
phasis is on distilling the Emergent Abilities (EA) of LLMs as Emergent Abilities, when tackling intricate tasks. Emer-
into small language models (SLMs). Consequently, we divide gent Abilities encompass several intriguing facets, including
these methods into two distinct categories: Standard KD and In-Context Learning (ICL) [Dong et al., 2023; Wang et
EA-based KD. For a visual representation, Figure 2 provides al., 2023b], Chain-of-Thought (CoT) [Wei et al., 2022b;
a brief classification of knowledge distillation for LLMs. Wang et al., 2023c; Shi et al., 2023], and Instruction Fol-
lowing (IF) [Ouyang et al., 2022; Brooks et al., 2023]. For
Standard KD a visual overview, refer to Figure 3, which provides a con-
Standard KD focuses on enabling student models to learn cise representation of the EA-based Knowledge Distillation
the common knowledge possessed by LLMs, such as out- concept.
put distributions and feature information. This approach is ICL employs a structured natural language prompt that
similar to vanilla KD [Gou et al., 2021; Park et al., 2019; contains task descriptions and possibly a few task examples
Zhao et al., 2022; Liu et al., 2021a], but with the distinction as demonstrations. Through these task examples, LLMs can
that the teacher models are LLMs. An illustrative example grasp and perform new tasks without necessitating explicit
is MINILLM [Gu et al., 2023], which delves into distilla- gradient updates. The work by Huang et al. introduces
tion from white-box generative LLMs. It observes a chal- ICL distillation, which transfers in-context few-shot learn-
lenge with minimizing forward Kullback-Leibler divergence ing and language modeling capabilities from LLMs to SLMs.
This is accomplished by combining in-context learning ob- capabilities. This approach taps into the versatility of LLMs
jectives with traditional language modeling objectives. To to guide the learning of student models in addressing complex
achieve this, they explore ICL distillation under two few-shot instructions and tasks.
learning paradigms: Meta In-context Tuning (Meta-ICT) and
Multitask In-context Tuning (Multitask-ICT). In Meta-ICT,
2.3 Quantization
the language model undergoes meta-training across diverse
tasks using in-context learning objectives. This equips it to In the domain of model compression, quantization has
adapt to unseen tasks through in-context learning, thereby ex- emerged as a widely embraced technique to alleviate the
tending its problem-solving capabilities. On the other hand, storage and computational overhead of deep learning mod-
Multitask-ICT fine-tunes the model using ICL objectives and els [Liu et al., 2021b; Gholami et al., 2022; Guo et al.,
a handful of examples from target tasks. Subsequently, it 2020]. While traditional representation employs floating-
employs in-context learning for making predictions on these point numbers, quantization converts them to integers or
tasks. Comparing the two paradigms, Multitask-ICT exhibits other discrete forms. This transformation significantly re-
superior performance over Meta-ICT. However, it does de- duces storage requirements and computational complexity.
mand greater computational resources during task adapta- Although some precision loss is inherent, careful quantiza-
tions, making it computationally more intensive. tion techniques can achieve substantial model compression
CoT takes a different approach compared to ICL by incor- with only minimal accuracy degradation. Quantization can
porating intermediate reasoning steps, which can lead to the be categorized into three main approaches: quantization-
final output, into the prompts instead of using simple input- aware training (QAT) [Tailor et al., 2021; Kim et al., 2022;
output pairs. MT-COT [Li et al., 2022] aims to leverage the Ding et al., 2022], quantization-aware fine-tuning (QAF) [Cai
explanations produced by LLMs to enhance the training of et al., 2019; Dong et al., 2019], and post-training quantiza-
smaller reasoners. It utilizes a multi-task learning framework tion (PTQ) [Liu et al., 2021b; Nagel et al., 2020; Fang et al.,
to empower smaller models with strong reasoning capabili- 2020]. The primary distinction among these approaches lies
ties alongside the ability to generate explanations. Fine-tune- in when quantization is applied to compress the model. QAT
CoT [Ho et al., 2023] takes a step further by generating mul- employs quantization during the model’s training process,
tiple reasoning solutions from LLMs through stochastic sam- QAF applies quantization during fine-tuning of a pretrained
pling. This augmentation of training data aids student models model, and PTQ quantizes a model after it has completed its
in their learning process. Researchers like Fu et al. iden- training. Recent research endeavors have harnessed quantiza-
tify a trade-off between the multi-dimensional capabilities tion to compress LLMs, yielding impressive outcomes. These
of language models and propose fine-tuning an instruction- efforts are classified into the three mentioned approaches:
tuned model. They distill CoT reasoning paths from a large Quantization-Aware Training, Quantization-Aware Fine-
teacher model to improve out-of-distribution generalization. tuning, and Post-Training Quantization. Furthermore, Ta-
Hsieh et al. employ LLM rationales as additional guidance ble 1 serves as a summarized reference for quantization meth-
for training smaller models within a multi-task framework. ods applied to LLMs. The table classifies these works into
SOCRATIC CoT [Shridhar et al., 2023] trains two distilled 8-bit quantization and lower-bit quantization, based on the
models: a problem decomposer and a subproblem solver. The number of bits (precision) in the weights of the LLM.
decomposer breaks down an original problem into a sequence
of subproblems, while the subproblem solver handles solv-
ing these subproblems. DISCO [Chen et al., 2023] intro- Quantization-Aware Training
duces a fully-automatic counterfactual knowledge distillation In QAT, the quantization objective is seamlessly integrated
approach based on LLMs. It engineers prompts to generate into the model’s training process. This approach enables the
phrasal perturbations using LLMs, then filters these through LLM to adapt to low-precision representations during train-
a task-specific teacher model to extract high-quality counter- ing, enhancing its capacity to handle precision loss caused
factual data. For rationale faithfulness, SCOTT [Wang et al., by quantization. This adaptation aims to preserve higher per-
2023a] employs contrastive decoding, which links each ra- formance even after the quantization process. For instance,
tionale to the answer. It encourages relevant rationales from LLM-QAT [Liu et al., 2023] delves into the challenges of ac-
the teacher. Additionally, the student is guided to engage in quiring training data for LLMs. Given that gathering training
counterfactual reasoning and predict based on rationales that data for LLMs can be demanding, LLM-QAT proposes an
lead to different answers. innovative solution. It leverages generations produced by a
IF endeavors to enhance the competence of language mod- pretrained model to achieve data-free distillation. This ap-
els in executing new tasks solely based on reading task de- proach significantly aids in circumventing the data collec-
scriptions, without relying on few-shot examples. By under- tion challenge. Additionally, LLM-QAT goes a step further
going fine-tuning using an array of tasks expressed as instruc- by quantizing not only weights and activations but also key-
tions, language models showcase the capacity to accurately value (KV) caches. This strategy aims to enhance through-
execute tasks described in previously unseen instructions. For put and support longer sequence dependencies. A notewor-
instance, Lion [Jiang et al., 2023] harnesses the adaptable thy achievement of LLM-QAT is its ability to distill large
nature of LLMs to improve student model performance. It LLaMA models with quantized weights and KV caches down
prompts the LLM to identify and generate the “hard” instruc- to just 4 bits. This groundbreaking result demonstrates the
tions, which are then utilized to enhance the student model’s feasibility of producing accurate 4-bit quantized LLMs.
Precision Methods
LUT-GEMM [Park et al., 2022], LLM.int8() [Dettmers et al., 2022],
8-bit quantization
ZeroQuant [Yao et al., 2022], SmoothQuant [Xiao et al., 2022]
LLM-QAT [Liu et al., 2023], PEQA [Kim et al., 2023], QLORA [Dettmers et al., 2023a],
GPTQ [Frantar et al., 2022], AWQ [Lin et al., 2023], SpQR [Dettmers et al., 2023b],
lower-bit quantization
RPTQ [Yuan et al., 2023], OliVe [Guo et al., 2023], Outlier Suppression+ [Wei et al., 2023],
OWQ [Lee et al., 2023], ZeroQuant-FP [Wu et al., 2023]

Table 1: A summary of quantization methods for the LLM. We divide them into 8-bit quantization and lower-bit quantization based on the
number of bits (i.e., precision) in the weights of the LLM.

Quantization-Aware Fine-tuning cient inference. Remarkably, LLM.int8() enables inference

QAF involves quantizing the LLM during the fine-tuning in models with up to 175 billion parameters without per-
process. The primary goal is to ensure that the fine-tuned formance compromise. ZeroQuant [Yao et al., 2022] in-
LLM sustains its performance even after quantization to tegrates a hardware-friendly quantization scheme, layer-by-
lower bit-widths. By integrating quantization awareness layer knowledge distillation, and optimized quantization sup-
into fine-tuning, the LLM aims to strike a balance between port to reduce weight and activation precision in Transformer-
model compression and retaining its performance. PEQA based models to INT8 with minimal accuracy impact. GPTQ
[Kim et al., 2023] and QLORA [Dettmers et al., 2023a] [Frantar et al., 2022] acknowledges that the methods men-
both fall under the category of quantization-aware Parameter- tioned above work well for low compression targets like 8-
Efficient Fine-Tuning (PEFT) techniques [Liu et al., 2022a; bit weights, but face challenges in maintaining accuracy at
Ding et al., 2023; Fu et al., 2023b]. These techniques focus higher rates. To tackle the challenges, GPTQ proposes a
on facilitating model compression and accelerating inference. novel layer-wise quantization technique based on approxi-
PEQA employs a dual-stage process. In the first stage, each mate second-order information. The result is a bitwidth re-
fully-connected layer’s parameter matrix is quantized into a duction to 3 or 4 bits per weight, with minimal accuracy loss
matrix of low-bit integers and a scalar vector. In the second compared to the uncompressed version. Dettmers and Zettle-
stage, fine-tuning occurs on the scalar vector for each specific moyer delve into the trade-off between model size and bit
downstream task. QLORA introduces innovative concepts precision in LLMs concerning zero-shot performance by an-
like a new data type, double quantization, and paged opti- alyzing inference scaling laws. Their extensive experimen-
mizers. These ideas are aimed at conserving memory with- tation across various LLM families reveals that 4-bit preci-
out compromising performance. QLORA enables large mod- sion is nearly universally optimal for achieving the right bal-
els to undergo fine-tuning on a single GPU while achieving ance between total model bits and zero-shot accuracy. AWQ
state-of-the-art results on the Vicuna benchmark [Chiang et [Lin et al., 2023] finds that weights are not equally important
al., 2023]. for LLMs’ performance, and protecting only 1% of salient
weights can greatly reduce quantization error. Building on
Post-Training Quantization this insight, AWQ employs an activation-aware approach by
PTQ involves quantizing the parameters of a LLM after the considering the significance of weight channels correspond-
completion of the LLM’s training phase. The primary ob- ing to larger activation magnitudes, which play a pivotal role
jective of PTQ is to diminish the storage and computational in processing vital features. The approach incorporates a
complexity of the LLM, all without necessitating modifica- per-channel scaling technique to identify optimal scaling fac-
tions to the LLM architecture or requiring a retraining pro- tors that minimize quantization errors while quantizing all
cess. PTQ’s key advantage is its simplicity and efficiency weights. OWQ [Lee et al., 2023] makes a theoretical analysis
in achieving model compression. However, it’s important to about how activation outliers can amplify the error in weight
note that PTQ can introduce a certain degree of precision loss quantization. Drawing insights from this analysis, OWQ in-
due to the quantization procedure. This method serves as troduces a mixed-precision quantization scheme, which ap-
a straightforward way to enhance the efficiency of an LLM plies higher precision to the weights susceptible to quantiza-
without significant alterations or extensive training efforts. tion caused by activation outliers. To further compress ac-
In PTQ, certain approaches focus on quantizing only the curate LLMs to 3-4 bits per parameter while staying near-
weights of LLMs to enhance efficiency and reduce compu- lossless, SpQR [Dettmers et al., 2023b] identifies and iso-
tational demands. Specifically, LUT-GEMM [Park et al., lates outlier weights, storing them in higher precision, and
2022] optimizes matrix multiplications within LLMs using compressing all other weights to 3-4 bits.
weight-only quantization and the BCQ format [Rastegari et
al., 2016], enhancing latency reduction and performance by Except the above works that quantize only the weights of
improving computational efficiency. LLM.int8() [Dettmers et LLMs, lots of works in PTQ try to quantize both weights
al., 2022] employs 8-bit quantization for matrix multiplica- and activations of LLMs. Specifically, SmoothQuant [Xiao
tion in LLM transformers, effectively halving GPU memory et al., 2022] addresses the challenge of quantizing activa-
usage during inference while maintaining performance pre- tions, which is often more complex due to the presence of
cision. This method employs vector-wise quantization and outliers. Observing that different tokens exhibit similar vari-
mixed-precision decomposition to handle outliers for effi- ations across their channels, SmoothQuant introduces a per-
channel scaling transformation that effectively smooths the so on, e.g., LoRAPrune [Zhang et al., 2023a] and ZeroQuant-
magnitudes, rendering the model more amenable to quan- FP [Wu et al., 2023], to achieve more effective compres-
tization. Recognizing the complexity of quantizing activa- sion while maintaining performance. As research in this area
tions in LLMs, RPTQ [Yuan et al., 2023] sheds light on the continues, there may be further developments in applying
challenge stemming from the uneven ranges across different low-rank factorization to compressing LLMs, but it seems
channels, in addition to the presence of outliers. To address that there is still ongoing exploration and experimentation re-
this, RPTQ strategically arranges channels into clusters for quired to fully harness its potential for LLMs.
quantization, effectively mitigating the discrepancies in chan-
nel ranges. Moreover, it integrates the channel reordering into 3 Metrics and Benchmarks
the layer norm operation and linear layer weights to minimize
associated overhead. OliVe [Guo et al., 2023] further adopts 3.1 Metrics
an outlier-victim pair (OVP) quantization and handles outlier Inference efficiency of LLMs can be measured using vari-
values locally with low hardware overheads and high perfor- ous metrics, which capture different aspects of performance.
mance gains, because it finds that outliers are important while These metrics are commonly presented alongside accuracy
the normal values next to them are not. Outlier Suppression+ and zero-shot ability to comprehensively evaluate the LLM.
[Wei et al., 2023] extends this understanding by confirming
that harmful outliers within activations exhibit an asymmet- Number of Parameters
ric distribution, predominantly concentrating in specific chan- Number of Parameters [Ma et al., 2023; Dasgupta et al.,
nels, and introduces a novel strategy involving channel-wise 2023] in a LLM refers to the total count of learnable weights
shifting and scaling operations to rectify the asymmetric pre- or variables that the LLM needs to optimize during training.
sentation of outliers and mitigate the impact of problematic In LLMs, parameters represent the weights in the connections
channels, and quantitatively analyzes the optimal values for between neurons or attention layers. In general, the more pa-
shifting and scaling, taking into account both the asymmet- rameters a LLM has, the more expressive it can be, but it also
ric nature of the outliers and the quantization errors stem- requires more computational resources and memory for both
ming from weights in the next layers. ZeroQuant-FP [Wu training and inference.
et al., 2023] explores the applicability of floating-point (FP)
quantization, specifically focusing on FP8 and FP4 formats. Model Size
The study reveals that for LLMs, FP8 activation consistently Model Size [Shridhar et al., 2023; Li et al., 2022; Magis-
outperforms its integer counterpart (INT8), while in terms of ter et al., 2023] typically refers to the disk space or memory
weight quantization, FP4 demonstrates comparable, if not su- footprint required to store the entire LLM, including weights,
perior, performance compared to INT4. To address the chal- biases, and other necessary components. The model size is
lenges arising from the divergence between weights and ac- closely related to the number of parameters, as more param-
tivations, ZeroQuant-FP mandates that all scaling factors be eters usually lead to a larger model size. However, other fac-
powers of 2 and confines the scaling factors within a single tors, like the data type used to represent the parameters and
compute group. Notably, ZeroQuant-FP also integrates the model architecture, can also influence the overall size.
Low Rank Compensation (LoRC) strategy to further enhance
Compression Ratio
the effectiveness of its quantization approach.
Compression Ratio [Frantar and Alistarh, 2023; Tao et al.,
2.4 Low-Rank Factorization 2023] represents the ratio between the original size of the un-
compressed LLM and the size of the compressed LLM. A
Low-Rank Factorization [Cheng et al., 2017; Povey et al., higher compression ratio indicates a more efficient compres-
2018; Idelbayev and Carreira-Perpiñán, 2020] is a model sion, as the LLM has been significantly reduced in size while
compression technique that aims to approximate a given preserving its functionality and performance.
weight matrix by decomposing it into two or more smaller
matrices with significantly lower dimensions. The core idea Inference time
behind low-rank factorization involves finding a factorization Inference time (i.e., latency) [Kurtic et al., 2023; Frantar et
of a large weight matrix W into two matrices U and V such al., 2022] measures the time taken by the LLM to process
that W ≈ U V , where U is an m × k matrix, and V is a and generate responses for input data during inference or pre-
k × n matrix, with k being much smaller than m and n. The diction. Inference time is particularly crucial for real-world
product of U and V approximates the original weight ma- applications where the LLM needs to respond to user queries
trix, leading to a substantial reduction in the number of pa- or process large amounts of data in real-time.
rameters and computational overhead. In the field of LLM
research, low-rank factorization has been widely adopted to Floating point operations (FLOPs)
fine-tune LLMs efficiently, e.g., LORA [Hu et al., 2022] FLOPs [Dettmers and Zettlemoyer, 2022; Yuan et al., 2023;
and its variants [Valipour et al., 2023; Zhang et al., 2023b; Wei et al., 2023] measures the number of arithmetic oper-
Chavan et al., 2023]. Different from those above works, we ations involving floating-point numbers (typically 32-bit or
focus on these works that use low-rank factorization to com- 16-bit) that the LLM performs when processing input data.
press LLMs. In the field of model compression for LLM re- FLOPs provide a useful way to estimate the computational
search, researchers often combine multiple techniques with requirements of a LLM and compare the efficiency of differ-
low-rank factorization, including pruning, quantization and ent LLMs or compression techniques.
3.2 Benchmarks most suitable for LLMs. In general, it is very important to
The main goal of these benchmarks is to measure the ef- design specialized benchmarks for LLMs.
fectiveness, efficiency, and accuracy of compressed LLMs
in comparison to their uncompressed counterparts. These Performance-Size Trade-offs
benchmarks typically consist of diverse tasks and datasets Prior research [Magister et al., 2023; Dettmers and Zettle-
that cover a range of natural language processing challenges. moyer, 2022] highlights the delicate balance between Large
Language Model (LLM) performance and model size. An-
Common Benchmarks alyzing this trade-off allows for optimal performance within
The majority of research evaluates compressed LLMs on hardware constraints. However, current work lacks theoret-
well-established NLP benchmarks. For instance, GLUE ical and empirical insights into this trade-off. Future LLM
[Wang et al., 2019b] and SuperGLUE [Wang et al., 2019a] compression research should conduct comprehensive analy-
is designed for evaluating the performance of language mod- ses to guide advanced techniques. Understanding the rela-
els on a wide range of natural language understanding (NLU) tionship between performance and size empowers researchers
tasks. LAMBADA [Paperno et al., 2016] is designed to eval- to develop tailored compression methods, navigating the de-
uate the context-dependent understanding of language mod- sign space effectively for efficient solutions.
els. LAMA [Petroni et al., 2019] and StrategyQA [Geva et
al., 2021] are both designed to evaluate the reasoning abil- Dynamic LLM Compression
ity of language models. SQuAD [Rajpurkar et al., 2016] is Despite the advancements in current compression methods,
designed for machine reading comprehension (MRC) tasks. they still rely on manual design to determine the compressed
HULK size and structure of LLMs. This often involves a trial-
The HULK benchmark [Zhou et al., 2021] comprehensively and-error approach based on input data or task requirements.
assesses energy efficiency in Pre-trained Language Models This process becomes particularly challenging in scenarios
(PLMs). It employs classic datasets widely used in the re- like knowledge distillation, where several trials are necessary
search community, evaluating efficiency across tasks like to find suitable student models within computational con-
MNLI [Williams et al., 2018], SST-2 [Socher et al., 2013], straints. This manual effort presents a practical hindrance.
and CoNLL-2003 [Sang and Meulder, 2003]. This multi- A promising solution emerges in the integration of Neural
task approach quantifies energy efficiency in pretraining, fine- Architecture Search (NAS) techniques [Elsken et al., 2019;
tuning, and inference stages, considering time and cost to Zoph and Le, 2016; Zhu et al., 2021; Zhu et al., 2023] into
achieve specific performance levels. HULK sheds light on the realm of compressing LLMs. NAS holds the potential
energy consumption in pretrained models, enhancing practi- to reduce the dependence on human-designed architectures,
cal deployment understanding. potentially revolutionizing LLM compression for improved
efficiency and effectiveness.
ELUE
The ELUE framework [Liu et al., 2022b] enables compre- Explainability
hensive method comparison with a performance-efficiency Earlier research [Stanton et al., 2021; Xu et al., 2021]
trade-off analysis. ELUE integrates six NLP datasets, cover- has raised significant concerns regarding the explainability
ing sentiment analysis, natural language inference, similarity, of compression techniques applied to Pre-trained Language
and paraphrase tasks. With four evaluation tracks based on Models (PLMs). Notably, these same challenges extend to
parameter counts (40M, 55M, 70M, and 110M), ELUE em- LLM compression methods as well. Consequently, the in-
ploys the ELUE score to measure model performance advan- tegration of explainable compression approaches emerges as
tage over ElasticBERT across various FLOPs settings. This a crucial necessity for the progression of LLM compression
approach provides a multi-dimensional perspective on model applications. Moreover, the adoption of explainable compres-
performance and efficiency. ELUE facilitates insightful eval- sion not only addresses the issue of interpretability but also
uation and method assessment. simplifies the evaluation procedure for compressed models.
This, in turn, enhances the reliability and predictability of the
4 Challenges and Future Directions models throughout the production phase.
Specialized Benchmarks
Despite the benchmarks introduced earlier for evaluating 5 Conclusion
model compression, these benchmarks still suffer from sev-
eral drawbacks. First, the evaluation of model compres- In this thorough survey, we’ve explored model compression
sion lacks a universally accepted standard setting. Different techniques for large language models (LLMs). Our cover-
studies often produce models with varying speed-up ratios, age spanned compression methods, evaluation metrics, and
parameter counts, and accuracy levels. As a result, direct benchmark datasets. By diving into LLM compression, we’ve
comparisons between these studies can be challenging, fur- highlighted its challenges and opportunities. As LLM com-
ther complicated by hardware differences. Second, common pression advances, there’s a clear call for research into ad-
benchmarks, such as LAMA [Petroni et al., 2019] and Strat- vanced methodologies specifically for LLMs, unlocking their
egyQA [Geva et al., 2021], may not be the most suitable rep- potential across applications. This survey aims to be a valu-
resentation of typical tasks on a mobile device. Third, bench- able reference, providing insights into the current landscape
marks designed for pretrained models may also not be the and promoting ongoing exploration of this pivotal topic.
References DISCO: distilling counterfactuals with large language
[Agarwal et al., 2023] Rishabh Agarwal, Nino Vieillard, Pi- models. In Anna Rogers, Jordan L. Boyd-Graber, and
otr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Naoaki Okazaki, editors, Proceedings of the 61st Annual
Bachem. GKD: generalized knowledge distillation for Meeting of the Association for Computational Linguistics
auto-regressive sequence models. CoRR, abs/2306.13649, (Volume 1: Long Papers), ACL 2023, Toronto, Canada,
2023. July 9-14, 2023, pages 5514–5528. Association for Com-
putational Linguistics, 2023.
[Anwar et al., 2017] Sajid Anwar, Kyuyeon Hwang, and
Wonyong Sung. Structured pruning of deep convolutional [Cheng et al., 2017] Yu Cheng, Duo Wang, Pan Zhou, and
neural networks. ACM J. Emerg. Technol. Comput. Syst., Tao Zhang. A survey of model compression and acceler-
13(3):32:1–32:18, 2017. ation for deep neural networks. CoRR, abs/1710.09282,
2017.
[Ardakani et al., 2019] Arash Ardakani, Zhengyun Ji,
Sean C. Smithson, Brett H. Meyer, and Warren J. Gross. [Chiang et al., 2023] Wei-Lin Chiang, Zhuohan Li, Zi Lin,
Learning recurrent binary/ternary weights. In 7th In- Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng,
ternational Conference on Learning Representations, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez,
ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Ion Stoica, and Eric P. Xing. Vicuna: An open-source chat-
OpenReview.net, 2019. bot impressing gpt-4 with 90%* chatgpt quality, March
2023.
[Brooks et al., 2023] Tim Brooks, Aleksander Holynski, and
Alexei A Efros. Instructpix2pix: Learning to follow im- [Dasgupta et al., 2023] Sayantan Dasgupta, Trevor Cohn,
age editing instructions. In Proceedings of the IEEE/CVF and Timothy Baldwin. Cost-effective distillation of large
Conference on Computer Vision and Pattern Recognition, language models. In Anna Rogers, Jordan L. Boyd-Graber,
pages 18392–18402, 2023. and Naoaki Okazaki, editors, Findings of the Associa-
[Brown et al., 2020] Tom B. Brown, Benjamin Mann, Nick tion for Computational Linguistics: ACL 2023, Toronto,
Canada, July 9-14, 2023, pages 7346–7354. Association
Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari-
for Computational Linguistics, 2023.
wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry,
Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, [Deng et al., 2020] Lei Deng, Guoqi Li, Song Han, Luping
Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Shi, and Yuan Xie. Model compression and hardware ac-
Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Win- celeration for neural networks: A comprehensive survey.
ter, Christopher Hesse, Mark Chen, Eric Sigler, Ma- Proc. IEEE, 108(4):485–532, 2020.
teusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, [Dettmers and Zettlemoyer, 2022] Tim Dettmers and Luke
Christopher Berner, Sam McCandlish, Alec Radford, Ilya Zettlemoyer. The case for 4-bit precision: k-bit inference
Sutskever, and Dario Amodei. Language models are few- scaling laws. CoRR, abs/2212.09720, 2022.
shot learners. In Hugo Larochelle, Marc’Aurelio Ran-
zato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien [Dettmers et al., 2022] Tim Dettmers, Mike Lewis, Younes
Lin, editors, Advances in Neural Information Processing Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit ma-
Systems 33: Annual Conference on Neural Information trix multiplication for transformers at scale. CoRR,
Processing Systems 2020, NeurIPS 2020, December 6-12, abs/2208.07339, 2022.
2020, virtual, 2020. [Dettmers et al., 2023a] Tim Dettmers, Artidoro Pagnoni,
[Cai et al., 2019] Han Cai, Tianzhe Wang, Zhanghao Wu, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient
Kuan Wang, Ji Lin, and Song Han. On-device image finetuning of quantized llms. CoRR, abs/2305.14314,
classification with proxyless neural architecture search and 2023.
quantization-aware fine-tuning. In 2019 IEEE/CVF In- [Dettmers et al., 2023b] Tim Dettmers, Ruslan Svirschevski,
ternational Conference on Computer Vision Workshops, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh
ICCV Workshops 2019, Seoul, Korea (South), October 27- Ashkboos, Alexander Borzunov, Torsten Hoefler, and
28, 2019, pages 2509–2513. IEEE, 2019. Dan Alistarh. Spqr: A sparse-quantized representa-
[Chang et al., 2023] Yupeng Chang, Xu Wang, Jindong tion for near-lossless LLM weight compression. CoRR,
Wang, Yuan Wu, Kaijie Zhu, Hao Chen, Linyi Yang, Xi- abs/2306.03078, 2023.
aoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue [Ding et al., 2022] Shaojin Ding, Phoenix Meadowlark,
Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. Yanzhang He, Lukasz Lew, Shivani Agrawal, and Oleg
A survey on evaluation of large language models. CoRR, Rybakov. 4-bit conformer with native quantization aware
abs/2307.03109, 2023. training for speech recognition. In Hanseok Ko and
[Chavan et al., 2023] Arnav Chavan, Zhuang Liu, Deepak K. John H. L. Hansen, editors, Interspeech 2022, 23rd An-
Gupta, Eric P. Xing, and Zhiqiang Shen. One-for-all: Gen- nual Conference of the International Speech Communica-
eralized lora for parameter-efficient fine-tuning. CoRR, tion Association, Incheon, Korea, 18-22 September 2022,
abs/2306.07967, 2023. pages 1711–1715. ISCA, 2022.
[Chen et al., 2023] Zeming Chen, Qiyue Gao, Antoine [Ding et al., 2023] Ning Ding, Yujia Qin, Guang Yang,
Bosselut, Ashish Sabharwal, and Kyle Richardson. Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu,
Yulin Chen, Chi-Min Chan, Weize Chen, Jing Yi, Weilin implicit reasoning strategies. Trans. Assoc. Comput. Lin-
Zhao, Xiaozhi Wang, Zhiyuan Liu, Hai-Tao Zheng, Jianfei guistics, 9:346–361, 2021.
Chen, Yang Liu, Jie Tang, Juanzi Li, and Maosong Sun. [Gholami et al., 2022] Amir Gholami, Sehoon Kim, Zhen
Parameter-efficient fine-tuning of large-scale pre-trained
Dong, Zhewei Yao, Michael W Mahoney, and Kurt
language models. Nat. Mac. Intell., 5(3):220–235, 2023.
Keutzer. A survey of quantization methods for efficient
[Dong et al., 2019] Zhen Dong, Zhewei Yao, Amir Gholami, neural network inference. In Low-Power Computer Vision,
Michael W. Mahoney, and Kurt Keutzer. HAWQ: hes- pages 291–326. Chapman and Hall/CRC, 2022.
sian aware quantization of neural networks with mixed-
precision. In 2019 IEEE/CVF International Conference [Gordon et al., 2020] Mitchell A. Gordon, Kevin Duh, and
on Computer Vision, ICCV 2019, Seoul, Korea (South), Nicholas Andrews. Compressing BERT: studying the ef-
October 27 - November 2, 2019, pages 293–302. IEEE, fects of weight pruning on transfer learning. In Spandana
2019. Gella, Johannes Welbl, Marek Rei, Fabio Petroni, Patrick
S. H. Lewis, Emma Strubell, Min Joon Seo, and Han-
[Dong et al., 2023] Qingxiu Dong, Lei Li, Damai Dai, naneh Hajishirzi, editors, Proceedings of the 5th Workshop
Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing on Representation Learning for NLP, RepL4NLP@ACL
Xu, Lei Li, and Zhifang Sui. A survey for in-context learn- 2020, Online, July 9, 2020, pages 143–155. Association
ing. CoRR, abs/2301.00234, 2023. for Computational Linguistics, 2020.
[Elsken et al., 2019] Thomas Elsken, Jan Hendrik Metzen, [Gou et al., 2021] Jianping Gou, Baosheng Yu, Stephen J.
and Frank Hutter. Neural architecture search: A survey. Maybank, and Dacheng Tao. Knowledge distillation: A
J. Mach. Learn. Res., 20:55:1–55:21, 2019. survey. Int. J. Comput. Vis., 129(6):1789–1819, 2021.
[Fang et al., 2020] Jun Fang, Ali Shafiee, Hamzah Abdel- [Gu et al., 2023] Yuxian Gu, Li Dong, Furu Wei, and Minlie
Aziz, David Thorsley, Georgios Georgiadis, and Joseph Huang. Knowledge distillation of large language models.
Hassoun. Post-training piecewise linear quantization for CoRR, abs/2306.08543, 2023.
deep neural networks. In Andrea Vedaldi, Horst Bischof,
Thomas Brox, and Jan-Michael Frahm, editors, Computer [Guo et al., 2020] Ruiqi Guo, Philip Sun, Erik Lindgren,
Vision - ECCV 2020 - 16th European Conference, Glas- Quan Geng, David Simcha, Felix Chern, and Sanjiv Ku-
gow, UK, August 23-28, 2020, Proceedings, Part II, vol- mar. Accelerating large-scale inference with anisotropic
ume 12347 of Lecture Notes in Computer Science, pages vector quantization. In Proceedings of the 37th Interna-
69–86. Springer, 2020. tional Conference on Machine Learning, ICML 2020, 13-
[Fang et al., 2023] Gongfan Fang, Xinyin Ma, Mingli Song, 18 July 2020, Virtual Event, volume 119 of Proceedings
of Machine Learning Research, pages 3887–3896. PMLR,
Michael Bi Mi, and Xinchao Wang. Depgraph: Towards
2020.
any structural pruning. CoRR, abs/2301.12900, 2023.
[Frantar and Alistarh, 2023] Elias Frantar and Dan Alistarh. [Guo et al., 2023] Cong Guo, Jiaming Tang, Weiming Hu,
Sparsegpt: Massive language models can be accurately Jingwen Leng, Chen Zhang, Fan Yang, Yunxin Liu, Minyi
pruned in one-shot. CoRR, abs/2301.00774, 2023. Guo, and Yuhao Zhu. Olive: Accelerating large language
models via hardware-friendly outlier-victim pair quantiza-
[Frantar et al., 2022] Elias Frantar, Saleh Ashkboos, Torsten tion. In Yan Solihin and Mark A. Heinrich, editors, Pro-
Hoefler, and Dan Alistarh. GPTQ: accurate post- ceedings of the 50th Annual International Symposium on
training quantization for generative pre-trained transform- Computer Architecture, ISCA 2023, Orlando, FL, USA,
ers. CoRR, abs/2210.17323, 2022. June 17-21, 2023, pages 3:1–3:15. ACM, 2023.
[Fu et al., 2023a] Yao Fu, Hao Peng, Litu Ou, Ashish Sab- [Han et al., 2015] Song Han, Jeff Pool, John Tran, and
harwal, and Tushar Khot. Specializing smaller lan- William J. Dally. Learning both weights and connections
guage models towards multi-step reasoning. CoRR, for efficient neural network. In Corinna Cortes, Neil D.
abs/2301.12726, 2023. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman
[Fu et al., 2023b] Zihao Fu, Haoran Yang, Anthony Man- Garnett, editors, Advances in Neural Information Process-
Cho So, Wai Lam, Lidong Bing, and Nigel Collier. On ing Systems 28: Annual Conference on Neural Information
the effectiveness of parameter-efficient fine-tuning. In Processing Systems 2015, December 7-12, 2015, Mon-
Brian Williams, Yiling Chen, and Jennifer Neville, edi- treal, Quebec, Canada, pages 1135–1143, 2015.
tors, Thirty-Seventh AAAI Conference on Artificial Intelli- [He et al., 2018] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang,
gence, AAAI 2023, Thirty-Fifth Conference on Innovative Li-Jia Li, and Song Han. AMC: automl for model com-
Applications of Artificial Intelligence, IAAI 2023, Thir- pression and acceleration on mobile devices. In Vitto-
teenth Symposium on Educational Advances in Artificial rio Ferrari, Martial Hebert, Cristian Sminchisescu, and
Intelligence, EAAI 2023, Washington, DC, USA, February Yair Weiss, editors, Computer Vision - ECCV 2018 - 15th
7-14, 2023, pages 12799–12807. AAAI Press, 2023. European Conference, Munich, Germany, September 8-
[Geva et al., 2021] Mor Geva, Daniel Khashabi, Elad Segal, 14, 2018, Proceedings, Part VII, volume 11211 of Lec-
Tushar Khot, Dan Roth, and Jonathan Berant. Did aris- ture Notes in Computer Science, pages 815–832. Springer,
totle use a laptop? A question answering benchmark with 2018.
[Hinton et al., 2015] Geoffrey E. Hinton, Oriol Vinyals, and [Kim et al., 2022] Minsoo Kim, Sihwa Lee, Sukjin Hong,
Jeffrey Dean. Distilling the knowledge in a neural net- Du-Seong Chang, and Jungwook Choi. Understanding and
work. CoRR, abs/1503.02531, 2015. improving knowledge distillation for quantization aware
[Ho et al., 2023] Namgyu Ho, Laura Schmid, and Se-Young training of large transformer encoders. In Yoav Goldberg,
Yun. Large language models are reasoning teachers. Zornitsa Kozareva, and Yue Zhang, editors, Proceedings
In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki of the 2022 Conference on Empirical Methods in Natural
Okazaki, editors, Proceedings of the 61st Annual Meeting Language Processing, EMNLP 2022, Abu Dhabi, United
of the Association for Computational Linguistics (Volume Arab Emirates, December 7-11, 2022, pages 6713–6725.
1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, Association for Computational Linguistics, 2022.
2023, pages 14852–14882. Association for Computational [Kim et al., 2023] Jeonghoon Kim, Jung Hyun Lee, Sung-
Linguistics, 2023. dong Kim, Joonsuk Park, Kang Min Yoo, Se Jung Kwon,
[Hsieh et al., 2023] Cheng-Yu Hsieh, Chun-Liang Li, Chih- and Dongsoo Lee. Memory-efficient fine-tuning of com-
Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, pressed large language models via sub-4-bit integer quan-
Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distill- tization. CoRR, abs/2305.14152, 2023.
ing step-by-step! outperforming larger language models [Kurtic et al., 2023] Eldar Kurtic, Elias Frantar, and Dan Al-
with less training data and smaller model sizes. In Anna istarh. Ziplm: Hardware-aware structured pruning of lan-
Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, ed- guage models. CoRR, abs/2302.04089, 2023.
itors, Findings of the Association for Computational Lin- [LeCun et al., 1989] Yann LeCun, John S. Denker, and
guistics: ACL 2023, Toronto, Canada, July 9-14, 2023, Sara A. Solla. Optimal brain damage. In David S. Touret-
pages 8003–8017. Association for Computational Linguis- zky, editor, Advances in Neural Information Processing
tics, 2023. Systems 2, [NIPS Conference, Denver, Colorado, USA,
[Hu et al., 2022] Edward J. Hu, Yelong Shen, Phillip Wallis, November 27-30, 1989], pages 598–605. Morgan Kauf-
Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, mann, 1989.
and Weizhu Chen. Lora: Low-rank adaptation of large [Lee et al., 2023] Changhun Lee, Jungyu Jin, Taesu Kim,
language models. In The Tenth International Conference Hyungjun Kim, and Eunhyeok Park. OWQ: lessons
on Learning Representations, ICLR 2022, Virtual Event, learned from activation outliers for weight quantization in
April 25-29, 2022. OpenReview.net, 2022. large language models. CoRR, abs/2306.02272, 2023.
[Huang and Chang, 2023] Jie Huang and Kevin Chen-Chuan [Li et al., 2017] Hao Li, Asim Kadav, Igor Durdanovic,
Chang. Towards reasoning in large language models: A Hanan Samet, and Hans Peter Graf. Pruning filters for
survey. In Anna Rogers, Jordan L. Boyd-Graber, and efficient convnets. In 5th International Conference on
Naoaki Okazaki, editors, Findings of the Association for Learning Representations, ICLR 2017, Toulon, France,
Computational Linguistics: ACL 2023, Toronto, Canada, April 24-26, 2017, Conference Track Proceedings. Open-
July 9-14, 2023, pages 1049–1065. Association for Com- Review.net, 2017.
putational Linguistics, 2023.
[Li et al., 2022] Shiyang Li, Jianshu Chen, Yelong Shen,
[Huang et al., 2022] Yukun Huang, Yanda Chen, Zhou Yu, Zhiyu Chen, Xinlu Zhang, Zekun Li, Hong Wang, Jing
and Kathleen R. McKeown. In-context learning distilla- Qian, Baolin Peng, Yi Mao, Wenhu Chen, and Xifeng Yan.
tion: Transferring few-shot learning ability of pre-trained Explanations from large language models make small rea-
language models. CoRR, abs/2212.10670, 2022. soners better. CoRR, abs/2210.06726, 2022.
[Idelbayev and Carreira-Perpiñán, 2020] Yerlan Idelbayev [Lin et al., 2023] Ji Lin, Jiaming Tang, Haotian Tang, Shang
and Miguel Á. Carreira-Perpiñán. Low-rank compression Yang, Xingyu Dang, and Song Han. AWQ: activation-
of neural nets: Learning the rank of each layer. In 2020 aware weight quantization for LLM compression and ac-
IEEE/CVF Conference on Computer Vision and Pattern celeration. CoRR, abs/2306.00978, 2023.
Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, [Liu et al., 2021a] Yuanxin Liu, Fandong Meng, Zheng Lin,
2020, pages 8046–8056. Computer Vision Foundation / Weiping Wang, and Jie Zhou. Marginal utility diminishes:
IEEE, 2020. Exploring the minimum knowledge for BERT knowledge
[Jiang et al., 2023] Yuxin Jiang, Chunkit Chan, Mingyang distillation. In Chengqing Zong, Fei Xia, Wenjie Li,
Chen, and Wei Wang. Lion: Adversarial distilla- and Roberto Navigli, editors, Proceedings of the 59th An-
tion of closed-source large language model. CoRR, nual Meeting of the Association for Computational Lin-
abs/2305.12870, 2023. guistics and the 11th International Joint Conference on
[Kim and Rush, 2016] Yoon Kim and Alexander M. Rush. Natural Language Processing, ACL/IJCNLP 2021, (Vol-
Sequence-level knowledge distillation. In Jian Su, Xavier ume 1: Long Papers), Virtual Event, August 1-6, 2021,
Carreras, and Kevin Duh, editors, Proceedings of the 2016 pages 2928–2941. Association for Computational Linguis-
Conference on Empirical Methods in Natural Language tics, 2021.
Processing, EMNLP 2016, Austin, Texas, USA, November [Liu et al., 2021b] Zhenhua Liu, Yunhe Wang, Kai Han, Wei
1-4, 2016, pages 1317–1327. The Association for Compu- Zhang, Siwei Ma, and Wen Gao. Post-training quanti-
tational Linguistics, 2016. zation for vision transformer. In Marc’Aurelio Ranzato,
Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Paul F. Christiano, Jan Leike, and Ryan Lowe. Training
Jennifer Wortman Vaughan, editors, Advances in Neural language models to follow instructions with human feed-
Information Processing Systems 34: Annual Conference back. In NeurIPS, 2022.
on Neural Information Processing Systems 2021, NeurIPS [Paperno et al., 2016] Denis Paperno, Germán Kruszewski,
2021, December 6-14, 2021, virtual, pages 28092–28103,
Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi,
2021.
Sandro Pezzelle, Marco Baroni, Gemma Boleda, and
[Liu et al., 2022a] Haokun Liu, Derek Tam, Mohammed Raquel Fernández. The LAMBADA dataset: Word predic-
Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and tion requiring a broad discourse context. In Proceedings
Colin Raffel. Few-shot parameter-efficient fine-tuning is of the 54th Annual Meeting of the Association for Compu-
better and cheaper than in-context learning. In NeurIPS, tational Linguistics, ACL 2016, August 7-12, 2016, Berlin,
2022. Germany, Volume 1: Long Papers. The Association for
[Liu et al., 2022b] Xiangyang Liu, Tianxiang Sun, Junliang Computer Linguistics, 2016.
He, Jiawen Wu, Lingling Wu, Xinyu Zhang, Hao Jiang, [Park et al., 2019] Wonpyo Park, Dongju Kim, Yan Lu, and
Zhao Cao, Xuanjing Huang, and Xipeng Qiu. Towards Minsu Cho. Relational knowledge distillation. In IEEE
efficient NLP: A standard evaluation and A strong base- Conference on Computer Vision and Pattern Recognition,
line. In Marine Carpuat, Marie-Catherine de Marneffe, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019,
and Iván Vladimir Meza Ruı́z, editors, Proceedings of the pages 3967–3976. Computer Vision Foundation / IEEE,
2022 Conference of the North American Chapter of the 2019.
Association for Computational Linguistics: Human Lan-
[Park et al., 2022] Gunho Park, Baeseong Park, Se Jung
guage Technologies, NAACL 2022, Seattle, WA, United
States, July 10-15, 2022, pages 3288–3303. Association Kwon, Byeongwook Kim, Youngjoo Lee, and Dongsoo
for Computational Linguistics, 2022. Lee. nuqmm: Quantized matmul for efficient infer-
ence of large-scale generative language models. CoRR,
[Liu et al., 2023] Zechun Liu, Barlas Oguz, Changsheng abs/2206.09557, 2022.
Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad,
Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas [Petroni et al., 2019] Fabio Petroni, Tim Rocktäschel, Se-
Chandra. LLM-QAT: data-free quantization aware train- bastian Riedel, Patrick S. H. Lewis, Anton Bakhtin, Yux-
ing for large language models. CoRR, abs/2305.17888, iang Wu, and Alexander H. Miller. Language models as
2023. knowledge bases? In Kentaro Inui, Jing Jiang, Vincent Ng,
and Xiaojun Wan, editors, Proceedings of the 2019 Con-
[Luccioni et al., 2022] Alexandra Sasha Luccioni, Sylvain ference on Empirical Methods in Natural Language Pro-
Viguier, and Anne-Laure Ligozat. Estimating the car- cessing and the 9th International Joint Conference on Nat-
bon footprint of bloom, a 176b parameter language model. ural Language Processing, EMNLP-IJCNLP 2019, Hong
CoRR, abs/2211.02001, 2022. Kong, China, November 3-7, 2019, pages 2463–2473. As-
[Ma et al., 2023] Xinyin Ma, Gongfan Fang, and Xinchao sociation for Computational Linguistics, 2019.
Wang. Llm-pruner: On the structural pruning of large lan- [Povey et al., 2018] Daniel Povey, Gaofeng Cheng, Yiming
guage models. CoRR, abs/2305.11627, 2023. Wang, Ke Li, Hainan Xu, Mahsa Yarmohammadi, and
[Magister et al., 2023] Lucie Charlotte Magister, Jonathan Sanjeev Khudanpur. Semi-orthogonal low-rank matrix
Mallinson, Jakub Adámek, Eric Malmi, and Aliaksei Sev- factorization for deep neural networks. In B. Yegna-
eryn. Teaching small language models to reason. In Anna narayana, editor, Interspeech 2018, 19th Annual Confer-
Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, ed- ence of the International Speech Communication Associa-
itors, Proceedings of the 61st Annual Meeting of the As- tion, Hyderabad, India, 2-6 September 2018, pages 3743–
sociation for Computational Linguistics (Volume 2: Short 3747. ISCA, 2018.
Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, [Rajpurkar et al., 2016] Pranav Rajpurkar, Jian Zhang, Kon-
pages 1773–1781. Association for Computational Linguis- stantin Lopyrev, and Percy Liang. Squad: 100, 000+
tics, 2023. questions for machine comprehension of text. In Jian Su,
[Nagel et al., 2020] Markus Nagel, Rana Ali Amjad, Mart Xavier Carreras, and Kevin Duh, editors, Proceedings of
van Baalen, Christos Louizos, and Tijmen Blankevoort. the 2016 Conference on Empirical Methods in Natural
Up or down? adaptive rounding for post-training quantiza- Language Processing, EMNLP 2016, Austin, Texas, USA,
tion. In Proceedings of the 37th International Conference November 1-4, 2016, pages 2383–2392. The Association
on Machine Learning, ICML 2020, 13-18 July 2020, Vir- for Computational Linguistics, 2016.
tual Event, volume 119 of Proceedings of Machine Learn- [Rastegari et al., 2016] Mohammad Rastegari, Vicente Or-
ing Research, pages 7197–7206. PMLR, 2020. donez, Joseph Redmon, and Ali Farhadi. Xnor-net: Ima-
[Ouyang et al., 2022] Long Ouyang, Jeffrey Wu, Xu Jiang, genet classification using binary convolutional neural net-
Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, works. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max
Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Welling, editors, Computer Vision - ECCV 2016 - 14th
Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke European Conference, Amsterdam, The Netherlands, Oc-
Miller, Maddie Simens, Amanda Askell, Peter Welinder, tober 11-14, 2016, Proceedings, Part IV, volume 9908
of Lecture Notes in Computer Science, pages 525–542. [Tailor et al., 2021] Shyam Anil Tailor, Javier Fernández-
Springer, 2016. Marqués, and Nicholas Donald Lane. Degree-quant:
[Sang and Meulder, 2003] Erik F. Tjong Kim Sang and Quantization-aware training for graph neural networks.
Fien De Meulder. Introduction to the conll-2003 shared In 9th International Conference on Learning Representa-
task: Language-independent named entity recognition. tions, ICLR 2021, Virtual Event, Austria, May 3-7, 2021.
In Walter Daelemans and Miles Osborne, editors, Pro- OpenReview.net, 2021.
ceedings of the Seventh Conference on Natural Language [Tao et al., 2023] Chaofan Tao, Lu Hou, Haoli Bai, Jian-
Learning, CoNLL 2003, Held in cooperation with HLT- sheng Wei, Xin Jiang, Qun Liu, Ping Luo, and Ngai Wong.
NAACL 2003, Edmonton, Canada, May 31 - June 1, 2003, Structured pruning for efficient generative pre-trained lan-
pages 142–147. ACL, 2003. guage models. In Anna Rogers, Jordan L. Boyd-Graber,
[Schaeffer et al., 2023] Rylan Schaeffer, Brando Miranda, and Naoaki Okazaki, editors, Findings of the Associa-
and Sanmi Koyejo. Are emergent abilities of large lan- tion for Computational Linguistics: ACL 2023, Toronto,
guage models a mirage? CoRR, abs/2304.15004, 2023. Canada, July 9-14, 2023, pages 10880–10895. Associa-
tion for Computational Linguistics, 2023.
[Shi et al., 2023] Freda Shi, Mirac Suzgun, Markus Fre-
itag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, [Tung and Mori, 2019] Frederick Tung and Greg Mori.
Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Similarity-preserving knowledge distillation. In 2019
Dipanjan Das, and Jason Wei. Language models are multi- IEEE/CVF International Conference on Computer Vision,
lingual chain-of-thought reasoners. In The Eleventh Inter- ICCV 2019, Seoul, Korea (South), October 27 - November
national Conference on Learning Representations, ICLR 2, 2019, pages 1365–1374. IEEE, 2019.
2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, [Valipour et al., 2023] Mojtaba Valipour, Mehdi Reza-
2023. gholizadeh, Ivan Kobyzev, and Ali Ghodsi. Dylora:
[Shridhar et al., 2023] Kumar Shridhar, Alessandro Stolfo, Parameter-efficient tuning of pre-trained models using
and Mrinmaya Sachan. Distilling reasoning capabilities dynamic search-free low-rank adaptation. In Andreas
into smaller language models. In Anna Rogers, Jordan L. Vlachos and Isabelle Augenstein, editors, Proceedings
Boyd-Graber, and Naoaki Okazaki, editors, Findings of of the 17th Conference of the European Chapter of the
the Association for Computational Linguistics: ACL 2023, Association for Computational Linguistics, EACL 2023,
Toronto, Canada, July 9-14, 2023, pages 7059–7073. As- Dubrovnik, Croatia, May 2-6, 2023, pages 3266–3279.
sociation for Computational Linguistics, 2023. Association for Computational Linguistics, 2023.
[Socher et al., 2013] Richard Socher, Alex Perelygin, Jean [Wang et al., 2019a] Alex Wang, Yada Pruksachatkun,
Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Nikita Nangia, Amanpreet Singh, Julian Michael, Felix
Ng, and Christopher Potts. Recursive deep models for se- Hill, Omer Levy, and Samuel R. Bowman. Superglue:
mantic compositionality over a sentiment treebank. In Pro- A stickier benchmark for general-purpose language
ceedings of the 2013 Conference on Empirical Methods in understanding systems. In Hanna M. Wallach, Hugo
Natural Language Processing, EMNLP 2013, 18-21 Octo- Larochelle, Alina Beygelzimer, Florence d’Alché-Buc,
ber 2013, Grand Hyatt Seattle, Seattle, Washington, USA, Emily B. Fox, and Roman Garnett, editors, Advances
A meeting of SIGDAT, a Special Interest Group of the ACL, in Neural Information Processing Systems 32: Annual
pages 1631–1642. ACL, 2013. Conference on Neural Information Processing Systems
2019, NeurIPS 2019, December 8-14, 2019, Vancouver,
[Stanton et al., 2021] Samuel Stanton, Pavel Izmailov,
BC, Canada, pages 3261–3275, 2019.
Polina Kirichenko, Alexander A. Alemi, and An-
drew Gordon Wilson. Does knowledge distillation really [Wang et al., 2019b] Alex Wang, Amanpreet Singh, Julian
work? In Marc’Aurelio Ranzato, Alina Beygelzimer, Michael, Felix Hill, Omer Levy, and Samuel R. Bowman.
Yann N. Dauphin, Percy Liang, and Jennifer Wortman GLUE: A multi-task benchmark and analysis platform for
Vaughan, editors, Advances in Neural Information natural language understanding. In 7th International Con-
Processing Systems 34: Annual Conference on Neural ference on Learning Representations, ICLR 2019, New Or-
Information Processing Systems 2021, NeurIPS 2021, leans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
December 6-14, 2021, virtual, pages 6906–6919, 2021. [Wang et al., 2023a] Peifeng Wang, Zhengyang Wang,
[Sun et al., 2023] Mingjie Sun, Zhuang Liu, Anna Bair, and Zheng Li, Yifan Gao, Bing Yin, and Xiang Ren. SCOTT:
J. Zico Kolter. A simple and effective pruning approach self-consistent chain-of-thought distillation. In Anna
for large language models. CoRR, abs/2306.11695, 2023. Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki,
[Syed et al., 2023] Aaquib Syed, Phillip Huang Guo, and Vi- editors, Proceedings of the 61st Annual Meeting of the
Association for Computational Linguistics (Volume 1:
jaykaarti Sundarapandiyan. Prune and tune: Improving
Long Papers), ACL 2023, Toronto, Canada, July 9-14,
efficient pruning techniques for massive language mod-
2023, pages 5546–5558. Association for Computational
els. In Krystal Maughan, Rosanne Liu, and Thomas F.
Linguistics, 2023.
Burns, editors, The First Tiny Papers Track at ICLR 2023,
Tiny Papers @ ICLR 2023, Kigali, Rwanda, May 5, 2023. [Wang et al., 2023b] Xinyi Wang, Wanrong Zhu, and
OpenReview.net, 2023. William Yang Wang. Large language models are implicitly
topic models: Explaining and finding good demonstrations ral networks using energy-aware pruning. In 2017 IEEE
for in-context learning. CoRR, abs/2301.11916, 2023. Conference on Computer Vision and Pattern Recognition,
[Wang et al., 2023c] Xuezhi Wang, Jason Wei, Dale Schuur- CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages
mans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha 6071–6079. IEEE Computer Society, 2017.
Chowdhery, and Denny Zhou. Self-consistency improves [Yao et al., 2022] Zhewei Yao, Reza Yazdani Aminabadi,
chain of thought reasoning in language models. In The Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He.
Eleventh International Conference on Learning Repre- Zeroquant: Efficient and affordable post-training quantiza-
sentations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. tion for large-scale transformers. In NeurIPS, 2022.
OpenReview.net, 2023.
[Yuan et al., 2023] Zhihang Yuan, Lin Niu, Jiawei Liu,
[Wei et al., 2022a] Jason Wei, Yi Tay, Rishi Bommasani, Wenyu Liu, Xinggang Wang, Yuzhang Shang, Guangyu
Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yo- Sun, Qiang Wu, Jiaxiang Wu, and Bingzhe Wu. RPTQ:
gatama, Maarten Bosma, Denny Zhou, Donald Metzler, reorder-based post-training quantization for large lan-
Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy guage models. CoRR, abs/2304.01089, 2023.
Liang, Jeff Dean, and William Fedus. Emergent abilities
of large language models. Trans. Mach. Learn. Res., 2022, [Zhang et al., 2018] Tianyun Zhang, Shaokai Ye, Kaiqi
2022. Zhang, Jian Tang, Wujie Wen, Makan Fardad, and Yanzhi
Wang. A systematic DNN weight pruning framework us-
[Wei et al., 2022b] Jason Wei, Xuezhi Wang, Dale Schuur- ing alternating direction method of multipliers. In Vit-
mans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, torio Ferrari, Martial Hebert, Cristian Sminchisescu, and
Quoc V. Le, and Denny Zhou. Chain-of-thought prompt- Yair Weiss, editors, Computer Vision - ECCV 2018 - 15th
ing elicits reasoning in large language models. In NeurIPS, European Conference, Munich, Germany, September 8-
2022. 14, 2018, Proceedings, Part VIII, volume 11212 of Lec-
[Wei et al., 2023] Xiuying Wei, Yunchen Zhang, Yuhang Li, ture Notes in Computer Science, pages 191–207. Springer,
Xiangguo Zhang, Ruihao Gong, Jinyang Guo, and Xiang- 2018.
long Liu. Outlier suppression+: Accurate quantization of [Zhang et al., 2023a] Mingyang Zhang, Hao Chen, Chun-
large language models by equivalent and optimal shifting hua Shen, Zhen Yang, Linlin Ou, Xinyi Yu, and Bohan
and scaling. CoRR, abs/2304.09145, 2023. Zhuang. Pruning meets low-rank parameter-efficient fine-
[Williams et al., 2018] Adina Williams, Nikita Nangia, and tuning. CoRR, abs/2305.18403, 2023.
Samuel R. Bowman. A broad-coverage challenge corpus
[Zhang et al., 2023b] Qingru Zhang, Minshuo Chen,
for sentence understanding through inference. In Mari-
lyn A. Walker, Heng Ji, and Amanda Stent, editors, Pro- Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu
ceedings of the 2018 Conference of the North American Chen, and Tuo Zhao. Adaptive budget allocation for
Chapter of the Association for Computational Linguistics: parameter-efficient fine-tuning. In The Eleventh Inter-
Human Language Technologies, NAACL-HLT 2018, New national Conference on Learning Representations, ICLR
Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net,
Papers), pages 1112–1122. Association for Computational 2023.
Linguistics, 2018. [Zhang et al., 2023c] Yijia Zhang, Lingran Zhao, Shijie Cao,
[Wu et al., 2023] Xiaoxia Wu, Zhewei Yao, and Yuxiong Wenqiang Wang, Ting Cao, Fan Yang, Mao Yang, Shang-
He. Zeroquant-fp: A leap forward in llms post-training hang Zhang, and Ningyi Xu. Integer or floating point?
W4A8 quantization using floating-point formats. CoRR, new outlooks for low-bit quantization on large language
abs/2307.09782, 2023. models. CoRR, abs/2305.12356, 2023.
[Xiao et al., 2022] Guangxuan Xiao, Ji Lin, Mickaël Seznec, [Zhao et al., 2022] Borui Zhao, Quan Cui, Renjie Song,
Julien Demouth, and Song Han. Smoothquant: Accurate Yiyu Qiu, and Jiajun Liang. Decoupled knowledge dis-
and efficient post-training quantization for large language tillation. In IEEE/CVF Conference on Computer Vision
models. CoRR, abs/2211.10438, 2022. and Pattern Recognition, CVPR 2022, New Orleans, LA,
USA, June 18-24, 2022, pages 11943–11952. IEEE, 2022.
[Xu et al., 2021] Canwen Xu, Wangchunshu Zhou, Tao Ge,
Ke Xu, Julian J. McAuley, and Furu Wei. Beyond pre- [Zhao et al., 2023] Wayne Xin Zhao, Kun Zhou, Junyi Li,
served accuracy: Evaluating loyalty and robustness of Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min,
BERT compression. In Marie-Francine Moens, Xuanjing Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du,
Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Pro- Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang,
ceedings of the 2021 Conference on Empirical Methods Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu
in Natural Language Processing, EMNLP 2021, Virtual Liu, Jian-Yun Nie, and Ji-Rong Wen. A survey of large
Event / Punta Cana, Dominican Republic, 7-11 November, language models. CoRR, abs/2303.18223, 2023.
2021, pages 10653–10659. Association for Computational [Zhou et al., 2021] Xiyou Zhou, Zhiyu Chen, Xiaoyong Jin,
Linguistics, 2021. and William Yang Wang. HULK: an energy efficiency
[Yang et al., 2017] Tien-Ju Yang, Yu-Hsin Chen, and Vivi- benchmark platform for responsible natural language pro-
enne Sze. Designing energy-efficient convolutional neu- cessing. In Dimitra Gkatzia and Djamé Seddah, editors,
Proceedings of the 16th Conference of the European Chap-
ter of the Association for Computational Linguistics: Sys-
tem Demonstrations, EACL 2021, Online, April 19-23,
2021, pages 329–336. Association for Computational Lin-
guistics, 2021.
[Zhu et al., 2021] Xunyu Zhu, Jian Li, Yong Liu, Jun Liao,
and Weiping Wang. Operation-level progressive differ-
entiable architecture search. In James Bailey, Pauli Mi-
ettinen, Yun Sing Koh, Dacheng Tao, and Xindong Wu,
editors, IEEE International Conference on Data Mining,
ICDM 2021, Auckland, New Zealand, December 7-10,
2021, pages 1559–1564. IEEE, 2021.
[Zhu et al., 2023] Xunyu Zhu, Jian Li, Yong Liu, and Weip-
ing Wang. Improving differentiable architecture search via
self-distillation. CoRR, abs/2302.05629, 2023.
[Zoph and Le, 2016] Barret Zoph and Quoc V. Le. Neural
architecture search with reinforcement learning. CoRR,
abs/1611.01578, 2016.

PM E Bus Sewa Guidelines - MoHUA
No ratings yet
PM E Bus Sewa Guidelines - MoHUA
27 pages
Lumion 10 Pro Free Download - Instructions For Detailed Installation PDF
No ratings yet
Lumion 10 Pro Free Download - Instructions For Detailed Installation PDF
5 pages
???????????????????? accessibilityPunctuationGroup
No ratings yet
???????????????????? accessibilityPunctuationGroup
101 pages
SAP SD Test Sample Case Document PDF
50% (2)
SAP SD Test Sample Case Document PDF
25 pages
Week 9 Data Warehouse Concepts
No ratings yet
Week 9 Data Warehouse Concepts
35 pages
LAB Report # 1: An Introduction To PCB Designing Using Proteus
No ratings yet
LAB Report # 1: An Introduction To PCB Designing Using Proteus
6 pages
Optimising LLMs
No ratings yet
Optimising LLMs
8 pages
A Comprehensive Survey of Compression Algorithms For Language Models - 2024 - Park Et Al
No ratings yet
A Comprehensive Survey of Compression Algorithms For Language Models - 2024 - Park Et Al
35 pages
A Survey of Small Language Models
No ratings yet
A Survey of Small Language Models
20 pages
Model Compression and Efficient Inference For Large Language Models: A Survey
No ratings yet
Model Compression and Efficient Inference For Large Language Models: A Survey
47 pages
Finezip:: Pushing The Limits of Large Language Models For Practical Lossless Text Compression
No ratings yet
Finezip:: Pushing The Limits of Large Language Models For Practical Lossless Text Compression
7 pages
Model Compression
No ratings yet
Model Compression
41 pages
New Solutions On LLM Acceleration Optimization
No ratings yet
New Solutions On LLM Acceleration Optimization
12 pages
Riemannian Low-Rank Model Compression For Federated Learning With Over-The-Air Aggregation
No ratings yet
Riemannian Low-Rank Model Compression For Federated Learning With Over-The-Air Aggregation
16 pages
Efficient Large Language Models - A Survey
No ratings yet
Efficient Large Language Models - A Survey
67 pages
Deepseek LLM
No ratings yet
Deepseek LLM
48 pages
Compress Review
No ratings yet
Compress Review
14 pages
S LLM: D - S Q: Queeze Ense AND Parse Uantization
No ratings yet
S LLM: D - S Q: Queeze Ense AND Parse Uantization
21 pages
A Survey On Efficient Inference For Large Language Models
No ratings yet
A Survey On Efficient Inference For Large Language Models
34 pages
2411 03350v1
No ratings yet
2411 03350v1
76 pages
Fine-Tuning and Deploying Large Language Models Over Edges Issues and Approaches
No ratings yet
Fine-Tuning and Deploying Large Language Models Over Edges Issues and Approaches
7 pages
Small Language Models (SLMS)
No ratings yet
Small Language Models (SLMS)
23 pages
A Survey On Efficient Inference For Large Language Models
No ratings yet
A Survey On Efficient Inference For Large Language Models
35 pages
Compressing Large Scale Transformer Based Models - A Case Study On BERT
No ratings yet
Compressing Large Scale Transformer Based Models - A Case Study On BERT
7 pages
Synopsis-FINEZIP-research Paper-3
No ratings yet
Synopsis-FINEZIP-research Paper-3
2 pages
Survey On Efficient Inference For LLMs 1721657409
No ratings yet
Survey On Efficient Inference For LLMs 1721657409
36 pages
LLMLingua Compressing Prompts LLM Jiangetal
No ratings yet
LLMLingua Compressing Prompts LLM Jiangetal
19 pages
Manifold Learning For LLM Compression
No ratings yet
Manifold Learning For LLM Compression
4 pages
Chen Et Al. - An Agile Framework For Efficient LLM Accelerator Development and Model Inference
No ratings yet
Chen Et Al. - An Agile Framework For Efficient LLM Accelerator Development and Model Inference
9 pages
All You Should Kno About LLM'S
No ratings yet
All You Should Kno About LLM'S
10 pages
Tiny Titans: Can Smaller Large Language Models Punch Above Their Weight in The Real World For Meeting Summarization?
No ratings yet
Tiny Titans: Can Smaller Large Language Models Punch Above Their Weight in The Real World For Meeting Summarization?
8 pages
LLMLingua 2
No ratings yet
LLMLingua 2
18 pages
Online Embedding Compression For Text Classification Using Low Rank Matrix Factorization
No ratings yet
Online Embedding Compression For Text Classification Using Low Rank Matrix Factorization
9 pages
Survay
No ratings yet
Survay
59 pages
L RD: LLM - : O W ANK Ecomposition of Monolingual Code S For One Shot Com Pression
No ratings yet
L RD: LLM - : O W ANK Ecomposition of Monolingual Code S For One Shot Com Pression
14 pages
Achieving Peak Performance For Large Language
No ratings yet
Achieving Peak Performance For Large Language
34 pages
2024 - Efficiency Optimization of Large-Scale Language
No ratings yet
2024 - Efficiency Optimization of Large-Scale Language
8 pages
Notes
No ratings yet
Notes
21 pages
Slice GPT
No ratings yet
Slice GPT
22 pages
L L M H T C: A S R: Arge Anguage Odels For Ealthcare EXT Lassification Ystematic Eview
No ratings yet
L L M H T C: A S R: Arge Anguage Odels For Ealthcare EXT Lassification Ystematic Eview
55 pages
LLM Inference Unveiled Survey and Roofline Model Insights
No ratings yet
LLM Inference Unveiled Survey and Roofline Model Insights
27 pages
Platypus
No ratings yet
Platypus
17 pages
What Is The Role of Small Models in The LLM Era A Survey
No ratings yet
What Is The Role of Small Models in The LLM Era A Survey
25 pages
2024 Achieving Peak Performance For Large Language Models A Systematic Review
No ratings yet
2024 Achieving Peak Performance For Large Language Models A Systematic Review
34 pages
Survery On Fpga and LLM
No ratings yet
Survery On Fpga and LLM
16 pages
xLAM 250303 024533
No ratings yet
xLAM 250303 024533
16 pages
FPTQ: F - P - T Q - L L M: INE Grained OST Raining Uantiza Tion For Arge Anguage Odels
No ratings yet
FPTQ: F - P - T Q - L L M: INE Grained OST Raining Uantiza Tion For Arge Anguage Odels
17 pages
Explainability For Large Language Models-A Survey
No ratings yet
Explainability For Large Language Models-A Survey
38 pages
Final Year Sample Report
No ratings yet
Final Year Sample Report
49 pages
Optimizing Llama 3.2 1B Using Quantization Techniques Usingbitsandbytes For Efficient Ai Deployment
No ratings yet
Optimizing Llama 3.2 1B Using Quantization Techniques Usingbitsandbytes For Efficient Ai Deployment
11 pages
Model Compression Techniquesin Deep Learning
No ratings yet
Model Compression Techniquesin Deep Learning
23 pages
Performance-Aligned Llms For Generating Fast Code
No ratings yet
Performance-Aligned Llms For Generating Fast Code
12 pages
LLM Quantization
No ratings yet
LLM Quantization
9 pages
Toward A Holistic Performance Evaluation of Large Language Models Across Diverse AI Accelerators
No ratings yet
Toward A Holistic Performance Evaluation of Large Language Models Across Diverse AI Accelerators
10 pages
Machine Learning Systems With Reduced Memory Requirements
No ratings yet
Machine Learning Systems With Reduced Memory Requirements
41 pages
Recent Advances in Language Modeling (2022-2025)
No ratings yet
Recent Advances in Language Modeling (2022-2025)
5 pages
Revised
No ratings yet
Revised
28 pages
CIKM
No ratings yet
CIKM
173 pages
2023 LLMBC Whats Next
No ratings yet
2023 LLMBC Whats Next
95 pages
Model Compression Is The Big ML Flavour of 2021
No ratings yet
Model Compression Is The Big ML Flavour of 2021
4 pages
Serving LLM 2312.15234
No ratings yet
Serving LLM 2312.15234
32 pages
Conference Template A4
No ratings yet
Conference Template A4
6 pages
Small Models Big Impact Blog Post
No ratings yet
Small Models Big Impact Blog Post
4 pages
Galore 2: Large-Scale LLM Pre-Training by Gradient Low-Rank Projection
No ratings yet
Galore 2: Large-Scale LLM Pre-Training by Gradient Low-Rank Projection
15 pages
Brochure - GH & Residential Sites - Oct - 23
No ratings yet
Brochure - GH & Residential Sites - Oct - 23
9 pages
Thermal Energy Storage
No ratings yet
Thermal Energy Storage
10 pages
Upcatalyst - 2pager 2022 August
No ratings yet
Upcatalyst - 2pager 2022 August
2 pages
LOGO Soft Comfort Upgrade V7.0 PDF
No ratings yet
LOGO Soft Comfort Upgrade V7.0 PDF
1 page
Lab No. 6 Title: Robotic Arm Programming Objective
No ratings yet
Lab No. 6 Title: Robotic Arm Programming Objective
7 pages
TLV320AIC3262EVM User Guide (Rev. A)
No ratings yet
TLV320AIC3262EVM User Guide (Rev. A)
37 pages
Barco Web Commander
No ratings yet
Barco Web Commander
2 pages
Compal La-9832p r1.0 Schematics
No ratings yet
Compal La-9832p r1.0 Schematics
66 pages
Pseudocode Examples From Dave Mulkey
No ratings yet
Pseudocode Examples From Dave Mulkey
17 pages
A Command To List All Users - and How To Add, Delete, Modify Users - Ask Ubuntu
No ratings yet
A Command To List All Users - and How To Add, Delete, Modify Users - Ask Ubuntu
1 page
Parking Management Solution Case Study
No ratings yet
Parking Management Solution Case Study
3 pages
Raw Log
No ratings yet
Raw Log
17 pages
Social Media by Hanan
No ratings yet
Social Media by Hanan
7 pages
DevOps Lab Manual
No ratings yet
DevOps Lab Manual
35 pages
Dsa L1
No ratings yet
Dsa L1
11 pages
Joshua Bouldin
No ratings yet
Joshua Bouldin
3 pages
NSLU2 Manual
100% (2)
NSLU2 Manual
47 pages
Programmer's Manual II (PAC Library) - DeNSO Robotics
No ratings yet
Programmer's Manual II (PAC Library) - DeNSO Robotics
142 pages
Class 8 - Computer
100% (1)
Class 8 - Computer
15 pages
Agile Metrics What Are Agile Metrics?: Continuous Improvement
100% (1)
Agile Metrics What Are Agile Metrics?: Continuous Improvement
14 pages
HBL InternetBanking FAQs
No ratings yet
HBL InternetBanking FAQs
9 pages
Deadlock Error Log
No ratings yet
Deadlock Error Log
42 pages
Project Report of ISO/IEC 23000 MPEG-A Multimedia Application Format
No ratings yet
Project Report of ISO/IEC 23000 MPEG-A Multimedia Application Format
81 pages
Logcat
No ratings yet
Logcat
2,654 pages
Rohde & Schwarz Presents Economy Vector Signal Generator For The Automotive, Iot and Education Sectors
No ratings yet
Rohde & Schwarz Presents Economy Vector Signal Generator For The Automotive, Iot and Education Sectors
2 pages
12 Smith and Nephew LENS Integrated Camera Processor and LED Light Source B
No ratings yet
12 Smith and Nephew LENS Integrated Camera Processor and LED Light Source B
2 pages
Swrs304 Cc27xx Datasheet 030524
No ratings yet
Swrs304 Cc27xx Datasheet 030524
66 pages
Power Bi
No ratings yet
Power Bi
68 pages

LLM Compression Techniques

Uploaded by

LLM Compression Techniques

Uploaded by

A Survey on Model Compression for Large Language Models

Abstract manage operations. To tackle these issues, a prevalent ap-

Standard KD MINILLM [Gu et al., 2023], GKD [Agarwal et al., 2023]

Knowledge In-Context Learning In-Context Learning distillation [Huang et al., 2022]

Instruction Following Lion [Jiang et al., 2023]

Figure 1: Taxonomy of Model Compression methods for Large Language Models.

2 Methods position. Such irregularity demands specialized compres-

Prompting Generating Hard

Figure 2: A brief classification of knowledge distillation for LLMs.

Quantization-Aware Fine-tuning cient inference. Remarkably, LLM.int8() enables inference

You might also like