0% found this document useful (0 votes)
164 views18 pages

LLMLingua 2

This paper proposes LLMLingua-2, a data distillation approach for task-agnostic prompt compression. It formulates prompt compression as a token classification problem to preserve essential information. A Transformer encoder extracts features from full bidirectional context to predict token labels. Experiments show the model achieves significant gains over baselines with compression ratios of 2-5x and faster inference using smaller models like XLM-RoBERTa-large and mBERT.

Uploaded by

xepit98367
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
164 views18 pages

LLMLingua 2

This paper proposes LLMLingua-2, a data distillation approach for task-agnostic prompt compression. It formulates prompt compression as a token classification problem to preserve essential information. A Transformer encoder extracts features from full bidirectional context to predict token labels. Experiments show the model achieves significant gains over baselines with compression ratios of 2-5x and faster inference using smaller models like XLM-RoBERTa-large and mBERT.

Uploaded by

xepit98367
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

LLMLingua-2: Data Distillation for Efficient and Faithful

Task-Agnostic Prompt Compression


Zhuoshi Pan1†, Qianhui Wu2‡, Huiqiang Jiang2 , Menglin Xia2 , Xufang Luo2 , Jue Zhang2 ,
Qingwei Lin2 , Victor Rühle2 , Yuqing Yang2 , Chin-Yew Lin2 ,
H. Vicky Zhao1 , Lili Qiu2 , Dongmei Zhang2
1
Tsinghua University, 2 Microsoft Corporation
{qianhuiwu, hjiang, xufang.luo}@microsoft.com

Abstract 1 Introduction

This paper focuses on task-agnostic prompt Recent years have witnessed the emergence of var-
compression for better generalizability and ef- ious prompting techniques for large language mod-
arXiv:2403.12968v1 [cs.CL] 19 Mar 2024

ficiency. Considering the redundancy in nat- els (LLMs), such as Chain-of-Thought (COT) (Wei
ural language, existing approaches compress et al., 2022), In-context Learning (ICL) (Dong
prompts by removing tokens or lexical units ac- et al., 2023), and Retrieval Augmented Generation
cording to their information entropy obtained (RAG) (Lewis et al., 2020). These techniques em-
from a causal language model such as LLaMa-
7B. The challenge is that information entropy
power LLMs to handle complex and varied tasks
may be a suboptimal compression metric: (i) it through rich and informative prompts that may ex-
only leverages unidirectional context and may ceed tens of thousands of tokens. However, the
fail to capture all essential information needed benefits of such lengthy prompts come at a cost of
for prompt compression; (ii) it is not aligned increased computational and financial overhead, as
with the prompt compression objective. well as the degraded information perception ability
To address these issues, we propose a data dis- of LLMs. Prompt compression is a straightforward
tillation procedure to derive knowledge from solution to address these issues, which attempts to
an LLM to compress prompts without losing shorten the original prompts without losing essen-
crucial information, and meantime, introduce tial information.
an extractive text compression dataset. We for- Several methods have been proposed to com-
mulate prompt compression as a token classifi-
press prompts in a task-aware manner (Jiang et al.,
cation problem to guarantee the faithfulness of
the compressed prompt to the original one, and 2023b; Xu et al., 2024; Jung and Kim, 2023; Huang
use a Transformer encoder as the base archi- et al., 2023). These techniques aim to generate
tecture to capture all essential information for compressed prompts tailored to the specific task or
prompt compression from the full bidirectional query, typically resulting in enhanced performance
context. Our approach leads to lower latency by on downstream tasks, particularly in question an-
explicitly learning the compression objective swering. However, the dependency on task-specific
with smaller models such as XLM-RoBERTa-
features presents challenges in terms of efficiency
large and mBERT.
and generalizability when deploying these methods.
We evaluate our method on both in-domain For example, in RAG-style applications, it may be-
and out-of-domain datasets, including Meeting- come necessary to compress the same documents
Bank, LongBench, ZeroScrolls, GSM8K, and multiple times depending on the associated queries
BBH. Despite its small size, our model shows
with task-aware prompt compression. More details
significant performance gains over strong base-
lines and demonstrates robust generalization are discussed in Sec. 2.
ability across different LLMs. Additionally, Some works have explored task-agnostic prompt
our model is 3x-6x faster than existing prompt compression methods for better generalizability
compression methods, while accelerating the and efficiency (Jiang et al., 2023a; Li et al., 2023).
end-to-end latency by 1.6x-2.9x with compres- The underlying assumption is that natural language
sion ratios of 2x-5x.1
contains redundancy (Shannon, 1951) that may be

Work during internship at Microsoft.
useful for human understanding but might not be

Corresponding author. necessary for LLMs. Therefore, they propose to
1
Code: https://aka.ms/LLMLingua-2 compress prompts by removing tokens (Jiang et al.,
2023a) or lexical units (Li et al., 2023) accord- • We approach prompt compression as a token
ing to their information entropy obtained from a classification task (i.e., preserve or discard),
causal small language model (SLM), regardless and take the predicted probability of each to-
of the downstream task or question information. ken being labeled as preserve as the com-
However, these task-agnostic methods face two pression metric. The benefits are three folds:
challenges: (i) Information entropy is an empirical (1) It can capture all essential information
metric for prompt compression. Relying on it for needed for prompt compression from the full
prompt trimming may be suboptimal, as it is not bidirectional context by using a Transformer
aligned with the prompt compression objective. (ii) encoder for feature extraction. (2) It can lead
Causal LMs only leverage unidirectional context, to lower latency, due to the use of smaller
which may fail to capture all essential information models to explicitly learn the compression ob-
needed for prompt compression within the context. jective. (3) It guarantees faithfulness of the
The challenges lead to the following research compressed prompt to the original content.
questions:
• We conduct extensive experiments and anal-
Q1. How can we identify or build a suitable ysis on both in-domain (i.e., MeetingBank)
dataset to align the SLM towards effective prompt and out-of-domain datasets (i.e., LongBench,
compression? ZeroScrolls, GSM8K, and Big Bench Hard).
Despite small in size, our model shows signif-
Q2. How can we design a compression algorithm
icant performance gains over strong baselines
that effectively leverages the full bidirectional con-
and demonstrates robust generalization ability
text for better performance?
from GPT-3.5-Turbo to Mistral-7B. Addition-
For Q1, most text compression datasets are ab- ally, our model is 3x-6x faster than existing
stractive (Toutanova et al., 2016; Koupaee and prompt compression methods, while acceler-
Wang, 2018; Kim et al., 2019), meaning that they ating the end-to-end latency by 1.6x-2.9x with
treat prompt compression as a generative task compression ratios of 2x-5x.
where the original prompts are rephrased into con-
densed ones. However, this autoregressive gener- 2 Related Works
ation process is slow and it may produce halluci- Depending on whether task information is used
nated content (Zhao et al., 2020). On the other for compression, prompt compression methods can
hand, extractive compression datasets such as Sent- be categorized into task-aware and task-agnostic
Comp (Filippova and Altun, 2013) and DebateSum compression approaches.
(Roush and Balaji, 2020) are usually created for the Task-aware compression compresses the context
summarization task and often lack detailed informa- based on the downstream task or the current query.
tion. In the case of prompt compression, this will For example, LongLLMLingua (Jiang et al., 2023b)
hurt the performance of LLM inference in down- applies a question-aware coarse-to-fine compres-
stream applications such as QA (see Appendix G sion approach to estimate the information entropy
for some examples). Therefore, it is necessary to of the tokens and adapts the estimation accord-
construct an extractive text compression dataset ing to the question. Reinforcement Learning (RL)
that retains essential information. based methods (Jung and Kim, 2023; Huang et al.,
Contributions. We present this paper to address 2023) usually train a model for prompt compres-
the above challenges for task-agnostic prompt com- sion with reward signals from downstream tasks.
pression. We make the following contributions. Soft prompt tuning methods (Wingate et al., 2022;
Mu et al., 2023) typically require fine-tuning for the
• We propose a data distillation procedure to specific task. Xu et al. (2024) trains a summariza-
derive knowledge from an LLM (GPT-4) to tion model to compress the context depending on
compress the prompts without losing crucial the question. Task-aware compression approaches
information. We introduce an extractive text are usually tailored for specific tasks and compres-
compression dataset, containing pairs of origi- sion ratios, which may limit their generalizability
nal texts from MeetingBank (Hu et al., 2023) in real-world applications.
and their compressed versions. We publicly Task-agnostic methods compress the prompt
release the dataset. without considering the specific task, making it
Compressed Text: Item 15, City Manager Recommendation adopt three resolutions. Response
Join Victory Pace program. Join California first program. Consent inclusion
properties jurisdiction California Hero program. Emotion, motion, second, public
comment. Cast vote. Public comment? Come forward. Alex Mitchell, represent Hero LLM
program. Hero program in California three half years
Compressed Prompt
Step 1: Step 5:
Step 2:
Data Distillation Prompt Compression
LLM Data Annotation
based on 𝑝preserve

𝑝preserve 𝑝discard
Original Text: Item 15, report from City Manager Recommendation to adopt three Step 3: …
resolutions. First, to join the Victory Pace program. Second, to join the
Quality Control
California first program. And number three, consenting to to inclusion of
& Filtering …
certain properties within the jurisdiction in the California Hero program. It
was emotion, motion, a second and public comment. CNN. Please cast your vote. …
Oh. Was your public comment? Yeah. Please come forward. I thank you, Mr. Mayor.
Step 4: Token Classifier as Compressor
Thank you. Members of the council. My name is Alex Mitchell. I represent the
hero program. Just wanted to let you know that the hero program. Has been in Train Compressor
California for the last three and a half years. Original Prompt

Figure 1: Overview of LLMLingua-2.

more adaptable to a range of applications and black- 3.1 Data Distillation


box LLMs. However, producing compressed text To extract knowledge from the LLM for effective
that can generalize well to different tasks is not prompt compression, our goal is to prompt GPT-
trivial. Typical methods involve using information 4 to generate compressed texts from original texts
entropy-based metrics to remove redundant infor- that meet the following criteria: (i) Token reduction:
mation in the prompt (Li et al., 2023; Jiang et al., Compressed prompts should be short in length to
2023a). They employ a small language model to reduce cost and speed up inference. (ii) Informa-
estimate token importance from the information tiveness: Essential information should be retained.
metrics. Despite being training-free, these methods (iii) Faithfulness: Compressed prompts should re-
may not effectively capture the token importance main faithful and avoid introducing hallucinated
distribution optimized for specific LLMs and often content to ensure accuracy when prompting LLMs
entail high computation overhead. Summarization- in downstream tasks.
based methods are also leveraged for task-agnostic However, distilling such data from GPT-4 is chal-
compression (Chen et al., 2023; Packer et al., 2023). lenging, as it does not consistently follow the in-
However, they often omit crucial details and do structions. For instance, Jiang et al. (2023a) ex-
not generalize well. An alternative approach is perimented with different prompts for compression
to compress or trim the context hidden or KV and found that GPT-4 struggles to retain essential
caches (Chevalier et al., 2023; Ge et al., 2023; information from original texts. In our preliminary
Zhang et al., 2023; Liu et al., 2023b; Xiao et al., experiments, we have also observed that GPT-4
2024). However, this is orthogonal to our work and tends to modify expressions used in the original
cannot be easily applied to black-box LLMs. texts and sometimes generates hallucinated content.
To address this challenge, we propose the following
3 Dataset Construction dataset distillation procedure.

In this section, we outline the process of dataset Instruction Design A well-crafted instruction is
construction for prompt compression. We first in- the key to unveiling the compression capabilities
troduce our data distillation procedure, which in- of GPT-4. To ensure that the generated texts stay
volves extracting knowledge from an LLM (GPT-4 faithful to the original, we explicitly instruct GPT-
) to compress texts without losing crucial informa- 4 to compress the text by discarding unimportant
tion or introducing hallucinated content (Sec. 3.1). words in the original texts only and not adding any
Leveraging the distilled knowledge from the LLM, new words during generation.
we explain our data annotation algorithm, which To ensure token reduction and informativeness,
assigns labels to each word in the original text to previous studies (Jiang et al., 2023a; Huang et al.,
indicate whether it should be preserved after com- 2023) have specified either a compression ratio or
pression (Sec. 3.2). To ensure the dataset’s quality, a target number of compressed tokens in the in-
we propose two quality control metrics for filtering structions. However, GPT-4 often fails to adhere
low-quality samples (Sec. 3.3). to these restrictions. Additionally, the information
Our Instruction for Compression:
Compress the given text to short expressions, and such
that you (GPT-4) can reconstruct it as close as possible 100
to the original. Unlike the usual text compression, I need
you to comply with the 5 conditions below:
1. You can ONLY remove unimportant words. 80

Compression Ratio
2. Do not reorder the original words.
3. Do not change the original words. 60
4. Do not use abbreviations or emojis.
5. Do not add new words or symbols.
Compress the origin aggressively by removing words only. 40
Compress the origin as short as you can, while retain-
ing as much information as possible. If you understand, 20
please compress the following text: {text to compress}
The compressed text is:
0
0 10000 20000 30000 40000
Context Length
Figure 2: Our instruction used for data distillation.
Figure 4: Illustration of compression ratio w.r.t. original
0.30 context length on MeetingBank. We use GPT-4-32k
0.25 with the output token limit setting to 4096.
Ratio of sentence

0.20
0.15 sion leads to substantial information loss, signifi-
0.10 cantly impacting the performance of downstream
0.05 tasks. To mitigate this issue, we first segment each
long context into multiple chunks, each containing
0.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 no more than 512 tokens and ending with a period.
Compression Ratio
We then instruct GPT-4 to compress each chunk
Figure 3: Distribution of compression ratio after chunk- individually.
wise compression on MeetingBank.
3.2 Data Annotation
Having obtained pairs of original texts and
density of text can vary significantly depending their compressed versions from data distillation
on its genre, style, etc. For instance, news arti- (Sec. 3.1), the goal of data annotation is to assign a
cles typically contain denser information compared binary label to each token in the original texts to de-
to meeting transcripts. Furthermore, even within termine if it should be preserved or discarded after
the domain of meeting transcripts, the information compression. Fig. 5 describes the three primary ob-
density from different speakers may vary. These stacles encountered here, which arise from GPT-4’s
factors suggest that a fixed compression ratio may inability to precisely comply with the instruction
not be optimal. Therefore, we remove the com- in Fig. 9. Alg. 1 outlines the overall procedure of
pression ratio restriction from our instructions and the proposed annotation algorithm designed to deal
instead prompt GPT-4 to compress the origin text with these obstacles. For more detailed informa-
as short as possible while retaining as much infor- tion, please refer to Appendix B.
mation as possible. As shown in Fig. 3, GPT-4
assigns varying compression ratios to different sen- 3.3 Quality Control
tences and discards some sentences entirely. For a We introduce two quality control metrics to assess
comparison between our instruction and those of the quality of the compressed texts generated by
Jiang et al. (2023a), please refer to Table 7. GPT-4 distillation, as well as the quality of the
automatically annotated labels. We then filter the
Chunk-Wise Compression Empirically, we
examples by their scores.
have found that the length of the original text has a
notable influence on the compression performance. Variation Rate As GPT-4 may fail to follow the
As shown in Fig. 4, GPT-4 tends to apply a high instructions, we introduce the metric Variation Rate
compression ratio when processing very long con- (VR) to evaluate the quality of the compressed texts
text, which might be due to GPT-4’s limited ability generated from data distillation. VR measures the
to handle long context. This aggressive compres- proportion of words in the compressed text that are
Original Texts examples with the top 5% highest variation rates.
Item 15, report from City Manager Recommendation to
adopt three resolutions. First, to join the Victory Pace Alignment Gap We propose Alignment Gap
program. Second, to join the California first program. (AG) to evaluate the quality of the automatically
And number three, consenting to to inclusion of certain
properties within the jurisdiction in the California Hero annotated labels. Let l(·) represent the annotation
program. function, where l(w) = True signifies that word
Compressed Texts w ∈ Sori corresponds to a word in Scomp . We
City Manager Recommendation adopt three resolutions. firstly define the matching rate (MR) as:
Join California first program. Consent properties inclusion
jurisdiction California Hero program. 1 X
MR = I(l(w) = True). (2)
|Sori |
w∈Sori
Figure 5: Challenges in data annotation.
(i) Ambiguity: a word in the compressed texts may Since there exists a many-to-one word mapping
appear multiple times in the original content. from Sori to Scomp (i.e., the "Ambiguity" challenge
(ii) Variation: GPT-4 may modify the original words in presented in Sec. 3.2), we further present a hitting
tense, plural form, etc. during compression. rate (HR) as a regularization term to measure the
(iii) Reordering: The order of words may be changed
proportion of words in Scomp that are found in Sori .
after compression.
HR is defined as:
Algorithm 1: Data Annotation 1 X
HR = I(w ∈ Sori ). (3)
Input :original string Sori , compressed |Sori |
w∈Scomp
string Scomp , window size s.
Split original string Sori to word list Sori . Finally, the Alignment Gap (AG) is defined as:
Split compressed Scomp to word list Scomp .
Initialize labels of original words to False. AG = HR − MR. (4)
Initialize previous match index prev to 0.
for w ∈ Scomp do The alignment gap of a perfect annotation should
for i = 1, 2, ..., 2s do be 0. A large AG indicates a high hitting rate with
right = min(|Sori |, prev + i) a poor matching rate, implying low-quality anno-
if fuzzy_match(w, Sori [right]) then tation for this example. Therefore, we discard ex-
L[right] = True. amples of the highest 10% alignment gap to ensure
prev = right. quality control of the dataset.
Break.
end 4 Compressor
lef t = max(0, prev − i) We formulate prompt compression as a binary to-
if fuzzy_match(w, Sori [lef t]) then ken classification problem (i.e., preserve or discard)
L[lef t] = True.
to guarantee the faithfulness of the compressed
Break.
end prompt to the original content, and meantime en-
end sure the low latency of the compression model it-
self. For the token classification model, we employ
end
a Transformer encoder as the feature extractor to
Output :labels of original words L(Sori ).
leverage information from the bidirectional con-
texts of each token. We train the classification
absent in the original text. Specifically, let Scomp model on the dataset constructed in Sec. 3 from
be the set of words in the compressed text and Sori MeetingBank (Hu et al., 2023). During inference,
be that of the original text. VR is defined as: we determine whether to preserve or discard each
token in the original prompt based on its probability
1 X
calculated by our classification model.
VR = I(w ∈
/ Sori ), (1)
|Scomp |
w∈Scomp
4.1 Token Classification Model
where | · | is the cardinality of a set. A higher varia- Architecture We utilize a Transformer encoder
tion rate implies a higher likelihood of encounter- (Devlin et al., 2019) as the feature encoder fθ and
ing hallucinated content. Therefore, we exclude the add a linear classification layer on top. Given
QA Summary Length
Methods
F1 Score BELU Rouge1 Rouge2 RougeL BERTScore Tokens 1/τ
Selective-Context 66.28 10.83 39.21 18.73 27.67 84.48 1,222 2.5x
LLMLingua 67.52 8.94 37.98 14.08 26.58 86.42 1,176 2.5x
LLMLingua-2-small 85.82 17.41 48.33 23.07 34.36 88.77 984 3.0x
LLMLingua-2 86.92 17.37 48.64 22.96 34.24 88.27 970 3.1x
Original 87.75 22.34 47.28 26.66 35.15 88.96 3,003 1.0x

Table 1: In-domain evaluation of different methods on MeetingBank.

an original prompt consisting of N words x = for a higher compression ratio of ∼15x for tasks
{xi }N
i=1 , this can be formulated as: involving multiple demonstrations or documents.
Particularly, we can replace the perplexity-based
h = fθ (x), (5) iterative token compression module in LLMLin-
p(xi , Θ) = softmax(W hi + b), (6) gua with our token-classification-based compressor,
while keeping the budget controller unchanged.
where h = {hi }N i=1 denotes feature vectors for
all words, p(xi , Θ) ∈ R2 denotes the probability 5 Experiment
distribution of labels {preserve, discard} for Implementation Details We construct our ex-
the i-th word xi , and Θ = {θ, W, b} represent all tractive text compression dataset using training
the trainable parameters. examples from MeetingBank (Hu et al., 2023)
Training Let y = {yi }N with implementation details in Appendix A. Our
i=1 denote the corre-
sponding labels for all words in x, then we em- approach is implemented using Huggingface’s
ploy cross entropy loss to train the model. The loss Transformers and PyTorch 2.0.1 with CUDA-11.7.
function L w.r.t. x is: We use xlm-roberta-large (Conneau et al.,
2020) and multilingual-BERT (Devlin et al.,
N
1 X 2019) for the feature encoder fθ in our compres-
L(Θ) = CrossEntropy(yi , p(xi , Θ)). (7) sor, which we refer to as LLMLingua-2 and
N
i=1
LLMLingua-2-small, respectively. We fine-
4.2 Compression Strategy tune both models for 10 epochs, using the Adam
optimizer (Kingma and Ba, 2015) with a learning
Our approach to compressing the original prompt
rate of 1e-5 and a batch size of 10. Unless spec-
x = {xi }N i=1 with a target compression ratio 1/τ ified otherwise, all reported metrics use GPT-3.5-
involves a three-step process, where τ is defined
Turbo-06133 as the target LLM for downstream
as the quotient of the number of words in the com-
tasks, with greedy decoding at a temperature of 0
pressed prompt and the number of words in the
for enhanced stability across experiments.
original prompt x. First, we derive the target num-
ber of tokens to be preserved in the compressed Datasets & Evaluation Metrics We conduct five
prompt x̃: Ñ = τ N . Next, we use the token classi- groups of experiments to evaluate the compressed
fication model to predict the probability pi of each prompts on two groups of datasets.
word xi being labeled as preserve2 . Finally, we (i) In-Domain: As we train our compressor us-
retain the top Ñ words in the original prompt x ing the dataset built with training examples from
with the highest pi and maintain their original order MeetingBank (Hu et al., 2023), we use the Meet-
to form the compressed prompt x̃. ingBank test examples for in-domain evaluation.
It’s worth noting that our approach can be readily In addition to the summarization task, we further
integrated into the coarse-to-fine framework pro- introduce a QA task by prompting GPT-4 to gener-
posed in LLMLingua (Jiang et al., 2023a), allowing ate 3 question-answer pairs for each example dis-
2 tributed across the whole context (see Appendix F
To address tokenization-related challenges that arise when
applying our approach across various LLMs and SLMs, we for more details). For the summarization task, we
preserve the integrity of multi-token words and represent the use the same evaluation metric as in LLMLingua
probability of a word by averaging over the predicted proba-
3
bilities of all subword tokens. https://platform.openai.com/
LongBench ZeroSCROLLS
Methods
SingleDoc MultiDoc Summ. FewShot Synth. Code AVG Tokens 1/τ AVG Tokens 1/τ
2,000-token constraint
Task(Question)-Aware Compression
SBERT† 33.8 35.9 25.9 23.5 18.0 17.8 25.8 1,947 5x 20.5 1,773 6x
OpenAI† 34.3 36.3 24.7 32.4 26.3 24.8 29.8 1,991 5x 20.6 1,784 5x
LongLLMLingua† 39.0 42.2 27.4 69.3 53.8 56.6 48.0 1,809 6x 32.5 1,753 6x
Task(Question)-Agnostic Compression
Selective-Context† 16.2 34.8 24.4 15.7 8.4 49.2 24.8 1,925 5x 19.4 1,865 5x
LLMLingua† 22.4 32.1 24.5 61.2 10.4 56.8 34.6 1,950 5x 27.2 1,862 5x
LLMLingua-2-small 29.5 32.0 24.5 64.8 22.3 56.2 38.2 1,891 5x 33.3 1,862 5x
LLMLingua-2 29.8 33.1 25.3 66.4 21.3 58.9 39.1 1,954 5x 33.4 1898 5x
3,000-tokens constraint
Task(Question)-Aware Compression
SBERT† 35.3 37.4 26.7 63.4 51.0 34.5 41.4 3,399 3x 24.0 3,340 3x
OpenAI† 34.5 38.6 26.8 63.4 49.6 37.6 41.7 3,421 3x 22.4 3,362 3x
LongLLMLingua† 40.7 46.2 27.2 70.6 53.0 55.2 48.8 3,283 3x 32.8 3,412 3x
Task(Question)-Agnostic Compression
Selective-Context† 23.3 39.2 25.0 23.8 27.5 53.1 32.0 3,328 3x 20.7 3,460 3x
LLMLingua† 31.8 37.5 26.2 67.2 8.3 53.2 37.4 3,421 3x 30.7 3,366 3x
LLMLingua-2-small 35.5 38.1 26.2 67.5 23.9 60.0 41.9 3,278 3x 33.4 3,089 3x
LLMLingua-2 35.5 38.7 26.3 69.6 21.4 62.8 42.4 3,392 3x 33.5 3206 3x
Original Prompt 39.7 38.7 26.5 67.0 37.8 54.2 44.0 10,295 - 34.7 9,788 -
Zero-Shot 15.6 31.3 15.6 40.7 1.6 36.2 23.5 214 48x 10.8 32 306x

Table 2: Out-of-domain evaluation on general long-context scenarios. † : numbers reported in Jiang et al. (2023b).

GSM8K BBH
Methods
1-shot constraint half-shot constraint 1-shot constraint half-shot constraint
EM Tokens 1/τ EM Tokens 1/τ EM Tokens 1/τ EM Tokens 1/τ
Selective-Context† 53.98 452 5x 52.99 218 11x 54.27 276 3x 54.02 155 5x
LLMLingua† 79.08 446 5x 77.41 171 14x 70.11 288 3x 61.60 171 5x
LLMLingua-2-small 78.92 437 5x 77.48 161 14x 69.54 263 3x 60.35 172 5x
LLMLingua-2 79.08 457 5x 77.79 178 14x 70.02 269 3x 61.94 176 5x
Full-Shot 78.85 2,366 - 78.85 2,366 - 70.07 774 - 70.07 774 -
Zero-Shot 48.75 11 215x 48.75 11 215x 32.32 16 48x 32.32 16 48x

Table 3: Out-of-domain evaluation on reasoning and in-context learning. † : numbers reported in Jiang et al. (2023b).

(Jiang et al., 2023a). For QA task, we use the met- Baselines We take two state-of-the-art prompt
rics and scripts provided in LongBench (Bai et al., compression methods as primary baselines for com-
2023) Single Document QA for evaluation. parison: Selective-Context (Li et al., 2023) and
LLMLingua (Jiang et al., 2023a), both are based
(ii) Out-of-Domain: For long-context scenarios,
on LLaMA-2-7B. Additionally, we compare our
we use LongBench (Bai et al., 2023) and Zero-
approach with some task-aware prompt compres-
SCROLLS (Shaham et al., 2023), and we employ
sion methods, such as retrieval-based methods and
the same evaluation metric as in LongLLMLingua
LongLLMLingua (Jiang et al., 2023b).
(Jiang et al., 2023b). For reasoning and in-context
learning, we use GSM8K (Cobbe et al., 2021) and Results on In-Domain Benchmark In Table 1,
Big Bench Hard (BBH) (bench authors, 2023), we first present the results of our proposed method
with evaluation metrics consistent with LLMLin- compared to the strong baselines on MeetingBank.
gua (Jiang et al., 2023a). Despite the fact that our compressors are much
MeetingBank LongBench-SingleDoc
Methods
QA Summ. Tokens 1/τ 2,000-token cons. Tokens 1/τ 3,000-token cons. Tokens 1/τ
Selective-Context 58.13 26.84 1,222 2.5x 22.0 2,038 7.1x 26.0 3,075 4.7x
LLMLingua 50.45 23.63 1,176 2.5x 19.5 2,054 7.1x 20.8 3,076 4.7x
LLMLingua-2-small 75.97 29.93 984 3.0x 25.3 1,949 7.4x 27.9 2,888 5.0x
LLMLingua-2 76.22 30.18 970 3.0x 26.8 1,967 7.4x 27.3 2,853 5.1x
Original Prompt 66.95 26.26 3,003 - 24.5 14,511 - 24.5 14,511 -

Table 4: Evaluation with Mistral-7B as the Target LLM on MeetingBank and LongBench single doc QA task. We
report Rouge1(Lin, 2004) for summary.

smaller than the LLaMa-2-7B used in the baselines, method, by offering shorter prompts with higher
our approach achieves significantly better perfor- information density, effectively improves Mistral-
mance on both the QA and Summary tasks, and 7B’s final inference performance.
comes close to matching the performance of the
original prompt. This demonstrates the effective- Latency Evaluation Table 5 shows the latency of
ness of our constructed dataset, and highlights the different systems on a V100-32G GPU with differ-
importance and benefit of optimizing the compres- ent compression ratios. It shows that LLMLingua-
sion model using prompt compression knowledge. 2 has a much smaller computation overhead than
other compression methods, and can achieve an
Results on Out-of-Domain Benchmarks As our end-to-end speedup ranging from 1.6x to 2.9x. Ad-
model is trained on meeting transcripts data from ditionally, our method can reduce GPU memory
MeetingBank, here we explore its generalization costs by 8x, lowering the demand for hardware
ability across various benchmarks of long-context resources. For details, see the Appendix I.
scenarios, reasoning, and in-context learning. Ta-
ble 2 and 3 show the results on LongBench, Ze-
1/τ 1x 2x 3x 5x
roSCROLLS, GSM8K, and BBH: Our model has
End2End w/o Compression 14.9
demonstrated superior performance compared to
End2End w/ LLMLingua-2 - 9.4 (1.6x) 7.5 (2.1x) 5.2 (2.9x)
other task-agnostic baselines. Even our smaller
Selective-Context - 15.9 15.6 15.5
model, which is of BERT-base size, has been
LLMLingua - 2.9 2.1 1.5
able to achieve comparable, and in some cases, LLMLingua-2 - 0.5 0.4 0.4
even slightly higher performance than the original
prompt. While our approach has shown promising Table 5: Latency (s) comparison on MeetingBank.
results, it falls short when compared to other task-
aware compression methods like LongLLMlingua
(Jiang et al., 2023a) on Longbench. We attribute Observation on Context Awareness We have
this performance gap to the additional information observed that LLMLingua-2 can effectively main-
that they leverage from the question. However, the tain the most informative words with respect to the
task-agnostic characteristics of our model make it full context as the compression ratio increases. We
an efficient option with good generalizability when owe this to the adoption of the bidirectional context-
deployed across different scenarios. aware feature extractor, as well as the strategy of
explicitly optimizing toward the prompt compres-
Mistral-7B as the Target LLM Table 4 presents
sion objective. See Figure 6 for more details.
the results of different methods using Mistral-7B-
v0.14 as the target LLM. Our method demonstrates
Prompt Reconstruction We have conducted
significant performance gain over other baselines,
experiments of prompting GPT-4 to reconstruct
showcasing its good generalization ability across
the original prompt from the LLMLingua-2 com-
target LLMs. Notably, LLMLingua-2 yields even
pressed prompt. The results show that GPT-4 can
better performance than the original prompt. We
effectively reconstruct the original prompt, suggest-
speculate that Mistral-7B might be less adept at
ing that there is no essential information loss during
managing long contexts than GPT-3.5-Turbo. Our
the compression process of LLMLingua-2. Figure
4
https://mistral.ai/ 7 and 8 in Appendix E present some examples.
LongBench ZeroSCROLLS
Methods
SingleDoc MultiDoc Summ. FewShot Synth. Code AVG Tokens 1/τ AVG Tokens 1/τ
LLMLingua-2-small 29.5 32.0 24.5 64.8 22.3 56.2 38.2 1,891 5x 33.3 1,862 5x
LLMLingua-2 29.8 33.1 25.3 66.4 21.3 58.9 39.1 1,954 5x 33.4 1,898 5x
LLMLingua-2‡ 30.7 33.9 25.4 66.6 22.6 58.1 39.5 1,853 5x 33.4 1,897 5x
Original Prompt 39.7 38.7 26.5 67.0 37.8 54.2 44.0 10,295 - 34.7 9,788 -
Zero-Shot 15.6 31.3 15.6 40.7 1.6 36.2 23.5 214 48x 10.8 32 306x

Table 6: Out-of-domain evaluation on general long-context benchmarks with the 2,000-token constraint.
LLMLingua-2‡ : We expand the constructed text compression dataset using 50k examples from TriviaQA-wiki.
Then train an LLMLingua-2 compressor with the expanded dataset.

Instruction 1/τ VR ↓ QA F1 ↑ from two perspectives.


Instruction1 123x 13.7 19.1 Firstly, we have conducted extensive out-of-
Instruction2 27x 7.8 26.1 domain evaluation on four benchmarks in the pa-
Instruction3 78x 9.6 23.7 per, including LongBench (Bai et al., 2023), Zero-
Instruction4 49x 9.4 24.9 SCROLLS (Shaham et al., 2023), GSM8K (Cobbe
LLMLingua-2 w/o Chunk 21x 6.0 27.9 et al., 2021), and Big Bench Hard (BBH) (bench au-
LLMLingua-2 2.6x 2.2 36.7 thors, 2023), which cover multiple tasks from doc-
ument QA to math problems and in-context learn-
Table 7: Ablation Study on Chunk-Wise Compression ing. The experimental results show that even our
and Instruction Design. We report the compression LLMLingua-2-small model that is of BERT-base
ratio, variation rate, and QA performance on LongBench size achieves superior performance than the two
Single Document QA. See Fig. 10 in Appendix for more
LLaMA-2-7B based baselines Selective-Context
details of Instruction1 - Instruction4 here.
(Li et al., 2023) and LLMLingua (Jiang et al.,
2023a). This demonstrates that our learned prompt
Ablation Study on Chunk-Wise Compression compression model has good generalization ability
and Instruction Design Table 7 shows that both to data from different domains.
the designed instruction and the chunk-wise com- Secondly, we expand the constructed text
pression strategy proposed in this paper signifi- compression dataset using 50k examples from
cantly contribute to the success of LLMLingua-2. TriviaQA-wiki. Then train an LLMLingua-2 com-
pressor with the expanded dataset to see whether
6 Conclusion there would be further performance gain. Table
6 shows the results under the 2,000-token con-
This paper targets task-agnostic prompt compres- straint. We can see that training the compressor
sion for better generalizability and efficiency. In with more data does bring further performance
this paper, we identify the challenges encountered gain (LLMLingua-2‡ ). However, the improvement
in existing methods and address them accordingly. seems not that significant. We conjecture that this
We conduct extensive experiments and analysis on is because although the semantics of texts from
five benchmarks across different tasks and domains. different domains may vary a lot, their redundancy
Our model shows superiority over strong baselines pattern might be similar. Such pattern or knowl-
in terms of performance and compression latency. edge may be learned during in-domain training,
We publicly release the dataset of text compression and then act as an anchor that can transfer across
with no essential information loss in this paper. different domains. We leave this for future work.
Limitations
References
Our text compression dataset was constructed us-
ing only training examples from MeetingBank, a Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu,
Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao
dataset of summarization over meeting transcripts. Liu, Aohan Zeng, Lei Hou, et al. 2023. Longbench:
This raises concerns about the generalization ability A bilingual, multitask benchmark for long context
of our compressor. Here we discuss this question understanding. ArXiv preprint, abs/2308.14508.
BIG bench authors. 2023. Beyond the imitation game: Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing
Quantifying and extrapolating the capabilities of lan- Yang, and Lili Qiu. 2023a. LLMLingua: Compress-
guage models. Transactions on Machine Learning ing prompts for accelerated inference of large lan-
Research. guage models. In Proceedings of the 2023 Confer-
ence on Empirical Methods in Natural Language Pro-
Howard Chen, Ramakanth Pasunuru, Jason Weston, and cessing, pages 13358–13376, Singapore. Association
Asli Celikyilmaz. 2023. Walking down the mem- for Computational Linguistics.
ory maze: Beyond context limit through interactive
reading. ArXiv preprint, abs/2310.05029. Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng
Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2023b.
Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Longllmlingua: Accelerating and enhancing llms
Danqi Chen. 2023. Adapting language models to in long context scenarios via prompt compression.
compress contexts. ArXiv preprint, abs/2305.14788. ArXiv preprint, abs/2310.06839.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian,
Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Hoyoun Jung and Kyung-Joong Kim. 2023. Discrete
Plappert, Jerry Tworek, Jacob Hilton, Reiichiro prompt compression with reinforcement learning.
Nakano, et al. 2021. Training verifiers to solve math ArXiv preprint, abs/2308.08758.
word problems. ArXiv preprint, abs/2110.14168.
Byeongchang Kim, Hyunwoo Kim, and Gunhee Kim.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, 2019. Abstractive summarization of Reddit posts
Vishrav Chaudhary, Guillaume Wenzek, Francisco with multi-level memory networks. In Proceedings
Guzmán, Edouard Grave, Myle Ott, Luke Zettle- of the 2019 Conference of the North American Chap-
moyer, and Veselin Stoyanov. 2020. Unsupervised ter of the Association for Computational Linguistics:
cross-lingual representation learning at scale. In Pro- Human Language Technologies, Volume 1 (Long and
ceedings of the 58th Annual Meeting of the Asso- Short Papers), pages 2519–2531, Minneapolis, Min-
ciation for Computational Linguistics, pages 8440– nesota. Association for Computational Linguistics.
8451, Online. Association for Computational Lin-
guistics. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
method for stochastic optimization. In 3rd Inter-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and national Conference on Learning Representations,
Kristina Toutanova. 2019. BERT: Pre-training of ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
deep bidirectional transformers for language under- Conference Track Proceedings.
standing. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for Mahnaz Koupaee and William Yang Wang. 2018. Wiki-
Computational Linguistics: Human Language Tech- how: A large scale text summarization dataset. ArXiv
nologies, Volume 1 (Long and Short Papers), pages preprint, abs/1810.09305.
4171–4186, Minneapolis, Minnesota. Association for
Computational Linguistics. Patrick S. H. Lewis, Ethan Perez, Aleksandra Pik-
tus, Fabio Petroni, Vladimir Karpukhin, Naman
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiy- Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih,
ong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Tim Rocktäschel, Sebastian Riedel, and Douwe
Zhifang Sui. 2023. A survey for in-context learning. Kiela. 2020. Retrieval-augmented generation for
ArXiv preprint, abs/2301.00234. knowledge-intensive NLP tasks. In Advances in Neu-
ral Information Processing Systems 33: Annual Con-
Katja Filippova and Yasemin Altun. 2013. Overcom- ference on Neural Information Processing Systems
ing the lack of parallel data in sentence compression. 2020, NeurIPS 2020, December 6-12, 2020, virtual.
In Proceedings of the 2013 Conference on Empiri-
cal Methods in Natural Language Processing, pages Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin.
1481–1491, Seattle, Washington, USA. Association 2023. Compressing context to enhance inference ef-
for Computational Linguistics. ficiency of large language models. In Proceedings of
Tao Ge, Jing Hu, Xun Wang, Si-Qing Chen, and Furu the 2023 Conference on Empirical Methods in Natu-
Wei. 2023. In-context autoencoder for context com- ral Language Processing, pages 6342–6353, Singa-
pression in a large language model. ArXiv preprint, pore. Association for Computational Linguistics.
abs/2307.06945.
Chin-Yew Lin. 2004. ROUGE: A package for auto-
Yebowen Hu, Tim Ganter, Hanieh Deilamsalehy, Franck matic evaluation of summaries. In Text Summariza-
Dernoncourt, Hassan Foroosh, and Fei Liu. 2023. tion Branches Out, pages 74–81, Barcelona, Spain.
Meetingbank: A benchmark dataset for meeting sum- Association for Computational Linguistics.
marization. ArXiv preprint, abs/2305.17529.
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paran-
Xijie Huang, Li Lyna Zhang, Kwang-Ting Cheng, and jape, Michele Bevilacqua, Fabio Petroni, and Percy
Mao Yang. 2023. Boosting llm reasoning: Push the Liang. 2023a. Lost in the middle: How lan-
limits of few-shot learning with reinforced in-context guage models use long contexts. ArXiv preprint,
pruning. ArXiv preprint, abs/2312.08901. abs/2307.03172.
Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong
Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyril- Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuan-
lidis, and Anshumali Shrivastava. 2023b. Scis- dong Tian, Christopher Re, Clark Barrett, Zhangyang
sorhands: Exploiting the persistence of importance Wang, and Beidi Chen. 2023. H2o: Heavy-hitter ora-
hypothesis for LLM KV cache compression at test cle for efficient generative inference of large language
time. In Thirty-seventh Conference on Neural Infor- models. In Thirty-seventh Conference on Neural In-
mation Processing Systems. formation Processing Systems.
Jesse Mu, Xiang Lisa Li, and Noah Goodman. 2023. Zheng Zhao, Shay B. Cohen, and Bonnie Webber. 2020.
Learning to compress prompts with gist tokens. In Reducing quantity hallucinations in abstractive sum-
Thirty-seventh Conference on Neural Information marization. In Findings of the Association for Com-
Processing Systems. putational Linguistics: EMNLP 2020, pages 2237–
2249, Online. Association for Computational Lin-
Charles Packer, Vivian Fang, Shishir G Patil, Kevin guistics.
Lin, Sarah Wooders, and Joseph E Gonzalez. 2023.
Memgpt: Towards llms as operating systems. ArXiv
preprint, abs/2310.08560.
A Details of Data Distillation
Allen Roush and Arvind Balaji. 2020. DebateSum:
A large-scale argument mining and summarization To construct the extractive compression dataset, we
dataset. In Proceedings of the 7th Workshop on Ar- use GPT-4-32k to compress the original meeting
gument Mining, pages 1–7, Online. Association for transcript. Each transcript is divided into chunks
Computational Linguistics. first, with each chunk terminating at the end of a
Uri Shaham, Maor Ivgi, Avia Efrat, Jonathan Be- complete sentence and not exceeding 512 tokens.
rant, and Omer Levy. 2023. Zeroscrolls: A zero- We employ the default parameter settings with a
shot benchmark for long text understanding. ArXiv temperature of 0.3 and a top_p of 1.0. The max-
preprint, abs/2305.14196.
imum number of generated tokens is set to 4096.
Claude E Shannon. 1951. Prediction and entropy Transcripts exceeding 28K tokens are truncated,
of printed english. Bell system technical journal, allowing a 4K token budget for generation. Fig. 9
30(1):50–64.
presents the full instruction used in GPT-4 com-
Kristina Toutanova, Chris Brockett, Ke M. Tran, and pression. Tab. 8 shows the statistics of our Meet-
Saleema Amershi. 2016. A dataset and evaluation ingBank compression dataset.
metrics for abstractive compression of sentences and
short paragraphs. In Proceedings of the 2016 Con- Data Part Data Size Chunk Sentence (Avg) Token (Avg) 1/τ
ference on Empirical Methods in Natural Language
Processing, pages 340–350, Austin, Texas. Associa- Original 5,169 41,746 232 3,635 -
Compressed 5,169 41,746 132 1,415 2.57x
tion for Computational Linguistics.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Table 8: Statistics of MeetingBank compression dataset.
Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le,
and Denny Zhou. 2022. Chain of thought prompt-
ing elicits reasoning in large language models. In B Details of Data Annotation
Advances in Neural Information Processing Systems.
David Wingate, Mohammad Shoeybi, and Taylor
Based on the compressed prompt, we design a word
Sorensen. 2022. Prompt compression and contrastive annotation algorithm to automatically assign each
conditioning for controllability and toxicity reduction word a label indicating whether the word in the
in language models. In Findings of the Association original prompt should be retained. Initially, all la-
for Computational Linguistics: EMNLP 2022, pages
bels of the original words are set to False. Then, for
5621–5634, Abu Dhabi, United Arab Emirates. As-
sociation for Computational Linguistics. every word in the compressed prompt, we search
for its corresponding word in the original prompt,
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song which is then assigned a True label.
Han, and Mike Lewis. 2024. Efficient streaming lan-
guage models with attention sinks. In The Twelfth Sliding Window: To assign labels to the ap-
International Conference on Learning Representa-
tions. propriate words in the original prompt, we uti-
lize a sliding window approach, constraining the
Fangyuan Xu, Weijia Shi, and Eunsol Choi. 2024. RE- search scope within a local window centered on the
COMP: Improving retrieval-augmented LMs with
context compression and selective augmentation. In
previously matched word in the original prompt.
The Twelfth International Conference on Learning The search initiates from the last matching posi-
Representations. tion. The True label is then assigned to the first
Prompt Compression Details:
Example 1:
Item 15, report from City Manager Recommendation to adopt three resolutions. First, to join the Victory Pace program.
Second, to join the California first program. And number three, consenting to to inclusion of certain properties within the
jurisdiction in the California Hero program. It was emotion, motion, a second and public comment. CNN. Please cast your
vote. Oh. Was your public comment? Yeah. Please come forward. I thank you, Mr. Mayor. Thank you. Members of the
council. My name is Alex Mitchell. I represent the hero program. Just wanted to let you know that the hero program. Has
been in California for the last three and a half years. We’re in. Over 20. We’re in 28 counties, and we’ve completed over
29,000 energy efficient projects to make homes. Greener and more energy efficient. And this includes anything. From solar
to water. Efficiency. We’ve done. Almost. $550 million in home improvements.
Example 2:
John: So, um, I’ve been thinking about the project, you know, and I believe we need to, uh, make some changes. I mean, we
want the project to succeed, right? So, like, I think we should consider maybe revising the timeline.
Sarah: I totally agree, John. I mean, we have to be realistic, you know. The timeline is, like, too tight. You know what I
mean? We should definitely extend it .

Figure 6: LLMLingua-2 performs context awareness compression. The dark red highlights the words which
are preserved at a 5x compression ratio, medium red denotes 3x compression ratio, and light red represents 2x
compression ratio. Gray indicates discarded words during compression.

matched word in the original prompt. Further- E Prompt Reconstruction


more, the search is bidirectional to prevent mis-
Fig. 7 and Fig. 8 show two reconstructed prompts
matches caused by GPT-4’s reordering, as shown
from the compressed prompts using GPT-4. Specif-
in Fig. 5. Moreover, if GPT-4 introduces new words
ically, we prepend a simple reconstruction instruc-
during compression, the sliding window restricts
tion: "I have asked you to compress a meeting
the search scope, preventing mismatches between
transcript by dropping word only. Now, reconstruct
the newly added words in the compressed prompt
the original meeting transcript based on the fol-
and words in the original prompt.
lowing compressed transcript." to the compressed
Fuzzy Matching: Another challenge arises from prompt. With the key information preserved in
the fact that GPT-4 may alter the original words in the compressed prompt, the reconstructed prompt
tense, voice, and singular/plural forms during com- closely resembles the original prompt.
pression, even when we request GPT-4 to compress F Details of MeetingBank QA and
by discarding words only. To address this issue, we
MeetingBank Summary
first apply lemmatization to reduce words to their
base form using Spacy5 , and then perform word The MeetingBank QA dataset consists of 862 meet-
matching using the sliding window approach. ing transcripts from the MeetingBank test set. Ini-
tially, we generate 10 question-answer pairs for
C Context Aware Compression each meeting transcript using GPT-4-32K. The in-
struction used in generating QA pairs is: "Create
Fig. 6 presents some compression results of our 10 questions/answer pairs from the given meeting
LLMLingua-2 under different compression ratios. transcript. The answer should be short and con-
Our method effectively maintains the most mean- cise. The question should start with Q: and answser
ingful words as the compression ratio increases. should start with A: . The meeting transcript is as
follows.". To ensure the quality of the generated
D Comparison with Baselines QA pairs, we discard the question-answer pairs
with answer lengths exceeding 50 tokens. Subse-
In Fig. 11 and Fig. 12, we qualitatively compare the
quently, we carefully examine the remaining QA
compressed prompts of our methods with those of
pairs to ensure that the answers actually appear
baseline method on GSM8K and BBH datasets.
in the original transcripts, instead of being prod-
Note our LLMLingua-2 here is only trained on
ucts of GPT-4’s hallucinations. After the aforemen-
MeetingBank, but also yields more reasonable com-
tioned filtering process, we retain 3 high-quality
pressed prompt than baseline methods on the trans-
question-answer pairs for each meeting transcript.
ferred domain data.
Additionally, we instruct GPT-4-32K to summarize
5
https://spacy.io/api/lemmatizer each meeting transcript. The summaries generated
Original Prompt (200 Tokens):
Item 15, report from City Manager Recommendation to adopt three resolutions. First, to join the Victory Pace program.
Second, to join the California first program. And number three, consenting to to inclusion of certain properties within the
jurisdiction in the California Hero program. It was emotion, motion, a second and public comment. CNN. Please cast your
vote. Oh. Was your public comment? Yeah. Please come forward. I thank you, Mr. Mayor. Thank you. Members of the
council. My name is Alex Mitchell. I represent the hero program. Just wanted to let you know that the hero program. Has
been in California for the last three and a half years. We’re in. Over 20. We’re in 28 counties, and we’ve completed over
29,000 energy efficient projects to make homes. Greener and more energy efficient. And this includes anything. From solar
to water. Efficiency. We’ve done. Almost.$550 million in home improvements.
Compressed Prompt (98 Tokens, 2x Compression):
Item 15 report City Manager Recommendation adopt three resolutions. join Victory Pace program. Second join California
first program. three inclusion properties jurisdiction California Hero program. emotion motion second public comment.
CNN. cast vote. public comment?. come forward. thank you Mr. Mayor. Members council. Alex Mitchell. represent hero
program. California three and a half years. 20. 28 counties completed 29,000 energy efficient projects. includes. solar water.
Efficiency Almost $550 million home improvements.
Reconstructed prompt (198 Tokens):
Item number 15 on the agenda is the report from the City Manager. The Recommendation is to adopt three resolutions. The
first is to join the Victory Pace program. The second resolution is to join the California First program. The third resolution is
for the inclusion of properties within our jurisdiction into the California Hero program. There is an emotion for the motion
and a second. We will now open the floor for public comment. We have a representative from CNN here. Are there any other
individuals who would like to cast a vote or make a public comment? If so, please come forward.
Thank you, Mr. Mayor and Members of the council. My name is Alex Mitchell and I represent the Hero program in
California. I have been with the program for three and a half years. As of now, 20 out of 28 counties have completed 29,000
energy efficient projects. This includes projects related to solar water and energy efficiency. Almost $550 million has been
invested in home improvements.

Figure 7: Prompt reconstruction results of our LLMLingua-2 on the MeetingBank example.

Original Prompt (160 Tokens):


Okay. Thank you. Next step is we’re going to do item number, is it that was 16. So I could do item 16. We’ll try to get
through these as expeditiously as possible. And there’s going to be a a motion that’s ready to go here. So can we the the
the item please. Report from city clerk recommendation to receive and file the certification of the petition regarding the
regulation of medical marijuana businesses and approve one of the following three alternative actions adopt the initiative
ordinance without alteration to submit the initiative ordinance without alteration to the voters to be held on November 8th,
2016 or three. Adopt a report pursuant to California State Elections Code. Thank you. There’s a motion and a second device.
Marie Lowenthal. Thank you.
Compressed Prompt (83 Tokens, 2x compression):
Okay. Thank you. Next step do item number, 16. item 16. try get through expeditiously. motion ready go here. item. Report
from city clerk receive file certification of petition regulation medical marijuana businesses approve three alternative actions
adopt initiative ordinance without alteration submit to voters November 8th, 2016 or three. Adopt report California State
Elections Code. motion second device. Marie Lowenthal.
Reconstructed Prompt (123 Tokens):
Okay. Thank you. The next step is to do item number 16. For item 16, we will try to get through this expeditiously. The
motion is ready to go from here. The item is a report from the city clerk to receive and file the certification of a petition for
the regulation of medical marijuana businesses. We can approve three alternative actions: adopt the initiative ordinance
without alteration, submit it to the voters on November 8th, 2016, or three, adopt the report as per the California State
Elections Code. The motion is seconded by the device. Marie Lowenthal.

Figure 8: Prompt reconstruction results of our LLMLingua-2 on the MeetingBank example.

by GPT-4 are used as ground truth to evaluate the lacking detailed information. This information loss
summary performance. inevitably hinders the downstream tasks such as
document-based QA, as illustrated in Fig. 13 and
G Drawback of Existing Text Fig. 14
Compression Dataset
H Model Size and Training Details
Existing extractive compression datasets such as
SentComp (Filippova and Altun, 2013) and Debate- We use xlm-roberta-large which has
Sum (Roush and Balaji, 2020) are mainly created 355M parameters as the feature encoder fθ in
for summarization task. The compressed texts pro- LLMLingua-2. The training process takes ap-
vided in their dataset are usually too concise, only proximately 23 hours on our MeetingBank com-
maintaining the main idea of the original text and pression dataset. For LLMLingua-2-small,
Our GPT-4 Instruction for Compression:
System Prompt:
You are an excellent linguist and very good at compressing passages into short expressions by removing unimportant words,
while retaining as much information as possible.
User Prompt:
Compress the given text to short expressions, and such that you (GPT-4) can reconstruct it as close as possible to the original.
Unlike the usual text compression, I need you to comply with the 5 conditions below:
1. You can ONLY remove unimportant words.
2. Do not reorder the original words.
3. Do not change the original words.
4. Do not use abbreviations or emojis.
5. Do not add new words or symbols.
Compress the origin aggressively by removing words only. Compress the origin as short as you can, while retaining as much
information as possible. If you understand, please compress the following text: {text to compress}
The compressed text is:

Figure 9: The instruction we used in GPT-4 compression.

Instruction1:
Could you please rephrase the paragraph to make it short, and keep 5% tokens?
Instruction2:
Summarize the provided examples in a few sentences, maintaining all essential reasoning aspects.
Instruction3:
Remove redundancy and express the text concisely in English, ensuring that all key information and reasoning processes are
preserved.
Instruction4:
Follow these steps to shorten the given text content: 1. First, calculate the amount of information contained in each sentence,
and remove sentences with less information. 2. Next, further condense the text by removing stop words, unnecessary
punctuation, and redundant expressions. Refine the content while ensuring that all key information is retained. Let’s do it
step by step.

Figure 10: Other instructions we evaluated, which are proposed in LLMLingua (Jiang et al., 2023a).

the feature encoder is the multilingual-BERT or multilingual-BERT compressor acquired


which has 110M parameters. It takes roughly 16 from the pre-training phase.
hours to train the multilingual-BERT model.
K Integration with LongLLMLingua
I GPU Memory Usage
In retrieval-augmented generation (RAG) and
LLMLingua-2 enjoys a smaller GPU memory over- Multi-Documents Question-Answer (MDQA) sce-
head because of its lightweight. The peak GPU narios, the primary challenge is to identify the doc-
memory usage of LLMLingua-2 on MeetingBank ument that contains the key information relevant
is only 2.1GB, while LLMLingua and Selective- to the question. In these scenarios, LongLLMLin-
Context, which utilize LLAMA-2-7B as the SLM, gua improves the key information preservation by
consume 16.6GB and 26.5GB of GPU memory, utilizing the information provided in the question.
respectively. While LLMLingua-2 is designed for question-
agnostic compression, it can also be integrated with
J Multilingual Generalization Ability
LongLLMLingua to preserve more key information
In Table 9, we assess the performance of relevant to the question in these scenarios. Specifi-
LLMLingua-2 on the Chinese benchmarks of cally, we utilize LongLLMLingua’s coarse-grained
LongBench, comprising 5 tasks with a total of compression to assign varying compression ratios
1000 samples. Despite being trained solely on to different documents based on the question’s
the MeetingBank data, which consists of En- perplexity conditioned on each document. Con-
glish corpus only, LLMLingua-2 also outperforms sequently, it allocates more token budgets to the
LLMLingua on Chinese benchmarks. We at- documents which are more relevant to the question.
tribute this performance gain to the multilin- As illustrated in Table 11, LLMLingua-2
gual capabilities of the xlm-roberta-large with LongLLMLingua coarse-grained compression
Original Prompt (139 tokens):
Q: I have a blackberry, a clarinet, a nectarine, a plum, a strawberry, a banana, a flute, an orange, and a violin. How many
fruits do I have?
A: Let’s think step by step.
We first identify the fruits on the list and include their quantity in parentheses:
- blackberry (1) - nectarine (1) - plum (1) - strawberry (1) - banana (1) - orange (1)
Now, let’s add the numbers in parentheses: 1 + 1 + 1 + 1 + 1 + 1 = 6. So the answer is 6.
Compressed prompt (57 tokens) by LLMLingua:
: a blackberry, a a ne a a a a, many have
:’s think
We first theruits the list and include their in - (–
’s the numbers in parentheses:1 + 1 = 6. So the answer is 6.
Compressed prompt (54 tokens) by LLMLingua-2:
Q: clarinet, nectarine, strawberry, violin.
How many fruits
think step by step.
identify fruits include quantity parentheses:
blackberry nectarine plum strawberry banana orange add numbers parentheses: 1 + 1 = 6.
answer is 6.

Figure 11: Comparison with baseline. LLMLingua-2 here is only trained on MeetingBank, but also yields more
reasonable compressed prompt than LLMLingua on BBH.

achieves an average performance gain of 25.3% on


NaturalQuestions (Liu et al., 2023a) compared to
LLMLingua-2.

L Sample-Wise Dynamic Compression


Ratio
By default, LLMLingua-2 applies fixed compres-
sion rate to all samples in the benchmark. How-
ever, this approach may not be optimal due to vari-
ations in the density of key information across dif-
ferent samples. To address this problem, we allow
LLMLingua-2 to dynamically adjust the compres-
sion rate for each sample under the overall com-
pression rate constraint. Specifically, we employ
the compressor to predict each token’s preservation
probability of all samples. We then set a proba-
bility threshold to achieve the overall compression
rate constraint. For all samples, tokens with preser-
vation probabilities higher than this threshold are
retained.
Table 12 presents the performance of
LLMLingua-2 using the sample-wise dynamic
compression ratio, showcasing a 4.4% and 4.5%
performance improvement under 7x and 5x
compression ratios, respectively, compared to
LLMLingua-2 with a fixed compression ratio.
Original Prompt (249 tokens):
Question: Sam bought a dozen boxes, each with 30 highlighter pens inside, for $10 each box. He rearranged five of these
boxes into packages of six highlighters each and sold them for $3 per package. He sold the rest of the highlighters separately
at the rate of three pens for $2. How much profit did he make in total, in dollars?
Let’s think step by step
Sam bought 12 boxes x $10 = $120 worth of highlighters.
He bought 12 * 30 = 360 highlighters in total.
Sam then took 5 boxes × 6 highlighters/box = 30 highlighters.
He sold these boxes for 5 * $3 = $15
After selling these 5 boxes there were 360 - 30 = 330 highlighters remaining.
These form 330 / 3 = 110 groups of three pens.
He sold each of these groups for $2 each, so made 110 * 2 = $220 from them.
In total, then, he earned $220 + $15 = $235.
Since his original cost was $120, he earned $235 - $120 = $115 in profit.
The answer is 115
Compressed prompt (144 tokens) by LLMLingua:
: Sam bought a dozen boxes each 30 highl pens inside, $10 each. He reanged five of boxes into of
six each $3 per. He sold the thelters separately at the of three $2. much make total,
Lets think step
bought boxes x0 oflters
He 2 3ters in
Sam then boxes 6lters/box 0ters
He sold these boxes 5
Afterelling these boxes there 36030lters
ese00 of three
sold groups2 each so made *2 $20 from
In total, he015
Since his he $ - $120 = $115 in profit.
The answer is 115
Compressed prompt (138 tokens) by LLMLingua-2:
Sam bought dozen 30 highlighter pens $10 rearranged five boxes into six highlighters sold $3 per sold rest three pens profit ?
Sam bought 12 boxes x $10 = $120
12 * 30 = 360 highlighters
5 boxes × 6 highlighters/box = 30
sold 5 * $3 = $15
5 360 - 30 = 330 highlighters
330 / 3 = 110 groups three
sold $2 110 * 2 = $220
earned $220 + $15 = $235. original cost earned $235 - $120 = $115
The answer is 115

Figure 12: Comparison with baseline. LLMLingua-2 here is only trained on MeetingBank, but also yields more
reasonable compressed prompt than LLMLingua on GSM8K.

Document:
Chinese government is to open more museums, memorial halls and national patriotism education bases to the public for free
amid efforts to upgrade cultural services.All national museums and provincial comprehensive museums will stop charging
entry fees this year, says a government circular. Museums and memorial halls listed as national patriotism education bases
will open for free, adds the circular, jointly issued by the Publicity Department of the Communist Party of China Central
Committee, the ministries of finance and culture, and the State Administration of Cultural Heritage on Janyary 23. Free
entry is also available to museums above county level in Zhejiang, Fujian, Hubei, Jiangxi, Anhui and Gansu provinces and
Xinjiang Uygur Autonomous Region. Other provinces, autonomous regions and municipalities are encouraged cut or abolish
entry fees according to their circumstances, the circular says. All museums, memorial halls and national patriotism education
bases will be free to visit by 2009 except cultural relics and historical sites, which will have cheap rates for minors, the
elderly, soldiers, the disabled and low-income families, says the circular. For special or guest exhibitions, museums and
memorial halls can charge fees, the circular says, and museums are encouraged to have cheap tickets and flexible plans, such
as regular free entry, and cheap tickets for groups and families.
Question:
In which provinces will museums above country level be open for free?

Figure 13: An example from the SentComp dataset (Filippova and Altun, 2013). The compressed text is highlighted
in blue. The provided compressed text fails to cover the question references which are highlighted in red.
Document:
The overall results regarding the long-term effects of exchange rate volatility are highly informative in relation to the exports
and imports of an LDC. Mexico’s exports of agricultural goods are clearly depressed by uncertainty: Table 3 shows that no
unprocessed agricultural good responds positively, while various animal, vegetable, and wood products make up 6 of the 21
industries with negative effects. Imports are also affected. While the category of Oil-seeds, oil nuts, and oil kernels does
seem to increase because of uncertainty, 6 of the 21 industries in which volatility reduces import flows are agricultural in
nature. Mexican textile exports also show clear negative effects due to uncertainty, not only for the category of Clothing
except fur clothing, but also for the inputs of Textile and leather machinery and Textile yarn and thread (in Table 4).
Question:
Which industries of textile suffer from negative effects due to the exchange rate uncertainty?

Figure 14: An example from the DebateSum dataset (Roush and Balaji, 2020). The compressed text is highlighted
in blue. The provided compressed text fails to cover the question references which are highlighted in red.

LongBench-Zh
Methods
SingleDoc MultiDoc Summ. FewShot Synth. AVG Tokens 1/τ
Task(Question)-Agnostic Compression
LLMLingua 35.2 20.4 11.8 24.3 51.4 28.6 3060 5x
LLMLingua-2 46.7 23.0 15.3 32.8 72.6 38.1 3023 5x
Original Prompt 61.2 28.7 16.0 29.2 77.5 42.5 14940 -

Table 9: Out-of-domain evaluation on LongBench Chinese benchmarks.

QA Summary Length
Data Type
F1 Score BELU Rouge1 Rouge2 RougeL BERTScore # Tokens 1/τ
Filtered 58.71 17.74 48.42 23.71 34.36 88.99 1629 3.3x
Annotated Kept 92.82 19.53 50.24 25.16 36.38 89.05 855 2.9x
All 86.30 19.17 49.89 24.90 35.97 89.04 1003 3.0x
Filtered 59.65 20.53 46.39 25.31 34.17 88.91 5298 -
Original Kept 94.41 23.05 47.73 27.20 35.74 88.99 2461 -
All 87.75 22.34 47.28 26.66 35.15 88.96 3,003 -

Table 10: Ablation study of the filtering process in dataset construction. Annotated gathers all words which are
assigned a True label by our annotation algorithm as the input prompt. Filtered denotes the discard samples of the
filtering process in sec 3.3, while Kept represents the retained samples.
Methods 1st 5th 10th 15th 20th Reorder Tokens 1/τ
4x constraint
Question-Aware Compression
BM25† 40.6 38.6 38.2 37.4 36.6 36.3 798 3.7x
Gzip† 63.1 61.0 59.8 61.1 60.1 62.3 824 3.6x
SBERT † 66.9 61.1 59.0 61.2 60.3 64.4 808 3.6x
OpenAI† 63.8 64.6 65.4 64.1 63.7 63.7 804 3.7x
LLMLingua-2+ 74.0 70.4 67.0 66.9 65.3 71.9 739 3.9x
LongLLMLingua † 75.0 71.8 71.2 71.2 74.7 75.5 748 3.9x
Question-Agnostic Compression
Selective-Context† 31.4 19.5 24.7 24.1 43.8 - 791 3.7x
LLMLingua† 25.5 27.5 23.5 26.5 30.0 27.0 775 3.8x
LLMLingua2 48.6 44.5 43.6 40.9 39.9 46.2 748 3.9x
Original Prompt 75.7 57.3 54.1 55.4 63.1 - 2,946 -
Zero-shot 56.1 15 196x

Table 11: Performance comparison on NaturalQuestions (20 documents) (Liu et al., 2023a). LLMLingua-2+ denotes
LLMLingua-2 with LongLLMLingua (Jiang et al., 2023b) coarse level compression. † : numbers reported in Jiang
et al. (2023b).

LongBench-SingleDoc
Methods
QA Score Tokens 1/τ QA Score Tokens 1/τ
Target Token Constraint 2000 Tokens 3000 Tokens
LLMLingua2 29.8 1954 7.4x 35.5 3392 4.3x
Compression Ratio Constraint 7x 5x
LLMLingua2 FR† 25.1 2131 6.8x 27.4 3185 4.5x
LLMLingua2 DCR‡ 29.5 2125 6.8x 32.2 3164 4.5x
Original Prompt 39.7 14,511 1x 39.7 14,511 1x

Table 12: Evaluation of LLMLingua-2 sample wise dynamic compression on LongBench single doc QA task. FR†
assigns each example with the same fixed compression rate. DCR‡ assigns dynamic compression rate to different
examples within the corpus level constraint.

You might also like