LLMLingua 2
LLMLingua 2
Abstract 1 Introduction
This paper focuses on task-agnostic prompt Recent years have witnessed the emergence of var-
compression for better generalizability and ef- ious prompting techniques for large language mod-
arXiv:2403.12968v1 [cs.CL] 19 Mar 2024
ficiency. Considering the redundancy in nat- els (LLMs), such as Chain-of-Thought (COT) (Wei
ural language, existing approaches compress et al., 2022), In-context Learning (ICL) (Dong
prompts by removing tokens or lexical units ac- et al., 2023), and Retrieval Augmented Generation
cording to their information entropy obtained (RAG) (Lewis et al., 2020). These techniques em-
from a causal language model such as LLaMa-
7B. The challenge is that information entropy
power LLMs to handle complex and varied tasks
may be a suboptimal compression metric: (i) it through rich and informative prompts that may ex-
only leverages unidirectional context and may ceed tens of thousands of tokens. However, the
fail to capture all essential information needed benefits of such lengthy prompts come at a cost of
for prompt compression; (ii) it is not aligned increased computational and financial overhead, as
with the prompt compression objective. well as the degraded information perception ability
To address these issues, we propose a data dis- of LLMs. Prompt compression is a straightforward
tillation procedure to derive knowledge from solution to address these issues, which attempts to
an LLM to compress prompts without losing shorten the original prompts without losing essen-
crucial information, and meantime, introduce tial information.
an extractive text compression dataset. We for- Several methods have been proposed to com-
mulate prompt compression as a token classifi-
press prompts in a task-aware manner (Jiang et al.,
cation problem to guarantee the faithfulness of
the compressed prompt to the original one, and 2023b; Xu et al., 2024; Jung and Kim, 2023; Huang
use a Transformer encoder as the base archi- et al., 2023). These techniques aim to generate
tecture to capture all essential information for compressed prompts tailored to the specific task or
prompt compression from the full bidirectional query, typically resulting in enhanced performance
context. Our approach leads to lower latency by on downstream tasks, particularly in question an-
explicitly learning the compression objective swering. However, the dependency on task-specific
with smaller models such as XLM-RoBERTa-
features presents challenges in terms of efficiency
large and mBERT.
and generalizability when deploying these methods.
We evaluate our method on both in-domain For example, in RAG-style applications, it may be-
and out-of-domain datasets, including Meeting- come necessary to compress the same documents
Bank, LongBench, ZeroScrolls, GSM8K, and multiple times depending on the associated queries
BBH. Despite its small size, our model shows
with task-aware prompt compression. More details
significant performance gains over strong base-
lines and demonstrates robust generalization are discussed in Sec. 2.
ability across different LLMs. Additionally, Some works have explored task-agnostic prompt
our model is 3x-6x faster than existing prompt compression methods for better generalizability
compression methods, while accelerating the and efficiency (Jiang et al., 2023a; Li et al., 2023).
end-to-end latency by 1.6x-2.9x with compres- The underlying assumption is that natural language
sion ratios of 2x-5x.1
contains redundancy (Shannon, 1951) that may be
†
Work during internship at Microsoft.
useful for human understanding but might not be
‡
Corresponding author. necessary for LLMs. Therefore, they propose to
1
Code: https://aka.ms/LLMLingua-2 compress prompts by removing tokens (Jiang et al.,
2023a) or lexical units (Li et al., 2023) accord- • We approach prompt compression as a token
ing to their information entropy obtained from a classification task (i.e., preserve or discard),
causal small language model (SLM), regardless and take the predicted probability of each to-
of the downstream task or question information. ken being labeled as preserve as the com-
However, these task-agnostic methods face two pression metric. The benefits are three folds:
challenges: (i) Information entropy is an empirical (1) It can capture all essential information
metric for prompt compression. Relying on it for needed for prompt compression from the full
prompt trimming may be suboptimal, as it is not bidirectional context by using a Transformer
aligned with the prompt compression objective. (ii) encoder for feature extraction. (2) It can lead
Causal LMs only leverage unidirectional context, to lower latency, due to the use of smaller
which may fail to capture all essential information models to explicitly learn the compression ob-
needed for prompt compression within the context. jective. (3) It guarantees faithfulness of the
The challenges lead to the following research compressed prompt to the original content.
questions:
• We conduct extensive experiments and anal-
Q1. How can we identify or build a suitable ysis on both in-domain (i.e., MeetingBank)
dataset to align the SLM towards effective prompt and out-of-domain datasets (i.e., LongBench,
compression? ZeroScrolls, GSM8K, and Big Bench Hard).
Despite small in size, our model shows signif-
Q2. How can we design a compression algorithm
icant performance gains over strong baselines
that effectively leverages the full bidirectional con-
and demonstrates robust generalization ability
text for better performance?
from GPT-3.5-Turbo to Mistral-7B. Addition-
For Q1, most text compression datasets are ab- ally, our model is 3x-6x faster than existing
stractive (Toutanova et al., 2016; Koupaee and prompt compression methods, while acceler-
Wang, 2018; Kim et al., 2019), meaning that they ating the end-to-end latency by 1.6x-2.9x with
treat prompt compression as a generative task compression ratios of 2x-5x.
where the original prompts are rephrased into con-
densed ones. However, this autoregressive gener- 2 Related Works
ation process is slow and it may produce halluci- Depending on whether task information is used
nated content (Zhao et al., 2020). On the other for compression, prompt compression methods can
hand, extractive compression datasets such as Sent- be categorized into task-aware and task-agnostic
Comp (Filippova and Altun, 2013) and DebateSum compression approaches.
(Roush and Balaji, 2020) are usually created for the Task-aware compression compresses the context
summarization task and often lack detailed informa- based on the downstream task or the current query.
tion. In the case of prompt compression, this will For example, LongLLMLingua (Jiang et al., 2023b)
hurt the performance of LLM inference in down- applies a question-aware coarse-to-fine compres-
stream applications such as QA (see Appendix G sion approach to estimate the information entropy
for some examples). Therefore, it is necessary to of the tokens and adapts the estimation accord-
construct an extractive text compression dataset ing to the question. Reinforcement Learning (RL)
that retains essential information. based methods (Jung and Kim, 2023; Huang et al.,
Contributions. We present this paper to address 2023) usually train a model for prompt compres-
the above challenges for task-agnostic prompt com- sion with reward signals from downstream tasks.
pression. We make the following contributions. Soft prompt tuning methods (Wingate et al., 2022;
Mu et al., 2023) typically require fine-tuning for the
• We propose a data distillation procedure to specific task. Xu et al. (2024) trains a summariza-
derive knowledge from an LLM (GPT-4) to tion model to compress the context depending on
compress the prompts without losing crucial the question. Task-aware compression approaches
information. We introduce an extractive text are usually tailored for specific tasks and compres-
compression dataset, containing pairs of origi- sion ratios, which may limit their generalizability
nal texts from MeetingBank (Hu et al., 2023) in real-world applications.
and their compressed versions. We publicly Task-agnostic methods compress the prompt
release the dataset. without considering the specific task, making it
Compressed Text: Item 15, City Manager Recommendation adopt three resolutions. Response
Join Victory Pace program. Join California first program. Consent inclusion
properties jurisdiction California Hero program. Emotion, motion, second, public
comment. Cast vote. Public comment? Come forward. Alex Mitchell, represent Hero LLM
program. Hero program in California three half years
Compressed Prompt
Step 1: Step 5:
Step 2:
Data Distillation Prompt Compression
LLM Data Annotation
based on 𝑝preserve
𝑝preserve 𝑝discard
Original Text: Item 15, report from City Manager Recommendation to adopt three Step 3: …
resolutions. First, to join the Victory Pace program. Second, to join the
Quality Control
California first program. And number three, consenting to to inclusion of
& Filtering …
certain properties within the jurisdiction in the California Hero program. It
was emotion, motion, a second and public comment. CNN. Please cast your vote. …
Oh. Was your public comment? Yeah. Please come forward. I thank you, Mr. Mayor.
Step 4: Token Classifier as Compressor
Thank you. Members of the council. My name is Alex Mitchell. I represent the
hero program. Just wanted to let you know that the hero program. Has been in Train Compressor
California for the last three and a half years. Original Prompt
In this section, we outline the process of dataset Instruction Design A well-crafted instruction is
construction for prompt compression. We first in- the key to unveiling the compression capabilities
troduce our data distillation procedure, which in- of GPT-4. To ensure that the generated texts stay
volves extracting knowledge from an LLM (GPT-4 faithful to the original, we explicitly instruct GPT-
) to compress texts without losing crucial informa- 4 to compress the text by discarding unimportant
tion or introducing hallucinated content (Sec. 3.1). words in the original texts only and not adding any
Leveraging the distilled knowledge from the LLM, new words during generation.
we explain our data annotation algorithm, which To ensure token reduction and informativeness,
assigns labels to each word in the original text to previous studies (Jiang et al., 2023a; Huang et al.,
indicate whether it should be preserved after com- 2023) have specified either a compression ratio or
pression (Sec. 3.2). To ensure the dataset’s quality, a target number of compressed tokens in the in-
we propose two quality control metrics for filtering structions. However, GPT-4 often fails to adhere
low-quality samples (Sec. 3.3). to these restrictions. Additionally, the information
Our Instruction for Compression:
Compress the given text to short expressions, and such
that you (GPT-4) can reconstruct it as close as possible 100
to the original. Unlike the usual text compression, I need
you to comply with the 5 conditions below:
1. You can ONLY remove unimportant words. 80
Compression Ratio
2. Do not reorder the original words.
3. Do not change the original words. 60
4. Do not use abbreviations or emojis.
5. Do not add new words or symbols.
Compress the origin aggressively by removing words only. 40
Compress the origin as short as you can, while retain-
ing as much information as possible. If you understand, 20
please compress the following text: {text to compress}
The compressed text is:
0
0 10000 20000 30000 40000
Context Length
Figure 2: Our instruction used for data distillation.
Figure 4: Illustration of compression ratio w.r.t. original
0.30 context length on MeetingBank. We use GPT-4-32k
0.25 with the output token limit setting to 4096.
Ratio of sentence
0.20
0.15 sion leads to substantial information loss, signifi-
0.10 cantly impacting the performance of downstream
0.05 tasks. To mitigate this issue, we first segment each
long context into multiple chunks, each containing
0.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 no more than 512 tokens and ending with a period.
Compression Ratio
We then instruct GPT-4 to compress each chunk
Figure 3: Distribution of compression ratio after chunk- individually.
wise compression on MeetingBank.
3.2 Data Annotation
Having obtained pairs of original texts and
density of text can vary significantly depending their compressed versions from data distillation
on its genre, style, etc. For instance, news arti- (Sec. 3.1), the goal of data annotation is to assign a
cles typically contain denser information compared binary label to each token in the original texts to de-
to meeting transcripts. Furthermore, even within termine if it should be preserved or discarded after
the domain of meeting transcripts, the information compression. Fig. 5 describes the three primary ob-
density from different speakers may vary. These stacles encountered here, which arise from GPT-4’s
factors suggest that a fixed compression ratio may inability to precisely comply with the instruction
not be optimal. Therefore, we remove the com- in Fig. 9. Alg. 1 outlines the overall procedure of
pression ratio restriction from our instructions and the proposed annotation algorithm designed to deal
instead prompt GPT-4 to compress the origin text with these obstacles. For more detailed informa-
as short as possible while retaining as much infor- tion, please refer to Appendix B.
mation as possible. As shown in Fig. 3, GPT-4
assigns varying compression ratios to different sen- 3.3 Quality Control
tences and discards some sentences entirely. For a We introduce two quality control metrics to assess
comparison between our instruction and those of the quality of the compressed texts generated by
Jiang et al. (2023a), please refer to Table 7. GPT-4 distillation, as well as the quality of the
automatically annotated labels. We then filter the
Chunk-Wise Compression Empirically, we
examples by their scores.
have found that the length of the original text has a
notable influence on the compression performance. Variation Rate As GPT-4 may fail to follow the
As shown in Fig. 4, GPT-4 tends to apply a high instructions, we introduce the metric Variation Rate
compression ratio when processing very long con- (VR) to evaluate the quality of the compressed texts
text, which might be due to GPT-4’s limited ability generated from data distillation. VR measures the
to handle long context. This aggressive compres- proportion of words in the compressed text that are
Original Texts examples with the top 5% highest variation rates.
Item 15, report from City Manager Recommendation to
adopt three resolutions. First, to join the Victory Pace Alignment Gap We propose Alignment Gap
program. Second, to join the California first program. (AG) to evaluate the quality of the automatically
And number three, consenting to to inclusion of certain
properties within the jurisdiction in the California Hero annotated labels. Let l(·) represent the annotation
program. function, where l(w) = True signifies that word
Compressed Texts w ∈ Sori corresponds to a word in Scomp . We
City Manager Recommendation adopt three resolutions. firstly define the matching rate (MR) as:
Join California first program. Consent properties inclusion
jurisdiction California Hero program. 1 X
MR = I(l(w) = True). (2)
|Sori |
w∈Sori
Figure 5: Challenges in data annotation.
(i) Ambiguity: a word in the compressed texts may Since there exists a many-to-one word mapping
appear multiple times in the original content. from Sori to Scomp (i.e., the "Ambiguity" challenge
(ii) Variation: GPT-4 may modify the original words in presented in Sec. 3.2), we further present a hitting
tense, plural form, etc. during compression. rate (HR) as a regularization term to measure the
(iii) Reordering: The order of words may be changed
proportion of words in Scomp that are found in Sori .
after compression.
HR is defined as:
Algorithm 1: Data Annotation 1 X
HR = I(w ∈ Sori ). (3)
Input :original string Sori , compressed |Sori |
w∈Scomp
string Scomp , window size s.
Split original string Sori to word list Sori . Finally, the Alignment Gap (AG) is defined as:
Split compressed Scomp to word list Scomp .
Initialize labels of original words to False. AG = HR − MR. (4)
Initialize previous match index prev to 0.
for w ∈ Scomp do The alignment gap of a perfect annotation should
for i = 1, 2, ..., 2s do be 0. A large AG indicates a high hitting rate with
right = min(|Sori |, prev + i) a poor matching rate, implying low-quality anno-
if fuzzy_match(w, Sori [right]) then tation for this example. Therefore, we discard ex-
L[right] = True. amples of the highest 10% alignment gap to ensure
prev = right. quality control of the dataset.
Break.
end 4 Compressor
lef t = max(0, prev − i) We formulate prompt compression as a binary to-
if fuzzy_match(w, Sori [lef t]) then ken classification problem (i.e., preserve or discard)
L[lef t] = True.
to guarantee the faithfulness of the compressed
Break.
end prompt to the original content, and meantime en-
end sure the low latency of the compression model it-
self. For the token classification model, we employ
end
a Transformer encoder as the feature extractor to
Output :labels of original words L(Sori ).
leverage information from the bidirectional con-
texts of each token. We train the classification
absent in the original text. Specifically, let Scomp model on the dataset constructed in Sec. 3 from
be the set of words in the compressed text and Sori MeetingBank (Hu et al., 2023). During inference,
be that of the original text. VR is defined as: we determine whether to preserve or discard each
token in the original prompt based on its probability
1 X
calculated by our classification model.
VR = I(w ∈
/ Sori ), (1)
|Scomp |
w∈Scomp
4.1 Token Classification Model
where | · | is the cardinality of a set. A higher varia- Architecture We utilize a Transformer encoder
tion rate implies a higher likelihood of encounter- (Devlin et al., 2019) as the feature encoder fθ and
ing hallucinated content. Therefore, we exclude the add a linear classification layer on top. Given
QA Summary Length
Methods
F1 Score BELU Rouge1 Rouge2 RougeL BERTScore Tokens 1/τ
Selective-Context 66.28 10.83 39.21 18.73 27.67 84.48 1,222 2.5x
LLMLingua 67.52 8.94 37.98 14.08 26.58 86.42 1,176 2.5x
LLMLingua-2-small 85.82 17.41 48.33 23.07 34.36 88.77 984 3.0x
LLMLingua-2 86.92 17.37 48.64 22.96 34.24 88.27 970 3.1x
Original 87.75 22.34 47.28 26.66 35.15 88.96 3,003 1.0x
an original prompt consisting of N words x = for a higher compression ratio of ∼15x for tasks
{xi }N
i=1 , this can be formulated as: involving multiple demonstrations or documents.
Particularly, we can replace the perplexity-based
h = fθ (x), (5) iterative token compression module in LLMLin-
p(xi , Θ) = softmax(W hi + b), (6) gua with our token-classification-based compressor,
while keeping the budget controller unchanged.
where h = {hi }N i=1 denotes feature vectors for
all words, p(xi , Θ) ∈ R2 denotes the probability 5 Experiment
distribution of labels {preserve, discard} for Implementation Details We construct our ex-
the i-th word xi , and Θ = {θ, W, b} represent all tractive text compression dataset using training
the trainable parameters. examples from MeetingBank (Hu et al., 2023)
Training Let y = {yi }N with implementation details in Appendix A. Our
i=1 denote the corre-
sponding labels for all words in x, then we em- approach is implemented using Huggingface’s
ploy cross entropy loss to train the model. The loss Transformers and PyTorch 2.0.1 with CUDA-11.7.
function L w.r.t. x is: We use xlm-roberta-large (Conneau et al.,
2020) and multilingual-BERT (Devlin et al.,
N
1 X 2019) for the feature encoder fθ in our compres-
L(Θ) = CrossEntropy(yi , p(xi , Θ)). (7) sor, which we refer to as LLMLingua-2 and
N
i=1
LLMLingua-2-small, respectively. We fine-
4.2 Compression Strategy tune both models for 10 epochs, using the Adam
optimizer (Kingma and Ba, 2015) with a learning
Our approach to compressing the original prompt
rate of 1e-5 and a batch size of 10. Unless spec-
x = {xi }N i=1 with a target compression ratio 1/τ ified otherwise, all reported metrics use GPT-3.5-
involves a three-step process, where τ is defined
Turbo-06133 as the target LLM for downstream
as the quotient of the number of words in the com-
tasks, with greedy decoding at a temperature of 0
pressed prompt and the number of words in the
for enhanced stability across experiments.
original prompt x. First, we derive the target num-
ber of tokens to be preserved in the compressed Datasets & Evaluation Metrics We conduct five
prompt x̃: Ñ = τ N . Next, we use the token classi- groups of experiments to evaluate the compressed
fication model to predict the probability pi of each prompts on two groups of datasets.
word xi being labeled as preserve2 . Finally, we (i) In-Domain: As we train our compressor us-
retain the top Ñ words in the original prompt x ing the dataset built with training examples from
with the highest pi and maintain their original order MeetingBank (Hu et al., 2023), we use the Meet-
to form the compressed prompt x̃. ingBank test examples for in-domain evaluation.
It’s worth noting that our approach can be readily In addition to the summarization task, we further
integrated into the coarse-to-fine framework pro- introduce a QA task by prompting GPT-4 to gener-
posed in LLMLingua (Jiang et al., 2023a), allowing ate 3 question-answer pairs for each example dis-
2 tributed across the whole context (see Appendix F
To address tokenization-related challenges that arise when
applying our approach across various LLMs and SLMs, we for more details). For the summarization task, we
preserve the integrity of multi-token words and represent the use the same evaluation metric as in LLMLingua
probability of a word by averaging over the predicted proba-
3
bilities of all subword tokens. https://platform.openai.com/
LongBench ZeroSCROLLS
Methods
SingleDoc MultiDoc Summ. FewShot Synth. Code AVG Tokens 1/τ AVG Tokens 1/τ
2,000-token constraint
Task(Question)-Aware Compression
SBERT† 33.8 35.9 25.9 23.5 18.0 17.8 25.8 1,947 5x 20.5 1,773 6x
OpenAI† 34.3 36.3 24.7 32.4 26.3 24.8 29.8 1,991 5x 20.6 1,784 5x
LongLLMLingua† 39.0 42.2 27.4 69.3 53.8 56.6 48.0 1,809 6x 32.5 1,753 6x
Task(Question)-Agnostic Compression
Selective-Context† 16.2 34.8 24.4 15.7 8.4 49.2 24.8 1,925 5x 19.4 1,865 5x
LLMLingua† 22.4 32.1 24.5 61.2 10.4 56.8 34.6 1,950 5x 27.2 1,862 5x
LLMLingua-2-small 29.5 32.0 24.5 64.8 22.3 56.2 38.2 1,891 5x 33.3 1,862 5x
LLMLingua-2 29.8 33.1 25.3 66.4 21.3 58.9 39.1 1,954 5x 33.4 1898 5x
3,000-tokens constraint
Task(Question)-Aware Compression
SBERT† 35.3 37.4 26.7 63.4 51.0 34.5 41.4 3,399 3x 24.0 3,340 3x
OpenAI† 34.5 38.6 26.8 63.4 49.6 37.6 41.7 3,421 3x 22.4 3,362 3x
LongLLMLingua† 40.7 46.2 27.2 70.6 53.0 55.2 48.8 3,283 3x 32.8 3,412 3x
Task(Question)-Agnostic Compression
Selective-Context† 23.3 39.2 25.0 23.8 27.5 53.1 32.0 3,328 3x 20.7 3,460 3x
LLMLingua† 31.8 37.5 26.2 67.2 8.3 53.2 37.4 3,421 3x 30.7 3,366 3x
LLMLingua-2-small 35.5 38.1 26.2 67.5 23.9 60.0 41.9 3,278 3x 33.4 3,089 3x
LLMLingua-2 35.5 38.7 26.3 69.6 21.4 62.8 42.4 3,392 3x 33.5 3206 3x
Original Prompt 39.7 38.7 26.5 67.0 37.8 54.2 44.0 10,295 - 34.7 9,788 -
Zero-Shot 15.6 31.3 15.6 40.7 1.6 36.2 23.5 214 48x 10.8 32 306x
Table 2: Out-of-domain evaluation on general long-context scenarios. † : numbers reported in Jiang et al. (2023b).
GSM8K BBH
Methods
1-shot constraint half-shot constraint 1-shot constraint half-shot constraint
EM Tokens 1/τ EM Tokens 1/τ EM Tokens 1/τ EM Tokens 1/τ
Selective-Context† 53.98 452 5x 52.99 218 11x 54.27 276 3x 54.02 155 5x
LLMLingua† 79.08 446 5x 77.41 171 14x 70.11 288 3x 61.60 171 5x
LLMLingua-2-small 78.92 437 5x 77.48 161 14x 69.54 263 3x 60.35 172 5x
LLMLingua-2 79.08 457 5x 77.79 178 14x 70.02 269 3x 61.94 176 5x
Full-Shot 78.85 2,366 - 78.85 2,366 - 70.07 774 - 70.07 774 -
Zero-Shot 48.75 11 215x 48.75 11 215x 32.32 16 48x 32.32 16 48x
Table 3: Out-of-domain evaluation on reasoning and in-context learning. † : numbers reported in Jiang et al. (2023b).
(Jiang et al., 2023a). For QA task, we use the met- Baselines We take two state-of-the-art prompt
rics and scripts provided in LongBench (Bai et al., compression methods as primary baselines for com-
2023) Single Document QA for evaluation. parison: Selective-Context (Li et al., 2023) and
LLMLingua (Jiang et al., 2023a), both are based
(ii) Out-of-Domain: For long-context scenarios,
on LLaMA-2-7B. Additionally, we compare our
we use LongBench (Bai et al., 2023) and Zero-
approach with some task-aware prompt compres-
SCROLLS (Shaham et al., 2023), and we employ
sion methods, such as retrieval-based methods and
the same evaluation metric as in LongLLMLingua
LongLLMLingua (Jiang et al., 2023b).
(Jiang et al., 2023b). For reasoning and in-context
learning, we use GSM8K (Cobbe et al., 2021) and Results on In-Domain Benchmark In Table 1,
Big Bench Hard (BBH) (bench authors, 2023), we first present the results of our proposed method
with evaluation metrics consistent with LLMLin- compared to the strong baselines on MeetingBank.
gua (Jiang et al., 2023a). Despite the fact that our compressors are much
MeetingBank LongBench-SingleDoc
Methods
QA Summ. Tokens 1/τ 2,000-token cons. Tokens 1/τ 3,000-token cons. Tokens 1/τ
Selective-Context 58.13 26.84 1,222 2.5x 22.0 2,038 7.1x 26.0 3,075 4.7x
LLMLingua 50.45 23.63 1,176 2.5x 19.5 2,054 7.1x 20.8 3,076 4.7x
LLMLingua-2-small 75.97 29.93 984 3.0x 25.3 1,949 7.4x 27.9 2,888 5.0x
LLMLingua-2 76.22 30.18 970 3.0x 26.8 1,967 7.4x 27.3 2,853 5.1x
Original Prompt 66.95 26.26 3,003 - 24.5 14,511 - 24.5 14,511 -
Table 4: Evaluation with Mistral-7B as the Target LLM on MeetingBank and LongBench single doc QA task. We
report Rouge1(Lin, 2004) for summary.
smaller than the LLaMa-2-7B used in the baselines, method, by offering shorter prompts with higher
our approach achieves significantly better perfor- information density, effectively improves Mistral-
mance on both the QA and Summary tasks, and 7B’s final inference performance.
comes close to matching the performance of the
original prompt. This demonstrates the effective- Latency Evaluation Table 5 shows the latency of
ness of our constructed dataset, and highlights the different systems on a V100-32G GPU with differ-
importance and benefit of optimizing the compres- ent compression ratios. It shows that LLMLingua-
sion model using prompt compression knowledge. 2 has a much smaller computation overhead than
other compression methods, and can achieve an
Results on Out-of-Domain Benchmarks As our end-to-end speedup ranging from 1.6x to 2.9x. Ad-
model is trained on meeting transcripts data from ditionally, our method can reduce GPU memory
MeetingBank, here we explore its generalization costs by 8x, lowering the demand for hardware
ability across various benchmarks of long-context resources. For details, see the Appendix I.
scenarios, reasoning, and in-context learning. Ta-
ble 2 and 3 show the results on LongBench, Ze-
1/τ 1x 2x 3x 5x
roSCROLLS, GSM8K, and BBH: Our model has
End2End w/o Compression 14.9
demonstrated superior performance compared to
End2End w/ LLMLingua-2 - 9.4 (1.6x) 7.5 (2.1x) 5.2 (2.9x)
other task-agnostic baselines. Even our smaller
Selective-Context - 15.9 15.6 15.5
model, which is of BERT-base size, has been
LLMLingua - 2.9 2.1 1.5
able to achieve comparable, and in some cases, LLMLingua-2 - 0.5 0.4 0.4
even slightly higher performance than the original
prompt. While our approach has shown promising Table 5: Latency (s) comparison on MeetingBank.
results, it falls short when compared to other task-
aware compression methods like LongLLMlingua
(Jiang et al., 2023a) on Longbench. We attribute Observation on Context Awareness We have
this performance gap to the additional information observed that LLMLingua-2 can effectively main-
that they leverage from the question. However, the tain the most informative words with respect to the
task-agnostic characteristics of our model make it full context as the compression ratio increases. We
an efficient option with good generalizability when owe this to the adoption of the bidirectional context-
deployed across different scenarios. aware feature extractor, as well as the strategy of
explicitly optimizing toward the prompt compres-
Mistral-7B as the Target LLM Table 4 presents
sion objective. See Figure 6 for more details.
the results of different methods using Mistral-7B-
v0.14 as the target LLM. Our method demonstrates
Prompt Reconstruction We have conducted
significant performance gain over other baselines,
experiments of prompting GPT-4 to reconstruct
showcasing its good generalization ability across
the original prompt from the LLMLingua-2 com-
target LLMs. Notably, LLMLingua-2 yields even
pressed prompt. The results show that GPT-4 can
better performance than the original prompt. We
effectively reconstruct the original prompt, suggest-
speculate that Mistral-7B might be less adept at
ing that there is no essential information loss during
managing long contexts than GPT-3.5-Turbo. Our
the compression process of LLMLingua-2. Figure
4
https://mistral.ai/ 7 and 8 in Appendix E present some examples.
LongBench ZeroSCROLLS
Methods
SingleDoc MultiDoc Summ. FewShot Synth. Code AVG Tokens 1/τ AVG Tokens 1/τ
LLMLingua-2-small 29.5 32.0 24.5 64.8 22.3 56.2 38.2 1,891 5x 33.3 1,862 5x
LLMLingua-2 29.8 33.1 25.3 66.4 21.3 58.9 39.1 1,954 5x 33.4 1,898 5x
LLMLingua-2‡ 30.7 33.9 25.4 66.6 22.6 58.1 39.5 1,853 5x 33.4 1,897 5x
Original Prompt 39.7 38.7 26.5 67.0 37.8 54.2 44.0 10,295 - 34.7 9,788 -
Zero-Shot 15.6 31.3 15.6 40.7 1.6 36.2 23.5 214 48x 10.8 32 306x
Table 6: Out-of-domain evaluation on general long-context benchmarks with the 2,000-token constraint.
LLMLingua-2‡ : We expand the constructed text compression dataset using 50k examples from TriviaQA-wiki.
Then train an LLMLingua-2 compressor with the expanded dataset.
Figure 6: LLMLingua-2 performs context awareness compression. The dark red highlights the words which
are preserved at a 5x compression ratio, medium red denotes 3x compression ratio, and light red represents 2x
compression ratio. Gray indicates discarded words during compression.
by GPT-4 are used as ground truth to evaluate the lacking detailed information. This information loss
summary performance. inevitably hinders the downstream tasks such as
document-based QA, as illustrated in Fig. 13 and
G Drawback of Existing Text Fig. 14
Compression Dataset
H Model Size and Training Details
Existing extractive compression datasets such as
SentComp (Filippova and Altun, 2013) and Debate- We use xlm-roberta-large which has
Sum (Roush and Balaji, 2020) are mainly created 355M parameters as the feature encoder fθ in
for summarization task. The compressed texts pro- LLMLingua-2. The training process takes ap-
vided in their dataset are usually too concise, only proximately 23 hours on our MeetingBank com-
maintaining the main idea of the original text and pression dataset. For LLMLingua-2-small,
Our GPT-4 Instruction for Compression:
System Prompt:
You are an excellent linguist and very good at compressing passages into short expressions by removing unimportant words,
while retaining as much information as possible.
User Prompt:
Compress the given text to short expressions, and such that you (GPT-4) can reconstruct it as close as possible to the original.
Unlike the usual text compression, I need you to comply with the 5 conditions below:
1. You can ONLY remove unimportant words.
2. Do not reorder the original words.
3. Do not change the original words.
4. Do not use abbreviations or emojis.
5. Do not add new words or symbols.
Compress the origin aggressively by removing words only. Compress the origin as short as you can, while retaining as much
information as possible. If you understand, please compress the following text: {text to compress}
The compressed text is:
Instruction1:
Could you please rephrase the paragraph to make it short, and keep 5% tokens?
Instruction2:
Summarize the provided examples in a few sentences, maintaining all essential reasoning aspects.
Instruction3:
Remove redundancy and express the text concisely in English, ensuring that all key information and reasoning processes are
preserved.
Instruction4:
Follow these steps to shorten the given text content: 1. First, calculate the amount of information contained in each sentence,
and remove sentences with less information. 2. Next, further condense the text by removing stop words, unnecessary
punctuation, and redundant expressions. Refine the content while ensuring that all key information is retained. Let’s do it
step by step.
Figure 10: Other instructions we evaluated, which are proposed in LLMLingua (Jiang et al., 2023a).
Figure 11: Comparison with baseline. LLMLingua-2 here is only trained on MeetingBank, but also yields more
reasonable compressed prompt than LLMLingua on BBH.
Figure 12: Comparison with baseline. LLMLingua-2 here is only trained on MeetingBank, but also yields more
reasonable compressed prompt than LLMLingua on GSM8K.
Document:
Chinese government is to open more museums, memorial halls and national patriotism education bases to the public for free
amid efforts to upgrade cultural services.All national museums and provincial comprehensive museums will stop charging
entry fees this year, says a government circular. Museums and memorial halls listed as national patriotism education bases
will open for free, adds the circular, jointly issued by the Publicity Department of the Communist Party of China Central
Committee, the ministries of finance and culture, and the State Administration of Cultural Heritage on Janyary 23. Free
entry is also available to museums above county level in Zhejiang, Fujian, Hubei, Jiangxi, Anhui and Gansu provinces and
Xinjiang Uygur Autonomous Region. Other provinces, autonomous regions and municipalities are encouraged cut or abolish
entry fees according to their circumstances, the circular says. All museums, memorial halls and national patriotism education
bases will be free to visit by 2009 except cultural relics and historical sites, which will have cheap rates for minors, the
elderly, soldiers, the disabled and low-income families, says the circular. For special or guest exhibitions, museums and
memorial halls can charge fees, the circular says, and museums are encouraged to have cheap tickets and flexible plans, such
as regular free entry, and cheap tickets for groups and families.
Question:
In which provinces will museums above country level be open for free?
Figure 13: An example from the SentComp dataset (Filippova and Altun, 2013). The compressed text is highlighted
in blue. The provided compressed text fails to cover the question references which are highlighted in red.
Document:
The overall results regarding the long-term effects of exchange rate volatility are highly informative in relation to the exports
and imports of an LDC. Mexico’s exports of agricultural goods are clearly depressed by uncertainty: Table 3 shows that no
unprocessed agricultural good responds positively, while various animal, vegetable, and wood products make up 6 of the 21
industries with negative effects. Imports are also affected. While the category of Oil-seeds, oil nuts, and oil kernels does
seem to increase because of uncertainty, 6 of the 21 industries in which volatility reduces import flows are agricultural in
nature. Mexican textile exports also show clear negative effects due to uncertainty, not only for the category of Clothing
except fur clothing, but also for the inputs of Textile and leather machinery and Textile yarn and thread (in Table 4).
Question:
Which industries of textile suffer from negative effects due to the exchange rate uncertainty?
Figure 14: An example from the DebateSum dataset (Roush and Balaji, 2020). The compressed text is highlighted
in blue. The provided compressed text fails to cover the question references which are highlighted in red.
LongBench-Zh
Methods
SingleDoc MultiDoc Summ. FewShot Synth. AVG Tokens 1/τ
Task(Question)-Agnostic Compression
LLMLingua 35.2 20.4 11.8 24.3 51.4 28.6 3060 5x
LLMLingua-2 46.7 23.0 15.3 32.8 72.6 38.1 3023 5x
Original Prompt 61.2 28.7 16.0 29.2 77.5 42.5 14940 -
QA Summary Length
Data Type
F1 Score BELU Rouge1 Rouge2 RougeL BERTScore # Tokens 1/τ
Filtered 58.71 17.74 48.42 23.71 34.36 88.99 1629 3.3x
Annotated Kept 92.82 19.53 50.24 25.16 36.38 89.05 855 2.9x
All 86.30 19.17 49.89 24.90 35.97 89.04 1003 3.0x
Filtered 59.65 20.53 46.39 25.31 34.17 88.91 5298 -
Original Kept 94.41 23.05 47.73 27.20 35.74 88.99 2461 -
All 87.75 22.34 47.28 26.66 35.15 88.96 3,003 -
Table 10: Ablation study of the filtering process in dataset construction. Annotated gathers all words which are
assigned a True label by our annotation algorithm as the input prompt. Filtered denotes the discard samples of the
filtering process in sec 3.3, while Kept represents the retained samples.
Methods 1st 5th 10th 15th 20th Reorder Tokens 1/τ
4x constraint
Question-Aware Compression
BM25† 40.6 38.6 38.2 37.4 36.6 36.3 798 3.7x
Gzip† 63.1 61.0 59.8 61.1 60.1 62.3 824 3.6x
SBERT † 66.9 61.1 59.0 61.2 60.3 64.4 808 3.6x
OpenAI† 63.8 64.6 65.4 64.1 63.7 63.7 804 3.7x
LLMLingua-2+ 74.0 70.4 67.0 66.9 65.3 71.9 739 3.9x
LongLLMLingua † 75.0 71.8 71.2 71.2 74.7 75.5 748 3.9x
Question-Agnostic Compression
Selective-Context† 31.4 19.5 24.7 24.1 43.8 - 791 3.7x
LLMLingua† 25.5 27.5 23.5 26.5 30.0 27.0 775 3.8x
LLMLingua2 48.6 44.5 43.6 40.9 39.9 46.2 748 3.9x
Original Prompt 75.7 57.3 54.1 55.4 63.1 - 2,946 -
Zero-shot 56.1 15 196x
Table 11: Performance comparison on NaturalQuestions (20 documents) (Liu et al., 2023a). LLMLingua-2+ denotes
LLMLingua-2 with LongLLMLingua (Jiang et al., 2023b) coarse level compression. † : numbers reported in Jiang
et al. (2023b).
LongBench-SingleDoc
Methods
QA Score Tokens 1/τ QA Score Tokens 1/τ
Target Token Constraint 2000 Tokens 3000 Tokens
LLMLingua2 29.8 1954 7.4x 35.5 3392 4.3x
Compression Ratio Constraint 7x 5x
LLMLingua2 FR† 25.1 2131 6.8x 27.4 3185 4.5x
LLMLingua2 DCR‡ 29.5 2125 6.8x 32.2 3164 4.5x
Original Prompt 39.7 14,511 1x 39.7 14,511 1x
Table 12: Evaluation of LLMLingua-2 sample wise dynamic compression on LongBench single doc QA task. FR†
assigns each example with the same fixed compression rate. DCR‡ assigns dynamic compression rate to different
examples within the corpus level constraint.