0% found this document useful (0 votes)
29 views2 pages

Synopsis-LLMZIP-research Paper-2

Uploaded by

satyamkr.verma27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views2 pages

Synopsis-LLMZIP-research Paper-2

Uploaded by

satyamkr.verma27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

Synopsis

(LLMZip: Lossless Text Compression using Large Language Models)


(Chandra Shekhara Kaushik Valmeekam, Krishna Narayanan,

Dileep Kalathil, Jean-Francois Chamberland, Srinivas Shakkottai)

Abstract
We provide new estimates of an asymptotic upper bound on the entropy of English using the large
language model LLaMA-7B as a predictor for the next token given a window of past tokens. This
estimate is significantly smaller than currently available estimates. A natural by-product is an
algorithm for lossless compression of English text which combines the prediction from the large
language model with a lossless compression scheme. Preliminary results from limited experiments
suggest that our scheme outperforms state-of-the-art text compression schemes such as BSC, ZPAQ,
and paq8h.

Introduction
The paper explores the connection between learning, prediction, and compression. It uses LLaMA-7B
as a predictor for text compression and achieves better compression ratios than existing methods.
The estimated entropy of the English language is lower than previous estimates. This suggests that
large language models can be effectively used for text compression.

INTUITIVE EXPLANATION OF THE MAIN IDEA


This describes a method for compressing text using a language model. The main idea is to predict the
next word in a sentence using the previous words and then encode the rank of the actual next word in
the predicted list. A sequence of ranks is then compressed using a standard algorithm. The method
aims to achieve better compression by directly using the probabilities produced by the language
model.

COMPRESSION USING LLMS


This describes a method for compressing text using a language model. The method first parses the
text into tokens and then predicts the next token using the previous tokens. The rank of the actual
next token in the predicted list is encoded, and the sequence of ranks is compressed. The
compression ratio is defined as the number of compressed bits divided by the number of original
characters.
 Entropy bounds
This derives a relationship between the entropy of a language sequence and the entropy of its
tokenized representation. The key result is that the entropy of the language sequence is equal
to the entropy of the token sequence divided by the average number of characters per token.
This result is used to establish an asymptotic upper bound on the entropy of the language
sequence.
 Encoding schemes
 Compressing the ranks using zlib
The first scheme uses the zlib compression algorithm to encode the sequence of ranks.
We refer to this scheme as LLaMA+zlib and denote the compression ratio of this scheme
by ρLLaMA+zlib.
 Token-by-Token Compression
The second scheme, LLaMA+TbyT, uses a token-by-token lossless compression
scheme. It encodes each token using a prefix-free code based on the predicted
probability distribution. The compression ratio for this scheme is calculated by summing
the code lengths for all tokens and dividing by the total number of characters
 Arithmetic Coding
The third scheme, LLaMA+AC, uses arithmetic coding to combine the output of the LLM
with a lossless compression scheme. Arithmetic coding is well-suited for time-varying
probabilities and can achieve near-optimal compression ratios.
Conclusion and Limitations
This research investigates the compression performance of a large language model (LLM) called
LLaMA-7B. The authors compare it against state-of-the-art text compression algorithms on the text8
dataset.
They found that LLaMA-7B with Arithmetic Coding compression achieves a superior compression
ratio (0.71 bits/character) compared to existing methods (ZPAQ: 1.4 bits/char, paq8h: 1.2 bits/char).
This suggests LLaMA-7B can compress text more efficiently.
However, the authors acknowledge limitations. The results might be biased because both LLaMA-7B
and the text8 dataset likely originated from Wikipedia. To address this, they tested LLaMA-7B on a
different book and achieved similar compression improvements.
Overall, the study highlights the potential of LLMs for text compression, achieving better results than
traditional algorithms.

References
 Thomas Cover and Roger King, “A convergent gambling estimate of the entropy of English,”
IEEE Transactions on Information Theory, vol. 24, no. 4, pp. 413–421, 1978.
 Shahar Lutati, Itamar Zimerman, and Lior Wolf, “Focus your attention (with adaptive IIR
filters),” 2023.
 Claude E Shannon, “Prediction and entropy of printed English,” Bell system technical journal,
vol. 30, no. 1, pp. 50–64, 1951.
 John Cleary and Ian Witten, “Data compression using adaptive coding and partial string
matching,” IEEE transactions on Communications, vol. 32, no. 4, pp. 396–402, 1984.
 Mohit Goyal, Kedar Tatwawadi, Shubham Chandak, and Idoia Ochoa, “Deepzip: Lossless
data compression using recurrent neural networks,” arXiv preprint arXiv: 1811.08162, 2018.
 Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux,
Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien
Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample, “Llama: Open and
efficient foundation language models,” 2023.
 J. Frank Dobie, Legends of Texas, https://www.gutenberg.org/ebooks/70859. United States,
Texas Folk-Lore Society, 1924; Project Gutenberg, May 25, 2023, 2023,
 Thomas M Cover and Joy A Thomas, Elements of Information Theory, Wiley, New York,
1999.
 Timothy Bell, Ian H Witten, and John G Cleary, “Modeling for text compression,” ACM
Computing Surveys (CSUR), vol. 21, no. 4, pp. 557–591, 1989.
 David JC MacKay, Information theory, inference and learning algorithms, Cambridge
university press, 2003.
 Taku Kudo and John Richardson, “Sentencepiece: A simple and language independent
subword tokenizer and detokenizer for neural text processing,” CoRR, vol. abs/1808.06226,
2018.

You might also like