Synopsis-LLMZIP-research Paper-2
Synopsis-LLMZIP-research Paper-2
Abstract
We provide new estimates of an asymptotic upper bound on the entropy of English using the large
language model LLaMA-7B as a predictor for the next token given a window of past tokens. This
estimate is significantly smaller than currently available estimates. A natural by-product is an
algorithm for lossless compression of English text which combines the prediction from the large
language model with a lossless compression scheme. Preliminary results from limited experiments
suggest that our scheme outperforms state-of-the-art text compression schemes such as BSC, ZPAQ,
and paq8h.
Introduction
The paper explores the connection between learning, prediction, and compression. It uses LLaMA-7B
as a predictor for text compression and achieves better compression ratios than existing methods.
The estimated entropy of the English language is lower than previous estimates. This suggests that
large language models can be effectively used for text compression.
References
Thomas Cover and Roger King, “A convergent gambling estimate of the entropy of English,”
IEEE Transactions on Information Theory, vol. 24, no. 4, pp. 413–421, 1978.
Shahar Lutati, Itamar Zimerman, and Lior Wolf, “Focus your attention (with adaptive IIR
filters),” 2023.
Claude E Shannon, “Prediction and entropy of printed English,” Bell system technical journal,
vol. 30, no. 1, pp. 50–64, 1951.
John Cleary and Ian Witten, “Data compression using adaptive coding and partial string
matching,” IEEE transactions on Communications, vol. 32, no. 4, pp. 396–402, 1984.
Mohit Goyal, Kedar Tatwawadi, Shubham Chandak, and Idoia Ochoa, “Deepzip: Lossless
data compression using recurrent neural networks,” arXiv preprint arXiv: 1811.08162, 2018.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux,
Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien
Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample, “Llama: Open and
efficient foundation language models,” 2023.
J. Frank Dobie, Legends of Texas, https://www.gutenberg.org/ebooks/70859. United States,
Texas Folk-Lore Society, 1924; Project Gutenberg, May 25, 2023, 2023,
Thomas M Cover and Joy A Thomas, Elements of Information Theory, Wiley, New York,
1999.
Timothy Bell, Ian H Witten, and John G Cleary, “Modeling for text compression,” ACM
Computing Surveys (CSUR), vol. 21, no. 4, pp. 557–591, 1989.
David JC MacKay, Information theory, inference and learning algorithms, Cambridge
university press, 2003.
Taku Kudo and John Richardson, “Sentencepiece: A simple and language independent
subword tokenizer and detokenizer for neural text processing,” CoRR, vol. abs/1808.06226,
2018.