"Low-Resource" Text Classification: A Parameter-Free Classification Method With Compressors
"Low-Resource" Text Classification: A Parameter-Free Classification Method With Compressors
"Low-Resource" Text Classification: A Parameter-Free Classification Method With Compressors
Table 3: Test accuracy compared with gzip, red highlighting the ones outperformed by gzip. We report results
getting from our own implementation. We also include previously reported results for reference in Appendix E.
Table 5: Test accuracy on OOD datasets with 95% confidence interval over five trials in five-shot setting.
Without any pre-training or fine-tuning, our consistently high accuracy of BERT and SentBERT
method outperforms both BERT and mBERT on on in-distribution datasets like AG News and DB-
all five datasets. In fact, our experiments show that pedia under few-shot settings.3 It’s worth noting,
our method outperforms both pretrained and non- though, that gzip outperforms SentBERT for 50 and
pretrained deep learning methods on OOD datasets, 100 shots. However, as shown in the SogouNews
which back our claim that our method is universal results, when the dataset is distinctively different
in terms of dataset distributions. To put it simply, from the pretrained datasets, the inductive bias in-
our method is designed to handle unseen datasets: troduced from the pre-training data leads to a low
the compressor is data-type-agnostic by nature and accuracy of BERT and SentBERT with 10, 50 and
non-parametric methods do not introduce inductive 100-shot settings, especially with the 5-shot setting.
bias during training. In general, when the shot number increases, the ac-
curacy difference between gzip and deep learning
5.3 Few-Shot Learning methods becomes smaller. W2V is an exception
We further compare our method with deep learn- that has a large variance in accuracy. This is due to
ing methods under the few-shot setting. We carry the vectors being trained for a limited set of words,
out experiments on AG News, DBpedia, and So- meaning that numerous tokens in the test set are
gouNews across both non-pretrained deep neural unseen and hence out-of-vocabulary.
networks and pretrained ones. We use n-shot la- We further investigate the quality of DNNs and
beled examples per class from the training dataset, our method in the 5-shot setting on five OOD
where n = {5, 10, 50, 100}. We choose these three datasets, tabulating results in Table 5. Under 5-
datasets, as their scale is large enough to cover 100- shot setting on OOD datasets, our method excels
shot settings and they vary in text lengths as well all the deep learning methods by a huge margin:
as languages. We choose methods whose train- it surpasses the accuracy of BERT by 91%, 40%,
able parameters range from zero parameters like 59%, 58% and 194% and surpasses mBERT’s ac-
word2vec and sentence BERT to hundreds of mil- curacy by 100%, 67%, 40%, 12% and 130% on
lions of parameters like BERT, covering both word- the corresponding five datasets.4 The reason be-
based models (HAN) and an n-gram one (fastText). hind the outperformance of our method is due to
compressors’ excellent ability to capture regularity,
We plot the results in Figure 2 (detailed numbers
which is prominent when training becomes moot
are shown in Appendix D). As shown, gzip outper-
with very few labeled data for DNNs.
forms non-pretrained models with 5, 10, 50 settings
on all three datasets. When the number of shots is 6 Analyses
as few as n = 5, gzip outperforms non-pretrained
models by a large margin: gzip is 115% better in ac- 6.1 Using Other Compressors
curacy than fastText in the AG News 5-shot setting. As the compressor in our method can actually be
In the 100-shot setting, gzip also outperforms non- replaced by any other compressors, we evaluate the
pretrained models on AG News and SogouNews 3
BERT reaches almost perfect accuracy on DBpedia probably
but slightly underperforms on DBpedia. because the data is extracted from Wikipedia, which BERT
Previous work (Nogueira et al., 2020; Zhang is pretrained on.
4
mBERT has much higher accuracy than BERT in the few-
et al., 2021) show that pretrained models are ex- shot setting on Filipino and Swahili, where mBERT was
cellent few-shot learners, which is reflected in our pretrained on.
6815
1 1 1
Figure 2: Comparison among different methods using different shots with 95% confidence interval over five trials.
1 1 1
Figure 3: Comparison among different compressors on three datasets with 95% confidence interval over five trials.
6816
Method AGNews SogouNews DBpedia YahooAnswers
Most compressors have a limited “size”, for gzip
gzip (ce) 0.739±0.046 0.741±0.076 0.880±0.010 0.408±0.012
gzip (kNN) 0.752±0.041 0.862±0.033 0.852±0.008 0.352±0.014 it’s the sliding window size that can be used to
search back of the repeated string while for lzma
Table 6: Comparison with other compressor-based meth- it’s the dictionary size it can keep a record of. This
ods under the 100-shot setting. means that even if there are a large number of
training samples, the compressor can’t take full
tion distance E(x, y) better. But in bz2’s case, its
advantage of those samples; (2) When dc is large,
accuracy is always lower than the regression line
compressing dc du can be slow, which paralleliza-
(Figure 4). We conjecture it may be because the
tion can’t solve. These two main drawbacks stop
Burrows-Wheeler algorithm used by bz2 dismisses
this method from being applied to a really large
the information of character order by permuting
dataset. Thus, we limit the size of the dataset to
characters during compression.
1,000 randomly picked test samples and 100-shot
We investigate the correlation between accuracy
from each class in the training set to compare our
and compression ratio across compressors and find
method with this method.
that they have a moderate monotonic linear corre-
lation as shown in Figure 4. As the shot number In Table 6, “gzip (ce)” means using the cross en-
increases, the linear correlation becomes more ob- tropy C(dc du ) − C(dc ) while “gzip (kNN)” refers
vious with rs = 0.605 for all shot settings and Pear- to our method. We carry out each experiment for
son correlation rp = 0.575, 0.638, 0.691, 0.719 re- five times and calculate the mean and 95% confi-
spectively on 5, 10, 50, and 100-shot settings across dence interval. Our method outperforms the cross-
four compressors. We have also found that for a entropy method on AGNews and SogouNews.
single compressor, the easier a dataset can be com- The reason for the large accuracy gap between
pressed, the higher the accuracy gzip can achieve the two methods on SogouNews is probably be-
(details are in Appendix F.1). Combining our find- cause each instance in SogouNews is very long,
ings, we can see that a compressor performs best and the size of each sample can be 11.2K, which,
when it has a high compression ratio on datasets when concatenated, makes dc larger than 1,000K
that are highly compressible unless crucial informa- under 100-shot setting, while gzip typically has
tion is disregarded by its compression algorithm. 32K window size only. When the search space is
tremendously smaller than the size of dc , the com-
6.2 Using Other Compressor-Based Methods pressor fails to take advantage of all the information
A majority of previous compressor-based text clas- from the training set, which renders the compres-
sification is built on estimating cross entropy be- sion ineffective. The cross-entropy method does
tween the probability distribution built on class c perform very well on YahooAnswers. This might
and the document d: Hc (d), as we mention in Sec- be because on a divergent dataset like YahooAn-
tion 2.1. Summarized in Russell (2010), the proce- swers, which is created by numerous online users,
dure of using compressor to estimate Hc (d) is: concatenating all the samples in a class allows the
cross-entropy method to take full advantage of all
1. For each class c, concatenate all samples dc the information from a single class.
in the training set belonging to c. We also test the performance of the compressor-
based cross-entropy method on full AGNews
2. Compress dc as one long document to get the
dataset, as it is a relatively smaller one with a
compressed length C(dc ).
shorter single instance. The accuracy is 0.745, not
3. Concatenate the given test sample du with dc much higher than the 100-shot setting, which fur-
and compress to get C(dc du ). ther confirms that using C(dc du ) − C(dc ) as a dis-
tance metric cannot take full advantage of the large
4. The predicted class is arg minc C(dc du ) − datasets. In general, the result suggests that the
C(dc ). compressor-based cross-entropy method is not as
advantageous as ours on large datasets.
The distance metric used by previous work (Marton
et al., 2005; Russell, 2010) is mainly C(dc du ) − 7 Conclusions and Future Work
C(dc ). Although using this distance metric is
faster than pair-wise distance matrix computation In this paper, we use gzip with a compressor-
on small datasets, it has several drawbacks: (1) based distance metric to do text classification.
6817
Our method achieves an accuracy comparable to underexposure problem (Hovy and Spruit, 2016).
non-pretrained neural network classifiers on in- However, as our method has not been fully explored
distribution datasets and outperforms both pre- on datasets other than topic classification, it is very
trained and non-pretrained models on out-of- possible that our method makes unexpected classifi-
distribution datasets. We also find that our method cation mistakes on tasks like emotion classification.
has greater advantages under few-shot settings. We encourage the usage of this method in the real
For future works, we will extend this work by world to be limited to topic classification and hope
generalizing gzip to neural compressors on text, as that future work can explore more diverse tasks.
recent studies (Jiang et al., 2022) show that com-
bining neural compressors derived from deep latent Acknowledgement
variables models with compressor-based distance This research is supported in part by the Natu-
metrics can even outperform semi-supervised meth- ral Sciences and Engineering Research Council
ods for image classification. (NSERC) of Canada, and in part by the Global Wa-
ter Futures program funded by the Canada First
Limitations
Research Excellence Fund (CFREF).
As the computation complexity of kNN is O(n2 ),
when the size of a dataset gets really big, speed be-
comes one of the limitations of our method. Multi- References
threads and multi-processes can greatly boost the Ashutosh Adhikari, Achyudh Ram, Raphael Tang, and
speed. Lempel-Ziv Jaccard Distance (LZJD) (Raff Jimmy Lin. 2019a. Docbert: Bert for document clas-
sification. arXiv preprint arXiv:1904.08398.
and Nicholas, 2017), a more efficient version of
NCD can also be explored to alleviate the ineffi- Ashutosh Adhikari, Achyudh Ram, Raphael Tang, and
ciency problem. In addition, as our purpose is to Jimmy Lin. 2019b. Rethinking complex neural net-
highlight the trade-off between the simplicity of a work architectures for document classification. In
Proceedings of the 2019 Conference of NAACL-HLT,
model and its performance, we focus on the vanilla Volume 1 (Long and Short Papers), pages 4046–4051.
version of DNNs, which is already complex enough
compared with our method, without add-ons like Charles H Bennett, Péter Gács, Ming Li, Paul MB
pretrained embeddings (Pennington et al., 2014). Vitányi, and Wojciech H Zurek. 1998. Information
distance. IEEE Transactions on information theory,
This means we do not exhaust all the techniques 44(4):1407–1423.
one can use to improve DNNs, and neither do we
exhaust all the text classification methods in the Samuel Bowman, Gabor Angeli, Christopher Potts, and
Christopher D Manning. 2015. A large annotated
literature. Furthermore, our work only covers tra-
corpus for learning natural language inference. In
ditional compressors. As traditional compressors Proceedings of the 2015 Conference on Empirical
are only able to capture the orthographic similarity, Methods in Natural Language Processing, pages 632–
they may not be sufficient for harder classification 642.
tasks like emotional classification. Fortunately, the Michael Burrows. 1994. A block-sorting lossless data
ability to compress redundant semantic information compression algorithm. SRC Research Report, 124.
may be made possible by neural compressors built
on latent variable models (Townsend et al., 2018). Xin Chen, Brent Francia, Ming Li, Brian Mckinnon,
and Amit Seker. 2004. Shared information and pro-
gram plagiarism detection. IEEE Transactions on
Ethics Information Theory, 50(7):1545–1551.
Being parameter-free, our method doesn’t rely on Alexis Conneau, Holger Schwenk, Loïc Barrault, and
GPU force but CPU resources only. Thus, it does Yann Lecun. 2017. Very deep convolutional networks
not bring negative environmental impacts revolv- for text classification. In Proceedings of the 15th
Conference of the European Chapter of the Associa-
ing around GPU. In terms of overgeneralization,
tion for Computational Linguistics: Volume 1, Long
we conduct our experiments on both in-distribution Papers, pages 1107–1116.
and out-of-distribution datasets, covering six lan-
guages. As compressors are data-type agnostic, David Pereira Coutinho and Mario AT Figueiredo.
2015. Text classification using compression-based
they are more inclusive to datasets, which allows dissimilarity measures. International Journal
us to classify low-resource languages like Kin- of Pattern Recognition and Artificial Intelligence,
yarwanda, Kirundi, and Swahili and to mitigate the 29(05):1553004.
6818
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Eamonn Keogh, Stefano Lonardi, and Chotirat Ann
Kristina Toutanova. 2019. Bert: Pre-training of deep Ratanamahatana. 2004. Towards parameter-free data
bidirectional transformers for language understand- mining. In Proceedings of the tenth ACM SIGKDD
ing. In Proceedings of the 2019 Conference of the international conference on Knowledge discovery
North American Chapter of the Association for Com- and data mining, pages 206–215.
putational Linguistics: Human Language Technolo-
gies, Volume 1 (Long and Short Papers), pages 4171– Dmitry V Khmelev and William J Teahan. 2003. A
4186. repetition based measure for verification of text col-
lections and for text categorization. In Proceedings
Jarek Duda. 2009. Asymmetric numeral systems. arXiv of the 26th annual international ACM SIGIR con-
preprint arXiv:0902.0271. ference on Research and development in informaion
retrieval, pages 104–110.
Eibe Frank, Chang Chui, and Ian H Witten. 2000. Text
categorization using compression models. Diederik P Kingma and Jimmy Ba. 2015. Adam: A
method for stochastic optimization. In ICLR (Poster).
William Hersh, Chris Buckley, TJ Leone, and David
Hickam. 1994. Ohsumed: An interactive retrieval Andrei N Kolmogorov. 1963. On tables of random
evaluation and new large test collection for research. numbers. Sankhyā: The Indian Journal of Statistics,
In SIGIR’94, pages 192–201. Springer. Series A, pages 369–376.
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015.
short-term memory. Neural computation, 9(8):1735– Recurrent convolutional neural networks for text clas-
1780. sification. In Twenty-ninth AAAI conference on artifi-
cial intelligence.
Dirk Hovy and Shannon L Spruit. 2016. The social im-
pact of natural language processing. In Proceedings Ken Lang. 1995. Newsweeder: Learning to filter net-
of the 54th Annual Meeting of the Association for news. In Proceedings of the Twelfth International
Computational Linguistics (Volume 2: Short Papers), Conference on Machine Learning, pages 331–339.
pages 591–598.
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton.
David A Huffman. 1952. A method for the construction
2015. Deep learning. nature, 521(7553):436–444.
of minimum-redundancy codes. Proceedings of the
IRE, 40(9):1098–1101.
Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch,
Zhiying Jiang, Yiqin Dai, Ji Xin, Ming Li, and Jimmy Dimitris Kontokostas, Pablo N Mendes, Sebastian
Lin. 2022. Few-shot non-parametric learning with Hellmann, Mohamed Morsey, Patrick Van Kleef,
deep latent variable model. Advances in Neural In- Sören Auer, et al. 2015. Dbpedia–a large-scale, mul-
formation Processing Systems (NeurIPS). tilingual knowledge base extracted from wikipedia.
Semantic web, 6(2):167–195.
Thorsten Joachims. 1998. Text categorization with sup-
port vector machines: Learning with many relevant Ming Li, Xin Chen, Xin Li, Bin Ma, and Paul MB
features. In European conference on machine learn- Vitányi. 2004. The similarity metric. IEEE transac-
ing, pages 137–142. Springer. tions on Information Theory, 50(12):3250–3264.
Armand Joulin, Edouard Grave, and Piotr Bo- Qian Li, Hao Peng, Jianxin Li, Congying Xia, Renyu
janowski Tomas Mikolov. 2017. Bag of tricks for Yang, Lichao Sun, Philip S Yu, and Lifang He. 2022.
efficient text classification. EACL 2017, page 427. A survey on text classification: From traditional to
deep learning. ACM Transactions on Intelligent Sys-
Alexandros Kastanos and Tyler Martin. 2021. Graph tems and Technology (TIST), 13(2):1–41.
convolutional network for swahili news classification.
arXiv preprint arXiv:2103.09325. Xien Liu, Song Wang, Xiao Zhang, Xinxin You, Ji Wu,
and Dejing Dou. 2020. Label-guided learning for
Nitya Kasturi and Igor L Markov. 2022. Text ranking text classification. arXiv preprint arXiv:2002.10772.
and classification using data compression. In I (Still)
Can’t Believe It’s Not Better! Workshop at NeurIPS Evan Dennison Livelo and Charibeth Cheng. 2018. In-
2021, pages 48–53. PMLR. telligent dengue infoveillance using gated recurrent
neural learning and cross-label frequencies. In 2018
Kazuya Kawakami. 2008. Supervised sequence la- IEEE International Conference on Agents (ICA),
belling with recurrent neural networks. Ph. D. thesis. pages 2–7. IEEE.
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Yuval Marton, Ning Wu, and Lisa Hellerstein. 2005. On
Toutanova. 2019. Bert: Pre-training of deep bidirec- compression-based text classification. In European
tional transformers for language understanding. In Conference on Information Retrieval, pages 300–314.
Proceedings of NAACL-HLT, pages 4171–4186. Springer.
6819
Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- William J Teahan and David J Harper. 2003. Using
frey Dean. 2013. Efficient estimation of word compression-based language models for text cate-
representations in vector space. arXiv preprint gorization. In Language modeling for information
arXiv:1301.3781. retrieval, pages 141–165. Springer.
Rubungo Andre Niyongabo, Qu Hong, Julia Kreutzer, James Townsend, Thomas Bird, and David Barber. 2018.
and Li Huang. 2020. Kinnews and kirnews: Bench- Practical lossless compression with latent variables
marking cross-lingual text classification for kin- using bits back coding. In International Conference
yarwanda and kirundi. In Proceedings of the 28th on Learning Representations.
International Conference on Computational Linguis-
tics, pages 5507–5521. Paul MB Vitányi, Frank J Balbach, Rudi L Cilibrasi, and
Ming Li. 2009. Normalized information distance. In
Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Information theory and statistical learning, pages
Jimmy Lin. 2020. Document ranking with a pre- 45–82. Springer.
trained sequence-to-sequence model. In Findings
of the Association for Computational Linguistics: Canhui Wang, Min Zhang, Shaoping Ma, and Liyun
EMNLP 2020, pages 708–718. Ru. 2008. Automatic online news issue construction
in web environment. In Proceedings of the 17th
Antoine Nzeyimana and Andre Niyongabo Rubungo. international conference on World Wide Web, pages
2022. Kinyabert: a morphology-aware kinyarwanda 457–466.
language model. In Proceedings of the 60th Annual
Meeting of the Association for Computational Lin- Yequan Wang, Minlie Huang, Xiaoyan Zhu, and
guistics (Volume 1: Long Papers), pages 5347–5363. Li Zhao. 2016. Attention-based lstm for aspect-level
sentiment classification. In Proceedings of the 2016
Jeffrey Pennington, Richard Socher, and Christopher conference on empirical methods in natural language
Manning. 2014. GloVe: Global vectors for word processing, pages 606–615.
representation. In Proceedings of the 2014 Confer-
ence on Empirical Methods in Natural Language Pro- Adina Williams, Nikita Nangia, and Samuel Bowman.
cessing (EMNLP), pages 1532–1543, Doha, Qatar. 2018. A broad-coverage challenge corpus for sen-
Association for Computational Linguistics. tence understanding through inference. In Proceed-
ings of the 2018 Conference of the NAACL-HLT, Vol-
Edward Raff and Charles Nicholas. 2017. An alterna-
ume 1 (Long Papers), pages 1112–1122, New Or-
tive to ncd for large sequences, lempel-ziv jaccard
leans, Louisiana. Association for Computational Lin-
distance. In Proceedings of the 23rd ACM SIGKDD
guistics.
international conference on knowledge discovery and
data mining, pages 1007–1015. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Chaumond, Clement Delangue, Anthony Moi, Pier-
Sentence embeddings using siamese bert-networks. ric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz,
In Proceedings of the 2019 Conference on Empirical et al. 2020. Transformers: State-of-the-art natural
Methods in Natural Language Processing and the 9th language processing. In Proceedings of the 2020 con-
International Joint Conference on Natural Language ference on empirical methods in natural language
Processing (EMNLP-IJCNLP), pages 3982–3992. processing: system demonstrations, pages 38–45.
Stuart J Russell. 2010. Artificial intelligence a modern Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,
approach. Pearson Education, Inc. Alex Smola, and Eduard Hovy. 2016. Hierarchical at-
tention networks for document classification. In Pro-
Mike Schuster and Kuldip K Paliwal. 1997. Bidirec- ceedings of the 2016 conference of the North Ameri-
tional recurrent neural networks. IEEE transactions can chapter of the association for computational lin-
on Signal Processing, 45(11):2673–2681. guistics: human language technologies, pages 1480–
1489.
Dinghan Shen, Guoyin Wang, Wenlin Wang, Mar-
tin Renqiang Min, Qinliang Su, Yizhe Zhang, Chun- Liang Yao, Chengsheng Mao, and Yuan Luo. 2019.
yuan Li, Ricardo Henao, and Lawrence Carin. Graph convolutional networks for text classification.
2018. Baseline needs more love: On simple word- In Proceedings of the AAAI conference on artificial
embedding-based models and associated pooling intelligence, volume 33, pages 7370–7377.
mechanisms. In Proceedings of the 56th Annual
Meeting of the Association for Computational Lin- Yaodong Yu, Heinrich Jiang, Dara Bahri, Hossein
guistics (Volume 1: Long Papers), pages 440–450. Mobahi, Seungyeon Kim, Ankit Singh Rawat, An-
dreas Veit, and Yi Ma. 2021. An empirical study
Eyal Shnarch, Ariel Gera, Alon Halfon, Lena Dankin, of pre-trained vision models on out-of-distribution
Leshem Choshen, Ranit Aharonov, and Noam generalization. In NeurIPS 2021 Workshop on Distri-
Slonim. 2022. Cluster & tune: Boost cold start per- bution Shifts: Connecting Methods and Applications.
formance in text classification. In Proceedings of the
60th Annual Meeting of the Association for Compu- Haode Zhang, Yuwei Zhang, Li-Ming Zhan, Jiaxin
tational Linguistics (Volume 1: Long Papers), pages Chen, Guangyuan Shi, Xiao-Ming Wu, and Al-
7639–7653. bert YS Lam. 2021. Effectiveness of pre-training
6820
for few-shot intent classification. In Findings of the
Association for Computational Linguistics: EMNLP
2021, pages 1114–1120.
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.
Character-level convolutional networks for text classi-
fication. Advances in neural information processing
systems, 28.
Jacob Ziv and Abraham Lempel. 1977. A universal
algorithm for sequential data compression. IEEE
Transactions on information theory, 23(3):337–343.
6821
A Derivation of NCD B Dataset Details
Recall that information distance E(x, y) is: In addition to statistics of the datasets we use, we
also include one example for each dataset in Ta-
ble 7. We then briefly introduce what the dataset is
E(x, y) = max{K(x|y), K(y|x)} (4) about and how are they collected.
= K(xy) − min{K(x), K(y)} (5) AG News7 contains more than 1 million news ar-
ticles from an academic news search engine Come-
E(x, y) equates the similarity between two objects ToMyHead and is collected for a research purpose;
in a program that can convert one to another. The DBpedia (Lehmann et al., 2015) is extracted
simpler the converting program is, the more similar from Wikipedia as a crowd-sourced project and
the objects are. For example, the negative of an we use the version in torchtext version 0.11.
image is very similar to the original one as the trans- YahooAnswers is introduced in Zhang et al.
formation can be simply described as “inverting the (2015) through the Yahoo! Webscope program
color of the image”. and use the 10 largest main categories for topic
In order to compare the similarity, the relative classification corpus.
distance is preferred. Vitányi et al. (2009) propose 20News (Lang, 1995) is originally collected by
a normalized version of E(x, y) called Normalized Ken Lang and is widely used to evaluate text clas-
information distance (NID). sification and we use the version in scikit-learn.
Ohsumed (Hersh et al., 1994) is collected from
Definition 1 (NID) NID is a function: Ω × Ω → 270 medical journals over a five-year period (1987-
[0, 1], where Ω is a non-empty set, defined as: 1991) with 23 cardiovascular diseases. We use the
subset introduced in (Yao et al., 2019) to create a
max{K(x|y), K(y|x)}
NID(x, y) = . (6) single-label classification.
max{K(x), K(y)}
Both R8 and R52 are two subsets from Reuters-
Equation (6) can be interpreted as follows: Given 21578 collection (Joachims, 1998) which can be
two sequences x, y, K(y) ≥ K(x): downloaded from Text Categorization Corpora.
KirundiNews (KirNews) and KinyarwandaNews
K(y) − I(x : y) I(x : y) (KinNews) are introduced in (Niyongabo et al.,
NID(x, y) = =1− , 2020), collected as a benchmark for text classifica-
K(y) K(y)
(7) tion on two low-resource African languages, which
where I(x : y) = K(y) − K(y|x) means the can be freely downloaded from the repository.
mutual algorithmic information. I(x:y) SwahiliNews (Swahili)8 is a news dataset in
K(y) means the
shared information (in bits) per bit of information Swahili. It’s spoken by 100-150 million people
contained in the most informative sequence, and across East Africa, and the dataset is created to
Equation (7) here is a specific case of Equation (6). help leverage NLP techniques across the African
Normalized Compression Distance (NCD) is a continent, which can be freely downloaded from
computable version of NID based on real-world huggingface datasets.
compressors. In this context, K(x) can be viewed DengueFilipino (Filipino) (Livelo and Cheng,
as the length of x after being maximally com- 2018) is a multi-label low-resource classification
pressed. Suppose we have C(x) as the length of dataset, which can be freely downloaded from hug-
compressed x produced by a real-world compres- gingface datasets. We process it as a single-label
sor, then NCD is defined as: classification task — we randomly select a label if
an instance have multiple labels and use the same
C(xy) − min{C(x), C(y)} processed dataset for every model.
NCD(x, y) = . (8)
max{C(x), C(y)} SogouNews is collected by Wang et al. (2008),
segmented and labeled by Zhang et al. (2015). We
NCD is thus computable in that it not only uses use the version that’s publicly available on torch-
compressed length to approximate K(x) but also text.
replaces conditional Kolmogorov complexity with 7
http://groups.di.unipi.it/g̃ulli/AG_corpus_of_news
C(xy) that only needs a simple concatenation of _articles.html
8
x, y. https://doi.org/10.5281/zenodo.5514203
6822
Dataset Sample Text
“Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street’s dwindling band
AGNews
of ultra-cynics, are seeing green again.”
“European Association for the Study of the Liver”, “The European Association for the Study of the Liver
DBpedia
(EASL) is a European professional association for liver disease.”
“Is a transponder required to fly in class C airspace?”,“I’ve heard that it may not be for some aircraft.
YahooAnswers
What are the rules?”,“the answer is that you must have a transponder in order to fly in a class C airspace.”
“Subject: WHAT car is this!? Nntp-Posting-Host: rac3.wam.umd.edu Organization: University of
Maryland, College Park Lines: 15 I was wondering if anyone out there could enlighten me on this car I
saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a
20News Bricklin. The doors were really small. In addition, the front bumper was separate from the rest of the
body. This is all I know. If anyone can tellme a model name, engine specs, years of production, where
this car is made, history, or whatever info you have on this funky looking car, please e-mail. Thanks,- IL
—- brought to you by your neighborhood Lerxst —-”
“Protection against allergen-induced asthma by salmeterol.The effects of the long-acting beta 2-agonist
salmeterol on early and late phase airways events provoked by inhaled allergen were assessed in a group
of atopic asthmatic patients.In a placebo-controlled study, salmeterol 50 micrograms inhaled before
allergen challenge ablated both the early and late phase of allergen-induced bronchoconstriction over a
Ohsumed
34 h time period.Salmeterol also completely inhibited the allergen-induced rise in non-specific bronchial
responsiveness over the same time period.These effects were shown to be unrelated to prolonged
bronchodilatation or functional antagonism.These data suggest novel actions for topically active long-
acting beta 2-agonists in asthma that extend beyond their protective action on airways smooth muscle.”
“champion products ch approves stock split champion products inc said its board of directors approved a
two for one stock split of its common shares for shareholders of record as of april the company also said
R8
its board voted to recommend to shareholders at the annual meeting april an increase in the authorized
capital stock from five mln to mln shares reuter ”
“january housing sales drop realty group says sales of previously owned homes dropped pct in january
to a seasonally adjusted annual rate of mln units the national association of realtors nar said but the
december rate of mln units had been the highest since the record mln unit sales rate set in november the
R52
group said the drop in january is not surprising considering that a significant portion of december s near
record pace was made up of sellers seeking to get favorable capital gains treatment under the old tax
laws said the nar s john tuccillo reuter”
“mutzig beer fest itegerejwe n’abantu benshi kigali mutzig beer fest thedition izabera juru parki rebero
hateganyijwe imodoka zizajya zifata abantu buri minota zibakura sonatubei remera stade kumarembo
areba miginai remera mugiporoso hamwe mumujyi rond point nini kigali iki gitaramo kizaba cyatumi-
KinNews wemo abahanzi batandukanye harimo kizigenza mugihugu cy’u burundi uzwi izina kidum benshi bakaba
bamuziho gucuranga neza live music iki gitaramo kikazatangira isaha saa kumi n’ebyiri z’umugoroba
taliki kugeza saa munani mugitondo taliki kwinjira bizasaba amafaranga y’u rwanda kubafite mutzig
golden card aha niho tike zigurirwa nakumat la gallette simba super market flurep”
“sentare yiyungurizo ntahangwa yagumije munyororo abamenyeshamakuru bane abo bamenyeshamakuru
bakaba bakorera ikinyamakuru iwacu bakaba batawe mvuto kwezi kw’icumi umwaka bakaba bagiye
ntara bubanza kurondera amakuru yavuga hari abagwanya leta binjiye gihugu abajejwe umutekano
baciye babafata bagishika komine bukinanyana ahavugwa bagwanyi bakaba baciye bashikirizwa sentare
nkuru bubanza umushikirizamanza akaba yaciye abagiriza icaha co kwifatanya n’abagwanyi gutera
KirNews igihugu icaha cahavuye gihindurwa citwa icaha co gushaka guhungabanya umutekano w’igihugu iyo
sentare yaciye ibacira imyaka ibiri nusu n’amande y’amafaranga umuriyoni umwe umwe icabafashe
cane n’ubutumwe bwafatanwe umwe muribo buvuga ’bagiye i bubanza gufasha abagwanyi” ababuranira
bakaba baragerageje kwerekana kwabo bamenyeshamakuru ataco bapfana n’abagwanyi ikinyamakuru
iwacu kikaba carunguruje sentare yiyungurizo ntahangwa ariko sentare yafashe ingingo kubagumiza
mumunyororo ikinyamakuru iwacu kikavuga kigiye kwitura sentare ntahinyuzwa”
Filipino “Kung hindi lang absent yung ibang pipirma sa thesis namen edi sana tapos na hardbound”
“TIMU ya taifa ya Tanzania, Serengeti Boys jana ilijiweka katika nafasi fi nyu katika mashindano
ya Mataifa ya Afrika kwa wachezaji wenye umri chini ya miaka 17 baada ya kuchapwa mabao 3-0
na Uganda kwenye Uwanja wa Taifa, Dar es Salaam.Uganda waliandika bao lao la kwanza katika
dakika ya 15 lililofungwa na Kawooya Andrew akiunganisha wavuni krosi ya Najibu Viga huku lile la
SwahiliNews pili likifungwa na Asaba Ivan katika dakika ya 27 Najib Yiga.Serengeti Boys iliendelea kulala, Yiga
aliifungia Uganda bao la tatu na la ushindi na kuifanya Serengeti kushika mkia katika Kundi A na kuacha
simanzi kwa wapenzi wa soka nchini. Serengeti Boys inasubiri mchezo wa mwisho dhidi ya Senegal
huku Nigeria ikisonga mbele baada ya kushinda mchezo wake wa awali kwenye uwanja huo na kufikisha
pointi sita baada ya kushinda ule wa ufunguzi dhidi ya Tanzania.”
“2008 di4 qi1 jie4 qi1ng da3o guo2 ji4 che1 zha3n me3i nv3 mo2 te4 ”,“2008di4 qi1 jie4 qi1ng da3o
guo2 ji4 che1 zha3n yu2 15 ri4 za4i qi1ng da3o guo2 ji4 hui4 zha3n zho1ng xi1n she4ng da4 ka1i mu4
. be3n ci4 che1 zha3n jia1ng chi2 xu4 da4o be3n yue4 19 ri4 . ji1n nia2n qi1ng da3o guo2 ji4 che1
SogouNews
zha3n shi4 li4 nia2n da3o che2ng che1 zha3n gui1 mo2 zui4 da4 di2 yi1 ci4 , shi3 yo4ng lia3o qi1ng
da3o guo2 ji4 hui4 zha3n zho1ng xi1n di2 qua2n bu4 shi4 ne4i wa4i zha3n gua3n . yi3 xia4 we2i xia4n
cha3ng mo2 te4 tu2 pia4n .”
6823
Paper Model Emb AGNews DBpedia YahooAnswers 20News Ohsumed R8 R52 SogouNews
LSTM ✓ 0.860 0.985 0.708 - - - - 0.951
Zhang et al. (2015)
charCNN ✗ 0.914 0.985 0.680 - - - - 0.956
Yang et al. (2016) HAN ✓ - - 0.758 - - - - -
charCNN ✗ 0.872 0.983 0.712 - - - - 0.951
Joulin et al. (2017) VDCNN ✗ 0.913 0.987 0.734 - - - - 0.968
fastText ✗ 0.915 0.981 0.720 - - - - 0.939
Conneau et al. (2017) VDCNN ✗ 0.908 0.986 0.724 - - - - 0.962
LSTM ✗ - - - 0.657 0.411 0.937 0.855 -
Yao et al. (2019)
fastText ✓ - - - 0.797 0.557 0.947 0.909 -
fastText ✓ 0.925 0.986 0.723 0.114 0.146 0.860 0.716 -
Liu et al. (2020) BiLSTM ✓ - - - 0.732 0.493 0.963 0.905 -
BERT ✗ - - - 0.679 0.512 0.960 0.897 -
Table 8: Results reported in previous works on datasets with abundant resources with embedding (Emb) information.
Table 9: Results reported in previous works on low resource languages with embedding (Emb) and pre-training (PT)
information.
Paper Model AGNews DBpedia the batch size is set to be 128 for English and So-
BERT 0.619 0.312
Shnarch et al. (2022) gouNews while for low-resource languages, we set
BERTIT:CLUSTER 0.807 0.670
the learning rate to be 1e−5 with batch size to be 16
Table 10: Results reported in previous works on 64- for 5 epochs. We use publicly available transform-
sample learning, corresponding to 14-shot for AGNews ers library (Wolf et al., 2020) for BERT and specif-
and ≈5-shot for DBpedia. ically we use bert-base-uncased checkpoint for
BERT and bert-base-multilingual-uncased
for mBERT.
C Implementation Details
For charCNN and textCNN, we use the same
We use different hyper-parameters for full-dataset hyper-parameters setting in Adhikari et al. (2019b)
settings and few-shot settings. except when in the few-shot learning setting, we
For both LSTM, Bi-LSTM+Attn, fastText, we reduce the batch size to 1, reducing the learning
use embedding size = 256, dropout rate = 0.3. rate to 1e − 4 and increase the number of epochs
For full-dataset setting, the learning rate is set to to 60. We also use their open source hedwig repo
be 0.001 and decay rate = 0.9 for Adam opti- for implementation. For VDCNN, we use the shal-
mizer (Kingma and Ba, 2015), number of epochs lowest 9-layer version with embedding size set to
= 20, with batch size = 64; for few-shot setting, be 16, batch size set to be 64 learning rate set to
the learning rate = 0.01, the decay rate = 0.99, be 1e − 4 for full-dataset setting, and batch size
batch size = 1, number of epochs = 50 for 50-shot = 1, epoch number = 60 for few-shot setting. For
and 100-shot, epoch = 80 for 5-shot and 10-shot. RCNN, we use embedding size = 256, hidden size
For LSTM and Bi-LSTM+Attn, we set RNN layer of RNN = 256, learning rate = 1e − 3, and the
= 1, hidden size = 64. For fastText, we use 1 same batch size and epoch setting as VDCNN for
hidden layer whose dimension is set to 10. full-dataset and few-shot settings.
For HAN, we use 1 layer for both word-level In general, we perform grid search for hyper-
RNN and sentence-level RNN, the hidden size of parameters on all the neural network models and
both of them are set to 50, and the hidden sizes we use a test set to validate, which only overesti-
of both attention layers are set to 100. It’s trained mates the accuracy.
with batch size = 256, 0.5 decay rate for 6 epochs. For preprocessing, we don’t use any pretrained
For BERT, the learning rate is set to be 2e−5 and word embedding for any word-based models. The
6824
reason is that we have a strict categorization be- E Other Reported Results
tween “training” and “pre-training”, involving pre-
In Table 3 and Table 5, we report the result from
trained embedding will make DNNs’ categories
our hyper-parameter setting and implementation.
ambiguous. Neither do we use data augmentation
However, we find that we couldn’t replicate pre-
during the training. The procedures of tokenization
viously reported results in some cases — we get
for both word-level and character-level, padding
higher or lower results than previously reported
for batch processing are, however, inevitable.
ones, which may be due to different experiment
For all non-parametric methods, the only hyper-
settings (e.g., they may use pretrained word embed-
parameter is k. We set k = 2 for all the methods
dings while we don’t) or different hyper-parameter
on all the datasets and we report the maximum
settings. Thus, we provide results reported by some
possible accuracy getting from the experiments
previous papers for reference in Table 8, Table 9
for each method. For Sentence-BERT, we use the
and Table 10. Note that SogouNews is listed in
paraphrase-MiniLM-L6-v2 checkpoint.
the first table as it has abundant resources and is
Our method only requires CPUs and we use 8- commonly used as a benchmark for DNNs that ex-
core CPUs to take advantage of multi-processing. cel at large datasets. As the studies carried out in
The time of calculating distance matrix using gzip low-resource languages and few-shot learning sce-
takes about half an hour on AGNews, two days narios are insufficient, in Table 9 and in Table 10,
on DBpedia and SogouNews, and six days on Ya- we also report the result of variants of our mod-
hooAnswers. els like BiGRU using Kinyarwanda embeddings
(Kin. W2V) and BERTM ORP HO incorporating
D Few-Shot Results morphology and pretrained on Kinyarwanda cor-
pus (Kin. Corpus) in addition to models we use
The exact numerical values of accuracy shown in the paper. We don’t find any result reported for
in Figure 2 is listed in three tables below. DengueFilipino as previous works’ evaluation uses
multi-label metrics.
Dataset AGNews
#Shot 5 10 50 100
fastText 0.273±0.021 0.329±0.036 0.550±0.008 0.684±0.010 F Performance Analysis
Bi-LSTM+Attn 0.269±0.022 0.331±0.028 0.549±0.028 0.665±0.019
HAN 0.274±0.024 0.289±0.020 0.340±0.073 0.548±0.031 To understand the merits and shortcomings of using
W2V 0.388±0.186 0.546±0.162 0.531±0.272 0.395±0.089
BERT 0.803±0.026 0.819±0.019 0.869±0.005 0.875±0.005 gzip for classification, we evaluate gzip’s perfor-
SentBERT 0.716±0.032 0.746±0.018 0.818±0.008 0.829±0.004
gzip (ours) 0.587±0.048 0.610±0.034 0.699±0.017 0.741±0.007 mance in terms of both the absolute accuracy and
the relative performance compared to the neural
Table 11: Few-Shot result on AG News methods. An absolute low accuracy with a high rel-
ative performance suggests that the dataset itself is
difficult, while a high accuracy with a low relative
Dataset DBpedia performance means the dataset is better solved by
#Shot 5 10 50 100
fastText 0.475±0.041 0.616±0.019 0.767±0.041 0.868±0.014 a neural network. As our method performs well
Bi-LSTM+Attn 0.506±0.041 0.648±0.025 0.818±0.008 0.862±0.005
HAN 0.350±0.012 0.484±0.010 0.501±0.003 0.835±0.005
on OOD datasets, we are more interested in ana-
W2V 0.325±0.113 0.402±0.123 0.675±0.05 0.787±0.015 lyzing ID cases. We carry out seven in-distribution
BERT 0.964±0.041 0.979±0.007 0.986±0.002 0.987±0.001
SentBERT 0.730±0.008 0.746±0.018 0.819±0.008 0.829±0.004 datasets and one out-of-distribution dataset across
gzip (ours) 0.622±0.022 0.701±0.021 0.825±0.003 0.857±0.004
fourteen models to account for different ranks. We
Table 12: Few-Shot result on DBpedia analyze both the relative performance and the abso-
lute accuracy regarding the vocabulary size and the
compression rate of both datasets (i.e., how easily
Dataset SogouNews
a dataset can be compressed) and compressors (i.e.,
#Shot 5 10 50 100 how well a compressor can compress).
fastText 0.545±0.053 0.652±0.051 0.782±0.034 0.809±0.012
Bi-LSTM+Attn 0.534±0.042 0.614±0.047 0.771±0.021 0.812±0.008 To represent the relative performance with re-
HAN
W2V
0.425±0.072
0.141±0.005
0.542±0.118 0.671±0.102
0.124±0.048 0.133±0.016
0.808±0.020
0.395±0.089
gard to other methods, we use the normalized rank
rank of gzip
BERT 0.221±0.041 0.226±0.060 0.392±0.276 0.679±0.073 percentage, computed as total#methods ; the lower the
SentBERT 0.485±0.043 0.501±0.041 0.565±0.013 0.572±0.003
gzip (ours) 0.649±0.061 0.741±0.017 0.833±0.007 0.867±0.016 score, the better gzip is. We use “bits per charac-
ter”(bpc) to evaluate the compression rate. The
Table 13: Few-Shot result on SogouNews procedure is to randomly sample a thousand in-
6825
1.2
vocabulary size has on the relative performance,
DBpedia
1 YahooAnswers
20News
0.8 Ohsumed
R8
R52
0.6 SogouNews
0.4
0.2
0
2 2.2 2.4 2.6 2.8 3 3.2 3.4
Bits per Character
7 A4. Have you used AI writing assistants when working on this paper?
Left blank.
3 Did you use or create scientific artifacts?
B
Section 3.
3 B1. Did you cite the creators of artifacts you used?
Appendix B and C.
3 B2. Did you discuss the license or terms for use and / or distribution of any artifacts?
Appendix B and C.
3 B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided
that it was specified? For the artifacts you create, do you specify intended use and whether that is
compatible with the original access conditions (in particular, derivatives of data accessed for research
purposes should not be used outside of research contexts)?
Appendix B and C.
3 B4. Did you discuss the steps taken to check whether the data that was collected / used contains any
information that names or uniquely identifies individual people or offensive content, and the steps
taken to protect / anonymize it?
Appendix B.
3 B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and
linguistic phenomena, demographic groups represented, etc.?
Section 4.1 and Appendix B.
3 B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits,
etc. for the data that you used / created? Even for commonly-used benchmark datasets, include the
number of examples in train / validation / test splits, as these provide necessary context for a reader
to understand experimental results. For example, small differences in accuracy on large test sets may
be significant, while on small test sets they may not be.
In Section 4.1 Table 1.
6827
3 C2. Did you discuss the experimental setup, including hyperparameter search and best-found
hyperparameter values?
Appendix C.
3 C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary
statistics from sets of experiments), and is it transparent whether you are reporting the max, mean,
etc. or just a single run?
Section 4.3, 4.4, 4.5.
C4. If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did
you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE,
etc.)?
Not applicable. Left blank.
D
7 Did you use human annotators (e.g., crowdworkers) or research with human participants?
Left blank.
D1. Did you report the full text of instructions given to participants, including e.g., screenshots,
disclaimers of any risks to participants or annotators, etc.?
No response.
D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students)
and paid participants, and discuss if such payment is adequate given the participants’ demographic
(e.g., country of residence)?
No response.
D3. Did you discuss whether and how consent was obtained from people whose data you’re
using/curating? For example, if you collected data via crowdsourcing, did your instructions to
crowdworkers explain how the data would be used?
No response.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?
No response.
D5. Did you report the basic demographic and geographic characteristics of the annotator population
that is the source of the data?
No response.
6828