"Low-Resource" Text Classification: A Parameter-Free Classification Method With Compressors

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

“Low-Resource” Text Classification: A Parameter-Free Classification

Method with Compressors


Zhiying Jiang1,2 , Matthew Y.R. Yang1 , Mikhail Tsirlin1 ,
Raphael Tang1 , Yiqin Dai2 and Jimmy Lin1
1
University of Waterloo 2
AFAIK
{zhiying.jiang, m259yang, mtsirlin, r33tang}@uwaterloo.ca
[email protected] [email protected]

Abstract (2018) further show that even word-embedding-


based methods can achieve results comparable to
Deep neural networks (DNNs) are often used
convolutional neural networks (CNNs) and recur-
for text classification due to their high accu-
racy. However, DNNs can be computationally rent neural networks (RNNs).
intensive, requiring millions of parameters and Among all the endeavors for a lighter alternative
large amounts of labeled data, which can make to DNNs, one stream of work focuses on using com-
them expensive to use, to optimize, and to trans- pressors for text classification. There have been
fer to out-of-distribution (OOD) cases in prac- several studies in this field (Teahan and Harper,
tice. In this paper, we propose a non-parametric
2003; Frank et al., 2000), most of them based on
alternative to DNNs that’s easy, lightweight,
and universal in text classification: a combi- the intuition that the minimum cross entropy be-
nation of a simple compressor like gzip with tween a document and a language model of a class
a k-nearest-neighbor classifier. Without any built by a compressor indicates the class of the
training parameters, our method achieves re- document. However, previous works fall short of
sults that are competitive with non-pretrained matching the quality of neural networks.
deep learning methods on six in-distribution
Addressing these shortcomings, we propose a
datasets. It even outperforms BERT on all five
OOD datasets, including four low-resource lan-
text classification method combining a lossless
guages. Our method also excels in the few-shot compressor, a compressor-based distance metric
setting, where labeled data are too scarce to with a k-nearest-neighbor classifier (kNN). It uti-
train DNNs effectively. Code is available at lizes compressors in capturing regularity, which
https://github.com/bazingagin/npc_gzip. is then translated into similarity scores by a
compressor-based distance metric. With the re-
1 Introduction sulting distance matrix, we use kNN to perform
Text classification, as one of the most fundamen- classification. We carry out experiments on seven
tal tasks in natural language processing (NLP), in-distribution datasets and five out-of-distribution
has improved substantially with the help of neu- ones. With a simple compressor like gzip, our
ral networks (Li et al., 2022). However, most neu- method achieves results competitive with those of
ral networks are data-hungry, the degree of which DNNs on six out of seven datasets and outperforms
increases with the number of parameters. Hyper- all methods including BERT on all OOD datasets.
parameters must be carefully tuned for different It also surpasses all models by a large margin under
datasets, and the preprocessing of text data (e.g., few-shot settings.
tokenization, stop word removal) needs to be tai- Our contributions are as follows: (1) we are the
lored to the specific model and dataset. Despite first to use NCD with kNN for topic classifica-
their ability to capture latent correlations and rec- tion, allowing us to carry out comprehensive ex-
ognize implicit patterns (LeCun et al., 2015), com- periments on large datasets with compressor-based
plex deep neural networks may be overkill for sim- methods; (2) we show that our method achieves
ple tasks such as topic classification, and lighter results comparable to non-pretrained DNNs on six
alternatives are usually good enough. For exam- out of seven in-distribution datasets; (3) on OOD
ple, Adhikari et al. (2019b) find that a simple long datasets, we show that our method outperforms
short-term memory network (LSTM; Hochreiter all methods, including pretrained models such as
and Schmidhuber, 1997) with appropriate regular- BERT; and (4) we demonstrate that our method ex-
ization can achieve competitive results. Shen et al. cels in the few-shot setting of scarce labeled data.
6810
Findings of the Association for Computational Linguistics: ACL 2023, pages 6810–6828
July 9-14, 2023 ©2023 Association for Computational Linguistics
2 Related Work ing, represented by Graph Convolutional Networks
(GCNs) (Yao et al., 2019), and inductive learning,
2.1 Compressor-Based Text Classification dominated by recurrent neural networks (RNNs)
Text classification using compressors can be di- and convolutional neural networks (CNNs). We
vided into two main approaches: (1) Using a com- focus on inductive learning in this paper as trans-
pressor to estimate entropy based on Shannon In- ductive learning assumes the test dataset is pre-
formation Theory; (2) Using a compressor to ap- sented during the training, which is not a common
proximate Kolmogorov complexity and informa- scenario in practice.
tion distance.1 Zhang et al. (2015) use the character-based CNN
The first approach mainly employs a text com- with millions of parameters for text classification.
pression technique called Prediction by Partial Conneau et al. (2017) extend the idea with more
Matching (PPM)2 for topic classification. This layers. Along the line of RNNs, Kawakami (2008)
approach estimates the cross entropy between the introduce a method that uses LSTMs (Hochreiter
probability distribution of a specific class c and and Schmidhuber, 1997) to learn the sequential in-
a given document d: Hc (d) (Frank et al., 2000; formation for classification. To better capture the
Teahan and Harper, 2003). The intuition is that important information regardless of position, Wang
the lower the cross entropy, the more likely that d et al. (2016) incorporate the attention mechanism
belongs to c. Marton et al. (2005); Coutinho and into the relation classification. Yang et al. (2016)
Figueiredo (2015); Kasturi and Markov (2022) fur- include a hierarchical structure for sentence-level
ther improve the final accuracy by improving the attention. As the parameter number and the model
representation to better cope with the compressor. complexity increase, Joulin et al. (2017) look for
Another line of compressor-based meth- using a simple linear model with a hidden layer
ods (Khmelev and Teahan, 2003; Keogh et al., coping with n-gram features and hierarchical soft-
2004) takes advantage of the information dis- max to improve efficiency.
tance (Bennett et al., 1998), a distance metric The landscape of classification has been fur-
derived from Kolmogorov complexity. The ther transformed by the widespread use of pre-
intuition of information distance is that for two trained models like BERT (Kenton and Toutanova,
similar objects, there exists a simple program to 2019), with hundreds of millions of parameters
convert one to another. However, most previous pretrained on a corpus containing billions of to-
works focus on clustering (Vitányi et al., 2009), kens. BERT achieves the state of the art on text
plagiarism detection (Chen et al., 2004) and time classification (Adhikari et al., 2019a). Built on
series data classification (Keogh et al., 2004). BERT, Reimers and Gurevych (2019) calculate se-
Few (Marton et al., 2005; Coutinho and Figueiredo, mantic similarity between pairs of sentences effi-
2015) explore its application to topic classification, ciently by using a siamese network architecture
and none applies the combination of information and fine-tuning on multiple NLI datasets (Bowman
distance and k-nearest-neighbor (kNN) classifier et al., 2015; Williams et al., 2018). We compare
when k > 1 to topic classification. Besides, gzip with these deep learning models.
to the best of our knowledge, all the previous
works use relatively small datasets like 20News 3 Our Approach
and Reuters-10. There is neither a comparison
between compressor-based methods and deep Our approach consists of a lossless compressor, a
learning methods nor a comprehensive study of compressor-based distance metric, and a k-Nearest-
large datasets. Neighbor classifier. Lossless compressors aim to
represent information using as few bits as possi-
2.2 Deep Learning for Text Classification ble by assigning shorter codes to symbols with
higher probability. The intuition of using compres-
The deep learning methods used for text classifi-
sors for classification is that (1) compressors are
cation can be divided into two: transductive learn-
good at capturing regularity; (2) objects from the
1
This doesn’t indicate these two lines of work are completely same category share more regularity than those
parallel. In fact, the expected value of Kolmogorov complex- from different categories. For example, x1 below
ity equals Shannon entropy, up to a constant.
2
PPM is a text compression scheme utilizing language model- belongs to the same category as x2 , but a different
ing to estimate cross entropy. category from x3 . If we use C(·) to represent com-
6811
Compressor 1 import gzip
2 import numpy as np
concat
Compressor NCD distance 3 for ( x1 , _ ) in test_set :
4 Cx1 = len ( gzip . compress ( x1 . encode () ) )
Compressor 5 distance_from_x1 = []
6 for ( x2 , _ ) in training_set :
Figure 1: Our approach overview. 7 Cx2 = len ( gzip . compress ( x2 . encode () )
8 x1x2 = " " . join ([ x1 , x2 ])
9 Cx1x2 = len ( gzip . compress ( x1x2 .
pressed length, we will find C(x1 x2 ) − C(x1 ) < encode () )
10 ncd = ( Cx1x2 - min ( Cx1 , Cx2 )) / max (
C(x1 x3 ) − C(x1 ) where C(x1 x2 ) means the com-
Cx1 , Cx2 )
pressed length of concatenation of x1 and x2 . In 11 distance_from_x1 . append ( ncd )
other words, C(x1 x2 ) − C(x1 ) can be interpreted 12 sorted_idx = np . argsort ( np . array (
as how many bytes do we still need to encode x2 distance_from_x1 ) )
based on the information of x1 : 13 top_k_class = training_set [ sorted_idx
[: k ] , 1]
x1 = Japan’s Seiko Epson Corp. has developed a 14 predict_class = max ( set ( top_k_class ) ,
12-gram flying microrobot.
key = top_k_class . count )
x2 = The latest tiny flying robot has been unveiled
in Japan. Listing 1: Python Code for Text Classification with gzip.
x3 = Michael Phelps won the gold medal in the
400 individual medley. As our main experiment results use gzip as the
compressor, C(x) here means the length of x af-
This intuition can be formalized as a distance met- ter being compressed by gzip. C(xy) is the com-
ric derived from Kolmogorov complexity (Kol- pressed length of concatenation of x and y. With
mogorov, 1963). Kolmogorov complexity K(x) the distance matrix NCD provides, we can then use
characterizes the length of the shortest binary pro- k-nearest-neighbor to perform classification.
gram that can generate x. K(x) is theoretically the Our method can be implemented with 14 lines
ultimate lower bound for information measurement. of Python code below. The inputs are training_set,
To measure information content shared between test_set, both consisting of an array of (text, label)
two objects, Bennett et al. (1998) define informa- tuples, and k as shown in Listing 1.
tion distance E(x, y) as the length of the shortest Our method is a simple, lightweight, and uni-
binary program that converts x to y: versal alternative to DNNs. It’s simple because it
doesn’t require any preprocessing or training. It’s
E(x, y) = max{K(x|y), K(y|x)} (1) lightweight in that it classifies without the need for
= K(xy) − min{K(x), K(y)} (2) parameters or GPU resources. It’s universal as com-
pressors are data-type agnostic, and non-parametric
As the incomputable nature of Kolmogorov methods do not bring underlying assumptions.
complexity renders E(x,y) incomputable, Li et al.
(2004) proposes a normalized and computable ver- 4 Experimental Setup
sion of information distance, Normalized Com- 4.1 Datasets
pression Distance (NCD), utilizing compressed
length C(x) to approximate Kolmogorov complex- We choose a variety of datasets to investigate the
ity K(x). Formally, it’s defined as follows (detailed effects of the number of training samples, the
derivation is shown in Appendix A): number of classes, the length of the text, and the
difference in distribution on accuracy. The de-
C(xy) − min{C(x), C(y)} tails of each dataset are listed in Table 1. Previ-
N CD(x, y) = (3) ous works on text classification have two disjoint
max{C(x), C(y)}
preferences when choosing evaluation datasets:
The intuition behind using compressed length is CNN and RNN-based methods favor large-scale
that the length of x that has been maximally com- datasets (AG News, DBpedia, YahooAnswers),
pressed by a compressor is close to K(x). Gener- whereas transductive methods like graph convo-
ally, the higher the compression ratio, the closer lutional neural networks focus on smaller ones
C(x) is to K(x). (20News, Ohsumed, R8, R52) (Li et al., 2022).
6812
Dataset Ntrain Ntest C W L V
same vowels as English but doesn’t have q,x as
AG News 120K 7.6K 4 44 236 128K
DBpedia 560K 70K 14 54 301 1M consonants; Sogou news is in Pinyin – a phonetic
YahooAnswers 1.4M 60K 10 107 520 1.5M
20News 11K 7.5K 20 406 1902 277K romanization of Chinese. Therefore, those datasets
ohsumed 3.4K 4K 23 212 1273 55K can be viewed as permutations of English alphabets
R8 5.5K 2.2K 8 102 587 24K
R52 6.5K 2.6K 52 110 631 26K (see Table 7 for text examples).
KinyarwandaNews 17K 4.3K 14 232 1872 240K
KirundiNews 3.7K 923 14 210 1722 63K
DengueFilipino
SwahiliNews
4K
22.2K
500
7.3K
5
6
10
327
62.7
2.2K
13K
570K
4.2 Baselines
SogouNews 450K 60K 5 589 2.8K 611K
We compare our result with (1) neural net-
Table 1: Details of datasets used for evaluation. work methods that require training and (2) non-
N{train,test} denote the number of training and test set parametric methods that use the kNN classifier di-
examples, C is the number of classes, W the average rectly, with or without pre-training on external data.
number of words in each example, L the average num- Specifically, we choose mainstream architectures
ber of characters, and V the vocabulary size. for text classification, like logistic regression, fast-
Model # Par. PT TT ED Preprocessing Details
Text (Joulin et al., 2017), RNNs with or without
TFIDF+LR 260K ✗ ✓ ✗ tok+tfidf+dict (+lower) attention (vanilla LSTM (Hochreiter and Schmid-
LSTM
Bi-LSTM+Attn
5.2M
8.2M






tok+dict (+emb+lower+pad)
tok+dict (+emb+lower+pad)
huber, 1997), bidirectional LSTMs (Schuster and
HAN 30M ✗ ✓ ✗ tok+dict (+emb+lower+pad) Paliwal, 1997) with attention (Wang et al., 2016),
charCNN 2.7M ✗ ✓ ✗ dict (+lower+pad)
textCNN 31M ✗ ✓ ✗ tok+dict (+emb+lower+pad) hierarchical attention networks (Yang et al., 2016)),
RCNN 19M tok+dict (+emb+lower+pad)
VDCNN 14M





✗ dict (+lower+pad)
CNNs (character CNNs (Zhang et al., 2015), recur-
fastText 8.2M ✗ ✓ ✗ tok+dict (+lower+pad+ngram) rent CNNs (Lai et al., 2015), very deep CNNs (Con-
BERT-base 109M ✓ ✓ ✓ tok+dict+pe (+lower+pad)
W2V 0 ✓ ✗ ✗ tok+dict (+lower) neau et al., 2017)) and BERT (Devlin et al., 2019).
SentBERT 0 ✓ ✗ ✓ tok+dict (+lower)
TextLength 0 ✗ ✗ ✗ ✗ We also include three other non-parametric meth-
gzip (ours) 0 ✗ ✗ ✗ ✗
ods: word2vec (W2V) (Mikolov et al., 2013),
pretrained sentence BERT (SentBERT) (Reimers
Table 2: Models with their respective number of training
parameters, whether they use pre-training (PT), task- and Gurevych, 2019), and the length of the in-
specific training (TT)/fine-tuning in BERT, and external stance (TextLength), all using a kNN classifier.
data (ED), as well as text preprocessing details. “TextLength” is a baseline where the text length
of the instance is used as the only input into a kNN
We include datasets on both sides in order to inves- classifier, whose result rules out the impact of text
tigate how our method performs in both situations. length in classification.
Apart from dataset sizes, we also take the number We present details of models in Table 2. Here we
of classes into account by intentionally including use AG News as an example to estimate the model
datasets like R52 to evaluate the performance of size, as the number of parameters is affected by the
datasets with a large number of classes. We also number of classes and the vocabulary size. This
include the text length of each dataset in Table 1 as dataset has a relatively small vocabulary size and
previous works (Marton et al., 2005) indicate that it number of classes, making the estimated number of
affects the accuracy of compressor-based methods. parameters the lower bound of the studied datasets.
Generalizing to out-of-distribution datasets has Some methods require pre-training either on the
always been a challenge in machine learning. Even target dataset or on other external datasets.
with the success of pretrained models, this prob- We also list preprocessing required by the mod-
lem is not alleviated. In fact, Yu et al. (2021) have els in Table 2, including tokenization (“tok”),
shown that improved in-distribution accuracy on building vocabulary dictionaries and mapping to-
pretrained models may lead to poor OOD perfor- kens (“dict”), using pretrained word embeddings
mance in image classification. In order to com- (“emb”), lowercasing words (“lower”) and padding
pare our method with pretrained models on OOD sequences to a certain length (“pad”). Other model-
datasets, we choose five datasets that are unseen specific preprocessing includes an extra bag of n-
in BERT’s pretrained corpus—Kinyarwanda news, grams (“ngram”) for fastText and positional em-
Kirundi news, Filipino dengue, Swahili news, and bedding (“pe”) for BERT. Note that for models that
Sogou news. Those datasets are chosen to have only require training, we do not use pretrained word
Latin script which means they have a very similar embeddings; otherwise, the boundary between pre-
alphabet as English. For example, Swahili has the training and training will become ambiguous.
6813
Model Pre-training Training AGNews DBpedia YahooAnswers 20News Ohsumed R8 R52
TFIDF+LR ✗ ✓ 0.898 0.982 0.715 0.827 0.549 0.949 0.874
LSTM ✗ ✓ 0.861 0.985 0.708 0.657 0.411 0.937 0.855
Bi-LSTM+Attn ✗ ✓ 0.917 0.986 0.732 0.667 0.481 0.943 0.886
HAN ✗ ✓ 0.896 0.986 0.745 0.646 0.462 0.960 0.914
charCNN ✗ ✓ 0.914 0.986 0.712 0.401 0.269 0.823 0.724
textCNN ✗ ✓ 0.817 0.981 0.728 0.751 0.570 0.951 0.895
RCNN ✗ ✓ 0.912 0.984 0.702 0.716 0.472 0.810 0.773
VDCNN ✗ ✓ 0.913 0.987 0.734 0.491 0.237 0.858 0.750
fastText ✗ ✓ 0.911 0.978 0.702 0.690 0.218 0.827 0.571
BERT ✓ ✓ 0.944 0.992 0.768 0.868 0.741 0.982 0.960
W2V ✓ ✗ 0.892 0.961 0.689 0.460 0.284 0.930 0.856
SentBERT ✓ ✗ 0.940 0.937 0.782 0.778 0.719 0.947 0.910
TextLength ✗ ✗ 0.275 0.093 0.105 0.053 0.090 0.455 0.362
gzip (ours) ✗ ✗ 0.937 0.970 0.638 0.685 0.521 0.954 0.896

Table 3: Test accuracy compared with gzip, red highlighting the ones outperformed by gzip. We report results
getting from our own implementation. We also include previously reported results for reference in Appendix E.

Dataset average gzip of TextLength is extremely low, indicating the com-


AGNews 0.901 0.937
DBpedia 0.978 0.970 pressed length used in NCD does not benefit from
YahooAnswers 0.726 0.638 the length distribution of different classes.
20News 0.678 0.685
Ohsumed 0.470 0.521
gzip does not perform well on extremely large
R8 0.914 0.954 datasets (e.g., YahooAnswers), but is competitive
R52 0.838 0.896 on medium and small datasets. Performance-wise,
the only non-pretrained deep learning model that’s
Table 4: Test accuracy comparison between the average
of all baseline models (excluding TextLength) and gzip. competitive to gzip is HAN, which surpasses gzip
on four datasets and still achieves relatively high
accuracy when it’s beaten by gzip, unlike textCNN.
5 Results The difference is that gzip doesn’t require training.
We list the average of all baseline models’ test
5.1 Result on in-distribution Datasets accuracy (except TextLength for its very low accu-
We train all baselines on seven datasets (training racy) in Table 4. We observe that our method is
details are in Appendix C) using their full training either higher or close to the average on all but the
sets. The results are shown in Table 3. Our method YahooAnswers dataset.
performs particularly well on AG News, R8, and
R52. On the AG News dataset, fine-tuning BERT 5.2 Result on out-of-distribution Datasets
yields the highest performance among all meth- On five OOD datasets (Kinyarwanda news, Kirundi
ods, while our method, without any pre-training, news, Filipino dengue, Swahili news and Sogou
achieves competitive results, with only 0.007 points news), we also select DNNs to cover a wide range
lower than BERT. On both R8 and R52, the only of parameter numbers. We discard CNN-based
non-pretrained neural networks that outperform our methods due to their inferiority when datasets are
method is HAN. For YahooAnswers, the accuracy small, as shown in both Section 5.1 and Zhang et al.
of gzip is about 7% lower than the average neural (2015). In addition, we also add BERT pretrained
methods. This may be due to the large vocabulary on 104 languages (mBERT). We can see in Table 5
size of YahooAnswers, which makes it hard for that on languages that mBERT has not been pre-
the compressor to compress (detailed discussion is trained on (Kinyarwanda, Kirundi, or Pinyin), it is
in Appendix F). worse than BERT. Compared with non-pretrained
Overall, BERT-based models are robust to the ones, pretrained models do not hold their advantage
size of in-distribution datasets. Character-based on low-resource languages with smaller data sizes,
models like charCNN and VDCNN perform badly except for Filipino which shares a large vocabulary
when the dataset is small and the vocabulary size with English words. On large OOD datasets (i.e.,
is large (e.g., 20News). Word-based models are SogouNews), BERT achieves competitive results
better at handling big vocabulary sizes. The result with other non-pretrained neural networks.
6814
Model/Dataset KinyarwandaNews KirundiNews DengueFilipino SwahiliNews SogouNews
Shot# Full 5-shot Full 5-shot Full 5-shot Full 5-shot Full 5-shot
Bi-LSTM+Attn 0.843 0.253±0.061 0.872 0.254±0.053 0.948 0.369±0.053 0.863 0.357±0.049 0.952 0.534±0.042
HAN 0.820 0.137±0.033 0.881 0.190±0.099 0.981 0.362±0.119 0.887 0.264±0.042 0.957 0.425±0.072
fastText 0.869 0.170±0.057 0.883 0.245±0.242 0.870 0.248±0.108 0.874 0.347±0.255 0.930 0.545±0.053
W2V 0.874 0.281±0.236 0.904 0.288±0.189 0.993 0.481±0.158 0.892 0.373±0.341 0.943 0.141±0.005
SentBERT 0.788 0.292±0.062 0.886 0.314±0.060 0.992 0.629±0.143 0.822 0.436±0.081 0.860 0.485±0.043
BERT 0.838 0.240±0.060 0.879 0.386±0.099 0.979 0.409±0.058 0.897 0.396±0.096 0.952 0.221±0.041
mBERT 0.835 0.229±0.066 0.874 0.324±0.071 0.983 0.465±0.048 0.906 0.558±0.169 0.953 0.282±0.060
gzip (ours) 0.891 0.458±0.065 0.905 0.541±0.056 0.998 0.652±0.048 0.927 0.627±0.072 0.975 0.649±0.061

Table 5: Test accuracy on OOD datasets with 95% confidence interval over five trials in five-shot setting.

Without any pre-training or fine-tuning, our consistently high accuracy of BERT and SentBERT
method outperforms both BERT and mBERT on on in-distribution datasets like AG News and DB-
all five datasets. In fact, our experiments show that pedia under few-shot settings.3 It’s worth noting,
our method outperforms both pretrained and non- though, that gzip outperforms SentBERT for 50 and
pretrained deep learning methods on OOD datasets, 100 shots. However, as shown in the SogouNews
which back our claim that our method is universal results, when the dataset is distinctively different
in terms of dataset distributions. To put it simply, from the pretrained datasets, the inductive bias in-
our method is designed to handle unseen datasets: troduced from the pre-training data leads to a low
the compressor is data-type-agnostic by nature and accuracy of BERT and SentBERT with 10, 50 and
non-parametric methods do not introduce inductive 100-shot settings, especially with the 5-shot setting.
bias during training. In general, when the shot number increases, the ac-
curacy difference between gzip and deep learning
5.3 Few-Shot Learning methods becomes smaller. W2V is an exception
We further compare our method with deep learn- that has a large variance in accuracy. This is due to
ing methods under the few-shot setting. We carry the vectors being trained for a limited set of words,
out experiments on AG News, DBpedia, and So- meaning that numerous tokens in the test set are
gouNews across both non-pretrained deep neural unseen and hence out-of-vocabulary.
networks and pretrained ones. We use n-shot la- We further investigate the quality of DNNs and
beled examples per class from the training dataset, our method in the 5-shot setting on five OOD
where n = {5, 10, 50, 100}. We choose these three datasets, tabulating results in Table 5. Under 5-
datasets, as their scale is large enough to cover 100- shot setting on OOD datasets, our method excels
shot settings and they vary in text lengths as well all the deep learning methods by a huge margin:
as languages. We choose methods whose train- it surpasses the accuracy of BERT by 91%, 40%,
able parameters range from zero parameters like 59%, 58% and 194% and surpasses mBERT’s ac-
word2vec and sentence BERT to hundreds of mil- curacy by 100%, 67%, 40%, 12% and 130% on
lions of parameters like BERT, covering both word- the corresponding five datasets.4 The reason be-
based models (HAN) and an n-gram one (fastText). hind the outperformance of our method is due to
compressors’ excellent ability to capture regularity,
We plot the results in Figure 2 (detailed numbers
which is prominent when training becomes moot
are shown in Appendix D). As shown, gzip outper-
with very few labeled data for DNNs.
forms non-pretrained models with 5, 10, 50 settings
on all three datasets. When the number of shots is 6 Analyses
as few as n = 5, gzip outperforms non-pretrained
models by a large margin: gzip is 115% better in ac- 6.1 Using Other Compressors
curacy than fastText in the AG News 5-shot setting. As the compressor in our method can actually be
In the 100-shot setting, gzip also outperforms non- replaced by any other compressors, we evaluate the
pretrained models on AG News and SogouNews 3
BERT reaches almost perfect accuracy on DBpedia probably
but slightly underperforms on DBpedia. because the data is extracted from Wikipedia, which BERT
Previous work (Nogueira et al., 2020; Zhang is pretrained on.
4
mBERT has much higher accuracy than BERT in the few-
et al., 2021) show that pretrained models are ex- shot setting on Filipino and Swahili, where mBERT was
cellent few-shot learners, which is reflected in our pretrained on.

6815
1 1 1

Test Accuracy on SogouNews


Test Accuracy on AGNews

Test Accuracy on DBpedia


0.8 0.8 0.8

0.6 0.6 0.6

fastText fastText fastText


0.4 Bi-LSTM+Attn 0.4 Bi-LSTM+Attn 0.4 Bi-LSTM+Attn
HAN HAN HAN
W2V W2V W2V
0.2 SentBERT 0.2 SentBERT 0.2 SentBERT
BERT BERT BERT
gzip gzip gzip
0 0 0
5 10 50 100 5 10 50 100 5 10 50 100
# of shots # of shots # of shots

Figure 2: Comparison among different methods using different shots with 95% confidence interval over five trials.
1 1 1

Test Accuracy on SogouNews


Test Accuracy on AGNews

0.8 Test Accuracy on DBpedia


0.8 0.8

bz2 bz2 bz2


0.6 lzma 0.6 lzma 0.6 lzma
zstd zstd zstd
gzip gzip gzip
5 10 50 100 5 10 50 100 5 10 50 100
# of shots # of shots # of shots

Figure 3: Comparison among different compressors on three datasets with 95% confidence interval over five trials.

0.85 compression algorithm using (offset, length) to rep-


0.80 resent the n-gram that has previously appeared in
0.75 the search buffer.5 zstandard (zstd) is a new com-
Test Accuracy

0.70 pression algorithm that’s built on LZ77, Huffman


0.65
gzip
coding as well as Asymmetric Numeral Systems
0.60 bz2 (ANS) (Duda, 2009). We pick zstd because of its
lzma
0.55 zstd high compressing speed and a compression ratio
3 4 5 6
Compression Ratio
7 8 9 close to gzip. A competitive result would suggest
that zstd might be an alternative to gzip and speed
Figure 4: Compression ratio V.S. Test Accuracy across up the classification.
different compressors on three datasets under different In Figure 4, we plot the test accuracy and com-
shot settings pression ratio of each compressor. Compression
original size
ratio is calculated by compressed size , so the larger the
performance of three other lossless compressors: compression ratio is, the more a compressor can
bz2, lzma, and zstandard. Due to the low compres- compress.6 Each marker type represents a dataset,
sion speed of lzma, we randomly select 1,000 test with ‘+’ representing the mean of each compres-
samples from the whole test set to evaluate and con- sor’s test accuracy across different shot settings.
duct our experiments under 5, 10, 50, and 100-shot In general, gzip achieves relatively high and sta-
settings. We repeat the experiments under each ble accuracy across three datasets. lzma is competi-
setting for five times to calculate the mean and the tive with gzip but the speed is much slower. Despite
95% confidence interval. its high compression ratio, bz2 performs the worst
Each of the three compressors that we choose across AGNews and DBpedia. Normally, a higher
has different underlying algorithms from gzip. bz2 compression ratio of a compressor suggests that
uses Burrows-Wheeler algorithm (Burrows, 1994) the NCD based on it approximates the informa-
to permute the order of characters in the strings
5
to create more repeated “substrings” that can be gzip uses DEFLATE algorithm, which uses Huffman cod-
ing (Huffman, 1952) to further encode (offset, length)
compressed, giving it a higher compression ratio whereas lzma uses range coding to do so, resulting lzma
(e.g., it can achieve 2.57 bits-per-character (bpc) has a higher compression ratio but a slower compression
on AGNews while gzip can achieve only 3.38 bpc). speed.
6
We use compression ratio instead of bpc here as the latter one
lzma is similar to gzip in that they are both based on is too close to each other and cannot be differentiated from
LZ77 (Ziv and Lempel, 1977), a dictionary-based one another.

6816
Method AGNews SogouNews DBpedia YahooAnswers
Most compressors have a limited “size”, for gzip
gzip (ce) 0.739±0.046 0.741±0.076 0.880±0.010 0.408±0.012
gzip (kNN) 0.752±0.041 0.862±0.033 0.852±0.008 0.352±0.014 it’s the sliding window size that can be used to
search back of the repeated string while for lzma
Table 6: Comparison with other compressor-based meth- it’s the dictionary size it can keep a record of. This
ods under the 100-shot setting. means that even if there are a large number of
training samples, the compressor can’t take full
tion distance E(x, y) better. But in bz2’s case, its
advantage of those samples; (2) When dc is large,
accuracy is always lower than the regression line
compressing dc du can be slow, which paralleliza-
(Figure 4). We conjecture it may be because the
tion can’t solve. These two main drawbacks stop
Burrows-Wheeler algorithm used by bz2 dismisses
this method from being applied to a really large
the information of character order by permuting
dataset. Thus, we limit the size of the dataset to
characters during compression.
1,000 randomly picked test samples and 100-shot
We investigate the correlation between accuracy
from each class in the training set to compare our
and compression ratio across compressors and find
method with this method.
that they have a moderate monotonic linear corre-
lation as shown in Figure 4. As the shot number In Table 6, “gzip (ce)” means using the cross en-
increases, the linear correlation becomes more ob- tropy C(dc du ) − C(dc ) while “gzip (kNN)” refers
vious with rs = 0.605 for all shot settings and Pear- to our method. We carry out each experiment for
son correlation rp = 0.575, 0.638, 0.691, 0.719 re- five times and calculate the mean and 95% confi-
spectively on 5, 10, 50, and 100-shot settings across dence interval. Our method outperforms the cross-
four compressors. We have also found that for a entropy method on AGNews and SogouNews.
single compressor, the easier a dataset can be com- The reason for the large accuracy gap between
pressed, the higher the accuracy gzip can achieve the two methods on SogouNews is probably be-
(details are in Appendix F.1). Combining our find- cause each instance in SogouNews is very long,
ings, we can see that a compressor performs best and the size of each sample can be 11.2K, which,
when it has a high compression ratio on datasets when concatenated, makes dc larger than 1,000K
that are highly compressible unless crucial informa- under 100-shot setting, while gzip typically has
tion is disregarded by its compression algorithm. 32K window size only. When the search space is
tremendously smaller than the size of dc , the com-
6.2 Using Other Compressor-Based Methods pressor fails to take advantage of all the information
A majority of previous compressor-based text clas- from the training set, which renders the compres-
sification is built on estimating cross entropy be- sion ineffective. The cross-entropy method does
tween the probability distribution built on class c perform very well on YahooAnswers. This might
and the document d: Hc (d), as we mention in Sec- be because on a divergent dataset like YahooAn-
tion 2.1. Summarized in Russell (2010), the proce- swers, which is created by numerous online users,
dure of using compressor to estimate Hc (d) is: concatenating all the samples in a class allows the
cross-entropy method to take full advantage of all
1. For each class c, concatenate all samples dc the information from a single class.
in the training set belonging to c. We also test the performance of the compressor-
based cross-entropy method on full AGNews
2. Compress dc as one long document to get the
dataset, as it is a relatively smaller one with a
compressed length C(dc ).
shorter single instance. The accuracy is 0.745, not
3. Concatenate the given test sample du with dc much higher than the 100-shot setting, which fur-
and compress to get C(dc du ). ther confirms that using C(dc du ) − C(dc ) as a dis-
tance metric cannot take full advantage of the large
4. The predicted class is arg minc C(dc du ) − datasets. In general, the result suggests that the
C(dc ). compressor-based cross-entropy method is not as
advantageous as ours on large datasets.
The distance metric used by previous work (Marton
et al., 2005; Russell, 2010) is mainly C(dc du ) − 7 Conclusions and Future Work
C(dc ). Although using this distance metric is
faster than pair-wise distance matrix computation In this paper, we use gzip with a compressor-
on small datasets, it has several drawbacks: (1) based distance metric to do text classification.
6817
Our method achieves an accuracy comparable to underexposure problem (Hovy and Spruit, 2016).
non-pretrained neural network classifiers on in- However, as our method has not been fully explored
distribution datasets and outperforms both pre- on datasets other than topic classification, it is very
trained and non-pretrained models on out-of- possible that our method makes unexpected classifi-
distribution datasets. We also find that our method cation mistakes on tasks like emotion classification.
has greater advantages under few-shot settings. We encourage the usage of this method in the real
For future works, we will extend this work by world to be limited to topic classification and hope
generalizing gzip to neural compressors on text, as that future work can explore more diverse tasks.
recent studies (Jiang et al., 2022) show that com-
bining neural compressors derived from deep latent Acknowledgement
variables models with compressor-based distance This research is supported in part by the Natu-
metrics can even outperform semi-supervised meth- ral Sciences and Engineering Research Council
ods for image classification. (NSERC) of Canada, and in part by the Global Wa-
ter Futures program funded by the Canada First
Limitations
Research Excellence Fund (CFREF).
As the computation complexity of kNN is O(n2 ),
when the size of a dataset gets really big, speed be-
comes one of the limitations of our method. Multi- References
threads and multi-processes can greatly boost the Ashutosh Adhikari, Achyudh Ram, Raphael Tang, and
speed. Lempel-Ziv Jaccard Distance (LZJD) (Raff Jimmy Lin. 2019a. Docbert: Bert for document clas-
sification. arXiv preprint arXiv:1904.08398.
and Nicholas, 2017), a more efficient version of
NCD can also be explored to alleviate the ineffi- Ashutosh Adhikari, Achyudh Ram, Raphael Tang, and
ciency problem. In addition, as our purpose is to Jimmy Lin. 2019b. Rethinking complex neural net-
highlight the trade-off between the simplicity of a work architectures for document classification. In
Proceedings of the 2019 Conference of NAACL-HLT,
model and its performance, we focus on the vanilla Volume 1 (Long and Short Papers), pages 4046–4051.
version of DNNs, which is already complex enough
compared with our method, without add-ons like Charles H Bennett, Péter Gács, Ming Li, Paul MB
pretrained embeddings (Pennington et al., 2014). Vitányi, and Wojciech H Zurek. 1998. Information
distance. IEEE Transactions on information theory,
This means we do not exhaust all the techniques 44(4):1407–1423.
one can use to improve DNNs, and neither do we
exhaust all the text classification methods in the Samuel Bowman, Gabor Angeli, Christopher Potts, and
Christopher D Manning. 2015. A large annotated
literature. Furthermore, our work only covers tra-
corpus for learning natural language inference. In
ditional compressors. As traditional compressors Proceedings of the 2015 Conference on Empirical
are only able to capture the orthographic similarity, Methods in Natural Language Processing, pages 632–
they may not be sufficient for harder classification 642.
tasks like emotional classification. Fortunately, the Michael Burrows. 1994. A block-sorting lossless data
ability to compress redundant semantic information compression algorithm. SRC Research Report, 124.
may be made possible by neural compressors built
on latent variable models (Townsend et al., 2018). Xin Chen, Brent Francia, Ming Li, Brian Mckinnon,
and Amit Seker. 2004. Shared information and pro-
gram plagiarism detection. IEEE Transactions on
Ethics Information Theory, 50(7):1545–1551.
Being parameter-free, our method doesn’t rely on Alexis Conneau, Holger Schwenk, Loïc Barrault, and
GPU force but CPU resources only. Thus, it does Yann Lecun. 2017. Very deep convolutional networks
not bring negative environmental impacts revolv- for text classification. In Proceedings of the 15th
Conference of the European Chapter of the Associa-
ing around GPU. In terms of overgeneralization,
tion for Computational Linguistics: Volume 1, Long
we conduct our experiments on both in-distribution Papers, pages 1107–1116.
and out-of-distribution datasets, covering six lan-
guages. As compressors are data-type agnostic, David Pereira Coutinho and Mario AT Figueiredo.
2015. Text classification using compression-based
they are more inclusive to datasets, which allows dissimilarity measures. International Journal
us to classify low-resource languages like Kin- of Pattern Recognition and Artificial Intelligence,
yarwanda, Kirundi, and Swahili and to mitigate the 29(05):1553004.

6818
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Eamonn Keogh, Stefano Lonardi, and Chotirat Ann
Kristina Toutanova. 2019. Bert: Pre-training of deep Ratanamahatana. 2004. Towards parameter-free data
bidirectional transformers for language understand- mining. In Proceedings of the tenth ACM SIGKDD
ing. In Proceedings of the 2019 Conference of the international conference on Knowledge discovery
North American Chapter of the Association for Com- and data mining, pages 206–215.
putational Linguistics: Human Language Technolo-
gies, Volume 1 (Long and Short Papers), pages 4171– Dmitry V Khmelev and William J Teahan. 2003. A
4186. repetition based measure for verification of text col-
lections and for text categorization. In Proceedings
Jarek Duda. 2009. Asymmetric numeral systems. arXiv of the 26th annual international ACM SIGIR con-
preprint arXiv:0902.0271. ference on Research and development in informaion
retrieval, pages 104–110.
Eibe Frank, Chang Chui, and Ian H Witten. 2000. Text
categorization using compression models. Diederik P Kingma and Jimmy Ba. 2015. Adam: A
method for stochastic optimization. In ICLR (Poster).
William Hersh, Chris Buckley, TJ Leone, and David
Hickam. 1994. Ohsumed: An interactive retrieval Andrei N Kolmogorov. 1963. On tables of random
evaluation and new large test collection for research. numbers. Sankhyā: The Indian Journal of Statistics,
In SIGIR’94, pages 192–201. Springer. Series A, pages 369–376.
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015.
short-term memory. Neural computation, 9(8):1735– Recurrent convolutional neural networks for text clas-
1780. sification. In Twenty-ninth AAAI conference on artifi-
cial intelligence.
Dirk Hovy and Shannon L Spruit. 2016. The social im-
pact of natural language processing. In Proceedings Ken Lang. 1995. Newsweeder: Learning to filter net-
of the 54th Annual Meeting of the Association for news. In Proceedings of the Twelfth International
Computational Linguistics (Volume 2: Short Papers), Conference on Machine Learning, pages 331–339.
pages 591–598.
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton.
David A Huffman. 1952. A method for the construction
2015. Deep learning. nature, 521(7553):436–444.
of minimum-redundancy codes. Proceedings of the
IRE, 40(9):1098–1101.
Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch,
Zhiying Jiang, Yiqin Dai, Ji Xin, Ming Li, and Jimmy Dimitris Kontokostas, Pablo N Mendes, Sebastian
Lin. 2022. Few-shot non-parametric learning with Hellmann, Mohamed Morsey, Patrick Van Kleef,
deep latent variable model. Advances in Neural In- Sören Auer, et al. 2015. Dbpedia–a large-scale, mul-
formation Processing Systems (NeurIPS). tilingual knowledge base extracted from wikipedia.
Semantic web, 6(2):167–195.
Thorsten Joachims. 1998. Text categorization with sup-
port vector machines: Learning with many relevant Ming Li, Xin Chen, Xin Li, Bin Ma, and Paul MB
features. In European conference on machine learn- Vitányi. 2004. The similarity metric. IEEE transac-
ing, pages 137–142. Springer. tions on Information Theory, 50(12):3250–3264.

Armand Joulin, Edouard Grave, and Piotr Bo- Qian Li, Hao Peng, Jianxin Li, Congying Xia, Renyu
janowski Tomas Mikolov. 2017. Bag of tricks for Yang, Lichao Sun, Philip S Yu, and Lifang He. 2022.
efficient text classification. EACL 2017, page 427. A survey on text classification: From traditional to
deep learning. ACM Transactions on Intelligent Sys-
Alexandros Kastanos and Tyler Martin. 2021. Graph tems and Technology (TIST), 13(2):1–41.
convolutional network for swahili news classification.
arXiv preprint arXiv:2103.09325. Xien Liu, Song Wang, Xiao Zhang, Xinxin You, Ji Wu,
and Dejing Dou. 2020. Label-guided learning for
Nitya Kasturi and Igor L Markov. 2022. Text ranking text classification. arXiv preprint arXiv:2002.10772.
and classification using data compression. In I (Still)
Can’t Believe It’s Not Better! Workshop at NeurIPS Evan Dennison Livelo and Charibeth Cheng. 2018. In-
2021, pages 48–53. PMLR. telligent dengue infoveillance using gated recurrent
neural learning and cross-label frequencies. In 2018
Kazuya Kawakami. 2008. Supervised sequence la- IEEE International Conference on Agents (ICA),
belling with recurrent neural networks. Ph. D. thesis. pages 2–7. IEEE.

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Yuval Marton, Ning Wu, and Lisa Hellerstein. 2005. On
Toutanova. 2019. Bert: Pre-training of deep bidirec- compression-based text classification. In European
tional transformers for language understanding. In Conference on Information Retrieval, pages 300–314.
Proceedings of NAACL-HLT, pages 4171–4186. Springer.

6819
Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- William J Teahan and David J Harper. 2003. Using
frey Dean. 2013. Efficient estimation of word compression-based language models for text cate-
representations in vector space. arXiv preprint gorization. In Language modeling for information
arXiv:1301.3781. retrieval, pages 141–165. Springer.
Rubungo Andre Niyongabo, Qu Hong, Julia Kreutzer, James Townsend, Thomas Bird, and David Barber. 2018.
and Li Huang. 2020. Kinnews and kirnews: Bench- Practical lossless compression with latent variables
marking cross-lingual text classification for kin- using bits back coding. In International Conference
yarwanda and kirundi. In Proceedings of the 28th on Learning Representations.
International Conference on Computational Linguis-
tics, pages 5507–5521. Paul MB Vitányi, Frank J Balbach, Rudi L Cilibrasi, and
Ming Li. 2009. Normalized information distance. In
Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Information theory and statistical learning, pages
Jimmy Lin. 2020. Document ranking with a pre- 45–82. Springer.
trained sequence-to-sequence model. In Findings
of the Association for Computational Linguistics: Canhui Wang, Min Zhang, Shaoping Ma, and Liyun
EMNLP 2020, pages 708–718. Ru. 2008. Automatic online news issue construction
in web environment. In Proceedings of the 17th
Antoine Nzeyimana and Andre Niyongabo Rubungo. international conference on World Wide Web, pages
2022. Kinyabert: a morphology-aware kinyarwanda 457–466.
language model. In Proceedings of the 60th Annual
Meeting of the Association for Computational Lin- Yequan Wang, Minlie Huang, Xiaoyan Zhu, and
guistics (Volume 1: Long Papers), pages 5347–5363. Li Zhao. 2016. Attention-based lstm for aspect-level
sentiment classification. In Proceedings of the 2016
Jeffrey Pennington, Richard Socher, and Christopher conference on empirical methods in natural language
Manning. 2014. GloVe: Global vectors for word processing, pages 606–615.
representation. In Proceedings of the 2014 Confer-
ence on Empirical Methods in Natural Language Pro- Adina Williams, Nikita Nangia, and Samuel Bowman.
cessing (EMNLP), pages 1532–1543, Doha, Qatar. 2018. A broad-coverage challenge corpus for sen-
Association for Computational Linguistics. tence understanding through inference. In Proceed-
ings of the 2018 Conference of the NAACL-HLT, Vol-
Edward Raff and Charles Nicholas. 2017. An alterna-
ume 1 (Long Papers), pages 1112–1122, New Or-
tive to ncd for large sequences, lempel-ziv jaccard
leans, Louisiana. Association for Computational Lin-
distance. In Proceedings of the 23rd ACM SIGKDD
guistics.
international conference on knowledge discovery and
data mining, pages 1007–1015. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Chaumond, Clement Delangue, Anthony Moi, Pier-
Sentence embeddings using siamese bert-networks. ric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz,
In Proceedings of the 2019 Conference on Empirical et al. 2020. Transformers: State-of-the-art natural
Methods in Natural Language Processing and the 9th language processing. In Proceedings of the 2020 con-
International Joint Conference on Natural Language ference on empirical methods in natural language
Processing (EMNLP-IJCNLP), pages 3982–3992. processing: system demonstrations, pages 38–45.

Stuart J Russell. 2010. Artificial intelligence a modern Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,
approach. Pearson Education, Inc. Alex Smola, and Eduard Hovy. 2016. Hierarchical at-
tention networks for document classification. In Pro-
Mike Schuster and Kuldip K Paliwal. 1997. Bidirec- ceedings of the 2016 conference of the North Ameri-
tional recurrent neural networks. IEEE transactions can chapter of the association for computational lin-
on Signal Processing, 45(11):2673–2681. guistics: human language technologies, pages 1480–
1489.
Dinghan Shen, Guoyin Wang, Wenlin Wang, Mar-
tin Renqiang Min, Qinliang Su, Yizhe Zhang, Chun- Liang Yao, Chengsheng Mao, and Yuan Luo. 2019.
yuan Li, Ricardo Henao, and Lawrence Carin. Graph convolutional networks for text classification.
2018. Baseline needs more love: On simple word- In Proceedings of the AAAI conference on artificial
embedding-based models and associated pooling intelligence, volume 33, pages 7370–7377.
mechanisms. In Proceedings of the 56th Annual
Meeting of the Association for Computational Lin- Yaodong Yu, Heinrich Jiang, Dara Bahri, Hossein
guistics (Volume 1: Long Papers), pages 440–450. Mobahi, Seungyeon Kim, Ankit Singh Rawat, An-
dreas Veit, and Yi Ma. 2021. An empirical study
Eyal Shnarch, Ariel Gera, Alon Halfon, Lena Dankin, of pre-trained vision models on out-of-distribution
Leshem Choshen, Ranit Aharonov, and Noam generalization. In NeurIPS 2021 Workshop on Distri-
Slonim. 2022. Cluster & tune: Boost cold start per- bution Shifts: Connecting Methods and Applications.
formance in text classification. In Proceedings of the
60th Annual Meeting of the Association for Compu- Haode Zhang, Yuwei Zhang, Li-Ming Zhan, Jiaxin
tational Linguistics (Volume 1: Long Papers), pages Chen, Guangyuan Shi, Xiao-Ming Wu, and Al-
7639–7653. bert YS Lam. 2021. Effectiveness of pre-training

6820
for few-shot intent classification. In Findings of the
Association for Computational Linguistics: EMNLP
2021, pages 1114–1120.
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.
Character-level convolutional networks for text classi-
fication. Advances in neural information processing
systems, 28.
Jacob Ziv and Abraham Lempel. 1977. A universal
algorithm for sequential data compression. IEEE
Transactions on information theory, 23(3):337–343.

6821
A Derivation of NCD B Dataset Details
Recall that information distance E(x, y) is: In addition to statistics of the datasets we use, we
also include one example for each dataset in Ta-
ble 7. We then briefly introduce what the dataset is
E(x, y) = max{K(x|y), K(y|x)} (4) about and how are they collected.
= K(xy) − min{K(x), K(y)} (5) AG News7 contains more than 1 million news ar-
ticles from an academic news search engine Come-
E(x, y) equates the similarity between two objects ToMyHead and is collected for a research purpose;
in a program that can convert one to another. The DBpedia (Lehmann et al., 2015) is extracted
simpler the converting program is, the more similar from Wikipedia as a crowd-sourced project and
the objects are. For example, the negative of an we use the version in torchtext version 0.11.
image is very similar to the original one as the trans- YahooAnswers is introduced in Zhang et al.
formation can be simply described as “inverting the (2015) through the Yahoo! Webscope program
color of the image”. and use the 10 largest main categories for topic
In order to compare the similarity, the relative classification corpus.
distance is preferred. Vitányi et al. (2009) propose 20News (Lang, 1995) is originally collected by
a normalized version of E(x, y) called Normalized Ken Lang and is widely used to evaluate text clas-
information distance (NID). sification and we use the version in scikit-learn.
Ohsumed (Hersh et al., 1994) is collected from
Definition 1 (NID) NID is a function: Ω × Ω → 270 medical journals over a five-year period (1987-
[0, 1], where Ω is a non-empty set, defined as: 1991) with 23 cardiovascular diseases. We use the
subset introduced in (Yao et al., 2019) to create a
max{K(x|y), K(y|x)}
NID(x, y) = . (6) single-label classification.
max{K(x), K(y)}
Both R8 and R52 are two subsets from Reuters-
Equation (6) can be interpreted as follows: Given 21578 collection (Joachims, 1998) which can be
two sequences x, y, K(y) ≥ K(x): downloaded from Text Categorization Corpora.
KirundiNews (KirNews) and KinyarwandaNews
K(y) − I(x : y) I(x : y) (KinNews) are introduced in (Niyongabo et al.,
NID(x, y) = =1− , 2020), collected as a benchmark for text classifica-
K(y) K(y)
(7) tion on two low-resource African languages, which
where I(x : y) = K(y) − K(y|x) means the can be freely downloaded from the repository.
mutual algorithmic information. I(x:y) SwahiliNews (Swahili)8 is a news dataset in
K(y) means the
shared information (in bits) per bit of information Swahili. It’s spoken by 100-150 million people
contained in the most informative sequence, and across East Africa, and the dataset is created to
Equation (7) here is a specific case of Equation (6). help leverage NLP techniques across the African
Normalized Compression Distance (NCD) is a continent, which can be freely downloaded from
computable version of NID based on real-world huggingface datasets.
compressors. In this context, K(x) can be viewed DengueFilipino (Filipino) (Livelo and Cheng,
as the length of x after being maximally com- 2018) is a multi-label low-resource classification
pressed. Suppose we have C(x) as the length of dataset, which can be freely downloaded from hug-
compressed x produced by a real-world compres- gingface datasets. We process it as a single-label
sor, then NCD is defined as: classification task — we randomly select a label if
an instance have multiple labels and use the same
C(xy) − min{C(x), C(y)} processed dataset for every model.
NCD(x, y) = . (8)
max{C(x), C(y)} SogouNews is collected by Wang et al. (2008),
segmented and labeled by Zhang et al. (2015). We
NCD is thus computable in that it not only uses use the version that’s publicly available on torch-
compressed length to approximate K(x) but also text.
replaces conditional Kolmogorov complexity with 7
http://groups.di.unipi.it/g̃ulli/AG_corpus_of_news
C(xy) that only needs a simple concatenation of _articles.html
8
x, y. https://doi.org/10.5281/zenodo.5514203

6822
Dataset Sample Text
“Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street’s dwindling band
AGNews
of ultra-cynics, are seeing green again.”
“European Association for the Study of the Liver”, “The European Association for the Study of the Liver
DBpedia
(EASL) is a European professional association for liver disease.”
“Is a transponder required to fly in class C airspace?”,“I’ve heard that it may not be for some aircraft.
YahooAnswers
What are the rules?”,“the answer is that you must have a transponder in order to fly in a class C airspace.”
“Subject: WHAT car is this!? Nntp-Posting-Host: rac3.wam.umd.edu Organization: University of
Maryland, College Park Lines: 15 I was wondering if anyone out there could enlighten me on this car I
saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a
20News Bricklin. The doors were really small. In addition, the front bumper was separate from the rest of the
body. This is all I know. If anyone can tellme a model name, engine specs, years of production, where
this car is made, history, or whatever info you have on this funky looking car, please e-mail. Thanks,- IL
—- brought to you by your neighborhood Lerxst —-”
“Protection against allergen-induced asthma by salmeterol.The effects of the long-acting beta 2-agonist
salmeterol on early and late phase airways events provoked by inhaled allergen were assessed in a group
of atopic asthmatic patients.In a placebo-controlled study, salmeterol 50 micrograms inhaled before
allergen challenge ablated both the early and late phase of allergen-induced bronchoconstriction over a
Ohsumed
34 h time period.Salmeterol also completely inhibited the allergen-induced rise in non-specific bronchial
responsiveness over the same time period.These effects were shown to be unrelated to prolonged
bronchodilatation or functional antagonism.These data suggest novel actions for topically active long-
acting beta 2-agonists in asthma that extend beyond their protective action on airways smooth muscle.”
“champion products ch approves stock split champion products inc said its board of directors approved a
two for one stock split of its common shares for shareholders of record as of april the company also said
R8
its board voted to recommend to shareholders at the annual meeting april an increase in the authorized
capital stock from five mln to mln shares reuter ”
“january housing sales drop realty group says sales of previously owned homes dropped pct in january
to a seasonally adjusted annual rate of mln units the national association of realtors nar said but the
december rate of mln units had been the highest since the record mln unit sales rate set in november the
R52
group said the drop in january is not surprising considering that a significant portion of december s near
record pace was made up of sellers seeking to get favorable capital gains treatment under the old tax
laws said the nar s john tuccillo reuter”
“mutzig beer fest itegerejwe n’abantu benshi kigali mutzig beer fest thedition izabera juru parki rebero
hateganyijwe imodoka zizajya zifata abantu buri minota zibakura sonatubei remera stade kumarembo
areba miginai remera mugiporoso hamwe mumujyi rond point nini kigali iki gitaramo kizaba cyatumi-
KinNews wemo abahanzi batandukanye harimo kizigenza mugihugu cy’u burundi uzwi izina kidum benshi bakaba
bamuziho gucuranga neza live music iki gitaramo kikazatangira isaha saa kumi n’ebyiri z’umugoroba
taliki kugeza saa munani mugitondo taliki kwinjira bizasaba amafaranga y’u rwanda kubafite mutzig
golden card aha niho tike zigurirwa nakumat la gallette simba super market flurep”
“sentare yiyungurizo ntahangwa yagumije munyororo abamenyeshamakuru bane abo bamenyeshamakuru
bakaba bakorera ikinyamakuru iwacu bakaba batawe mvuto kwezi kw’icumi umwaka bakaba bagiye
ntara bubanza kurondera amakuru yavuga hari abagwanya leta binjiye gihugu abajejwe umutekano
baciye babafata bagishika komine bukinanyana ahavugwa bagwanyi bakaba baciye bashikirizwa sentare
nkuru bubanza umushikirizamanza akaba yaciye abagiriza icaha co kwifatanya n’abagwanyi gutera
KirNews igihugu icaha cahavuye gihindurwa citwa icaha co gushaka guhungabanya umutekano w’igihugu iyo
sentare yaciye ibacira imyaka ibiri nusu n’amande y’amafaranga umuriyoni umwe umwe icabafashe
cane n’ubutumwe bwafatanwe umwe muribo buvuga ’bagiye i bubanza gufasha abagwanyi” ababuranira
bakaba baragerageje kwerekana kwabo bamenyeshamakuru ataco bapfana n’abagwanyi ikinyamakuru
iwacu kikaba carunguruje sentare yiyungurizo ntahangwa ariko sentare yafashe ingingo kubagumiza
mumunyororo ikinyamakuru iwacu kikavuga kigiye kwitura sentare ntahinyuzwa”
Filipino “Kung hindi lang absent yung ibang pipirma sa thesis namen edi sana tapos na hardbound”
“TIMU ya taifa ya Tanzania, Serengeti Boys jana ilijiweka katika nafasi fi nyu katika mashindano
ya Mataifa ya Afrika kwa wachezaji wenye umri chini ya miaka 17 baada ya kuchapwa mabao 3-0
na Uganda kwenye Uwanja wa Taifa, Dar es Salaam.Uganda waliandika bao lao la kwanza katika
dakika ya 15 lililofungwa na Kawooya Andrew akiunganisha wavuni krosi ya Najibu Viga huku lile la
SwahiliNews pili likifungwa na Asaba Ivan katika dakika ya 27 Najib Yiga.Serengeti Boys iliendelea kulala, Yiga
aliifungia Uganda bao la tatu na la ushindi na kuifanya Serengeti kushika mkia katika Kundi A na kuacha
simanzi kwa wapenzi wa soka nchini. Serengeti Boys inasubiri mchezo wa mwisho dhidi ya Senegal
huku Nigeria ikisonga mbele baada ya kushinda mchezo wake wa awali kwenye uwanja huo na kufikisha
pointi sita baada ya kushinda ule wa ufunguzi dhidi ya Tanzania.”
“2008 di4 qi1 jie4 qi1ng da3o guo2 ji4 che1 zha3n me3i nv3 mo2 te4 ”,“2008di4 qi1 jie4 qi1ng da3o
guo2 ji4 che1 zha3n yu2 15 ri4 za4i qi1ng da3o guo2 ji4 hui4 zha3n zho1ng xi1n she4ng da4 ka1i mu4
. be3n ci4 che1 zha3n jia1ng chi2 xu4 da4o be3n yue4 19 ri4 . ji1n nia2n qi1ng da3o guo2 ji4 che1
SogouNews
zha3n shi4 li4 nia2n da3o che2ng che1 zha3n gui1 mo2 zui4 da4 di2 yi1 ci4 , shi3 yo4ng lia3o qi1ng
da3o guo2 ji4 hui4 zha3n zho1ng xi1n di2 qua2n bu4 shi4 ne4i wa4i zha3n gua3n . yi3 xia4 we2i xia4n
cha3ng mo2 te4 tu2 pia4n .”

Table 7: Sample text for each dataset.

6823
Paper Model Emb AGNews DBpedia YahooAnswers 20News Ohsumed R8 R52 SogouNews
LSTM ✓ 0.860 0.985 0.708 - - - - 0.951
Zhang et al. (2015)
charCNN ✗ 0.914 0.985 0.680 - - - - 0.956
Yang et al. (2016) HAN ✓ - - 0.758 - - - - -
charCNN ✗ 0.872 0.983 0.712 - - - - 0.951
Joulin et al. (2017) VDCNN ✗ 0.913 0.987 0.734 - - - - 0.968
fastText ✗ 0.915 0.981 0.720 - - - - 0.939
Conneau et al. (2017) VDCNN ✗ 0.908 0.986 0.724 - - - - 0.962
LSTM ✗ - - - 0.657 0.411 0.937 0.855 -
Yao et al. (2019)
fastText ✓ - - - 0.797 0.557 0.947 0.909 -
fastText ✓ 0.925 0.986 0.723 0.114 0.146 0.860 0.716 -
Liu et al. (2020) BiLSTM ✓ - - - 0.732 0.493 0.963 0.905 -
BERT ✗ - - - 0.679 0.512 0.960 0.897 -

Table 8: Results reported in previous works on datasets with abundant resources with embedding (Emb) information.

Paper Model Emb PT KinyarwandaNews KirundiNews SwahiliNews DengueFilipino


charCNN ✗ ✗ 0.717 0.692 - -
Niyongabo et al. (2020) BiGRU ✓(Kin. W2V) ✗ 0.887 0.859 - -
CNN ✓(Kin. W2V) ✗ 0.875 0.857 - -
Kastanos and Martin (2021) fastText ✗ ✗ - - 0.675 -
BERTBP E ✗ ✓(Kin. Corpus) 0.883 - - -
BERTM ORP HO ✗ ✓(Kin. Corpus) 0.869 - - -
Nzeyimana and Rubungo (2022)
KinyaBERT ✗ ✓(Kin. Corpus) 0.880 - - -

Table 9: Results reported in previous works on low resource languages with embedding (Emb) and pre-training (PT)
information.

Paper Model AGNews DBpedia the batch size is set to be 128 for English and So-
BERT 0.619 0.312
Shnarch et al. (2022) gouNews while for low-resource languages, we set
BERTIT:CLUSTER 0.807 0.670
the learning rate to be 1e−5 with batch size to be 16
Table 10: Results reported in previous works on 64- for 5 epochs. We use publicly available transform-
sample learning, corresponding to 14-shot for AGNews ers library (Wolf et al., 2020) for BERT and specif-
and ≈5-shot for DBpedia. ically we use bert-base-uncased checkpoint for
BERT and bert-base-multilingual-uncased
for mBERT.
C Implementation Details
For charCNN and textCNN, we use the same
We use different hyper-parameters for full-dataset hyper-parameters setting in Adhikari et al. (2019b)
settings and few-shot settings. except when in the few-shot learning setting, we
For both LSTM, Bi-LSTM+Attn, fastText, we reduce the batch size to 1, reducing the learning
use embedding size = 256, dropout rate = 0.3. rate to 1e − 4 and increase the number of epochs
For full-dataset setting, the learning rate is set to to 60. We also use their open source hedwig repo
be 0.001 and decay rate = 0.9 for Adam opti- for implementation. For VDCNN, we use the shal-
mizer (Kingma and Ba, 2015), number of epochs lowest 9-layer version with embedding size set to
= 20, with batch size = 64; for few-shot setting, be 16, batch size set to be 64 learning rate set to
the learning rate = 0.01, the decay rate = 0.99, be 1e − 4 for full-dataset setting, and batch size
batch size = 1, number of epochs = 50 for 50-shot = 1, epoch number = 60 for few-shot setting. For
and 100-shot, epoch = 80 for 5-shot and 10-shot. RCNN, we use embedding size = 256, hidden size
For LSTM and Bi-LSTM+Attn, we set RNN layer of RNN = 256, learning rate = 1e − 3, and the
= 1, hidden size = 64. For fastText, we use 1 same batch size and epoch setting as VDCNN for
hidden layer whose dimension is set to 10. full-dataset and few-shot settings.
For HAN, we use 1 layer for both word-level In general, we perform grid search for hyper-
RNN and sentence-level RNN, the hidden size of parameters on all the neural network models and
both of them are set to 50, and the hidden sizes we use a test set to validate, which only overesti-
of both attention layers are set to 100. It’s trained mates the accuracy.
with batch size = 256, 0.5 decay rate for 6 epochs. For preprocessing, we don’t use any pretrained
For BERT, the learning rate is set to be 2e−5 and word embedding for any word-based models. The
6824
reason is that we have a strict categorization be- E Other Reported Results
tween “training” and “pre-training”, involving pre-
In Table 3 and Table 5, we report the result from
trained embedding will make DNNs’ categories
our hyper-parameter setting and implementation.
ambiguous. Neither do we use data augmentation
However, we find that we couldn’t replicate pre-
during the training. The procedures of tokenization
viously reported results in some cases — we get
for both word-level and character-level, padding
higher or lower results than previously reported
for batch processing are, however, inevitable.
ones, which may be due to different experiment
For all non-parametric methods, the only hyper-
settings (e.g., they may use pretrained word embed-
parameter is k. We set k = 2 for all the methods
dings while we don’t) or different hyper-parameter
on all the datasets and we report the maximum
settings. Thus, we provide results reported by some
possible accuracy getting from the experiments
previous papers for reference in Table 8, Table 9
for each method. For Sentence-BERT, we use the
and Table 10. Note that SogouNews is listed in
paraphrase-MiniLM-L6-v2 checkpoint.
the first table as it has abundant resources and is
Our method only requires CPUs and we use 8- commonly used as a benchmark for DNNs that ex-
core CPUs to take advantage of multi-processing. cel at large datasets. As the studies carried out in
The time of calculating distance matrix using gzip low-resource languages and few-shot learning sce-
takes about half an hour on AGNews, two days narios are insufficient, in Table 9 and in Table 10,
on DBpedia and SogouNews, and six days on Ya- we also report the result of variants of our mod-
hooAnswers. els like BiGRU using Kinyarwanda embeddings
(Kin. W2V) and BERTM ORP HO incorporating
D Few-Shot Results morphology and pretrained on Kinyarwanda cor-
pus (Kin. Corpus) in addition to models we use
The exact numerical values of accuracy shown in the paper. We don’t find any result reported for
in Figure 2 is listed in three tables below. DengueFilipino as previous works’ evaluation uses
multi-label metrics.
Dataset AGNews
#Shot 5 10 50 100
fastText 0.273±0.021 0.329±0.036 0.550±0.008 0.684±0.010 F Performance Analysis
Bi-LSTM+Attn 0.269±0.022 0.331±0.028 0.549±0.028 0.665±0.019
HAN 0.274±0.024 0.289±0.020 0.340±0.073 0.548±0.031 To understand the merits and shortcomings of using
W2V 0.388±0.186 0.546±0.162 0.531±0.272 0.395±0.089
BERT 0.803±0.026 0.819±0.019 0.869±0.005 0.875±0.005 gzip for classification, we evaluate gzip’s perfor-
SentBERT 0.716±0.032 0.746±0.018 0.818±0.008 0.829±0.004
gzip (ours) 0.587±0.048 0.610±0.034 0.699±0.017 0.741±0.007 mance in terms of both the absolute accuracy and
the relative performance compared to the neural
Table 11: Few-Shot result on AG News methods. An absolute low accuracy with a high rel-
ative performance suggests that the dataset itself is
difficult, while a high accuracy with a low relative
Dataset DBpedia performance means the dataset is better solved by
#Shot 5 10 50 100
fastText 0.475±0.041 0.616±0.019 0.767±0.041 0.868±0.014 a neural network. As our method performs well
Bi-LSTM+Attn 0.506±0.041 0.648±0.025 0.818±0.008 0.862±0.005
HAN 0.350±0.012 0.484±0.010 0.501±0.003 0.835±0.005
on OOD datasets, we are more interested in ana-
W2V 0.325±0.113 0.402±0.123 0.675±0.05 0.787±0.015 lyzing ID cases. We carry out seven in-distribution
BERT 0.964±0.041 0.979±0.007 0.986±0.002 0.987±0.001
SentBERT 0.730±0.008 0.746±0.018 0.819±0.008 0.829±0.004 datasets and one out-of-distribution dataset across
gzip (ours) 0.622±0.022 0.701±0.021 0.825±0.003 0.857±0.004
fourteen models to account for different ranks. We
Table 12: Few-Shot result on DBpedia analyze both the relative performance and the abso-
lute accuracy regarding the vocabulary size and the
compression rate of both datasets (i.e., how easily
Dataset SogouNews
a dataset can be compressed) and compressors (i.e.,
#Shot 5 10 50 100 how well a compressor can compress).
fastText 0.545±0.053 0.652±0.051 0.782±0.034 0.809±0.012
Bi-LSTM+Attn 0.534±0.042 0.614±0.047 0.771±0.021 0.812±0.008 To represent the relative performance with re-
HAN
W2V
0.425±0.072
0.141±0.005
0.542±0.118 0.671±0.102
0.124±0.048 0.133±0.016
0.808±0.020
0.395±0.089
gard to other methods, we use the normalized rank
rank of gzip
BERT 0.221±0.041 0.226±0.060 0.392±0.276 0.679±0.073 percentage, computed as total#methods ; the lower the
SentBERT 0.485±0.043 0.501±0.041 0.565±0.013 0.572±0.003
gzip (ours) 0.649±0.061 0.741±0.017 0.833±0.007 0.867±0.016 score, the better gzip is. We use “bits per charac-
ter”(bpc) to evaluate the compression rate. The
Table 13: Few-Shot result on SogouNews procedure is to randomly sample a thousand in-
6825
1.2
vocabulary size has on the relative performance,

Normalized Rank Percentage


1 our method with gzip may be more susceptible to
0.8 the vocabulary size than neural network methods.
0.6
AGNews To distinguish between a “hard” dataset and an
DBpedia
YahooAnswers
20News
“easy” one, we average all models’ accuracies. The
0.4
Ohsumed
R8
datasets that has the lowest accuracies are 20News
0.2 R52
SogouNews
and Ohsumed, which are two datasets that have the
0
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 longest average length of texts.
Vocabulary Size ·106
1.2
AGNews
Normalized Rank Percentage

DBpedia
1 YahooAnswers
20News
0.8 Ohsumed
R8
R52
0.6 SogouNews

0.4

0.2

0
2 2.2 2.4 2.6 2.8 3 3.2 3.4
Bits per Character

Figure 5: Relative performance v.s. vocabulary size and


compression rate.

stances from the training and test set respectively,


calculate the compressed length, and divide by the
number of characters. Sampling is to keep the size
of the dataset constant.

F.1 Relative Performance


Combining Table 1 and Table 3, we see that ac-
curacy is largely unaffected by the average length
of a single sample: with the Spearman coefficient
rs = −0.220. But the relative performance is more
correlated with vocabulary size (rs = 0.561) as we
can see in Figure 5. SogouNews is an outlier in the
first plot: on a fairly large vocabulary-sized dataset,
gzip ranks first. The second plot may provide an
explanation for that — the compression ratio for
SogouNews is high which means even with a rel-
atively large vocabulary size, there is also repeti-
tive information that can be squeezed out. With
rs = 0.785 on the correlation between the normal-
ized rank percentage and the compression rate, we
can see when a dataset is easier to compress, our
method may be a strong candidate as a classifier.

F.2 Absolute Accuracy


Similarly, we evaluate the accuracy of classifi-
cation with respect to the vocabulary size and
we’ve found there is almost no monotonic relation
(rs = 0.071). With regard to bpc, the monotonic
relation is not as strong as the one with the rank per-
centage (rs = −0.56). Considering the effect that
6826
ACL 2023 Responsible NLP Checklist
A For every submission:
3 A1. Did you describe the limitations of your work?

Section 7.
3 A2. Did you discuss any potential risks of your work?

Section 8.
3 A3. Do the abstract and introduction summarize the paper’s main claims?

Section 1.


7 A4. Have you used AI writing assistants when working on this paper?
Left blank.
3 Did you use or create scientific artifacts?
B 
Section 3.
3 B1. Did you cite the creators of artifacts you used?

Appendix B and C.
3 B2. Did you discuss the license or terms for use and / or distribution of any artifacts?

Appendix B and C.
3 B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided

that it was specified? For the artifacts you create, do you specify intended use and whether that is
compatible with the original access conditions (in particular, derivatives of data accessed for research
purposes should not be used outside of research contexts)?
Appendix B and C.
3 B4. Did you discuss the steps taken to check whether the data that was collected / used contains any

information that names or uniquely identifies individual people or offensive content, and the steps
taken to protect / anonymize it?
Appendix B.
3 B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and

linguistic phenomena, demographic groups represented, etc.?
Section 4.1 and Appendix B.
3 B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits,

etc. for the data that you used / created? Even for commonly-used benchmark datasets, include the
number of examples in train / validation / test splits, as these provide necessary context for a reader
to understand experimental results. For example, small differences in accuracy on large test sets may
be significant, while on small test sets they may not be.
In Section 4.1 Table 1.

C 3 Did you run computational experiments?



Section 4.
3 C1. Did you report the number of parameters in the models used, the total computational budget

(e.g., GPU hours), and computing infrastructure used?
Section 4 and Appendix C.
The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing
assistance.

6827
3 C2. Did you discuss the experimental setup, including hyperparameter search and best-found

hyperparameter values?
Appendix C.
3 C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary

statistics from sets of experiments), and is it transparent whether you are reporting the max, mean,
etc. or just a single run?
Section 4.3, 4.4, 4.5.

 C4. If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did
you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE,
etc.)?
Not applicable. Left blank.

D 
7 Did you use human annotators (e.g., crowdworkers) or research with human participants?
Left blank.

 D1. Did you report the full text of instructions given to participants, including e.g., screenshots,
disclaimers of any risks to participants or annotators, etc.?
No response.

 D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students)
and paid participants, and discuss if such payment is adequate given the participants’ demographic
(e.g., country of residence)?
No response.

 D3. Did you discuss whether and how consent was obtained from people whose data you’re
using/curating? For example, if you collected data via crowdsourcing, did your instructions to
crowdworkers explain how the data would be used?
No response.

 D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?
No response.

 D5. Did you report the basic demographic and geographic characteristics of the annotator population
that is the source of the data?
No response.

6828

You might also like