2020.findings Emnlp.269
2020.findings Emnlp.269
data scarcity
Mika Juuti1 , Tommi Gröndahl2 , Adrian Flanagan3 , N. Asokan1,2
University of Waterloo1
Aalto University2
Huawei Technologies Oy (Finland) Co Ltd3
[email protected], [email protected]
[email protected], [email protected]
2991
Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2991–3009
November 16 - 20, 2020. c 2020 Association for Computational Linguistics
2 Preliminaries different classes of toxic language.2 The median
length of a document is three sentences, but the
Data augmentation arises naturally from the prob-
distribution is heavy-tailed (Table 1).
lem of filling in missing values (Tanner and Wong,
1987). In classification, data augmentation is ap- Mean Std. Min Max 25% 50% 75%
plied to available training data. Classifier perfor- 4 6 1 683 2 3 5
mance is measured on a separate (non-augmented)
test set (Krizhevsky et al., 2012). Data augmen- Table 1: Document lengths (number of sentences; tok-
tation can decrease overfitting (Wong et al., 2016; enized with NLTK sent tokenize (Bird et al., 2009)).
Shorten and Khoshgoftaar, 2019), and broaden the
input feature range by increasing the vocabulary Some classes are severely under-represented:
(Fadaee et al., 2019). e.g., 478 examples of threat vs. 159093 non-
Simple oversampling is the most basic augmenta- threat examples. Our experiments concern bi-
tion technique: copying minority class datapoints nary classification, where one class is the minor-
to appear multiple times. This increases the rele- ity class and all remaining documents belong to
vance of minority class features for computing the the majority class. We focus on threat as the
loss during training (Chawla et al., 2002). minority class, as it poses the most challenge for
EDA is a prior technique combining four text trans- automated analysis in this dataset (van Aken et al.,
formations to improve classification with CNN and 2018). To confirm our results, we also applied
RNN architectures (Wei and Zou, 2019). It uses (i) the best-performing techniques on a different type
synonym replacement from WordNet (§3.2.1), (ii) of toxic language, the identity-hate class
random insertion of a synonym, (iii) random swap (§4.6).
of two words, and (iv) random word deletion. Our goal is to understand how data augmentation
Word replacement has been applied in several improves performance under extreme data scarcity
data augmentation studies (Zhang et al., 2015; in the minority class (threat). To simulate this,
Wang and Yang, 2015; Xie et al., 2017; Wei and we derive our seed dataset (S EED) from the full data
Zou, 2019; Fadaee et al., 2019). We compared set (G OLD S TANDARD) via stratified bootstrap
four techniques, two based on semantic knowledge sampling (Bickel and Freedman, 1984) to reduce
bases (§3.2.1) and two on pre-trained (sub)word the dataset size k-fold. We replaced newlines, tabs
embeddings (§3.2.2). and repeated spaces with single spaces, and lower-
Pre-trained Transformer networks feature cased each dataset. We applied data augmentation
prominently in state-of-the-art NLP research. They techniques on S EED with k-fold oversampling of
are able to learn contextual embeddings, which the minority class, and compared each classifier
depend on neighboring subwords (Devlin et al., architecture (§3.3) trained on S EED, G OLD S TAN -
2019). Fine-tuning – adapting the weights of a DARD , and the augmented datasets. We used the
pre-trained Transformer to a specific corpus – has original test dataset (T EST) for evaluating perfor-
been highly effective in improving classification mance. We detail the dataset sizes in Table 2.
performance (Devlin et al., 2019) and language
modeling (Radford et al., 2019; Walton; Branwen, G OLD S TD . S EED T EST
2019). State-of-the-art networks are trained on Minority 478 25 211
large corpora: GPT-2’s corpus contains 8M web Majority 159,093 7955 63,767
pages, while BERT’s training corpus contains 3.3B
Table 2: Number of documents (minority: threat)
words.
2992
mentation technique (below) and classifier (§3.3). phrases altogether. We controlled substitution by
For comparison, we also evaluated simple oversam- grammatical context as specified in PPDB. In sin-
pling (C OPY) and EDA (Wei and Zou, 2019), both gle words this is the part-of-speech tag; whereas in
reviewed in §2. Following the recommendation of multi-word paraphrases it also contains the syntac-
Wei and Zou (2019) for applying EDA to small tic category that appears after the original phrase
seed datasets, we used 5% augmentation probabil- in the PPDB training corpus. We obtained gram-
ity, whereby each word has a 1 − 0.954 ≈ 19% matical information with the Spacy3 parser.
probability of being transformed by at least one of
the four EDA techniques. 3.2.2 Embedding neighbour substitutions
Four of the six techniques are based on replacing Embeddings can be used to map units to others
words with semantically close counterparts; two with a similar occurrence distribution in a train-
using semantic knowledge bases (§3.2.1) and two ing corpus (Mikolov et al., 2013). We considered
pre-trained embeddings (§3.2.2). We applied 25% two alternative pre-trained embedding models. For
of all possible replacements with these techniques, each model, we produced top-10 nearest embed-
which is close to the recommended substitution rate ding neighbours (cosine similarity) of each word
in EDA. For short documents we ensured that at selected for replacement, and randomly picked the
least one substitution is always selected. We also new word from these.
added majority class material to minority class doc- Twitter word embeddings (G LOV E) (Penning-
uments (§3.2.3), and generated text with the GPT- ton et al., 2014) were obtained from a Twitter cor-
2 language model fine-tuned on S EED (§3.2.4). pus,4 and we deployed these via Gensim (Řehůřek
and Sojka, 2010).
3.2.1 Substitutions from a knowledge base Subword embeddings (BPE MB) have emerged
WordNet is a semantic knowledge base contain- as a practical pre-processing tool for overcoming
ing various properties of word senses, which corre- the challenge of low-prevalence words (Sennrich
spond to word meanings (Miller, 1995). We aug- et al., 2016). They have been applied in Trans-
mented S EED by replacing words with random syn- former algorithms, including WordPiece (Wu et al.,
onyms. While EDA also uses WordNet synonyms 2016) for BERT (Devlin et al., 2019), and BPE (Sen-
(§2), we additionally applied word sense disam- nrich et al., 2016) for GPT-2 (Radford et al., 2019).
biguation (Navigli, 2009) and inflection. BPE MB (Heinzerling and Strube, 2018) provides
For word sense disambiguation we used simple pre-trained GloVe embeddings, constructed by ap-
Lesk from PyWSD (Tan, 2014). As a variant of the plying SentencePiece (Kudo and Richardson, 2018)
Lesk algorithm (Lesk, 1986) it relies on overlap in on the English Wikipedia. We use 50-dimensional
definitions and example sentences (both provided BPE MB-embeddings with vocabulary size 10,000.
in WordNet), compared between each candidate
sense and words in the context. 3.2.3 Majority class sentence addition (A DD)
Word senses appear as uninflected lemmas, Adding unrelated material to the training data can
which we inflected using a dictionary-based tech- be beneficial by making relevant features stand
nique. We lemmatized and annotated a large corpus out (Wong et al., 2016; Shorten and Khoshgoftaar,
with NLTK (Bird et al., 2009), and mapped each 2019). We added a random sentence from a major-
<lemma, tag> combination to its most common ity class document in S EED to a random position
surface form. The corpus contains 8.5 million short in a copy of each minority class training document.
sentences (≤ 20 words) from multiple open-source
corpora (see Appendix E). We designed it to have 3.2.4 GPT-2 conditional generation
both a large vocabulary for wide coverage (371125 GPT-2 is a Transformer language model pre-trained
lemmas), and grammatically simple sentences to on a large collection of Web documents. We used
maximize correct tagging. the 110M parameter GPT-2 model from the Trans-
Paraphrase Database (PPDB) was collected formers library (Wolf et al., 2019) We discuss pa-
from bilingual parallel corpora on the premise that rameters in Appendix F. We augmented as follows
English phrases translated identically to another (N -fold oversampling):
language tend to be paraphrases (Ganitkevitch et al., 3
https://spacy.io/
2013; Pavlick et al., 2015). We used phrase pairs 4
We use 25-dimensional GloVe-embeddings from:
tagged as equivalent, constituting 245691 para- https://nlp.stanford.edu/projects/glove/
2993
Augmentation Type Unit #Parameters Pre-training Corpus
A DD Non-toxic corpus Sentence NA NA
PPDB Knowledge Base N-gram NA NA
W ORD N ET Knowledge Base Word NA NA
G LOV E GloVe Word 30M Twitter
BPE MB GloVe Subword 0.5M Wikipedia
GPT-2 Transformer Subword 117M WebText
Classifier Model Type Unit #Parameters Pre-training Corpus
Char-LR Logistic regression Character 30K -
Word-LR Logistic regression Word 30K -
CNN Convolutional network Word 3M -
BERT Transformer Subword 110M Wikipedia & BookCorpus
1. Ĝ ← briefly train GPT-2 on minority class score for each classifier and augmentation tech-
documents in S EED. nique. (For brevity, we use “F1-score” from now
2. generate N − 1 novel documents x̂ ← Ĝ(x) on.) The majority class F1-score remained 1.00
for all minority class samples x in S EED. (two digit rounding) across all our experiments. All
3. assign the minority class label to all docu- classifiers are binary, and we assigned predictions
ments x̂ to the class with the highest conditional probability.
4. merge x̂ with S EED. We relax this assumption in §4.4, to report area
under the curve (AUC) values (Murphy, 2012).
3.3 Classifiers To validate our results, we performed repeated
Char-LR and Word-LR. We adapted the logistic re- experiments with the common random numbers
gression pipeline from the Wiki-detox project (Wul- technique (Glasserman and Yao, 1992), by which
czyn et al., 2017).5 We allowed n-grams in the we controlled the sampling of S EED, initial random
range 1–4, and kept the default parameters: TF-IDF weights of classifiers, and the optimization proce-
normalization, vocabulary size at 10, 000 and pa- dure. We repeated the experiments 30 times, and
rameter C = 10 (inverse regularization strength). report confidence intervals.
CNN. We applied a word-based CNN model with
4.1 Results without augmentation
10 kernels of sizes 3, 4 and 5. Vocabulary size was
10, 000 and embedding dimensionality 300. For We first show classifier performance on G OLD
training, we used the dropout probability of 0.1, S TANDARD and S EED in Table 4. van Aken
and the Adam optimizer (Kingma and Ba, 2014) et al. (2018) reported F1-scores for logistic regres-
with the learning rate of 0.001. sion and CNN classifiers on G OLD S TANDARD.
BERT. We used the pre-trained Uncased BERT- Our results are comparable. We also evaluate BERT,
Base and trained the model with the training script which is noticeably better on G OLD S TANDARD,
from Fast-Bert.6 We set maximum sequence length particularly in terms of threat recall.
to 128 and mixed precision optimization level to All classifiers had significantly reduced F1-
O1. scores on S EED, due to major drops in threat re-
call. In particular, BERT was degenerate, assigning
4 Results all documents to the majority class in all 30 repeti-
tions. Devlin et al. (2019) report that such behavior
We compared precision and recall for the minor- may occur on small datasets, but random restarts
ity class (threat), and the macro-averaged F1- may help. In our case, random restarts did not
5
https://github.com/ewulczyn/wiki- impact BERT performance on S EED.
detox/blob/master/src/modeling/get_prod_
models.py 4.2 Augmentations
6
https://github.com/kaushaltrivedi/
fast-bert/blob/master/sample_notebooks/ We applied all eight augmentation techniques (§3.2)
new-toxic-multilabel.ipynb to the minority class of S EED (threat). Each
2994
G OLD S TANDARD S EED, but were less effective than C OPY in both
Char-LR Word-LR CNN BERT precision and recall. Park et al. (2019) found that
Precision 0.61 0.43 0.60 0.54 BERT may perform poorly on out-of-domain sam-
Recall 0.34 0.36 0.33 0.54 ples. BERT is reportedly unstable on adversarially
F1 0.72 0.69 0.71 0.77 chosen subword substitutions (Sun et al., 2020).
S EED We suggest that non-contextual word embedding
Char-LR Word-LR CNN BERT schemes may be sub-optimal for BERT since its
Precision 0.64 0.47 0.41 0.00 pre-training is not conducted with similarly noisy
Recall 0.03 0.04 0.09 0.00 documents. We verified that reducing the num-
F1 0.52 0.53 0.57 0.50 ber of replaced words was indeed beneficial for
BERT (Appendix G).
Table 4: Classifier performance on G OLD S TAN -
Char-LR. BPE MB and A DD were effective at in-
DARD and S EED . Precision and recall for threat;
F1-score macro-averaged from both classes. creasing recall, and reached similar increases in
F1-score. GPT-2 raised recall to G OLD S TAN -
DARD level (p < 10−5 for all pairings), but preci-
technique retains one copy of each S EED docu- sion remained 16 percentage points below G OLD
ment, and adds 19 synthetically generated docu- S TANDARD. It led to the best increase in F1-score:
ments per S EED document. Table 5 summarizes 16 percentage points above S EED (p < 10−3 for
augmented dataset sizes. We present our main re- all pairings).
sults in Table 6. We first discuss classifier-specific Word-LR. Embedding-based BPE MB and G LOV E
observations, and then make general observations increased recall by at least 13 percentage points,
on each augmentation technique. but the conceptually similar PPDB and W ORD -
N ET were largely unsuccessful. We suggest
S EED Augmented this discrepancy may be due to W ORD N ET and
Minority 25 25→500 PPDB relying on written standard English,
Majority 7955 7955 whereas toxic language tends to be more colloquial.
GPT-2 increased recall and F1-score the most: 15
Table 5: Number of documents in augmented datasets.
We retained original S EED documents and expanded
percentage points above S EED (p < 10−10 for all
the dataset with additional synthetic documents (minor- pairings).
ity: threat) CNN. G LOV E and A DD increased recall by at least
10 percentage points. BPE MB led to a large in-
We compared the impact of augmentations on crease in recall, but with a drop in precision, pos-
each classifier, and therefore our performance com- sibly due to its larger capacity to make changes in
parisons below are local to each column (i.e., classi- text – G LOV E can only replace entire words that
fier). We identify the best performing technique for exist in the pre-training corpus. GPT-2 yielded the
the three metrics and report the p-value when its ef- largest increases in recall and F1-score (p < 10−4
fect is significantly better than the other techniques for all pairings).
(based on one-sided paired t-tests, α = 5%).7 We now discuss each augmentation technique.
BERT. C OPY and A DD were successful on BERT, C OPY emphasizes the features of original minority
raising the F1-score up to 21 percentage points documents in S EED, which generally resulted in
above S EED to 0.71. But their impacts on BERT fairly high precision. On Word-LR, C OPY is analo-
were different: A DD led to increased recall, while gous to increasing the weight of words that appear
C OPY resulted in increased precision. PPDB preci- in minority documents.
sion and recall were statistically indistinguishable EDA behaved similarly to C OPY on Char-LR, Word-
from C OPY, which indicates that it did few alter- LR and CNN; but markedly worse on BERT.
ations. GPT-2 led to significantly better recall A DD reduces the classifier’s sensitivity to irrele-
(p < 10−5 for all pairings), even surpassing G OLD vant material by adding majority class sentences
S TANDARD. Word substitution methods like EDA, to minority class documents. On Word-LR, A DD is
W ORD N ET, G LOV E, and BPE MB improved on analogous to reducing the weights of majority class
7
The statistical significance results apply to this dataset, words. A DD led to a marginally better F1-score
but are indicative of the behavior of the techniques in general. than any other technique on BERT.
2995
Augmentation Metric Char-LR Word-LR CNN BERT
Precision 0.68 ± 0.22 0.43 ± 0.27 0.45 ± 0.14 0.00 ± 0.00
S EED
No Oversampling Recall 0.03 ± 0.02 0.04 ± 0.02 0.08 ± 0.05 0.00 ± 0.00
F1 (macro) 0.53 ± 0.02 0.54 ± 0.02 0.56 ± 0.03 0.50 ± 0.00
Precision 0.67 ± 0.07 0.38 ± 0.24 0.40 ± 0.08 0.49 ± 0.07
C OPY
Simple Oversampling Recall 0.16 ± 0.03 0.03 ± 0.02 0.07 ± 0.03 0.36 ± 0.09
F1 (macro) 0.63 ± 0.02 0.53 ± 0.02 0.56 ± 0.02 0.70 ± 0.03
Precision 0.66 ± 0.06 0.36 ± 0.19 0.26 ± 0.09 0.21 ± 0.03
EDA
Wei and Zou (2019) Recall 0.13 ± 0.03 0.08 ± 0.04 0.07 ± 0.01 0.06 ± 0.01
F1 (macro) 0.61 ± 0.02 0.56 ± 0.03 0.55 ± 0.01 0.54 ± 0.01
Precision 0.58 ± 0.07 0.36 ± 0.21 0.45 ± 0.07 0.36 ± 0.04
A DD
Add Majority-class Sentence Recall 0.24 ± 0.04 0.06 ± 0.04 0.19 ± 0.07 0.52 ± 0.07
F1 (macro) 0.67 ± 0.03 0.55 ± 0.03 0.63 ± 0.04 0.71 ± 0.01
Precision 0.16 ± 0.08 0.41 ± 0.27 0.37 ± 0.09 0.48 ± 0.06
PPDB
Phrase Substitutions Recall 0.10 ± 0.03 0.04 ± 0.02 0.08 ± 0.04 0.34 ± 0.08
F1 (macro) 0.56 ± 0.02 0.53 ± 0.02 0.57 ± 0.02 0.70 ± 0.03
Precision 0.16 ± 0.06 0.36 ± 0.24 0.41 ± 0.08 0.47 ± 0.08
W ORD N ET
Word Substitutions Recall 0.11 ± 0.03 0.05 ± 0.03 0.11 ± 0.05 0.29 ± 0.07
F1 (macro) 0.56 ± 0.02 0.54 ± 0.02 0.58 ± 0.03 0.68 ± 0.03
Precision 0.15 ± 0.04 0.39 ± 0.12 0.38 ± 0.08 0.43 ± 0.11
G LOV E
Word Substitutions Recall 0.14 ± 0.03 0.16 ± 0.05 0.18 ± 0.06 0.18 ± 0.06
F1 (macro) 0.57 ± 0.02 0.61 ± 0.03 0.62 ± 0.03 0.62 ± 0.03
Precision 0.56 ± 0.07 0.33 ± 0.07 0.25 ± 0.07 0.38 ± 0.12
BPE MB
Subword Substitutions Recall 0.22 ± 0.03 0.22 ± 0.04 0.37 ± 0.08 0.16 ± 0.04
F1 (macro) 0.66 ± 0.02 0.63 ± 0.02 0.64 ± 0.03 0.61 ± 0.03
Precision 0.45 ± 0.08 0.35 ± 0.07 0.31 ± 0.08 0.15 ± 0.05
GPT-2
Conditional Generation Recall 0.33 ± 0.04 0.42 ± 0.05 0.46 ± 0.10 0.62 ± 0.09
F1 (macro) 0.69 ± 0.02 0.69 ± 0.02 0.68 ± 0.02 0.62 ± 0.03
Table 6: Comparison of augmentation techniques for 20x augmentation on S EED/threat: means for precision,
recall and macro-averaged F1-score shown with standard deviations (30 paired repetitions). Precision and recall
for threat; F1-score macro-averaged from both classes. Bold figures represent techniques that are either best, or
not significantly different (α = 5%) from this best technique. Double underlines indicate the best technique (for
a given metric and classifier) significantly better (α = 1%) than all other techniques.
Word replacement was more effective with augmentations increased the intersection cardinal-
G LOV E and BPE MB than with PPDB or W ORD - ity by 260% from the original; compared to only
N ET. PPDB and W ORD N ET generally replace few 84% and 70% with the next-best performing aug-
words per document, which often resulted in simi- mentation techniques (A DD and BPE MB, respec-
lar performance to C OPY. BPE MB was generally tively). This demonstrates that GPT-2 significantly
the most effective among these techniques. increased the vocabulary range of the training set,
GPT-2 had the best improvement overall, leading specifically with offensive words likely to be rel-
to significant increases in recall across all classi- evant for toxic language classification. However,
fiers, and the highest F1-score on all but BERT. there is a risk that human annotators might not label
The increase in recall can be attributed to GPT-2’s GPT-2-generated documents as toxic. Such label
capacity for introducing novel phrases. We cor- noise may decrease precision. (See Appendix H,
roborated this hypothesis by measuring the overlap Table 22 for example augmentations that display
between the original and augmented test sets and an the behavior of GPT-2 and other techniques.)
offensive/profane word list from von Ahn.8 GPT-2
8
https://www.cs.cmu.edu/˜biglou/
resources/
2996
4.3 Mixed augmentations mated data augmentation can increase the coverage
In §4.2 we saw that the effect of augmentations dif- of language. Furthermore, Char-LR trained with
fer across classifiers. A natural question is whether ABG was comparable (no statistically significant
it is beneficial to combine augmentation techniques. difference) to the best results obtained with BERT
For all classifiers except BERT, the best perform- (trained with A DD, p > 0.2 on all metrics).
ing techniques were GPT-2, A DD, and BPE MB 4.4 Average classification performance
(Table 6). They also represent each of our aug-
mentation types (§3.2), BPE MB having the high- The results in Tables 6 and 7 focus on precision,
est performance among the four word replacement recall and the F1-score of different models and aug-
techniques (§3.2.1–§3.2.2) in these classifiers. mentation techniques where the probability thresh-
We combined the techniques by merging aug- old for determining the positive or negative class
mented documents in equal proportions. In is 0.5. In general the level of precision and recall
ABG, we included documents generated by A DD, are adapted based on the use case for the classifier.
BPE MB or GPT-2. Since A DD and BPE MB im- Another general evaluation of a classifier is based
pose significantly lower computational and mem- on the ROC-AUC metric, which is the area under
ory requirements than GPT-2, and require no ac- the curve for a plot of true-positive rate versus the
cess to a GPU (Appendix C), we also evaluated false-positive rate for a range of thresholds varying
combining only A DD and BPE MB (AB). over [0, 1]. Table 8 shows the ROC-AUC scores
ABG outperformed all other techniques (in F1- for each of the classifiers for the best augmentation
score) on Char-LR and CNN with statistical signif- techniques from Tables 6 and 7.
BERT with ABG gave the best ROC-AUC value
icance, while being marginally better on Word-LR.
On BERT, ABG achieved a better F1-score and pre- of 0.977 which is significantly higher than BERT
cision than GPT-2 alone (p < 10−10 ), and a better with any other augmentation technique (p < 10−6 ).
CNN exhibited a similar pattern: ABG resulted in
recall (p < 0.05). ABG was better than AB in
recall on Word-LR and CNN, while the precision was the best ROC-AUC compared to the other augmen-
comparable. tation techniques (p < 10−6 ). For Word-LR, ROC-
AUC was highest for ABG, but the difference to
Augmenting with ABG resulted in similar per-
GPT-2 was not statistically significant (p > 0.05).
formance as G OLD S TANDARD on Word-LR, Char-
In the case of Char-LR, none of the augmentation
LR and CNN (Table 4). Comparing Tables 6 and 7,
techniques improved on S EED (p < 0.05). Char-LR
it is clear that much of the performance improve-
produced a more consistent averaged performance
ment came from the increased vocabulary cover-
across all augmentation methods with ROC-AUC
age of GPT-2-generated documents. Our results
values varying between (0.958, 0.973), compared
suggest that in certain types of data like toxic lan-
to variations across all augmentation techniques
guage, consistent labeling may be more important
of (0.792, 0.962) and (0.816, 0.977) for CNN and
than wide coverage in dataset collection, since auto-
BERT respectively.
Table 7: Effects of mixed augmentation (20x) on Our results highlight a difference between the re-
S EED/threat (Annotations as in Table 6). Precision
sults in Tables 6 and 7: while C OPY reached a high
and recall for threat; F1-score macro-averaged from
both classes. F1-score on BERT, our results on ROC-AUC high-
light that such performance may not hold while
2997
varying the decision threshold. We observe that dependency parsing (Vania et al., 2019), and ma-
a combined augmentation method such as ABG chine translation (Fadaee et al., 2019; Xia et al.,
provides an increased ability to vary the decision 2019). Related techniques are also used in auto-
threshold for the more complex classifiers such as matic paraphrasing (Madnani and Dorr, 2010; Li
CNN and BERT. Simpler models performed consis- et al., 2018) and writing style transfer (Shen et al.,
tently across different augmentation techniques. 2017; Shetty et al., 2018; Mahmood et al., 2019).
Hu et al. (2017) produced text with controlled
4.5 Computational requirements target attributes via variational autoencoders. Mes-
BERT has significant computational requirements bah et al. (2019) generated artificial sentences for
(Table 9). Deploying BERT on common EC2 in- adverse drug reactions using Reddit and Twitter
stances requires 13 GB GPU memory. ABG on data. Similarly to their work, we generated novel
EC2 requires 4 GB GPU memory for approxi- toxic sentences from a language model. Petroni
mately 100s (for 20x augmentation). All other et al. (2019) compared several pre-trained lan-
techniques take only a few seconds on ordinary guage models on their ability to understand fac-
desktop computers (See Appendices C–D for addi- tual and commonsense reasoning. BERT models
tional data on computational requirements). consistently outperformed other language models.
Petroni et al. suggest that large pre-trained lan-
A DD BPE MB GPT-2 ABG guage models may become alternatives to knowl-
CPU - 100 3,600 3,600 edge bases in the future.
GPU - - 3,600 3,600
Char-LR Word-LR CNN BERT 6 Discussion and conclusions
CPU 100 100 400 13,000
Our results highlight the relationship between clas-
GPU 100 100 400 13,000
sification performance and computational overhead.
Table 9: Memory (MB) required for augmentation tech- Overall, BERT performed the best with data aug-
niques and classifiers. Rounded to nearest 100 MB. mentation. However, it is highly resource-intensive
(§4.5). ABG yielded almost BERT-level F1- and
ROC-AUC scores on all classifiers. While using
4.6 Alternative toxic class GPT-2 is more expensive than other augmenta-
tion techniques, it has significantly less require-
In order to see whether our results described so ments than BERT. Additionally, augmentation is a
far generalize beyond threat, we repeated our one-time upfront cost in contrast to ongoing costs
experiments using another toxic language class, for classifiers. Thus, the trade-off between perfor-
identity-hate, as the minority class. Our re- mance and computational resources can influence
sults for identity-hate are in line with those which technique is optimal in a given setting.
for threat. All classifiers performed poorly on
We identify the following further topics that we
S EED due to very low recall. Augmentation with
leave for future work.
simple techniques helped BERT gain more than 20
S EED coverage. Our results show that data aug-
percentage points for the F1-score. Shallow classi-
mentation can increase coverage, leading to better
fiers approached BERT-like performance with ap-
toxic language classifiers when starting with very
propriate augmentation. We present further details
small seed datasets. The effects of data augmenta-
in Appendix B.
tion will likely differ with larger seed datasets.
5 Related work Languages. Some augmentation techniques are
limited in their applicability across languages.
Toxic language classification has been conducted GPT-2, W ORD N ET, PPDB and G LOV E are avail-
in a number of studies (Schmidt and Wiegand, able for certain other languages, but with less cov-
2017; Davidson et al., 2017; Wulczyn et al., 2017; erage than in English. BPE MB is nominally avail-
Gröndahl et al., 2018; Qian et al., 2019; Breitfeller able in 275 languages, but has not been thoroughly
et al., 2019). NLP applications of data augmenta- tested on less prominent languages.
tion include text classification (Ratner et al., 2017; Transformers. BERT has inspired work on other
Wei and Zou, 2019; Mesbah et al., 2019), user pre-trained Transformer classifiers, leading to bet-
behavior categorization (Wang and Yang, 2015), ter classification performance (Liu et al., 2019;
2998
Lewis et al., 2019) and better trade-offs between Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall,
memory consumption and classification perfor- and W. Philip Kegelmeyer. 2002. SMOTE: Syn-
thetic minority over-sampling technique. Journal of
mance (Sanh et al., 2019; Jiao et al., 2019). Ex-
Artificial Intelligence Research, 16:321–357.
ploring the effects of augmentation on these Trans-
former classifiers is left for future work. Thomas Davidson, Dana Warmsley, Michael Macy,
Attacks. Training classifiers with augmented data and Ingmar Weber. 2017. Automated hate speech
detection and the problem of offensive language. In
may influence their vulnerability for model extrac- Proceedings of the 11th Conference on Web and So-
tion attacks (Tramèr et al., 2016; Krishna et al.), cial Media, pages 512–515.
model evasion (Gröndahl et al., 2018), or back-
doors (Schuster et al., 2020). We leave such con- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
siderations for future work. deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference of
Acknowledgments the North American Chapter of the Association for
Computational Linguistics (NAACL), pages 4171–
We thank Jonathan Paul Fernandez Strahl, Mark 4186.
van Heeswijk, and Kuan Eeik Tan for valuable dis-
cussions related to the project, and Karthik Ramesh Marzieh Fadaee, Arianna Bisazza, and Christof Monz.
2019. Data augmentation for low resource neural
for his help with early experiments. We also machine translation. In Proceedings of the 55th An-
thank Prof. Yaoliang Yu for providing compute nual Meeting of the Association for Computational
resources for early experiments. Tommi Gröndahl Linguistics (Short Papers), pages 567–573.
was funded by the Helsinki Doctoral Education
Juri Ganitkevitch, Benjamin Van Durme, and Chris
Network in Information and Communications Tech- Callison-Burch. 2013. PPDB: The paraphrase
nology (HICT). database. In Proceedings of the North American
Chapter of the Association for Computational Lin-
guistics: Human Language Technologies (NAACL-
References HLT), pages 758–764.
Betty van Aken, Julian Risch, Ralf Krestel, and Alexan- Paul Glasserman and David D Yao. 1992. Some guide-
der Löser. 2018. Challenges for toxic comment clas- lines and guarantees for common random numbers.
sification: An in-depth error analysis. In Proceed- Management Science, 38(6):884–908.
ings of the 2nd Workshop on Abusive Language On-
line (ALW2), pages 33–42. Tommi Gröndahl, Luca Pajola, Mika Juuti, Mauro
Conti, and N. Asokan. 2018. All you need is “love”:
Pinkesh Badjatiya, Shashank Gupta, Manish Gupta, Evading hate speech detection. In Proceedings of
and Vasudeva Varma. 2017. Deep learning for hate the 11th ACM Workshop on Artificial Intelligence
speech detection in tweets. In Proceedings of the and Security (AISec’11), pages 2–12.
26th International Conference on World Wide Web
Companion, pages 759–760. Benjamin Heinzerling and Michael Strube. 2018.
BPEmb: Tokenization-free pre-trained subword em-
Peter J. Bickel and David A. Freedman. 1984. Asymp- beddings in 275 languages. In Proceedings of the
totic normality and the bootstrap in stratified sam- Eleventh International Conference on Language Re-
pling. The annals of statistics, 12(2):470–482. sources and Evaluation (LREC 2018), pages 2989–
2993.
Steven Bird, Ewan Klein, and Edward Loper. 2009.
Natural Language Processing with Python: An- Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan
alyzing Text with the Natural Language Toolkit. Salakhutdinov, and Eric P. Xing. 2017. Toward con-
O’Reilly, Beijing. trolled generation of text. In Proceedings of the
34th International Conference on Machine Learning,
Gwern Branwen. 2019. Gpt-2 neural network poetry. pages 1587–1596.
https://www.gwern.net/GPT-2 Last accessed
May 2020. Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang,
Xiao Chen, Linlin Li, Fang Wang, and Qun Liu.
Luke Breitfeller, Emily Ahn, David Jurgens, and Yu- 2019. Tinybert: Distilling bert for natural language
lia Tsvetkov. 2019. Finding microaggressions in the understanding. arXiv preprint arXiv:1909.10351.
wild: A case for locating elusive phenomena in so-
cial media posts. In Proceedings of the 2019 Con- Jigsaw. 2018. Toxic comment classification challenge
ference on Empirical Methods in Natural Language identify and classify toxic online comments. Avail-
Processing and the 9th International Joint Confer- able in https://www.kaggle.com/c/jigsaw-
ence on Natural Language Processing (EMNLP- toxic-comment-classification-challenge,
IJCNLP), pages 1664–1674. accessed last time in May 2020.
2999
Diederik Kingma and Jimmy Ba. 2014. Adam: A Sepideh Mesbah, Jie Yang, Robert-Jan Sips,
method for stochastic optimization. In Proceedings Manuel Valle Torre, Christoph Lofi, Alessan-
of the International Conference on Learning Repre- dro Bozzon, and Geert-Jan Houben. 2019. Training
sentations (ICLR). data augmentation for detecting adverse drug reac-
tions in user-generated content. In Proceedings of
Kalpesh Krishna, Gaurav Singh Tomar, Ankur Parikh, the 2019 Conference on Empirical Methods in Natu-
Nicolas Papernot, and Mohit Iyyer. Thieves of ral Language Processing and the 9th International
sesame street: Model extraction on bert-based apis. Joint Conference on Natural Language Processing
In Proceedings of the International Conference on (EMNLP-IJCNLP), pages 2349–2359.
Learning Representations (ICLR).
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hin- rado, and Jeffrey Dean. 2013. Distributed represen-
ton. 2012. Imagenet classification with deep con- tations of words and phrases and their composition-
volutional neural networks. In Proceedings of Neu- ality. In Proceedings of the 26th International Con-
ral Information Processing Systems (NIPS), pages ference on Neural Information Processing Systems
1097–1105. (NIPS), pages 3111–3119.
Taku Kudo and John Richardson. 2018. Sentencepiece:
A simple and language independent subword tok- George A. Miller. 1995. WordNet: A lexical
enizer and detokenizer for neural text processing. In database for English. Communications of the ACM,
Proceedings of the 2018 Conference on Empirical 38(11):39–41.
Methods in Natural Language Processing: System
Demonstrations, pages 66–71. Kevin P. Murphy. 2012. Machine learning: a proba-
bilistic perspective. MIT press, Cambridge.
Michael Lesk. 1986. Automatic sense disambiguation
using machine readable dictionaries: how to tell a Roberto Navigli. 2009. Word sense disambiguation: A
pine code from an ice cream cone. In Proceedings of survey. ACM Computing Surveys, 41(2):1–69.
the 5th Annual International Conference on Systems
Documentation, pages 24–26. Casey Newton. 2020. Facebook will pay $52
million in settlement with moderators who
Mike Lewis, Yinhan Liu, Naman Goyal, Mar- developed PTSD on the job. The Verge.
jan Ghazvininejad, Abdelrahman Mohamed, Omer https://www.theverge.com/2020/5/12/
Levy, Ves Stoyanov, and Luke Zettlemoyer. 21255870/facebook-content-moderator-
2019. BART: Denoising sequence-to-sequence settlement-scola-ptsd-mental-health/
pre-training for natural language generation, trans- Last accessed May 2020.
lation, and comprehension. arXiv preprint
arXiv:1910.13461. Cheoneum Park, Juae Kim, Hyeon-gu Lee,
Reinald Kim Amplayo, Harksoo Kim, Jungyun
Zichao Li, Xin Jiang, Lifeng Shang, and Hang Li. Seo, and Changki Lee. 2019. ThisIsCompetition
2018. Paraphrase Generation with Deep Reinforce- at SemEval-2019 Task 9: BERT is unstable for
ment Learning. In Proceedings of the Conference on out-of-domain samples. In Proceedings of the 13th
Empirical Methods in Natural Language Processing International Workshop on Semantic Evaluation,
(EMNLP), pages 3865–3878. pages 1254–1261.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Ellie Pavlick, Pushpendre Rastogi, Juri Ganitkevitch,
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Benjamin Van Durme, and Chris Callison-Burch.
Luke Zettlemoyer, and Veselin Stoyanov. 2019. 2015. PPDB 2.0: Better paraphrase ranking, fine-
Roberta: A robustly optimized BERT pretraining ap- grained entailment relations,word embeddings, and
proach. arXiv preprint arXiv:1907.11692. style classification. In Proceedings of the 53rd An-
Nitin Madnani and Bonnie Dorr. 2010. Generating nual Meeting of the Association for Computational
phrasal and sentential paraphrases: A survey of data- Linguistics and the 7th International Joint Confer-
driven methods. Journal of Computational Linguis- ence on Natural Language Processing (Short Pa-
tics, 36(3):341–387. pers), pages 425–430.
Asad Mahmood, Faizan Ahmad, Zubair Shafiq, Pad- Jeffrey Pennington, Richard Socher, and Christopher D.
mini Srinivasan, and Fareed Zaffar. 2019. A girl has Manning. 2014. GloVe: Global vectors for word rep-
no name: Automated authorship obfuscation using resentation. In Proceedings of the Conference on
Mutant-X. In Proceedings on Privacy Enhancing Empirical Methods in Natural Language Processing
Technologies (PETS), pages 54–71. (EMNLP), pages 1532–1543.
Binny Mathew, Ritam Dutt, Pawan Goyal, and Ani- Fabio Petroni, Tim Rocktäschel, Sebastian Riedel,
mesh Mukherjee. 2019. Spread of hate speech in on- Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and
line social media. In Proceedings of the 10th ACM Alexander Miller. 2019. Language models as knowl-
Conference on Web Science (WebSci ’19), pages edge bases? In Proceedings of the 2019 Confer-
173–182. ence on Empirical Methods in Natural Language
3000
Processing and the 9th International Joint Confer- Connor Shorten and Taghi M. Khoshgoftaar. 2019. A
ence on Natural Language Processing (EMNLP- survey on image data augmentation for deep learn-
IJCNLP), pages 2463–2473. ing. Journal of Big Data, 6.
Jing Qian, Anna Bethke, Yinyin Liu, Elizabeth Beld- Lichao Sun, Kazuma Hashimoto, Wenpeng Yin, Akari
ing, and William Yang Wang. 2019. A bench- Asai, Jia Li, Philip Yu, and Caiming Xiong. 2020.
mark dataset for learning to intervene in online hate Adv-BERT: BERT is not robust on misspellings!
speech. In Proceedings of the 2019 Conference on Generating nature adversarial samples on BERT.
Empirical Methods in Natural Language Processing arXiv preprint arXiv:2003.04985.
and the 9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP), pages Liling Tan. 2014. Pywsd: Python implementations
4757–4766. of word sense disambiguation (WSD) technologies
[software]. https://github.com/alvations/
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, pywsd.
Dario Amodei, and Ilya Sutskever. 2019. Language
models are unsupervised multitask learners. OpenAI Martin A Tanner and Wing Hung Wong. 1987. The cal-
Blog, 1(8):9. culation of posterior distributions by data augmenta-
tion. Journal of the American statistical Association,
Alexander J. Ratner, Henry R. Ehrenberg, Zeshan Hus- 82(398):528–540.
sain, Jared Dunnmon, and Christopher Ré. 2017.
Learning to compose domain-specific transforma- Florian Tramèr, Fan Zhang, Ari Juels, Michael K Re-
tions for data augmentation. In Proceedings of the iter, and Thomas Ristenpart. 2016. Stealing ma-
31st Conference on Neural Information Processing chine learning models via prediction apis. In Pro-
Systems (NIPS 2017). ceedings of the 25th USENIX Security Symposium,
pages 601–618.
Radim Řehůřek and Petr Sojka. 2010. Software
framework for topic modelling with large corpora. Clara Vania, Yova Kementchedjhieva, Anders Sogaard,
In Proceedings of the LREC 2010 Workshop on and Adam Lopez. 2019. A systematic comparison
New Challenges for NLP Frameworks, pages 45– of methods for low-resource dependency parsing on
50, Valletta, Malta. ELRA. http://is.muni.cz/ genuinely low-resource languages. In Proceedings
publication/884893/en. of the 2019 Conference on Empirical Methods in
Natural Language Processing and the 9th Interna-
Victor Sanh, Lysandre Debut, Julien Chaumond, and tional Joint Conference on Natural Language Pro-
Thomas Wolf. 2019. DistilBERT, a distilled version cessing (EMNLP-IJCNLP), pages 1105–1116.
of BERT: smaller, faster, cheaper and lighter. arXiv
preprint arXiv:1910.01108. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Anna Schmidt and Michael Wiegand. 2017. A survey Kaiser, and Illia Polosukhin. 2017. Attention is all
on hate speech detection using natural language pro- you need. In Proceedings of the 31st Conference
cessing. In Proceedings of the Fifth International on Neural Information Processing Systems (NIPS),
Workshop on Natural Language Processing for So- pages 5998–6008.
cial Media, pages 1–10.
Nick Walton. Ai dungeon 2. https://aidungeon.
Roei Schuster, Tal Schuster, Yoav Meri, and Vitaly io/ Last accessed May 2020.
Shmatikov. 2020. Humpty dumpty: Controlling
word meanings via corpus poisoning. arXiv preprint William Yang Wang and Diyi Yang. 2015. Thats so an-
arXiv:2001.04935. noying!!!: A lexical and frame-semantic embedding
based data augmentation approach to automatic cat-
Rico Sennrich, Barry Haddow, and Alexandra Birch. egorization of annoying behaviors using #petpeeve
2016. Neural machine translation of rare words tweets. In Proceedings of the 2015 Conference on
with subword units. In Proceedings of the 54th An- Empirical Methods in Natural Language Processing
nual Meeting of the Association for Computational (EMNLP), pages 2557–2563.
Linguistics (Volume 1: Long Papers), pages 1715–
1725. Zeerak Waseem and Dirk Hovy. 2016. Hateful sym-
bols or hateful people? Predictive features for hate
Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi speech detection on twitter. In Proceedings of the
Jaakkola. 2017. Style transfer from non-parallel text NAACL Student Research Workshop, pages 88–93.
by cross-alignment. In Proceedings of Neural Infor-
mation Processing Systems (NIPS). Jason Wei and Kai Zou. 2019. EDA: Easy data aug-
mentation techniques for boosting performance on
Rakshith Shetty, Bernt Schiele, and Mario Fritz. 2018. text classification tasks. In Proceedings of the
A4 NT: Author attribute anonymity by adversarial 2019 Conference on Empirical Methods in Natu-
training of neural machine translation. In Proceed- ral Language Processing and the 9th International
ings of the 27th USENIX Security Symposium, pages Joint Conference on Natural Language Processing
1633–1650. (EMNLP-IJCNLP), pages 6382–6388.
3001
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Rémi Louf, Morgan Fun-
towicz, et al. 2019. Transformers: State-of-the-
art natural language processing. arXiv preprint
arXiv:1910.03771.
Sebastien C. Wong, Adam Gatt, Victor Stamatescu, and
Mark D. McDonnell. 2016. Understanding data aug-
mentation for classification: When to warp? In Pro-
ceedings of the 2016 International Conference on
Digital Image Computing: Techniques and Applica-
tions (DICTA), pages 1–6.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V
Le, Mohammad Norouzi, Wolfgang Macherey,
Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Macherey, et al. 2016. Google’s neural machine
translation system: Bridging the gap between hu-
man and machine translation. arXiv preprint
arXiv:1609.08144.
3002
A Class overlap and interpretation of on G OLD S TANDARD/identity-hate in Ta-
“toxicity” ble 11. The results closely resemble those on G OLD
S TANDARD/threat in Table 4 (§4.1).
Kaggle’s toxic comment classification challenge
dataset9 contains six classes, one of which is We compared S EED and C OPY with the tech-
called toxic. But all six classes represent niques that had the highest performance on
examples of toxic speech: toxic, severe threat: A DD, BPE MB, GPT-2, and their com-
toxic, obscene, threat, insult, and bination ABG. Table 12 shows the results.
identity-hate. Of the threat docu-
Like in threat, BERT performed the poor-
ments in the full training dataset (G OLD S TAN -
est on S EED, with the lowest recall (0.06). All
DARD ), 449/478 overlap with toxic. For
techniques decreased precision from S EED, and
identity-hate, overlap with toxic is
all increased recall except C OPY with CNN. With
1302/1405. Therefore, in this paper, we use the
C OPY, the F1-score increased with Char-LR (0.12)
term toxic more generally, subsuming threat
and BERT (0.21), but not Word-LR (0.01) or
and identity-hate as particular types of toxic
CNN (−0.04). This is in line with corresponding re-
speech. To confirm that this was a reasonable
sults from threat (§4.2, Table 6): C OPY did not
choice, we manually examined the 29 threat
help either of the word-based classifiers (Word-LR,
datapoints not overlapping with toxic. All of
CNN) but helped the character- and subword-based
these represent genuine threats, and are hence toxic
classifiers (Char-LR, BERT).
in the general sense.
Of the individual augmentation techniques,
B The “Identity hate” class A DD increased the F1-score the most with Char-
LR (0.15) and BERT (0.20); and GPT-2 increased
G OLD S TD . S EED T EST it the most with Word-LR (0.07) and CNN (0.07).
Minority 1,405 75 712 Here again we see the similarity between the two
Majority 158,166 7,910 63,266 word-based classifiers, and the two that take inputs
below the word-level. Like in threat, C OPY and
Table 10: Corpus size for identity-hate (minor- A DD achieved close F1-scores with BERT, but with
ity) and non-identity-hate (majority). different relations between precision and recall.
BPE MB was not the best technique with any clas-
sifier, but increased F1-score everywhere except in
G OLD S TANDARD CNN, where precision dropped drastically.
Char Word CNN BERT
In the combined ABG technique, Word-
Precision 0.64 0.54 0.70 0.55
LR and CNN reached their highest F1-score in-
Recall 0.40 0.31 0.20 0.62
creases (0.08 and 0.07, respectively). With Char-LR
F1 (macro) 0.74 0.69 0.65 0.79
F1-score was also among the highest, but did not
Table 11: Classifier performance on G OLD S TANDARD. reach A DD. Like with threat, ABG increased
Precision and recall for identity-hate; F1-score precision and recall more than GPT-2 alone.
macro-averaged from both classes.
Overall, our results on identity-hate
closely resemble those we received in threat, re-
To see if our results generalize beyond threat, sulting in more than 20 percentage point increases
we experimented on the identity-hate class in the F1-score for BERT on augmentations with
in Kaggle’s toxic comment classification dataset. C OPY and A DD. Like in threat, the impact of
Again, we used a 5% stratified sample of G OLD most augmentations was greater on Char-LR than on
S TANDARD as S EED. We first show the number of Word-LR or CNN. Despite their similar F1-scores in
samples in G OLD S TANDARD, S EED and T EST in S EED, Char-LR exhibited much higher precision,
Table 10. There are approximately 3 times more which decreased but remained generally higher
minority-class samples in identity-hate than than with other classifiers. Combined with an in-
in threat. Next, we show classifier performance crease in recall to similar or higher levels than with
9
https://www.kaggle.com/c/jigsaw- other classifiers, Char-LR reached BERT-level per-
toxic-comment-classification-challenge formance with proper data augmentation.
3003
Augmentation Metric Char-LR Word-LR CNN BERT
Precision 0.85 ± 0.04 0.59 ± 0.05 0.52 ± 0.08 0.65 ± 0.46
S EED
No Oversampling Recall 0.11 ± 0.04 0.12 ± 0.03 0.11 ± 0.04 0.06 ± 0.10
F1 (macro) 0.60 ± 0.03 0.60 ± 0.02 0.59 ± 0.02 0.54 ± 0.08
Precision 0.61 ± 0.02 0.54 ± 0.04 0.27 ± 0.06 0.52 ± 0.06
C OPY
Simple Oversampling Recall 0.34 ± 0.04 0.14 ± 0.03 0.07 ± 0.01 0.50 ± 0.06
F1 (macro) 0.72 ± 0.02 0.61 ± 0.02 0.55 ± 0.01 0.75 ± 0.01
Precision 0.54 ± 0.04 0.54 ± 0.05 0.43 ± 0.05 0.43 ± 0.05
A DD
Add Majority-class Sentence Recall 0.47 ± 0.05 0.21 ± 0.03 0.21 ± 0.04 0.58 ± 0.08
F1 (macro) 0.75 ± 0.01 0.65 ± 0.01 0.64 ± 0.02 0.74 ± 0.01
Precision 0.43 ± 0.04 0.30 ± 0.03 0.15 ± 0.05 0.29 ± 0.06
BPE MB
Subword Substitutions Recall 0.38 ± 0.04 0.29 ± 0.01 0.32 ± 0.05 0.23 ± 0.03
F1 (macro) 0.70 ± 0.01 0.64 ± 0.01 0.59 ± 0.02 0.62 ± 0.02
Precision 0.41 ± 0.05 0.30 ± 0.03 0.33 ± 0.08 0.22 ± 0.05
GPT-2
Conditional Generation Recall 0.34 ± 0.04 0.39 ± 0.03 0.34 ± 0.09 0.59 ± 0.06
F1 (macro) 0.68 ± 0.01 0.67 ± 0.01 0.66 ± 0.01 0.65 ± 0.02
Precision 0.41 ± 0.04 0.32 ± 0.03 0.28 ± 0.06 0.27 ± 0.05
ABG
A DD,BPE MB,GPT-2 Mix Recall 0.50 ± 0.04 0.41 ± 0.02 0.46 ± 0.05 0.62 ± 0.07
F1 (macro) 0.72 ± 0.01 0.68 ± 0.01 0.66 ± 0.02 0.68 ± 0.02
Table 12: Comparison of augmentation techniques for 20x augmentation on S EED/identity-hate: means
for precision, recall and macro-averaged F1-score shown with standard deviations (10 repetitions). Precision and
recall for identity-hate; F1-score macro-averaged from both classes.
3004
Library Version Training
https://github.com/ Memory (MB) Runtime (s)
Nov 8, 201913
jasonwei20/eda nlp GPU CPU GPU CPU
apex 0.1 Char-LR - 100 - 4
bpemb 0.3.0 Word-LR - 100 - 3
fast-bert 1.6.5 CNN 400 400 - 13
gensim 3.8.1 BERT 3800 1500 757 -
nltk 3.4.5 Prediction
numpy 1.17.2 Memory (MB) Runtime (s)
pywsd 1.2.4 GPU CPU GPU CPU
scikit-learn 0.21.3 Char-LR - 100 - 25
scipy 1.4.1 Word-LR - 100 - 5
spacy 2.2.4 CNN 400 400 - 42
torch 1.4.0 BERT 4600 4200 464 -
transformers 2.8.0
Table 15: Computational resources (MB and seconds)
Table 14: Library versions required for replicating this required for training classifiers on the S EED dataset and
study. Date supplied if no version applicable. test dataset. Note that BERT results here were calcu-
lated with mixed precision arithmetic (currently sup-
ported by Nvidia Turing architecture). We measured
memory usage close to 13 GB in the general case.
D Classifier training and testing
performance
lease),20 and WordNet example sentences.21 The
Table 15 specifies the system resources training and rationale for the corpus was to have a large vo-
prediction required on our setup (Section C). The cabulary along with relatively simple grammatical
S EED dataset has 8,955 documents and test dataset structures, to maximize both coverage and the cor-
63,978 documents. We used the 12-layer, 768- rectness of POS-tagging. We mapped each lemma-
hidden, 12-heads, 110M parameter BERT-Base, POS-pair to its most common inflected form in the
Uncased-model.14 corpus. When performing synonym replacement
in W ORD N ET augmentation, we lemmatized and
POS-tagged the original word with NLTK, chose a
E Lemma inflection in W ORD N ET random synonym for it, and then inflected the syn-
onym with the original POS-tag if it was present in
Lemmas appear as uninflected lemmas WordNet. the inflection dictionary.
To mitigate this limitation, we used a dictionary-
based method for mapping lemmas to surface man- F GPT-2 parameters
ifestations with NLTK part-of-speech (POS) tags.
For deriving the dictionary, we used 8.5 million Table 16 shows the hyperparameters we used
short sentences (≤ 20 words) from seven corpora: for fine-tuning our GPT-2 models, and for gen-
Stanford NMT,15 OpenSubtitles 2018,16 Tatoeba,17 erating outputs. Our fine-tuning follows the
SNLI,18 SICK,19 Aristo-mini (December 2016 re- transformers examples with default parame-
ters.22
14
For generation, we trimmed input to be at most
https://storage.googleapis.com/bert_
models/2018_10_18/uncased_L-12_H-768_A- 100 characters long, further cutting off the input at
12.zip the last full word or punctuation to ensure gener-
15
https://nlp.stanford.edu/projects/
nmt/ 20
https://www.kaggle.com/allenai/
16
http://opus.nlpl.eu/OpenSubtitles2018. aristo-mini-corpus
php 21
http://www.nltk.org/_modules/nltk/
17
https://tatoeba.org corpus/reader/wordnet.html
18 22
https://nlp.stanford.edu/projects/ https://github.com/huggingface/
snli/ transformers/blob/master/examples/
19
http://clic.cimec.unitn.it/composes/ language-modeling/run_language_modeling.
sick.html py
3005
ated documents start with full words. Our genera- We show results for G LOV E in Table 20. Word-
tion script follows transformers examples.23 LR performed better with higher substitution rates
(increased recall). Interestingly, Char-LR per-
Fine-tuning formance (particularly precision) dropped with
Batch size 1 G LOV E compared to using C OPY. For CNN,
Learning rate 2e-5 smaller substitution rates seem preferable, since
Epochs 2 precision decreased quickly as the number of sub-
Generation stitutions increased.
Input cutoff 100 characters BPE MB results in Table 21 are consistent across
Temperature 1.0 the classifiers Char-LR, Word-LR and CNN. Substitu-
Top-p 0.9 tions in the range 12%–37% increased recall over
Repetition penalty 1 C OPY. However, precision dropped at different
Output cutoff 100 subwords or points, depending on the classifier. CNN precision
EOS generated dropped earlier than on other classifiers, already at
25% change rate.
Table 16: GPT-2 parameters.
H Augmented threat examples
In §4.2 – §4.4, we generated novel documents We provide examples of augmented documents in
with GPT-2 fine-tuned on threat documents in Table 22. We picked a one-sentence document as
S EED for 2 epochs. In Table 17, we show the im- the seed. We remark that augmented documents
pact of changing the number of fine-tuning epochs created by GPT-2 have the highest novelty, but may
for GPT-2. Precision generally increased as the not always be considered threat (see example
number of epochs was increased. However, recall GPT-2 #1. in Table 22).
simultaneously decreased.
G Ablation study
In §4.2 – §4.4, we investigated several word re-
placement techniques with a fixed change rate. In
those experiments, we allowed 25% of possible
replacements. Here we study each augmentation
technique’s sensitivity to the replacement rate. As
done in previous experiments, we ensured that at
least one augmentation is always performed. Ex-
periments are shown in tables 18–21.
Interestingly, all word replacements decreased
classification performance with BERT. We suspect
this occurred because of the pre-trained weights in
BERT.
We show threat precision, recall and macro-
averaged F1-scores for PPDB in Table 18. Chang-
ing the substitution rate had very little impact to the
performance on any classifier. This indicates that
there were very few n-gram candidates that could
be replaced. We show results on W ORD N ET in
Table 19. As exemplified for substitution rate 25%
in H, PPDB and W ORD N ET substitutions replaced
very few words. Both results were close to C OPY
(§4.2, Table 6).
23
https://github.com/
huggingface/transformers/blob/
818463ee8eaf3a1cd5ddc2623789cbd7bb517d02/
examples/run_generation.py
3006
Fine-tuning epochs on GPT-2
Classifier Metric
1 2 3 4 5 6 7 8 9 10
Precision 0.38 0.43 0.45 0.49 0.51 0.49 0.52 0.50 0.51 0.51
Char-LR Recall 0.34 0.34 0.32 0.31 0.31 0.29 0.28 0.28 0.27 0.28
F1 (macro) 0.68 0.69 0.68 0.68 0.69 0.68 0.68 0.68 0.68 0.68
Precision 0.30 0.33 0.34 0.34 0.36 0.35 0.35 0.34 0.34 0.34
Word-LR Recall 0.47 0.45 0.43 0.40 0.40 0.38 0.37 0.36 0.35 0.35
F1 (macro) 0.68 0.69 0.69 0.68 0.68 0.68 0.67 0.67 0.67 0.67
Precision 0.26 0.28 0.30 0.32 0.33 0.32 0.31 0.31 0.31 0.32
CNN Recall 0.49 0.50 0.47 0.50 0.48 0.48 0.48 0.46 0.47 0.46
F1 (macro) 0.66 0.67 0.68 0.69 0.69 0.68 0.68 0.68 0.68 0.68
Precision 0.11 0.14 0.15 0.15 0.16 0.17 0.17 0.19 0.17 0.17
BERT Recall 0.62 0.66 0.67 0.64 0.65 0.62 0.62 0.62 0.61 0.61
F1 (macro) 0.59 0.61 0.62 0.62 0.62 0.63 0.63 0.64 0.63 0.62
Table 17: Impact of changing number of fine-tuning epochs on GPT-2-augmented datasets. Mean results for 10
repetitions. Highest numbers highlighted in bold.
Table 18: Impact of changing the proportion of substi- Table 19: Impact of changing the proportion of substi-
tuted words on PPDB-augmented datasets. Mean re- tuted words on W ORD N ET-augmented datasets. Mean
sults for 10 repetitions. Classifier’s highest numbers results for 10 repetitions. Classifier’s highest numbers
highlighted in bold. highlighted in bold.
3007
G LOV E: Word substitution rate
Metric
0 12 25 37 50 100
Char-LR
Pre. 0.16 0.15 0.14 0.14 0.14 0.32
Rec. 0.11 0.12 0.13 0.13 0.13 0.05
F1 ma. 0.56 0.56 0.57 0.57 0.57 0.54
Word-LR
Pre. 0.31 0.37 0.35 0.33 0.33 0.30
Rec. 0.07 0.10 0.16 0.19 0.19 0.09
F1 ma. 0.55 0.58 0.61 0.62 0.62 0.57
CNN
Pre. 0.41 0.44 0.39 0.35 0.28 0.15
Rec. 0.13 0.18 0.19 0.20 0.17 0.06
F1 ma. 0.59 0.62 0.62 0.62 0.60 0.54
BERT
Pre. 0.44 0.43 0.40 0.36 0.33 0.13
Rec. 0.35 0.27 0.16 0.13 0.11 0.03
F1 ma. 0.69 0.66 0.61 0.59 0.58 0.52
3008
# Document sample
S EED: No Oversampling
if you do not stop, the wikapidea nijas will come to your house and kill you
C OPY: Simple Oversampling
1. if you do not stop, the wikapidea nijas will come to your house and kill you
2. if you do not stop, the wikapidea nijas will come to your house and kill you
3. if you do not stop, the wikapidea nijas will come to your house and kill you
EDA: Easy Data Augmentation 16
1. if you do put up not stop the wikapidea nijas will come to your house and kill you
2. if you do not stopover the wikapidea nijas will come to your house and kill you
3. if you do not break the wikapidea nijas will come to your house and kill you
A DD: Add Majority-class Sentence
1. We thank you both for your contributions to Wikipedia at-large and your use of the tool. if you
do not stop, the wikapidea nijas will come to your house and kill you
2. if you do not stop, the wikapidea nijas will come to your house and kill you Honest! ))
3. *** username, I am on a shared IP address. if you do not stop, the wikapidea nijas will come
to your house and kill you
PPDB Phrase Substitutions
1. if you do not be halted , the wikapidea nijas will come to your home and kill you
2. if you do not stop , the wikapidea nijas comes along to your house and been murdered you
3. if you do not stop , the wikapidea nijas will arrive to your home and kill you
W ORD N ET Word Substitutions
1. if you do not stop , the wikapidea nijas will amount to your house and kill you
2. if you do not stop , the wikapidea nijas will come to your house and down you
3. if you do not stop , the wikapidea nijas will come to your theatre and kill you
G LOV E Word Substitutions
1. if you so not stop , the wikapidea nijas will come to yours house have kill you
2. mean know do not stop ? the wikapidea nijas will cos to your house and kill you
3. if you do not thinking , and wikapidea nijas will come to your going and kill you
BPE MB Subword Substitutions
1. whether you know not stop, theldikapideina niji give come to your house and kill you
2. shouldnever do not stop, the wikamyla nijag will come to your house and kill want
3. if you let not train, the wkanapidei nijas will come tonever house and dead you
GPT-2 Conditional Generation
1. for what you have done, you don’t have a choice. You are too big and weak to stop.
2. ! So if you don’t stop this, you will be murdered and your people will be ripped apart. I hope
you die and die and die. No, you will die and die and die. Just stop and kill yourself for it.
Advertisements
3. . You will be hanged like a rat. The next day I will get a letter telling you that you were
kicked out of your home by the wikapidea nijas and that you will die there. You are an idiot
for calling the wikapidea nijas to death and you should die.
ABG: Mixed Augmentation (A DD, BPE MB & GPT-2)
1. Support - The other article is minor, and not worth a separate entry. if you do not stop, the
wikapidea nijas will come to your house and kill you
2. . You will be hanged like a rat. The next day I will get a letter telling you that you were
kicked out of your home by the wikapidea nijas and that you will die there. You are an idiot
for calling the wikapidea nijas to death and you should die.
3. if you let not train, the wkanapidei nijas will come tonever house and dead you
Table 22: Documents generated by selected augmentation techniques in this study. Changes to original seed
highlighted. The selected sample is shorter than average (see §3.1, Table 1). We anonymized the username in A DD
(#3.). Three samples generated by each technique shown.
16
https://github.com/jasonwei20/eda_nlp 3009