0% found this document useful (0 votes)

16 views19 pages

2020.findings Emnlp.269

Uploaded by

Maslenia Mubarrat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views19 pages

2020.findings Emnlp.269

Uploaded by

Maslenia Mubarrat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

A little goes a long way: Improving toxic language classification despite

data scarcity
Mika Juuti1 , Tommi Gröndahl2 , Adrian Flanagan3 , N. Asokan1,2
University of Waterloo1
Aalto University2
Huawei Technologies Oy (Finland) Co Ltd3
[email protected], [email protected]
[email protected], [email protected]

Abstract small toxic language datasets, shallow classifiers

have been shown to perform well (Gröndahl et al.,
Detection of some types of toxic language is
hampered by extreme scarcity of labeled train- 2018). At the same time, pre-trained Transformer
ing data. Data augmentation – generating new networks (Vaswani et al., 2017) have led to im-
synthetic data from a labeled seed dataset – pressive results in several NLP tasks (Young et al.,
can help. The efficacy of data augmentation on 2018). Comparing the effects of data augmentation
toxic language classification has not been fully between shallow classifiers and pre-trained Trans-
explored. We present the first systematic study formers is thus of particular interest.
on how data augmentation techniques impact
performance across toxic language classifiers, We systematically compared eight augmentation
ranging from shallow logistic regression ar- techniques on four classifiers, ranging from shal-
chitectures to BERT – a state-of-the-art pre- low architectures to BERT (Devlin et al., 2019),
trained Transformer network. We compare a popular pre-trained Transformer network. We
the performance of eight techniques on very used downsampled variants of the Kaggle Toxic
scarce seed datasets. We show that while
Comment Classification Challenge dataset (Jigsaw
BERT performed the best, shallow classifiers
performed comparably when trained on data 2018; §3) as our seed dataset. We focused on the
augmented with a combination of three tech- threat class, but also replicated our results on
niques, including GPT-2-generated sentences. another toxic class (§4.6). With some classifiers,
We discuss the interplay of performance and we reached the same F1-score as when training on
computational overhead, which can inform the original dataset, which is 20x larger. However,
the choice of techniques under different con- performance varied markedly between classifiers.
straints.
We obtained the highest overall results with
1 Introduction BERT, increasing the F1-score up to 21% com-
Toxic language is an increasingly urgent challenge pared to training on seed data alone. However,
in online communities (Mathew et al., 2019). Al- augmentation using a fine-tuned GPT-2 (§3.2.4) –
though there are several datasets, most commonly a pre-trained Transformer language model (Rad-
from Twitter or forum discussions (Badjatiya et al., ford et al., 2019) – reached almost BERT-level
2017; Davidson et al., 2017; Waseem and Hovy, performance even with shallow classifiers. Com-
2016; Wulczyn et al., 2017; Zhang et al., 2018), bining multiple augmentation techniques, such
high class imbalance is a problem with certain as adding majority class sentences to minority
classes of toxic language (Breitfeller et al., 2019). class documents (§3.2.3) and replacing subwords
Manual labeling of toxic content is onerous, haz- with embedding-space neighbors (Heinzerling and
ardous (Newton, 2020), and thus expensive. Strube, 2018) (§3.2.2), improved performance on
One strategy for mitigating these problems is all classifiers. We discuss the interplay of perfor-
data augmentation (Wang and Yang, 2015; Rat- mance and computational requirements like mem-
ner et al., 2017; Wei and Zou, 2019): comple- ory and run-time costs (§4.5). We release our
menting the manually labeled seed data with new source code.1
synthetic documents. The effectiveness of data
augmentation for toxic language classification has 1
https://github.com/ssg-research/
not yet been thoroughly explored. On relatively language-data-augmentation

2991
Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2991–3009
November 16 - 20, 2020. c 2020 Association for Computational Linguistics
2 Preliminaries different classes of toxic language.2 The median
length of a document is three sentences, but the
Data augmentation arises naturally from the prob-
distribution is heavy-tailed (Table 1).
lem of filling in missing values (Tanner and Wong,
1987). In classification, data augmentation is ap- Mean Std. Min Max 25% 50% 75%
plied to available training data. Classifier perfor- 4 6 1 683 2 3 5
mance is measured on a separate (non-augmented)
test set (Krizhevsky et al., 2012). Data augmen- Table 1: Document lengths (number of sentences; tok-
tation can decrease overfitting (Wong et al., 2016; enized with NLTK sent tokenize (Bird et al., 2009)).
Shorten and Khoshgoftaar, 2019), and broaden the
input feature range by increasing the vocabulary Some classes are severely under-represented:
(Fadaee et al., 2019). e.g., 478 examples of threat vs. 159093 non-
Simple oversampling is the most basic augmenta- threat examples. Our experiments concern bi-
tion technique: copying minority class datapoints nary classification, where one class is the minor-
to appear multiple times. This increases the rele- ity class and all remaining documents belong to
vance of minority class features for computing the the majority class. We focus on threat as the
loss during training (Chawla et al., 2002). minority class, as it poses the most challenge for
EDA is a prior technique combining four text trans- automated analysis in this dataset (van Aken et al.,
formations to improve classification with CNN and 2018). To confirm our results, we also applied
RNN architectures (Wei and Zou, 2019). It uses (i) the best-performing techniques on a different type
synonym replacement from WordNet (§3.2.1), (ii) of toxic language, the identity-hate class
random insertion of a synonym, (iii) random swap (§4.6).
of two words, and (iv) random word deletion. Our goal is to understand how data augmentation
Word replacement has been applied in several improves performance under extreme data scarcity
data augmentation studies (Zhang et al., 2015; in the minority class (threat). To simulate this,
Wang and Yang, 2015; Xie et al., 2017; Wei and we derive our seed dataset (S EED) from the full data
Zou, 2019; Fadaee et al., 2019). We compared set (G OLD S TANDARD) via stratified bootstrap
four techniques, two based on semantic knowledge sampling (Bickel and Freedman, 1984) to reduce
bases (§3.2.1) and two on pre-trained (sub)word the dataset size k-fold. We replaced newlines, tabs
embeddings (§3.2.2). and repeated spaces with single spaces, and lower-
Pre-trained Transformer networks feature cased each dataset. We applied data augmentation
prominently in state-of-the-art NLP research. They techniques on S EED with k-fold oversampling of
are able to learn contextual embeddings, which the minority class, and compared each classifier
depend on neighboring subwords (Devlin et al., architecture (§3.3) trained on S EED, G OLD S TAN -
2019). Fine-tuning – adapting the weights of a DARD , and the augmented datasets. We used the
pre-trained Transformer to a specific corpus – has original test dataset (T EST) for evaluating perfor-
been highly effective in improving classification mance. We detail the dataset sizes in Table 2.
performance (Devlin et al., 2019) and language
modeling (Radford et al., 2019; Walton; Branwen, G OLD S TD . S EED T EST
2019). State-of-the-art networks are trained on Minority 478 25 211
large corpora: GPT-2’s corpus contains 8M web Majority 159,093 7955 63,767
pages, while BERT’s training corpus contains 3.3B
Table 2: Number of documents (minority: threat)
words.

3 Methodology Ethical considerations. We used only public

We now describe the data (3.1), augmentation tech- datasets, and did not involve human subjects.
niques (3.2), and classifiers (3.3) we used.
3.2 Data augmentation techniques
3.1 Dataset We evaluated six data augmentation techniques on
We used Kaggle’s toxic comment classification four classifiers (Table 3). We describe each aug-
challenge dataset (Jigsaw, 2018). It contains 2
Although one class is specifically called toxic, all six
human-labeled English Wikipedia comments in six represent types of toxic language. See Appendix A.

2992
mentation technique (below) and classifier (§3.3). phrases altogether. We controlled substitution by
For comparison, we also evaluated simple oversam- grammatical context as specified in PPDB. In sin-
pling (C OPY) and EDA (Wei and Zou, 2019), both gle words this is the part-of-speech tag; whereas in
reviewed in §2. Following the recommendation of multi-word paraphrases it also contains the syntac-
Wei and Zou (2019) for applying EDA to small tic category that appears after the original phrase
seed datasets, we used 5% augmentation probabil- in the PPDB training corpus. We obtained gram-
ity, whereby each word has a 1 − 0.954 ≈ 19% matical information with the Spacy3 parser.
probability of being transformed by at least one of
the four EDA techniques. 3.2.2 Embedding neighbour substitutions
Four of the six techniques are based on replacing Embeddings can be used to map units to others
words with semantically close counterparts; two with a similar occurrence distribution in a train-
using semantic knowledge bases (§3.2.1) and two ing corpus (Mikolov et al., 2013). We considered
pre-trained embeddings (§3.2.2). We applied 25% two alternative pre-trained embedding models. For
of all possible replacements with these techniques, each model, we produced top-10 nearest embed-
which is close to the recommended substitution rate ding neighbours (cosine similarity) of each word
in EDA. For short documents we ensured that at selected for replacement, and randomly picked the
least one substitution is always selected. We also new word from these.
added majority class material to minority class doc- Twitter word embeddings (G LOV E) (Penning-
uments (§3.2.3), and generated text with the GPT- ton et al., 2014) were obtained from a Twitter cor-
2 language model fine-tuned on S EED (§3.2.4). pus,4 and we deployed these via Gensim (Řehůřek
and Sojka, 2010).
3.2.1 Substitutions from a knowledge base Subword embeddings (BPE MB) have emerged
WordNet is a semantic knowledge base contain- as a practical pre-processing tool for overcoming
ing various properties of word senses, which corre- the challenge of low-prevalence words (Sennrich
spond to word meanings (Miller, 1995). We aug- et al., 2016). They have been applied in Trans-
mented S EED by replacing words with random syn- former algorithms, including WordPiece (Wu et al.,
onyms. While EDA also uses WordNet synonyms 2016) for BERT (Devlin et al., 2019), and BPE (Sen-
(§2), we additionally applied word sense disam- nrich et al., 2016) for GPT-2 (Radford et al., 2019).
biguation (Navigli, 2009) and inflection. BPE MB (Heinzerling and Strube, 2018) provides
For word sense disambiguation we used simple pre-trained GloVe embeddings, constructed by ap-
Lesk from PyWSD (Tan, 2014). As a variant of the plying SentencePiece (Kudo and Richardson, 2018)
Lesk algorithm (Lesk, 1986) it relies on overlap in on the English Wikipedia. We use 50-dimensional
definitions and example sentences (both provided BPE MB-embeddings with vocabulary size 10,000.
in WordNet), compared between each candidate
sense and words in the context. 3.2.3 Majority class sentence addition (A DD)
Word senses appear as uninflected lemmas, Adding unrelated material to the training data can
which we inflected using a dictionary-based tech- be beneficial by making relevant features stand
nique. We lemmatized and annotated a large corpus out (Wong et al., 2016; Shorten and Khoshgoftaar,
with NLTK (Bird et al., 2009), and mapped each 2019). We added a random sentence from a major-
<lemma, tag> combination to its most common ity class document in S EED to a random position
surface form. The corpus contains 8.5 million short in a copy of each minority class training document.
sentences (≤ 20 words) from multiple open-source
corpora (see Appendix E). We designed it to have 3.2.4 GPT-2 conditional generation
both a large vocabulary for wide coverage (371125 GPT-2 is a Transformer language model pre-trained
lemmas), and grammatically simple sentences to on a large collection of Web documents. We used
maximize correct tagging. the 110M parameter GPT-2 model from the Trans-
Paraphrase Database (PPDB) was collected formers library (Wolf et al., 2019) We discuss pa-
from bilingual parallel corpora on the premise that rameters in Appendix F. We augmented as follows
English phrases translated identically to another (N -fold oversampling):
language tend to be paraphrases (Ganitkevitch et al., 3
https://spacy.io/
2013; Pavlick et al., 2015). We used phrase pairs 4
We use 25-dimensional GloVe-embeddings from:
tagged as equivalent, constituting 245691 para- https://nlp.stanford.edu/projects/glove/

2993
Augmentation Type Unit #Parameters Pre-training Corpus
A DD Non-toxic corpus Sentence NA NA
PPDB Knowledge Base N-gram NA NA
W ORD N ET Knowledge Base Word NA NA
G LOV E GloVe Word 30M Twitter
BPE MB GloVe Subword 0.5M Wikipedia
GPT-2 Transformer Subword 117M WebText
Classifier Model Type Unit #Parameters Pre-training Corpus
Char-LR Logistic regression Character 30K -
Word-LR Logistic regression Word 30K -
CNN Convolutional network Word 3M -
BERT Transformer Subword 110M Wikipedia & BookCorpus

Table 3: Augmentation techniques and classifiers considered in this study.

1. Ĝ ← briefly train GPT-2 on minority class score for each classifier and augmentation tech-
documents in S EED. nique. (For brevity, we use “F1-score” from now
2. generate N − 1 novel documents x̂ ← Ĝ(x) on.) The majority class F1-score remained 1.00
for all minority class samples x in S EED. (two digit rounding) across all our experiments. All
3. assign the minority class label to all docu- classifiers are binary, and we assigned predictions
ments x̂ to the class with the highest conditional probability.
4. merge x̂ with S EED. We relax this assumption in §4.4, to report area
under the curve (AUC) values (Murphy, 2012).
3.3 Classifiers To validate our results, we performed repeated
Char-LR and Word-LR. We adapted the logistic re- experiments with the common random numbers
gression pipeline from the Wiki-detox project (Wul- technique (Glasserman and Yao, 1992), by which
czyn et al., 2017).5 We allowed n-grams in the we controlled the sampling of S EED, initial random
range 1–4, and kept the default parameters: TF-IDF weights of classifiers, and the optimization proce-
normalization, vocabulary size at 10, 000 and pa- dure. We repeated the experiments 30 times, and
rameter C = 10 (inverse regularization strength). report confidence intervals.
CNN. We applied a word-based CNN model with
4.1 Results without augmentation
10 kernels of sizes 3, 4 and 5. Vocabulary size was
10, 000 and embedding dimensionality 300. For We first show classifier performance on G OLD
training, we used the dropout probability of 0.1, S TANDARD and S EED in Table 4. van Aken
and the Adam optimizer (Kingma and Ba, 2014) et al. (2018) reported F1-scores for logistic regres-
with the learning rate of 0.001. sion and CNN classifiers on G OLD S TANDARD.
BERT. We used the pre-trained Uncased BERT- Our results are comparable. We also evaluate BERT,
Base and trained the model with the training script which is noticeably better on G OLD S TANDARD,
from Fast-Bert.6 We set maximum sequence length particularly in terms of threat recall.
to 128 and mixed precision optimization level to All classifiers had significantly reduced F1-
O1. scores on S EED, due to major drops in threat re-
call. In particular, BERT was degenerate, assigning
4 Results all documents to the majority class in all 30 repeti-
tions. Devlin et al. (2019) report that such behavior
We compared precision and recall for the minor- may occur on small datasets, but random restarts
ity class (threat), and the macro-averaged F1- may help. In our case, random restarts did not
5
https://github.com/ewulczyn/wiki- impact BERT performance on S EED.
detox/blob/master/src/modeling/get_prod_
models.py 4.2 Augmentations
6
https://github.com/kaushaltrivedi/
fast-bert/blob/master/sample_notebooks/ We applied all eight augmentation techniques (§3.2)
new-toxic-multilabel.ipynb to the minority class of S EED (threat). Each

2994
G OLD S TANDARD S EED, but were less effective than C OPY in both
Char-LR Word-LR CNN BERT precision and recall. Park et al. (2019) found that
Precision 0.61 0.43 0.60 0.54 BERT may perform poorly on out-of-domain sam-
Recall 0.34 0.36 0.33 0.54 ples. BERT is reportedly unstable on adversarially
F1 0.72 0.69 0.71 0.77 chosen subword substitutions (Sun et al., 2020).
S EED We suggest that non-contextual word embedding
Char-LR Word-LR CNN BERT schemes may be sub-optimal for BERT since its
Precision 0.64 0.47 0.41 0.00 pre-training is not conducted with similarly noisy
Recall 0.03 0.04 0.09 0.00 documents. We verified that reducing the num-
F1 0.52 0.53 0.57 0.50 ber of replaced words was indeed beneficial for
BERT (Appendix G).
Table 4: Classifier performance on G OLD S TAN -
Char-LR. BPE MB and A DD were effective at in-
DARD and S EED . Precision and recall for threat;
F1-score macro-averaged from both classes. creasing recall, and reached similar increases in
F1-score. GPT-2 raised recall to G OLD S TAN -
DARD level (p < 10−5 for all pairings), but preci-
technique retains one copy of each S EED docu- sion remained 16 percentage points below G OLD
ment, and adds 19 synthetically generated docu- S TANDARD. It led to the best increase in F1-score:
ments per S EED document. Table 5 summarizes 16 percentage points above S EED (p < 10−3 for
augmented dataset sizes. We present our main re- all pairings).
sults in Table 6. We first discuss classifier-specific Word-LR. Embedding-based BPE MB and G LOV E
observations, and then make general observations increased recall by at least 13 percentage points,
on each augmentation technique. but the conceptually similar PPDB and W ORD -
N ET were largely unsuccessful. We suggest
S EED Augmented this discrepancy may be due to W ORD N ET and
Minority 25 25→500 PPDB relying on written standard English,
Majority 7955 7955 whereas toxic language tends to be more colloquial.
GPT-2 increased recall and F1-score the most: 15
Table 5: Number of documents in augmented datasets.
We retained original S EED documents and expanded
percentage points above S EED (p < 10−10 for all
the dataset with additional synthetic documents (minor- pairings).
ity: threat) CNN. G LOV E and A DD increased recall by at least
10 percentage points. BPE MB led to a large in-
We compared the impact of augmentations on crease in recall, but with a drop in precision, pos-
each classifier, and therefore our performance com- sibly due to its larger capacity to make changes in
parisons below are local to each column (i.e., classi- text – G LOV E can only replace entire words that
fier). We identify the best performing technique for exist in the pre-training corpus. GPT-2 yielded the
the three metrics and report the p-value when its ef- largest increases in recall and F1-score (p < 10−4
fect is significantly better than the other techniques for all pairings).
(based on one-sided paired t-tests, α = 5%).7 We now discuss each augmentation technique.
BERT. C OPY and A DD were successful on BERT, C OPY emphasizes the features of original minority
raising the F1-score up to 21 percentage points documents in S EED, which generally resulted in
above S EED to 0.71. But their impacts on BERT fairly high precision. On Word-LR, C OPY is analo-
were different: A DD led to increased recall, while gous to increasing the weight of words that appear
C OPY resulted in increased precision. PPDB preci- in minority documents.
sion and recall were statistically indistinguishable EDA behaved similarly to C OPY on Char-LR, Word-
from C OPY, which indicates that it did few alter- LR and CNN; but markedly worse on BERT.
ations. GPT-2 led to significantly better recall A DD reduces the classifier’s sensitivity to irrele-
(p < 10−5 for all pairings), even surpassing G OLD vant material by adding majority class sentences
S TANDARD. Word substitution methods like EDA, to minority class documents. On Word-LR, A DD is
W ORD N ET, G LOV E, and BPE MB improved on analogous to reducing the weights of majority class
7
The statistical significance results apply to this dataset, words. A DD led to a marginally better F1-score
but are indicative of the behavior of the techniques in general. than any other technique on BERT.

2995
Augmentation Metric Char-LR Word-LR CNN BERT
Precision 0.68 ± 0.22 0.43 ± 0.27 0.45 ± 0.14 0.00 ± 0.00
S EED
No Oversampling Recall 0.03 ± 0.02 0.04 ± 0.02 0.08 ± 0.05 0.00 ± 0.00
F1 (macro) 0.53 ± 0.02 0.54 ± 0.02 0.56 ± 0.03 0.50 ± 0.00
Precision 0.67 ± 0.07 0.38 ± 0.24 0.40 ± 0.08 0.49 ± 0.07
C OPY
Simple Oversampling Recall 0.16 ± 0.03 0.03 ± 0.02 0.07 ± 0.03 0.36 ± 0.09
F1 (macro) 0.63 ± 0.02 0.53 ± 0.02 0.56 ± 0.02 0.70 ± 0.03
Precision 0.66 ± 0.06 0.36 ± 0.19 0.26 ± 0.09 0.21 ± 0.03
EDA
Wei and Zou (2019) Recall 0.13 ± 0.03 0.08 ± 0.04 0.07 ± 0.01 0.06 ± 0.01
F1 (macro) 0.61 ± 0.02 0.56 ± 0.03 0.55 ± 0.01 0.54 ± 0.01
Precision 0.58 ± 0.07 0.36 ± 0.21 0.45 ± 0.07 0.36 ± 0.04
A DD
Add Majority-class Sentence Recall 0.24 ± 0.04 0.06 ± 0.04 0.19 ± 0.07 0.52 ± 0.07
F1 (macro) 0.67 ± 0.03 0.55 ± 0.03 0.63 ± 0.04 0.71 ± 0.01
Precision 0.16 ± 0.08 0.41 ± 0.27 0.37 ± 0.09 0.48 ± 0.06
PPDB
Phrase Substitutions Recall 0.10 ± 0.03 0.04 ± 0.02 0.08 ± 0.04 0.34 ± 0.08
F1 (macro) 0.56 ± 0.02 0.53 ± 0.02 0.57 ± 0.02 0.70 ± 0.03
Precision 0.16 ± 0.06 0.36 ± 0.24 0.41 ± 0.08 0.47 ± 0.08
W ORD N ET
Word Substitutions Recall 0.11 ± 0.03 0.05 ± 0.03 0.11 ± 0.05 0.29 ± 0.07
F1 (macro) 0.56 ± 0.02 0.54 ± 0.02 0.58 ± 0.03 0.68 ± 0.03
Precision 0.15 ± 0.04 0.39 ± 0.12 0.38 ± 0.08 0.43 ± 0.11
G LOV E
Word Substitutions Recall 0.14 ± 0.03 0.16 ± 0.05 0.18 ± 0.06 0.18 ± 0.06
F1 (macro) 0.57 ± 0.02 0.61 ± 0.03 0.62 ± 0.03 0.62 ± 0.03
Precision 0.56 ± 0.07 0.33 ± 0.07 0.25 ± 0.07 0.38 ± 0.12
BPE MB
Subword Substitutions Recall 0.22 ± 0.03 0.22 ± 0.04 0.37 ± 0.08 0.16 ± 0.04
F1 (macro) 0.66 ± 0.02 0.63 ± 0.02 0.64 ± 0.03 0.61 ± 0.03
Precision 0.45 ± 0.08 0.35 ± 0.07 0.31 ± 0.08 0.15 ± 0.05
GPT-2
Conditional Generation Recall 0.33 ± 0.04 0.42 ± 0.05 0.46 ± 0.10 0.62 ± 0.09
F1 (macro) 0.69 ± 0.02 0.69 ± 0.02 0.68 ± 0.02 0.62 ± 0.03

Table 6: Comparison of augmentation techniques for 20x augmentation on S EED/threat: means for precision,
recall and macro-averaged F1-score shown with standard deviations (30 paired repetitions). Precision and recall
for threat; F1-score macro-averaged from both classes. Bold figures represent techniques that are either best, or
not significantly different (α = 5%) from this best technique. Double underlines indicate the best technique (for
a given metric and classifier) significantly better (α = 1%) than all other techniques.

Word replacement was more effective with augmentations increased the intersection cardinal-
G LOV E and BPE MB than with PPDB or W ORD - ity by 260% from the original; compared to only
N ET. PPDB and W ORD N ET generally replace few 84% and 70% with the next-best performing aug-
words per document, which often resulted in simi- mentation techniques (A DD and BPE MB, respec-
lar performance to C OPY. BPE MB was generally tively). This demonstrates that GPT-2 significantly
the most effective among these techniques. increased the vocabulary range of the training set,
GPT-2 had the best improvement overall, leading specifically with offensive words likely to be rel-
to significant increases in recall across all classi- evant for toxic language classification. However,
fiers, and the highest F1-score on all but BERT. there is a risk that human annotators might not label
The increase in recall can be attributed to GPT-2’s GPT-2-generated documents as toxic. Such label
capacity for introducing novel phrases. We cor- noise may decrease precision. (See Appendix H,
roborated this hypothesis by measuring the overlap Table 22 for example augmentations that display
between the original and augmented test sets and an the behavior of GPT-2 and other techniques.)
offensive/profane word list from von Ahn.8 GPT-2
8
https://www.cs.cmu.edu/˜biglou/
resources/

2996
4.3 Mixed augmentations mated data augmentation can increase the coverage
In §4.2 we saw that the effect of augmentations dif- of language. Furthermore, Char-LR trained with
fer across classifiers. A natural question is whether ABG was comparable (no statistically significant
it is beneficial to combine augmentation techniques. difference) to the best results obtained with BERT
For all classifiers except BERT, the best perform- (trained with A DD, p > 0.2 on all metrics).
ing techniques were GPT-2, A DD, and BPE MB 4.4 Average classification performance
(Table 6). They also represent each of our aug-
mentation types (§3.2), BPE MB having the high- The results in Tables 6 and 7 focus on precision,
est performance among the four word replacement recall and the F1-score of different models and aug-
techniques (§3.2.1–§3.2.2) in these classifiers. mentation techniques where the probability thresh-
We combined the techniques by merging aug- old for determining the positive or negative class
mented documents in equal proportions. In is 0.5. In general the level of precision and recall
ABG, we included documents generated by A DD, are adapted based on the use case for the classifier.
BPE MB or GPT-2. Since A DD and BPE MB im- Another general evaluation of a classifier is based
pose significantly lower computational and mem- on the ROC-AUC metric, which is the area under
ory requirements than GPT-2, and require no ac- the curve for a plot of true-positive rate versus the
cess to a GPU (Appendix C), we also evaluated false-positive rate for a range of thresholds varying
combining only A DD and BPE MB (AB). over [0, 1]. Table 8 shows the ROC-AUC scores
ABG outperformed all other techniques (in F1- for each of the classifiers for the best augmentation
score) on Char-LR and CNN with statistical signif- techniques from Tables 6 and 7.
BERT with ABG gave the best ROC-AUC value
icance, while being marginally better on Word-LR.
On BERT, ABG achieved a better F1-score and pre- of 0.977 which is significantly higher than BERT
cision than GPT-2 alone (p < 10−10 ), and a better with any other augmentation technique (p < 10−6 ).
CNN exhibited a similar pattern: ABG resulted in
recall (p < 0.05). ABG was better than AB in
recall on Word-LR and CNN, while the precision was the best ROC-AUC compared to the other augmen-
comparable. tation techniques (p < 10−6 ). For Word-LR, ROC-
AUC was highest for ABG, but the difference to
Augmenting with ABG resulted in similar per-
GPT-2 was not statistically significant (p > 0.05).
formance as G OLD S TANDARD on Word-LR, Char-
In the case of Char-LR, none of the augmentation
LR and CNN (Table 4). Comparing Tables 6 and 7,
techniques improved on S EED (p < 0.05). Char-LR
it is clear that much of the performance improve-
produced a more consistent averaged performance
ment came from the increased vocabulary cover-
across all augmentation methods with ROC-AUC
age of GPT-2-generated documents. Our results
values varying between (0.958, 0.973), compared
suggest that in certain types of data like toxic lan-
to variations across all augmentation techniques
guage, consistent labeling may be more important
of (0.792, 0.962) and (0.816, 0.977) for CNN and
than wide coverage in dataset collection, since auto-
BERT respectively.

AB Char-LR Word-LR CNN BERT

Char-LR Word-LR CNN BERT S EED 0.973 0.968 0.922 0.816
Precision 0.56 0.37 0.33 0.41 C OPY 0.972 0.937 0.792 0.898
Recall 0.26 0.18 0.36 0.36 A DD 0.958 0.955 0.904 0.956
F1 0.68 0.62 0.67 0.69 BPE MB 0.968 0.968 0.940 0.868
ABG GPT-2 0.969 0.973 0.953 0.964
Char-LR Word-LR CNN BERT ABG 0.972 0.973 0.962 0.977
Precision 0.48 0.37 0.31 0.28
Recall 0.36 0.39 0.52 0.65 Table 8: Comparison of ROC-AUC for augmentation
F1 0.70 0.69 0.69 0.69 (20x) on S EED/threat (Annotations as in Table 6).

Table 7: Effects of mixed augmentation (20x) on Our results highlight a difference between the re-
S EED/threat (Annotations as in Table 6). Precision
sults in Tables 6 and 7: while C OPY reached a high
and recall for threat; F1-score macro-averaged from
both classes. F1-score on BERT, our results on ROC-AUC high-
light that such performance may not hold while

2997
varying the decision threshold. We observe that dependency parsing (Vania et al., 2019), and ma-
a combined augmentation method such as ABG chine translation (Fadaee et al., 2019; Xia et al.,
provides an increased ability to vary the decision 2019). Related techniques are also used in auto-
threshold for the more complex classifiers such as matic paraphrasing (Madnani and Dorr, 2010; Li
CNN and BERT. Simpler models performed consis- et al., 2018) and writing style transfer (Shen et al.,
tently across different augmentation techniques. 2017; Shetty et al., 2018; Mahmood et al., 2019).
Hu et al. (2017) produced text with controlled
4.5 Computational requirements target attributes via variational autoencoders. Mes-
BERT has significant computational requirements bah et al. (2019) generated artificial sentences for
(Table 9). Deploying BERT on common EC2 in- adverse drug reactions using Reddit and Twitter
stances requires 13 GB GPU memory. ABG on data. Similarly to their work, we generated novel
EC2 requires 4 GB GPU memory for approxi- toxic sentences from a language model. Petroni
mately 100s (for 20x augmentation). All other et al. (2019) compared several pre-trained lan-
techniques take only a few seconds on ordinary guage models on their ability to understand fac-
desktop computers (See Appendices C–D for addi- tual and commonsense reasoning. BERT models
tional data on computational requirements). consistently outperformed other language models.
Petroni et al. suggest that large pre-trained lan-
A DD BPE MB GPT-2 ABG guage models may become alternatives to knowl-
CPU - 100 3,600 3,600 edge bases in the future.
GPU - - 3,600 3,600
Char-LR Word-LR CNN BERT 6 Discussion and conclusions
CPU 100 100 400 13,000
Our results highlight the relationship between clas-
GPU 100 100 400 13,000
sification performance and computational overhead.
Table 9: Memory (MB) required for augmentation tech- Overall, BERT performed the best with data aug-
niques and classifiers. Rounded to nearest 100 MB. mentation. However, it is highly resource-intensive
(§4.5). ABG yielded almost BERT-level F1- and
ROC-AUC scores on all classifiers. While using
4.6 Alternative toxic class GPT-2 is more expensive than other augmenta-
tion techniques, it has significantly less require-
In order to see whether our results described so ments than BERT. Additionally, augmentation is a
far generalize beyond threat, we repeated our one-time upfront cost in contrast to ongoing costs
experiments using another toxic language class, for classifiers. Thus, the trade-off between perfor-
identity-hate, as the minority class. Our re- mance and computational resources can influence
sults for identity-hate are in line with those which technique is optimal in a given setting.
for threat. All classifiers performed poorly on
We identify the following further topics that we
S EED due to very low recall. Augmentation with
leave for future work.
simple techniques helped BERT gain more than 20
S EED coverage. Our results show that data aug-
percentage points for the F1-score. Shallow classi-
mentation can increase coverage, leading to better
fiers approached BERT-like performance with ap-
toxic language classifiers when starting with very
propriate augmentation. We present further details
small seed datasets. The effects of data augmenta-
in Appendix B.
tion will likely differ with larger seed datasets.
5 Related work Languages. Some augmentation techniques are
limited in their applicability across languages.
Toxic language classification has been conducted GPT-2, W ORD N ET, PPDB and G LOV E are avail-
in a number of studies (Schmidt and Wiegand, able for certain other languages, but with less cov-
2017; Davidson et al., 2017; Wulczyn et al., 2017; erage than in English. BPE MB is nominally avail-
Gröndahl et al., 2018; Qian et al., 2019; Breitfeller able in 275 languages, but has not been thoroughly
et al., 2019). NLP applications of data augmenta- tested on less prominent languages.
tion include text classification (Ratner et al., 2017; Transformers. BERT has inspired work on other
Wei and Zou, 2019; Mesbah et al., 2019), user pre-trained Transformer classifiers, leading to bet-
behavior categorization (Wang and Yang, 2015), ter classification performance (Liu et al., 2019;

2998
Lewis et al., 2019) and better trade-offs between Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall,
memory consumption and classification perfor- and W. Philip Kegelmeyer. 2002. SMOTE: Syn-
thetic minority over-sampling technique. Journal of
mance (Sanh et al., 2019; Jiao et al., 2019). Ex-
Artificial Intelligence Research, 16:321–357.
ploring the effects of augmentation on these Trans-
former classifiers is left for future work. Thomas Davidson, Dana Warmsley, Michael Macy,
Attacks. Training classifiers with augmented data and Ingmar Weber. 2017. Automated hate speech
detection and the problem of offensive language. In
may influence their vulnerability for model extrac- Proceedings of the 11th Conference on Web and So-
tion attacks (Tramèr et al., 2016; Krishna et al.), cial Media, pages 512–515.
model evasion (Gröndahl et al., 2018), or back-
doors (Schuster et al., 2020). We leave such con- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
siderations for future work. deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference of
Acknowledgments the North American Chapter of the Association for
Computational Linguistics (NAACL), pages 4171–
We thank Jonathan Paul Fernandez Strahl, Mark 4186.
van Heeswijk, and Kuan Eeik Tan for valuable dis-
cussions related to the project, and Karthik Ramesh Marzieh Fadaee, Arianna Bisazza, and Christof Monz.
2019. Data augmentation for low resource neural
for his help with early experiments. We also machine translation. In Proceedings of the 55th An-
thank Prof. Yaoliang Yu for providing compute nual Meeting of the Association for Computational
resources for early experiments. Tommi Gröndahl Linguistics (Short Papers), pages 567–573.
was funded by the Helsinki Doctoral Education
Juri Ganitkevitch, Benjamin Van Durme, and Chris
Network in Information and Communications Tech- Callison-Burch. 2013. PPDB: The paraphrase
nology (HICT). database. In Proceedings of the North American
Chapter of the Association for Computational Lin-
guistics: Human Language Technologies (NAACL-
References HLT), pages 758–764.

Betty van Aken, Julian Risch, Ralf Krestel, and Alexan- Paul Glasserman and David D Yao. 1992. Some guide-
der Löser. 2018. Challenges for toxic comment clas- lines and guarantees for common random numbers.
sification: An in-depth error analysis. In Proceed- Management Science, 38(6):884–908.
ings of the 2nd Workshop on Abusive Language On-
line (ALW2), pages 33–42. Tommi Gröndahl, Luca Pajola, Mika Juuti, Mauro
Conti, and N. Asokan. 2018. All you need is “love”:
Pinkesh Badjatiya, Shashank Gupta, Manish Gupta, Evading hate speech detection. In Proceedings of
and Vasudeva Varma. 2017. Deep learning for hate the 11th ACM Workshop on Artificial Intelligence
speech detection in tweets. In Proceedings of the and Security (AISec’11), pages 2–12.
26th International Conference on World Wide Web
Companion, pages 759–760. Benjamin Heinzerling and Michael Strube. 2018.
BPEmb: Tokenization-free pre-trained subword em-
Peter J. Bickel and David A. Freedman. 1984. Asymp- beddings in 275 languages. In Proceedings of the
totic normality and the bootstrap in stratified sam- Eleventh International Conference on Language Re-
pling. The annals of statistics, 12(2):470–482. sources and Evaluation (LREC 2018), pages 2989–
2993.
Steven Bird, Ewan Klein, and Edward Loper. 2009.
Natural Language Processing with Python: An- Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan
alyzing Text with the Natural Language Toolkit. Salakhutdinov, and Eric P. Xing. 2017. Toward con-
O’Reilly, Beijing. trolled generation of text. In Proceedings of the
34th International Conference on Machine Learning,
Gwern Branwen. 2019. Gpt-2 neural network poetry. pages 1587–1596.
https://www.gwern.net/GPT-2 Last accessed
May 2020. Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang,
Xiao Chen, Linlin Li, Fang Wang, and Qun Liu.
Luke Breitfeller, Emily Ahn, David Jurgens, and Yu- 2019. Tinybert: Distilling bert for natural language
lia Tsvetkov. 2019. Finding microaggressions in the understanding. arXiv preprint arXiv:1909.10351.
wild: A case for locating elusive phenomena in so-
cial media posts. In Proceedings of the 2019 Con- Jigsaw. 2018. Toxic comment classification challenge
ference on Empirical Methods in Natural Language identify and classify toxic online comments. Avail-
Processing and the 9th International Joint Confer- able in https://www.kaggle.com/c/jigsaw-
ence on Natural Language Processing (EMNLP- toxic-comment-classification-challenge,
IJCNLP), pages 1664–1674. accessed last time in May 2020.

2999
Diederik Kingma and Jimmy Ba. 2014. Adam: A Sepideh Mesbah, Jie Yang, Robert-Jan Sips,
method for stochastic optimization. In Proceedings Manuel Valle Torre, Christoph Lofi, Alessan-
of the International Conference on Learning Repre- dro Bozzon, and Geert-Jan Houben. 2019. Training
sentations (ICLR). data augmentation for detecting adverse drug reac-
tions in user-generated content. In Proceedings of
Kalpesh Krishna, Gaurav Singh Tomar, Ankur Parikh, the 2019 Conference on Empirical Methods in Natu-
Nicolas Papernot, and Mohit Iyyer. Thieves of ral Language Processing and the 9th International
sesame street: Model extraction on bert-based apis. Joint Conference on Natural Language Processing
In Proceedings of the International Conference on (EMNLP-IJCNLP), pages 2349–2359.
Learning Representations (ICLR).
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hin- rado, and Jeffrey Dean. 2013. Distributed represen-
ton. 2012. Imagenet classification with deep con- tations of words and phrases and their composition-
volutional neural networks. In Proceedings of Neu- ality. In Proceedings of the 26th International Con-
ral Information Processing Systems (NIPS), pages ference on Neural Information Processing Systems
1097–1105. (NIPS), pages 3111–3119.
Taku Kudo and John Richardson. 2018. Sentencepiece:
A simple and language independent subword tok- George A. Miller. 1995. WordNet: A lexical
enizer and detokenizer for neural text processing. In database for English. Communications of the ACM,
Proceedings of the 2018 Conference on Empirical 38(11):39–41.
Methods in Natural Language Processing: System
Demonstrations, pages 66–71. Kevin P. Murphy. 2012. Machine learning: a proba-
bilistic perspective. MIT press, Cambridge.
Michael Lesk. 1986. Automatic sense disambiguation
using machine readable dictionaries: how to tell a Roberto Navigli. 2009. Word sense disambiguation: A
pine code from an ice cream cone. In Proceedings of survey. ACM Computing Surveys, 41(2):1–69.
the 5th Annual International Conference on Systems
Documentation, pages 24–26. Casey Newton. 2020. Facebook will pay $52
million in settlement with moderators who
Mike Lewis, Yinhan Liu, Naman Goyal, Mar- developed PTSD on the job. The Verge.
jan Ghazvininejad, Abdelrahman Mohamed, Omer https://www.theverge.com/2020/5/12/
Levy, Ves Stoyanov, and Luke Zettlemoyer. 21255870/facebook-content-moderator-
2019. BART: Denoising sequence-to-sequence settlement-scola-ptsd-mental-health/
pre-training for natural language generation, trans- Last accessed May 2020.
lation, and comprehension. arXiv preprint
arXiv:1910.13461. Cheoneum Park, Juae Kim, Hyeon-gu Lee,
Reinald Kim Amplayo, Harksoo Kim, Jungyun
Zichao Li, Xin Jiang, Lifeng Shang, and Hang Li. Seo, and Changki Lee. 2019. ThisIsCompetition
2018. Paraphrase Generation with Deep Reinforce- at SemEval-2019 Task 9: BERT is unstable for
ment Learning. In Proceedings of the Conference on out-of-domain samples. In Proceedings of the 13th
Empirical Methods in Natural Language Processing International Workshop on Semantic Evaluation,
(EMNLP), pages 3865–3878. pages 1254–1261.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Ellie Pavlick, Pushpendre Rastogi, Juri Ganitkevitch,
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Benjamin Van Durme, and Chris Callison-Burch.
Luke Zettlemoyer, and Veselin Stoyanov. 2019. 2015. PPDB 2.0: Better paraphrase ranking, fine-
Roberta: A robustly optimized BERT pretraining ap- grained entailment relations,word embeddings, and
proach. arXiv preprint arXiv:1907.11692. style classification. In Proceedings of the 53rd An-
Nitin Madnani and Bonnie Dorr. 2010. Generating nual Meeting of the Association for Computational
phrasal and sentential paraphrases: A survey of data- Linguistics and the 7th International Joint Confer-
driven methods. Journal of Computational Linguis- ence on Natural Language Processing (Short Pa-
tics, 36(3):341–387. pers), pages 425–430.

Asad Mahmood, Faizan Ahmad, Zubair Shafiq, Pad- Jeffrey Pennington, Richard Socher, and Christopher D.
mini Srinivasan, and Fareed Zaffar. 2019. A girl has Manning. 2014. GloVe: Global vectors for word rep-
no name: Automated authorship obfuscation using resentation. In Proceedings of the Conference on
Mutant-X. In Proceedings on Privacy Enhancing Empirical Methods in Natural Language Processing
Technologies (PETS), pages 54–71. (EMNLP), pages 1532–1543.

Binny Mathew, Ritam Dutt, Pawan Goyal, and Ani- Fabio Petroni, Tim Rocktäschel, Sebastian Riedel,
mesh Mukherjee. 2019. Spread of hate speech in on- Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and
line social media. In Proceedings of the 10th ACM Alexander Miller. 2019. Language models as knowl-
Conference on Web Science (WebSci ’19), pages edge bases? In Proceedings of the 2019 Confer-
173–182. ence on Empirical Methods in Natural Language

3000
Processing and the 9th International Joint Confer- Connor Shorten and Taghi M. Khoshgoftaar. 2019. A
ence on Natural Language Processing (EMNLP- survey on image data augmentation for deep learn-
IJCNLP), pages 2463–2473. ing. Journal of Big Data, 6.

Jing Qian, Anna Bethke, Yinyin Liu, Elizabeth Beld- Lichao Sun, Kazuma Hashimoto, Wenpeng Yin, Akari
ing, and William Yang Wang. 2019. A bench- Asai, Jia Li, Philip Yu, and Caiming Xiong. 2020.
mark dataset for learning to intervene in online hate Adv-BERT: BERT is not robust on misspellings!
speech. In Proceedings of the 2019 Conference on Generating nature adversarial samples on BERT.
Empirical Methods in Natural Language Processing arXiv preprint arXiv:2003.04985.
and the 9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP), pages Liling Tan. 2014. Pywsd: Python implementations
4757–4766. of word sense disambiguation (WSD) technologies
[software]. https://github.com/alvations/
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, pywsd.
Dario Amodei, and Ilya Sutskever. 2019. Language
models are unsupervised multitask learners. OpenAI Martin A Tanner and Wing Hung Wong. 1987. The cal-
Blog, 1(8):9. culation of posterior distributions by data augmenta-
tion. Journal of the American statistical Association,
Alexander J. Ratner, Henry R. Ehrenberg, Zeshan Hus- 82(398):528–540.
sain, Jared Dunnmon, and Christopher Ré. 2017.
Learning to compose domain-specific transforma- Florian Tramèr, Fan Zhang, Ari Juels, Michael K Re-
tions for data augmentation. In Proceedings of the iter, and Thomas Ristenpart. 2016. Stealing ma-
31st Conference on Neural Information Processing chine learning models via prediction apis. In Pro-
Systems (NIPS 2017). ceedings of the 25th USENIX Security Symposium,
pages 601–618.
Radim Řehůřek and Petr Sojka. 2010. Software
framework for topic modelling with large corpora. Clara Vania, Yova Kementchedjhieva, Anders Sogaard,
In Proceedings of the LREC 2010 Workshop on and Adam Lopez. 2019. A systematic comparison
New Challenges for NLP Frameworks, pages 45– of methods for low-resource dependency parsing on
50, Valletta, Malta. ELRA. http://is.muni.cz/ genuinely low-resource languages. In Proceedings
publication/884893/en. of the 2019 Conference on Empirical Methods in
Natural Language Processing and the 9th Interna-
Victor Sanh, Lysandre Debut, Julien Chaumond, and tional Joint Conference on Natural Language Pro-
Thomas Wolf. 2019. DistilBERT, a distilled version cessing (EMNLP-IJCNLP), pages 1105–1116.
of BERT: smaller, faster, cheaper and lighter. arXiv
preprint arXiv:1910.01108. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Anna Schmidt and Michael Wiegand. 2017. A survey Kaiser, and Illia Polosukhin. 2017. Attention is all
on hate speech detection using natural language pro- you need. In Proceedings of the 31st Conference
cessing. In Proceedings of the Fifth International on Neural Information Processing Systems (NIPS),
Workshop on Natural Language Processing for So- pages 5998–6008.
cial Media, pages 1–10.
Nick Walton. Ai dungeon 2. https://aidungeon.
Roei Schuster, Tal Schuster, Yoav Meri, and Vitaly io/ Last accessed May 2020.
Shmatikov. 2020. Humpty dumpty: Controlling
word meanings via corpus poisoning. arXiv preprint William Yang Wang and Diyi Yang. 2015. Thats so an-
arXiv:2001.04935. noying!!!: A lexical and frame-semantic embedding
based data augmentation approach to automatic cat-
Rico Sennrich, Barry Haddow, and Alexandra Birch. egorization of annoying behaviors using #petpeeve
2016. Neural machine translation of rare words tweets. In Proceedings of the 2015 Conference on
with subword units. In Proceedings of the 54th An- Empirical Methods in Natural Language Processing
nual Meeting of the Association for Computational (EMNLP), pages 2557–2563.
Linguistics (Volume 1: Long Papers), pages 1715–
1725. Zeerak Waseem and Dirk Hovy. 2016. Hateful sym-
bols or hateful people? Predictive features for hate
Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi speech detection on twitter. In Proceedings of the
Jaakkola. 2017. Style transfer from non-parallel text NAACL Student Research Workshop, pages 88–93.
by cross-alignment. In Proceedings of Neural Infor-
mation Processing Systems (NIPS). Jason Wei and Kai Zou. 2019. EDA: Easy data aug-
mentation techniques for boosting performance on
Rakshith Shetty, Bernt Schiele, and Mario Fritz. 2018. text classification tasks. In Proceedings of the
A4 NT: Author attribute anonymity by adversarial 2019 Conference on Empirical Methods in Natu-
training of neural machine translation. In Proceed- ral Language Processing and the 9th International
ings of the 27th USENIX Security Symposium, pages Joint Conference on Natural Language Processing
1633–1650. (EMNLP-IJCNLP), pages 6382–6388.

3001
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Rémi Louf, Morgan Fun-
towicz, et al. 2019. Transformers: State-of-the-
art natural language processing. arXiv preprint
arXiv:1910.03771.
Sebastien C. Wong, Adam Gatt, Victor Stamatescu, and
Mark D. McDonnell. 2016. Understanding data aug-
mentation for classification: When to warp? In Pro-
ceedings of the 2016 International Conference on
Digital Image Computing: Techniques and Applica-
tions (DICTA), pages 1–6.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V
Le, Mohammad Norouzi, Wolfgang Macherey,
Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Macherey, et al. 2016. Google’s neural machine
translation system: Bridging the gap between hu-
man and machine translation. arXiv preprint
arXiv:1609.08144.

Ellery Wulczyn, Nithum Thain, and Lucas Dixon. 2017.

Ex machina: Personal attacks seen at scale. In Pro-
ceedings of the 26th International Conference on
World Wide Web, pages 1391–1399.

Mengzhou Xia, Xiang Kong, Antonios Anastasopou-

los, and Graham Neubig. 2019. Generalized data
augmentation for low-resource translation. In Pro-
ceedings of the 57th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 5786–
5796.
Ziang Xie, Sida I. Wang, Jiwei Li, Daniel Levy, Aiming
Nie, Dan Jurafsky, and Andrew Y. Ng. 2017. Data
noising as smoothing in neural network language
models. In Proceedings of the International Con-
ference on Learning Representations (ICLR 2017).
Tom Young, Devamanyu Hazarika, Soujanya Poria,
and Erik Cambria. 2018. Recent trends in deep
learning based natural language processing. IEEE
Computational Intelligence Magazine, 13(3):55–75.
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.
Character-level convolutional networks for text clas-
sification. In Proceedings of the 28th International
Conference on Neural Information Processing Sys-
tems (NIPS 2015).
Ziqi Zhang, David Robinson, and Jonathan Tepper.
2018. Detecting hate speech on twitter using a
convolution-gru based deep neural network. In Pro-
ceedings of the Extended Semantic Web Conference
(ESWC), pages 745–760.

3002
A Class overlap and interpretation of on G OLD S TANDARD/identity-hate in Ta-
“toxicity” ble 11. The results closely resemble those on G OLD
S TANDARD/threat in Table 4 (§4.1).
Kaggle’s toxic comment classification challenge
dataset9 contains six classes, one of which is We compared S EED and C OPY with the tech-
called toxic. But all six classes represent niques that had the highest performance on
examples of toxic speech: toxic, severe threat: A DD, BPE MB, GPT-2, and their com-
toxic, obscene, threat, insult, and bination ABG. Table 12 shows the results.
identity-hate. Of the threat docu-
Like in threat, BERT performed the poor-
ments in the full training dataset (G OLD S TAN -
est on S EED, with the lowest recall (0.06). All
DARD ), 449/478 overlap with toxic. For
techniques decreased precision from S EED, and
identity-hate, overlap with toxic is
all increased recall except C OPY with CNN. With
1302/1405. Therefore, in this paper, we use the
C OPY, the F1-score increased with Char-LR (0.12)
term toxic more generally, subsuming threat
and BERT (0.21), but not Word-LR (0.01) or
and identity-hate as particular types of toxic
CNN (−0.04). This is in line with corresponding re-
speech. To confirm that this was a reasonable
sults from threat (§4.2, Table 6): C OPY did not
choice, we manually examined the 29 threat
help either of the word-based classifiers (Word-LR,
datapoints not overlapping with toxic. All of
CNN) but helped the character- and subword-based
these represent genuine threats, and are hence toxic
classifiers (Char-LR, BERT).
in the general sense.
Of the individual augmentation techniques,
B The “Identity hate” class A DD increased the F1-score the most with Char-
LR (0.15) and BERT (0.20); and GPT-2 increased
G OLD S TD . S EED T EST it the most with Word-LR (0.07) and CNN (0.07).
Minority 1,405 75 712 Here again we see the similarity between the two
Majority 158,166 7,910 63,266 word-based classifiers, and the two that take inputs
below the word-level. Like in threat, C OPY and
Table 10: Corpus size for identity-hate (minor- A DD achieved close F1-scores with BERT, but with
ity) and non-identity-hate (majority). different relations between precision and recall.
BPE MB was not the best technique with any clas-
sifier, but increased F1-score everywhere except in
G OLD S TANDARD CNN, where precision dropped drastically.
Char Word CNN BERT
In the combined ABG technique, Word-
Precision 0.64 0.54 0.70 0.55
LR and CNN reached their highest F1-score in-
Recall 0.40 0.31 0.20 0.62
creases (0.08 and 0.07, respectively). With Char-LR
F1 (macro) 0.74 0.69 0.65 0.79
F1-score was also among the highest, but did not
Table 11: Classifier performance on G OLD S TANDARD. reach A DD. Like with threat, ABG increased
Precision and recall for identity-hate; F1-score precision and recall more than GPT-2 alone.
macro-averaged from both classes.
Overall, our results on identity-hate
closely resemble those we received in threat, re-
To see if our results generalize beyond threat, sulting in more than 20 percentage point increases
we experimented on the identity-hate class in the F1-score for BERT on augmentations with
in Kaggle’s toxic comment classification dataset. C OPY and A DD. Like in threat, the impact of
Again, we used a 5% stratified sample of G OLD most augmentations was greater on Char-LR than on
S TANDARD as S EED. We first show the number of Word-LR or CNN. Despite their similar F1-scores in
samples in G OLD S TANDARD, S EED and T EST in S EED, Char-LR exhibited much higher precision,
Table 10. There are approximately 3 times more which decreased but remained generally higher
minority-class samples in identity-hate than than with other classifiers. Combined with an in-
in threat. Next, we show classifier performance crease in recall to similar or higher levels than with
9
https://www.kaggle.com/c/jigsaw- other classifiers, Char-LR reached BERT-level per-
toxic-comment-classification-challenge formance with proper data augmentation.

3003
Augmentation Metric Char-LR Word-LR CNN BERT
Precision 0.85 ± 0.04 0.59 ± 0.05 0.52 ± 0.08 0.65 ± 0.46
S EED
No Oversampling Recall 0.11 ± 0.04 0.12 ± 0.03 0.11 ± 0.04 0.06 ± 0.10
F1 (macro) 0.60 ± 0.03 0.60 ± 0.02 0.59 ± 0.02 0.54 ± 0.08
Precision 0.61 ± 0.02 0.54 ± 0.04 0.27 ± 0.06 0.52 ± 0.06
C OPY
Simple Oversampling Recall 0.34 ± 0.04 0.14 ± 0.03 0.07 ± 0.01 0.50 ± 0.06
F1 (macro) 0.72 ± 0.02 0.61 ± 0.02 0.55 ± 0.01 0.75 ± 0.01
Precision 0.54 ± 0.04 0.54 ± 0.05 0.43 ± 0.05 0.43 ± 0.05
A DD
Add Majority-class Sentence Recall 0.47 ± 0.05 0.21 ± 0.03 0.21 ± 0.04 0.58 ± 0.08
F1 (macro) 0.75 ± 0.01 0.65 ± 0.01 0.64 ± 0.02 0.74 ± 0.01
Precision 0.43 ± 0.04 0.30 ± 0.03 0.15 ± 0.05 0.29 ± 0.06
BPE MB
Subword Substitutions Recall 0.38 ± 0.04 0.29 ± 0.01 0.32 ± 0.05 0.23 ± 0.03
F1 (macro) 0.70 ± 0.01 0.64 ± 0.01 0.59 ± 0.02 0.62 ± 0.02
Precision 0.41 ± 0.05 0.30 ± 0.03 0.33 ± 0.08 0.22 ± 0.05
GPT-2
Conditional Generation Recall 0.34 ± 0.04 0.39 ± 0.03 0.34 ± 0.09 0.59 ± 0.06
F1 (macro) 0.68 ± 0.01 0.67 ± 0.01 0.66 ± 0.01 0.65 ± 0.02
Precision 0.41 ± 0.04 0.32 ± 0.03 0.28 ± 0.06 0.27 ± 0.05
ABG
A DD,BPE MB,GPT-2 Mix Recall 0.50 ± 0.04 0.41 ± 0.02 0.46 ± 0.05 0.62 ± 0.07
F1 (macro) 0.72 ± 0.01 0.68 ± 0.01 0.66 ± 0.02 0.68 ± 0.02

Table 12: Comparison of augmentation techniques for 20x augmentation on S EED/identity-hate: means
for precision, recall and macro-averaged F1-score shown with standard deviations (10 repetitions). Precision and
recall for identity-hate; F1-score macro-averaged from both classes.

C Augmentation computation Augmentation

performance Memory (MiB) Runtime (s)
GPU CPU GPU CPU
Table 13 reports computational resources required
C OPY - - - <1
for replicating augmentations. GPU computations
EDA - 100 - 1
were performed on a GeForce RTX 2080 Ti. CPU
A DD - - - 1
computations were performed with an Intel Core
W ORD N ET - 4000 - 1
i9-9900K CPU @ 3.60GHz with 8 cores, where
PPDB - 2900 - 3
applicable. Memory usage was collected using
G LOV E - 600 - 32
nvidia-smi and htop routines. Usage is rounded to
BPE MB - 100 - <1
nearest 100 MiB. Computation time includes time
GPT-2 3600 3600 12 + 78 -
to load library from file and is rounded to nearest
integer. Computation time (training and prediction) Table 13: Computational resources (MiB and seconds)
is shown separately for GPT-2. required for augmenting 25 examples to 500 exam-
We provide library versions in Table 14. We used ples. GPT-2 takes approximately 6 seconds to train per
sklearn.metrics.precision recall fscore support10 epoch, and 3 seconds to generate 19 new documents.
for calculating minority-class precision, recall
and macro-averaged F1-score. For the first
two, we applied pos label=1, and set average =
’macro’ for the third. For ROC-AUC, we used
which gives p-values for two-tailed significance
sklearn.metrics.roc auc score11 with default pa-
tests. We divided the p-values in half for the
rameters. For t-tests, we used scipy.stats.ttest rel12 ,
one-tailed significance tests.
10
https://scikit-learn.org/stable/
modules/generated/sklearn.metrics.roc_
auc_score.html
11
https://scikit-learn.org/stable/
modules/generated/sklearn.metrics.roc_
auc_score.html reference/generated/scipy.stats.ttest_
12
https://docs.scipy.org/doc/scipy/ rel.html

3004
Library Version Training
https://github.com/ Memory (MB) Runtime (s)
Nov 8, 201913
jasonwei20/eda nlp GPU CPU GPU CPU
apex 0.1 Char-LR - 100 - 4
bpemb 0.3.0 Word-LR - 100 - 3
fast-bert 1.6.5 CNN 400 400 - 13
gensim 3.8.1 BERT 3800 1500 757 -
nltk 3.4.5 Prediction
numpy 1.17.2 Memory (MB) Runtime (s)
pywsd 1.2.4 GPU CPU GPU CPU
scikit-learn 0.21.3 Char-LR - 100 - 25
scipy 1.4.1 Word-LR - 100 - 5
spacy 2.2.4 CNN 400 400 - 42
torch 1.4.0 BERT 4600 4200 464 -
transformers 2.8.0
Table 15: Computational resources (MB and seconds)
Table 14: Library versions required for replicating this required for training classifiers on the S EED dataset and
study. Date supplied if no version applicable. test dataset. Note that BERT results here were calcu-
lated with mixed precision arithmetic (currently sup-
ported by Nvidia Turing architecture). We measured
memory usage close to 13 GB in the general case.
D Classifier training and testing
performance
lease),20 and WordNet example sentences.21 The
Table 15 specifies the system resources training and rationale for the corpus was to have a large vo-
prediction required on our setup (Section C). The cabulary along with relatively simple grammatical
S EED dataset has 8,955 documents and test dataset structures, to maximize both coverage and the cor-
63,978 documents. We used the 12-layer, 768- rectness of POS-tagging. We mapped each lemma-
hidden, 12-heads, 110M parameter BERT-Base, POS-pair to its most common inflected form in the
Uncased-model.14 corpus. When performing synonym replacement
in W ORD N ET augmentation, we lemmatized and
POS-tagged the original word with NLTK, chose a
E Lemma inflection in W ORD N ET random synonym for it, and then inflected the syn-
onym with the original POS-tag if it was present in
Lemmas appear as uninflected lemmas WordNet. the inflection dictionary.
To mitigate this limitation, we used a dictionary-
based method for mapping lemmas to surface man- F GPT-2 parameters
ifestations with NLTK part-of-speech (POS) tags.
For deriving the dictionary, we used 8.5 million Table 16 shows the hyperparameters we used
short sentences (≤ 20 words) from seven corpora: for fine-tuning our GPT-2 models, and for gen-
Stanford NMT,15 OpenSubtitles 2018,16 Tatoeba,17 erating outputs. Our fine-tuning follows the
SNLI,18 SICK,19 Aristo-mini (December 2016 re- transformers examples with default parame-
ters.22
14
For generation, we trimmed input to be at most
https://storage.googleapis.com/bert_
models/2018_10_18/uncased_L-12_H-768_A- 100 characters long, further cutting off the input at
12.zip the last full word or punctuation to ensure gener-
15
https://nlp.stanford.edu/projects/
nmt/ 20
https://www.kaggle.com/allenai/
16
http://opus.nlpl.eu/OpenSubtitles2018. aristo-mini-corpus
php 21
http://www.nltk.org/_modules/nltk/
17
https://tatoeba.org corpus/reader/wordnet.html
18 22
https://nlp.stanford.edu/projects/ https://github.com/huggingface/
snli/ transformers/blob/master/examples/
19
http://clic.cimec.unitn.it/composes/ language-modeling/run_language_modeling.
sick.html py

3005
ated documents start with full words. Our genera- We show results for G LOV E in Table 20. Word-
tion script follows transformers examples.23 LR performed better with higher substitution rates
(increased recall). Interestingly, Char-LR per-
Fine-tuning formance (particularly precision) dropped with
Batch size 1 G LOV E compared to using C OPY. For CNN,
Learning rate 2e-5 smaller substitution rates seem preferable, since
Epochs 2 precision decreased quickly as the number of sub-
Generation stitutions increased.
Input cutoff 100 characters BPE MB results in Table 21 are consistent across
Temperature 1.0 the classifiers Char-LR, Word-LR and CNN. Substitu-
Top-p 0.9 tions in the range 12%–37% increased recall over
Repetition penalty 1 C OPY. However, precision dropped at different
Output cutoff 100 subwords or points, depending on the classifier. CNN precision
EOS generated dropped earlier than on other classifiers, already at
25% change rate.
Table 16: GPT-2 parameters.
H Augmented threat examples
In §4.2 – §4.4, we generated novel documents We provide examples of augmented documents in
with GPT-2 fine-tuned on threat documents in Table 22. We picked a one-sentence document as
S EED for 2 epochs. In Table 17, we show the im- the seed. We remark that augmented documents
pact of changing the number of fine-tuning epochs created by GPT-2 have the highest novelty, but may
for GPT-2. Precision generally increased as the not always be considered threat (see example
number of epochs was increased. However, recall GPT-2 #1. in Table 22).
simultaneously decreased.

G Ablation study
In §4.2 – §4.4, we investigated several word re-
placement techniques with a fixed change rate. In
those experiments, we allowed 25% of possible
replacements. Here we study each augmentation
technique’s sensitivity to the replacement rate. As
done in previous experiments, we ensured that at
least one augmentation is always performed. Ex-
periments are shown in tables 18–21.
Interestingly, all word replacements decreased
classification performance with BERT. We suspect
this occurred because of the pre-trained weights in
BERT.
We show threat precision, recall and macro-
averaged F1-scores for PPDB in Table 18. Chang-
ing the substitution rate had very little impact to the
performance on any classifier. This indicates that
there were very few n-gram candidates that could
be replaced. We show results on W ORD N ET in
Table 19. As exemplified for substitution rate 25%
in H, PPDB and W ORD N ET substitutions replaced
very few words. Both results were close to C OPY
(§4.2, Table 6).
23
https://github.com/
huggingface/transformers/blob/
818463ee8eaf3a1cd5ddc2623789cbd7bb517d02/
examples/run_generation.py

3006
Fine-tuning epochs on GPT-2
Classifier Metric
1 2 3 4 5 6 7 8 9 10
Precision 0.38 0.43 0.45 0.49 0.51 0.49 0.52 0.50 0.51 0.51
Char-LR Recall 0.34 0.34 0.32 0.31 0.31 0.29 0.28 0.28 0.27 0.28
F1 (macro) 0.68 0.69 0.68 0.68 0.69 0.68 0.68 0.68 0.68 0.68
Precision 0.30 0.33 0.34 0.34 0.36 0.35 0.35 0.34 0.34 0.34
Word-LR Recall 0.47 0.45 0.43 0.40 0.40 0.38 0.37 0.36 0.35 0.35
F1 (macro) 0.68 0.69 0.69 0.68 0.68 0.68 0.67 0.67 0.67 0.67
Precision 0.26 0.28 0.30 0.32 0.33 0.32 0.31 0.31 0.31 0.32
CNN Recall 0.49 0.50 0.47 0.50 0.48 0.48 0.48 0.46 0.47 0.46
F1 (macro) 0.66 0.67 0.68 0.69 0.69 0.68 0.68 0.68 0.68 0.68
Precision 0.11 0.14 0.15 0.15 0.16 0.17 0.17 0.19 0.17 0.17
BERT Recall 0.62 0.66 0.67 0.64 0.65 0.62 0.62 0.62 0.61 0.61
F1 (macro) 0.59 0.61 0.62 0.62 0.62 0.63 0.63 0.64 0.63 0.62

Table 17: Impact of changing number of fine-tuning epochs on GPT-2-augmented datasets. Mean results for 10
repetitions. Highest numbers highlighted in bold.

PPDB: N-gram substitution rate W ORD N ET: Word substitution rate

Metric Metric
0 12 25 37 50 100 0 12 25 37 50 100
Char-LR Char-LR
Pre. 0.14 0.14 0.13 0.13 0.13 0.14 Pre. 0.15 0.15 0.14 0.14 0.12 0.10
Rec. 0.09 0.09 0.09 0.08 0.07 0.05 Rec. 0.10 0.10 0.10 0.10 0.09 0.07
F1 ma. 0.55 0.55 0.55 0.55 0.54 0.54 F1 ma. 0.56 0.56 0.56 0.55 0.55 0.54
Word-LR Word-LR
Pre. 0.32 0.33 0.38 0.44 0.41 0.34 Pre. 0.28 0.29 0.30 0.31 0.34 0.31
Rec. 0.04 0.04 0.04 0.04 0.03 0.01 Rec. 0.04 0.04 0.04 0.05 0.04 0.02
F1 ma. 0.53 0.53 0.53 0.53 0.53 0.51 F1 ma. 0.53 0.53 0.53 0.54 0.54 0.52
CNN CNN
Pre. 0.44 0.41 0.39 0.36 0.38 0.32 Pre. 0.42 0.43 0.42 0.45 0.44 0.32
Rec. 0.09 0.09 0.10 0.09 0.08 0.05 Rec. 0.10 0.11 0.11 0.12 0.10 0.07
F1 ma. 0.57 0.57 0.57 0.57 0.56 0.54 F1 ma. 0.58 0.58 0.58 0.59 0.58 0.55
BERT BERT
Pre. 0.45 0.45 0.46 0.46 0.47 0.48 Pre. 0.45 0.44 0.43 0.43 0.42 0.35
Rec. 0.37 0.37 0.37 0.35 0.33 0.25 Rec. 0.31 0.31 0.29 0.26 0.24 0.18
F1 ma. 0.70 0.70 0.70 0.70 0.69 0.66 F1 ma. 0.68 0.68 0.67 0.66 0.65 0.61

Table 18: Impact of changing the proportion of substi- Table 19: Impact of changing the proportion of substi-
tuted words on PPDB-augmented datasets. Mean re- tuted words on W ORD N ET-augmented datasets. Mean
sults for 10 repetitions. Classifier’s highest numbers results for 10 repetitions. Classifier’s highest numbers
highlighted in bold. highlighted in bold.

3007
G LOV E: Word substitution rate
Metric
0 12 25 37 50 100
Char-LR
Pre. 0.16 0.15 0.14 0.14 0.14 0.32
Rec. 0.11 0.12 0.13 0.13 0.13 0.05
F1 ma. 0.56 0.56 0.57 0.57 0.57 0.54
Word-LR
Pre. 0.31 0.37 0.35 0.33 0.33 0.30
Rec. 0.07 0.10 0.16 0.19 0.19 0.09
F1 ma. 0.55 0.58 0.61 0.62 0.62 0.57
CNN
Pre. 0.41 0.44 0.39 0.35 0.28 0.15
Rec. 0.13 0.18 0.19 0.20 0.17 0.06
F1 ma. 0.59 0.62 0.62 0.62 0.60 0.54
BERT
Pre. 0.44 0.43 0.40 0.36 0.33 0.13
Rec. 0.35 0.27 0.16 0.13 0.11 0.03
F1 ma. 0.69 0.66 0.61 0.59 0.58 0.52

Table 20: Impact of changing the proportion of sub-

stituted words on G LOV E-augmented datasets. Mean
results for 10 repetitions. Classifier’s highest numbers
highlighted in bold.

BPE MB: Subword substitution rate

Metric
0 12 25 37 50 100
Char-LR
Pre. 0.65 0.64 0.56 0.52 0.49 0.37
Rec. 0.17 0.20 0.22 0.20 0.17 0.06
F1 ma. 0.63 0.65 0.65 0.64 0.63 0.55
Word-LR
Pre. 0.26 0.34 0.31 0.30 0.25 0.19
Rec. 0.07 0.13 0.22 0.25 0.23 0.13
F1 ma. 0.55 0.59 0.63 0.63 0.62 0.57
CNN
Pre. 0.42 0.37 0.22 0.14 0.09 0.03
Rec. 0.17 0.31 0.38 0.31 0.27 0.10
F1 ma. 0.62 0.66 0.63 0.59 0.56 0.52
BERT
Pre. 0.43 0.41 0.33 0.32 0.25 0.08
Rec. 0.37 0.22 0.15 0.13 0.10 0.03
F1 ma. 0.70 0.64 0.60 0.59 0.57 0.52

Table 21: Impact of changing the proportion of substi-

tuted subwords on BPE MB-augmented datasets. Mean
results for 10 repetitions. Classifier’s highest numbers
highlighted in bold.

3008
# Document sample
S EED: No Oversampling
if you do not stop, the wikapidea nijas will come to your house and kill you
C OPY: Simple Oversampling
1. if you do not stop, the wikapidea nijas will come to your house and kill you
2. if you do not stop, the wikapidea nijas will come to your house and kill you
3. if you do not stop, the wikapidea nijas will come to your house and kill you
EDA: Easy Data Augmentation 16
1. if you do put up not stop the wikapidea nijas will come to your house and kill you
2. if you do not stopover the wikapidea nijas will come to your house and kill you
3. if you do not break the wikapidea nijas will come to your house and kill you
A DD: Add Majority-class Sentence
1. We thank you both for your contributions to Wikipedia at-large and your use of the tool. if you
do not stop, the wikapidea nijas will come to your house and kill you
2. if you do not stop, the wikapidea nijas will come to your house and kill you Honest! ))
3. *** username, I am on a shared IP address. if you do not stop, the wikapidea nijas will come
to your house and kill you
PPDB Phrase Substitutions
1. if you do not be halted , the wikapidea nijas will come to your home and kill you
2. if you do not stop , the wikapidea nijas comes along to your house and been murdered you
3. if you do not stop , the wikapidea nijas will arrive to your home and kill you
W ORD N ET Word Substitutions
1. if you do not stop , the wikapidea nijas will amount to your house and kill you
2. if you do not stop , the wikapidea nijas will come to your house and down you
3. if you do not stop , the wikapidea nijas will come to your theatre and kill you
G LOV E Word Substitutions
1. if you so not stop , the wikapidea nijas will come to yours house have kill you
2. mean know do not stop ? the wikapidea nijas will cos to your house and kill you
3. if you do not thinking , and wikapidea nijas will come to your going and kill you
BPE MB Subword Substitutions
1. whether you know not stop, theldikapideina niji give come to your house and kill you
2. shouldnever do not stop, the wikamyla nijag will come to your house and kill want
3. if you let not train, the wkanapidei nijas will come tonever house and dead you
GPT-2 Conditional Generation
1. for what you have done, you don’t have a choice. You are too big and weak to stop.
2. ! So if you don’t stop this, you will be murdered and your people will be ripped apart. I hope
you die and die and die. No, you will die and die and die. Just stop and kill yourself for it.
Advertisements
3. . You will be hanged like a rat. The next day I will get a letter telling you that you were
kicked out of your home by the wikapidea nijas and that you will die there. You are an idiot
for calling the wikapidea nijas to death and you should die.
ABG: Mixed Augmentation (A DD, BPE MB & GPT-2)
1. Support - The other article is minor, and not worth a separate entry. if you do not stop, the
wikapidea nijas will come to your house and kill you
2. . You will be hanged like a rat. The next day I will get a letter telling you that you were
kicked out of your home by the wikapidea nijas and that you will die there. You are an idiot
for calling the wikapidea nijas to death and you should die.
3. if you let not train, the wkanapidei nijas will come tonever house and dead you

Table 22: Documents generated by selected augmentation techniques in this study. Changes to original seed
highlighted. The selected sample is shorter than average (see §3.1, Table 1). We anonymized the username in A DD
(#3.). Three samples generated by each technique shown.
16
https://github.com/jasonwei20/eda_nlp 3009

The Branding Blueprint Book
75% (4)
The Branding Blueprint Book
14 pages
DRRR Module 2nd Quarter
No ratings yet
DRRR Module 2nd Quarter
58 pages
Text Augmentation For Neural Networks
No ratings yet
Text Augmentation For Neural Networks
6 pages
Data Augmentation With Transformers For Text Classification
No ratings yet
Data Augmentation With Transformers For Text Classification
13 pages
2020 Emnlp-Main 488
No ratings yet
2020 Emnlp-Main 488
13 pages
Auggpt: Leveraging Chatgpt For Text Data Augmentation: Ntroduction
No ratings yet
Auggpt: Leveraging Chatgpt For Text Data Augmentation: Ntroduction
12 pages
Chatbot Interaction With Artificial Intelligence: Human Data Augmentation With T5 and Language Transformer Ensemble For Text Classification
No ratings yet
Chatbot Interaction With Artificial Intelligence: Human Data Augmentation With T5 and Language Transformer Ensemble For Text Classification
16 pages
Zhang Et Al. (2018)
No ratings yet
Zhang Et Al. (2018)
10 pages
Chataug: Leveraging Chatgpt For Text Data Augmentation
No ratings yet
Chataug: Leveraging Chatgpt For Text Data Augmentation
12 pages
2020 Trac-1 4
No ratings yet
2020 Trac-1 4
5 pages
A Survey of Data Augmentation Approaches For NLP: Liu Et Al. 2020a
No ratings yet
A Survey of Data Augmentation Approaches For NLP: Liu Et Al. 2020a
21 pages
Label Representation
No ratings yet
Label Representation
5 pages
Performance of Data Augmentation Methods For Brazi
No ratings yet
Performance of Data Augmentation Methods For Brazi
9 pages
How To Choose "Good" Samples For Text Data Augmentation
No ratings yet
How To Choose "Good" Samples For Text Data Augmentation
13 pages
Module 05 - Learners Guide
No ratings yet
Module 05 - Learners Guide
31 pages
Comparison of BERT and XLNet Accuracy With Classical Methods and Algorithms in Text Classification
No ratings yet
Comparison of BERT and XLNet Accuracy With Classical Methods and Algorithms in Text Classification
3 pages
Lecture 02
No ratings yet
Lecture 02
31 pages
CH 04
No ratings yet
CH 04
47 pages
Paper Scope
No ratings yet
Paper Scope
2 pages
Text Data Augmentation For Deep Learning 27jx1h90mp
No ratings yet
Text Data Augmentation For Deep Learning 27jx1h90mp
34 pages
Toxic Message Detector
No ratings yet
Toxic Message Detector
9 pages
2022 Lrec-1 45
No ratings yet
2022 Lrec-1 45
9 pages
Over Fitting and TBL
No ratings yet
Over Fitting and TBL
46 pages
Identification and Classification of Toxic Comment Using Machine Learning Methods
0% (1)
Identification and Classification of Toxic Comment Using Machine Learning Methods
5 pages
Learning Style
No ratings yet
Learning Style
17 pages
Comparing Text Augmentation by GPT-3.5 and Llama3 For Evaluating Student Responses
No ratings yet
Comparing Text Augmentation by GPT-3.5 and Llama3 For Evaluating Student Responses
27 pages
EMNLP 2021 Bench Augm vANON
No ratings yet
EMNLP 2021 Bench Augm vANON
9 pages
IEEE Vaibhav (Updated)
No ratings yet
IEEE Vaibhav (Updated)
6 pages
Towards Robust Models For Fake News Detection in Spanish - Gómez González, Coll Ardanuy y Rosso
No ratings yet
Towards Robust Models For Fake News Detection in Spanish - Gómez González, Coll Ardanuy y Rosso
13 pages
Data Generation Using Large Language Models For Text Classification
No ratings yet
Data Generation Using Large Language Models For Text Classification
17 pages
Hate Speech Detection PPT FINAL
100% (1)
Hate Speech Detection PPT FINAL
29 pages
Unit 5
No ratings yet
Unit 5
8 pages
1 s2.0 S0957417424011680 Main
No ratings yet
1 s2.0 S0957417424011680 Main
25 pages
Comparative Analysis of Hate Speech Detection: Traditional vs. Deep Learning Approaches
No ratings yet
Comparative Analysis of Hate Speech Detection: Traditional vs. Deep Learning Approaches
6 pages
A E A T - B L M: E O M: Nalysis of The Volution of Dvanced Ransformer Ased Anguage Odels Xperiments On Pinion Ining
No ratings yet
A E A T - B L M: E O M: Nalysis of The Volution of Dvanced Ransformer Ased Anguage Odels Xperiments On Pinion Ining
16 pages
Martin, Adrián Rodríguez, Barcelona - 2018 - Toxic Comment Classification Using Convolutional and Recurrent Neural Networks-Annotated
No ratings yet
Martin, Adrián Rodríguez, Barcelona - 2018 - Toxic Comment Classification Using Convolutional and Recurrent Neural Networks-Annotated
4 pages
Enhancing Logical Reasoning of Large Language Models Through Logic-Driven Data Augmentation
No ratings yet
Enhancing Logical Reasoning of Large Language Models Through Logic-Driven Data Augmentation
17 pages
Blue Doodle Project Presentation
No ratings yet
Blue Doodle Project Presentation
15 pages
NLP Pipeline
No ratings yet
NLP Pipeline
50 pages
Natural Language Processing
No ratings yet
Natural Language Processing
49 pages
Toxic Comment Classification Using Natural Language Processing IRJET-V7I61123
No ratings yet
Toxic Comment Classification Using Natural Language Processing IRJET-V7I61123
4 pages
UNIT 5a
No ratings yet
UNIT 5a
48 pages
Leashing Inner Demons
No ratings yet
Leashing Inner Demons
9 pages
BERT-base Ensemble Approaches For HDS
No ratings yet
BERT-base Ensemble Approaches For HDS
7 pages
Bag - of - Words NLP
No ratings yet
Bag - of - Words NLP
23 pages
35868-Article Text-39936-1-2-20250607
No ratings yet
35868-Article Text-39936-1-2-20250607
20 pages
Conditional BERT Contextual
No ratings yet
Conditional BERT Contextual
12 pages
Do Models Really Learn To Follow Instructions? An Empirical Study of Instruction Tuning
No ratings yet
Do Models Really Learn To Follow Instructions? An Empirical Study of Instruction Tuning
10 pages
0 Yqn EK3 VG 4 He OTv 089 KX SI1 Ij Wzu Ax T1 Ag Gev OKKJE
No ratings yet
0 Yqn EK3 VG 4 He OTv 089 KX SI1 Ij Wzu Ax T1 Ag Gev OKKJE
4 pages
Maslej-Krešňáková Et Al. - 2020 - Comparison of Deep Learning Models and Various Text Pre-Processing Techniques For The Toxic Comments C-Annotated
No ratings yet
Maslej-Krešňáková Et Al. - 2020 - Comparison of Deep Learning Models and Various Text Pre-Processing Techniques For The Toxic Comments C-Annotated
26 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Data aug-IR
No ratings yet
Data aug-IR
15 pages
Hate Speech Detection Is Not As Easy As You May Think
No ratings yet
Hate Speech Detection Is Not As Easy As You May Think
17 pages
NLP Unit 1 PDF
No ratings yet
NLP Unit 1 PDF
27 pages
AugCSE: Contrastive Sentence Embedding With Diverse Augmentations
No ratings yet
AugCSE: Contrastive Sentence Embedding With Diverse Augmentations
24 pages
A Survey On Data Augmentation in Large Model Era: Yue Zhou, Chenlu Guo, Xu Wang, Yi Chang, Yuan Wu
No ratings yet
A Survey On Data Augmentation in Large Model Era: Yue Zhou, Chenlu Guo, Xu Wang, Yi Chang, Yuan Wu
33 pages
7834-Article Text-8539-1-10-20230901
No ratings yet
7834-Article Text-8539-1-10-20230901
12 pages
Text Classification PDF
No ratings yet
Text Classification PDF
34 pages
Generating Regular Expressions From Natural Language Specifications: Are We There Yet?
No ratings yet
Generating Regular Expressions From Natural Language Specifications: Are We There Yet?
4 pages
LLM-Generated Natural Language Meets Scaling Laws
No ratings yet
LLM-Generated Natural Language Meets Scaling Laws
20 pages
Arxiv: Natural Language Processing (Almost) From Scratch
No ratings yet
Arxiv: Natural Language Processing (Almost) From Scratch
47 pages
IGNOU MCA Data Science and Big Data Previous Years Unsolved Papers MCS 226
From Everand
IGNOU MCA Data Science and Big Data Previous Years Unsolved Papers MCS 226
Manish Soni
No ratings yet
SVC Application Guide PDP TV-H2 Series, Fitted With 42V7 Module
100% (1)
SVC Application Guide PDP TV-H2 Series, Fitted With 42V7 Module
9 pages
Johnson Porselano Royale 120x240 80x240 120x180 & 80x160cm Catalogue RJKT Aug 23
No ratings yet
Johnson Porselano Royale 120x240 80x240 120x180 & 80x160cm Catalogue RJKT Aug 23
248 pages
Republic of The Philippines: Cebu Normal University
No ratings yet
Republic of The Philippines: Cebu Normal University
2 pages
Algorithm and Application of Inverse Kinematics For 6-DOF Welding Robot Based On Screw Theory
No ratings yet
Algorithm and Application of Inverse Kinematics For 6-DOF Welding Robot Based On Screw Theory
5 pages
(Project) Requirements Specification
No ratings yet
(Project) Requirements Specification
8 pages
Ivandic Odyssey 2022
No ratings yet
Ivandic Odyssey 2022
1,208 pages
Please Provide Answers To The Following Questions:: Activity 5 - Determine Appropriate Business Structure
No ratings yet
Please Provide Answers To The Following Questions:: Activity 5 - Determine Appropriate Business Structure
4 pages
Affirmative Action in Malaysia: Education and Employment Outcomes Since The 1990s
No ratings yet
Affirmative Action in Malaysia: Education and Employment Outcomes Since The 1990s
37 pages
Problem Solving Method
No ratings yet
Problem Solving Method
21 pages
File 395
No ratings yet
File 395
88 pages
Program of Study Outcomes: Lesson Title/Focus Class Badminton Day 1 (6 Day Condensed Unit) Course Grade 8
No ratings yet
Program of Study Outcomes: Lesson Title/Focus Class Badminton Day 1 (6 Day Condensed Unit) Course Grade 8
4 pages
Manual de Operacion y Servicios
100% (1)
Manual de Operacion y Servicios
148 pages
Is 2372 - Timber For Cooling Towers
No ratings yet
Is 2372 - Timber For Cooling Towers
8 pages
History II List For Book Reviews
No ratings yet
History II List For Book Reviews
4 pages
BP News, Events & Reminders - February 21, 2012: Bingo Tomorrow Night at Brunswick House
No ratings yet
BP News, Events & Reminders - February 21, 2012: Bingo Tomorrow Night at Brunswick House
5 pages
TV Repair Guide LCD
100% (22)
TV Repair Guide LCD
45 pages
Cui Wang Huang 2011 1
No ratings yet
Cui Wang Huang 2011 1
31 pages
(PPT) Environmental Impacts of Dam
100% (2)
(PPT) Environmental Impacts of Dam
17 pages
Cockroaches: Pictorial Key To Some Common Species: Harry D. Pratt
No ratings yet
Cockroaches: Pictorial Key To Some Common Species: Harry D. Pratt
8 pages
Chapter 7: Data Link Control Protocols True or False: Data and Computer Communications, 10 Edition, by William Stallings
No ratings yet
Chapter 7: Data Link Control Protocols True or False: Data and Computer Communications, 10 Edition, by William Stallings
5 pages
JRC GPS Roll-Over
No ratings yet
JRC GPS Roll-Over
1 page
DR Odooh 3
No ratings yet
DR Odooh 3
6 pages
October Internal Assesment MBA3rdSem
No ratings yet
October Internal Assesment MBA3rdSem
2 pages
Complete
No ratings yet
Complete
14 pages
Resume 2011
No ratings yet
Resume 2011
2 pages
CASE DISCUSSION Subgroup 1 1
No ratings yet
CASE DISCUSSION Subgroup 1 1
112 pages
Nurses As Key Players in Information Technology Health Solutions
No ratings yet
Nurses As Key Players in Information Technology Health Solutions
5 pages
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
76 pages