Data Augmentation With Transformers For Text Classification
Data Augmentation With Transformers For Text Classification
1 Introduction
Data Augmentation (DA) is the process of obtaining/generating additional data
for training machine learning models. Through the DA process, one is able to
reduce the risk of overfitting and increase the robustness of the machine learning
models when not enough data is available. DA has emerged in the context of
deep learning because these models require of large amounts of data in order
to learn adequately. These techniques are commonly used in computer vision
where considerable improvements in performance are reported, see e.g., [5,13].
Despite this success, DA has not been thoroughly explored in the context of
Natural Language Processing (NLP), where there are plenty of domains in which
collecting manually labeled documents is complicated.
In this paper we explore DA in the context of NLP, specifically for text clas-
sification. We rely on the success that transformer-based models (e.g., Bert [6]
and GPT [12]) have reported in different NLP task and use them to generate
synthetic documents that are in turn used to augment the initial training set
associated to text classification tasks. Transformers are powerful models imple-
menting self attention mechanisms that allow them to capture long range depen-
dencies among words in a sequence. Outstanding results in a wide variety of
c Springer Nature Switzerland AG 2020
L. Martı́nez-Villaseñor et al. (Eds.): MICAI 2020, LNAI 12469, pp. 247–259, 2020.
https://doi.org/10.1007/978-3-030-60887-3_22
248 J. M. Tapia-Téllez and H. J. Escalante
tasks have been reported with these methods when used as both (pretrained)
feature extractors and end-to-end learners [6,8,12]. Since transformers are in
essence language models they can be sampled conditioned on certain inputs and
be used as text generators. We use such feature of transformers and use it as a
data augmentation mechanism. In a nutshell, we sample pretrained transform-
ers conditioned on training documents of a text classification task, and use the
outputs as augmented training instances. Four variants for generating artificial
samples are proposed and evaluated in benchmark data. Experimental results
show the usefulness of the augmented samples and motivate further research on
data augmentation for NLP.
The contributions of this paper are as follows:
– We explore the suitability of using transformers as data augmenters in the
context of text classification.
– We propose four variants to augment a training set of documents with the
outputs of transformers conditioned on the documents.
– We show the augmentation process is promising, motivating further research.
The remainder of this paper is organized as follows. The next section reviews
related work on data augmentation for NLP. Next Sect. 3 describes the proposed
methodology for data augmentation. Then Sect. 4 presents an experimental eval-
uation of the augmentation procedures. Finally, Sect. 5 outlines conclusions and
future work directions.
2 Related Work
This section briefly reviews related work on data augmentation for NLP tasks.
The idea of augmenting the available training documents for NLP tasks is not
new, see for instance [1,7,11]. However, early approaches for data augmentation
(either term or document expansion methods) mainly dealt with the task of
identifying words associated to the content of documents that could be added to
it. These methods mostly relied on thesaurus, semantic nets like WordNet [1,7] or
co-occurrences [4,11]. In this way, a document could include more related terms
eventually addressing issues associated to synonymy and polysemy. Differently
to early expansion strategies, modern data augmentation aims at generating
artificial instances (instead of extending the content available instances) based
on the available ones. In this sense, data augmentation is closer to oversampling
(e.g., see [3]) than to classical expansion methodologies.
Recent data augmentation efforts for NLP have adopted quite diverse
methodologies, in the following we summarize the main paradigms. Yang et
al. generate artificial (word-embedding) representations for documents by using
the most similar word embeddings to the words appearing in the initial docu-
ment [15]. In this way, the artificial representations resemble documents with
meaning related to the original ones. One should note, however, that no docu-
ment is actually generated, but only the embedding-based representations. Wei
et al. introduce four Easy Data Augmentation (EDA) strategies for generating
Data Augmentation with Transformers for Text Classification 249
3.1 Transformers
Although there are several transformer based models out there, in this work
we rely on two of the most effective and popular ones, namely: BERT and
GPT-2. BERT (Bidirectional Encoder Representations from Transformers) is a
language model based on transformers [6]. It implements bidirectional self atten-
tion mechanisms and has been trained by using term masking strategies. GPT-2,
on the other hand, implements an unidirectional language model trained under
a predict-next-word objective [12]. Both transformer models were trained using
huge corpora and under a variety of settings. Pretrained models are available
out there so that anyone can use them as starting point for their research. Using
such pretrained models for solving a variety of NLP tasks is straightforward.
However, the benefits of these methods for data augmentation have not been
explored deeply so far (see [2,10]), we hope our study helps to better under-
stand the capabilities and limitations of transformers for data augmentation.
Fig. 1. Left: Single masking augmentation method. Words in a sentence are masked
(one at a time) and artificial sentences are produced by asking BERT to unmask the
masked words. Right: Examples for one, three, five and ten sentences created from the
original sentence using Single Masking Augmentation
process are now masked in a second and different random position see the top
of Fig. 2 for an illustration. Finally a token is provided based on the order of
the respective sentences. The final results are sentences with two words changed
based on BERT. As with Single Masking Augmentation we apply this method
in the production of one, three, five and ten sentences as shown in the bottom
of Fig. 2.
Triple Masking Augmentation. The third method corresponds to Triple
Masking Augmentation. The idea is the same as in the previous two methods,
but we now mask three words of the sentence. As we can see in Fig. 3, the Single
Masking Augmentation procedure is applied three times in series, thus creating
the set of new sentences with three different words in them. As in the previous
two methods, we run experiments for the generation of one, three, five and ten
sentences as shown in Fig. 4
Augmented Sentence. The fourth proposed variant for data augmentation
consists in augmenting sentences with GPT-2. The procedure is pretty straight-
forward: as input we take a sentence that is introduced into the GPT-2 model,
then we ask the model to predict an augmented sentence with up to fifty words
as show in Fig. 5. One should note that this procedure is very similar to that
one described in [2,10].
4 Experimental Results
This section presents an experimental evaluation of the data augmentation
strategies described in Sect. 3. We first describe the experimental settings and
then we present the results obtained by each of the developed strategies.
252 J. M. Tapia-Téllez and H. J. Escalante
For the experiments we used four benchmark datasets for text classification,
namely: SST-2 (Stanford Sentiment Treebak), IMDB (Sentiment-related movie
reviews), Spam, and Sentence Type (sentences classified as command, question
and statement). Each of the considered datasets have 600 instances, for the
experimental evaluation we used 80% of instances for training and the remainder
for testing. In a preliminary stage we used 20% of the training data as validation
set for hyperparameter tuning.
As classification models we used straighforward state of the art models based
on deep learning: a Convolutional Neural Network (CNN) and a Long Short-
Term Memory Network (LSTM-RNN). The CNN and LSTM-RNN were imple-
mented based on the implementations of Wei et al. [16]. The CNN has the
following layers: embedding, convolutional one dimension (activation = relu),
dropout, global max pooling one dimension, dense (relu), dense (sigmoid); it
Data Augmentation with Transformers for Text Classification 253
Fig. 4. Examples for one, three, five and ten sentences created from the original sen-
tence using Triple Masking Augmentation
was trained with a binary cross entropy loss and an Adam optimizer, and it was
trained for ten epochs. The LSTM-RNN has the following layers: embedding,
LSTM, FC1, activation (relu), dropout, output layer, activation (sigmoid); it
was trained also with a binary cross entropy and an RMSprop optimizer and
trained for ten epochs.
4.2 Results
Each of the four strategies described above were evaluated using the datasets
and classifiers just mentioned, where we used test set accuracy as the leading
evaluation measure. The augmentation methods using BERT (i.e., single, double,
and triple masking) were tested for an augmentation of one, three, five and
up to ten sentences, this in order to determine whether there was a relationship
254 J. M. Tapia-Téllez and H. J. Escalante
Table 1. Classification performance with Single Masking for one, three, five and ten
sentences augmented. Plain indicates the performance obtained by each of the models
without performing any augmentation. The best result in each row is shown in bold.
Plain +
Single Masking
Plain
1 3 5 10
RNN 0.79 0.86 0.84 0.86 0.92
Spam
CNN 0.59 0.68 0.67 0.65 0.63
RNN 0.71 0.72 0.77 0.75 0.71
IMDB
CNN 0.79 0.76 0.77 0.72 0.74
Sentence RNN 0.35 0.42 0.46 0.53 0.56
Type CNN 0.40 0.43 0.44 0.40 0.40
RNN 0.59 0.63 0.68 0.66 0.67
SST2
CNN 0.68 0.63 0.66 0.65 0.71
RNN 0.61 0.65 0.68 0.70 0.71
Average
CNN 0.61 0.62 0.63 0.60 0.62
the results obtained with the Double Masking Augmentation Strategy. From
this table it can be seen that in average this strategy is able to improve the
performance of the baseline. Where we can see that augmenting ten sentences
Data Augmentation with Transformers for Text Classification 255
Table 2. Results for DA with Double Masking for one, three, five and ten sentences
augmented
Plain +
Double Masking
Plain
1 3 5 10
RNN 0.79 0.82 0.77 0.78 0.86
Spam
CNN 0.59 0.65 0.57 0.66 0.74
RNN 0.71 0.71 0.69 0.76 0.70
IMDB
CNN 0.79 0.76 0.76 0.74 0.70
Sentence RNN 0.35 0.37 0.44 0.35 0.50
Type CNN 0.40 0.43 0.40 0.35 0.50
RNN 0.59 0.64 0.66 0.69 0.66
SST2
CNN 0.68 0.68 0.65 0.70 0.70
RNN 0.61 0.63 0.64 0.64 0.68
Average
CNN 0.61 0.63 0.59 0.61 0.66
gave better results for both models, LSTM-RNN and CNN. This result seems to
corroborate the hypothesis that the more amount of augmented sentences yields
better classification performance.
Table 3 shows the results obtained with the Triple Masking Augmentation
strategy. Based on the results from Table 3 it can be seen that this strategy
also improves the performance of the baseline for both considered classification
models. Augmenting ten sentences resulted in better classification performance
for the LSTM-RNN and one sentence worked better for the CNN.
Table 3. Results for DA with Triple Masking for one, three, five and ten sentences
Plain +
Triple Masking
Plain
1 3 5 10
RNN 0.79 0.74 0.77 0.67 0.85
Spam
CNN 0.59 0.69 0.65 0.62 0.57
RNN 0.71 0.72 0.70 0.70 0.77
IMDB
CNN 0.79 0.75 0.76 0.70 0.76
Sentence RNN 0.35 0.43 0.38 0.47 0.51
Type CNN 0.40 0.42 0.35 0.36 0.50
RNN 0.59 0.64 0.69 0.70 0.63
SST2
CNN 0.68 0.70 0.71 0.70 0.63
RNN 0.61 0.63 0.63 0.63 0.69
Average
CNN 0.61 0.64 0.62 0.59 0.61
256 J. M. Tapia-Téllez and H. J. Escalante
Last but not least, the results for the Augmented Sentence strategy are pre-
sented in Table 4. From this table it can be observed that there were improve-
ments mostly for the LSTM-RNN model, whereas the augmentation did not
seem to improve the performance of the CNN classifier.
Plain +
Plain
Augmentation
RNN 0.79 0.75
Spam
CNN 0.59 0.61
RNN 0.71 0.75
IMDB
CNN 0.79 0.76
Sentence RNN 0.35 0.46
Type CNN 0.40 0.41
RNN 0.59 0.62
SST2
CNN 0.68 0.67
RNN 0.61 0.64
Average
CNN 0.61 0.61
Table 5. Summary of the results for the four method along with their average and
standard deviation.
In order to get a better insight on how the proposed methods work when using
different amounts of training data, we subsampled the training set and evaluated
the performance of the augmentation for the RNN-LSTM classifier. For this
experiment we considered only the Single Masking method. Our hypothesis was
that the smaller the size of the training set, the larger the impact of the different
augmentation strategies in the classification performance. Figure 6 shows the
result of the experiment for the four datasets.
From Fig. 6, we cannot see the behavior we were expecting. However, still
it can be seen that the augmentation strategy outperforms the baseline when
half size of the training set (and on) data was considered. The fact we could not
obtain larger improvement margins with fewer documents could be due to the
fact that training data was very limited, also, one should note we trained the
model for 10 epochs only.
Fig. 6. Performance on text classification tasks with and without Augmented Data for
Single Masking in three sentences augmented
5 Conclusions
classification are, in general, on top of the curves related to the plain data, thus
indicating that the augmentation methods do improve accuracy.
As future work, we would first like to work with more data, with it we could
produce better graphs and a second round of average results for the methods.
Second, we would like to implement a class-label-related as part of our model, this
in order to conserve the label within the text augmentation process. And finally,
we would like to run experiments on combinations of the methods developed in
this work. This would surely bring light over which could be the best tool in
order to enhance our training data.
References
1. Agirre, E., Arregi, X., Otegi, A.: Document expansion based on WordNet for robust
IR. In: Coling 2010: Posters, pp. 9–17 (2010)
2. Anaby-Tavor, A., et al.: Not enough data? Deep learning to the rescue! arXiv
preprint 1911.03118 (2019)
3. Bowyer, K.W., Chawla, N.V., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic
minority over-sampling technique. J. Artif. Int. Res. 16(1), 321–357 (2002)
4. Cabrera, J.M., Escalante, H.J., Montes-y-Gómez, M.: Distributional term repre-
sentations for short-text categorization. In: Gelbukh, A. (ed.) CICLing 2013, Part
II. LNCS, vol. 7817, pp. 335–346. Springer, Heidelberg (2013). https://doi.org/10.
1007/978-3-642-37256-8 28
5. Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: learning
augmentation policies from data. ArXiv preprint 1805.09501 (2018)
6. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidi-
rectional transformers for language understanding. In: Proceedings of the 2019
Conference of the NAACL, pp. 4171–4186, June 2019
7. Gong, Z., Cheang, C.W., Hou U, L.: Multi-term web query expansion using Word-
Net. In: Bressan, S., Küng, J., Wagner, R. (eds.) DEXA 2006. LNCS, vol. 4080,
pp. 379–388. Springer, Heidelberg (2006). https://doi.org/10.1007/11827405 37
8. Howard, J., Ruder, S.: Universal language model fine-tuning for text classification.
ArXiv preprint 1801.06146 (2018)
9. Kobayashi, S.: Contextual augmentation: data augmentation by words with
paradigmatic relations. In: Proceedings of the 2018 Conference of NAACL, pp.
452–457 (2018)
10. Kumar, V., Choudhary, A., Cho, E.: Data augmentation using pre-trained trans-
former models. ArXiv preprint 2003.02245 (2020)
11. Lavelli, A., Sebastiani, F., Zanoli, R.: Distributional term representations: an
experimental comparison. In: Proceedings of the 13th ACM International Con-
ference on Information and Knowledge Management, pp. 615–624. ACM (2004)
12. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language
models are unsupervised multitask learners. Technical report (2019)
13. Shorten, C., Khoshgoftaar, T.M.: A survey on image data augmentation for deep
learning. J. Big Data 6, 60 (2019). https://doi.org/10.1186/s40537-019-0197-0
Data Augmentation with Transformers for Text Classification 259
14. Vaswani, A.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–
6008 (2017)
15. Wang, W.Y., Yang, D.: That’s so annoying!!!: A lexical and frame-semantic embed-
ding based data augmentation approach to automatic categorization of annoy-
ing behaviors using #petpeeve tweets. In: Proceedings of the 2015 Conference on
Empirical Methods in Natural Language Processing, pp. 2557–2563 (2015)
16. Wei, J., Zou, K.: EDA: easy data augmentation techniques for boosting perfor-
mance on text classification tasks. In: Proceedings of 2019 Conference on Empirical
Methods in Natural Language Processing, pp. 6382–6388. Association for Compu-
tational Linguistics (2019)