0% found this document useful (0 votes)

10 views13 pages

Data Augmentation With Transformers For Text Classification

The document discusses using transformer models like BERT and GPT-2 for data augmentation in text classification. Four variants of using transformer outputs to generate synthetic documents for augmenting training data are proposed and evaluated experimentally on benchmark datasets, showing promising results that motivate further research in this area.

Uploaded by

Max Sarmento

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views13 pages

Data Augmentation With Transformers For Text Classification

Uploaded by

Max Sarmento

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Data Augmentation with Transformers

for Text Classiﬁcation

José Medardo Tapia-Téllez and Hugo Jair Escalante(B)

Instituto Nacional de Astroﬁsica Optica y Electronica, Puebla, Mexico

{tapiatellez,hugojair}@inaoep.mx

Abstract. The current deep learning revolution has established to

transformer based architectures as the state of the art in several natural
language processing tasks. However, it is not clear whether such models
can be also used for enhancing other aspects of the learning pipeline in
the NLP context. This paper presents a study in such a direction, in
particular, we explore the suitability of transformer models as a data
augmentation mechanism for text classification. We introduce four ways
of using transformer models for augmenting data in text classification.
Each of these variants take the outputs of a transformer model, feed
with training documents, and use such outputs as additional training
data. The proposed strategies are evaluated in benchmark data using
CNN and LSTM based classifiers. Experimental results are promising:
improvements over a model training on the plain documents are consis-
tent.

Keywords: Text classiﬁcation · Data Augmentation · Transformers

1 Introduction
Data Augmentation (DA) is the process of obtaining/generating additional data
for training machine learning models. Through the DA process, one is able to
reduce the risk of overfitting and increase the robustness of the machine learning
models when not enough data is available. DA has emerged in the context of
deep learning because these models require of large amounts of data in order
to learn adequately. These techniques are commonly used in computer vision
where considerable improvements in performance are reported, see e.g., [5,13].
Despite this success, DA has not been thoroughly explored in the context of
Natural Language Processing (NLP), where there are plenty of domains in which
collecting manually labeled documents is complicated.
In this paper we explore DA in the context of NLP, specifically for text clas-
sification. We rely on the success that transformer-based models (e.g., Bert [6]
and GPT [12]) have reported in different NLP task and use them to generate
synthetic documents that are in turn used to augment the initial training set
associated to text classification tasks. Transformers are powerful models imple-
menting self attention mechanisms that allow them to capture long range depen-
dencies among words in a sequence. Outstanding results in a wide variety of
c Springer Nature Switzerland AG 2020
L. Martı́nez-Villaseñor et al. (Eds.): MICAI 2020, LNAI 12469, pp. 247–259, 2020.
https://doi.org/10.1007/978-3-030-60887-3_22
248 J. M. Tapia-Téllez and H. J. Escalante

tasks have been reported with these methods when used as both (pretrained)
feature extractors and end-to-end learners [6,8,12]. Since transformers are in
essence language models they can be sampled conditioned on certain inputs and
be used as text generators. We use such feature of transformers and use it as a
data augmentation mechanism. In a nutshell, we sample pretrained transform-
ers conditioned on training documents of a text classification task, and use the
outputs as augmented training instances. Four variants for generating artificial
samples are proposed and evaluated in benchmark data. Experimental results
show the usefulness of the augmented samples and motivate further research on
data augmentation for NLP.
The contributions of this paper are as follows:
– We explore the suitability of using transformers as data augmenters in the
context of text classification.
– We propose four variants to augment a training set of documents with the
outputs of transformers conditioned on the documents.
– We show the augmentation process is promising, motivating further research.
The remainder of this paper is organized as follows. The next section reviews
related work on data augmentation for NLP. Next Sect. 3 describes the proposed
methodology for data augmentation. Then Sect. 4 presents an experimental eval-
uation of the augmentation procedures. Finally, Sect. 5 outlines conclusions and
future work directions.

2 Related Work
This section briefly reviews related work on data augmentation for NLP tasks.
The idea of augmenting the available training documents for NLP tasks is not
new, see for instance [1,7,11]. However, early approaches for data augmentation
(either term or document expansion methods) mainly dealt with the task of
identifying words associated to the content of documents that could be added to
it. These methods mostly relied on thesaurus, semantic nets like WordNet [1,7] or
co-occurrences [4,11]. In this way, a document could include more related terms
eventually addressing issues associated to synonymy and polysemy. Differently
to early expansion strategies, modern data augmentation aims at generating
artificial instances (instead of extending the content available instances) based
on the available ones. In this sense, data augmentation is closer to oversampling
(e.g., see [3]) than to classical expansion methodologies.
Recent data augmentation efforts for NLP have adopted quite diverse
methodologies, in the following we summarize the main paradigms. Yang et
al. generate artificial (word-embedding) representations for documents by using
the most similar word embeddings to the words appearing in the initial docu-
ment [15]. In this way, the artificial representations resemble documents with
meaning related to the original ones. One should note, however, that no docu-
ment is actually generated, but only the embedding-based representations. Wei
et al. introduce four Easy Data Augmentation (EDA) strategies for generating
Data Augmentation with Transformers for Text Classification 249

artiﬁcial documents: synonym replacement (choosing randomly n words from a

sentence and replace them by a synonym); random insertion (inserting a ran-
dom synonym of a random word in a random position in the sentence); random
swap (randomly choose two words in the sentence and swap their positions); and
random deletion (randomly remove each word in the sentence with probability
p.). EDA improves performance on both, convolutional and recurrent neural net-
works, and is particularly helpful for smaller data sets. Kobayashi et al. propose
a method for augmenting label sentences [9], their method is called contextual
augmentation. They stochastically replace words with other words that are pre-
dicted by a bi-directional language model at the word positions. It is important
to emphasize that they feedback the language model with a label-conditional
architecture, which allows the model to augment sentences without breaking
the label-compatibility. Their results for six different classification tasks show
improvements when using a convolutional neural network.
Parallel to our work, Kumar et al. proposed a method that used transformers
for data augmentation in NLP [10]. The authors feed transformer models with
documents and label information and they synthesize new samples. Although
they use transformers, the way new instances are generated is different: they
feed the transformer models with a combination of the source document and its
label, but they ask the model to predict the most probable sequence following
the fed input. This way of synthetization was first proposed in [2] obtaining
satisfactory results. Compared to these related efforts, our proposal adopts a
different approach to generate synthetic documents: we mask words and ask the
model to predict the missing words, the predicted words are used to generate
new documents. We also developed a data augmentation procedure based on
sentences, this resembles to some extend the method in [10].

3 Data Augmentation with Transformers

This section describes the proposed methodology for data augmentation in text
classification. We first briefly introduce transformers, then we describe the aug-
mentation process and the four considered variants.

3.1 Transformers

A Transformer is a deep network architecture for processing sequential data and

that is based entirely on attention mechanisms [14]. This type of models do not
aim to explicitly model sequential information as in classical recurrent neural
nets (RNNs), instead they implicitly capture such information via self attention
layers. An attention mechanism indicates which parts of a document (words,
sentences) should have more weight for the modeling processes. A self attention
module, thus, learns which parts of an input document should receive higher
weight for the layers in upper levels. Transformers learn in an end-to-end way
multiple self attention mechanisms in parallel and rely on an encoder-decoder
architecture to solve a variety of NLP tasks.
250 J. M. Tapia-Téllez and H. J. Escalante

Although there are several transformer based models out there, in this work
we rely on two of the most eﬀective and popular ones, namely: BERT and
GPT-2. BERT (Bidirectional Encoder Representations from Transformers) is a
language model based on transformers [6]. It implements bidirectional self atten-
tion mechanisms and has been trained by using term masking strategies. GPT-2,
on the other hand, implements an unidirectional language model trained under
a predict-next-word objective [12]. Both transformer models were trained using
huge corpora and under a variety of settings. Pretrained models are available
out there so that anyone can use them as starting point for their research. Using
such pretrained models for solving a variety of NLP tasks is straightforward.
However, the beneﬁts of these methods for data augmentation have not been
explored deeply so far (see [2,10]), we hope our study helps to better under-
stand the capabilities and limitations of transformers for data augmentation.

3.2 Generation of Synthetic Documents

The goal of this work is to explore the benefits of using transformers for data
augmentation in the context of text classification. As previously mentioned, we
propose four variants to generate synthetic documents. In all of them, the idea
is to feed a transformer (either BERT or GPT-2) with training documents of
the classification task at hand and use different strategies to generate artificial
documents. Each of the synthetically generated documents will be assigned the
same label as the source document we use it for the generation process. Synthetic
and original training documents are both used for training a classification model.
In the remainder of this section we detail the four variants, where the first
three correspond to masking words using BERT, and the remaining one aims at
expanding sentences using GPT-2.
Single Masking Augmentation. The first method corresponds to Single
Masking Augmentation. Where a specific sentence is first tokenized. Then a ran-
dom word in the sentence is selected and masked, this allows for BERT not to
take into account this word and be able to predict new words based on the other
words in the sentence and the position of the masked item. We then regroup the
sentence but now with the masked item, this masked sentence is inserted into
BERT where a series of tokens (words) for unmasking the masked word are pre-
dicted. Based on a predefined number of sentences to be augmented per sentence
in the training set, we produce new sentences with the respective tokens in the
masked position. For example, if we want to generate three new sentences, then
the first three tokens would be used to create them. Figure 1 graphically depicts
this variant, in the left plot we show the procedure and in the right plot we show
sample generated sentences.
Double Masking Augmentation. The second method corresponds to Double
Masking Augmentation. The idea is similar as in Single Masking Augmentation
but now we mask two random words of the original sentence. To do this we
apply the Single Masking Augmentation process in series. It is important to
mention that the obtained sentences from the Single Masking Augmentation
Data Augmentation with Transformers for Text Classification 251

Fig. 1. Left: Single masking augmentation method. Words in a sentence are masked
(one at a time) and artiﬁcial sentences are produced by asking BERT to unmask the
masked words. Right: Examples for one, three, ﬁve and ten sentences created from the
original sentence using Single Masking Augmentation

process are now masked in a second and different random position see the top
of Fig. 2 for an illustration. Finally a token is provided based on the order of
the respective sentences. The final results are sentences with two words changed
based on BERT. As with Single Masking Augmentation we apply this method
in the production of one, three, five and ten sentences as shown in the bottom
of Fig. 2.
Triple Masking Augmentation. The third method corresponds to Triple
Masking Augmentation. The idea is the same as in the previous two methods,
but we now mask three words of the sentence. As we can see in Fig. 3, the Single
Masking Augmentation procedure is applied three times in series, thus creating
the set of new sentences with three different words in them. As in the previous
two methods, we run experiments for the generation of one, three, five and ten
sentences as shown in Fig. 4
Augmented Sentence. The fourth proposed variant for data augmentation
consists in augmenting sentences with GPT-2. The procedure is pretty straight-
forward: as input we take a sentence that is introduced into the GPT-2 model,
then we ask the model to predict an augmented sentence with up to fifty words
as show in Fig. 5. One should note that this procedure is very similar to that
one described in [2,10].

4 Experimental Results
This section presents an experimental evaluation of the data augmentation
strategies described in Sect. 3. We ﬁrst describe the experimental settings and
then we present the results obtained by each of the developed strategies.
252 J. M. Tapia-Téllez and H. J. Escalante

Fig. 2. Top: double masking augmentation: we apply Single masking augmentation

in series to generate sentences with two modiﬁed words. Bottom: examples for one,
three, ﬁve and ten sentences created from the original sentence using Double Masking
Augmentation.

4.1 Experimental Settings

For the experiments we used four benchmark datasets for text classification,
namely: SST-2 (Stanford Sentiment Treebak), IMDB (Sentiment-related movie
reviews), Spam, and Sentence Type (sentences classified as command, question
and statement). Each of the considered datasets have 600 instances, for the
experimental evaluation we used 80% of instances for training and the remainder
for testing. In a preliminary stage we used 20% of the training data as validation
set for hyperparameter tuning.
As classification models we used straighforward state of the art models based
on deep learning: a Convolutional Neural Network (CNN) and a Long Short-
Term Memory Network (LSTM-RNN). The CNN and LSTM-RNN were imple-
mented based on the implementations of Wei et al. [16]. The CNN has the
following layers: embedding, convolutional one dimension (activation = relu),
dropout, global max pooling one dimension, dense (relu), dense (sigmoid); it
Data Augmentation with Transformers for Text Classification 253

Fig. 3. Triple masking augmentation: we apply Single masking augmentation three

times in series to generate sentences with two modiﬁed words.

Fig. 4. Examples for one, three, ﬁve and ten sentences created from the original sen-
tence using Triple Masking Augmentation

was trained with a binary cross entropy loss and an Adam optimizer, and it was
trained for ten epochs. The LSTM-RNN has the following layers: embedding,
LSTM, FC1, activation (relu), dropout, output layer, activation (sigmoid); it
was trained also with a binary cross entropy and an RMSprop optimizer and
trained for ten epochs.

4.2 Results

Each of the four strategies described above were evaluated using the datasets
and classiﬁers just mentioned, where we used test set accuracy as the leading
evaluation measure. The augmentation methods using BERT (i.e., single, double,
and triple masking) were tested for an augmentation of one, three, ﬁve and
up to ten sentences, this in order to determine whether there was a relationship
254 J. M. Tapia-Téllez and H. J. Escalante

Fig. 5. Example of an augmented sentence through the Augmented Sentence method

between the number of augmented sentences and accuracy. The augmentation

method using GPT-2 (i.e., augmented sentence) always produced a sentence
of the specified length and our goal was used it as reference, as this method
resembles state of the art augmentation methods introduced elsewhere [2,10].
Table 1 shows the results obtained with the Single Masking Augmentation
strategy for different numbers of augmented sentences. Based on the results
from Table 1, we can conclude, that on average, the original training data plus
the data generated by the proposed augmentation method improves accuracy
with respect to the baseline (Plain, which refers to the model trained only with
the original data) for both of the considered classification models. We can also
conclude that, on average, Single Masking with ten augmented sentences has the
best accuracy with the LSTM-RNN, this was not the case for CNN, where the
best result was obtained when three sentences were augmented. Table 2 shows

Table 1. Classiﬁcation performance with Single Masking for one, three, ﬁve and ten
sentences augmented. Plain indicates the performance obtained by each of the models
without performing any augmentation. The best result in each row is shown in bold.

Plain +
Single Masking
Plain
1 3 5 10
RNN 0.79 0.86 0.84 0.86 0.92
Spam
CNN 0.59 0.68 0.67 0.65 0.63
RNN 0.71 0.72 0.77 0.75 0.71
IMDB
CNN 0.79 0.76 0.77 0.72 0.74
Sentence RNN 0.35 0.42 0.46 0.53 0.56
Type CNN 0.40 0.43 0.44 0.40 0.40
RNN 0.59 0.63 0.68 0.66 0.67
SST2
CNN 0.68 0.63 0.66 0.65 0.71
RNN 0.61 0.65 0.68 0.70 0.71
Average
CNN 0.61 0.62 0.63 0.60 0.62

the results obtained with the Double Masking Augmentation Strategy. From
this table it can be seen that in average this strategy is able to improve the
performance of the baseline. Where we can see that augmenting ten sentences
Data Augmentation with Transformers for Text Classiﬁcation 255

Table 2. Results for DA with Double Masking for one, three, ﬁve and ten sentences
augmented

Plain +
Double Masking
Plain
1 3 5 10
RNN 0.79 0.82 0.77 0.78 0.86
Spam
CNN 0.59 0.65 0.57 0.66 0.74
RNN 0.71 0.71 0.69 0.76 0.70
IMDB
CNN 0.79 0.76 0.76 0.74 0.70
Sentence RNN 0.35 0.37 0.44 0.35 0.50
Type CNN 0.40 0.43 0.40 0.35 0.50
RNN 0.59 0.64 0.66 0.69 0.66
SST2
CNN 0.68 0.68 0.65 0.70 0.70
RNN 0.61 0.63 0.64 0.64 0.68
Average
CNN 0.61 0.63 0.59 0.61 0.66

gave better results for both models, LSTM-RNN and CNN. This result seems to
corroborate the hypothesis that the more amount of augmented sentences yields
better classification performance.
Table 3 shows the results obtained with the Triple Masking Augmentation
strategy. Based on the results from Table 3 it can be seen that this strategy
also improves the performance of the baseline for both considered classification
models. Augmenting ten sentences resulted in better classification performance
for the LSTM-RNN and one sentence worked better for the CNN.

Table 3. Results for DA with Triple Masking for one, three, ﬁve and ten sentences

Plain +
Triple Masking
Plain
1 3 5 10
RNN 0.79 0.74 0.77 0.67 0.85
Spam
CNN 0.59 0.69 0.65 0.62 0.57
RNN 0.71 0.72 0.70 0.70 0.77
IMDB
CNN 0.79 0.75 0.76 0.70 0.76
Sentence RNN 0.35 0.43 0.38 0.47 0.51
Type CNN 0.40 0.42 0.35 0.36 0.50
RNN 0.59 0.64 0.69 0.70 0.63
SST2
CNN 0.68 0.70 0.71 0.70 0.63
RNN 0.61 0.63 0.63 0.63 0.69
Average
CNN 0.61 0.64 0.62 0.59 0.61
256 J. M. Tapia-Téllez and H. J. Escalante

Last but not least, the results for the Augmented Sentence strategy are pre-
sented in Table 4. From this table it can be observed that there were improve-
ments mostly for the LSTM-RNN model, whereas the augmentation did not
seem to improve the performance of the CNN classiﬁer.

Table 4. Results obtained for the Augmented sentence method.

Plain +
Plain
Augmentation
RNN 0.79 0.75
Spam
CNN 0.59 0.61
RNN 0.71 0.75
IMDB
CNN 0.79 0.76
Sentence RNN 0.35 0.46
Type CNN 0.40 0.41
RNN 0.59 0.62
SST2
CNN 0.68 0.67
RNN 0.61 0.64
Average
CNN 0.61 0.61

Finally we show in Table 5 a summary of results that shows the average

performance across datasets obtained by the diﬀerent variants. It is clear from
this table that only in two out of the 24 conﬁgurations we tested, the usage
of data augmentation decreased the performance of the baseline model, and in
most cases an improved was reported. Also, it can be seen that better results
were obtained with the masking strategies than with the sentence augmentation
method, which can be seen as a reference of the state of the art augmentation
techniques [2,10].

Table 5. Summary of the results for the four method along with their average and
standard deviation.

Plain + Plain + Plain +

Single Masking Double Masking Triple Masking Plain +
Plain
1 3 5 10 1 3 5 10 1 3 5 10 A. S.
RNN 0.61 0.65 0.68 0.70 0.710.63 0.64 0.64 0.68 0.63 0.63 0.63 0.69 0.64
Avg.
CNN 0.61 0.62 0.63 0.60 0.62 0.63 0.59 0.61 0.66 0.64 0.62 0.59 0.61 0.61
Tot. avg. 0.61 0.63 0.65 0.65 0.66 0.63 0.61 0.62 0.67 0.63 0.62 0.61 0.65 0.62
Std.
0.0 0.02 0.03 0.07 0.06 0.0 0.03 0.02 0.01 0.01 0.01 0.02 0.05 0.02
Dev.
Data Augmentation with Transformers for Text Classiﬁcation 257

In order to get a better insight on how the proposed methods work when using
different amounts of training data, we subsampled the training set and evaluated
the performance of the augmentation for the RNN-LSTM classifier. For this
experiment we considered only the Single Masking method. Our hypothesis was
that the smaller the size of the training set, the larger the impact of the different
augmentation strategies in the classification performance. Figure 6 shows the
result of the experiment for the four datasets.
From Fig. 6, we cannot see the behavior we were expecting. However, still
it can be seen that the augmentation strategy outperforms the baseline when
half size of the training set (and on) data was considered. The fact we could not
obtain larger improvement margins with fewer documents could be due to the
fact that training data was very limited, also, one should note we trained the
model for 10 epochs only.

Fig. 6. Performance on text classiﬁcation tasks with and without Augmented Data for
Single Masking in three sentences augmented

5 Conclusions

We proposed four data augmentation strategies for text classiﬁcation based on

Transformers (BERT and GPT-2), namely: Single Masking, Double Masking,
Triple Masking, and Augmented Sentence. The latter resembling a state of the
art solution. The four evaluated methods, on average, improved the classification
performance of two classification models a CNN and an LSTM-RNN. Regarding
the masking methods, Single Masking obtained the best performance, however,
Double and Triple Masking methods also improved the performance of the base-
line. In general, better performance was observed when more sentences were
added.
Experimental results with different amounts of training data were also
reported. These results seem to indicate two things: first of all, we need more
data in order to perform more experiments and to validate through graphs the
results from the tables, and second, the curves related to the augmented data
258 J. M. Tapia-Téllez and H. J. Escalante

classification are, in general, on top of the curves related to the plain data, thus
indicating that the augmentation methods do improve accuracy.
As future work, we would first like to work with more data, with it we could
produce better graphs and a second round of average results for the methods.
Second, we would like to implement a class-label-related as part of our model, this
in order to conserve the label within the text augmentation process. And finally,
we would like to run experiments on combinations of the methods developed in
this work. This would surely bring light over which could be the best tool in
order to enhance our training data.

Acknowledgements. This work was partially supported by CONACyT under project

grant A1-S-26314, Integración de Visión y Lenguaje mediante Representaciones Multi-
modales Aprendidas para Clasiﬁcación y Recuperación de Imágenes y Videos.

References
1. Agirre, E., Arregi, X., Otegi, A.: Document expansion based on WordNet for robust
IR. In: Coling 2010: Posters, pp. 9–17 (2010)
2. Anaby-Tavor, A., et al.: Not enough data? Deep learning to the rescue! arXiv
preprint 1911.03118 (2019)
3. Bowyer, K.W., Chawla, N.V., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic
minority over-sampling technique. J. Artif. Int. Res. 16(1), 321–357 (2002)
4. Cabrera, J.M., Escalante, H.J., Montes-y-Gómez, M.: Distributional term repre-
sentations for short-text categorization. In: Gelbukh, A. (ed.) CICLing 2013, Part
II. LNCS, vol. 7817, pp. 335–346. Springer, Heidelberg (2013). https://doi.org/10.
1007/978-3-642-37256-8 28
5. Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: learning
augmentation policies from data. ArXiv preprint 1805.09501 (2018)
6. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidi-
rectional transformers for language understanding. In: Proceedings of the 2019
Conference of the NAACL, pp. 4171–4186, June 2019
7. Gong, Z., Cheang, C.W., Hou U, L.: Multi-term web query expansion using Word-
Net. In: Bressan, S., Küng, J., Wagner, R. (eds.) DEXA 2006. LNCS, vol. 4080,
pp. 379–388. Springer, Heidelberg (2006). https://doi.org/10.1007/11827405 37
8. Howard, J., Ruder, S.: Universal language model fine-tuning for text classification.
ArXiv preprint 1801.06146 (2018)
9. Kobayashi, S.: Contextual augmentation: data augmentation by words with
paradigmatic relations. In: Proceedings of the 2018 Conference of NAACL, pp.
452–457 (2018)
10. Kumar, V., Choudhary, A., Cho, E.: Data augmentation using pre-trained trans-
former models. ArXiv preprint 2003.02245 (2020)
11. Lavelli, A., Sebastiani, F., Zanoli, R.: Distributional term representations: an
experimental comparison. In: Proceedings of the 13th ACM International Con-
ference on Information and Knowledge Management, pp. 615–624. ACM (2004)
12. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language
models are unsupervised multitask learners. Technical report (2019)
13. Shorten, C., Khoshgoftaar, T.M.: A survey on image data augmentation for deep
learning. J. Big Data 6, 60 (2019). https://doi.org/10.1186/s40537-019-0197-0
Data Augmentation with Transformers for Text Classification 259

14. Vaswani, A.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–
6008 (2017)
15. Wang, W.Y., Yang, D.: That’s so annoying!!!: A lexical and frame-semantic embed-
ding based data augmentation approach to automatic categorization of annoy-
ing behaviors using #petpeeve tweets. In: Proceedings of the 2015 Conference on
Empirical Methods in Natural Language Processing, pp. 2557–2563 (2015)
16. Wei, J., Zou, K.: EDA: easy data augmentation techniques for boosting perfor-
mance on text classiﬁcation tasks. In: Proceedings of 2019 Conference on Empirical
Methods in Natural Language Processing, pp. 6382–6388. Association for Compu-
tational Linguistics (2019)

Introduction To Transformers
No ratings yet
Introduction To Transformers
187 pages
Motorola gm3188 Service Manual
67% (6)
Motorola gm3188 Service Manual
5 pages
Microsoft 365 Zero Trust Mind Map
No ratings yet
Microsoft 365 Zero Trust Mind Map
1 page
Cyberark - Cau201.V2022-04-19.Q108: Show Answer
0% (1)
Cyberark - Cau201.V2022-04-19.Q108: Show Answer
28 pages
Self-Supervision, Bert, and Beyond: Building Transformer-Based Natural Language Processing Applications (Part 2)
No ratings yet
Self-Supervision, Bert, and Beyond: Building Transformer-Based Natural Language Processing Applications (Part 2)
117 pages
Demo Ex280
100% (1)
Demo Ex280
7 pages
Computedradiography 190411130133
No ratings yet
Computedradiography 190411130133
32 pages
Patrick Siarry (Editor) - Metaheuristics-Springer (2016) PDF
No ratings yet
Patrick Siarry (Editor) - Metaheuristics-Springer (2016) PDF
497 pages
Epservices Flyer All-Bioprocessing-Equipment
No ratings yet
Epservices Flyer All-Bioprocessing-Equipment
8 pages
A Survey On Data Augmentation in Large Model Era: Yue Zhou, Chenlu Guo, Xu Wang, Yi Chang, Yuan Wu
No ratings yet
A Survey On Data Augmentation in Large Model Era: Yue Zhou, Chenlu Guo, Xu Wang, Yi Chang, Yuan Wu
33 pages
A Comprehensive Survey On Applications of Transformers For Deep Learning Tasks
No ratings yet
A Comprehensive Survey On Applications of Transformers For Deep Learning Tasks
58 pages
Segmentation in Operating System
No ratings yet
Segmentation in Operating System
9 pages
Store Management System Project 29092013023847 Store Management System Project
100% (1)
Store Management System Project 29092013023847 Store Management System Project
50 pages
Transformers Tutorial 1 56
No ratings yet
Transformers Tutorial 1 56
56 pages
05 - Data Augmentation Generative AI Model
No ratings yet
05 - Data Augmentation Generative AI Model
29 pages
Comparing Text Augmentation by GPT-3.5 and Llama3 For Evaluating Student Responses
No ratings yet
Comparing Text Augmentation by GPT-3.5 and Llama3 For Evaluating Student Responses
27 pages
HCC Homework Diary
100% (2)
HCC Homework Diary
8 pages
AMMUS: A Survey of Transformer-Based Pretrained Models in Natural Language Processing
No ratings yet
AMMUS: A Survey of Transformer-Based Pretrained Models in Natural Language Processing
42 pages
3-Natural Language Processing With Attention Models
No ratings yet
3-Natural Language Processing With Attention Models
62 pages
Data Augmnetation
No ratings yet
Data Augmnetation
35 pages
Non-Creamy Layer Certificate: Government of Kerala
No ratings yet
Non-Creamy Layer Certificate: Government of Kerala
1 page
Lesson 6 Data Accquistion
No ratings yet
Lesson 6 Data Accquistion
43 pages
Transformers
No ratings yet
Transformers
27 pages
MTBF MTTR MTTF FIT Terms 3618wp 1703446671
No ratings yet
MTBF MTTR MTTF FIT Terms 3618wp 1703446671
2 pages
P4 October 2022 QP
No ratings yet
P4 October 2022 QP
32 pages
CP R80.20 Gaia AdminGuide
No ratings yet
CP R80.20 Gaia AdminGuide
343 pages
A Survey of Synthetic Data Augmentation Methods in Computer Vision
No ratings yet
A Survey of Synthetic Data Augmentation Methods in Computer Vision
33 pages
Day 8
No ratings yet
Day 8
20 pages
Transformers in Machine Learning
No ratings yet
Transformers in Machine Learning
16 pages
Exploring Artificial Intelligence (AI) Technology For The Near Future of Education
No ratings yet
Exploring Artificial Intelligence (AI) Technology For The Near Future of Education
22 pages
Appl Ner Social Media
No ratings yet
Appl Ner Social Media
11 pages
Data Augmentation For Meta-Learning
No ratings yet
Data Augmentation For Meta-Learning
10 pages
Text Data Augmentation For Deep Learning 27jx1h90mp
No ratings yet
Text Data Augmentation For Deep Learning 27jx1h90mp
34 pages
Information 14 00242
No ratings yet
Information 14 00242
17 pages
ACL 2024 DA Survey Final
No ratings yet
ACL 2024 DA Survey Final
27 pages
Chapter 12
No ratings yet
Chapter 12
16 pages
Unit 2 Generative AI
No ratings yet
Unit 2 Generative AI
14 pages
A Complete Guide To Data Augmentation - DataCamp
No ratings yet
A Complete Guide To Data Augmentation - DataCamp
18 pages
BS-100 Dust Monitor User Manual
No ratings yet
BS-100 Dust Monitor User Manual
22 pages
Eti MP
No ratings yet
Eti MP
18 pages
Uppwise Standard PPT 2
No ratings yet
Uppwise Standard PPT 2
13 pages
Auggpt: Leveraging Chatgpt For Text Data Augmentation: Ntroduction
No ratings yet
Auggpt: Leveraging Chatgpt For Text Data Augmentation: Ntroduction
12 pages
2022.Acl-long.292-Prompt-based Data Augmentation For Low-Resource NLU Tasks
No ratings yet
2022.Acl-long.292-Prompt-based Data Augmentation For Low-Resource NLU Tasks
14 pages
Prudhvi Java Dveloper
No ratings yet
Prudhvi Java Dveloper
5 pages
An Explainable Transformer Circuit For Compositional Generalization
No ratings yet
An Explainable Transformer Circuit For Compositional Generalization
11 pages
2020.findings Emnlp.269
No ratings yet
2020.findings Emnlp.269
19 pages
2022 AIOpen A Survey of Transformers Lin, Wang, Liu, Qiu
No ratings yet
2022 AIOpen A Survey of Transformers Lin, Wang, Liu, Qiu
22 pages
2020 Emnlp-Main 488
No ratings yet
2020 Emnlp-Main 488
13 pages
Data Augmentation in Machine Learning
No ratings yet
Data Augmentation in Machine Learning
4 pages
A Survey of Data Augmentation Approaches For NLP: Liu Et Al. 2020a
No ratings yet
A Survey of Data Augmentation Approaches For NLP: Liu Et Al. 2020a
21 pages
Application of Data Augmentation On Deep Learning
No ratings yet
Application of Data Augmentation On Deep Learning
13 pages
Dataset Augmentation
No ratings yet
Dataset Augmentation
9 pages
Data Augmentation Approaches in Natural Language Processing A Survey
No ratings yet
Data Augmentation Approaches in Natural Language Processing A Survey
20 pages
Regularizing Deep Networks With Semantic Data Augmentation
No ratings yet
Regularizing Deep Networks With Semantic Data Augmentation
18 pages
Transformer-Based Regression Models For Assessing Reading Passage Complexity: A Deep Learning Approach in Natural Language Processing
No ratings yet
Transformer-Based Regression Models For Assessing Reading Passage Complexity: A Deep Learning Approach in Natural Language Processing
14 pages
Enhancing Logical Reasoning of Large Language Models Through Logic-Driven Data Augmentation
No ratings yet
Enhancing Logical Reasoning of Large Language Models Through Logic-Driven Data Augmentation
17 pages
Chatbot Interaction With Artificial Intelligence: Human Data Augmentation With T5 and Language Transformer Ensemble For Text Classification
No ratings yet
Chatbot Interaction With Artificial Intelligence: Human Data Augmentation With T5 and Language Transformer Ensemble For Text Classification
16 pages
Q Bank
No ratings yet
Q Bank
4 pages
Subtitle
No ratings yet
Subtitle
4 pages
Xtremely Imple: WBC 5 Part Differential
100% (1)
Xtremely Imple: WBC 5 Part Differential
2 pages
Deep Learning Based Text Abstraction
No ratings yet
Deep Learning Based Text Abstraction
9 pages
300 PDF
No ratings yet
300 PDF
8 pages
How To Choose "Good" Samples For Text Data Augmentation
No ratings yet
How To Choose "Good" Samples For Text Data Augmentation
13 pages
EMNLP 2021 Bench Augm vANON
No ratings yet
EMNLP 2021 Bench Augm vANON
9 pages
Chataug: Leveraging Chatgpt For Text Data Augmentation
No ratings yet
Chataug: Leveraging Chatgpt For Text Data Augmentation
12 pages
Performance of Data Augmentation Methods For Brazi
No ratings yet
Performance of Data Augmentation Methods For Brazi
9 pages
TabTransformer - Tabular Data Modeling Using Contextual Embeddings
No ratings yet
TabTransformer - Tabular Data Modeling Using Contextual Embeddings
17 pages
Artificial Intelligence Research Paper Topics
No ratings yet
Artificial Intelligence Research Paper Topics
6 pages
Aakash Singh Bhadoriya
No ratings yet
Aakash Singh Bhadoriya
1 page
A Study On Effects of Data Augmentation in Detection
No ratings yet
A Study On Effects of Data Augmentation in Detection
13 pages
Project Synopsis College Notes Management System: Presented By
No ratings yet
Project Synopsis College Notes Management System: Presented By
8 pages
How To Create An ISO Image From A CD (Or DVD or BD)
No ratings yet
How To Create An ISO Image From A CD (Or DVD or BD)
2 pages
Self-Attention-based Data Augmentation Method For Text Classification
No ratings yet
Self-Attention-based Data Augmentation Method For Text Classification
6 pages
A Neural-Based Architecture For Small Datasets Classification
No ratings yet
A Neural-Based Architecture For Small Datasets Classification
9 pages
Domain Differential Adaptation For Neural Machine Translation
No ratings yet
Domain Differential Adaptation For Neural Machine Translation
11 pages
Selected Answer
No ratings yet
Selected Answer
4 pages
Computer Programming Lesson Plan - Amsa
No ratings yet
Computer Programming Lesson Plan - Amsa
5 pages
5-Instructions of Machine Maintenance A3 A3 Uv l1800
No ratings yet
5-Instructions of Machine Maintenance A3 A3 Uv l1800
3 pages
Contextual Augmentation: Data Augmentation by Words With Paradigmatic Relations
No ratings yet
Contextual Augmentation: Data Augmentation by Words With Paradigmatic Relations
6 pages
Understanding Data Augmentation For Classification: When To Warp?
No ratings yet
Understanding Data Augmentation For Classification: When To Warp?
6 pages
Text Augmentation For Neural Networks
No ratings yet
Text Augmentation For Neural Networks
6 pages
Second Term Exam - Writing - Reading and Listening (12.5%)
No ratings yet
Second Term Exam - Writing - Reading and Listening (12.5%)
5 pages
Project Element Response: Project Name Today's Date Project Start Date Target Completion Date
No ratings yet
Project Element Response: Project Name Today's Date Project Start Date Target Completion Date
2 pages
TP04-AS2 Instruction Sheet-English-20060718 PDF
No ratings yet
TP04-AS2 Instruction Sheet-English-20060718 PDF
2 pages
Applied Natural Language Processing with AllenNLP: Definitive Reference for Developers and Engineers
From Everand
Applied Natural Language Processing with AllenNLP: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
From Everand
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
Timothy King
No ratings yet
50 Most Challenging Algebra Problems!
From Everand
50 Most Challenging Algebra Problems!
Andrei Besedin
No ratings yet
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
Pattern Recognition: Fundamentals and Applications
From Everand
Pattern Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet

Data Augmentation With Transformers For Text Classification

Uploaded by

Data Augmentation With Transformers For Text Classification

Uploaded by

Data Augmentation with Transformers

for Text Classiﬁcation

José Medardo Tapia-Téllez and Hugo Jair Escalante(B)

Instituto Nacional de Astroﬁsica Optica y Electronica, Puebla, Mexico

Abstract. The current deep learning revolution has established to

Keywords: Text classiﬁcation · Data Augmentation · Transformers

artiﬁcial documents: synonym replacement (choosing randomly n words from a

3 Data Augmentation with Transformers

A Transformer is a deep network architecture for processing sequential data and

3.2 Generation of Synthetic Documents

Fig. 2. Top: double masking augmentation: we apply Single masking augmentation

4.1 Experimental Settings

Fig. 3. Triple masking augmentation: we apply Single masking augmentation three

Fig. 5. Example of an augmented sentence through the Augmented Sentence method

between the number of augmented sentences and accuracy. The augmentation

Table 4. Results obtained for the Augmented sentence method.

Finally we show in Table 5 a summary of results that shows the average

Plain + Plain + Plain +

We proposed four data augmentation strategies for text classiﬁcation based on

Acknowledgements. This work was partially supported by CONACyT under project

You might also like