0% found this document useful (0 votes)
6 views

Data Augmentation With Transformers For Text Classification

The document discusses using transformer models like BERT and GPT-2 for data augmentation in text classification. Four variants of using transformer outputs to generate synthetic documents for augmenting training data are proposed and evaluated experimentally on benchmark datasets, showing promising results that motivate further research in this area.

Uploaded by

Max Sarmento
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Data Augmentation With Transformers For Text Classification

The document discusses using transformer models like BERT and GPT-2 for data augmentation in text classification. Four variants of using transformer outputs to generate synthetic documents for augmenting training data are proposed and evaluated experimentally on benchmark datasets, showing promising results that motivate further research in this area.

Uploaded by

Max Sarmento
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Data Augmentation with Transformers

for Text Classification

José Medardo Tapia-Téllez and Hugo Jair Escalante(B)

Instituto Nacional de Astrofisica Optica y Electronica, Puebla, Mexico


{tapiatellez,hugojair}@inaoep.mx

Abstract. The current deep learning revolution has established to


transformer based architectures as the state of the art in several natural
language processing tasks. However, it is not clear whether such models
can be also used for enhancing other aspects of the learning pipeline in
the NLP context. This paper presents a study in such a direction, in
particular, we explore the suitability of transformer models as a data
augmentation mechanism for text classification. We introduce four ways
of using transformer models for augmenting data in text classification.
Each of these variants take the outputs of a transformer model, feed
with training documents, and use such outputs as additional training
data. The proposed strategies are evaluated in benchmark data using
CNN and LSTM based classifiers. Experimental results are promising:
improvements over a model training on the plain documents are consis-
tent.

Keywords: Text classification · Data Augmentation · Transformers

1 Introduction
Data Augmentation (DA) is the process of obtaining/generating additional data
for training machine learning models. Through the DA process, one is able to
reduce the risk of overfitting and increase the robustness of the machine learning
models when not enough data is available. DA has emerged in the context of
deep learning because these models require of large amounts of data in order
to learn adequately. These techniques are commonly used in computer vision
where considerable improvements in performance are reported, see e.g., [5,13].
Despite this success, DA has not been thoroughly explored in the context of
Natural Language Processing (NLP), where there are plenty of domains in which
collecting manually labeled documents is complicated.
In this paper we explore DA in the context of NLP, specifically for text clas-
sification. We rely on the success that transformer-based models (e.g., Bert [6]
and GPT [12]) have reported in different NLP task and use them to generate
synthetic documents that are in turn used to augment the initial training set
associated to text classification tasks. Transformers are powerful models imple-
menting self attention mechanisms that allow them to capture long range depen-
dencies among words in a sequence. Outstanding results in a wide variety of
c Springer Nature Switzerland AG 2020
L. Martı́nez-Villaseñor et al. (Eds.): MICAI 2020, LNAI 12469, pp. 247–259, 2020.
https://doi.org/10.1007/978-3-030-60887-3_22
248 J. M. Tapia-Téllez and H. J. Escalante

tasks have been reported with these methods when used as both (pretrained)
feature extractors and end-to-end learners [6,8,12]. Since transformers are in
essence language models they can be sampled conditioned on certain inputs and
be used as text generators. We use such feature of transformers and use it as a
data augmentation mechanism. In a nutshell, we sample pretrained transform-
ers conditioned on training documents of a text classification task, and use the
outputs as augmented training instances. Four variants for generating artificial
samples are proposed and evaluated in benchmark data. Experimental results
show the usefulness of the augmented samples and motivate further research on
data augmentation for NLP.
The contributions of this paper are as follows:
– We explore the suitability of using transformers as data augmenters in the
context of text classification.
– We propose four variants to augment a training set of documents with the
outputs of transformers conditioned on the documents.
– We show the augmentation process is promising, motivating further research.
The remainder of this paper is organized as follows. The next section reviews
related work on data augmentation for NLP. Next Sect. 3 describes the proposed
methodology for data augmentation. Then Sect. 4 presents an experimental eval-
uation of the augmentation procedures. Finally, Sect. 5 outlines conclusions and
future work directions.

2 Related Work
This section briefly reviews related work on data augmentation for NLP tasks.
The idea of augmenting the available training documents for NLP tasks is not
new, see for instance [1,7,11]. However, early approaches for data augmentation
(either term or document expansion methods) mainly dealt with the task of
identifying words associated to the content of documents that could be added to
it. These methods mostly relied on thesaurus, semantic nets like WordNet [1,7] or
co-occurrences [4,11]. In this way, a document could include more related terms
eventually addressing issues associated to synonymy and polysemy. Differently
to early expansion strategies, modern data augmentation aims at generating
artificial instances (instead of extending the content available instances) based
on the available ones. In this sense, data augmentation is closer to oversampling
(e.g., see [3]) than to classical expansion methodologies.
Recent data augmentation efforts for NLP have adopted quite diverse
methodologies, in the following we summarize the main paradigms. Yang et
al. generate artificial (word-embedding) representations for documents by using
the most similar word embeddings to the words appearing in the initial docu-
ment [15]. In this way, the artificial representations resemble documents with
meaning related to the original ones. One should note, however, that no docu-
ment is actually generated, but only the embedding-based representations. Wei
et al. introduce four Easy Data Augmentation (EDA) strategies for generating
Data Augmentation with Transformers for Text Classification 249

artificial documents: synonym replacement (choosing randomly n words from a


sentence and replace them by a synonym); random insertion (inserting a ran-
dom synonym of a random word in a random position in the sentence); random
swap (randomly choose two words in the sentence and swap their positions); and
random deletion (randomly remove each word in the sentence with probability
p.). EDA improves performance on both, convolutional and recurrent neural net-
works, and is particularly helpful for smaller data sets. Kobayashi et al. propose
a method for augmenting label sentences [9], their method is called contextual
augmentation. They stochastically replace words with other words that are pre-
dicted by a bi-directional language model at the word positions. It is important
to emphasize that they feedback the language model with a label-conditional
architecture, which allows the model to augment sentences without breaking
the label-compatibility. Their results for six different classification tasks show
improvements when using a convolutional neural network.
Parallel to our work, Kumar et al. proposed a method that used transformers
for data augmentation in NLP [10]. The authors feed transformer models with
documents and label information and they synthesize new samples. Although
they use transformers, the way new instances are generated is different: they
feed the transformer models with a combination of the source document and its
label, but they ask the model to predict the most probable sequence following
the fed input. This way of synthetization was first proposed in [2] obtaining
satisfactory results. Compared to these related efforts, our proposal adopts a
different approach to generate synthetic documents: we mask words and ask the
model to predict the missing words, the predicted words are used to generate
new documents. We also developed a data augmentation procedure based on
sentences, this resembles to some extend the method in [10].

3 Data Augmentation with Transformers


This section describes the proposed methodology for data augmentation in text
classification. We first briefly introduce transformers, then we describe the aug-
mentation process and the four considered variants.

3.1 Transformers

A Transformer is a deep network architecture for processing sequential data and


that is based entirely on attention mechanisms [14]. This type of models do not
aim to explicitly model sequential information as in classical recurrent neural
nets (RNNs), instead they implicitly capture such information via self attention
layers. An attention mechanism indicates which parts of a document (words,
sentences) should have more weight for the modeling processes. A self attention
module, thus, learns which parts of an input document should receive higher
weight for the layers in upper levels. Transformers learn in an end-to-end way
multiple self attention mechanisms in parallel and rely on an encoder-decoder
architecture to solve a variety of NLP tasks.
250 J. M. Tapia-Téllez and H. J. Escalante

Although there are several transformer based models out there, in this work
we rely on two of the most effective and popular ones, namely: BERT and
GPT-2. BERT (Bidirectional Encoder Representations from Transformers) is a
language model based on transformers [6]. It implements bidirectional self atten-
tion mechanisms and has been trained by using term masking strategies. GPT-2,
on the other hand, implements an unidirectional language model trained under
a predict-next-word objective [12]. Both transformer models were trained using
huge corpora and under a variety of settings. Pretrained models are available
out there so that anyone can use them as starting point for their research. Using
such pretrained models for solving a variety of NLP tasks is straightforward.
However, the benefits of these methods for data augmentation have not been
explored deeply so far (see [2,10]), we hope our study helps to better under-
stand the capabilities and limitations of transformers for data augmentation.

3.2 Generation of Synthetic Documents


The goal of this work is to explore the benefits of using transformers for data
augmentation in the context of text classification. As previously mentioned, we
propose four variants to generate synthetic documents. In all of them, the idea
is to feed a transformer (either BERT or GPT-2) with training documents of
the classification task at hand and use different strategies to generate artificial
documents. Each of the synthetically generated documents will be assigned the
same label as the source document we use it for the generation process. Synthetic
and original training documents are both used for training a classification model.
In the remainder of this section we detail the four variants, where the first
three correspond to masking words using BERT, and the remaining one aims at
expanding sentences using GPT-2.
Single Masking Augmentation. The first method corresponds to Single
Masking Augmentation. Where a specific sentence is first tokenized. Then a ran-
dom word in the sentence is selected and masked, this allows for BERT not to
take into account this word and be able to predict new words based on the other
words in the sentence and the position of the masked item. We then regroup the
sentence but now with the masked item, this masked sentence is inserted into
BERT where a series of tokens (words) for unmasking the masked word are pre-
dicted. Based on a predefined number of sentences to be augmented per sentence
in the training set, we produce new sentences with the respective tokens in the
masked position. For example, if we want to generate three new sentences, then
the first three tokens would be used to create them. Figure 1 graphically depicts
this variant, in the left plot we show the procedure and in the right plot we show
sample generated sentences.
Double Masking Augmentation. The second method corresponds to Double
Masking Augmentation. The idea is similar as in Single Masking Augmentation
but now we mask two random words of the original sentence. To do this we
apply the Single Masking Augmentation process in series. It is important to
mention that the obtained sentences from the Single Masking Augmentation
Data Augmentation with Transformers for Text Classification 251

Fig. 1. Left: Single masking augmentation method. Words in a sentence are masked
(one at a time) and artificial sentences are produced by asking BERT to unmask the
masked words. Right: Examples for one, three, five and ten sentences created from the
original sentence using Single Masking Augmentation

process are now masked in a second and different random position see the top
of Fig. 2 for an illustration. Finally a token is provided based on the order of
the respective sentences. The final results are sentences with two words changed
based on BERT. As with Single Masking Augmentation we apply this method
in the production of one, three, five and ten sentences as shown in the bottom
of Fig. 2.
Triple Masking Augmentation. The third method corresponds to Triple
Masking Augmentation. The idea is the same as in the previous two methods,
but we now mask three words of the sentence. As we can see in Fig. 3, the Single
Masking Augmentation procedure is applied three times in series, thus creating
the set of new sentences with three different words in them. As in the previous
two methods, we run experiments for the generation of one, three, five and ten
sentences as shown in Fig. 4
Augmented Sentence. The fourth proposed variant for data augmentation
consists in augmenting sentences with GPT-2. The procedure is pretty straight-
forward: as input we take a sentence that is introduced into the GPT-2 model,
then we ask the model to predict an augmented sentence with up to fifty words
as show in Fig. 5. One should note that this procedure is very similar to that
one described in [2,10].

4 Experimental Results
This section presents an experimental evaluation of the data augmentation
strategies described in Sect. 3. We first describe the experimental settings and
then we present the results obtained by each of the developed strategies.
252 J. M. Tapia-Téllez and H. J. Escalante

Fig. 2. Top: double masking augmentation: we apply Single masking augmentation


in series to generate sentences with two modified words. Bottom: examples for one,
three, five and ten sentences created from the original sentence using Double Masking
Augmentation.

4.1 Experimental Settings

For the experiments we used four benchmark datasets for text classification,
namely: SST-2 (Stanford Sentiment Treebak), IMDB (Sentiment-related movie
reviews), Spam, and Sentence Type (sentences classified as command, question
and statement). Each of the considered datasets have 600 instances, for the
experimental evaluation we used 80% of instances for training and the remainder
for testing. In a preliminary stage we used 20% of the training data as validation
set for hyperparameter tuning.
As classification models we used straighforward state of the art models based
on deep learning: a Convolutional Neural Network (CNN) and a Long Short-
Term Memory Network (LSTM-RNN). The CNN and LSTM-RNN were imple-
mented based on the implementations of Wei et al. [16]. The CNN has the
following layers: embedding, convolutional one dimension (activation = relu),
dropout, global max pooling one dimension, dense (relu), dense (sigmoid); it
Data Augmentation with Transformers for Text Classification 253

Fig. 3. Triple masking augmentation: we apply Single masking augmentation three


times in series to generate sentences with two modified words.

Fig. 4. Examples for one, three, five and ten sentences created from the original sen-
tence using Triple Masking Augmentation

was trained with a binary cross entropy loss and an Adam optimizer, and it was
trained for ten epochs. The LSTM-RNN has the following layers: embedding,
LSTM, FC1, activation (relu), dropout, output layer, activation (sigmoid); it
was trained also with a binary cross entropy and an RMSprop optimizer and
trained for ten epochs.

4.2 Results

Each of the four strategies described above were evaluated using the datasets
and classifiers just mentioned, where we used test set accuracy as the leading
evaluation measure. The augmentation methods using BERT (i.e., single, double,
and triple masking) were tested for an augmentation of one, three, five and
up to ten sentences, this in order to determine whether there was a relationship
254 J. M. Tapia-Téllez and H. J. Escalante

Fig. 5. Example of an augmented sentence through the Augmented Sentence method

between the number of augmented sentences and accuracy. The augmentation


method using GPT-2 (i.e., augmented sentence) always produced a sentence
of the specified length and our goal was used it as reference, as this method
resembles state of the art augmentation methods introduced elsewhere [2,10].
Table 1 shows the results obtained with the Single Masking Augmentation
strategy for different numbers of augmented sentences. Based on the results
from Table 1, we can conclude, that on average, the original training data plus
the data generated by the proposed augmentation method improves accuracy
with respect to the baseline (Plain, which refers to the model trained only with
the original data) for both of the considered classification models. We can also
conclude that, on average, Single Masking with ten augmented sentences has the
best accuracy with the LSTM-RNN, this was not the case for CNN, where the
best result was obtained when three sentences were augmented. Table 2 shows

Table 1. Classification performance with Single Masking for one, three, five and ten
sentences augmented. Plain indicates the performance obtained by each of the models
without performing any augmentation. The best result in each row is shown in bold.

Plain +
Single Masking
Plain
1 3 5 10
RNN 0.79 0.86 0.84 0.86 0.92
Spam
CNN 0.59 0.68 0.67 0.65 0.63
RNN 0.71 0.72 0.77 0.75 0.71
IMDB
CNN 0.79 0.76 0.77 0.72 0.74
Sentence RNN 0.35 0.42 0.46 0.53 0.56
Type CNN 0.40 0.43 0.44 0.40 0.40
RNN 0.59 0.63 0.68 0.66 0.67
SST2
CNN 0.68 0.63 0.66 0.65 0.71
RNN 0.61 0.65 0.68 0.70 0.71
Average
CNN 0.61 0.62 0.63 0.60 0.62

the results obtained with the Double Masking Augmentation Strategy. From
this table it can be seen that in average this strategy is able to improve the
performance of the baseline. Where we can see that augmenting ten sentences
Data Augmentation with Transformers for Text Classification 255

Table 2. Results for DA with Double Masking for one, three, five and ten sentences
augmented

Plain +
Double Masking
Plain
1 3 5 10
RNN 0.79 0.82 0.77 0.78 0.86
Spam
CNN 0.59 0.65 0.57 0.66 0.74
RNN 0.71 0.71 0.69 0.76 0.70
IMDB
CNN 0.79 0.76 0.76 0.74 0.70
Sentence RNN 0.35 0.37 0.44 0.35 0.50
Type CNN 0.40 0.43 0.40 0.35 0.50
RNN 0.59 0.64 0.66 0.69 0.66
SST2
CNN 0.68 0.68 0.65 0.70 0.70
RNN 0.61 0.63 0.64 0.64 0.68
Average
CNN 0.61 0.63 0.59 0.61 0.66

gave better results for both models, LSTM-RNN and CNN. This result seems to
corroborate the hypothesis that the more amount of augmented sentences yields
better classification performance.
Table 3 shows the results obtained with the Triple Masking Augmentation
strategy. Based on the results from Table 3 it can be seen that this strategy
also improves the performance of the baseline for both considered classification
models. Augmenting ten sentences resulted in better classification performance
for the LSTM-RNN and one sentence worked better for the CNN.

Table 3. Results for DA with Triple Masking for one, three, five and ten sentences

Plain +
Triple Masking
Plain
1 3 5 10
RNN 0.79 0.74 0.77 0.67 0.85
Spam
CNN 0.59 0.69 0.65 0.62 0.57
RNN 0.71 0.72 0.70 0.70 0.77
IMDB
CNN 0.79 0.75 0.76 0.70 0.76
Sentence RNN 0.35 0.43 0.38 0.47 0.51
Type CNN 0.40 0.42 0.35 0.36 0.50
RNN 0.59 0.64 0.69 0.70 0.63
SST2
CNN 0.68 0.70 0.71 0.70 0.63
RNN 0.61 0.63 0.63 0.63 0.69
Average
CNN 0.61 0.64 0.62 0.59 0.61
256 J. M. Tapia-Téllez and H. J. Escalante

Last but not least, the results for the Augmented Sentence strategy are pre-
sented in Table 4. From this table it can be observed that there were improve-
ments mostly for the LSTM-RNN model, whereas the augmentation did not
seem to improve the performance of the CNN classifier.

Table 4. Results obtained for the Augmented sentence method.

Plain +
Plain
Augmentation
RNN 0.79 0.75
Spam
CNN 0.59 0.61
RNN 0.71 0.75
IMDB
CNN 0.79 0.76
Sentence RNN 0.35 0.46
Type CNN 0.40 0.41
RNN 0.59 0.62
SST2
CNN 0.68 0.67
RNN 0.61 0.64
Average
CNN 0.61 0.61

Finally we show in Table 5 a summary of results that shows the average


performance across datasets obtained by the different variants. It is clear from
this table that only in two out of the 24 configurations we tested, the usage
of data augmentation decreased the performance of the baseline model, and in
most cases an improved was reported. Also, it can be seen that better results
were obtained with the masking strategies than with the sentence augmentation
method, which can be seen as a reference of the state of the art augmentation
techniques [2,10].

Table 5. Summary of the results for the four method along with their average and
standard deviation.

Plain + Plain + Plain +


Single Masking Double Masking Triple Masking Plain +
Plain
1 3 5 10 1 3 5 10 1 3 5 10 A. S.
RNN 0.61 0.65 0.68 0.70 0.710.63 0.64 0.64 0.68 0.63 0.63 0.63 0.69 0.64
Avg.
CNN 0.61 0.62 0.63 0.60 0.62 0.63 0.59 0.61 0.66 0.64 0.62 0.59 0.61 0.61
Tot. avg. 0.61 0.63 0.65 0.65 0.66 0.63 0.61 0.62 0.67 0.63 0.62 0.61 0.65 0.62
Std.
0.0 0.02 0.03 0.07 0.06 0.0 0.03 0.02 0.01 0.01 0.01 0.02 0.05 0.02
Dev.
Data Augmentation with Transformers for Text Classification 257

In order to get a better insight on how the proposed methods work when using
different amounts of training data, we subsampled the training set and evaluated
the performance of the augmentation for the RNN-LSTM classifier. For this
experiment we considered only the Single Masking method. Our hypothesis was
that the smaller the size of the training set, the larger the impact of the different
augmentation strategies in the classification performance. Figure 6 shows the
result of the experiment for the four datasets.
From Fig. 6, we cannot see the behavior we were expecting. However, still
it can be seen that the augmentation strategy outperforms the baseline when
half size of the training set (and on) data was considered. The fact we could not
obtain larger improvement margins with fewer documents could be due to the
fact that training data was very limited, also, one should note we trained the
model for 10 epochs only.

Fig. 6. Performance on text classification tasks with and without Augmented Data for
Single Masking in three sentences augmented

5 Conclusions

We proposed four data augmentation strategies for text classification based on


Transformers (BERT and GPT-2), namely: Single Masking, Double Masking,
Triple Masking, and Augmented Sentence. The latter resembling a state of the
art solution. The four evaluated methods, on average, improved the classification
performance of two classification models a CNN and an LSTM-RNN. Regarding
the masking methods, Single Masking obtained the best performance, however,
Double and Triple Masking methods also improved the performance of the base-
line. In general, better performance was observed when more sentences were
added.
Experimental results with different amounts of training data were also
reported. These results seem to indicate two things: first of all, we need more
data in order to perform more experiments and to validate through graphs the
results from the tables, and second, the curves related to the augmented data
258 J. M. Tapia-Téllez and H. J. Escalante

classification are, in general, on top of the curves related to the plain data, thus
indicating that the augmentation methods do improve accuracy.
As future work, we would first like to work with more data, with it we could
produce better graphs and a second round of average results for the methods.
Second, we would like to implement a class-label-related as part of our model, this
in order to conserve the label within the text augmentation process. And finally,
we would like to run experiments on combinations of the methods developed in
this work. This would surely bring light over which could be the best tool in
order to enhance our training data.

Acknowledgements. This work was partially supported by CONACyT under project


grant A1-S-26314, Integración de Visión y Lenguaje mediante Representaciones Multi-
modales Aprendidas para Clasificación y Recuperación de Imágenes y Videos.

References
1. Agirre, E., Arregi, X., Otegi, A.: Document expansion based on WordNet for robust
IR. In: Coling 2010: Posters, pp. 9–17 (2010)
2. Anaby-Tavor, A., et al.: Not enough data? Deep learning to the rescue! arXiv
preprint 1911.03118 (2019)
3. Bowyer, K.W., Chawla, N.V., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic
minority over-sampling technique. J. Artif. Int. Res. 16(1), 321–357 (2002)
4. Cabrera, J.M., Escalante, H.J., Montes-y-Gómez, M.: Distributional term repre-
sentations for short-text categorization. In: Gelbukh, A. (ed.) CICLing 2013, Part
II. LNCS, vol. 7817, pp. 335–346. Springer, Heidelberg (2013). https://doi.org/10.
1007/978-3-642-37256-8 28
5. Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: learning
augmentation policies from data. ArXiv preprint 1805.09501 (2018)
6. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidi-
rectional transformers for language understanding. In: Proceedings of the 2019
Conference of the NAACL, pp. 4171–4186, June 2019
7. Gong, Z., Cheang, C.W., Hou U, L.: Multi-term web query expansion using Word-
Net. In: Bressan, S., Küng, J., Wagner, R. (eds.) DEXA 2006. LNCS, vol. 4080,
pp. 379–388. Springer, Heidelberg (2006). https://doi.org/10.1007/11827405 37
8. Howard, J., Ruder, S.: Universal language model fine-tuning for text classification.
ArXiv preprint 1801.06146 (2018)
9. Kobayashi, S.: Contextual augmentation: data augmentation by words with
paradigmatic relations. In: Proceedings of the 2018 Conference of NAACL, pp.
452–457 (2018)
10. Kumar, V., Choudhary, A., Cho, E.: Data augmentation using pre-trained trans-
former models. ArXiv preprint 2003.02245 (2020)
11. Lavelli, A., Sebastiani, F., Zanoli, R.: Distributional term representations: an
experimental comparison. In: Proceedings of the 13th ACM International Con-
ference on Information and Knowledge Management, pp. 615–624. ACM (2004)
12. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language
models are unsupervised multitask learners. Technical report (2019)
13. Shorten, C., Khoshgoftaar, T.M.: A survey on image data augmentation for deep
learning. J. Big Data 6, 60 (2019). https://doi.org/10.1186/s40537-019-0197-0
Data Augmentation with Transformers for Text Classification 259

14. Vaswani, A.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–
6008 (2017)
15. Wang, W.Y., Yang, D.: That’s so annoying!!!: A lexical and frame-semantic embed-
ding based data augmentation approach to automatic categorization of annoy-
ing behaviors using #petpeeve tweets. In: Proceedings of the 2015 Conference on
Empirical Methods in Natural Language Processing, pp. 2557–2563 (2015)
16. Wei, J., Zou, K.: EDA: easy data augmentation techniques for boosting perfor-
mance on text classification tasks. In: Proceedings of 2019 Conference on Empirical
Methods in Natural Language Processing, pp. 6382–6388. Association for Compu-
tational Linguistics (2019)

You might also like