Performance of Data Augmentation Methods For Brazi
Performance of Data Augmentation Methods For Brazi
Text Classification
searchers. Data augmentation techniques are (Kobayashi, 2018), and others. Thus, implementa-
often used towards achieving this target, and tion cost versus performance gain varies from tech-
most of its evaluation is made using English nique to technique. Still, all of the methods rely
corpora. In this work, we took advantage of on at least one kind of language resource, which
different existing data augmentation methods may be a WordNet dictionary, a Word Embedding
to analyze their performances applied to text model, datasets with specific formats, or another
classification problems using Brazilian Por-
type of dependency closely tied to a single lan-
tuguese corpora. As a result, our analysis
shows some putative improvements in using
guage.
some of these techniques; however, it also sug- Most of these text augmentation techniques were
gests further exploitation of language bias and originally developed using English corpora. How-
non-English text data scarcity. ever, some recent works extend their application
scenarios using a specific technique on, whether
1 Introduction various languages (Ciolino et al., 2021), or even
Text classification is a common and essential task Brazilian Portuguese corpora as evaluation lan-
in natural language processing (NLP). Much work guage (Veríssimo and Costa, 2020; Venturott and
has been done in the area. State-of-the-art results Ciarelli, 2020). Each work uses distinct processes
achieved high accuracy on several related tasks and datasets and, given some limitations, data aug-
such as sentiment analysis (Thongtan and Phien- mentation improved their results.
thrakul, 2019; Jiang et al., 2020) and topic classi- In this paper, we revisit some different existing
fication (Kesiraju et al., 2019; Meng et al., 2020). text augmentation methods, gathering and recon-
Still, high performance often depends on the size, structing the necessary resources to reproduce each
quality, and perhaps most important, training data technique. Then, by applying them to Brazilian
availability. Gathering data can quickly become a Portuguese corpora, we attempt to expand and vali-
tedious assignment and it is especially challenging date these techniques in a more generic way. Using
for non-English languages that likely have fewer McNemar’s statistical test to compare the classi-
resources since most current researches use English fication models, we achieve a set of different re-
corpora. In such a scarce scenario, data augmenta- sults showing that text augmentation methods are
tion techniques are even handier to deal with data. particularly useful, although language-specific fine-
Data augmentation is already widely used in com- tunings should be considered to ensure significant
puter vision (Simard et al., 2012; Szegedy et al., positive gains.
2015; Krizhevsky et al., 2017) and speech (Cui
2 Experimental Setup
et al., 2015; Ko et al., 2015), where it boosts per-
formance, especially on smaller datasets. Initially, we cluster the text augmentation methods
Text data augmentation techniques use various into three main groups: (1) Easy Data Augmen-
strategies such as applying a set of universal func- tation (EDA): based on Wei and Zou (2019), it
tions to quickly and easily introduce diversity consists of a collection of four functions (synonym
in the dataset (Wei and Zou, 2019), generating replacement, random insertion, random swap, and
random deletion). The first two rely on a map one else; and (3) Mercadolibre: Mercado Li-
of synonyms; in this case, we use the PPDB bre Data Challenge 2019 (available at https:
Portuguese paraphrase pack (available at http: //ml-challenge.mercadolibre.com) is
//paraphrase.org). For the other parame- a corpus of 693,318 purchase histories where the
ters of the technique, we use the same ones in the goal is to predict the next item bought by the user.
original paper. (2) Synonym (Syn): Many text As demonstrated by Wei and Zou (2019),
augmentation methods are based on some kind some text augmentation techniques have a
of synonym or allonym replacement. They pri- more significant impact on smaller datasets;
marily use language models to effectively replace for that reason, we randomly resample the
words to create diversity in the training set syn- datasets into subsets with different sizes N =
thetically. Here we use the nlpaug library1 to {500, 1000, 2000, 5000, 10000}. For each subset
produce a pipeline of word replacement (sequential N , we use a 75% train split rate. Also, we use
flow) that uses the PPDB Portuguese paraphrase different percentages p = {0%, 5%, 10%, 20%} of
pack, the Fasttext Portuguese Word Embedding the train set in the augmentation process. Finally,
model2 (Grave et al., 2018), and the Portuguese we run 15 rounds of the whole experiment for each
BERT model3 (Souza et al., 2019). Combined, dataset, totaling 2,700 trained models (3 datasets
the three resources provide a smart replacement X 3 augmentation groups X 5 subset sizes X 4
for similar words. We generate one new sentence augmentation percentages X 15 rounds).
per sample in the training set using this method.
(3) Back translation (BT): many translation APIs 2.2 Text Classification Models
are publicly available and free (up to reasonable Due to a large number of models, we opted to
usage). So for this method, we generate one sen- use non-deep-learning classifiers to perform the
tence per sample in the training set using AWS benchmark based on their ease of use and usually
Amazon Translate service, chosen for convenience faster training. Many popular algorithms for text
(Microsoft and Google also offer that kind of API, classification are not based on neural networks;
where it is also possible to use pre-trained models one of the most prominent is the Support Vector
to perform the "back translation" technique). Machine (SVM) algorithm (Kowsari et al., 2019).
We use the sci-kit-learn library4 implementa-
2.1 Benchmark Datasets tion of SVM with C=10, kernel=rbf, and
We conduct experiments on three public available gamma=scale for all trained models (other pa-
Brazilian Portuguese text classification datasets: rameters set to default). As the featurizer, we use
(1) Tweets: TweetSentBR (Brum and das Graças the Fasttext Portuguese Word Embedding model5
Volpe Nunes, 2017) is a corpus of 10,648 Tweets (Grave et al., 2018) to extract the sentence vector
manually annotated in one of the three sentiment for each sample.
classes: Positive - the user meant a positive re-
action or evaluation about the main topic on the 2.3 Model Evaluation
post; Negative - the user meant a negative reac- We use the F1-score (weighted F1-score for multi-
tion or evaluation about the main topic on the label datasets) as the evaluation metric. The F1-
post; Neutral - tweets not belonging to any of score is the harmonic mean of precision and recall,
the last classes, usually not making a point, out and it was applied as a filter, leaving only the best
of topic, irrelevant, confusing or containing only 180 models for each experiment round and parame-
objective data. (2) B2W: B2W Open Product ter combination.
Reviews (available at https://github.com/ After the models’ filtering, we use the F1-score
b2wdigital/b2w-reviews01) consists of a of each baseline model (p = 0%) to compute the
binary classification corpus of 132,373 product model gain, i.e. the difference between the base-
reviews. The labels represent the willingness of line F1-score and the F1-score of each augmented
the customer to recommend the product to some- model (p > 0%). Also, to determine if a model’s
1 4
https://github.com/makcedward/nlpaug https://scikit-learn.org/stable/
2
https://fasttext.cc/docs/en/ modules/generated/sklearn.svm.SVC.html#
crawl-vectors.html sklearn.svm.SVC
3 5
https://huggingface.co/neuralmind/ https://fasttext.cc/docs/en/
bert-large-portuguese-cased crawl-vectors.html
performance is significantly different, we use the
continuity-corrected version of McNemar’s test.
This nonparametric statistical method is used on
paired nominal data and has been used to confront
NLP models (Jiang et al., 2021; Chen et al., 2021).
3 Results
All details regarding baseline F1-scores, F1-score
gains and p-values can be found in Appendices A,
B and C, respectively.
Figure 1: F1-score average gains for each augmenta- The B2W dataset analysis resulted in a signifi-
tion group on Tweets dataset. cant model, as shown in Figure 4. This best model,
trained with a subset size 2000 and 0.05 augmenta-
McNemar’s test results show no significant dif- tion, reached a p-value of 0.042957.
ference between baseline models and augmented
3.3 Mercado Libre Dataset
models (p-value < 0.05). The test was applied only
in positive F1-score gain models and is depicted in Finally, in the Mercado Libre dataset, the EDA
Figure 2. augmentation group still achieved the best F1-score
gain. Figure 5 shows the F1-score average gain for
3.2 B2W Dataset each group.
As with Tweets, the analysis of the B2W dataset All positive F1-score gains p-values resulting
resulted in the highest F1-score gain belonging to from McNemar’s test are depicted in Figure 6.
the EDA augmentation group. Also, the dataset Although no significant text classification model
subsets sizes that achieved the best results were was found (Figure 6), the dataset subset size 5000
500 and 2000. Likewise, the same pattern was models obtained a satisfactory performance.
observed in the Tweets dataset. Figure 3 shows
the F1-score average gains for each augmentation 3.4 Augmentation Group Performance
group. Combining all results across the three augmenta-
The statistical significance analysis was applied tion groups, Figure 7 shows the F1-score gains for
following the same scheme as the Tweets dataset. each dataset subset size.
Figure 4 shows the positive gain models’ p-values. The Syn augmentation group achieved the best
Figure 4: -log(p-values) for each augmentation group Figure 6: -log(p-values) for each augmentation group
on B2W dataset. Dashed red line: alpha = 0.05. on Mercado Libre. Dashed red line: alpha = 0.05.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hin- Table 1 shows all baseline F1-scores for Tweets
ton. 2017. ImageNet classification with deep con- dataset.
volutional neural networks. Communications of the
ACM, 60(6):84–90. A.2 B2W Dataset
Yu Meng, Yunyi Zhang, Jiaxin Huang, Yu Zhang, Chao Table 2 shows all baseline F1-scores for B2W
Zhang, and Jiawei Han. 2020. Hierarchical topic dataset.
mining via joint spherical tree and text embedding.
A.3 Mercado Libre Dataset
Patrice Y. Simard, Yann A. LeCun, John S. Denker, and
Bernard Victorri. 2012. Transformation invariance Table 3 shows all baseline F1-scores for Mercado
in pattern recognition – tangent distance and tangent Libre dataset.
Augm. Type
EDA Syn BT
Augm.Perc.
Subset Size 0.05 0.1 0.2 0.05 0.1 0.2 0.05 0.1 0.2
500 0.83 0.82 0.83 0.83 0.83 0.83 0.87 0.87 0.87
1000 0.80 0.79 0.80 0.80 0.80 0.80 0.80 0.80 0.80
2000 0.81 0.81 0.81 0.79 0.79 0.79 0.78 0.78 0.78
5000 0.79 0.80 0.80 0.78 0.78 0.78 0.79 0.79 0.79
10000 0.78 0.78 0.77 0.77 0.78 0.77 0.78 0.78 0.78
Augm. Type
EDA Syn BT
Augm.Perc.
Subset Size 0.05 0.1 0.2 0.05 0.1 0.2 0.05 0.1 0.2
500 0.92 0.92 0.91 0.94 0.94 0.94 0.92 0.92 0.92
1000 0.95 0.95 0.95 0.94 0.93 0.94 0.94 0.93 0.93
2000 0.93 0.94 0.93 0.93 0.93 0.93 0.94 0.94 0.93
5000 0.93 0.94 0.93 0.93 0.93 0.93 0.94 0.94 0.93
10000 0.94 0.94 0.93 0.94 0.94 0.94 0.93 0.93 0.93
C Detailed p-values
C.1 Tweets Dataset
Table 4 shows all p-values for Tweets dataset.
Augm. Type
EDA Syn BT
Augm. Perc.
Subset Size 0.05 0.1 0.2 0.05 0.1 0.2 0.05 0.1 0.2
500 0.30 0.89 1.00 1.00 0.67 0.60 0.70 0.24 0.58
1000 0.81 0.61 0.62 0.74 0.68 0.87 0.93 1.00 0.86
2000 0.43 0.43 0.63 0.60 0.95 0.95 0.63 0.48 0.32
5000 0.53 0.97 0.24 0.54 0.78 0.67 0.55 0.81 0.50
10000 0.61 0.58 0.52 0.91 0.81 0.35 0.98 0.88 0.45
Augm. Type
EDA Syn BT
Augm. Perc.
Subset Size 0.05 0.1 0.2 0.05 0.1 0.2 0.05 0.1 0.2
500 1.00 0.23 0.21 1.00 1.00 0.61 0.54 1.00 0.70
1000 0.09 0.73 0.21 1.00 0.61 0.32 0.74 0.30 0.47
2000 0.14 0.89 0.81 0.04 0.39 1.00 1.00 0.55 0.61
5000 0.92 0.78 0.35 0.92 1.00 0.93 1.00 0.48 0.79
10000 0.56 0.89 0.12 0.77 0.10 0.15 0.60 0.42 0.79
Augm. Type
EDA Syn BT
Augm. Perc.
Subset Size 0.05 0.1 0.2 0.05 0.1 0.2 0.05 0.1 0.2
500 0.93 0.93 0.80 1.00 0.93 0.50 0.60 0.44 0.81
1000 0.79 1.00 0.36 1.00 0.95 0.66 0.69 0.80 0.95
2000 0.52 0.55 0.54 0.75 0.67 0.56 0.67 0.67 0.96
5000 0.63 0.62 1.00 0.72 0.13 0.93 0.56 0.44 0.81
10000 0.51 0.66 0.23 0.89 0.72 0.60 0.23 0.38 0.21