0% found this document useful (0 votes)
23 views9 pages

Performance of Data Augmentation Methods For Brazi

The document evaluates different data augmentation methods on text classification tasks in Brazilian Portuguese. It applies methods like easy data augmentation, synonym replacement, and back translation to three datasets. Results show some methods improve performance but language biases and data scarcity must be addressed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views9 pages

Performance of Data Augmentation Methods For Brazi

The document evaluates different data augmentation methods on text classification tasks in Brazilian Portuguese. It applies methods like easy data augmentation, synonym replacement, and back translation to three datasets. Results show some methods improve performance but language biases and data scarcity must be addressed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Performance of Data Augmentation Methods for Brazilian Portuguese

Text Classification

Marcellus Amadeus and Paulo Branco


Alana AI
{marcellus}@alana.ai

Abstract new data by translating sentences into another lan-


guage and back into English (Yu et al., 2018) (also
Improving machine learning performance referred to as "back translation"), using predic-
while increasing model generalization has
tive language models for synonym replacement
been a constantly pursued goal by AI re-
arXiv:2304.02785v1 [cs.CL] 5 Apr 2023

searchers. Data augmentation techniques are (Kobayashi, 2018), and others. Thus, implementa-
often used towards achieving this target, and tion cost versus performance gain varies from tech-
most of its evaluation is made using English nique to technique. Still, all of the methods rely
corpora. In this work, we took advantage of on at least one kind of language resource, which
different existing data augmentation methods may be a WordNet dictionary, a Word Embedding
to analyze their performances applied to text model, datasets with specific formats, or another
classification problems using Brazilian Por-
type of dependency closely tied to a single lan-
tuguese corpora. As a result, our analysis
shows some putative improvements in using
guage.
some of these techniques; however, it also sug- Most of these text augmentation techniques were
gests further exploitation of language bias and originally developed using English corpora. How-
non-English text data scarcity. ever, some recent works extend their application
scenarios using a specific technique on, whether
1 Introduction various languages (Ciolino et al., 2021), or even
Text classification is a common and essential task Brazilian Portuguese corpora as evaluation lan-
in natural language processing (NLP). Much work guage (Veríssimo and Costa, 2020; Venturott and
has been done in the area. State-of-the-art results Ciarelli, 2020). Each work uses distinct processes
achieved high accuracy on several related tasks and datasets and, given some limitations, data aug-
such as sentiment analysis (Thongtan and Phien- mentation improved their results.
thrakul, 2019; Jiang et al., 2020) and topic classi- In this paper, we revisit some different existing
fication (Kesiraju et al., 2019; Meng et al., 2020). text augmentation methods, gathering and recon-
Still, high performance often depends on the size, structing the necessary resources to reproduce each
quality, and perhaps most important, training data technique. Then, by applying them to Brazilian
availability. Gathering data can quickly become a Portuguese corpora, we attempt to expand and vali-
tedious assignment and it is especially challenging date these techniques in a more generic way. Using
for non-English languages that likely have fewer McNemar’s statistical test to compare the classi-
resources since most current researches use English fication models, we achieve a set of different re-
corpora. In such a scarce scenario, data augmenta- sults showing that text augmentation methods are
tion techniques are even handier to deal with data. particularly useful, although language-specific fine-
Data augmentation is already widely used in com- tunings should be considered to ensure significant
puter vision (Simard et al., 2012; Szegedy et al., positive gains.
2015; Krizhevsky et al., 2017) and speech (Cui
2 Experimental Setup
et al., 2015; Ko et al., 2015), where it boosts per-
formance, especially on smaller datasets. Initially, we cluster the text augmentation methods
Text data augmentation techniques use various into three main groups: (1) Easy Data Augmen-
strategies such as applying a set of universal func- tation (EDA): based on Wei and Zou (2019), it
tions to quickly and easily introduce diversity consists of a collection of four functions (synonym
in the dataset (Wei and Zou, 2019), generating replacement, random insertion, random swap, and
random deletion). The first two rely on a map one else; and (3) Mercadolibre: Mercado Li-
of synonyms; in this case, we use the PPDB bre Data Challenge 2019 (available at https:
Portuguese paraphrase pack (available at http: //ml-challenge.mercadolibre.com) is
//paraphrase.org). For the other parame- a corpus of 693,318 purchase histories where the
ters of the technique, we use the same ones in the goal is to predict the next item bought by the user.
original paper. (2) Synonym (Syn): Many text As demonstrated by Wei and Zou (2019),
augmentation methods are based on some kind some text augmentation techniques have a
of synonym or allonym replacement. They pri- more significant impact on smaller datasets;
marily use language models to effectively replace for that reason, we randomly resample the
words to create diversity in the training set syn- datasets into subsets with different sizes N =
thetically. Here we use the nlpaug library1 to {500, 1000, 2000, 5000, 10000}. For each subset
produce a pipeline of word replacement (sequential N , we use a 75% train split rate. Also, we use
flow) that uses the PPDB Portuguese paraphrase different percentages p = {0%, 5%, 10%, 20%} of
pack, the Fasttext Portuguese Word Embedding the train set in the augmentation process. Finally,
model2 (Grave et al., 2018), and the Portuguese we run 15 rounds of the whole experiment for each
BERT model3 (Souza et al., 2019). Combined, dataset, totaling 2,700 trained models (3 datasets
the three resources provide a smart replacement X 3 augmentation groups X 5 subset sizes X 4
for similar words. We generate one new sentence augmentation percentages X 15 rounds).
per sample in the training set using this method.
(3) Back translation (BT): many translation APIs 2.2 Text Classification Models
are publicly available and free (up to reasonable Due to a large number of models, we opted to
usage). So for this method, we generate one sen- use non-deep-learning classifiers to perform the
tence per sample in the training set using AWS benchmark based on their ease of use and usually
Amazon Translate service, chosen for convenience faster training. Many popular algorithms for text
(Microsoft and Google also offer that kind of API, classification are not based on neural networks;
where it is also possible to use pre-trained models one of the most prominent is the Support Vector
to perform the "back translation" technique). Machine (SVM) algorithm (Kowsari et al., 2019).
We use the sci-kit-learn library4 implementa-
2.1 Benchmark Datasets tion of SVM with C=10, kernel=rbf, and
We conduct experiments on three public available gamma=scale for all trained models (other pa-
Brazilian Portuguese text classification datasets: rameters set to default). As the featurizer, we use
(1) Tweets: TweetSentBR (Brum and das Graças the Fasttext Portuguese Word Embedding model5
Volpe Nunes, 2017) is a corpus of 10,648 Tweets (Grave et al., 2018) to extract the sentence vector
manually annotated in one of the three sentiment for each sample.
classes: Positive - the user meant a positive re-
action or evaluation about the main topic on the 2.3 Model Evaluation
post; Negative - the user meant a negative reac- We use the F1-score (weighted F1-score for multi-
tion or evaluation about the main topic on the label datasets) as the evaluation metric. The F1-
post; Neutral - tweets not belonging to any of score is the harmonic mean of precision and recall,
the last classes, usually not making a point, out and it was applied as a filter, leaving only the best
of topic, irrelevant, confusing or containing only 180 models for each experiment round and parame-
objective data. (2) B2W: B2W Open Product ter combination.
Reviews (available at https://github.com/ After the models’ filtering, we use the F1-score
b2wdigital/b2w-reviews01) consists of a of each baseline model (p = 0%) to compute the
binary classification corpus of 132,373 product model gain, i.e. the difference between the base-
reviews. The labels represent the willingness of line F1-score and the F1-score of each augmented
the customer to recommend the product to some- model (p > 0%). Also, to determine if a model’s
1 4
https://github.com/makcedward/nlpaug https://scikit-learn.org/stable/
2
https://fasttext.cc/docs/en/ modules/generated/sklearn.svm.SVC.html#
crawl-vectors.html sklearn.svm.SVC
3 5
https://huggingface.co/neuralmind/ https://fasttext.cc/docs/en/
bert-large-portuguese-cased crawl-vectors.html
performance is significantly different, we use the
continuity-corrected version of McNemar’s test.
This nonparametric statistical method is used on
paired nominal data and has been used to confront
NLP models (Jiang et al., 2021; Chen et al., 2021).

3 Results
All details regarding baseline F1-scores, F1-score
gains and p-values can be found in Appendices A,
B and C, respectively.

3.1 Tweets Dataset


Figure 2: -log(p-values) for each augmentation group
As a result of the Tweets dataset analysis, the high- on Tweets dataset. Dashed red line: alpha = 0.05.
est F1-score gains were obtained on the small-
est dataset subset for EDA and Syn augmentation
groups. Figure 1 shows the F1-score average gains
for each augmentation group.

Figure 3: F1-score average gains for each augmenta-


tion group on the B2W dataset.

Figure 1: F1-score average gains for each augmenta- The B2W dataset analysis resulted in a signifi-
tion group on Tweets dataset. cant model, as shown in Figure 4. This best model,
trained with a subset size 2000 and 0.05 augmenta-
McNemar’s test results show no significant dif- tion, reached a p-value of 0.042957.
ference between baseline models and augmented
3.3 Mercado Libre Dataset
models (p-value < 0.05). The test was applied only
in positive F1-score gain models and is depicted in Finally, in the Mercado Libre dataset, the EDA
Figure 2. augmentation group still achieved the best F1-score
gain. Figure 5 shows the F1-score average gain for
3.2 B2W Dataset each group.
As with Tweets, the analysis of the B2W dataset All positive F1-score gains p-values resulting
resulted in the highest F1-score gain belonging to from McNemar’s test are depicted in Figure 6.
the EDA augmentation group. Also, the dataset Although no significant text classification model
subsets sizes that achieved the best results were was found (Figure 6), the dataset subset size 5000
500 and 2000. Likewise, the same pattern was models obtained a satisfactory performance.
observed in the Tweets dataset. Figure 3 shows
the F1-score average gains for each augmentation 3.4 Augmentation Group Performance
group. Combining all results across the three augmenta-
The statistical significance analysis was applied tion groups, Figure 7 shows the F1-score gains for
following the same scheme as the Tweets dataset. each dataset subset size.
Figure 4 shows the positive gain models’ p-values. The Syn augmentation group achieved the best
Figure 4: -log(p-values) for each augmentation group Figure 6: -log(p-values) for each augmentation group
on B2W dataset. Dashed red line: alpha = 0.05. on Mercado Libre. Dashed red line: alpha = 0.05.

Figure 7: F1-score gains for each augmentation group,


Figure 5: F1-score average gains for each augmenta- combining all datasets.
tion group on Mercado Libre dataset.

train set size; and by the nature of the datasets, i.e.


overall performance, as shown in Figure 7.
the colloquialism inherent in the Tweets texts and a
large number of targets in the Mercado Libre clas-
4 Conclusions
sification task, adding noise to the trained models.
We grouped and compared three different text aug- Also, since all the augmentation techniques were
mentation techniques that have been reported as developed using English corpora, some putative
significant additions to solving text classification language dependency might occur, biasing the final
tasks. Initially developed using English corpora, results.
we applied these techniques using Brazilian Por- In future work, we are planning to adjust the
tuguese publicly available language resources. model choice and expand the analysis on the Brazil-
We trained 2,700 text classification models us- ian Portuguese language. Besides, in order to in-
ing three different datasets. For each dataset, the crease the available Brazilian Portuguese data vol-
models were generated using combinations of the ume, we are planning to curate and annotate a new
following attributes: dataset subset size, percentage corpus.
of augmentation, and the group of augmentation
technique.
Comparing augmented and non-augmented best References
models, our analysis showed a slight upward trend Henrico Bertini Brum and Maria das Graças
in the F1 score gain, although no expressive sta- Volpe Nunes. 2017. Building a sentiment cor-
tistical significance was found. This result might pus of tweets in brazilian portuguese.
be caused by the model choice, i.e. the SVM is Timothy L. Chen, Max Emerling, Gunvant R. Chaud-
considered a low-sensitive method regarding the hari, Yeshwant R. Chillakuru, Youngho Seo,
Thienkhai H. Vu, and Jae Ho Sohn. 2021. Domain propagation. In Lecture Notes in Computer Science,
specific word embeddings for natural language pro- pages 235–269. Springer Berlin Heidelberg.
cessing in radiology. Journal of Biomedical Infor-
matics, 113:103665. Fábio Souza, Rodrigo Nogueira, and Roberto Lotufo.
2019. Portuguese named entity recognition using
Matthew Ciolino, David Noever, and Josh Kalin. 2021. bert-crf.
Multilingual augmenter: The model chooses.
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Ser-
Xiaodong Cui, Vaibhava Goel, and Brian Kingsbury. manet, Scott Reed, Dragomir Anguelov, Dumitru Er-
2015. Data augmentation for deep neural network han, Vincent Vanhoucke, and Andrew Rabinovich.
acoustic modeling. IEEE/ACM Transactions on Au- 2015. Going deeper with convolutions. In 2015
dio, Speech, and Language Processing, 23(9):1469– IEEE Conference on Computer Vision and Pattern
1477. Recognition (CVPR). IEEE.
Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Ar- Tan Thongtan and Tanasanee Phienthrakul. 2019. Sen-
mand Joulin, and Tomas Mikolov. 2018. Learning timent classification using document embeddings
word vectors for 157 languages. In Proceedings trained with cosine similarity. In Proceedings of
of the International Conference on Language Re- the 57th Annual Meeting of the Association for
sources and Evaluation (LREC 2018). Computational Linguistics: Student Research Work-
shop, pages 407–414, Florence, Italy. Association
Haoming Jiang, Pengcheng He, Weizhu Chen, Xi- for Computational Linguistics.
aodong Liu, Jianfeng Gao, and Tuo Zhao. 2020.
SMART: Robust and efficient fine-tuning for pre- Lígia Iunes Venturott and Patrick Marques Ciarelli.
trained natural language models through principled 2020. Data augmentation for improving hate speech
regularized optimization. In Proceedings of the 58th detection on social networks. In Proceedings of the
Annual Meeting of the Association for Computa- Brazilian Symposium on Multimedia and the Web.
tional Linguistics, pages 2177–2190, Online. Asso- ACM.
ciation for Computational Linguistics.
Vinícius Veríssimo and Rostand Costa. 2020. Using
Tao Jiang, Jian Ping Li, Amin Ul Haq, Abdus Saboor, data augmentation and neural networks to improve
and Amjad Ali. 2021. A novel stacking approach the emotion analysis of brazilian portuguese texts.
for accurate detection of fake news. IEEE Access, In Proceedings of the Brazilian Symposium on Mul-
9:22626–22639. timedia and the Web. ACM.
Santosh Kesiraju, Oldřich Plchot, Lukáš Burget, and Jason Wei and Kai Zou. 2019. EDA: Easy data aug-
Suryakanth V Gangashetty. 2019. Learning docu- mentation techniques for boosting performance on
ment embeddings along with their uncertainties. text classification tasks. In Proceedings of the
2019 Conference on Empirical Methods in Natu-
T. Ko, Vijayaditya Peddinti, D. Povey, and S. Khudan- ral Language Processing and the 9th International
pur. 2015. Audio augmentation for speech recogni- Joint Conference on Natural Language Processing
tion. In INTERSPEECH. (EMNLP-IJCNLP). Association for Computational
Sosuke Kobayashi. 2018. Contextual augmentation: Linguistics.
Data augmentation by words with paradigmatic re- Adams Wei Yu, David Dohan, Quoc Le, Thang Luong,
lations. In Proceedings of the 2018 Conference of Rui Zhao, and Kai Chen. 2018. Fast and accurate
the North American Chapter of the Association for reading comprehension by combining self-attention
Computational Linguistics: Human Language Tech- and convolution. In International Conference on
nologies, Volume 2 (Short Papers). Association for Learning Representations.
Computational Linguistics.

Kowsari, Jafari Meimandi, Heidarysafa, Mendu, A Detailed Baseline F1-scores


Barnes, and Brown. 2019. Text classification algo-
rithms: A survey. Information, 10(4):150. A.1 Tweets Dataset

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hin- Table 1 shows all baseline F1-scores for Tweets
ton. 2017. ImageNet classification with deep con- dataset.
volutional neural networks. Communications of the
ACM, 60(6):84–90. A.2 B2W Dataset
Yu Meng, Yunyi Zhang, Jiaxin Huang, Yu Zhang, Chao Table 2 shows all baseline F1-scores for B2W
Zhang, and Jiawei Han. 2020. Hierarchical topic dataset.
mining via joint spherical tree and text embedding.
A.3 Mercado Libre Dataset
Patrice Y. Simard, Yann A. LeCun, John S. Denker, and
Bernard Victorri. 2012. Transformation invariance Table 3 shows all baseline F1-scores for Mercado
in pattern recognition – tangent distance and tangent Libre dataset.
Augm. Type
EDA Syn BT
Augm.Perc.
Subset Size 0.05 0.1 0.2 0.05 0.1 0.2 0.05 0.1 0.2
500 0.83 0.82 0.83 0.83 0.83 0.83 0.87 0.87 0.87
1000 0.80 0.79 0.80 0.80 0.80 0.80 0.80 0.80 0.80
2000 0.81 0.81 0.81 0.79 0.79 0.79 0.78 0.78 0.78
5000 0.79 0.80 0.80 0.78 0.78 0.78 0.79 0.79 0.79
10000 0.78 0.78 0.77 0.77 0.78 0.77 0.78 0.78 0.78

Table 1: Baseline F1-scores for Tweets dataset.

Augm. Type
EDA Syn BT
Augm.Perc.
Subset Size 0.05 0.1 0.2 0.05 0.1 0.2 0.05 0.1 0.2
500 0.92 0.92 0.91 0.94 0.94 0.94 0.92 0.92 0.92
1000 0.95 0.95 0.95 0.94 0.93 0.94 0.94 0.93 0.93
2000 0.93 0.94 0.93 0.93 0.93 0.93 0.94 0.94 0.93
5000 0.93 0.94 0.93 0.93 0.93 0.93 0.94 0.94 0.93
10000 0.94 0.94 0.93 0.94 0.94 0.94 0.93 0.93 0.93

Table 2: Baseline F1-scores for B2W dataset.

B Detailed F1-score Gains


B.1 Tweets Dataset
Figure 8 shows all F1-score gains for Tweets
dataset.

B.2 B2W Dataset


Figure 9 shows all F1-score gains for B2W dataset.

B.3 Mercado Libre Dataset


Figure 10 shows all F1-score gains for Mercado
Libre dataset.

C Detailed p-values
C.1 Tweets Dataset
Table 4 shows all p-values for Tweets dataset.

C.2 B2W Dataset


Table 5 shows all p-values for B2W dataset.

C.3 Mercado Libre Dataset


Table 6 shows all p-values for Mercado Libre
dataset.
Augm. Type
EDA Syn BT
Augm.Perc.
Subset Size 0.05 0.1 0.2 0.05 0.1 0.2 0.05 0.1 0.2
500 0.62 0.61 0.63 0.64 0.62 0.62 0.65 0.66 0.66
1000 0.73 0.73 0.74 0.69 0.69 0.69 0.71 0.72 0.71
2000 0.80 0.80 0.80 0.79 0.78 0.79 0.77 0.78 0.77
5000 0.85 0.85 0.85 0.85 0.84 0.85 0.84 0.84 0.84
10000 0.88 0.88 0.88 0.88 0.88 0.88 0.89 0.89 0.89

Table 3: Baseline F1-scores for Mercado Libre dataset.

Augm. Type
EDA Syn BT
Augm. Perc.
Subset Size 0.05 0.1 0.2 0.05 0.1 0.2 0.05 0.1 0.2
500 0.30 0.89 1.00 1.00 0.67 0.60 0.70 0.24 0.58
1000 0.81 0.61 0.62 0.74 0.68 0.87 0.93 1.00 0.86
2000 0.43 0.43 0.63 0.60 0.95 0.95 0.63 0.48 0.32
5000 0.53 0.97 0.24 0.54 0.78 0.67 0.55 0.81 0.50
10000 0.61 0.58 0.52 0.91 0.81 0.35 0.98 0.88 0.45

Table 4: McNemar’s test p-values for Tweets dataset.

Augm. Type
EDA Syn BT
Augm. Perc.
Subset Size 0.05 0.1 0.2 0.05 0.1 0.2 0.05 0.1 0.2
500 1.00 0.23 0.21 1.00 1.00 0.61 0.54 1.00 0.70
1000 0.09 0.73 0.21 1.00 0.61 0.32 0.74 0.30 0.47
2000 0.14 0.89 0.81 0.04 0.39 1.00 1.00 0.55 0.61
5000 0.92 0.78 0.35 0.92 1.00 0.93 1.00 0.48 0.79
10000 0.56 0.89 0.12 0.77 0.10 0.15 0.60 0.42 0.79

Table 5: McNemar’s test p-values for B2W dataset.

Augm. Type
EDA Syn BT
Augm. Perc.
Subset Size 0.05 0.1 0.2 0.05 0.1 0.2 0.05 0.1 0.2
500 0.93 0.93 0.80 1.00 0.93 0.50 0.60 0.44 0.81
1000 0.79 1.00 0.36 1.00 0.95 0.66 0.69 0.80 0.95
2000 0.52 0.55 0.54 0.75 0.67 0.56 0.67 0.67 0.96
5000 0.63 0.62 1.00 0.72 0.13 0.93 0.56 0.44 0.81
10000 0.51 0.66 0.23 0.89 0.72 0.60 0.23 0.38 0.21

Table 6: McNemar’s test p-values for Mercado Libre dataset.


Figure 8: F1-score gains for each augmentation group Figure 9: F1-score gains for each augmentation group
on Tweets dataset. on B2W dataset.
Figure 10: F1-score gains for each augmentation group
on Mercado Libre dataset.

You might also like