SRL-ACO A Text Augmentation Framework Based On Semantic Role
SRL-ACO A Text Augmentation Framework Based On Semantic Role
a r t i c l e i n f o a b s t r a c t
Article history: The process of creating high-quality labeled data is crucial for training machine-learning models, but it
Received 31 March 2023 can be a time-consuming and labor-intensive process. Moreover, manual annotation by human annota-
Revised 2 May 2023 tors can lead to varying degrees of competency, training, and experience, which can result in inconsistent
Accepted 3 June 2023
labeling and arbitrary standards. To address these challenges, researchers have been exploring automated
Available online 9 June 2023
methods for enhancing training and testing datasets. This paper proposes SRL-ACO, a novel text augmen-
tation framework that leverages Semantic Role Labeling (SRL) and Ant Colony Optimization (ACO) tech-
Keywords:
niques to generate additional training data for natural language processing (NLP) models. The framework
Data augmentation
Text classification
uses SRL to identify the semantic roles of words in a sentence and ACO to generate new sentences that
Deep learning preserve these roles. SRL-ACO can enhance the accuracy of NLP models by generating additional data
Sarcasm identification without requiring manual data annotation. The paper presents experimental results demonstrating the
effectiveness of SRL-ACO on seven text classification datasets for sentiment analysis, toxic text detection
and sarcasm identification. The results show that SRL-ACO improves the performance of a classifier on
different NLP tasks. These results demonstrate that SRL-ACO has the potential to enhance the quality
and quantity of training data for various NLP tasks.
Ó 2023 The Author(s). Published by Elsevier B.V. on behalf of King Saud University. This is an open access
article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
1. Introduction (Deng et al., 2009; LeCun et al., 2015). Machine learning models
require a substantial amount of labeled data to learn and general-
Deep learning have revolutionized a wide range of fields, ize effectively (Goodfellow et al., 2016). Deep learning, in particu-
including computer vision, natural language processing, speech lar, relies on vast amounts of training data to learn the complex
recognition, robotics, and more (Shinde and Shah, 2018). The abil- patterns and relationships within the data. Without sufficient
ity of these algorithms to learn from data and improve their perfor- training data, the performance and robustness of deep learning
mance over time has led to breakthroughs in image and speech systems can be severely compromised, leading to inaccurate or
recognition, language translation, autonomous driving, and many unreliable predictions. Therefore, it is crucial to have access to
other areas. As the amount of available data continues to grow high-quality, diverse, and representative training datasets to
and the power of computing resources increases, the potential ensure that deep learning models can generalize well to new,
applications of deep learning are virtually limitless. From health- and unseen data (Bengio et al., 2013). Moreover, the availability
care to finance, transportation to education, these algorithms are of training data is also critical for the development of new and
transforming the way we live and work, and are poised to have innovative applications of deep learning in various domains, such
an even greater impact in the future (Dargan et al., 2020). as healthcare, finance, and natural language processing
The availability of large-scale training data is essential for the (Rajkomar et al., 2018; Heaton et al., 2017; Vaswani et al., 2017).
design and development of reliable systems for deep learning Despite the importance of training data, many technical and
business constraints prevent organizations from obtaining enough
high-quality data to build reliable deep learning models. For
E-mail address: [email protected]
Peer review under responsibility of King Saud University.
instance, sensitive data privacy concerns, data access restrictions,
and high data acquisition costs can limit the availability of diverse
and representative training datasets (Munappy et al., 2019). Addi-
tionally, the datasets available are quite often unreliable, which
Production and hosting by Elsevier can significantly affect the quality and accuracy of machine
https://doi.org/10.1016/j.jksuci.2023.101611
1319-1578/Ó 2023 The Author(s). Published by Elsevier B.V. on behalf of King Saud University.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101611
learning models. Not only are these datasets used for training, but Introducing a rigorous and systematic approach to incorporat-
they are also used to test and evaluate models to assure their func- ing augmented data into an existing dataset for machine learn-
tionality and establish their overall efficacy. Hence, the quality of ing tasks,
the data has a significant effect on the quality of the models. It is
possible to train machine learning models using unreliable data, The rest of this paper is structured as follows. In Section 2, we
but doing so could lead to predictions that do not match reality provide an overview of related work in the field of text data aug-
because of biases or mistakes (Whang and Lee, 2020). Ensuring mentation, including previous approaches to data augmentation,
the quality and integrity of training data is crucial in deep learning and recent advances in deep learning models for natural language
applications. Developing appropriate techniques for handling and processing tasks. Section 3 describes the proposed approach in
mitigating data-related issues is also essential. detail, including the key stages of SRL-based document modeling,
Text classification is a fundamental task in natural language ACO-based text data augmentation, document evaluation, the inte-
processing that involves assigning one or more predefined cate- gration of new documents, and iterative refinement. In Section 4,
gories to a piece of text based on its content (Kowsari et al., we discuss experimental results on various datasets and demon-
2019). The importance of text classification stems from its wide strate the effectiveness of the proposed approach. Finally, in
range of applications, including spam detection, sentiment analy- Section 5, we conclude the paper and discuss future directions
sis, document classification, and language identification, among for research in text data augmentation.
others (Aggarwal and Zhai, 2012). Sentiment analysis, in particu-
lar, has received significant attention in recent years due to the
growing popularity of social media platforms and the need to 2. Related work
understand public opinion and sentiment towards various prod-
ucts, services, and events (Medhat et al., 2014). Sarcasm identifi- Text data augmentation is a well-studied area in natural lan-
cation is the task of detecting instances of sarcasm in written or guage processing and machine learning. The goal of text data aug-
spoken language. Sarcasm is a form of verbal irony that is charac- mentation is to increase the size and diversity of training data by
terized by the use of language that expresses the opposite of what generating new samples that are similar to the original data (Liu
is actually meant, often for humorous or satirical effect (Eke et al., et al., 2020). This can help improve the accuracy and robustness
2020). Sarcasm identification is another important application of of machine learning models, particularly for tasks such as text clas-
text classification, as sarcasm and irony are prevalent in online sification and sentiment analysis. A wide range of techniques have
communication and can have significant social and political been proposed for text data augmentation, including unsupervised,
implications. and supervised models (Shorten et al., 2021; Feng et al., 2021).
The problem of incorrect labels and problematic annotations is In recent years, several studies have presented unsupervised
a common challenge in the field of machine learning and natural approaches to text data augmentation for various NLP tasks.
language processing (Schwartz et al., 2011). In many cases, training Wang and Yang (2015) proposed a novel data augmentation
datasets are manually labeled by human annotators, and errors or approach to enhance sentiment analysis using social media text.
inconsistencies can occur due to subjective interpretations, bias, or They collected a Twitter corpus of descriptions of annoying behav-
insufficient training of the annotators. These incorrect labels or iors using petpeeve hashtags and studied the language use in these
problematic annotations can lead to inaccurate or biased models, tweets, with a special focus on fine-grained categories and geo-
affecting their generalization ability and overall performance graphic variation of language. They showed that lexical and syntac-
(Ahmed et al., 2023). One approach to address this issue is to use tic features are useful for automatic categorization of annoying
quality control measures such as inter-annotator agreement, behaviors, and frame-semantic features further improve the per-
where multiple annotators label the same data to assess the con- formance. Their work highlights the importance of incorporating
sistency and quality of the annotations (Artstein, 2017). However, large lexical embeddings to create additional training instances
this can be time-consuming and resource-intensive. Another for text classification tasks.
approach is to use active learning, where the machine learning In another study, easy data augmentation technique (EDA) has
model iteratively selects the most informative examples for human been introduced for boosting the performance of text classification
annotation, reducing the potential for errors and inconsistencies tasks (Wei and Zou, 2019). EDA consists of four simple operations,
(Ringger et al., 2007). In addition to the challenges of incorrect namely, synonym replacement, random insertion, random swap,
labels and problematic annotations, another challenge is the lack and random deletion.
of diverse and representative training datasets. This can lead to Similarly, Haralabopoulos et al. (2021) presented a novel text
bias and limited generalization ability of machine learning models, data augmentation approach using sentence permutations to cre-
especially in cases where the training data does not adequately ate synthetic data that retains key statistical properties of the data-
represent the target population or application domain (Bansal set. The method has been evaluated on eight different datasets and
et al., 2022). has shown significant improvements in classification accuracy by
The proposed scheme SRL-ACO introduces a novel approach to an average of 4.1%.
text data augmentation that leverages the power of Semantic Role The use of monolingual data to improve neural machine trans-
Labeling (SRL) and Ant Colony Optimization (ACO) algorithms. SRL- lation through back-translation of target language sentences is a
ACO overcomes the key limitations of existing text augmentation well-established method (Edunov et al., 2018). The study con-
models, such as limited diversity and lack of semantic coherence, tributes to the understanding of back-translation by investigating
by generating high-quality, semantically meaningful, and diverse various methods for generating synthetic source sentences. The
augmented documents. The key contributions of the study can be authors found that back-translations obtained via sampling or
summarized as follows: noised beam outputs are the most effective, particularly in
resource-poor settings.
Introducing a novel approach to text data augmentation that In another study, Body et al. (2021) presented a novel data aug-
leverages the power of SRL and ACO algorithms, mentation method called back-and-forth translation to artificially
Improving the diversity and quality of the augmented docu- increase the size of any natural language dataset. The proposed
ments by incorporating semantic coherence and readability method is shown to reduce the error rate in binary sentiment clas-
criteria, sification models by up to 3.4%, and this decrease in error rate
2
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101611
scales non-linearly with sample size. The method is particularly lems, resulting in a performance improvement of at least 5% on
effective at smaller sample sizes compared to larger ones. In recent two datasets.
years, semi-supervised learning has been explored to improve In another work, Dai et al. (2023) proposed AugGPT, a text data
deep learning models when labeled data is limited. augmentation approach based on the ChatGPT language model.
One popular approach is to use consistency training on a large The authors leveraged the recent success of large language models,
amount of unlabeled data to regularize model predictions. Xie particularly ChatGPT’s improved language comprehension abili-
et al. (2020) proposed a novel perspective on noising unlabeled ties, to generate multiple conceptually similar but semantically dif-
examples and argues that the quality of noising plays a crucial role ferent samples for each sentence in the training dataset. The
in semi-supervised learning. The study introduced advanced data augmented samples were then used for downstream model
augmentation methods, such as random augment and back- training.
translation to substitute simple noising operations, and showed In another paper, Ubani et al. (2023) proposed ZeroShotDa-
that this method brings substantial improvements across six lan- taAug, a method that uses the large generative language model,
guage and three vision tasks. Recently, Onan (2023) presented an ChatGPT, to generate synthetic training data for data augmentation
ensemble text data augmentation scheme based on three universal in low-resource scenarios. The authors show that with task-
and three task-specific transformation functions for text specific ChatGPT prompts, their approach outperforms existing
classification. data augmentation techniques.
There are also several supervised text data augmentation Kwon and Lee (2023) proposed a text data augmentation tech-
schemes. Most of these methods obtain the synthetic words and nique based on explainability for the mix-up method of swapping
sentences for augmentation through training or fine-tuning a neu- words between sentences. While mix-up is a widely used approach
ral language model. This allows for the creation of more diverse for data augmentation, it does not consider the importance of the
and relevant synthetic data that can improve model performance manipulated word, which can have a critical effect on classification
on specific tasks. Kobayashi (2018) introduced a supervised results. To address this limitation, the proposed approach explicitly
method, called contextual augmentation. The method utilizes a calculates the importance of each word, which is reflected in the
bi-directional language model to predict appropriate words to labeling process of the augmented data. The experiment results
replace the original words, assuming the invariance of sentence demonstrate that the proposed approach significantly outperforms
naturalness even with paradigmatic replacements. Additionally, existing methods when the importance of the manipulated word is
the language model is retrofitted with a label-conditional architec- considered in the labeling.
ture to ensure label-compatibility of the augmented sentences. The paper by Bayer et al. (2023) presents and evaluates a novel
In another study, a new method called conditional BERT contex- text generation approach suitable for improving the performance
tual augmentation has been proposed (Wu et al., 2019). This of classifiers for long and short texts. The authors achieved promis-
method retrofits BERT to a new conditional masked language ing improvements when evaluating short and long text tasks with
model task and applies label-conditional constraints. The well- the enhancement by their text generation method. The results also
trained conditional BERT is then used to enhance contextual aug- showed significant additive accuracy gains in a constructed low
mentation. Experimental results on various text classification tasks data regime, compared to the no augmentation baseline and
show that the proposed method is effective in improving the per- another data augmentation technique.
formance of both convolutional and recurrent neural networks The proposed SRL-ACO algorithm for data augmentation in
classifiers. In another study, a novel neural generative model has semantic role labeling (SRL) adopts a novel approach by leveraging
been proposed that combines variational auto-encoders (VAEs) the inherent structure of the task to generate high-quality syn-
and holistic attribute discriminators for effective imposition of thetic data. By incorporating Ant Colony Optimization (ACO) and
semantic structures. This model enhances VAEs with the wake- Semantic Role Labeling (SRL) models, SRL-ACO generates new
sleep algorithm for leveraging fake samples as extra training data. examples by performing targeted transformations on the original
The proposed model learns interpretable representations from sentences. The algorithm focuses on modifying words that are cru-
even only word annotations and produces short sentences with cial to the underlying semantic structure, ensuring that the aug-
desired attributes of sentiment and tenses (Hu et al., 2017). mented examples are not only plausible but also semantically
In a similar way, Moreno-Barea et al. (2020) employed varia- consistent. Furthermore, SRL-ACO provides a principled way of
tional autoencoders and generative adversarial networks for text controlling the complexity of the generated examples, which can
data augmentation. In another study, Ng et al. (2020) introduced be useful for improving the generalization capability of SRL mod-
a data augmentation method (SSMBA) that uses a pair of corrup- els. Table 1.
tion and reconstruction functions to generate synthetic training
examples by randomly moving on a data manifold. SSMBA lever-
3. Theoretical foundations
ages the manifold assumption to reconstruct corrupted text with
masked language models.
This section briefly explains the theoretical foundations of the
In another study, Feng et al. (2022) presented a novel data aug-
study, i.e., data augmentation for text, ant colony optimization,
mentation technique, named tailored text augmentation (TTA) for
and deep neural network architectures have been briefly
sentiment analysis. TTA uses probabilistic word sampling based on
presented.
the discriminative power and relevance of the word to sentiment,
as well as the identification of words irrelevant to sentiment but
discriminative for the training data, and applies zero masking or 3.1. Data augmentation
contextual replacement to these words. These operations are
designed to expand the coverage of discriminative words and alle- To achieve accurate and robust machine learning models, it is
viate the problem of overfitting, improving the model’s generaliza- essential to have high-quality training data that is diverse, repre-
tion capability. sentative, and correctly labeled. However, in practice, acquiring
Recently, Ahmed et al. (2023) proposed a novel text augmenta- such datasets can be challenging due to limited resources, biases,
tion method which employs the clonal selection algorithm (CLO- noise, and errors in annotations. As a result, models trained on
NALG) and abstract meaning representation (AMR) graphs to imperfect data can suffer from poor generalization, low perfor-
improve the quality and quantity of data in cybersecurity prob- mance, and bias towards certain groups or features. To overcome
3
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101611
Table 1
The summary of the related work on text data augmentation.
these challenges, researchers have developed various techniques, moves through the search space by depositing and following
such as data augmentation, transfer learning, and ensemble learn- pheromone trails, which are updated based on the quality of the
ing, to improve the quality and diversity of training datasets and solutions found. By exploiting the collective intelligence and
reduce the impact of incorrect labels and problematic annotations. trail-following behavior of the ants, ACO can efficiently search
In this section, we will review these techniques and their applica- large and complex solution spaces, and converge to high-quality
tions in machine learning, highlighting their benefits, limitations, solutions even in the presence of noise, nonlinearity, and uncer-
and best practices. tainty. ACO has been successfully applied to a wide range of opti-
Data augmentation is a popular technique in deep learning, mization problems, including the traveling salesman problem,
which involves generating new training data by transforming the vehicle routing problem, and the scheduling problem, among
existing data. In the context of natural language processing, text others (Dorigo and Stützle, 2019).
data augmentation techniques aim to increase the size and diver-
sity of text datasets by creating new samples that are similar to
the original data but have different characteristics, such as para- 3.3. Deep neural networks
phrasing or adding noise (Shorten et al., 2021). Text data augmen-
tation can help overcome the limitations of small or imbalanced Numerous machine learning models have been developed for
datasets and improve the performance and robustness of deep text classification, ranging from traditional algorithms such as
learning models for various tasks, including text classification, sen- Naïve Bayes and Support Vector Machines to deep neural networks
timent analysis, and natural language generation. such as Convolutional Neural Networks (CNNs) and Recurrent Neu-
ral Networks (RNNs). These models have shown impressive perfor-
mance in various text classification tasks, demonstrating the
3.2. Ant colony optimization potential of deep learning for natural language processing (Zhang
et al., 2018).
Ant Colony Optimization (ACO) is a metaheuristic optimization Convolutional Neural Networks (CNNs) are a type of deep learn-
algorithm inspired by the behavior of ants searching for food ing algorithm that has been traditionally used for image recogni-
(Dorigo et al., 2006). In this algorithm, a colony of artificial ants tion tasks, but have recently gained popularity in natural
is used to explore a search space and find the optimal solution to language processing for text classification tasks (Chai and
a given problem. Each ant represents a candidate solution and Lee, 2019). CNNs for text involve applying one-dimensional
4
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101611
convolutional filters to sliding windows of text, followed by pool- Algorithm 1. The general structure of the SRL-ACO algorithm.
ing operations to reduce the dimensionality of the output. This
approach allows the network to automatically learn features of dif-
ferent scales and complexities from the text input, without the
need for hand-crafted feature engineering (Onan, 2021). CNNs Input: Text dataset D, ACO parameters a, b, c, d, e, number of
have shown to be effective for a range of text classification tasks, iterations T, quality threshold q, T=10
including sentiment analysis, topic classification, and spam detec- Output: Augmented dataset D’
tion, among others. 1. SRL-based document modeling:
Long Short-Term Memory (LSTM) networks are a type of recur- 1.1. For each document d in D, tokenize d into a sequence of
rent neural network (RNN) that have been widely used in natural words or sub-word units.
language processing for text generation, language modeling, and 1.2. Parse the tokenized sequence using a syntactic parser to
sequence classification tasks (Kowsari et al., 2019). LSTMs are par- obtain the grammatical structure of the sentence.
ticularly effective in modeling sequences that have long-term 1.3. Apply a pre-trained SRL model to the parsed sequence
dependencies, which is a common challenge in natural language to obtain the SRL graph G.
processing. In contrast to traditional RNNs, which tend to suffer 1.4. Store G in a list Glist.
from the vanishing gradient problem, LSTMs use a set of gating 2. ACO-based text data augmentation:
mechanisms to selectively forget or remember information at each 2.1. For each SRL graph G in a list Glist, perform ACO-based
time step, allowing them to maintain information over longer text data augmentation to obtain a set of candidate
sequences. LSTMs have been used to generate realistic text, such solutions x.
as in chatbots and virtual assistants, as well as for sentiment anal- 2.2. Evaluate each candidate solution x using the objective
ysis and language translation, among other applications function f(x).
(Sendhilkumar, 2023). 2.3. Use the heuristic function h(x) to guide the search for
Gated Recurrent Units (GRUs) are another type of recurrent high-quality solutions.
neural network (RNN) that has been used for natural language pro- 3. Document evaluation:
cessing tasks, particularly in cases where long-term dependencies 3.1. For each candidate solution x, evaluate its quality based
need to be modeled (Onan, 2020). Like LSTMs, GRUs use gating on the criteria of semantic coherence, readability, and
mechanisms to selectively retain or discard information at each grammatical correctness.
time step, allowing them to capture long-term dependencies while 3.2. Select only the solutions that meet or exceed the
avoiding the vanishing gradient problem. However, unlike LSTMs, quality threshold q.
GRUs have a simpler architecture with fewer parameters, making 3.3. Store the selected solutions in a list xlist.
them easier to train and faster to execute. GRUs have been shown 4. Integration of new documents:
to be effective for tasks such as sentiment analysis, language mod- 4.1. For each selected solution x in xlist, convert x to a text
eling, and named entity recognition, among others. document using a graph-to-text conversion algorithm.
Transformers are a type of deep learning architecture that have 4.2. Add the generated document to the augmented dataset
revolutionized natural language processing tasks such as language D’.
translation, summarization, and question answering (Wolf et al., 5. Iterative refinement:
2020). Unlike traditional recurrent neural networks (RNNs), which 5.1. Set i=1.
process input sequentially and suffer from issues such as vanishing 5.2. While i T:
gradients, Transformers use a self-attention mechanism to process 5.2.1. Use the augmented dataset D’ to train a new SRL
the entire input sequence in parallel. This allows them to model model.
long-term dependencies more efficiently and effectively than RNNs 5.2.2. Repeat steps 2-4 for each document in D.
(Devlin et al., 2018). The Transformer architecture is based on an 5.2.3. Increment i.
encoder-decoder framework, in which the encoder reads the input 6. Output: the augmented dataset D’.
sequence and produces a hidden representation, and the decoder
generates the output sequence based on the encoder’s hidden rep-
resentation and an attention mechanism over the encoder’s input
sequence. Transformers have achieved state-of-the-art perfor- 4.1. SRL-based document modeling
mance on many natural language processing benchmarks, and
have become a key technology in the field (Singh and Mahmood, The algorithm starts by using Semantic Role Labeling (SRL) to
2021). identify the semantic roles of words in a given text document.
The SRL model analyzes the document and generates an SRL graph
that represents the semantic structure of the document. The goal of
4. Proposed method this phase is to analyze the input text document and identify the
semantic roles of the words in the document. This is done by
This section describes the proposed text augmentation model, applying a pre-trained SRL model to the document, which gener-
which involves five stages, namely, SRL-based document mod- ates an SRL graph that represents the semantic structure of the
elling, ACO-based text data augmentation, the document evalua- document. In the proposed scheme, AllenNLP has been utilized as
tion stage, the integration of new documents, and the iterative a pre-trained SRL model (Gardner et al., 2018). AllenNLP provides
refinement. The general structure of the proposed scheme has been a pre-trained SRL model that uses a transition-based neural net-
depicted in Fig. 1 and the stages of the proposed scheme has been work to identify the semantic roles of the words in the input sen-
outlined in Algorithm 1. tence. The model is based on the PropBank dataset and can be used
5
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101611
out of the box for English text. The SRL model used in our study is resented in an SRL graph, which is a directed graph that cap-
based on a neural network architecture that jointly predicts syn- tures the semantic structure of the document. The nodes in
tactic and semantic information from input sentences, using a the graph represent the words or concepts in the sentence,
combination of BiLSTM and self-attention layers. This architecture while the edges represent the semantic relationships between
has been shown to achieve state-of-the-art performance on various them. Let GSRL be the SRL graph that represents the semantic
SRL benchmarks. Specifically, we utilized the pre-trained SRL structure of the document. GSRL is a directed graph with nodes
model provided by the AllenNLP library, which implements this V and edges ESRL , where ESRL represents the semantic relation-
architecture and has been fine-tuned on the OntoNotes 5.0 corpus. ships between the nodes. The edges in ESRL are labeled with
The steps involved in the SRL-based document modeling phase can the corresponding semantic role labels assigned by the SRL
be outlined as follows: model, as follows: ðv i ; r j ; v k Þ 2 ESRL , where v i and v k are nodes
representing words or phrases in the document, and rj is the
Tokenization: The input text document is first tokenized into a semantic role label assigned by the SRL model to v i .
sequence of words or sub-word units, such as morphemes or
characters. This step is important because the SRL model oper- The SRL-based document modeling phase is a critical step in the
ates on a sequence of discrete tokens, rather than a continuous proposed approach for text data augmentation, as it provides a
string of text. Let D be the input text document, consisting of a structured and semantically meaningful representation of the
sequence of N tokens. The study uses D ¼ ðw1 ; w2 ; . . . ; wN Þ, input document that can be used as a basis for generating new,
where each wi is a token. semantically coherent documents using ACO.
Parsing: The tokenized sequence is then parsed using a syntac- Let we have the sentence ‘‘The dog chased the cat around the
tic parser, which analyzes the grammatical structure of the sen- tree.” The operations on the SRL-based document modeling has
tence and identifies the relationships between the words. This been employed as follows:
step is important because it provides the context for the SRL
model to identify the semantic roles of the words. Let G be Tokenization: The sentence is tokenized into a sequence of
the syntactic parse tree of the input document, which captures words: [‘‘The”, ‘‘dog”, ‘‘chased”, ‘‘the”, ‘‘cat”, ‘‘around”, ‘‘the”,
the grammatical structure of the sentence. We can represent ‘‘tree”, ‘‘.”]
this as G = (V, E), where V is the set of nodes representing words Parsing: The sequence is then parsed to identify the grammati-
or phrases in the document, and E is the set of edges represent- cal structure of the sentence. The resulting parse tree has been
ing the relationships between the nodes. given in Fig. 2.
SRL labeling: Once the parsing is complete, the SRL model is SRL labeling: The SRL model is then applied to the parsed
applied to the parsed sequence to identify the semantic roles of sequence to identify the semantic roles of the words. For exam-
the words. The SRL model typically uses a set of pre-defined roles, ple, the SRL model might assign the following semantic role
such as Agent, Patient, Time, Location, and Instrument, to label labels to the words:
the words in the sentence. Let R be the set of pre-defined semantic
roles that the SRL model can assign to each word in the document.
We can represent this as R ¼ ðr 1 ; r 2 ; . . . ; rK Þ. For each word wi in
the document, the SRL model assigns a semantic role label rj from
R to indicate the semantic function of the word in the sentence.
We can represent this as S ¼ ðs1 ; s2 ; . . . ; sN Þ, where si is the seman-
tic role label assigned to wi .
SRL graph construction: The output of the SRL labeling step is a
set of labeled words, along with their corresponding semantic
roles. These labeled words and their relationships are then rep- Fig. 2. The resulting parse tree.
6
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101611
o dog: Agent Pheromone trail updating: The third step is to simulate the
o chased: Verb behavior of ant colonies by using pheromone trail updating to
o cat: Patient guide the search for high-quality solutions. In this step, the
o tree: Location algorithm uses a set of pheromone trails that represent the
SRL graph construction: Finally, the labeled words and their quality of each candidate solution. The pheromone trails are
relationships are represented in an SRL graph. The resulting updated at each iteration of the algorithm based on the quality
graph has been given in Fig. 3. of the candidate solutions. Let s(x) be the pheromone trail that
represents the quality of the candidate solution x. The phero-
This graph represents the semantic structure of the sentence, mone trail is updated at each iteration of the algorithm based
with the labeled words as nodes and the labeled relationships as on the quality of the candidate solutions. We can represent
edges. the pheromone trail updating as follows: s(x) (1 - q)s(x) +
qDs(x), where q is the evaporation rate of the pheromone trail,
4.2. ACO-based text data augmentation and Ds(x) is the amount of pheromone deposited by the ants
that generated the solution x.
The algorithm applies ant colony optimization (ACO) to gener- Heuristic information: The fourth step is to use heuristic infor-
ate new text documents that are similar in meaning to the original mation to guide the search. Let h(x) be the heuristic information
document, but with variations in syntax and structure. In this case, that guides the search for high-quality solutions. The heuristic
the algorithm uses ACO to explore the search space of possible information can include prior knowledge about the language
variations of the original document by iteratively building and or domain, as well as statistical or linguistic features of the doc-
updating a set of candidate solutions. The steps involved in the ument. We can represent the heuristic information as
ACO-based text data augmentation stage can be outlined as hðxÞ ¼ dh1 ðxÞ þ eh2 ðxÞ, where d and e are the weights that bal-
follows: ance the contribution of each type of heuristic information,
and h1 ðxÞ and h2 ðxÞ are the heuristic functions that encode the
Solution representation: The first step in applying ACO to text linguistic and statistical features of the document, respectively.
data augmentation is to represent each candidate solution as a In this scheme, the density of nouns has been utilized has been
sequence of operations that modify the original document. utilized as h1 ðxÞ, and the number of named entities has been
These operations include inserting, deleting, or replacing words utilized as h2 ðxÞ:
or phrases, as well as reordering or shuffling parts of the docu- Stochastic decision-making: The fifth step is to use stochastic
ment. Let x be a candidate solution that represents a sequence decision-making to balance exploration and exploitation in
of L operations that modify the original document. We can rep- the search for high-quality solutions. The algorithm makes ran-
resent this as x ¼ ðx1 ; x2 ; . . . ; xL Þ, where xi is an operation that dom decisions based on the pheromone trails and heuristic
modifies the document. Let we have the sentence ‘‘The dog information to explore new areas of the search space, while also
chased the cat around the tree.” The first step is to represent exploiting promising solutions. The algorithm makes random
each candidate solution as a sequence of operations that modify decisions based on the pheromone trails and heuristic informa-
the original document. In this case, the candidate solutions tion to explore new areas of the search space, while also
could be represented as sequences of operations that insert, exploiting promising solutions. The study uses the stochastic
delete, or replace words or phrases, as well as reorder or shuffle decision-making as a probability distribution P(x) over the set
parts of the sentence. For example, a candidate solution might of candidate solutions, which is determined by the pheromone
be to replace ‘‘chased” with ‘‘followed”, resulting in the sen- trails and the heuristic information. The probability distribution
tence ‘‘The dog followed the cat around the tree.” can be expressed as P(x) = x(x) / Rx(x’), where x(x) is the
Objective function: The second step is to define an objective weight of the candidate solution , which is determined by
function that measures the quality of each candidate solution. the pheromone trail and the heuristic information, and the
Let f ðxÞ be the objective function that measures the quality of sum is over all candidate solutions x’.
the candidate solution x. We can represent the objective func-
tion as f ðxÞ ¼ af 1 ðxÞ þ bf 2 ðxÞ þ cf 3 ðxÞ, where a, b, and c are The ACO-based text data augmentation stage uses a combina-
weights that balance the contribution of each sub-function, tion of stochastic search, pheromone trail updating, and heuristic
and f 1 ðxÞ; f 2 ðxÞ, and f 3 ðxÞ are the sub-functions that measure information to explore the space of possible variations of the orig-
the semantic, readability, and grammatical quality of the mod- inal document and generate high-quality augmented documents.
ified document, respectively. The semantic quality sub-function
f 1 ðxÞ measures the similarity between the modified document 4.3. The document evaluation stage
and the original document in terms of their semantic structure
based on the cosine similarity. The readability quality sub- The goal of this stage is to evaluate the quality of the new doc-
function f 2 ðxÞ measures how easy the modified document is uments generated by the ACO algorithm based on criteria such as
to read and understand, using the Gunning Fox Index. The semantic coherence, readability, and grammatical correctness. This
grammatical quality sub-function f 3 ðxÞ measures the grammat- is done to ensure that the generated documents are of high quality
icality of the modified document, using the number of gram- and can be used for training machine learning models. The steps
matical errors. involved in the document evaluation stage:
7
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101611
G’ using the cosine similarity. We can represent the semantic the same format as the original dataset. We can represent the
similarity as sim(G, G’). The similarity between G and G’ can preprocessed documents as D’’ = p(D’).
be represented as sim(G, G’) = cos(G, G’) = (G. G’) / (||G|| * || Data merging: The second step is to merge the augmented doc-
G’||), where G. G’ represents the dot product between the two uments with the existing dataset. This may involve appending
SRL graphs, and ||G|| and ||G’|| represent the Euclidean norms the new documents to the end of the existing dataset, or insert-
of the two graphs. The cosine similarity measures the cosine ing them at random positions in the dataset to create a more
of the angle between the two SRL graphs, and ranges from 1 diverse and balanced dataset. Let D be the original dataset,
(completely dissimilar) to 1 (completely similar). A higher and let D’’ be the preprocessed augmented documents. We
cosine similarity indicates a greater degree of semantic coher- can merge the two datasets using a function m() that appends
ence between the original and generated documents. or inserts the augmented documents into the original dataset.
Readability: The second step is to assess the readability of the We can represent the merged dataset as D’ = m(D, D’’)
generated documents. Let R be the readability score of the gen- Data splitting: The third step is to split the merged dataset into
erated document. We can represent the readability score using training, validation, and testing sets. This is done using a strat-
the readability metric, the Gunning Fog index. We can represent ified sampling technique to ensure that the distribution of
the readability score as R = f(D’), where f is the readability met- labels is consistent across the different sets. Let D’ be the
ric and D’ is the generated document. merged dataset. We can split the dataset into training, valida-
Grammatical correctness: The third step is to assess the gram- tion, and testing sets using a function s() that performs strati-
matical correctness of the generated documents. Let Gc be the fied sampling. We can represent the split datasets as Dtrain,
grammatical correctness score of the generated document. Dval, and Dtest, respectively: Dtrain, Dval, Dtest = s(D’).
The study uses the grammatical correctness score using a gram-
mar checker or language model that detects grammatical errors 4.5. Iterative refinement
and provides suggestions for correction. In this scheme, the
GPT-3 language model has been utilized for grammatical error The goal of this stage is to further refine the augmented dataset
detection. The grammatical correctness function f 3 ðxÞ can be by repeating the ACO-based text data augmentation process and
defined as a function of the grammatical correctness score the document evaluation stage multiple times to generate addi-
(Gc) generated by the GPT-3 language model. For example, tional high-quality, semantically meaningful documents. The steps
f 3 ðxÞ ¼ expðGcÞ can be used to penalize solutions that contain involved in the iterative refinement stage include:
grammatical errors, with a higher penalty for a higher number
of errors. Alternatively, f 3 ðxÞ ¼ 1 if Gc is close to 1 (i.e., the gen- 1. ACO-based text data augmentation: The first step is to repeat
erated document is grammatically correct), and f 3 ðxÞ ¼ 0 other- the ACO-based text data augmentation process multiple times,
wise, can be used to only accept solutions that are each time generating a new set of augmented documents based
grammatically correct according to the GPT-3 language model. on the existing dataset. This may involve adjusting the param-
Quality threshold: The final step is to set a quality threshold for eters of the ACO algorithm, such as the number of ants, the
the generated documents, based on the criteria of semantic pheromone update rate, or the heuristic information, to explore
coherence, readability, and grammatical correctness. Only doc- different areas of the search space and generate more diverse
uments that meet or exceed the quality threshold are selected and high-quality documents. Let D be the original dataset and
for integration into the augmented dataset. We can represent Di be the augmented dataset generated in iteration i. We can
the quality threshold as q ¼ ðsimmin ; Rmin ; Gcmin Þ, where simmin , use the ACO algorithm with parameters Pi to generate a new
Rmin , and Gcmin are the minimum values for the semantic simi- set of augmented documents D’i. We can represent this as: D’i =-
larity, readability, and grammatical correctness scores, ACO(Di, Pi).
respectively. 2. Document evaluation: The second step is to evaluate the qual-
ity of the new documents generated in each iteration using the
Based on these definitions, the document evaluation stage can same criteria as in the previous document evaluation stage. This
be expressed as follows: is done to ensure that the new documents are of high quality
If sim(G, G’) simmin , R(D’) Rmin , and Gc(D’) Gcmin , then the and can be used effectively for training machine learning mod-
generated document D’ meets the quality threshold q and is els. Let D’i be the set of new augmented documents generated in
selected for integration into the augmented dataset. In this iteration i. We can evaluate the quality of the new documents
scheme, the desired level of quality for the augmented dataset is using the same document evaluation criteria as in the previous
high, and we have set simmin to 0.9, Rmin to 0.7, and Gcmin to 0.9. stage. We can represent this as: eval_metrics_i = E(D’i) where E()
is the evaluation function that computes the quality metrics.
4.4. The integration of new documents 3. New document integration: The third step is to integrate the
new, high-quality documents generated in each iteration into
The goal of this stage is to integrate the high-quality, aug- the existing dataset, following the same data preprocessing,
mented documents generated in the previous stages into the exist- merging, splitting, and training and evaluation steps as in
ing dataset for use in machine learning tasks. The steps involved in the previous integration of new documents stage. Let D’’i be
the integration of new documents: the set of high-quality, semantically meaningful documents
generated in iteration i. We can integrate the new documents
Data preprocessing: The first step is to preprocess the aug- into the existing dataset using the same merging, splitting,
mented documents to ensure that they are in the same format and training steps as in the previous stage. We can represent
as the original documents in the dataset. This may involve con- this as: Di+1 = merge(D, D’’i); Dtrain, Dval, Dtest = split(Di+1);
verting the documents to a common encoding, removing any Mi+1 = train(Dtrain) where merge() is the function that combi-
extraneous information or metadata, and standardizing the for- nes the original and new datasets, split() is the function that
matting of the text. Let D’ be the set of augmented documents splits the dataset into training, validation, and testing sets,
generated in the previous stages. We can preprocess the docu- and train() is the function that trains the machine learning
ments using a function p() that transforms the documents into model.
8
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101611
4. Performance monitoring: The fourth step is to monitor the per- Yelp Dataset: The third dataset, Yelp, is a collection of reviews
formance of the machine learning models trained on the aug- from the Yelp Dataset Challenge 2015, annotated as positive or
mented dataset in each iteration, using the same evaluation negative (Zhang et al., 2015).
metrics as in the previous stage. This is done to assess the effec- US Airline Dataset: The fourth dataset, US Airline, is a collection
tiveness of the iterative refinement process in improving the of tweets about major U.S. airlines classified as positive, nega-
performance and robustness of the models. Let Mi+1 be the tive, or neutral (Feng et al., 2022).
machine learning model trained on the augmented dataset in Toxic Dataset: The Toxic dataset is a commonly used dataset for
iteration i + 1. We can evaluate the performance of the model the task of detecting toxic or abusive language in online com-
on the validation and testing sets using the same evaluation ments. The dataset contains comments from Wikipedia’s talk
metrics as in the previous stage. We can represent this as: page edits, with each comment labeled as toxic or non-toxic
eval_metrics_val_i + 1 = E(Mi+1, Dval); eval_metrics_test_i + 1 = E (Haralabopoulos et al., 2020).
(Mi+1, Dtest). Semeval Dataset: The Semaval Dataset consists of 4,999 tweets
5. Iteration stopping criterion: The final step is to define a stop- that have been manually annotated for the presence of emotion,
ping criterion for the iterative refinement process, based on including anger, fear, joy, and sadness. The tweets were col-
the performance of the machine learning models or the quality lected from various sources, including news organizations,
of the augmented dataset. This criterion may involve stopping blogs, and social media platforms. Each tweet is annotated by
the iterative refinement process when the performance of the multiple annotators to ensure high inter-annotator agreement
models reaches a certain threshold or when the quality of the (Mohammad et al., 2018).
augmented dataset no longer improves. In the proposed Sarcasm Dataset: The sarcasm dataset is a collection of Twitter
scheme, we have defined a convergence threshold for the itera- messages that have been annotated as sarcastic or non-
tive refinement process, based on the improvement in the per- sarcastic. The dataset was collected using a methodology based
formance or quality of the augmented dataset between on self-annotation by Twitter users. Specifically, tweets con-
consecutive iterations. The iterative refinement process can be taining the hashtags ‘‘sarcasm” or ‘‘ironic” were considered as
terminated when the improvement falls below the convergence sarcastic, while tweets containing hashtags indicating positive
threshold. The convergence threshold has been set to 0.05. or negative emotions were considered non-sarcastic. In total,
approximately 40,000 tweets in English were collected using
5. Experimental results and discussion this method. The sarcasm dataset is our own collected corpus
and has been used in empirical evaluations of sarcasm detec-
To evaluate the effectiveness of the proposed SRL-ACO approach tion algorithms (Onan and Toçoğlu, 2021).
for text data augmentation, extensive experiments have been con-
ducted on several benchmark datasets. In this section, we first Table 2 presents the number of sentences or tweets for each of
introduce the datasets used in the experiments, followed by a the seven datasets before and after applying the SRL-ACO method
description of the baseline methods and experimental settings. proposed in our study. The before column shows the number of
Specifically, we provide details on the low-data regime setting sentences or tweets in the original dataset, while the after (SRL-
and model training. Next, we present the experimental results ACO) column displays the number of sentences or tweets after
and comparisons between the proposed approach and the baseline applying the SRL-ACO method.
methods.
5.2. Baseline models
5.1. Datasets
The proposed approach (i.e., SRL-ACO) for text data augmenta-
To ensure fair comparison and benchmarking, we selected the tion has been compared to sixteen state-of-the-art techniques.
same datasets as those used in recently proposed text data aug- These techniques include:
mentation frameworks (Haralabopoulos et al., 2021; Feng et al.,
2022). This approach allows us to evaluate the effectiveness and 1. Backtranslation: Backtranslation (Edunov et al., 2018)
efficiency of our proposed technique in a consistent and standard- involves translating the original language into another lan-
ized way, and to compare its performance to that of other state-of- guage and then translating it back to the original language
the-art methods. By using the same datasets, we can isolate the using pre-trained translation models such as EN-DE, DE-
impact of our data augmentation technique on the performance EN, EN-RU, and RU-EN. By applying these models, each sen-
of the model, and assess its robustness across different datasets tence can generate two synthetic sentences.
and tasks. Moreover, using publicly available datasets that are 2. EDA: EDA is a simple but effective data augmentation tech-
commonly used in the literature increases the reproducibility nique consisting of four primary operations: synonym
and generalizability of our results, allowing other researchers to replacement, random insertion, random swap, and random
validate and build upon our findings. In summary, using the same deletion. All these operations are easy to implement to pro-
datasets as recently proposed text data augmentation frameworks
is a common practice in the field, and helps to ensure fair compar-
ison and rigorous evaluation of new techniques. The paper evalu-
ates the proposed sentiment analysis model on seven widely- Table 2
used benchmark datasets on sentiment analysis and sarcasm The parameters of the SRL-ACO scheme.
identification:
Dataset Before After (SRL-ACO)
SST-2 Dataset: The first dataset, SST-2, is the Stanford Senti- SST-2 11,855 35,565
Senti140 1,600,000 3,200,000
ment Treebank, which consists of movie reviews annotated as Yelp 5,6 33,6
positive or negative (Usama et al., 2020). US Airline 14,64 87,84
Senti140 Dataset: The second dataset, Senti140, contains Toxic 159,571 319,142
1,600,000 labeled tweets covering various brands and products SemEval 4,999 29,994
Sarcasm 40 120
on Twitter (Go et al., 2009; Onan, 2022).
9
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101611
duce a large number of synthetic sentences (Wei and Zou, to create an opposite classified copy (Haralabopoulos et al.,
2019). For this case, ensemble of the operations have been 2021).
performed. 16. AMR-CLONALG: This method that uses CLONALG and AMR
3. CBERT: CBERT retrofits the deep bidirectional language graphs to improve data quality and volume. The approach
model BERT with a label-conditional constraint to produce preserves domain-specific keywords, controls text muta-
synthetic sentences. The pre-trained BERT is first fine- tions, and ensures label preservation while allowing trace-
tuned by predicting the masked word based on its context ability and interpretability (Ahmed et al., 2023).
and label, and the well-trained CBERT is then used to predict
various words in the masked places to generate new sen-
tences (Wu et al., 2019). 5.3. Model variations
4. TF-IDF replacing: The model attempts to produce diverse
and valid samples by preserving keywords and replacing The study considered three different variations of models to
uninformative words with low TF-IDF scores. It determines assess the effectiveness of the proposed SRL-ACO framework. The
the importance of each word by calculating its frequency description of the model variations used in the empirical analysis
and TF-IDF score and then replaces the low-scored words has been briefly outlined:
with each other (Xie et al., 2020).
5. SSMBA: The model is a self-supervised technique aiming to AMR-ACO: It is a variant of SRL-ACO that uses abstract meaning
generate out-of-domain samples. SSMBA first employs a cor- representation (AMR) graphs instead of SRL.
ruption function to perturb samples off the data manifold SRL-PSO: It is a variant of SRL-ACO that uses particle swarm
and then projects these corrupted samples back based on a optimization algorithm instead of ant colony optimization.
reconstruction function, using the pre-trained BERT model. SRL-ACO-woIR: In this scheme, SRL-ACO has been employed,
The merit of SSMBA is its simplicity without requiring but without iterative refinement stage.
task-specific knowledge or any fine-tuning (Ng et al., 2020).
6. BF-Translation: This is a technique specially designed and
tested for sentiment analysis. It uses German as an interme- 5.4. Experimental settings
diate language and Google Translation API as a translation
tool to enhance data for sentiment analysis (Body et al., For the text data augmentation methods, default model param-
2021). eters were used in our experiments. This means that the methods
7. The Tailored Text Augmentation (TTA): The technique pro- were implemented according to the standard settings provided by
poses probabilistic word sampling for synonym replacement the original authors or the software packages used. The parameter
based on word relevance to sentiment and identifies words values of the proposed scheme (SRL-ACO) has been presented in
irrelevant to sentiment but discriminative for training data Table 3. The values for the parameters in the SRL-ACO scheme were
to improve the model’s generalization capability in senti- determined based on extensive empirical analysis and optimiza-
ment analysis (Feng et al., 2022). tion. Specifically, we conducted experiments on several benchmark
8. Random Deletion: This technique randomly removes one datasets to evaluate the performance of the method under differ-
word from each sentence (Wei and Zou, 2019). ent parameter settings. We started with a range of possible values
9. Synonym Replacement: This technique replaces one word for each parameter and systematically varied them to identify the
with a randomly chosen synonym (Wei and Zou, 2019). best-performing combinations. We used a grid search approach
10. Random Synonym Insertion: This technique inserts a ran- coupled with a validation set to find the optimal values. The speci-
dom synonym into a random position in the sentence (Wei fic values chosen for each parameter were those that produced the
and Zou, 2019). best performance across all datasets and were thus deemed to be
11. Generative Adversarial Network (GAN): This technique is a the most appropriate for the task at hand. We also conducted sen-
deep learning technique that generates synthetic data by sitivity analysis to ensure that the chosen values were robust to
training a generator model to produce fake data that resem- changes in the dataset and problem setting. The parameter values
bles the real data, and a discriminator model to distinguish reported in Table 3 are the result of careful empirical analysis and
between real and fake data, leading to improved perfor- optimization to ensure the best possible performance of the SRL-
mance of machine learning models (Moreno-Barea et al., ACO scheme. The performance of text data augmentation
2020). approaches have been evaluated in conjunction with three classi-
12. Permutation: This technique rearranges each sentence n! fiers, namely, Convolutional Neural Networks (CNNs), Long
times, ensuring every sentence is equally permutated while Short-Term Memory (LSTM) networks, and Gated Recurrent Unit
maintaining statistical properties (Haralabopoulos et al., (GRU) networks. The selection of Convolutional Neural Networks
2021). (CNNs), Long Short-Term Memory (LSTM) networks, and Gated
13. Antonym: This technique involves replacing a verb, adjec- Recurrent Unit (GRU) networks in our experiments is based on
tive, or noun in a sentence with its antonym, resulting in their effectiveness in text classification tasks and their popularity
the opposite meaning of the sentence. The classification of in the literature. CNNs are widely used for text classification tasks,
the sentence is also reversed accordingly. This method is particularly for their ability to capture local and global dependen-
suitable for cases where classes have comparable polarities cies in text data. LSTM and GRU networks are also popular choices
(Haralabopoulos et al., 2021). for text classification, especially when dealing with sequential
14. Negation: This technique adds a negation adverb to negate data. Table 4 provides the hyperparameters of the selected DNN
the sentence meaning and create an opposite classified copy models. These parameters were determined based on an extensive
(Haralabopoulos et al., 2021). empirical analysis and optimization process. We first used a grid
15. TDA: TDA employs the permutation technique to rearrange search approach to explore a range of hyperparameters and
each sentence in the corpus n! times, antonym technique selected the best performing ones. We then fine-tuned these
to replace a verb, adjective or noun with its antonym, and parameters using a random search method to further improve
negation technique to add negation adverb to the sentence the performance of the models.
10
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101611
Table 3
The parameters of the SRL-ACO scheme.
11
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101611
Table 5
The accuracy values obtained by the augmentation methods on CNN.
negation, antonym, and permutation schemes also yield promising nique achieves the highest accuracy on the majority of the data-
results. The performance of traditional augmentation methods, sets, with an average accuracy of 88.69%. This technique uses
such as synonym replacement and random deletion, was not as swarm intelligence algorithms to select the most informative sam-
good as the proposed methods, although they still achieved higher ples for augmentation, which may explain its effectiveness. Simi-
accuracy than the original dataset. larly, the results presented in Table 9 indicate that almost all
Tables 6 and 7 show the accuracy values obtained by various augmentation methods have improved the performance of the
augmentation methods on LSTM and GRU models, respectively. classifiers compared to the no augmentation baseline. The SRL-
For the LSTM model, the highest accuracy scores were achieved ACO method achieved the highest accuracy scores on all datasets,
by the SRL-ACO augmentation method on all datasets. For the with significant improvements over the no augmentation baseline.
GRU model, the SRL-ACO method outperformed all other methods. The AMR-ACO and SRL-PSO methods also performed well, showing
The results suggest that data augmentation techniques can consistent improvements across all datasets. The results suggest
improve the accuracy of text classification models, and that SRL- that data augmentation methods, particularly the SRL-ACO
ACO and SRL-ACO-woIR methods are effective in augmenting text method, can be effective in improving the performance of text clas-
data for both LSTM and GRU models. sifiers, and this can be particularly useful when working with lim-
The results in Table 8 show that data augmentation techniques ited amounts of training data.
generally improve the accuracy of the BERT classifier, with the SRL- In Fig. 4, the bar chart for classifiers in terms of classification
ACO technique achieving the highest accuracy on most of the data- accuracy has been presented. Looking at the results in Tables 3, 4
sets. The performance of the different techniques varies depending and 5, and Fig. 4, it is interesting to note that the CNN classifier
on the dataset, with some techniques performing better on certain achieved the lowest predictive performance across all augmenta-
datasets than others. Among the techniques, the SRL-ACO tech- tion methods and datasets. On the other hand, the GRU and LSTM
Table 6
The accuracy values obtained by the augmentation methods on LSTM.
12
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101611
Table 7
The accuracy values obtained by the augmentation methods on GRU.
Table 8
The accuracy values obtained by the augmentation methods on GPT.
classifiers obtained higher results, with LSTM consistently outper- ples, it has the opportunity to learn a wider range of features and
forming GRU. In fact, BERT achieved the highest average predictive correlations between them, leading to improved performance on
performances across all datasets and augmentation methods. test data. This is because the model has a better understanding
These results suggest that the choice of the deep learning text clas- of the underlying structure of the data and can make more accurate
sifier has a significant impact on the effectiveness of text data aug- predictions for previously unseen examples. Additionally, increas-
mentation methods. ing the training data size can help prevent overfitting, where the
In Fig. 5, the bar chart for accuracy based on the state-of-the-art model becomes too specialized to the training data and performs
augmentation methods has been presented. As it can be observed poorly on new data. Fig. 6 summarizes the experimental results
from Fig. 5, augmentation methods outperform the case where on different training sizes. In Fig. 7, the interaction plot for accu-
augmentation has not been employed. In addition, the highest racy for different datasets and training sizes has been presented.
average results in terms of classification accuracy have been As the size of the training data is increased, the predictive perfor-
achieved by SRL-ACO. mance of both text data augmentation methods and deep neural
We carried out experiments to evaluate the performance of the networks improves across all datasets. The same is also valid for
SRL-ACO approach using limited training data, with the number of different augmentation models, as summarized in Fig. 8.
training samples ranging from 10 to 500. Our findings demonstrate To further examine the performance of the proposed scheme on
that increasing the training data size significantly enhances the the full dataset, Table 10 shows the results of the experiments con-
performance of deep neural learning classifiers. Increasing the ducted with different configurations of BERT, CNN, GPT, GRU, and
amount of training data allows deep neural learning classifiers to LSTM classifiers, both with and without data augmentation using
better learn patterns and generalize to new, unseen examples. the SRL-ACO scheme. In general, the SRL-ACO scheme leads to
When a model is trained on a larger and more diverse set of exam- significant improvements in the classification accuracy of all
13
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101611
Table 9
The accuracy values obtained by the augmentation methods on BERT.
Fig. 4. The bar chart for accuracy based on classifiers. Fig. 6. The box-plot for accuracy based on different training sizes.
14
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101611
Fig. 7. The interaction plot for accuracy based on different datasets and training sizes.
Fig. 8. The interaction plot for accuracy based on different augmentation methods.
Table 10
The comparison of classifiers on full dataset.
deep neural learning classifiers, and their interactions. This infor- methods. As it can be observed from Fig. 9, the higher predictive
mation can be used to optimize the performance of the classifiers performance values obtained by the proposed scheme (i.e., SRL-
by selecting the most effective combination of factors. In addition, ACO) has been deployed upper the right dashed line, indicating
Fig. 9 presents the interval plot of accuracy for the augmentation that the results are statistically significant.
15
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101611
Table 11
Two-way ANOVA test results.
There are some insights and discussions based on the empirical Impact of training size: As expected, increasing the amount of
results for augmentation methods: training data generally leads to improved performance for all
augmentation methods and architectures. However, the impact
Performance improvement: The empirical results indicate that of training size varies depending on the dataset and architec-
augmentation methods can significantly improve the predictive ture. For example, the performance of GRU and LSTM was rela-
performance of deep learning models on various text classifica- tively stable as the training size increased on the SST-2 dataset,
tion tasks. This suggests that data augmentation can be an effec- while the performance of CNN improved significantly with lar-
tive strategy to overcome the challenges of limited training ger training sizes.
data. Even for large datasets, the proposed technique may still be
Method effectiveness: Among the augmentation methods eval- beneficial in certain ways. Firstly, while large datasets may have
uated, SRL-ACO, AMR-ACO, and SRL-PSO consistently achieved many instances, the feature space can still be quite limited,
the highest performance across multiple datasets and deep which can result in overfitting and poor generalization. By using
learning architectures. This suggests that these methods are data augmentation techniques, the feature space can be
particularly effective for improving the performance of text expanded, allowing the model to better capture the underlying
classifiers. However, it is worth noting that different datasets patterns in the data and improve generalization. Additionally,
and architectures may have different optimal augmentation data augmentation can also help to address class imbalance,
methods. which is often a challenge in large datasets where some classes
Importance of dataset and architecture: The results also high- may be overrepresented while others may be underrepresented.
light the importance of choosing the appropriate dataset and By artificially increasing the size of the minority classes through
architecture for a particular task. For example, while LSTM data augmentation, the model can better learn to distinguish
achieved the highest average performance across all datasets, between them and improve performance. Lastly, data augmen-
the performance of GRU was particularly high on the Yelp data- tation can also be used to improve the robustness of the model
set. Similarly, while SRL-ACO was the most effective augmenta- to variations in the input, such as different sentence structures
tion method overall, it was outperformed by other methods on or word orderings, which can be particularly important in tasks
some datasets and architectures. such as language translation or text generation. In summary,
16
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101611
even for large datasets, using data augmentation techniques can 4.49% for sentiment analysis, toxic text detection, and sarcasm
still provide benefits such as improving generalization, address- identification, respectively. These findings demonstrate that
ing class imbalance, and enhancing the robustness of the model. SRL-ACO has the potential to enhance the quality and quantity of
The proposed work addresses a research gap in the domain of training data for various NLP tasks. In future work, we plan to
text data augmentation by combining two powerful techniques, investigate the effectiveness of SRL-ACO on other NLP tasks and
SRL and ACO, in a novel way to generate high-quality synthetic explore its combination with other data augmentation techniques
text data. The proposed SRL-ACO approach is specifically to further improve the performance of NLP models.
designed to address the challenge of limited training data in
natural language processing tasks, which is a common problem
faced by researchers and practitioners alike. By using SRL-ACO Declaration of Competing Interest
to augment the training data, our proposed approach can
improve the performance and generalizability of various NLP The authors declare that they have no known competing finan-
models, which is crucial for real-world applications. Therefore, cial interests or personal relationships that could have appeared
this study makes a valuable contribution to the research field to influence the work reported in this paper.
of text data augmentation and has the potential to benefit var-
ious NLP applications. References
SRL-ACO is effective in generating high-quality synthetic data
that can be used to enhance the performance of NLP models. Aggarwal, C.C., Zhai, C., 2012. A survey of text classification algorithms. Mining Text
The results show that SRL-ACO improves the performance of a Data, 163–222.
Ahmed, H., Traore, I., Mamun, M., Saad, S., 2023. Text augmentation using a graph-
classifier on different NLP tasks. based approach and clonal selection algorithm. Machine Learn. Appl. 11,
The proposed method outperforms other data augmentation 100452.
techniques, such as back-translation and word replacement, Artstein, R., 2017. Inter-annotator agreement. Handbook Linguist. Annot., 297–313
Bansal, M.A., Sharma, D.R., Kathuria, D.M., 2022. A systematic review on data
indicating that the use of SRL in combination with ACO can be scarcity problem in deep learning: solution and applications. ACM Comput.
a promising approach for generating high-quality synthetic data Surv. (CSUR) 54 (10s), 1–29.
for NLP tasks. Bayer, M., Kaufhold, M.A., Buchhold, B., Keller, M., Dallmeyer, J., Reuter, C., 2023.
Data augmentation in natural language processing: a novel text generation
The effectiveness of the proposed method is more pronounced approach for long and short text classifiers. Int. J. Mach. Learn. Cybern. 14 (1),
in datasets with smaller training sizes, indicating that the 135–150.
method can be especially useful in scenarios where obtaining Bengio, Y., Courville, A., Vincent, P., 2013. Representation learning: A review and
new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35 (8), 1798–1828.
large amounts of labeled data is challenging or expensive.
Body, T., Tao, X., Li, Y., Li, L., Zhong, N., 2021. Using back-and-forth translation to
The performance gains achieved by SRL-ACO increase as the size create artificial augmented textual data for sentiment analysis models. Expert
of the training data increases, indicating that the method can Syst. Appl. 178, 115033.
benefit from larger datasets. Chai, J., Li, A., 2019. Deep learning in natural language processing: A state-of-the-art
survey. In: 2019 International Conference on Machine Learning and Cybernetics
The proposed method is not significantly affected by the choice (ICMLC). IEEE, pp. 1–6.
of deep learning architecture, as similar performance gains Dai, H., Liu, Z., Liao, W., Huang, X., Wu, Z., Zhao, L., et al., 2023. Chataug: Leveraging
were observed across different models, including CNN, LSTM, chatgpt for text data augmentation. arXiv preprint arXiv:2302.13007.
Dargan, S., Kumar, M., Ayyagari, M.R., Kumar, G., 2020. A survey of deep learning
and GRU. and its applications: a new paradigm to machine learning. Arch. Comput. Meth.
The use of SRL-ACO can result in a substantial increase in the Eng. 27, 1071–1092.
size of the training dataset, which can be useful in scenarios Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L., 2009. June). Imagenet: A large-
scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision
where large amounts of training data are needed to achieve and Pattern Recognition, pp. 248–255. https://doi.org/10.1109/
high performance. CVPR.2009.5206848.
The proposed method can improve the interpretability of NLP Devlin, J., Chang, M. W., Lee, K., Toutanova, K., 2018. Bert: Pre-training of deep
bidirectional transformers for language understanding. arXiv preprint
models, as the generated synthetic data can provide insights arXiv:1810.04805.
into the underlying patterns and relationships in the data. Dorigo, M., Stützle, T., 2019. Ant Colony Optimization: Overview and Recent
Advances. Springer International Publishing, pp. 311–351.
Dorigo, M., Birattari, M., Stutzle, T., 2006. Ant colony optimization. IEEE Comput.
6. Conclusion
Intell. Mag. 1 (4), 28–39.
Edunov, S., Ott, M., Auli, M., Grangier, D., 2018. Understanding back-translation at
Training deep learning models for natural language processing scale. arXiv preprint arXiv:1808.09381.
tasks requires creating high-quality labeled data, which can be a Eke, C.I., Norman, A.A., Shuib, L., Nweke, H.F., 2020. Sarcasm identification in textual
data: systematic review, research challenges and open directions. Artif. Intell.
time-consuming and labor-intensive process that involves manual Rev. 53, 4215–4258.
annotation by human annotators. This can lead to inconsistent Feng, S. Y., Gangal, V., Wei, J., Chandar, S., Vosoughi, S., Mitamura, T., & Hovy, E.
labeling and arbitrary standards due to variations in annotators’ (2021). A survey of data augmentation approaches for NLP. arXiv preprint arXiv:
2105.03075.
competency, training, and experience. To address these challenges, Feng, Z., Zhou, H., Zhu, Z., Mao, K., 2022. Tailored text augmentation for sentiment
researchers have been exploring automated methods for enhanc- analysis. Expert Syst. Appl. 205, 117605.
ing training and testing datasets. In this paper, we propose a novel Gardner, M., Grus, J., Neumann, M., Tafjord, O., Dasigi, P., Liu, N., et al., 2018.
Allennlp: A deep semantic natural language processing platform. arXiv preprint
text augmentation framework called SRL-ACO that leverages arXiv:1803.07640.
Semantic Role Labeling (SRL) and Ant Colony Optimization (ACO) Go, A., Bhayani, R., Huang, L., 2009. Twitter sentiment classification using distant
techniques to generate additional training data for NLP models. supervision. CS224N project report, Stanford 1 (12), 2009.
Goodfellow, I., Bengio, Y., Courville, A., 2016. Deep Learning. MIT Press, pp. 20–32.
SRL-ACO uses SRL to identify the semantic roles of words in a sen- Haralabopoulos, G., Anagnostopoulos, I., McAuley, D., 2020. Ensemble deep learning
tence and ACO to generate new sentences that preserve these roles. for multilabel binary classification of user-generated content. Algorithms 13 (4),
This approach enhances the accuracy of NLP models by generating 83.
Haralabopoulos, G., Torres, M.T., Anagnostopoulos, I., McAuley, D., 2021. Text data
additional data without requiring manual data annotation. We
augmentations: permutation, antonyms and negation. Expert Syst. Appl. 177,
tested the effectiveness of the SRL-ACO framework on seven text 114769.
classification datasets, including sentiment analysis, toxic text Heaton, J.B., Polson, N.G., Witte, J.H., 2017. Deep learning for finance: deep
detection, and sarcasm identification. The results show that SRL- portfolios. Appl. Stoch. Model. Bus. Ind. 33 (1), 3–12.
Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., Xing, E.P., 2017. Toward controlled
ACO improves the performance of classifiers on different NLP tasks, generation of text. In: International Conference on Machine Learning, PMLR, pp.
achieving an average accuracy improvement of 3.23%, 1.68%, and 1587–1596
17
Aytuğ Onan Journal of King Saud University – Computer and Information Sciences 35 (2023) 101611
Kobayashi, S., 2018. Contextual augmentation: Data augmentation by words with Proceedings of the 49th Annual Meeting of the Association for Computational
paradigmatic relations. arXiv preprint arXiv:1805.06201. Linguistics: Human Language Technologies, pp. 663–672.
Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., Brown, D., Sendhilkumar, S., 2023. Developing a conceptual framework for short text
2019. Text classification algorithms: A survey. Information 10 (4), 150. categorization using hybrid CNN-LSTM based Caledonian crow optimization.
Kwon, S., Lee, Y., 2023. Explainability-based mix-up approach for text data Expert Syst. Appl. 212, 118517.
augmentation. ACM transactions on knowledge discovery from data 17 (1), Shinde, P.P., Shah, S., 2018. A review of machine learning and deep learning
1–14. applications. In: 2018 Fourth International conference on Computing
LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521 (7553), 436–444. Communication Control and Automation (ICCUBEA). IEEE, pp. 1–6.
Liu, P., Wang, X., Xiang, C., Meng, W., 2020. A survey of text data augmentation. In: Shorten, C., Khoshgoftaar, T.M., Furht, B., 2021. Text data augmentation for deep
2020 International Conference on Computer Communication and Network learning. J. Big Data 8, 1–34.
Security (CCNS). IEEE, pp. 191–195. Singh, S., Mahmood, A., 2021. The NLP cookbook: modern recipes for transformer
Medhat, W., Hassan, A., Korashy, H., 2014. Sentiment analysis algorithms and based deep learning architectures. IEEE Access 9, 68675–68702.
applications: A survey. Ain Shams Eng. J. 5 (4), 1093–1113. Ubani, S., Polat, S. O., Nielsen, R., 2023. ZeroShotDataAug: Generating and
Mohammad, S., Bravo-Marquez, F., Salameh, M., Kiritchenko, S., 2018. Semeval- Augmenting Training Data with ChatGPT. arXiv preprint arXiv:2304.1
2018 task 1: Affect in tweets. In: Proceedings of the 12th International 4334.
Workshop on Semantic Evaluation, pp. 1–17. Usama, M., Ahmad, B., Song, E., Hossain, M.S., Alrashoud, M., Muhammad, G., 2020.
Moreno-Barea, F.J., Jerez, J.M., Franco, L., 2020. Improving classification accuracy Attention-based sentiment analysis using convolutional and recurrent neural
using data augmentation on small data sets. Expert Syst. Appl. 161, 113696. network. Futur. Gener. Comput. Syst. 113, 571–578.
Munappy, A., Bosch, J., Olsson, H.H., Arpteg, A., Brinne, B., 2019. Data management Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł.,
challenges for deep learning. In: 2019 45th Euromicro Conference on Software Polosukhin, I., 2017. Attention is all you need. In: Advances in Neural
Engineering and Advanced Applications (SEAA). IEEE, pp. 140–147. Information Processing Systems, pp. 5998–6008.
Ng, N., Cho, K., Ghassemi, M., 2020. SSMBA: Self-supervised manifold based data Wang, W. Y., & Yang, D., 2015. That’s so annoying!!!: A lexical and frame-semantic
augmentation for improving out-of-domain robustness. arXiv preprint embedding based data augmentation approach to automatic categorization of
arXiv:2009.10195. annoying behaviors using# petpeeve tweets. In: Proceedings of the 2015
Onan, A., 2020. Mining opinions from instructor evaluation reviews: a deep learning Conference on Empirical Methods in Natural Language Processing, pp. 2557–
approach. Comput. Appl. Eng. Educ. 28 (1), 117–138. 2563.
Onan, A., 2021. Sentiment analysis on massive open online course evaluations: a Wei, J., Zou, K., 2019. Eda: Easy data augmentation techniques for boosting
text mining and deep learning approach. Comput. Appl. Eng. Educ. 29 (3), 572– performance on text classification tasks. arXiv preprint arXiv:1901.11196.
589. Whang, S.E., Lee, J.G., 2020. Data collection and quality challenges for deep learning.
Onan, A., 2022. Bidirectional convolutional recurrent neural network architecture Proc. VLDB Endowment 13 (12), 3429–3432.
with group-wise enhancement mechanism for text sentiment classification. J. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., et al., 2020.
King Saud Univ.-Comput. Informat. Sci. 34 (5), 2098–2117. Transformers: State-of-the-art natural language processing. In: Proceedings of
Onan, A., 2023. Improving Turkish text sentiment classification through task- the 2020 Conference on Empirical Methods in Natural Language Processing:
specific and universal transformations: an ensemble data augmentation System Demonstrations, pp. 38–45.
approach. Appl. Sci. Wu, X., Lv, S., Zang, L., Han, J., Hu, S., 2019. Conditional bert contextual
Onan, A., Toçoğlu, M.A., 2021. A term weighted neural language model and stacked augmentation. In: Computational Science–ICCS 2019: 19th International
bidirectional LSTM based framework for sarcasm identification. IEEE Access 9, Conference, Faro, Portugal, June 12–14, 2019, Proceedings, Part IV 19.
7701–7722. Springer International Publishing, pp. 84–95.
Rajkomar, A., Oren, E., Chen, K., Dai, A.M., Hajaj, N., Hardt, M., Dean, J., 2018. Scalable Xie, Q., Dai, Z., Hovy, E., Luong, T., Le, Q., 2020. Unsupervised data augmentation for
and accurate deep learning with electronic health records. npj Digital Med. 1 consistency training. Adv. Neural Inf. Proces. Syst. 33, 6256–6268.
(1), 18. Zhang, L., Wang, S., Liu, B., 2018. Deep learning for sentiment analysis: A survey.
Ringger, E., McClanahan, P., Haertel, R., Busby, G., Carmen, M., Carroll, J., et al., 2007. Wiley Interdiscip. Rev.: Data Mining Knowledge Discove. 8 (4), e1253.
Active learning for part-of-speech tagging: Accelerating corpus annotation. In: Zhang, X., Zhao, J., LeCun, Y., 2015. Character-level convolutional networks for text
Proceedings of the Linguistic Annotation Workshop, pp. 101–108. classification. Adv. Neural Informat. Process Syst. 28.
Schwartz, R., Abend, O., Reichart, R., Rappoport, A., 2011. Neutralizing linguistically
problematic annotations in unsupervised dependency parsing evaluation. In:
18