A Neural-Based Architecture For Small Datasets Classification
A Neural-Based Architecture For Small Datasets Classification
A Neural-based Architecture
For Small Datasets Classification
Andi Rexha Mauro Dragoni Roman Kern
[email protected] [email protected] [email protected]
Know-Center Fondazione Bruno Kessler Graz University of Technology
Graz, Austria Trento, Italy Graz, Austria
ABSTRACT other hand, researchers started to deal with the problem of missing
Digital Libraries benefit from the use of text classification strategies data. Indeed, many domains suffer from the lack of datasets mainly,
since they are enablers for performing many document manage- due to the costs of manual labeling.
ment tasks like Information Retrieval. The effectiveness of such One of the literature’s proposed techniques for addressing this
classification strategies depends on the amount of available data challenge is based on data augmentation [6]. Briefly, it consists of
and the classifier used. The former leads to the design of data aug- expanding original datasets with new samples created by applying
mentation solutions where new samples are generated into small proper similarity metrics by taking into account the features used
datasets based on the semantic similarity between existing samples for building the classification model. Data augmentation is used in
and concepts defined within external linguistic resources. The lat- many tasks ranging from the Natural Language Processing (NLP)
ter relates to the capability of finding, which is the best learning ones (e.g., word sense disambiguation, sentiment analysis, etc.) to
principle to adopt for designing an effective classification strategy image and video recognition. Within the NLP domain, conven-
suitable for the problem. In this work, we propose a neural-based tional techniques are built upon the use of distributional semantic
architecture thought for addressing the text classification problem strategies applied for selecting similar terminology to include in
on small datasets. Our architecture is based on BERT equipped the automatically generated samples.
with one further layer using the sigmoid function. The hypothe- While data augmentation solutions demonstrated to be effective,
sis we want to verify is that by using a BERT-based architecture, the use of artificial samples may have detrimental effects on the
the vectors’ semantic learned by the BERT model can perform overall effectiveness of the generated classification model. This
effective classification on small datasets without the use of data problem mainly occurs since a domain expert does not supervise
augmentation strategies. We observed improvements up to 14% in the new samples. Hence, they can introduce errors in the generated
the accuracy and up to 23% in the f-score with respect to baseline models. This aspect has been demonstrated in the literature by
classifiers exploiting data augmentation. considering, for example, query expansion strategies [1, 4].
In this paper, we propose a neural-based architecture designed
KEYWORDS for addressing the challenge described above. Such an architecture
has been thought for learning effective models by working on small
Text Classification; Small Datasets; Data Augmentation
datasets. With this work, we want to answer the following research
ACM Reference Format: questions:
Andi Rexha, Mauro Dragoni, and Roman Kern. 2020. A Neural-based Archi-
tecture For Small Datasets Classification. In Proceedings of the ACM/IEEE 1. Is it possible to design a neural-based architecture able to
Joint Conference on Digital Libraries in 2020 (JCDL ’20), August 1–5, 2020, build effective models from small datasets, that outperforms
Virtual Event, China. ACM, New York, NY, USA, 9 pages. https://doi.org/10. state of the art data augmented techniques?
1145/3383583.3398535 2. By using such an architecture, would the use of data augmen-
tation techniques have a detrimental effect on the overall
1 INTRODUCTION effectiveness or, at least, to have not statistically significant
The new spring of Artificial Intelligence (AI) opened the opportu- improvements?
nity of designing and implementing textual classification strategies Concerning the first research question, the proposed approach
in many research fields that can benefit, in particular, by the integra- has been validated, on four datasets publicly available, by compar-
tion of neural-based solutions into existing classifiers. On the one ing obtained results three baselines models trained by exploiting
hand, such an integration demonstrated to be suitable for improv- state of the art augmentation techniques for digital libraries. In
ing the overall effectiveness of existing textual classifiers. On the particular, in this work, we used as baselines the classifiers trained
with augmented datasets presented in [6]. While answering the
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed second research question, we applied three different data augmen-
for profit or commercial advantage and that copies bear this notice and the full citation tation techniques on all datasets and compared the results with
on the first page. Copyrights for components of this work owned by others than the respect to the models trained on the original datasets.
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission The remainder of the paper is structured as follows. Section 2
and/or a fee. Request permissions from [email protected]. surveys the most recent advances about strategies on building clas-
JCDL ’20, August 1–5, 2020, Virtual Event, China
sifiers for small textual datasets and on the main data augmentation
© 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-7585-6/20/08. . . $15.00 techniques by highlighting their limits. In Section 3, we present our
https://doi.org/10.1145/3383583.3398535 neural-based architecture and we introduce the state of the art data
319
EU-L-4: Neural Semantic Representation JCDL ’20, August 1–5, 2020, Virtual Event, China
augmentation strategies we implemented for demonstrating how Data Augmentation of Textual Dataset. To augment the textual
data augmentation may have a detrimental effect when applied dataset, [12] uses a method based on Latent Dirichlet Allocation
to classification strategies that are already effective. Section 4 pro- (LDA). The output of the LDA is employed as enrichment in the
vides the evaluation we performed on four state of the art datasets form of embeddings of documents to words. LDA is also used in [2]
and compares the obtained results with respect to the recent lit- as a keyword extractor, where the authors extract the top keywords
erature in data augmentation. Then, in Section 5, we discuss the for each class of the sentiment task. The keywords of each class are
results obtained by our neural-based architecture when trained on tested whether they are contained in the 3-grams of the instances of
augmented datasets. Finally, Section 6 concludes the paper. the class. In such a case, the instance owning the 3-gram is enriched
with the same 3 grams, thus boosting its importance.
2 RELATED WORK A different approach, which is based on the syntactic modifica-
This work considers the case of textual classification for small tion of the text, is presented in [24]. This work proposes a system
datasets. Given the two research questions introduced in Section 1, called EDA (Easy Data Augmentation), which implements four
we surveyed the main recent contributions related to text classifi- different methods for enriching the dataset. The first enrichment
cation on small datasets and about the impact of the most relevant substitutes “n" random words of the text with their synonyms, while
data augmentation techniques on textual datasets. the second inserts the synonym of a randomly selected word in
the instance. Random swap and deletion are the two other meth-
Small Dataset Textual Classification. One of the early works on ods proposed to change the text. The former approach swaps two
small datasets is the hierarchical document classification introduced words, while he latter deletes another one with a random selection.
in [20]. This work uses a Bayesian approach to classify documents The output of these methods is a new instance that is added to the
in a hierarchy of topics. Differently from previous attempts, the dataset.
inner nodes of the classifier “update” their class conditional proba- The study in [25] proposes a new system called Hierarchical Data
bility of each word. Thus, obtaining a differentiation of terms in the Augmentation (HDA) and compares it to EDA. The authors stack
hierarchy according to their level of generality/specificity. In a less two levels of bidirectional GRU, each with supported by an attention
complicated scenario, [17] focuses on the task of Intent Classifica- layer: the first layer used for word attention and the second used for
tion. The main issue with conventional methods is that they do not sentence attention. In the first step, the training set is passed to the
consider spelling errors and out of vocabulary words. The paper network responsible for the classification. The attention layers of
proposes the use of Semantic Hashing as embeddings. Precisely, the network are then used to create a new training document with
in the first step, the sentence is split in its words, and secondly, the most important (in terms of attention) words and sentences.
each word is divided into 3-grams. These 3-grams are then used In [15], the authors propose a new method to increase the number
as features, weighted with a Term Frequency - Inverse Document of instances for imbalanced classes. Words are substituted with syn-
Frequency (TF-IDF), normalized and passed to different classifiers. onyms that have the same part of speech with them. The synonyms
A topic modeling approach is used in [3] for classifying small are identified via cosine similarity on pre-trained word embeddings.
datasets. The method creates a graph where words are connected Another approach to creating new input for the training is shifting
to their topics with edges representing the probability of the word the 0 paddings for sentences that feed the Neural Network. The
being part of it. Topics are then connected via a probability model. authors also propose a generation of new instances by training a
Hence, a document is described as a probability model over the language model. A word is randomly selected as the first element
topics graph, where these probabilities represent the features. The of the new instance, and rest is predicted via a Long Short Term
authors achieve similar results to the state of the art of the time, Memory (LSTM) network augmented with a fully connected layer.
with around 1% of the data. Results between data augmentation and the original data do not
In [9], tackle the case of insufficient training data on a large show a significant benefit. The only benefit visible from the paper
number of categories. A novel architecture for multi-task learning is the increased recall of hate speech (smaller class) on the short
is proposed for such a scenario. A convolutional neural network is text.
created and trained small- and large-scale classification tasks that Enrichment on a tiny spam dataset for short-text messages is
are considered related. Their shared features are then combined presented in [11]. Three steps are applied to create new instances.
for improving the missing of the instances in the training set. A The first is the normalization of the text (by removing grammat-
further approach in the skewed dataset is over-sampling, which ical errors) via an English dictionary. Next, a semantic indexing
duplicates the under-represented classes’ instances until the dataset technique is used to get synonyms for words, with the synonyms
is balanced. filtered with a concept disambiguation tool. Thus, a new sample is
The goal of [10] is to identify the best feature selection meth- generated using one synonym from each set at each step.
ods (i.e., information gain, Gini index, etc.) for classification tasks Data enrichment also applies in other languages than English.
in small datasets. The authors also argue that it is necessary to A strategy of augmentation for Chinese is presented in [19] with
also value other evaluations rather than accuracy, like stability, two different enrichment strategies: 1) at the word level and 2) at
efficiency, etc. So they propose an evaluation based on multiple the phrase level. At the word level, synonyms are exchanged, and
criteria decision making(MCDM): TOPSIS, VIKOR, GRA, Weighted random meaningless words added as noise. While on the phrase
sum method(WSM), and PROMOTHEE. Ten datasets are picked for level, the substitute is done at the adverbial phrases by word2vec
choosing the best feature selector, with nine measures for binary- and thesaurus. The system assesses the results in sentiment analysis
and seven for multiclass- classification. for the publicly available hotel online evaluation dataset.
320
EU-L-4: Neural Semantic Representation JCDL ’20, August 1–5, 2020, Virtual Event, China
In [22] the authors explore how to overcome this data bottle- Thus, initially BERT is trained (i) with generic data for identifying
neck for Dutch, a low-resource language. The goal of the paper missing tokens, and then (ii) it is further fine-tuned for tasks like
is to normalize the text as defined in [16], going from noisy data question answering. With the goal of improving the performance
to standard text. Three manually annotated datasets are provided: (i.e. accuracy) of classification tasks in small datasets, we use BERT
Tweets, Message Board Posts, and Text Messages (SMS). The en- for transfer learning. Intuitively, BERT learns some distribution
richment is performed with a distance-based substitution on word over the generic corpora, and then adapts it to the distribution
embeddings. The unnormalized enriched instances are then passed of words in the current dataset. This type of architecture gives a
to Sequence-to-Sequence (seq2seq) classification (encoder-decoder lot of advantages for teams which do not have the computational
architecture) with the normalized instances as the target. power to train on huge dataset, and thus resulting in state of the
art performance on cascading tasks. In this paper we use BERT two
folds, as an input for a classification algorithm and for enriching
3 METHOD
small datasets. A detailed description of its usage can be found in
In this section, we present the proposed neural-based architecture. the next subsections.
We start to introduce in subsection 3.1 the word embedding we
adopted for the setup of our classifier. Then, in subsection 3.2, we 3.2 BERT-Based Classifier
describe how these embeddings have been exploited and how we
To improve the current baselines of classification tasks on small
configured our neural architecture. Finally, subsection 3.3, intro-
datasets, we use a BERT based classifier. As previously mentioned,
duces the state of the art data augmentation strategies we imple-
BERT is the encoder part of a Transformer architecture. This part
mented for validating the second research question presented in
is composed of 12 stacked (encoder) blocks 1 with each of them
Section 1 concerning the detrimental effects of augmented datasets
consisting of 3 components:
when adopted on already-effective classifiers.
• Multi-head attention is the first component of an encoder
block. It uses a set of parallel self-attention networks to
3.1 Word Embeddings learn the closeness relation of the word itself (hence “self”-
Word embeddings were introduced to avoid the curse of dimen- attention) to embeddings of the other words.
sionality for NLP tasks such as Text Classification, Relation Extrac- • The Feedforward component consists of two linear layers
tion, etc. In early systems, for each word in the training dataset, with a ReLU activation function connecting them. This part
a single multidimensional representation is learned. Among the returns an embedding representation for each word received
systems of this category, which we call non-contextual, one can from the output of the previous component.
find Word2Vec [13], GloVe [14], fastText [8], etc.. The learned em- • Two layer normalization [7] components, with the first
beddings from these systems are very powerful and have proven to located between the two previously described components
work quite well in different NLP tasks, but sometimes lack in the and the second located before the output of the block.
contextual representation of the word itself. For example, the word The position encoding is another peculiarity of BERT. For each
bank can have different senses. It could refer to a river bank or to a input word, a representation of its position in the text is encoded
financial institution. This ambiguity may mislead systems built for in order to inform the network about the distance of the words in
NLP tasks that are trained on the embeddings of the words. To over- the text. We combine such architecture with a sigmoid function
come such a drawback, a new family of embeddings with contextual as the last layer and a dropout rate of 0.1 to the input of this layer
representation has been proposed. The contextual word embedding for avoiding overfitting. The classification system is illustrated in
systems do not learn a single representation for each word. Instead, Figure 1. The output of the network is presented by:
the representation is dependent on the context in which the word
1
is embedded. Such systems are often used for transfer learning, ℎ(𝑥) = 𝜎 (𝜃 𝑇 𝑥 (𝛾) + 𝑏) =
1+𝑒 −(𝜃 𝑇 𝑥 (𝛾 )+𝑏)
where pre-trained models are made available to NLP practitioners.
Furthermore, the latest ones allow fine-tuning of their weights in where 𝑥 (𝛾) are the outputs of BERT for the text at hand, 𝛾 represent
the network for cascading tasks. This way, the distributions of the the weights of BERT and 𝜃 are the learned weights and 𝑏 is the bias
data in the embeddings can better represent the one at hand. term of the output neuron.
One of the best-known systems from this type of embeddings The expected output to minimize is the function that predicts the
is BERT (Bidirectional Encoder Representations from Transform- target labels in the best manner. These are the labels of each short
ers) [5]. BERT is based on the well-known Transformer [21] archi- text with the cost function of the network defined as following:
tecture (encoder-decoder) and uses only the encoder part of it. The 1 Õ (𝑖)
𝑚
main advantage of such architecture is exploiting the bidirectional 𝐽 =− 𝑦 log(ℎ (𝑖) ) + (1 − 𝑦 (𝑖) ) log(1 − ℎ (𝑖) )
𝑚 𝑖=1
structure of the attention mechanism. To train BERT, the authors
propose the following two steps: Where 𝑚 is the number of instances in the training set, 𝑦 (𝑖) is the
label of the 𝑖 − 𝑡ℎ instance and ℎ (𝑖) is the output of the sigmoid
(1) Mask tokens in a sentence with a 15% probability and try
layer for that instance. The goal of the algorithm is to find the 𝛾,
to predicting them (a modified version of Language Models
𝜃 , and 𝑏 that would minimize the cost function: 𝑎𝑟𝑔𝑠_𝑚𝑖𝑛𝛾,𝜃,𝑏 (𝐽 ).
called Masked Language Model).
(2) Predict whether two sentences are written one after each 1 The
number of layers is 6 in the pre-trained model that we chose to use called
other. BERT_Base, Uncased
321
EU-L-4: Neural Semantic Representation JCDL ’20, August 1–5, 2020, Virtual Event, China
𝑡 𝑓 𝑖𝑑 𝑓𝑖,𝑗 = 𝑡 𝑓𝑖,𝑗 · 𝑖𝑑 𝑓𝑖
where 𝑡 𝑓𝑖,𝑗 represents the term frequency of word 𝑖 in the document
𝑗 defined as:
𝑛𝑖,𝑗
𝑡 𝑓𝑖,𝑗 = Í
Figure 1: The architecture of our classifier. A BERT architec- 𝑘 𝑛𝑘,𝑗
ture with a sigmoid function on top for a two class classifi-
where 𝑛𝑖,𝑗 is the number of words 𝑖 in the text 𝑗 and 𝑖𝑑 𝑓 represents
cation.
the inverse document frequency of word 𝑖 calculated as:
|𝑡 |
𝑖𝑑 𝑓𝑖 = log
|𝑡 𝑗 : 𝑤𝑖 ∈ 𝑡 𝑗 |
Next, we select the most prominent candidates for word substi-
As can be noticed, the goal of the algorithm isn’t only to update
tution (𝑡𝑜𝑝_𝑎𝑟𝑔𝑠 ()) with three different configurations: 20%, 40%
the weights of the output neuron. During training, also the weights
and 60% of the top 𝑡 𝑓 𝑖𝑑 𝑓 weights. Also, we decide to use 4 possi-
of BERT are updated in order to make the network adapt to the
ble values for the maximum amount of substitutions: [3, 5, 7, 9].
distribution of the training set. We use this classifier as opposed to
This means that we select the top 3, 5, 7 or 9 closest terms with
previously proposed baselines and for assessing data augmentation
our candidate word returned from BERT. Furthermore, we do not
strategies. In our configurations, we use batches of 8 instances and
consider stopwords for this process. An example of the enrichment
train the dataset for 10 epochs.
technique described above is illustrated in Figure 2.
In this paper, we try three different filtering strategies for enrich-
3.3 Enriching the Dataset ing the text in small datasets. The generic logic of these filtering
With the goal to improve the performance (i.e. accuracy) of clas- strategies is presented in the Algorithm 1, with all of them differing
sification tasks on small datasets here we try different strategies from each other in the filter() function (bolded in the Algorithm 1).
for enriching. Traditionally, to augment textual datasets, new sam- In the first filtering strategy (raw strategy), the Naive BERT, the
ples with substituted words (usually synonyms) are introduced filter function does not remove any of the substitution candidates.
in the data. We use a similar approach for our experiments, but One of the questions that arises is about the amount of masked
instead of using traditional external resources or non-contextual tokens we need to extract for replacements. We use 4 different
word embeddings, we exploit BERT’s ability as a Masked Lan- values to get the maximum amount of replacement tokens: 3, 5, 7,
guage Model (MLM). We predict missing words by masking them 9.
in the training set and substituting them with their most promi- Once we have identified the technique to use for enriching the
nent candidates. More precisely, given a text 𝑡 = 𝑤 1𝑤 2 . . . 𝑤𝑛 , dataset, we have to decide which words to substitute. At the be-
with the word 𝑤𝑚 = [𝑀𝐴𝑆𝐾], we are interested in extracting ginning we use a TF-IDF weighting scheme for each word in the
𝑡𝑜𝑝_𝑎𝑟𝑔𝑠 (𝑃 (𝐵𝐸𝑅𝑇 (𝑤𝑚 | 𝑡))), where 𝑡𝑜𝑝_𝑎𝑟𝑔𝑠 returns the words candidate text. Then we select 3 different percentages from the
with the highest BERT probability. To achieve good results, we selected words: 20%; 40%; 60% of the best candidates.
need to carefully pick and define the following parameters: By analyzing the enriched text, we discovered that sometimes
out-of-context words are replaced. Other than that, as expected,
• Which tokens to substitute? BERT returns both, synonyms and antonyms. Obviously, for some
• How many substitutes to use? tasks (i.e. sentiment analysis), antonyms substitutions are counter-
• What kind of 𝑡𝑜𝑝_𝑎𝑟𝑔𝑠 () function to use? productive. Nevertheless, for some tasks (i.e. subjectivity) antonyms
322
EU-L-4: Neural Semantic Representation JCDL ’20, August 1–5, 2020, Virtual Event, China
Algorithm 1 The algorithm for enriching the training set Dataset Documents # of Words Average Positive Negative
Doc. Length Docs Docs
Require: 𝑡𝑠, the text of the training set
CR 3,772 6,596 20 2,406 1,366
1: for each 𝑡 ∈ 𝑡𝑠 do MPQA 10,624 6,298 3 3,316 7,308
2: 𝑡 𝑓 𝑖𝑑 𝑓 𝑠 ← [] Rt10k 10,662 20,621 21 5,331 5,331
3: for each 𝑤 ∈ 𝑡 do Subj 10,000 23,187 24 5,000 5,000
4: if 𝑤 ∉ 𝑠𝑡𝑜𝑝𝑤𝑜𝑟𝑑𝑠 then Table 1: Statistics of the dataset used for the evaluation. The
5: 𝑤𝑒𝑖𝑔ℎ𝑡 𝑤 ← 𝑡 𝑓 𝑖𝑑 𝑓 (𝑤) first two columns contain the number of documents and
6: 𝑡 𝑓 𝑖𝑑 𝑓 𝑠.𝑎𝑝𝑝𝑒𝑛𝑑 (𝑤, 𝑤𝑒𝑖𝑔ℎ𝑡 𝑤 ) the different words contained in each dataset, respectively.
7: end if In the third column the average number of words compos-
8: end for ing each sample of the dataset is depicted. The last two
9: 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 ← 𝑏𝑒𝑠𝑡 − 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 (𝑡 𝑓 𝑖𝑑 𝑓 𝑠) columns show the number of positive and negative samples
10: for each 𝑐 ∈ 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑠 do contained in each dataset, respectively.
11: 𝑠𝑢𝑏𝑠𝑡𝑖𝑡𝑢𝑡𝑒𝑠 = 𝑔𝑒𝑡_𝑠𝑢𝑏𝑠𝑡𝑖𝑡𝑢𝑡𝑒𝑠 (𝑐)
12: 𝑓 𝑖𝑛𝑎𝑙_𝑠𝑢𝑏𝑠𝑡𝑖𝑡𝑢𝑡𝑒𝑠 ← filter(substitutes)
13: for each 𝑠𝑢𝑏 ∈ 𝑓 𝑖𝑛𝑎𝑙_𝑠𝑢𝑏𝑠𝑡𝑖𝑡𝑢𝑡𝑒𝑠 do
14: 𝑛𝑒𝑤_𝑡𝑒𝑥𝑡 ← 𝑒𝑥𝑐ℎ𝑎𝑛𝑔𝑒 (𝑡, 𝑐, 𝑠𝑢𝑏) 4 VALIDATION OF THE PROPOSED NEURAL
15: 𝑡𝑠.𝑎𝑝𝑝𝑒𝑛𝑑 (𝑛𝑒𝑤_𝑡𝑒𝑥𝑡) ⊲ Append candidate
ARCHITECTURE
16: end for
17: end for In this section, we present the evaluation performed on four short-
18: end for
text datasets benchmarked in numerous NLP studies: Customer
Reviews (CR), MPQA, Short Movie Reviews (Rt10k) and Subjectivity
(Subj) 2 . A summary of additional information about the employed
substitutions are very beneficial and their substitution is semanti- datasets can be seen in Table 1. All these datasets are cases of binary
cally correct. In order to try only synonyms and remove the out- classification.
of-context substitutes, we propose the second filtering strategy We subdivide each dataset as follows: the test sets consist of
(synonym strategy) that accepts only synonyms as replacing to- 1000 samples held out from each dataset for later testing. To show
kens. To do so, we extract all synonyms of the candidate word from the effectiveness of our method with respect to the techniques
WordNet and accept them if and only if they have the same stem as presented in [6]. We applied the designed neural architecture to
the replacement returned by BERT. Hence, we have only synonym different training-set sizes: 500, 1000, 1500, 8500 for MPQA, Rt10k
replacements from a contextual language model and the replace- and Subj and 500, 1000, 1500, 2600 for CR. For each size, we sampled
ment is “natural”. The difference of this filtering strategy with the five training sets randomly, using stratified sampling. For each of
previous one is on the filter() method of the Algorithm 1. The ac- these sets, a 10-fold cross-validation is performed to find the best
ceptance of rejection of the candidate substitutions is presented in parameter combination, i.e., the combination that yields the highest
the Algorithm 2. average accuracy over all folds. The classifier was then trained on
the same dataset that is used for the cross-validation and tested on
the held-out test set (five times for each sample size and classifier
Algorithm 2 Whether to accept or reject a possible substitute with
combination).
WordNet
As baselines, we refer to three approaches used in recent state
Require: 𝑐, 𝑠𝑢𝑏, candidate to substitute and possible substitute of the art evaluation as discussed in [6] namely the Multinomial
1: 𝑠𝑦𝑛𝑠 ← 𝑤𝑜𝑟𝑑𝑛𝑒𝑡 .𝑠𝑦𝑛𝑠𝑒𝑡 (𝑐)
Naive Bayes (MNB), the Naive Bayes Support Vector Machine (NB-
2: 𝑠𝑡𝑒𝑚𝑠𝑦𝑛𝑠 ← 𝑠𝑡𝑒𝑚(𝑠𝑦𝑛𝑠)
SVM), and a Recursive Auto-Encoder (RAE). The MNB and NBSVM
3: if 𝑠𝑡𝑒𝑚(𝑠𝑢𝑏𝑠𝑡𝑖𝑡𝑢𝑡𝑒) ∈ 𝑠𝑡𝑒𝑚𝑠𝑦𝑛𝑠 then
classifiers have been parametrized as discussed in [23]. While, the
4: accept ⊲ Only accept if at least one stem matches word embeddings of the RAE classifier have been initialized as sug-
5: else
gested by the authors 3 [18]. For all the baselines and the proposed
6: reject approach we reported the accuracies in Table 2 and the precision,
7: end if
recall, and f-score in Table 3.
Results reported in Table 2 related to the accuracies shows how
One of the possible drawbacks of the “synonym” strategy is that the proposed approach outperforms all the baselines. The improve-
certain synonyms substitution might be part of the other class too. ments range from around 5% for the MPQA and Subj datasets to
By substituting words that occur in the other class (or the candidates around 10% for the CR and Rt10k ones. By analyzing these results
of the other class), we might introduce confusion to the classifier. To from the dataset statistics perspective, we did not find any specific
avoid such behavior, we propose a third filtering strategy (pure correlation. However, we observed that if we rank the datasets by
strategy) that further filters the possible synonyms (intersection the effectiveness of each classifier, we can notice that such a rank
between WordNet and BERT) by rejecting substitution candidates is the same for all classifier. Such results suggest that even if the
that are part of the words in the other class and/or even possible
substitution of the other class. Our goal in this filtering strategy is 2 All datasets are available at https://github.com/sidaw/nbsvm.
to enrich the training only with “more pure” information. 3A MATLAB implementation of the classifier is available at http://www.socher.org.
323
EU-L-4: Neural Semantic Representation JCDL ’20, August 1–5, 2020, Virtual Event, China
Dataset BERT MNB NBSVM RAE strategies is not particularly effective when the neural architecture
CR 500 0.906 0.743 0.759 0.758 has been already designed for addressing a specific task.
CR 1000 0.908 0.779 0.779 0.797
CR 1500 0.912 0.782 0.787 0.813
CR 2600 0.916 0.810 0.808 0.806
MPQA 500 0.870 0.778 0.769 0.775 5 EFFECTS OF DATA AUGMENTATION
MPQA 1000 0.882 0.804 0.804 0.834
MPQA 1500 0.876 0.824 0.812 0.832
In this Section, we present the results observed by applying the
MPQA 8500 0.902 0.888 0.877 0.879 three augmentation techniques described in Section 3.3 and we
Rt10k 500 0.828 0.674 0.646 0.671 discuss if the use of such techniques helps in improving the overall
Rt10k 1000 0.844 0.683 0.680 0.714
Rt10k 1500 0.852 0.705 0.715 0.695 effectiveness of the classifier or not.
Rt10k 8500 0.880 0.756 0.778 0.792 Table 4 shows the accuracies obtained by our approach (second
Subj 500 0.940 0.861 0.855 0.893 column) with respect to the three strategies (from third to fifth
Subj 1000 0.946 0.880 0.869 0.901
Subj 1500 0.960 0.897 0.881 0.911 columns). We can appreciate how our approach outperforms the
Subj 8500 0.966 0.922 0.916 0.957 models trained with the augmented datasets in 9 out of 16 cases.
Table 2: Summary of the accuracies observed for the pro- For what concerns the Subj dataset, our approach is less effective
posed approach and for the three baselines. than the augmented models. However, the gap is very limited and
is not statistical significant.
On the contrary, for the first three configurations of the MPQA
dataset, the raw augmented strategy significantly outperforms the
others
proposed strategy significantly improve the classification capabili- The same result if reflected on the precision, recall, and f-score
ties, there might be a subset of samples that are quite challenging reported in Table 5. Indeed, also here we can notice how, beside
due to their particular structure. In the future, we will focus on the first three configurations of the MPQA dataset, the proposed
the manual analysis of such samples, by extracting them from the approach outperforms the models trained with the augmented
errors analytics of each classifier, in order to understand which are datasets.
the reasons led to the errors. These results allow us to answer to the second research question
Besides the observation of the accuracies obtained by each classi- since the implementation of state of the art data augmentation
fier, we computed also the precision, recall, and f-score as reported techniques, in general, did not lead to a significant improvements
in Table 3. Here, we can notice a more interesting scenario. Indeed, of the classifier. However, it would be necessary to perform a deeper
while for the accuracies, our approach outperforms all baselines for analysis of the classifier errors in order to find which are the reasons
each dataset and training size, the situation is different by taking of such a detrimental effect. Finally, for having a more complete
into account the precision, recall, and f-score. The reader may no- picture of the classifier behavior, the system should be equipped
tice that for the MPQA dataset the RAE classifier obtained the best with an explainable model providing further details about why a
f-score due to the very high recall values. Similarly, it occurs also specific sample is classified in a specific way. All these aspects will
on the Subj dataset where in two out of four training size configu- be part of future work.
rations, the RAE classifier outperform our approach. The analysis
of the error matrices highlighted how the RAE classifier performed
very well in detecting false negative samples, while it has poor 6 CONCLUSION
performance on detecting the false positive ones. This behavior In this paper, we discussed how the use of a neural-based architec-
causes the low precision values in favor of the recall ones. Except ture designed for addressing the task of text classification on small
for the MPQA dataset, for the other configurations our approach datasets outperforms models trained with augmented datasets. We
outperforms all the baselines with a delta of more than 15%. provided two research questions which we positively answered.
Finally, we performed an error analysis on the obtained results The first research question was related to the comparison be-
in order to have a more deeper analysis of the behavior of the tween the results obtained by the proposed strategies and the ones
classifier. In particular, we want to observe if the classifier is biased obtained by other three baseline systems. By observing the results,
for specific classes in the case of unbalanced datasets. Confusion reported in Section 4, we can state that the approach presented in
matrices are shown in Figure 3. Concerning the balanced datasets this paper is suitable with respect to the use of classifier trained
(Rt10k and Subj), we can notice how the classifier encounter the with augmented dataset.
same error rate for both classes. Instead, for the unbalanced datasets The second research question, instead, was related to the pos-
(Cust and MPQA), we can appreciate that the higher error rate has sible detrimental effects that the use of augmented datasets can
been observed on the most represented classes. This means that the have on classifiers that have been already tuned for working on
classifier has good generalization capabilities since its performance small datasets. Results reported in Section 5 confirmed this hypoth-
are not affected by the number of samples contained in each class. esis since the effectiveness of the three state of the art strategies
In this section, we demonstrated how neural architecture de- we implemented for augmenting the dataset did not improve the
signed for addressing the challenge of managing small datasets is effectiveness of the classifier significantly.
able to obtain significant improvements with respect to data aug- Future work will focus on performing a deeper analysis of the
mentation solutions integrated into different type of classifiers. In errors we reported and discussed in Section 4 and 5. With the aim
the next section, we show how the integration of data augmentation of inferring if there are common characteristics among the samples
324
EU-L-4: Neural Semantic Representation JCDL ’20, August 1–5, 2020, Virtual Event, China
REFERENCES
[1] Ahmed Abdelali, Jim Cowie, and Hamdy S. Soliman. 2007. Improving query
precision using semantic expansion. Inf. Process. Manage. 43, 3 (2007), 705–716. Figure 3: Confusion matrices computed on the results ob-
https://doi.org/10.1016/j.ipm.2006.06.007 tained by the proposed classifier.
[2] Muhammad Abulaish and Amit Kumar Sah. 2019. A Text Data Augmentation Ap-
proach for Improving the Performance of CNN. In 11th International Conference on
Communication Systems & Networks, COMSNETS 2019, Bengaluru, India, January
7-11, 2019. IEEE, 625–630. https://doi.org/10.1109/COMSNETS.2019.8711054 Proceedings of the 2019 Conference of the North American Chapter of the Association
[3] Fabio Clarizia, Francesco Colace, Massimo De Santo, Luca Greco, and Paolo for Computational Linguistics: Human Language Technologies, Volume 1 (Long and
Napoletano. 2011. A new text classification technique using small training sets. In Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota,
11th International Conference on Intelligent Systems Design and Applications, ISDA 4171–4186. https://doi.org/10.18653/v1/N19-1423
2011, Córdoba, Spain, November 22-24, 2011, Sebastián Ventura, Ajith Abraham, [6] Ábel Elekes, Antonino Simone Di Stefano, Martin Schäler, Klemens Böhm, and
Krzysztof J. Cios, Cristóbal Romero, Francesco Marcelloni, José Manuel Benítez, Matthias Keller. 2019. Learning from Few Samples: Lexical Substitution with Word
and Eva Lucrecia Gibaja Galindo (Eds.). IEEE, 1038–1043. https://doi.org/10. Embeddings for Short Text Classification. In 19th ACM/IEEE Joint Conference on
1109/ISDA.2011.6121795 Digital Libraries, JCDL 2019, Champaign, IL, USA, June 2-6, 2019, Maria Bonn,
[4] Stephen Cronen-Townsend, Yun Zhou, and W. Bruce Croft. 2004. A frame- Dan Wu, J. Stephen Downie, and Alain Martaus (Eds.). IEEE, 111–119. https:
work for selective query expansion. In Proceedings of the 2004 ACM CIKM In- //doi.org/10.1109/JCDL.2019.00025
ternational Conference on Information and Knowledge Management, Washington, [7] Sheng Jia, Jamie Kiros, and Jimmy Ba. 2019. DOM-Q-NET: Grounded RL on
DC, USA, November 8-13, 2004, David A. Grossman, Luis Gravano, ChengXi- Structured Language. In 7th International Conference on Learning Representations,
ang Zhai, Otthein Herzog, and David A. Evans (Eds.). ACM, 236–237. https: ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. https://openreview.net/forum?
//doi.org/10.1145/1031171.1031220 id=HJgd1nAqFX
[5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: [8] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag
Pre-training of Deep Bidirectional Transformers for Language Understanding. In of Tricks for Efficient Text Classification. arXiv preprint arXiv:1607.01759 (2016).
325
EU-L-4: Neural Semantic Representation JCDL ’20, August 1–5, 2020, Virtual Event, China
[13] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
Distributed Representations of Words and Phrases and their Compositionality. In
Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou,
M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). Curran Associates,
Inc., 3111–3119. http://papers.nips.cc/paper/5021-distributed-representations-
of-words-and-phrases-and-their-compositionality.pdf
[14] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove:
Global vectors for word representation. In In EMNLP.
[15] Georgios Rizos, Konstantin Hemker, and Björn W. Schuller. 2019. Augment
to Prevent: Short-Text Data Augmentation in Deep Learning for Hate-Speech
Classification. In Proceedings of the 28th ACM International Conference on In-
formation and Knowledge Management, CIKM 2019, Beijing, China, November
3-7, 2019, Wenwu Zhu, Dacheng Tao, Xueqi Cheng, Peng Cui, Elke A. Run-
densteiner, David Carmel, Qi He, and Jeffrey Xu Yu (Eds.). ACM, 991–1000.
https://doi.org/10.1145/3357384.3358040
[16] Sarah Schulz, Guy De Pauw, Orphée De Clercq, Bart Desmet, Véronique Hoste,
Walter Daelemans, and Lieve Macken. 2016. Multimodular Text Normalization
of Dutch User-Generated Content. ACM TIST 7, 4 (2016), 61:1–61:22. https:
//doi.org/10.1145/2850422
[17] Kumar Shridhar, Ayushman Dash, Amit Sahu, Gustav Grund Pihlgren, Pedro
Alonso, Vinaychandran Pondenkandath, György Kovács, Foteini Simistira, and
Marcus Liwicki. 2019. Subword Semantic Hashing for Intent Classification on
Small Datasets. In International Joint Conference on Neural Networks, IJCNN 2019
Budapest, Hungary, July 14-19, 2019. IEEE, 1–6. https://doi.org/10.1109/IJCNN.
Figure 4: Confusion matrices computed on the results ob- 2019.8852420
[18] Richard Socher, Jeffrey Pennington, Eric H. Huang, Andrew Y. Ng, and Christo-
tained by the proposed classifier trained on the dataset en- pher D. Manning. 2011. Semi-Supervised Recursive Autoencoders for Predicting
riched with the three described data augmentation strate- Sentiment Distributions. In Proceedings of the 2011 Conference on Empirical Meth-
ods in Natural Language Processing, EMNLP 2011, 27-31 July 2011, John McIntyre
gies. Conference Centre, Edinburgh, UK, A meeting of SIGDAT, a Special Interest Group
of the ACL. ACL, 151–161. https://www.aclweb.org/anthology/D11-1014/
[19] Xiao Sun, Jiajin He, and Changqin Quan. 2017. A multi-granularity data augmen-
tation based fusion neural network model for short text sentiment analysis. In
[9] Kang-Min Kim, Yeachan Kim, Jungho Lee, Ji-Min Lee, and SangKeun Lee. 2019. Seventh International Conference on Affective Computing and Intelligent Interaction
From Small-scale to Large-scale Text Classification. In The World Wide Web Workshops and Demos, ACII Workshops 2017, San Antonio, TX, USA, October 23-26,
Conference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019, Ling Liu, Ryen W. 2017. IEEE Computer Society, 12–17. https://doi.org/10.1109/ACIIW.2017.8272616
White, Amin Mantrach, Fabrizio Silvestri, Julian J. McAuley, Ricardo Baeza-Yates, [20] Kristina Toutanova, Francine Chen, Kris Popat, and Thomas Hofmann. 2001.
and Leila Zia (Eds.). ACM, 853–862. https://doi.org/10.1145/3308558.3313563 Text Classification in a Hierarchical Mixture Model for Small Training Sets.
[10] Gang Kou, Pei Yang, Yi Peng, Feng Xiao, Yang Chen, and Fawaz E. Alsaadi. 2020. In Proceedings of the 2001 ACM CIKM International Conference on Information
Evaluation of feature selection methods for text classification with small datasets and Knowledge Management, Atlanta, Georgia, USA, November 5-10, 2001. ACM,
using multiple criteria decision-making methods. Appl. Soft Comput. 86 (2020). 105–112. https://doi.org/10.1145/502585.502604
https://doi.org/10.1016/j.asoc.2019.105836 [21] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
[11] Johannes V. Lochter, Renato Moraes Silva, Tiago A. Almeida, and Akebo Ya- Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All
makami. 2018. Semantic Indexing-Based Data Augmentation for Filtering Unde- you Need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V.
sired Short Text Messages. In 17th IEEE International Conference on Machine Learn- Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.).
ing and Applications, ICMLA 2018, Orlando, FL, USA, December 17-20, 2018, M. Arif Curran Associates, Inc., 5998–6008. http://papers.nips.cc/paper/7181-attention-
Wani, Mehmed M. Kantardzic, Moamar Sayed Mouchaweh, João Gama, and Ed- is-all-you-need.pdf
win Lughofer (Eds.). IEEE, 1034–1039. https://doi.org/10.1109/ICMLA.2018.00169 [22] Claudia Matos Veliz, Orphée De Clercq, and Véronique Hoste. 2019. Benefits
[12] Xinghua Lu, Bin Zheng, Atulya Velivelli, and ChengXiang Zhai. 2006. Research of Data Augmentation for NMT-based Text Normalization of User-Generated
Paper: Enhancing Text Categorization with Semantic-enriched Representation Content. In Proceedings of the 5th Workshop on Noisy User-generated Text, W-
and Training Data Augmentation. JAMIA 13, 5 (2006), 526–535. https://doi.org/ NUT@EMNLP 2019, Hong Kong, China, November 4, 2019, Wei Xu, Alan Ritter, Tim
10.1197/jamia.M2051
326
EU-L-4: Neural Semantic Representation JCDL ’20, August 1–5, 2020, Virtual Event, China
Baldwin, and Afshin Rahimi (Eds.). Association for Computational Linguistics, Conference on Empirical Methods in Natural Language Processing and the 9th
275–285. https://doi.org/10.18653/v1/D19-5536 International Joint Conference on Natural Language Processing, EMNLP-IJCNLP
[23] Sida I. Wang and Christopher D. Manning. 2012. Baselines and Bigrams: Simple, 2019, Hong Kong, China, November 3-7, 2019, Kentaro Inui, Jing Jiang, Vincent Ng,
Good Sentiment and Topic Classification. In The 50th Annual Meeting of the and Xiaojun Wan (Eds.). Association for Computational Linguistics, 6381–6387.
Association for Computational Linguistics, Proceedings of the Conference, July 8-14, https://doi.org/10.18653/v1/D19-1670
2012, Jeju Island, Korea - Volume 2: Short Papers. The Association for Computer [25] Shujuan Yu, Jie Yang, Danlei Liu, Runqi Li, Yun Zhang, and Shengmei Zhao. 2019.
Linguistics, 90–94. https://www.aclweb.org/anthology/P12-2018/ Hierarchical Data Augmentation and the Application in Text Classification. IEEE
[24] Jason W. Wei and Kai Zou. 2019. EDA: Easy Data Augmentation Techniques for Access 7 (2019), 185476–185485. https://doi.org/10.1109/ACCESS.2019.2960263
Boosting Performance on Text Classification Tasks. In Proceedings of the 2019
327