0% found this document useful (0 votes)
79 views8 pages

Multiclass Text Classification On Unbalanced, Sparse and Noisy Data

This document discusses methods to improve multiclass text classification performance on unbalanced, sparse and noisy data, such as a dataset of song lyrics labeled by artist. It evaluates three neural network methods of increasing complexity - a perceptron, Doc2Vec, and multilayer perceptron - on the song lyrics dataset. Feature engineering is also explored to better represent the text for classification. The goal is to address challenges in the data like class imbalance and sparsity to fulfill the classification task.

Uploaded by

Aman Agnihotri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views8 pages

Multiclass Text Classification On Unbalanced, Sparse and Noisy Data

This document discusses methods to improve multiclass text classification performance on unbalanced, sparse and noisy data, such as a dataset of song lyrics labeled by artist. It evaluates three neural network methods of increasing complexity - a perceptron, Doc2Vec, and multilayer perceptron - on the song lyrics dataset. Feature engineering is also explored to better represent the text for classification. The goal is to address challenges in the data like class imbalance and sparsity to fulfill the classification task.

Uploaded by

Aman Agnihotri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Multiclass Text Classification on Unbalanced, Sparse and Noisy Data

Tillmann Dönicke1 , Florian Lux2 , and Matthias Damaschk3

University of Stuttgart
Institute for Natural Language Processing
Pfaffenwaldring 5 b
D-70569 Stuttgart
{doenictn1 ,luxfn2 ,damascms3 }@ims.uni-stuttgart.de

Abstract predict something latent, since there is no direct


mapping between artists and their songtexts. A lot
This paper discusses methods to improve of the texts are not written by the singers them-
the performance of text classification on selves, but by professional songwriters (Smith
data that is difficult to classify due to a et al., 2019, p. 5). Hence the correlation that a
large number of unbalanced classes with classifier should capture is between songtexts that
noisy examples. A variety of features the writers think to fit a specific artist and the artist.
are tested, in combination with three dif- All these points make the task difficult; still, it is
ferent neural-network-based methods with a task needed for nowadays’ NLP systems, e.g., in
increasing complexity. The classifiers a framework that suggests new artists to a listener
are applied to a songtext–artist dataset based on the songtexts s/he likes. Thus, to tackle
which is large, unbalanced and noisy. We the challenges given is a helpful step for any task
come to the conclusion that substantial in the field of NLP that might come with similarly
improvement can be obtained by remov- difficult data.
ing unbalancedness and sparsity from the For the artist classification, we investigate three
data. This fulfils a classification task neural-network-based methods: a one-layer per-
unsatisfactorily—however, with contem- ceptron, a two-layer Doc2Vec model and a multi-
porary methods, it is a practical step to- layer perceptron. (A detailed description of the
wards fairly satisfactory results. models shall follow in section 2.) Besides the
model, the representation of the instances in a fea-
1 Introduction ture space is important for classification, thus we
Text classification tasks are omnipresent in natu- also aim to find expressive features for the partic-
ral language processing (NLP). Various classifiers ular domain of songtexts. (See section 4 for a list
may perform better or worse, depending on the of our features.)1
data they are given (eg. Uysal and Gunal, 2014).
However, there is data where one would expect to 2 Methods
find a correlation between (vectorised) texts and
The following sections shall provide an overview
classes, but the expectation is not met and the clas-
of our classifiers in order of increasing complexity.
sifiers achieve poor results. One example for such
data are songtexts with the corresponding artists 2.1 Perceptron
being classes. A classification task on this data is
especially hard due to multiple handicaps: A perceptron is a very simple type of neural net-
First, the number of classes is extraordinarily work that was first described in Rosenblatt (1958).
high (compared to usual text classification tasks). It contains only one layer, which is at the same
Second, the number of samples for a class varies time the input and output layer. The input is a
between a handful and more than a hundred. And feature vector ~x extracted from a data sample x.
third, songtexts are structurally and stylistically Every possible class y ∈ Y is associated with a
more diverse than, e.g., newspaper texts, as they weight vector w ~ y . A given sample x is classified
may be organised in blocks of choruses and verses, as ŷx as follows:
exhibit rhyme, make use of slang language etc. 1
The code will be made available at https://
(cf. Mayer et al., 2008). In addition, we try to github.com/ebaharilikult/DL4NLP.
Doc2Vec model learns vectors of each label in-
ŷx = arg max ~x · w
~y (1) stead of each document. If one wants to predict the
y∈Y label of an unseen document, a vector representa-
During training, every training sample is clas- tion for this document needs to be inferred first.
sified using the perceptron. If the classification is Therefore, a new column/row is added to the para-
incorrect, the weights of the incorrectly predicted graph matrix. Then the n-grams of the document
class are decreased, whereas the weights of the are iteratively fed to the network (as in training).
class that would have been correct are reinforced. However, the word matrix as well as the weight
Additions to this basic system include shuffling matrix of the output layer are kept fixed, and so
of the training data, batch learning and a dynamic only the paragraph matrix is updated. The result-
learning rate. The shuffling of the training data ing paragraph vector for the unseen document is
across multiple epochs shall prevent the order of finally compared to the paragraph vectors repre-
the samples from having an effect on the learn- senting labels; and the label of the most similar
ing process. Batch learning generally improves vector is returned as the prediction.
the performance of a perceptron (e.g. McDonald 2.3 MLP
et al., 2010); hereby all updates are jointly per-
A multi-layer perceptron (Beale and Jackson,
formed after each epoch instead of each sample.
1990), also referred to as feed forward neural net-
This removes a strong preference for the last seen
work, is a deep neural network consisting of mul-
class if the information in the features of multiple
tiple hidden layers with neurons which are fully
samples overlaps greatly. A dynamic learning rate
connected with the neurons of the next layer, and
improves the convergence of the weights. It gives
an output layer with as many neurons as classes.
updates that happen later during training less im-
The number of layers and the number of neurons
pact and allows a closer approximation to optimal
in each hidden layer depends on the classification
weights.
tasks and are therefore hyperparameters. For mul-
2.2 Doc2Vec ticlass classification, the softmax function is used
in the output layer. During training, the back-
Doc2Vec (Le and Mikolov, 2014) is a two-layer propagation learning algorithm (Rumelhart et al.,
neural network model which learns vector repre- 1985) based on the gradient descent algorithm, up-
sentations of documents, so-called paragraph vec- dates the weights and reduces the error of the cho-
tors. It is an extension of Word2Vec (Mikolov sen cost function, such as mean squared error or
et al., 2013) which learns vector representations cross-entropy.
of words. Thereby, context words, represented To prevent overfitting in neural networks,
as concatenated one-hot vectors, serve as input dropout (Srivastava et al., 2014) is commonly
and are used to predict one target word. After used. It omits hidden neurons with a certain prob-
training, the weight matrix of the hidden layer ability to generalise more during training and thus
becomes the word matrix, containing the well- enhances the model.3
known Word2Vec word embeddings.
The extension in Doc2Vec is that a unique doc- 3 Data
ument/paragraph id is appended to each input n-
We use a dataset of 57,647 English songtexts with
gram, as if it was a context word. Since paragraph
their corresponding artist and title, which has been
ids have to be different from all words in the vo-
downloaded from Kaggle4 . The data was shuffled
cabulary, the weight matrix can be separated into
uniformly and 10% were held out for validation
the word matrix and the paragraph matrix, the lat-
and test set, respectively.
ter containing document embeddings.2
There are 643 unique artists in the data. Table 1
In a document classification task, labels instead
shows the distribution of artists and songtexts for
of paragraph ids are used. By doing so, the
3
It should be noted here that advanced deep learning mod-
2 els such as CNNs and RNNs exist and have been successfully
The described version of Word2Vec and Doc2Vec is
commonly referred to as “continuous bag of words” (DBOW) used in text classification tasks (Lee and Dernoncourt, 2016),
model. If the input and output are swapped, i.e. a single word but have not been used in the context of this work and are
(Word2Vec) or document (Doc2Vec) is used to predict several therefore not explained in detail.
4
context words, the architecture is called “skip-gram” (SG) or https://www.kaggle.com/mousehead/
“distributed memory” (DM) model, respectively. songlyrics
Dataset Artists Songs Avg. songs Within a token, all non-initial letters are lower-
cased and sequences of repeating letters are short-
Training 642 46,120 71.8 (44.0)
ened to a maximum of three. To further re-
Validation 612 5,765 9.4 (5.7)
duce noise, punctuation is removed. The texts
Test 618 5,765 9.3 (6.0)
are tagged with parts-of-speech (POS) using the
Apache OpenNLP maximum entropy tagger5 .
Table 1: Number of unique artists (classes), num-
ber of songtexts and average number of songtexts Stylometric features Generic information
per artist (standard deviation in parentheses) for about the text, i.e. the number of lines, the number
each dataset. of tokens, the number of tokens that appear only
once, the number of types and the average number
of tokens per line.
Rhyme feature Number of unique line endings
(in terms of the last two letters), normalised by the
number of lines. This should serve as a simple
measure for how well the lines rhyme.
Word count vectors Every unique word in the
training corpus is assigned a unique dimension. In
each dimension, the number of occurrences of the
Figure 1: A histogram showing the distribution of
word in the sample are denoted. Term-frequency
songtexts in the training set.
(tf) weighting is implemented, but can be switched
on and off. As a minimal variant, only nouns can
each subset. The training set contains most of be taken into account; in this case we speak of
the unique artists (642) whereas less artists appear noun count vectors.
in the validation (612) and test set (618). Also,
POS count vectors The same as word count
the standard deviation of average songs per artists
vectors after replacing all words with their POS
is relatively high (i.e. more than half the aver-
tag.
age) which indicates that the number of songs per
artists is spread over a large range. Word2Vec embeddings 300-dimensional em-
The distribution of songs per artists in the train- beddings created from the Google news corpus.6
ing set can be seen in Figure 1. It shows similar The embedding of a text is hereby defined as the
counts for classes (artists) with many and classes average of the embeddings of all the words in the
with only a few samples (songtexts), i.e. unbal- text. As a minimal variant, only nouns can be
anced training data. The bandwidth goes from one taken into account.
artist with 159 songs to four artists with only one
Bias A feature with a constant value of 1 (to
song which also illustrates the sparsity for some
avoid zero vectors as feature vectors).
classes.
Besides the issues caused by the distribution 5 Experiments
of songtexts per artist, the quality of the texts is
less than perfect. Nonsense words and sequences, This section lists the concrete parameter settings
such as tu ru tu tu tu tu, as well as spelling varia- of the methods described in section 2. Since our
tions, such as Yeaaah, Hey-A-Hey and aaaaaaalll- models can only predict classes which have been
lllright!, are very common. encountered during training, only the 612 artists
occurring in all subsets are kept for all evalua-
4 Feature Extraction tions. For another series of experiments, only the
40 unique artists with more songs than 140 in the
For both the (single-layer) perceptron and the training set are kept to reduce the impact of unbal-
multi-layer perceptron we use the same prepro- ancedness and sparsity (numbers in Table 2).
cessing and features which are described below.
5
(The Doc2Vec model uses its own integrated to- https://opennlp.apache.org/docs/1.8.
0/manual/opennlp.html\#tools.postagger
keniser and Word2Vec-based features.) 6
GoogleNews-vectors-negative300.bin.gz from https:
The songtexts are tokenised by whitespace. //code.google.com/archive/p/word2vec/
Dataset Artists Songs Avg. songs
Training 40 5,847 146.2 (4.6)
Validation 40 646 16.2 (3.6)
Test 40 714 17.9 (5.0)

Table 2: Number of unique artists (classes), num-


ber of songtexts and average number of songtexts
per artist (standard deviation in parentheses) for
each dataset, keeping only the 40 unique artists
with more songs than 140 in the training set.

5.1 Experimental Settings


Perceptron We train and test two versions of
the perceptron. The minimal version (Perceptron)
only uses noun count vectors without tf-weighting
and the bias. The maximal version (Perceptron+)
uses all features. All add-ons described in sec-
tion 2.1 (shuffling, batch learning, dynamic learn-
ing rate) are used for both versions since they led
to an increasing performance on the validation set
in preliminary tests. For the decay of the learning
rate, a linear decrease starting from 1 and ending
1
at number of epochs is chosen.

Doc2Vec For the Doc2Vec implementation, we


use deeplearning4j7 version 0.9. The hyper-
paramters are a minimum word frequency of 10, Figure 2: Model of the MLP+ model: Layers
a hidden layer size of 300, a maximum window with corresponding number of neurons. The input
size of 8, and a learning rate starting at 0.025 and layers correspond to the following feature groups
decreasing to 0.001. Tokenisation is performed by (f.l.t.r.): word count vectors, stylometric features,
the incorporated UIMA tokeniser. The model is rhyme feature, POS count vectors, Word2Vec em-
trained for 100 epochs with batch learning.8 beddings.

MLP The implementation of MLP and MLP+ is


done with Keras (Chollet et al., 2015). MLP+ uses function with adaptive learning rate, a batch size
all feature groups from section 4 and one bias fea- of 32 and categorical cross-entropy as the loss
ture for every group. The MLP+ model is shown in function are used. For the activation functions,
Figure 2. Each feature group uses multiple stacked rectified linear units are used in the hidden layer
layers that are then merged with a concatenation and softmax in the output layer. The model trains
layer. The sizes of the dense layers are manually for 250 epochs and stores the weights that led to
selected by trial and error. Several dropout layers the best accuracy on the validation set through a
with a constant probability of 0.2 are included. In checkpoint mechanism.
contrast, MLP uses only the noun count vectors
(as Perceptron) and thus only one input branch. 5.2 Evaluation measures
For both models, Adadelta, an optimisation Given a (test) set X, each sample x ∈ X has a
7
See https://deeplearning4j.org/docs/ class yx and a prediction ŷx . Based on the predic-
latest/deeplearning4j-nlp-doc2vec for a tions, we can calculate class-wise precision (P ),
quick example. recall (R) and F -score as follows:
8
Since deeplearning4j does not document the possibility
to get intermediate evaluation results during training, 10 mod-
els are trained separately for 10, 20, 30 etc. epochs to obtain |{x ∈ X | yx = y ∧ yx = ŷx }|
data points for the learning progress analysis.
P (y) = (2)
|{x ∈ X | ŷx = y}|
final best
Model MST
|{x ∈ X | yx = y ∧ yx = ŷx }| F ep. F ep.
R(y) = (3)
|{x ∈ X | yx = y}| Perceptron 0 .066 100 .069 94
Perceptron+ 0 .003 100 .006 23
2 · P (y) · R(y) Doc2Vec 0 .032 100 .033 70
F (y) = (4)
P (y) + R(y) MLP 0 .023 97 .025 114
The macro-averaged F -score of all classes is MLP+ 0 .079 97 .089 80
the average of the class-wise F -scores: Perceptron 140 .146 100 .160 51
1 X Perceptron+ 140 .021 100 .055 54
Fmacro = · F (y) (5) Doc2Vec 140 .101 100 .120 20
|Y |
y∈Y MLP 140 .050 201 .088 46
For the overall precision and recall, the nom- MLP+ 140 .182 105 .193 109
inators and denominators are summed up for all
y ∈ Y , resulting in: Table 3: Micro-averaged F -score for each model
on the test set after training (final) and during
|{x ∈ X | yx = ŷx }| training (best), together with the corresponding
P =R= (6)
|{x ∈ X}| training epochs. Lower part: only artists with
And the formula for the micro-averaged F - more songs than (MST) 140 are kept in the train-
score: ing and the test set.

2·P ·R
Fmicro = =P =R (7) then reaches a plateau. Since the model uses that
P +R
The identity of overall P and R causes their number of epochs which works best on the valida-
identity with Fmicro . This measure is identical tion set (i.e. 105), it misses the best performance
with the overall accuracy (correct predictions di- at the 109th epoch but gets a final score close to it.
vided by all predictions). Since Fmacro gives ev-
5.4 Error Analysis
ery class the same weight, but we deal with an un-
balanced dataset, we choose Fmicro as evaluation In this section, we shall focus on the confu-
measure and only show Fmacro in some graphs for sion matrices produced by our three main models
comparison. which are depicted in Figure 6 and to be read as
follows: The x-axis and the y-axis represent artists
5.3 Results in the same order (labels are omitted due to leg-
Table 3 shows the micro-averaged F -score for all ibility). Each cell indicates how often the artist
models. MLP+ performs best, followed by Per- on the x-axis was classified as the artist on the
ceptron and Doc2Vec. The use of additional fea- y-axis (darker colours are higher numbers). The
tures significantly decreases the performance of cells on the main diagonal correspond to the cases
the perceptron (Perceptron+), but increases it for where a class was classified correctly hence a vis-
the multi-layer network (MLP+). This observation ible diagonal correlates with good results. Darker
is discussed in section 5.4. cells outside of the diagonal are significant mis-
Figures 3–5 show the performance of Percep- classifications9 and might be interesting to look
tron, Doc2Vec and MLP+ in dependence of the into.
number of training epochs. The Perceptron (Fig- As is clearly visible in Figure 6 (a), something is
ure 4) shows a generally increasing learning curve, wrong with the predictions of the Doc2Vec model.
i.e. more epochs lead to better results. Peaks like There are hints of a diagonal showing at the top-
the one after the 51st epoch are ignored since the left—however, there are entire columns of dark
model uses a fixed number of 100 training epochs. colour. This means that there are classes that are
Doc2Vec (Figure 5) reaches its best performance almost always predicted, no matter which class a
with 20 epochs and does not show any learning sample actually belongs to. Interestingly, we ob-
progress after that, even if trained for 100 epochs. 9
Such mis-classifications will be denoted as outliers in the
The MLP+ model (Figure 3) exhibits increasing following, since we are talking about the correlation of the
performance until around 100 training epochs and axes here.
Figure 3: Micro-averaged (above) and macro-averaged (below) F -score of the MLP+ model on the test
set (MST=140) after each training epoch.

Figure 4: Micro-averaged (above) and macro- Figure 5: Micro-averaged (above) and macro-
averaged (below) F -score of the Perceptron model averaged (below) F -score of the Doc2Vec model
on the test set (MST=140) after each training on the test set (MST=140) for different numbers
epoch. of training epochs.

served the same behaviour with our perceptron im- differently from what one might expect. Addi-
plementation in preliminary tests which changed tional stylometric features that were specifically
when we started using batch learning. However, designed for the task lead to significantly worse
our Doc2Vec implementation already uses batch results than the simple noun count vectors (Per-
learning and mini-batch learning did not improve ceptron+ vs. Perceptron). However, this does not
the performance in postliminary tests. tell anything about the perceptron’s performance
Figure 6 (b) shows that the perceptron performs when using other feature combinations.
well on most classes. However, there are very few The error distribution of the multi-layer percep-
classes where it actually recognises (almost) all of tron is displayed in Figure 6 (c). Compared to the
the samples. There are three major outliers and perceptron, most of the classes are predicted much
many outliers overall. This explains the rather low better and there are less outliers overall. However,
scores the model achieves, even though the diag- there are more major outliers which explains the
onal is clearly visible. Looking at the three big still rather low performance. In further contrast to
outliers and engineering new features specifically the perceptron, the use of all features leads to a
for those could improve the performance. How- performance increase (MLP+ vs. MLP). This is
ever, the informativeness of features can behave probably the case because the multi-layer percep-
(a) Doc2Vec (b) Perceptron (c) MLP+

Figure 6: Confusion matrices of different models on the test set (MST=140).

tron can learn individual weights for the different gineering good features.
feature groups through multiple branches and lay-
ers much better than the single-layer perceptron. 7 Future Work
A general clustering-based topic model encoded
6 Summary & Conclusion
in new features could potentially improve the per-
Songtext–artist classification is an example of formance of songtext classifiers. Looking at our
multiclass text classification on unbalanced, multi-layer perceptron, new features seem to be a
sparse and noisy data. Three neural network mod- good way to handle such difficult data. Other net-
els have been investigated on this specific task. 1) work architectures such as CNNs and RNNs can
A single-layer perceptron which can be used for be considered worth a look as well since they im-
all kinds of classification on vectorised data. 2) proved (noisy) text classification in previous stud-
Doc2Vec which is a contemporary tool for text ies (e.g. Lai et al., 2015; Apostolova and Kreek,
classification. And 3) an extended multi-layer per- 2018).
ceptron which we designed specifically for this Another way of dealing with imbalanced data is
task. While the third and most complex model to apply oversampling to raise the number of sam-
achieves the best results, it becomes also visible ples for sparse classes, or undersampling to reduce
that the choice of features has a significant ef- the number of samples for frequent classes.
fect on the classification performance. Here, too,
the multi-layer network with its advanced combi- 8 Acknowledgements
nation of different, stylometric as well as count/ We thank the three anonymous reviewers for their
embedding-based, feature groups outperforms the valuable comments. This work emerged from a
other models. course at the Institute for Natural Language Pro-
We come to the conclusion that a vast number cessing of the University of Stuttgart; we thank our
of and unbalanced classes as well as sparse and supervisors Evgeny Kim and Roman Klinger. We
not directly correlated data do not allow for a per- further thank Sebastian Padó for the organisational
fect performance. Thus, given a text classification support.
task where the data is as difficult, it makes sense
to reduce the data to something that is manage-
able and meaningful. Sparse classes in a noisy References
sample space are little more than guesswork which Emilia Apostolova and R. Andrew Kreek. 2018. Train-
might confuse the classifier and decrease the per- ing and prediction data discrepancies: Challenges of
formance on more important classes. While it is text classification with noisy, historical data. arXiv
preprint arXiv:1809.04019.
somewhat obvious that removing difficult cases
from the data improves the overall results, it is Russell Beale and Tom Jackson. 1990. Neural Com-
not something that one would usually do in a real- puting: An Introduction, pages 67–73. CRC Press.
world application. We argue, that it can be a prac- François Chollet et al. 2015. Keras: Deep learning li-
tical step to approach such a classification task, for brary for theano and tensorflow. URL: https://keras.
elaborating the complexity of the classifier and en- io/k, 7(8):T1.
Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015.
Recurrent convolutional neural networks for text
classification. In Twenty-ninth AAAI conference on
artificial intelligence.
Quoc Le and Tomas Mikolov. 2014. Distributed repre-
sentations of sentences and documents. In Interna-
tional conference on machine learning, pages 1188–
1196.
Ji Young Lee and Franck Dernoncourt. 2016. Se-
quential short-text classification with recurrent and
convolutional neural networks. arXiv preprint
arXiv:1603.03827.
Rudolf Mayer, Robert Neumayer, and Andreas Rauber.
2008. Rhyme and style features for musical genre
classification by song lyrics. In Ismir, pages 337–
342.
Ryan McDonald, Keith Hall, and Gideon Mann. 2010.
Distributed training strategies for the structured per-
ceptron. In Human Language Technologies: The
2010 Annual Conference of the North American
Chapter of the Association for Computational Lin-
guistics, pages 456–464. Association for Computa-
tional Linguistics.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-
frey Dean. 2013. Efficient estimation of word
representations in vector space. arXiv preprint
arXiv:1301.3781.
Frank Rosenblatt. 1958. The perceptron: a probabilis-
tic model for information storage and organization
in the brain. Psychological review, 65(6):386.
David E. Rumelhart, Geoffrey E. Hinton, and Ronald J.
Williams. 1985. Learning internal representations
by error propagation. Technical report, Institute for
Cognitive Science, University of California.
Stacy L. Smith, Marc Choueiti, Katherine Pieper, Han-
nah Clark, Ariana Case, and Sylvia Villanueva.
2019. Inclusion in the recording studio? Gender
and race/ethnicity of artists, songwriters & produc-
ers across 700 popular songs from 2012-2018. USC
Annenberg Inclusion Initiative.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
Dropout: a simple way to prevent neural networks
from overfitting. The journal of machine learning
research, 15(1):1929–1958.

Alper Kursat Uysal and Serkan Gunal. 2014. The im-


pact of preprocessing on text classification. Infor-
mation Processing & Management, 50(1):104–112.

You might also like