Multiclass Text Classification On Unbalanced, Sparse and Noisy Data

This document discusses methods to improve multiclass text classification performance on unbalanced, sparse and noisy data, such as a dataset of song lyrics labeled by artist. It evaluates three neural network methods of increasing complexity - a perceptron, Doc2Vec, and multilayer perceptron - on the song lyrics dataset. Feature engineering is also explored to better represent the text for classification. The goal is to address challenges in the data like class imbalance and sparsity to fulfill the classification task.

Uploaded by

Aman Agnihotri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

79 views8 pages

Multiclass Text Classification On Unbalanced, Sparse and Noisy Data

Uploaded by

Aman Agnihotri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Multiclass Text Classification on Unbalanced, Sparse and Noisy Data

Tillmann Dönicke1 , Florian Lux2 , and Matthias Damaschk3

University of Stuttgart
Institute for Natural Language Processing
Pfaffenwaldring 5 b
D-70569 Stuttgart
{doenictn1 ,luxfn2 ,damascms3 }@ims.uni-stuttgart.de

Abstract predict something latent, since there is no direct

mapping between artists and their songtexts. A lot
This paper discusses methods to improve of the texts are not written by the singers them-
the performance of text classification on selves, but by professional songwriters (Smith
data that is difficult to classify due to a et al., 2019, p. 5). Hence the correlation that a
large number of unbalanced classes with classifier should capture is between songtexts that
noisy examples. A variety of features the writers think to fit a specific artist and the artist.
are tested, in combination with three dif- All these points make the task difficult; still, it is
ferent neural-network-based methods with a task needed for nowadays’ NLP systems, e.g., in
increasing complexity. The classifiers a framework that suggests new artists to a listener
are applied to a songtext–artist dataset based on the songtexts s/he likes. Thus, to tackle
which is large, unbalanced and noisy. We the challenges given is a helpful step for any task
come to the conclusion that substantial in the field of NLP that might come with similarly
improvement can be obtained by remov- difficult data.
ing unbalancedness and sparsity from the For the artist classification, we investigate three
data. This fulfils a classification task neural-network-based methods: a one-layer per-
unsatisfactorily—however, with contem- ceptron, a two-layer Doc2Vec model and a multi-
porary methods, it is a practical step to- layer perceptron. (A detailed description of the
wards fairly satisfactory results. models shall follow in section 2.) Besides the
model, the representation of the instances in a fea-
1 Introduction ture space is important for classification, thus we
Text classification tasks are omnipresent in natu- also aim to find expressive features for the partic-
ral language processing (NLP). Various classifiers ular domain of songtexts. (See section 4 for a list
may perform better or worse, depending on the of our features.)1
data they are given (eg. Uysal and Gunal, 2014).
However, there is data where one would expect to 2 Methods
find a correlation between (vectorised) texts and
The following sections shall provide an overview
classes, but the expectation is not met and the clas-
of our classifiers in order of increasing complexity.
sifiers achieve poor results. One example for such
data are songtexts with the corresponding artists 2.1 Perceptron
being classes. A classification task on this data is
especially hard due to multiple handicaps: A perceptron is a very simple type of neural net-
First, the number of classes is extraordinarily work that was first described in Rosenblatt (1958).
high (compared to usual text classification tasks). It contains only one layer, which is at the same
Second, the number of samples for a class varies time the input and output layer. The input is a
between a handful and more than a hundred. And feature vector ~x extracted from a data sample x.
third, songtexts are structurally and stylistically Every possible class y ∈ Y is associated with a
more diverse than, e.g., newspaper texts, as they weight vector w ~ y . A given sample x is classified
may be organised in blocks of choruses and verses, as ŷx as follows:
exhibit rhyme, make use of slang language etc. 1
The code will be made available at https://
(cf. Mayer et al., 2008). In addition, we try to github.com/ebaharilikult/DL4NLP.
Doc2Vec model learns vectors of each label in-
ŷx = arg max ~x · w
~y (1) stead of each document. If one wants to predict the
y∈Y label of an unseen document, a vector representa-
During training, every training sample is clas- tion for this document needs to be inferred first.
sified using the perceptron. If the classification is Therefore, a new column/row is added to the para-
incorrect, the weights of the incorrectly predicted graph matrix. Then the n-grams of the document
class are decreased, whereas the weights of the are iteratively fed to the network (as in training).
class that would have been correct are reinforced. However, the word matrix as well as the weight
Additions to this basic system include shuffling matrix of the output layer are kept fixed, and so
of the training data, batch learning and a dynamic only the paragraph matrix is updated. The result-
learning rate. The shuffling of the training data ing paragraph vector for the unseen document is
across multiple epochs shall prevent the order of finally compared to the paragraph vectors repre-
the samples from having an effect on the learn- senting labels; and the label of the most similar
ing process. Batch learning generally improves vector is returned as the prediction.
the performance of a perceptron (e.g. McDonald 2.3 MLP
et al., 2010); hereby all updates are jointly per-
A multi-layer perceptron (Beale and Jackson,
formed after each epoch instead of each sample.
1990), also referred to as feed forward neural net-
This removes a strong preference for the last seen
work, is a deep neural network consisting of mul-
class if the information in the features of multiple
tiple hidden layers with neurons which are fully
samples overlaps greatly. A dynamic learning rate
connected with the neurons of the next layer, and
improves the convergence of the weights. It gives
an output layer with as many neurons as classes.
updates that happen later during training less im-
The number of layers and the number of neurons
pact and allows a closer approximation to optimal
in each hidden layer depends on the classification
weights.
tasks and are therefore hyperparameters. For mul-
2.2 Doc2Vec ticlass classification, the softmax function is used
in the output layer. During training, the back-
Doc2Vec (Le and Mikolov, 2014) is a two-layer propagation learning algorithm (Rumelhart et al.,
neural network model which learns vector repre- 1985) based on the gradient descent algorithm, up-
sentations of documents, so-called paragraph vec- dates the weights and reduces the error of the cho-
tors. It is an extension of Word2Vec (Mikolov sen cost function, such as mean squared error or
et al., 2013) which learns vector representations cross-entropy.
of words. Thereby, context words, represented To prevent overfitting in neural networks,
as concatenated one-hot vectors, serve as input dropout (Srivastava et al., 2014) is commonly
and are used to predict one target word. After used. It omits hidden neurons with a certain prob-
training, the weight matrix of the hidden layer ability to generalise more during training and thus
becomes the word matrix, containing the well- enhances the model.3
known Word2Vec word embeddings.
The extension in Doc2Vec is that a unique doc- 3 Data
ument/paragraph id is appended to each input n-
We use a dataset of 57,647 English songtexts with
gram, as if it was a context word. Since paragraph
their corresponding artist and title, which has been
ids have to be different from all words in the vo-
downloaded from Kaggle4 . The data was shuffled
cabulary, the weight matrix can be separated into
uniformly and 10% were held out for validation
the word matrix and the paragraph matrix, the lat-
and test set, respectively.
ter containing document embeddings.2
There are 643 unique artists in the data. Table 1
In a document classification task, labels instead
shows the distribution of artists and songtexts for
of paragraph ids are used. By doing so, the
3
It should be noted here that advanced deep learning mod-
2 els such as CNNs and RNNs exist and have been successfully
The described version of Word2Vec and Doc2Vec is
commonly referred to as “continuous bag of words” (DBOW) used in text classification tasks (Lee and Dernoncourt, 2016),
model. If the input and output are swapped, i.e. a single word but have not been used in the context of this work and are
(Word2Vec) or document (Doc2Vec) is used to predict several therefore not explained in detail.
4
context words, the architecture is called “skip-gram” (SG) or https://www.kaggle.com/mousehead/
“distributed memory” (DM) model, respectively. songlyrics
Dataset Artists Songs Avg. songs Within a token, all non-initial letters are lower-
cased and sequences of repeating letters are short-
Training 642 46,120 71.8 (44.0)
ened to a maximum of three. To further re-
Validation 612 5,765 9.4 (5.7)
duce noise, punctuation is removed. The texts
Test 618 5,765 9.3 (6.0)
are tagged with parts-of-speech (POS) using the
Apache OpenNLP maximum entropy tagger5 .
Table 1: Number of unique artists (classes), num-
ber of songtexts and average number of songtexts Stylometric features Generic information
per artist (standard deviation in parentheses) for about the text, i.e. the number of lines, the number
each dataset. of tokens, the number of tokens that appear only
once, the number of types and the average number
of tokens per line.
Rhyme feature Number of unique line endings
(in terms of the last two letters), normalised by the
number of lines. This should serve as a simple
measure for how well the lines rhyme.
Word count vectors Every unique word in the
training corpus is assigned a unique dimension. In
each dimension, the number of occurrences of the
Figure 1: A histogram showing the distribution of
word in the sample are denoted. Term-frequency
songtexts in the training set.
(tf) weighting is implemented, but can be switched
on and off. As a minimal variant, only nouns can
each subset. The training set contains most of be taken into account; in this case we speak of
the unique artists (642) whereas less artists appear noun count vectors.
in the validation (612) and test set (618). Also,
POS count vectors The same as word count
the standard deviation of average songs per artists
vectors after replacing all words with their POS
is relatively high (i.e. more than half the aver-
tag.
age) which indicates that the number of songs per
artists is spread over a large range. Word2Vec embeddings 300-dimensional em-
The distribution of songs per artists in the train- beddings created from the Google news corpus.6
ing set can be seen in Figure 1. It shows similar The embedding of a text is hereby defined as the
counts for classes (artists) with many and classes average of the embeddings of all the words in the
with only a few samples (songtexts), i.e. unbal- text. As a minimal variant, only nouns can be
anced training data. The bandwidth goes from one taken into account.
artist with 159 songs to four artists with only one
Bias A feature with a constant value of 1 (to
song which also illustrates the sparsity for some
avoid zero vectors as feature vectors).
classes.
Besides the issues caused by the distribution 5 Experiments
of songtexts per artist, the quality of the texts is
less than perfect. Nonsense words and sequences, This section lists the concrete parameter settings
such as tu ru tu tu tu tu, as well as spelling varia- of the methods described in section 2. Since our
tions, such as Yeaaah, Hey-A-Hey and aaaaaaalll- models can only predict classes which have been
lllright!, are very common. encountered during training, only the 612 artists
occurring in all subsets are kept for all evalua-
4 Feature Extraction tions. For another series of experiments, only the
40 unique artists with more songs than 140 in the
For both the (single-layer) perceptron and the training set are kept to reduce the impact of unbal-
multi-layer perceptron we use the same prepro- ancedness and sparsity (numbers in Table 2).
cessing and features which are described below.
5
(The Doc2Vec model uses its own integrated to- https://opennlp.apache.org/docs/1.8.
0/manual/opennlp.html\#tools.postagger
keniser and Word2Vec-based features.) 6
GoogleNews-vectors-negative300.bin.gz from https:
The songtexts are tokenised by whitespace. //code.google.com/archive/p/word2vec/
Dataset Artists Songs Avg. songs
Training 40 5,847 146.2 (4.6)
Validation 40 646 16.2 (3.6)
Test 40 714 17.9 (5.0)

Table 2: Number of unique artists (classes), num-

ber of songtexts and average number of songtexts
per artist (standard deviation in parentheses) for
each dataset, keeping only the 40 unique artists
with more songs than 140 in the training set.

5.1 Experimental Settings

Perceptron We train and test two versions of
the perceptron. The minimal version (Perceptron)
only uses noun count vectors without tf-weighting
and the bias. The maximal version (Perceptron+)
uses all features. All add-ons described in sec-
tion 2.1 (shuffling, batch learning, dynamic learn-
ing rate) are used for both versions since they led
to an increasing performance on the validation set
in preliminary tests. For the decay of the learning
rate, a linear decrease starting from 1 and ending
1
at number of epochs is chosen.

Doc2Vec For the Doc2Vec implementation, we

use deeplearning4j7 version 0.9. The hyper-
paramters are a minimum word frequency of 10, Figure 2: Model of the MLP+ model: Layers
a hidden layer size of 300, a maximum window with corresponding number of neurons. The input
size of 8, and a learning rate starting at 0.025 and layers correspond to the following feature groups
decreasing to 0.001. Tokenisation is performed by (f.l.t.r.): word count vectors, stylometric features,
the incorporated UIMA tokeniser. The model is rhyme feature, POS count vectors, Word2Vec em-
trained for 100 epochs with batch learning.8 beddings.

MLP The implementation of MLP and MLP+ is

done with Keras (Chollet et al., 2015). MLP+ uses function with adaptive learning rate, a batch size
all feature groups from section 4 and one bias fea- of 32 and categorical cross-entropy as the loss
ture for every group. The MLP+ model is shown in function are used. For the activation functions,
Figure 2. Each feature group uses multiple stacked rectified linear units are used in the hidden layer
layers that are then merged with a concatenation and softmax in the output layer. The model trains
layer. The sizes of the dense layers are manually for 250 epochs and stores the weights that led to
selected by trial and error. Several dropout layers the best accuracy on the validation set through a
with a constant probability of 0.2 are included. In checkpoint mechanism.
contrast, MLP uses only the noun count vectors
(as Perceptron) and thus only one input branch. 5.2 Evaluation measures
For both models, Adadelta, an optimisation Given a (test) set X, each sample x ∈ X has a
7
See https://deeplearning4j.org/docs/ class yx and a prediction ŷx . Based on the predic-
latest/deeplearning4j-nlp-doc2vec for a tions, we can calculate class-wise precision (P ),
quick example. recall (R) and F -score as follows:
8
Since deeplearning4j does not document the possibility
to get intermediate evaluation results during training, 10 mod-
els are trained separately for 10, 20, 30 etc. epochs to obtain |{x ∈ X | yx = y ∧ yx = ŷx }|
data points for the learning progress analysis.
P (y) = (2)
|{x ∈ X | ŷx = y}|
final best
Model MST
|{x ∈ X | yx = y ∧ yx = ŷx }| F ep. F ep.
R(y) = (3)
|{x ∈ X | yx = y}| Perceptron 0 .066 100 .069 94
Perceptron+ 0 .003 100 .006 23
2 · P (y) · R(y) Doc2Vec 0 .032 100 .033 70
F (y) = (4)
P (y) + R(y) MLP 0 .023 97 .025 114
The macro-averaged F -score of all classes is MLP+ 0 .079 97 .089 80
the average of the class-wise F -scores: Perceptron 140 .146 100 .160 51
1 X Perceptron+ 140 .021 100 .055 54
Fmacro = · F (y) (5) Doc2Vec 140 .101 100 .120 20
|Y |
y∈Y MLP 140 .050 201 .088 46
For the overall precision and recall, the nom- MLP+ 140 .182 105 .193 109
inators and denominators are summed up for all
y ∈ Y , resulting in: Table 3: Micro-averaged F -score for each model
on the test set after training (final) and during
|{x ∈ X | yx = ŷx }| training (best), together with the corresponding
P =R= (6)
|{x ∈ X}| training epochs. Lower part: only artists with
And the formula for the micro-averaged F - more songs than (MST) 140 are kept in the train-
score: ing and the test set.

2·P ·R
Fmicro = =P =R (7) then reaches a plateau. Since the model uses that
P +R
The identity of overall P and R causes their number of epochs which works best on the valida-
identity with Fmicro . This measure is identical tion set (i.e. 105), it misses the best performance
with the overall accuracy (correct predictions di- at the 109th epoch but gets a final score close to it.
vided by all predictions). Since Fmacro gives ev-
5.4 Error Analysis
ery class the same weight, but we deal with an un-
balanced dataset, we choose Fmicro as evaluation In this section, we shall focus on the confu-
measure and only show Fmacro in some graphs for sion matrices produced by our three main models
comparison. which are depicted in Figure 6 and to be read as
follows: The x-axis and the y-axis represent artists
5.3 Results in the same order (labels are omitted due to leg-
Table 3 shows the micro-averaged F -score for all ibility). Each cell indicates how often the artist
models. MLP+ performs best, followed by Per- on the x-axis was classified as the artist on the
ceptron and Doc2Vec. The use of additional fea- y-axis (darker colours are higher numbers). The
tures significantly decreases the performance of cells on the main diagonal correspond to the cases
the perceptron (Perceptron+), but increases it for where a class was classified correctly hence a vis-
the multi-layer network (MLP+). This observation ible diagonal correlates with good results. Darker
is discussed in section 5.4. cells outside of the diagonal are significant mis-
Figures 3–5 show the performance of Percep- classifications9 and might be interesting to look
tron, Doc2Vec and MLP+ in dependence of the into.
number of training epochs. The Perceptron (Fig- As is clearly visible in Figure 6 (a), something is
ure 4) shows a generally increasing learning curve, wrong with the predictions of the Doc2Vec model.
i.e. more epochs lead to better results. Peaks like There are hints of a diagonal showing at the top-
the one after the 51st epoch are ignored since the left—however, there are entire columns of dark
model uses a fixed number of 100 training epochs. colour. This means that there are classes that are
Doc2Vec (Figure 5) reaches its best performance almost always predicted, no matter which class a
with 20 epochs and does not show any learning sample actually belongs to. Interestingly, we ob-
progress after that, even if trained for 100 epochs. 9
Such mis-classifications will be denoted as outliers in the
The MLP+ model (Figure 3) exhibits increasing following, since we are talking about the correlation of the
performance until around 100 training epochs and axes here.
Figure 3: Micro-averaged (above) and macro-averaged (below) F -score of the MLP+ model on the test
set (MST=140) after each training epoch.

Figure 4: Micro-averaged (above) and macro- Figure 5: Micro-averaged (above) and macro-
averaged (below) F -score of the Perceptron model averaged (below) F -score of the Doc2Vec model
on the test set (MST=140) after each training on the test set (MST=140) for different numbers
epoch. of training epochs.

served the same behaviour with our perceptron im- differently from what one might expect. Addi-
plementation in preliminary tests which changed tional stylometric features that were specifically
when we started using batch learning. However, designed for the task lead to significantly worse
our Doc2Vec implementation already uses batch results than the simple noun count vectors (Per-
learning and mini-batch learning did not improve ceptron+ vs. Perceptron). However, this does not
the performance in postliminary tests. tell anything about the perceptron’s performance
Figure 6 (b) shows that the perceptron performs when using other feature combinations.
well on most classes. However, there are very few The error distribution of the multi-layer percep-
classes where it actually recognises (almost) all of tron is displayed in Figure 6 (c). Compared to the
the samples. There are three major outliers and perceptron, most of the classes are predicted much
many outliers overall. This explains the rather low better and there are less outliers overall. However,
scores the model achieves, even though the diag- there are more major outliers which explains the
onal is clearly visible. Looking at the three big still rather low performance. In further contrast to
outliers and engineering new features specifically the perceptron, the use of all features leads to a
for those could improve the performance. How- performance increase (MLP+ vs. MLP). This is
ever, the informativeness of features can behave probably the case because the multi-layer percep-
(a) Doc2Vec (b) Perceptron (c) MLP+

Figure 6: Confusion matrices of different models on the test set (MST=140).

tron can learn individual weights for the different gineering good features.
feature groups through multiple branches and lay-
ers much better than the single-layer perceptron. 7 Future Work
A general clustering-based topic model encoded
6 Summary & Conclusion
in new features could potentially improve the per-
Songtext–artist classification is an example of formance of songtext classifiers. Looking at our
multiclass text classification on unbalanced, multi-layer perceptron, new features seem to be a
sparse and noisy data. Three neural network mod- good way to handle such difficult data. Other net-
els have been investigated on this specific task. 1) work architectures such as CNNs and RNNs can
A single-layer perceptron which can be used for be considered worth a look as well since they im-
all kinds of classification on vectorised data. 2) proved (noisy) text classification in previous stud-
Doc2Vec which is a contemporary tool for text ies (e.g. Lai et al., 2015; Apostolova and Kreek,
classification. And 3) an extended multi-layer per- 2018).
ceptron which we designed specifically for this Another way of dealing with imbalanced data is
task. While the third and most complex model to apply oversampling to raise the number of sam-
achieves the best results, it becomes also visible ples for sparse classes, or undersampling to reduce
that the choice of features has a significant ef- the number of samples for frequent classes.
fect on the classification performance. Here, too,
the multi-layer network with its advanced combi- 8 Acknowledgements
nation of different, stylometric as well as count/ We thank the three anonymous reviewers for their
embedding-based, feature groups outperforms the valuable comments. This work emerged from a
other models. course at the Institute for Natural Language Pro-
We come to the conclusion that a vast number cessing of the University of Stuttgart; we thank our
of and unbalanced classes as well as sparse and supervisors Evgeny Kim and Roman Klinger. We
not directly correlated data do not allow for a per- further thank Sebastian Padó for the organisational
fect performance. Thus, given a text classification support.
task where the data is as difficult, it makes sense
to reduce the data to something that is manage-
able and meaningful. Sparse classes in a noisy References
sample space are little more than guesswork which Emilia Apostolova and R. Andrew Kreek. 2018. Train-
might confuse the classifier and decrease the per- ing and prediction data discrepancies: Challenges of
formance on more important classes. While it is text classification with noisy, historical data. arXiv
preprint arXiv:1809.04019.
somewhat obvious that removing difficult cases
from the data improves the overall results, it is Russell Beale and Tom Jackson. 1990. Neural Com-
not something that one would usually do in a real- puting: An Introduction, pages 67–73. CRC Press.
world application. We argue, that it can be a prac- François Chollet et al. 2015. Keras: Deep learning li-
tical step to approach such a classification task, for brary for theano and tensorflow. URL: https://keras.
elaborating the complexity of the classifier and en- io/k, 7(8):T1.
Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015.
Recurrent convolutional neural networks for text
classification. In Twenty-ninth AAAI conference on
artificial intelligence.
Quoc Le and Tomas Mikolov. 2014. Distributed repre-
sentations of sentences and documents. In Interna-
tional conference on machine learning, pages 1188–
1196.
Ji Young Lee and Franck Dernoncourt. 2016. Se-
quential short-text classification with recurrent and
convolutional neural networks. arXiv preprint
arXiv:1603.03827.
Rudolf Mayer, Robert Neumayer, and Andreas Rauber.
2008. Rhyme and style features for musical genre
classification by song lyrics. In Ismir, pages 337–
342.
Ryan McDonald, Keith Hall, and Gideon Mann. 2010.
Distributed training strategies for the structured per-
ceptron. In Human Language Technologies: The
2010 Annual Conference of the North American
Chapter of the Association for Computational Lin-
guistics, pages 456–464. Association for Computa-
tional Linguistics.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-
frey Dean. 2013. Efficient estimation of word
representations in vector space. arXiv preprint
arXiv:1301.3781.
Frank Rosenblatt. 1958. The perceptron: a probabilis-
tic model for information storage and organization
in the brain. Psychological review, 65(6):386.
David E. Rumelhart, Geoffrey E. Hinton, and Ronald J.
Williams. 1985. Learning internal representations
by error propagation. Technical report, Institute for
Cognitive Science, University of California.
Stacy L. Smith, Marc Choueiti, Katherine Pieper, Han-
nah Clark, Ariana Case, and Sylvia Villanueva.
2019. Inclusion in the recording studio? Gender
and race/ethnicity of artists, songwriters & produc-
ers across 700 popular songs from 2012-2018. USC
Annenberg Inclusion Initiative.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
Dropout: a simple way to prevent neural networks
from overfitting. The journal of machine learning
research, 15(1):1929–1958.

Alper Kursat Uysal and Serkan Gunal. 2014. The im-

pact of preprocessing on text classification. Infor-
mation Processing & Management, 50(1):104–112.

NLP Module 3
No ratings yet
NLP Module 3
66 pages
Full Text 01
No ratings yet
Full Text 01
54 pages
Dynamic Embedding Projection-Gated
No ratings yet
Dynamic Embedding Projection-Gated
10 pages
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
No ratings yet
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
74 pages
What Is Text Classification - Exxact
No ratings yet
What Is Text Classification - Exxact
12 pages
Text-Based Classification
No ratings yet
Text-Based Classification
7 pages
Generative and Discriminative Text Classification With Recurrent Neural Networks
No ratings yet
Generative and Discriminative Text Classification With Recurrent Neural Networks
9 pages
A Survey On Text Classification From Shallow To Deep Learning
No ratings yet
A Survey On Text Classification From Shallow To Deep Learning
21 pages
Text Classification
No ratings yet
Text Classification
24 pages
03 ML Essentials
No ratings yet
03 ML Essentials
52 pages
Text Classification Using NLP
No ratings yet
Text Classification Using NLP
28 pages
Impact of Convolutional Neural Network and Fasttext Embedding On Text Classification
No ratings yet
Impact of Convolutional Neural Network and Fasttext Embedding On Text Classification
17 pages
ML Notes - 2025
No ratings yet
ML Notes - 2025
145 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
37 pages
CNN vs. LSTM For Turkish Text Classification
No ratings yet
CNN vs. LSTM For Turkish Text Classification
6 pages
L2 Cse256 Fa24 TC
No ratings yet
L2 Cse256 Fa24 TC
65 pages
Deep Neural Network Language Models - W12-2703
No ratings yet
Deep Neural Network Language Models - W12-2703
9 pages
Machine Learning, NLP - Text Classification Using Scikit-Learn, Python and NLTK
No ratings yet
Machine Learning, NLP - Text Classification Using Scikit-Learn, Python and NLTK
9 pages
L5 TextClassification Updated
No ratings yet
L5 TextClassification Updated
179 pages
ITD253 L6 TextClassificationClustering
No ratings yet
ITD253 L6 TextClassificationClustering
39 pages
NLP m4
No ratings yet
NLP m4
97 pages
Lect 05
No ratings yet
Lect 05
17 pages
Lecture 6 - Word2Vec and Text Classification
No ratings yet
Lecture 6 - Word2Vec and Text Classification
66 pages
Classification
No ratings yet
Classification
21 pages
Product Classification in E-Commerce Using Distributional Semantics
No ratings yet
Product Classification in E-Commerce Using Distributional Semantics
17 pages
Distributed Representations of Sentences and Documents: Quoc Le Tomas Mikolov
No ratings yet
Distributed Representations of Sentences and Documents: Quoc Le Tomas Mikolov
9 pages
Research Paper 3
No ratings yet
Research Paper 3
7 pages
Paragraph Vector PDF
No ratings yet
Paragraph Vector PDF
9 pages
Unit 2
No ratings yet
Unit 2
26 pages
Shallow Parsing With Conditional Random Fields
No ratings yet
Shallow Parsing With Conditional Random Fields
8 pages
Pattern Recognition in Neural Networks: T. Muthya Mounika, V.V. Vishnu Prabhakar
No ratings yet
Pattern Recognition in Neural Networks: T. Muthya Mounika, V.V. Vishnu Prabhakar
3 pages
Text Classification
No ratings yet
Text Classification
3 pages
Challenges in ML&DM
No ratings yet
Challenges in ML&DM
12 pages
NLP NB
No ratings yet
NLP NB
52 pages
Perceptrons: Fundamentals and Applications for The Neural Building Block
From Everand
Perceptrons: Fundamentals and Applications for The Neural Building Block
Fouad Sabry
No ratings yet
Machine Learning Algorithm Cheat Sheet
No ratings yet
Machine Learning Algorithm Cheat Sheet
1 page
Non Linear 1704955560
No ratings yet
Non Linear 1704955560
50 pages
AI Lab 1
No ratings yet
AI Lab 1
11 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
CS585 Lecture October01st
No ratings yet
CS585 Lecture October01st
158 pages
NLP Short
No ratings yet
NLP Short
5 pages
Hidden Markov Support Vector Machines
No ratings yet
Hidden Markov Support Vector Machines
8 pages
Short Course On Deep Learning: Welcome!!
No ratings yet
Short Course On Deep Learning: Welcome!!
57 pages
M2 Transcript
No ratings yet
M2 Transcript
11 pages
ML7 - Text Classification
No ratings yet
ML7 - Text Classification
13 pages
ML 4
No ratings yet
ML 4
32 pages
ML Lecture 2 Supervised Learning Setup
No ratings yet
ML Lecture 2 Supervised Learning Setup
38 pages
Pattern Recognition Ai Notes
No ratings yet
Pattern Recognition Ai Notes
5 pages
CS464 Chapter 4: Naïve Bayes: (Slides Based On The Slides Provided by Öznur Taştan and Mehmet Koyutürk)
No ratings yet
CS464 Chapter 4: Naïve Bayes: (Slides Based On The Slides Provided by Öznur Taştan and Mehmet Koyutürk)
55 pages
127 1498038923 - 21-06-2017 PDF
No ratings yet
127 1498038923 - 21-06-2017 PDF
9 pages
Text Classification Research Based On Bert Model and Bayesian Network
No ratings yet
Text Classification Research Based On Bert Model and Bayesian Network
5 pages
Deep Learning For NLP
No ratings yet
Deep Learning For NLP
78 pages
NLP 160709201345
No ratings yet
NLP 160709201345
61 pages
Building Machine Learning Systems With Python - Second Edition - Sample Chapter
100% (2)
Building Machine Learning Systems With Python - Second Edition - Sample Chapter
32 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
38 pages
Naive Bayes and Sentiment Classification
No ratings yet
Naive Bayes and Sentiment Classification
23 pages
Mihai Surdeanu, Marco Antonio Valenzuela-Escarcega - Deep Learning For Natural Language Processing - A Gentle Introduction-Cambridge University Press (2024)
No ratings yet
Mihai Surdeanu, Marco Antonio Valenzuela-Escarcega - Deep Learning For Natural Language Processing - A Gentle Introduction-Cambridge University Press (2024)
345 pages
Chap 1 Introduction To ML
No ratings yet
Chap 1 Introduction To ML
33 pages
Ai 2
No ratings yet
Ai 2
19 pages
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
From Everand
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
Fouad Sabry
No ratings yet
Zaid-LAB - 09 - Jupyter Notebook
No ratings yet
Zaid-LAB - 09 - Jupyter Notebook
3 pages
Vishek Sahu Desertation
No ratings yet
Vishek Sahu Desertation
84 pages
EC8393 QB 01 2MARKS - by LearnEngineering - in
No ratings yet
EC8393 QB 01 2MARKS - by LearnEngineering - in
9 pages
An Introduction To Information Theory: Adrish Banerjee
No ratings yet
An Introduction To Information Theory: Adrish Banerjee
16 pages
Hackerrank Test
0% (7)
Hackerrank Test
3 pages
MHF4U Characteristics of Polynomial Functions
No ratings yet
MHF4U Characteristics of Polynomial Functions
3 pages
Deep Learning
No ratings yet
Deep Learning
4 pages
Flat Course File 1
No ratings yet
Flat Course File 1
19 pages
Compiler Design Introduction
No ratings yet
Compiler Design Introduction
23 pages
Compiler-Group Assignment
No ratings yet
Compiler-Group Assignment
15 pages
Introducere Calculator Cuantic
No ratings yet
Introducere Calculator Cuantic
103 pages
A Universal Turing Machine: Fall 2006 Costas Busch - RPI 1
No ratings yet
A Universal Turing Machine: Fall 2006 Costas Busch - RPI 1
61 pages
Machine Learning (Open Elective) B.Tech. ECE LTPC 3 0 0 3 Objectives
No ratings yet
Machine Learning (Open Elective) B.Tech. ECE LTPC 3 0 0 3 Objectives
2 pages
Bits F232 Foundations of Data Structures and Algorithms - Handout
No ratings yet
Bits F232 Foundations of Data Structures and Algorithms - Handout
5 pages
7.3, 7.4 Dijkastra and Bellman Ford
No ratings yet
7.3, 7.4 Dijkastra and Bellman Ford
14 pages
Report Homework 1: 1 Openmp Experiment
No ratings yet
Report Homework 1: 1 Openmp Experiment
8 pages
R22 ML Question Bank For It and CSM
No ratings yet
R22 ML Question Bank For It and CSM
4 pages
Project #1: Maze Solving Task 03 - Coding The Solution Documentation
No ratings yet
Project #1: Maze Solving Task 03 - Coding The Solution Documentation
2 pages
Lessn Plan FLA
No ratings yet
Lessn Plan FLA
3 pages
12th PYTHON Chapter 1 To 16
No ratings yet
12th PYTHON Chapter 1 To 16
271 pages
Unit 4 Classification (1) (P)
No ratings yet
Unit 4 Classification (1) (P)
50 pages
Session 2: Section A, 3pm To 5:00 PM Data Types in Java: Class Notes by DR B P Sharma - 880060025
No ratings yet
Session 2: Section A, 3pm To 5:00 PM Data Types in Java: Class Notes by DR B P Sharma - 880060025
6 pages
Discrete Structure Syllabus
100% (1)
Discrete Structure Syllabus
2 pages
Lecture 1 PDF
No ratings yet
Lecture 1 PDF
23 pages
Artificial Intelligence Mca-401 Index: Unit Ii
No ratings yet
Artificial Intelligence Mca-401 Index: Unit Ii
8 pages
Algo - CH - 5 and 6
No ratings yet
Algo - CH - 5 and 6
46 pages
Magic Composite
No ratings yet
Magic Composite
4 pages
System of Inequalities Word Problems: D. I. R. T
No ratings yet
System of Inequalities Word Problems: D. I. R. T
2 pages
Chapter 5
No ratings yet
Chapter 5
38 pages
Cosc 340
No ratings yet
Cosc 340
3 pages

Multiclass Text Classification On Unbalanced, Sparse and Noisy Data

Uploaded by

Multiclass Text Classification On Unbalanced, Sparse and Noisy Data

Uploaded by

Multiclass Text Classification on Unbalanced, Sparse and Noisy Data

Tillmann Dönicke1 , Florian Lux2 , and Matthias Damaschk3

Abstract predict something latent, since there is no direct

Table 2: Number of unique artists (classes), num-

5.1 Experimental Settings

Doc2Vec For the Doc2Vec implementation, we

MLP The implementation of MLP and MLP+ is

Figure 6: Confusion matrices of different models on the test set (MST=140).

Alper Kursat Uysal and Serkan Gunal. 2014. The im-

You might also like