NN MLC Ecml2014 Camera Ready
NN MLC Ecml2014 Camera Ready
Abstract. Neural networks have recently been proposed for multi-label classi-
fication because they are able to capture and model label dependencies in the
output layer. In this work, we investigate limitations of BP-MLL, a neural net-
work (NN) architecture that aims at minimizing pairwise ranking error. Instead,
we propose to use a comparably simple NN approach with recently proposed
learning techniques for large-scale multi-label text classification tasks. In partic-
ular, we show that BP-MLL’s ranking loss minimization can be efficiently and
effectively replaced with the commonly used cross entropy error function, and
demonstrate that several advances in neural network training that have been de-
veloped in the realm of deep learning can be effectively employed in this setting.
Our experimental results show that simple NN models equipped with advanced
techniques such as rectified linear units, dropout, and AdaGrad perform as well as
or even outperform state-of-the-art approaches on six large-scale textual datasets
with diverse characteristics.
1 Introduction
Multi-label classification is an automatic approach for addressing such problems by
learning to assign a suitable subset of categories from an established classification sys-
tem to a given text. In the literature, one can find a number of multi-label classification
approaches for a variety of tasks in different domains such as bioinformatics [1], music
[28], and text [9]. In the simplest case, multi-label classification may be viewed as a set
of binary classification tasks that decides for each label independently whether it should
be assigned to the document or not. However, this so-called binary relevance approach
ignores dependencies between the labels, so that current research in multi-label classifi-
cation concentrates on the question of how such dependencies can be exploited [23, 3].
One such approach is BP-MLL [32], which formulates multi-label classification prob-
lems as a neural network with multiple output nodes, one for each label. The output
layer is able to model dependencies between the individual labels.
This is author’s personal copy. The final publication is available at Springer via http://dx.doi.
org/10.1007/978-3-662-44851-9 28
2 J. Nam, J. Kim, E. Loza, I. Gurevych, and J. Fürnkranz
In this work, we directly build upon BP-MLL and show how a simple, single hidden
layer NN may achieve a state-of-the-art performance in large-scale multi-label text clas-
sification tasks. The key modifications that we suggest are (i) more efficient and more
effective training by replacing BP-MLL’s pairwise ranking loss with cross entropy and
(ii) the use of recent developments in the area of deep learning such as rectified linear
units (ReLUs), Dropout, and AdaGrad.
Even though we employ techniques that have been developed in the realm of deep
learning, we nevertheless stick to single-layer NNs. The motivation behind this is two-
fold: first, a simple network configuration allows better scalability of the model and is
more suitable for large-scale tasks on textual data3 . Second, as it has been shown in the
literature [15], popular feature representation schemes for textual data such as variants
of tf-idf term weighting already incorporate a certain degree of higher dimensional fea-
tures, and we speculate that even a single-layer NN model can work well with text data.
This paper provides an empirical evidence to support that a simple NN model equipped
with recent advanced techniques for training NN performs as well as or even outper-
forms state-of-the-art approaches on large-scale datasets with diverse characteristics.
2 Multi-label Classification
Formally, multi-label classification may be defined as follows: X ⊂ RD is a set of M
instances, each being a D-dimensional feature vector, and L is a set of labels. Each
instance x is associated with a subset of the L labels, the so-called relevant labels;
all other labels are irrelevant for this example. The task of the learner is to learn a
mapping function f : RD → 2L that assigns a subset of labels to a given instance. An
alternative view is that we have to predict an L-dimensional target vector y ∈ {0, 1}L ,
where yi = 1 indicates that the i-th label is relevant, whereas yi = 0 indicates that it is
irrelevant for the given instance.
Many algorithms have been developed for tackling this type of problem. The most
straightforward way is binary relevance (BR) learning; it constructs L binary classifiers,
which are trained on the L labels independently. Thus, the prediction of the label set is
composed of independent predictions for individual labels. However, labels often occur
together, that is, the presence of a specific label may suppress or exhibit the likelihood
of other labels.
To address this limitation of BR, pairwise decomposition (PW) and label powerset
(LP) approaches consider label dependencies during the transformation by either gener-
ating pairwise subproblems [9, 20] or the powerset of possible label combinations [29].
Classifier chains [23, 3] are another popular approach that extend BR by including pre-
vious predictions into the predictions of subsequent labels. [7] present a large-margin
classifier, RankSVM, that minimizes a ranking loss by penalizing incorrectly ordered
pairs of labels. This setting can be used for multi-label classification by assuming that
the ranking algorithm has to rank each relevant label before each irrelevant label. In
3
Deep NNs, in fact, scale well and work effectively in other domains by learning features from
raw inputs which are usually smaller than hand-crafted features extracted from the raw inputs.
However, in our case, the dimensions of raw inputs are relatively large and computational cost
of training deep NNs depends highly on the number of input features.
Large-scale Multi-label Text Classification 3
Fig. 1. (a) a neural network with a single hidden layer of two units and multiple output units,
one for each possible label. (b) shows how threshold for a training example is estimated based
on prediction output o of the network. Consider nine possible labels, of which o1 , o4 and o6 are
relevant labels (blue) and the rest are irrelevant (red). The figure shows three exemplary threshold
candidates (dashed lines), of which the middle one is the best choice because it gives the highest
F1 score. See Section 3.3 for more details.
order to make a prediction, the ranking has to be calibrated [9], i.e., a threshold has
to be found that splits the ranking into relevant and irrelevant labels. Similarly, Zhang
and Zhou [32] introduced a framework that learns ranking errors in neural networks via
backpropagation (BP-MLL).
The most prominent learning method for multi-label text classification is to use a BR
approach with strong binary classifiers such as SVMs [24, 30] despite its simplicity.
It is well known that characteristics of high-dimensional and sparse data, such as text
data, make decision problems linearly separable [15], and this characteristic suits the
strengths of SVM classifiers well. Unlike benchmark datasets, real-world text collec-
tions consist of a large number of training examples represented in a high-dimensional
space with a large amount of labels. To handle such datasets, researchers have derived
efficient linear SVMs [16, 8] that can handle large-scale problems. However, their per-
formance decreases as the number of labels grows and the label frequency distribution
becomes skewed [19, 24]. In such cases, it is also intractable to employ methods that
minimize ranking errors among labels [7, 32] or that learn joint probability distributions
of labels [11, 3].
where w(y) is a normalization factor, I (·) is the indicator function, and fi (·) is a
prediction score for a label i. Unfortunately, it is hard to minimize due to non-convex
property of the loss function. Therefore, convex surrogate losses have been proposed as
alternatives to rank loss [26, 7, 32].
where Θ = {W(1) , b(1) , W(2) , b(2) }, and fo and fh are element-wise activation func-
tions in the output layer and the hidden layer, respectively. Specifically, the function
fΘ (x) can be re-written as follows:
z(1) = W(1) x + b(1) , h = fh z(1)
z(2) = W(2) h + b(2) , o = fo z(2)
where z(1) and z(2) denote the weighted sum of inputs and hidden activations, respec-
tively. Our aim is to find a parameter vector Θ that minimizes a cost function J(Θ; x, y).
The cost function measures discrepancy between predictions of the network and given
targets y.
BP-MLL [32] minimizes errors induced by incorrectly ordered pairs of labels, in
order to exploit dependencies among labels. To this end, it introduces a pairwise error
function (PWE), which is defined as follows:
1 X
JP W E (Θ; x, y) = exp(−(op − on )) (3)
|y||ȳ|
(p,n)∈y×ȳ
where p and n are positive and negative label index associated with training example
x. ȳ represents a set of negative labels and | · | stands for the cardinality. The PWE is
the relaxation of the loss function in Equation 1 that we want to minimize.
As no closed-form solution exists to minimize the cost function, we use a gradient-
based optimization method.
Θ(τ +1) = Θ(τ ) − η∇Θ(τ ) J(Θ(τ ) ; x, y) (4)
Large-scale Multi-label Text Classification 5
The parameter Θ is updated by adding a small step of negative gradients of the cost
function J(Θ(τ ) ; x, y) with respect to the parameter Θ at step τ . The parameter η, called
the learning rate, determines the step size of updates.
3.3 Thresholding
Once training of the neural network is finished, its output can be used to rank labels, but
additional measures are needed in order to split the ranking into relevant and irrelevant
labels. For transforming the ranked list of labels into a set of binary predictions, we train
a multi-label threshold predictor from training data. This sort of thresholding methods
are also used in [7, 32]
For each document xm , labels are sorted by the probabilities in decreasing order.
Ideally, if NNs successfully learn a mapping function fΘ , all correct (positive) labels
will be placed on top of the sorted list and there should be a large margin between the set
of positive labels and the set of negative labels. Using F1 score as a reference measure,
we calculate classification performances at every pair of successive positive labels and
choose a threshold value tm that produces the best performance (Figure 1 (b)).
Afterwards, we can train a multi-label thresholding predictor t̂ = T (x; θ) to learn
t as target values from input pattern x. We use linear regression with `2-regularization
to learn θ
M
1 X λ
J (θ) = (T (xm ; θ) − ti )2 + kθk22 (5)
2M m=1 2
BP-MLL is supposed to perform better in multi-label problems since it takes label cor-
relations into consideration than the standard form of NN that does not. However, we
have found that BP-MLL does not perform as expected in our preliminary experiments,
particularly, on datasets in textual domain.
Consistency w.r.t Rank Loss Recently, it has been claimed that none of convex loss
functions including BP-MLL’s loss function (Equation 3) is consistent with respect to
rank loss which is non-convex and has discontinuity [2, 10]. Furthermore, univariate
surrogate loss functions such as log loss are rather consistent with rank loss [4].
X
Jlog (Θ; x, y) = w (y) log 1 + e−ẏl zl
l
where w (y) is a weighting function that normalizes loss in terms of y and zl in-
dicates prediction for label l. Please note that the log loss is often used for logistic
6 J. Nam, J. Kim, E. Loza, I. Gurevych, and J. Fürnkranz
regression
P in which ẏ ∈ {−1, 1} is the target and zl is the output of a linear function
zl = k Wlk xk + bl where Wlk is a weight from input xk to output zl and bl is bias for
label l. A typical choice is, for instance, w(y) = (|y||ȳ|)−1 as in BP-MLL. In this work,
we set w(y) = 1, then the log loss above is equivalent to cross entropy (CE), which is
commonly used to train neural networks for classification tasks if we use sigmoid trans-
fer function in the output layer, i.e. fo (z) = 1/ (1 + exp(−z)), or simply fo (z) = σ (z):
X
JCE (Θ; x, y) = − (yl log ol ) + (1 − yl ) log(1 − ol )) (6)
l
where ol and yl are the prediction and the target for label l, respectively. Let us verify
the equivalence between the log loss and the CE. Consider the log loss function for only
label l.
−ẏl zl 1
Jlog (Θ; x, yl ) = log(1 + e ) = − log (7)
1 + e−ẏl zl
As noted, ẏ in the log loss takes either −1 or 1, which allows us to split the above
equation as follows:
1 − log (σ (zl )) if ẏ = 1
− log = (8)
1 + e−ẏl zl − log (σ (−zl )) if ẏ = −1
(2) (2)
Whereas the computation of = −yl /ol + (1 − yl )/(1 − ol )fo0 (zl ) for the CE can be
δl
(2)
performed efficiently, obtaining error terms δl for the PWE is L times more expensive
than one in ordinary NN utilizing the cross entropy error function. This also shows that
BP-MLL scales poorly w.r.t. the number of unique labels.
Plateaus To get an idea of how differently both objective functions behave as a function
of parameters to be optimized, let us draw graphs containing cost function values. Note
that it has been pointed out that the slope of the cost function as a function of the
parameters plays an important role in learning parameters of neural networks [27, 12].
Large-scale Multi-label Text Classification 7
7 CE w/ tanh 20 CE w/ ReLU
PWE w/ tanh CE w/ tanh
)
)
18
,W(2)
,W(2)
6
4
4
16
,W(2)
,W(2)
5
3
3
14
;x,y,W(2)
;x,y,W(2)
4
2
2
12
3 10
J(W(1),W(2)
J(W(1),W(2)
8
1
1
2
6
1
4
0 2
4 4
2 4 2 4
2 2
0 0
0 0
−2 −2
−2 −2
W(2) W(2)
−4 −4 −4 −4
1 W(1) 1 W(1)
(a) Comparison of CE and PWE (b) Comparison of tanh and ReLU, both for CE
Fig. 2. Landscape of cost functions and a type of hidden units. W (1) represents a
(2)
weight connecting an input unit to a hidden unit. Likewise, W1 denotes a weight
from the hidden unit to output unit 1. The z-axis stands for a value for the cost func-
(2) (2) (2) (2)
tion J(W (1) , W1 ; x, y, W2 , W3 , W4 ) where instances x, targets y and weights
(2) (2) (2)
W2 , W3 , W4 are fixed.
Consider two-layer neural networks consisting of W (1) ∈ R for the first layer,
W ∈ R4×1 for the second, that is, the output layer. Since we are interested in func-
(2)
(2)
tion values with respect to two parameters W (1) and W1(2) out of 5 parameters, W{2,3,4}
is set to a fixed value c. In this paper we use c = 0.4 Figure 2 (a) shows different shapes
of the functions and slope steepness. In figure 2 (a) both curves have similar shapes, but
the curve for PWE has plateaus in which gradient descent can be very slow in compar-
ison with the CE. Figure 2 (b) shows that CE with ReLUs, which is explained the next
Section, has a very steep slope compared to CE with tanh. Such a slope can accelerate
convergence speed in learning parameters using gradient descent. We conjecture that
these properties might explain why our set-up converges faster than the other configu-
rations, and BP-MLL performs poorly in most cases in our experiments.
In recent neural network and deep learning literature, a number of techniques were
proposed to overcome the difficulty of learning neural networks efficiently. In particular,
we make use of ReLUs, AdaGrad, and Dropout training, which are briefly discussed in
the following.
Rectified Linear Units Rectified linear units (ReLUs) have been proposed as activation
units on the hidden layer and shown to yield better generalization performance [22, 13,
4
The shape of the functions is not changed even if we set c to arbitrary value since it is drawn
(2)
by function values in z-axis with respect to only W (1) and W1 .
8 J. Nam, J. Kim, E. Loza, I. Gurevych, and J. Fürnkranz
31]. A ReLU disables negative activation (ReLU(x) = max(0, x)) so that the number
of parameters to be learned decreases during the training. This sparsity characteristic
makes ReLUs advantageous over the traditional activation units such as sigmoid and
tanh in terms of the generalization performance.
4 Experimental Setup
We have shown the reason why the structure of NNs needs to be reconsidered in the
previous Sections. In this Section, we describe evaluation measures to show how ef-
fectively NNs perform by combining recent development in learning neural networks
based on the fact that the univariate loss is consistent with respect to rank loss on large-
scale textual datasets.
Large-scale Multi-label Text Classification 9
L L L
1 X tpl 1 X tpl 1 X 2tpl
Pmacro = , Rmacro = , F1−macro =
L l=1 tpl + f pl L l=1 tpl + f nl L l=1 2tpl + f pl + f nl
Datasets Our main interest is in large-scale text classification, for which we se-
lected six representative domains, whose characteristics are summarized in Table 1.
For Reuters21578, we used the same training/test split as previous works [30]. Training
and test data were switched for RCV1-v2 [18] which originally consists of 23,149 train
and 781,265 test documents. The EUR-Lex, Delicious and Bookmarks datasets were
taken from the MULAN repository.6 Except for Delicious and Bookmarks, all docu-
ments are represented with tf-idf features with cosine normalization such that length of
the document vector is 1 in order to account for the different document lengths.
In addition to these standard benchmark datasets, we prepared a large-scale dataset
from documents of the German Education Index (GEI).7 The GEI is a database of links
to more than 800,000 scientific articles with metadata, e.g. title, authorship, language
of an article and index terms. We consider a subset of the dataset consisting of ap-
proximately 300,000 documents which have an abstract as well as the metadata. Each
5
Note that scores computed by micro-averaged measures might be much higher than that by
macro-averaged measures if there are many rarely-occurring labels for which the classification
system does not perform well. This is because macro-averaging weighs each label equally,
whereas micro-averaged measures are dominated by the results of frequent labels.
6
http://mulan.sourceforge.net/datasets.html
7
http://www.dipf.de/en/portals/portals-educational-information/german-education-index
10 J. Nam, J. Kim, E. Loza, I. Gurevych, and J. Fürnkranz
Table 1. Number of documents (D), size of vocabulary (D), total number of labels (L) and
average number of labels per instance (C) for the six datasets used in our study.
Dataset M D L C
Reuters-21578 10789 18637 90 1.13
RCV1-v2 804414 47236 103 3.24
EUR-Lex 19348 5000 3993 5.31
Delicious 16105 500 983 19.02
Bookmarks 87856 2150 208 2.03
German Education Index 316061 20000 1000 7.16
document has multiple index terms which are carefully hand-labeled by human experts
with respect to the content of the articles. We processed plain text by removing stop-
words and stemming each token. To avoid the computational bottleneck from a large
number of labels, we chose the 1,000 most common labels out of about 50,000. We
then randomly split the dataset into 90% for training and 10% for test.
Algorithms Our main goal is to compare our NN-based approach to BP-MLL. NNA
stands for the single hidden layer neural networks which have ReLUs for its hidden layer
and which are trained with SGD where each parameter of the neural networks has their
own learning rate using AdaGrad. NNAD additionally employs Dropout based on the
same settings as NNA . T and R following BP-MLL indicate tanh and ReLU as a transfer
function in the hidden layer. For both NN and BP-MLL, we used 1000 units in the
hidden layer over all datasets. 8 As Dropout works well as a regularizer, no additional
regularization to prevent overfitting was incorporated. The base learning rate η0 was
also determined among [0.001, 0.01, 0.1] using validation data.
We also compared the NN-based algorithms to binary relevance (BR) using SVMs
(Liblinear) as a base learner, as a representative of the state-of-the-art. The penalty
parameter C was optimized in the range of [10−3 , 10−2 , . . . , 102 , 103 ] based on either
average of micro- and macro-average F1 or rankloss on validation set. BRB refers to
linear SVMs where C is optimized with bipartition measures on the validation dataset.
BR models whose penalty parameter is optimized on ranking measures are indicated as
BRR . In addition, we apply the same thresholding technique which we utilize in our
NN approach (Section 3.3) on a ranked list produced by BR models (BRR ). Given a
document, the distance of each predicted label to the hyperplane is used to determine
the position of the label in the ranked list.
5 Results
We evaluate our proposed models and other baseline systems on datasets with varying
statistics and characteristics. We first show experimental results that confirm that the
techniques discussed in Section 3.5 actually contribute to an increased performance of
8
The optimal number of hidden units of BP-MLL and NN was tested among 20, 50, 100, 500,
1000 and 4000 on validation datasets. Usually, the more units are in the hidden layer, the better
performance of networks is. We chose it in terms of computational efficiency.
Large-scale Multi-label Text Classification 11
Reuters−21578 EUR−Lex
0.08 0.05
ReLU w/ AdaGrad w/ Dropout (1000)
tanh w/ AdaGrad w/ Dropout (4000)
0.07 0.045
sigm w/ AdaGrad w/o Dropout (1000)
ReLU w/ momentum w/o Dropout (4000)
0.06 tanh w/ momentum 0.04
sigm w/ momentum
0.05 0.035
Rank loss
Rank loss
0.04 0.03
0.03 0.025
0.02 0.02
0.01 0.015
0 0.01
1 1000 10000 1 1000 10000
The number of parameter updates The number of parameter updates
Fig. 3. (left) effects of AdaGrad and momentum on three types of transfer functions in the hid-
den layers in terms of rank loss on Reuters-21578. The number of parameter updates in x-axis
corresponds to the number of evaluations of Eq. (4). (right) effects of dropout with two different
numbers of hidden units in terms of rank loss on EUR-Lex.
NN-based multi-label classification, and then compare all algorithms on the six above-
mentioned datasets in order to get an overall impression of their performance.
Decorrelating Hidden Units While Output Units Remain Correlated One major
goal of multi-label learners is to minimize rank loss by leveraging inherent correlations
in a label space. However, we conjecture that these correlations also may cause overfit-
ting because if groups of hidden units specialize in predicting particular label subsets
that occur frequently in the training data, it will become harder to predict novel label
combinations that only occur in the test set. Dropout effectively fights this by randomly
dropping individual hidden units, so that it becomes harder for groups of hidden units to
specialize in the prediction of particular output combinations, i.e., they decorrelate the
hidden units, whereas the correlation of output units still remains. Particularly, a subset
of output activations o and hidden activations h would be correlated through W(2) .
9
However, unlike the results of [31], in our preliminary experiments adding more hidden layers
did not further improve generalization performance.
12 J. Nam, J. Kim, E. Loza, I. Gurevych, and J. Fürnkranz
0.06 0.5
0.45
0.055
0.05
0.35
Rank loss
0.045 0.3
PWE =0.01
PWE =0.01 D
0.25
0.04
PWE =0.1
PWE =0.1 D
0.2
CE =0.01
0.035 CE =0.01 D
0.15
CE =0.1
CE =0.1 D
0.03 0.1
1 10000 100000 1 10000 100000
The number of parameter updates The number of parameter updates
Fig. 4. Rankloss (left) and mean average precision (right) on the German Education Index test
data for the different cost functions. η denotes the base learning rate and D indicates that Dropout
is applied. Note that x-axis is in log scale.
We observed overfitting across all datasets except for Reuters-21578 and RCV1-v2
under our experimental settings. The right part of Figure 3 shows how well Dropout pre-
vents NNs from overfitting on the test data of EUR-Lex. In particular, we can see that
with increasing numbers of parameter updates, the performance of regular NNs even-
tually got worse in terms of rank loss. On the other hand, when dropout is employed,
convergence is initially slower, but eventually effectively prevents overfitting.
Limiting Small Learning Rates in BP-MLL The learning rate strongly influences
convergence and learning speed [17]. As we have already seen in the Figure 2, the
slope of PWE is less steep than CE, which implies that smaller learning rates should
be used. Specifically, we observed PWE allows only smaller learning rate 0.01 (blue
markers) in contrast with CE that works well a relatively larger learning rate 0.1 (red
markers) in Figure 4. In the case of PWE with the larger learning rate (green markers),
interestingly, dropout (rectangle markers in green) makes it converge towards much
better local minima, yet it is still worse than the other configurations. It seems that the
weights of BP-MLL oscillates in the vicinity of local minima and, indeed, converges
worse local minima. However, it makes learning procedure of BP-MLL slow compared
to NNs with CE making bigger steps for parameter updates.
With respect to Dropout, Figure 4 also shows that for the same learning rates, net-
works without Dropout converge much faster than ones working with Dropout in terms
of both rank loss and MAP. Regardless of the cost functions, overfitting arises over
the networks without Dropout and it is likely that overfitting is avoided effectively as
discussed earlier.
Ranking Bipartition
Eval. measures
rankloss oneError Coverage MAP miP miR miF maP maR maF
Average Ranks
NNA 2.2 2.4 2.6 2.2 2.0 6.0 2.4 1.8 5.6 2.0
NNAD 1.2 1.4 1.2 1.6 2.0 5.8 1.8 2.0 5.6 2.2
BP-MLLTA 5.2 7.2 6.0 6.4 7.0 3.2 7.0 6.2 2.0 5.6
BP-MLLTAD 4.1 6.0 4.4 5.9 7.4 2.8 7.4 7.2 3.2 7.0
BP-MLLRA 5.9 6.7 5.6 6.4 5.2 3.2 4.6 5.6 3.8 4.8
BP-MLLRAD 4.0 6.0 3.6 5.6 5.6 3.6 5.4 5.4 4.4 5.8
BRB 7.4 3.3 6.9 4.3 3.2 6.8 4.6 4.4 6.8 5.6
BRR 6.0 3.0 5.7 3.6 3.6 4.6 2.8 3.4 4.6 3.0
6 Conclusion
Table 3. Results on ranking and bipartition measures. Results for BP-MLL on EUR-Lex are
missing because the runs could not be completed in a reasonably short time.
Ranking Bipartition
Eval. measures
rankloss oneError Coverage MAP miP miR miF maP maR maF
Reuters-21578
NNA 0.0037 0.0706 0.7473 0.9484 0.8986 0.8357 0.8660 0.6439 0.4424 0.4996
NNAD 0.0031 0.0689 0.6611 0.9499 0.9042 0.8344 0.8679 0.6150 0.4420 0.4956
BP-MLLTA 0.0039 0.0868 0.8238 0.9400 0.7876 0.8616 0.8230 0.5609 0.4761 0.4939
BP-MLLTAD 0.0039 0.0808 0.8119 0.9434 0.7945 0.8654 0.8284 0.5459 0.4685 0.4831
BP-MLLRA 0.0039 0.0808 1.0987 0.9431 0.8205 0.8582 0.8389 0.5303 0.4364 0.4624
BP-MLLRAD 0.0063 0.0719 1.2037 0.9476 0.8421 0.8416 0.8418 0.5510 0.4292 0.4629
BRB 0.0040 0.0613 0.8092 0.9550 0.9300 0.8096 0.8656 0.6050 0.3806 0.4455
BRR 0.0040 0.0613 0.8092 0.9550 0.8982 0.8603 0.8789 0.6396 0.4744 0.5213
RCV1-v2
NNA 0.0040 0.0218 3.1564 0.9491 0.9017 0.7836 0.8385 0.7671 0.5760 0.6457
NNAD 0.0038 0.0212 3.1108 0.9500 0.9075 0.7813 0.8397 0.7842 0.5626 0.6404
BP-MLLTA 0.0058 0.0349 3.7570 0.9373 0.6685 0.7695 0.7154 0.4385 0.5803 0.4855
BP-MLLTAD 0.0057 0.0332 3.6917 0.9375 0.6347 0.7497 0.6874 0.3961 0.5676 0.4483
BP-MLLRA 0.0058 0.0393 3.6730 0.9330 0.7712 0.8074 0.7889 0.5741 0.6007 0.5823
BP-MLLRAD 0.0056 0.0378 3.6032 0.9345 0.7612 0.8016 0.7809 0.5755 0.5748 0.5694
BRB 0.0061 0.0301 3.8073 0.9375 0.8857 0.8232 0.8533 0.7654 0.6342 0.6842
BRR 0.0051 0.0287 3.4998 0.9420 0.8156 0.8822 0.8476 0.6961 0.7112 0.6923
EUR-Lex
NNA 0.0195 0.2016 310.6202 0.5975 0.6346 0.4722 0.5415 0.3847 0.3115 0.3256
NNAD 0.0164 0.1681 269.4534 0.6433 0.7124 0.4823 0.5752 0.4470 0.3427 0.3687
BRB 0.0642 0.1918 976.2550 0.6114 0.6124 0.4945 0.5471 0.4260 0.3643 0.3752
BRR 0.0204 0.2088 334.6172 0.5922 0.0329 0.5134 0.0619 0.2323 0.3063 0.2331
German Education Index
NNA 0.0350 0.2968 138.5423 0.4828 0.4499 0.4200 0.4345 0.4110 0.3132 0.3427
NNAD 0.0352 0.2963 138.3590 0.4797 0.4155 0.4472 0.4308 0.3822 0.3216 0.3305
BP-MLLTA 0.0386 0.8309 150.8065 0.3432 0.1502 0.6758 0.2458 0.1507 0.5562 0.2229
BP-MLLTAD 0.0371 0.7591 139.1062 0.3281 0.1192 0.5056 0.1930 0.1079 0.4276 0.1632
BP-MLLRA 0.0369 0.4221 143.4541 0.4133 0.2618 0.4909 0.3415 0.3032 0.3425 0.2878
BP-MLLRAD 0.0353 0.4522 135.1398 0.3953 0.2400 0.5026 0.3248 0.2793 0.3520 0.2767
BRB 0.0572 0.3052 221.0968 0.4533 0.5141 0.2318 0.3195 0.3913 0.1716 0.2319
BRR 0.0434 0.3021 176.6349 0.4755 0.4421 0.3997 0.4199 0.4361 0.2706 0.3097
Delicious
NNA 0.0860 0.3149 396.4659 0.4015 0.3637 0.4099 0.3854 0.2488 0.1721 0.1772
NNAD 0.0836 0.3127 389.9422 0.4075 0.3617 0.4399 0.3970 0.2821 0.1777 0.1824
BP-MLLTA 0.0953 0.4967 434.8601 0.3288 0.1829 0.5857 0.2787 0.1220 0.2728 0.1572
BP-MLLTAD 0.0898 0.4358 418.3618 0.3359 0.1874 0.5884 0.2806 0.1315 0.2427 0.1518
BP-MLLRA 0.0964 0.6157 427.0468 0.2793 0.2070 0.5894 0.3064 0.1479 0.2609 0.1699
BP-MLLRAD 0.0894 0.6060 411.5633 0.2854 0.2113 0.5495 0.3052 0.1650 0.2245 0.1567
BRB 0.1184 0.4355 496.7444 0.3371 0.1752 0.2692 0.2123 0.0749 0.1336 0.0901
BRR 0.1184 0.4358 496.8180 0.3371 0.2559 0.3561 0.2978 0.1000 0.1485 0.1152
Bookmarks
NNA 0.0663 0.4924 22.1183 0.5323 0.3919 0.3907 0.3913 0.3564 0.3069 0.3149
NNAD 0.0629 0.4828 20.9938 0.5423 0.3929 0.3996 0.3962 0.3664 0.3149 0.3222
BP-MLLTA 0.0684 0.5598 23.0362 0.4922 0.0943 0.5682 0.1617 0.1115 0.4743 0.1677
BP-MLLTAD 0.0647 0.5574 21.7949 0.4911 0.0775 0.6096 0.1375 0.0874 0.5144 0.1414
BP-MLLRA 0.0707 0.5428 23.6088 0.5049 0.1153 0.5389 0.1899 0.1235 0.4373 0.1808
BP-MLLRAD 0.0638 0.5322 21.5108 0.5131 0.0938 0.5779 0.1615 0.1061 0.4785 0.1631
BRB 0.0913 0.5318 29.6537 0.4868 0.2821 0.2546 0.2676 0.1950 0.1880 0.1877
BRR 0.0895 0.5305 28.7233 0.4889 0.2525 0.4049 0.3110 0.2259 0.3126 0.2569
Large-scale Multi-label Text Classification 15
method for the multi-label text classification task. Also, we have conducted extensive
analysis to characterize the effectiveness of combining ReLUs with AdaGrad for fast
convergence rate, and utilizing Dropout to prevent overfitting which results in better
generalization.
Acknowledgments The authors would like to thank the anonymous reviewers for their valu-
able comments. This work has been supported by the German Institute for Educational Research
(DIPF) under the Knowledge Discovery in Scientific Literature (KDSL) program.
References
[1] Bi, W., Kwok, J.T.: Multi-label classification on tree-and dag-structured hierar-
chies. In: Proceedings of the 28th International Conference on Machine Learning.
pp. 17–24 (2011)
[2] Calauzènes, C., Usunier, N., Gallinari, P.: On the (non-)existence of convex, cali-
brated surrogate losses for ranking. In: Advances in Neural Information Process-
ing Systems 25. pp. 197–205 (2012)
[3] Dembczyński, K., Cheng, W., Hüllermeier, E.: Bayes optimal multilabel classifi-
cation via probabilistic classifier chains. In: Proceedings of the 27th International
Conference on Machine Learning. pp. 279–286 (2010)
[4] Dembczyński, K., Kotłowski, W., Hüllermeier, E.: Consistent multilabel ranking
through univariate losses. In: Proceedings of the 29th International Conference on
Machine Learning. pp. 1319–1326 (2012)
[5] Demšar, J.: Statistical comparisons of classifiers over multiple data sets. The Jour-
nal of Machine Learning Research 7, 1–30 (2006)
[6] Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning
and stochastic optimization. Journal of Machine Learning Research 12, 2121–
2159 (2011)
[7] Elisseeff, A., Weston, J.: A kernel method for multi-labelled classification. In:
Advances in Neural Information Processing Systems 14. pp. 681–687 (2001)
[8] Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: A li-
brary for large linear classification. Journal of Machine Learning Research 9,
1871–1874 (2008)
[9] Fürnkranz, J., Hüllermeier, E., Loza Mencı́a, E., Brinker, K.: Multilabel classifi-
cation via calibrated label ranking. Machine Learning 73(2), 133–153 (Jun 2008)
[10] Gao, W., Zhou, Z.H.: On the consistency of multi-label learning. Artificial Intel-
ligence 199–200, 22–44 (2013)
[11] Ghamrawi, N., McCallum, A.: Collective multi-label classification. In: Proceed-
ings of the 14th ACM International Conference on Information and Knowledge
Management. pp. 195–200 (2005)
[12] Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward
neural networks. In: Proceedings of the 13th International Conference on Artificial
Intelligence and Statistics, JMLR W&CP. pp. 249–256 (2010)
[13] Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In:
Proceedings of the 14th International Conference on Artificial Intelligence and
Statistics, JMLR W&CP. pp. 315–323 (2011)
16 J. Nam, J. Kim, E. Loza, I. Gurevych, and J. Fürnkranz
[14] Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.:
Improving neural networks by preventing co-adaptation of feature detectors. arXiv
preprint arXiv:1207.0580 (2012)
[15] Joachims, T.: Text categorization with support vector machines: Learning with
many relevant features. In: Proceedings of the 10th European Conference on Ma-
chine Learning (ECML-98). pp. 137–142 (1998)
[16] Joachims, T.: Training linear svms in linear time. In: Proceedings of the 12th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining.
pp. 217–226 (2006)
[17] LeCun, Y., Bottou, L., Orr, G.B., Müller, K.R.: Efficient backprop. In: Neural
Networks: tricks of the trade, pp. 9–48 (2012)
[18] Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: A new benchmark collection for
text categorization research. Journal of Machine Learning Research 5, 361–397
(2004)
[19] Liu, T.Y., Yang, Y., Wan, H., Zeng, H.J., Chen, Z., Ma, W.Y.: Support vector ma-
chines classification with a very large-scale taxonomy. SIGKDD Explorations
7(1), 36–43 (2005)
[20] Loza Mencı́a, E., Park, S.H., Fürnkranz, J.: Efficient voting prediction for pairwise
multilabel classification. Neurocomputing 73(7-9), 1164–1176 (2010)
[21] Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval.
Cambridge University Press (2008)
[22] Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann ma-
chines. In: Proceedings of the 27th International Conference on Machine Learn-
ing. pp. 807–814 (2010)
[23] Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label
classification. Machine Learning 85(3), 333–359 (2011)
[24] Rubin, T.N., Chambers, A., Smyth, P., Steyvers, M.: Statistical topic models for
multi-label document classification. Machine Learning 88(1-2), 157–208 (2012)
[25] Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-
propagating errors. Nature 323(6088), 533–536 (1986)
[26] Schapire, R.E., Singer, Y.: BoosTexter: A boosting-based system for text catego-
rization. Machine Learning 39(2/3), 135–168 (2000)
[27] Solla, S.A., Levin, E., Fleisher, M.: Accelerated learning in layered neural net-
works. Complex Systems 2(6), 625–640 (1988)
[28] Trohidis, K., Tsoumakas, G., Kalliris, G., Vlahavas, I.: Multi-label classification
of music into emotions. In: Proceedings of the 9th International Conference on
Music Information Retrieval. pp. 325–330 (2008)
[29] Tsoumakas, G., Katakis, I., Vlahavas, I.P.: Random k-labelsets for multilabel clas-
sification. IEEE Transactions on Knowledge and Data Engineering 23(7), 1079–
1089 (2011)
[30] Yang, Y., Gopal, S.: Multilabel classification with meta-level features in a
learning-to-rank framework. Machine Learning 88(1-2), 47–68 (2012)
[31] Zeiler, M.D., Ranzato, M., Monga, R., Mao, M.Z., Yang, K., Le, Q.V., Nguyen,
P., Senior, A., Vanhoucke, V., Dean, J., Hinton, G.E.: On rectified linear units for
speech processing. In: Acoustics, Speech and Signal Processing (ICASSP), 2013
IEEE International Conference on. pp. 3517–3521 (2013)
Large-scale Multi-label Text Classification 17
[32] Zhang, M.L., Zhou, Z.H.: Multilabel neural networks with applications to func-
tional genomics and text categorization. IEEE Transactions on Knowledge and
Data Engineering 18, 1338–1351 (2006)