Acticle 4
Acticle 4
ABSTRACT extremely large scale presents open challenges for machine learning
Extreme multi-label text classification (XMTC) refers to the prob- research.
lem of assigning to each document its most relevant subset of class Multi-label classification is fundamentally different from the tra-
labels from an extremely large label collection, where the number ditional binary or multi-class classification problems which have
of labels could reach hundreds of thousands or millions. The huge been intensively studied in the machine learning literature. Binary
label space raises research challenges such as data sparsity and classifiers treat class labels as independent target variables, which
scalability. Significant progress has been made in recent years by is clearly sub-optimal for multi-label classification as the dependen-
the development of new machine learning methods, such as tree cies among class labels cannot be leveraged. Multi-class classifiers
induction with large-margin partitions of the instance spaces and rely on the mutually exclusive assumption about class labels (i.e.,
label-vector embedding in the target space. However, deep learning one document should have one and only one class label), which is
has not been explored for XMTC, despite its big successes in other wrong in multi-label settings. Addressing the limitations of those
related areas. This paper presents the first attempt at applying deep traditional classification methods by explicitly modeling the depen-
learning to XMTC, with a family of new Convolutional Neural Net- dencies or correlations among class labels has been the major focus
work (CNN) models which are tailored for multi-label classification of multi-label classification research [7, 11, 13, 15, 42, 48]; how-
in particular. With a comparative evaluation of 7 state-of-the-art ever, scalable solutions for problems with hundreds of thousands
methods on 6 benchmark datasets where the number of labels is up or even millions of labels have become available only in the past
to 670,000, we show that the proposed CNN approach successfully few years[5, 36]. Part of the difficulty in solving XMTC problems
scaled to the largest datasets, and consistently produced the best is due to the extremely severe data sparsity issue. XMTC datasets
or the second best results on all the datasets. On the Wikipedia typically exhibit a power-law distribution of labels, which means
dataset with over 2 million documents and 500,000 labels in partic- a substantial proportion of the labels have very few training in-
ular, it outperformed the second best method by 11.7% ∼ 15.3% in stances associated with them. It is, therefore, difficult to learn the
precision@K and by 11.5% ∼ 11.7% in NDCG@K for K = 1,3,5. dependency patterns among labels reliably. Another significant
challenge in XMTC is that the computational costs in both training
and testing of mutually independent classifiers would be practi-
1 INTRODUCTION cally prohibiting when the number of labels reaches hundreds of
Extreme multi-label text classification (XMTC), the problem of thousands or even millions.
finding each document its most relevant subset of labels from an Significant progress has been made in XMTC recently. Several
extremely large space of categories, becomes increasingly important approaches have been proposed to deal with the huge label space
due to the fast growing of internet contents and the urgent needs for and to address the scalability and data sparsity issues. Most of these
organizational views of big data. For example, Wikipedia has over approaches fall into two categories: target-embedding methods
a million of curator-generated category labels, and an article often and tree-based ensemble methods. Let us briefly outline below
has more than one relevant label: the web page of “potato” would be the key ideas of these two categories and discuss the third cate-
tagged with the class labels of “Solanum”, “Root vegetables”, “Crops gory in addition, i.e., the deep learning methods which have made
originating from South America”, etc. The classification system for significant impact in multi-class classification problems but have
Amazon shopping items, as another example, uses a large hierarchy not been explored in XMTC. Comparing the accomplishments and
of over one million categories for the organization of shopping limitations of these three categories of work leads to the design of
items, and each item typically belongs to more than one relevant our proposed work in this paper and the aimed contributions.
categories. Solving such multi-label classification problems in an
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed 1.1 Target-Embedding Methods
for profit or commercial advantage and that copies bear this notice and the full citation Target-embedding methods aim to address the data sparsity issue
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, in training XMTC classifiers by finding a set of low-dimensional
to post on servers or to redistribute to lists, requires prior specific permission and/or a embeddings of the label vectors for training instances in the target
fee. Request permissions from [email protected]. space. Suppose the training data is given as n pairs of feature vec-
SIGIR ’17, August 07-11, 2017, Shinjuku, Tokyo, Japan n where x ∈ RD and y ∈ {0, 1} L ,
© 2017 ACM. 978-1-4503-5022-8/17/08. . . $15.00 tors and label vectors {(x i , yi )}i=1 i i
DOI: http://dx.doi.org/10.1145/3077136.3080834 D is the number of features and L is the number of labels. Notice
115
Session 1C: Document Representation and Content Analysis 1 SIGIR’17, August 7-11, 2017, Shinjuku, Tokyo, Japan
that L can be extremely large in XMTC, which means that learn- hand, have achieved great successes recently in other related do-
ing a reliable mapping from arbitrary x to relevant y is difficult mains by automatically extracting context-sensitive features from
from limited training data. Instead, if we can effectively compress raw text. Those areas include various tasks in natural language un-
the label vectors from L-dimension to L̂-dimension via a linear or derstanding [37], language modeling [33], machine translation[38],
nonlinear projection, then we can use standard classifiers (such as and more. In multi-class text classification in particular, which
Support Vector Machines) to efficiently learn a reliable mapping is closely related to multi-label classification but restricting each
from the feature space to the compressed label space. For classify- document to having only one label, deep learning approaches have
ing a new document we also need to project the predicted embed- recently outperformed linear predictors (e.g., linear SVM) with
ding of y back to the original high-dimensional space. The projec- bag-of-word based features as input, and become the new state-
tion of target label vectors to their low-dimensional embeddings of-the-art. The strong deep learning models in multi-class text
is called the compression process, and the projection back to the classification include the convolutional neural network by [25]
high-dimensional space is called the decompression process. Many (CNN), the recurrent neural network by [27] (RNN), the combina-
variants of the target embedding methods have been proposed tion of CNN and RNN by [49], the CNN with attention mechanism
[3, 6, 9, 10, 14, 19, 20, 24, 40, 46, 50]. These methods mainly differ by [2, 43] and the Bow-CNN model by [21, 22]. Although some
in their choices of compression and decompression techniques[36] of those deep learning models were also evaluated on multi-label
such as compressed sensing[19, 24], Bloom filters[10], Singular classification datasets [21], those methods are designed for multi-
Value Decomposition[39], landmark labels[3], output codes[50], class settings, not taking multi-label settings into account in model
etc. Among those methods, SLEEC [5] is considered representa- optimization. We will provide more details about the CNN-Kim
tive as it outperformed competing methods on some benchmark [25] and Bow-CNN [21] models in Section 2 as the representative
datasets. We will provide more precise descriptions of this method deep learning models (which are applied to but not tailored for
in Section 2 as it is a strong baseline in our comparative evaluation XMTC) in our comparative evaluation (Section 4).
(Section 4) in this paper. The great successes of deep learning in multi-class classification
and other related areas raise an important question for XMTC
1.2 Tree-based Ensemble Methods research, i.e., can we use deep learning to advance the state of the
art in XMTC? Specifically, how can we make deep learning both
Another category of efforts that has improved the state of the art of
effective and scalable when both the feature space and label space
XMTC in recent years are the new variants of tree-based ensemble
are extremely large? Existing work in XMTC has not offered the
methods [1, 36, 41]. Similar to those in classical decision-tree learn-
answer; only limited efforts have been reported in this direction,
ing, the new methods induce a tree structure which recursively
and the solutions have not scaled to extremely large problems
partitions the instance space or sub-spaces at each non-leaf node,
[26, 34, 44, 47].
and has a base classifier at each leaf node which only focuses on
a few active labels in that node. Different from traditional deci-
sion trees, on the other hand, the new methods learn a hyperplane
1.4 Our New Contributions
(equivalent to using a weighted combination of all features) to split Our primary goal in this paper is to use deep learning to enhance
the current instance space at each node, instead of selecting a single the state of the art of XMTC. We accomplish this goal with the
feature based on information gain (as in classical decision trees) for following contributions:
the splitting. The hyperplane-based induction is much less greedy • We re-examine the state of the art of XMTC by conducting
than the single-feature based induction of decision trees, and hence a comparative evaluation of 7 methods which are most
potentially more robust for extreme classification with a vast fea- representative in target-embedding, tree-based ensembling
ture space. Another advantage of the tree-based methods is that the and deep learning approaches to XMTC, on 6 benchmark
prediction time complexity is typically sublinear in the training-set datasets where the label space sizes are up to 670,000.
size, and would be logarithmic (as the best case) if the induced • We propose a new deep learning method, namely XML-
tree is balanced. For enhancing the robustness of predictions, most CNN, which combines the strengths of existing CNN mod-
tree-based methods learn an ensemble of trees, each of which is els and goes beyond by taking multi-label co-occurrence
induced based on a randomly selected subset of features at each patterns into account in both the optimization objective
level of the tree. The top-performing method in this category is and the design of the neural network architecture, and
FastXML [36], for which we will provide a detailed description in scales successfully to the largest XMTC benchmark datasets.
Section 2 as it is a strong baseline in our comparative evaluation • Our extensive experiments show that XML-CNN consis-
(Section 4). tently produces the best or the second best results among
all the competing methods on all of the 6 benchmark datasets.
1.3 Deep Learning for Text Classification On the Wikipedia dataset with over 2 million documents
It should be noted that all the aforementioned methods are based and 500,000 labels in particular, our proposed method out-
on bag-of-word representations of documents. That is, words are performs the second best method by 11.7%∼15.3% in preci-
treated as independent features out of context, which is a fundamen- sion@K and by 11.5%∼11.7% in NDCG@K for K = 1,3,5.
tal limitation of those methods. How to overcome such a limitation The rest of the paper is organized as follows. Section 2 outlines
of existing XMTC methods is an open question that has not been six existing competitive methods which will be used (together with
studied in sufficient depth. Deep learning models, on the other our XML-CNN) in our comparative evaluation. Section 3 describes
116
Session 1C: Document Representation and Content Analysis 1 SIGIR’17, August 7-11, 2017, Shinjuku, Tokyo, Japan
the new XML-CNN method. Section 4 reports our extensive experi- 2.3 FastText
ments and results, followed by conclusion in Section 5. FastText [23] is a simple yet effective deep learning method for
multi-class text classification. A document representation is con-
structed by averaging the embeddings of the words that appear
2 EXISTING COMPETITIVE METHODS in the document, upon which a softmax layer is applied to map
We outline six methods, including the most representative methods the document representation to class labels. This approach was in-
in XMTC and some successful deep learning methods which are spired by the recent work on efficient word representation learning,
designed for mutli-class text classification but also applicable to such as skip-gram and CBOW[32]. It ignores word order in the
XMTC with minor adaptations. Later in Section 4 we will compare construction of document representations, and uses a linear soft-
those methods empirically against our proposed XML-CNN method max classifier. This simplicity makes FastText very efficient to train
(Section 3). yet achieving state-of-the-art performances on several multi-class
classification benchmarks, and often is several orders of magnitude
faster than other competing methods [23]. However, simply av-
eraging input word embeddings with the shallow architecture for
2.1 SLEEC document-to-label mapping might limit its success in XMTC, as in
SLEEC [5] is most representative for target-embedding methods in XMTC, document presentations need to capture much richer infor-
XMTC. It consists of two steps of learning embeddings and kNN mation for successfully predicting multiple correlated labels and
classification. SLEEC learns L̂-dimensional embeddings zi ∈ RL̂ discriminating them from enormous numbers of irrelevant labels.
for the original L-dimensional label vectors yi ∈ {0, 1} L that non-
linearly capture label correlations by preserving pairwise distances 2.4 CNN-Kim
between only closest label vectors, i.e. d(zi , z j ) ≈ d(yi , y j ) only if i CNN-Kim [25] is one of the first attempts of applying convolutional
is among j’s k nearest neighbors under some distance metric d(·, ·). neural networks to text classification. CNN-Kim constructs a docu-
Then regressors V ∈ RL̂×D are learned such that V x i ≈ zi with `1 ment vector with the concatenation of its word embeddings, and
regularization on V x i , which results in sparse solutions. then t filters are applied to this concatenated vector in the convo-
At prediction time, for a novel document x ∗ ∈ RD SLEEC per- lution layer to produce t feature maps, which are in turn fed to a
forms a kNN search for its projection z ∗ = V x ∗ in the L̂-dimensional max-over-time pooling layer to construct a t-dimensional document
embedding space. To speed up the kNN search, SLEEC groups representation. This is followed by a fully-connected layer with L
training data into many clusters, and learns embeddings for each softmax outputs, corresponding to L labels. In practice CNN-Kim
separate cluster, then kNN search is performed only within the has shown excellent performance in multi-class text classification,
cluster this novel document belongs to. Since clustering high di- and is a strong baseline in our comparative evaluations.
mensional data is usually unstable, an ensemble of SLEEC models
are induced with different clusterings to boost prediction accuracy. 2.5 Bow-CNN
Bow-CNN [21] (Bag-of-word CNN) is another strong method in
multi-class classification. It represents each small text region (sev-
2.2 FastXML eral consecutive words) using a bag-of-word indicator vector (called
FastXML [36] is considered the state-of-the-art tree-based method the one-hot vector). Denoting by D the size of the feature space
for XMTC. It learns a hierarchy of training instances and optimizes (the vocabulary), a D-dimensional binary vector is constructed for
an NDCG-based objective at each node of the hierarchy. Specifically, each region, where the i-th entry is 1 iff the i-th word in the vocab-
a hyperplane parameterized by w ∈ RD is induced at each node, ulary appears in the that text region. The embeddings of all regions
which splits the set of documents in the current node into two are passed through a convolutional layer, followed by a special
subsets; the ranking of the labels in each of the two subsets are dynamic pooling layer that aggregates the embedded regions into
jointly learned. The key idea is to have the documents in each a document representation, and then fed to a softmax output layer.
subset sharing similar label distribution, and to characterize the
distribution using a set-specific ranked list of labels. This is achieved 2.6 PD-Sparse
by jointly maximizing NDCG scores of the ranked label lists in the PD-Sparse [45] is a recently proposed max-margin method de-
two sibling subsets. In practice, an ensemble of multiple induced signed for extreme multi-label classification. It does not fall into
trees are learned to improve the robustness of predictions. the three categories in Section 1 (target-embedding methods, tree-
At prediction time, each test document is passed from the root based methods, and deep learning methods). In PD-Sparse, a linear
to a leaf node in each induced tree, and the label distributions in all classifier is learned for each label with `1 and `2 penalties on the
the reached leaves are aggregated for the test document. Suppose weight matrix associated with this label. This results in a solu-
T trees are induced, H is the average height of the trees, and L̂ is tion extremely sparse in both the primal and dual spaces, which is
the average number of labels per leaf node. The prediction cost desirable in terms of both time and memory efficiency in XMTC.
is approximately O(T DH + T L̂ + L̂ log L̂), which is dominated by PD-Sparse proposes a Fully-Corrective Block-Coordinate Frank-
O(T DH ) when L̂ is small. If the trees are near balanced, then H = Wolfe training algorithm that exploits sparsity in the solution and
log N ≈ log L, and prediction cost is approximately O(T D log L), achieves sub-linear training time w.r.t the number of primal and
which is logarithmic in the number of labels. dual variables, but prediction time is still linear w.r.t the number of
117
Session 1C: Document Representation and Content Analysis 1 SIGIR’17, August 7-11, 2017, Shinjuku, Tokyo, Japan
labels. [45] has shown that PD-Sparse outperforms 1-vs-all SVM 3.1 Dynamic Max Pooling
and logistic regression on multi-label classification, with signifi- In previous CNN models for text classification, including CNN-
cantly reduced training time and model size. Kim, a max-over-time[12] pooling scheme is usually adopted. This
simply means taking the maximum element of a feature map: P(c) =
3 PROPOSED METHOD ĉ = max{c}. The idea is to capture the most important feature, i.e.
Our model architecture (XML-CNN), shown in Figure 1, is based on the entry with the largest value, in each feature map. Using max-
CNN-Kim [25], the CNN model for multi-class text classification over-time pooling, each filter generates a single feature, so the
described in 2.4. output of the pooling layer is [P(c (1) ), .., P(c (t ) )] = [ĉ (1) , .., ĉ (t ) ] ∈
Similar to other CNNs[21, 25], our model learns a rich num- Rt . However, one drawback of max-over-time pooling is that for
ber of feature representations by passing the document through each filter, only one value, i.e. the largest value in the feature
various convolutional filters. The key attributes of our model lie map, is chosen to carry information to subsequent layers, which
in the following connected layers. Specifically, our model adopts is especially problematic when the document is long. Moreover,
a dynamic max pooling scheme that captures more fine-grained this pooling scheme does not capture any information about the
features from different regions of the document. We furthermore position of the largest value.
utilize a binary cross-entropy loss over sigmoid output that is more In our model, we adopt a dynamic max pooling scheme, which
tailored for XMTC. An additional hidden bottleneck layer is inserted is similar to [8, 21]. Instead of generating only one feature per
between pooling and output layer to learn compact document rep- filter, p features are generated to capture richer information. For a
resentations, which reduces model size as well as boosts model document with m words, we evenly divide its m-dimensional feature
performance. In the following we describe our model in detail. map into p chunks, each chunk is pooled to a single feature by taking
Let ei ∈ Rk be the k-dimensional word embedding correspond- the largest value within that chunk, so that information about
ing to the i-th word in the current document, i = 1, .., m. The whole different parts of the document can be received by the top layers.
document is represented by the concatenation of its word embed- Under this pooling scheme, each filter produces a p-dimensional
feature (assuming m is dividable by p):
dings e 1:m = [e 1 , .., em ] ∈ Rkm . In general, the text region from
the i-th word to the j-th word is represented by ei:j = [ei , .., e j ] ∈ P(c) = [max{c 1: m }, .., max{cm− m +1:m }] ∈ Rp
Rk(j−i+1) . A convolution filter v ∈ Rkh is applied to a text region
p p
of h words ei:i+h−1 to produce a new feature: which captures both important features and position information
about these important features.
c i = дc (vT ei:j+h−1 )
3.2 Loss Function
The most straightforward adaptation from the mult-class classifica-
where дc is the nonlinear activation function for this convolution
tion problems to multi-label ones would be to extend the traditional
layer, such as sigmoid or ReLU. We omit all bias terms for sim-
cross-entropy loss. Specifically, [16–18, 25] consider the extended
plicity. All c i ’s together form a feature map c = [c 1 , ..., cm ] ∈ Rm
cross-entropy loss function as
associated with the applied filter v. Here we pad the end of the
document to obtain m features if the length is short. Multiple fil- n L
1 ÕÕ
n
1Õ Õ 1
ters with different window sizes are used in a convolution layer min − yi j log(p̂i j ) = − log(p̂i j ),
Θ n i=1 j=1 n i=1 |y + |
to capture rich semantic information. Suppose t filters are used j ∈yi+ i
and the t resulting feature maps are c (1) , .., c (t ) . Then a pooling
where Θ denotes model parameters, yi+ denotes the set of relevant
operation P(·) is applied to each of these t feature maps to produce
labels of instance i and p̂i j is the model prediction for instance i on
t p-dimensional vectors P(c (i) ) ∈ Rp . We will discuss the choice
label j , through a softmax activation:
of P(·) in Section 3.1. The output of pooling layer is followed by a
fully connected bottleneck layer with h hidden units and then an exp(f j (x i ))
p̂i j = ÍL .
output layer with L units corresponding to the scores assigned to
j 0 =1 exp(f j (x i ))
0
each label, denoted by f ∈ RL :
Another intuitively reasonable objective for XMTC is rank loss
(1) (t ) [47] that minimizes the number of mis-ordered pairs of relevant and
f = Wo дh (Wh [P(c ), .., P(c )]) (1)
irrelevant labels, i.e. it aims to assign relevant labels with higher
scores than irrelevant labels. However, rank loss has shown to be
Here Wh ∈ Rh×tp and Wo ∈ RL×h are weight matrices associated inferior to binary cross-entropy loss (BCE) over sigmoid activation
with the bottleneck layer and output layer; дh is the element-wise when applied to multi-label classification datasets in a simple feed-
activation functions applied to the bottleneck layer. The key at- forward neural network[34]. The binary cross-entropy objective
tributes that make our model especially suited for XMTC are the can be formulated as:
pooling operation, loss function, and the hidden bottleneck layer
n L
between pooling and output layer. We will verify that each of 1 ÕÕ
min yi j log(σ (fi j )) + (1 − yi j ) log(1 − σ (fi j )) (2)
−
these three components contributes to performance improvement Θ n i=1 j=1
in XMTC through an ablation test in Section 4.4. In the remainder
of this section, we introduce these three key attributes of our model. where σ is sigmoid function σ (x) = 1+e1 −x .
118
Session 1C: Document Representation and Content Analysis 1 SIGIR’17, August 7-11, 2017, Shinjuku, Tokyo, Japan
wait
for
the
video
and
do
n’t
rent
it
Representations of Convolutional layer Dynamic max Fully connected Fully connected layer
documents with word with multiple filter pooling layer for a with sigmoid output for
embeddings widths and feature compact large label space
maps representation and binary entropy loss
We find that the BCE loss is more suited for multi-label problems 4.1 Datasets
and outperforms cross-entropy loss in our experiments (Section We used six benchmark datasets, including two small-scale datasets
4.4). Therefore, we adopt the BCE loss for our final model. RCV1 [29] (103 labels) and EUR-Lex [31] (3865 labels), two medium-
scale datasets Amazon-12K [30] (12,277 labels) and Wiki-30K [51]
(29,947 labels), and two large-scale datasets Amazon-670K [28]
3.3 Hidden Bottleneck Layer (670,091 labels) and Wiki-500K1 (501,069 labels). The dataset statis-
Unlike CNN-Kim [25] or Bow-CNN [21] where pooling layer and tics are summarized in Table 1.
output layer are directly connected, we propose to add a fully- The TF-IDF features and the class labels for all of these datasets
connected hidden layer with h units between pooling and output are available in the Extreme Classification Repository1 . In our
layer, referred to as the hidden bottleneck layer, as the number experiments, the non-deep learning methods (FastXML, SLEEC and
of its hidden units is far less than the pooling and output layer. PD-Sparse) used the vectors of TF-IDF features to represent input
The reason for this is two-fold. First, with pooling layer directly documents, while the deep learning methods (FastText, Bow-CNN
connected to output layer, the number of parameters between these and CNN-Kim and XML-CNN) require raw text of documents as
two layers is O(pt ×L). When the document is long, and the number input, which we obtained from the authors of the repository1 . We
of labels is large, more filters and more chunks in pooling operation removed words that do not have corresponding TF-IDF features
are needed to obtain good document representations, and in XMTC from the raw text so that the feature set (vocabulary) of each dataset
setting, L can be up to millions, so model size of O(pt × L) might is the same in the experiments for both the deep learning and non-
not fit into a common GPU memory. With an additional hidden deep learning methods. Each dataset has an official training set and
bottleneck layer inserted between pooling and output layer, the test set, which we used as they are. We reserved 25% of the training
number of parameters reduces to O(h × (pt + L)), which can be an data as the validation set for model selection (i.e., for tuning hyper-
order of magnitude less than without this bottleneck layer. Second, parameters); the remaining 75% is used for training the models.
without this hidden bottleneck layer, the model only has one hidden
layer of non-linearity, which is not expressive enough to learn good
document representations and classifiers. Experiments in Section 4.2 Evaluation Metrics
4 show that this hidden bottleneck layer does help to learn better In extreme multi-label classification datasets, even though the label
document representations and improve prediction accuracy. spaces are very large, each instance only has very few relevant
labels (see Table 1). This means that it is important to present a
short ranked list of potentially relevant labels for each test instance
for user’s attention, and to evaluate the quality of such ranked lists
4 EXPERIMENTS with an emphasis on the relevance of the top portion of each list. For
In this section we report our evaluation of the proposed method these reasons rank-based evaluation metrics have been commonly
and the six competing methods introduced in Section 2 on XMTC used for comparing XMTC methods, including the precision at
benchmark datasets, and compare both the effectiveness and scala- top K (P@K) and the Normalized Discounted Cummulated Gains
bility of those methods. We also analyze the contribution of each
component of our method via an ablation test. 1 https://manikvarma.github.io
119
Session 1C: Document Representation and Content Analysis 1 SIGIR’17, August 7-11, 2017, Shinjuku, Tokyo, Japan
datasets N M D L L̄ L̃ W̄ Ŵ
RCV1 23,149 781,265 47,236 103 3.18 729.67 259.47 269.23
EUR-Lex 15,449 3,865 171,120 3,956 5.32 15.59 1,225.20 1,248.07
Amazon-12K 490,310 152,981 135,895 12,277 5.37 214.45 230.25 224.67
Amazon-670K 490,449 153,025 135,895 670,091 5.45 3.99 230.20 224.62
Wiki-30K 12,959 5,992 100,819 29,947 18.74 8.11 2,246.58 2,210.10
Wiki-500K 1,646,302 711,542 2,381,304 501,069 4.87 16.33 750.64 751.42
Table 1: Data Statistics: N is the number of training instances, M is the number of test instances, D is the total number of
features, L is the total number of class labels, L̄ is the average number of label per document, L̃ is the average number of
documents per label, W̄ is the average number of words per document in the training set, Ŵ is the average number of words
per document in the test set.
(NDCG@K) [1, 5, 36, 41, 45]. We follow such convention and use Why did XML-CNN perform particularly strong on the datasets
these two metrics in our evaluation in this paper, with k = 1, 3, 5. of RCV1, Amazon-12K and Wiki-500K? We notice that these datasets
Denoting by y ∈ {0, 1} L as the vector of true labels of an document, have a higher number of training instances per class than that in
and ŷ ∈ RL as the system-predicted score vector for the same other datasets (see the column of L̃ in Table 1). This confirms our in-
document, the metrics are defined as: tuition about deep learning: it typically requires more training data
1 Õ in order to achieve advantageous performance than other methods,
P@k = yl and XML-CNN successfully took this advantage on these datasets.
k
l ∈r k (ŷ) As for why other deep learning methods especially CNN-Kim per-
Õ yl formed much worse on the other hand, we will analyze this via the
DCG@k =
log(l + 1) ablation test in Section 4.4.
l ∈r k (ŷ)
DCG@k As for why SLEEC, the leading target-embedding method, had
NDCG@k = Í the strongest performance on EUR-Lex and Wiki-30K in particular,
min(k, ky k0 ) 1
l =1 log(l +1) we notice that these two datasets have much longer documents
where r k (ŷ) is the set of rank indices of the truly relevant labels than other datasets (see column of W̄ in Table 1). In other words,
among the top-k portion of the system-predicted ranked list for a the feature vectors of documents in these two datasets are denser
document, and ky k0 counts the number of relevant labels in the and possibly information-richer than those in other datasets. As
ground truth label vector y. P@K and NDCG@K are calculated for we described in Section 2, SLEEC uses a linear regression model to
each test document and then averaged over all the documents. establish the mapping from documents to label vectors in a dimen-
sion reduced target space. But why such a relatively simple linear
4.3 Main Results mapping should lead to better predictions for long documents than
the more complex deep learning or tree-based models is not clear
The performance of all methods in P@K and NDCG@K on all six at this point – answering this question requires future research.
datasets are summarized in Tables 2 and 3. Each line compares all An interesting observation is that all the methods performed
the methods on a specific dataset, where the best score is in boldface. relatively well on RCV1, and each method has much stronger per-
We conducted paired t-tests to compared the performance score formance on this dataset than itself on other datasets. Notice that
of each method against the best one in the same line, where the this dataset has a much smaller number of labels (103 only) than
number of documents in each test set is the number of trials. those of other datasets, and the smallest number of labels (3.18) per
As we can see in Table 2, XML-CNN has the best scores in 11 out document. These means that this dataset is the least challenging
of the 18 lines in the table, and each of the 11 scores is statistically one among the six included in this study.
significantly better than the 2nd best score in the corresponding line. Another interesting observation is that FastText and CNN-Kim
On the other hand, the target-embedding method SLEEC is the best had significantly worse results than other methods on Amazon-
in 4 lines, the deep learning method Bow-CNN is the best in 2 lines, 670K (with 670K labels) and Wiki-500K (with 500K labels). The
and the tree-based ensemble method FastXML is the best in only 1 extremely huge label spaces mean that the data sparsity issue would
line. Similar observation can be found in Table 3 of the NDCG@K be severe, especially for the long tail of rare labels. Both FastText
scores, where XML-CNN has the best results in 12 out of the 18 and CNN-Kim are designed for multi-class classification and not
lines. On the Wiki-500K dataset with over 2 million documents and for modeling label correlations, and hence suffered significantly in
500,000 labels in particular, XML-CNN outperformed the second XMTC when the label spaces are extremely large. Other models
best method by 11.7% ∼ 15.3% in P@K and by 11.5% ∼ 11.7% in like PD-Sparse and Bow-CNN would have a similar weakness but
NDCG@K for K = 1,3,5. In the lines XML-CNN does not have the
best score, it has the 2nd best score.
120
Session 1C: Document Representation and Content Analysis 1 SIGIR’17, August 7-11, 2017, Shinjuku, Tokyo, Japan
we cannot examine them empirically as they failed to scale up on of the full dimension of D=2,381,304; otherwise it could not fit into
these two datasets. our memory. Thus the listed CPU seconds (209,735) in this table
does not reflect the full complexity of SLEEC in training.
4.4 Ablation Test Combining Table 4 and Figure 3 we have the following points:
Given that CNN-Kim and XML-CNN are closely coupled in the (1) For scalability comparison, time complexities on the largest
sense that the latter is an extension of the former by adding a few datasets, e.g., Wiki-500K and Amazon-670K, are most in-
components that especially suit XMTC, it would be informative to formative. Only four methods, i.e., XML-CNN, CNN-Kim,
empirically examine the impact of each new component in XML- FastXML and FastText successfully scaled to these extremely
CNN via an ablation test. The results are shown in Figure 2, where large problems; the remaining methods failed for memory
CNN-Kim is the original model designed for multi-class problems; issues;
suffix v1 denotes the model obtained by replacing the original (2) Both XML-CNN and FastXML (the 4th best method on
loss function of CNN-Kim by the binary cross-entropy loss over average in this study) scaled well with both extremely
sigmoid output (Section 3.2); suffix v2 denotes the model obtained large feature spaces and label spaces;
by inserting the hidden bottleneck layer to model v1; suffix v3 (3) SLEEC (the 2nd best method on average in this study)
denotes the model obtained by adding the dynamic pooling layer scaled well with extremely large label spaces (e.g., on
to model v2. That is, v3 is the full version of our XML-CNN model. Amazon-670K) but suffered from extremely large feature
We show results on three representative datasets in terms of their spaces (e.g., on Wiki-500K);
category sizes (small, medium and large scale). Results on the other (4) CNN-Kim has comparable time complexities as XML-CNN
three datasets demonstrate similar trends and we omit them here in both training and testing, but CNN-Kim performed sub-
due to space limit. stantially worse on all of the datasets;
Overall, we can see that each of the three new components im- (5) FastText is less efficient on EUR-Lex, Wiki-30K as its train-
proved the performance of XML-CNN. First, the new loss function ing cost is proportional to document length on average;
certainly is more suitable for dealing with multi-label problems. We (6) Bow-CNN is relatively fast in training if it were not run-
also found that the hidden bottleneck layer not only consistently ning into memory issues. This is partly due to its imple-
boosted the performance of XML-CNN, but also improved the scala- mentation2 ; also, it has a dynamic pooling layer directly
bility as it compresses the output of the pooling layer into a smaller connected to output layer, resulting in O(ptL) memory (as
one and makes XML-CNN scalable on the largest dataset. Last but we discussed in Section 3.3), where t, p are the number
not least, dynamic max-pooling layer plays a crucial role in our of filters and pooling chunks, which can be an order of
model, leading to significant improvement of XML-CNN over CNN- magnitude larger than that of XML-CNN (O(h(pt + L)))
Kim, indicating its effectiveness in extracting richer information and CNN-Kim (O(tL)).
from the context of documents.
121
Session 1C: Document Representation and Content Analysis 1 SIGIR’17, August 7-11, 2017, Shinjuku, Tokyo, Japan
Methods
Datasets Metrics FastXML SLEEC PD-Sparse FastText Bow-CNN CNN-Kim XML-CNN
P@1 94.62 95.35 95.16 95.40 96.40 93.54 96.86
RCV1 P@3 78.40 79.51 79.46 79.96 81.17 76.15 81.11
P@5 54.82 55.06 55.61 55.64 56.74 52.94 56.07
P@1 68.12 78.21 72.10 71.51 64.99 42.84 76.38
EUR-Lex P@3 57.93 64.33 57.74 60.37 51.68 34.92 62.81
P@5 48.97 52.47 47.48 50.41 42.32 29.01 51.41
P@1 94.58 93.61 88.80 82.11 92.77 90.31 95.06
Amazon-12K P@3 78.69 79.13 70.69 71.94 69.85 74.34 79.86
P@5 62.26 63.54 55.70 58.96 48.74 58.78 63.91
P@1 35.59 34.54 - 8.93 - 15.19 35.39
Amazon-670K P@3 31.87 30.89 - 9.07 - 13.78 31.93
P@5 28.96 28.25 - 8.83 - 12.64 29.32
P@1 83.26 85.96 82.32 66.19 81.09 78.93 84.06
Wiki-30K P@3 68.74 73.13 66.96 55.44 50.64 55.48 73.96
P@5 58.84 62.73 55.60 48.03 35.98 45.06 64.11
P@1 49.27 53.60 - 8.66 - 23.38 59.85
Wiki-500K P@3 33.30 34.51 - 7.32 - 11.95 39.28
P@5 25.63 25.85 - 6.54 - 8.59 29.81
Table 2: Results in P@K – bold face indicates the best method in each line; * denotes the method (if any) whose score is not
statistically significantly different from the best one; ‘-’ denotes the methods which failed to scale due to memory issues.
Methods
Datasets Metrics FastXML SLEEC PD-Sparse FastText Bow-CNN CNN-Kim XML-CNN
G@1 94.62 95.35 95.16 95.40 96.40 93.54 96.88
RCV1 G@3 89.21 90.45 90.29 90.95 92.04∗ 87.26 92.22
G@5 90.27 90.97 91.29 91.68 92.89 88.20 92.63
G@1 68.12 78.21 72.10 71.51 64.99 42.84 76.38
EUR-Lex G@3 60.66 67.86 61.33 63.32 55.03 36.95 66.28
G@5 56.42 61.57 55.93 58.56 49.92 33.83 60.32
G@1 94.58 93.61 88.80 82.11 92.77 90.31 95.06
Amazon-12K G@3 88.26 88.61 80.48 79.90 80.96 83.87 89.48
G@5 84.89 86.46 77.81 79.36 73.01 81.21 87.06
G@1 35.59 34.54 - 8.93 - 15.19 35.39
Amazon-670K G@3 33.36 32.70 - 9.51 - 14.60 33.74
G@5 31.76 31.54 - 9.74 - 14.12 32.64
G@1 83.26 85.96 82.32 66.19 81.09 78.93 84.06
Wiki-30K G@3 72.07 76.10∗ 70.54 57.87 57.26 60.52 76.35
G@5 64.34 68.14 61.74 52.12 45.28 51.96 68.94
G@1 49.27 53.60 - 8.66 - 23.38 59.85
Wiki-500K G@3 41.64 43.64 - 8.43 - 15.45 48.67
G@5 40.16 41.31 - 8.86 - 13.64 46.12
Table 3: Results in NDCG@K – bold face indicates the best method in each line; * denotes the method (if any) whose score is
not statistically significantly different from the best one; ‘-’ denotes the methods which failed to scale due to memory issues.
122
Session 1C: Document Representation and Content Analysis 1 SIGIR’17, August 7-11, 2017, Shinjuku, Tokyo, Japan
1.0
data = eurlex 1.0
data = wiki30K 1.0
data = amazon670K
CNN-Kim
0.8 0.8 0.8
CNN-Kim+v1
0.6 0.6 0.6
value
CNN-Kim+v2
0.4 0.4 0.4 CNN-Kim+v3
0.2 0.2 0.2
0.0 0.0 0.0
P@1 P@3 P@5 P@1 P@3 P@5 P@1 P@3 P@5
metric metric metric
Figure 2: Result of the ablation test – v1 uses the new loss function for multi-label classification over CNN-Kim; v2 denotes
the insertion of the additional hidden layers over v1, and v3 denotes the insertion of the dynamic max pooling layer over v2
(hence the full version of XML-CNN).
CPU GPU
FastXML SLEEC PD-Sparse FastText Bow-CNN CNN-Kim XML-CNN
dataset train test train test train test train test train test train test train test
RCV1 552 615 888 1,764 20 15 858 184 804 85 3,720 92 2,340 89
EUR-Lex 756 55 4,617 16 322 0.4 19,518 7 1,665 4 660 0.6 1,020 0.7
Amazon-12K 27,949 1,015 90,925 2,506 2,399 15 20,790 499 11,097 135 41,460 34 48,000 45
Amazon-670K 47,342 1,300 63,991 2,856 - - 19,353 26,746 - - 138,060 889 188,040 2,476
Wiki-30K 2,047 21 2,646 0.6 238 60 6,354 20 2,377 20 1,020 3 5,280 9
Wiki-500K 192,741 8,972 209,735 20,002 - - 721,444 84,442 - - 494,400 4,534 422,040 16,511
Table 4: Training and testing time (seconds) of 7 methods on 6 datasets
10 6
Scalability on Wiki500K 5 CONCLUSION
This paper presented a new deep learning approach to extreme
multi-label text classification, and evaluated the proposed method in
10 5 comparison with other state-of-the-art methods on six benchmark
Training time (seconds)
%
1%
5%
10
25
50
Percentage of Sampled Category Space hope this study is informative for XMTC research.
123
Session 1C: Document Representation and Content Analysis 1 SIGIR’17, August 7-11, 2017, Shinjuku, Tokyo, Japan
REFERENCES [26] Gakuto Kurata, Bing Xiang, and Bowen Zhou. 2016. Improved neural network-
[1] Rahul Agrawal, Archit Gupta, Yashoteja Prabhu, and Manik Varma. 2013. Multi- based multi-label classification with better initialization leveraging label co-
label learning with millions of labels: Recommending advertiser bid phrases for occurrence. In Proceedings of NAACL-HLT. 521–526.
web pages. In Proceedings of the 22nd international conference on World Wide Web. [27] Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent Convolutional
ACM, 13–24. Neural Networks for Text Classification.. In AAAI. 2267–2273.
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma- [28] Jure Leskovec and Andrej Krevl. 2015. {SNAP Datasets }: {Stanford } Large
chine translation by jointly learning to align and translate. arXiv preprint Network Dataset Collection. (2015).
arXiv:1409.0473 (2014). [29] David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. 2004. Rcv1: A new bench-
[3] Krishnakumar Balasubramanian and Guy Lebanon. 2012. The landmark selection mark collection for text categorization research. Journal of machine learning
method for multiple output prediction. arXiv preprint arXiv:1206.6479 (2012). research 5, Apr (2004), 361–397.
[4] James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan [30] Julian McAuley and Jure Leskovec. 2013. Hidden factors and hidden topics:
Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua understanding rating dimensions with review text. In Proceedings of the 7th ACM
Bengio. 2010. Theano: A CPU and GPU math compiler in Python. In Proc. 9th conference on Recommender systems. ACM, 165–172.
Python in Science Conf. 1–7. [31] Eneldo Loza Mencia and Johannes Fürnkranz. 2008. Efficient pairwise multilabel
[5] Kush Bhatia, Himanshu Jain, Purushottam Kar, Manik Varma, and Prateek Jain. classification for large-scale problems in the legal domain. In Joint European
2015. Sparse local embeddings for extreme multi-label classification. In Advances Conference on Machine Learning and Knowledge Discovery in Databases. Springer,
in Neural Information Processing Systems. 730–738. 50–65.
[6] Wei Bi and James Tin-Yau Kwok. 2013. Efficient Multi-label Classification with [32] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient
Many Labels.. In ICML (3). 405–413. estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
[7] Matthew R Boutell, Jiebo Luo, Xipeng Shen, and Christopher M Brown. 2004. (2013).
Learning multi-label scene classification. Pattern recognition 37, 9 (2004), 1757– [33] Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khu-
1771. danpur. 2010. Recurrent neural network based language model.. In Interspeech,
[8] Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng, and Jun Zhao. 2015. Event Vol. 2. 3.
Extraction via Dynamic Multi-Pooling Convolutional Neural Networks.. In ACL [34] Jinseok Nam, Jungi Kim, Eneldo Loza Mencı́a, Iryna Gurevych, and Johannes
(1). 167–176. Fürnkranz. 2014. Large-scale multi-label text classificationfi!?revisiting neural
[9] Yao-Nan Chen and Hsuan-Tien Lin. 2012. Feature-aware label space dimen- networks. In Joint European Conference on Machine Learning and Knowledge
sion reduction for multi-label classification. In Advances in Neural Information Discovery in Databases. Springer, 437–452.
Processing Systems. 1529–1537. [35] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove:
[10] Moustapha M Cisse, Nicolas Usunier, Thierry Artieres, and Patrick Gallinari. Global Vectors for Word Representation.. In EMNLP, Vol. 14. 1532–1543.
2013. Robust bloom filters for large multilabel classification tasks. In Advances [36] Yashoteja Prabhu and Manik Varma. 2014. Fastxml: A fast, accurate and stable
in Neural Information Processing Systems. 1851–1859. tree-classifier for extreme multi-label learning. In Proceedings of the 20th ACM
[11] Amanda Clare and Ross D King. 2001. Knowledge discovery in multi-label phe- SIGKDD international conference on Knowledge discovery and data mining. ACM,
notype data. In European Conference on Principles of Data Mining and Knowledge 263–272.
Discovery. Springer, 42–53. [37] Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Man-
[12] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray ning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for
Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) semantic compositionality over a sentiment treebank. In Proceedings of the con-
from scratch. Journal of Machine Learning Research 12, Aug (2011), 2493–2537. ference on empirical methods in natural language processing (EMNLP), Vol. 1631.
[13] André Elisseeff and Jason Weston. 2001. A kernel method for multi-labelled Citeseer, 1642.
classification. In Advances in neural information processing systems. 681–687. [38] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence
[14] Chun-Sung Ferng and Hsuan-Tien Lin. 2011. Multi-label Classification with learning with neural networks. In Advances in neural information processing
Error-correcting Codes.. In ACML. 281–295. systems. 3104–3112.
[15] Johannes Fürnkranz, Eyke Hüllermeier, Eneldo Loza Mencı́a, and Klaus Brinker. [39] Farbound Tai and Hsuan-Tien Lin. 2012. Multilabel classification with principal
2008. Multilabel classification via calibrated label ranking. Machine learning 73, label space transformation. Neural Computation 24, 9 (2012), 2508–2542.
2 (2008), 133–153. [40] Jason Weston, Samy Bengio, and Nicolas Usunier. 2011. Wsabie: Scaling up to
[16] Sayan Ghosh, Eugene Laksana, Stefan Scherer, and Louis-Philippe Morency. large vocabulary image annotation. (2011).
2015. A multi-label convolutional neural network approach to cross-domain [41] Jason Weston, Ameesh Makadia, and Hector Yee. 2013. Label Partitioning For
action unit detection. In Affective Computing and Intelligent Interaction (ACII), Sublinear Ranking.. In ICML (2). 181–189.
2015 International Conference on. IEEE, 609–615. [42] Yiming Yang and Siddharth Gopal. 2012. Multilabel classification with meta-
[17] Yunchao Gong, Yangqing Jia, Thomas Leung, Alexander Toshev, and Sergey level features in a learning-to-rank framework. Machine Learning 88, 1-2 (2012),
Ioffe. 2013. Deep convolutional ranking for multilabel image annotation. arXiv 47–68.
preprint arXiv:1312.4894 (2013). [43] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard
[18] Matthieu Guillaumin, Thomas Mensink, Jakob Verbeek, and Cordelia Schmid. Hovy. 2016. Hierarchical attention networks for document classification. In
2009. Tagprop: Discriminative metric learning in nearest neighbor models Proceedings of the 2016 Conference of the North American Chapter of the Association
for image auto-annotation. In Computer Vision, 2009 IEEE 12th International for Computational Linguistics: Human Language Technologies.
Conference on. IEEE, 309–316. [44] Chih-Kuan Yeh, Wei-Chieh Wu, Wei-Jen Ko, and Yu-Chiang Frank Wang. 2017.
[19] Daniel Hsu, Sham Kakade, John Langford, and Tong Zhang. 2009. Multi-Label Learning Deep Latent Spaces for Multi-Label Classification. (2017).
Prediction via Compressed Sensing.. In NIPS, Vol. 22. 772–780. [45] Ian EH Yen, Xiangru Huang, Kai Zhong, Pradeep Ravikumar, and Inderjit S
[20] Shuiwang Ji, Lei Tang, Shipeng Yu, and Jieping Ye. 2008. Extracting shared Dhillon. 2016. PD-Sparse: A Primal and Dual Sparse Approach to Extreme
subspace for multi-label classification. In Proceedings of the 14th ACM SIGKDD Multiclass and Multilabel Classification. (2016).
international conference on Knowledge discovery and data mining. ACM, 381–389. [46] Hsiang-Fu Yu, Prateek Jain, Purushottam Kar, and Inderjit S Dhillon. 2014. Large-
[21] Rie Johnson and Tong Zhang. 2015. Effective use of word order for text catego- scale Multi-label Learning with Missing Labels.. In Proceedings of the 31th Inter-
rization with convolutional neural networks. In Proceedings of the 2015 Conference national Conference on Machine Learning. 593–601.
of the North American Chapter of the Association for Computational Linguistics: [47] Min-Ling Zhang and Zhi-Hua Zhou. 2006. Multilabel neural networks with
Human Language Technologie. 103–112. applications to functional genomics and text categorization. IEEE transactions
[22] Rie Johnson and Tong Zhang. 2015. Semi-supervised convolutional neural on Knowledge and Data Engineering 18, 10 (2006), 1338–1351.
networks for text categorization via region embedding. In Advances in neural [48] Min-Ling Zhang and Zhi-Hua Zhou. 2007. ML-KNN: A lazy learning approach
information processing systems. 919–927. to multi-label learning. Pattern recognition 40, 7 (2007), 2038–2048.
[23] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag [49] Rui Zhang, Honglak Lee, and Dragomir Radev. 2016. Dependency sensitive
of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016). convolutional neural networks for modeling sentences and documents. arXiv
[24] Ashish Kapoor, Raajay Viswanathan, and Prateek Jain. 2012. Multilabel classifi- preprint arXiv:1611.02361 (2016).
cation using bayesian compressed sensing. In Advances in Neural Information [50] Yi Zhang and Jeff G Schneider. 2011. Multi-Label Output Codes using Canonical
Processing Systems. 2645–2653. Correlation Analysis.. In AISTATS. 873–882.
[25] Yoon Kim. 2014. Convolutional neural networks for sentence classification. In [51] Arkaitz Zubiaga. 2012. Enhancing navigation on wikipedia with social tags.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language arXiv preprint arXiv:1202.5469 (2012).
Processing (EMNLP). 1746–1751.
124