Extreme Multi-Label Learning With Label Features For Warm-Start Tagging, Ranking & Recommendation
Extreme Multi-Label Learning With Label Features For Warm-Start Tagging, Ranking & Recommendation
net/publication/322972095
CITATIONS READS
7 229
7 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Kunal Dahiya on 06 December 2018.
Manik Varma∗†
[email protected]
ABSTRACT KEYWORDS
The objective in extreme multi-label learning is to build classifiers Extreme multi-label learning, Sponsored search, Large scale recom-
that can annotate a data point with the subset of relevant labels from mender systems with user and item features
an extremely large label set. Extreme classification has, thus far,
only been studied in the context of predicting labels for novel test ACM Reference Format:
points. This paper formulates the extreme classification problem Yashoteja Prabhu, Anil Kag, Shilpa Gopinath, Kunal Dahiya, Shrutendra Har-
when predictions need to be made on training points with partially sola, Rahul Agrawal, and Manik Varma. 2018. Extreme Multi-label Learning
revealed labels. This allows the reformulation of warm-start tag- with Label Features for Warm-start Tagging, Ranking & Recommendation.
ging, ranking and recommendation problems as extreme multi-label In WSDM 2018: WSDM 2018: The Eleventh ACM International Conference on
learning with each item to be ranked/recommended being mapped Web Search and Data Mining , February 5–9, 2018, Marina Del Rey, CA, USA.
ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3159652.3159660
onto a separate label. The SwiftXML algorithm is developed to
tackle such warm-start applications by leveraging label features.
SwiftXML improves upon the state-of-the-art tree based extreme
classifiers by partitioning tree nodes using two hyperplanes learnt 1 INTRODUCTION
jointly in the label and data point feature spaces. Optimization is
Objective: This paper studies extreme multi-label problems where
carried out via an alternating minimization algorithm allowing
predictions need to be made on training points with partially re-
SwiftXML to efficiently scale to large problems.
vealed labels rather than on previously unseen test points. The
Experiments on multiple benchmark tasks, including tagging on
SwiftXML algorithm is developed for learning in such scenarios
Wikipedia and item-to-item recommendation on Amazon, reveal
by exploiting both label and data point features and applied to
that SwiftXML’s predictions can be up to 14% more accurate as
warm-start tagging, ranking and recommendation applications.
compared to leading extreme classifiers. SwiftXML also demon-
Extreme classification: Extreme multi-label learning addresses
strates the benefits of reformulating warm-start recommendation
the problem of learning a classifier that can annotate a data point
problems as extreme multi-label learning tasks by scaling beyond
with the most relevant subset of labels from an extremely large
classical recommender systems and achieving prediction accuracy
label set. Note that multi-label learning is distinct from multi-class
gains of up to 37%. Furthermore, in a live deployment for sponsored
learning which aims to predict a single mutually exclusive label.
search on Bing, it was observed that SwiftXML could increase the
Extreme classification has opened up a new paradigm for think-
relative click-through-rate by 10% while simultaneously reducing
ing about applications such as tagging, ranking and recommenda-
the bounce rate by 30%.
tion. In general, one can reformulate these problems as extreme
∗ Indian classification tasks by treating each item to be ranked/recommended
Institute of Technology Delhi
† Microsoft
Research and AI as a separate label, learning an extreme multi-label classifier that
‡ Samsung Research maps a user’s feature vector to a set of relevant labels, and then
using the classifier to predict the subset of items that should be
Permission to make digital or hard copies of all or part of this work for personal or ranked/recommended to each user. Data points are therefore re-
classroom use is granted without fee provided that copies are not made or distributed ferred to as users and labels as items throughout this paper.
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the Unfortunately, extreme classification has thus far been studied
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or only in the cold-start context of recommending items to new users
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected].
(predicting labels for novel test points). As a result, existing extreme
WSDM 2018, February 5–9, 2018, Marina Del Rey, CA, USA classification algorithms have been based on user features alone
© 2018 Copyright held by the owner/author(s). Publication rights licensed to Associa- and have not exploited item features. They have therefore been
tion for Computing Machinery. unable to leverage information about the revealed item preferences
ACM ISBN 978-1-4503-5581-0/18/02. . . $15.00
https://doi.org/10.1145/3159652.3159660 of existing users available in warm-start scenarios.
Motivation and label features: This paper is motivated by the hyperplane based on user features alone in the following ways.
applications of tagging on Wikipedia, item-to-item recommenda- First, users who were incorrectly partitioned based on user features
tion on Amazon and ranking queries for a given ad-landing page alone might now be partitioned correctly using the additional item
for sponsored search on Bing. In the case of tagging Wikipedia arti- features. Second, users who like similar items can now be parti-
cles, each Wikipedia tag is treated as a separate label and the tag’s tioned together as their item features will be similar. Predictions
word2vec embedding is treated as its label feature vector. SwiftXML are made exactly as in existing tree-based extreme classifiers but
is then used to predict new tags for an existing article leveraging in- for the fact that trees are now traversed based on the sign of the
formation about not only the article’s text but also the existing tags average prediction of both hyperplanes at a node. This improves
for the article. For item-to-item recommendation on Amazon, each prediction accuracy by allowing SwiftXML to recommend items
item is treated as a separate label and bag-of-words label features that are similar to the items that are already liked by a user not just
are extracted based on the product description. SwiftXML is then in terms of ratings similarity as in the case of existing extreme clas-
used to recommend other items that might be bought along with a sifiers but also in terms of item feature similarity. Finally, prediction
given item leveraging not only the given item’s product description accuracy also improves over the traditional extreme classification
but also information about existing item recommendations that had formulation as more information is now available at test time.
been adopted by users in the past. Finally, for sponsored search Contributions: This paper extends the extreme classification
advertising, each of the top monetizable queries on Bing is treated formulation to handle warm-start scenarios and develops, to the
as a separate label and label features are extracted based on CDSSM best of our knowledge, the first extreme classification algorithm
embeddings [25] for each query. SwiftXML is then used to rank the incorporating label features. The key technical contributions are
relevant queries for a given ad landing page leveraging not only a novel tree node splitting function based on both user and item
the page’s content but also information about existing queries that features and a scalable algorithm for optimizing the function. Ex-
had led to clicks on the ad in the past as well as existing queries perimental results reveal that SwiftXML’s predictions can be as
that the advertiser had bid on for that ad. This allows SwiftXML to much as 14% more accurate as compared to state-of-the-art extreme
make significantly more accurate predictions as compared to pre- classifiers and 37% more accurate as compared to classical methods
vious extreme classification algorithms which could not leverage for warm-start recommendation. Another significant contribution
information about existing tags, previously recommended items is the live deployment of SwiftXML to real users for sponsored
and relevant queries. search on Bing. It is demonstrated that SwiftXML can increase the
Limitations of current recommender systems: Warm-start relative click-through-rate by 10% while simultaneously reducing
recommendation is a very well studied problem and a number of the bounce rate by 30%.
algorithms have been proposed outside the extreme classification
literature including those based on matrix factorization [12, 14, 16,
2 RELATED WORK
24], Inductive Matrix Completion [20], Local Collaborative Filter-
ing [17], etc. Unfortunately, most of these algorithms cannot scale Much progress has recently been made in developing extreme learn-
to the extreme setting as they cannot train on large item and user ing algorithms based on trees [5, 15, 22, 29], embeddings [7, 10,
feature matrices involving millions of items and users in reasonable 13, 19, 27, 28, 34] and linear approaches [6, 31, 32]. While these
time on commodity hardware. Furthermore, some of these algo- methods might potentially be used to make warm-start recommen-
rithms are unable to handle implicit ratings (where missing labels dations, their performance degrades in this setting since they do
are not necessarily irrelevant) and, given millions of items, are not leverage item feature information during training and do not
unable to make predictions in milliseconds, both of which are criti- exploit information about a user’s revealed item preferences during
cal features for deploying in real world applications. While some prediction.
heuristics have been proposed to address these limitations, such Direct approaches to extreme learning, such as DiSMEC [6] learn
as adding a tree based search over the items as a post-processing and apply a separate linear classifier per label. As a result, they
step [33], principled solutions which directly incorporate implicit have the highest prediction accuracies but can also take months for
ratings and logarithmic prediction costs into the learning algorithm training and prediction at extreme scales on commodity hardware.
have not been proposed for large scale recommender systems based Tree-based extreme classifiers, such as PfastreXML [15] have lower
on user and item features. Finally, many of these algorithms, with prediction accuracies and larger model sizes but can train in hours
the notable exception of Local Collaborative Ranking [17], are based on a small server and can make predictions in milliseconds per
on the assumption that the ratings matrix is low-rank and therefore test point. Scaling to large datasets is a critical requirement for
have poor prediction accuracies as this assumption is violated in deployment in Bing and one of the major advantages that extreme
the extreme setting [7]. classifiers enjoy over traditional warm-start recommendation al-
SwiftXML: SwiftXML addresses these limitations by efficiently gorithms based on user and item features. SwiftXML is therefore
learning an ensemble of tree classifiers at the extreme scale. As developed as an extension of PfastreXML. SwiftXML enjoys all the
with other extreme classification algorithms, SwiftXML can train scaling properties of PfastreXML while having significantly higher
on implicit ratings without the low-rank assumption and can make prediction accuracies than PfastreXML, DiSMEC and state-of-the-
predictions in milliseconds. Unlike other extreme classification art recommender systems.
algorithms, however, SwiftXML grows its trees by recursively parti- Warm-start prediction has been well-studied in the recommender
tioning nodes using two hyperplanes learned jointly in the user and systems literature. Matrix factorization techniques such as WRMF [14],
item feature spaces. This improves training over learning a single SVD++ [16], BPR [24] and WBPR [12], were specifically designed
to address this scenario but factorize the ratings matrix without training allows information sharing between the 2 hyperplanes,
exploiting user or item feature information. This limitation has leading to better partitions. The tree growth is terminated when
been addressed in IMC [20], Matchbox [26], LCR [17] and other all the leaf nodes contain less than a specified number of users
methods [4]. Unfortunately, none of these scale to the extreme which is a hyperparameter of the algorithm. Each leaf node stores
setting involving a large number of users and items with implicit a probability distribution over the items which are proportional
ratings. Furthermore, with the exception of LCR, each of these meth- to number of users in the leaf that prefer the respective items. To
ods assumes that the ratings matrix is low-rank, which does not rectify the mistakes committed by a single, deep tree, SwiftXML
hold in the extreme setting [7]. Consequently these methods have trains multiple trees which differ only in the random seed values
poor prediction accuracies as demonstrated in the Experiments used to initialize the node partitioning algorithm. SwiftXML tree
Section. While some heuristics such as tree based post-processing prediction involves passing the test user down each of the trees,
search [33] have been proposed to speed-up prediction, they are starting from the root node until it reaches a leaf node. At each
not specialized for implicit rating scenarios and don’t address the node, the test point is sent to left (right) child if the average of
training time and accuracy concerns. the scores predicted by the 2 hyperplanes is positive (negative).
The probability scores of all the reached leaf nodes are averaged
3 A MOTIVATING EXAMPLE to obtain the label probability predictions. The SwiftXML trees are
prone to predicting tail labels with low probabilities as partitioning
Consider the problem of tagging an existing Wikipedia article –
errors in the internal nodes disproportionately reduce the support
such as that of Albert Einstein which has already been annotated
of tail labels in the leaf node distributions. SwiftXML addresses
with about 60 tags. State-of-the-art extreme classifiers suffer from
this limitation by re-ranking the tree predictions using classifiers
the following three limitations. First, during training, they learn
designed specifically for tail labels. SwiftXML follows the same
impoverished models from article text features alone and are unable
training and prediction procedure for tail label classifiers as [15].
to leverage features from tags such as "Recipients of the Pour le
SwiftXML trees were empirically observed to be well-balanced,
Mérite" and "Members of the Lincean Academy" which contain
leading to logarithmic training and prediction complexities.
information not found in the article’s text. SwiftXML avoids this
Item-set features: SwiftXML nodes learn a hyperplane in the
issue by learning from word2vec features extracted from the tags
item-set feature space, jointly with a hyperplane in the user fea-
in addition to the article text features. This allows SwiftXML to
ture space. The item-set features encode semantic information
predict "Recipients of civil awards and decorations" for the article
about a user’s revealed item preferences. The following linear
which could not be predicted by the traditional methods. Second,
formula is used for deriving the item-set features of an ith user:
again during training, existing classifiers learn that Einstein and
zi = ( j yirj xj′ )/∥ j yirj xj′ ∥, where yri ∈ {0, 1} L denotes the re-
P P
Newton’s articles are very different as they share very few tags.
vealed item preference vector for the ith user, and xj′ ∈ RD denotes
′
However, SwiftXML learns that the two are similar as the word2vec
embeddings of "American physicists" and "English physicists" as the feature vector for the jth item. This formulation ensures that
well as other corresponding tags are similar. This allows SwiftXML users with overlapping item preferences have similar item-set fea-
to annotate Einstein’s article with Newton’s tag "Geometers" which tures, which helps to retain those users together in the SwiftXML
could not have been predicted from the article text directly. Finally, trees. The item-set features provide information complementary
during prediction, SwiftXML can confidently annotate Einstein’s to the user features, and allow SwiftXML to leverage correlations
article with a novel tag "Astrophysicists" by relying on its sim- that exist between the revealed and the test items. To account for
ilarity to the existing tag "Cosmologists" in terms of their label varying number of revealed items across users, the item-set features
features. Existing extreme classifiers are unable to leverage such are normalized to unit norm.
label correlations. Node partitioning objective function: SwiftXML optimizes
a novel node partitioning objective which is designed to ensure
4 SWIFTXML both purity as well as generalizability of the learned partitions:
This section develops the SwiftXML algorithm as an extension of
X X
PfastreXML for addressing warm start classification tasks. SwiftXML Min ∥wx ∥1 + C x Lreg (δi wx⊤ xi ) + ∥wz ∥1 + Cz Lreg (δi wz⊤ zi )
retains all the scaling properties of PfastreXML while being sig- i i
nificantly more accurate owing to the effective use of additional X 1 + δi 1 − δi
+
information about revealed item preferences and item features. + Cr Lrank (r , yri ) + Lrank (r− , yri )
i
2 2
SwiftXML node partitioning optimizes a novel objective which
w.r.t. wx ∈ R D , wz ∈ R D ,δ ∈ {−1, +1} L , r+ , r− ∈ Π(1,L)
′
jointly learns two separating hyperplanes in the user and item
feature spaces respectively. PL yl
Classifier architecture: SwiftXML trees are grown by recur- l =1 pl log(r l +1)
where Lreg (x ) = log(1 + e −x ) , Lrank (r, y) = − PL 1
sively partitioning the users into 2 child nodes. Each internal node l =1 log(l +1)
stores 2 separating hyperplanes in the user and the item feature (1)
spaces respectively, which jointly learn the user partitioning. Since
learning an optimum user partition is a computationally hard prob- Here, i enumerates the training users; δi ∈ {−1, +1} indicates the
lem, SwiftXML proposes an efficient alternating minimization al- user assignment to either negative (right) or positive (left) partition;
gorithm which converges to a locally optimal solution. The joint wx , wz represent the separating hyperplanes learned in the user
and item-set feature spaces; r+ and r− represent the item rank- randomly. When a leaf node is reached, the leaf’s item distribution
ing variables for positive and negative partitions; Π(1,L) denotes is used as the item preference scores. Thereupon, the individual
the space of all possible rankings over the L items; C x ,Cz ,Cr are tree scores are averaged to obtain the ensemble scores, E j (x) =
PT t t
SwiftXML hyper-parameters; pl are the item propensity scores [15]. t =1 P j (x, z)/T for the j item, where P j (x, z) is the probability
The first line in (1) promotes model sparsity as well as gener- assigned by tree t to item j for the test point. These ensemble scores
alizability to test points by learning sparse logistic regressors in are further reranked by combining with the tail label classifier
user and item-set feature spaces. The second line maximizes node scores B j as s j = α log E j (x) + (1 − α ) log B j (x) ∀j ∈ {1, ..,L}.
purity by ranking the preferred items of each user as highly as Finally the top scoring items are recommended to the test user.
possible within its partition. Concretely, this is achieved by maxi- The SwiftXML trees are well-balanced with O (log(N )) depth.
mizing the Propensity-scored Normalized Discounted Cumulative Consequently, tree prediction is efficient and its cost for a test user
Gain (PSnDCG) metric. PSnDCG is unbiased to missing items and is O (log(N ) ∗ (D̂ + Dˆ ′ )). The overall complexity of the SwiftXML
boosts accuracy over rare and informative tail labels which are prediction algorithm is O ((T log(N ) + c) ∗ (D̂ + Dˆ ′ )), where c is the
also frequently missing owing to their unfamiliarity to the users. number of top-scoring items being reranked by the base-classifiers.
Lrank (r− , yri ) is the loss form of PSnDCG which is minimized by (1). Theoretical analysis of SwiftXML algorithm and optimization
The tight coupling between δi and the two regressors as well as are beyond the scope of this work. Detailed derivations of the
the ranking terms help to learn node partitions that are both pure optimization steps and pseudocodes for SwiftXML training and
and accurate. Upon joint optimization, the two separators bring prediction algorithms are presented in supplementary .
together those users having both similar user descriptions as well
as similar revealed item preferences.
The time complexity for training SwiftXML is O (N (T log(N ) +
5 SPONSORED SEARCH ADVERTISING
L̂)(D̂ + Dˆ ′ )), where N ,T are the number of training instances and Objective and motivation: Search engines generate most of their
trained trees respectively; and L̂, D̂, Dˆ ′ are the average number revenue via sponsored advertisements which are displayed along-
of revealed items, non-zero user features and non-zero item-set side the organic search results. While creating an ad campaign, an
features of a user, respectively. The above complexity includes both advertiser usually provides minimalistic information such as a short
tree training times as well as tail label classifier training times, description of the ad, url of the ad landing page, a few bid-phrases
and is dominated by the former. Due to small values of L̂,T and a along with their corresponding bids. Bid-phrases are a set of search
tractable log(N ) ∗ N dependence on the number of users, SwiftXML queries that are judged to be relevant to the ad, and hence bidded
training is sufficiently fast and scalable. on by the advertisers. Since the potential monetizable queries are
Optimization: The discrete objective in (1) cannot be straight in the order of millions, the advertiser won’t be able to provide
forwardly optimized with usual gradient descent techniques. There- exhaustive list of all relevant bid-phrases. To address this limitation,
fore, an efficient, iterative, alternating minimization algorithm is an ad retrieval system resorts to machine learning-based extended
adopted which alternately optimizes over one of the four classes match techniques [9, 23] which suggest additional relevant search
of variables (δ, r± , wx , wz ) at a time with the others held constant. queries to bid on by leveraging advertiser provided information.
Optimization over δ with other variables held constant reduces (1) This ad retrieval application is a natural fit for warm-start recom-
to N separate problems over individual δi variables which have mendations where the given bid phrases and historically known
simple closed form solutions: relevant search queries can be used to predict new search queries
for the ad more accurately.
δi = Sign C x wx⊤ xi +Cz wz⊤ zi +Cr (Lrank (r+ , yri )) − Lrank (r− , yri ))
Ranking queries for ad retrieval: Ad retrieval task can be
formulated as extreme multi-label learning problem by treating
Optimization over r± with δ, wx , wz fixed also has a closed form ad landing pages as data points and search queries as labels. Data
solution r± = rank i:δi =± I L (yri )yri , where rank(v) returns the
P
point feature vectors are created by extracting bag-of-words repre-
indices of v in their descending order, and I L (yri ) is a user-specific sentation from the ad landing pages. Relevant labels are generated
constant. Optimization over wx or wz while fixing the remaining from the historical click logs i.e. a search query is tagged to be
variables reduces to standard L1 regularized logistic regression relevant to an ad if the ad had received clicks when displayed for
problems which can be efficiently solved using Liblinear [11]. In the query in the past. For the warm-start scenario, the advertiser
practice, the algorithm alternates between r± ,wx and wz variables, provided bid-phrases as well as the queries that had led to clicks
interleaved with efficient δ optimizations. Early stopping with just on the ad in the past are considered as revealed labels for the ad.
one iteration over all variables was found to be sufficient in practice. Label features are extracted based on the CDSSM embeddings [25]
Optimization time is empirically dominated by learning two L1 of the bid-phrases or the search query phrases.
regularized logistic regressions, which have a combined cost of Label relevance weights: Not all ads which were clicked for a
O (N nodet (D̂ + Dˆ ′ )), where N node is the number of training users query in past are equally relevant or click-yielding to the query. One
in the node and t (usually set to 1) is the number of iterations. such example is the query ”marco polo” which had received clicks
Prediction: SwiftXML tree prediction involves routing the test mostly from hotel related ads along with a few clicks from book
user down the tree, starting from the root node until it reaches a related ads. Treating all the clicked ads for this query as equally
leaf. At each visited internal node, the user is sent to left (right) relevant results in the query "marco polo" being recommended for
child if the combined scores of the two separating hyperplanes, i.e. book related ads too, which in turn generates lower click through
Cx wx⊤ x + Cz wz⊤ z, is greater (lesser) than 0, with ties being broken rates on the actual search engine traffic. To address this problem,
Table 1: Dataset statistics Table 2: SwiftXML can be up to 14% and 37% more accurate as
compared to state-of-the-art extreme classifiers and warm-
Dataset
Train Features Labels Test Avg. labels Avg. points start recommendation algorithms respectively according to
N D L M per point per label
unbiased propensity-scored Precision@5 (PSP5). Results for
EURLex-4K 15,539 5,000 3,993 3,809 5.31 25.73
Wiki10-31K 14,149 101,938 30,935 6,613 17.25 11.58
PSP1, PSP3 and biased Precision@k are presented in the sup-
AmazonCat-13K 1,186,239 203,882 13,330 306,782 5.05 566.01 plementary .
CitationNetwork-36K 62,503 39,539 36,211 15,467 3.07 6.61
Amazon-79K 490,449 135,909 78,957 153,025 2.08 4.06
Wikipedia-500K 1,813,391 2,381,304 501,070 783,743 4.77 24.75
Revealed Label Percentages
Dataset Algorithm
20% 40% 60% 80%
differential weights are assigned to each (query, ad) pair in the train- WRMF 11.05 16.58 19.77 22.85
ing data based on the ad’s click generating potential for the query as SVD++ 0.41 0.51 0.61 0.60
measured by the normalized point-wise mutual information (nPMI). BPR 1.13 1.01 0.86 2.24
Let c i j be number of historical clicks between ith ad and jth query. PfastreXML 48.21 48.64 51.13 52.46
log(p (i )p (j )) EURLex SLEEC 42.72 46.31 48.56 51.64
Normalized PMI is defined as nPMIi j = log p (i,j ) − 1, where
PDSparse 43.79 45.72 46.02 49.87
p(i) = j c i j / i,j c i j and p(i, j) = c i j / i,j c i j
P P P
DiSMEC 47.03 48.30 50.54 51.63
The nPMI values are higher for more relevant query-ad pairs IMC 11.23 11.45 11.45 11.72
in the training data and always lie between 0 and 1. SwiftXML is Matchbox 0.50 – 1.00 1.09
learned on the weighted training data, which helped to improve SwiftXML 48.45 49.72 53.12 55.70
offline accuracy on the holdout data by 1% over the model trained
WRMF 5.27 6.01 6.34 6.33
on the unweighted data.
PfastreXML 19.80 18.17 16.31 14.77
Inverted ad-index: Trained SwiftXML model was used to make
SLEEC 12.42 12.57 12.28 12.14
recommendations for all the ads in the system. For every ad, the Wiki10
PDSparse 8.02 6.78 5.72 4.73
SwiftXML recommended queries having a score of above a thresh- DiSMEC 15.47 15.19 14.53 13.87
old are inserted into an inverted ad index, to be retrieved when the IMC 2.36 3.42 3.87 3.98
corresponding query is entered in the search engine. A relatively SwiftXML 19.92 19.07 17.06 16.23
high threshold of 0.7 was set on the SwiftXML scores to maxi-
PfastreXML 76.32 75.80 75.17 76.30
mize precision. Additionally, the top two SwiftXML recommended
AmazonCat PDSparse 65.25 61.61 58.37 57.47
queries were also inserted into the inverted ad index, since they
SwiftXML 77.17 81.10 83.77 87.83
were found to be mostly relevant despite scoring lower than the
set threshold sometimes. This step improved recall, while main- PfastreXML 15.39 15.19 15.30 15.24
taining the precision value, thus achieving 2% improvement in the SLEEC 7.41 7.65 7.37 6.40
CitationNetwork
offline F1-score on the holdout data, as compared to the global PDSparse 14.65 14.48 14.05 14.21
threshold-only approach. DiSMEC 17.84 17.78 18.06 18.49
SwiftXML 16.92 17.84 19.44 19.34
Mapping onto bid-phrases: SwiftXML recommended queries
need to be associated with one of the advertiser provided bid- PfastreXML 36.39 36.14 36.61 35.40
phrases in order to choose appropriate bid value for the query, SLEEC 23.83 28.30 32.24 31.33
Amazon
and also to allow the advertisers to track the campaign perfor- PDSparse 34.12 33.57 33.54 32.85
mance at the bid-phrase level. A query to bid-phrase click model DiSMEC 41.89 41.94 42.54 41.86
is learned for this purpose, which assigns a click probability to a SwiftXML 37.69 42.80 51.33 49.44
given (query, bid-phrase) pair. Training data of form (query, bid- PfastreXML 33.34 33.35 33.22 35.22
Wikipedia
phrase) is generated by considering historically clicked (query, ad) SwiftXML 34.76 35.31 35.68 38.07
pairs and extracting the corresponding bid-phrase from the ad. A
gradient boosted decision trees (GBDT) model is trained over the
following features extracted for each (query, bid-phrase) pair: syn- the scale of state-of-the-art warm-start recommendation algorithms
tactic similarity features, historical click features (click counts at that train on both document and label features and (c) deploying
token level) and semantic similarity features (Word2vec [18] and SwiftXML in a live system for sponsored search advertising on Bing
CDSSM [25] embeddings). SwiftXML recommended query for an ad, led to significant gains in the click-through rate and quality of ad
is associated with the bid-phrase with maximum click probability, recommendations as well as simultaneous reductions in the bounce
as predicted by the click model. rate.
Datasets and features: Experiments were carried out on bench-
6 EXPERIMENTS mark datasets containing up to 1.8 million training points, 2.3 mil-
Experiments were carried out on benchmark datasets with up to lion dimensional features and 0.5 million labels (see Table 1 for
half a million labels demonstrating that: (a) SwiftXML could be as dataset statistics). The applications considered range from tagging
much as 14% and 37% more accurate as compared to state-of-the-art Wikipedia articles (Wikipedia-500K), cataloging Amazon items into
extreme classification and warm-start recommendation algorithms, multiple Amazon product categories (AmazonCat-13K), item-to-
respectively; (b) SwiftXML could scale to extreme datasets beyond item recommendation of Amazon products (Amazon-79K), paper
Table 3: SwiftXML can be up to 14% and 4% more accu- are collaborative filtering algorithms based on factorizing the label
rate according to unbiased propensity scored Precision@5 (ratings) matrix alone and do not leverage user or item features. Sec-
as compared to baseline extensions of PfastreXML incorpo- ond, SwiftXML was compared to state-of-the-art extreme classifiers
rating label features via early and late fusion respectively. based on trees (PfastreXML [15]), embeddings (SLEEC [7]) and lin-
Results for other metrics, including biased Precision@k, are ear methods that learn a separate classifier per label (PDSparse [32]
reported in the supplementary . and DiSMEC [6]). The extreme classifiers improve upon the col-
laborative filtering methods by training on the user features along
Revealed Label Percentages with the label (ratings) matrix. Third, SwiftXML was also com-
Dataset Algorithm
20% 40% 60% 80% pared to state-of-the-art recommender systems such as Inductive
PfastreXML-early 47.29 49.82 52.04 54.40 Matrix Completion (IMC) [20] and Matchbox [26] which extend
EURLex PfastreXML-late 48.21 49.55 52.25 46.05 collaborative filtering and extreme classification methods by lever-
SwiftXML 48.45 49.72 53.12 55.70 aging user features, label (item) features and the label (ratings)
PfastreXML-early 19.59 18.17 16.31 14.94 matrix during both training and prediction. Finally, SwiftXML was
Wiki10 PfastreXML-late 19.80 18.03 16.31 14.65 compared to two alternate ways of extending the state-of-the-art
SwiftXML 19.92 19.07 17.06 16.23 tree based PfastreXML extreme classifier [15] to handle label fea-
PfastreXML-early 74.56 78.53 79.77 81.52 tures. In particular, PfastreXML-early uses early fusion to train
AmazonCat PfastreXML-late 76.23 78.20 81.43 85.40 PfastreXML on concatenated label and document features with
SwiftXML 77.17 81.10 83.77 87.83 the relative weighting of the two feature types being determined
PfastreXML-early 15.04 15.08 15.43 15.54 through validation. In contrast, PfastreXML-late uses late fusion
CitationNetwork PfastreXML-late 15.73 17.12 19.11 19.86 to learn separate PfastreXML classifiers in the document and label
SwiftXML 16.92 17.84 19.44 19.34 feature spaces and then combines the two scores during prediction
PfastreXML-early 36.45 36.18 36.64 35.43 with relative weighting being again determined through validation.
Amazon PfastreXML-late 36.71 40.42 46.71 45.98 Results are reported for the Mrec [2] recommender system li-
SwiftXML 37.69 42.80 51.33 49.44 brary implementation of WRMF, the Mahout [1] implementation
PfastreXML-early 34.23 34.44 34.43 36.47 of SVD++ and the Matchbox implementation available on the Mi-
Wikipedia PfastreXML-late 33.78 34.26 34.88 37.66 crosoft Azure cloud computing platform. The implementation of all
SwiftXML 34.76 35.31 35.68 38.07 the other algorithms was provided by the authors. Unfortunately,
some algorithms do not scale to large datasets and results have
Table 4: SwiftXML could increase the relative click-through- therefore been reported for only those datasets to which an imple-
rate (CTR) and relative quality of ad recommendation (QOA) mentation scales. The relative performance of all the methods can
by 10% while simultaneously reducing the bounce rate (BR) be compared on the small scale EURLex dataset.
by 30% on sponsored search on Bing. Hyper-parameters: In addition to the hyper-parameters of
PfastreXML, SwiftXML has two extra hyper-parameters Cz ,λz
which weight the loss incurred over the label features in the node
Algorithm Relative Relative Relative partitioning objective and the base classifiers respectively . Unfortu-
CTR (%) QOA (%) BR (%) nately, the Wikipedia-500K dataset is too large for hyper-parameter
Bing 100 100 100 tuning through validation and therefore all algorithms were run
PfastreXML 102 103 76 with default values, with the default SwiftXML values kept same
SwiftXML 110 112 69 as in PfastreXML along with Cz = λz = 1. On the other datasets,
the hyper-parameters for all the algorithms were tuned using fine
grained validation so as to maximize the prediction accuracy on
citation recommendation (CitationNetwork-39K) and document tag- the validation set.
ging (EURLex-4K, Wiki10-31K). All the datasets can be publically Evaluation metrics: Performance evaluation was done using
downloaded from The Extreme Classification Repository [8]. Bag- precision@k and nDCG@k (with k = 1, 3 and 5) which are widely
of-words TF-IDF features provided on the Repository were used as used metrics for extreme classification. Performance was also evalu-
the document (or user) features for each dataset. 500-dimensional ated using propensity scored precision@k and nDCG@k (PSPk and
word2vec embeddings [18] were used to generate the label features PSNk with k = 1, 3 and 5) which have recently been shown to be
as these led to better results as compared to other word embedding unbiased, and more suitable, metrics [15] for extreme classification,
models including glove [21] and phrase2vec [3]. Algorithms were tagging, ranking, recommendation, etc. The propensity model and
evaluated under various warm-start conditions as more and more values available on The Extreme Classification Repository were
of a user’s item preferences were revealed. This was simulated by used. Results for all metrics apart from PSP5 are reported in the
randomly sampling 20%, 40%, 60% and 80% of the test labels and supplementary due to space limitations.
revealing them during training while the remaining labels were Results - prediction accuracy: Tables 2 and 3 compare predic-
used for evaluation purposes as ground-truth. tion accuracy of SwiftXML to the various baseline algorithms as the
Baseline algorithms: SwiftXML was compared to four types percentage of labels revealed during training is varied from 20% to
of algorithms for warm-start recommendation. First, SwiftXML 80%. As can be seen in Table 2, SwiftXML’s predictions can be up to
was compared to WRMF [14], SVD++ [16] and BPR [24] which 14% and 37% more accurate as compared to state-of-the-art extreme
classifiers and warm-start recommendation algorithms respectively. labelled data. Table 4 compares SwiftXML to PfastreXML as well
The largest gains over existing extreme classifiers were observed as a large ensemble of state-of-the-art methods for query ranking
for item-to-item recommendation on Amazon. This was because which are currently in production, referred to as Bing-ensemble. As
many Amazon products had unhelpfully brief product descriptions, can be seen, SwiftXML leads to upto 10% higher CTR and QOA, as
translating into poor user features for existing extreme classifiers. well as upto 10% lower QBR as compared to both PfastreXML and
In fact, in some extreme cases, some Amazon products had no prod- Bing-ensemble. The training times of SwiftXML and PfastreXML
uct description whatsoever apart from the product name. However, algorithms were 140 and 200 hours respectively, while their model
the very same products had a number of other products that had sizes were 3.7 and 3.8 GB respectively, with prediction times being
frequently been bought together with them which contained some 1.7 and 1.8 milliseconds per test point respectively.
useful information. This was leveraged by SwiftXML as item fea- Qualitative examples: Figure (1) illustrates the advantages of
tures to make significantly better recommendations. SwiftXML was SwiftXML over PfastreXML through some representative exam-
also able to make better predictions by leveraging such features ples. PfastreXML suffers from several limitations due to its sole
even when a sufficiently verbose product description was available reliance on user features, which are overcome by SwiftXML by
(see figure 1 for qualitative examples). leveraging revealed items and item feature information. First, the
Table 3 also illustrates that SwiftXML can be more accurate user features often fail to provide sufficient information for classifi-
as compared to early and late fusion methods for incorporating cation, with an extreme case being "(a) AmazonCat: Kvutzat Yavne
label features into PfastreXML by as much as 14% and 4% respec- pickles" whose web page contains limited text. In this case, Pfas-
tively. Early fusion has several limitations such as its tendency to treXML recommends popular labels such as "Books" and "Literature
overfit and an inherent bias towards the dense label features over & fiction" which are unfortunately irrelevant. On the other hand,
sparse document features [30]. PfastreXML-late learns independent SwiftXML leverages information about revealed tags such as "Pick-
classifiers over the document and label features and therefore has les" and "Pickles & relishes" to recommend relevant tags such as
suboptimal performance as compared to SwiftXML’s which learns "Dill pickles", "Grocery & gourmet food". Second, PfastreXML some-
node separators in both spaces jointly. times puts greater emphasis on the irrelevant, but frequent user
Results - scaling: As Tables 2 and 3 show, SwiftXML can effi- features such as "language" and "Barron" in "(d) Amazon: Barron’s
ciently scale to large datasets beyond the scale of warm-start recom- IELTS ..." leading to irrelevant recommendations, such as "More
mendation algorithms such as IMC and Matchbox which also train useful french words" and "Spanish the Easy Way (Barron’s E-Z)". In
on both document and label features. SwiftXML’s training time is contrast, the information about revealed items such as "toefl" and
comparable to that of PfastreXML-early and PfastreXML-late, other "academic english" inform SwiftXML to emphasize more on the
techniques for handling warm-start problems based on both docu- relevant user features such as "ielts" and "testing", resulting in use-
ment and label features, but it’s prediction time and model size can ful IELTS preparation books being recommended to the customer.
be lower. As compared to PfastreXML, SwiftXML’s training time is Third, PfastreXML is unable to disambiguate between homonyms
3-4x more in general but its prediction time and model size might be such as "Fo" in the book title "(c) Amazon: Xsl Fo" versus "Fo" in
sometimes lower as it learns shorter better quality trees due to the the surname of dramatist "Dario Fo", both of which are among the
extra information available. For instance, on the AmazonCat-13K user features. Hence, PfastreXML incorrectly recommends Dario
dataset with 1.1 million training points and 13.3K labels, SwiftXML, Fo’s plays instead of books on Xsl. SwiftXML resolves this ambi-
PfastreXML-early, PfastreXML-late and PfastreXML,’s trained in guity by using the context information from revealed books about
25, 19, 29 and 7 hours respectively on a single core of Intel Xeon 2.6 "xsl" to recommend relevant books such as "Definitive XSL-FO" and
GHz server, while their model sizes were 10, 15, 19 and 16 GB respec- "Learning XSLT". SwiftXML also leverages correlations between
tively with prediction times being 4.2, 6.6, 4.7 and 4.3 milliseconds the revealed items and the relevant test items, to make accurate
respectively. Note that SwiftXML training can be easily parallelized predictions. For example, the strong correlation between the re-
by growing each tree on a separate core, unlike algorithms such as vealed label "Marx" and novel label "Karl" in "(b) AmazonCat: Rosa
Matchbox whose implementations run on only a single core. More Luxemburg..." is used by SwiftXML to correctly recommend "Karl".
parallelization can be attained in SwiftXML training by growing Furthermore, the revealed bid-phrases also help SwiftXML to ac-
each node with the same tree depth on a separate core. curately resolve advertiser intents such as selling archery items in
Sponsored Search Advertising: SwiftXML was used to rank "(e) Bing Ads: 3RiversArchery archery supplies" and selling pest
the queries that might lead to a click on a given ad shown on Bing. bird control products in "(f) Bing Ads: Arcat pest control".
While PfastreXML could only rank the queries on the basis of the
text present on the ad-landing page, SwiftXML was able to leverage
information about other queries that had already led to a click on 7 CONCLUSIONS
the ad as well as queries that had been bid on by the advertiser This paper extended the extreme classification formulation to han-
for that page. Performance is measured in terms of click-through dle warm-start scenarios by leveraging item features which can pro-
rate (CTR), bounce rate (QBR) and quality of ad recommendations vide a rich, and complementary, source of information to the user
(QOA). The bounce rate is defined as the percentage of times a features relied on by traditional extreme classifiers. The SwiftXML
user returns back immediately after viewing an ad landing page, algorithm was developed for exploiting item features and label
indicating user dissatisfaction. The quality of ad recommendations correlations as a simple, easy to implement and reproduce exten-
measures the relevance of ad recommendations to a search query sion of the PfastreXML extreme classifier. Despite its simplicity,
and is estimated by a query-ad relevance model trained on human
(a) AmazonCat: Kvutzat Yavne pickles (b) AmazonCat: Rosa Luxemburg, Women’s Liberation, and Marx’s
Philosophy of Revolution book
(c) Amazon: Xsl Fo book (d) Amazon: Barron’s IELTS with Audio CD book
(e) Bing Ads: 3RiversArchery archery supplies (f) Bing Ads: Arcat pest control products
Figure 1: Item recommendations by PfastreXML and SwiftXML on AmazonCat, Amazon and Bing Ads: PfastreXML predictions
are frequently irrelevant due to lack of informative user features (e.g. (a)), emphasis on the wrong features (e.g. (d)) and inability
to disambiguate homonyms (e.g. (e)). SwiftXML leverages item correlations (e.g. "Marx" => "Karl" in (b)) and helpful information
from revealed items and their features (e.g. (a)-(f)) to make much more accurate predictions. See text for more details. Figure
best viewed under high magnification.
SwiftXML was shown to improve prediction accuracy on the Ama- could increase the click-through rate and quality of ad recommen-
zon item-to-item recommendation task by as much as 37% and dations by 10%, and reduce the bounce rate by 31% as compared to
14% over state-of-the-art warm start recommendation algorithms a large ensemble of state-of-the-art algorithms in production. The
and extreme classifiers respectively. Furthermore, live deployment SwiftXML code will be made publically available.
for sponsored search advertising on Bing revealed that SwiftXML
REFERENCES [18] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. 2013. Distributed Repre-
[1] [n. d.]. Apache Mahout. https://mahout.apache.org. ([n. d.]). sentations of Words and Phrases and their Compositionality. CoRR abs/1310.4546
[2] [n. d.]. Mrec recommender systems library. http://mendeley.github.io/mrec. ([n. (2013).
d.]). [19] P. Mineiro and N. Karampatziakis. 2015. Fast Label Embeddings for Extremely
[3] [n. d.]. Phrase2vec. https://github.com/zseymour/phrase2vec. ([n. d.]). Large Output Spaces. In ECML.
[4] D. Agarwal and B. C. Chen. 2009. Regression-based Latent Factor Models. In [20] N. Natarajan and I. Dhillon. 2014. Inductive Matrix Completion for Predicting
KDD. 19–28. Gene-Disease Associations. In Bioinformatics.
[5] R. Agrawal, A. Gupta, Y. Prabhu, and M. Varma. 2013. Multi-label Learning with [21] J. Pennington, R. Socher, and C. D. Manning. 2014. Glove: Global Vectors for
Millions of Labels: Recommending Advertiser Bid Phrases for Web Pages. In Word Representation. In EMNLP. 1532–1543.
WWW. [22] Y. Prabhu and M. Varma. 2014. FastXML: A fast, accurate and stable tree-classifier
[6] R. Babbar and B. Schölkopf. 2017. DiSMEC: Distributed Sparse Machines for for extreme multi-label learning. In KDD.
Extreme Multi-label Classification. In WSDM. [23] H. Raghavan and R. Iyer. 2008. Evaluating vector-space and probabilistic mod-
[7] K. Bhatia, H. Jain, P. Kar, M. Varma, and P. Jain. 2015. Sparse Local Embeddings els for query to ad matching. In SIGIR Workshop on Information Retrieval in
for Extreme Multi-label Classification. In NIPS. Advertising.
[8] K. Bhatia, H. Jain, Y. Prabhu, and M. Varma. [n. d.]. The Extreme Classification [24] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme. 2009. BPR:
Repository: Multi-label Datasets & Code. ([n. d.]). http://manikvarma.org/ Bayesian Personalized Ranking from Implicit Feedback. In UAI.
downloads/XC/XMLRepository.html [25] Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil. 2014. Learning semantic represen-
[9] Y. Choi, M. Fontoura, E. Gabrilovich, V. Josifovski, M. Mediano, and B. Pang. 2010. tations using convolutional neural networks for web search. In WWW.
Using landing pages for sponsored search ad selection. In WWW. [26] D. H. Stern, R. Herbrich, and T. Graepel. 2009. Matchbox: large scale online
[10] M. Cissé, N. Usunier, T. Artières, and P. Gallinari. 2013. Robust Bloom Filters for bayesian recommendations. In WWW.
Large MultiLabel Classification Tasks. In NIPS. [27] Y. Tagami. 2017. AnnexML: Approximate Nearest Neighbor Search for Extreme
[11] R. E. Fan, K. W. Chang, C. J. Hsieh, X. R. Wang, and C. J. Lin. 2008. LIBLINEAR: Multi-label Classification. In KDD. 455–464.
A library for large linear classification. JMLR (2008). [28] J. Weston, S. Bengio, and N. Usunier. 2011. Wsabie: Scaling Up To Large Vocabu-
[12] Z. Gantner, L. Drumond, C. Freudenthaler, and L. Schmidt-Thieme. 2012. Person- lary Image Annotation. In IJCAI.
alized Ranking for Non-Uniformly Sampled Items. In Proceedings of KDD Cup [29] J. Weston, A. Makadia, and H. Yee. 2013. Label Partitioning For Sublinear Ranking.
2011. 231–247. In ICML.
[13] D. Hsu, S. Kakade, J. Langford, and T. Zhang. 2009. Multi-Label Prediction via [30] C. Xu, D. Tao, and C. Xu. 2013. A Survey on Multi-view Learning. CoRR (2013).
Compressed Sensing. In NIPS. [31] I. E. H. Yen, X. Huang, W. Dai, P. Ravikumar, I. Dhillon, and E. Xing. 2017.
[14] Y. Hu, Y. Koren, and C. Volinsky. 2008. Collaborative Filtering for Implicit PPDsparse: A Parallel Primal-Dual Sparse Method for Extreme Classification. In
Feedback Datasets. In ICDM. KDD. 545–553.
[15] H. Jain, Y. Prabhu, and M. Varma. 2016. Extreme Multi-label Loss Functions for [32] I. E. H. Yen, X. Huang, K. Zhong, P. Ravikumar, and I. S. Dhillon. 2016. PD-
Recommendation, Tagging, Ranking & Other Missing Label Applications. In Sparse: A Primal and Dual Sparse Approach to Extreme Multiclass and Multilabel
KDD. Classification. In ICML.
[16] Y. Koren. 2008. Factorization Meets the Neighborhood: A Multifaceted Collabo- [33] F. Zhang, T. Gong, Victor E. Lee, G. Zhao, C. Rong, and G. Qu. 2016. Fast
rative Filtering Model. In KDD. Algorithms to Evaluate Collaborative Filtering Recommender Systems. Know.-
[17] J. Lee, S. Bengio, S. Kim, G. Lebanon, and Y. Singer. 2014. Local collaborative Based Syst. (2016), 96–103.
ranking. In WWW. [34] Y. Zhang and J. G. Schneider. 2011. Multi-Label Output Codes using Canonical
Correlation Analysis. In AISTATS.