0% found this document useful (0 votes)
130 views10 pages

SLIM Sparse Linear Methods For Top-N Recommender Systems

The document summarizes a new method called SLIM (Sparse Linear Method) for generating top-N recommendations in recommender systems. SLIM learns a sparse coefficient matrix solely from user purchase/rating profiles by solving a regularized optimization problem. It produces high-quality recommendations efficiently by aggregating from user profiles using this sparse coefficient matrix. Experiments show SLIM achieves better recommendation quality than other state-of-the-art methods while requiring much less time.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
130 views10 pages

SLIM Sparse Linear Methods For Top-N Recommender Systems

The document summarizes a new method called SLIM (Sparse Linear Method) for generating top-N recommendations in recommender systems. SLIM learns a sparse coefficient matrix solely from user purchase/rating profiles by solving a regularized optimization problem. It produces high-quality recommendations efficiently by aggregating from user profiles using this sparse coefficient matrix. Experiments show SLIM achieves better recommendation quality than other state-of-the-art methods while requiring much less time.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

2011 11th IEEE International Conference on Data Mining

SLIM: Sparse Linear Methods


for Top-N Recommender Systems
Xia Ning and George Karypis
Computer Science & Engineering
University of Minnesota, Minneapolis, MN
Email: {xning,[email protected]}

Abstract—This paper focuses on developing effective and purchase/rating profiles by solving a regularized optimization
efficient algorithms for top-N recommender systems. A novel problem. Sparsity is introduced into the coefficient matrix
Sparse LInear Method (SLIM) is proposed, which generates top- which allows it to generate recommendations efficiently. Fea-
N recommendations by aggregating from user purchase/rating
profiles. A sparse aggregation coefficient matrix W is learned ture selection methods allow SLIM to substantially reduce
from SLIM by solving an 1 -norm and 2 -norm regularized the amount of time required to learn the coefficient matrix.
optimization problem. W is demonstrated to produce high- Furthermore, SLIM can be used to do top-N recommenda-
quality recommendations and its sparsity allows SLIM to generate tions from ratings, which is a less exploited direction in
recommendations very fast. A comprehensive set of experiments recommender system research.
is conducted by comparing the SLIM method and other state-of-
the-art top-N recommendation methods. The experiments show The SLIM method addresses the demands for high quality
that SLIM achieves significant improvements both in run time and efficiency in top-N recommender systems concurrently, so
performance and recommendation quality over the best existing it is better suitable for real-time applications. We conduct a
methods. comprehensive set of experiments on various datasets from dif-
Keywords-Top-N Recommender Systems, Sparse Linear Meth- ferent real applications. The results show that SLIM produces
ods, 1 -norm Regularization better recommendations than the state-of-the-art methods at a
very high speed. In addition, it achieves good performance in
using ratings to do top-N recommendation.
I. I NTRODUCTION
The rest of this paper is organized as follows. In Section II,
The emergence and fast growth of E-commerce have signif- a brief review on related work is provided. In Section III,
icantly changed people’s traditional perspective on purchasing definitions and notations are introduced. In Section IV, the
products by providing huge amounts of products and detailed methods are described. In Section V, the materials used for
product information, thus making online transactions much experiments are presented. In Section VI, the results are
easier. However, as the number of products conforming to presented. Finally in Section VII are the discussions and
the customers’ desires has dramatically increased, the problem conclusions.
has become how to effectively and efficiently help customers
identify the products that best fit their personal tastes. In par-
II. R ELATED W ORK
ticular, given the user purchase/rating profiles, recommending
a ranked list of items for the user so as to encourage additional Top-N recommender systems are used in E-commerce appli-
purchases has the most application scenarios. This leads to the cations to recommend size-N ranked lists of items that users
widely used top-N recommender systems. may like the most, and they have been intensively studied
In recent years, various algorithms for top-N recommen- during the last few years. The methods for top-N recom-
dation have been developed [1]. These algorithms can be mendation can be broadly classified into two categories. The
categorized into two classes: neighborhood-based collabora- first category is the neighborhood-based collaborative filtering
tive filtering methods and model-based methods. Among the methods [2]. For a certain user, user-based k-nearest-neighbor
neighborhood-based methods, those based on item neigh- (userkNN) collaborative filtering methods first identify a set
borhoods can generate recommendations very fast, but they of similar users, and then recommend top-N items based on
achieve this with a sacrifice on recommendation quality. On what items those similar users have purchased. Similarly, item-
the other hand, model-based methods, particularly those based based k-nearest-neighbor (itemkNN) collaborative filtering
on latent factor models incur a higher cost while generating methods first identify a set of similar items for each of the
recommendations, but the quality of these recommendations items that the user has purchased, and then recommend top-N
is higher, and they have been shown to achieve the best items based on those similar items. The user/item similarity is
performance especially on large recommendation tasks. calculated from user-item purchase/rating matrix in a collab-
In this paper, we propose a novel Sparse LInear Method orative filtering fashion with some similarity measures (e.g.,
(SLIM) for top-N recommendation that is able to make high- Pearson correlation, cosine similarity) applied. One advantage
quality recommendations fast. SLIM learns a sparse coefficient of the item-based methods is that they are efficient to generate
matrix for the items in the system solely from the user recommendations due to the fact that the item neighborhood

1550-4786/11 $26.00 © 2011 IEEE 497


DOI 10.1109/ICDM.2011.134

Authorized licensed use limited to: b-on: UNIVERSIDADE DO PORTO. Downloaded on October 18,2022 at 14:47:49 UTC from IEEE Xplore. Restrictions apply.
is sparse. However, they suffer from low accuracy since there entry (denoted by aij ) is 1 or a positive value if user ui has
is essentially no knowledge learned about item characteristics ever purchased/rated item tj , otherwise the entry is marked as
so as to produce accurate top-N recommendations. 0. The i-th row of A represents the purchase/rating history of
The second category is model-based methods, particularly user ui on all items T , and this row is denoted by aT i . The
latent factor models as they have achieved the state-of-the-art j-th column of A represents the purchase/rating history of all
performance on large-scale recommendation tasks. The key users U on item tj and this column is denoted by aj .
idea of latent factor models is to factorize the user-item matrix In this paper, all vectors (e.g., aT
i and aj ) are represented
into (low-rank) user factors and item factors that represent by bold lower-case letters and all matrices (e.g., A) are
user tastes and item characteristics in a common latent space, represented by upper-case letters. Row vectors are represented
respectively. The prediction for a user on an item can be by having the transpose supscriptT , otherwise by default they
calculated as the dot product of the corresponding user factor are column vectors. A predicted/approximated value is denoted
and item factor. There are various Matrix Factorization (MF)- by having a ∼ head. We will use corresponding matrix/vector
based methods proposed in recent years for building such notations instead of user/item purchase/rating profiles if no
latent factor models. Cremonesi et al [3] proposed a simple ambiguity is raised.
Pure Singular-Value-Decomposition-based (PureSVD) matrix
factorization method, which describes users and items by the IV. S PARSE L INEAR M ETHODS FOR Top-N
most principle singular vectors of the user-item matrix. Pan R ECOMMENDATION
et al [4] and Hu et al [5] proposed a Weighted Regularized A. SLIM for Top-N Recommendation
Matrix Factorization (WRMF) method formulated as a regu-
In this paper, we propose a Sparse LInear Method (SLIM)
larized Least-Squares (LS) problem, in which a weighting
to do top-N recommendation. In the SLIM method, the recom-
matrix is used to differentiate the contributions from observed
mendation score on an un-purchased/-rated item tj of a user
purchase/rating activities and unobserved ones. Rennie [6]
ui is calculated as a sparse aggregation of items that have
and Srebro [7] proposed a Max-Margin Matrix Factorization
been purchased/rated by ui , that is,
(MMMF) method, which requires a low-norm factorization of
the user-item matrix and allows unbounded dimensionality for ãij = aT
i wj , (1)
the latent space. This is implemented by minimizing the trace-
norm of the reconstructed user-item matrix from the factors. where aij = 0 and wj is a sparse size-n column vector of
Sindhwani et al [8] proposed a Weighted Non-Negative Matrix aggregation coefficients. Thus, the model utilized by SLIM can
Factorization (WNNMF) method, in which they enforce non- be presented as
negativity on the user and item factors so as to lend “part- Ã = AW, (2)
based” interpretability to the model. Hofmann [9] applied where A is the binary user-item purchase matrix or the
Probabilistic Latent Semantic Analysis (PLSA) technique for user-item rating matrix, W is an n × n sparse matrix of
collaborative filtering, which has been shown equivalent to aggregation coefficients, whose j-th column corresponds to
non-negative matrix factorization. PLSA introduces a latent wj as in Equation 1, and each row ãT T T
i of à (ãi = ai W )
space such that the co-occurrence of users and items (i.e., represents the recommendation scores on all items for user
a certain user has purchased a certain item) can be rendered ui . Top-N recommendation for ui is done by sorting ui ’s non-
conditionally independent. Koren [10] proposed an intersect- purchased/-rated items based on their recommendation scores
ing approach between neighborhood-based method and MF. In in ãT
i in decreasing order and recommending the top-N items.
his approach, item similarity is learned simultaneously with a
matrix factorization so as to take advantages of both methods.
B. Learning W for SLIM
Top-N recommendation has also been formulated as a
ranking problem. Rendle et al [11] proposed a Bayesian We view the purchase/rating activity of user ui on item tj in
Personalized Ranking (BPR) criterion, which is the maximum A (i.e., aij ) as the ground-truth item recommendation score.
posterior estimator from a Bayesian analysis and measures the Given a user-item purchase/rating matrix A of size m × n, we
difference between the rankings of user-purchased items and learn the sparse n×n matrix W in Equation 2 as the minimizer
the rest items. BPR can be well adopted for item knn method for the following regularized optimization problem:
(BPRkNN) and MF methods (BPRMF) as a general objective 1 β
minimize A − AW 2F + W 2F + λW 1
function. W 2 2
subject to W ≥ 0 (3)

III. D EFINITIONS AND N OTATIONS diag(W ) = 0,


n n
In this paper, the symbols u and t will be used to denote where W 1 = i=1 j=1 |wij | is the entry-wise 1 -norm
the users and items, respectively. Individual users and items of W , and  · F is the matrix Frobenius norm. In Equa-
will be denoted using different subscripts (i.e., ui , tj ). The tion 3, AW is the estimated matrix of recommendation scores
set of all users and items in the system will be denoted by U (i.e., Ã) by the sparse linear model as in Equation 2. The
(|U | = m) and T (|T | = n), respectively. The entire set of first term 21 A − AW 2F (i.e., the residual sum of squares)
user-item purchases/ratings will be represented by a user-item measures how well the linear model fits the training data, and
purchase/rating matrix A of size m × n, in which the (i, j) W 2F and W 21 are F -norm and 1 -norm regularization

498

Authorized licensed use limited to: b-on: UNIVERSIDADE DO PORTO. Downloaded on October 18,2022 at 14:47:49 UTC from IEEE Xplore. Restrictions apply.
terms, respectively. The constants β and λ are regularization leads to a method that has considerably lower computational
parameters. The larger the parameters are, the more severe the requirements with minimal quality degradation.
regularizations are. The non-negativity constraint is applied
on W such that the learned W represents positive relations C. Efficient Top-N Recommendation from SLIM
between items, if any. The constraint diag(W ) = 0 is also The SLIM method in Equation 2 and the sparsity of W en-
applied so as to avoid trivial solutions (i.e., the optimal W is able significantly faster top-N recommendation. In Equation 2,
an identical matrix such that an item always recommends itself aTi is always very sparse (i.e., the user usually purchased/rated
so as to minimize 12 A − AW 2F ). In addition, the constraint only a small fraction of all items), and when W is also sparse,
diag(W ) = 0 ensures that aij is not used to compute ãij . the calculation of ãT i can be very fast by exploiting W ’s
sparse structure (i.e., applying a “gather” operation along W
1) 1 -norm and F -norm Regularizations for SLIM : In
columns on its non-zero values in the rows corresponding to
order to learn a sparse W , we introduce the 1 -norm of W
non-zero values in aT i ). Thus, the computational complexity to
as a regularizer in Equation 3. It is well known that 1 -norm
do recommendations for user ui is O(nai × nw + N log(N )),
regularization introduces sparsity into the solutions [12].
where nai is the number of non-zero values in aT i , and nw
Besides the 1 -norm, we have the F -norm of W as another
is the average number of non-zero values in the rows of W .
regularizer, which leads the optimization problem to an elastic
The N log(N ) term is for sorting the highest scoring N items,
net problem [13]. The F -norm measures model complexity
which can be selected from the potentially nai ×nw non-zeros
and prevents overfitting (as in ridge regression). Moreover,
entries in ãT
i in linear time using linear selection.
1 -norm and F -norm regularization together implicitly group
correlated items in the solutions [13]. D. SLIM vs Existing Linear Methods
2) Computing W: Since the columns of W are independent, Linear methods have already been used for top-N recom-
the optimization problem in Equation 3 can be decoupled into mendation. For example, the itemkNN method in [2] has a
a set of optimization problems of the form linear model similar to that of SLIM. The model of itemkNN
1 β is a knn item-item cosine similarity matrix S, that is, each
minimize aj − Awj 22 + wj 22 + λwj 1 row sT i has exactly k nonzero values representing the cosine
wj 2 2
(4) similarities between item tj and its k most similar neighbors.
subject to wj ≥ 0 The fundamental difference between itemkNN and SLIM’s
wj,j = 0, linear models is that the former is highly dependent on the
which allows each column of W to be solved independently. pre-specified item-item similarity measure used to identify
In Equation 4, wj is the j-th column of W and aj is the neighbors, whereas the later generates W by solving
the j-th column of A,  · 2 is 2 -norm of vectors, and the optimization problem of Equation 3. In this way, W
n can potentially encode rich and subtle relations across items
wj 1 = i=1 |wij | is the entry-wise 1 -norm of vector wj .
Due to the column-wise independence property of W , learn- that may not be easily captured by conventional item-item
ing W can be easily parallelized. The optimization problem similarity metrics. This is validated by the experimental results
of Equation 4 can be solved using coordinate descent and soft in Section VI that show the W substantially outperforms S.
thresholding [14]. Rendle et al [11] discussed an adaptive k-Nearest-Neighbor
method, which used the same model as in itemkNN in [2]
3) SLIM with Feature Selection: The estimation of wj but adaptively learn the item-item similarity matrix. How-
in Equation 4 can be considered as the solution to a regularized ever, the item-item similarity matrix in [11] is fully dense,
regression problem in which the j-th column of A is the symmetric and has negative values. W is different from
dependent variable to be estimated as a function of the Rendle et al’s item-item similarity matrix in that, in addition
remaining n − 1 columns of A (independent variables). This to its sparsity which leads to fast recommendation and low
view suggests that feature selection methods can potentially requirement for storage, W is not necessarily symmetric due
be used to reduce the number of independent variables prior to the optimization process and thus allows more flexibility
to computing wj . The advantage of such feature selection for recommendation.
methods is that they reduce the number of columns in A, Paterek [15] introduced a linear model for each item for
which can substantially decrease the overall amount of time rating prediction, in which the rating of a user ui on an item tj
required for SLIM learning. is calculated as the aggregation of the ratings of ui on all other
Motivated by these observations, we extended the SLIM items. They learned the aggregation coefficients (equivalent to
method to incorporate feature selection. We will refer to W ) by solving an 2 -norm regularized least squares problem
these methods as fsSLIM. Even though many feature selection for each item. The learned coefficients are fully dense. The
approaches can be used, in this paper we only investigated advantage of SLIM over Paterek’s method is that 1 -norm
one approach, inspired by the itemkNN top-N recommendation regularization is incorporated during learning which enforces
algorithms. Specifically, since the goal is to learn a linear W to be sparse, and thus, the most informative signals are
model to estimate the j-th column of A (i.e., aj ), then the captured in W while noises are discarded. In addition, SLIM
columns of A that are the most similar to aj can be used learns W from all purchase/rating activities so as to better fuse
as the selected features. As our experiments will later show, information, compared to Paterek’s method, which only uses
using the cosine similarity and this feature selection approach, a certain set of purchase/rating activities.

499

Authorized licensed use limited to: b-on: UNIVERSIDADE DO PORTO. Downloaded on October 18,2022 at 14:47:49 UTC from IEEE Xplore. Restrictions apply.
TABLE I: The Datasets Used in Evaluation The first category (containing ccard, ctlg2, ctlg3 and
dataset #users #items #trns rsize csize density ratings ecmrc [2]) was derived from customer purchasing transactions.
ccard 42,067 18,004 308,420 7.33 17.13 0.04% - Specifically, the ccard dataset corresponds to credit card pur-
ctlg2 22,505 17,096 1,814,072 80.61 106.11 0.47% - chasing transactions of a major department store, n which each
ctlg3 58,565 37,841 453,219 7.74 11.98 0.02% -
ecmrc 6,594 3,972 50,372 7.64 12.68 0.19% - card has at least 5 transactions. The ctlg2 and ctlg3 datasets
BX 3,586 7,602 84,981 23.70 11.18 0.31% 1-10 correspond to the catalog purchasing transactions of two major
ML10M 69,878 10,677 10,000,054 143.11 936.60 1.34% 1-10
Netflix 39,884 8,478 1,256,115 31.49 148.16 0.37% 1-5 mail-order catalog retailers. The ecmrc dataset corresponds
Yahoo 85,325 55,371 3,973,104 46.56 71.75 0.08% 1-5 to web-based purchasing transactions of an e-commerce site.
Columns corresponding to #users, #items and #trns show the number of users, These four datasets have only binary purchase information.
number of items and number of transactions, respectively, in each dataset.
Columns corresponding to rsize and csize show the average number of trans- The second category (containing BX, ML10M, Netflix
actions for each user and on each item (i.e., row density and column density and Yahoo) contains multi-value ratings. All the ratings are
of the user-item matrix), respectively, in each dataset. Column corresponding
to density shows the density of each dataset (i.e., density = #trns/(#users × converted to binary indications if needed. In particular, the
#items)). Column corresponding to ratings shows the rating range of each BX dataset is a subset from the Book-Crossing dataset1 , in
dataset with granularity 1.
which each user has rated at least 20 items and each item
has been rated by at least 5 users and at most 300 users.
The ML10M dataset corresponds to movie ratings and was
E. Relations between SLIM and MF Methods obtained from the MovieLens2 research projects. The Netflix
MF methods for top-N recommendation have a model dataset is a subset extracted from the Netflix Prize dataset3
such that each user has rated 20 – 250 movies, and each movie
à = U V T , (5) is rated by 20 – 50 users. Finally, the Yahoo dataset is a subset
extracted from Yahoo! Music user ratings of songs, provided
where U and V T are the user and item factors, respectively.
as part of the Yahoo! Research Alliance Webscope program4 .
Comparing MF model in Equation 5 and the SLIM method
In Yahoo dataset, each user has rated 20 – 200 songs, and
in Equation 2, we can see that SLIM’s model can be considered
each music has been rated by at least 10 users and at most
as a special case of MF model (i.e., A is equivalent to U and
5000 users.
W is equivalent to V T )
U and V T in Equation 5 are in a latent space, whose dimen- B. Evaluation Methodology & Metrics
sionality is usually specified as a parameter. The “latent” space
We applied 5-time Leave-One-Out cross validation
becomes exactly the item space in Equation 2, and therefore,
(LOOCV) to evaluate the performance of SLIM methods. In
in SLIM there is no need to learn user representation in the
each run, each of the datasets is split into a training set and a
“latent” space and thus the learning process is simplified. On
testing set by randomly selecting one of the non-zero entries
the other hand, U and V T are typically of low dimensionality,
of each user and placing it into the testing set. The training set
and thus useful information may potentially get lost during the
is used to train a model, then for each user a size-N ranked
low-rank approximation of A from U and V T . On the contrary,
list of recommended items is generated by the model. The
in SLIM, since information on users are fully preserved in A
evaluation is conducted by comparing the recommendation
and the counterpart on items is optimized via the learning,
list of each user and the item of that user in the testing set.
SLIM can potentially give better recommendations than MF
In the majority of the results reported in Section VI, N is
methods.
equal to 10. However, we also report some limited results for
In addition, since both U and V T in Equation 5 are typically
different values of N .
dense, the computation of aT i requires the calculation of each The recommendation quality is measured by the Hit Rate
ãij from its corresponding dense vectors in U and V T . This
(HR) and the Average Reciprocal Hit-Rank (ARHR) [2]. HR
results in a high computational complexity for MF methods
is defined as follows,
to do recommendations, that is, O(k 2 × n) for each user,
where k is the number of latent factors, and n is the number #hits
HR = , (6)
of items. The computational complexity can be potentially #users
reduced by utilizing the sparse matrix factorization algorithms where #users is the total number of users, and #hits is the
developed in [16], [17], [18]. However, none of these sparse number of users whose item in the testing set is recommended
matrix factorization algorithms have been applied to solve top- (i.e., hit) in the size-N recommendation list. A second measure
N recommendation problem due to their high computational for evaluation is ARHR, which is defined as follows:
costs.
1 
#hits
1
ARHR = , (7)
#users
i=1
pi
V. M ATERIALS
A. Datasets where if an item of a user is hit, p is the position of the item in
the ranked recommendation list. ARHR is a weighted version
We evaluated the performance of SLIM methods on eight
1 http://www.informatik.uni-freiburg.de/∼cziegler/BX/
different real datasets whose characteristics are shown in Ta- 2 http://www.grouplens.org/node/12
ble I. These datasets can be broadly classified into two 3 http://www.netflixprize.com/
categories. 4 http://research.yahoo.com/Academic Relations

500

Authorized licensed use limited to: b-on: UNIVERSIDADE DO PORTO. Downloaded on October 18,2022 at 14:47:49 UTC from IEEE Xplore. Restrictions apply.
of HR and it measures how strongly an item is recommended, The second category of algorithms is the MF methods,
in which the weight is the reciprocal of the hit position in the including PureSVD and WRMF as discussed in Section II.
recommendation list. Note that both PureSVD and WRMF use 0 values in the user-
For the experiments utilizing ratings, we evaluate the per- item matrix during learning. PureSVD is demonstrated to
formance of the methods by looking at how well they can outperform other MF methods in top-N recommendation using
recommend the items that have a particular rating value. For ratings [3], including the MF methods which treat 0s as missing
this purpose, we define the per-rating Hit Rate (rHR) and data. WRMF represents the state-of-the-art matrix factorization
cumulative Hit Rate (cHR). The rHR is calculated as the hit methods for top-N recommendation using binary information.
rate on items that have a certain rating value. cHR is calculated The third category of algorithms is the methods which rely
as the hit rate on items that have a rating value which is no on ranking/retrieval criteria, including BPRMF and BPRkNN as
lower than a certain rating threshold. discussed in Section II and Section IV-E. It is demonstrated
Note that in the top-N recommendation literature, there exist in [11] that BPRMF outperforms other methods in top-N recom-
other metrics for evaluation. Such metrics include area under mendation in terms of AUC measure using binary information.
the ROC curve (AUC), which measures relative positions of
true positives and false positives in an entire ranked list. 2) Top-N Recommendation Performance: Table II shows
Variances of AUC can measure the positions in top part of a the overall performance of different top-N recommendation
ranked list. Another popular metric is recall. However, in top- algorithms. These results show that SLIM produces recommen-
N recommendation scenario, we believe HR and ARHR are dations that are consistently better (in terms of HR and ARHR)
the most direct and meaningful measures, since the users only than other methods over all the datasets except ML10M (SLIM
care if a short recommendation list has the items of interest or has HR 0.311 on ML10M, which is only worse than BPRkNN
not rather than a very long recommendation list. Due to this with HR 0.327). In term of HR, SLIM is on average 19.67%,
we use HR and ARHR in our evaluation. 12.91%, 22.41%, 50.80%, 13.42%, 14.32% and 12.95% better
All the algorithms compared in Section VI are implemented than itemkNN, itemprob, userkNN, PureSVD, WRMF, BPRMF
in C. All the experiments are done on a Linux cluster with 6- and BPRkNN, respectively, over all the eight datasets. Similar
core Intel Xeon X7542 “Westmere” processors at 2.66 GHz. performance gains can also be observed with respect to
ARHR. Among the three MF-based models, WRMF and BPRMF
VI. E XPERIMENTAL R ESULTS have similar performance, which is substantially better than
In this section, we present the performance of SLIM methods PureSVD on all datasets except ML10M and Netflix. BPRkNN
and compare them with other popular top-N recommendation has better performance than MF methods on large datasets (i.e.,
methods. We present the results from two sets of experiments. ML10M, Netflix and Yahoo) but worse than MF methods on
In the first set of experiments, all the top-N recommendation small datasets.
methods use binary user-item purchase information during In terms of recommendation efficiency, SLIM is comparable
learning, and thus all the methods are appended by -b to to itemkNN and itemprob (i.e., the required times range in
indicate binary data used (e.g., SLIM-b) if there is confusion. In seconds), but considerably faster than the other methods (i.e.,
the second set of experiments, all the top-N recommendation the required times range in minutes). The somewhat worse
methods use user-item rating information during learning, and efficiency of SLIM compared to itemkNN is due to the fact that
correspondingly they are appended by -r if there is confusion. the resulting best W matrix is denser than the best performing
We optimized all the C implementations of the algorithms item-item cosine similarity matrix from itemkNN. PureSVD,
to make sure that any time difference in performance is due to WRMF and BPRMF have worse computational complexities (i.e.,
the algorithms themselves, and not due to the implementation. linear to the product of the number of items and the dimen-
For all the methods, we conducted an exhaustive grid search sionality of the latent space), which is validated by their high
to identify the best parameters to use. We only report the recommendation run time. BPRkNN produces a fully dense
performance corresponding to the parameters that lead to the item-item similarity matrix, which is responsible for its high
best results in this section. recommendation time.
In terms of the amount of time required to learn the
A. SLIM Performance on Binary data models, we see that the time required by itemkNN/itemprob
is substantially smaller than the rest of the methods. The
1) Comparison Algorithms: We compare the performance
amount of time required by SLIM to learn its model, relative
of SLIM with another three categories of top-N recommen-
to PureSVD, WRMF, BPRMF and BPRkNN, varies depending on
dation algorithms. The first category of algorithms is the
the datasets. However, even though SLIM is slower on some
item/user neighborhood-based collaborative filtering methods
datasets (e.g., ML10M and Yahoo), this situation can be easily
itemkNN, itemprob and userkNN. Methods itemkNN and
remediated by the feature-selection-based fsSLIM as will be
userkNN are as discussed in Section II and Section IV-E.
discussed later in Section VI-A3.
Method itemprob is similarity to itemkNN except that instead
of item-item cosine similarity, it uses modified item-item One thing that is surprising with the results shown in Ta-
transition probabilities. These methods are engineered with ble II is that the MF-based methods are sometimes even worse
various heuristics for better performance5 . than the simple itemkNN, itemprob and userkNN in terms
of HR. For example, BPRMF performs worse for BX, ML10M,
5 http://glaros.dtc.umn.edu/gkhome/suggest/overview Netflix and Yahoo. This may because that in the BPRMF

501

Authorized licensed use limited to: b-on: UNIVERSIDADE DO PORTO. Downloaded on October 18,2022 at 14:47:49 UTC from IEEE Xplore. Restrictions apply.
TABLE II: Comparison of Top-N Recommendation Algorithms
ccard ctlg2
method
params HR ARHR mt tt params HR ARHR mt tt
itemkNN 50 - 0.195 0.145 0.54(s) 1.34(s) 10 - 0.222 0.108 33.78(s) 1.19(s)
itemprob 50 0.2 0.226 0.154 0.97(s) 1.24(s) 10 0.5 0.222 0.105 47.86(s) 0.99(s)
userkNN 150 - 0.189 0.122 0.06(s) 14.84(s) 50 - 0.204 0.106 0.37(s) 48.45(s)
PureSVD 3500 10 0.101 0.058 42.89(m) 2.65(h) 1300 10 0.196 0.099 3.95(m) 18.46(m)
WRMF 250 15 0.230 0.150 4.01(h) 9.14(m) 300 10 0.235 0.114 20.42(h) 5.49(s)
BPRMF 350 0.3 0.238 0.157 1.29(h) 6.64(m) 400 0.1 0.249 0.123 9.72(h) 3.14(m)
BPRkNN 1e-4 0.01 0.208 0.145 2.38(m) 8.15(m) 0.001 0.001 0.224 0.104 1.28(h) 22.03(m)
SLIM 5 0.5 0.246 0.170 17.24(m) 13.57(s) 5 2.0 0.272 0.140 7.24(h) 26.98(s)
fsSLIM 100 0.5 0.243 0.168 4.97(m) 4.45(s) 100 1.0 0.282 0.149 10.92(m) 5.00(s)
fsSLIM 50 0.5 0.244 0.169 2.40(m) 3.34(s) 10 0.5 0.262 0.138 4.21(m) 2.01(s)

ctlg3 ecmrc
method
params HR ARHR mt tt params HR ARHR mt tt
itemkNN 300 - 0.544 0.313 0.55(s) 6.66(s) 300 - 0.218 0.125 0.06(s) 0.54(s)
itemprob 400 0.3 0.558 0.322 0.87(s) 7.62(s) 30 0.2 0.245 0.138 0.09(s) 0.12(s)
userkNN 350 - 0.492 0.285 0.11(s) 19.18(s) 400 - 0.212 0.119 0.01(s) 0.78(s)
PureSVD 3000 10 0.373 0.210 1.11(h) 4.28(h) 1900 10 0.186 0.110 3.67(m) 3.22(m)
WRMF 420 20 0.543 0.308 14.42(h) 50.67(m) 270 15 0.242 0.133 3.22(h) 13.60(s)
BPRMF 300 0.5 0.541 0.283 1.49(h) 13.66(m) 350 0.1 0.249 0.128 4.00(m) 12.76(s)
BPRkNN 0.001 1e-4 0.542 0.304 6.20(m) 20.28(m) 1e-5 0.010 0.242 0.130 1.02(m) 13.53(s)
SLIM 3 0.5 0.579 0.347 1.02(h) 16.23(s) 5 0.5 0.255 0.149 11.10(s) 0.51(s)
fsSLIM 100 0.0 0.546 0.292 12.57(m) 9.62(s) 100 0.5 0.252 0.147 16.89(s) 0.32(s)
fsSLIM 400 0.5 0.570 0.339 14.27(m) 12.52(s) 30 0.5 0.252 0.147 5.41(s) 0.16(s)

BX ML10M
method
params HR ARHR mt tt params HR ARHR mt tt
itemkNN 10 - 0.085 0.044 1.34(s) 0.08(s) 20 - 0.238 0.106 1.97(m) 8.93(s)
itemprob 30 0.3 0.103 0.050 2.11(s) 0.22(s) 20 0.5 0.237 0.106 1.88(m) 7.49(s)
userkNN 100 - 0.083 0.039 0.01(s) 1.49(s) 50 - 0.303 0.146 2.26(s) 34.42(m)
PureSVD 1500 10 0.072 0.037 1.91(m) 2.57(m) 170 10 0.294 0.139 1.68(m) 1.72(m)
WRMF 400 5 0.086 0.040 12.01(h) 29.77(s) 100 2 0.306 0.139 16.27(h) 1.59(m)
BPRMF 350 0.1 0.089 0.040 8.95(m) 12.44(s) 350 0.1 0.281 0.123 4.77(h) 5.20(m)
BPRkNN 1e-4 0.010 0.082 0.035 5.16(m) 42.23(s) 0.001 1e-4 0.327 0.156 15.78(h) 1.08(h)
SLIM 3 0.5 0.109 0.055 5.51(m) 1.39(s) 1 2.0 0.311 0.153 50.98(h) 41.59(s)
fsSLIM 100 0.5 0.109 0.053 36.26(s) 0.63(s) 100 0.5 0.311 0.152 37.12(m) 17.97(s)
fsSLIM 30 1.0 0.105 0.055 16.07(s) 0.18(s) 20 1.0 0.298 0.145 14.26(m) 8.87(s)

Netflix Yahoo
method
params HR ARHR mt tt params HR ARHR mt tt
itemkNN 150 - 0.178 0.088 24.53(s) 13.17(s) 400 - 0.107 0.041 21.54(s) 2.25(m)
itemprob 10 0.5 0.177 0.083 30.36(s) 1.01(s) 350 0.5 0.107 0.041 34.23(s) 1.90(m)
userkNN 200 - 0.154 0.077 0.33(s) 1.04(m) 50 - 0.107 0.041 18.46(s) 3.26(m)
PureSVD 3500 10 0.182 0.092 29.86(m) 21.29(m) 170 10 0.074 0.027 53.05(s) 11.18(m)
WRMF 350 10 0.184 0.085 22.47(h) 2.63(m) 200 8 0.090 0.032 16.23(h) 50.05(m)
BPRMF 400 0.1 0.156 0.071 43.55(m) 3.56(m) 400 0.1 0.093 0.033 10.36(h) 47.28(m)
BPRkNN 0.01 0.01 0.188 0.092 10.91(m) 6.12(m) 0.01 0.001 0.104 0.038 2.60(h) 4.11(h)
SLIM 5 1.0 0.200 0.102 7.85(h) 9.84(s) 5 0.5 0.122 0.047 21.30(h) 5.69(m)
fsSLIM 100 0.5 0.202 0.104 6.43(m) 5.73(s) 100 0.5 0.124 0.048 1.39(m) 41.24(s)
fsSLIM 150 0.5 0.202 0.104 9.09(m) 7.47(s) 400 0.5 0.123 0.048 2.41(m) 1.72(m)
Columns corresponding to params present the parameters for the corresponding method. For methods itemkNN and userkNN, the
paramesters are number of neighbors, respectively. For method itemprob, the parameters are the number of neighbors and transition
parameter α. For method PureSVD, the parameters are the number of singular values and the number of iterations during SVD. For
method WRMF, the parameters are the dimension of the latent space and the weight on purchases. For method BPRMF, the parameters
are the dimension of the latent space and learning rate, respectively. For method BPRkNN, the parameters are the learning rate and
regularization parameter λ. For method SLIM, the parameters are 2 -norm regularization parameter β and 1 -norm regularization
parameter λ. For method fsSLIM, the parameters are the number of neighbors and 1 -norm regularization parameter λ. Columns
corresponding to HR and ARHR present the hit rate and average reciprocal hit-rank, respectively. Columns corresponding to mt
and tt present the time used by model learning and recommendation, respectively. The mt/tt numbers with (s), (m) and (h) are time
used in seconds, minutes and hours, respectively. Bold numbers are the best performance in terms of HR for each dataset.

paper [11], the authors evaluated the entire AUC curve to the item-item cosine similarity (as in itemkNN) and for each
measure if the interested items are ranked higher than the column of A selected its 100 other most similar columns
rest. However, a good AUC value does not necessarily lead and used the m × 100 matrix to estimate the coefficient
to good performance on top-N of the ranked list. In addition, matrix in Equation 3. The results are shown in the first
in the case of PureSVD, the best performance is achieved when fsSLIM row in Table II. In the second set of experiments
a rather larger number of singular values is used (e.g., ccard, we selected the most similar columns of A based on item-
ctlg3, BX and Netflix). item cosine similarity or item-item probability similarity (as
in itemprob), whichever performs best, and corresponding
3) fsSLIM Performance: Table II also presents the results
number of columns. The results of these experiments are
for the SLIM version that utilizes feature selection (rows
shown in the second fsSLIM row.
labeled as fsSLIM). In the first set of experiments we used

502

Authorized licensed use limited to: b-on: UNIVERSIDADE DO PORTO. Downloaded on October 18,2022 at 14:47:49 UTC from IEEE Xplore. Restrictions apply.
2.5 0.11
5.0 5.0
TABLE III: Performance on the Long Tail of ML10M
2.0 0.10 ML10M long tail
3.0 3.0 method
params HR ARHR mt tt
1.5 0.09
itemkNN

time (s)
2.0 2.0 10 - 0.130 0.052 1.59(m) 4.62(s)

HR
itemprob
β

β
10 0.5 0.126 0.051 1.65(m) 4.04(s)
1.0 1.0 1.0 0.08
userkNN 50 - 0.162 0.069 2.10(s) 20.43(m)
0.5 0.5 PureSVD 350 70 0.224 0.096 2.98(m) 10.45(m)
0.5 0.07 WRMF 100 2 0.232 0.097 23.15(h) 1.74(m)
0.0 0.0 BPRMF 300 0.01 0.240 0.102 22.63(h) 8.56(m)
0.0 0.06 BPRkNN 0.001 1e-4 0.239 0.098 15.72(h) 36.42(m)
0.0 0.5 1.0 2.0 3.0 5.0 0.0 0.5 1.0 2.0 3.0 5.0
λ λ SLIM 1 5.0 0.256 0.106 57.55(h) 47.69(s)
fsSLIM 10 5.0 0.255 0.105 25.37(m) 9.57(s)
(a) Recommendation time (b) HR fsSLIM 100 4.0 0.255 0.105 58.32(m) 19.32(s)
Fig. 1: 1 -Norm and 2 -Norm Regularization Effects on BX 1% most popular items are eliminated from ML10M. Params have same
meanings as those in Table II.

100%
ML10M
10% Since while constructing the BX, Netflix and Yahoo
1% datasets, we eliminated the items that are purchased/rated
% of items

by many users, these datasets do not suffer from long-tail


0.1%
effects. The results presented in Table II as they relate to these
0.01%
short long-tail datasets demonstrate that SLIM is superior to other methods
-head in producing non-trivial top-N recommendations when no
0.001% (popular) (unpopular)
significantly popular items are present.
0
0 20% 40% 60% 80% 100% The plot in Figure 2 demonstrates the long-tail distribution
% of purchases/ratings
of the items in ML10M dataset, in which only 1% of the items
Fig. 2: Purchase/Rating Distribution in ML10M contribute 20% of the ratings. We eliminate these 1% most
popular items and use the remaining ratings for all the top-N
methods during learning. The results are presented in Table III.
These results show that the performance of all methods is
There are three important observations that can be made notably worse than the corresponding performance in Table II
from the fsSLIM results. First, the performance of fsSLIM is in which the “short head” (i.e., corresponding to the most
comparable to that of SLIM for nearly all the datasets. Second, popular items) is present. However, SLIM outperforms the
the amount of time required by fsSLIM to learn the model is rest methods. In particular, SLIM outperforms BPRkNN even
much smaller than that required by SLIM. Third, using fsSLIM though BPRkNN does better than SLIM as in Table II when
to estimate W , whose sparsity structure is constrained by the popular items are presented in ML10M. This conforms to
the itemkNN/itemprob neighbors, leads to significantly better the observations based on BX, Netflix and Yahoo results, that
recommendation performance than itemkNN/itemprob itself. SLIM is resistant to the long-tail effects.
This suggests that we can utilize feature selection to improve 6) Recommendation for Different Top-N: Figure 3 shows
the learning time without decreasing the performance. the performance of the methods for different values of N
4) Regularization Effects in SLIM : Figure 1 shows the (i.e., 5, 10, 15, 20 and 25) for BX, ML10M, Netflix and Yahoo
effects of 1 -norm and 2 -norm regularizations in terms of datasets. Table IV presents the performance difference between
recommendation time (which directly depends on how sparse SLIM and the best of the other methods in terms of HR on the
W is.) and HR on the dataset BX(similar results are observed four datasets. For example, 0.012 in Table IV for BX when
from all the other datasets). Figure 1 demonstrates that as N = 5 is calculated as the difference between SLIM’s HR and
greater 1 -norm regularization (i.e., larger λ in Equation 3) the best HR from all the other methods on BX when top-5
is applied, lower recommendation time is achieved, indicating items are recommended. The performance difference between
that the learned W is sparser. Figure 1 also shows the SLIM and the best of the other methods are higher for smaller
effects of 1 -norm and 2 -norm regularizations together for values of N across the datasets BX, ML10M and Netflix.
recommendation quality. The best recommendation quality is Figure 3 and Table IV demonstrate that SLIM produces better
achieved when both of the regularization parameters β and λ than the other methods when smaller number of items are
are non-zero. In addition, the recommendation quality changes recommended. This indicates that SLIM tends to rank most
smoothly as the regularization parameters β and λ change. relevant items higher than the other methods.

5) SLIM for the Long-Tail Distribution: The long-tail effect, 7) Sparsity Pattern of W : We use ML10M as an example
which refers to the fact that a disproportionally large number to illustrate what SLIM is learning. The item-item similarity
of purchases/ratings are condensed in a small number of items matrix S constructed from itemkNN and the W from SLIM
(popular items), has been a concern for recommender systems. are shown in Figure 4. Note that in Figure 4, the S matrix
Popular items tend to dominate the recommendations, making is obtained using 100 nearest neighbors. The density of the
it difficult to produce novel and diverse recommendations. matrices produced by itemkNN and SLIM is 0.936% and
0.935%, respectively. However, their sparsity patterns are

503

Authorized licensed use limited to: b-on: UNIVERSIDADE DO PORTO. Downloaded on October 18,2022 at 14:47:49 UTC from IEEE Xplore. Restrictions apply.
0.18
itemkNN userkNN WRMF BPRkNN
0.15 itemprob PureSVD BPRMF SLIM

0.12
HR

0.09

0.06

0.03
5 10 15 20 25
N
(a) BX
0.60
itemkNN userkNN WRMF BPRkNN (a) S from itemkNN (b) W from SLIM
0.50 itemprob PureSVD BPRMF SLIM

0.40
Fig. 4: Sparsity Patterns for ML10M (dark for non-zeros)
HR

0.30

0.20
suggests that W , even though it is also very sparse, recovers
0.10
5 10 15 20 25 some subtle relations that are not captured by item-item cosine
N
(b) ML10M
similarity measure, which brings performance improvement.
In SLIM, item tk that are co-purchased with item tj also
0.35
itemkNN
itemprob
userkNN
PureSVD
WRMF
BPRMF
BPRkNN
SLIM
contributes to the similarity between item tj and another item
0.30
ti , even tk has never been co-purchased with ti . Furthermore,
0.25
treating missing values as 0s helps to generalize. Including all
HR

0.20
0.15
missing values as 0s in wj vector in Equation 4 help smooth
0.10
out item similarities and help incorporate the impacts from
5 10 15 20 25
dissimilar/un-co-purchased items. The above can be shown
N theoretically by the coordinate descent updates (proof omitted
(c) Netflix here).
0.25

0.20
itemkNN
itemprob
userkNN
PureSVD
WRMF
BPRMF
BPRkNN
SLIM 8) Matrix Reconstruction: We compare SLIM with MF
methods by looking at how it reconstructs the user/item
0.15
purchase matrix. We use BPRMF as an example of MF methods
HR

0.10 since it has the typical properties that most of the state-of-the-
0.05 art MF methods have. We focused on ML10M, whose matrix A
5 10 15 20 25 has a density of 1.3%. The reconstructed matrix ÃSLIM = AW
N
from SLIM has a density 25.1%, whose non-zero values have
(d) Yahoo
a mean of 0.0593. For those 1.3% non-zero entries in A, ÃSLIM
Fig. 3: Recommendations for Different Values of N recovered 99.1% of them and their mean value is 0.4489 (i.e.,
7.57 times of the non-zero mean). The reconstructed matrix
ÃBPRMF = U V T is fully dense, with 13.1% of its values greater
TABLE IV: Performance Difference on Top-N Recommendations than 0 with mean of 1.8636, and 86.9% of its values less than
N
dataset
5 10 15 20 25 0 with a mean of -2.4718. For those 1.3% non-zero entries in
BX 0.012 0.006 0.000 0.000 0.001
A, ÃBPRMF has 97.3% of them as positive values with a mean of
ML10M 0.000 -0.016 -0.013 -0.018 -0.021 4.7623 (i.e., 2.56 times of positive mean). This suggests that
Netflix
Yahoo
0.013
0.009
0.012
0.015
0.008
0.015
0.005
0.016
0.003
0.017
SLIM recovers A better than BPRMF since SLIM recovers more
non-zero entries with relatively much larger values.
Columns corresponding to N shows the performance (in
terms of HR) difference between SLIM and the best of the
rest methods on corresponding top-N recommendatons.
B. SLIM Performance on Ratings

different. First, the S matrix has non-zero item-item similarity 1) Comparison Algorithms: We compare the performance
values that are clustered towards the diagonal, while W has of SLIM with PureSVD, WRMF and BPRkNN. In SLIM, the W
non-zero values that are more evenly distributed. Second, matrix is learned by using the user-item rating matrix A as
during recommendation, on average 53.60 non-zero values in in Equation 2. PureSVD also uses the user-item rating matrix
S contribute to the recommendation score calculation on one for the SVD calculation. In WRMF, the ratings are used as
item for one user, whereas in case of W , on average 14.79 weights following the approach suggested in [5]. We modified
non-zero values make contributions, which is 1/3 as that in S. BPRkNN such that in addition to raking rated items higher than
W recovers 31.8% of the non-zero entries in S (those entries non-rated items, they also rank high-rated items higher than
have larger values than average) and it also discovers new low-rated items. We will use the suffix -r after each method
non-zero entries that are not in S. The newly discovered item- to explicitly denote that a method utilizes rating information
item similarities contribute to 37.1% of the hits from W . This during the model construction. Similarly, we will use the suffix

504

Authorized licensed use limited to: b-on: UNIVERSIDADE DO PORTO. Downloaded on October 18,2022 at 14:47:49 UTC from IEEE Xplore. Restrictions apply.
-b in this section to denote that a method utilizes binary top-N recommendations fast. SLIM employs a sparse linear
information as in Section VI-A for comparison purpose. model in which the recommendation score for a new item
can be calculated as an aggregation of other items. A sparse
2) Top-N Recommendation Performance on Ratings: We
aggregation coefficient matrix W is learned for SLIM to make
compare SLIM-r with PureSVD-r, WRMF-r and BPRkNN-r on
the aggregation very fast. W is learned by solving an 1 -
the BX, ML10M, Netflix and Yahoo datasets for which rating
norm and 2 -norm regularized optimization problem such that
information is available. In addition, we also evaluated SLIM-b,
sparsity is introduced into W .
PureSVD-b, WRMF-b and BPRkNN-b on the four datasets, for
We conducted a comprehensive set of experiments and com-
which the models are still learned from binary user-item
pared SLIM with other state-of-the-art top-N recommendation
purchase matrix but the recommendations are evaluated based
algorithms. The results showed that SLIM achieves predic-
on ratings.
tion quality better that the state-of-the-art MF-based methods.
Figure 5 presents the results of these experiments. The first
Moreover, SLIM generates recommendations very fast. The
column of figures show the rating distribution of the four
experimental results also demonstrated the good properties of
datasets. The second column of figures show the per-rating
SLIM compared to other methods. Such properties include that
hit rates (rHR) for the four datasets. Finally, the third column
SLIM is able to have significant speedup if feature selection
of figures show the cumulative hit rates (cHR) for the four
is applied prior to learning. SLIM is also resistant to long-tail
datasets. In Figure 5, the binary top-N models correspond to
effects in top-N recommendation problems. In addition, when
the best performing models of each method as in Table II. The
trained using ratings, SLIM tends to produce recommendations
results from top-N models using ratings are selected based on
that are also potentially highly rated. Due to these properties,
the cHR performance on rating 6, 6, 3 and 3 for the datasets,
SLIM is very suitable for real-time top-N recommendation
respectively.
tasks.
The results in Figure 5 show that all the -r methods tend
to produce higher hit rates on items with higher ratings.
ACKNOWLEDGEMENT
However, the per-rating hit rates of the -b methods have
smaller dynamics across different ratings. This is because This work was supported in part by NSF (IIS-
during learning, high-rated items and low-rated items are not 0905220, OCI-1048018, and IOS-0820730), the DOE grant
differentiated in the -b methods. In addition, the -r methods USDOE/DE-SC0005013 and the Digital Technology Center at
outperform -b methods in terms of rHR on high-rated items. the University of Minnesota. Access to research and comput-
In particular, -r methods consistently outperform -b methods ing facilities was provided by the Digital Technology Center
on items with ratings above the average across all the datasets. and the Minnesota Supercomputing Institute.
Figure 5 also shows that the SLIM-r consistently outper-
forms the other methods in terms of both rHR and cHR R EFERENCES
on items with higher ratings over all the four datasets. In [1] F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor, Eds., Recommender
particular, it outperforms PureSVD-r in terms of cHR, which Systems Handbook. Springer, 2011.
[2] M. Deshpande and G. Karypis, “Item-based top-n recommendation
is demonstrated in [3] to be the best performing methods algorithms,” ACM Transactions on Information Systems, vol. 22, pp.
for top-N recommendation using ratings. This indicates that 143–177, January 2004.
incorporating rating information during learning allows the [3] P. Cremonesi, Y. Koren, and R. Turrin, “Performance of recommender
algorithms on top-n recommendation tasks,” in Proceedings of the fourth
SLIM methods to identify more highly-rated items. ACM conference on Recommender systems, ser. RecSys ’10. New York,
NY, USA: ACM, 2010, pp. 39–46.
[4] R. Pan, Y. Zhou, B. Cao, N. N. Liu, R. Lukose, M. Scholz, and Q. Yang,
VII. D ISCUSSION & C ONCLUSIONS “One-class collaborative filtering,” in Proceedings of the 2008 Eighth
A. Observed Data vs Missing Data IEEE International Conference on Data Mining. Washington, DC,
USA: IEEE Computer Society, 2008, pp. 502–511.
In the user-item purchase/rating matrix A, the non-zero en- [5] Y. Hu, Y. Koren, and C. Volinsky, “Collaborative filtering for implicit
tries represent purchase/rating activities. However, the entries feedback datasets,” in Proceedings of the 2008 Eighth IEEE Interna-
tional Conference on Data Mining. Washington, DC, USA: IEEE
with “0” value can be ambiguous. They may either represent Computer Society, 2008, pp. 263–272.
that the users will never purchase the items, the users may [6] J. D. M. Rennie and N. Srebro, “Fast maximum margin matrix fac-
purchase the items but have not done so, or we do not know torization for collaborative prediction,” in Proceedings of the 22nd
international conference on Machine learning, ser. ICML ’05. New
if the users have purchased the items or not or if they will. York, NY, USA: ACM, 2005, pp. 713–719.
This is the typical “missing data” setting and it has been well [7] N. Srebro, J. D. M. Rennie, and T. S. Jaakkola, “Maximum-margin
studied in recommender systems [4], [8]. matrix factorization,” in Advances in Neural Information Processing
Systems 17. MIT Press, 2005, pp. 1329–1336.
In SLIM, we treated all missing data in aj and A in Equa- [8] V. Sindhwani, S. S. Bucak, J. Hu, and A. Mojsilovic, “One-class
tion 4 as true negative(i.e., the users will never purchase matrix completion with low-density factorizations,” in Proceedings of
the items). Differentiation of observed data and missing data the 2010 IEEE International Conference on Data Mining, ser. ICDM
’10. Washington, DC, USA: IEEE Computer Society, 2010, pp. 1055–
in Equation 4 is under development. 1060.
[9] T. Hofmann, “Latent semantic models for collaborative filtering,” ACM
Trans. Inf. Syst., vol. 22, pp. 89–115, January 2004.
B. Conclusions [10] Y. Koren, “Factorization meets the neighborhood: a multifaceted col-
laborative filtering model,” in Proceeding of the 14th ACM SIGKDD
In this paper, we proposed a sparse linear method for top- international conference on Knowledge discovery and data mining, ser.
N recommendation, which is able to generate high-quality KDD ’08. New York, NY, USA: ACM, 2008, pp. 426–434.

505

Authorized licensed use limited to: b-on: UNIVERSIDADE DO PORTO. Downloaded on October 18,2022 at 14:47:49 UTC from IEEE Xplore. Restrictions apply.
0.21 0.21
25% BX PureSVD-r WRMF-r BPRkNN-r SLIM-r PureSVD-r WRMF-r BPRkNN-r SLIM-r
0.18 PureSVD-b WRMF-b BPRkNN-b SLIM-b 0.18 PureSVD-b WRMF-b BPRkNN-b SLIM-b
20% 0.15 0.15
distribution

0.12 0.12

cHR
rHR
15%
0.09 0.09
10%
0.06 0.06
5% 0.03 0.03
0 0.00 0.00
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
rating rating rating

(a) BX distribution (b) BX rHR (c) BX cHR

ML10M 0.50 PureSVD-r WRMF-r BPRkNN-r SLIM-r 0.50 PureSVD-r WRMF-r BPRkNN-r SLIM-r
30% PureSVD-b WRMF-b BPRkNN-b SLIM-b PureSVD-b WRMF-b BPRkNN-b SLIM-b
25% 0.40 0.40
distribution

20% 0.30 0.30

cHR
rHR

15%
0.20 0.20
10%
5% 0.10 0.10
0 0.00 0.00
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
rating rating rating

(d) ML10M distribution (e) ML10M rHR (f) ML10M cHR

Netflix 0.40 PureSVD-r WRMF-r BPRkNN-r SLIM-r 0.40 PureSVD-r WRMF-r BPRkNN-r SLIM-r
35%
PureSVD-b WRMF-b BPRkNN-b SLIM-b PureSVD-b WRMF-b BPRkNN-b SLIM-b
30% 0.32 0.32
distribution

25%
0.24 0.24

cHR
rHR

20%
15% 0.16 0.16
10%
0.08 0.08
5%
0 0.00 0.00
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
rating rating rating

(g) Netflix distribution (h) Netflix rHR (i) Netflix cHR

40% Yahoo 0.18 PureSVD-r WRMF-r BPRkNN-r SLIM-r 0.18 PureSVD-r WRMF-r BPRkNN-r SLIM-r
35% PureSVD-b WRMF-b BPRkNN-b SLIM-b PureSVD-b WRMF-b BPRkNN-b SLIM-b
0.15 0.15
distribution

30%
25% 0.12 0.12
cHR
rHR

20% 0.09 0.09


15% 0.06 0.06
10%
5% 0.03 0.03
0 0.00 0.00
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
rating rating rating

(j) Yahoo distribution (k) Yahoo rHR (l) Yahoo cHR

Fig. 5: Top-N Recommendation Performance on Ratings

[11] S. Rendle, C. Freudenthaler, Z. Gantner, and S.-T. Lars, “Bpr: Bayesian


personalized ranking from implicit feedback,” in Proceedings of the
Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, ser.
UAI ’09. Arlington, Virginia, United States: AUAI Press, 2009, pp.
452–461.
[12] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Jour-
nal of the Royal Statistical Society (Series B), vol. 58, pp. 267–288,
1996.
[13] H. Zou and T. Hastie, “Regularization and variable selection via the
elastic net,” Journal Of The Royal Statistical Society Series B, vol. 67,
no. 2, pp. 301–320, 2005.
[14] J. H. Friedman, T. Hastie, and R. Tibshirani, “Regularization paths for
generalized linear models via coordinate descent,” Journal of Statistical
Software, vol. 33, no. 1, pp. 1–22, 2 2010.
[15] A. Paterek, “Improving regularized singular value decomposition for
collaborative filtering,” Statistics, pp. 2–5, 2007.
[16] F. Bach, J. Mairal, and J. Ponce, “Convex sparse matrix factorizations,”
CoRR, vol. abs/0812.1869, 2008.
[17] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix
factorization and sparse coding,” J. Mach. Learn. Res., vol. 11, pp. 19–
60, March 2010.
[18] P. O. Hoyer, “Non-negative matrix factorization with sparseness con-
straints,” Journal of Machine Learning Research, vol. 5, pp. 1457–1469,
December 2004.

506

Authorized licensed use limited to: b-on: UNIVERSIDADE DO PORTO. Downloaded on October 18,2022 at 14:47:49 UTC from IEEE Xplore. Restrictions apply.

You might also like