A Generalized Probabilistic Framework and Its Variants For Training Top-K Recommender Systems

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

A Generalized Probabilistic Framework and its Variants for

Training Top-k Recommender Systems


Harald Steck Yu Xin
Bell Labs, Alcatel-Lucent CSAIL MIT
Murray Hill, NJ Cambridge, MA
[email protected] [email protected]

ABSTRACT 1. INTRODUCTION
Accounting for missing ratings in available training data The idea of recommender systems is to automatically sug-
was recently shown [3, 17] to lead to large improvements gest items to each user that s/he may find appealing. The
in the top-k hit rate of recommender systems, compared quality of recommender systems can be assessed with respect
to state-of-the-art approaches optimizing the popular root- to various criteria, including accuracy, diversity, surprise /
mean-square-error (RMSE) on the observed ratings. In this serendipity, and explainability of recommendations.
paper, we take a Bayesian approach, which lends itself natu- This paper is concerned with accuracy. The root mean
rally to incorporating background knowledge concerning the squared error (RMSE) has become the most popular accu-
missing-data mechanism. The resulting log posterior distri- racy measure in the literature of recommender systems–for
bution is very similar to the objective function in [17]. We training and testing. Its computational efficiency is one of
conduct elaborate experiments with real-world data, testing its main advantages. Impressive progress has been made in
several variants of our approach under different hypotheti- predicting rating values with small RMSE, and it is impos-
cal scenarios concerning the missing ratings. In the second sible to name all approaches, e.g., [4, 6, 7, 11, 13]). There is,
part of this paper, we provide a generalized probabilistic however, also some work on optimizing the ranking of items,
framework for dealing with possibly multiple observed rat- e.g., measured in terms of normalized Discounted Cumula-
ing values for a user-item pair. Several practical applica- tive Gain (nDCG) [18]. Despite their differences, they have
tions are subsumed by this generalization, including aggre- in common that they were trained and tested on observed
gate recommendations (e.g., recommending artists based on ratings only. Obviously, these measures cannot immediately
ratings concerning their songs) as well as collaborative filter- be evaluated if some items have unobserved ratings.
ing of sequential data (e.g., recommendations based on TV In this paper, we consider the top-k hit rate–based on all
consumption over time). We present promising preliminary (unrated) items–as the natural accuracy measure for rec-
experimental results on IP-TV data. ommender systems, as only a few out of all unrated items
can be recommended to a user in practice (see Section 3
for exact definition of top-k hit rate). While this measure
Categories and Subject Descriptors is computationally tractable for testing the predictions of
H.2.8 [Database Management]: Database Applications– recommender systems, unfortunately it is computationally
Data Mining very costly for training recommender systems. For training,
we thus resort to appropriate surrogate objective functions
that are computationally efficient.
General Terms In recent work [3, 17], it was shown that the top-k hit
Algorithms rate can be significantly improved on large real-world data
by accounting for the fact that the observed ratings provide
a skewed picture of the (unknown) distribution concerning
Keywords all (unrated) items and users.
Recommender Systems Motivated by the results of [17], as the first contribution of
this paper, in Section 2 we present a probabilistic approach
∗This work was done while an intern at Bell Labs, Alcatel- that allows us to naturally include background knowledge
Lucent. concerning the (unknown) distribution of all items and users
into our training objective function, the posterior probabil-
ity of the model.
In our second contribution, we conduct elaborate experi-
ments on the Netflix Prize data [1] and test several models
under different hypothetical scenarios concerning the miss-
ing ratings in Section 4. These different scenarios serve as a
sensitivity analysis, as the ground truth of the missing data-
mechanism is unknown due to lack of data. These experi-
Copyright is held by the author/owner(s). Workshop on the Practical Use of
Recommender Systems, Algorithms and Technologies (PRSAT 2010), held
ments are based on our popularity-stratified recall measure,
in conjunction with RecSys 2010. September 30, 2010, Barcelona, Spain. which we define in Section 3.
As the third contribution of this paper, we generalize this are free parameters of the zero-mean normal prior distribu-
probabilistic approach as to account for possibly multiple ob- tion, denoted by N . There are several ways of defining the
served rating values for a user-item pair in Section 5. This standard deviations in Eq. 2, eventually resulting in differ-
general framework subsumes several applications in addi- ent kinds of regularization.
√ The obvious choice is to assume
tion to the one outlined in Section 2. Two of which are that σP,i = σQ,u = 1/ 2λ0 ∀ i, u, with λ0 ∈ R. This results
outlined in Section 5: while the training objective function in the regularization term
for a recommender system concerning TV programs seems
log p(M |σP2 , σQ
2
) = −λ0 ||P ||22 + ||Q||22 + c1 ,
` ´
motivated in an ad-hoc manner in [5], we show that it can
be understood and improved naturally in a Bayesian frame-
work; apart from that, we also provide a Bayesian approach where || · ||2 denotes the Frobenius norm of a matrix, and c1
for making aggregate recommendations, e.g., recommending is an irrelevant constant when training our model.
an artist or concert to a user based on the ratings given to When optimizing root mean square error on observed data
individual songs. (like in the Netflix Prize competition [1]), however, numer-
ous experimental works reported significant improvements
2. MODEL TRAINING by using a different regularization. This is obtained by
choosing the standard deviations
p σP and σQ as follows: σP,i =
In this section, we outline a probabilistic framework that p
1/ 2λ0 · u0 (i) , σQ,u = 1/ 2λ0 · i0 (u), where i0 (u) de-
allows us to incorporate background knowledge when train-
notes the number of items rated by user u, and u0 (i) is the
ing recommender systems on available data. The use of
number of users who rated item i. This results in the pop-
background knowledge in addition to the available training
ular regularization term
data can significantly improve the accuracy of recommender
systems on performance measures like top-k hit rate, recall,
log p(M |σP2 , σQ
2
)
precision, area under the ROC curve, etc. This was demon- !
strated for implicit feedback data in [5, 10], and for explicit 0
X X 2
X X
feedback data in [3, 17]. Like in [17], we use background = −λ u0 (i) Pi,d + i0 (u) Q2u,d + c2
i d u d
knowledge that missing rating values tend to reflect negative 0 !1
feedback, as experimentally observed in [9, 8]; i.e., negative X X
0 2
feedback tends to be missing from the available data with a = −λ @ Pi,d + Q2u,d A + c2 , (3)
larger probability than positive feedback does. observed (i,u) d
The Bayesian approach lends itself naturally to this task.
We consider the rating matrix R as a matrix of random where c2 denotes again an irrelevant constant when train-
variables: each element Ri,u concerning item i = 1, ..., i0 ing our model. Note that this choice increasingly regularizes
and user u = 1, ..., u0 is a random variable with normal the model parameters related to the items and users with a
distribution, where i0 denotes the number of items and u0 larger number of observed ratings. This may seem counter-
is the number of users. intuitive at first glance. A theoretical explanation for this
empirical finding was recently provided in [14].
2.1 Model
We take a collaborative filtering approach, and use a low- 2.3 Informative Background Knowledge
rank matrix-factorization model, which has proven success- We now incorporate the following background knowledge
ful in many publications. Like the rating matrix, we consider into our sequential Bayesian approach: absent rating val-
our model as a matrix of random variables, M . Each random ues tend to be lower than the observed ratings on average
variable Mi,u corresponds to the rating of item i assigned by (see [17]). We insert this knowledge into our approach by
user u. In matrix notation, it reads means of a virtual data point for each pair (i, u): a virtual
prior
M = roffset + P Q> (1) rating value ri,u with small confidence (i.e., large variance
2
offset σprior,i,u ). Then the likelihood of our model in light of these
where r ∈ R is an offset value, and P , Q are low-rank
virtual data points reads (assuming i.i.d. data):
matrices of random variables with dimensions i0 × d0 and
u0 × d0 , respectively, where rank d0  i0 , u0 . We use upper Y Y
case symbols to denote random variables (with a Gaussian p(rprior |M, σprior
2
)= prior
p(Ri,u = ri,u 2
|Mi,u , σprior,i,u ),
distribution), and lower case symbols to denote values. all i all u
(4)
prior
2.2 Prior over Matrix Elements where rprior denotes the matrix of virtual data points ri,u ,
2 2
In our Bayesian approach, we first define the usual prior and σprior the matrix with elements σprior,i,u . We assume
over model parameters, concerning each entry of the low that the probabilities in this likelihood are determined by
2
rank matrices P and Q (see also [12]): normal distributions with mean Mi,u and variance σprior,i,u .
( ) The log likelihood then reads
2 2
YY 2
p(M |σP , σQ ) = N (Pi,d |0, σP,i ) XX “ ”2
i d log p(rprior |M, σprior
2
)=− prior
wi,u prior
ri,u − Mi,u +c3
( ) all i all u

·
YY 2
N (Qu,d |0, σQ,u ) (2) (5)
u
where we defined the weights of the virtual data points as
d prior 2
wi,u = 1/(2σprior,i,u ); c3 is again an irrelevant constant
2 2
The vectors of variances σQ = (σQ,u )u=1,...,u0 and σP2 = when training our model.
2
(σP,i )i=1,...,i0 for all users u = 1, ..., u0 and items i = 1, ..., i0 With Bayes rule, we obtain the posterior distribution of
prior
the model in light of these virtual data points: prior weights to be identical: wprior = wi,u for all (i, u). In
p(rprior , wprior |M )p(M ) summary, the three tuning parameters in Eq. 9 are wprior ,
p(M |rprior , wprior ) = . (6) rprior and λ, which can be chosen as to optimize the perfor-
p(rprior , wprior )
mance measure on cross-validation data.
This equation combines our prior concerning the elements
in the matrices P and Q (for regularization) with our back- 2.6 MAP Estimate of Model
ground knowledge on the expected rating values. This serves For computational efficiency, our training aims to find
as our prior when observing the actual rating values in the the maximum-a-posteriori (MAP) parameter estimate of our
training data. model, i.e., the MAP estimates P̂ and Q̂ of the matrices P
2.4 Training Data and Q. We use the alternating least squares approach. The
idea is that one matrix can be optimized exactly while the
Now we use the rating values actually observed in the other one is assumed fixed. A local maximum of the log
training data. The likelihood of the model in light of ob- posterior can be found by alternating between the matrices
obs
served rating values ri,u reads (assuming i.i.d. data): P̂ and Q̂. While local optima exist [16], we did not find this
obs 2
Y obs 2 to cause major computational problems in our experiments.
p(r |M, σobs ) = p(Ri,u = ri,u |Mi,u , σobs,i,u )
The update equation for each row i of P̂ is (for fixed Q̂):
observed (i,u)

Again assuming a normal distribution, the log likelihood P̂i,· = (r̄i,· − rprior )(W̃ (i) + wprior I)Q̂ ·
reads: h i−1
X Q̂> (W̃ (i) + wprior I)Q̂ + λ(tr(W̃ (i) ) + wprior u0 )I (10)
log p(robs |M, σobs
2
)=− obs obs
wi,u (ri,u − Mi,u )2 + c4
obs obs prior prior
observed (i,u) where r̄i,u = (ri,u wi,u +rprior wi,u obs
)/(wi,u +wi,u ) denotes
(7) obs
the average rating; we defined wi,u = 0 if rating at (i, u)
where we defined the weights of the observed rating values
obs 2 is missing; note that r̄i,· − rprior = 0 if rating is missing for
as wi,u = 1/(2σobs,i,u ); c4 is again an irrelevant constant
(i, u); W̃ (i) is a diagonal matrix containing the ith column of
when training our model.
the weight matrix wobs ; the trace is tr(W̃ (i) ) = u∈Si wi,u obs
P
,
2.5 Posterior where Si is the set of users who rated item i; I denotes
The posterior after seeing the observed ratings is again the identity matrix; and u0 is the number of users. This
obtained by Bayes rule (we omit the weights wobs , wprior for equation can be re-written for efficient computations, e.g.,
brevity of notation here): see [17]. The update equation for Q̂ is analogous.

p(robs |M, rprior )p(M |rprior )


p(M |robs , rprior ) = 3. MODEL TESTING
p(robs )
A key challenge in testing recommender systems is that
∝ p(robs |M, rprior )p(rprior |M )p(M )(8) the observed ratings in the available data typically provide a
where we assumed in the denominator that the observed skewed picture of the (unknown) true distribution concern-
ratings are independent of the chosen prior ratings, i.e., ing all ratings [9, 8]. This may be caused by the fact that
p(robs |rprior ) = p(robs ). Substituting Eqs. 3, 5, 6 and 7 users are free to choose what items to rate, and they tend
into Eq. 8, we obtain the following log posterior: to not rate items that would otherwise receive a low rating.
If the ratings are missing not at random (MNAR), it is not
log p(M |robs , rprior , wobs , wprior , λ) = guaranteed that correct or meaningful results are obtained
( )
X obs obs 2
Xˆ 2 2 ˜
from testing a recommender system on the observed ratings
− wi,u (ri,u − Mi,u ) + λ Pi,d + Qu,d only. The latter is, however, common practice in the lit-
obs.(i,u) d erature or recommender systems, using measures like root
mean square error or nDCG [4, 6, 7, 11, 13, 18].
( )
X prior prior 2
Xˆ 2
Q2u,d
˜
− wi,u (ri,u − Mi,u ) + λ Pi,d + The top-k hit-rate / recall of relevant items is a particu-
all(i,u) d larly useful performance measure for assessing the accuracy
+c5 (9) of recommender systems [17]. An item i is relevant to user
u if s/he finds this item interesting or appealing [17]. For
We found that using a prior that involves also the regular- instance, in the Netflix data [1] we consider items i with a
ization term of P and Q in the third line in Eq. 9 leads obs
5-star rating, ri,u = 5, as relevant to user u.
to a slight improvements in our experimental results. The Recall can be calculated for a user by ranking the items
prior
obs
weights wi,u and wi,u as well as λ0 are absorbed in λ; this according to their scores predicted by the recommender sys-
is a slight but straight-forward generalization of the prior in tem, and determining the fraction of relevant items that are
Section 2.2; c5 is again an irrelevant constant for training. among the top-k items, i.e., the k items with the highest
Eq. 9 serves as our training objective function. For sim- scores. The value of k ∈ N has to be chosen, e.g., as the
plicity, we choose the same value for all virtual rating values number of items that can be recommended to a user. Only
rprior . For computational efficiency, we choose our model small values of k are important in practice. The goal is to
offset to equal the prior rating values: roffset = rprior . Its maximize recall for the chosen value of k.
main effect is that this retains the sparsity of the observed Recall has two interesting properties in this context [17]:
rating matrix. Apart from that, it also leads to the simpli- it is proportional to precision on fixed data and fixed k when
prior
fication: (ri,u − Mi,u )2 = ((P Q> )i,u )2 . For simplicity, we comparing different recommender systems with each other.
obs
set wi,u = 1 for all observed pairs (i, u), and also choose all In other words, the recommender system with the larger
recall also has the larger precision. More interestingly, how- In the first scenario, pobs may take only two values: it is
ever, recall can be calculated from the available MNAR data small for unpopular items, and large for popular items. If
and provides an unbiased estimate for the recall concern- the ratio of these two values approaches infinity, this results
ing the (unknown) complete data (which comprises all rat- in si → 0 for popular items. Removing the relevant rat-
ing values of all users) under the following assumption: the ings of the most popular items from the test set is indeed
relevant ratings are missing at random, while an arbitrary common practice, e.g., in [3]. In the second scenario, we con-
missing-data mechanism may apply to all other rating values sider the case where pobs is a smooth function of the items’
(as long as they are missing with a larger probability than popularities. We assume the polynomial relationship
the relevant ones) [17]. This assumption is much milder ` + ´γ
than the one underlying the popular approach of ignoring pobs (i) ∝ Ncomplete,i (13)
missing ratings, i.e., assuming that all ratings are missing with γ ∈ R. This is consistent with the power-law behavior
at random. of the observed relevant ratings (see Figure 1 a) in the sense
The expected performance on the (unknown) complete that also the (unknown) complete data then follows a power-
data is important because it is directly related to user ex- low distribution concerning the relevant ratings. This results
perience: the recommender system has to pick a few items in the stratification weights
from among all items the user has not rated yet (and which
+
may hence be new to the user); one can expect the dis- si ∝ 1/(Nobs,i )γ/(γ+1) .
tribution on all unrated items to be well-approximated by
the distribution on all (rated and unrated) items under the Note that the unknown proportionality factor cancels out in
(mild) assumption that only a small fraction of the relevant Eq. 12, so that it provides an unbiased estimate of the re-
ratings has been observed. call concerning the complete data for the correct polynomial
Given that ground truth (i.e., the complete data) is typi- degree γ (and assuming that there are no additional factors
cally not available (at low cost), the validity of the assump- underlying the missing data mechanism). If γ = 0, the rel-
tion in [17] cannot be verified in practice. For this reason, evant ratings are missing at random; if γ = 1, the probabil-
we carry out a sensitivity analysis in the following. We re- ity of observing relevant ratings increases linearly with item
+
lax this assumption even further and determine its effect popularity (Ncomplete,i ). The extreme case when γ → ∞ has
on the recall test-results for different models. Note that several interesting properties: first, one observes in the avail-
the assumption in [17] allows for an arbitrary missing-data able data only relevant ratings of the item with the largest
mechanism concerning all ratings, except for the relevant popularity in the complete data, which does not agree with
ratings; only the latter are assumed to be missing at ran- empirical evidence. Second, as γ/(γ + 1) → 1, the stratifi-
dom. For this reason, the following is concerned with the cation weights si are inversely proportional to the number
+
relevant ratings only. of observed relevant ratings Nobs,i ; this means that every
We consider the case that the probability of observing item has the same weight in the recall-measure in Eq. 12,
a relevant rating depends on the popularity of items. We independent of its number of observed relevant ratings. This
+
define the popularity of an item by the number Ncomplete,i means that, once an item obtains its first relevant rating, it
of relevant ratings it obtained in the (unknown) complete is weighted the same as all other items that may have ob-
+
data. Let Nobs,i be the number of relevant ratings observed tained thousands of relevant ratings. This obviously entails
in the available data; then the probability of observing a low robustness against statistical noise as well as against
relevant rating regarding item i is manipulation and attacks. As this is the limiting case of
γ, we nevertheless provide experimental results for this ex-
+
Nobs,i treme scenario in Section 4. The (unknown) realistic case
pobs (i) = . (11)
+
Ncomplete,i can be expected to be less extreme.
If the test set is a random subset of the observed data,
Assuming that there are no additional (possibly hidden) fac- +
then Nobs,i can be determined from the test set. Given that
tors underlying the missing data mechanism concerning the the training set is typically much larger than the test set,
relevant ratings, we obviously obtain an unbiased estimate it might be more robust to determine Nobs,i +
based on all
for recall on the (unknown) complete data by calculating the the available data, i.e., the training and test set combined.
popularity-stratified recall (for user u), We use the latter in our experiments reported in the next
+
P
+,k si
section, but the former choice of Nobs,i leads to very similar
i∈S
recallu (k) = P u (12) results.
+ si
Finally, we define recall(k) = u wu recallu (k) asP
P
i∈Su the aver-
on the available MNAR data; Su+ denotes the set of relevant age recall over all users, with normalized weights, P u wu =
items of user u; Su+,k is the subset of relevant items ranked in 1, like in [17]. In our experiments, we choose wu ∝ i∈Su+ si ,
the top k items based on the predictions of the recommender as a generalization of the definition in [7, 17].
system; the popularity-stratification weight for each item i Obviously, stratification like in Eq. 12 carries over analo-
is gously to other measures, like ATOP [17] or the area under
1 the ROC curve.
si = .
pobs (i)
If we choose the stratification weights si = 1 for all items 4. EXPERIMENTS
i, we obtain the usual recall measure. Given that complete This section summarizes our results on the Netflix Prize
data is unavailable (at low cost), pobs in Eq. 11 cannot be data [1]. These data contain 17,770 movies and almost half
calculated in practice. For this reason, we now examine dif- a million users. About 100 million ratings are available.
ferent choices for pobs and their effects on recall in Eq. 12. Ratings are observed for about 1% of all possible movie-
5
10 0.7
(a) (b) 0.6
(c)

MF: w
prior
=0.005, λ=0.04 0.5
0.6 prior
# movies

MF: w =0.0005, λ=0.04 0.4

recall
0.4

recall
MF: wprior=0.00005, λ=0.06
0.3
0.2MF: wprior=0, λ=0.07
0SVD 0.2
0 5 10 15 20
BS k 0.1
0
10 2 4 6 0
10 10 10 0 5 10 15 20
# 5−star ratings k

0.4 0.15 0.5


(d) (e) (f)
0.4
0.3
0.1
0.3
recall

recall
recall

0.2
0.2
0.05
0.1
0.1

0 0 0
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
k k k

Figure 1: Netflix data [1]: (a) the number of relevant (i.e., 5-star) ratings per movie in the training data
shows a close to power-law distribution (i.e., straight line in the log-log plot); (b) legend of models (see text
for details); (c)-(f ) show recall on probe set for different hypothetical missing-data mechanisms concerning
the relvant (i.e., 5-star) ratings (while an arbitrary missing-data mechanism is allowed for the other ratings):
(c) relevant ratings are missing at random (γ = 0 in Eq. 13); (d) relevant ratings observed with probability
increasing linearly with item popularity (γ = 1 in Eq. 13); (e) unrealistic extreme case where γ → ∞ in
Eq. 13; (f ) relevant ratings of the 10% most popular items removed. As a result, for all these missing-data
mechanisms, recall test-results are improved by using an appropriate small prior weight wprior > 0 during
training, compared to the popular approach of ignoring the missing-data mechanism (wprior = 0) during
training.

user-pairs. The ratings take integer values from 1 (worst) to values of tuning parameters in our training objective func-
5 (best). The provided data are already split into a train- tion (log posterior) in Eq. 9 are summarized in Figure 1 (b).
ing and a probe set. We removed the probe set from the Like in [17], we chose the prior rating value rprior = 2 in Eq.
provided data as to obtain our training set. 9.
We consider 5-star ratings as relevant to a user (as defined Figures 1 (c)-(f) shows the performance of these mod-
above), and use the popularity-stratified recall, as outlined els under different test scenarios, concerning our popularity-
in Section 3, as our performance measures on the Netflix stratified recall measure for the practically important range
probe set. For all experiments, we chose rank d = 50 of our of small k values. For computational efficiency, we computed
low-rank matrix factorization (MF) model (Eq. 1). In the recall by randomly sampling, for a user, 1,000 unrated items
following, we compared our MF model, trained with different for each relevant rating in the test set, like in [3]. The only
prior weights wprior > 0, against the popular MF approach difference to the test procedure used in [7, 17] is that we
that ignores the missing-data mechanism (i.e., wprior = 0 in sample from unrated items only, rather than from all items.
our notation). The latter achieved a root mean square error This is more realistic. It also results in slightly higher recall
on the observed ratings in the probe set of 0.922. Addition- values compared to the procedure used in [7, 17].
ally, we compared to singular value decomposition (SVD), When comparing the different graphs, it is obvious that
where we used the svds function of Matlab, which implicitly the performance of all the models depends on the (unknown)
imputes a zero value (with unit weight) for all missing rat- missing-data mechanism concerning the relevant ratings. In
ings; and to the bestseller list (BS), which ranks the items particular, when pobs of relevant ratings is assumed to in-
+
according to the number of ratings in the training set. The crease more rapidly with growing popularity (Ncomplete,i ),
the expected recall on the (unknown) complete data de- of our model:
creases for all models, cf. Figures 1 (c)→(d)→(e), and XX
log p(M |D) = − wi,u,v (v − Mi,u )2
(c)→(f).
i,u v
As pobs of relevant ratings increases more rapidly with
item popularity (compare Figures 1 (c)→(d)→(e)), the dif-
X 1 X 2 X 1 X 2
− 2
Pi,d − 2
Qu,d + c6 (16)
ference in recall among the various MF models decreases. i
2σP,i u
2σQ,i
d d
Training with smaller but positive weights wprior > 0 results
in the best recall on the test set, even in the unrealistic ex- The standard deviations σQ,i and σP,i may be chosen as to
treme limit in Figure 1 (e). This suggests that, compared achieve the desired variant of regularization, as discussed in
to the popular approach of ignoring the missing-data mech- Section 2.2.
anism when training MF models, recall can be improved 5.1 MAP Estimate of Model
by using a small prior weight wprior > 0; its value is up-
per bounded by the value that optimizes the (usual) recall For computational efficiency, we focus on optimizing the
measure on the test set, i.e., under the assumption that the log posterior in Eq. 16. The maximum-a-posteriori (MAP)
relevant ratings are missing at random, like in [17]. parameter estimate of our model, i.e., the MAP estimates P̂
The bestseller list (BS) and SVD perform surprisingly well and Q̂ of the matrices P and Q, can be determined by alter-
if relevant ratings are missing at random, see Figure 1 (c), nating least squares, which alternately optimizes one matrix
while the popular MF model with wprior = 0 has low recall while the other one is assumed fixed. Using the usual nec-
in comparison. This was also found in [17, 3]. BS and SVD essary condition for the optimum of Eq. 16, we equate its
perform rather poorly, however, if pobs increases rapidly with partial derivative to zero, and obtain the following update
item popularity, as shown in the extreme scenarios in Figure equation for each row i of P̂ (for fixed Q̂):
1 (e) and (f). This suggests that not only BS, but also SVD " #−1
1
tend to recommend items that are popular in the available P̂i,· = (v̄i,· − roffset )W̃ (i) Q̂ · Q̂> W̃ (i) Q̂ + 2 I , (17)
training data. Their recommendations may hence result in 2σP,i
a relatively low degree of serendipity or surprise, relative to
our MF models trained with a small positive prior weight. where I denotes the identity matrix, and W̃ (i) is a diagonal
matrix containing the ith row of the aggregate weight matrix
with elements
5. GENERALIZED APPROACH X
wi,u = wi,u,v ,
This section outlines a generalization of the Bayesian ap- v
proach given above. In our probabilistic approach, we con-
sider the rating matrix R as a matrix of random variables. and v̄i,u is the average rating value
As each entry Ri,u is a random variable (rather than a
!
X
value), this naturally allows for possibly multiple values con- v̄i,u = vwi,u,v /wi,u .
cerning each pair (i, u) in the data. This has several ad- v
vantages over a matrix of values, which has typically been
considered in the literature of recommender systems. Af- Analogously, the update equation for each row u of Q̂ is:
#−1
ter developing our generalized probabilistic framework, we
"
1
outline three special cases / applications in Section 5.2. Q̂u,· = (v̄·,u − roffset )W̃ (u) P̂ P̂ > W̃ (u) P̂ + 2 I ,
Let the given data set be D = {ri,u,j }i,u,j , where i = 2σQ,u
1, ..., i0 is the index concerning items, u = 1, ..., u0 is the (18)
index regarding users, and j = 1, ... is the index over possi- where W̃u is the diagonal matrix containing the uth column
bly multiple observed ratings for the same pair (i, u). The of the aggregate weight matrix.
likelihood of the model in light of i.i.d. data reads This derivation shows that optimizing Eq. 16 is equivalent
Y to optimizing
p(D|M ) = p(Riu = ri,u,j |M ). (14) X
i,u,j log p(M |D) = − wi,u (v̄i,u − Mi,u )2
i,u
Assuming again a normal distribution of the ratings (with X 1 X 2 X 1 X 2
standard deviations σi,u,j , or equivalently weights wi,u,j = − 2
Pi,d − 2
Qu,d + c6 (19)
2 i
2σP,i u
2σQ,i
1/(2σi,u,j )), the log likelihood of the model is d d

XX where multiple rating values for an item-user pair are re-


log p(D|M ) = − wi,u,j (ri,u,j − Mi,u )2 + c5 placed by their mean value v̄i,u and their aggregate weight
i,u j wi,u .
XX
= − wi,u,v (v − Mi,u )2 + c5 (15) 5.2 Applications
i,u v
This generalized probabilistic approach subsumes several
where the second line is obtained by switching–for each pair applications as special cases. In addition to the use of virtual
(i, u)–from index j (over multiple ratings in the data) to ratings in the prior, as outlined in Section 2, we present two
the
P actual rating values v; the cumulative weight is wi,u,v = additional applications in the following.
j wi,u,j Iri,u,j =v , where indicator function Iri,u,j =v = 1 if
ri,u,j = v and 0 otherwise. 5.2.1 Aggregate Recommendations
Combining this likelihood with the same kind of prior over The universe of all items available for recommendation
the model parameters as in Eq. 2, we obtain the log posterior may have a structure that goes beyond a flat list. Items can
often be arranged in a hierarchical manner. For instance, fidence / weight. This weight is small, as it can also be
songs may be grouped by artist, album, genre, etc. Possibly, interpreted as the variance of our prior, which is large as
there are several layers of hierarchy. Now let us consider there are many reasons for not watching a show. Analo-
the problem that data are available where users have rated gously to Section 2, we use virtual observations of value 0,
individual songs, but the task is to recommend an artist to a with weight wprior . This results in the log likelihood in light
user. This problem arises in several situations. For instance, of the virtual data points Dprior :
the recommender system may want to suggest a concert to X
the user, based on the data on individual songs. Another log p(Dprior |M ) = −nmax wprior (0 − Mi,u )2 + c8
scenario is the release of a new song by an artist: this cold all (i,u)
start problem may be overcome by recommending the new
song to users who like the artist. Combining these two likelihoods, together with the prior
The ratings matrix concerning songs and users can be used over the model parameters in Eq. 2 (first variant), we obtain
to construct a rating matrix regarding artists and users by the log posterior
aggregating the songs of each artist. As a user may have log p(M |D, Dprior ) ∝
rated several songs of an artist, we now have possibly multi- X X
ple rating values for an entry in the artist-user matrix. This − ti,u (r̄i,u − Mi,u )2 − wprior (0 − Mi,u )2
is exactly the problem solved by our general approach in (i,u)∈S all (i,u)
Section 5. Our framework also shows that, for each user, "
X X X
#
2
the rating of each artist can be determined as the weighted −λ Pi,d + Q2u,d + c9 , (20)
average of the ratings of his/her songs, and the weight is the d i u
sum of the weights of the songs, as one may have intuitively
expected. where we replaced the standard deviations in the prior by
λ ∈ R; the proportionality is due to omitting nmax , which is
5.2.2 Recommendation of TV Shows an irrelevant constant when optimizing the posterior; c9 is
IP-TV is much more interactive than traditional TV. Con- an irrelevant additive constant in our optimization problem.
cerning recommender systems, it allows one to collect infor- In [5], the following objective function was experimentally
mation on the users’ TV consumption. This can be used found to result in the best recommendation accuracy from
to learn preferences of users to TV programs, as to make among several variants (re-written, but equivalent to Eq. 3
accurate recommendations of TV shows. Unlike the pre- in [5]):
vious applications described in this paper, we now use im- X X
(1 + αnti,u )(1 − Mi,u )2 + (0 − Mi,u )2
plicit feedback data (time spent watching a TV show) in
(i,u)∈S (i,u)6∈S
place of the ratings. Our approach carries over immedi- " #
ately. This problem can be cast in our general probabilistic X X 2
X
framework as follows: we divide the length of each show +λ Pi,d + Q2u,d , (21)
d i u
into nmax time intervals (of equal length), where nmax is the
same large integer for all shows. We consider each time in- where nti,u =
P
j ni,u,j is the total time spent by user u
terval associated with a random experiment; a show hence
watching show i (including all episodes j); S denotes again
comprises nmax repetition of the experiment. The random
the set of pairs (i, u) where show i is at least partially watched
variable Ri,u takes the value 1 if user u watched show i
by user u; α takes essentially the role of wprior .
for a time-interval, and 0 otherwise. So, if user u watches
Interestingly, their objective function is similar to ours.
ni,u ∈ {1, ..., nmax } out of nmax intervals of show i, then
The main difference, however, is in the least squares term,
we have ni,u observations of value 1 for the random vari-
where we fit our model to the fraction r̄i,u of the show
able Ri,u , and nmax − ni,u observations of value 0; as shown
watched, while the indicator value 1 is used in [5]. We at-
in Section 5, these multiple observations can be substituted
tribute to this difference the fact that our model performs
equivalently in our least-squares objective function in Eq.
slightly better in our preliminary experiments on our (pro-
16 by the aggregate weight nmax and the averaged value
prietary) IP-TV data [2], see below. Besides the experimen-
of the observations: r̄i,u = ni,u /nmax , i.e., the fraction of
tal improvement, our theoretical framework also provides a
the show watched. In addition, if a show comprises several
clear understanding of the assumptions underlying our ap-
(e.g., weekly) episodes, then we assume that the above ap-
proach, while the approach in [5] appears to be found ex-
plies to each episode; the total weight of the show is then
perimentally in a somewhat ad-hoc manner.
ti,u nmax , where ti,u ∈ N is the number of episodes of show
Preliminary Experiments: We used a (proprietary)
i watched by user u. Then the averaged observed value is
Pti,u IP-TV data set [2] concerning TV consumption of N =25,777
r̄i,u = j=1 r̄i,u,j /ti,u , where r̄i,u,j is the fraction of episode different shows by 14,731 users (living in 6,423 households)
j watched (analogous to above). This results in the log like- in the UK over a 6 month period in 2008/2009 (see also
lihood (with an irrelevant constant c7 in our optimization [15] for a more detailed description). In our collaborative
problem): filtering approach, we used only implicit feedback data con-
log p(D|M ) = −nmax
X
ti,u (r̄i,u − Mi,u )2 + c7 , cerning TV consumption (user ID, program ID, the length of
the program and the time the user spent on this program),
(i,u)∈S
and ignored the available explicit user profiles and content
where the set S contains all pairs (i, u) with shows i that information for simplicity. We randomly split these data
have been watched at least partially by user u. Concerning into a training and a disjoint test set, with 10 shows per
all shows, we additionally incorporate background knowl- user assigned to the test set. In the test set, we considered
edge that users tend to not like shows with some small con- shows interesting or relevant to users if they watched at least
model k0 = 1% k0 = 2% k0 = 3% k0 = 4% k0 = 5% Acknowledgements
ours 0.671 0.763 0.819 0.856 0.882 We are greatly indebted to Tin Ho for her encouragement
[5] 0.624 0.723 0.785 0.828 0.857 and support of this work. We are also very grateful to the
Nbr 0.547 0.642 0.704 0.760 0.804 anonymous reviewers for their valuable feedback.

Table 1: Recall(k0 ) test results on IP-TV data.


7. REFERENCES
[1] J. Bennet and S. Lanning. The Netflix Prize. In
Workshop at SIGKDD-07, ACM Conference on
80 % of them. We used these shows to evaluate the recom- Knowledge Discovery and Data Mining, 2007.
mender systems w.r.t. the performance measures recall (as [2] BARB: Broadcaster Audience Research Board.
defined in Section 3 with γ = 0). Table 1 summarizes our http://www.barb.co.uk.
preliminary results in terms of recall(k0 ), where k0 = k/N [3] P. Cremonesi, Y. Koren, and R. Turrin. Performance
is normalized regarding the number N of available shows. of recommender algorithms on top-N recommendation
Again, we used rank d = 50 for the matrix factorization tasks. In ACM Conference on Recommender Systems,
models. We find that our new approach (with λ = 0.005) 2010.
and the approach in [5] (with λ = 0.02) give similar results, [4] S. Funk. Netflix update: Try this at home, 2006.
ri r 0
compared to the neighborhood model (Nbr), sij = kri kkrj j k http://sifter.org/ simon/journal/20061211.html.
[5] Y. Hu, Y. Koren, and C. Volinsky. Collaborative
P
and r̂iu = sij ruj , which was also used in [5] for compar-
ison. Concerning recall, small values of k0 are particularly filtering for implicit feedback datasets. In ICDM, 2008.
important in practice, as only a small number of shows can [6] R. Keshavan, A. Montanari, and S. Oh. Matrix
be recommended to a user; in this regime, our new approach completion from noisy entries, 2009. arXiv:0906.2027.
seems to perform better than the approach of [5]. We are [7] Y. Koren. Factorization meets the neighborhood: a
currently running more refined experiments to confirm this multifaceted collaborative filtering model. In KDD,
finding. 2008.
[8] B. Marlin and R. Zemel. Collaborative prediction and
ranking with non-random missing data. In ACM
Conference on Recommender Systems (RecSys), 2009.
6. CONCLUSIONS [9] B. Marlin, R. Zemel, S. Roweis, and M. Slaney.
This paper provides three contributions. First, we out- Collaborative filtering and the missing at random
lined a Bayesian framework that naturally allowed us to assumption. In Conference on Uncertainty in Artificial
insert background knowledge concerning the missing-data Intelligence (UAI), 2007.
mechanism underlying the observed rating data. The ob- [10] R. Pan, Y. Zhou, B. Cao, N. Liu, R. Lukose,
tained log posterior probability of the model is very similar M. Scholz, and Q. Yang. One-class collaborative
to the training objective function outlined in [17]. filtering. In ICDM, 2008.
In our second contribution, we conducted experiments [11] A. Paterek. Improving regularized singular value
where we considered several hypothetical missing-data mech- decomposition for collaborative filtering. In KDDCup,
anisms underlying the observed real-world data. Given that 2007.
the true missing-data mechanism is unknown in the ab-
[12] R. Salakhutdinov and A. Mnih. Probabilistic matrix
sence of ground truth data, this sensitivity analysis pro-
factorization. In NIPS, 2008.
vided valuable information. Our key insight based on these
experiments is that the top-k hit-rate or recall can be im- [13] R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted
proved considerably by training recommender systems with Boltzmann machines for collaborative filtering. In
an appropriately chosen small positive prior weight concern- ICML, 2007.
ing background knowledge on the missing-data mechanism. [14] R. Salakhutdinov and N. Srebro. Collaborative
This is in contrast to the popular approach in the literature, filtering in a non-uniform world: Learning with the
which only considers observed ratings. weighted trace norm, 2010. arXiv:1002.2780.
As third contribution, we provided a generalized prob- [15] C. Senot, D. Kostadinov, M. Bouzid, J. Picault,
abilistic framework for factorizing a user-item-matrix that A. Aghasaryan, and C. Bernier. Analysis of strategies
possibly has multiple observed rating values associated with for building group profiles. In Conference on User
each user-item pair. We discussed three important special Modeling, Adaption and Personalization (UMAP),
cases / applications: besides training a top-k recommender 2010.
system by using virtual data points, we outlined how ratings [16] N. Srebro and T. Jaakkola. Weighted low-rank
can be aggregated when items are grouped in a hierarchical approximations. In ICML, pages 720–7, 2003.
manner rather than in a flat list, and how recommendations [17] H. Steck. Training and testing of recommender
can be made using this hierarchical structure. Addition- systems on data missing not at random. In KDD,
ally, this framework enabled us to derive the training objec- 2010.
tive function for a recommender system on sequential data [18] M. Weimer, A. Karatzoglou, Q. Le, and A. Smola.
concerning TV consumption. This derivation not only pro- Cofi rank - maximum margin matrix factorization for
vides a clear understanding of the assumptions underlying collaborative ranking. In Advances in Neural
the training objective function, but also led to improvements Information Processing Systems (NIPS), 2008.
in the top-k hit rate over state-of-the-art approaches in our
preliminary experiments.

You might also like