A Generalized Probabilistic Framework and Its Variants For Training Top-K Recommender Systems
A Generalized Probabilistic Framework and Its Variants For Training Top-K Recommender Systems
A Generalized Probabilistic Framework and Its Variants For Training Top-K Recommender Systems
∗
Harald Steck Yu Xin
Bell Labs, Alcatel-Lucent CSAIL MIT
Murray Hill, NJ Cambridge, MA
[email protected] [email protected]
ABSTRACT 1. INTRODUCTION
Accounting for missing ratings in available training data The idea of recommender systems is to automatically sug-
was recently shown [3, 17] to lead to large improvements gest items to each user that s/he may find appealing. The
in the top-k hit rate of recommender systems, compared quality of recommender systems can be assessed with respect
to state-of-the-art approaches optimizing the popular root- to various criteria, including accuracy, diversity, surprise /
mean-square-error (RMSE) on the observed ratings. In this serendipity, and explainability of recommendations.
paper, we take a Bayesian approach, which lends itself natu- This paper is concerned with accuracy. The root mean
rally to incorporating background knowledge concerning the squared error (RMSE) has become the most popular accu-
missing-data mechanism. The resulting log posterior distri- racy measure in the literature of recommender systems–for
bution is very similar to the objective function in [17]. We training and testing. Its computational efficiency is one of
conduct elaborate experiments with real-world data, testing its main advantages. Impressive progress has been made in
several variants of our approach under different hypotheti- predicting rating values with small RMSE, and it is impos-
cal scenarios concerning the missing ratings. In the second sible to name all approaches, e.g., [4, 6, 7, 11, 13]). There is,
part of this paper, we provide a generalized probabilistic however, also some work on optimizing the ranking of items,
framework for dealing with possibly multiple observed rat- e.g., measured in terms of normalized Discounted Cumula-
ing values for a user-item pair. Several practical applica- tive Gain (nDCG) [18]. Despite their differences, they have
tions are subsumed by this generalization, including aggre- in common that they were trained and tested on observed
gate recommendations (e.g., recommending artists based on ratings only. Obviously, these measures cannot immediately
ratings concerning their songs) as well as collaborative filter- be evaluated if some items have unobserved ratings.
ing of sequential data (e.g., recommendations based on TV In this paper, we consider the top-k hit rate–based on all
consumption over time). We present promising preliminary (unrated) items–as the natural accuracy measure for rec-
experimental results on IP-TV data. ommender systems, as only a few out of all unrated items
can be recommended to a user in practice (see Section 3
for exact definition of top-k hit rate). While this measure
Categories and Subject Descriptors is computationally tractable for testing the predictions of
H.2.8 [Database Management]: Database Applications– recommender systems, unfortunately it is computationally
Data Mining very costly for training recommender systems. For training,
we thus resort to appropriate surrogate objective functions
that are computationally efficient.
General Terms In recent work [3, 17], it was shown that the top-k hit
Algorithms rate can be significantly improved on large real-world data
by accounting for the fact that the observed ratings provide
a skewed picture of the (unknown) distribution concerning
Keywords all (unrated) items and users.
Recommender Systems Motivated by the results of [17], as the first contribution of
this paper, in Section 2 we present a probabilistic approach
∗This work was done while an intern at Bell Labs, Alcatel- that allows us to naturally include background knowledge
Lucent. concerning the (unknown) distribution of all items and users
into our training objective function, the posterior probabil-
ity of the model.
In our second contribution, we conduct elaborate experi-
ments on the Netflix Prize data [1] and test several models
under different hypothetical scenarios concerning the miss-
ing ratings in Section 4. These different scenarios serve as a
sensitivity analysis, as the ground truth of the missing data-
mechanism is unknown due to lack of data. These experi-
Copyright is held by the author/owner(s). Workshop on the Practical Use of
Recommender Systems, Algorithms and Technologies (PRSAT 2010), held
ments are based on our popularity-stratified recall measure,
in conjunction with RecSys 2010. September 30, 2010, Barcelona, Spain. which we define in Section 3.
As the third contribution of this paper, we generalize this are free parameters of the zero-mean normal prior distribu-
probabilistic approach as to account for possibly multiple ob- tion, denoted by N . There are several ways of defining the
served rating values for a user-item pair in Section 5. This standard deviations in Eq. 2, eventually resulting in differ-
general framework subsumes several applications in addi- ent kinds of regularization.
√ The obvious choice is to assume
tion to the one outlined in Section 2. Two of which are that σP,i = σQ,u = 1/ 2λ0 ∀ i, u, with λ0 ∈ R. This results
outlined in Section 5: while the training objective function in the regularization term
for a recommender system concerning TV programs seems
log p(M |σP2 , σQ
2
) = −λ0 ||P ||22 + ||Q||22 + c1 ,
` ´
motivated in an ad-hoc manner in [5], we show that it can
be understood and improved naturally in a Bayesian frame-
work; apart from that, we also provide a Bayesian approach where || · ||2 denotes the Frobenius norm of a matrix, and c1
for making aggregate recommendations, e.g., recommending is an irrelevant constant when training our model.
an artist or concert to a user based on the ratings given to When optimizing root mean square error on observed data
individual songs. (like in the Netflix Prize competition [1]), however, numer-
ous experimental works reported significant improvements
2. MODEL TRAINING by using a different regularization. This is obtained by
choosing the standard deviations
p σP and σQ as follows: σP,i =
In this section, we outline a probabilistic framework that p
1/ 2λ0 · u0 (i) , σQ,u = 1/ 2λ0 · i0 (u), where i0 (u) de-
allows us to incorporate background knowledge when train-
notes the number of items rated by user u, and u0 (i) is the
ing recommender systems on available data. The use of
number of users who rated item i. This results in the pop-
background knowledge in addition to the available training
ular regularization term
data can significantly improve the accuracy of recommender
systems on performance measures like top-k hit rate, recall,
log p(M |σP2 , σQ
2
)
precision, area under the ROC curve, etc. This was demon- !
strated for implicit feedback data in [5, 10], and for explicit 0
X X 2
X X
feedback data in [3, 17]. Like in [17], we use background = −λ u0 (i) Pi,d + i0 (u) Q2u,d + c2
i d u d
knowledge that missing rating values tend to reflect negative 0 !1
feedback, as experimentally observed in [9, 8]; i.e., negative X X
0 2
feedback tends to be missing from the available data with a = −λ @ Pi,d + Q2u,d A + c2 , (3)
larger probability than positive feedback does. observed (i,u) d
The Bayesian approach lends itself naturally to this task.
We consider the rating matrix R as a matrix of random where c2 denotes again an irrelevant constant when train-
variables: each element Ri,u concerning item i = 1, ..., i0 ing our model. Note that this choice increasingly regularizes
and user u = 1, ..., u0 is a random variable with normal the model parameters related to the items and users with a
distribution, where i0 denotes the number of items and u0 larger number of observed ratings. This may seem counter-
is the number of users. intuitive at first glance. A theoretical explanation for this
empirical finding was recently provided in [14].
2.1 Model
We take a collaborative filtering approach, and use a low- 2.3 Informative Background Knowledge
rank matrix-factorization model, which has proven success- We now incorporate the following background knowledge
ful in many publications. Like the rating matrix, we consider into our sequential Bayesian approach: absent rating val-
our model as a matrix of random variables, M . Each random ues tend to be lower than the observed ratings on average
variable Mi,u corresponds to the rating of item i assigned by (see [17]). We insert this knowledge into our approach by
user u. In matrix notation, it reads means of a virtual data point for each pair (i, u): a virtual
prior
M = roffset + P Q> (1) rating value ri,u with small confidence (i.e., large variance
2
offset σprior,i,u ). Then the likelihood of our model in light of these
where r ∈ R is an offset value, and P , Q are low-rank
virtual data points reads (assuming i.i.d. data):
matrices of random variables with dimensions i0 × d0 and
u0 × d0 , respectively, where rank d0 i0 , u0 . We use upper Y Y
case symbols to denote random variables (with a Gaussian p(rprior |M, σprior
2
)= prior
p(Ri,u = ri,u 2
|Mi,u , σprior,i,u ),
distribution), and lower case symbols to denote values. all i all u
(4)
prior
2.2 Prior over Matrix Elements where rprior denotes the matrix of virtual data points ri,u ,
2 2
In our Bayesian approach, we first define the usual prior and σprior the matrix with elements σprior,i,u . We assume
over model parameters, concerning each entry of the low that the probabilities in this likelihood are determined by
2
rank matrices P and Q (see also [12]): normal distributions with mean Mi,u and variance σprior,i,u .
( ) The log likelihood then reads
2 2
YY 2
p(M |σP , σQ ) = N (Pi,d |0, σP,i ) XX “ ”2
i d log p(rprior |M, σprior
2
)=− prior
wi,u prior
ri,u − Mi,u +c3
( ) all i all u
·
YY 2
N (Qu,d |0, σQ,u ) (2) (5)
u
where we defined the weights of the virtual data points as
d prior 2
wi,u = 1/(2σprior,i,u ); c3 is again an irrelevant constant
2 2
The vectors of variances σQ = (σQ,u )u=1,...,u0 and σP2 = when training our model.
2
(σP,i )i=1,...,i0 for all users u = 1, ..., u0 and items i = 1, ..., i0 With Bayes rule, we obtain the posterior distribution of
prior
the model in light of these virtual data points: prior weights to be identical: wprior = wi,u for all (i, u). In
p(rprior , wprior |M )p(M ) summary, the three tuning parameters in Eq. 9 are wprior ,
p(M |rprior , wprior ) = . (6) rprior and λ, which can be chosen as to optimize the perfor-
p(rprior , wprior )
mance measure on cross-validation data.
This equation combines our prior concerning the elements
in the matrices P and Q (for regularization) with our back- 2.6 MAP Estimate of Model
ground knowledge on the expected rating values. This serves For computational efficiency, our training aims to find
as our prior when observing the actual rating values in the the maximum-a-posteriori (MAP) parameter estimate of our
training data. model, i.e., the MAP estimates P̂ and Q̂ of the matrices P
2.4 Training Data and Q. We use the alternating least squares approach. The
idea is that one matrix can be optimized exactly while the
Now we use the rating values actually observed in the other one is assumed fixed. A local maximum of the log
training data. The likelihood of the model in light of ob- posterior can be found by alternating between the matrices
obs
served rating values ri,u reads (assuming i.i.d. data): P̂ and Q̂. While local optima exist [16], we did not find this
obs 2
Y obs 2 to cause major computational problems in our experiments.
p(r |M, σobs ) = p(Ri,u = ri,u |Mi,u , σobs,i,u )
The update equation for each row i of P̂ is (for fixed Q̂):
observed (i,u)
Again assuming a normal distribution, the log likelihood P̂i,· = (r̄i,· − rprior )(W̃ (i) + wprior I)Q̂ ·
reads: h i−1
X Q̂> (W̃ (i) + wprior I)Q̂ + λ(tr(W̃ (i) ) + wprior u0 )I (10)
log p(robs |M, σobs
2
)=− obs obs
wi,u (ri,u − Mi,u )2 + c4
obs obs prior prior
observed (i,u) where r̄i,u = (ri,u wi,u +rprior wi,u obs
)/(wi,u +wi,u ) denotes
(7) obs
the average rating; we defined wi,u = 0 if rating at (i, u)
where we defined the weights of the observed rating values
obs 2 is missing; note that r̄i,· − rprior = 0 if rating is missing for
as wi,u = 1/(2σobs,i,u ); c4 is again an irrelevant constant
(i, u); W̃ (i) is a diagonal matrix containing the ith column of
when training our model.
the weight matrix wobs ; the trace is tr(W̃ (i) ) = u∈Si wi,u obs
P
,
2.5 Posterior where Si is the set of users who rated item i; I denotes
The posterior after seeing the observed ratings is again the identity matrix; and u0 is the number of users. This
obtained by Bayes rule (we omit the weights wobs , wprior for equation can be re-written for efficient computations, e.g.,
brevity of notation here): see [17]. The update equation for Q̂ is analogous.
MF: w
prior
=0.005, λ=0.04 0.5
0.6 prior
# movies
recall
0.4
recall
MF: wprior=0.00005, λ=0.06
0.3
0.2MF: wprior=0, λ=0.07
0SVD 0.2
0 5 10 15 20
BS k 0.1
0
10 2 4 6 0
10 10 10 0 5 10 15 20
# 5−star ratings k
recall
recall
0.2
0.2
0.05
0.1
0.1
0 0 0
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
k k k
Figure 1: Netflix data [1]: (a) the number of relevant (i.e., 5-star) ratings per movie in the training data
shows a close to power-law distribution (i.e., straight line in the log-log plot); (b) legend of models (see text
for details); (c)-(f ) show recall on probe set for different hypothetical missing-data mechanisms concerning
the relvant (i.e., 5-star) ratings (while an arbitrary missing-data mechanism is allowed for the other ratings):
(c) relevant ratings are missing at random (γ = 0 in Eq. 13); (d) relevant ratings observed with probability
increasing linearly with item popularity (γ = 1 in Eq. 13); (e) unrealistic extreme case where γ → ∞ in
Eq. 13; (f ) relevant ratings of the 10% most popular items removed. As a result, for all these missing-data
mechanisms, recall test-results are improved by using an appropriate small prior weight wprior > 0 during
training, compared to the popular approach of ignoring the missing-data mechanism (wprior = 0) during
training.
user-pairs. The ratings take integer values from 1 (worst) to values of tuning parameters in our training objective func-
5 (best). The provided data are already split into a train- tion (log posterior) in Eq. 9 are summarized in Figure 1 (b).
ing and a probe set. We removed the probe set from the Like in [17], we chose the prior rating value rprior = 2 in Eq.
provided data as to obtain our training set. 9.
We consider 5-star ratings as relevant to a user (as defined Figures 1 (c)-(f) shows the performance of these mod-
above), and use the popularity-stratified recall, as outlined els under different test scenarios, concerning our popularity-
in Section 3, as our performance measures on the Netflix stratified recall measure for the practically important range
probe set. For all experiments, we chose rank d = 50 of our of small k values. For computational efficiency, we computed
low-rank matrix factorization (MF) model (Eq. 1). In the recall by randomly sampling, for a user, 1,000 unrated items
following, we compared our MF model, trained with different for each relevant rating in the test set, like in [3]. The only
prior weights wprior > 0, against the popular MF approach difference to the test procedure used in [7, 17] is that we
that ignores the missing-data mechanism (i.e., wprior = 0 in sample from unrated items only, rather than from all items.
our notation). The latter achieved a root mean square error This is more realistic. It also results in slightly higher recall
on the observed ratings in the probe set of 0.922. Addition- values compared to the procedure used in [7, 17].
ally, we compared to singular value decomposition (SVD), When comparing the different graphs, it is obvious that
where we used the svds function of Matlab, which implicitly the performance of all the models depends on the (unknown)
imputes a zero value (with unit weight) for all missing rat- missing-data mechanism concerning the relevant ratings. In
ings; and to the bestseller list (BS), which ranks the items particular, when pobs of relevant ratings is assumed to in-
+
according to the number of ratings in the training set. The crease more rapidly with growing popularity (Ncomplete,i ),
the expected recall on the (unknown) complete data de- of our model:
creases for all models, cf. Figures 1 (c)→(d)→(e), and XX
log p(M |D) = − wi,u,v (v − Mi,u )2
(c)→(f).
i,u v
As pobs of relevant ratings increases more rapidly with
item popularity (compare Figures 1 (c)→(d)→(e)), the dif-
X 1 X 2 X 1 X 2
− 2
Pi,d − 2
Qu,d + c6 (16)
ference in recall among the various MF models decreases. i
2σP,i u
2σQ,i
d d
Training with smaller but positive weights wprior > 0 results
in the best recall on the test set, even in the unrealistic ex- The standard deviations σQ,i and σP,i may be chosen as to
treme limit in Figure 1 (e). This suggests that, compared achieve the desired variant of regularization, as discussed in
to the popular approach of ignoring the missing-data mech- Section 2.2.
anism when training MF models, recall can be improved 5.1 MAP Estimate of Model
by using a small prior weight wprior > 0; its value is up-
per bounded by the value that optimizes the (usual) recall For computational efficiency, we focus on optimizing the
measure on the test set, i.e., under the assumption that the log posterior in Eq. 16. The maximum-a-posteriori (MAP)
relevant ratings are missing at random, like in [17]. parameter estimate of our model, i.e., the MAP estimates P̂
The bestseller list (BS) and SVD perform surprisingly well and Q̂ of the matrices P and Q, can be determined by alter-
if relevant ratings are missing at random, see Figure 1 (c), nating least squares, which alternately optimizes one matrix
while the popular MF model with wprior = 0 has low recall while the other one is assumed fixed. Using the usual nec-
in comparison. This was also found in [17, 3]. BS and SVD essary condition for the optimum of Eq. 16, we equate its
perform rather poorly, however, if pobs increases rapidly with partial derivative to zero, and obtain the following update
item popularity, as shown in the extreme scenarios in Figure equation for each row i of P̂ (for fixed Q̂):
1 (e) and (f). This suggests that not only BS, but also SVD " #−1
1
tend to recommend items that are popular in the available P̂i,· = (v̄i,· − roffset )W̃ (i) Q̂ · Q̂> W̃ (i) Q̂ + 2 I , (17)
training data. Their recommendations may hence result in 2σP,i
a relatively low degree of serendipity or surprise, relative to
our MF models trained with a small positive prior weight. where I denotes the identity matrix, and W̃ (i) is a diagonal
matrix containing the ith row of the aggregate weight matrix
with elements
5. GENERALIZED APPROACH X
wi,u = wi,u,v ,
This section outlines a generalization of the Bayesian ap- v
proach given above. In our probabilistic approach, we con-
sider the rating matrix R as a matrix of random variables. and v̄i,u is the average rating value
As each entry Ri,u is a random variable (rather than a
!
X
value), this naturally allows for possibly multiple values con- v̄i,u = vwi,u,v /wi,u .
cerning each pair (i, u) in the data. This has several ad- v
vantages over a matrix of values, which has typically been
considered in the literature of recommender systems. Af- Analogously, the update equation for each row u of Q̂ is:
#−1
ter developing our generalized probabilistic framework, we
"
1
outline three special cases / applications in Section 5.2. Q̂u,· = (v̄·,u − roffset )W̃ (u) P̂ P̂ > W̃ (u) P̂ + 2 I ,
Let the given data set be D = {ri,u,j }i,u,j , where i = 2σQ,u
1, ..., i0 is the index concerning items, u = 1, ..., u0 is the (18)
index regarding users, and j = 1, ... is the index over possi- where W̃u is the diagonal matrix containing the uth column
bly multiple observed ratings for the same pair (i, u). The of the aggregate weight matrix.
likelihood of the model in light of i.i.d. data reads This derivation shows that optimizing Eq. 16 is equivalent
Y to optimizing
p(D|M ) = p(Riu = ri,u,j |M ). (14) X
i,u,j log p(M |D) = − wi,u (v̄i,u − Mi,u )2
i,u
Assuming again a normal distribution of the ratings (with X 1 X 2 X 1 X 2
standard deviations σi,u,j , or equivalently weights wi,u,j = − 2
Pi,d − 2
Qu,d + c6 (19)
2 i
2σP,i u
2σQ,i
1/(2σi,u,j )), the log likelihood of the model is d d