0% found this document useful (0 votes)
188 views

Kernel Matrix Factorization Models

kernel matrix models

Uploaded by

normal lochuhsin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
188 views

Kernel Matrix Factorization Models

kernel matrix models

Uploaded by

normal lochuhsin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Online-Updating Regularized Kernel Matrix Factorization

Models for Large-Scale Recommender Systems

Steffen Rendle, Lars Schmidt-Thieme


Machine Learning Lab
Institute for Computer Science
University of Hildesheim, Germany
{srendle, schmidt-thieme}@ismll.uni-hildesheim.de

ABSTRACT tinction between training and prediction. In fact after the


Regularized matrix factorization models are known to gen- training phase visitors will generate new feedback (e.g. rate
erate high quality rating predictions for recommender sys- items). For users that have already rated a lot the user’s pro-
tems. One of the major drawbacks of matrix factorization file will not change much, but for a new user each feedback
is that once computed, the model is static. For real-world that he gives will result in much change in his inferable taste.
applications dynamic updating a model is one of the most Therefore adding ratings to a user with a small rating profile
important tasks. Especially when ratings on new users or should result in a much better model and thus to better pre-
new items come in, updating the feature matrices is crucial. dictions for this user. We will refer to this as the ‘new-user
In this paper, we generalize regularized matrix factoriza- problem’. Symmetrically for items, the ‘new-item problem’
tion (RMF) to regularized kernel matrix factorization is formulated. The scenario of new items and new users is
(RKMF). Kernels provide a flexible method for deriving important in almost any real-world recommender and is es-
new matrix factorization methods. Furthermore with ker- pecially crucial in domains with changing content e.g. on
nels nonlinear interactions between feature vectors are pos- a news website, TV program, etc. The new-user problem
sible. We propose a generic method for learning RKMF occurs in almost every application from online shopping to
models. From this method we derive an online-update algo- personalized websites.
rithm for RKMF models that allows to solve the new-user/ In this work, first we generalize regularized matrix fac-
new-item problem. Our evaluation indicates that our pro- torization (RMF) to regularized kernel matrix factorization
posed online-update methods are accurate in approximat- (RKMF). A kernel function allows to transform the prod-
ing a full retrain of a RKMF model while the runtime of uct of the factor matrices. Kernels like the s-shaped logistic
online-updating is in the range of milliseconds even for huge function allow to impose bounds on the prediction (e.g. one
datasets like Netflix. to five stars) while still being differentiable. We will pro-
pose a gradient descent based algorithm for learning RKMF
models. We will show that best k-rank SVD approximations
Categories and Subject Descriptors are different from RKMF models and that for recommender
I.2.6 [Learning]: Parameter learning systems RKMF clearly outperforms SVD as RKMF models
do not need imputation and have a regularization term.
General Terms Secondly, we show how online-updates can be applied
on a learned RKMF model without having to retrain the
Algorithms, Experimentation, Measurement, Performance whole model. Our online-update methods especially tar-
gets the new-user and new-item problem. The proposed
1. INTRODUCTION update methods are generic and applicable for all RKMF
Regularized matrix factorization is known to be one of models. They are derived directly from the gradient de-
the most successful methods for rating prediction outper- scent method that we propose for learning RKMF models.
forming other methods like pearson-correlation based kNN In the evaluation we will see that the online-updates for
or co-clustering [1, 14, 3]. One drawback is that the model new-user/new-item problems approximate the prediction of
(the factorization) is learned in batch mode. After having retraining the whole model very well. Both theoretical and
learned the model, it is applied for prediction. For lab eval- empirical results show that the learning complexity of our
uation this works fine. But in real-world scenarios like an online-updates is low. That makes RKMF models an ideal
online shop or a video rental service there is not such a dis- choice for real-world applications both in terms of runtime
complexity and prediction quality.
In all, our contributions are as follows: (i) We introduce
the model class of regularized kernel based matrix factor-
Permission to make digital or hard copies of all or part of this work for ization for rating prediction in recommender systems. We
personal or classroom use is granted without fee provided that copies are provide a generic learning method and present three con-
not made or distributed for profit or commercial advantage and that copies crete instances of this class. (ii) We show how the major
bear this notice and the full citation on the first page. To copy otherwise, to problem of updating features of new users and new items
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. can be solved efficiently for the whole model class of regu-
RecSys’08, October 23–25, 2008, Lausanne, Switzerland. larized KMF. An extensive evaluation substantiates the ef-
Copyright 2008 ACM 978-1-60558-093-7/08/10 ...$5.00.
fectiveness of our generic online-update-rules both in terms the rows specify objects and the columns specify attributes.
of prediction quality and runtime. They map the entries within the attribute dimension in a
higher dimension and use kernels to perform multiplications
2. RECOMMENDER SYSTEMS between two attribute vectors (columns). Our approach dif-
fers as we use the kernel between user and item vectors (row
The central task in recommender systems is to predict the
and column). Furthermore we use regularization to learn
taste of a user. In this paper, we deal with the rating pre-
the model. In [14] Takacs et al. propose to apply a round-
diction problem that tries to predict how much a user likes
ing function on parts of the output of MF. If their rounding
a particular item. The estimated ratings can be used to rec-
function is applied to the whole output, i.e. the dot prod-
ommend items to the user, e.g. what movies he might want
uct, this can be seen as a special case of a kernel. Similarly
to watch. The prediction is based on the user’s feedback
Salakhutdinov and Mnih [9] suggest to use the logistic func-
and feedback of other users in the past. Both feedback and
tion to bound the output of the dot product. The kernel
prediction are supposed to be numerical values, e.g. one to
approach that we suggest is more general.
five stars.
Updating k-rank SVD models has already been studied in
2.1 Rating Prediction the field of recommender systems [2, 10]. As we will see,
best k-rank SVD and regularized MF are different. For the
The problem of rating prediction can be seen as a matrix task of rating prediction where there is a huge number of
completion task, where a rating matrix R should be com- missing values, SVD suffers from imputation and overfitting.
pleted. The rows of R correspond to the users U and the A detailed discussion on the differences between SVD and
columns to the items I. Thus the matrix has dimension regularized (K)MF can be found in section 4.5.
|U | × |I|. The entry ru,i of the matrix R contains the rating Online updates for kernel methods like SVMs or kernel
value of user u for item i. R is sparse, that means many regression have been studied e.g. by Kivinen et al. [7]. In
values are unobserved (missing) values. We denote the set contrast to this work we present online updates for regular-
of observed ratings by S which contains triples (u, i, v) or ized kernel matrix factorization.
ru,i of feedback from the user. The task of rating prediction If there are attributes on users (e.g. demographic data) or
is to complete R with an estimation R̂. attributes on items (e.g. genre, actors) this information can
be used for recommendation. Especially when little rating
2.2 New-User Problem / New-Item Problem information on a specific user or item is present like in the
Typically, applications of recommender systems are dy- new-user and new-item problem (aka ‘cold start problem’),
namic. That means that new users sign in, new items are attribute information can improve the prediction quality [12,
added and new ratings are given. On the other hand, many 5]. In this paper we deal with problems without any at-
large-scale recommender models are static, e.g. correlation tribute information.
coefficients or nearest neighbors often are precomputed for
collaborative filtering or factor matrices have been computed
for matrix factorization. If after training these models, a 4. REGULARIZED KERNEL MATRIX FAC-
new user signs in and gives some ratings, the recommender TORIZATION
system has not adapted and gives poor recommendations.
In this section, we first introduce matrix factorization and
That means from the perspective of the user, the recom-
then we generalize this to kernel matrix factorization. As
mendations do not improve when he gives additional rat-
MF models have a huge number of parameters, for opti-
ings. This is especially disappointing at the beginning when
mizing a regularization term is added to prevent overfitting.
one expects large improvements from giving ratings. In this
We propose a stochastic gradient descent approach to learn
paper we deal with this problem both for new items and new
RKMF models in general and provide update rules for three
users.
models, i.e. linear, logistic and linear with non-negative con-
We define the ‘new-user problem’ as follows: The profile
straints. At the end of this section we compare RKMF to
C(u, ·) of a user u grows from 0 ratings to k ratings. Where
best k-rank SVD. Sometimes people refer to RMF as regu-
C is defined as:
larized SVD which is not accurate as this is not a singular
C(u, i) := {ru0 ,i0 ∈ S|u0 = u ∧ i0 = i} (1) value decomposition. In this paper we use the term of SVD
for real singular value decomposition and MF for matrix fac-
Symmetrically, we define the ‘new-item problem’ as follow- torization in two matrices. We will see that our proposed
ing: The profile C(·, i) of a item i grows from 0 ratings to k RKMF models are better suited and provide better results
ratings. for rating prediction than best k-rank SVD. The reasons are
that (1) SVD tends to overfit because of missing regulariza-
3. RELATED WORK tion and (2) SVD needs imputation of missing values.
There are many approaches for rating prediction. Long-
established is collaborative filtering based on the k-nearest- 4.1 Matrix Factorization (MF)
neighbor method (kNN) [4, 11, 1]. Other approaches use la- Matrix factorization is the task of approximating the true
tent semantic models [6], classifiers [13], etc. Another class unobserved ratings-matrix R by R̂ : |U | × |I|. With R̂ being
of models is based on matrix factorization (MF). MF tech- the product of two feature matrices W : |U | × k and H :
niques have shown to be very effective on several datasets |I| × k, where the u-th row wu of W contains the k features
including Netflix [15, 14, 1]. that describe the u-th user and the i-th row hi of H contains
There is already research in kernels for matrix factoriza- k corresponding features for the i-th item.
tion. Zhang et al. [16] present non-negative MF on ker-
nels. Their motivation is to approximate a matrix where R̂ = W · H t (2)
Or equivalently: Netflix: RMSE

k
Train

1.05
X
r̂u,i = hwu , hu i = wu,f hi,f (3) Test

f =1

0.95
Often bias terms are added which are equivalent to center-

RMSE
ing the approximization, so that only residuals have to be

0.85
learned:
k

0.75
X
r̂u,i = bu,i + wu,f hi,f (4)
f =1
0 50 100 150 200

Normally the bias term bu,i is something like the global av- Iterations

erage, user average or item average, but also could be the


result of another prediction algorithm. In our experiments
Figure 1: Fit on training and evaluation set (‘probe’)
we set bu,i to the global average rating, i.e. avg ru,i .
ru,i ∈S of regularized linear MF on Netflix.

4.2 Kernel Matrix Factorization (KMF)


4.4 Learning Matrix Factorization Models
Like matrix factorization, kernel matrix factorization (KMF)
uses two feature matrices that contain the features for users The most important part for learning MF models is deal-
and items, respectively. But the interactions between the ing with overfitting. Factorization models have a huge num-
feature vector wu of a user and the feature vector hi of an ber of parameters, i.e. O(k · (|U | + |I|)), that should approx-
item are kernelized: imate a matrix with lots of missing values. An illustrative
example is the Netflix datasets with about 480, 000 users and
r̂u,i = a + c · K(wu , hi ) (5) 17, 000 items which leads to about 50 million free parame-
ters for MF with k = 100. For Netflix the number of given
The terms a and c are introduced to allow rescaling the ratings is about 100 million, that means R has about 8, 060
predictions. For the kernel K : Rk × Rk → R one can use million missing values. It is obvious that learning 50 million
one of the following well-known kernels: parameters from 100 million ratings will lead to overfitting.
So strategies for overfitting play a central role for learning
Kl (wu , hi ) = hwu , hu i linear (6)
algorithms. For MF usually two strategies are proposed: (1)
d
Kp (wu , hi ) = (1 + hwu , hu i) polynomial (7) regularization and (2) early stopping. A second problem is

||wu − hu ||2
 the high number of missing values. Thus the optimization of
Kr (wu , hi ) = exp − RBF (8) model parameters is done only with respect to the observed
2σ 2
values S of R. For optimizing wrt RMSE, the task is:
Ks (wu , hi ) = φs (bu,i + hwu , hu i) logistic (9)
1 argmin E(S, W, H) (11)
with φs (x) := W,H
1 + e−x
with
It is obvious that normal matrix factorization like in equa- X
tion (4) can be expressed with a = bu,i and c = 1 and the E(S, R̂) := E(S, W, H) := (ru,i − r̂u,i )2 (12)
linear kernel Kl . ru,i ∈S

One benefit from using a kernel like the logistic one is that
values are naturally bound to the application domain and do 4.4.1 Regularization
not have to be cut, e.g. a prediction of 6.2 stars in a sce- Instead of learning the optimal fit of W · H t on the ob-
nario with at most 5 stars is not possible with the proposed served values, a regularization term is added to the opti-
logistic kernel. Secondly, kernels can provide non-linear in- mization task. Usually Tikhonov regularization aka ridge
teractions between user and item vectors. Finally, another regression is used where a parameter λ controls the regular-
benefit is that kernels lead to different models that can be ization. Hence, the optimization task is:
combined in an ensemble. The Netflix challenge has shown
that ensembling (e.g. blending) many models achieves the argmin Opt(S, W, H) (13)
W,H
best prediction quality [1, 15, 14].
with
4.3 Non-negative Matrix Factorization Opt(S, W, H) := E(S, W, H) + λ ||W ||2F + ||H||2F

(14)
Non-negative matrix factorization is similar to matrix fac-
torization but poses additional constraints on the feature
matrices W and H. It is required that all elements of both 4.4.2 Optimization by Gradient Descent
matrices are non-negative. The motivation is to eliminate in- For optimizing formula (14) different techniques might be
teractions between negative correlations which has been suc- used. For normal matrix factorization often stochastic gradi-
cessfully applied in some collaborative filtering algorithms. ent descent is used. We propose to use this also for KMF, so
With the linear kernel reasonable meta parameters are: for optimizing only the partial derivative of K has to calcu-
lated. Also minimizing another (differentiable, monotonic)
a := rmin , c := rmax − rmin (10) loss function than RMSE is easy, because only E(S, W, H)
1: procedure Optimize(S, W, H) Again, the features can be initialized with small random
2: initialize W , H values around 0 when the following bias term is used:
3: repeat  
4: for ru,i ∈ S do c
bu,i = − ln (20)
5: for f ← 1, . . . , k do g−a
6: wu,f ← wu,f − α ∂w∂u,f Opt({ru,i }, W, H)
The other hyper-parameters for the logistic kernel are a =
7: hi,f ← hi,f − α ∂h∂i,f Opt({ru,i }, W, H)
rmin and c = rmax − rmin .
8: end for
9: end for 4.4.5 Non-Negative constraints
10: until Stopping criteria met With a linear kernel and non-negative constraints on both
11: return (W, H) W and H, the update rules can use a projection step to
12: end procedure ensure non-negative elements:
 
Figure 2: Learning KMF by gradient descent. ∂
wu,f ← max 0, wu,f − α Opt({ru,i }, W, H) (21)
∂wu,f
 

has to be differentiated. In all, the generic learning algo- hi,f ← max 0, hi,f − α Opt({ru,i }, W, H) (22)
rithm is outlined in Figure 2. The parameter α is called ∂hi,f
the learning rate or step size. With the regularization term, The derivations are the same as in formula (15), (16) and
avoiding overfitting by early stopping becomes less impor- (17). In contrast to unconstrained KMF, when dealing with
tant. Nevertheless, early stopping can be applied to speed non-negative constraints, the non-negative model cannot be
up the training process. In the simple case, the stopping cri- centered around the global average with a bias term because
terion is a fixed number of iterations that could be optimized otherwise no rating below the average would be possible.
by a holdout method on the training set. Thus, a initialization around 0 would lead to predictions
In the following, we give detailed optimization rules for around rmin . A better initialization for non-negative ma-
three variants of RKMF, i.e. a linear kernel, a logistic kernel trix factorization is to set the values of both W and H s.t.
and a linear kernel with non-negative constraints. In the the predictionsqare near the global average, which leads to
derivation, we will discard all positive constants that can
be integrated in the learning rate α or the regularization λ. wu,f = hi,f = k·(rg−r min
max −rmin )
+ noise
The partial derivations of Opt({ru,i }, W, H) are:
∂Opt({ru,i }, W, H) ∂
4.5 SVD versus Regularized KMF
∝ (r̂u,i − ru,i ) · K(wu , hi ) + λwu,f Singular Value Decomposition (SVD) is a technique for
∂wu,f ∂wu,f
(15) decomposing a matrix into three matrices. With our no-
tation that would be R = W 0 ΣH 0 with W 0 : |U | × |U |,
∂Opt({ru,i }, W, H) ∂ Σ : |U | × |I| and H 0 : |I| × |I| where Σ is a diagonal ma-
∝ (r̂u,i − ru,i ) · K(wu , hi ) + λhi,f
∂hi,f ∂hi,f trix containing the singular values. One can show, that the
(16) best k-rank approximation R̂ of R is given by using only the
k largest singular values and setting the remaining singular
In all, only the partial derivative of K(wu , hi ) is kernel spe-
values to zero. This means that one could reduce the number
cific.
of columns of W 0 and H 0 to k and obtaining two matrices
4.4.3 Linear kernel W : |U | × k and H : |I| × k that give the best approximation
of W · H t to R. As SVD is a well-studied technique in many
Optimizing formula (14) with gradient descent (figure 2)
fields, e.g. image analysis or numerics, the question arises
for a KMF with a linear kernel corresponds to normal regu-
why not to use it in recommender systems. As we indicated
larized matrix factorization. For the complete update rules
before the task of matrix approximation in recommender
(15) and (16) we need the partial derivation of Kl (wu , hi ):
systems differs from other fields like image analysis.
∂ ∂ First of all, in recommender systems we deal with a huge
K(wu , hi ) = hi,f , K(wu , hi ) = wu,f (17)
∂wu,f ∂hi,f amount of missing/ unobserved values. E.g. for Netflix
the sparsity factor is about 99%. Take care that sparsity
For initialization of both feature matrices wu,f and hi,f , in recommender systems means missing/ unobserved values
small random values around 0 should be used. This way, whereas in SVD literature often zero values are meant. Be-
in the beginning r̂u,i is near the bias term. fore an SVD can be calculated, the missing values have to be
4.4.4 Logistic kernel estimated (‘imputation’). Choosing 0 as missing value is ob-
viously not a good idea because then most of the predicted
For the logistic kernel (9), the update rules require to ratings in R̂ would be around 0 as well. A better idea is
differentiate Ks (wu , hi ) which includes the logistic function: to use another prediction algorithm to estimate the missing
∂ values. In a simple case that could be the column or row
K(wu , hi ) ∝ hi,f · φ2s (bu,i + hwu , hu i) · e−bu,i −hwu ,hu i mean. The SVD can then be calculated on this full matrix.
∂wu,f
(18) An efficient implementation should recenter the matrix (e.g.
around the column mean) before applying a standard SVD

K(wu , hi ) ∝ wu,f · φ2s (bu,i + hwu , hu i) · e−bu,i −hwu ,hu i algorithm. A second problem for both MF and SVD is over-
∂hi,f fitting. As indicated before, in regularized MF usually regu-
(19) larization and early stopping is applied to avoid overfitting.
SVD Regularized MF 1: procedure UserUpdate(S, W, H, ru,i )
linear logistic lin. non-neg. 2: S ← S ∪ {ru,i }
Netflix RMSE 0.946 [8] 0.915 0.918 0.914 3: return UserRetrain(S, W, H, u)
4: end procedure
Table 1: RMSE results on Netflix probe for RKMF
and k-rank SVD. 5: procedure UserRetrain(S, W, H, u∗ )
6: initialize u∗ -th row in W
7: repeat
In contrast to this, choosing the best k-rank approximation 8: for ru,i ∈ C(u∗ , ·) do
of an SVD will lead to overfitting of R̂. 9: for f ← 1, . . . , f do
As we have seen, in the domain of recommender systems 10: wu,f ← wu,f − α ∂w∂u,f Opt(S, W, H)
SVD has several drawbacks in contrast to RMF. First of all 11: end for
the high number of missing values that have to be estimated 12: end for
and the lack of regularization leads to overfitting. Table 1 13: until Stopping criteria met
shows a comparison of SVD to several regularized matrix 14: return (W, H)
factorization methods. The evaluation has been done on 15: end procedure
the Netflix dataset, where Netflix ‘probe’ was used as test
dataset. We compare the RKMF prediction quality to the
Figure 3: Online updates for new-user problem.
best SVD results reported by [8]. Their best SVD model uses
the Lanczos algorithm with imputation by EM. This SVD
model has best quality on the ‘probe’ set with 10 dimensions
– for more dimensions they report overfitting. Our regular-
ized KMF methods use 40 dimensions (k = 40) and we do in stochastic gradient descent the sequence of how ratings
not observe overfitting (see fig. 1 for the linear case). In in S are visited is important and (2) information between
fact, even if we enlarge the number of dimensions in RKMF iterations propagates through the matrices. The best that
(e.g. k = 100) the quality still increases. This evaluation can be done is to approximate R̂S∪{ru,i } from R̂S .
has shown, that regularization is important for successfully We propose the algorithm UserUpdate (see fig. 3) for
learning a model and avoiding overfitting. In the rest of this solving the new-user problem. This algorithm retrains the
paper we will not deal with SVD any more, as regularized whole feature vector for this user and keeps all other entries
matrix factorization is obviously the better method for the in the matrix fixed. The motivation for this algorithm is the
task of recommender systems. assumption that the model build from S and the model build
from S ∪{ru,i } is mostly the same from a global perspective.
But if user u is a new-user, his (local) features might change
5. ONLINE UPDATES a lot from the new rating ru,i . That is why we fully retrain
In this section we provide methods for solving the new- this user and keep the other features fixed, as we assume
user and new-item problem. That means it is assumed that them to be already the best guess.
an existing factorization, i.e. (W, H), is given and then a The time complexity of UserRetrain is O(|C(u, ·)|·k ·i).
new rating comes in. We provide methods for updating the In our evaluation we will see, that retraining users for each
factorization (W, H) both in case for a user with a small rat- incoming rating becomes less important when the user pro-
ing profile and an item with a small profile. These methods file increases. That is why |C(u, ·)| usually is small. Besides
will use the same update rules as in the last section, that time complexity and good quality (see evaluation), one of
means the derivations for training the model can be reused the major advantages of UserUpdate is that it is generic
for online-updates. We will describe all methods in terms of and applicable to any RKMF model. That means that no
the new-user problem. Of course everything can be applied kernel specific algorithm or additional update formulas have
to the new-item problem as well because KMF models are to be designed.
symmetric.

5.1 Training KMF models


5.3 Further Speedup
Obviously, retraining the whole KMF model with the al-
gorithm in fig. 2 after a new rating comes in is not applicable We have argued that retraining a user u on a new in-
at all as it has complexity O(|S| · k · i) where i is the number coming rating ru,i is very important for a user with a small
of iterations before early stopping is applied. In the Net- profile. With growing profiles, the update is less important
flix use-case with k = 40, i = 120 and |S| = 100, 000, 000 (see fig. 5). In cases where an additional speedup is needed,
this would lead to about 480 billion feature updates. In this one can apply several rules that determine if user-retrain
paper, we propose an approximization method that updates can be skipped. We will present approaches that define a
the matrices of an existing model and that has complexity probability whether online-updates are performed or not.
O(|C(u, ·)| · k · i) where C(u, ·) is the current profile of the Depending on this probability the recommender can skip
user. some online-update steps.

5.2 Approximating Updates


Let R̂S denote the factorization calculated from the ob- 5.3.1 Profile Size
served values S of R by the algorithm in fig. 2. Afterwards A reasonable assumption is that the larger the profile of a
a new rating ru,i comes in. First of all, an exact recon- specific user is, the less important is retraining his profile on
struction of R̂S∪{ru,i } from R̂S cannot be calculated as (1) each new rating. Thus, the probability of retraining a user
1: procedure AddRating(S, W, H, ru,i ) 6. EVALUATION
2: S ← S ∪ {ru,i }
In the evaluation we want to examine how our proposed
3: return UpdateRating(S, W, H, ru,i )
online-update methods perform on real world problems. The
4: end procedure
goal of the update methods is (1) to approximate the quality
of fully retraining as good as possible and (2) to have low
5: procedure RemoveRating(S, W, H, ru,i )
runtime.
6: S ← S \ {ru,i }
7: return UpdateRating(S, W, H, ru,i ) 6.1 Evaluation Protocol
8: end procedure
We simulate the new-user problem the following way:
9: procedure UpdateRating(S,
W, H, ru,i ) 1. Create a new-user-scenario:
10: if Pu (train ru,i ) > random then
11: (W, H) ← UserRetrain(S, W, H, u) (a) Pick n% of the users and put them in Ut .
12: end if (b) for each unknown user u ∈ Ut do
13: if Pi (train ru,i ) > random then
14: (W, H) ← ItemRetrain(S, W, H, i) i. Split the ratings in C(u, ·) in two disjoint sets
15: end if Tu and Vu . The size of Tu is min{m, |C(u,·)|
2
}
16: return (W, H) and so Vu = C(u, ·) \ Tu .
17: end procedure ii. Remove all of the ratings C(u, ·) from the set
of known ratings S: S ← S \ C(u, ·)
Figure 4: General algorithm for online-updates.
2. Train the model on S: (W, H) ← Optimize(S, W, H)
3. Evaluate the new-user-scenario:
could decay with increasing profile size: for j = 1, . . . , m do
(a) for each unknown user u ∈ Ut do
Pu (train|ru,i ) = γ |C(u,·)| , γ ∈ (0, 1) (23)
  i. add one rating ru,i ∈ Tu to S
m
Pu (train|ru,i ) = max 1, , m ∈ N+ (24) ii. update the model:
|C(u, ·)|
(W, H) ← UserUpdate(S, W, H, ru,i )
With (24) the average/expected runtime complexity is O(m· iii. calculate error seju := E(Vu , W, H) on Vu
k · i) and thus independent of the profile size.
v
u 1 X
j
(b) calculate rmse := u u X · seju
|Vu | u:|Tu |≥j
5.3.2 Expected Impact
t
u:|Tu |≥j
Another approach would be retraining if the new rating
ru,i is expected to change the features. To measure the With this protocol, one can examine how well a model can
expected impact of the new rating on a user’s profile one adapt to a user’s taste with an increasing size of the user’s
could use the error between the prediction of this rating and rating profile. It is obvious that the protocol simulates a
the true value. The larger this error is the more the rating true online scenario where the rating profile of several new
is expected to change the features. To map the error to users increase from one up to m ratings. Please note that
the interval [0, 1] we propose to use a smooth monotonically we update and evaluate the models after adding each single
increasing function like tanh: rating and not only after having added one rating for all
users. Thus we can see the quality of the immediate response
P (train|ru,i ) = tanh (r̂u,i − ru,i )2

(25) on a user’s recommendations. The evaluation protocol for
the new-item problem is symmetric to the protocol of new-
5.4 General Update Problem users.
In the previous part of this section the algorithms and 6.2 Datasets
methods have been described from the perspective of the
In our experiments we evaluate on two movie recommen-
new-user problem. However, all methods and algorithms
dation datasets, i.e. Netflix1 and Movielens2 . Netflix is
can be directly transfered to the new-item problem by ex-
the state-of-the-art evaluation dataset for recommender sys-
changing users with items.
tems with over 100 million ratings of about 480,000 users
In general, a new rating ru,i might influence the features of
and 17,000 items. There are two Movielens datasets, we
both user u and item i. If both profiles are small, one could
choose the larger one which contains about 1 million ratings
update both feature vectors. The outcome of this is a gen-
by 6040 users and 3706 items.
eral update algorithm (fig. 4) that first performs an online
update of the user’s features and then an online update of 6.3 Methodology
the item’s features. Both online updates are only executed if
We run the proposed evaluation protocol (see section 6.1)
they are not pruned by one of the rules P (train|ru,i ). This
for both the new-user problem and the new-item problem on
way, it is more likely to perform an update on a small profile
Netflix and Movielens. For Movielens we choose n = 10%
than on a large one.
and for Netflix n = 1%. The profile size for each user and
The proposed algorithm UpdateRating is not limited to
1
new ratings, but can also be applied for removing ratings or http://www.netflixprize.com
2
changing a rating. A sketch for this task is found in fig. 4. http://www.grouplens.org
item resp. is grown from 0 to m = 50 ratings as indicated Dataset Training Regularized MF
in the evaluation protocol. Each Movielens experiment is linear logistic non-neg.
run 10 times where the evaluation folds (the unknown users Movielens Online Update 0-1 ms 0-10 ms 0-1 ms
and items resp.) do not overlap (cv-style). For the Netflix Retrain 18 s 3.5 min 25 s
dataset we rerun the experiment four times on non over- Features k 10 10 10
lapping user/ item sets. We report the mean of the RMSE # Iterations 30 270 50
results as described in the evaluation protocol. For each ex- Netflix Online Update 0-15 ms 0-15 ms 0-18 ms
periment we run three types of RKMF, i.e. a linear kernel, Retrain 11.6 h 10.2 h 13.8 h
a logistic kernel and a linear kernel with non-negative con- Features k 40 40 40
straints. For Netflix each RKMF model has k = 40 features # Iterations 120 120 120
and for Movielens k = 10 (see table 2).
As we want to examine how good online-updates approx- Table 2: Runtime of full retrain and runtime of pro-
imate a full retrain, we train and evaluate a second model posed online updates wrt to profile size C(u, ·) (for
(W ∗ , H ∗ ) at the end of the j-loop (3.). This model (W ∗ , H ∗ ) ‘new-user’ problem).
is generated by a full retrain, i.e. by Optimize(S, W ∗ , H ∗ ),
on all ratings S. Thus (W ∗ , H ∗ ) is the model that the
update costs. It is obvious, that online-updates are clearly
online-updates try to approximate. To measure how good
faster than retraining the whole model. E.g. with linear
the approximation is in terms of prediction quality, we cal-
RKMF on Netflix, retraining costs about 12 hours whereas
culate the RMSESon the same evaluation set as the online
an update as described in algorithm (3) takes only 0 to 15 ms
updates, i.e. on u:|Tu |≥j Vu . For Netflix we measure the
depending on the profile size |C(u, ·)| of the user/ the item.
quality of a full retrain only for j = {10, 25, 50} because These empirical results match to the theoretical complexity
for Netflix fully retraining a second model in each itera- that is O(|C(u, ·)|·k·i) for online-updates instead of O(|S|·k·
tion (m=50), for each kernel (3), each run (4) and both i) for full retraining, where S is the set of all ratings whereas
new-user and new-item problem (2) would have cost about C(u, ·) are the ratings of a specific user.
50·4·3·2·10 hours = 500 days of CPU runtime. Nevertheless,
our evaluation is computationally expensive. In total all ex-
periments took 54 days of CPU runtime. All experiments 7. CONCLUSION
were run on machines of the same type (CPU, RAM). In this paper we have proposed the class of regularized
kernel matrix factorization (RKMF) and a generic learn-
6.4 Quality ing algorithm based on gradient descent. We have provided
Figure 5 shows the evaluation of the new-user- and the generic online-update methods for RKMF models that are
new-item-problem on both Netflix and Movielens. As one based on the same gradient descent step that are also used
can see, the error curves of the full retrain and the online for training the model. For static rating prediction, RMF
updates of all three kernel methods are quite similar. Espe- are known to be one of the best models with regard to pre-
cially for non-negative RMF the online-updates are almost diction quality. The drawback of RMF models is that once
the same as the full retrain. For linear and logistic RMF the factorization is computed, they cannot handle updates.
the difference between online-update and full retrain are Our online-updates make RMF and the more general RKMF
about 1% in the worst case. This shows that the proposed class applicable for dynamic real-world scenarios where new
online-updates approximate the quality of fully retraining users sign in and item catalogs are extended. We have shown
the model very well. that the proposed online-updates approximate the quality
Secondly, all three factorization methods show promising of fully retraining the model very well. On the other hand,
quality results on the datasets. E.g. with non-negative RMF both empirical and theoretical results for runtime complex-
a user profile size of 7 ratings is enough to obtain a RMSE ity of online-updates make RKMF models feasible for huge
of below 95% which means beating the overall RMSE of datasets like Netflix and other dynamic real-world applica-
Netflix’s Cinematch system. With a user-profile size of 22 tions.
ratings, a mean RMSE of bellow 90% is achieved. For new Acknowledgements
items the prediction task on Netflix is harder as there are
much more users than items. But also in this case 18 ratings The authors gratefully acknowledge the partial co-funding of
on a new item’s profile would beat the overall RMSE of their work through the European Commission FP7 project
Cinematch. And about 50 ratings are sufficient to break the MyMedia (www.mymediaproject.org) under the grant agree-
90% barrier. ment no. 215006. For your inquiries please contact
Furthermore, the evaluation shows that for the new-user/ [email protected].
new-item problem it is important that models use new rat-
ing information. A static model without updates would not 8. REFERENCES
improve with new ratings and thus the error for all profile [1] R. M. Bell and Y. Koren. Scalable collaborative
sizes would be as worse as at the beginning. filtering with jointly derived neighborhood
interpolation weights. In ICDM, pages 43–52. IEEE
6.5 Speedup Computer Society, 2007.
As we have already seen, the quality of the approximation [2] M. Brand. Fast online svd revisions for lightweight
with online-updates is almost as good as a full retrain of recommender systems. In SIAM International
the model. Now we compare the runtimes between online- Conference on Data Mining, 2003.
updates and full retrains. Table 2 shows an overview of [3] T. George and S. Merugu. A scalable collaborative
training times of several RKMF methods and the online- filtering framework based on co-clustering. In ICDM
Netflix: new user Netflix: new item

1.00

1.00
Linear update Linear update
Logistic update Logistic update
Non−neg. update Non−neg. update

0.98

0.98
Linear retrain Linear retrain
● Logistic retrain ● Logistic retrain

Non−neg. retrain Non−neg. retrain
0.96

0.96
0.94
RMSE

RMSE

0.94
0.92



0.90

0.92
0.88

0.90

0 10 20 30 40 50 0 10 20 30 40 50

Profile size Profile size

Movielens: new user Movielens: new item


1.00

1.00
Linear update Linear update
Logistic update Logistic update
Non−neg. update Non−neg. update
0.98

0.98
Linear retrain Linear retrain
Logistic retrain Logistic retrain
Non−neg. retrain Non−neg. retrain
0.96

0.96
RMSE

RMSE
0.94

0.94
0.92

0.92
0.90

0.90
0.88

0.88

0 10 20 30 40 50 0 10 20 30 40 50

Profile size Profile size

Figure 5: New-user/ new-item problem on Movielens and Netflix. Curves show the RMSE of online-updates
(see protocol) compared to a full retrain.

’05: Proceedings of the Fifth IEEE International for highly scalable recommender systems. In
Conference on Data Mining, pages 625–628, Proceedings of the 5th International Conference in
Washington, DC, USA, 2005. IEEE Computer Society. Computers and Information Technology, 2002.
[4] D. Goldberg, D. Nichols, B. M. Oki, and D. Terry. [11] B. M. Sarwar, G. Karypis, J. A. Konstan, and
Using collaborative filtering to weave an information J. Reidl. Item-based collaborative filtering
tapestry. Communications of the ACM, 35(12):61–70, recommendation algorithms. In World Wide Web,
1992. pages 285–295, 2001.
[5] S. Hauger, K. Tso, and L. Schmidt-Thieme. [12] A. Schein, A. Popescul, L. Ungar, and D. Pennock.
Comparison of recommender system algorithms Generative models for cold-start recommendations. In
focusing on the new-item and user-bias problem. In Proceedings of the 2001 SIGIR Workshop on
Proceedings of 31th Annual Conference of the Recommender Systems, 2001.
Gesellschaft fuer Klassifikation (GfKl), Freiburg, 2007. [13] L. Schmidt-Thieme. Compound classification models
[6] T. Hofmann. Latent semantic models for collaborative for recommender systems. In Proceedings of the Fifth
filtering. ACM Trans. Inf. Syst., 22(1):89–115, 2004. IEEE International Conference on Data Mining
[7] J. Kivinen, A. Smola, and R. Williamson. Online (ICDM 2005), pages 378–385, 2005.
learning with kernels. Signal Processing, IEEE [14] G. Takacs, I. Pilaszy, B. Nemeth, and D. Tikk. On the
Transactions on, 52(8):2165–2176, Aug. 2004. gravity recommendation system. In KDDCup 2007,
[8] M. Kurucz, A. A. Benczúr, and B. Torma. Methods 2007.
for large scale svd with missing values. In KDDCup [15] M. Wu. Collaborative filtering via ensembles of matrix
2007, 2007. factorization. In KDDCup 2007, pages 43–47, 2007.
[9] R. Salakhutdinov and A. Mnih. Probabilistic matrix [16] D. Zhang, Z.-H. Zhou, and S. Chen. Non-negative
factorization. In Advances in Neural Information matrix factorization on kernels. In Q. Yang and G. I.
Processing Systems, volume 20, 2008. Webb, editors, PRICAI, volume 4099 of Lecture Notes
[10] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. in Computer Science, pages 404–412. Springer, 2006.
Incremental singular value decomposition algorithms

You might also like