Kernel Matrix Factorization Models
Kernel Matrix Factorization Models
k
Train
1.05
X
r̂u,i = hwu , hu i = wu,f hi,f (3) Test
f =1
0.95
Often bias terms are added which are equivalent to center-
RMSE
ing the approximization, so that only residuals have to be
0.85
learned:
k
0.75
X
r̂u,i = bu,i + wu,f hi,f (4)
f =1
0 50 100 150 200
Normally the bias term bu,i is something like the global av- Iterations
One benefit from using a kernel like the logistic one is that
values are naturally bound to the application domain and do 4.4.1 Regularization
not have to be cut, e.g. a prediction of 6.2 stars in a sce- Instead of learning the optimal fit of W · H t on the ob-
nario with at most 5 stars is not possible with the proposed served values, a regularization term is added to the opti-
logistic kernel. Secondly, kernels can provide non-linear in- mization task. Usually Tikhonov regularization aka ridge
teractions between user and item vectors. Finally, another regression is used where a parameter λ controls the regular-
benefit is that kernels lead to different models that can be ization. Hence, the optimization task is:
combined in an ensemble. The Netflix challenge has shown
that ensembling (e.g. blending) many models achieves the argmin Opt(S, W, H) (13)
W,H
best prediction quality [1, 15, 14].
with
4.3 Non-negative Matrix Factorization Opt(S, W, H) := E(S, W, H) + λ ||W ||2F + ||H||2F
(14)
Non-negative matrix factorization is similar to matrix fac-
torization but poses additional constraints on the feature
matrices W and H. It is required that all elements of both 4.4.2 Optimization by Gradient Descent
matrices are non-negative. The motivation is to eliminate in- For optimizing formula (14) different techniques might be
teractions between negative correlations which has been suc- used. For normal matrix factorization often stochastic gradi-
cessfully applied in some collaborative filtering algorithms. ent descent is used. We propose to use this also for KMF, so
With the linear kernel reasonable meta parameters are: for optimizing only the partial derivative of K has to calcu-
lated. Also minimizing another (differentiable, monotonic)
a := rmin , c := rmax − rmin (10) loss function than RMSE is easy, because only E(S, W, H)
1: procedure Optimize(S, W, H) Again, the features can be initialized with small random
2: initialize W , H values around 0 when the following bias term is used:
3: repeat
4: for ru,i ∈ S do c
bu,i = − ln (20)
5: for f ← 1, . . . , k do g−a
6: wu,f ← wu,f − α ∂w∂u,f Opt({ru,i }, W, H)
The other hyper-parameters for the logistic kernel are a =
7: hi,f ← hi,f − α ∂h∂i,f Opt({ru,i }, W, H)
rmin and c = rmax − rmin .
8: end for
9: end for 4.4.5 Non-Negative constraints
10: until Stopping criteria met With a linear kernel and non-negative constraints on both
11: return (W, H) W and H, the update rules can use a projection step to
12: end procedure ensure non-negative elements:
Figure 2: Learning KMF by gradient descent. ∂
wu,f ← max 0, wu,f − α Opt({ru,i }, W, H) (21)
∂wu,f
∂
has to be differentiated. In all, the generic learning algo- hi,f ← max 0, hi,f − α Opt({ru,i }, W, H) (22)
rithm is outlined in Figure 2. The parameter α is called ∂hi,f
the learning rate or step size. With the regularization term, The derivations are the same as in formula (15), (16) and
avoiding overfitting by early stopping becomes less impor- (17). In contrast to unconstrained KMF, when dealing with
tant. Nevertheless, early stopping can be applied to speed non-negative constraints, the non-negative model cannot be
up the training process. In the simple case, the stopping cri- centered around the global average with a bias term because
terion is a fixed number of iterations that could be optimized otherwise no rating below the average would be possible.
by a holdout method on the training set. Thus, a initialization around 0 would lead to predictions
In the following, we give detailed optimization rules for around rmin . A better initialization for non-negative ma-
three variants of RKMF, i.e. a linear kernel, a logistic kernel trix factorization is to set the values of both W and H s.t.
and a linear kernel with non-negative constraints. In the the predictionsqare near the global average, which leads to
derivation, we will discard all positive constants that can
be integrated in the learning rate α or the regularization λ. wu,f = hi,f = k·(rg−r min
max −rmin )
+ noise
The partial derivations of Opt({ru,i }, W, H) are:
∂Opt({ru,i }, W, H) ∂
4.5 SVD versus Regularized KMF
∝ (r̂u,i − ru,i ) · K(wu , hi ) + λwu,f Singular Value Decomposition (SVD) is a technique for
∂wu,f ∂wu,f
(15) decomposing a matrix into three matrices. With our no-
tation that would be R = W 0 ΣH 0 with W 0 : |U | × |U |,
∂Opt({ru,i }, W, H) ∂ Σ : |U | × |I| and H 0 : |I| × |I| where Σ is a diagonal ma-
∝ (r̂u,i − ru,i ) · K(wu , hi ) + λhi,f
∂hi,f ∂hi,f trix containing the singular values. One can show, that the
(16) best k-rank approximation R̂ of R is given by using only the
k largest singular values and setting the remaining singular
In all, only the partial derivative of K(wu , hi ) is kernel spe-
values to zero. This means that one could reduce the number
cific.
of columns of W 0 and H 0 to k and obtaining two matrices
4.4.3 Linear kernel W : |U | × k and H : |I| × k that give the best approximation
of W · H t to R. As SVD is a well-studied technique in many
Optimizing formula (14) with gradient descent (figure 2)
fields, e.g. image analysis or numerics, the question arises
for a KMF with a linear kernel corresponds to normal regu-
why not to use it in recommender systems. As we indicated
larized matrix factorization. For the complete update rules
before the task of matrix approximation in recommender
(15) and (16) we need the partial derivation of Kl (wu , hi ):
systems differs from other fields like image analysis.
∂ ∂ First of all, in recommender systems we deal with a huge
K(wu , hi ) = hi,f , K(wu , hi ) = wu,f (17)
∂wu,f ∂hi,f amount of missing/ unobserved values. E.g. for Netflix
the sparsity factor is about 99%. Take care that sparsity
For initialization of both feature matrices wu,f and hi,f , in recommender systems means missing/ unobserved values
small random values around 0 should be used. This way, whereas in SVD literature often zero values are meant. Be-
in the beginning r̂u,i is near the bias term. fore an SVD can be calculated, the missing values have to be
4.4.4 Logistic kernel estimated (‘imputation’). Choosing 0 as missing value is ob-
viously not a good idea because then most of the predicted
For the logistic kernel (9), the update rules require to ratings in R̂ would be around 0 as well. A better idea is
differentiate Ks (wu , hi ) which includes the logistic function: to use another prediction algorithm to estimate the missing
∂ values. In a simple case that could be the column or row
K(wu , hi ) ∝ hi,f · φ2s (bu,i + hwu , hu i) · e−bu,i −hwu ,hu i mean. The SVD can then be calculated on this full matrix.
∂wu,f
(18) An efficient implementation should recenter the matrix (e.g.
around the column mean) before applying a standard SVD
∂
K(wu , hi ) ∝ wu,f · φ2s (bu,i + hwu , hu i) · e−bu,i −hwu ,hu i algorithm. A second problem for both MF and SVD is over-
∂hi,f fitting. As indicated before, in regularized MF usually regu-
(19) larization and early stopping is applied to avoid overfitting.
SVD Regularized MF 1: procedure UserUpdate(S, W, H, ru,i )
linear logistic lin. non-neg. 2: S ← S ∪ {ru,i }
Netflix RMSE 0.946 [8] 0.915 0.918 0.914 3: return UserRetrain(S, W, H, u)
4: end procedure
Table 1: RMSE results on Netflix probe for RKMF
and k-rank SVD. 5: procedure UserRetrain(S, W, H, u∗ )
6: initialize u∗ -th row in W
7: repeat
In contrast to this, choosing the best k-rank approximation 8: for ru,i ∈ C(u∗ , ·) do
of an SVD will lead to overfitting of R̂. 9: for f ← 1, . . . , f do
As we have seen, in the domain of recommender systems 10: wu,f ← wu,f − α ∂w∂u,f Opt(S, W, H)
SVD has several drawbacks in contrast to RMF. First of all 11: end for
the high number of missing values that have to be estimated 12: end for
and the lack of regularization leads to overfitting. Table 1 13: until Stopping criteria met
shows a comparison of SVD to several regularized matrix 14: return (W, H)
factorization methods. The evaluation has been done on 15: end procedure
the Netflix dataset, where Netflix ‘probe’ was used as test
dataset. We compare the RKMF prediction quality to the
Figure 3: Online updates for new-user problem.
best SVD results reported by [8]. Their best SVD model uses
the Lanczos algorithm with imputation by EM. This SVD
model has best quality on the ‘probe’ set with 10 dimensions
– for more dimensions they report overfitting. Our regular-
ized KMF methods use 40 dimensions (k = 40) and we do in stochastic gradient descent the sequence of how ratings
not observe overfitting (see fig. 1 for the linear case). In in S are visited is important and (2) information between
fact, even if we enlarge the number of dimensions in RKMF iterations propagates through the matrices. The best that
(e.g. k = 100) the quality still increases. This evaluation can be done is to approximate R̂S∪{ru,i } from R̂S .
has shown, that regularization is important for successfully We propose the algorithm UserUpdate (see fig. 3) for
learning a model and avoiding overfitting. In the rest of this solving the new-user problem. This algorithm retrains the
paper we will not deal with SVD any more, as regularized whole feature vector for this user and keeps all other entries
matrix factorization is obviously the better method for the in the matrix fixed. The motivation for this algorithm is the
task of recommender systems. assumption that the model build from S and the model build
from S ∪{ru,i } is mostly the same from a global perspective.
But if user u is a new-user, his (local) features might change
5. ONLINE UPDATES a lot from the new rating ru,i . That is why we fully retrain
In this section we provide methods for solving the new- this user and keep the other features fixed, as we assume
user and new-item problem. That means it is assumed that them to be already the best guess.
an existing factorization, i.e. (W, H), is given and then a The time complexity of UserRetrain is O(|C(u, ·)|·k ·i).
new rating comes in. We provide methods for updating the In our evaluation we will see, that retraining users for each
factorization (W, H) both in case for a user with a small rat- incoming rating becomes less important when the user pro-
ing profile and an item with a small profile. These methods file increases. That is why |C(u, ·)| usually is small. Besides
will use the same update rules as in the last section, that time complexity and good quality (see evaluation), one of
means the derivations for training the model can be reused the major advantages of UserUpdate is that it is generic
for online-updates. We will describe all methods in terms of and applicable to any RKMF model. That means that no
the new-user problem. Of course everything can be applied kernel specific algorithm or additional update formulas have
to the new-item problem as well because KMF models are to be designed.
symmetric.
1.00
1.00
Linear update Linear update
Logistic update Logistic update
Non−neg. update Non−neg. update
0.98
0.98
Linear retrain Linear retrain
● Logistic retrain ● Logistic retrain
●
Non−neg. retrain Non−neg. retrain
0.96
●
0.96
0.94
RMSE
RMSE
0.94
0.92
●
●
0.90
0.92
0.88
0.90
●
0 10 20 30 40 50 0 10 20 30 40 50
1.00
Linear update Linear update
Logistic update Logistic update
Non−neg. update Non−neg. update
0.98
0.98
Linear retrain Linear retrain
Logistic retrain Logistic retrain
Non−neg. retrain Non−neg. retrain
0.96
0.96
RMSE
RMSE
0.94
0.94
0.92
0.92
0.90
0.90
0.88
0.88
0 10 20 30 40 50 0 10 20 30 40 50
Figure 5: New-user/ new-item problem on Movielens and Netflix. Curves show the RMSE of online-updates
(see protocol) compared to a full retrain.
’05: Proceedings of the Fifth IEEE International for highly scalable recommender systems. In
Conference on Data Mining, pages 625–628, Proceedings of the 5th International Conference in
Washington, DC, USA, 2005. IEEE Computer Society. Computers and Information Technology, 2002.
[4] D. Goldberg, D. Nichols, B. M. Oki, and D. Terry. [11] B. M. Sarwar, G. Karypis, J. A. Konstan, and
Using collaborative filtering to weave an information J. Reidl. Item-based collaborative filtering
tapestry. Communications of the ACM, 35(12):61–70, recommendation algorithms. In World Wide Web,
1992. pages 285–295, 2001.
[5] S. Hauger, K. Tso, and L. Schmidt-Thieme. [12] A. Schein, A. Popescul, L. Ungar, and D. Pennock.
Comparison of recommender system algorithms Generative models for cold-start recommendations. In
focusing on the new-item and user-bias problem. In Proceedings of the 2001 SIGIR Workshop on
Proceedings of 31th Annual Conference of the Recommender Systems, 2001.
Gesellschaft fuer Klassifikation (GfKl), Freiburg, 2007. [13] L. Schmidt-Thieme. Compound classification models
[6] T. Hofmann. Latent semantic models for collaborative for recommender systems. In Proceedings of the Fifth
filtering. ACM Trans. Inf. Syst., 22(1):89–115, 2004. IEEE International Conference on Data Mining
[7] J. Kivinen, A. Smola, and R. Williamson. Online (ICDM 2005), pages 378–385, 2005.
learning with kernels. Signal Processing, IEEE [14] G. Takacs, I. Pilaszy, B. Nemeth, and D. Tikk. On the
Transactions on, 52(8):2165–2176, Aug. 2004. gravity recommendation system. In KDDCup 2007,
[8] M. Kurucz, A. A. Benczúr, and B. Torma. Methods 2007.
for large scale svd with missing values. In KDDCup [15] M. Wu. Collaborative filtering via ensembles of matrix
2007, 2007. factorization. In KDDCup 2007, pages 43–47, 2007.
[9] R. Salakhutdinov and A. Mnih. Probabilistic matrix [16] D. Zhang, Z.-H. Zhou, and S. Chen. Non-negative
factorization. In Advances in Neural Information matrix factorization on kernels. In Q. Yang and G. I.
Processing Systems, volume 20, 2008. Webb, editors, PRICAI, volume 4099 of Lecture Notes
[10] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. in Computer Science, pages 404–412. Springer, 2006.
Incremental singular value decomposition algorithms