8 Recommender
8 Recommender
8. Recommender Systems
Alex Smola
Yahoo! Research and ANU
http://alex.smola.org/teaching/berkeley2012
Stat 260 SP 12
http://www.igvita.com/2006/10/29/dissecting-the-netflix-dataset/
Netflix competition yardstick
• Least mean squares prediction error
• Easy to define
s X
rmse(S) = |S| 1 (r̂ui rui )2
(i,u)2S
• Wrong measure for composing sessions!
#3
#2
#1
Joe
#4
Basic Idea
Neighborhood-based CF
• (user,user) similarity to recommend items
• Earliest and most popular collaborative filtering
• good if item base is smaller than user base
• Derive unknown ratings from those of “similar”
• good if item base changes
(item-item rapidly
variant)
• traverse bipartite similarity
• A parallel graph
user-user flavor: rely on ratings of like
users
• (item,item) similarity to recommend new items that
were also liked by the same users
• good if the user base
is small is small
• Oldest known CF method
Neighborhood based CF
Neighborhood-based CF
users
1 2 3 4 5 6 7 8 9 10 11 12
1 1 3 5 5 4
2 5 4 4 2 1 3
3 2 4 1 2 3 4 3 5
items
4 2 4 5 4 2
5 4 3 4 2 2 5
6 1 3 3 2 4 0.2 · 2 + 0.3 · 3
= 2.6
0.2 + 0.3
1 1 3 ? 5 5 4
2 5 4 4 2 1 3
3 2 4 1 2 3 4 3 5
items
4 2 4 5 4 2
5 4 3 4 2 2 5
6 1 3 3 2 4 0.2 · 2 + 0.3 · 3
= 2.6
0.2 + 0.3
1 1 3 ? 5 5 4
2 5 4 4 2 1 3
3 2 4 1 2 3 4 3 5
items
4 2 4 5 4 2
5 4 3 4 2 2 5
6 1 3 3 2 4 0.2 · 2 + 0.3 · 3
= 2.6
0.2 + 0.3
1 1 3 ? 5 5 4 similarity
2 5 4 4 2 1 3 s13 = 0.2
3 s16 = 0.3
2 4 1 2 3 4 3 5
items
4 2 4 5 4 2
weighted
5 4 3 4 2 2 5 average
6 1 3 3 2 4 0.2 · 2 + 0.3 · 3
= 2.6
0.2 + 0.3
1 1 3 2.6 5 5 4 similarity
2 5 4 4 2 1 3 s13 = 0.2
3 s16 = 0.3
2 4 1 2 3 4 3 5
items
4 2 4 5 4 2
weighted
5 4 3 4 2 2 5 average
6 1 3 3 2 4 0.2 · 2 + 0.3 · 3
= 2.6
0.2 + 0.3
Properties
• A parallel user-user flavor: rely on ratings of like-m
users
• Intuitive
• No (substantial) training
• Handles new users / items
• Easy to explain to user
• •Compute
Pearsoncorrelation
correlationovercoefficient
shared support
Cov[rui , ruj ]
• nonuniform support sij =
Std[rui ]Std[ruj ]
• compute only over shared support
• shrinkage towards 0 to address problem of
small support (typically few items in common)
(item,item) similarity
• Empirical Pearson correlation coefficient
P
u2U (i,j) (rui bui )(ruj buj )
⇢ˆij = qP P
2 buj )2
u2U (i,j) (rui bui ) u2U (i,j) (ruj
~ M ⇡U ·V
LatentLatent
variable
factor models
view
serious
Braveheart
The Color Amadeus
Purple
Dave
escapist
Basic matrix
Basic factorization
matrix factorization model
users
1 3 5 5 4
5 4 4 2 1 3
items ~
2 4 1 2 3 4 3 5
2 4 5 4 2
4 3 4 2 2 5
1 3 3 2 4
users
.1 -.4 .2 1.1 -.2 .3 .5 -2 -.5 .8 -.4 .3 1.4 2.4 -.9
items
1.1 2.1 .3
-.7 2.1 -2
-1 .7 .3
5 4 4 2 1 3
?
items ~
2 4 1 2 3 4 3 5
2 4 5 4 2
4 3 4 2 2 5
1 3 3 2 4
users
.1 -.4 .2 1.1 -.2 .3 .5 -2 -.5 .8 -.4 .3 1.4 2.4 -.9
items
1.1 2.1 .3
-.7 2.1 -2
-1 .7 .3
5 4 4 2 1 3
?
items ~
2 4 1 2 3 4 3 5
2 4 5 4 2
4 3 4 2 2 5
1 3 3 2 4
users
.1 -.4 .2 1.1 -.2 .3 .5 -2 -.5 .8 -.4 .3 1.4 2.4 -.9
items
1.1 2.1 .3
-.7 2.1 -2
-1 .7 .3
5 4 4 2 1 3
2.4
items ~
2 4 1 2 3 4 3 5
2 4 5 4 2
4 3 4 2 2 5
1 3 3 2 4
users
.1 -.4 .2 1.1 -.2 .3 .5 -2 -.5 .8 -.4 .3 1.4 2.4 -.9
items
1.1 2.1 .3
-.7 2.1 -2
-1 .7 .3
~
2 4 1 2 3 4 3 5
-.8 .7 .5 1.4 .3 -1 1.4 2.9 -.7 1.2 -.1 1.3
2 4 5 4 2 1.1 2.1 .3
2.1 -.4 .6 1.7 2.4 .9 -.3 .4 .8 .7 -.6 .1
4 3 4 2 2 5 -.7 2.1 -2
1 3 3 2 4 -1 .7 .3
• SVD is undefined
Properties:
for missing entries
•• stochastic gradient descent (faster) use
SVD isn’t defined when entries are unknown
specialized methods
•• alternating optimization
Very powerful model can easily overfit, sensitive to
• Overfitting without regularization
regularization
• Probably most popular model among Netflix
particularly if fewer reviews than dimensions contestants
– 12/11/2006: Simon Funk describes an SVD based method
• Very– popular
12/29/2006:on
FreeNetflix
implementation at timelydevelopment.com
Netflix: 0.9514
Factor models: Error vs. #parameters
0.91 40
60
90 NMF
0.905 128
50 180 BiasSVD
100
0.9 200
SVD++
SVD v.2
0.895 50
RMSE
0.875
10 100 1000 10000 100000
Prize: 0.8563 Millions of Parameters
Risk Minimization View
• Objective Function
X h i
2 2 2
minimize (rui hpu , qi i) + kpkFrob + kqkFrob
p,q
(u,i)2S
0.875
10 100 1000 10000 100000
Millions of Parameters
Bias
• Objective Function
X
2
minimize (rui (µ + bu + bi + hpu , qi i)) +
p,q
(u,i)2S
h i
2 2 2 2
kpkFrob + kqkFrob + kbusers k + kbitems k
• Stochastic gradient descent
pu (1 ⌘t )pu ⌘t qi ⇢ui
qi (1 ⌘t )qi ⌘t pu ⇢ui
bu (1 ⌘t )bu ⌘t ⇢ui
bi (1 ⌘t )bi ⌘t ⇢ui
µ (1 ⌘t )µ ⌘t ⇢ui
where ⇢ui = (rui (µ + bi + bu + hpu , qi i))
Factor models: Error vs. #parameters
0.91 40
60
90 NMF
0.905 128
50 180 BiasSVD
100
0.9 200
SVD++
“who rated
SVD v.2
0.895 50 what”
RMSE
0.875
10 100 1000 10000 100000
Millions of Parameters
Ratings are not given at random!
Ratings areDistribution
not given of ratings
at random
Netflix ratings Yahoo! music ratings Yahoo! survey answers
users users
1 3 5 5 4 1 0 1 0 0 1 0 0 1 0 1 0
5 4 4 2 1 3 0 0 1 1 0 0 1 0 0 1 1 1
movies
movies
2 4 1 2 3 4 3 5 1 1 0 1 1 0 1 0 1 1 1 0
2 4 5 4 2 0 1 1 0 1 0 0 1 0 0 1 0
4 3 4 2 2 5 0 0 1 1 1 1 0 0 0 0 1 1
1 3 3 2 4 1 0 1 0 1 0 0 1 0 0 1 0
rui cui
R rui u ,i B bui u ,i
• Characterize users by which movies they rated
Edge attributes (observed, rating)
• Adding features to recommender system
rui = µ + bu + bi + hpu , qi i + hcu , xi i regression
Alternative integration
• Key idea - use related ratings to average
• Salakhudtinov & Mnih, 2007
X
qi qi + cui pu
u
0.875
10 100 1000 10000 100000
Millions of Parameters
Something Happened in Early 2004…
Netflix ratings by date
Netflix changed
rating labels
2004
Are movies getting better with time?
Sources of temporal change
• Items
• Seasonal effects
(Christmas, Valentine’s day, Holiday movies)
• Public perception of movies (Oscar etc.)
• Users
• Changed labeling of reviews
• Anchoring (relative to previous movie)
• Change of rater in household
• Selection bias for time of viewing
Modeling temporal change
• Time-dependent bias
• Time-dependent user preferences
rui (t) = µ + bu (t) + bi (t) + hqi , pu (t)i
• Parameterize functions b and p
• Slow changes for items
• Fast sudden changes for users
• Good parametrization is key
Biases
33%
Unexplained
57%
Personalization
10%
+0.732 (unexplained)
0.415 (biases)
+
0.129 (personalization)
1.276 (total variance)
Netflix: 0.9514
Factor models: Error vs. #parameters
0.91 40
60 rui q pu T
i NMF
0.905 90
128
50 180 BiasSVD
100
0.9 200
SVD++
SVD v.2
0.895 50
RMSE
• Explain factorizations
• Cold start (new users)
• Different regularization for different parameter
groups / different users
• Sharing of statistical strength between users
• Hierarchical matrix co-clustering / factorization
(write a paper on that)
3 Session Modeling
Motivation
User interaction
• Explicit search query
• Search engine
• Genre selection on movie site
• Implicit search query
• News site
• Priority inbox
• Comments on article
• Viewing specific movie (see also ...)
• Sponsored search (advertising)
collapse
implicit
user interest
log it!
hover on
link
Response is conditioned on
available options
• User search for ‘chocolate’
+ |ci 1 | + i
• Here
Submodular
|ci 1 | is gain per additional
a correction document
coefficient which captures the ef-
i 1
• Relevance score per document
fect of having visited |c | links previously. i captures the
position specific e↵ect, i.e. how clickable di↵erent positions
• Coverage
in the resultover different
set are. aspectsnote that for b = 0 we are
Furthermore
• Position
back to adependent score ranking model where each arti-
modular relevance
• cle is considered regardless of previously
Score dependent on number of previous clicks shown content. In
other words, we recover the vector space model as a special
Optimization
• Latent variables
• rating p(R
• given
of the corresponding
Given the by
feature2 inner product
rating vectors
is: for T the 2
ij |Ui, Vj , σ ) = N (Rij |Ui Vj , σ )
Vj
R ij
Ui
i=1,...,N
user and the movie, the distribution R ij
j=1,...,M
• Ratings
p(R |U , V
ij ofi the , σ 2
) = N
j corresponding (R ij |U T
V ,
irating
j σ 2
)is:
• The user and movie feature vectors j=1,...,M
i=1,...,N
σ R
are given zero-mean2 sphericalT 2 ij
• The user
p(Rand movie
|U
Gaussian
ij i , V , feature
σ
priors:
j ) = Nvectors
(Rij |Ui Vj , σ ) σ i=1,...,N
are given zero-mean spherical j=1,...,M
• Latent
Gaussianfactors
N M
priors: 2 !
2 2
!
2
p(U |σ
• The user and U ) = N (U
movie feature
i |0, σ I), p(V
U vectors V |σ ) = N (V j |0, σ V I)
N M σ
are2 given
! zero-mean i=1
2 spherical 2
! j=1
2
p(U |σ ) =
Gaussian
U N (U
priors: U
i |0, σ I), p(V |σ V ) = N (Vj |0, σ V I)
9
i=1 j=1
N M
• Wishartmethods
egularization prior is conjugate
than
imple penalization of the Frobenius ΘV ΘU
to Gaussian, hence use it
norm of the feature matrices:
• Allows us to adapt the
• priors with diagonal or full
variance automatically Vj Ui
covariance matrices and
• Inference
adjustable (Gibbs
means, sampler)
or even mixture of
Gaussians priors. Rij
• Sample user factors (parallel) i=1,...,N
j=1,...,M
• Using• spherical
Sample movie factors
Gaussian (parallel)
priors for feature
vectors leads to the standard PMF with σ
λU and • λSample
V chosenhyperparameters
automatically. (parallel)
Making it fancier
trained PMF (II) (constrained BPMF)
σU σW
atent σV
atrix.
Yi Wk
k=1,...,M
Vj
Ui Ii
who rated
k R ij what
i=1,...,N
j=1,...,M
σ
Results (Mnih & Salakthudtinov)
Experimental Results
20
1.2
1.15
Movie Average
1.1 15
Users (%)
1.05 PMF
RMSE
1 10
0.95 Constrained
PMF
0.9 5
0.85
0.8 0
1−5 6−10 −20 −40 −80 −160 −320 −640>641 1−5 6−10 −20 −40 −80 −160 −320 −640>641
Number of Observed Ratings Number of Observed Ratings
e
Social Network Data
Data: users, connections, features y y’
Y Y
p(x, y, e) = p(yi )p(xi |yi ) p(eij |xi , yi , xj , yj )
i2Users i,j2Users
u y y’
v x x’
s e
app install
Model
• Social interaction
xi p(x|yi )
xj p(x|yj )
eij p(e|xi , yi , xj , yj , )
u y y’
• App install
v x x’
xi p(x|yi )
vj p(v|uj ) a e
aij p(a|xi , yi , uj , vj , )
Model
• Social interaction
cold start latent features
xi p(x|yi )
xj p(x|yj ) xi = Ayi + i
eij p(e|xi , yi , xj , yj , ) vj = Buj + ˜j
• App install
eij p(e|x> x
i j + y >
i W yj )
xi p(x|yi )
aij p(a|x> v
i j + y >
i M uj )
vj p(v|uj )
aij p(a|xi , yi , uj , vj , )
bilinear features
Optimization Problem
X
minimize
minimize ⇥e >
l(eij , xi xj + >
yi W yj )+
(i,j)
X
> >
⇥a l(aij , xi vj + yi M uj )+
(i,j)
X X
⇥x (xi |yi ) + ⇥v (vi |ui )+
i i
2 2 2 2
⇥W W + ⇥M M + ⇥A A + ⇥B B
Optimization Problem
X
minimize
minimize ⇥e >
l(eij , xi xj + >
yi W yj )+ social
(i,j)
X
> >
⇥a l(aij , xi vj + yi M uj )+
(i,j)
X X
⇥x (xi |yi ) + ⇥v (vi |ui )+
i i
2 2 2 2
⇥W W + ⇥M M + ⇥A A + ⇥B B
Optimization Problem
X
minimize
minimize ⇥e >
l(eij , xi xj + >
yi W yj )+ social
(i,j)
X
> >
⇥a l(aij , xi vj + yi M uj )+ app
(i,j)
X X
⇥x (xi |yi ) + ⇥v (vi |ui )+
i i
2 2 2 2
⇥W W + ⇥M M + ⇥A A + ⇥B B
Optimization Problem
X
minimize
minimize ⇥e >
l(eij , xi xj + >
yi W yj )+ social
(i,j)
X
> >
⇥a l(aij , xi vj + yi M uj )+ app
(i,j)
reconstruction X X
⇥x (xi |yi ) + ⇥v (vi |ui )+
i i
2 2 2 2
⇥W W + ⇥M M + ⇥A A + ⇥B B
Optimization Problem
X
minimize
minimize ⇥e >
l(eij , xi xj + >
yi W yj )+ social
(i,j)
X
> >
⇥a l(aij , xi vj + yi M uj )+ app
(i,j)
reconstruction X X
⇥x (xi |yi ) + ⇥v (vi |ui )+
i i
2 2 2 2
⇥W W + ⇥M M + ⇥A A + ⇥B B
regularizer
Loss Function
Loss
application social
recommendation recommendation
Optimization
• Nonconvex optimization problem
• Large set of variables xi = Ayi + i
vj = Buj + ˜j
(user, user) x x’
(user, app)
(app, advertisement) e
(user, user) v x x’
(user, app)
(app, advertisement) a e
(user, user) v v x x’
(user, app)
(app, advertisement) a a e
(user, user) v v x x’
(user, app)
(app, advertisement) a a e
Someone
• Approximation is O(1/n)
• To show that estimate is unbiased take expectation
over Rademacher hash.
Collaborative Filtering
• Hashing compression
X X
0
ui = ⇠(k, j)Ukj and vi = ⇠ (k, j)Vkj .
j,k:h(k,j)=i j,k:h0 (k,j)=i
100
1.30
120 1.14
Thousands of elements in U
150
250
400
rows in U
1.10
1.26 300
520 1.08
350
1.24
720 400 1.06
450
840 1.04
1.22
500
1225 983
1.02
1.20
32 16 10 9 8 7 6 5 1682 500 450 400 350 300 250 200 150 100 50
Thousands of elements in M rows in M
Eachmovie MovieLens
Summary
• Neighborhood methods
• User / movie similarity
• Iteration on graph
• Matrix Factorization
• Singular value decomposition
• Convex reformulation
• Ranking and Session Modeling
• Ordinal regression
• Session models
• Features
• Latent dense (Bayesian Probabilistic Matrix Factorization)
• Latent sparse (Dirichlet process factorization)
• Coldstart problem (inferring features)
• Hashing
Further reading
• Collaborative Filtering with temporal dynamics
http://research.yahoo.com/files/kdd-fp074-koren.pdf
• Neighborhood factorization
http://research.yahoo.com/files/paper.pdf
• Matrix Factorization for recommender systems
http://research.yahoo.com/files/ieeecomputer.pdf
• CoFi Rank (collaborative filtering & ranking)
http://www.cofirank.org/
• Yehuda Koren’s papers
http://research.yahoo.com/Yehuda_Koren