0% found this document useful (0 votes)
20 views139 pages

8 Recommender

Uploaded by

MSR MSR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views139 pages

8 Recommender

Uploaded by

MSR MSR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 139

Scalable Machine Learning

8. Recommender Systems

Alex Smola
Yahoo! Research and ANU
http://alex.smola.org/teaching/berkeley2012
Stat 260 SP 12

Significant content courtesy of Yehuda Koren


8. Recommender Systems

Much content courtesy of (Mr Netflix) Yehuda Koren


Outline
• Neighborhood methods
• User / movie similarity
• Iteration on graph
• Matrix Factorization
• Singular value decomposition
• Convex reformulation
• Ranking and Session Modeling
• Ordinal regression
• Session models
• Features
• Latent dense (Bayesian Probabilistic Matrix Factorization)
• Latent sparse (Dirichlet process factorization)
• Coldstart problem (inferring features)
• Hashing
Why
Netflix
Personalized Content

adapt to general popularity


pick based on user preferences
Spam Filtering

Something went wrong!


A more formal view
• User (requests content)
• Objects (that can be displayed)
• Context (device, location, time)
• Interface (mobile browser, tablet, viewport)
o o o
u o o o
o
o
recommend
c interface relevant objects
Examples
• Movie recommendation (Netflix)
• Related product recommendation (Amazon)
• Web page ranking (Google)
• Social recommendation (Facebook)
• News content recommendation (Yahoo)
• Priority inbox & spam filtering (Google)
• Online dating (OK Cupid)
• Computational Advertising (Yahoo)
Running Example
Netflix Movie
Movie Recommendation
rating data
Training data Test data

user movie date score user movie date score


1 21 5/7/02 1 1 62 1/6/05 ?
1 213 8/2/04 5 1 96 9/13/04 ?
2 345 3/6/01 4 2 7 8/18/05 ?
2 123 5/1/05 4 2 3 11/22/05 ?
2 768 7/15/02 3 3 47 6/13/02 ?
3 76 1/22/01 5 3 15 8/12/01 ?
4 45 8/3/00 4 4 41 9/1/00 ?
5 568 9/10/05 1 4 28 8/27/05 ?
5 342 3/5/03 2 5 93 4/4/05 ?
5 234 12/28/00 2 5 74 7/16/03 ?
6 76 8/11/02 5 6 69 2/14/04 ?
6 56 6/15/03 4 6 83 10/3/03 ?
Challenges
• Scalability
• Millions of objects
• 100s of millions of users
• Cold start
• Changing user base
• Changing inventory (movies, stories, goods)
• Attributes
• Imbalanced dataset
User activity / item reviews
are power law distributed

http://www.igvita.com/2006/10/29/dissecting-the-netflix-dataset/
Netflix competition yardstick
• Least mean squares prediction error
• Easy to define
s X
rmse(S) = |S| 1 (r̂ui rui )2
(i,u)2S
• Wrong measure for composing sessions!

• Consistent (in large sample size limit this will


converge to minimizer)
1 Neighborhood Methods
Part I:
Basic Idea
Basic neighborhood methods

#3

#2

#1

Joe

#4
Basic Idea
Neighborhood-based CF
• (user,user) similarity to recommend items
• Earliest and most popular collaborative filtering
• good if item base is smaller than user base
• Derive  unknown  ratings  from  those  of  “similar”  
• good if item base changes
(item-item rapidly
variant)
• traverse bipartite similarity
• A parallel graph
user-user flavor: rely on ratings of like
users
• (item,item) similarity to recommend new items that
were also liked by the same users
• good if the user base
is small is small
• Oldest known CF method
Neighborhood based CF
Neighborhood-based CF
users
1 2 3 4 5 6 7 8 9 10 11 12

1 1 3 5 5 4
2 5 4 4 2 1 3
3 2 4 1 2 3 4 3 5
items

4 2 4 5 4 2
5 4 3 4 2 2 5
6 1 3 3 2 4 0.2 · 2 + 0.3 · 3
= 2.6
0.2 + 0.3

- unknown rating - rating between 1 to 5


Neighborhood based CF
Neighborhood-based CF
users
1 2 3 4 5 6 7 8 9 10 11 12

1 1 3 ? 5 5 4
2 5 4 4 2 1 3
3 2 4 1 2 3 4 3 5
items

4 2 4 5 4 2
5 4 3 4 2 2 5
6 1 3 3 2 4 0.2 · 2 + 0.3 · 3
= 2.6
0.2 + 0.3

- unknown rating - rating between 1 to 5


Neighborhood based CF
Neighborhood-based CF
users
1 2 3 4 5 6 7 8 9 10 11 12

1 1 3 ? 5 5 4
2 5 4 4 2 1 3
3 2 4 1 2 3 4 3 5
items

4 2 4 5 4 2
5 4 3 4 2 2 5
6 1 3 3 2 4 0.2 · 2 + 0.3 · 3
= 2.6
0.2 + 0.3

- unknown rating - rating between 1 to 5


Neighborhood based CF
Neighborhood-based CF
users
1 2 3 4 5 6 7 8 9 10 11 12

1 1 3 ? 5 5 4 similarity
2 5 4 4 2 1 3 s13 = 0.2
3 s16 = 0.3
2 4 1 2 3 4 3 5
items

4 2 4 5 4 2
weighted
5 4 3 4 2 2 5 average
6 1 3 3 2 4 0.2 · 2 + 0.3 · 3
= 2.6
0.2 + 0.3

- unknown rating - rating between 1 to 5


Neighborhood based CF
Neighborhood-based CF
users
1 2 3 4 5 6 7 8 9 10 11 12

1 1 3 2.6 5 5 4 similarity
2 5 4 4 2 1 3 s13 = 0.2
3 s16 = 0.3
2 4 1 2 3 4 3 5
items

4 2 4 5 4 2
weighted
5 4 3 4 2 2 5 average
6 1 3 3 2 4 0.2 · 2 + 0.3 · 3
= 2.6
0.2 + 0.3

- unknown rating - rating between 1 to 5


• Derive  unknown  ratings  from  those  of  “similar”  ite
(item-item variant)

Properties
• A parallel user-user flavor: rely on ratings of like-m
users

• Intuitive
• No (substantial) training
• Handles new users / items
• Easy to explain to user

• Accuracy & scalability questionable


Normalization / Bias
• Problem
• Some items are significantly higher rated
• Some users rate substantially lower
• Ratings change over time
• Bias correction is crucial for nearest neighborhood
recommender algorithm
global
• Offset per user
• Offset per movie bui = µ + bu + bi
• Time effects
user item
• Global bias
Bell & Koren ICDM 2007
http://public.research.att.com/~volinsky/netflix/BellKorICDM07.pdf
Baseline estimation
• Mean rating is 3.7
• Troll Hunter is 0.7 above mean
• User rates 0.2 below mean
• Baseline is 4.2 stars
• Least mean squares problem
" #
X X X
2 2 2
minimize (rui µ bu bi ) + bu + bi
b
(u,i) u i

• Jointly convex. Alternatively remove mean & iterate


P P
u2R(i) (rui µ bu ) i2R(u) (rui µ bi )
bi = and bu =
+ |R(i)| + |R(u)|
Parzen Windows style CF

• Similarity measure sij between items


• Find set sk(i,u) of k-nearest neighbors to i that
were rated by user u
• Weighted average over the set
P
j2sk (i,u) sij (ruj buj )
r̂ui = bui + P where bui = µ + bu + bi
j2sk (i,u) sij

• How to compute sij?


• Common practice – rely on Pearson correlation coeff
(item,item) similarity measures
• Challenge – non-uniform user support of item ratings,
each item rated by a distinct set of users
User ratings for item i:
1 ? ? 5 5 3 ? ? ? 4 2 ? ? ? ? 4 ? 5 4 1 ?

User ratings for item j:


? ? 4 2 5 ? ? 1 2 5 ? ? 2 ? ? 3 ? ? ? 5 4

• •Compute
Pearsoncorrelation
correlationovercoefficient
shared support
Cov[rui , ruj ]
• nonuniform support sij =
Std[rui ]Std[ruj ]
• compute only over shared support
• shrinkage towards 0 to address problem of
small support (typically few items in common)
(item,item) similarity
• Empirical Pearson correlation coefficient
P
u2U (i,j) (rui bui )(ruj buj )
⇢ˆij = qP P
2 buj )2
u2U (i,j) (rui bui ) u2U (i,j) (ruj

• Smoothing towards 0 for small support


|U (i, j)| 1
sij = ⇢ˆij
|U (i, j)| 1 +
2
• Make neighborhood more peaked ij
s ! s ij

• Shrink towards baseline for small neighborhood


P
j2sk (i,u) sij (ruj buj )
r̂ui = bui + P
+ j2sk (i,u) sij
Similarity for binary data
• Pearson correlation meaningless
• Views mi users acting on i
• Purchase behavior mij users acting on both i and j
m total number of users
• Clicks
• Jaccard similarity mij
sij =
(intersection vs. joint) ↵ + mi + mj mij
• Observed/expected ratio observed mij
sij = ⇡
Improve by counting expected ↵ + mi mj /m
per user (many users better than heavy users)
2 Matrix Factorization
Basics
Part II:
x factorization techniques
Basic Idea

~ M ⇡U ·V
LatentLatent
variable
factor models
view
serious
Braveheart
The Color Amadeus
Purple

Sense and Lethal Weapon


Sensibility Ocean’s  11
Geared Geared
towards towards
females males

Dave

The Lion King Dumb and


Dumber
The Princess Independence
Diaries Day Gus

escapist
Basic matrix
Basic factorization
matrix factorization model
users
1 3 5 5 4

5 4 4 2 1 3

items ~
2 4 1 2 3 4 3 5

2 4 5 4 2

4 3 4 2 2 5

1 3 3 2 4

users
.1 -.4 .2 1.1 -.2 .3 .5 -2 -.5 .8 -.4 .3 1.4 2.4 -.9
items

-.5 .6 .5 -.8 .7 .5 1.4 .3 -1 1.4 2.9 -.7 1.2 -.1 1.3

~ -.2 .3 .5 2.1 -.4 .6 1.7 2.4 .9 -.3 .4 .8 .7 -.6 .1

1.1 2.1 .3

-.7 2.1 -2

-1 .7 .3

A rank-3 SVD approximation


Estimate unknown ratings as inner
Estimate unknown ratings as inner-products of factors:
products of latent factors
users
1 3 5 5 4

5 4 4 2 1 3
?

items ~
2 4 1 2 3 4 3 5

2 4 5 4 2

4 3 4 2 2 5

1 3 3 2 4

users
.1 -.4 .2 1.1 -.2 .3 .5 -2 -.5 .8 -.4 .3 1.4 2.4 -.9
items

-.5 .6 .5 -.8 .7 .5 1.4 .3 -1 1.4 2.9 -.7 1.2 -.1 1.3

~ -.2 .3 .5 2.1 -.4 .6 1.7 2.4 .9 -.3 .4 .8 .7 -.6 .1

1.1 2.1 .3

-.7 2.1 -2

-1 .7 .3

A rank-3 SVD approximation


Estimate unknown ratings as inner
Estimate unknown ratings as inner-products of factors:
products of latent factors
users
1 3 5 5 4

5 4 4 2 1 3
?

items ~
2 4 1 2 3 4 3 5

2 4 5 4 2

4 3 4 2 2 5

1 3 3 2 4

users
.1 -.4 .2 1.1 -.2 .3 .5 -2 -.5 .8 -.4 .3 1.4 2.4 -.9
items

-.5 .6 .5 -.8 .7 .5 1.4 .3 -1 1.4 2.9 -.7 1.2 -.1 1.3

~ -.2 .3 .5 2.1 -.4 .6 1.7 2.4 .9 -.3 .4 .8 .7 -.6 .1

1.1 2.1 .3

-.7 2.1 -2

-1 .7 .3

A rank-3 SVD approximation


Estimate unknown ratings as inner
Estimate unknown ratings as inner-products of factors:
products of latent factors
users
1 3 5 5 4

5 4 4 2 1 3
2.4

items ~
2 4 1 2 3 4 3 5

2 4 5 4 2

4 3 4 2 2 5

1 3 3 2 4

users
.1 -.4 .2 1.1 -.2 .3 .5 -2 -.5 .8 -.4 .3 1.4 2.4 -.9
items

-.5 .6 .5 -.8 .7 .5 1.4 .3 -1 1.4 2.9 -.7 1.2 -.1 1.3

~ -.2 .3 .5 2.1 -.4 .6 1.7 2.4 .9 -.3 .4 .8 .7 -.6 .1

1.1 2.1 .3

-.7 2.1 -2

-1 .7 .3

A rank-3 SVD approximation


Properties
Matrix factorization model
.1 -.4 .2
1 3 5 5 4
-.5 .6 .5
5 4 4 2 1 3
-.2 .3 .5 1.1 -.2 .3 .5 -2 -.5 .8 -.4 .3 1.4 2.4 -.9

~
2 4 1 2 3 4 3 5
-.8 .7 .5 1.4 .3 -1 1.4 2.9 -.7 1.2 -.1 1.3
2 4 5 4 2 1.1 2.1 .3
2.1 -.4 .6 1.7 2.4 .9 -.3 .4 .8 .7 -.6 .1
4 3 4 2 2 5 -.7 2.1 -2

1 3 3 2 4 -1 .7 .3

• SVD is undefined
Properties:
for missing entries
•• stochastic gradient descent (faster) use
SVD  isn’t  defined  when  entries  are  unknown  
specialized methods
•• alternating optimization
Very powerful model  can easily overfit, sensitive to
• Overfitting without regularization
regularization
• Probably most popular model among Netflix
particularly if fewer reviews than dimensions contestants
– 12/11/2006: Simon Funk describes an SVD based method
• Very– popular
12/29/2006:on
FreeNetflix
implementation at timelydevelopment.com
Netflix: 0.9514
Factor models: Error vs. #parameters
0.91 40
60
90 NMF
0.905 128
50 180 BiasSVD
100
0.9 200
SVD++
SVD v.2
0.895 50
RMSE

100 SVD v.3


200
0.89 50
SVD v.4
100 200 500
0.885
100 50
200 500
100 200 500 1000 1500
0.88

0.875
10 100 1000 10000 100000
Prize: 0.8563 Millions of Parameters
Risk Minimization View
• Objective Function
X h i
2 2 2
minimize (rui hpu , qi i) + kpkFrob + kqkFrob
p,q
(u,i)2S

• Alternating least squares


2 3 1
X X
pu 4 1+ >5
qi qi qi rui
i|(u,i)2S i good for
2 3
X
1
X MapReduce
qi 4 1+ >5
pu pu pu rui
u|(u,i)2S i
Risk Minimization View
• Objective Function
X h i
2 2 2
minimize (rui hpu , qi i) + kpkFrob + kqkFrob
p,q
(u,i)2S

• Stochastic gradient descent


pu (1 ⌘t )pu ⌘t qi (rui hpu , qi i) much
qi (1 ⌘t )qi ⌘t pu (rui hpu , qi i) faster
• No need for locking
• Multicore updates asynchronously
(Recht, Re, Wright, 2012 - Hogwild)
Theoretical Motivation
deFinetti Theorem
• Independent random variables
m
Y
p(X) = p(xi ) xi
i=1
• Exchangeable random variables
p(X) = p(x1 , . . . , xm ) = p(x⇡(1) , . . . , x⇡(m) )
• There exists a conditionally independent
representation of exchangeable r.v. ϴ
Z m
Y
p(X) = dp(✓) p(xi |✓)
i=1
xi
This motivates latent variable models
Aldous Hoover Factorization
• Matrix-valued set of random variable
Example - Erdos Renyi graph model
Y
p(E) = p(Vij )
i,j

• Independently exchangeable on matrix


p(E) = p(E11 , E12 , . . . , Emn ) = p(E⇡(1)⇢(1) , E⇡(1)⇢(2) , . . . , E⇡(m)⇢(n) )

• Aldous Hoover Theorem


Z Z Y
m n
Y Y
p(E) = dp(✓) dp(ui ) dp(vj ) p(Eij |ui , vj , ✓)
i=1 j=1 i,j
Aldous Hoover Factorization
• Rating matrix is (row,
u1 u2 u3 u4 u5 u6 column) exchangeable
v1 e11 e12 e15 e16 • Draw latent variables per
row and column
v2 e24
• Draw matrix entries
v3 e32 independently given pairs
v4 e43 e46 • Absence / presence of
v5 e55
rating is a signal
• Can be extended to graphs
with vertex attributes
Aldous Hoover variants
• Jointly exchangeable matrix
• Social network graphs
• Draw vertex attributes first, then edges
• Cold start problem
• New user appears
• Attributes (age, location, browser)
• Can estimate latent variables from that
• User and item factors in matrix factorization
problem can be viewed as AH-factors
Improvements
Factor models: Error vs. #parameters
0.91 40
60
90 NMF
0.905 128
50 180 BiasSVD
100 Add
0.9 200
biases SVD++
SVD v.2
0.895 50
RMSE

100 SVD v.3


200
0.89 50
SVD v.4
100 200 500
0.885
100 50
200 500
100 200 500 1000 1500
0.88

0.875
10 100 1000 10000 100000
Millions of Parameters
Bias
• Objective Function
X
2
minimize (rui (µ + bu + bi + hpu , qi i)) +
p,q
(u,i)2S
h i
2 2 2 2
kpkFrob + kqkFrob + kbusers k + kbitems k
• Stochastic gradient descent
pu (1 ⌘t )pu ⌘t qi ⇢ui
qi (1 ⌘t )qi ⌘t pu ⇢ui
bu (1 ⌘t )bu ⌘t ⇢ui
bi (1 ⌘t )bi ⌘t ⇢ui
µ (1 ⌘t )µ ⌘t ⇢ui
where ⇢ui = (rui (µ + bi + bu + hpu , qi i))
Factor models: Error vs. #parameters
0.91 40
60
90 NMF
0.905 128
50 180 BiasSVD
100
0.9 200
SVD++
“who  rated  
SVD v.2
0.895 50 what”
RMSE

100 SVD v.3


200
0.89 50
SVD v.4
100 200 500
0.885
100 50
200 500
100 200 500 1000 1500
0.88

0.875
10 100 1000 10000 100000
Millions of Parameters
Ratings are not given at random!
Ratings areDistribution
not given of ratings
at random
Netflix ratings Yahoo! music ratings Yahoo! survey answers

• Marlin et al. “Collaborative Filtering and the


B.  Marlin  et  al.,  “Collaborative  Filtering  and  the  Missing  
at  Random  Assumption”  UAI 2007
Missing at Random Assumption” UAI 2007
Characterize users by which movies they rated, rather
than how they rated
Movie rating matrix
•  A dense binary representation of the data:

users users
1 3 5 5 4 1 0 1 0 0 1 0 0 1 0 1 0
5 4 4 2 1 3 0 0 1 1 0 0 1 0 0 1 1 1
movies

movies
2 4 1 2 3 4 3 5 1 1 0 1 1 0 1 0 1 1 1 0
2 4 5 4 2 0 1 1 0 1 0 0 1 0 0 1 0
4 3 4 2 2 5 0 0 1 1 1 1 0 0 0 0 1 1
1 3 3 2 4 1 0 1 0 1 0 0 1 0 0 1 0

rui cui
R  rui u ,i B  bui u ,i
• Characterize users by which movies they rated
Edge attributes (observed, rating)
• Adding features to recommender system
rui = µ + bu + bi + hpu , qi i + hcu , xi i regression
Alternative integration
• Key idea - use related ratings to average
• Salakhudtinov & Mnih, 2007
X
qi qi + cui pu
u

• Koren et al., 2008


X
qi qi + cui xj
u

Overparametrize items by q and x


Factor models: Error vs. #parameters
0.91 40
60
90 NMF
0.905 128
50 180 BiasSVD
100
0.9 200
SVD++
SVD v.2
0.895 50
RMSE

100 SVD v.3


50
200 temporal
0.89
100
effects SVD v.4
200 500
0.885
100 50
200 500
100 200 500 1000 1500
0.88

0.875
10 100 1000 10000 100000
Millions of Parameters
Something Happened in Early 2004…
Netflix ratings by date

Netflix changed
rating labels

2004
Are movies getting better with time?
Sources of temporal change
• Items
• Seasonal effects
(Christmas, Valentine’s day, Holiday movies)
• Public perception of movies (Oscar etc.)
• Users
• Changed labeling of reviews
• Anchoring (relative to previous movie)
• Change of rater in household
• Selection bias for time of viewing
Modeling temporal change
• Time-dependent bias
• Time-dependent user preferences
rui (t) = µ + bu (t) + bi (t) + hqi , pu (t)i
• Parameterize functions b and p
• Slow changes for items
• Fast sudden changes for users
• Good parametrization is key

Koren et al., KDD 2009 (CF with temporal dynamics)


Bias matters
Biases matter!
Sources of Variance in Netflix data

Biases
33%

Unexplained
57%
Personalization
10%

+0.732 (unexplained)
0.415 (biases)
+
0.129 (personalization)
1.276 (total variance)
Netflix: 0.9514
Factor models: Error vs. #parameters
0.91 40
60 rui  q pu T
i NMF
0.905 90
128
50 180 BiasSVD
100
0.9 200
SVD++
SVD v.2
0.895 50
RMSE

100 SVD v.3


200
0.89 50
SVD v.4
100 200 500
0.885
100 50
200 500
100 200
 
500 1000 1500
0.88
rui (t )    bu (t )  bi (t )  q  pu (t )   buj x j 
T
i
0.875  j 
10 100 1000 10000 100000
Prize: 0.8563 Millions of Parameters
More ideas

• Explain factorizations
• Cold start (new users)
• Different regularization for different parameter
groups / different users
• Sharing of statistical strength between users
• Hierarchical matrix co-clustering / factorization
(write a paper on that)
3 Session Modeling
Motivation
User interaction
• Explicit search query
• Search engine
• Genre selection on movie site
• Implicit search query
• News site
• Priority inbox
• Comments on article
• Viewing specific movie (see also ...)
• Sponsored search (advertising)

Space, users’ time and attention are limited.


session? models?
Did the user
SCROLL DOWN?
Bad ideas ...
• Show items based on relevance

• Yes, this user likes Die Hard.


• But he likes other movies, too
• Show items only for majority of users
‘apple’ vs. ‘Apple’
User response

collapse

implicit
user interest

log it!
hover on
link
Response is conditioned on
available options
• User search for ‘chocolate’

user picks this


• What the user really would have wanted
• User can only pick from
available items
• Preferences are often relative
Models
Independent click model

• Each object has click probability


• Object is viewed independently
• Used in computational advertising (with some position correction)
• Horribly wrong assumption
• OK if probability is very small (OK in ads)
n
Y 1
p(x|s) = x i si
i=1
1 + e
Logistic click model
no
click

• User picks at most one object


• Exponential family model for click
e sx
p(x|s) = s0 P s 0 = exp (sx g(s))
e + x0 e x
no click
• Ignores order of objects
• Assumes that the user looks at all before taking action
Sequential click model
no
click

• User traverses list click


• At each position some probability of clicking
• When user reaches end of the list he aborts
"j 1 #
Y 1 1
p(x = j|s) = sj
i=1
1 + e si 1+e

• This assumes that a patient user viewed all items


Skip click model
no no no no

click click click click


• User traverses list
• At each position some probability of clicking
• At each position the user may abandon the process
• This assumes that user traverses list sequentially
Context skip click model

• User traverses list


• At each position some probability of clicking which depends on previous content
• At each position the user may abandon the process
• User may click more than once
Context skip click model
onal probabilities, and defer specifying the functional form the set A. More
the scores until Section 2.2 because it requires us to in- a convex cone.

Context skip click model


oduce submodular gains first.
xamination probability: We distinguish two cases: the
submodular fun
ular maximizat
greedy procedu
xamination probability whenever a user clicked on a previ-
us item and the examination probability when the previous [16] which can b
emthat• Viewing
is, a userprobability
notifclicked. did not examine a result then he willtion notprocedure
was
click on it. Furthermore, we assume that the
user
click
is gone
proba-
Consider two
p(vi = 1|vi 1 = 0) = 0 (2)
bility for a given article is characterized by 1) the number to be covered an
i1 1 content describ
of previous clicks,
p(vi = 1|v i 1 =as
1, denoted
ci 1 = 0) =by |c | and 2) by
(3) the rele-
1 + e ↵i coverage score v
vance of the current article di given the
i 1
p(vdi = 1|v
1 previously user returns
displayed
articles i 1 = the
where 1, ci particular
1 = 1) = order of the latter (4) is irrel-
1+e i
evant. These simplifying assumptions lead to the following

other Click
functional
probabilitythat the(only if viewed)
form:for a logistic dependence which di↵ersprior context
words, provided user examined the pre-
ous link, we allow where [s]j is a
ccording to whether the user i 1returns
i after having 1clicked feature j is pre
p(ci = 1|v
link or whether he iis=already
1, c on , d the
) =page. Choosingf (|c dif-
i 1 |,d ,d i 1 ) decreasing subm
1+e i
rent coefficients ↵i , i makes sense since the propensity to ture j of article
ickThe
on aspecific functional
result varies form of
in accordance to fthewill be described
location on the as cle
part
s. We assum
sults pagerelevance
of the (e.g. whether
modela user will be required to scroll).
next. results or a sto
the debt crisis).
2.2probability:
lick Relevance The Model
key aim of our user interaction Next we need
modular feature has a diminishing return e↵ect.

Incremental gains score


Given this definition of coverage, we now convert it into
a relevance score to be used in the model of Section 2.1 by
defining:
i 1 i 1
f (|c |, di , d ) (11)
i i 1
:=⇢(S, d |a, b) |a, b) + |ci 1 | + i
⇢(S, d
!
XX X “ ”
i i 1
:= [s]j aj [d]j + bj ⇢j (d ) ⇢j (d )
s2S j d2di

+ |ci 1 | + i

• Here
Submodular
|ci 1 | is gain per additional
a correction document
coefficient which captures the ef-
i 1
• Relevance score per document
fect of having visited |c | links previously. i captures the
position specific e↵ect, i.e. how clickable di↵erent positions
• Coverage
in the resultover different
set are. aspectsnote that for b = 0 we are
Furthermore
• Position
back to adependent score ranking model where each arti-
modular relevance
• cle is considered regardless of previously
Score dependent on number of previous clicks shown content. In
other words, we recover the vector space model as a special
Optimization
• Latent variables

We don’t know v whether user viewed result


• Use variational inference to integrate out v
(more next week in graphical models)
follows:
log p(c)  log p(c) + D(q(v)kp(v|c))
= Ev⇠q(v) [ log p(c) + log q(v) log p(v|c)]
= Ev⇠q(v) [ log p(c, v)] H(q(v)).
Here q(v) is a distribution over the latent variables, H(q)
Optimization
• Compute latent viewing probability given clicks
• Easy since we only have one transition from
views to no views (no DP needed)
• Expected log-likelihood under viewing model
• Convex expected log-likelihood
• Stochastic gradient descent
• Parametrization uses personalization, too
(user, position, viewport, browser)
F
P
S
s
u
v
t
p
4 Feature Representation
Bayesian Probabilistic
Matrix Factorization
Statistical
Probabilistic
Probabilistic
Model
Matrix Matrix Factorization
Factorization (PMF)(PMF)
• PMF is a simple probabilistic linear
• Aldous-Hoover
• PMF is a model
simplewith factorization
probabilistic
Probabilistic
Gaussian
model with Gaussian observation
linear Factorization
Matrix
observation σV
σ (PMF)
σU
σ V U
noise.
• normal
noise. • PMFdistribution
is a simple probabilistic
• Given theGaussian
feature vectors
for for the
linear
σ σU
model with observation Vj V Ui
user
• Given and
the
noise.
user and the
feature
user
of the
movie,
item
and the attributes
vectors
movie,
corresponding
forthe
thedistribution
rating is:
the distribution
Vj Ui

• rating p(R
• given
of the corresponding
Given the by
feature2 inner product
rating vectors
is: for T the 2
ij |Ui, Vj , σ ) = N (Rij |Ui Vj , σ )
Vj
R ij
Ui
i=1,...,N
user and the movie, the distribution R ij
j=1,...,M
• Ratings
p(R |U , V
ij ofi the , σ 2
) = N
j corresponding (R ij |U T
V ,
irating
j σ 2
)is:
• The user and movie feature vectors j=1,...,M
i=1,...,N

σ R
are given zero-mean2 sphericalT 2 ij
• The user
p(Rand movie
|U
Gaussian
ij i , V , feature
σ
priors:
j ) = Nvectors
(Rij |Ui Vj , σ ) σ i=1,...,N
are given zero-mean spherical j=1,...,M

• Latent
Gaussianfactors
N M
priors: 2 !
2 2
!
2
p(U |σ
• The user and U ) = N (U
movie feature
i |0, σ I), p(V
U vectors V |σ ) = N (V j |0, σ V I)
N M σ
are2 given
! zero-mean i=1
2 spherical 2
! j=1
2
p(U |σ ) =
Gaussian
U N (U
priors: U
i |0, σ I), p(V |σ V ) = N (Vj |0, σ V I)
9
i=1 j=1
N M

Salakhudtinov & Mnih, ICML 2008 BPMF j V I)


! !
2 2 2 2
p(U |σ U) = N (Ui |0,
9 σ U I), p(V |σ V ) = N (V |0, σ
Details
Automatic Complexity Control for PMF (II)
• use
• Can Priors on sophisticated
more all factors αV αU

• Wishartmethods
egularization prior is conjugate
than
imple penalization of the Frobenius ΘV ΘU
to Gaussian, hence use it
norm of the feature matrices:
• Allows us to adapt the
• priors with diagonal or full
variance automatically Vj Ui
covariance matrices and
• Inference
adjustable (Gibbs
means, sampler)
or even mixture of
Gaussians priors. Rij
• Sample user factors (parallel) i=1,...,N
j=1,...,M
• Using• spherical
Sample movie factors
Gaussian (parallel)
priors for feature
vectors leads to the standard PMF with σ
λU and • λSample
V chosenhyperparameters
automatically. (parallel)
Making it fancier
trained PMF (II) (constrained BPMF)
σU σW
atent σV
atrix.
Yi Wk

k=1,...,M
Vj
Ui Ii
who rated
k R ij what
i=1,...,N
j=1,...,M

σ
Results (Mnih & Salakthudtinov)
Experimental Results
20
1.2

1.15
Movie Average
1.1 15

Users (%)
1.05 PMF
RMSE

1 10

0.95 Constrained
PMF
0.9 5

0.85

0.8 0
1−5 6−10 −20 −40 −80 −160 −320 −640>641 1−5 6−10 −20 −40 −80 −160 −320 −640>641
Number of Observed Ratings Number of Observed Ratings

• Left Panel: Performance of constrained PMF, PMF and


helps
movie average for
algorithm that always predicts the average
infrequent
rating of users
each movie.
Multiple Sources
Social Network Data
Data: users, connections, features
Goal: suggest connections
Social Network Data
Data: users, connections, features
Goal: suggest connections
Social Network Data
Data: users, connections, features y y’

Goal: suggest connections


x x’

e
Social Network Data
Data: users, connections, features y y’

Goal: model/suggest connections


x x’

Y Y
p(x, y, e) = p(yi )p(xi |yi ) p(eij |xi , yi , xj , yj )
i2Users i,j2Users

Direct application of the Aldous-Hoover theorem.


Edges are conditionally independent.
Applications
Applications
Applications
Applications
social network = friendship + interests
Applications
social network = friendship + interests
recommend users based recommend apps based
on friendship & interests on friendship & interests
Social Recommendation
recommend users based recommend apps based
on friendship & interests on friendship & interests

• boost traffic • boost traffic


• make the user • increased revenue
graph more dense • increased user
• increase user participation
population • make app graph
• stickiness more dense
... usually addressed by separate tools ...
Homophily
recommend users based recommend apps based
on friendship & interests on friendship & interests

• users with similar


• friends install similar
interests are more
applications
likely to connect

Highly correlated. Estimate both jointly


Model
(latent) app
(latent) user
features
features

u y y’

v x x’

s e

app install
Model
• Social interaction
xi p(x|yi )
xj p(x|yj )
eij p(e|xi , yi , xj , yj , )
u y y’

• App install
v x x’
xi p(x|yi )
vj p(v|uj ) a e

aij p(a|xi , yi , uj , vj , )
Model
• Social interaction
cold start latent features
xi p(x|yi )
xj p(x|yj ) xi = Ayi + i
eij p(e|xi , yi , xj , yj , ) vj = Buj + ˜j

• App install
eij p(e|x> x
i j + y >
i W yj )
xi p(x|yi )
aij p(a|x> v
i j + y >
i M uj )
vj p(v|uj )
aij p(a|xi , yi , uj , vj , )
bilinear features
Optimization Problem
X
minimize
minimize ⇥e >
l(eij , xi xj + >
yi W yj )+
(i,j)
X
> >
⇥a l(aij , xi vj + yi M uj )+
(i,j)
X X
⇥x (xi |yi ) + ⇥v (vi |ui )+
i i
2 2 2 2
⇥W W + ⇥M M + ⇥A A + ⇥B B
Optimization Problem
X
minimize
minimize ⇥e >
l(eij , xi xj + >
yi W yj )+ social
(i,j)
X
> >
⇥a l(aij , xi vj + yi M uj )+
(i,j)
X X
⇥x (xi |yi ) + ⇥v (vi |ui )+
i i
2 2 2 2
⇥W W + ⇥M M + ⇥A A + ⇥B B
Optimization Problem
X
minimize
minimize ⇥e >
l(eij , xi xj + >
yi W yj )+ social
(i,j)
X
> >
⇥a l(aij , xi vj + yi M uj )+ app
(i,j)
X X
⇥x (xi |yi ) + ⇥v (vi |ui )+
i i
2 2 2 2
⇥W W + ⇥M M + ⇥A A + ⇥B B
Optimization Problem
X
minimize
minimize ⇥e >
l(eij , xi xj + >
yi W yj )+ social
(i,j)
X
> >
⇥a l(aij , xi vj + yi M uj )+ app
(i,j)
reconstruction X X
⇥x (xi |yi ) + ⇥v (vi |ui )+
i i
2 2 2 2
⇥W W + ⇥M M + ⇥A A + ⇥B B
Optimization Problem
X
minimize
minimize ⇥e >
l(eij , xi xj + >
yi W yj )+ social
(i,j)
X
> >
⇥a l(aij , xi vj + yi M uj )+ app
(i,j)
reconstruction X X
⇥x (xi |yi ) + ⇥v (vi |ui )+
i i
2 2 2 2
⇥W W + ⇥M M + ⇥A A + ⇥B B

regularizer
Loss Function
Loss

• Much more evidence of application non-install


(i.e. many more negative examples)
• Few links between vertices in friendship
network (even within short graph distance)

• Generate ranking problems (link, non-link) with


non-links drawn from background set
Loss

application social
recommendation recommendation
Optimization
• Nonconvex optimization problem
• Large set of variables xi = Ayi + i
vj = Buj + ˜j

• Stochastic gradient descent eij p(e|x> x + y >


i j i W yj )
on x, v, ε for speed aij >
p(a|xi vj + >
yi M u j )
• Use hashing to reduce
memory load, i.e. xij = (i, j)X[h(i, j)]

binary hash hash


Y! Pulse
Y! Pulse Data

1.2M users, 386 items


6.1M friend connections
29M interest indications
App Recommendation

SIM: similarity based model;


RLFM: regression based latent factor model (Chen&Agarwal); NLFM: SIM&RLFM
Social recommendation
app recommendation
L2 penalty
Extensions
• Multiple relations
y y’

(user, user) x x’
(user, app)
(app, advertisement) e

• Users visiting several properties


news, mail, frontpage, social network, etc.
• Different statistical models
• Latent Dirichlet Allocation for latent factors
• Indian Buffet Process
Extensions
• Multiple relations
u y y’

(user, user) v x x’
(user, app)
(app, advertisement) a e

• Users visiting several properties


news, mail, frontpage, social network, etc.
• Different statistical models
• Latent Dirichlet Allocation for latent factors
• Indian Buffet Process
Extensions
• Multiple relations
u u y y’

(user, user) v v x x’
(user, app)
(app, advertisement) a a e

• Users visiting several properties


news, mail, frontpage, social network, etc.
• Different statistical models
• Latent Dirichlet Allocation for latent factors
• Indian Buffet Process
Extensions
• Multiple relations
u u y y’

(user, user) v v x x’
(user, app)
(app, advertisement) a a e

• Users visiting several properties


news, mail, frontpage, social network, etc.
• Different statistical models
• Latent Dirichlet Allocation for latent factors
• Indian Buffet Process
More strategies
Multiple factor LDA
• Discrete set of preferences
(Porteous, Bart, Welling, 2008)
• User picks one to assess movie
• Movie represented by a discrete attribute
• Inference by Gibbs sampler
• Works fairly well

• Extension by Lester Mackey and coworkers to


combine with BPMF model
More state representations

• Indian Buffet Process


(Griffiths & Ghahramani, 2005)
• Attribute vector is binary string
• Models preferences naturally & very compact
(Inference is costly)
• Hierarchical attribute representation and
clustering over users ... TO DO
5 Hashing
Parameter Storage

• We have millions of users


• We have millions of products
• Storage - for 100 factors this requires
106 x 106 x 8 = 8TB
• We want a model that can be kept in RAM (<16GB)
• Instant response for each user
• Disks have 20 IOP/s at best (SSDs much better)
• Privacy (what if parameter vector leaks)
Recall - Hash Kernels
instance: ⇥xi N (U +1)
R X
w̄[h(i)] (i)xi
Hey, i
{-1, 1}
please mention h(‘mention’)
subtly during s(m_b) 1

your talk that


3
people should
use Yahoo mail h()
more often.
2
Thanks, h(‘mention_barney’)
s(m) -1

Someone

task/user Similar to count hash


(=barney):
(Charikar, Chen, Farrach-Colton, 2003)
Collaborative Filtering
• Hashing compression

ui = (j, k)Ujk and vi = (j, k)Vjk .


j,k:h(j,k)=i j,k:h (j,k)=i

Xij := (k, i) (k, j)uh(k,i) vh (k,j) .


k

• Approximation is O(1/n)
• To show that estimate is unbiased take expectation
over Rademacher hash.
Collaborative Filtering
• Hashing compression

X X
0
ui = ⇠(k, j)Ukj and vi = ⇠ (k, j)Vkj .
j,k:h(k,j)=i j,k:h0 (k,j)=i

Xij := (k, i) (k, j)uh(k,i) vh (k,j) .


k

• Expectation expectation vanishes


X X X
Xij := ⇠(k, i)⇠ 0 (k, j) ⇠(k, l)⇠ 0 (k, o)Ukl Vko
k l,k:h(k,l)=h(k,i) o,k:h0 (k,o)=h0 (k,j)
Collaborative Hashing

• Combine with stochastic gradient descent


• Random access in memory is expensive
(we now have to do k lookups per pair)
• Feistel networks can accelerate this

• Distributed optimization without locking


Examples
1.32
60 50
1.16

100
1.30
120 1.14
Thousands of elements in U

150

240 200 1.12


1.28

250
400

rows in U
1.10
1.26 300
520 1.08
350

1.24
720 400 1.06

450
840 1.04
1.22
500

1225 983
1.02
1.20
32 16 10 9 8 7 6 5 1682 500 450 400 350 300 250 200 150 100 50
Thousands of elements in M rows in M

Eachmovie MovieLens
Summary
• Neighborhood methods
• User / movie similarity
• Iteration on graph
• Matrix Factorization
• Singular value decomposition
• Convex reformulation
• Ranking and Session Modeling
• Ordinal regression
• Session models
• Features
• Latent dense (Bayesian Probabilistic Matrix Factorization)
• Latent sparse (Dirichlet process factorization)
• Coldstart problem (inferring features)
• Hashing
Further reading
• Collaborative Filtering with temporal dynamics
http://research.yahoo.com/files/kdd-fp074-koren.pdf
• Neighborhood factorization
http://research.yahoo.com/files/paper.pdf
• Matrix Factorization for recommender systems
http://research.yahoo.com/files/ieeecomputer.pdf
• CoFi Rank (collaborative filtering & ranking)
http://www.cofirank.org/
• Yehuda Koren’s papers
http://research.yahoo.com/Yehuda_Koren

You might also like