0% found this document useful (0 votes)
10 views10 pages

13 Recsys 2

The document outlines announcements regarding homework deadlines and class schedules for a course on Recommender Systems and Big Data Analytics. It covers key concepts such as UV decomposition, latent factor models, and the importance of regularization to prevent overfitting in model training. Additionally, it discusses optimization techniques like gradient descent and the use of biases in modeling user-item interactions.

Uploaded by

asansyzbai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views10 pages

13 Recsys 2

The document outlines announcements regarding homework deadlines and class schedules for a course on Recommender Systems and Big Data Analytics. It covers key concepts such as UV decomposition, latent factor models, and the importance of regularization to prevent overfitting in model training. Additionally, it discusses optimization techniques like gradient descent and the use of biases in modeling user-item interactions.

Uploaded by

asansyzbai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Announcements

• Homeworks:
• HW2 (due extended: 11/08)
• Due to maintenance at Haedong Lounge (11/01 18:00 – 11/04 09:00).
Recommender Systems 2 • Enjoy the Netflix challenge!
• HW3 (will be posted on 11/06)
EE412: Foundation of Big Data Analytics
• Midterm:
Fall 2024
• Claim finished; thank you for your hard work!
• Classes:
• No in-person class on 11/04; video will be uploaded.

Jaemin Yoo 1 Jaemin Yoo 2

Recommend
Recap Outline
Popularity
“Touching the void” “Into Thin Air”
• Recommender Systems 1. UV Decomposition
• Content-based Recommendation 2. UV Decomposition: Computation
• Collaborative Filtering Items 3. UV Decomposition: Variants
• The Netflix Challenge

like
like like

similar

recommend
like recommend

Jaemin Yoo 3 Jaemin Yoo 4


Goal of Recommender Systems Latent Factor Models
• Recommendation is to fill in the blank in the utility matrix 𝑅. • We’ll learn latent factor models, which assume:
• The core operation is how to get user and item representations. • There are latent factors that can represent users and items well.
• Content-based approach creates user/item profiles. • Such latent factors can be extracted from the utility matrix.
• Collaborative filtering uses the rows and columns of 𝑅. • Many people consider latent factor models as a part of CF.
• Since they share the same philosophy.
HP1 HP2 HP3 TW SW1 SW2 SW3 • CF uses the rows and columns of 𝑅 without modification.
A 4 5 1 • Latent factor models extract (better) latent factors from 𝑅.
B 5 5 4
C 2 4 5
D 3 3

Jaemin Yoo 5 Jaemin Yoo 6

Latent Factor Models UV Decomposition


• Idea: Consider a utility matrix as the product of factor matrices. • Given an 𝑚 × 𝑛 utility matrix 𝑅 (i.e., 𝑚 users and 𝑛 items).
• Latent factors are underlying concepts/topics; same as in SVD. • Find an 𝑚 × 𝑘 matrix 𝑈 and 𝑛 × 𝑘 matrix 𝑉 such that:
• E.g., users react to certain genres, famous actors, or directors. • 𝑈𝑉 # closely approximates 𝑀 for the non-blank entries.
• UV decomposition decomposes a utility matrix into 𝑈 and 𝑉. • Use the elements of 𝑈𝑉 ! to estimate the blank entries in 𝑅.
• Each user and movie is summarized as a low-dimensional vector. ̂ = 𝑢$#𝑣% to predict 𝑟$% .
• Compute 𝑟$%
n k n

✖ 𝑉! k
m R ≈ U

R U 𝑉!

Jaemin Yoo 7 Jaemin Yoo 8


Error Function Recap: Singular Value Decomposition
• Root-mean-square error (RMSE) measures the difference. • Decomposition of any matrix into a product of three matrices.
• Let 𝐸 be the set of non-blank entries. • Choose any number 𝑟 of intermediate concepts (= latent factors).
• RMSE 𝑅, / 𝑅 = sqrt & ∑ $,% ∈' 𝑟$% ̂ − 𝑟$% * . • In a way that minimizes the reconstruction error.
'
• The error is zero when 𝑟 ≥ rank 𝑀 .
• RMSE is equivalent to the sum of squared errors (SSE).
/ 𝑅 = 𝐸 ⋅ RMSE* 𝑅,
• SSE 𝑅, / 𝑅 . n r r n

VT r
Σ
RMSE( )= m =
M U

R 𝑈𝑉 !

Jaemin Yoo 9 Jaemin Yoo 10

UV Decomposition vs. SVD Outline


• Note: 𝑈 and ΣV ! from SVD are also factor matrices. 1. UV Decomposition
• Differences of UV decomposition from SVD: 2. UV Decomposition: Computation
• UV ignores the missing entries, not treating them as zero. 3. UV Decomposition: Variants
• UV minimizes RMSE on only the training portion.
• Larger 𝑘 is not necessarily better.
• Larger 𝑘 does not guarantee decreasing RMSE on the test data.
• 𝑈 and 𝑉 need not to be orthonormal matrices.

Jaemin Yoo 11 Jaemin Yoo 12


Objective Function Mismatch in the Objective Function
• Goal: Find matrices 𝑈 and 𝑉 such that: • Our (true) goal is to minimize SSE for unseen test data.
• However, our objective function only considers training data.
𝑈 ∗ , 𝑉 ∗ = argmin$,& 𝐽 𝑅, 𝑈, 𝑉
• Increasing 𝑘 always decreases our objective function.
# * • Since the error decreases with more factors.
• 𝐽 𝑅, 𝑈, 𝑉 = ∑$,%∈' 𝑟$% − 𝑢$ 𝑣% measures SSE on training data 𝐸.
• However, SSE on test data can begin to rise with large 𝑘.
n k n

✖ 𝑉! k
• We call 𝑈, 𝑉 parameters.
m R ≈ U • Other choices are hyperparameters.

Jaemin Yoo 13 Jaemin Yoo 14

Overfitting Regularization
• This is a classical example of overfitting: • Regularization is a possible way to prevent overfitting:
• Model starts fitting noise with too many free parameters. • Allows a rich model when there is sufficient data.
• Model is not generalizing well to unseen test data. • Shrinks the model aggressively where data is scarce.
• We should carefully control the model complexity. • The new objective function with regularization is
• E.g., the number of clusters in 𝑘-means.
𝐽 ⋅ = 5 𝑟'( − 𝑢' 𝑣( +
+ 𝜆, 5 𝑢' +
+ + 𝜆+ 5 𝑣( +
+
',(∈* ' (

• 𝜆& and 𝜆* are hyperparameters. Limit the lengths of 𝑢! and 𝑣"


Source: Medium

Jaemin Yoo 15 Jaemin Yoo 16


The Color
serious
Amadeus
Braveheart Validation Data
Purple

• Introducing validation data is also important.


Lethal • Split the training data into (new) training and validation data.
Sense and Weapon
Sensibility Ocean’s 11 Geared
• Check the performance on validation data during training.
Geared
towards towards • Use the validation performance as a proxy of the test performance.
Factor 1
females males

Validation dataset
Effect of regularization: The Lion King
Dumb and Training dataset
• Goes to the center unless Dumber

Factor 2
the signal is really strong. Independence
Day Test dataset

funny

Jaemin Yoo 17 Jaemin Yoo 18

Gradient Descent Gradient Descent for UV Decomposition


• Gradient descent (GD) aims to find an input 𝑥 to minimize 𝑓 𝑥 . • How to use GD to find 𝑈 ∗ and 𝑉 ∗ for UV decomposition:
• Compute the derivative ∇𝑓. • Step 1: Initialize 𝑈 and 𝑉 using SVD, treating missing entries as 0.
• Step 2: Update 𝑈 and 𝑉 to minimize the objective function 𝐽 ⋅ .
• Start at some point 𝑦 and evaluate ∇𝑓 𝑦 .
• Make a step in the reverse direction of the gradient: 𝑦 ← 𝑦 − ∇𝑓 𝑦 .
• 𝑈 ← 𝑈 − 𝜂 ⋅ ∇2 𝐽 ⋅ .
• Repeat until 𝑓 is sufficiently small.
𝑓 • 𝑉 ← 𝑉 − 𝜂 ⋅ ∇3 𝐽 ⋅ .
• 𝜂 is a hyperparameter (called a step size or a learning rate).
𝑓 𝑦 + 𝛻𝑓(𝑦)

Jaemin Yoo 19 Jaemin Yoo 20


Gradient Descent for UV Decomposition Stochastic Gradient Descent
• Perform the update step on every entry independently. • Observation: The gradient can be decomposed as
• Since 𝑈 and 𝑉 are matrices. • ∇𝐽 𝑅, 𝑈, 𝑉 = ∑%: $,% ∈' −2𝑣%6 𝑟$% − 𝑢$#𝑣% + 2𝜆&𝑢$6
• For each entry at row 𝑥, column 𝑐 of matrix 𝑈, we update: = ∑$,%∈' ∇𝐽 𝑟$% , 𝑈, 𝑉
• Stochastic gradient descent (SGD):
𝑢'/ = 𝑢'/ − 𝜂 ⋅ ∇0!" 𝐽 𝑈, 𝑉 • Evaluate the gradient on each (not all) rating and make a step.
• Needs more steps until convergence but each step is much faster.
where ∇4!" 𝐽 𝑈, 𝑉 = ∑%: $,% ∈' −2𝑣%6 𝑟$% − 𝑢$#𝑣% + 2𝜆&𝑢$6 . • GD: 𝑈 ← 𝑈 − 𝜂 ⋅ ∑% ∇𝐽 𝑟$% .
• SGD: 𝑈 ← 𝑈 − 𝜂 ⋅ ∇𝐽 𝑟$% .

Jaemin Yoo 21 Jaemin Yoo 22

Convergence of SGD vs. GD SGD for UV Decomposition


• GD improves the value of the objective function at every step. • How to use SGD to find 𝑈 ∗ and 𝑉 ∗ for UV decomposition:
• SGD improves the value but in a “noisy” way. • Step 1: Initialize 𝑈 and 𝑉 using SVD, treating missing ratings as 0.
• GD takes fewer steps to converge but each step takes much longer. • Step 2: Iterate over the ratings and update factors until convergence:

for each 𝑟$% ∈ 𝐸:


Objective function

𝜇& and 𝜇*: Learning rates.


𝜖$% ← 2 𝑟$% − 𝑢$#𝑣%
𝑢$ ← 𝑢$ + 𝜇& 𝜖$% 𝑣% − 𝜆&𝑢$
𝑣% ← 𝑣% + 𝜇* 𝜖$% 𝑢$ − 𝜆*𝑣%

Iteration/step
Jaemin Yoo 23 Jaemin Yoo 24
23
SGD with Mini-batches Outline
• In practice, people do not apply SGD for individual samples. 1. UV Decomposition
• Instead, they create (mini-)batches of several samples. 2. UV Decomposition: Computation
• GD: 1 step using 𝑁 samples. 3. UV Decomposition: Variants
• (True) SGD: 𝑁 steps using 1 sample.
• Batch SGD: 𝑁/𝐵 steps using a batch of 𝐵 samples.
• Makes a better balance between speed and stability in training.
• 𝐵 is a hyperparameter to choose.

Jaemin Yoo 25 Jaemin Yoo 26

Modeling Biases and Interactions Model with Biases


• There are global effects (biases) of users and movies. • Let’s predict a rating as 𝑟'(
̂ = 𝜇 + 𝑏' + 𝑏( + 𝑢'! 𝑣( .
• Rating 𝑟$% is not only about the interaction between 𝑥 and 𝑖. • 𝜇 is the overall mean rating.
• E.g., a critical reviewer 𝑥& vs. a generous person 𝑥*. • 𝑏$ and 𝑏% are biases for user 𝑥 and movie 𝑖, respectively.
• E.g., The Godfather 𝑖& vs. some bad movie 𝑖* on Netflix. • 𝑢$#𝑣% is their interaction modeled by factor matrices.
• Example:
user bias movie bias user-movie interaction
• Mean rating of training data is 𝜇 = 3.7.
• You are a critical reviewer: 𝑏$ = −1.
• Star Wars is favored by many people: 𝑏% = +0.5.
• Final score is 3.7 − 1 + 0.5 + 𝑢$#𝑣% .

Jaemin Yoo 27 Jaemin Yoo 28


Fitting the New Model Further Improvements
• Update the parameters {𝑈, 𝑉, 𝐵1234 , 𝐵5637 } minimizing 𝐽′: • Any idea for modifying the three components is okay to try:
• Set 𝜃 of learnable parameters.
𝐽 ⋅ = 5 𝑟'( − 𝜇 + 𝑏' + 𝑏( + 𝑢'! 𝑣( + regularizer
8 + • Objective function 𝐽 to minimize.
• (Optional) regularizer on 𝜃 with 𝜆.
',(∈*
• SGD (or GD) will take care of finding the optimal parameters.
• There are 4 regularization hyperparameters: 𝜆&, 𝜆*, 𝜆7, 𝜆8. • We believe in the power of gradient-based optimization!
• 𝜇 is the simple average of ratings; it needs not to be learned.

Jaemin Yoo 29 Jaemin Yoo 30

Hyperparameter Search Dealing with Implicit Feedback


• How can we search for optimal hyperparameters 𝜆, , 𝜆+ , 𝜆9 , 𝜆: ? • What if the utility matrix 𝑅 contains only implicit feedback?
• We pick the one with the best validation performance. • Each entry is either 0 (not watched) or 1 (watched).
• Grid search: Create sets of values and try for each combination. • Using the same model with RMSE 𝐽 leads to some limitations.
• E.g., 𝜆& in 0.001, 0.01, 0.1, 1, 10 . • Limitation 1: Our prediction can be any number, i.e., 𝑢$#𝑣% > 1.
• Limitation 2: We may have a better gradient curve of 𝐽.
• Random search: Try for each randomly selected combination. • Limitation 3: Our loss 𝐽 assumes 0 as a dislike, not “not watched.”
• E.g., 𝜆& from 0.001, 10 (but in a log scale).

Jaemin Yoo 31 Jaemin Yoo 32


Idea 1: Sigmoid Function Implication of the Sigmoid
• We want to limit the predictions of our model to be 0, 1 . • Without sigmoid: Signals are mixed regardless of 𝑟'( .
• Simple solution is to use 𝜎 𝑢'! 𝑣( instead of 𝑢'! 𝑣( as output. • If 𝑟$% = 1 and 𝑢$#𝑣% > 1, the model is updated to decrease 𝑢$#𝑣% .
= 1 and 𝑢$#𝑣% < 1, the model is updated to increase 𝑢$#𝑣% .
• Sigmoid function 𝜎 is defined as follows:
• If 𝑟$%
• If 𝑟$% = 0 and 𝑢$#𝑣% > 0, the model is updated to decrease 𝑢$#𝑣% .
1 • If 𝑟$% = 0 and 𝑢$#𝑣% < 0, the model is updated to increase 𝑢$#𝑣% .
𝜎 𝑥 = • With sigmoid: Signals are consistent with 𝑟'( .
1 + 𝑒 ;'
• If 𝑟$% = 1, the model is always updated to increase 𝑢$#𝑣% .
• Maps −∞, ∞ to 0, 1 with 𝜎 0 = 0.5. • If 𝑟$% = 0, the model is always updated to decrease 𝑢$#𝑣% .
• Monotonically increasing for all ranges of 𝑥.

Jaemin Yoo 33 Jaemin Yoo 34

Idea 2: Binary Cross Entropy Gradient Curves


• RMSE is mainly designed for continuous, unbounded targets. • BCE has a gradient curve different from that of RMSE.
• We may use binary cross entropy (BCE) instead of RMSE: • Suppose that 𝑟'( = 0 and 𝑎 = 𝜎 𝑢'! 𝑣( for simplicity.
• (RMSE) ∇9 𝑎 * = 2𝑎: Gradient increases linearly with 𝑎.
𝐽<=> 𝑢' , 𝑣( , 𝑟'( = − 𝑟'( log 𝜎 𝑢'! 𝑣( + 1 − 𝑟'( log(1 − 𝜎 𝑢'! 𝑣( ) • (BCE) ∇9 − log 1 − 𝑎 = 1/ 1 − 𝑎 : Gradient increases rapidly.

• If 𝑟$% = 1, we minimize − log 𝜎 𝑢$#𝑣% by making 𝜎 𝑢$#𝑣% to 1. • BCE makes the model focus more on wrong samples.
• If 𝑟$% = 0, we minimize − log(1 − 𝜎 𝑢$#𝑣% ) by making 𝜎 𝑢$#𝑣% to 0. • Not already accurate ones.

Jaemin Yoo 35 Jaemin Yoo 36


Idea 3: Ranking Losses Bayesian Personalized Ranking
• Both RMSE and BCE assume 0 entries as “dislike,” not “unknown.” • We may use the Bayesian personalized ranking (BPR) loss:
• Since they are computed for individual elements 𝑟$% .
• Idea: Let’s consider the task as ranking, not elementwise prediction. 𝐽<BC 𝑈, 𝑉 = 5 − log 𝜎 𝑢'! 𝑣( − 𝑢'! 𝑣D
• Given a user 𝑥, suppose that 𝑟$% = 1 while 𝑟$: = 0. ',(,D
• Then, we can safely assume that user 𝑥 likes 𝑖 more than 𝑗.
• If 𝑥 really likes 𝑗, they would have watched it before 𝑖. • Item 𝑗 is randomly selected from the negative samples { 𝑗 | 𝑟$: = 0 }.
• Let’s update the model by comparing items, so that 𝑢$% > 𝑢$: . • In BPR, we don’t have to compare 𝜎(𝑢'! 𝑣( ) and 𝜎(𝑢'! 𝑣D ).
• Since 𝜎 is a monotonically increasing function.

Jaemin Yoo 37 Jaemin Yoo 38

Summary
1. UV Decomposition
• Latent factor models
2. UV Decomposition: Computation
• Overfitting and regularization
• Stochastic gradient descent
3. UV Decomposition: Variants
• Modeling biases
• Dealing with implicit feedback
• BCE and BPR losses

Jaemin Yoo 39

You might also like