ML Report
ML Report
ML Report
PROJECT REPORT:
MOVIE RECOMMENDATION SYSTEM
Ha Noi 2023
Semester 2022.2 Course: Machine Learning
Abstract
Industry 4.0, recommender systems play an important role in numerous sectors. The
use of recommender systems to enhance user experience is widespread in social media, e-
commerce, and online platforms. Users are to receive personalized recommendations from
the recommender system based on their preferences, behaviors, or previous interactions. The
goal of this project is to explore two common methods for movie recommendation systems:
collaborative filtering and content-based filtering. Additionally, we investigate several forms of
collaborative filtering, such as matrix factorization-based Singular Value Decomposition and
Gradient Descent. We also go through how these strategies’ performance was judged using a
variety of evaluation criteria that were selected and used with recommender systems in mind.
Contents
1 Introduction to Movie Recommendation System 4
1.1 Short description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Input and output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Dataset 4
2.1 About the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Data exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Content-Based Filtering 7
3.1 Introduction to Content-Based Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Content-Based Filtering Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.1 Utility Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.2 Items Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2.3 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Collaborative Filtering 9
4.1 Neighborhood-based Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 Matrix Factorization Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2.1 Introduction to MF Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2.2 Matrix Factorization Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2
Semester 2022.2 Course: Machine Learning
7 Reference 20
8 Work distribution 20
9 Appendix 21
9.1 RMSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
9.2 MAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
9.3 Precision,Recall and F1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
9.4 Normalized Discounted Cumulative Gain (NDCG) . . . . . . . . . . . . . . . . . . . . . . . . . 22
9.5 Mean Average Precision (MAP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3
Semester 2022.2 Course: Machine Learning
When building this system, we have to deal with many data about the features of the
movie (actor, rating, genre,....) and also the past choice of users. Therefore, the data type
of input may be a string (actor name, genre,...) or real value(rating, year,...), or also a
boolean(like/dislike,...).
2 Dataset
The primary dataset used in this project comes from MovieLens generated by GroupLens.
It has two versions, the full version has a collection of 26 million reviews over 45,000 different
movies by 270,000 users. Due to limited computing power, a smaller version, which contains
1,000,209 anonymous ratings of approximately 3,900 movies made by 6,040 MovieLens users
who joined MovieLens in 2000, is mainly used in this project instead.
• users.dat: contains user information regarding User ID, Gender, Age, Occupation, and
Zip-code.
• movies.dat: contains movie information regarding Movie ID, Title, and Genres.
4
Semester 2022.2 Course: Machine Learning
We divided the dataset into test and training on the basis of UserID, using 90% of each
user’s reviews for training the model and the remaining 10% for validating the model’s accu-
racy. Using the train-test-split technique from the Scikit-Learn package, we accomplish the
aforementioned. The feature on which the dataset is split into training and testing is specified
by the function’s stratify property.
Due to unintentional duplicate entries and test entries, we discovered that several MovieIDs
in this collection do not relate to movies. We decided to treat these erroneous movies as movies
with no ratings in order to preserve consistency.
• Distribution of ratings
Reviews are mostly towards the higher end, 3, 4 or 5. The average rating is 3.58.
5
Semester 2022.2 Course: Machine Learning
6
Semester 2022.2 Course: Machine Learning
3 Content-Based Filtering
• Because the recommendations are tailored to a person, the model does not require any
information about other users. This makes scaling of a big number of people more simple.
• The model can recognize a user’s individual preferences and make recommendations for
niche things that only a few other users are interested in.
• New items may be suggested before being rated by a large number of users, as opposed
to collective filtering.
In Recommender Systems, building the Utility Matrix is paramount. There are two main
entities in Recommendation Systems, users and items. Each user will have a degree of pref-
erence for each item. This interest level, if known in advance, is assigned a value for each
user-item pair. Assuming that interest is measured by the user rate value for the item, let’s
call this value rating. The collection of all ratings, including the unknown values that need to
be predicted, forms a matrix called the Utility Matrix.
7
Semester 2022.2 Course: Machine Learning
• The set of actors in the movie. Some viewers prefer movies with their favorite actors.
• The director. Some viewers have a preference for the work of certain directors.
• The year in which the movie was made. Some viewers prefer old movies; others watch
only the latest releases.
• The genre or general type of movie. Some viewers like only comedies, others dramas or
romances.
As computers can only process numeric data, the generation of these features is a very
crucial step to make sure that high-quality recommendations can be produced. In our project,
we utilize the binary value of genres for each movie, where 1 represents the presence of a genre
in a movie and 0 represents the absence. Finally, we construct the feature vector with TF-IDF
(Term Frequency-Inverse Document Frequency).
We construct a linear model following the Ridge Regression model. Assume that, we have
N users, M items, and Utility matrix Y . Furthermore, R is rated-or-not matrix, namely rmn
equals 1 if item m is rated by user n, and 0 otherwise.
8
Semester 2022.2 Course: Machine Learning
Linear model:
Assume that, we find a model for each user, illustrated by weight wn and bias bn so that the
interest of nth user for the mth item is calculated by a linear function:
ymn = xm wn + bn (1)
1 λ
Ln = ||X̂n wn + bn en − ŷn ||22 + ||wn ||22 (3)
2sn 2sn
4 Collaborative Filtering
A collaborative filtering (CF) system provides recommendations for items based on user
and movie similarity metrics. The algorithm suggests products that comparable types of
people have a preference for. The benefits of collaborative filtering are numerous:
• Since users provide explicit ratings in CF, true quality assessment of things is performed.
1. Give each user a weight based on how similar they are to the active user.
3. Create a forecast using a weighted average of the ratings of the chosen neighbors.
9
Semester 2022.2 Course: Machine Learning
• Similarity metrics
1. In the first stage, determining the similarity between two users is the task that has to
be completed first and foremost. The Utility matrix (which contains all user ratings)
is the only piece of information we have, thus it is necessary to compare the two
users’ columns in this matrix to see how comparable they are.
The weight sim(ui , uj ) is utilized as a measure of how comparable the user i and the
active user j. The Pearson correlation coefficient between the evaluations of the two
users is the most often used metric of similarity and is defined as follows:
P
(ri,m − ri )(rj,m − rj )
sim(ui , uj ) = pPm∈M 2 2
(4)
m∈M (ri,m − r i ) (rj,m − r j )
Where M represents the group of movies that both users reviewed ri,m represents
the rating that user i gave to movie m, and ri represents the mean rating that user
i provided.
2. In the next step, predictions are generally calculated as the weighted mean of vari-
ances from the neighbor’s mean, as in:
P
(ri,m − ri ) × sim(ui , uj )
pi,m = ri + u∈KP (5)
u∈K |sum(ui , uj )|
where sim(ui , uj ) is the degree of similarity between users i and j, pi,m is the predic-
tion for the active user i for item m, K is the group or neighborhood of users who
are most similar, and so on.
10
Semester 2022.2 Course: Machine Learning
matrix (MxK) and the user feature matrix (KxN). With M is the number of items, N is the
number of users and K is the number of features.
• Latent Feature
In the real life, the preferences of each users has the connections to the preferences of other
users through features. Similarly, some movies have identical feature with other movies.
But the problem is we just indentify the limited feature and alot of other connections we
don’t actually know. The matrix factorization help us to deal with it very well.
The main idea behind the matrix factorization is representing users and items in lower
dimensional latent space. That means we expressed the original matrix into 2 lower
dimensional matrix is Users matrix which will tell us the connections between users and
Items matrix which tell us the connections between movies through K latent features.
We call it the latent features since we don’t actually know what exactly K features is, we
only know that those will help us to express the relationship between Users and between
Movies. Therefore, matrix factorization will identify the connections between users and
between movies based on known ratings and use it to figure out the unknown ratings,
then make a recommendation to user. The high coefficient corresponds to one latent
feature in both item and user matrix will also be high in the user-item matrix. That
means the item has latent feature which the user likes, it will be suggested to the user.
• Memory Reduction
Matrix factorization decomposes the user-item matrix into 2 lower dimensionality matri-
ces will help the inference simpler since we only need to take the product of two vectors
with length K which is much smaller than M, N. Also, storing 2 matrices: item and user
requires a smaller memory than the user-item matrix. Therefore matrix factorization is
11
Semester 2022.2 Course: Machine Learning
Figure 6: In this example, if we use original matrix to make recommendations, we need to store 2 millions
entities. But by decomposing, we just need to store 2 matrices: One has 100K entities and one has 200K
entities. So, in total, we just need to store 300K entities.
found to be the most accurate approach to large and high levels of sparsity datasets.
There are many variations of the matrix factorization implementation. Because of the time
limit of the project, our team decided to only chooses 2 ways to implement it:
1. Matrix factorization with Gradient Descent.
2. Matrix factorization with Truncated Singular Value Decomposition.
In particular in matrix factorization, we will initialize two random users and movies
matricies. Then by each iteration, we will reduce the value of loss function by assigning
new set of coefficients for user and movies matricies.
• Loss Function
12
Semester 2022.2 Course: Machine Learning
With X is the items matrix and W is user matrix. rmn = 1 if movie m is rated
by user n. Notation ∥ · ∥2F is Frobineous norm is the square roots of the sum of the
squares of all the entities in the matrix and s is the numbers of known ratings.
where the columns of U and V are orthonormal and the matrix Σ is diagonal with
13
Semester 2022.2 Course: Machine Learning
positive real entries whose ith diagonal entry equals the ith singular value σi for i =
1, . . . , r and all other entries of Σ are zero.
• Truncated SVD
Truncated SVD is the lower-rank approximation. That means we use low-rank matrix
to approximate the original matrix. In the diagonal matrix Σ the singular values σi
in the diagonal is non-negative and decreasing: σ1 ≥ σ2 ≥ σ3 ≥ ... ≥ σr ≥ 0, show
how important the various columns of U and rows of V are. Therefore, U1 is more
important than U2 and U2 is more important than U3 , .... Similar to V1 , V2 , ....
The word ”important” here express which amount of values those columns or rows
contribute to the original matrix. To be clearly, we can see the picture above. We
can express the original matrix as the sum of many rank-1 matrices and the more
to the right side, value of sigma is decreasing so it contribute value less to the
original matrix.And maybe some last matrices, value of σ ≈ 0. Now, truncated is
applied, truncated is the action that we only keep the most important r(r < rank(A))
matrices or r matrices are left to right and all matrices after r we discard.So, from
that, by using lower-rank matrix, we almost still have important information about
the original matrix.
• Matrix factorization with Truncated SVD
From the original ratings matrix, we use the truncated SVD to generate the low-rank
r matrix which still have almost information, then use it make the recommendation.
14
Semester 2022.2 Course: Machine Learning
Figure 8: RMSE and MAE depends on λ Figure 9: Precision, Recall, and F1 depends on λ
The error graph is irregular that error remains stable when lambda increases. It means that the the error
is minimum when the weights equal to zero which means the predicted ratings for each user converge to one
value. This may be because of the skewness of rating distribution. The majority of ratings is 3 4 5. Therefore,
the variance in ratings of each user is small. Then, the optimal solution may be predicting the same value for
each user.
We choose λ = 10 as the best option based on the findings in the graph. We examine our model for λ = 10
to determine how many movies we should suggest to the user in order to get the optimal result.
In light of the outcome shown in this graph, we determined that 30 movies was the optimum solution.
15
Semester 2022.2 Course: Machine Learning
Figure 10: MAP and NDCG depends on number of Figure 11: Precision, Recall, and F1 depends on number
movie of movie
Figure 12: RMSE and MAE depends on number of Figure 13: Precision, Recall, and F1 depends on number
neighbor of neighbor
Based on the results in the graph, we decide that neighbor = 20 is the best option. In order to obtain the
best outcome, we analyze our model for neighbor = 20 to decide how many movies we should recommend to
the user.
Figure 14: MAP and NDCG depends on number of Figure 15: Precision, Recall, and F1 depends on number
movie of movie
The result of this graph led us to the conclusion that at 30 films, ranking metrics remain stable. Hence,
30 is the best option for number of movie recommended.
16
Semester 2022.2 Course: Machine Learning
Figure 16: RMSE and MAE depends on number of Figure 17: Precision, Recall, and F1 depends on num-
neighbor ber of neighbor
We determine that neighbor = 50 is the best choice based on the results shown in the graph. In order to
select how many movies to suggest to the user, we assess our model for neighbor = 50 in order to get the best
results.
Figure 18: MAP and NDCG depends on number of Figure 19: Precision, Recall, and F1 depends on number
movie of movie
We deduced from the graph’s outcome that ranking measures are stable after 30 films. Therefore, the
optimal choice for the number of recommended movies is 30.
We fix the size of latent features K = 10, the weight for regularization component λ = 0.1. Then we tune
the number of iterations in Gradient Descent. We concluded from the graph’s outcome that RMSE and MAE
remain stable after 30 iterations. We evaluate our model with 30 iterations in order to achieve the greatest
results when determining how many movies to recommend to the user.
Based on the results provided in this graph, we discovered that 30 movies is the ideal selection.
17
Semester 2022.2 Course: Machine Learning
Figure 21: MAP and NDCG depends on number of Figure 22: Precision, Recall, and F1 depends on number
movie of movie
Based on the findings in the graph, we determine that rank = 10 is the most suitable choice. To get
the optimal results, we examine our model with rank = 10 to determine the number of films we need to
recommend to the user.
As a consequence of this graph, we concluded that at 30 movies, ranking metrics remain steady. As a
18
Semester 2022.2 Course: Machine Learning
Figure 25: MAP and NDCG depends on number of Figure 26: Precision, Recall, and F1 depends on number
movie of movie
5.5 Summary
After experiments, we discover that the most suitable number of recommendations is 30.
In conclusion, the results of all models tuned with optimum hyperparameters are shown in the Table below:
Based on the table, we can infer that the Item-item model delivers the most accurate results, but the
efficiency of suggestions is low. Meanwhile, Matrix Factorization using SVD offers lower accuracy but greater
efficiency. In general, we conclude that Matrix Factorization using SVD approach provides the most efficient
recommendations. Moreover, we can see that the order of recommendations is quite good.
19
Semester 2022.2 Course: Machine Learning
7 Reference
1. F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context.
ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages.
DOI=http://dx.doi.org/10.1145/2827872
2. Kalervo Järvelin, Jaana Kekäläinen: Cumulated gain-based evaluation of IR techniques. ACM Transac-
tions on Information Systems 20(4), 422–446 (2002)
3. Yining Wang, Liwei Wang, Yuanzhi Li, Di He, Wei Chen, Tie-Yan Liu. 2013. A Theoretical Analysis
of Normalized Discounted Cumulative Gain (NDCG) Ranking Measures. In Proceedings of the 26th
Annual Conference on Learning Theory (COLT 2013).
6. Recommendation-systems https://machinelearningcoban.com
8 Work distribution
1. Proposal: Lâm Anh
9. Report:
20
Semester 2022.2 Course: Machine Learning
11. Presentation
9 Appendix
9.1 RMSE
Root Mean Square Error (RMSE): RMSE is a common metric in prediction and recommendation problems,
used to measure the average error between the actual value and the predicted value. In this instance, the
RMSE is determined by comparing the actual rating to the rating predicted by the model. The following are
the steps to compute RMSE in the movie recommendation problem:
s
PN
n=1 (xi − x̂i )2
RM SE = (11)
N
where N is the number of data observed, xi is the actual value, x̂i is the predicted value.
9.2 MAE
Mean Absolute Error (MAE): MAE is a common metric in prediction and prediction problems, used to
measure the average difference between the actual value and the predicted value. In this case, the MAE is
calculated based on the difference between the actual rating and the rating predicted by the model. The
formula for calculating the MAE in the problem of movie recommendation is as follows:
PN
n=1 |xi − x̂i |
M AE = (12)
N
where N is the number of data observed, xi is the actual value, x̂i is the predicted value.
TP
P recision = (13)
TP + FP
Recall: Recall is a crucial measurement for assessing how well a model can identify all the appropriate movies
for the user. Recall concentrates on finding as many appropriate movies as possible for users, avoiding missing
21
Semester 2022.2 Course: Machine Learning
important movies. To calculate the Recall, we need to know the following values: True Positive (TP): Number
of movies suggested by the model and actually relevant to the user. False Negative (FN): Number of movies
that match the user but are not suggested by the model.
The formula for calculating Recall is:
TP
Recall = (14)
TP + FN
Where:
1. True Positive (TP): Number of movies suggested by the model and actually relevant to the user.
2. False Positive (FP): Number of movies suggested by the model but not suitable for the user.
3. False Negative (FN): Number of movies that match the user but are not suggested by the model.
F1-score: F1-score combines Precision and Recall into a single statistic to assess how well model accu-
racy and coverage are balanced. The F1-score aids in evaluating the movie recommendation model’s overall
effectiveness, ensuring that the model achieves high accuracy and coverage.
The formula for calculating F1-score is:
2 · P recision · Recall
F 1s core = (15)
P recision + Recall
However, the use of F1-score should note that it is a composite metric and only gives an overview of perfor-
mance. We need to consider the need to combine Precision and Recall individually to get a more detailed
view of the model’s performance.
where reli is the actual rating of the movie at position i in the recommended list.
To stronger emphasis on retrieving relevant documents, we utilize an alternative formulation of DCG:
r
X 2reli − 1
DCGr = (17)
i=1
log2 (i + 1)
Calculate Ideal Discounted Cumulative Gain (IDCG): IDCG is the ideal DCG value calculated from a list
of actual ratings sorted in descending order of rating. The IDCG is the maximum DCG value that can be
obtained for an evaluation list.
22
Semester 2022.2 Course: Machine Learning
Calculate Normalized Discounted Cumulative Gain (NDCG): To make DCGs directly comparable between
users, we need to normalize them. We divide the raw DCG by this ideal DCG to get NDCG, a number between
0 and 1.
DCG
N DCG = (18)
IDCG
Average Precision@K (AP@K): The Average Precision@K is the sum of Precision@K where the item at
the kth rank is relevant divided by the total number of relevant items (r) in the top K recommendations
K
1X
AP @K = P @K · rel(k) (20)
r
k=1
1, if items at k rank is relevant
th
rel(i) = (21)
0, otherwise
Average Precision@K is higher if the most relevant recommendations are on the first ranks. Hence, AP@K
shows the goodness of the top recommendations.
Mean Average Precision@K (MAP@K): MAP@K calculate the mean value of AP@K for all users.
M
1 X
M AP @K = AP @Ki (22)
M i=1
23