Movie Recomendation System Using R
Movie Recomendation System Using R
GROUP PROJECT
BACHELOR OF ENGINEERING
IN COMPUTER
SCIENCE
ENGINEERING
Preface
The R Markdown code used to generate this report and the PDF version are available on GitHub.
The HTML version is available on RPubs.
1 Introduction
Recommendation systems plays an important role in e-commerce and online streaming services,
such as Netflix, YouTube and Amazon. Making the right recommendation for the next product,
music or movie increases user retention and satisfaction, leading to sales and profit growth.
Companies competing for customer loyalty invest on systems that capture and analyses the
user’s preferences, and offer products or services with higher likelihood of purchase.
The economic impact of such company-customer relationship is clear: Amazon is the largest
online retail company by sales and part of its success comes from the recommendation system
and marketing based on user preferences. In 2006 Netflix offered a one million dollar prize2 for
the person or group that could improve their recommendation system by at least 10%.
Usually recommendation systems are based on a rating scale from 1 to 5 grades or stars, with 1
indicating lowest satisfaction and 5 is the highest satisfaction. Other indicators can also be used,
such as comments posted on previously used items; video, music or link shared with friends;
percentage of movie watched or music listened; web pages visited and time spent on each page;
product category; and any other interaction with the company’s web site or application can be
used as a predictor.
The primary goal of recommendation systems is to help users find what they want based on their
preferences and previous interactions, and predicting the rating for a new item. In this document,
we create a movie recommendation system using the MovieLens dataset and applying the
lessons learned during the HarvardX’s Data Science Professional Certificate3 program.
This document is structured as follows. Chapter 1 describes the dataset and summarizes the
goal of the project and key steps that were performed. In chapter 2 we explain the process and
techniques used, such as data cleaning, data exploration and visualization, any insights gained,
and the modeling approach. In chapter 3 we present the modeling results and discuss the model
performance. We conclude in chapter 4 with a brief summary of the report, its limitations and
future work.
MAE=1N∑i=1N|y^i−yi|
where N is the number of observations, y^i is the predicted value and y is the true value.
MSE=1N∑u,i(y^i−yi)2
RMSE=1N∑u,i(y^u,i−yu,i)2
where N is the number of ratings, yu,i is the rating of movie i by user u and y^u, is the
prediction of movie i by user u.
Similar to MSE, the RMSE penalizes large deviations from the mean and is appropriate in cases
that small errors are not relevant. Contrary to the MSE, the error has the same unit as the
measurement.
1. Data preparation: download, parse, import and prepare the data to be processed and
analysed.
2. Data exploration and visualization: explore data to understand the features and the
relationship between the features and predictors.
3. Data cleaning: eventually the dataset contains unnecessary information that needs to be
removed.
4. Data analysis and modeling: create the model using the insights gained during
exploration. Also test and validate the model.
5. Communicate: create the report and publish the results.
First we download the dataset from MovieLens website and split into two subsets used for
training and validation. The training subset is called edx and the validation subset is
called validation. The edx set is split again into two subsets used for training and and testing.
When the model reaches the RMSE target in the testing set, we train the edx set with the model
and use the validation set for final validation. We pretend the validation set is new data with
unknown outcomes.
In the next step we create charts, tables and statistics summary to understand how the features
can impact the outcome. The information and insights obtained during exploration will help to
build the machine learning model.
Creating a recommendation system involves the identification of the most important features that
helps to predict the rating any given user will give to any movie. We start building a very simple
model, which is just the mean of the observed values. Then, the user and movie effects are
included in the linear model, improving the RMSE. Finally, the user and movie effects receive
regularization parameter that penalizes samples with few ratings.
Although the linear model with regularization achieves the desired RMSE, matrix factorization
using the LIBMF8 algorithm is evaluated and provides a better prediction. LIBMF is available
through the R package recosystem9.
if(!require(tidyverse))
install.packages("tidyverse", repos = "http://cran.us.r-project.org")
if(!require(caret))
install.packages("caret", repos = "http://cran.us.r-project.org")
if(!require(data.table))
install.packages("data.table", repos = "http://cran.us.r-project.org")
dl <- tempfile()
download.file("http://files.grouplens.org/datasets/movielens/ml-10m.zip", d
l)
ratings <- fread(text = gsub("::", "\t",
readLines(unzip(dl, "ml-10M100K/ratings.dat"))
),
col.names = c("userId", "movieId", "rating", "timestamp"))
# Make sure userId and movieId in 'validation' set are also in 'edx' set
validation <- temp %>%
semi_join(edx, by = "movieId") %>%
semi_join(edx, by = "userId")
# Add rows removed from 'validation' set back into 'edx' set
removed <- anti_join(temp, validation)
edx <- rbind(edx, removed)
The edx set is used for training and testing, and the validation set is used for final validation to
simulate the new data.
Here, we split the edx set in 2 parts: the training set and the test set.
The model building is done in the training set, and the test set is used to test the model. When
the model is complete, we use the validation set to calculate the final RMSE.
We use the same procedure used to create edx and validation sets.
The training set will be 90% of edx data and the test set will be the remaining 10%.
set.seed(1, sample.kind="Rounding")
test_index <- createDataPartition(y = edx$rating, times = 1, p = 0.1, list
= FALSE)
train_set <- edx[-test_index,]
temp <- edx[test_index,]
# Make sure userId and movieId in test set are also in train set
test_set <- temp %>%
semi_join(train_set, by = "movieId") %>%
semi_join(train_set, by = "userId")
# Add rows removed from test set back into train set
removed <- anti_join(temp, test_set)
train_set <- rbind(train_set, removed)
str(edx)
## 'data.frame': 9000055 obs. of 6 variables:
## $ userId : int 1 1 1 1 1 1 1 1 1 1 ...
## $ movieId : num 122 185 292 316 329 355 356 362 364 370 ...
## $ rating : num 5 5 5 5 5 5 5 5 5 5 ...
## $ timestamp: int 838985046 838983525 838983421 838983392 838983392 838
984474 838983653 838984885 838983707 838984596 ...
## $ title : chr "Boomerang (1992)" "Net, The (1995)" "Outbreak (1995)
" "Stargate (1994)" ...
## $ genres : chr "Comedy|Romance" "Action|Crime|Thriller" "Action|Dram
a|Sci-Fi|Thriller" "Action|Adventure|Sci-Fi" ...
movieId integer
userId integer
rating numeric
timestampnumeric
title character
genres character
How many rows and columns are there in the edx dataset?
dim(edx)
## [1] 9000055 6
The next table shows the structure and content of edx dataset
The dataset is in tidy format, i.e. each row has one observation and the column names are the
features. The rating column is the desired outcome. The user information is stored in userId;
the movie information is both in movieId and title columns. The rating date is available
in timestamp measured in seconds since January 1st, 1970. Each movie is tagged with one or
more genre in the genres column.
head(edx)
The next sections discover more details about each feature and outcome.
2.2.1 Genres
Along with the movie title, MovieLens provides the list of genres for each movie. Although this
information can be used to make better predictions, this research doesn’t use it. However it’s
worth exploring this information as well.
The data set contains 797 different combinations of genres. Here is the list of the first six.
Genres
Action
Genres
Action|Adventure
Action|Adventure|Animation|Children|Comedy
Action|Adventure|Animation|Children|Comedy|Fantasy
Action|Adventure|Animation|Children|Comedy|IMAX
The table above shows that several movies are classified in more than one genre. The number of
genres in each movie is listed in this table, sorted in descend order.
count genres
7 Action|Adventure|Comedy|Drama|Fantasy|Horror|Sci-Fi|Thriller
6 Adventure|Animation|Children|Comedy|Crime|Fantasy|Mystery
6 Adventure|Animation|Children|Comedy|Drama|Fantasy|Mystery
6 Adventure|Animation|Children|Comedy|Fantasy|Musical|Romance
5 Action|Adventure|Animation|Children|Comedy|Fantasy
5 Action|Adventure|Animation|Children|Comedy|IMAX
2.2.2 Date
The rating period was collected over almost 14 years.
library(lubridate)
tibble(`Initial Date` = date(as_datetime(min(edx$timestamp), origin="1970-0
1-01")),
`Final Date` = date(as_datetime(max(edx$timestamp), origin="1970-01-
01"))) %>%
mutate(Period = duration(max(edx$timestamp)-min(edx$timestamp)))
if(!require(ggthemes))
install.packages("ggthemes", repos = "http://cran.us.r-project.org")
if(!require(scales))
install.packages("scales", repos = "http://cran.us.r-project.org")
edx %>% mutate(year = year(as_datetime(timestamp, origin="1970-01-01"))) %>
%
ggplot(aes(x=year)) +
geom_histogram(color = "white") +
ggtitle("Rating Distribution Per Year") +
xlab("Year") +
ylab("Number of Ratings") +
scale_y_continuous(labels = comma) +
theme_economist()
Rating distribution per year.
The following table lists the days with more ratings. Not surprisingly, the movies are well known
blockbusters.
Date title
1999-12-11 Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) (1977)
2005-03-22 Lord of the Rings: The Fellowship of the Ring, The (2001)
2.2.3 Ratings
Users have the option to choose a rating value from 0.5 to 5.0, totaling 10 possible values. This
is unusual scale, so most movies get a rounded value rating, as shown in the chart below.
Count the number of each ratings:
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
2.2.4 Movies
There are 10677 different movies in the edx set. We know from intuition that some of them are
rated more than others, since many movies are watched by few users and blockbusters tend to
have more ratings.
2.2.5 Users
There are 69878 different users are in the edx set.
The majority of users rate few movies, while a few users rate more than a thousand movies.
5% users rated less than 20 movies.
userId n
62516 10
22170 12
15719 13
50608 13
userId n
901 14
1833 14
As previously discussed, several features can be used to predict the rating for a given user. However,
many predictors increases the model complexity and requires more computer resources, so in this
research the estimated rating uses only movie and user information.
2.4 Modeling
2.4.1 Random Prediction
A very simple model is just randomly predict the rating using the probability distribution observed
during the data exploration. For example, if we know the probability of all users giving a movie a
rating of 3 is 10%, then we may guess that 10% of the ratings will have a rating of 3.
Such prediction sets the worst error we may get, so any other model should provide better result.
The simplest model predicts all users will give the same rating to all movies and assumes the movie
to movie variation is the randomly distributed error. Although the predicted rating can be any value,
statistics theory says that the average minimizes the RMSE, so the initial prediction is just the
average of all observed ratings, as described in this formula:
Y^u,i=μ+ϵi,u
Where Y^ is the predicted rating, μ is the mean of observed data and ϵi,u is the error distribution.
Any value other than the mean increases the RMSE, so this is a good initial estimation.
Part of the movie to movie variability can be explained by the fact that different movies have
different rating distribution. This is easy to understand, since some movies are more popular than
others and the public preference varies. This is called movie effect or movie bias, and is expressed
as bi in this formula:
Y^u,i=μ+bi+ϵi,
The movie effect can be calculated as the mean of the difference between the observed rating y and
the mean μ.
b^i=1N∑i=1N(yi−μ^)
Similar to the movie effect, different users have different rating pattern or distribution. For example,
some users like most movies and consistently rate 4 or 5, while other users dislike most movies
rating 1 or 2. This is called user effect or user bias and is expressed in this formula:
b^u=1N∑i=1N(yu,i−b^i−μ^)
Y^u,i=μ+bi+bu+ϵu,I
Movies can be grouped into categories or genres, with different distributions. In general, movies in
the same genre get similar ratings. In this project we won’t evaluate the genre effect.
2.4.3 Regularization
The linear model provides a good estimation for the ratings, but doesn’t consider that many movies
have very few number of ratings, and some users rate very few movies. This means that the sample
size is very small for these movies and these users. Statistically, this leads to large estimated error.
The estimated value can be improved adding a factor that penalizes small sample sizes and have
have little or no impact otherwise. Thus, estimated movie and user effects can be calculated with
these formulas:
b^i=1ni+λ∑u=1ni(yu,i−μ^)
b^u=1nu+λ∑i=1nu(yu,i−b^i−μ^)
For values of N smaller than or similar to λ , b^I and b^u is smaller than the original values,
whereas for values of N much larger than λ , b^I and b^u change very little.
An effective method to choose λ that minimizes the RMSE is running simulations with several values
of λ .
Matrix factorization is widely used machine learning tool for predicting ratings in recommendation
systems. This method became widely known during the Netflix Prize challenge10.
The data can be converted into a matrix such that each user is in a row, each movie is in a column
and the rating is in the cell, then the algorithm attempts to fill in the missing values. The table below
provides a simple example of a 4×54×5 matrix.
user 1 ? ? 4 ? 3
user 2 2 ? ? 4 ?
user 3 ? 3 ? ? 5
user 4 3 ? 2 ? ?
The concept is to approximate a large rating matrix Rm×n into the product of two lower dimension
matrices Pk×m and Qk×n , such that
R≈P′Q
The R recosystem package provides methods to decompose the rating matrix and estimate the user
rating, using parallel matrix factorization.
3 Results
mean(abs(true_ratings - predicted_ratings))
}
# Define Mean Squared Error (MSE)
mean((true_ratings - predicted_ratings)^2)
sqrt(mean((true_ratings - predicted_ratings)^2))
The first model randomly predicts the ratings using the observed probabilities in the training set.
First, we calculate the probability of each rating in the training set, then we predict the rating for the
test set and compare with actual rating. Any model should be better than this one.
Since the training set is a sample of the entire population and we don’t know the real distribution of
ratings, the Monte Carlo simulation with replacement provides a good approximation of the rating
distribution.
B <- 10^3
M <- replicate(B, {
sapply(rating, p, y= s)
})
result <- tibble(Method = "Project Goal", RMSE = 0.8649, MSE = NA, MAE = NA)
result
y^=μ+bi+bu+ϵu,I
y^=μ+ϵu,I
mu <- mean(train_set$rating)
tibble(Method = "Mean",
result
y^=μ+bi+ϵu,I
group_by(movieId) %>%
head(bi)
movieId b_i
1 0.4150040
2 -0.3064057
3 -0.3613952
4 -0.6372808
5 -0.4416058
movieId b_i
6 0.3018943
xlab("Movie effect") +
ylab("Count") +
scale_y_continuous(labels = comma) +
theme_economist()
y^u,i=μ+bi+bu+ϵu,i
Predict the rating with mean + bi + bu
# Prediction
y_hat_bi_bu <- test_set %>%
left_join(bi, by='movieId') %>%
left_join(bu, by='userId') %>%
mutate(pred = mu + b_i + b_u) %>%
.$pred
# Update the results table
result <- bind_rows(result,
tibble(Method = "Mean + bi + bu",
RMSE = RMSE(test_set$rating, y_hat_bi_bu),
MSE = MSE(test_set$rating, y_hat_bi_bu),
MAE = MAE(test_set$rating, y_hat_bi_bu)))
train_set %>%
group_by(userId) %>%
summarize(b_u = mean(rating)) %>%
filter(n()>=100) %>%
ggplot(aes(b_u)) +
geom_histogram(color = "black") +
ggtitle("User Effect Distribution") +
xlab("User Bias") +
ylab("Count") +
scale_y_continuous(labels = comma) +
theme_economist()
3.3.4 Evaluating the model result
The RMSE improved from the initial estimation based on the mean. However, we still need to
check if the model makes good ratings predictions.
Check the 10 largest residual differences
train_set %>%
left_join(bi, by='movieId') %>%
mutate(residual = rating - (mu + b_i)) %>%
arrange(desc(abs(residual))) %>%
slice(1:10)
title
title
Besotted (2001)
train_set %>%
left_join(bi, by = "movieId") %>%
arrange(desc(b_i)) %>%
group_by(title) %>%
summarise(n = n()) %>%
slice(1:10)
title
3.4 Regularization
Now, we regularize the user and movie effects adding a penalty factor λ , which is a tuning
parameter. We define a number of values for λ and use the regularization function to pick
the best value that minimizes the RMSE.
regularization <- function(lambda, trainset, testset){
# Mean
mu <- mean(trainset$rating)
# Prediction: mu + bi + bu
predicted_ratings <- testset %>%
left_join(b_i, by = "movieId") %>%
left_join(b_u, by = "userId") %>%
filter(!is.na(b_i), !is.na(b_u)) %>%
mutate(pred = mu + b_i + b_u) %>%
pull(pred)
return(RMSE(predicted_ratings, testset$rating))
}
# Tune lambda
rmses <- sapply(lambdas,
regularization,
trainset = train_set,
testset = test_set)
# Prediction
y_hat_reg <- test_set %>%
left_join(b_i, by = "movieId") %>%
left_join(b_u, by = "userId") %>%
mutate(pred = mu + b_i + b_u) %>%
pull(pred)
if(!require(recosystem))
install.packages("recosystem", repos = "http://cran.us.r-project.org")
set.seed(123, sample.kind = "Rounding") # This is a randomized algorithm
# Convert the train and test sets into recosystem input format
train_data <- with(train_set, data_memory(user_index = userId,
item_index = movieId,
rating = rating))
test_data <- with(test_set, data_memory(user_index = userId,
item_index = movieId,
rating = rating))
# Prediction
y_hat_edx <- validation %>%
left_join(b_i_edx, by = "movieId") %>%
left_join(b_u_edx, by = "userId") %>%
mutate(pred = mu_edx + b_i + b_u) %>%
pull(pred)
As expected, the RMSE calculated on the validation set (0.8648177) is lower than the target of
0.8649 and slightly higher than the RMSE of the test set (0.8641362).
Top 10 best movies
validation %>%
left_join(b_i_edx, by = "movieId") %>%
left_join(b_u_edx, by = "userId") %>%
mutate(pred = mu_edx + b_i + b_u) %>%
arrange(-pred) %>%
group_by(title) %>%
select(title) %>%
head(10)
title
validation %>%
left_join(b_i_edx, by = "movieId") %>%
left_join(b_u_edx, by = "userId") %>%
mutate(pred = mu_edx + b_i + b_u) %>%
arrange(pred) %>%
group_by(title) %>%
select(title) %>%
head(10)
title
Kazaam (1996)
Steel (1997)
The final RMSE with matrix factorization is 0.7826974, 9.5% better than the linear model with
regularization (0.8648177).
Now, let’s check the best and worst movies predicted with matrix factorization.
Top 10 best movies:
title
title
Bug (2007)
4 Conclusion
We started collecting and preparing the dataset for analysis, then we explored the information
seeking for insights that might help during the model building.
Next, we created a random model that predicts the rating based on the probability distribution of
each rating. This model gives the worst result.
We started the linear model with a very simple model which is just the mean of the observed
ratings. From there, we added movie and user effects, that models the user behavior and movie
distribution. With regularization we added a penalty value for the movies and users with few
number of ratings. The linear model achieved the RMSE of 0.8648177, successfully passing the
target of 0.8649.
Finally, we evaluated the recosystem package that implements the LIBMF algorithm, and
achieved the RMSE of 0.7826974.
4.1 Limitations
Some machine learning algorithms are computationally expensive to run in a commodity laptop
and therefore were unable to test. The required amount of memory far exceeded the available in
a commodity laptop, even with increased virtual memory.
Only two predictors are used, the movie and user information, not considering other features.
Modern recommendation system models use many predictors, such as genres, bookmarks,
playlists, etc.
The model works only for existing users, movies and rating values, so the algorithm must run
every time a new user or movie is included, or when the rating changes. This is not an issue for
small client base and a few movies, but may become a concern for large data sets. The model
should consider these changes and update the predictions as information changes.
There is no initial recommendation for a new user or for users that usually don’t rate movies.
Algorithms that uses several features as predictors can overcome this issue.
References
1. Rafael A. Irizarry (2019), Introduction to Data Science: Data Analysis and Prediction
Algorithms with R
2. Yixuan Qiu (2017), recosystem: recommendation System Using Parallel Matrix
Factorization
3. Michael Hahsler (2019), recommendationlab: Lab for Developing and Testing
recommendation Algorithms. R package version 0.2-5.
4. Georgios Drakos, How to select the Right Evaluation Metric for Machine Learning
Models: Part 1 Regression Metrics
1. https://www.edx.org/professional-certificate/harvardx-data-science↩
2. https://www.netflixprize.com/↩
3. https://www.edx.org/professional-certificate/harvardx-data-science↩
4. https://grouplens.org/↩
5. https://movielens.org/↩
6. https://grouplens.org/datasets/movielens/latest/↩
7. https://grouplens.org/datasets/movielens/10m/↩
8. A Matrix-factorization Library for Recommender Systems -
https://www.csie.ntu.edu.tw/~cjlin/libmf/↩
9. https://cran.r-project.org/web/packages/recosystem/index.html↩
10. https://www.netflixprize.com/↩
11. https://cran.r-project.org/web/packages/recosystem/vignettes/introduction.html↩
12. https://cran.r-project.org/web/packages/recosystem/vignettes/introduction.html↩
13. https://cran.r-project.org/web/packages/available_packages_by_name.html↩