0% found this document useful (0 votes)
24 views41 pages

Movie Recomendation System Using R

Uploaded by

rc2899318
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views41 pages

Movie Recomendation System Using R

Uploaded by

rc2899318
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

PROJECT REPORT

Movie recommendation system in R


VAC (Data Science using R)

GROUP PROJECT

BACHELOR OF ENGINEERING
IN COMPUTER
SCIENCE
ENGINEERING

SUBMITTED BY : Rahul Chauhan-23BCS12035


Partner- Anustubh Mishra-23BCS13106

SUPERVISED BY : Rashmi Dhawan

My VAC Section : 3 Semester – 3rd


My Section – 708-B
Movie Recommendation System in
R

Preface
The R Markdown code used to generate this report and the PDF version are available on GitHub.
The HTML version is available on RPubs.

1 Introduction
Recommendation systems plays an important role in e-commerce and online streaming services,
such as Netflix, YouTube and Amazon. Making the right recommendation for the next product,
music or movie increases user retention and satisfaction, leading to sales and profit growth.
Companies competing for customer loyalty invest on systems that capture and analyses the
user’s preferences, and offer products or services with higher likelihood of purchase.
The economic impact of such company-customer relationship is clear: Amazon is the largest
online retail company by sales and part of its success comes from the recommendation system
and marketing based on user preferences. In 2006 Netflix offered a one million dollar prize2 for
the person or group that could improve their recommendation system by at least 10%.
Usually recommendation systems are based on a rating scale from 1 to 5 grades or stars, with 1
indicating lowest satisfaction and 5 is the highest satisfaction. Other indicators can also be used,
such as comments posted on previously used items; video, music or link shared with friends;
percentage of movie watched or music listened; web pages visited and time spent on each page;
product category; and any other interaction with the company’s web site or application can be
used as a predictor.
The primary goal of recommendation systems is to help users find what they want based on their
preferences and previous interactions, and predicting the rating for a new item. In this document,
we create a movie recommendation system using the MovieLens dataset and applying the
lessons learned during the HarvardX’s Data Science Professional Certificate3 program.
This document is structured as follows. Chapter 1 describes the dataset and summarizes the
goal of the project and key steps that were performed. In chapter 2 we explain the process and
techniques used, such as data cleaning, data exploration and visualization, any insights gained,
and the modeling approach. In chapter 3 we present the modeling results and discuss the model
performance. We conclude in chapter 4 with a brief summary of the report, its limitations and
future work.

1.1 MovieLens Dataset


GroupLens4 is a research lab in the University of Minnesota that has collected and made
available rating data for movies in the MovieLens web site5.
The complete MovieLens dataset6 consists of 27 million ratings of 58,000 movies by 280,000
users. The research presented in this paper is based in a subset7 of this dataset with 10 million
ratings on 10,000 movies by 72,000 users.
1.2 Model Evaluation
The evaluation of machine learning algorithms consists in comparing the predicted value with the
actual outcome. The loss function measures the difference between both values. There are other
metrics that go beyond the scope of this document.
The most common loss functions in machine learning are the mean absolute error (MAE), mean
squared error (MSE) and root mean squared error (RMSE).
Regardless of the loss function, when the user consistently selects the predicted movie, the error
is equal to zero and the algorithm is perfect.
The goal of this project is to create a recommendation system with RMSE lower than 0.8649.
There’s no specific target for MSE and MAE.

1.2.1 Mean Absolute Error - MAE


The mean absolute error is the average of absolute differences between the predicted value and
the true value. The metric is linear, which means that all errors are equally weighted. Thus, when
predicting ratings in a scale of 1 to 5, the MAE assumes that an error of 2 is twice as bad as an
error of 1.
The MAE is given by this formula:

MAE=1N∑i=1N|y^i−yi|
where N is the number of observations, y^i is the predicted value and y is the true value.

1.2.2 Mean Squared Error - MSE


The MSE is the average squared error of the predictions. Larger errors are weighted more than
smaller ones, thus if the error doubles, the MSE increases four times. On the other hand, errors
smaller than 1 also decrease faster, for example an error of 0.5 results in MSE of 0.25.

MSE=1N∑u,i(y^i−yi)2

1.2.3 Root Mean Squared Error - RMSE


The Root Mean Squared Error, RMSE, is the square root of the MSE. It is the typical metric to
evaluate recommendation systems, and is defined by the formula:

RMSE=1N∑u,i(y^u,i−yu,i)2
where N is the number of ratings, yu,i is the rating of movie i by user u and y^u, is the
prediction of movie i by user u.
Similar to MSE, the RMSE penalizes large deviations from the mean and is appropriate in cases
that small errors are not relevant. Contrary to the MSE, the error has the same unit as the
measurement.

1.3 Process and Workflow


The main steps in a data science project include:

1. Data preparation: download, parse, import and prepare the data to be processed and
analysed.
2. Data exploration and visualization: explore data to understand the features and the
relationship between the features and predictors.
3. Data cleaning: eventually the dataset contains unnecessary information that needs to be
removed.
4. Data analysis and modeling: create the model using the insights gained during
exploration. Also test and validate the model.
5. Communicate: create the report and publish the results.

First we download the dataset from MovieLens website and split into two subsets used for
training and validation. The training subset is called edx and the validation subset is
called validation. The edx set is split again into two subsets used for training and and testing.
When the model reaches the RMSE target in the testing set, we train the edx set with the model
and use the validation set for final validation. We pretend the validation set is new data with
unknown outcomes.
In the next step we create charts, tables and statistics summary to understand how the features
can impact the outcome. The information and insights obtained during exploration will help to
build the machine learning model.
Creating a recommendation system involves the identification of the most important features that
helps to predict the rating any given user will give to any movie. We start building a very simple
model, which is just the mean of the observed values. Then, the user and movie effects are
included in the linear model, improving the RMSE. Finally, the user and movie effects receive
regularization parameter that penalizes samples with few ratings.
Although the linear model with regularization achieves the desired RMSE, matrix factorization
using the LIBMF8 algorithm is evaluated and provides a better prediction. LIBMF is available
through the R package recosystem9.

2 Methods and Analysis


2.1 Data Preparation
In this section we download and prepare the dataset to be used in the analysis. We split the
dataset in two parts, the training set called edx and the evaluation set called validation with
90% and 10% of the original dataset respectively.
Then, we split the edx set in two parts, the train set and test set with 90% and 10% of edx set
respectively. The model is created and trained in the train set and tested in the test set until
the RMSE target is achieved, then finally we train the model again in the entire edx set and
validate in the validation set. The name of this method is cross-validation.

if(!require(tidyverse))
install.packages("tidyverse", repos = "http://cran.us.r-project.org")
if(!require(caret))
install.packages("caret", repos = "http://cran.us.r-project.org")
if(!require(data.table))
install.packages("data.table", repos = "http://cran.us.r-project.org")

# MovieLens 10M dataset:


# https://grouplens.org/datasets/movielens/10m/
# http://files.grouplens.org/datasets/movielens/ml-10m.zip

dl <- tempfile()
download.file("http://files.grouplens.org/datasets/movielens/ml-10m.zip", d
l)
ratings <- fread(text = gsub("::", "\t",
readLines(unzip(dl, "ml-10M100K/ratings.dat"))
),
col.names = c("userId", "movieId", "rating", "timestamp"))

movies <- str_split_fixed(readLines(unzip(dl, "ml-10M100K/movies.dat")), "\


\::", 3)
colnames(movies) <- c("movieId", "title", "genres")
movies <- as.data.frame(movies) %>%
mutate(movieId = as.numeric(levels(movieId))[movieId],
title = as.character(title),
genres = as.character(genres))

movielens <- left_join(ratings, movies, by = "movieId")

# 'Validation' set will be 10% of MovieLens data


set.seed(1, sample.kind="Rounding")
test_index <- createDataPartition(y = movielens$rating,
times = 1, p = 0.1, list = FALSE)
edx <- movielens[-test_index,]
temp <- movielens[test_index,]

# Make sure userId and movieId in 'validation' set are also in 'edx' set
validation <- temp %>%
semi_join(edx, by = "movieId") %>%
semi_join(edx, by = "userId")

# Add rows removed from 'validation' set back into 'edx' set
removed <- anti_join(temp, validation)
edx <- rbind(edx, removed)

rm(dl, ratings, movies, test_index, temp, movielens, removed)

The edx set is used for training and testing, and the validation set is used for final validation to
simulate the new data.
Here, we split the edx set in 2 parts: the training set and the test set.
The model building is done in the training set, and the test set is used to test the model. When
the model is complete, we use the validation set to calculate the final RMSE.
We use the same procedure used to create edx and validation sets.
The training set will be 90% of edx data and the test set will be the remaining 10%.
set.seed(1, sample.kind="Rounding")
test_index <- createDataPartition(y = edx$rating, times = 1, p = 0.1, list
= FALSE)
train_set <- edx[-test_index,]
temp <- edx[test_index,]

# Make sure userId and movieId in test set are also in train set
test_set <- temp %>%
semi_join(train_set, by = "movieId") %>%
semi_join(train_set, by = "userId")

# Add rows removed from test set back into train set
removed <- anti_join(temp, test_set)
train_set <- rbind(train_set, removed)

rm(test_index, temp, removed)

2.2 Data Exploration


Before start building the model, we need to understand the structure of the data, the distribution
of ratings and the relationship of the predictors. This information will help build a better model.

str(edx)
## 'data.frame': 9000055 obs. of 6 variables:
## $ userId : int 1 1 1 1 1 1 1 1 1 1 ...
## $ movieId : num 122 185 292 316 329 355 356 362 364 370 ...
## $ rating : num 5 5 5 5 5 5 5 5 5 5 ...
## $ timestamp: int 838985046 838983525 838983421 838983392 838983392 838
984474 838983653 838984885 838983707 838984596 ...
## $ title : chr "Boomerang (1992)" "Net, The (1995)" "Outbreak (1995)
" "Stargate (1994)" ...
## $ genres : chr "Comedy|Romance" "Action|Crime|Thriller" "Action|Dram
a|Sci-Fi|Thriller" "Action|Adventure|Sci-Fi" ...

From this initial exploration, we discover that edx has 6 columns:

movieId integer
userId integer
rating numeric
timestampnumeric
title character
genres character
How many rows and columns are there in the edx dataset?

dim(edx)
## [1] 9000055 6

The next table shows the structure and content of edx dataset
The dataset is in tidy format, i.e. each row has one observation and the column names are the
features. The rating column is the desired outcome. The user information is stored in userId;
the movie information is both in movieId and title columns. The rating date is available
in timestamp measured in seconds since January 1st, 1970. Each movie is tagged with one or
more genre in the genres column.

head(edx)

userId movieId rating timestamp title genres

1 1 122 5 838985046 Boomerang (1992) Comedy|Romance

2 1 185 5 838983525 Net, The (1995) Action|Crime|Thriller

4 1 292 5 838983421 Outbreak (1995) Action|Drama|Sci-Fi|Thriller

5 1 316 5 838983392 Stargate (1994) Action|Adventure|Sci-Fi

6 1 329 5 838983392 Star Trek: Generations (1994) Action|Adventure|Drama|Sci-Fi

7 1 355 5 838984474 Flintstones, The (1994) Children|Comedy|Fantasy

The next sections discover more details about each feature and outcome.

2.2.1 Genres
Along with the movie title, MovieLens provides the list of genres for each movie. Although this
information can be used to make better predictions, this research doesn’t use it. However it’s
worth exploring this information as well.
The data set contains 797 different combinations of genres. Here is the list of the first six.

edx %>% group_by(genres) %>%


summarise(n=n()) %>%
head()

Genres

(no genres listed)

Action
Genres

Action|Adventure

Action|Adventure|Animation|Children|Comedy

Action|Adventure|Animation|Children|Comedy|Fantasy

Action|Adventure|Animation|Children|Comedy|IMAX

The table above shows that several movies are classified in more than one genre. The number of
genres in each movie is listed in this table, sorted in descend order.

tibble(count = str_count(edx$genres, fixed("|")), genres = edx$genres) %>%


group_by(count, genres) %>%
summarise(n = n()) %>%
arrange(-count) %>%
head()

count genres

7 Action|Adventure|Comedy|Drama|Fantasy|Horror|Sci-Fi|Thriller

6 Adventure|Animation|Children|Comedy|Crime|Fantasy|Mystery

6 Adventure|Animation|Children|Comedy|Drama|Fantasy|Mystery

6 Adventure|Animation|Children|Comedy|Fantasy|Musical|Romance

5 Action|Adventure|Animation|Children|Comedy|Fantasy

5 Action|Adventure|Animation|Children|Comedy|IMAX

2.2.2 Date
The rating period was collected over almost 14 years.

library(lubridate)
tibble(`Initial Date` = date(as_datetime(min(edx$timestamp), origin="1970-0
1-01")),
`Final Date` = date(as_datetime(max(edx$timestamp), origin="1970-01-
01"))) %>%
mutate(Period = duration(max(edx$timestamp)-min(edx$timestamp)))

Initial Date Final Date

1995-01-09 2009-01-05 441479727s

if(!require(ggthemes))
install.packages("ggthemes", repos = "http://cran.us.r-project.org")
if(!require(scales))
install.packages("scales", repos = "http://cran.us.r-project.org")
edx %>% mutate(year = year(as_datetime(timestamp, origin="1970-01-01"))) %>
%
ggplot(aes(x=year)) +
geom_histogram(color = "white") +
ggtitle("Rating Distribution Per Year") +
xlab("Year") +
ylab("Number of Ratings") +
scale_y_continuous(labels = comma) +
theme_economist()
Rating distribution per year.
The following table lists the days with more ratings. Not surprisingly, the movies are well known
blockbusters.

edx %>% mutate(date = date(as_datetime(timestamp, origin="1970-01-01"))) %>


%
group_by(date, title) %>%
summarise(count = n()) %>%
arrange(-count) %>%
head(10)

Date title

1998-05-22 Chasing Amy (1997)

2000-11-20 American Beauty (1999)

1999-12-11 Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) (1977)

1999-12-11 Star Wars: Episode V - The Empire Strikes Back (1980)

1999-12-11 Star Wars: Episode VI - Return of the Jedi (1983)

2005-03-22 Lord of the Rings: The Two Towers, The (2002)

2005-03-22 Lord of the Rings: The Fellowship of the Ring, The (2001)

2000-11-20 Terminator 2: Judgment Day (1991)

1999-12-11 Matrix, The (1999)

2000-11-20 Jurassic Park (1993)

2.2.3 Ratings
Users have the option to choose a rating value from 0.5 to 5.0, totaling 10 possible values. This
is unusual scale, so most movies get a rounded value rating, as shown in the chart below.
Count the number of each ratings:

edx %>% group_by(rating) %>% summarize(n=n())


rating

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

How many ratings are in edx?

edx %>% group_by(rating) %>%


summarise(count=n()) %>%
ggplot(aes(x=rating, y=count)) +
geom_line() +
geom_point() +
scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x))) +
ggtitle("Rating Distribution", subtitle = "Higher ratings are prevalent
.") +
xlab("Rating") +
ylab("Count") +
theme_economist()
Round values receive more ratings than decimals.

2.2.4 Movies
There are 10677 different movies in the edx set. We know from intuition that some of them are
rated more than others, since many movies are watched by few users and blockbusters tend to
have more ratings.

edx %>% group_by(movieId) %>%


summarise(n=n()) %>%
ggplot(aes(n)) +
geom_histogram(color = "white") +
scale_x_log10() +
ggtitle("Distribution of Movies",
subtitle = "The distribution is almost symetric.") +
xlab("Number of Ratings") +
ylab("Number of Movies") +
theme_economist()
10.7% of movies have less than 10 ratings.

2.2.5 Users
There are 69878 different users are in the edx set.
The majority of users rate few movies, while a few users rate more than a thousand movies.
5% users rated less than 20 movies.

edx %>% group_by(userId) %>%


summarise(n=n()) %>%
arrange(n) %>%
head()

userId n

62516 10

22170 12

15719 13

50608 13
userId n

901 14

1833 14

The user distribution is right skewed.

edx %>% group_by(userId) %>%


summarise(n=n()) %>%
ggplot(aes(n)) +
geom_histogram(color = "white") +
scale_x_log10() +
ggtitle("Distribution of Users",
subtitle="The distribution is right skewed.") +
xlab("Number of Ratings") +
ylab("Number of Users") +
scale_y_continuous(labels = comma) +
theme_economist()

5% of users rated less than 20 movies.


Show the heatmap of users x movies
This user-movie matrix is sparse, with the vast majority of empty cells. Notice that four movies
have more ratings, and one or two users are more active.

users <- sample(unique(edx$userId), 100)


edx %>% filter(userId %in% users) %>%
select(userId, movieId, rating) %>%
mutate(rating = 1) %>%
spread(movieId, rating) %>%
select(sample(ncol(.), 100)) %>%
as.matrix() %>% t(.) %>%
image(1:100, 1:100,. , xlab="Movies", ylab="Users")
abline(h=0:100+0.5, v=0:100+0.5, col = "grey")
title("User x Movie Matrix")

2.3 Data Cleaning

As previously discussed, several features can be used to predict the rating for a given user. However,
many predictors increases the model complexity and requires more computer resources, so in this
research the estimated rating uses only movie and user information.

train_set <- train_set %>% select(userId, movieId, rating, title)

test_set <- test_set %>% select(userId, movieId, rating, title)

2.4 Modeling
2.4.1 Random Prediction

A very simple model is just randomly predict the rating using the probability distribution observed
during the data exploration. For example, if we know the probability of all users giving a movie a
rating of 3 is 10%, then we may guess that 10% of the ratings will have a rating of 3.

Such prediction sets the worst error we may get, so any other model should provide better result.

2.4.2 Linear Model

The simplest model predicts all users will give the same rating to all movies and assumes the movie
to movie variation is the randomly distributed error. Although the predicted rating can be any value,
statistics theory says that the average minimizes the RMSE, so the initial prediction is just the
average of all observed ratings, as described in this formula:

Y^u,i=μ+ϵi,u

Where Y^ is the predicted rating, μ is the mean of observed data and ϵi,u is the error distribution.
Any value other than the mean increases the RMSE, so this is a good initial estimation.

Part of the movie to movie variability can be explained by the fact that different movies have
different rating distribution. This is easy to understand, since some movies are more popular than
others and the public preference varies. This is called movie effect or movie bias, and is expressed
as bi in this formula:

Y^u,i=μ+bi+ϵi,

The movie effect can be calculated as the mean of the difference between the observed rating y and
the mean μ.

b^i=1N∑i=1N(yi−μ^)

Similar to the movie effect, different users have different rating pattern or distribution. For example,
some users like most movies and consistently rate 4 or 5, while other users dislike most movies
rating 1 or 2. This is called user effect or user bias and is expressed in this formula:

b^u=1N∑i=1N(yu,i−b^i−μ^)

The prediction model that includes the user effect becomes:

Y^u,i=μ+bi+bu+ϵu,I

Movies can be grouped into categories or genres, with different distributions. In general, movies in
the same genre get similar ratings. In this project we won’t evaluate the genre effect.

2.4.3 Regularization

The linear model provides a good estimation for the ratings, but doesn’t consider that many movies
have very few number of ratings, and some users rate very few movies. This means that the sample
size is very small for these movies and these users. Statistically, this leads to large estimated error.

The estimated value can be improved adding a factor that penalizes small sample sizes and have
have little or no impact otherwise. Thus, estimated movie and user effects can be calculated with
these formulas:
b^i=1ni+λ∑u=1ni(yu,i−μ^)

b^u=1nu+λ∑i=1nu(yu,i−b^i−μ^)

For values of N smaller than or similar to λ , b^I and b^u is smaller than the original values,
whereas for values of N much larger than λ , b^I and b^u change very little.

An effective method to choose λ that minimizes the RMSE is running simulations with several values
of λ .

2.4.4 Matrix Factorization

Matrix factorization is widely used machine learning tool for predicting ratings in recommendation
systems. This method became widely known during the Netflix Prize challenge10.

The data can be converted into a matrix such that each user is in a row, each movie is in a column
and the rating is in the cell, then the algorithm attempts to fill in the missing values. The table below
provides a simple example of a 4×54×5 matrix.

movie 1 movie 2 movie 3 movie 4 movi

user 1 ? ? 4 ? 3

user 2 2 ? ? 4 ?

user 3 ? 3 ? ? 5

user 4 3 ? 2 ? ?

The concept is to approximate a large rating matrix Rm×n into the product of two lower dimension
matrices Pk×m and Qk×n , such that

R≈P′Q

The R recosystem package provides methods to decompose the rating matrix and estimate the user
rating, using parallel matrix factorization.

3 Results

This section presents the code and results of the models.

3.1 Model Evaluation Functions

Here we define the loss functions.

# Define Mean Absolute Error (MAE)

MAE <- function(true_ratings, predicted_ratings){

mean(abs(true_ratings - predicted_ratings))

}
# Define Mean Squared Error (MSE)

MSE <- function(true_ratings, predicted_ratings){

mean((true_ratings - predicted_ratings)^2)

# Define Root Mean Squared Error (RMSE)

RMSE <- function(true_ratings, predicted_ratings){

sqrt(mean((true_ratings - predicted_ratings)^2))

3.2 Random Prediction

The first model randomly predicts the ratings using the observed probabilities in the training set.
First, we calculate the probability of each rating in the training set, then we predict the rating for the
test set and compare with actual rating. Any model should be better than this one.

Since the training set is a sample of the entire population and we don’t know the real distribution of
ratings, the Monte Carlo simulation with replacement provides a good approximation of the rating
distribution.

set.seed(4321, sample.kind = "Rounding")

# Create the probability of each rating

p <- function(x, y) mean(y == x)

rating <- seq(0.5,5,0.5)

# Estimate the probability of each rating with Monte Carlo simulation

B <- 10^3

M <- replicate(B, {

s <- sample(train_set$rating, 100, replace = TRUE)

sapply(rating, p, y= s)

})

prob <- sapply(1:nrow(M), function(x) mean(M[x,]))


# Predict random ratings

y_hat_random <- sample(rating, size = nrow(test_set),

replace = TRUE, prob = prob)

# Create a table with the error results

result <- tibble(Method = "Project Goal", RMSE = 0.8649, MSE = NA, MAE = NA)

result <- bind_rows(result,

tibble(Method = "Random prediction",

RMSE = RMSE(test_set$rating, y_hat_random),

MSE = MSE(test_set$rating, y_hat_random),

MAE = MAE(test_set$rating, y_hat_random)))

The RMSE of random prediction is very high.

result

Method RMSE MSE MAE

Project Goal 0.864900 NA NA

Random prediction 1.501245 2.253735 1.167597

3.3 Linear Model

We’re building the linear model based on the formula:

y^=μ+bi+bu+ϵu,I

3.3.1 Initial Prediction

The initial prediction is just the mean of the ratings, μ .

y^=μ+ϵu,I

# Mean of observed values

mu <- mean(train_set$rating)

# Update the error table


result <- bind_rows(result,

tibble(Method = "Mean",

RMSE = RMSE(test_set$rating, mu),

MSE = MSE(test_set$rating, mu),

MAE = MAE(test_set$rating, mu)))

# Show the RMSE improvement

result

Method RMSE MSE MAE

Project Goal 0.864900 NA NA

Random prediction 1.501245 2.253735 1.1675974

Mean 1.060054 1.123714 0.8551636

3.3.2 Include Movie Effect (bi )

bi is the movie effect (bias) for movie I .

y^=μ+bi+ϵu,I

# Movie effects (bi)

bi <- train_set %>%

group_by(movieId) %>%

summarize(b_i = mean(rating - mu))

head(bi)

movieId b_i

1 0.4150040

2 -0.3064057

3 -0.3613952

4 -0.6372808

5 -0.4416058
movieId b_i

6 0.3018943

The movie effect is normally left skewed distributed.

bi %>% ggplot(aes(x = b_i)) +

geom_histogram(bins=10, col = I("black")) +

ggtitle("Movie Effect Distribution") +

xlab("Movie effect") +

ylab("Count") +

scale_y_continuous(labels = comma) +

theme_economist()

# Predict the rating with mean + bi


y_hat_bi <- mu + test_set %>%
left_join(bi, by = "movieId") %>%
.$b_i
# Calculate the RMSE
result <- bind_rows(result,
tibble(Method = "Mean + bi",
RMSE = RMSE(test_set$rating, y_hat_bi),
MSE = MSE(test_set$rating, y_hat_bi),
MAE = MAE(test_set$rating, y_hat_bi)))

# Show the RMSE improvement


result

Method RMSE MSE

Project Goal 0.8649000 NA

Random prediction 1.5012445 2.2537350

Mean 1.0600537 1.1237139

Mean + bi 0.9429615 0.8891764

3.3.3 Include User Effect (bu)


bu is the user effect (bias) for user u .

y^u,i=μ+bi+bu+ϵu,i
Predict the rating with mean + bi + bu

# User effect (bu)


bu <- train_set %>%
left_join(bi, by = 'movieId') %>%
group_by(userId) %>%
summarize(b_u = mean(rating - mu - b_i))

# Prediction
y_hat_bi_bu <- test_set %>%
left_join(bi, by='movieId') %>%
left_join(bu, by='userId') %>%
mutate(pred = mu + b_i + b_u) %>%
.$pred
# Update the results table
result <- bind_rows(result,
tibble(Method = "Mean + bi + bu",
RMSE = RMSE(test_set$rating, y_hat_bi_bu),
MSE = MSE(test_set$rating, y_hat_bi_bu),
MAE = MAE(test_set$rating, y_hat_bi_bu)))

# Show the RMSE improvement


result

Method RMSE MSE

Project Goal 0.8649000 NA

Random prediction 1.5012445 2.2537350

Mean 1.0600537 1.1237139

Mean + bi 0.9429615 0.8891764

Mean + bi + bu 0.8646843 0.7476789

The user effect is normally distributed.

train_set %>%
group_by(userId) %>%
summarize(b_u = mean(rating)) %>%
filter(n()>=100) %>%
ggplot(aes(b_u)) +
geom_histogram(color = "black") +
ggtitle("User Effect Distribution") +
xlab("User Bias") +
ylab("Count") +
scale_y_continuous(labels = comma) +
theme_economist()
3.3.4 Evaluating the model result
The RMSE improved from the initial estimation based on the mean. However, we still need to
check if the model makes good ratings predictions.
Check the 10 largest residual differences

train_set %>%
left_join(bi, by='movieId') %>%
mutate(residual = rating - (mu + b_i)) %>%
arrange(desc(abs(residual))) %>%
slice(1:10)

userId movieId rating title b_i

26423 6483 5.0 From Justin to Kelly (2003) -2.638139

5279 6371 5.0 Pokémon Heroes (2003) -2.472133

57863 6371 5.0 Pokémon Heroes (2003) -2.472133

2507 318 0.5 Shawshank Redemption, The (1994) 0.944111


userId movieId rating title b_i

7708 318 0.5 Shawshank Redemption, The (1994) 0.944111

9214 318 0.5 Shawshank Redemption, The (1994) 0.944111

9568 318 0.5 Shawshank Redemption, The (1994) 0.944111

9975 318 0.5 Shawshank Redemption, The (1994) 0.944111

10749 318 0.5 Shawshank Redemption, The (1994) 0.944111

13496 318 0.5 Shawshank Redemption, The (1994) 0.944111

titles <- train_set %>%


select(movieId, title) %>%
distinct()

Top 10 best movies (ranked by bi).


These are unknown movies
bi %>%
inner_join(titles, by = "movieId") %>%
arrange(-b_i) %>%
select(title) %>%
head()

title

Hellhounds on My Trail (1999)

Satan’s Tango (Sátántangó) (1994)

Shadows of Forgotten Ancestors (1964)

Fighting Elegy (Kenka erejii) (1966)

Sun Alley (Sonnenallee) (1999)

Blue Light, The (Das Blaue Licht) (1932)


Top 10 worst movies (ranked by bi ), also unknown movies:
bi %>%
inner_join(titles, by = "movieId") %>%
arrange(b_i) %>%
select(title) %>%
head()

title

Besotted (2001)

Hi-Line, The (1999)

Accused (Anklaget) (2005)

Confessions of a Superhero (2007)

War of the Worlds 2: The Next Wave (2008)

SuperBabies: Baby Geniuses 2 (2004)

Number of ratings for 10 best movies:

train_set %>%
left_join(bi, by = "movieId") %>%
arrange(desc(b_i)) %>%
group_by(title) %>%
summarise(n = n()) %>%
slice(1:10)

title

’burbs, The (1989)

’night Mother (1986)

’Round Midnight (1986)

’Til There Was You (1997)


title

“Great Performances” Cats (1998)

*batteries not included (1987)

…All the Marbles (a.k.a. The California Dolls) (1981)

…And God Created Woman (Et Dieu… créa la femme) (1956)

…And God Spoke (1993)

…And Justice for All (1979)

train_set %>% count(movieId) %>%


left_join(bi, by="movieId") %>%
arrange(desc(b_i)) %>%
slice(1:10) %>%
pull(n)
## [1] 1 1 1 1 1 1 4 2 4 4

3.4 Regularization
Now, we regularize the user and movie effects adding a penalty factor λ , which is a tuning
parameter. We define a number of values for λ and use the regularization function to pick
the best value that minimizes the RMSE.
regularization <- function(lambda, trainset, testset){

# Mean
mu <- mean(trainset$rating)

# Movie effect (bi)


b_i <- trainset %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu)/(n()+lambda))

# User effect (bu)


b_u <- trainset %>%
left_join(b_i, by="movieId") %>%
filter(!is.na(b_i)) %>%
group_by(userId) %>%
summarize(b_u = sum(rating - b_i - mu)/(n()+lambda))

# Prediction: mu + bi + bu
predicted_ratings <- testset %>%
left_join(b_i, by = "movieId") %>%
left_join(b_u, by = "userId") %>%
filter(!is.na(b_i), !is.na(b_u)) %>%
mutate(pred = mu + b_i + b_u) %>%
pull(pred)

return(RMSE(predicted_ratings, testset$rating))
}

# Define a set of lambdas to tune


lambdas <- seq(0, 10, 0.25)

# Tune lambda
rmses <- sapply(lambdas,
regularization,
trainset = train_set,
testset = test_set)

# Plot the lambda vs RMSE


tibble(Lambda = lambdas, RMSE = rmses) %>%
ggplot(aes(x = Lambda, y = RMSE)) +
geom_point() +
ggtitle("Regularization",
subtitle = "Pick the penalization that gives the lowest RMSE.")
+
theme_economist()
Choose the lambda that minimizes RMSE.

Next, we apply the best λ to the linear model.


# We pick the lambda that returns the lowest RMSE.
lambda <- lambdas[which.min(rmses)]

# Then, we calculate the predicted rating using the best parameters


# achieved from regularization.
mu <- mean(train_set$rating)

# Movie effect (bi)


b_i <- train_set %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu)/(n()+lambda))

# User effect (bu)


b_u <- train_set %>%
left_join(b_i, by="movieId") %>%
group_by(userId) %>%
summarize(b_u = sum(rating - b_i - mu)/(n()+lambda))

# Prediction
y_hat_reg <- test_set %>%
left_join(b_i, by = "movieId") %>%
left_join(b_u, by = "userId") %>%
mutate(pred = mu + b_i + b_u) %>%
pull(pred)

# Update the result table


result <- bind_rows(result,
tibble(Method = "Regularized bi and bu",
RMSE = RMSE(test_set$rating, y_hat_reg),
MSE = MSE(test_set$rating, y_hat_reg),
MAE = MAE(test_set$rating, y_hat_reg)))

# Regularization made a small improvement in RMSE.


result

Method RMSE MSE

Project Goal 0.8649000 NA

Random prediction 1.5012445 2.2537350

Mean 1.0600537 1.1237139

Mean + bi 0.9429615 0.8891764

Mean + bi + bu 0.8646843 0.7476789

Regularized bi and bu 0.8641362 0.7467313

3.5 Matrix Factorization


Matrix factorization approximates a large user-movie matrix into the product of two smaller
dimension matrices. Information in the train set is stored in tidy format, with one observation per
row, so it needs to be converted to the user-movie matrix before using matrix factorization. This
code executes this transformation.

train_data <- train_set %>%


select(userId, movieId, rating) %>%
spread(movieId, rating) %>%
as.matrix()
The code above uses more memory than a commodity laptop is able to process, so we use an
alternative method: the recosystem package, which provides the complete solution for a
recommendation system using matrix factorization.
The package vignette12 describes how to use recosystem:
Usage of recosystem
The usage of recosystem is quite simple, mainly consisting of the following steps:

1. Create a model object (a Reference Class object in R) by calling Reco().


2. (Optionally) call the $tune() method to select best tuning parameters along a set of
candidate values.
3. Train the model by calling the $train() method. A number of parameters can be set
inside the function, possibly coming from the result of $tune().
4. (Optionally) export the model via $output(), i.e. write the factorization matrices P and Q
into files or return them as R objects.
5. Use the $predict() method to compute predicted values.

if(!require(recosystem))
install.packages("recosystem", repos = "http://cran.us.r-project.org")
set.seed(123, sample.kind = "Rounding") # This is a randomized algorithm

# Convert the train and test sets into recosystem input format
train_data <- with(train_set, data_memory(user_index = userId,
item_index = movieId,
rating = rating))
test_data <- with(test_set, data_memory(user_index = userId,
item_index = movieId,
rating = rating))

# Create the model object


r <- recosystem::Reco()

# Select the best tuning parameters


opts <- r$tune(train_data, opts = list(dim = c(10, 20, 30),
lrate = c(0.1, 0.2),
costp_l2 = c(0.01, 0.1),
costq_l2 = c(0.01, 0.1),
nthread = 4, niter = 10))

# Train the algorithm


r$train(train_data, opts = c(opts$min, nthread = 4, niter = 20))
## iter tr_rmse obj
## 0 0.9825 1.1036e+007
## 1 0.8740 8.9621e+006
## 2 0.8419 8.3303e+006
## 3 0.8208 7.9554e+006
## 4 0.8047 7.6948e+006
## 5 0.7921 7.5024e+006
## 6 0.7817 7.3538e+006
## 7 0.7728 7.2387e+006
## 8 0.7652 7.1393e+006
## 9 0.7587 7.0615e+006
## 10 0.7530 6.9936e+006
## 11 0.7479 6.9347e+006
## 12 0.7435 6.8850e+006
## 13 0.7394 6.8436e+006
## 14 0.7356 6.8037e+006
## 15 0.7322 6.7692e+006
## 16 0.7292 6.7380e+006
## 17 0.7262 6.7104e+006
## 18 0.7235 6.6837e+006
## 19 0.7212 6.6625e+006
# Calculate the predicted values
y_hat_reco <- r$predict(test_data, out_memory())
head(y_hat_reco, 10)
## [1] 4.851838 3.720369 3.274899 2.896342 3.550605 3.879312 3.637041
## [8] 3.427570 4.030577 3.193461

Matrix factorization improved substantially the RMSE.

result <- bind_rows(result,


tibble(Method = "Matrix Factorization - recosystem",
RMSE = RMSE(test_set$rating, y_hat_reco),
MSE = MSE(test_set$rating, y_hat_reco),
MAE = MAE(test_set$rating, y_hat_reco)))
result

Method RMSE MSE

Project Goal 0.8649000 NA

Random prediction 1.5012445 2.2537350


Method RMSE MSE

Mean 1.0600537 1.1237139

Mean + bi 0.9429615 0.8891764

Mean + bi + bu 0.8646843 0.7476789

Regularized bi and bu 0.8641362 0.7467313

Matrix Factorization - recosystem 0.7851291 0.6164277

3.6 Final Validation


As we can see from the result table, regularization and matrix factorization achieved the target
RMSE. So, finally we train the complete edx set with both models and calculate the RMSE in
the validation set. The project goal is achieved if the RMSE stays below the target.

3.6.1 Linear Model With Regularization


During the training and testing phases, the linear model with regularization achieved the target
RMSE with a small margin. Here we do the final validation with the validation set.

mu_edx <- mean(edx$rating)

# Movie effect (bi)


b_i_edx <- edx %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu_edx)/(n()+lambda))

# User effect (bu)


b_u_edx <- edx %>%
left_join(b_i_edx, by="movieId") %>%
group_by(userId) %>%
summarize(b_u = sum(rating - b_i - mu_edx)/(n()+lambda))

# Prediction
y_hat_edx <- validation %>%
left_join(b_i_edx, by = "movieId") %>%
left_join(b_u_edx, by = "userId") %>%
mutate(pred = mu_edx + b_i + b_u) %>%
pull(pred)

# Update the results table


result <- bind_rows(result,
tibble(Method = "Final Regularization (edx vs validatio
n)",
RMSE = RMSE(validation$rating, y_hat_edx),
MSE = MSE(validation$rating, y_hat_edx),
MAE = MAE(validation$rating, y_hat_edx)))

# Show the RMSE improvement


result

Method RMSE MSE

Project Goal 0.8649000 NA

Random prediction 1.5012445 2.2537350

Mean 1.0600537 1.1237139

Mean + bi 0.9429615 0.8891764

Mean + bi + bu 0.8646843 0.7476789

Regularized bi and bu 0.8641362 0.7467313

Matrix Factorization - recosystem 0.7851291 0.6164277

Final Regularization (edx vs validation) 0.8648177 0.7479097

As expected, the RMSE calculated on the validation set (0.8648177) is lower than the target of
0.8649 and slightly higher than the RMSE of the test set (0.8641362).
Top 10 best movies

validation %>%
left_join(b_i_edx, by = "movieId") %>%
left_join(b_u_edx, by = "userId") %>%
mutate(pred = mu_edx + b_i + b_u) %>%
arrange(-pred) %>%
group_by(title) %>%
select(title) %>%
head(10)

title

Usual Suspects, The (1995)

Shawshank Redemption, The (1994)

Shawshank Redemption, The (1994)

Shawshank Redemption, The (1994)

Eternal Sunshine of the Spotless Mind (2004)

Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) (1977)

Schindler’s List (1993)

Donnie Darko (2001)

Star Wars: Episode VI - Return of the Jedi (1983)

Schindler’s List (1993)

Top 10 worst movies

validation %>%
left_join(b_i_edx, by = "movieId") %>%
left_join(b_u_edx, by = "userId") %>%
mutate(pred = mu_edx + b_i + b_u) %>%
arrange(pred) %>%
group_by(title) %>%
select(title) %>%
head(10)

title

Battlefield Earth (2000)


title

Police Academy 4: Citizens on Patrol (1987)

Karate Kid Part III, The (1989)

Pokémon Heroes (2003)

Turbo: A Power Rangers Movie (1997)

Kazaam (1996)

Pokémon Heroes (2003)

Free Willy 3: The Rescue (1997)

Shanghai Surprise (1986)

Steel (1997)

3.6.2 Matrix Factorization


The initial test shows that matrix factorization gives the best RMSE. Now it’s time to validate with
the entire edx and validation sets.

set.seed(1234, sample.kind = "Rounding")

# Convert 'edx' and 'validation' sets to recosystem input format


edx_reco <- with(edx, data_memory(user_index = userId,
item_index = movieId,
rating = rating))
validation_reco <- with(validation, data_memory(user_index = userId,
item_index = movieId,
rating = rating))

# Create the model object


r <- recosystem::Reco()

# Tune the parameters


opts <- r$tune(edx_reco, opts = list(dim = c(10, 20, 30),
lrate = c(0.1, 0.2),
costp_l2 = c(0.01, 0.1),
costq_l2 = c(0.01, 0.1),
nthread = 4, niter = 10))

# Train the model


r$train(edx_reco, opts = c(opts$min, nthread = 4, niter = 20))
## iter tr_rmse obj
## 0 0.9728 1.2026e+007
## 1 0.8723 9.8942e+006
## 2 0.8404 9.1864e+006
## 3 0.8187 8.7792e+006
## 4 0.8026 8.4855e+006
## 5 0.7908 8.2918e+006
## 6 0.7809 8.1354e+006
## 7 0.7725 8.0127e+006
## 8 0.7654 7.9133e+006
## 9 0.7591 7.8295e+006
## 10 0.7538 7.7614e+006
## 11 0.7491 7.7028e+006
## 12 0.7448 7.6521e+006
## 13 0.7410 7.6053e+006
## 14 0.7374 7.5660e+006
## 15 0.7343 7.5314e+006
## 16 0.7314 7.5003e+006
## 17 0.7288 7.4738e+006
## 18 0.7263 7.4475e+006
## 19 0.7242 7.4260e+006
# Calculate the prediction
y_hat_final_reco <- r$predict(validation_reco, out_memory())

# Update the result table


result <- bind_rows(result,
tibble(Method = "Final Matrix Factorization - recosyste
m",
RMSE = RMSE(validation$rating, y_hat_final_reco)
,
MSE = MSE(validation$rating, y_hat_final_reco),
MAE = MAE(validation$rating, y_hat_final_reco))
)

The final RMSE with matrix factorization is 0.7826974, 9.5% better than the linear model with
regularization (0.8648177).

# Show the RMSE improvement


result

Method RMSE MSE

Project Goal 0.8649000 NA

Random prediction 1.5012445 2.2537350

Mean 1.0600537 1.1237139

Mean + bi 0.9429615 0.8891764

Mean + bi + bu 0.8646843 0.7476789

Regularized bi and bu 0.8641362 0.7467313

Matrix Factorization - recosystem 0.7851291 0.6164277

Final Regularization (edx vs validation) 0.8648177 0.7479097

Final Matrix Factorization - recosystem 0.7826974 0.6126152

Now, let’s check the best and worst movies predicted with matrix factorization.
Top 10 best movies:

tibble(title = validation$title, rating = y_hat_final_reco) %>%


arrange(-rating) %>%
group_by(title) %>%
select(title) %>%
head(10)

title

Lord of the Rings: The Return of the King, The (2003)


title

Contempt (Mépris, Le) (1963)

Schindler’s List (1993)

Shawshank Redemption, The (1994)

Shawshank Redemption, The (1994)

Shawshank Redemption, The (1994)

Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) (1977)

Annie Hall (1977)

Boys Life 2 (1997)

Pulp Fiction (1994)

Top 10 worst movies:

tibble(title = validation$title, rating = y_hat_final_reco) %>%


arrange(rating) %>%
group_by(title) %>%
select(title) %>%
head(10)

title

Time Walker (a.k.a. Being From Another Planet) (1982)

Dead Girl, The (2006)

Giant Gila Monster, The (1959)

Phish: Bittersweet Motel (2000)

Pokémon Heroes (2003)


title

Bug (2007)

Blue Kite, The (Lan feng zheng) (1993)

Dirty Love (2005)

Mass Transit (1998)

Kiss Them for Me (1957)

4 Conclusion
We started collecting and preparing the dataset for analysis, then we explored the information
seeking for insights that might help during the model building.
Next, we created a random model that predicts the rating based on the probability distribution of
each rating. This model gives the worst result.
We started the linear model with a very simple model which is just the mean of the observed
ratings. From there, we added movie and user effects, that models the user behavior and movie
distribution. With regularization we added a penalty value for the movies and users with few
number of ratings. The linear model achieved the RMSE of 0.8648177, successfully passing the
target of 0.8649.
Finally, we evaluated the recosystem package that implements the LIBMF algorithm, and
achieved the RMSE of 0.7826974.

4.1 Limitations
Some machine learning algorithms are computationally expensive to run in a commodity laptop
and therefore were unable to test. The required amount of memory far exceeded the available in
a commodity laptop, even with increased virtual memory.
Only two predictors are used, the movie and user information, not considering other features.
Modern recommendation system models use many predictors, such as genres, bookmarks,
playlists, etc.
The model works only for existing users, movies and rating values, so the algorithm must run
every time a new user or movie is included, or when the rating changes. This is not an issue for
small client base and a few movies, but may become a concern for large data sets. The model
should consider these changes and update the predictions as information changes.
There is no initial recommendation for a new user or for users that usually don’t rate movies.
Algorithms that uses several features as predictors can overcome this issue.

4.2 Future Work


This report briefly describes simple models that predicts ratings. There are two other widely
adopted approaches not discussed here: content-based and collaborative filtering.
The recommenderlab package implements these methods and provides an environment to build
and test recommendation systems.
Besides recommenderlab, there are other packages for building recommendation systems
available in The Comprehensive R Archive Network (CRAN) website13.

References
1. Rafael A. Irizarry (2019), Introduction to Data Science: Data Analysis and Prediction
Algorithms with R
2. Yixuan Qiu (2017), recosystem: recommendation System Using Parallel Matrix
Factorization
3. Michael Hahsler (2019), recommendationlab: Lab for Developing and Testing
recommendation Algorithms. R package version 0.2-5.
4. Georgios Drakos, How to select the Right Evaluation Metric for Machine Learning
Models: Part 1 Regression Metrics

1. https://www.edx.org/professional-certificate/harvardx-data-science↩
2. https://www.netflixprize.com/↩
3. https://www.edx.org/professional-certificate/harvardx-data-science↩
4. https://grouplens.org/↩
5. https://movielens.org/↩
6. https://grouplens.org/datasets/movielens/latest/↩
7. https://grouplens.org/datasets/movielens/10m/↩
8. A Matrix-factorization Library for Recommender Systems -
https://www.csie.ntu.edu.tw/~cjlin/libmf/↩
9. https://cran.r-project.org/web/packages/recosystem/index.html↩
10. https://www.netflixprize.com/↩
11. https://cran.r-project.org/web/packages/recosystem/vignettes/introduction.html↩
12. https://cran.r-project.org/web/packages/recosystem/vignettes/introduction.html↩
13. https://cran.r-project.org/web/packages/available_packages_by_name.html↩

You might also like