0% found this document useful (0 votes)
114 views7 pages

Movie Recommender System Using Content Based AndCollaborative Filtering

Technology has evolved a lot from basic to advanced such as Machine learning, deep learning, Internet of things
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
114 views7 pages

Movie Recommender System Using Content Based AndCollaborative Filtering

Technology has evolved a lot from basic to advanced such as Machine learning, deep learning, Internet of things
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Movie Recommender System using Content


Based and Collaborative Filtering
Harshal Fulzele1
Mihir Bhoite2
Prajwal Kanfade3
Ashutosh Yadav4
1,2,3,4
Artificial Intelligence, G.H.Raisoni College Of Engineering, Nagpur, India

Madhuri Sahu5
Achamma Thomas6
5,6
Assistant Professor ,Artificial Intelligence, G.H.Raisoni College Of Engineering, Nagpur , India

Abstract:- Technology has evolved a lot from basic to Using this rating information of thousands and
advanced such as Machine learning, deep learning, thousands of users and tens of thousands of movies we can
Internet of things, Data Mining and many more. somehow predict the type of movie the user can like in the
Recommender systems provide users with personalized future. We are using predicted rating as the feature for
suggestions for products or services also this system only recommending the movie of similar type.
rely on collaborative filtering. Movies are the source of
Entertainment but finding the desired content is the The Higher the predicted rating, the higher is the
problem. Aim of this paper is to improve the accuracy chance of recommending that movie for the user. We are
and performance of the regular filtering technique and using the surprise library for this approach which is also
also to recommend movies based on the content of the calledthe simple python recommendation system engine. We
movie which users have watched earlier. Collaborative are improving the cinematch algorithm for this. We used the
filtering recommends movies to user A based on the collaborativefiltering plus the xgboost regressor.
interest of similar user B. Netflix is internally using a
cinematch algorithm for the collaborative filtering we Netflix comes with an algorithm called the cinematch
are improving the accuracy and the performance of system which has some errors for this they are using root
regular technique. Content based filtering will help mean square error.
Netflix boost their turnover by providing similar movies
which users have watched earlier on any of the We want to improve the cinematch algorithm and the
OTT(Over The Top) platforms. We have used a surprise Content based approach tothe solution.
library along with the xgboost regressor which makes
our model improve from regular technique. We have II. DATA
also designed the frontend for the content based
recommendation system system for Netflix. We got the data from Netflix where we have movie ids
ranging from 1 to 17770 sequentially. Customer id range
Keywords:- Content Based , Collaborative Filtering, from 1 to 2649429 there are 48018 users. Ratings are on a
Recommender System, Surprise-Library, User-Based five steer scale 1 to 5 and dates have format YYYY-MM-
Recommender, Item-Based Recommender. DD It has actually 5 text files where we have all this date.
eg:- 1: 1488844,3; 2005-09-66 ". Customer id, rating, date.
I. INTRODUCTION

Netflix is basically an online repository where we can


watch web series, movies and documentaries etc. Netflix
account movies are recommended to us and it says because
you have watched this tv show or series e.g Roman empire
you may like this movie or TV shows e.g.:- ;Spartacus, The
last Kingdom. Netflix would recommend movies or TV
shows similar to movies we watched previously. The
question is: How do they understand what we will like? So,
for every movie m and movies as mi. a user ui and movie as
mj suppose user i rates movie j as rij. It is based on user
watched movies and given ratings for them between 1 to 5
star:
Fig 1 Dataset

IJISRT23MAY496 www.ijisrt.com 1009


Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
III. EXISTING SYSTEM We are sorting all the data using data.df sort_values that is
oldest date is the first entry and then it keeps increasing.
Netflix does not provide a content based approach, that Using df.describe().
is it does not show the similar movie. If you watched some
of the movies on other platforms like Amazon Prime,
Hotstar, Voot etc then it does not show the same type of
movie. Other than this Netflix internally uses Cinematch
algorithm for the collaborative filtering which needed some
improvement in the accuracy.

IV. OBJECTIVES AND CONSTRAINTS

Let’s assume there is a Movie Mj that the user has not


yet watched and the algo we build will try to predict how
much the Useri will rate the movie. what rating to Movie jrij
will be the predicted rating, assume it is then Netflix would
recommend the movies to us. Our objective is to minimize
the difference. Between rij and rij hat. That is actual rating Fig 2 Describe
and predicted rating we can measure this using Root mean
squared error or mean absolute percentage error constraints We came to know the max rating is 5 and the 75th
are some for interpretability that is why such movie us percentile is 4.
recommended which is very important we do not need low
latency system Netflix we would' pre compute what to Next part of preprocessing is if there is any NaN
recommend users in the hash table and precompute nightly values in the data using df.isnull().any() we found NaN
basis means after 24 hours recommendation may change and values in ourdataframe
always gets improved as more users more movies and better
the algorithm. Then we checked if there were anyduplicates by any
change using df.duplicated() And there were Zeroduplicates.
V. PROPOSED SYSTEM
Now we performed some basic statistics to know how
We build the front end for a content based approach so many ratings, users & movies.
that users can watch the similar movie which they have
watched earlier on any other ott platforms.For this we have By using the simple command nр.unique, on our
the data from imdb which gives the genre of movie and the dataframe. We also found 17,770 total no of movies.
actors name publisher name by which we can recommend
the movie using the genre of the movie they have watched
before.

In Collaborative Filtering given the data we have


movies, user and we need to predict the rating which is
standard recommendation problem in addition to this we can
also see this as the regression problem because we want to
predict rating between 1 to 5 that isour yi is 1 to 5 and we do
not have xi if we somehow come with best features then we
can predict yi easily we will use xgboost for regression and
we can use matrix factorization SVD for user user similarity Fig 3 Total Data
and movie movie similarity we will use all of those to solve
the problem. Splitting date into train and test (80:20) how do we
split the date, given the data we make our model to learn
VI. IMPORTING LIBRARIES from the date and for future we want to predict the rating by
deploying the model into productionisation so it makes
Pandas, Numpy, Matplotlib, Seaborn Scipy, Sklearn, complete sense that we have temporal structure for the data
Surprise date time libraries to know how much time it will and we have data which is best so we took 80% of the train
take to run the code. data & 20% as test data and on the basic of time axis we
have split the data often 80:20 split then data in train has
VII. EXPLORATORY DATA ANALYSIS 80,384405 no of ratings 405031 users and 17434 movies as
far as test data is unlearned we have 349312 users and 17577
We want to build the CSV such as ui,mj,rij and data. movies and 100 million ratings are broken in 80 millions in
For this we created a file data.csv into which we will write. train and 20 million in test roughly.
At first we will read all the files and store all the files into
data.csv. We will look at the rating field using df.describe().

IJISRT23MAY496 www.ijisrt.com 1010


Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
VIII. EXPLORATORY DATA ANALYSIS ON
TRAIN DATA

Firstly we have seen the distribution of data based on


the ratings 1,2,3,4 and 5

Fig 6 Ratings by Users

Fig 4 Training Dataset

From this we found 4 ratings is often ratedfrom this we


come to know ratings are higher not lower.

Next thing is have added the column called day of


week which will help us analyze better we plot the data from
no. of ratings per months vs month we have data from 1999
up to 2006 Fig 7 Ratings by Users

We found the average number of movies rated by


users. 198 it was the mean which shows that people who use
Netflix rate lotsof movies.

Fig 8 Netflix Rate


Fig 5 No. of ratings

From this we come to know the amount of ratings have


increased sharply and the ratings went close to 4.5 million
per month in 2004 and 2005. That is a massive growth from
this we came to know that we have a wide spectrum of data
in train and less in text. Now we analyzed the average rating
given by the user it said used id 305344 he has rated 17112
ratings. And user ID 1461435 has a rating of 9447.

As we were very curious about this data we plotted the


PDF and CDF of the data then we came to know that most
users give veryfew ratings and few give lots of ratings.
Fig 9 No. of Ratings

IJISRT23MAY496 www.ijisrt.com 1011


Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
Then we zoomed in the range of 75th percentile and When I can rate some movies not all movies as most
max values and we plot thequantile. ratings do not exist it is a sparse matrix. Using spicy were
we converted data to sparse matrices.

CSR matrix is the command CSR is compressed sparse


row matrix. If the sparsity of the matrix is 99 percentile then
there is no rating given there and 1% is nonzero rating..

We computed sparsity of the matrix on train data and


we got 99.8 to percentile sparsity and on text data it is
99.95%. It shows datais extremely sparse.

X. FINDING GLOBAL AVERAGE OF ALL


MOVIES RATINGS, AVERAGE RATING
PER ZEROS, AND AVERAGE RATING PER
MOVIES
Fig 10 Ratings
We have a sparse Matrix. We computed themean of all
th
Then we came to know that the 95 percentile is also ratings called the Global mean. Then we computed the
very low so we zoomed in 95th to 100 percentile values. So, average rating for users. Then we plot the average ratings
we came to know how many ratings there were. per users and average ratings per movie. This tells us that
this user is Critical or Lenient and whether the movie is
There were 20,305 ratings which was ok. super-hot or not.

The median users rated 89 movies. So percentile of XI. COLD START PROBLEM
users rated below 89 movies and 50 percentile of users
above 89 movies. If we slice a problem with the time as 80: 20 split there
might be some users who are present in train data and not in
Then we plot numbers of ratings per weekday. And we the test data.
found Saturday and Sunday traffic is much favored because
on Sunday and Saturday people go for outings. There might be some users who joined late and also
there might be new movies. For a person or user we have
1500 data where there is no data. This cold start problem in
a recommendation system. And when we looked for the cold
start problem with users it found that 15% of the users we
are not present to new users. Total movies are 17770 and we
have 17424 in trend data, that is 346 movies did not appear
in train data.

That is 1.95 % which is low. A cold start can kill our


recommendation system so we have to keep these in the
back of our mind.

XII. COMPILE SIMILARITY MATRIX


Fig 11 Total No. of ratings
The training data has 405 K rows and 17k columns
IX. CREATING SPARSE MATRIX FROM DATA each column is movie and row is users. User-user similarity
FRAME for these we have UI as sparse vector 17k dimension and
assume u j has 17 k dimensions. And if we try to take
We have table movie ID user id and rating new we can similarity of Ui and Uj using cosine similarity By taking dot
discard the date because it is ofNo use. product between the two values that is UiT Uj That is is by
using these we we find the top similar users it will took
around 41 days to complete the training set so we will try to
reduce the dimensions using SVD. So, it will speed up the
process. Instead of 17k dimensions use SVD or PCA
dimensions reduction technique to reduce dimension till 500
dimensions.

But now we are also taking the same time because


after SVD we have dense matrices so now PCA and SVD
Fig 12 Data Frame are not working here.We are stuck in a big problem now.

IJISRT23MAY496 www.ijisrt.com 1012


Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
We maintain a binary vector for users, which tells us We picked up movie vampire journals and we listed
whether we already computed or not stop if not computed similar movies and then we computed a similarity matrix for
then compute the top 1000 most similar users for this given Vampire generals. Then we found a hundred similar movies
user , we add this to our data structure so that we can just using cosine similarity, we found top movies similar to
access without computing it again. Vampire general were modern vampires, sleep Vampire etc.
Whichare very similar movies.
In production time we have to recomputed similarities
if it is computed a long time ago because users references
changes over time if we could maintain some kind of timer
so we have chosen here to make dictionaries of dictionaries
which is like ki is user1 and similar users are stored in the
values. This is a software engineering hack which could
speed up things.

XIII. COMPUTING MOVIE : MOVIE


SIMILARITY

Movies are 405k which is very large and this movie


vector is very sparse. So, we can find it by the similarity
Matrix movie i transpose of movie j.

Fig 14 Similar Movies to Vampiregeneral

XIV. MACHINE LEARNING MODEL

Surprise library makes your data handling very easy. If we


give Data in triplet format. It handles everything for us. It
provides various ready to use algorithms like KNN, Logistic
regression, SVD,PMF,NMF. SurPRISE is a simple Python
recommendation system engine.

Then we imported surprise, so install Surprise library


Fig 13 Time Per user
code is
 From above plot, It took roughly 8.88sec for computing
We designed the approach were we took the data and
similar users for one user
took a sample of data for training our machine learning
 We have 405,041 users with us in thetraining set. models at first we came with 13 features such as
405041×8.88=3596764.08sec=59946.068min=999.1  GAvg : Average rating of all the ratings
hours=41.629213889 days...
 Similar users rating of this movie:
 Even if we run on 4 cores parallelly (a typical system sur1, sur2, sur3, sur4, sur5 ( top 5 similar users who
nowadays), It will still take almost 10 and 1/2 days. rated that movie.. )
 Instead, we will try to reduce the dimensions using
SVD, so that it mightspeed up the process...  Similar movies rated by this user:
smr1, smr2, smr3, smr4, smr5 ( top 5 similar movies
This Matrix will be dense, which is 144 million rated by this movie.. )
computations, which is good. It tooknearly 10 minutes for us
to find a movie we care for what is similar to that movie  UAvg : User's Average rating
which we will solve as earlier with a dictionary.
 MAvg : Average rating of this movie
Even though we have a similarity measure of each
movie, with all other movies, We generally don't care much  Rating : Rating of this movie by this user.
about the least similar movies.Most of the times, only
top_xxxsimilar items matter. It may be 10 or 100.

IJISRT23MAY496 www.ijisrt.com 1013


Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
 Step 2
We used 13 features for the xgboost regression such as
Global average, similar user rating of this movie , Users
average rating, average rating of the movie, rating of the
movie by users.We need to transform data for a surprise
model, then we applied an actual machine learning model
XG boost with 13 features. We trained with two error
matrices RMSE and MAPE.

XV. XGBOOST WITH 13 FEATURES

We got RMSE as 1.076 and MAPE as 34.50 and the


most important feature is users average score and second is
movingaverage.

Fig 16 X Gboost with 13 Features

In surprise, the baseline model method is SGD and the


learning rate is 0.001. We want to minimize the difference of
rating actual and predicted. We are using L2 regularization.
The RSME here is 1.073 and MAPE 34.04 and we had 13
Fig 15 Movie Recommender System Approach features RMSE 1.076 and now we got slightly better.

 We Feature Data for Regression and use Now, we will mix 13 features and the baseline model.
XGBoost regression with RMSE as error. Then we We have 13 features and the 14th feature will be BSIPR . It
featured data for the surprise library and used the baseline is the output of the baseline model, and then apply XGboost
model of the surprise library with RMSE. Using step 1 and 2 on top of this. RMSE is 1.076 and MAPE is 34.49, and
we again used XG boost regression with features along with BSIPR is a list of important features we came to know.
RMSE. Then we use surprise KNN with user-user similarity
and RMSE as error and again we use surprise KNN model Surprise KNN is our next tape which is internally
with item-item similarity as step 4 combined step 1,2, 3 and using similar users and similar movies. k is similar to users
4 as the features and used XG boost regressor with RMSE of user (U) andwho rated movie (i) it is the cosine similarity.
as error. Then for feature 5 we use Matrix factorization But we use here baseline pearson correlation Coefficient and
SVDand for feature 6 we used matrix factorization svd ++ now we have RMSE is 1.0726 and MAPE as 35.02 , then
as a feature using all the feature sets 1 to 6 we implemented we use Matrix factorization technique in which RMSE is
XG boost regression and RMSE as the error. 1.072 and MAPE is 35.01 And test is lowest among all that
wetrain till now, then we went with svd++ inwhich RSME is
 Step 1 1.072 and MAPE is also same.
Is to sample the data our train has 405k * 17k and test
has 349k * 17k so, we use 10k users and 1K movies as train
data and 5K users and 500 movies as tests. And then we
work on the model and find which model is best and then
use that model for allthe data.

IJISRT23MAY496 www.ijisrt.com 1014


Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
REFERENCES

[1]. C. S. M. Wu, D. Garg, and U. Bhandary, “Movie


Recommendation System Using Collaborative
Filtering,” In 2018 IEEE 9th International Conference
on Software Engineering and Service Science
(ICSESS),pp. 11-15, IEEE, 2018 Nov.
[2]. R. E. Nakhli, H. Moradi, and M. A. Sadeghi, “Movie
Recommender System Based on Percentage of View,”
In 2019 5th Conference on Knowledge Based
Engineering and Innovation (KBEI), pp. 656-660,
IEEE.
[3]. H. W. Chen, Y. L. Wu, M. K. Hor, and C.Y. T ang,
“Fully content based movie recommender system
Fig 17 Content Based Movie Recommendation with feature extraction using neural network,” In 2017
International Conference on Machine Learning and
In collaborative filtering SVD has lowest RSME among Cybernetics (ICMLC) , vol. 2, pp. 504-509, Jul, 2017,
all the models. IEEE
[4]. Seroussi Y., “Utilizing user texts to improve
recommendations,” User Modeling, Adaptation, and
Personalization, pp.403–406, 2010.
[5]. Buttler D., “A short survey of document structure
similarity algorithms,” in Proceedings of the 5th
International Conference on Internet Computing,
2004.
[6]. Goldberg D., Nichols D., Oki B. M., and Terry D.,
“[Using collaborative filtering to weave an
information Tapestry],” Communications of the
ACM, vol. 35, no.12, pp. 61–70, 1992.
[7]. Beel J., Langer S., and Genzmehr M., “Mind-Map
based User Modelling and Research Paper
Recommendations,” in workin progress, 2014.
[8]. MacQueen J... Some methods for classification and
Fig 18 Result analysis of multivariate observations. In Proc. Of the
5th Berkeley Symp. On Mathematical Statistics and
XVI. RESULT Probability, pages 281-297. University of California
Press, 1967.
We designed the front end for a Content based [9]. Ball G. and Hall D... A Clustering Technique for
Recommendation system. we can select the movie from the Summarizing Multivariate Data. Behavior Science,
dropdown or we can even search the movie name for e.g if 12:153-155, March 1967. Bowman, M., Debray, S.
we select spiderman and then click on show K., and Peterson, L. L. 1993. Reasoning about
recommendations it shows all the similar movies related to naming systems.
spiderman or type of the spiderman. [10]. Choi, Sungwoon, et al. "Reinforcement Learning
based Recommender System using Biclustering
XVII. CONCLUSION Technique." arXiv pre printarXiv:1801.05532 (2018).
[11]. G. Adomavicius and A. Tuzhilin, “Toward the next
In this paper we used content based filtering as well as generation of recommender systems: a survey of the
collaborative based filtering to improve the recommendation state-of-the-art and possible extensions,” IEEE Trans.
system. Knowl. Data Eng., vol. 17, no.6, pp. 734–749, (2005).
[12]. P. Resnick, H.R. Varian, “Recommender systems,”
Content based filtering is working with streamlit Communications of the ACM 40(3) (1997) 56–58.
successfully and recommending the similar type of movies.
In Collaborative filtering Lower the RMSE the better model
is we sorted all the model and among all of them best was
the SVD with 1.0726 RMSE34, and all values are very close
but when we compared with the percentage difference it is
0.35% improvement which is also very good in terms of All
the data, as soon as we train large data set we will get the
bestresult.

IJISRT23MAY496 www.ijisrt.com 1015

You might also like