Design and Analysis of A Recommendation System Based On Collaborative Filtering Techniques For Big Data
Design and Analysis of A Recommendation System Based On Collaborative Filtering Techniques For Big Data
Abstract: Online search has become very popular, and users can easily search for any movie title; however, to easily
search for moving titles, users have to select a title that suits their taste. Otherwise, people will have difficulty choosing
the film they want to watch. The process of choosing or searching for a film in a large film database is currently time-
consuming and tedious. Users spend extensive time on the internet or on several movie viewing sites without success
until they find a film that matches their taste. This happens especially because humans are confused about choosing
things and quickly change their minds. Hence, the recommendation system becomes critical. This study aims to reduce
user effort and facilitate the movie research task. Further, we used the root mean square error scale to evaluate and
compare different models adopted in this paper. These models were employed with the aim of developing a
classification model for predicting movies. Thus, we tested and evaluated several cooperative filtering techniques. We
used four approaches to implement sparse matrix completion algorithms: k- nearest neighbors, matrix factorization,
co-clustering, and slope-one.
Key words: recommendation system; machine learning; collaborative filtering (CF); decision support system; big data
1 Introduction confusing[1].
In this study, we present a film recommendation
With the advent of big data and technological
system based on Collaborative Filtering (CF)
developments that marked the end of the 20th century
techniques. To this end, we implemented, tested, and
and the beginning of this century, the amount of data to
be exploited or analyzed has become very voluminous. evaluated several machine learning algorithms to
Knowing what data to look for and where to find them develop a predictive film provider rating model. The
is usually tedious. One such data searching process remainder of this paper is organized as follows. A
includes selecting or searching for an online film from literature review of movie recommendation systems is
a large film database, which makes users spend long provided in Section 2. In Section 3, We present the
hours on the internet or on many movie viewing sites methodology that is employed, along with a discussion
without success until they find a film that suits their on machine learning models and two evaluation
taste. Therefore, film recommendation systems aim to metrics. Section 4 discusses the results obtained in this
assist film lovers by suggesting which movie to watch study. Finally, Section 5 presents the conclusions and
without going through the lengthy film selection future studies[2].
process from a huge series of movies that extends to
2 Research background
thousands and millions, which is time-consuming and
Najia Khouibiri, Yousef Farhaoui, and Ahmad El Allaoui are 2.1 Related work
with STI Laboratory, IDM, T-IDMS, Faculty of Sciences and
Techniques Errachidia, Moulay Ismail University, Meknes Several studies have been conducted to recommend
5003, Morocco. E-mail: [email protected]; films.
[email protected]; [email protected].
For example, Ref. [3] suggests a movie
* To whom correspondence should be addressed.
Manuscript received: 2023-04-25; accepted: 2023-06-21 recommendation system that predicts the user
© All articles included in the journal are copyrighted to the ITU and TUP. This work is available under the CC BY-NC-ND 3.0 IGO license:
https://creativecommons.org/licenses/by-nc-nd/3.0/igo/.
Najia Khouibiri et al.: Design and analysis of a recommendation system based on collaborative filtering techniques... 297
preference for a film based on different parameters that places greater emphasis on the analysis of
using the K- means clustering and k- nearest neighbor elements that contribute to generating predictions. For
(KNN) algorithms. the CB technique, the recommendation is based on the
A hybrid recommendation system proposed in Ref. user’s profile using features extracted from the content
[4] is built by combining two techniques, CF and of items that the user has evaluated in the past[11, 12].
content-based filtering (CB), to provide accurate Subsequently, it builds a user interest profile (see
recommendations for movies. The content filtering part Fig. 1).
of the system has been adopted to train neural networks 2.2.2 Collaborative filtering
representing individual user preferences. Filtering CF is an approach based on the sharing of opinions
results were combined using Boolean and fuzzy among users. It follows the principle of “ word of
aggregation operators. The data adopted in this model mouth” that people always practice to build an opinion
led to highly accurate predictions. on a product or service they do not know. The basic
Another study constructs a recommendation system premise of this method is that another user’s viewpoint
based on cosine similarity using KNN with the support can be used to provide a reasonable forecast of
of CF technique simultaneously to eliminate the preferences to an active user for an item that they have
disadvantages of CB filtering[5]. Some scholars
not yet evaluated. This method assumes that if users
suggested the development of a recommendation
have the same preferences for a set of items, they will
system based on multiple algorithms to obtain
probably have the same preferences for another set of
groupings, such as K- means, mini-batch K-means,
items that they have not evaluated yet[13, 14] . For
birch, affinity propagation, and other algorithms[6].
example, imagine that Ahmed’s neighbors discover
Additionally, several approaches have been presented
that a newly opened restaurant in their neighborhood is
to improve K- means so that not every cluster can
a success; he will decide to try it. However, if most of
dramatically augment the variance. For movies, this
his neighbors consider it a failure, he may decide not to
system is restricted to the use of groups based on type
and tags. go there. Similarly, CF techniques recommend items to
Most of the above studies employ CF approaches, the current user that are appreciated by users with the
such as matrix factorization neighborhood-based same tastes (see Fig. 2).
algorithms[7]. Other methods can be employed to 2.2.3 Hybrid recommendation system
predict missing viewer evaluations and find the list of A hybrid recommendation system combines two or
movies that the user would like to watch. The main more different referral approaches (CF and CB). The
contribution of our study is the testing and evaluation earlier approaches had various drawbacks, such as cold
of several strategies, including co-clustering and slope- start or data scarcity. These issues are frequently
one methods[8]. resolved by combining two or more techniques.
Moreover, with this hybridization, it is feasible to
2.2 Recommendation system
The recommendation system is a valuable tool that
Search, like, consult, …
provides the user with a list of suggestions and directs
them to a group of sources that may be useful and
Client
interesting to them, which can be difficult to reach in a Similar articles
short period of time within the big data space. For this
purpose, one of the following methods is used: CB, CF, Recommended
or hybrid approaches[9, 10].
2.2.1 Content-based filtering
The CB technique is a domain-dependent algorithm Fig. 1 Content-based recommendation system.
298 Intelligent and Converged Networks, 2023, 4(4): 296−304
ar !
a s
FM
ns
ry
or
ll V oo
g
ll V zo
ll V en
Ye
.)
.)
A r.)
in
ta
vis
io
(A ah
st
(A ma
(A ieL
ss
in
La
ad
ro
Ep
op
ov
ip
-c
M
Pr
ok
Bo
Dataset
The second file, ratings.csv, contains a table of user
† http://www.movielens.org/
Fig. 3 Most common datasets for studying § http://www.grouplens.org/
recommendation systems. ¶ https://www.grouplens.org/datasets/movielens/25m/
Najia Khouibiri et al.: Design and analysis of a recommendation system based on collaborative filtering techniques... 299
ratings for movies. It has four columns and 697 561 developing a movie recommendation system based on
rows. The columns are movieId, rating (users can rate the movie rating model.
movies from 0.5 to 5), timestamp (the time with the (1) Gather data in the form of explicit movie viewer
date of voting), and userId. Then, we removed the ratings (user ratings) and then prepare and explore it
timestamps column because it serves no purpose for us. beforehand.
It can be seen that this dataset contains movieId, the (2) Test and evaluate various machine learning
title of the film and its genre. We need a dataset models on ready data using a cross-validation
containing the userId (to extract user data; thus, we will technique and then choose the model with the best
be able to use user data to increase the precision of performance.
recommendations because MovieLens does not offer a (3) To develop the desired recommendation system,
table relating to users), movie titles, and notes. This we ultimately deployed the trained model. Figure 4
information is included in two different data frame shows the various phases of the suggested solution.
objects: df_ratings and df_movies. To obtain the 3.4.1 Machine learning models
desired information in a single data frame (Table 1), Systems using the CF technique should compare
we can merge these two data frame objects on the objects that are significantly different from one
movieId column, as it is common for these data frames. another: items in relation to users. The neighborhood
We can do this using the merge() function of the method and latent factor models are the two main
Pandas library. strategies for facilitating such a comparative
In our study, finding the best machine learning evaluation. Additionally, the co-clustering and slope-
model that can accurately predict the missing ratings is one methods have been suggested in the literature to
a difficult task. This is the reason we remove users with deal with the recommendation issue. The machine
only one review (only retain viewers who have more learning models that we have employed to forecast
reviews than the average number of reviews per missing ratings have been presented in this part.
viewer). However, there is nearly 99% sparsity in the Neighbors-based models. There are two main stages
built movies rating matrix. for suggesting recommendations based on the
neighbor’s model. The first stage is to establish the
3.4 Suggested solution
neighborhood, and the second stage is to make
In a study on the health care provider’s recommendations.
recommendation system[28], several methods are During the neighborhood build process, similarity
studied, e.g., the neighborhood method and latent between users (called a user-based approach) or
factor models. Additionally, the proposed solution is elements (item-based approach) is measured. The two
applied in four stages. In our study, we worked on most widely used similarity measures are Pearson’s
larger data (including the recommendation system to correlation (PC) coefficient (Eq. (1)) and cosine-based
address the problem of user loss amid big data) and similarity (Eq. (2)).
tried to work on some algorithms that are applied in the ∑
n
aforementioned study and test them on a large (xi − x′ )(yi − y′ )
i=1
dataset and other data in a movie recommendation PC (x, y) = v
t v
t (1)
∑
n ∑
n
system[29, 30]. The following strategy is suggested for (xi − x ) ′ 2 ′ 2
(yi − y )
i=1 i=1
Table 1 Statistic results on the final dataset.
Entry Number where x and y are two n-pointed vectors. The average
Movie 62 000 values of vectors x and y are represented by x' and y',
Unique user 100 000 respectively. PC determines the relationship between
Rating 400 000
two sets of data, x and y.
300 Intelligent and Converged Networks, 2023, 4(4): 296−304
Machine
learning model
1 Neighbors-based
model
Preparation & Cross validation Co-clustering-
exploration
based model
Slope-one based
model
Evaluation
Construction
Our movie
3
recommendation system
Best model
Item
A1 A2 A3 A4 A1 A2 A3 A4
contrast between the degrees of the two elements is the evaluation, to which the user u is considered to belong,
only free parameter. In some cases, it has proved to be and Ci represents the average evaluation of the cluster,
considerably more precise than the linear regression of to which element i is considered to belong.
the degrees of one element to the degrees of another 3.4.2 Evaluation metric
element. The most popular and widely used scales for evaluating
Therefore, the prediction is calculated using the recommendation systems are the root mean squared
following relationship: error (RMSE) and mean absolute error scales. In this
∑ 1
r̂u,i = µu + dev(i, j) (5) study, we used the RMSE scale to evaluate our
|Ri (u)| j∈Ri (u) recommendation system, which can be calculated using
where Ri(u) is the collection of pertinent elements (i.e., Eq. (8).
√∑
the collection of elements j rated by u and shared with
(r̂ui − rui )2
at least one user i ), and dev(i, j ) represents the RMSE = (8)
n
difference of average rating between elements i and j,
where r̂ui is the expected user u rating for item i, rui is
and it is calculated using the following Eq. (6):
the rating that was actually given, and n is the volume
1 ∑
dev (i, j) = ru,i − ru, j (6) of the test set (size).
Ui j u∈Ui j In this paper, on all of our samples, we perform a 5-
where Uij represents all users that rated items i and j. cross-validation RMSE. We try to train our model
Co-clustering-based model. In the field of data using 80% of the data, and the remaining 20% is used
mining, the term “clustering” denotes the process of for testing the accuracy.
grouping objects into similar objects belonging to the
4 Result
same group or cluster. Clustering is an unsupervised
learning technique. According to the type of data, The outcomes of our testing and evaluation of various
different aggregation techniques could be applied. The methods of the primary models mentioned above are
user element rating matrix is used as data in the case of summarized in Table 2.
CF. Users and elements are determined by certain Cu,i It can be seen that the baseline user-based CF KNN
co-clusters, Ci clusters, and certain Cu using a bi- is the best model in terms of RMSE value. Our aim is
clustering technique. Clusters are selected using an to find the best (optimal) metrics for each of the
uncomplicated optimization technique, similar to K- models. A detailed summary of the findings is provided
means[17]. We can calculate predictive rating using the in Table 2.
following Eq. (7):
5 Conclusion and perspective
( ) ( )
r̂ui = Cui + µu − Cu + µi − Ci (7)
This study aimed to develop recommendation systems
where Cui represents the median evaluation for the Cui using the CF approach with several machine learning
co-cluster, Cu represents the cluster’s median models. Our testing experiments proved that the
302 Intelligent and Converged Networks, 2023, 4(4): 296−304
heart patients? Moreover, how many teenagers, due to Recommendation systems: Principles, methods and
evaluation, Egypt. Inform. J., vol. 16, no. 3, pp. 261–273,
the error of recommending films that do not agree with
2015.
their age, have committed suicide? [8] I. Benouaret, Un système de recommandation contextuel
In essence, the more information we collect, the et composite pour la visite personnalisée de sites culturels,
greater the significance of similarity calculations, the (in French), Ph. D. dissertation, University of Technology
recommendation system is more accurate and safer for of Compiègne, France, 2017, pp. 181.
the user’s life because human life does not accept any [9] M. Baidada, K. Mansouri, and F. Poirier, Hybrid filtering
room for error. Therefore, relying solely on machine recommendation system in an educational context, Int. J.
Web Based Learn. Teach. Technol., vol. 17, no. 1, pp.
learning techniques is insufficient. Instead, we must
1–17, 2022.
look forward to developing a hybrid recommendation [10] J. Beel and V. Brunel, Data pruning in recommender
system that adopts various deep learning techniques systems research: Best-practice or malpractice? in Proc.
and integrates data mining techniques to eliminate the ACM RecSys 2019 Late-Breaking Results & 13th ACM
cold start problem. Conf. Recommender Systems, Copenhagen, Denmark,
Najia Khouibiri et al.: Design and analysis of a recommendation system based on collaborative filtering techniques... 303