Final Report 18.7.24
Final Report 18.7.24
INTRODUCTION
The explosive growth in the amount of available digital information and the number of visitors to
the Internet have created a potential challenge of information overload which hinders timely access
to items of interest on the Internet. Information retrieval systems, such as Google, Devil Finder and
AltaVista have partially solved this problem but prioritization and personalization (where a system
maps available content to user’s interests and preferences) of information were absent. This has
increased the demand for recommender systems more than ever before. Recommender systems are
information filtering systems that deal with the problem of information overload by filtering vital
information fragment out of large amount of dynamically generated information according to user’s
preferences, interest, or observed behavior about item. Recommender system has the ability to
predict whether a particular user would prefer an item or not based on the user’s profile.
Recommender systems are beneficial to both service providers and users. They reduce transaction
costs of finding and selecting items in online shopping environment. Recommendation systems have
also proved to improve decision making process and quality. In e-commerce setting, recommender
systems enhance revenues, for the fact that they are effective means of selling more products. In
scientific libraries, recommender systems support users by allowing them to move beyond catalog
searches. Therefore, the need to use efficient and accurate recommendation techniques within a
system that will provide relevant and dependable recommendations for users cannot be over-
emphasized.
Chapter 2
LITERATURE REVIEW
Kang et al. [1] have presented a recommender system for personalized advertisements in Online
the user preferences to minimize the overhead of preference prediction and using a HashMap along
with the tree characteristics. Ullah et al. [2] have implemented an image-based service
recommendation model for online shopping based random forest and Convolutional Neural
Networks (CNN). Te model used JPEG coefficients to achieve an accurate prediction rate. Cai et al.
[3] proposed a new hybrid recommender model using a many-objective evolutionary algorithm
(MaOEA). Esteban et al. [4] have implemented a hybrid multi-criteria recommendation system
concerned with students’ academic performance, personal interests, and course selection. the
system was developed using a Genetic Algorithm (GA) and aimed at helping university students. It
combined both course information and student information for increasing system performance and
the reliability of the recommendations. Mondal et al. [5] have built a multilayer, graph data model-
based doctor recommendation system by exploiting the trust concept between patient-doctor
relationships. Te proposed system showed good results in practical applications. In 2021, Dhelim
et al. [6] have developed a personality-based product recommending model using the techniques of
meta path discovery and user interest mining. Tis model showed better results when compared to
session-based and deep learning models. Bhalse et al. [7] proposed a web-based movie
(SVD), collaborative filtering and cosine similarity (CS) for addressing the sparsity problem of
of movies. Similarly, to solve both sparsity and cold-start problems Ke et al. [8] proposed a
dynamic goods recommendation system based on reinforcement learning. Te proposed Roy and
Dutta Journal of Big Data (2022) 9:59 Page 15 of 36 system was capable of learning from the
reduced entropy loss error on real-time applications. Chen et al. [9] have presented a movie
recommender model combining various techniques like user interest with category-level
representation, neighbor-assisted representation, user interest with latent representation and item-
level representation using FNN. Knowledge representation learning [10] systems aim to simplify
the model development process by increasing the acquisition efficiency, inferential efficiency,
detecting user-related meta-data which is employed to increase the overall model performance.
Chapter 3
The use of efficient and accurate recommendation techniques is very important for a system that
will provide good and useful recommendation to its individual users. This explains the
analysis of the attributes of items in order to generate predictions. When documents such as web
pages, publications and news are to be recommended, content-based filtering technique is the
most successful. In content-based filtering technique, recommendation is made based on the user
profiles using features extracted from the content of the items the user has evaluated in the past.
Items that are mostly related to the positively rated items are recommended to the user. CBF
uses
different types of models to find similarity between documents in order to generate meaningful
recommendations. It could use Vector Space Model such as Term Frequency Inverse Document
Frequency (TF/IDF) or Probabilistic models such as Naïve Bayes Classifier, Decision Trees or
Neural Networks to model the relationship between different documents within a corpus. These
techniques make recommendations by learning the underlying model with either statistical
analysis or machine learning techniques. Content-based filtering technique does not need the
profile of other users since they do not influence recommendation. Also, if the user profile
changes, CBF technique still has the potential to adjust its recommendations within a very short
period of time. The major disadvantage of this technique is the need to have an in-depth
CB filtering techniques overcome the challenges of CF. They have the ability to recommend new
items even if there are no ratings provided by users. So even if the database does not contain user
preferences, recommendation accuracy is not affected. Also, if the user preferences change, it
has the capacity to adjust its recommendations in a short span of time. They can manage
situations where different users do not share the same items, but only identical items according to
their intrinsic features. Users can get recommendations without sharing their profile, and this
ensures privacy. CBF technique can also provide explanations on how recommendations are
generated to users. However, the techniques suffer from various problems as discussed in the
literature. Content based filtering techniques are dependent on items’ metadata. That is, they
require rich description of items and very well organized user profile before recommendation can
be made to users. This is called limited content analysis. So, the effectiveness of CBF depends
CBF technique. Users are restricted to getting recommendations similar to items already defined
in their profiles.
News Dude is a personal news system that utilizes synthesized speech to read news stories to
users. TF-IDF model is used to describe news stories in order to determine the short-term
recommendations which is then compared with the Cosine Similarity Measure and finally
supplied to a learning algorithm (NN). CiteSeer is an automatic citation indexing that uses
various heuristics and machine learning algorithms to process documents. Today, CiteSeer is
among the largest and widely used research paper repository on the web.
LIBRA is a content-based book recommendation system that uses information about book
gathered from the Web. It implements a Naïve Bayes classifier on the information extracted from
the web to learn a user profile to produce a ranked list of titles based on training examples
recommendations made to users by listing the features that contribute to the highest ratings and
hence allowing the users to have total confidence on the recommendations provided to users by
the system.
easily and adequately be described by metadata such as movies and music. Collaborative
filtering technique works by building a database (user-item matrix) of preferences for items by
users. It then matches users with relevant interest and preferences by calculating similarities
between their profiles to make recommendations. Such users build a group called neighborhood.
A user gets recommendations to those items that he has not rated before but that were already
positively rated by users in his neighborhood. Recommendations that are produced by CF can be
predicted score of item j for the user i, while Recommendation is a list of top N items that the
user will like the most as shown in below figure. The technique of collaborative filtering can be
The items that were already rated by the user before play a relevant role in searching for a
neighbor that shares appreciation with him. Once a neighbor of a user is found, different
Due to the effectiveness of these techniques, they have achieved widespread success in real life
applications. Memory-based CF can be achieved in two ways through user-based and item-based
techniques. User based collaborative filtering technique calculates similarity between users by
comparing their ratings on the same item, and it then computes the predicted rating for an item
by the active user as a weighted average of the ratings of the item by users similar to the active
user where weights are the similarities of these users with the target item. Item-based filtering
techniques compute predictions using the similarity between items and not the similarity between
users. It builds a model of item similarities by retrieving all items rated by an active user from
the user-item matrix, it determines how similar the retrieved items are to the target item, then it
selects the k most similar items and their corresponding similarities are also determined.
Prediction is made by taking a weighted average of the active users rating on the similar items k.
Several types of similarity measures are used to compute similarity between item/user. The two
most popular similarity measures are correlation-based and cosine-based. Pearson correlation
coefficient is used to measure the extent to which two variables linearly relate with each other
and is defined as
S(a,u)is the mean rating given by user a while n is the total number of items in the user-item
space. Also, prediction for an item is made from the weighted combination of the selected
neighbors’ ratings, which is computed as the weighted deviation from the neighbors’ mean. The
which is based on linear algebra rather that statistical approach. It measures the similarity
between two n-dimensional vectors based on the angle between them. Cosine-based measure is
widely used in the fields of information retrieval and texts mining to compare two text
documents, in this case, documents are represented as vectors of terms. The similarity between
the scores that express how similar users or items are to each other. These scores can then be
context of use, similarity metrics can also be referred to as correlation metrics or distance
metrics.
This technique employs the previous ratings to learn a model in order to improve the
performance of Collaborative filtering Technique. The model building process can be done using
machine learning or data mining techniques. These techniques can quickly recommend a set of
items for the fact that they use pre-computed model and they have proved to produce
Value Decomposition (SVD), Matrix Completion Technique, Latent Semantic methods, and
Regression and Clustering. Model-based techniques analyze the user-item matrix to identify
relations between items; they use these relations to compare the list of top-N recommendations.
Model based techniques resolve the sparsity problems associated with recommendation systems.
The use of learning algorithms has also changed the manner of recommendations from
recommender systems:
Association rule: Association rules mining algorithms extract rules that predict the occurrence
of an item based on the presence of other items in a transaction. For instance, given a set of
transactions, where each transaction is a set of items, an association rule applies the form A → B,
where A and B are two sets of items. Association rules can form a very compact representation
of preference data that may improve efficiency of storage as well as performance. Also, the
effectiveness of association rule for uncovering patterns and driving personalized marketing
decisions has been known for sometimes. However, there is a clear relation between this method
and the goal of a Recommendation System but they have not become mainstream.
Clustering: Clustering techniques have been applied in different domains such as, pattern
recognition, image processing, statistical data analysis and knowledge discovery. Clustering
algorithm tries to partition a set of data into a set of sub-clusters in order to discover meaningful
groups that exist within them. Once clusters have been formed, the opinions of other users in a
cluster can be averaged and used to make recommendations for individual users. A good
clustering method will produce high quality clusters in which the intra-cluster similarity is high,
while the inter-cluster similarity is low. In some clustering approaches, a user can have partial
participation in different clusters, and recommendations are then based on the average across the
Organizing Map (SOM) are the most commonly used among the different clustering methods. K-
means takes an input parameter, and then partitions a set of n items into K clusters . The Self-
Organizing Map (SOM) is a method for an unsupervised learning, based on artificial neurons
clustering technique. Clustering techniques can be used to reduce the candidate set in
collaborative-based algorithms.
Decision tree: Decision tree is based on the methodology of tree graphs which is constructed by
analyzing a set of training examples for which the class labels are known. They are then applied
to classify previously unseen examples. If trained on very high quality data, they have the ability
to make very accurate predictions. Decision trees are more interpretable than other classifier such
as Support Vector machine (SVM) and Neural Networks because they combine simple questions
about data in an understandable manner. Decision trees are also flexible in handling items with
mixture of real-valued and categorical features as well as items that have some specific missing
features.
Artificial Neural network: ANN is a structure of many connected neurons (nodes) which are
arranged in layers in systematic ways. The connections between neurons have weights associated
with them depending on the amount of influence one neuron has on another. There are some
advantages in using neural networks in some special problem situations. For example, due to the
fact that it contains many neurons and also assigned weight to each connection, an artificial
neural network is quite robust with respect to noisy and erroneous data sets. ANN has the ability
of estimating nonlinear functions and capturing complex relationships in data sets also, they can
be efficient and even operate if part of the network fails. The major disadvantage is that it is hard
to come up with the ideal network topology for a given problem and once the topology is
decided this will act as a lower bound for the classification error.
Link analysis: Link Analysis is the process of building up networks of interconnected objects in
order to explore pattern and trends. It has presented great potentials in improving the
accomplishment of web search. Link analysis consists of PageRank and HITS algorithms. Most
link analysis algorithms handle a web page as a single node in the web graph.
Regression: Regression analysis is used when two or more variables are thought to be
analyzing associative relationships between dependent variable and one or more independent
variables. Uses of regression contain curve fitting, prediction, and testing systematic hypotheses
about relationships between variables. The curve can be useful to identify a trend within dataset,
Bayesian Classifiers: They are probabilistic framework for solving classification problems
which is based on the definition of conditional probability and Bayes theorem. Bayesian
classifiers consider each attribute and class label as random variables. Given a record of N
features (A1, A2, …, AN), the goal of the classifier is to predict class Ck by finding the value of
Ck that maximizes the posterior probability of the class given the data P(Ck|A1, A2, …, AN) by
applying Bayes’ theorem, P(Ck|A1, A2, …, AN) 𝖺 P(A1, A2, …, AN|Ck)P(Ck). The most
commonly used Bayesian classifier is known as the Naive Bayes Classifier. In order to estimate
the conditional probability, P(A1, A2, …, AN|Ck), a Naive Bayes Classifier assumes the
probabilistic independence of the attributes that is, the presence or absence of a particular
attribute is unrelated to the presence or absence of any other. This assumption leads to P(A1, A2,
arethat they are robust to isolated noise points and irrelevant attributes, and they handle missing
values by ignoring the instance during probability estimate calculations. However, the
independence assumption may not hold for some attributes as they might be correlated. In this
case, the usual approach is to use Bayesian Networks. Bayesian classifiers may prove practical
for environments in which knowledge of user preferences changes slowly with respect to the
time needed to build the model but are not suitable for environments in which users preference
Matrix completion techniques: The essence of matrix completion technique is to predict the
unknown values within the user-item matrices. Correlation based K-nearest neighbor is one of
the major techniques employed in collaborative filtering recommendation systems. They depend
largely on the historical rating data of users on items. Most of the time, the rating matrix is
always very big and sparse due to the fact that users do not rate most of the items represented
within the matrix. This problem always leads to the inability of the system to give reliable and
accurate recommendations to users. Different variations of low rank models have been used in
practice for matrix completion especially toward application in collaborative filtering . Formally,
the task of matrix completion technique is to estimate the entries of a matrix, M∈Rm×n , when a
subset, ΩC{(i,j):1⩽i⩽m,1⩽j⩽n}
. The most widely used algorithm in practice for recovering M from partially observed matrix
using low rank assumption is Alternating Least Square (ALS) minimization which involves
optimizing over U and V in an alternating manner to minimize the square error over observed
entries while keeping other factors fixed. Candes and Recht proposed the use of matrix
completion technique in the Netflix problem as a practical example for the utilization of the
technique. Keshavan et al. used SVD technique in an OptSpace algorithm to deal with matrix
completion problem. The result of their experiment showed that SVD is able provide a reliable
initial estimate for spanning subspace which can be further refined by gradient descent on a
Grassmannian manifold. Model based techniques solve sparsity problem. The major drawback of
the techniques is that the model building process is computationally expensive and the capacity
of memory usage is highly intensive. Also, they do not alleviate the cold-start problem.
Collaborative Filtering has some major advantages over CBF in that it can perform in domains
where there is not much content associated with items and where content is difficult for a
computer system to analyze (such as opinions and ideal). Also, CF technique has the ability to
provide serendipitous recommendations, which means that it can recommend items that are
relevant to the user even without the content being in the user’s profile. Despite the success ofCF
techniques, their widespread use has revealed some potential problems such as follows.
accurate and personalized recommendations to users. These systems aim to leverage the strengths
of different recommendation techniques while mitigating their weaknesses. There are typically
used together.
Recommendations from both techniques are combined using methods like weighted
This approach is effective in addressing the "cold start" problem, where new users or
attributes or features.
based filtering.
It is particularly useful when the item space is vast, and collaborative filtering alone
For example, a system might primarily use collaborative filtering, but when it lacks
dynamically, giving more weight to the model that is currently performing better.
can provide more accurate and diverse recommendations, reducing the risk of "filter
bubbles."
Addressing Cold Start Problems: Hybrid models can effectively handle new users or
Robustness: Hybrid models can adapt to changing user preferences and system conditions,
Examples of hybrid recommendation systems are commonly found in platforms like Netflix,
Amazon, and Spotify, where they utilize user-item interaction data, content features, and other
contextual information to generate personalized recommendations. These systems continuously
evolve and adapt to user behavior, providing a more engaging and satisfying user experience.
Chapter 4
Cold-start problem:
This refers to a situation where a recommender does not have adequate information about a user
or an item in order to make relevant predictions. This is one of the major problems that reduce
the performance of recommendation system. The profile of such new user or item will be empty
since he has not rated any item; hence, his taste is not known to the system [20].
This is the problem that occurs as a result of lack of enough information, that is, when only a few
of the total number of items available in a database are rated by users. This always leads to a
sparse user-item matrix, inability to locate successful neighbors and finally, the generation of
weak recommendations. Also, data sparsity always leads to coverage problems, which is the
Scalability:
normally grows linearly with the number of users and items [5]. A recommendation technique
that is efficient when the number of dataset is limited may be unable to generate satisfactory
number of recommendations when the volume of dataset is increased. Thus, it is crucial to apply
number of dataset in a database increases. Methods used for solving scalability problem and
such as Singular Value Decomposition (SVD) method, which has the ability to produce reliable
and efficient recommendations.
Synonymy:
Synonymy is the tendency of very similar items to have different names or entries. Most
recommender systems find it difficult to make distinction between closely related items such as
the difference between e.g. baby wear and baby cloth [10]. Collaborative Filtering systems
usually find no match between the two terms to be able to compute their similarity. Different
methods, such as automatic term expansion, the construction of a thesaurus, and Singular
Value Decomposition (SVD), especially Latent Semantic Indexing are capable of solving the
synonymy problem. The shortcoming of these methods is that some added terms may have
different meanings from what is intended, which sometimes leads to rapid degradation of
Recommendation performance.
Chapter 5
APPLICATIONS
Ringo is a user-based CF system which makes recommendations of music albums and artists. In
Ringo, when a user initially enters the system, a list of 125 artists is given to the user to rate
according to how much he likes listening to them. The list is made up of two different sections.
The first session consists of the most often rated artists, and this affords the active user
opportunity to rate artists which others have equally rated, so that there is a level of similarities
between different users’ profiles. The second session is generated upon a random selection of
items from the entire user-item matrix, so that all artists and albums are eventually rated at some
Usenet news which is a high volume discussion list service on the Internet. The short lifetime of
Netnews, and the underlying sparsity of the rating matrices are the two main challenges
item collaborative filtering techniques to recommend online products for different users. The
computational algorithm scales independently of the number of users and items within the
from users. The interface is made up of the following sections, your browsing history, rate these
items, and improve your recommendations and your profile. The system predicts users interest
based on the items he/she has rated. The system then compares the users browsing pattern on the
system and decides the item of interest to recommend to the user. Amazon.com popularized
feature of “people who bought this item also bought these items”.
CONCLUSION
Internet. It also helps to alleviate the problem of information overload which is a very common
phenomenon with information retrieval systems and enables users to have access to products and
services which are not readily available to users on the system. This discussed the three
traditional recommendation techniques and highlighted their strengths and challenges with
diverse kind of hybridization strategies used to improve their performances. This knowledge will
empower researchers and serve as a road map to improve the state-of-the-art recommendation
technique.
References:
4. Esteban A, Zafra A, Romero C. Helping university students to choose elective courses by using a
2020;194:105385.
based on user interests mining and metapath discovery. IEEE Trans Comput Soc Syst. 2021;8:86–
98.
7. Bhalse N, Thakur R. Algorithm for movie recommendation system using collaborative fltering.
9. Chen X, Liu D, Xiong Z, Zha ZJ. Learning and fusing multiple user interest representations for
10. Afolabi AO, Toivanen P. Integration of recommendation systems into connected health for
13. Russell S, Yoon V. Applications of wavelet data reduction in a recommender system. Expert Syst
Appl. 2008;34:2316–25.
14. Campos LM, Fernández-Luna JM, Huete JF. A collaborative recommender system based on
15. Funk M, Rozinat A, Karapanos E, Medeiros AKA, Koca A. In situ evaluation of recommender
systems: Framework and instrumentation. Int J Hum Comput Stud. 2010;68:525–47.J.A. Konstan,
J. Riedl Recommender systems: from algorithms to user experience User Model User-Adapt
16. C. Pan, W. LiResearch paper recommendation with topic analysis In Computer Design and
Proceedings of the fifth ACM conference on Recommender Systems (RecSys’11), ACM, New
Proceedings of ACM conference on recommender systems (RecSys’09), New York City, NY,
19. B. Pathak, R. Garfinkel, R. Gopal, R. Venkatesan, F. Yin Empirical analysis of the impact of
recommender systems on salesJ Manage Inform Syst, 27 (2) (2010), pp. 159-188
20. Rashid AM, Albert I, Cosley D, Lam SK, McNee SM, Konstan JA et al. Getting to know
you: learning new user preferences in recommender systems. In: Proceedings of the
22. P. Resnick, H.R. Varian Recommender system’s Commun ACM, 40 (3) (1997), pp. 56-58,
10.1145/245108.24512
23. A.M. Acilar, A. Arslan A collaborative filtering method based on Artificial Immune Network
[10] L.S. Chen, F.H. Hsu, M.C. Chen, Y.C. HsuDeveloping recommender systems with the
consideration of product profitability for sellersInt J Inform Sci, 178 (4) (2008), pp. 1032-1048
system to predict user future movementExp Syst Applicat, 37 (9) (2010), pp. 6201-6212
25. G. Adomavicius, A. TuzhilinToward the next generation of recommender system. A survey of the
state-of-the-art and possibleextensions IEEE Trans Knowl Data Eng, 17 (6) (2005), pp. 734-749
26. Li C, Wang Z, Cao S, He L. WLRRS: A new recommendation system based on weighted linear
27. Mezei J, Nikou S. Fuzzy optimization to improve mobile health and wellness recommendation
28. Ayata D, Yaslan Y, Kamasak ME. Emotion based music recommendation system using wearable
30. Hammou BA, Lahcen AA, Mouline S. An efective distributed predictive model with matrix
factorization and random forest for big data recommendation systems. Expert Syst Appl.
2019;137:253–65. 47.
(Mr. Prasad A. Lahare) (Dr. M.A. Wakchaure)
Name & Sign of Researcher Name & Sign of Research Guide