Recommendation System Using Collaborative Filtering
Recommendation System Using Collaborative Filtering
SJSU ScholarWorks
Master's Projects Master's Theses and Graduate Research
Fall 2015
Recommended Citation
Lee, Yunkyoung, "RECOMMENDATION SYSTEM USING COLLABORATIVE FILTERING" (2015). Master's Projects. 439.
DOI: https://doi.org/10.31979/etd.5c62-ve53
https://scholarworks.sjsu.edu/etd_projects/439
This Master's Project is brought to you for free and open access by the Master's Theses and Graduate Research at SJSU ScholarWorks. It has been
accepted for inclusion in Master's Projects by an authorized administrator of SJSU ScholarWorks. For more information, please contact
[email protected].
RECOMMENDATION SYSTEM USING COLLABORATIVE FILTERING
A Thesis
Presented to
In Partial Fulfillment
Master of Science
by
Yunkyoung Lee
December 2015
©2015 Yunkyoung Lee
By
Yunkyoung Lee
December 2015
by Yunkyoung Lee
when enough data is provided, because this technique is based on the user’s
past to predict the customer’s behavior as the most important part of the
recommendation system. However, their widespread use has revealed some real
challenges, such as data sparsity and data scalability, with gradually increasing
approach can achieve better performance and execution time for the
I am very thankful to my advisor Dr. Tsau Young Lin for his continuous
guidance and support throughout the project and having firm believe in me. Also,
I would like to thank the Committee members Dr. H. Chris Tseng and Dr.
Thomas Austin for monitoring the progress of the project and their valuable time.
Table of Contents
CHAPTER 1 .......................................................................................................... 5
Introduction.......................................................................................................... 5
CHAPTER 2 .......................................................................................................... 8
CHAPTER 3 ........................................................................................................ 21
CHAPTER 4 ........................................................................................................ 25
CHAPTER 5 ........................................................................................................ 32
CHAPTER 6 ........................................................................................................ 40
REFERENCES .................................................................................................... 41
LIST OF FIGURES
[Figure 9] The impact of the similarity computation on IBCF and R-IBCF ........... 33
[Figure 13] Comparison of the prediction quality of IBCF, R-IBCF, and UBCF ... 38
LIST OF TABLES
CHAPTER 1
Introduction
around mobile commerce since the advent of smart devices. User has more
opportunity to access diverse information and the amount of information that can
Wide Web has led to an information overload problem. It is difficult for users to
quickly obtain what they want from massive information. In recent years, each
customer can actively share their review and get a discount based on customer
purchase factors. E-commerce sites try to collect various users’ interests, such
as purchase history, product information in the cart, product ratings, and product
by exploiting the intuition that a user will likely prefer the items preferred by
similar users. Therefore, at first, the algorithm tries to find the user’s neighbors
based on user similarities and then combines the neighbor user’s rating score by
score. Instead of the nearest neighbors, it looks into a set of items; the target
user has already rated items and this algorithm computes how similar items are
to the target item under recommendation [8, 9]. After that it also combines the
widespread use has revealed some potential challenges, such as rating data
sparsity, cold-start, and data scalability [2, 6, 8, 9]. Therefore, to solve the
related work and their capabilities and limitation. The proposed approach is
CHAPTER 2
RELATED WORK
Since the advent of the information age, the immense growth of the World
Wide Web gives rise to the difficulty for users to quickly find what they want given
travel guides, online dating, books, restaurants, E-commerce sites and so forth.
systems are used to recommend items based on a description of items the user
used to like before, or corresponding with pre-defined attributes of the user, such
items rated by all users. Hybrid techniques combine both these approaches. In
users with an accurate recommendation to meet the needs of the user and to
increased sales by 29% [11], Netflix increased movie rentals by 60% [12], and
user that are already items of interest for other users who are similar to the target
user. For example, as seen [Figure 2] [15], let User 1 and User 3 have very
similar preference behavior. If User 1 likes Item A, UBCF can recommend Item A
to User 3. UBCF needs the explicit rating scores of items rated by users [8] to
find the nearest neighbors based on user similarities. And then, it generates
prediction in terms of items by combining the neighbor user’s rating scores based
1 2 3
into similarities between the items and other items that are already associated
10
with the user. For example, as seen in [Figure 3] [15], let’s say Item A and Item C
are very similar. If a User likes Item A, IBCF can recommend Item C to the User.
IBCF needs a set of items that the target user has already rated to calculate
similarities between items and a target item. And then, it generates prediction in
terms of the target item by combining the target user’s previous preferences
based on these item similarities [9]. In IBCF, users’ preference data can be
collected in two ways. One is that user explicitly gives rating score to item within
a certain numerical scale. The other is that it implicitly analyzes user’s purchase
mainly divided into three steps; Step 1) collecting user ratings data matrix, Step 2)
selecting similar neighbors by measuring the rating similarity, and then Step 3)
11
users and n symbolizes the total number of items. 𝑅!,! is the score of item In
Item
I1 I2 I3 … In
User
U1 𝑅!,! 𝑅!,! 𝑅!,! … 𝑅!,!
… … … … … …
Um 𝑅!,! 𝑅!,! 𝑅!,! … 𝑅!,!
[Table 1] User-Item ratings matrix
12
between users and to form a set of users called neighbors. A set of similarity
current target user. For example, as seen in [Figure 5] [4], the distance between
the target node (black node) and every other node is calculated by a similarity
measure. And then, 5 users in the center are selected by k-nearest neighbor
algorithm (k = 5).
13
after the similarity is calculated. This is because IBCF already begins computing
the similarity between co-rated items only as the value of two vectors [9,14]. For
looking into Item i and Item j rated by User 2, l, and n. Each of these pairs are
UBCF.
significance [9]. There are a couple of popular similarity algorithms that have
been used in the CF recommendation algorithms [8]. In this paper, I present four
14
notionally considers only the angle of two vectors without the magnitude, it is a
can count the number of times that term appears in the data [17].
In the following formula, the cosine vector similarity looks into the angle
between two vectors (the target Item i and the other Item j) of ratings in n-
dimensional item space. 𝑅!,! is the rating of the target Item i by User k. 𝑅!,! is the
rating of the other Item j by user k. n is the total number of all rating users to Item
i and Item j.
!
𝚤×𝚥 !!! 𝑅!,! 𝑅!,!
𝑠𝑖𝑚 𝑖, 𝑗 = cos 𝚤, 𝚥 = ! !
=
𝚤 × 𝚥 ! ! ! !
!!! 𝑅!,! !!! 𝑅!,!
When the angle between two vectors is near 0 degree (they are in the
same direction), Cosine similarity value, sim(i,j), is 1, meaning very similar. When
the angle between two vectors is near 90 degree, sim(i,j) is 0, meaning irrelevant.
When the angle between two vectors is near 180 degree (they are in the
retrieval using CF, sim(i,j) ranges from 0 to 1. This is because the angle between
15
to measure how larger a number in one series is, relative to the corresponding
to move together [14]. When two vectors have a high tendency, the correlation,
𝑠𝑖𝑚(𝑖, 𝑗), is close to 1. When two vectors have a low tendency, 𝑠𝑖𝑚(𝑖, 𝑗) is close
mentioned above in [Figure 6], item-based similarity is computed with the co-
rated items where users rated both (Item i and Item j).
𝑅!,! is the rating of the target Item i given by User k. 𝑅!,! is the rating of the
other Item j given by User i. 𝐴! is the average rating of the target Item i for all the
co-rated users, and 𝐴! is the average rating of the other Item j for all the co-rated
users. n is the total number of ratings users gave to Item i and Item j.
Euclidean distance between each point. When distance value between two
16
points, 𝑠𝑖𝑚 𝑖, 𝑗 , is large, it means the two points are not similar. When 𝑠𝑖𝑚(𝑖, 𝑗) is
small, it means two points are similar. This is Euclidean distance formula is given
below.
𝑅!,! is the ratings of the target Item i given by User k. 𝑅!,! is the ratings of
the other Item j given by User k. n is the total number of rating users to Item i and
Item j.
not take into account preference values of an item rated by a user. It only
size of the intersection, or overlap, in two users’ preferred items, to the union of
users’ preferred items [14]. When two items are completely overlapped,
overlapped, 𝑠𝑖𝑚 𝑖, 𝑗 is 0.
between two sets to compare the similarity and diversity of two sets. 𝑓! is a set of
Item i for which users express preference. 𝑓! is a set of Item j for which users
17
𝑓! ∩ 𝑓!
𝑠𝑖𝑚 𝑖, 𝑗 =
𝑓! + 𝑓! − 𝑓! ∩ 𝑓!
might not have enough users’ information. This metric would be helpful to
available.
Once CF computes the similarity between users (in UBCF) or items (in
IBCF) and then finds the set of most similar user or similar items, it generates
prediction of the target user’s interest as the most significant step in CF.
Since UBCF gets the neighborhood of user, UBCF can calculate the
predictive rating for the target User u on the target Item i. It is scaled by the
weighted average of all neighbors’ ratings on the target Item i as following [2, 4]:
!
!!!
𝑅!,! − 𝐴! × 𝑠𝑖𝑚(𝑢, 𝑤)
𝑃!,! = 𝐴! + !
!!! 𝑠𝑖𝑚(𝑢, 𝑤)
𝐴! is the average ratings of the target User u to all other rated items and
𝐴! is the average ratings of the neighbor User w to all other rated items. 𝑅!,! is
18
the rating of the neighbor User w to the target item i. 𝑠𝑖𝑚(𝑢, 𝑤) is the similarity of
the target User u and the neighbor User w. And n is the total number of
neighbors.
Since IBCF has got the neighborhood of items, IBCF tries to make sure
how the target user rates similar items. To check if the prediction is in the
predefined range [8], the predictive rating for the target User u on the target Item
i is scaled by the weighted average of all neighbor items’ ratings given by the
𝑅!,! is the rating of the target User u to the target Item i. 𝑠𝑖𝑚(𝑖, 𝑗) is the weighted
similarity of the target Item i and the neighbor Item j, n is the total number of
neighbor items.
Since the number of users and items in each application has steadily
increased at the same time as the growth of World Wide Web, collected input
data has been a big problem in producing an accurate prediction and in running
19
ratings given by user to item. User-item input data matrix could have a few rating
scores of the total number of items available, even though users are very active.
In addition, because users tend not to rate actively, calculating similarity over co-
rated set of items could be a challenge. These problems give rise to inaccurate
Filtering predicts items based on user’s previous preference behavior. That is, it
could not predict recommendable items to new users unless new users rate
many items. Also, new items could be considered for recommendation, because
For over millions of users and millions of items in user-item input data
systems could not quickly react to online requirements and immediately make
20
CHAPTER 3
Reduction
data sparsity and data scalability. Data sparsity problem could lead to a skewed
prediction and low reliability of predictions. Besides, data scalability requires low
operation time and high memory feature to scale with all users and items in the
database.
and had had over 244 million active customers as Geekwire reported in 2014 [18].
Also, Amazon had sold over 200 million products as ReportX reported in 2013
[19]. Currently in 2015, it is expected that Amazon would have more than these
21
Amazon should look into all datasets similar to a 244 million × 200 million matrix,
it will encounter data scalability and data sparsity issues. In UBCF, more the
number of users and items increase, more the number of matrix dimensions
increase and runtime takes long to find nearest neighbor of users. Therefore, it is
assumed that using denser data having much more preference information given
by users with IBCF effectively addresses data scalability and data sparsity
problems. To focus on active items assuming that they have many ratings given
passive items.
Item
I1 I2 I3 I4 I5
User
U1 2.0 4.0 3.0 3.0 3.0
U2 1.0 3.0
U3 5.0 1.0 5.0 5.0
U4 4.0 3.0
[Table 2] User-Item matrix before dimension reduction
Item
I1 I3 I4
User
U1 2.0 3.0 3.0
U2 1.0 3.0
U3 5.0 5.0 5.0
U4 4.0 3.0
[Table 3] User-Item matrix after dimension reduction
22
For instance, as seen in [Table 2], each item can get up to a maximum of
4 ratings by users. Item I2 has 2 ratings and Item I5 has 1 rating, which means
the number of ratings for Item I2 and Item I5 is not bigger than half of the total
number of ratings. We can assume that Item I2 and Item I5 do not carry much
weight with this matrix. Hence, when matrix has impactful items like Item I1, Item
I3, and Item I4 as seen in [Table 3], running time of the recommendation system
expected to reduce.
[Figure 7]. This is mainly divided into four steps. This approach is based on
algorithm uses an optimized data by reducing dimention of items that have the
number of ratings less than a specific value. For example, if it needs to consider
items that have over 20 ratings from users, it extracts data in terms of items
having over 20 ratings. In other words, such items are rated by over 20 users.
23
Recommend Items
24
CHAPTER 4
movies. Each user has rated at least 20 movies [21]. The range of ratings is from
These are a few parts of MovieLen dataset. It consists of user ID, Item ID
and Rating as seen in [Table 4]. I consider it as the User-Item matrix as seen in
[Table 5].
25
User ID Item ID Rating
1 1035 5
1 1287 5
1 3408 4
6 1035 5
6 1380 5
6 3408 5
10 1035 5
10 1380 5
10 1287 3
10 3408 4
10 1201 2
26 1035 2
26 1380 4
26 3408 2
26 1201 2
… … …
[Table 4] Raw dataset of MovieLens
Item
1035 1380 1287 3408 1201 …
User
1 5 5 4 …
6 5 5 5 …
10 5 5 3 4 2 …
26 2 4 2 2 …
… … … … … … …
[Table 5] User-Item Matrix by raw dataset
accuracy metrics. Mean Absolute Error (MAE) is a widely used metric in the
26
set. 𝑝! is the prediction of user’s ratings. 𝑞! is corresponding real ratings data set
of users.
!
!!! 𝑝! − 𝑞!
𝑀𝐴𝐸 =
𝑁
prediction scores of users’ ratings and actual user ratings for the user-item pairs
in the test dataset [2]. Lower the MAE value, better is the recommendation
• Language: Java
Hadoop
27
the most powerful open source platforms in supporting scalable machine learning
associated preferences from any source. Users and items are identified
a target user in UBCF. Since IBCF begins with a list of a user’s preferred
28
U: a target user U,
M: Minimum ratings,
29
1) Parse raw dataset and count the number of ratings per item
Else
C ← <P[ItemID], 1 >
D ← P
For each P in D Do
ArrayList<GenericPreference> R
O ← <P[UserID], R>
6) Create GenericItemBasedRecommender
30
31
CHAPTER 5
PERFORMACE RESULTS
better quality of prediction in terms of the MAE measure and to make faster
optimal similarity algorithm and training/test ratio of the dataset. Also, I selected
an optimal value of the number of ratings per item on R-IBCF as I varied the
value of it.
R-IBCF and IBCF to UBCF and the quality of prediction with optimal parameters.
measured MAE to find an optimal similarity on IBCF and R-IBCF for this dataset.
32
clear advantage, as MAE is the lowest on IBCF and R-IBCF. Therefore, I select
If a user does not rate at least one item, the system cannot recommend
any items to the user. In this dataset, each item needs to have at least 627
ratings in order to recommend items to all users. That is, each item has to be
rated by at least 627 users. For example, if I use the optimized data with items
33
having at least 628 ratings, after dimension reduction based on it, one person
among 6040 users cannot get item recommendation. This is because that user
did not give items any ratings at all. Therefore, to prevent non-recommendation
ratings per item ranging from 50 (similar to raw data) to 627 (smaller dataset) in
prediction is almost same with IBCF (MAE = 0.786). On the other hand, when I
reduce lots of dimensions (x = 627), the quality of prediction is the best (MAE =
34
experiments where I varied the value of training/test ratio ranging from 0.2 to 0.9
in an increment of 0.1 and computed MAE. For instance, x is 0.2 means that my
experiments run with 20% of dataset as training data and 80% of dataset as test
data.
35
The results are shown in [Figure 11]. I observed that applying dimension
on IBCF generally makes the quality of prediction better than IBCF. When the
training ratio is 0.7, R-IBCF tends to be flat. Hence, I select 0.7 as optimal choice
36
size has an effect on the quality of prediction. The quality of prediction gets better
by increasing the number of neighbors and when the rate is 30, the quality of
IBCF and R-IBCF approaches with the benchmark UBCF. The purpose of this
experiment was to determine how each similarity algorithm influences the quality
collaborative filtering to compute the similarity between each item or each user in
I present the results in [Figure 13]. I performed them with selected values:
0.7 as the optimum training on three CFs, 627 as an optimal value of the number
Overall, R-IBCF provides better quality of predictions than IBCF and UBCF. It
can be observed that data sparsity and data scalability problems affect the
dimensions means that it does not take into account items having fewer ratings
37
show that reducing dimensions on IBCF contributes greatly to improve the quality
[Figure 13] Comparison of the prediction quality of IBCF, R-IBCF, and UBCF
with four similarity algorithms 30 times and got the average of their runtime
These results are shown in [Figure 14]. Even though it takes more time to
filter data based on the number of ratings per item, I observed that it is faster
38
than computing similarity between all co-rated items or all users. Therefore,
terms of data scalability. In addition, because IBCF and R-IBCF only consider co-
rated items to compute similarity, they do not take finding the nearest neighbors
39
CHAPTER 6
web for the customer to suggest items what they would be interested. With the
main shortcoming: data sparsity and data scalability problems, which bring out
noise of dimensional data, it focuses on typical and popular items to compute the
similarity between them and to predict the most similar items to users. The
comparison with traditional UBCF and IBCF. It results in improving the quality of
The potential limitation would use this approach with dataset widely
overcome this challenge, I propose an approach to mix both explicit and implicit
40
REFERENCES
[1] Schafer, J. Ben, Joseph Konstan, and John Riedl. 1999. “Recommender
158-166.
doi:10.1155/2009/421425.
[6] Sarwar, George Kaypi, Joseph Konstan and John Riedl.2000. "Application of
WebKDD Workshop.
41
doi:10.4304/jsw.5.7.745-752.
[10] Yan Shi, Xiao, HongWu Ye, and SongJie Gong. 2008. “A Personalized
(2008): 264-267.
http://fortune.com/2012/07/30/amazons-recommendation-secret/.
ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD 09), ACM
(2009): 447-455.
[13] Liu, Jiahui, Dolan, Peter, Pedersen, Elin Rønby.2010. “ Personalized news
recommendation based on click behavior”, In: Rich, et al. (eds.) In the 14th Int.
[14] Owen, Sean, Anil, Robin, Dunning, Ted, Friedman, Ellen. 2011 . Mahout in
42
[15] Walunj, Sachin, Sadafale, Kishor. 2013. “An online recommendation system
https://en.wikipedia.org/wiki/Cosine_similarity.
[18] Duryee, Tricia. 2014. “Amazon Adds 30 Million Customers In The Past Year
customers-past-year/.
[19] Grey, Paul. 2013. “How Many Products Does Amazon Sell? | Exportx.”
Exportx. https://export-x.com/2013/12/15/many-products-amazon-sell/.
[20] Resnick, Paul, Iacovou, Neophytos, Suchak, Mitesh, Bergstrom, Peter, Riedl,
43