A Collaborative Filtering Recommendation Algorithm Based on User Clustering and Item Clustering
A Collaborative Filtering Recommendation Algorithm Based on User Clustering and Item Clustering
Abstract—Personalized recommendation systems can help consisting of their rating scores. There are two methods
people to find interesting things and they are widely used in CF as user based collaborative filtering and item based
with the development of electronic commerce. Many collaborative filtering [3,4]. User based CF assumes that
recommendation systems employ the collaborative filtering a good way to find a certain user’s interesting item is to
technology, which has been proved to be one of the most
successful techniques in recommender systems in recent
find other users who have a similar interest. So, at first, it
years. With the gradual increase of customers and products tries to find the user’s neighbors based on user
in electronic commerce systems, the time consuming nearest similarities and then combine the neighbor users’ rating
neighbor collaborative filtering search of the target scores, which have previously been expressed, by
customer in the total customer space resulted in the failure similarity weighted averaging. And item based CF
of ensuring the real time requirement of recommender fundamentally has the same scheme with user based CF.
system. At the same time, it suffers from its poor quality It looks into a set of items; the target user has already
when the number of the records in the user database rated and computes how similar they are to the target
increases. Sparsity of source data set is the major reason item under recommendation. After that, it also combines
causing the poor quality. To solve the problems of
scalability and sparsity in the collaborative filtering, this
his previous preferences based on these item similarities.
paper proposed a personalized recommendation approach The challenge of these two CF as following [5,6]:
joins the user clustering technology and item clustering Sparsity: Even as users are very active, there are a few
technology. Users are clustered based on users’ ratings on rating of the total number of items available in a user-
items, and each users cluster has a cluster center. Based on item ratings database. As the main of the collaborative
the similarity between target user and cluster centers, the filtering algorithms are based on similarity measures
nearest neighbors of target user can be found and smooth computed over the co-rated set of items, large levels of
the prediction where necessary. Then, the proposed sparsity can lead to less accuracy.
approach utilizes the item clustering collaborative filtering Scalability: Collaborative filtering algorithms seem to
to produce the recommendations. The recommendation
joining user clustering and item clustering collaborative
be efficient in filtering in items that are interesting to
filtering is more scalable and more accurate than the users. However, they require computations that are very
traditional one. expensive and grow non-linearly with the number of
users and items in a database.
Index Terms—recommender systems, collaborative filtering, Cold-start: An item cannot be recommended unless it
user clustering, item clustering, scalability, sparsity, mean has been rated by a number of users. This problem
absolute error applies to new items and is particularly detrimental to
users with eclectic interest. Likewise, a new user has to
rate a sufficient number of items before the CF algorithm
I. INTRODUCTION be able to provide accurate recommendations.
As the development of the internet, intranet and To solve the problems of scalability and sparsity in the
electronic commerce systems, there are amounts of collaborative filtering, in this paper, we proposed a
information arrived we can hardly deal with. Thus, personalized recommendation approach joins the user
personalized recommendation services exist to provide us clustering technology and item clustering technology.
the useful data employing some information filtering Users are clustered based on users’ ratings on items, and
technologies. Information filtering has two main methods. each users cluster has a cluster center. Based on the
One is the content based filtering and the other is the similarity between target user and cluster centers, the
collaborative filtering. Collaborative filtering (CF) has nearest neighbors of target user can be found and smooth
proved to be one of the most effective for its simplicity in the prediction where necessary. Then, the proposed
both theory and implementation [1,2]. approach utilizes the item clustering collaborative
Many researchers have proposed various kinds of CF filtering to produce the recommendations. The
technologies to make a quality recommendation. All of recommendation joining user clustering and item
them make a recommendation based on the same data clustering collaborative filtering is more scalable and
structure as user-item matrix having users and items more accurate than the traditional one.
II. TRADITIONAL COLLABORATIVE FILTERING Where Ri,c is the rating of the item c by user i, Ai is
ALGORITHM the average rating of user i for all the co-rated items, and
Iij is the items set both rating by user i and user j.
A. User Item Rating Content The cosine measure, as following formula, looks at the
The task of the traditional collaborative filtering angle between two vectors of ratings where a smaller
recommendation algorithm concerns the prediction of the angle is regarded as implying greater similarity.
target user’s rating for the target item that the user has not n
given the rating, based on the users’ ratings on observed
items. And the user-item rating database is in the central.
∑R ik R jk
Each user is represented by item-rating pairs, and can be sim(i, j ) = k =1
n 2 n 2
summarized in a user-item table, which contains the
ratings Rij that have been provided by the ith user for the ∑R ∑R
k =1
ik
k =1
jk
jth item, the table as following [7,8]. (2)
Where Rik is the rating of the item k by user i and n is
TABLE I
USER-ITEM RATINGS TABLE the number of items co-rated by both users. And if the
rating is null, it can be set to zero.
Item Item1 Item2 …… Itemn
The adjusted cosine, as following formula, is used in
User some collaborative filtering methods for similarity among
User1 R11 R12 …… R1n users where the difference in each user’s use of the rating
User2 R21 R22 …… R2n scale is taken into account.
∑
…… …… …… …… ……
Userm Rm1 Rm2 …… Rmn c∈Iij
(Ric − Ac )(Rjc − Ac )
( , j) =
simi
∑ (Ric − Ac ) *∑c∈I (Rjc − Ac )
2 2
Where Rij denotes the score of item j rated by an
c∈Iij ij
active user i. If user i has not rated item j, then Rij =0. (3)
The symbol m denotes the total number of users, and n
denotes the total number of items. Where Ri,c is the rating of the item c by user i, Ac is
the average rating of user i for all the co-rated items, and
B. Measuring the Rating Similarity Ii,j is the items set both rating by user i and user j.
Collaborative filtering approaches have been popular Literature provides rich evidence on the successful
for both researchers and practitioners alike evidenced by performance of collaborative filtering methods. However,
the abundance of publications and actual implementation there are some shortcomings of the methods as well.
cases. Although there have been many algorithms, the Collaborative filtering methods are known to be
basic common idea is to calculate similarity among users vulnerable to data sparsity and to have cold-start
using some measure to recommend items based on the problems. Data sparsity refers to the problem of
similarity. The collaborative filtering algorithms that use insufficient data, or sparseness. Cold-start problems refer
similarities among users are called user based to the difficulty of recommending new items or
collaborative filtering [9,10]. recommending to new users where there are not sufficient
A set of similarity measures are presented and a metric ratings available for them.
of relevance between two vectors. When the values of C. Selecting Neighbors
these vectors are associated with a user’s model then the
similarity is called user based similarity, whereas when Select of the neighbors who will serve as
they are associated with an item’s model then it is called recommenders. Two techniques have been employed in
item based similarity. The similarity measure can be the collaborative filtering recommender systems.
effectively used to balance the ratings significance in a Threshold-based selection, according to which users
prediction algorithm and therefore to improve accuracy. whose similarity exceeds a certain threshold value are
There are several similarity algorithms that have been considered as neighbors of the target user.
used in the collaborative filtering recommendation The top-n technique, n-best neighbors is selected and
algorithm [1,3]: Pearson correlation, cosine vector the n is given at first.
similarity, adjusted cosine vector similarity, mean- D. Producing Prediction
squared difference and Spearman correlation. Since we have got the membership of user, we can
Pearson’s correlation, as following formula, measures calculate the weighted average of neighbors’ ratings,
the linear correlation between two vectors of ratings. weighted by their similarity to the target user.
∑ c∈Iij
(Ri,c − Ai )(Rj ,c − Aj ) The rating of the target user u to the target item t is as
following:
sim(i, j) = (1)
∑ ∑
2 2
c∈Iij
(Ri,c − Ai ) c∈Iij
(Rj ,c − Aj )
J. Kelleher et al. [20] present a collaborative Panagiotis Symeonidis et al. [25, 26] use bi-clustering
recommender that uses a user-based model to predict user to disclose this duality between users and items, by
ratings for specified items. The model comprises grouping them in both dimensions simultaneously. They
summary rating information derived from a hierarchical propose a novel nearest bi-clusters collaborative filtering
clustering of the users. They compare their algorithm algorithm, which uses a new similarity measure that
with several others. They show that its accuracy is good achieves partial matching of users’ preferences. They
and its coverage is maximal. They also show that the apply nearest bi-clusters in combination with two
proposed algorithm is very efficient: predictions can be different types of bi-clustering algorithms Bimax and
made in time that grows independently of the number of xMotif for constant and coherent biclustering,
ratings and items and only logarithmically in the number respectively. Extensive performance evaluation results in
of users. three real-life data sets are provided, which show that the
Xue, G. et al. [21] present a novel approach that proposed method improves substantially the performance
combines the advantages of memory based collaborative of the CF process.
filtering and model based collaborative filtering of
approaches by introducing a smoothing-based method. In IV. RATING SMOOTHING BASED ON USER CLUSTERING
their approach, clusters generated from the training data
provide the basis for data smoothing and neighborhood A. User Clustering
selection. As a result, they provide higher accuracy as User clustering techniques work by identifying groups
well as increased efficiency in recommendations. Their of users who appear to have similar ratings. Once the
empirical studies on two datasets as EachMovie and clusters are created, predictions for a target user can be
MovieLens show that their new proposed approach made by averaging the opinions of the other users in that
consistently outperforms other user based traditional cluster. Some clustering techniques represent each user
collaborative filtering algorithms. with partial participation in several clusters. The
George, T. et al. [22] consider a novel collaborative prediction is then an average across the clusters, weighted
filtering approach based on a recently proposed weighted by degree of participation. Once the user clustering is
co-clustering algorithm that involves simultaneous complete, however, performance can be very good, since
clustering of users and items. They design incremental the size of the group that must be analyzed is much
and parallel versions of the co-clustering algorithm and smaller [18].
use it to build an efficient real-time collaborative filtering The idea is to divide the users of a collaborative
framework. Their empirical evaluation of the proposed filtering system using user clustering algorithm and use
approach on large movie and book rating datasets the divide as neighborhoods, as Figure 1 show. The
demonstrates that it is possible to obtain accuracy clustering algorithm may generate fixed sized partitions,
comparable to that of the correlation and matrix or based on some similarity threshold it may generate a
factorization based approaches at a much lower requested number of partitions of varying size.
computational cost.
Rashid, A.M. et al. [23] propose ClustKnn, a simple
and intuitive algorithm that is well suited for large data
sets. The proposed method first compresses data
tremendously by building a straightforward but efficient
clustering model. Recommendations are then generated
quickly by using a simple Nearest Neighbor-based
approach. They demonstrate the feasibility of ClustKnn
both analytically and empirically. They also show, by
comparing with a number of other popular collaborative
filtering algorithms that, apart from being highly scalable
and intuitive, ClustKnn provides very good recommender
accuracy as well.
Cantador, I. et al. [24] propose a multilayered semantic
social network model that offers different views of
common interests underlying a community of people. The
applicability of the proposed model to a collaborative
filtering system is empirically studied. Starting from a
number of ontology-based user profiles and taking into Figure1. Collaborative filtering based on user clustering.
account their common preferences, they automatically
cluster the domain concept space. With the obtained Where Rij is the rating of the user i to the item i, aij
semantic clusters, similarities among individuals are the average rating of the user center i to the item i, m is
identified at multiple semantic preference layers, and the number of all users, n is the number of all items, and c
emergent, layered social networks are defined, suitable to is the number of user centers.
be used in collaborative environments and content
recommenders.
∑
ratings as the target item t and the remaining item r.
R u i × s im ( t , i )
m
∑ (R it − At )( R ir − Ar )
(6)
Pu t = i =1
c
(8)
sim ( t , r ) =
∑
i =1
m m s im ( t , i )
∑
i =1
( R it − At ) 2 ∑ ( R ir − Ar ) 2
i =1
i =1
deviation of the predicted ratings from the respective conclusion from Figure 4, which includes the Mean
actual user ratings. Some of them frequently used are Absolute Errors for the proposed algorithm and the
mean absolute error (MAE), root mean squared error traditional collaborative filtering as observed in relation
(RMSE) and correlation between ratings and predictions. to the different numbers of neighbors, is that our
All of the above metrics were computed on result data proposed algorithm is better.
and generally provided the same conclusions. As
statistical accuracy measure, mean absolute error is 0.9
employed. Traditional CF
Formally, if n is the number of actual ratings in an item Proposed CF
set, then MAE is defined as the average absolute 0.87
difference between the n pairs. Assume that p1, p2, p3, ...,
pn is the prediction of users' ratings, and the
MAE
corresponding real ratings data set of users is q1, q2, 0.84
q3, ..., qn. See the MAE definition as following:
n
∑| p i − qi | 0.81
MAE = i =1
n (9)
The lower the MAE, the more accurate the predictions 0.78
would be, allowing for better recommendations to be 20 25 30 35 40 45 50
formulated. MAE has been computed for different Number of neighbours
prediction algorithms and for different levels of sparsity.
Figure4. Comparing the proposed CF algorithm with the traditional CF
C. Sensitivity of different training-test ratio x algorithm.
To determine the sensitivity of density of the dataset
we carried out an experiment where we varied the value
of x from 0.2 to 0.8 in an increment of 0.1. For each of VII. CONCLUSIONS
these training-test ratio values we ran our experiments Recommender systems can help people to find
using our proposed algorithm and the traditional CF interesting things and they are widely used in our life
algorithm. The results are shown in Figure 3. We observe with the development of electronic commerce. Many
that the quality of prediction increase as we increase x recommendation systems employ the collaborative
and our proposed CF is better than the traditional. filtering technology, which has been proved to be one of
the most successful techniques in recommender systems
in recent years. With the gradual increase of customers
0.93 T raditional CF and products in electronic commerce systems, the time
Proposed CF consuming nearest neighbor collaborative filtering search
of the target customer in the total customer space resulted
0.89
in the failure of ensuring the real time requirement of
recommender system. At the same time, it suffers from its
MAE
A Project Supported by Scientific Research Fund of [16] K Honda, N Sugiura, H Ichihashi, S Araki. Collaborative
Zhejiang Provincial Education Department (Grant No. Filtering Using Principal Component Analysis and Fuzzy
Clustering,Lecture Notes in Computer Science, 2001
Y200806038). [17] S.H.S. Chee,J Han,K. Wang.Rectree: An efficient
collaborative filtering method. Lecture Notes in Computer
REFERENCES Science, 2114, 2001
[18] B. Sarwar, G. Karypis, J. Konstan and J. Riedl,
[1] Breese J, Hecherman D, Kadie C. Empirical analysis of Recommender systems for large-scale e-commerce:
predictive algorithms for collaborative filtering. In: Scalableneighborhood formation using clustering,
Proceedings of the 14th Conference on Uncertainty in Proceedings of the Fifth International Conference on
Artificial Intelligence (UAI’98). 1998. 43~52. Computer andInformation Technology, 2002
[2] Chong-Ben Huang, Song-Jie Gong, Employing rough set [19] D. Bridge and J. Kelleher, Experiments in sparsity
theory to alleviate the sparsity issue in recommender reduction: Using clustering in collaborative recommenders,
system, In: Proceeding of the Seventh International in Procs. of the Thirteenth Irish Conference on Artificial
Conference on Machine Learning and Cybernetics Intelligence and Cognitive Science, pp. 144–149. Springer,
2002.
(ICMLC2008), IEEE Press, 2008, pp.1610-1614.
[3] Sarwar B, Karypis G, Konstan J, Riedl J. Item-Based [20] J. Kelleher and D. Bridge. Rectree centroid: An accurate,
scalable collaborative recommender. In Procs. of the
collaborative filtering recommendation algorithms. In: Fourteenth Irish Conference on Artificial Intelligence and
Proceedings of the 10th International World Wide Web Cognitive Science, pages 89–94, 2003.
Conference. 2001. 285-295. [21] Xue, G., Lin, C., & Yang, Q., et al. Scalable collaborative
[4] Manos Papagelis, Dimitris Plexousakis, Qualitative filtering using cluster-based smoothing. In Proceedings of
analysis of user-based and item-based prediction the ACM SIGIR Conference 2005 pp.114–121.
algorithms for recommendation agents, Engineering [22] George, T., & Merugu, S. A scalable collaborative filtering
Application of Artificial Intelligence 18 (2005) 781-789. framework based on co-clustering. In Proceedings of the
[5] Hyung Jun Ahn, A new similarity measure for IEEE ICDM Conference. 2005
collaborative filtering to alleviate the new user cold- [23] Rashid, A.M.; Lam, S.K.; Karypis, G.; Riedl, J.;
starting problem, Information Sciences 178 (2008) 37-51. ClustKNN: A Highly Scalable Hybrid Model- & Memory-
[6] SongJie Gong, The Collaborative Filtering Based CF Algorithm. WEBKDD 2006.
Recommendation Based on Similar-Priority and Fuzzy [24] Cantador, I., Castells, P. Multilayered Semantic Social
Clustering, In: Proceeding of 2008 Workshop on Power Networks Modelling by Ontologybased User Profiles
Electronics and Intelligent Transportation System Clustering: Application to Collaborative Filtering. EKAW
2006, pp. 334-349.
(PEITS2008), IEEE Computer Society Press, 2008, pp.
248-251. [25] Panagiotis Symeonidis, Alexandros Nanopoulos,
Apostolos Papadopoulos, Yannis Manolopoulos,Nearest-
[7] SongJie Gong, GuangHua Cheng, Mining User Interest Biclusters Collaborative Filtering,WEBKDD 2006
Change for Improving Collaborative Filtering, In:Second
[26] Panagiotis Symeonidis, Alexandros Nanopoulos,
International Symposium on Intelligent Information
Apostolos N. Papadopoulos, Yannis Manolopoulos.
Technology Application(IITA2008), IEEE Computer
Nearest-biclusters collaborative filtering based on constant
Society Press, 2008, Volume3, pp.24-27.
and coherent values. Inf Retrieval 2007 DOI
[8] Duen-Ren Liu, Ya-Yueh Shih, Hybrid approaches to
10.1007/s10791-007-9038-4.
product recommendation based on customer lifetime value
[27] Gao Fengrong, Xing Chunxiao, Du Xiaoyong, Wang Shan,
and purchase preferences, The Journal of Systems and
Personalized Service System Based on Hybrid Filtering for
Software 77 (2005) 181–191.
Digital Library, Tsinghua Science and Technology,
[9] Yu Li, Liu Lu, Li Xuefeng, A hybrid collaborative filtering
Volume 12, Number 1, February 2007,1-8.
method for multiple-interests and multiple-content
[28] Huang qin-hua, Ouyang wei-min, Fuzzy collaborative
recommendation in E-Commerce, Expert Systems with
filtering with multiple agents, Journal of Shanghai
Applications 28 (2005) 67–77.
University (English Edition), 2007,11(3):290-295.
[10] George Lekakos, George M. Giaglis, Improving the
[29] Songjie Gong, Chongben Huang, Employing Fuzzy
prediction accuracy of recommendation algorithms:
Clustering to Alleviate the Sparsity Issue in Collaborative
Approaches anchored on human factors, Interacting with
Filtering Recommendation Algorithms, In: Proceeding of
Computers 18 (2006) 410–431.
[11] L. H. Ungar and D. P. Foster. Clustering Methods for 2008 International Pre-Olympic Congress on Computer
Collaborative Filtering. In Proc. Workshop on Science, World Academic Press, 2008, pp.449-454.
Recommendation Systems at the 15th National Conf. on
Artificial Intelligence. Menlo Park, CA: AAAI Press.1998
[12] L. H. Ungar and D. P. Foster. A Formal Statistical
Approach to Collaborative Filtering. Proceedings of
Conference on Automated Leading and Discovery
(CONALD), 1998. SongJie Gong was born in Cixi, Zhejiang Province,
[13] M. O. Conner and J. Herlocker. Clustering Items for P.R.China, in July 1, 1979. He received B. Sc degree from
Collaborative Filtering. In Proceedings of the ACM SIGIR Tongji University and M. Sc degree in computer application
Workshop on Recommender Systems, Berkeley, CA, from Shanghai Jiaotong University, P.R. China in 2003 and
August 1999. 2006 respectively. He is currently a teacher in Zhejiang
[14] A. Kohrs and B. Merialdo. Clustering for Collaborative Business technology Institute, Ningbo, P.R.China.
Filtering Applications. In Proceedings of CIMCA'99. IOS His research interest includes data mining, information
Press, 1999. processing and intelligent computing. He has published more
[15] Lee, WS. Online clustering for collaborative filtering. than 30 papers in journals and conferences.
School of Computing Technical Report TRA8/00. 2000.