Collaborative Filtering Recommendation Algorithm
Collaborative Filtering Recommendation Algorithm
net/publication/282275966
CITATIONS READS
23 923
2 authors, including:
Olgierd Unold
Wroclaw University of Science and Technology
123 PUBLICATIONS 666 CITATIONS
SEE PROFILE
All content following this page was uploaded by Olgierd Unold on 29 April 2018.
Abstract—The aim of this work was to develop and compare Recently published works proved that it is possible to
recommendation systems which use the item-based collaborative parallelize collaborative filtering algorithms with Hadoop tech-
filtering algorithm, based on Hadoop and Spark. Data for the nology [4], [5], which is dedicated to solving complex and
research were gathered from a real social portal the users of distributable problems. Admittedly, however, this approach
which can express their preferences regarding the applications on based on MapReduce paradigm has not favorable scalability
offer. The Hadoop version was implemented with the use of the
and computation-cost efficiency if the data size increase.
Mahout library which was an element of the Hadoop ecosystem.
The authors original solution was implemented with the use of This paper presents a new solution to item-based CF based
the Apache Spark platform and the Scala programming language. on the Apache Spark platform - a new engine for large-
The applied similarity measure was the Tanimoto coefficient
scale data processing. We develop and compare item-based
which provides the most precise results for the available data.
The initial assumptions were confirmed as the solution based on collaborative filtering algorithm using two cluster computing
the Apache Spark platform turned out to be more efficient. frameworks: Hadoop’s disk-based MapReduce paradigm and
Spark’s in-memory based RDD paradigm. The implementation
of the CF algorithm based on Hadoop platform was done using
I. I NTRODUCTION the Mahout library [6], [7]. Paralleled item-based CF algorithm
for the Apache Spark platform was implemented in the Scala
Recommendation systems are the most recognizable and programming language.
currently the most widely used technique of machine learning
[1]. They have an undeniable impact on the personalization
of content on the Internet, so prevalent nowadays. There is II. T HE M AP R EDUCE PARADIGM
great interest in them, especially in e-Commerce, as their MapReduce is a paradigm for distributed programming,
implementation raises sales by an estimate of 8 to 10 percent. used for processing large amounts of data, with the use of
During the last few years, as a huge amount of data is used groups of connected computer units, i.e. a computer cluster.
for generating recommendations, those systems are equipped Each computer unit is called a node. The paradigm was
with solutions characteristic of the problems of Big Data. introduced by the Google company in 2004 [8]. Its main asset
The task of recommendation systems is to present users is that it takes away from the programmer most problems of
with information about the items in which they might be parallel programming, such as node-to-node communication,
interested. The systems differ from one another in the way they management of cluster resources, and resistance to node fail-
analyze the available data which, in turn, make it possible to ures. The MapReduce model is inspired by the map and reduce
evaluate the degree to which users like particular products. The functions commonly used in functional programming.
techniques are used in personalized recommendation systems A programmers task is to provide the implementation of
can be classified into three basic groups: content-based, those the following two procedures (see Figure 1 for the MapReduce
based on collaborative filtering (abbreviated to CF), and hybrid workflow):
systems [2].
• map() takes a fragment of the input data expressed
This work focuses on systems which use collaborative
as a set of pairs (key, value) and, on the basis of
filtering, because of their universality. Note that in a CF
particular records, produces zero or more intermediate
algorithm, it is expensive to compute the similarity of users (or
pairs. The MapReduce library groups all intermediate
items) as the algorithm is required to search entire database to
values related to the same intermediate key K. Next,
find the potential neighbors for a target user (item). It requires
the values are transferred to the Reduce() function.
computation that increase linearly with the growth of both the
number of users and items (a CF algorithm has a worst case M ap(k1, w1) −→ list(k2, w2)
complexity of O(U I), where U is the number of users and
I is the number of items). Therefore, many algorithms are • reduce() takes the intermediate key K and a set of
either slowed down or require additional resources such as values for that key. Next, the values are joined. Each
computation power or memory. This is so-called the scalability call to the Reduce function usually returns one value,
problems, which is one of the key challenge for CFs [3]. but it can also return zero or more values.
Data
Reduce(k2, list(w2)) −→ list(w3)
RDD is a partitioned collection of data which means that val ratings = file.map(line => {
val fields = line.split("\t")
particular elements of a collection can be divided among (fields(0).toInt, fields(1).toInt)
cluster nodes and processed in parallel. Additionally, we can })
distinguish the following characteristics of RDD: val ratings2 = ratings
val ratingPairs = ratings.join(ratings2).filter
{ case (user, (item1, item2)) =>
• They can only be created by operating on other sets item1 < item2 }
or while reading data from a file system because RDD
sets cannot be modified. Then, each item is assigned to the number of the expressed
• They are processed lazily and each set stores informa- preferences. For optimization reasons, this set is distributed
tion about the operations made on input data, which among all nodes via a variable of type Brodcast.
makes it possible to recover a lost part of RDD in case
val numRatersPerItem = ratings.groupBy { case (user, item) =>
of a node failure. item }.map { case (item, ratingsPerItem) =>
(item, ratingsPerItem.size) }
• Intermediate results of operations can be stored in val numRatersPerItemBC =
cache memory so computations for iterative algo- sc.broadcast(Map(numRatersPerItem.collect(): _*))
rithms can be made much faster.
Each pair of co-occurring items is assigned to a list of
V. E XPERIMENTAL STUDIES proper users.
The aim of the work was to implement and compare val usersForRatingPair = ratingPairs.map
systems which generate recommendations with the use of { case (user, (item1, item2)) =>
((item1, item2), user) }.groupByKey()
distributed computing platforms, Hadoop, and Spark. Because
of the limited time of access to the computing cluster the
implemented systems were limited to computing the similarity Using the received RDDs, it is possible to calculate the
matrices of items. similarity between items.
val similaritiesInput = usersForRatingPair.map
A. A description of the data { case (pair, users) =>
(pair, (users.size, numRatersPerItemBC.value(key._1),
The data for the work were taken from an Internet portal numRatersPerItemBC.value(key._2))) }
in which registered users can use the available entertainment val similarities = similaritiesInput.map
{ case (pair, (size, numRaters1, numRaters2)) =>
applications, such as games, quizzes, or tests. The portal users (pair, jaccardSimilarity(size, numRaters1, numRaters2)) }
can like a given application. The preferences used for providing
recommendations for users are expressed as Boolean values. C. The results
The set of data consists of 9,644,727 preferences (likes)
Figure 2 shows the results of the research for a system
expressed by 1,148,320 users with respect to 730 applications.
which generates an item similarity matrix and is implemented
The data are stored in the form of a text file. Particular
in the Apache Spark technology, compared to the ItemSimi-
preferences are separated by the new line sign (\n). A prefer-
larityJob program from the Mahout library.
ence consists of a user identifier and an application identifier,
separated by the tab sign (\t). User and application identifiers As the Spark platform has a very limited access to the
are unique integers. The size of the data is 106.5 MB. cluster resources and there is no possibility of modifying the
amount of input data were made for almost 10 million out of
145 million of the preferences present in the available data.
The conducted analyzes of the systems showed that the
system based on the Spark platform was more efficient, as had
been presumed. Although our implementation of Spark is still
a prototype, early experience with the system is encouraging.
We show that Spark can outperform Hadoop up to 10x for a
smaller number of examined preferences (ca. 2 mln)! For full
credibility, the research should be conducted with the use of a
greater amount of data (and available cluster memory).
In the course of this work it was noticed that Apache
Spark platform has features that justify its growing interest
in the world of Big Data. In-memory based RDD paradigm
Fig. 2. A breakdown of the dependence of the performance time of allows for more efficient work allocation between nodes,
computations on the amount of input data for ItemSimilarityJob and the Spark reducing costly read/write operations of intermediate results
implementation on hard drives, which directly translates into increased jobs
productivity. Furthermore, implementation of algorithms on
Spark is much more developer-friendly, because there is no re-
cluster configuration, the experiments have not been conducted quirement to express jobs in successive mapping and reduction
with the full amount of possessed data. The most numerous functions. Instead, the developer uses RDD transformations
dataset processed in its entirety was a set consisting of nearly and actions, which are largely based on Scala’s collections
10 million of preferences. The averaged time of performing the API. In our opinion, functional programming style seems to be
computations in the ItemSimilarityJob program was 3 minutes more natural for parallel computing. Last but not least, Spark
19 seconds, and in the Spark implementation – 1 minute 45 provides interactive shell (REPL), which allows for quick jobs
seconds, so the answer time was shortened almost by half in prototyping, and encourages experimentation.
the case of the Spark platform. The lower the amount of input
data, the greater was the difference. Further work on that topic should include completing the
implementation of the collaborative filtering algorithm for the
D. Conclusions Spark platform and a repeating the analyzes with the use of
a greater amount of data. Another possibility is to compare
It ought to be noted that, despite the considerable improve- the obtained data with the results of implementations of other
ment of efficiency when the Spark platform was used, the algorithms used in recommendation systems.
results do not represent the whole picture. The direct reason for
it is that the amount of input data for which the research was
done was significantly limited the obtained characteristic of R EFERENCES
the Hadoop implementation does not show a linear dependence
[1] C. Sammut and G. I. Webb, Encyclopedia of Machine Learning.
of time on the amount, and the increase of time between Springer, 2011.
700,000 and 10,000,000 preferences was only 20 seconds. [2] (2014, November) IBM what is big data? Bringing big data to the
Therefore, we can conclude that in the case of running a enterprise. [Online]. Available: http://www.ibm.com/big-data/us/en/
program consisting of five MapReduce tasks, which processed [3] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Item-based collabo-
a small amount of data, most of the time was consumed by rative filtering recommendation algorithms,” in Proceedings of the 10th
actions such as starting particular MapReduce jobs, synchro- international conference on World Wide Web. ACM, 2001, pp. 285–
nizing them, nodes communication, and reading and recording 295.
operations, and not directly for the computations. In order to [4] Z.-D. Zhao and M.-S. Shang, “User-based collaborative-filtering recom-
obtain fully credible results, the measurements would have to mendation algorithms on hadoop,” in Knowledge Discovery and Data
Mining, 2010. WKDD’10. Third International Conference on. IEEE,
be repeated on a greater scale. 2010, pp. 478–481.
[5] J. Jiang, J. Lu, G. Zhang, and G. Long, “Scaling-up item-based
VI. C ONCLUSIONS AND FUTURE WORK collaborative filtering recommendation algorithm based on hadoop,” in
Services (SERVICES), 2011 IEEE World Congress on. IEEE, 2011,
Because of the limited time of access to the computer pp. 490–497.
cluster used for the realization of the work, the item-based [6] S. Owen, R. Anil, T. Dunning, and E. Friedman, Mahout in action.
CF algorithm based on the Apache Spark technology was Manning Publications Co., Greenwich, CT, USA, 2011.
restricted to an item similarity matrix which can be used [7] (2014, November) Apache mahout documentation. [Online]. Available:
for generating recommendations on the fly, separately for http://mahout.apache.org/
each user. The Mahout library provides an analogous program [8] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on
named ItemSimilarityJob which was used for the comparative large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–
analysis. 113, 2008.
[9] (2014, November) Hadoop 1.1.2 documentation. [Online]. Available:
Another problem encountered during the realization of http://hadoop.apache.org/docs/stable/
the project was the limit of random access memory on the [10] T. White, Hadoop: The definitive guide. O’Reilly Media, Inc., 2012.
cluster for the Spark platform. Therefore, the measurements [11] P. Resnick and H. R. Varian, “Recommender systems,” Communications
of the performance time of the computations dependent on the of the ACM, vol. 40, no. 3, pp. 56–58, 1997.
[12] H. Tan and H. Ye, “A collaborative filtering recommendation algorithm
based on item classification,” in Circuits, Communications and Systems,
2009. PACCS’09. Pacific-Asia Conference on. IEEE, 2009, pp. 694–
697.
[13] S. Gong, H. Ye, and H. Tan, “Combining memory-based and model-
based collaborative filtering in recommender system,” in Circuits, Com-
munications and Systems, 2009. PACCS’09. Pacific-Asia Conference on.
IEEE, 2009, pp. 690–693.
[14] R. V. Tatiya and A. S. Vaidya, “A survey of recommendation algo-
rithms,” IOSR Journal of Computer Engineering, vol. 16, no. 6, pp.
16–19, 2014.
[15] L. Kaufman and P. J. Rousseeuw, Finding groups in data: an introduc-
tion to cluster analysis. John Wiley & Sons, 2009, vol. 344.
[16] T. Kajdanowicz, W. Indyk, and P. Kazienko, “Mapreduce approach to
relational influence propagation in complex networks,” Pattern Analysis
and Applications, pp. 1–8, 2012.
[17] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma,
M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica,
“Resilient distributed datasets: A fault-tolerant abstraction for in-
memory cluster computing,” in Proceedings of the 9th USENIX
conference on Networked Systems Design and Implementation.
USENIX Association, 2012, pp. 2–2. [Online]. Available:
http://www.cs.berkeley.edu/˜matei/papers/2012/nsdi spark.pdf
[18] (2014, November) Spark documentation. [Online]. Available:
http://spark.apache.org/documentation.html
[19] (2014, November) Cloudera official web site. [Online]. Available:
http://www.cloudera.com/content/cloudera/en/home.html
[20] (2014, November) Scala programming language. [Online]. Available:
http://www.scala-lang.org