0% found this document useful (0 votes)
17 views6 pages

Dev 2016

The document proposes a recommendation system for big data applications based on set similarity of user preferences. It uses key terms from user reviews to indicate preferences and create preference sets. A collaborative filtering algorithm then finds similar users by comparing preference sets to generate personalized item recommendations for users.

Uploaded by

Rosmi George
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views6 pages

Dev 2016

The document proposes a recommendation system for big data applications based on set similarity of user preferences. It uses key terms from user reviews to indicate preferences and create preference sets. A collaborative filtering algorithm then finds similar users by comparing preference sets to generate personalized item recommendations for users.

Uploaded by

Rosmi George
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2016 International Conference on Next Generation Intelligent Systems (ICNGIS)

Recommendation System For Big Data Applications


Based On Set Similarity Of User Preferences
Arpan V Dev Anuraj Mohan
PG Scholar Assistant Professor
Department of Computer Science Department of Computer Science
NSS College of Engineering NSS College of Engineering
Palakkad, Kerala Palakkad, Kerala
Email: [email protected] Email: [email protected]

Symbol Definition
Abstract—Recommender system techniques are software tech- K The candidate key terms list, K={k1 ,k2 ,,kn }
niques to provide users with tips on the object they need to devour PSC The preference set of the current user, PSC={psc1 ,psc2 ,,psci }
or the item they want to apply. The conventional approach is to PSP The preference set of a previous user, PSP{psp1 ,psp2 ,,psph }
consider this as a decision problem and to solve it using rule based UP The table, User preferences
techniques, or cluster analysis. But recommendation systems are IBP The table Item based preferences
mainly employed in applications such as online market, which EPF Extended prefix filtering
works with big data. Since, performing data mining on big data SSJR Set similarity join based recommendation
is a tedious task due to its distributed nature and enormity,
instead of data mining, another method known as set-similarity TABLE I
BASIC SYMBOLS AND NOTATIONS
join can be utilized. This paper proposes a solution for item
recommendation for big data applications. The proposed work
presents customized and personalized item recommendations and
In addition, the majority of existing service recommender
prescribes the most suitable items to the users successfully. In systems frameworks introduce the same evaluations and rank-
particular, key terms are used to indicate users preferences, and a ings of services to various users without considering differ-
user-based collaborative filtering algorithm is embraced to create ent users’ inclinations, and in this manner neglects to meet
suitable suggestions. Proposed work is designed to work with users’ customized prerequisites [3]. Since the difference in
Hadoop, a broadly chosen distributed computing platform using
the MapReduce framework
preferences are definitely reflected in the reviews on items,
Index Terms—Recommender system, big data, prefix filtering, it becomes difficult to arrive at some meaningful conclusion
MapReduce. from these reviews. In order to address these challenges this
paper proposes a recommender system based on set similarity
computation between users preference sets. It is called set
I. I NTRODUCTION
similarity computation because the values compared are sets
Recommendation systems, which attempts to provide pre- of words not strings.
dictions on ’rating’ or ’preference’ that user would give to an The remaining part of the paper is organized as follows:
item or social element, at all times can be classified into three The set similarity based recommender system, is described
main categories: content-based, collaborative, and hybrid rec- in Section 2. Section 3 presents the design of the system on
ommendation approaches [1]. Collaborative filtering methods, MapReduce. In Section 4, experiment analysis is discussed.
in which the suggestions for users are based with respect to Related works are presented in section 5. Section 6 concludes
other users who have comparative tastes and preferences, can the paper. Table 1 defines the acronyms and symbols used in
be further arranged into item-based systems and user-based this paper.
systems [2]. Here we use an item-based collaborative filtering
approach. II. R ECOMMENDATION USING SET SIMILARITY JOIN
Most of the current systems are working based on data Converting reviews into preference sets before similarity
mining techniques such as rule based techniques and cluster computation reduces computation overhead and improves ac-
analysis. But, by the last decade, the number of users, services curacy. These preference sets are sets of words that are
and online data have been developing quickly, causing an important to the context. So, in our method, previous users
explosive increment of the amount of data and hence the preference sets are extracted from their item review. The
evolution of an area called big data. Thus, traditional service preference set given by the current user is then compared with
recommender systems often started to suffer from scalability these extracted sets of preferences to find the group of users
and inefficiency problems when processing or analyzing such who are similar to the current user. The personalized rating
large-scale data. This turned out to possess vital challenges of each candidate item for the current user is then calculated
for recommender systems commonly known as the big data from the ratings of similar users. These ratings are then used
analysis problem for recommendation systems. to create the recommendation list. Our system recommends

978-1-5090-0870-4/16/$31.00 ©2016 IEEE


2016 International Conference on Next Generation Intelligent Systems (ICNGIS)

then the corresponding key term should be extracted from


the candidate key terms list and added into the preference
key terms set of the user. These two collections of words are
implemented using a single key-value table. In this table the
key set serves as the candidate key terms list and the values
set plays the role of treasury of domain words.
The preferences of the current user is given by the user
by selecting key terms from the key terms candidate list. The
preference set of the current user can be denoted as PSC. This
is stored in the table User Preferences (UP). The preference
set of a previous user for some candidate item is extracted
from his/her reviews for the item. It can be denoted as, PSP.
This set is stored in the table Item Based Preferences (IBP).
B. Computing similarity between user preferences and com-
paring
Extended prefix filtering (EPF) [4] is the technique used
in this component. Extended prefix filtering can be formally
stated using the Jaccard coefficient as, if Jaccard (A,B)> θ
for two sets A and B, A’s prefix of length (|A|-α+1) and B’s
prefix of length (|B|-α+1) where α = (|A|+|B|)θ
(1+θ) must share
at least β elements. The contrast between the extended prefix
and the original prefix is β-1 elements.
The portion of these extra components is known as the
extension and β-1 is called the extension length. When α < β,
for a pair of attribute values A and B, then the extended
prefix has to be longer than A (or B). In order to address
Fig. 1. Set Similarity Join Based Recommendation System (SSJR)
this challenge, a pseudo element , which comes last in any
global order, is appended repeatedly to A (or B) up to (β-α)
the most appropriate items to the current user as per this times.
recommendation list. The basic idea behind EPF is that, if intersection of two sets
Figure 1. shows the working of the proposed system. It are below some lower threshold value, then that set pair can
has mainly three components. They are key terms extraction be avoided. i.e., in our context, if PSC and a PSP is similar,
for collecting user preferences, computing similarity between at least a minimum of the elements should be overlapped.
user preferences and comparing, and calculating personalized For this the concept of prefix is used. Prefix of a set is a
ratings and producing recommendations. subset represented by the first p elements of the set. Here
p is called prefix length. Extended prefix (EP) is the prefix
A. Key terms extraction for collecting user preferences
that is extended by some more elements to improve accuracy
It is the component that formalizes the preferences of and eliminate unnecessary pairs. The number of additional
current users and previous users into their corresponding elements are known as extension length. The filtering using
preference key term sets. The component consists of two extended prefixes is known as extended prefix filtering.
modules. Substituting preference sets with their extended prefixes
Pre-process: This module removes HTML tags, stop words, reduces the amount of data. But, EPF requires a pair of EPs
and common morphological endings from reviews. Porter to share at least β elements for considering the records for
stemmer algorithm can be used for this. comparison, where β-1 is the extension length. So, the pair
Key terms extraction: This module converts each review in to will be considered at least β times for further comparison. This
corresponding set of key terms using two different collection problem is called problem of redundant pairs. This is solved
of words, candidate key terms list and treasury of domain with the help of concepts of inverted index and projection. For
words. an element h, indexing element, the inverted index is the list
The candidate key terms list is a set of key terms about users of the records (actually the record ids )containing the element
preferences and quality of the candidate items, which can be in its extended prefix. The projection of an extended prefix
indicated as, K. The treasury of domain words is a reference with respect to h is all elements that come after h in the
work of the candidate key terms list. It lists words grouped extended prefix. In our approach, we compare the two records
together according to the similarity of key term meaning. It in the inverted index once when the projections have some c
also includes related and contrasting words and antonyms. If common elements. This avoids redundant comparisons in the
the review contains a word in the treasury of domain words, other inverted indexes.
2016 International Conference on Next Generation Intelligent Systems (ICNGIS)

A compact trie is used to implement inverted index. The Algorithm 3 secondPhase(nodex,nodey,θ,E,S)


projections are used as keys, and the stored value of each key Input: nodex-Pointer to a node in T
is the record id of the projection. The path from the root to a nodey-Pointer to a node in T
node represents the key associated with that node. The global θ -The minimum similarity threshold
order of elements in keys is used for the order of sibling nodes. E-The extension length
The extended prefix algorithm is shown in Algorithms 1 to 4. S-A set
It performs filtering by traversing the trie in pre-order. Output: A boolean value to indicate the end of current phase.
1: if nodex=nodey then
Algorithm 1 ExtendedPrefixFilter(T,θ ,E,S) 2: if size(projection(nodey))=E then
Input: T-The trie for indexing element 3: simPair(nodex,nodey,θ, S)
E-The extension length 4: end if
S-A null set 5: Return TRUE
Output: S-the list of users who are similar with the current 6: else
user 7: if projection(nodex) ∩ projection(nodey)=E then
1: R ← root(T) 8: simPair(nodex,nodey,θ, S)
2: S ← firstPhase(R,R,θ,E,S) 9: else if projection(nodex) ∩ projection(nodey)>E then
3: Return S 10: if nodey=ancestor(nodex) then
11: Return TRUE
12: else
13: Return FALSE
Algorithm 2 firstPhase(nodex,nodey,θ,E,S)
14: end if
Input: nodex-Pointer to a node in T nodey-Pointer to a node
15: else
in T
16: if lastElement(projection(nodex))) do not succeeds
θ -The minimum similarity threshold
lastElement(projection(nodey))) in global ordering
E-The extension length
then
S-A set
17: Return FALSE
Output: The list of users who are similar to the current user.
18: end if
1: if nodex contains some user ids then
19: end if
2: secondPhase(nodex,nodey,θ ,E,S)
20: for each child of nodey do
3: else
21: nodey ← nodey→ child
4: for each child of nodex do
22: if secondPhase(nodex,nodey,θ,E,S) is TRUE then
5: nodex ← nodex→ child
23: Return TRUE
6: S ← firstPhase(nodex,nodey,θ,E,S)
24: end if
7: end for
25: end for
8: end if
26: Return FALSE
9: Return S
27: end if

While traversing the trie, each loop starts from the root
node and visits the nodes of the trie in pre-order. Pointers C. Calculating personalized ratings and producing recom-
nodex and nodey is used to point the node currently being mendations
visited by the outer loop (first phase), inner loop (second The overall idea of our system can be wrapped up using the
phase) respectively. The first phase gets terminated, if all algorithm explained in Algorithm 5.The personalized ratings
nodes of the trie have been traversed. In each iteration of of each candidate item for the current user can be calculated
the first phase, when the pointer nodey and nodex points as shown in algorithm. Repeating this for every candidate
the same node, the second phase gets terminated. A pair services, we can calculate the personalized ratings of all
of record is considered for comparison, only when a pair candidate services for the current user. Then we rank the
of nodes with number of common projection elements services by the personalized ratings and present a personalized
same as the extension length is found. Here, function item recommendation list to the current user. Without loss of
SimPair represents the the procedure that corresponds generality, we assume that the items with higher ratings are
to actual comparison. In function SimPair, we use the more preferable to the user. So the items with the highest
approximate similarity computation method. The approximate rating(s) will be recommended to the current user.
similarity computation is defined by Jaccard coefficient [2] as,
III. D ESIGN FOR M AP R EDUCE IMPLEMENTATION
We implement this through three MapReduce phases on
Approximate Similarity (S1 ,S2 )=Jaccard (S1 ,S2 )= (|S 1 ∩S2 |)
(|S1 ∪S2 |) Hadoop platform to improve the scalability and efficiency
Where S1 and S2 can be any pair of sets for which similarity is of our system in big data environment. The implementation
computed. However, our algorithm for SimPair is as follows:
2016 International Conference on Next Generation Intelligent Systems (ICNGIS)

Algorithm 4 simPair(nodex,nodey,θ ,S) Algorithm 5 Set Similarity Join Based Item Recommendation
Input: Two node pointers nodex, nodey Input: The preference key term set of the current user, PSC
The similarity threshold, θ The set of candidate items,ItemSet{it1 ,,itn }
Set S The minimum similarity threshold θ
Output: S-the list of users who are similar with the current The limit K
user Output: The top K rated items
1: for each record id Rx in nodex do 1: for each item in ItemSet do
2: prefx ← preference(Rx ) 2: S ← θ,sum← 0,r← 0
3: for each record id Ry in nodey do 3: for each review Rj of the item do
4: prefy ← preference(Ry ) 4: if PSPj = φ then
|prefx ∩ prefy | 5: for each text ti of length¿2 do
5: sim ←
|prefx ∪ prefy | 6: if ti ∃ treasury then
6: if sim> θ then 7: PSPj = PSPj ∪ ti
7: if Rx = PSC and Ry = PSC then 8: end if
8: if Rx not ∈ S then 9: end for
9: S ← S∪Rx 10: Store PSPj in IBP
10: end if 11: else
11: end if 12: Retrieve PSPj from IBP
12: if Ry = PSC and Rx = PSC then 13: end if
13: if Ry not ∈ S then 14: Insert PSPj in CandidatePreferenceSet,CPS
14: S ← S∪Ry 15: end for
15: end if 16: insert PSC to CandidatePreferenceSet, CPS
16: end if 17: build trie T for each indexing element in CPS
17: end if 18: S ← ExtendedPrefixFilter (CPS)
18: end for 19: k←0
19: end for 20: for each record id Rj  S do
|P SC ∩ Rj .pref erence|
21: k←
|P SC ∪ Rj .pref erence|
is motivated from the work of S. Meng et al. [3] The first 22: end for
1
MapReduce phase is for key terms extraction, the second 23: k←
k
MapReduce phase for similarity computation, and the third 24: for each record id Rj  S do
MapReduce phase is for personalized rank determination. 25: sum ← sum+1
Step 1. The first step is to process the reviews for candidate 26: r ← r+Rj .rating
items by previous users into their preference key term sets and 27: end for
r
compute the average ratings for each candidate item. Map-I: 28: r←
Map  iti , j,Rij on iti such that the tuples with the same iti sum
29: Pr ← r
are shuffled to the same node in the form of j,Rij . Reduce-I: 30: for each record id Rj  S do
Take j, Rij as the input and emit iti , j,PSPij for each input |P SC ∩ Rj .pref erence|
of Map-I. The output of Reduce-I will be used as the input of 31: Pr ← Pr + (k × ×
|P SC ∪ Rj .pref erence|
Map-II to calculate the similarity. (Rj .rating − r))
Step 2. The second step, which is performed on both current 32: end for
user preference set and previous user preferences sets, is to 33: Sort candidate items based on Pr value
compute the similarity between the current user and previous 34: Return The k items having highest rating
users. Map-II: Map iti , j,PSPij on iti , and tuples with the 35: end for
same iti are shuffled to the same node in form of j, PSPij .
Reduce-II: Take j, PSPij and  PSC as the input, builds
the trie, calculates similarity then emit iti , j, Si as output. IV. E XPERIMENTAL EVALUATION
Step 3. The third step aims to calculate the personalized
rating of each candidate item and present a personalized Experiments are conducted to evaluate the accuracy and
recommendation list to the current user. Based on the output scalability of three algorithms: User-based algorithm us-
of this step, the recommendation can be obtained. Map-III: ing Pearson Correlation Coefficient (UPCC) [1], Key-word
Map  iti , j, rij , Si on iti so that the tuples with the same iti aware service recommendation (KASR) [3], and set-similarity
are shuffled to the same node in form of j, rij , Si . Reduce- join based recommendation (SSJR). While taking values for
III: Take j, rij , Si as the input, and emit Ranking list = { KASR, KASR-ASC is taken into consideration. In the first
pri , iti } i=1 to N where pri is the personalized rating of the one, we compare SSJR with UPCC, and KASR in MAE
current user to item i. (Mean Absolute Error)[5], MAP (Mean Average Precision)
2016 International Conference on Next Generation Intelligent Systems (ICNGIS)

Fig. 2. Comparison of MAE and NMAE values of UPCC, KASR, and SSJR Fig. 3. Comparison of MAP values of UPCC, KASR, and SSJR in Top 3
and Top 5 recommendations
[2] and DCG (Discounted Cumulative Gain)[7] to evaluate
the accuracy of SSJR. The scalability of SSJR is evaluated
by executing the algorithm in different number of nodes. The
experiments were performed on a multi node Hadoop cluster.
The cluster is equipped with 9 nodes of Intel(R) Core(TM)
i3 3.40 GHz machines with 16 GB of main memory. We
used Hadoop version 1.2.1 and HBase version 0.94.2 for the
MapReduce framework and NoSQL database, respectively. We
used three data sets of sizes 1 gb, 512 mb, and 256 mb from
an online resource [21].
A. Accuracy evaluation
This measures the difference between the systems estimated
ratings and the real rating. The most popular among this Fig. 4. Comparison of DCG values of UPCC, KASR, and SSJR in Top 3
metric is the MAE. The MAE measures the difference as and Top 5 recommendations
absolute value between the prediction of the algorithm and
the real rating. It measures overall error differences between B. Scalability evaluation
a predicted rating and the real rating to a total number Speedup [8], is adopted to measure the performance of
of ratings in the test set. Normalized mean absolute error SSJR in its scalability terms. Speedup refers to how much a
(NMAE) [6] metric is also used to measure the prediction distributed algorithm is faster than a corresponding sequential
accuracy. NMAE normalizes the MAE metric to the range of algorithm.To verify the scalability of SSJR, experiment is
the respective rating scale in order to make results comparable conducted respectively in a cluster of nodes ranging from 1 to
among recommenders. 9. There are three data sets used in the experiments (256 mb,
Figure 2. shows the MAE and NMAE values of UPCC, 512 mb and 1 gb data size). From figure 5. we can see that the
KASR, and SSJR. The lower the MAE or NMAE presents, speedup of SSJR increases relative linearly with the growth of
the more accurate predictions. So, as shown in figure 2. SSJR the number of nodes. The experimental result shows that SSJR
performs slightly better than KASR and shows significant on Map-Reduce on Hadoop platform has good scalability over
improvement compared to UPCC. big data and performs better with larger data set.
To evaluate the quality of Top-K service recommendation
list, MAP and DCG are used. Average Precision is the average V. R ELATED WORKS
of the precision value obtained for the set of top-K documents The First recommender system was developed by Goldberg,
existing after each relevant document is retrieved. DCG mea- Nichols, Oki and Terry in 1992 [12]. Since then a lot of rec-
sures the performance of a recommendation system based on ommender systems were developed as part of research because
the graded relevance of the recommended entities. It varies of its relevance in the industry. Unlike content-based recom-
from 0.0 to 1.0, with 1.0 representing the ideal ranking of the mendation methods [2], collaborative recommender systems
entities. [1] (or collaborative filtering systems) attempts to anticipate
Higher MAP or DCG presents the higher quality of the and predict the utility of items for a particular user based on
predicted service recommendation list. We can see in figure 3. the items previously rated by other users. Using generalized
and figure 4. that the MAP values and DCG values of SSJR stereotypes, the Grundy [13] system build individual user
are comparatively higher than KASR, UPCC. It also could be models and use them to recommend relevant books to each
found that the MAP values decrease when K increases, while user. Later on, the Tapestry system relied on each user to
the DCG values increase when K increases. identify like-minded users manually. Group Lens [9], Video
2016 International Conference on Next Generation Intelligent Systems (ICNGIS)

R EFERENCES
[1] Adomavicius, G & Tuzhilin, A (2005), Toward the next generation of
recommender systems: A survey of the state-of-the-art and possible
extensions, IEEE Transactions on Knowledge and Data Engineering,
vol. 17, no. 6, pp. 734-749.
[2] J. L. Herlocker, J. A. Konstan, L. G. Terveen, & J. T. Riedl, Evaluating
collaborative filtering recommender systems, ACM Transactions on
Information Systems, vol. 22, no. 1, pp. 5-53, 2004.
[3] S. Meng, W. Dou, X. Zhang, & J. Chen, KASR: A Keyword-Aware ser-
vice recommendation method on MapReduce for big data applications,
IEEE Transactions on Parallel and Distributed Systems, vol. 25, no. 12,
pp. 3221-3231, 2014.
[4] Kim. C, & Shim. K (2015), Supporting set-valued joins in NoSQL using
MapReduce, Information Systems, vol. 49, pp. 52-64.
[5] K. Lakiotaki, N.F. Matsatsinis, & A. Tsoukis, Multi-Criteria user mod-
eling in recommender systems, IEEE Intelligent Systems, vol. 26, no. 2,
Fig. 5. Speedup values variation with increasing number of nodes pp. 64-76, 2011.
[6] Shinde. S and Potey. M (2015), Survey on evaluation of recommender
Recommender [10], and Ringo [11] were the first systems to systems, International Journal Of Engineering And Computer Science
vol. 4, no. 2, pp. 10351-10355.
use collaborative filtering algorithms to automate prediction. [7] Shani. G and Gunawardana. A (2010), Evaluating recommendation
Another important system that used CF based approach is systems, Recommender Systems Handbook, Available at Springer Books,
UPCC [1]. It used an algorithm based on the covariance pp. 257297.[12-December-2015]
[8] X. Yang, Y. Guo, & Y. Liu, Bayesian-Inference based recommendation
divided by the product of standard deviations method. in online social networks, IEEE Transactions on Parallel and Distributed
With the advancement of cloud computing software tools Systems, vol. 24, no. 4, pp. 642-651, 2013.
such as Apache Hadoop, MapReduce, and Mahout[14], it [9] P. Resenick, N.Iyacovou, M. Suchak, P. Bergstorm, & J. Reidl, Grou-
pLens: an open architecture for collaborative filtering of netnews, In
gets to be conceivable to plan and execute versatile recom- Proc. of ACM conference on Computer Supported Cooperative Work,
mender systems in big data environment. An example is User- pp. 175-186, 1994.
Based Collaborative-Filtering Recommendation Algorithms on [10] A. Borchers, J. Herlocker, J. Konstan, & J. Reidl, Ganging up on
information overload, IEEE Computer, vol. 31, no. 4, pp. 106 - 108,
Hadoop proposed by Z.D.Zhao et al [15]. A parallel user 1998.
profiling approach for a scalable recommender system by [11] Shardanand, U., and Maes, P.(1995). Social information filtering: Al-
using Map-Reduce was prposed by H. Liang et al [16]. Later, gorithms for automating Word of Mouth, In Proc. of CHI 95. Denver,
CO.
Jin et al. [17] presented an item-based CF algorithm for large- [12] D. Goldberg, D. Nichols, B. Oki & D. Terry, Using collaborative filtering
scale video recommendation. But, inspired from these, KASR to weave an information Tapestry, Communications of the ACM, vol. 35,
was introduced by S. Meng et al in 2014. no. 12, pp. 61-70, 1992.
[13] R. Burke, A. Felfernig, & M. Gker (2016), Recommender systems: An
On the other hand, in order to solve the problem of overview, Aaai.org, [16- February- 2016].
similarity join, a lot of researches were going on. Most of them [14] Walunj. S. G, & Sadafale. K (2013), An online recommendation system
were using MapReduce platform. The solution using prefix for e-commerce based on Apache Mahout framework, 2013 ACM
SIGMIS International Conference on Computers and People Research
filtering developed by Vernica et al [18], document similarity ,pp.153-158.
self-join by R. Baraglia et al [19], and V-smart join [20] are [15] Zhao. Z. D, & Shang. M. S, User-Based collaborative filtering rec-
worth mentioning among them. Then, C.Kim et al [4] proposed ommendation algorithms on Hadoop, In Proc. of Third International
Workshop on Knowledge Discovery and Data Mining, pp. 478-481,
a much efficient method using extended prefix filtering. 2010.
[16] H. Liang, J. Hogan, & Y. Xu, Parallel user profiling based on folksonomy
VI. C ONCLUSION for large scaled recommender systems: An implementation of cascading
In this work, a scalable method for building recommender MapReduce, In Proc. of IEEE Internationall Conference on Data Mining
Workshops, pp. 156-161, 2010.
systems based on similarity join is proposed. The system is [17] Y. Jin, M. Hu, H. Singh, D. Rule, M. Berlyant, & Z. Xie, MySpace video
designed using MapReduce framework so as to work with recommendation with Map-Reduce on Qizmt, In Proc. IEEE Fourth
big data applications. The system can significantly reduce International Conference On Semantic Computing, pp. 126-133, 2010.
[18] R. Vernica, M. J. Carey, & C.Li, Efficient parallel set-similarity joins
the unnecessary computation overhead such as a redundant using MapReduce, In Proc. of 2010 ACM SIGMOD International
comparisons in the similarity computing phase using a method Conference on Management of Data, pp.495506, 2010.
called extended prefix filtering. Experiment results show sig- [19] R. Baraglia, G .D .F . Morales, & C. Lucchese, Document similarity
self-join with MapReduce, In Proc. of IEEE International Conference
nificant improvement in scalability and performance over the on Data Mining, pp.731-736, 2010.
most efficient existing solutions for item recommendation. [20] Metwally. A, & Faloutsos.C (2012), V-smart join: A scalable MapRe-
In future, the efficiency of the proposed work can be duce framework for all-pair similarity joins of multi sets and vectors, In
Proc. of VLDB Endowment, pp.704-715.
improved by using cosine based weight vector approach for [21] Burrp. Available from: http://www.burrp.com/. [Jan 2016]
calculating similarity. In this approach, the number of oc-
currence of a word is assumed directly proportional to the
weight of the word. Thus a weight vector corresponding to
each preference set can be formed and cross product of which
will give a more accurate similarity value. This will give more
reliable results.

You might also like