0% found this document useful (0 votes)
22 views9 pages

61.a Big Data Analysis Method Based On Modified

This document discusses a big data analysis method based on a modified collaborative filtering recommendation algorithm. It first provides background on big data and recommendation systems. It then discusses related work in recommendation systems and collaborative filtering algorithms. Specifically, it notes that traditional collaborative filtering algorithms using Pearson correlation can have errors, so this study modifies the similarity value calculation using normal recovery similarity measure to address this issue. The document proposes implementing this modified collaborative filtering algorithm in a Hadoop cloud environment to improve efficiency for large-scale data. It measures execution times with different numbers of nodes and compares to a single machine to analyze speedup and efficiency.

Uploaded by

sk k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views9 pages

61.a Big Data Analysis Method Based On Modified

This document discusses a big data analysis method based on a modified collaborative filtering recommendation algorithm. It first provides background on big data and recommendation systems. It then discusses related work in recommendation systems and collaborative filtering algorithms. Specifically, it notes that traditional collaborative filtering algorithms using Pearson correlation can have errors, so this study modifies the similarity value calculation using normal recovery similarity measure to address this issue. The document proposes implementing this modified collaborative filtering algorithm in a Hadoop cloud environment to improve efficiency for large-scale data. It measures execution times with different numbers of nodes and compares to a single machine to analyze speedup and efficiency.

Uploaded by

sk k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Open Phys.

2019; 17:966–974

Research Article

Nan Yin*

A Big Data Analysis Method Based on Modified


Collaborative Filtering Recommendation
Algorithms
https://doi.org/10.1515/phys-2019-0102
Received Oct 18, 2019; accepted Nov 20, 2019
1 Introduction
Abstract: With the rapid development of e-commerce, With the progress of information technology, big data is
collaborative filtering recommendation system has been also called large data, which refers to a large amount of
widely used in various network platforms. Using recom- information. When the amount of data is so complex that
mendation system to accurately predict customers’ pref- the database system cannot store, calculate, process, and
erences for goods can solve the problem of information analyze the information that can be interpreted in a rea-
overload faced by users and improve users’ dependence sonable time, it is called big data. These massive data con-
on the network platform. Because the recommendation tain useful information, such as unknown correlation, hid-
system based on collaborative filtering technology has the den patterns, potential market trends, etc., which may con-
ability to recommend more abstract or difficult to describe tain unprecedented knowledge and applications waiting
goods in words, the research related to collaborative filter- to be discovered [1]. However, due to the huge amount
ing technology has attracted more and more attention. of data and the rapid flow of data, traditional technology
According to the past research, in collaborative filtering al- is often unable to conduct efficient processing and analy-
gorithm, if Pearson correlation coefficient is used, errors sis, prompting relevant researchers to constantly develop
will occur under special circumstances. In this study, the a new generation of data storage equipment and technol-
normal recovery similarity measure is used to modify the ogy, hoping to extract those valuable information from
similarity value to correct the error value of a collaborative large data. Many companies are committed to meeting the
filtering recommendation algorithm. Based on this, a big needs of consumers. To satisfy the needs of consumers, the
data analysis method based on a modified collaborative researcher must first understand what users need. How
filtering recommendation algorithm is proposed. This re- to recommend what consumers need or like is the most
search implemented it in the cloud Hadoop environment, important step to satisfy the needs of consumers. The re-
and measure the execution time with 2, 5 and 8 nodes. searcher can make recommendations through the habits
Then the research compared it with the execution time of and preferences of consumers. Quantitative data can be
a single machine, and analyze its speedup ratio and effi- used as the basis for our analysis, and big data analysis
ciency. The experimental results show that the execution has become a link closely related to life.
time increases with the number of neighbors. When the Due to the explosive growth of digital information and
number of nodes is 5 and 8, the execution time is greatly the increasing number of visitors using the network, infor-
improved, which improves the efficiency of collaborative mation overload has become a potential challenge nowa-
filtering algorithm and can cope with massive data in the days. People want to get interesting information on the net-
future. work in real-time, which is also the main reason for the
increasing demand for recommendation systems. The rec-
Keywords: collaborative filtering; big data; cloud environ-
ommendation system can filter out important and useful
ment
information according to users’ preferences and interests.
PACS: 89.70.Eg, 83.85.Ns, 89.75.-k Therefore, the recommendation system can solve the prob-
lem of information overload. In addition, the recommen-
dation system can also predict products that may be of in-
terest to a particular user, depending on other users who
*Corresponding Author: Nan Yin: Business School, Nanjing Xi- have similar preferences with that user. That is to say, the
aozhuang University, Nanjing 211171, China;
content-based filtering and collaborative filtering are com-
Email: [email protected]

Open Access. © 2019 N. Yin, published by De Gruyter. This work is licensed under the Creative Commons Attribution 4.0 License
A Big Data Analysis Method Based on Modified Collaborative Filtering Recommendation Algorithms | 967

mon methods in the recommendation system. For users,


recommendation system can greatly shorten their time to
2 Discussions on Related Literature
browse a large amount of information and quickly select
products suitable for them; For service providers, import- 2.1 Recommendation System
ing recommendation system can help their customers find
products of interest in real-time, so that more consumers The recommendation system is a reference for recommend-
will be willing to buy products on the service platform and ing and providing consumers to buy goods. These sugges-
become loyal customers. tions are based on many decisions, such as what products
With the rapid development of the Internet, it also rep- do consumers buy? Which movie did the consumer see? Al-
resents that there are many open resources on the network, ternatively, what articles do consumers read online? Due
and the high proportion of new information increases, and to the explosive growth of digital information and the in-
there is no way to compare and analyze the filtering infor- creasing number of visitors using the network, informa-
mation, which makes it difficult for users to distinguish tion overload has become a potential challenge nowadays.
and filter the appropriate information, which also shows People want to get interesting information on the network
another common discussion topic of the Internet informa- in real-time, which is also the main reason for the increas-
tion overload. People search the resources on the network ing demand for recommendation systems. The recommen-
by the help of search engine, and recommendation system dation system can filter out important and useful informa-
is a kind of concept that provides the information needed tion according to users’ preferences and interests. There-
by users actively [2]. fore, the recommendation system can solve the problem
In order to meet the needs of different users in big data, of information overload. In addition, the recommendation
recommendation algorithms are generated, among which system can also predict products that may be of interest to
the collaborative filtering recommendation algorithm is a particular user, depending on other users who have sim-
one of them. Current collaborative recommendation algo- ilar preferences with that user [3].
rithms focus on the design of personal computers. In or- Recommendation system can greatly shorten the time
der to cope with the trend of massive data, the system for users to browse a large amount of information and
can know the user’s interests at the moment and meet the quickly select products suitable for them. For service
user’s needs in time. The speed of data processing is the providers, importing recommendation system can help
decisive key. The execution speed of the PC cannot meet their customers find products of interest in real-time so
the real-time requirement, so the combination of cloud that more consumers will be willing to buy products on the
and collaborative filtering algorithm has the value of im- service platform and become loyal customers. The opera-
plementation. tion process of recommendation system is as follows: first
According to past research, in a collaborative filtering collect user’s information, including preferences and pur-
algorithm, if the Pearson correlation coefficient is used, er- chased products, etc., then the system will learn and build
rors will occur in special cases. In this study, the Normal models independently, and finally predict products that
Recovery Similarity Measure is used to modify the similar- users may be interested in and recommend them, while the
ity value to correct the error value of the collaborative filter- system will collect user’s selected data and go back to the
ing recommendation algorithm, which is the basis of the first stage for repeated execution [4].
collaborative filtering algorithm. In order to reduce the additional cost of searching in-
There are two main purposes of this study. The first formation, the recommendation system can recommend
purpose of the research is to measure the running time potential information, services or products that users may
with 2, 5 and 8 nodes in the cloud Hadoop environment, need according to their preferences, interests, behaviors or
compare with the running time of a single computer, and needs [5]. Recommendation system is a system that helps
then analyze its acceleration and efficiency. The second users filter information. Its core task is not only to filter
purpose of the research is to analyze the prediction re- information effectively, but also to find out users’ pref-
sults by using three algorithms: the Jaccard similarity coef- erences and give users interested information [6]. With
ficient, Pearson similarity and Normal recovery similarity the support of recommender system, the flooding of in-
measure. formation and the complexity of online search can be re-
duced [7], and the convenience of searching and filtering
network data can be improved.
According to different methods, common recommen-
dation systems are divided into three types: collaborative
968 | N. Yin

filtering, content filtering and knowledge-based recom- positive integer. Usually, the higher the score is, the more
mendation. Content-based filtering represents the user’s positive the user likes to give feedback. If a user u i fails to
preferences by the characteristics of the project, sum- score an item p i , then r ij = 0, the information is stored and
marizes the user’s preferences through the user’s click expressed in the form of R:
through records or viewing times, and finds the items ⎛ ⎞
that meet the preferences as recommendations [8]. The r11 r12 · · · r1M
⎜r
⎜ 21 r22 · · · r2M ⎟

characteristic of collaborative filtering is to collect users’
r r32 · · · r3M ⎟
⎜ ⎟
evaluation of the project to evaluate users’ preference R=⎜ ⎜ 31 (1)
⎜ .. .. .. ⎟

model, and to evaluate the possible score of the project ⎝ . . . ⎠
by the same user group. The final knowledge-based recom- r N1 r N2 · · · r NM
mender can explain the relationship between needs and
recommended textbooks, and recommend specific text- The main purpose of collaborative filtering is to gener-
books to suitable users. In the process, learners contribute ate a list of product recommendation sequences for each
their own preference model, so that the recommendation user based on the information of a user’s item score matrix.
system can interact with it [9]. For this purpose, each collaborative filtering recommenda-
tion system will have an algorithm to predict the score of
each user to each item. Rating is used to generate a list of
2.2 Collaborative filtering algorithm for the recommendations.
Traditional collaborative filtering recommendation
recommendation system
will find similar items or users according to the similar-
Collaborative Filtering refers to other users’ past prefer- ity comparison between users or objects. The most basic
ences to other users based on their similar interests. The way is to add up and average the scores of similar users
similarity between the two is calculated by each user’s on items, and then get the scores of these users on the
past score on the item, which is used to calculate the simi- items, although it is reasonable and very theoretical. Ef-
larity between users. Collaborative filtering can be divided fective methods, but in the actual recommendation system
into user-based filtering and item-based filtering. Collab- data, the serious sparse data makes the similarity almost
orative filtering aims at identifying other users who have impossible to complete the comparison, and a large num-
similar preferences with target users, while Schafer et al. ber of users and items lead to a very time-consuming com-
argues that the recommendation of people-to-people cor- puting process.
relation refers to the relevance of users’ purchases on e-
commerce websites [10].
O’Donovan & Smyth [11] pointed out that collabora- 2.3 User-based Collaborative Filtering
tive filtering recommendation, also known as social filter- Algorithms
ing recommendation, is mainly based on user experience
or suggestions with similar attributes or interests as the User-based collaborative filtering algorithm is suitable
basis of providing personalized information. By recording when the number of items is much larger than that of
and comparing user product or service preference data, users, and users change less; Project-based Collaborative
users are divided into several communities with high de- filtering algorithm is suitable when the number of users is
gree of internal user relevance Cooperation recommenda- much larger than that of items, and the number of items
tion reference. Herlocker, konstan, & Riedl [12] also men- changes less. Because the number of items in this exper-
tioned that collaborative filtering system is to predict the iment is large and fixed, the user-based collaborative fil-
user’s preference for a certain transaction or information tering algorithm is adopted in this paper. User-based Col-
by connecting a group of people who have common inter- laborative Filtering, first proposed by Schafer et al. [14]
ests with the user. Herlocker, et al. [13] pointed out that the refers to a recommendation based on the similarity of pref-
operation principle of collaborative filtering is to automate erences between users. For example, recommend products
the process of word-of-mouth effect, and the suggestions that a consumer might like based on the relevance of goods
made by the system are based on the preferences of other purchased by other consumers on e-commerce websites.
users with similar preferences. The algorithm uses all User and Item databases to predict
Assuming that there is a user set of N users u i , 1 ≤ i ≤ User’s Item score. The most commonly used technique is
N, and an item set of M items p j , 1 ≤ j ≤ M, a user u i will the Nearest Neighbor Method, which identifies the users
express his/her idea of an item p jj as a score, but r ij as a who scored similar items and all scored similar items, i.e.
A Big Data Analysis Method Based on Modified Collaborative Filtering Recommendation Algorithms | 969

the users’ neighbors. Then the user predicts these items Then, the gradient descent is used to update x i
through other items scored by neighbors and uses the Top- [︃
(i) (i)
∑︁ (︂(︁ (j) )︁T (i) )︂ (︁ )︁
T
N recommendation method to recommend the first N items x =x −α θ x − y(i,j) θ(j) (6)
of interest. j:r(i,j)=1
The basic idea of user-based collaborative filtering al- n (︁
]︃
)︁
(i)
∑︁
gorithm is that if user A likes item a, user B likes item a, +λ xk
b, c, and user C likes item a and c, then user A is similar k=1

to user B and C because they both like a, and user who The resulting x i is the feature of the movie i.
likes a likes c, so recommend C to user A. The algorithm
uses the nearest-neighbor algorithm to find a user’s neigh-
bor set. The users of the set have similar preferences with 2.4 Collaborative Filtering Program
the user. The algorithm predicts the user according to the
neighbor’s preferences. The first step is similarity calculation: similarity calcula-
The mathematical model of collaborative filtering rec- tion between users or projects is the key step of collabora-
ommendation algorithm can be expressed as follows: for tive filtering. In collaborative filtering, common methods
each user, its optimization goal is: include cosine similarity, advanced cosine similarity and
∑︁ (︂(︁ (j) )︁T (i) )︂2 Pearson correlation coefficient.
1
J (j) = min θ x − y (i,j)
(2) The second step is neighbor selection: as long as dif-
θ(j) 2m (j)
i:r(i,j)=1 ferent users join the neighborhood, the accuracy of pre-
n (︁
λ )︁2 diction will change. Therefore, the researcher should care-
θ k (j)
∑︁
+
2m(j) fully select some neighbor active user methods, the tra-
k=1
ditional Top-N algorithm in N-neighbor prediction. In ad-
Among them, the θ j denotes the preference character- dition, people in different countries or regions are more
istics of the user j, x i denotes the characteristics of the likely to have different preferences. Therefore, when se-
movie i, y(i,j) denotes the rating of the user j on the movie lecting neighbors for active users, it is necessary to con-
i, i : r(i, j) = 1 denotes that the user j has rating on the sider the location of users. Because of the development of
movie i (not missing value), and m j denotes the number of mobile network, location information can be obtained by
the user j rating the movie. Since the left and right terms mobile client or IP address and sent to server for further
have m j , the above formula can also be written as follows: analysis. Usually, users can be divided into multiple parti-
(︂(︁ )︁
T
)︂2 tions according to their location. Users in the same parti-
(j) 1 ∑︁ (j) (i) (i,j)
J = min θ x −y (3) tion have priority in neighbor selection.
θ(j) 2
i:r(i,j)=1 The third step is prediction: based on neighborhood
n
λ ∑︁ (︁ (j) )︁2 similarity and score, rank the scores.
+ θk
2 The forth step is project ranking: Once the forecast is
k=1
obtained, the recommendation system needs to rank all
Then, the gradient descent is used to update θ j , and θ j items according to the forecast score. In order to improve
is the preference feature of the user j. the diversity of suggestions, projects with larger predic-
(︂(︁ )︁
T
)︂2 tions and lower popularity should rank higher.
(j) 1 ∑︁ (j) (i) (i,j)
J = min θ x −y (4) The fifth step is selecting the first n items: After sort-
θ(j) 2
i:r(i,j)=1 ing all the options, the first n items are provided to the
n
λ ∑︁ (︁ (j) )︁2 user, where n is the default parameter required before rec-
+ θk
2 ommending the task.
k=1

If the user’s preference for θ is known, then the step


can learn the movie’s feature x. For each movie, the opti- 2.5 Computation of Similarity
mization function is
(︂(︁ )︁ )︂2 As for the calculation of similarity, the existing basic meth-
1 ∑︁ T
J (i) = min θ(j) x(i) − y(i,j) (5) ods are based on vectors. In fact, the distance between two
x(i) 2
j:r(i,j)=1
vectors is calculated. The closer the distance is, the greater
n
λ ∑︁ (︁ (i) )︁2 the similarity is. In the two-dimensional user-item prefer-
+ xk
2 ence matrix of the recommended scenario, the researcher
k=1
970 | N. Yin

can use a user’s preference for all items as a vector to cal- The Jaccard distance, which measures dissimilarity be-
culate the similarity between users, or use all users’ pref- tween sample sets, is complementary to the Jaccard coeffi-
erence for one item as a vector to calculate the similarity cient and is obtained by subtracting the Jaccard coefficient
between items. from 1, or, equivalently, by dividing the difference of the
sizes of the union and the intersection of two sets by the
size of the union:
2.5.1 Pearson correlation coeflcient |A ∪ B| − |A ∩ B|
d J (A, B) = 1 − J(A, B) = (10)
|A ∪ B|
Pearson correlation coefficient has two concepts, one is
size or strength. In terms of absolute value, the greater the
absolute value, the higher the correlation between the two; 3 Improvement of Collaborative
the smaller the value, the lower the correlation between
the two. One is the direction symbol, that is, when the coef- Filtering Algorithms by Normal
ficients are positive or negative, the relationship between Restoration Similarity Measure
the two directions changes in the positive direction, one
becomes larger, one becomes smaller, and the other be- There are many different similarity algorithms in collabora-
comes smaller, which is called positive correlation; Nega- tive filtering algorithm. The core concept of Jacquard simi-
tive values change in reverse, one becomes larger and the larity coefficient can be seen from the following formulas:
other smaller. The smaller one is, the larger the other is,
A∩B
which is called negative correlation. If it is zero, one be- J(A, B) = (11)
A∪B
comes smaller and the other may become larger or smaller The number of items scored by user A and user B di-
or unchanged, that is zero correlation. vided by the number of items scored by user A or user B
Pearson correlation coefficient is generally used to cal- falls between 0 and 1.
culate the degree of tightness between two fixed-distance Pearson correlation coefficient is the most famous sim-
variables, and its value is between [−1, +1]. s x , s y are stan- ilarity algorithm, and its value falls between 1 and - 1. If
dard deviations of x and y samples. user-based collaborative filtering is used, the formula is as
∑︀
x i y i − nxy follows:
P(x, y) = (7)
(n − 1)s x s y
∑︀ (︀ )︀ (︀ )︀
r u,i − r u r v,i − r u
i∈I
∑︀ ∑︀ ∑︀
n xi yi − xi yi Sim(u, v) = √︂∑︀ (︀ (12)
= √︁ )︀2 √︂∑︀ (︀ )︀2
∑︀ 2 ∑︀ 2 √ ∑︀ 2 ∑︀ 2 r u,i − r u r v,i − r v
n xi − ( xi ) n yi − ( yi ) i∈I i∈I

I is an item with a score between user u and v. r u and


i represent user u’s score for item i, r v and i represent user
2.5.2 Jaccard similarity coeflcient
v’s score for item i, and r u and r v represent the average
value of all user u’s scores and the average value of all user
The Jaccard similarity coefficient is a statisticused for com-
v’s scores. If the collaborative filtering is based on goods,
paring the similarity and diversity of sample sets. The Jac-
the formula is as follows:
card coefficient measures similarity between finite sample ∑︀ (︀ )︀ (︀ )︀
r u,i − r l r u,j − r j √︃
sets, and is defined as the size of the intersection divided u∈U
∑︁ (︀ )︀2
by the size of the union of the sample sets: Sim(i, j) = √︂ ∑︀ (︀ )︀2 r u,j − r j (13)
r u,i − r l u∈U
A∩B u∈U
J(A, B) = (8)
A∪B U is an item with the same user rating between item i
If A and B are both empty, we define J(A, B) = 1. and j. r u and i represent user u’s rating of item i. r u and j
represent user u’s rating of item j. r i and r j represent the
0 ≤ J(A, B) ≤ 1 (9)
average value of all item i’s rating and the average value of
The MinHash min-wise independent permutations lo- all item j’s rating.
cality sensitive hashing scheme may be used to efficiently These two collaborative filtering algorithms use the
compute an accurate estimate of the Jaccard similarity co- same prediction formula, and user-based collaborative fil-
efficient of pairs of sets, where each set is represented by tering formula is:
a constant-sized signature derived from the minimum val- ∑︀
Rating v,i · sim(u, v)
ues of a hash function [15]. Score u,i = ∑︀ (14)
sim(u, v)
A Big Data Analysis Method Based on Modified Collaborative Filtering Recommendation Algorithms | 971

The meaning of the formula is represented by: user v


has a score, and user u has not scored all items multiplied
4 Experimental environment and
by user u, v similarity, divided by the sum of user u, v sim- methods
ilarity. The Item-based collaborative filtering formula is:
∑︀ The program language used in this study is R data analy-
Rating u,j · sim(i, j)
Score u,i = ∑︀ (15) sis language. One server and four hosts were selected as
sim(i, j)
hardware cloud environment to test on 2, 5 and 8 nodes re-
However, in some cases, errors may occur in the cal- spectively. The data used in this study are from the IMDB
culation of Pearson correlation coefficient. The following Film Scoring Website (http://www.imdb.com). A total of
results can be obtained when calculating the similarities 224836 score records were used [16, 17]. There are less than
between user u1 and user u2 , user u2 and user u3 : 20 users who delete scoring items from the data of this ex-
periment, and all users have the same score. Because the
Sim(u1 , u2 ) > Sim(u3 , u2 ) (16) accuracy of collaborative filtering algorithm will increase
with the increase of the value of k, the neighborhood k is
But in fact, the similarity between user u2 and u3
tested from 1 to 10 in the experiment process [18].
should be relatively high, because user u1 scores range
As a user-based collaborative filtering algorithm, the
from 1 to 5, while user u2 and u3 scores range from 2 to 4.
experimental structure is divided into four parts: (1) cal-
This study proposes an improved approach: using normal
culating the maximum and minimum scores of all users;
recovery similarity measure.
(2) calculating the similarity of all users; (3) calculating
dist (u, v) the prediction scores. (4) In another experiment, the same
Sim(u, v) = 1 − (17)
distmax data was used to recommend the item with the highest
√︂∑︀
(nr u,i − nr v,i )
2 prediction score, and the number of neighbors used was
i∈I 3. In this study, three different algorithms are used for pre-
=1− √︃
|I|
∑︀ 2
diction, namely, the Jaccard similarity coefficient, Pearson
(1 − 0) similarity and Normal recovery similarity measure.
k=1
√︃
∑︀ (︁ r u,i −r u min )︁2
r v,i −r v min
r u max −r u min − r v max −r v min
i∈I
=1− √︃
|I|
5 Research results and analysis
∑︀
1
k=1 The experiment first calculates the execution time of a sin-
The formula is simplified as follows: gle personal computer. As can be seen from Figure 1, where
√︃ the abscissa k is the number of neighbors, when the value
∑︀ (︁ r u,i −r u min r v,i −r v min
)︁2 of k increases, the running time will be greatly increased,
r u max −r u min − r v max −r v min because according to the formula, when the value of k in-
i∈I
Sim(u, v) = 1 − √︀ (18) creases, the time will be exponential growth.
|I |

The similarity between user u1 and u2 is less than that


between user u2 and u3 , and the similarity between user
u5 and u6 is 0. The formula of the prediction score is the
normal recovery similarity prediction formula:

sim(u, u′ ) · nr u′ ,i
∑︀
u′ ∈u
^r u,l = r u min + (r u max − r u min ) (19)
sim(u, u′ )
∑︀
u′ ∈u

r u min is the lowest score evaluated by user u, r u max is the


highest score evaluated by user u, Sim(u, u′ ) is the similar-
ity between user u and user u′ . In this paper, the similarity
measure of normal recovery is used as the basis of collab-
orative filtering algorithm. Figure 1: The running time of a personal computer (the number of
neighbors in abscissa K)
972 | N. Yin

Table 1: Eflciency comparison of personal computers with 2, 5 and 8 nodes

K PC 2 nodes Acceleration 5 nodes Acceleration 8 nodes Acceleration


Ratio Ratio Ratio
1 721 1146 0.624 529 1.452 345 2.445
2 2245 3046 0.654 1391 1.539 879 2.489
3 4456 6234 0.691 3156 1.482 1846 2.546
4 7397 11862 0.663 5256 1.383 2875 2.583
5 10695 15672 0.647 7145 1.584 4489 2.498
6 12341 22478 0.586 8763 1.389 5446 2.437
7 14619 22478 0.642 12189 1.498 6450 2.510
8 12147 32458 0.545 12487 1.587 7215 2.674
9 22462 36542 0.629 15655 1.445 8889 2.348
10 25246 35425 0.542 14586 1.478 11241 2.457

From Figure 2, it can be seen that when the number of


hardware resources and nodes in cloud environment is too
low (Curve 1), it is not suitable for cloud execution. How-
ever, when the number of hardware resources and nodes
in cloud environment is increased (Curve 3, 4), the collabo-
rative filtering algorithm can effectively accelerate the cal-
culation.
The formula used for calculating the acceleration ratio
is speedup=T a /T b ,
T a represents the running time of a personal computer,
T b represents Hadoop runtime.
Figure 2: Comparisons of running time between PC and Hadoop with
In another experiment, three different algorithms are
2, 5 and 8 nodes
used to calculate the result prediction. The experimental
results are completely consistent. It can be speculated that
Because Hadoop’s hardware environment consists of there are two reasons for this result. The first one is the
three hosts, it corresponds to two, five and eight nodes [19]. data set. Because the data source used in this experiment
Table 1 compares the performance of the PC with that of is the score of the website, it depends on the rater’s inter-
the two nodes in the case of adjusting the k value (k = 1-10). ests, so the matrix is sparse in numbers [21]. Users may
At two nodes, it happens to be executed by one host. Com- only want to evaluate their favorite projects, resulting in
pared with the execution of personal computer, it has more positive correlation of similarity, so the calculation of sim-
time to transmit and configure, so the execution efficiency ilarity will have similar results. The second reason is that
is not good. this research only recommend the highest project, so other
In Table 1, the performance of the PC with 5 nodes and possible projects may be ignored.
8 nodes is significantly improved compared with that of Table 2 is part of the recommendation results of three
the PC with 5 nodes and 8 nodes when the K value is ad- different algorithms. The results are expressed by the first
justed. In the case of five nodes, it can be seen that the ac- 10 users out of 100 users. The contents of the table are
celeration ratio is greater than 1, which means that the ex- movie numbers. Each column represents different users.
ecution speed of five nodes is about 0.5 times faster than From the table, it can be seen that the recommendation
that of a single computer [20]. In the case of 8 nodes, it results of each user in three different algorithms are the
can be seen that the acceleration ratio has been increased same.
to more than 2 times, about 2.5 times, and the maximum
acceleration ratio is 2.67 times when the number of neigh-
bors k equals 4.
Figure 2 shows a comparison of running time curves
of 2, 5 and 8 computing nodes between PC and Hadoop.
A Big Data Analysis Method Based on Modified Collaborative Filtering Recommendation Algorithms | 973

Table 2: Top 10 Recommended Results of the Three Algorithms faced with sparse matrix distribution, it will make predic-
tion difficult. In the follow-up study, the researcher can
Users Jaccard Normal Restoration Pearson try to find other recommended algorithms and improve-
similarity Similarity Measure similarity ment directions, such as the construction of the multi-
coeflcient agent model combined with neural network and collabo-
1 925468 925468 925468 rative filtering algorithm. Nowadays, with the increasing
2 24589 24589 24589 amount of data, using R language to analyze data in mas-
3 252465 252465 252465 sive data will encounter layer-by-layer obstacles, too long
4 52245774 52245774 52245774 analysis time, insufficient memory and so on. Using the
5 52547 52547 52547 methods of Hadoop Distributed File System (HDFS) and
6 38625 38625 38625 Map Reduce in Apache Hadoop Open Source Software can
7 3545562 3545562 3545562 improve computing efficiency and storage space manage-
8 2542588 2542588 2542588 ment and increase capacity.
9 75225 75225 75225
10 855265 855265 855265

References
6 Research conclusions [1] D’Angeac G.D., Big data: the management revolution, Harvard
Business Review, 2012, 90(10), 60-68.
With the increasingly frequent e-commerce transactions [2] Xu H.L., Wu X., Li X.D., Yan B.P., Comparison study of internet
nowadays, more and more sellers choose to sell goods on- recommendation system: comparison study of internet recom-
mendation system, Journal of Software, 2009, 20(2), 350-362.
line, which also brings a huge number of goods. In the
[3] Feng L., Guo W., Yu D., Gao Q., Gao K., Xue, Z., et al., Classification
past, the collaborative filtering recommendation system of different therapeutic responses of major depressive disorder
will treat each item as a feature to calculate, but in today’s with multivariate pattern analysis method based on structural
data form, it is unrealistic and massive. Users and com- mr scans, Plos One, 2012, 7(7), 1-11.
modities also bring about the problem of extremely sparse [4] Karabadji N.E.I., Beldjoudi S., Seridi H., Aridhi S., Dhifli W. Im-
proving memory-based user collaborative filtering with evolu-
data, resulting in the recommendation system operation
tionary multi-objective optimization, Expert System with Appli-
speed is too slow, or even unable to work. cations, 2018, 98, 153-165.
With the advent of cloud era, data growth rate is very [5] Rashid A.M., Albert I., Cosley D., Lam S.K., McNee S.M., Konstan
fast. In a massive data environment, when the researcher J.A., Riedl J., Getting to know you: learning new user preferences
need to find solutions to problems, execution speed will in recommender systems, Proceedings of the 7th International
be the key. In this paper, a collaborative filtering algo- Conference on Intelligent User Interfaces (January 13 - 16, 2002,
San Francisco, CA, USA), ACM New York, 2002, 127-134.
rithm modified by normal recovery similarity measure is
[6] Liang T.P., Lai H.J., Ku Y.C., Personalized content recommenda-
adopted, and the speed is improved by 2.67 times through tion and user satisfaction: Theoretical synthesis and empirical
the cloud environment simulation. With the increase of ac- findings, Journal of Management Information Systems, 2006,
tual data, the operation of personal computers will take 23(3), 45-70.
more time, and the ability to store data will be limited to [7] Xiao B., Benbasat I., E-commerce product recommendation
agents: Use, characteristics, and impact, Mis Quarterly, 2007,
a certain extent. Using MapReduce on Hadoop distributed
31(1), 137-209.
platform to distribute operation and data to different hosts [8] De Meo P., Quattrone G., Terracina G., Ursino D., An XML-based
can save a lot of time and data burden. Hadoop’s Dis- multiagent system for supporting online recruitment services,
tributed File System (HDFS) guarantees the correctness Systems, Man and Cybernetics, Part A: Systems and Humans,
of the data and restores the similarity measure normally. IEEE Transactions on Systems Man & Cybernetics Part A Systems
After modification, its prediction accuracy is improved. & Humans, 2007, 37(4), 464-480.
[9] Yoshii K., Goto M., Komatani K., Ogata T., Okuno H.G. An eflcient
The experimental results show that the execution time in-
hybrid music recommender system using an incrementally train-
creases with the number of neighbors. When the number able probabilistic generative model, IEEE T Audio Speech, 2008,
of nodes is 5 and 8, the execution time is greatly improved, 16(2), 435-447.
which improves the efficiency of collaborative filtering al- [10] Li J., Kai Z., Yang X., Peng W., Jie W., Mitra K., et al., Category pre-
gorithm and can cope with massive data in the future. ferred canopy-k-means based collaborative filtering algorithm,
Future Generation Computer Systems, 2018, 93, 1046-1054.
However, there are some shortcomings in this study.
[11] O’Donovan J., Smyth B., Trust in recommender systems, Pro-
For example, when the collaborative filtering algorithm is ceedings of the 10th international conference on Intelligent user
974 | N. Yin

interfaces (January 09 - 12, 2005, San Diego, CA, USA), ACM New [17] Gao W., Wang W., A tight neighborhood union condition on
York, 2005, 167-174. fractional-critical deleted graphs, Colloquium Mathematicum,
[12] Herlocker J.L., Konstan J.A., Riedl J., Explaining collaborative 2017, 149, 291-298.
filtering recommendations, Proceedings of the 2000 ACM Con- [18] Gao W., Wang W., New isolated toughness condition for fractional-
ference on Computer Supported Cooperative work (December critical graph, Colloquium Mathematicum, 2017, 147, 55-65.
02 - 06, 2000, Philadelphia, PA, USA), ACM New York, 2000, [19] Khalique C.M., Mhlanga I.E., Travelling waves and conservation
241-250. laws of a dimensional coupling system with korteweg-de vries
[13] Herlocker J.L., Konstan J.A., Borchers A., Riedl J., An algorithmic equation, Applied Mathematics & Nonlinear Sciences, 2018, 3,
framework for performing collaborative filtering, Proceedings 241-254.
of the 22nd Annual International ACM SIGIR Conference on Re- [20] Naeem M., Siddiqui M.K., Guirao J.L.G., Gao W., New and modi-
search and Development in Information Retrieval (August 15-19, fied eccentric indices of octagonal grid Om
n , Applied Mathematics
1999, Berkeley, CA, USA), ACM New York, 1999, 230-237. & Nonlinear Sciences, 2018, 3, 209-228.
[14] Good N., Schafer J.B., Konstan J.A., Borchers A., Sarwar B., Her- [21] Pandey P.K., A new computational algorithm for the solution
locker J. et al., Combining collaborative filtering with personal of second order initial value problems in ordinary differential
agents for better recommendations, 1999, 439-446. equations, Applied Mathematics & Nonlinear Sciences, 2018, 3,
[15] Edith C., Min-Hash Sketches, Springer New York, 2016. 167-174.
[16] Cantero A., Crespo F., Ferrer S., The triaxiality role in the spin-
orbit dynamics of a rigid body, Applied Mathematics & Nonlinear
Sciences, 2018, 3, 187-208.

You might also like