61.a Big Data Analysis Method Based On Modified
61.a Big Data Analysis Method Based On Modified
2019; 17:966–974
Research Article
Nan Yin*
Open Access. © 2019 N. Yin, published by De Gruyter. This work is licensed under the Creative Commons Attribution 4.0 License
A Big Data Analysis Method Based on Modified Collaborative Filtering Recommendation Algorithms | 967
filtering, content filtering and knowledge-based recom- positive integer. Usually, the higher the score is, the more
mendation. Content-based filtering represents the user’s positive the user likes to give feedback. If a user u i fails to
preferences by the characteristics of the project, sum- score an item p i , then r ij = 0, the information is stored and
marizes the user’s preferences through the user’s click expressed in the form of R:
through records or viewing times, and finds the items ⎛ ⎞
that meet the preferences as recommendations [8]. The r11 r12 · · · r1M
⎜r
⎜ 21 r22 · · · r2M ⎟
⎟
characteristic of collaborative filtering is to collect users’
r r32 · · · r3M ⎟
⎜ ⎟
evaluation of the project to evaluate users’ preference R=⎜ ⎜ 31 (1)
⎜ .. .. .. ⎟
⎟
model, and to evaluate the possible score of the project ⎝ . . . ⎠
by the same user group. The final knowledge-based recom- r N1 r N2 · · · r NM
mender can explain the relationship between needs and
recommended textbooks, and recommend specific text- The main purpose of collaborative filtering is to gener-
books to suitable users. In the process, learners contribute ate a list of product recommendation sequences for each
their own preference model, so that the recommendation user based on the information of a user’s item score matrix.
system can interact with it [9]. For this purpose, each collaborative filtering recommenda-
tion system will have an algorithm to predict the score of
each user to each item. Rating is used to generate a list of
2.2 Collaborative filtering algorithm for the recommendations.
Traditional collaborative filtering recommendation
recommendation system
will find similar items or users according to the similar-
Collaborative Filtering refers to other users’ past prefer- ity comparison between users or objects. The most basic
ences to other users based on their similar interests. The way is to add up and average the scores of similar users
similarity between the two is calculated by each user’s on items, and then get the scores of these users on the
past score on the item, which is used to calculate the simi- items, although it is reasonable and very theoretical. Ef-
larity between users. Collaborative filtering can be divided fective methods, but in the actual recommendation system
into user-based filtering and item-based filtering. Collab- data, the serious sparse data makes the similarity almost
orative filtering aims at identifying other users who have impossible to complete the comparison, and a large num-
similar preferences with target users, while Schafer et al. ber of users and items lead to a very time-consuming com-
argues that the recommendation of people-to-people cor- puting process.
relation refers to the relevance of users’ purchases on e-
commerce websites [10].
O’Donovan & Smyth [11] pointed out that collabora- 2.3 User-based Collaborative Filtering
tive filtering recommendation, also known as social filter- Algorithms
ing recommendation, is mainly based on user experience
or suggestions with similar attributes or interests as the User-based collaborative filtering algorithm is suitable
basis of providing personalized information. By recording when the number of items is much larger than that of
and comparing user product or service preference data, users, and users change less; Project-based Collaborative
users are divided into several communities with high de- filtering algorithm is suitable when the number of users is
gree of internal user relevance Cooperation recommenda- much larger than that of items, and the number of items
tion reference. Herlocker, konstan, & Riedl [12] also men- changes less. Because the number of items in this exper-
tioned that collaborative filtering system is to predict the iment is large and fixed, the user-based collaborative fil-
user’s preference for a certain transaction or information tering algorithm is adopted in this paper. User-based Col-
by connecting a group of people who have common inter- laborative Filtering, first proposed by Schafer et al. [14]
ests with the user. Herlocker, et al. [13] pointed out that the refers to a recommendation based on the similarity of pref-
operation principle of collaborative filtering is to automate erences between users. For example, recommend products
the process of word-of-mouth effect, and the suggestions that a consumer might like based on the relevance of goods
made by the system are based on the preferences of other purchased by other consumers on e-commerce websites.
users with similar preferences. The algorithm uses all User and Item databases to predict
Assuming that there is a user set of N users u i , 1 ≤ i ≤ User’s Item score. The most commonly used technique is
N, and an item set of M items p j , 1 ≤ j ≤ M, a user u i will the Nearest Neighbor Method, which identifies the users
express his/her idea of an item p jj as a score, but r ij as a who scored similar items and all scored similar items, i.e.
A Big Data Analysis Method Based on Modified Collaborative Filtering Recommendation Algorithms | 969
the users’ neighbors. Then the user predicts these items Then, the gradient descent is used to update x i
through other items scored by neighbors and uses the Top- [︃
(i) (i)
∑︁ (︂(︁ (j) )︁T (i) )︂ (︁ )︁
T
N recommendation method to recommend the first N items x =x −α θ x − y(i,j) θ(j) (6)
of interest. j:r(i,j)=1
The basic idea of user-based collaborative filtering al- n (︁
]︃
)︁
(i)
∑︁
gorithm is that if user A likes item a, user B likes item a, +λ xk
b, c, and user C likes item a and c, then user A is similar k=1
to user B and C because they both like a, and user who The resulting x i is the feature of the movie i.
likes a likes c, so recommend C to user A. The algorithm
uses the nearest-neighbor algorithm to find a user’s neigh-
bor set. The users of the set have similar preferences with 2.4 Collaborative Filtering Program
the user. The algorithm predicts the user according to the
neighbor’s preferences. The first step is similarity calculation: similarity calcula-
The mathematical model of collaborative filtering rec- tion between users or projects is the key step of collabora-
ommendation algorithm can be expressed as follows: for tive filtering. In collaborative filtering, common methods
each user, its optimization goal is: include cosine similarity, advanced cosine similarity and
∑︁ (︂(︁ (j) )︁T (i) )︂2 Pearson correlation coefficient.
1
J (j) = min θ x − y (i,j)
(2) The second step is neighbor selection: as long as dif-
θ(j) 2m (j)
i:r(i,j)=1 ferent users join the neighborhood, the accuracy of pre-
n (︁
λ )︁2 diction will change. Therefore, the researcher should care-
θ k (j)
∑︁
+
2m(j) fully select some neighbor active user methods, the tra-
k=1
ditional Top-N algorithm in N-neighbor prediction. In ad-
Among them, the θ j denotes the preference character- dition, people in different countries or regions are more
istics of the user j, x i denotes the characteristics of the likely to have different preferences. Therefore, when se-
movie i, y(i,j) denotes the rating of the user j on the movie lecting neighbors for active users, it is necessary to con-
i, i : r(i, j) = 1 denotes that the user j has rating on the sider the location of users. Because of the development of
movie i (not missing value), and m j denotes the number of mobile network, location information can be obtained by
the user j rating the movie. Since the left and right terms mobile client or IP address and sent to server for further
have m j , the above formula can also be written as follows: analysis. Usually, users can be divided into multiple parti-
(︂(︁ )︁
T
)︂2 tions according to their location. Users in the same parti-
(j) 1 ∑︁ (j) (i) (i,j)
J = min θ x −y (3) tion have priority in neighbor selection.
θ(j) 2
i:r(i,j)=1 The third step is prediction: based on neighborhood
n
λ ∑︁ (︁ (j) )︁2 similarity and score, rank the scores.
+ θk
2 The forth step is project ranking: Once the forecast is
k=1
obtained, the recommendation system needs to rank all
Then, the gradient descent is used to update θ j , and θ j items according to the forecast score. In order to improve
is the preference feature of the user j. the diversity of suggestions, projects with larger predic-
(︂(︁ )︁
T
)︂2 tions and lower popularity should rank higher.
(j) 1 ∑︁ (j) (i) (i,j)
J = min θ x −y (4) The fifth step is selecting the first n items: After sort-
θ(j) 2
i:r(i,j)=1 ing all the options, the first n items are provided to the
n
λ ∑︁ (︁ (j) )︁2 user, where n is the default parameter required before rec-
+ θk
2 ommending the task.
k=1
can use a user’s preference for all items as a vector to cal- The Jaccard distance, which measures dissimilarity be-
culate the similarity between users, or use all users’ pref- tween sample sets, is complementary to the Jaccard coeffi-
erence for one item as a vector to calculate the similarity cient and is obtained by subtracting the Jaccard coefficient
between items. from 1, or, equivalently, by dividing the difference of the
sizes of the union and the intersection of two sets by the
size of the union:
2.5.1 Pearson correlation coeflcient |A ∪ B| − |A ∩ B|
d J (A, B) = 1 − J(A, B) = (10)
|A ∪ B|
Pearson correlation coefficient has two concepts, one is
size or strength. In terms of absolute value, the greater the
absolute value, the higher the correlation between the two; 3 Improvement of Collaborative
the smaller the value, the lower the correlation between
the two. One is the direction symbol, that is, when the coef- Filtering Algorithms by Normal
ficients are positive or negative, the relationship between Restoration Similarity Measure
the two directions changes in the positive direction, one
becomes larger, one becomes smaller, and the other be- There are many different similarity algorithms in collabora-
comes smaller, which is called positive correlation; Nega- tive filtering algorithm. The core concept of Jacquard simi-
tive values change in reverse, one becomes larger and the larity coefficient can be seen from the following formulas:
other smaller. The smaller one is, the larger the other is,
A∩B
which is called negative correlation. If it is zero, one be- J(A, B) = (11)
A∪B
comes smaller and the other may become larger or smaller The number of items scored by user A and user B di-
or unchanged, that is zero correlation. vided by the number of items scored by user A or user B
Pearson correlation coefficient is generally used to cal- falls between 0 and 1.
culate the degree of tightness between two fixed-distance Pearson correlation coefficient is the most famous sim-
variables, and its value is between [−1, +1]. s x , s y are stan- ilarity algorithm, and its value falls between 1 and - 1. If
dard deviations of x and y samples. user-based collaborative filtering is used, the formula is as
∑︀
x i y i − nxy follows:
P(x, y) = (7)
(n − 1)s x s y
∑︀ (︀ )︀ (︀ )︀
r u,i − r u r v,i − r u
i∈I
∑︀ ∑︀ ∑︀
n xi yi − xi yi Sim(u, v) = √︂∑︀ (︀ (12)
= √︁ )︀2 √︂∑︀ (︀ )︀2
∑︀ 2 ∑︀ 2 √ ∑︀ 2 ∑︀ 2 r u,i − r u r v,i − r v
n xi − ( xi ) n yi − ( yi ) i∈I i∈I
sim(u, u′ ) · nr u′ ,i
∑︀
u′ ∈u
^r u,l = r u min + (r u max − r u min ) (19)
sim(u, u′ )
∑︀
u′ ∈u
Table 2: Top 10 Recommended Results of the Three Algorithms faced with sparse matrix distribution, it will make predic-
tion difficult. In the follow-up study, the researcher can
Users Jaccard Normal Restoration Pearson try to find other recommended algorithms and improve-
similarity Similarity Measure similarity ment directions, such as the construction of the multi-
coeflcient agent model combined with neural network and collabo-
1 925468 925468 925468 rative filtering algorithm. Nowadays, with the increasing
2 24589 24589 24589 amount of data, using R language to analyze data in mas-
3 252465 252465 252465 sive data will encounter layer-by-layer obstacles, too long
4 52245774 52245774 52245774 analysis time, insufficient memory and so on. Using the
5 52547 52547 52547 methods of Hadoop Distributed File System (HDFS) and
6 38625 38625 38625 Map Reduce in Apache Hadoop Open Source Software can
7 3545562 3545562 3545562 improve computing efficiency and storage space manage-
8 2542588 2542588 2542588 ment and increase capacity.
9 75225 75225 75225
10 855265 855265 855265
References
6 Research conclusions [1] D’Angeac G.D., Big data: the management revolution, Harvard
Business Review, 2012, 90(10), 60-68.
With the increasingly frequent e-commerce transactions [2] Xu H.L., Wu X., Li X.D., Yan B.P., Comparison study of internet
nowadays, more and more sellers choose to sell goods on- recommendation system: comparison study of internet recom-
mendation system, Journal of Software, 2009, 20(2), 350-362.
line, which also brings a huge number of goods. In the
[3] Feng L., Guo W., Yu D., Gao Q., Gao K., Xue, Z., et al., Classification
past, the collaborative filtering recommendation system of different therapeutic responses of major depressive disorder
will treat each item as a feature to calculate, but in today’s with multivariate pattern analysis method based on structural
data form, it is unrealistic and massive. Users and com- mr scans, Plos One, 2012, 7(7), 1-11.
modities also bring about the problem of extremely sparse [4] Karabadji N.E.I., Beldjoudi S., Seridi H., Aridhi S., Dhifli W. Im-
proving memory-based user collaborative filtering with evolu-
data, resulting in the recommendation system operation
tionary multi-objective optimization, Expert System with Appli-
speed is too slow, or even unable to work. cations, 2018, 98, 153-165.
With the advent of cloud era, data growth rate is very [5] Rashid A.M., Albert I., Cosley D., Lam S.K., McNee S.M., Konstan
fast. In a massive data environment, when the researcher J.A., Riedl J., Getting to know you: learning new user preferences
need to find solutions to problems, execution speed will in recommender systems, Proceedings of the 7th International
be the key. In this paper, a collaborative filtering algo- Conference on Intelligent User Interfaces (January 13 - 16, 2002,
San Francisco, CA, USA), ACM New York, 2002, 127-134.
rithm modified by normal recovery similarity measure is
[6] Liang T.P., Lai H.J., Ku Y.C., Personalized content recommenda-
adopted, and the speed is improved by 2.67 times through tion and user satisfaction: Theoretical synthesis and empirical
the cloud environment simulation. With the increase of ac- findings, Journal of Management Information Systems, 2006,
tual data, the operation of personal computers will take 23(3), 45-70.
more time, and the ability to store data will be limited to [7] Xiao B., Benbasat I., E-commerce product recommendation
agents: Use, characteristics, and impact, Mis Quarterly, 2007,
a certain extent. Using MapReduce on Hadoop distributed
31(1), 137-209.
platform to distribute operation and data to different hosts [8] De Meo P., Quattrone G., Terracina G., Ursino D., An XML-based
can save a lot of time and data burden. Hadoop’s Dis- multiagent system for supporting online recruitment services,
tributed File System (HDFS) guarantees the correctness Systems, Man and Cybernetics, Part A: Systems and Humans,
of the data and restores the similarity measure normally. IEEE Transactions on Systems Man & Cybernetics Part A Systems
After modification, its prediction accuracy is improved. & Humans, 2007, 37(4), 464-480.
[9] Yoshii K., Goto M., Komatani K., Ogata T., Okuno H.G. An eflcient
The experimental results show that the execution time in-
hybrid music recommender system using an incrementally train-
creases with the number of neighbors. When the number able probabilistic generative model, IEEE T Audio Speech, 2008,
of nodes is 5 and 8, the execution time is greatly improved, 16(2), 435-447.
which improves the efficiency of collaborative filtering al- [10] Li J., Kai Z., Yang X., Peng W., Jie W., Mitra K., et al., Category pre-
gorithm and can cope with massive data in the future. ferred canopy-k-means based collaborative filtering algorithm,
Future Generation Computer Systems, 2018, 93, 1046-1054.
However, there are some shortcomings in this study.
[11] O’Donovan J., Smyth B., Trust in recommender systems, Pro-
For example, when the collaborative filtering algorithm is ceedings of the 10th international conference on Intelligent user
974 | N. Yin
interfaces (January 09 - 12, 2005, San Diego, CA, USA), ACM New [17] Gao W., Wang W., A tight neighborhood union condition on
York, 2005, 167-174. fractional-critical deleted graphs, Colloquium Mathematicum,
[12] Herlocker J.L., Konstan J.A., Riedl J., Explaining collaborative 2017, 149, 291-298.
filtering recommendations, Proceedings of the 2000 ACM Con- [18] Gao W., Wang W., New isolated toughness condition for fractional-
ference on Computer Supported Cooperative work (December critical graph, Colloquium Mathematicum, 2017, 147, 55-65.
02 - 06, 2000, Philadelphia, PA, USA), ACM New York, 2000, [19] Khalique C.M., Mhlanga I.E., Travelling waves and conservation
241-250. laws of a dimensional coupling system with korteweg-de vries
[13] Herlocker J.L., Konstan J.A., Borchers A., Riedl J., An algorithmic equation, Applied Mathematics & Nonlinear Sciences, 2018, 3,
framework for performing collaborative filtering, Proceedings 241-254.
of the 22nd Annual International ACM SIGIR Conference on Re- [20] Naeem M., Siddiqui M.K., Guirao J.L.G., Gao W., New and modi-
search and Development in Information Retrieval (August 15-19, fied eccentric indices of octagonal grid Om
n , Applied Mathematics
1999, Berkeley, CA, USA), ACM New York, 1999, 230-237. & Nonlinear Sciences, 2018, 3, 209-228.
[14] Good N., Schafer J.B., Konstan J.A., Borchers A., Sarwar B., Her- [21] Pandey P.K., A new computational algorithm for the solution
locker J. et al., Combining collaborative filtering with personal of second order initial value problems in ordinary differential
agents for better recommendations, 1999, 439-446. equations, Applied Mathematics & Nonlinear Sciences, 2018, 3,
[15] Edith C., Min-Hash Sketches, Springer New York, 2016. 167-174.
[16] Cantero A., Crespo F., Ferrer S., The triaxiality role in the spin-
orbit dynamics of a rigid body, Applied Mathematics & Nonlinear
Sciences, 2018, 3, 187-208.