Effective Garbage Data Filtering Algorithm for SNS Big Data Processing by Machine Learning
Effective Garbage Data Filtering Algorithm for SNS Big Data Processing by Machine Learning
Abstract- Recently, as the use of social network collection and storage as well as effective data
services (SNS) increases in modern daily life, the processing with constrained computing resources have
amount of SNS data has become enormous. In been done recently as interest in big-data processing
addition, more and more efforts are being made to has grown. The value of large data prior to processing,
extract different pieces of information by collecting, however, is the subject of less research and study [1].
processing, and analysing large amounts of SNS Recently, the number of users of social network
data. Although various pieces of information can be services (SNS) is increasing due to the explosive
extracted from SNS data through big data growth of mobile devices, and the amount of data
processing, this is a resource-intensive task. generated on SNS is increasing correspondingly. SNS
Therefore, extracting information from SNS data is widely used for social relations and friendship, but
requires considerable time and material resources. recently, it has been increasingly used for the
In this paper, we propose a data filtering algorithm secondary purpose of gathering and analyzing large
that filters out junk data that has no data meaning in datasets on SNS and obtaining various pieces of
SNS data. The proposed algorithm improves the information. The data on SNS includes content related
filtering accuracy by iterative learning based on the to opinions being expressed in various fields such as
initial learning data. Experimental results show that economy, society, and culture [2]. Therefore, by
the proposed algorithm has a filtering effect of more analyzing the data on SNS, information on various
than 70% on experimental keywords. flows and opinions on topics such as society,
economy, and politics can be extracted. However, it is
Indexed Terms- Social network services, big data, very difficult and time consuming to accurately
machine learning, iterative learning. analyze the data on SNS as it consists of a mix between
positive data that is helpful to the actual analysis,
I. INTRODUCTION advertisement data, and irrelevant data. In recent
years, as interest in big-data processing has increased,
Due to the fast growth of social network services studies have been conducted on collecting and storing
(SNS), the number of users has recently increased. As big data in a stable manner and more efficiently
the number of mobile devices grows, so does the processing data using limited computing resources[3].
volume of data gather on social networking sites. SNS However, less research and fewer studies are available
is frequently used for friendship and social regarding the utility of big data before they are
interactions, but in recent years, its secondary usage processed. Therefore, this study investigates how to
for collecting, analysing, and acquiring various bits of effectively filter garbage data from big data, and
information from large datasets on SNS has thereby improve the accuracy and speed of the data
significantly increased. Therefore, by examining the analysis in real big-data processing as Figure 1. In
data on SNS, it is possible to deduce information about particular, this study focuses on improving the
a variety of flows and opinions on topics such as filtering accuracy by including machine learning in the
society, the economy, and politics. However, because process of filtering garbage data. Therefore, in this
the data on SNS is a mixture of relevant data, data from study, we propose an algorithm that can improve the
advertisements, and beneficial data for the research garbage data filtering accuracy of SNS big data by
itself, it is highly difficult and time-consuming to cyclic learning and prove the effectiveness of the
analyse it successfully. Studies on stable data algorithm through experiments. As a result, this work
explores effective garbage data filtering from big data Chen et al. [8] proposed an algorithm for disease
to enhance the correctness and speed of real-world prediction based on machine learning for healthcare
big-data processing analysis. By introducing machine big data and demonstrated the effectiveness of the
learning into the process of removing useless data, this proposed algorithm through experiments. In addition,
study explicitly aims to increase filtering accuracy. we can find some studies suggesting a big data
Consequently, in this paper, we present a method that processing method using various machine learning.
increases garbage data filtering accuracy using cyclic
learning, and we use experiments to demonstrate the This paper focuses on the specific problem of Big Data
programme's efficacy. classification of network intrusion traffic. It discusses
the system challenges presented by the Big Data
II. LITERATURE SURVEY problems associated with network intrusion
prediction. The prediction of a possible intrusion
Qiu et al.[4] There is no doubt that big data are now attack in a network requires continuous collection of
rapidly expanding in all science and engineering traffic data and learning of their characteristics on the
domains. While the potential of these massive data is fly. The continuous collection of traffic data by the
undoubtedly significant, fully making sense of them network leads to Big Data problems that are caused by
requires new ways of thinking and novel learning the volume, variety and velocity properties of Big
techniques to address the various challenges. In this Data. The learning of the network characteristics
paper, we present a literature survey of the latest requires machine learning techniques that capture
advances in researches on machine learning for big global knowledge of the traffic patterns. The Big Data
data processing. First, we review the machine learning properties will lead to significant system challenges to
techniques and highlight some promising learning implement machine learning frameworks. This paper
methods in recent studies, such as representation discusses the problems and challenges in handling Big
learning, deep learning, distributed and parallel Data classification using geometric representation-
learning, transfer learning, active learning, and kernel- learning techniques and the modern Big Data
based learning. Next, we focus on the analysis and networking technologies
discussions about the challenges and possible
solutions of machine learning for big data. Following With the emerging technologies and all associated
that, we investigate the close connections of machine devices, it is predicted that massive amount of data
learning with signal processing techniques for big data will be created in the next few years – in fact, as much
processing. Finally, we outline several open issues and as 90% of current data were created in the last couple
research trends. of years – a trend that will continue for the foreseeable
future. Sustainable computing studies the process by
Jarrah et al. [5] studied effective machine learning which computer engineer/scientist designs computers
methods for processing big data. They explored data and associated subsystems efficiently and effectively
modelling methods and analysed the efficiencies of the with minimal impact on the environment. However,
model and algorithm. Landset et al. [6] classified tools current intelligent machine-learning systems are
for machine learning as processing engines, machine performance driven – the focus is on the
learning frameworks, and learning algorithms, and predictive/classification accuracy, based on known
analysed their association. Machine learning properties learned from the training samples. For
frameworks such as Mahout, MLlib, H2O, and Samoa instance, most machine-learning-based nonparametric
were also examined side by side. Xing et al. [7] models are known to require high computational cost
analyzed and compared implementation engines such in order to find the global optima. With the learning
as MapReduce, Spark, Flink, Storm, and H2O in the task in a large dataset, the number of hidden nodes
Hadoop ecosystem, a typical machine learning within the network will therefore increase
architecture. In addition, machine learning libraries significantly, which eventually leads to an exponential
and frameworks such as Mahout, MLlib, and Samoa rise in computational complexity.
were examined.
IV. RESULTS
Fig.2 In above screen we are showing code for SPARK and Naïve Bayes processing
Fig.3 In above screen we are processing dataset weight using SPARK and then using Naïve Bayes algorithm for
training
Fig.4 In above screen click on ‘Upload SNS Dataset’ button to load dataset and get below screen
Fig.5 In above screen selecting and uploading ‘dataset.csv’ file and then click on ‘Open’ button to load dataset and
get below output
Fig.6 In above screen dataset loaded and in graph x-axis represents types of data as 0, 1 or 2 and y-axis represents
number of records found in dataset in that group and now click on ‘Dataset Classifier Generator’ to convert dataset
tweets into morphologic weights and get below output.
Fig.7 In above screen first row represents word and remaining rows contains weight of that word and now click on
‘Data Classifier using SPARK Naive Bayes’ button to train Naïve Bayes algorithm and get below prediction
accuracy
Fig.8 In above screen SPARK processing and naïve Bayes training started and after some time will get below output
Fig.9 In above screen with XGBOOST we got 94% accuracy and now click on ‘Data Analyzer’ button to upload test
data and then classifier algorithm will predict group of test data
Fig.10 In above graph x-axis represents algorithm names and y-axis represents accuracy of those algorithms and in
above graph we can see all extension algorithms got high accuracy compare to propose algorithms.
74.45% following a comparison with the correct Hsien, “Machine Learning Based Big Data
answer set. Therefore, it is found that it may be Processing Framework for Cancer Diagnosis
advantageous in a big data processing environment Using Hidden Markov Model and GM
where a large amount of data must be processed Clustering,” Wireless Personal Communications,
quickly. Based on this, the contribution of this study is vol. 102, pp. 2099-2116, 2018.
summarized as follows. First, this study proposed an [8] W. Xiaofei, Z. Yuhua, L. Victor, G. Nadra, J.
effective garbage and advertisement data filtering Tianpeng, “D2D Big Data: Content Deliveries
system that can be used in big data processing system. over Wireless Device-to-Device Sharing in
It is designed to enhance the efficiency of big data Large-Scale Mobile Networks,” IEEE Wireless
processing by selecting and processing only data that Communications. vol. 25, pp. 32-38, 2018.
is worth processing from a large amount of data
[9] Z. Zhenhua, H. Qing, G. Jing, N. Ming, “A deep
generated in daily life such as SNS big data. Second,
learning approach for detecting traffic accidents
we introduced a recursive machine learning method
from social media data,” Transportation
for data filtering. We made initial learning data from
Research Part C: Emerging Technologies, vol.
SNS big data and used it for data filtering, and we
86, pp. 580-596, 2017.
improved the accuracy of filtering by using the filtered
data as learning data through the proposed system. [10] S. Ou; J. Lee, “Implementation of a Spam
Message Filtering System using Sentence
REFERENCES Similarity Measurements,” KIISE Trans.
Comput. Pract. (KTCP), vol. 23, pp. 57-64, 2017.
[1] J. Qiu, Q. Wu, G. Ding, Y. Xu, S. Feng, “A
survey of machine learning for big data
processing,” EURASIP J. Adv. Signal Process.
vol. 2016, pp. 1-16, 2016.
[2] S. Suthanharan, “Big data classification:
Problems and challenges in network intrusion
prediction with machine learning,” ACM
SIGMETRICS Perf. Eval. Rev. vol. 41, pp. 70-
73, 2014. [
[3] O. Jarrah, P. Yoo, S. Muhaidat, G.
Karagiannidis, K. Taha, “Efficient Machine
Learning for Big Data: A Review,” Big Data Res.
vol. 2, pp. 87-93, 2015.
[4] S. Landset, T. Khoshgoftaar, A. Richter, T.
Hasanin, “A survey of open source tools for
machine learning with big data in the Hadoop
ecosystem,” J. Big Data, vol. 2, pp. 1-36, 2015.
[5] E. Xing, Q. Ho, W. Dai, J. Kim, Y. Yu, “Petuum:
A New Platform for Distributed Machine
Learning on Big Data,” IEEE Trans. Big Data,
vol. 1, pp. 49-67, 2015.
[6] M. Chen, Y. Hao, K. Hwang, L. Wang, L. Wang,
“Disease Prediction by Machine Learning Over
Big Data from Healthcare Communities,” IEEE
Access, vol. 5, pp. 8869–8879, 2017.
[7] M. Gunasekaran, V. Vijayakumar, R.
Varatharajan, K. Priyan S. Revathi, H. Ching-