0% found this document useful (0 votes)
8 views8 pages

Effective Garbage Data Filtering Algorithm for SNS Big Data Processing by Machine Learning

This paper presents an effective garbage data filtering algorithm for processing big data from social network services (SNS) using machine learning techniques. The proposed algorithm aims to improve filtering accuracy by iteratively learning from initial data, achieving over 70% accuracy in filtering out irrelevant data. The study emphasizes the importance of efficiently processing large datasets to extract meaningful information related to societal, economic, and political opinions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views8 pages

Effective Garbage Data Filtering Algorithm for SNS Big Data Processing by Machine Learning

This paper presents an effective garbage data filtering algorithm for processing big data from social network services (SNS) using machine learning techniques. The proposed algorithm aims to improve filtering accuracy by iteratively learning from initial data, achieving over 70% accuracy in filtering out irrelevant data. The study emphasizes the importance of efficiently processing large datasets to extract meaningful information related to societal, economic, and political opinions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

© JUN 2023 | IRE Journals | Volume 6 Issue 12 | ISSN: 2456-8880

Effective Garbage Data Filtering Algorithm for SNS Big


Data Processing by Machine Learning
B. SRAVANI1, A. SHARANYA2, M. AKHILA3, G. SWATHI4
1, 2, 3
B. Tech Student, Dept. of CSE, Teegala Krishna Reddy Engineering College, Meerpet, Hyderabad
4
Assistant Professor, Dept. of CSE, Teegala Krishna Reddy Engineering College, Meerpet, Hyderabad

Abstract- Recently, as the use of social network collection and storage as well as effective data
services (SNS) increases in modern daily life, the processing with constrained computing resources have
amount of SNS data has become enormous. In been done recently as interest in big-data processing
addition, more and more efforts are being made to has grown. The value of large data prior to processing,
extract different pieces of information by collecting, however, is the subject of less research and study [1].
processing, and analysing large amounts of SNS Recently, the number of users of social network
data. Although various pieces of information can be services (SNS) is increasing due to the explosive
extracted from SNS data through big data growth of mobile devices, and the amount of data
processing, this is a resource-intensive task. generated on SNS is increasing correspondingly. SNS
Therefore, extracting information from SNS data is widely used for social relations and friendship, but
requires considerable time and material resources. recently, it has been increasingly used for the
In this paper, we propose a data filtering algorithm secondary purpose of gathering and analyzing large
that filters out junk data that has no data meaning in datasets on SNS and obtaining various pieces of
SNS data. The proposed algorithm improves the information. The data on SNS includes content related
filtering accuracy by iterative learning based on the to opinions being expressed in various fields such as
initial learning data. Experimental results show that economy, society, and culture [2]. Therefore, by
the proposed algorithm has a filtering effect of more analyzing the data on SNS, information on various
than 70% on experimental keywords. flows and opinions on topics such as society,
economy, and politics can be extracted. However, it is
Indexed Terms- Social network services, big data, very difficult and time consuming to accurately
machine learning, iterative learning. analyze the data on SNS as it consists of a mix between
positive data that is helpful to the actual analysis,
I. INTRODUCTION advertisement data, and irrelevant data. In recent
years, as interest in big-data processing has increased,
Due to the fast growth of social network services studies have been conducted on collecting and storing
(SNS), the number of users has recently increased. As big data in a stable manner and more efficiently
the number of mobile devices grows, so does the processing data using limited computing resources[3].
volume of data gather on social networking sites. SNS However, less research and fewer studies are available
is frequently used for friendship and social regarding the utility of big data before they are
interactions, but in recent years, its secondary usage processed. Therefore, this study investigates how to
for collecting, analysing, and acquiring various bits of effectively filter garbage data from big data, and
information from large datasets on SNS has thereby improve the accuracy and speed of the data
significantly increased. Therefore, by examining the analysis in real big-data processing as Figure 1. In
data on SNS, it is possible to deduce information about particular, this study focuses on improving the
a variety of flows and opinions on topics such as filtering accuracy by including machine learning in the
society, the economy, and politics. However, because process of filtering garbage data. Therefore, in this
the data on SNS is a mixture of relevant data, data from study, we propose an algorithm that can improve the
advertisements, and beneficial data for the research garbage data filtering accuracy of SNS big data by
itself, it is highly difficult and time-consuming to cyclic learning and prove the effectiveness of the
analyse it successfully. Studies on stable data algorithm through experiments. As a result, this work

IRE 1704643 ICONIC RESEARCH AND ENGINEERING JOURNALS 350


© JUN 2023 | IRE Journals | Volume 6 Issue 12 | ISSN: 2456-8880

explores effective garbage data filtering from big data Chen et al. [8] proposed an algorithm for disease
to enhance the correctness and speed of real-world prediction based on machine learning for healthcare
big-data processing analysis. By introducing machine big data and demonstrated the effectiveness of the
learning into the process of removing useless data, this proposed algorithm through experiments. In addition,
study explicitly aims to increase filtering accuracy. we can find some studies suggesting a big data
Consequently, in this paper, we present a method that processing method using various machine learning.
increases garbage data filtering accuracy using cyclic
learning, and we use experiments to demonstrate the This paper focuses on the specific problem of Big Data
programme's efficacy. classification of network intrusion traffic. It discusses
the system challenges presented by the Big Data
II. LITERATURE SURVEY problems associated with network intrusion
prediction. The prediction of a possible intrusion
Qiu et al.[4] There is no doubt that big data are now attack in a network requires continuous collection of
rapidly expanding in all science and engineering traffic data and learning of their characteristics on the
domains. While the potential of these massive data is fly. The continuous collection of traffic data by the
undoubtedly significant, fully making sense of them network leads to Big Data problems that are caused by
requires new ways of thinking and novel learning the volume, variety and velocity properties of Big
techniques to address the various challenges. In this Data. The learning of the network characteristics
paper, we present a literature survey of the latest requires machine learning techniques that capture
advances in researches on machine learning for big global knowledge of the traffic patterns. The Big Data
data processing. First, we review the machine learning properties will lead to significant system challenges to
techniques and highlight some promising learning implement machine learning frameworks. This paper
methods in recent studies, such as representation discusses the problems and challenges in handling Big
learning, deep learning, distributed and parallel Data classification using geometric representation-
learning, transfer learning, active learning, and kernel- learning techniques and the modern Big Data
based learning. Next, we focus on the analysis and networking technologies
discussions about the challenges and possible
solutions of machine learning for big data. Following With the emerging technologies and all associated
that, we investigate the close connections of machine devices, it is predicted that massive amount of data
learning with signal processing techniques for big data will be created in the next few years – in fact, as much
processing. Finally, we outline several open issues and as 90% of current data were created in the last couple
research trends. of years – a trend that will continue for the foreseeable
future. Sustainable computing studies the process by
Jarrah et al. [5] studied effective machine learning which computer engineer/scientist designs computers
methods for processing big data. They explored data and associated subsystems efficiently and effectively
modelling methods and analysed the efficiencies of the with minimal impact on the environment. However,
model and algorithm. Landset et al. [6] classified tools current intelligent machine-learning systems are
for machine learning as processing engines, machine performance driven – the focus is on the
learning frameworks, and learning algorithms, and predictive/classification accuracy, based on known
analysed their association. Machine learning properties learned from the training samples. For
frameworks such as Mahout, MLlib, H2O, and Samoa instance, most machine-learning-based nonparametric
were also examined side by side. Xing et al. [7] models are known to require high computational cost
analyzed and compared implementation engines such in order to find the global optima. With the learning
as MapReduce, Spark, Flink, Storm, and H2O in the task in a large dataset, the number of hidden nodes
Hadoop ecosystem, a typical machine learning within the network will therefore increase
architecture. In addition, machine learning libraries significantly, which eventually leads to an exponential
and frameworks such as Mahout, MLlib, and Samoa rise in computational complexity.
were examined.

IRE 1704643 ICONIC RESEARCH AND ENGINEERING JOURNALS 351


© JUN 2023 | IRE Journals | Volume 6 Issue 12 | ISSN: 2456-8880

III. DATA FILTERING SYSTEM • B. Design of Data Classifier Generator


In the proposed system, the data classifier generator
In this section, we introduce the data filtering system. plays a key role. This module classifies an initial
The core building blocks of the system are explained, sentence data into the first, second or third group by
and the functional roles of the components are referring to the data learned. In particular, in the
discussed. proposed system, the machine learning system is
implemented by executing the Mahout module in
The proposed system consists of various components Hadoop system.
including a data classifier generator, a data classifier,
and a data analyser. The data classifier generator is a • NAÏVE’S BAYES:
unit tasked with performing data classification. The It is a simple multiclass classification algorithm with
initial learning data are input, and the learned data are the assumption of independence between every pair of
generated through morphological analysis, weighting, features. Naive Bayes can be trained very efficiently.
and application of the classification algorithm, and a Within a single pass to the training data, it computes
data classification module is created based on the the conditional probability distribution of each feature
generated data. The data classifier receives target SNS given label, and then it applies Bayes’ theorem to
data, namely sentences, to be filtered and performs compute the conditional probability distribution of
classification into three groups of words: first, garbage label given an observation and use it for prediction.
data, then second, advertisement data, and third,
definite (positive) data. The first and second groups • RANDOM FOREST
are input once again in the data classifier generator and It is a meta estimator that fits a number of decision tree
used to generate the classification module, and classifiers on various sub samples of the dataset and
sentences included in the second and third group are use averaging to improve the predictive accuracy and
entered into the pattern analyzer generator, which control over fitting. It is a meta estimator that fits a
generates a pattern analysis module. This is number of decision tree classifiers on various sub
accomplished through the sequence of morpheme samples of the dataset and use averaging to improve
analysis, application of classification algorithm, and the predictive accuracy and control over fitting.
use of vocabulary database by pattern type. The data
analyzer receives sentence data included in the second • DECISION TREE
and third group words as determined by the data The decision tree classifier creates the classification
classifier and generates various pieces of information. model by building a decision tree. Each node in the
Data analyzers are not implemented in this study tree specifies a test on an attribute, each branch
because they are off the subject of this study. Figure 2 descending from that node corresponds to one of the
shows the structure of proposed SNS garbage data possible values for that attribute.
filtering system.
• XG Boost Classifier
• SYSTEM ARCHITECTURE XG Boost Classifier It is a Machine learning algorithm
that is applied for structured and tabular data.
XGBoost is an implementation of gradient boosted
decision trees designed for speed and performance.

Fig.1 System architecture

IRE 1704643 ICONIC RESEARCH AND ENGINEERING JOURNALS 352


© JUN 2023 | IRE Journals | Volume 6 Issue 12 | ISSN: 2456-8880

IV. RESULTS

Fig.2 In above screen we are showing code for SPARK and Naïve Bayes processing

Fig.3 In above screen we are processing dataset weight using SPARK and then using Naïve Bayes algorithm for
training

Fig.4 In above screen click on ‘Upload SNS Dataset’ button to load dataset and get below screen

IRE 1704643 ICONIC RESEARCH AND ENGINEERING JOURNALS 353


© JUN 2023 | IRE Journals | Volume 6 Issue 12 | ISSN: 2456-8880

Fig.5 In above screen selecting and uploading ‘dataset.csv’ file and then click on ‘Open’ button to load dataset and
get below output

Fig.6 In above screen dataset loaded and in graph x-axis represents types of data as 0, 1 or 2 and y-axis represents
number of records found in dataset in that group and now click on ‘Dataset Classifier Generator’ to convert dataset
tweets into morphologic weights and get below output.

IRE 1704643 ICONIC RESEARCH AND ENGINEERING JOURNALS 354


© JUN 2023 | IRE Journals | Volume 6 Issue 12 | ISSN: 2456-8880

Fig.7 In above screen first row represents word and remaining rows contains weight of that word and now click on
‘Data Classifier using SPARK Naive Bayes’ button to train Naïve Bayes algorithm and get below prediction
accuracy

Fig.8 In above screen SPARK processing and naïve Bayes training started and after some time will get below output

IRE 1704643 ICONIC RESEARCH AND ENGINEERING JOURNALS 355


© JUN 2023 | IRE Journals | Volume 6 Issue 12 | ISSN: 2456-8880

Fig.9 In above screen with XGBOOST we got 94% accuracy and now click on ‘Data Analyzer’ button to upload test
data and then classifier algorithm will predict group of test data

Fig.10 In above graph x-axis represents algorithm names and y-axis represents accuracy of those algorithms and in
above graph we can see all extension algorithms got high accuracy compare to propose algorithms.

CONCLUSION proposed system can improve the accuracy of the


analysis of unstructured data in SNS by separating it
In this paper, we proposed and implemented an into garbage, advertisement, and definite data through
effective SNS garbage data filtering system through machine learning. Concerning the accuracy
repetitive machine learning. We assume that the experiment, data filtering showed an accuracy of up to

IRE 1704643 ICONIC RESEARCH AND ENGINEERING JOURNALS 356


© JUN 2023 | IRE Journals | Volume 6 Issue 12 | ISSN: 2456-8880

74.45% following a comparison with the correct Hsien, “Machine Learning Based Big Data
answer set. Therefore, it is found that it may be Processing Framework for Cancer Diagnosis
advantageous in a big data processing environment Using Hidden Markov Model and GM
where a large amount of data must be processed Clustering,” Wireless Personal Communications,
quickly. Based on this, the contribution of this study is vol. 102, pp. 2099-2116, 2018.
summarized as follows. First, this study proposed an [8] W. Xiaofei, Z. Yuhua, L. Victor, G. Nadra, J.
effective garbage and advertisement data filtering Tianpeng, “D2D Big Data: Content Deliveries
system that can be used in big data processing system. over Wireless Device-to-Device Sharing in
It is designed to enhance the efficiency of big data Large-Scale Mobile Networks,” IEEE Wireless
processing by selecting and processing only data that Communications. vol. 25, pp. 32-38, 2018.
is worth processing from a large amount of data
[9] Z. Zhenhua, H. Qing, G. Jing, N. Ming, “A deep
generated in daily life such as SNS big data. Second,
learning approach for detecting traffic accidents
we introduced a recursive machine learning method
from social media data,” Transportation
for data filtering. We made initial learning data from
Research Part C: Emerging Technologies, vol.
SNS big data and used it for data filtering, and we
86, pp. 580-596, 2017.
improved the accuracy of filtering by using the filtered
data as learning data through the proposed system. [10] S. Ou; J. Lee, “Implementation of a Spam
Message Filtering System using Sentence
REFERENCES Similarity Measurements,” KIISE Trans.
Comput. Pract. (KTCP), vol. 23, pp. 57-64, 2017.
[1] J. Qiu, Q. Wu, G. Ding, Y. Xu, S. Feng, “A
survey of machine learning for big data
processing,” EURASIP J. Adv. Signal Process.
vol. 2016, pp. 1-16, 2016.
[2] S. Suthanharan, “Big data classification:
Problems and challenges in network intrusion
prediction with machine learning,” ACM
SIGMETRICS Perf. Eval. Rev. vol. 41, pp. 70-
73, 2014. [
[3] O. Jarrah, P. Yoo, S. Muhaidat, G.
Karagiannidis, K. Taha, “Efficient Machine
Learning for Big Data: A Review,” Big Data Res.
vol. 2, pp. 87-93, 2015.
[4] S. Landset, T. Khoshgoftaar, A. Richter, T.
Hasanin, “A survey of open source tools for
machine learning with big data in the Hadoop
ecosystem,” J. Big Data, vol. 2, pp. 1-36, 2015.
[5] E. Xing, Q. Ho, W. Dai, J. Kim, Y. Yu, “Petuum:
A New Platform for Distributed Machine
Learning on Big Data,” IEEE Trans. Big Data,
vol. 1, pp. 49-67, 2015.
[6] M. Chen, Y. Hao, K. Hwang, L. Wang, L. Wang,
“Disease Prediction by Machine Learning Over
Big Data from Healthcare Communities,” IEEE
Access, vol. 5, pp. 8869–8879, 2017.
[7] M. Gunasekaran, V. Vijayakumar, R.
Varatharajan, K. Priyan S. Revathi, H. Ching-

IRE 1704643 ICONIC RESEARCH AND ENGINEERING JOURNALS 357

You might also like