0% found this document useful (0 votes)
11 views

Apache Spark Based Analysis On Word Count Application in Big Data

This document discusses big data analysis using Apache Spark for a word count application. It introduces big data characteristics and challenges, as well as how big data supports IoT and cloud computing. Specifically, it explains how combining big data, IoT, and cloud provides a new architecture. The document also discusses the 5 V's of big data - volume, variety, velocity, veracity, and value. It analyzes how Spark can be used to process large amounts of data in parallel.

Uploaded by

salah Alswiay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Apache Spark Based Analysis On Word Count Application in Big Data

This document discusses big data analysis using Apache Spark for a word count application. It introduces big data characteristics and challenges, as well as how big data supports IoT and cloud computing. Specifically, it explains how combining big data, IoT, and cloud provides a new architecture. The document also discusses the 5 V's of big data - volume, variety, velocity, veracity, and value. It analyzes how Spark can be used to process large amounts of data in parallel.

Uploaded by

salah Alswiay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2022 2nd International Conference on Innovative Practices in Technology and Management (ICIPTM)

Apache Spark based analysis on word count


application in Big Data
K. Subha Dr. N. Bharathi
2022 2nd International Conference on Innovative Practices in Technology and Management (ICIPTM) | 978-1-6654-6643-1/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICIPTM54933.2022.9753879

Department of Computer Science and Engineering Department of Computer Science and Engineering
SRM Institute of Science and Technology, SRM Institute of Science and Technology,
Vadapalani Campus Vadapalani Campus
Chennai, India Chennai, India
[email protected] [email protected]

Abstract—The rise in the volume of data, as well as the type of


data and the rate at which data is produced, has led to the
development of novel processing methods capable of working
with such massive amounts of data, termed as Big Data. The
most difficult aspects of managing Big Data include its collection
and storage, as well as search, sharing, analysis, and
visualizations. In this paper explain the characteristics,
Figure 1. Types of BD
processing, application of bigdata and its challenges. Also
explained how big data support Internet of Things and Cloud
computing. By combining big data, IOT, Cloud provides the
new architecture.

Keywords— Big Data (BD), Hadoop, Spark, data processing.

I. INTRODUCTION

With the spike in use of internet, social media, mobile phones,


IOT devices the volume of generated data is growing
exponentially. According to Statista, the total quantity of data
Figure 2. 5 V’s of BD
consumed globally is expected to reach 64.2 zettabytes in
2020, 79 zettabytes in 2021, and over 180 zettabytes in
1. Volume: Big data itself contain the meaning of volume.
2025[1]. In fact, data is rising at such a rapid rate that, if
It defines the size of data. Now a days amount of created
current trends continue, the data volume will soon exceed the
data is in Peta byte. From the huge volume of data, we
Yottabyte scale. The handling high volume of data is always
can find value and hidden pattern of the data. In Fig 2
challenging issue of any organization. The standard database
shows the 5 Characteristic of BD.
management system is incapable of storing and analyzing
large amounts of complicated data. Cox and Ellsworth coined
2. Velocity: The rate at which data is generated and
the term "big data." A huge volume and variety of data that
processed to fulfil the needs and problems. The speed of
are increases exponentially with time and this type of data,
generated data is in alarm rate.
which are collected from different origins such as Facebook,
twitter, android mobile phones, various sensors, tests report
3. Variety: Type or format of data which are in structured,
from laboratory, clinical notes from hospital, demographics
semi-structured and unstructured data collected from
data, and a variety of omics data, can be classified into three
different sources (emails, PDFs, photos, videos, audios).
groups based on how they are organise such as structured,
Variety is one of important characteristic of BD.
semi structured, and unstructured are showed in Fig 1.
Structure data maintain proper structure like tables which
4. Value: From the raw data, can yield valuable data.
consist of rows and columns. Semi structured data is partially
organizations are starting to generate amazing value from
organized. It’s a bride between structure and unstructured
their BD.
data. Examples of semi structured data are JSON, CSV, XML,
etc. Another type is Unstructured data where the data is not
5. Veracity: Defines the inconsistency of data. Level of
organized in a predefined schema. Example of unstructured
quality of captured data, which means noise, inconsistent
data image files, audio files, log files, and video files.
and bias that are differ greatly [16]. Accurate analysis is
Characteristic of big data includes volume, velocity, and
depending on the veracity of source data.
variety. These are the most common characteristics that define
The remaining paper is organized as follows. Section 2
the BD, that are commonly called as 3V’s of BD. Later it
describes the big data processing. The details of role of big
extended to 5 V’s. Fig1 describe the 5V’s of BD.1. Volume,
data in IOT, Cloud and its application analyzed in Section 3.
2. Velocity, 3. Variety, 4. Value, 5. Veracity [16].

491
978-1-6654-6643-1/22/$31.00 ©2022 IEEE

Authorized licensed use limited to: ULAKBIM UASL - Atilim Universitesi. Downloaded on November 24,2022 at 15:14:57 UTC from IEEE Xplore. Restrictions apply.
2022 2nd International Conference on Innovative Practices in Technology and Management (ICIPTM)

Section 4 present the experimental result. The research popular for its multi query approach to reduce the number of
challenges are explained in Section 5. times that the data is scanned. And the last component is
YARN, which is short for Yet Another Resource
II. BIG DATA PROCESS Negotiator. YARN is a very important component because it
prepares the RAM and CPU for Hadoop to run data which are
BD is a collection of large amounts of data, so the BD tools stored in HDFS.
are used to process the high volume of data with low
computational time. The BD framework like Hadoop and C. Spark
spark is explained in this section. Hadoop support batch
Spark is a BD processing engine, and it is also cluster
processing and spark support both batch and stream
computing engine that is designed for fast execution, in-
processing.
memory computation, and supports different workload. Spark
manages batch processing, interactive querying, streaming,
A. Hadoop
and iterative computations [9]. The Apace Spark system
Hadoop is a big data processing framework that permits users includes the Spark core and have libraries which are Spark's
to store and process data sets which are very big in size MLlib, GraphX, Spark Streaming, and Spark SQL. Machine
(gigabytes to petabytes) by allowing a group of computers or learning is performed by Spark's MLlib, graph analysis is
nodes to solve huge and complex data problems. It is a very handled by GraphX, stream processing is done by Spark
flexible and low-cost solution for storing, Analyzing and Streaming [17], and structured data processing is addressed by
processing ordered, semi- ordered and unordered data. It’s a Spark SQL. The performance of the spark can be increased by
high scalable because of using commodity hardware and fault recourse management [18]
tolerance through the replication factor. Hadoop replication
factor is 3. The replication factor is changeable. It can be III. APPLICATION OF BIG DATA IN IOT AND CLOUD
change to 2 or can be increased more than 3. Hadoop provides
The IOT sensor devices are continuously generating data
advanced analytics for stored data. Generally, there are four
based on the application, so the big data place the major role
types in data analytics which are descriptive analytics,
to collect, process and analyze the large volume of data which
diagnostic analytics predictive analytics, and prescriptive
are collected from IOT devices. The cloud is the internet of
analytics. Hadoop also support data mining, machine learning
service. It is used for storage, platform for service which can
(ML).The initial step in the Hadoop data processing is to
used for organization or individual person and provide
divide a vast volume of data into smaller tasks. The small jobs
software service. IOT and cloud together with big data place
are completed in parallel using the MapReduce algorithm and
the important role in many applications. In table 1explained
then distributed across a Hadoop cluster. MapReduce is
the recent Application of BD.
responsible for parallel processing to reduce processing time
of the large volume of data.[11]
The author described a new data storage framework that
maintains high consistency. BD is a vast volume of
B. Components of Hadoop
information with a complicated structure. Traditional database
The term Hadoop is often used to refer to both the core management solutions are incapable of handling such a large
components of Hadoop as well as the ecosystem of related amount of data. Rapidly growing no of BD application
projects. The core components of Hadoop are Hadoop requires efficient data base architecture to store and manage
Distributed File System (HDFS), MapReduce, YARN, and critical data and support scalability and availability for data
Hadoop Common. Hadoop Common, which is an essential analytics. The parallel and distributed data processing needs
part of the Apache Hadoop Framework that refers to the to support Consistency, Availability, and partition tolerance
collection of common utilities and libraries that provide for play the important role for data processing. Like ACID
other Hadoop modules ecosystem [13]. A Large amount of Property in relational database system, CAP theorem explains
data are stored in HDFS. It handles large data sets running on the consistency, Availability, and partition tolerance in BD
Commodity Hardware (CH). A CH is low- management System. To provide the business needs, many
specifications industry-grade hardware and scales a single NoSQL systems runs to achieve the high consistency by the
Hadoop cluster to hundreds and even thousands and BD CAP theorem. The proposed system provides strong
support horizontal scaling. The next component is consistency through Scalable Distributed Two Layer Data
MapReduce, it has two-part mapper and reducer which is a Store (SD2DS) [1]. The work is divided into two parts, first it
processing unit of Hadoop and a core component to the analyses all consistent issue and in the second part, design the
Hadoop framework. MapReduce processes the data by scheduling algorithm for supporting strong consistency. In
splitting huge volume of data into smaller units and processes this paper, high consistency is demonstrated for all basic
them simultaneously. MapReduce was the only way to access operations.
the data stored in the HDFS. Other components of the systems Diabetes is the widespread disease, so the authors detect the
like Hive and Pig. Hive provided SQL-like query and occurrence of diabetes datasets to predict the optimal results
provided users with strong statistical functions. Pig was for the diabetes patients. Hadoop MapReduce use data mining

492

Authorized licensed use limited to: ULAKBIM UASL - Atilim Universitesi. Downloaded on November 24,2022 at 15:14:57 UTC from IEEE Xplore. Restrictions apply.
2022 2nd International Conference on Innovative Practices in Technology and Management (ICIPTM)

algorithm attention network, Decision Tree (DT), outlier base for the analytics. Finally, it reaches the data visualization. The
multiclass classification and association rule techniques were visualization of energy conception level is showed in device
used to get solution to the patient. MapReduce is the are used to manage the device for low energy bill by the house
processing engine for BD. From the MapReduce the BD set is owner.
divided into manageable data set with different attributes and
hence passed to Decision Tree, Apriori algorithm and outlier Increasing high amount of data, storing BD in single data
based multiclass classification. Use of these algorithm the centre is no longer feasible. Storing and accessing data is the
diabetes patients are classified, and they determined the challenging issue in BD. The authors store and analyze BD in
insulin level [2]. multiple data centre which are in different geographical area.
In Hadoop and Spark framework are designed to process data
Smart meters are used to accurately read energy consumption locally with same data centre. they must replicate all data to a
level from the smart building and smart industry and play the single data centre before performing any operation in a locally
important part in growing energy management system. distributed computation. Because of bandwidth constraints,
Velocity is one of the characteristics of bigdata. The sensor in communication costs, data privacy, and security, copying all
Smart meters generate the data in alarmed rate, that’s define data from several data centers to one data centre might be a
the velocity. The authors designed the system for BD that read bottleneck. The implemented Random Sample Partition (RSP)
one million smart meters data in near real time. The processed method is to divide the large data into packets of sample data
system consists of four modules that are BD storage block, racks and distribute these data blocks across different data
data modelling, BD storage, big data processing and querying centers without or with duplication. The important data are
block and data visualization. The voluminous data is captured replicated to increase the availability [4].
and modelled by optimized query design then stored in BD
storage HDFS which is consist of master and slave node. In this paper, the authors implement an efficient movie
Master nodes contain metadata, and the slave nodes are actual recommendation system in the environment of BD.
workers. Data is fetched from storage area and applied Auto Recommendation systems are the systems designed to
Regressive Integrated Moving Average (ARIMA) [3] model recommend things to the user based on their interest.

TABLE I. RECENT APPLICATION IN BIG DATA


S. No. Authors Findings Application Area Year Algorithm / Method
1 Krechowicz, Adam Deniziak, Provide strong consistency by using Data Storage 2021 SD2DS
Stanisław Lukawski, Grzegorz [1] novel data storage.
2 Jayasri N.P., R. Aruna [2] Evaluate diabetes patients to discover Healthcare 2021 DT, AA & outlier based
optimal solution. multiclass classification
3 Gupta, Ragini. etl [3] Explained efficient and accurate Energy Management 2020 ARIMA
representation of energy consumption
level which are collected from smart
meters.
4 T. Z. Emara and J. Z. Huang [4] Analyzing the big data in distributed Data Storage 2020 RSP
data centre.
5 Shen, J., Zhou, T., & Chen, L[5] Users’ historical activity data and Recommendation System 2020 Collaboration filtering
interests are used to recommend
appropriate products to the customers.
6 Gokhan, Silahtaroglu and Nevin Described a machine learning pre- Healthcare 2019 PNN
Yılmazturk [6] diagnosis model for emergency
Departments.
7 Mahboob Alam & Mohd Amjad [7] Forecast future weather condition on Science and technology 2019 Parallel and distributed
available BD. analytics
8 Shihao Zhou, Zhilei Qiao, Qianzhou BD Text Analytics to measure Business 2018 SVD
Du. Et. [8] customer Liveliness from online
reviews.

Organizations like Netflix, Amazon and so on use proposal hospital or clinic which is less crowded for the patient to be
framework to assist their clients with distinguishing the right admitted immediately. In emergency Departments The
item. Recommendation system is to deal with large volume of authors explained a machine learning pre-diagnosis system,
data. In this proposed work implements the collaborative which uses Random Forest Decision Tree Probabilistic Neural
filtering algorithm to recommend appropriate products to the Network (PNN) based on the DDA (Dynamic Decay
customers [5]. Adjustment), models are separately trained on the dataset.
This model understands the emergency level of the case and
Ambulance services are the first service provided by find the most suitable and available health care [6].
emergency department in hospitals. One of the most important
aspects of an ambulance service is determining the severity of To forecast future weather conditions in cloud computing, the
the situation, another important analysis is to find the closest authors designed an architecture for parallel and distributed

493

Authorized licensed use limited to: ULAKBIM UASL - Atilim Universitesi. Downloaded on November 24,2022 at 15:14:57 UTC from IEEE Xplore. Restrictions apply.
2022 2nd International Conference on Innovative Practices in Technology and Management (ICIPTM)

big data analysis. Based on available data, weather forecasting


is used to predict the atmosphere for a certain geographical
area and period. The suggested system uses the Hadoop
system in conjunction with the MapReduce engine to process
large amounts of data. The final forecast contains maximum
and lowest temperatures as well as rainfall for any future date
[7].
Through big data analytics technology, huge data created by
internet users can be used to develop new products with
significant strategic value. The suggested model uses a Figure 5. Spark Processing Time with Single Core
semantic keyword similarity method based on Singular Value
Decomposition (SVM) to examine the connection between V. RESEARCH CHALLENGES
survey volume, client deftness, and item execution, which is
achieved in two stages. The first part examines how the Big data research provides significant benefits to organization,
volume of online reviews promotes consumer agility while the industries, and individuals to make them better decision. Even
subsequent stage explores the connection between client though there are numerous issues to be solved. To address
readiness and item execution [8]. such issues, some BD research problems require assistance
from BD research groups, governments, and organizations.
IV. EXPERIMENTAL SETUP & RESULT For most researchers, the characteristics of big data provide a
Installed spark 3.1.2 with 2 core 8 GB ram standalone node. major challenge.
In this study, we used varied data size to perform our
execution using a different number of cores in the with 1. Volume
different data size. Run the word count program to analyse the
speedup and processing time. In Fig 3,4,5 Show less handling Volume is one of the challenging issues in BD. Increased
time such as processing time when we are expanding the internet usage and IOT devices the volume of generated data
quantity of cores as for the document size. is growing up day by day. The volume of generated data is
increased but the percentage of data used for analysis is less.
As data flows in the company increase, the percentage of data
that can be processed decreases, the challenge becomes
evident. According to Statista, the total quantity of data
consumed globally is expected to reach 64.2 zettabytes in
2020, 79 zettabytes in 2021, and over 180 zettabytes in 2025.

Figure 3. Big Data Processing Time

Figure 6. Data growth Rate

Figure 4. Spark Processing time with two cores


2. Variety

Handling a different kind of data, including structured, semi


structured, and unstructured data, is another issue in BD. Data
locality is also a challenge in big data.

494

Authorized licensed use limited to: ULAKBIM UASL - Atilim Universitesi. Downloaded on November 24,2022 at 15:14:57 UTC from IEEE Xplore. Restrictions apply.
2022 2nd International Conference on Innovative Practices in Technology and Management (ICIPTM)

3. Velocity [12] Verma, Ankush, Ashik Hussain Mansuri, and Neelesh Jain. "Big data
management processing with Hadoop MapReduce and spark
technology: A comparison." 2016 Symposium on Colossal Data
The problem of data streaming falls under the velocity Analysis and Networking (CDAN). IEEE, 2016.
characteristic of BD. Data availability and concept drift are [13] Lee, Jinbae, Bobae Kim, and Jong-Moon Chung. "Time estimation and
two BD challenges that must be addressed while data is in resource minimization scheme for apache spark and hadoop big data
systems with failures." Ieee Access 7 (2019): 9658-9666.
motion.
[14] Nasser, T., and R. S. Tariq. "Big data challenges." J Comput Eng Inf
Technol 4: 3. doi: http://dx. doi. org/10.4172/2324 9307.2 (2015).
4. Veracity [15] Salloum, S., Dautov, R., Chen, X. et al. Big data analytics on Apache
Spark. Int J Data Sci Anal 1, 145–164 (2016).
https://doi.org/10.1007/s41060-016-0027-9
Uncertainties, falsehood, and missing values are all examples
[16] Hajjaji, Yosra, et al. "Big data and IoT-based applications in smart
of veracity. The quality of the data influences the quality of environments: A systematic review." Computer Science Review 39
the research findings. Many experts believe that the largest (2021): 100318.
difficulty in BD is truthfulness. [17] Oussous, Ahmed, et al. "Big Data technologies: A survey." Journal of
King Saud University-Computer and Information Sciences 30.4
(2018): 431-448
VI. CONCLUSION [18] Aziz, Khadija, Dounia Zaidouni, and Mostafa Bellafkih. "Leveraging
resource management for efficient performance of Apache
The foundational concept of BD is presented in this work. Spark." Journal of Big Data 6.1 (2019): 1-23.
These topics include BD characteristics, types of BD, BD
processing, big data applications, and big data difficulties.
Data analysis using big data processing engines spark shows
that spark is better execution engine with high scalability and
can get low processing time with increasing number of cores
in apace spark.

REFERENCES

[1] Krechowicz, Adam, Stanisław Deniziak, and Grzegorz Łukawski.


"Highly Scalable Distributed Architecture for NoSQL Datastore
Supporting Strong Consistency." IEEE Access 9 (2021): 69027-69043.
[2] Jayasri N.P., R. Aruna, " Big data analytics in health care by data
mining and classification techniques,” ICT Express, 2021
[3] T. Z. Emara and J. Z. Huang, "Distributed Data Strategies to Support
Large-Scale Data Analysis Across Geo-Distributed Data Centers,"
in IEEE Access, vol. 8, pp. 178526-178538, 2020, doi:
10.1109/ACCESS.2020.3027675.
[4] R. Gupta, A. R. Al-Ali, I. A. Zualkernan and S. K. Das, "Big Data
Energy Management, Analytics and Visualization for Residential
Areas," in IEEE Access, vol. 8, pp. 156153-156164, 2020, doi:
10.1109/ACCESS.2020.3019331
[5] Shen, J., Zhou, T., & Chen, L. Collaborative filtering-based
recommendation system for big data. International Journal of
Computational Science and Engineering, Vol. 21, PP. 219,
2020, doi:10.1504/ijcse.2020.105727
[6] Gokhan, Silahtaroglu and Nevin Yılmazturk, “Data analysis in health
and big data: A machine learning medical diagnosis model based on
patients’complaints,” Communications in Statistics-Theory and
Methods, 2019, doi: 10.1080/03610926.2019.1622728
[7] Mahboob Alam & Mohd Amjad ,“Weather forecasting using parallel
and distributed analytics approaches on big data clouds,” Journal of
Statistics and Management Systems, Vol. 22:4, PP.791-799, 2019,doi:
10.1080/09720510.2019.1609559.
[8] Shihao Zhou, Zhilei Qiao, Qianzhou Du, G. Alan Wang, “ Weiguo Fan
& Xiangbin Yan,” Measuring Customer Agility from Online
ReviewsUsing Big Data Text Analytics,” Journal of Management
Information System, Vol: 35:2,PP 510-539, 2018, doi:
10.1080/07421222.2018.1451956
[9] Stoica, Ion. "Trends and challenges in big data
processing." Proceedings of the VLDB Endowment 9.13 (2016): 1619-
1619.
[10] https://www.ibm.com/cloud/blog/hadoop-vs-spark
[11] Thakur, Bhupender Singh, and Kishori Lal Bansal. "Performance
Evaluation Of Apache Hadoop, Apache Spark, And Apache
Flink." Advances In Management, Social Sciences and Technology:
93.

495

Authorized licensed use limited to: ULAKBIM UASL - Atilim Universitesi. Downloaded on November 24,2022 at 15:14:57 UTC from IEEE Xplore. Restrictions apply.

You might also like