Apache Spark Based Analysis On Word Count Application in Big Data
Apache Spark Based Analysis On Word Count Application in Big Data
Department of Computer Science and Engineering Department of Computer Science and Engineering
SRM Institute of Science and Technology, SRM Institute of Science and Technology,
Vadapalani Campus Vadapalani Campus
Chennai, India Chennai, India
[email protected] [email protected]
I. INTRODUCTION
491
978-1-6654-6643-1/22/$31.00 ©2022 IEEE
Authorized licensed use limited to: ULAKBIM UASL - Atilim Universitesi. Downloaded on November 24,2022 at 15:14:57 UTC from IEEE Xplore. Restrictions apply.
2022 2nd International Conference on Innovative Practices in Technology and Management (ICIPTM)
Section 4 present the experimental result. The research popular for its multi query approach to reduce the number of
challenges are explained in Section 5. times that the data is scanned. And the last component is
YARN, which is short for Yet Another Resource
II. BIG DATA PROCESS Negotiator. YARN is a very important component because it
prepares the RAM and CPU for Hadoop to run data which are
BD is a collection of large amounts of data, so the BD tools stored in HDFS.
are used to process the high volume of data with low
computational time. The BD framework like Hadoop and C. Spark
spark is explained in this section. Hadoop support batch
Spark is a BD processing engine, and it is also cluster
processing and spark support both batch and stream
computing engine that is designed for fast execution, in-
processing.
memory computation, and supports different workload. Spark
manages batch processing, interactive querying, streaming,
A. Hadoop
and iterative computations [9]. The Apace Spark system
Hadoop is a big data processing framework that permits users includes the Spark core and have libraries which are Spark's
to store and process data sets which are very big in size MLlib, GraphX, Spark Streaming, and Spark SQL. Machine
(gigabytes to petabytes) by allowing a group of computers or learning is performed by Spark's MLlib, graph analysis is
nodes to solve huge and complex data problems. It is a very handled by GraphX, stream processing is done by Spark
flexible and low-cost solution for storing, Analyzing and Streaming [17], and structured data processing is addressed by
processing ordered, semi- ordered and unordered data. It’s a Spark SQL. The performance of the spark can be increased by
high scalable because of using commodity hardware and fault recourse management [18]
tolerance through the replication factor. Hadoop replication
factor is 3. The replication factor is changeable. It can be III. APPLICATION OF BIG DATA IN IOT AND CLOUD
change to 2 or can be increased more than 3. Hadoop provides
The IOT sensor devices are continuously generating data
advanced analytics for stored data. Generally, there are four
based on the application, so the big data place the major role
types in data analytics which are descriptive analytics,
to collect, process and analyze the large volume of data which
diagnostic analytics predictive analytics, and prescriptive
are collected from IOT devices. The cloud is the internet of
analytics. Hadoop also support data mining, machine learning
service. It is used for storage, platform for service which can
(ML).The initial step in the Hadoop data processing is to
used for organization or individual person and provide
divide a vast volume of data into smaller tasks. The small jobs
software service. IOT and cloud together with big data place
are completed in parallel using the MapReduce algorithm and
the important role in many applications. In table 1explained
then distributed across a Hadoop cluster. MapReduce is
the recent Application of BD.
responsible for parallel processing to reduce processing time
of the large volume of data.[11]
The author described a new data storage framework that
maintains high consistency. BD is a vast volume of
B. Components of Hadoop
information with a complicated structure. Traditional database
The term Hadoop is often used to refer to both the core management solutions are incapable of handling such a large
components of Hadoop as well as the ecosystem of related amount of data. Rapidly growing no of BD application
projects. The core components of Hadoop are Hadoop requires efficient data base architecture to store and manage
Distributed File System (HDFS), MapReduce, YARN, and critical data and support scalability and availability for data
Hadoop Common. Hadoop Common, which is an essential analytics. The parallel and distributed data processing needs
part of the Apache Hadoop Framework that refers to the to support Consistency, Availability, and partition tolerance
collection of common utilities and libraries that provide for play the important role for data processing. Like ACID
other Hadoop modules ecosystem [13]. A Large amount of Property in relational database system, CAP theorem explains
data are stored in HDFS. It handles large data sets running on the consistency, Availability, and partition tolerance in BD
Commodity Hardware (CH). A CH is low- management System. To provide the business needs, many
specifications industry-grade hardware and scales a single NoSQL systems runs to achieve the high consistency by the
Hadoop cluster to hundreds and even thousands and BD CAP theorem. The proposed system provides strong
support horizontal scaling. The next component is consistency through Scalable Distributed Two Layer Data
MapReduce, it has two-part mapper and reducer which is a Store (SD2DS) [1]. The work is divided into two parts, first it
processing unit of Hadoop and a core component to the analyses all consistent issue and in the second part, design the
Hadoop framework. MapReduce processes the data by scheduling algorithm for supporting strong consistency. In
splitting huge volume of data into smaller units and processes this paper, high consistency is demonstrated for all basic
them simultaneously. MapReduce was the only way to access operations.
the data stored in the HDFS. Other components of the systems Diabetes is the widespread disease, so the authors detect the
like Hive and Pig. Hive provided SQL-like query and occurrence of diabetes datasets to predict the optimal results
provided users with strong statistical functions. Pig was for the diabetes patients. Hadoop MapReduce use data mining
492
Authorized licensed use limited to: ULAKBIM UASL - Atilim Universitesi. Downloaded on November 24,2022 at 15:14:57 UTC from IEEE Xplore. Restrictions apply.
2022 2nd International Conference on Innovative Practices in Technology and Management (ICIPTM)
algorithm attention network, Decision Tree (DT), outlier base for the analytics. Finally, it reaches the data visualization. The
multiclass classification and association rule techniques were visualization of energy conception level is showed in device
used to get solution to the patient. MapReduce is the are used to manage the device for low energy bill by the house
processing engine for BD. From the MapReduce the BD set is owner.
divided into manageable data set with different attributes and
hence passed to Decision Tree, Apriori algorithm and outlier Increasing high amount of data, storing BD in single data
based multiclass classification. Use of these algorithm the centre is no longer feasible. Storing and accessing data is the
diabetes patients are classified, and they determined the challenging issue in BD. The authors store and analyze BD in
insulin level [2]. multiple data centre which are in different geographical area.
In Hadoop and Spark framework are designed to process data
Smart meters are used to accurately read energy consumption locally with same data centre. they must replicate all data to a
level from the smart building and smart industry and play the single data centre before performing any operation in a locally
important part in growing energy management system. distributed computation. Because of bandwidth constraints,
Velocity is one of the characteristics of bigdata. The sensor in communication costs, data privacy, and security, copying all
Smart meters generate the data in alarmed rate, that’s define data from several data centers to one data centre might be a
the velocity. The authors designed the system for BD that read bottleneck. The implemented Random Sample Partition (RSP)
one million smart meters data in near real time. The processed method is to divide the large data into packets of sample data
system consists of four modules that are BD storage block, racks and distribute these data blocks across different data
data modelling, BD storage, big data processing and querying centers without or with duplication. The important data are
block and data visualization. The voluminous data is captured replicated to increase the availability [4].
and modelled by optimized query design then stored in BD
storage HDFS which is consist of master and slave node. In this paper, the authors implement an efficient movie
Master nodes contain metadata, and the slave nodes are actual recommendation system in the environment of BD.
workers. Data is fetched from storage area and applied Auto Recommendation systems are the systems designed to
Regressive Integrated Moving Average (ARIMA) [3] model recommend things to the user based on their interest.
Organizations like Netflix, Amazon and so on use proposal hospital or clinic which is less crowded for the patient to be
framework to assist their clients with distinguishing the right admitted immediately. In emergency Departments The
item. Recommendation system is to deal with large volume of authors explained a machine learning pre-diagnosis system,
data. In this proposed work implements the collaborative which uses Random Forest Decision Tree Probabilistic Neural
filtering algorithm to recommend appropriate products to the Network (PNN) based on the DDA (Dynamic Decay
customers [5]. Adjustment), models are separately trained on the dataset.
This model understands the emergency level of the case and
Ambulance services are the first service provided by find the most suitable and available health care [6].
emergency department in hospitals. One of the most important
aspects of an ambulance service is determining the severity of To forecast future weather conditions in cloud computing, the
the situation, another important analysis is to find the closest authors designed an architecture for parallel and distributed
493
Authorized licensed use limited to: ULAKBIM UASL - Atilim Universitesi. Downloaded on November 24,2022 at 15:14:57 UTC from IEEE Xplore. Restrictions apply.
2022 2nd International Conference on Innovative Practices in Technology and Management (ICIPTM)
494
Authorized licensed use limited to: ULAKBIM UASL - Atilim Universitesi. Downloaded on November 24,2022 at 15:14:57 UTC from IEEE Xplore. Restrictions apply.
2022 2nd International Conference on Innovative Practices in Technology and Management (ICIPTM)
3. Velocity [12] Verma, Ankush, Ashik Hussain Mansuri, and Neelesh Jain. "Big data
management processing with Hadoop MapReduce and spark
technology: A comparison." 2016 Symposium on Colossal Data
The problem of data streaming falls under the velocity Analysis and Networking (CDAN). IEEE, 2016.
characteristic of BD. Data availability and concept drift are [13] Lee, Jinbae, Bobae Kim, and Jong-Moon Chung. "Time estimation and
two BD challenges that must be addressed while data is in resource minimization scheme for apache spark and hadoop big data
systems with failures." Ieee Access 7 (2019): 9658-9666.
motion.
[14] Nasser, T., and R. S. Tariq. "Big data challenges." J Comput Eng Inf
Technol 4: 3. doi: http://dx. doi. org/10.4172/2324 9307.2 (2015).
4. Veracity [15] Salloum, S., Dautov, R., Chen, X. et al. Big data analytics on Apache
Spark. Int J Data Sci Anal 1, 145–164 (2016).
https://doi.org/10.1007/s41060-016-0027-9
Uncertainties, falsehood, and missing values are all examples
[16] Hajjaji, Yosra, et al. "Big data and IoT-based applications in smart
of veracity. The quality of the data influences the quality of environments: A systematic review." Computer Science Review 39
the research findings. Many experts believe that the largest (2021): 100318.
difficulty in BD is truthfulness. [17] Oussous, Ahmed, et al. "Big Data technologies: A survey." Journal of
King Saud University-Computer and Information Sciences 30.4
(2018): 431-448
VI. CONCLUSION [18] Aziz, Khadija, Dounia Zaidouni, and Mostafa Bellafkih. "Leveraging
resource management for efficient performance of Apache
The foundational concept of BD is presented in this work. Spark." Journal of Big Data 6.1 (2019): 1-23.
These topics include BD characteristics, types of BD, BD
processing, big data applications, and big data difficulties.
Data analysis using big data processing engines spark shows
that spark is better execution engine with high scalability and
can get low processing time with increasing number of cores
in apace spark.
REFERENCES
495
Authorized licensed use limited to: ULAKBIM UASL - Atilim Universitesi. Downloaded on November 24,2022 at 15:14:57 UTC from IEEE Xplore. Restrictions apply.