0% found this document useful (0 votes)
3 views

A Study of Big Data Analytics Using Apache Spark With Python and Scala

The document discusses the use of Apache Spark for Big Data analytics, highlighting its efficiency in processing large volumes of structured, semi-structured, and unstructured data. It compares the programming languages Python and Scala in the context of Apache Spark, noting that both are suitable depending on project requirements. Additionally, the paper outlines the characteristics, challenges, and application areas of Big Data analytics, emphasizing the importance of data quality and the role of various tools like Hadoop and MapReduce.

Uploaded by

eriknascimento
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

A Study of Big Data Analytics Using Apache Spark With Python and Scala

The document discusses the use of Apache Spark for Big Data analytics, highlighting its efficiency in processing large volumes of structured, semi-structured, and unstructured data. It compares the programming languages Python and Scala in the context of Apache Spark, noting that both are suitable depending on project requirements. Additionally, the paper outlines the characteristics, challenges, and application areas of Big Data analytics, emphasizing the importance of data quality and the role of various tools like Hadoop and MapReduce.

Uploaded by

eriknascimento
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]

IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3

A Study of Big Data Analytics using Apache Spark


with Python and Scala
Yogesh Kumar Gupta 1 Surbhi Kumari 2
1 2
Assistant Professor, Department of Computer Science M .Tech(CSE) Research Scholar
2020 3rd International Conference on Intelligent Sustainable Systems (ICISS) | 978-1-7281-7089-3/20/$31.00 ©2020 IEEE | DOI: 10.1109/ICISS49785.2020.9315863

Banasthali Vidyapith Banasthali Vidyapith


[email protected] [email protected]

Abstract— Data is generated by humans every day via various is generated from the sensors, weblogs, machines, and
sources such as Instagram, Facebook, Twitter, Google, etc at a humans, etc. Examples of structured data are DBMS, MySQL,
rate of 2.5 quintillion bytes with high volume, high speed and Spread sheet, etc.
high variety. When this huge volume of data with high velocity is
handled by the traditional approaches, it becomes inefficient and
Semi-structured Data:- Datasets can be in a structured or
time-consuming. Apache S park technology has been used that is
an open-source in-memory clusters computing system for fast
unstructured format is called semi-structured data. That’s why
processing. This paper introduces a brief study of Big Data the developer faces difficulty categorizing it. Moreover, semi-
Analytics and Apache S park which consists of characteristics structured data can also be handled through the Hadoop
(7V’s) of big data, tools and application areas for big data System. Examples of semi-structured data are JSON
analytics, as well as Apache S park Ecosystem including documents, BibTex files, CSV files, XML, etc.
components, libraries and cluster managers that is deployment
modes of Apache spark. Furthermore, this also presents a Unstructured Data:- Unstructured data are those data
comparative study of Python and S cala programming languages which does not have any format and cannot be stored in a row-
with many parameters in the aspect of Apache S park. This column form. It can only be handled through Hadoop System.
comparative study will help to identify which programming At least, 80% of data are structured data in the world.
language like Python and S cala are suitable for Apache S park
Examples of unstructured data are images, audio, video, text,
technology. As result, both Python and S cala programming
languages are suitable for Apache S park, however language
pdf, media posts, word documents, log files, E-mails data, etc.
choice for programming in Apache S park is depending on the
features that best suited the needs of the project since each one Hadoop is one of the popular open-source scalable fault-
has its own advantages and disadvantages. The main purpose of tolerant platforms for large scale distributed batch processing
this paper is to make things easier for the programmer in the by using cluster commodity servers. It was developed against
selection of programming languages in Apache S park-based on common failure issues of execution in a distributed system.
their project. However, if compared to its performance with other
technologies is not good since data is accessed from disks for
Keywords— Big Data, Apache S park, Cluster Computing, processing. Hadoop provides fault tolerance so organizations
Python, S cala. have not needed expensive products for processin g tasks on
large data sets. There are two key Hadoop building blocks: a
I. INT RODUCT ION Hadoop Distributed File System that can accommodate large
datasets and a MapReduce engine that evaluates results in
Big Data is a large set of data that can be structured, semi-
batches. MapReduce is a distributed programming model for
structured, or unstructured form, which is gathered from a
processing massive datasets across a large cluster. It has two
variety of data sources like Social Media, Cell Phones,
functions:- Map and Reduce, which helps to utilize the
HealthCare, E-commerce, etc. John Mashe coined the term
available resources for parallel processing of large data. It is
Big Data in the 1990s, and it got trendy in the 2000s. There
used for batch and persistent storage processing. However,
are some tools and techniques of big data analytics such as
MapReduce has not been developed for real-time processing.
Apache Hadoop, MapReduce, Apache Spark, NoSQL,
database, and Apache Hive for data processing that is used to Apache Spark is a powerful open-source parallel
manage a massive amount of big data. The main purpose of processing, flexible and user friendly platform which is very
collecting and processing huge amounts of big data helps appropriate for storing and performing big data analytics. It
organizations to better understanding. Moreover, it also helps can be run on vast cloud clusters and also run on a small
to find the information that is most important for future cluster, even run locally on student computers with smaller
business decisions. There are three types of Big Data. datasets. Providers such as AWS and Google Cloud have
supported it. With the RDD can quickly perform processing
Structured Data- Those types of data is already stored in
tasks on very large data sets as stored in memory. Apache
databases in an ordered manner is called structured data.
Spark framework consists of several dominant components
Nowadays, at least 20% of the data are structured data, which

978-1-7281-7089-3/20/$31.00 ©2020 IEEE 471


Authorized licensed use limited to: Universidade Tecnologica Federal do Parana. Downloaded on October 27,2024 at 21:23:43 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3

which include Spark core and upper-level libraries: Spark wants to extract meaningful insights from data, they
SQL, Spark Streaming, Spark MLlib, GraphX and SparkR need to clean up to minimize noise. Big data benefits
which helps to perform a wide range of workloads including can only come from applications when the data is
batch processing, machine learning, interactive queries, meaningful and reliable. Therefore, data cleansing is
streaming processing, etc. Apache Spark system leads with necessary so that inaccurate and unreliable information
language-integrated APIs in SQL, Scala, Java, Python and R. can be filtered out. Example:- Data set of high veracity
The major functions are accomplished on Spark Core. Existing will be from a medical procedure or trial.
components are tightly integrated with the Spark Core which  Validity:- The validity of the data refers to the accuracy
provides one unified environment. It is more efficient for and correctness of data used to obtain the result in form
especially iterative or interactive applications than all other of information. It is very important to make decisions.
technologies. In this way, Apache Spark is better than other  Volatility:-The volatility of big data implies the stored
technologies. data and how long it is useful for future use. Since the
A. Characteristics of Big Data data which is valid right now might not be valid just a
few minutes or a few days later.
 Value:-The value of data is the most important element
of 7V’s in big data. It is not just the amount of data that
stores or processes by individuals. In reality, it is the
amount of precious, accurate, and trustworthy
information that needs to be stored, processed, analyzed
to find insights.

B. Challenges in Big data analytics


There are many computational methods available that work
well with small data but it does not work well for data that is
generating with high volume and velocity. Traditional tools
are not efficient to process big data and the complexity to
process big data is very high. The following challenges arise
when big data is analyzed.

1) Data Heterogeneity and Incompleteness. A major


problem of how to include all the data from various
sources to discover the pattern and trend for
Fig. 1. 7Vs of Big Data researchers. It can remain difficult to analyze the
unstructured and semi-structured data. Hereby, data
must be discreetly structured before analysis. Ontology
 Volume:- Big data implies enormous volumes
matching is a common approach based on semantics
(TeraByte, Peta Byte) of data. Data is created by
that detects the similarities among ontologies of
various sources, including internet explosion, social
multiple sources. After data cleaning and error
media, healthcare, Internet of Things, e-commerce, and
correction, certain incompleteness and errors can persist
other systems, which need large storage s o the volume
in datasets. During data processing, this incompleteness
of data is a big challenge.
and errors must be handled. It's a challenge to do this in
 Velocity:- Velocity of data refers to the low latency, or
the right way.
how fast the data is generated by multiple sources at
2) Scalability and Storage. In data analytics, the
high velocities viz: social media data, healthcare data,
management and analysis of massive volumes of data is
and retail data. Every second, 1.7 MB of data is
a challenge. Storage systems are not adequately capable
provided by every person during 2020.
of storing rapidly increasing data sets. Though, by
 Variety:- Variety refers to the various types of data that improving processor speed such problems can be
can be structured, unstructured or semi-structured, ameliorated. Therefore, needed to develop a processing
existing different forms of data for example text data, system that will also maintain the necessity of the
emails, tweets, log files, user reviews, photos, audios, future.
videos, and sensors data. Example: - High variety of 3) Security and Privacy. A much more serious issue is
data sets would be the CCTV audio and video files that how to find meaningful information from large and
are produced at different places in a city. rapidly generated datasets. Researchers have many
 Veracity:- The veracity of data refers to noise and methods and techniques to access data from any data
abnormality in big data. All data will not be 100 % source to discover trends in data. They have ceased
correct when dealing with high volume, velocity, and worrying about an individual's security and privacy.
variety of data so there can be dirty data. If anyone

978-1-7281-7089-3/20/$31.00 ©2020 IEEE 472


Authorized licensed use limited to: Universidade Tecnologica Federal do Parana. Downloaded on October 27,2024 at 21:23:43 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3

When data is shared and agglomerated beyond dynamic 8) Apache Spark:- Apache Spark is one of the most
or distributed computing systems. Organizations have prominent and highly valued big data frameworks. It
been using diverse de-identification approaches to was developed by people from the University of
maintain privacy and security. California and written in Scala. The performance of
4) Human Collaboration. Despite the enormous Apache Spark is fast because it has in-memory
advancements made in computational analysis, there processing. It does real-time data processing as well as
are still many patterns that humans can easily identify, batch processing with a huge amount of data and
but computer algorithms have a difficult time requires a lot of memory, but it can deal with standard
identifying them. The big data analysis framework must speed and amount of disk.
support input from diverse human experts and the
D. Application areas of Big Data Analytics
sharing of results. These many experts can be
segregated in space and time when it is too pricey to  Healthcare:- Big Data analytics is used with the aid of a
combine an entire team in one room. This distributed patient's medical chronicle data to determine how likely
expert input must be accepted by the data system and they are to have health problems. Furthermore, Big data
their collaboration must be supported. analytics are used in healthcare to minimize costs,
vaticinate epidemics, and prevent preventable diseases.
Other challenges: Data Replication, Data Locality, Combining The Electronic Health Record is one of the most popular
Multiple Data Sets, Data Quality, Fault-Tolerance of Big Data applications of Big Data in the healthcare industry.
Applications, Data Availability, Data Processing, and Data  Banking:- Banks use big data analytics to identify
Management. fraudulent activities from the transaction. Due to the
analytics system stops fraud before it occurs and the
C. Tools of Big Data Analytics bank improves profitability.
1) Apache Hadoop:- Apache Hadoop is one of the most  Media and Entertainment:- Entertainment and media
prominent big data frameworks and is written in Java. industries are using big data analytics to understand what
Hadoop is originally designed to continuously gather content, products, and services people want.
data from multiple sources without worrying about the  Telecom:- The most relevant contributor to big data is
type of data and storage across a distributed telecoms. They improve the services and routes of traffic
environment. Moreover, it can only perform batch more efficiently. Furthermore, the analytics system is
processing. used to recognize records of call details, fraudulent
2) MapReduce:- MapReduce is a programming model behavior, and which also helps to take action
that processes and analyzes huge data sets. Google immediately.
introduced it in December of 2004. Moreover, it is used  Government:- The Indian government used big data
for batch processing and persistent storage. However, analytics which helps law enforcement and to get an
MapReduce was not built for real-time processing. estimation of trade in the country. Due to big data
3) Apache Hive:- Apache Hive is a SQL-like query analytics, governmental procedures allow competencies
language and established by Facebook. Hive is a data in terms of expenditure, productiveness, and innovation.
warehousing component that performs reading, writing,  Education:- Nowadays the education department is
and managing large datasets in a distributed being observed big data analytics gradually. As a result
environment. of big data-powered technologies have been improved
4) Apache Pig:- Apache Pig is a high-level data flow learning tools. Besides, it's used to enhance and develop
platform for executing MapReduce programs of existing courses according to trade requirements.
Hadoop and which was originally developed by Yahoo
 Retail:- Retail uses Big Data Analytics to optimize its
in 2006. By using this, all data manipulation operations business, including e-commerce and in-stores. For
in Hadoop can be performed.
example, Amazon, Flipkart, Walmart, etc.
5) Apache HBase:- Apache HBase is a distributed
column-oriented database that is run at the top of the
HDFS file system. It is nothing but a NoSQL E. Overview of Apache Spark Technology
DataStore, and it is similar to a database management Apache Spark is an open-source distributed, in-memory
system, but it provides quick random access to a huge cluster computing framework designed to provide faster and
amount of structured data. easy-to-use analytics than Hadoop MapReduce. In 2009,
6) Apache Storm:- Apache Storm is an open-source AMPLab of UC Berkeley designed Apache Spark and first
distributed real-time computation system. It is used released it as open-source in March 2010 and donated to the
wherever to generated a lot of data streaming. Twitter Apache Software Foundation in June 2013. This open -source
uses it for real-time data analysis. framework protrudes for its ability to process large volumes of
7) Apache Cassandra:- Apache Cassandra is a free open data. Spark is 100 times faster than MapReduce of Hadoop
source NoSQL database and which was created by since there is no time consumed in transferring and processing
Facebook. It is more popular and very robust to handle data in or out of the disk because all of these processes are
huge amounts of data.

978-1-7281-7089-3/20/$31.00 ©2020 IEEE 473


Authorized licensed use limited to: Universidade Tecnologica Federal do Parana. Downloaded on October 27,2024 at 21:23:43 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3

done in memory. It supports stream processing, known as real- batch and streaming processing of data in the
time processing which includes continuous input and output of application.
data and is suitable for trivial operations and massive data d) Spark MLlib:- It is a package of Apache Spark which
processing on large clusters. Many organizations such as accommodates multiple types of machine learning
Healthcare, Bank, Telecom, E-Commerce, etc; and all of the algorithms (classification, regression, and clustering) on
major gigantic technology companies such as Apple, top of spark. It performs data processing with large
Facebook, IBM, and Microsoft are used Apache Spark to datasets to recognize the patterns and make decisions.
improve their business insights. These companies collect data Machine learning algorithms run with many iterations
in terabytes from various sources and process it, to enhance for the desired objective in an adaptable manner.
consumer's services. e) GraphX:- Apache Spark leads with a module to allows
Apache Spark Ecosystem is having various components, Graph distributed computing in Graph data structures.
including Spark Core and upper-level libraries such as Spark A graph data structure is having a network of
SQL, Spark Streaming, Spark Mllib, GraphX, and SparkR organizations like non-manual social networks. GraphX
which are built atop of Spark Core. Cluster Managers viz; is also called Pregel that revealed by Google agent in
Standalone, Hadoop YARN, Mesos and Kubernetes are 2010.
operated by Spark Core. f) SparkR:- It is a module os Apache Spark that produces
incompetent forefront. SparkR is a cluster
computational platform that allows the processing of
structured data and ML tasks. Although, R
programming language was not invented to manages
large datasets that cannot suitable to a single machine.

The cluster manager manages a cluster of computers that


consists of CPU, memory, storage, ports, and other resources
available on a cluster of nodes. Spark supports cluster
managers, including Standalone, Yarn, Mesos, and Kubernetes
which provides a script that can be used to deploy a Spark
application. Apache Spark can be operated through many
cluster managers. Currently, there are available some modes
for the deployment of Spark.

1. Standalone:- The term standalone is meant by it does


not need an external scheduler. Spark standalone
cluster manager provides everything to start Apache
Spark. It can run quickly with few dependencies or
Fig. 2. Apache Spark Ecosystem environmental considerations . Standalone is a cluster
management technique that is responsible for
a) Spark Core:- Spark Core is the main component that managing the hardware and memory that runs on a
available in the Apache Spark tool, all processes of node. Spark applications are corroborated through it.
Apache Spark are handled by Spark Core. Apache Furthermore, it manages the Spark components and
Spark provides some libraries such as Spark SQL, provides limited functionality. Several applications
Spark Streaming, Spark MLlib, and Graphx are built on use Standalone, such as Microsoft Word, Autodesk
the top of Spark Core. It has RDD's, which stands for 3D Max, Adobe Photoshop, and Google Chrome.
resilient distributed datasets, which helps to execute This cluster manager can be run on Linux, Mac, or
Spark's libraries in a distributed environment. Windows.
b) Spark SQL:- Spark SQL is a data processing 2. Hadoop YARN:- Even YARN (another resource
framework in Apache Spark that is built on top of Spark negotiator) is also a generic open-source cluster
Core. Both structured and semi-structured data can be manager that enables Spark application to share
accessed via the Spark SQL. Spark SQL can read the cluster resources with Hadoop MapReduce
data from different formats such as text, CSV, JSON, applications. It is associated with a component called
Avro, etc. It can create powerful, interactive, and Job-tracker that provide features such as cluster
analytical applications through both streaming and managing, job scheduling, and monitoring
historical data. capabilities. YARN supports both client and cluster
c) Spark Streaming:- The Spark Streaming component is mode deployment of a Spark application. It can run
built on top of Spark Core moreover used to process on Linux or Windows.
real-time data and live data. It also allows us to perform 3. Apache Mesos:- Apache Mesos is also an open-
source cluster manager that is exquisitely scalable to

978-1-7281-7089-3/20/$31.00 ©2020 IEEE 474


Authorized licensed use limited to: Universidade Tecnologica Federal do Parana. Downloaded on October 27,2024 at 21:23:43 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3

thousands of nodes. It is a master slave-based system model of big data performs some
and has fault-tolerant. For a cluster of machines, it operation like calculating
average speed rate, code
can be known as an operating system kernel. It pools necessity, etc, with the spark.
computing resources together on a cluster of T his application was conducted
machines and allows those resources to be spread by data processing techniques.
T hus, applied it to the healthcare
through different applications. Mesos is developed to
system.
support a diversity of distributed computing
applications that can be share both static and dynamic T he authors have implemented
cluster resources. Some organizations such as Ajaysinh, Apache Spark, the healthcare model using
Twitter, Xogito, Media Crossing, etc are used [4] R. P., & and Machine different analysis and prediction
Somani, Learning techniques with machine
Apache Mesos and can be run it on Linux or Mac H. (2016) Algorithms learning algorithms for better
operating systems. predictions.
4. Kubernetes:- Spark runs on clusters that are
T he work in this paper is
organized throKubernetes. Due to the open-source focusing on the e-governance
container management platform, it has been ingested Salwan, system that is built using an
to spark. It comes up with Google in 2014. [5] P., & Apache Spark apache spark for analyzing
Kubernetes brings with its advantages as feasibility Maan, V. government -collected data.
K. (2020) authors gave a brief explanation
and stability. So can run full spark cluster on it. of the architecture of apache-
Kubernetes is a portable and cost-effective platform spark, including core layer,
that comes with self-healing abilities. It is developed ecosystem layer, resource
for managing the complex distributed system without management, and methods that
are used in a spark government
invalidating of containers empowers . department is generated data
with high volume so cannot be
managed by the traditional
II. LIT ERAT URE REVIEW database management system,
Table 1. Literature Survey thus built a system with more
efficiency using big data
analytics techniques.
Sr. Author Algorithms/ O bservation .furthermore resolved main
No. Name Techniques issues of traditional database
management systems like speed,
T he authors presented a mixed typed datasets, accuracy,
comparison between Java And etc.
Omar, H. Apache Spark Scala for evaluating time
[1] K., & tool, Mllib performance in Apache Spark T he authors explained the
Jumaa, A. library with Mllib .also explain tools, APIs, concept of big data, and Apache
K(2019) Python and programming language, and Bhattacha Hadoop Spark, firstly introduces big data
Scala Spark machine learning libraries [6] rya A. & Map Reduce and a very important part of big
in Apache Spark. Furthermore Bhatnagar And Spark data is V’s. Moreover, big data
discovered the advantages of , S. (2016) tools analytics, security issues of big
data loading and accessing from data analytics, and a variety of
stored sources like Hadoop, tools that are available in the
HDFS, Cassandra, HBase, etc. market like Hadoop MapReduce,
T he authors concluded that the Apache Spark are also
performance of Scala is much explained. Furthermore,
better than Java performance. presented a comparison between
Hadoop’s MapReduce and
In this paper, creat ed a general Apache Spark on some features
Apache Spark architecture using the Spark such as memory, competitive
Van-Dai tool, Steaming streaming method that can product.
[2] T a et al. API, Machine implement in the healthcare In this paper, the author is
(2016) Learning and system in big data analytics. focusing on the basic
data mining Also, explain how can be Salloum, components and unique features
techniques enhanced efficiency through [7] S., Apache Spark of apache spark to big data
machine learning and data Dautov, analytics. With the help of
mining techniques. R., Chen, apache-spark, some Ml pipelines
X., Peng, API and distinct utilities are
Researchers are focusing on the P. X., & produced for designing and
big data application model that Huang, J. implementing. T he authors
Keerti, Hadoop can be used in the real-time Z. (2016) illustrated how to increase the
[3] Singh, K., MapReduce, system, social network area, and popularity of apache spark
& Apache Spark in the healthcare system. T his technology to the research field
Dhawan, paper gave an introduction to in big data analytics.
S. (2016) MapReduce, Hadoop, and Spark.
Also, Spark is compared with
MapReduce. T he three-layered

978-1-7281-7089-3/20/$31.00 ©2020 IEEE 475


Authorized licensed use limited to: Universidade Tecnologica Federal do Parana. Downloaded on October 27,2024 at 21:23:43 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3

T he authors have explained how T he author described the


to improve the privacy and Amol Apache Spark, advantage of the Apache Spark
Hongyong MapReduce, security of data in big data [15] Bansod Hadoop framework for data processing in
[8] Yu & Apache Hive analytics. T hey discussed both (2015) MapReduce HDFS and also compared it with
Deshuai MapReduce and Apache Hive Hadoop MapReduce, and with
Wang frameworks with usages, ease of other data processing
(2012) use, capability, and processing. frameworks

T he authors focused on a In this paper, the author


learning analytics model that can discussed the concept of big
Hussain et Apache Spark predict future and trends from Shoro, A. Apache Spark data, characteristics of big data
[9] al. (2016) and Hadoop educational organizations. [16] G., & And T witter (V’s), and big data analytics
MapReduce Furthermore, explored usages, Soomro, Stream API tools that are Hive, Pig, Zebra,
methodologies (Hadoop T . R. HBase, Chu Kwa, Storm, and
MapReduce) of big data and also (2015) Spark. Moreover, given some
recognized some issues such as reasons for apache spark
data privacy, capacity, technology when should use it or
processing, and analyzing of not. furthermore, they performed
data. T he authors concluded that data processing with T witter
the prediction model helps data by Apache Spark and
education authorities for T witter stream API.
learning activities and patterns
that are in the trends. [17- Gupta, Apache Pig Authors Analyzed various
23] Y., et. al. datasets such as stock exchange
T he authors discussed how to Data, Crime rates of India,
Shaikh, perform in-memory computing Population of India, and
[10] E., processes in apache spark and Healthcare data using Apache
Mohiuddi Apache Spark spark compared with other tools Pig. Also elaborated various
n, I., for fast computing. Furthermore, tools and techniques used to
Alufaisan, explained batch processing and analyses the massive volume of
Y., & stream processing with data i.e. big data in the Hadoop
Nahvi, I. capabilities. And also discussed distributed file system of the
(2019) the multithreading and cluster of commodity hardware.
concurrency capabilities of T he authors also describe
Apache Spark. various image processing
. techniques.
T he authors implemented a T he authors have developed a
M.U. Apache Spark, three-layered model of Shrutika conceptual architecture on the
[11] Bokhari et HDFS and architecture for storage and data [24] Dhoka & Apache Spark Apache Spark platform to
al. Machine analysis. T here are used HDFS R.A. overcome the problems that get
Learning for data storage, and machine Kudale when processing big data in the
learning algorithms for data (2016) healthcare system.
analysis.

T he author gave a brief In this paper, Big data analytics and Apache Spark is
explanation of big data and its explained in various aspects. Authors focused on
U. R. Pol Apache Spark, analytics using Apache Spark. characteristics (7V’s) of big data, tools and application areas
[12] (2016) Hadoop furthermore, explain how apache
MapReduce spark overcomes Hadoop which for big data analytics, as well as Apache Spark Ecosystem
is a good framework for data including components, libraries and cluster managers that is
processing and also open-source deployment modes of Apache spark. Ultimately, they also
distributed computing for present a comparative study of Python and Scala programming
reliability and scalability in big
data analytics languages with various parameters in the aspect of Apache
Spark.
T hey have discussed the impact
Sunil Apache of big data on the healthcare III. RESEARCH GAP
[13] Kumar& Hadoop system and how to manage
Maninder different tools that are available After reading all these research papers, the vast amount of
Singh in the Hadoop ecosystem. Big Data can be processed using Python and Scala
(2019] Moreover, they have also
programming languages over Apache Spark. Also, the present
explored the conceptual
architecture of big data analytics comparisons between both programming languages for fast
for healthcare systems. data processing in Apache Spark defines which programming
T he authors described how to
language is best suited for Apache Spark that can give a better
Aziz et al. Apache Spark, process real-time data using result in Big Data Analytics.
[14] (2018) MapReduce Apache Spark and Hadoop tools
in big data analytics. And also
compared Apache Spark and
Hadoop for fast computing

978-1-7281-7089-3/20/$31.00 ©2020 IEEE 476


Authorized licensed use limited to: Universidade Tecnologica Federal do Parana. Downloaded on October 27,2024 at 21:23:43 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3

IV. COMPARAT IVE ST UDY OF PYT HON AND SCALA IN SPARK The learning curve of Scala is a bit tough as compared to
Python is an object-oriented, high-level functional and Python due to simple syntax. However, Scala has enigmatic
interpreted based programming language that runs directly on syntax and a lot of operations that are combined into a single
the machine. Pyspark is a Python API for the Apache Spark statement.
that works with RDDs due to multiple operations in python. It The availability of libraries is very rich in Python in
is a very powerful and preferred programming language contrast to Scala. However, Scala libraries are not more
because of its high availability of libraries. compatible with a data scientist. So python is the preferable
Scala is an object-oriented and functional programming language for this aspect. In Python, the testing process and its
language that runs on JVM (Java virtual machine) and also methodologies are very complex due to a dynamically typed
helps the developers to be more programmers to be more language. However, testing of scala is less complex. Python
inventive. It is a great programming language that helps to holds less verbosity because of dynamically typed language
write valid code without error and also helps to develop Big Scala is a statically typed language that can identify
Data application. compilation time error so it is a better alternative for large-
scale projects.
Table 2. Comparison between Python and Scala Scala supports features of multithreading so it can be
handle parallelism and concurrency while python does not
support multithreading. The Python community keeps
Sr. No Parame te rs Python Scala
organizing conferences, reunites, and work on code to develop
01 the language. Python has much larger community support in
Pe rformance Slower Faster comparison to Scala.
02 Le arning
Curve Easy T ough V. DISCUSSION
03 Machine The discussion is related to programming languages that
Le arning Rich library Less library are appropriate for the big data field. A lot of programming
Librarie s language is used to solve Big Data problems , simultaneously
04
Platform Interpreter Complier for big data professionals choosing a language is the most
important part. The decision of programming language must
05
Visualiz ation Rich library Less library be suitable, thus here can perform analysis and manipulation
Librarie s to Big Data Problems so that they can achieve the desired
06 T he dynamic type T he static type of output. By Omar, H. K., & Jumaa, A. K. (2019) [1], it was
Type Safe ty of language. language. found that the performance of the Scala programming
language is better than Java performance in Spark MLlib.
07 The authors have presented the comparative study of
Te sting Very complex Less complex
Python and Scala programming languages with parameters for
08
Simplicity Easy to learn Scala may be Apache Spark. The performance of Scala is faster but a little
because writing difficult to learn difficult to learn although Python is slower while it is very
code is simple. than Python. simple to use. Scala does not provide enough big data
analytics tools and libraries such as Python for machine
09
Less Verbose High Verbose
learning and natural language processing and there are no
Ease of use
good visualization tools for Scala. Python support for Spark
10 Streaming is not advanced and mature like Scala consequently
Concurre ncy No support Highly supported
Scala is the best option for Spark Streaming functionality.
11 Apache Spark framework is written in Scala, so with Scala big
IDE Pycharm, Jupyter Eclipse, Intellij data developers easily dig into the source code. Scala is more
12 engineering-oriented but Python is more analytical-oriented,
Spark She ll >Pyspark >Scala
both languages are excellent to develop the big data
13 Much larger Less community applications. To exploit the full potential of Spark, Scala will
Support community support
support.
be more helpful.
After exploring, the authors concluded that in Big Data
Analytics, both Python and Scala programming languages are
The performance is a very important factor, when Python apt for Apache Spark technology. However, language choice
and Scala are used with Spark, the performance of Scala is 10 for programming in Apache Spark depends on the features that
times faster rather than Python performance. Python is a suit the needs of the project and can also effectively solve the
dynamically typed language, so the speed is reduced. During problem as each language has its own benefits and drawbacks
runtime, Scala uses Java Virtual Machine and it is a statically If the programmer works on smaller projects with less
typed language consequently, this originates speed. The experience then python is a good choice. if the programmer
compiled language is faster than the interpreted language. has large-scale projects that need many tools, techniques, and

978-1-7281-7089-3/20/$31.00 ©2020 IEEE 477


Authorized licensed use limited to: Universidade Tecnologica Federal do Parana. Downloaded on October 27,2024 at 21:23:43 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3

multiprocessing, then the Scala programming language is the [11] M. U. Bokhari, M. Zeyauddin, and M. A. Siddiqui(2016).An
effective model for big data analytics. 3rd International Conference
best alternative. The main objective of this paper, it will make
on Computing for Sustainable Global Development, pp. 3980-3982,
things easier for the programmer to a selection of 2016.
programming languages in Apache Spark according to their [12] Pol, U. R. (2016). Big Data Analysis : Comparison of Hadoop
project and achieved its desired objectives. MapReduce and Apache Spark. International Journal of Engineering
Science and Computing, 6(6), 6389–6391.
VI. CONCLUSION https://doi.org/10.4010/2016.1535
[13] Kumar, S., & Singh, M. (2018). Big data analytics for the healthcare
This paper has explained big data analytics and Apache industry: impact, applications, and tools. Big Data Mining and
Spark from various aspects. Authors focused on characteristics Analytics, 2(1), 48–57.
https://doi.org/10.26599/bdma.2018.9020031
(7V’s) of big data, tools and application areas for big data [14] Aziz, K., Zaidouni, D., & Bellafkih, M. (2018). Real-time data
analytics, as well as Apache Spark Ecosystem including analysis using Spark and Hadoop. 2018 4th International
components, libraries and cluster managers that is deployment Conference on Optimization and Applications (ICOA).
modes of Apache spark. Ultimately, they also present a DOI:10.1109/icoa.2018.8370593
[15] Amol Bansod. (2015). Efficient Big Data Analysis with Apache
comparative study of Python and Scala programming Spark in HDFS. International Journal of Engineering and Advanced
languages with various parameters in the aspect of Apache T echnology, 6, 2249–8958
Spark. It consists of a table for comparison between Python [16] Shoro, A. G., & Soomro, T . R. (2015). Big data analysis: Apache
and Scala programming languages with various parameters. spark perspective. Global Journal of Computer Science and
T echnology.
This comparative study has been concluded that both Python
[17] Gupta, Y. K. & Sharma, S. (2019). Impact of Big Data to Analyze
and Scala programming languages are suitable for Apache Stock Exchange Data Using Apache PIG. International Journal of
Spark technology. However, language choice is depending on Innovative Technology and Exploring Engineering. ISSN: 2278-
the features that best suit the needs of the project as each 3075, 8(7), Pp. 1428-1433.
[18] Gupta, Y. K. & Sharma, S. (2019). Empirical Aspect to Analyze
language has its own advantages and disadvantages. The main Stock Exchange Banking Data Using Apache PIG in HDFS
purpose of this paper is to make things easier for the Environment. Proceedings of the Third International Conference on
programmer in the selection of programming languages in Intelligent Computing and Control Systems (ICICCS 2019).
Apache Spark-based on their project. [19] Gupta, Y. K. & Gunjan B. (2019). Analysis of Crime Rates of
Different States in India Using Apache Pig in HDFS Environment.
Recent Patents on Engineering. Print ISSN: 1872-2121, Online
REFERENCES ISSN: 2212-4047, 13:1.
[1] Omar, H. K., & Jumaa, A. K. (2019). Big Data Analysis Using https://doi.org/10.2174/1872212113666190227162314.
Apache Spark MLlib and Hadoop HDFS with Scala and Java. site:http://www.eurekaselect .com/node/170260/article.
Kurdistan Journal of Applied Research, 4(1), 7–14 [20] Gupta, Y. K.* & Choudhary, S. (2020). A Study of Big Data
https://doi.org/10.24017/science.2019.1.2 Analytics with T wo Fatal Diseases Using Apache Spark
[2] Van-Dai T a, Chuan-Ming Liu, Goodwill Wandile Framework. International Journal of Advanced Science and
Nkabinde(2016).Big Data Stream Computing in Healthcare Real- Technology (IJAST), Vol. 29, No. 5, pp. 2840 - 2851.
T ime Analytics. IEEE, pp. 37-42, [21] Gupta, Y. K.*, Kamboj, S. & Kumar, A. (2020). Proportional
DOI:10.1109/ICCCBDA.2016.7529531 Exploration of Stock Exchange Data Corresponding to Various
[3] Keerti, Singh, K., & Dhawan, S. (2016). Future of Big Data Sectors Using Apache Pig. International Journal of Advanced
Science and Technology (IJAST), Vol. 29, No. 5, pp. 2858 - 2867.
Application & Apache Spark vs. Map Reduce. 1(6), 148 –151.
[22] Gupta, Y. K.* & Mittal, T. (2020). Empirical Aspects to Analyze
[4] Ajaysinh, R. P., & Somani, H. (2016). A Survey on Machine
Population of India using Apache Pig in Evolutionary of Big Data
learning assisted Big Data Analysis for Health Care Domain. 4(4),
Environment, International Journal of Scientific & Technology
550–554. Research (IJSTR). ISSN 2277-8616, 9(1), Pp. 238-242.
[5] Salwan, P., & Maan, V. K. (2020). Integrating E-Governance with [23] Gupta, Y. K. & Jha, C. K.(2016). A Review on the Study of Big
Big Data Analytics using Apache Spark. International Journal of Data with Comparison of Various Storage and Computing T ools and
Recent T echnology and Engineering, 8(6), 1609 –1615. their Relative Capabilities. International Journal of Invocation in
https://doi.org/10.35940/ijrte.f7820.038620 engineering & technology (IJIET). ISSN: 2319-1058, 7(1), Pp. 470-
[6] Bhattacharya, A., & Bhatnagar, S. (2016). Big Data and Apache 477.
Spark: A Review. International Journal of Engineering Research & [24] hoka, Shrutika; A. Kudale, R. (2016). Use of Big Data in Healthcare
Science (IJOER) ISSN, 2(5), 206–210. https://ijoer.com/Paper-May- with Spark. Proceedings - International Symposium on Parallel
2016/IJOER-MAR-2016-9.pdf Architectures, Algorithms and Programming, PAAP, 2016 -
[7] Salloum, S., Dautov, R., Chen, X., Peng, P. X., & Huang, J. Z. Janua(11), 172–176. https://doi.org/10.1109/PAAP.2015.4
(2016). Big data analytics on Apache Spark. International Journal of
Data Science and Analytics, 1(3–4), 145–164.
https://doi.org/10.1007/s41060-016-0027-9
[8] Hongyong Yu, Deshuai Wang(2012). Research and
Implementation of Massive Health Care Data Management and
Analysis Based on Hadoop. IEEE, pp. 514 -517, DOI:
10.1109/ICCIS.2012.225
[9] Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave,
A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M. J., Ghodsi,
A., Gonzalez, J., Shenker, S., & Stoica, I. (2016). Apache spark: A
unified engine for big data processing. Communications of the
ACM, 59(11), 56–65. https://doi.org/10.1145/2934664
[10] Shaikh, E., Mohiuddin, I., Alufaisan, Y., & Nahvi, I. (2019). Apache
Spark: A Big Data Processing Engine. 2019 2nd IEEE Middle East
and North Africa Communications Conference, MENACOMM
2019. https://doi.org/10.1109/MENACOMM46666.2019.8988541

978-1-7281-7089-3/20/$31.00 ©2020 IEEE 478


Authorized licensed use limited to: Universidade Tecnologica Federal do Parana. Downloaded on October 27,2024 at 21:23:43 UTC from IEEE Xplore. Restrictions apply.

You might also like