A Study of Big Data Analytics Using Apache Spark With Python and Scala
A Study of Big Data Analytics Using Apache Spark With Python and Scala
Abstract— Data is generated by humans every day via various is generated from the sensors, weblogs, machines, and
sources such as Instagram, Facebook, Twitter, Google, etc at a humans, etc. Examples of structured data are DBMS, MySQL,
rate of 2.5 quintillion bytes with high volume, high speed and Spread sheet, etc.
high variety. When this huge volume of data with high velocity is
handled by the traditional approaches, it becomes inefficient and
Semi-structured Data:- Datasets can be in a structured or
time-consuming. Apache S park technology has been used that is
an open-source in-memory clusters computing system for fast
unstructured format is called semi-structured data. That’s why
processing. This paper introduces a brief study of Big Data the developer faces difficulty categorizing it. Moreover, semi-
Analytics and Apache S park which consists of characteristics structured data can also be handled through the Hadoop
(7V’s) of big data, tools and application areas for big data System. Examples of semi-structured data are JSON
analytics, as well as Apache S park Ecosystem including documents, BibTex files, CSV files, XML, etc.
components, libraries and cluster managers that is deployment
modes of Apache spark. Furthermore, this also presents a Unstructured Data:- Unstructured data are those data
comparative study of Python and S cala programming languages which does not have any format and cannot be stored in a row-
with many parameters in the aspect of Apache S park. This column form. It can only be handled through Hadoop System.
comparative study will help to identify which programming At least, 80% of data are structured data in the world.
language like Python and S cala are suitable for Apache S park
Examples of unstructured data are images, audio, video, text,
technology. As result, both Python and S cala programming
languages are suitable for Apache S park, however language
pdf, media posts, word documents, log files, E-mails data, etc.
choice for programming in Apache S park is depending on the
features that best suited the needs of the project since each one Hadoop is one of the popular open-source scalable fault-
has its own advantages and disadvantages. The main purpose of tolerant platforms for large scale distributed batch processing
this paper is to make things easier for the programmer in the by using cluster commodity servers. It was developed against
selection of programming languages in Apache S park-based on common failure issues of execution in a distributed system.
their project. However, if compared to its performance with other
technologies is not good since data is accessed from disks for
Keywords— Big Data, Apache S park, Cluster Computing, processing. Hadoop provides fault tolerance so organizations
Python, S cala. have not needed expensive products for processin g tasks on
large data sets. There are two key Hadoop building blocks: a
I. INT RODUCT ION Hadoop Distributed File System that can accommodate large
datasets and a MapReduce engine that evaluates results in
Big Data is a large set of data that can be structured, semi-
batches. MapReduce is a distributed programming model for
structured, or unstructured form, which is gathered from a
processing massive datasets across a large cluster. It has two
variety of data sources like Social Media, Cell Phones,
functions:- Map and Reduce, which helps to utilize the
HealthCare, E-commerce, etc. John Mashe coined the term
available resources for parallel processing of large data. It is
Big Data in the 1990s, and it got trendy in the 2000s. There
used for batch and persistent storage processing. However,
are some tools and techniques of big data analytics such as
MapReduce has not been developed for real-time processing.
Apache Hadoop, MapReduce, Apache Spark, NoSQL,
database, and Apache Hive for data processing that is used to Apache Spark is a powerful open-source parallel
manage a massive amount of big data. The main purpose of processing, flexible and user friendly platform which is very
collecting and processing huge amounts of big data helps appropriate for storing and performing big data analytics. It
organizations to better understanding. Moreover, it also helps can be run on vast cloud clusters and also run on a small
to find the information that is most important for future cluster, even run locally on student computers with smaller
business decisions. There are three types of Big Data. datasets. Providers such as AWS and Google Cloud have
supported it. With the RDD can quickly perform processing
Structured Data- Those types of data is already stored in
tasks on very large data sets as stored in memory. Apache
databases in an ordered manner is called structured data.
Spark framework consists of several dominant components
Nowadays, at least 20% of the data are structured data, which
which include Spark core and upper-level libraries: Spark wants to extract meaningful insights from data, they
SQL, Spark Streaming, Spark MLlib, GraphX and SparkR need to clean up to minimize noise. Big data benefits
which helps to perform a wide range of workloads including can only come from applications when the data is
batch processing, machine learning, interactive queries, meaningful and reliable. Therefore, data cleansing is
streaming processing, etc. Apache Spark system leads with necessary so that inaccurate and unreliable information
language-integrated APIs in SQL, Scala, Java, Python and R. can be filtered out. Example:- Data set of high veracity
The major functions are accomplished on Spark Core. Existing will be from a medical procedure or trial.
components are tightly integrated with the Spark Core which Validity:- The validity of the data refers to the accuracy
provides one unified environment. It is more efficient for and correctness of data used to obtain the result in form
especially iterative or interactive applications than all other of information. It is very important to make decisions.
technologies. In this way, Apache Spark is better than other Volatility:-The volatility of big data implies the stored
technologies. data and how long it is useful for future use. Since the
A. Characteristics of Big Data data which is valid right now might not be valid just a
few minutes or a few days later.
Value:-The value of data is the most important element
of 7V’s in big data. It is not just the amount of data that
stores or processes by individuals. In reality, it is the
amount of precious, accurate, and trustworthy
information that needs to be stored, processed, analyzed
to find insights.
When data is shared and agglomerated beyond dynamic 8) Apache Spark:- Apache Spark is one of the most
or distributed computing systems. Organizations have prominent and highly valued big data frameworks. It
been using diverse de-identification approaches to was developed by people from the University of
maintain privacy and security. California and written in Scala. The performance of
4) Human Collaboration. Despite the enormous Apache Spark is fast because it has in-memory
advancements made in computational analysis, there processing. It does real-time data processing as well as
are still many patterns that humans can easily identify, batch processing with a huge amount of data and
but computer algorithms have a difficult time requires a lot of memory, but it can deal with standard
identifying them. The big data analysis framework must speed and amount of disk.
support input from diverse human experts and the
D. Application areas of Big Data Analytics
sharing of results. These many experts can be
segregated in space and time when it is too pricey to Healthcare:- Big Data analytics is used with the aid of a
combine an entire team in one room. This distributed patient's medical chronicle data to determine how likely
expert input must be accepted by the data system and they are to have health problems. Furthermore, Big data
their collaboration must be supported. analytics are used in healthcare to minimize costs,
vaticinate epidemics, and prevent preventable diseases.
Other challenges: Data Replication, Data Locality, Combining The Electronic Health Record is one of the most popular
Multiple Data Sets, Data Quality, Fault-Tolerance of Big Data applications of Big Data in the healthcare industry.
Applications, Data Availability, Data Processing, and Data Banking:- Banks use big data analytics to identify
Management. fraudulent activities from the transaction. Due to the
analytics system stops fraud before it occurs and the
C. Tools of Big Data Analytics bank improves profitability.
1) Apache Hadoop:- Apache Hadoop is one of the most Media and Entertainment:- Entertainment and media
prominent big data frameworks and is written in Java. industries are using big data analytics to understand what
Hadoop is originally designed to continuously gather content, products, and services people want.
data from multiple sources without worrying about the Telecom:- The most relevant contributor to big data is
type of data and storage across a distributed telecoms. They improve the services and routes of traffic
environment. Moreover, it can only perform batch more efficiently. Furthermore, the analytics system is
processing. used to recognize records of call details, fraudulent
2) MapReduce:- MapReduce is a programming model behavior, and which also helps to take action
that processes and analyzes huge data sets. Google immediately.
introduced it in December of 2004. Moreover, it is used Government:- The Indian government used big data
for batch processing and persistent storage. However, analytics which helps law enforcement and to get an
MapReduce was not built for real-time processing. estimation of trade in the country. Due to big data
3) Apache Hive:- Apache Hive is a SQL-like query analytics, governmental procedures allow competencies
language and established by Facebook. Hive is a data in terms of expenditure, productiveness, and innovation.
warehousing component that performs reading, writing, Education:- Nowadays the education department is
and managing large datasets in a distributed being observed big data analytics gradually. As a result
environment. of big data-powered technologies have been improved
4) Apache Pig:- Apache Pig is a high-level data flow learning tools. Besides, it's used to enhance and develop
platform for executing MapReduce programs of existing courses according to trade requirements.
Hadoop and which was originally developed by Yahoo
Retail:- Retail uses Big Data Analytics to optimize its
in 2006. By using this, all data manipulation operations business, including e-commerce and in-stores. For
in Hadoop can be performed.
example, Amazon, Flipkart, Walmart, etc.
5) Apache HBase:- Apache HBase is a distributed
column-oriented database that is run at the top of the
HDFS file system. It is nothing but a NoSQL E. Overview of Apache Spark Technology
DataStore, and it is similar to a database management Apache Spark is an open-source distributed, in-memory
system, but it provides quick random access to a huge cluster computing framework designed to provide faster and
amount of structured data. easy-to-use analytics than Hadoop MapReduce. In 2009,
6) Apache Storm:- Apache Storm is an open-source AMPLab of UC Berkeley designed Apache Spark and first
distributed real-time computation system. It is used released it as open-source in March 2010 and donated to the
wherever to generated a lot of data streaming. Twitter Apache Software Foundation in June 2013. This open -source
uses it for real-time data analysis. framework protrudes for its ability to process large volumes of
7) Apache Cassandra:- Apache Cassandra is a free open data. Spark is 100 times faster than MapReduce of Hadoop
source NoSQL database and which was created by since there is no time consumed in transferring and processing
Facebook. It is more popular and very robust to handle data in or out of the disk because all of these processes are
huge amounts of data.
done in memory. It supports stream processing, known as real- batch and streaming processing of data in the
time processing which includes continuous input and output of application.
data and is suitable for trivial operations and massive data d) Spark MLlib:- It is a package of Apache Spark which
processing on large clusters. Many organizations such as accommodates multiple types of machine learning
Healthcare, Bank, Telecom, E-Commerce, etc; and all of the algorithms (classification, regression, and clustering) on
major gigantic technology companies such as Apple, top of spark. It performs data processing with large
Facebook, IBM, and Microsoft are used Apache Spark to datasets to recognize the patterns and make decisions.
improve their business insights. These companies collect data Machine learning algorithms run with many iterations
in terabytes from various sources and process it, to enhance for the desired objective in an adaptable manner.
consumer's services. e) GraphX:- Apache Spark leads with a module to allows
Apache Spark Ecosystem is having various components, Graph distributed computing in Graph data structures.
including Spark Core and upper-level libraries such as Spark A graph data structure is having a network of
SQL, Spark Streaming, Spark Mllib, GraphX, and SparkR organizations like non-manual social networks. GraphX
which are built atop of Spark Core. Cluster Managers viz; is also called Pregel that revealed by Google agent in
Standalone, Hadoop YARN, Mesos and Kubernetes are 2010.
operated by Spark Core. f) SparkR:- It is a module os Apache Spark that produces
incompetent forefront. SparkR is a cluster
computational platform that allows the processing of
structured data and ML tasks. Although, R
programming language was not invented to manages
large datasets that cannot suitable to a single machine.
thousands of nodes. It is a master slave-based system model of big data performs some
and has fault-tolerant. For a cluster of machines, it operation like calculating
average speed rate, code
can be known as an operating system kernel. It pools necessity, etc, with the spark.
computing resources together on a cluster of T his application was conducted
machines and allows those resources to be spread by data processing techniques.
T hus, applied it to the healthcare
through different applications. Mesos is developed to
system.
support a diversity of distributed computing
applications that can be share both static and dynamic T he authors have implemented
cluster resources. Some organizations such as Ajaysinh, Apache Spark, the healthcare model using
Twitter, Xogito, Media Crossing, etc are used [4] R. P., & and Machine different analysis and prediction
Somani, Learning techniques with machine
Apache Mesos and can be run it on Linux or Mac H. (2016) Algorithms learning algorithms for better
operating systems. predictions.
4. Kubernetes:- Spark runs on clusters that are
T he work in this paper is
organized throKubernetes. Due to the open-source focusing on the e-governance
container management platform, it has been ingested Salwan, system that is built using an
to spark. It comes up with Google in 2014. [5] P., & Apache Spark apache spark for analyzing
Kubernetes brings with its advantages as feasibility Maan, V. government -collected data.
K. (2020) authors gave a brief explanation
and stability. So can run full spark cluster on it. of the architecture of apache-
Kubernetes is a portable and cost-effective platform spark, including core layer,
that comes with self-healing abilities. It is developed ecosystem layer, resource
for managing the complex distributed system without management, and methods that
are used in a spark government
invalidating of containers empowers . department is generated data
with high volume so cannot be
managed by the traditional
II. LIT ERAT URE REVIEW database management system,
Table 1. Literature Survey thus built a system with more
efficiency using big data
analytics techniques.
Sr. Author Algorithms/ O bservation .furthermore resolved main
No. Name Techniques issues of traditional database
management systems like speed,
T he authors presented a mixed typed datasets, accuracy,
comparison between Java And etc.
Omar, H. Apache Spark Scala for evaluating time
[1] K., & tool, Mllib performance in Apache Spark T he authors explained the
Jumaa, A. library with Mllib .also explain tools, APIs, concept of big data, and Apache
K(2019) Python and programming language, and Bhattacha Hadoop Spark, firstly introduces big data
Scala Spark machine learning libraries [6] rya A. & Map Reduce and a very important part of big
in Apache Spark. Furthermore Bhatnagar And Spark data is V’s. Moreover, big data
discovered the advantages of , S. (2016) tools analytics, security issues of big
data loading and accessing from data analytics, and a variety of
stored sources like Hadoop, tools that are available in the
HDFS, Cassandra, HBase, etc. market like Hadoop MapReduce,
T he authors concluded that the Apache Spark are also
performance of Scala is much explained. Furthermore,
better than Java performance. presented a comparison between
Hadoop’s MapReduce and
In this paper, creat ed a general Apache Spark on some features
Apache Spark architecture using the Spark such as memory, competitive
Van-Dai tool, Steaming streaming method that can product.
[2] T a et al. API, Machine implement in the healthcare In this paper, the author is
(2016) Learning and system in big data analytics. focusing on the basic
data mining Also, explain how can be Salloum, components and unique features
techniques enhanced efficiency through [7] S., Apache Spark of apache spark to big data
machine learning and data Dautov, analytics. With the help of
mining techniques. R., Chen, apache-spark, some Ml pipelines
X., Peng, API and distinct utilities are
Researchers are focusing on the P. X., & produced for designing and
big data application model that Huang, J. implementing. T he authors
Keerti, Hadoop can be used in the real-time Z. (2016) illustrated how to increase the
[3] Singh, K., MapReduce, system, social network area, and popularity of apache spark
& Apache Spark in the healthcare system. T his technology to the research field
Dhawan, paper gave an introduction to in big data analytics.
S. (2016) MapReduce, Hadoop, and Spark.
Also, Spark is compared with
MapReduce. T he three-layered
T he author gave a brief In this paper, Big data analytics and Apache Spark is
explanation of big data and its explained in various aspects. Authors focused on
U. R. Pol Apache Spark, analytics using Apache Spark. characteristics (7V’s) of big data, tools and application areas
[12] (2016) Hadoop furthermore, explain how apache
MapReduce spark overcomes Hadoop which for big data analytics, as well as Apache Spark Ecosystem
is a good framework for data including components, libraries and cluster managers that is
processing and also open-source deployment modes of Apache spark. Ultimately, they also
distributed computing for present a comparative study of Python and Scala programming
reliability and scalability in big
data analytics languages with various parameters in the aspect of Apache
Spark.
T hey have discussed the impact
Sunil Apache of big data on the healthcare III. RESEARCH GAP
[13] Kumar& Hadoop system and how to manage
Maninder different tools that are available After reading all these research papers, the vast amount of
Singh in the Hadoop ecosystem. Big Data can be processed using Python and Scala
(2019] Moreover, they have also
programming languages over Apache Spark. Also, the present
explored the conceptual
architecture of big data analytics comparisons between both programming languages for fast
for healthcare systems. data processing in Apache Spark defines which programming
T he authors described how to
language is best suited for Apache Spark that can give a better
Aziz et al. Apache Spark, process real-time data using result in Big Data Analytics.
[14] (2018) MapReduce Apache Spark and Hadoop tools
in big data analytics. And also
compared Apache Spark and
Hadoop for fast computing
IV. COMPARAT IVE ST UDY OF PYT HON AND SCALA IN SPARK The learning curve of Scala is a bit tough as compared to
Python is an object-oriented, high-level functional and Python due to simple syntax. However, Scala has enigmatic
interpreted based programming language that runs directly on syntax and a lot of operations that are combined into a single
the machine. Pyspark is a Python API for the Apache Spark statement.
that works with RDDs due to multiple operations in python. It The availability of libraries is very rich in Python in
is a very powerful and preferred programming language contrast to Scala. However, Scala libraries are not more
because of its high availability of libraries. compatible with a data scientist. So python is the preferable
Scala is an object-oriented and functional programming language for this aspect. In Python, the testing process and its
language that runs on JVM (Java virtual machine) and also methodologies are very complex due to a dynamically typed
helps the developers to be more programmers to be more language. However, testing of scala is less complex. Python
inventive. It is a great programming language that helps to holds less verbosity because of dynamically typed language
write valid code without error and also helps to develop Big Scala is a statically typed language that can identify
Data application. compilation time error so it is a better alternative for large-
scale projects.
Table 2. Comparison between Python and Scala Scala supports features of multithreading so it can be
handle parallelism and concurrency while python does not
support multithreading. The Python community keeps
Sr. No Parame te rs Python Scala
organizing conferences, reunites, and work on code to develop
01 the language. Python has much larger community support in
Pe rformance Slower Faster comparison to Scala.
02 Le arning
Curve Easy T ough V. DISCUSSION
03 Machine The discussion is related to programming languages that
Le arning Rich library Less library are appropriate for the big data field. A lot of programming
Librarie s language is used to solve Big Data problems , simultaneously
04
Platform Interpreter Complier for big data professionals choosing a language is the most
important part. The decision of programming language must
05
Visualiz ation Rich library Less library be suitable, thus here can perform analysis and manipulation
Librarie s to Big Data Problems so that they can achieve the desired
06 T he dynamic type T he static type of output. By Omar, H. K., & Jumaa, A. K. (2019) [1], it was
Type Safe ty of language. language. found that the performance of the Scala programming
language is better than Java performance in Spark MLlib.
07 The authors have presented the comparative study of
Te sting Very complex Less complex
Python and Scala programming languages with parameters for
08
Simplicity Easy to learn Scala may be Apache Spark. The performance of Scala is faster but a little
because writing difficult to learn difficult to learn although Python is slower while it is very
code is simple. than Python. simple to use. Scala does not provide enough big data
analytics tools and libraries such as Python for machine
09
Less Verbose High Verbose
learning and natural language processing and there are no
Ease of use
good visualization tools for Scala. Python support for Spark
10 Streaming is not advanced and mature like Scala consequently
Concurre ncy No support Highly supported
Scala is the best option for Spark Streaming functionality.
11 Apache Spark framework is written in Scala, so with Scala big
IDE Pycharm, Jupyter Eclipse, Intellij data developers easily dig into the source code. Scala is more
12 engineering-oriented but Python is more analytical-oriented,
Spark She ll >Pyspark >Scala
both languages are excellent to develop the big data
13 Much larger Less community applications. To exploit the full potential of Spark, Scala will
Support community support
support.
be more helpful.
After exploring, the authors concluded that in Big Data
Analytics, both Python and Scala programming languages are
The performance is a very important factor, when Python apt for Apache Spark technology. However, language choice
and Scala are used with Spark, the performance of Scala is 10 for programming in Apache Spark depends on the features that
times faster rather than Python performance. Python is a suit the needs of the project and can also effectively solve the
dynamically typed language, so the speed is reduced. During problem as each language has its own benefits and drawbacks
runtime, Scala uses Java Virtual Machine and it is a statically If the programmer works on smaller projects with less
typed language consequently, this originates speed. The experience then python is a good choice. if the programmer
compiled language is faster than the interpreted language. has large-scale projects that need many tools, techniques, and
multiprocessing, then the Scala programming language is the [11] M. U. Bokhari, M. Zeyauddin, and M. A. Siddiqui(2016).An
effective model for big data analytics. 3rd International Conference
best alternative. The main objective of this paper, it will make
on Computing for Sustainable Global Development, pp. 3980-3982,
things easier for the programmer to a selection of 2016.
programming languages in Apache Spark according to their [12] Pol, U. R. (2016). Big Data Analysis : Comparison of Hadoop
project and achieved its desired objectives. MapReduce and Apache Spark. International Journal of Engineering
Science and Computing, 6(6), 6389–6391.
VI. CONCLUSION https://doi.org/10.4010/2016.1535
[13] Kumar, S., & Singh, M. (2018). Big data analytics for the healthcare
This paper has explained big data analytics and Apache industry: impact, applications, and tools. Big Data Mining and
Spark from various aspects. Authors focused on characteristics Analytics, 2(1), 48–57.
https://doi.org/10.26599/bdma.2018.9020031
(7V’s) of big data, tools and application areas for big data [14] Aziz, K., Zaidouni, D., & Bellafkih, M. (2018). Real-time data
analytics, as well as Apache Spark Ecosystem including analysis using Spark and Hadoop. 2018 4th International
components, libraries and cluster managers that is deployment Conference on Optimization and Applications (ICOA).
modes of Apache spark. Ultimately, they also present a DOI:10.1109/icoa.2018.8370593
[15] Amol Bansod. (2015). Efficient Big Data Analysis with Apache
comparative study of Python and Scala programming Spark in HDFS. International Journal of Engineering and Advanced
languages with various parameters in the aspect of Apache T echnology, 6, 2249–8958
Spark. It consists of a table for comparison between Python [16] Shoro, A. G., & Soomro, T . R. (2015). Big data analysis: Apache
and Scala programming languages with various parameters. spark perspective. Global Journal of Computer Science and
T echnology.
This comparative study has been concluded that both Python
[17] Gupta, Y. K. & Sharma, S. (2019). Impact of Big Data to Analyze
and Scala programming languages are suitable for Apache Stock Exchange Data Using Apache PIG. International Journal of
Spark technology. However, language choice is depending on Innovative Technology and Exploring Engineering. ISSN: 2278-
the features that best suit the needs of the project as each 3075, 8(7), Pp. 1428-1433.
[18] Gupta, Y. K. & Sharma, S. (2019). Empirical Aspect to Analyze
language has its own advantages and disadvantages. The main Stock Exchange Banking Data Using Apache PIG in HDFS
purpose of this paper is to make things easier for the Environment. Proceedings of the Third International Conference on
programmer in the selection of programming languages in Intelligent Computing and Control Systems (ICICCS 2019).
Apache Spark-based on their project. [19] Gupta, Y. K. & Gunjan B. (2019). Analysis of Crime Rates of
Different States in India Using Apache Pig in HDFS Environment.
Recent Patents on Engineering. Print ISSN: 1872-2121, Online
REFERENCES ISSN: 2212-4047, 13:1.
[1] Omar, H. K., & Jumaa, A. K. (2019). Big Data Analysis Using https://doi.org/10.2174/1872212113666190227162314.
Apache Spark MLlib and Hadoop HDFS with Scala and Java. site:http://www.eurekaselect .com/node/170260/article.
Kurdistan Journal of Applied Research, 4(1), 7–14 [20] Gupta, Y. K.* & Choudhary, S. (2020). A Study of Big Data
https://doi.org/10.24017/science.2019.1.2 Analytics with T wo Fatal Diseases Using Apache Spark
[2] Van-Dai T a, Chuan-Ming Liu, Goodwill Wandile Framework. International Journal of Advanced Science and
Nkabinde(2016).Big Data Stream Computing in Healthcare Real- Technology (IJAST), Vol. 29, No. 5, pp. 2840 - 2851.
T ime Analytics. IEEE, pp. 37-42, [21] Gupta, Y. K.*, Kamboj, S. & Kumar, A. (2020). Proportional
DOI:10.1109/ICCCBDA.2016.7529531 Exploration of Stock Exchange Data Corresponding to Various
[3] Keerti, Singh, K., & Dhawan, S. (2016). Future of Big Data Sectors Using Apache Pig. International Journal of Advanced
Science and Technology (IJAST), Vol. 29, No. 5, pp. 2858 - 2867.
Application & Apache Spark vs. Map Reduce. 1(6), 148 –151.
[22] Gupta, Y. K.* & Mittal, T. (2020). Empirical Aspects to Analyze
[4] Ajaysinh, R. P., & Somani, H. (2016). A Survey on Machine
Population of India using Apache Pig in Evolutionary of Big Data
learning assisted Big Data Analysis for Health Care Domain. 4(4),
Environment, International Journal of Scientific & Technology
550–554. Research (IJSTR). ISSN 2277-8616, 9(1), Pp. 238-242.
[5] Salwan, P., & Maan, V. K. (2020). Integrating E-Governance with [23] Gupta, Y. K. & Jha, C. K.(2016). A Review on the Study of Big
Big Data Analytics using Apache Spark. International Journal of Data with Comparison of Various Storage and Computing T ools and
Recent T echnology and Engineering, 8(6), 1609 –1615. their Relative Capabilities. International Journal of Invocation in
https://doi.org/10.35940/ijrte.f7820.038620 engineering & technology (IJIET). ISSN: 2319-1058, 7(1), Pp. 470-
[6] Bhattacharya, A., & Bhatnagar, S. (2016). Big Data and Apache 477.
Spark: A Review. International Journal of Engineering Research & [24] hoka, Shrutika; A. Kudale, R. (2016). Use of Big Data in Healthcare
Science (IJOER) ISSN, 2(5), 206–210. https://ijoer.com/Paper-May- with Spark. Proceedings - International Symposium on Parallel
2016/IJOER-MAR-2016-9.pdf Architectures, Algorithms and Programming, PAAP, 2016 -
[7] Salloum, S., Dautov, R., Chen, X., Peng, P. X., & Huang, J. Z. Janua(11), 172–176. https://doi.org/10.1109/PAAP.2015.4
(2016). Big data analytics on Apache Spark. International Journal of
Data Science and Analytics, 1(3–4), 145–164.
https://doi.org/10.1007/s41060-016-0027-9
[8] Hongyong Yu, Deshuai Wang(2012). Research and
Implementation of Massive Health Care Data Management and
Analysis Based on Hadoop. IEEE, pp. 514 -517, DOI:
10.1109/ICCIS.2012.225
[9] Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave,
A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M. J., Ghodsi,
A., Gonzalez, J., Shenker, S., & Stoica, I. (2016). Apache spark: A
unified engine for big data processing. Communications of the
ACM, 59(11), 56–65. https://doi.org/10.1145/2934664
[10] Shaikh, E., Mohiuddin, I., Alufaisan, Y., & Nahvi, I. (2019). Apache
Spark: A Big Data Processing Engine. 2019 2nd IEEE Middle East
and North Africa Communications Conference, MENACOMM
2019. https://doi.org/10.1109/MENACOMM46666.2019.8988541