Anand J. Kulkarn
Anand J. Kulkarn
Big Data
Analytics in
Healthcare
A Review of Big Data and Its
Applications in Healthcare and Public
Sector
Abstract Big Data has been a buzzword in the IT sector for a few years now. It has
attracted attention from researchers, industry and academia around the world. This
chapter is intended to introduce Big data and its related technologies and further trace
the challenges. In this chapter, we discuss the applications of big data technologies in
the fields of healthcare and public sector. Over the preceding few years, computing
power has increased substantially while the storage costs have reduced significantly,
leading to businesses being able to produce and store huge volumes of data. Also,
increasing penetration of hand-held and internet enabled devices had led to an explo-
sion in data generation. Social media is exemplary regarding this phenomenon. Such
huge volumes of data cannot be handled using existing frameworks and requires new
and innovative techniques to handle it. In this chapter, we will briefly discuss the use
of big data in healthcare and its potential use cases such as preventive healthcare
planning and predictive analytics. We will also discuss the potential use of big data
in public sector and its applications like urban management and inclusive decision
making. We will further highlight the challenges that hinder the potential use of big
data technologies in these areas.
1 Introduction
Everyone is talking about Big Data these days. Leading organisations have started to
recognise it as a strategic asset [1]. Let’s start with a commonly agreed definition of
it. Big Data is any data which is large and complex and therefore becomes difficult to
process with the traditional storage and processing paradigms, loosely approximated
by practitioners as data-sets around 30–50 terabytes and beyond up to petabytes [2].
A. Shastri (B)
Lovely Professional University, Phagwara 144411, Punjab, India
e-mail: [email protected]
Symbiosis Institute of Technology, Symbiosis International University, Pune 412115, India
M. Deshpande
School of Information Studies, Syracuse University, Syracuse, NY 13244-1190, USA
e-mail: [email protected]
© Springer Nature Switzerland AG 2020 55
A. J. Kulkarni et al. (eds.), Big Data Analytics in Healthcare,
Studies in Big Data 66, https://doi.org/10.1007/978-3-030-31672-3_4
56 A. Shastri and M. Deshpande
For example, The Large Hadron Collider’s 150 million sensors generate a data flow
of about 15 petabytes or about 15,000,000 GB per year [3]. Therefore, traditional
tools and techniques are unable to store, process and visualize it within stipulated
amount of time and extract competitive insights. Big data applications can be seen
everywhere from scientific community, marketing, banking, telecom to healthcare,
public services and so on. It has allowed organisations to take informed decisions
based on the insights derived from transactional data created at various points. Big
data has been described as a set of 3V’s, 4V’s and even 5V’s by various big data
researchers as shown in Fig. 1.
(i) Volume
It refers to the humungous scale of data. The amount of data that is being gener-
ated has increased has been increasing exponentially in the past few years and
is expected to continue to do so in the coming future due to reducing storage
costs. By 2020, there will be around 6.1 Billion smart phones and our accumu-
lated digital universe will be around 44 trillion gigabytes [5]. Google processes
around 40,000 search queries every single second [5]. The mammoth scale of
data being generated requires innovative data infrastructure, data management
and processing techniques.
A Review of Big Data and Its Applications … 57
(ii) Velocity
The rate of flow of data is measured by velocity. Facebook handles around 900
million photographs every day [6]. It must absorb it, process it and later be able
to retrieve it. The Data Management infrastructure that follows such high-speed
data flows is a vital part of the Big Data Paradigm. Time sensitive processes
like banking transactions or social media streaming data are some examples
were data is generated, processed and stored in a matter of few seconds.
(iii) Variety
The nature of different data forms that exist. They can vary from tradi-
tional enterprise structured data, semi-structured data or unstructured data like
images, text, audio and video. There are endless heterogeneous data types and
sources which are to be dealt with in a Big data paradigm.
(iv) Veracity
It the quality associated with big data defined as data of inconsistent, incompe-
tent, deceiving and ambiguous nature. It is also concerned with the reliability
and authenticity of the data used for analyses.
(v) Value
It is the intrinsic value that the big data holds with respect to its size. If large
volumes of imprecise data are analyzed it results in low value and if large
volumes of precise data are analyzed it results in high value.
“Data, I look at it as the new oil. It’s going to change most industries across the board.”
said Intel CEO Brian Krzanich [7]. There are a plenty of tools and technology for big
data processing and storage available today. Early developments like the Google File
system which allowed for processing large scale distributed data-intensive applica-
tions on inexpensive commodity hardware using a fault tolerant mechanism paved
way for further developments in distributed computing [8]. Later, Google developed
MapReduce, a programming model based on Java language, which is useful for
writing applications to process huge amounts of data, in parallel, on clusters of com-
modity hardware. Hadoop and Spark are the latest buzzwords in the big data universe
these days. The Apache Hadoop project develops open-source software for reliable,
scalable, distributed computing in a fault-tolerant manner. The Hadoop framework
is a collection of software that enables distributed processing for large sets of data
across clusters of computers using simple programming models. It makes use of the
Map Reduce model as one of its module. Several projects related to Hadoop that
work on the top of Hadoop architecture have been developed and are available to
use. Spark is yet another distributed processing framework which has its similarities
with Hadoop, but is generally consider much more efficient and fast as compared
to Hadoop especially when dealing with queries that are iterative in nature. Few of
these technologies will further elaborated.