Big Data Systems
Big Data Systems
The right of Jawwad Ahmed Shamsi and Muhammad Ali Khojaye to be identified as authors of this work has been
asserted by them in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988.
Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright
holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowl-
edged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or
utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho-
tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission
from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the
Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are
not available on CCC please contact [email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for
identification and explanation without intent to infringe.
Preface xiii
Author Bios xv
Acknowledgments xvii
Section I Introduction
vii
viii Contents
Bibliography 283
Index 309
Preface
In a simplistic term, a system which handles big data is considered as a big data system.
This handling could include different services such as computation, storage, and networking.
The term big data refers to enormous size of data, which could challenge the conventional
systems for storing and computation. While there is no standard range, which can classify
big data, normally systems which can handle a few terabytes of data can be considered as
big data systems.
Large amount of data is useful in many ways. Organizations can get deep insights about
many aspects and operations. End users can get enhanced services. For instance, a large
search engine can offer customized search results to its users. Similarly, a government can
enhance its security system through smart surveillance systems based on big data. Both
these examples reflect the emerging potential of big data. While the benefits of big data
are still emerging, it is necessary to understand and study different systems and platforms
which can facilitate big data.
Objectives
The purpose of this book is to study different concepts related to big data. It has
following major objectives:
• To elucidate different platforms for processing of big data systems.
• To highlight the role of cloud computing in computing and storing big data systems.
• To explain security and privacy issues in big data.
• To provide an overview of different networking technologies in big data systems.
• To describe pros and cons of different computational platforms for big data.
• To elaborate on different case studies of big data.
• To enlighten programming and algorithmic techniques available for big data process-
ing.
Organization
The book is organized into five parts and 14 chapters:
Section 1: The first part covers introductory concepts. It consists of three chapters.
The first chapter covers concepts related to fundamental concepts of big data including
difference between analytical and transactional systems and requirements and challenges of
big data. The second chapter elaborates on architectural and organizational concepts related
to big data. The chapter explains the difference between lambda and kappa architectures. It
also highlights cluster computing and different organization schemes for clusters. The last
chapter in this section, chapter 3, is focused on cloud computing and virtualization. Cloud
provides an important platform for big data and the chapter is focused on elaborating this
requirement.
xiii
xiv Preface
Section II: The second section is focused on elaborating efficient platforms for storing
and processing big data. Chapter 4 explains various topics related to Hadoop and MapRe-
duce. This chapter also covers programming examples using MapReduce. In addition, two
important components of Hadoop, i.e., HDFS and HBase are explained. Chapter 5 explains
a few important limitations of Hadoop v1, and describes how Hadoop v2 addresses these
limitations. This chapter also covers YARN, Pig, Hive, Dremel, Impala, Drill, Sqoop, and
Ambari. Chapter 6 explains Spark and its different components. It elaborates on how Spark
is useful in solving a few major big data problems. The chapter also includes programming
examples. Chapter 7 describes NoSQL systems. These include, column-base, key-value base,
document oriented, and graph databases. The chapter covers illustrative examples for en-
hanced explanation. Chapter 8 is focused on introducing the topic of NewSQL systems.
The chapter has introductory coverage of four different NewSQL systems. These include
VoltDB, NuoDB, Spanner, and HRDBMS.
Section III: Section III elaborates on networking, security, and privacy for big data
systems. There are several illustrative examples in this part. Chapter 9 explains various
topics related to networking for big data. This chapter highlights different requirements
for efficient networks for big data systems and provides an overview of existing networking
solutions. Chapter 10 explains security requirements and solutions for big data. It includes
various topics such as requirements, attack types and mechanisms, attack detection, and
prevention. The last chapter in this section, chapter 11, provides an overview of privacy
concerns in big data and explains existing solutions in ensuring privacy.
Section IV: The fourth part of this book includes computation for big data. It contains
chapter 12 and chapter 13. Chapter 12 explains HPC solutions, which are being utilized by
big data systems. This chapter describes the functionality of GPUs, TPUs, and supercom-
puting. Chapter 13 introduces the topic of deep learning. It covers explanations of various
deep learning solutions. These include Feed Forward Network, RNN, and CNN. The chapter
explains various examples of big data which can get leverage from deep learning solutions.
Section V: Section V contains chapter 14. This chapter covers various case studies on
big data. These include organizational solutions on a Facebook, Uber, LinkedIn, Microsoft,
and Google. The chapter also highlights a few open issues on big data systems.
Target Audience
The book adopts an example-centric approach, where various concepts have been ex-
plained using illustrative examples and codes. The book can be used either as a text book
or as a reference book. It can be adopted for various courses such as cloud computing and
big data systems. Moreover, individual chapters of the book can be used as reference for
different courses related to networking, security, privacy, high-performance computing, and
deep learning.
Contact Us
If you have any question or comment about this book, please send email to bigdataque-
[email protected]
We have a web site for the book, where we list examples, errata, and any additional
information. For access, please visit our web page at https://sites.google.com/
view/bigdatasystems.
Jawwad Ahmed Shamsi and Muhammad Ali Khojaye
Author Bios
Jawwad A. Shamsi completed B.E. (Electrical Engineering) form NED University of En-
gineering and Technology, Karachi in 1998. He completed his MS in Computer and In-
formation Sciences from University of Michigan-Dearborn, MI, USA in 2002. In 2009,
he completed his PhD. from Wayne State University, MI, USA. He has also worked as
a Programmer Analyst in USA from 2000 to 2002. In 2009, he joined FAST- National
University of Computer and Emerging Sciences (NUCES), Karachi. He has served as
the head of computer science department from 2012 to 2017. Currently, he is serving
as a Professor of Computer Science and Director of the Karachi Campus. He also leads
a research group – syslab (http://syslab.khi.nu.edu.pk). His research is focused on de-
veloping systems which can meet the growing needs of scalability, security, high perfor-
mance, robustness, and agility. His research has been funded by different International
and National agencies including NVIDIA and Higher Education Commission, Pakistan.
Muhammad Ali Khojaye has more than decade of industrial experience ranging from
the cloud-native side of things to distributed systems design, CI/CD, and infrastructure.
His current technical interests revolve around big data, cloud, containers, and large-scale
systems design. Born in the mountain village of Chitral Pakistan, Ali currently lives in
Glasgow with his wife and son. When he is not at work, Ali enjoys cycling, travelling, and
spending time with family and friends.
xv
Acknowledgments
The authors would like to acknowledge the contributions from Muhammad Nouman Dur-
rani, Ali Akber, and Bara Raza Mangnani from National University of Computer and
Emerging Sciences. The authors are also thankful to Abid Hasnain from Visionet Systems
and Narmeen Bawany from Jinnah University for Women.
xvii
List of Examples
3.1 VM Advantages 40
3.2 Types of Virtualization 44
3.3 Containers vs VM. 47
3.4 Configure AWS using CLI 53
3.5 Start and Stop an EC2 instance 54
3.6 Describe an EC2 Instance 55
3.7 Reboot an EC2 Instance 55
3.8 List All S3 Buckets 56
3.9 Create an S3 Bucket 56
3.10 Upload a File to an S3 Bucket 56
3.11 List all the Container Images 57
3.12 Build a Container Image 57
3.13 Running a Container Application 57
3.14 Quitting a Container 57
3.15 Running a Container in Background 58
3.16 Attaching to a Container 58
3.17 Share your Container Image 58
3.18 Delete all Images or Containers 58
3.19 Delete a Container Image 58
xix
xx LIST OF EXAMPLES
xxiii
xxiv LIST OF FIGURES
12.1 Support for parallelism – Difference between GPU and CPU 234
12.2 Matrix addition 235
12.3 GPU – device memory model 236
12.4 TPU stack [4] 240
12.5 TPU block diagram 242
12.6 NVLink connectivity 242
12.7 WordCount example using MPI 244
xxvi LIST OF FIGURES
13.1 Types of Deep Neural Networks and Their Mapping with Big Data Appli-
cations 258
xxvii
I
Introduction
1
CHAPTER 1
IG DATA has been increasingly used in our daily lives. From social networks to mobile
B applications, and internet search, a huge amount of data is being generated, collected,
and processed. The purpose of this chapter is to explain the fundamental concepts related
to big data that are useful in understanding the functionality and execution of big data
systems.
3
4 Big Data Systems
1. Volume: Big data refers to the massive volume of data such that the amount of
data challenges the storage and processing requirements. While there is no specific
distinction about the volume of data, normally the volume could vary from Terabytes
(10 12 ) to exabytes (10 18 ) and beyond.
2. Velocity: Data is being generated at a very fast pace. The high rate of data generation
signifies the importance of data. The high velocity of data can be assessed by the fact
that a large proportion of data being used belong to the recent past.
3. Variety: Data under consideration could be obtained from numerous sources such as
web logs, Internet of Things (IoT) devices, URLs, user tweets, and search patterns
etc. Similarly, data could have different formats such as Comma Separated Values
(CSV), tables, text documents, and graphs. Further, it could either be structured,
semi-structured, or unstructured.
4. Veracity: Data may vary in terms of veracity; i.e., data under consideration may be
inconsistent or it may be highly consistent across all replicas; it may be useless or it
may be of high value. Veracity refers to the trustworthiness, accuracy, or authenticity
of data.
5. Value: Data must be of high value; i.e., stale data has limited value.
2013, a similar prediction appeared as erroneous. It was observed that the GFT prediction
was overstated by a factor of more than two. The problem was that Google’s algorithm
was simply considering the search terms on the Google Search Engine. It was assumed that
all the related searches made on Google are related to spread of flu. The Google team was
unable to find the correlation between search terms and flu.
In a similar context, data sampling error could be induced during an election campaign.
For instance, data for election tweets may favor a specific candidate. However, it may be
the case that voters of the candidate are pro-active on social media as compared to the
voters of other candidates. Similarly, sample size in any big data problem could have its
own biases.
Determining the correct size of data for a given big data problem is not trivial. In
addition, collecting or gathering the complete data is also an issue. Many experts believe
that in the case of big data, N=ALL is a good reference point for data analysis [191]. That
is, all the data needs to be analyzed. However, collecting such a large amount of data or
determining what is included in N=ALL is not trivial. Therefore, in many cases, a big data
problem is analyzed on found data – a term which is referred to donate the data which has
found for analysis.
While collecting more data is often more useful for analysis; it is not necessary that
more data would yield improved results. In this context, relevance of data being collected
is also important [350].
1. Transactional Systems: These are the types of systems which support transaction
processing. Subsequently, these systems adhere to ACID (Atomicity, Consistency, Iso-
lation, and Durability) properties. They have proper schema and data for each trans-
action is uniquely identified.
2. Analytical Systems: Such systems do not necessarily hold ACID properties. Conse-
quently, data does not necessarily adhere to a proper schema. It may have duplicates
and missing values etc. Such systems are more appropriate for analyzing data.
Traditionally, the term big data has been associated for analytical systems – specifically
due to the fact because such systems do not require strong consistency and have schema-less
data with duplicates, multi-formatting, and missing values. However, as we will study in
chapter 8 big data systems have evolved to include transactional systems bearing ACID
properties.
occurs and the network is partitioned, a distributed system can either offer consistency or
availability but not both. Note that when there is no network failure, a distributed system
can offer both availability and consistency together. Example 1.1 explains the concept of
CAP theorem.
CAP Theorem
Example 1.1 (CAP Theorem). Figure 1.1 explains the CAP theorem. Two dis-
tributed servers A and B are shown. In part (a), at time t1 , ClientA writes x=100
at ServerA . Assuming strict consistency, the value of x is immediately updated to
ServerB . Consequently, ClientB reads x=100.
At time t2 , network fault occurs in the link connecting the two servers. Consequently,
either availability or consistency can be met.
In part (b), availability is maintained, while consistency is relaxed. ClientA writes
x=200 at ServerA . However, since the network link is down, the value could not be
immediately updated to ServerB . Therefore, ClientB still reads x=100.
In part (c), in order to meet consistency, availability is compromised. ClientA in-
tends to write to ServerA . However, since the link between the two servers is broken,
the value of x is not updated either to ServerA or to ServerB .
Introduction to Big Data Systems 7
Many big data systems exploit CAP theorem to provide availability at the cost of con-
sistency. However, consistency is not necessarily compromised for availability. It can also
be compromised for latency [74]. Systems which have relaxed consistency requirements are
normally considered appropriate for data analytics as they do not necessarily maintain
ACID properties. Maintaining ACID properties in the context of big data are challenging
due to massive size and distributed nature of data. Therefore, by-n-large, big data systems
have been normally associated with analytical systems. Figure 1.2 illustrates the difference
between transactional systems and analytical systems.
1. Scalability: The foremost requirement for big data systems is to provide massive ca-
pability for processing and storage of huge amounts of data. Scalability should be
achieved without any noticeable degradation in performance.
2. Availability and Fault Tolerance: An efficient big data system should be able to tolerate
faults. Faults could either be transient such as network congestion, CPU availability,
and packet loss, or they could be persistent such as disk failure, power faults, and
network outages.
3. Efficient Network Setup: As big data system consists of a large number of machines
and workstations, efficient networking setup is an important requirement. The network
should be capable of providing access to big data, with low communication latency.
Network setup should facilitate building big data systems through both Local Area
Network (LAN) and Wide Area Network (WAN).
4. Flexibility: Big data systems may contain data from multiple sources such as textual
data, images, videos, and graphs. Similarly, data can be assessed and analyzed through
multiple means including visualizations, raw data, aggregated data, and queries. Big
data systems should facilitate flexible mechanisms for accessing and storing big data
systems.
5. Privacy and Access Control: As big data systems gather data from a large number of
sources, privacy and access control are likely to be one of the major concerns. Questions
such as which data should be made public, what information should be assessed, and
who has the ownership of data are important and needed to be identified.
6. Elasticity: In a big data system, the number of users varies over time. An efficient
system should be able to meet user’s needs. Elasticity refers to the capability of the
system in meeting these needs.
7. Batch Processing and Interactive Processing With the passage of time, big data sys-
tems have expanded from batch processing to interactive processing. For capable big
data systems, possessing the ability to analyze and process big data in batch mode as
well streaming mode is necessary.
8. Efficient Storage: As data is replicated in big data systems, efficient mechanisms for
replication and storage are significant in reducing the overall cost.
9. Multi-tenancy: Big data systems are accessed by multiple users at a time. Multi-
tenancy refers to the capability of the system in providing fair, persistent, and isolated
services to the users of big data.
10. Efficient Processing: As data is massive, efficient algorithms, techniques, and hardware
are needed for large-scale computation of big data. In this context, effective techniques
for parallelization are also significant. Similarly, iterative computation for machine
learning and data analytics are also important.
11. Efficient Scheduling: With multiple parallel tasks and concurrent users, methods and
techniques for efficient scheduling are needed.
Introduction to Big Data Systems 9
The above set of requirements are significant for big data systems. There are numerous
solutions which have been developed to cater these needs. Over the course of this book, we
will study various solutions pertaining to these challenges.
GLOSSARY
ACID: These are the set of properties which identify a database system. These stand for
Atomicity, Consistency, Isolation, and Durability.
Analytical Systems: These types of systems are used to provide analytics on historically
large volumes of data.
Atomicity: It is a database concept that a transaction either succeeds or fails in its entirety.
BASE: It stands for Basically Available Soft state Eventual consistency. These are the types
of systems which provide relaxed consistency requirements than ACID.
CAP Theorem: It is a theorem which identifies a design model for distributed systems.
It states that in case of a network partition (failure), a distributed system can either
provide consistency or availability.
Cluster computing: It is a type of computing which allows multiple computers to work to-
gether to either solve common computing problems or provide large storage. It requires
a cluster management layer to handle communication between the individual nodes and
work coordination.
Data Visualization: It presents meaningful data graphically (from raw data) in order to
understand complex big data.
Eventual Consistency: It is a type of consistency model, which support BASE model to the
ACID model.
Fault Tolerant: It is a property of a system to make sure it recovers automatically even if
certain parts of the system fail.
Processing: It refers to extracting valuable information from large datasets.
Reliability: The probability that a given system will perform its intended functions contin-
uously and correctly in a specified environment for a specified duration.
Resiliency: A resilient system is one that can gracefully handle unexpected situations and
bounce back from failures.
Transactional Systems: These are the types of systems which support transaction process-
ing.
CHAPTER 2
Architecture and
Organization of Big Data
Systems
CONTENTS
2.1 Architecture for Big Data Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Lambda Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.2 Kappa Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Organization of Big Data Systems: Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Introduction to Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Cluster Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.3 Services Offered by Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.4 Cluster Computing vs. Grid Computing . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.5 Flynn’s Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Classification of Clusters: Distributed Memory vs. Shared Memory . . . 18
2.3.1 Classification of Shared Memory Systems . . . . . . . . . . . . . . . . . . . . . . 20
2.3.2 Classification of Distributed Memory Systems . . . . . . . . . . . . . . . . . 21
2.3.3 Cluster Example: Distributed File System . . . . . . . . . . . . . . . . . . . . . 23
2.3.4 Cluster Example: Message-Passing Interface (MPI) . . . . . . . . . . . . 24
2.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 Exercise Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
s we discussed in the previous chapter, big data systems require massive amounts
A of resources in order to meet the challenge of computing, storing, and processing big
data in a timely manner. The purpose of this chapter is to elaborate on suitable systems
for big data systems, which can meet these challenges. We will begin the chapter with two
possible architecture systems for big data. The chapter also elaborates into organization of
big data systems.
11
12 Big Data Systems
it needs to support batch and incremental updates. We now discuss lambda and kappa
architectures with their benefits and limitations.
Example 2.1 (Two important characteristics of data). There are two inherent prop-
erties of data. First, data is time dependent. If Mike follows Cindy on a social net-
work, even if he stops following her, that does not change the fact that he used to
follow Cindy. Second, data is immutable. This is supported by the statement that the
fact that Mike was following Cindy cannot be changed.
Nathan argues that if availability is traded off then nothing can be done. However, it
is possible to beat the CAP theorem if availability is chosen instead of consistency. This is
because the system will be eventually consistent. That is, the updates will eventually be
propagated. However, since data is immutable, i.e., data cannot be updated, it can only be
added. This is true because data is a function of time.
The lambda architecture is illustrated in figure 2.1. The main idea of the architecture is
based on the fact that a complex data system is composed of three components, a real-time
layer, a batch layer, and a storage layer.
The batch layer manages and computes historical data. For instance, it can compute
machine learning models over a large dataset. It can integrate the historical data and com-
bines it with the arriving data. Results can be computed on the integrated dataset by going
through iterations of computation. This allows higher accuracy at the cost of higher latency.
The speed layer provides results in real-time. It utilizes the model updated by the batch
layer and the new data (not the complete dataset) to produce results in near-real time.
Computation cost at the streaming layer is significantly low because in the model approxi-
mation algorithms are utilized. The purpose of the storage layer is to compute the results
for fast retrieval.
The lambda architecture offers various benefits:
1. It facilitates near real-time data.
2. The master dataset can be used as an immutable dataset as it has a complete record
of all the dataset.
3. Reprocessing of data is easier. Reprocessing implies to processing input data again to
recompute the output. Reprocessing may be needed if the system is evolving or if the
code is modified.
While the lambda architecture has several benefits, it also has a major limitation that
it requires that the code is required to be maintained in two complex distributed systems.
This requirement could create many issues.
Architecture and Organization of Big Data Systems 13
1. Use a stream processing engine such as Kafka (see section 5.8.2) to store logs of data
from that point. For instance, if data of the past 10 days is to be processed then the
system stores data from the past 10 days.
2. For reprocessing, another instance of stream processing could be started, which starts
processing from the beginning.
common computing problems or provide large storage. A cluster is a type of a parallel and
distributed system in which a collection of interconnected processors are connected through
hardware, networks, and software.
Clustering allows accumulation of resources such as storage and processing. This distri-
bution facilitates parallelism and permits faster computation of data. Another advantage
of accumulation is reliability, i.e., in case of a fault, replicated execution or storage can be
utilized.
Clusters can be used for high availability, storage, load balancing, redundancy, and
parallel computation:
1. Storage: A large dataset can be stored in a cluster to increase replication data avail-
ability and fault tolerance.
In the context of big data systems, clusters are used for processing as well as storage. A
cluster can provide high availability and load balancing as well.
Understanding Parallelism
Example 2.2 (Understanding Parallelism). Figure 2.3 explains the process of paral-
lelism through addition of 16 numbers. Suppose that the cost of adding two numbers
is 1 unit. A serial program would require 15 time units for addition. Comparatively,
a parallel version could reduce the cost to 4 time units.
Example 2.2 illustrates parallelism through the process of addition. Parallelization de-
creases execution time. Consequently, large complex problems may be solved in a shorter
time. A parallel program requires redundant resources.
Parallel computation can be implemented on a variety of architectures such as a desktop
machine with few cores, a cluster, or a supercomputer. Parallel computing beyond desktop
computing is often termed as High-Performance Computing (HPC).
and effective task distribution and debugging. It is a common practice for big data systems
to experience failures. In this context, redundancy and effective task distribution are utmost
important.
Hadoop/MapReduce is an example of a parallel programming environment with redun-
dancy, scalability, and task distribution. We will explain the architecture of Hadoop in
chapter 4.
load. For instance, a web server cluster may dispatch different service requests to
different cluster nodes. These clusters offer cost-effective scalability, as the number
of nodes required can be mapped according to the load requirements. Availability of
load-balancing clusters is also important. If a node becomes inoperative; the load-
balancing software detects the node failure and redirects the requests to other nodes.
4. High-performance clusters: High-performance clusters offer concurrent execution of
applications to perform heavy computation, thereby enhancing the performance of the
applications. Whenever required, nodes also communicate to execute their assigned
tasks [117, 333].
1. SISD: In the Single Instruction Single Data (SISD) stream, a single instruction is
executed on a single processing unit at a time. Uniprocessor PC and traditional main-
frame computers fall in this category.
2. SIMD: In the Single Instruction Multiple Data (SIMD) stream, a single instruction
stream is executed over multiple processing elements (or nodes). The control unit gen-
erates control signals for all the processing units, and the same operation is executed
18 Big Data Systems
on different data streams. In SIMD, each processing element has its own associated
data memory, so that different processors may execute each instruction on a different
set of data. Just like vector and array processor, data-level parallelism is achieved in
this category.
3. MISD: In the Multiple Instruction Single Data (MISD) stream, a set of processing
units executes different instructions on the same dataset. In fault-tolerant systems,
the same instructions redundantly executed to detect and mask the errors.
4. MIMD: In the Multiple Instruction Multiple Data (MIMD) stream, a set of process-
ing units simultaneously executes different set of instructions sequences on different
datasets. MIMD-based systems can use the shared memory across heterogeneous net-
work computers in a distributed environment or shared memory across in a memory.
MIMD is primarily used in computer-aided design/computer-aided manufacturing,
simulation, modeling, communication switches etc. Multi-core systems are examples
of MIMD.
Big data systems such as MapReduce generally follow SIMD model in which the same
program is executed on different worker nodes.
1. Shared Memory: In a shared memory system, all nodes in the cluster share the same
memory. As such an explicit mechanism for message passing or communication be-
tween the cluster nodes is not needed. Due to this feature, shared memory systems
are considered much faster than distributed memory systems. An explicit mechanism
for locking (or mutual exclusion) may be needed to avoid dirty reads. Shared mem-
ory systems have limited scalability due to the shared memory requirement of all the
cores. Further, performance is limited by the speed of the system bus.
Multi-core PCs, Graphics Processing Systems (GPUs), and high-end computing nodes
such as Xeon are examples of shared memory systems.
2. Distributed Memory: In a distributed memory system, nodes are connected via net-
work such that each node has a distinct memory. Network connection can be made
either over a Local Area Network (LAN) or through a Wide Area Network (WAN).
Communication between nodes is established through a dedicated communication
paradigm such as Message Passing Interface (MPI). Due to the adaptive nature of
distributed memory systems, they can achieve higher scalability. A Beowulf cluster is
an example of a distributed memory system.
3. Hybrid Memory: A hybrid memory system is a combination of a distributed mem-
ory and shared memory models. In such a system, many shared memory nodes are
connected via a distributed memory model.
In multi-processor shared memory systems, where each processor has its own cache,
maintaining coherence among the contents of the cache is significant. Cache Coherence is
a phenomenon, which avoids such conflicts and ensures that data remains updated among
Architecture and Organization of Big Data Systems 19
all the nodes in the parallel system. Parallel computing clusters incorporate increased fault
tolerance through redundancy. In that, the same job may be assigned to multiple nodes to
tolerate node failures.
Both the shared memory and the distributed memory systems differ in the mechanism
of communication employed for communication among the processors.
Example 2.3 explains the difference between shared memory and distributed memory.
Example 2.3 (Distributed vs. Shared Memory). We can better understand the dif-
ference between distributed memory, shared memory, and hybrid memory systems
through figure 2.5. The figure shows four computers, which are connected over a net-
work. Each of these four computers has four cores (also called CPUs) and a memory.
This is equivalent to a hybrid model where message passing between the four com-
puters is established over a network, whereas communication between the four cores
of a PC is executed through the shared memory system.
Shared memory systems are faster as they avoid network communication and utilize
the shared memory medium for communication among the processors.
In comparison, distributed memory systems carry the potential of massive scalability
as a large number of computers can be connected through the network.
A hybrid model integrates the advantages of both the systems, where low
communication-latency is achieved through the shared memory model and scalability
can be increased through the distributed memory system.
computer consists of one processing unit and its own local private memory. In such
an architecture, a communication network and a data distribution and accumulation
mechanism is required. This results in efficient usage of each individual computer, but
limits the amount of coordination and sharing of workloads. Hadoop is an example of
loosely coupled cluster. Figure 2.6(b) illsutrates the architecture of a loosely coupled
system.
3. Moderately Coupled: Moderately or hybrid coupled clusters binds a loosely coupled
system out of shared memory multiprocessor. In this configuration, from an operat-
ing system viewpoint, the programs and information needed during the boot process
resides on the master node. The client node requires the master node for the boot
process. Once the client node retrieves all the required file systems from the server,
it acts like a stand-alone computer [302]. A hybrid of loosely and tightly coupled
architecture is shown in figure 2.6(c).
Example 2.4 (SMP vs. Non-Unified Memory). Shared memory systems can either
have Uniform Memory Access (UMA) or Non-Uniform Memory Access (NUMA).
The former is also called a Symmetric Multi-Processor (SMP). Such systems are
tightly coupled multi-processor systems in which there are numerous processors that
are connected via a system bus. These processors are homogeneous, which share com-
mon resources of memory, I/O device, interrupt system, and so on. These resources
are accessed via the common system bus. Each processor may execute different sets
of data. A distinct feature is that access time to shared memory is the same for all
the processors.
On the other end, in Non-Uniform Memory Access (NUMA) systems, access to mem-
ory is asymmetric. This leads to variable memory access time for different processors
and different memory units. Note that for both UMA and NUMA, each processor has
a separate cache.
1. Shared Nothing Architecture: In the shared nothing architecture, each node has its
22 Big Data Systems
own memory and hard disk. These resources are not accessible over a network. Nodes
can communicate with each other over a network.
2. Shared Disk System Architecture: In the shared disk architecture, disks are shared
among the nodes of a cluster such that disks of all the nodes are accessible over a
network.
In comparison to the shared disk architecture, the shared nothing architecture provides
improved scalability. However, it requires high-speed networking infrastructure in order to
promote efficiency. Example 2.5 illustrates the difference between the two types of systems.
Example 2.5 (Shared Nothing vs. Shared Disk). Shared nothing systems provide
improved scalability; however they require faster network access.
Architecture and Organization of Big Data Systems 23
Example 2.6 (Distributed File System). Figure 2.11 illustrates a Distributed File
System. A dedicated name server is used to store metadata and provide file lookup
service to the clients. Files are replicated across the distributed system. Not all the
files are available on all the servers. Clients access actual files through file servers.
As files are replicated a client may select the nearest server in order to access files
and reduce latency. In case of high load namespace may also be replicated.
MPI Cluster
Example 2.7 (MPI Cluster). Figure 2.12 illustrates an MPI cluster. A head node
is used to distribute work among worker nodes. Worker nodes are connected via a
LAN. All the worker nodes have access to a shared storage.
Cluster architecture has been explained in references [290, 312]. Details about grid com-
puting is mentioned in Ref. [170].
Use of parallelism in big data analytics is explained in references [216, 387].
2. Describe limitations of the lambda architecture and explain how kappa architecture
addresses these issues?
3. Differentiate between cluster computing and grid computing.
4. Explain different types of clusters.
5. Describe Flynn’s taxonomy. How is it useful in the design of big data systems?
6. Differentiate between grid computing and cluster computing.
7. What is a distributed file system? How is it useful for big data systems?
8. Differentiate between shared nothing and shared disk architecture.
9. Explain the difference between shared memory and distributed systems. Which one
of them is useful for promoting scalability? and why?
10. Explain the functionality of a distributed file system.
GLOSSARY
Lambda: It is an architecture for big data systems, which require maintaining separate
engines for stream processing and batch Processing.
Kappa: An architecture for big data systems in which stream and batch processing engines
are merged together.
FLynn’s Taxonomy: This taxonomy categorizes systems with parallel processing capability
based on the number of instructions that can be executed and amount of data they can
be handled at a time.
Clustering: it refers to the concept of organizing machines such that they are collectively
organized to perform a coherent task. Machines in a cluster perform tasks in parallel.
Message Passing Interface (MPI): It is a communication paradigm used for distributed
memory model. It is built upon send and receive architecture.
NUMA: It stands for Non Uniform Memory Access. In such systems, access to memory
is asymmetric. That is, each processor may take variable time and delay to access the
memory.
Shared Memory Systems: These refers to the types of systems in which all processing nodes
share the same logical address space and has direct access to the memory.
Distributed Memory Systems: In such type of systems processing nodes have access to their
local memory only.
Architecture and Organization of Big Data Systems 27
29
30 Big Data Systems
loud Computing provides foundations for deploying big data systems. The purpose
C of this chapter is to elucidate the concept, the architecture, the service model, and the
functionality of cloud computing in the context of big data systems.
1. Deployment: Architecturally, how will these clusters be deployed? Should each big
data application use a separate cluster? Or could there be sharing of clusters among
different applications and users?
2. Service Level: Big data applications require long and extended operational hours. An
important question is how to ensure a service level which meets a user’s needs of fault
tolerance and availability.
3. Workload Management: Workload for big data applications is expected to vary. A
significant question is how to ensure higher efficiency and effective utilization of re-
sources.
7. Agile: Cloud provides capabilities for developers to build systems much faster. For
instance, it can provision new IT resources or deploy application code in minutes
rather than days or weeks with a simple click or API call. This results in a substantial
increase in agility for the organizations.
On top of layer-2 switches are aggregated switches, which are used to aggregate network
packets and send them to access routers. At the topmost layer of the data center, there are
border routers which provide data center’s connection to the outside world.
1. Public Cloud: In the public cloud deployment model, a service provider provides
resources, such as infrastructure, application, and storage. These resources are made
available to the general public over the Internet. The service provider owns, operates,
and manages the cloud. These services are typically offered on a pay-per-usage model.
Infrastructure for public clouds is hosted on provider’s premises. In addition, the
environment is shared by multiple users. Therefore, deployment on public cloud may
raise privacy concerns.
Cloud Computing for Big Data 33
Amazon EC2 [81], Microsoft Windows Azure [303], Google App Engine [384], and
RackSpace [301] are popular examples of public cloud.
2. Private Cloud: In a private cloud, the infrastructure is provisioned exclusively for a
single organization. It may be owned, managed, and operated by the organization
itself or by a third party. The private cloud deployed inside customer-owned facility
and the environment is dedicated to a single customer.
Internal hosting on private cloud solves privacy concerns at the expense of the cost
of hosting and management. OpenStack [149], Xen Cloud Platform [71], Eucalyptus
[286], and VMware [70] are popular platforms for building a private cloud.
3. Hybrid Cloud: This cloud infrastructure comprises both private and public clouds.
Public and private cloud remain distinct components. However, they are integrated
together to offer the benefits of multiple deployment models.
The deployed applications can utilize the benefits of both public and private cloud
seamlessly. For instance, an organization may store and process sensitive data in a
private cloud for security and privacy reasons and outsource the rest of the processing
to public cloud.
VMware [69] and Amazon VPC [2] are popular platforms for building hybrid clouds.
Each of these models provides an abstraction for masking complexities of these underly-
ing components and therefore, allow users to focus on building applications while providing
automation, scaling and ease of management. Also, these cloud service models can be used
across each cloud deployment model we discussed in section 3.1.3.
When leveraging these cloud platforms, the users have a shared responsibility with the
cloud service provider for securing the cloud services. As shown in figure 3.4, the further
right the users go, the more responsibility the cloud service provider takes on.
(a) Responsibilities: In IaaS, a cloud provider manages the underlying cloud infras-
tructure on behalf of a user, while the user manages and configures operating
system and applications. A user may also have limited control of networking
components such as firewall.
(b) Examples: IBM, Amazon, Microsoft, Google are some high-profile vendors in the
IaaS market.
(c) Benefits, Concerns, and Considerations: IaaS is the most mature and well-
understood paradigm, which can be used to host several kinds of applications.
These may include traditional or monolithic architecture applications and ap-
plications require specific hardware or infrastructure for the workloads, etc. It
also provides complete control over infrastructure configurations and operating
systems.
IaaS delegates responsibility for patching, scaling, configuration, and manage-
ment to the users. It supports limited portability and agility. The infrastructure
Cloud Computing for Big Data 35
cost for running applications (e.g., web apps) in IaaS compared to other models
can vary significantly.
(a) Responsibilities: Unlike an IaaS model, where users have to manage the virtual
machine the application is deployed to, with CaaS, the cloud provider manages or
controls the underlying cloud infrastructure including network, physical servers,
operating systems, or storage, and virtualization. The user has control over the
deployed applications and container.
(b) Examples: Amazon’s Elastic Container Service (ECS), Kubernetes [64], and
Mesos [213] are some examples for container orchestration.
(c) Benefits, Concerns, and Considerations: CaaS provides increased server density
and greater user flexibility. It also allows greater portability and efficiency across
the environments than the IaaS service model. However, as it provides different
layer of service, a user may have less control and higher service of abstraction.
Traditional monolithic or legacy architecture applications that have been mod-
ernized or containerized, and applications that developed using Microservice ar-
chitecture are good fit for deploy into a CaaS-based platform.
3. Platform as a Service (PaaS) This service model enables users to run applications
without the need to manage underlying infrastructure (e.g., operating systems, hard-
ware). It layers on top of IaaS, provides users a platform, such as a set of tools,
programming and development environment (e.g., Java), marketplace services, and
database to develop, build, and deploy their custom applications.
(a) Responsibilities: In PaaS, a cloud provider provides support for the tools and
development environment. They manage or control the underlying cloud infras-
tructure including network, physical servers, operating systems, patch updates,
or storage, container, and virtualization, whereas the user has control over the
deployed applications responsible for their configuration management and may
also configure the application hosting environment.
(b) Examples: Google App Engine, AWS Elastic Beanstalk are some examples of
PaaS products offered by cloud providers while Pivotal App Platform and Heroku
are popular platforms for PaaS. Salesforce Apprenda and Jelastic are also some
other proprietary offerings of PaaS.
(c) Benefits, Concerns, and Considerations: PaaS provides a platform that lever-
ages and manages the complexity of underlying infrastructure and minimizes the
36 Big Data Systems
overall operational overhead. Similar to CaaS, PaaS platform also provides porta-
bility across environments. PaaS also greatly simplifies the deployment process
and experience.
PaaS platforms require a transformation of legacy applications in order to lever-
age their services. It also allows less developer control by providing prebuilt
software services (e.g., database types), and therefore can lock our code into a
cloud vendor platform if our application relies on many of the vendor-specific
tools and technologies.
(a) Responsibilities: A user only manages and configures the input function while
the complete underlying cloud infrastructure including network, physical servers,
operating systems, storage, visualization and container components manages and
configures by the cloud provider on behalf of the user.
(b) Examples: FaaS is an emerging platform with growing users in both private and
public clouds. The public cloud providers that provide serverless service are:
AWS Lambda from AWS (https://aws.amazon.com/lambda/)
Azure Functions by Microsoft (https://azure.microsoft.com/en-gb/services/
functions/)
Cloud Functions from Google (https://cloud.google.com/functions/)
IBM Cloud Functions (https://www.ibm.com/cloud/functions)
(c) Benefits, Concerns, and Considerations: FaaS leads to increased server utilization
efficiencies, cost optimization, and high level of abstraction. The FaaS-based
model allows users to focus solely on business functionality and offers the best
economics; however, it can lock our code into a cloud vendor platform as the code
is deployed to a vendor-specific runtime engine. Moreover, due to full abstraction
from underlying infrastructure, concerns also exist regarding its security, controls,
and predictability of the performance.
Although the number of private and public cloud FaaS providers are increasing,
the emergence of open standards and architectural patterns is still evolving.
The application or software does not need to be installed locally on the customer’s
device and is accessible from various client devices through a thin client interface such
as a web browser (e.g., web-based email).
The SaaS applications can be categorized into two types. The first type represents the
SaaS applications to be hosted on SaaS provider’s own infrastructure while in the sec-
ond type, the SaaS provider hosts applications on external outsourced infrastructure
(for example, Azure).
(a) Responsibilities: A SaaS provider manages or controls the underlying cloud in-
frastructure including network, servers, operating systems, storage, or even indi-
vidual application capabilities. The only thing users or customers are responsible
for is the configuration of the software. A customer may manage the possible
exception of limited user-specific application configuration settings such as cre-
ating or federating users, managing users, access, and assigning proper access
rights.
(b) Examples: Microsoft’s Office 365, Google’s Gmail, Dropbox, and Concur are some
common examples. All these example products hosted in cloud with everything
provided to the user other than their configuration.
(c) Benefits, Concerns, and Considerations: With a SaaS model, both the mainte-
nance and support become simplified for the user since the software does not
install on the user’s machine and has no responsibility to maintain the applica-
tion infrastructure. However, this simplicity comes at the expense of optionality,
for example, users have no control on the technology selection. With a SaaS-
based model, data privacy can be a concern for cloud service providers since all
the data management is the responsibility of the cloud vendor.
In short, the decision to choose a specific model should be considered based on the
requirements and characteristics of a given application.
3.2 VIRTUALIZATION
In this section, we will study fundamental topics related to virtualization.
Example 3.1 explains the concepts of isolation, consolidation, and migration in VMs.
40 Big Data Systems
(also known as rings). The highest privilege mode is at ring 0. In a nonvirtualized environ-
ment, ring 0 hosts the OS and ring 3 hosts the applications, whereas rings 1 and 2 remain
unoccupied.
These requirements may contradict each other. For instance, if all the guest binaries are
executed natively then the overhead of VMM is low. However, this cannot be implemented
as privileged instructions should not be handled by the guest OS.
We should make guest binaries run on the CPU as fast as possible. Theoretically speak-
ing, if we can run all guest binaries natively, there will be no overhead at all. However,
this is implausible because we cannot let guest OS handle everything. VMM should be
able to control all the hardware resources. The first effort on implementing virtualization
is the “trap and emulate model”. Using this model, nonprivilege instructions have been
executed directly by the guest OS, whereas privilege instructions initiated by the guest OS
are trapped by the VMM. The VMM will emulate the trapped instruction.
When virtualization is introduced, VMM takes the role of managing resources for VMs.
It is hosted at ring 0, whereas the guest OS is hosted at ring 1. The guest OS executes
user-level instructions directly and privilege instructions are interrupted and translated by
the VMM.
The “trap and emulate” model is limited because a few instructions are not trapped.
This is because a few instructions have different behaviors in user and kernel modes.
Cloud Computing for Big Data 43
Example 3.2 explains the difference between the three types of virtualization techniques.
44 Big Data Systems
Types of Virtualization
2. Affinity of Resources: For instance, in a data center environment, two VMs, which
are being hosted on two different physical machines, are communicating with each
other via network. As such, there is a strong network affinity between the two VMs.
It would be beneficial to consolidate both the VMs on one physical machine in order
to reduce network latency and improve user experience. Similarly, affinity could exist
with respect to other attributes such as memory, CPU, and storage.
3. Conserve Power: VMs could also be migrated in order to conserve power of the data
center. This includes power of servers and networking equipment references [105,359].
3.4 CONTAINERIZATION
Containerization is aimed to secure a portion of a device’s resources such as application,
storage, or network access from other applications and systems running on the same de-
vice [288].
Containerization permits to run a multitude of processes on a single host. Each of the
processes can be executed in an isolated environment. The technique is also called container-
based virtualization [312].
Docker [23] and Linux Containers (LXC) [42] are the two popular platforms for con-
tainerization.
Unlike hypervisor-based virtualization, which requires hardware assistance, containers
are lightweight and hardware independent. Containers can run their own operating system;
however, they share the base kernel OS. Therefore, they are lighter. LXC – a popular linux-
based container has its own file system and network stack. Isolation is provided through
Linux cgroups and namespaces. In that, cgroups are used to manage and allocate resources,
whereas namespaces are used to isolate users, processes, and files. Another important dif-
ference between hypervisor and container is that the latter cannot allocate resources on a
per core basis.
46 Big Data Systems
Containers (such as Docker) mount a writable file system on top of read-only file system.
Through this, multiple read-only file systems can be layered together with the top most layer
serving as a writeable layer. This concept is illustrated in figure 3.14.
Another major advantage of containers is that they can facilitate application clustering –
a phenomenon which is used to cluster applications together. This is useful for interoper-
ability and scalability. Figure 3.13 illustrates clustering architecture in containers.
Volumes provide persistent storage, while links are used to provide communication chan-
nel between containers. Containers run their own OS, file system, and network stack but
they share kernel [312].
Containers vs VM.
Example 3.3 (Containers vs VM.). Figure 3.15 illustrates the difference between a
VM and a Container. Each VM instance needs a guest OS. VMs store large isolated
files on their host and store the entire file system. In addition, application binaries
and libraries necessary for the applications are also stored. This yields large overhead
and slow startups for VMs. In comparison, a container offers lightweight virtualiza-
tion. A container holds self-contained and ready to deploy applications with binaries
and libraries [289].
While isolation in VMs has been well established, container-based systems may suffer
with degraded isolation [370].
48 Big Data Systems
Big data systems have benefited from the hypervisor-based virtualization techniques
in building scalable and reliable cloud systems. Containers have emerged as a competitive
technology in providing lightweight, robust, high performance, and scalable systems for big
data. Studies have shown that container-based big data cloud systems may perform better
than hypervisor-based cloud systems. However, performance may depend upon a number
of factors such as workload, platform, and hardware resources [98].
3.6.1 Borg
Borg has been developed by Google as a part of it’s internal requirements to manage large
clusters containing Linux containers [116, 358].
It caters for two kinds of jobs, i.e., long-running services and batch jobs. Of these,
long-services jobs are from specific Google services such as Gmail and Google docs. They
have latency-sensitive requirements, whereas, batch jobs have relaxed latency requirements.
They may take from a few seconds to a few days to complete. Batch jobs are normally
CPU-intensive.
Google applications and services are run on Borg. Users, which are Google developers
and engineers, submit their jobs to borg. Figure 3.16 illustrates the architecture of borg.
We now define a few terminologies of borg.
1. Borg job: A set of tasks that all run the same program.
2. Borg cell: It is a set of machines that are managed as a unit and runs one borg job.
3. Borgmaster: It is a centralized controller, which is used to manage a borg cell.
4. Borglet: It is an agent process that runs on each machine in a cell. It is responsible
for starting and stopping tasks, failure recovery, and management of local resources.
Borgmaster receives requests from borglets and schedule tasks accordingly. For larger
cells, borgmaster can operate in parallel.
Borg supports a variety of tools for cluster management. These include predicting re-
source requirements, configuring jobs and submitting configuration files, load balancing,
auto-scaling, and machine life cycle management [116].
3.6.2 Kubernetes
Kubernetes (k8s for short) is a portable, extensible, open-source, robust, and feature-rich or-
chestration platform for managing containers that can operate at scale [64,255]. It provides
the platform to automate deployment, management, and scaling of containerized applica-
tions.
Kubernetes was originally developed by Google. However, its simplicity and flexibil-
ity have quickly made Kubernetes a front runner for container orchestration and cluster
management. It inherited a lot of concepts and lessons learned from Google’s internal large-
scale cluster orchestration systems such as Borg and Omega, and was designed based on
Cloud Computing for Big Data 49
1. Data center as a single virtual machine: Kubernetes embraces the idea of treating
our servers as a single unit and abstracts away the underlying hosting environment
and how individual compute resources operate. For instance, from a collection of five
servers, Kubernetes makes a single cluster that behaves like one. It allows us to deploy
and run our application components without having to know about the actual servers
underneath.
on-premise cluster to Azure, AWS, or Google Kubernetes Engine (GKE) without risk
of vendor lock-in1 .
We will now provide an introduction to some of the basic concepts (or objects) of Ku-
bernetes and their role in the development and containerization of an application [38].
1. Nodes: The nodes can be either virtual machines (VMs) or physical servers.
2. Master and worker nodes: Kubernetes is composed of a master node and a set of
worker nodes. The master node is responsible for distributing tasks to other nodes. It
does not run any workload except in small clusters, e.g., Minikube2 [277]. It schedules
work across the worker nodes. The worker nodes run the actual applications we deploy.
3. Pods: A Pod is a collection of application containers that work as a single unit [39]. Pod
is the smallest deployable artifact in a Kubernetes cluster. This means all containers
in a Pod are always run on a single worker node in the cluster as shown in figure 3.17.
Pods also share many resources between the containers. For example, containers within
a Pod share the node’s network and volumes with any other containers in the same
Pod.
5. Labels: A Kubernetes object such as Pods can be attached with arbitrary key-value
pairs called labels. Labels are useful for organizing the Kubernetes objects. A object
can have any number of labels.
1 Vendor lock-in is a situation where we have a dependency on a single service provider and the migration
machine.
Cloud Computing for Big Data 51
Figure 3.18 explains the execution. Two layers of load balancers (internal and external)
are shown in the figure. The internal load balancer is called Service, whereas the
external one is called Ingress. A Deployment is used to creates the Pods.
Kubernetes provides a variety of options to automate deployment, scaling, and the man-
agement of containerized applications. It strives to minimal disruption or zero-down time
during application update.
Kubernetes is offered by all of the major cloud providers. It has become the standard
platform for building cloud-native applications3 . There are several managed Kubernetes
services being offered by cloud providers. These include:
3 Cloud native applications are essentially applications that are designed from a lightweight loosely-
coupled services (i.e., microservices) and utilize the unique capabilities of the cloud.
52 Big Data Systems
1. AWS EKS (Elastic Kubernetes Service): It is the managed Kubernetes offering from
Amazon Web Services.
2. Azure AKS (Azure Kubernetes Service): It is the managed Kubernetes offering from
Microsoft Azure.
3. Google Kubernetes Engine (GKE): It is a fully managed Kubernetes service from
Google Cloud.
A few of the several benefits of kubernetes include having a common API for orchestra-
tion and provisioning, greater application portability across platforms, business agility, and
improved elasticity. The Kubernetes ecosystem continues to experience explosive growth
with new projects being launched frequently.
3.8 EXAMPLES
In this section, we will show a few examples for creation of cloud instances. We will also
explain the process of creating container images. The purpose of these exercises is to enhance
readers’ learning and hands-on capabilities. In these exercises, comments are listed in italics.
3.8.1 Creating Cloud Instances on AWS
AWS is one of the largest cloud providers. It is used to provide IaaS. It utilizes public/private
key pair for access.
For IaaS, storage on AWS is through S3 (Simple Storage Service). It utilizes the concept
of buckets which is analogous to directory in a file system. The namespace of S3 is universal,
i.e., two buckets cannot be created with the identical name. Buckets are used to store objects.
There are three ways to access AWS components. These include AWS management
console, AWS command line interface (CLI), and an API (such as boto library).
Access and management through the AWS management console and AWS CLI are triv-
ial. Example 3.4 shows a sample command to configure AWS using CLI. Comments are
mentioned in italics.
aws c o n f i g u r e
if c o m m a n d == ’ON ’:
Do a dryrun first to verify permissions
try :
c l i e n t _ n a m e . s t a r t _ i n s t a n c e s ( I n s t a n c e I d s =[
i n s t a n c e i d ] , D r y R u n = True )
e x c e p t C l i e n t E r r o r as e :
if ’ D r y R u n O p e r a t i o n ’ not in str ( e ) :
r aise
Dry run succeeded, run start instances without dryrun
try :
my_response = cloud_instance . start_instances (
I n s t a n c e I d s =[ i n s t a n c e i d ] , D r y R u n = F al s e )
p rint ( m y _ r e s p o n s e )
e x c e p t C l i e n t E r r o r as e :
p rint ( e )
else :
try :
my_response = client_name . stop_instances (
I n s t a n c e I d s =[ i n s t a n c e i d ] , D r y R u n = F al s e )
p rint ( m y _ r e s p o n s e )
e x c e p t C l i e n t E r r o r as e :
p rint ( e )
Cloud Computing for Big Data 55
m y _ r e s p o n s e = s3 . l i s t _ b u c k e t s ()
Get a list of all bucket names from the response
b u c k e t s = [ b u c k e t [ ’ Name ’] for b u c k e t in
m y _ r e s p o n s e [ ’ Buckets ’]]
Create an S3 client
s3 = boto3 . c l i e n t ( ’ s3 ’)
f i l e n a m e = ’ file . txt ’
m y _ b u c k e t = ’my - bucket ’
Quitting a Container
Attaching to a Container
GLOSSARY
Borg: Borg is Google’s internal cluster management system which was developed as a part
of their internal requirements to manage large clusters containing Linux containers.
Cloud: It is an abstraction of network-based (or remote) service. The service can be pro-
vided at different layers such as hardware, platform, and software. Cloud provides high
availability and fault tolerance. It is based on a utility computing model in which users
are charged for the services they used. A cloud is an elastic computing model in which
resources can shrink and grow as per need.
60 Big Data Systems
Container: It is a lightweight virtual machine, which does not use any hardware virtual-
ization. It shares the host’s kernal. It includes everything (code library, dependencies,
configs), which is required to run an application.
Containerization: It is a process of virtualizing OS such that multiple containers can be
created on a single OS.
Docker: It is an open-source container platform for building, managing, shipping, securing,
and running containers. It is a powerful and easy-to-use tool for running applications
in containers.
Docker Swarm: It is a clustering and scheduling tool for Docker containers.
Database-as-a-Service (DBaaS): It is a managed database service in the cloud that provides
a powerful on-demand scalable database platform.
Edge Computing: It is closely related to fog computing. It is the concept in which computing
is moved from data centers to edge devices or close to the user or data sources.
EKS: It is called Elastic Kubernetes Service (EKS) which is the managed Kubernetes of-
fering from Amazon Web Services.
etcd: It is a key-value store which is designed to run in distributed systems.
Fog Computing: It is extension of cloud computing and services that enable storage and
processing to the edge of the network.
Full Virtualization: It is a type of virtualization, which involves binary translation of priv-
ileged instructions by the hypervisor.
Hybrid Cloud: It is a combination of both public and private clouds to offer the benefits of
multiple deployment models.
Hypervisor: It is a software which is used to provide processor virtualization. It is used to
manage hardware resources for different VMs.
Kubernetes: (k8s for short) It is an open-source, robust, and feature-rich orchestration
platform for managing containers that can operate at scale. It provides the platform to
automate the deployment, management, and scaling of containerized applications.
Lambda Function: It is an AWS implementation of Function as a Service (FaaS) model.
Mesos: It is an open-source cluster management framework for managing compute, memory,
storage, and other resources across a cluster.
Monolithic Applications: Applications packaged as a single unit is generally referred to as
monoliths. Monolithic applications are usually deployed as a single artifact. Changes to
one component of the application require a new version of the whole application to be
developed, tested, and deployed.
Microservices: Splitting the complex and static monolithic applications into a set of smaller,
decoupled, and independent components called microservices. These components can
be developed, deployed, tested, and scaled individually.
Multitenancy: It implies to the concpet that multiple users (also called tenants) can be
served. Multitenancy is supported through virtualization such that multiple virtual
machines can be hosted on a single physical machine.
Cloud Computing for Big Data 61
Multicloud: It is the strategy of using the combination of services and technologies from
different cloud service providers to satisfy or achieve a business objective for an orga-
nization.
63
CHAPTER 4
HADOOP: An Efficient
Platform for Storing and
Processing Big Data
CONTENTS
4.1 Requirements for Processing and Storing Big Data . . . . . . . . . . . . . . . . . . . . 66
4.2 Hadoop – The Big Picture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3 Hadoop Distributed File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3.1 Benefits of Using HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.2 Scalability of HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.3 Size of Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.4 Cluster Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.5 Read and Write Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.6 Checkpointing and Failure Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.7 HDFS Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4.1 MapReduce Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4.2 Input Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4.3 The Partitioner Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4.4 Sorting by Keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4.5 The Combiner Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.4.6 Counting Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4.7 Secondary Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4.8 Inverted Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.4.9 Computing Inlinks and Outlinks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.4.10 Join Operations Using MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.4.11 MapReduce for Iterative Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.5 HBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.5.1 HBase and Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.5.2 HBase Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.5.3 Installing HBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.5.4 HBase and Relational Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.7 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.8 Exercise Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
65
66 Big Data Systems
n this chapter, we will study efficient platforms for storing and processing big data. Our
I main focus will be on MapReduce/Hadoop – a widely used platform for batch processing
on big data. Finally, we will learn about HBase – a distributed, scalable, big data store built
on top of HDFS.
1. Scalability: A scalable solution is needed which can meet the growing demands of big
data systems.
2. Low Network Cost: Cost (time) to transfer data should be low. Predominantly, time
spent on computation should be more than the time spent on transferring the data.
3. Efficient Computation: Owing to a large amount of data, computation should be done
in parallel in order to reduce time of computation.
4. Fast and Rapid Retrieval: As big data systems are based on the principle of ‘write
once read many’, retrieval of data should be fast.
5. Fault Tolerance: Network and hardware failures are eminent. A big data storage and
computation system should be fault tolerant.
The above mentioned requirements are specific to big data systems. In addition to these
requirements, a big data system should provide a flexible and adaptable platform for pro-
gramming and development.
Over the past two decades, big data systems have evolved on these principles. With
time, their capacity and performance have also increased.
We will now discuss Hadoop, one of the most widely used big data platform, which can
cater these challenges.
Component Purpose
HDFS (Hadoop Distributed file system that can support high-throughput for access-
Distributed File ing data.
System)
MapReduce Programming platform which utilizes data stored in HDFS.
Hive A infrastructure for querying and data warehousing solutions for
Hadoop.
HBase Is a distributed, scalable, NoSQL column family storage system built
on top of HDFS.
Pig Scripting language for accessing data stored on HDFS.
Mahout Machine learning library that is built on top of MapReduce.
Oozie Scheduler and workflow engine for creating MapReduce jobs.
Zookeeper Tool to manage and synchronize configuration.
Ambari Tool for monitoring the Hadoop cluster.
Impala Massively parallel distributed database engine which can utilize
Hadoop worker nodes for processing of queries.
NoSQL column-oriented database, which is built on top of HDFS, whereas Oozie is a work-
flow Engine for MapReduce jobs. The purpose of Zookeeper is to manage and synchronize
configuration.
In chapter 5, we will study these components in detail.
single namenode, performance of the HDFS cluster depends greatly upon the capacity and
capabilities of the namenode.
The figure shows four blocks which are distributed across the cluster. By default, in
HDFS, each block is replicated three times.
Example 4.1 (Effect of HDFS Block Size). If the block size is reduced, there will be
large number of blocks, which will increase metadata. This will eventually increase
load on the namenode.
Suppose that in a cluster, 1280 MB of data is needed to be stored. For a 128 MB
block size, this will lead to 10 blocks. If we reduce the size of the block to 64 MB, the
number of blocks required to hold 1280 MB will be increased to 20. This will increase
load on the namenode because eventually, there will be more queries about metadata.
If the size of the block is increased, this will reduce the load on the name node.
However, it will decrease the amount of parallelism as there will be more data for
each data node to handle. As Hadoop is built on the concept of data localization,
increasing parallelism is also significant.
Both the size of the block and the replication factor can be customized by an adminis-
trator.
HDFS namenode is considered a single point of failure. Further, it has a high startup
time. The bootstrapping mechanism for namenode consists of two steps [103]:
1. Reading information of the filesystem from the file fsimage. Once the information is
read, transaction logs are applied and the updated information is stored as a new file
system back to disk.
2. Block reports are read from data nodes to obtain information about all the known
block locations.
a) Create a file
nano t e s t f i l e . txt
Write some text in the file and save it .
$ H A D O O P _ H O M E / bin / h a d o o p fs - ls / user / i n p u t h d f s
In the next section, we describe MapReduce, a popular programming model for Hadoop.
4.4 MAPREDUCE
MapReduce [154] is a programming framework which is based on partition aggregation
model. That is, a large problem with computational requirements on large data, is par-
titioned into smaller chunks. Each chunk is independently executed in parallel. The final
result is then aggregated by combining intermediate results from each chunk.
The MapReduce framework computes the problem into two phases. During the first
phase, which is also called as the Map phase, the data is divided into chunks. Each chunk
is computed independently and concurrently on a worker node called mapper. There could
be hundreds to thousands of mappers in the map phase.
The second phase is called the Reduce phase. In this phase, output of mappers is sent
to a single node for aggregation. This process is also called as Reduction and the node
performing the reduction is called as the reducer. There could be one to a few reducer
nodes. MapReduce is based on the model of functional programming. That is, tasks for
both mapper and reducer are specified using separate functions.
HADOOP: An Efficient Platform for Storing and Processing Big Data 73
A distinctive feature of the MapReduce framework is that the input to the Reduce phase
is sorted by keys. The framework, while copying data from the mappers to the reducer, sorts
it by keys. This operation is called shuffle and merge phase.
Understanding MapReduce
Example 4.4 and figure 4.6 explain the execution of wordcount problem on MapReduce.
A corpus of data is given in which our task is to count the occurrence of each unique word.
The corpus consists of three input documents. These are processed by three mappers by
incorporating the data localization concept we studied earlier. Each mapper processes a
document and extracts words. These words and their interim count (which is ‘1’ in this
HADOOP: An Efficient Platform for Storing and Processing Big Data 75
case) is emitted by the mappers. The data is emitted in the form of <key,value> pairs. A
developer can select <key,value> pairs according to her choice. For this example, words are
selected as keys and their quantities are selected as values.
These <key,value> pairs are sent to the reducer, which performs the reduce operation.
For this example, the reduce operation involves aggregating the output from all the mappers.
Figure 4.10 illustrates the example of MapReduce operation for the wordcount problem.
We will now learn pseudocode of the wordcount problem. Example 4.5 shows a sample
pseudocode. The code has been derived from reference [245]. The word ‘Emit’ is a general
term which refers to the operation of emitting out the <key,value> pairs.
Class Reducer:
m e t h o d R e d u c e ( word w , c o u n t s [ c1 , c2 , . . . ] )
sum = 0
for all count c b e l o n g s to c o u n t s [ c1 , c2 , . . . ] do
sum = sum + c
Emit ( word w , co unt sum )
of copying is finished, the data is merged at the reducer side. This merging allows a reducer
to have an aggregated and sorted input from all the related mappers. The process of merging
at the reducer is called sorting [155].
Many organizations such as Facebook [353], Yahoo, and Adobe have implemented
Hadoop-based solutions. Google has implemented a proprietary version of MapReduce,
where it is used for the generation of data for production web search service, sorting, ma-
chine learning, data mining, and many other systems [154].
We will now study a few examples which will explain the use of MapReduce.
Example 4.7 (Counting Items from a list). We have seen an example of the Word-
Count problem, where MapReduce can be used to count occurrence of words in a
corpus. Similar approach can be applied to count items in a list.
For instance, for a grocery store it is important to determine the number of items
sold in a day. This information can be used to forecast sales and manage inventory.
MapReduce can be used for this purpose.
Figure 4.10 illustrates the usability of MapReduce in determining number of vegeta-
bles sold in a day.
not sort the data by values. MapReduce can be used to incorporate sorting by values, by
introducing composite key. This will include integrating keys and values together as a key.
class Reducer
p r o c e d u r e R e d u c e ( term t , p o s t i n g s [( n1 , f1 ) , ( n2 ,
f2 ) , ( n3 , f3 ) ])
P = new List
for all p o s t i n g (a , f ) b e l o n g s to p o s t i n g s [( n1 ,
f1 ) , ( n2 , f2 ) , ( n3 , f3 ) ..] do
A p p e n d (P , (a , f ) )
Sort ( P )
Emit ( term t , p o s t i n g s P )
HADOOP: An Efficient Platform for Storing and Processing Big Data 83
Inverted Indexing
Example 4.9 (Inverted Indexing). Figure 4.11 shows inverted indexing over a small
corpus. Mappers emit keywords and a composite value consisting of document id and
no. of occurrences as <key,value> pairs. These are combined to produce aggregated
response at the reducer. In the figure, keys are represented as bold text.
84 Big Data Systems
Example 4.10 (Computing Inlinks and Outlinks). For Search engines, rank of a
web page depends upon a number of factors. One of the most prominent factors in
computing page rank is the number of incoming links to a web page. This is also
called as inlinks. MapReduce can also be used to compute inlinks.
Figure 4.12 shows a sample input of four webpages A,B,C, and D. The figure also
shows incoming links (inlinks) and outgoing links (outlinks) for each webpage.
In figure 4.13, two mappers are created. Each mapper computes outlinks for two web
pages. The mappers emit outlinks as <key,value> pairs.
Figure 4.14 shows execution of the reducer. The <key,value> pairs are sent to the
reducer. The output of the reducer is the aggregated value against each URL. These
values represent the inlink for the URL.
HADOOP: An Efficient Platform for Storing and Processing Big Data 85
1. Reducer Side Join: We should recall from our previous study that in Hadoop
<key,value> pairs from different map phases are merged and sorted during the shuffle
and merge phase. In this manner, values from identical keys are accumulated together.
Reducer side join operations are inspired by these characteristics.
86 Big Data Systems
Example 4.11 (Reducer Side Join). Figure 4.15 illustrates the reducer side
join operation on a students’ dataset. The query is to Select all students who
have taken the course of OS and have CGPA <= 3.0. There are two mappers.
The first mapper selects the students whose CGPA <= 3.0. It emits student
ID as key and their CGPA as value. The second mapper selects students and
their courses. It selects student ID as the key and their courses as the value.
At the reducer, students can be easily determined who have fulfilled both the
criteria (i.e., CGPA > 3.00 and have taken the OS course).
HADOOP: An Efficient Platform for Storing and Processing Big Data 87
Reducer side join operations are effective in joining two related datasets. However,
they may suffer from a high number of <key,value> pairs being generated. In this
case, the network may become the bottleneck. Example 4.11 explains reducer side
join operation.
2. Mapper Side Join: Mapper-side join operation can be implemented by merging the
two relations (or datasets) in the map phase. The advantage of map-side join is that
the overhead of a large number of keys being emitted during the shuffle phase can be
avoided.
To understand the example of Map Side join, consider that two datasets are initially
sorted over the join Key. During the map phase, the two datasets can be merged and
the output can be emitted.
3. In Memory Join: If the dataset is not sorted as per the join key and if the size of the
data is small, then the process of joining can specifically be handled in the memory.
4.5 HBASE
HBase [176] is a NoSQL storage system built on top of HDFS which is designed for storing
and retrieving data for fast and random access supporting large volumes of data. It is a non-
relational, open-source and distributed column-oriented data store, which is modeled after
Google’s Bigtable [128]. It has fault tolerance capability and it provides horizontal scalabil-
ity. It is written in Java and is maintained by many engineers from diverse organizations
under the framework of Apache Software Foundation.
Hadoop/HDFS HBase
Provides file system for distributed stor- Provides column-oriented data store capa-
age bility
Data model is not flexible Provides a flexible schema model
Optimized for workloads with sequential Optimized for tabular data workloads
reads and writes, and not for random ac- with random reads and writes
cess of files
Offline system optimized for throughput Online system driven by low-latency
Uses a write-once and read-many-times Optimized for both read/write many, sup-
(WORM) access model ports real time read/write random access
Mainly used for Batch processing use- Used for use-cases where system needs
cases in which data will be accessed only quick random access of huge data
in a sequential manner
Data is primarily accessed through HBase provides a command-line tool for
MapReduce jobs shell support, and supports Java API,
Thrift, and RESTful web service for pro-
grammatic access
To overcome these challenges, we can layer HBase over HDFS to use for applications
where we require real-time read/write random access to very large datasets.
HBase is a database built on top of Hadoop and was developed to support storage of
structural data, which can take advantage of HDFS and other distributed files systems.
In the Hadoop ecosystem, HBase provides a persistent, structured, and schema-based data
store which has the ability to store and retrieve both structured and semi-structured data
for use cases where response time is critical. It is also suitable for random reads and writes
and is known for high write throughput.
The HDFS gives HBase a storage layer providing availability and reliability. Although
HBase primarily works with HDFS to store data but it can also work on other filesystems.
Table 4.2 provides a comparison between HDFS and HBase:
1. HMaster: HMaster typically runs on the Hadoop NameNode, is responsible mainly for
managing the cluster state and performs administrative operations such as failover,
fallback and control load balancing. HMaster is also responsible for managing Region-
Servers, such as assigning regions to different region servers.
2. RegionServers: HBase tables are divided into small chunks, called Regions, which are
distributed across multiple servers. The servers that host these regions are known as
RegionServers. A table can have many regions across RegionServers and each Region-
Server hosts several regions. The RegionServers run on the machines where DataNodes
run.
HADOOP: An Efficient Platform for Storing and Processing Big Data 89
1. Standalone: In the standalone mode, all HBase services run on a single java process
and use a local disk to store tables and data. This mode doesn’t require running any
other external service (HDFS, Zookeeper etc.) and is suitable for exploration, testing,
and local development.
2. Pseudo-distributed: In the Pseudo-distributed mode, all HBase services (HMaster,
HRegionServer, and Zookeeper) run on a single machine but as separate Java pro-
cesses. This mode uses HDFS file systems and is also suitable for testing and local
development.
3. Full-distributed: In the full-distributed mode, an HBase cluster is set up on multiple
nodes and all HBase services run under different JVMs on these cluster nodes. This
mode uses HDFS as an underlying file system and is designed for use in the production
environment.
together into sets called column families, which are stored together on HDFS. Unlike a
relational database, HBase is schema less and only require column families to be defined
up front. Although the column families are fixed, new columns can be added to a family at
any time. This makes the schema flexible and allows HBase to rapidly adapt to changing
requirements.
We have compared the ACID properties model of RDMS in section 1.3.2. HBase provides
row-level atomicity but doesn’t provide transactions or ACID compliance over multiple rows,
and therefore is also not an ACID-compliant database. It does provide some guarantees and
these ACID semantics are described in reference [10].
HBase became a top-level Apache project in 2010 and it is still continuing to flourish
and grow which is widely used by many big organizations such as Google, Facebook, and
Twitter. Facebook used it for messaging platform, Twitter used it for people searches and
Apache used it for maintaining wiki. Reference [11] lists the number of organizations with
their use cases.
Similar to HBase, Accumulo [1] is another high-performance open-source distributed
key-value data store based on Google’s Bigtable design [128]. It was developed by National
Security Agency (NSA) in 2008, and became an open-source project in 2011. It uses HDFS
as a storage layer, just like HBase, and is designed to store up to trillions of data.
Both Hbase and Accumulo are a column-based NoSQL system. In chapter 7, we will
study further details about NoSQL systems.
7. What are the advantages of having multiple reducers? How multiple reducers can be
implemented using MapReduce?
8. What are the limitations of MapReduce in processing iterative operations?
9. Explain the pros and cons of having large block size in HDFS.
10. Explain the purpose of combiner function in MapReduce. What are the advantages of
combiner function? Will combiner function be an exact replica of the reducer function?
11. Explain data localization. How does Hadoop ensure data localization?
GLOSSARY
Accumulo: It is a high-performance open-source distributed <key,value> data store based
on Google’s Bigtable design [128]. It uses HDFS as a storage layer and is designed to
store up to trillions of data.
Ambari: It is an open-source widely used graphical management tool, which has been de-
signed to manage and monitor Hadoop clusters.
Blocks: It is the minimum amount of data that HDFS can read or write at a time.
Chunk: It is a portion of data, which is allocated to a mapper. Chunks are computed in
parallel during the Map Phase.
Cluster: It is a collection of systems that allow them to work together to either solve
common computing problems or provide large storage.
Combiner: It is a process in the MapReduce model, which is used to perform local aggre-
gation at the mapper node.
Data Locality: It is a concept employed by the Hadoop system in which compute nodes are
selected based on the locality of data. This reduces time to transfer large amount of
data.
DataNode: It is the worker node controlled by NameNode. It is responsible for storing
application data.
HBase: It is a distributed, scalable, NoSQL column family storage system, which is built
on top of HDFS and provides schema flexibility.
HDFS: It stands for Hadoop Distributed File System. It is a distributed file system designed
for storing large datasets that run on commodity hardware.
Heartbeat: It is a special message in which each task tracker node periodically sends its
health report to the job tracker node.
Hive: It is an SQL-like language that presents data in tabular form for Hadoop. It en-
ables Hadoop to be used as a data warehouse by providing capability for querying and
managing large datasets.
Impala: It is an open-source massively parallel processing SQL query engine designed to
run on Hadoop platforms.
Job Tracker: It is a monitoring process which runs on the master and monitors the Task-
Tracker applications which are running tasks at each worker node.
92 Big Data Systems
Mahout: It is a scalable machine learning and data mining library that is built on top of
MapReduce.
Map: It is a phase in the MapReduce programming model in which each node computes
independently and in parallel to other mapper nodes on a portion of data.
MapReduce: It is a widely used general-purpose programming framework for processing
huge quantities of data in parallel on multiple nodes in a computer cluster.
NameNode: It is the master node which manages HDFS file namespaces and serves as a
single point for all types of coordination on HDFS data. It maintains the health of each
DataNode through the heartbeat mechanism.
Node: It refers to an individual system or a host within a cluster.
Oozie: It is a scheduler and workflow engine for creating MapReduce jobs in Hadoop.
Partitioner: It is a process in the MapReduce model, which ensures usage of multiple re-
ducers. It distributes <key,value> pairs according to the specific reducer.
Phoenix: It provides a SQL layer over HBase for low-latency applications.
Pig: It is a high-level language that allows users to express their data processing and analysis
operations on top of the MapReduce framework using a simple scripting language.
Reduce: It is a phase in the MapReduce programming model in which reduce operation is
applied on one or a few nodes.
Secondary NameNode: It is a secondary node, which keeps a backup of active NameNode
data periodically.
Straggler Detection: It is a process in MapReduce programming model in which slow nodes
are detected through heartbeat messages.
Task Tracker: It is a process which runs on data nodes and is responsible for reporting the
progress of tasks assigned by the Job Tracker.
Zookeeper: It is a tool to manage and synchronize configuration and is a highly available
and reliable distributed configuration service.
CHAPTER 5
Enhancements in Hadoop
CONTENTS
5.1 Issues with Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2 YARN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.2.1 YARN Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.3 Pig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.3.1 Execution Scenarios and Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.3.2 Architecture of Pig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.3.3 Optimizations in Pig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.3.4 Running our First Pig Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.4 Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.5 Dremel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.6 Impala . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.7 Drill . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.8 Data Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.8.1 Sqoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.8.2 Kafka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.9 Ambari . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.10 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.11 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.12 Exercise Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
93
94 Big Data Systems
jobs. The lack of a dedicated resource scheduler turned out to be a major limitation of
Hadoop v1. Following are the major limitations of Hadoop 1.x:
The above mentioned limitations were the motivating factors behind Hadoop v2. Fig-
ure 5.1 illustrates a layered model of the Hadoop v2 echo system. Two distinctive features
in Hadoop v2 includes the introduction of YARN in order to decouple resource scheduling
from MapReduce and introduction of Spark. Spark and its components are explained in
chapter 6. We will now look into YARN components and its architecture.
5.2 YARN
To address the limitations of Hadoop 1.x, YARN (Yet Another Resource Negotiator) has
been introduced. Its main purpose is to decouple the process of resource monitoring and
resource scheduling from the MapReduce framework. It splits the resource management
and job scheduling/monitoring responsibilities of JobTracker into separate daemons. Un-
like JobTracker, it doesn’t care about the data flow of the application and only manages
resources.
With this new architecture, YARN provides a framework for managing both MapReduce
and non-MapReduce tasks. It can handle tasks of greater size and complexity. It can also
support simultaneous execution of different applications in a cluster (such as MapReduce
and streaming applications). YARN also improves efficiency of the Hadoop framework and
enhances its resource-sharing capabilities. Different distributed processing frameworks such
as MapReduce, Impala, Spark, and Tez can utilize YARN.
We will now study different components of YARN:
architecture includes a Resource Manager, a Node Manager, and Application Masters. Ex-
ample 5.1 explains the functionality of YARN and its components.
96 Big Data Systems
2. Application Master:
The Application Master is responsible for the life cycle management of every applica-
tion. It is the framework-specific process launched by the Resource Manager for each
application type submitted by client that coordinates an application's execution to the
cluster. Each application has its own unique Application Master which has the respon-
sibility of negotiating for appropriate resource containers from the Resource Manager
on behalf of the application itself and works with Node Manager(s) to execute and
monitor tasks.
3. Node Manager:
The per-node Node Manager is a worker daemon that runs on each worker node in
the cluster. It is responsible for managing the life cycle of the container which includes
launches, monitors, and control application containers assigned to it by the Resource
Manager and for monitoring their resource usage. The Node Manager registers with
the Resource Manager on start-up and sends information about the resources avail-
able on the nodes. Each Node Manager periodically sends its liveliness information to
the Resource Manager.
4. Container:
A container represents a unit of allocation resource on a single node where Applica-
tion Master and application specific tasks run. There can be multiple containers on a
98 Big Data Systems
single node in a given cluster. It is scheduled by the resource manager and supervised
by the resource manager.
YARN is integrated with Hadoop v2 and is used to schedule jobs for various clients. We
can find the details of the number of projects that are powered by YARN in reference [174].
5.3 PIG
Apache Pig is an important component of Hadoop. It is used for analyzing big data using
scripting and easy to use language. An important characteristic of Pig is that it can incor-
porate parallelization and can support a very large dataset. A Pig program is written in a
scripting language called Pig Latin.
1. No. of MapReduce Jobs: A Pig script may generate a number of MapReduce jobs.
If a Hadoop cluster is available, then each job may take 5 to 15 seconds to start a
job. However, in case the cluster is not available then the job start time may increase.
Further, MapReduce jobs may be dependent. In case of dependency, a MapReduce
job may not start unless and until all the previous jobs have finished [174]. Therefore,
Pig strives to reduce the number of MapReduce jobs.
2. Key Skew: MapReduce can speedup a pig program by running different tasks in
parallel. However, during the reduce phase, it is possible that the number of keys are
not well distributed. If keys are skewed then they may not be distributed to different
reducers using partition function. Pig solves this problem of key skew by using skew
join. Using this join operation, a key with the highest number of records is split and
distributed to more than one reducer.
3. Data Shuffling: The amount of data transfer from mapper to reducer could consume
a lot of time. The size of data can be reduced by compression or by reducing the
number of <key,value> pairs emitted.
100 Big Data Systems
Component Purpose
AVG Computes Average of numbers.
CONCAT Concatenates two arrays. Either character or byte arrays
COUNT Computes count of numbers.
COUNT ˙STAR Computes Count of numbers.
DIFF A programming platform which used HDFS for accessing data.
IsEmpty Checks if a bag is empty
MAX Computes maximum.
MIN Computes Minimum
SIZE Returns number of elements
SUM Computes Sum
TOKENIZE Tokenize the data.
The main benefit of Pig is that it provides flexible and distributed data processing using
easy to use programming environment. We will now study a sample code of Pig.
Pig has a strong library to provide builtin functions for many operations. Tables 5.1, 5.2,
and 5.3 lists built in functions for evaluation, mathematics, and string computation respec-
tively [49].
5.4 HIVE
Hive is a data warehousing system, which has been developed over Hadoop. Similar to Pig,
the original rationale behind Hive is to develop a transition layer on MapReduce. This
provides a developer to write a script for data analysis tasks. The script is translated to
MapReduce tasks, and therefore exploits all the benefits of Hadoop including scalability.
Enhancements in Hadoop 101
Function Purpose
ABS Computes absolute value.
ACOS Computes inverse cosine.
ASIN Computes inverse sine.
ATAN Computes inverse tangent.
CBRT Computes cube root.
CEIL Computes ceiling value.
COS Computes cosine value.
COSH Computes hyperbolic cosine.
EXP Computes Euler’s number e raise to the power x.
FLOOR Computes floor value.
LOG Computes natural log.
LOG10 Computes Log base 10 value.
RANDOM Generates a pseudo random number.
ROUND Computes rounded value.
SIN Computes the sine value.
SINH Computes hyperbolic sine.
SQRT Computes square root.
TAN Computes tangent.
TANH Computes hyperbolic tangent.
Function Purpose
INDEXOF Returns the index of the first occurrence of a character in
a string
LASTINDEXOF Returns the index of the last occurrence of a character in a
string.
LCFIRST Converts the first character in a string to lower case.
LOWER Converts all characters in a string to lower case.
REGEX ˙EXTRACT Matches a regular expression and returns index parameter
of the matched group.
REGEX ˙EXTRACT ˙ALL Matches a regular expression and returns all matched
groups.
REPLACE Checks if a bag is empty
STRSPLIT Splits a string according to the given regular expression.
SUBSTRING Returns a substring from a given string.
TRIM Trims a string to remove trailing and leading white spaces.
UCFIRST Returns a string with the first character converted to upper
case.
UPPER Converts a string to upper case.
102 Big Data Systems
Both Hive and Pig provide alternative interfaces to MapReduce. Hive was developed
at Facebook about the same time Yahoo was developing Pig and was the first to offer a
SQL-like query language, called Hive Query Language (HiveQL) for querying structured
data stored in a Hadoop cluster.
There are two main goals for Hive [200]:
Figure 5.4 shows the architecture of Hive. We now assess the major components of the
architecture:
1. CLI and HiveServer2: CLI and HiveServer2 (HS2) provide the two interfaces through
which a user can submit her Hive statement. CLI is the command line interface which
Enhancements in Hadoop 103
enables a user to write the statement through command line, whereas HS2 is a process
that enables multiple clients to submit their jobs (statements) to the Hive system. The
HS2 server implements Kerberos [283] for user authentication.
2. Driver: The Driver program is responsible for parsing the statement and generating
an Abstract Syntax Tree (AST). This tree contains a series of steps which are needed
to be executed in the statement.
3. Planner: Query planner analyzes queries which are used for query processing and data
retrieval. The planner assembles the operator tree to represent data operations of this
query. The planner also optimizes the query for faster execution.
4. Executor: The executor breaks the operator tree to multiple stages represented by
executable tasks.
5. MetaStore: Metastore stores system catalog and metadata about tables, columns, and
partitions. This information is stored in an RDBMS.
6. Data Processing Engine: The data processing engine is responsible for executing the
query. For instance, in case of MapReduce the generated task is executed as MapRe-
duce tasks.
7. SerDe: It is a serialization and deserialization library which is used to serialize and
deserialize data.
8. ORC File: Optimized Record Columnar File. It is a file format, which provides higher
efficiency in reading, writing, and processing data. For instance, light weight indexes
are stored within a file. Further, a single file is generated as output of each task. This
reduces load on name node.
9. Storage Handler: A storage handler is used to process data stored in other systems.
Hive implements a separate storage handler for each system. For instance, HBase
storage handler is used when data is read from or written to HBase.
HDFS and Hive are both not designed for OLTP workloads and therefore, doesn’t pro-
vide critical features required for OLTP. If we need OLTP features for large-scale data,
we should consider a NoSQL database like HBase, which we have discussed in section 4.5.
On the other hand, HDFS and Hive are best suited for batch jobs over large datasets that
need to scan enormous amounts of data. Hive is also most suitable for data warehouse
applications where data is relatively fixed and fast response times are not required for
analyzing.
5.5 DREMEL
Dremel [264] is a large-scale platform for big data analytics, which supports interactive
querying. The system is scalable to thousands of commodity machines and can perform
interactive analysis in a short time.
Dremel is a columnar database. That is it stores the data in columns as compared to
rows. In section 7.3.1, we discussed the difference between column-oriented and row-oriented
databases.
After Google Dremel paper [264] published in 2010, it inspired a number of tools for
supporting the interactive query which includes:
104 Big Data Systems
1. BigQuery: It is a product from Dremel, which is a big data analysis tool [19]. It is a
petabyte scale, serverless, interactive, and ad hoc query engine that allows users to
conduct analysis of large datasets.
2. Impala: It is an open-source, Massively Parallel Processing (MPP) query engine, which
aims to provide fast, interactive queries over data stored in HDFS or Hive [28].
3. Drill: It is an open-source, low-latency distributed SQL query engine for interactive
analysis of large-scale datasets stored in HDFS, NoSQL databases, or cloud storage [5].
4. Presto: It was developed by Facebook, is an open-source, distributed SQL query engine
for analyzing large datasets [52]. It enables SQL access to any data source including
relational or NoSQL databases, distributed object storage (Amazon S3, Google Cloud
Storage), Hive, Kafka, and so on.
We will now discuss Impala and Drill in some more details.
5.6 IMPALA
Impala is a modern, open-source massively parallel processing (MPP) query execution en-
gine [100]. It is written in C++ and is inspired by Google’s Dremel [264], designed for
executing low-latency queries to data stored in Hadoop.
It is built on top of HDFS and inherits the attributes of fault tolerance, parallelism,
and data locality. It can be executed on many machines on an existing Hadoop cluster and
provides the ability to execute queries with low-latency. It does not use MapReduce at all,
instead uses an in-memory processing engine and also utilizes the data locality feature of
Hadoop processing. It uses a columnar data format, similar to Dremel, for structuring data
at rest.
The Impala architecture consists of following services:
1. Impala daemon (Impalad): A daemon which is executed on all the worker nodes. It
accepts queries from client process and orchestrates them on the cluster. A daemon
can become a query coordinator if it initiates the query. Consequently, a daemon can
also execute queries in fragments on behalf of other nodes. A daemon can also exploit
data locality as a separate instance of daemon is deployed on each node HDFS data
node.
2. StateStore daemon (statestored): The Statestore daemon is responsible for dissemi-
nating cluster wide meta data to all the processes. This data is useful to implement
MPP architecture. The Statestore daemon manages topic-wise membership of nodes.
It sends two types of messages to nodes, i.e., keepalive messages to check the health of
nodes and topic updates topic updates (including new entries, modified entries, and
deleted entries) to all the subscribed nodes.
3. Catalog daemon (Catalogd): The Catalog daemon is responsible for relay updates.
Figure 5.5 illustrates the functionality of Impala. In that, a query is sent to Impala
daemon. Query planner receives the query and sends it to the query coordinator, which
orchestrates it to other nodes. Query executor executes the query and fetches the results
from other nodes in the calculator. It then aggregates the results and sends them back to
the client. Impala can store data on HDFS data nodes.
Impala supports a subset of the HiveQL language specification, therefore most Hive
queries can be executed using Impala.
Impala can utilize different components of Hadoop. For instance, similar to Hive, Impala
keeps its table definitions in metastore (traditional MySQL or PostgreSQL database). This
Enhancements in Hadoop 105
permits Impala to access tables loaded or defined by Hive, provided that these have been
implemented by Impala-supported tables.
Impala is well suited for use cases when we need a real-time interactive querying capa-
bility. It is also most suitable for ad-hoc query access to data in Hadoop.
5.7 DRILL
Apache Drill [5] is an open-source, low-latency query engine for Hadoop. It is a scalable
and distributed system to provide interactive query services using hundreds to thousands
of servers [193]. Apache Drill is inspired to respond to ad-hoc queries with low-latency and
is the direct open-source implementation of Dremel [264].
The high level architecture of Drill consists of following layers:
1. User Interface: This layer is responsible for connecting a user with the core process-
ing engine of Drill. The interface could be provided through different means such as
command line interface, JDBC connectivity, or REST API.
2. Processing: This layer is responsible for processing the query distributed over multiple
nodes.
3. Data Sources: This is the lowest level of the system. It consists of the different sources
of data such as HDFS, MongoDB, Cassandra, Hbase etc.
Figure 5.6 shows the flow of query execution in Drill. It supports multiple data sources,
including HDFS, HBase, and MongoDB, and various file formats such as CSV, JSON, and
106 Big Data Systems
FIGURE 5.7 SQL support for various data sources with drill
Parquet as illustrated in figure 5.7. Moreover, a single query can access and join data from
multiple sources. Unlike relational databases, which require users to define a schema before
adding data, Drill does not require to define any schema before query execution.
Drill is easy to use and supports standard ANSI SQL syntax that allow users to query
the data using standard SQL and BI tools such as Tableau or MicroStrategy. It is best
suited for use cases when querying data stored in Parquet format.
Tool Description
Flume [7] It is an open-source, distributed, reliable, and highly available service
for efficiently ingesting high-volume streaming data, particularly log
data, from multiple sources into data storage solutions like HDFS.
Sqoop [61] It is a popular tool for transferring data between structured data
stores such as relational databases and Hadoop. It is designed for an
efficient transfer of bulk data between Hadoop data stores like HDFS
or HBase, and relational databases.
Storm [16] It is an open-source, distributed, and real-time computational frame-
work for processing of unbounded streams of data.
Kafka [32] It is an open-source, distributed streaming platform, widely used
as publish/subscribe messaging system. It can be used to build a
high-performance, real-time robust data streaming pipeline between
applications.
NiFi [13] It is an open-source real-time data ingestion tool, which enables the
automation of data flow between systems in an efficient and reliable
way. It is a data flow management system based on flow-based pro-
gramming language that comes with an easy to use web user interface
to manage data flows.
Flink [6] It is an open-source distributed processor for stream and batch data
processing. The core of Apache Flink is a distributed streaming
dataflow engine which supports streaming one-at-a-time and pro-
vides fault tolerance capabilities. It has intuitive and expressive APIs
available in Java and Scala.
Samza [14] It is an open-source distributed stream processing framework that
provides scalable and durable stateful stream processing. It has a
simple API. It utilizes Kafka for messaging and YARN to provide
fault tolerance and task scheduling.
DistCp [8] Hadoop comes with a useful tool called DistCp, which provides an
easy and efficient way to copy large amounts of data to and from
Hadoop filesystems in parallel. It uses MapReduce underneath to
copy in a distributed manner. We can use DistCp for use cases—for
example, for disaster recovery, backups, or data migration purposes.
5.8.1 Sqoop
Originally developed by Cloudera, Sqoop [61] is an Apache Hadoop top-level project, which
provides a single command line interface application. It optimizes data transfers between
Hadoop and any other data stores, particularly relational databases.
Transferring data to and from any data store is a challenging and difficult job, and
requires careful testing and error handling. Sqoop, however, has become a useful tool in the
industry for this purpose. It is designed for efficient bulk transfer of data between Hadoop
data stores like HDFS or HBase, and relational databases. With Sqoop, we can extract data
from relational databases or data warehouses, process it using MapReduce, or higher-level
tools like Hive and Pig, and then export the data back to an RDBMS or a data warehouse
for further consumption by other clients. It supports all the leading relational database
108 Big Data Systems
systems, such as PostgreSQL, MySQL, Microsoft SQL Server, and Oracle, and can also
transfer data from a variety of enterprise data warehouses, such as Teradata and Netezza.
Since Sqoop uses MapReduce to import and perform the data processing into the Hadoop
cluster, it therefore exploits the full advantage of the parallel processing capabilities of
Hadoop. As a result, Sqoop provides fast data transfer performance by performing data
loading tasks in a parallel manner.
Sqoop is easy and relatively quick to configure and use. It can be downloaded from the
Apache site [62]. The documentation contains the detailed user installation guide as well
as code and documentation. It can be installed on a user computer or on any node in our
Hadoop cluster.
Having studied the basic functionality of Sqoop, we will now learn from example 5.3
some useful commands in order to understand its operations.
Enhancements in Hadoop 109
structured data source on an ad hoc or scheduled job basis into a Hadoop data store like
HDFS, Hive, or HBase data warehouse for further processing and analysis. More information
and extensive documentation on Sqoop is available at reference [61].
5.8.2 Kafka
Apache Kafka [12, 32, 229] is a distributed streaming platform. It is used to build real-
time data streaming pipeline between applications. Through this pipeline, applications can
produce and consume streams of data. In addition, it can also be used to develop real-time
streaming applications.
In this section, we will study a few important concepts, which will help us to understand
the functionality of Kafka and its usage.
1. Messages: A message (or record) represents a data item in Kafka. It consists of a key,
a value, and a timestamp.
2. Producer: A producer is an entity which creates messages to the topics.
3. Consumer: A consumer is an entity which consumes or process messages. A consumer
can subscribe to multiple topics. Consumers can poll brokers to pull messages related
to topics.
4. Broker: In Kafka, records are stored in servers called brokers. Messages are replicated
on brokers.
5. Topic: A topic is a collection of messages such that a group of similar messages is
called a topic. A topic can have any number of consumers such as zero, one, or many.
a Kafka cluster. A topic may also be partitioned such that messages of a particular topic
could be stored on multiple brokers. Partitioning also promotes load balancing – a key factor
for performance.
Kafka supports publishing and retrieving messages simultaneously through multiple pro-
ducers and consumers. Kafka also maintains consumer groups. Each group includes a set
of consumers, which are working on related topics. Each consumer reads messages from a
unique set of partitions.
Kafka is based on a publish-subscribe model. That is, messages are broadcast to multiple
consumer groups. However, within the same group, a message is delivered and processed by
a single consumer only. This is implemented such that each consumer reads from a unique
set of partitions. Note that this feature is in contrast to other queuing-based systems, where
a message is only delivered to the consumer which reads and pops the message from the
queue.
By assigning partitions to consumers, Kafka allows messages to be processed or con-
sumed in order. Further, it also balances load among consumers. However, the number of
consumers cannot exceed the number of partitions.
Producers write data to topics which are read by consumers. Producers operate in a
push mode, whereas consumers operate in a pull mode. Unless a message is pulled by
a consumer, a message is stored at a broker server. Each message may have some time
duration associated with it. If a message is not pulled by a consumer during that time, then
it is discarded after the expiration of its time duration.
In Kafka, each partition has a set of records. Each record is uniquely identified by an
offset. This identifier is used to order messages within a partition. However, this ordering is
only applicable within the same partition.
Read and writes in Kafka are operated independently. A joint buffer is maintained at
each broker, where producers write data at the end of the buffer. In the buffer, each consumer
maintain its own offset for reading. The offset can be arbitrarily changed by the consumer
as well. That is, a consumer may read messages in any sequence. Figures 5.9 and 5.10
illustrates the concept of read and write in Kafka.
In Kafka, keys can be of great significance. Messages with similar keys are guaranteed to
be delivered to the same partition. Similarly, messages with similar key arrive at the same
consumer.
Since Kafka provides replication and load balancing, it can be considered as a distributed
file system. Each write in Kafka is acknowledged – ensuring high reliability.
Kafka is originally developed at LinkedIn and became a top-level Apache project in 2011.
It is widely used as a messaging system by many organizations, such as LinkedIn [33, 227],
Twitter [26], Netflix [34], and Yahoo [30]. Use cases for Kafka include log aggregation,
website activity monitoring, and stream processing. Reference [31] list the number of orga-
nizations with their use cases.
5.9 AMBARI
As the size of a Hadoop cluster grew, needs for ecient monitoring arise. Apache Am-
bari [3, 363] is an open-source tool for monitoring and management of a cluster. It is the
most popular open-source tool for Hadoop monitoring and supports the activities such as
provisioning, monitoring, and managing of the Hadoop clusters. Ambari is an open frame-
work, which can be integrated with different types of clusters including Hadoop. It uses
Ganglia [261] – a distributed monitoring tool for clusters of machines.
Ambari supports the management of the Hadoop components such as HDFS, MapRe-
duce, Pig, HBase, Zookeeper, and Hive. Figure 5.11 illustrates the fundamental architecture
112 Big Data Systems
of Ambari. It consists of an ambari server, ambari web, and a few ambari monitoring agents.
The main components are explained below:
1. Ambari Agents: The agents are deployed separately on each node of the cluster. They
are responsible for monitoring different attributes such as CPU usage, RAM, and
network characteristics and sending the measured attributes to the ambari server.
2. Ambari Server: The ambari server consists of the following components:
(a) An RDBMS to store the cluster configurations. Ambari supports the popular
databases such as PostgreSQL, Oracle, and MySQL.
(b) Nagios service to provide alerts and notifications. This service is optional.
(c) REST API, which integrates the ambari web server.
(d) An authorization provider which can control user access.
3. Ambari Web Server: It is integrated to provide web-based management and user
access. It integrates with the REST APIs provided by ambari server.
Similar to Ambari, Cloudera provides the Cloudera manager tool which performs similar
functions and can be used to monitor and manage Hadoop clusters.
Apache Pig, Apache Hive, Apache Impala, and Apache Drill along with their requirements.
We also covered two popular frameworks for data ingestion: Apache Sqoop and Apache
Kafka. We discussed their functionality, architecture, and use cases.
GLOSSARY
Ambari: It is a widely used open-source graphical management tool, which has been de-
signed to manage and monitor Hadoop clusters.
BigQuery: It is a petabyte scale, serverless, interactive, and ad-hoc query engine that allows
a user to conduct analysis of large datasets.
DistCp: It is an open-source tool to copy large amounts of data to and from Hadoop filesys-
tems in parallel.
Dremel: It is a distributed large-scale platform for big data analytics which supports inter-
active querying. It is a high-performance columnar database and can scale to thousands
of commodity machines to perform interactive analysis in a short time. It inspires a
number of open-source tools for supporting the interactive query.
Enhancements in Hadoop 115
Flume: It is an open-source, distributed, reliable, and highly available service for efficiently
ingesting high-volume streaming data. It can ingest data from multiple sources into
data storage solutions like HDFS.
Hive: It is a platform built on top of MapReduce. It takes declarative SQL-like language
as an input from a user and presents data in the form of tables. It reads data from
HDFS and enables Hadoop as a data warehouse by providing capability for querying
and managing large datasets.
HiveQL: It is a SQL-like programming language used with Hive for querying data stored
in a Hadoop cluster.
Impala: It is an open-source, low-latency, distributed, and massively parallel processing
SQL query engine designed to run on Hadoop platforms.
Kafka: It is an open-source, high-throughput distributed streaming platform, which is
widely used as publish/subscribe messaging system. It can be used to build a high-
performance, real-time robust data streaming pipeline between applications.
Kudu: It is an open-source low-latency column-oriented data store, which supports fast
sequential and random access.
NiFi: It is an open-source real-time data ingestion tool, which enables automation of data
flow between systems in an efficient and reliable manner.
Pig: It is a high-level framework that allows a user to express data processing and analysis
operations on top of the MapReduce framework using a simple scripting language.
Pig Latin: It is a dataflow procedural scripting language, which is used by the Pig platform.
It is used to perform complex operations such as joins, sorts, and filtering across several
types of datasets loaded on Pig.
YARN: It stands for Yet Another Resources Negotiator. It is used to centrally manage a
cluster of resources.
Zookeeper: It is a highly reliable tool to manage and synchronize configuration on a Hadoop
cluster.
CHAPTER 6
Spark
CONTENTS
6.1 Limitations of MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.1.1 Spark Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.2 Introduction to Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.3 Spark Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.3.1 Spark- Modes of Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.3.2 Operations on RDD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.3.3 Lazy Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.3.4 Persistence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.3.5 Spark Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.4 Spark SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.4.1 DataFrame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.5 Spark MLlib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.5.1 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.6 Stream-Based System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.6.1 Issues and Challenges in Stream Processing . . . . . . . . . . . . . . . . . . . . 132
6.7 Spark Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.7.1 Types of Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.7.2 Checkpointing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.8 GraphX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.8.1 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.9 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.10 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.11 Exercise Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
adoop/MapReduce has been the de-facto platform for big data computation for many
H years. The wide-scale popularity of MapReduce has been driven by its simple pro-
gramming model and by its effectiveness in storing and processing big data. However, as
big data systems continue to grow, the applications of big data evolved from batch process-
ing to streaming. Further, continuous increase in the size of big data induced scalability
challenges. Spark is motivated to address these needs. In this chapter, we will study the
architecture of Spark. This will be followed by Spark programming. The chapter covers
various examples in order to explain the topics.
117
118 Big Data Systems
6. Fault Tolerance: The probability of a node failing on any given day is always high for
a large cluster of nodes. Spark provides fault tolerant capabilities by automatically
dealing with failed nodes. If the node, for instance, running a partition of a map()
operation crashes, Spark will automatically execute it on another node. In addition,
Spark also provides capabilities to deal with slow nodes. In the situation where even if
the node does not crash but runs much slower than other nodes, Spark can proactively
launch a provisional copy of the task on another node, and apply its result if that
finishes.
Since Spark automatically handles node failures, application developers do not have
to handle such failures in their application. This feature simplifies application code.
7. Unifying Big Data: As evident from figure 6.1, Spark unifies several types of big data
challenges and requirements, which include processing of batch data or streaming
data, machine learning algorithms, and graph processing.
In summary, there are many organizations which are currently leveraging Spark for big
data processing. Reference [51] lists the number of organizations with their use cases.
The second layer is the core of Spark. It is responsible for the overall functionality of
Spark. These functions include task management, task scheduling, memory management,
fault recovery, interaction with storage systems, and creation and management of RDDs.
The third layer describes the significant features of Spark. These include SQL, Streaming,
MLlib, and GraphX libraries:
1. Local: This is a non-distributed mode. Using this mode, Spark programs are executed
locally on a single JVM. Parallelism is achieved through a number of threads. Standard
options for the local mode are mentioned below:
(a) local: single thread.
(b) local[n]: “n” number of threads.
(c) local[*]: as many threads as the number of processors are available.
2. Standalone Cluster: This is the default cluster management setup in which Spark
itself manages the tasks on master and worker nodes [15].
3. YARN: Using this option, Spark utilizes the Yarn cluster manager. Details about
YARN are explained in section 5.2.
4. Mesos: Spark is executed on the mesos cluster.
Spark 121
We should note an important point that in a local mode, there should be at least two
threads in the Spark program. This is because Spark keeps separate threads for receiving
data and processing data. Spark can also be used to receive data from multiple input sources
at a time. In such a case, a separate thread is needed for each individual source. Same notion
can be applied when Spark is executed in a cluster mode such that enough cores should be
allocated to Spark so it can receive and process data simultaneously.
Example 6.1 explains Spark submit options.
Spark Submit
Spark scripts can also be executed using shell commands. In such a case, Spark programs
are written in a scripting language such as Python or Scala. The script is executed by Spark.
Table 6.1 explains a few common functions.
Example 6.3 shows creation of RDD in Spark using Python. Spark shell can be obtained
using “pyspark”.
Creating an RDD
In this example, sc denotes Spark context. The RDD lines contain all the text in the file
“myfile.txt” Example 6.4 extends the program from example 6.3 by counting no. of lines.
An important point to note is that an existing RDD cannot be altered. A new RDD is
created for each modification in an existing RDD.
122 Big Data Systems
compute a result-based on an RDD. The results are either returned to the driver program
or saved to an external storage system (e.g., HDFS).
6.3.4 Persistence
As we discussed earlier, Spark computes RDDs in a lazy fashion. This also implies that if
an action on an RDD is computed more than once, then Spark will perform computation
multiple times on the RDD. This behavior may lead to repetition of tasks. To avoid this
scenario, Spark provides a method persist.
If the persist method is called on an RDD then Spark will store its result in the cache
(of all the nodes) so that multiple calls to the RDD can be responded through cache. In
this manner, the task of repetitive computation can be avoided.
For instance, let us consider example 6.7. As the method persist has been called on
the RDD pythonLines, repetitive calls to the RDD will be answered through cache. The
nodes that compute the RDD store the partitions. In case of a failure of a node, Spark will
recompute the lost partitions.
124 Big Data Systems
Function Explanation
Spark Persistence
Spark provides multiple levels of persistence such as persistence in memory and per-
sistence at the disk; further, persistence can also be varied from server only to server and
worker nodes.
Spark 125
Example 6.8 explains the difference between map and flatmap functions. Example 6.9
explains the functionality of various Spark functions.
Spark Functions
Distinct
RDD4 = RDD1 . d i s t i n c t ()
RDD4=2,3,4
Union
RDD5 = RDD1 . uni on ( RDD2 )
RDD5=1,2,3,4,1,2,3,5
126 Big Data Systems
WordCount on Spark
6.4.1 DataFrame
In Spark besides RDD, data can also be stored in a dataframe, which is conceptually
equivalent to a table in a relational database. A dataframe can be constructed from different
sources such as tables in Hive, structured data files, external databases, or existing RDDs.
Example 6.11 shows how DataFrames can be created [53] Comments are listed in bold.
Spark 127
|age|name|
|27|Mike|
|38|Sara|
|11|Fischer|
t o p c u s t o m e r s = h i v e C t x . sql (" S E L E C T ID ,
c u s t o m e r C o u n t FROM c u s t o m e r s O R D E R BY
c u s t o m e r C o u n t LIM I T 5")
Example 6.13 shows a sample code for K-means clustering using Spark RDD. Comments
in the example are listed in bold. The data is loaded from the file input.txt. The loaded
data is parsed and trained using the Spark’s K-means method. The model is then tested
128 Big Data Systems
Purpose
RDD
Local Vector Creates a vector for strong binary values. This is useful for storing
features. The vector could either be sparse or dense.
Labeled Point A local vector to store labeled responses. These are used for super-
vised learning.
Local Matrix A matrix which is stored locally in a cluster. Values are stored as
sparse matrix.
Distributed Matrix A matrix which is distributed across the cluster.
for prediction. The example also shows how the generated model can be saved and loaded.
Example 6.14 shows the same task of computing K-means clustering using DataFrame [53].
The RDD-based APIs are now in maintenance mode and will not add any new features
into it. Table 6.2 describes a few RDDs for MLlib.
All the new development and features will be contributed in the DataFrame-based API
which is now the primary API for machine learning [45].
Spark 129
p a r s e d d a t a = i n p u t d a t a . map ( l a m b d a line : ar r a y ([
float ( x ) for x in line . s pl i t ( ’ , ’) ]) )
Build the Kmeans model for training
m odel = K M e a n s . t rain ( p ar se d da ta , 3)
choosing 3 leaf clustres and default values for
maxiterations=20,minDivisibleClusterSize, and seed
Print Cost
p rint (" T otal Cost : " + str ( m o d e l . c o m p u t e C o s t (
parseddata )))
Predict
p = array ([0.0 , 0.0])
m odel . p r e d i c t ( p )
Make predictions
p r e d i c t i o n s = mo de l . t r a n s f o r m ( i n p u t d a t a )
6.5.1 Pipelining
Pipelining allows to combine multiple stages into a workflow. Through this, multiple machine
learning tasks can be incorporated at different stages.
Each stage in a Pipeline can either be a Transformer or an Estimator. The former stage
converts a DataFrame to another DataFrame by using the transform method, whereas, an
Estimator converts a DataFrame and produces a Model. It implements the fit() method.
Pipeline stages are executed in an order. Example 6.15 shows execution of ML pipeline.
It has been taken from the official documentation of Spark [54].
Spark 131
Spark ML Pipelining
Example 6.15 (Spark ML Pipeline). Figure 6.3 shows a pipeline with three stages.
In the top row, first two stages, Tokenizer and HashingTF are Transformers. The
third stage is an Estimator, which outputs Logistic Regression.
The bottom row represents dataflow through each stage of the pipeline. Dataframes
are indicated through cylinders. These show output at each stage. The tokenizer
splits the input data (raw text) into words. It also adds a new column words to the
DataFrame. The HashingTF transforms the words into feature vectors. These vectors
are added to the DataFrame. The last stage of the pipeline, Logistic Regression,
creates a model.
Logistic Regression
logisticregression = LogisticRegression ( maxIter
=10 , r e g P a r a m = 0 . 0 0 1 )
p i p e l i n e = P i p e l i n e ( s t a g e s =[ tokens , h a s h i n g v e c t o r ,
l o g i s t i c r e g r e s s i o n ])
Create a Pipeline
m o d e l = p i p e l i n e . fit ( t r a i n i n g )
1. Standing queries: These queries are stored and executed at regular intervals and sched-
uled time. For instance, for temperature reading through sensors, what is the maxi-
mum temperature recorded in a day?
2. Ad hoc queries: These are the queries which are needed to be answered on an ad-hoc
basis. For instance, for network logs on a web server, what is the most accessed URL
in the past one hour?
For stream processing, we can meet the requirements of scalability and timeliness if we
had access to large memory. However, as the size of data grows, meeting requirements of
large memory becomes difficult. Therefore, we need to adopt efficient techniques to solve
these problems.
One possible approach to solve streams is to adapt to approximation. For instance, if
we can sample streams at a periodic rate over a long period of time, it is likely to capture
the randomness, which will lead to an approximate and realistic answer.
Reference [57] provides the integration guide for Spark streaming with Kafka. In this
chapter, we will focus on DStreams which are RDD-based API supporting real-time data
analytics using discretized streams. A DStream is implemented as a sequence of RDDs, each
of which can handle data in sequence or time step.
Figure 6.5 illustrates the concept of sequence of RDDs. For instance, RDD1 holds the
data from time t0 to t1 .
134 Big Data Systems
DStreams can either be used to transform data in order to create a new DStream or
to write data to output. For DStreams, data can be obtained from various input sources,
such as Flume, Kafka, or HDFS, and TCP Sockets. Once the data is obtained, many Spark
functions such as map, filter, reduce, and join can be applied on the streams. Further, Spark
libraries such as MLlib and GraphX can also be applied on DStreams. The output of Spark
streaming can either be written to a file such as HDFS or can be sent to a database.
We will now review a sample example for reading data from a stream and analyzing
it using Spark StreamingContext. The StreamingContext is the entry point for a Spark
Streaming application and defines the time interval by which streams of data are discretized.
Example 6.17 shows Python code for reading input data from a network stream and applying
wordcount operation. The stream is created with a 1-second interval.
read_lines = spark_streaming_context .
s o c k e t T e x t S t r e a m ( " 1 2 7 . 0 . 0 . 1 " , 60 0 0)
The rest of the code for parsing lines to get words and counting words to complete the
wordcount operation is similar to example 6.10.
We observe that example 6.17 shows two additional functions. The start function is
called to explicitly start the operation on the stream. The stream operation is initiated on a
separate Spark thread. The awaitTermination method is used to enforce wait on the stream
to finish execution. As an optional step, a stop operation can also be applied on the stream
by calling the StreamingContext.stop method.
136 Big Data Systems
Example 6.19 shows computation of stream over a windowed session. The countByWin-
dow operation performs count over the last 30 seconds of data (window length) every 10
seconds (slide interval).
The example also illustrates execution of reduceByKeyAndWindow, which performs re-
duction over <key,value> pairs. The first argument of the function is the reduction function,
which in this case is x+y. The second argument is the inverse function x-y. The inverse of the
reduction function is necessary because here computation is done by reducing the new data
that enters the sliding window, and inverse reducing the old data that leaves the window.
Spark 137
The third and fourth arguments of the function are window length and sliding interval. For
clarity, only a snippet of the code is shown. The rest of the code is similar to example 6.17.
windowedCounts =
w o r d C o u n t s . c o u n t B y W i n d o w (30 ,10)
windowedreducebykey = key_value_pairs .
reduceByKeyAndWindow ( lambda x , y: x + y , lambda
x , y : x - y , 30 , 10)
6.7.2 Checkpointing
Spark is a fault-tolerant distributed file system. The need of providing fault tolerance ex-
ist in stateful transformation, where a fault (such as node failure) should not lead to re-
computation from scratch. For this purpose, Spark provides checkpointing – a feature which
enables Spark to checkpoint the data in a reliable file system (such as HDFS). If checkpoint-
ing is enabled, recovery could easily be made using the last checkpoint.
We should note that checkpointing of data is only necessary in stateful transformations
because data in the current batch depends upon the data from previous batches. Many
Spark functions such as updateStateByKey () or reduceByKeyAndWindow () are used for
stateful transformations.
In addition to stateful transformation, checkpointing can also be used for metadata. This
includes data related to configuration and DStream operations. This information would be
needed in both stateless and stateful transformations and it is useful to cater driver failures.
We can learn the process of creation of checkpoint through the example 6.20. In the
example, Spark is executed in a local mode. ”*” denotes to create as many worker threads
as logical cores on the machine.
138 Big Data Systems
ssc = S t r e a m i n g C o n t e x t ( sc , 10)
ssc . start ()
ssc . a w a i t T e r m i n a t i o n ()
In a similar context, checkpointing can be used to load the data. Example 6.21 shows
that StreamingContext.getOrCreate method can be used to load data from the check point
directory.
6.8 GRAPHX
GraphX is the Spark library to process large-scale graphs. In this section, we will study
GraphX concepts, which are specific to Spark. However, to strengthen foundation, we will
also study some preliminaries about graphs. In section 7.7, we will study graph-based pro-
cessing for NoSQL databases.
6.8.1 Concepts
GraphX provides graph-based processing for computation of large-scale graphs. It extends
RDD concepts through a graph-based abstraction, in which a graph has properties asso-
ciated with vertices and edges. In the literature, such a graph is considered as a property
graph.
Example 6.22 shows a property graph with edges and vertices, whereas example 6.23
shows python code to construct such a graph using graph frames. It also shows code to
perform a few basic graph operations [65].
Spark 139
Property Graph
Example 6.22 (Property Graph). Figure 6.7 shows a graph with properties of edges
and vertices.
140 Big Data Systems
Spark – GraphX
Constructing Graph
g = G r a p h F r a m e (v , e )
GLOSSARY
Apache Spark: It is an open-source unified distributed computing engine across different
workloads and platforms. It is a cluster computing system for big data.
Apache Mesos: It is an open-source cluster management framework for managing compute,
memory, storage, and other resources across a cluster.
Apache Sqoop: It is one of the leading tools for transferring data between structured data
stores (such as relational databases) and Hadoop.
DataFrame: It is similar to a distributed table with named columns with a schema. It is an
immutable, distributed collection of data organized into rows, where each one consists
of a set of columns and each column has a specific data type: integer, string, etc.
DStream: It is used for stream analysis. DStream is implemented as a sequence of RDDs,
each of which can handle data in sequence or time step.
GraphX: It is Spark’s graph-processing API. It is used for data visualization and imple-
menting useful graph algorithms.
Property Graph: It is a graph which has properties associated with vertices and edges.
RDD: It stands for Resilient Distributed Dataset. It is an immutable, resilient, fault-
tolerant, and distributed collection of objects, which is partitioned across the cluster.
Spark Core: It contains the Spark core execution engine and a set of low-level functional
APIs for the Spark platform that all other functionality is built upon.
Spark ML: It is a Spark library for machine learning, which is supported through
DataFrame-based APIs.
Spark MLlib: It is a Spark library for machine learning. It is an RDD-based API.
Spark Pipelining: It is a sequence of stages which can be used to define a workflow.
Spark SQL: It implements the higher level Dataset and DataFrame APIs of Spark and adds
SQL support on top of arbitrary data sources.
Spark Streaming: It is a stream processing framework to support real-time data analytics
using discretized streams (or DStreams).
Structured Streaming: It is the second-generation Spark Streaming processing engine built
on top of the Spark SQL abstraction heavily based on DataFrame API with streaming
capabilities.
CHAPTER 7
NoSQL Systems
CONTENTS
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.2 Handling Big Data Systems – Parallel RDBMS . . . . . . . . . . . . . . . . . . . . . . . . 144
7.2.1 Shared Everything . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.2.2 Shared Disk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.2.3 Shared Nothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.3 Emergence of NoSQL Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.3.1 Column-Oriented vs Row-Oriented Databases . . . . . . . . . . . . . . . . . 149
7.3.2 Types of NoSQL Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.4 Key-value Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.4.1 Strengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.4.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.4.3 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.4.4 Amazon DynamoDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.5 Document-Oriented Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
7.5.1 Strengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.5.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.5.3 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.5.4 MongoDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.6 Column-Oriented Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.6.1 Strengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.6.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.6.3 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.6.4 Apache Cassandra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
7.7 Graph Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
7.7.1 Strengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
7.7.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7.7.3 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7.7.4 Apache Giraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
7.8 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
7.9 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
7.10 Exercise Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
143
144 Big Data Systems
IG DATA systems require massive amounts of storage and this storage needs to
B be spread across data centers in huge clusters of commodity servers. NoSQL Sys-
tems have been specifically introduced to cater these challenges of big data systems. These
databases were designed to deal with big data storage and processing demands that rela-
tional databases cannot meet. In this chapter, we will provide a detailed explanation of how
NoSQL databases work and the different categories of systems that fall under this definition.
The chapter covers various examples in order to explain the topics.
7.1 INTRODUCTION
In section 1.3, we discussed various types of big data systems.
Relational databases have been the primary model for database management for decades.
They are still a good choice for applications that require structured data with requirements
of ACID transactions. However, the need to handle massive volume of data in large scale
systems has led to the emergence of NoSQL databases.
These databases are being increasingly utilized and have become an alternative to re-
lational databases with their performance, scalability, availability, schema flexibility, and
analytical capabilities.
In section 1.3.2, we discussed that as compared to the ACID properties model of RDBMS,
big data systems follow the BASE model in which they relax consistency in order to achieve
higher availability.
In this chapter, we will explain different categories of NoSQL database systems and will
also describe their strengths and limitations.
4. Load Balancing: Large databases need to ensure load balancing in order to provide
high performance.
These requirements necessitated the need for parallel RDBMS. A parallel RDBMS is
an RDBMS system with multiple processes, which can support parallel access. There are
three major types of parallel databases. These types are similar to our earlier discussion in
section 2.3:
1. Shared Everything
2. Shared Disk
3. Shared Nothing
NoSQL Systems 145
Example 7.1 illustrates the difference between the three types of RDBMS.
Parallel RDBMS
Example 7.1 (Parallel RDBMS). Figure 7.1 illustrates the difference between the
three types of a parallel RDBMS. In the shared everything architecture, both the
storage and memory are shared by all the processes. In the shared disk architecture,
only the storage is shared between the processes, Comparatively, in the shared nothing
architecture, each process has a separate disk and memory.
There are three major architectures for distributed database systems illustrated in Fig-
ure 7.1. We now discuss them in detail.
146 Big Data Systems
2. Parallelism: This architecture provides great support for parallel processing where
inter-query parallelism is an inherent property and can easily be achieved due to
the shared resources. Inter-transaction parallelism is relatively easy to implement as
there is no need for complex distributed locking or commit protocol and each new
transaction is a new process that is allocated to another processor.
3. Efficiency: It also provides better design of inter-transaction parallelism for read-only
queries since any query can be executed by any processor and therefore, costly rela-
tional operators like sort and join can be parallelized using classical parallel algorithms.
Shared everything bears following disadvantages:
1. Limited Scalability: Scalability is one of the main issues with shared everything archi-
tecture. It is limited due to the fact that every communication between each resource
goes over a shared interconnection network and can cause contention as the number
of processors (or resources) increases. As a result, adding new processors degrades the
system performance.
2. Memory Faults: With memory space shared by each processor, a memory fault results
with affecting many processors and can potentially result with database unavailable.
processor with enough memory, which results in better elastic scalability and extensibility.
It also removes the need for re-balancing operations.
Shared disk provides the following advantages:
1. Scalability: It is very flexible in meeting a user’s needs. In case of increase in demand,
more resources (such as processors) can be added.
2. Fault Tolerance: Shared disk architecture improves extensibility by optimizing and
minimizing the interference on the disks as each processor has its own memory. Isola-
tion of a memory module faults from other nodes results in better availability.
3. Load Balancing: Load balancing feature can easily be achieved due to shared access
of all data for each node.
Shared disk provides the following disadvantages:
1. Bottleneck: Sharing a resource over an interconnection is always a potential bottle-
neck, and therefore, access to a shared disk is a potential bottleneck.
2. Increased Coordination: It requires increased coordination effort between the proces-
sors for implementation of distributed database system protocol.
3. Cache Consistency: One other challenge for Shared disk is the need to coordinate and
maintain cache consistency across nodes. Each node requires maintaining a consistent
view of data and therefore, needs to maintain a consistent cache. Maintaining this
cache consistency can incur high communication overhead among the nodes.
Shared nothing clusters have the advantage of a simpler consistency model; however,
the complete decoupling of all resources introduces following complexities and limitations
in the architecture:
1. Higher Coordination: It requires a two-phase commit protocol and may need to co-
ordinate transactions across multiple nodes. This coordination incurs complexity and
potentially results with transactional performance problems.
2. Difficulty in Balancing Load: Load balancing becomes difficult to implement for shared
nothing architecture.
3. Limited Extensibility: It also requires careful partitioning to ensure that the cluster
workload remains balanced. This limits extensibility as adding or removing nodes from
the cluster requires an expensive re-balancing.
Out of the three types of systems, shared disk and shared everything are more appropri-
ate for parallel databases. This is because of their greater infrastructural support to provide
ACID guarantees. In comparison, shared nothing architecture is a loosely connected system.
The cost to ensure ACID properties may be higher due to higher communication delay.
However, as we have learned in chapter 1 that many big data systems are analytical
systems and therefore they do not require compliance to ACID properties.
In the next section, we will study reasons for emergence of NoSQL (Not Only SQL)
systems.
1. Key-value Database
2. Document-oriented Database
3. Column-oriented Database
4. Graph Database
RDBMS to <Key,Value>
Example 7.3 (RDBMS to <Key,Value>). Figure 7.2 shows <key,value> pairs re-
lated to an RDBMS.
can have a dynamic set of attributes attached to it. It has an ability to store an enormous
number of records with extremely low-latency and provides all the maintenance and failover
services.
Example 7.3 shows how an RDBMS system can be represented as a key-value system.
In that, primary key (unique key) of RDBMS has been mapped as a key in the key-value
storage, whereas all the other columns have been mapped as values.
7.4.1 Strengths
There are numerous advantages of a key-value database:
7.4.2 Limitations
Key-value databases have a few limitations. These are described below:
1. It does not have the richness of a query language similar to SQL and does not support
range queries.
2. The values in the key-value database are opaque to the database which means that
we cannot index and query by the fields of values.
3. Transaction guarantees are generally available in the context of a single key put in a
<key,value> store.
4. A relational database uses integrity constraints to prevent invalid data and ensure
that data at the lowest level have integrity. It means that data that violate integrity
constraints cannot exist in the database. However, a key-value database typically does
not support these constraints.
5. In comparison with other NoSQL databases, it provides a simple and limited set of
operations.
152 Big Data Systems
4. <key,value> pairs are useful for data where the relationship among multiple subsets
does not exist. For instance, bank, account, and transaction are all related and can
be saved in relational databases for a single application. On the contrary, log files and
cached application data is not related to each other. However, this data is required
to be stored for an application. Storing these unrelated data types in a key-value
database would be easier since no relations are ever going to be modeled between
them.
5. When large datasets are required to be stored. Such datasets are easier to be stored
in key-value databases as these can easily be spread out across multiple servers.
6. For rapid and agile development.
7. When aggregate information is to be computed. For instance, counts, sum, and similar
kind of operation on normalized data. Similarly, finding most popular items, votes,
ratings, session data, some statistics are a few useful examples.
8. For applications which needs easier and simpler storage and retrieval. This is because
it is easier to store and retrieve data from <key,value> pairs.
Some examples of this type of databases are Redis [121], Riak [90], Amazon Dy-
namoDB [332], and Voldemort [126].
DynamoDB is schema-less, i.e., in a table, different attributes are possible for two
different data items. However, each item in a table has a unique identifier, which is
denoted by primary key.
We can better understand these concepts through example 7.4
NoSQL Systems 153
DynamoDB-Components
A simple primary key with one attribute also serves as a partition key for DynamoDB.
Partition key is used to compute a hash function which determines the actual place-
ment of physical storage where the item will be stored. In DynamoDB, a composite
primary key is also possible. In that, two attributes are used as a primary key. The
first attribute serves as a partition key, whereas the second attribute serves as a sort
key. All items with the same partition key are stored together in the location in the
order defined by the sort key. Example 7.5 illustrates the concept of composite key in
DynamoDB.
2. Read Operations: DynamoDB supports two types of read operations. An eventually
consistent read and a strict consistent read. When a read operation is initiated in Dy-
namoDB, by default an eventually consistent read operation is performed. This implies
that the read operation may not return the most recent data; it may return stale data,
which could be 20 seconds stale. Strictly consistent read may not be available for some
data.
154 Big Data Systems
Document-Oriented Database
7.5.1 Strengths
Document databases can offer the following advantages for the users:
1. The price of scaling out with a document database is much less, compared to a SQL
database.
2. Since the data is transparent to the store, these databases can index the fields of
documents which allows the user to query not only by the primary key but also by a
document’s contents.
3. Provides the ability to do extremely rapid application development.
4. Schemaless, completely free to define the contents of a document.
156 Big Data Systems
7.5.2 Limitations
Besides the powerful query engines and indexing features, there are few limitations:
1. This database is not suitable for business transaction application.
2. It does not offer joins across collections.
7.5.4 MongoDB
MongoDB is a non-relational, open-source, NoSQL database, which relies on document
storage to store data. This implies that all the data in MongoDB is stored in documents.
MongoDB provides a few services of relational databases such as sorting, secondary indexing,
and range queries.
MongoDB is a scalable database which relies on scaling out. This means that more
computing nodes can be added easily.
2. Indexing and Sharding: MongoDB supports indexing of data for fast query processing.
Documents are indexed according to keywords for faster access and retrieval.
Understanding Sharding
A shard consists of one or more servers that contains the subset of data that it is
responsible for. For instance, if we had a cluster containing data of 100,000 users, one
shard may contain data of 20,000 users. If there are more than one servers in a shard
then a shard may also contain replicated data.
MongoDB uses keys (also called as shard key), which can be any field or combination
of fields. For partitioning, a shard key is specified which specifies fields upon which
data will be distributed. Because of shard key, chunks can be described as a triple of
<collection, keymin , and keymax >.
Maximum size of a chunk is 200 MB. Once a Chunk grows to a maximum size of 200
MB, it can be split into new chunks.
We can get the basic idea about the architecture of MongoDB replication from figure
7.6. A shard containing data “ABC” is represented as a cylinder, whereas, data is
replicated across multiple servers, which are represented as cubical boxes.
To evenly distribute data across shards, MongoDB may transfer chunks of data from
one shard to another shard. The data which may be copied depends upon the key of
data selected.
NoSQL Systems 159
MongoDB automatically balances data across all shards. Each shard contains multiple
ranges in order to reduce the amount of transferred data. We can consider a range as
a collection of keys such that each shard could contain multiple ranges.
To understand the whole scenario of data splitting and auto-sharding, let us consider
example 7.9, which is based on a database of students containing their IDs, Names,
and DoB (Date of Birth).
160 Big Data Systems
MongoDB supports auto-sharding. The cluster manages the splitting of the data and
rebalancing automatically. This removes some of the administrative headaches of man-
ual sharding.
Column-Oriented Database
7.6.1 Strengths
Column-oriented database can offer the following advantages:
1. Because the data stored in columns is typically similar and the columns are stored
adjacent to one another, the compression algorithms make use of this similarity of
adjacent data for highly data compression and help storage capacity to be used more
efficiently.
2. The compression of data can greatly enhance the performance by reducing the disk
I/O for reading large amount of repeating values. This improves query performance
by greatly reducing the I/O required to execute analytical queries.
3. A Columnstore index achieve high query performance on aggregation queries such as
AVG, SUM, MAX, MIN, and COUNT. This is due to the fact that the number of
hard disk seeks are reduced as compared to row-based databases.
4. Column-oriented databases are more efficient for inserting a single column values at
once as this can be written efficiently without affecting any other columns for the rows.
This is in contrast with row-oriented databases which are more efficient for writing a
new row if all of the column data is supplied at the same time.
5. The quick searching, scanning, and aggregation abilities of column-oriented database
storage are highly efficient for analytics.
7.6.2 Limitations
The column-oriented databases has following limitations:
1. Column-oriented databases are not useful for systems that require ACID transactions.
They are significantly slower at handling such transactions.
2. Write operations such as INSERT or UPDATE are relatively slow in comparison with
row-oriented or row store databases due to the number of disk seeks required to insert
or update a row. Writing to a row requires seeking to each column for an insert or
multi-column update in contrast to a row-oriented database where the entire row can
write at once.
3. Column-oriented databases are not useful for systems with wildly varying queries.
RDBMS Cassandra
Database Keyspace
Table Column Family (CF)
Primary Key Row Key
Column Name Column Name
Column Value Column Value
1. Data Model: A table in Cassandra can be distributed across various nodes. Data model
of Cassandra consists of keys and values. Each attribute is stored as a key, where the
value represents the actual value corresponding to that key. Cassandra implements a
flexible approach in which there may be few optional values in data so some rows may
have more attributes as compared to other rows.
Example 7.11 (Cassandra Data Model). Figure 7.9 explains data model of
Cassandra. Rows can have different attributes. Similar to an RDBMS, a pri-
mary key uniquely identifies a row.
concepts of hashing that in consistent hashing, nodes are organized in the form of a
circle. To find the mapping location of a data item, its hash value is used to find the
first node which maps the item. Scalability is achieved by adding more nodes in a
circle. Upon addition of a node, a portion of data items are remapped. The number of
data items which are needed to be remapped are determined using the two neighboring
nodes of the new node.
By employing consistent hashing, Cassandra achieves scalability.
Cassandra Hashing
In Cassandra, the primary key is used to determine a hash value. For this reason, the
primary key is also called as the partition key. If the primary key consists of composite
entities, then the first component is used to determine the partition.
2. Replication: Cassandra replicates data items to achieve high scalability and durability.
Each data item is replicated N times, where N is the replication factor defined in
the system. Each key has a coordinator node, which ensures that the data item is
distributed to the remaining N-1 replicas.
164 Big Data Systems
Graph-Oriented Database
Example 7.13 (Graph-Oriented Database). Figure 7.11 shows graph oriented orga-
nization related to an RDBMS.
A Graph database systems is a NoSQL database system that uses graph data model for
storage and processing of data. The data model is a graph with nodes, edges, and properties.
It is designed for lightning-fast access to complex data. A graph database enables us to model
our problem in a way that makes it much simpler and more expressive compare to other
relational or NoSQL databases. Elements of a graph database have direct reference to their
adjacent elements which provides added advantage in traversing graphs.
As we can see in example 7.13, the graphs model real-world relationships much more
efficiently as compared to relational databases. Objects or entities form nodes in a graph
can be added easily with relationship connections linking together other nodes.
There are graph database implementations that provide ACID guarantees [272], high-
availability, horizontal read scalability [326] along with storage capacity to store billions of
entities.
7.7.1 Strengths
The strength of graph databases are as follows:
1. Graph databases have the ability to store various types of graphs: undirected graphs,
weighted graphs, hypergraphs, etc.
NoSQL Systems 165
2. Graph data models performs a good job when creating rich and highly connected data
to represent the real world use cases and applications.
3. They have the ability to efficiently manage highly connected data and complex queries
irrespective of the size of dataset.
4. The underlying model of graph database provides the functionality of graph theory
and can easily handle complex and flexible data models.
5. Graph databases are optimized for local graph traversals and provides exceptional
performances for local reads by traversing the graph.
6. Graph databases are very good for storing entities and the information about the
relationships between these entities.
7.7.2 Limitations
Besides the numerous advantages, there a few limitations of using graph databases:
1. Graph databases are not the best solutions for applications of very large volumes of
data or for updating sets of data.
2. The strength of a graph database is in traversing the connections between the nodes
and generally require all data to fit on one machine. However, this becomes the lim-
itation for graph database scalability. Also, many off-the-shelf graph databases are
also not horizontally scalable. Therefore, applications for graphs have limited usage
for large volume application.
3. Because data is connected by nature, scaling a graph becomes more difficult com-
pared to other NoSQL databases. Scaling graph data by distributing it across multi-
ple machines is not easy. Similarly, relationships that span across nodes make graph
distribution much more difficult.
4. Graph databases are more suitable for visualization and understanding relationships
instead of storing transactions. For instance, using a graph database to store trans-
actions of a grocery store is not viable. However, a graph database can effectively
capture and present a customer’s spending patterns.
5. Large graphs can be partitioned to be deployed over a distributed cluster. However,
partitioning a graph sometimes is not easy because of the factors including high net-
work latency, access patterns in the graph, and real-time evolution of the graph.
Applications Usage
Social Networking These applications allow organizations to identify direct
and indirect relationships between people, groups, and
activities.
Graph databases are typically go-to database for social
networking by making it simple to discover how a person
is connected to another person within a group or an
organization. Many large and successful social networks
organizations use graph databases at their core, which
allows to quickly traverse through a social graph and
return query results at a high-speed.
Recommendation A recommendation system helps users by providing sug-
Systems gestions for their products or services based on user
behavior or preference. Graph databases are naturally
well-suited for building recommendation engines. These
databases have the ability to make an effective rec-
ommendation. Although relational databases can be
used to represent such kind of data structure, graph
databases are actually built to solve this kind of problem
and widely used in regards to performance and main-
tainability.
Authorization and A graph database has the ability to manage access to
Access Control content and manage relationships between users, groups,
assets, and collections. It can traverse millions of rela-
tionships per second to execute queries in milliseconds
and can access lookups over large and complex struc-
tures.
Some of the most common graph database use cases are mentioned in table 7.3. In
addition, graph database is a great solution for many other use cases including the following:
1. Query performance and responsiveness are generally main concerns with regard to the
data platforms for many organizations. Applications like online transactional system
or enterprise web applications typically require to response the user queries in millisec-
onds. However, as the size of the dataset increases, the joins operation becomes more
challenging and the query performance deteriorates. Apparently, graph databases have
no such penalties and turn complex joins into fast graph traversals by scaling query
times and maintaining millisecond performance irrespective of the overall size of the
dataset. Therefore, if application has joint pain, that’s another indicator that graph
databases will be a better choice and solution to a complex data model in a relational
database.
2. A graph database is best used for exploring data that are structured like a graph,
tree, or a hierarchical and the relationships between entities are significant. Graph
data also useful for application domain having complex data and highly connected
model, for example social network, healthcare, real-time logistics etc.
3. Graph databases are also useful for applications which require a lot of graph-centric
querying. For example, they work best for analyzing queries like how closely things
are related, how many steps are required to get from one point to another etc.
NoSQL Systems 167
4. Graph databases are also fit good for data visualization including how things are
connected together. Analyzing relationships between people in social networking sites
such as in Facebook, Twitter, and LinkedIn are typical use cases for graph databases.
A few of the well-known examples of graph database are Apache Giraph [305],
Neo4j [362], OrientDB [351], HyperGraphDB [204], and AllegroGraph [72].
Example 7.14 (Finding maximum value in a graph). Figure 7.12 illustrates an ex-
ample of finding maximum value in a graph consisting of four nodes. Solid arrows
indicate neighbor relationship, whereas dotted lines indicate communication. Shaded
node indicates that a node has decided to halt. A node can become re-active after
deciding to halt.
Document storage system Riak has been explained in reference [385]. Riak and Redis
have been compared in reference [90].
MongoDB has been discussed in detail [86, 140, 141].
Couchbase has been explained in reference [112]. CouchDB has been explained in refer-
ences [82, 241].
HBase has been discussed in reference [176], whereas Amazon SimpleDB has been dis-
cussed in reference [125].
Details about Graph-based systems can be found in respective references. Apache Gi-
raph [138, 259, 305, 306], Neo4j [362], OrientDB [158, 221, 351], HyperGraphDB [72, 96, 204],
and Allegrograph [24].
2. Explain why a graph database will yield lower performance when it is deployed over
a distributed cluster?
3. Describe how a bulk synchronous parallel mode is useful for graph databases?
4. Explain why RDBMS are suitable for transactional systems?
5. Is it possible to ensure ACID guarantees for analytical systems?
6. Describe how column-oriented databases are useful for aggregation?
7. Explain strengths and weaknesses for <key,value> storage.
8. Explain scenarios where document-oriented databases are useful.
9. Describe deriving factors behind popularity of NoSQL systems.
10. Explain how MongoDB performs sharding?
11. How Cassandra achieves scalability?
GLOSSARY
ArangoDB: It is an open-source, highly available, scalable, and multi-model database that
can handle key-values, graphs, and JSON documents.
Graph Based System: It is a type of NoSQL system in which data is represented as a graph.
Nodes represent the data items, whereas edges represent relationships between them.
170 Big Data Systems
Key-value: It is a type of NoSQL system in which data is stored in the form of <key,value>
pairs.
MongoDB: It is an open-source NoSQL document-oriented database supports automatic
sharding and MapReduce operations.
NoSQL: (aka ”not only SQL”) It is a data storage system in which data is stored differently
than traditional relational form and provide flexible schemas.
SQL: Structured query language; It is a language for accessing and manipulating data from
relational databases.
Sharding: It is a process of distributing a database over multiple nodes.
Shared Nothing Systems: It is a type of cluster arrangement in which participating nodes
do not share anything; neither the memory nor the storage.
Shared Disk Systems: It is a type of cluster arrangement in which participating nodes share
common storage.
Shared Memory Systems: In such a system participating nodes are arranged in the form of
a cluster such that they share memory.
Vertex Partitioning: It is a method of partitioning large graphs in which a graph is parti-
tioned on a vertex.
CHAPTER 8
NewSQL Systems
CONTENTS
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
8.2 Types of NewSQL Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
8.3 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
8.4 NewSQL Systems: Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
8.4.1 VoltDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
8.4.2 NuoDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
8.4.3 Spanner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
8.4.4 HRDBMS - NewSQL System for Data Analytics . . . . . . . . . . . . . . 201
8.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
8.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
8.7 Exercise Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
ewSQL databases are designed for specific applications relying on different underlying
N mathematical models while offering unique benefits. In this chapter, we will provide
a detailed explanation about NewSQL systems and its types. The chapter also includes a
few case studies of NewSQL systems. Various examples are also included for explanation.
8.1 INTRODUCTION
In section 7, we have studied NoSQL systems. A NoSQL system is generally optimized to
meet the scalability needs of big data. Scalability is achieved at the expense of consistency
such that eventual consistency techniques have been implemented. Such models are suit-
able for OLAP (Online Analytical Processing). However, with the increased usage of big
data systems, use of OLTP (Online Transaction Processing) systems have also increased.
NewSQL systems are a kind of big data DBMS which provide consistency requirements of
DBMS and are capable to meet scalability requirements of big data.
These databases are a set of new relational databases that provide improved perfor-
mance while incorporating the use of the SQL language. These systems intend to promote
the same performance and scalability improvement of NoSQL systems, while meeting the re-
lational and ACID (Atomicity, Consistency, Isolation, Durability) compliance of traditional
databases. In this chapter, we will study generalized architecture of NewSQL systems and
will also study their prominent types.
171
172 Big Data Systems
8.3 FEATURES
We will now study a few significant characteristics of NewSQL systems which enables them
to achieve ACID compliance while ensuring high-speed and scalability [292].
2. Partitioning
NewSQL systems utilize partitioning to distribute large size data over multiple par-
titions. To ensure consistency, NewSQL systems mainly utilize multi-version concur-
NewSQL Systems 173
4. Secondary Index
For faster searching, NewSQL systems utilize secondary indexes. A normal DBMS is
partitioned on primary indexes. Partitioning on the secondary indexes allows faster
searching. If the system resides on a single node then secondary indexing is trivial.
However, if the system is distributed then partitioning could be challenging. Two
approaches are used to coordinate searching on secondary indexes, a centralized co-
ordinator approach and a distributed approach. Example 8.3 explains the two mech-
anisms.
5. Replication
Distributed databases maintain large number of replicas for load balancing and user
localization. A fundamental challenge in distributed database systems is to have an
efficient update mechanism among all the replicas.
There are two fundamental approaches for maintaining consistency among replicas in
NewSQL systems:
(a) Active-active approach: In active-active approach all the replicas are immediately
updated upon execution of a query. This approach guarantees strict consistency;
however, it incurs higher overhead and delay in update transactions.
(b) Active-passive approach: In this approach, there is a master replica for each
object. For updates, only the master is updated first. The master node then
distributes the update to other replicas.
Most of the NewSQL systems use the active-passive approach. They follow non-
deterministic concurrency control scheme. This implies that queries on the replicas
may get executed in a different order. This is due to the fact that replicas may expe-
rience several factors such as network delays, cache stalls, and clock skew. This step
requires additional coordination among replicas.
There are a few deterministic NewSQL systems. These include H-Store [214] and
VoltDB [343]. These systems guarantee that transactions execute in the same order
among all the replicas. Such systems have consistent states among all the replicas.
6. Crash Recovery
Crash recovery mechanism refers to the process of recovering from a failure. Normally,
this refers to failure recovery for the master node. When the master node fails, another
replica can take over as a master. When the master is alive-back, it has two options
for recovery:
(a) Checkpointing and logs: The first mechanism is to recover from the last check-
point and then apply all the logs of the update. This mechanism requires that
all the logs are needed to be maintained.
(b) Copying from the Replica: In this approach, the master node gets a copy from
a replica node. This approach is also useful if a new replica is needed to be
initiated.
8.4.1 VoltDB
In VoltDB, data is distributed across partitions [343, 360]. In addition, stored procedures
(such as SQL Scripts) are also partitioned. Each stored procedure is a transaction. Each
transaction in VoltDB is executed in its own thread. This minimizes the individual latency
for each transaction, where latency is measured from the time when the transaction begins
until processing ends. Since each transaction has its own thread, there is no need for locking.
NewSQL Systems 175
Example 8.4 explains the partitioning mechanism of VoltDB, whereas example 8.5 explains
execution of stored procedures.
Example 8.5 (Stored procedures in VoltDB). Figure 8.2 illustrates the procedure
for executing stored procedures when a stored procedure needs data from multiple
partitions. A coordinator is used to collect data from multiple partitions and compile
aggregated result. This may slow down the VoltDB operation.
8.4.2 NuoDB
NuoDB is another useful NewSQL system [113]. It is based on Peer-to-Peer (P2P) architec-
ture. Each peer follows a two-tier architecture. A transaction processing layer (also known
as Transaction Processing Engine or TE) and a storage layer (also known as Storage Engine
or SE) . The two layer approach separates transaction processing and storage.
Out of the four desirable ACID properties, Atomicity, Consistency, and Isolation are
handled by the transaction processing layer, whereas durability is handled at the storage
layer. The transaction processing layer is in memory, which is responsible for managing
transactions. The storage management layer is responsible for ensuring durability.
Durability is achieved by making data durable on commit. The storage layer also provides
access to data when the TE does not find data in the memory. This is considered as a cache
miss. A set of peer-to-peer coordination messages are used. Separating transaction and
storage layers provides fast transaction processing.
NewSQL Systems 177
In a NuoDB system, all processes are peers. This means that there is no need to have a
single coordinator. It also eliminates single point of failure.
TEs are responsible for accepting SQL client connections. Further, they are also used
for parsing and running SQL queries against cached data.
All processes, i.e., SEs and TEs, communicate through a P2P protocol. When a cache
miss is observed by a TE, it contacts its peers to fetch the data. This could either be another
TE with data in its cache or an SE with access to data.
NuoDB employs Brokers, which are peer processes responsible for management. Each
physical node, which executes the database service runs a broker service. A Broker is re-
sponsible to start and stop NuoDB services and monitoring processes health. It is also
responsible for configuration management. Brokers coordinate with each other to have a
global view. Figure 8.3 illustrates the architecture.
8.4.3 Spanner
Spanner is an open-source, fault-tolerant, and globally distributed database, which provides
the features of a relational databases such as SQL queries and ACID transactions with the
horizontal scaling capabilities of a non-relational database. Figure 8.4 shows the architecture
of Spanner. It consists of multiple zones. Each zone has a zone master, a location proxy,
and one to thousands of spanserver. The zone master assigns data to spanservers. Clients
connect to the spanservers for data retrieval. Proxies are used to locate spanservers inside
zones. Placement driver is responsible to move and replicate data in order to balance load
and maintain locality across zones, whereas universe master provides a console that is used
for interactive debugging across all zones [148].
There are some important features of Spanner, which helps it in achieving ACID guar-
antees at a wide-scale [110].
(c) Other Attributes: How many replicas are maintained to control durability, avail-
ability, and read performance?
2. External Consistency: External consistency implies that transactions appear to ex-
ecute in a serializable manner. It will appear that if all transactions are executed
sequentially, although they may have executed across multiple servers over multiple
data centers. The system guarantees that if a transaction ”A” commits before an-
other transaction ”B” starts, then A’s timestamp of commit will be smaller then B’s
timestamp of commit.
3. Global Consistency: Global consistency implies that all replicas of the database are
consistent.
4. TimeStamps: Spanner has TrueTime API, which enables it to assign globally mean-
ingful timestamps to transactions. Spanner depends on TrueTime to generate mono-
tonically increasing timestamps. These are used to generate proper timestamps for
write transactions without the need for global communication. Further, they are also
used for read operations.
written to HDFS, whereas the output of the map phase is written to the local disk. By
omitting the reduce phase, HRDBMS can utilize a series of map phases to be executed with
low cost. In such a manner, relational operations can be performed. While data is shuffled
between map phases, it is not sorted. Shuffle in HRDBMS only ensures that data with
similar keys is sent to the same node for processing including joins and aggregation.
Data in HRDBMS is stored in local disk. Coordinator nodes contain information about
the stored data. The first map phase is always executed on the node where data is stored,
which allows data to be read efficiently.
The Optimizer in HRDBMS utilizes query execution information to build efficient work-
flows. Shuffle is the most costly step in HRDBMS. The number of shuffles are reduced by
efficiently partitioning the data.
HRDBMS is inspired to support relational operations while meeting the high scalability
demands. However, it lacks fault tolerance, which are inherent large scale big data systems.
GLOSSARY
ACID Guarantees: Atomicity, Consistency, Isolation, and Durability are the four important
attributes of a DBMS.
Broker: It is a management service which runs on each physical node executing the NuoDB
system.
Clock Skew: The term is referred as the deviation of time among different nodes of a dis-
tributed system.
CockroachDB: It is an open-source, globally distributed, horizontally scalable, and ACID-
compliant database. It provides high-availability and fault tolerance, and supports dis-
tributed SQL transactions.
Concurrency: It is a process of executing multiple computations at a time.
Database-as-a-Service (DBaaS): It is a managed database service in the cloud that provides
a powerful on-demand scalable database platform.
Indexing: It is a method of optimizing performance of a DBMS by minimizing the number
of disk accesses required when a query is processed. Indexing is supported through
a data structure technique which is used to quickly locate and access the data in a
database.
In-Memory Database: It is a type of database system which maintains and retrieves data
from memory for faster execution.
Load balancing: It is a process of distributing load or data across different replicas in order
to balance load across them.
MemSQL: It is a high-performance in-memory distributed database that provides high
availability, fault-tolerance, and ACID-compliant relational database management. It is
a cloud-native NewSQL database that is massively scalable across various geographic
regions.
Online Analytical Processing (OLAP): It is a type of system which is used for analytical
processing of the system.
Online Transactional Processing (OLTP): It is a type of system which is used to process
transactions. Normally, it guarantees ACID properties.
Replica: It is a backup-node which is used to maintain replication and load balancing.
Zone: It is a unit of execution in Google Spanner.
III
Networking, Security, and Privacy for Big
Data
181
CHAPTER 9
ig data systems require high throughput and effective means of data transfer in or-
B der to process data efficiently. In this chapter, we will study numerous network-related
technologies, which are being incorporated by big data systems to fulfill these requirements.
We will also discuss several issues and challenges encountered in this regard. Network perfor-
mance of a system would depend upon end-to-end network paths. However, for simplicity, we
will restrict our focus on networking within a cluster (or a data center), as this is considered
more relevant for the processing of big data [378].
183
184 Big Data Systems
In section 3.1.2, we studied data center architecture. Figure 3.2 is redrawn as figure 9.1.
It illustrates a typical data center architecture. In a data center, hundreds to thousands of
machines (called nodes or hosts) are installed. These nodes are organized into racks.
Worker nodes are organized at the edge layer. Data from the edge layer is aggregated
through the aggregator switch. At the top, is the core layer, which is supported through
border router and provides an interface of the data center with the outside world. In a
typical data center, there could be one to tens or hundreds of clusters.
Figure 9.2 illustrates a sample cluster which consists of numerous racks of servers. Each
rack may contain tens of servers and a Top of Rack (ToR) switch. These are considered at
the edge layer in the data center. Data from these racks is aggregated through aggregated
switches, which is then sent to the IP router. We should note that servers in a cluster are
organized in the form of a virtual LAN (VLAN). This allows consistent routing policies
throughout the servers in a cluster. We will revert back to routing when we will discuss
Software-Defined Network (SDN).
Networking for Big Data 185
1. Control Messages: These are latency-sensitive messages, which are used to send time-
sensitive information such as cluster control or to detect abnormal activity or configu-
ration updates. For instance, a head node could send are you alive messages to all the
worker nodes in a cluster on a routine basis. These messages are sent to detect node
failure and to inquire the availability of resources. Similarly, routing updates can be
sent for network optimization. In case of Hadoop and YARN, heartbeat messages can
be considered as control messages. In general, control messages have low data-rate
and the latency-sensitive requirement can be met through Ethernet.
2. Aggregation: This type of traffic is most common for big data systems. For instance,
MapReduce-like systems has a normal pattern for aggregation. We should recall from
our study in chapter 2, how a task is distributed and aggregated. A task is submitted
to the head node, which splits it into smaller tasks and distributes it among worker
nodes. The worker nodes compute the smaller task and send the results back to
the head node, where the partial results are aggregated to compute the final result.
Partition and aggregation systems are used in many big data systems such as search
engines and querying systems.
3. Bulk Transfer: This type of traffic is usually required either for maintenance or for
management of a data center. Objectives for bulk data transfer could include replica-
tion, fault tolerance, recovery, and data migration. The size of data traffic could be
huge, involving petabytes to zettabytes of data. Generally, such type of transfer does
not have critical requirements of timeliness.
186 Big Data Systems
1. One-to-One: This is a fundamental traffic model in which data is transferred (or mes-
sages are communicated) between two hosts on a one-to-one mapping. High network
throughput is desired for this purpose. An efficient one-to-one communication system
would yield effective several-to-one and all-to-all support. Examples of the one-to-one
model include VM migration, data backup, and hotspot mitigation.
2. One-to-Many This type of traffic pattern is desired when data is copied from one host
to many other hosts. For instance, by default, data is replicated in a distributed file
system (such as GFS [177] and HDFS [324]).
3. One-to-All: This type of traffic patterns include patterns in which data is transferred
from one host to all the other hosts in a cluster. Examples of such traffic models
include data broadcast, system upgrade, and transfer of application binaries.
4. All-to-All: In such a model, each host transfers the data to every other host in the
cluster. Such type of traffic model is common in MapReduce [154], when during the
reduce phase data is shuffled across many servers [186].
1. Flexibility and Adaptability: Big data systems may contain heterogeneous applications
with varying requirements. An adaptable and flexible solution is needed which can
meet the requirements of these applications.
2. Low-Latency and High-Speed Data Transfer: Communication latency should be low.
If large amount of data is needed to be transferred, efficient mechanisms for data
transfer should be employed. Network should not become a bottleneck due to network
transfer.
3. Avoiding TCP Incast – Achieving Low-Latency and High Throughput: Big data sys-
tems have varying requirements of traffic. For instance, control messages require low-
latency, whereas bulk data transfer requires that the throughput remains high. A big
data network is vulnerable to congestion and packet loss as well. In addition, big data
systems may suffer from TCP Incast problem, where throughput of the network may
decrease abruptly due to sudden increase in packet loss. The TCP Incast problem has
been described in section 9.5.
4. Fault Tolerance: In a big data network, there should be enough redundant network
resources, as failure in the network could induce huge delay.
Table 9.1 shows challenges and their corresponding solutions discussed in the chapter.
In the remaining sections of this chapter, we will study different solutions to meet these
requirements and challenges.
Networking for Big Data 187
1. Significance:
Significance of network programmability can be understood from example 9.1.
188 Big Data Systems
Network Programmability
2. Benefits:
Network programmability promotes application-awareness – a key requirement for
big data systems. In order to leverage the need of network programmability in big
data systems, we will now see a few examples where application awareness could be
beneficial:
1. Significance:
Figure 9.4 illustrates the comparison between a traditional network vs an SDN-enabled
network. In a traditional network, forwarding devices (switches or routers) are used
to forward packets. Each device can execute its own policy. This implies that control
logic or control plane is integrated with each switch. In comparison, in an SDN,
a centralized controller provides the control logic. The rules and policies about the
forwarding decisions are made by the centralized controller, whereas switches are mere
forwarding devices.
In an SDN, the control plane is separated from the data plane. That is forwarding
decisions are made by the centralized controller (control plane), whereas forwarding
actions are made by switches (data plane), which seek commands and instructions
from the controller. Upon arrival of a flow of packets, a switch checks for the available
set of rules for the flow. If it doesn’t find any rule for the flow then it checks with the
controller about the rule for the flow.
Figure 9.5 shows the working model where numerous switches are connected to a con-
troller. The controller can communicate from applications to make application-aware
decisions. In an SDN, there are two categories of communication, i.e., communication
between the controller and the switches, and communication between the controller
and applications. Both these types of communications are handled through software
or Application Programming Interface (API). Northbound API is used for communi-
cation between the controller and the application, whereas, southbound API is used
for communication between the controller and forwarding devices.
Use of a centralized controller provides greater flexibility as policy decisions and for-
warding updates can be centrally controlled through software. Similarly, application-
aware decisions can also be made with greater ease. However, a centralized controller
is vulnerable for a single point of failure. To cater this issue, SDN has evolved from a
single controller to a centralized control plane with multiple controllers. Multiple con-
trollers allow fault tolerance, as in case of a non-availability of a controller, backup con-
trollers are available. Backup controllers induce challenges of maintaining consistency.
190 Big Data Systems
2. Important Features:
Following are the four prominent features of SDN:
(a) The control and data planes are decoupled. This decoupling allows adaptability
and flexibility.
(b) The decision to forward packets are made on flows instead of destinations. Flow-
based decisions allows greater flexibility and application-specific routing. This
improves the efficacy of the network.
(c) Control logic is centralized. This implies that the network can be controlled cen-
trally. Routing rules, firewall policies, and decisions to forward flows can all be
controlled through a centralized controller or network operating system (NOS).
(d) Network is programmable; i.e., software applications can run on top of the NOS,
which interacts with the underlying data plane devices.
3. Benefits:
SDN can benefit cloud data centers meeting Service Level Agreement (SLA) for big
data applications. SLAs for big data applications include many attributes such as pro-
cessing time, data transfer time, security, and bandwidth etc. SDN can help in meeting
these SLAs by efficiently selecting routes in order to optimize network efficiency. SDN
can also be used for optimizing network bandwidth utilization. Big data applications
Networking for Big Data 191
can have diverse traffic, ranging from streaming applications to file transfer. Meeting
SLA requirements for different types of applications can be challenging. SDN could
promote flexibility and adaptability by allowing application-specific traffic engineering.
For big data systems (such as Hadoop), data locality is essential to reduce network-
transfer cost [300]. A prime requirement for data locality is to have a global knowledge
of the network. SDN can take advantage of having global knowledge to promote data
locality and reduce network transfer time. In such a way, Hadoop jobs can be sched-
uled to promote data locality. Eventually, this would lead to reduced job completion
time in Hadoop.
4. Open Issues:
Although SDN can be exploited to handle networking issues related to big data but
there are many open challenges which are needed to be addressed [150]:
(a) Scalable Controller Management: Controllers in big data applications often re-
quire frequent flow table updates and involve a large amount of data transfers.
In general, current controllers can only have capacity of 8000 entries for flow
table [230]. This is insufficient for big data applications. As big data systems are
being increasingly used, need for scalable and robust controlling mechanisms are
growing.
192 Big Data Systems
A cluster of controllers can be set-up which would allow replication, fault toler-
ance, and load balancing. The idea of distributed controllers has been proposed
by many researchers. This allows distributing the logic of control plane over
several hosts, while managing the network centrally through the controller. An-
other option is to incorporate multicore controllers which would allow parallel
execution of controller logic across all the cores.
(b) Intelligent Flow Table/Rule Management: While the separation of control and
data planes in SDN has simplified management of network, it has also increased
processing requirements for the controller. In an SDN-compliant big data system,
forwarding tables and rules are sent to the switches in the network. These tables
are utilized by the controllers to transfer packets of big data applications. Since
these packets are sent without any pre-processing at switches, a controller may re-
ceive a large number of such packets, which may create congestion in the network.
Some sort of pre-processing at switches can reduce the overhead at the controller.
(c) High Flexible Language Abstraction: Currently, there are several SDN controllers
utilizing Python, Java, and C. In a big data network, as size of the network grows,
heterogeneous controllers may be employed. In such a scenario, a unified abstrac-
tive language is needed to support programmability for big data networks.
(d) Wireless Mobile Big Data: With advancements in wireless technologies and Inter-
net of Things (IoT), use of SDN-enabled wireless and mobile networks is expected
to increase. Such networks may encounter different challenges related to avail-
ability, connectivity, and bandwidth. In such a scenario, capable SDN solutions
are needed to be explored.
SDN provides several important features that are useful for big data applications. It is
expected that big data applications would continue to grow and benefit from SDN.
(a) InfiniBand Verbs Layer: The verbs layer provides lowest level access to Infini-
Band system. Data transferred through verbs is completely bypassed from OS,
i.e., there are no intermediate copies in the OS. The verbs layer incorporates a
194 Big Data Systems
queue pair model which implements communication end points. Work requests
are placed on a queue pair which are then processed by the Host Channel Adapter
(HCA). A completion queue is maintained to store notifications for the comple-
tion of work. The process of completion is detected through polling or interrupts.
(b) InfiniBand IP Layer: The purpose of the IP-based layer is to provide support
for communication with traditional socket-based systems. A driver is provided
for this purpose, which is responsible for communicating over socket API. For
such communication, the InfiniBand adapters appear as traditional socket-based
devices such as ib0 , ib1 , and so on. It is often referred as IP-over-IB or (IPoIB).
It supports both the unconnected datagram protocol as well as the connected
protocol. Of the two types of protocols, the latter is considered more reliable.
(c) InfiniBand Sockets Direct Protocol: The Sockets Direct Protocol (SDP) emulates
TCP sockets stream semantics. SDP is layered on top of IB message-oriented
transfer model. The SDP protocol achieves high-speed without requiring mod-
ifications to existing socket-based applications. For smaller messages, buffered
mode transfer is implemented, whereas for large messages, RDMA-based trans-
fer is incorporated. The use of RDMA allows SDP to offer improved performance
than IPoIB. However, as RDMA is not utilized for all messages, performance in
SDP is lower as compared to the performance of verbs.
InfiniBand has been a popular framework for high-speed networks. The RDMA feature has
been instrumental in providing high-speed. Owing to its efficiency, the IB systems have been
deployed by many HPC clusters.
1. iWARP:
Internet Wide Area RDMA Protocol (iWARP) is another protocol used to provide
for high-speed data connection. Similar to IB, it also utilizes RDMA to achieve low-
latency and high bandwidth. However, this is achieved over Ethernet. There are three
key iWARP components that deliver low-latency [203]:
(a) Kernel Bypass: This feature removes the need for context switching from kernel-
space to user-space and allows an application to post commands and write mes-
sages directly to the server adapter. This is in contrast to traditional message
transfer where messages are transferred from user-space to kernel-space through
context switch.
Networking for Big Data 195
(b) Direct Data Placement: Using Direct Data Placement (DDP), data can be copied
directly from the server adapter’s receive buffer to the application buffer. This is
achieved through RDMA. DDP eliminates the need of intermediate operations
which would otherwise be required to copy data to an application buffer and
conserves time and memory bandwidth.
(c) Transport Acceleration: In a traditional system, the processor maintains substan-
tial connection information and resources for maintaining network stack infor-
mation. This includes connection context and payload reassembly. This overhead
increases linearly with the speed of the wire. Using iWARP, this information is
maintained at the network controller. This feature frees up processor resources
and it is now available to process other applications.
9.4.4 iSCSI
iSCSI (Internet Small Computer Systems Interface) is a protocol for linking data storage
facilities [254] through Storage Area Network (SAN). It is one of the most widely used
protocols for remote data storage access. It encompasses two major protocols to facilitate
access. Small Computer System Interface (SCSI) protocol is used for accessing storage and
TCP is used for networking transport access. It enables applications like data mirroring,
remote backup, and remote management. Such features allow a big data system to have
management and data recovery features in case of catastrophic events. These are specifically
needed for mission critical systems such as business and banking data.
The iSCSI protocol supports many kind of interconnects between a server and remote
storage. These include Parallel SCSI bus, fibre channel, and RDMA. Of these, parallel SCSI
bus suffers from the limitation that it can only have a few meters distance between the server
and the storage device and the number of storage devices are limited to 16. Alternatively,
SCSI devices can be connected via fibre channel which increases the allowable distance and
the number of devices. In this context, Fibre Channel-Arbitrated Loop (FC-AL) [159, 222]
can attach up to 126 nodes. Since multiple FC-ALs can be connected, the number of disks
could be considerably large. However, fibre channels do not solve the distance problem and
access to the SCSI device remains confined to LAN.
Traditional TCP-based solutions for network access solves the distance issue and allows
access to remote storage locations. However, the speed of data transfer remains limited.
The main reason for the slow speed is due to out-of-order arrival and reassembly of TCP
segments. This reassembly is required because iSCSI header is not likely to be present in
every TCP packet. Thus TCP packets are stored and reassembled such that data can be
placed in iSCSI buffers. Reassembly wastes memory bandwidth and CPU cycles and affects
the overall speed [29].
iSCSI can also utilize RDMA technique for access protocol [310, 314]. iSCSI Extensions
for RDMA (iSER) is a generic term used for this purpose [124]. iSCSI can exploit direct
data placement from RDMA to achieve higher speed. Using DDP, need for intermediate
operations for data transfer is eliminated and data is copied directly to an application
buffer. RDMA Write, RDMA Read Request, and RDMA Read Response Messages are used
for data as well as control messages for the iSCSI data-type.
Networking for Big Data 197
1. Large Buffer in Switches: One possible solution is to increase the size of the buffer
in switches. In this way, a large queue can be maintained and packet drop can be
decreased. However, large queue increases the cost of switches and does not have a
significant impact on packet drop [136].
2. TCP’s Minimum RTO: In TCP, Retransmission Time Out (RTO) is the time during
which a sender must wait to receive an acknowledgement before re-sending a packet
to the receiver. A large value of RTO avoids unnecessary retransmission of packet and
increases TCP throughput; however, it slows down a system as packet loss is detected
late. In comparison, a small value of RTO reduces the packet loss detection time at
the cost of reducing the system throughput. As big data systems are connected via
high-speed networks in a cluster environment, a low value of RTO seems appropriate
in many cases. Many researchers have performed experiments demonstrating the effec-
tiveness of low value of RTO. Typically, RTO of 1 ms has been found to be effective.
Decreasing TCP RTO may reduce latency; however, it may decrease throughput due
to the possibility of a large number of retransmissions.
198 Big Data Systems
3. DCTCP: Data Center TCP is a variation of TCP, which employs explicit congestion
notification. Using ECN, a switch can mark packets if its queue exceeds a predefined
threshold. This marking is then utilized by the recipient to notify the sender about
the possibility of congestion at the switch. The sender can then adjust the sending
rate to avoid packet loss. DCTCP carries the advantage of providing low- latency and
high-throughput.
9.6.1 Fat-Tree
The fat-tree architecture has been motivated to increase fault tolerance in data centers. A
fat-tree is a general-purpose interconnection that is effective in utilizing high-speed inter-
connect bandwidth. Hosts (or nodes) are connected in a tree-like structure in a manner such
that link bandwidth increases when upward from leaves to root. That is, as we traverse from
Networking for Big Data 199
end hosts to core of the network, bandwidth of the network increases. This characteristic
reduces network congestion [246].
A typical fat-tree architecture is shown in figure 9.7. We can learn from the figure that
if a host wants to send a message to another host then there are multiple paths available.
A unique feature of a fat-tree is that for any switch, there are same number of edges
going down to the siblings of the switch as the number of parents going up to its parent
in the upper level. In comparison to the traditional tree in which the link bandwidth is
fixed, a fat-tree has increasing bandwidth when traverse upwards from leave to root. In
this manner congestion problem is mitigated. In a fat-tree, leaf nodes represent processors
whereas internal nodes are switches. Fat-tree also employs commodity hardware in order to
reduce the cost.
9.6.2 BCube
BCube is a network architecture that takes a server-centric approach [186]. It is based on a
modular data center, in which the data center is considered as portable module which can
be deployed easily. The BCube architecture has servers as well as switches. Each server has
multiple ports and switches that connect to a fixed number of servers. Figure 9.8 shows a
sample BCube architecture with both servers and switches.
BCube can be connected recursively with BCube0 simply being n servers connected to an
n-port switch. BCubek is constructed with n BCubek-1 having nk-1 switches each connecting
same index server from all the BCube k-1 .
With 8-port mini-switches, it can support up to 4096 servers in one BCube3 [186].
Both BCube and fat-tree utilize redundancy to achieve high fault tolerance. There is a
high cost of fault tolerance. This can be mitigated through commodity hardware.
high-speed data transfer, dealing with TCP Incast, and high fault tolerance. To provide an
effective networking solution for big data is a challenging task. New challenges and solutions
are being emerged.
10. High-speed network incurs high cost. Do you think this is justified for all big data
systems?
Networking for Big Data 201
GLOSSARY
Bulk Transfer: It is a type of network traffic, which is usually required either for mainte-
nance or for management of a data center.
Control Messages: These are latency-sensitive messages, which are used to send time-
sensitive information such as cluster control or to detect abnormal activity or con-
figuration updates.
Direct Data Placement: It is a technique through which data can be copied directly from
the server adapter’s receive buffer to the application buffer. DDP is built upon the
RDMA protocol.
Fibre Channel over Ethernet (FCoE): It is a network technology that enables Fibre Chan-
nel communications to run directly over Ethernet.
InfiniBand (IB): It is a type of high-speed connection link, which is used in HPC systems.
It facilitates high-speed network through RDMA.
RCaP: It is also referred as an RDMA-Capable Protocol. It is a generic term used to refer
protocol stacks that provide the Remote Direct Memory Access (RDMA) functionality,
such as iWARP and InfiniBand.
RDMA over Converged Ethernet (RoCE): It is a type of network, which delivers many
advantages of RDMA while using the Ethernet-based implementation instead of
InfiniBand-based Adapters (IBA).
Remote Direct Memory Access (RDMA): It is a technique to copy data directly to the
remote server’s memory.
Software-Defined Networking: It is a technique of implementing a network in which control
plane is separated from data plane in order to have a more flexible and adaptive control
of the network.
CHAPTER 10
ecurity is an important operational requirement for big data systems. A secured system
S is expected to provide trustworthy and uninterrupted access to users. The purpose of
this chapter is to assess security requirements, identify challenges, and elucidate security-
based solutions for storage and processing of big data systems.
10.1 INTRODUCTION
In order to understand security requirements and solutions, let us first discuss what is meant
by security.
The term security in big data can be defined as collective means and measures to safe-
guard the service layer, data storage, processing, and networking against malicious attacks
such as theft, leakages, undesired access, and improper computational output [349, 386].
Figure 10.1 illustrates four layers of a big data system. However, it is understood that im-
plementation of a big data system varies for different systems. For instance, big data storage
server may have a negligible role in computation. Similarly, a standalone big data server
will have minimal implementation of the networking layer.
203
204 Big Data Systems
4. Ensuring Privacy and Protection Against Data Leakage: The system should adhere to
privacy of users. It should also ensure that data should not be leaked or stolen from
the system.
The above mentioned four requirements should be met at all the layers of a system [215].
1. Espionage: An action where adversary spies classified or confidential data. The main
goal here is to retrieve information that would not otherwise be available.
2. Data Theft: An action which involves the theft of confidential data.
3. Denial of Service: Also called as a DoS attack – an attack which degrades or denies
core service operation of a system. For instance, a storage system will be unavailable
to provide storage access if it encounters a denial-of-service attack. When the attack
is launched through coordinated multiple sources, it is called a DDoS (Distributed
Denial Of Service) attack.
4. Fabricated Message: A fake message, which is induced by an adversary into the system
to either implicitly or explicitly induce false information.
An attacker can either be an insider or can be an outsider. The former implies that the
attacker is from within the organization. In comparison, in the outsider attack, the attacker
is an external person.
In order to launch an attack, an attacker needs access to the system. Access can either
be physical or it can be through software. A software, which gives access to an attacker,
is also known as malware. Table 10.1 summarizes the methods of access along with their
description [161].
Once the access is granted, attack can be launched by exploiting vulnerabilities.
1. Network Flooding: A large number of messages are sent to the victim, which be-
comes unable to respond to legitimate requests, leading to a denial-of-service attack.
There could be different types of attack messages such as connection requests, service
requests, or arbitrary network messages [129]. Example: Denial of Service.
206 Big Data Systems
Type Description
Virus It must be triggered by the activation of the host such as human
activity.
Worm It does not require activation through host. It can self-replicate
and execute itself.
Rootkit A malicious code that provide remote access to the target host
without being detected.
Trojan-horse A malware which disguises itself as a normal program.
Spyware A malicious program that spies a user activity without being
detected.
Adware A type of malware which automatically displays advertisements.
Ransomware A software which holds resources of the system while demanding
some ransom.
Downloader A malicious code or software that downloads and installs ad-
ditional malware code to perform exploitation on the infected
system.
Security for Big Data 207
2. Malware: A malicious code is injected in a system through some means. In table 10.1,
we studied various mechanisms to get access. Malware could lead to different types of
attacks such as data theft, DDoS, and Espionage. There could be different types of
malware. These are summarized in table 10.2. Example: Denial of Service, Espionage,
Data Theft, Fabricated Messages.
3. Botnet: An adversary can inject malicious code in a computing system and control the
system remotely. This is called bot (derived from robot). A network of bots is created
when an attacker is able to control many nodes. A botnet can be used to launch an
attack on a target system as well. Botnet can also lead to a flooding attack.
One of the most notable botnets that hit millions of devices, especially Internet of
Things (IoT) appliances, and caused major outages in previous years was the Mirai
botnet. The adversary were able to successfully compromise millions of devices to
perform massive DDoS (Distributed Denial Of Service) attacks against very famous
service providers including GitHub, Netflix, PayPal, Twitter, and Reddit. Example
Denial of Service, Espionage, Data Theft, Fabricated Messages.
4. DNS (Domain Name Service) Attack: An attacker exploits vulnerabilities in the DNS
entries of a system to make the DNS server holding the DNS information unavailable.
Consequently, the big data system becomes unavailable. Example: DDoS.
5. Man-in-the-middle Attack : An adversary can intercept and fabricate communication
between two communication end-points. Example: espionage, fabrication, DDoS, and
Data Theft.
6. SQL Injection Attack: Malicious SQL statements are executed to extract sensitive
information from a database. This is specifically true if the database is configured
incorrectly [189]. Example: DDoS and Data Theft.
7. Clickjacking: An adversary can hijack user clicks through different measures such as
sending spoofed emails, or malicious links. A malware is installed on the system which
can be used to launch different attacks such as botnet [318]. Example: Data Theft,
DDoS, and Espionage.
8. Drive by Download Attack: An adversary can initiate download of malicious mal-
ware. This can be initiated through clickjacking. Example: Data Theft, DDoS, and
Espionage.
9. Cross-site scripting: An attacker can initiate execution of malicious script through
embedded HTML. Example: Data Theft, DDoS, and Espionage.
10. Data Breach: An adversary can access to confidential data from the system. This
can be accessed to various means such as malware, man-in-the-middle attack or even
misconfiguration of the system. Example: Data Theft, DDoS, and Espionage.
11. IP Prefix hijacking: The prefix of a group of IP addresses is hijacked by corrupting IP
routing tables in routing configuration. The hijacked group of addresses would expe-
rience reachability problems. This could lead to inaccessibility of services. Example:
DDoS and Espionage.
Example 10.1 (Attack Scenarios in Big Data Systems). Figure 10.2 illustrates dif-
ferent scenarios of security attacks in big data systems. Three clients are part of a
botnet and launch a DDoS attack. Man-in-the-middle attacks is also shown in the
figure. They can be initiated either by an insider or by an outsider. In the figure, the
computation layer has been attacked through virus and worm. DDoS can be launched
through various means including network flooding and DNS unavailability.
10.4.1 Encryption
A major challenge in ensuring security is that communication between two end points of a
network can be sniffed by a third party. We studied man-in-the-middle attack in the pre-
vious section, which allows a third entity to sniff messages between two hosts. To maintain
confidentiality, communication could be encrypted. Encryption allows a message to be en-
coded with a certain scheme such that it cannot be understood by any entity. A message
encrypted by the sender is decrypted by the recipient at the receiver end. The encrypted
text is also known as ciphertext.
Example 10.2 illustrates a scenario of encrypted communication between two end points.
It can prevent the man-in-the-middle attack. Encryption is not necessarily incorporated for
communication. It can also be used to encrypt stored data. In such a scenario, encryption
can ensure data anonymity. Even in case of data theft, encrypted data will prevent leakage
of data.
Understanding Encryption
For encryption, two attributes are important, i.e., encryption function and encryption
key. The former specifies the algorithm which is used to encrypt a message, whereas the
latter determines the functional output of the encryption algorithm. Based on the type of
encryption keys, there are two types of techniques for encryption:
1. Symmetric Key: In this type of cryptography, the same key is used for encryption as
well as for decryption. Both the sender and the receiver must devise a mechanism to
exchange the key. Secrecy of the key must be maintained between the two parties.
Any entity with access to the secret key can either encrypt or decrypt the message.
Out of the two types of encryption techniques, Symmetric key encryption is lighter, i.e.,
it takes less computational resources. However, it is not scalable. This is because for point to
point communication between “n” nodes, n * (n-1) keys are needed. In addition, exchanging
keys between n nodes remains a challenge.
In comparison, public, private key pair-based cryptography takes higher computational
resources. However, it is scalable. For communication between “n” nodes, “n” pairs of public,
private keys are needed.
210 Big Data Systems
Digital Signatures
10.4.3 Firewall
A firewall is a combination of software and hardware, which is placed at the entry point
of a network. Its purpose is to inspect all incoming and outgoing network packets in order
to prevent attacks against networks. A firewall can either inspect packets against different
rules or it can review packets against signatures of malware.
Example 10.4 illustrates that the firewall has a set of rules. For instance, DNS packets
(UDP port 53) is prohibited from outside network.
Security for Big Data 211
Example 10.5 (Access Control Mechanisms). Suppose Alice tries to access her bank
account through her mobile application. She logs on the application using her creden-
tials. Her banking application allows her to login after necessary verification from
the bank. This process is considered as Authentication. After authentication, Alice’s
bank allows her to access all the relevant accounts. This process is called as Au-
thorization. During the login session, all the activities of Alice are monitored and
tracked to ensure security. This process is called as Accountability.
10.4.7 DNSSEC
A DNS server is used to resolve a host name to an IP address. If this resolution is altered
then it can be a security threat to a big data system. For instance, DNS record of a big
data system could be targeted – thereby making the big data system unavailable.
Traditional DNS servers are vulnerable to two kinds of attacks, i.e., man-in-the-middle
attack and DNS cache poisoning attack. In the former case, an intruder can listen and alter
the DNS communication between a DNS client and a DNS server. Whereas, in the latter
case, local DNS client receives fake DNS receives a flood of bad DNS record entries making
the DNS resolution targeted to a fake server.
The Domain Name System Security Extensions (DNSSEC) is a set of security protocols
which can be used to ensure protection from these attacks. DNSSEC ensures integrity
and security by sending digitally signed DNS records to the client. Signature ensures that
messages cannot be altered by an attacker.
10.4.9 Blockchain
We have studied digital signatures, which ensures the integrity of data. However, when the
integrity of a large number of transactions is needed to be maintained then blockchain is a
useful solution.
Blockchain is a distributed ledger which is immutable [356]. This implies that data is
stored in a distributed manner among many nodes and cannot be altered. Since data is
distributed, there is no single point of failure. The major characteristic of a blockchain
system is its integrity. Transactions are arranged in the form of blocks such that each block
has many transactions. Blocks are arranged in the form of a chain. Each block’s header
has a hash which is computed from the hash of all the transactions in the block and the
hash of the previous block in the chain. For the computation of hash, we should refer to the
example 10.3.
Since each block has a hash of all the transactions in the block, if any transaction is
modified then it will alter the hash of the block. Similarly, since each block is linked to its
previous block through cryptographic hash, it will also alter its hash value.
Blockchain systems have been largely used to record transactions of crypto currencies.
However, they can also be used for other systems such as supply chain [197]. Example 10.8
explains a blockchain system.
Security for Big Data 215
Blockchain System
is a type of process in which input data is classified into different types or categories.
ML models have been used to detect attacks as well to identify patterns for malware.
For instance, binary classification is used to classify events as either malicious or benign.
Similarly, multi-label classification is used to identify more then one type of malware.
Various classification methods such as Decision Tree, Support Vector Machines, and
Naive Bayes algorithms are used for classification. These algorithms require that the dataset
is labeled. That is, categories which are needed to be identified are labeled in the dataset.
A major limitation of these techniques is that they are only useful to detect attacks which
have already been occurred.
6. Explain the use of access control mechanisms. How it helps to improve security?
7. Explain the difference between Denial of Service (DoS) and distributed DoS (DDoS)
attack.
Security for Big Data 217
GLOSSARY
Access Control Systems: It is a kind of system with formalized set of rules or policies for
granting or restricting user access to information, systems, and other resources.
Adversary: It refers to an entity who has malicious intent and aims to induce security
attacks.
Attack Vector: It is a list of attacks that could happen in any system.
Botnet: It refers to an entire network of compromised devices (bots) without the knowledge
of their owners.
Bots: The term is inspired from robots. It refers to a set of automated programs which are
designated to perform tasks and respond to instructions. Auto registrations in online
services are one example.
Clickjacking: It is a process of misleading a user to click on a hidden malicious link or
button performing unintended actions without realizing it.
Denial of Service (DoS) attack: It is a kind of attack which disrupts the availability of ap-
plications, servers, and network resources to legitimate users.
Distributed Denial of Service (DDoS) attack: It refers to an attack which is launched from
a distributed set of sources each of which is focused on launching a denial of service
attack against the same target.
Espionage: It is a method to find out confidential information from non-disclosed sources.
Intrusion: It is a process of intruding into the system through some vulnerability.
Zero-Day Attack: It refers to an undiscovered or previously unknown attack.
Zero Trust Security Model: It is based on the design principle that all networks should be
considered ”untrusted” and we should assume nothing (zero-assumption).
CHAPTER 11
11.1 INTRODUCTION
The term big data computing refers to massive amount of data – whereby storing, analyz-
ing, and understanding such data can push the frontiers of technologies to the limit [317].
Massive data is being collected for big data systems by enterprises, organizations, and gov-
ernment agencies either with or without a user’s knowledge. When a huge amount of data is
collected from different sources, it brings the potential of solving many unanswered queries.
Meticulous analysis and methodological prediction of big data have revealed useful results
and solved unresolved queries for many applications.
While this meticulous analysis and methodological prediction has revealed useful results
and solved unresolved answers, it has also raised privacy concerns of users. A user may be
219
220 Big Data Systems
totally unaware that privacy is being violated in the collection of data. Ensuring that the
privacy of a user does not violate and confidential information is not shared in big data
systems is a massive challenge.
The purpose of this chapter is to provide an overview of privacy violations and concerns
in the context of big data. This chapter explains the categorization of different privacy
violations types which exist in big data systems and assesses the effectiveness and limitations
of their protection techniques. The chapter also reviews legal propositions, laws, and social
effects in privacy preservation and studies significant measures that can be implemented to
strengthen privacy of the users in big data systems.
privacy is not violated. It only provides a legal cover to the organization that it is legally
justified to collect the information.
Other techniques for inferring confidential information include collecting user search
results, collecting credit card shopping trends, inferring electricity consumption through
average monthly bills. All such methods may directly reveal specific information about a
user.
a) Tracking by Governments
b) Utilization of confidential information by service providers
c) Re-identification attacks through external sources
d) Data breach
These user profiles provide service providers with the ability to improve service to end users.
For instance, Facebook and LinkedIn exploit this information for insights and recommend
friends, jobs, groups, etc. Google, for instance, states that it scans users’ emails for virus
and spam protection, spell check, relevant search results, and features such as priority inbox
and auto-detection of calendar events [247].
This utilization of information, however, raises a significant threat to user privacy. Ser-
vice providers may utilize these confidential information without user approval and can
result in disclosure of personal information. For instance, they can utilize these personal
information in user profiles for targeted advertising. They can sell personal data to third
parties for monetary purposes or financial gains without users’ effective knowledge or con-
sent. These information may then be used for commercial or even malicious purposes. In
recent years, this increased utilization of information has shown the importance of taking
privacy into consideration when contributing data in online services.
Re-identification attacks
1. Correlation attacks: This type of attack requires an adversary to correlate two or more
datasets to obtain a more clear and distinct record. In this attack, A distinguishing
feature of the correlation attack is that it does not reveal information about any
particular person. Instead sensitive information about a group of individuals is iden-
224 Big Data Systems
tified. The data breach caused in the Netflix dataset is an example of the correlation
attack [280].
2. Arbitrary Identification attacks: In this type of re-identification attack, the objective
is to relate at least one data entry in an accumulated dataset to the identity of a
particular individual, with an adequate level of likelihood. This leads to learning all
anonymously released information about that individual. Privacy breaches caused by
AOL search data [89] is an example of this attack in which analyzing the anonymized
dataset resulted in identification of specific members.
Note that these solutions are inter-related. For instance, strong laws can enforce providers
to implement these measures against data breach.
An example of K-anonymity
The K-anonymity technique has suffered from the homogeneity attack and the back-
ground attack. Both homogeneity and background attacks are caused by the fact that
there is not sufficient diversity in the set of sensitive values present in a QID group.
Privacy for Big Data 227
(a) Homogeneity Attack: The homogeneity attack occurs when each record, with
the same value for the quasi-identifier, also assumes the same value for the sen-
sitive attribute. Using homogeneity attack, an adversary can identify sensitive
information related to an individual if there is not much diversity in the data as
illustrated in example 11.3.
(b) Background Knowledge Attack: K-anonymity does not guarantee privacy against
attackers which have background knowledge about the target respondent. An ad-
versary with background knowledge about gender, zip, and physical condition of
the victim can also identify sensitive information as illustrated in example 11.4.
An example of L-diversity
Datasets using L-diversity techniques can suffer from skewness and similarity attacks.
(a) Skewness Attack: The skewness attack occurs where the distribution of the sensi-
tive attribute value within a given equivalence class is significantly different with
respect to the frequency distribution of sensitive attribute values in the overall
dataset.
(b) Similarity Attack: The similarity attack is possible where an adversary can deter-
mine likely possibilities of the disease when the values for the sensitive attribute
for an equivalence class are although distinct but semantically similar. On this
assumption, an adversary can infer important information as illustrated in ex-
ample 11.6.
4. Differential Privacy: Differential privacy [337] is a privacy model offers strong privacy
guarantees. It aims to limit the disclosure of sensitive data by limiting the impact
of each individual in the answered query. This is achieved by adding appropriately
chosen noises, e.g. Laplace mechanism and geometric mechanism, to the aggregate
results.
This technique protects the individual privacy by adding sufficient noise to the query
result; however, the original data still resides at the server where it is vulnerable
to data breaches. Improper disclosure of the original data can cause data privacy
breaches.
9. Elaborate how information collected by service providers can yield to privacy viola-
tions?
10. Describe the significance of laws in protecting privacy of individuals.
GLOSSARY
Arbitrary Identification Attack: This type of attack attempts to relate with high prob-
ability at least one entry in an accumulated dataset to the identity of a particular
individual.
Correlation Attack: This attack involves correlating different datasets to obtain a more
distinct and cohesive set of database records.
Cyber Security: This refers to technologies, processes, and measures that are designed to
protect systems, networks and data from cybercrimes.
Data Breach: It refers to the act of security breach that leads to the accidental or unlawful
destruction, loss, alteration, unauthorized disclosure of, or access to protected data
transmitted, stored or otherwise processed.
Differential Privacy: This technique aims to limit the disclosure of sensitive data by limiting
the impact of each individual in the answered query.
K-anonymity: It is a technique of providing anonymity to data by adding at least k num-
ber of redundant quasi-identifiers (QIDs) in the dataset; provides anonymity for k1
individuals.
L-diversity: It is a technique, which relies on distribution of a sensitive attribute such that
each equivalence class has at least 1 well-represented value.
PRISM: It is a program which allows the US government to collect data from major service
providers for intelligence purposes.
Re-identification Attacks: It is an attack which utilizes anonymized data to find individuals
in public datasets.
Service Provider: It refers to an organization that provides a service to a normal user, e.g.
The Internet Service Provider.
Surveillance: It refers to an act of closely observing a person or a group, especially one
under suspicion.
Target Identification Attack: It refers to initiating an intentional attack in order to target
a specific identity.
T-closeness: It refers to an anonymity technique in which the frequency distribution of
sensitive attributes within each equivalence class should be close (t-close, where t is
a fixed threshold value) to their distribution of the sensitive attributes in the entire
dataset.
IV
Computation for Big Data
231
CHAPTER 12
High-Performance
Computing for Big Data
CONTENTS
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
12.2 Scalability: Need for HPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
12.3 Graphic Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
12.3.1 CUDA Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
12.3.2 CUDA Programming Model Through an Example . . . . . . . . . . . . . 236
12.3.3 Usage and Advancements for Big Data . . . . . . . . . . . . . . . . . . . . . . . . . 238
12.4 Tensor Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
12.5 High Speed Interconnects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
12.6 Message Passing Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
12.7 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
12.8 Other Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
12.8.1 Volunteer Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
12.8.2 Supercomputing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
12.9 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
12.10 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
12.11 Exercise Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
ig data systems require massive amount of computational and storage resources for
B their proper execution. The notion of High-Performance Computing (HPC) is often
utilized to meet the large scalability demands of big data systems.
The purpose of this chapter is to explain the usage of HPC systems for Big Data.
12.1 INTRODUCTION
A fundamental requirement of big data systems is that they require large computational
resources for storage and processing massive volume of data. High-Performance Computing
(HPC) is a type of computing in which large computational resources (computational power
or storage) are utilized to achieve higher performance. HPC involves integration using hard-
ware and software. In the context of big data, HPC systems are intrinsic as they provide
necessary computational and storage resources for execution.
In chapter 2, we discussed organization of big data systems into clusters. We also studied
distributed and shared memory paradigms in section 2.3. These concepts are necessary to
understand organization of HPC systems.
233
234 Big Data Systems
FIGURE 12.1 Support for parallelism – Difference between GPU and CPU
1. Horizontal Scaling: This involves adding more computational resources such as inde-
pendent workstations or adding more processors. This type of scaling often requires
adding more resources without replacing existing resources. Therefore, commodity
hardware and workstations can also be deployed. Horizontal scaling employs dis-
tributed memory model for communication among the workstations. Google’s MapRe-
duce system has been based on horizontal scaling.
2. Vertical Scaling: Vertical scaling involves upgrading or replacing hardware with a more
powerful unit. This type of scaling often involves a single server with a more capable
execution unit. Vertical scaling often relies on a shared memory model.
A GPU program (also called as a kernel) is organized into blocks of threads. Architec-
turally, a GPU consists of scores of Streaming Multiprocessors (SMs). Each SM consists of
a number of Streaming Processors (SPs). An SP is an execution mapping of a thread.
A GPU program is divided into execution units called threads and is executed on SPs.
These threads are organized into thread blocks which are mapped on SMs. Each thread has
access to a small and fast cache memory as well as larger and slower global memory. Threads
within the same block can be synchronized for execution. This implies that check points can
be inserted for execution. They also share a memory block called shared memory. This is
smaller and faster than global memory. Thread-communication across blocks is restrictive.
That is, threads in one block are not allowed to communicate with the threads in the other
block. This is because it is not guaranteed that all thread blocks will be executed at the
same time.
OpenCL (Open Computing Language) and CUDA (Compute Unified Device Architec-
ture) are two major programming platforms for GPUs. OpenCL is an open standard of
236 Big Data Systems
the industry and can be used to program GPUs of different vendors including AMD, Intel,
and – device memory model NVIDIA. In comparison, CUDA is specific to NVIDIA GPUs.
It is comparatively, easier and well-adapted for GPU programming. A CUDA program is
executed in two parts. It is invoked through the CPU, which is also called as host. The
parallel portion of the code is executed on device, i.e., GPU. Syntax of CUDA is almost
similar to C/C++.
1. Creating input arrays A and B and the resultant array C on the host.
2. Creating arrays on the device using cudaMalloc.
Example 12.2 shows CUDA code addition of two arrays A and B into the third array C.
Addition is performed in the kernel function “add”. The function takes three parameters,
i.e., pointers to three memory locations a, b, and c. The function simply adds the values of
a and b and stores them in c. Since the function is called separately for each thread of the
GPU and each thread only adds one index element from arrays A and B, the program is
capable to add the total number of elements as declared by the variable arraySize.
The main function declares the two input arrays and the resultant array c. The function
cudaMalloc is responsible for allocating memory on CUDA device. It takes two parameters.
The first parameter is the pointer while the second parameter is the desired number of bytes
to allocate.
We should observe that cudaMalloc only allocates memory in the global memory, it
does not copy the contents. In order to load the memory with input values, we use the cu-
daMemcpy function. The function is used to copy data from CPU (host) to GPU (device)
and vice versa. It has four options for data copying. These are cudaMemcpyHostToHost, cu-
daMemcpyHostToDevice, cudaMemcpyDeviceToHost, and cudaMemcpyDeviceToDevice.
These options specify the direction of the copy from host to host, host to device, device to
host, and device to device, respectively. Subsequently, there are two calls to cudaMemcpy,
one before calling the kernel function for copying host to device, and the second after the
kernel function for copying data from device to host.
238 Big Data Systems
More details about CUDA programming can be obtained from references [223, 308].
GPU can act either as a mapper or a reducer. Remember that during the map phase
of MapReduce, SIMD (Single Instruction Multiple Data) architecture is followed. This
makes GPUs ideal to solve MapReduce tasks.
For GPU-compliant MapReduce, execution can follow two steps. During the first
step, map phase is executed and each GPU thread can individually execute mapper.
Alternatively, threads within a block can collaborate to combine map tasks operating
within a block. The resultant data from the map phase can be written back to the
CPU, where values are aggregated according to keys. A second kernel can be launched
for reduce task.
A core advantage for GPU-based MapReduce is that the Map phase can utilize the
inherent parallelism of GPUs. Further, communication bottleneck between mapper
and reducer can be reduced as CPU to GPU bandwidth is higher than the network
bandwidth.
2. Spark: It is another popular big data platform and has been discussed in chapter 6.
Spark has been ported on GPUs [243, 380]. We should recall that Spark utilized in
memory computation to overcome shortcomings of MapReduce for iterative tasks and
streaming. GPUs can meet this requirement through availability of shared memory per
block. Further, stream-processing and pipelining in GPUs can be utilized to develop
efficient Spark solutions. However, memory in graphic cards is smaller as compared
to CPU memory; therefore, capabilities of GPU-enabled Spark remain limited unless
multiple GPUs are connected together.
3. Machine Learning and Deep Learning: These are two important applications of big
data. Deep Learning will be discussed in chapter 13. GPUs can be greatly helpful
for both these applications. Machine learning requires iterative computing based so-
lutions, where interim results are needed to be stored in memory instead of writing
to hard disk. In a related context, deep learning requires layers of computation for
feature extraction and testing. Both techniques require extensive parallelism for faster
execution. GPUs can be greatly useful in proving an efficient parallel solution.
1. Matrix Multiplier Unit (MXU): For matrix multiplication, there are 65,536 8-bit
multiply-and-add units.
2. Unified Buffer (UB): 24MB of dedicated SRAM. The RAM includes registers.
The process of running a trained model and classifying data or estimating missing labels
in the data is called inference. The process of inference using a Neural Network takes the
following steps:
1. Multiply input value (X) with input weights to produce the input signal.
2. The resultant values (i.e., input of all the features and their weights) are aggregated
to produce a single value.
3. An activation function (such as RELU, Sigmoid, tanh) is applied to assess the activity
of neuron.
We can assess that this is multiplication and addition intensive work, which is required
for the process of inference. The process can also be termed as Matrix Multiplication as
this represents a series of steps, which are involved in matrix multiplication.
For instance, given three inputs (with weights) and three neurons for a fully connected
neural network, there will be nine multiplications and six additions in total. In general, a
big data application could have hundreds to thousands of features. Therefore, the task of
inference is matrix multiplication intensive.
Google observed that many of the applications that are being executed on Google data
centers have considerable load of matrix multiplication. So if the process of matrix multipli-
cation could be made efficient with respect to energy and time, it could have a considerable
impact on Google cloud.
High-Performance Computing for Big Data 241
Instruction Purpose
Read_Host_Memory Read data from memory
Read_Weights Read weights from memory
MatrixMultiply/Convolve Multiply or convolve with the data and
weights,accumulate the results
Activate Apply activation functions
Write_Host_Memory Write result to memory
TPU is an ASIC which has been built with two major modifications:
Figure 12.5 shows a simplified architecture of TPU. The core of TPU is Matrix Multiplier
Unit (MXU), which has 65,536 8-bit multiply-and-add units for matrix operations. MXU
receives input from two sources. These include the FIFO weight scheduler and systolic data
setup. The former is used for input features and weights, while the latter is used to input
activation function.
TPU has special instructions to support operations: Table 12.1 shows a few sample
instructions mentioned by Google official blog [309].
Example 12.3 (WordCount using MPI). Figure 12.7 illustrates execution of Word-
Count problem using MPI. Five nodes are shown with separate storage for data and
memory. Ranks (IDs) of the nodes are marked from 0 to 4. The head node (rank 0)
has four files. It sends one file to each node to perform the WordCount operation.
This communication is done using the MPI_Send function. At the receiving end,
nodes 1 to 4 have corresponding MPI_Recv functions. Each node (1 to 4) receives
the file, performs the WordCount operation, and sends the result of local computation
back to the head node. The head node aggregates the result.
An MPI code needs to include <mpi.h > file. This provides an interface for MPI com-
mands. In MPI, processors can communicate with each other if they share a communicator.
A communicator can be considered as a group of processes which can communicate with
each other. MPI_Comm_World is a standard communicator.
Table 12.2 lists standard MPI functions along with their description. “*” and “&” sym-
bols in the table denote the pointer and address notions in the C language.
Example 12.4 illustrates functionality of four MPI collaborative functions.
High-Performance Computing for Big Data 245
Functions Description
MPI_Init(argc,&argv) This function must be called only once in every MPI program. It
initializes the MPI execution environment. Other MPI functions
follow this function. It also marks the beginning of a parallel
section of code.
MPI_Comm_size It returns the total number of MPI processes in the specified
(comm,&size) communicator. The number of processes are stored in the size
parameter.
MPI_Comm_rank It returns the rank of the current process. The rank is stored
(comm,&rank) in the “rank” argument. The total number of each process will
vary from 0 to n-1, where n is the total number of processes.
MPI_Send (*buf, count, It sends “count” no. of bytes, which are stored at location “buf”
datatype, dest, tag,comm) and of “datatype” to the process with rank “dest”. “comm”
specifies the communicator, and “tag” can be used to distinguish
different messages.
MPI_Recv (*buf, count, It is a blocking call. It receives “count” no. of bytes of datatype
datatype, source, tag, from the process “source”, and stores them. “comm” is the com-
comm,*status ) municator and “tag” is used to distinguish messages. “status”
specifies the success or error status.
MPI_Abort It terminates all processes associated with the communicator.
(comm,errorcode)
MPI_Barrier(comm) It is a synchronization operation. It creates a barrier synchro-
nization in the group. When a process reaches MPI_Barrier
call it blocks until all the processes reach the synchronization
point.
MPI_Bcast This function is used to broadcast (sends) a message of
(&buffer,count, “datatype” to all the processes. “comm” specifies the communi-
datatype,root ,comm) cator, “root” is the rank of the sending process, “buffer” is the
storage location of the message, and “count” denotes the no. of
bytes.
MPI_Scatter This function is used to distribute chunk of messages to all the
(&sendbuf,sendcnt, processes in the group. However, unlike the broadcast function
sendtype,&recvbuf, which distributes the same message to all the processes, this
recvcnt,recvtype, function sends individual chunk to each process.
root,comm)
MPI_Gather (&send- This function gathers distinct messages from each process in
buf,sendcnt, send- the group and sends to a single destination process specified by
type,&recvbuf, recv- “root”. It is the reverse of MPI_Scatter.
count,recvtype,
root,comm)
MPI_Reduce (&send- This process is a collective computation operation. It applies a
buf,&recvbuf,count, reduction operation on all processes in the group and places the
datatype,op,root,comm) result in one process.
MPI_Finalize It terminates the MPI execution environment. It is the last MPI
function in the program.
246 Big Data Systems
Example 12.4 (Collaborative Functions using MPI). Figure 12.8 illustrates func-
tionality of four collaborative functions of MPI. In MPI_Reduce a reduce operation
(such as sum) is applied. The result is stored at the “root” process. In MPI_Bcast,
the intended value is sent to all the processes in the group. In MPI_Scatter a
chunk of values are passed to processes such that each process receives its corre-
sponding value. In MPI_Gather, individual values from each process are gathered
at the root process.
We can learn MPI programming from example 12.5 [47]. Each process in the program
prints a Hello world message.
High-Performance Computing for Big Data 247
In chapter 4, we studied MapReduce and solved the WordCount problem with it. A
major difference between MPI and MapReduce is that the former provides more flexibility
to a developer. On the contrary, MapReduce has stronger abstraction at the cost of lesser
flexibility.
12.7 OPENMP
OpenMP is a shared memory programming platform. It is based on a thread-based program-
ming model, where a cluster of processors share memory among themselves. Since memory
is shared between the processors, there is no need to send and receive messages. The major
primitives will be to ensure that critical section remains protected.
We will now learn OpenMP from an example 12.6 [48].
248 Big Data Systems
12.8.2 Supercomputing
Supercomputing is a form of computing which involves computation through extensive com-
putational power to solve computationally expensive problems. Supercomputing is generally
referred as the integration of hardware, software, networking, and algorithms to solve an
expensive problem. Some of the applications of supercomputing include weather analysis,
astronomy, predictive analysis on items, sales, stocks, protein, and other experiments in-
volving large amount of data. Supercomputing can be considered as a type of HPC with
enormously high computational power. Table 12.3 shows a list of top 10 supercomputers
along with their speed and architecture.
KNN [172], and Decision Trees [251, 322] have been effectively implemented on GPUs.
Similarly, TensorFlow [75] and Theano [92] are popular deep learning-based solutions.
GLOSSARY
CUDA: It is a programming language, which is specific to NVIDIA GPUs. It allows the
CPU for performing general tasks and the GPU for data parallel tasks.
GPU: It is a processor designed for processing graphics. It has hundreds of cores, which
makes it efficient for data parallel processes.
eep Learning is a popular research area. It refers to the class of machine learning
D which involves deep neural networks such that several layers of a neural network
are utilized for feature extraction, training, and transformation. With the availability of
big data, machine learning [206] and deep learning have been popular in providing new
dimensions for prediction, classification, recognition, and analysis [179]. There are many
applications of big data and deep learning [279]. These include computer vision [135], speech
recognition, image and video analysis, and network flow analysis [256]. The purpose of this
chapter is to elucidate readers about the emerging topic of deep learning. We will only be
covering initial concepts about the topic.
13.1 INTRODUCTION
Deep learning is an emerging and active area of data analytics. It is a class of machine
learning which allows a machine to be fed with raw data and to automatically discover
the representations needed for detection or classification. Deep learning achieves this rep-
resentation through multiple layers in which each layer transforms the underlying layer to
provide a better abstraction than the preceding layer.
Deep learning models follow deep neural networks, a class of neural networks in which
there are more than one hidden layers. Hidden layers allow improved learning and pro-
253
254 Big Data Systems
mote deeper understanding. We shall revert to a more deeper understanding of deep neural
network in later sections. For now, let’s explore the impact of deep learning on big data.
13.2 FUNDAMENTALS
Having identified the usefulness of deep learning in extracting useful information from big
data, it is important to study a few fundamentals of deep learning.
Deep learning is inspired by the neural network (NN) – a computational model which
is modeled after the human brain. An NN consists of many simple, connected processors
called neurons, which is a specialized cell that is used to stimulate electrical impulses.
Each neuron produces a sequence of real-valued activations. Neurons at the input layer get
activated through sensors perceiving the environment. Neurons at other layers are activated
through weighted connections from previously active neurons [313].
In this chapter, we will mainly consider two types of neurons, a perceptron, and sigmoid
neuron.
A perceptron is a neuron which inputs several binary values, x1 ,x2 ,xn , and produces
a single binary output value. Here binary means that the value could either be 0 or 1.
Normally a weight wi is assigned to each input xi such that the output depends upon the
product of inputs and their corresponding weights.
Further, a constant value, called as bias, is added to the product sum. We can define
bias as a constant which is used to best fit the model according to the given data. The value
Deep Learning with Big Data 255
of the bias is changed along with the weights to fit the model with the input. However, bias
differs from weights as its value is independent of the output from previous layers.
n
z= wi xi + bias (13.1)
i=0
The above equation denotes the output of a neuron, as depicted in figure 13.1. However,
the equation 13.1 does not denote a trigger value, i.e., the output could vary from −∞ to
+∞, without us having any knowledge if the output will be triggered.
Therefore, often an activation function (also called as transfer function) is associated
with the neuron. The purpose of the activation function is to decide whether the neuron
should be considered as triggered (or activated). The activation function is actually a non-
linear function. Example 13.1 explains the reasons of non-linearity for an activation function.
Figure 13.3 shows the impact of an activation function on triggering a neuron.
Equation 13.1 can be re-written as follows:
n
output = activation f unction( wi xi + bias) (13.2)
i=0
A key point to consider is that both input and output are binary values in perceptrons,
i.e., they are either zero or one. This is considered a major limitation because continuous
values cannot be predicted or classified.
For continuous values, sigmoid neuron is used. A sigmoid neuron is a special type of
neuron in which input can be a fraction value. Similarly, the output could also be a fraction
value. Note that a sigmoid neuron is similar to perceptron as it can take multiple inputs
with distinct weights.
The sigmoid neuron has a specific type of activation function called sigmoid function.
This is governed by the following equation.
1
output = (13.3)
1 + e−z
1
output = n
wi xi +bias)
(13.4)
1+ e−( i=0
256 Big Data Systems
or
1
output = n (13.5)
1 + exp−( i=0 wi xi + bias)
In equation 13.3, if z is large, then e-z tends to zero and the value of sigmoid is equiv-
alent to the value of perceptron function. Figure 13.2 illustrates the output of the sigmoid
activation function.
Accuracy of a DNN is measured as a cost function which is the difference between the
actual value – estimated value. Equation 13.6 shows how the cost function can be computed.
The objective is to select the values of weights (w’s) and bias (b) to minimize the cost
function.
n
11
J(w, b) = (actualvalue − predictedvalue)2 (13.6)
n i=0 2
There are two known algorithms which are used to minimize the cost function and
improve the accuracy.
1. Gradient Descent: The gradient descent technique utilizes the gradient, i.e., the slope
of the cost function. The slope is computed with respect to the input parameters
Deep Learning with Big Data 257
activation function i.e., w and b. The rationale is that the shortest path to the minimal
cost is through the steepest slope. The slope could be positive, negative, or zero.
A negative value of the slope implies that the direction is opposite and should be
changed. Based on the slope, the values of weights and bias should be adjusted. The
amount of adjustment is important as it is desired to compute an optimal value and
the convergence point.
2. Back Propagation: The back propagation technique is used in conjunction with the
gradient descent algorithm to find an optimal point. To understand the idea, let us
reconsider figure 13.6. The cost function is not directly related to the input parameters
w and b. In order to adjust the values, back propagation technique is used such that
the values are traced back from the output layer to the input layer in order to minimize
the error. Back propagation is based on the fact that in order to find the gradient
of the cost function with respect to both w and b, we need to find the partial derivative
of the cost with all the variables in the preceding layers. Using back propagation, the
error is back propagated in the network using derivatives. An iterative process is
followed such that the loss function reaches to a minimum value and the derivative
becomes zero.
TABLE 13.1 Types of Deep Neural Networks and Their Mapping with Big Data
Applications
based on the gradient of the cost function. This adjustment is made in each iteration during
the training phase. In some cases, when the gradient is small, the training process nearly
stalls because weights can’t be changed much (due to low value of the gradient) to reduce
the cost function. This problem is referred as the Vanishing Gradient Problem. It is a known
problem in DNN. We shall study how the problem is tackled in the other variants of DNN.
Example 13.2 (Object Recognition through a Feed Forward Neural Network). Fig-
ure 13.6 shows the process of object recognition. The input layer is responsible to read
input pixels from an image. The layer is also called as the visible layer because all the
information provided to it is visible or observable. In contrast, the next three layers
are hidden layers, as the information presented to these layers is not visible and the
DNN model is expected to determine the related concepts with the assistance of the
preceding layers.
For instance, the output of the input layer (i.e., pixels) is presented to the first
hidden layer to determine edges. These are identified by comparing the brightness of
neighboring pixels. The second hidden layer computes corners and contours through
collection of edges. The third hidden layer detects object parts through contours and
corners. Finally, the output is classified by the output layer.
RNN retains information from previous input using a simple loop. Using this loop,
information from the previous time stamp is added to the input of the current time stamp.
Figure 13.7 shows a neuron for RNN. Xt is the input to the network and yt is the output
of the network at time “t”, whereas “h” is an RNN cell. In principle, the self loop allows
the neuron to recall the output at yt-1 to produce yt .
An RNN can be considered as a sequence of RNN cells, each of which is connected with
its successor to produce a desired output. Figure 13.8 expands the idea from figure 13.7 to
develop a full RNN.
This characteristic of RNN has enabled their usage in a number of big data applications
including machine translation, speech recognition, and text summarization.
Figure 13.8 illustrates the most fundamental architecture of RNN. Long Short-Term
Memory (LSTM) and Gated Recurrent Units (GRU) are the two popular variations of RNN.
They differ upon how much memory is maintained by RNNs. A detailed explanation of these
networks is beyond the scope of this book. Readers are advised to refer to references [43]
and [142] for a detailed understanding.
262 Big Data Systems
Filter in CNN
Example 13.3 (Filter in CNN). Filter of size 3x3 is applied to the input. The process
is repeated for the remaining values of the input layer in order to produce a 3x3 hidden
layer.
Example 13.5 (The Max Pooling Function). The max pooling function is applied to
the above figure, having a matrix size of 2x2. The maximum value from each matrix
is selected.
We have studied a few fundamental types of DNNs. The next section is focused on
explaining the use of different types of DNNs for big data applications.
While statistical translation systems work better than rule-based systems, they require
extensive analysis and complicated computations.
Alternatively, encoding1 -based systems are gaining popularity. For sentences, encoding
can differentiate one sentence from another using set of features which are distinct to each
sentence. Example 13.6 explains the process of machine translation using deep learning.
The idea of sequence to sequence learning can be extended to many other applications.
For instance, sequence to sequence learning can be used to develop a chatbot. In this, queries
from users are considered as input text, whereas response from the chatbot is considered
as the translated text. Automated responses from chatbot can be very useful in managing
customer flows and enhancing support.
Sequence to sequence learning has also been used to build an image search engine. For
this, the first RNN in figure 13.14 is replaced with a CNN for image identification. An image
can be encoded through a CNN and then decoded into text through an RNN. Once trained,
the model can be used to generate description of images.
Sequence to sequence learning can also be used for generating summaries of articles [327].
Similarly, textual encodings can be used to find related images. This is useful for an image
search engine.
Example 13.7 explains the process of object identification in an image. The accuracy of
the DNN model depends upon the amount of training data used for classification of objects.
1. Instead of feeding the whole image directly to our neural network, we tiled the
image into many smaller images of equal size. Tiling allows us to detect an
object irrespective of its placement in an image.
2. Tiled images are fed into the Conv layer. This is used to detect edges of objects.
3. The output of the conv layer is fed into the activation layer, which decides
which of these tiny and tiled images contain our desired object (or components
of an object).
4. The next step is downsampling using any of the appropriate techniques (such
as max pooling). The idea is to keep the most appropriate component of the
object.
5. Final output is made through a fully connected neural network which can clas-
sify if an image contains a cat or a dog or both.
The idea of detecting objects can be extended to detect handwritten objects as well.
Example 13.8 (Speech Recognition using Deep Learning). 1. The first step in
speech recognition is to sample the sound waves without losing data. Sampling
implies that readings of sound waves are taken at a rate which appropriately
retains enough information for recognizing human words. As per literature,
sampling at least twice the rate of desired frequency can lead to reconstruction
of the original input.
2. The next step is to group the samples which can help recognizing words. This
grouping is useful in recognizing words as duration of the voice signal can vary.
3. These chunks of audio are then fed to an RNN for recognizing characters.
Big dataset is needed for training and testing. Remember that an RNN can
remember previous output. Therefore, each character can either initiate a new
word or can be a part of an existing word.
The three steps outlined here are the most basic and fundamental steps. However, in
reality, a voice signal can contain several background noises or can contain signals of
different variations. Fourier transform can be applied to separate signals of different
pitch.
is fed to a CNN with an activation function, to detect the occurrence of a digit. We need a
large data set for training.
6. Explain the benefits of RNN. Which big data applications can be solved using RNN?
7. Describe the architecture of CNN and explain how it can be useful for big data?
8. Identify how the accuracy of speech recognition can be improved using DNN?
9. Describe and explain the process of machine translation using sentence encoding.
10. In section 13.5, we have studied three big data applications, which can benefit from
DNN. Identify another application, which can utilize deep learning for improved ac-
curacy.
GLOSSARY
Activation Function: It is a non-linear function whose purpose is to decide whether the
neuron should be considered as triggered (activated) or deactivated. It is also called as
transfer function.
Deep Learning with Big Data 269
271
CHAPTER 14
ig Data is already being used to solve problems that were previously impossible due
B to their massive computational requirements. This chapter will provide detailed case
studies selected from different industry domains to demonstrate the practical implementa-
tion of big data. It will also highlight a few approaches which tend to deviate from normal
trends.
273
274 Big Data Systems
14.2.2 Motivation
Hadoop is an efficient framework for large-scale batch processing of data. To use Hadoop
for real-time system, there are many challenges.
First, HDFS was originally designed for batch processing rather than interactive use by
its users. It is optimized for high-throughput of data access. Second, Hadoop has a single
point of failure which is another reason it cannot be used for real-time application. When
the master goes down, the entire HDFS cluster is unusable unless the NameNode is back
up.
Third, real-time application often requires failover to be performed within seconds. How-
ever, with using existing BackupNode strategy, failover times could be high which is not
acceptable for such an application.
And last, since every update from NameNode to BackupNode is synchronously trans-
ferred, entire system is dependent upon the reliability of the NameNode.
1. New Nodes for High Availability: There are two new nodes introduced in existing
HDFS Cluster called the Active and Standby Avatar Node. These nodes are simply
wrapper around a normal NameNode.
The Active AvatarNode writes transaction logs to an NFS filesystem. whereas the
Standby Node keeps its namespace as close as possible to primary by reading and
applying transaction at the same time in its own namespace. The secondary NameN-
ode is not needed because the Standby AvatarNode can take care of the backup when
needed.
The DataNodes communicate with both the Active AvatarNode and the Standby
AvatarNode and sends updates and heartbeat messages to both the nodes. The
Standby AvatarNode is synchronous with the Active AvatarNode. In case of failure,
it can take up the desired role.
276 Big Data Systems
The above features allow enhancements through which Facebook was able to achieve
real-time behavior for its messaging application.
research hypothesis has been driven by the fact that prices of memory are continuously
decreasing such that it is cheaper to have one server with 192 GB of RAM instead of 24
servers with 8 GB RAM. Further, the research hypothesis has been strengthened by the fact
that many analytical problems such as K-means and clustering are iterative. They require
iterative execution with many rounds such that input to the next round is prior to the
next round. In such a scenario, shared memory systems are feasible to limit network-based
communication. Instead in-memory communication is much faster and feasible.
Researchers from Microsoft Cambridge have performed experimental analysis to evaluate
this hypothesis. The researchers used two algorithms, namely Adpredictor and Frequent
Itemset Mining (FIM) [304]. The former algorithm utilizes machine learning to predict
click-through rate (CTR). This is used by the Bing search engine to accommodate for
sponsored search on Bing. Whereas, the latter algorithm is used to determine frequently
used items. For both the algorithms, MR++, a modified version of MapReduce, which
effectively retains states for iterative algorithms, was used. The authors observed that for
both the algorithms, performance on a single server with large memory was substantially
better than a distributed cluster. However, we must understand that this observation is
only valid for iterative algorithms which require retention of states and which become too
complex for distributed processing. Further, as the size of data is relatively smaller, analytics
can be performed on a single machine with high processing power and memory. However,
as the size would increase more computational power may be needed.
3. Updates and Deletes: Hadoop does not cater for updates and deletes. However, it was
a requirement because incremental updates were needed to be processed.
4. Faster ETL and Modeling: ETL and modeling jobs were snapshot-based. This requires
the platform to be rebuild in every run. However, incremental processing was required
which could only pull the changed data.
Uber team built Hadoop Upserts anD Incremental (Hudi) to cater their big data system
needs. It is an open-source Spark library that utilizes HDFS and provides scalable support
for deletes as well as updates. Using Hudi, Uber was able to adapt to an incremental
ingestion model. Users (or applications) can retrieve all the records that have been updated
or added since a specified time. Users can also pass checkpoint as a reference to retrieve
records. Example 14.1 illustrates the big data model at Uber.
278 Big Data Systems
Example 14.1 (Big Data at Uber). Different components of big data system are
shown.
The LinkedIn team observes that while the lambda architecture is operational; however,
it has a few drawbacks [41].
1. Data is required to be read from two different data sources, streaming layer as well as
batch layer. This implies that the speed will be restricted to the slower of the two.
2. Availability of the application is restricted to the availability of both the data sources.
3. Integration from both of the systems is required to get the computed results.
The LinkedIn team modified the lambda architecture to cater these limitations. The
team introduced Venice, a distributed <key,value> storage.
Venice solves the problem of keeping two different data storage. It does so by providing
a single database that accepts both batch and streaming ingestions. It reconciles them both
at the time of writing. Thus it avoids a complex, multi-system lambda architecture by using
one system to manage.
Venice data flow is shown in figure 14.3. The push job is responsible to read data off the
Hadoop cluster. It will run as a MapReduce job and produce a Venice data message to the
source Kafka cluster.
The Kafka Mirror Maker (MM), deployed in the destination cluster, will replicate all
the messages from the Kafka cluster in the source data center to the Kafka cluster in the
destination data center. Venice Storage Node will consume the Venice data message from
the local Kafka cluster and persist data in a local database.
280 Big Data Systems
The core resource in Venice is a store. A store has schemas, owners, and is isolated from
other stores. A store represents a single use case for a team using Venice. When a team
pushes a dataset into their Venice store, they create a new version of that store. Every bulk
push of data into Venice is a complete snapshot of the dataset.
1. Availability: With large data, there are increasing expectations of replication and
availability such that faults can be isolated.
2. Parallelism: As data is being generated at an immensely higher rate, there are growing
needs for enhancing data parallelism such that speed of computation can be enhanced.
3. Scalability: As data is growing, scalable solutions for processing and storage are
needed.
4. Transaction Processing: OLTP solutions are needed which can process big data. We
have studied that there is a trade-off between scalability and capability to process
transactions. This trade-off is likely to exist in the coming years. However, a balance
is needed to be explored.
5. Stream Processing: Real-time processing of data is an ongoing challenge. With the
increase data arrival rate, new techniques and solutions are needed which can process
data in real-time.
6. Power Efficiency: Big data systems employ massive computational power, which con-
sumes high electricity. Power efficient solutions are needed to conserve electrical power.
In this scenario, alternate energy solutions are also needed.
7. Heterogeneous Environment: Big data systems utilize heterogeneous computing en-
vironments. Often they include commodity hardware solutions. Efficient solutions are
needed which can cater the needs of heterogeneity while enhancing efficiency.
8. Efficient Network Processing: Network is an important requirement for big data pro-
cessing and storage. High-speed efficient network solutions are also needed for imple-
menting parallel computing framework.
Big Data: Case Studies and Future Trends 281
GLOSSARY
Avatar Node: It is a node used to create as a backup for a namenode in Hadoop.
Bigtable: It is a distributed storage system which is designed to scale into the petabyte
range. It is used as Google’s internal database system that runs on top of the Google File
System (GFS). It scales across thousands of commodity servers and is used extensively
to support many of its core services, such as Gmail, YouTube, and Google Maps. In
May 2015, Google released a public version of Bigtable as a managed NoSQL database
service in the Google Cloud Platform (GCP).
Click Through Rate: It is a ratio of users who clicked on a specified link as compared to
the total number of users who view the page.
DAFS: It refers to the Distributed Avatar File System, which is a modified HDFS with an
Avatar node.
ETL: The term refers to Extract, Transform, and Load – the three fundamental functions
of a database.
Frequent Item Mining: It is an algorithm to determine frequent items from a dataset.
282 Big Data Systems
Graph Processing: It refers to the process utilizing graphs to solve distributed system prob-
lems.
Google Earth Engine: It is a big data system for providing geospatial data. It has been
developed by Google.
Incremental Data: The term refers to the changed data. That is, data which has been
changed from previous reference.
Bibliography
283
284 Bibliography
[43] Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale
Acoustic Modeling.
[44] Machine learning is fun. https://medium.com/.
[45] Machine Learning Library MLlib Guide. https://spark.apache.org/docs/
latest/ml-guide.html.
[46] MLlib: RDD-based API. https://spark.apache.org/docs/latest/
mllib-guide.html.
[47] MPI Hello World Tutorial. https://mpitutorial.com/tutorials/.
[48] OpenMP Hello World Tutorial. https://www.dartmouth.edu.
[49] Pig Guide. https://pig.apache.org/docs/r0.17.0/func.html.
[50] Pig Optimizations. https://cwiki.apache.org/confluence/display/
PIG/.
[51] Powered By Spark. http://spark.apache.org/powered-by.html.
[52] Presto – Distributed SQL Query Engine for Big Data. https://prestosql.io.
[53] Spark Documentation. https://spark.apache.org/docs.
[54] Spark ML online documentation. https://spark.apache.org/docs/latest/
ml-pipeline.html.
[55] Spark Programming Guide. https://spark.apache.org/docs/latest/
rdd-programming-guide.html.
[56] Spark SQL. https://spark.apache.org/docs/latest/
sql-programming-guide.html.
[57] Spark Streaming. https://spark.apache.org/docs/latest/
streaming-kafka-integration.html.
[58] Spark Streaming Example. http://spark.apache.org/.
[59] Spark Streaming Guide. https://spark.apache.org/docs/latest/
structured-streaming-programming-guide.html.
[60] Spark Structured Streaming Guide. https://spark.apache.org/docs/
latest/structured-streaming-programming-guide.html.
[61] Sqoop. https://sqoop.apache.org.
[62] Sqoop Docs. https://sqoop.apache.org/docs/.
[63] TensorFlow. https://www.tensorflow.org/.
[64] The Official Kubernetes Documentation. https://kubernetes.io/docs/
home/.
[65] The Official SPARK Graphx Documentation. https://spark.apache.org/
docs/latest/graphx-programming-guide.html.
[66] Theano Documentation. http://deeplearning.net/.
286 Bibliography
[83] Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K
Bradley, Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, et al. Spark
SQL: Relational Data Processing in Spark. In Proceedings of the 2015 ACM SIGMOD
International Conference on Management of Data, page 1383–1394. ACM, 2015.
[84] Jason Arnold, Boris Glavic, and Ioan Raicu. Hrdbms: Combining the Best of Modern
and Traditional Relational Databases. arXiv preprint arXiv:1901.08666, 2019.
[85] Francesco Asnicar, Nadir Sella, Luca Masera, Paolo Morettin, Thomas Tolio, Stanis-
lau Semeniuta, Claudio Moser, Enrico Blanzieri, and Valter Cavecchia. TN-Grid and
Gene@ Home Project: Volunteer Computing for Bioinformatics. In BOINC: FAST
2015 International Conference BOINC: FAST 2015Second International Conference
BOINC-based High Performance Computing: Fundamental Research and Develop-
ment. Russian Academy of Sciences, 2015.
[86] Kyle Banker. MongoDB in Action. Manning Publications Co., 2011.
[87] Narsimha Banothu, ShankarNayak Bhukya, and K Venkatesh Sharma. Big-data: Acid
Versus Base for Database Transactions. In Electrical, Electronics, and Optimization
Techniques (ICEEOT), International Conference on, page 3704–3709. IEEE, 2016.
[88] Feng Bao and Robert H Deng. A Signcryption Scheme with Signature Directly Ver-
ifiable by Public Key. In International Workshop on Public Key Cryptography, page
55–59. Springer, 1998.
[89] Michael Barbaro, Tom Zeller, and Saul Hansell. A Face is Exposed for AOL Searcher
no. 4417749. New York Times, 9(2008):8For, 2006.
[90] Cristian Andrei Baron et al. NoSQL Key-Value DBs Riak and Redis. Database
Systems Journal, 4:3–10, 2016.
[91] Daniel C Barth-Jones. The ‘Re-identification’ of Governor William Weld’s Medical In-
formation: A Critical Re-examination of Health Data Identification Risks and Privacy
Protections, Then and Now. Then and Now (July 2012), 2012.
[92] Frederic Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian Goodfel-
low, Arnaud Bergeron, Nicolas Bouchard, David Warde-Farley, and Yoshua Bengio.
Theano: new features and speed improvements. arXiv preprint arXiv:1211.5590, 2012.
[93] Narmeen Zakaria Bawany and Jawwad A Shamsi. SEAL: SDN Based Secure and Agile
Framework for Protecting Smart City Applications from DDoS Attacks. Journal of
Network and Computer Applications, 145:102381, 2019.
[94] Narmeen Zakaria Bawany, Jawwad A Shamsi, and Khaled Salah. DDoS Attack Detec-
tion and Mitigation Using SDN: Methods, Practices, and Solutions. Arabian Journal
for Science and Engineering, 42(2):425–441, 2017.
[95] Adam L Beberg, Daniel L Ensign, Guha Jayachandran, Siraj Khaliq, and Vijay S
Pande. Folding@ home: Lessons from Eight Years of Volunteer Distributed Comput-
ing. In 2009 IEEE International Symposium on Parallel & Distributed Processing,
page 1–8. IEEE, 2009.
[96] Andras Beleczki and Balint Molnar. Modeling Framework for Designing and Ana-
lyzing Document-Centric Information Systems based on HyperGraphDB. In CEUR
Workshop Proceedings (ISSN: 1613-0073), volume 2046, page 17–22, 2016.
288 Bibliography
[97] Andrew John Bernoth. Identifying Additional Firewall Rules that may be needed,
July 3 2018. US Patent 10,015,140.
[98] Janki Bhimani, Zhengyu Yang, Miriam Leeser, and Ningfang Mi. Accelerating Big
Data Applications Using Lightweight Virtualization Framework on Enterprise Cloud.
In High Performance Extreme Computing Conference (HPEC), 2017 IEEE, page 1–7.
IEEE, 2017.
[99] Carsten Binnig, Andrew Crotty, Alex Galakatos, Tim Kraska, and Erfan Zamanian.
The End of Slow Networks: It’s Time for a Redesign. Proceedings of the VLDB
Endowment, 9(7):528–539, 2016.
[100] MKABV Bittorf, Taras Bobrovytsky, CCACJ Erickson, Martin Grund Daniel Hecht,
MJIJL Kuff, Dileep Kumar Alex Leblang, NLIPH Robinson, David Rorke Silvius
Rus, John Russell Dimitris Tsirogiannis Skye Wanderman, and Milne Michael Yoder.
Impala: A Modern, Open-Source SQL Engine for Hadoop. In Proceedings of the 7th
Biennial Conference on Innovative Data Systems Research, 2015.
[101] Linda Camilla Boldt, Vinothan Vinayagamoorthy, Florian Winder, Melanie
Schnittger, Mats Ekran, Raghava Rao Mukkamala, Niels Buus Lassen, Benjamin
Flesch, Abid Hussain, and Ravi Vatrapu. Forecasting Nike’s Sales using Facebook
Data. In 2016 IEEE International Conference on Big Data (Big Data), page 2447–
2456. IEEE, 2016.
[102] Flavio Bonomi, Rodolfo Milito, Jiang Zhu, and Sateesh Addepalli. Fog Computing
and its Role in the Internet of Things. In Proceedings of the First Edition of the MCC
Workshop on Mobile Cloud Computing, page 13–16. ACM, 2012.
[103] Dhruba Borthakur, Jonathan Gray, Joydeep Sen Sarma, Kannan Muthukkaruppan,
Nicolas Spiegelberg, Hairong Kuang, Karthik Ranganathan, Dmytro Molkov, Aravind
Menon, Samuel Rash, et al. Apache Hadoop goes Realtime at Facebook. In Proceed-
ings of the 2011 ACM SIGMOD International Conference on Management of Data,
page 1071–1080. ACM, 2011.
[104] Roland N Boubela, Klaudius Kalcher, Wolfgang Huf, Christian Naısel, and Ewald
Moser. Big Data Approaches for the Analysis of Large-Scale fMRI Data using Apache
Spark and GPU Processing: A Demonstration on Resting-State fMRI Data from the
Human Connectome Project. Frontiers in Neuroscience, 9:492, 2016.
[105] David Bradley, Richard Harper, and Steven Hunter. Power-Aware Workload Balanc-
ing using Virtual Machines, March 17 2005. US Patent App. 10/663,285.
[106] Rodrigo Braga, Edjard Mota, and Alexandre Passito. Lightweight DDoS Flooding At-
tack Detection Using NOX/OpenFlow. In IEEE Local Computer Network Conference,
page 408–415. IEEE, 2010.
[107] Eric Brewer. A Certain Freedom: Thoughts on the Cap Theorem. In Proceedings of
the 29th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing,
page 335–335. ACM, 2010.
[108] Eric Brewer. Cap Twelve Years Later: How the “Rules” Have Changed. Computer,
45(2):23–29, 2012.
[109] Eric Brewer. Pushing the cap: Strategies for Consistency and Availability. Computer,
45(2):23–29, 2012.
Bibliography 289
[110] Eric Brewer. Spanner, Truetime and the CAP Theorem. 2017.
[111] Eric A Brewer. Towards Robust Distributed Systems. In PODC, volume 7, 2000.
[112] Martin C Brown. Getting Started with Couchbase Server: Extreme Scalability at Your
Fingertips. “O’Reilly Media, Inc.”, 2012.
[113] Barbara Brynko. NuoDB: Reinventing the Database. Information Today, 29(9):9–9,
2012.
[114] Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D Ernst. Haloop: Efficient
Iterative Data Processing on Large Clusters. Proceedings of the VLDB Endowment,
3(1-2):285–296, 2010.
[115] Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D Ernst. The Haloop Ap-
proach to Large-Scale Iterative Data Analysis. The VLDB Journal The International
Journal on Very Large Data Bases, 21(2):169–190, 2012.
[116] Brendan Burns, Brian Grant, David Oppenheimer, Eric Brewer, and John Wilkes.
Borg, omega, and kubernetes. Queue, 14(1):70–93, 2016.
[117] Rajkumar Buyya et al. High Performance Cluster Computing: Architectures and
Systems (volume 1). Prentice Hall, Upper Saddle River, NJ, USA, 1:999, 1999.
[118] Rajkumar Buyya, Chee Shin Yeo, Srikumar Venugopal, James Broberg, and Ivona
Brandic. Cloud Computing and Emerging it Platforms: Vision, Hype, and Reality
for Delivering Computing as the 5th utility. Future Generation Computer Systems,
25(6):599–616, 2009.
[119] Christian Cachin et al. Architecture of the Hyperledger Blockchain Fabric. In Work-
shop on Distributed Cryptocurrencies and Consensus Ledgers, volume 310, page 4,
2016.
[120] Stefano Calzavara, Sebastian Roth, Alvise Rabitti, Michael Backes, and Ben Stock.
A Tale of Two Headers: A Formal Analysis of Inconsistent Click-jacking Protection
on the Web. 2020.
[121] Josiah L Carlson. Redis in Action. Manning Shelter Island, 2013.
[122] Rick Cattell. Scalable SQL and NoSQL Data Stores. Acm Sigmod Record, 39(4):12–27,
2011.
[123] Ugur Cetintemel, Nesime Tatbul, Kristin Tufte, Hao Wang, Stanley Zdonik, Jiang Du,
Tim Kraska, Samuel Madden, David Maier, John Meehan, et al. S-store: A Streaming
NewSQL System for Big Velocity Applications. 2014.
[124] Mallikarjun Chadalapaka, Hemal Shah, Uri Elzur, Patricia Thaler, and Michael Ko. A
Study of iSCSI extensions for RDMA (iSER). In Proceedings of the ACM SIGCOMM
Workshop on Network-I/O Convergence: Experience, Lessons, Implications, page 209–
219, 2003.
[125] Prabhakar Chaganti and Rich Helms. Amazon SimpleDB Developer Guide. Packt
Publishing Ltd, 2010.
[126] Swetha Prabha Chaganti. Voldemort NoSQL Database. 2016.
290 Bibliography
[127] Bill Chambers and Matei Zaharia. Spark: The Definitive Guide: Big Data Processing
Made Simple. “O’Reilly Media, Inc.”, 2018.
[128] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C Hsieh, Deborah A Wallach,
Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E Gruber. Bigtable: A
distributed Storage System for Structured Data. ACM Transactions on Computer
Systems (TOCS), 26(2):1–26, 2008.
[129] Rocky KC Chang. Defending Against Flooding-Based Distributed Denial-of-Service
Attacks: A Tutorial. IEEE Communications Magazine, 40(10):42–51, 2002.
[130] Jack Chen, Samir Jindel, Robert Walzer, Rajkumar Sen, Nika Jimsheleishvilli, and
Michael Andrews. The MemSQL Query Optimizer: A Modern Optimizer for Real-
Time Analytics in a Distributed Database. Proceedings of the VLDB Endowment,
9(13):1401–1412, 2016.
[131] Kai Chen, Ankit Singla, Atul Singh, Kishore Ramachandran, Lei Xu, Yueping Zhang,
Xitao Wen, and Yan Chen. Osa: An optical switching architecture for data center
networks with unprecedented flexibility. IEEE/ACM Transactions on Networking
(TON), 22(2):498–511, 2014.
[132] Qun Chen, Song Bai, Zhanhuai Li, Zhiying Gou, Bo Suo, and Wei Pan. GraphHP:
A Hybrid Platform for Iterative Graph Processing. arXiv preprint arXiv:1706.07221,
2017.
[133] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun
Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A Flexible and Efficient
Machine Learning Library for Heterogeneous Distributed Systems. arXiv preprint
arXiv:1512.01274, 2015.
[134] Wen Chen, Peng Cheng, Fengyuan Ren, Ran Shu, and Chuang Lin. Ease the Queue
Oscillation: Analysis and Enhancement of DCTCP. In 2013 IEEE 33rd International
Conference on Distributed Computing Systems, page 450–459. IEEE, 2013.
[135] Xue-Wen Chen and Xiaotong Lin. Big Data Deep Learning: Challenges and Perspec-
tives. IEEE access, 2:514–525, 2014.
[136] Yanpei Chen, Rean Griffit, David Zats, and Randy H Katz. Understanding TCP incast
and its implications for Big Data Workloads. University of California at Berkeley,
Tech. Rep, 2012.
[137] Yi Chen, Zhi Qiao, Hai Jiang, Kuan-Ching Li, and Won Woo Ro. MGMR: Multi-GPU
based MapReduce. In International Conference on Grid and Pervasive Computing,
page 433–442. Springer, 2013.
[138] Avery Ching, Sergey Edunov, Maja Kabiljo, Dionysios Logothetis, and Sambavi
Muthukrishnan. One trillion edges: Graph Processing at Facebook-Scale. Proceedings
of the VLDB Endowment, 8(12):1804–1815, 2015.
[139] Kristina Chodorow. Scaling MongoDB: Sharding, Cluster Setup, and Administration.
“O’Reilly Media, Inc.”, 2011.
[140] Kristina Chodorow. MongoDB: The Definitive Guide: Powerful and Scalable Data
Storage. “O’Reilly Media, Inc.”, 2013.
Bibliography 291
[141] Mrs Rupali M Chopade and Nikhil S Dhavase. MongoDB, Couchbase: Performance
Comparison for Image Dataset. In 2017 2nd International Conference for Convergence
in Technology (I2CT), page 255–258. IEEE, 2017.
[142] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Gated Feed-
back Recurrent Neural Networks. In International Conference on Machine Learning,
page 2067–2075, 2015.
[143] Taejoong Chung, Roland van Rijswijk-Deij, Balakrishnan Chandrasekaran, David
Choffnes, Dave Levin, Bruce M Maggs, Alan Mislove, and Christo Wilson. An End-
to-End View of DNSSEC Ecosystem Management. ; login:, 42(4), 2017.
[144] Dan Ciresan, Ueli Meier, Jonathan Masci, and Jurgen Schmidhuber. Multi-Column
Deep Neural Network for Traffic Sign Classification. Neural Networks, 32:333–338,
2012.
[145] Cisco. Data Center: Load Balancing Data Center Services.
[146] Christopher Clark, Keir Fraser, Steven Hand, Jacob Gorm Hansen, Eric Jul, Christian
Limpach, Ian Pratt, and Andrew Warfield. Live Migration of Virtual Machines. In
Proceedings of the 2nd conference on Symposium on Networked Systems Design &
Implementation-Volume 2, page 273–286. USENIX Association, 2005.
[147] Michael Colesky, Jaap-Henk Hoepman, and Christiaan Hillen. A Critical Analysis of
Privacy Design Strategies. In Security and Privacy Workshops (SPW), 2016 IEEE,
page 33–40. IEEE, 2016.
[148] James C Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost,
Jeffrey John Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter
Hochschild, et al. Spanner: Googles Globally Distributed Database. ACM Transac-
tions on Computer Systems (TOCS), 31(3):1–22, 2013.
[149] Antonio Corradi, Mario Fanelli, and Luca Foschini. Vm consolidation: A Real Case
Based on Openstack Cloud. Future Generation Computer Systems, 32:118–127, 2014.
[150] Laizhong Cui, F Richard Yu, and Qiao Yan. When Big Data Meets Software-Defined
Networking: SDN for Big Data and Big Data for SDN. IEEE Network, 30(1):58–65,
2016.
[151] Ian Curry. An Introduction to Cryptography and Digital Signatures. Entrust Securing
Digital Identities and Information, 2001.
[152] Nhu-Ngoc Dao, Junho Park, Minho Park, and Sungrae Cho. A Feasible Method to
Combat Against DDoS Attack in SDN Network. In 2015 International Conference
on Information Networking (ICOIN), page 309–311. IEEE, 2015.
[153] Marieke De Goede. The politics of Privacy in the Age of Preemptive Security. Inter-
national Political Sociology, 8(1):100–104, 2014.
[154] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large
Clusters. Communications of the ACM, 51(1):107–113, 2008.
[155] Jeffrey Dean and Sanjay Ghemawat. MapReduce: A Flexible Data Processing Tool.
Communications of the ACM, 53(1):72–77, 2010.
[156] Casimer DeCusatis. Optical Interconnect Networks for Data Communications. Jour-
nal of Lightwave Technology, 32(4):544–552, 2014.
292 Bibliography
[157] Kevin Deierling. Ethernet Just Got a Big Performance Boost with Release of Soft
RoCE, 2015.
[158] OrientDB Developers. OrientDB. Hybrid Document-Store and Graph NoSQL
Database [online], 2012.
[159] David HC Du, Tai-Sheng Chang, Jenwei Hsieh, Sangyup Shim, and Yuewei Wang.
Two Emerging Serial Storage Interfaces for Supporting Digital Libraries: Serial Stor-
age Architecture (SSA) and Fiber Channel-Arbitrated Loop (FC-AL). Multimedia
Tools and Applications, 10(2):179–203, 2000.
[160] Muhammad Nouman Durrani and Jawwad A Shamsi. Volunteer Computing: Require-
ments, Challenges, and Solutions. Journal of Network and Computer Applications,
39:369–380, 2014.
[161] Michael Erbschloe. Trojans, Worms, and Spyware: A Computer Security Profes-
sional’s Guide to Malicious Code. Elsevier, 2004.
[162] Hamza Es-Samaali, Aissam Outchakoucht, and Jean Philippe Leroy. A Blockchain-
Based Access Control for Big Data. International Journal of Computer Networks and
Communications Security, 5(7):137, 2017.
[163] Christian Esposito, Aniello Castiglione, and Kim-Kwang Raymond Choo. Challenges
in Delivering Software in the Cloud as Microservices. IEEE Cloud Computing, (5):10–
14, 2016.
[164] Reza Farivar, Daniel Rebolledo, Ellick Chan, and Roy H Campbell. A Parallel Im-
plementation of K-means Clustering on GPUs. In PDPTA, volume 13, page 212–312,
2008.
[165] Nathan Farrington, George Porter, Sivasankar Radhakrishnan, Hamid Hajabdolali
Bazzaz, Vikram Subramanya, Yeshaiahu Fainman, George Papen, and Amin Vahdat.
Helios: A Hybrid Electrical/Optical Switch Architecture for Modular Data Centers.
ACM SIGCOMM Computer Communication Review, 40(4):339–350, 2010.
[166] Maria Fazio, Antonio Celesti, Rajiv Ranjan, Chang Liu, Lydia Chen, and Massimo
Villari. Open Issues in Scheduling Microservices in the Cloud. IEEE Cloud Computing,
3(5):81–88, 2016.
[167] M Fenn, MA Murphy, J Martin, and S Goasguen. An Evaluation of KVM for use in
Cloud Computing. In Proceedings of the 2nd International Conference on the Virtual
Computing Initiative, RTP, NC, USA, 2008.
[168] Michael J Flynn. Very High-Speed Computing Systems. Proceedings of the IEEE,
54(12):1901–1909, 1966.
[169] Julien Forgeat. Data Processing Architectures-Lambda and Kappa. Ericsson Research
Blog, 2015.
[170] Ian Foster, Yong Zhao, Ioan Raicu, and Shiyong Lu. Cloud Computing and Grid Com-
puting 360-Degree Compared. In Grid Computing Environments Workshop, 2008.
GCE’08, page 1–10. IEEE, 2008.
[171] Michael Frampton. Big Data Made Easy: A Working Guide to the Complete Hadoop
Toolset. Apress, 2014.
Bibliography 293
[172] Vincent Garcia, Eric Debreuve, and Michel Barlaud. Fast K Nearest Neighbor Search
using GPU. arXiv preprint arXiv:0804.1448, 2008.
[173] Alan Gates and Daniel Dai. Programming Pig: Dataflow Scripting with Hadoop.
“O’Reilly Media, Inc.”, 2016.
[174] Alan Gates, Jianyong Dai, and Thejas Nair. Apache Pig’s Optimizer. IEEE Data
Engineering Bulletin, 36(1):34–45, 2013.
[175] Alan F Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M
Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, and
Utkarsh Srivastava. Building A High-Level Dataflow System on Top of Map-Reduce:
The Pig Experience. Proceedings of the VLDB Endowment, 2(2):1414–1425, 2009.
[176] Lars George. HBase: The Definitive Guide: Random Access to your Planet-Size Data.
“O’Reilly Media, Inc.”, 2011.
[177] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System.
In ACM SIGOPS Operating Systems Review, volume 37, page 29–43. ACM, 2003.
[178] Seth Gilbert and Nancy A Lynch. Perspectives on the Cap Theorem. Computer,
45(2):30–36, 2012.
[179] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press,
2016.
[180] Noel Gorelick, Matt Hancher, Mike Dixon, Simon Ilyushchenko, David Thau, and
Rebecca Moore. Google earth engine: Planetary-scale geospatial analysis for everyone.
Remote sensing of Environment, 202:18–27, 2017.
[181] Alex Graves, Santiago Fernandez, Faustino Gomez, and Jürgen Schmidhuber. Con-
nectionist Temporal Classification: Labelling Unsegmented Sequence Data with Re-
current Neural Networks. In Proceedings of the 23rd International Conference on
Machine learning, page 369–376. ACM, 2006.
[182] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech Recognition
with Deep Recurrent Neural Networks. In Acoustics, Speech and Signal Processing
(ICASSP), 2013 IEEE International Conference on, page 6645–6649. IEEE, 2013.
[183] Albert Greenberg, James R Hamilton, Navendu Jain, Srikanth Kandula, Changhoon
Kim, Parantap Lahiri, David A Maltz, Parveen Patel, and Sudipta Sengupta. Vl2: A
Scalable and Flexible Data Center Network. In ACM SIGCOMM Computer Commu-
nication Review, volume 39, page 51–62. ACM, 2009.
[184] Steven L Grobman. Server Pool Kerberos Authentication Scheme, March 21 2017.
US Patent 9,602,275.
[185] Katarina Grolinger, Wilson A Higashino, Abhinav Tiwari, and Miriam AM Capretz.
Data Management in Cloud Environments: NoSQL and NewSQL Data Stores. Journal
of Cloud Computing: Advances, Systems and Applications, 2(1):22, 2013.
[186] Chuanxiong Guo, Guohan Lu, Dan Li, Haitao Wu, Xuan Zhang, Yunfeng Shi, Chen
Tian, Yongguang Zhang, and Songwu Lu. Bcube: A High Performance, Server-Centric
Network Architecture for Modular Data Centers. In Proceedings of the ACM SIG-
COMM 2009 Conference on Data Communication, page 63–74, 2009.
294 Bibliography
[187] Himanshu Gupta, Subhash Mondal, Srayan Ray, Biswajit Giri, Rana Majumdar, and
Ved P Mishra. Impact of SQL Injection in Database Security. In 2019 International
Conference on Computational Intelligence and Knowledge Economy (ICCIKE), page
296–299. IEEE, 2019.
[188] Irfan Habib. Virtualization with KVM. Linux Journal, 2008(166):8, 2008.
[189] William G Halfond, Jeremy Viegas, Alessandro Orso, et al. A Classification of SQL-
Injection Attacks and Countermeasures. In Proceedings of the IEEE International
Symposium on Secure Software Engineering, volume 1, page 13–15. IEEE, 2006.
[190] Jing Han, E Haihong, Guan Le, and Jian Du. Survey on NoSQL Database. In 2011
6th International Conference on Pervasive Computing and Applications, page 363–
366. IEEE, 2011.
[191] T Harford. Big Data: Are We Making a Big Mistake? [internet]. London: Ft magazine;
c2014 [cited at 2015 sep 28].
[192] Ibrahim Abaker Targio Hashem, Ibrar Yaqoob, Nor Badrul Anuar, Salimah Mokhtar,
Abdullah Gani, and Samee Ullah Khan. The Rise of “Big Data on Cloud Computing:
Review and Open Research Issues. Information Systems, 47:98–115, 2015.
[193] Michael Hausenblas and Jacques Nadeau. Apache Drill: Interactive Ad-hoc Analysis
at Scale. Big Data, 1(2):100–104, 2013.
[194] Bingsheng He, Wenbin Fang, Qiong Luo, Naga K Govindaraju, and Tuyong Wang.
Mars: A MapReduce Framework on Graphics Processors. In Parallel Architectures
and Compilation Techniques (PACT), 2008 International Conference on, page 260–
269. IEEE, 2008.
[195] Ying He, F Richard Yu, Nan Zhao, Victor CM Leung, and Hongxi Yin. Software-
Defined Networks with Mobile Edge Computing and Caching for Smart Cities: A
Big Data Deep Reinforcement Learning Approach. IEEE Communications Magazine,
55(12):31–37, 2017.
[196] Thomas A Hengeveld. Multi-Tunnel Virtual Private Network, March 29 2016. US
Patent 9,300,570.
[197] Maurice Herlihy. Blockchains From a Distributed Computing Perspective. Commu-
nications of the ACM, 62(2):78–85, 2019.
[198] Bai Hong-Tao, He Li-li, Ouyang Dan-tong, Li Zhan-shan, and Li He. K-Means on
Commodity GPUs with CUDA. In 2009 World Congress on Computer Science and
Information Engineering, page 651–655. IEEE, 2009.
[199] Weisheng Hu, Weiqiang Sun, Yaohui Jin, Wei Guo, and Shilin Xiao. An Efficient
Transportation Architecture for Big Data Movement. In Information, Communica-
tions and Signal Processing (ICICS) 2013 9th International Conference on, page 1–5.
IEEE, 2013.
[200] Yin Huai, Ashutosh Chauhan, Alan Gates, Gunther Hagleitner, Eric N Hanson, Owen
O’Malley, Jitendra Pandey, Yuan Yuan, Rubao Lee, and Xiaodong Zhang. Major
Technical Advancements in Apache Hive. In Proceedings of the 2014 ACM SIGMOD
International Conference on Management of Data, page 1235–1246. ACM, 2014.
Bibliography 295
[201] Patrick Hunt, Mahadev Konar, Flavio Paiva Junqueira, and Benjamin Reed.
Zookeeper: Wait-Free Coordination for Internet-Scale Systems. In USENIX Annual
Technical Conference, volume 8. Boston, MA, USA, 2010.
[202] Jalal B Hur and Jawwad A Shamsi. A Survey on Security Issues, Vulnerabilities
and Attacks in Android Based Smartphone. In 2017 International Conference on
Information and Communication Technologies (ICICT), page 40–46. IEEE, 2017.
[203] Intel. Understanding iWARP: Delivering Low Latency to Ethernet.
[216] Karthik Kambatla, Giorgos Kollias, Vipin Kumar, and Ananth Grama. Trends in
Big Data Analytics. Journal of Parallel and Distributed Computing, 74(7):2561–2573,
2014.
[217] Debabrata Kar, Suvasini Panigrahi, and Srikanth Sundararajan. SQLiGoT: Detecting
SQL Injection Attacks Using Graph of Tokens and SVM. Computers & Security,
60:206–225, 2016.
[218] Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia. Learning Spark:
Lightning-Fast Big Data Analysis. “O’Reilly Media, Inc.”, 2015.
[219] Wayne Karpoff and Brian Lake. Storage Virtualization System and Methods, Au-
gust 18 2009. US Patent 7,577,817.
[220] Karambir Kaur and Monika Sachdeva. Performance Evaluation of NewSQL
Databases. In 2017 International Conference on Inventive Systems and Control
(ICISC), page 1–5. IEEE, 2017.
[221] Sawinder Kaur and Karamjit Guide Kaur. Visualizing Class Diagram using OrientDB
NoSQL Data-Store. PhD thesis, 2016.
[222] Robert W Kembel and Horst L Truestedt. Fibre Channel Arbitrated Loop. 1996.
[223] David B Kirk and W Hwu Wen-Mei. Programming Massively Parallel Processors: A
Hands-on Approach. Morgan Kaufmann, 2016.
[224] O Kolkman and R Gieben. DNSSEC Operational Practices. Technical report, RFC
4641, September, 2006.
[225] Richard T Kouzes, Gordon A Anderson, Stephen T Elbert, Ian Gorton, and Deb-
orah K Gracio. The Changing Paradigm of Data-Intensive Computing. Computer,
(1):26–34, 2009.
[226] Jay Kreps. Parallel Hardware Architecture. Oracle, Dec.
[227] Jay Kreps. The Log: What Every Software Engineer Should Know About Real-Time
Datas Unifying Abstraction. Linkedin. com, Dec, 16, 2013.
[228] Jay Kreps. Questioning the Lambda Architecture. Online Article, July, page 205,
2014.
[229] Jay Kreps, Neha Narkhede, Jun Rao, et al. Kafka: A Distributed Messaging System
for Log Processing. In Proceedings of the NetDB, page 1–7, 2011.
[230] Diego Kreutz, Fernando MV Ramos, Paulo Esteves Verissimo, Christian Esteve
Rothenberg, Siamak Azodolmolky, and Steve Uhlig. Software-Defined Networking:
A Comprehensive Survey. Proceedings of the IEEE, 103(1):14–76, 2015.
[231] Alexandros Labrinidis and Hosagrahar V Jagadish. Challenges and Opportunities
with Big Data. Proceedings of the VLDB Endowment, 5(12):2032–2033, 2012.
[232] Kashif Laeeq and Jawwad A Shamsi. A Study of Security Issues, Vulnerabilities and
Challenges in Internet of Things. Securing Cyber-Physical Systems, 10, 2015.
[233] Daniel Guimaraes do Lago, Edmundo RM Madeira, and Luiz Fernando Bittencourt.
Power-Aware Virtual Machine Scheduling on Clouds using Active Cooling Control
and DVFS. In Proceedings of the 9th International Workshop on Middleware for
Grids, Clouds and e-Science, page 2. ACM, 2011.
Bibliography 297
[234] Avinash Lakshman and Prashant Malik. Cassandra: A Decentralized Structured Stor-
age System. ACM SIGOPS Operating Systems Review, 44(2):35–40, 2010.
[235] Phillip A Laplante. Who’s Afraid of Big Data? IT Professional, 15(5):6–7, 2013.
[236] David Lazer, Ryan Kennedy, Gary King, and Alessandro Vespignani. The Parable of
Google Flu: Traps in Big Data Analysis. Science, 343(14 March), 2014.
[237] Brian Lebiednik, Aman Mangal, and Niharika Tiwari. A Survey and Evaluation of
Data Center Network Topologies. arXiv preprint arXiv:1605.01701, 2016.
[238] Jae-Gil Lee and Minseo Kang. Geospatial Big Data: Challenges and Opportunities.
Big Data Research, 2(2):74–81, 2015.
[239] Ken Ka-Yin Lee, Wai-Choi Tang, and Kup-Sze Choi. Alternatives to Relational
Database: Comparison of NoSQL and XML Approaches for Clinical Data Storage.
Computer Methods and Programs in Biomedicine, 110(1):99–109, 2013.
[240] Sangdo Lee and Jun-Ho Huh. An Effective Security Measures for Nuclear Power Plant
Using Big Data Analysis Approach. The Journal of Supercomputing, 75(8):4267–4294,
2019.
[241] Joe Lennon. Beginning CouchDB. Apress, 2010.
[242] Hu Li, Tianjia Chen, and Wei Xu. Improving Spark Performance with Zero-Copy
Buffer Management and RDMA. In Computer Communications Workshops (INFO-
COM WKSHPS), 2016 IEEE Conference on, page 33–38. IEEE, 2016.
[243] Peilong Li, Yan Luo, Ning Zhang, and Yu Cao. Heterospark: A Heterogeneous
CPU/GPU Spark Platform for Machine Learning Algorithms. In Networking, Archi-
tecture and Storage (NAS), 2015 IEEE International Conference on, page 347–348.
IEEE, 2015.
[244] Qi Li, Jianfeng Ma, Rui Li, Ximeng Liu, Jinbo Xiong, and Danwei Chen. Secure,
Efficient and Revocable Multi-Authority Access Control System in Cloud Storage.
Computers & Security, 59:45–59, 2016.
[245] Jimmy Lin and Chris Dyer. Data Intensive Text processing with MapReduce. Morgan
Claypool Publishers, 2010.
[246] Xuan-Yi Lin, Yeh-Ching Chung, and Tai-Yi Huang. A Multiple LID Routing Scheme
for Fat-Tree-Based infiniband Networks. In Parallel and Distributed Processing Sym-
posium, 2004. Proceedings. 18th International, page 11. IEEE, 2004.
[247] Maria Lindh and Jan Nolin. Information We Collect: Surveillance and Privacy in
the Implementation of Google Apps for Education. European Educational Research
Journal, 15(6):644–663, 2016.
[248] Alex X Liu and Mohamed G Gouda. Diverse Firewall Design. IEEE Transactions on
Parallel and Distributed Systems, 19(9):1237–1251, 2008.
[249] Stephanie Q Liu and Anna S Mattila. Airbnb: Online Targeted Advertising, Sense of
Power, and Consumer Decisions. International Journal of Hospitality Management,
60:33–41, 2017.
298 Bibliography
[250] Yimeng Liu, Yizhi Wang, and Yi Jin. Research on the Improvement of MongoDB
Auto-Sharding in Cloud Environment. In Computer Science & Education (ICCSE),
2012 7th International Conference on, page 851–854. IEEE, 2012.
[251] Win-Tsung Lo, Yue-Shan Chang, Ruey-Kai Sheu, Chun-Chieh Chiu, and Shyan-Ming
Yuan. Cudt: A CUDA Based Decision Tree Algorithm. The Scientific World Journal,
2014, 2014.
[252] Noel Lopes and Bernardete Ribeiro. Gpumlib: An Efficient Open-Source GPU Ma-
chine Learning Library. International Journal of Computer Information Systems and
Industrial Management Applications, 3:355–362, 2011.
[253] Adam Lopez. Statistical Machine Translation. ACM Computing Surveys (CSUR),
40(3):8, 2008.
[254] Yingping Lu and David HC Du. Performance Study of iSCSI-Based Storage Subsys-
tems. IEEE Communications Magazine, 41(8):76–82, 2003.
[255] Marko Luksa. Kubernetes in action. Manning Publications Shelter Island, 2018.
[256] Yisheng Lv, Yanjie Duan, Wenwen Kang, Zhengxi Li, and Fei-Yue Wang. Traffic
Flow Prediction with Big Data: A Deep Learning Approach. IEEE Transactions on
Intelligent Transportation Systems, 16(2):865–873, 2015.
[257] Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, and Muthuramakrishnan
Venkitasubramaniam. l-diversity: Privacy Beyond K-anonymity. ACM Transactions
on Knowledge Discovery from Data (TKDD), 1(1):3, 2007.
[258] Grzegorz Malewicz, Matthew H Austern, Aart JC Bik, James C Dehnert, Ilan Horn,
Naty Leiser, and Grzegorz Czajkowski. Pregel: A System for Large-Scale Graph
Processing. In Proceedings of the 2010 ACM SIGMOD International Conference on
Management of Data, page 135–146, 2010.
[259] Claudio Martella, Roman Shaposhnik, Dionysios Logothetis, and Steve Harenberg.
Practical Graph Analytics with Apache Giraph, volume 1. Springer, 2015.
[260] Nathan Marz. How to Beat the Cap Theorem. Thoughts from the Red Planet, 2011.
[261] Matthew L Massie, Brent N Chun, and David E Culler. The Ganglia Distributed
Monitoring System: Design, Implementation, and Experience. Parallel Computing,
30(7):817–840, 2004.
[262] Yuan Mei, Luwei Cheng, Vanish Talwar, Michael Y Levin, Gabriela Jacques-Silva,
Nikhil Simha, Anirban Banerjee, Brian Smith, Tim Williamson, Serhat Yilmaz, et al.
Turbine: Facebooks Service Management Platform for Stream Processing. In 2020
IEEE 36th International Conference on Data Engineering (ICDE), page 1591–1602.
IEEE, 2020.
[263] Peter Mell and Timothy Grance. The NIST Definition of Cloud Computing (draft).
NIST Special Publication, 800(145):7, 2011.
[264] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivaku-
mar, Matt Tolton, and Theo Vassilakis. Dremel: Interactive Analysis of Web-Scale
Datasets. Proceedings of the VLDB Endowment, 3(1–2):330–339, 2010.
Bibliography 299
[280] Arvind Narayanan and Vitaly Shmatikov. Robust De-Anonymization of Large Sparse
Datasets. In Security and Privacy, 2008. SP 2008. IEEE Symposium on, page 111–
125. IEEE, 2008.
[281] Arvind Narayanan and Vitaly Shmatikov. Myths and Fallacies of Personally Identi-
fiable Information. Communications of the ACM, 53(6):24–26, 2010.
[282] Rimma Nehme and Nicolas Bruno. Automated Partitioning Design in Parallel
Database Systems. In Proceedings of the 2011 ACM SIGMOD International Con-
ference on Management of Data, page 1137–1148. ACM, 2011.
[283] B Clifford Neuman and Theodore Ts’o. Kerberos: An Authentication Service for
Computer Networks. IEEE Communications Magazine, 32(9):33–38, 1994.
[284] Krishna Nibhanupudi and Rimmi Devgan. Data Center Ethernet.
[285] Muhammad Nouman Durrani and Jawwad A Shamsi. Volunteer Computing: Require-
ments, Challenges, and Solutions. Journal of Network and Computer Applications,
2013.
[286] Daniel Nurmi, Richard Wolski, Chris Grzegorczyk, Graziano Obertelli, Sunil So-
man, Lamia Youseff, and Dmitrii Zagorodnov. The Eucalyptus Open-Source Cloud-
Computing System. In Cluster Computing and the Grid, 2009. CCGRID’09. 9th
IEEE/ACM International Symposium on, page 124–131. IEEE, 2009.
[287] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew
Tomkins. Pig Latin: A Not-so-Foreign Language for Data Processing. In Proceedings
of the 2008 ACM SIGMOD International Conference on Management of Data, page
1099–1110. ACM, 2008.
[288] Oyindamola Oluwatimi, Daniele Midi, and Elisa Bertino. Overview of Mobile Con-
tainerization Approaches and Open Research Directions. IEEE Security & Privacy,
15(1):22–31, 2017.
[289] Claus Pahl. Containerization and the Paas Cloud. IEEE Cloud Computing, 2(3):24–
31, 2015.
[290] Claus Pahl, Antonio Brogi, Jacopo Soldani, and Pooyan Jamshidi. Cloud Container
Technologies: A State-of-the-Art Review. IEEE Transactions on Cloud Computing,
2017.
[291] Rakesh Patel, Mara Nicholl, and Lindsey Harju. Access Control System for Imple-
menting Access Restrictions of Regulated Database Records while Identifying and
Providing Indicators of Regulated Database Records Matching Validation Criteria,
September 19 2017. US Patent 9,767,309.
[292] Andrew Pavlo and Matthew Aslett. What’s Really New with newSQL? ACM Sigmod
Record, 45(2):45–55, 2016.
[293] Gregory F Pfister. An Introduction to the Infiniband Architecture. High Performance
Mass Storage and Parallel I/O, 42:617–632, 2001.
[294] Jaroslav Pokorny. NoSQL Databases: A Step to Database Scalability in Web Envi-
ronment. International Journal of Web Information Systems, 9(1):69–82, 2013.
Bibliography 301
[295] Lesandro Ponciano, Francisco Brasileiro, Robert Simpson, and Arfon Smith. Volun-
teers’ Engagement in Human Computation for Astronomy Projects. Computing in
Science & Engineering, 16(6):52–59, 2014.
[296] Andrea Possemato, Andrea Lanzi, Simon Pak Ho Chung, Wenke Lee, and Yanick
Fratantonio. Clickshield: Are You Hiding Something? Towards Eradicating Clickjack-
ing on Android. In Proceedings of the 2018 ACM SIGSAC Conference on Computer
and Communications Security, page 1120–1136, 2018.
[297] Steve Pousty and Katie Miller. Getting Started with OpenShift: A Guide for Impatient
Beginners. ” O’Reilly Media, Inc.”, 2014.
[298] Gil Press. A Very Short History of Big Data. FORBES. Recuperado May, 12:2014,
2013.
[299] Dan Pritchett. Base: An Acid Alternative. Queue, 6(3):48–55, 2008.
[300] Peng Qin, Bin Dai, Benxiong Huang, and Guan Xu. Bandwidth-Aware Scheduling
with SDN in Hadoop: A New Trend for Big Data. IEEE Systems Journal, 2015.
[301] US Rackspace. Inc.,The Rackspace Cloud, 2010.
[302] Peter Rausch, Alaa F Sheta, and Aladdin Ayesh. Business Intelligence and Perfor-
mance Management: Theory, Systems and Industrial Applications. Springer Publish-
ing Company, Incorporated, 2013.
[303] Tejaswi Redkar and Tony Guidici. Windows Azure Platform. Apress, 2011.
[304] Antony Rowstron, Dushyanth Narayanan, Austin Donnelly, Greg O’Shea, and Andrew
Douglas. Nobody Ever Got Fired for Using Hadoop on a cluster. In Proceedings of the
1st International Workshop on Hot Topics in Cloud Data Processing, page 2. ACM,
2012.
[305] Sherif Sakr, Faisal Moeen Orakzai, Ibrahim Abdelaziz, and Zuhair Khayyat. Large-
scale Graph Processing using Apache Giraph. Springer, 2016.
[306] Semih Salihoglu, Jaeho Shin, Vikesh Khanna, Ba Quan Truong, and Jennifer Widom.
Graft: A Debugging Tool for Apache Giraph. In Proceedings of the 2015 ACM SIG-
MOD International Conference on Management of Data, page 1403–1408. ACM, 2015.
[307] Juha Salo. Data Center Network Architectures.
[308] Jason Sanders and Edward Kandrot. CUDA by Example: An Introduction to General-
Purpose GPU Programming. Addison-Wesley Professional, 2010.
[309] Kaz Sato, Cliff Young, and David Patterson. An in-Depth Look at Googles First
Tensor Processing Unit (TPU). Google Cloud Big Data and Machine Learning Blog,
12, 2017.
[310] Julian Satran and Kalman Meth. Internet Small Computer Systems Interface (iSCSI).
2004.
[311] Gigi Sayfan. Mastering kubernetes. Packt Publishing Ltd, 2017.
[312] Mathijs Jeroen Scheepers. Virtualization and Containerization of Application Infras-
tructure: A Comparison. In 21st Twente Student Conference on IT, volume 1, page
1–7, 2014.
302 Bibliography
[313] Jürgen Schmidhuber. Deep Learning in Neural Networks: An Overview. Neural Net-
works, 61:85–117, 2015.
[314] Friedhelm Schmidt. The SCSI Bus and IDE Interface Protocols. Application and
Programming, Addison-Wesley, New York, 1995.
[315] Nicolas Seyvet and Ignacio Mulas Viela. Applying the Kappa Archi-
tecture in the Telco Industry. https://www.oreilly.com/ideas/
applying-the-kappa-architecture-in-the-telco-industry (visited:
2019-11-09).
[316] Jawad Ali Shah, Hassaan Haider, Kushsairy Abdul Kadir, and Sheroz Khan. Sparse
Signal Reconstruction of Compressively Sampled Signals Using Smoothed 0-Norm. In
Signal and Image Processing Applications (ICSIPA), 2017 IEEE International Con-
ference on, page 61–65. IEEE, 2017.
[317] Jawwad Shamsi, Muhammad Ali Khojaye, and Mohammad Ali Qasmi. Data-intensive
Cloud Computing: Requirements, Expectations, Challenges, and Solutions. Journal
of Grid Computing, 11(2):281–310, 2013.
[318] Jawwad A Shamsi, Sufian Hameed, Waleed Rahman, Farooq Zuberi, Kaiser Altaf,
and Ammar Amjad. Clicksafe: Providing Security Against Clickjacking Attacks. In
2014 IEEE 15th International Symposium on High-Assurance Systems Engineering,
page 206–210. IEEE, 2014.
[319] Jawwad A Shamsi and Muhammad Khojaye. Understanding Privacy Violations in
Big Data Systems. IT Professional.
[320] Jawwad A Shamsi, Sherali Zeadally, and Zafar Nasir. Interventions in Cyberspace:
Status and Trends. IT Professional, 18(1):18–25, 2016.
[321] Jawwad A Shamsi, Sherali Zeadally, Fareha Sheikh, and Angelyn Flowers. Attribution
in Cyberspace: Techniques and Legal Implications. Security and Communication
Networks, 9(15):2886–2900, 2016.
[322] Toby Sharp. Implementing Decision Trees and Forests on a GPU. In European
Conference on Computer Vision, page 595–608. Springer, 2008.
[323] Alexander Shpiner, Eitan Zahavi, Omar Dahley, Aviv Barnea, Rotem Damsker, Gen-
nady Yekelis, Michael Zus, Eitan Kuta, and Dean Baram. Roce Rocks without PFC:
Detailed Evaluation. In Proceedings of the Workshop on Kernel-Bypass Networks,
page 25–30. ACM, 2017.
[324] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The
Hadoop Distributed File System. In Mass Storage Systems and Technologies (MSST),
2010 IEEE 26th Symposium on, page 1–10. IEEE, 2010.
[325] Konstantin V Shvachko. HDFS Scalability: The Limits to Growth. ; login: The
Magazine of USENIX & SAGE, 35(2):6–16, 2010.
[326] Kamran Siddique, Zahid Akhtar, Edward J Yoon, Young-Sik Jeong, Dipankar Das-
gupta, and Yangwoo Kim. Apache Hama: An Emerging Bulk Synchronous Parallel
Computing Framework for Big Data Applications. IEEE Access, 4:8879–8887, 2016.
[327] Tooba Siddiqui and Jawwad Ahmed Shamsi. Generating Abstractive Summaries Us-
ing Sequence to Sequence Attention Model. In 2018 International Conference on
Frontiers of Information Technology (FIT), page 212–217. IEEE, 2018.
Bibliography 303
[328] Nanki Sidhu, Edzer Pebesma, and Gilberto Cßamara. Using google earth engine to
detect land cover change: Singapore as a use case. European Journal of Remote
Sensing, 51(1):486–500, 2018.
[329] Dilpreet Singgh and Chandan K Reddy. A Survey on Platforms for Big Data Analyt-
ics. Journal of Big Data, 1(8):1–20, 2014.
[330] Aameek Singh, Madhukar Korupolu, and Dushmanta Mohapatra. Server-Storage
Virtualization: Integration and Load Balancing in Data Centers. In Proceedings of
the 2008 ACM/IEEE Conference on Supercomputing, page 53. IEEE Press, 2008.
[331] Uthayasankar Sivarajah, Muhammad Mustafa Kamal, Zahir Irani, and Vishanth
Weerakkody. Critical Analysis of Big Data Challenges and Analytical Methods. Jour-
nal of Business Research, 70:263–286, 2017.
[332] Swaminathan Sivasubramanian. Amazon DynamoDB: A Seamlessly Scalable Non-
Relational Database Service. In Proceedings of the 2012 ACM SIGMOD International
Conference on Management of Data, page 729–730. ACM, 2012.
[333] Joseph D Sloan. High Performance Linux Clusters with OSCAR, Rocks, OpenMosix,
and MPI. O’reilly, 2009.
[334] Aleksander Slominski, Vinod Muthusamy, and Rania Khalaf. Building a Multi-tenant
Cloud Service from Legacy Code with Docker Containers. In 2015 IEEE International
Conference on Cloud Engineering (IC2E), page 394–396. IEEE, 2015.
[335] Daniel J Solove and Danielle Keats Citron. Risk and Anxiety: A Theory of Data-
Breach Harms. Texas Law Review, 96:737, 2017.
[336] Stephen Soltesz, Herbert Ptzl, Marc E Fiuczynski, Andy Bavier, and Larry Peterson.
Container-based Operating System Virtualization: A Scalable, High-Performance Al-
ternative to Hypervisors. In ACM SIGOPS Operating Systems Review, volume 41,
page 275–287. ACM, 2007.
[337] Jordi Soria-Comas and Josep Domingo-Ferrert. Differential Privacy Via T-closeness
in Data Publishing. In Privacy, Security and Trust (PST), 2013 Eleventh Annual
International Conference on, page 27–35. IEEE, 2013.
[338] Kristopher A Standish, Sam Amiri, Misbah Mubarak, Louisa J Bellis, Takanori Fuji-
wara, and John L Rayner. Advances in Supercomputing. Advances in Supercomputing,
page 157, 2020.
[339] Nick Steele, Stan Hawkins, Joe Maranville, and Andrew Bradnan. Single Sign-on for
Access to a Central Data Repository, March 27 2018. US Patent 9,928,508.
[340] Stergios Stergiou. Scaling Pagerank to 100 Billion Pages. In Proceedings of The Web
Conference 2020, page 2761–2767, 2020.
[341] Thomas Lawrence Sterling. Beowulf Cluster Computing with Linux. MIT press, 2002.
[342] Michael Stonebraker, Daniel Abadi, David J DeWitt, Sam Madden, Erik Paulson,
Andrew Pavlo, and Alexander Rasin. MapReduce and Parallel DBMSs: Friends or
Foes? Communications of the ACM, 53(1):64–71, 2010.
[343] Michael Stonebraker and Ariel Weisberg. The VoltDB Main Memory DBMS. IEEE
Data Engineering Bulletin, 36(2):21–27, 2013.
304 Bibliography
[344] Nikko Strom. Scalable Distributed DNN Training Using Commodity GPU Cloud
Computing. In Sixteenth Annual Conference of the International Speech Communi-
cation Association, 2015.
[345] Jeff A Stuart, Cheng-Kai Chen, Kwan-Liu Ma, and John D Owens. Multi-GPU
Volume Rendering using MapReduce. In Proceedings of the 19th ACM International
Symposium on High Performance Distributed Computing, page 841–848. ACM, 2010.
[346] Jeff A Stuart and John D Owens. Multi-GPU MapReduce on GPU Clusters. In
Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International,
page 1068–1079. IEEE, 2011.
[347] Michelle Suh, Sae Hyong Park, Byungjoon Lee, and Sunhee Yang. Building Firewall
over the Software-Defined Network Controller. In 16th International Conference on
Advanced Communication Technology, page 744–748. IEEE, 2014.
[348] Alexey Svyatkovskiy, Kosuke Imai, Mary Kroeger, and Yuki Shiraito. Large-scale text
processing pipeline with apache spark. In 2016 IEEE International Conference on Big
Data (Big Data), pages 3928–3935. IEEE, 2016.
[349] Colin Tankard. Big Data Security. Network Security, 2012(7):5–8, 2012.
[350] Linnet Taylor and Ralph Schroeder. Is Bigger Better? the Emergence of Big Data as
a Tool for International Development Policy. GeoJournal, 80(4):503–518, 2015.
[351] Claudio Tesoriero. Getting Started with OrientDB. Packt Publishing Ltd, 2013.
[352] D.J. Patil Thomas H. Davenport. Data Scientist: The Sexiest Job of the 21st Century.
[353] Ashish Thusoo, Zheng Shao, Suresh Anthony, Dhruba Borthakur, Namit Jain, Joy-
deep Sen Sarma, Raghotham Murthy, and Hao Liu. Data Warehousing and Analytics
Infrastructure at Facebook. In Proceedings of the 2010 ACM SIGMOD International
Conference on Management of Data, page 1013–1020. ACM, 2010.
[354] Muhammad Tirmazi, Adam Barker, Nan Deng, Md Ehtesam Haque, Zhijing Gene
Qin, Steven Hand, Mor Harchol-Balter, and John Wilkes. Borg: the Next Generation.
In EuroSys’20, Heraklion, Crete, 2020.
[355] Uchi Ugobame Uchibeke, Kevin A Schneider, Sara Hosseinzadeh Kassani, and Ralph
Deters. Blockchain Access Control Ecosystem for Big Data Security. In 2018 IEEE
International Conference on Internet of Things (iThings) and IEEE Green Computing
and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing
(CPSCom) and IEEE Smart Data (SmartData), page 1373–1378. IEEE, 2018.
[356] Sarah Underwood. Blockchain Beyond Bitcoin, 2016.
[357] Olivier Huynh Van and Jeff Gray. Systems and Methods for Determining Endpoint
Configurations for Endpoints of a Virtual Private Network (VPN) and Deploying the
Configurations to the Endpoints, April 19 2016. US Patent 9,319,300.
[358] Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune,
and John Wilkes. Large-scale cluster management at google with borg. In Proceedings
of the Tenth European Conference on Computer Systems, pages 1–17, 2015.
Bibliography 305
[359] Akshat Verma, Puneet Ahuja, and Anindya Neogi. pMapper: Power and Migra-
tion Cost Aware Application Placement in Virtualized Systems. In Proceedings of
the 9th ACM/IFIP/USENIX International Conference on Middleware, page 243–264.
Springer-Verlag New York, Inc., 2008.
[360] LLC VoltDB. VoltDB Technical Overview, Whitepaper, 2010.
[361] Denny Vrandecic. Architecture for a Multilingual Wikipedia. Technical report,
Google, 2020.
[362] Aleksa Vukotic, Nicki Watt, Tareq Abedrabbo, Dominic Fox, and Jonas Partner.
Neo4j in Action. Manning Publications Co., 2014.
[363] Sameer Wadkar and Madhu Siddalingaiah. Apache Ambari. In Pro Apache Hadoop,
page 399–401. Springer, 2014.
[364] Guohui Wang, David G Andersen, Michael Kaminsky, Konstantina Papagiannaki,
TS Eugene Ng, Michael Kozuch, and Michael Ryan. C-Through: Part-time Optics in
Data Centers. In Proceedings of the ACM SIGCOMM 2010 conference, page 327–338,
2010.
[365] Guohui Wang, TS Eugene Ng, and Anees Shaikh. Programming Your Network at
Run-Time for Big Data Applications. In Proceedings of the First Workshop on Hot
Topics in Software Defined Networks, page 103–108, 2012.
[366] Rory Ward and Betsy Beyer. BeyondCorp: A New Approach to Enterprise Security.
2014.
[367] Eric W. Weisstein. “convolution.” from MathWorld – A Wolfram Web Resource.
http://mathworld.wolfram.com/Convolution.html".
[368] W Hwu Wen-mei. Programming Massively Parallel Processors. Morgan Kaufmann,
2010.
[369] Xindong Wu, Xingquan Zhu, Gong-Qing Wu, and Wei Ding. Data Mining With Big
Data. Knowledge and Data Engineering, IEEE Transactions on, 26(1):97–107, 2014.
[370] Miguel G Xavier, Israel C De Oliveira, Fabio D Rossi, Robson D Dos Passos, Kas-
siano J Matteussi, and Cesar AF De Rose. A Performance Isolation Analysis of
Disk-Intensive Workloads on Container-Based Clouds. In Parallel, Distributed and
Network-Based Processing (PDP), 2015 23rd Euromicro International Conference on,
page 253–260. IEEE, 2015.
[371] Miguel Gomes Xavier, Marcelo Veiga Neves, and Cesar Augusto Fonticielha De Rose.
A Performance Comparison of Container-based Virtualization Systems for MapRe-
duce Clusters. In Parallel, Distributed and Network-Based Processing (PDP), 2014
22nd Euromicro International Conference on, page 299–306. IEEE, 2014.
[372] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: A Novel Im-
age Dataset for Benchmarking Machine Learning Algorithms. arXiv preprint
arXiv:1708.07747, 2017.
[373] Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas
Stolcke, Dong Yu, and Geoffrey Zweig. The Microsoft 2016 Conversational Speech
Recognition System. In Acoustics, Speech and Signal Processing (ICASSP), 2017
IEEE International Conference on, page 5255–5259. IEEE, 2017.
306 Bibliography
[374] Qiang Xu, Xin Wang, Jianxin Li, Qingpeng Zhang, and Lele Chai. Distributed Sub-
graph Matching on Big Knowledge Graphs Using Pregel. IEEE Access, 7:116453–
116464, 2019.
[375] Xiaolong Xu, Qingxiang Liu, Yun Luo, Kai Peng, Xuyun Zhang, Shunmei Meng, and
Lianyong Qi. A Computation Offloading Method Over Big Data for IoT-Enabled
Cloud-Edge Computing. Future Generation Computer Systems, 95:522–533, 2019.
[376] Corinna Cortes Christopher J.C. Yann LeCun, Courant Institute. The MNIST
Database of Handwritten Digits.
[377] Fan Yao, Jingxin Wu, Guru Venkataramani, and Suresh Subramaniam. A Compar-
ative Analysis of Data Center Network Architectures. In 2014 IEEE International
Conference on Communications (ICC), page 3106–3111. IEEE, 2014.
[378] Xiaomeng Yi, Fangming Liu, Jiangchuan Liu, and Hai Jin. Building a Network High-
way for Big Data: Architecture and Challenges. IEEE Network, 28(4):5–13, 2014.
[379] Shui Yu, Meng Liu, Wanchun Dou, Xiting Liu, and Sanming Zhou. Networking for Big
Data: A Survey. IEEE Communications Surveys & Tutorials, 19(1):531–549, 2016.
[380] Yuan Yuan, Meisam Fathi Salmi, Yin Huai, Kaibo Wang, Rubao Lee, and Xiaodong
Zhang. Spark-GPU: An Accelerated in-Memory Data Processing Engine on Clusters.
In Big Data (Big Data), 2016 IEEE International Conference on, page 273–283. IEEE,
2016.
[381] Saima Zafar, Abeer Bashir, and Shafique Ahmad Chaudhry. On Implementation of
DCTCP on Three-Tier and Fat-Tree Data Center Network Topologies. SpringerPlus,
5(1):766, 2016.
[382] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion
Stoica. Spark: Cluster Computing with Working Sets. HotCloud, 10(10-10):95, 2010.
[383] Matei Zaharia, Reynold S Xin, Patrick Wendell, Tathagata Das, Michael Arm-
brust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J
Franklin, et al. Apache Spark: A Unified Engine for Big Data Processing. Communi-
cations of the ACM, 59(11):56–65, 2016.
[384] Alexander Zahariev. Google App Engine. Helsinki University of Technology, 2009.
[385] Zuzana Zatrochova. Analysis and Testing of Distributed NoSQL Datastore Riak. PhD
thesis, Masarykova univerzita, Fakulta informatiky, 2015.
[386] Dongpo Zhang. Big Data Security and Privacy Protection. In 8th International
Conference on Management and Computer Science (ICMCS 2018). Atlantis Press,
2018.
[387] Hao Zhang, Gang Chen, Beng Chin Ooi, Kian-Lee Tan, and Meihui Zhang. In-Memory
Big Data Management and Processing: A survey. IEEE Transactions on Knowledge
and Data Engineering, 27(7):1920–1948, 2015.
[388] Xiang Zhang and Yann LeCun. Text Understanding from Scratch. arXiv preprint
arXiv:1502.01710, 2015.
[389] Ying Zhang, Mohammad Pezeshki, Philemon Brakel, Saizheng Zhang, Cesar Lau-
rent Yoshua Bengio, and Aaron Courville. Towards End-to-End Speech Recognition
with Deep Convolutional Neural Networks. arXiv preprint arXiv:1701.02720, 2017.
Bibliography 307
[390] Yu Zhang, William Chan, and Navdeep Jaitly. Very Deep Convolutional Networks
for End-to-End Speech Recognition. In Acoustics, Speech and Signal Processing
(ICASSP), 2017 IEEE International Conference on, page 4845–4849. IEEE, 2017.
Index
309
310 Index
Hadoop L-Diversity
Big Picture, 66 Similarity Attack, 228
Components, 66 Skewness Attack, 228
Data Transfer, 106 L-diversity, 227
Drill, 105 Lambda, 12, 278
HBase, 87 LinkedIn, 278
HDFS, 67 Loosely Coupled, 19
Hive, 100
Mahout, 66
Issues, 93
MapReduce, 72, 279
MapReduce, 72
Combiner Function, 78
Pig, 98
Example, 81
Spark, 119
Input and Output, 74
YARN, 94
Inverted Indexing, 82
HBase, 66, 87
Join Operations, 85
Architecture, 88
Limitations, 117
Installation, 89
Map Phase, 74
Vs. Relational Database, 89
Operations, 73
HDFS, 275
Partitioner Function, 76
Benefits, 68
Reduce Phase, 74
Examples, 71
Secondary Sorting, 81
Failure, 71
Shuffle and Merge Phase, 74
Read and Write, 69
Sorting, 76
Scalability, 68
Mapreduce, 16
Size of Block, 68, 69
MIMD, 18
High-Performance Computing, 233
MISD, 18
Interconnects, 241
Moderately Coupled, 20
MPI, 241
MongDB, 156
OpenMP, 247
MPI, 24
Scalability, 233
Code, 241
Supercomputing, 249
Collaborative Functions, 246
Volunteer Computing, 249
Commands, 244
Hive, 66, 100
WordCount, 244
Homogeneity Attack, 226
HRDBMS, 178 Network, 280
Hybrid Memory, 18 Neural Network, 257
NewSQL, 171
IaaS, 34
Case Studies, 174
InfiniBand, 192, 194
Characteristics, 172
Index 311
TPU, 239
Transactional Systems, 5
Uber, 277
Value, 4
Variety, 4
Velocity, 4
Venice, 279
Veracity, 4
Virtualization, 31, 39, 47
Migration, 45
Types, 41, 43
VMM, 42
VoltDB, 174
Volume, 4
Volunteer Computing, 249
Yarn, 94
Components, 94