20IT503 - Big Data Analytics - Unit4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 73

Please read this disclaimer before proceeding:

This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document
contains proprietary information and is intended only to the respective group /
learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender
immediately by e-mail if you have received this document by mistake and delete
this document from your system. If you are not the intended recipient you are
notified that disclosing, copying, distributing or taking any action in reliance on
the contents of this information is strictly prohibited.
20IT503
Big Data Analytics
Department: IT
Batch/Year: 2020-2024/ III

Created by: K.Selvi,


AP/IT

Date: 30.07.2022
Table of Contents

S NO CONTENTS PAGE NO

1 Contents 5
2 Course Objectives 6
3 Pre Requisites (Course Names with Code) 7
4 Syllabus (With Subject Code, Name, LTPC details) 8
5 Course Outcomes 10
6 CO- PO/PSO Mapping 11
7 Lecture Plan 12
8 Activity Based Learning 14
9 4 Introducing Hadoop 15
Hadoop Overview
4.1 15

4.2 HDFS (Hadoop Distributed File System) 16


4.3 Processing Data with Hadoop -RDBMS versus Hadoop 18,21
4.4 Components and Block Replication 23

4.5 Introduction to MapReduce- Features of MapReduce 28,40

4.6 Introduction to NoSQL: CAP theorem 43

4.7 MongoDB: RDBMS Vs. MongoDB 47


Mongo DB Database Model – Data Types and
4.8 Sharding 51

4.9 Introduction to Hive – Hive Architecture 53

4.10 Hive Query Language (HQL). 56

10 Assignments 67
11 Part A (Questions & Answers) 63
12 Part B Questions 66

13 Supportive Online Certification Courses 68

14 Real time Applications 69


15 Assessment Schedule 70
16 Prescribed Text Books & Reference Books 71
17 Mini project Suggestions 72

5
Course Objectives

To Understand the Big Data Platform and its Use cases

To Provide an overview of Apache Hadoop

To Provide HDFS Concepts and Interfacing with HDFS

To Understand Map Reduce Jobs


Pre Requisites

CS8391 – Data Structures

CS8492 – Database Management System


Syllabus
LTPC
20IT503 BIG DATA ANALYTICS
3003
UNIT I INTRODUCTION TO BIG DATA 9

Data Science – Fundamentals and Components –Types of Digital Data –


Classification of Digital Data – Introduction to Big Data – Characteristics of
Data – Evolution of Big Data – Big Data Analytics – Classification of
Analytics – Top Challenges Facing Big Data – Importance of Big Data
Analytics.

UNIT II DESCRIPTIVE ANALYTICS USING STATISTICS 9

Mean, Median and Mode – Standard Deviation and Variance – Probability –


Probability Density Function – Percentiles and Moments – Correlation and
Covariance – Conditional Probability – Bayes’ Theorem – Introduction to
Univariate, Bivariate and Multivariate Analysis – Dimensionality Reduction
using Principal Component Analysis (PCA) and LDA.

UNIT III PREDICTIVE MODELING AND MACHINE LEARNING 9

Linear Regression – Polynomial Regression – Multivariate Regression –


Bias/Variance Trade Off – K Fold Cross Validation – Data Cleaning and
Normalization – Cleaning Web Log Data – Normalizing Numerical Data –
Detecting Outliers – Introduction to Supervised And Unsupervised Learning
– Reinforcement Learning – Dealing with Real World Data – Machine
Learning Algorithms –Clustering.
Syllabus
UNIT IV BIG DATA HADOOP FRAMEWORK 9

Introducing Hadoop –Hadoop Overview – RDBMS versus Hadoop – HDFS


(Hadoop Distributed File System): Components and Block Replication –
Processing Data with Hadoop – Introduction to MapReduce – Features of
MapReduce – Introduction to NoSQL: CAP theorem – MongoDB: RDBMS
Vs MongoDB – Mongo DB Database Model – Data Types and Sharding –
Introduction to Hive – Hive Architecture – Hive Query Language (HQL).

UNIT V PYTHON AND R PROGRAMMING 9

Python Introduction – Data types - Arithmetic - control flow – Functions -


args - Strings – Lists – Tuples – sets – Dictionaries Case study: Using R,
Python, Hadoop, Spark and Reporting tools to understand and Analyze
the Real world Data sources in the following domain- financial,
Insurance, Healthcare in Iris, UCI datasets.
Course Outcomes
CO# COs K Level

CO1 Identify Big Data and its Business Implications. K3

CO2 List the components of Hadoop and Hadoop Eco- K4


System

CO3 Access and Process Data on Distributed File System K4

CO4 Manage Job Execution in Hadoop Environment K4

CO5 Develop Big Data Solutions using Hadoop Eco System K4


CO-PO/PSO Mapping

PO PO PO PO PO PO PO PO PO PO PO PO PSO PSO PSO


CO #
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3

CO1 2 3 3 3 3 1 1 - 1 2 1 1 2 2 2

CO2 2 3 2 3 3 1 1 - 1 2 1 1 2 2 2

CO3 2 3 2 3 3 1 1 - 1 2 1 1 2 2 2

CO4 2 3 2 3 3 1 1 - 1 2 1 1 2 2 2

CO5 2 3 2 3 3 1 1 - 1 2 1 1 1 1 1
Lecture Plan
UNIT – IV
No
S of Propos Actual Perta Tax Mode of
Topics peri
No ed Date in ing on delivery
ods date CO om
y
lev
el
Chalk &Board
1 Introducing Hadoop 1 CO2 K4

Chalk &Board
Hadoop Overview,
2 RDBMS versus Hadoop 1 CO2 K4
Chalk &Board
HDFS (Hadoop
3 Distributed File 1 K4
CO2
System)
Components and Block
Chalk &Board
4 Replication 1 CO4 K4

Processing Data Chalk &Board


5 with Hadoop 1 CO3 K4

Introduction to Chalk &Board


6 MapReduce , Features of 1 CO4 K4
MapReduce
Introduction to NoSQL: Chalk &Board
7 1 CO4 K4
CAP theorem

8 MongoDB: RDBMS Vs 1 K4 Chalk &Board


CO4
MongoDB
Chalk &Board
9 Mongo DB Database 1 CO4 K4
Model
10 1 CO4 K4 Chalk &Board
Data Types and
Sharding
11 1 CO4 K4 Chalk &Board
Introduction to Hive –
Hive Architecture –
Hive Query Language
(HQL).
Lecture Plan
UNIT IV BIG DATA HADOOP FRAMEWORK 9
Introducing Hadoop –Hadoop Overview – RDBMS versus Hadoop – HDFS (Hadoop
Distributed File System): Components and Block Replication – Processing Data with
Hadoop – Introduction to MapReduce – Features of MapReduce – Introduction to
NoSQL: CAP theorem – MongoDB: RDBMS Vs MongoDB – Mongo DB Database
Model – Data Types and Sharding – Introduction to Hive – Hive Architecture – Hive
Query Language (HQL)
Session Mode of Reference
Topics to be covered
No. delivery
Introducing Hadoop 1
1 Chalk &Board
Chalk &Board
2 Hadoop Overview – RDBMS versus Hadoop 1
HDFS (Hadoop Distributed File System):
Chalk &Board 1
3 Components and Block Replication
4 Chalk &Board
Processing Data with Hadoop 1
Introduction to MapReduce – Features
5 Chalk &Board 1
of MapReduce
1
6 Introduction to NoSQL: CAP theorem Chalk &Board
MongoDB: RDBMS Vs MongoDB Chalk &Board 1
7
1
8 Mongo DB Database Model Chalk &Board
9 Data Types and Sharding
Chalk &Board 1
10 Introduction to Hive – Hive
Chalk &Board 1
Architecture – Hive Query Language
(HQL)

NPTEL/OTHER REFERENCES / WEBSITES : -


1.EMC Education Services, "Data Science and Big Data Analytics:
Discovering, Analyzing, Visualizing and Presenting Data", Wiley publishers,
2015.
NUMBER OF PERIODS : Planned: 9 Actual:9
DATE OF COMPLETION : Planned: Actual:
REASON FOR DEVIATION (IF ANY) :

CORRECTIVE MEASURES :

Signature of The Faculty


Signature Of HoD
ACTIVITY BASED LEARNING

Perform Sentiment Analysis for the following Tweets

Statement Sentiment

The First #GOPDebate: Social Media Reaction ?


and More http://t.co/X6KUVSkltF
RT @RobGeorge: That Carly Fiorina is trending ?
-- hours after HER debate -- above any of the
men in just-completed #GOPdebate says she's
on …
RT @BrandonDHowell: Great way to start ?
Friday. #CarlyFiorina @CarlyFiorina
#GOPDebate http://t.co/1ai9CuZ8bY
RT @LisaVikingstad: Ted Cruz at the ?
#GOPDebate will be like Ted Cruz in bed. He
will keep a confusingly sad look on his face and
refuse to …
RT @ariannahuff: The best and worst from the ?
#GOPDebate http://t.co/YkI30j7hhO
LECTURE NOTES
UNIT IV

4.INTRODUCING HADOOP :

4.1 HADOOP OVERVIEW:

Hadoop is an Apache open source framework written in java that


allows distributed processing of large datasets across clusters of computers using
simple programming models. The Hadoop framework application works in an
environment that provides distributed storage and computation across clusters of
computers. Hadoop is designed to scale up from single server to thousands of
machines, each offering local computation and storage.

Apache Hadoop is an open-source framework that manages data


processing and storage for big data applications. It allows the clustering of
multiple computers to analyze massive datasets in parallel. Hadoop can process
structured (data which can be stored in database SQL in a table with rows and
columns), semi-structured (data that does not reside in a relational database but
has organizational properties) and unstructured (data which is not organized in a
predefined manner) data and scale it up from a single server to thousands of
machines. It consists of mainly three components

Hadoop Architecture:
At its core, Hadoop has two major layers namely :
1.Processing/Computation layer (MapReduce), and
2.Storage layer (Hadoop Distributed File System).
Hadoop Architecture

4.2 Hadoop Distributed File System:


The Hadoop Distributed File System (HDFS) is based on the Google
File System (GFS) and provides a distributed file system that is designed to run
on commodity hardware. It has many similarities with existing distributed file
systems. However, the differences from other distributed file systems are
significant. It is highly fault-tolerant and is designed to be deployed on low-cost
hardware. It provides high throughput access to application data and is suitable
for applications having large datasets.
Apart from the above-mentioned two core components, Hadoop
framework also includes the following two modules
1.Hadoop Common: These are Java libraries and utilities required by other
Hadoop modules.
2.Hadoop YARN : This is a framework for job scheduling and cluster resource
management.
How Does Hadoop Work?
It is quite expensive to build bigger servers with heavy
configurations that handle large scale processing, but as an alternative, you can
tie together many commodity computers with single- CPU, as a single functional
distributed system and practically, the clustered machines can read the dataset
in parallel and provide a much higher throughput. Moreover, it is cheaper than
one high-end server. So this is the first motivational factor behind using Hadoop
that it runs across clustered and low-cost machines.
Hadoop runs code across a cluster of computers. This process
includes the following core tasks that Hadoop performs
1.Data is initially divided into directories and files. Files are divided
into uniform sized blocks of 128M and 64M (preferably 128M).
2.These files are then distributed across various cluster nodes for
further processing.
3.HDFS, being on top of the local file system, supervises the
processing.
4.Blocks are replicated for handling hardware failure.
5.Checking that the code was executed successfully.
6.Performing the sort that takes place between the map and reduce
stages.
7.Sending the sorted data to a certain computer.
8.Writing the debugging logs for each job.
Advantages of Hadoop
Hadoop framework allows the user to quickly write and test
distributed systems. It is efficient, and it automatic distributes the data and
work across the machines and in turn, utilizes the underlying parallelism of the
CPU cores.
Hadoop does not rely on hardware to provide fault-tolerance and
high availability (FTHA), rather Hadoop library itself has been designed to
detect and handle failures at the application layer.
Servers can be added or removed from the cluster dynamically
and Hadoop continues to operate without interruption.
Another big advantage of Hadoop is that apart from being open
source, it is compatible on all the platforms since it is Java based.
4.3 Processing Data with Hadoop :
Managing Resources and Applications with Hadoop YARN
Yarn divides the task on resource management and job
scheduling/monitoring into separate daemons. There is one Resource
Manager and per-application Application Master. An application can be either
a job or a DAG of jobs.
The Resource Manger have two components – Scheduler and
Application Manager.
Scheduler:
The scheduler is a pure scheduler i.e. it does not track the status
of running application. It only allocates resources to various competing
applications. Also, it does not restart the job after failure due to hardware or
application failure. The scheduler allocates the resources based on an
abstract notion of a container. A container is nothing but a fraction of
resources like CPU, memory, disk, network etc.
Application Manager:
Following are the tasks of Application Manager:-
1.Accepts submission of jobs by client.
2.Negotiates first container for specific Application Master.
3.Restarts the container after application failure. Below are the
responsibilities of Application Master.
4.Negotiates containers from Scheduler
5.Tracking container status and monitoring its progress.
Yarn supports the concept of Resource Reservation via
Reservation System. In this, a user can fix a number of resources for
execution of a particular job over time and temporal constraints. The
Reservation System makes sure that the resources are available to the job
until its completion. It also performs admission control for reservation.
Yarn can scale beyond a few thousand nodes via Yarn
Federation. YARN Federation allows to wire multiple sub-cluster into the
single massive cluster. We can use many independent clusters together for a
single large job. It can be used to achieve a large scale system.
Let us summarize how Hadoop works step by step:
1.Input data is broken into blocks of size 128 Mb and then
blocks are moved to different nodes.
2.Once all the blocks of the data are stored on data-nodes, the
user can process the data.
3.Resource Manager then schedules the program (submitted by
the user) on individual nodes.
4.Once all the nodes process the data, the output is written back
to HDFS
Interacting with Hadoop Ecosystem
Hadoop has an ecosystem that has evolved from its three core
components processing, resource management, and storage. In this topic,
you will learn the components of the Hadoop ecosystem and how they
perform their roles during Big Data processing. The Hadoop ecosystem is
continuously growing to meet the needs of Big Data. It comprises the
following twelve components:
HDFS(Hadoop Distributed file system)
HBase
Sqoop
Flume
Spark
Hadoop MapReduce
Pig
Impala
Hive
Cloudera Search
Oozie
Hue.
Let us understand the role of each component of the Hadoop ecosystem.
Components of Hadoop Ecosystem:
HDFS (HADOOP DISTRIBUTED FILE SYSTEM):
--HDFS is a storage layer for Hadoop.
--HDFS is suitable for distributed storage and processing, that is, while
--the data is being stored, it first gets distributed and then it is
processed.
--HDFS provides Streaming access to file system data.
--HDFS provides file permission and authentication.
--HDFS uses a command line interface to interact with Hadoop.
So what stores data in HDFS? It is the HBase which stores data in HDFS.
Hbase:
--HBase is a NoSQL database or non-relational database .
--HBase is important and mainly used when you need random, real-
time, read, or write access to your Big Data.
--It provides support to a high volume of data and high throughput.
--In an HBase, a table can have thousands of columns

RDBMS VERSUS HADOOP:

RDMS (Relational Database Management System):

RDBMS is an information management system, which is based


on a data model. In RDBMS tables are used for information storage. Each
row of the table represents a record and column represents an attribute of
data. Organization of data and their manipulation processes are different in
RDBMS from other databases. RDBMS ensures ACID (atomicity, consistency,
integrity, durability) properties required for designing a database. The
purpose of RDBMS is to store, manage, and retrieve data as quickly and
reliably as possible.

Hadoop:

It is an open-source software framework used for storing data


and running applications on a group of commodity hardware. It has large
storage capacity and high processing power. It can manage multiple
concurrent processes at the same time. It is used in predictive analysis, data
mining and machine learning.
It can handle both structured and unstructured form of data. It is more
flexible in storing, processing, and managing data than traditional RDBMS.
Unlike traditional systems, Hadoop enables multiple analytical processes on
the same data at the same time. It supports scalability very flexibly.

Below is a table of differences between RDBMS and Hadoop:

S.No
RDBMS Hadoop
.
Traditional row-column based An open-source software
databases, basically used for used for storing data and
1.
data storage, manipulation running applications or
and retrieval. processes concurrently.
In this both structured and
In this structured data is
2. unstructured data is
mostly processed.
processed.
It is best suited for OLTP It is best suited for BIG
3.
environment. data.
It is less scalable than
4. It is highly scalable.
Hadoop.
Data normalization is Data normalization is not
5.
required in RDBMS. required in Hadoop.
It stores transformed and It stores huge volume of
6.
aggregated data. data.
It has some latency in
7. It has no latency in response.
response.
The data schema of RDBMS The data schema of
8.
is static type. Hadoop is dynamic type.
Low data integrity available
9. High data integrity available.
than RDBMS.
Cost is applicable for licensed Free of cost, as it is an
10.
software. open source software.
4.4 HDFS:COMPONENTS AND BLOCK REPLICATION:
It is difficult to maintain huge volumes of data in a single
machine. Therefore, it becomes necessary to break down the data into
smaller chunks and store it on multiple machines.
File systems that manage the storage across a network of
machines are called distributed file systems.
Hadoop Distributed File System (HDFS) is the storage component
of Hadoop. All data stored on Hadoop is stored in a distributed manner
across a cluster of machines. But it has a few properties that define its
existence.
Huge volumes – Being a distributed file system, it is highly capable of
storing petabytes of data without any glitches.
Data access – It is based on the philosophy that “the most effective data
processing pattern is write-once, the read-many-times pattern”.
Cost-effective – HDFS runs on a cluster of commodity hardware. These are
inexpensive machines that can be bought from any vendor.
components of the Hadoop Distributed File System(HDFS)?
HDFS has two main components, broadly speaking, – data blocks
and nodes storing those data blocks. But there is more to it than meets the
eye. So, let’s look at this one by one to get a better understanding.
HDFS Blocks:
HDFS breaks down a file into smaller units. Each of these units is
stored on different machines in the cluster. This, however, is transparent to
the user working on HDFS. To them, it seems like storing all the data onto a
single machine.
These smaller units are the blocks in HDFS. The size of each of
these blocks is 128MB by default, you can easily change it according to
requirement. So, if you had a file of size 512MB, it would be divided into 4
blocks storing 128MB each.

But, you must be wondering, why such a huge amount in a


single block? Why not multiple blocks of 10KB each? Well, the amount of
data with which we generally deal with in Hadoop is usually in the order of
petra bytes or higher.
Therefore, if we create blocks of small size, we would end up
with a colossal number of blocks. This would mean we would have to deal
with equally large metadata regarding the location of the blocks which would
just create a lot of overhead. And we don’t really want that!
There are several perks to storing data in blocks rather than
saving the complete file.
The file itself would be too large to store on any single disk
alone. Therefore, it is prudent to spread it across different machines on the
cluster.
It would also enable a proper spread of the workload and
prevent the choke of a single machine by taking advantage of parallelism.
Now, you must be wondering, what about the machines in the cluster? How
do they store the blocks and where is the metadata stored? Let’s find out.
Namenode in HDFS:
HDFS operates in a master-worker architecture, this means that
there are one master node and several worker nodes in the cluster. The master
node is the Namenode.
Namenode is the master node that runs on a separate node in the
cluster.
Manages the file system namespace which is the file system tree or
hierarchy of the files and directories.
Stores information like owners of files, file permissions, etc for all
the files. It is also aware of the locations of all the blocks of a file and their
size.
All this information is maintained persistently over the local disk in the form of
two files: Fsimage and Edit Log.
1.Fsimage:
stores the information about the files and directories in the file
system. For files, it stores the replication level, modification and access times,
access permissions, blocks the file is made up of, and their sizes. For
directories, it stores the modification time and permissions.
2.Edit log:
On the other hand keeps track of all the write operations that the
client performs. This is regularly updated to the in-memory metadata to serve
the read requests.
Whenever a client wants to write information to HDFS or read
information from HDFS, it connects with the Namenode. The Namenode
returns the location of the blocks to the client and the operation is carried out.
Yes, that’s right, the Namenode does not store the blocks. For that, we have
separate nodes.
A file stored in HDFS

Datanodes in HDFS:
Datanodes are the worker nodes. They are inexpensive
commodity hardware that can be easily added to the cluster.
Datanodes are responsible for storing, retrieving, replicating,
deletion, etc. of blocks when asked by the Namenode.
They periodically send heartbeats to the Namenode so that it is
aware of their health. With that, a DataNode also sends a list of blocks that
are stored on it so that the Namenode can maintain the mapping of blocks
to Datanodes in its memory.
But in addition to these two types of nodes in the cluster, there is
also another node called the Secondary Namenode. Let’s look at what that
is.
Secondary Namenode in HDFS:

Suppose we need to restart the Namenode, which can happen


in case of a failure. This would mean that we have to copy the Fsimage from
disk to memory. Also, we would also have to copy the latest copy of Edit Log
to Fsimage to keep track of all the transactions. But if we restart the node
after a long time, then the Edit log could have grown in size. This would
mean that it would take a lot of time to apply the transactions from the Edit
log. And during this time, the file system would be offline. Therefore, to
solve this problem, we bring in the Secondary Namenode.

Secondary Namenode:

It is another node present in the cluster whose main task is to


regularly merge the Edit log with the Fsimage and produce check‐points of
the primary’s in-memory file system metadata. This is also referred to
as Checkpointing.
But the checkpointing procedure is computationally very
expensive and requires a lot of memory, which is why the Secondary
namenode runs on a separate node on the cluster.
However, despite its name, the Secondary Namenode does not
act as a Namenode. It is merely there for Checkpointing and keeping a copy
of the latest Fsimage.

Replication of blocks:
HDFS is a reliable storage component of Hadoop. This is because
every block stored in the file system is replicated on different Data Nodes in
the cluster. This makes HDFS fault-tolerant.
The default replication factor in HDFS is 3. This means that every
block will have two more copies of it, each stored on separate DataNodes in
the cluster. However, this number is configurable

4.5 INTRODUCTION TO MAPREDUCE:


In Hadoop, MapReduce combined both job management
and the programming model for execution.

The MapReduce execution environment employs a master/slave


execution model, in which one master node (called the JobTracker) manages
a pool of slave computing resources (called TaskTrackers) that are called
upon to do the actual work.

The role of the JobTracker is to manage the resources with some


specific responsibilities, including managing the TaskTrackers, continually
monitoring their accessibility and availability, and the different aspects of job
management that include scheduling tasks, tracking the progress of
assigned tasks, reacting to identified failures, and ensuring fault tolerance
of the execution.
The role of the TaskTracker is much simpler: wait for a task
assignment, initiate and execute the requested task, and provide status
back to the JobTracker on a periodic basis.
Different clients can make requests from the JobTracker, which
becomes the sole arbitrator for allocation of resources.

Limitations within this existing MapReduce model.

First, the programming paradigm is nicely suited to applications


where there is locality between the processing and the data, but applications
that demand data movement will rapidly become bogged down by network
latency issues.

Second, not all applications are easily mapped to the MapReduce


model, yet applications developed using alternative programming methods
would still need the MapReduce system for job management.
Third, the allocation of processing nodes within the cluster is fixed
through allocation of certain nodes as ―map slots‖ versus ―reduce slots.‖ When the
computation is weighted toward one of the phases, the nodes assigned to
the other phase are largely unused, resulting in processor under utilization.
This is being addressed in future versions of Hadoop through the
segregation of duties within a revision called YARN. In this approach, overall
resource management has been centralized while management of resources at
each node is now performed by a local Node Manager.
YARN:
The fundamental idea of YARN is to split up the functionalities of
resource management and job scheduling/monitoring into separate daemons.
The idea is to have a global Resource Manager (RM) and per-application
Application Master (AM). An application is either a single job or a DAG of jobs.
The Resource Manager and the Node Manager form the data-
computation framework. The Resource Manager is the ultimate authority
that arbitrates resources among all the applications in the system. The
Node Manager is the per-machine framework agent who is responsible for
containers, monitoring their resource usage (cpu, memory, disk, network) and
reporting the same to the Resource Manager/Scheduler.
The per-application Application Master is, a framework specific
library and is tasked with negotiating resources from the Resource Manager
and working with the Node Manager(s) to execute and monitor the tasks.
The Resource Manager has two main components: Scheduler and
Applications Manager.

The YARN Architecture


The Scheduler is responsible for allocating resources to the various running
applications subject to familiar constraints of capacities, queues etc. The
Scheduler is pure scheduler in the sense that it performs no monitoring or
tracking of status for the application. Also, it offers no guarantees about
restarting failed tasks either due to application failure or hardware failures.
The Scheduler performs its scheduling function based on the
resource requirements of the applications; it does so based on the abstract
notion of a resource Container which incorporates elements such as memory,
cpu, disk, network etc.

The Applications Manager is responsible for accepting job-


submissions, negotiating the first container for executing the application
specific Application Master and provides the service for restarting the
Application Master container on failure. The per- application Application
Master has the responsibility of negotiating appropriate resource containers
from the Scheduler, tracking their status and monitoring for progress.
Advantages of YARN
The concept of an Application Master that is associated
with each application that directly negotiates with the central Resource
Manager for resources while taking over the responsibility for monitoring
progress and tracking status. Pushing this responsibility to the application
environment allows greater flexibility in the assignment of resources as
well as be more effective in scheduling to improve node utilization.
The YARN approach allows applications to be better aware of the
data allocation across the topology of the resources within a cluster. This
awareness allows for improved colocation of compute and data resources,
reducing data motion, and consequently, reducing delays associated with data
access latencies. The result should be increased scalability and performance.
THE MAPREDUCE PROGRAMMING MODEL
Map Reduce, can be used to develop applications to read,
analyze, transform, and share massive amounts of data is not a database
system but rather is a programming model introduced and described by
Google researchers for parallel, distributed computation involving massive
datasets (ranging from hundreds of terabytes to petabytes).
Application development in Map Reduce is a combination of the
familiar procedural/imperative approaches used by Java or C++
programmers embedded within what is effectively a functional language
programming model such as the one used within languages like Lisp and
APL.

Map Reduce’s dependence on two basic operations that are


applied to sets or lists of data value pairs:
1.Map, which describes the computation or analysis applied to a set of

input key/value pairs to produce a set of intermediate key/value pairs.


2.Reduce, in which the set of values associated with the intermediate

key/value pairs output by the Map operation are combined to provide the
results.
A Map Reduce application is envisioned as a series of basic operations
applied in a sequence to small sets of many (millions, billions, or even
more) data items. These data items are logically organized in a way that
enables the MapReduce execution model to allocate tasks that can be
executed in parallel
The data items are indexed using a defined key in to key, value
pairs, in which the key represents some grouping criterion associated with
a computed value. With some applications applied to massive datasets, the
theory is that the computations applied during the Map phase to each
input key/value pair are independent from one another. Figure 1.7 shows
how Map and Reduce work.

Combining both data and computational independence means that both


the data and the computations can be distributed across multiple storage
and processing units and automatically parallelized. This parallelizability
allows the programmer to exploit scalable massively parallel processing
resources for increased processing speed and performance.

A SIMPLE EXAMPLE
In the canonical MapReduce example of counting the number of
occurrences of a word across a corpus of many documents, the key is the
word and the value is the number of times the word is counted at each
process node.
The process can be subdivided into much smaller sets of tasks. For
example: The total number of occurrences of each word in the entire collection
of documents is equal to the sum of the occurrences of each word in each
document. The total number of occurrences of each word in each document
can be computed as the sum of the occurrences of each word in each
paragraph.
The total number of occurrences of each word in each paragraph
can be computed as the sum of the occurrences of each word in each
sentence. In this example, the determination of the right level of
parallelism can be scaled in relation to the size of the ―chunk‖ to be processed
and the number of computing resources available in the pool.
A single task might consist of counting the number of occurrences of

each word in a single document, or a paragraph, or a sentence, depending on


the level of granularity.
Each time a processing node is assigned a set of tasks in processing
different subsets of the data, it maintains interim results associated with each
key. This will be done for all of the documents, and interim results for each
word are created.
Once all the interim results are completed, they can be
redistributed so that all the interim results associated with a key can be
assigned to a specific processing node that accumulates the results into a final
result.

MORE ON MAP REDUCE


The MapReduce programming model consists of five basic
operations:
Input data: In which the data is loaded into the environment and is
distributed across the storage nodes, and distinct data artifacts are associated
with a key value.
Map, in which a specific task is applied to each artifact with the interim
result associated with a different key value. An analogy is that each
processing node has a bucket for each key, and interim results are put into

the bucket for that key.


Sort/shuffle, in which the interim results are sorted and redistributed so
that all interim results for a specific key value are located at one single-
processing node. To continue the analogy, this would be the process of
delivering all the buckets for a specific key to a single delivery point.

Reduce, in which the interim results are accumulated into a final result.
Output result, where the final output is sorted.
These steps are presumed to be run in sequence, and
applications developed using Map Reduce often execute a series of iterations
of the sequence, in which the output results from iteration n becomes the
input to iteration n+1.

Illustration of MapReduce
The simplest illustration of MapReduce is a word count example
in which the task is to simply count the number of times each word
appears in a collection of documents.
Each map task processes a fragment of the text, line by line,
parses a line into words, and emits <word, 1> for each word,
regardless of how many times word appears in the line of text.

In this example, the map step parses the provided text string
into individual words and emits a set of key/value pairs of the form
<word, 1>. For each unique key—in this example, word—the reduce step
sums the 1 values and outputs the <word, count> key/value pairs. Because
the word each appeared twice in the given line of text, the reduce step
provides acorresponding key/value pair of <each, 2>.
Example of how map and reduce works
It should be noted that, in this example, the original key, 1234, is
ignored in the processing. In a typical word count application, the map step
may be applied to millions of lines of text, and the reduce step will summarize
the key/value pairs generated by all the map steps.

MapReduce has the advantage of being able to distribute the


workload over a cluster of computers and run the tasks in parallel. In a
word count, the documents, or even pieces of the documents, could be
processed simultaneously during the map step. A key characteristic of
MapReduce is that the processing of one portion of the input can be carried
out independently of the processing of the other inputs. Thus, the workload
can be easily distributed over a cluster of machines.

Structuring a MapReduce Job in Hadoop


Hadoop provides the ability to specific details on how a MapReduce
job is run in Hadoop. A typical MapReduce program in Java consists of three
classes: the driver, the mapper, and the reducer.
The driver provides details such as input file locations, the
provisions for adding the input file to the map task, the names of the
mapper and reducer Java classes, and the location of the reduce task
output. Various job configuration options can also be specified in the driver.
For example, the number of reducers can be manually specified in the driver.
Such options are useful depending on how the MapReduce job output will
be used in later downstream processing.

The mapper provides the logic to be processed on each data


block corresponding to the specified input files in the driver code. For example,
in the word count MapReduce example provided earlier, a map task is
instantiated on a worker node where a data block resides. Each map task
processes a fragment of the text, line by line, parses a line into words, and
emits <word, 1> for each word, regardless of how many times word appears
in the line of text. The key/value pairs are stored temporarily in the worker
node‘s memory (or cached to the node‘s disk).
Next, the key/value pairs are processed by the built-in shuffle and
sort functionality based on the number of reducers to be executed. In this
simple example, there is only one reducer. So, all the intermediate data is
passed to it. From the various map task outputs, for each unique key, arrays
(lists in Java) of the associated values in the key/value pairs are
constructed. Also, Hadoop ensures that the keys are passed to each reducer in
sorted order.
<each,(1,1)> is the first key/value pair processed, followed
alphabetically by <For,(1)> and the rest of the key/value pairs until the last
key/value pair is passed to the reducer. The ( ) denotes a list of values which,
in this case, is just an array of ones.
Shuffle and sort
In general, each reducer processes the values for each key and
emits a key/value pair as defined by the reduce logic. The output is then
stored in HDFS like any other file in, say, 64 MB blocks replicated three times
across the nodes.
Additional Considerations in Structuring a MapReduce Job
Several Hadoop features provide additional functionality to a
MapReduce job.
First, a combiner is a useful option to apply, when possible,
between the map task and the shuffle and sort. Typically, the combiner
applies the same logic used in the reducer, but it also applies this logic on the
output of each map task. In the word count example, a combiner sums up
the number of occurrences of each word from a mapper‘s output.
Below figure illustrates how a combiner processes a single string
in the simple word count example.
MapReduce Programming Model (Example)
FEATURES OF MAPREDUCE:
1.Scalabily:
Apache Hadoop is a highly scalable framework. This is because of
its ability to store and distribute huge data across plenty of servers. All these
servers were inexpensive and can operate in parallel. We can easily scale the
storage and computation power by adding servers to the cluster.
Hadoop MapReduce programming enables organizations to run
applications from large sets of nodes which could involve the use of
thousands of terabytes of data.
Hadoop MapReduce programming enables business organizations
to run applications from large sets of nodes. This can use thousands of
terabytes of data.
2.Flexibility:
MapReduce programming enables companies to access new
sources of data. It enables companies to operate on different types of data. It
allows enterprises to access structured as well as unstructured data, and
derive significant value by gaining insights from the multiple sources of data.
Additionally, the MapReduce framework also provides support for
the multiple languages and data from sources ranging from email, social
media, to clickstream.
The MapReduce processes data in simple key-value pairs thus
supports data type including meta-data, images, and large files. Hence,
MapReduce is flexible to deal with data rather than traditional DBMS.
3. Security and Authentication:
The MapReduce programming model uses HBase and HDFS
security platform that allows access only to the authenticated users to operate
on the data. Thus, it protects unauthorized access to system data and
enhances system security.
4.Cost-effective solution:
Hadoop’s scalable architecture with the MapReduce programming
framework allows the storage and processing of large data sets in a very
affordable manner
5.Fast:
Hadoop uses a distributed storage method called as a Hadoop
Distributed File System that basically implements a mapping system for
locating data in a cluster.
The tools that are used for data processing, such as MapReduce
programming, are generally located on the very same servers that allow for
the faster processing of data.
So, Even if we are dealing with large volumes of unstructured
data, Hadoop MapReduce just takes minutes to process terabytes of data. It
can process petabytes of data in just an hour.

6. Simple model of programming:


Amongst the various features of Hadoop MapReduce, one of the
most important features is that it is based on a simple programming model.
Basically, this allows programmers to develop the MapReduce programs which
can handle tasks easily and efficiently.
The MapReduce programs can be written in Java, which is not
very hard to pick up and is also used widely. So, anyone can easily learn and
write MapReduce programs and meet their data processing needs.

.
7. Parallel Programming:
One of the major aspects of the working of MapReduce
programming is its parallel processing. It divides the tasks in a manner that
allows their execution in parallel.
The parallel processing allows multiple processors to execute
these divided tasks. So the entire program is run in less time.
8. Availability and resilient nature:
Whenever the data is sent to an individual node, the same set of
data is forwarded to some other nodes in a cluster. So, if any particular node
suffers from a failure, then there are always other copies present on other
nodes that can still be accessed whenever needed. This assures high
availability of data.
One of the major features offered by Apache Hadoop is its fault
tolerance. The Hadoop MapReduce framework has the ability to quickly
recognizing faults that occur.
It then applies a quick and automatic recovery solution. This
feature makes it a game-changer in the world of big data processing.

.
4.6 INTRODUCTION TO NoSQL:
NoSQL (Not only Structured Query Language) is a term used to
describe those data stores that are applied to unstructured data. As described
earlier, HBase is such a tool that is ideal for storing key/values in column
families. In general, the power of NoSQL data stores is that as the size of the
data grows, the implemented solution can scale by simply adding additional
machines to the distributed system. Four major categories of NoSQL tools and
a few examples are provided next .
Key/value stores contain data (the value) that can be simply
accessed by a given identifier (the key). As described in the MapReduce
discussion, the values can be complex. In a key/value store, there is no stored
structure of how to use the data; the client that reads and writes to a
key/value store needs to maintain and utilize the logic of how to meaningfully
extract the useful elements from the key and the value. Here are some uses
for key/value stores:
• Using a customer’s login ID as the key, the value contains the customer’s
preferences
• Using a web session ID as the key, the value contains everything that was
captured during the session

Column family stores are useful for sparse datasets, records with
thousands of columns but only a few columns have entries. The key/value
concept still applies, but in this case a key is associated with a collection of
columns. In this collection, related columns are grouped into column families.
.
For example, columns for age, gender, income, and education may be
grouped into a demographic family. Column family data stores are useful in
the following instances:
• To store and render blog entries, tags, and viewers’ feedback
• To store and update various web page metrics and counters
Graph databases are intended for use cases such as networks, where there
are item (people or web page links) and relationships between these items.
While it is possible to store graphs such as trees in a relational database, it
often becomes cumbersome to navigate, scale, and add new relationships.
Graph databases help to overcome these possible obstacles and can be
optimized to quickly traverse a graph (move from one item in the network to
another item in the network). Following are examples of graph database
implementations:
• Social networks such as Facebook and LinkedIn.
• Geospatial applications such as delivery and traffic systems to optimize
the time to reach one or more destinations.
Table provides a few examples of NoSQL data stores. As is often the case,
the choice of a specific data store should be made based on the functional
and performance requirements. A particular data store may provide
exceptional functionality in one aspect, but that functionality may come at a
loss of other functionality or performance.

Category Data Store Website


Key/Value Redis redis.io
Voldemort www.project-voldemort.com/voldemort
Document CouchDB couchdb.apache.org
MongoDB www.mongodb.org
Column family Cassandra cassandra.apache.org
HBase hbase.apache.org/
Graph FlockDB github.com/twitter/flockdb
Neo4j www.neo4j.org
CAP THEOREM:
It is very important to understand the limitations of NoSQL
database. NoSQL can not provide consistency and high availability together.
This was first expressed by Eric Brewer in CAP Theorem.
CAP theorem or Eric Brewers theorem states that we can only
achieve at most two out of three guarantees for a database:
1.Consistency
2.Availability and
3.Partition Tolerance.
Here Consistency means that all nodes in the network see the same data
at the same time.
Availability :
It is a guarantee that every request receives a response about
whether it was successful or failed. However it does not guarantee that a
read request returns the most recent write. The more number of users a
system can cater to better is the availability.
Partition Tolerance:
It is a guarantee that the system continues to operate despite
arbitrary message loss or failure of part of the system. In other words, even
if there is a network outage in the data center and some of the computers
are unreachable, still the system continues to perform.
Out of these three guarantees, no system can provide more than
2 guarantees. Since in the case of a distributed systems, the partitioning of
the network is must, the tradeoff is always between consistency and
availability.
As depicted in the Venn diagram, RDBMS can provide only
consistency but not partition tolerance. While HBASE and Redis can provide
Consistency and Partition tolerance. And MongoDB, CouchDB, Cassandra and
Dynamo guarantee only availability but no consistency. Such databases
generally settle down for eventual consistency meaning that after a while the
system is going to be ok.
Let us take a look at various scenarios or architectures of systems to
better understand the CAP theorem.
• The first one is RDBMs where Reading and writing of data happens
on the same machine. Such systems are consistent but not partition
tolerant because if this machine goes down, there is no backup. Also, if one
user is modifying the record, others would have to wait thus compromising
the high availability.
• The second diagram is of a system which has two machines. Only
one machine can accept modifications while the reads can be done from all
machines. In such systems, the modifications flow from that one machine
to the rest. Such systems are highly available as there are multiple
machines to serve. Also, such systems are partition tolerant because if one
machine goes down, there are other machines available to take up that
responsibility. Since it takes time for the data to reach other machines from
the node A, the other machine would be serving older data. This causes
inconsistency. Though the data is eventually going to reach all machine and
after a while, things are going to okay. There we call such systems
eventually consistent instead of strongly consistent. This kind of
architecture is found in Zookeeper and MongoDB.
• In the third design of any storage system, we have one machine similar
to our first diagram along with its backup. Every new change or
modification at A in the diagram is propagated to the backup machine B.
There is only one machine which is interacting with the readers and
writers. So, It is consistent but not highly available.
If A goes down, B can take A's place. Therefore this system is partition
tolerant.
Examples of such system we are HDFS having secondary Namenode and
even relational databases having a regular backup.
4.7 MongoDB:
MongoDB is a cross-platform, document-oriented database that
provides, high performance, high availability, and easy scalability. MongoDB
works on concept of collection and document
Database
Database is a physical container for collections. Each database
gets its own set of files on the file system. A single MongoDB server typically
has multiple databases.
Collection
Collection is a group of MongoDB documents. It is the equivalent
of an RDBMS table. A collection exists within a single database. Collections do
not enforce a schema. Documents within a collection can have different fields.
Typically, all documents in a collection are of similar or related purpose
Document
A document is a set of key-value pairs. Documents have dynamic
schema. Dynamic schema means that documents in the same collection do
not need to have the same set of fields or structure, and common fields in a
collection's documents may hold different types of data.
RDBMS Vs MongoDB:

RDBMS MongoDB
2
Database Database

Table Collection

Tuple/Row Document

column Field

Table Join Embedded Documents

Primary Key Primary Key (Default key _id provided


byMongoDB itself)

Database Server and Client

mysqld/Oracle mongod

mysql/sqlplus mongo

4.8 MangoDB Database Model :


MongoDB provides two types of data models: — Embedded data
model and Normalized data model.
Based on the requirement, you can use either of the models while
preparing your document.
1.Embedded Data Model
In this model, you can have (embed) all the related data in a
single document, it is also known as de-normalized data model.
For example, assume we are getting the details of employees in three different
documents namely, Personal_details, Contact and, Address, you can embed all
the three documents in a single one as shown below.
{
_id: ,
Emp_ID: "10025AE336“
Personal_details:
{
First_Name: "Radhika",
Last_Name: "Sharma",
Date_Of_Birth: "1995-09-26"
},
Contact: {
e-mail: "[email protected]",
phone: "9848022338"
},
Address: {
city: "Hyderabad",
Area: "Madapur",

}
}
Normalized Data Model
In this model, you can refer the sub documents in the original
document, using references. For example, you can re-write the above
document in the normalized model as:
Employee:

{ _id:
<ObjectId101>,
Emp_ID: "10025AE336"
}
Personal_details:
{ _id:
<ObjectId102>,
empDocID: " ObjectId101",
First_Name: "Radhika",
Last_Name: "Sharma",
Date_Of_Birth: "1995-09-26“
}
.
Contact:
{ _id:
<ObjectId103>,
empDocID: " ObjectId101",
e-mail: "[email protected]",
phone: "9848022338"
}
Address:

{ _id:
<ObjectId104>,
empDocID: " ObjectId101",
city: "Hyderabad",
Area: "Madapur",
State: "Telangana" }

Considerations while designing Schema in MongoDB


• Design your schema according to user requirements.
• Combine objects into one document if you will use them together.
Otherwise separate them (but make sure there should not be need of
joins).
• Duplicate the data (but limited) because disk space is cheap as compare
to compute time.
• Do joins while write, not on read.
• Optimize your schema for most frequent use cases.
• Do complex aggregation in the schema.
4.9 MongoDB Data Types and Sharding:
Data Types:
MongoDB supports many datatypes. Some of them are −
String − This is the most commonly used datatype to store the data. String
in MongoDB must be UTF-8 valid.
Integer − This type is used to store a numerical value. Integer can be 32
bit or 64 bit depending upon your server.
Boolean − This type is used to store a boolean (true/ false) value.
Double − This type is used to store floating point values.
Min/ Max keys − This type is used to compare a value against the lowest
and highest BSON elements.
Arrays − This type is used to store arrays or list or multiple values into one
key.
Timestamp − ctimestamp. This can be handy for recording when a document
has been modified or added.
Object − This datatype is used for embedded documents.
Null − This type is used to store a Null value.
Symbol − This datatype is used identically to a string; however, it's generally
reserved for languages that use a specific symbol type.
Date − This datatype is used to store the current date or time in UNIX time
format. You can specify your own date time by creating object of Date and
passing day, month, year into it.
Object ID − This datatype is used to store the document’s ID.
Binary data − This datatype is used to store binary data.
Code − This datatype is used to store JavaScript code into the document.
Regular expression − This datatype is used to store regular expression.
MongoDB sharding:
Sharding is the process of storing data records across multiple
machines and it is MongoDB's approach to meeting the demands of data
growth. As the size of the data increases, a single machine may not be
sufficient to store the data nor provide an acceptable read and write
throughput. Sharding solves the problem with horizontal scaling. With
sharding, you add more machines to support data growth and the demands of
read and write operations.
Why Sharding?
• In replication, all writes go to master node
• Latency sensitive queries still go to master
• Single replica set has limitation of 12 nodes
• Memory can't be large enough when active dataset is big
• Local disk is not big enough
• Vertical scaling is too expensive
• Sharding in MongoDB
• The following diagram shows the Sharding in MongoDB using sharded
cluster.
SHARDING in MongoDB
The following diagram shows the Sharding in MongoDB using sharded cluster.

In the following diagram, there are three main components −


Shards − Shards are used to store data. They provide high availability and
data consistency. In production environment, each shard is a separate replica
set.
Config Servers − Config servers store the cluster's metadata. This data
contains a mapping of the cluster's data set to the shards. The query router
uses this metadata to target operations to specific shards. In production
environment, sharded clusters have exactly 3 config servers.
Query Routers − Query routers are basically mongo instances, interface
with client applications and direct operations to the appropriate shard. The
query router processes and targets the operations to shards and then
returns results to the clients. A sharded cluster can contain more than one
query router to divide the client request load. A client sends requests to one
query router. Generally, a sharded cluster have many query routers.
4.9 INTRODUCTION TO HIVE:
What is Hive
Hive is a data warehouse infrastructure tool to process
structured data in Hadoop. It resides on top of Hadoop to summarize Big
Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache
Software Foundation took it up and developed it further as an open source
under the name Apache Hive. It is used by different companies. For
example, Amazon uses it in Amazon Elastic MapReduce.
Hive is not
• A relational database
• A design for OnLine Transaction Processing (OLTP)
• A language for real-time queries and row-level updates
Features of Hive
• It stores schema in a database and processed data into HDFS.
• It is designed for OLAP.
• It provides SQL type language for querying called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.
1.15 Architecture of Hive:

This component diagram contains different units. The following table


describes each unit:
Unit Name Operation
User Interface Hive is a data warehouse infrastructure software that
can create interaction between user and HDFS. The user
interfaces that Hive supports are Hive Web UI, Hive command
line, and Hive HDInsight (In Windows server).
Meta Store Hive chooses respective database servers to store the
schema or Metadata of tables, databases, columns in a table,
their data types, and HDFS mapping.
HiveQL Process Engine HiveQL is similar to SQL for querying on schema info
on the Metastore. It is one of the replacements of traditional
approach for MapReduce program. Instead of writing
MapReduce program in Java, we can write a query for
MapReduce job and process it.
Execution Engine The conjunction part of HiveQL process Engine and
MapReduce is Hive Execution Engine. Execution engine
processes the query and generates results as same as
MapReduce results. It uses the flavor of MapReduce.
HDFS or HBASE Hadoop distributed file system or HBASE are the
datastorage techniques to store data into file system.
Working of Hive:
The following diagram depicts the workflow between Hive and
Hadoop

The following table defines how Hive interacts with Hadoop framework:
Ste Operation
p
No.

1 Execute Query-
The Hive interface such as Command Line or Web UI sends query to Driver
(any databasedriver such as JDBC, ODBC, etc.) to execute.
2 Get Plan-
The driver takes the help of query compiler that parses the query to check
the syntax andquery plan or the requirement of query.
3 Get Metadata

The compiler sends metadata request to Metastore (any database).


4 Send Metadata

Metastore sends metadata as a response to the compiler.


5 Send Plan

The compiler checks the requirement and resends the plan to the driver.
Up to here, theparsing and compiling of a query is complete.
6 Execute Plan-
The driver sends the execute plan to the execution engine.
7 Execute Job-
Internally, the process of execution job is a MapReduce job. The execution
engine sends the job to JobTracker, which is in Name node and it assigns
this job to TaskTracker, whichis in Data node. Here, the query executes
MapReduce job.
7(i) Metadata Ops-
Meanwhile in execution, the execution engine can execute
metadata operations withMetastore.
8 Fetch Result-
The execution engine receives the results from Data nodes.

9 Send Results-
The execution engine sends those resultant values to the driver.

10 Send Results-
The driver sends the results to Hive Interfaces.

4.10 Hive Query Language (HQL):

HiveQL (Hive Query Language), resembles Structured Query


Language (SQL) rather than a scripting language.
A Hive table structure consists of rows and columns. The rows
typically correspond to some record, transaction, or particular entity (for
example, customer) detail. The values of the corresponding columns
represent the various attributes or characteristics for each row. Hadoop and
its ecosystem are used to apply some structure to unstructured data.
Therefore, if a table structure is an appropriate way to view the restructured
data, Hive may be a good tool to use.
Additionally, a user may consider using Hive if the user has
experience with SQL and the data is already in HDFS. Another consideration
in using Hive may be how data will be updated or added to the Hive tables.
If data will simply be added to a table periodically, Hive works well, but if
there is a need to update data in place, it may be beneficial to consider
another tool, such as Hbase.
Although Hive’s performance may be better in certain applications
than a conventional SQL database, Hive is not intended for real-time
querying. A Hive query is first translated into a MapReduce job, which is then
submitted to the Hadoop cluster. Thus, the execution of the query has to
compete for resources with any other submitted job. Like Pig, Hive is
intended for batch processing. Again, HBase may be a better choice for real-
time query needs.
To summarize the preceding discussion, consider using Hive when
the following conditions exist
• Data easily fits into a table structure.
• Data is already in HDFS. (Note: Non-HDFS files can be loaded into a Hive
table.)
• There is a desire to partition datasets based on time. (For example, daily
updates are added to the Hive table.)
• Batch processing is acceptable.

The remainder of the Hive discussion covers some HiveQL basics.


From the command prompt, a user enters the interactive Hive environment by
simply entering hive:
$ hive
hive>
From this environment, a user can define new tables, query them,
or summarize their contents..
Hive Commands:
Hive supports Data definition Language(DDL), Data Manipulation
Language(DML) and User defined functions.
Hive DDL Commands
create database, drop database, create table, drop table,
alter table, create index, create view.
Hive DML Commands
Select, Where, Group By, Order By, Load Data, Join:
• Inner Join
• Left Outer Join
• Right Outer Join
• Full Outer Join
Hive DDL Commands :
Create Database Statement
A database in Hive is a namespace or a collection of tables
hive> CREATE SCHEMA userdb;
hive> SHOW DATABASES;
Drop database:
Hive> DROP DATABASE IF EXISTS
Creating Hive Tables
Create a table called Sonoo with two columns, the first being an
integer and the other a string.
hive> CREATE TABLE Sonoo(foo INT, bar STRING);
Create a table called HIVE_TABLE with two columns and a
partition column called ds. The partition column is a virtual column. It is not
part of the data itself but is derived from the partition that a particular
dataset is loaded into. By default, tables are assumed to be of text input
format and the delimiters are assumed to be ^A(ctrl-a
hive> CREATE TABLE HIVE_TABLE (foo INT, bar STRING) PARTITIONED BY
(ds STRI NG);
Browse the table
hive> Show tables;
Altering and Dropping Tables
hive> ALTER TABLE Sonoo RENAME TO Kafka;
hive> ALTER TABLE Kafka ADD COLUMNS (col INT);
hive> ALTER TABLE HIVE_TABLE ADD COLUMNS (col1 INT COMMENT 'a
comment');
hive> ALTER TABLE HIVE_TABLE REPLACE COLUMNS (col2 INT, weight
STRING, baz INT COMMENT 'baz replaces new_col1');
Hive DML Commands:
To understand the Hive DML commands, let's see the employee
and employee_department table first.

LOAD DATA
hive> LOAD DATA LOCAL INPATH './usr/Desktop/kv1.txt' OVERWRITE INTO
TABLE Employee;
SELECTS and FILTERS
hive> SELECT E.EMP_ID FROM Employee E WHERE E.Address='US';
GROUP BY
hive> hive> SELECT E.EMP_ID FROM Employee E GROUP BY E.Addresss;
Adding a Partition
We can add partitions to a table by altering the table. Let us
assume we have a table called employee with fields such as Id, Name,
Salary, Designation, Dept, and yoj.
Syntax:
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec
[LOCATION 'location1'] partition_spec [LOCATION 'location2'] ...;

partition_spec:
: (p_column = p_col_value, p_column = p_col_value, ...)
The following query is used to add a partition to the employee table.
hive> ALTER TABLE employee
ADD PARTITION (year=’2012’)
location '/2012/part2012';
Renaming a Partition
The syntax of this command is as follows
ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION
partition_spec;
The following query is used to rename a partition:
hive> ALTER TABLE employee PARTITION (year=’1203’)
RENAME TO PARTITION (Yoj=’1203’);
Dropping a Partition
The following syntax is used to drop a partition
ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec,
PARTITION partition_spec,...;
The following query is used to drop a partition:
hive> ALTER TABLE employee DROP [IF EXISTS]
PARTITION (year=’1203’);
SELECT statement with WHERE clause.
SELECT statement is used to retrieve the data from a table. WHERE
clause works similar to a condition. It filters the data using the condition and
gives you a finite result. The built-in operators and functions generate an
expression, which fulfils the condition.
Given below is the syntax of the SELECT query:
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference [WHERE where_condition] [GROUP BY col_list] [HAVING
having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]] [LIMIT
number];
Example
Let us take an example for SELECT…WHERE clause. Assume we have
the employee table as given below, with fields named Id, Name, Salary,
Designation, and Dept. Generate a query to retrieve the employee details who
earn a salary of more than Rs 30000
+ + + + + +
| ID | Name | Salary | Designation | Dept |
+ + + + + +
|1201 | Gopal | 45000 | Technical manager | TP
|
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
+ + + + + +
hive> SELECT * FROM employee WHERE salary>30000;
On successful execution of the query, you get to see the following response:

+ + + + + +
| ID | Name | Salary | Designation | Dept |
+ + + + + +
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
+ + + + +
Part-A Questions and Answers

1.What is Hadoop and its components. (CO2,K2)


When “Big Data” emerged as a problem, Apache Hadoop
evolved as a solution to it. Apache Hadoop is a framework which provides
us various services or tools to store and process Big Data. It helps in
analyzing Big Data and making business decisions out of it, which can’t be
done efficiently and effectively using traditional systems.
Storage unit– HDFS (NameNode, DataNode)
Processing framework– YARN (ResourceManager, NodeManager
2. What is HDFS? (CO3,K2)
HDFS (Hadoop Distributed File System) is the storage unit of
Hadoop. It is responsible for storing different kinds of data as blocks in a
distributed environment. It follows master and slave topology.
NameNode: NameNode is the master node in the distributed environment
and it maintains the metadata information for the blocks of data stored in
HDFS like block location, replication factors etc.
DataNode: DataNodes are the slave nodes, which are responsible for
storing data in the HDFS. NameNode manages all the DataNodes.
3.What is YARN? (CO3,K2)
YARN (Yet Another Resource Negotiator) is the processing
framework in Hadoop, which manages resources and provides an execution
environment to the processes.
ResourceManager: It receives the processing requests, and then passes the
parts of requests to corresponding NodeManagers accordingly, where the
actual processing takes place. It allocates resources to applications based on
the needs.
NodeManager: NodeManager is installed on every DataNode and it is
responsible for the execution of the task on every single DataNode.
4. What is a checkpoint? (CO2,K2)
“Checkpointing” is a process that takes an FsImage, edit log and
compacts them into a new FsImage. Thus, instead of replaying an edit log, the
NameNode can load the final in-memory state directly from the FsImage. This
is a far more efficient operation and reduces NameNode start-up time.
Checkpointing is performed by Secondary NameNode
5. How is HDFS fault tolerant? (CO3,K2)
When data is stored over HDFS, NameNode replicates the data to
several DataNode. The default replication factor is 3. You can change the
configuration factor as per your need. If a DataNode goes down, the
NameNode will automatically copy the data to another node from the replicas
and make the data available. This provides fault tolerance in HDFS.
6. Can NameNode and DataNode be a commodity
hardware? (CO3,K2)
DataNodes are commodity hardware like personal computers and
laptops as it stores data and are required in a large number. NameNode is the
master node and it stores metadata about all the blocks stored in HDFS. It
requires high memory (RAM) space, so NameNode needs to be a high-end
machine with good memory space.
7. What does a “MapReduce Partitioner” do? (CO4,K2)
A “MapReduce Partitioner” makes sure that all the values of a
single key go to the same “reducer”, thus allowing even distribution of the
map output over the “reducers”. It redirects the “mapper” output to the
“reducer” by determining which “reducer” is responsible for the particular
key.
8.What is the default location where “Hive” stores table data?
(CO4,K2)
The default location where Hive stores table data is inside HDFS
in /user/hive/warehouse
9. What is MongoDB ? (CO4,K2)
• MongoDB is an open-source NoSQL database written in C++ language. It
uses JSON-like documents with optional schemas.
• It provides easy scalability and is a cross-platform, document-oriented
database.
• MongoDB works on the concept of Collection and Document.
• It combines the ability to scale out with features such as secondary
indexes, range queries, sorting, aggregations, and geospatial indexes.
• MongoDB is developed by MongoDB Inc. and licensed under the Server
Side Public License (SSPL
10. What are Databases in MongoDB? (CO4,K2)
MongoDB groups collections into databases. MongoDB can host
several databases, each grouping together collections.
Some reserved database names are as follows:admin,local,
config
Part-B Questions
Q. Questions CO K Level
No. Level
What is Hadoop? What are the key components of
1 Hadoop architecture? Explain the Hadoop framework CO4 K2
with neat diagram.
Differentiate between traditional distributed system
2 processing and Hadoop distributed system CO4 K3
processing.
What is HDFS? Explain the HDFS architecture with
3 neat diagram? CO4 K2

What is YARN. Explain how resource


management is carried out in Hadoop
4 framework using YARN. CO4 K4

Explain the Map Reduce programming model with neat


5 diagram
CO4 K4

Explain in detail the Features of MapReduce. CO4 K2


6
What is Hive? What are the key components of Hive
CO4 K2
7 architecture.?

CO4 K2
8 Explain the process of Sharding in MongoDB.
Differentiate between traditional distributed CO4 K2
9 management processing and MongoDB.

10 Explain in detail about Hive Query Language (HQL). CO4 K2


Assignments

Q. Question CO K Level
No. Level

Given the following text data made up of the


following sentences, process it using the
classical ―word count‖ Map Reduce program.
Provide the output (K,V) pairs,
i) at the output of mappers
ii) at the output of combiners
iii) at the input of reducers and CO4 K4
1 iv) at the output of reducers
Consider 3 mappers and 2 reducers and stop
words {the, The, a, is, are, and}
D1: The blue sky and bright sun are behind
the grey cloud. The cloud is dark and the sun
is bright
D2: The cloud is bright and the sun is grey.
The sky is bright. The sky is blue.
D3: The dark cloud is behind the sun and the
blue sky.
The sun is rising. The cloud is Grey.

Consider N  M matrix. Provide the output


(K,V) pairs at the output of mapper, input of
2 reducers and output of reducers during CO4 K4
multiplication of two matrices, with 1 mapper
task and 2 reducer tasks. Also, write the map
function and reduce function.
Supportive Online Courses

Sl. Courses Platform


No.
1 MongoDB udemy

2 Hive Query Language SimpliLearn


REAL TIME APPLICATIONS IN DAY TO DAY LIFE

Apache Hadoop Applications:


1. Finance sectors:
Financial organizations use hadoop for fraud detection and
prevention. They use Apache Hadoop for reducing risk, identifying rogue traders,
analyzing fraud patterns. Hadoop helps them to precisely target their marketing
campaigns on the basis of customer segmentation.
Hadoop helps financial agencies to improve customer satisfaction.
Credit card companies also use Apache Hadoop for finding out the exact
customer for their product.
2. Security and Law Enforcement:
The USA national security agency uses Hadoop in order to prevent
terrorist attacks and to detect and prevent cyber-attacks. Big Data tools are used
by the Police forces for catching criminals and even predicting criminal activity.
Hadoop is used by different public sector fields such as defense, intelligence,
research, cyber security, etc.
3. Companies use Hadoop for understanding customers requirements:
The most important application of Hadoop is understanding
Customer’ requirements.
Different companies such as finance, telecom use Hadoop for finding
out the customer’s requirement by examining a big amount of data and
discovering useful information from these vast amounts of data. By
understanding customers behaviors, organizations can improve their sales.
4. Hadoop Applications in Retail industry:
Retailers both online and offline use Hadoop for improving their
sales. Many e-commerce companies use Hadoop for keeping track of the
products bought together by the customers
REAL TIME APPLICATIONS IN DAY TO DAY LIFE

5. Real-time analysis of customers data:


Hadoop can analyze customer data in real-time. It can track
clickstream data as it’s for storing and processing high volumes of clickstream
data. When a visitor visits a website, then Hadoop can capture information like
from where the visitor originated before reaching a particular website, the search
used for landing on the website.
6. Uses of Hadoop in Government sectors:
The government uses Hadoop for the country, states, and cities
development by analyzing vast amounts of data.
For example, they use Hadoop for managing traffic in the streets, for
the development of smart cities, or for improving transportation in the city.
7. Hadoop Uses in Advertisements Targeting Platforms:
Advertisements Targeting Platforms use hadoop for capturing and
analyzing clickstream, video, transaction, and social media data. They analyze
the data generated by various social media websites such as Facebook, Twitter,
Instagram, etc. and then target their interested audience.
They keep on posting advertisements on various social media sites to
the target audience. Hadoop also find use in managing content, images, posts,
videos on social media platforms.
8. Businesses use Hadoop for sentiment analysis:
Hadoop can capture data and perform analysis of the sentiment data.
Sentiment data are the unstructured bits of data, such as attitudes, opinions, and
emotions which you mostly see on social media platforms, blogs, customer
support interactions, online product reviews, etc.
Text & Reference Books

Sl. Book Name & Author Book


No.
1 EMC Education Services, "Data Science and Big Data Analytics: Text Book
Discovering, Analyzing, Visualizing and Presenting Data", Wiley
publishers, 2015.
https://bhavanakhivsara.files.wordpress.com/2018/06/data-
science-and-big-data-analy-nieizv_book.pdf

2 Anand Rajaraman and Jeffrey David Ullman, "Mining of Massive Text Book
Datasets", Cambridge University Press, 2012.
http://infolab.stanford.edu/~ullman/mmds/bookL.pdf

3 An Introduction to Statistical Learning: with Applications in R Text Book


(Springers Texts in Statistics) Hardc0ver

4 Dietmar Jannach and Markus Zanker, "Recommender Systems: Reference


An Introduction", Cambridge University Press, 2010. Book
https://drive.google.com/file/d/1Wr4fllOj03X72rL8CHgVJ1dGxG
58N63S/view?usp=sharing
5 Kim H. Pries and Robert Dunnigan, "Big Data Analytics: A Reference
Practical Guide for Managers " CRC Press, 2015. Book

6 Jimmy Lin and Chris Dyer, "Data-Intensive Text Processing with Reference
MapReduce", Synthesis Lectures on Human Language Book
Technologies, Vol. 3, No. 1, Pages 1-177, Morgan Claypool
publishers, 2010.
Mini Project Suggestions

1. Implement Map reduce to illustrate the occurrence of each word in a text


file called example.txt whose contents are as follows Dear, Bear, River,
Car, Car, River, Deer, Car and Bear. Divide the data in to 3 splits,
distribute the work on map nodes, perform sorting, shuffling and reduce
operation. (co4, K4)
Thank you

Disclaimer:
This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the
respective group / learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender immediately by e-mail if you
have received this document by mistake and delete this document from your system. If you are not
the intended recipient you are notified that disclosing, copying, distributing or taking any action in
relianceon the contentsof this information is strictlyprohibited.

You might also like