0% found this document useful (0 votes)
855 views154 pages

4 UNIT-4 Introduction To Hadoop

These PPTs are prepared according to Solapur University Syllabus of Big Data analytics and book referred is Big Data Analytics By Seema Acharya and Subhashini Chellappan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
855 views154 pages

4 UNIT-4 Introduction To Hadoop

These PPTs are prepared according to Solapur University Syllabus of Big Data analytics and book referred is Big Data Analytics By Seema Acharya and Subhashini Chellappan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 154

Introduction to Hadoop

Unit-IV
Prof.P.R.Gadekar
INTRODUCING HADOOP
�Data: The treasure Trove

1. Provides business advantages such as

� generating product recommendations,


�inventing new products,
� analyzing the market,
�and many, many more, ….
�2. Provides few early key indicators that can turn the
fortune of business.

� 3. Provides room for precise analysis. If we have more


data for analysis, then we have greater precision of
analysis.
Why Hadoop
�Its capability to handle massive amounts
of data, different categories of data –
fairly quickly.
�1. Low cost:

�Hadoop is an open-source framework and uses


commodity hardware (commodity hardware is
relatively inexpensive and easy to obtain hardware) to
store enormous quantities of data.
�2. Computing power:

�Hadoop is based on distributed computing model


which processes very large volumes of data fairly
quickly.

�The more the number of computing nodes, the more


the processing power at hand.
�3. Scalability:

�This boils down to simply adding nodes as the system


grows and requires much less administration.
�4. Storage flexibility:

�Unlike the traditional relational databases, in Hadoop


data need not be pre-processed before storing it
�Hadoop provides the convenience of storing as much
data as one needs and also the added flexibility of
deciding later as to how to use the stored data.

� In Hadoop, one can store unstructured data like


images, videos, and free-form text.
�5. Inherent data protection:

� Hadoop protects data and executing applications


against hardware failure
�If a node fails, it automatically redirects the jobs that
had been assigned to this node to the other functional
and available nodes and ensures that distributed
computing does not fail. It goes a step further to store
multiple copies (replicas) of the data on various nodes
across the cluster.
WHY NOT RDBMS?
�RDBMS is not suitable for storing and processing large
files, images, and videos.

�RDBMS is not a good choice when it comes to


advanced analytics involving machine learning.
�It calls for huge investment as the volume of data
shows an upward trend.
RDBMS versus HADOOP
DISTRIBUTED COMPUTING
CHALLENGES

�Hardware Failure

� In a distributed system, several servers are networked


together. This implies that more often than not, there
may be a possibility of hardware failure
�And when such a failure does happen, how does one
retrieve the data that was stored in the system?

� Just to explain further – a regular hard disk may fail


once in 3 years
� And when you have 1000 such hard disks, there is a
possibility of at least a few being down every day.
�Hadoop has an answer to this problem in Replication
Factor (RF).

� Replication Factor connotes the number of data copies


of a given data item/data block stored across the
network. Refer Figure 5.5.
How to Process This Gigantic Store of
Data?
�In a distributed system, the data is spread across the
network on several machines.

� A key challenge here is to integrate the data available


on several machines prior to processing it.
�Hadoop solves this problem by using MapReduce
Programming.

� It is a programming model to process the data

�(MapReduce programming will be discussed a little


later).
5.6 HISTORY OF HADOOP
�Hadoop was created by Doug Cutting, the creator of
Apache Lucene (a commonly used text search library).

� Hadoop is a part of the Apache Nutch (Yahoo) project


(an open-source web search engine) and also a part of
the Lucene project. Refer Figure 5.6 for more details.
5.6.1 The Name “Hadoop”
�The name Hadoop is not an acronym; it’s a made-up
name.
�The project creator, Doug Cutting, explains how the
name came about: “The name my kid gave a stuffed
yellow elephant.
�Short, relatively easy to spell and pronounce,
meaningless, and not used elsewhere: those are my
naming criteria. Kids are good at generating such.
Googol is a kid’s term”.
HADOOP OVERVIEW
�Open-source software framework to store and process
massive amounts of data in a distributed fashion on
large clusters of commodity hardware.
�Basically, Hadoop accomplishes two tasks:

� 1. Massive data storage.

� 2. Faster data processing.


Key aspects of Hadoop
Hadoop Components
Hadoop Conceptual Layer
�It is conceptually divided into
�Data Storage Layer which stores huge volumes of data
and

�Data Processing Layer which processes data in parallel


to extract richer and meaningful insights from data
(Figure 5.9)
High-Level Architecture of Hadoop
�Hadoop is a distributed Master-Slave Architecture.

�Master node is known as NameNode and slave nodes


are known as DataNodes.

� Figure 5.10 depicts the Master–Slave Architecture of


Hadoop Framework.
key components of the Master Node.

�1. Master HDFS:

�2. Master MapReduce



�1. Master HDFS: Its main responsibility is partitioning
the data storage across the slave nodes.

� It also keeps track of locations of data on DataNodes.


�2. Master MapReduce: It decides and schedules
computation task on slave nodes.
Hadoop Distributors
Hadoop Distributed File System
�Some key Points of Hadoop Distributed File System
are as follows:
� 1. Storage component of Hadoop.

� 2. Distributed File System.

� 3. Modeled after Google File System.


�4. Optimized for high throughput

�(HDFS leverages large block size and moves


computation where data is stored).

�5. You can replicate a file for a configured number of


times, which is tolerant in terms of both software and
hardware.
�6. Re-replicates data blocks automatically on nodes
that have failed.

�7. You can realize the power of HDFS when you


perform read or write on large files (gigabytes and
larger).
�8. Sits on top of native file system such as ext3 and
ext4, which is described in Figure 5.13.
Distributed File System Architecture
Distributed File System Architecture
HDFS Daemons
�1.Name node
�2. Data node
�3.Secondary Name node
Name node
�HDFS breaks a large file into smaller pieces called
blocks.
� NameNode uses a rack ID to identify DataNodes in
the rack.
�A rack is a collection of DataNodes within the cluster.
�NameNode keeps tracks of blocks of a file as it is
placed on various DataNodes.
�NameNode manages file-related operations such as
read, write, create, and delete.

� Its main job is managing the File System Namespace.

�A file system namespace is collection of files in the


cluster.

�NameNode stores HDFS namespace.


�File system namespace includes mapping of blocks to
file, file properties and is stored in a file called
FsImage. NameNode uses an EditLog (transaction log)
to record every transaction that happens to the
filesystem metadata.
�Refer Figure 5.16. When NameNodestarts up, it reads
FsImage and EditLog from disk and applies all
transactions from the EditLog to in-memory
representation of the FsImage.
�Then it flushes out new version of FsImage on disk and
truncates the old EditLog because the changes are
updated in the FsImage. There is a single NameNode
per cluster.
DataNode
�There are multiple DataNodes per cluster. During
Pipeline read and write DataNodes communicate with
each other. A DataNode also continuously sends
“heartbeat” message to NameNode to ensure the
connectivity between the NameNode and DataNode.
�In case there is no heartbeat from a DataNode, the
NameNode replicates that DataNode within the cluster
and keeps on running as if nothing had happened.
�Let us explain the concept behind sending the heartbeat
report by the DataNodes to the NameNode.
Secondary NameNode
�The Secondary NameNode takes a snapshot of HDFS
metadata at intervals specified in the Hadoop
configuration. Since the memory requirements of
Secondary NameNode are the same as NameNode, it is
better to run NameNode and Secondary NameNode on
different machines.
�In case of failure of the NameNode, the Secondary
NameNode can be configured manually to bring up the
cluster.

�However, the Secondary NameNode does not record


any real-time changes that happen to the HDFS
metadata.
Anatomy of file read
Anatomy of File Write
Replica Placement Strategy
�Hadoop Default Replica Placement Strategy

�As per the Hadoop Replica Placement Strategy, first


replica is placed on the same node as the client. Then it
places second replica on a node that is present on
different rack

�It places the third replica on the same rack as second,
but on a different node in the rack. Once replica
locations have been set, a pipeline is built. This
strategy provides good reliability.
Working with HDFS Commands
�Objective: To get the list of directories and files at the
root of HDFS.
Act:
�hadoop fs -ls/

�Objective: To get the list of complete directories and


files of HDFS.

�Act: hadoop fs -ls -R/


Special Features of HDFS

�1.Data Replication

�2.Data Pipeline
Processing data with HADOOP
�Map Reduce programming is software framework.
�It helps you in processing massive data in parallel.
�The MapReduce algorithm contains two important
tasks, namely Map and Reduce.

�Map takes a set of data and converts it into another set


of data, where individual elements are broken down
into tuples (key/value pairs).
�Secondly, reduce task, which takes the output from a
map as an input and combines those data tuples into a
smaller set of tuples.

�As the sequence of the name MapReduce implies, the


reduce task is always performed after the map job.
Special Features of HDFS
�1. Data Replication:

�There is absolutely no need for a client application to tr


ack all blocks. It
redirects
the client to the nearest replica to ensure high performa
nce.


Data Pipeline

� A client application writes a block to the first DataNod


e in the pipeline.

�Then this DataNode takes over and forwards the data to


the next node in the pipeline.
Processing Data withHadoop
�MapReduce Programming is a software framework.
MapReduce Programming helps you to process
massive amounts of data in parallel. In MapReduce
Programming, the input dataset is split into
independent chunks.
�Map tasks process these independent chunks
completely in a parallel manner. The output produced
by the map tasks serves as intermediate data and is
stored on the local disk of that server. The output of the
mappers are automatically shuffled and sorted by the
framework. MapReduce Framework sorts the output
based on keys.
�This sorted output becomes the input to the reduce
tasks. Reduce task provides reduced output by
combining the output of the various mappers. Job
inputs and outputs are stored in a file system.
MapReduce framework also takes care of the other
tasks such as scheduling, monitoring, re-executing
failed tasks, etc.
�Hadoop Distributed File System and MapReduce
Framework run on the same set of nodes. This
configuration allows effective scheduling of tasks on
the nodes where data is present (Data Locality). This in
turn results in very high throughput.
�There are two daemons associated with MapReduce
Programming.

�A single master JobTracker per cluster and one slave


TaskTracker per cluster-node.
�The JobTracker is responsible for scheduling tasks to
the TaskTrackers, monitoring the task, and re-executing
the task just in case the TaskTracker fails.

� The TaskTracker executes the task. Refer Figure 5.21.


�The MapReduce functions and input/output locations
are implemented via the MapReduce applications.

�These applications use suitable interfaces to construct


the job.
�The application and the job parameters together are
known as job configuration. Hadoop job client submits
job jar/executable, etc.) to the JobTracker
�Then it is the responsibility of Job Tracker to schedule
tasks to the slaves.

�In addition to scheduling, it also monitors the task and


provides status information to the job-client.
Map Reduce Daemons

�1.Job Tracker

�2.Task Tracker
1. JobTracker:
�It provides connectivity between Hadoop and your
application.
�When you submit code to cluster, Job Tracker creates
the execution plan by deciding which task to assign to
which node.
� It also monitors all the running tasks.
�When a task fails, it automatically re-schedules the task
to a different node after a predefined number of retries.
� JobTracker is a master daemon responsible for
executing overall MapReduce job.

�There is a single JobTracker per Hadoop cluster.


2. TaskTracker:
�This daemon is responsible for executing individual
tasks that is assigned by the JobTracker.

� There is a single TaskTracker per slave and spawns


multiple Java Virtual Machines (JVMs) to handle
multiple map or reduce tasks in parallel.
�TaskTracker continuously sends heartbeat message to
JobTracker. When the JobTracker fails to receive a
heartbeat from a TaskTracker, the JobTracker assumes
that the TaskTracker has failed and resubmits the task
to another available node in the cluster.
�Once the client submits a job to the JobTracker, it
partitions and assigns diverse MapReduce tasks for
each TaskTracker in the cluster. Figure 5.22 depicts
JobTracker and TaskTracker interaction.
How does map reduce work?
How does MAP Reduce Work?
�MapReduce divides a data analysis task into two parts
− map and reduce.
� Figure 5.23 depicts how the MapReduce
Programming works.
�In this example, there are two mappers and one
reducer. Each mapper works on the partial dataset that
is stored on that node and the reducer combines the
output from the mappers to produce the reduced result
set.
The following steps describe how
MapReduce performs its task.
�Figure 5.24 describes the working model of
MapReduce Programming.

�1. First, the input dataset is split into multiple pieces of


data (several small subsets).

�2. Next, the framework creates a master and several
workers processes and executes the worker processes
remotely.
�3. Several map tasks work simultaneously and read
pieces of data that were assigned to each map task.
�The map worker uses the map function to extract only
those data that are present on their server and
generates key/value pair for the extracted data
� 4. Map worker uses partitioner function to dividethe
data into regions.

�Partitioner decides which reducer should get the


output of the specified mapper
�. 5. When the map workers complete their work, the
master instructs the reduce workers to begin their
work.
�The reduce workers in turn contact the map workers to
get the key/value data for their partition. The data thus
received is shuffled and sorted as per keys.
�6. Then it calls reduce function for every unique key.
This function writes the output to the file.

� 7. When all the reduce workers complete their work,


the master transfers the control to the user program.
SQL Versus MapReduce
MANAGING RESOURCES AND APPLICATIONS
WITH HADOOP YARN

�YARN : (YET ANOTHER RESOURCE NEGOTIATOR)


�Apache Hadoop YARN is a sub-project of Hadoop 2.x.
�Hadoop 2.x is YARN-based architecture.
�It is a general processing platform.
�YARN is not constrained to MapReduce only.
�You can run multiple applications in Hadoop 2.x in
which all applications share a common resource
management.

�Now Hadoop can be used for various types of


processing such as Batch, Interactive, Online,
Streaming, Graph, and others.
Limitations of Hadoop 1.0 Architecture
�1. Single NameNode is responsible for managing entire
namespace for Hadoop Cluster.

�2. It has a restricted processing model which is suitable


for batch-oriented MapReduce jobs.
�3. Hadoop MapReduce is not suitable for interactive
analysis.

� 4. Hadoop 1.0 is not suitable for machine learning


algorithms, graphs, and other memory intensive
algorithms.
�5. MapReduce is responsible for cluster resource
management and data processing.

�In this Architecture, map slots might be “full”, while


the reduce slots are empty and vice versa.

�This causes resource utilization issues. This needs to


be improved for proper resource utilization.
HDFS Limitation
�NameNode saves all its file metadata in main memory.

�Although the main memory today is not as small and


as expensive as it used to be two decades ago, still
there is a limit on the number of objects that one can
have in the memory on a single NameNode.
�The NameNode can quickly become overwhelmed with
load on the system increasing. In Hadoop 2.x, this is
resolved with the help of HDFS Federation.
Hadoop 2: HDFS
�HDFS 2 consists of two major components:
� (a) namespace, (b) blocks storage service.

�Namespace service takes care of file-related


operations, such as creating files, modifying files, and
directories. The block storage service handles data
node cluster management, replication.
HDFS 2 Features
�1. Horizontal scalability.
�2. High availability.

� HDFS Federation uses multiple independent


NameNodes for horizontal scalability.

�NameNodes are independent of each other. It means,


NameNodes does not need any coordination with each
other.
�The DataNodes are common storage for blocks and
shared by all NameNodes. All DataNodes in the cluster
registers with each NameNode in the cluster.
�The DataNodes are common storage for blocks and
shared by all NameNodes.

� All DataNodes in the cluster registers with each


NameNode in the cluster.
�High availability of NameNode is obtained with the
help of Passive Standby NameNode.

�In Hadoop 2.x, Active−Passive NameNode handles


failover automatically.
�All namespace edits are recorded to a shared NFS
storage and there is a single writer at any point of time.

� Passive NameNode reads edits from shared storage


and keeps updated metadata information.
Hadoop 2 YARN: Taking Hadoop beyond
Batch
�YARN helps us to store all data in one place. We can
interact in multiple ways to get predictable
performance and quality of services.

� This was originally architected by Yahoo. Refer Figure


5.28.
�aIn case of Active NameNode failure, Passive
NameNode becomes an Active NameNode
automatically. Then it starts writing to the shared
storage. Figure 5.26 describes the Active−Passive
NameNode interaction.
Fundamental Idea
�The fundamental idea behind this architecture is
splitting the JobTracker responsibility of resource
management and Job Scheduling/Monitoring into
separate daemons.
Daemons that are part of YARN
Architecture
�1. A Global ResourceManager:

� Its main responsibility is to distribute resources among


various applications in the system. application
It has two main components:
� (a) Scheduler:

�The pluggable scheduler of ResourceManager decides


allocation of resources to various running applications.
�The scheduler is just that, a pure scheduler, meaning it
does NOT monitor or track the status of the
application.
�(b) ApplicationManager:

�ApplicationManager does the following:


� • Accepting job submissions.
� • Negotiating resources (container) for executing the
specific ApplicationMaster.
�• Restarting the ApplicationMaster in case of failure.
2. NodeManager:
�This is a per-machine slave daemon. NodeManager
responsibility is launching the application containers
for application execution.

� NodeManager monitors the resource usage such as


memory, CPU, disk, network, etc. It then reports the
usage of resources to the global ResourceManager.
3. Per-application ApplicationMaster:
�This is an application-specific entity. Its responsibility
is to negotiaterequired resources forexecution from the
ResourceManager.
� It works along with the NodeManager for executing
and monitoring component tasks.
Basic Concepts

�Application:

� 1. Application is a job submitted to the framework. 2.


Example – MapReduce Job.
�Container:
� 1. Basic unit of allocation.

� 2. Fine-grained resource allocation across multiple


resource types (Memory, CPU, disk, network, etc.)
�(a) container_0 = 2GB, 1CPU
�(b) container_1 = 1GB, 6 CPU

�3. Replaces the fixed map/reduce slots.


�YARN Architecture:
�The steps involved in YARN architecture are as
follows:

�1. A client program submits the application which


includes the necessary specifications to launch the
application-specific ApplicationMaster itself.
�2. The ResourceManager launches the
ApplicationMaster by assigning some container.
�3. The ApplicationMaster, on boot-up, registers with
the ResourceManager.

�This helps the client program to query the


ResourceManager directly for the details.
�4. During the normal course, ApplicationMaster
negotiates appropriate resource containers via the
resource-request protocol.
Yarn Architecture
�5. On successful container allocations, the
ApplicationMaster launches the container by providing
the container launch specification to the
NodeManager.
�6. The NodeManager executes the application code and
provides necessary information such as progress,
status, etc. to it’s ApplicationMaster via an application-
specific protocol
�7. During the application execution, the client that
submitted the job directly communicates with the
ApplicationMaster to get status, progress updates, etc.
via an application-specific protocol.
�8. Once the application has been processed completely,
ApplicationMaster deregisters with the
ResourceManager and shuts down, allowing its own
container to be repurposed.
INTERACTING WITH HADOOP
ECOSYSTEM
�1.Pig
�2.Hive
�3.Sqoop
�4.Hbase
�5.
Pig
�Pig is a data flow system for Hadoop.

� It uses Pig Latin to specify data flow.

�Pig is an alternativeto MapReduce Programming.


�It abstracts some details and allows you to focus on
data processing.

� It consists of two components.


� 1. Pig Latin: The data processing language. 2.
Compiler: To translate Pig Latin to MapReduce
Programming.
Hive
�Hive is a Data Warehousing Layer on top of Hadoop.

�Analysis and queries can be done using an SQL-like


language.

�Hive can be used to do ad-hoc queries, summarization,


and data analysis.
Sqoop
�Sqoop is a tool which helps to transfer data between
Hadoop and Relational Databases.

�With the help of Sqoop, you can import data from


RDBMS to HDFS and vice-versa.
5.13.4 HBase
�HBase is a NoSQL database for Hadoop.

�HBase is column-oriented NoSQL database. HBase is


used to store billions of rows and millions of columns.
�HBase provides random read/write operation.

�It also supports record level updates which is not


possible using HDFS.

� HBase sits on top of HDFS.


TEST ME A. Fill Me
�1. Hadoop is ___________based flat structure.
� 2. RDBMS is best choice when___________ is the
main concern.
�3. Hadoop supports, _______,________and data
formats.
�4. RDBMS supports__________ data formats
�5. In Hadoop, data is processed in. 6. HDFS can be
deployed on. 7. NameNode uses to store file system
namespace. 8. NameNode uses to record every
transaction. 9. Secondary NameNode is a daemon. 10.
DataNode is responsible for file operation.

You might also like