0% found this document useful (0 votes)
10 views56 pages

Introduction To Hadoop

The document introduces Hadoop as an open-source framework designed to store and process large amounts of data in a distributed manner, utilizing a master-slave architecture with NameNode and DataNode components. It highlights Hadoop's capabilities in handling massive data through parallel processing and its use of MapReduce for data analysis. Additionally, it discusses the advancements in Hadoop 2.x, particularly the YARN architecture, which enhances resource management and application performance.

Uploaded by

pradeepn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views56 pages

Introduction To Hadoop

The document introduces Hadoop as an open-source framework designed to store and process large amounts of data in a distributed manner, utilizing a master-slave architecture with NameNode and DataNode components. It highlights Hadoop's capabilities in handling massive data through parallel processing and its use of MapReduce for data analysis. Additionally, it discusses the advancements in Hadoop 2.x, particularly the YARN architecture, which enhances resource management and application performance.

Uploaded by

pradeepn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Introduction to Hadoop

Introducing Hadoop
● Large amount of data is generated every day, every minute, every second.
Data: The treasure trove
● Provides business advantage such as generate product recommendations,
inventing new products, analyzing markets, etc..
● Provides few key indicators that can turn the fortune of business.
● Provides room for precise analysis,
Why Hadoop

● Its capability to handle massive amounts of data, different categories of


data — fairly quickly.
● Other considerations are:
Hadoop Framework
Hadoop Framework

1. Distributes the data and duplicates chunks of each data file across
several nodes, for example, 25 - 30 is one chunk of data as shown
2. Locally available compute resource is used to process each chunk of
data in parallel.
3. Hadoop Framework handles failover smartly and automatically.
Why Not RDMS
Distributed processing challenges
1. Hardware Failure:
○ Replication
○ Replication factor
Data: The treasure trove
2. How to Process This Gigantic Store of Data?
● Integrate data available across several servers before proceeding to
processing.
● Hadoop solves this problem by using MapReduce Programming. It is a
programming model to process the data
Hadoop overview
1. Open-source software framework to store and process massive amounts of
data in a distributed fashion
2. Basically, Hadoop accomplishes two tasks:
a. Massive data storage.
b. Faster data processing.
Hadoop overview
Hadoop components
Hadoop core comonents
Hadoop ecosystem
● Hadoop ecosystem are support projects to enhance the functionality of
hadoop core components.
Hadoop conceptual layer
High-level architecture of Hadoop
● Hadoop is a distributed Master-slave architecture.
● Master node is known as NameNode
● Slave node is known as DataNode
High-level architecture of Hadoop
High-level architecture of Hadoop
Use Case of Hadoop
ClickStream Data:

● It helps you to understand purchasing behaviour of customers


● It helps online marketers to optimize their product web pages, promotional
contents, etc.,
HDFS
HDFS
HDFS
HDFS architecture -HDFS daemons
● Client application NameNode for metadata related activity and communicates with
DataNodes to read and write files.
● DataNodes communicates with each other for, pipelined read and writes.
● For e.g., sample.txt , its size is 192 MB. As per default block size of 64 MB split into 3
blocks, replicates across nodes in the cluster as per default replication factor.
HDFS architecture- NameNode
HDFS architecture- NameNode
● HDFS breaks large file into smaller pieces called blocks.
● NameNode uses rackID to identify DataNodes in the rack.
● A rack is a collection of DataNodes in the cluster.
● NameNode keep track of block of files as it placed on various DataNodes.
● DataNode manages all the file operations like - read, write, create, delete
● Job of NameNode is to manage File System Namespace.
● A File System Namespace is a collection of files in the cluster.
● File System Namespace includes mapping of block to file, file properties, and
is stored in a file called FsImage.
● NameNode uses an EditLog (transaction log) to record every transaction that
happens to the file metadata.
HDFS architecture- NameNode
HDFS architecture- DataNode
● There are multiple DataNodes per cluster
● During pipeline read and write DataNodes are communicating with each
other.
● A DataNode node continuously sends “heartbeat” message to NameNode,
which ensures connectivity between NameNode and DataNode.
● If no “heartbeat” message from DataNode, NameNode replicate that
DataNode within the cluster.
HDFS architecture- DataNode
HDFS architecture- Secondary NameNode
● Secondary NameNode takes snapshot of metadata, at the specified
duration as per configuration.
● NameNode and Secondary Namenode running in different machines, as
their memory requirements are same.
● In case of failure of NameNode, secondary NameNode can be configured
manually to bring up the cluster.
● Secondary NameNode doesn’t record real-time changes happened to
HDFS.
Anatomy of file read
Anatomy of file read
Anatomy of file write
Anatomy of file write
Replica placement strategy

Hadoop default replication strategy


Special features of hadoop
Processing Data with Hadoop

● MapReduce programming is the Software framework


● MapReduce enables processing massive amount of data
● Dataset is split into independent chunks
MapReduce Daemons
MapReduce Daemons
How does MapReduce Works?
● MapReduce divide the data analysis task into 2 parts:
○ Map
○ Reduce
● Each mapper works on small datasets that is stored on that node and reducer
combines the output from mappers to produce the reduced result set.
How does MapReduce Works?
Working model of MapReduce programming
Processing Data with Hadoop
SQL vs MapReduce
Managing resources and applications with Hadoop
YARN
● Apache hadoop is an sub-project of Hadoop 2.x
● Hadoop 2.x is an YARN based architecture
● Enables sharing of resources for all applications running on hadoop 2.x
● Hadoop 2.x can be used for many applications such as batch , interactive
graph, video streaming
Managing resources and applications with Hadoop
YARN
Limitation of Hadoop 1.0:

HDFS Limitation:
Single name for a cluster. Overwhelming load on single Namenode. In adoop 2.x ,
this is resolved using HDFS federation
Hadoop 2: HDFS

HDFS 2 consists of two major components:


● NameSpace: Takes care of file related operations like creating file ,
directories, modifying files.
● Blocks Storage service: Data node cluster management, replication
Features:
● Horizontal scalability
● High availability
Horizontal scalability

● HDFS federation uses multiple independent NameNodes for


horizontal scalability.
● NameNodes are independent of each other and no coordination
among is needed.
● DataNodes are common storage for blocks and shared by all
NameNodes.
● Every DataNode in acluster registers with each NameNode in the
cluster.
High availability

● High availability of NameNode is obtained with the help of Passive


Standby NameNode.
● In Hadoop 2.x, active-passive nameNode handles failover
automatically.
● All edits to Namenodes are recorded in NFS storage.
● Passive NameNodes reds edits from shared storage and keeps
updated metadata information.
High availability
Hadoop 1.0 vs Hadoop 2.0
Hadoop 2 YARN: Taking Hadoop beyond batch

● YARN helps us to store all data in one place


● Interact with YARN in to get predictable performance and quality of
service
● This was originally architected by Yahoo.
Hadoop 2 YARN: Taking Hadoop beyond batch

Fundamental Idea:
● The fundamental idea behind the YARN architecture is to splitting
JobTracker responsibility of resource management and Job scheduling
/Monitoring into separate monitoring.
● Daemons of YARN are as follows:
● 1. Global resource manager: Its responsibility is to distribute resources
among various applications in the system. It has 2 main components:
○ a) Scheduler: The pluggble scheduler of resource manager decides
allocations of resources to various running applications. IT does not
monitor status of applications
○ Application manager: It accepts job submissions, Negotiating
resources for executing the application specific Application manager,
restarting application Master in case of failure.
Hadoop 2 YARN: Taking Hadoop beyond batch

2. Node Manager: This is a per machine slave daemon. NodeManager


responsibility is launching the application container for application
execution. It monitors resource usage such as memory, CPU, network ,
etc.. Then it reports usage of resource to Global resource manager.
3. Per-application ApllicationMaster: Its job is to negotiate resource
requirements for execution from Resource Manager.
Hadoop 2 YARN: Taking Hadoop beyond batch

Basic Concepts:
Hadoop 2 YARN: Taking Hadoop beyond batch
Hadoop 2 YARN: Taking Hadoop beyond batch

You might also like