Unit 4: Big Data Tehnology Landscape Two Inportant Technologies
Unit 4: Big Data Tehnology Landscape Two Inportant Technologies
OTP/OLAP NO OTP/OLAP
3.Secondary NameNode
• It takes a snapshot of HDFS metadata at
intervals as specified in the configuration
• It occupies same memory size as namenode
• Therefore they are run on different machines
• In failure of namenode the secondary can be
configured
4.2. file read, file write, Replica processing of data
with hadoop
• File read:
• 1. the client opens file he wants to read by calling open()
on the DFS
• 2.DFS communicates with namenode to get the location
of the data blocks
• 3.namenode returns the addresses of the datanodes
containing the data blocks
• 4.DFS returns an FSDataInputStream to client.
• 5. client calls read() on the FSDataInputStream which
contains the addresses of the datanodes for the first few
blocks of file, connects to the nearest datanode for the
1st block in the file on FSDataInputStream to close the
connection
• 6.client calls read() repeatedly to get the data stream
from the datanode
• 7.when the end of a block FSDataInputStream closes
the connection with datanode.
• 8. it repeats the steps for to find the best node for
the next block.
• 9. client calls close()
File write
• 1. client calls create() to create file
• 2. An RPC call is initiated to namenode
• 3. namenode creates file after few checks
• 4. FSDataInputStream returns the stream for client to write on
• 5.as the client writes data, the data is split into packets which is then
written to a data queue
• 6.datastreamer requests namenode to allocate blocks by selecting a
list of suitable nodes for storing replicas (by default 3)
• 7. this list of datanodes makes a pipeline with 3 nodes in the pipe line
for the 1st block
File write….
• 8. datastreamer streams the packets to the 1st data node in the
pipeline which stores and the forwards to other datanodes in the
pipeline
• 9.DFSOutputStream manages an “Ack queue” of packets that are
waiting for ackment- and a pkt is removed from the queue only if
it is acknowledged by all the datanodes in the pipeline
• 10.when the client finishes writing the file it calls close() on the
stream
• 11.this flushes all the remaining pkts to the datanode pipeline
and waits for acknowledgements before communicating with
NameNode to inform the client that the creation of file is
complete
Replica processing of data with
• Replica placement strategy:
hadoop
• by default 3 replicas are created for each data set
1st replica is placed in the same node as the client
2nd replica is placed on a node in a different rack
3rd replica is placed on the same rack as second but on a different node in the rack
• Then a data pipeline is built . The client application writes a block to the 1 st
datanode in the pipeline.
next this datanode takes over and forwards data to the next node in the pipeline.
• this process continues for all the data blocks.
• Subsequently all the data blocks are written to the disk
• The client application need not track all blocks of data. The HDFS directs the
client to the nearest replica.
Why hadoop 2.x ?
• Because of following limitations of hadoop1.0:
• In hadoop 1.0 HDFS and MR are core componenets while other
components are built around.
• 1. single namenode for entire namespace of a cluster. It saves all its
file metadata in main memory. This puts a limit on the number of
objects stored in NameNode.
• 2.restricted to processing batch-oriented Map reduce jobs
• 3.MR for cluster resource management and data processing. not
suitable for interactive analysis
• 4. hadoop1.0 not suitable for machine learning, graphs and other
memory intensive algorithms
• 5. map slots may become full while reduce slots are empty and vice
versa- inefficient resource utilization
• 6
How hadoop 2.x
• HDFS 2 used in hadoop 2.0 consists of 2 major components:
• 1. namespace service: to take care of file related (create,
read, write) operations
• 2. blocks storage service: handles data nodes cluster
management, replication
• HDFS2 uses:
• 1. mutiple independent namenodes: datanodes are
common storage blocks shared by all namenodes. All
datanodes register with every namenode in the cluster
• 2. passive standby namenode
Managing resources and applications
with hadoop YARN
• YARN is a sub-project of hadoop 2.x
• It is a general processing platform
• YARN is not constrained to MR alone
• Multiple applications can be run in hadoop2.x
if all the applications share the same resources (memory,
cpu, network etc.,) management.
• With YARN hadoop can do not only batch processing but
also interactive, online, streaming, graph and other types of
processing
Daemons of YARN
1. Global Resource Manager: to distribute resources among various
applications. It has 2 components:
1.1. Scheduler: decides allocation of resources to running applications.
No monitoring
1.2. ApplicationManager: accepts jobs, negotiates resources for
executing ApplicationMaster which is specific to an application
• 2.NodeManager: it monitors usage of resources and reports the usage
to Global Resource Manager. It launches ‘application containers’ for
execution of application.
• Every machine will have one NodeManager
• 3.Per-application ApplicationMaster: every application has one.to
negotiate required resoueces for execution from the Resource
Manager. It works along with NodeManager for executing and
monitoring component tasks
• Application is a job submitted to the
framework. Ex: Map Reduce job
• Container:
is a basic unit of allocation across multiple
resource types:
ex: container_0= 2GB, 1 CPU
container_1= 1GB, 6 CPU
container replaces the fixed map/reduce slots
YARN Architecture: steps
• 1.client program submits the application which contains specifications to
launch application specific ‘ ApplicationMaster’
• 2.ResourceManager launches ‘ ApplicationMaster’ by assigning some
container
• 3. ‘ ApplicationMaster’ registers with ApplicationMaster’ so that the client
can quiery from Resource manager for details
• 4.( applicationmaster negotiates apptopruate resource containers via the
resource –request protocol)
• 5. after container allocation , the ApplicationMaster launches the container
by providing the specs to NodeManger
• 6. NodeManger executeds the application code and provides status to
ApplicationMaster via application specific protocol
• 7.on completion of application , ‘ ApplicationMaster deregisters with
ResourceManager and shuts down. itscontainers can then be reused.