0% found this document useful (0 votes)
10 views277 pages

Big Data Unit 2

Hadoop is an open-source framework designed for processing and storing large datasets in a distributed computing environment, utilizing components like HDFS for storage and MapReduce for processing. The Hadoop ecosystem includes various tools such as Pig, Hive, and YARN, which enhance data ingestion, analysis, and resource management. The document also discusses the evolution from Hadoop 1.0 to 2.0, highlighting improvements in scalability and support for diverse data processing methods.

Uploaded by

rahul104941
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views277 pages

Big Data Unit 2

Hadoop is an open-source framework designed for processing and storing large datasets in a distributed computing environment, utilizing components like HDFS for storage and MapReduce for processing. The Hadoop ecosystem includes various tools such as Pig, Hive, and YARN, which enhance data ingestion, analysis, and resource management. The document also discusses the evolution from Hadoop 1.0 to 2.0, highlighting improvements in scalability and support for diverse data processing methods.

Uploaded by

rahul104941
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 277

Apex Institute of Technology

Department of Computer Science & Engineering


Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)

DISCOVER . LEARN . EMPOWER


1
Hadoop

2
WHAT IS HADOOP
Hadoop is an open source, Java-based programming framework that supports the
processing and storage of extremely large data sets in a distributed computing
environment. It is part of the Apache project sponsored by the Apache Software
Foundation.
 Able to store of huge amount of data from variety of Sources

 Low cost data storage as data is stored at commodity hardware

 Interoperability with other platforms

 Bigger tasks are divided in to smaller tasks and these are processed
and analyzed parallelly and later regrouped for final result.
Hadoop Ecosystem
• Hadoop Ecosystem is a collection of tools that
provides services like ingestion, storage, processing
and analysis of big data. Various components of
Hadoop are
HADOOP-ECOSYSTEM
Core components of Hadoop
The core of Apache Hadoop consists of a storage part, known as Hadoop
Distributed File System (HDFS), and a processing part which is a MapReduce
programming model.
• Hadoop Distributed File System (HDFS) – a distributed file-system that
stores data on commodity machines, providing very high aggregate
bandwidth across the cluster.
• Unlike other distributed systems, HDFS is highly fault tolerant and
designed using low-cost hardware.
• HDFS holds very large amount of data and provides easier access.
• To store such huge data, the files are stored across multiple machines.
• HDFS also makes applications available to parallel processing.
HDFS
• It is suitable for the distributed storage and processing.
• Hadoop provides a command interface to interact with HDFS.
• HDFS provides file permissions and authentication
• HDFS has two types of component Nodes
• NameNode and
• DataNode.
• NameNode act as a master node and maintains all the
information regarding Data nodes and other blocks of HDFS.
It keeps track regularly on Data nodes to check whether they
are alive or not.
Hadoop MapReduce –
• MapReduce is a software framework in which is applications are
created in different programming languages to process large data set.
It is based on principle of Map (distribute) and reduce (aggregates).
• An implementation of the MapReduce programming model for large
scale data processing.
YARN (Yet Another Resource Negotiator):

• This is another core component of Hadoop Ecosystem which is


also called as brain of Hadoop Ecosystem as it manages all the
processing activities. It also as resource allocator and
scheduler.
PIG

 Pig was initially developed by Yahoo. It performs the extraction, transformation, processing and
analysis of huge data. Pig has two components

o Pig Latin which is scripting language like SQL.

o The Pig Runtime which provides the execution environment for Pig Latin.
HIVE

 Hive is basically used for analysis of big data sets using HQL (Hive
Query Language). HQL is a programming language for HIVE and
supports all data types of SQL. It supports both batch processing as
well as real time applications. HIVE has two components
o Hive Command Line is used to execute Hive commands
o JDBC/ODBC Driver provides the connection for data processing.
Other Hadoop Components
 Mahout: It provides environment for creating machine learning applications for analysis using
clustering, filtering and classification.

 HBase: It is open source, NoSQL database which runs at the top of HDFS. It is written in Java.

 Zookeeper: It is a server which works at central level to provide co-ordination between different
applications.

 Apache Flume: Flume is used to ingest unstructured as well as semi-structured data from RDBMS/
Datawarehouse to HDFS and vice-versa.

 Apache Sqoop: Sqoop is also data ingestion tool that is used to ingest not only unstructured and
semi-structured data but also structured data.
• Apache Sqoop: Sqoop is also data ingestion tool that is used to ingest not only unstructured and semi-
structured data but also structured data.
HDFS Architecture
NAMENODE
HDFS follows the master-slave architecture and it has the following
elements.
• Namenode

• The namenode is the commodity hardware that contains an


operating system and the namenode software.
• It is a software that can be run on commodity hardware.
NAMENODE
The system having the namenode acts as the master server and it does
the following tasks:
• Manages the file system namespace.

• Regulates client’s access to files.

• It also executes file system operations such as renaming, closing,


and opening files and directories.
DATANODE
• The datanode is a commodity hardware having any operating
system and datanode software. For every node (Commodity
hardware/System) in a cluster, there will be a datanode. These
nodes manage the data storage of their system.
• Datanodes perform read-write operations on the file systems, as
per client request.
• They also perform operations such as block creation, deletion, and
replication according to the instructions of the namenode.
BLOCK

• Generally the user data is stored in the files of HDFS. The file in a file
system will be divided into one or more segments and/or stored in
individual data nodes. These file segments are called as blocks. In
other words, the minimum amount of data that HDFS can read or
write is called a Block. The default block size is 64MB, but it can be
increased as per the need to change in HDFS configuration.
Limitations

• Scalability
• The utilization of computational resources is inefficient.
• The Hadoop framework became limited only to MapReduce
processing paradigm.
Hadoop version 2.0

• YARN was introduced in Hadoop version 2.0 in the year 2012 by


Yahoo and Hortonworks. The basic idea behind YARN is to
relieve MapReduce by taking over the responsibility of Resource
Management and Job Scheduling
• YARN allows different data processing methods like graph
processing, interactive processing, stream processing as well as
batch processing to run and process data stored in HDFS.
Therefore YARN opens up Hadoop to other types of distributed
applications beyond MapReduce.
• YARN enabled the users to perform operations as per
requirement by using a variety of tools like Spark for real-time
processing, Hive for SQL, HBase for NoSQL and others.
Q/A

1. What is Hadoop?

• a) A programming language for Big Data analytics

• b) A data storage system

• c) An open-source framework for distributed processing of large datasets

• d) A cloud computing platform

Answer: c) An open-source framework for distributed processing of large


datasets

25
Q/A
4. Which component of the Hadoop ecosystem is an in-memory data processing engine?

• a) Hadoop Distributed File System (HDFS)

• b) MapReduce

• c) Apache Spark

• d) Apache Hive

Answer: c) Apache Spark

26
Q/A
6. Which component of the Hadoop ecosystem is used for batch processing and analysis of large datasets?

• a) Hadoop Distributed File System (HDFS)

• b) MapReduce

• c) Apache Spark

• d) Apache Kafka

Answer: b) MapReduce

27
References:

✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication

8/8/2021 28
THANK YOU

For queries
Email:
[email protected]
29
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)

DISCOVER . LEARN . EMPOWER


1
Hadoop Ecosystem
Components
Flume vs Sqoop

• Flume only ingests unstructured data or semi-structured


data into HDFS.
• While Sqoop can import as well as export structured data
from RDBMS or Enterprise data warehouses to HDFS or
vice versa.
Q/A

Which component of the Hadoop ecosystem provides a high-level query and


data manipulation language similar to SQL?

• a) Hadoop Distributed File System (HDFS)

• b) MapReduce

• c) Apache Spark

• d) Apache Hive

Answer: d) Apache Hive

6
Q/A
7.Which type of data sources does Sqoop support?
• a) Only relational databases

• b) Only NoSQL databases

• c) Both relational and NoSQL databases

• d) Only streaming data sources

Answer: c) Both relational and NoSQL databases

7
Q/A
8.Which type of data sources does Flume support?
• a) Only relational databases

• b) Only NoSQL databases

• c) Both relational and NoSQL databases

• d) Only streaming data sources

Answer: d) Only streaming data sources

8
References:

✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication

8/8/2021 9
Q/A

10
THANK YOU

For queries
Email:
[email protected]
11
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)

DISCOVER . LEARN . EMPOWER


1
Hadoop 1.0 Vs Hadoop 2.0

2
WHAT IS HADOOP
Hadoop is an open source, Java-based programming framework that supports the
processing and storage of extremely large data sets in a distributed computing
environment. It is part of the Apache project sponsored by the Apache Software
Foundation.
 Able to store of huge amount of data from variety of Sources

 Low cost data storage as data is stored at commodity hardware

 Interoperability with other platforms

 Bigger tasks are divided in to smaller tasks and these are processed
and analysed parallelly and later regrouped for final result.
Hadoop Ecosystem
• Hadoop Ecosystem is a collection of tools that
provides services like ingestion, storage, processing
and analysis of big data. Various components of
Hadoop are
HADOOP-ECOSYSTEM
Hadoop version 2.0

• YARN was introduced in Hadoop version 2.0 in the year 2012 by


Yahoo and Hortonworks. The basic idea behind YARN is to
relieve MapReduce by taking over the responsibility of Resource
Management and Job Scheduling
• YARN allows different data processing methods like graph
processing, interactive processing, stream processing as well as
batch processing to run and process data stored in HDFS.
Therefore YARN opens up Hadoop to other types of distributed
applications beyond MapReduce.
• YARN enabled the users to perform operations as per
requirement by using a variety of tools like Spark for real-time
processing, Hive for SQL, HBase for NoSQL and others.
Q/A
What is the advantage of YARN in Hadoop 2.0?
• a) Improved fault tolerance

• b) Increased scalability and reliability

• c) Better support for real-time processing


• d) More efficient resource utilization and multi-tenancy

Answer: d) More efficient resource utilization and multi-tenancy

11
Q/A
What is the main difference between Hadoop 1.0 and Hadoop 2.0?
• a) Hadoop 2.0 introduced the concept of data replication.

• b) Hadoop 2.0 improved the scalability and reliability of Hadoop.

• c) Hadoop 2.0 introduced YARN (Yet Another Resource Negotiator)


for resource management.
• d) Hadoop 2.0 added support for real-time stream processing.

Answer: c) Hadoop 2.0 introduced YARN (Yet Another Resource


Negotiator) for resource management.

12
Q/A
Which version of Hadoop introduced support for Apache HBase?
• a) Hadoop 1.0

• b) Hadoop 2.0
• c) Hadoop 2.2

• d) Hadoop 3.0
Answer: b) Hadoop 2.0

13
References:

✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication

8/8/2021 14
THANK YOU

For queries
Email:
[email protected]
15
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)

DISCOVER . LEARN . EMPOWER


1
HDFS

2
HDFS
 In 2010, Facebook claimed to have one of the largest HDFS cluster
storing 21 Petabytes of data.

 In 2012, Facebook declared that they have the largest single HDFS cluster
with more than 100 PB of data.

 And Yahoo! has more than 100,000 CPU in over 40,000 servers running
Hadoop, with its biggest Hadoop cluster running 4,500 nodes. All told,
Yahoo! stores 455 petabytes of data in HDFS.

 In fact, by 2013, most of the big names in the Fortune 50 started using
Hadoop.
HDFS
• Distributed File System talks about managing data,
i.e. files or folders across multiple computers or
servers.
• Like for windows you have NTFS (New Technology File
System) or for Mac you have HFS (Hierarchical File System).
The only difference is that, in case of Distributed File
System, you store data in multiple machines rather than
single machine. Even though the files are stored across the
network, DFS organizes, and displays data in such a
manner that a user sitting on a machine will feel like all the
data is stored in that very machine.
HDFS

• Hadoop Distributed file system or HDFS is a Java


based distributed file system that allows you to
store large data across multiple nodes in a Hadoop
cluster. So, if you install Hadoop, you get HDFS as
an underlying storage system for storing the data in
the distributed environment.
Advantages Of HDFS

• Distributed Storage: You will


feel as if you have logged into a
single large machine which has a
storage capacity of 10 TB (total
storage over ten machines).
What does it mean? It means
that you can store a single large
file of 10 TB which will be
distributed over the ten
machines (1 TB each).
Distributed & Parallel Computation

• The work which was


taking 43
minutes before, gets
finished in just 4.3
minutes now as the
work got divided over
ten machines.
Horizontal Scalability:
Vertical scaling
• There are two types of scaling: vertical and horizontal.
• In vertical scaling (scale up), you increase the hardware capacity of your system. In other
words, you procure more RAM or CPU and add it to your existing system to make it more
robust and powerful. But there are challenges associated with vertical scaling or scaling
up:
 There is always a limit to which you can increase your hardware capacity. So, you can’t keep
on increasing the RAM or CPU of the machine.

 In vertical scaling, you stop your machine first. Then you increase the RAM or CPU to make it a
more robust hardware stack. After you have increased your hardware capacity, you restart
the machine. This down time when you are stopping your system becomes a challenge.
Horizontal scaling (scale out):

• Add more nodes to existing cluster instead of


increasing the hardware capacity of individual
machines. And most importantly, you can add
more machines on the go i.e. Without stopping
the system. Therefore, while scaling out we don’t
have any down time or green zone, nothing of such
sort.
Features/Advantages of HDFS
 Cost: The HDFS, in general, is deployed on a commodity hardware like your desktop/laptop
which you use every day. So, it is very economical in terms of the cost of ownership of the
project. Since, we are using low cost commodity hardware, you don’t need to spend huge
amount of money for scaling out your Hadoop cluster. In other words, adding more nodes
to your HDFS is cost effective.

 Variety and Volume of Data: When we talk about HDFS then we talk about storing huge data
i.e. Terabytes & petabytes of data and different kinds of data. So, you can store any type of
data into HDFS, be it structured, unstructured or semi structured.

 Reliability and Fault Tolerance: When you store data on HDFS, it internally divides the given
data into data blocks and stores it in a distributed fashion across your Hadoop cluster. The
information regarding which data block is located on which of the data nodes is recorded in
the metadata on Name Node.
Features/Advantages of HDFS
• Data Integrity: Data Integrity talks about whether the data stored in my HDFS are
correct or not. HDFS constantly checks the integrity of data stored against its
checksum. If it finds any fault, it reports to the name node about it. Then, the name
node creates additional new replicas
• High Throughput: Throughput is the amount of work done in a unit time. It talks
about how fast you can access the data from the file system. Basically, it gives you
an insight about the system performance. As you have seen in the above example
where we used ten machines collectively to enhance computation. There we were
able to reduce the processing time from 43 minutes to a mere 4.3 minutes as all
the machines were working in parallel. and therefore deletes the corrupted copies
• Data Locality: Data locality talks about moving processing unit to data rather than
the data to processing unit. In our traditional system, we used to bring the data to
the application layer and then process it
Q/A
2.What is the primary storage component in HDFS?
• a) NameNode

• b) DataNode

• c) ResourceManager

• d) NodeManager

Answer: b) DataNode

13
Q/A
3.What is the role of the NameNode in HDFS?
• a) Storing and managing the actual data blocks of files.

• b) Distributing the workload and managing resources in a Hadoop


cluster.
• c) Coordinating the storage and retrieval of data in HDFS.

• d) Managing the metadata of files and directories in HDFS.

Answer: d) Managing the metadata of files and directories in HDFS.

14
Q/A
5.Which HDFS component is responsible for ensuring data replication
and fault tolerance?
• a) NameNode

• b) DataNode

• c) SecondaryNameNode

• d) CheckpointNode

Answer: a) NameNode

15
References:

✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication

8/8/2021 16
THANK YOU

For queries
Email:
[email protected]
17
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)

DISCOVER . LEARN . EMPOWER


1
HDFS Architecture:

2
HDFS Architecture:

• Hadoop Distributed File System is a


• Block-structured file system where
• Each file is divided into blocks of a pre-determined size.
• These blocks are stored across a cluster of one or several machines.
• HDFS Architecture follows a Master/Slave Architecture,
• cluster comprises of a single NameNode (Master node) and all the
other nodes are DataNodes (Slave nodes)
• Though one can run several DataNodes on a single machine, but in
the practical world, these DataNodes are spread across various
machines.
Functions of NameNode:
NameNode is the master node in the Apache Hadoop HDFS Architecture that
maintains and manages the blocks present on the DataNodes (slave
nodes). NameNode is a very highly available server that manages the File System
Namespace and controls access to files by clients
 It is the master daemon that maintains and manages the DataNodes (slave nodes)
 It records the metadata of all the files stored in the cluster, e.g. The location of
blocks stored, the size of the files, permissions, hierarchy, etc. There are two files
associated with the metadata:
o FsImage: It contains the complete state of the file system namespace since the
start of the NameNode.
o EditLogs: It contains all the recent modifications made to the file system with
respect to the most recent FsImage.It records each change that takes place to
the file system metadata. For example, if a file is deleted in HDFS, the NameNode
will immediately record this in the EditLog.
Functions of NameNode:

 It regularly receives a Heartbeat and a block report from all the DataNodes in the
cluster to ensure that the DataNodes are live.

 It keeps a record of all the blocks in HDFS and in which nodes these blocks are
located.

 The NameNode is also responsible to take care of the replication factor of all the
blocks which we will discuss in detail later in this HDFS tutorial blog.

 In case of the DataNode failure, the NameNode chooses new DataNodes for new
replicas, balance disk usage and manages the communication traffic to the
DataNodes.
DataNode

• DataNodes are the slave nodes in HDFS. Unlike NameNode,


DataNode is a commodity hardware, that is, a non-
expensive system which is not of high quality or high-
availability. The DataNode is a block server that stores the
data in the local file
Functions of DataNode:

 These are slave daemons or process which runs on each slave


machine.

 The actual data is stored on DataNodes.

 The DataNodes perform the low-level read and write requests from
the file system’s clients.

 They send heartbeats to the NameNode periodically to report the


overall health of HDFS, by default, this frequency is set to 3 seconds.
Secondary NameNode:

• Apart from these two daemons, there is a third daemon or a process called
Secondary NameNode. The Secondary NameNode works concurrently with the
primary NameNode as a helper daemon. And don’t be confused about the
Secondary NameNode being a backup NameNode because it is not.
Functions of Secondary NameNode:

 The Secondary NameNode is one which constantly reads all the file
systems and metadata from the RAM of the NameNode and writes it
into the hard disk or the file system.
 It is responsible for combining the EditLogs with FsImage from the
NameNode.
 It downloads the EditLogs from the NameNode at regular intervals
and applies to FsImage. The new FsImage is copied back to the
NameNode, which is used whenever the NameNode is started the
next time.
• Hence, Secondary NameNode performs regular checkpoints in HDFS.
Therefore, it is also called CheckpointNode.
Q/A
• In HDFS, large files are split into smaller blocks of fixed size. What is
the default block size in Hadoop 3.x?
• a) 64 MB

• b) 128 MB

• c) 256 MB

• d) 512 MB

Ans: b) 128 MB

11
• Which component of HDFS is responsible for storing the actual data
blocks?
• a) NameNode

• b) DataNode

• c) SecondaryNameNode

• d) ResourceManager

Ans : b) DataNode

12
•d) It assists the NameNode in checkpointing and logging operations.

• What is the purpose of the SecondaryNameNode in HDFS?


• a) It acts as a backup for the NameNode.
• b) It stores additional copies of data blocks for redundancy.
• c) It balances the load across multiple DataNodes.
• d) It assists the NameNode in checkpointing and logging operations.

• Ans d) It assists the NameNode in checkpointing and logging


operations.

13
References:

✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication

8/8/2021 14
THANK YOU

For queries
Email:
[email protected]
15
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)

DISCOVER . LEARN . EMPOWER


1
Replication Management in HDFS

2
Block

• HDFS stores each file as blocks which are scattered throughout the
Apache Hadoop cluster. The default size of each block is 128 MB in
Apache Hadoop 2.x (64 MB in Apache Hadoop 1.x) which you can
configure as per your requirement.
why we need to have such a huge blocks size i.e. 128 MB?

• Well, whenever we talk about HDFS, we talk about huge data sets, i.e.
Terabytes and Petabytes of data. So, if we had a block size of let’s say
of 4 KB, as in Linux file system, we would be having too many blocks
and therefore too much of the metadata. So, managing these
no. of blocks and metadata will create huge overhead, which is
something, we don’t want
Replication Management:

• HDFS provides a reliable way to store huge data in a distributed environment as data blocks.
• The blocks are also replicated to provide fault tolerance.
• The default replication factor is 3 which is again configurable.
• So, as you can see in the figure below where each block is replicated three times and stored
on different DataNodes
• Therefore, if you are storing a file of 128 MB in HDFS using the default configuration, you will
end up occupying a space of 384 MB (3*128 MB) as the blocks will be replicated three times
and each replica will be residing on a different DataNode.

• Note: The NameNode collects block report from DataNode periodically to maintain the
replication factor. Therefore, whenever a block is over-replicated or under-replicated the
NameNode deletes or add replicas as needed.
Rack Awareness:
• Rack Awareness Algorithm says that
• The first replica of a block will be stored on a local rack and the next
two replicas will be stored on a different (remote) rack but, on a
different DataNode within that (remote) rack as shown in the figure
above.
• If you have more replicas, the rest of the replicas will be placed on
random DataNodes provided not more than two replicas reside on
the same rack,
Rack Awareness:
Advantages of Rack Awareness:

• To improve the network


performance
• To prevent loss of datae
HDFS Read/ Write Architecture:

• HDFS follows Write Once – Read Many Philosophy


• So, you can’t edit files already stored in HDFS. But, you can append
new data by re-opening the file
HDFS Write Architecture:

 At first, the HDFS client will reach out to the NameNode for a Write Request
against the two blocks, say, Block A & Block B.
 The NameNode will then grant the client the write permission and will
provide the IP addresses of the DataNodes where the file blocks will be
copied eventually.
 The selection of IP addresses of DataNodes is purely randomized based on
availability, replication factor and rack awareness that we have discussed
earlier.
 Let’s say the replication factor is set to default i.e. 3. Therefore, for each
block the NameNode will be providing the client a list of (3) IP addresses of
DataNodes. The list will be unique for each block.
HDFS Write Architecture:

 Suppose, the NameNode provided following lists of IP addresses to the


client:
 For Block A, list A = {IP of DataNode 1, IP of DataNode 4, IP of DataNode 6}
 For Block B, set B = {IP of DataNode 3, IP of DataNode 7, IP of DataNode 9}
 Each block will be copied in three different DataNodes to maintain the
replication factor consistent throughout the cluster.
HDFS Write Architecture:

 Now the whole data copy process will happen in three stages:
 Set up of Pipeline
 Data streaming and replication
 Shutdown of Pipeline (Acknowledgement stage)
1. Set up of Pipeline:

Before writing the blocks, the client confirms whether the DataNodes,
present in each of the list of IPs, are ready to receive the data or not. In
doing so, the client creates a pipeline for each of the blocks by
connecting the individual DataNodes in the respective list for that block.
Let us consider Block A. The list of DataNodes provided by the
NameNode is:

• For Block A, list A = {IP of DataNode 1, IP of DataNode 4, IP of


DataNode 6}.
Steps to create a pipeline:

 The client will choose the first DataNode in the list (DataNode IPs for
Block A) which is DataNode 1 and will establish a TCP/IP connection.

 The client will inform DataNode 1 to be ready to receive the block. It


will also provide the IPs of next two DataNodes (4 and 6) to the
DataNode 1 where the block is supposed to be replicated.

 The DataNode 1 will connect to DataNode 4. The DataNode 1 will


inform DataNode 4 to be ready to receive the block and will give it the
IP of DataNode 6. Then, DataNode 4 will tell DataNode 6 to be ready
for receiving the data.
Steps to create a pipeline:

 Next, the acknowledgement of readiness will follow the


reverse sequence, i.e. From the DataNode 6 to 4 and then to
1.
 At last DataNode 1 will inform the client that all the
DataNodes are ready and a pipeline will be formed between
the client, DataNode 1, 4 and 6.
 Now pipeline set up is complete and the client will finally
begin the data copy or streaming process.
2. Data Streaming:

• As the pipeline has been created, the client will push the
data into the pipeline.
• Now, don’t forget that in HDFS, data is replicated based on
replication factor. So, here Block A will be stored to three
DataNodes as the assumed replication factor is 3.
• The client will copy the block (A) to DataNode 1 only. The
replication is always done by DataNodes sequentially.
Steps will take place during replication:

 Once the block has been written to DataNode 1 by the


client, DataNode 1 will connect to DataNode 4.
 Then, DataNode 1 will push the block in the pipeline and
data will be copied to DataNode 4.
 Again, DataNode 4 will connect to DataNode 6 and will copy
the last replica of the block.
3. Shutdown of Pipeline or Acknowledgement stage:

• Once the block has been copied into all the three DataNodes, a series of
acknowledgements will take place to ensure the client and NameNode that the
data has been written successfully. Then, the client will finally close the pipeline to
end the TCP session.

• As shown in the figure below, the acknowledgement happens in the reverse


sequence i.e. from DataNode 6 to 4 and then to 1.

• Finally, the DataNode 1 will push three acknowledgements (including its own) into
the pipeline and send it to the client.

• The client will inform NameNode that data has been written successfully.

• The NameNode will update its metadata and the client will shut down the pipeline.
Copying of Block B
• Similarly, Block B will also be copied into the DataNodes in parallel
with Block A. So, the following things are to be noticed here:
 The client will copy Block A and Block B to the first
DataNode simultaneously.
 Therefore, in our case, two pipelines will be formed for each of the
block and all the process discussed above will happen in parallel in
these two pipelines.
 The client writes the block into the first DataNode and then the
DataNodes will be replicating the block sequentially.
• As you can see in the above image, there are two pipelines
formed for each block (A and B). Following is the flow of
operations that is taking place for each block in their
respective pipelines:
 For Block A: 1A -> 2A -> 3A -> 4A
 For Block B: 1B -> 2B -> 3B -> 4B -> 5B -> 6B
HDFS Read Architecture:

HDFS Read architecture is comparatively easy to understand. Let’s take


the above example again where the HDFS client wants to read the file
“example.txt” now. Now, following steps will be taking place while
reading the file:
 The client will reach out to NameNode asking for the block metadata
for the file “example.txt”.
 The NameNode will return the list of DataNodes where each block
(Block A and B) are stored.
 After that client, will connect to the DataNodes where the blocks are
stored.
HDFS Read Architecture:

 The client starts reading data parallel from the DataNodes (Block
A from DataNode 1 and Block B from DataNode 3).
 Once the client gets all the required file blocks, it will combine
these blocks to form a file.
 While serving read request of the client, HDFS selects the replica
which is closest to the client. This reduces the read latency and
the bandwidth consumption. Therefore, that replica is selected
which resides on the same rack as the reader node, if possible.
Q/A
• In HDFS, how many replicas of a data block are typically
maintained?
• a) 1

• b) 2

• c) 3

• d) It depends on the configuration settings.

Ans: 3

29
Q/A
• Which factor(s) influence(s) the replication factor for a file in HDFS?
• a) The configured replication factor setting
• b) The size of the file

• c) The number of available DataNodes

• d) All of the above

Ans: d) All of the above

30
Q/A
What is the purpose of the Block Placement Policy in HDFS?
• a) To determine the number of replicas for each data block

• b) To determine the size of each data block

• c) To determine the storage location for each replica

• d) To determine the naming convention for data blocks

Ans: c) To determine the storage location for each replica

31
References:

✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication

8/8/2021 32
THANK YOU

For queries
Email:
[email protected]
33
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)

DISCOVER . LEARN . EMPOWER


1
HDFS Write Architecture

2
Block

• HDFS stores each file as blocks which are scattered throughout the
Apache Hadoop cluster. The default size of each block is 128 MB in
Apache Hadoop 2.x (64 MB in Apache Hadoop 1.x) which you can
configure as per your requirement.
why we need to have such a huge blocks size i.e. 128 MB?

• Well, whenever we talk about HDFS, we talk about huge data sets, i.e.
Terabytes and Petabytes of data. So, if we had a block size of let’s say
of 4 KB, as in Linux file system, we would be having too many blocks
and therefore too much of the metadata. So, managing these
no. of blocks and metadata will create huge overhead, which is
something, we don’t want
Replication Management:

• HDFS provides a reliable way to store huge data in a distributed environment as data blocks.
• The blocks are also replicated to provide fault tolerance.
• The default replication factor is 3 which is again configurable.
• So, as you can see in the figure below where each block is replicated three times and stored
on different DataNodes
• Therefore, if you are storing a file of 128 MB in HDFS using the default configuration, you will
end up occupying a space of 384 MB (3*128 MB) as the blocks will be replicated three times
and each replica will be residing on a different DataNode.

• Note: The NameNode collects block report from DataNode periodically to maintain the
replication factor. Therefore, whenever a block is over-replicated or under-replicated the
NameNode deletes or add replicas as needed.
Rack Awareness:
• Rack Awareness Algorithm says that
• The first replica of a block will be stored on a local rack and the next
two replicas will be stored on a different (remote) rack but, on a
different DataNode within that (remote) rack as shown in the figure
above.
• If you have more replicas, the rest of the replicas will be placed on
random DataNodes provided not more than two replicas reside on
the same rack,
Rack Awareness:
Advantages of Rack Awareness:

• To improve the network


performance
• To prevent loss of datae
HDFS Read/ Write Architecture:

• HDFS follows Write Once – Read Many Philosophy


• So, you can’t edit files already stored in HDFS. But, you can append
new data by re-opening the file
HDFS Write Architecture:

 At first, the HDFS client will reach out to the NameNode for a Write Request
against the two blocks, say, Block A & Block B.
 The NameNode will then grant the client the write permission and will
provide the IP addresses of the DataNodes where the file blocks will be
copied eventually.
 The selection of IP addresses of DataNodes is purely randomized based on
availability, replication factor and rack awareness that we have discussed
earlier.
 Let’s say the replication factor is set to default i.e. 3. Therefore, for each
block the NameNode will be providing the client a list of (3) IP addresses of
DataNodes. The list will be unique for each block.
HDFS Write Architecture:

 Suppose, the NameNode provided following lists of IP addresses to the


client:
 For Block A, list A = {IP of DataNode 1, IP of DataNode 4, IP of DataNode 6}
 For Block B, set B = {IP of DataNode 3, IP of DataNode 7, IP of DataNode 9}
 Each block will be copied in three different DataNodes to maintain the
replication factor consistent throughout the cluster.
HDFS Write Architecture:

 Now the whole data copy process will happen in three stages:
 Set up of Pipeline
 Data streaming and replication
 Shutdown of Pipeline (Acknowledgement stage)
1. Set up of Pipeline:

Before writing the blocks, the client confirms whether the DataNodes,
present in each of the list of IPs, are ready to receive the data or not. In
doing so, the client creates a pipeline for each of the blocks by
connecting the individual DataNodes in the respective list for that block.
Let us consider Block A. The list of DataNodes provided by the
NameNode is:

• For Block A, list A = {IP of DataNode 1, IP of DataNode 4, IP of


DataNode 6}.
Steps to create a pipeline:

 The client will choose the first DataNode in the list (DataNode IPs for
Block A) which is DataNode 1 and will establish a TCP/IP connection.

 The client will inform DataNode 1 to be ready to receive the block. It


will also provide the IPs of next two DataNodes (4 and 6) to the
DataNode 1 where the block is supposed to be replicated.

 The DataNode 1 will connect to DataNode 4. The DataNode 1 will


inform DataNode 4 to be ready to receive the block and will give it the
IP of DataNode 6. Then, DataNode 4 will tell DataNode 6 to be ready
for receiving the data.
Steps to create a pipeline:

 Next, the acknowledgement of readiness will follow the


reverse sequence, i.e. From the DataNode 6 to 4 and then to
1.
 At last DataNode 1 will inform the client that all the
DataNodes are ready and a pipeline will be formed between
the client, DataNode 1, 4 and 6.
 Now pipeline set up is complete and the client will finally
begin the data copy or streaming process.
2. Data Streaming:

• As the pipeline has been created, the client will push the
data into the pipeline.
• Now, don’t forget that in HDFS, data is replicated based on
replication factor. So, here Block A will be stored to three
DataNodes as the assumed replication factor is 3.
• The client will copy the block (A) to DataNode 1 only. The
replication is always done by DataNodes sequentially.
Steps will take place during replication:

 Once the block has been written to DataNode 1 by the


client, DataNode 1 will connect to DataNode 4.
 Then, DataNode 1 will push the block in the pipeline and
data will be copied to DataNode 4.
 Again, DataNode 4 will connect to DataNode 6 and will copy
the last replica of the block.
3. Shutdown of Pipeline or Acknowledgement stage:

• Once the block has been copied into all the three DataNodes, a series of
acknowledgements will take place to ensure the client and NameNode that the
data has been written successfully. Then, the client will finally close the pipeline to
end the TCP session.

• As shown in the figure below, the acknowledgement happens in the reverse


sequence i.e. from DataNode 6 to 4 and then to 1.

• Finally, the DataNode 1 will push three acknowledgements (including its own) into
the pipeline and send it to the client.

• The client will inform NameNode that data has been written successfully.

• The NameNode will update its metadata and the client will shut down the pipeline.
Copying of Block B
• Similarly, Block B will also be copied into the DataNodes in parallel
with Block A. So, the following things are to be noticed here:
 The client will copy Block A and Block B to the first
DataNode simultaneously.
 Therefore, in our case, two pipelines will be formed for each of the
block and all the process discussed above will happen in parallel in
these two pipelines.
 The client writes the block into the first DataNode and then the
DataNodes will be replicating the block sequentially.
• As you can see in the above image, there are two pipelines
formed for each block (A and B). Following is the flow of
operations that is taking place for each block in their
respective pipelines:
 For Block A: 1A -> 2A -> 3A -> 4A
 For Block B: 1B -> 2B -> 3B -> 4B -> 5B -> 6B
HDFS Read Architecture:

HDFS Read architecture is comparatively easy to understand. Let’s take


the above example again where the HDFS client wants to read the file
“example.txt” now. Now, following steps will be taking place while
reading the file:
 The client will reach out to NameNode asking for the block metadata
for the file “example.txt”.
 The NameNode will return the list of DataNodes where each block
(Block A and B) are stored.
 After that client, will connect to the DataNodes where the blocks are
stored.
HDFS Read Architecture:

 The client starts reading data parallel from the DataNodes (Block
A from DataNode 1 and Block B from DataNode 3).
 Once the client gets all the required file blocks, it will combine
these blocks to form a file.
 While serving read request of the client, HDFS selects the replica
which is closest to the client. This reduces the read latency and
the bandwidth consumption. Therefore, that replica is selected
which resides on the same rack as the reader node, if possible.
Q/A
• When a client wants to write a file to HDFS, which component does
it initially contact?
• a) NameNode

• b) DataNode

• c) ResourceManager

• d) SecondaryNameNode

Ans: NameNode

29
Q/A
What is the purpose of the pipeline concept in HDFS write
architecture?
a) To ensure data replication across multiple DataNodes
b) b) To minimize data transfer latency during write operations
c) c) To manage metadata updates in the NameNode
d) d) To encrypt data during write operations

•Ans: b) To minimize data transfer latency during write operations

30
Q/A
• How many DataNodes are involved in the pipeline for writing a data
block in HDFS by default?
• a) 1

• b) 2

• c) 3

• d) It depends on the replication factor.

Ans: d) It depends on the replication factor.

31
References:

✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication

8/8/2021 32
THANK YOU

For queries
Email:
[email protected]
33
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)

DISCOVER . LEARN . EMPOWER


1
HDFS Read Architecture:

2
HDFS Read Architecture:

HDFS Read architecture is comparatively easy to understand. Let’s take


the above example again where the HDFS client wants to read the file
“example.txt” now. Now, following steps will be taking place while
reading the file:
 The client will reach out to NameNode asking for the block metadata
for the file “example.txt”.
 The NameNode will return the list of DataNodes where each block
(Block A and B) are stored.
 After that client, will connect to the DataNodes where the blocks are
stored.
HDFS Read Architecture:

 The client starts reading data parallel from the DataNodes (Block
A from DataNode 1 and Block B from DataNode 3).
 Once the client gets all the required file blocks, it will combine
these blocks to form a file.
 While serving read request of the client, HDFS selects the replica
which is closest to the client. This reduces the read latency and
the bandwidth consumption. Therefore, that replica is selected
which resides on the same rack as the reader node, if possible.
References:

✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication

8/8/2021 6
Q/A
• What is the role of the NameNode in the HDFS read architecture?
• a) Storing the actual data blocks
• b) Managing metadata and file system namespace
• c) Coordinating read requests from multiple clients
• d) Distributing data across multiple DataNodes
• Ans b) Managing metadata and file system namespace

7
Q/A
• In HDFS, what is the purpose of the read cache?
• a) To store frequently accessed file metadata for faster retrieval

• b) To buffer and temporarily store the read data blocks for


improved performance
• c) To maintain a copy of the data blocks for fault tolerance

• d) To encrypt the data blocks during read operations

Ans : B) To buffer and temporarily store the read data blocks for
improved performance

8
Q/A
• Which component in HDFS is responsible for maintaining the
mapping of file blocks to DataNodes?
• a) NameNode

• b) DataNode

• c) ResourceManager

• d) SecondaryNameNode

Ans: a) NameNode

9
THANK YOU

For queries
Email:
[email protected]
10
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)

DISCOVER . LEARN . EMPOWER


1
MapReduce
Mapreduce
• It is a processing layer of Hadoop.
• MapReduce is a programming model designed for processing large
volumes of data in parallel by dividing the work into the set of
chunks
• There are two processes one is Mapper and another is
the Reducer.
Mapreduce
• Map phase- It is the first phase of data processing. In this
phase, we specify all the complex logic/business rules/costly
code.
• Reduce phase- It is the second phase of processing. In this
phase, we specify light-weight processing like
aggregation/summation
Q/A
Which company originally developed the MapReduce framework?
• a) Google
• b) Microsoft

• c) Facebook

• d) Amazon

Ans: a) Google

8
Q/A
What is the purpose of the Map phase in MapReduce?
• a) To process and analyze input data in parallel

• b) To aggregate and summarize data

• c) To sort and partition data

• d) To generate output results

Ans: a) To process and analyze input data in parallel

9
Q/A
Which programming language is commonly used for writing
MapReduce programs?
• a) Java
• b) Python

• c) C++

• d) Ruby

Ans : a) Java

10
References:

✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication

8/8/2021 11
THANK YOU

For queries
Email:
[email protected]
12
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)

DISCOVER . LEARN . EMPOWER


1
YARN
Limitations

• Scalability
• The utilization of computational resources is inefficient.
• The Hadoop framework became limited only to MapReduce
processing paradigm.
Hadoop version 2.0

• YARN was introduced in Hadoop version 2.0 in the year 2012 by


Yahoo and Hortonworks. The basic idea behind YARN is to
relieve MapReduce by taking over the responsibility of Resource
Management and Job Scheduling
• YARN allows different data processing methods like graph
processing, interactive processing, stream processing as well as
batch processing to run and process data stored in HDFS.
Therefore YARN opens up Hadoop to other types of distributed
applications beyond MapReduce.
• YARN enabled the users to perform operations as per
requirement by using a variety of tools like Spark for real-time
processing, Hive for SQL, HBase for NoSQL and others.
Q/A
• What are the two main components of YARN?
• a) ResourceManager and NameNode

• b) ResourceManager and NodeManager

• c) TaskTracker and JobTracker

• d) DataNode and SecondaryNameNode

Ans: b

11
Q/A
• Which programming framework is commonly used with YARN for
distributed data processing?
• a) Hadoop MapReduce

• b) Apache Spark

• c) Apache Hive

• d) Apache Kafka

Ans: b) Apache Spark

12
Q/A
• What is the role of the NodeManager in YARN?
• a) Managing and allocating cluster resources

• b) Managing and storing metadata information

• c) Coordinating and scheduling MapReduce jobs

• d) Storing and managing data blocks

Ans: a

13
References:

✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication

8/8/2021 14
THANK YOU

For queries
Email:
[email protected]
15
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)

DISCOVER . LEARN . EMPOWER


1
Examples of Mapreduce
Second Example
Hadoop version 1.0 which is also referred to as MRV1
(MapReduce Version 1)

• MapReduce performed both processing and resource


management functions. It consisted of a
• Job Tracker which was the single master. The Job Tracker
allocated the resources, performed scheduling and
monitored the processing jobs. It assigned map and reduce
tasks on a number of subordinate processes called the
Task Trackers. The Task Trackers periodically reported their
progress to the Job Tracker.
Limitations

• Scalability
• The utilization of computational resources is inefficient.
• The Hadoop framework became limited only to MapReduce
processing paradigm.
Q/A
• Which phase in the MapReduce process shuffles and sorts the
intermediate key-value pairs?
• a) Map

• b) Reduce

• c) Shuffle

• d) Sort

Ans c) Shuffle

17
Q/A
• Which Hadoop component provides the runtime environment for
executing MapReduce jobs?
• a) Hadoop Distributed File System (HDFS)
• b) Hadoop YARN (Yet Another Resource Negotiator)
• c) Hadoop MapReduce Framework
• d) Hadoop Pig

18
Q/A
• Which programming language is commonly used to write
MapReduce jobs?
• a) Python

• b) Java

• c) C++

• d) JavaScript

Ans b) Java

19
References:

✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication

8/8/2021 20
THANK YOU

For queries
Email:
[email protected]
21
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)

DISCOVER . LEARN . EMPOWER


1
YARN
Components of YARN
1. Resource Manager: Runs on a master daemon and manages the resource
allocation in the cluster.

2. Node Manager: They run on the slave daemons and are responsible for
the execution of a task on every single Data Node.

3. Application Master: Manages the user job lifecycle and resource needs of
individual applications. It works along with the Node Manager and monitors
the execution of tasks.

4. Container: Package of resources including RAM, CPU, Network, HDD etc on


a single node.
Components of YARN
Resource Manager
• It is the ultimate authority in resource allocation.
• On receiving the processing requests, it passes parts of
requests to corresponding node managers accordingly,
where the actual processing takes place.
• It is the arbitrator of the cluster resources and decides the
allocation of the available resources for competing
applications.
• Optimizes the cluster utilization like keeping all resources
in use all the time against various constraints such as
capacity guarantees, fairness, and SLAs.
• It has two major components: a) Scheduler b) Application
Manager
Scheduler
• The scheduler is responsible for allocating resources to the various
running applications subject to constraints of capacities, queues etc.
• It is called a pure scheduler in ResourceManager, which means that it
does not perform any monitoring or tracking of status for the
applications.
• If there is an application failure or hardware failure, the Scheduler
does not guarantee to restart the failed tasks.
• Performs scheduling based on the resource requirements of the
applications.
• It has a pluggable policy plug-in, which is responsible for partitioning
the cluster resources among the various applications. There are two
such plug-ins: Capacity Scheduler and Fair Scheduler, which are
currently used as Schedulers in ResourceManager.
Application Manager
• It is responsible for accepting job submissions.
• Negotiates the first container from the Resource Manager
for executing the application specific Application Master.
• Manages running the Application Masters in a cluster and
provides service for restarting the Application Master
container on failure
Node Manager
• It takes care of individual nodes in a Hadoop cluster and manages
user jobs and workflow on the given node.
• It registers with the Resource Manager and sends heartbeats with
the health status of the node.
• Its primary goal is to manage application containers assigned to it by
the resource manager.
• It keeps up-to-date with the Resource Manager.
• Application Master requests the assigned container from the Node
Manager by sending it a Container Launch Context(CLC) which
includes everything the application needs in order to run. The Node
Manager creates the requested container process and starts it.
• Monitors resource usage (memory, CPU) of individual containers.
• Performs Log management.
• It also kills the container as directed by the Resource Manager.
Application Master
• An application is a single job submitted to the framework. Each such
application has a unique Application Master associated with it which
is a framework specific entity.
• It is the process that coordinates an application’s execution in the
cluster and also manages faults.
• Its task is to negotiate resources from the Resource Manager and
work with the Node Manager to execute and monitor the component
tasks.
• It is responsible for negotiating appropriate resource containers from
the ResourceManager, tracking their status and monitoring progress.
• Once started, it periodically sends heartbeats to the Resource
Manager to affirm its health and to update the record of its resource
demands.
Container
• It is a collection of physical resources such as RAM, CPU
cores, and disks on a single node.
• YARN containers are managed by a container launch
context which is container life-cycle(CLC). This record
contains a map of environment variables, dependencies
stored in a remotely accessible storage, security tokens,
payload for Node Manager services and the command
necessary to create the process.
• It grants rights to an application to use a specific amount of
resources (memory, CPU etc.) on a specific host.
Application Workflow in Hadoop YARN

1. Client submits an application

2. Resource Manager allocates a container to start Application Manager

3. Application Manager registers with Resource Manager

4. Application Manager asks containers from Resource Manager

5. Application Manager notifies Node Manager to launch containers

6. Application code is executed in the container

7. Client contacts Resource Manager/Application Manager to monitor application’s


status

8. Application Manager unregisters with Resource Manage


Q/A
• What is YARN in the context of Apache Hadoop?
• a) A data storage system

• b) A distributed computing framework

• c) A query language for big data

• d) A cluster resource management system

Ans : d) A cluster resource management system

29
Q/A
• Which command is used to submit a MapReduce job to YARN for
execution?
• a) hadoop jar

• b) yarn jar

• c) hdfs jar

• d) mapred jar

Ans: b) yarn. jar

30
Q/A
Which of the following is a key component of YARN?
a)HDFS (Hadoop Distributed File System)
b) MapReduce
c) YARN Application Manager
d) HBase (Hadoop Database)
Ans: c) YARN Application Manager

31
References:

✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication

8/8/2021 32
THANK YOU

For queries
Email:
[email protected]
33
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)

DISCOVER . LEARN . EMPOWER


1
YARN (Continued)
Components of YARN
1. Resource Manager: Runs on a master daemon and manages the resource
allocation in the cluster.

2. Node Manager: They run on the slave daemons and are responsible for
the execution of a task on every single Data Node.

3. Application Master: Manages the user job lifecycle and resource needs of
individual applications. It works along with the Node Manager and monitors
the execution of tasks.

4. Container: Package of resources including RAM, CPU, Network, HDD etc on


a single node.
Components of YARN
Resource Manager
• It is the ultimate authority in resource allocation.
• On receiving the processing requests, it passes parts of
requests to corresponding node managers accordingly,
where the actual processing takes place.
• It is the arbitrator of the cluster resources and decides the
allocation of the available resources for competing
applications.
• Optimizes the cluster utilization like keeping all resources
in use all the time against various constraints such as
capacity guarantees, fairness, and SLAs.
• It has two major components: a) Scheduler b) Application
Manager
Scheduler
• The scheduler is responsible for allocating resources to the various
running applications subject to constraints of capacities, queues etc.
• It is called a pure scheduler in ResourceManager, which means that it
does not perform any monitoring or tracking of status for the
applications.
• If there is an application failure or hardware failure, the Scheduler
does not guarantee to restart the failed tasks.
• Performs scheduling based on the resource requirements of the
applications.
• It has a pluggable policy plug-in, which is responsible for partitioning
the cluster resources among the various applications. There are two
such plug-ins: Capacity Scheduler and Fair Scheduler, which are
currently used as Schedulers in ResourceManager.
Application Manager
• It is responsible for accepting job submissions.
• Negotiates the first container from the Resource Manager
for executing the application specific Application Master.
• Manages running the Application Masters in a cluster and
provides service for restarting the Application Master
container on failure
Node Manager
• It takes care of individual nodes in a Hadoop cluster and manages
user jobs and workflow on the given node.
• It registers with the Resource Manager and sends heartbeats with
the health status of the node.
• Its primary goal is to manage application containers assigned to it by
the resource manager.
• It keeps up-to-date with the Resource Manager.
• Application Master requests the assigned container from the Node
Manager by sending it a Container Launch Context(CLC) which
includes everything the application needs in order to run. The Node
Manager creates the requested container process and starts it.
• Monitors resource usage (memory, CPU) of individual containers.
• Performs Log management.
• It also kills the container as directed by the Resource Manager.
Application Master
• An application is a single job submitted to the framework. Each such
application has a unique Application Master associated with it which
is a framework specific entity.
• It is the process that coordinates an application’s execution in the
cluster and also manages faults.
• Its task is to negotiate resources from the Resource Manager and
work with the Node Manager to execute and monitor the component
tasks.
• It is responsible for negotiating appropriate resource containers from
the ResourceManager, tracking their status and monitoring progress.
• Once started, it periodically sends heartbeats to the Resource
Manager to affirm its health and to update the record of its resource
demands.
Container
• It is a collection of physical resources such as RAM, CPU
cores, and disks on a single node.
• YARN containers are managed by a container launch
context which is container life-cycle(CLC). This record
contains a map of environment variables, dependencies
stored in a remotely accessible storage, security tokens,
payload for Node Manager services and the command
necessary to create the process.
• It grants rights to an application to use a specific amount of
resources (memory, CPU etc.) on a specific host.
Application Workflow in Hadoop YARN

1. Client submits an application

2. Resource Manager allocates a container to start Application Manager

3. Application Manager registers with Resource Manager

4. Application Manager asks containers from Resource Manager

5. Application Manager notifies Node Manager to launch containers

6. Application code is executed in the container

7. Client contacts Resource Manager/Application Manager to monitor application’s


status

8. Application Manager unregisters with Resource Manage


Q/A
What is YARN in the context of Apache Hadoop?
a) Yet Another Real-time Network
b) Yet Another Resource Negotiator
c) Yet Another Redundant Name
d) Yet Another Routing Network

Ans: b) Yet Another Resource Negotiator

29
Q/A
• Which of the following is responsible for monitoring and tracking
the status of YARN applications?
• a) YARN Resource Manager
• b) YARN Node Manager
• c) HDFS NameNode
• d) YARN Application Manager

30
Q/A
What is the primary function of YARN in Hadoop?
a) Data storage and retrieval
b) Cluster management and resource allocation
c) Data processing and analysis
d) Data replication and fault tolerance

Ans: b) Cluster management and resource allocation

31
References:

✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication

8/8/2021 32
THANK YOU

For queries
Email:
[email protected]
33

You might also like