Hadoop Class 2 PDF
Hadoop Class 2 PDF
MapReduce is a programming framework that allows us to perform distributed and parallel processing on large
data sets in a distributed environment.
● Parallel Processing: In MapReduce, we are dividing the job among multiple nodes and each node works
with a part of the job simultaneously. So, MapReduce is based on Divide and Conquer paradigm which helps
us to process the data using different machines very quickly.
● Data Locality: Instead of moving data to the processing unit, we are moving the processing unit to the data
in the MapReduce Framework. In the traditional system, we used to bring data to the processing unit and
process it. But, as the data grew and became very huge, bringing this huge amount of data to the processing
unit posed the following issues:
○ Moving huge data to processing is costly and deteriorates the network performance.
○ Processing takes time as the data is processed by a single unit which becomes the bottleneck.
○ The master node can get over-burdened and may fail.
Now, MapReduce allows us to overcome the above issues by bringing the processing unit to the data. This
allows us to have the following advantages:
○ According to the key value in MapReduce, each combiner output is partitioned, and a record having the
same key value goes into the same partition, and then each partition is sent to a reducer.
● Shuffling and Sorting: Now, the output is Shuffled to the reduce node (which is a normal slave node but
reduce phase will run here hence called as reducer node). The shuffling is the physical movement of the data
which is done over the network. Once all the mappers are finished and their output is shuffled on the reducer
nodes, then this intermediate output is merged and sorted, which is then provided as input to reduce phase.
● Reducer: It takes the set of intermediate key-value pairs produced by the mappers as the input and then runs
a reducer function on each of them to generate the output. The output of the reducer is the final output, which
is stored in HDFS.
● RecordWriter: It writes these output key-value pair from the Reducer phase to the output files.
● OutputFormat: The way these output key-value pairs are written in output files by RecordWriter is determined
by the OutputFormat.
Difference between Input Split & Block
Difference between Input Split & Block
YARN
Apache Hadoop YARN (Yet Another Resource Negotiator) is a resource management layer in Hadoop.
YARN came into the picture with the introduction of Hadoop 2.x. It allows various data processing engines
such as interactive processing, graph processing, batch processing, and stream processing to run and
process data stored in HDFS (Hadoop Distributed File System).
Components Of YARN
Components Of YARN
1. Resource Manager: Resource Manager is the master daemon of YARN. It is responsible for managing
several other applications, along with the global assignments of resources such as CPU and memory. It is
used for job scheduling. Resource Manager has two components:
a. Scheduler: Schedulers’ task is to distribute resources to the running applications. It only deals with the
scheduling of tasks and hence it performs no tracking and no monitoring of applications.
b. Application Manager: The application Manager manages applications running in the cluster. Tasks,
such as the starting of Application Master or monitoring, are done by the Application Manager.
2. Node Manager: Node Manager is the slave daemon of YARN. It has the following responsibilities:
a. Node Manager has to monitor the container’s resource usage, along with reporting it to the Resource
Manager.
b. The health of the node on which YARN is running is tracked by the Node Manager.
c. It takes care of each node in the cluster while managing the workflow, along with user jobs on a
particular node.
d. It keeps the data in the Resource Manager updated
e. Node Manager can also destroy or kill the container if it gets an order from the Resource Manager to do
so.
Components Of YARN
3. Application Master: Every job submitted to the framework is an application, and every application has a specific
Application Master associated with it. Application Master performs the following tasks:
● It coordinates the execution of the application in the cluster, along with managing the faults.
● It negotiates resources from the Resource Manager.
● It works with the Node Manager for executing and monitoring other components’ tasks.
● At regular intervals, heartbeats are sent to the Resource Manager for checking its health, along with updating
records according to its resource demands.
● Now, we will step forward with the fourth component of Apache Hadoop YARN.
4. Container: A container is a set of physical resources (CPU cores, RAM, disks, etc.) on a single node. The tasks
of a container are listed below:
● It grants the right to an application to use a specific amount of resources (memory, CPU, etc.) on a specific
host.
● YARN containers are particularly managed by a Container Launch context which is Container Life Cycle
(CLC). This record contains a map of environment variables, dependencies stored in remotely accessible
storage, security tokens, the payload for Node Manager services, and the command necessary to create the
process.
Running an Application through YARN
1. Application Submission: The RM accepts the application, causing the creation of an ApplicationMaster (AM) instance. The AM is
responsible for negotiating resources from the RM and working with the Node Managers (NMs) to execute and monitor the tasks.
2. Resource Request: The AM starts by requesting resources from the RM. It specifies what resources are needed, in which
locations, and other constraints. These resources are encapsulated in terms of "Resource Containers" which include
specifications like memory size, CPU cores, etc.
3. Resource Allocation: The Scheduler in the RM, based on the current system load and capacity, as well as policies (e.g.,
capacity, fairness), allocates resources to the applications by granting containers. The specific strategy depends on the scheduler
type (e.g., FIFO, Capacity Scheduler).
4. Container Launching: Post-allocation, the RM communicates with relevant NMs to launch the containers. The Node Manager
sets up the container's environment, then starts the container by executing the specified commands.
5. Task Execution: Each container then runs the task assigned by the ApplicationMaster. These are actual data processing tasks,
specific to the application's purpose.
6. Monitoring and Fault Tolerance: The AM monitors the progress of each task. If a container fails, the AM requests a new
container from the RM and retries the task, ensuring fault tolerance in the execution phase.
7. Completion and Release of Resources: Upon task completion, the AM releases the allocated containers, freeing up resources.
After all tasks are complete, the AM itself is terminated, and its resources are also released.
8. Finalization: The client then polls the RM or receives a notification to know the status of the application. Once informed of the
completion, the client retrieves the result and finishes the process.
Running an Application through YARN