Bda U4
Bda U4
Advantages of MapReduce:
Scalability. Businesses can process petabytes of data stored in the Hadoop Distributed File System
(HDFS).
Flexibility. Hadoop enables easier access to multiple sources of data and multiple types of data.
Speed. With parallel processing and minimal data movement, Hadoop offers fast processing of
massive amounts of data.
Simple. Developers can write code in a choice of languages, including Java, C++ and Python.
Cost Reduction - Because MapReduce is highly scalable, it reduces storage and processing costs
to meet growing data requirements.
Fault Tolerance: Due to its distributed nature, MapReduce is highly fail-safe. Typically,
MapReduce- supported distributed file systems, along with the basic process, provide MapReduce
jobs to
overcome hardware problems.
Architecture Of MapReduce
MapReduce phases are mainly two: Map and Reduce phase. Below is detailed description of 5 phases.
2
The whole process is illustrated in figure. At the highest level, there are five independent entities:
The client, which submits the MapReduce job.
The YARN resource manager, which coordinates the allocation of compute resources on the cluster.
The YARN node managers, which launch and monitor the compute containers on machines in
the cluster.
The MapReduce application master, which coordinates the tasks running the MapReduce job The
application master and the MapReduce tasks run in containers that are scheduled by the
resource manager and managed by the node managers.
The distributed filesystem, which is used for sharing job files between the other entities.
1. Job Submission:
The submit() method on Job creates an internal JobSubmitter instance
and calls submitJobInternal() on it.
Having submitted the job, waitForCompletion polls the job’s progress once per second and
reports the progress to the console if it has changed since the last report.
When the job completes successfully, the job counters are displayed Otherwise, the error
that caused the job to fail is logged to the console.
The job submission process implemented by JobSubmitter does the following:
Asks the resource manager for a new application ID, used for the MapReduce job ID.
5
Checks the output specification of the job For example, if the output directory has not
been specified or it already exists, the job is not submitted and an error is thrown to
the MapReduce program.
Computes the input splits for the job If the splits cannot be computed (because the input
paths don’t exist, for example), the job is not submitted and an error is thrown to the
MapReduce program.
Copies the resources needed to run the job, including the job JAR file, the configuration file, and
the computed input splits, to the shared filesystem in a directory named after the job ID.
Submits the job by calling submitApplication() on the resource manager.
2. Job Initialization :
When the resource manager receives a call to its submitApplication() method, it hands off
the request to the YARN scheduler.
The scheduler allocates a container, and the resource manager then launches the
application master’s process there, under the node manager’s management.
The application master for MapReduce jobs is a Java application whose main class is MRAppMaster .
It initializes the job by creating a number of bookkeeping objects to keep track of the job’s
progress, as it will receive progress and completion reports from the tasks.
It retrieves the input splits computed in the client from the shared filesystem.
It then creates a map task object for each split, as well as a number of reduce task objects
determined by the mapreduce.job.reduces property (set by the setNumReduceTasks() method on
Job).
3. Task Assignment:
If the job does not qualify for running as an uber task, then the application master
requests containers for all the map and reduce tasks in the job from the resource manager
.
Requests for map tasks are made first and with a higher priority than those for reduce tasks,
since all the map tasks must complete before the sort phase of the reduce can start.
Requests for reduce tasks are not made until 5% of map tasks have completed.
4. Task Execution:
Once a task has been assigned resources for a container on a particular node by the resource
manager’s scheduler, the application master starts the container by contacting the node
manager.
The task is executed by a Java application whose main class is YarnChild. Before it can run the task,
it localizes the resources that the task needs, including the job configuration and JAR file, and
any files from the distributed cache.
Finally, it runs the map or reduce task.
Streaming:
6
Streaming runs special map and reduce tasks for the purpose of launching the user
supplied executable and communicating with it.
The Streaming task communicates with the process (which may be written in any language)
using standard input and output streams.
During execution of the task, the Java process passes input key value pairs to the external
process, which runs it through the user defined map or reduce function and passes the output
key value pairs back to the Java process.
From the node manager’s point of view, it is as if the child process ran the map or reduce code itself.
5. Progress and status updates :
MapReduce jobs are long running batch jobs, taking anything from tens of seconds to hours to run.
A job and each of its tasks have a status, which includes such things as the state of the job or task
(e g running, successfully completed, failed), the progress of maps and reduces, the values of the
job’s counters, and a status message or description (which may be set by user code).
When a task is running, it keeps track of its progress (i.e. the proportion of task is completed).
For map tasks, this is the proportion of the input that has been processed.
For reduce tasks, it’s a little more complex, but the system can still estimate the proportion of
the reduce input processed.
It does this by dividing the total progress into three parts, corresponding to the three phases of the shufle.
As the map or reduce task runs, the child process communicates with its parent application
master through the umbilical interface.
The task reports its progress and status (including counters) back to its application master,
which has an aggregate view of the job, every three seconds over the umbilical interface.
7
MapReduce Example
A very famous example for mapreduce is the wordcount example.
8
First, we divide the input into three splits as shown in the figure. This will distribute the work
among all the map nodes.
Then, we tokenize the words in each of the mappers and give a hardcoded value (1) to each of
the tokens or words. The rationale behind giving a hardcoded value equal to 1 is that every word,
in
itself, will occur once.
Now, a list of key-value pair will be created where the key is nothing but the individual words and
value is one. So, for the first line (Dear Bear River) we have 3 key-value pairs – Dear, 1; Bear, 1;
River,
1. The mapping process remains the same on all the nodes.
After the mapper phase, a partition process takes place where sorting and shufling happen so that
all the tuples with the same key are sent to the corresponding reducer.
So, after the sorting and shufling phase, each reducer will have a unique key and a list of values
corresponding to that very key. For example, Bear, [1,1]; Car, [1,1,1].., etc.
Now, each Reducer counts the values which are present in that list of values. As shown in the
figure, reducer gets a list of values which is [1,1] for the key Bear. Then, it counts the number of
ones in the very list and gives the final output as – Bear, 2.
Finally, all the output key/value pairs are then collected and written in the output file.
Failures in MapReduce
***[ A very important note here is ye jo neeche mention kiye hue failures hain a ye classical mapreduce
failures mein aate. But if they ask Yarn mapreduce failure then write the below:
For MapReduce programs running on YARN, we need to consider the failure of any of the following
entities: the task, the application master, the node manager, and the resource manager.]***
In the real world, user code is buggy, processes crash, and machines fail. One of the major benefits of using
Hadoop is its ability to handle such failures and allow your job to complete. There are generally 3 types of
failures in MapReduce.
Task Failure
TaskTracker Failure
JobTracker Failure
9
1. Task Failure
In Hadoop, task failure is similar to an employee making a mistake while doing a task. Consider you are
working on a large project that has been broken down into smaller jobs and assigned to different
employees in your team. If one of the team members fails to do their task correctly, the entire project may be
compromised. Similarly, in Hadoop, if a job fails due to a mistake or issue, it could affect overall data
processing, causing delays or faults in the final result.
Reasons for Task Failure
1. Limited memory
2. Failures of disk
3. Issues with software or hardware
How to Overcome Task Failure
1. Increase memory allocation
2. Implement fault tolerance mechanisms
3. Regularly update software and hardware
2. TaskTracker Failure
A TaskTracker in Hadoop is similar to an employee responsible for executing certain tasks in a large project. If
a TaskTracker fails, it signifies a problem occurred while an employee worked on their assignment. This
can interrupt the entire project, much as when a team member makes a mistake or encounters difficulties
with their task, producing delays or problems with the overall project's completion. To avoid TaskTracker
failures, ensure the TaskTracker's hardware and software are in excellent working order and have the
resources they need to do their jobs successfully.
Reasons for TaskTracker Failure
1. Hardware issues
2. Software problems or errors
3. Overload or resource exhaustion
How to Overcome TaskTracker Failure
1. Update software and hardware on a regular basis
2. Upgrade or replace hardware
3. Restart or reinstall the program
10
3. JobTracker Failure
A JobTracker in Hadoop is similar to a supervisor or manager that oversees the entire project and assigns
tasks to TaskTrackers (employees). If a JobTracker fails, it signifies the supervisor is experiencing a
problem or has stopped working properly. This can interrupt the overall project's coordination and
development, much as when a supervisor is unable to assign assignments or oversee their completion. To
avoid
JobTracker failures, it is critical to maintain the JobTracker's hardware and software, ensure adequate
resources, and fix any issues or malfunctions as soon as possible to keep the project going smoothly.
Reasons for JobTracker Failure
1. Database connectivity
2. Security problems
How to Overcome JobTracker Failure
1. Avoiding Database Connectivity
2. To overcome security-related problems
Failures in YARN
For MapReduce programs running on YARN, we need to consider the failure of any of the
following entities: the task, the application master, the node manager, and the resource
manager.
1. Task Failure
Failure of the running task is similar to the classic case. Runtime exceptions and sudden exits of the
JVM are propagated back to the application master and the task attempt is marked as failed.
Likewise, hanging tasks are noticed by the application master by the absence of a ping over the
umbilical channel (the timeout is set by mapreduce.task.time out), and again the task attempt is
marked as failed.
The configuration properties for determining when a task is considered to be failed are the same as
the classic case: a task is marked as failed after four attempts (set by mapreduce.map.maxattempts
for map tasks and mapreduce.reduce.maxattempts for reducer tasks). A job will be failed if more
than mapreduce.map.failures.maxpercent percent of the map tasks in the job fail, or more than
mapreduce.reduce.failures.maxpercent percent of the reduce tasks fail.
Just like MapReduce tasks are given several attempts to succeed (in the face of hardware or
network failures) applications in YARN are tried multiple times in the event of failure. By default,
applications are marked as failed if they fail once, but this can be increased by setting the
property yarn.resourcemanager.am.max-retries.
An application master sends periodic heartbeats to the resource manager, and in the event of
application master failure, the resource manager will detect the failure and start a new instance of
the master running in a new container (managed by a node manager). In the case of the
MapReduce application master, it can recover the state of the tasks that had already been run by
the (failed) application so they don't have to be rerun. By default, recovery is not enabled, so failed
application masters will not rerun all their tasks, but you can turn it on by setting
yarn.app.mapreduce.am.job.recovery.enable to true.
10
The client polls the application master for progress reports, so if its application master fails the client
needs to locate the new instance. During job initialization the client asks the resource manager for the
application master's address, and then caches it, so it doesn't overload the the resource manager with
a request every time it needs to poll the application master. If the application master fails, however,
the client will experience a timeout when it issues a status update, at which point the client will go
back to the resource manager to ask for the new application master's address.
Any task or application master running on the failed node manager will be recovered using the
mechanisms described in the previous two sections.
Node managers may be blacklisted if the number of failures for the application is high. Blacklisting is
done by the application master, and for MapReduce the application master will try to reschedule
tasks on different nodes if more than three tasks fail on a node manager. The threshold may be set
with mapreduce.job.maxtaskfai lures.per.tracker.
Job Scheduling
Job scheduling is an important part of MapReduce, as it determines the order in which jobs are executed and
the resources that are allocated to them. There are a number of different job scheduling algorithms that
can be used in MapReduce, each with its own advantages and disadvantages.
There are mainly 3 types of Schedulers in Hadoop:
1. FIFO (First In First Out) Scheduler.
2. Capacity Scheduler.
3. Fair Scheduler.
1. FIFO Scheduler
10
As the name suggests FIFO i.e. First In First Out, so the tasks or application that comes first will be served first.
This is the default Scheduler we use in Hadoop. The tasks are placed in a queue and the tasks are
performed in their submission order. In this method, once the job is scheduled, no intervention is allowed. So
sometimes the high-priority process has to wait for a long time since the priority of the task does not
matter in this method.
Advantage:
No need for configuration
First Come First Serve
simple to execute
Disadvantage:
Priority of task doesn’t matter, so high priority jobs need to wait
Not suitable for shared cluster
11
2. Capacity Scheduler
In Capacity Scheduler we have multiple job queues for scheduling our tasks. The Capacity Scheduler allows
multiple occupants to share a large size Hadoop cluster. In Capacity Scheduler corresponding for each job
queue, we provide some slots or cluster resources for performing job operation. Each job queue has it’s
own slots to perform its task. In case we have tasks to perform in only one queue then the tasks of that
queue can access the slots of other queues also as they are free to use, and when the new task enters to
some other queue then jobs in running in its own slots of the cluster are replaced with its own job.
Capacity Scheduler also provides a level of abstraction to know which occupant is utilizing the more cluster
resource or slots, so that the single user or application doesn’t take disappropriate or unnecessary slots in
the cluster. The capacity Scheduler mainly contains 3 types of the queue that are root, parent, and leaf
which are used to represent cluster, organization, or any subgroup, application submission respectively.
Advantage:
Best for working with Multiple clients or priority jobs in a Hadoop cluster
Maximizes throughput in the Hadoop cluster
Disadvantage:
More complex
Not easy to configure for everyone
3. Fair Scheduler
The Fair Scheduler is very much similar to that of the capacity scheduler. The priority of the job is kept in
consideration. With the help of Fair Scheduler, the YARN applications can share the resources in the large
Hadoop Cluster and these resources are maintained dynamically so no need for prior capacity. The
resources are distributed in such a manner that all applications within a cluster get an equal amount of
time. Fair Scheduler takes Scheduling decisions on the basis of memory, we can configure it to work with
CPU also.
12
As we told you it is similar to Capacity Scheduler but the major thing to notice is that in Fair Scheduler
whenever any high priority job arises in the same queue, the task is processed in parallel by replacing some
portion from the already dedicated slots.
Advantages:
Resources assigned to each application depend upon its priority.
it can limit the concurrent running task in a particular pool or queue.
Disadvantages: The configuration is required.
Configuration Tuning
Various configuration settings can be adjusted to optimize the shufle and sort phase for improved
performance.
Task Execution
Task Execution Environment:
Hadoop provides information to map or reduce tasks about their execution environment.
Properties like mapred.job.id, mapred.tip.id, mapred.task.id, mapred.task.partition,
and mapred.task.is.map can be accessed from the job's configuration.
Streaming programs can retrieve these properties as environment variables, where
non- alphanumeric characters are replaced with underscores.
Environment variables can also be set for Streaming processes using the -cmdenv option.
Speculative Execution:
Speculative execution is a feature in MapReduce that launches backup tasks for slow-running tasks.
It aims to reduce job execution time by running redundant tasks in parallel.
Speculative execution does not launch duplicate tasks to compete against each other; it launches
a speculative task only after all tasks have been launched.
Speculative tasks are only launched for tasks that have been running for some time and have
made slower progress compared to other tasks.
When a task completes successfully, any duplicate tasks running are killed.
Speculative execution can be enabled or disabled independently for map tasks and reduce
tasks, either on a cluster-wide or per-job basis.
It is turned on by default, but it can be disabled to improve cluster efficiency or for non-
idempotent tasks.
Output Committers:
Hadoop MapReduce uses OutputCommitters to ensure jobs and tasks either succeed or fail cleanly.
The OutputCommitter API provides methods for setup, commit, and abort operations.
The setupJob() method is called before the job runs and is typically used for initialization.
The commitJob() method is called if the job succeeds, and it performs cleanup operations
like deleting temporary working spaces and creating a _SUCCESS marker file.
The abortJob() method is called if the job fails or is killed, and it performs cleanup operations
as well.
The setupTask() method is called before each task runs and is used for any necessary setup.
The commitTask() and abortTask() methods are called for each task to commit or abort the
task's outputs.
The needsTaskCommit() method determines if the commit phase for tasks is necessary, and it
can be disabled to save resources.
OutputCommitters allow customization and can be overridden or implemented differently for specific
requirements, such as special setup or cleanup operations.
16
The context objects are used for emitting key-value pairs, so they are parameterized by the output types,
so that the signature of the write() method is:
public void write(KEYOUT key, VALUEOUT
value) throws IOException,
InterruptedException
If a combine function is used, then it is the same form as the reduce function (and is an implementation of
Reducer), except its output types are the intermediate key and value types (K2 and V2), so they can feed
the reduce function:
17
MapReduce formats
Hadoop can process many different types of data formats, from flat text files to databases. In this section,
we explore the different formats available.
18
19
20
21
22
23