0% found this document useful (0 votes)
27 views99 pages

How Map Reduce Work

Uploaded by

pprogram2314
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views99 pages

How Map Reduce Work

Uploaded by

pprogram2314
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 99

HOW MAP REDUCE WORKS

Anatomy of MapReduce Job Run


• Jobs get submitted byWaitForCompletion().
• Properties that decide the job run:
– “mapred.job.tracker”:
• This is present in configuration file :mapred-site.xml
• Default is local (Job and Task tracker runs on single
JVM)
• Set as host:port for fully distribution mode (Job and
Task tracker runs on separate JVM’s on single node)
There are two types of Map Reduce
– Classic Map Reduce (Map Reduce 1)
– YARN (Map Reduce 2)
Entire work done by map reduce can be divided into 6
steps as:
– Job Submission
– Job Initialization
– Task Assignment
– Task Execution
– Progress and Status Updates
– Job Completion
Classic Map Reduce (Map Reduce 1)

•The client, which submits the Map Reduce job.

•The jobtracker, which coordinates the job run. The jobtracker is a


Java application whose main class is JobTracker.

•The task trackers, which run the tasks that the job has been split into.

•Tasktrackers are Java applications whose main class is TaskTracker.

•The distributed file system (normally HDFS, covered in Chapter 3),


which is used for sharing job files between the other entities.
Job Submission
•The submit() method on Job creates an internal
JobSubmmitter instance and calls
submitJobInternal() on it. Having submitted the job,
waitForCompletion() polls the job’s progress once a
second and reports the progress to the console if it has
changed since the last report.

•When the job is complete, if it was successful, the job


counters are displayed. Otherwise, the error that
caused the job to fail is logged to the console.
The job submission process implemented by JobSummitter does the
following:
•Asks the job tracker for a new job ID (by calling getNewJobId() on Job Tracker) (step2).

•Checks the output specification of the job. For example, if the output directory has not
been specified or it already exists, the job is not submitted and an error is thrown to the
Map Reduce program.

•Computes the input splits for the job. If the splits cannot be computed, because the
input paths don’t exist, for example, then the job is not submitted and an error is thrown
to the Map Reduce program.

•Copies the resources needed to run the job, including the job JAR file, the configuration
file, and the computed input splits, to the job tracker's file system in a directory named
after the job ID.

•The job JAR is copied with a high replication factor (controlled by the
mapred.submit.replication property, which defaults to 10) so that there are lots of copies
across the cluster for the task trackers to access when they run tasks for the job (step 3).
Job Initialization
•When the JobTracker receives a call to its
submitJob() method, it puts it into an internal
queue from where the job scheduler will pick it
up and initialize it.
•Initialization involves creating an object to
represent the job being run, which encapsulates
its tasks, and bookkeeping information to keep
track of the tasks’ status and progress.
Task Assignment

•Tasktrackers run a simple loop that periodically sends


heartbeat method calls to the jobtracker. Heartbeats tell the
jobtracker that a tasktracker is alive, but they also double as a
channel for messages.

•As a part of the heartbeat, a tasktracker will indicate whether


it is ready to run a new task, and if it is, the jobtracker will
allocate it a task, which it communicates to the tasktracker
using the heartbeat return value.

•Tasktrackers have a fixed number of slots for map tasks and for
reduce tasks: for example, a tasktracker may be able to run two
map tasks and two reduce tasks simultaneously.
Task Execution
•Now that the task tracker has been assigned a task,
the next step is for it to run the task. First, it
localizes the job JAR by copying it from the shared
file system to the task tracker's file system.

•Task Runner launches a new Java Virtual Machine


to run each task in so that any bugs in the user-
defined map and reduce functions don’t affect the
task tracker (by causing it to crash or hang, for
example). It is, however, possible to reuse the JVM
between tasks.
Progress and Status Updates

•Map Reduce jobs are long-running batch jobs,


taking anything from minutes to hours to run.
Because this is a significant length of time, it’s
important for the user to get feedback on how the
job is progressing.

•A job and each of its tasks have a status, which


includes such things as the state of the job or task
(e.g., running, successfully completed, failed), the
progress of maps and reduces, the values of the
job’s counters, and a status message or description.
What Constitutes Progress in MapReduce?

Progress reporting is important, as it means Hadoop will not fail


a task that’s making progress.

All of the following operations constitute progress:


• Reading an input record (in a mapper or reducer)
• Writing an output record (in a mapper or reducer)
• Setting the status description on a reporter (using Reporter’s
setStatus() method)
• Incrementing a counter (using Reporter’s incrCounter()
method)
• Calling Reporter’s progress() method
Streaming and Pipes
•The core idea is to make data processing independent of the language we use.

•Hadoop streaming is the ability of the hadoop to interface with map and reduce
programs written in ruby and python. It uses unix standard streams as the interface
between hadoop and our program.In streaming there would be no driver class.

•The Streaming task communicates with the process (which may be written in any
language) using standard input and output streams.

•The Pipes task, on the other hand, listens on a socket( communication channel to
the task tracker) and passes the C++ process a port number in its environment, so
that on startup, the C++ process can establish a persistent socket connection back to
the parent Java Pipes task.
Job Completion
•When the jobtracker receives a notification that the last task for
a job is complete (this will be the special job cleanup task), it
changes the status for the job to “successful.”

•Then, when the Job polls for status, it learns that the job has
completed successfully, so it prints a message to tell the user and
then returns from the waitForCompletion() method.

•The jobtracker also sends an HTTP job notification if it is


configured to do so. This can be configured by clients wishing to
receive callbacks, via the job.end.notification.url property.
YARN (Map Reduce 2)

For very large clusters in the region of 4000


nodes and higher, the Map Reduce system
described in the previous section begins to hit
scalability bottlenecks, so in 2010 a group at
Yahoo! began to design the next generation of
Map Reduce.
Advantage
•YARN meets the scalability shortcomings of “classic” Map Reduce by splitting
the responsibilities of the job tracker into separate entities.

•The job tracker takes care of both job scheduling (matching tasks with task
trackers) and task progress monitoring (keeping track of tasks and restarting
failed or slow tasks, and doing task bookkeeping such as maintaining counter
totals)

•Better memory utilization with the concept of containers. ( in map reduce-1


there are specific fixed slots
Roles

YARN separates these two roles into two


independent daemons: a Resource Manager
(Job Scheduling) to manage the use of resources
across the cluster, and an Application Master
( Task Monitoring) to manage the lifecycle of
applications running on the cluster.
Map Reduce on YARN involves more entities than classic Map
Reduce
The client, which submits the Map Reduce job.

The YARN resource manager, which coordinates the allocation of compute


resources on the cluster.

The YARN node managers, which launch and monitor the compute
containers on machines in the cluster.

The Map Reduce application master, which coordinates the tasks running
the Map Reduce job. The application master and the Map Reduce tasks run
in containers that are scheduled by the resource manager, and managed by
the node managers.

The distributed file system (normally HDFS, covered in Chapter 3), which is
used for sharing job files between the other entities.
Job Submission

Jobs are submitted in MapReduce 2 using the same user API


as MapReduce 1.

MapReduce 2 has an implementation of ClientProtocol that


is activated when mapreduce.framework.name is set to yarn.

The submission process is very similar to the classic


implementation.

The new job ID is retrieved from the resource manager


(rather thanthe jobtracker), although in the nomenclature of
YARN it is an application ID.
Splits
The job client checks the output specification of
the job; computes input splits.
Job Initialization

When the resource manager receives a call to its


submit Application(), it hands off the request to
the scheduler.

The scheduler allocates a container, and the


resource manager then launches the application
master’s process there, under the node
manager’s management
The application master for Map Reduce jobs is a
Java application whose main class is MRApp
Master.

It initializes the job by creating a number of


bookkeeping objects to keep track of the job’s
progress, as it will receive progress and
completion reports from the tasks.
Cond
Next, it retrieves the input splits computed in
the client from the shared file system.

It then creates a map task object for each split,


and a number of reduce task objects
determined by the mapreduce.job.reduces
property.
The next thing the application master does is decide
how to run the tasks that make up the Map Reduce
job.

If the job is small, the application master may choose


to run them in the same JVM as itself, since it judges
the overhead of allocating new containers and
running tasks in them as outweighing the gain to be
had in running them in parallel, compared to running
them sequentially on one node. Such a job is said to
be uberized, or run as an uber task.
Task Assignment
If the job does not qualify for running as an uber
task, then the application master requests
containers for all the map and reduce tasks in
the job from the resource manager.

Each request, which are piggybacked on


heartbeat calls, includes information about each
map task’s data locality, in particular the hosts
and corresponding racks that the input split
resides on.
Task Execution

Once a task has been assigned a container by the


resource manager’s scheduler, the application
master starts the container by contacting the node
manager.

The task is executed by a Java application whose


main class is Yarn Child. Before it can run the task it
localizes the resources that the task needs, including
the job configuration and JAR file, and any files from
the distributed cache Finally, it runs the map or
reduce task
cotd
The Yarn Child runs in a dedicated JVM, for the
same reason that task trackers spawn new JVMs
for tasks in Map Reduce 1: to isolate user code
from long-running system daemons. Unlike Map
Reduce 1, however, YARN does not support JVM
reuse so each task runs in a new JVM.
Progress and Status Updates

When running under YARN, the task reports its


progress and status (including counters) back to
its application master every three seconds (over
the umbilical interface), which has an aggregate
view of the job.

The client polls the application master every


second(set via mapreduce.client.pro
gressmonitor.pollinterval) to receive progress
updates, which are usually displayed to the user.
Job Completion

As well as polling the application master for


progress, every five seconds the client checks
whether the job has completed when using the
waitForCompletion() method on Job.
Failures
Failures in Classic MapReduce:
1. Task Failure
2. Tasktracker Failure
3. Jobtracker Failure
Task Failure
Task Failure
• User code may run in infinite loop. The task tracker may observe that there is
no progress of the task over a period of time and mark as job failure. The
observation time is set by the property “mapred.task.timeout”, it can be set
to zero as well and in this case the task tracker would never fail a long
running job.
• For runtime errors, the reported error will be kept in the user logs by the task
tracker.
• There are rare cases where JVM may have been exposed to a bug while
Map/Reduce code run, in that case the task JVM can crash along with the
task tracker.
• Job tracker get notified about the fails and will reschedule on a different task
tracker.
• The no of reattempts that would be made on a maptask is governed by
property “mapred.map.max.attempts”=4(default) and similarly for reduce
task by “mapred.reduce.max.attempts”=4(default)
Task Failure

The most common way that this happens is


when user code in the map or reduce task
throws a runtime exception.

If this happens, the child JVM reports the error


back to its parent task tracker, before it exits.
The error ultimately makes it into the user logs.
The task tracker marks the task attempt as
failed, freeing up a slot to run another task.
Reason 2
For Streaming tasks, if the Streaming process
exits with a nonzero exit code, it is marked as
failed.

Another failure mode is the sudden exit of the


child JVM—perhaps there is a JVM bug that
causes the JVM to exit for a particular set of
circumstances exposed by the Map- Reduce user
code.
Reason 3

Hanging tasks are dealt with differently. The task


tracker notices that it hasn’t received a progress
update for a while and proceeds to mark the
task as failed.

The child JVM process will be automatically


killed after some period
Task tracker Failure
Task tracker Failure
• The job tracker stops receiving the heart beats from the task
tracker, then it concludes task tracker is dead. In this case it
reschedules the tasks on other task tracker.
• Even the completed tasks are rerun, as their results would have
written to a local disk and they would have been lost because of
crash of task tracker.
• The job tracker will remove the task tracker from the available pool.
If the no of tasks failed on a task tracker crosses the threshold, it
gets blacklisted and will be removed from available pool of task
trackers.
• The threshold is set by property “mapred.max.tracker.failures”, in
case the task tracker is blacklisted it joins back on the restart or
after a certain period of time.
Task tracker Failure
•Failure of a tasktracker is another failure mode. If a tasktracker
fails by crashing, or running very slowly, it will stop sending
heartbeats to the jobtracker (or send them very infrequently).

•The jobtracker will notice a tasktracker that has stopped


sending heartbeats (if it hasn’t received one for 10 minutes,
configured via the mapred.task tracker.expiry.interval property,
in milliseconds) and remove it from its pool of tasktrackers to
schedule tasks on.
Task Tracker failure
•A task tracker can also be blacklisted by the job tracker, even if
the task tracker has not failed.

• If more than four tasks from the same job fail on a particular
task tracker (set by (mapred.max.tracker.failures), then the job
tracker records this as a fault.

•A task tracker is blacklisted if the number of faults is over some


minimum threshold (four, set by mapred.max.tracker.blacklists)
and is significantly higher than the average number of faults for
task trackers in the cluster.
Job Tracker Failure
Job Tracker Failure

•Failure of the job tracker is the most serious failure mode.


Hadoop has no mechanism for dealing with failure of the job
tracker—it is a single point of failure.

•It is recommended to run on a better hardware, so as to


avoid this scenario as much as possible. we need to resubmit
all the jobs in progress, once the job tracker is brought up
again.

•However, this failure mode has a low chance of occurring,


since the chance of a particular machine failing is low.
Failures in MapReduce2 (YARN):

1. Task Failure
2. Application Master Failure
3. Node Manager Failure
4. Resource Manager Failure
Task Failure
Task Failure
•Failure of the running task is similar to the classic case. Runtime exceptions
and sudden exits of the JVM are propagated back to the application master
and the task attempt is marked as failed.

•The configuration properties for determining when a task is considered to be


failed are the same as the classic case: a task is marked as failed after four
attempts (set by mapreduce.map.maxattempts for map tasks and
mapreduce.reduce.maxattempts for reducer tasks).

•In some jobs which process huge amount of data with more no of tasks,
failures of some tasks is acceptable and those may not be mark the complete
job as failure.

• ”mapred.map.failures.maxpercent”, ”mapred.reduce.failures.maxpercent”
would be the properties set for acceptable percent of failures for tasks before
declaring a job as failed.
Application Master Failure
Application Master Failure
•An application master sends periodic heartbeats to the resource manager,
and in the event of application master failure, the resource manager will
detect the failure and start a new instance of the master running in a new
container (managed by a node manager).

•If the application master fails, the tasks that have run under it need not be
resubmitted, they can be recovered,but by default the recovery is not
switched ON. “yarn.app.mapreduce.am.job.recovery.enable” to be turned
ON. The status of the tasks is recovered and the execution of the job is
continued.

•Applications in YARN are tried multiple times in the event of failure. By


default, applications are marked as failed if they fail once, but this can be
increased by setting the property yarn.resourcemanager.am.max-retries.
Node Manager Failure
Node Manager Failure
•If a node manager fails, then it will stop sending heartbeats to the resource
manager, and the node manager will be removed from the resource
manager’s pool of available nodes.

•If the application master was running under the failed node manager, the
steps described in application master failures are followed.

•The property yarn.resourcemanager.nm.liveness-monitor.expiry-


intervalms, which defaults to 600000 (10 minutes), determines the minimum
time the resource manager waits before considering a node manager that has
sent no heartbeat in that time as failed.

•If tasks under a new node manager fail often and crosses the threshold, the
node is taken off from the available pool and is blacklisted( which is a process
of tracking poorly performing nodes).
Resource Manager Failure
Resource Manager Failure

•Failure of the resource manager is serious, since without it


neither jobs nor task containers can be launched.

•The resource manager was designed from the outset to be able


to recover from crashes, by using a check pointing mechanism
to save its state to persistent storage.

•After the crash the new resource manager instance is brought


by the administrator and it recovers from the last state saved. So
the rerun of all the jobs is not required.
After a crash, a new resource manager instance
is brought up (by an adminstrator) and it
recovers from the saved state. The state consists
of the node managers in the system as well as
the running applications.
Map Reduce Features
Counters

•Counters are a useful channel for gathering statistics about the job: for
quality control or for application level-statistics.

•They are also useful for problem diagnosis. If we are tempted to put a log
message into our map or reduce task, then it is often better to see whether
we can use a counter instead to record that a particular condition occurred.

•In addition to counter values being much easier to retrieve than log output
for large distributed jobs, we get a record of the number of times that
condition occurred, which is more work to obtain from a set of log files.
Built-in Counters
Task counters
Task counters gather information about tasks over the
course of their execution, and the results are
aggregated over all the tasks in a job.

For example, the MAP_INPUT_RECORDS counter counts


the input records read by each map task and aggregates
over all map tasks in a job, so that the final figure is the
total number of input records for the whole job.
Task Counters
Job counters

•Job counters are maintained by the job tracker (or


application master in YARN), so they don’t need to be sent
across the network, unlike all other counters, including
user-defined ones.

•They measure job-level statistics, not values that change


while a task is running. For example,
TOTAL_LAUNCHED_MAPS counts the number of map tasks
that were launched over the course of a job (including
ones that failed).
Job Counters
User-Defined Java Counters

•MapReduce allows user code to define a set of counters, which are


then incremented as desired in the mapper or reducer. Counters are
defined by a Java enum, which serves to group related counters.

• A job may define an arbitrary number of enums, each with an


arbitrary number of fields. The name of the enum is the group name,
and the enum’s fields are the counter names.

•Counters are global: the MapReduce framework aggregates them


across all maps and reduces to produce a grand total at the end of the
job.
Application to run the maximum temperature job, including counting missing and
malformed fields and quality codes
Contd
output
Application to calculate the proportion of records with missing temperature fields
SORTING
Shuffle and Sort
• Every map/reduce job goes through shuffle and sort phase. map program
processes input key and value, then the map output is sorted and is
transferred to reducer and is known as shuffle.

• In sort phase merge factor property plays an key role and is set by
property io.sort.factor ( 10 by default). It signifies how many files can be
merged at one go.
Partial Sort
•By default, Map Reduce will sort input records by their keys. The
variation for sorting sequence files with IntWritable keys is called partial
sort.
•Storing temperatures as Text objects doesn’t work for sorting purposes,
because signed integers don’t sort lexicographically.
•Instead, we are going to store the data using sequence files whose
IntWritable keys represent the temperatures (and sort correctly) and
whose Text values are the lines of data.
Program
Controlling Sort Order
The sort order for keys is controlled by a RawComparator,
which is found as follows:
1. If the property mapred.output.key.comparator.class is
set, either explicitly or by calling
setSortComparatorClass() on Job, then an instance of
that class is used.
2. Otherwise, keys must be a subclass of
WritableComparable, and the registered comparator for
the key class is used.
Total Sort
•How can we produce a globally sorted file using Hadoop? The naive
answer is to use a single partition. But this is inefficient for large files,
since one machine has to process all of the output, so we are
throwing away the benefits of the parallel architecture that Map
Reduce provides.

•Instead, it is possible to produce a set of sorted files that, if


concatenated, would form a globally sorted file.

•We use a partitioner that respects the total order of the output. For
example, if we had four partitions, we could put keys for
temperatures less than –10°C in the first partition, those between –
10°C and 0°C in the second, those between 0°C and 10°C in the third,
and those over 10°C in the fourth.
Program
Secondary Sort

•The Map Reduce framework sorts the records by key before they
reach the reducers. For any particular key, however, the values
are not sorted. The order that the values appear is not even
stable from one run to the next, since they come from different
map tasks, which may finish at different times from run to run.

•However, it is possible to impose an order on the values by


sorting and grouping the keys in a particular way.

• This is done by secondary sort.


• Natural key is the portion of the composite key which should be
considered for partitioning and grouping.
• Natural value is the portion of the composite key which should be
considered while sorting.
• First, to write Custom Writable class to handle the composite key ( would
be of two or more hadoop data types).
• While writing the Custom Writable class , we need to override a few basic
set of functions which are used by the map reduce framework to read
write, compare, hash and convert the objective strings.
• Second, tell hadoop how to compare custom variables while performing
the sort which can be done by the class job.setSortComparatorClass().
• In this function we pass the custom implementation of writable
comparable and override its compare method to help hadoop understand
which custom key is smaller than the other when compared.
• In the compare function of job.setSortComparatorClass() we have
to write logic that compares the first part of the composite key and
then considers the second part of the key to find the order.

• Third is the Custom Partitioner, which is required for hadoop to


correctly identify to which partition the record belongs to, will have
to override the getPartition() and it is always the natural key
portion of the composite key which would decide the partitioning.

• Lastly we need to tell the hadoop that by which field it need to


group and feed the input to the reducer. So for this as well the
natural key portion of the composite key would be the grouping
field.
JOINS
Map Side Join

•It is an operation where we combine two or more operations based on a


column or a set of columns.
•Joins are fairly complex to design in map/reduce frame work. In java(i.e,
optimised solution in terms of processing and speed of data) it takes more LOC
with complex design and same thing can de done easily in high level frame
works like Pig or Hive.
•A map-side join between large inputs works by performing the join before the
data reaches the map function.
•The inputs to each map must be partitioned and sorted in a particular way.
Each input dataset must be divided into the same number of partitions, and it
must be sorted by the same key (the join key) in each source.
•All the records for a particular key must reside in the same partition.
Reduce-Side Joins

•A reduce-side join is more general than a map-side join, in that the input
datasets don’t have to be structured in any particular way, but it is less
efficient as both datasets have to go through the Map Reduce shuffle.

•The mapper tags each record with its source and uses the join key as the
map output key, so that the records with the same key are brought
together in the reducer.
• First we need to treat different input data sets to different map logics, this
can be done by MultipleInputs object written in driver class. This is
possible by the method MultipleInputs.addInputPath();
• Each dataset has a different format and there cannot be a single logic to
process all the different datasets.
• Custom writablecomparable class with necessary functions overridden.
• Custom partitioner should be designed which considers only the natural
key portion of the composite key.
• Custom group comparitive class which tells hadoop how to compare the
two records and sort on the basis of natural value portion of the
composite key.
• Finally the reduce logic is to store the first record of the group and expand
it in subsequent occurrences of value to reach to the ultimate output.
Side Data Distribution
• It can be done through distributed cache mechanism.
• The dataset can be distributed to the task nodes and the mapper and reducers
can read the local copies present with them at the time they are performing
map and reduce tasks this mechanism is called distributed cache mechanism.
• This method can be generally applied when there are operations on two or
more datasets which involves one very small dataset. It can be the case where
the small information need to be looked up at the time of map/reduce.
Side Data Distribution
• Side Data - extra read-only data needed by a job to process the
main dataset
– The main challenge is to make side data available to all the
map or reduce tasks (which are spread across the cluster) in
way that is convenient and efficient
• Using the Job Configuration
• This technique should not be used for transferring more than few kilobytes
of data as it can pressurize the memory usage of hadoop daemons,
particularly if our system is running several hadoop jobs.

– Configuration is a setter method used to set key-value pairs in


the job configuration
– Useful for passing metadata to tasks
• Distributed Cache
– Instead of serializing side data in the job config, it is preferred
to distribute the datasets using Hadoop’s distributed cache
• Provides a service for copying files and archives to the
task nodes in time for the tasks to use them when they
run
– 2 types of objects can be placed into cache:
• Files
• Archives

You might also like