How Map Reduce Work
How Map Reduce Work
•The task trackers, which run the tasks that the job has been split into.
•Checks the output specification of the job. For example, if the output directory has not
been specified or it already exists, the job is not submitted and an error is thrown to the
Map Reduce program.
•Computes the input splits for the job. If the splits cannot be computed, because the
input paths don’t exist, for example, then the job is not submitted and an error is thrown
to the Map Reduce program.
•Copies the resources needed to run the job, including the job JAR file, the configuration
file, and the computed input splits, to the job tracker's file system in a directory named
after the job ID.
•The job JAR is copied with a high replication factor (controlled by the
mapred.submit.replication property, which defaults to 10) so that there are lots of copies
across the cluster for the task trackers to access when they run tasks for the job (step 3).
Job Initialization
•When the JobTracker receives a call to its
submitJob() method, it puts it into an internal
queue from where the job scheduler will pick it
up and initialize it.
•Initialization involves creating an object to
represent the job being run, which encapsulates
its tasks, and bookkeeping information to keep
track of the tasks’ status and progress.
Task Assignment
•Tasktrackers have a fixed number of slots for map tasks and for
reduce tasks: for example, a tasktracker may be able to run two
map tasks and two reduce tasks simultaneously.
Task Execution
•Now that the task tracker has been assigned a task,
the next step is for it to run the task. First, it
localizes the job JAR by copying it from the shared
file system to the task tracker's file system.
•Hadoop streaming is the ability of the hadoop to interface with map and reduce
programs written in ruby and python. It uses unix standard streams as the interface
between hadoop and our program.In streaming there would be no driver class.
•The Streaming task communicates with the process (which may be written in any
language) using standard input and output streams.
•The Pipes task, on the other hand, listens on a socket( communication channel to
the task tracker) and passes the C++ process a port number in its environment, so
that on startup, the C++ process can establish a persistent socket connection back to
the parent Java Pipes task.
Job Completion
•When the jobtracker receives a notification that the last task for
a job is complete (this will be the special job cleanup task), it
changes the status for the job to “successful.”
•Then, when the Job polls for status, it learns that the job has
completed successfully, so it prints a message to tell the user and
then returns from the waitForCompletion() method.
•The job tracker takes care of both job scheduling (matching tasks with task
trackers) and task progress monitoring (keeping track of tasks and restarting
failed or slow tasks, and doing task bookkeeping such as maintaining counter
totals)
The YARN node managers, which launch and monitor the compute
containers on machines in the cluster.
The Map Reduce application master, which coordinates the tasks running
the Map Reduce job. The application master and the Map Reduce tasks run
in containers that are scheduled by the resource manager, and managed by
the node managers.
The distributed file system (normally HDFS, covered in Chapter 3), which is
used for sharing job files between the other entities.
Job Submission
• If more than four tasks from the same job fail on a particular
task tracker (set by (mapred.max.tracker.failures), then the job
tracker records this as a fault.
1. Task Failure
2. Application Master Failure
3. Node Manager Failure
4. Resource Manager Failure
Task Failure
Task Failure
•Failure of the running task is similar to the classic case. Runtime exceptions
and sudden exits of the JVM are propagated back to the application master
and the task attempt is marked as failed.
•In some jobs which process huge amount of data with more no of tasks,
failures of some tasks is acceptable and those may not be mark the complete
job as failure.
• ”mapred.map.failures.maxpercent”, ”mapred.reduce.failures.maxpercent”
would be the properties set for acceptable percent of failures for tasks before
declaring a job as failed.
Application Master Failure
Application Master Failure
•An application master sends periodic heartbeats to the resource manager,
and in the event of application master failure, the resource manager will
detect the failure and start a new instance of the master running in a new
container (managed by a node manager).
•If the application master fails, the tasks that have run under it need not be
resubmitted, they can be recovered,but by default the recovery is not
switched ON. “yarn.app.mapreduce.am.job.recovery.enable” to be turned
ON. The status of the tasks is recovered and the execution of the job is
continued.
•If the application master was running under the failed node manager, the
steps described in application master failures are followed.
•If tasks under a new node manager fail often and crosses the threshold, the
node is taken off from the available pool and is blacklisted( which is a process
of tracking poorly performing nodes).
Resource Manager Failure
Resource Manager Failure
•Counters are a useful channel for gathering statistics about the job: for
quality control or for application level-statistics.
•They are also useful for problem diagnosis. If we are tempted to put a log
message into our map or reduce task, then it is often better to see whether
we can use a counter instead to record that a particular condition occurred.
•In addition to counter values being much easier to retrieve than log output
for large distributed jobs, we get a record of the number of times that
condition occurred, which is more work to obtain from a set of log files.
Built-in Counters
Task counters
Task counters gather information about tasks over the
course of their execution, and the results are
aggregated over all the tasks in a job.
• In sort phase merge factor property plays an key role and is set by
property io.sort.factor ( 10 by default). It signifies how many files can be
merged at one go.
Partial Sort
•By default, Map Reduce will sort input records by their keys. The
variation for sorting sequence files with IntWritable keys is called partial
sort.
•Storing temperatures as Text objects doesn’t work for sorting purposes,
because signed integers don’t sort lexicographically.
•Instead, we are going to store the data using sequence files whose
IntWritable keys represent the temperatures (and sort correctly) and
whose Text values are the lines of data.
Program
Controlling Sort Order
The sort order for keys is controlled by a RawComparator,
which is found as follows:
1. If the property mapred.output.key.comparator.class is
set, either explicitly or by calling
setSortComparatorClass() on Job, then an instance of
that class is used.
2. Otherwise, keys must be a subclass of
WritableComparable, and the registered comparator for
the key class is used.
Total Sort
•How can we produce a globally sorted file using Hadoop? The naive
answer is to use a single partition. But this is inefficient for large files,
since one machine has to process all of the output, so we are
throwing away the benefits of the parallel architecture that Map
Reduce provides.
•We use a partitioner that respects the total order of the output. For
example, if we had four partitions, we could put keys for
temperatures less than –10°C in the first partition, those between –
10°C and 0°C in the second, those between 0°C and 10°C in the third,
and those over 10°C in the fourth.
Program
Secondary Sort
•The Map Reduce framework sorts the records by key before they
reach the reducers. For any particular key, however, the values
are not sorted. The order that the values appear is not even
stable from one run to the next, since they come from different
map tasks, which may finish at different times from run to run.
•A reduce-side join is more general than a map-side join, in that the input
datasets don’t have to be structured in any particular way, but it is less
efficient as both datasets have to go through the Map Reduce shuffle.
•The mapper tags each record with its source and uses the join key as the
map output key, so that the records with the same key are brought
together in the reducer.
• First we need to treat different input data sets to different map logics, this
can be done by MultipleInputs object written in driver class. This is
possible by the method MultipleInputs.addInputPath();
• Each dataset has a different format and there cannot be a single logic to
process all the different datasets.
• Custom writablecomparable class with necessary functions overridden.
• Custom partitioner should be designed which considers only the natural
key portion of the composite key.
• Custom group comparitive class which tells hadoop how to compare the
two records and sort on the basis of natural value portion of the
composite key.
• Finally the reduce logic is to store the first record of the group and expand
it in subsequent occurrences of value to reach to the ultimate output.
Side Data Distribution
• It can be done through distributed cache mechanism.
• The dataset can be distributed to the task nodes and the mapper and reducers
can read the local copies present with them at the time they are performing
map and reduce tasks this mechanism is called distributed cache mechanism.
• This method can be generally applied when there are operations on two or
more datasets which involves one very small dataset. It can be the case where
the small information need to be looked up at the time of map/reduce.
Side Data Distribution
• Side Data - extra read-only data needed by a job to process the
main dataset
– The main challenge is to make side data available to all the
map or reduce tasks (which are spread across the cluster) in
way that is convenient and efficient
• Using the Job Configuration
• This technique should not be used for transferring more than few kilobytes
of data as it can pressurize the memory usage of hadoop daemons,
particularly if our system is running several hadoop jobs.