Developing A Mapreduce Application: by Dr. K. Venkateswara Rao Professor Department of Cse
Developing A Mapreduce Application: by Dr. K. Venkateswara Rao Professor Department of Cse
By
Dr. K. Venkateswara Rao
Professor
Department of CSE
Contents
1. Hadoop Cluster
2. YARN Architecture
3. Data Flow in Hadoop Computing Model
4. MapReduce Model
5. The Configuration API
6. A simple configuration file
7. Accessing configuration properties
8. The MapReduce Web UI
9. Hadoop Logs
10.Tuning a Job
Hadoop Cluster
Primary Master
Rack Switch-k
• Job.runJob()
• Creates a new JobClient instances
• Calls submit() on Job object
• Job.submit ()
• Creates an internal JobSubmitter instance and calls submitJobInternal() on it(step 1 in Figure).
• Having submitted the job, waitForCompletion() polls the job’s progress once per second and
reports the progress to the console if it has changed since the last report.
• The job submission process implemented by JobSubmitter. It has five things.
Five things done in Job Submission Process
1. Asks the resource manager for a new application ID, used for the
MapReduce job ID (step 2).
2. Checks the output specification of the job.
3. Computes the input splits for the job.
4. Copies the resources needed to run the job, including the job JAR
file, the configuration file, and the computed input splits, to the
shared filesystem in a directory named after the job ID (step 3).
5. Submits the job by calling submitApplication() on the resource
manager (step 4).
Input Splits
•
Relation between input splits and HDFS Blocks
Job Initialization
1. When the resource manager receives a call to its submitApplication()
method, it hands off the request to the YARN scheduler.
2. The scheduler allocates a container, and the resource manager then
launches the application master’s process there, under the node manager’s
management (steps 5a and 5b).
3. The application master for MapReduce jobs is a Java application whose main
class is MRAppMaster. It initializes the job by creating a number of
bookkeeping objects to keep track of the job’s progress, as it will receive
progress and completion reports from the tasks (step 6).
4. Next, it retrieves the input splits computed in the client from the shared
filesystem (step 7). It then creates a map task object for each split, as well as
a number of reduce task objects. Tasks are given IDs at this point.
Task Assignment
1. The application master requests containers for all the map and
reduce tasks in the job from the resource manager (step 8).
2. Reduce tasks can run anywhere in the cluster, but requests for map
tasks have data locality constraints that the scheduler tries to
honor.
3. Requests for reduce tasks are not made until 5% of map tasks have
completed
4. Requests also specify memory requirements and CPUs for tasks. By
default, each map and reduce task is allocated 1,024 MB of
memory and one virtual core.
Task Execution
1. Once a task has been assigned resources for a container on a particular
node by the resource manager’s scheduler, the application master starts
the container by contacting the node manager (steps 9a and 9b).
2. The task is executed by a Java application whose main class is YarnChild.
3. Before it can run the task, it localizes the resources that the task needs,
including the job configuration and JAR file, and any files from the
distributed cache (step 10).
4. Finally, it runs the map or reduce task (step 11).
5. The YarnChild runs in a dedicated JVM, so that any bugs in the user-defined
map and reduce functions (or even in YarnChild) don’t affect the node
manager—by causing it to crash or hang, for example.
Progress Measure
• Following operations constitute task progress in Hadoop:
1. Reading an input record (in a mapper or reducer)
2. Writing an output record (in a mapper or reducer)
3. Setting the status description (via Reporter’s or
TaskAttemptContext’s setStatus() method)
4. Incrementing a counter (using Reporter’s incrCounter() method
or Counter’s increment() method)
5. Calling Reporter’s or TaskAttemptContext’s progress() method
Progress and Status Updates
1. When a task is running, it keeps track of its progress (i.e., the proportion of the task
completed). Progress is not always measurable.
2. For map tasks, this is the proportion of the input that has been processed. For reduce
tasks, it’s a little more complex, but the system can still estimate the proportion of
the reduce input processed.
3. Tasks also have a set of counters that count various events as the task runs. The
counters are either built into the framework, such as the number of map output
records written, or defined by users.
4. The task reports its progress and status (including counters) back to its application
master, which has an aggregate view of the job.
5. During the course of the job, the client receives the latest status by polling the
application master every second
propagation of status updates through the
MapReduce system
Job Completion
1. When the application master receives a notification that the last
task for a job is complete, it changes the status for the job to
“successful.”
2. When the Job polls for status, it learns that the job has completed
successfully, so it prints a message to tell the user and then returns
from the waitForCompletion() method.
3. Job statistics and counters are printed to the console at this point.
4. Finally, on job completion, the application master and the task
containers clean up their working state.
5. Job information is archived by the job history server to enable later
interrogation by users if desired.
Failures in Hadoop
• In the real world,
1. User code is buggy,
2. Processes crash, and
3. Machines fail.
• One of the major benefits of using Hadoop is its ability to handle failures.
• Various entities that may fail in Hadoop
1. Task Failure
2. Application Master Failure
3. Node Manager Failure
4. Resource Manager Failure
Task Failure
• User code in the map or reduce task throws a runtime exception
The task JVM reports the error back to its parent application master before it exits. The
error ultimately makes it into the user logs. The application master marks the task
attempt as failed, and frees up the container so its resources are available for another
task.
• Sudden exit of the task JVM—perhaps there is a JVM bug
The node manager notices that the process has exited and informs the application
master so it can mark the attempt as failed.
• Hanging tasks are dealt with differently
The application master notices that it hasn’t received a progress update for a while and
proceeds to mark the task as failed.
• When the application master is notified of a task attempt that has failed, it will reschedule
execution of the task. The application master will try to avoid rescheduling the task on a
node manager where it has previously failed.
Application Master Failure
• An application master sends periodic heartbeats to the resource manager, and
in the event of application master failure, the resource manager will detect the
failure and start a new instance of the master running in a new container.
• In the case of the MapReduce application master, it will use the job history to
recover the state of the tasks that were already run by the (failed) application
so they don’t have to be rerun. Recovery is enabled by default.
• The MapReduce client polls the application master for progress reports, but if
its application master fails, the client needs to locate the new instance.
If the application master fails, however, the client will experience a timeout
when it issues a status update, at that point the client will go back to the
resource manager to ask for the new application master’s address.
Node Manager Failure
• If a node manager fails by crashing or running very slowly, it will stop sending
heartbeats to the resource manager.
The resource manager will notice a node manager that has stopped sending heartbeats
if it hasn’t received one for 10 minutes and remove it from its pool of nodes to schedule
containers on.
Any task or application master running on the failed node manager will be recovered
using the mechanisms described under “Task Failure” and “Application Master Failure”
sections respectively.
The application master arranges for map tasks (which were scheduled on failed nodes)
to be rerun if they belong to incomplete jobs.
Node managers may be blacklisted if the number of failures for the application is high,
even if the node manager itself has not failed. Blacklisting is done by the application
master.
Resource Manager Failure
• The resource manager is a single point of failure.
• To achieve high availability (HA), it is necessary to run a pair of resource managers in an active-
standby configuration. If the active resource manager fails, then the standby can take over
without a significant interruption to the client.
• Information about all the running applications is stored in a highly available state store (backed
by ZooKeeper or HDFS), so that the standby can recover the core state of the failed active
resource manager.
• Node manager information can be reconstructed by the new resource manager as the node
managers send their first heartbeats.
• When the new resource manager starts, it reads the application information from the state
store, then restarts the application masters for all the applications running on the cluster.
• The transition of a resource manager from standby to active is handled by a failover controller.
• Clients and node managers must be configured to handle resource manager failover.
Hadoop MapReduce: A Closer Look
Node 1 Node 2
Files loaded from local HDFS store Files loaded from local HDFS store
InputFormat InputFormat
file file
Split Split Split Split Split Split
file file
RecordReaders RR RR RR RR RR RR RecordReaders
OutputFormat OutputFormat
Writeback to local Writeback to local
HDFS store HDFS store
Input Files
• Input files are where the data for a MapReduce task is initially
stored
• The input files typically reside in a distributed file system (e.g.
HDFS)
• The format of input files is arbitrary
Line-based log files
Binary files
Multi-line input records
Or something else entirely
InputFormat
• How the input files are split up and read is defined by the
InputFormat
• InputFormat is a class that does the following:
Selects the files that should be used
for input Files loaded from local HDFS store
Defines the InputSplits that break
InputFormat
a file
file
Provides a factory for RecordReader
file
66
Input Splits
An input split describes a unit of work that comprises a single map task in a
MapReduce program
The RecordReader class actually loads data from its source and converts it
into (K, V) pairs suitable for reading by Mappers
The RecordReader is invoked repeatedly Files loaded from local HDFS store
on the input until the entire split is consumed
InputFormat
A new instance of Mapper is created for each split Files loaded from local HDFS store
InputFormat
RR RR RR
Reduce
Combiners and Partitioners
Combiner Example
Partitioner
Each mapper may emit (K, V) pairs to any partition
Partitioner
The default partitioner computes a hash value for a
given key and assigns it to a partition based on Sort
file
Partitioner
Sort
Reduce
OutputFormat
Shuffle and Sort: The Map Side
• MapReduce guarantees that the input to every reducer is sorted by key.
• The process by which the system performs the sort—and transfers the map
outputs to the reducers as inputs—is known as the shuffle
• When the map function starts producing output, it is not simply written to
disk. It takes advantage of buffering by writing in main memory and doing
some presorting for efficiency reasons.
• Each map task has a circular memory buffer that it writes the output to. The
buffer is 100 MB by default.
• When the contents of the buffer reach a certain threshold size (default value
0.80, or 80%), background thread will start to spill the contents to disk.
Shuffle and Sort in MapReduce
Shuffle and Sort: The Map Side
• Spills are written in round-robin fashion to the specified directories.
• Before it writes to disk, the thread first divides the data into partitions corresponding
to the reducers that they will ultimately be sent to.
• Within each partition, the background thread performs an in-memory sort by key,
and if there is a combiner function, it is run on the output of the sort.
• Running the combiner function makes for a more compact map output, so there is
less data to write to local disk and to transfer to the reducer.
• Each time the memory buffer reaches the spill threshold, a new spill file is created, so
after the map task has written its last output record, there could be several spill files.
• Before the task is finished, the spill files are merged into a single partitioned and
sorted output file. If there are at least three spill files, the combiner is run again
before the output file is written.
Shuffle and Sort: The Reduce Side
• The map output file is sitting on the local disk of the machine that ran the map
task, but now the map output is needed by the machine that is about to run
the reduce task for the partition.
• Moreover, the reduce task needs the map output for its particular partition
from several map tasks across the cluster.
• The map tasks may finish at different times, so the reduce task starts copying
their outputs as soon as each completes. This is known as the copy phase of
the reduce task.
• The reduce task has a small number of copier threads, by default 5, so that it
can fetch map outputs in parallel.
• A thread in the reducer periodically asks the master for map output hosts until
it has retrieved them all.
Shuffle and Sort: The Reduce Side
• Map outputs are copied to the reduce task JVM’s memory if they are small enough.
otherwise, they are copied to disk.
• When the in-memory buffer reaches a threshold size or reaches a threshold number
of map outputs, it is merged and spilled to disk. If a combiner is specified, it will be run
during the merge to reduce the amount of data written to disk.
• As the copies accumulate on disk, a background thread merges them into larger,
sorted files.
• When all the map outputs have been copied, the reduce task moves into the sort
phase, which merges the map outputs, maintaining their sort ordering. This is done in
rounds.
• During the reduce phase, the reduce function is invoked for each key in the sorted
output.
• The output of this phase is written directly to the output filesystem, typically HDFS.
Speculative Execution
• The MapReduce model is to break jobs into tasks and run the tasks in parallel to make
the overall job execution time smaller than it would be if the tasks ran sequentially.
• A MapReduce job is dominated by the slowest task
• Hadoop doesn’t try to diagnose and fix slow-running tasks(stragglers); instead, it tries
to detect when a task is running slower than expected and launches another
equivalent task as a backup. This is termed speculative execution of tasks.
• Only one copy of a straggler is allowed to be speculated
• Whichever copy (among the two copies) of a task commits first, it becomes the
definitive copy, and the other copy is killed.
• Speculative execution is an optimization, and not a feature to make jobs run more
reliably.
• Speculative execution is turned on by default.
Output Committers
• Hadoop MapReduce uses a commit protocol to ensure that jobs and
tasks either succeed or fail cleanly.
• The behavior is implemented by the OutputCommitter in use for the
job.
• In the old MapReduce API, the OutputCommitter is set by calling the
setOutputCommitter() on JobConf or by setting
mapred.output.committer.class in the configuration.
• In the new MapReduce API, the OutputCommitter is determined by
the OutputFormat, via its getOutputCommitter() method.