Module 4
Module 4
MapReduce workflows – unit tests with MRUnit – test data and local tests – anatomy of
MapReduce job run – classic Map-reduce – YARN – failures in classic Map-reduce and
YARN – job scheduling – shuffle and sort – task execution – MapReduce types – input
formats – output formats
MapReduce Workflows
It explains the data processing problem into the MapReduce model. When the
processing gets more complex, the complexity is generally manifested by having more
MapReduce jobs, rather than having more complex map and reduce functions. In other
words, as a rule of thumb, think about adding more jobs, rather than adding complexity
to jobs. Map Reduce workflow is divided into two steps:
When we start a map/reduce workflow, the framework will split the input in
to segments, passing each segment to a different machine. Each machine
thenruns the map script on the portion of data attributed to it.
2. Running Dependent Jobs (linear chain of jobs) or More complex
Directed Acyclic Graph jobs
When there is more than one job in a MapReduce workflow, the question arises: how
do you manage the jobs so they are executed in order? There are several approaches,
and the main consideration is whether you have a linear chain of jobs, or a more
complex directed acyclic graph (DAG) of jobs.
For a linear chain, the simplest approach is to run each job one after another, waiting until
a job completes successfully before running the next:
JobClient.runJob(conf1);
JobClient.runJob(conf
2);
For anything more complex job like DAG than a linear chain, there is a class called
JobControl which represents a graph of jobs to be run.
Oozie
Unlike JobControl, which runs on the client machine submitting the jobs, Oozie runs as
a server, and a client submits a workflow to the server. In Oozie, a workflow is a DAG of
action
nodes and control-flow nodes. There is an action node which performs a workflow task,
like
moving files in HDFS, running a MapReduce job or running a Pig job. When the
workflow completes, Oozie can make an HTTP callback to the client to inform it of the
workflow status. It is also possible to receive callbacks every time the workflow enters
or exits an action node.
Oozie allows failed workflows to be re-run from an arbitrary point. This is useful for
dealing with transient errors when the early actions in the workflow are time consuming
to execute.
How MapReduce Works? / Explain the anatomy of classic map reduce job run/How
Hadoop runs map reduce Job?
You can run a MapReduce job with a single line of code: JobClient.runJob(conf). It is
very short, but it conceals a great deal of processing behind the scenes. The whole
process is illustrated in following figure.
As shown in Figure 1, there are four independent entities in the framework:
- Client, which submits the MapReduce Job
- JobTracker, which coordinates and controls the job run. It is a Java class called
JobTracker.
- TaskerTrackers, which run the task that is split job, control the specific map or
reduce task, and make reports to JobTracker. They are Java class as well.
- HDFS, which provides distributed data storage and is used to share job files
between other entities.
As the Figure 1 show, a MapReduce processing including 10 steps, and in short, that is:
- The clients submit MapReduce jobs to the JobTracker.
- The JobTracker assigns Map and Reduce tasks to other nodes in the cluser
- These nodes each run a software daemon TaskTracker on separate JVM.
- Each TaskTracker actually initiates the Map or Reduce tasks and reports progress
back to the JobTracker
There are six detailed levels in workflows. They are:
1. Job Submission
2. Job Initialization
3. Task Assignment
4. Task Execution
5. Task Progress and status updates
6. Task Completion
Job Submission
When the client call submit() on job object. An internal JobSubmmitter Java Object is
initiated and submitJobInternal() is called. If the clients calls the waiForCompletion(),
the job progresss will begin and it will response to the client with process results to
clients until the job completion.
JobSubmmiter do the following work:
- Ask the JobTracker for a new job ID.
- Checks the output specification of the job.
- Computes the input splits for the job.
- Copy the resources needed to run the job. Resources include the job jar file, the
configuration file and the computed input splits. These resources will be copied to HDFS
in a directory named after the job id. The job jar will be copied more than 3 times across
the cluster so that TaskTrackers can access it quickly.
- Tell the JobTracker that the job is ready for execution by calling submitJob() on
JobTracker.
Job Initialization
When the JobTracker receives the call submitJob(), it will put the call into an internal
queue from where the job scheduler will pick it up and initialize it. The initialization is
done as follow:
- An job object is created to represent the job being run. It encapsulates its
tasks and bookkeeping information so as to keep track the task progress and
status.
- Retrieves the input splits from HDFS and create the list of tasks, each of which has
task ID. JobTracker creates one map task for each split, and the number of reduce
tasks according to configuration.
- JobTracker will create the setup task and cleanup task. Setup task is to create the
final output directory for the job and the temporary working space for the task output.
Cleanup task is to delete the temporary working space for the task ouput.
- JobTracker will assign tasks to free TaskTrackers
Task Assignment
Task Execution
When the TaskTracker has been assigned a task. The task execution will be run as
follows:
- Copy jar file from HDFS, copy needed files from the distributed cache on the local disk.
- Creates a local working directory for the task and ‘un-jars’ the jar file contents to the
direcoty
- Creates a TaskRunner to run the task. The TaskRunner will lauch a new JVM to run
each task.. TaskRunner fails by bugs will not affect TaskTracker. And multiple tasks on
the node can reuse the JVM created by TaskRunner.
- Each task on the same JVM created by TaskRunner will run setup task and cleanup task.
- The child process created by TaskRunner will informs the parent process of the task’s
progress every few seconds until the task is complete.
Progress and Status Updates
After clients submit a job. The MapReduce job is a long time batching job. Hence the
job progress report is important. What consists of the Hadoop task progress is as
follows:
- Reading an input record in a mapper or reducer
- Writing an output record in a mapper or a reducer
- Setting the status description on a reporter, using the Reporter’s setStatus() method
- Incrementing a counter
- Calling Reporter’s progress()
As shown in Figure , when a task is running, the TaskTracker will notify the
JobTracker its task progress by heartbeat every 5 seconds.And mapper and reducer
on the child JVM will report to TaskTracker with it’s progress status every few
seconds. The mapper or reducers will set a flag to indicate the status change that
should be sent to the TaskTracker. The flag is checked in a separated thread every 3
seconds. If the flag sets, it will notify the TaskTracker of current task status.
The JobTracker combines all of the updates to produce a global view, and the Client can
use getStatus() to get the job progress status.
Job Completion
When the JobTracker receives a report that the last task for a job is complete, it will
change its status to successful. Then the JobTracker will send a HTTP notification to
the client which calls the waitForCompletion(). The job statistics and the counter
information will be printed to the client console. Finally the JobTracker and the
TaskTracker will do clean up action for the job.
MRUnit test
MRUnit is based on JUnit and allows for the unit testing of mappers, reducers and some
limited integration testing of the mapper – reducer interaction along with combiners,
custom counters and partitioners.
To write your test you would:
Testing Mappers
1. Instantiate an instance of the MapDriver class parameterized exactly as the
mapper under test.
2. Add an instance of the Mapper you are testing in the with Mapper call.
3. In the withInput call pass in your key and input value
4. Specify the expected output in the withOutput call
5. The last call runTest feeds the specified input values into the mapper and
compares the actual output against the expected output set in the ‘withOutput’
method.
Testing Reducers
1. The test starts by creating a list of objects (pairList) to be used as the input
to the reducer.
2. A ReducerDriver is instantiated
3. Next we pass in an instance of the reducer we want to test in the withReducer call.
4. In the withInput call we pass in the key (of “190101”) and the pairList object
created at the start of the test.
5. Next we specify the output that we expect our reducer to emit
6. Finally runTest is called, which feeds our reducer the inputs specified and
compares the output from the reducer against the expect output.
MRUnit testing framework is based on JUnit and it can test Map Reduce programs
written on several versions of Hadoop.
Following is an example to use MRUnit to unit test a Map Reduce program that does SMS
Call Details Record (call details record) analysis.
The records look like
The MapReduce program analyzes these records, finds all records with CDRType as 1,
and note its corresponding SMS Status Code. For example, the Mapper outputs are
6, 1
0, 1
The Reducer takes these as inputs and output number of times a particular status code
has been obtained in the CDR records.
@Test
public void testMapper()
{
mapDriver.withInput(new LongWritable(), new Text(
"655209;1;796764372490213;8044229
38115889;6"));
mapDriver.withOutput(new Text("6"), new
IntWritable(1)); mapDriver.runTest();
}
@Test
public void testReducer()
{
List<IntWritable> values = new
ArrayList<IntWritable>(); values.add(new
IntWritable(1));
values.add(new IntWritable(1));
reduceDriver.withInput(new Text("6"), values);
reduceDriver.withOutput(new Text("6"), new
IntWritable(2)); reduceDriver.runTest();
}
}
Scalability problem
For large clusters with more than 4000 nodes, the classic MapReduce framework hit
the scalability problems.
Improves scaling
As shown in Figure, the YARN involves more entities than classic MapReduce 1 :
- Client, the same as classic MapReduce which submits the MapReduce job.
- Resource Manager, which has the ultimate authority that arbitrates resources among
all the applications in the cluster, it coordinates the allocation of compute resources on
the cluster.
- Node Manager, which is in charge of resource containers, monitoring resource
usage (cpu, memory, disk , network) on the node , and reporting to the Resource
Manager.
- Application Master, which is in charge of the life cycle an application, like a
MapReduce Job. It will negotiates with the Resource Manager of cluster resources—in
YARN called containers. The Application Master and the MapReduce task in the
containers are scheduled by the Resource Manager. And both of them are managed by
the Node Manager. Application Mater is also responsible for keeping track of task
progress and status.
- HDFS, the same as classic MapReduce, for files sharing between different entities.
Resource Manager consists of two components:
• Scheduler and
• Applications Manager.
Clients can submit jobs with the same API as MapReduce 1 in YARN. YARN implements
its ClientProtocol, the submission process is similar to MapReduce 1.
- The client calls the submit() method, which will initiate the JobSubmmitter object
and call submitJobInternel().
- Resource Manager will allocate a new application ID and response it to client.
- The job client checks the output specification of the job
- The job client computes the input splits
- The job client copies resources, including the splits data, configuration information,
the job JAR into HDFS
- Finally, the job client notify Resource Manager it is ready by calling submitApplication()
on the Resource Manager.
Job Initialization
When the Resource Manager(RM) receives the call submitApplication(), RM will hands
off the job to its scheduler. The job initialization is as follows:
- The scheduler allocates a resource container for the job,
- The RM launches the Application Master under the Node Manager’s management.
- Application Master initialize the job. Application Master is a Java class named
MRAppMaster, which initializes the job by creating a number of bookkeeping objects to
keep track of the job progress. It will receive the progress and the completion reports
from the tasks.
- Application Master retrieves the input splits from HDFS, and creates a map task
object for each split. It will create a number of reduce task objects determined by the
mapreduce.job.reduces configuration property.
- Application Master then decides how to run the job.
For small job, called uber job, which is the one has less than 10 mappers and only one
reducer, or the input split size is smaller than a HDFS block, the Application Manager
will run the job on its own JVM sequentially. This policy is different from MapReduce 1
which will ignore the small jobs on a single TaskTracker.
For large job, the Application Master will launches a new node with new
NodeManager and new container, in which run the task. This can run job in parallel
and gain more performance.
Application Master calls the job setup method to create the job’s output directory.
That’s different from MapReduce 1, where the setup task is called by each task’s
TaskTracker.
Task Assignment
When the job is very large so that it can’t be run on the same node as the Application
Master. The Application Master will make request to the Resource Manager to
negotiate more resource container which is in piggybacked on heartbeat calls. The task
assignment is as follows:
- The Application Master make request to the Resource Manager in heartbeat call. The
request includes the data locality information, like hosts and corresponding racks that
the input splits resides on.
- The Recourse Manager hand over the request to the Scheduler. The Scheduler
makes decisions based on these information. It attempts to place the task as close
the data as possible. The data-local nodes is great, if this is not possible , the rack-
local the preferred to nolocal node.
- The request also specific the memory requirements, which is between the minimum
allocation (1GB by default) and the maximum allocation (10GB). The Scheduler will
schedule a container with multiples of 1GB memory to the task, based on the
mapreduce.map.memory.mb and mapreduce.reduce.memory.mb property set by the task.
This way is more flexible than MapReduce 1. In MapReduce 1, the TaskTrackers have
a fixed number of slots and each task runs in a slot. Each slot has fixed memory
allowance which results in two problems. For small task, it will waste of memory, and
for large task which need more memeory, it will lack of memory. In YARN, the memory
allocation is more fine-grained, which is also the beauty of YARE resides in.
Task Execution
After the task has been assigned the container by the Resource Manger’s scheduler,
the Application Master will contact the NodeManger which will launch the task JVM.
The task execution is as follows:
- The Java Application whose class name is YarnChild localizes the resources
that the task needs. YarnChild retrieves job resources including the job jar,
configuration file, and any needed files from the HDFS and the distributed cache
on the local disk.
- YarnChild run the map or the reduce task
Each YarnChild runs on a dedicated JVM, which isolates user code from the long
running system daemons like NodeManager and the Application Master. Different from
MapReduce 1, YARN doesn’t support JVM reuse, hence each task must run on new
JVM.
The streaming and the pipeline processs and communication in the same as MapReduce
1.
Progress and Status Updates
When the job is running under YARN, the mapper or reducer will report its status and
progress to its Application Master every 3 seconds over the umbilical interface. The
Application Master will aggregate these status reports into a view of the task status and
progress. While in MapReduce 1, the TaskTracker reports status to JobTracker which is
responsible for aggregating status into a global view.
Moreover, the Node Manger will send heartbeats to the Resource Manager every few
seconds. The Node Manager will monitoring the Application Master and the recourse
container usage like cpu, memeory and network, and make reports to the Resource
Manager. When the Node Manager fails and stops heartbeat the Resource Manager,
the Resource Manager will remove the node from its available resource nodes pool.
The client pulls the status by calling getStatus() every 1 second to receive the progress
updates, which are printed on the user console. User can also check the status from the
web UI. The Resource Manager web UI will display all the running applications with
links to the web UI where displays task status and progress in detail.
Job Completion
Every 5 second the client will check the job completion over the HTTP ClientProtocol
by calling waitForCompletion(). When the job is done, the Application Master and the
task containers clean up their working state and the outputCommitter’s job cleanup
method is called. And the job information is archived as history for later interrogation
by user.
Resource Manager
Application Master
Rolling upgrades
YARN supports for programming paradigms other than MapReduce (Multi tenancy)
HBase on YARN
One of the major benefits of using Hadoop is its ability to handle such failures and
allow your job to complete.
1. Task Failure
• Consider first the case of the child task failing. The most common way that this
happens is when user code in the map or reduce task throws a runtime
exception. If this happens, the child JVM reports the error back to its parent
tasktracker, before it exits. The error ultimately makes it into the user logs. The
tasktracker marks the task attempt as failed, freeing up a slot to run another task.
• For Streaming tasks, if the Streaming process exits with a nonzero exit code, it is
marked as failed. This behavior is managed by the stream.non.zero.exit.is.failure
property.
• Another failure mode is the sudden exit of the child JVM. In this case, the tasktracker
notices that the process has exited and marks the attempt as failed.
• A task attempt may also be killed, which is also different kind of failing. Killed
task attempts do not count against the number of attempts to run the task since
it wasn’t the task’s fault that an attempt was killed.
2. Tasktracker Failure
• If a tasktracker fails by crashing, or running very slowly, it will stop sending
heartbeats to the jobtracker (or send them very infrequently). The jobtracker will
notice a tasktracker that has stopped sending heartbeats (if it hasn’t received one
for 10 minutes) and remove it from its pool of tasktrackers to schedule tasks on.
3. Jobtracker Failure
• Failure of the jobtracker is the most serious failure mode. Currently, Hadoop has
no mechanism for dealing with failure of the jobtracker—it is a single point of
failure— so in this case the job fails.
Failures in YARN
Refer to yarn architecture figure above, container and task failures are handled by node- manager.
When a container fails or dies, node-manager detects the failure event and launches a new container to
replace the failing container and restart the task execution in the new container. In the event of
application-master failure, the resource-manager detects the failure and start a new instance of the
application-master with a new container. The ability to recover the associated job state depends on the
application-master implementation. MapReduce application-master has the ability to recover the state
but it is not enabled by default. Other than resource-manager, associated client also reacts with the
failure. The client contacts the resource-manager to locate the new application-master’s address.
Failure of the resource-manager is severe since clients can not submit a new job
and existing running jobs could not negotiate and request for new container.
Existing node-managers and application-masters try to reconnect to the failed resource-
manager. The job progress will be lost when they are unable to reconnect. This lost of
job progress will likely frustrate engineers or data scientists that use YARN because
typical production jobs that run on top of YARN are expected to have long running time
and typically they are in the order of few hours.
Furthermore, this limitation is preventing YARN to be used efficiently in cloud
environment (such as Amazon EC2) since node failures often happen in cloud
environment.
Shuffle and Sort
MapReduce makes the guarantee that the input to every reducer is sorted by key.
The process by which the system performs the sort—and transfers the map
outputs to the reducers as inputs—is known as the shuffle.
The shuffle is an area of the codebase where refinements and improvements are
continually being made.
STEPS
The buffer is 100 MB by default, a size which can be tuned by changing the
io.sort.mb property. When the contents of the buffer reaches a certain threshold size a
background thread will start to spill the contents to disk. Map outputs will continue to be
written to the buffer while the spill takes place, but if the buffer fills up during this time, the
map will block until the spill is complete. Spills are written in round-robin fashion to the
directories specified by the mapred.local.dir property, in a job-specific subdirectory.
Before it writes to disk, the thread first divides the data into partitions corresponding to the
reducers that they will ultimately be sent to. Within each partition, the background thread
performs an in-memory sort by key, and if there is a combiner function, it is run on the
output of the sort.
Running the combiner function makes for a more compact map output, so there is
less data to write to local disk and to transfer to the reducer. Each time the memory buffer
reaches the spill threshold, a new spill file is created, so after the map task has written its
last output record there could be several spill files. Before the task is finished, the spill
files are merged into a single partitioned and sorted output file. The configuration property
io.sort.factor controls the
maximum number of streams to merge at once; the default is 10.If there are at least three
spill files then the combiner is run again before the output file is written. Combiners may be
run repeatedly over the input without affecting the final result.If there are only one or two
spills, then the potential reduction in map output size is not worth the overhead in invoking
the combiner, so it is not run again for this map output.
To compress the map output as it is written to disk, makes it faster to write to disk,
saves disk space, and reduces the amount of data to transfer to the reducer. By default,
the output is not compressed, but it is easy to enable by setting
mapred.compress.map.output to true. The output file’s partitions are made available to the
reducers over HTTP. The maximum number of worker threads used to serve the file
partitions is controlled by the tasktracker.http.threads property. The default of 40 may
need increasing for large clusters running large jobs.
II. The Reduce Side
The map output file is sitting on the local disk of the machine that ran the
map task. The reduce task needs the map output for its particular partition from
several map tasks across the cluster.
Copy phase of reduce: The map tasks may finish at different times, so the reduce task
starts copyingtheir outputs as soon as each completes. The reduce task has a small
number of copier threads so that it can fetch map outputs in parallel.
The default is five threads, but this number can be changed by setting the
mapred.reduce.parallel.copies property. The map outputs are copied to reduce task
JVM’s memory otherwise, they are copied to disk. When the in-memory buffer reaches a
threshold size or reaches a threshold number of map outputs it is merged and spilled to
disk. If a combiner is specified it will be run during the merge to reduce the amount of data
written to disk.The copies accumulate on disk, a background thread merges them into
larger, sorted files. This saves some time merging later on.
Any map outputs that were compressed have to be decompressed in memory in
order to perform a merge on them. When all the map outputs have been copied, the
reduce task moves into the sort phase which merges the map outputs, maintaining their
sort ordering. This is done in rounds. For example, if there were 50 map outputs, and the
merge factor was 10, then there would be 5 rounds. Each round would merge 10 files into
one, so at the end there would be five intermediate files. These five files into a single
sorted file, the merge saves a trip to disk by directly feeding the reduce function. This final
merge can come from a mixture of in-memory and on-disk segments.
During the reduce phase, the reduce function is invoked for each key in the sorted
output. The output of this phase is written directly to the output filesystem, typically HDFS.
In the case of HDFS, since the tasktracker node is also running a datanode, the first
block replica will be written to the local disk.
On the map side, the best performance can be obtained by avoiding multiple spills to
disk; one is optimal. If you can estimate the size of your map outputs, then you can set the
io.sort.* properties appropriately to minimize the number of spills. There is a MapReduce
counter that counts the total number of records that were spilled to disk over the course of
a job, which can be useful for tuning. The counter includes both map and reduces side
spills.
On the reduce side, the best performance is obtained when the intermediate data can
reside entirely in memory. By default, this does not happen, since for the general case
all the memory is reserved for the reduce function. But if your reduce function has light
memory requirements, then setting mapred.inmem.merge.threshold to 0 and
mapred.job.reduce.input.buffer.percent to 1.0 may bring a performance boost. Hadoop
uses a buffer size of 4 KB by default, which is low, so you should increase this across
the cluster.
Input Formats
Hadoop can process many different types of data formats, from flat text files to
datab ases.
1) Input Splits and Records:
An input split is a chunk of the input that is processed by a single map. Each map
processes a single split. Each split is divided into records, and the map processes each
record—a key-value pair—in turn.
public abstract class InputSplit {
public abstract long getLength() throws IOException,
InterruptedException; public abstract String[] getLocations() throws
IOException, InterruptedException;
}
FileInputFormat: FileInputFormat is the base class for all implementations of
InputFormat that use files as their data source. It provides two things: a place to
define which files are included as the input to a job, and an implementation for
generating splits for the input files.
FileInputFormat input paths: The input to a job is specified as a collection of paths,
which offers great flexibility in constraining the input to a job. FileInputFormat offers
four static convenience methods for setting a Job’s input paths:
public static void addInputPath(Job job, Path path)
public static void addInputPaths(Job job, String
commaSeparatedPaths) public static void setInputPaths(Job job,
Path... inputPaths)
public static void setInputPaths(Job job, String commaSeparatedPaths )
Fig: InputFormat class hierarchy
FileInputFormat input splits: FileInputFormat splits only large files. Here “large”
means larger than an HDFS block. The split size is normally the size of an HDFS
block.
Table: Properties for controlling split size
Preventing splitting: There are a couple of ways to ensure that an existing file is
not split. The first way is to increase the minimum split size to be larger than the
largest file in your system. The second is to subclass the concrete subclass of
FileInputFormat that you want to use, to override the isSplitable() method to return
false.
File information in the mapper: A mapper processing a file input split can find
information about the split by calling the getInputSplit() method on the Mapper’s
Context object.
Table: File split properties
1) Text Output: The default output format, TextOutputFormat, writes records as lines of
text. Its keys
and values may be of any type, since TextOutputFormat turns them to strings by calling
toString() on them.
2) Binary Output
SequenceFileOutputFormat: As the name indicates, SequenceFileOutputFormat
writes sequence files for its output. Compression is controlled via the static
methods on SequenceFileOutputFormat.
SequenceFileAsBinaryOutputFormat: SequenceFileAsBinaryOutputFormat is the
counterpart to SequenceFileAsBinaryInput Format, and it writes keys and values in
raw binary format into a SequenceFile container.
MapFileOutputFormat: MapFileOutputFormat writes MapFiles as output. The keys
in a MapFile must be added in order, so you need to ensure that your reducers
emit keys in sorted order.
3) Multiple Outputs: FileOutputFormat and its subclasses generate a set of files in the
output directory. There is one file per reducer, and files are named by the partition number:
part-r- 00000, partr-00001, etc. MapReduce comes with the MultipleOutputs class to help
you do this.
Zero reducers: There are no partitions, as the application needs to run only map
tasks.
One reducer: It can be convenient to run small jobs to combine the output of
previous jobs into a single file. This should only be attempted when the amount of data is
small enough to be processed comfortablyby one reducer.
MultipleOutputs: MultipleOutputs allows you to write data to files whose names are
derived from the output keys and values, or in fact from an arbitrary
string.MultipleOutputs delegates to the mapper’s OutputFormat, which in this
example is a TextOutputFormat, but more complex set ups are possible.
Lazy Output: FileOutputFormat subclasses will create output (part-r-nnnnn) files,
even if they are
empty. Some applications prefer that empty files not be created, which is where
LazyOutputFormat helps. It is a wrapper output format that ensures that the
output file is Output Formats created only when the first record is emitted for a
given partition. To use it, call its setOutputFormatClass() method with the JobConf
and the underlying output format.
Database Output: The output formats for writing to relational databases and to
HBase