Unit-3 BDA
Unit-3 BDA
1. History of Hadoop
Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used
text search library. Hadoop has its origins in Apache Nutch, an open source web search engine,
itself a part of the Lucene project.
In January 2008, Hadoop was made its own top-level project at Apache, confirming its
success and its diverse, active community. By this time, Hadoop was being used by many other
companies besides Yahoo!, such as Last.fm, Facebook, and the New York Times. In one well-
publicized feat, the New York Times used Amazon’s EC2 compute cloud to crunch through 4
terabytes of scanned archives from the paper, converting them to PDFs for the Web. The
processing took less than 24 hours to run using 100 machines, and the project probably
wouldn’t have been embarked upon without the combination of Amazon’s pay-by-the-hour
model (which allowed the NYT to access a large number of machines for a short period) and
Hadoop’s easy-to-use parallel programming model.
In April 2008, Hadoop broke a world record to become the fastest system to sort an
entire terabyte of data. Running on a 910-node cluster, Hadoop sorted 1 terabyte in 209 seconds
(just under 3.5 minutes), beating the previous year’s winner of 297 seconds. In November of
the same year, Google reported that its MapReduce implementation sorted 1terabyte in 68
seconds. Then, in April 2009, it was announced that a team at Yahoo! had used Hadoop to sort
1 terabyte in 62 seconds. The trend since then has been to sort even larger volumes of data at
ever faster rates. In the 2014 competition, a team from Databricks were joint winners of the
Gray Sort benchmark. They used a 207-node Spark cluster to sort 100 terabytes of data in 1,406
seconds, a rate of 4.27 terabytes per minute.
Today, Hadoop is widely used in mainstream enterprises. Hadoop’s role as a general
purpose storage and analysis platform for big data has been recognized by the industry, and
this fact is reflected in the number of products that use or incorporate Hadoop in some way.
Commercial Hadoop support is available from large, established enterprise vendors, including
EMC, IBM, Microsoft, and Oracle, as well as from specialist Hadoop companies such as
Cloudera, Hortonworks, and MapR.
HADOOP BASICS:
• Need to process huge datasets on large clusters of computers
• Very expensive to build reliability into each application
• Nodes fail every day
Failure is expected, rather than exceptional
The number of nodes in a cluster is not constant
• Need a common infrastructure
Efficient, reliable, easy to use
Open Source, Apache Licence
Key Benefit & Flexibility
1
Client-server Concept
•Client sends requests to one or more servers which in turn accepts, processes them and return
the requested information to the client.
•A server might run software which listens on particular ipand port number for requests
Examples:
Server -web server
Client –web browser
• Very Large Distributed File System
10K nodes, 100 million files, 10PB
• Assumes Commodity Hardware
Files are replicated to handle hardware failure
Detect failures and recover from them
• Optimized for Batch Processing
Data locations exposed so that computations can move to where data resides
Provides very high aggregate bandwidth
2
Datanodes – workhorses of the filesystem
• They store and retrieve blocks when they are told to (by clients or namenode)
• And they report back to the namenode periodically with lists of blocks that they are
storing
• Without the namenode, the filesystem cannot be used
• If the machine running the namenode were obliterated, all the files on the filesystem
would be lost since there would be no way of knowing how to reconstruct the files from
the blocks on the datanodes.
• For this reason, it is important to make the namenode resilient to failure
3
• It is important to make the namenode resilient to failure;
Hadoop provides two mechanisms for this
1. The first way is to back up the files – Hadoop can be configured so that the namenode
writes its persistent state to multiple filesystems
2. Secondary namenode – does not act as a namenode. Its main role is to periodically
merge the namespace
Block Caching
• Normally, a datanode reads blocks from disk, but for frequently accessed files the
blocks may be explicitly cached in the datanode’s memory – block cache
• By default, a block is cached in only one datanode’s memory
• Job schedulers (for MapReduce, Spark and other frameworks) can take advantage of
cached blocks by running tasks on the datanode where a block is cached, for increased
read performance
• Users or applications instruct the namenode - which files to cache and for how long –
by adding a cache directive to a cache pool
• Cache pools are an administrative grouping for managing cache permission and
resource usage
4
3. Components of Hadoop
5
Fig. 3.6 MapReduce
6
Fig. 3.9 YARN
7
• Scheduler
• in Yarn.
• Also, there is
• Application Master and
• Scheduler
• in Yarn.
8
Fig. 3.16 Application workflow in Hadoop YARN
10
Fig. 3.17 Hadoop Ecosystem and Their Components
The list of Hadoop Components is discussed in this section one by one in detail.
Hadoop Distributed File System
It is the most important component of Hadoop Ecosystem. HDFS is the primary storage
system of Hadoop. Hadoop distributed file system (HDFS) is a java based file system that
provides scalable, fault tolerance, reliable and cost efficient data storage for Big data. HDFS is
a distributed filesystem that runs on commodity hardware. HDFS is already configured with
default configuration for many installations. Most of the time for large clusters configuration
is needed. Hadoop interact directly with HDFS by shell-like commands.
HDFS Components:
There are two major components of Hadoop HDFS- NameNode and DataNode. Let’s
now discuss these Hadoop HDFS Components-
i. NameNode
It is also known as Master node. NameNode does not store actual data or dataset.
NameNode stores Metadata i.e. number of blocks, their location, on which Rack, which
Datanode the data is stored and other details. It consists of files and directories.
Tasks of HDFS NameNode
Manage file system namespace.
Regulates client’s access to files.
Executes file system execution such as naming, closing, opening files and directories.
ii. DataNode
It is also known as Slave. HDFS Datanode is responsible for storing actual data in
HDFS. Datanode performs read and write operation as per the request of the clients. Replica
block of Datanode consists of 2 files on the file system. The first file is for data and second file
is for recording the block’s metadata. HDFS Metadata includes checksums for data. At startup,
each Datanode connects to its corresponding Namenode and does handshaking. Verification of
11
namespace ID and software version of DataNode take place by handshaking. At the time of
mismatch found, DataNode goes down automatically.
Tasks of HDFS DataNode
DataNode performs operations like block replica creation, deletion, and replication
according to the instruction of NameNode.
DataNode manages data storage of the system.
This was all about HDFS as a Hadoop Ecosystem component.
MapReduce
Hadoop MapReduce is the core Hadoop ecosystem component which provides data
processing. MapReduce is a software framework for easily writing applications that process
the vast amount of structured and unstructured data stored in the Hadoop Distributed File
system.
MapReduce programs are parallel in nature, thus are very useful for performing large-
scale data analysis using multiple machines in the cluster. Thus, it improves the speed and
reliability of cluster this parallel processing.
12
Features of MapReduce
Simplicity – MapReduce jobs are easy to run. Applications can be written in any language
such as java, C++, and python.
Scalability – MapReduce can process petabytes of data.
Speed – By means of parallel processing problems that take days to solve, it is solved in
hours and minutes by MapReduce.
Fault Tolerance – MapReduce takes care of failures. If one copy of data is unavailable,
another machine has a copy of the same key pair which can be used for solving the same
subtask.
YARN
` Hadoop YARN (Yet Another Resource Negotiator) is a Hadoop ecosystem component
that provides the resource management. Yarn is also one the most important component of
Hadoop Ecosystem. YARN is called as the operating system of Hadoop as it is responsible for
managing and monitoring workloads. It allows multiple data processing engines such as real-
time streaming and batch processing to handle data stored on a single platform.
13
Hive
The Hadoop ecosystem component, Apache Hive, is an open source data warehouse
system for querying and analyzing large datasets stored in Hadoop files. Hive do three main
functions: data summarization, query, and analysis.Hive use language called HiveQL (HQL),
which is similar to SQL. HiveQL automatically translates SQL-like queries into MapReduce
jobs which will execute on Hadoop.
14
Fig. 3.21 Pig Diagram
15
Fig. 3.22 HBase Diagram
HCatalog
It is a table and storage management layer for Hadoop. HCatalog supports different
components available in Hadoop ecosystems like MapReduce, Hive, and Pig to easily read and
write data from the cluster. HCatalog is a key component of Hive that enables the user to store
their data in any format and structure.
By default, HCatalog supports RCFile, CSV, JSON, sequenceFile and ORC file formats.
Benefits of HCatalog:
Enables notifications of data availability.
With the table abstraction, HCatalog frees the user from overhead of data storage.
Provide visibility for data cleaning and archiving tools.
Avro
Avro is a part of Hadoop ecosystem and is a most popular Data serialization
system. Avro is an open source project that provides data serialization and data exchange
services for Hadoop. These services can be used together or independently. Big data can
exchange programs written in different languages using Avro.Using serialization service
programs can serialize data into files or messages. It stores data definition and data together in
one message or file making it easy for programs to dynamically understand information stored
in Avro file or message.
16
Avro schema – It relies on schemas for serialization/deserialization. Avro requires the schema
for data writes/read. When Avro data is stored in a file its schema is stored with it, so that files
may be processed later by any program.
Dynamic typing – It refers to serialization and deserialization without code generation. It
complements the code generation which is available in Avro for statically typed language as
an optional optimization.
Features provided by Avro:
Rich data structures.
Remote procedure call.
Compact, fast, binary data format.
Container file, to store persistent data.
Thrift
It is a software framework for scalable cross-language services development. Thrift is
an interface definition language for RPC(Remote procedure call) communication. Hadoop does
a lot of RPC calls so there is a possibility of using Hadoop Ecosystem componet Apache Thrift
for performance or other reasons.
17
is designed to scale to several thousands of nodes and query petabytes of data. The drill is the
first distributed SQL query engine that has a schema-free model.
Application of Apache drill
The drill has become an invaluable tool at cardlytics, a company that provides consumer
purchase data for mobile and internet banking. Cardlytics is using a drill to quickly process
trillions of record and execute queries.
Features of Apache Drill:
The drill has specialized memory management system to eliminates garbage collection
and optimize memory allocation and usage. Drill plays well with Hive by allowing developers
to reuse their existing Hive deployment.
Extensibility – Drill provides an extensible architecture at all layers, including query layer,
query optimization, and client API. We can extend any layer for the specific need of an
organization.
Flexibility – Drill provides a hierarchical columnar data model that can represent complex,
highly dynamic data and allow efficient processing.
Dynamic schema discovery – Apache drill does not require schema or type specification
for data in order to start the query execution process. Instead, drill starts processing the
data in units called record batches and discover schema on the fly during processing.
Drill decentralized metadata – Unlike other SQL Hadoop technologies, the drill does not
have centralized metadata requirement. Drill users do not need to create and manage tables
in metadata in order to query data.
Apache Mahout
Mahout is open source framework for creating scalable machine learning algorithm and
data mining library. Once data is stored in Hadoop HDFS, mahout provides the data science
tools to automatically find meaningful patterns in those big data sets.
Algorithms of Mahout are:
Clustering – Here it takes the item in particular class and organizes them into naturally
occurring groups, such that item belonging to the same group are similar to each other.
Collaborative filtering – It mines user behaviour and makes product recommendations
(e.g. Amazon recommendations)
Classifications – It learns from existing categorization and then assigns unclassified items
to the best category.
Frequent pattern mining – It analyzes items in a group (e.g. items in a shopping cart or
terms in query session) and then identifies which items typically appear together.
Apache Sqoop
Sqoop imports data from external sources into related Hadoop ecosystem components
like HDFS, Hbase or Hive. It also exports data from Hadoop to other external sources. Sqoop
works with relational databases such as teradata, Netezza, oracle, MySQL.
Features of Apache Sqoop:
Import sequential datasets from mainframe – Sqoop satisfies the growing need to move
data from the mainframe to HDFS.
Import direct to ORC files – Improves compression and light weight indexing and improve
query performance.
18
Parallel data transfer – For faster performance and optimal system utilization.
Efficient data analysis – Improve efficiency of data analysis by combining structured data
and unstructured data on a schema on reading data lake.
Fast data copies – from an external system into Hadoop.
19
Ambari
Ambari, another Hadoop ecosystem component, is a management platform for
provisioning, managing, monitoring and securing apache Hadoop cluster. Hadoop management
gets simpler as Ambari provide consistent, secure platform for operational control.
20
Fig. 3.27 ZooKeeper Diagram
Features of Zookeeper:
Fast – Zookeeper is fast with workloads where reads to data are more common than writes.
The ideal read/write ratio is 10:1.
Ordered – Zookeeper maintains a record of all transactions.
Oozie
It is a workflow scheduler system for managing apache Hadoop jobs. Oozie combines
multiple jobs sequentially into one logical unit of work. Oozie framework is fully integrated
with apache Hadoop stack, YARN as an architecture center and supports Hadoop jobs for
apache MapReduce, Pig, Hive, and Sqoop.
21
Oozie workflow – It is to store and run workflows composed of Hadoop jobs e.g.,
MapReduce, pig, Hive.
Oozie Coordinator – It runs workflow jobs based on predefined schedules and availability
of data.
This was all about Components of Hadoop Ecosystem
22
public class MaxTemperatureMapper extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private static final int MISSING = 9999;
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92)); }
String quality = line.substring(92, 93);
if (airTemperature != MISSING &&quality.matches("[01459]")) {
output.collect(new Text(year), new IntWritable(airTemperature)); } } }
Example Reducer for maximum temperature example
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class MaxTemperatureReducer extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int maxValue = Integer.MIN_VALUE;
while (values.hasNext()) {
maxValue = Math.max(maxValue, values.next().get());
}
output.collect(key, new IntWritable(maxValue));
}
}
Example Application to find the maximum temperature in the weather dataset
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
23
public class MaxTemperature {
public static void main(String[] args) throws IOException {
if (args.length != 2) {
System.err.println("Usage: MaxTemperature<input path><output path>");
System.exit(-1); }
JobConf conf = new JobConf(MaxTemperature.class);
conf.setJobName("Max temperature");
FileInputFormat.addInputPath(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(MaxTemperatureMapper.class);
conf.setReducerClass(MaxTemperatureReducer.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf); } }
5. Scaling Out
• data flow for large inputs.
• For simplicity, the examples so far have used files on the local filesystem.
• However, to scale out,
– data should be stored in a distributed filesystem, typically HDFS,
– to allow Hadoop to move the MapReduce computation
– to each machine hosting a part of the data.
• Having many splits means the time taken to process each split is small compared to the
time to process the whole input.
• So if we are processing the splits in parallel, the processing is better load-balanced if
the splits are small, since a faster machine will be able to process proportionally more
splits over the course of the job than a slower machine.
• On the other hand,
• if splits are too small, then the overhead of managing the splits and of map task creation
begins to dominate the total job execution time.
• For most jobs, a good split size tends to be the size of an HDFS block, 64 MB by default,
although this can be changed for the cluster
• Hadoop does its best to run the map task on a node where the input data resides in
HDFS. This is called the data locality optimization.
• It should now be clear why the optimal split size is the same as the block size: it is the
largest size of input that can be guaranteed to be stored on a single node.
• Map tasks write their output to the local disk, not to HDFS.
Why is this?
• Map output is intermediate output: it’s processed by reduce tasks to produce the final
output, and once the job is complete the map output can be thrown away.
• So storing it in HDFS, with replication, would be overkill.
• If the node running the map task fails before the map output has been consumed by the
reduce task, then Hadoop will automatically rerun the map task on another node to re-
create the map output.
24
Fig. 3.30 MapReduce Data flow with the single reduce task
Fig. 3.31 MapReduce Data flow with the multiple reduce tasks
25
Fig. 3.32 MapReduce Data flow with the no reduce task
Combiner Functions
• Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays
to minimize the data transferred between map and reduce tasks.
• Hadoop allows the user to specify a combiner function to be run on the map output—
the combiner function’s output forms the input to the reduce function.
• Since the combiner function is an optimization, Hadoop does not provide a guarantee
of how many times it will call it for a particular map output record, if at all.
• In other words, calling the combiner function zero, one, or many times should produce
the same output from the reducer
• The contract for the combiner function constrains the type of function that may be used.
• This is best illustrated with an example.
• Suppose that for the maximum temperature example, readings for the year 1950 were
processed by two maps (because they were in different splits).
• Imagine the first map produced the output:
(1950, 0)
(1950, 20)
(1950, 10)
• And the second produced:
(1950, 25)
(1950, 15)
• The reduce function would be called with a list of all the values:
(1950, [0, 20, 10, 25, 15])
• with output:
(1950, 25)
• since 25 is the maximum value in the list.
A combiner function can be used - just like the reduce function, finds the maximum temperature
for each map output.
• The reduce would then be called with:
(1950, [20, 25])
26
• and the reduce would produce the same output as before.
• More succinctly, we may express the function calls on the temperature values in this
case as follows:
max (0, 20, 10, 25, 15) = max (max (0, 20, 10), max (25, 15)) = max (20, 25) = 25
• If calculating mean temperatures, then we couldn’t use the mean as our combiner
function, since:
mean (0, 20, 10, 25, 15) = 14
• but:
mean (mean(0, 20, 10), mean(25, 15)) = mean(10, 20) = 15
• The combiner function doesn’t replace the reduce function.
• But it can help cut down the amount of data shuffled between the maps and the reduces,
and for this reason alone it is always worth considering whether a combiner function
can be used in the MapReduce job.
6. Hadoop Streaming
• Hadoop provides an API to MapReduce that allows writing map and reducing functions
in languages other than Java.
• Hadoop Streaming uses Unix standard streams as the interface between Hadoop and
the program, so any language can be used that can read standard input and write to
standard output to write the MapReduce program.
• Streaming is naturally suited for text processing and when used in text mode, it has a
line-oriented view of data.
• Map input data is passed over standard input to the map function, which processes it
line by line and writes lines to standard output.
• A map output key-value pair is written as a single tab-delimited line.
• Input to the reduce function is in the same format—a tab-separated key-value pair—
passed over standard input.
• The reduce function reads lines from standard input, which the framework guarantees
are sorted by key, and writes its results to standard output.
• ------Ruby, Python
7. Design of HDFS
Distributed File System
• File Systems that manage the storage across a network of machines
• Since they are network based, all the complications of network programming occur
• This makes DFS more complex than regular disk file systems
– For example, one of the biggest challenges is making the filesystem tolerate
node failure without suffering data loss
• HDFS – a file system designed for storing very large files with streaming data access
patterns, running on clusters of commodity hardware
• Very Large files
– Hundreds of megabytes, gigabytes or terabytes in size
– There are Hadoop clusters running today store petabytes of data
• Streaming Data Access
– HDFS is built around the idea that the most efficient data processing pattern is
a write-once, read-many-times pattern
27
– A dataset is typically generated or copied from source, and then various analyses
are performed on that dataset over time
• Commodity Hardware
– Hadoop doesn’t require expensive, highly reliable hardware.
– It is designed to run on clusters of commodity hardware(commonly available
hardware that can be obtained from multiple vendors)
– Chance of node failure is high, atleast for large clusters
– HDFS is designed to carry on working without interruption to the user in the
face of such failure
Areas where HDFS is not good fit today
• Low-latency data access
– Applications that require low-latency access to data, in the tens of milliseconds
range, will not work well with HDFS
– HDFS is designed for delivering a high throughput of data and this may be at
the expense of latency
– Hbase – better choice of low-latency access
• Lots of small files
– Because the namenode holds filesystem metadata in memory, the limit to the
number of files in a filesystem is governed by the amount of memory on the
namenode
– As a rule of thumb, each file, directory and block takes about 150 bytes.
– So, e.g.,if one million files, each taking one block, would need at least 300 MB
of memory
• Multiple writers, arbitrary file modifications
– Files in HDFS may be written to by a single writer
– Writers are always made at the end of the file, in append-only fashion
– There is no support for multiple writers or for modifications at arbitrary offsets
in the file
9. How MapReduce Works
10. Anatomy of a MapReduce Job run
Anatomy of a MapReduce Job Run
• Can run a MapReduce job with a single method call:
submit () on a Job object
• Can also call
waitForCompletion() – submits the job if it hasn’t been submitted already, then
waits for it to finish
At the highest level, there are five independent entries:
• The client, which submits the MapReduce job
• The YARN resource manager, which coordinates the allocation of computer resources
on the cluster
• The YARN node managers, which launch and monitor the compute containers on
machines in the cluster
• The MapReduce application master, which coordinates the tasks running the
MapReduce job. The application master and the MapReduce tasks run in containers
that are scheduled by the resource manager and managed by the node managers.
• The DFS (normally HDFS), which is used for sharing job files between the other
entities.
30
Fig. 3.33 Anatomy of MapReduce Job Run
Job Submission
• The submit() method on job creates an internal JobSubmitter instance and calls
submitJobInternal() on it (step 1 in figure)
• Having submitted the job, waitForCompletion() polls the job’s progress once per
second and reports the progress to the console if it has changed since last report
• When the job completes successfully, the job counters are displayed.
• Otherwise, the error that caused the job to fail is logged to the console.
• The job submission process implemented by JobSubmitter does the following
• Asks the resource manager for a new application ID, used for the MapReduce job
ID(step 2)
• Checks the output specification of the job. For example, if the output directory has not
been specified or it already exists, the job is not submitted and an error is thrown to the
MapReduce program
• Computes the input splits for the job. If the splits cannot be computed (because of the
input paths don’t exist, for example), the job is not submitted and an error is thrown to
the MapReduce program
• Copies the resources needed to run the job, including the job JAR file, the configuration
file and the computed input splits, to the shared filesystem in a directory named after
the job ID (step 3). The job JAR is copied with a high replication factor so that there
are lots of copies across the cluster for the node managers to access when they run tasks
for the job.
• Submits the job by calling submitApplication() on the resource manager (step 4)
31
Job Initialization
• When the resource manager receives a call to its submitApplication() method, it
handsoff the request to the YARN scheduler.
• The scheduler allocates a container, and the resource manager then launches the
application master’s process there, under the node manager’s management (steps 5a
and 5b)
• The application master for MapReduce jobs is a Java application whose main class
isMRAppMaster.
• It initializes the job by creating a number of bookkeeping objects to keep track of the
job’s progress, as it will receive progress and completion reports from the task (step 6)
• Next, it retrieves the input splits computed in the client from the shared filesystem (step
7)
• It then creates a map task object for each split, as well as a number of reduce task objects
determined by the mapreduce.job.reduces property (set by the setNumReduceTasks()
method on Job). Tasks are given IDs at this point
• The application master must decide how to run the tasks that make up the MapReduce
job
• If the job is small, the application master may choose to run the tasks in the same JVM
as itself.
• This happens when it judges that the overhead of allocating and running tasks in new
containers outweighs the gain to be had in running them in parallel, compared to
running them sequentially on one node. Such a job is said to be uberized or run as an
uber task.
What qualifies as a small job?
• By default,
• a small job is one that has
– less than 10 mappers,
– only one reducer and
– an input size that is less than the size of one HDFS block
• Before it can run the task, it localizes the resources that the task needs, including the
job configuration and JAR file and any files from the distributed cache (step 10).
• Finally it runs the map or reduce task (step 11)
• The YarnChild runs in a dedicated JVM, so that any bugs in the user-defined map and
reduce functions don’t affect the node manager – by causing it to crash or hang.
Streaming
• Streaming runs special map and reduce tasks for the purpose of launching the user-
supplied executable and communicating with it
Job Completion
• When the application master receives a notification that the last task for a job is
complete, it changes the status for the job to “successful”
• Finally, on job completion, the application master and the task containers clean up their
working state
32