Unit - III
Unit - III
1. Hardware failures
2. Software failures
3. Network failures
4. Input data issues
5. Resource allocation issues
Hardware failures: Nodes in the cluster can fail due to various
hardware issues such as hard disk crashes, power supply failures,
network connectivity issues, etc. These failures can cause the loss of
data or processing power and can slow down the entire MapReduce
job.
•Asks the resource manager for a new application ID, used for the
MapReduce job ID.
•Checks the output specification of the job For example, if the output
directory has not been specified or it already exists, the job is not
submitted and an error is thrown to the MapReduce program.
•Computes the input splits for the job If the splits cannot be
computed (because the input paths don’t exist, for example), the job
is not submitted and an error is thrown to the MapReduce
program.
•Copies the resources needed to run the job, including the job
JAR file, the configuration file, and the computed input splits, to
the shared filesystem in a directory named after the job ID.
•Submits the job by calling submitApplication() on the resource
Job Initialization :
•When the resource manager receives a call to its submitApplication() method, it hands off the request to
the YARN scheduler.
•The scheduler allocates a container, and the resource manager then launches the application master’s
process there, under the node manager’s management.
•The application master for MapReduce jobs is a Java application whose main class is MRAppMaster .
•It initializes the job by creating a number of bookkeeping objects to keep track of the job’s progress, as it
will receive progress and completion reports from the tasks.
•It retrieves the input splits computed in the client from the shared filesystem.
•It then creates a map task object for each split, as well as a number of reduce task objects determined by
the mapreduce.job.reduces property (set by the setNumReduceTasks() method on Job).
Task Assignment:
• If the job does not qualify for running as an uber task, then the
• application master requests containers for all the map and reduce
• tasks in the job from the resource manager .
• Requests for map tasks are made first and with a higher priority than
• those for reduce tasks, since all the map tasks must complete before
• the sort phase of the reduce can start.
• Requests for reduce tasks are not made until 5% of map tasks have
• completed.
Task Execution:
• Once a task has been assigned resources for a container on a
particular node by the resource manager’s scheduler, the
application master starts the container by contacting the node
manager.
• The task is executed by a Java application whose main class is
YarnChild. Before it can run the task, it localizes the resources that
the task needs, including the job configuration and JAR file, and
any files from the distributed cache.
• Finally, it runs the map or reduce task.
Streaming:
• Streaming runs special map and reduce tasks for the purpose of launching the
user supplied executable and communicating with it.
• The Streaming task communicates with the process (which may be
written in any language) using standard input and output streams.
• During execution of the task, the Java process passes input key value
pairs to the external process, which runs it through the user defined
map or reduce function and passes the output key value pairs back to
the Java process.
• From the node manager’s point of view, it is as if the child process
ran the map or reduce code itself.
Job Completion:
• When the application master receives a notification that the last
task for a job is complete, it changes the status for the job to Successful.
• Then, when the Job polls for status, it learns that the job has completed
successfully, so it prints a message to tell the user and then returns from
the waitForCompletion() .
• Finally, on job completion, the application master and the task
containers clean up their working state and the OutputCommitter’s
commitJob () method is called.
• Job information is archived by the job history server to enable later
interrogation by users if desired.
MapReduce Types
MapReduce is a programming model used for processing large
datasets in a distributed environment. In this model, data is
divided into smaller chunks and processed in parallel across
multiple computing nodes.
• There are three types of MapReduce:
1. Traditional MapReduce
2. Streaming MapReduce
3. Incremental MapReduce
1.Traditional MapReduce:
This is the original form of MapReduce developed by
Google. It consists of two main phases: the map phase, where data
is divided into smaller chunks and processed in parallel across
multiple computing nodes, and the reduce phase, where the results
of the map phase are combined to produce the final output.
2. Streaming MapReduce:
This type of MapReduce allows data to be processed in real-
time, rather than in batches. Data is fed into the system as a
stream, and each record is processed as it arrives. This makes it
suitable for applications such as log processing and real-time
analytics.
3. Incremental MapReduce:
This type of MapReduce is used for iterative processing,
where the same dataset is processed multiple times with different
parameters. Instead of processing the entire dataset for each
iteration, only the changed or updated data is processed. This
makes it suitable for applications such as machine learning,
where models need to be trained iteratively on large datasets.
Input formats:
1. Text input format
2. Sequence file input format
3. Hadoop archives input format
4. DBinput format
5. Combine file input format
1.Text Input Format: This is the default input format for
MapReduce. It reads plain text files and splits them into
separate records. Each record is a line of text, and the key is
the byte offset of the line in the input file.
Good for OLTP workloads where Better for OLAP workloads where
Storage Efficiency individual rows are frequently analytical queries typically access a
accessed. subset of columns.