0% found this document useful (0 votes)
40 views37 pages

Unit - III

The document describes the anatomy of a MapReduce task in Hadoop. It explains the key components including input data splitting, map tasks, shuffle and sort phase, reduce tasks, and failures that can occur. The map function processes input data and produces intermediate outputs, shuffle and sort organizes the outputs, and reduce functions produce the final output.

Uploaded by

praneelp2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views37 pages

Unit - III

The document describes the anatomy of a MapReduce task in Hadoop. It explains the key components including input data splitting, map tasks, shuffle and sort phase, reduce tasks, and failures that can occur. The map function processes input data and produces intermediate outputs, shuffle and sort organizes the outputs, and reduce functions produce the final output.

Uploaded by

praneelp2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

Unit - III

Anatomy of a Map Reduce Task in Hadoop


• Input data is split into small subsets of data. Map tasks work on
these data splits.
• The intermediate input data from Map tasks is then submitted
to Reduce task after an intermediate process called 'shuffle’.
• The Reduce task(s) works on this intermediate data to
generate the result of a MapReduce Job.
1. InputFiles
• The data that is to be processed by the MapReduce task is stored in input files. These input files
are stored in the Hadoop Distributed File System. The file format is arbitrary, while the line-based
log files and the binary format can also be used.
2. InputFormat
• It specifies the input-specification for the job. InputFormat validates the MapReduce job input-
specification and splits-up the input files into logical InputSplit instances. Each InputSplit is then
assigned to the individual Mapper. TextInputFormat is the default InputFormat.
3. InputSplit
• It represents the data for processing by the individual Mapper. InputSplit typically presents the
byte-oriented view of the input. It is the RecordReader responsibility to process and present the
record-oriented view. The default InputSplit is the FileSplit.
4. RecordReader
• RecordReader reads the <key, value> pairs from the InputSplit. It converts a byte-oriented view of
the input and presents a record-oriented view to the Mapper implementations for processing.
• It is responsible for processing record boundaries and presenting the Map tasks with keys and
values. The record reader breaks the data into the <key, value> pairs for input to the Mapper.
• 5. Mapper
• Mapper maps the input <key, value> pairs to a set of intermediate <key, value>
pairs. It processes the input records from the RecordReader and generates the new
<key, value> pairs. The <key, value> pairs generated by Mapper are different from
the input <key, value> pairs.
• The generated <key, value> pairs is the output of Mapper known as intermediate
output. These intermediate outputs of the Mappers are written to the local disk.
• The Mappers output is not stored on the Hadoop Distributed File System because
this is the temporary data, and writing this data on HDFS will create unnecessary
copies. The output of the Mappers is then passed to the Combiner for further
processing.
• 6. Combiner
• It is also known as the ‘Mini-reducer’. Combiner performs local aggregation on the
output of the Mappers. This helps in minimizing data transfer between the Mapper
and the Reducer.
• After the execution of the Combiner function, the output is passed to the Partitioner
for further processing.
7. Partitioner
• When we are working on the MapReduce program with more than one Reducer
then only the Partitioner comes into the picture. For only one reducer, we do not
use Partitioner.
• It partitions the keyspace. It controls the partitioning of keys of the Mapper
intermediate outputs.
• Partitioner takes the output from the Combiner and performs partitioning. Key is
for deriving the partition typically through the hash function. The number of
partitions is similar to the number of reduce tasks. HashPartitioner is the default
Partitioner.
8. Shuffling and Sorting
• The input to the Reducer is always the sorted intermediate output of the mappers.
After combining and partitioning, the framework via HTTP fetches all the relevant
partitions of the output of all the mappers.
• Once the output of all the mappers is shuffled, the framework groups the Reducer
inputs on the basis of the keys. This is then provided as an input to the Reducer.
9. Reducer
• Reducer then reduces the set of intermediate values who shares a key to the
smaller set of values. The output of reducer is the final output. This output is
stored in the Hadoop Distributed File System.
10. RecordWriter
• RecordWriter writes the output (key, value pairs) of Reducer to an output file.
It writes the MapReduce job outputs to the FileSystem.
11. OutputFormat
• The OutputFormat specifies the way in which these output key-value pairs are
written to the output files. It validates the output specification for a
MapReduce job.
• OutputFormat basically provides the RecordWriter implementation used for
writing the output files of the MapReduce job. The output files are stored in a
FileSystem.
• Hence, in this manner, MapReduce works over the Hadoop cluster in different
phases.
The anatomy of a MapReduce can be broken down into
three main components:
*Map Function: The Map function is responsible for
processing the input data and producing intermediate key-
value pairs. The Map function takes an input key-value pair
and produces zero or more intermediate key-value pairs.
The Map function is executed in parallel on different nodes in
the cluster.
*Shuffle and Sort: The Shuffle and Sort phase is responsible
for sorting the intermediate key-value pairs and partitioning
them according to the keys. This phase ensures that all
intermediate key-value pairs with the same key are sent to the
same Reduce function. This phase is executed on the master
node.
*Reduce Function: The Reduce function is responsible for
combining the intermediate key-value pairs and producing
the final output.
The Reduce function takes an intermediate key and a list of
values and produces zero or more output key-value pairs.
The Reduce function is executed in parallel on different
nodes in the cluster.
Job Run:
A job run involves the execution of a MapReduce program on a set of input
data to produce an output result. The job run process can be broken down
into several steps:
1. Input Data Splitting
2. Map Task Execution
3. Intermediate Data Shuffling
4. Reduce Task Execution
5. Output Data Writing
*Input Data Splitting: The input data is divided into small chunks called input
splits. Each input split is processed by a single Map task.
Map Task Execution: The Map task processes the input split and produces a
set of intermediate key-value pairs. The Map task is executed in parallel on
different nodes in the cluster.
*Intermediate Data Shuffling: The intermediate key-value pairs produced
by the Map tasks are sorted and partitioned by key. This ensures that all
intermediate key-value pairs with the same key are sent to the same
Reduce task.
*Reduce Task Execution: The Reduce task takes the intermediate key-
value pairs with the same key and produces a set of output key-value
pairs. The Reduce task is executed in parallel on different nodes in the
cluster.
*Output Data Writing: The output key-value pairs produced by the
Reduce tasks are written to the output file.
The job run process is managed by a JobTracker in Hadoop or a
ResourceManager in YARN. The JobTracker/ResourceManager is
responsible for assigning Map and Reduce tasks to available nodes in the
cluster, monitoring the progress of the job, and handling any failures that
may occur.
Failures in MapReduce:

1. Hardware failures
2. Software failures
3. Network failures
4. Input data issues
5. Resource allocation issues
Hardware failures: Nodes in the cluster can fail due to various
hardware issues such as hard disk crashes, power supply failures,
network connectivity issues, etc. These failures can cause the loss of
data or processing power and can slow down the entire MapReduce
job.

Software failures: MapReduce jobs rely on various software


components such as Hadoop, YARN, MapReduce framework, etc. If
any of these software components fail or experience bugs, the entire
job can fail or produce incorrect results.

Network failures: MapReduce jobs require a stable network


connection between the nodes in the cluster. If there are network
issues such as congestion, packet loss, or connection drops, it can
cause delays or even failures in the job.
Input data issues: Input data to MapReduce jobs can also
cause failures if the data is corrupted, incomplete, or in the
wrong format. In such cases, the job may fail to process the
data or produce incorrect results.

Resource allocation issues: MapReduce jobs require


resources such as memory, CPU, and disk space. If the
resources are not properly allocated or if there is a shortage
of resources, it can cause the job to fail or produce incorrect
results.
Shuffle and Sort in MapReduce
• Shuffle and Sort are important phases in the MapReduce framework
that occur between the Map and Reduce phases. They are
responsible for grouping and sorting the output data from the Map
phase before it is passed to the Reduce phase for processing.
• Shuffle phase : the process of transferring data from the Map phase to
the Reduce phase. During the Shuffle phase, the Map output data is
partitioned and sent to the reducers. The Map output data is grouped
by keys and each group of data with the same key is sent to the same
reducer. The partitioner is responsible for deciding which reducer each
key-value pair will be sent to based on the key.
• Sort phase :is the process of sorting the Map output data within each
reducer. The Map output data is sorted by key before it is passed to
the reducer. The Sort phase ensures that all the data with the same
key is sent to the reducer in a sorted order. This makes it easier for
the reducer to process the data efficiently.
MapReduce Task Execution
• A MapReduce job usually splits the input data-set into
independent chunks which are processed by the map tasks in a
completely parallel manner.
• The framework sorts the outputs of the maps, which are then
input to the reduce tasks.
• Typically both the input and the output of the job are stored in a
file-system.
Anatomy
Diagram
There are five independent entities:
•The client, which submits the MapReduce job.
•The YARN resource manager, which coordinates the allocation of
compute resources on the cluster.
•The YARN node managers, which launch and monitor the compute
containers on machines in the cluster.
•The MapReduce application master, which coordinates the tasks
running the MapReduce job The application master and
the MapReduce tasks run in containers that are scheduled by the
resource manager and managed by the node managers.
•The distributed filesystem, which is used for sharing job files between
the other entities.
Job Submission :

•The submit() method on Job creates an internal JobSubmitter


instance and calls submitJobInternal() on it.
•Having submitted the job, waitForCompletion polls the job’s
progress once per second and reports the progress to the console if it
has changed since the last report.
•When the job completes successfully, the job counters are displayed
Otherwise, the error that caused the job to fail is logged to the
console.
The job submission process implemented by JobSubmitter does the following:

•Asks the resource manager for a new application ID, used for the
MapReduce job ID.
•Checks the output specification of the job For example, if the output
directory has not been specified or it already exists, the job is not
submitted and an error is thrown to the MapReduce program.
•Computes the input splits for the job If the splits cannot be
computed (because the input paths don’t exist, for example), the job
is not submitted and an error is thrown to the MapReduce
program.
•Copies the resources needed to run the job, including the job
JAR file, the configuration file, and the computed input splits, to
the shared filesystem in a directory named after the job ID.
•Submits the job by calling submitApplication() on the resource
Job Initialization :
•When the resource manager receives a call to its submitApplication() method, it hands off the request to
the YARN scheduler.
•The scheduler allocates a container, and the resource manager then launches the application master’s
process there, under the node manager’s management.
•The application master for MapReduce jobs is a Java application whose main class is MRAppMaster .
•It initializes the job by creating a number of bookkeeping objects to keep track of the job’s progress, as it
will receive progress and completion reports from the tasks.
•It retrieves the input splits computed in the client from the shared filesystem.
•It then creates a map task object for each split, as well as a number of reduce task objects determined by
the mapreduce.job.reduces property (set by the setNumReduceTasks() method on Job).
Task Assignment:
• If the job does not qualify for running as an uber task, then the
• application master requests containers for all the map and reduce
• tasks in the job from the resource manager .
• Requests for map tasks are made first and with a higher priority than
• those for reduce tasks, since all the map tasks must complete before
• the sort phase of the reduce can start.
• Requests for reduce tasks are not made until 5% of map tasks have
• completed.
Task Execution:
• Once a task has been assigned resources for a container on a
particular node by the resource manager’s scheduler, the
application master starts the container by contacting the node
manager.
• The task is executed by a Java application whose main class is
YarnChild. Before it can run the task, it localizes the resources that
the task needs, including the job configuration and JAR file, and
any files from the distributed cache.
• Finally, it runs the map or reduce task.
Streaming:
• Streaming runs special map and reduce tasks for the purpose of launching the
user supplied executable and communicating with it.
• The Streaming task communicates with the process (which may be
written in any language) using standard input and output streams.
• During execution of the task, the Java process passes input key value
pairs to the external process, which runs it through the user defined
map or reduce function and passes the output key value pairs back to
the Java process.
• From the node manager’s point of view, it is as if the child process
ran the map or reduce code itself.
Job Completion:
• When the application master receives a notification that the last
task for a job is complete, it changes the status for the job to Successful.
• Then, when the Job polls for status, it learns that the job has completed
successfully, so it prints a message to tell the user and then returns from
the waitForCompletion() .
• Finally, on job completion, the application master and the task
containers clean up their working state and the OutputCommitter’s
commitJob () method is called.
• Job information is archived by the job history server to enable later
interrogation by users if desired.
MapReduce Types
MapReduce is a programming model used for processing large
datasets in a distributed environment. In this model, data is
divided into smaller chunks and processed in parallel across
multiple computing nodes.
• There are three types of MapReduce:
1. Traditional MapReduce
2. Streaming MapReduce
3. Incremental MapReduce
1.Traditional MapReduce:
This is the original form of MapReduce developed by
Google. It consists of two main phases: the map phase, where data
is divided into smaller chunks and processed in parallel across
multiple computing nodes, and the reduce phase, where the results
of the map phase are combined to produce the final output.

2. Streaming MapReduce:
This type of MapReduce allows data to be processed in real-
time, rather than in batches. Data is fed into the system as a
stream, and each record is processed as it arrives. This makes it
suitable for applications such as log processing and real-time
analytics.
3. Incremental MapReduce:
This type of MapReduce is used for iterative processing,
where the same dataset is processed multiple times with different
parameters. Instead of processing the entire dataset for each
iteration, only the changed or updated data is processed. This
makes it suitable for applications such as machine learning,
where models need to be trained iteratively on large datasets.
Input formats:
1. Text input format
2. Sequence file input format
3. Hadoop archives input format
4. DBinput format
5. Combine file input format
1.Text Input Format: This is the default input format for
MapReduce. It reads plain text files and splits them into
separate records. Each record is a line of text, and the key is
the byte offset of the line in the input file.

2.Sequence File Input Format: This input format reads binary


key-value pairs stored in a sequence file format. Sequence files
are often used as an intermediate format in Hadoop workflows.
4. Hadoop Archives (HAR) Input Format: This input format reads
data stored in a Hadoop archive file. Hadoop archives are used
to combine small files into a single compressed file for more
efficient processing.

5. DBInputFormat: This input format reads data from a relational


database using JDBC. It allows MapReduce jobs to process data
stored in a database table.

6. CombineFileInputFormat: This input format reads multiple


small files and combines them into a single split. This can
improve performance by reducing the number of input splits
processed by MapReduce.
Output formats
• MapReduce is a programming model and processing
framework used for distributed computing and large-scale data
processing. In MapReduce, the output format refers to the
structure and organization of the final results generated by the
MapReduce job. Here are some commonly used output formats
in MapReduce:
1. TextOutputFormat 2. SequenceFileOutputFormat
3. MultipleOutputs 4. AvroOutputFormat
5. ParquetOutputFormat 6. SequenceFileAsBinaryOutputFormat
7. DBOutputFormat
1.TextOutputFormat: This is the default output format in
MapReduce. It writes the output as plain text files, where each line
represents a key-value pair separated by a tab or a delimiter of
your choice.

2.SequenceFileOutputFormat: This output format writes the output


as binary files called SequenceFiles. SequenceFiles are
compressed and splittable, making them suitable for storing large
amounts of data efficiently.

3.MultipleOutputs: This allows you to generate multiple output files


from a single MapReduce job. You can specify different output
formats for each named output, enabling you to write different
types of output files based on your requirements.
4. AvroOutputFormat: This output format writes the output in Avro
format, a compact binary data serialization system. Avro files are
self-describing, meaning the schema is stored along with the
data, making it easier to work with structured data.

5. ParquetOutputFormat: This format writes the output in the


Parquet columnar storage format, which is optimized for big data
workloads. Parquet files are highly compressed and support
efficient columnar operations, making them suitable for analytics
and querying.
Feature Row-Oriented Storage Columnar Storage (Parquet)

Good for OLTP workloads where Better for OLAP workloads where
Storage Efficiency individual rows are frequently analytical queries typically access a
accessed. subset of columns.

More effective compression due to


Limited by data locality within rows,
Compression similar data types being stored together
making it less efficient for compression.
within columns.

Query performance can degrade as the Query performance remains stable


Query Performance number of columns accessed in a even when only a subset of columns is
query increases. accessed.

Schema changes can be complex and Supports schema evolution without


Schema Evolution may require modifications to existing breaking compatibility, allowing for easy
data and applications. addition or modification of columns.

Widely supported in relational Seamless integration with Apache


Integration databases and traditional data Hadoop ecosystem tools like Hive,
warehouses. Spark, and
6. SequenceFileAsBinaryOutputFormat: Similar to
SequenceFileOutputFormat, this format writes the output as
binary SequenceFiles but treats the key-value pairs as binary
data rather than serialized objects.

7. DBOutputFormat: This output format allows you to write the


output directly to a relational database. It provides a convenient
way to load the results of a MapReduce job into a database
table.

You might also like