Unit 3 Notes
Unit 3 Notes
1. Input Data: The first step in a MapReduce workflow is to have a large dataset that you want
to process. This dataset is typically stored in a distributed file system like Hadoop Distributed
File System (HDFS).
2. Map Phase:
- Map Function: In this phase, the input dataset is divided into smaller chunks called "splits,"
and each split is processed by a separate Mapper task. A user-defined "Map" function is applied
to each record within the split. The Map function takes the input data and generates a set of key-
value pairs as intermediate outputs. These key-value pairs are not sorted.
- Shuffling and Sorting: After the Map phase, all the intermediate key-value pairs from all
Mappers are grouped by key and sorted. This is necessary because the Reduce phase works on
grouped data.
4. Reduce Phase:
- Reduce Function: The Reducer tasks take the sorted and shuffled key-value pairs from the
Map phase and apply a user-defined "Reduce" function to process and aggregate the data. Each
Reducer receives a group of key-value pairs with the same key. The Reduce function can
perform various operations on the values, such as summarization, aggregation, or filtering.
- Output: The output of the Reducer tasks is typically written to an output location, often in a
distributed file system, where it can be used for further analysis or as the final result of the
MapReduce job.
5. Final Output: After all the Reducer tasks have completed, the final result is available in the
output location specified for the job.
MapReduce workflows are highly scalable and fault-tolerant, making them suitable for
processing large-scale data in distributed computing environments. While the core concept of
MapReduce remains the same, there are various implementations and tools available, such as
Apache Hadoop and Apache Spark, that offer more features and flexibility for different data
processing tasks. Additionally, developers need to design their Map and Reduce functions based
on the specific requirements of their data processing tasks.
```java
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mrunit.mapreduce.MapDriver;
import org.junit.Before;
import org.junit.Test;
@Before
public void setUp() {
MyMapper mapper = new MyMapper(); // Replace with your Mapper class
mapDriver = MapDriver.newMapDriver(mapper);
}
@Test
public void testMapper() {
mapDriver.withInput(new LongWritable(1), new Text("Hello, world!"))
.withOutput(new Text("Hello,"), new IntWritable(1))
.withOutput(new Text("world!"), new IntWritable(1))
.runTest();
}
}
```
In this example, we use MRUnit's `MapDriver` to test a Mapper. Replace `MyMapper` with the
actual name of your Mapper class. We provide sample input data and expected output data and
then run the test.
For Reducer tests, you can use `ReduceDriver` in a similar way.
MapReduce is a framework using which we can write applications to process huge amounts of
data, in parallel, on large clusters of commodity hardware in a reliable manner.
What is MapReduce?
MapReduce is a processing technique and a program model for distributed computing based on
java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map
takes a set of data and converts it into another set of data, where individual elements are broken
down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as
an input and combines those data tuples into a smaller set of tuples. As the sequence of the name
MapReduce implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application into mappers and reducers is
sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the
application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is
merely a configuration change. This simple scalability is what has attracted many programmers
to use the MapReduce model.
The Algorithm
• Generally MapReduce paradigm is based on sending the computer to where the data
resides!
• MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.
o Map stage − The map or mapper’s job is to process the input data. Generally the
input data is in the form of file or directory and is stored in the Hadoop file
system (HDFS). The input file is passed to the mapper function line by line. The
mapper processes the data and creates several small chunks of data.
o Reduce stage − This stage is the combination of the Shuffle stage and the
Reduce stage. The Reducer’s job is to process the data that comes from the
mapper. After processing, it produces a new set of output, which will be stored in
the HDFS.
• During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.
• The framework manages all the details of data-passing such as issuing tasks, verifying
task completion, and copying data around the cluster between the nodes.
• Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.
• After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.
The MapReduce framework operates on <key, value> pairs, that is, the framework views the
input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the
output of the job, conceivably of different types.
The key and the value classes should be in serialized manner by the framework and hence, need
to implement the Writable interface. Additionally, the key classes have to implement the
Writable-Comparable interface to facilitate sorting by the framework. Input and Output types of
a MapReduce job − (Input) <k1, v1> → map → <k2, v2> → reduce → <k3, v3>(Output).
Input Output
Map <k1, v1> list (<k2, v2>)
Reduce <k2, list(v2)> list (<k3, v3>)
Terminology
• PayLoad − Applications implement the Map and the Reduce functions, and form the
core of the job.
• Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair.
• NamedNode − Node that manages the Hadoop Distributed File System (HDFS).
• DataNode − Node where data is presented in advance before any processing takes place.
• MasterNode − Node where JobTracker runs and which accepts job requests from clients.
• SlaveNode − Node where Map and Reduce program runs.
• JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.
• Task Tracker − Tracks the task and reports status to JobTracker.
• Job − A program is an execution of a Mapper and Reducer across a dataset.
• Task − An execution of a Mapper or a Reducer on a slice of data.
• Task Attempt − A particular instance of an attempt to execute a task on a SlaveNode.
When developing MapReduce jobs using Hadoop or similar frameworks, it's essential to perform
unit testing to ensure that your code works correctly. One crucial aspect of unit testing is creating
test data and running local tests to verify the behavior of your MapReduce jobs. Here's a step-by-
step guide on how to create test data and conduct local tests for your MapReduce jobs:
1.1. Generate Sample Input Data: You can manually create small sample input data in a text
file or use tools to generate synthetic data for testing purposes. Ensure that your test data covers
various scenarios, including edge cases and potential issues that your MapReduce job might
encounter.
1.2. Store Test Data in a Local Directory: Place your sample input data in a directory on your
local machine. This directory will serve as the input source for your local tests.
2. Write Unit Tests:
2.1. Set Up a Testing Framework: Use a testing framework like JUnit, TestNG, or a Hadoop-
specific testing framework like MRUnit to write unit tests for your MapReduce job. Ensure that
your project is properly configured to use the testing framework.
2.2. Write Test Cases: Create test cases that cover various aspects of your MapReduce job,
including Mapper and Reducer functionality, handling of different input data scenarios, and edge
cases. Your test cases should include both positive and negative test scenarios.
2.3. Configure Test Inputs and Outputs: In your test cases, specify the test input data,
including the directory or file containing your sample input data. Define the expected output for
each test case.
2.4. Invoke MapReduce Job: In your test cases, invoke the MapReduce job by creating an
instance of your Mapper and Reducer classes and running them with the test input data. Capture
the output.
2.5. Assertions: Use assertions to compare the actual output of your MapReduce job with the
expected output defined in your test cases. Ensure that the results match your expectations.
1. Input Data:
- A MapReduce job begins with a large dataset that you want to process. This dataset is
typically stored in a distributed file system like Hadoop Distributed File System (HDFS).
2. Mapper Phase:
- The input data is divided into smaller chunks or "splits."
- Each split is assigned to a mapper task. Mappers are responsible for processing these splits
independently.
- The user-defined Mapper function is applied to each record within a split. The Mapper
function emits key-value pairs as intermediate outputs.
4. Reducer Phase:
- Reducers are responsible for processing the grouped and sorted key-value pairs.
- User-defined Reducer functions are applied to each group of values associated with the same
key.
- Reducers emit the final output, which is typically aggregated or transformed data.
5. Output Data:
- The output from the Reducer phase is written to an external storage system, often HDFS.
- This output represents the result of the MapReduce job, which can be used for further
analysis, reporting, or other purposes.
6. Job Configuration:
- Before running the job, you need to configure various parameters, such as input and output
paths, the number of mappers and reducers, and other job-specific settings.
- This configuration is typically done using a job configuration object, and it specifies how the
input data should be processed.
7. Job Submission:
- Once the MapReduce job is configured, it is submitted to a cluster's resource manager (e.g.,
Hadoop YARN) for execution.
- The resource manager allocates resources (nodes) for running the job's tasks (mappers and
reducers) based on cluster availability and job priorities.
8. Task Execution:
- Mapper and Reducer tasks are distributed across the cluster's nodes.
- Each task runs independently, processing its portion of the data.
- Task progress and status are monitored by the resource manager.
This classic MapReduce workflow is the foundation for distributed data processing in
frameworks like Hadoop. While it's a powerful and scalable approach, newer data processing
frameworks like Apache Spark have gained popularity due to their improved performance and
ease of use. However, the core concepts of MapReduce are still relevant in understanding
distributed data processing.
"Classic MapReduce" refers to the original programming model and framework for processing
and generating large datasets that can be parallelized across a distributed cluster of computers. It
was popularized by Google in a 2004 paper written by Jeffrey Dean and Sanjay Ghemawat.
Google used this model to process vast amounts of data across their infrastructure, and the ideas
presented in the paper served as the foundation for open-source implementations like Apache
Hadoop.
Here are the key concepts and components of the classic MapReduce model:
1. Mapper: The Mapper is responsible for processing input data and emitting a set of key-value
pairs as intermediate data. Each Mapper task works independently on a subset of the input data.
2. Reducer: The Reducer takes the intermediate key-value pairs produced by the Mappers,
groups them by key, and processes them to produce the final output. Reducers work in parallel,
and each handles a different set of keys.
3. Input Data: The input data is typically a large dataset stored in a distributed file system like
HDFS. It's divided into smaller chunks or "splits" that are processed by individual Mapper tasks.
4. Output Data: The output from the Reducer phase is written to an external storage system,
often the distributed file system, and represents the result of the MapReduce job.
5. Shuffling and Sorting: After the Mapper phase, there is a shuffling and sorting step where the
intermediate key-value pairs generated by Mappers are grouped by key and sorted. This step is
crucial because it ensures that all values for a particular key end up in the same Reducer.
6. Job Configuration: Before running the job, you need to configure various parameters, such as
input and output paths, the number of Mappers and Reducers, and other job-specific settings.
This configuration is typically done using a job configuration object.
7. Job Submission: Once the MapReduce job is configured, it is submitted to a cluster's resource
manager for execution. The resource manager allocates resources (nodes) for running the job's
tasks based on cluster availability and job priorities.
8. Task Execution: Mapper and Reducer tasks are distributed across the cluster's nodes. Each
task runs independently, processing its portion of the data. Task progress and status are
monitored by the resource manager.
9. Job Monitoring and Logging: During the job run, logs and statistics are generated, which
can be used for monitoring job progress, debugging, and performance analysis.
10. Job Completion: When all the tasks have completed successfully, the MapReduce job is
considered finished. The final output data is available in the specified output directory in the
distributed file system.
Classic MapReduce was groundbreaking because it allowed for the processing of vast amounts
of data on commodity hardware in a fault-tolerant manner. While newer data processing
frameworks like Apache Spark have gained popularity due to their improved performance and
ease of use, the core concepts of MapReduce remain relevant and are foundational in
understanding distributed data processing.
3.6 YARN
YARN is a resource management and job scheduling component in the Hadoop ecosystem. It is
designed to separate the resource management layer from the processing framework, providing a
more flexible and scalable resource management platform. YARN replaced the older
MapReduce JobTracker and TaskTracker architecture, allowing Hadoop to support various
processing frameworks beyond MapReduce.
2. NodeManager (NM): NodeManagers run on individual nodes in the cluster and are responsible
for monitoring resource usage and reporting it back to the ResourceManager.
3. ApplicationMaster (AM): Each application running on the cluster has its own
ApplicationMaster. The ApplicationMaster is responsible for negotiating resources with the
ResourceManager and monitoring the execution of its application's tasks.
YARN provides a more robust and scalable resource management solution compared to the
previous MapReduce-specific resource management, allowing multiple applications to run
concurrently on a Hadoop cluster.
Failures can occur in both classic MapReduce and YARN due to various reasons. Handling
failures robustly is a critical aspect of distributed computing. Here are some common types of
failures in these frameworks:
2. JobTracker Failures: In the classic MapReduce model, the JobTracker is a single point of
failure. If the JobTracker fails, it can result in job failures or delays.
3. Data Node Failures: If a Data Node hosting HDFS data blocks fails, it can lead to data
unavailability for MapReduce jobs.
4. Network Failures: Network issues can disrupt communication between nodes and cause job
failures or delays.
Failures in YARN:
4. Container Failures: Containers running on nodes can fail due to hardware or software issues,
causing task failures.
To address these failures, Hadoop and YARN incorporate mechanisms for fault tolerance and
recovery, including task retry, speculative execution, and high availability configurations for
ResourceManager and NameNode (in HDFS). Additionally, monitoring, logging, and alerting
systems are often employed to detect and respond to failures in a timely manner.
It's essential to design and configure Hadoop clusters to handle failures gracefully to ensure the
reliability and availability of big data processing jobs. More recent versions of these frameworks
may have introduced additional features and improvements in failure handling and recovery.
Job scheduling in the context of Hadoop MapReduce refers to the process of allocating resources
and managing the execution of MapReduce jobs within a cluster. Proper job scheduling is crucial
for efficient resource utilization and achieving good performance in a Hadoop cluster. Here are
some key points related to job scheduling:
2. Job Submission: Users or applications submit MapReduce jobs to the ResourceManager. Each
job specifies its resource requirements and priority.
4. Queue Management: YARN supports the concept of queues to manage resource allocation.
Users can be assigned to specific queues, and administrators can configure queue capacities and
priorities.
5. Fair Scheduler: Hadoop provides a Fair Scheduler that aims to distribute resources fairly
among multiple jobs. It assigns resources based on job weights and can be configured to ensure
that no job starves for resources.
6. Capacity Scheduler: The Capacity Scheduler allows for resource allocation based on pre-
defined capacities for different queues, making it suitable for multi-tenant clusters.
7. Job Prioritization: Jobs can be assigned different priorities, allowing higher-priority jobs to
receive resources ahead of lower-priority ones.
8. Job Progress Monitoring: The ResourceManager monitors the progress of running jobs and
detects job failures. It also handles task retries and re-allocations in case of task failures.
9. Speculative Execution: Hadoop can perform speculative execution of tasks, where a task is
executed redundantly on different nodes to take advantage of available resources and complete
jobs faster.
Efficient job scheduling ensures that cluster resources are used optimally, jobs complete in a
timely manner, and different users or applications are allocated resources fairly.
1. Mapper Output: During the Mapper phase, each Mapper processes input data and generates
key-value pairs as intermediate output. These key-value pairs are not yet in the final sorted order.
2. Shuffling: Shuffling is the process of transferring and grouping intermediate key-value pairs
by key across all Mapper tasks. All values associated with the same key are collected together.
This process is crucial because it ensures that all relevant data for a particular key ends up in the
same Reducer task.
3. Sorting: After shuffling, the intermediate key-value pairs are sorted by key. Sorting is
essential because it allows Reducer tasks to process data efficiently in key-sorted order. The
sorting can be either by default, using the natural order of keys, or custom-defined if a
comparator is provided.
4. Reducer Input: The sorted and grouped key-value pairs serve as the input to Reducer tasks.
Each Reducer receives a group of key-value pairs with the same key, making it easier to perform
aggregate operations or computations based on keys.
5. Data Transfer: Shuffling involves significant data transfer across the cluster, as data from
multiple Mapper tasks must be collected, sorted, and distributed to Reducer tasks. Efficient data
transfer is crucial for job performance.
The Shuffle and Sort phases are resource-intensive and can be a bottleneck in MapReduce jobs.
Optimizing these phases, such as by minimizing data transfer or using combiners (mini-reducers)
to pre-aggregate data during shuffling, can significantly improve job performance.
Overall, Shuffle and Sort ensure that MapReduce jobs can process and aggregate data efficiently
in a distributed and parallelized manner, making them essential components of the MapReduce
framework.
1. Mapper Tasks: These tasks are responsible for processing a portion of the input data. Mappers
apply a user-defined function to each record within their assigned data split. The output of the
Mapper is a set of intermediate key-value pairs.
2. Reducer Tasks: Reducer tasks take the intermediate key-value pairs generated by the Mappers
and process them. Each Reducer receives a group of key-value pairs with the same key, and it
applies a user-defined Reduce function to these pairs. The output of the Reducer is the final
result of the MapReduce job.
- Task execution is parallelized, with multiple Mapper and Reducer tasks running concurrently
on different cluster nodes.
- Tasks are assigned to nodes based on data locality whenever possible, which reduces data
transfer overhead.
- Task progress and status are monitored by the ResourceManager (in YARN) or the JobTracker
(in classic MapReduce).
- In the event of task failures, the ResourceManager or JobTracker can reschedule tasks on
available nodes.
Efficient task execution is critical for achieving good performance in MapReduce jobs, as it
involves the actual processing of data.
1. Batch Processing:
- Batch processing MapReduce jobs are the most common type. They are used for processing
large volumes of data in a batch mode.
- Examples include data cleaning, transformation, ETL (Extract, Transform, Load) processes,
log analysis, and generating reports.
2. Iterative Processing:
- Some algorithms require multiple iterations over the data, such as machine learning
algorithms (e.g., iterative clustering or graph algorithms).
- Iterative MapReduce jobs involve running the same Map and Reduce functions multiple
times with modified input.
5. Data Joining:
- Data joining MapReduce jobs combine data from multiple sources based on common keys.
- Examples include merging user profiles from different data sets or joining logs with customer
data.
7. Log Processing:
- Log processing jobs analyze server logs, application logs, or sensor data to extract insights,
detect anomalies, or troubleshoot issues.
8. Machine Learning:
- Machine learning tasks like training models (e.g., decision trees, random forests) on large
datasets can be parallelized using MapReduce.
9. Recommendation Systems:
- MapReduce can be used to implement recommendation algorithms, such as collaborative
filtering or content-based recommendation.
The type of MapReduce job depends on the specific use case and data processing requirements.
MapReduce's ability to scale horizontally and process vast amounts of data makes it a versatile
choice for a wide range of big data processing tasks. However, it's worth noting that more recent
frameworks like Apache Spark have gained popularity for certain use cases due to their
improved performance and support for interactive and iterative processing.
3.12 INPUT FORMATS
Input formats in Hadoop MapReduce specify how input data is read and processed by the
MapReduce job. Hadoop supports various input formats to handle different types of data sources
and formats efficiently. Here are some common input formats:
1. TextInputFormat (Default): This is the default input format for text-based data. Each line in
the input file is treated as a separate record, and the key is the byte offset of the line, while the
value is the content of the line.
2. KeyValueTextInputFormat: This format is used when input data is in the form of key-value
pairs separated by a delimiter. It allows you to specify a delimiter character to split key and
value pairs.
3. SequenceFileInputFormat: Sequence files are a binary format that stores key-value pairs. This
input format is efficient for storing and processing large datasets.
4. Avro, Parquet, and ORC Input Formats: These formats are used for reading data stored in
Avro, Parquet, or ORC file formats, which are columnar storage formats optimized for big data
processing.
5. TextInputFormat (Custom Delimiter): You can use the TextInputFormat with a custom
delimiter if your text data is not newline-separated but instead uses a different delimiter.
6. Custom Input Formats: You can create custom input formats to handle specialized data
sources or formats not covered by the built-in input formats. This requires implementing specific
logic for reading and parsing input data.
The choice of input format depends on the nature of your data and how it is stored. Hadoop
allows you to configure the input format for your MapReduce job so that it can effectively read
and process the data.
1. TextOutputFormat (Default): This is the default output format for text-based data. Each key-
value pair is written as a line in the output file.
3. Avro, Parquet, and ORC Output Formats: These formats allow you to write data in Avro,
Parquet, or ORC formats, which are optimized for efficient storage and query performance.
4. MultipleOutputFormat: This format allows you to write output to multiple files or directories
based on specific criteria, such as keys or other attributes. It's useful for scenarios where you
need to partition or distribute the output data.
5. Custom Output Formats: Just like with input formats, you can create custom output formats to
write data to specialized storage systems or formats.
The choice of output format depends on how you want to store and use the results of your
MapReduce job. Different formats offer trade-offs between storage efficiency, query
performance, and ease of use. It's essential to select the appropriate output format to ensure that
the processed data is readily accessible and fits your downstream processing or analysis needs.