0% found this document useful (0 votes)

54 views21 pages

Unit 3 Notes

The document provides an overview of MapReduce, a programming model for processing large datasets in a distributed computing environment, detailing its workflow, including the Map and Reduce phases, shuffling and sorting, and job execution. It also discusses unit testing with MRUnit for validating MapReduce jobs and outlines the anatomy of a MapReduce job run, emphasizing the importance of input data, mapper and reducer tasks, job configuration, and monitoring. Additionally, it highlights the scalability and fault-tolerance of MapReduce while noting the emergence of newer frameworks like Apache Spark.

Uploaded by

953622205037

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views21 pages

Unit 3 Notes

Uploaded by

953622205037

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 21

UNIT 3 BDA - UNIT 3 BDA

UNIT III MAP REDUCE APPLICATIONS 6

MapReduce workflows – unit tests with MRUnit – test data and local tests – anatomy of
MapReduce job run – classic Map-reduce – YARN – failures in classic Map-reduce and YARN
– job scheduling – shuffle and sort – task execution – MapReduce types – input formats – output
formats.

3.1 MAPREDUCE WORKFLOWS

MapReduce is a programming model and data processing technique that was popularized by
Google and is widely used for processing and generating large datasets that can be parallelized
across a distributed cluster of computers. MapReduce workflows consist of two main phases: the
"Map" phase and the "Reduce" phase. These phases are typically orchestrated by a framework
like Apache Hadoop. Here's an overview of the MapReduce workflow:

1. Input Data: The first step in a MapReduce workflow is to have a large dataset that you want
to process. This dataset is typically stored in a distributed file system like Hadoop Distributed
File System (HDFS).

2. Map Phase:
- Map Function: In this phase, the input dataset is divided into smaller chunks called "splits,"
and each split is processed by a separate Mapper task. A user-defined "Map" function is applied
to each record within the split. The Map function takes the input data and generates a set of key-
value pairs as intermediate outputs. These key-value pairs are not sorted.
- Shuffling and Sorting: After the Map phase, all the intermediate key-value pairs from all
Mappers are grouped by key and sorted. This is necessary because the Reduce phase works on
grouped data.

3. Shuffle and Sort Phase:

- In this phase, the framework sorts and shuffles the intermediate key-value pairs generated by
the Mapper tasks. The purpose is to group all values associated with the same key together,
which allows the Reducer tasks to process the data efficiently.

4. Reduce Phase:
- Reduce Function: The Reducer tasks take the sorted and shuffled key-value pairs from the
Map phase and apply a user-defined "Reduce" function to process and aggregate the data. Each
Reducer receives a group of key-value pairs with the same key. The Reduce function can
perform various operations on the values, such as summarization, aggregation, or filtering.
- Output: The output of the Reducer tasks is typically written to an output location, often in a
distributed file system, where it can be used for further analysis or as the final result of the
MapReduce job.

5. Final Output: After all the Reducer tasks have completed, the final result is available in the
output location specified for the job.

MapReduce workflows are highly scalable and fault-tolerant, making them suitable for
processing large-scale data in distributed computing environments. While the core concept of
MapReduce remains the same, there are various implementations and tools available, such as
Apache Hadoop and Apache Spark, that offer more features and flexibility for different data
processing tasks. Additionally, developers need to design their Map and Reduce functions based
on the specific requirements of their data processing tasks.

3.2 UNIT TESTS WITH MRUNIT

MRUnit is a Java library that helps you write unit tests for your Apache Hadoop MapReduce
jobs. Unit testing your MapReduce code is crucial to ensure that your data processing logic
works as expected before deploying it to a Hadoop cluster. MRUnit provides a framework for
creating and running unit tests for your Mapper and Reducer classes. Here's how you can write
unit tests with MRUnit:

1. Set up your project:

First, you need to set up your Java project with the necessary dependencies, including MRUnit.
You can typically include MRUnit in your project using Maven or by manually adding the
MRUnit JAR file to your project's classpath.

2. Write your unit test:

Next, you can create a unit test class for your Mapper or Reducer. Here's a basic example for
testing a Mapper:

```java
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mrunit.mapreduce.MapDriver;
import org.junit.Before;
import org.junit.Test;

public class MyMapperTest {

private MapDriver<LongWritable, Text, Text, IntWritable> mapDriver;

@Before
public void setUp() {
MyMapper mapper = new MyMapper(); // Replace with your Mapper class
mapDriver = MapDriver.newMapDriver(mapper);
}

@Test
public void testMapper() {
mapDriver.withInput(new LongWritable(1), new Text("Hello, world!"))
.withOutput(new Text("Hello,"), new IntWritable(1))
.withOutput(new Text("world!"), new IntWritable(1))
.runTest();
}
}
```
In this example, we use MRUnit's `MapDriver` to test a Mapper. Replace `MyMapper` with the
actual name of your Mapper class. We provide sample input data and expected output data and
then run the test.
For Reducer tests, you can use `ReduceDriver` in a similar way.

3. Run the unit test:

You can run your unit tests using your preferred Java testing framework (e.g., JUnit). Ensure that
you have configured your IDE or build tool to run the tests, including the necessary classpath
dependencies.
When you run the test, MRUnit will simulate the MapReduce job execution, allowing you to
verify that your Mapper or Reducer logic behaves as expected without the need to set up a
Hadoop cluster.
Remember that while MRUnit is excellent for testing the individual components of your
MapReduce job (Mappers and Reducers), it does not test the full end-to-end execution of a
MapReduce job on a cluster. For end-to-end testing, you may need to consider other approaches,
such as running integration tests on a Hadoop cluster. However, MRUnit is valuable for rapidly
testing the logic within your Mappers and Reducers in isolation.

MapReduce is a framework using which we can write applications to process huge amounts of
data, in parallel, on large clusters of commodity hardware in a reliable manner.

What is MapReduce?

MapReduce is a processing technique and a program model for distributed computing based on
java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map
takes a set of data and converts it into another set of data, where individual elements are broken
down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as
an input and combines those data tuples into a smaller set of tuples. As the sequence of the name
MapReduce implies, the reduce task is always performed after the map job.

The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application into mappers and reducers is
sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the
application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is
merely a configuration change. This simple scalability is what has attracted many programmers
to use the MapReduce model.

The Algorithm
• Generally MapReduce paradigm is based on sending the computer to where the data
resides!
• MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.
o Map stage − The map or mapper’s job is to process the input data. Generally the
input data is in the form of file or directory and is stored in the Hadoop file
system (HDFS). The input file is passed to the mapper function line by line. The
mapper processes the data and creates several small chunks of data.
o Reduce stage − This stage is the combination of the Shuffle stage and the
Reduce stage. The Reducer’s job is to process the data that comes from the
mapper. After processing, it produces a new set of output, which will be stored in
the HDFS.
• During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.
• The framework manages all the details of data-passing such as issuing tasks, verifying
task completion, and copying data around the cluster between the nodes.
• Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.
• After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.

Inputs and Outputs (Java Perspective)

The MapReduce framework operates on <key, value> pairs, that is, the framework views the
input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the
output of the job, conceivably of different types.

The key and the value classes should be in serialized manner by the framework and hence, need
to implement the Writable interface. Additionally, the key classes have to implement the
Writable-Comparable interface to facilitate sorting by the framework. Input and Output types of
a MapReduce job − (Input) <k1, v1> → map → <k2, v2> → reduce → <k3, v3>(Output).
Input Output
Map <k1, v1> list (<k2, v2>)
Reduce <k2, list(v2)> list (<k3, v3>)

Terminology

• PayLoad − Applications implement the Map and the Reduce functions, and form the
core of the job.
• Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair.
• NamedNode − Node that manages the Hadoop Distributed File System (HDFS).
• DataNode − Node where data is presented in advance before any processing takes place.
• MasterNode − Node where JobTracker runs and which accepts job requests from clients.
• SlaveNode − Node where Map and Reduce program runs.
• JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.
• Task Tracker − Tracks the task and reports status to JobTracker.
• Job − A program is an execution of a Mapper and Reducer across a dataset.
• Task − An execution of a Mapper or a Reducer on a slice of data.
• Task Attempt − A particular instance of an attempt to execute a task on a SlaveNode.

3.3 TEST DATA AND LOCAL TESTS

When developing MapReduce jobs using Hadoop or similar frameworks, it's essential to perform
unit testing to ensure that your code works correctly. One crucial aspect of unit testing is creating
test data and running local tests to verify the behavior of your MapReduce jobs. Here's a step-by-
step guide on how to create test data and conduct local tests for your MapReduce jobs:

1. Create Test Data:

1.1. Generate Sample Input Data: You can manually create small sample input data in a text
file or use tools to generate synthetic data for testing purposes. Ensure that your test data covers
various scenarios, including edge cases and potential issues that your MapReduce job might
encounter.

1.2. Store Test Data in a Local Directory: Place your sample input data in a directory on your
local machine. This directory will serve as the input source for your local tests.
2. Write Unit Tests:

2.1. Set Up a Testing Framework: Use a testing framework like JUnit, TestNG, or a Hadoop-
specific testing framework like MRUnit to write unit tests for your MapReduce job. Ensure that
your project is properly configured to use the testing framework.
2.2. Write Test Cases: Create test cases that cover various aspects of your MapReduce job,
including Mapper and Reducer functionality, handling of different input data scenarios, and edge
cases. Your test cases should include both positive and negative test scenarios.
2.3. Configure Test Inputs and Outputs: In your test cases, specify the test input data,
including the directory or file containing your sample input data. Define the expected output for
each test case.
2.4. Invoke MapReduce Job: In your test cases, invoke the MapReduce job by creating an
instance of your Mapper and Reducer classes and running them with the test input data. Capture
the output.
2.5. Assertions: Use assertions to compare the actual output of your MapReduce job with the
expected output defined in your test cases. Ensure that the results match your expectations.

3. Run Local Tests:

3.1. Execute Unit Tests: Run your unit tests using the testing framework you've set up. The
framework will execute your MapReduce job locally using the sample input data and validate the
results.
3.2. Interpret Test Results: Review the test results to identify any failures or issues. The test
framework will report which test cases passed and which ones failed.
3.3. Debug and Refine: If any test cases fail, use the debugging tools provided by your
development environment to diagnose and fix the issues in your MapReduce code. Refine your
code and tests iteratively until all test cases pass successfully.

4. Repeat for Different Test Scenarios:

4.1. Edge Cases: Make sure to test edge cases, such as empty input data, data with missing
fields, or data that could potentially cause exceptions in your MapReduce code.
4.2. Performance Testing (Optional): If your MapReduce job is expected to process large
volumes of data, you may want to test its performance with larger datasets. You can generate
larger test data sets for such scenarios.
By creating test data and running local tests, you can catch and fix issues early in the
development process, which helps ensure the correctness and reliability of your MapReduce jobs
before deploying them to a Hadoop cluster for processing large datasets.

3.4 ANATOMY OF MAPREDUCE JOB RUN

The anatomy of a classic MapReduce job run involves several key components and stages.
MapReduce is a programming model for processing and generating large datasets that can be
parallelized across a distributed cluster of computers. Below, I'll outline the essential
components and the sequence of steps involved in the execution of a classic MapReduce job:

1. Input Data:
- A MapReduce job begins with a large dataset that you want to process. This dataset is
typically stored in a distributed file system like Hadoop Distributed File System (HDFS).

2. Mapper Phase:
- The input data is divided into smaller chunks or "splits."
- Each split is assigned to a mapper task. Mappers are responsible for processing these splits
independently.
- The user-defined Mapper function is applied to each record within a split. The Mapper
function emits key-value pairs as intermediate outputs.

3. Shuffling and Sorting:

- The intermediate key-value pairs generated by mappers are grouped by key across all the
mappers.
- These groups of key-value pairs are then sorted by key. This step is crucial because it ensures
that all values for a particular key end up in the same reducer.

4. Reducer Phase:
- Reducers are responsible for processing the grouped and sorted key-value pairs.
- User-defined Reducer functions are applied to each group of values associated with the same
key.
- Reducers emit the final output, which is typically aggregated or transformed data.
5. Output Data:
- The output from the Reducer phase is written to an external storage system, often HDFS.
- This output represents the result of the MapReduce job, which can be used for further
analysis, reporting, or other purposes.

6. Job Configuration:
- Before running the job, you need to configure various parameters, such as input and output
paths, the number of mappers and reducers, and other job-specific settings.
- This configuration is typically done using a job configuration object, and it specifies how the
input data should be processed.

7. Job Submission:
- Once the MapReduce job is configured, it is submitted to a cluster's resource manager (e.g.,
Hadoop YARN) for execution.
- The resource manager allocates resources (nodes) for running the job's tasks (mappers and
reducers) based on cluster availability and job priorities.

8. Task Execution:
- Mapper and Reducer tasks are distributed across the cluster's nodes.
- Each task runs independently, processing its portion of the data.
- Task progress and status are monitored by the resource manager.

9. Job Monitoring and Logging:

- During the job run, logs and statistics are generated, which can be used for monitoring job
progress, debugging, and performance analysis.

10. Job Completion:

- When all the tasks have completed successfully, the MapReduce job is considered finished.
- The final output data is available in the specified output directory in HDFS or the designated
storage system.
11. Cleanup:
- After the job is completed, any temporary resources or data created during the job run may
be cleaned up.

12. Results Analysis:

- The output data can be further analyzed or used for various purposes, such as generating
reports, feeding into machine learning models, or populating databases.

3.5 CLASSIC MAP-REDUCE

This classic MapReduce workflow is the foundation for distributed data processing in
frameworks like Hadoop. While it's a powerful and scalable approach, newer data processing
frameworks like Apache Spark have gained popularity due to their improved performance and
ease of use. However, the core concepts of MapReduce are still relevant in understanding
distributed data processing.

"Classic MapReduce" refers to the original programming model and framework for processing
and generating large datasets that can be parallelized across a distributed cluster of computers. It
was popularized by Google in a 2004 paper written by Jeffrey Dean and Sanjay Ghemawat.
Google used this model to process vast amounts of data across their infrastructure, and the ideas
presented in the paper served as the foundation for open-source implementations like Apache
Hadoop.

Here are the key concepts and components of the classic MapReduce model:

1. Mapper: The Mapper is responsible for processing input data and emitting a set of key-value
pairs as intermediate data. Each Mapper task works independently on a subset of the input data.

2. Reducer: The Reducer takes the intermediate key-value pairs produced by the Mappers,
groups them by key, and processes them to produce the final output. Reducers work in parallel,
and each handles a different set of keys.
3. Input Data: The input data is typically a large dataset stored in a distributed file system like
HDFS. It's divided into smaller chunks or "splits" that are processed by individual Mapper tasks.

4. Output Data: The output from the Reducer phase is written to an external storage system,
often the distributed file system, and represents the result of the MapReduce job.

5. Shuffling and Sorting: After the Mapper phase, there is a shuffling and sorting step where the
intermediate key-value pairs generated by Mappers are grouped by key and sorted. This step is
crucial because it ensures that all values for a particular key end up in the same Reducer.

6. Job Configuration: Before running the job, you need to configure various parameters, such as
input and output paths, the number of Mappers and Reducers, and other job-specific settings.
This configuration is typically done using a job configuration object.

7. Job Submission: Once the MapReduce job is configured, it is submitted to a cluster's resource
manager for execution. The resource manager allocates resources (nodes) for running the job's
tasks based on cluster availability and job priorities.

8. Task Execution: Mapper and Reducer tasks are distributed across the cluster's nodes. Each
task runs independently, processing its portion of the data. Task progress and status are
monitored by the resource manager.

9. Job Monitoring and Logging: During the job run, logs and statistics are generated, which
can be used for monitoring job progress, debugging, and performance analysis.

10. Job Completion: When all the tasks have completed successfully, the MapReduce job is
considered finished. The final output data is available in the specified output directory in the
distributed file system.

Classic MapReduce was groundbreaking because it allowed for the processing of vast amounts
of data on commodity hardware in a fault-tolerant manner. While newer data processing
frameworks like Apache Spark have gained popularity due to their improved performance and
ease of use, the core concepts of MapReduce remain relevant and are foundational in
understanding distributed data processing.
3.6 YARN
YARN is a resource management and job scheduling component in the Hadoop ecosystem. It is
designed to separate the resource management layer from the processing framework, providing a
more flexible and scalable resource management platform. YARN replaced the older
MapReduce JobTracker and TaskTracker architecture, allowing Hadoop to support various
processing frameworks beyond MapReduce.

Key components of YARN include:

1. ResourceManager (RM): The ResourceManager is the central resource allocation and

management component in YARN. It manages cluster resources and schedules applications.

2. NodeManager (NM): NodeManagers run on individual nodes in the cluster and are responsible
for monitoring resource usage and reporting it back to the ResourceManager.

3. ApplicationMaster (AM): Each application running on the cluster has its own
ApplicationMaster. The ApplicationMaster is responsible for negotiating resources with the
ResourceManager and monitoring the execution of its application's tasks.

4. Container: A container represents a set of allocated resources (CPU, memory, etc.) on a

cluster node for running a specific task or component of an application.

YARN provides a more robust and scalable resource management solution compared to the
previous MapReduce-specific resource management, allowing multiple applications to run
concurrently on a Hadoop cluster.

3.7 FAILURES IN CLASSIC MAP-REDUCE AND YARN

Failures can occur in both classic MapReduce and YARN due to various reasons. Handling
failures robustly is a critical aspect of distributed computing. Here are some common types of
failures in these frameworks:

Failures in Classic MapReduce:

1. Task Failures: Individual Mapper or Reducer tasks can fail due to hardware issues, software
errors, or other factors. Task failures are relatively common.

2. JobTracker Failures: In the classic MapReduce model, the JobTracker is a single point of
failure. If the JobTracker fails, it can result in job failures or delays.

3. Data Node Failures: If a Data Node hosting HDFS data blocks fails, it can lead to data
unavailability for MapReduce jobs.

4. Network Failures: Network issues can disrupt communication between nodes and cause job
failures or delays.

Failures in YARN:

1. ResourceManager Failures: The ResourceManager in YARN can be a single point of failure.

If it fails, it can disrupt resource allocation and application scheduling.

2. NodeManager Failures: NodeManagers can fail, impacting the availability of resources on

specific nodes.

3. ApplicationMaster Failures: If an ApplicationMaster fails, the corresponding application may

fail to manage its tasks, affecting the job's progress.

4. Container Failures: Containers running on nodes can fail due to hardware or software issues,
causing task failures.

5. Network Failures: Network problems can disrupt communication between ResourceManager,

NodeManagers, and ApplicationMasters.

To address these failures, Hadoop and YARN incorporate mechanisms for fault tolerance and
recovery, including task retry, speculative execution, and high availability configurations for
ResourceManager and NameNode (in HDFS). Additionally, monitoring, logging, and alerting
systems are often employed to detect and respond to failures in a timely manner.

It's essential to design and configure Hadoop clusters to handle failures gracefully to ensure the
reliability and availability of big data processing jobs. More recent versions of these frameworks
may have introduced additional features and improvements in failure handling and recovery.

3.8 JOB SCHEDULING

Job scheduling in the context of Hadoop MapReduce refers to the process of allocating resources
and managing the execution of MapReduce jobs within a cluster. Proper job scheduling is crucial
for efficient resource utilization and achieving good performance in a Hadoop cluster. Here are
some key points related to job scheduling:

1. ResourceManager (YARN): In the Hadoop ecosystem, YARN (Yet Another Resource

Negotiator) is responsible for job scheduling. The ResourceManager (RM) is the central
component that allocates cluster resources to various applications.

2. Job Submission: Users or applications submit MapReduce jobs to the ResourceManager. Each
job specifies its resource requirements and priority.

3. Resource Allocation: The ResourceManager allocates containers (CPU, memory, etc.) on

cluster nodes to run tasks for each job. It takes into account factors such as available resources,
job priority, and fair allocation.

4. Queue Management: YARN supports the concept of queues to manage resource allocation.
Users can be assigned to specific queues, and administrators can configure queue capacities and
priorities.

5. Fair Scheduler: Hadoop provides a Fair Scheduler that aims to distribute resources fairly
among multiple jobs. It assigns resources based on job weights and can be configured to ensure
that no job starves for resources.
6. Capacity Scheduler: The Capacity Scheduler allows for resource allocation based on pre-
defined capacities for different queues, making it suitable for multi-tenant clusters.

7. Job Prioritization: Jobs can be assigned different priorities, allowing higher-priority jobs to
receive resources ahead of lower-priority ones.

8. Job Progress Monitoring: The ResourceManager monitors the progress of running jobs and
detects job failures. It also handles task retries and re-allocations in case of task failures.

9. Speculative Execution: Hadoop can perform speculative execution of tasks, where a task is
executed redundantly on different nodes to take advantage of available resources and complete
jobs faster.

Efficient job scheduling ensures that cluster resources are used optimally, jobs complete in a
timely manner, and different users or applications are allocated resources fairly.

3.9 SHUFFLE AND SORT

Shuffle and Sort are critical phases in the MapReduce framework, occurring between the Mapper
and Reducer phases. These phases are responsible for organizing and transferring data from
Mapper tasks to Reducer tasks. Here's an overview of Shuffle and Sort:

1. Mapper Output: During the Mapper phase, each Mapper processes input data and generates
key-value pairs as intermediate output. These key-value pairs are not yet in the final sorted order.

2. Shuffling: Shuffling is the process of transferring and grouping intermediate key-value pairs
by key across all Mapper tasks. All values associated with the same key are collected together.
This process is crucial because it ensures that all relevant data for a particular key ends up in the
same Reducer task.

3. Sorting: After shuffling, the intermediate key-value pairs are sorted by key. Sorting is
essential because it allows Reducer tasks to process data efficiently in key-sorted order. The
sorting can be either by default, using the natural order of keys, or custom-defined if a
comparator is provided.
4. Reducer Input: The sorted and grouped key-value pairs serve as the input to Reducer tasks.
Each Reducer receives a group of key-value pairs with the same key, making it easier to perform
aggregate operations or computations based on keys.

5. Data Transfer: Shuffling involves significant data transfer across the cluster, as data from
multiple Mapper tasks must be collected, sorted, and distributed to Reducer tasks. Efficient data
transfer is crucial for job performance.

The Shuffle and Sort phases are resource-intensive and can be a bottleneck in MapReduce jobs.
Optimizing these phases, such as by minimizing data transfer or using combiners (mini-reducers)
to pre-aggregate data during shuffling, can significantly improve job performance.

Overall, Shuffle and Sort ensure that MapReduce jobs can process and aggregate data efficiently
in a distributed and parallelized manner, making them essential components of the MapReduce
framework.

3.10 TASK EXECUTION

Task execution is a fundamental aspect of the MapReduce framework, where the actual data
processing takes place. In a MapReduce job, tasks are executed in parallel across a cluster of
nodes. There are two main types of tasks in a MapReduce job:

1. Mapper Tasks: These tasks are responsible for processing a portion of the input data. Mappers
apply a user-defined function to each record within their assigned data split. The output of the
Mapper is a set of intermediate key-value pairs.

2. Reducer Tasks: Reducer tasks take the intermediate key-value pairs generated by the Mappers
and process them. Each Reducer receives a group of key-value pairs with the same key, and it
applies a user-defined Reduce function to these pairs. The output of the Reducer is the final
result of the MapReduce job.

Key points about task execution in MapReduce:

- Task execution is parallelized, with multiple Mapper and Reducer tasks running concurrently
on different cluster nodes.
- Tasks are assigned to nodes based on data locality whenever possible, which reduces data
transfer overhead.
- Task progress and status are monitored by the ResourceManager (in YARN) or the JobTracker
(in classic MapReduce).
- In the event of task failures, the ResourceManager or JobTracker can reschedule tasks on
available nodes.

Efficient task execution is critical for achieving good performance in MapReduce jobs, as it
involves the actual processing of data.

3.11 MAPREDUCE TYPES

MapReduce jobs can be categorized into several types based on their use cases and data
processing requirements. Here are some common types of MapReduce jobs:

1. Batch Processing:
- Batch processing MapReduce jobs are the most common type. They are used for processing
large volumes of data in a batch mode.
- Examples include data cleaning, transformation, ETL (Extract, Transform, Load) processes,
log analysis, and generating reports.

2. Iterative Processing:
- Some algorithms require multiple iterations over the data, such as machine learning
algorithms (e.g., iterative clustering or graph algorithms).
- Iterative MapReduce jobs involve running the same Map and Reduce functions multiple
times with modified input.

3. Real-time Processing (Streaming):

- Real-time MapReduce jobs process data as it arrives, making them suitable for stream
processing.
- Examples include real-time fraud detection, social media sentiment analysis, and monitoring
system logs in real-time.
4. Graph Processing:
- Graph algorithms, like PageRank or community detection, can be implemented using
MapReduce.
- These jobs involve complex data transformations and traversals on graph data structures.

5. Data Joining:
- Data joining MapReduce jobs combine data from multiple sources based on common keys.
- Examples include merging user profiles from different data sets or joining logs with customer
data.

6. Search and Indexing:

- MapReduce is used for building search engine indexes or performing large-scale text search
operations.

7. Log Processing:
- Log processing jobs analyze server logs, application logs, or sensor data to extract insights,
detect anomalies, or troubleshoot issues.

8. Machine Learning:
- Machine learning tasks like training models (e.g., decision trees, random forests) on large
datasets can be parallelized using MapReduce.

9. Recommendation Systems:
- MapReduce can be used to implement recommendation algorithms, such as collaborative
filtering or content-based recommendation.

The type of MapReduce job depends on the specific use case and data processing requirements.
MapReduce's ability to scale horizontally and process vast amounts of data makes it a versatile
choice for a wide range of big data processing tasks. However, it's worth noting that more recent
frameworks like Apache Spark have gained popularity for certain use cases due to their
improved performance and support for interactive and iterative processing.
3.12 INPUT FORMATS
Input formats in Hadoop MapReduce specify how input data is read and processed by the
MapReduce job. Hadoop supports various input formats to handle different types of data sources
and formats efficiently. Here are some common input formats:

1. TextInputFormat (Default): This is the default input format for text-based data. Each line in
the input file is treated as a separate record, and the key is the byte offset of the line, while the
value is the content of the line.

2. KeyValueTextInputFormat: This format is used when input data is in the form of key-value
pairs separated by a delimiter. It allows you to specify a delimiter character to split key and
value pairs.

3. SequenceFileInputFormat: Sequence files are a binary format that stores key-value pairs. This
input format is efficient for storing and processing large datasets.

4. Avro, Parquet, and ORC Input Formats: These formats are used for reading data stored in
Avro, Parquet, or ORC file formats, which are columnar storage formats optimized for big data
processing.

5. TextInputFormat (Custom Delimiter): You can use the TextInputFormat with a custom
delimiter if your text data is not newline-separated but instead uses a different delimiter.

6. Custom Input Formats: You can create custom input formats to handle specialized data
sources or formats not covered by the built-in input formats. This requires implementing specific
logic for reading and parsing input data.

The choice of input format depends on the nature of your data and how it is stored. Hadoop
allows you to configure the input format for your MapReduce job so that it can effectively read
and process the data.

3.13 OUTPUT FORMATS

Output formats in Hadoop MapReduce specify how the results of a MapReduce job are written
to the distributed file system or external storage. Similar to input formats, Hadoop provides
various output formats to handle different data storage requirements. Here are some common
output formats:

1. TextOutputFormat (Default): This is the default output format for text-based data. Each key-
value pair is written as a line in the output file.

2. SequenceFileOutputFormat: Sequence files can be used as an output format, especially when

you want to store key-value pairs in a binary format for efficient storage and subsequent
processing.

3. Avro, Parquet, and ORC Output Formats: These formats allow you to write data in Avro,
Parquet, or ORC formats, which are optimized for efficient storage and query performance.

4. MultipleOutputFormat: This format allows you to write output to multiple files or directories
based on specific criteria, such as keys or other attributes. It's useful for scenarios where you
need to partition or distribute the output data.

5. Custom Output Formats: Just like with input formats, you can create custom output formats to
write data to specialized storage systems or formats.

The choice of output format depends on how you want to store and use the results of your
MapReduce job. Different formats offer trade-offs between storage efficiency, query
performance, and ease of use. It's essential to select the appropriate output format to ensure that
the processed data is readily accessible and fits your downstream processing or analysis needs.

MATLAB Data Science
From Everand
MATLAB Data Science
Henry Codwell
No ratings yet
Flutter Full-Stack
From Everand
Flutter Full-Stack
HAROLD WHITES
No ratings yet
Unit 4
No ratings yet
Unit 4
56 pages
Yousef AI Follow-Up Sheet
100% (1)
Yousef AI Follow-Up Sheet
729 pages
CertyIQ AZ-900 UpdatedExam Dumps - 2022 Part 3
No ratings yet
CertyIQ AZ-900 UpdatedExam Dumps - 2022 Part 3
29 pages
CS3691 Embedded Systems and IoT
100% (1)
CS3691 Embedded Systems and IoT
86 pages
Unit 2 Topic 5 Developing A Map Reduce Application
No ratings yet
Unit 2 Topic 5 Developing A Map Reduce Application
52 pages
Big Data
No ratings yet
Big Data
120 pages
SAP BW 7.5 SP4 Powered by SAP HANA Overview & Roadmap
No ratings yet
SAP BW 7.5 SP4 Powered by SAP HANA Overview & Roadmap
33 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Big Data Unit-2 PPT Part2
No ratings yet
Big Data Unit-2 PPT Part2
78 pages
BDA Unit-3
No ratings yet
BDA Unit-3
63 pages
BDA Unit-2
No ratings yet
BDA Unit-2
11 pages
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
BDA UNIT-3 (1) - Merged
No ratings yet
BDA UNIT-3 (1) - Merged
98 pages
Big Data Analytics
No ratings yet
Big Data Analytics
50 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
41 pages
Architectural Design of Compute and Storage Clouds
No ratings yet
Architectural Design of Compute and Storage Clouds
2 pages
Mc5502 Bda Unit I Notes
No ratings yet
Mc5502 Bda Unit I Notes
106 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
PBDS Unit4
No ratings yet
PBDS Unit4
32 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
11 pages
(Studies in Big Data) Mamta Mittal - Valentina E. Balas - Lalit Mohan Goyal - Raghvendra Kumar - Big Data Processing Using Spark in Cloud (2019, Springer) PDF
No ratings yet
(Studies in Big Data) Mamta Mittal - Valentina E. Balas - Lalit Mohan Goyal - Raghvendra Kumar - Big Data Processing Using Spark in Cloud (2019, Springer) PDF
274 pages
MapReduce Unit3
No ratings yet
MapReduce Unit3
27 pages
Unit - Iii
No ratings yet
Unit - Iii
38 pages
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts
No ratings yet
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts
26 pages
Bda Module 4
No ratings yet
Bda Module 4
34 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Big Data Analytics-4
No ratings yet
Big Data Analytics-4
26 pages
Prediction and Impact Assessment of Land, Water, Flora & Fauna, Noise, Mathematical Models
No ratings yet
Prediction and Impact Assessment of Land, Water, Flora & Fauna, Noise, Mathematical Models
66 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Unit 2 - From Hadoop Streaming PDF
No ratings yet
Unit 2 - From Hadoop Streaming PDF
20 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
UNIT4 Notes
No ratings yet
UNIT4 Notes
32 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Map Reduce 2
No ratings yet
Map Reduce 2
14 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
Unit-2 Map Reduce Notes
No ratings yet
Unit-2 Map Reduce Notes
28 pages
Big Data Analytics UNIT 3 Notets
No ratings yet
Big Data Analytics UNIT 3 Notets
12 pages
Unit 2
No ratings yet
Unit 2
12 pages
132 P16cse5a-P16ite3a 2020052706582977
No ratings yet
132 P16cse5a-P16ite3a 2020052706582977
15 pages
Map Reduce
No ratings yet
Map Reduce
36 pages
Unit 3
No ratings yet
Unit 3
13 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
Exp 5 Bda
No ratings yet
Exp 5 Bda
9 pages
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
No ratings yet
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
12 pages
05 Movies Data Analysis Using Mapreduce
No ratings yet
05 Movies Data Analysis Using Mapreduce
20 pages
3.1.how Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.how Map Reduce Works & 3.2 Anatomy
11 pages
HDFS Unit 4
No ratings yet
HDFS Unit 4
12 pages
Hortonworks Hadoop System Admin Guide 20130819
No ratings yet
Hortonworks Hadoop System Admin Guide 20130819
68 pages
Hadoop
No ratings yet
Hadoop
34 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
Unit 4 BDA
No ratings yet
Unit 4 BDA
31 pages
3 Fuel Consumption Example - MR
No ratings yet
3 Fuel Consumption Example - MR
7 pages
Hadoop - MapReduce
No ratings yet
Hadoop - MapReduce
5 pages
Bda 03
No ratings yet
Bda 03
10 pages
Hadoop and MapReduce Notes
No ratings yet
Hadoop and MapReduce Notes
4 pages
21CS1601 Unit 5 Understanding Big Data Technolgies
No ratings yet
21CS1601 Unit 5 Understanding Big Data Technolgies
20 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
18mcs35e U4
No ratings yet
18mcs35e U4
7 pages
UIX Unit - 4
No ratings yet
UIX Unit - 4
48 pages
DSBDA Manual Assignment 11
No ratings yet
DSBDA Manual Assignment 11
6 pages
Map Reduce Report
No ratings yet
Map Reduce Report
16 pages
Hbase Hive Pig
No ratings yet
Hbase Hive Pig
144 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
4 pages
Hand Out Cloud Computing
No ratings yet
Hand Out Cloud Computing
9 pages
BDA Assignment 1: Big Data Features and Characteristics
No ratings yet
BDA Assignment 1: Big Data Features and Characteristics
14 pages
UNIT II Memory Interfacing
No ratings yet
UNIT II Memory Interfacing
28 pages
Unit 1
No ratings yet
Unit 1
14 pages
V CSE CS3551 DC Unit-3
No ratings yet
V CSE CS3551 DC Unit-3
36 pages
UIX Unit - 5
No ratings yet
UIX Unit - 5
44 pages
Term Paper Java
No ratings yet
Term Paper Java
14 pages
Big Data in The Healthcare System A Synergy
No ratings yet
Big Data in The Healthcare System A Synergy
16 pages
Data Engineer Interview Questions
No ratings yet
Data Engineer Interview Questions
7 pages
New Microsoft Office Word Document
No ratings yet
New Microsoft Office Word Document
10 pages
Chapter 4 Business
No ratings yet
Chapter 4 Business
31 pages
Selection & Registration Criteria For EIA Consultants
No ratings yet
Selection & Registration Criteria For EIA Consultants
14 pages
Ex-2 Stock Maintenance System
No ratings yet
Ex-2 Stock Maintenance System
21 pages
4.1.3. Issues in Failure Recovery-1
No ratings yet
4.1.3. Issues in Failure Recovery-1
4 pages
4.1.4. Checkpoint Based Recovery-1
No ratings yet
4.1.4. Checkpoint Based Recovery-1
10 pages
Google VCEup - Com - Professional-Data-Engineer 2022-July-05 173q
No ratings yet
Google VCEup - Com - Professional-Data-Engineer 2022-July-05 173q
64 pages
Ex-3 Mad
No ratings yet
Ex-3 Mad
5 pages
Mini Project Presentation Web
No ratings yet
Mini Project Presentation Web
17 pages
4.1.6. Coordinated Checkpointing Algorithm-1
No ratings yet
4.1.6. Coordinated Checkpointing Algorithm-1
7 pages
EXPERIMENT-2 OOSE Exam Registeration System
No ratings yet
EXPERIMENT-2 OOSE Exam Registeration System
20 pages
Unit-5 Cloud Computing
No ratings yet
Unit-5 Cloud Computing
28 pages
Mad MP
No ratings yet
Mad MP
15 pages
Assignment
No ratings yet
Assignment
3 pages
Exercise-1 Creation of Interactive Website Design Using HTML and Authoring Tools
No ratings yet
Exercise-1 Creation of Interactive Website Design Using HTML and Authoring Tools
4 pages
MongoDB at A Glance
No ratings yet
MongoDB at A Glance
19 pages
People'S University, Bhopal Syllabus of Examination Choice Based Credit System (CBCS)
No ratings yet
People'S University, Bhopal Syllabus of Examination Choice Based Credit System (CBCS)
23 pages
Cloud Minipro
No ratings yet
Cloud Minipro
9 pages
Mad Report 2
No ratings yet
Mad Report 2
6 pages
Planning and Creating A Brand Website
No ratings yet
Planning and Creating A Brand Website
5 pages
Submitted by Pigu 3335
No ratings yet
Submitted by Pigu 3335
39 pages
Clickstream Data
No ratings yet
Clickstream Data
38 pages
CS8791-Cloud Computing QB 1
No ratings yet
CS8791-Cloud Computing QB 1
9 pages
Big Data Analytics
No ratings yet
Big Data Analytics
12 pages
Hritick Resume - 2+
No ratings yet
Hritick Resume - 2+
1 page
14 Oozie
No ratings yet
14 Oozie
38 pages
Big Data Solution Using Hadoop - Project For Big Data Management
No ratings yet
Big Data Solution Using Hadoop - Project For Big Data Management
3 pages
Group 3&4 Assignment Sample Solution
No ratings yet
Group 3&4 Assignment Sample Solution
5 pages
Mapr 6.1.0
No ratings yet
Mapr 6.1.0
5 pages
Big Data HDP Introduction
No ratings yet
Big Data HDP Introduction
34 pages
Infrastructure Guide CDP Private Cloud
No ratings yet
Infrastructure Guide CDP Private Cloud
22 pages
The Role of Big Data Analysis and Machine Learning in Marketing Communications A Case Study of Amazon
No ratings yet
The Role of Big Data Analysis and Machine Learning in Marketing Communications A Case Study of Amazon
26 pages
CC Viva Questions and Answers
No ratings yet
CC Viva Questions and Answers
9 pages
Experimen4 Oose
No ratings yet
Experimen4 Oose
1 page

Unit 3 Notes

Uploaded by

Unit 3 Notes

Uploaded by

UNIT 3 BDA - UNIT 3 BDA

UNIT III MAP REDUCE APPLICATIONS 6

3.1 MAPREDUCE WORKFLOWS

3. Shuffle and Sort Phase:

3.2 UNIT TESTS WITH MRUNIT

1. Set up your project:

2. Write your unit test:

public class MyMapperTest {

3. Run the unit test:

Inputs and Outputs (Java Perspective)

3.3 TEST DATA AND LOCAL TESTS

1. Create Test Data:

3. Run Local Tests:

4. Repeat for Different Test Scenarios:

3.4 ANATOMY OF MAPREDUCE JOB RUN

3. Shuffling and Sorting:

9. Job Monitoring and Logging:

10. Job Completion:

12. Results Analysis:

3.5 CLASSIC MAP-REDUCE

Key components of YARN include:

1. ResourceManager (RM): The ResourceManager is the central resource allocation and

4. Container: A container represents a set of allocated resources (CPU, memory, etc.) on a

3.7 FAILURES IN CLASSIC MAP-REDUCE AND YARN

Failures in Classic MapReduce:

1. ResourceManager Failures: The ResourceManager in YARN can be a single point of failure.

2. NodeManager Failures: NodeManagers can fail, impacting the availability of resources on

3. ApplicationMaster Failures: If an ApplicationMaster fails, the corresponding application may

5. Network Failures: Network problems can disrupt communication between ResourceManager,

3.8 JOB SCHEDULING

1. ResourceManager (YARN): In the Hadoop ecosystem, YARN (Yet Another Resource

3. Resource Allocation: The ResourceManager allocates containers (CPU, memory, etc.) on

3.9 SHUFFLE AND SORT

3.10 TASK EXECUTION

Key points about task execution in MapReduce:

3.11 MAPREDUCE TYPES

3. Real-time Processing (Streaming):

6. Search and Indexing:

3.13 OUTPUT FORMATS

2. SequenceFileOutputFormat: Sequence files can be used as an output format, especially when

You might also like