0% found this document useful (0 votes)
21 views25 pages

Unit IV Notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views25 pages

Unit IV Notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

UNIT - 3

3.1 MapReduce Workflows

Introduction:

• MapReduce is designed to simplify the development of large-scale,


distributed, fault-tolerant data processing applications, MapReduce is
foremost a way of writing applications.

• In MapReduce, developers write jobs that consist primarily of a map


function and a reduce function, and the framework handles the gory details
of parallelizing the work, scheduling parts of the job on worker machines,
monitoring for and recovering from failures, and so forth.

• Developers are shielded from having to implement complex and


repetitious code and instead, focus on algorithms and business logic. User-
provided code is invoked by the framework rather than the other way
around.

• This is much like Java application servers that invoke servlets upon
receiving an HTTP request; the container is responsible for setup and
teardown as well as pro- viding a runtime environment for user-supplied
code.

• Similarly, as servlet authors need not implement the low-level details of


socket I/O, event handling loops, and complex thread coordination,
MapReduce developers program to a well-defined, simple interface and the
“container” does the heavy lifting.

MapReduce Specifically developed to deal with large-scale workloads, MapReduce


provides the following features:

1. Simplicity of development

MapReduce is dead simple for developers: no socket programming, no


threading or fancy synchronization logic, no management of retries, no
special techniques to deal with enormous amounts of data. Developers use
functional programming concepts to build data processing applications
that operate on one record at a time. Map functions operate on these
records and produce intermediate key-value pairs. The reduce function
then operates on the intermediate key-value pairs, processing all values
that have the same key together and outputting the result. These primi-
tives can be used to implement filtering, projection, grouping, aggregation,
and other common data processing functions.

2. Scale
Since tasks do not communicate with one another explicitly and do not
share state, they can execute in parallel and on separate machines.
Additional machines can be added to the cluster and applications
immediately take advantage of the addi- tional hardware with no change at
all. MapReduce is designed to be a share noth-ing system.

3. Automatic parallelization and distribution of work


Developers focus on the map and reduce functions that process individual
records (where “record” is an abstract concept—it could be a line of a file or
a row from a relational database) in a dataset. The storage of the dataset is
not prescribed by MapReduce, although it is extremely common, as we’ll
see later, that files on a distributed filesystem are an excellent pairing. The
framework is responsible for splitting a MapReduce job into tasks. Tasks
are then executed on worker nodes or (less pleasantly) slaves.

4. Fault tolerance
Failure is not an exception; it’s the norm. MapReduce treats failure as a
first-class citizen and supports reexecution of failed tasks on healthy
worker nodes in the cluster. Should a worker node fail, all tasks are
assumed to be lost, in which case they are simply rescheduled elsewhere.
The unit of work is always the task, and it either completes successfully or it
fails completely.

• In MapReduce, users write a client application that submits one or more jobs
that contain user-supplied map and reduce code and a job configuration file
to a cluster of machines.

• The job contains a map function and a reduce function, along with job con-
figuration information that controls various aspects of its execution.

• The framework handles breaking the job into tasks, scheduling tasks to run
on machines, monitoring each task’s health, and performing any necessary
retries of failed tasks.

• A job processes an input dataset specified by the user and usually outputs
one as well. Commonly, the input and output datasets are one or more files
on a distributed filesystem.

Stages of MapReduce:

A MapReduce job is made up of four distinct stages, executed in order: client job
submission, map task execution, shuffle and sort, and reduce task execution.
Client applications can really be any type of application the developer desires,
from command line tools to services. The MapReduce framework provides a set of
APIs for submitting jobs and interacting with the cluster. The job itself is made up
of code written by a developer against the MapReduce APIs and the configuration
which specifies things such as the input and output datasets.

The client application submits a job to the cluster using the framework APIs. A
master process, called the jobtracker in Hadoop MapReduce, is responsible for
accepting these submissions (more on the role of the jobtracker later). Job
submission occurs over the network, so clients may be running on one of the
cluster nodes or not; it doesn’t matter. The framework gets to decide how to
split the input dataset into chunks, or input splits, of data that can be processed in
parallel. In Hadoop MapReduce, the component that does this is called an input
format, and Hadoop comes with a small library of them for common file formats.

In order to better illustrate how MapReduce works; use a simple application log
processing example where count all events of each severity within a window of
time. Let’s assume 100 GB of logs in a directory in HDFS. A sample of log records
might look something like this:

2012-02-13 00:23:54-0800 [INFO - com.company.app1.Main] Application started!2012-


02-13 00:32:02-0800 [WARN - com.company.app1.Main] Something hinky↵
is going down...
2012-02-13 00:32:19-0800 [INFO - com.company.app1.Main] False alarm. No worries.
...
2012-02-13 09:00:00-0800 [DEBUG - com.company.app1.Main] coffee units remaining:zero↵
- triggering coffee time.
2012-02-13 09:00:00-0800 [INFO - com.company.app1.Main] Good morning. It's↵coffee
time.

For each input split, a map task is created that runs the user-supplied map
function on each record in the split. Map tasks are executed in parallel. This means
each chunk of the input dataset is being processed at the same time by various
machines that make up the cluster. It’s fine if there are more map tasks to execute
than the cluster can handle. They’re simply queued and executed in whatever order
the framework deems best. The map function takes a key-value pair as input and
produces zero or more intermediate key-value pairs.

The input format is responsible for turning each record into its key-value pair
representation. For now, trust that one of the built-in input formats will turn each
line of the file into a value with the byte offset into the file provided as the key.
Getting back to our example, to write a map function that will filter records for
those within a specific timeframe, and then count all events of each severity. The
map phase is where we’ll perform the filtering. We’ll output the severity and the
number 1 for each record that we see with that severity.
function map(key, value) {
// Example key: 12345 - the byte offset in the file (not really interesting).
// Example value: 2012-02-13 00:23:54-0800 [INFO - com.company.app1.Main]↵
// Application started!
// Do the nasty record parsing to get dateTime, severity,
// className, and message.
(dateTime, severity, className, message) = parseRecord(value);
// If the date is today...
if (dateTime.date() == '2012-02-13') {
// Emit the severity and the number 1 to say we saw one of these records.
emit(severity, 1);
}
}

Given the sample records earlier, our intermediate data would look as follows:

DEBUG, 1

INFO, 1

INFO, 1
INFO, 1

WARN, 1

The key INFO repeats, which makes sense because our sample contained three
INFO records that would have matched the date 2012-02-13. It’s perfectly legal to
output the same key or value multiple times. The other notable effect is that the
output records are not in the order to expect. In the original data, the first record
was an INFO record, followed by WARN, but that’s clearly not the case here. This is
because the framework sorts the output of each map task by its key. Just like
outputting the value 1 for each record, the rationale behind sorting the data will
become clear in a moment.

Each key is assigned to a partition using a component called the partitioner. In


Hadoop MapReduce, the default partitioner implementation is a hash partitioner
that takes a hash of the key, modulo the number of configured reducers in the job,
to get a partition number. Because the hash implementation used by Hadoop
ensures the hash of the key INFO is always the same on all machines, all INFO
records are guaranteed to be placed in the same partition. The intermediate data
isn’t physically partitioned, only logically so. For all intents and purposes, you can
picture a partition number next to each record; it would be the same for all
records with the same key. See Figure 3-1 for a high-level overview of the
execution of the map phase.
Ultimately, to run the users reduce function on the intermediate output data. A
number of guarantees, however, are made to the developer with respect to the
reducers that need to be fulfilled.

• If a reducer sees a key, it will see all values for that key. For example, if a
reducer receives the INFO key, it will always receive the three number 1 values.

• A key will be processed by exactly one reducer. This makes sense given the
pre- ceding requirement.

• Each reducer will see keys in sorted order.

The next phase of processing, called the shuffle and sort, is responsible for
enforcing these guarantees. The shuffle and sort phase is actually performed by
the reduce tasks before they run the user’s reduce function. When started, each
reducer is assigned one of the partitions on which it should work. First, they copy
the intermediate key-value data from each worker for their assigned partition.
It’s possible that tens of thousands of map tasks have run on various machines
throughout the cluster, each having output key value pairs for each partition. The
reducer assigned partition 1, for example, would need to fetch each piece of its
partition data from potentially every other worker in the cluster. A logical view of
the intermediate data across all machines in the cluster mightlook like this:
worker 1, partition 2, DEBUG, 1
worker 1, partition 1, INFO, 1
worker 2, partition 1, INFO, 1
worker 2, partition 1, INFO 1
worker 3, partition 2, WARN, 1

With the partition data now combined into a complete sorted list, the user’s
reducer code can now be executed:
# Logical data input to the reducer assigned partition 1:INFO, [ 1, 1, 1 ]
# Logical data input to the reducer assigned partition 2:DEBUG, [ 1 ]
WARN, [ 1 ]

The reducer code in our example is hopefully clear at this point:


function reduce(key, iterator<values>) {
// Initialize a total event count.totalEvents = 0;
// For each value (a number one)...foreach (value
in values) {
// Add the number one to the total.totalEvents +=
value;
}
// Emit the severity (the key) and the total events we saw.
// Example key: INFO
// Example value: 3 emit(key,
totalEvents);
}

Each reducer produces a separate output file, usually in HDFS (see Figure 3-2).
Separate files are written so that reducers do not have to coordinate access to a
shared file. This greatly reduces complexity and lets each reducer run at whatever
speed it can. The format of the file depends on the output format specified by the
author of the MapReduce job in the job configuration. Unless the job does
something special (and most don’t) each reducer output file is named part-
<XXXXX>, where <XXXXX> is the number of the reduce task within the job, starting
from zero. Sample reducer output for our example job would look as follows:
# Reducer for partition 1:
INFO, 3
# Reducer for partition 2:
DEBUG, 1
WARN, 1

Shuffle and sort, and reduce phases


For those that are familiar with SQL and relational databases, we could view the
logs as a table with the schema:
CREATE TABLE logs (
EVENT_DATE DATE,
SEVERITY VARCHAR(8),
SOURCE VARCHAR(128),MESSAGE
VARCHAR(1024)
)

To produce the same output, use the following SQL statement. In the interest of
readability, ignoring the fact that this doesn’t yield identically formatted output;
the data is the same.
SELECT SEVERITY,COUNT(*)
FROM logs GROUP BY SEVERITY WHERE EVENT_DATE = '2012-02-13'GROUP BY SEVERITY
ORDER BY SEVERITY

As exciting as all of this is, MapReduce is not a silver bullet. It is just as important
to know how MapReduce works and what it’s good for, as it is to understand why
Map- Reduce is not going to end world hunger or serve you breakfast in bed.

1. MapReduce is a batch data processing system


The design of MapReduce assumes that jobs will run on the order of minutes,
if not hours. It is optimized for full table scan style operations. Consequently, it
underwhelms when attempting to mimic low-latency, random access patterns
found in traditional online transaction processing (OLTP) systems. MapReduce
is not a relational database killer, nor does it purport to be.

2. MapReduce is overly simplistic


One of its greatest features is also one of its biggest drawbacks: MapReduce is
simple. In cases where a developer knows something special about the data
and wants to make certain optimizations, he may find the model limiting. This
usually manifests as complaints that, while the job is faster in terms of wall
clock time, it’s far less efficient in MapReduce than in other systems. This can be
very true. Some have said MapReduce is like a sledgehammer driving a nail; in
some cases, it’s more like a wrecking ball.

3. MapReduce is too low-level


Compared to higher-level data processing languages (notably SQL),
MapReduce seems extremely low-level. Certainly for basic query like
functionality, no one wants to write, map, and reduce functions. Higher-level
languages built atop Map- Reduce exist to simplify life, and unless you truly need
the ability to touch terabytes (or more) of raw data, it can be overkill.

4. Not all algorithms can be parallelized


There are entire classes of problems that cannot easily be parallelized. The act
of training a model in machine learning, for instance, cannot be parallelized for
many types of models. This is true for many algorithms where there is shared
state or dependent variables that must be maintained and updated centrally.
Sometimes it’s possible to structure problems that are traditionally solved
using shared state differently such that they can be fit into the MapReduce
model, but at the cost of efficiency (shortest path−finding algorithms in graph
processing are excellent ex- amples of this). Other times, while this is possible,
it may not be ideal for a host of reasons. Knowing how to identify these kinds of
problems and create alternative solutions is far beyond the scope of this book
and an art in its own right. This is the same problem as the “mythical man
month,” but is most succinctly expressed by stating, “If one woman can have a
baby in nine months, nine women should be able to have a baby in one month,”
which, in case it wasn’t clear, is decidedly false.

3.2 unit tests with MRUnit


MRUnit is a popular testing framework designed specifically for testing MapReduce
jobs in the Apache Hadoop ecosystem. It allows developers to write unit tests for
mapper and reducer functions, as well as integration tests for the entire MapReduce job
logic. MRUnit provides a set of APIs that simulate the MapReduce execution
environment and enable developers to validate the correctness of their MapReduce jobs
efficiently and without the need for a full-scale Hadoop cluster.
Here's how unit tests with MRUnit are typically performed:
Set Up MRUnit:
• Developers need to include the MRUnit library in their Java project to access
the testing framework.
• MRUnit is usually provided as a JAR file that can be added to the project's build
path or managed through a build automation tool like Maven or Gradle.
Write Unit Tests:
• Developers write unit tests for the mapper and reducer functions using the
MRUnit API. The MRUnit framework provides convenient methods to create
input data (key-value pairs) for mappers and to set up the expected output for
reducers.
• For mapper tests, developers create input key-value pairs and use MRUnit's
MapDriver class to test the map function. The withInput method sets the input
data, and the withOutput method sets the expected output key-value pairs for
the mapper.
• For reducer tests, developers create input key-value pairs for a specific key
and use MRUnit's ReduceDriver class to test the reduce function. The
withInputKey and withInputValue methods set the input data for the reducer,
and the withOutput method sets the expected output key-value pairs.
Run the Unit Tests:
• Developers run the unit tests using their preferred Java testing framework,
such as JUnit or TestNG.
• MRUnit provides a convenient runTest method to execute the test cases and
compare the actual output with the expected output, checking if they match.
Validate Results:
• The test framework verifies whether the actual output matches the expected
output for each test case. If the outputs match, the test is considered successful;
otherwise, it indicates a potential issue in the MapReduce job's logic.
• Using MRUnit, developers can efficiently test various scenarios, edge cases, and
data inputs to ensure that their MapReduce jobs handle different situations
correctly. This approach significantly improves the quality and reliability of
MapReduce jobs by catching and addressing potential issues early in the
development process.

3.3 test data and local tests

In the context of MapReduce, "test data" refers to the data that is used for testing
and validating the correctness and efficiency of MapReduce jobs before running
them on a production cluster. Local tests, on the other hand, involve running
MapReduce jobs on a single machine or a small local cluster to check the logic,
functionality, and performance of the job without utilizing the full-scale resources
of a distributed cluster.

Here's how test data and local tests are typically used in the MapReduce
development process:

Test Data Preparation:

• For developing and testing MapReduce jobs, developers create small


sample datasets, known as test data, that represent various scenarios and
edge cases that the job should handle.

• The test data should cover different input data patterns, including corner
cases, to ensure the job's robustness and correctness.

Local Testing:

• Before deploying MapReduce jobs to a full-scale Hadoop cluster,


developers often perform local tests on their development machines or a
small local Hadoop cluster with limited resources.

• Local testing allows developers to iterate quickly during the development


process and catch bugs or logic issues before submitting the job to a
production cluster.

• Local testing can be done using tools like Apache Hadoop's


LocalJobRunner, which simulates the MapReduce job execution on a single
machine, mimicking the distributed processing environment.

Unit Testing and Integration Testing:

• Developers write unit tests and integration tests for the MapReduce job's
individual components, such as mapper and reducer functions, to validate
their correctness.

• Unit tests ensure that each component performs as expected, while


integration tests check how the components interact with each other.

Performance Testing:

• During local tests, developers can also assess the performance of their
MapReduce jobs on smaller datasets, identify potential bottlenecks, and
optimize the job accordingly.

• Performance testing on local data helps developers gauge the job's


scalability and efficiency before running it on a larger production cluster.

• Once the MapReduce job passes local tests and meets the desired
performance requirements, it can be submitted to a production Hadoop
cluster for processing larger datasets. The data and resources available in
the production cluster are typically much more substantial, so testing on
the local environment is essential for detecting and fixing issues early in
the development process and ensuring the job performs efficiently at scale.

3.4 Anatomy of MapReduce job run

The anatomy of a MapReduce job run involves several phases and components
that work together to process and analyze large-scale data in a distributed
computing environment. Here is a step-by-step overview of how a MapReduce job
is executed:

1. Job Submission:

The process begins when a client submits a MapReduce job to the Hadoop cluster.
The client typically uses Hadoop's command-line interface or an API to submit
the job.

2. Job Initialization:

Upon job submission, the ResourceManager (in YARN-based Hadoop) or the


JobTracker (in older Hadoop versions) receives the job request. The
ResourceManager/JobTracker verifies the job's configuration and ensures that it
has sufficient resources to run the job.

3. Splitting Input Data:

The input data for the job is divided into smaller splits called "input splits." Each
input split is processed independently by a mapper task. The splitting is
performed by the InputFormat associated with the job, which determines how to
read and divide the input data.

4. Mapper Task Assignment:

The ResourceManager (in YARN) or JobTracker (in older Hadoop versions)


assigns available mapper tasks to NodeManagers (in YARN) or TaskTrackers (in
older Hadoop versions) across the cluster. Each mapper task processes one input
split.

5. Map Task Execution:

NodeManagers/TaskTrackers execute the mapper tasks by applying the user-


defined map function to the input data within their input splits. The map function
processes the data and generates intermediate key-value pairs.

6. Shuffle and Sort:

After the map tasks complete, the intermediate key-value pairs are transferred to
the reducers. The shuffle and sort phase groups the values with the same key
together and sorts them based on the keys. This ensures that the reducers
process all values for each key in a sorted order.

7. Reducer Task Assignment:

The ResourceManager (in YARN) or JobTracker (in older Hadoop versions)


assigns available reducer tasks to NodeManagers (in YARN) or TaskTrackers (in
older Hadoop versions) across the cluster.

8. Reduce Task Execution:


NodeManagers/TaskTrackers execute the reducer tasks by applying the user-
defined reduce function to the shuffled and sorted intermediate data. The reduce
function aggregates or processes all values associated with each key and
produces the final output.

9. Output Writing:

The output generated by the reduce tasks is written to the specified output
directory in the Hadoop Distributed File System (HDFS) or another storage
system.

10. Job Completion:

Once all map and reduce tasks are successfully executed, the entire MapReduce
job is considered complete, and the final output is available for further analysis or
use.

Throughout the job run, the ResourceManager/JobTracker monitors the progress


of individual tasks and ensures that failed tasks are retried on other available
nodes to ensure fault tolerance and successful job completion. This fault tolerance
is a crucial feature of the MapReduce framework, allowing it to handle failures
and ensure the successful processing of large-scale data in distributed
environments.

3.5 CLASSIC MAPREDUCE

Classic MapReduce refers to the original implementation of the MapReduce


programming model in the Apache Hadoop ecosystem. It was introduced with
Hadoop 1.x and served as the foundation for large-scale data processing in
distributed environments. Classic MapReduce had two major components:
JobTracker and TaskTracker.

JobTracker:

The JobTracker was the central coordinator responsible for managing job
submissions, task scheduling, and resource allocation in the Hadoop cluster. It
received job submissions from clients, divided the jobs into smaller tasks (map and
reduce tasks), and scheduled these tasks across available TaskTrackers in the cluster.
TaskTracker:

Each node in the Hadoop cluster had a TaskTracker, responsible for executing tasks
assigned to it by the JobTracker. A TaskTracker managed the execution of map and
reduce tasks, monitored their progress, and reported the status back to the
JobTracker.

The typical workflow of a classic MapReduce job was as follows:

• Client submitted a MapReduce job to the JobTracker.

• The JobTracker divided the job into map tasks and reduce tasks and scheduled
them across available TaskTrackers.

• TaskTrackers executed the map tasks and reduced tasks in parallel across the
cluster nodes.

• The map tasks processed the input data and generated intermediate key-value
pairs.

• The JobTracker managed the shuffle and sort phase, where intermediate data
was shuffled and sorted based on keys before being passed to the reducers.

• The reduce tasks processed the shuffled data, aggregated values for each key,
and produced the final output.

Classic MapReduce served as the foundation for distributed data processing in


Hadoop and made it possible to process massive datasets efficiently. However, it had
some limitations, including a single point of failure (JobTracker), scalability
challenges, and overhead in task startup for small jobs. To address these limitations,
Apache Hadoop evolved, and with the release of Hadoop 2.x, YARN (Yet Another
Resource Negotiator) was introduced as a new resource management layer,
decoupling resource management from job scheduling and enabling support for
various distributed computing models beyond MapReduce. This evolution paved the
way for more flexible and efficient data processing frameworks like Apache Tez,
Apache Spark, and Apache Flink, which build on top of YARN and provide better
performance, fault tolerance, and support for interactive and iterative data
processing use cases.
3.6 YARN

YARN stands for Yet Another Resource Negotiator. It is a resource


management layer in Apache Hadoop, an open-source framework for distributed
storage and processing of large datasets. YARN is responsible for managing and
allocating resources (CPU, memory, etc.) across various applications running on a
Hadoop cluster.

Before YARN, Hadoop MapReduce was the only processing framework in Hadoop.
MapReduce was tightly integrated with Hadoop's resource management, making
it difficult to run other types of applications. YARN was introduced as a major
architectural change in Hadoop 2.x to address these limitations and enable multi-
tenancy, flexibility, and scalability.

Key components of YARN include:

1. ResourceManager:

The ResourceManager is the central authority that manages resources in the


cluster. It keeps track of available resources and allocates them to different
applications based on their resource requirements. The ResourceManager
also monitors the health of NodeManagers and handles job scheduling.

2. NodeManager:

Each node in the Hadoop cluster runs a NodeManager process. The


NodeManager is responsible for managing resources on a specific node. It
communicates with the ResourceManager to request resources and report on
their usage.

3. ApplicationMaster:

For each application running on the cluster (e.g., a MapReduce job or another
application using YARN), there is an ApplicationMaster. The
ApplicationMaster is responsible for negotiating resources with the
ResourceManager and coordinating the tasks and containers needed for the
application to run.

4. Container:

A container represents a slice of resources (CPU, memory) allocated to run a


specific task in an application. Containers are managed by the NodeManager
and isolated from each other to prevent interference.

With YARN, Hadoop becomes more versatile, allowing various processing engines
and applications, such as Apache Spark, Apache Flink, Apache HBase, and others,
to coexist and run simultaneously on the same Hadoop cluster, each getting its
share of resources. This flexibility and multi-tenancy support have significantly
improved the capabilities and efficiency of the Hadoop ecosystem.

3.7 Failures in classic Map-reduce and YARN


Both classic MapReduce and YARN have their own limitations and potential failure
points. Let's explore some of the common failures in each:
Failures in Classic MapReduce:
1. Single Point of Failure: In Hadoop's classic MapReduce framework, the JobTracker
is a single point of failure. If the JobTracker fails, it can cause the entire cluster to
become unavailable, impacting all running and pending jobs.
2. Scalability Issues: The JobTracker handles all job scheduling and resource
management, which can become a performance bottleneck as the cluster scales up
and handles more jobs.
3. Slow Startup for Small Jobs: For small jobs, the overhead of starting up the
MapReduce framework can be relatively high, leading to delays and reduced
efficiency.
4. Data Locality: Although classic MapReduce has data locality optimization, it is not
always perfect. Sometimes, data has to be transferred across the network, leading
to increased network traffic and slower job execution.
Failures in YARN:
1. ResourceManager Failure: The ResourceManager in YARN can become a single
point of failure. If the ResourceManager goes down, it can impact the entire
cluster's ability to manage resources and execute applications.
2. ApplicationMaster Failure: Each application running on YARN has its own
ApplicationMaster. If an ApplicationMaster fails, it can result in the failure of the
specific application it manages.
3. NodeManager Failure: If a NodeManager fails on a specific node, the
ResourceManager needs to redistribute the tasks running on that node to other
healthy nodes, causing additional overhead and possible delays.
4. Fairness and Queue Starvation: YARN allows for resource sharing among multiple
applications, but if not configured properly, certain applications or queues may
dominate the cluster's resources, leading to fairness issues and queue starvation
for other applications.
5. Resource Oversubscription: If YARN is not configured with appropriate resource
limits or if applications demand more resources than available in the cluster, it
can lead to resource oversubscription and overall performance degradation.

To address some of these issues and improve overall efficiency, newer technologies
like Apache Tez, Apache Spark, and Apache Flink have emerged. These frameworks
aim to overcome the limitations of classic MapReduce and provide better resource
management, fault tolerance, and performance optimizations. Moreover,
advancements in Hadoop's ecosystem and the continuous development of YARN
help in addressing and minimizing these failure points to make large-scale data
processing more reliable and efficient.

3.8 Job Sceduling

In MapReduce, job scheduling refers to the process of determining when and how
MapReduce jobs are executed on a Hadoop cluster. The job scheduler is responsible for
allocating resources to different jobs and ensuring that they run efficiently while
considering various factors like data locality, available cluster resources, job priorities,
and fairness.

The MapReduce job scheduling process typically involves the following steps:
1. Job Submission: Users or applications submit MapReduce jobs to the Hadoop
cluster through the client interface.
2. Job Splitting: The input data is divided into smaller splits, known as input
splits, which are processed independently by different mapper tasks. The
number of input splits determines the number of map tasks in the job.
3. Task Scheduling: The job scheduler is responsible for allocating resources (i.e.,
containers) to run mapper and reducer tasks. The tasks are scheduled on the
nodes in the cluster based on the data locality, available resources, and other
cluster conditions.
4. Data Locality: The job scheduler tries to assign tasks to nodes where the data
is already present (data locality) to minimize data transfer over the network.
This helps reduce network traffic and improves overall job performance.
5. Task Execution: Once the tasks are scheduled and assigned to nodes, the
TaskTrackers (in older Hadoop versions) or NodeManagers (in YARN-based
Hadoop versions) execute the tasks.
6. Task Monitoring and Failure Handling: The TaskTrackers or NodeManagers
monitor the task execution and report the status (success, failure, or in-
progress) back to the JobTracker (in older Hadoop versions) or
ResourceManager (in YARN-based Hadoop versions). In case of task failures,
the job scheduler may reschedule the failed tasks on other nodes to ensure
fault tolerance and job completion.
7. Job Completion: After all the map and reduce tasks are successfully executed,
the MapReduce job is considered complete, and the results are typically
written to the Hadoop Distributed File System (HDFS) or other storage
systems.
8. Job schedulers in Hadoop can be configured to use different scheduling
algorithms, such as First-Come-First-Served (FCFS), Fair Scheduler, Capacity
Scheduler, or other custom schedulers. The choice of scheduling algorithm
depends on factors like the nature of the workload, resource requirements,
and performance objectives of the cluster.
9. In YARN-based Hadoop clusters, the ResourceManager and the
ApplicationMaster play critical roles in job scheduling and resource
management, providing a more flexible and scalable environment for running
various distributed applications beyond just MapReduce.

3.8 Shuffle and Sort


In the context of the MapReduce programming model, "Shuffle and Sort" refers to
a crucial phase that occurs between the Map phase and the Reduce phase. It is an
intermediate step where the data produced by the mappers is transferred and
rearranged to prepare it for processing by the reducers.
The "Shuffle and Sort" phase involves the following key steps:
1. Shuffle: During the Map phase, each mapper processes a portion of the input data
and generates key-value pairs as output. These key-value pairs are partitioned
based on the keys' hash values and sent to the appropriate reducer node. This
process is known as the "shuffle" phase.
2. Sort: Once the key-value pairs reach the reducer nodes, they need to be sorted
based on their keys to ensure that all values with the same key are grouped
together. Sorting is necessary because each reducer processes all the values
associated with a specific key. The sorting phase rearranges the key-value pairs in
ascending or descending order based on the keys.
The primary objectives of the "Shuffle and Sort" phase are:

1. Data Redistribution: The shuffle phase redistributes the data produced by


mappers across the reducers based on the keys, ensuring that all values with the
same key end up on the same reducer. This step enables the grouping of related
data together, simplifying the reduction process.
2. Data Locality: The shuffle phase attempts to move data as close to the reducers as
possible to take advantage of data locality. When the data is already present on
the node where the reducer is running, it avoids unnecessary data transfer over
the network, improving overall performance.
3. Data Sorting: The sort phase ensures that the data is organized based on keys,
making it easier for the reducers to process data with the same key sequentially.
This sequential processing is essential for aggregation or other operations
performed during the Reduce phase.
The "Shuffle and Sort" phase is a crucial aspect of the MapReduce framework, and
efficient handling of this phase can significantly impact the overall performance of
MapReduce jobs. In large-scale data processing, this phase can be resource-intensive
and time-consuming, so optimizations in this phase are crucial for improving the
efficiency of MapReduce applications. Additionally, newer processing frameworks like
Apache Tez, Apache Spark, and Apache Flink have introduced more efficient and
optimized approaches to handle the shuffle and sort phase, further enhancing the
performance of data processing tasks in distributed environments.
3.9 Task Execution
In the MapReduce framework, task execution refers to the process of executing the
map and reduce tasks on a Hadoop cluster to process and analyze large-scale data.
The task execution phase is a critical part of the overall MapReduce job processing
pipeline and involves the following steps:
1. Map Task Execution:
• Input Split: The input data is divided into smaller splits called "input
splits." Each input split is processed independently by a mapper task. The
number of mapper tasks is determined by the number of input splits and
the configuration of the Hadoop cluster.
• Task Assignment: The JobTracker (in older Hadoop versions) or
ResourceManager (in YARN-based Hadoop versions) assigns available
mapper tasks to TaskTrackers (in older Hadoop versions) or
NodeManagers (in YARN-based Hadoop versions) across the cluster.
• Task Execution: Each assigned TaskTracker or NodeManager executes the
mapper task by applying the user-defined map function to the input data
within its input split. The map function processes the input data and
generates intermediate key-value pairs as output.
• Shuffle and Sort:
• After the map tasks complete, the intermediate key-value pairs produced
by the mappers are transferred to the reducers, where they will be
grouped and processed based on their keys. This process is known as the
"shuffle and sort" phase, as explained in the previous response.
2. Reduce Task Execution:
• Task Assignment: Once the shuffle and sort phase is complete, the
JobTracker or ResourceManager assigns available reducer tasks to
TaskTrackers or NodeManagers across the cluster.
• Task Execution: Each assigned TaskTracker or NodeManager executes the
reducer task by applying the user-defined reduce function to the shuffled
and sorted intermediate data. The reduce function processes all values
associated with the same key and produces the final output, which is
written to the output file.
3. Output Writing:
• The output generated by the reduce tasks is stored in the Hadoop
Distributed File System (HDFS) or another storage system specified by the
user.
• Throughout the task execution phase, the JobTracker or ResourceManager
monitors the progress of individual tasks, ensuring that failed tasks are
rescheduled on other available nodes to ensure fault tolerance and
successful job completion. Once all map and reduce tasks are successfully
executed, the entire MapReduce job is considered complete, and the final
output is available for further analysis or use.
Efficient task execution is essential for achieving optimal performance in MapReduce
jobs. Proper configuration of the cluster, data locality optimizations, and efficient use of
resources can significantly impact the overall execution time of large-scale data
processing tasks in a Hadoop cluster.

3.10 MapReduce types


In the MapReduce framework, there are two primary types of tasks: Map tasks
and Reduce tasks. These tasks work together to process and analyze large-scale data
in a distributed computing environment. Let's take a closer look at each type:
1. Map Tasks:
• Map tasks are responsible for processing input data and generating
intermediate key-value pairs.
• Input data is divided into smaller splits called "input splits," and each map task
processes one input split independently.
• The input data is passed to the user-defined map function, which transforms
the data and generates intermediate key-value pairs as output.
• Map tasks run in parallel across multiple nodes in the Hadoop cluster, allowing
for efficient processing of large datasets.
2. Reduce Tasks:
• Reduce tasks are responsible for processing the intermediate key-value pairs
generated by the map tasks and producing the final output.
• After the map tasks complete, the intermediate data is shuffled and sorted
based on the keys, and all values with the same key are grouped together.
• The sorted intermediate data is passed to the user-defined reduce function,
which aggregates or processes the values associated with each key and
generates the final output.
• Reduce tasks also run in parallel across multiple nodes in the cluster, making it
possible to process the grouped data concurrently.
• The MapReduce framework automatically manages the distribution of map
and reduce tasks across the nodes in the Hadoop cluster. The number of map
tasks is determined by the number of input splits and the configuration of the
cluster. The number of reduce tasks can be configured by the user or is based
on the number of reducers specified in the job configuration.
MapReduce tasks are designed to be fault-tolerant, and in the event of task
failures, the framework automatically retries the failed tasks on other available
nodes to ensure successful job completion.
Besides map and reduce tasks, there are additional types of tasks and phases in
advanced processing frameworks like Apache Tez, Apache Spark, and Apache Flink,
which provide more flexibility and optimization opportunities for large-scale data
processing beyond classic MapReduce. These newer frameworks often introduce
additional task types, such as shuffle tasks, merge tasks, and more, to enhance
performance and resource utilization in data processing pipelines.

3.11 Input Formats


In the MapReduce framework, input formats define how input data is read and
processed by the mappers. Hadoop provides several built-in input formats that cater
to different types of data and file formats. Each input format determines how input
data is split, read, and passed to the mappers for processing. Some of the commonly
used input formats in Hadoop are:

1. TextInputFormat: This is the default input format for processing text data. It reads
data line-by-line, where each line becomes a separate input record for the
mappers. The key represents the byte offset of the line, and the value represents
the actual text of the line.
2. KeyValueTextInputFormat: This input format is used when data is in a key-value
format, with each line containing a key and its associated value separated by a
delimiter (e.g., tab or comma). It allows mappers to process key-value pairs as
separate input records.
3. SequenceFileInputFormat: This input format is used to read data stored in
Hadoop's SequenceFile format. SequenceFiles are binary files that allow efficient
storage of key-value pairs. SequenceFileInputFormat reads the keys and values
from SequenceFiles and passes them to the mappers.
4. NLineInputFormat: This input format allows mappers to process a fixed number
of lines as a single input record. It is useful when processing data with a fixed
structure, such as log files.
5. CombineTextInputFormat: This input format combines small input files into
larger splits to reduce the number of mappers and improve efficiency for
processing small files.
6. DBInputFormat: This input format allows mappers to read data from relational
databases using JDBC. It enables MapReduce jobs to process data directly from
database tables.
7. AvroKeyInputFormat and AvroKeyValueInputFormat: These input formats are
used for reading Avro data files, which are a compact and efficient binary data
format.
8. Custom Input Formats: Hadoop allows users to define custom input formats to
process data in formats not covered by the built-in input formats. By
implementing the InputFormat interface, users can customize how data is read
and passed to the mappers.
The choice of input format depends on the nature of the input data and the specific
requirements of the MapReduce job. Proper selection of the input format can
significantly impact job performance and data processing efficiency.

3.12 Output Formats


In the MapReduce framework, output formats define how the output data from the
reducers is written to the storage system after processing. Hadoop provides several
built-in output formats that cater to different storage needs and data formats. Each
output format determines how the final output data is organized and written to the
output location. Some of the commonly used output formats in Hadoop are:
1. TextOutputFormat: This is the default output format for MapReduce jobs. It writes
the output as text data, where each key-value pair is written as a line, with the key
and value separated by a delimiter (usually a tab or a comma).
2. SequenceFileOutputFormat: This output format writes the output data as a
Hadoop SequenceFile. SequenceFiles are binary files that store key-value pairs
efficiently, making them suitable for large-scale data storage.
3. KeyValueTextOutputFormat: This output format writes the output as a key-value
text file, where each line contains a key and its associated value separated by a
delimiter.
4. MultipleOutputFormat: This output format allows MapReduce jobs to write
output data to multiple output directories, each having a different format. It is
useful when a single job needs to generate multiple types of outputs
simultaneously.
5. NullOutputFormat: This output format discards the output generated by the
reducers, useful when a MapReduce job is used for data processing but does not
require any output.
6. DBOutputFormat: This output format allows writing the output data from the
reducers directly to a relational database using JDBC.
7. AvroKeyOutputFormat and AvroKeyValueOutputFormat: These output formats
are used for writing Avro data files as the output.
8. Custom Output Formats: Hadoop allows users to define custom output formats to
write output data in formats not covered by the built-in output formats. By
implementing the OutputFormat interface, users can customize how data is
written and stored based on their specific needs.

The choice of output format depends on the desired format of the output data and the
storage system used for data persistence. Proper selection of the output format ensures
that the final output is organized efficiently and can be easily processed by other
systems or applications.

You might also like