0% found this document useful (0 votes)
4 views45 pages

Unit - 4

MapReduce is a key processing engine in Hadoop that simplifies the computation of large data sets through a two-phase model: mapping and reducing. It allows organizations to efficiently analyze data for various applications, such as pricing strategies and advertising effectiveness, without needing extensive computational resources. The document also covers the MapReduce workflow, examples of its application, and the importance of testing MapReduce code using MRUnit.

Uploaded by

tharan26072001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views45 pages

Unit - 4

MapReduce is a key processing engine in Hadoop that simplifies the computation of large data sets through a two-phase model: mapping and reducing. It allows organizations to efficiently analyze data for various applications, such as pricing strategies and advertising effectiveness, without needing extensive computational resources. The document also covers the MapReduce workflow, examples of its application, and the importance of testing MapReduce code using MRUnit.

Uploaded by

tharan26072001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

CCS334 BIG DATA ANALYTICS

UNIT IV
MAP REDUCE APPLICATIONS

MAPREDUCE OVERVIEW
MapReduce is the processing engine of Hadoop that processes and computes large
volumes of data. It is one of the most common engines used by Data Engineers to process
Big Data. It allows businesses and other organizations to run calculations to:

 Determine the price for their products that yields the highest profits
 Know precisely how effective their advertising is and where they should spend their ad
dollars
 Make weather predictions
 Mine web clicks, sales records purchased from retailers, and Twitter trending topics to
determine what new products the company should produce in the upcoming season
Before MapReduce, these calculations were complicated. Now, programmers can tackle
problems like these with relative ease. Data scientists have coded complex algorithms into
frameworks so that programmers can use them.

Companies no longer need an entire department of Ph.D. scientists to model data, nor do
they need a supercomputer to process large sets of data, as MapReduce runs across a
network of low-cost commodity machines.

There are two phases in the MapReduce programming model:

1. Mapping
2. Reducing

The following part of this MapReduce tutorial discusses both these phases.

MAPPING AND REDUCING

A mapper class handles the mapping phase; it maps the data present in different
datanodes. A reducer class handles the reducing phase; it aggregates and reduces the
output of different datanodes to generate the final output.

1
CCS334 BIG DATA ANALYTICS

Data that is stored on multiple machines pass through mapping. The final output is
obtained after the data is shuffled, sorted, and reduced.

INPUT DATA

Hadoop accepts data in various formats and stores it in HDFS. This input data is worked
upon by multiple map tasks.

2
CCS334 BIG DATA ANALYTICS

MAP TASKS
Map reads the data, processes it, and generates key-value pairs. The number of map tasks
depends upon the input file and its format.

Typically, a file in a Hadoop cluster is broken down into blocks, each with a default size of
128 MB. Depending upon the size, the input file is split into multiple chunks. A map task
then runs for each chunk. The mapper class has mapper functions that decide what
operation is to be performed on each chunk.

REDUCE TASKS

In the reducing phase, a reducer class performs operations on the data generated from the
map tasks through a reducer function. It shuffles, sorts, and aggregates the intermediate
key-value pairs (tuples) into a set of smaller tuples.

OUTPUT

The smaller set of tuples is the final output and gets stored in HDFS.
Let us look at the MapReduce workflow in the next section of this MapReduce tutorial.

MAPREDUCE WORKFLOW

Here are some of the key concepts related to MapReduce.

 Job – A Job in the context of Hadoop MapReduce is the unit of work to be performed
as requested by the client / user. The information associated with the Job includes
the data to be processed (input data), MapReduce logic / program / algorithm, and
any other relevant configuration information necessary to execute the Job.

 Task – Hadoop MapReduce divides a Job into multiple sub-jobs known as Tasks.
These tasks can be run independent of each other on various nodes across the
cluster. There are primarily two types of Tasks – Map Tasks and Reduce Tasks.

3
CCS334 BIG DATA ANALYTICS

 Job Tracker – Just like the storage (HDFS), the computation (MapReduce) also
works in a master-slave / master-worker fashion. A Job Tracker node acts as the
Master and is responsible for scheduling / executing Tasks on appropriate nodes,
coordinating the execution of tasks, sending the information for the execution of
tasks, getting the results back after the execution of each task, re-executing the
failed Tasks, and monitors / maintains the overall progress of the Job. Since a Job
consists of multiple Tasks, a Job’s progress depends on the status / progress of
Tasks associated with it. There is only one Job Tracker node per Hadoop Cluster.

 Task Tracker – A Task Tracker node acts as the Slave and is responsible for
executing a Task assigned to it by the Job Tracker. There is no restriction on the
number of Task Tracker nodes that can exist in a Hadoop Cluster. TaskTracker
receives the information necessary for execution of a Task from Job Tracker,
Executes the Task, and Sends the Results back to Job Tracker.

 Map() – Map Task in MapReduce is performed using the Map() function. This part of
the MapReduce is responsible for processing one or more chunks of data and
producing the output results.

 Reduce() – The next part / component / stage of the MapReduce programming


model is the Reduce() function. This part of the MapReduce is responsible for
consolidating the results produced by each of the Map() functions/tasks.

 Data Locality – MapReduce tries to place the data and the compute as close as
possible. First, it tries to put the compute on the same node where data resides, if
that cannot be done (due to reasons like compute on that node is down, compute on
that node is performing some other computation, etc.), then it tries to put the
compute on the node nearest to the respective data node(s) which contains the data
to be processed. This feature of MapReduce is “Data Locality”.

4
CCS334 BIG DATA ANALYTICS

The following diagram shows the logical flow of a MapReduce programming model.

The stages depicted above are

 Input: This is the input data / file to be processed.

 Split: Hadoop splits the incoming data into smaller pieces called “splits”.

 Map: In this step, MapReduce processes each split according to the logic defined in

map() function. Each mapper works on each split at a time. Each mapper is treated as
a task and multiple tasks are executed across different Task Trackers and coordinated
by the Job Tracker.
 Combine: This is an optional step and is used to improve the performance by reducing

the amount of data transferred across the network. Combiner is the same as the
reduce step and is used for aggregating the output of the map() function before it is
passed to the subsequent steps.
 Shuffle & Sort: In this step, outputs from all the mappers is shuffled, sorted to put them

in order, and grouped before sending them to the next step.


 Reduce: This step is used to aggregate the outputs of mappers using the reduce()

function. Output of reducer is sent to the next and final step. Each reducer is treated as
a task and multiple tasks are executed across different Task Trackers and coordinated
by the Job Tracker.
 Output: Finally the output of reduce step is written to a file in HDFS.

5
CCS334 BIG DATA ANALYTICS

WORD COUNT EXAMPLE

For the purpose of understanding MapReduce, let us consider a simple example. Let us
assume that we have a file which contains the following four lines of text.

In this file, we need to count the number of occurrences of each word. For instance, DW
appears twice, BI appears once, SSRS appears twice, and so on. Let us see how this
counting operation is performed when this file is input to MapReduce.

Below is a simplified representation of the data flow for Word Count Example.

6
CCS334 BIG DATA ANALYTICS

 Input: In this step, the sample file is input to MapReduce.

 Split: In this step, Hadoop splits / divides our sample input file into four parts, each
part made up of one line from the input file. Note that, for the purpose of this
example, we are considering one line as each split. However, this is not necessarily
true in a real-time scenario.

 Map: In this step, each split is fed to a mapper which is the map() function containing
the logic on how to process the input data, which in our case is the line of text
present in the split. For our scenario, the map() function would contain the logic to
count the occurrence of each word and each occurrence is captured / arranged as a
(key, value) pair, which in our case is like (SQL, 1), (DW, 1), (SQL, 1), and so on.

 Combine: This is an optional step and is often used to improve the performance by
reducing the amount of data transferred across the network. This is essentially the
same as the reducer (reduce() function) and acts on output from each mapper. In
our example, the key value pairs from first mapper “(SQL, 1), (DW, 1), (SQL, 1)” are
combined and the output of the corresponding combiner becomes “(SQL, 2), (DW,
1)”.

 Shuffle and Sort: In this step, output of all the mappers is collected, shuffled, and
sorted and arranged to be sent to reducer.

 Reduce: In this step, the collective data from various mappers, after being shuffled
and sorted, is combined / aggregated and the word counts are produced as (key,
value) pairs like (BI, 1), (DW, 2), (SQL, 5), and so on.

 Output: In this step, the output of the reducer is written to a file on HDFS. The
following image is the output of our word count example.

7
CCS334 BIG DATA ANALYTICS

GAME EXAMPLE

Say you are processing a large amount of data and trying to find out what percentage of
your user base where talking about games. First, we will identify the keywords which we
are going to map from the data to conclude that its something related to games. Next, we
will write a mapping function to identify such patterns in our data. For example, the
keywords can be Gold medals, Bronze medals, Silver medals, Olympic football, basketball,
cricket, etc.

Let us take the following chunks in a big data set and see how to process it.

“Hi, how are you”

“We love football”

“He is an awesome football player”

“Merry Christmas”

“Olympics will be held in China”

“Records broken today in Olympics”

“Yes, we won 2 Gold medals”

“He qualified for Olympics”

8
CCS334 BIG DATA ANALYTICS

MAPPING PHASE – So our map phase of our algorithm will be as

1. Declare a function “Map”


2. Loop: For each words equal to “football”
3. Increment counter
4. Return key value “football”=>counter

In the same way, we can define n number of mapping functions for mapping various words:
“Olympics”, “Gold Medals”, “cricket”, etc.

REDUCING PHASE – The reducing function will accept the input from all these mappers in
form of key value pair and then processing it. So, input to the reduce function will look like
the following:p

reduce(“football”=>2)

reduce(“Olympics”=>3)

Our algorithm will continue with the following steps

5. Declare a function reduce to accept the values from map function.


6. Where for each key-value pair, add value to counter.
7. Return “games”=> counter.

At the end, we will get the output like “games”=>5.

Now, getting into a big picture we can write n number of mapper functions here. Let us say
that you want to know who all where wishing each other. In this case you will write a
mapping function to map the words like “Wishing”, “Wish”, “Happy”, “Merry” and then will
write a corresponding reducer function.

Here you will need one function for shuffling which will distinguish between the “games”
and “wishing” keys returned by mappers and will send it to the respective reducer function.
Similarly you may need a function for splitting initially to give inputs to the mapper functions
in form of chunks. The following diagram summarizes the flow of Map reduce algorithm:

9
CCS334 BIG DATA ANALYTICS

In the above map reduce flow

 The input data can be divided into n number of chunks depending upon the amount
of data and processing capacity of individual unit.
 Next, it is passed to the mapper functions. Please note that all the chunks are
processed simultaneously at the same time, which embraces the parallel processing
of data.
 After that, shuffling happens which leads to aggregation of similar patterns.
 Finally, reducers combine them all to get a consolidated output as per the logic.
 This algorithm embraces scalability as depending on the size of the input data, we
can keep increasing the number of the parallel processing units.

10
CCS334 BIG DATA ANALYTICS

UNIT TEST WITH MR UNIT


What is MR Unit?
MR Unit is a Java library that provides a testing framework for Hadoop MapReduce code. It
allows you to write unit tests for your MapReduce code and run them locally, without the
need for a Hadoop cluster. MRUnit is built on top of JUnit, a popular unit testing framework
for Java, and provides a set of APIs that you can use to test your MapReduce code.
MRUnit can test both the map and reduce functions of your code, as well as the overall job
configuration.

Why Test Hadoop MapReduce Code?


Testing Hadoop MapReduce code is important for several reasons:

1. Correctness: MapReduce code can be complex and difficult to debug. Testing your
code can help you catch errors early and ensure that your code works as expected.

2. Performance: MapReduce jobs can take a long time to run, especially when
processing large datasets. Testing your code can help you identify performance
bottlenecks and optimize your code for faster processing.

3. Scalability: Hadoop is designed to scale horizontally, which means that your code
must be able to handle datasets of varying sizes. Testing your code can help you
ensure that it can handle large datasets without crashing or slowing down.

Setting up MRUnit
To use MRUnit, you need to add the MRUnit dependency to your project. You can do this
by adding the following code to your Maven pom.xml file:

<dependency>
<groupId>org.apache.mrunit</groupId>
<artifactId>mrunit</artifactId>
<version>1.1.0</version>
<scope>test</scope>
</dependency>

11
CCS334 BIG DATA ANALYTICS

Once you have added the dependency, you can start writing unit tests for your MapReduce
code.

Writing MR Unit Tests


To write MR Unit tests, you need to create a test class that extends
the org.apache.mrunit.mapreduce.MapReduce Driver class. This class provides methods
that allow you to set up your input data, run your MapReduce job, and verify the output.

Here is an example of a simple MRUnit test:

publicclass WordCountTest {
@Test
publicvoid testWordCount()throws IOException {
MapReduceDriver<LongWritable, Text, Text, IntWritable, Text, IntWritable> driver =
new MapReduceDriver<LongWritable, Text, Text, IntWritable, Text, IntWritable>();

driver.withMapper(new WordCountMapper())
.withReducer(new WordCountReducer())
.withInput(new LongWritable(1),new Text("hello world"))
.withOutput(new Text("hello"),new IntWritable(1))
.withOutput(new Text("world"),new IntWritable(1))
.runTest();
}
}
In this example, we create a MapReduceDriver object and configure it with a WordCount
Mapper and a WordCount Reducer. We then set up our input data and expected output,
and run the test using the runTest() method.

The withInput() method is used to set up the input data for the MapReduce job. In this case,
we are providing a single input record with the value “hello world”.

12
CCS334 BIG DATA ANALYTICS

The withOutput() method is used to set up the expected output for the MapReduce job. In
this case, we expect the output to contain two records: one with the word “hello” and a
count of 1, and one with the word “world” and a count of 1.

Once we have set up our input and output data, we run the test using the runTest() method.
This method runs the MapReduce job and verifies that the output matches the expected
output.

Hadoop testing using MRUnit is an essential tool for data scientists and software engineers
working with Hadoop MapReduce code. MRUnit provides a simple and effective way to test
your code and ensure its correctness, performance, and scalability. By following the steps
outlined in this blog post, you can start using MRUnit to test your Hadoop MapReduce code
today.

ANATOMY OF MAP REDUCE JOB RUN

13
CCS334 BIG DATA ANALYTICS

There are five independent entities:

 The client, which submits the MapReduce job.


 The YARN resource manager, which coordinates the allocation of
compute resources on the cluster.
 The YARN node managers, which launch and monitor the compute
containers on machines in the cluster.
 The MapReduce application master, which coordinates the tasks
running the MapReduce job The application master and the MapReduce tasks run in
containers that are scheduled by the resource manager and managed by the node
managers.
 The distributed filesystem, which is used for sharing job files between
the other entities.

JOB SUBMISSION :
 The submit() method on Job creates an internal JobSubmitter
instance and calls submitJobInternal() on it.

 Having submitted the job, waitForCompletion polls the job’s


progress once per second and reports the progress to the console if it
has changed since the last report.

 When the job completes successfully, the job counters are displayed
Otherwise, the error that caused the job to fail is logged to the
console.

The job submission process implemented by JobSubmitter does the following:

 Asks the resource manager for a new application ID, used for the
MapReduce job ID.

 Checks the output specification of the job For example, if the output
directory has not been specified or it already exists, the job is not
submitted and an error is thrown to the MapReduce program.

14
CCS334 BIG DATA ANALYTICS

 Computes the input splits for the job If the splits cannot be
computed (because the input paths don’t exist, for example), the job
is not submitted and an error is thrown to the MapReduce
program.

 Copies the resources needed to run the job, including the job
JAR file, the configuration file, and the computed input splits, to
the shared filesystem in a directory named after the job ID.

 Submits the job by calling submitApplication() on the resource


manager.

JOB INITIALIZATION :

 When the resource manager receives a call to its submitApplication() method, it


hands off the request to the YARN scheduler.

 The scheduler allocates a container, and the resource manager then


launches the application master’s process there, under the node
manager’s management.

 The application master for MapReduce jobs is a Java application


whose main class is MRAppMaster .

 It initializes the job by creating a number of bookkeeping objects to


keep track of the job’s progress, as it will receive progress and
completion reports from the tasks.

 It retrieves the input splits computed in the client from the shared
filesystem.

 It then creates a map task object for each split, as well as a number of
reduce task objects determined by the mapreduce.job.reduces property (set by
the setNumReduceTasks() method on Job).

15
CCS334 BIG DATA ANALYTICS

TASK ASSIGNMENT:

 If the job does not qualify for running as an uber task, then the
application master requests containers for all the map and reduce
tasks in the job from the resource manager.

 Requests for map tasks are made first and with a higher priority than
those for reduce tasks, since all the map tasks must complete before
the sort phase of the reduce can start.

 Requests for reduce tasks are not made until 5% of map tasks have
completed.

TASK EXECUTION:
 Once a task has been assigned resources for a container on a
particular node by the resource manager’s scheduler, the application
master starts the container by contacting the node manager.

 The task is executed by a Java application whose main class is


YarnChild. Before it can run the task, it localizes the resources that
the task needs, including the job configuration and JAR file, and
any files from the distributed cache.

 Finally, it runs the map or reduce task.

16
CCS334 BIG DATA ANALYTICS

STREAMING

 Streaming runs special map and reduce tasks for the purpose of
launching the user supplied executable and communicating with it.

 The Streaming task communicates with the process (which may be


written in any language) using standard input and output streams.

 During execution of the task, the Java process passes input key value
pairs to the external process, which runs it through the user defined
map or reduce function and passes the output key value pairs back to
the Java process.

 From the node manager’s point of view, it is as if the child process


ran the map or reduce code itself.

PROGRESS AND STATUS UPDATES :


 MapReduce jobs are long running batch jobs, taking anything from
tens of seconds to hours to run.

 A job and each of its tasks have a status, which includes such things
as the state of the job or task (e g running, successfully completed,

17
CCS334 BIG DATA ANALYTICS

failed), the progress of maps and reduces, the values of the job’s
counters, and a status message or description (which may be set by
user code).

 When a task is running, it keeps track of its progress (i e the


proportion of task is completed).

 For map tasks, this is the proportion of the input that has been
processed.

 For reduce tasks, it’s a little more complex, but the system can still
estimate the proportion of the reduce input processed.

It does this by dividing the total progress into three parts, corresponding to the three phases
of the shuffle.
 As the map or reduce task runs, the child process communicates
with its parent application master through the umbilical interface.

 The task reports its progress and status (including counters) back to
its application master, which has an aggregate view of the job, every
three seconds over the umbilical interface.

18
CCS334 BIG DATA ANALYTICS

How status updates are propagated through the MapReduce System


 The resource manager web UI displays all the running applications
with links to the web UIs of their respective application masters,
each of which displays further details on the MapReduce job,
including its progress.

 During the course of the job, the client receives the latest status
by polling the application master every second (the interval is set
via mapreduce.client.progressmonitor.pollinterval).

JOB COMPLETION:
 When the application master receives a notification that the last
task for a job is complete, it changes the status for the job to Successful.

 Then, when the Job polls for status, it learns that the job has
completed successfully, so it prints a message to tell the user and
then returns from the waitForCompletion().

 Finally, on job completion, the application master and the task


containers clean up their working state and the OutputCommitter’s
commitJob () method is called.

 Job information is archived by the job history server to enable laterinterrogation by


users if desired.

YARN ARCHITECTURE

Apache Hadoop YARN (Yet Another Resource Negotiator) is a resource management layer
in Hadoop. YARN came into the picture with the introduction of Hadoop 2.x. It allows
various data processing engines such as interactive processing, graph processing, batch
processing, and stream processing to run and process data stored in HDFS (Hadoop
Distributed File System).

YARN was introduced to make the most out of HDFS, and job scheduling is also handled
by YARN.

19
CCS334 BIG DATA ANALYTICS

HADOOP YARN ARCHITECTURE

Now, we will discuss the architecture of YARN. Apache YARN framework contains a
Resource Manager (master daemon), Node Manager (slave daemon), and an Application
Master.

Let’s now discuss each component of Apache Hadoop YARN one by one in detail.

RESOURCE MANAGER

Resource Manager is the master daemon of YARN. It is responsible for managing several
other applications, along with the global assignments of resources such as CPU and
memory. It is used for job scheduling. Resource Manager has two components:

20
CCS334 BIG DATA ANALYTICS

 Scheduler: Schedulers’ task is to distribute resources to the running applications. It

only deals with the scheduling of tasks and hence it performs no tracking and no
monitoring of applications.

 Application Manager: The application Manager manages applications running in

the cluster. Tasks, such as the starting of Application Master or monitoring, are done
by the Application Manager.

Let’s move on with the second component of Apache Hadoop YARN.

NODE MANAGER

Node Manager is the slave daemon of YARN. It has the following responsibilities:
 Node Manager has to monitor the container’s resource usage, along with reporting it
to the Resource Manager.
 The health of the node on which YARN is running is tracked by the Node Manager.
 It takes care of each node in the cluster while managing the workflow, along with
user jobs on a particular node.
 It keeps the data in the Resource Manager updated
 Node Manager can also destroy or kill the container if it gets an order from the
Resource Manager to do so.

The third component of Apache Hadoop YARN is the Application Master.

Every job submitted to the framework is an application, and every application has a specific
Application Master associated with it. Application Master performs the following tasks:

 It coordinates the execution of the application in the cluster, along with managing the
faults.

 It negotiates resources from the Resource Manager.

 It works with the Node Manager for executing and monitoring other components’
tasks.

21
CCS334 BIG DATA ANALYTICS

 At regular intervals, heartbeats are sent to the Resource Manager for checking its
health, along with updating records according to its resource demands.

Now, we will step forward with the fourth component of Apache Hadoop YARN.

CONTAINER

A container is a set of physical resources (CPU cores, RAM, disks, etc.) on a single node.
The tasks of a container are listed below:
 It grants the right to an application to use a specific amount of resources (memory,
CPU, etc.) on a specific host.
 YARN containers are particularly managed by a Container Launch context which is
Container Life Cycle (CLC). This record contains a map of environment variables,
dependencies stored in remotely accessible storage, security tokens, the payload for
Node Manager services, and the command necessary to create the process.

How does Apache Hadoop YARN work?

YARN separates HDFS and MapReduce, making the Hadoop environment more suitable
for applications that can’t wait for the batch processing jobs to get finished. So, no more
batch processing delays with YARN! This architecture lets you process data with multiple
processing engines using real-time streaming, interactive SQL, batch processing, handling
of data stored in a single platform, and working with analytics in a completely different
manner. It can be considered as the basis of the next generation of the Hadoop ecosystem,
ensuring that the forward-thinking organizations are realizing the modern data architecture.

22
CCS334 BIG DATA ANALYTICS

How is an application submitted in Hadoop YARN?

1. Submit the job


2. Get an application ID
3. Retrieval of the context of application submission
 Start Container Launch
 Launch Application Master

4. Allocate Resources.

 Container
 Launching

5. Executing

Workflow of an Application in Apache Hadoop YARN


1. Submission of the application by Client
2. Container allocation for starting Application Manager
3. Registering the Application Manager with Resource Manager
4. Application Manager asks for containers from Resource Manager
5. Application Manager notifies Node Manager to launch containers
6. Application code gets executed in the container
7. Client contacts Resource Manager/Application Manager to monitor the status of
the application
8. Application Manager gets disconnected with Resource Manager

23
CCS334 BIG DATA ANALYTICS

Features of Hadoop YARN


 High-degree compatibility: Applications created use the MapReduce framework
that can be run easily on YARN.

 Better cluster utilization: YARN allocates all cluster resources efficiently and
dynamically, which leads to better utilization of Hadoop as compared to the
previous version of it.

 Utmost scalability: Whenever there is an increase in the number of nodes in the


Hadoop cluster, the YARN Resource Manager assures that it meets the user
requirements.

 Multi-tenancy: Various engines that access data on the Hadoop cluster can
efficiently work together all because of YARN as it is a highly versatile technology.

YARN vs MapReduce

In Hadoop 1.x, the batch processing framework MapReduce was closely paired with HDFS.
With the addition of YARN to these two components, giving birth to Hadoop 2.x, came a lot
of differences in how Hadoop worked. Let’s go through these differences.

24
CCS334 BIG DATA ANALYTICS

Criteria YARN MapReduce

Type of processing Real-time, batch, and Silo and batch processing


interactive processing with with a single-engine
multiple engines
Cluster resource optimization Excellent due to central Average due to fixed Map and
resource management Reduce slots
Suitable for MapReduce and non- Only MapReduce applications
MapReduce applications
Managing cluster resource Done by YARN Done by JobTracker
Namespace Hadoop supports multiple Supports only one
namespaces namespace, i.e., HDFS

Classic Map Reduce / YARN and MapReduce interaction

25
CCS334 BIG DATA ANALYTICS

Map Reduce v1 (Classic Map Reduce)


There are 4 entities involved in classing map reduce
1. The Job client who submits the job
2. The Job tracker who handels overall execution of job. It is a java application with main
class JobTracker
3. The task tracker, who run the individual tasks. It is a java application with main class
TaskTracker.
4. The shared file system, provided by HDFS

Map Reduce Life Cycle

26
CCS334 BIG DATA ANALYTICS

The following are the main phases in map reduce

Job Submission

The submit() method on job creates an internal instance of JobSubmitter and


calls submitJobInternal() method on it. Having submitted the job, waitForCompletion()
polls the job’s progress once a second.

On calling this method following happens.

 It goes to JobTracker and gets a jobId for the job


 Perform checks if the the output directory has been specified or not. If specified,
whether the directory already exists or is new. And throws error if any such thing
fails.

 Computes input split and throws error if it fails to do so, because the input paths
don’t exist.

 Copies the resources to Job Tracker file system in a directory named after Job Id.
These resources include configuration files, job jar file,and computed input splits.
These are copied with high redundancy by default a factor of 10.

 Finally it calls submitJob() method on Job Tracker.

Job Initialization

Job tracker performs following steps

 Creates bookkeeping object to track tasks and their progress


 For each input split creates a map tasks.
 The number of reduce tasks is defined by the configuration mapred.reduce.tasks set
by setNumReduceTasks().
 Tasks are assigned taskId’s at this point.

27
CCS334 BIG DATA ANALYTICS

 In addition to this 2 other tasks are created: Job initialization task and Job clean up
task, these are run by tasktrackers
 Job initialization task , based on output committed, like in case of
FileOutputCommitter, creates the output directory to store the tasks output as well as
temporary output
 Job clean up tasks which delete the temporary directory after the job is complete.

Task Assignment

Task Tracker sends a heartbeat to job tracker every five seconds. This heartbeat serves as
a communication channel, will indicate whether it is ready to run a new task. They also
send the available slots on them.

Here is how job allocation takes place.

 Job Tracker first selects a job to select the task from, based on job scheduling
algorithms.
 The default scheduler fills empty map task before reduce task slots.
 For a map task then it chooses a task tracker which is in following order of
priority: data-local, rack-local and then network.
 For a reduce task, it simply chooses a task tracker which has empty slots.
 The number of slots which a task tracker has depends on number of cores.

Task Execution

Following is how a job is executed

Setup:

 Task Tracker copies the job jar file from the shared filesystem (HDFS) and any
files needed to run the tasks from distributed cache to Task Tracker’s local file
system

28
CCS334 BIG DATA ANALYTICS

 Task tracker creates a local working directory, and un-jars the jar file into the local file
system

 It then creates an instance of Task Runner


Action:
 Task tracker starts Task Runner in a new JVM to run the map or reduce task.

 Separate process is needed so that the Task Tracker does not crash in case of bug
in user code or JVM.

 The child process communicates it progress to parent process umbilical interface.

 Each task can perform setup and cleanup actions, which are run in the same JVM
as the task itself, based on OutputComitter

 In case of speculative execution the other tasks are killed before committing the task
output, only one of the duplicate task is committed.

 Even if the map or reduce tasks is run via pipes or via socket as in case of
streaming, we provide the input via stdin and get output via stdout from the running
process.

Job/Task Progress

Here is how the progress is monitored of a job/task

 Job Client keeps polling the Job Tracker for progress.

 Each child process reports its progress to parent task tracker.

 If a task reports progress, it sets a flag to indicate that the status change should be
sent to the task tracker. The flag is checked in a separate thread every 3 seconds,
and if set it notifies the task tracker of the current task status.

 Task tracker sends its progress to Job Tracker over the heartbeat for every five
seconds. Counters are sent less frequently, because they can be relatively high-
bandwidth.

 Job Tracker then assembles task progress from all task trackers and keeps a holistic
view of job.

29
CCS334 BIG DATA ANALYTICS

 The Job receives the latest status by polling the job tracker every second.

30
CCS334 BIG DATA ANALYTICS

Job Completion

On Job Completion the clean up task is run.

 Task sends the task tracker job completion. Which in turn is sent to job tracker.
 Job Tracker then send the job completion message to client, when it polls.
 Job tracker cleans up its working state for the job and instructs task trackers to do
the same, It cleans up all the temporary directories.
 This causes jobclient’s waitForJobToComplete() method to return.

Failures in Classic MapReduce

In the MapReduce 1 runtime there are three failure modes to consider: failure of the
running task, failure of the task Tracker, and failure of the job tracker. Let’s look at each in
turn.

Task Failure

Consider first the case of the child task failing. The most common way that this happens is
when user code in the map or reduce task throws a runtime exception. If this happens, the
child JVM reports the error back to its parent task tracker, before it exits. The error
ultimately makes it into the user logs. The task tracker marks the task attempt as failed,
freeing up a slot to run another task.

For Streaming tasks, if the Streaming process exits with a nonzero exit code, it is marked
as failed. This behavior is governed by the stream.non.zero.exit.is.failure property (the
default is true).

Another failure mode is the sudden exit of the child JVM—perhaps there is a JVM bug that
causes the JVM to exit for a particular set of circumstances exposed by the MapReduce
user code. In this case, the task tracker notices that the process has exited and marks the
attempt as failed.

31
CCS334 BIG DATA ANALYTICS

Hanging tasks are dealt with differently. The task tracker notices that it hasn’t received a
progress update for a while and proceeds to mark the task as failed. The child JVM process
will be automatically killed after this period. The timeout period after which tasks are
considered failed is normally 10 minutes and can be configured on a per-job basis (or a
cluster basis) by setting the mapred.task.timeout property to a value in milliseconds.

If a Streaming or Pipes process hangs, the task tracker will kill it (along with the JVM that
launched it) only in one the following circumstances: either mapred.task.tracker.task-
controller is set to org.apache.hadoop.mapred.LinuxTaskController, or the default task
controller in being used (org.apache.hadoop.mapred.DefaultTaskController) and
the setsid command is available on the system (so that the child JVM and any processes it
launches are in the same process group). In any other case orphaned Streaming or Pipes
processes will accumulate on the system, which will impact utilization over time.

Setting the timeout to a value of zero disables the timeout, so long-running tasks are never
marked as failed. In this case, a hanging task will never free up its slot, and over time there
may be cluster slowdown as a result. This approach should therefore be avoided, and
making sure that a task is reporting progress periodically will suffice.

When the job tracker is notified of a task attempt that has failed (by the task tracker’s
heartbeat call), it will reschedule execution of the task. The job tracker will try to avoid
rescheduling the task on a task tracker where it has previously failed. Furthermore, if a task
fails four times (or more), it will not be retried further. This value is configurable: the
maximum number of attempts to run a task is controlled by
the mapred.map.max.attempts property for map tasks and mapred.reduce.max.attempts for
reduce tasks. By default, if any task fails four times (or whatever the maximum number of
attempts is configured to), the whole job fails.

For some applications, it is undesirable to abort the job if a few tasks fail, as it may be
possible to use the results of the job despite some failures. In this case, the maximum
percentage of tasks that are allowed to fail without triggering job failure can be set for the
job. Map tasks and reduce tasks are controlled independently, using
the mapred.max.map.failures.percent and mapred.max.reduce.failures.percent properties.

32
CCS334 BIG DATA ANALYTICS

A task attempt may also be killed, which is different from it failing. A task attempt may be
killed because it is a speculative duplicate (for more, see “Speculative Execution” on page
213), or because the tasktracker it was running on failed, and the jobtracker marked all the
task attempts running on it as killed. Killed task attempts do not count against the number
of attempts to run the task (as set by mapred.map.max.attempts and
mapred.reduce.max.attempts), since it wasn’t the task’s fault that an attempt was killed.

Users may also kill or fail task attempts using the web UI or the command line (type hadoop
job to see the options). Jobs may also be killed by the same mechanisms.

Task tracker Failure

Failure of a task tracker is another failure mode. If a task tracker fails by crashing, or
running very slowly, it will stop sending heartbeats to the job tracker (or send them very
infrequently). The job tracker will notice a task tracker that has stopped sending heartbeats
(if it hasn’t received one for 10 minutes, configured via the mapred.task
tracker.expiry.interval property, in milliseconds) and remove it from its pool of task trackers
to schedule tasks on. The job tracker arranges for map tasks that were run and completed
successfully on that task tracker to be rerun if they belong to incomplete jobs, since their
intermediate output residing on the failed task tracker’s local filesystem may not be
accessible to the reduce task. Any tasks in progress are also rescheduled.

A task tracker can also be blacklisted by the job tracker, even if the task tracker has not
failed. If more than four tasks from the same job fail on a particular task tracker (set by
(mapred.max.tracker.failures), then the job tracker records this as a fault. A task tracker is
blacklisted if the number of faults is over some minimum threshold (four, set
by mapred.max.tracker.blacklists) and is significantly higher than the average number of
faults for task trackers in the cluster .

Blacklisted task trackers are not assigned tasks, but they continue to communicate with the
job tracker. Faults expire over time (at the rate of one per day), so task trackers get the
chance to run jobs again simply by leaving them running. Alternatively, if there is an
underlying fault that can be fixed (by replacing hardware, for example), the task tracker will
be removed from the job tracker’s blacklist after it restarts and rejoins the cluster.

33
CCS334 BIG DATA ANALYTICS

Job tracker Failure

Failure of the job tracker is the most serious failure mode. Hadoop has no mechanism for
dealing with failure of the job tracker—it is a single point of failure—so in this case the job
fails. However, this failure mode has a low chance of occurring, since the chance of a
particular machine failing is low. The good news is that the situation is improved in YARN,
since one of its design goals is to eliminate single points of failure in MapReduce.

After restarting a job tracker, any jobs that were running at the time it was stopped will need
to be re-submitted. There is a configuration option that attempts to recover any running jobs
(mapred.jobtracker.restart.recover, turned off by default), however it is known not to work
reliably, so should not be used.

Introduction to Hadoop Scheduler

Hadoop MapReduce is a software framework for writing applications that process huge
amounts of data (terabytes to petabytes) in-parallel on the large Hadoop cluster. This
framework is responsible for scheduling tasks, monitoring them, and re-executes the failed
task.

In Hadoop 2, a YARN called Yet Another Resource Negotiator was introduced. The basic
idea behind the YARN introduction is to split the functionalities of resource management
and job scheduling or monitoring into separate daemons that are Resource Manager,
Application Master, and Node Manager.

Resource Manager is the master daemon that arbitrates resources among all the
applications in the system. Node Manager is the slave daemon responsible for containers,
monitoring their resource usage, and reporting the same to Resource Manager or
Schedulers. Application Master negotiates resources from the Resource Manager and
works with Node Manager in order to execute and monitor the task.

The Resource Manager has two main components that are Schedulers and Applications
Manager.

34
CCS334 BIG DATA ANALYTICS

Schedulers in YARN
Resource Manager is a pure scheduler which is responsible for allocating resources to the
various running applications. It is not responsible for monitoring or tracking the status of an
application. Also, the scheduler does not guarantee about restarting the tasks that are
failed either due to hardware failure or application failure.

The scheduler performs scheduling based on the resource requirements of the


applications.

It has some pluggable policies that are responsible for partitioning the cluster resources
among the various queues, applications, etc.

The FIFO Scheduler, Capacity Scheduler, and Fair Scheduler are such pluggable policies
that are responsible for allocating resources to the applications.

Let us now study each of these Schedulers in detail.

TYPES OF HADOOP SCHEDULER

35
CCS334 BIG DATA ANALYTICS

1. FIFO Scheduler

First In First Out is the default scheduling policy used in Hadoop. FIFO Scheduler gives
more preferences to the application coming first than those coming later. It places the
applications in a queue and executes them in the order of their submission (first in, first
out).

Here, irrespective of the size and priority, the request for the first application in the queue
are allocated first. Once the first application request is satisfied, then only the next
application in the queue is served.

Advantage:
 It is simple to understand and doesn’t need any configuration.

 Jobs are executed in the order of their submission.

Disadvantage:
 It is not suitable for shared clusters. If the large application comes before the shorter

one, then the large application will use all the resources in the cluster, and the
shorter application has to wait for its turn. This leads to starvation.

 It does not take into account the balance of resource allocation between the long

applications and short applications.

36
CCS334 BIG DATA ANALYTICS

2. Capacity Scheduler

The Capacity Scheduler allows multiple-tenants to securely share a large Hadoop cluster. It
is designed to run Hadoop applications in a shared, multi-tenant cluster while maximizing
the throughput and the utilization of the cluster.

It supports hierarchical queues to reflect the structure of organizations or groups that


utilizes the cluster resources. A queue hierarchy contains three types of queues that are
root, parent, and leaf.

The root queue represents the cluster itself, parent queue represents organization/group or
sub-organization/sub-group, and the leaf accepts application submission.

The Capacity Scheduler allows the sharing of the large cluster while giving capacity
guarantees to each organization by allocating a fraction of cluster resources to each queue.

Also, when there is a demand for the free resources that are available on the queue who
has completed its task, by the queues running below capacity, then these resources will be
assigned to the applications on queues running below capacity. This provides elasticity for
the organization in a cost-effective manner.

Apart from it, the Capacity Scheduler provides a comprehensive set of limits to ensure that
a single application/user/queue cannot use a disproportionate amount of resources in the
cluster.

To ensure fairness and stability, it also provides limits on initialized and pending apps from
a single user and queue.

Advantages:
 It maximizes the utilization of resources and throughput in the Hadoop cluster.

 Provides elasticity for groups or organizations in a cost-effective manner.

 It also gives capacity guarantees and safeguards to the organization utilizing cluster.

Disadvantage:
 It is complex amongst the other scheduler.

37
CCS334 BIG DATA ANALYTICS

3. Fair Scheduler

Fair Scheduler allows YARN applications to fairly share resources in large Hadoop clusters.
With Fair Scheduler, there is no need for reserving a set amount of capacity because it will
dynamically balance resources between all running applications.

It assigns resources to applications in such a way that all applications get, on average, an
equal amount of resources over time.

The Fair Scheduler, by default, takes scheduling fairness decisions only on the basis of
memory. We can configure it to schedule with both memory and CPU.

When the single application is running, then that app uses the entire cluster resources.
When other applications are submitted, the free up resources are assigned to the new apps
so that every app eventually gets roughly the same amount of resources. Fair Scheduler
enables short apps to finish in a reasonable time without starving the long-lived apps.

Similar to Capacity Scheduler, the Fair Scheduler supports hierarchical queue to reflect the
structure of the long shared cluster.

38
CCS334 BIG DATA ANALYTICS

Apart from fair scheduling, the Fair Scheduler allows for assigning minimum shares to
queues for ensuring that certain users, production, or group applications always get
sufficient resources. When an app is present in the queue, then the app gets its minimum
share, but when the queue doesn’t need its full guaranteed share, then the excess share is
split between other running applications.

Advantages:
 It provides a reasonable way to share the Hadoop Cluster between the number of
users.

 Also, the Fair Scheduler can work with app priorities where the priorities are used as
weights in determining the fraction of the total resources that each application should
get.

Disadvantage:

 It requires configuration.

Shuffling and Sorting in Hadoop MapReduce

In Hadoop, the process by which the intermediate output from mappers is transferred to
the reducer is called Shuffling. Reducer gets 1 or more keys and associated values on the
basis of reducers. Intermediated key-value generated by mapper is sorted automatically by
key. In this blog, we will discuss in detail about shuffling and Sorting
in Hadoop MapReduce.

Here we will learn what is sorting in Hadoop, what is shuffling in Hadoop, what is the
purpose of Shuffling and sorting phase in MapReduce, how MapReduce shuffle works and
how MapReduce sort works.

39
CCS334 BIG DATA ANALYTICS

Shuffling and Sorting in Hadoop MapReduce


\
Before we start with Shuffle and Sort in MapReduce, let us revise the other phases of
MapReduce like Mapper, reducer in MapReduce, Combiner, partitioner in
MapReduce and input Format in MapReduce.

Shuffle phase in Hadoop transfers the map output from Mapper to a Reducer in
MapReduce. Sort phase in MapReduce covers the merging and sorting of map outputs.
Data from the mapper are grouped by the key, split among reducers and sorted by the key.
Every reducer obtains all values associated with the same key. Shuffle and sort phase in
Hadoop occur simultaneously and are done by the MapReduce framework.

Shuffling in MapReduce

The process of transferring data from the mappers to reducers is known as shuffling i.e. the
process by which the system performs the sort and transfers the map output to the reducer
as input. So, MapReduce shuffle phase is necessary for the reducers, otherwise, they
would not have any input (or input from every mapper). As shuffling can start even before
the map phase has finished so this saves some time and completes the tasks in lesser
time.

40
CCS334 BIG DATA ANALYTICS

Sorting in MapReduce

The keys generated by the mapper are automatically sorted by MapReduce Framework,
i.e. Before starting of reducer, all intermediate key-value pairs in MapReduce that are
generated by mapper get sorted by key and not by value. Values passed to each reducer
are not sorted; they can be in any order.

Sorting in Hadoop helps reducer to easily distinguish when a new reduce task should start.
This saves time for the reducer. Reducer starts a new reduce task when the next key in the
sorted input data is different than the previous. Each reduce task takes key-value pairs as
input and generates key-value pair as output.

Note that shuffling and sorting in Hadoop MapReduce is not performed at all if you specify
zero reducers (setNumReduceTasks(0)). Then, the MapReduce job stops at the map
phase, and the map phase does not include any kind of sorting (so even the map phase is
faster).

Secondary Sorting in MapReduce

If we want to sort reducer’s values, then the secondary sorting technique is used as it
enables us to sort the values (in ascending or descending order) passed to each reducer.

Map Reduce Types

Map Reduce Types: The map and reduce functions in Hadoop MapReduce have the
following general form:

map: (K1, V1) → list(K2, V2)

reduce: (K2, list(V2)) → list(K3, V3)

In general, the map input key and value types (K1 and V1) are different from the map
output types (K2 and V2). However, the reduce input must have the same types as the map
output, although the reduce output types may be different again (K3 and V3).

41
CCS334 BIG DATA ANALYTICS

public void map(LongWritable key, Text value, Context context)

.....

.....

context.write(new Text(year), new IntWritable(airTemperature));

public void reduce(Text key, Iterable <IntWritable> values, Context context)

..................

context.write(key, new IntWritable(maxValue));

public void combiner(Text key, Iterable <IntWritable> values, Context context)

..................

context.write(key, new IntWritable(maxValue));

If a combine function is used, then it is the same form as the reduce function (and is an
implementation of Reducer), except its output types are the intermediate key and value
types

(K2 and V2), so they can feed the reduce function:

map: (K1, V1) → list(K2, V2)

combine: (K2, list(V2)) → list(K2, V2)

reduce: (K2, list(V2)) → list(K3, V3)

42
CCS334 BIG DATA ANALYTICS

Often the combine and reduce functions are the same, in which case, K3 is the same as
K2,

and V3 is the same as V2.

Input types are set by the input format. So, for instance, a TextInputFormat generates keys
of type LongWritable and values of type Text. The other types are set explicitly by calling
the methods on the Job as follows.

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

So if K2 and K3 are the same, you don’t need to call setMapOutputKeyClass(), since it falls
back to the type set by calling setOutputKeyClass(). Similarly, if V2 and V3 are the same,
you only need to use setOutputValueClass().

Input and Output Formats in MapReduce

Apache Hadoop is a popular open-source framework for processing and analyzing large
datasets in parallel across a distributed cluster. One of the key components of Hadoop is
MapReduce, which allows us to write distributed data processing tasks in a simple and
scalable manner. In the context of MapReduce, input and output formats play a crucial role
in defining how data is read into and written from the system.

Input Formats
In MapReduce, an input format determines how the framework reads data from its input
source and presents it to the mapper function for processing. Hadoop provides various
built-in input formats to handle different types of data sources:

1. Text Input Format: This is the default input format, where each input record
represents a line of text. The set of key-value pairs supplied to the mapper function
includes the byte offset of the record within the input file as the key and the line
content as the value.

43
CCS334 BIG DATA ANALYTICS

2. Key-Value Input Format: This input format is suitable for data represented as key-
value pairs, where each line represents a separate record. The framework splits
each line into a key-value pair based on specified delimiters or patterns.

3. Sequence File Input Format: This format allows the processing of binary key-value
pairs stored in a compact file format. It enables faster data reading and supports
compression for efficient storage.

4. Custom Input Formats: Hadoop also allows developers to create custom input
formats by extending the FileInputFormat class and implementing the necessary
logic for data reading and parsing. This flexibility enables processing of specialized
data formats or proprietary datasets.

Output Formats
Similar to input formats, output formats dictate how the results of MapReduce tasks are
written to the desired output destination. Some commonly used output formats are:

1. Text Output Format: This is the default output format, where each record is written
as a line of text. The key and value pairs produced by the reducer function are
transformed into a string representation and saved in the output file.

2. Sequence File Output Format: This format allows the storage of key-value pairs in
a binary file. It provides a compact and efficient way to store large volumes of
structured data.

3. Multiple Output Format: Hadoop supports writing output to multiple files or


directories based on some condition specified within the MapReduce job. This
format is useful when we need to partition the output based on certain criteria or
distribute it across different locations.

4. Custom Output Formats: Developers can create custom output formats by


extending the FileOutputFormat class and implementing the necessary logic for
writing data to the desired output source. This allows adapting the output format to
specific requirements or integrating with external systems.

44
CCS334 BIG DATA ANALYTICS

In summary, input and output formats in MapReduce define how data is read into and
written from the Hadoop ecosystem. The flexibility of Hadoop allows developers to leverage
various pre-defined formats or implement custom formats to handle different types of data
sources and output destinations. Understanding these formats is essential for effective data
processing and analysis using Hadoop's MapReduce framework.

45

You might also like