0% found this document useful (0 votes)
36 views9 pages

Big Data BCA Unit4

The document discusses MapReduce, which is a framework for processing large datasets in parallel. It describes the key components of MapReduce including the map and reduce functions, and how they work together to process data and generate results. The document also provides examples and explanations of how MapReduce jobs are executed in Hadoop.

Uploaded by

kuldeep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views9 pages

Big Data BCA Unit4

The document discusses MapReduce, which is a framework for processing large datasets in parallel. It describes the key components of MapReduce including the map and reduce functions, and how they work together to process data and generate results. The document also provides examples and explanations of how MapReduce jobs are executed in Hadoop.

Uploaded by

kuldeep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

BCA 6TH SEM UNIT-4 BIG DATA

HADOOP MAPREDUCE

MapReduce is a framework using which we can write applications to process huge amounts of
data, in parallel, on large clusters of commodity hardware in a reliable manner. It divides a task
into small parts & assigns them to many computers. Later, the results are collected at one place
& integrated to form the result dataset.

Commodity Hardware

Commodity Hardware
Centralized User
System

Commodity Hardware
Fig.1. MapReduce Framework

 Each commodity hardware has its own JVM.


 .jar file is send to every node by centralized system

How MapReduce works?


MapReduce is a processing technique and a program model for distributed computing based on
java. The MapReduce algorithm contains two important tasks, namely Map and Reduce.
 Map takes a set of data and converts it into another set of data, where individual elements
are broken down into tuples (key/value pairs).
 Secondly, reduce task, which takes the output from a map as an input and combines those
data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies,
the reduce task is always performed after the map job.

Fig.2. Working of MapReduce


BCA 6TH SEM UNIT-4 BIG DATA

Example:
Suppose we have a text file and we want to find that how many number of times a particular
word is appearing in the file.

Fig.3. The overall MapReduce word count process

Each map task in Hadoop is broken into following phases: record reader, mapper, combiner, and
partitioner. The output of map phase, called intermediate key and values are sent to the reducers.
The reduce tasks are broken into following phases: shuffle, sort, reducer and output format. The
map tasks are assigned by Hadoop framework to those DataNodes where the actual data to be
processed resides. This ensures that the data typically doesn’t have to move over the network to
save the network bandwidth and data is computed on the local machine itself so called map task
is data local.

Record Reader:
The record reader translates an input split generated by input format into records. The purpose of
record reader is to parse the data into record but doesn’t parse the record itself. It passes the data
to the mapper in form of key/value pair. Usually the key in this context is positional information
and the value is a chunk of data that composes a record. In the above fig. the input file would be
splitted like:

1,Dog Cat Mouse Where, 1 is ‘key’ & Dog Cat Mouse is ‘value’
2,Dog Dog Cat
3,Dog Cat Duck

Map:
Map function is the heart of mapper task, which is executed on each key/value pair from the
record reader to produce zero or more key/value pair, called intermediate pairs. The decision of
what is key/value pair depends on what the MapReduce job is accomplishing. The data is
BCA 6TH SEM UNIT-4 BIG DATA

grouped on key and the value is the information pertinent to the analysis in the reducer. In the
above fig. three mappers are used for counting the existence of each word.

Combiner:
Its an optional component but highly useful and provides extreme performance gain of
MapReduce job without any downside. Combiner is not applicable to all the MapReduce
algorithms but where ever it can be applied it is always recommended to use. It takes the
intermediate keys from the mapper and applies a user-provided method to aggregate values in a
small scope of that one mapper. e.g sending (hadoop, 3) requires fewer bytes than sending
(hadoop, 1) three times over the network.

Partitioner:
The partitioner takes the intermediate key/value pairs from mapper and split them into shards,
one shard per reducer. This randomly distributes the keyspace evenly over the reducer, but still
ensures that keys with the same value in different mappers end up at the same reducer. The
partitioned data is written to the local filesystem for each map task and waits to be pulled by its
respective reducer.

Reducer
Shuffle and Sort:
The reduce task start with the shuffle and sort step. This step takes the output files written by all
of the partitioners and downloads them to the local machine in which the reducer is running.
These individual data pipes are then sorted by keys into one larger data list. The purpose of this
sort is to group equivalent keys together so that their values can be iterated over easily in the
reduce task.

Reduce:
The reducer takes the grouped data as input and runs a reduce function once per key grouping.
The function is passed the key and an iterator over all the values associated with that key. A wide
range of processing can happen in this function, the data can be aggregated, filtered, and
combined in a number of ways. Once it is done, it sends zero or more key/value pair to the final
step, the output format.

Output Format:
The output format translate the final key/value pair from the reduce function and writes it out to
a file by a record writer. By default, it will separate the key and value with a tab and separate
record with a new line character.
BCA 6TH SEM UNIT-4 BIG DATA

MapReduce Jobs: Execution


MapReduce is a programming model designed to process large amount of data in parallel by
dividing the job into several independent local tasks.

The below tasks occur when the user submits a MapReduce job to Hadoop -

 The local Job Client prepares the job for submission and hands it off to the Job Tracker.

 The Job Tracker schedules the job and distributes the map work among the Task Trackers
for parallel processing.

 Each Task Tracker issues a Map Task.

 The Job Tracker receives progress information from the Task Trackers.

 Once the mapping phase results available, the Job Tracker distributes the reduce work
among the Task Trackers for parallel processing.
BCA 6TH SEM UNIT-4 BIG DATA

 Each Task Tracker issues a Reduce Task to perform the work.

 The Job Tracker receives progress information from the Task Trackers.

 Once the Reduce task completed, Cleanup task will be performed.

The following are the main phases in map reduce job execution flow -

1) Job Submission

2) Job Initialization

3) Task Assignment

4) Task Execution

5) Job/Task Progress

6) Job Completion

1) Job Submission: -

The job submit method creates an internal instance of JobSubmitter and


calls submitJobInternal method on it.

waitForCompletion method samples the job’s progress once a second after the job submitted.

waitForCompletion method performs below -

 It goes to Job Tracker and gets a jobId for the job

 Perform checks if the output directory has been specified or not.

 If specified checks the directory already exists or is new and throws error if any issue
occurs with directory.

 Computes input split and throws error if it fails because the input paths don’t exist.

 Copies the resources to Job Tracker file system in a directory named after Job Id.

 Finally, it calls submitJob method on JobTracker.

2) Job Initialization: -

Job tracker performs the below steps in job initialization -


BCA 6TH SEM UNIT-4 BIG DATA

 Creates object to track tasks and their progress.

 Creates a map tasks for each input split.

 The number of reduce tasks is defined by the configuration mapred.reduce.tasks set


by setNumReduceTasks method.

 Tasks are assigned with task ID’s.

 Job initialization task and Job clean up task created and these are run by task trackers.

 Job clean up tasks which delete the temporary directory after the job is completed.

3) Task Assignment:

Task Tracker sends a heartbeat to job tracker every five seconds.

The heartbeat is a communication channel and indicate whether it is ready to run a new task.

The available slots information also sends to them.

The job allocation takes place like below -

 Job Tracker first selects a job to select the task based on job scheduling algorithms.

 The default scheduler fills empty map task before reduce task slots.

 The number of slots which a task tracker has depends on number of cores.

4) Task Execution: -

Below steps describes how the job executed -

 Task Tracker copies the job jar file from the shared filesystem (HDFS).

 Task Tracker creates a local working directory and un-jars the jar file into the local file
system.

 Task Tracker creates an instance of TaskRunner.

 Task Tracker starts TaskRunner in a new JVM to run the map or reduce task.

 The child process communicates the progress to parent process.

 Each task can perform setup and cleanup actions based on OutputComitter.
BCA 6TH SEM UNIT-4 BIG DATA

 The input provided via stdin and get output via stdout from the running process even if
the map or reduce tasks ran via pipes or socket in case of streaming.

5) Job/Task Progress: -

Below steps describes about how the progress is monitored of a job/task -

 Job Client keeps polling the Job Tracker for progress.

 Each child process reports its progress to parent task tracker.

 If a task reports progress, it sets a flag to indicate the status change that sent to the Task
Tracker.

 The flag is verified in a separate thread for every 3 seconds and it notifies the Task
Tracker of the current task status if set.

 Task tracker sends its progress to Job Tracker over the heartbeat for every five seconds.

 Job Tracker consolidate the task progress from all task trackers and keeps a holistic view
of job.

 The Job receives the latest status by polling the Job Tracker every second.

6) Job Completion: -

Once the job completed, the clean-up task will get processed like below -

 Once the task completed, Task Tracker sends the job completion status to the Job
Tracker.

 Job Tracker then send the job completion message to client.

 Job Tracker cleans up its current working state for the job and instructs Task Trackers to
perform the same and also cleans up all the temporary directories.

 The total process causes Job Client’s waitForJobToComplete method to return

HADOOP DAEMONS

NOTE* we have already covered it in previous lectures.

There are five daemons basically:


BCA 6TH SEM UNIT-4 BIG DATA

1.NameNode: NameNode is used to hold the Metadata (information about the location, size of
files/blocks) for HDFS. The Metadata could be stored on RAM or Hard-Disk. There will always
be only one NameNode in a cluster. NameNode is single mode of complete failure in Hadoop.

2. Secondary NameNode: It is used as a backup for NameNode. It holds pretty much same
information as that of NameNode. If NameNode fails, this one comes into picture.

3.DataNode: While the NameNode stores Metadata, the actual data is stored on DataNode. The
number of DataNode depends on your data size. You can additionally add them. The DataNode
communicates to NameNode on frequent basis. (Every 3 seconds).

4. Job Tracker: NameNode and DataNodes store details and actual data on HDFS. This data is
also required to process as per client’s requirements. A Developer writes a code to process the
data. Processing of data can be done using MapReduce. MapReduce engine sends the code
across DataNodes, creating jobs. These jobs are to be continuously monitored; Job tracker
manages those jobs.

5. Task Tracker: The jobs given by Job trackers are actually performed by Task trackers. Each
DataNode will have one task tracker. Task trackers communicate with Job trackers to send
statuses of the jobs.

INVESTIGATING HADOOP DISTRIUTED FILE SYSTEM

NOTE* : Same topics we have covered in Hadoop Architecture

DIFFERENT MODES OF HADOOP

Hadoop can be run in three different modes:

1. Local Mode or Standalone Mode

Standalone mode is the default mode in which Hadoop run. Standalone mode is mainly used for
debugging where we don’t really use HDFS.

We also don’t need to do any custom configuration in the files- mapred-site.xml, core-site.xml,
hdfs-site.xml.

Standalone mode is usually the fastest Hadoop modes as it uses the local file system for all the
input and output. Here is the summarized view of the standalone mode-

• Used for debugging purpose


• HDFS is not being used
• Uses local file system for input and output
BCA 6TH SEM UNIT-4 BIG DATA

• No need to change any configuration files


• Default Hadoop Modes

2. Pseudo-distributed Mode

The pseudo-distribute mode is also known as a single-node cluster where both NameNode and
DataNode will reside on the same machine.

In pseudo-distributed mode, all the Hadoop daemons will be running on a single node. Such
configuration is mainly used while testing when we don’t need to think about the resources and
other users sharing the resource.

In this architecture, a separate JVM is spawned for every Hadoop components as they could
communicate across network sockets, effectively producing a fully functioning and optimized
mini-cluster on a single host.

Here is the summarized view of pseudo distributed Mode-

• Single Node Hadoop deployment running on Hadoop is considered as pseudo distributed mode
• All the master & slave daemons will be running on the same node
• Mainly used for testing purpose
• Replication Factor will be ONE for blocks
• Changes in configuration files will be required for all the three files- mapred-site.xml, core-
site.xml, hdfs-site.xml

3. Fully-Distributed Mode (Multi-Node Cluster)

This is the production mode of Hadoop where multiple nodes will be running. Here data will
be distributed across several nodes and processing will be done on each node.

Master and Slave services will be running on the separate nodes in fully-distributed Hadoop
Mode.

• Production phase of Hadoop


• Separate nodes for master and slave daemons
• Data are used and distributed across multiple nodes

You might also like