0% found this document useful (0 votes)

36 views9 pages

Big Data BCA Unit4

The document discusses MapReduce, which is a framework for processing large datasets in parallel. It describes the key components of MapReduce including the map and reduce functions, and how they work together to process data and generate results. The document also provides examples and explanations of how MapReduce jobs are executed in Hadoop.

Uploaded by

kuldeep

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views9 pages

Big Data BCA Unit4

Uploaded by

kuldeep

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

BCA 6TH SEM UNIT-4 BIG DATA

HADOOP MAPREDUCE

MapReduce is a framework using which we can write applications to process huge amounts of
data, in parallel, on large clusters of commodity hardware in a reliable manner. It divides a task
into small parts & assigns them to many computers. Later, the results are collected at one place
& integrated to form the result dataset.

Commodity Hardware

Commodity Hardware
Centralized User
System

Commodity Hardware
Fig.1. MapReduce Framework

 Each commodity hardware has its own JVM.

 .jar file is send to every node by centralized system

How MapReduce works?

MapReduce is a processing technique and a program model for distributed computing based on
java. The MapReduce algorithm contains two important tasks, namely Map and Reduce.
 Map takes a set of data and converts it into another set of data, where individual elements
are broken down into tuples (key/value pairs).
 Secondly, reduce task, which takes the output from a map as an input and combines those
data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies,
the reduce task is always performed after the map job.

Fig.2. Working of MapReduce

BCA 6TH SEM UNIT-4 BIG DATA

Example:
Suppose we have a text file and we want to find that how many number of times a particular
word is appearing in the file.

Fig.3. The overall MapReduce word count process

Each map task in Hadoop is broken into following phases: record reader, mapper, combiner, and
partitioner. The output of map phase, called intermediate key and values are sent to the reducers.
The reduce tasks are broken into following phases: shuffle, sort, reducer and output format. The
map tasks are assigned by Hadoop framework to those DataNodes where the actual data to be
processed resides. This ensures that the data typically doesn’t have to move over the network to
save the network bandwidth and data is computed on the local machine itself so called map task
is data local.

Record Reader:
The record reader translates an input split generated by input format into records. The purpose of
record reader is to parse the data into record but doesn’t parse the record itself. It passes the data
to the mapper in form of key/value pair. Usually the key in this context is positional information
and the value is a chunk of data that composes a record. In the above fig. the input file would be
splitted like:

1,Dog Cat Mouse Where, 1 is ‘key’ & Dog Cat Mouse is ‘value’
2,Dog Dog Cat
3,Dog Cat Duck

Map:
Map function is the heart of mapper task, which is executed on each key/value pair from the
record reader to produce zero or more key/value pair, called intermediate pairs. The decision of
what is key/value pair depends on what the MapReduce job is accomplishing. The data is
BCA 6TH SEM UNIT-4 BIG DATA

grouped on key and the value is the information pertinent to the analysis in the reducer. In the
above fig. three mappers are used for counting the existence of each word.

Combiner:
Its an optional component but highly useful and provides extreme performance gain of
MapReduce job without any downside. Combiner is not applicable to all the MapReduce
algorithms but where ever it can be applied it is always recommended to use. It takes the
intermediate keys from the mapper and applies a user-provided method to aggregate values in a
small scope of that one mapper. e.g sending (hadoop, 3) requires fewer bytes than sending
(hadoop, 1) three times over the network.

Partitioner:
The partitioner takes the intermediate key/value pairs from mapper and split them into shards,
one shard per reducer. This randomly distributes the keyspace evenly over the reducer, but still
ensures that keys with the same value in different mappers end up at the same reducer. The
partitioned data is written to the local filesystem for each map task and waits to be pulled by its
respective reducer.

Reducer
Shuffle and Sort:
The reduce task start with the shuffle and sort step. This step takes the output files written by all
of the partitioners and downloads them to the local machine in which the reducer is running.
These individual data pipes are then sorted by keys into one larger data list. The purpose of this
sort is to group equivalent keys together so that their values can be iterated over easily in the
reduce task.

Reduce:
The reducer takes the grouped data as input and runs a reduce function once per key grouping.
The function is passed the key and an iterator over all the values associated with that key. A wide
range of processing can happen in this function, the data can be aggregated, filtered, and
combined in a number of ways. Once it is done, it sends zero or more key/value pair to the final
step, the output format.

Output Format:
The output format translate the final key/value pair from the reduce function and writes it out to
a file by a record writer. By default, it will separate the key and value with a tab and separate
record with a new line character.
BCA 6TH SEM UNIT-4 BIG DATA

MapReduce Jobs: Execution

MapReduce is a programming model designed to process large amount of data in parallel by
dividing the job into several independent local tasks.

The below tasks occur when the user submits a MapReduce job to Hadoop -

 The local Job Client prepares the job for submission and hands it off to the Job Tracker.

 The Job Tracker schedules the job and distributes the map work among the Task Trackers
for parallel processing.

 Each Task Tracker issues a Map Task.

 The Job Tracker receives progress information from the Task Trackers.

 Once the mapping phase results available, the Job Tracker distributes the reduce work
among the Task Trackers for parallel processing.
BCA 6TH SEM UNIT-4 BIG DATA

 Each Task Tracker issues a Reduce Task to perform the work.

 The Job Tracker receives progress information from the Task Trackers.

 Once the Reduce task completed, Cleanup task will be performed.

The following are the main phases in map reduce job execution flow -

1) Job Submission

2) Job Initialization

3) Task Assignment

4) Task Execution

5) Job/Task Progress

6) Job Completion

1) Job Submission: -

The job submit method creates an internal instance of JobSubmitter and

calls submitJobInternal method on it.

waitForCompletion method samples the job’s progress once a second after the job submitted.

waitForCompletion method performs below -

 It goes to Job Tracker and gets a jobId for the job

 Perform checks if the output directory has been specified or not.

 If specified checks the directory already exists or is new and throws error if any issue
occurs with directory.

 Computes input split and throws error if it fails because the input paths don’t exist.

 Copies the resources to Job Tracker file system in a directory named after Job Id.

 Finally, it calls submitJob method on JobTracker.

2) Job Initialization: -

Job tracker performs the below steps in job initialization -

BCA 6TH SEM UNIT-4 BIG DATA

 Creates object to track tasks and their progress.

 Creates a map tasks for each input split.

 The number of reduce tasks is defined by the configuration mapred.reduce.tasks set

by setNumReduceTasks method.

 Tasks are assigned with task ID’s.

 Job initialization task and Job clean up task created and these are run by task trackers.

 Job clean up tasks which delete the temporary directory after the job is completed.

3) Task Assignment:

Task Tracker sends a heartbeat to job tracker every five seconds.

The heartbeat is a communication channel and indicate whether it is ready to run a new task.

The available slots information also sends to them.

The job allocation takes place like below -

 Job Tracker first selects a job to select the task based on job scheduling algorithms.

 The default scheduler fills empty map task before reduce task slots.

 The number of slots which a task tracker has depends on number of cores.

4) Task Execution: -

Below steps describes how the job executed -

 Task Tracker copies the job jar file from the shared filesystem (HDFS).

 Task Tracker creates a local working directory and un-jars the jar file into the local file
system.

 Task Tracker creates an instance of TaskRunner.

 Task Tracker starts TaskRunner in a new JVM to run the map or reduce task.

 The child process communicates the progress to parent process.

 Each task can perform setup and cleanup actions based on OutputComitter.
BCA 6TH SEM UNIT-4 BIG DATA

 The input provided via stdin and get output via stdout from the running process even if
the map or reduce tasks ran via pipes or socket in case of streaming.

5) Job/Task Progress: -

Below steps describes about how the progress is monitored of a job/task -

 Job Client keeps polling the Job Tracker for progress.

 Each child process reports its progress to parent task tracker.

 If a task reports progress, it sets a flag to indicate the status change that sent to the Task
Tracker.

 The flag is verified in a separate thread for every 3 seconds and it notifies the Task
Tracker of the current task status if set.

 Task tracker sends its progress to Job Tracker over the heartbeat for every five seconds.

 Job Tracker consolidate the task progress from all task trackers and keeps a holistic view
of job.

 The Job receives the latest status by polling the Job Tracker every second.

6) Job Completion: -

Once the job completed, the clean-up task will get processed like below -

 Once the task completed, Task Tracker sends the job completion status to the Job
Tracker.

 Job Tracker then send the job completion message to client.

 Job Tracker cleans up its current working state for the job and instructs Task Trackers to
perform the same and also cleans up all the temporary directories.

 The total process causes Job Client’s waitForJobToComplete method to return

HADOOP DAEMONS

NOTE* we have already covered it in previous lectures.

There are five daemons basically:

BCA 6TH SEM UNIT-4 BIG DATA

1.NameNode: NameNode is used to hold the Metadata (information about the location, size of
files/blocks) for HDFS. The Metadata could be stored on RAM or Hard-Disk. There will always
be only one NameNode in a cluster. NameNode is single mode of complete failure in Hadoop.

2. Secondary NameNode: It is used as a backup for NameNode. It holds pretty much same
information as that of NameNode. If NameNode fails, this one comes into picture.

3.DataNode: While the NameNode stores Metadata, the actual data is stored on DataNode. The
number of DataNode depends on your data size. You can additionally add them. The DataNode
communicates to NameNode on frequent basis. (Every 3 seconds).

4. Job Tracker: NameNode and DataNodes store details and actual data on HDFS. This data is
also required to process as per client’s requirements. A Developer writes a code to process the
data. Processing of data can be done using MapReduce. MapReduce engine sends the code
across DataNodes, creating jobs. These jobs are to be continuously monitored; Job tracker
manages those jobs.

5. Task Tracker: The jobs given by Job trackers are actually performed by Task trackers. Each
DataNode will have one task tracker. Task trackers communicate with Job trackers to send
statuses of the jobs.

INVESTIGATING HADOOP DISTRIUTED FILE SYSTEM

NOTE* : Same topics we have covered in Hadoop Architecture

DIFFERENT MODES OF HADOOP

Hadoop can be run in three different modes:

1. Local Mode or Standalone Mode

Standalone mode is the default mode in which Hadoop run. Standalone mode is mainly used for
debugging where we don’t really use HDFS.

We also don’t need to do any custom configuration in the files- mapred-site.xml, core-site.xml,
hdfs-site.xml.

Standalone mode is usually the fastest Hadoop modes as it uses the local file system for all the
input and output. Here is the summarized view of the standalone mode-

• Used for debugging purpose

• HDFS is not being used
• Uses local file system for input and output
BCA 6TH SEM UNIT-4 BIG DATA

• No need to change any configuration files

• Default Hadoop Modes

2. Pseudo-distributed Mode

The pseudo-distribute mode is also known as a single-node cluster where both NameNode and
DataNode will reside on the same machine.

In pseudo-distributed mode, all the Hadoop daemons will be running on a single node. Such
configuration is mainly used while testing when we don’t need to think about the resources and
other users sharing the resource.

In this architecture, a separate JVM is spawned for every Hadoop components as they could
communicate across network sockets, effectively producing a fully functioning and optimized
mini-cluster on a single host.

Here is the summarized view of pseudo distributed Mode-

• Single Node Hadoop deployment running on Hadoop is considered as pseudo distributed mode
• All the master & slave daemons will be running on the same node
• Mainly used for testing purpose
• Replication Factor will be ONE for blocks
• Changes in configuration files will be required for all the three files- mapred-site.xml, core-
site.xml, hdfs-site.xml

3. Fully-Distributed Mode (Multi-Node Cluster)

This is the production mode of Hadoop where multiple nodes will be running. Here data will
be distributed across several nodes and processing will be done on each node.

Master and Slave services will be running on the separate nodes in fully-distributed Hadoop
Mode.

• Production phase of Hadoop

• Separate nodes for master and slave daemons
• Data are used and distributed across multiple nodes

BDA Unit 2 Notes
No ratings yet
BDA Unit 2 Notes
32 pages
Unit 5 - Mapreduce
No ratings yet
Unit 5 - Mapreduce
8 pages
Unit-2 (MapReduce-II)
No ratings yet
Unit-2 (MapReduce-II)
11 pages
Big Data Unit - 3
No ratings yet
Big Data Unit - 3
7 pages
3.1.how Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.how Map Reduce Works & 3.2 Anatomy
11 pages
Anatomy of A MapReduce Job
No ratings yet
Anatomy of A MapReduce Job
5 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
4 pages
Bda Unit 3
No ratings yet
Bda Unit 3
29 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
11 pages
Unit 3
No ratings yet
Unit 3
33 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Big Data Analytics UNIT 3 Notets
No ratings yet
Big Data Analytics UNIT 3 Notets
12 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
Bda U2
No ratings yet
Bda U2
79 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
Unit - III
No ratings yet
Unit - III
37 pages
BDA Unit-2
No ratings yet
BDA Unit-2
11 pages
Map Reduce 2
No ratings yet
Map Reduce 2
14 pages
P.Prabu (23x61c) CCS334-BDA - Unit-3
No ratings yet
P.Prabu (23x61c) CCS334-BDA - Unit-3
23 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
Unit 3
No ratings yet
Unit 3
13 pages
BDA UNIT-3 (1) - Merged
No ratings yet
BDA UNIT-3 (1) - Merged
98 pages
Bda Unit 2
No ratings yet
Bda Unit 2
48 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
BDA Notes
No ratings yet
BDA Notes
39 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
Bda Module 4
No ratings yet
Bda Module 4
34 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Bda Unit 3
No ratings yet
Bda Unit 3
14 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
Big Data Analytics Mid 2
No ratings yet
Big Data Analytics Mid 2
9 pages
Unit 4
No ratings yet
Unit 4
19 pages
Map Reduce
No ratings yet
Map Reduce
45 pages
Unit 5
No ratings yet
Unit 5
7 pages
UNIT 3bda
No ratings yet
UNIT 3bda
16 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Data Science Presentation
No ratings yet
Data Science Presentation
20 pages
Unit - Iii
No ratings yet
Unit - Iii
38 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
Unit 2
No ratings yet
Unit 2
12 pages
Big Data
No ratings yet
Big Data
120 pages
Understanding MapReduce
No ratings yet
Understanding MapReduce
4 pages
MAP Reduce - 1
No ratings yet
MAP Reduce - 1
34 pages
MapReduce Architecture
No ratings yet
MapReduce Architecture
5 pages
Map Reduce
No ratings yet
Map Reduce
14 pages
Bda Unit-3
No ratings yet
Bda Unit-3
44 pages
Big Data Notes
No ratings yet
Big Data Notes
13 pages
Map Reduce
No ratings yet
Map Reduce
40 pages
Unit 3 - Big Data Technologies
No ratings yet
Unit 3 - Big Data Technologies
42 pages
Map Reduce Workflow Colloquim
No ratings yet
Map Reduce Workflow Colloquim
30 pages
777 1651400043 BD Module 4
No ratings yet
777 1651400043 BD Module 4
21 pages
Mapreduce Lifecycle
No ratings yet
Mapreduce Lifecycle
8 pages
Big Data Analytics-4
No ratings yet
Big Data Analytics-4
26 pages
BDA Unit-3
No ratings yet
BDA Unit-3
63 pages
1 Unit-1
No ratings yet
1 Unit-1
59 pages
Unit 3
No ratings yet
Unit 3
22 pages
3D Hardware design:: Software applications for GPU
From Everand
3D Hardware design:: Software applications for GPU
S Mathioudakis
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Druid Vs Necromancer Spell List PDF
No ratings yet
Druid Vs Necromancer Spell List PDF
1 page
06 05 Gita
No ratings yet
06 05 Gita
2 pages
Bitcoin Prise Using LSTM - Ipynb - Colab
No ratings yet
Bitcoin Prise Using LSTM - Ipynb - Colab
49 pages
Report
No ratings yet
Report
9 pages
Summative Assessment For The Unit "Clothes and Fashion" Learning Objective
No ratings yet
Summative Assessment For The Unit "Clothes and Fashion" Learning Objective
2 pages
Sulhi Kul Ain I Akbari
No ratings yet
Sulhi Kul Ain I Akbari
16 pages
FLCT, Cheska G.-Bsed-Eng-4-Activity
No ratings yet
FLCT, Cheska G.-Bsed-Eng-4-Activity
2 pages
Activity in Class 17-5-2022
No ratings yet
Activity in Class 17-5-2022
2 pages
Peer Evaluation-Pronunciation HGMT
No ratings yet
Peer Evaluation-Pronunciation HGMT
2 pages
Logarithm Exercise Full Solution (Combined)
No ratings yet
Logarithm Exercise Full Solution (Combined)
40 pages
3is FORMAT
No ratings yet
3is FORMAT
5 pages
List of NGO
No ratings yet
List of NGO
15 pages
Ordinal Numbers
No ratings yet
Ordinal Numbers
3 pages
A Grammar of Hebrew Language of The Old Testament Ewald
100% (2)
A Grammar of Hebrew Language of The Old Testament Ewald
476 pages
Iot Based Waste Management For Smart City
No ratings yet
Iot Based Waste Management For Smart City
9 pages
5A Notes Intro To Conic Hyperbolas
No ratings yet
5A Notes Intro To Conic Hyperbolas
6 pages
PLAGUE OF WAR - ATHENS, SPARTA Roberts
100% (1)
PLAGUE OF WAR - ATHENS, SPARTA Roberts
404 pages
Resumen Literatura
No ratings yet
Resumen Literatura
5 pages
The Hero's Journey Analysis of The Movie Ever After
No ratings yet
The Hero's Journey Analysis of The Movie Ever After
1 page
Types of Constraints in DBMS
No ratings yet
Types of Constraints in DBMS
15 pages
Wlin35 ITMD536 ResearchPaper-V2
No ratings yet
Wlin35 ITMD536 ResearchPaper-V2
15 pages
The Magic of The Pen: Select Miniatures From The Khamsa of Nizami Ganjavi
No ratings yet
The Magic of The Pen: Select Miniatures From The Khamsa of Nizami Ganjavi
276 pages
Class 9 - SCT
No ratings yet
Class 9 - SCT
9 pages
How To Master The Art of Speaking (And Blow Up Your Content) (DownSub - Com)
No ratings yet
How To Master The Art of Speaking (And Blow Up Your Content) (DownSub - Com)
27 pages
The Writing Process and An Introduction To Business Message
0% (1)
The Writing Process and An Introduction To Business Message
12 pages
Stakeholder Request
No ratings yet
Stakeholder Request
11 pages
Domains of Development Observation1
No ratings yet
Domains of Development Observation1
8 pages
Theory: Scale and Chord: Ke y # Sharps Ke y B Flats
No ratings yet
Theory: Scale and Chord: Ke y # Sharps Ke y B Flats
3 pages
A Fundamental, Practical Theology of Children, Mothers', and Fathers in Modern Societies
No ratings yet
A Fundamental, Practical Theology of Children, Mothers', and Fathers in Modern Societies
436 pages
Ayachi BPSC Tre 4 0 English (TGT, Class-9th & 10th) Complete Foundation With Final Selection Batch 2024 - Online Live Classes by Adda 247
No ratings yet
Ayachi BPSC Tre 4 0 English (TGT, Class-9th & 10th) Complete Foundation With Final Selection Batch 2024 - Online Live Classes by Adda 247
2 pages

Big Data BCA Unit4

Uploaded by

Big Data BCA Unit4

Uploaded by

BCA 6TH SEM UNIT-4 BIG DATA

 Each commodity hardware has its own JVM.

How MapReduce works?

Fig.2. Working of MapReduce

Fig.3. The overall MapReduce word count process

MapReduce Jobs: Execution

 Each Task Tracker issues a Map Task.

 Each Task Tracker issues a Reduce Task to perform the work.

 Once the Reduce task completed, Cleanup task will be performed.

The job submit method creates an internal instance of JobSubmitter and

waitForCompletion method performs below -

 It goes to Job Tracker and gets a jobId for the job

 Perform checks if the output directory has been specified or not.

 Finally, it calls submitJob method on JobTracker.

Job tracker performs the below steps in job initialization -

 Creates object to track tasks and their progress.

 Creates a map tasks for each input split.

 The number of reduce tasks is defined by the configuration mapred.reduce.tasks set

 Tasks are assigned with task ID’s.

Task Tracker sends a heartbeat to job tracker every five seconds.

The available slots information also sends to them.

The job allocation takes place like below -

Below steps describes how the job executed -

 Task Tracker creates an instance of TaskRunner.

 The child process communicates the progress to parent process.

Below steps describes about how the progress is monitored of a job/task -

 Job Client keeps polling the Job Tracker for progress.

 Each child process reports its progress to parent task tracker.

 Job Tracker then send the job completion message to client.

 The total process causes Job Client’s waitForJobToComplete method to return

NOTE* we have already covered it in previous lectures.

There are five daemons basically:

INVESTIGATING HADOOP DISTRIUTED FILE SYSTEM

NOTE* : Same topics we have covered in Hadoop Architecture

DIFFERENT MODES OF HADOOP

Hadoop can be run in three different modes:

1. Local Mode or Standalone Mode

• Used for debugging purpose

• No need to change any configuration files

Here is the summarized view of pseudo distributed Mode-

3. Fully-Distributed Mode (Multi-Node Cluster)

• Production phase of Hadoop

You might also like