0% found this document useful (0 votes)

11 views25 pages

Hadoop MapReduce Tutorial

The document explains the workings of Hadoop MapReduce, detailing its components such as InputFormat, InputSplits, Mapper, Combiner, Partitioner, and Reducer, along with their roles in processing data. It also discusses the types of joins in MapReduce, specifically Map-Side Join and Reduce-Side Join, highlighting their advantages and disadvantages. The document concludes with an example of a Reduce-Side Join operation to analyze customer transaction data.

Uploaded by

22521001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views25 pages

Hadoop MapReduce Tutorial

Uploaded by

22521001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Word Count MapReduce

1
Hadoop MapReduce

2
Hadoop MapReduce
How Hadoop MapReduce work?

3
Hadoop MapReduce
Input Files
● The data for a MapReduce task is stored in input files, and input files typically lives
in HDFS. The format of these files is arbitrary, while line-based log files and binary
format can also be used.

4
Hadoop MapReduce
InputFormat
● Now, InputFormat defines how these input files are split and read. It selects the files
or other objects that are used for input. InputFormat creates InputSplit.

5
Hadoop MapReduce
InputSplits
● It is created by InputFormat, logically represent the data which will be processed by an
individual Mapper (We will understand mapper below). One map task is created for each split;
thus the number of map tasks will be equal to the number of InputSplits. The split is divided
into records and each record will be processed by the mapper.

6
Hadoop MapReduce
RecordReader
● It communicates with the InputSplit in Hadoop MapReduce and converts the data into key-
value pairs suitable for reading by the mapper. By default, it uses TextInputFormat for
converting data into a key-value pair. RecordReader communicates with the InputSplit until the
file reading is not completed. It assigns byte offset (unique number) to each line present in the
file. Further, these key-value pairs are sent to the mapper for further processing.

7
Hadoop MapReduce
Mapper
● It processes each input record (from RecordReader) and generates new key-value pair, and this key-value pair
generated by Mapper is completely different from the input pair. The output of Mapper is also known as
intermediate output which is written to the local disk. The output of the Mapper is not stored on HDFS as this is
temporary data and writing on HDFS will create unnecessary copies (also HDFS is a high latency system).
Mappers output is passed to the combiner for further process

8
Hadoop MapReduce
Combiner
● The combiner is also known as ‘Mini-reducer’. Hadoop MapReduce Combiner performs local
aggregation on the mappers’ output, which helps to minimize the data transfer between
mapper and reducer (we will see reducer below). Once the combiner functionality is executed,
the output is then passed to the partitioner for further work.

9
Hadoop MapReduce
Partitioner
● Hadoop MapReduce, Partitioner comes into the picture if we are working on more than one
reducer (for one reducer partitioner is not used).
● Partitioner takes the output from combiners and performs partitioning. Partitioning of output
takes place on the basis of the key and then sorted. By hash function, key (or a subset of the
key) is used to derive the partition.
● According to the key value in MapReduce, each combiner output is partitioned, and a record
having the same key value goes into the same partition, and then each partition is sent to a
reducer. Partitioning allows even distribution of the map output over the reducer.

10
Hadoop MapReduce
Shuffling and Sorting
● Now, the output is Shuffled to the reduce node (which is a normal slave node but reduce phase
will run here hence called as reducer node). The shuffling is the physical movement of the data
which is done over the network. Once all the mappers are finished and their output is shuffled
on the reducer nodes, then this intermediate output is merged and sorted, which is then
provided as input to reduce phase.

11
Hadoop MapReduce
Reducer
● It takes the set of intermediate key-value pairs produced by the mappers as the input and then
runs a reducer function on each of them to generate the output. The output of the reducer is the
final output, which is stored in HDFS.

12
Hadoop MapReduce
RecordWriter
● It writes these output key-value pair from the Reducer phase to the output files.

13
Hadoop MapReduce
OutputFormat
● The way these output key-value pairs are written in output files by RecordWriter is determined by the
OutputFormat. OutputFormat instances provided by the Hadoop are used to write files in HDFS or on the local
disk. Thus the final output of reducer is written on HDFS by OutputFormat instances.
● Hence, in this manner, a Hadoop MapReduce works over the cluster.

14
Hadoop MapReduce

15
Hadoop MapReduce

The entire MapReduce program can be fundamentally divided into three parts:

● Mapper Phase Code

● Reducer Phase Code
● Driver Code

16
What is Join in MapReduce

● Advantage – Optimize Solution in terms of

processing speed of the data.
● Disadvantage – Time Consuming for programmer,
Non-ease mode of development due to 100’s of
lines of code and Availability of higher level
frameworks like HIVE/PIG

17
Type of Join in MapReduce
Map-Side Join
Read the data streams into the mappers and uses login within the
mapper function to perform the join.
● Where to use: – When you have One large dataset and you need
to join this with a small dataset. Also for Optimize performance.

● Why: – Smaller table will be loaded into memory and the join
operation will happen during the mapper execution of large data set
● Advantage: Better performance.
● Disadvantage: Not Flexible i.e. can not be used if both data sets
are large in size.
● Note: Reduce Side Join can also be used here but the performance
will decrease.

18
Type of Join in MapReduce
Reduce-Side Join
Process the multiple data streams through multiple map stages and
perform the join at Reducer stage.
● Where to use: – When both the data sets are large.
Why: – None of the dataset can be loaded in memory
completely. Will have to process both tables separately and
JOIN them at reducer side
● Advantage: Flexible and can be applied anywhere.
Disadvantage: Poor performance in comparison with Map-side
Joins

● Note: – Map Side Join can’t be used here.

19
Reduce side Join

• Suppose that I have two separate datasets of a sports complex:

• cust_details: It contains the details of the customer.

• transaction_details: It contains the transaction record of the customer.

• Using these two datasets, I want to know the lifetime value of each customer. In doing so, I will be
needing the following things:

• The person’s name along with the frequency of the visits by that person.

• The total amount spent by him/her for purchasing the equipment.

20
Reduce side Join

21
Reduce side Join
Map phase: Mapper for customer
• Read the input taking one tuple at a time.
• Tokenize each word in that tuple and fetch the cust ID along with the name of the person.
• The cust ID will be the key of the key-value pair that the mapper will generate eventually.
• Add a tag “cust” to indicate that this input tuple is of cust_details type.
• Therefore, mapper for cust_details will produce following intermediate key-value pair:
Key – Value pair: [cust ID, cust name]

• Example: [4000001, cust Kristina], [4000002, cust Paige], etc.

22
Reduce side Join
Map phase: Mapper for transaction
• Fetch the amount value instead of name of the person.
• In this case, “tnxn” is used as a tag.
• Therefore, the cust ID will be the key of the key-value pair that the mapper will generate eventually.
• Finally, the output of mapper for transaction_details will be of the following format:
Key, Value Pair: [cust ID, tnxn amount]

• Example: [4000001, tnxn 40.33], [4000002, tnxn 198.44], etc.

23
Reduce side Join
Sorting and Shuffling Phase
• The sorting and shuffling phase will generate an array list of values corresponding to each key. In other words, it will
put together all the values corresponding to each unique key in the intermediate key-value pair. The output of sorting
and shuffling phase will be of the following format:

Key – list of Values:

{cust ID1 – [(cust name1), (tnxn amount1), (tnxn amount2), (tnxn amount3),…..]}

{cust ID2 – [(cust name2), (tnxn amount1), (tnxn amount2), (tnxn amount3),…..]}

……
• Example:

{4000001 – [(cust kristina), (tnxn 40.33), (tnxn 47.05),…]};

{4000002 – [(cust paige), (tnxn 198.44), (tnxn 5.58),…]};

24
……
Reduce side Join
Reduce Phase
• The primary goal to perform this reduce-side join operation was to find out that how many times a
particular customer has visited sports complex and the total amount spent by that very customer on
different sports. Therefore, the final output should be of the following format:

Key – Value pair: [Name of the customer] (Key) – [total amount, frequency of the visit] (Value)

• Hence, the final output that my reducer will generate is given below:

Kristina, 651.05 8

Paige, 706.97 6

….. 67

GSS110 Week1 Topic 1 Introduction
No ratings yet
GSS110 Week1 Topic 1 Introduction
17 pages
NID in Software Engineering
No ratings yet
NID in Software Engineering
192 pages
S MapReduce Types Formats Features 06
No ratings yet
S MapReduce Types Formats Features 06
26 pages
Understanding MapReduce
No ratings yet
Understanding MapReduce
4 pages
BDA Unit 3 1
No ratings yet
BDA Unit 3 1
37 pages
Chapter 4 - Understanding Map Reduce Fundamentals
No ratings yet
Chapter 4 - Understanding Map Reduce Fundamentals
45 pages
Day 6
No ratings yet
Day 6
12 pages
Unit-2 (MapReduce-II)
No ratings yet
Unit-2 (MapReduce-II)
11 pages
S MapReduce Types Formats Features 03
No ratings yet
S MapReduce Types Formats Features 03
16 pages
Unit 3 - Big Data Technologies
No ratings yet
Unit 3 - Big Data Technologies
42 pages
Map Reduce 2
No ratings yet
Map Reduce 2
14 pages
Map Reduce
No ratings yet
Map Reduce
45 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
Big Data Unit - 3
No ratings yet
Big Data Unit - 3
7 pages
Hadoop Wordcount Program
No ratings yet
Hadoop Wordcount Program
20 pages
Data Science Presentation
No ratings yet
Data Science Presentation
20 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Bda Unit-3
No ratings yet
Bda Unit-3
44 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
BDA Unit 2 Notes
No ratings yet
BDA Unit 2 Notes
32 pages
L04 MapReduce
No ratings yet
L04 MapReduce
37 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
Bda U2
No ratings yet
Bda U2
79 pages
BDA - Unit - III-1
No ratings yet
BDA - Unit - III-1
57 pages
Lecture 04
No ratings yet
Lecture 04
25 pages
Map Reduce
No ratings yet
Map Reduce
11 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
HDFS Unit 4
No ratings yet
HDFS Unit 4
12 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
67 pages
Describe The MapReduce Execution Steps With A Neat Diagram
No ratings yet
Describe The MapReduce Execution Steps With A Neat Diagram
10 pages
Bda Unit III r20csm
No ratings yet
Bda Unit III r20csm
54 pages
Unit - Iii
No ratings yet
Unit - Iii
38 pages
Map Red
No ratings yet
Map Red
6 pages
Bda FW-4
No ratings yet
Bda FW-4
7 pages
Dllction To MAPREDUCE Afflrlling: L Tro
No ratings yet
Dllction To MAPREDUCE Afflrlling: L Tro
12 pages
Map Reduce
No ratings yet
Map Reduce
35 pages
Ecs765p W2
No ratings yet
Ecs765p W2
55 pages
Unit - III
No ratings yet
Unit - III
37 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
4 pages
Unit 3
No ratings yet
Unit 3
27 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
59 pages
Understand: The First Phase of Mapreduce Paradigm, What Is A Map/Mapper, What Is The Input To The
No ratings yet
Understand: The First Phase of Mapreduce Paradigm, What Is A Map/Mapper, What Is The Input To The
5 pages
Anatomy of A MapReduce Job
No ratings yet
Anatomy of A MapReduce Job
5 pages
Lecture 03
No ratings yet
Lecture 03
26 pages
Sem 7 - COMP - BDA
No ratings yet
Sem 7 - COMP - BDA
16 pages
Unit 2
No ratings yet
Unit 2
12 pages
Data Science
No ratings yet
Data Science
7 pages
Unit 3
No ratings yet
Unit 3
22 pages
Lecture - 3
No ratings yet
Lecture - 3
25 pages
Understanding MapReduce
No ratings yet
Understanding MapReduce
15 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
Unit 2 - From Hadoop Streaming PDF
No ratings yet
Unit 2 - From Hadoop Streaming PDF
20 pages
Map Reduce
No ratings yet
Map Reduce
7 pages
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
No ratings yet
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
36 pages
Big Data Unit 2 - PPT1
No ratings yet
Big Data Unit 2 - PPT1
15 pages
Map Reduce Tutorial-1
No ratings yet
Map Reduce Tutorial-1
7 pages
S MapReduce Types Formats
100% (2)
S MapReduce Types Formats
22 pages
Dart for Flutter
From Everand
Dart for Flutter
Zeuz IT
No ratings yet
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
From Everand
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
Karl Josef Hensel
No ratings yet
Reinforcement Learning
No ratings yet
Reinforcement Learning
3 pages
Stereography 1
No ratings yet
Stereography 1
46 pages
Dalvik Optimization by LycanTweaks
No ratings yet
Dalvik Optimization by LycanTweaks
3 pages
Cs502 Midterm Solved Mcqs by Junaid
No ratings yet
Cs502 Midterm Solved Mcqs by Junaid
47 pages
Yardstick International College: Chapter Two: Linear Programing (LP) BY Shewayirga Assalf (Asst. Prof.)
No ratings yet
Yardstick International College: Chapter Two: Linear Programing (LP) BY Shewayirga Assalf (Asst. Prof.)
163 pages
III B.SC CS - Operating Systems
No ratings yet
III B.SC CS - Operating Systems
60 pages
CS2056 DS Unit4 Notes
No ratings yet
CS2056 DS Unit4 Notes
34 pages
Template Method Pattern
No ratings yet
Template Method Pattern
28 pages
CSE-5th Semester-Syllabus
No ratings yet
CSE-5th Semester-Syllabus
12 pages
Exp 3 Cry 5030
No ratings yet
Exp 3 Cry 5030
8 pages
Class 4 Sample Paper 2
No ratings yet
Class 4 Sample Paper 2
4 pages
Java Technical Interview Questions For Freshers
No ratings yet
Java Technical Interview Questions For Freshers
11 pages
Past Paper Assignment
No ratings yet
Past Paper Assignment
70 pages
Theory of Computation
No ratings yet
Theory of Computation
5 pages
CoSc-2042 Operating System
No ratings yet
CoSc-2042 Operating System
2 pages
Dsa 7
No ratings yet
Dsa 7
6 pages
Report Plag
No ratings yet
Report Plag
68 pages
4.question Bank
No ratings yet
4.question Bank
6 pages
400 Top Compiler Design Lab Viva Questions and Answers
No ratings yet
400 Top Compiler Design Lab Viva Questions and Answers
23 pages
03-003 Pseudocode
No ratings yet
03-003 Pseudocode
11 pages
Python Program160823
No ratings yet
Python Program160823
62 pages
Notes Ds FINAL
No ratings yet
Notes Ds FINAL
6 pages
DBMS - Chapter 2 - Storage and File Structures
No ratings yet
DBMS - Chapter 2 - Storage and File Structures
118 pages
Decision Tree
No ratings yet
Decision Tree
66 pages
GPS Tracking System Black Book
100% (1)
GPS Tracking System Black Book
56 pages
DSBDA GRP B 1
No ratings yet
DSBDA GRP B 1
8 pages
Gray Code
No ratings yet
Gray Code
6 pages
Sem 3 Results
No ratings yet
Sem 3 Results
3 pages

Hadoop MapReduce Tutorial

Uploaded by

Hadoop MapReduce Tutorial

Uploaded by

Word Count MapReduce

● Mapper Phase Code

● Advantage – Optimize Solution in terms of

● Note: – Map Side Join can’t be used here.

• Suppose that I have two separate datasets of a sports complex:

• cust_details: It contains the details of the customer.

• transaction_details: It contains the transaction record of the customer.

• The total amount spent by him/her for purchasing the equipment.

• Example: [4000001, cust Kristina], [4000002, cust Paige], etc.

• Example: [4000001, tnxn 40.33], [4000002, tnxn 198.44], etc.

Key – list of Values:

{4000001 – [(cust kristina), (tnxn 40.33), (tnxn 47.05),…]};

{4000002 – [(cust paige), (tnxn 198.44), (tnxn 5.58),…]};

You might also like