0% found this document useful (0 votes)
2 views

Hadoop MapReduce Tutorial

The document explains the workings of Hadoop MapReduce, detailing its components such as InputFormat, InputSplits, Mapper, Combiner, Partitioner, and Reducer, along with their roles in processing data. It also discusses the types of joins in MapReduce, specifically Map-Side Join and Reduce-Side Join, highlighting their advantages and disadvantages. The document concludes with an example of a Reduce-Side Join operation to analyze customer transaction data.

Uploaded by

22521001
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Hadoop MapReduce Tutorial

The document explains the workings of Hadoop MapReduce, detailing its components such as InputFormat, InputSplits, Mapper, Combiner, Partitioner, and Reducer, along with their roles in processing data. It also discusses the types of joins in MapReduce, specifically Map-Side Join and Reduce-Side Join, highlighting their advantages and disadvantages. The document concludes with an example of a Reduce-Side Join operation to analyze customer transaction data.

Uploaded by

22521001
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Word Count MapReduce

1
Hadoop MapReduce

2
Hadoop MapReduce
How Hadoop MapReduce work?

3
Hadoop MapReduce
Input Files
● The data for a MapReduce task is stored in input files, and input files typically lives
in HDFS. The format of these files is arbitrary, while line-based log files and binary
format can also be used.

4
Hadoop MapReduce
InputFormat
● Now, InputFormat defines how these input files are split and read. It selects the files
or other objects that are used for input. InputFormat creates InputSplit.

5
Hadoop MapReduce
InputSplits
● It is created by InputFormat, logically represent the data which will be processed by an
individual Mapper (We will understand mapper below). One map task is created for each split;
thus the number of map tasks will be equal to the number of InputSplits. The split is divided
into records and each record will be processed by the mapper.

6
Hadoop MapReduce
RecordReader
● It communicates with the InputSplit in Hadoop MapReduce and converts the data into key-
value pairs suitable for reading by the mapper. By default, it uses TextInputFormat for
converting data into a key-value pair. RecordReader communicates with the InputSplit until the
file reading is not completed. It assigns byte offset (unique number) to each line present in the
file. Further, these key-value pairs are sent to the mapper for further processing.

7
Hadoop MapReduce
Mapper
● It processes each input record (from RecordReader) and generates new key-value pair, and this key-value pair
generated by Mapper is completely different from the input pair. The output of Mapper is also known as
intermediate output which is written to the local disk. The output of the Mapper is not stored on HDFS as this is
temporary data and writing on HDFS will create unnecessary copies (also HDFS is a high latency system).
Mappers output is passed to the combiner for further process

8
Hadoop MapReduce
Combiner
● The combiner is also known as ‘Mini-reducer’. Hadoop MapReduce Combiner performs local
aggregation on the mappers’ output, which helps to minimize the data transfer between
mapper and reducer (we will see reducer below). Once the combiner functionality is executed,
the output is then passed to the partitioner for further work.

9
Hadoop MapReduce
Partitioner
● Hadoop MapReduce, Partitioner comes into the picture if we are working on more than one
reducer (for one reducer partitioner is not used).
● Partitioner takes the output from combiners and performs partitioning. Partitioning of output
takes place on the basis of the key and then sorted. By hash function, key (or a subset of the
key) is used to derive the partition.
● According to the key value in MapReduce, each combiner output is partitioned, and a record
having the same key value goes into the same partition, and then each partition is sent to a
reducer. Partitioning allows even distribution of the map output over the reducer.

10
Hadoop MapReduce
Shuffling and Sorting
● Now, the output is Shuffled to the reduce node (which is a normal slave node but reduce phase
will run here hence called as reducer node). The shuffling is the physical movement of the data
which is done over the network. Once all the mappers are finished and their output is shuffled
on the reducer nodes, then this intermediate output is merged and sorted, which is then
provided as input to reduce phase.

11
Hadoop MapReduce
Reducer
● It takes the set of intermediate key-value pairs produced by the mappers as the input and then
runs a reducer function on each of them to generate the output. The output of the reducer is the
final output, which is stored in HDFS.

12
Hadoop MapReduce
RecordWriter
● It writes these output key-value pair from the Reducer phase to the output files.

13
Hadoop MapReduce
OutputFormat
● The way these output key-value pairs are written in output files by RecordWriter is determined by the
OutputFormat. OutputFormat instances provided by the Hadoop are used to write files in HDFS or on the local
disk. Thus the final output of reducer is written on HDFS by OutputFormat instances.
● Hence, in this manner, a Hadoop MapReduce works over the cluster.

14
Hadoop MapReduce

15
Hadoop MapReduce

The entire MapReduce program can be fundamentally divided into three parts:

● Mapper Phase Code


● Reducer Phase Code
● Driver Code

16
What is Join in MapReduce

● Advantage – Optimize Solution in terms of


processing speed of the data.
● Disadvantage – Time Consuming for programmer,
Non-ease mode of development due to 100’s of
lines of code and Availability of higher level
frameworks like HIVE/PIG

17
Type of Join in MapReduce
Map-Side Join
Read the data streams into the mappers and uses login within the
mapper function to perform the join.
● Where to use: – When you have One large dataset and you need
to join this with a small dataset. Also for Optimize performance.

● Why: – Smaller table will be loaded into memory and the join
operation will happen during the mapper execution of large data set
● Advantage: Better performance.
● Disadvantage: Not Flexible i.e. can not be used if both data sets
are large in size.
● Note: Reduce Side Join can also be used here but the performance
will decrease.

18
Type of Join in MapReduce
Reduce-Side Join
Process the multiple data streams through multiple map stages and
perform the join at Reducer stage.
● Where to use: – When both the data sets are large.
Why: – None of the dataset can be loaded in memory
completely. Will have to process both tables separately and
JOIN them at reducer side
● Advantage: Flexible and can be applied anywhere.
Disadvantage: Poor performance in comparison with Map-side
Joins

● Note: – Map Side Join can’t be used here.

19
Reduce side Join

• Suppose that I have two separate datasets of a sports complex:

• cust_details: It contains the details of the customer.

• transaction_details: It contains the transaction record of the customer.

• Using these two datasets, I want to know the lifetime value of each customer. In doing so, I will be
needing the following things:

• The person’s name along with the frequency of the visits by that person.

• The total amount spent by him/her for purchasing the equipment.


20
Reduce side Join

21
Reduce side Join
Map phase: Mapper for customer
• Read the input taking one tuple at a time.
• Tokenize each word in that tuple and fetch the cust ID along with the name of the person.
• The cust ID will be the key of the key-value pair that the mapper will generate eventually.
• Add a tag “cust” to indicate that this input tuple is of cust_details type.
• Therefore, mapper for cust_details will produce following intermediate key-value pair:
Key – Value pair: [cust ID, cust name]

• Example: [4000001, cust Kristina], [4000002, cust Paige], etc.

22
Reduce side Join
Map phase: Mapper for transaction
• Fetch the amount value instead of name of the person.
• In this case, “tnxn” is used as a tag.
• Therefore, the cust ID will be the key of the key-value pair that the mapper will generate eventually.
• Finally, the output of mapper for transaction_details will be of the following format:
Key, Value Pair: [cust ID, tnxn amount]

• Example: [4000001, tnxn 40.33], [4000002, tnxn 198.44], etc.

23
Reduce side Join
Sorting and Shuffling Phase
• The sorting and shuffling phase will generate an array list of values corresponding to each key. In other words, it will
put together all the values corresponding to each unique key in the intermediate key-value pair. The output of sorting
and shuffling phase will be of the following format:

Key – list of Values:

{cust ID1 – [(cust name1), (tnxn amount1), (tnxn amount2), (tnxn amount3),…..]}

{cust ID2 – [(cust name2), (tnxn amount1), (tnxn amount2), (tnxn amount3),…..]}

……
• Example:

{4000001 – [(cust kristina), (tnxn 40.33), (tnxn 47.05),…]};

{4000002 – [(cust paige), (tnxn 198.44), (tnxn 5.58),…]};


24
……
Reduce side Join
Reduce Phase
• The primary goal to perform this reduce-side join operation was to find out that how many times a
particular customer has visited sports complex and the total amount spent by that very customer on
different sports. Therefore, the final output should be of the following format:

Key – Value pair: [Name of the customer] (Key) – [total amount, frequency of the visit] (Value)

• Hence, the final output that my reducer will generate is given below:

Kristina, 651.05 8

Paige, 706.97 6

….. 67

You might also like