Module - 4 - UNDERSTANDING MAP REDUCE FUNDAMENTALS
Module - 4 - UNDERSTANDING MAP REDUCE FUNDAMENTALS
MapReduce
1. Traditional Enterprise Systems normally have a centralized server to store and process data.
2. The following illustration depicts a schematic view of a traditional enterprise system. Traditional model
is certainly not suitable to process huge volumes of scalable data and cannot be accommodated by
standard database servers.
3. Moreover, the centralized system creates too much of a bottleneck while processing multiple files
simultaneously.
4. Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce divides a task
into small parts and assigns them to many computers.
5. Later, the results are collected at one place and integrated to form the result dataset.
Join Our Telegram Group to Get Notifications, Study Materials, Practice test & quiz: https://t.me/ccatpreparations
Visit: www.ccatpreparation.com
The Reduce tasks work on one key at a time, and combine all the values associated with that key in some
way. The manner of combination of values is determined by the code written by the user for the Reduce
function.
B. Grouping by Key
i. As the Map tasks have all completed successfully, the key-value pairs are grouped by key, and the values
associated with each key are formed into a list of values.
Join Our Telegram Group to Get Notifications, Study Materials, Practice test & quiz: https://t.me/ccatpreparations
Visit: www.ccatpreparation.com
ii. The grouping is performed by the system, regardless of what the Map and Reduce tasks do.
iii. The master controller process knows how many Reduce tasks there will be, say r such tasks.
iv. The user typically tells the MapReduce system what r should be.
v. Then the master controller picks a hash function that applies to keys and produces a bucket number
from 0 to r − 1.
vi. Each key that is output by a Map task is hashed and its key-value pair is put in one of r local files. Each
file is destined for one of the Reduce tasks.1.
vii. To perform the grouping by key and distribution to the Reduce tasks, the master controller merges the
files from each Map task that are destined for a particular Reduce task and feeds the merged file to that
process as a sequence of key-list-of-value pairs.
viii. That is, for each key k, the input to the Reduce task that handles key k is a pair of the form (k,
[v1, v2, . . . , vn]), where (k, v1), (k, v2), . . . , (k, vn) are all the key-value pairs with key k coming from
all the Map tasks.
i. The Map task takes a set of data and converts it into another set of data, where individual elementsare
broken down into tuples (key-value pairs).
Join Our Telegram Group to Get Notifications, Study Materials, Practice test & quiz: https://t.me/ccatpreparations
Visit: www.ccatpreparation.com
Figure 4.4: Overview of the execution of a MapReduce program
ii. The Reduce task takes the output from the Map as an input and combines those data tuples (key- value
pairs) into a smaller set of tuples.
iii. The reduce task is always performed after the map job.
Input Phase − Here we have a Record Reader that translates each record in an input file and sendsthe
parsed data to the mapper in the form of key-value pairs.
Map − Map is a user-defined function, which takes a series of key-value pairs and processes eachone of
them to generate zero or more key-value pairs.
Intermediate Keys − they key-value pairs generated by the mapper are known as intermediate keys.
Combiner − A combiner is a type of local Reducer that groups similar data from the map phase into
identifiable sets. It takes the intermediate keys from the mapper as input and applies a user-defined code
to aggregate the values in a small scope of one mapper. It is not a part of the main MapReduce algorithm;
it is optional.
Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads the grouped key-
value pairs onto the local machine, where the Reducer is running. The individual key-value pairs are sorted
by key into a larger data list. The data list groups the equivalent keys together so that their values can be
iterated easily in the Reducer task.
Reducer − The Reducer takes the grouped key-value paired data as input and runs a Reducer function on
each one of them. Here, the data can be aggregated, filtered, and combined in a number of ways, and it
requires a wide range of processing. Once the execution is over, it gives zero or more key- value pairs to
the final step.
Join Our Telegram Group to Get Notifications, Study Materials, Practice test & quiz: https://t.me/ccatpreparations
Visit: www.ccatpreparation.com
Output Phase − In the output phase, we have an output formatter that translates the final key-value pairs
from the Reducer function and writes them onto a file using a record writer.
iv. The MapReduce phase
F. MapReduce-Example
Twitter receives around 500 million tweets per day, which is nearly 3000 tweets per second. The
following illustration shows how Tweeter manages its tweets with the help of MapReduce.
Figure4.7: Example
i. Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value pairs.
ii. Filter − Filters unwanted words from the maps of tokens and writes the filtered maps as key- value
pairs.
iii. Count − Generates a token counter per word.
iv. Aggregate Counters − Prepares an aggregate of similar counter values into small manageableunits.
G. MapReduce – Algorithm
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
MapReduce implements various mathematical algorithms to divide a task into small parts and assign
Join Our Telegram Group to Get Notifications, Study Materials, Practice test & quiz: https://t.me/ccatpreparations
Visit: www.ccatpreparation.com
them to multiple systems. In technical terms, MapReduce algorithm helps in sending the Map & Reduce
tasks to appropriate servers in a cluster.
Join Our Telegram Group to Get Notifications, Study Materials, Practice test & quiz: https://t.me/ccatpreparations
Visit: www.ccatpreparation.com