Module 3
Module 3
Subject In-charge
Sonali Suryawanshi
Assistant Professor, Department of Information Technology, SFIT
Room No. 328
email: [email protected]
Contents :
MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners, Details of MapReduce
Execution, Coping With Node Failures.
Illustrating use of MapReduce with use of real life databases and applications.
Self-learning Topics: Implementation of MapReduce algorithms like Word count, Matrix-Vector and
MatrixMatrix algorithm
St. Francis Institute of Technology
Department of Information Technology 3
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
MapReduce
Definition:
MapReduce is a framework using which we can write applications to
process huge amounts of data, in parallel, on large clusters of
commodity hardware in a reliable manner.
MapReduce
Features:
• two functions, called Map and Reduce,
• programming model for distributed computing written in java
• easy to scale data processing over multiple computing nodes.
• Decomposing a data processing application into mappers and reducers is
sometimes nontrivial.
• once we write an application in the MapReduce form, scaling the application to
run over hundreds, thousands, or even tens of thousands of machines in a cluster
is merely a configuration change.
• Some number of Map tasks -given one or more chunks from a distributed
file system that turn the chunk into a sequence of key-value pairs.
•key-value pairs produced from the input data & determined by the code
written by the user for the Map function.
• Sorting of key-value pairs from each Map task collected by a master controller
• all key-value pairs with the same key wind up at the same Reduce task.
• combining all the values associated with that key in some way.
Map(key1,value1) list(key2,value2)
A.txt(64MB)
A.Txt (byteoffeset,value) MAP
B.txt(64MB) Hi how are you Eg 1,hi how are you
File.txt (200 MB) How is big data class 16,how is a big data
c.txt(64MB) you are learning class…. (Hi,1),(how,1),
about mapreduce (are,1),(how,1)
d.txt(8MB) …..(you,1)..
3. JobHistoryServer: JobHistoryServer is a
daemon that saves historical information about
St. Francis Institute of Technology
Department of Information Technology
completed tasks/applications. 12
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
DataNodes are the slaves which provide the actual storage and are deployed on each machine. They are
responsible for processing read and write requests for the clients. The following are the other functions of
DataNode:
1. Handles block storage on multiple volumes and also maintain block integrity.
2. Periodically sends heartbeats and also the block reports to NameNode.
Figure 2.2 shows how HDFS handles job processing requests from the user in the form of Sequence Diagram.
User copies the input files into DFS and submits the job to the client. Client gets the input file information from
DFS, creates splits and uploads the job information to DFS. JobTracker puts ready job into the internal queue.
JobScheduler picks job from the queue and initializes the job by creating job object. JobTracker creates a list of
tasks and assigns one map task for each input split.
TaskTrackers send heartbeat to JobTracker to indicate it is ready to run new tasks. JobTracker chooses task from
first job in priority-queue and assigns it to the TaskTracker.
Secondary NameNode is responsible for performing periodic checkpoints. These are used to restart the
NameNode in case of failure. MapReduce can then process the data where it is located.
St. Francis Institute of Technology
Department of Information Technology 15
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners,
Map task
• Consists of elements - tuple, a line or a document.
• A chunk is the collection of elements.
• All inputs to Map task are key–value pairs.
• Map function converts the elements to zero or more key–value pairs.
• Keys not unique.
• Several same key–value pairs from same element is possible..
MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners,
Map task
• Example : count the number of words in a document. S
• document - element.
• The Map function has to read a document and break it into sequence of words.
• Each word is counted as one. Each word is different and so the key is not
unique.
• output of the Map function-> (word1, 1), (word2, 1), …, (wordn, 1).
• Input can be a repository or collection of documents and output is the number
of occurrences of each word and a single Map task can process all the
documents In one or more chunks.
MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners,
•}
Hadoop java API includes Mapper class. An abstract function map() is
•} present in the Mapper class. Any specific Mapper implementation should be
a subclass of this class and overrides the abstract function map()
MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners,
• The number of map task Nmap can be explicitly setby using setNmapTask(int)
MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners,
MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners,
MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners,
Grouping by Key
❖As soon as the Map tasks have all completed successfully, the key-value pairs are grouped
by key, and the values associated with each key are formed into a list of values. Eg
(w1,[1,1,1,]),(w2,[1,1])
❖The grouping is performed by the system, regardless of what the Map and Reduce tasks
do. The master controller process knows how many Reduce tasks there will be, say r such
tasks.
❖The user typically tells the MapReduce system what r should be. Then, the master
controller picks a hash function that applies to keys and produces a bucket number from 0
to r − 1.
❖Each key that is output by a Map task is hashed and its key-value pair is put in one of r
local files. Each file is destined for one of the Reduce tasks1.]
MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners,
Grouping by Key
❖the master controller merges the files from each Map task that are destined for a particular
Reduce task and feeds the merged file to that process as a sequence of key-list-of-value
pairs.
❖That is, for each key k, the input to the Reduce task that handles key k is a pair of the form
(k, [v1, v2, . . . , vn]) 🡪where (k, v1), (k, v2), . . . , (k, vn) are all the key-value pairs with key k
coming from all the Map tasks.
MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners,
Partitioner
❖ Semi mappers in MapReduce which are optional class
❖ Driver class can specify the Partitioner
❖ A partitioner processes the output before submitting it to reducer tasks
❖ Its an optimization in MapReduce that allows local partitioning before reduce
task phase
❖ Main function is to split the map output record with the same key
❖ The function of MapReduce partitioner is to make sure that all the value of a
single key goes to the same reducer, eventually which helps even distribution
of the map output over the reducers
MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners,
Combiners
❖ Semi-Reducers, Optional Class (driver class can specify combiner )
❖ Sometimes, a Reduce function is associative and commutative.
❖ That is, the values to be combined can be combined in any order, with the same result.
❖ It does not matter how we group a list of numbers v1, v2, . . . , vn; the sum will be the same.
❖ When the Reduce function is associative and commutative, we can push some of what the
reducers do to the Map tasks. This reduces input output operation between mapper and
reducer
❖ These key-value pairs would thus be replaced by one pair with key w and value equal to the
sum of all the 1’s in all those pairs.
MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners,
Combiners
❖That is, the pairs with key w generated by a single Map task would be replaced by
a pair (w, m)
• 🡪where m is the number of times that w appears among the documents
handled by this Map task.
✔ Note that it is still necessary to do grouping and aggregation and to pass the
result to the Reduce tasks, since there will typically be one key-value pair with
key w coming from each of the Map tasks.
MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners,
MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners,
• }
• }
Hadoop java API includes Reducer class. An abstract function reduce() is present in the Reduce class. Any
specific Reducer implementation should be a subclass of this class and overrides the abstract function reduce()
MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners,
Distributed cache
St. Francis Institute of Technology
Department of Information Technology 30
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
• The partition of the key space produced by the mapper, that is every
intermediate data from the mapper, is given as input to reducer.
• The partitioner determines the reduce node for the given key–value pair.
• All map values of the same key are reduced together and so all the map nodes
must be aware of the reducer node
St. Francis Institute of Technology
Department of Information Technology 33
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
Sort process groups the key–value pairs to form the list of values.
Shuffling -- pairs with same key are grouped together and passed to a single
machine that will run the reducer script over them.
St. Francis Institute of Technology
Department of Information Technology 34
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
user-defined code.
receives a key along with an iterator over all the values associated with
the key and produces the output key– value pairs.
Combiner takes the mapper instances as input and combines the values with the
same key to reduce the number of keys (key space) that must
be sorted.
Distributed cache:
• The user’s code for driver, map and reduce along with the configuring
parameters can be packaged into a single jar file and placed in this cache.
MapReduce
Working of Mapreduce :
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
https://www.geeksforgeeks.org/map-reduce-in-hadoop/
Block size is configurable across the cluster and even file basis also.
Split : It has something related with map reduce , you do have an option that you can change the split size , means you can modify your split size greater
than your block size or your split size less than your block size . By default if you don't do any configuration then your split size is approximately equal
to block size .
In map reduce processing, number of mapper spawned would be equal to your number of splits : for a file if 10 splits are there then 10 mappers would
be spawned.
When put command is being fired , it goes to namenode , namenode asks client (in this case hadoop fs utility is behaving like a client) , break the file
into blocks and as per block size , which could be defined in hdfs-site.xml then ,namenode ask client to write the different blocks to different data nodes
.
Actual data will get store on data nodes and meta data of data means file's block location and file attributes would be stored on name node .
client first establish the connection with name node , once it gets the confirmation about where to store the block and then it would directly make a tcp
connection with data nodes and writes the data .
Based on replication factor other copies would be maintained in hadoop cluster and their blocks information would be stored on namenode .
But in any scenario data node won't have duplicate copies of block , means same block would not be replicating on the same node .
St. Francis Institute of Technology
Department of Information Technology 42
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
Block size is configurable across the cluster and even file basis also.
Split : It has something related with map reduce , you do have an option that you can change the split size , means you can modify your split size greater
than your block size or your split size less than your block size . By default if you don't do any configuration then your split size is approximately equal
to block size .
In map reduce processing, number of mapper spawned would be equal to your number of splits : for a file if 10 splits are there then 10 mappers would
be spawned.
When put command is being fired , it goes to namenode , namenode asks client (in this case hadoop fs utility is behaving like a client) , break the file
into blocks and as per block size , which could be defined in hdfs-site.xml then ,namenode ask client to write the different blocks to different data nodes
.
Actual data will get store on data nodes and meta data of data means file's block location and file attributes would be stored on name node .
client first establish the connection with name node , once it gets the confirmation about where to store the block and then it would directly make a tcp
connection with data nodes and writes the data .
Based on replication factor other copies would be maintained in hadoop cluster and their blocks information would be stored on namenode .
But in any scenario data node won't have duplicate copies of block , means same block would not be replicating on the same node .
St. Francis Institute of Technology
Department of Information Technology 43