Lec 6
Lec 6
Noc19-cs33
Lec 06-Hadoop
MapReduce
2.0 (Part-I)
Hadoop, maps reduce 2.0 versions.
Content of this lecture in this lecture; we will discuss MapReduce paradigm and also what is new in
MapReduce version 2? We will also look into the internal working and the implementation or we will
also see many examples how using MapReduce, we can design different applications and also look into
how the scheduling and the fault tolerance is now being supported inside the MapReduce implementation
of version 2.
So, the programs, are written in this style are automatically, are executed in parallel on a large cluster of
commodity machines. So, the parallel programming paradigm is automatically being taken care and the
users or the programmers, are not going to be bothered about, how the parallelization and how the
synchronization, is to be managed. So, the so, parallel execution is automatically done in, the MapReduce
framework. Now as far, as the run time is concerned, run time take care of the details of partitioning the
data scheduling the programs like equation across the set of machines, handling machine failures,
managing the required inter machine communication and this, allows the programmer, without any
experience with the parallel and distributed computing to easily utilize the resources of the large
distributed system. So, let us explain, this entire notion of a runtime operations which this particular
MapReduce allows or provides to the execution environment, to the for the execution of these programs
on a large-scale commodity machines. So, as far as MapReduce is concerned now, starting with the data
set, the input data set, we assume to be a very large and stored on the clusters, that means may be stored
on hundreds and thousands of nodes one data set let us assume that it is stored. So, this will be stored
with the help of the partitioning. So, the data set will be partitioned and stored on these nodes. So, let us
say that if 100 nodes, are available the entire data set is partitioned into 100 different chunks and stored,
in this way, for computation in this scenario. Now, the next task is, to schedule the program's execution,
across these, let us say 100 different machines where the entire data set is stored. So, then, it launches the,
the programs in the form of a map and reduce, we will execute, in parallel, at all places ,that is let us say
if hundred different nodes are involved, all hundred different chunks, will experience the execution in
parallel. So, this is also, to be done automatically by, the MapReduce using, the job tracker and a task
tracker, that we have already seen. And in the newer version that is 2.0, this particular scheduling and
resource allocation, this is to be done by the YARN, once this particular request comes from, this
MapReduce execution, for that particular job, of that data set. Now, there is a issues like, these nodes, are
comprises of commodity, Hardware that we have seen in the previous lecture of HDFS. So the nodes may
fail, in case of node failures, automatically ,it will, be taken care by doing the scheduling of this
alternative replicas and the execution still carries on without any effect on the execution. so it requires,
also to manage the inter machine communications, for example ,the intermediate results, when it is
available ,then, they are sometimes required to communicate with each other ,some values, some results,
this is called,’ shuffle’ and ‘combined’. So, shuffle and combine requires, intermission communication,
that we will see, in more detail in the further slides. Now, this will allow the programmers without any
experience, with the parallel and distributed systems, to use it. And this will simplify, the entire
programming, paradigm and the programmers has to only write down a simple MapReduce programs. so
we will see how the programmers will be able to write the MapReduce program for different application
and the execution will be automatically, taken care by, this framework. So a typical, MapReduce
computation may vary from terabytes on thousands of the machines. So that we have, already seen
,hundreds of MapReduce programs have been implemented and upwards of 1,000 MapReduce jobs are
executed on Google's, cluster, every day. So, the companies like Google and other big companies, which
works, in the big data computation they uses this technology, which is the MapReduce programs. That
means applications are written in the MapReduce that we are going to see them.
now, we will see, these components, which are used in, the MapReduce program execution and they are
the chunk servers, where the file is split into the chunks, of 64 megabytes and these chunks are also
replicated ,this is also sometimes called as ’blocks’. We call it as chunks, both are synonymous, to be
used here in this part of the discussion. Now, another thing is, the these particular chunks are, replicated
sometimes, two times the replication all three times, this is called, ‘replication factor’. So, it is replicated,
that means ,every chunk is not is stored on, more than one, that is more than one nodes ,for example
,when it is the replication factor is three times, so every chunk, is stored on three different nodes.
Sometimes, they are try to be stored on a different rack, to have the rack, failure tolerant. So, that means
if, the node on one rack or if one rack fields, so it continues to get the other replicas on a different rack, so
that is how, different replications are done to support the, the fault tolerance or the failure tolerance. Now
the next component of a distributed file system, which is used here in the MapReduce is called, ‘Master
Node’ also known as the name node in HDFS, which stores the metadata, metadata means the
information where the exact, data is stored on which data nodes. So, this information is maintained at the
master node by and it is also called as a, ‘Name Node’ in the terminology of HDFS. Now, another
component is called, ‘Client Library’, for the file axis, which talks to the master, to find out the chunk
servers, connects directly to the chunk server, to access the data.
Now, we will see about, this MapReduce and its motivation why we are? So, much enthusiastic, about
this more produced programming paradigm and which is very much used in the Big Data computation.
So, the motivation for MapReduce is, to support the large-scale data processing. So, if you want, to run
the program on thousands of CPUs and then this MapReduce is the only paradigm available and all the
companies, which are looking for the scale? A large-scale data processing the MapReduce is the only
framework, which is basically doing all this task? And another important motivation is that, this
particular paradigm makes the programming very simple and does not require the programmer to go and
manage
the intricacies of the menu detail, underneath. Now MapReduce architecture also provides automatic
parallelization and the distribution, of data and the configuration of it. So, the parallel computation, on
distributed system is all abstracted and the programmer is provided a simple MapReduce paradigm and
he does not have, to bother on these details automatically, the parallelization and distributed computation,
is being performed another important part is the failures. So, the fault tolerance is also supported in the
MapReduce architecture and the programmer does not have to bother about, that automatically it has
taken care, provided the programmer gives sufficient, configuration information so that this fault
tolerance is taken care of in different applications, another thing is input-output scheduling, is
automatically done. So, those lots of optimizations are used to reduce the number of I/O, operations and
also improve upon the performance of the execution engines. So, that is all automatically done by the
MapReduce architecture. Similarly the monitoring of all the data nodes, that is the task, that is the task
records, execution and their status updates is all, done using the client-server interaction in the
MapReduce architecture, that is also taken care of.
Now, let us go in more detail of the MapReduce paradigm. So, let us see, what this MapReduce? Is from
programmers perspective, this particular map and reduce terms is borrowed from the functional
programming language, with let us say that Lisp, is one such programming language, which has this kind
of features, that is MapReduce. So, let us see the functionality, which is supported in the functional
programming language of MapReduce? And let us say that, we want to; write a program for calculating
the sum of squares of a given list. So, let us assume that the list which we are given, is one, two, three,
four is a list of numbers and we want to find out, the squares of these numbers. So, there is a function
which is called a, ‘Square’. And a map says that this particular square function is to be applied on each
and every element of the list. So, let us see the output will be:
(map square ‘(1 2 3 4))
Output: (1 4 9 16)
[processes each record sequentially and independently]
So, all the squares are computed and output is given in again in the form of a list. Now, this particular
operation is square may be executed in parallel, at all the elements and this result will be given very
efficiently ,it is not that sequentially one by one, all the squares are being computed. So, therefore it will,
process them independently instead of these records are not to be carried out in sequentially hence, it's a
parallel execution, environment of map function ,of a map function that is, performing the square
operation on all the elements, which are being provided in the map function. Similarly to find out the sum
this was to calculate the, the squares, now another routine which is required to make the summation, of
this intermediate result. So, this result is called, ‘Intermediate Result’. So, with this input, we have to
perform another operation which is called a, ‘Reduce’. Reduce will require an addition operation, the
addition on this particular list, which is nothing but, an intermediate result which is calculated by the map
function and you can see that, the sum is the binary operator. So, it will start doing the summation, 2
elements at a time and once they are calculated. So, the entire output will be given in the form of
whatever is required sum of the squares. So, this is the sum of squares.
So, given this kind of functionality or a function of a functional programming language, we will see that,
how this particular MapReduce? Can be applied here, in the Hadoop scenario. Now, let us see a small
application, simple application, here it, is written a sample application of a word count. So, that means, let
us say we have a large collection of documents, let us assume it is a Wikipedia dump, of particular
Shakespeare's, works it can be any, different important persons related works, we are going to process this
huge data set and we are asked to find out, to list the count for each of, the words in this one document
and then query will be to find out a particular reference, in the Shakespeare's works about, the characters
how many? Which character is more important based on how much referencing? Shakespeare has done
for that character.
So, let us see how this is all done using MapReduce? So, MapReduce will huge collection, of data set
first it will, be considered in the form of the key value pairs, for example if it is, the document is let us say
welcome everyone, hello everyone. So, here this is the key is about, every word it becomes a key and
how many times it is appearing in a particular sentence is becomes the value? So, the file is also in a
given the input in the form of the filename and the text at the file text so, so this becomes the key, value.
So, file will name will be the key and these are text, the value and further on when we process using map
function, when we apply a map function, on this particular input it, will generate this another intermediate
key value. This is called, ‘Intermediate Result’. In a form of key values, that means for every word, will
become a key, every word will become a key, what will become a key? And the instances, when it is
occurring, into a particular line of a, of a file then it becomes the value, for example on in a line number
one, welcome, is appearing once. So, it gives one similarly in the line number one everyone is also is
available only once. Similarly in the second line when it is being processed. So, hello is the appearing
one. So, it will be treated as once and on the second line everyone is appearing again, so this particular
key value, which is emitted out of map? Is further used by the reduced function. Now, let us say that, let
us see how, in parallel this map function can work, for example these two lines, instead of processed them
sequentially let us divide in two, into two different chunks and this chunk, will be given map function one
and this another chunk will be given second map function. So, if there are two chunks we are going to
process them in parallel. So, this particular task, map task one will execute and map task 2 will also
execute, in parallel. So, in parallel if they are executed ,so key this key value pair welcome and everyone,
will be collected up by the task number 1, similarly the other task that is hello and everyone, will be
collected in the map task 2. So, we need to say that, this particular map function allows, the process
individual record to generate the intermediate key value pairs, in parallel and that is done automatically.
So, parallel process a large number of individual record, generate the intermediate key value pair, that we
have seen if these numbers or the chunks, are increased more than, two then, this is, an example where
multiple map functions are simultaneously, at the same time that is in parallel they will, execute and
process these records.
Map abstraction is written by the user which will take the input pairs and produces the set of intermediate
key value pairs. This function is to be written, based on the applications and the programmer has to, write
down as per the application what is to be done in the map function? And the remaining part will take care
in the reduced function. So, the reduced library, MapReduce library, groups together all intermediate
values associated with the same intermediate key and passes them to the reduced function that was the
map abstraction.
Now, we will just see, the structure of the map and reduce function which programmers used to write
down and here, we take the example of a word count. Will not go in detail in explanation at this stage,
explanation we will see in more detail in the further slides. So, as far as, the map function this is the
pseudo code, of a map function it cannot be executed in this form, why because? It has to be programmed
as per the specification of that particular language, of MapReduce but, we are going to see the, the pseudo
code. So, pseudo code has to specify the map function with the key and value and where key here, in the
word count example, is the name of the document and the values are the, the text, of the document which
are be retrieved and given in this map function. So, for each word, which is basically getting as the value,
in this particular for each word in the value, which is appearing the, map will omit, that particular world
and one. Now, the next pseudo code, for reduce takes, the same format as emitted out of the map
function. So, that means the world becomes, the key and the value becomes as per the value over here,
that format should match then only the output of map can be acceptable that is the, the reduce function.
So, here, it is explained that the key, is the word and values is basically the values, when it is grouped by
key then, the iterator has to operate, over this set of values. So, let us take this set of values and we
perform the reduce operation. So, reduce we will take all these values, out of iterator and does the
summation and results, in to that, how many times that key is appearing? That is it. We'll do the
summation of all the values but, this has explained .