Lec4 Merged
Lec4 Merged
Preface content of this lecture; in this lecture, we will cover the goals of HDFS, read/write process in
HDFS, configurations tuning parameters to control HDFS performance and robustness.
Now, let us see the Hadoop distributed file system: some of the design concepts and then we will go in
more detail of HDFS design. Now, the first important thing is about a scalable, distributed file system:
that means, we can add more disk and we can get a scalable performance that is one of the major design,
concepts of HDFS, in realizing the scalable distributed file system. Now, as you add more you are adding
a lot of disk and, this will automatically scales out the performance, as far as the design, goal of HDFS is
concerned. Why this is required is because, if the data, set is big and very large, cannot fit in, to one
computer system. So, hundreds and thousands of computer systems are used, to store that file. So, hence
the data, of a file is now, divided into the form of a blocks and distributed, onto this large-scale
infrastructure. So, that means a distributed data, on the local disk is now, stored on several nodes, this
particular method, will insure the low cost commodity hardware usage, to store the amount of
information, by distributing it across all these multiple nodes, which comprises of low-cost commodity
hardware. The drawback is that, some of the nodes may get failure: that we will see that, it is also
included as per the design, goal of HDFS and the low cost commodity hardware, is to be used in this
particular manner, a lot of performance out of it, is achieved because, we are aggregating the
performance, of hundreds and thousands of such commodity low cost hardware. So, in this particular
diagram we see that, we assume and number of nodes let us say node 1, node 2, on so on, node n these
nodes are, in the range of hundreds and thousands of the nodes and if a file is given, file is broken into
the, to the blocks and these blocks are now, having a data, which will be distributed, on this particular
kind of setup. So, in this particular example, we can see that, a file is there and that file is, divided into the
blocks, file data is block, in divided into the data, blocks and each block, is stored across different nodes.
So, we can see here, all the blue colored nodes, blue color blocks of these nodes are storing a file data. So,
hence the file data is now distributed, onto this local disk in HDFS.
So, hundreds and thousands of the nodes are available and their disk is being used for storing. Now, these
comprises of the commodity hardware, so they are prone to the hardware failure and as I told that, that
this design. So, they are prone to the hardware failure, so the design needs to handle, the node failures. In
this particular case, so HDFS design goal, is to handle, the node failures also. So, another aspect is about
the portability across heterogeneous Hardware, why because? There are hundreds and thousands of
community hardware machines, they may be having different operating system and the software running,
so hence, this heterogeneity also requires, the portability support, in this particular case. That is also one
of the HDFS design goals, another important design goal of HDFS is to handle, the large data sets. So, the
data sets so the file size, ranging from terabyte to the pet bytes that is huge, file or huge dataset is also
now, being able to stored here in HDFS file system so it provides a support of, the handling the large
datasets also enable the processing with the high throughput. So that means how this is all ensured the
processing with the high throughput? That we will see and it has kept as one of the important design goals
of HDFS.
Now let us see, what is new in Hadoop version 2.0? So, HDFS, in aversion Hadoop 2.0 or HDFS 2.0,
uses the HDFS Federation that means that, it is not a single namespace but it is a Federation, as that is
called, ‘HDFS’ name node Federation. So, so this particular Federation will now, have multiple data
nodes and multiple name nodes, are there and this will increase, the reliability of the name node, in this
case of Federation. So, it is not one name node, but it is, n number of n in name nodes and this particular
method is called the, ‘HDFS Federation’. The benefits is to increase the namespace scalability, earlier
there was one name is space now it has, a Federation of name a space so, obviously the scalability is
increased and also, the performance is increased and also, the isolation, performance is increased, why
because? Now, the nearest namespace is used to serve, the clients requests and isolation means if let us
say a high, requirement high a large amount of resource requirement is there for a particular application, it
is not going to affect, the entire a single namespace, why because? It is a Federation of name is space
other, applications is not going to be affected by a very high, requirement for a particular application that
is called, ‘Isolation’.
Now, in HDFS version two, are in Hadoop version two how, this all is done let us go and discuss. Now,
here as we have mentioned that, instead of one name node here now we have, multiple name mode
servers and they are managing, the namespace hence they becomes a multiple namespace and the data, is
now stored in the form of a block pools. So, now this block pool is also, going to be managed across these
data nodes, on the nodes of the cluster machines. So, it is not only one node, but several nodes are
involved and they will be storing the block pools. So, there is a pool associated with each name node or a
namespace and these pools are essentially, spread out over all the data nodes.
So, here you can see in this particular diagram that the namespace it has multiple namespace, namespace
one, namespace 2 and so on up to namespace n. Let us assume that it has, multiple namespaces and each
name is space, is having a block pool. So, these are called, ‘Block Pools’ and these block pools are stored
on the nodes. So, they are on different nodes, just like a cluster machine so, each block pool is stored on a
different node. So, different nodes are there and this is going to manage, the multiple namespace and this
is called the, ‘Federation’ of the block pools. So, hence now it is not a single point of failure, even if one
or more name node, namespace or a name node fails, it is not going to affect, anything and it also
increases the, the performance reliability and throughput and also, performs also provides you the
isolation. So, if you remember the original design you have only one namespace and a bunch of data
nodes, so the structure looks, like similar but, internally it is managed as the Federation. So, you have a
bunch of named nodes now, instead of one named node and each of these named nodes is essentially
write to these pools. But, the pools are spread out over the data nodes just like before, this is where the
data is spread out and you can grass over the different data nodes. So, the block pool is essentially the
main thing that’s different in Hadoop, version or HDFS version two.
So, HDFS performance measures, if we see that, here we see that, determine the number of blocks, for a
given file, for a given file size. And the, the key HDFS and the system components are affected by the
block size and impact of ,using a lot of small files on, HDFS and HDFS system. So, these are some of the
performance, measures that we are going to, tune the parameters and measure the performance. Let us
summarize it again, all these and different, tuning parameter for performance. And so, basically the
number of, how many number of blocks for a given file size? And this is required, to be known, in so
basically, we will see there is a performance measure or, or basically there is a tradeoff, in the number of
blocks to be replicated. The another key component is, about the size of the block. So, here the block size,
which varies from 64 MB to 128 MB. So, if the block size is 64 then, what is the performance and if we
increase the block size, then what will be the performance, similarly the number of blocks, that means,
how many this is the replication, if the replication factor is three that means, every block is replicated on
three different nodes, for a given file size. So, if the replication is 1, then obviously we are, saving lot of
space. But, performance we are going to sacrifice, so there is a trade-off between this. And another
important parameter, for HDFS performance is about, the number of small files, which are there on, the
HDFS. So, if there are lot of small files, which are there in HDFS, the performance goes down, we will
see how and how this particular problem is to be overcome, in the further slides.
So, let us see that, recall again the HDFS architecture, where the data is distributed, on the local disk, on
several nodes. So here, in this particular picture, we have shown, several nodes where data is divided and
that is called, ‘Distributed data on different local disks’, it is stored.
10 × ( 1024
64 )
=160
So, 160 blocks we have to just store, in a distributed manner, on several nodes and therefore, this
particular block size is going to matter much. So, if we increase the number of or the size of the block,
then obviously, it will be less than 160 blocks. So, if the number of blocks are more hence, the parallel
operations is more possible and that we are going to see about, what is the, what is the effect of keeping
the small block size, of 64 or a more than 64.
So, the importance of the number of blocks in a file. So, if the number of blocks are more, than the
amount of memory which is used in the name node, will be more in that case. So, for example, here every
block that, you create basically every file, could be a lot of blocks we saw in the previous case, 160
blocks. and if you have a million of files and this, that millions of objects, essentially is required to
basically store that, amount of space in the name node to manage it and it becomes, several times bigger.
If let us say, the number of blocks are more. And the files are more. So, we will see this kind of
importance, of the number of blocks, so it is going to affect, into the size, into the name node. And it
measures, how much memory is going to be used, in the name node to manage that, many number of
blocks of a file. Now, the number of map tasks, also is going to matter much, for example, if the file is
divided into160 blocks, so at least 160 different, map functions are required to be executed, to cover
entire data set operations or computations. So hence, if the number of blocks are more, not only it is going
to take more space, in the name node, but also more number of map, functions also required to be
executed.
Hence, there has to be a trade-off. Similarly, if there is a large number of a small files: this will impact on
the name node, why because a lot of memory is required, to store the information of this number of small
files. Hence, the network load is going to increase, in this particular case.
So, HDFS is therefore optimized for a large file size. So, lot of small files is bad and the solution, to this
particular problem is to merge and concatenate different files are, there is a operation which is called,
‘Sequence Files’, several files are Mouse together, in a sequence and that is called a, ‘Sequence File’.
And which is treated as one big file instead of keeping, many number of small files, another solution, to
the small number of lots of small number of files is, using the HBase and hive configuration, for this
particular large number of small files. They will be used to optimize this particular issue. And also there
is also, another solution is to combine the input file, input format, file input format.
Now, let us see, in more detail about read and write processes, which is there in HDFS, how it is being
supported.
Now, the read process in HDFS, if we see that, first of all we have to identify that there is a name node.
And this is the client. And there are several data nodes, in this example, we are having one name node and
there are three data nodes and there is a client, which will perform the read operation. So, the HDFS
client, will then request to read a particular file, this is the read operation and this particular request will
go to the name node, to know the, the blocks where that, read operation is to be executed and the data is
to be given back to the to the client. So, it will send the request to the name node and then, name node
will ,give back this information, back to this particular client end of HDFS and from there, it will have
two options, whether to read from the block number four or to read from the block number five. Then it
will try to read the, the one which is the closest one and this particular data is given back to the client.
Now, then let us see the write operation, which is initiated again by the client. So, whenever there is a
client wants to do a write operation and this particular write operation is now, going to be requesting, the
name node to find out the, the data node which can, be used to store, the, the clients data. And after
getting this information back, this particular right operation is being performed on this particular data
node, which is, which is the closest one and that, particular data node has to, do this replication. If let us
say, replication factor is 3 then, it will do this in the form of a pipeline. So, the client will write down or
will write the data, on a particular data node and that data node, in turn will carry out, the pipeline for the
replication, this is called a, ‘Replication Pipeline’. And once the replication is over, then it will send the
acknowledgment and the right operation is completed, in this particular process.
So, HDFS tuning parameters, we are going to see, especially the DFS block size, from that viewpoint and
also we will see the name node, data node and all these different tuning parameters.
Let us see, what are the, which are most important, which need to be decided for performance from,
performance perspective. Now here, HDFS block size recall that, impacts how much name node memory
is used, the number of map tasks that are showing up and also have the impacts on the performance. So,
the by default the block size is 64 megabytes. And typically, it can go up to 128 megabytes and it can be
changed based on the workloads. So, if let us say that, we want to have a better performance and the size,
file size is too big, too large, then obviously more than 64 megabytes is good enough, so that so, so the
parameter that this, make this particular changes is known as, DFS block size or dfs.block.size, where we
have to mention about, the, the, the, the block size, by default it is 64 but we can increase up to, 128
megabytes. So if the, if the block size is more obviously, the number of blocks, will be, will be less and if
it is less than, the amount of space which is required, to store in the namespace memory, will be less and
also, if it is less and also the number of map tasks, which will be required to execute also, will be less.
And so, basically there is a trade-off, where the performance is required, so we have to set, this block size
accordingly and application.
So, another parameter is called the, ‘HDFS’, application by default that application is 3 and this parameter
is set in a dfs.replication, configuration file. And there is a trade-off, that means, if we lower, it to reduce
the replication cost, that means, if the replication factor is not 3, if it is less, than the replication cost will
be less. But, the trade-off is that, it will be less robust, robust in the sense, if some of the nodes are filled
and there is only one replication, there is no replicas available of that node, so that particular data will not
be available. So hence, it will be less robust. And also, the it will lose the performance, for example, if it
is replicated then, it will be able to serve that, particular data block, from the closest possible, data block
to the client. So, higher application can make data local to the more workers, lower replication means, and
more space.
So, HDFS robustness, we have so far discussed. And so therefore the replication, on the data node is
done. So, that it is a lack fault tolerant, that means, the replicas are across racks, so that if the one rack is
down, it will be able to serve, from the other end. So, Name node receives the heartbeats and block the
report from, this one data nodes, so all these is measured and wherever there is data note down, this
information is captured or understood and the name node and that particular node is now, not being used
by for the client, for the requests.
So, multiple copies of Central meta data structure is being maintained to handle with these common
failures. And failover, to standby the name node is there and normally it is manually done, by default.
Now, there is a trade-off between the replication, trade-off with the respect to the, to the robustness.
Before we start, the idea is that, if we reduce the replication factor, then it is going to affect to the
robustness. For example, if let us say, it is not replicated, to the other data nodes. And if that data nodes,
containing that data or a block fields, then it is not available at other end. Hence, it is going to affect the
robustness. So, replication is so important, that we are going to discuss. So, one performance trade-off is
actually, when you go out, to do some of the Map Reduce jobs, having replicas gives additional locality
possibility. But, the big trade-off is the robustness, in this case we said, no replica, might lose a node or, a
or a local discount recover because, there is no replica. Hence, if replication factor is, is not immutable, so
if the replication so, no replica is available, if no replica is available, then obviously it is lead to a failure.
And hence, there is no hence, it is not robust. Similarly, with the data corruption and if we get, the checks
that is bad and we cannot recover, why because, we do not have any replicas and other parameters,
changes have similar effects. So, basically there is a trade-off between, the replica and the robustness.
So in conclusion, in this lecture, we have discussed the HDFS. HDFS version two and operation that is
read and write which is supported in HDFS, we have also seen, the main configuration, we have also seen
the performance, parameters and the tuning parameters, with respect to the block size and the replication
factor, to ensure about the HDFS performance and its robustness trade-off. Thank you.
Lecture 05
Hadoop MapReduce 1.0
Lec 06-Hadoop
MapReduce 2.0
(Part-I)
Hadoop, maps reduce 2.0 versions.
Content of this lecture in this lecture; we will discuss MapReduce paradigm and also what is new in
MapReduce version 2? We will also look into the internal working and the implementation or we will
also see many examples how using MapReduce, we can design different applications and also look into
how the scheduling and the fault tolerance is now being supported inside the MapReduce implementation
of version 2.
So, the programs, are written in this style are automatically, are executed in parallel on a large cluster of
commodity machines. So, the parallel programming paradigm is automatically being taken care and the
users or the programmers, are not going to be bothered about, how the parallelization and how the
synchronization, is to be managed. So, the so, parallel execution is automatically done in, the MapReduce
framework. Now as far, as the run time is concerned, run time take care of the details of partitioning the
data scheduling the programs like equation across the set of machines, handling machine failures,
managing the required inter machine communication and this, allows the programmer, without any
experience with the parallel and distributed computing to easily utilize the resources of the large
distributed system. So, let us explain, this entire notion of a runtime operations which this particular
MapReduce allows or provides to the execution environment, to the for the execution of these programs
on a large-scale commodity machines. So, as far as MapReduce is concerned now, starting with the data
set, the input data set, we assume to be a very large and stored on the clusters, that means may be stored
on hundreds and thousands of nodes one data set let us assume that it is stored. So, this will be stored with
the help of the partitioning. So, the data set will be partitioned and stored on these nodes. So, let us say
that if 100 nodes, are available the entire data set is partitioned into 100 different chunks and stored, in
this way, for computation in this scenario. Now, the next task is, to schedule the program's execution,
across these, let us say 100 different machines where the entire data set is stored. So, then, it launches the,
the programs in the form of a map and reduce, we will execute, in parallel, at all places ,that is let us say
if hundred different nodes are involved, all hundred different chunks, will experience the execution in
parallel. So, this is also, to be done automatically by, the MapReduce using, the job tracker and a task
tracker, that we have already seen. And in the newer version that is 2.0, this particular scheduling and
resource allocation, this is to be done by the YARN, once this particular request comes from, this
MapReduce execution, for that particular job, of that data set. Now, there is a issues like, these nodes, are
comprises of commodity, Hardware that we have seen in the previous lecture of HDFS. So the nodes may
fail, in case of node failures, automatically ,it will, be taken care by doing the scheduling of this
alternative replicas and the execution still carries on without any effect on the execution. so it requires,
also to manage the inter machine communications, for example ,the intermediate results, when it is
available ,then, they are sometimes required to communicate with each other ,some values, some results,
this is called,’ shuffle’ and ‘combined’. So, shuffle and combine requires, intermission communication,
that we will see, in more detail in the further slides. Now, this will allow the programmers without any
experience, with the parallel and distributed systems, to use it. And this will simplify, the entire
programming, paradigm and the programmers has to only write down a simple MapReduce programs. so
we will see how the programmers will be able to write the MapReduce program for different application
and the execution will be automatically, taken care by, this framework. So a typical, MapReduce
computation may vary from terabytes on thousands of the machines. So that we have, already
seen ,hundreds of MapReduce programs have been implemented and upwards of 1,000 MapReduce jobs
are executed on Google's, cluster, every day. So, the companies like Google and other big companies,
which works, in the big data computation they uses this technology, which is the MapReduce programs.
That means applications are written in the MapReduce that we are going to see them.
Now, we will see about, this MapReduce and its motivation why we are? So, much enthusiastic, about
this more produced programming paradigm and which is very much used in the Big Data computation.
So, the motivation for MapReduce is, to support the large-scale data processing. So, if you want, to run
the program on thousands of CPUs and then this MapReduce is the only paradigm available and all the
companies, which are looking for the scale? A large-scale data processing the MapReduce is the only
framework, which is basically doing all this task? And another important motivation is that, this particular
paradigm makes the programming very simple and does not require the programmer to go and manage
the intricacies of the menu detail, underneath. Now MapReduce architecture also provides automatic
parallelization and the distribution, of data and the configuration of it. So, the parallel computation, on
distributed system is all abstracted and the programmer is provided a simple MapReduce paradigm and he
does not have, to bother on these details automatically, the parallelization and distributed computation, is
being performed another important part is the failures. So, the fault tolerance is also supported in the
MapReduce architecture and the programmer does not have to bother about, that automatically it has
taken care, provided the programmer gives sufficient, configuration information so that this fault
tolerance is taken care of in different applications, another thing is input-output scheduling, is
automatically done. So, those lots of optimizations are used to reduce the number of I/O, operations and
also improve upon the performance of the execution engines. So, that is all automatically done by the
MapReduce architecture. Similarly the monitoring of all the data nodes, that is the task, that is the task
records, execution and their status updates is all, done using the client-server interaction in the
MapReduce architecture, that is also taken care of.
Now, let us go in more detail of the MapReduce paradigm. So, let us see, what this MapReduce? Is from
programmers perspective, this particular map and reduce terms is borrowed from the functional
programming language, with let us say that Lisp, is one such programming language, which has this kind
of features, that is MapReduce. So, let us see the functionality, which is supported in the functional
programming language of MapReduce? And let us say that, we want to; write a program for calculating
the sum of squares of a given list. So, let us assume that the list which we are given, is one, two, three,
four is a list of numbers and we want to find out, the squares of these numbers. So, there is a function
which is called a, ‘Square’. And a map says that this particular square function is to be applied on each
and every element of the list. So, let us see the output will be:
(map square ‘(1 2 3 4))
Output: (1 4 9 16)
[processes each record sequentially and independently]
So, all the squares are computed and output is given in again in the form of a list. Now, this particular
operation is square may be executed in parallel, at all the elements and this result will be given very
efficiently ,it is not that sequentially one by one, all the squares are being computed. So, therefore it will,
process them independently instead of these records are not to be carried out in sequentially hence, it's a
parallel execution, environment of map function ,of a map function that is, performing the square
operation on all the elements, which are being provided in the map function. Similarly to find out the sum
this was to calculate the, the squares, now another routine which is required to make the summation, of
this intermediate result. So, this result is called, ‘Intermediate Result’. So, with this input, we have to
perform another operation which is called a, ‘Reduce’. Reduce will require an addition operation, the
addition on this particular list, which is nothing but, an intermediate result which is calculated by the map
function and you can see that, the sum is the binary operator. So, it will start doing the summation, 2
elements at a time and once they are calculated. So, the entire output will be given in the form of
whatever is required sum of the squares. So, this is the sum of squares.
So, given this kind of functionality or a function of a functional programming language, we will see that,
how this particular MapReduce? Can be applied here, in the Hadoop scenario. Now, let us see a small
application, simple application, here it, is written a sample application of a word count. So, that means, let
us say we have a large collection of documents, let us assume it is a Wikipedia dump, of particular
Shakespeare's, works it can be any, different important persons related works, we are going to process this
huge data set and we are asked to find out, to list the count for each of, the words in this one document
and then query will be to find out a particular reference, in the Shakespeare's works about, the characters
how many? Which character is more important based on how much referencing? Shakespeare has done
for that character.
Now, we will see the next phase that is called, ‘Reduced Operation’. So, after the map the intermediate
values, which is being generated, in the form of the key value pair will be used as the input for the
reduced function. So, reduce processes, these intermediate values, which is output by the map function
and then merges all the intermediate values, associated per key, that means based on the similarity of the
key values, they are all combined together that is group by key operation, is being performed and using,
in this particular, group by key operation whatever is the reduced function applies, this will be the
computation will come, from the reduced function. So, computation on this group by key, will be applied
by, the reduce function to get the entire output, for example if this is, the input which is, input to the
reduced function and we want to do a word count. So, now what it will do is, it will group by the key
when you say group by the key then, everyone is appearing twice. So, with everyone there are two
different values or two times, this everyone one will come and this will be added up by the reduced
function, similarly hello one is appearing once. So, group by key is, having only this tuple or entity
similarly for welcome. So, this particular reduced function, will execute in this particular manner.
So, each key assigned to one reduce function and here we can say that if there are two different nodes,
which runs this reduce task, let us say task 1 and reduce task 2. So, each key is, assigned to one reduce
task for example this, particular key will be assigned to the task 1. So, when this second key also second
time when everyone comes it should also, assign to that same reduce task, whereas welcome and hello
can be assigned to the reduce task 2. Now, parallel process and merges all, the intermediate value by
partitioning keys that, we have already told you, but how, these keys are to be partitioned and the simple
example is, let us, say that use a, hash partitioning, that means if you have a hashing function f, and when
you apply this has have a hashing function, on a particular keyword, it will give a particular value and this
value will be mod, how many number of reduced tasks here? Let us, assume that it is, the number of tasks
is 2 and modulo, some hash function and on that particular hash function, we have to define a key and this
will generate the, the partitioning. So, partitioning we have shown you through, a simple example, the
way partitioning is carried out, that is through the highest partitioning, it can be very complicated also
depending upon different application how ,the partitioning is done and how many different worker nodes
are reduced nodes are there or reduce task is allocated by the YARN? And based on that this particular
partitioning is done and after partitioning then it will merge, the intermediate values and gives, the result
back.
Programming model, the computation takes a set of input key value pairs and produces a set of output key
value pairs, in this programming model. To do this, the user has to specify using MapReduce library the
two functions, the map and reduce function, which we have already explained.
(Part-II)
Let us see, some of the applications, of MapReduce.
We are already the programs are available and its, use in the production, environment, by different
companies. So, here are the few, simple applications of interesting programs, that can be, that has, already
been expressed, as the MapReduce computation. So, the first one is called, ‘Distributed Graph’. Here, the
map function emits a line, if it matches a supplied pattern, for example, if the document is given and we,
we have already given one pattern. So, all the lines in that particular document, where that particular
pattern is appearing, will be emitted and reduced function will become an identity function that that just
copies, the supplied intermediate values to the output that is called, ‘Intermediate Graph’ or ‘Distributed
Graph’. so, the difference between the Graph and this distributed graph is that, here the document can be
very big, it cannot fit into one system, memory and therefore a document which is distributed the stored
on data nodes, can perform this operation the graph. So, it will filter and extract, only those set of
documents where we are interested in that particular pattern, that's called, ‘Distributed Graph’, has
various purposes various applications, the next application is the count of URL, access frequency. Now,
to do this, there is a map function, which processes the log of webpage, requests and the output (URL;1).
So, that means the map function, what it does is? It inspects the log of web pages and for every URL, it
encounters it will emit one as per the map phase. Now, reduce function will combine. So, it will collect
the all the URLs, that means group by key and it will do the summation, of how many times that number
of ones are there it has to, just do a count. So, it is just like a word count program, an extension of a word
count, which will find out, the URL access frequency that means our URL how many times it is being
referred in a particular log file? Now, another example, another application, where it is used this
MapReduce program? Is called, Reverse Web Link Graph’, for example there are the web pages and web
pages are, pointing to each other. Let us say this is a B and C. We want to find out that, for let us say web
page see, how many different pages are pointing to it? We are given these, kind of pairs that is a is
pointing to C, web page is pointing to web page C and web page be pointing to A and B is pointing to C,
out of this we have to, now find out that for a particular web page, how many links are pointing to it? It's
called, ‘Reverse Web Link Graph’. So, it's called a, ‘Reverse Web Link Graph’. So, the map function
here, given this as the input, to the map function it will output the target and the source pair. So, for
example here the target is C. So, C and the source. So, given AC, in the map it will emit, C and A, for
each link in this, particular case similarly for BC, it will emit, C and B similarly for BA, it will emit, A
and B, for each link to the target URL found in the, in a page named source. Now, the reduced function
then concatenate, the list of all sorts URLs, associated with a given target URL and emits appear, that is
called, ‘Target and List Source’ for example, here C is appearing ,two times and this list will become for
C a comma B. So, for C this is, the list is, C comma, A B. So, this is, one and another one is a and B. So,
this will be given to the reduced function and reduce will take this, target C and find out this list or the
source. So, C, page is being pointed to by A and B, that is being computed here by, the MapReduce and
the web page A, is being pointed to by only one, web link that is called A B. So, this way, popularity of
the pages can also be calculated if you want to do a sum, you have to just make a count of it. So, it
becomes, 2 this becomes 1 and this kind of statistics or this kind of output can be used in computing, the
Page Rank.
Another application is about, the inverted index, which all the search engines are mostly does this, let us
see, what this application? Is and how the Map Reduce can be used to program, doing this inverted
index? So, here, in this application the map function parses, each document and emits a sequence of
world and document ID pair, the reduced function accepts, all the pairs of a given word sorts the
corresponding document IDs and emits, the word and list of document ID pair. So, the set of all output
pairs forms a simple inverted index it is, easy to augment, this configuration to keep track of the word
positions. So, we will, in the later slides you will see, more detail of MapReduce program for this
application that is, for inverted index that is, it is possible that if, the set of documents are given, for
example search engine does this, Google when we type a keyword it gives you all the list of web pages?
Where those documents? Where these keywords are appearing and that is being performed using inverted
index? So, every search engine ,often computes this inverted index and that if, the number of documents
is huge, the search engine like Google, are being by Microsoft all, they are, they are basically computing
the inverted index and whenever the user searches it, it will perform it will check this, inverted index and
gives, that particular outcome. We will see, in more details how, this MapReduce exactly program this,
inverted index application, similarly the distributed sort, that is the map function extracts the key, from
each record and emits, the key and record pair the reduced function just emits all the pairs unchanged,
that means automatically, internally the map function gives, when it emits the key and required, pair it
provides in this sorted order and if there is, nothing different happens, into the shuffle phase, then if the
output is, given as it is, then this, particular outcome of the map face will be taken up, as and it will be
emitted unchanged, by the reduced function and this will perform the distributed sort.
We will see, in more detail of distributed sort application, how it is done in the Map Reduce?
Applications of MapReduce, we are going in now, little more details, of these applications which we
have, some of them which we have summarized. Let us take the example of distributed grab how using
MapReduce we can perform this distributed group operation? We assume, that the input is a large
collection of files and the output, I want to get is, the lines of a file, that matches that particular pattern,
that matches a given pattern. So, the map function, will emit, a line if it, matches the supplied pattern. So,
map function will, emit a line. So, here the things are quite simple, why because? Whenever the line is
matched, in the map function, line and a pattern. So, it emits only the line and the reducer does not have
to do anything it will copy the, all the intermediate data, which is given by the map function and it, will
output.
Another application we will see, about the sorting and here, the input is given in the form of a
<key, value> pairs and we want the sorted, values to be output. This particular program as we have,
shown you quite simple for example, the input whatever is given to the map function <key, value>, it will
output only the value? And the reducer job, also will just output this <key, value>, whatever is there? So,
in this particular process, when the map, outputs these values, which are already in the sorted form,
normally quick sort is done, here when during the shuffle phase and if the same thing is output, in the in
the reduced function. So, it passes on and it uses the mud shot. So, it is a quick sort and the reduced
function uses the merge sort. So, quick sort and merge sort together, will sort the applications and we
don't have to do much, the partitioning function we have to be careful ,during the sort is that partitioning,
partition keys across the reducer, is based on the on the on the ranges and you cannot use hashing,
otherwise it will disturb the sorted order. Okay?
Example one, word count using MapReduce? So, here we see the structure of map and reduce program
which, will do the word count, in a given document.
map(key, value):
// key: document name; value: text of document
for each word w in value:
emit(w, 1)
reduce(key, values):
// key: a word; values: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)
Refer slide time :( 01:57)
Let us see this illustration through this running example, now in this example we consider a document,
let us say this name of a document, docx file and this is the text, of the document now, when this
particular name of the document, is given and the text is given. So, these particular text words, as it is
appearing it will emit word and ‘w’. So, here we see that, ‘w’ and this is the value 1 and this is the
value. So, as these words are appearing out of this particular text, this map function will emit, these w
and 1 out of this particular previous program. So, after this emit of these words, it will do a shuffle, in
so-called, what it will do the same words it will try to, collect together, it will sort them and collect
them, together and then pass on, this is the shuffle phase and then pass on to, the reducer phase. Now it
depends upon how many reducers we have let us assume we have one and two different reducers. So, as
far as this Bob is concerned, it will be passed on over here this will pass on over here Bob will go to this
particular function a run will go to this particular function, to this reducer I spot will go here, here and
throw will go here. And this reducer will now, for example in ‘see’, there are two different, these values
when it comes and iterator it will, go through this iterator function and make the summation of it.
So, given this particular document if we color them according to this word length categorization so, the
document will look like, in this manner now we have to find out the word and histogram.
Input:
Desired output:
Now we will see the operations, how we are going to perform the relational, join operation using
MapReduce, by relational join let us understand this example and then we will see, how that is done
music MapReduce. Let us say that employee is a table having, the attributes as a name and its SSN
number, similarly another document which is called, ‘Assigned Departments’, here we have in SSN and
the department name. Now if you want to join on, employee and assigned departments, if you want to
join on SSN equal, employ SSN then we, we see that this particular employee with the name sue has
matching, employ SSN with the accounts. So, if we join them this particular tuple will be generated,
what about the other one here this SSN ha two matches. So, therefore this particular tuple two different
tuples will be generated accordingly wherein this is the sales and marketing will be reflected. Now let us
see how this we are going to achieve using MapReduce operation.
Refer slide time :( 18:25)
Now before going into details we have to understand that map and MapReduce is a unary operation and
this join is a binary operation, when it requires two tables, table 1 and table 2. So, how this MapReduce
which is a unary operation, will be used to do a relational join and that we have to see that now what we
will do is we consider the entries or the tuples of the table as, as a single tuples, as the collection of all
the tuples and we will attach this identity of the name of a table also. So, it becomes a key value pair so,
key value pair means that, the name of the table will become let us say key and the remaining couple
will become the value this way we will list out all the tuples, which are there in different tables. Now if
this becomes, a complete data set then we can perform the join operation, easily how the join is
happening ? join is happening around a particular key. So, around a SSN number so SSN number will
become the key here in this case and we will omit the, the Map Reduce will, will emit the key and the
value. So, key will be this SSN number and the value it will emit will be the entire tuple. So, now as far
as the reduce is concerned, reduce will now group by this key and, and then after group by key then, it
will try to do the iterator and if these table IDs are different, then it will make a joint operation within it.
AB C D
B ACDE and so on.
So, the person and arrow followed by, the list of friends is given as the input to this particular program,
now then we will see that.
(A B) -> (A C D E) (B C D)
(A C) -> (A B D E) (B C D)
(A D) -> (A B C E) (B C D)
(B C) -> (A B D E) (A C D E)
(B D) -> (A B C E) (A C D E)
(B E) -> (A C D E) (B C D)
(C D) -> (A B C E) (A B D E)
(C E) -> (A B D E) (B C D)
(D E) -> (A B C E) (B C D)