0% found this document useful (0 votes)

59 views84 pages

Lec4 Merged

This document provides an overview of Hadoop Distributed File System (HDFS). It discusses the goals of HDFS including providing a scalable, distributed file system that handles large datasets and hardware failures through techniques like data replication. The core components of HDFS architecture are explained - the NameNode manages the file system namespace and tracks data locations, while DataNodes store data blocks and replicate them for reliability. Read and write operations involve clients communicating with the NameNode to determine data locations before reading from or writing to DataNodes.

Uploaded by

Arifa Khadri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views84 pages

Lec4 Merged

Uploaded by

Arifa Khadri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 84

Lecture 04

Hadoop Distributed File System (HDFS)

Hadoop distributed file system, HDFS.

Refer Slide Time :( 0: 18)

Preface content of this lecture; in this lecture, we will cover the goals of HDFS, read/write process in
HDFS, configurations tuning parameters to control HDFS performance and robustness.

Refer Slide Time :( 0: 37)

Introduction, Hadoop provides distributed file system, as a framework for the analysis and performance
of very large datasets computations using Map Reduce paradigm. Important characteristics of Hadoop is
the partitioning of data and computation across hundreds and thousands of the nodes of a cluster and
executing these computations, in parallel, close to their data, Hadoop cluster scales computations
capacity, storage capacity and i/o bandwidth by simply adding commodity servers, Hadoop clusters at
Yahoo! spans 25,000, servers and store 25 petabytes of application data, with the largest cluster being
3,500, servers 100 other organization worldwide report using Hadoop.

Refer Slide Time :( 1: 52)

With this introduction let us, see the picture of our overall viewpoint of Hadoop. So, Hadoop is an
Apache project; all the components are available via Apache open source license. And Yahoo! has
developed and contributed 80% of the core that is, core of Hadoop that is, HDFS and Map Reduce. HBase
was originally developed at Powerset, now Microsoft. Hive was originated and developed
Facebook. Pig, Zookeeper, Chukwa were originated and developed at Yahoo!

Refer Slide Time :( 2: 38)

So, some of the Yahoo! Project components and the first and foremost is the HDFS: that is Hadoop
distributed file system and the other important component of Hadoop project is the Map Reduce that is
distributed parallel computation framework.

Refer Slide Time :( 2: 57)

Now, let us see the Hadoop distributed file system: some of the design concepts and then we will go in
more detail of HDFS design. Now, the first important thing is about a scalable, distributed file system:
that means, we can add more disk and we can get a scalable performance that is one of the major design,
concepts of HDFS, in realizing the scalable distributed file system. Now, as you add more you are adding
a lot of disk and, this will automatically scales out the performance, as far as the design, goal of HDFS is
concerned. Why this is required is because, if the data, set is big and very large, cannot fit in, to one
computer system. So, hundreds and thousands of computer systems are used, to store that file. So, hence
the data, of a file is now, divided into the form of a blocks and distributed, onto this large-scale
infrastructure. So, that means a distributed data, on the local disk is now, stored on several nodes, this
particular method, will insure the low cost commodity hardware usage, to store the amount of
information, by distributing it across all these multiple nodes, which comprises of low-cost commodity
hardware. The drawback is that, some of the nodes may get failure: that we will see that, it is also
included as per the design, goal of HDFS and the low cost commodity hardware, is to be used in this
particular manner, a lot of performance out of it, is achieved because, we are aggregating the
performance, of hundreds and thousands of such commodity low cost hardware. So, in this particular
diagram we see that, we assume and number of nodes let us say node 1, node 2, on so on, node n these
nodes are, in the range of hundreds and thousands of the nodes and if a file is given, file is broken into
the, to the blocks and these blocks are now, having a data, which will be distributed, on this particular
kind of setup. So, in this particular example, we can see that, a file is there and that file is, divided into the
blocks, file data is block, in divided into the data, blocks and each block, is stored across different nodes.
So, we can see here, all the blue colored nodes, blue color blocks of these nodes are storing a file data. So,
hence the file data is now distributed, onto this local disk in HDFS.

Refer Slide Time :( 7: 02)

So, hundreds and thousands of the nodes are available and their disk is being used for storing. Now, these
comprises of the commodity hardware, so they are prone to the hardware failure and as I told that, that
this design. So, they are prone to the hardware failure, so the design needs to handle, the node failures. In
this particular case, so HDFS design goal, is to handle, the node failures also. So, another aspect is about
the portability across heterogeneous Hardware, why because? There are hundreds and thousands of
community hardware machines, they may be having different operating system and the software running,
so hence, this heterogeneity also requires, the portability support, in this particular case. That is also one
of the HDFS design goals, another important design goal of HDFS is to handle, the large data sets. So, the
data sets so the file size, ranging from terabyte to the pet bytes that is huge, file or huge dataset is also
now, being able to stored here in HDFS file system so it provides a support of, the handling the large
datasets also enable the processing with the high throughput. So that means how this is all ensured the
processing with the high throughput? That we will see and it has kept as one of the important design goals
of HDFS.

Refer Slide Time :( 8: 44)

Now let us, understand the techniques to meet these design goals, the first of this technique is called a,
‘Simplified coherency’ model that is, which is nothing but, right once and read many times, this will
simplify the number of operations, hence since we are going to write once and read many times. So, most
of the design will be based on that coherent model. Another technique which will meet these design goals,
which we have seen in the previous slide, is about the data application. Now, since there is a possibility of
hardware failures or a failing of these nodes, which are of commodity Hardware therefore, the data
blocks, which are store in this particular HDFS file system, is replicated at more than one nodes and
hence the data application, is the technique, which will be there to handle the hardware failures. So, by
data application means the data blocks, will be spirited, across different nodes and add more than one
times, these replications are the same piece of data, is now available on different nodes using replication.
So, it is not that only one copy of a data is stored but, the replication factor or replication we will tell how
many, different same piece of data, is stored on how many nodes? So, even if that node is fail or Iraq is
failed, even then there is a possibility that, the data is available and it will overcome from such failures
that is done through the hardware data application. Another important technique which basically ensures,
the high performance throughput, is that we moved, to the computation rather computation will be
moving close to wherever the data is there hence, we are not moving the data around and this will
improve the performance and the throughput of the system. Another technique which is used to meet the
HDFS design goal is that, we will relax some of the POSIX requirements, we increase the throughput for
example, when I write by the client, then the write operation, will keep on doing the cache, at the clients
end. So, this basically, is the relaxation of the POSIX and this will increase the throughput that we will
see later on in this particular part of the discussion.

Refer Slide Time :( 11: 37)

So, this is the basic architecture of HDFS, which comprises of a single name node and multiple data
nodes and which supports the two operations which are write and read, by the clients, so and whenever
this, particular client want to do a read operation, a write operation then it has to contact to the name
node, try to find out the data, nodes where these particular blocks of a file, need to be written and out of
them one, the client will write to the, closest of these data blocks and that particular data block in turn will
replicate, through the other data nodes and that particular data node in turn will replicate to the other data
node, which is the, which is in the same rack. So, this way, the entire operation: that means this writing on
a one block and then replicating at other data nodes, will happen at the same point of time hence, the right
is done in, a pipeline mode. This will increase the throughput, of this right operation, similarly whenever
a read operation, is required by the client, then this client will contact to the name node, try to find out the
blocks or data nodes, where that blocks are stored and out of them, the client will prefer to read, from the
data block which is very close to that particular client. In this way the throughput is increased so,
therefore a name node and data node together, will which constitutes the architecture of HDFS, is able to
support this large-scale data storage and also, ensure the computations at that, place with a high
throughput.

Refer Slide Time :( 13: 50)

So, let us see the HDFS architecture, again and describe its key components. So, a single name node that
we have seen, is nothing but, a master server that manages the file system name, namespace and basically
regulates, access to these files from the client and also, keep track of where the data is on the data node
and where the blocks, are distributed essentially. So, single name node will store the Metadata that means
the information about the data, where it is, stored on the data nodes, is maintained by the name space.
Now, as far as the data which is actually stored on the data multiple data nodes. So, that means one
typically, one per node, in a cluster, is there that is maintained by the name node information and which is
used to store the, the data locally. So, the basic functions here, in this key components are that of HDFS is
to manage, the storage on the data nodes, where the actual data, is managed or is told and read and write
requests, are being initiated by the client, into the HDFS support in the HDFS architecture, similarly for
the block creation, deletion and the replication is all based on the instructions, from the name node. So,
name node is, basically managing the entire operations of the data, placement data axis, in terms of block
creation, deletion and replication.

Refer Slide Time :( 15: 44)

So, we have seen that, in the original HDFS design there is single name node and a multiple data nodes
and these data nodes, will manage the storage that is nothing but a blocks of data and these data nodes are
serving, for the read and write request from the initiated by the client and also these data nodes, will
perform the block creation, deletion and replication.

Refer Slide Time :( 16: 07)

Now let us see, what is new in Hadoop version 2.0? So, HDFS, in aversion Hadoop 2.0 or HDFS 2.0,
uses the HDFS Federation that means that, it is not a single namespace but it is a Federation, as that is
called, ‘HDFS’ name node Federation. So, so this particular Federation will now, have multiple data
nodes and multiple name nodes, are there and this will increase, the reliability of the name node, in this
case of Federation. So, it is not one name node, but it is, n number of n in name nodes and this particular
method is called the, ‘HDFS Federation’. The benefits is to increase the namespace scalability, earlier
there was one name is space now it has, a Federation of name a space so, obviously the scalability is
increased and also, the performance is increased and also, the isolation, performance is increased, why
because? Now, the nearest namespace is used to serve, the clients requests and isolation means if let us
say a high, requirement high a large amount of resource requirement is there for a particular application, it
is not going to affect, the entire a single namespace, why because? It is a Federation of name is space
other, applications is not going to be affected by a very high, requirement for a particular application that
is called, ‘Isolation’.

Refer Slide Time :( 18: 00)

Now, in HDFS version two, are in Hadoop version two how, this all is done let us go and discuss. Now,
here as we have mentioned that, instead of one name node here now we have, multiple name mode
servers and they are managing, the namespace hence they becomes a multiple namespace and the data, is
now stored in the form of a block pools. So, now this block pool is also, going to be managed across these
data nodes, on the nodes of the cluster machines. So, it is not only one node, but several nodes are
involved and they will be storing the block pools. So, there is a pool associated with each name node or a
namespace and these pools are essentially, spread out over all the data nodes.

Refer Slide Time :( 19: 01)

That we will see, in the further picture.

Refer Slide Time :( 19: 04)

So, here you can see in this particular diagram that the namespace it has multiple namespace, namespace
one, namespace 2 and so on up to namespace n. Let us assume that it has, multiple namespaces and each
name is space, is having a block pool. So, these are called, ‘Block Pools’ and these block pools are stored
on the nodes. So, they are on different nodes, just like a cluster machine so, each block pool is stored on a
different node. So, different nodes are there and this is going to manage, the multiple namespace and this
is called the, ‘Federation’ of the block pools. So, hence now it is not a single point of failure, even if one
or more name node, namespace or a name node fails, it is not going to affect, anything and it also
increases the, the performance reliability and throughput and also, performs also provides you the
isolation. So, if you remember the original design you have only one namespace and a bunch of data
nodes, so the structure looks, like similar but, internally it is managed as the Federation. So, you have a
bunch of named nodes now, instead of one named node and each of these named nodes is essentially
write to these pools. But, the pools are spread out over the data nodes just like before, this is where the
data is spread out and you can grass over the different data nodes. So, the block pool is essentially the
main thing that’s different in Hadoop, version or HDFS version two.

Refer Slide Time :( 20:45)

So, HDFS performance measures, if we see that, here we see that, determine the number of blocks, for a
given file, for a given file size. And the, the key HDFS and the system components are affected by the
block size and impact of ,using a lot of small files on, HDFS and HDFS system. So, these are some of the
performance, measures that we are going to, tune the parameters and measure the performance. Let us
summarize it again, all these and different, tuning parameter for performance. And so, basically the
number of, how many number of blocks for a given file size? And this is required, to be known, in so
basically, we will see there is a performance measure or, or basically there is a tradeoff, in the number of
blocks to be replicated. The another key component is, about the size of the block. So, here the block size,
which varies from 64 MB to 128 MB. So, if the block size is 64 then, what is the performance and if we
increase the block size, then what will be the performance, similarly the number of blocks, that means,
how many this is the replication, if the replication factor is three that means, every block is replicated on
three different nodes, for a given file size. So, if the replication is 1, then obviously we are, saving lot of
space. But, performance we are going to sacrifice, so there is a trade-off between this. And another
important parameter, for HDFS performance is about, the number of small files, which are there on, the
HDFS. So, if there are lot of small files, which are there in HDFS, the performance goes down, we will
see how and how this particular problem is to be overcome, in the further slides.

Refer Slide Time :( 23:08)

So, let us see that, recall again the HDFS architecture, where the data is distributed, on the local disk, on
several nodes. So here, in this particular picture, we have shown, several nodes where data is divided and
that is called, ‘Distributed data on different local disks’, it is stored.

Refer Slide Time :( 23:31)

Now, the data is stored in the terms of block and the block size is going to matter much, in the
performance, the default block size 64MB (megabytes). And this is good for a large file. So, if there is a
file size of 10 GB, then this particular file is broken into, 160 blocks of size 64 megabytes.

10 × ( 1024
64 )
=160

So, 160 blocks we have to just store, in a distributed manner, on several nodes and therefore, this
particular block size is going to matter much. So, if we increase the number of or the size of the block,
then obviously, it will be less than 160 blocks. So, if the number of blocks are more hence, the parallel
operations is more possible and that we are going to see about, what is the, what is the effect of keeping
the small block size, of 64 or a more than 64.

Refer Slide Time :( 24:34)

So, the importance of the number of blocks in a file. So, if the number of blocks are more, than the
amount of memory which is used in the name node, will be more in that case. So, for example, here every
block that, you create basically every file, could be a lot of blocks we saw in the previous case, 160
blocks. and if you have a million of files and this, that millions of objects, essentially is required to
basically store that, amount of space in the name node to manage it and it becomes, several times bigger.
If let us say, the number of blocks are more. And the files are more. So, we will see this kind of
importance, of the number of blocks, so it is going to affect, into the size, into the name node. And it
measures, how much memory is going to be used, in the name node to manage that, many number of
blocks of a file. Now, the number of map tasks, also is going to matter much, for example, if the file is
divided into160 blocks, so at least 160 different, map functions are required to be executed, to cover
entire data set operations or computations. So hence, if the number of blocks are more, not only it is going
to take more space, in the name node, but also more number of map, functions also required to be
executed.

Refer Slide Time :( 26:16)

Hence, there has to be a trade-off. Similarly, if there is a large number of a small files: this will impact on
the name node, why because a lot of memory is required, to store the information of this number of small
files. Hence, the network load is going to increase, in this particular case.

Refer Slide Time :( 26:37)

So, if there is a large number of a small file: there is a performance issue or a problem of a performance.
So, suppose you have 10GB of data, to process and you have all in a lots of 32KB of a file size? Then we
are end up with, so many number of map tasks. So, huge list of tasks are now queued up and the
performance will go down, why because, when they spin up and spin down ,there is a latency involved
and because, you are starting up the Java and stopping them and also, it is inefficient due to the disk i/o,
with the small sizes.
Refer Slide Time :( 27:19)

So, HDFS is therefore optimized for a large file size. So, lot of small files is bad and the solution, to this
particular problem is to merge and concatenate different files are, there is a operation which is called,
‘Sequence Files’, several files are Mouse together, in a sequence and that is called a, ‘Sequence File’.
And which is treated as one big file instead of keeping, many number of small files, another solution, to
the small number of lots of small number of files is, using the HBase and hive configuration, for this
particular large number of small files. They will be used to optimize this particular issue. And also there
is also, another solution is to combine the input file, input format, file input format.

Refer Slide Time :( 28:20)

Now, let us see, in more detail about read and write processes, which is there in HDFS, how it is being
supported.

Refer Slide Time :( 28:26)

Now, the read process in HDFS, if we see that, first of all we have to identify that there is a name node.
And this is the client. And there are several data nodes, in this example, we are having one name node and
there are three data nodes and there is a client, which will perform the read operation. So, the HDFS
client, will then request to read a particular file, this is the read operation and this particular request will
go to the name node, to know the, the blocks where that, read operation is to be executed and the data is
to be given back to the to the client. So, it will send the request to the name node and then, name node
will ,give back this information, back to this particular client end of HDFS and from there, it will have
two options, whether to read from the block number four or to read from the block number five. Then it
will try to read the, the one which is the closest one and this particular data is given back to the client.

Refer Slide Time :( 29:56)

Now, then let us see the write operation, which is initiated again by the client. So, whenever there is a
client wants to do a write operation and this particular write operation is now, going to be requesting, the
name node to find out the, the data node which can, be used to store, the, the clients data. And after
getting this information back, this particular right operation is being performed on this particular data
node, which is, which is the closest one and that, particular data node has to, do this replication. If let us
say, replication factor is 3 then, it will do this in the form of a pipeline. So, the client will write down or
will write the data, on a particular data node and that data node, in turn will carry out, the pipeline for the
replication, this is called a, ‘Replication Pipeline’. And once the replication is over, then it will send the
acknowledgment and the right operation is completed, in this particular process.

Refer Slide Time :( 31:18)

Now, let us see in more detail about, HDFS tuning parameters.

Refer Slide Time :( 31:24)

So, HDFS tuning parameters, we are going to see, especially the DFS block size, from that viewpoint and
also we will see the name node, data node and all these different tuning parameters.

Refer Slide Time :( 31:37)

Now, as far as tuning parameter is concerned in HDFS, there is a file XML configuration file. And for
example, hdfs-site.xml file is there, where this particular environment or configuration parameter can be
set. In some of the cases like, cloudera supports automatic GUI for these configurations or tuning
parameters of HDFS that is, through the management console.

Refer Slide Time :( 32:13)

Let us see, what are the, which are most important, which need to be decided for performance from,
performance perspective. Now here, HDFS block size recall that, impacts how much name node memory
is used, the number of map tasks that are showing up and also have the impacts on the performance. So,
the by default the block size is 64 megabytes. And typically, it can go up to 128 megabytes and it can be
changed based on the workloads. So, if let us say that, we want to have a better performance and the size,
file size is too big, too large, then obviously more than 64 megabytes is good enough, so that so, so the
parameter that this, make this particular changes is known as, DFS block size or dfs.block.size, where we
have to mention about, the, the, the, the block size, by default it is 64 but we can increase up to, 128
megabytes. So if the, if the block size is more obviously, the number of blocks, will be, will be less and if
it is less than, the amount of space which is required, to store in the namespace memory, will be less and
also, if it is less and also the number of map tasks, which will be required to execute also, will be less.
And so, basically there is a trade-off, where the performance is required, so we have to set, this block size
accordingly and application.

Refer Slide Time :( 34:00)

So, another parameter is called the, ‘HDFS’, application by default that application is 3 and this parameter
is set in a dfs.replication, configuration file. And there is a trade-off, that means, if we lower, it to reduce
the replication cost, that means, if the replication factor is not 3, if it is less, than the replication cost will
be less. But, the trade-off is that, it will be less robust, robust in the sense, if some of the nodes are filled
and there is only one replication, there is no replicas available of that node, so that particular data will not
be available. So hence, it will be less robust. And also, the it will lose the performance, for example, if it
is replicated then, it will be able to serve that, particular data block, from the closest possible, data block
to the client. So, higher application can make data local to the more workers, lower replication means, and
more space.

Refer Slide Time :( 35:08)

Lot of other parameters are there, which can be set. But, these two parameters which we have covered.
And that is block size and application factor are the two most important, tunable parameters. So, the other
such parameters are available, for example, DFS data node dot handler count is 10, that means, the
number of the server threats, on each data nodes that is, maximum up to 10 and this is going to be a factor
of this performance, of that data node operations. Similarly, there is another parameter which is called,
‘Name Node’. Offense limits that is the maximum blocks per file, that is maximum number of blocks per
file is also set, as per, this one.

Refer Slide Time :( 36:01)

So, let us see the, HDFS performance and its robustness. So, the common failures is a data node failure
and the server can fail, disk and crash and the data also can become corrupt, in that case, the, the replicas
is will be able to overcome, from this particular failures. another failure is called, ‘Network Failure’,
sometimes there is a corruption of network or a disk issues. So, it could lead to the failure, of the data
node in HDFS. So, when a network go down, then If, if let us say, it is replicas, replicas are across the, the
rack then, it can be able to serve, from the other place. Similarly, name node if it is filled and it could be
named node failure, disk failure, on the name node, on the name itself, it could corrupt this particular
process. So, the Federation is there to, overcome from this name node failures.

Refer Slide Time :( 37:09)

So, HDFS robustness, we have so far discussed. And so therefore the replication, on the data node is
done. So, that it is a lack fault tolerant, that means, the replicas are across racks, so that if the one rack is
down, it will be able to serve, from the other end. So, Name node receives the heartbeats and block the
report from, this one data nodes, so all these is measured and wherever there is data note down, this
information is captured or understood and the name node and that particular node is now, not being used
by for the client, for the requests.

Refer Slide Time :( 37:59)

Now, the mitigation of common failures. So, periodic heartbeats from, data node to the name node is done
that, we have seen and data nodes, without recent heartbeats is being marked. So mark the data and the
new input, output, a new I/O that comes up is, not going to be sent to that node. Data node also
remembers that, a name node has the information on all, the replication information, for the files, on the
file system. So, if it knows that data, node fails which blocks, will follow that particular application
factor. Now, this replication factor is set, for the entire system, so you could, set it for a particular file,
when you are writing to a file either way, the name node knows, which block falls below the replication
factor and it will restart the process to replicate, to read replicate. So, therefore let us see, this particular
diagram that, this is the name node and it keeps on checking, the data nodes and several data nodes. So,
these data nodes keeps on sending their, heartbeats at a periodic interval and by that, particular heartbeats,
the name node knows that, these particular data nodes are active. If the heartbeats is not, received at the
name node, name node, now, understand that this is down and if it is down then, this particular
application factor is basically is reduced, for that, particular replicas stored on that data node. So, the
name node knows which of that block falls, below the replication factor. And it will restart the process to
re replicate. So, that number of, so that replication factor is maintained at all points of time. And that
particular data, data node, which is down, which is detected as down, will not be used for, further usage.
So, the migration of common failure is, handled by the name node, in this particular manner, with the
help of the periodic heartbeats.

Refer Slide Time :( 40:28)

Mitigation of other common failures, such as, checksum computed on a file, we show that, the data or a
block is corrupted or a checksum stored, in the HDFS namespace, also tells that it is failed. And used to
check the retrieve data and reread the alternative, alternate replicas. So, that means that, whenever there is
using checksum, if it is directed that, the data is replica is not or the data, which is accessed is, having an
error using, some failure then, alternative replica is consulted up or is being accessed and then, it will be
also made the, proper corrections, wherever there is a failure.

Refer Slide Time :( 41:30)

So, multiple copies of Central meta data structure is being maintained to handle with these common
failures. And failover, to standby the name node is there and normally it is manually done, by default.

Refer Slide Time :( 41:47)

Now, let us see, the performance issue that, if we change the blocksize and the replica factor, replication
factor, how is it going to improve the performance. Let us take an example, of a distributed copy
operation, hadoop supports a distributed copy that, allows the parallel transfer of the files.

Refer Slide Time :( 42:08)

Now, there is a trade-off between the replication, trade-off with the respect to the, to the robustness.
Before we start, the idea is that, if we reduce the replication factor, then it is going to affect to the
robustness. For example, if let us say, it is not replicated, to the other data nodes. And if that data nodes,
containing that data or a block fields, then it is not available at other end. Hence, it is going to affect the
robustness. So, replication is so important, that we are going to discuss. So, one performance trade-off is
actually, when you go out, to do some of the Map Reduce jobs, having replicas gives additional locality
possibility. But, the big trade-off is the robustness, in this case we said, no replica, might lose a node or, a
or a local discount recover because, there is no replica. Hence, if replication factor is, is not immutable, so
if the replication so, no replica is available, if no replica is available, then obviously it is lead to a failure.
And hence, there is no hence, it is not robust. Similarly, with the data corruption and if we get, the checks
that is bad and we cannot recover, why because, we do not have any replicas and other parameters,
changes have similar effects. So, basically there is a trade-off between, the replica and the robustness.

Refer Slide Time :( 43:51)

So in conclusion, in this lecture, we have discussed the HDFS. HDFS version two and operation that is
read and write which is supported in HDFS, we have also seen, the main configuration, we have also seen
the performance, parameters and the tuning parameters, with respect to the block size and the replication
factor, to ensure about the HDFS performance and its robustness trade-off. Thank you.
Lecture 05
Hadoop MapReduce 1.0

Hadoop MapReduce 1.0 version

Refer slide time :( 0:18)

So, MapReduce is an execution engine of Hadoop and we are going to briefly describe Hadoop 1.0, its
components, that is MapReduce which runs on HDFS. So, MapReduce is the programming paradigm, of
the Hadoop system, for big data computing and it also performs, in MapReduce version 1.0 the resource
management and the data processing, aspects. Also which runs over HDFS 1.0. So, we are going to cover
this Hadoop or a MapReduce 1.0 version, in a brief.

Refer slide time :( 01:04)

So, MapReduce has two different major components, one is called the,’ Job Tracker,’ the other one is
called the,’ Task Tracker’.
Refer slide time :( 01:15)
And the scenario of the job tracker is, you can see here, in this particular diagram. So, this is the job
tracker, which is a part of MapReduce version one. Now this particular job tracker runs, on the master or
this is the master node. So, this master node and this name node, is a part of HDFS, HDFS 1.0 version.
So, both of them may resides on the same machine or may not be in the same machine, but for the sake of
simplicity we assume that the name node and the job tracker resides, on the same node which is called a
‘Master’, over here in this particular scenario, since it is having a master and several slaves. So, hence this
is basically a client-server architecture, which is followed in MapReduce, version 1.0 and also in the
HDFS version 1.0. So, in this particular diagram, we can see here, the job tracker resides on a particular
node which is basically the, the master node. And on the same master node, let us assume that as DFS
name node is there. So, we are not going to refer in this part of the discussion why because, we are
focusing only on the MapReduce.
So, hence we are not going to discuss this name node part, of HDFS. Now another component, is called
the, ‘Task Tracker’, the task record may resides, on the same node also resides on other different slave
nodes. So, the job tracker and task tracker they, they run in the form of a client-server model. So, job
tracker is basically, running as a must is a server and the task records is run as its client. So, it's a client-
server model let us understand, more into the functionality, of job tracker. So, job tracker as, I have
already mentioned, is hosted inside the master and receives the job execution requests, from the client.
So, the so, the client or the application, which basically is nothing but a MapReduce, program when it is
submitted by the client, then the job tracker has to deal with that particular program execution. So, its
main duty is to break, down the receive, its main duties are to break down, the received job, that is a that
is the big data computation, specified in the form of MapReduce jobs. And this particular MapReduce, is
divided into the smaller chunks and that is the small parts and these small parts that is called, ‘Chunks’,
are allocated with the map function and map and reduced function and this particular partial,
contradictions are happening at this particular slave nodes with the help of the, task tracker. And this is
the, the entire unit of execution, of this particular job. So, let us see the more detail of the task tracker. So,
the task tracker is the MapReduce component, on the slave machine, as there are multiple slave machines,
as we have shown here five of them. So, many task records are available, in the cluster its duties to
perform the computations, which are assigned by the by that the job tracker, on the data which is
available on the slave machines. So, the task tracker will communicate, the progress and report the result
back to the job tracker, the master node contains the job tracker and the name node whereas all the slave
nodes contain the, the task tracker and data nodes. So, in this particular way, the job tracker keeps the
track, of the map and reduce jobs, which are being allocated at different nodes and which are executing
on, the data set which are assigned which are allocated, to these particular nodes, where the data is there
the, the map and reduced function or the configuration will be performed. So, it is a computation engine.
So, MapReduce is a computation engine, in version 1.0. so, not only it allocates, the MapReduce jobs to
different slave nodes, where the data also resides, in the form of a chunks and it will then connect and so,
basically not only it assigns but also it tracks, keep track ,of the progress and the resources, which is being
allocated. Okay? Okay? Discovery hmm physically in order the execution steps.

Refer slide time :( 08:04)

So, we are going to now trace all the execution steps, for the life cycle of a MapReduce job, till the
application is submitted, by the client and to the to the MapReduce and it finishes and we are going to
trace the, the execution cycle or the execution steps in the MapReduce, version 1.0. so, the first step is,
the client submits the job, to the job tracker for example here, we have to find out we have to 20,000
records, of the customer and we want to find out all the customers from Mumbai. So, that means the
query is basically to be executed, on these data set and this particular operation to find all the customer,
this is the query, which is to be performed, using the MapReduce program which is being submitted. So,
this particular request is being submitted to the job tracker and job tracker will ask the name note, about
the location of this particular data set. So, the job tracker will consult the name node, which is a part of
HDFS, to find out the location, of the data where it is being installed. So, now I say that this 20,000
records are divided like this, there are five different nodes, which stores all of them this is node number
one, two, three, four, and five. So, the records first four thousand records stored are installed on this
particular node number one and the next 4,000 is stored on the node number two and then next four
thousand node number three and furthermore node number 4 and 5 respectively stores the remaining
twenty thousand records.
Now this is called the chunks or the splits. The entire, two thousand twenty thousand record is splitted
and stored on four different, five different nodes and this information will be given from, this name node
back to the job tracker now the job tracker as per the reply by the name node the job tracker ask the
respective task tracker to execute the tasks on their data. So, therefore, this particular job tracker, now
assigns the, the, the map function or map and reduce function map function, on the stash tracker to
execute, on that data chunk. So, with this particular direction the task tracker will perform this execution,
at all places there in parallel. So, this particular execution of MapReduce program is done in parallel, on
all the chunks. So, after the computation, of a MapReduce so, all the results are stored on the same data
node, so whatever is the result it will be stored on the same data node and the name node is informed,
about these particular results. So, the task tracker informs the job tracker, about the completion and the
progress of this jobs assigned to those job tracker now the job tracker informs this particular completion,
to that particular client and the client contacts the name node and retrieve the result back. So, after
completing it this particular job tracker will inform, to the client about the completion of this particular
job, that means now the result of the query is now ready, which the client will be able to access, with the
help of name node and gets back the result. So, this is the entire execution engine, of which is, there in the
form of, a MapReduce version 1.0, that we have briefly explained. Thank you.
Noc19-cs33

Lec 06-Hadoop

MapReduce 2.0

(Part-I)
Hadoop, maps reduce 2.0 versions.

Refer Slide Time :( 0:17)

Content of this lecture in this lecture; we will discuss MapReduce paradigm and also what is new in
MapReduce version 2? We will also look into the internal working and the implementation or we will
also see many examples how using MapReduce, we can design different applications and also look into
how the scheduling and the fault tolerance is now being supported inside the MapReduce implementation
of version 2.

Refer Slide Time :( 0: 49)

Introduction of MapReduce; we have to say that MapReduce is a programming model and associated
implementation for processing and generating the large data set. Now in a version 2.0, MapReduce per
version 2.0, now we have separated out the programming model and as far as the resource management
and scheduling, is done using YARN, which is another component of Hadoop 2.0. So, in Hadoop 2.0 if
we see this kind of a stack. So it becomes three different layers. So, the first layer is HDFS 2.0, then, this
is Hadoop 2.0 and then comes the YARN and here the MapReduce version 2.0. So, now YARN will, do
the functionalities of resource management and scheduling which was a part of MapReduce 1.0 version.
So, this will make the simplification, into the MapReduce which is only now, the programming model it
will focus on the programming aspect and rest of this particular part that is the resource management and
the she ruling will be performed by YARN. So, with this, we can design many new applications on this
particular HDFS and YARN, bypassing the MapReduce also. So, we will see, the Hadoop stack or
different distributions, which either uses HDFS YARN or HDFS, YARN, MapReduce or top of it lot of
applications, are available for big data computations. Now, in this particular framework that is
MapReduce version 2 which is, now purely a programming model the users, can specify the map
function, that processes a key value pair, to generate a set of intermediate key value pairs, meaning to say
that the map function it is accepting the input, in the form of a key value pair. Which is quite trivial why
because, any data set we can represent in the form of a key value pair? So, the record is nothing but, a key
value pair. So, this will become the input to the map function and output is also, the specified in the key
value pair. So, what is the key and value pair? That we will discuss, in more detail in the for the slides
but, let us general let us understand let us, us at this point of time that any, set of records or a big data set
we can represent easily in the form of a key value pair. It's not a big thing that we are going to discuss.
Then the second part is called the, ‘Reduce Function’. So, reduced function merges, all the intermediate
values associated with the same intermediate key. So, that means the internet values or the output which
is generated by the map, will be the input to the, to the reduced function that is the output of map function
that is nothing but, in the form of a key value pair, that is in the intermediate form of the result and then it
will, now do the combination it will combine, it and give the final output. It will combine and combine by
the key and give the results out. So, the map functions and reduce together, will perform in parallel all the
computations which are desired to process the large data set. And so, many real-world applications or the
task we can express using this particular model that is using MapReduce jobs. So, this particular model
allows, to execute the application in, in parallel at all the machines which are basically running that the,
the task that is, the slave machines.

Refer Slide Time :( 6: 05)

So, the programs, are written in this style are automatically, are executed in parallel on a large cluster of
commodity machines. So, the parallel programming paradigm is automatically being taken care and the
users or the programmers, are not going to be bothered about, how the parallelization and how the
synchronization, is to be managed. So, the so, parallel execution is automatically done in, the MapReduce
framework. Now as far, as the run time is concerned, run time take care of the details of partitioning the
data scheduling the programs like equation across the set of machines, handling machine failures,
managing the required inter machine communication and this, allows the programmer, without any
experience with the parallel and distributed computing to easily utilize the resources of the large
distributed system. So, let us explain, this entire notion of a runtime operations which this particular
MapReduce allows or provides to the execution environment, to the for the execution of these programs
on a large-scale commodity machines. So, as far as MapReduce is concerned now, starting with the data
set, the input data set, we assume to be a very large and stored on the clusters, that means may be stored
on hundreds and thousands of nodes one data set let us assume that it is stored. So, this will be stored with
the help of the partitioning. So, the data set will be partitioned and stored on these nodes. So, let us say
that if 100 nodes, are available the entire data set is partitioned into 100 different chunks and stored, in
this way, for computation in this scenario. Now, the next task is, to schedule the program's execution,
across these, let us say 100 different machines where the entire data set is stored. So, then, it launches the,
the programs in the form of a map and reduce, we will execute, in parallel, at all places ,that is let us say
if hundred different nodes are involved, all hundred different chunks, will experience the execution in
parallel. So, this is also, to be done automatically by, the MapReduce using, the job tracker and a task
tracker, that we have already seen. And in the newer version that is 2.0, this particular scheduling and
resource allocation, this is to be done by the YARN, once this particular request comes from, this
MapReduce execution, for that particular job, of that data set. Now, there is a issues like, these nodes, are
comprises of commodity, Hardware that we have seen in the previous lecture of HDFS. So the nodes may
fail, in case of node failures, automatically ,it will, be taken care by doing the scheduling of this
alternative replicas and the execution still carries on without any effect on the execution. so it requires,
also to manage the inter machine communications, for example ,the intermediate results, when it is
available ,then, they are sometimes required to communicate with each other ,some values, some results,
this is called,’ shuffle’ and ‘combined’. So, shuffle and combine requires, intermission communication,
that we will see, in more detail in the further slides. Now, this will allow the programmers without any
experience, with the parallel and distributed systems, to use it. And this will simplify, the entire
programming, paradigm and the programmers has to only write down a simple MapReduce programs. so
we will see how the programmers will be able to write the MapReduce program for different application
and the execution will be automatically, taken care by, this framework. So a typical, MapReduce
computation may vary from terabytes on thousands of the machines. So that we have, already
seen ,hundreds of MapReduce programs have been implemented and upwards of 1,000 MapReduce jobs
are executed on Google's, cluster, every day. So, the companies like Google and other big companies,
which works, in the big data computation they uses this technology, which is the MapReduce programs.
That means applications are written in the MapReduce that we are going to see them.

Refer Slide Time :( 12:02)

now, we will see, these components, which are used in, the MapReduce program execution and they are
the chunk servers, where the file is split into the chunks, of 64 megabytes and these chunks are also
replicated ,this is also sometimes called as ’blocks’. We call it as chunks, both are synonymous, to be
used here in this part of the discussion. Now, another thing is, the these particular chunks are, replicated
sometimes, two times the replication all three times, this is called, ‘replication factor’. So, it is replicated,
that means ,every chunk is not is stored on, more than one, that is more than one nodes ,for
example ,when it is the replication factor is three times, so every chunk, is stored on three different nodes.
Sometimes, they are try to be stored on a different rack, to have the rack, failure tolerant. So, that means
if, the node on one rack or if one rack fields, so it continues to get the other replicas on a different rack, so
that is how, different replications are done to support the, the fault tolerance or the failure tolerance. Now
the next component of a distributed file system, which is used here in the MapReduce is called, ‘Master
Node’ also known as the name node in HDFS, which stores the metadata, metadata means the
information where the exact, data is stored on which data nodes. So, this information is maintained at the
master node by and it is also called as a, ‘Name Node’ in the terminology of HDFS. Now, another
component is called, ‘Client Library’, for the file axis, which talks to the master, to find out the chunk
servers, connects directly to the chunk server, to access the data.

Refer Slide Time :( 14: 07)

Now, we will see about, this MapReduce and its motivation why we are? So, much enthusiastic, about
this more produced programming paradigm and which is very much used in the Big Data computation.
So, the motivation for MapReduce is, to support the large-scale data processing. So, if you want, to run
the program on thousands of CPUs and then this MapReduce is the only paradigm available and all the
companies, which are looking for the scale? A large-scale data processing the MapReduce is the only
framework, which is basically doing all this task? And another important motivation is that, this particular
paradigm makes the programming very simple and does not require the programmer to go and manage
the intricacies of the menu detail, underneath. Now MapReduce architecture also provides automatic
parallelization and the distribution, of data and the configuration of it. So, the parallel computation, on
distributed system is all abstracted and the programmer is provided a simple MapReduce paradigm and he
does not have, to bother on these details automatically, the parallelization and distributed computation, is
being performed another important part is the failures. So, the fault tolerance is also supported in the
MapReduce architecture and the programmer does not have to bother about, that automatically it has
taken care, provided the programmer gives sufficient, configuration information so that this fault
tolerance is taken care of in different applications, another thing is input-output scheduling, is
automatically done. So, those lots of optimizations are used to reduce the number of I/O, operations and
also improve upon the performance of the execution engines. So, that is all automatically done by the
MapReduce architecture. Similarly the monitoring of all the data nodes, that is the task, that is the task
records, execution and their status updates is all, done using the client-server interaction in the
MapReduce architecture, that is also taken care of.

Refer Slide Time :( 17: 05)

Now, let us go in more detail of the MapReduce paradigm. So, let us see, what this MapReduce? Is from
programmers perspective, this particular map and reduce terms is borrowed from the functional
programming language, with let us say that Lisp, is one such programming language, which has this kind
of features, that is MapReduce. So, let us see the functionality, which is supported in the functional
programming language of MapReduce? And let us say that, we want to; write a program for calculating
the sum of squares of a given list. So, let us assume that the list which we are given, is one, two, three,
four is a list of numbers and we want to find out, the squares of these numbers. So, there is a function
which is called a, ‘Square’. And a map says that this particular square function is to be applied on each
and every element of the list. So, let us see the output will be:
(map square ‘(1 2 3 4))
Output: (1 4 9 16)
[processes each record sequentially and independently]

So, all the squares are computed and output is given in again in the form of a list. Now, this particular
operation is square may be executed in parallel, at all the elements and this result will be given very
efficiently ,it is not that sequentially one by one, all the squares are being computed. So, therefore it will,
process them independently instead of these records are not to be carried out in sequentially hence, it's a
parallel execution, environment of map function ,of a map function that is, performing the square
operation on all the elements, which are being provided in the map function. Similarly to find out the sum
this was to calculate the, the squares, now another routine which is required to make the summation, of
this intermediate result. So, this result is called, ‘Intermediate Result’. So, with this input, we have to
perform another operation which is called a, ‘Reduce’. Reduce will require an addition operation, the
addition on this particular list, which is nothing but, an intermediate result which is calculated by the map
function and you can see that, the sum is the binary operator. So, it will start doing the summation, 2
elements at a time and once they are calculated. So, the entire output will be given in the form of
whatever is required sum of the squares. So, this is the sum of squares.

(reduce + ‘(1 4 9 16))

(+ 16 (+ 9 (+ 4 1) ) )
Output: 30

So, given this kind of functionality or a function of a functional programming language, we will see that,
how this particular MapReduce? Can be applied here, in the Hadoop scenario. Now, let us see a small
application, simple application, here it, is written a sample application of a word count. So, that means, let
us say we have a large collection of documents, let us assume it is a Wikipedia dump, of particular
Shakespeare's, works it can be any, different important persons related works, we are going to process this
huge data set and we are asked to find out, to list the count for each of, the words in this one document
and then query will be to find out a particular reference, in the Shakespeare's works about, the characters
how many? Which character is more important based on how much referencing? Shakespeare has done
for that character.

Refer Slide Time :( 21: 31)

So, let us see how this is all done using MapReduce? So, MapReduce will huge collection, of data set
first it will, be considered in the form of the key value pairs, for example if it is, the document is let us say
welcome everyone, hello everyone. So, here this is the key is about, every word it becomes a key and how
many times it is appearing in a particular sentence is becomes the value? So, the file is also in a given the
input in the form of the filename and the text at the file text so, so this becomes the key, value. So, file
will name will be the key and these are text, the value and further on when we process using map
function, when we apply a map function, on this particular input it, will generate this another intermediate
key value. This is called, ‘Intermediate Result’. In a form of key values, that means for every word, will
become a key, every word will become a key, what will become a key? And the instances, when it is
occurring, into a particular line of a, of a file then it becomes the value, for example on in a line number
one, welcome, is appearing once. So, it gives one similarly in the line number one everyone is also is
available only once. Similarly in the second line when it is being processed. So, hello is the appearing
one. So, it will be treated as once and on the second line everyone is appearing again, so this particular
key value, which is emitted out of map? Is further used by the reduced function. Now, let us say that, let
us see how, in parallel this map function can work, for example these two lines, instead of processed them
sequentially let us divide in two, into two different chunks and this chunk, will be given map function one
and this another chunk will be given second map function. So, if there are two chunks we are going to
process them in parallel. So, this particular task, map task one will execute and map task 2 will also
execute, in parallel. So, in parallel if they are executed ,so key this key value pair welcome and everyone,
will be collected up by the task number 1, similarly the other task that is hello and everyone, will be
collected in the map task 2. So, we need to say that, this particular map function allows, the process
individual record to generate the intermediate key value pairs, in parallel and that is done automatically.

Refer Slide Time :( 24: 46)

So, parallel process a large number of individual record, generate the intermediate key value pair, that we
have seen if these numbers or the chunks, are increased more than, two then, this is, an example where
multiple map functions are simultaneously, at the same time that is in parallel they will, execute and
process these records.

Refer Slide Time :( 25: 09)

Now, we will see the next phase that is called, ‘Reduced Operation’. So, after the map the intermediate
values, which is being generated, in the form of the key value pair will be used as the input for the
reduced function. So, reduce processes, these intermediate values, which is output by the map function
and then merges all the intermediate values, associated per key, that means based on the similarity of the
key values, they are all combined together that is group by key operation, is being performed and using,
in this particular, group by key operation whatever is the reduced function applies, this will be the
computation will come, from the reduced function. So, computation on this group by key, will be applied
by, the reduce function to get the entire output, for example if this is, the input which is, input to the
reduced function and we want to do a word count. So, now what it will do is, it will group by the key
when you say group by the key then, everyone is appearing twice. So, with everyone there are two
different values or two times, this everyone one will come and this will be added up by the reduced
function, similarly hello one is appearing once. So, group by key is, having only this tuple or entity
similarly for welcome. So, this particular reduced function, will execute in this particular manner.

Refer Slide Time :( 27: 05)

So, each key assigned to one reduce function and here we can say that if there are two different nodes,
which runs this reduce task, let us say task 1 and reduce task 2. So, each key is, assigned to one reduce
task for example this, particular key will be assigned to the task 1. So, when this second key also second
time when everyone comes it should also, assign to that same reduce task, whereas welcome and hello
can be assigned to the reduce task 2. Now, parallel process and merges all, the intermediate value by
partitioning keys that, we have already told you, but how, these keys are to be partitioned and the simple
example is, let us, say that use a, hash partitioning, that means if you have a hashing function f, and when
you apply this has have a hashing function, on a particular keyword, it will give a particular value and this
value will be mod, how many number of reduced tasks here? Let us, assume that it is, the number of tasks
is 2 and modulo, some hash function and on that particular hash function, we have to define a key and this
will generate the, the partitioning. So, partitioning we have shown you through, a simple example, the
way partitioning is carried out, that is through the highest partitioning, it can be very complicated also
depending upon different application how ,the partitioning is done and how many different worker nodes
are reduced nodes are there or reduce task is allocated by the YARN? And based on that this particular
partitioning is done and after partitioning then it will merge, the intermediate values and gives, the result
back.

Refer Slide Time :( 29: 09)

Programming model, the computation takes a set of input key value pairs and produces a set of output key
value pairs, in this programming model. To do this, the user has to specify using MapReduce library the
two functions, the map and reduce function, which we have already explained.

Refer Slide Time :( 29: 38)

Map abstraction is written by the user which will take the input pairs and produces the set of intermediate
key value pairs. This function is to be written, based on the applications and the programmer has to, write
down as per the application what is to be done in the map function? And the remaining part will take care
in the reduced function. So, the reduced library, MapReduce library, groups together all intermediate
values associated with the same intermediate key and passes them to the reduced function that was the
map abstraction.

Refer Slide Time :( 30: 21)

In the reduced function, also it is written by the user, which accepts an intermediate key and a set of
values for that particular key, it merges together these values, to form a possibly a smaller set of values,
typically just 0 or 1 output values produced per reduce invocation, the intermediate values are supplied to
the users, reduced function via an iterator. This allows us to; handle the list of values that are too large to
fit in the memory. So, that means, if we see the outcome of the map function, it will generate the
intermediate key value pair, which will be taken in the reduce phase as the input. So, here, what it does
is? This stage which is called a, ‘Shuffle’. It merges together, shuffle and the combiner, here there is a
combiner, sometimes a combiner is a part of both the activities map and reduces. So, the combiner. So,
first it will shuffle, that is it will arrange, according to the group by keys and this is called a, ‘Shuffle’.
And after that it will combine, that is all the same keys, with the values to be combined and is given to the
reduce to operate. So, therefore, 0 r1 output values is produced, per reduce invocation the n iterator
intermediate values, supplied to the users reduced function, wire and n iterators meaning to say that this
particular function, which is called, ‘Iterator’. Can be understood, in more detail like this, for example if
you want to do, a summation of the intermediate values which is, basically the sum of squares. Let us say,
we want to, do this, particular summation of all these things. So, iterator has, two for example if there are,
maybe another example, iterator has to evolve for example if there are yeah, iterator has to evolve or to
see the, the elements which are similar, it has to scan and this is called, ‘Iterator’. So, it will scan through
all, the elements and perform the operation together and for example one and four, can be treated
together. So, iterator here will identify, a binary or two set of elements, first two elements and perform the
plus operation and then it will, take the next operation, next element and perform the plus operation and
then perform then take the next element 16 and performed on the plus operation to become the overall
sum. So, the iterator job is, to scan, through this particular list and provide the elements for the operation
which is specified in the reduced function. So, we have seen, that out of the map function the, there is a
activity, which is called a, ‘Shuffle’. Which will sort, the elements? So, that all the key value, key values,
with the same key group by same key they are to be sorted and given back to the next phase. And in the
reduced function, it will use the iterator, to scan, through this particular list of elements, which are already
sorted and provided by the shuffle. So, after the map function, lot of internal communication and
activities are going on, in the form of shuffle and iterator and then only the reduced function will be,
applied and give the values out. So, this allows, these operations allows us, to handle the list of values,
that are too large to be fed, in the memory and why because? They are not, to be brought at one place but,
they are to be, operated wherever these use our reciting, hence it is, possible that the, the operations are
performed, even on a very large data set which cannot fit, into a particular memory.

Refer Slide Time :( 35: 54)

Now, we will just see, the structure of the map and reduce function which programmers used to write
down and here, we take the example of a word count. Will not go in detail in explanation at this stage,
explanation we will see in more detail in the further slides. So, as far as, the map function this is the
pseudo code, of a map function it cannot be executed in this form, why because? It has to be programmed
as per the specification of that particular language, of MapReduce but, we are going to see the, the pseudo
code. So, pseudo code has to specify the map function with the key and value and where key here, in the
word count example, is the name of the document and the values are the, the text, of the document which
are be retrieved and given in this map function. So, for each word, which is basically getting as the value,
in this particular for each word in the value, which is appearing the, map will omit, that particular world
and one. Now, the next pseudo code, for reduce takes, the same format as emitted out of the map
function. So, that means the world becomes, the key and the value becomes as per the value over here,
that format should match then only the output of map can be acceptable that is the, the reduce function.
So, here, it is explained that the key, is the word and values is basically the values, when it is grouped by
key then, the iterator has to operate, over this set of values. So, let us take this set of values and we
perform the reduce operation. So, reduce we will take all these values, out of iterator and does the
summation and results, in to that, how many times that key is appearing? That is it. We'll do the
summation of all the values but, this has explained .

Refer Slide Time :( 38: 23)

The MapReduce paradigm and let us, see some more detail of it and here, we have given the input as a set
of key value pairs and the user has to specify, the function map, wherein he has to specify, the what is key
and what is value? And after that this map function will emit, intermediate values in the form of a list,
which is having a (k1, v1). So, k1 and v1 is the intermediate key value pair, and this intermediate key
value pair, is given to the reduce by the shuffler. So, once the reducer will get the key and intermediate
key and value pair. So, it will get the entire list of values and here, the iterator will operate on it and
applying, the reduced function it will emit that particular value. So, the output will be, the (k1, v2) values
combined. So, this is the abstraction of MapReduce and overall picture how the program, can be written
using MapReduce paradigm. Thank you.
Lec 07-

Hadoop MapReduce 2.0

(Part-II)
Let us see, some of the applications, of MapReduce.

Refer Slide Time :( 0: 15)

We are already the programs are available and its, use in the production, environment, by different
companies. So, here are the few, simple applications of interesting programs, that can be, that has, already
been expressed, as the MapReduce computation. So, the first one is called, ‘Distributed Graph’. Here, the
map function emits a line, if it matches a supplied pattern, for example, if the document is given and we,
we have already given one pattern. So, all the lines in that particular document, where that particular
pattern is appearing, will be emitted and reduced function will become an identity function that that just
copies, the supplied intermediate values to the output that is called, ‘Intermediate Graph’ or ‘Distributed
Graph’. so, the difference between the Graph and this distributed graph is that, here the document can be
very big, it cannot fit into one system, memory and therefore a document which is distributed the stored
on data nodes, can perform this operation the graph. So, it will filter and extract, only those set of
documents where we are interested in that particular pattern, that's called, ‘Distributed Graph’, has
various purposes various applications, the next application is the count of URL, access frequency. Now,
to do this, there is a map function, which processes the log of webpage, requests and the output (URL;1).
So, that means the map function, what it does is? It inspects the log of web pages and for every URL, it
encounters it will emit one as per the map phase. Now, reduce function will combine. So, it will collect
the all the URLs, that means group by key and it will do the summation, of how many times that number
of ones are there it has to, just do a count. So, it is just like a word count program, an extension of a word
count, which will find out, the URL access frequency that means our URL how many times it is being
referred in a particular log file? Now, another example, another application, where it is used this
MapReduce program? Is called, Reverse Web Link Graph’, for example there are the web pages and web
pages are, pointing to each other. Let us say this is a B and C. We want to find out that, for let us say web
page see, how many different pages are pointing to it? We are given these, kind of pairs that is a is
pointing to C, web page is pointing to web page C and web page be pointing to A and B is pointing to C,
out of this we have to, now find out that for a particular web page, how many links are pointing to it? It's
called, ‘Reverse Web Link Graph’. So, it's called a, ‘Reverse Web Link Graph’. So, the map function
here, given this as the input, to the map function it will output the target and the source pair. So, for
example here the target is C. So, C and the source. So, given AC, in the map it will emit, C and A, for
each link in this, particular case similarly for BC, it will emit, C and B similarly for BA, it will emit, A
and B, for each link to the target URL found in the, in a page named source. Now, the reduced function
then concatenate, the list of all sorts URLs, associated with a given target URL and emits appear, that is
called, ‘Target and List Source’ for example, here C is appearing ,two times and this list will become for
C a comma B. So, for C this is, the list is, C comma, A B. So, this is, one and another one is a and B. So,
this will be given to the reduced function and reduce will take this, target C and find out this list or the
source. So, C, page is being pointed to by A and B, that is being computed here by, the MapReduce and
the web page A, is being pointed to by only one, web link that is called A B. So, this way, popularity of
the pages can also be calculated if you want to do a sum, you have to just make a count of it. So, it
becomes, 2 this becomes 1 and this kind of statistics or this kind of output can be used in computing, the
Page Rank.

Refer Slide Time :( 6: 31)

Now, another application is called, ‘Term-Vector per Host’. So, term vector summarizes the most
important words that occur in a document or a set of documents as a list of a word and frequency pair. So,
the map function omits the hostname and the term vector, appear for each input document, where the
hostname is extracted from URL of that document, the reduced function is passed, on the all per
document on vector, for a given host it at these term vector together, throwing away the infrequent terms
and then omits the final hostname and term vector pair. So, that means, for a given term document we are
going to find out the most important or most frequent words and we have to basically omit the hostname
and the term vector pair in this particular application.

Refer Slide Time :( 7:30)

Another application is about, the inverted index, which all the search engines are mostly does this, let us
see, what this application? Is and how the Map Reduce can be used to program, doing this inverted
index? So, here, in this application the map function parses, each document and emits a sequence of
world and document ID pair, the reduced function accepts, all the pairs of a given word sorts the
corresponding document IDs and emits, the word and list of document ID pair. So, the set of all output
pairs forms a simple inverted index it is, easy to augment, this configuration to keep track of the word
positions. So, we will, in the later slides you will see, more detail of MapReduce program for this
application that is, for inverted index that is, it is possible that if, the set of documents are given, for
example search engine does this, Google when we type a keyword it gives you all the list of web pages?
Where those documents? Where these keywords are appearing and that is being performed using inverted
index? So, every search engine ,often computes this inverted index and that if, the number of documents
is huge, the search engine like Google, are being by Microsoft all, they are, they are basically computing
the inverted index and whenever the user searches it, it will perform it will check this, inverted index and
gives, that particular outcome. We will see, in more details how, this MapReduce exactly program this,
inverted index application, similarly the distributed sort, that is the map function extracts the key, from
each record and emits, the key and record pair the reduced function just emits all the pairs unchanged,
that means automatically, internally the map function gives, when it emits the key and required, pair it
provides in this sorted order and if there is, nothing different happens, into the shuffle phase, then if the
output is, given as it is, then this, particular outcome of the map face will be taken up, as and it will be
emitted unchanged, by the reduced function and this will perform the distributed sort.

Refer Slide Time :( 10: 18)

We will see, in more detail of distributed sort application, how it is done in the Map Reduce?
Applications of MapReduce, we are going in now, little more details, of these applications which we
have, some of them which we have summarized. Let us take the example of distributed grab how using
MapReduce we can perform this distributed group operation? We assume, that the input is a large
collection of files and the output, I want to get is, the lines of a file, that matches that particular pattern,
that matches a given pattern. So, the map function, will emit, a line if it, matches the supplied pattern. So,
map function will, emit a line. So, here the things are quite simple, why because? Whenever the line is
matched, in the map function, line and a pattern. So, it emits only the line and the reducer does not have
to do anything it will copy the, all the intermediate data, which is given by the map function and it, will
output.

Refer Slide Time :( 11: 53)

So, this will become the distributed graph. Reverse web link graph we have already, seen let us see again,
for the sake of completeness and we assume, that the web graph is, given in the form of a tuples, A, B.
So, again I am drawing the same picture, let us say it is, A B and C these are the web pages, web pages
and let us, say that A is pointing to C and B is pointing to A. So, in this example, the tuples which is
given as the input are AC, then BC, then BA. And so, as far as, the map is concerned map function on
getting this, as the input. So, that means these are the edges, which are given as the input to the map
function and that means the source and the target, source this is, the source and this is, the target. What it
does is it emits<target, list(source)>? That means, for example for BA, it will, emit(A, B) and
for AC it will, emit(C, A) and for B C it will emit (C, B). Now, after doing this emit, the reduced
function will accept this target, target means these, things will be accepted here, in this reduced function.
And we will, form the list that is the, the list of sources. So, far C, it will emit (C, (A, B) this will be the
output and for, for A it will emit, the B itself. So, for a page C, A and B they are pointing you can see, in
this particular picture, A for C A and B they are pointing to it and for, for A B is pointing to it and this is
called, ‘Reverse Web Link’. So, output for each page, the list of pages that link to it that we have already
achieved, in this particular application using MapReduce program. So, you can so, the programmers can
easily, write down the MapReduce program for different this application we have seen, the reverse web
link graph.

Refer Slide Time :( 15: 14)

Similarly if you want to find out, if you want to count, the URL access frequency, that means the input is
in the form of the log file which has accessed URLs and normally, this particular log file we can obtain
from the proxy server, which maintains the log of URLs, which are accessed by the different clients. So,
out of this, particular log analysis and the output we want is, that for each URL, we want to find out the
percentage of total accesses, for the URLs. So, we want to find out and then rank it, later on we have
some other applications. So, how to find out for a particular URL how? What is the percentage of
accesses for that URL according to that log accesses? So, the map for this particular program will require
to get the weblog and output and it will emit (URL, 1). So, for every URL it encounters in, in the weblog,
it will emit(URL,1), the map function will do this, output. Now, then to find out this, access frequency, in
the percentage it requires multiple, reducers. So, the first reducer, it will emit the URL and the URL
count. So, that means, it will, do this, for every URL, it will also, do a count. So, it is just like, what count
program? Like what count it will do a URL count? So, out of this , particular URL count, the map
function, another map function will, will execute, it will take the (<URL, URL_count>) and output as
1, (<URL, URL_count>). So, after out after this, output the reducer will now, perform two different
passes, in the first pass it will, sum all the URL counts, to calculate the overall count and in the second
pass it will, calculate the percentage of that URL and percentage of that URL. So, it will emit the multiple
URL values and URL count divided by the overall count. So, in this particular, example we have seen,
not only one Map Reduce but, a series of Map Reduce. So, this is, the first Map Reduce function, will
emit this, one URL and URL count and which will be taken by another Map Reduce function, which will
compute, which will now, compute, which will emit<1, (<URL, URL_count>)>. which is it will emit one
and whatever values we are getting and then there is, a third Map Reduce which it will now, calculate all
the sum and find out the URL. So, it will, emit(<URL, URL_count>), divided by overall count. So, they
see that, this is a chain of, MapReduce functions, which are required to solve this particular problem. So,
earlier examples which we have shown only one MapReduce but, now we have shown that several
sequence of MapReduce, are required to solve. So, if there is, such complicated or complex applications
are there. So, it is, possible to make a chain of MapReduce job and solve this particular problem.
Refer Slide Time :( 19: 44)

Another application we will see, about the sorting and here, the input is given in the form of a
<key, value> pairs and we want the sorted, values to be output. This particular program as we have,
shown you quite simple for example, the input whatever is given to the map function <key, value>, it will
output only the value? And the reducer job, also will just output this <key, value>, whatever is there? So,
in this particular process, when the map, outputs these values, which are already in the sorted form,
normally quick sort is done, here when during the shuffle phase and if the same thing is output, in the in
the reduced function. So, it passes on and it uses the mud shot. So, it is a quick sort and the reduced
function uses the merge sort. So, quick sort and merge sort together, will sort the applications and we
don't have to do much, the partitioning function we have to be careful ,during the sort is that partitioning,
partition keys across the reducer, is based on the on the on the ranges and you cannot use hashing,
otherwise it will disturb the sorted order. Okay?

Refer Slide Time :( 21: 14)

The YARN scheduler. So, for MapReduce, job scheduling and resource management, YARN is used. So,
let us see and go through the YARN scheduler in more detail because, in Map Reduce version 2, 2.0 the
scheduling resource management and scheduling, is done by the YARN. So, let us see, how the YARN
does this scheduling? So, it is used underneath, the Hadoop 2.0 versions and onwards. So, YARN full
form is yet, another resource negotiator and its job is the, the resource manager and scheduler. Now, this
YARN treats each server as a collection of containers. So, by this means is that so, the data nodes which
are called as, ‘Slaves’. Is in the form of containers, containers is, a container is having a CPU and a fixed
memory. So, for example if let us say, a data node or this machine or a server has let us say, eight cores
and has some memory, let us say 16 MB of space. So, the container will contain, a one core and let us
say, 2 MB of space this is, one container. So, it will how, an four eight different container in this example,
eight containers, in this, configuration. So, it depends upon, how many cores are there in the server? So,
that many number of containers and the memory together can form the containers and container is the
unit, which is being allocated, by the YARN scheduler, for MapReduce job. So, it has three different
components, YARN has three different components, one is called, ‘Global Resource Manager’ and which
performs the overall scheduling and for per, node manager it is called, ‘Per Node Manager’.
It is called, ‘Node Manager’ also, daemon and the server functions, it will specify and for per application,
that is the job, there is another job application master and this application master will negotiate, with
YARN, for getting the container, with there is, the resource manager and the node manager and whenever
it detects a for failures? For that particular job the application master again contacts the, the YARN
component that is the resource manager and a node manager.

Refer Slide Time :( 24: 22)

Let us see, how this all flows are in this YARN between the resource manager and node manager? So, let
us say that, we have two servers A and B, A and B there are two servers, A and B and we have a two jobs,
one and two which are to be allocated. So, how it does is first of all, this particular resource manager is
contacted and it knows, about the scheduling or it will schedule these jobs to be allocated? The containers
and then it will be shade ruled over there. So, this will, this client will give a request to the to the resource
manager, that is a component of a YARN, for she ruling these two jobs and this particular resource
manager knows, that there are two servers A and B, with the available container, within it. So, let us
assume, that there is one container available, on B and there is, containers available on A also. So, these
containers, after allocate after allocation, it is informed to the application master and the application
master in turn, contact with the node manager and starts, the execution on these containers and once the
application, execution is over, then these containers, then this application master will inform and the
containers will be returned back. So, just see that, that means MapReduce jobs, with the help of YARN,
MapReduce jobs, with the help of YARN, allocates the container, for that many number of, for the
number of MapReduce jobs. So, this is done, through the help of YARN. So, this is explained here, in this
particular picture. Thank you.
Lecture 8
MapReduce Examples
MapReduce examples,

Refer slide time :( 0:15)

Example one, word count using MapReduce? So, here we see the structure of map and reduce program
which, will do the word count, in a given document.

map(key, value):
// key: document name; value: text of document
for each word w in value:
emit(w, 1)

reduce(key, values):
// key: a word; values: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)
Refer slide time :( 01:57)

Let us see this illustration through this running example, now in this example we consider a document,
let us say this name of a document, docx file and this is the text, of the document now, when this
particular name of the document, is given and the text is given. So, these particular text words, as it is
appearing it will emit word and ‘w’. So, here we see that, ‘w’ and this is the value 1 and this is the
value. So, as these words are appearing out of this particular text, this map function will emit, these w
and 1 out of this particular previous program. So, after this emit of these words, it will do a shuffle, in
so-called, what it will do the same words it will try to, collect together, it will sort them and collect
them, together and then pass on, this is the shuffle phase and then pass on to, the reducer phase. Now it
depends upon how many reducers we have let us assume we have one and two different reducers. So, as
far as this Bob is concerned, it will be passed on over here this will pass on over here Bob will go to this
particular function a run will go to this particular function, to this reducer I spot will go here, here and
throw will go here. And this reducer will now, for example in ‘see’, there are two different, these values
when it comes and iterator it will, go through this iterator function and make the summation of it.

Refer slide time :( 04:34)

Now let us see another example, example number two, counting words of different lengths. So, in a
given document we have to find out, the documents of a given length for example, ‘this is my pen’. So,
here you see that this word is of length four this word is of length two this word or is of length two this
word is of length three. So, counting the words of different lengths is, you see this particular word, of
length 2 is appearing two times, the word of length four is appearing one times word of length three is
appearing one times this we have to give as an output from for a given document file. So, let us see how
map and reduced function, is utilized to do this particular job.

Refer slide time :( 05:40)

Now to do this counting words of different lengths, we have to design the map and reduce function. So,
given this particular document, the map function will emit the key and the value. So, the key comprises
of the length, of about and the word itself. So, for example, at emit ‘the’ this is the word, this will be
emitted and also the length, will be emitted so key value pair will be key, will be the length, here length
of the world and the world itself will become the value this will be emitted out of the map function.
Now after that what the reducer? Will do reducer will accept this format and then it will be group by so,
the reducer, will now collect the, the key and corresponding it will outcome the list of the words. So, for
example here, you can see the word of length 3 it is 3 different types of 3 different words, which are
appearing, is being clubbed together, as a list. So, this will be the output of the reduce function.

Refer slide time :( 07:27)

So, the reductions can be done in parallel, again providing a huge advantage, we can then look at these
final designs and see that there are only two words of length 5 in the corpus and only two words of
length 8 and so on. So, this we have seen through an example that using MapReduce program complex
program or application can be easily programmed and this is going to be used in a big data applications.

Refer slide time :( 08:01)

Example number 3; Here we have to find out the word length histogram and if this particular document
is given and then we what we have to, find out that how many big medium and small words are
appearing in this particular document and this becomes the word length histogram.

Refer slide time :( 08:23)

For example,

Big = Yellow = 10+ letters

Medium = Red = 5..9 letters
Small = Blue = 2..4 letters
Tiny = Pink = 1 letter

So, given this particular document if we color them according to this word length categorization so, the
document will look like, in this manner now we have to find out the word and histogram.

Refer slide time :( 09:19)

Now to do this you know that in MapReduce program, we can divide this entire document into the
chunks let us assume that this is upper, when this particular this portion of a document, is chunk one the
other portion the remaining portion is chunk two so we divided into two chunks.

Refer slide time :( 09:41)

And now we have to write down the map function. So, map function will be based on, this word count
application, wherein if a word is given and the map functions will take the key and the value. So, key
will be this document and this value will be these words out of this particular, document is written now
whenever a word is, there then through this particular length, we have to find out whether it, will emit
whether it is the color, of our award and one will be emitted, similarly as far as the reduce function, is
concerned reduce for every key it will make a sum of a count, I it will do the aggregation. So, count
means the iterator, iterator will make the sum and this will be the output. So, in this the, the chunk one
will omit this statistics that is yellow 17 different times it is a appearing red 77 blue 107 and pink is 3
similarly the, the map task 2, will emit the yellow as 20 red as 71 and blue as 93. So, these numbers are
internally done, that means that every task the reduce function is applied that is why instead of 17 times
once it is doing this, now this reduce again will combine them and give the final outcome that yellow is
37 and red is 148 and blue is 200 and this pink is 9.

Refer slide time :( 11:56)

So, if we see these different steps so, this step is called a, ‘Shuffle Step’. And through this different 17
and 20 here there will be an iterator and thus reduce function will long add, after going through this
iterator and these values will outcome. So, this example shows the world length histogram which is
nothing but an extension of word count program

Refer slide time :( 12:28)

Now there is another example for to build an inverted index. So, by mean inverted index is, for example
if a document is given or a set of documents is, is given then we want to find out the, the text or a
particular word, which is appearing in a particular document and then we will finally collect a world
which is appearing in all list of all documents where it is appearing, this is called, ‘Inverted Index’. And
search engines are now using this concept. So, let us see how this happens so map function will take a
key value pair and will emit the, the name of the word that is called, ’value’. What value? And the
document ID it will emit. So, for example in this particular case,

Input:

tweet1, (“I love pancakes for breakfast”)

tweet2, (“I dislike pancakes”)
tweet3, (“What should I eat for breakfast?”)
tweet4, (“I love to eat”)

Desired output:

“pancakes”, (tweet1, tweet2)

“breakfast”, (tweet1, tweet3)
“eat”, (tweet3, tweet4)
“love”, (tweet1, tweet4)
…
So, this will form the inverted index, using simple MapReduce program and whenever a search engine
uses, the word breakfast, for searching so, it will give these two documents, to it one and two it three
where this breakfast, was being referred in that document.

Refer slide time :( 16:48)

Now we will see the operations, how we are going to perform the relational, join operation using
MapReduce, by relational join let us understand this example and then we will see, how that is done
music MapReduce. Let us say that employee is a table having, the attributes as a name and its SSN
number, similarly another document which is called, ‘Assigned Departments’, here we have in SSN and
the department name. Now if you want to join on, employee and assigned departments, if you want to
join on SSN equal, employ SSN then we, we see that this particular employee with the name sue has
matching, employ SSN with the accounts. So, if we join them this particular tuple will be generated,
what about the other one here this SSN ha two matches. So, therefore this particular tuple two different
tuples will be generated accordingly wherein this is the sales and marketing will be reflected. Now let us
see how this we are going to achieve using MapReduce operation.
Refer slide time :( 18:25)

Now before going into details we have to understand that map and MapReduce is a unary operation and
this join is a binary operation, when it requires two tables, table 1 and table 2. So, how this MapReduce
which is a unary operation, will be used to do a relational join and that we have to see that now what we
will do is we consider the entries or the tuples of the table as, as a single tuples, as the collection of all
the tuples and we will attach this identity of the name of a table also. So, it becomes a key value pair so,
key value pair means that, the name of the table will become let us say key and the remaining couple
will become the value this way we will list out all the tuples, which are there in different tables. Now if
this becomes, a complete data set then we can perform the join operation, easily how the join is
happening ? join is happening around a particular key. So, around a SSN number so SSN number will
become the key here in this case and we will omit the, the Map Reduce will, will emit the key and the
value. So, key will be this SSN number and the value it will emit will be the entire tuple. So, now as far
as the reduce is concerned, reduce will now group by this key and, and then after group by key then, it
will try to do the iterator and if these table IDs are different, then it will make a joint operation within it.

Refer slide time :( 20:50)

So, let us see here in this case as I told you that you we have to decide what is the key and what is the
value. So, key will be the, the, the SSN number around which we are going to join. So, whatever is the
join become, the key and the entire tuple including the name of the table or a name of the document s
and all the tuple will be that value. So, this after, after finding out this kind of

Refer slide time :( 21:23)

things, then now we have to do a join operation. So, in the join we will see that when we make a group
by key, let us say that two different tuples are appearing and since they are employ their, their tables are
different. So, we are going to join them using these different notions and similarly here in the other case,
we have a employ and two different Department. So, this employee will be joined so, it will generate
two different tuples where in the Tony and Sales, is one tuple Tony and Marketing will be another tuple
this way the join operation.

Refer slide time :( 22:10)

Relational join we can perform now if you take another example, let us say that we have two different
tables, one is the order table the other is line item table and their tuples are listed over here then with a
map function, what we will do is so, we have to join around order ID, if we are going to join around
order ID. So, order ID will become the key so, the map will emit the key and the corresponding all the
values that is the name of the table and the, the corresponding. So far both the tables, it will generate
this, this will emit out of the map function. Now then the reducer will combine, by this one key so, it
will group by key for example this order, and for example key number one is appearing three times. So,
one with the order the other is with the line. Now then we are going to combine this order with two
different lines. So, basically it will generate two different tuples which is shown over here. so, by this
example we have shown

Refer slide time :( 23:32)

that using MapReduce, we can perform the relational join, example number six, for finding the common
friends. So, this kind of finding friends is very common in the social networks, like, Facebook. So,
Facebook has a list of friends and we want to find out the your list of common friends, which are there
in the social network like, our Facebook now this kind of operation of finding common friends is quite
common, when you visit someone profile and see the list of friends that you have the common this list
does not change frequently and but it has to be calculated, at a regular intervals using Facebook.

Refer slide time :( 24:32)

So, let us see that how using MapReduce we are going to perform this kind of operation. So, assume
that the friends are is told in this format, for example a person and the list of all the friends is basically
given as the input for example person

AB C D
B ACDE and so on.

So, the person and arrow followed by, the list of friends is given as the input to this particular program,
now then we will see that.

Refer slide time :( 25:11)

How to find out a common friend, is first thing is for a map function, what we will do is? So, if the
person A and all the list of common friends is given so, we are going to generate a tuple:

For map(A -> B C D) :

(A B) -> B C D
(A C) -> B C D
(A D) -> B C D
For map(B -> A C D E) : (Note that A comes before B in the key)
(A B) -> A C D E
(B C) -> A C D E
(B D) -> A C D E
(B E) -> A C D E

Refer slide time :( 26:00)

And finally the next thing has to be done by the reducer. So reducer will find out, reducer will reduce a
group by key, for example A B is a key and it has two different, such lists which has been obtained and
now similarly, for a key a see two different lists and so. So, once this is obtained so, reducer

(A B) -> (A C D E) (B C D)
(A C) -> (A B D E) (B C D)
(A D) -> (A B C E) (B C D)
(B C) -> (A B D E) (A C D E)
(B D) -> (A B C E) (A C D E)
(B E) -> (A C D E) (B C D)
(C D) -> (A B C E) (A B D E)
(C E) -> (A B D E) (B C D)
(D E) -> (A B C E) (B C D)

Refer slide time :( 26:24)

Will find out by taking the intersection, of these lists of values and that becomes stuck. A list of
common friends, for example,

reduce((A B) -> (A C D E) (B C D))

will output (A B) : (C D)

Refer slide time :( 26:49)

In this particular manner, by finding the intersection and for every set of persons, now in this manner,
we is computing the list of common friends, in parallel and

Refer slide time :( 26:04)

this particular operation is being performed. Thank you

NCR SelfServ23 and 27 ATMs Service Manual
100% (2)
NCR SelfServ23 and 27 ATMs Service Manual
912 pages
Samples of Test Bank For Information Security and IT Risk Management 1st Edition by Manish Agrawal
No ratings yet
Samples of Test Bank For Information Security and IT Risk Management 1st Edition by Manish Agrawal
6 pages
Lec 4
No ratings yet
Lec 4
28 pages
Lec 4
No ratings yet
Lec 4
27 pages
Notes - 3 Unit Neha
No ratings yet
Notes - 3 Unit Neha
25 pages
Introduction To Hadoop Ecosystem
No ratings yet
Introduction To Hadoop Ecosystem
46 pages
Hadoop Distributed File System (HDFS) : Suresh Pathipati
No ratings yet
Hadoop Distributed File System (HDFS) : Suresh Pathipati
43 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
HDFS Internals
No ratings yet
HDFS Internals
30 pages
HDFS
No ratings yet
HDFS
1 page
Unit 3 1
No ratings yet
Unit 3 1
20 pages
BDT - Unit - II - Hdfs and Hadoop Io
No ratings yet
BDT - Unit - II - Hdfs and Hadoop Io
42 pages
HDFS
No ratings yet
HDFS
22 pages
Distributed File Systems Leading To Hadoop File System: UNIT-2
No ratings yet
Distributed File Systems Leading To Hadoop File System: UNIT-2
12 pages
Unit 3
No ratings yet
Unit 3
5 pages
Big Data Unit-III
No ratings yet
Big Data Unit-III
39 pages
BDH Unit 3
No ratings yet
BDH Unit 3
25 pages
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
No ratings yet
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
16 pages
HDFS
No ratings yet
HDFS
11 pages
BDA CW Chapter 2
No ratings yet
BDA CW Chapter 2
6 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Bigdata Unit 3
No ratings yet
Bigdata Unit 3
96 pages
DC Mod 6
No ratings yet
DC Mod 6
9 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Big Data Aktu Unit 3
No ratings yet
Big Data Aktu Unit 3
90 pages
HDFS Architecture Guide: by Dhruba Borthakur
No ratings yet
HDFS Architecture Guide: by Dhruba Borthakur
13 pages
BG 345
No ratings yet
BG 345
26 pages
Paper Hdfs Summary
No ratings yet
Paper Hdfs Summary
5 pages
The Hadoop Distributed File System
No ratings yet
The Hadoop Distributed File System
29 pages
Design of HDFS: Unit 3
No ratings yet
Design of HDFS: Unit 3
20 pages
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
Big Data Refers To Extremely Large and Complex Datasets That 1
No ratings yet
Big Data Refers To Extremely Large and Complex Datasets That 1
421 pages
Read Write in HDFS
No ratings yet
Read Write in HDFS
6 pages
HDFS
No ratings yet
HDFS
3 pages
UNIT 3 HDFS, Hadoop Environment Part 1
No ratings yet
UNIT 3 HDFS, Hadoop Environment Part 1
9 pages
Big Data Lecture Presentation
No ratings yet
Big Data Lecture Presentation
28 pages
Rob Jordan & Chris Livdahl
No ratings yet
Rob Jordan & Chris Livdahl
32 pages
Bigdta Unit 3
No ratings yet
Bigdta Unit 3
65 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
IMTC634 - Data Science - Chapter 14
No ratings yet
IMTC634 - Data Science - Chapter 14
22 pages
BDA Module-2
No ratings yet
BDA Module-2
7 pages
Hadoop Unit-4
No ratings yet
Hadoop Unit-4
44 pages
Hadoop Ecosystem
100% (2)
Hadoop Ecosystem
33 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
Apache Hadoop Filesystem and Its Usage in Facebook
No ratings yet
Apache Hadoop Filesystem and Its Usage in Facebook
33 pages
Big Data Lecture # 05
No ratings yet
Big Data Lecture # 05
22 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
A Novel Architecture To Efficient Utilization of Hadoop Distributed File Systems For Small Files
No ratings yet
A Novel Architecture To Efficient Utilization of Hadoop Distributed File Systems For Small Files
8 pages
Hadoop Distributed File System (HDFS)
No ratings yet
Hadoop Distributed File System (HDFS)
6 pages
Unit 3 Full
No ratings yet
Unit 3 Full
89 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
258 pages
Hadoop Intro
No ratings yet
Hadoop Intro
40 pages
What Is HDFS
No ratings yet
What Is HDFS
3 pages
BD U-3 (Anupam Sir)
No ratings yet
BD U-3 (Anupam Sir)
23 pages
Big Data Huawei Course
No ratings yet
Big Data Huawei Course
12 pages
Week-2 Lecture Notes
No ratings yet
Week-2 Lecture Notes
101 pages
Unit 2 Da Material
No ratings yet
Unit 2 Da Material
71 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Building Information Modeling Notes
No ratings yet
Building Information Modeling Notes
12 pages
Sima400h PDF
No ratings yet
Sima400h PDF
8 pages
BethINI Readme UTF-8
No ratings yet
BethINI Readme UTF-8
3 pages
Install GenStat
No ratings yet
Install GenStat
3 pages
COA Quetion Bank (Pyq) PDF
No ratings yet
COA Quetion Bank (Pyq) PDF
18 pages
Emc Powerpath On Sola
No ratings yet
Emc Powerpath On Sola
13 pages
Lab Manual Computer Network
No ratings yet
Lab Manual Computer Network
44 pages
Virtual Container Technology Options For Management Security
No ratings yet
Virtual Container Technology Options For Management Security
100 pages
Java Vs C++
No ratings yet
Java Vs C++
4 pages
SCADA Questions and Answers Rev4
No ratings yet
SCADA Questions and Answers Rev4
12 pages
Welcome To Practical Class: Subject Code: 275
No ratings yet
Welcome To Practical Class: Subject Code: 275
11 pages
Insem PPL
No ratings yet
Insem PPL
28 pages
MENSAJES
No ratings yet
MENSAJES
816 pages
Cao 2012
0% (1)
Cao 2012
116 pages
OS Lab (Experiment List)
No ratings yet
OS Lab (Experiment List)
2 pages
Aman Verma SFDC
No ratings yet
Aman Verma SFDC
3 pages
Fundamentals of Apache Sqoop Notes
No ratings yet
Fundamentals of Apache Sqoop Notes
66 pages
Review Questions 1 of 2
No ratings yet
Review Questions 1 of 2
2 pages
Daftar Barang: Prioritas List Barang Kisaran Harga Jumlah Banyak Barang
No ratings yet
Daftar Barang: Prioritas List Barang Kisaran Harga Jumlah Banyak Barang
2 pages
Accuratetiminganalysis
No ratings yet
Accuratetiminganalysis
6 pages
2012-11!14!1 Jayacom Information
No ratings yet
2012-11!14!1 Jayacom Information
2 pages
Smartphone Enrollment For Android
No ratings yet
Smartphone Enrollment For Android
8 pages
Introduction To C Programming
No ratings yet
Introduction To C Programming
12 pages
SAP Number Range Object and Maintenance
No ratings yet
SAP Number Range Object and Maintenance
6 pages
Arduino Code For CNC Machine
No ratings yet
Arduino Code For CNC Machine
8 pages
ATX - Napajanje
No ratings yet
ATX - Napajanje
5 pages
Installation Steps For EA VIP
No ratings yet
Installation Steps For EA VIP
3 pages
WCHISPTool CMD CommandLineProgrammingToolInstruction
No ratings yet
WCHISPTool CMD CommandLineProgrammingToolInstruction
7 pages