0% found this document useful (0 votes)
15 views29 pages

BDA Module 2 COMP

Uploaded by

Siddhesh Mestry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views29 pages

BDA Module 2 COMP

Uploaded by

Siddhesh Mestry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

SSPM’s College Of Engineering, Kankavli

Class: BE COMP Subject: BDA

Module 2

Hadoop HDFS and Map Reduce


2.1 Distributed File Systems
File system stores data permanently. The system has logical drives and is layered on
top of physical storage medium. It is addressed by a file name under a directory that supports
hierarchical nesting. Access to the file is through file path consisting of drive, directory(s) and
filename.
DFS supports access to files that are stored on remote servers. It also offers support
for replication and local caching. Concurrent access to files read/write has to be taken care of
using locking conditions. Different types of implementations are available based on the
complexity of applications.
1. Google File System
Google had to store a massive amount of data. It needs a good DFS with cheap
commodity computers to reduce cost. These commodity computers are unreliable, hence
redundant storage is required to manage failures. Most of the files in Google file system (GFS)
are written only once and sometimes appended. But it needs to allow large streaming reads
and so high-sustained throughput is required over low latency.
File sizes are typically in gigabytes and are stored as chunks of 64 MB each. Each of
these chunks is replicated thrice to avoid information loss due to the failure of the commodity
hardware. These chunks are centrally managed through a single master that stores the
metadata information about the chunks. Metadata stored on the master has file and chunk
namespaces, namely, mapping of file to chunks and location of the replicas of each chunk.
Since Google users do a lot of streaming read of large data sets, caching has no
importance or benefit.
What if the master fails? Master is replicated in shadow master. Also the master
involvement is reduced by not moving data through it; metadata from master is cached at
clients. Master chooses one of the replicas of chunk as primary and delegates the authority
for taking care of the data mutations.

2. Hadoop Distributed File System


‐ HDFS is very similar to GFS.
‐ Here, the master is called NameNode and shadow master is called Secondary
NameNode.
‐ Chunks are called blocks and chunk server is called DataNode. DataNode stores and
retrieves blocks, and also reports the list of blocks it is storing to NameNode.
‐ Unlike GFS, only single-writers per file is allowed and no append record operation is
possible. Since HDFS is an open-source, interface, libraries for different file systems
are provided.

1 BDA Module 2 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE COMP Subject: BDA

Physical Organization of Compute Nodes


The new parallel-computing architecture, sometimes called cluster computing, is
organized as follows.
‐ Compute nodes are stored on racks, perhaps 8–64 on a rack.
‐ The nodes on a single rack are connected by a network, typically gigabit Ethernet.
There can be many racks of compute nodes, and racks are connected by another level
of network or a switch. The bandwidth of inter-rack communication is somewhat
greater than the intra-rack Ethernet, but given the number of pairs of nodes that might
need to communicate between racks However, there may be many more racks and
many more compute nodes per rack.

Some important calculations take minutes or even hours on thousands of compute nodes. If
we had to abort and restart the computation every time one component failed, then the
computation might never complete successfully. The solution to this problem takes two
forms:
‐ Files must be stored redundantly.
‐ Computations must be divided into tasks, such that if any one task fails to execute to
completion, it can be restarted without affecting other tasks.

Large-Scale File-System Organization


To exploit cluster computing, files must look and behave somewhat differently from the
conventional file systems found on single computers. This new file system, often called a
distributed file system or DFS (although this term had other meanings in the past), is typically
used as follows.
‐ Files can be enormous, possibly a terabyte in size. If you have only small files, there is
no point using a DFS for them.
‐ Files are rarely updated. Rather, they are read as data for some calculation, and
possibly additional data is appended to files from time to time.
Files are divided into chunks, which are typically 64 megabytes in size. Chunks are
replicated, perhaps three times, at three different compute nodes. Moreover, the nodes
holding copies of one chunk should be located on different racks, so we don’t lose all copies

2 BDA Module 2 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE COMP Subject: BDA

due to a rack failure. Normally, both the chunk size and the degree of replication can be
decided by the user.
To find the chunks of a file, there is another small file called the master node or name
node for that file. The master node is itself replicated, and a directory for the file system as a
whole knows where to find its copies. The directory itself can be replicated, and all
participants using the DFS know where the directory copies are.

2.2
Map-Reduce
Map-reduce is a style of computing that has been implemented several times. Use an
implementation of map-reduce to manage many large-scale computations in a way that is
tolerant of hardware faults. All you need to write are two functions, called Map and Reduce,
while the system manages the parallel execution, coordination of tasks that execute Map or
Reduce. In brief, a map-reduce computation executes as follows:
1. Some number of Map tasks each are given one or more chunks from a distributed file
system. These Map tasks turn the chunk into a sequence of key-value pairs. The way key-
value pairs are produced from the input data is determined by the code written by the user
for the Map function.
2. The key-value pairs from each Map task are collected by a master controller and sorted
by key. The keys are divided among all the Reduce tasks, so all key-value pairs with the same
key wind up at the same Reduce task.
3. The Reduce tasks work on one key at a time, and combine all the values associated
with that key in some way. The manner of combination of values is determined by the code
written by the user for the Reduce function.

Fig: Map-Reduce computation

3 BDA Module 2 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE COMP Subject: BDA

The Map Tasks


We view input files for a Map task as consisting of elements, which can be any type: a
tuple or a document, for example. A chunk is a collection of elements, and no element is
stored across two chunks. Technically, all inputs to Map tasks and outputs from Reduce tasks
are of the key-value-pair form, but normally the keys of input elements are not relevant and
we shall tend to ignore them. Insisting on this form for inputs and outputs is motivated by the
desire to allow composition of several map-reduce processes.
A Map function is written to convert input elements to key-value pairs. The types of
keys and values are each arbitrary. Further, keys are not “keys” in the usual sense; they do
not have to be unique. Rather a Map task can produce several key-value pairs with the same
key, even from the same element.

Grouping and Aggregation


Grouping and aggregation is done the same way, regardless of what Map and Reduce
tasks do. The master controller process knows how many Reduce tasks there will be, say r
such tasks. The user typically tells the map-reduce system what r should be. Then the master
controller normally picks a hash function that applies to keys and produces a bucket number
from 0 to r − 1. Each key that is output by a Map task is hashed and its key-value pair is put in
one of r local files. Each file is destined for one of the Reduce tasks. After all the Map tasks
have completed successfully, the master controller merges the file from each Map task that
are destined for a particular Reduce task and feeds the merged file to that process as a
sequence of key-list-of-value pairs. That is, for each key k, the input to the Reduce task that
handles key k is a pair of the form (k, [v1, v2. . . vn]), where (k, v1), (k, v2), . . . , (k, vn) are all
the key-value pairs with key k coming from all the Map tasks.

The Reduce Tasks


The Reduce function is written to take pairs consisting of a key and its list of associated
values and combine those values in some way. The output of a Reduce task is a sequence of
key-value pairs consisting of each input key k that the Reduce task received, paired with the
combined value constructed from the list of values that the Reduce task received along with
key k. The outputs from all the Reduce tasks are merged into a single file.

Combiners
When the Reduce function is both associative and commutative (e.g., sum, max,
average, etc.), then some of the tasks of Reduce function are assigned to combiner. Instead
of sending all the Mapper data to Reducers, some values are computed in the Map side itself
by using combiners and then they are sent to the Reducer. This reduces the input−output
operations between Mapper and Reducer. Combiner takes the mapper instances as input and
combines the values with the same key to reduce the number of keys (key space) that must
be sorted. To optimize the Map process, if a particular word w appears k times among all the
documents assigned to the process, then there will be k times (word, 1) key–value pairs as a
result of Map execution, which can be grouped into a single pair (word, k) provided in the
addition process (Reduce task); associative and commutative properties are satisfied. Figure
below shows word count using MapReduce algorithm with all the intermediate values
obtained in mapping, shuffling and reducing.

4 BDA Module 2 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE COMP Subject: BDA

Algorithm: Pseudo Code for Word Count Using MapReduce Algorithm

Word Count using MapReduce


→ count occurrences of each word input,
M = large corpus of text.
Mapper:
each word in M → (“word”, 1)
Reducer:
for all (“word”, v_i)
→ (“word”, sum_iv_i)
“aggregate” in Hadoop.

Details of Map-Reduce Execution


Let us now consider in more detail how a program using map-reduce is executed.
Taking advantage of a library provided by a map-reduce system such as Hadoop, the user
program forks a Master controller process and some number of Worker processes at different
compute nodes. Normally, a Worker handles either Map tasks or Reduce tasks, but not both.
The Master has many responsibilities. One is to create some number of Map tasks and some
number of Reduce tasks, these numbers being selected by the user program. These tasks will
be assigned to Worker processes by the Master. It is reasonable to create one Map task for
every chunk of the input file(s), but we may wish to create fewer Reduce tasks. The reason for
limiting the number of Reduce tasks is that it is necessary for each Map task to create an
intermediate file for each Reduce task, and if there are too many Reduce tasks the number of
intermediate files explodes. A Worker process reports to the Master when it finishes a task, and
a new task is scheduled by the Master for that Worker process. Each Map task is assigned one
or more chunks of the input file(s) and executes on it the code written by the user. The Map
task creates a file for each Reduce task on the local disk of the Worker that executes the Map
task. The Master is informed of the location and sizes of each of these files, and the Reduce
task for which each is destined. When a Reduce task is assigned by the master to a Worker

5 BDA Module 2 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE COMP Subject: BDA

process, that task is given all the files that form its input. The Reduce task executes code written
by the user and writes its output to a file that is part of the surrounding distributed file system.

Fig: Overview of the execution of a map-reduce program

Coping With Node Failures


The worst thing that can happen is that the compute node at which the Master is
executing fails. In this case, the entire map-reduce job must be restarted. But only this one node
can bring the entire process down; other failures will be managed by the Master, and the map-
reduce job will complete eventually.
Suppose the compute node at which a Map worker resides fails. This failure will be
detected by the Master, because it periodically pings the Worker processes. All the Map tasks
that were assigned to this Worker will have to be redone, even if they had completed. The
reason for redoing completed Map tasks is that their output destined for the Reduce tasks
resides at that compute node, and is now unavailable to the Reduce tasks. The Master sets the
status of each of these Map tasks to idle and will schedule them on a Worker when one becomes
available. The Master must also inform each Reduce task that the location of its input from that
Map task has changed. Dealing with a failure at the node of a Reduce worker is simpler. The
Master simply sets the status of its currently executing Reduce tasks to idle. These will be
rescheduled on another reduce worker later.

Algorithms Using MapReduce: Matrix-Vector Multiplication by MapReduce

MapReduce is a technique in which a huge program is subdivided into small tasks


and run parallel to make computation faster, save time, and mostly used in
distributed systems. It has 2 important parts:
 Mapper: It takes raw data input and organizes into key, value pairs. For example,
in a dictionary, you search for the word “Data” and its associated meaning is
“facts and statistics collected together for reference or analysis”. Here the Key

6 BDA Module 2 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE COMP Subject: BDA

is Data and the Value associated with is facts and statistics collected together for
reference or analysis.
 Reducer: It is responsible for processing data in parallel and produce final output.

Matrix multiplication using MapReduce

1. Map Task
Convert entire data in the form of key-value pairs.
According to algorithm
For matrix A
Mapper for Matrix A (k, v) = ((i, k), (A, j, Aij)) for all k
For matrix B
Mapper for Matrix B (k, v) = ((i, k), (B, j, Bjk)) for all i

2. Combiner Task
Combine the key-value pairs to reduce network congestion.
Identify each key and create group of values associated with it.

3. Reducer Task
The formula for Reducer is:
(i, k) => Summation (Aij * Bjk)) for j
Output =>((i, k), sum)
Example

Let us consider the matrix multiplication example to visualize MapReduce. Consider


the following matrix:

2×2 matrices A and B

1. Map Task
Computing the mapper for Matrix A:
Mapper for Matrix A (k, v)=((i, k), (A, j, Aij)) for all k

Aij = A00 = 1 i.e. i=0 and j=0 and k will vary from 0 to 1
((0, 0), (A, 0, 1)) k=0

7 BDA Module 2 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE COMP Subject: BDA

((0, 1), (A, 0, 1)) k=1

Aij = A01 = 2 i.e. i=0 and j=1 and k will vary from 0 to 1
((0, 0), (A, 1, 2)) k=0
((0, 1), (A, 1, 2)) k=1

Aij = A10 = 3 i.e. i=1 and j=0 and k will vary from 0 to 1
((1, 0), (A, 0, 3)) k=0
((1, 1), (A, 0, 3)) k=1

Aij = A11 = 3 i.e. i=1 and j=1 and k will vary from 0 to 1
((1, 0), (A, 1, 4)) k=0
((1, 1), (A, 1, 4)) k=1

Computing the mapper for Matrix B:


Mapper for Matrix B (k, v)=((i, k), (B, j, Bjk)) for all i

Bjk = B00 = 5 i.e. j=0 and k=0 and i will vary from 0 to 1
((0, 0), (B, 0, 5)) i=0
((1, 0), (B, 0, 5)) i=1

Bjk = B01 = 6 i.e. j=0 and k=1 and i will vary from 0 to 1
((0, 1), (B, 0, 6)) i=0
((1, 1), (B, 0, 6)) i=1

Bjk = B10 = 7 i.e. j=1 and k=0 and i will vary from 0 to 1
((0, 0), (B, 1, 7)) i=0
((1, 0), (B, 1, 7)) i=1

Bjk = B11 = 8 i.e. j=1 and k=1 and i will vary from 0 to 1
((0, 1), (B, 1, 8)) i=0
((1, 1), (B, 1, 8)) i=1
2. Combiner Task:
(0, 0), (A, 0, 1), (A, 1, 2), (B, 0, 5), (B, 1, 7)
(0, 1), (A, 0, 1), (A, 1, 2), (B, 0, 6), (B, 1, 8)

8 BDA Module 2 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE COMP Subject: BDA

(1, 0), (A, 0, 3), (A, 1, 4), (B, 0, 5), (B, 1, 7)


(1, 1), (A, 0, 3), (A, 1, 4), (B, 0, 6), (B, 1, 8)

3. Reducer Task:
Check j value present
(0, 0) (A, 0, 1) (A, 1, 2)
(B, 0, 5) (B, 1, 7)
1*5 2*7
5 + 14 = 19

(0, 1) (A, 0, 1) (A, 1, 2)


(B, 0, 6) (B, 1, 8)
1*6 2*8
6 + 16 = 22

(1, 0), (A, 0, 3) (A, 1, 4)


(B, 0, 5) (B, 1, 7)
3*5 4*7
15 + 28 = 43

(1, 1), (A, 0, 3) (A, 1, 4)


(B, 0, 6) (B, 1, 8)
3*6 4*8
18 + 32 = 50
Therefore the Final Matrix is:

9 BDA Module 2 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE COMP Subject: BDA

Pseudocode
1. Map Task

2. Reduce Task

10 BDA Module 2 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE COMP Subject: BDA

Relational-Algebra Operations
1. Computing Selections by MapReduce
2. Computing Projections by MapReduce
3. Union
4. Intersection
5. Difference

Fig: Actual storage of a table on distributed file system


Hash Function

Hash function can be something like


1. Take a key
2. Typecast it to string
3. For each character in the string sum up the ASCII value
4. Mod the sum with number of reduce workers, this value is the hash value for that particular
key.
We just want a hash function in this case that will distribute the work equally among reduce
workers. Even if we have high number of collisions it’s fine as we are not using hash function
to construct maps which allows fast search, we just want data to be partitioned in n buckets
where n is the number of reduce workers while making sure data for same key goes to the
same reduce worker across all worker nodes.

1. Selection Using Map Reduce


To perform selections using map reduce we need the following Map and Reduce functions:
 Map Function: For each tuple t in R, test if it satisfies condition C. If so, produce the
key–value pair (t, t). That is, both the key and value are t.
 Reduce Function: The Reduce function is the identity. It simply passes each key–value
pair to the output.

11 BDA Module 2 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE COMP Subject: BDA

Algorithm

Pseudo Code for Selection

1. Map Task
map(key, value):
for tuple in value:
if tuple satisfies condition:
emit(tuple, tuple)

2. Reduce Task
reduce(key, values):
emit(key, key)
Example
Select all the rows where value of B is less than or equal to 3, i.e. Selection(B <= 3)
Let’s consider the data we have initially distributed as files in Map Workers, And the data looks
like the following figure

Initial data distributed in files across map workers representing a single table

After applying the map function (And grouping, there are no common keys in this case as each
row is unique) we will get the output as follows, The tuples are constructed with 0th index
containing values from A column and 1st index containing values from B. In actual
implementations either this information can be sent as some extra metadata or within each
value itself, making values and keys look something like ({A: 1}, {B: 2}), which does look
somewhat inefficient.

12 BDA Module 2 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE COMP Subject: BDA

Data after applying Map function which filtered rows having B value less than 3

After this based on number or reduce workers (2 in our case). A hash function is applied
as explained in the Hash Function section. The files for reduce workers on map workers will
look like:

Files for reduce workers created at map worker based on hash function

After this step the files for reduce worker 1 are sent to that and reduce worker 2 are
sent to that. The data at reduce workers will look like:

13 BDA Module 2 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE COMP Subject: BDA

Data at reduce workers sent from map workers

The final output after applying the reduce function which ignores the keys and just consider
values will look like:

Output of selection(B ≤ 3)

The points to take into consideration here are that we don’t need to shuffle data
across the nodes really. We can just execute the map function and save values to the output
from map workers itself. This makes it an efficient operation (When compared to others where
reduce function does something).
Projection Using Map Reduce
Algorithm

Let’s S be the subset containing the selected attributes

1. Map Task
map(key, value):
for tuple in value:
ts = tuple with only attributes in S
emit(ts,ts)

14 BDA Module 2 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE COMP Subject: BDA

2. Reduce Task
Reduce(key, values):
emit(key, key)

Initial Data distributed on map workers

After application of map function (ignoring values in C column) and grouping the keys the data
will look like:

The keys will be partitioned using a hash function as was the case in selection. The data will
look like:

Files generated for reduce workers

15 BDA Module 2 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE COMP Subject: BDA

The data at the reduce workers will be:

Data at reduce workers


At the reduce node the keys will be aggregated again as same keys might have occurred at
multiple map workers. As we already know the reduce function operates on values of each key
only once.

Data after aggregation by key at reduce workers

The reduce function is applied which will consider only the first value of the values list and

ignore rest of the information.

16 BDA Module 2 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE COMP Subject: BDA

Output of projection(A, B)

The points to remember are that here the reduce function is required for duplicate
elimination. If that’s not the case (as it is in SQL) we can get rid of reduce operation, meaning
we don’t have to move data around. So, this operation can be implemented without actually
passing data around.

Union Using Map Reduce


Pseudo Code for Union
map(key, value):
for tuple in value:
emit(tuple, tuple)
reduce(key, values):
emit(key, key

Initial data at map workers


After applying the map function and grouping the keys we will get output as:

17 BDA Module 2 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE COMP Subject: BDA

Map and grouping the keys

The data to be sent to reduce workers will look like:

Files to be sent to reduce workers

Data at reduce workers after will be:

Files At reduce workers


At reduce workers aggregation on keys will be done.

18 BDA Module 2 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE COMP Subject: BDA

Aggregated data at reduce workers

The final output after applying the reduce function which takes only the first value and ignores

everything else is as follows:

Final table after union


Here we note that in this case same as projection we can this done without moving data
around in case we are not interested in removing duplicates. And hence this operation is also
efficient it terms of data shuffle across machines.
Intersection Using Map Reduce
Pseudo Code for Intersection
map(key, value):
for tuple in value:
emit(tuple, tuple)
reduce(key, values):
if values == [key, key]
emit(key, key)

19 BDA Module 2 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE COMP Subject: BDA

part before reduce function is applied.

Data at reduce workers


Now we just apply the reduce operation which will output only rows if list has a length of 2.

Output of intersection
Difference Using Map Reduce
Pseudo Code for Difference
map(key, value):
if key == R:
for tuple in value:
emit(tuple, R)
else:
for tuple in value:
emit(tuple, S)
reduce(key, values):
if values == [R] emit(key, key)

20 BDA Module 2 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE COMP Subject: BDA

Initial Data

After applying the map function and grouping the keys the data looks like the following figure

Data after applying map function and grouping keys


After applying map function files for reduce workers will be created based on hashing keys as

has been the case so far.

Files for reduce workers

21 BDA Module 2 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE COMP Subject: BDA

The data at the reduce workers will look like

Files at reduce workers

After aggregation of the keys at reduce workers the data looks like:

Data after aggregation of keys at reduce workers

The final output is generated after applying the reduce function over the output.

Output of difference of the tables

22 BDA Module 2 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE COMP Subject: BDA

For the difference operation we notice that we cannot get rid of the reduce part and
hence have to send data across the workers as the context of from which table the value came
is needed. Hence it will be more expensive operation as compared to selection, projection,
union and intersection.
Grouping and Aggregation Using Map Reduce
Usually understanding grouping and aggregation takes a bit of time when we learn SQL,
but not in case when we understand these operations using map reduce. The logic is already
there in the working of the map. Map workers implicitly group keys and the reduce function
acts upon the aggregated values to generate output.

Pseudo Code for Grouping and Aggregation


map(key, value):
for (a, b, c) in value:
emit(a, b)

reduce(key, values):
emit(key, theta(values))

For our example lets group by (A, B) and apply sum as the aggregation.

Initial data at the map workers


The data after application of map function and grouping keys will creates (A, B) as key
and C as value and D is discarded as if it doesn’t exist.

23 BDA Module 2 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE COMP Subject: BDA

Data at map workers

Applying partitioning using hash functions, we get

Files for the reduce workers

The data at the reduce workers will look like

Data at reduce workers

The data is aggregated based on keys before applying the aggregation function (sum in this
case).
24 BDA Module 2 notes by P A Ghadigaonkar
SSPM’s College Of Engineering, Kankavli
Class: BE COMP Subject: BDA

Aggregated data based on keys

After applying the sum over the value lists we get the final output

Output of group by (A, B) sum(C)

Here also like difference operation we can’t get rid of the reduce stage. The context of tables

isn’t wanted here but the aggregation function makes it necessary for the values to be in one

place for a single key. This operation is also inefficient as compared to selection, projection,

union, and intersection. The column that is not in aggregation or grouping clause is ignored

and isn’t required. So if the data be stored in a columnar format we can save cost of loading a

lot of data. Usually there are only a few columns involved in grouping and aggregation it does

save up a lot of cost both in terms of data that is sent over the network and the data that needs
to be loaded to main memory for execution.

25 BDA Module 2 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE COMP Subject: BDA

Natural Join Using Map Reduce

The natural join will keep the rows that matches the values in the common column for both

tables. To perform natural join we will have to keep track of from which table the value came

from. If the values for the same key are from different tables we need to form pairs of those

values along with key to get a single row of the output. Join can explode the number of rows

as we have to form each and every possible combination of the values for both tables.

Pseudo Code for Natural Join


map(key, value):
if key == R:
for (a, b) in value:
emit(b, (R, a))
else: for (b, c) in value:
emit(b, (S, c))
reduce(key, values):
list_R = [a for (x, a) in values if x == R]
list_S = [c for (x, c) in values if x == S]
for a in list_R:
for c in list_S:
emit(key, (a, key, c))

For an example lets consider joining Table 1 and Table 2, where B is the common column.

Initial data at map workers

The data after applying the map function and grouping at the map workers will look like:

26 BDA Module 2 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE COMP Subject: BDA

Data at map workers after applying map function and grouping the keys

As has been the case so far files for reduce workers will be created at the map workers

Files constructed for reduce workers

The data at the reduce workers will be:

Data at reduce workers


Applying aggregation of keys at the reduce workers we get:

27 BDA Module 2 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE COMP Subject: BDA

Data after aggregation of keys at the reduce workers

After applying the reduce function which will create a row by taking one value from table T1
and other one from T2. If there are only values from T1 or T2 in the values list that won’t
constitute a row in output.

Output of the join


As we need to keep context from which table a value came from, we can’t get rid of
the data that needs to be sent across the workers for application of reduce task, this operation
also becomes costly as compared to others we discussed so far. The fact that for each list of
values we need to create pairs also plays a major factor in the computation cost associated
with this operation.

Hadoop Limitations

HDFS cannot be mounted directly by an existing operating system. Getting data into
and out of the HDFS file system can be inconvenient. In Linux and other UNIX systems, a file
system in Userspace (FUSE) virtual file system is developed to address this problem.
File access can be achieved through the native Java API, to generate a client in the
language of the users’ choice (C++, Java, Python, PHP, Ruby, etc.), in the command-line
interface or browsed through the HDFS-UI web app over HTTP.
1. Security Concerns

28 BDA Module 2 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE COMP Subject: BDA

Hadoop security model is disabled by default due to sheer complexity. Whoever’s


managing the platform should know how to enable it; else data could be at huge risk.
Hadoop does not provide encryption at the storage and network levels, which is a major
reason for the government agencies and others not to prefer to keep their data in Hadoop
framework.
2. Vulnerable By Nature
Hadoop framework is written almost entirely in Java, one of the most widely used
programming languages by cyber-criminals. For this reason, several experts have
suggested dumping it in favor of safer, more efficient alternatives.
3. Not Fit for Small Data
While big data is not exclusively made for big businesses, not all big data platforms are
suitable for handling small files. Due to its high capacity design, the HDFS lacks the ability
to efficiently support the random reading of small files. As a result, it is not recommended
for organizations with small quantities of data.
4. Potential Stability Issues
Hadoop is an open-source platform necessarily created by the contributions of many
developers who continue to work on the project. While improvements are constantly
being made, like all open-source software, Hadoop has stability issues. To avoid these
issues, organizations are strongly recommended to make sure they are running the latest
stable version or run it under a third-party vendor equipped to handle such problems.
5. General Limitations
Google mentions in its article that Hadoop may not be the only answer for big data.
Google has its own Cloud Dataflow as a possible solution. The main point the article
stresses is that companies could be missing out on many other benefits by using Hadoop
alone.

29 BDA Module 2 notes by P A Ghadigaonkar

You might also like