BDA Module 2 COMP
BDA Module 2 COMP
Module 2
Some important calculations take minutes or even hours on thousands of compute nodes. If
we had to abort and restart the computation every time one component failed, then the
computation might never complete successfully. The solution to this problem takes two
forms:
‐ Files must be stored redundantly.
‐ Computations must be divided into tasks, such that if any one task fails to execute to
completion, it can be restarted without affecting other tasks.
due to a rack failure. Normally, both the chunk size and the degree of replication can be
decided by the user.
To find the chunks of a file, there is another small file called the master node or name
node for that file. The master node is itself replicated, and a directory for the file system as a
whole knows where to find its copies. The directory itself can be replicated, and all
participants using the DFS know where the directory copies are.
2.2
Map-Reduce
Map-reduce is a style of computing that has been implemented several times. Use an
implementation of map-reduce to manage many large-scale computations in a way that is
tolerant of hardware faults. All you need to write are two functions, called Map and Reduce,
while the system manages the parallel execution, coordination of tasks that execute Map or
Reduce. In brief, a map-reduce computation executes as follows:
1. Some number of Map tasks each are given one or more chunks from a distributed file
system. These Map tasks turn the chunk into a sequence of key-value pairs. The way key-
value pairs are produced from the input data is determined by the code written by the user
for the Map function.
2. The key-value pairs from each Map task are collected by a master controller and sorted
by key. The keys are divided among all the Reduce tasks, so all key-value pairs with the same
key wind up at the same Reduce task.
3. The Reduce tasks work on one key at a time, and combine all the values associated
with that key in some way. The manner of combination of values is determined by the code
written by the user for the Reduce function.
Combiners
When the Reduce function is both associative and commutative (e.g., sum, max,
average, etc.), then some of the tasks of Reduce function are assigned to combiner. Instead
of sending all the Mapper data to Reducers, some values are computed in the Map side itself
by using combiners and then they are sent to the Reducer. This reduces the input−output
operations between Mapper and Reducer. Combiner takes the mapper instances as input and
combines the values with the same key to reduce the number of keys (key space) that must
be sorted. To optimize the Map process, if a particular word w appears k times among all the
documents assigned to the process, then there will be k times (word, 1) key–value pairs as a
result of Map execution, which can be grouped into a single pair (word, k) provided in the
addition process (Reduce task); associative and commutative properties are satisfied. Figure
below shows word count using MapReduce algorithm with all the intermediate values
obtained in mapping, shuffling and reducing.
process, that task is given all the files that form its input. The Reduce task executes code written
by the user and writes its output to a file that is part of the surrounding distributed file system.
is Data and the Value associated with is facts and statistics collected together for
reference or analysis.
Reducer: It is responsible for processing data in parallel and produce final output.
1. Map Task
Convert entire data in the form of key-value pairs.
According to algorithm
For matrix A
Mapper for Matrix A (k, v) = ((i, k), (A, j, Aij)) for all k
For matrix B
Mapper for Matrix B (k, v) = ((i, k), (B, j, Bjk)) for all i
2. Combiner Task
Combine the key-value pairs to reduce network congestion.
Identify each key and create group of values associated with it.
3. Reducer Task
The formula for Reducer is:
(i, k) => Summation (Aij * Bjk)) for j
Output =>((i, k), sum)
Example
1. Map Task
Computing the mapper for Matrix A:
Mapper for Matrix A (k, v)=((i, k), (A, j, Aij)) for all k
Aij = A00 = 1 i.e. i=0 and j=0 and k will vary from 0 to 1
((0, 0), (A, 0, 1)) k=0
Aij = A01 = 2 i.e. i=0 and j=1 and k will vary from 0 to 1
((0, 0), (A, 1, 2)) k=0
((0, 1), (A, 1, 2)) k=1
Aij = A10 = 3 i.e. i=1 and j=0 and k will vary from 0 to 1
((1, 0), (A, 0, 3)) k=0
((1, 1), (A, 0, 3)) k=1
Aij = A11 = 3 i.e. i=1 and j=1 and k will vary from 0 to 1
((1, 0), (A, 1, 4)) k=0
((1, 1), (A, 1, 4)) k=1
Bjk = B00 = 5 i.e. j=0 and k=0 and i will vary from 0 to 1
((0, 0), (B, 0, 5)) i=0
((1, 0), (B, 0, 5)) i=1
Bjk = B01 = 6 i.e. j=0 and k=1 and i will vary from 0 to 1
((0, 1), (B, 0, 6)) i=0
((1, 1), (B, 0, 6)) i=1
Bjk = B10 = 7 i.e. j=1 and k=0 and i will vary from 0 to 1
((0, 0), (B, 1, 7)) i=0
((1, 0), (B, 1, 7)) i=1
Bjk = B11 = 8 i.e. j=1 and k=1 and i will vary from 0 to 1
((0, 1), (B, 1, 8)) i=0
((1, 1), (B, 1, 8)) i=1
2. Combiner Task:
(0, 0), (A, 0, 1), (A, 1, 2), (B, 0, 5), (B, 1, 7)
(0, 1), (A, 0, 1), (A, 1, 2), (B, 0, 6), (B, 1, 8)
3. Reducer Task:
Check j value present
(0, 0) (A, 0, 1) (A, 1, 2)
(B, 0, 5) (B, 1, 7)
1*5 2*7
5 + 14 = 19
Pseudocode
1. Map Task
2. Reduce Task
Relational-Algebra Operations
1. Computing Selections by MapReduce
2. Computing Projections by MapReduce
3. Union
4. Intersection
5. Difference
Algorithm
1. Map Task
map(key, value):
for tuple in value:
if tuple satisfies condition:
emit(tuple, tuple)
2. Reduce Task
reduce(key, values):
emit(key, key)
Example
Select all the rows where value of B is less than or equal to 3, i.e. Selection(B <= 3)
Let’s consider the data we have initially distributed as files in Map Workers, And the data looks
like the following figure
Initial data distributed in files across map workers representing a single table
After applying the map function (And grouping, there are no common keys in this case as each
row is unique) we will get the output as follows, The tuples are constructed with 0th index
containing values from A column and 1st index containing values from B. In actual
implementations either this information can be sent as some extra metadata or within each
value itself, making values and keys look something like ({A: 1}, {B: 2}), which does look
somewhat inefficient.
Data after applying Map function which filtered rows having B value less than 3
After this based on number or reduce workers (2 in our case). A hash function is applied
as explained in the Hash Function section. The files for reduce workers on map workers will
look like:
Files for reduce workers created at map worker based on hash function
After this step the files for reduce worker 1 are sent to that and reduce worker 2 are
sent to that. The data at reduce workers will look like:
The final output after applying the reduce function which ignores the keys and just consider
values will look like:
Output of selection(B ≤ 3)
The points to take into consideration here are that we don’t need to shuffle data
across the nodes really. We can just execute the map function and save values to the output
from map workers itself. This makes it an efficient operation (When compared to others where
reduce function does something).
Projection Using Map Reduce
Algorithm
1. Map Task
map(key, value):
for tuple in value:
ts = tuple with only attributes in S
emit(ts,ts)
2. Reduce Task
Reduce(key, values):
emit(key, key)
After application of map function (ignoring values in C column) and grouping the keys the data
will look like:
The keys will be partitioned using a hash function as was the case in selection. The data will
look like:
The reduce function is applied which will consider only the first value of the values list and
Output of projection(A, B)
The points to remember are that here the reduce function is required for duplicate
elimination. If that’s not the case (as it is in SQL) we can get rid of reduce operation, meaning
we don’t have to move data around. So, this operation can be implemented without actually
passing data around.
The final output after applying the reduce function which takes only the first value and ignores
Output of intersection
Difference Using Map Reduce
Pseudo Code for Difference
map(key, value):
if key == R:
for tuple in value:
emit(tuple, R)
else:
for tuple in value:
emit(tuple, S)
reduce(key, values):
if values == [R] emit(key, key)
Initial Data
After applying the map function and grouping the keys the data looks like the following figure
After aggregation of the keys at reduce workers the data looks like:
The final output is generated after applying the reduce function over the output.
For the difference operation we notice that we cannot get rid of the reduce part and
hence have to send data across the workers as the context of from which table the value came
is needed. Hence it will be more expensive operation as compared to selection, projection,
union and intersection.
Grouping and Aggregation Using Map Reduce
Usually understanding grouping and aggregation takes a bit of time when we learn SQL,
but not in case when we understand these operations using map reduce. The logic is already
there in the working of the map. Map workers implicitly group keys and the reduce function
acts upon the aggregated values to generate output.
reduce(key, values):
emit(key, theta(values))
For our example lets group by (A, B) and apply sum as the aggregation.
The data is aggregated based on keys before applying the aggregation function (sum in this
case).
24 BDA Module 2 notes by P A Ghadigaonkar
SSPM’s College Of Engineering, Kankavli
Class: BE COMP Subject: BDA
After applying the sum over the value lists we get the final output
Here also like difference operation we can’t get rid of the reduce stage. The context of tables
isn’t wanted here but the aggregation function makes it necessary for the values to be in one
place for a single key. This operation is also inefficient as compared to selection, projection,
union, and intersection. The column that is not in aggregation or grouping clause is ignored
and isn’t required. So if the data be stored in a columnar format we can save cost of loading a
lot of data. Usually there are only a few columns involved in grouping and aggregation it does
save up a lot of cost both in terms of data that is sent over the network and the data that needs
to be loaded to main memory for execution.
The natural join will keep the rows that matches the values in the common column for both
tables. To perform natural join we will have to keep track of from which table the value came
from. If the values for the same key are from different tables we need to form pairs of those
values along with key to get a single row of the output. Join can explode the number of rows
as we have to form each and every possible combination of the values for both tables.
For an example lets consider joining Table 1 and Table 2, where B is the common column.
The data after applying the map function and grouping at the map workers will look like:
Data at map workers after applying map function and grouping the keys
As has been the case so far files for reduce workers will be created at the map workers
After applying the reduce function which will create a row by taking one value from table T1
and other one from T2. If there are only values from T1 or T2 in the values list that won’t
constitute a row in output.
Hadoop Limitations
HDFS cannot be mounted directly by an existing operating system. Getting data into
and out of the HDFS file system can be inconvenient. In Linux and other UNIX systems, a file
system in Userspace (FUSE) virtual file system is developed to address this problem.
File access can be achieved through the native Java API, to generate a client in the
language of the users’ choice (C++, Java, Python, PHP, Ruby, etc.), in the command-line
interface or browsed through the HDFS-UI web app over HTTP.
1. Security Concerns