0% found this document useful (0 votes)

28 views84 pages

Module2 D MapReduceParadigm

Module2_D_MapReduceParadigm

Uploaded by

2021.shreya.pawaskar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views84 pages

Module2 D MapReduceParadigm

Module2_D_MapReduceParadigm

Uploaded by

2021.shreya.pawaskar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 84

Map Reduce

Other Algorithms using MR

Big Data Analytics - Module 2D

OVERVIEW OF THE CHAPTER
• Algorithms Using MapReduce:
• Matrix-Vector Multiplication by MapReduce
• Relational-Algebra Operations, Computing
Selections by MapReduce, Computing
Projections by MapReduce, Union, Intersection,
and Difference by MapReduce, Computing
Natural Join by MapReduce, Grouping and
Aggregation by MapReduce
• Matrix Multiplication, Matrix Multiplication with
One MapReduce Step .
• Illustrating use of MapReduce with use of real life
databases and applications.
SORTING
• The standard MR framework automatically sorts the output keys
as part of its processing.
• So if your Mapper just takes the input key and outputs the same
key (with no value), and the Reducer just takes the input keys
and outputs them unchanged, you'll end up with all of the keys
sorted.
• let's suppose you're sorting a really long list of names that all start
with an English letter A - Z.
• Each mapper gets one name at a time. It looks at the first letter
and send the name to the reducer based on that letter. So, all
of the 'A's get sent to Reducer 1, and so on
• Then each Reducer sorts the names it gets (using any sorting
algorithm) - and that should be relatively fast since each one
only has so do roughly 1/26 of the data.
• When you're done, you'd end up with 26 files, one for each
reducer, each in sorted order. You could then concatenate the
files in order to get the complete sorted list.
MATRIX – VECTOR MULTIPLICATION
Multiply M = (mij) (n × n matrix) & v = (vi) (n-vector)
nXn
Mv = (xij )
n
xij = å mij v j M v n (i, mij v j )
j=1

Case 1: Large n, M does not fit into main memory, but v does
 Since v fits into main memory, v is available to every map task
 Map: for each matrix element mij, emit key value pair (i, mijvj)
 Shuffle and sort: groups all mijvj values together for the same i
 Reduce: sum mijvj for all j for the same i
HOW THE VECTOR IS REPLICATED:

• Distributed Cache/Broadcast: Before the

MapReduce job begins, vector is placed in a
distributed cache (Hadoop's distributed cache
or Spark's broadcast variables).
• This cache ensures that each mapper has a
local copy of the vector.
• Local Access in Mapper: Each mapper
accesses the vector locally from the cache,
avoiding the need for repeated network
communication.
• The reduce phase aggregates the results to
produce the final output vector.
WHY DATA MOVEMENT IS MANAGEABLE:
• Small Vector Size:
• Vectors are usually much smaller than the matrix itself.
• Broadcasting Efficiency:
• Broadcasting a small vector across the nodes involves
relatively little data movement compared to the overall
size of the matrix.
• The overhead of transmitting this small amount of data to
each node is usually negligible, especially in distributed
systems designed to handle such operations efficiently.
• One-Time Cost:
• Single Broadcast: The vector is typically broadcast only
once at the start of the MapReduce job
• Distributed Cache:
• Systems like Hadoop use a distributed cache to distribute
the vector, ensuring that each node has a local copy. This
minimizes the need for repeated data movement during
the map operations.
– Used in many algorithms, e.g., PageRank

Valeria Cardellini - SABD 2020/21 20

MATRIX – VECTOR MULTIPLICATION
This much will fit into main
Mv = (xij ) memory

n
xij = å
Mv = (xij )
mij v j (i, mij v j )
n
xij = å mij v j
j=1 mij v j )chunk does not fit in
(i,whole
This
j=1 main memory anymore

Case 2: Very large n, even v does not fit into main memory
 For every map, many accesses to disk (for parts of v) required!
 Solution:
– How much of v will fit in?
– Partition v and rows of M so that each partition of v fits into memory
– Take dot product of one partition of v and the corresponding partition of M
– Map and reduce same as before
Color = stripe.
Each stripe of matrix
chunk divided up
into chunks.

X
MATRIX-VECTOR MULTIPLICATION MR CODE

• map(key, value):
• for (i, j, a_ij) in value:
• emit(i, a_ij * v[j])
• reduce(key, values):
• result = 0
• for value in values:
• result += value
• emit(key, result)
RELATIONAL ALGEBRA
• Primitives
• Projection ()
• Selection ()
• Cartesian product ()
• Set union ()
• Set difference ()
• Rename ()
• Other operations
• Join (⋈)
• Group by… aggregation
•…
RELATIONAL ALGEBRA
• R, S - relation
• t, t’ - a tuple
• C - a condition of selection
• A, B, C - subset of attributes
• a, b, c - attribute values for a given subset of
attributes
• Relations (however big) can be stored in a
distributed filesystem – If they don’t fit in a
single machine, they’re broken into pieces
(think HDFS)
BIG DATA ANALYSIS
• Peta-scale datasets are everywhere:
• Facebook has 2.5 PB of user data + 15 TB/day
(4/2009)
• eBay has 6.5 PB of user data + 50 TB/day (5/2009)
• …
• A lot of these datasets are (mostly) structured
• Query logs
• Point-of-sale records
• User data (e.g., demographics)
• …
• How do we perform data analysis at scale?
• Relational databases and SQL
• MapReduce (Hadoop)
Relational Databases Vs. MapReduce
• Relational databases:
• Multipurpose: analysis and transactions; batch
and interactive
• Data integrity via ACID transactions
• Lots of tools in software ecosystem (for ingesting,
reporting, etc.)
• Supports SQL (and SQL integration, e.g., JDBC)
• Automatic SQL query optimization
• MapReduce (Hadoop):
• Designed for large clusters, fault tolerant
• Data is accessed in “native format”
• Supports many query languages
• Programmers retain control over performance
• Open source
Selection Using Mapreduce
• Map: For each tuple t in R, test if t satisfies
C. If so, produce the key-value pair (t, t).
• Reduce: The identity function. It simply
passes each key-value pair to the output.

R2
R1
R3

R4  R3

R5
PROJECTION

R1 R1

R2 R2

R3 R3

R4 R4

R5 R5
Union Using Mapreduce
• Suppose R and S have the same
schema
• Map tasks are generated from
chunks of both R and S
INTERSECTION (R ∩ S)
DIFFERENCE (R-S)
GROUPING AND AGGREGATION USING
MAPREDUCE
R
• Group and aggregate on
a relation R(A,B) using A B

aggregation function γ(B), x 2

group by attribute A y 1
z 4
• Map: z 1
• For each tuple t = (a,b) of R, x 5
emit key value pair (a,b) select A, sum(B) from
• Reduce: R group by A;

• For all group {(a,b1), …,

(a,bm)} represented by a A SUM(B)
key ‘a’, apply γ to obtain ba x 7
= b1 + … + bm y 1
• Output (a,ba) z 5
GROUP BY… AGGREGATION - 2
• Example: What is the average time
spent per URL?
• In SQL:
• SELECT url, AVG(time) FROM visits GROUP BY
url
• In MapReduce:
• Map over tuples, emit time, keyed by url
• MR automatically groups values by keys
• Compute average in reducer
MAP-REDUCE EXAMPLE:
AGGREGATION
• Compute Avg of B for each distinct
value of A

A B C
Reducer 1
R1 1 10 12 (1, 10)
(2, 20) (1, [10, 10, 30, 20]) (1, 17.5)
R2 2 20 34 MAP 1
(1, 10)
R3 1 10 22
R4 1 30 56
Reducer 2
R5 3 40 17 (1, 30)
MAP 2 (3, 40) (2, 10) (2, 15)
R6 2 10 49 (2, 10) (3, 40) (3, 40)
(1, 20)
R7 1 20 44
Grouping & Aggregation Summary
NATURAL JOIN
MAP-REDUCE EXAMPLE : JOIN
 Select R.A, R.B, S.D where R.A==S.A

A B C
Reducer 1
R1 1 10 12 (1, [R, 10]) (1, 10, 30)
(2, [R, 20]) (1, 10, 30)
R2 2 20 34
(1, [R, 10]) (1, [(R, 10), (R, 10), (1, 10, 20)
MAP 1 (R, 30), (S, 20)] )
R3 1 10 22 (1, [R, 30]) (1, 10, 20)
R4 1 30 56 (3, [R, 40])
R5 3 40 17

A D E Reducer 2
S1 1 20 22 (1, [S, 20]) (2, [(R, 20), (S, 30),
(2, [S, 30]) (S, 10)] ) (2, 20, 30)
S2 2 30 36 MAP 2 (2, 20, 10)
(2, [S, 10]) (3, [(R, 40), (S, 50),
(3, 40, 50)
S3 2 10 29 (3, [S, 50]) (S, 40)]
(3, 40, 40)
S4 3 50 16 (3, [S, 40])
S5 3 40 37
Natural Join Using Mapreduce
• Join R(A,B) with S(B,C) on attribute B
R
• Map:
• For each tuple t = (a,b) of R, emit key A B
value pair (b,(R,a)) x a
• For each tuple t = (b,c) of S, emit key
y b
value pair (b,(S,c))
z c
• Reduce:
w d
• Each key b would be associated with a list
of values that are of the form (R,a) or (S,c)
• Construct all pairs consisting of one with S
first component R and the other with first
component S , say (R,a ) and (S,c ). The
output from this key and value list is a B C
sequence of key-value pairs a 1
• The key is irrelevant. Each value is one of c 3
the triples (a, b, c ) such that (R,a ) and
(S,c) are on the input list of values d 4
g 7
Need For High-level Languages
• Hadoop is great for large-data
processing!
• But writing Java programs for
everything is verbose and slow
• Analysts don’t want to (or can’t) write
Java
• Solution: develop higher-level data
processing languages
• Hive: HQL is like SQL
• Pig: Pig Latin is a bit like Perl
HIVE AND PIG
• Hive: data warehousing application in Hadoop
• Query language is HQL, variant of SQL
• Tables stored on HDFS as flat files
• Developed by Facebook, now open source
• Pig: large-scale data processing system
• Scripts are written in Pig Latin, a dataflow
language
• Developed by Yahoo!, now open source
• Roughly 1/3 of all Yahoo! internal jobs
• Common idea:
• Provide higher-level language to facilitate large-
data processing
• Higher-level language “compiles down” to
Hadoop jobs
Matrix Multiplication Using Mapreduce 1

n l l
n

m A n B = m
C cik = å aij b jk
(m × n) (m × l)
(n × l) j=1

 Think of a matrix as a relation with three attributes

 For example matrix A is represented by the relation A(I, J, V)
– For every non-zero entry (i, j, aij), the row number is the value of I,
column number is the value of J, the entry is the value in V
– Also advantage: usually most large matrices would be sparse, the relation
would have less number of entries
 The product is ~ a natural join followed by a grouping with
aggregation
MATRIX MULTIPLICATION USING MR 2

n l l
n
cik = å aij b jk
A
B = C
m (m × n) n m
(n × l) (m × l)
(i, j, aij) j=1
(j, k, bjk)

 Natural join of (I,J,V) and (J,K,W)  tuples (i, j, k, aij, bjk)

 Map:
– For every (i, j, aij), emit key value pair (j, (A, i, aij))
– For every (j, k, bjk), emit key value pair (j, (B, k, bjk))
 Reduce:
for each key j
for each value (A, i, aij) and (B, k, bjk)
produce a key value pair ((i,k),(aijbjk))
MATRIX MULTIPLICATION USING MR 3
n l l
n
cik = å aij b jk
A
B C
m (m × n) n
(n × l)
= m
(m × l)
(i, j, aij) j=1
(j, k, bjk)

 First MapReduce process has produced key value pairs

((i,k),(aijbjk))
 Another MapReduce process to group and aggregate
 Map: identity, just emit the key value pair ((i,k),(aijbjk))
 Reduce: n

for each key (i,k) cik = å aij b jk

j=1
produce the sum of the all the values for the key:
MATRIX MULTIPLICATION USING
MAPREDUCE: METHOD 2
n l l
n
cik = å aij b jk
A
B C
m (m × n) n = m
(m × l)
(n × l)
(i, j, aij) j=1
(j, k, bjk)

 A method with one MapReduce step

 Map:
– For every (i, j, aij), emit for all k = 1,…, l, the key value ((i,k), (A, j, aij))
– For every (j, k, bjk), emit for all i = 1,…, m, the key value ((i,k), (B, j, bjk))
 Reduce: May not fit in
for each key (i,k) main
sort values (A, j, aij) and (B, j, bjk) by j to group them by j memory.
for each j multiply aij and bjk n Expensive
sum the products for the key (i,k) to produce cik = å aij b jk external sort!
j=1
Matrix Multiplication In One Step

• One reducer per output cell

• Each reducer computes Sumj (A[i,j] * B[j,k])
Matrix Multiply Example
Matrix Multiply Example

C =AXB X =X
C =AXB
A has dimensions L x M
A has dimensions L x M
B has dimensions M x N A B C
B has dimensions M x N A B
C has dimensions L x N
C has dimensions L x N
Matrix multiplication:
Matrix multiplication: C[i,k] = Sum
C[i,k] (A[i,j] *B[j,k])
=j Sum j (A[i,j] *B[j,k])

InInthe
the map
map phase:
phase:
for each
for eachelement
element(i,j) (i,j) emit
of A,of A, ((i,k),
emitA[i,j]) k in 1..N
forA[i,j])
((i,k), for k in 1..N
Better:
Better: emit
emit((i,k)(‘A’, i, k, A[i,j]))
((i,k)(‘A’, for k in 1..N
i, k, A[i,j])) for k in 1..N
for each
for eachelement
element(j,k)(j,k) emit
of B,of B,((i,k),
emitB[j,k]) i in 1..Lfor i in 1..L
((i,k),forB[j,k])
Better:
Better: emit
emit ((i,k)(‘B’,
((i,k)(‘B’, i, k, B[j,k]))
i, k, B[j,k])) for i in 1..L
for i in 1..L
InInthe
the reduce phase
reduce phase OneOne reducer
reducer per cell,
per output output
emit cell, emit
key
key ==(i,k)
(i,k)
value = Sumj (A[i,j] * B[j,k])
value = Sumj (A[i,j] * B[j,k]) 20
Illustrating Matrix Multiplication
Matrix Multiplication – 2 Phase
Step 3 Map is just identity – emit (i, k, aik bkj)
Matrix Multiplication – 1 Phase
Mapper for Matrix A (k, v)=((i, k),
(A, j, Aij)) for all k
Mapper for Matrix B (k, v)=((i, k),
(B, j, Bjk)) for all i
Therefore computing the mapper for Matrix A:
• k, i, j computes the number of times it occurs.
• Here all are 2, therefore when k=1, i can have 2 values 1 & 2
• each case can have 2 further values of j=1 and j=2.

k=1 i=1 j=1 ((1, 1), (A, 1, 1))

j=2 ((1, 1), (A, 2, 2))
i=2 j=1 ((2, 1), (A, 1, 3))
j=2 ((2, 1), (A, 2, 4))
k=2 i=1 j=1 ((1, 2), (A, 1, 1))
j=2 ((1, 2), (A, 2, 2))
i=2 j=1 ((2, 2), (A, 1, 3))
j=2 ((2, 2), (A, 2, 4))
Computing the mapper for Matrix B

i=1 j=1 k=1 ((1, 1), (B, 1, 5))

k=2 ((1, 2), (B, 1, 6))
j=2 k=1 ((1, 1), (B, 2, 7))
k=2 ((1, 2), (B, 2, 8))

i=2 j=1 k=1 ((2, 1), (B, 1, 5))

k=2 ((2, 2), (B, 1, 6))
j=2 k=1 ((2, 1), (B, 2, 7))
k=2 ((2, 2), (B, 2, 8))

The formula for Reducer is:

• Reducer (k, v) =(i, k)=>Make sorted Alist and Blist
• (i, k) => Summation (Aij * Bjk)) for j
• Output =>((i, k), sum)
Reducer
• Observe from Mapper 4 pairs common (1, 1), (1, 2), (2, 1) and (2, 2)
• Separate lists for Matrix A & B with adjoining values from Mapper

(1, 1) =>Alist ={(A, 1, 1), (A, 2, 2)}

Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(1*5) + (2*7)] =19 -----(i) Thus we have
((1, 1), 19)
(1, 2) =>Alist ={(A, 1, 1), (A, 2, 2)} ((1, 2), 22)
Blist ={(B, 1, 6), (B, 2, 8)} ((2, 1), 43)
Now Aij x Bjk: [(1*6) + (2*8)] =22 -----(ii) ((2, 2), 50)

(2, 1) =>Alist ={(A, 1, 3), (A, 2, 4)}

Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(3*5) + (4*7)] =43 -----(iii)

(2, 2) =>Alist ={(A, 1, 3), (A, 2, 4)}

Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(3*6) + (4*8)] =50 ----(iv)
EXAMPLE - VISITS PER HOUR
• A common metric that web analytic tools provide
about website traffic is the number of page views
on a per-hour basis.
• In order to compute the number of page visits for
each hour, we must create a custom Key class
that encapsulates an hour (day, month, year, and
hour) and then map that key to the number of
observed page views for that hour.
• Just as we did with the WordCount example, the
mapper will return the key mapped to the value 1,
and then the reducer and combiners will compute
the actual count of occurrences for each hour.
• The challenge, is that we need to create a custom
key class to hold our date.
SENTIMENT ANALYSIS
M R R
Twitter #Movies
TweetScan Summarize Count
Posts Per
Inferred Rating Rating
Movie Ratings Medians

“Avatar was great” Movie Rating

“I hated Twilight” Movie Median
Avatar 8
“Twilight was pretty bad” Median #Movies
Twilight 0 Avatar 7
“I enjoyed Avatar” 2 1
Twilight 2 Twilight 2 7 1
“I loved Twilight”
Avatar 7
“Avatar was okay”
Twilight 7

Avatar 4
APPLICATIONS OF
MAP-REDUCE
DISTRIBUTED GREP

• Very popular example to explain

how Map-Reduce works
• Demo program comes with Nutch
(where Hadoop originated)
Distributed Grep
For Unix guru:
grep -Eh <regex> <inDir>/* | sort | uniq -c | sort -nr

- counts lines in all files in <inDir> that match <regex>

and displays the counts in descending order
File 1 C File 2 Result
C
B A 3C
B
1A
C

- grep -Eh 'A|C' in/* | sort | uniq -c | sort -nr

- Analyzing web server access logs to find the top

requested pages that match a given pattern
Distributed Grep
Map function in this case:
- input is (file offset, line)
- output is either:
1. an empty list [] (the line does not match)
2. a key-value pair [(line, 1)] (if it matches)

Reduce function in this case:

- input is (line, [1, 1, ...])
- output is (line, n) where n is the number of 1s in the
list.
Distributed Grep
File 1 C File 2 Result
C
B A 3C
B
1A
C

Map tasks: Reduce tasks:

(0, C) -> [(C, 1)] (A, [1]) -> (A, 1)
(2, B) -> [] (C, [1, 1, 1]) -> (C, 3)
(4, B) -> []
(6, C) -> [(C, 1)]
(0, C) -> [(C, 1)]
(2, A) -> [(A, 1)]
GEOGRAPHICAL DATA
• Large data sets including road,
intersection, and feature data
• Problems that Google Maps has
used MapReduce to solve
• Locating roads connected to a
given intersection
• Rendering of map tiles
• Finding nearest feature to a
given address or location
GEOGRAPHICAL DATA

• Input: List of roads and intersections

• Map: Creates pairs of connected
points (road, intersection) or (road,
road)
• Sort: Sort by key
• Reduce: Get list of pairs with same
key
• Output: List of all points that
connect to a particular road
GEOGRAPHICAL DATA
• Input: Graph describing node
network with all gas stations marked
• Map: Search five mile radius of each
gas station and mark distance to
each node
• Sort: Sort by key
• Reduce: For each node, emit path
and gas station with the shortest
distance
• Output: Graph marked and nearest
gas station to each node
Inverted Index For Text Collections

CollecQon and Documents

The curse of the black Tin. n Titanic
Finding Nemo
pearl Ocean AnimaQon Ship Rose Jack
Ocean Fish Nemo
Ship Jack Sparrow Ship Haddock AtlanQc Ocean
Reef AnimaQon
Caribbean Turner TinQn England Sink
Elizabeth Gun Fight

The Dark Knight Silence of the

Skyfall The Ghost Ship
Bruce Wayne Batman Lambs
007 James Bond Ship Ghost Ocean
Joker Harvey Gordon Hannibal Lector
MI6 Gun Fight Death Horror
Gun Fight Crime FBI Crime Gun
Cannibal

§ Document: unit of retrieval

§ Collection: the group of documents from which we retrieve
– Also called corpus (a body of texts)
Boolean retrieval
The curse of the black Tin. n Titanic
Finding Nemo
pearl Ocean AnimaQon Ship Rose Jack
Ocean Fish Nemo
Ship Captain Jack Ship Captain AtlanQc Ocean
Reef AnimaQon
Sparrow Caribbean Haddock TinQn England Sink
Elizabeth Gun Fight Captain

The Dark Knight Silence of the

Skyfall The Ghost Ship
Bruce Wayne Batman Lambs
007 James Bond Ship Ghost Ocean
Joker Harvey Gordon Hannibal Lector
MI6 Gun Fight Death Horror
Gun Fight Crime FBI Crime Gun
Cannibal

§ Find all documents containing a word w

§ Find all documents containing a word w1 but not containing
the word w2
§ Queries in the form of any Boolean expression
§ Query: Jack
30
Boolean retrieval
The curse of the black Tin. n Titanic
Finding Nemo
pearl Ocean AnimaQon Ship Rose Jack
Ocean Fish Nemo
Ship Captain Jack Ship Captain AtlanQc Ocean
Reef AnimaQon
Sparrow Caribbean Haddock TinQn England Sink
Elizabeth Gun Fight Captain

The Dark Knight Silence of the

Skyfall The Ghost Ship
Bruce Wayne Batman Lambs
007 James Bond Ship Ghost Ocean
Joker Harvey Gordon Hannibal Lector
MI6 Gun Fight Death Horror
Gun Fight Crime FBI Crime Gun
Cannibal

§ Find all documents containing a word w

§ Find all documents containing a word w1 but not containing
the word w2
§ Queries in the form of any Boolean expression
§ Query: Jack
31
Inverted index
1 2 3 4 5 6 7 8
Black Finding TinQn Titanic Dark Skyfall Silence Ghost
pearl Nemo Knight of lambs ship

Inverted Index
Ship 1 3 4 8
Jack 1 4
PosQng lists
Bond 6
Gun 1 5 6 7
Ocean 1 2 3 4 8
Captain 1 3 4
Batman 5
Crime 5 7

Vocabulary / DicQonary

34
CreaQng an inverted index
The
1 curse of the black pearl 2Finding Nemo 3OceanTin. n 4 Titanic
Ship Captain Jack Sparrow Ocean Fish AnimaQon Ship Rose Jack AtlanQc
Caribbean Elizabeth Gun Nemo Reef Ship Captain Ocean England Sink
Fight AnimaQon Haddock TinQn Captain

5 The Dark Knight 6 Skyfall 7

Silence of the Lambs 8 The Ghost Ship
Bruce Wayne Batman 007 James Bond Hannibal Lector FBI Ship Ghost Ocean Death
Joker Harvey Gordon Gun MI6 Gun Fight Crime Gun Cannibal Horror
Fight Crime

Term docId Term docId Map: for each term in each document,
Sort write out pairs (term, docid)
Ship 1 … …
and Reduce: List documents for each term
Captain 1 Captain 1
group
Jack 1 by … …
… … term Jack 1 Term docId docId docId
Ship 3 Jack 4 Captain 1 …
Tintin 3 … … Jack 1 4 …
… … Ship 1 Ship 1 3 …
Jack 4 Ship 3
… …
… … … … 35
TWITTER ANALYTICS
• Let us take a real-world example to comprehend the
power of MapReduce. Twitter receives around 500
million tweets per day, which is nearly 3000 tweets
per second. The following illustration shows how
Tweeter manages its tweets with the help of
MapReduce.
• As shown in the illustration, the MapReduce algorithm
performs the following actions −
• Tokenize − Tokenizes the tweets into maps of tokens
and writes them as key-value pairs.
• Filter − Filters unwanted words from the maps of
tokens and writes the filtered maps as key-value
pairs.
• Count − Generates a token counter per word.
• Aggregate Counters − Prepares an aggregate of
similar counter values into small manageable units.
MapReduce At FaceBook
• Facebook has a list of friends (note that friends are a bi-
directional thing on Facebook. If I'm your friend, you're mine).
• They also have lots of disk space and they serve hundreds of
millions of requests everyday.
• They've decided to pre-compute calculations when they can
to reduce the processing time of requests.
• One common processing request is the "You and XXX have 230
friends in common" feature.
• When you visit someone's profile, you see a list of friends that
you have in common.
• This list doesn't change frequently so it'd be wasteful to
recalculate it every time you visited the profile
• FaceBook uses mapreduce so that we can calculate
everyone's common friends once a day and store those results.
• Later on it's just a quick lookup. We've got lots of disk, it's
cheap.
MR AT FACEBOOK
• Assume the friends are stored as Person->[List of
Friends], our friends list is then:
• A -> B C D
• B -> A C D E
• C -> A B D E
• D -> A B C E
• E -> B C D
• Each line will be an argument to a mapper.
• For every friend in the list of friends, the mapper will
output a key-value pair.
• The key will be a friend along with the person.
• The value will be the list of friends.
• The key will be sorted so that the friends are in order,
causing all pairs of friends to go to the same reducer.
MR AT FACEBOOK
• After all the mappers are • For map(C -> A B D E) :
done running, you'll have a • (A C) -> A B D E
list like this: • (B C) -> A B D E
• For map(A -> B C D) : • (C D) -> A B D E
• (A B) -> B C D • (C E) -> A B D E
• (A C) -> B C D • For map(D -> A B C E) :
• (A D) -> A B C E
• (A D) -> B C D • (B D) -> A B C E
• For map(B -> A C D E) : (Note • (C D) -> A B C E
that A comes before B in the • (D E) -> A B C E
key)
• And finally for map(E ->
• (A B) -> A C D E B C D):
• (B C) -> A C D E • (B E) -> B C D
• (B D) -> A C D E • (C E) -> B C D
• (D E) -> B C D
• (B E) -> A C D E
MR AT FACEBOOK
• Before we send these key-value pairs to the reducers, we
group them by their keys and get:
• (A B) -> (A C D E) (B C D)
• (A C) -> (A B D E) (B C D)
• (A D) -> (A B C E) (B C D)
• (B C) -> (A B D E) (A C D E)
• (B D) -> (A B C E) (A C D E)
• (B E) -> (A C D E) (B C D)
• (C D) -> (A B C E) (A B D E)
• (C E) -> (A B D E) (B C D)
• (D E) -> (A B C E) (B C D)
MR AT FACEBOOK
• Each line will be passed as an argument to a reducer.
• The reduce function will simply intersect the lists of values and
output the same key with the result of the intersection.
• For example, reduce((A B) -> (A C D E) (B C D)) will output (A B) :
(C D) and means that friends A and B have C and D as common
friends.
• The result after reduction is:
• (A B) -> (C D)
• (A C) -> (B D)
• (A D) -> (B C)
• (B C) -> (A D E)
• (B D) -> (A C E)
• (B E) -> (C D)
• (C D) -> (A B E)
• (C E) -> (B D)
• (D E) -> (B C)
• Now when D visits B's profile, we can quickly look up (B D) and
see that they have three friends in common, (A C E).
MAP REDUCE INAPPLICABILITY

• Database management
• Sub-optimal implementation for
DB
• Does not provide traditional DBMS
features
• Lacks support for default DBMS
tools
Map Reduce Inapplicability

Database implementation issues

• Lack of a schema
• No separation from application
program
• No indexes
• Reliance on brute force
Map Reduce Inapplicability
Feature absence and tool
incompatibility
• Transaction updates
• Changing data and maintaining
data integrity
• Data mining and replication tools
• Database design and construction
tools

PT120 CT120 Database Schema Essentials: Training Manual
No ratings yet
PT120 CT120 Database Schema Essentials: Training Manual
163 pages
03-MapReduce
No ratings yet
03-MapReduce
184 pages
Weaviate Advanced RAG Techniques eBook
100% (1)
Weaviate Advanced RAG Techniques eBook
13 pages
CLEAN Format Disk
No ratings yet
CLEAN Format Disk
2 pages
Big Data Management Continued
No ratings yet
Big Data Management Continued
48 pages
Mapreduce Final
No ratings yet
Mapreduce Final
55 pages
Big Data Module 1,2,3.pptx
No ratings yet
Big Data Module 1,2,3.pptx
59 pages
Data Science and Big Data Analytics_ Unit_1
No ratings yet
Data Science and Big Data Analytics_ Unit_1
47 pages
MIS all modules diagrams
No ratings yet
MIS all modules diagrams
45 pages
MIS Short Notes-35-pages
No ratings yet
MIS Short Notes-35-pages
35 pages
Lecture-3-MR-model-and-systems
No ratings yet
Lecture-3-MR-model-and-systems
67 pages
Module 1 Algorithm For Massive Datasets
No ratings yet
Module 1 Algorithm For Massive Datasets
59 pages
Distributed Transactions Concurrency Control
No ratings yet
Distributed Transactions Concurrency Control
78 pages
Yum Yum D Giga
No ratings yet
Yum Yum D Giga
368 pages
ACID Properties in DBMS
No ratings yet
ACID Properties in DBMS
5 pages
CH 01 B Modular Arithmetic
No ratings yet
CH 01 B Modular Arithmetic
64 pages
Unit 1
No ratings yet
Unit 1
30 pages
BDA IAT 1 Question Bank
No ratings yet
BDA IAT 1 Question Bank
21 pages
7-Brief About Big Data, Hadoop Map Reduce-31-07-2023
No ratings yet
7-Brief About Big Data, Hadoop Map Reduce-31-07-2023
35 pages
Data Science
No ratings yet
Data Science
31 pages
Relational Data Processing On MapReduce
No ratings yet
Relational Data Processing On MapReduce
34 pages
Modes - of - Opre and rc4
No ratings yet
Modes - of - Opre and rc4
39 pages
Lecture 3 - MapReduce
No ratings yet
Lecture 3 - MapReduce
9 pages
BIG DATA UNIT-1
No ratings yet
BIG DATA UNIT-1
9 pages
BDA Module wise Important Questions
No ratings yet
BDA Module wise Important Questions
5 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
07-BigData-DataAnalysis
No ratings yet
07-BigData-DataAnalysis
66 pages
MR Databases
No ratings yet
MR Databases
52 pages
Big Data
No ratings yet
Big Data
10 pages
data base basic worksheet
No ratings yet
data base basic worksheet
3 pages
Unit 1^J2 big data
No ratings yet
Unit 1^J2 big data
6 pages
EUC1502 Module5 Big-Data
No ratings yet
EUC1502 Module5 Big-Data
46 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
31019710775AD ENU LabM PDF
No ratings yet
31019710775AD ENU LabM PDF
195 pages
BDA Questions
No ratings yet
BDA Questions
8 pages
Dbms Assignment-1 Baarth
No ratings yet
Dbms Assignment-1 Baarth
6 pages
BDA - AIDS Syllabus
No ratings yet
BDA - AIDS Syllabus
2 pages
Computational Tools DTU Presentation Week3
No ratings yet
Computational Tools DTU Presentation Week3
33 pages
MIS all modules 28 pages
No ratings yet
MIS all modules 28 pages
28 pages
DM - Topic Five
No ratings yet
DM - Topic Five
30 pages
BDAunit-III
No ratings yet
BDAunit-III
4 pages
Midterm Exam Database Programming With SQL
67% (3)
Midterm Exam Database Programming With SQL
107 pages
mis-6-pages
No ratings yet
mis-6-pages
6 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Ust Etl Dev 161490 Hari
No ratings yet
Ust Etl Dev 161490 Hari
5 pages
IDC - Storage Efficiency US42464717
No ratings yet
IDC - Storage Efficiency US42464717
15 pages
Matrix-Vector Multiplication by MapReduce-V2
No ratings yet
Matrix-Vector Multiplication by MapReduce-V2
26 pages
BDA Assignment1 BE6 20
No ratings yet
BDA Assignment1 BE6 20
10 pages
Map Reduce
No ratings yet
Map Reduce
26 pages
Database Management Systems Versus File Management Systems
No ratings yet
Database Management Systems Versus File Management Systems
24 pages
Mapreduce and Hadoop Distributed File System
No ratings yet
Mapreduce and Hadoop Distributed File System
45 pages
biggdata
No ratings yet
biggdata
24 pages
Sas/Access Interface To Teradata White Paper: December 15, 2000 by Donna R. Adler and Douglas J. Sedlak
No ratings yet
Sas/Access Interface To Teradata White Paper: December 15, 2000 by Donna R. Adler and Douglas J. Sedlak
38 pages
Program: 1 Build A Worksheet With A Student Record and Calculate The Result Based On Average Marks
No ratings yet
Program: 1 Build A Worksheet With A Student Record and Calculate The Result Based On Average Marks
11 pages
CH 02 Basics of Cryptography
No ratings yet
CH 02 Basics of Cryptography
66 pages
Big Data Science
No ratings yet
Big Data Science
18 pages
Join Algorithms Using Mapreduce: A Survey: Vikas Jadhav, Jagannath Aghav, Sunil Dorwani
No ratings yet
Join Algorithms Using Mapreduce: A Survey: Vikas Jadhav, Jagannath Aghav, Sunil Dorwani
5 pages
SQL BI Developer
No ratings yet
SQL BI Developer
6 pages
BDH Answer Bank
No ratings yet
BDH Answer Bank
21 pages
Experiment No _ 1 Bda
No ratings yet
Experiment No _ 1 Bda
10 pages
SEM VII BDA Syllabus Theory
No ratings yet
SEM VII BDA Syllabus Theory
4 pages
Matrix-Vector Multiplication Using MapReduce in Big Data.
No ratings yet
Matrix-Vector Multiplication Using MapReduce in Big Data.
4 pages
Oracle Database 21c - Data Warehousing
No ratings yet
Oracle Database 21c - Data Warehousing
37 pages
Ibm-Oracle Salesforrce
No ratings yet
Ibm-Oracle Salesforrce
6 pages
Bcsccs 505 r01 Dbms-Lab Manual-2010
No ratings yet
Bcsccs 505 r01 Dbms-Lab Manual-2010
18 pages
5 ER - and EER-to-Relational Mapping
No ratings yet
5 ER - and EER-to-Relational Mapping
20 pages
SpagoBI & ADempiere
No ratings yet
SpagoBI & ADempiere
4 pages
BDA-UNIT-1
No ratings yet
BDA-UNIT-1
32 pages
20ai402 Data Analytics Unit-2
No ratings yet
20ai402 Data Analytics Unit-2
72 pages
SQL: Queries, Programming, Triggers: Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1
No ratings yet
SQL: Queries, Programming, Triggers: Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1
32 pages
Rizwan
No ratings yet
Rizwan
1 page
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
SQL
No ratings yet
SQL
9 pages
Active Data Object in Visual Basic
No ratings yet
Active Data Object in Visual Basic
23 pages
MapReduce Algorithms For Big Data Analysis
No ratings yet
MapReduce Algorithms For Big Data Analysis
2 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
MR20 Vi-I Syllabus
No ratings yet
MR20 Vi-I Syllabus
22 pages
Introduction To Big Data & Basic Data Analysis
No ratings yet
Introduction To Big Data & Basic Data Analysis
51 pages
Chapter No.2 Database Design Using ER Model
No ratings yet
Chapter No.2 Database Design Using ER Model
50 pages
018 Risk Analysis
No ratings yet
018 Risk Analysis
26 pages
Bda Sem 7 Book
No ratings yet
Bda Sem 7 Book
188 pages
Syllabus
No ratings yet
Syllabus
3 pages
Linux File Systems
No ratings yet
Linux File Systems
14 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Teradata - Load Utilities
0% (1)
Teradata - Load Utilities
23 pages
Big Data
No ratings yet
Big Data
25 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
43 pages
Integration XML
No ratings yet
Integration XML
10 pages
Big Data Analytics Comp Syllabus Sem7
No ratings yet
Big Data Analytics Comp Syllabus Sem7
4 pages
Everything You Need To Know About PostgreSQL EXPLAIN
No ratings yet
Everything You Need To Know About PostgreSQL EXPLAIN
44 pages
ECS-701 (Distributed System) - Syllabus
0% (1)
ECS-701 (Distributed System) - Syllabus
1 page
Big Data Analytics Digital Notes
No ratings yet
Big Data Analytics Digital Notes
119 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Raster Graphics Editor: Transforming Visual Realities: Mastering Raster Graphics Editors in Computer Vision
From Everand
Raster Graphics Editor: Transforming Visual Realities: Mastering Raster Graphics Editors in Computer Vision
Fouad Sabry
No ratings yet
Digital Raster Graphic: Unveiling the Power of Digital Raster Graphics in Computer Vision
From Everand
Digital Raster Graphic: Unveiling the Power of Digital Raster Graphics in Computer Vision
Fouad Sabry
No ratings yet
Vector Graphics Editor: Empowering Visual Creation with Advanced Algorithms
From Everand
Vector Graphics Editor: Empowering Visual Creation with Advanced Algorithms
Fouad Sabry
No ratings yet
Two Dimensional Computer Graphics: Exploring the Visual Realm: Two Dimensional Computer Graphics in Computer Vision
From Everand
Two Dimensional Computer Graphics: Exploring the Visual Realm: Two Dimensional Computer Graphics in Computer Vision
Fouad Sabry
No ratings yet
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
From Everand
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
Fouad Sabry
No ratings yet