0% found this document useful (0 votes)
14 views

Module2 D MapReduceParadigm

Module2_D_MapReduceParadigm
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Module2 D MapReduceParadigm

Module2_D_MapReduceParadigm
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

Map Reduce

Other Algorithms using MR

Big Data Analytics - Module 2D


OVERVIEW OF THE CHAPTER
• Algorithms Using MapReduce:
• Matrix-Vector Multiplication by MapReduce
• Relational-Algebra Operations, Computing
Selections by MapReduce, Computing
Projections by MapReduce, Union, Intersection,
and Difference by MapReduce, Computing
Natural Join by MapReduce, Grouping and
Aggregation by MapReduce
• Matrix Multiplication, Matrix Multiplication with
One MapReduce Step .
• Illustrating use of MapReduce with use of real life
databases and applications.
SORTING
• The standard MR framework automatically sorts the output keys
as part of its processing.
• So if your Mapper just takes the input key and outputs the same
key (with no value), and the Reducer just takes the input keys
and outputs them unchanged, you'll end up with all of the keys
sorted.
• let's suppose you're sorting a really long list of names that all start
with an English letter A - Z.
• Each mapper gets one name at a time. It looks at the first letter
and send the name to the reducer based on that letter. So, all
of the 'A's get sent to Reducer 1, and so on
• Then each Reducer sorts the names it gets (using any sorting
algorithm) - and that should be relatively fast since each one
only has so do roughly 1/26 of the data.
• When you're done, you'd end up with 26 files, one for each
reducer, each in sorted order. You could then concatenate the
files in order to get the complete sorted list.
MATRIX – VECTOR MULTIPLICATION
Multiply M = (mij) (n × n matrix) & v = (vi) (n-vector)
nXn
Mv = (xij )
n
xij = å mij v j M v n (i, mij v j )
j=1

Case 1: Large n, M does not fit into main memory, but v does
 Since v fits into main memory, v is available to every map task
 Map: for each matrix element mij, emit key value pair (i, mijvj)
 Shuffle and sort: groups all mijvj values together for the same i
 Reduce: sum mijvj for all j for the same i
HOW THE VECTOR IS REPLICATED:

• Distributed Cache/Broadcast: Before the


MapReduce job begins, vector is placed in a
distributed cache (Hadoop's distributed cache
or Spark's broadcast variables).
• This cache ensures that each mapper has a
local copy of the vector.
• Local Access in Mapper: Each mapper
accesses the vector locally from the cache,
avoiding the need for repeated network
communication.
• The reduce phase aggregates the results to
produce the final output vector.
WHY DATA MOVEMENT IS MANAGEABLE:
• Small Vector Size:
• Vectors are usually much smaller than the matrix itself.
• Broadcasting Efficiency:
• Broadcasting a small vector across the nodes involves
relatively little data movement compared to the overall
size of the matrix.
• The overhead of transmitting this small amount of data to
each node is usually negligible, especially in distributed
systems designed to handle such operations efficiently.
• One-Time Cost:
• Single Broadcast: The vector is typically broadcast only
once at the start of the MapReduce job
• Distributed Cache:
• Systems like Hadoop use a distributed cache to distribute
the vector, ensuring that each node has a local copy. This
minimizes the need for repeated data movement during
the map operations.
– Used in many algorithms, e.g., PageRank

Valeria Cardellini - SABD 2020/21 20


MATRIX – VECTOR MULTIPLICATION
This much will fit into main
Mv = (xij ) memory

n
xij = å
Mv = (xij )
mij v j (i, mij v j )
n
xij = å mij v j
j=1 mij v j )chunk does not fit in
(i,whole
This
j=1 main memory anymore

Case 2: Very large n, even v does not fit into main memory
 For every map, many accesses to disk (for parts of v) required!
 Solution:
– How much of v will fit in?
– Partition v and rows of M so that each partition of v fits into memory
– Take dot product of one partition of v and the corresponding partition of M
– Map and reduce same as before
Color = stripe.
Each stripe of matrix
chunk divided up
into chunks.

X
MATRIX-VECTOR MULTIPLICATION MR CODE

• map(key, value):
• for (i, j, a_ij) in value:
• emit(i, a_ij * v[j])
• reduce(key, values):
• result = 0
• for value in values:
• result += value
• emit(key, result)
RELATIONAL ALGEBRA
• Primitives
• Projection ()
• Selection ()
• Cartesian product ()
• Set union ()
• Set difference ()
• Rename ()
• Other operations
• Join (⋈)
• Group by… aggregation
•…
RELATIONAL ALGEBRA
• R, S - relation
• t, t’ - a tuple
• C - a condition of selection
• A, B, C - subset of attributes
• a, b, c - attribute values for a given subset of
attributes
• Relations (however big) can be stored in a
distributed filesystem – If they don’t fit in a
single machine, they’re broken into pieces
(think HDFS)
BIG DATA ANALYSIS
• Peta-scale datasets are everywhere:
• Facebook has 2.5 PB of user data + 15 TB/day
(4/2009)
• eBay has 6.5 PB of user data + 50 TB/day (5/2009)
• …
• A lot of these datasets are (mostly) structured
• Query logs
• Point-of-sale records
• User data (e.g., demographics)
• …
• How do we perform data analysis at scale?
• Relational databases and SQL
• MapReduce (Hadoop)
Relational Databases Vs. MapReduce
• Relational databases:
• Multipurpose: analysis and transactions; batch
and interactive
• Data integrity via ACID transactions
• Lots of tools in software ecosystem (for ingesting,
reporting, etc.)
• Supports SQL (and SQL integration, e.g., JDBC)
• Automatic SQL query optimization
• MapReduce (Hadoop):
• Designed for large clusters, fault tolerant
• Data is accessed in “native format”
• Supports many query languages
• Programmers retain control over performance
• Open source
Selection Using Mapreduce
• Map: For each tuple t in R, test if t satisfies
C. If so, produce the key-value pair (t, t).
• Reduce: The identity function. It simply
passes each key-value pair to the output.

R1

R2
R1
R3

R4  R3

R5
PROJECTION

R1 R1

R2 R2

R3 R3

R4 R4

R5 R5
Union Using Mapreduce
• Suppose R and S have the same
schema
• Map tasks are generated from
chunks of both R and S
INTERSECTION (R ∩ S)
DIFFERENCE (R-S)
GROUPING AND AGGREGATION USING
MAPREDUCE
R
• Group and aggregate on
a relation R(A,B) using A B

aggregation function γ(B), x 2

group by attribute A y 1
z 4
• Map: z 1
• For each tuple t = (a,b) of R, x 5
emit key value pair (a,b) select A, sum(B) from
• Reduce: R group by A;

• For all group {(a,b1), …,


(a,bm)} represented by a A SUM(B)
key ‘a’, apply γ to obtain ba x 7
= b1 + … + bm y 1
• Output (a,ba) z 5
GROUP BY… AGGREGATION - 2
• Example: What is the average time
spent per URL?
• In SQL:
• SELECT url, AVG(time) FROM visits GROUP BY
url
• In MapReduce:
• Map over tuples, emit time, keyed by url
• MR automatically groups values by keys
• Compute average in reducer
MAP-REDUCE EXAMPLE:
AGGREGATION
• Compute Avg of B for each distinct
value of A

A B C
Reducer 1
R1 1 10 12 (1, 10)
(2, 20) (1, [10, 10, 30, 20]) (1, 17.5)
R2 2 20 34 MAP 1
(1, 10)
R3 1 10 22
R4 1 30 56
Reducer 2
R5 3 40 17 (1, 30)
MAP 2 (3, 40) (2, 10) (2, 15)
R6 2 10 49 (2, 10) (3, 40) (3, 40)
(1, 20)
R7 1 20 44
Grouping & Aggregation Summary
NATURAL JOIN
MAP-REDUCE EXAMPLE : JOIN
 Select R.A, R.B, S.D where R.A==S.A

A B C
Reducer 1
R1 1 10 12 (1, [R, 10]) (1, 10, 30)
(2, [R, 20]) (1, 10, 30)
R2 2 20 34
(1, [R, 10]) (1, [(R, 10), (R, 10), (1, 10, 20)
MAP 1 (R, 30), (S, 20)] )
R3 1 10 22 (1, [R, 30]) (1, 10, 20)
R4 1 30 56 (3, [R, 40])
R5 3 40 17

A D E Reducer 2
S1 1 20 22 (1, [S, 20]) (2, [(R, 20), (S, 30),
(2, [S, 30]) (S, 10)] ) (2, 20, 30)
S2 2 30 36 MAP 2 (2, 20, 10)
(2, [S, 10]) (3, [(R, 40), (S, 50),
(3, 40, 50)
S3 2 10 29 (3, [S, 50]) (S, 40)]
(3, 40, 40)
S4 3 50 16 (3, [S, 40])
S5 3 40 37
Natural Join Using Mapreduce
• Join R(A,B) with S(B,C) on attribute B
R
• Map:
• For each tuple t = (a,b) of R, emit key A B
value pair (b,(R,a)) x a
• For each tuple t = (b,c) of S, emit key
y b
value pair (b,(S,c))
z c
• Reduce:
w d
• Each key b would be associated with a list
of values that are of the form (R,a) or (S,c)
• Construct all pairs consisting of one with S
first component R and the other with first
component S , say (R,a ) and (S,c ). The
output from this key and value list is a B C
sequence of key-value pairs a 1
• The key is irrelevant. Each value is one of c 3
the triples (a, b, c ) such that (R,a ) and
(S,c) are on the input list of values d 4
g 7
Need For High-level Languages
• Hadoop is great for large-data
processing!
• But writing Java programs for
everything is verbose and slow
• Analysts don’t want to (or can’t) write
Java
• Solution: develop higher-level data
processing languages
• Hive: HQL is like SQL
• Pig: Pig Latin is a bit like Perl
HIVE AND PIG
• Hive: data warehousing application in Hadoop
• Query language is HQL, variant of SQL
• Tables stored on HDFS as flat files
• Developed by Facebook, now open source
• Pig: large-scale data processing system
• Scripts are written in Pig Latin, a dataflow
language
• Developed by Yahoo!, now open source
• Roughly 1/3 of all Yahoo! internal jobs
• Common idea:
• Provide higher-level language to facilitate large-
data processing
• Higher-level language “compiles down” to
Hadoop jobs
Matrix Multiplication Using Mapreduce 1

n l l
n

m A n B = m
C cik = å aij b jk
(m × n) (m × l)
(n × l) j=1

 Think of a matrix as a relation with three attributes


 For example matrix A is represented by the relation A(I, J, V)
– For every non-zero entry (i, j, aij), the row number is the value of I,
column number is the value of J, the entry is the value in V
– Also advantage: usually most large matrices would be sparse, the relation
would have less number of entries
 The product is ~ a natural join followed by a grouping with
aggregation
MATRIX MULTIPLICATION USING MR 2

n l l
n
cik = å aij b jk
A
B = C
m (m × n) n m
(n × l) (m × l)
(i, j, aij) j=1
(j, k, bjk)

 Natural join of (I,J,V) and (J,K,W)  tuples (i, j, k, aij, bjk)


 Map:
– For every (i, j, aij), emit key value pair (j, (A, i, aij))
– For every (j, k, bjk), emit key value pair (j, (B, k, bjk))
 Reduce:
for each key j
for each value (A, i, aij) and (B, k, bjk)
produce a key value pair ((i,k),(aijbjk))
MATRIX MULTIPLICATION USING MR 3
n l l
n
cik = å aij b jk
A
B C
m (m × n) n
(n × l)
= m
(m × l)
(i, j, aij) j=1
(j, k, bjk)

 First MapReduce process has produced key value pairs


((i,k),(aijbjk))
 Another MapReduce process to group and aggregate
 Map: identity, just emit the key value pair ((i,k),(aijbjk))
 Reduce: n

for each key (i,k) cik = å aij b jk


j=1
produce the sum of the all the values for the key:
MATRIX MULTIPLICATION USING
MAPREDUCE: METHOD 2
n l l
n
cik = å aij b jk
A
B C
m (m × n) n = m
(m × l)
(n × l)
(i, j, aij) j=1
(j, k, bjk)

 A method with one MapReduce step


 Map:
– For every (i, j, aij), emit for all k = 1,…, l, the key value ((i,k), (A, j, aij))
– For every (j, k, bjk), emit for all i = 1,…, m, the key value ((i,k), (B, j, bjk))
 Reduce: May not fit in
for each key (i,k) main
sort values (A, j, aij) and (B, j, bjk) by j to group them by j memory.
for each j multiply aij and bjk n Expensive
sum the products for the key (i,k) to produce cik = å aij b jk external sort!
j=1
Matrix Multiplication In One Step

• One reducer per output cell


• Each reducer computes Sumj (A[i,j] * B[j,k])
Matrix Multiply Example
Matrix Multiply Example

C =AXB X =X
C =AXB
A has dimensions L x M
A has dimensions L x M
B has dimensions M x N A B C
B has dimensions M x N A B
C has dimensions L x N
C has dimensions L x N
Matrix multiplication:
Matrix multiplication: C[i,k] = Sum
C[i,k] (A[i,j] *B[j,k])
=j Sum j (A[i,j] *B[j,k])

InInthe
the map
map phase:
phase:
for each
for eachelement
element(i,j) (i,j) emit
of A,of A, ((i,k),
emitA[i,j]) k in 1..N
forA[i,j])
((i,k), for k in 1..N
Better:
Better: emit
emit((i,k)(‘A’, i, k, A[i,j]))
((i,k)(‘A’, for k in 1..N
i, k, A[i,j])) for k in 1..N
for each
for eachelement
element(j,k)(j,k) emit
of B,of B,((i,k),
emitB[j,k]) i in 1..Lfor i in 1..L
((i,k),forB[j,k])
Better:
Better: emit
emit ((i,k)(‘B’,
((i,k)(‘B’, i, k, B[j,k]))
i, k, B[j,k])) for i in 1..L
for i in 1..L
InInthe
the reduce phase
reduce phase OneOne reducer
reducer per cell,
per output output
emit cell, emit
key
key ==(i,k)
(i,k)
value = Sumj (A[i,j] * B[j,k])
value = Sumj (A[i,j] * B[j,k]) 20
Illustrating Matrix Multiplication
Matrix Multiplication – 2 Phase
Step 3 Map is just identity – emit (i, k, aik bkj)
Matrix Multiplication – 1 Phase
Mapper for Matrix A (k, v)=((i, k),
(A, j, Aij)) for all k
Mapper for Matrix B (k, v)=((i, k),
(B, j, Bjk)) for all i
Therefore computing the mapper for Matrix A:
• k, i, j computes the number of times it occurs.
• Here all are 2, therefore when k=1, i can have 2 values 1 & 2
• each case can have 2 further values of j=1 and j=2.

k=1 i=1 j=1 ((1, 1), (A, 1, 1))


j=2 ((1, 1), (A, 2, 2))
i=2 j=1 ((2, 1), (A, 1, 3))
j=2 ((2, 1), (A, 2, 4))
k=2 i=1 j=1 ((1, 2), (A, 1, 1))
j=2 ((1, 2), (A, 2, 2))
i=2 j=1 ((2, 2), (A, 1, 3))
j=2 ((2, 2), (A, 2, 4))
Computing the mapper for Matrix B

i=1 j=1 k=1 ((1, 1), (B, 1, 5))


k=2 ((1, 2), (B, 1, 6))
j=2 k=1 ((1, 1), (B, 2, 7))
k=2 ((1, 2), (B, 2, 8))

i=2 j=1 k=1 ((2, 1), (B, 1, 5))


k=2 ((2, 2), (B, 1, 6))
j=2 k=1 ((2, 1), (B, 2, 7))
k=2 ((2, 2), (B, 2, 8))

The formula for Reducer is:


• Reducer (k, v) =(i, k)=>Make sorted Alist and Blist
• (i, k) => Summation (Aij * Bjk)) for j
• Output =>((i, k), sum)
Reducer
• Observe from Mapper 4 pairs common (1, 1), (1, 2), (2, 1) and (2, 2)
• Separate lists for Matrix A & B with adjoining values from Mapper

(1, 1) =>Alist ={(A, 1, 1), (A, 2, 2)}


Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(1*5) + (2*7)] =19 -----(i) Thus we have
((1, 1), 19)
(1, 2) =>Alist ={(A, 1, 1), (A, 2, 2)} ((1, 2), 22)
Blist ={(B, 1, 6), (B, 2, 8)} ((2, 1), 43)
Now Aij x Bjk: [(1*6) + (2*8)] =22 -----(ii) ((2, 2), 50)

(2, 1) =>Alist ={(A, 1, 3), (A, 2, 4)}


Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(3*5) + (4*7)] =43 -----(iii)

(2, 2) =>Alist ={(A, 1, 3), (A, 2, 4)}


Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(3*6) + (4*8)] =50 ----(iv)
EXAMPLE - VISITS PER HOUR
• A common metric that web analytic tools provide
about website traffic is the number of page views
on a per-hour basis.
• In order to compute the number of page visits for
each hour, we must create a custom Key class
that encapsulates an hour (day, month, year, and
hour) and then map that key to the number of
observed page views for that hour.
• Just as we did with the WordCount example, the
mapper will return the key mapped to the value 1,
and then the reducer and combiners will compute
the actual count of occurrences for each hour.
• The challenge, is that we need to create a custom
key class to hold our date.
SENTIMENT ANALYSIS
M R R
Twitter #Movies
TweetScan Summarize Count
Posts Per
Inferred Rating Rating
Movie Ratings Medians

“Avatar was great” Movie Rating


“I hated Twilight” Movie Median
Avatar 8
“Twilight was pretty bad” Median #Movies
Twilight 0 Avatar 7
“I enjoyed Avatar” 2 1
Twilight 2 Twilight 2 7 1
“I loved Twilight”
Avatar 7
“Avatar was okay”
Twilight 7

Avatar 4
APPLICATIONS OF
MAP-REDUCE
DISTRIBUTED GREP

• Very popular example to explain


how Map-Reduce works
• Demo program comes with Nutch
(where Hadoop originated)
Distributed Grep
For Unix guru:
grep -Eh <regex> <inDir>/* | sort | uniq -c | sort -nr

- counts lines in all files in <inDir> that match <regex>


and displays the counts in descending order
File 1 C File 2 Result
C
B A 3C
B
1A
C

- grep -Eh 'A|C' in/* | sort | uniq -c | sort -nr

- Analyzing web server access logs to find the top


requested pages that match a given pattern
Distributed Grep
Map function in this case:
- input is (file offset, line)
- output is either:
1. an empty list [] (the line does not match)
2. a key-value pair [(line, 1)] (if it matches)

Reduce function in this case:


- input is (line, [1, 1, ...])
- output is (line, n) where n is the number of 1s in the
list.
Distributed Grep
File 1 C File 2 Result
C
B A 3C
B
1A
C

Map tasks: Reduce tasks:


(0, C) -> [(C, 1)] (A, [1]) -> (A, 1)
(2, B) -> [] (C, [1, 1, 1]) -> (C, 3)
(4, B) -> []
(6, C) -> [(C, 1)]
(0, C) -> [(C, 1)]
(2, A) -> [(A, 1)]
GEOGRAPHICAL DATA
• Large data sets including road,
intersection, and feature data
• Problems that Google Maps has
used MapReduce to solve
• Locating roads connected to a
given intersection
• Rendering of map tiles
• Finding nearest feature to a
given address or location
GEOGRAPHICAL DATA

• Input: List of roads and intersections


• Map: Creates pairs of connected
points (road, intersection) or (road,
road)
• Sort: Sort by key
• Reduce: Get list of pairs with same
key
• Output: List of all points that
connect to a particular road
GEOGRAPHICAL DATA
• Input: Graph describing node
network with all gas stations marked
• Map: Search five mile radius of each
gas station and mark distance to
each node
• Sort: Sort by key
• Reduce: For each node, emit path
and gas station with the shortest
distance
• Output: Graph marked and nearest
gas station to each node
Inverted Index For Text Collections

CollecQon and Documents


The curse of the black Tin. n Titanic
Finding Nemo
pearl Ocean AnimaQon Ship Rose Jack
Ocean Fish Nemo
Ship Jack Sparrow Ship Haddock AtlanQc Ocean
Reef AnimaQon
Caribbean Turner TinQn England Sink
Elizabeth Gun Fight

The Dark Knight Silence of the


Skyfall The Ghost Ship
Bruce Wayne Batman Lambs
007 James Bond Ship Ghost Ocean
Joker Harvey Gordon Hannibal Lector
MI6 Gun Fight Death Horror
Gun Fight Crime FBI Crime Gun
Cannibal

§ Document: unit of retrieval


§ Collection: the group of documents from which we retrieve
– Also called corpus (a body of texts)
Boolean retrieval
The curse of the black Tin. n Titanic
Finding Nemo
pearl Ocean AnimaQon Ship Rose Jack
Ocean Fish Nemo
Ship Captain Jack Ship Captain AtlanQc Ocean
Reef AnimaQon
Sparrow Caribbean Haddock TinQn England Sink
Elizabeth Gun Fight Captain

The Dark Knight Silence of the


Skyfall The Ghost Ship
Bruce Wayne Batman Lambs
007 James Bond Ship Ghost Ocean
Joker Harvey Gordon Hannibal Lector
MI6 Gun Fight Death Horror
Gun Fight Crime FBI Crime Gun
Cannibal

§ Find all documents containing a word w


§ Find all documents containing a word w1 but not containing
the word w2
§ Queries in the form of any Boolean expression
§ Query: Jack
30
Boolean retrieval
The curse of the black Tin. n Titanic
Finding Nemo
pearl Ocean AnimaQon Ship Rose Jack
Ocean Fish Nemo
Ship Captain Jack Ship Captain AtlanQc Ocean
Reef AnimaQon
Sparrow Caribbean Haddock TinQn England Sink
Elizabeth Gun Fight Captain

The Dark Knight Silence of the


Skyfall The Ghost Ship
Bruce Wayne Batman Lambs
007 James Bond Ship Ghost Ocean
Joker Harvey Gordon Hannibal Lector
MI6 Gun Fight Death Horror
Gun Fight Crime FBI Crime Gun
Cannibal

§ Find all documents containing a word w


§ Find all documents containing a word w1 but not containing
the word w2
§ Queries in the form of any Boolean expression
§ Query: Jack
31
Inverted index
1 2 3 4 5 6 7 8
Black Finding TinQn Titanic Dark Skyfall Silence Ghost
pearl Nemo Knight of lambs ship

Inverted Index
Ship 1 3 4 8
Jack 1 4
PosQng lists
Bond 6
Gun 1 5 6 7
Ocean 1 2 3 4 8
Captain 1 3 4
Batman 5
Crime 5 7

Vocabulary / DicQonary

34
CreaQng an inverted index
The
1 curse of the black pearl 2Finding Nemo 3OceanTin. n 4 Titanic
Ship Captain Jack Sparrow Ocean Fish AnimaQon Ship Rose Jack AtlanQc
Caribbean Elizabeth Gun Nemo Reef Ship Captain Ocean England Sink
Fight AnimaQon Haddock TinQn Captain

5 The Dark Knight 6 Skyfall 7


Silence of the Lambs 8 The Ghost Ship
Bruce Wayne Batman 007 James Bond Hannibal Lector FBI Ship Ghost Ocean Death
Joker Harvey Gordon Gun MI6 Gun Fight Crime Gun Cannibal Horror
Fight Crime

Term docId Term docId Map: for each term in each document,
Sort write out pairs (term, docid)
Ship 1 … …
and Reduce: List documents for each term
Captain 1 Captain 1
group
Jack 1 by … …
… … term Jack 1 Term docId docId docId
Ship 3 Jack 4 Captain 1 …
Tintin 3 … … Jack 1 4 …
… … Ship 1 Ship 1 3 …
Jack 4 Ship 3
… …
… … … … 35
TWITTER ANALYTICS
• Let us take a real-world example to comprehend the
power of MapReduce. Twitter receives around 500
million tweets per day, which is nearly 3000 tweets
per second. The following illustration shows how
Tweeter manages its tweets with the help of
MapReduce.
• As shown in the illustration, the MapReduce algorithm
performs the following actions −
• Tokenize − Tokenizes the tweets into maps of tokens
and writes them as key-value pairs.
• Filter − Filters unwanted words from the maps of
tokens and writes the filtered maps as key-value
pairs.
• Count − Generates a token counter per word.
• Aggregate Counters − Prepares an aggregate of
similar counter values into small manageable units.
MapReduce At FaceBook
• Facebook has a list of friends (note that friends are a bi-
directional thing on Facebook. If I'm your friend, you're mine).
• They also have lots of disk space and they serve hundreds of
millions of requests everyday.
• They've decided to pre-compute calculations when they can
to reduce the processing time of requests.
• One common processing request is the "You and XXX have 230
friends in common" feature.
• When you visit someone's profile, you see a list of friends that
you have in common.
• This list doesn't change frequently so it'd be wasteful to
recalculate it every time you visited the profile
• FaceBook uses mapreduce so that we can calculate
everyone's common friends once a day and store those results.
• Later on it's just a quick lookup. We've got lots of disk, it's
cheap.
MR AT FACEBOOK
• Assume the friends are stored as Person->[List of
Friends], our friends list is then:
• A -> B C D
• B -> A C D E
• C -> A B D E
• D -> A B C E
• E -> B C D
• Each line will be an argument to a mapper.
• For every friend in the list of friends, the mapper will
output a key-value pair.
• The key will be a friend along with the person.
• The value will be the list of friends.
• The key will be sorted so that the friends are in order,
causing all pairs of friends to go to the same reducer.
MR AT FACEBOOK
• After all the mappers are • For map(C -> A B D E) :
done running, you'll have a • (A C) -> A B D E
list like this: • (B C) -> A B D E
• For map(A -> B C D) : • (C D) -> A B D E
• (A B) -> B C D • (C E) -> A B D E
• (A C) -> B C D • For map(D -> A B C E) :
• (A D) -> A B C E
• (A D) -> B C D • (B D) -> A B C E
• For map(B -> A C D E) : (Note • (C D) -> A B C E
that A comes before B in the • (D E) -> A B C E
key)
• And finally for map(E ->
• (A B) -> A C D E B C D):
• (B C) -> A C D E • (B E) -> B C D
• (B D) -> A C D E • (C E) -> B C D
• (D E) -> B C D
• (B E) -> A C D E
MR AT FACEBOOK
• Before we send these key-value pairs to the reducers, we
group them by their keys and get:
• (A B) -> (A C D E) (B C D)
• (A C) -> (A B D E) (B C D)
• (A D) -> (A B C E) (B C D)
• (B C) -> (A B D E) (A C D E)
• (B D) -> (A B C E) (A C D E)
• (B E) -> (A C D E) (B C D)
• (C D) -> (A B C E) (A B D E)
• (C E) -> (A B D E) (B C D)
• (D E) -> (A B C E) (B C D)
MR AT FACEBOOK
• Each line will be passed as an argument to a reducer.
• The reduce function will simply intersect the lists of values and
output the same key with the result of the intersection.
• For example, reduce((A B) -> (A C D E) (B C D)) will output (A B) :
(C D) and means that friends A and B have C and D as common
friends.
• The result after reduction is:
• (A B) -> (C D)
• (A C) -> (B D)
• (A D) -> (B C)
• (B C) -> (A D E)
• (B D) -> (A C E)
• (B E) -> (C D)
• (C D) -> (A B E)
• (C E) -> (B D)
• (D E) -> (B C)
• Now when D visits B's profile, we can quickly look up (B D) and
see that they have three friends in common, (A C E).
MAP REDUCE INAPPLICABILITY

• Database management
• Sub-optimal implementation for
DB
• Does not provide traditional DBMS
features
• Lacks support for default DBMS
tools
Map Reduce Inapplicability

Database implementation issues


• Lack of a schema
• No separation from application
program
• No indexes
• Reliance on brute force
Map Reduce Inapplicability
Feature absence and tool
incompatibility
• Transaction updates
• Changing data and maintaining
data integrity
• Data mining and replication tools
• Database design and construction
tools

You might also like