Module2 D MapReduceParadigm
Module2 D MapReduceParadigm
Case 1: Large n, M does not fit into main memory, but v does
Since v fits into main memory, v is available to every map task
Map: for each matrix element mij, emit key value pair (i, mijvj)
Shuffle and sort: groups all mijvj values together for the same i
Reduce: sum mijvj for all j for the same i
HOW THE VECTOR IS REPLICATED:
n
xij = å
Mv = (xij )
mij v j (i, mij v j )
n
xij = å mij v j
j=1 mij v j )chunk does not fit in
(i,whole
This
j=1 main memory anymore
Case 2: Very large n, even v does not fit into main memory
For every map, many accesses to disk (for parts of v) required!
Solution:
– How much of v will fit in?
– Partition v and rows of M so that each partition of v fits into memory
– Take dot product of one partition of v and the corresponding partition of M
– Map and reduce same as before
Color = stripe.
Each stripe of matrix
chunk divided up
into chunks.
X
MATRIX-VECTOR MULTIPLICATION MR CODE
• map(key, value):
• for (i, j, a_ij) in value:
• emit(i, a_ij * v[j])
• reduce(key, values):
• result = 0
• for value in values:
• result += value
• emit(key, result)
RELATIONAL ALGEBRA
• Primitives
• Projection ()
• Selection ()
• Cartesian product ()
• Set union ()
• Set difference ()
• Rename ()
• Other operations
• Join (⋈)
• Group by… aggregation
•…
RELATIONAL ALGEBRA
• R, S - relation
• t, t’ - a tuple
• C - a condition of selection
• A, B, C - subset of attributes
• a, b, c - attribute values for a given subset of
attributes
• Relations (however big) can be stored in a
distributed filesystem – If they don’t fit in a
single machine, they’re broken into pieces
(think HDFS)
BIG DATA ANALYSIS
• Peta-scale datasets are everywhere:
• Facebook has 2.5 PB of user data + 15 TB/day
(4/2009)
• eBay has 6.5 PB of user data + 50 TB/day (5/2009)
• …
• A lot of these datasets are (mostly) structured
• Query logs
• Point-of-sale records
• User data (e.g., demographics)
• …
• How do we perform data analysis at scale?
• Relational databases and SQL
• MapReduce (Hadoop)
Relational Databases Vs. MapReduce
• Relational databases:
• Multipurpose: analysis and transactions; batch
and interactive
• Data integrity via ACID transactions
• Lots of tools in software ecosystem (for ingesting,
reporting, etc.)
• Supports SQL (and SQL integration, e.g., JDBC)
• Automatic SQL query optimization
• MapReduce (Hadoop):
• Designed for large clusters, fault tolerant
• Data is accessed in “native format”
• Supports many query languages
• Programmers retain control over performance
• Open source
Selection Using Mapreduce
• Map: For each tuple t in R, test if t satisfies
C. If so, produce the key-value pair (t, t).
• Reduce: The identity function. It simply
passes each key-value pair to the output.
R1
R2
R1
R3
R4 R3
R5
PROJECTION
R1 R1
R2 R2
R3 R3
R4 R4
R5 R5
Union Using Mapreduce
• Suppose R and S have the same
schema
• Map tasks are generated from
chunks of both R and S
INTERSECTION (R ∩ S)
DIFFERENCE (R-S)
GROUPING AND AGGREGATION USING
MAPREDUCE
R
• Group and aggregate on
a relation R(A,B) using A B
group by attribute A y 1
z 4
• Map: z 1
• For each tuple t = (a,b) of R, x 5
emit key value pair (a,b) select A, sum(B) from
• Reduce: R group by A;
A B C
Reducer 1
R1 1 10 12 (1, 10)
(2, 20) (1, [10, 10, 30, 20]) (1, 17.5)
R2 2 20 34 MAP 1
(1, 10)
R3 1 10 22
R4 1 30 56
Reducer 2
R5 3 40 17 (1, 30)
MAP 2 (3, 40) (2, 10) (2, 15)
R6 2 10 49 (2, 10) (3, 40) (3, 40)
(1, 20)
R7 1 20 44
Grouping & Aggregation Summary
NATURAL JOIN
MAP-REDUCE EXAMPLE : JOIN
Select R.A, R.B, S.D where R.A==S.A
A B C
Reducer 1
R1 1 10 12 (1, [R, 10]) (1, 10, 30)
(2, [R, 20]) (1, 10, 30)
R2 2 20 34
(1, [R, 10]) (1, [(R, 10), (R, 10), (1, 10, 20)
MAP 1 (R, 30), (S, 20)] )
R3 1 10 22 (1, [R, 30]) (1, 10, 20)
R4 1 30 56 (3, [R, 40])
R5 3 40 17
A D E Reducer 2
S1 1 20 22 (1, [S, 20]) (2, [(R, 20), (S, 30),
(2, [S, 30]) (S, 10)] ) (2, 20, 30)
S2 2 30 36 MAP 2 (2, 20, 10)
(2, [S, 10]) (3, [(R, 40), (S, 50),
(3, 40, 50)
S3 2 10 29 (3, [S, 50]) (S, 40)]
(3, 40, 40)
S4 3 50 16 (3, [S, 40])
S5 3 40 37
Natural Join Using Mapreduce
• Join R(A,B) with S(B,C) on attribute B
R
• Map:
• For each tuple t = (a,b) of R, emit key A B
value pair (b,(R,a)) x a
• For each tuple t = (b,c) of S, emit key
y b
value pair (b,(S,c))
z c
• Reduce:
w d
• Each key b would be associated with a list
of values that are of the form (R,a) or (S,c)
• Construct all pairs consisting of one with S
first component R and the other with first
component S , say (R,a ) and (S,c ). The
output from this key and value list is a B C
sequence of key-value pairs a 1
• The key is irrelevant. Each value is one of c 3
the triples (a, b, c ) such that (R,a ) and
(S,c) are on the input list of values d 4
g 7
Need For High-level Languages
• Hadoop is great for large-data
processing!
• But writing Java programs for
everything is verbose and slow
• Analysts don’t want to (or can’t) write
Java
• Solution: develop higher-level data
processing languages
• Hive: HQL is like SQL
• Pig: Pig Latin is a bit like Perl
HIVE AND PIG
• Hive: data warehousing application in Hadoop
• Query language is HQL, variant of SQL
• Tables stored on HDFS as flat files
• Developed by Facebook, now open source
• Pig: large-scale data processing system
• Scripts are written in Pig Latin, a dataflow
language
• Developed by Yahoo!, now open source
• Roughly 1/3 of all Yahoo! internal jobs
• Common idea:
• Provide higher-level language to facilitate large-
data processing
• Higher-level language “compiles down” to
Hadoop jobs
Matrix Multiplication Using Mapreduce 1
n l l
n
m A n B = m
C cik = å aij b jk
(m × n) (m × l)
(n × l) j=1
n l l
n
cik = å aij b jk
A
B = C
m (m × n) n m
(n × l) (m × l)
(i, j, aij) j=1
(j, k, bjk)
C =AXB X =X
C =AXB
A has dimensions L x M
A has dimensions L x M
B has dimensions M x N A B C
B has dimensions M x N A B
C has dimensions L x N
C has dimensions L x N
Matrix multiplication:
Matrix multiplication: C[i,k] = Sum
C[i,k] (A[i,j] *B[j,k])
=j Sum j (A[i,j] *B[j,k])
InInthe
the map
map phase:
phase:
for each
for eachelement
element(i,j) (i,j) emit
of A,of A, ((i,k),
emitA[i,j]) k in 1..N
forA[i,j])
((i,k), for k in 1..N
Better:
Better: emit
emit((i,k)(‘A’, i, k, A[i,j]))
((i,k)(‘A’, for k in 1..N
i, k, A[i,j])) for k in 1..N
for each
for eachelement
element(j,k)(j,k) emit
of B,of B,((i,k),
emitB[j,k]) i in 1..Lfor i in 1..L
((i,k),forB[j,k])
Better:
Better: emit
emit ((i,k)(‘B’,
((i,k)(‘B’, i, k, B[j,k]))
i, k, B[j,k])) for i in 1..L
for i in 1..L
InInthe
the reduce phase
reduce phase OneOne reducer
reducer per cell,
per output output
emit cell, emit
key
key ==(i,k)
(i,k)
value = Sumj (A[i,j] * B[j,k])
value = Sumj (A[i,j] * B[j,k]) 20
Illustrating Matrix Multiplication
Matrix Multiplication – 2 Phase
Step 3 Map is just identity – emit (i, k, aik bkj)
Matrix Multiplication – 1 Phase
Mapper for Matrix A (k, v)=((i, k),
(A, j, Aij)) for all k
Mapper for Matrix B (k, v)=((i, k),
(B, j, Bjk)) for all i
Therefore computing the mapper for Matrix A:
• k, i, j computes the number of times it occurs.
• Here all are 2, therefore when k=1, i can have 2 values 1 & 2
• each case can have 2 further values of j=1 and j=2.
Avatar 4
APPLICATIONS OF
MAP-REDUCE
DISTRIBUTED GREP
Inverted Index
Ship 1 3 4 8
Jack 1 4
PosQng lists
Bond 6
Gun 1 5 6 7
Ocean 1 2 3 4 8
Captain 1 3 4
Batman 5
Crime 5 7
Vocabulary / DicQonary
34
CreaQng an inverted index
The
1 curse of the black pearl 2Finding Nemo 3OceanTin. n 4 Titanic
Ship Captain Jack Sparrow Ocean Fish AnimaQon Ship Rose Jack AtlanQc
Caribbean Elizabeth Gun Nemo Reef Ship Captain Ocean England Sink
Fight AnimaQon Haddock TinQn Captain
Term docId Term docId Map: for each term in each document,
Sort write out pairs (term, docid)
Ship 1 … …
and Reduce: List documents for each term
Captain 1 Captain 1
group
Jack 1 by … …
… … term Jack 1 Term docId docId docId
Ship 3 Jack 4 Captain 1 …
Tintin 3 … … Jack 1 4 …
… … Ship 1 Ship 1 3 …
Jack 4 Ship 3
… …
… … … … 35
TWITTER ANALYTICS
• Let us take a real-world example to comprehend the
power of MapReduce. Twitter receives around 500
million tweets per day, which is nearly 3000 tweets
per second. The following illustration shows how
Tweeter manages its tweets with the help of
MapReduce.
• As shown in the illustration, the MapReduce algorithm
performs the following actions −
• Tokenize − Tokenizes the tweets into maps of tokens
and writes them as key-value pairs.
• Filter − Filters unwanted words from the maps of
tokens and writes the filtered maps as key-value
pairs.
• Count − Generates a token counter per word.
• Aggregate Counters − Prepares an aggregate of
similar counter values into small manageable units.
MapReduce At FaceBook
• Facebook has a list of friends (note that friends are a bi-
directional thing on Facebook. If I'm your friend, you're mine).
• They also have lots of disk space and they serve hundreds of
millions of requests everyday.
• They've decided to pre-compute calculations when they can
to reduce the processing time of requests.
• One common processing request is the "You and XXX have 230
friends in common" feature.
• When you visit someone's profile, you see a list of friends that
you have in common.
• This list doesn't change frequently so it'd be wasteful to
recalculate it every time you visited the profile
• FaceBook uses mapreduce so that we can calculate
everyone's common friends once a day and store those results.
• Later on it's just a quick lookup. We've got lots of disk, it's
cheap.
MR AT FACEBOOK
• Assume the friends are stored as Person->[List of
Friends], our friends list is then:
• A -> B C D
• B -> A C D E
• C -> A B D E
• D -> A B C E
• E -> B C D
• Each line will be an argument to a mapper.
• For every friend in the list of friends, the mapper will
output a key-value pair.
• The key will be a friend along with the person.
• The value will be the list of friends.
• The key will be sorted so that the friends are in order,
causing all pairs of friends to go to the same reducer.
MR AT FACEBOOK
• After all the mappers are • For map(C -> A B D E) :
done running, you'll have a • (A C) -> A B D E
list like this: • (B C) -> A B D E
• For map(A -> B C D) : • (C D) -> A B D E
• (A B) -> B C D • (C E) -> A B D E
• (A C) -> B C D • For map(D -> A B C E) :
• (A D) -> A B C E
• (A D) -> B C D • (B D) -> A B C E
• For map(B -> A C D E) : (Note • (C D) -> A B C E
that A comes before B in the • (D E) -> A B C E
key)
• And finally for map(E ->
• (A B) -> A C D E B C D):
• (B C) -> A C D E • (B E) -> B C D
• (B D) -> A C D E • (C E) -> B C D
• (D E) -> B C D
• (B E) -> A C D E
MR AT FACEBOOK
• Before we send these key-value pairs to the reducers, we
group them by their keys and get:
• (A B) -> (A C D E) (B C D)
• (A C) -> (A B D E) (B C D)
• (A D) -> (A B C E) (B C D)
• (B C) -> (A B D E) (A C D E)
• (B D) -> (A B C E) (A C D E)
• (B E) -> (A C D E) (B C D)
• (C D) -> (A B C E) (A B D E)
• (C E) -> (A B D E) (B C D)
• (D E) -> (A B C E) (B C D)
MR AT FACEBOOK
• Each line will be passed as an argument to a reducer.
• The reduce function will simply intersect the lists of values and
output the same key with the result of the intersection.
• For example, reduce((A B) -> (A C D E) (B C D)) will output (A B) :
(C D) and means that friends A and B have C and D as common
friends.
• The result after reduction is:
• (A B) -> (C D)
• (A C) -> (B D)
• (A D) -> (B C)
• (B C) -> (A D E)
• (B D) -> (A C E)
• (B E) -> (C D)
• (C D) -> (A B E)
• (C E) -> (B D)
• (D E) -> (B C)
• Now when D visits B's profile, we can quickly look up (B D) and
see that they have three friends in common, (A C E).
MAP REDUCE INAPPLICABILITY
• Database management
• Sub-optimal implementation for
DB
• Does not provide traditional DBMS
features
• Lacks support for default DBMS
tools
Map Reduce Inapplicability