Hadoop Map Reduce Concepts - Teaching - 1
Hadoop Map Reduce Concepts - Teaching - 1
Hadoop MapReduce
• Hadoop MapReduce is a
– Software framework
– For easily writing applications
– Which process vast amounts of data (multi-terabyte data-sets)
– In-parallel on large clusters (thousands of nodes) of commodity
hardware
– In a reliable, fault-tolerant manner.
• A MapReduce job usually splits the input data-set into independent chunks
which are processed by the map tasks in a completely parallel manner.
• The framework sorts the outputs of the maps, which are then input to
the reduce tasks.
• Typically both the input and the output of the job are stored in a file-system.
• The framework takes care of scheduling tasks, monitoring them and re-
executes the failed tasks.
JobTracker and TaskTracker
• The MapReduce framework consists of a
– single master JobTracker and
– one slave TaskTracker per cluster-node.
• The master is responsible for scheduling the jobs'
component - tasks on the slaves, monitoring them and
re-executing the failed tasks.
• The slaves execute the tasks as directed by the master.
JobTracker and TaskTracker
Job Specification
• Minimally, applications specify the input/output locations and
supply map and reduce functions via implementations of appropriate
interfaces and/or abstract-classes.
• These, and other job parameters, comprise the job configuration.
• The Hadoop job client then submits the job (jar/executable etc.) and
configuration to the JobTracker which then assumes the responsibility of
distributing the software/configuration to the slaves, scheduling tasks and
monitoring them, providing status and diagnostic information to the job-
client.
• Although the Hadoop framework is implemented in JavaTM, MapReduce
applications need not be written in Java.
– Hadoop Streaming is a utility which allows users to create and run jobs with any
executables (e.g. shell utilities) as the mapper and/or the reducer.
– Hadoop Pipes is a SWIG- compatible C++ API to implement MapReduce
applications (non JNITM based).
Input Output
• The MapReduce framework operates exclusively on <key,
value> pairs.
• That is, the framework views the input to the job as a set
of <key, value> pairs and produces a set of <key, value> pairs as
the output of the job, conceivably of different types.
• The key and value classes have to be serializable by the
framework and hence need to implement
the Writable interface.
• Additionally, the key classes have to implement
the WritableComparable interface to facilitate sorting by the
framework.
• Input and Output types of a MapReduce job:
– (input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -
> <k3, v3> (output)
Hadoop Map/Reduce Architecture
JobTracker - Master Accepts MR jobs
Assigns tasks to slaves
UI for submitting
jobs Monitors tasks
Polls status Handles failures
information
Client
Run Map and Reduce
task TaskTracker -
Slaves
TaskTracker
Manage intermediate TaskTracker
output
Run the Map and
Reduce functions
Report progress Task
Hadoop HDFS + MR cluster - putting them together
Locality optimizations
– With large data, bandwidth to data is a problem
– Map tasks are scheduled close to the inputs when possible
Automatic re-execution on failure
HTTP Monitoring UI
Client
JobTracker
Submit Job
tasks
←.
T D T D T D T D T D
• Streaming
– http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
• Pipes
– http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/pipes/package-
summary.html
• Pig
– http://hadoop.apache.org/pig/
• Hive
– http://hive.apache.org
Example: Compute TF-IDF using Map/Reduce
• TF-IDF (Term Frequency, Inverse Document Frequency)
• Is a basic technique
class Combiner
method Combine(term t, [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
Collating
• Problem Statement:
– There is a set of items and some function of one item.
– It is required to save all items that have the same value of
function into one file or perform some other computation
that requires all such items to be processed as a group.
– The most typical example is building of inverted indexes.
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term f(t), t)
class Reducer
method Reduce(term f(t), [t1, t2,...])
//process or save
Emit(term t, count sum)
Filtering (“Grepping”), Parsing, and Validation
• Problem Statement:
– There is a set of records and it is required to collect all
records that meet some condition or transform each record
(independently from other records) into another
representation.
– The later case includes such tasks as text parsing and value
extraction, conversion from one format to another.
There is a software simulator of a digital communication system like WiMAX that passes
some volume of random data through the system model and computes error probability
of throughput.
Each Mapper runs simulation for specified amount of data which is 1/Nth of the required
sampling and emit error rate.
Each neighbor updates its state on the basis of the received messages.
From the technical point of view, Mapper emits messages for each node
using ID of the adjacent node as a key.
As result, all messages are grouped by the incoming node and reducer is
able to re-compute state and rewrite node with the new state.
Iterative Message Passing (Graph Processing)
Solution:
class Mapper
method Map(id n, object N)
Emit(id n, object N)
for all id m in N.OutgoingRelations do
Emit(id m, message getMessage(N))
class Reducer
method Reduce(id m, [s1, s2,...])
object M = null
messages = []
for all s in [s1, s2,...] do
if IsObject(s) then
M=s
else // s is a message
messages.add(s)
M.State = calculateState(messages)
Emit(id m, object M)
Case Study: Availability Propagation Through The Tree of Categories
• Problem Statement:
– This problem is inspired by real life eCommerce task.
– There is a tree of categories that branches out from large categories (like Men,
Women, Kids) to smaller ones (like Men Jeans or Women Dresses), and
eventually to small end-of-line categories (like Men Blue Jeans).
– End-of-line category is either available (contains products) or not.
– Some high level category is available if there is at least one available end-of-line
category in its subtree.
– The goal is to calculate availabilities for all categories if availabilities of end-of-
line categories are know.
Case Study: Availability Propagation Through The Tree of Categories
class N
State in {True = 2, False = 1, null = 0}, //initialized 1 or 2 for end-
of-line categories, 0 otherwise
method getMessage(object N)
return N.State
method getMessage(N)
return N.State + 1
Result:
a -> 3 // F=1, F=2, F=3
b -> 2 // F=1, F=3
d -> 1 // F=2
e -> 1 // F=2
Distinct Values (Unique Items Counting)
Solution:
PHASE-1
class Mapper
method Map(null, record [value f, categories [g1, g2,...]])
for all category g in [g1, g2,...]
Emit(record [g, f], count 1)
class Reducer
method Reduce(record [g, f], counts [n1, n2, ...])
Emit(record [g, f], null )
PHASE-2
class Mapper
method Map(record [f, g], null)
Emit(value g, count 1)
class Reducer
method Reduce(value g, counts [n1, n2,...])
Emit(value g, sum( [n1, n2,...] ) )
Cross-Correlation
• Problem Statement:
– There is a set of tuples of items.
– For each possible pair of items calculate a number of tuples
where these items co-occur.
– If the total number of items is N then N*N values should be
reported.
– This problem appears in :
• text analysis (say, items are words and tuples are sentences),
• market analysis (customers who buy this tend to also buy that).
– If N*N is small - matrix can fit in the memory of a single
machine.
Market Basket Analysis
• Suppose a store sells N different products and has data on B customer purchases
(each purchase is called a “market basket”)
class Reducer
method Reduce(pair [i j], counts [c1, c2,...])
s = sum([c1, c2,...])
Emit(pair[i j], count s)
Cross-Correlation
Solution:
class Mapper
method Map(null, basket = [i1, i2,...] )
for all item i in basket
H = new AssociativeArray : item -> counter
for all item j in basket
H{j} = H{j} + 1
Emit(item i, stripe H)
class Reducer
method Reduce(item i, stripes [H1, H2,...])
H = new AssociativeArray : item -> counter
H = merge-sum( [H1, H2,...] )
for all item j in H.keys()
Emit(pair [i j], H{j})
Selection
Solution:
class Mapper
method Map(rowkey key, tuple t)
if t satisfies the predicate
Emit(tuple t, null)
Projection
Solution:
class Mapper
method Map(rowkey key, tuple t)
tuple g = project(t) // extract required fields to tuple g
Emit(tuple g, null)
class Reducer
method Reduce(tuple t, array n) // n is an array of nulls
Emit(tuple t, null)
Union
Solution:
class Mapper
method Map(rowkey key, tuple t)
Emit(tuple t, null)
class Reducer
method Reduce(tuple t, array n) // n is an array of one or two nulls
Emit(tuple t, null)
Intersection
Solution:
class Mapper
method Map(rowkey key, tuple t)
Emit(tuple t, null)
class Reducer
method Reduce(tuple t, array n) // n is an array of one or more nulls
if n.size() >= 2
Emit(tuple t, null)
Difference
Solution:
// We want to compute difference R – S.
// Mapper emits all tuples and tag which is a name of the set this record came from.
// Reducer emits only records that came from R but not from S.
class Mapper
method Map(rowkey key, tuple t)
Emit(tuple t, string t.SetName) // t.SetName is either 'R' or 'S'
class Reducer
method Reduce(tuple t, array n) // array n can be ['R'], ['S'], ['R' 'S'], or ['S', 'R']
if n.size() = 1 and n[1] = 'R'
Emit(tuple t, null)
Join
• A JOIN is a means for combining fields from two tables by using values
common to each.
• ANSI standard SQL specifies four types of JOIN: INNER, OUTER, LEFT,
and RIGHT.
• As a special case, a table (base table, view, or joined table) can JOIN to itself
in a self-join.
• Join algorithms
– Nested loop
– Sort-Merge
– Hash
SELECT *
FROM employee
INNER JOIN department ON employee.DepartmentID = department.DepartmentID;
SELECT *
FROM employee
JOIN department ON employee.DepartmentID = department.DepartmentID;
SELECT *
FROM employee
LEFT OUTER JOIN department ON employee.DepartmentID = department.DepartmentID;
Nested loop
For each tuple r in R do
For each tuple s in S do
If r and s satisfy the join condition
Then output the tuple <r,s>
Sort-Merge
p in P; q in Q; g in Q
while more tuples in inputs do
while p.a < g.b do
advance p
end while
while p.a > g.b do
advance g //a group might begin here
end while
while p.a == g.b do
q = g //mark group beginning
while p.a == q.b do
Add <p,q> to the result
Advance q
end while
Advance p //move forward
end while
g = q //candidate to begin next group
end while
Sort-Merge : MapReduce
• This algorithm joins of two sets R and L on some key k.
• Mapper goes through all tuples from R and L, extracts key k from the tuples,
marks tuple with a tag that indicates a set this tuple came from (‘R’ or ‘L’),
and emits tagged tuple using k as a key.
• Reducer receives all tuples for a particular key k and put them into two
buckets – for R and for L.
• When two buckets are filled, Reducer runs nested loop over them and emits
a cross join of the buckets.
• Each emitted tuple is a concatenation R-tuple, L-tuple, and key k.
class Reducer
method Reduce(join_key k, tagged_tuples [t1, t2,...])
H = new AssociativeArray : set_name -> values
for all tagged_tuple t in [t1, t2,...] // separate values into 2 arrays
H{t.tag}.add(t.values)
for all values r in H{'R'} // produce a cross-join of the two arrays
for all values l in H{'L'}
Emit(null, [k r l] )
Hash Join
• The Hash Join algorithm consists of a ‘build’ phase and a ‘probe’ phase.
• In its simplest variant, the smaller dataset is loaded into an in-memory hash
table in the build phase.
• In the ‘probe’ phase, the larger dataset is scanned and joined with the
relevant tuple(s) by looking into the hash table.
for all p in P do
Load p into in memory hash table H
end for
for all q in Q do
if H contains p matching with q then
add <p,q> to the result
end if
end for
Hash Join : MapReduce Pseudocode
• Let’s assume that we join two sets – R and L, R is relative small. If so, R can
be distributed to all Mappers and each Mapper can load it and index by the
join key.
class Mapper
method Initialize
H = new AssociativeArray : join_key -> tuple from R
R = loadR()
for all [ join_key k, tuple [r1, r2,...] ] in R
H{k} = H{k}.append( [r1, r2,...] )
R. Vernica, M. J. Carey, and C. Li. Ecient parallel set-similarity joins using mapreduce. In
SIGMOD, pages 495-506, 2010.
Processing theta-joins using MapReduce, Alper Okcan, Mirek Riedewald, SIGMOD '11