Map Reduce
Map Reduce
MapReduce
HDFS Output
input data
data written
to HDFS
HDFS
4
MAP REDUCE FRAMWORK
Map – runs in multiple nodes
in the cluster
Reduce – takes output from
Map and aggregates the
outcomes to produce an
aggregated key value pair
On top of HDFS – takes input
from HDFS and updates back
to HDFS
Map Reduce Job Architecture
Map
output (Key,Value ) pair
HDFS HDFS
Key, value for Reduce
pair Phase
Intermediate
events in Local File
system
MAP REDUCE FRAMWORK
Number of Mappers – Depends
on Number of data Partitions
that exist across nodes
Decision made by Storage
System
Number of Reducers used can
be configured by developers
Greater number of reducers
provide more parallelization
Map Phase – with Data Splits
Rajesh, 1
DN1 M
Gopi, 1
DN2 M Ram, 1
Gopi, 1 R Rajesh, 30
Gopi, 20
Rajesh, 1
DN3 M
Ram, 1
MapReduce: The Map Step
Input Intermediate
key-value pairs key-value pairs
k v
map
k v
k v
map
k v
k v
… …
k v k v
16
MapReduce: The Reduce Step
Intermediate Output
Key-value groups key-value pairs
key-value pairs
reduce
k v k v v v k v
Group reduce
k v k v v k v
by key
k v
… … …
k v k v k v
17
MapReduce
First task:
We have a huge text document
data
reads
The crew of the space shuttle (The, 1) (crew, 1)
read the
Endeavor recently returned to (crew, 1) (crew, 1)
Earth as ambassadors, (crew, 2)
(of, 1) (space, 1)
sequential
harbingers of a new era of
space exploration. Scientists at (space, 1)
(the, 1) (the, 1)
NASA are saying that the (the, 3)
Sequentially
recent assembly of the Dextre (space, 1) (the, 1)
bot is the first step in a long- (shuttle, 1)
term space-based man/mache (shuttle, 1) (the, 1)
(recently, 1)
partnership. '"The work we're (Endeavor, 1) (shuttle, 1)
…
Only
doing now -- the robotics we're
doing -- is what we're going to (recently, 1) (recently, 1)
need ……………………..
…. …
Big document (key, value) (key, value) (key, value)
Map-Reduce: Environment
Map-Reduce environment takes care
of:
Partitioning the input data
Scheduling the program’s execution
across a
set of machines
Performing the group by key step
Handling machine failures
Managing required inter-machine
communication
22
Map-Reduce: A diagram
Big document
MAP:
Read input and
produces a set of key-
value pairs
Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)
Reduce:
Collect all values
belonging to the key
and output
23
Map-Reduce: In Parallel
All phases are distributed with many tasks doing the work 24
Map-Reduce
Programmer specifies:
Input 0 Input 1 Input 2
Map and Reduce and input files
Workflow:
Read inputs as a set of key-value-pairs Map 0 Map 1 Map 2
Map transforms input kv-pairs into a new set of k'v'-
pairs
Sorts & Shuffles the k'v'-pairs to output nodes Shuffle
All k’v’-pairs with a given k’ are sent to the same
reduce Reduce 0 Reduce 1
Reduce processes all k'v'-pairs grouped by key into
new k''v''-pairs
Write the resulting pairs to files Out 0 Out 1
28
Combiners
Map task - produce many
pairs of the form (k,v1),
(k,v2), … for the same key k
Can save network time by
pre-aggregating values in
the mapper:
combine(k, list(v1)) v2
Combiner is usually same
as the reduce function
29
Refinement: Combiners
Back to our word counting example:
Combiner combines the values of all keys of a single mapper
(single machine):
31
Combiners Advantages
Reduces Data Transfer across the network within
the cluster
- Reduces the size of the output data transfer to
reducer
- Reduces the frequency of data transfer to
reducer
Example: Measure Host size
Suppose we have a large web corpus
Look at the metadata file
Lines of the form: (URL, size, date, …)
For each host, find the total number of bytes
That is, the sum of the page sizes for all URLs from that
particular host
Example: Word Sequence Count
Statistical machine translation:
Need to count number of times every 5-word sequence occurs in a large
corpus of documents
35
Example: Cost Measures
For a map-reduce algorithm:
Communication cost = input file size + 2 (sum of the
sizes of all files passed from Map processes to Reduce
processes) + the sum of the output sizes of the Reduce
processes.
Elapsed communication cost is the sum of the largest
input + output for any map process, plus the same for
any reduce process
36
Implementations
Google
Not available outside Google
Hadoop
An open-source implementation in Java
Uses HDFS for stable storage
Aster Data
Cluster-optimized SQL Database that also implements
MapReduce
37
Cloud Computing
Ability to rent computing by the hour
Additional services e.g., persistent storage
38
Cloud Computing – In-Memory
Cloud Computing – In Memory
In-memory computing means using a
type of middleware software that allows one to store
data in RAM, across a cluster of computers, and
process it in parallel.
Operational datastore in “connected” RAM across
multiple computers
40
Case Study
The combination of the Hadoop MapReduce
programming model and cloud computing allows
biological scientists to analyze next-generation
sequencing (NGS) data in a timely and cost-effective
manner.
Cloud computing platforms remove the burden of IT
facility procurement and management from end
users and provide ease of access to Hadoop clusters.
41
Case Study
Biological scientists are still expected to choose
appropriate Hadoop parameters for running their
jobs.
42
Case Study
The Challenge is to minimize the cloud computing
cost spent on bioinformatics data analysis by
optimizing the extracted significant Hadoop
parameters.
When using MapReduce-based bioinformatics
tools in the cloud, the default settings often lead to
resource underutilization and wasteful expenses.
43
Case Study
The available Hadoop tuning guidelines are either
obsolete or too general to capture the particular
characteristics of bioinformatics applications.
44