0% found this document useful (0 votes)
12 views

Map Reduce

The document describes the MapReduce framework, which is an algorithmic approach for processing large datasets in a distributed computing environment. It consists of two main stages - the Map stage, which breaks down data into key-value pairs, and the Reduce stage, which aggregates the outputs from Map based on keys. MapReduce utilizes the Hadoop distributed file system (HDFS) to store input/output data and allows parallel processing across multiple nodes in a cluster.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Map Reduce

The document describes the MapReduce framework, which is an algorithmic approach for processing large datasets in a distributed computing environment. It consists of two main stages - the Map stage, which breaks down data into key-value pairs, and the Reduce stage, which aggregates the outputs from Map based on keys. MapReduce utilizes the Hadoop distributed file system (HDFS) to store input/output data and allows parallel processing across multiple nodes in a cluster.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

MAP REDUCE FRAMWORK

 “With data collection, ‘the


sooner the better’ is always
the best answer. – Marissa
Mayer

 Marissa Ann Mayer - American


businesswoman and investor.
Former president and chief
executive officer of Yahoo!
MAP REDUCE FRAMWORK
MAP REDUCE FRAMWORK
 Algorithmic approach to deal
with big data
 Two Stage Process
 Map Step – Break the data into
chunks and process those chunks
 Output from MAP – in Key and
Value format
 Reduce Step – Aggregates the
output of the Map
High Level MAP REDUCE FRAMEWORK

MapReduce

HDFS Output
input data
data written
to HDFS

HDFS

4
MAP REDUCE FRAMWORK
 Map – runs in multiple nodes
in the cluster
 Reduce – takes output from
Map and aggregates the
outcomes to produce an
aggregated key value pair
 On top of HDFS – takes input
from HDFS and updates back
to HDFS
Map Reduce Job Architecture

MASTER (RM – 100% JOB)


Heart
Beats Heart
Beats

DN1/NM1 DN2/NM2 DN3/NM3


DNx/NMx
(33%) (33%) (34%)

Intermediate Events in local file system of each DN


MAP REDUCE FRAMWORK
 RM – Distributes overall job into
available Node Manager based
on resource availability
( CPU, memory, JVM etc)
 NM executes task with the help
of Application Manager
(executor handled by Yarn)
 Reports back to RM – Heartbeat
signal (every 3 seconds)
MAP REDUCE FRAMWORK
 Developer – Need to write
logic for two functions
map() and reduce().
 Fault tolerance,
replication etc will be
handled by Hadoop
framework
MAP REDUCE FRAMWORK
 ‘Map’ – Runs on one
subset of data
 Map program executes
one record at a time and
outputs key-value pairs for
each record of data
MAP REDUCE FRAMWORK
 ‘Reduce’ – Takes ‘Map’
outputs and collates values
associated with the same key
and combines these values
based on the requirement
from the program (average,
sum, max or min etc)

MAP REDUCE FRAMWORK
 ‘Reduce’ – Takes ‘Map’
outputs and collates values
associated with the same key
and combines these values
based on the requirement
from the program (average,
sum, max or min etc)

MAP REDUCE FRAMWORK
 ‘Map’ – Processes run in parallel
 Reduce’ – Mostly a single job sometimes
parallel Reduce tasks also run
 Map output is placed in intermediate state for
intermediate events
 Events – Shuffling, Sorting, Partitioning and
Combining
 Handled by Mapreduce framework on each key
 In local system of each DataNode
 Copied from HDFS into local file system for the
intermediate events
 Then transferred back to HDFS as an input for
Reduce Phase

Internal Flow - MapReduce Job
Input data read
from HDFS and MAP
transformed to Transformation REDUCE
Key Value Pair Phase Operation Phase

Map
output (Key,Value ) pair
HDFS HDFS
Key, value for Reduce
pair Phase
Intermediate
events in Local File
system
MAP REDUCE FRAMWORK
 Number of Mappers – Depends
on Number of data Partitions
that exist across nodes
 Decision made by Storage
System
 Number of Reducers used can
be configured by developers
 Greater number of reducers
provide more parallelization
Map Phase – with Data Splits
Rajesh, 1
DN1 M
Gopi, 1

DN2 M Ram, 1
Gopi, 1 R Rajesh, 30
Gopi, 20

Rajesh, 1
DN3 M
Ram, 1
MapReduce: The Map Step
Input Intermediate
key-value pairs key-value pairs
k v
map
k v
k v
map
k v
k v

… …

k v k v

16
MapReduce: The Reduce Step
Intermediate Output
Key-value groups key-value pairs
key-value pairs
reduce
k v k v v v k v
Group reduce
k v k v v k v
by key

k v

… … …
k v k v k v

17
MapReduce

First task:
 We have a huge text document

 Count the number of times each


distinct word appears in the file
 Sample application:
 Analyze web server logs to find popular URLs
18
Task: Word Count
Case 1:
 File too large for memory, but all
<word, count> pairs fit in memory
Case 2:
 Count occurrences of words:
 words(doc.txt) | sort |
uniq -c
▪ outputs the words in the file - one per a
line
 Case 1 and 2 captures the essence
of MapReduce
19
MapReduce:
 Sequentially read a lot of data
 Map:
 Extract something you care
about
 Group by key: Sort and Shuffle
 Reduce:
 Aggregate, summarize, filter or
transform
 Write the result
MapReduce: Word Counting
Provided by the Provided by the
programmer programmer
MAP: Group by key: Reduce:
Read input and Collect all values
Collect all pairs with
produces a set of key- belonging to the key
same key
value pairs and output

data
reads
The crew of the space shuttle (The, 1) (crew, 1)

read the
Endeavor recently returned to (crew, 1) (crew, 1)
Earth as ambassadors, (crew, 2)
(of, 1) (space, 1)

sequential
harbingers of a new era of
space exploration. Scientists at (space, 1)
(the, 1) (the, 1)
NASA are saying that the (the, 3)

Sequentially
recent assembly of the Dextre (space, 1) (the, 1)
bot is the first step in a long- (shuttle, 1)
term space-based man/mache (shuttle, 1) (the, 1)
(recently, 1)
partnership. '"The work we're (Endeavor, 1) (shuttle, 1)

Only
doing now -- the robotics we're
doing -- is what we're going to (recently, 1) (recently, 1)
need ……………………..
…. …
Big document (key, value) (key, value) (key, value)
Map-Reduce: Environment
Map-Reduce environment takes care
of:
 Partitioning the input data
 Scheduling the program’s execution
across a
set of machines
 Performing the group by key step
 Handling machine failures
 Managing required inter-machine
communication
22
Map-Reduce: A diagram
Big document
MAP:
Read input and
produces a set of key-
value pairs

Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)

Reduce:
Collect all values
belonging to the key
and output

23
Map-Reduce: In Parallel

All phases are distributed with many tasks doing the work 24
Map-Reduce
 Programmer specifies:
Input 0 Input 1 Input 2
 Map and Reduce and input files
 Workflow:
 Read inputs as a set of key-value-pairs Map 0 Map 1 Map 2
 Map transforms input kv-pairs into a new set of k'v'-
pairs
 Sorts & Shuffles the k'v'-pairs to output nodes Shuffle
 All k’v’-pairs with a given k’ are sent to the same
reduce Reduce 0 Reduce 1
 Reduce processes all k'v'-pairs grouped by key into
new k''v''-pairs
 Write the resulting pairs to files Out 0 Out 1

 All phases are distributed with many tasks doing


the work 25
Failures
 Map worker failure
 Map tasks completed or in-progress reset to idle
 Reduce workers are notified when task is rescheduled on
another worker
 Reduce worker failure
 Only in-progress tasks are reset to idle
 Reduce task is restarted
 Master failure
 MapReduce task is aborted and client is notified
26
How many Map and Reduce jobs?
 M map tasks, R reduce tasks
 Rule of a thumb:
 Make M much larger than the number of nodes in the
cluster
 One DFS chunk per map is common
 Improves dynamic load balancing and speeds up
recovery from worker failures
 Usually R is smaller than M
 Because output is spread across R files
27
Task Granularity & Pipelining
 Fine granularity tasks: map tasks >> machines
 Minimizes time for fault recovery
 Can do pipeline shuffling with map execution
 Better dynamic load balancing

28
Combiners
 Map task - produce many
pairs of the form (k,v1),
(k,v2), … for the same key k
 Can save network time by
pre-aggregating values in
the mapper:
 combine(k, list(v1))  v2
 Combiner is usually same
as the reduce function
29
Refinement: Combiners
 Back to our word counting example:
 Combiner combines the values of all keys of a single mapper
(single machine):

 Much less data needs to be copied and shuffled! 30


Combiners Advantages
 Combiners improves Parallelism
 – running combiner on multiple map nodes
 – provides greater processing capability
 – improves utilization of resources
 – provides greater optimization

31
Combiners Advantages
 Reduces Data Transfer across the network within
the cluster
 - Reduces the size of the output data transfer to
reducer
 - Reduces the frequency of data transfer to
reducer
Example: Measure Host size
 Suppose we have a large web corpus
 Look at the metadata file
 Lines of the form: (URL, size, date, …)
 For each host, find the total number of bytes
 That is, the sum of the page sizes for all URLs from that
particular host
Example: Word Sequence Count
 Statistical machine translation:
 Need to count number of times every 5-word sequence occurs in a large
corpus of documents

 Very easy with MapReduce:


 Map:
▪ Extract (5-word sequence, count) from document
 Reduce:
▪ Combine the counts
Cost Measures for Algorithms
 Cost of an algorithm
Communication cost = total I/O
of all processes
Elapsed communication cost =
max of I/O along any path
- This counts only running time
of processes

35
Example: Cost Measures
 For a map-reduce algorithm:
 Communication cost = input file size + 2  (sum of the
sizes of all files passed from Map processes to Reduce
processes) + the sum of the output sizes of the Reduce
processes.
 Elapsed communication cost is the sum of the largest
input + output for any map process, plus the same for
any reduce process
36
Implementations
 Google
 Not available outside Google
 Hadoop
 An open-source implementation in Java
 Uses HDFS for stable storage

 Aster Data
 Cluster-optimized SQL Database that also implements
MapReduce
37
Cloud Computing
 Ability to rent computing by the hour
 Additional services e.g., persistent storage

 Amazon’s “Elastic Compute Cloud” (EC2)

 Aster Data and Hadoop can both be run on EC2

38
Cloud Computing – In-Memory
Cloud Computing – In Memory
 In-memory computing means using a
type of middleware software that allows one to store
data in RAM, across a cluster of computers, and
process it in parallel.
 Operational datastore in “connected” RAM across
multiple computers

40
Case Study
 The combination of the Hadoop MapReduce
programming model and cloud computing allows
biological scientists to analyze next-generation
sequencing (NGS) data in a timely and cost-effective
manner.
 Cloud computing platforms remove the burden of IT
facility procurement and management from end
users and provide ease of access to Hadoop clusters.
41
Case Study
 Biological scientists are still expected to choose
appropriate Hadoop parameters for running their
jobs.

42
Case Study
 The Challenge is to minimize the cloud computing
cost spent on bioinformatics data analysis by
optimizing the extracted significant Hadoop
parameters.
 When using MapReduce-based bioinformatics
tools in the cloud, the default settings often lead to
resource underutilization and wasteful expenses.
43
Case Study
 The available Hadoop tuning guidelines are either
obsolete or too general to capture the particular
characteristics of bioinformatics applications.

44

You might also like