0% found this document useful (0 votes)

14 views

Map Reduce

The document describes the MapReduce framework, which is an algorithmic approach for processing large datasets in a distributed computing environment. It consists of two main stages - the Map stage, which breaks down data into key-value pairs, and the Reduce stage, which aggregates the outputs from Map based on keys. MapReduce utilizes the Hadoop distributed file system (HDFS) to store input/output data and allows parallel processing across multiple nodes in a cluster.

Uploaded by

K Anantha Krishnan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

Map Reduce

Uploaded by

K Anantha Krishnan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 44

MAP REDUCE FRAMWORK

 “With data collection, ‘the

sooner the better’ is always
the best answer. – Marissa
Mayer

 Marissa Ann Mayer - American

businesswoman and investor.
Former president and chief
executive officer of Yahoo!
MAP REDUCE FRAMWORK
MAP REDUCE FRAMWORK
 Algorithmic approach to deal
with big data
 Two Stage Process
 Map Step – Break the data into
chunks and process those chunks
 Output from MAP – in Key and
Value format
 Reduce Step – Aggregates the
output of the Map
High Level MAP REDUCE FRAMEWORK

MapReduce

HDFS Output
input data
data written
to HDFS

HDFS

4
MAP REDUCE FRAMWORK
 Map – runs in multiple nodes
in the cluster
 Reduce – takes output from
Map and aggregates the
outcomes to produce an
aggregated key value pair
 On top of HDFS – takes input
from HDFS and updates back
to HDFS
Map Reduce Job Architecture

MASTER (RM – 100% JOB)

Heart
Beats Heart
Beats

DN1/NM1 DN2/NM2 DN3/NM3

DNx/NMx
(33%) (33%) (34%)

Intermediate Events in local file system of each DN

MAP REDUCE FRAMWORK
 RM – Distributes overall job into
available Node Manager based
on resource availability
( CPU, memory, JVM etc)
 NM executes task with the help
of Application Manager
(executor handled by Yarn)
 Reports back to RM – Heartbeat
signal (every 3 seconds)
MAP REDUCE FRAMWORK
 Developer – Need to write
logic for two functions
map() and reduce().
 Fault tolerance,
replication etc will be
handled by Hadoop
framework
MAP REDUCE FRAMWORK
 ‘Map’ – Runs on one
subset of data
 Map program executes
one record at a time and
outputs key-value pairs for
each record of data
MAP REDUCE FRAMWORK
 ‘Reduce’ – Takes ‘Map’
outputs and collates values
associated with the same key
and combines these values
based on the requirement
from the program (average,
sum, max or min etc)

MAP REDUCE FRAMWORK
 ‘Reduce’ – Takes ‘Map’
outputs and collates values
associated with the same key
and combines these values
based on the requirement
from the program (average,
sum, max or min etc)

MAP REDUCE FRAMWORK
 ‘Map’ – Processes run in parallel
 Reduce’ – Mostly a single job sometimes
parallel Reduce tasks also run
 Map output is placed in intermediate state for
intermediate events
 Events – Shuffling, Sorting, Partitioning and
Combining
 Handled by Mapreduce framework on each key
 In local system of each DataNode
 Copied from HDFS into local file system for the
intermediate events
 Then transferred back to HDFS as an input for
Reduce Phase

Internal Flow - MapReduce Job
Input data read
from HDFS and MAP
transformed to Transformation REDUCE
Key Value Pair Phase Operation Phase

Map
output (Key,Value ) pair
HDFS HDFS
Key, value for Reduce
pair Phase
Intermediate
events in Local File
system
MAP REDUCE FRAMWORK
 Number of Mappers – Depends
on Number of data Partitions
that exist across nodes
 Decision made by Storage
System
 Number of Reducers used can
be configured by developers
 Greater number of reducers
provide more parallelization
Map Phase – with Data Splits
Rajesh, 1
DN1 M
Gopi, 1

DN2 M Ram, 1
Gopi, 1 R Rajesh, 30
Gopi, 20

Rajesh, 1
DN3 M
Ram, 1
MapReduce: The Map Step
Input Intermediate
key-value pairs key-value pairs
k v
map
k v
k v
map
k v
k v

… …

k v k v

16
MapReduce: The Reduce Step
Intermediate Output
Key-value groups key-value pairs
key-value pairs
reduce
k v k v v v k v
Group reduce
k v k v v k v
by key

k v

… … …
k v k v k v

17
MapReduce

First task:
 We have a huge text document

 Count the number of times each

distinct word appears in the file
 Sample application:
 Analyze web server logs to find popular URLs
18
Task: Word Count
Case 1:
 File too large for memory, but all
<word, count> pairs fit in memory
Case 2:
 Count occurrences of words:
 words(doc.txt) | sort |
uniq -c
▪ outputs the words in the file - one per a
line
 Case 1 and 2 captures the essence
of MapReduce
19
MapReduce:
 Sequentially read a lot of data
 Map:
 Extract something you care
about
 Group by key: Sort and Shuffle
 Reduce:
 Aggregate, summarize, filter or
transform
 Write the result
MapReduce: Word Counting
Provided by the Provided by the
programmer programmer
MAP: Group by key: Reduce:
Read input and Collect all values
Collect all pairs with
produces a set of key- belonging to the key
same key
value pairs and output

data
reads
The crew of the space shuttle (The, 1) (crew, 1)

read the
Endeavor recently returned to (crew, 1) (crew, 1)
Earth as ambassadors, (crew, 2)
(of, 1) (space, 1)

sequential
harbingers of a new era of
space exploration. Scientists at (space, 1)
(the, 1) (the, 1)
NASA are saying that the (the, 3)

Sequentially
recent assembly of the Dextre (space, 1) (the, 1)
bot is the first step in a long- (shuttle, 1)
term space-based man/mache (shuttle, 1) (the, 1)
(recently, 1)
partnership. '"The work we're (Endeavor, 1) (shuttle, 1)
…

Only
doing now -- the robotics we're
doing -- is what we're going to (recently, 1) (recently, 1)
need ……………………..
…. …
Big document (key, value) (key, value) (key, value)
Map-Reduce: Environment
Map-Reduce environment takes care
of:
 Partitioning the input data
 Scheduling the program’s execution
across a
set of machines
 Performing the group by key step
 Handling machine failures
 Managing required inter-machine
communication
22
Map-Reduce: A diagram
Big document
MAP:
Read input and
produces a set of key-
value pairs

Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)

Reduce:
Collect all values
belonging to the key
and output

23
Map-Reduce: In Parallel

All phases are distributed with many tasks doing the work 24
Map-Reduce
 Programmer specifies:
Input 0 Input 1 Input 2
 Map and Reduce and input files
 Workflow:
 Read inputs as a set of key-value-pairs Map 0 Map 1 Map 2
 Map transforms input kv-pairs into a new set of k'v'-
pairs
 Sorts & Shuffles the k'v'-pairs to output nodes Shuffle
 All k’v’-pairs with a given k’ are sent to the same
reduce Reduce 0 Reduce 1
 Reduce processes all k'v'-pairs grouped by key into
new k''v''-pairs
 Write the resulting pairs to files Out 0 Out 1

 All phases are distributed with many tasks doing

the work 25
Failures
 Map worker failure
 Map tasks completed or in-progress reset to idle
 Reduce workers are notified when task is rescheduled on
another worker
 Reduce worker failure
 Only in-progress tasks are reset to idle
 Reduce task is restarted
 Master failure
 MapReduce task is aborted and client is notified
26
How many Map and Reduce jobs?
 M map tasks, R reduce tasks
 Rule of a thumb:
 Make M much larger than the number of nodes in the
cluster
 One DFS chunk per map is common
 Improves dynamic load balancing and speeds up
recovery from worker failures
 Usually R is smaller than M
 Because output is spread across R files
27
Task Granularity & Pipelining
 Fine granularity tasks: map tasks >> machines
 Minimizes time for fault recovery
 Can do pipeline shuffling with map execution
 Better dynamic load balancing

28
Combiners
 Map task - produce many
pairs of the form (k,v1),
(k,v2), … for the same key k
 Can save network time by
pre-aggregating values in
the mapper:
 combine(k, list(v1))  v2
 Combiner is usually same
as the reduce function
29
Refinement: Combiners
 Back to our word counting example:
 Combiner combines the values of all keys of a single mapper
(single machine):

 Much less data needs to be copied and shuffled! 30

Combiners Advantages
 Combiners improves Parallelism
 – running combiner on multiple map nodes
 – provides greater processing capability
 – improves utilization of resources
 – provides greater optimization

31
Combiners Advantages
 Reduces Data Transfer across the network within
the cluster
 - Reduces the size of the output data transfer to
reducer
 - Reduces the frequency of data transfer to
reducer
Example: Measure Host size
 Suppose we have a large web corpus
 Look at the metadata file
 Lines of the form: (URL, size, date, …)
 For each host, find the total number of bytes
 That is, the sum of the page sizes for all URLs from that
particular host
Example: Word Sequence Count
 Statistical machine translation:
 Need to count number of times every 5-word sequence occurs in a large
corpus of documents

 Very easy with MapReduce:

 Map:
▪ Extract (5-word sequence, count) from document
 Reduce:
▪ Combine the counts
Cost Measures for Algorithms
 Cost of an algorithm
Communication cost = total I/O
of all processes
Elapsed communication cost =
max of I/O along any path
- This counts only running time
of processes

35
Example: Cost Measures
 For a map-reduce algorithm:
 Communication cost = input file size + 2  (sum of the
sizes of all files passed from Map processes to Reduce
processes) + the sum of the output sizes of the Reduce
processes.
 Elapsed communication cost is the sum of the largest
input + output for any map process, plus the same for
any reduce process
36
Implementations
 Google
 Not available outside Google
 Hadoop
 An open-source implementation in Java
 Uses HDFS for stable storage

 Aster Data
 Cluster-optimized SQL Database that also implements
MapReduce
37
Cloud Computing
 Ability to rent computing by the hour
 Additional services e.g., persistent storage

 Amazon’s “Elastic Compute Cloud” (EC2)

 Aster Data and Hadoop can both be run on EC2

38
Cloud Computing – In-Memory
Cloud Computing – In Memory
 In-memory computing means using a
type of middleware software that allows one to store
data in RAM, across a cluster of computers, and
process it in parallel.
 Operational datastore in “connected” RAM across
multiple computers

40
Case Study
 The combination of the Hadoop MapReduce
programming model and cloud computing allows
biological scientists to analyze next-generation
sequencing (NGS) data in a timely and cost-effective
manner.
 Cloud computing platforms remove the burden of IT
facility procurement and management from end
users and provide ease of access to Hadoop clusters.
41
Case Study
 Biological scientists are still expected to choose
appropriate Hadoop parameters for running their
jobs.

42
Case Study
 The Challenge is to minimize the cloud computing
cost spent on bioinformatics data analysis by
optimizing the extracted significant Hadoop
parameters.
 When using MapReduce-based bioinformatics
tools in the cloud, the default settings often lead to
resource underutilization and wasteful expenses.
43
Case Study
 The available Hadoop tuning guidelines are either
obsolete or too general to capture the particular
characteristics of bioinformatics applications.

Lease Deed Reji C John
No ratings yet
Lease Deed Reji C John
222 pages
Tennant T7 Service Manual
100% (2)
Tennant T7 Service Manual
88 pages
En 10051 PDF
0% (1)
En 10051 PDF
2 pages
PaX-i3D Smart Troubleshooting Manual Ver. 3.0. (Pano Image Cutoff)
100% (1)
PaX-i3D Smart Troubleshooting Manual Ver. 3.0. (Pano Image Cutoff)
13 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
3a - MapReduce Data Flow Scheduling Combiner Partitioner PDF
No ratings yet
3a - MapReduce Data Flow Scheduling Combiner Partitioner PDF
22 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Lecture - 3
No ratings yet
Lecture - 3
25 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Lecture 2.1
No ratings yet
Lecture 2.1
13 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
CS 425 / ECE 428 Distributed Systems Fall 2016: Lecture 4: Mapreduce and Hadoop
No ratings yet
CS 425 / ECE 428 Distributed Systems Fall 2016: Lecture 4: Mapreduce and Hadoop
24 pages
Map Reduce
No ratings yet
Map Reduce
69 pages
Chapter 4
No ratings yet
Chapter 4
71 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
MapReduce Introduction
No ratings yet
MapReduce Introduction
34 pages
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
No ratings yet
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
36 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
BDA 2 (1)
No ratings yet
BDA 2 (1)
35 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
B. Hadoop Ecosystem_III (MapReduce)
No ratings yet
B. Hadoop Ecosystem_III (MapReduce)
55 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
Lecture 1 - Map Reduce
No ratings yet
Lecture 1 - Map Reduce
31 pages
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts (1)
No ratings yet
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts (1)
26 pages
Hadoop - Mapreduce (1)
No ratings yet
Hadoop - Mapreduce (1)
5 pages
Unit 3 - Big Data Technologies
No ratings yet
Unit 3 - Big Data Technologies
42 pages
Map Reduce: Simplified Processing On Large Clusters
No ratings yet
Map Reduce: Simplified Processing On Large Clusters
29 pages
BDP 2024 09
No ratings yet
BDP 2024 09
24 pages
M4_06_MapReduce
No ratings yet
M4_06_MapReduce
28 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
777 1651400043 BD Module 4
No ratings yet
777 1651400043 BD Module 4
21 pages
03-MapReduce
No ratings yet
03-MapReduce
184 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
55 pages
BDA Unit 3 1
No ratings yet
BDA Unit 3 1
37 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Chapter 4 - Understanding Map Reduce Fundamentals
No ratings yet
Chapter 4 - Understanding Map Reduce Fundamentals
45 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Bda Ia1 Scheme
No ratings yet
Bda Ia1 Scheme
7 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
Big Data
No ratings yet
Big Data
43 pages
BSC in Information Technology (Data Science) : Massive or Big Data Processing J.Alosius
No ratings yet
BSC in Information Technology (Data Science) : Massive or Big Data Processing J.Alosius
30 pages
MapReduce and The New Software Stack
No ratings yet
MapReduce and The New Software Stack
33 pages
Ecs765p W2
No ratings yet
Ecs765p W2
55 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
04_MapReduce
No ratings yet
04_MapReduce
45 pages
BIG DATA
No ratings yet
BIG DATA
120 pages
Bda Unit III r20csm
No ratings yet
Bda Unit III r20csm
54 pages
Bda CHP2
No ratings yet
Bda CHP2
105 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
Distributed Systems: 18. Mapreduce
No ratings yet
Distributed Systems: 18. Mapreduce
39 pages
Mastering Dynamic Programming in Java
From Everand
Mastering Dynamic Programming in Java
Ed A Norex
No ratings yet
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
From Everand
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
Joerg Christian Seubert
No ratings yet
Data Analytics Life Cycle
No ratings yet
Data Analytics Life Cycle
8 pages
R Programming Basics Slides
No ratings yet
R Programming Basics Slides
91 pages
Reading Data in R
No ratings yet
Reading Data in R
11 pages
Data Handling and Manipulation
No ratings yet
Data Handling and Manipulation
18 pages
BDA - Hadoop Ecosystem
No ratings yet
BDA - Hadoop Ecosystem
18 pages
CBSE Class 12 Mathematics Important Questions Trigonometric Functions
No ratings yet
CBSE Class 12 Mathematics Important Questions Trigonometric Functions
11 pages
Apache - SQOOP and Flume
No ratings yet
Apache - SQOOP and Flume
16 pages
Strong Acids and Bases
No ratings yet
Strong Acids and Bases
2 pages
Secondary School Examination (Class X) 2019
No ratings yet
Secondary School Examination (Class X) 2019
1 page
11 Maths Imp Ch3 Trigonometric Function Mix
No ratings yet
11 Maths Imp Ch3 Trigonometric Function Mix
3 pages
TS-250 (G) MSDS Neptune
No ratings yet
TS-250 (G) MSDS Neptune
6 pages
RMN 3
No ratings yet
RMN 3
6 pages
List of SAP Tables: Area Description
100% (1)
List of SAP Tables: Area Description
17 pages
Multilin 869 - Motor Protection System
100% (1)
Multilin 869 - Motor Protection System
918 pages
DPRNT
No ratings yet
DPRNT
4 pages
Angew Chem Int Ed - 2013 - Hawker - Editorial Effective Presentations A Must
No ratings yet
Angew Chem Int Ed - 2013 - Hawker - Editorial Effective Presentations A Must
2 pages
0029 2A Mernoki Optimalas en
No ratings yet
0029 2A Mernoki Optimalas en
225 pages
Week 2 - TVM, NPV IRR - S
No ratings yet
Week 2 - TVM, NPV IRR - S
63 pages
RMT Assignment 2
No ratings yet
RMT Assignment 2
4 pages
SAE J343-2001 Test and Test Procedures For SAE 100R Series Hydraulic Hose and Hose Assemblies
No ratings yet
SAE J343-2001 Test and Test Procedures For SAE 100R Series Hydraulic Hose and Hose Assemblies
9 pages
(23 Games) Mikhail Botvinnik Plays Against Flank Openings
No ratings yet
(23 Games) Mikhail Botvinnik Plays Against Flank Openings
10 pages
Sheet 5
No ratings yet
Sheet 5
7 pages
Satellite 300 Cds
No ratings yet
Satellite 300 Cds
4 pages
Unit 4
No ratings yet
Unit 4
34 pages
HAWKEYE 2000 - ACD System PDF
No ratings yet
HAWKEYE 2000 - ACD System PDF
61 pages
Installation Visum13 Eng
No ratings yet
Installation Visum13 Eng
14 pages
Batch Rectification Lab-Note
No ratings yet
Batch Rectification Lab-Note
9 pages
A Comparison of Pedestrian Crossing Behavior at A Signalized and Unsignalized Cross Walks in Addis Ababa, Tigist Legesse
100% (2)
A Comparison of Pedestrian Crossing Behavior at A Signalized and Unsignalized Cross Walks in Addis Ababa, Tigist Legesse
95 pages
Module 1 Stat
No ratings yet
Module 1 Stat
29 pages
CIS-145 Homework #3
No ratings yet
CIS-145 Homework #3
1 page
GATE 2016 2018 Mining Engineering Question Paper and Answer Key
100% (1)
GATE 2016 2018 Mining Engineering Question Paper and Answer Key
61 pages
Class 10 - Design of Footing 1
No ratings yet
Class 10 - Design of Footing 1
8 pages
WOI - Econometrics Ch11 Further Issues in Using OLS With Time Series Data
No ratings yet
WOI - Econometrics Ch11 Further Issues in Using OLS With Time Series Data
12 pages
Applied Statistics For Economic and Buisness
No ratings yet
Applied Statistics For Economic and Buisness
315 pages
2023 Grade 10 Test 1
No ratings yet
2023 Grade 10 Test 1
4 pages
Stycast 2850FT
No ratings yet
Stycast 2850FT
4 pages
UMRR Brochure
No ratings yet
UMRR Brochure
2 pages

Map Reduce

Uploaded by

Map Reduce

Uploaded by

MAP REDUCE FRAMWORK

 “With data collection, ‘the

 Marissa Ann Mayer - American

MASTER (RM – 100% JOB)

DN1/NM1 DN2/NM2 DN3/NM3

Intermediate Events in local file system of each DN

 Count the number of times each

 All phases are distributed with many tasks doing

 Much less data needs to be copied and shuffled! 30

 Very easy with MapReduce:

 Amazon’s “Elastic Compute Cloud” (EC2)

 Aster Data and Hadoop can both be run on EC2

You might also like