0% found this document useful (0 votes)

49 views29 pages

Map Reduce PDF

MapReduce is a programming model and software framework for processing large datasets in a distributed manner, where a dataset is broken into independent chunks that are processed in parallel by the distributed nodes, with the MapReduce framework handling fault tolerance, scheduling, and parallelizing the work. It involves two main phases - the map phase where the input data is processed key-value pair by key-value pair to generate intermediate key-value pairs, and the reduce phase where all intermediate values with the same key are grouped and passed to the reduce function to generate the final output. MapReduce is well-suited for problems that can be expressed as maps and reduces like counting word

Uploaded by

Mahalakshmi G

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views29 pages

Map Reduce PDF

Uploaded by

Mahalakshmi G

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

MapReduce

1
Distributed File System (DFS)
• For very large files: TBs, PBs
• Each file is partitioned into chunks, typically
64MB
• Each chunk is replicated several times (≥3),
on different racks, for fault tolerance
• Implementations:
– Google’s DFS: GFS, proprietary
– Hadoop’s DFS: HDFS, open source

2
MapReduce
• Google: paper published 2004
• Free variant: Hadoop

• MapReduce = high-level programming model

and implementation for large-scale parallel
data processing

3
Typical Problems Solved by MR
• Read a lot of data
• Map: extract something you care about from each
record
• Shuffle and Sort
• Reduce: aggregate, summarize, filter, transform
• Write the results Paradigm stays the same,
change map and reduce
functions for different problems

4
slide source: Jeff Dean
Data Model
Files!

A file = a bag of (key, value) pairs

A MapReduce program:
• Input: a bag of (inputkey, value) pairs
• Output: a bag of (outputkey, value) pairs

5
Step 1: the MAP Phase
User provides the MAP-function:
• Input: (input key, value)
• Ouput:
bag of (intermediate key, value)

System applies the map function in parallel to all

(input key, value) pairs in the input file

6
Step 2: the REDUCE Phase

User provides the REDUCE function:

• Input:
(intermediate key, bag of values)
• Output: bag of output (values)

System groups all pairs with the same intermediate

key, and passes the bag of values to the REDUCE
function
7
Example
• Counting the number of occurrences of each
word in a large collection of documents
• Each Document
– The key = document id (did)
– The value = set of words (word)
reduce(String key, Iterator values):
map(String key, String value):
// key: a word
// key: document name
// values: a list of counts
// value: document contents
int result = 0;
for each word w in value:
for each v in values:
EmitIntermediate(w, “1”);
result += ParseInt(v);
8
Emit(AsString(result));
MAP REDUCE

(w1,1)
Shuffle
(did1,v1) (w2,1)

(w3,1) (w1, (1,1,1,…,1)) (w1, 25)

… (w2, (1,1,…)) (w2, 77)

(did2,v2) (w1,1) (w3,(1…)) (w3, 12)

(w2,1) … …

… … …

(did3,v3) … …

… …

....

9
Jobs v.s. Tasks
• A MapReduce Job
– One single “query”, e.g. count the words in all docs
– More complex queries may consists of multiple jobs

• A Map Task, or a Reduce Task

– A group of instantiations of the map-, or reduce-
function, which are scheduled on a single worker

10
Workers
• A worker is a process that executes one task
at a time
• Typically there is one worker per processor,
hence 4 or 8 per node

11
Fault Tolerance
• If one server fails once every year…
... then a job with 10,000 servers will fail in
less than one hour

• MapReduce handles fault tolerance by writing

intermediate files to disk:
– Mappers write file to local disk
– Reducers read the files (=reshuffling); if the server
fails, the reduce task is restarted on another
server
12
MAP Tasks REDUCE Tasks

(w1,1) Shuffle
(did1,v1) (w2,1)

(w3,1) (w1, (1,1,1,…,1)) (w1, 25)

… (w2, (1,1,…)) (w2, 77)

(did2,v2) (w1,1) (w3,(1…)) (w3, 12)

(w2,1) … …

… … …

(did3,v3) … …

… …

....

13
MapReduce Execution Details
Output to disk,
replicated in cluster

Reduce Task
Intermediate data
goes to local disk:
(Shuffle) M × R files (why?)

Map Task
Data not
necessarily local

File system: GFS

or HDFS
14
Implementation
• There is one master node
• Master partitions input file into M splits, by key
• Master assigns workers (=servers) to the M map
tasks, keeps track of their progress
• Workers write their output to local disk, partition
into R regions
• Master assigns workers to the R reduce tasks
• Reduce workers read regions from the map
workers’ local disks
16
Interesting Implementation Details

Worker failure:

• Master pings workers periodically,

• If down then reassigns the task to another

worker

17
Interesting Implementation Details

Backup tasks:
• Straggler = a machine that takes unusually long
time to complete one of the last tasks. Eg:
– Bad disk forces frequent correctable errors (30MB/s 
1MB/s)
– The cluster scheduler has scheduled other tasks on
that machine
• Stragglers are a main reason for slowdown
• Solution: pre-emptive backup execution of the
last few remaining in-progress tasks

18
Parallel Data Processing @ 2010

19
Issues with MapReduce
• Difficult to write more complex queries

• Need multiple MapReduce jobs: dramatically

slows down because it writes all results to
disk

• Next lecture: Spark

20
Relational Operators in
MapReduce
Given relations R(A,B) and S(B, C) compute:

• Selection: σA=123(R)

• Group-by: γA,sum(B)(R)

• Join: R ⋈S

21
Selection σA=123(R)

map(String value):
if value.A = 123:
EmitIntermediate(value.key, value);

reduce(String k, Iterator values):

for each v in values:
Emit(v);

22
Selection σA=123(R)

map(String value):
if value.A = 123:
EmitIntermediate(value.key, value);

reduce(String k, Iterator values):

for each v in values:
Emit(v);
No need for reduce.
But need system hacking
to remove reduce from MapReduce 23
Group By γA,sum(B)(R)

map(String value):
EmitIntermediate(value.A, value.B);

reduce(String k, Iterator values):

s=0
for each v in values:
s=s+v
Emit(k, v);
24
Join
Two simple parallel join algorithms:

• Partitioned hash-join (we saw it, will recap)

• Broadcast join

25
R(A,B) ⋈B=C S(C,D)

Partitioned Hash-Join

Initially, both R and S are horizontally partitioned

R1, S1 R2, S2 . . . RP, SP

Reshuffle R on R.B
and S on S.B

R’1, S’1 R’2, S’2 . . . R’P, S’P

Each server computes
the join locally

26
R(A,B) ⋈B=C S(C,D)

Partitioned Hash-Join
map(String value):
case value.relationName of
‘R’: EmitIntermediate(value.B, (‘R’, value));
‘S’: EmitIntermediate(value.C, (‘S’, value));

reduce(String k, Iterator values):

R = empty; S = empty;
for each v in values:
case v.type of:
‘R’: R.insert(v)
‘S’: S.insert(v);
for v1 in R, for v2 in S
Emit(v1,v2);
27
R(A,B) ⋈B=C S(C,D)

Broadcast Join
Broadcast S

Reshuffle R on R.B

R1 R2 . . . RP S

R’1, S R’2, S . . . R’P, S

28
R(A,B) ⋈B=C S(C,D)

Broadcast Join
map should read
several records of R:
value = some group
of records
map(String value):
open(S); /* over the network */
hashTbl = new() Read entire table S,
build a Hash Table
for each w in S:
hashTbl.insert(w.B, w)
close(S);

for each v in value:

for each w in hashTbl.find(v.B)
Emit(v,w); reduce(…):
/* empty: map-side only */
29
Conclusions
• MapReduce offers a simple abstraction, and
handles distribution + fault tolerance
• Speedup/scaleup achieved by allocating
dynamically map tasks and reduce tasks to
available server. However, skew is possible
(e.g. one huge reduce task)
• Writing intermediate results to disk is
necessary for fault tolerance, but very slow.
Spark replaces this with “Resilient Distributed
Datasets” = main memory + lineage

Lesson4 Peripheral Devices
75% (4)
Lesson4 Peripheral Devices
4 pages
Parts List Lista de Peças Tdmg30: Toyama Part Number Name Portugues
No ratings yet
Parts List Lista de Peças Tdmg30: Toyama Part Number Name Portugues
18 pages
Year 6 Ict Week 1 & 2 Practical Lesson Plan Mr. Obayanju Ayodele First Term 2024 - 2025 Session
No ratings yet
Year 6 Ict Week 1 & 2 Practical Lesson Plan Mr. Obayanju Ayodele First Term 2024 - 2025 Session
15 pages
ALC269
No ratings yet
ALC269
79 pages
Wire Harness Repair-2200SRM1128 - (05-2005) - US-EN
No ratings yet
Wire Harness Repair-2200SRM1128 - (05-2005) - US-EN
76 pages
Bizhubc368 C308 C258InstallInstr
No ratings yet
Bizhubc368 C308 C258InstallInstr
13 pages
ZEMC-3008 Eng GC2010 OQR0725
100% (2)
ZEMC-3008 Eng GC2010 OQR0725
37 pages
Carenado C90 - GTX - King - Air Normal Procedures
No ratings yet
Carenado C90 - GTX - King - Air Normal Procedures
17 pages
MCQ
No ratings yet
MCQ
10 pages
CPP Black Book Final Pranita
No ratings yet
CPP Black Book Final Pranita
40 pages
Tk730-User Guide-Gotop-Gps Tracker
No ratings yet
Tk730-User Guide-Gotop-Gps Tracker
4 pages
Sony NW-A1000 Service Manual v1.0 2005
No ratings yet
Sony NW-A1000 Service Manual v1.0 2005
58 pages
Motivations of Fuzzy Logic
No ratings yet
Motivations of Fuzzy Logic
3 pages
MC 16 Short - Rev. 2016-07
No ratings yet
MC 16 Short - Rev. 2016-07
34 pages
K08 QMH and Other Line Units
No ratings yet
K08 QMH and Other Line Units
18 pages
GD300 01 Manual V2 - 0 EN
No ratings yet
GD300 01 Manual V2 - 0 EN
100 pages
Operating Signals, Check-List For Possible Faults and Troubleshooting (Ups Safepower Evo Ug..)
No ratings yet
Operating Signals, Check-List For Possible Faults and Troubleshooting (Ups Safepower Evo Ug..)
12 pages
Ex-Hacker: Unit 20 Interview
0% (1)
Ex-Hacker: Unit 20 Interview
8 pages
Hybrid Power Systems Thesis
100% (2)
Hybrid Power Systems Thesis
8 pages
Eog Controlled Wheelchair
No ratings yet
Eog Controlled Wheelchair
19 pages
Previous Year Coding Questions Solution (Free)
No ratings yet
Previous Year Coding Questions Solution (Free)
6 pages
Indicator 11.7.1 Training Module Public Space
No ratings yet
Indicator 11.7.1 Training Module Public Space
39 pages
VNPT eKYC - Sale Kit (EN)
No ratings yet
VNPT eKYC - Sale Kit (EN)
17 pages
Geographic Information Systems (GIS)
No ratings yet
Geographic Information Systems (GIS)
4 pages
Evolution of Online Shopping in India & Its Unparallel Growth
No ratings yet
Evolution of Online Shopping in India & Its Unparallel Growth
10 pages
Assignment 2016 17 Dca01 06 Practical
No ratings yet
Assignment 2016 17 Dca01 06 Practical
6 pages
Continental Tyres
No ratings yet
Continental Tyres
8 pages
Crop 4679 Stics of The Agricultural Environment Using Various Feature Selection Techniques and Classifiers
No ratings yet
Crop 4679 Stics of The Agricultural Environment Using Various Feature Selection Techniques and Classifiers
6 pages
MikroTik - DNS Servers Setup - ShellHacks
No ratings yet
MikroTik - DNS Servers Setup - ShellHacks
6 pages
bvp1 2 History of Televisionkey2 1
No ratings yet
bvp1 2 History of Televisionkey2 1
5 pages

Map Reduce PDF

Uploaded by

Map Reduce PDF

Uploaded by

MapReduce

• MapReduce = high-level programming model

A file = a bag of (key, value) pairs

System applies the map function in parallel to all

User provides the REDUCE function:

System groups all pairs with the same intermediate

(w3,1) (w1, (1,1,1,…,1)) (w1, 25)

… (w2, (1,1,…)) (w2, 77)

(did2,v2) (w1,1) (w3,(1…)) (w3, 12)

• A Map Task, or a Reduce Task

• MapReduce handles fault tolerance by writing

(w3,1) (w1, (1,1,1,…,1)) (w1, 25)

… (w2, (1,1,…)) (w2, 77)

(did2,v2) (w1,1) (w3,(1…)) (w3, 12)

File system: GFS

• Master pings workers periodically,

• If down then reassigns the task to another

• Need multiple MapReduce jobs: dramatically

• Next lecture: Spark

reduce(String k, Iterator values):

reduce(String k, Iterator values):

reduce(String k, Iterator values):

• Partitioned hash-join (we saw it, will recap)

Initially, both R and S are horizontally partitioned

R1, S1 R2, S2 . . . RP, SP

R’1, S’1 R’2, S’2 . . . R’P, S’P

reduce(String k, Iterator values):

R’1, S R’2, S . . . R’P, S

for each v in value:

You might also like