0% found this document useful (0 votes)
59 views30 pages

1.4 Map Reduce

MapReduce is a programming paradigm for processing large datasets in a distributed manner. It allows automatic parallelization, distribution, fault tolerance, and load balancing across large clusters of commodity servers. A typical MapReduce job involves mapping data to extract key-value pairs, shuffling and sorting the data, then reducing to aggregate or transform the values associated with each key.

Uploaded by

bhattsb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views30 pages

1.4 Map Reduce

MapReduce is a programming paradigm for processing large datasets in a distributed manner. It allows automatic parallelization, distribution, fault tolerance, and load balancing across large clusters of commodity servers. A typical MapReduce job involves mapping data to extract key-value pairs, shuffling and sorting the data, then reducing to aggregate or transform the values associated with each key.

Uploaded by

bhattsb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

MapReduce

http://www.google.org/flutrends/ca/

(2012)
Average Searches Per Day:
5,134,000,000

2
Motivation
• Process lots of data
• Google processed about 24 petabytes of data per day in
2009.

• A single machine cannot serve all the data


• You need a distributed system to store and process in
parallel
• Parallel programming?
• Threading is hard!
• How do you facilitate communication between nodes?
• How do you scale to more machines?
• How do you handle machine failures?
3
MapReduce
• MapReduce [OSDI’04] provides
– Automatic parallelization, distribution
– I/O scheduling
Apache Hadoop:
• Load balancing Open source
• Network and data transfer optimization implementation of
MapReduce
– Fault tolerance
• Handling of machine failures

• Need more power: Scale out, not up!


• Large number of commodity servers as opposed to some high
end specialized servers

4
Typical problem solved by MapReduce
• Read a lot of data

• Map: extract something you care about from each


record

• Shuffle and Sort

• Reduce: aggregate, summarize, filter, or transform

• Write the results

5
MapReduce workflow

Input Data Output Data


Wor Outpu
ker Wor write t
Split local ker File 0
0
Split read Wor write
1
Split ker Outpu
Wor t
2
Wor ker File 1
remote
ker
read,
sort
Map Reduce
extract something you aggregate,
care about from each summarize, filter,
record or transform
6
Mappers and Reducers
• Need to handle more data? Just add more
Mappers/Reducers!

• No need to handle multithreaded code ☺


– Mappers and Reducers are typically single threaded
and deterministic
• Determinism allows for restarting of failed jobs
– Mappers/Reducers run entirely independent of each other
• In Hadoop, they run in separate JVMs

7
Example: Word Count

http://kickstarthadoop.blogspot.ca/2011/04/word-count-hadoop-map-reduce-example.html

8
Mapper
• Reads in input pair <Key,Value>

• Outputs a pair <K’, V’>


– Let’s count number of each word in user queries (or Tweets/Blogs)
– The input to the mapper will be <queryID, QueryText>:
<Q1,“The teacher went to the store. The store was closed;
the store opens in the morning. The store opens at 9am.”
>
– The output would be:
<The, 1> <teacher, 1> <went, 1> <to, 1> <the, 1>
<store,1> <the, 1> <store, 1> <was, 1> <closed, 1> <the,
1> <store,1> <opens, 1> <in, 1> <the, 1> <morning, 1>
<the 1> <store, 1> <opens, 1> <at, 1> <9am, 1>

9
Reducer
• Accepts the Mapper output, and aggregates
values on the key
– For our example, the reducer input would be:
<The, 1> <teacher, 1> <went, 1> <to, 1> <the, 1> <store, 1>
<the, 1> <store, 1> <was, 1> <closed, 1> <the, 1> <store, 1>
<opens,1> <in, 1> <the, 1> <morning, 1> <the 1> <store, 1>
<opens, 1> <at, 1> <9am, 1>
– The output would be:
<The, 6> <teacher, 1> <went, 1> <to, 1> <store, 3> <was, 1>
<closed, 1> <opens, 1> <morning, 1> <at, 1> <9am, 1>

10
MapReduce
Hadoo
p
Progra
m
fork fork fork

Mas
assign ter assign
map reduce
Input Data Output Data
Wor Outpu
ker Wor write t
Split Transfer local ker File 0
0 read write
Split peta-sca Wor
1
Split ker Outpu
le data Wor t
2
through Wor ker File 1
remote
network ker
read,
sort
Map Reduce 11
Google File System (GFS)
Hadoop Distributed File System (HDFS)
• Split data and store 3 replica on commodity
servers

12
HDFS MapReduce
NameNod
e Where are the chunks
Location of the of input data?
chunks of input data
Mas
assign ter assign
map
Input Data reduce Output Data
Wor Outpu
Split
ker Wor write t
Split 0 local ker File 0
0
Split Wor write
1
Split ker
Split Outpu
1 Wor t
2
Wor ker File 1
remote
ker
Split
read,
2
Read from sort
local disk Map Reduce

13
Locality Optimization
• Master scheduling policy:
– Asks GFS for locations of replicas of input file blocks
– Map tasks scheduled so GFS input block replica are on
same machine or same rack

• Effect: Thousands of machines read input at local


disk speed
– Eliminate network bottleneck!

14
Failure in MapReduce
• Failures are norm in commodity hardware

• Worker failure
– Detect failure via periodic heartbeats
– Re-execute in-progress map/reduce tasks

• Master failure
– Single point of failure; Resume from Execution Log

• Robust
– Google’s experience: lost 1600 of 1800 machines once!, but finished fine

15
Fault tolerance:
Handled via re-execution
• On worker failure:
– Detect failure via periodic heartbeats
– Re-execute completed and in-progress map tasks
– Task completion committed through master

• Robust: [Google’s experience] lost 1600 of


1800 machines, but finished fine

16
Refinement:
Redundant Execution
• Slow workers significantly lengthen completion
time
– Other jobs consuming resources on machine
– Bad disks with soft errors transfer data very slowly
– Weird things: processor caches disabled (!!)

• Solution: spawn backup copies of tasks


– Whichever one finishes first "wins"

17
Refinement:
Skipping Bad Records
Map/Reduce functions sometimes fail for particular
inputs

• Best solution is to debug & fix, but not always


possible

• If master sees two failures for the same record:


– Next worker is told to skip the record

18
A MapReduce Job
Mapper

Reducer

Run this program as


a MapReduce job
19
Mapper

Reducer

Run this program as


a MapReduce job 20
Summary
• MapReduce
– Programming paradigm for data-intensive computing
– Distributed & parallel execution model
– Simple to program
• The framework automates many tedious tasks (machine
selection, failure handling, etc.)

21
Running Program in MR
• Apache sources under /opt, the examples will be in
the following directory:

/opt/hadoop-2.6.0/share/hadoop/mapreduce/

• The exact location of the example jar file can be found


using the find command:

$ find / -name "hadoop-mapreduce-examples*.jar"


-print

22
Running Program in MR
• A list of the available examples can be found
by running the following command.

$ yarn jar
$HADOOP_EXAMPLES/hadoop-mapreduce-exa
mples.jar

23
Running Program in MR
• The pi example calculates the digits of p using
a quasi-Monte Carlo method.

$ yarn jar
$HADOOP_EXAMPLES/hadoop-mapreduce-exa
mples.jar pi 16 1000000

24
25
26
Running the Terasort Test
• The terasort benchmark sorts a specified amount of randomly
generated data.

• This benchmark provides combined testing of the HDFS and


MapReduce layers of a

• Hadoop cluster. A full terasort benchmark run consists of the


following three steps:
– Generating the input data via teragen program.
– Running the actual terasort benchmark on the input data.
– Validating the sorted output data via the teravalidate program.

27
Running the Terasort Test
• Run teragen to generate rows of random data to sort.
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-examples.jar teragen 500000000
/user/hdfs/TeraGen-50GB
• Run terasort to sort the database.
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-examples.jar terasort
➥/user/hdfs/TeraGen-50GB /user/hdfs/TeraSort-50GB
• Run teravalidate to validate the sort.
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-examples.jar teravalidate
➥/user/hdfs/TeraSort-50GB /user/hdfs/TeraValid-50GB

28
Running the Terasort Test
• The following command will instruct terasort to
use four reducer tasks:

• $ yarn jar
$HADOOP_EXAMPLES/hadoop-mapreduce-examp
les.jar terasort

➥ -Dmapred.reduce.tasks=4
/user/hdfs/TeraGen-50GB
/user/hdfs/TeraSort-50GB 29
Additional information and background on
each of the examples and benchmarks
• Pi Benchmark
• https://hadoop.apache.org/docs/current/api/org/apache/h
adoop/examples/pi/package-summary.html
• Terasort Benchmark
• https://hadoop.apache.org/docs/current/api/org/apache/h
adoop/examples/terasort/package-summary.html
• Benchmarking and Stress Testing an Hadoop Cluster
• http://www.michael-noll.com/blog/2011/04/09/benchmark
ing-and-stresstesting-an-hadoop-cluster-with-terasort-testd
fsio-nnbench-mrbench (uses Hadoop V1, will work with V2)

30

You might also like