0% found this document useful (0 votes)
4 views16 pages

2 1-MapReduce

The document outlines the major components and basic concepts of the MapReduce framework, including user components like Mapper, Reducer, and optional elements such as Combiner and Partitioner. It explains the data flow from input splitting to output committing, detailing how data is processed in key-value pairs through mappers and reducers. Additionally, it describes the roles of system components like the Master and Output Committer in managing and executing MapReduce jobs effectively.

Uploaded by

fun422830
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views16 pages

2 1-MapReduce

The document outlines the major components and basic concepts of the MapReduce framework, including user components like Mapper, Reducer, and optional elements such as Combiner and Partitioner. It explains the data flow from input splitting to output committing, detailing how data is processed in key-value pairs through mappers and reducers. Additionally, it describes the roles of system components like the Master and Output Committer in managing and executing MapReduce jobs effectively.

Uploaded by

fun422830
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

MapReduce

Major Components
• User Components:
– Mapper
– Reducer
– Combiner (Optional)
– Partitioner (Optional)
(Shuffle)
– Writable(s) (Optional)

• System Components:
– Master
– Input Splitter*
– Output Committer*
* You can use your own if you really want!

Image source: http://www.ibm.com/developerworks/java/library/l-hadoop-3/index.html


Basic Concepts
• All data is represented in key value pairs of an arbitrary type
• Data is read in from a file or list of files, from HDFS
• Data is chunked based on an input split: A typical chunk is 64MB
(more or less can be configured depending on your use case)
• Mappers read in a chunk of data
• Mappers emit (write out) a set of data, typically derived from its
input
• Intermediate data (the output of the mappers) is split to a number
of reducers
• Reducers receive each key of data, along with ALL of the values
associated with it (this means each key must always be sent to the
same reducer)
– Essentially, <key, set<value>>
• Reducers emit a set of data, typically reduced from its input which is
written to disk
Data Flow

Spli Mapper
t0 0
Reduce Out
r0 0
Input

Spli Mapper
t1 1
Reduce Out
r1 1
Spli Mapper
t2 2
Input Splitter
• Is responsible for splitting your input into multiple
chunks
• These chunks are then used as input for your mappers
• Splits on logical boundaries. The default is 64MB per
chunk
– Depending on what you’re doing, 64MB might be a LOT of
data! You can change it
• Typically, you can just use one of the built in splitters,
unless you are reading in a specially formatted file
• Each map task corresponds to a single input split.
Record Reader
• RecordReader generates record (key-value
pair) and passes to the map function.
• The RecordReader is invoked repeatedly on
the input until the entire split is consumed.
Mapper
• Reads in input pair <K,V> (a section as split by the input
splitter)
• Outputs a pair <K’, V’>

• Ex. For our Word Count example, with the following input:
“The teacher went to the store. The store was closed; the
store opens in the morning. The store opens at 9am.”

• The output would be:


– <The, 1> <teacher, 1> <went, 1> <to, 1> <the, 1> <store, 1>
<the, 1> <store, 1> <was, 1> <closed, 1> <the, 1> <store, 1>
<opens, 1> <in, 1> <the, 1> <morning, 1> <the 1> <store, 1>
<opens, 1> <at, 1> <9am, 1>
Reducer
• Accepts the Mapper output, and collects values on the
key
– All inputs with the same key must go to the same reducer!
• Input is typically sorted, output is output exactly as is
• For our example, the reducer input would be:
– <The, 1> <teacher, 1> <went, 1> <to, 1> <the, 1> <store, 1>
<the, 1> <store, 1> <was, 1> <closed, 1> <the, 1> <store,
1> <opens, 1> <in, 1> <the, 1> <morning, 1> <the 1>
<store, 1> <opens, 1> <at, 1> <9am, 1>
• The output would be:
– <The, 6> <teacher, 1> <went, 1> <to, 1> <store, 3> <was,
1> <closed, 1> <opens, 1> <morning, 1> <at, 1> <9am, 1>
Output Committer
• Is responsible for taking the reduce output,
and committing it to a file
• Typically, this committer needs a
corresponding input splitter (so that another
job can read the input)
• Again, usually built in splitters are good
enough, unless you need to output a special
kind of file
Combiner
• Essentially an intermediate reducer
• Is optional
• used to reduce the amount of data received by
Reduce class by combining the data output from
map
• Reduces output from each mapper, reducing
bandwidth and sorting
• Cannot change the type of its input
– Input types must be the same as output types
Partitioner (Shuffler)
• Decides which pairs are sent to which reducer
• Default is simply:
– Key.hashCode() % numOfReducers

• User can override to:


– Provide (more) uniform distribution of load between
reducers
– Some values might need to be sent to the same
reducer
• Ex. To compute the relative frequency of a pair of words
<W1, W2> you would need to make sure all of words W1
and W2 are sent to the same reducer
Master
• Responsible for scheduling & managing jobs

• Scheduled computation should be close to the data if possible


– Bandwidth is expensive! (and slow)
– This relies on a Distributed File System (GFS / HDFS)!

• If a task fails to report progress (such as reading input, writing


output, etc), crashes, the machine goes down, etc, it is assumed to
be stuck, and is killed, and the step is re-launched (with the same
input)

• The Master is handled by the framework, no user code is necessary


Master Cont.
• HDFS can replicate data to be local if necessary
for scheduling
• Because our nodes are (or at least should be)
deterministic
– The Master can restart failed nodes
• Nodes should have no side effects!
– If a node is the last step, and is completing slowly, the
master can launch a second copy of that node
• This can be due to hardware isuses, network issues, etc.
• First one to complete wins, then any other runs are killed
Output Committer
• The Map-Reduce framework relies on
the OutputCommitter of the job to:
– Setup the job during initialization. For example, create the
temporary output directory for the job during the
initialization of the job.
– Cleanup the job after the job completion. For example,
remove the temporary output directory after the job
completion.
– Setup the task temporary output.
– Check whether a task needs a commit. This is to avoid the
commit procedure if a task does not need commit.
– Commit of the task output.
– Discard the task commit.
Writables
• Are types that can be serialized / deserialized to a stream
• Are required to be input/output classes, as the framework
will serialize your data before writing it to disk
• User can implement this interface, and use their own types
for their input/output/intermediate values
• There are default for basic values, like Strings, Integers,
Longs, etc.
• Can also handle store, such as arrays, maps, etc.
• Your application needs at least six writables
– 2 for your input
– 2 for your intermediate values (Map <-> Reduce)
– 2 for your output
Execution Flow

You might also like