Introduction To MapReduce
Introduction To MapReduce
Introducing MapReduce
• Now that we have described how Hadoop stores data, lets turn
our attention to how it processes data
• We typically process data in Hadoop using MapReduce
• MapReduce is not a language, it’s a programming model
• MapReduce is a method for distributing a task across multiple
nodes. Each node processes data stored on that node.
• MapReduce consists of two functions:
map (K1, V1) -> (K2, V2)
reduce (K2, list(V2)) -> list(K3, V3)
Automatic parallelization and distribution (The biggest advantage).
Fault-tolerance (individual tasks can be retried)
Hadoop comes with standard status and monitoring tools.
A clean abstraction for developers.
MapReduce programs are usually written in Java (possibly in other
languages using streaming)
Understanding Map and Reduce
The map function always runs first
• Typically used to “break down”
• Filter, transform, or parse data, e.g. Parse the stock symbol, price and time
from a data feed
• The output from the map function (eventually) becomes the input to the
reduce function
The reduce function
• Typically used to aggregate data from the map function
• e.g. Compute the average hourly price of the stock
• Not always needed and therefore optional
• You can run something called a “map-only” job
Understanding Map and Reduce
Between these two tasks there is typically a hidden phase known as the
“Shuffle and Sort”
• Which organizes map output for delivery to the reducer
Each individual piece is simple, but collectively are quite powerful
• Analogous to a pipe / filter in Unix
Typical Large Data Problem
JobTracker
◦ Determines the execution plan for the job
◦ Assigns individual tasks
TaskTracker
◦ Keeps track of the performance of an individual mapper or reducer
MapReduce:
The Big
Picture
Map Process