0% found this document useful (0 votes)
31 views

Introduction To MapReduce

The document introduces MapReduce, describing it as a programming model for processing large datasets in a distributed manner. MapReduce consists of map and reduce functions, with the map function processing input data in parallel across nodes and the reduce function combining results from the map stage. Fault tolerance is built into the MapReduce framework.

Uploaded by

fab vif
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Introduction To MapReduce

The document introduces MapReduce, describing it as a programming model for processing large datasets in a distributed manner. MapReduce consists of map and reduce functions, with the map function processing input data in parallel across nodes and the reduce function combining results from the map stage. Fault tolerance is built into the MapReduce framework.

Uploaded by

fab vif
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Introducing MapReduce

Introducing MapReduce
• Now that we have described how Hadoop stores data, lets turn
our attention to how it processes data
• We typically process data in Hadoop using MapReduce
• MapReduce is not a language, it’s a programming model
• MapReduce is a method for distributing a task across multiple
nodes. Each node processes data stored on that node.
• MapReduce consists of two functions:
 map (K1, V1) -> (K2, V2)
 reduce (K2, list(V2)) -> list(K3, V3)
 Automatic parallelization and distribution (The biggest advantage).
 Fault-tolerance (individual tasks can be retried)
 Hadoop comes with standard status and monitoring tools.
 A clean abstraction for developers.
 MapReduce programs are usually written in Java (possibly in other
languages using streaming)
Understanding Map and Reduce
The map function always runs first
• Typically used to “break down”
• Filter, transform, or parse data, e.g. Parse the stock symbol, price and time
from a data feed
• The output from the map function (eventually) becomes the input to the
reduce function
The reduce function
• Typically used to aggregate data from the map function
• e.g. Compute the average hourly price of the stock
• Not always needed and therefore optional
• You can run something called a “map-only” job
Understanding Map and Reduce
Between these two tasks there is typically a hidden phase known as the
“Shuffle and Sort”
• Which organizes map output for delivery to the reducer
Each individual piece is simple, but collectively are quite powerful
• Analogous to a pipe / filter in Unix
Typical Large Data Problem
 JobTracker
◦ Determines the execution plan for the job
◦ Assigns individual tasks

 TaskTracker
◦ Keeps track of the performance of an individual mapper or reducer
MapReduce:
The Big
Picture
Map Process

• map (in_key, in_value) (out_key, out_value)


Reduce Process
• reduce (out_key, out_value list) (final_key, final_value list)
Terminology
• The client program submits a job to Hadoop.
• The job consists of a mapper, a reducer, and a list of inputs.
• The job is sent to the JobTracker process on the Master Node.
• Each Slave Node runs a process called the TaskTracker.
• The JobTracker instructs TaskTrackers to run and monitor tasks.
• A Map or Reduce over a piece of data is a single task.
• A task attempts is an instance of a task running on a slave node.
MapReduce : High Level
MapReduce Failure Recovery
 Task processes send heartbeats to the TaskTracker.
 TaskTrackers send heartbeats to the JobTracker.
 Any task that fails to report in 10 minutes is assumed to have
failed- its JVM is killed by the TaskTracker.
 Any task that throws an exception is said to have failed.
 Failed tasks are reported to the JobTracker by the TaskTracker.
 The JobTracker reschedules any failed tasks - it tries to avoid
rescheduling the task on the same TaskTracker where it
previously failed.
 If a task fails more than 4 times, the whole job fails.
TaskTracker Recovery
 Any TaskTracker that fails to report in 10 minutes is assumed to
have crashed.
 All tasks on the node are restarted elsewhere
 Any TaskTracker reporting a high number of failed tasks is
blacklisted, to prevent the node from blocking the entire job.
 There is also a “global blacklist”, for TaskTrackers which fail on
multiple jobs.
 The JobTracker manages the state of each job and partial results
of failed tasks are ignored.
Example: Word Count

• We have a large file of words, one word to a line


• Count the number of times each distinct word appears in the
file
• Sample application: analyze web server logs to find popular
URLs
MapReduce
• Input: a set of key/value pairs
• User supplies two functions:
• map(k,v)  list(k1,v1)
• reduce(k1, list(v1))  v2
• (k1,v1) is an intermediate key/value pair
• Output is the set of (k1,v2) pairs
MapReduce: Word Count
MapReduce Example
Explanation Of The Map Function
Shuffle and Sort
Explanation of Reduce Function
Putting It All Together
Benefits of MapReduce
• Simplicity (via fault tolerance)
• Particularly when compared with other distributed programming
models
• Flexibility
• Offers more analytic capabilities and works with more data types than
platforms like SQL
• Scalability
• Because it works with
• Small quantities of data at a time
• Running in parallel across a cluster
• Sharing nothing among the participating nodes

You might also like