0% found this document useful (0 votes)
32 views7 pages

2 Mapreduce Model Principles

The document discusses MapReduce, a programming model for processing large datasets in a distributed environment. It describes how MapReduce allows developers to focus on defining computations rather than how they are executed. It also explains the two key stages of MapReduce: the map stage that processes input records in parallel, and the reduce stage that aggregates the outputs from the map stage.

Uploaded by

jefferyleclerc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views7 pages

2 Mapreduce Model Principles

The document discusses MapReduce, a programming model for processing large datasets in a distributed environment. It describes how MapReduce allows developers to focus on defining computations rather than how they are executed. It also explains the two key stages of MapReduce: the map stage that processes input records in parallel, and the reduce stage that aggregates the outputs from the map stage.

Uploaded by

jefferyleclerc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

MAP REDUCE SOME PRINCIPLES

AND PATTERNS
GENOVEVA VARGAS SOLAR
FRENCH COUNCIL OF SCIENTIFIC RESEARCH, LIG-LAFMIA, FRANCE
[email protected]
http://mapreducefest.wordpress.com/

http://vargas-solar.imag.fr
MAP-REDUCE

¡ Programming model for expressing distributed computations on massive amounts of data


¡ Execution framework for large-scale data processing on clusters of commodity servers
¡ Market: any organization built around gathering, analyzing, monitoring, filtering, searching, or organizing
content must tackle large-data problems
¡ data- intensive processing is beyond the capability of any individual machine and requires clusters
¡ large-data problems are fundamentally about organizing computations on dozens, hundreds, or even thousands of
machines

« Data represent the rising tide that lifts all boats—more data lead to better
algorithms and systems for solving real-world problems » 2
DATA PROCESSING

¡ Process the data to produce other data: analysis tool, business intelligence tool, ...
¡ This means
¡ • Handle large volumes of data
¡ • Manage thousands of processors
¡ • Parallelize and distribute treatments
¡ SchedulingI/O
¡ ManagingFaultTolerance
¡ Monitor/Controlprocesses

MapReduce provides all this easy!

3
MOTIVATION

¡ The only feasible approach to tackling large-data problems is to divide and conquer
¡ To the extent that the sub-problems are independent, they can be tackled in parallel by different worker (threads in a processor core,
cores in a multi-core processor, multiple processors in a machine, or many machines in a cluster)
¡ Intermediate results from each individual worker are then combined to yield the final output
¡ Aspects to consider
¡ How do we decompose the problem so that the smaller tasks can be executed in parallel?
¡ How do we assign tasks to workers distributed across a potentially large number of machines? (some workers are better suited to
running some tasks than others, e.g., due to available resources, locality constraints, etc.)
¡ How do we ensure that the workers get the data they need?
¡ How do we coordinate synchronization among the different workers?
¡ How do we share partial results from one worker that is needed by another?
¡ How do we accomplish all of the above in the face of software errors and hardware faults?

4
MOTIVATION

¡ OpenMP for shared memory parallelism or libraries implementing the Message Passing Interface (MPI) for
cluster-level parallelism provide logical abstractions that hide details of operating system synchronization and
communications primitives
à developers keep track of how resources are made available to workers

¡ Map-Reduce provides an abstraction hiding many system-level details from the programmer
à developers focus on what computations need to be performed, as opposed to how those computations are actually carried
out or how to get the data to the processes

§ Yet, organizing and coordinating large amounts of computation is only part of the challenge
§ Large-data processing requires bringing data and code together for computation to occur —no small feat for datasets that
are terabytes and perhaps petabytes in size!

5
APPROACH
Centralized computing with
distributed data storage

Run the program at the Client, get data from the distributed system
Downsides: important data flows, no use of the cluster computing “push the program near the data”
resources

¡ Instead of moving large amounts of data around, it is far more efficient, if possible, to move the code to the
data
¡ The complex task of managing storage in such a processing environment is typically handled by a distributed
6
file system that sits underneath MapReduce
MAP-REDUCE PRINCIPLE

¡ Stage 1: Apply a user-specified computation over all input records in a dataset.


¡ These operations occur in parallel and yield intermediate output (key-value pairs)

¡ Stage 2: Aggregate intermediate output by another user-specified computation


¡ Recursively applies a function on every pair of the list

You might also like