2 Mapreduce Model Principles

The document discusses MapReduce, a programming model for processing large datasets in a distributed environment. It describes how MapReduce allows developers to focus on defining computations rather than how they are executed. It also explains the two key stages of MapReduce: the map stage that processes input records in parallel, and the reduce stage that aggregates the outputs from the map stage.

Uploaded by

jefferyleclerc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views7 pages

2 Mapreduce Model Principles

Uploaded by

jefferyleclerc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

MAP REDUCE SOME PRINCIPLES

AND PATTERNS
GENOVEVA VARGAS SOLAR
FRENCH COUNCIL OF SCIENTIFIC RESEARCH, LIG-LAFMIA, FRANCE
[email protected]
http://mapreducefest.wordpress.com/

http://vargas-solar.imag.fr
MAP-REDUCE

¡ Programming model for expressing distributed computations on massive amounts of data

¡ Execution framework for large-scale data processing on clusters of commodity servers
¡ Market: any organization built around gathering, analyzing, monitoring, filtering, searching, or organizing
content must tackle large-data problems
¡ data- intensive processing is beyond the capability of any individual machine and requires clusters
¡ large-data problems are fundamentally about organizing computations on dozens, hundreds, or even thousands of
machines

« Data represent the rising tide that lifts all boats—more data lead to better
algorithms and systems for solving real-world problems » 2
DATA PROCESSING

¡ Process the data to produce other data: analysis tool, business intelligence tool, ...
¡ This means
¡ • Handle large volumes of data
¡ • Manage thousands of processors
¡ • Parallelize and distribute treatments
¡ SchedulingI/O
¡ ManagingFaultTolerance
¡ Monitor/Controlprocesses

MapReduce provides all this easy!

3
MOTIVATION

¡ The only feasible approach to tackling large-data problems is to divide and conquer
¡ To the extent that the sub-problems are independent, they can be tackled in parallel by different worker (threads in a processor core,
cores in a multi-core processor, multiple processors in a machine, or many machines in a cluster)
¡ Intermediate results from each individual worker are then combined to yield the final output
¡ Aspects to consider
¡ How do we decompose the problem so that the smaller tasks can be executed in parallel?
¡ How do we assign tasks to workers distributed across a potentially large number of machines? (some workers are better suited to
running some tasks than others, e.g., due to available resources, locality constraints, etc.)
¡ How do we ensure that the workers get the data they need?
¡ How do we coordinate synchronization among the different workers?
¡ How do we share partial results from one worker that is needed by another?
¡ How do we accomplish all of the above in the face of software errors and hardware faults?

4
MOTIVATION

¡ OpenMP for shared memory parallelism or libraries implementing the Message Passing Interface (MPI) for
cluster-level parallelism provide logical abstractions that hide details of operating system synchronization and
communications primitives
à developers keep track of how resources are made available to workers

¡ Map-Reduce provides an abstraction hiding many system-level details from the programmer
à developers focus on what computations need to be performed, as opposed to how those computations are actually carried
out or how to get the data to the processes

§ Yet, organizing and coordinating large amounts of computation is only part of the challenge
§ Large-data processing requires bringing data and code together for computation to occur —no small feat for datasets that
are terabytes and perhaps petabytes in size!

5
APPROACH
Centralized computing with
distributed data storage

Run the program at the Client, get data from the distributed system
Downsides: important data flows, no use of the cluster computing “push the program near the data”
resources

¡ Instead of moving large amounts of data around, it is far more efficient, if possible, to move the code to the
data
¡ The complex task of managing storage in such a processing environment is typically handled by a distributed
6
file system that sits underneath MapReduce
MAP-REDUCE PRINCIPLE

¡ Stage 1: Apply a user-specified computation over all input records in a dataset.

¡ These operations occur in parallel and yield intermediate output (key-value pairs)

¡ Stage 2: Aggregate intermediate output by another user-specified computation

¡ Recursively applies a function on every pair of the list

Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
192 pages
Map Reduce Workflow Colloquim
No ratings yet
Map Reduce Workflow Colloquim
30 pages
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-C
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-C
10 pages
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-A
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-A
7 pages
AAAI2011 Tutorial Slides
No ratings yet
AAAI2011 Tutorial Slides
213 pages
Yum Yum D Giga
No ratings yet
Yum Yum D Giga
368 pages
Map Reduce
No ratings yet
Map Reduce
69 pages
Week 02
No ratings yet
Week 02
115 pages
Hadoop Mapreduce
No ratings yet
Hadoop Mapreduce
131 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
The Map Reduce Programming
No ratings yet
The Map Reduce Programming
15 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
Lecture4 IntroMapReduce PDF
No ratings yet
Lecture4 IntroMapReduce PDF
75 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
Lecture 3 MR Model and Systems
No ratings yet
Lecture 3 MR Model and Systems
67 pages
BDS Session 6
No ratings yet
BDS Session 6
53 pages
Big Data
No ratings yet
Big Data
120 pages
Take A Close Look At: Ma Ed
No ratings yet
Take A Close Look At: Ma Ed
42 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
Mapreduce and Hadoop Distributed File System
No ratings yet
Mapreduce and Hadoop Distributed File System
45 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Spark Introduction
No ratings yet
Spark Introduction
90 pages
Map Reduce: Simplified Processing On Large Clusters
No ratings yet
Map Reduce: Simplified Processing On Large Clusters
29 pages
10-Big Data Nhom7
No ratings yet
10-Big Data Nhom7
81 pages
He-Phan-Bo - Thoai-Nam - Distributedsystem - 18 - Mapreduce - (Cuuduongthancong - Com)
No ratings yet
He-Phan-Bo - Thoai-Nam - Distributedsystem - 18 - Mapreduce - (Cuuduongthancong - Com)
31 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Bda 2
No ratings yet
Bda 2
35 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
Chapter 4
No ratings yet
Chapter 4
71 pages
Lecture 3 - Introduction To Apache Spark - 1691899519972
No ratings yet
Lecture 3 - Introduction To Apache Spark - 1691899519972
67 pages
BDS Session 5
No ratings yet
BDS Session 5
48 pages
B. Hadoop Ecosystem - III (MapReduce)
No ratings yet
B. Hadoop Ecosystem - III (MapReduce)
55 pages
03 MapReduce
No ratings yet
03 MapReduce
184 pages
777 1651400043 BD Module 4
No ratings yet
777 1651400043 BD Module 4
21 pages
Lecture 10 MapReduce Hadoop
No ratings yet
Lecture 10 MapReduce Hadoop
37 pages
The Incremental Online K Means Clustering Algorithm and Its Application To Color Quantization
No ratings yet
The Incremental Online K Means Clustering Algorithm and Its Application To Color Quantization
42 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Problem-Solving Using Mapreduce/Hadoop
No ratings yet
Problem-Solving Using Mapreduce/Hadoop
22 pages
Lecture 10 Map Reduce
No ratings yet
Lecture 10 Map Reduce
42 pages
ECS765P - W4 - Introduction To Spark
No ratings yet
ECS765P - W4 - Introduction To Spark
39 pages
7-Brief About Big Data, Hadoop Map Reduce-31-07-2023
No ratings yet
7-Brief About Big Data, Hadoop Map Reduce-31-07-2023
35 pages
Lec 6
No ratings yet
Lec 6
16 pages
BDP 2024 09
No ratings yet
BDP 2024 09
24 pages
CC Unit4
No ratings yet
CC Unit4
14 pages
MapReduce Introduction
No ratings yet
MapReduce Introduction
34 pages
Big Data Management Continued
No ratings yet
Big Data Management Continued
48 pages
Lecture 2.1
No ratings yet
Lecture 2.1
13 pages
Nosql Mod3
No ratings yet
Nosql Mod3
18 pages
Parallel Programming, Mapreduce Model: Unit Ii
No ratings yet
Parallel Programming, Mapreduce Model: Unit Ii
47 pages
Unit 4 Map Reduce
No ratings yet
Unit 4 Map Reduce
10 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
Agenda: Big Data Systems
No ratings yet
Agenda: Big Data Systems
25 pages
Lecture 3 - MapReduce
No ratings yet
Lecture 3 - MapReduce
9 pages
Analysis of Mapreduce Algorithms: Harini Padmanaban
No ratings yet
Analysis of Mapreduce Algorithms: Harini Padmanaban
6 pages
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-H
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-H
4 pages
06 Application Architecture
No ratings yet
06 Application Architecture
22 pages
Hadoop
No ratings yet
Hadoop
7 pages
MapReduce - What It Is, and Why It Is So Popular
No ratings yet
MapReduce - What It Is, and Why It Is So Popular
7 pages
Improved K-Means Map Reduce Algorithm For Big Data Cluster Analysis
No ratings yet
Improved K-Means Map Reduce Algorithm For Big Data Cluster Analysis
7 pages
Embed and Conquer: Scalable Embeddings For Kernel K-Means On Mapreduce
No ratings yet
Embed and Conquer: Scalable Embeddings For Kernel K-Means On Mapreduce
9 pages
K-Means Clustering Optimization Algorithm Based On Mapreduce
No ratings yet
K-Means Clustering Optimization Algorithm Based On Mapreduce
6 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-A
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-A
6 pages
Paper Dvi
No ratings yet
Paper Dvi
7 pages
Fuzzy K-Mean Clustering in Mapreduce On Cloud Based Hadoop: Dweepna Garg
No ratings yet
Fuzzy K-Mean Clustering in Mapreduce On Cloud Based Hadoop: Dweepna Garg
4 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-5
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-5
4 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-9
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-9
4 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-O
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-O
3 pages
Data Visualization Cheat Sheet For Basic Machine Learning Algorithms - by Boriharn K - Mar, 2024 - Towards Data Science
No ratings yet
Data Visualization Cheat Sheet For Basic Machine Learning Algorithms - by Boriharn K - Mar, 2024 - Towards Data Science
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-4
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-4
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-17
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-17
3 pages
Balanced K-Means Revisited-1
No ratings yet
Balanced K-Means Revisited-1
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-P
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-P
3 pages
Unit 4
No ratings yet
Unit 4
10 pages
Fast Scalable K-Means++ Algorithm With Mapreduce
No ratings yet
Fast Scalable K-Means++ Algorithm With Mapreduce
2 pages
Ditp ch2
No ratings yet
Ditp ch2
2 pages
Balanced K-Means Revisited-5
No ratings yet
Balanced K-Means Revisited-5
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-14
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-14
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-16
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-16
3 pages
Tutorial For K Means Clustering in Python Sklearn - MLK - Machine Learning Knowledge-5
No ratings yet
Tutorial For K Means Clustering in Python Sklearn - MLK - Machine Learning Knowledge-5
3 pages
Ditp - ch2 4
No ratings yet
Ditp - ch2 4
2 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1E
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1E
2 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1Q
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1Q
2 pages
A Distance-Based Kernel For Classification Via Support Vector Machines - PMC-17
No ratings yet
A Distance-Based Kernel For Classification Via Support Vector Machines - PMC-17
1 page

2 Mapreduce Model Principles

Uploaded by

2 Mapreduce Model Principles

Uploaded by

MAP REDUCE SOME PRINCIPLES

¡ Programming model for expressing distributed computations on massive amounts of data

MapReduce provides all this easy!

¡ Stage 1: Apply a user-specified computation over all input records in a dataset.

¡ Stage 2: Aggregate intermediate output by another user-specified computation

You might also like