0% found this document useful (0 votes)

58 views

Parallel Programming, Mapreduce Model: Unit Ii

Parallel programming can break tasks into parts that can run concurrently. MapReduce is a programming model used for large scale data processing across clusters of computers. It works by having a map stage that processes key-value pairs into intermediate key-value pairs, and a reduce stage that merges all intermediate values associated with the same key. The MapReduce runtime system automatically parallelizes tasks, handles failures and load balancing, making it easy for programmers to write parallel programs for large datasets without dealing with complex parallelization details.

Uploaded by

Darpan Paloda

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views

Parallel Programming, Mapreduce Model: Unit Ii

Uploaded by

Darpan Paloda

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 47

Parallel programming, Mapreduce model

UNIT II

Serial vs. Parallel Programming

serial program consist of a sequence of instructions, where each instruction executed one after the other
In

a parallel program, the processing is broken up into parts, each of which can be executed concurrently.

The Basics Parallel Programming

Identifying

sets of tasks that can run concurrently and/or paritions of data that can be processed concurrently
Sometimes

it's just not possible: Fibonacci function

common situation is having a large amount of consistent data which must be processed.

huge array which can be broken up into sub-arrays

implementation technique: master/worker

The MASTER:
initializes

the array and splits it up according to the number of available WORKERS sends each WORKER its subarray receives the results from each WORKER

The WORKER:
receives

the subarray from the MASTER performs processing on the subarray returns results to MASTER

An example of the MASTER/WORKER technique

Approximating pi

Approximating pi..
The area of the square, denoted As = (2r)2 or 4r2. The area of the circle, denoted Ac, is pi * r2. So: pi = Ac / r2 As = 4r2 r2 = As / 4 pi = 4 * Ac / As

Parallelize this method

Randomly
Count

generate points in the square

the number of generated points that are both in the circle and in the square
r

= the number of points in the circle divided by the number of points in the square
PI

=4*r

NUMPOINTS = 100000; // some large number - the bigger, the closer the approximation

p = number of WORKERS; numPerWorker = NUMPOINTS / p; countCircle = 0; // one of these for each WORKER
// each WORKER does the following: for (i = 0; i < numPerWorker; i++) { generate 2 random numbers that lie inside the square; xcoord = first random number; ycoord = second random number; if (xcoord, ycoord) lies inside the circle countCircle++; }

MASTER: receives from WORKERS their countCircle values computes PI from these values: PI = 4.0 * countCircle / NUMPOINTS;

MapReduce

How to painlessly process terabytes of data ?

A Brief History

Functional programming (e.g., Lisp)

map() function
Applies a function to each value of a sequence

reduce() function
Combines all elements of a sequence using a binary operator

What is MapReduce?
This model derives from the map and reduce combinators from a functional language like Lisp. Restricted parallel programming model meant for large clusters

User implements Map() and Reduce()

Parallel computing framework

Libraries take care of EVERYTHING else
Parallelization Fault Tolerance Data Distribution Load Balancing

Useful model for many practical tasks

Map and Reduce

Map()
Process a key/value pair to generate intermediate key/value pairs

Reduce()
Merge all intermediate values associated with the same key

Example: Counting Words

Map()
Input <filename, file text> Parses file and emits <word, count> pairs
eg. <hello, 1>

Reduce()
Sums all values for the same key and emits <word, TotalCount>
eg. <hello, (3 5 2 7)> => <hello, 17>

MapReduce: Programming Model

M
How now Brown cow

M M M Map

How does It work now

<How,1> <now,1> <brown,1> <cow,1> <How,1> <does,1> <it,1> <work,1> <now,1>

<How,1 1> <now,1 1> <brown,1> <cow,1> <does,1> <it,1> <work,1>

R R

MapReduce Framework

Reduce

brown 1 cow 1 does 1 How 2 it 1 now 2 work 1

Input

Output

Example Use of MapReduce

Counting words in a large set of documents

map(string key, string value) //key: document name //value: document contents for each word w in value EmitIntermediate(w,;)1 reduce(string key, iterator values) //key: word //values: list of counts int results = 0; for each v in values result += ParseInt(v); Emit(AsString(result));

MapReduce Examples

Distributed grep
Map function emits <word, line_number> if word matches search criteria Reduce function is the identity function

URL access frequency

Map function processes web logs, emits <url, 1> Reduce function sums values and emits <url, total>

MapReduce: Programming Model

More formally,
Map(k1,v1) --> list(k2,v2) Reduce(k2, list(v2)) --> list(v2)

MapReduce Runtime System

Partitions input data 2. Schedules execution across a set of machines 3. Handles machine failure 4. Manages interprocess communication
1.

MapReduce Benefits

Greatly reduces parallel programming complexity

Reduces synchronization complexity Automatically partitions data Provides failure transparency Handles load balancing

Practical

Approximately 1000 Google MapReduce jobs run everyday.

Google Computing Environment

Typical Clusters contain 1000's of machines Dual-processor x86's running Linux with 2-4GB memory Commodity networking

Typically 100 Mbs or 1 Gbs

IDE drives connected to individual machines

Distributed file system

How MapReduce Works

User to do list:
indicate:
Input/output files M: number of map tasks R: number of reduce tasks W: number of machines

Write map and reduce functions Submit the job

This requires no knowledge of parallel/distributed systems!!! What about everything else?

MapReduce Execution Overview

The user program, via the MapReduce library, shards the input data

Input Data

User Program

Shard 0 Shard 1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6

* Shards are typically 16-64mb in size

Data Distribution

Input files are split into M pieces on distributed file system

Typically ~ 64 MB blocks

Intermediate files created from map tasks are written to local disk Output files are written to distributed file system

MapReduce Execution Overview

The user program creates process copies distributed on a machine cluster. One copy will be the Master and the others will be worker threads.
Master

User Program Workers Workers Workers Workers Workers

MapReduce Resources
3.

The master distributes M map and R reduce tasks to idle workers.

M == number of shards R == the intermediate key space is divided into R parts

Message(Do_map_task)

Master

Idle Worker

Assigning Tasks
Many copies of user program are started Tries to utilize data localization by running map tasks on machines with data One instance becomes the Master Master finds idle machines and assigns them tasks

MapReduce Resources
4.

Each map-task worker reads assigned input shard and outputs intermediate key/value pairs.
Output buffered in RAM.

Shard 0

Map worker

Key/value pairs

MapReduce Execution Overview

Each worker flushes intermediate values, partitioned into R regions, to disk and notifies the Master process.

Disk locations

Master

Map worker

Local Storage

MapReduce Execution Overview

Master process gives disk locations to an available reduce-task worker who reads all associated intermediate data.
Disk locations

Master

Reduce worker

remote Storage

MapReduce Execution Overview

Each reduce-task worker sorts its intermediate data. Calls the reduce function, passing in unique keys and associated key values. Reduce function output appended to reduce-tasks partition output file.
Sorts data

Partition Output file

Reduce worker

MapReduce Execution Overview

Master process wakes up user process when all tasks have completed. Output contained in R output files.

Master

wakeup

User Program

Output files

Observations
No reduce can begin until map is complete Tasks scheduled based on location of data If map worker fails any time before reduce finishes, task must be completely rerun Master must communicate locations of intermediate files MapReduce library does most of the hard work for us!

Input key*value pairs

...
map
Data store 1 Data store n

map

(key 1, values...)

(key 2, values...)

(key 3, values...)

(key 1, values...)

(key 2, values...)

(key 3, values...)

== Barrier == : Aggregates intermediate values by output key key 1, intermediate values reduce key 2, intermediate values reduce key 3, intermediate values reduce

final key 1 values

final key 2 values

final key 3 values

Fault Tolerance

Workers are periodically pinged by master

No response = failed worker
Map-task failure Re-execute
All output was stored locally

Reduce-task failure Only re-execute partially completed tasks

All output stored in the global file system

Master writes periodic checkpoints

Fault Tolerance

On errors, workers send last gasp UDP packet to master

Detect records that cause deterministic crashes and skips them

Input file blocks stored on multiple machines When computation almost done, reschedule in-progress tasks

Avoids stragglers

Conclusions
Simplifies large-scale computations that fit this model Allows user to focus on the problem without worrying about details Computer architecture not very important

Portable model

MapReduce Applications

Relational operations using MapReduce

Enterprise application rely on structured data processing Same about relational data model and SQL Parallel databases supports parallel execution Drawback: lack the scale and fault tolerance MapReduce provides both

..
Relational join could be executed in parallel using mapreduce E.g. given sales table and city table compute the gross sales by city

Relational operations using MapReduce..

Enterprise Batch Processing using MapReduce

Enterprise context : interest in leveraging the MapReduce model for highthroughput batch processing, analysis of data

Batch processing operations

End of day processing Need to access and compute large dataset Time bound Constraints: online availability of trasaction processing system

Opportunity to accelerate batch processing

Example: revalue cust portfolios

References

Jeffery Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters Josh Carter, http://multipartmixed.com/software/mapreduce_presentation.pdf Ralf Lammel, Google's MapReduce Programming Model Revisited http://code.google.com/edu/parallel/mapreduce-tutorial.html

Gixxer 150 Catalogo de Partes PDF
80% (5)
Gixxer 150 Catalogo de Partes PDF
95 pages
Davy Process Tech Methanol
100% (1)
Davy Process Tech Methanol
8 pages
Lecture 2 - Mapreduce: Cpe 458 - Parallel Programming, Spring 2009
No ratings yet
Lecture 2 - Mapreduce: Cpe 458 - Parallel Programming, Spring 2009
26 pages
Map Reduce: Simplified Processing On Large Clusters
No ratings yet
Map Reduce: Simplified Processing On Large Clusters
29 pages
Lecture 10 MapReduce Hadoop
No ratings yet
Lecture 10 MapReduce Hadoop
37 pages
Map Reduce Intro CS4961-L22
No ratings yet
Map Reduce Intro CS4961-L22
20 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Mapreduce: Simpli - Ed Data Processing On Large Clusters
No ratings yet
Mapreduce: Simpli - Ed Data Processing On Large Clusters
4 pages
Distributed Systems: 18. Mapreduce
No ratings yet
Distributed Systems: 18. Mapreduce
39 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
Introduction To: Ma Ed
No ratings yet
Introduction To: Ma Ed
42 pages
AAAI2011 Tutorial Slides
No ratings yet
AAAI2011 Tutorial Slides
213 pages
Lec 6
No ratings yet
Lec 6
16 pages
Ecs765p W2
No ratings yet
Ecs765p W2
55 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
Take A Close Look At: Ma Ed
No ratings yet
Take A Close Look At: Ma Ed
42 pages
Lec 6
No ratings yet
Lec 6
14 pages
BDP 2024 09
No ratings yet
BDP 2024 09
24 pages
Chapter 6
No ratings yet
Chapter 6
57 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
37 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Map Reduce PDF
No ratings yet
Map Reduce PDF
29 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
ECS765P_W2_The MapReduce Programming Model
No ratings yet
ECS765P_W2_The MapReduce Programming Model
53 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
The Map Reduce Programming
No ratings yet
The Map Reduce Programming
15 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
BIG DATA
No ratings yet
BIG DATA
120 pages
Lecture 3 MapReduce Spark
No ratings yet
Lecture 3 MapReduce Spark
62 pages
Lecture 2.1
No ratings yet
Lecture 2.1
13 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
777 1651400043 BD Module 4
No ratings yet
777 1651400043 BD Module 4
21 pages
Paper Summary - MapReduce - Simplified Data Processing On Large Clusters (2004) - MeloSpace
No ratings yet
Paper Summary - MapReduce - Simplified Data Processing On Large Clusters (2004) - MeloSpace
7 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
Mapreduce
No ratings yet
Mapreduce
13 pages
Mapreduce
No ratings yet
Mapreduce
13 pages
By Christian Mechem and Geoff Crowley
No ratings yet
By Christian Mechem and Geoff Crowley
11 pages
Introduction to batch processing
No ratings yet
Introduction to batch processing
23 pages
Parallel & Distributed Computing
100% (1)
Parallel & Distributed Computing
52 pages
1s07 Map Reduce Presentation 2019
No ratings yet
1s07 Map Reduce Presentation 2019
43 pages
Lecture 3 - MapReduce
No ratings yet
Lecture 3 - MapReduce
9 pages
Da Unit 5 Data Analytics
No ratings yet
Da Unit 5 Data Analytics
43 pages
Lecture 1 - Map Reduce
No ratings yet
Lecture 1 - Map Reduce
31 pages
Map reduce
No ratings yet
Map reduce
35 pages
MapReduce: Simplified Data Processing On Large Clusters
100% (1)
MapReduce: Simplified Data Processing On Large Clusters
13 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
5 RK_MapReduce_v3
No ratings yet
5 RK_MapReduce_v3
30 pages
Chapter 4
No ratings yet
Chapter 4
71 pages
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
From Everand
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
Karl Josef Hensel
No ratings yet
Conceptual Programming: Conceptual Programming: Learn Programming the old way!
From Everand
Conceptual Programming: Conceptual Programming: Learn Programming the old way!
Avishek Sharma
No ratings yet
3D Hardware design:: Software applications for GPU
From Everand
3D Hardware design:: Software applications for GPU
S Mathioudakis
No ratings yet
Foundation Course for Advanced Computer Studies
From Everand
Foundation Course for Advanced Computer Studies
Franck Ismael Djédjé
No ratings yet
Manual Scavenging S
No ratings yet
Manual Scavenging S
7 pages
Show Results - DISC Personality Testing
No ratings yet
Show Results - DISC Personality Testing
3 pages
Wyoming Woody Teardrop Trailer v1.0
100% (4)
Wyoming Woody Teardrop Trailer v1.0
60 pages
Dynamic Facet Orderingf or Faceted Product
No ratings yet
Dynamic Facet Orderingf or Faceted Product
14 pages
Photographers and Texas Sales Tax
No ratings yet
Photographers and Texas Sales Tax
4 pages
Jeff Weiner
No ratings yet
Jeff Weiner
4 pages
Typical HVAC Report - 2
No ratings yet
Typical HVAC Report - 2
34 pages
Weller Portasol P-1K English
No ratings yet
Weller Portasol P-1K English
3 pages
Maruti Suzuki Report (Marketing)
No ratings yet
Maruti Suzuki Report (Marketing)
11 pages
Series LFII Residential Concealed Pendent Sprinklers, Flat Plate 4.2 K-Factor
No ratings yet
Series LFII Residential Concealed Pendent Sprinklers, Flat Plate 4.2 K-Factor
4 pages
c7 PDF
No ratings yet
c7 PDF
34 pages
Mining Foam
No ratings yet
Mining Foam
2 pages
Coriolis vs. Ultrasonic Flowmeters
No ratings yet
Coriolis vs. Ultrasonic Flowmeters
10 pages
1969-1971 Dodge Polara Monaco Coronet OEM Mopar 15 Wheel Hub Caps 69 70 71 EBay
No ratings yet
1969-1971 Dodge Polara Monaco Coronet OEM Mopar 15 Wheel Hub Caps 69 70 71 EBay
1 page
Um Qs en Webvisit 7361 en 01
No ratings yet
Um Qs en Webvisit 7361 en 01
68 pages
SR25BA (KR) .PDF Racing 125i
No ratings yet
SR25BA (KR) .PDF Racing 125i
117 pages
Kohat University of Science & Technology
No ratings yet
Kohat University of Science & Technology
7 pages
Watercolor Free Powerpoint Template
No ratings yet
Watercolor Free Powerpoint Template
25 pages
EEC403 e
No ratings yet
EEC403 e
2 pages
Tech Manual
No ratings yet
Tech Manual
84 pages
Manual de Servicio Jac j2 Ilovepdf Compressed
No ratings yet
Manual de Servicio Jac j2 Ilovepdf Compressed
243 pages
Philmac Valve Technical Manual - Ball Valves
No ratings yet
Philmac Valve Technical Manual - Ball Valves
6 pages
CEM Technical Specification Rev - 0
No ratings yet
CEM Technical Specification Rev - 0
9 pages
5 Algoritma Klastering
No ratings yet
5 Algoritma Klastering
85 pages
PMT Log Books - Health Center - A4
No ratings yet
PMT Log Books - Health Center - A4
11 pages
AirWatch Device Features Summary v8 - 1
No ratings yet
AirWatch Device Features Summary v8 - 1
3 pages
Operation and Maintenance Instructions: Pullotunkki Hydraulic Bottle Jack
No ratings yet
Operation and Maintenance Instructions: Pullotunkki Hydraulic Bottle Jack
3 pages

Parallel Programming, Mapreduce Model: Unit Ii

Uploaded by

Parallel Programming, Mapreduce Model: Unit Ii

Uploaded by

Parallel programming, Mapreduce model

Serial vs. Parallel Programming

The Basics Parallel Programming

it's just not possible: Fibonacci function

huge array which can be broken up into sub-arrays

implementation technique: master/worker

An example of the MASTER/WORKER technique

Parallelize this method

generate points in the square

How to painlessly process terabytes of data ?

Functional programming (e.g., Lisp)

User implements Map() and Reduce()

Parallel computing framework

Useful model for many practical tasks

Map and Reduce

Example: Counting Words

MapReduce: Programming Model

How does It work now

<How,1> <now,1> <brown,1> <cow,1> <How,1> <does,1> <it,1> <work,1> <now,1>

<How,1 1> <now,1 1> <brown,1> <cow,1> <does,1> <it,1> <work,1>

brown 1 cow 1 does 1 How 2 it 1 now 2 work 1

Example Use of MapReduce

Counting words in a large set of documents

URL access frequency

MapReduce: Programming Model

MapReduce Runtime System

Greatly reduces parallel programming complexity

Approximately 1000 Google MapReduce jobs run everyday.

Google Computing Environment

Typically 100 Mbs or 1 Gbs

IDE drives connected to individual machines

How MapReduce Works

Write map and reduce functions Submit the job

This requires no knowledge of parallel/distributed systems!!! What about everything else?

MapReduce Execution Overview

Shard 0 Shard 1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6

* Shards are typically 16-64mb in size

Input files are split into M pieces on distributed file system

MapReduce Execution Overview

User Program Workers Workers Workers Workers Workers

The master distributes M map and R reduce tasks to idle workers.

MapReduce Execution Overview

MapReduce Execution Overview

MapReduce Execution Overview

Partition Output file

MapReduce Execution Overview

Input key*value pairs

Input key*value pairs

final key 1 values

final key 2 values

final key 3 values

Workers are periodically pinged by master

Reduce-task failure Only re-execute partially completed tasks

Master writes periodic checkpoints

On errors, workers send last gasp UDP packet to master

Relational operations using MapReduce

Relational operations using MapReduce..

Enterprise Batch Processing using MapReduce

Batch processing operations

Opportunity to accelerate batch processing

Example: revalue cust portfolios

You might also like