0% found this document useful (0 votes)

39 views42 pages

Take A Close Look At: Ma Ed

This document provides an overview of MapReduce, including: - Its origins at Google for large-scale data processing across commodity computers. - The Map and Reduce functions that perform filtering and sorting of large datasets in parallel. - How it addresses issues like data locality, fault tolerance, and scalability. - Popular open-source implementations like Apache Hadoop that provide MapReduce functionality. - Examples of how MapReduce has been applied to problems like word counting, web link analysis, and machine learning.

Uploaded by

Abhishek Ranjan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views42 pages

Take A Close Look At: Ma Ed

Uploaded by

Abhishek Ranjan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 42

Take a Close Look at

MapReduce
Xuanhua Shi
Acknowledgement
Most of the slides are from Dr. Bing Chen,
http://grid.hust.edu.cn/chengbin/
Some slides are from SHADI IBRAHIM,
http://grid.hust.edu.cn/shadi/

What is MapReduce
Origin from Google, [OSDI04]
A simple programming model
Functional model
For large-scale data processing
Exploits large set of commodity computers
Executes process in distributed manner
Offers high availability

Motivation
Lots of demands for very large scale data
processing
A certain common themes for these
demands
Lots of machines needed (scaling)
Two basic operations on the input
Map
Reduce

Distributed Grep
Very
big
data
Split data
Split data
Split data
Split data
grep
grep
grep
grep
matches
matches
matches
matches
cat
All
matches
Distributed Word Count
Very
big
data
Split data
Split data
Split data
Split data
count
count
count
count
count
count
count
count
merge
merged
count
Map+Reduce

Map:
Accepts input
key/value pair
Emits intermediate
key/value pair

Reduce :
Accepts intermediate
key/value* pair
Emits output key/value
pair
Very
big
data
Result
M
A
P
R
E
D
U
C
E
Partitioning
Function
The design and how it works
Architecture overview
Job tracker
Task tracker Task tracker Task tracker
Master node
Slave node 1
Slave node 2 Slave node N
Workers
user
Workers Workers
GFS: underlying storage system
Goal
global view
make huge files available in the face of node failures
Master Node (meta server)
Centralized, index all chunks on data servers
Chunk server (data server)
File is split into contiguous chunks, typically 16-64MB.
Each chunk replicated (usually 2x or 3x).
Try to keep replicas in different racks.

GFS architecture
GFS Master
C
0
C
1
C
2
C
5
Chunkserver 1
C
0
C
5
Chunkserver N
C
1
C
3
C
5
Chunkserver 2

C
2
Client
Functions in the Model
Map
Process a key/value pair to generate
intermediate key/value pairs
Reduce
Merge all intermediate values associated with
the same key
Partition
By default : hash(key) mod R
Well balanced
Diagram (1)
Diagram (2)
A Simple Example
Counting words in a large set of documents

map(string value)
//key: document name
//value: document contents
for each word w in value
EmitIntermediate(w, 1);

reduce(string key, iterator values)
//key: word
//values: list of counts
int results = 0;
for each v in values
result += ParseInt(v);
Emit(AsString(result));

How does it work?
Locality issue
Master scheduling policy
Asks GFS for locations of replicas of input file blocks
Map tasks typically split into 64MB (== GFS block
size)
Map tasks scheduled so GFS input block replica are
on same machine or same rack
Effect
Thousands of machines read input at local disk speed
Without this, rack switches limit read rate
Fault Tolerance
Reactive way
Worker failure
Heartbeat, Workers are periodically pinged by master
NO response = failed worker
If the processor of a worker fails, the tasks of that worker are
reassigned to another worker.

Master failure
Master writes periodic checkpoints
Another master can be started from the last checkpointed
state
If eventually the master dies, the job will be aborted
Fault Tolerance
Proactive way (Redundant Execution)
The problem of stragglers (slow workers)
Other jobs consuming resources on machine
Bad disks with soft errors transfer data very slowly
Weird things: processor caches disabled (!!)

When computation almost done, reschedule
in-progress tasks
Whenever either the primary or the backup
executions finishes, mark it as completed

Fault Tolerance
Input error: bad records
Map/Reduce functions sometimes fail for particular
inputs
Best solution is to debug & fix, but not always
possible
On segment fault
Send UDP packet to master from signal handler
Include sequence number of record being processed
Skip bad records
If master sees two failures for same record, next worker is
told to skip the record

Status monitor
Refinements
Task Granularity
Minimizes time for fault recovery
load balancing
Local execution for debugging/testing
Compression of intermediate data

Points need to be emphasized
No reduce can begin until map is complete
Master must communicate locations of
intermediate files
Tasks scheduled based on location of data
If map worker fails any time before reduce
finishes, task must be completely rerun
MapReduce library does most of the hard
work for us!

Model is Widely Applicable
MapReduce Programs In Google Source Tree
distributed grep distributed sort web link-graph reversal
term-vector / host web access log stats inverted index construction
document clustering machine learning statistical machine translation
... ... ...
Examples as follows
How to use it
User to do list:
indicate:
Input/output files
M: number of map tasks
R: number of reduce tasks
W: number of machines
Write map and reduce functions
Submit the job
Detailed Example: Word Count(1)
Map
Detailed Example: Word Count(2)
Reduce
Detailed Example: Word Count(3)
Main
Applications
String Match, such as Grep
Reverse index
Count URL access frequency
Lots of examples in data mining

MapReduce Implementations
MapReduce
Cluster,
1, Google
2, Apache Hadoop
Multicore CPU,
Phoenix @ stanford
GPU,
Mars@HKUST
Hadoop
Open source
Java-based implementation of MapReduce
Use HDFS as underlying file system
Hadoop
Google Yahoo
MapReduce Hadoop
GFS HDFS
Bigtable HBase
Chubby (nothing yet but
planned)
Recent news about Hadoop
Apache Hadoop Wins Terabyte Sort
Benchmark

The sort used 1800 maps and 1800
reduces and allocated enough memory to
buffers to hold the intermediate data in
memory.

Phoenix
The best paper at HPCA07
MapReduce for multiprocessor systems
Shared-memory implementation of MapReduce
SMP, Multi-core
Features
Uses thread instead of cluster nodes for parallelism
Communicate through shared memory instead of network
messages
Dynamic scheduling, locality management, fault recovery

Workflow
The Phoenix API
System-defined functions

User-defined functions
Mars: MapReduce on GPU
PACT08

GeForce 8800 GTX, PS3, Xbox360
Implementation of Mars
NVIDIA GPU (GeForce 8800
GTX)
CPU (Intel P4 four cores,
2.4GHz)
Operating System (Windows or Linux)
CUDA
System calls
MapReduce
User applications.
Implementation of Mars

Discussion
We have MPI and PVM,Why do we need MapReduce?
MPI, PVM MapReduce
Objective
General distributed
programming model
Large-scale data
processing
Availability
Weaker, harder better
Data
Locality
MPI-IO GFS
Usability
Difficult to learn easier
Conclusions
Provide a general-purpose model to
simplify large-scale computation

Allow users to focus on the problem
without worrying about details

References
Original paper
(http://labs.google.com/papers/mapreduce
.html)
On wikipedia
(http://en.wikipedia.org/wiki/MapReduce)
Hadoop MapReduce in Java
(http://lucene.apache.org/hadoop/)
http://code.google.com/edu/parallel/mapre
duce-tutorial.html

Introduction To: Ma Ed
No ratings yet
Introduction To: Ma Ed
42 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
1s07 Map Reduce Presentation 2019
No ratings yet
1s07 Map Reduce Presentation 2019
43 pages
Map Reduced B Seminar
No ratings yet
Map Reduced B Seminar
17 pages
Whitney
No ratings yet
Whitney
19 pages
Map Reduce: Simplified Processing On Large Clusters
No ratings yet
Map Reduce: Simplified Processing On Large Clusters
29 pages
Mapreduce: Simplified Data Processing On Large Clusters
No ratings yet
Mapreduce: Simplified Data Processing On Large Clusters
38 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Parallel Programming, Mapreduce Model: Unit Ii
No ratings yet
Parallel Programming, Mapreduce Model: Unit Ii
47 pages
Chapter 6
No ratings yet
Chapter 6
57 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
06 Application Architecture
No ratings yet
06 Application Architecture
22 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Lecture 1 - Map Reduce
No ratings yet
Lecture 1 - Map Reduce
31 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
B. Hadoop Ecosystem - III (MapReduce)
No ratings yet
B. Hadoop Ecosystem - III (MapReduce)
55 pages
Chapter 4
No ratings yet
Chapter 4
71 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
02 Hadoop
No ratings yet
02 Hadoop
117 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
37 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
65 pages
MapReduce and The New Software Stack
No ratings yet
MapReduce and The New Software Stack
33 pages
Parallel & Distributed Computing
100% (1)
Parallel & Distributed Computing
52 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
Week 02
No ratings yet
Week 02
115 pages
Chapter 3 - 大数据管理
No ratings yet
Chapter 3 - 大数据管理
38 pages
Bda 2
No ratings yet
Bda 2
35 pages
UNIT III Notes
No ratings yet
UNIT III Notes
24 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
The Map Reduce Programming
No ratings yet
The Map Reduce Programming
15 pages
Mapreduce: Simpli - Ed Data Processing On Large Clusters
No ratings yet
Mapreduce: Simpli - Ed Data Processing On Large Clusters
4 pages
Chapter 4 MapReduce and New Software Stack
No ratings yet
Chapter 4 MapReduce and New Software Stack
48 pages
T05 MapReduce
No ratings yet
T05 MapReduce
20 pages
Lecture 10 MapReduce Hadoop
No ratings yet
Lecture 10 MapReduce Hadoop
37 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
Bda CHP2
No ratings yet
Bda CHP2
105 pages
CSE545 Sp23 (3) Hadoop MapReduce 2-13
No ratings yet
CSE545 Sp23 (3) Hadoop MapReduce 2-13
96 pages
Hadoop
No ratings yet
Hadoop
50 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
AAAI2011 Tutorial Slides
No ratings yet
AAAI2011 Tutorial Slides
213 pages
Ditp - ch2 4
No ratings yet
Ditp - ch2 4
2 pages
Lecture4 IntroMapReduce PDF
No ratings yet
Lecture4 IntroMapReduce PDF
75 pages
1.4 Map Reduce
No ratings yet
1.4 Map Reduce
30 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Map Reduce
No ratings yet
Map Reduce
69 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Map Reduce
No ratings yet
Map Reduce
36 pages
CS 525 Advanced Distributed Systems Spring 2010: Ravenshaw Management Centre, Cuttack
No ratings yet
CS 525 Advanced Distributed Systems Spring 2010: Ravenshaw Management Centre, Cuttack
27 pages
BDP 2024 09
No ratings yet
BDP 2024 09
24 pages
Hadoop
No ratings yet
Hadoop
34 pages
Big Data Lecture # 07
No ratings yet
Big Data Lecture # 07
21 pages
MapReduce - Simpli Ed Data Processing On Large Clusters
No ratings yet
MapReduce - Simpli Ed Data Processing On Large Clusters
22 pages
4
No ratings yet
4
53 pages
He-Phan-Bo - Thoai-Nam - Distributedsystem - 18 - Mapreduce - (Cuuduongthancong - Com)
No ratings yet
He-Phan-Bo - Thoai-Nam - Distributedsystem - 18 - Mapreduce - (Cuuduongthancong - Com)
31 pages
Hadoop Beginner's Guide
From Everand
Hadoop Beginner's Guide
Garry Turkington
4/5 (7)
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Google BigQuery Analytics
From Everand
Google BigQuery Analytics
Jordan Tigani
3/5 (1)

Take A Close Look At: Ma Ed

Uploaded by

Take A Close Look At: Ma Ed

Uploaded by

Take a Close Look at

You might also like