0% found this document useful (0 votes)

11 views37 pages

L06 Map Reduce

The document discusses the MapReduce architecture, which is designed for processing large-scale data across clusters of commodity hardware. It outlines the components of the architecture, including the distributed file system for stable storage, the Map and Reduce functions for data processing, and the fault tolerance mechanisms in place. Additionally, it highlights the practical applications of MapReduce and introduces Hadoop as an open-source implementation of this model.

Uploaded by

ANUJ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views37 pages

L06 Map Reduce

Uploaded by

ANUJ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 37

Map Reduce Architecture

Adapted from Lectures by

Anand Rajaraman (Stanford Univ.)
and Dan Weld (Univ. of Washington)

Prasad L06MapReduce 1
Single-node architecture

CPU
Machine Learning, Statistics

Memory

“Classical” Data Mining

Disk

Prasad L06MapReduce 2
Commodity Clusters
 Web data sets can be very large
 Tens to hundreds of terabytes
 Cannot mine on a single server (why?)
 Standard architecture emerging:
 Cluster of commodity Linux nodes
 Gigabit ethernet interconnect
 How to organize computations on this
architecture?
 Mask issues such as hardware failure
 General Purpose Programming Model
Prasad L06MapReduce 3
Cluster Architecture
2-10 Gbps backbone between racks
1 Gbps between Switch
any pair of nodes
in a rack
Switch Switch

CPU CPU CPU CPU

Mem … Mem Mem … Mem

Disk Disk Disk Disk

Each rack contains 16-64 nodes

Prasad L06MapReduce 4
Stable storage
 First order problem: if nodes can fail,
how can we store data persistently?
 Answer: Distributed File System
 Provides global file namespace
 Google GFS; Hadoop HDFS; Kosmix KFS
 File replication (e.g., 3 copies)
 Typical usage pattern
 Huge files (100s of GB to TB)
 Data is rarely updated in place
 Reads and appends are common
Prasad L06MapReduce 5
Distributed File System
 Chunk Servers
 File is split into contiguous chunks
 Typically each chunk is 16-64MB
 Each chunk replicated (usually 2x or 3x)
 Try to keep replicas in different racks
 Master node
 a.k.a. Name Nodes in HDFS
 Stores metadata
 Might be replicated
 Client library for file access
 Talks to master to find chunk servers
 Connects directly to chunkservers to access data
Prasad L06MapReduce 6
Motivation for MapReduce (why)
 Large-Scale Data Processing
 Want to use 1000s of CPUs
 But don’t want hassle of managing things

 MapReduce Architecture provides

 Automatic parallelization & distribution
 Fault tolerance
 I/O scheduling
 Monitoring & status updates
 Redundant computations (e.g., 3 processes)
Prasad L06MapReduce 7
What is Map/Reduce

 Map/Reduce
 Programming model from LISP
 (and other functional languages)

 Many problems can be phrased this way

 Easy to distribute across nodes

 Nice retry/failure semantics
Prasad L06MapReduce 8
Map in LISP (Scheme)
 (map f list [list2 list3 …])
a tor
p e r
y o
a r
Un

 (map square ‘(1 2 3 4))

 (1 4 9 16)

Prasad L06MapReduce 9
Reduce in LISP (Scheme)
 (reduce f id list) ra tor
op e
a ry
Bin
 (reduce + 0 ‘(1 4 9 16))
 (+ 16 (+ 9 (+ 4 (+ 1 0)) ) )
 30

 (reduce + 0
(map square (map – l1 l2))))
Prasad L06MapReduce 10
Warm up: Word Count
 We have a large file of words, one
word to a line
 Count the number of times each
distinct word appears in the file

 Sample application: analyze web

server logs to find popular URLs

Prasad L06MapReduce 11
Word Count (2)
 Case 1: Entire file fits in memory
 Case 2: File too large for mem, but all
<word, count> pairs fit in mem
 Case 3: File on disk, too many distinct
words to fit in memory
 sort datafile | uniq –c

Prasad L06MapReduce 12
Word Count (3)
 To make it slightly harder, suppose
we have a large corpus of documents
 Count the number of times each
distinct word occurs in the corpus
words(docs/*) | sort | uniq -c
where words takes a file and outputs the
words in it, one to a line
 The above captures the essence of
MapReduce
 Great thing is it is naturally parallelizable
Prasad L06MapReduce 13
MapReduce
 Input: a set of key/value pairs
 User supplies two functions:
 map(k,v)  list(k1,v1)
 reduce(k1, list(v1))  v2
 (k1,v1) is an intermediate key/value
pair
 Output is the set of (k1,v2) pairs

Prasad L06MapReduce 14
Word Count using MapReduce
map(key, value):
// key: document name; value: text of document
for each word w in value:
emit(w, 1)

reduce(key, values):
// key: a word; values: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key,result)
Prasad L06MapReduce 15
Count, map(key=url, val=contents):
Illustrated For each word w in contents, emit (w, “1”)
reduce(key=word, values=uniq_counts):
Sum all “1”s in values list
Emit result “(word, sum)”

see bob run see 1 bob 1

see spot bob 1 run 1
throw run 1 see 2
see 1 spot 1
spot 1 throw 1
throw 1

Prasad L06MapReduce 16
Model is Widely Applicable
MapReduce Programs In Google Source Tree

Example uses:
distributed grep distributed sort web link-graph reversal
web access log inverted index
term-vector / host
stats construction
document statistical machine
Prasad machine learning
L06MapReduce 17
clustering translation
Implementation
Typical cluster: Overview

• 100s/1000s of 2-CPU x86 machines, 2-4 GB of

memory
• Limited bisection bandwidth
• Storage is on local IDE disks
• GFS: distributed file system manages data
(SOSP'03)
• Job scheduling system: jobs made up of tasks,
scheduler assigns tasks to machines

Implementation is a C++ library linked into

user programs

Prasad L06MapReduce 18
Distributed Execution Overview
User
Program

fork fork fork

assign Master
assign
map reduce
Input Data Worker
write Output
local Worker File 0
Split 0 read
write
Split 1 Worker
Split 2 Output
Worker File 1
Worker remote
read,
sort
Prasad L06MapReduce 19
Data flow
 Input, final output are stored on a
distributed file system
 Scheduler tries to schedule map tasks
“close” to physical storage location of
input data
 Intermediate results are stored on
local FS of map and reduce workers
 Output is often input to another map
reduce task

Prasad L06MapReduce 20
Coordination
 Master data structures
 Task status: (idle, in-progress, completed)
 Idle tasks get scheduled as workers
become available
 When a map task completes, it sends the
master the location and sizes of its R
intermediate files, one for each reducer
 Master pushes this info to reducers
 Master pings workers periodically to
detect failures
Prasad L06MapReduce 21
Failures
 Map worker failure
 Map tasks completed or in-progress at
worker are reset to idle
 Reduce workers are notified when task is
rescheduled on another worker
 Reduce worker failure
 Only in-progress tasks are reset to idle
 Master failure
 MapReduce task is aborted and client is
notified
Prasad L06MapReduce 22
Execution

Prasad L06MapReduce 23
Parallel Execution

Prasad L06MapReduce 24
How many Map and Reduce jobs?
 M map tasks, R reduce tasks
 Rule of thumb:
 Make M and R much larger than the
number of nodes in cluster
 One DFS chunk per map is common
 Improves dynamic load balancing and
speeds recovery from worker failure
 Usually R is smaller than M, because
output is spread across R files

Prasad L06MapReduce 25
Combiners
 Often a map task will produce many
pairs of the form (k,v1), (k,v2), … for
the same key k
 E.g., popular words in Word Count
 Can save network time by pre-
aggregating at mapper
 combine(k1, list(v1))  v2
 Usually same as reduce function
 Works only if reduce function is
commutative and associative
Prasad L06MapReduce 26
Partition Function
 Inputs to map tasks are created by
contiguous splits of input file
 For reduce, we need to ensure that
records with the same intermediate
key end up at the same worker
 System uses a default partition
function e.g., hash(key) mod R
 Sometimes useful to override
 E.g., hash(hostname(URL)) mod R
ensures URLs from a host end up in the
same output file
Prasad L06MapReduce 27
Execution Summary
 How is this distributed?
1. Partition input key/value pairs into
chunks, run map() tasks in parallel
2. After all map()s are complete,
consolidate all emitted values for each
unique emitted key
3. Now partition space of output map keys,
and run reduce() in parallel
 If map() or reduce() fails, reexecute!

Prasad L06MapReduce 28
Logical flow of a MapReduce
programming model.

Prasad L06MapReduce 29
Prasad L06MapReduce 30
Exercise 1: Host size
 Suppose we have a large web corpus
 Let’s look at the metadata file
 Lines of the form (URL, size, date, …)
 For each host, find the total number
of bytes
 i.e., the sum of the page sizes for all
URLs from that host

Prasad L06MapReduce 31
Exercise 2: Distributed Grep
 Find all occurrences of the given
pattern in a very large set of files

Prasad L06MapReduce 32
Exercise 3: Graph reversal
 Given a directed graph as an
adjacency list:
src1: dest11, dest12, …
src2: dest21, dest22, …

 Construct the graph in which all the

links are reversed

Prasad L06MapReduce 34
Reverse Web-Link Graph
 Map
 For each URL linking to target, …
 Output <target, source> pairs
 Reduce
 Concatenate list of all source URLs
 Outputs: <target, list (source)> pairs

Prasad L06MapReduce 35
Hadoop
 An open-source implementation of
Map Reduce in Java
 Uses HDFS for stable storage
 Download from:
http://lucene.apache.org/hadoop/

Prasad L06MapReduce 37
Reading
 Jeffrey Dean and Sanjay Ghemawat,
MapReduce: Simplified Data Processing
on Large Clusters
http://labs.google.com/papers/mapreduce.html

 Sanjay Ghemawat, Howard Gobioff, and Shun-

Tak Leung, The Google File System
http://labs.google.com/papers/gfs.html

Prasad L06MapReduce 38
Conclusions

 MapReduce proven to be useful abstraction

 Greatly simplifies large-scale computations

 Fun to use:
 focus on problem,
 let library deal w/ messy details

Prasad L06MapReduce 39

Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
MRV Training - OS900 Management
No ratings yet
MRV Training - OS900 Management
14 pages
Map Reduce Architecture: Adapted From Lectures by
No ratings yet
Map Reduce Architecture: Adapted From Lectures by
37 pages
Lecture 1 - Map Reduce
No ratings yet
Lecture 1 - Map Reduce
31 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
Lecture - 3
No ratings yet
Lecture - 3
25 pages
Paper Map Reduce
No ratings yet
Paper Map Reduce
16 pages
Unit V Big Data Analytics
No ratings yet
Unit V Big Data Analytics
47 pages
Map Reduce: Simplified Processing On Large Clusters
No ratings yet
Map Reduce: Simplified Processing On Large Clusters
29 pages
Da Unit 5 Data Analytics
No ratings yet
Da Unit 5 Data Analytics
43 pages
Unit 5 Big Data
No ratings yet
Unit 5 Big Data
48 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Map Reduce Examples
No ratings yet
Map Reduce Examples
7 pages
Da Unit 5 Data Analytics
No ratings yet
Da Unit 5 Data Analytics
44 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
1s07 Map Reduce Presentation 2019
No ratings yet
1s07 Map Reduce Presentation 2019
43 pages
Lez.d-01-Hadoop (A) Intro
No ratings yet
Lez.d-01-Hadoop (A) Intro
58 pages
Chapter 6
No ratings yet
Chapter 6
57 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Hadoop Map Reduce Concepts - Teaching - 1
No ratings yet
Hadoop Map Reduce Concepts - Teaching - 1
53 pages
Map-Reduce For Parallel Computing: Amit Jain
No ratings yet
Map-Reduce For Parallel Computing: Amit Jain
72 pages
3a - MapReduce Data Flow Scheduling Combiner Partitioner PDF
No ratings yet
3a - MapReduce Data Flow Scheduling Combiner Partitioner PDF
22 pages
Lecture 10 MapReduce Hadoop
No ratings yet
Lecture 10 MapReduce Hadoop
37 pages
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
No ratings yet
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
36 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
CS 425 / ECE 428 Distributed Systems Fall 2016: Lecture 4: Mapreduce and Hadoop
No ratings yet
CS 425 / ECE 428 Distributed Systems Fall 2016: Lecture 4: Mapreduce and Hadoop
24 pages
Parallel Programming, Mapreduce Model: Unit Ii
No ratings yet
Parallel Programming, Mapreduce Model: Unit Ii
47 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
26 pages
Ecs765p W2
No ratings yet
Ecs765p W2
55 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Parallel & Distributed Computing
100% (1)
Parallel & Distributed Computing
52 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
M4 06 MapReduce
No ratings yet
M4 06 MapReduce
28 pages
CC Unit-7
No ratings yet
CC Unit-7
16 pages
Map Reduced B Seminar
No ratings yet
Map Reduced B Seminar
17 pages
Untitled
No ratings yet
Untitled
16 pages
Hadoop Map Reduce
No ratings yet
Hadoop Map Reduce
53 pages
BDA Module 3
No ratings yet
BDA Module 3
66 pages
04 MapReduce
No ratings yet
04 MapReduce
45 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
Module 3 (Part-1) - Big Data
No ratings yet
Module 3 (Part-1) - Big Data
46 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
MapReduce Introduction
No ratings yet
MapReduce Introduction
34 pages
8300 17977 1 PB
No ratings yet
8300 17977 1 PB
19 pages
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts
No ratings yet
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts
26 pages
Unit 5 Notes Data Analytics Kit 601
No ratings yet
Unit 5 Notes Data Analytics Kit 601
44 pages
BDA Unit 3
No ratings yet
BDA Unit 3
7 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
Bda Unit 3
No ratings yet
Bda Unit 3
20 pages
Mapreduce: Simpli - Ed Data Processing On Large Clusters
No ratings yet
Mapreduce: Simpli - Ed Data Processing On Large Clusters
4 pages
Map Reduce
No ratings yet
Map Reduce
69 pages
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
From Everand
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
Joerg Christian Seubert
No ratings yet
Whitney
No ratings yet
Whitney
19 pages
Lecture23 598PV
No ratings yet
Lecture23 598PV
20 pages
Mobile Security
No ratings yet
Mobile Security
56 pages
Full
No ratings yet
Full
130 pages
Play Fair Cipher
No ratings yet
Play Fair Cipher
5 pages
DMBS11 Exp
No ratings yet
DMBS11 Exp
25 pages
Barix Exstreamer 100 Quick Install Guide 31
No ratings yet
Barix Exstreamer 100 Quick Install Guide 31
2 pages
(MCQ) Computer Communication Networks - LMT6
No ratings yet
(MCQ) Computer Communication Networks - LMT6
16 pages
Safe Install Instructions PDF
No ratings yet
Safe Install Instructions PDF
2 pages
O'Reilly - Windows XP in A Nutshell
No ratings yet
O'Reilly - Windows XP in A Nutshell
289 pages
MSC Pool: General
No ratings yet
MSC Pool: General
2 pages
Siemens
No ratings yet
Siemens
8 pages
Laboratorio MPLS L3VPN
No ratings yet
Laboratorio MPLS L3VPN
4 pages
Light Runner Brochure
No ratings yet
Light Runner Brochure
4 pages
DPG ISO 27001 ComplianceAutomationSuite v001 20150421
No ratings yet
DPG ISO 27001 ComplianceAutomationSuite v001 20150421
36 pages
Unit - 3 Event Driven Programming
No ratings yet
Unit - 3 Event Driven Programming
80 pages
(Hand Out) DFN40143 NETWORK SECURITY CHAPTER 1 NETWORK PROTOCOLS AND SERVICES
No ratings yet
(Hand Out) DFN40143 NETWORK SECURITY CHAPTER 1 NETWORK PROTOCOLS AND SERVICES
39 pages
Troubleshoot A Failing To Start SAP HANA System
No ratings yet
Troubleshoot A Failing To Start SAP HANA System
11 pages
Jayacom Information 136
No ratings yet
Jayacom Information 136
2 pages
TUT2
No ratings yet
TUT2
4 pages
Tellabs 6325 FP1 3 SP1 Release Note PDF
No ratings yet
Tellabs 6325 FP1 3 SP1 Release Note PDF
4 pages
Lesson1 5 Reviewer 3 PDF
No ratings yet
Lesson1 5 Reviewer 3 PDF
12 pages
SB 10054865 6305
No ratings yet
SB 10054865 6305
15 pages
Cisco 1800 Series Integrated Services Routers (Fixed) Hardware Installation Guide
No ratings yet
Cisco 1800 Series Integrated Services Routers (Fixed) Hardware Installation Guide
82 pages
Introduction To Telecommunication EngineeringTelecommunication Laboratory
No ratings yet
Introduction To Telecommunication EngineeringTelecommunication Laboratory
36 pages
Java MP
No ratings yet
Java MP
25 pages
Assignment 9 SCADA
No ratings yet
Assignment 9 SCADA
6 pages
Cat 5
No ratings yet
Cat 5
4 pages
Index: A Project Report ON
No ratings yet
Index: A Project Report ON
45 pages
Com On To Nehcfdd
No ratings yet
Com On To Nehcfdd
4 pages
Challenges of Mobile Computing (Forman & Zahorjan, 1994) : - Wireless Communication
No ratings yet
Challenges of Mobile Computing (Forman & Zahorjan, 1994) : - Wireless Communication
17 pages
TKQTM A11481 WilcardMask
No ratings yet
TKQTM A11481 WilcardMask
5 pages
ALFOplus User Manual - MN.00273.e ED2
No ratings yet
ALFOplus User Manual - MN.00273.e ED2
174 pages
Mitigasi Dan Adaptasi
No ratings yet
Mitigasi Dan Adaptasi
535 pages
Installing and Configuring Samba AD - FreeRADIUS With Meraki
No ratings yet
Installing and Configuring Samba AD - FreeRADIUS With Meraki
110 pages

L06 Map Reduce

Uploaded by

L06 Map Reduce

Uploaded by

Map Reduce Architecture

Adapted from Lectures by

“Classical” Data Mining

CPU CPU CPU CPU

Mem … Mem Mem … Mem

Disk Disk Disk Disk

Each rack contains 16-64 nodes

 MapReduce Architecture provides

 Many problems can be phrased this way

 Easy to distribute across nodes

 (map square ‘(1 2 3 4))

 Sample application: analyze web

see bob run see 1 bob 1

• 100s/1000s of 2-CPU x86 machines, 2-4 GB of

Implementation is a C++ library linked into

fork fork fork

 Construct the graph in which all the

 Sanjay Ghemawat, Howard Gobioff, and Shun-

 MapReduce proven to be useful abstraction

 Greatly simplifies large-scale computations

You might also like