0% found this document useful (0 votes)

31 views

Map Reduce

MapReduce and Hadoop provide a framework for processing vast amounts of data in parallel across large clusters of computers. The MapReduce paradigm involves splitting input data into chunks, mapping those chunks to generate intermediate key-value pairs, shuffling and sorting the pairs by key, and then reducing the values for each key. Hadoop is an open source implementation of MapReduce that uses HDFS for storage and a master-slave architecture to distribute processing across nodes.

Uploaded by

Informatique Info

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views

Map Reduce

Uploaded by

Informatique Info

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 26

MapReduce and Hadoop

Debapriyo Majumdar
Data Mining – Fall 2014
Indian Statistical Institute Kolkata

November 10, 2014

Let’s keep the intro short
 Modern data mining: process immense amount of data quickly
 Exploit parallelism

Traditional parallelism

Bring data to compute

MapReduce

Bring compute to data

Pictures courtesy: Glenn K. Lockwood, glennklockwood.com 2

The MapReduce paradigm
Split Map Shuffle and sort Reduce Final

Final
original output
Input Input <Key,Value>
chunks pairs
<Key,Value> Output May not
May be pairs grouped chunks need to
already by keys combine
split in
filesystem The user needs to write the map() and the reduce() 3
An example: word frequency counting
Split Map Shuffle and sort Reduce Final

the pairs (w,1) for the same words

reduce: count the number (n) of

subcollections of documnts

pairs for each w, make it (w,n)

collection of documnts

output: (w,n) for

map: for each word w,
output pairs (w,1)

are grouped together

each w
Final
original output
Input Input <Key,Value>
chunks pairs
<Key,Value> Output
Problem: Given a collection
of documents, count the
pairs grouped chunks
number of times each by keys
word occurs in the map: for each word w, reduce: count the number (n) of
collection emit pairs (w,1) pairs for each w, make it (w,n) 4
An example: word frequency counting
Split Map Shuffle and sort Reduce Final

(apple,1)
(apple,2)
(apple,1)
apple (apple,1)
orange (orange,1) (orange,1)
apple orange peach (peach,1) (orange,1) (orange,3)
peach (orange,1)
orange (orange,1)
plum (plum,1) (guava,1) (guava,1) (apple,2)
orange plum
(orange,3)
orange (orange,1) (guava,1)
orange apple (plum,1)
apple (apple,1) (plum,2) (plum,2)
guava (plum,1)
guava (guava,1) (cherry,2)
(fig,2)
cherry fig (cherry,1) (cherry,1)
cherry fig (cherry,2) (peach,3)
(fig,1) (cherry,1)
peach fig
peach (peach,1) (fig,1)
peach fig
(fig,1) (fig,1)
(fig,2) Final
original peach
(peach,1) (peach,1) output
Input Input <Key,Value> (peach,1) (peach,3)
(peach,1)
chunks pairs
<Key,Value> Output
Problem: Given a collection
of documents, count the
pairs grouped chunks
number of times each by keys
word occurs in the map: for each word w, reduce: count the number (n) of
collection output pairs (w,1) pairs for each w, make it (w,n) 5
Apache Hadoop
An open source MapReduce framework

HADOOP

6
Hadoop
 Two main components
– Hadoop Distributed File System (HDFS): to store data
– MapReduce engine: to process data
 Master – slave architecture using commodity servers

 The HDFS
– Master: Namenode
– Slave: Datanode
 MapReduce
– Master: JobTracker
– Slave: TaskTracker
7
HDFS: Blocks
Datanode 1
Block 1 Block 2
Block 1 Block 2 Block 3

Datanode 2
Block 1 Block 3
Big File Block 3 Block 4 Block 4

Datanode 3
Block 2 Block 6
Block 5 Block 6
Block 5

Datanode 4
 Runs on top of existing filesystem
 Blocks are 64MB (128MB recommended) Block 4 Block 6
 Single file can be > any single disk Block 5
 POSIX based permissions
 8
Fault tolerant
HDFS: Namenode and Datanode
 Namenode
– Only one per Hadoop Cluster
– Manages the filesystem namespace
– The filesystem tree
– An edit log
– For each block block i, the datanode(s) in which block i is saved
– All the blocks residing in each datanode
 Secondary Namenode
– Backup namenode
 Datanodes
– Many per Hadoop cluster
– Controls block operations
– Physically puts the block in the nodes
– Do the physical replication
9
HDFS: an example

10
MapReduce: JobTracker and TaskTracker

1. JobClient submits job to JobTracker; Binary copied into HDFS

2. JobTracker talks to Namenode
3. JobTracker creates execution plan
4. JobTracker submits work to TaskTrackers
5. TaskTrackers report progress via heartbeat
6. JobTracker updates status 11
Map, Shuffle and Reduce: internal steps

1. Splits data up to send it to the mapper

2. Transforms splits into key/value pairs
3. (Key-Value) with same key sent to the same reducer
4. Aggregates key/value pairs based on user-defined code
5. Determines how the result are saved

12
Fault Tolerance
 If the master fails
– MapReduce would fail, have to restart the entire job
 A map worker node fails
– Master detects (periodic ping would timeout)
– All the map tasks for this node have to be restarted
• Even if the map tasks were done, the output were at the node
 A reduce worker fails
– Master sets the status of its currently executing reduce
tasks to idle
– Reschedule these tasks on another reduce worker

13
Some algorithms using MapReduce

USING MAPREDUCE

14
Matrix – Vector Multiplication
 Multiply M = (mij) (an n × n matrix) and v = (vi) (an n-vector)
 If n = 1000, no need of MapReduce!
n

M v n

Case 1: Large n, M does not fit into main memory, but v does
 Since v fits into main memory, v is available to every map task
 Map: for each matrix element mij, emit key value pair (i, mijvj)
 Shuffle and sort: groups all mijvj values together for the same i
 Reduce: sum mijvj for all j for the same i
15
Matrix – Vector Multiplication
 Multiply M = (mij) (an n × n matrix) and v = (vi) (an n-vector)
 If n = 1000, no need of MapReduce!
This much will fit into main
memory

This whole chunk does not fit

in main memory anymore

Case 2: Very large n, even v does not fit into main memory
 For every map, many accesses to disk (for parts of v) required!
 Solution:
– How much of v will fit in?
– Partition v and rows of M so that each partition of v fits into memory
– Take dot product of one partition of v and the corresponding partition of M
– Map and reduce same as before
16
Relational Alegebra
 Relation R(A1, A3, …, An) is Attr1 Attr2 Attr3 Attr4
xyz abc 1 true
a relation with attributes Ai
abc xyz 1 true
 Schema: set of attributes xyz def 1 false
 Selection on condition C: bcd def 2 true
apply C on each tuple in R,
output only those which
satisfy C Links between URLs

 Projection on a subset S of URL1 URL2

attributes: output the url1 url2
components for the url2 url1
attributes in S url3 url5
 Union, Intersection, Join… url1 url3

17
Selection using MapReduce
 Trivial Links between URLs
 Map: For each tuple t in R, test if t URL1 URL2
satisfies C. If so, produce the key-value url1 url2
pair (t, t). url2 url1
 Reduce: The identity function. It simply url3 url5
passes each key-value pair to the output. url1 url3

18
Union using MapReduce
 Union of two relations R and S Links between URLs
 Suppose R and S have the same schema URL1 URL2
 Map tasks are generated from chunks of url1 url2
both R and S url2 url1
 Map: For each tuple t, produce the key- url3 url5
value pair (t, t) url1 url3
 Reduce: Only need to remove duplicates
– For all key t, there would be either one or
two values
– Output (t, t) in either case

19
Natural join using MapReduce
 Join R(A,B) with S(B,C) on attribute B R
 Map:
A B
– For each tuple t = (a,b) of R, emit key value pair
(b,(R,a)) x a
– For each tuple t = (b,c) of S, emit key value pair y b
(b,(S,c)) z c
 Reduce: w d
– Each key b would be associated with a list of
values that are of the form (R,a) or (S,c)
S
– Construct all pairs consisting of one with first
component R and the other with first component B C
S , say (R,a ) and (S,c ). The output from this key a 1
and value list is a sequence of key-value pairs c 3
– The key is irrelevant. Each value is one of the
d 4
triples (a, b, c ) such that (R,a ) and (S,c) are on
the input list of values g 7

20
Grouping and Aggregation using MapReduce
 Group and aggregate on a relation R

R(A,B) using aggregation function γ(B), A B

group by x 2
y 1
 Map:
z 4
– For each tuple t = (a,b) of R, emit key
z 1
value pair (a,b)
x 5
 Reduce:
– For all group {(a,b1), …, (a,bm)} select A, sum(B) from R
group by A;
represented by a key a, apply γ to obtain
b a = b1 + … + bm
A SUM(B)
– Output (a,ba)
x 7
y 1
z 5
21
Matrix multiplication using MapReduce
n l l

m A n = C
B m
(m × n) (m × l)
(n × l)

 Think of a matrix as a relation with three attributes

 For example matrix A is represented by the relation A(I, J, V)
– For every non-zero entry (i, j, aij), the row number is the value of I,
column number is the value of J, the entry is the value in V
– Also advantage: usually most large matrices would be sparse, the relation
would have less number of entries
 The product is ~ a natural join followed by a grouping with
aggregation
22
Matrix multiplication using MapReduce
n l l
A
m n
B = C
(m × n) m
(n × l) (m × l)
(i, j, aij)
(j, k, bjk)

 Natural join of (I,J,V) and (J,K,W)  tuples (i, j, k, aij, bjk)

 Map:
– For every (i, j, aij), emit key value pair (j, (A, i, aij))
– For every (j, k, bjk), emit key value pair (j, (B, k, bjk))
 Reduce:
for each key j
for each value (A, i, aij) and (B, k, bjk)
produce a key value pair ((i,k),(aijbjk))
23
Matrix multiplication using MapReduce
n l l
A
m n
B = C
(m × n) m
(n × l) (m × l)
(i, j, aij)
(j, k, bjk)

 First MapReduce process has produced key value pairs ((i,k),

(aijbjk))
 Another MapReduce process to group and aggregate
 Map: identity, just emit the key value pair ((i,k),(aijbjk))
 Reduce:
for each key (i,k)
produce the sum of the all the values for the key:
24
Matrix multiplication using MapReduce: Method 2
n l l
A
m n
B = C
(m × n) m
(n × l) (m × l)
(i, j, aij)
(j, k, bjk)

 A method with one MapReduce step

 Map:
– For every (i, j, aij), emit for all k = 1,…, l, the key value ((i,k), (A, j, aij))
– For every (j, k, bjk), emit for all i = 1,…, m, the key value ((i,k), (B, j, bjk))
 Reduce: May not fit in
for each key (i,k) main memory.
sort values (A, j, aij) and (B, j, bjk) by j to group them by j Expensive
for each j multiply aij and bjk external sort!
sum the products for the key (i,k) to produce
25
References and acknowledgements
 Mining of Massive Datasets, by Leskovec, Rajaraman and
Ullman, Chapter 2
 Slides by Dwaipayan Roy

BCOR 2020 Exam 1 Study Guide 1 1
No ratings yet
BCOR 2020 Exam 1 Study Guide 1 1
5 pages
Information Systems Analysis and Design: Designing Databases, Forms, Reports, Interfaces, Dialogues
No ratings yet
Information Systems Analysis and Design: Designing Databases, Forms, Reports, Interfaces, Dialogues
69 pages
Agri-Informatics - Notes by G Vanitha
No ratings yet
Agri-Informatics - Notes by G Vanitha
4 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
43 pages
Module2 D MapReduceParadigm
No ratings yet
Module2 D MapReduceParadigm
84 pages
Computational Tools DTU Presentation Week3
No ratings yet
Computational Tools DTU Presentation Week3
33 pages
MR Databases
No ratings yet
MR Databases
52 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Map Reduce PDF
No ratings yet
Map Reduce PDF
29 pages
L4
No ratings yet
L4
65 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
Lecture 3 - MapReduce
No ratings yet
Lecture 3 - MapReduce
9 pages
5 RK_MapReduce_v3
No ratings yet
5 RK_MapReduce_v3
30 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Module 1 Algorithm For Massive Datasets
No ratings yet
Module 1 Algorithm For Massive Datasets
59 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
Yum Yum D Giga
No ratings yet
Yum Yum D Giga
368 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
Bda - Unit I - Lecture 6, 7
No ratings yet
Bda - Unit I - Lecture 6, 7
48 pages
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
49 pages
Join Algorithms Using Mapreduce: A Survey: Vikas Jadhav, Jagannath Aghav, Sunil Dorwani
No ratings yet
Join Algorithms Using Mapreduce: A Survey: Vikas Jadhav, Jagannath Aghav, Sunil Dorwani
5 pages
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
48 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
55 pages
BDA RepeatedImp Questions
No ratings yet
BDA RepeatedImp Questions
30 pages
BDA Module 3
No ratings yet
BDA Module 3
66 pages
Lecture 2.1
No ratings yet
Lecture 2.1
13 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
No ratings yet
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
36 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
MapReduce and The New Software Stack
No ratings yet
MapReduce and The New Software Stack
33 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
3a - MapReduce Data Flow Scheduling Combiner Partitioner PDF
No ratings yet
3a - MapReduce Data Flow Scheduling Combiner Partitioner PDF
22 pages
Jeffrey D. Ullman: Stanford University
No ratings yet
Jeffrey D. Ullman: Stanford University
52 pages
BDA 2 (1)
No ratings yet
BDA 2 (1)
35 pages
Chapter 4 - Understanding Map Reduce Fundamentals
No ratings yet
Chapter 4 - Understanding Map Reduce Fundamentals
45 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
Lecture 10 MapReduce Hadoop
No ratings yet
Lecture 10 MapReduce Hadoop
37 pages
777 1651400043 BD Module 4
No ratings yet
777 1651400043 BD Module 4
21 pages
Mapreduce and Hadoop Distributed File System
No ratings yet
Mapreduce and Hadoop Distributed File System
45 pages
MapReduce Algorithms For Big Data Analysis
No ratings yet
MapReduce Algorithms For Big Data Analysis
2 pages
L04-MapReduce
No ratings yet
L04-MapReduce
37 pages
Chapter 2_Introduction to MapReduce_new (1)
No ratings yet
Chapter 2_Introduction to MapReduce_new (1)
107 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
BDA IAT 1 Question Bank
No ratings yet
BDA IAT 1 Question Bank
21 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
AAAI2011 Tutorial Slides
No ratings yet
AAAI2011 Tutorial Slides
213 pages
Lez.d-01-Hadoop (A) Intro
No ratings yet
Lez.d-01-Hadoop (A) Intro
58 pages
Map reduce
No ratings yet
Map reduce
35 pages
Problem-Solving Using Mapreduce/Hadoop
No ratings yet
Problem-Solving Using Mapreduce/Hadoop
22 pages
BIG DATA UNIT-1
No ratings yet
BIG DATA UNIT-1
9 pages
07-BigData-DataAnalysis
No ratings yet
07-BigData-DataAnalysis
66 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
Unit 1 Lecture 3
No ratings yet
Unit 1 Lecture 3
12 pages
Lecture 1 - Map Reduce
No ratings yet
Lecture 1 - Map Reduce
31 pages
Lecture-3-MR-model-and-systems
No ratings yet
Lecture-3-MR-model-and-systems
67 pages
Learn Python through Nursery Rhymes and Fairy Tales: Classic Stories Translated into Python Programs (Coding for Kids and Beginners)
From Everand
Learn Python through Nursery Rhymes and Fairy Tales: Classic Stories Translated into Python Programs (Coding for Kids and Beginners)
Shari Eskenas
5/5 (1)
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
Advantages and Disadvantages of Indexes
No ratings yet
Advantages and Disadvantages of Indexes
2 pages
Hashing Problem Set Solutions
No ratings yet
Hashing Problem Set Solutions
3 pages
Voo Bly Launch 2
No ratings yet
Voo Bly Launch 2
5 pages
CertyIQ AZ-900 UpdatedExam Dumps - 2022 Part 2
No ratings yet
CertyIQ AZ-900 UpdatedExam Dumps - 2022 Part 2
27 pages
Aixperf Part2
No ratings yet
Aixperf Part2
26 pages
David M. Kroenke's: Database Processing
No ratings yet
David M. Kroenke's: Database Processing
32 pages
1z0-482 Dump
0% (4)
1z0-482 Dump
6 pages
Chapter 3 - Functional Dependencies
No ratings yet
Chapter 3 - Functional Dependencies
122 pages
Sera Gem 0022001
No ratings yet
Sera Gem 0022001
13 pages
DBMS UNIT 2
No ratings yet
DBMS UNIT 2
19 pages
EMC VMAX3 - CLI Cheat Sheet - David Ring
No ratings yet
EMC VMAX3 - CLI Cheat Sheet - David Ring
8 pages
Open Learning Oracle Server Access
No ratings yet
Open Learning Oracle Server Access
8 pages
Dell Emc Cx4-120 Spec Sheet
No ratings yet
Dell Emc Cx4-120 Spec Sheet
2 pages
Coursera 4RVSFATZ4Y63
No ratings yet
Coursera 4RVSFATZ4Y63
1 page
SQL Full Course Notes
No ratings yet
SQL Full Course Notes
53 pages
First Year Harmony - Thomas Tapper, 1908
No ratings yet
First Year Harmony - Thomas Tapper, 1908
179 pages
Website: Vce To PDF Converter: Facebook: Twitter:: Saa-C02.Vceplus - Premium.Exam.65Q
No ratings yet
Website: Vce To PDF Converter: Facebook: Twitter:: Saa-C02.Vceplus - Premium.Exam.65Q
22 pages
HANA Overview
No ratings yet
HANA Overview
69 pages
Task 1 Data Science With Documentation (1)
No ratings yet
Task 1 Data Science With Documentation (1)
11 pages
Firebird 2 Migration & Installation: Helen Borrie (Collator/Editor)
No ratings yet
Firebird 2 Migration & Installation: Helen Borrie (Collator/Editor)
46 pages
DWM-3
No ratings yet
DWM-3
7 pages
Sending Mail From An 11G Oracle Database
No ratings yet
Sending Mail From An 11G Oracle Database
6 pages
How To Resize and Move Partitions in Linux
No ratings yet
How To Resize and Move Partitions in Linux
14 pages
Microsoft Biztalk Server Tutorial: Cis764 - Enterprise Database Design Oubai Bounie
No ratings yet
Microsoft Biztalk Server Tutorial: Cis764 - Enterprise Database Design Oubai Bounie
21 pages
Configuring SQL Server 2005 For Use With ShipConstructor - ShipConstructor Knowledge Base - ShipConstructor Knowledgebase
No ratings yet
Configuring SQL Server 2005 For Use With ShipConstructor - ShipConstructor Knowledge Base - ShipConstructor Knowledgebase
8 pages
DBMS MCQ (Multiple Choice Questions) - Sanfoundry
No ratings yet
DBMS MCQ (Multiple Choice Questions) - Sanfoundry
16 pages
International University of Africa Faculty of Computer Studies Department of Computer Science
No ratings yet
International University of Africa Faculty of Computer Studies Department of Computer Science
4 pages

Map Reduce

Uploaded by

Map Reduce

Uploaded by

MapReduce and Hadoop

November 10, 2014

Bring data to compute

Bring compute to data

Pictures courtesy: Glenn K. Lockwood, glennklockwood.com 2

the pairs (w,1) for the same words

reduce: count the number (n) of

pairs for each w, make it (w,n)

output: (w,n) for

are grouped together

1. JobClient submits job to JobTracker; Binary copied into HDFS

1. Splits data up to send it to the mapper

This whole chunk does not fit

 Projection on a subset S of URL1 URL2

R(A,B) using aggregation function γ(B), A B

 Think of a matrix as a relation with three attributes

 Natural join of (I,J,V) and (J,K,W)  tuples (i, j, k, aij, bjk)

 First MapReduce process has produced key value pairs ((i,k),

 A method with one MapReduce step

You might also like