0% found this document useful (0 votes)

7 views94 pages

Mapreduce

mapreduce concepts

Uploaded by

Srinivas Redyy Sarvigari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views94 pages

Mapreduce

mapreduce concepts

Uploaded by

Srinivas Redyy Sarvigari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 94

Map-Reduce With

Hadoop!
Announcement - 1!
• Assignment 1B:!
• Autolab is not secure and assignments aren’t
designed for adversarial interactions!
• Our policy: deliberately “gaming” an
autograded assignment is considered
cheating.!
• The default penalty for cheating is failing
the course.!
• Getting perfect test scores should not be
possible: you’re either cheating, or it’s a bug.!
Announcement - 2!
• Paper presentations: 3/3 and 3/5!
• Projects: !
• see “project info” on wiki!
• 1-2 page writeup of your idea: 2/17!
• Response to my feedback: 3/5!
• Option for 605 students to collaborate:!
• Proposals will be posted; proposers can
advertise slots for collaborators, who can be
605 students (1-2 per project max)!
• “Pay”: 1 less assignment, no exam!
Today: from stream+sort to hadoop!

• Looked at algorithms consisting of!

• Sorting (to organize messages)!
• Streaming (low-memory, line-by-line) file
transformations (“map” operations)!
• Streaming “reduce” operations, like summing
counts, that input files sorted by keys and operate
on contiguous runs of lines with the same keys!

• è Our algorithms could be expressed as sequences of

map-sort-reduce triples (allowing identity maps and
reduces) operating on sequences of key-value pairs!
• è To parallelize we can look at parallelizing these …!
Today: from stream+sort to hadoop!

• Important point:!
• Our code is not CPU-bound!
• It’s I/O bound!
• To speed it up, we need to add more disk drives, not
more CPUs.!
• Example: finding a particular line in 1 TB of data!

• è Our algorithms could be expressed as sequences of

map-sort-reduce triples (allowing identity maps and
reduces) operating on sequences of key-value pairs!
• è To parallelize we can look at parallelizing these …!
Write code to run assignment 1B
in parallel!!
• What infrastructure would you need?!

• How could you run a generic stream-and-sort algorithm in

parallel?!

• cat input.txt | MAP | sort | REDUCE > output.txt!

Key-value Sorted
Key-value pairs
Key-value pairs

pairs
key-val
(one/line)
(one/line)

(one/line)
pairs

e.g., labeled docs
e.g., aggregate
e.g. event
counts

counts

How would you run assignment
1B in parallel? !!
• What infrastructure would you need?!
• How could you run a generic stream-and-sort algorithm in
parallel?!

• cat input.txt | MAP | sort | REDUCE > output.txt!

A-E
Box1! Step 1: split input
F-M
Box2! data, by key, into
Key-value pairs
shards and ship
N-P
Box3!
(one/line)
each shard to a
e.g., labeled docs

Q-Z
Box4! different box

How would you run assignment
1B in parallel?
•Open!!
sockets to receive data to ~wcohen/
kludge/mapinput.txt on each of the K boxes!
• •For each
What infrastructure would youkey,val
need?!pair:!
• Send key,val pair to boxFor (key)!
• How could you run a generic stream-and-sort algorithm in
parallel?!

• cat input.txt | MAP | sort | REDUCE > output.txt!

A-E
Box1! Step 1: split input
F-M
Box2! data, by key, into
shards and ship
N-P
Box3!
each shard to a
Q-Z
Box4! different box

How would you run assignment
1B in parallel?
•Open sockets!!
to receive data to boxk:/kludge/
mapin.txt on each of the K boxes!
• •For each
What infrastructure key,val
would youpair in input.txt:!
need?!
• Send key,val pair to boxFor (key)!
• • Run
How could you runKa processes: rsh boxk MAP < mapin.txt
generic stream-and-sort algorithm>in
parallel?! mapout.txt !

• cat input.txt | MAP | sort | REDUCE > output.txt!

A-E
Box1! …
Step 2: run
F-M
Box2! …
the maps in
parallel

N-P
Box3! …

Q-Z
Box4! …

•Open sockets to receive data to boxk:/kludge/mapin.txt on each of the K
boxes!
How would you run assignment
•For each key,val pair in input.txt:!
• Send key,val pair to socket[boxFor (key)]!
1B in parallel? !!
• Run K processes: rsh … MAP < ….> … to completion!
• On each box:!
•Open sockets to receive and sort data to boxk:/kludge/redin.txt on
• of
each What infrastructure
the K boxes! would you need?!
•For each key,val pair in mapout.txt:!
• How could you run a generic stream-and-sort algorithm in
• !Send key,val pair to socket[boxFor (key)]!
parallel?!

• cat input.txt | MAP | sort | REDUCE > output.txt!

A-E
Box1! …
Step 3:
F-M
Box2! …
redistribute
the map
…

N-P
Box3! output

Q-Z
Box4! …

•Open sockets to receive data to boxk:/kludge/mapin.txt on each of the K
boxes!
How would you run assignment
•For each key,val pair in input.txt:!
• Send key,val pair to socket[boxFor (key)]!
1B in parallel? !!
• Run K processes: rsh MAP …!
• On each box:!
•Open sockets to receive and sort data to boxk:/kludge/redin.txt on each of
the•KWhat
boxes!infrastructure would you need?!
•For each key,val pair in mapout.txt:!
• How could you run a generic stream-and-sort algorithm in
• !Send key,val pair to socket[boxFor (key)]!
parallel?!

• cat input.txt | MAP | sort | REDUCE > output.txt!

A-E
Box1! A-E
Box1! Step 3:
F-M
Box2! F-M
Box2! redistribute
the map
N-P
Box3!
N-P
Box3! output

Q-Z
Box4! Q-Z
Box4!
•Open sockets to receive data to boxk:/kludge/mapin.txt on each of the K
boxes!
How would you run assignment
•For each key,val pair in input.txt:!
• Send key,val pair to socket[boxFor (key)]!
1B in parallel? !!
• Run K processes: rsh MAP < mapin.txt > mapout.txt!
• Shuffle the data back to the right box!
• Do the same steps for the reduce processes!
• What infrastructure would you need?!
• How could you run a generic stream-and-sort algorithm in
parallel?!

• cat input.txt | MAP | sort | REDUCE > output.txt!

A-E
Box1! A-E
Box1! Step 4: run
F-M
Box2! F-M
Box2! the reduce
processes

N-P
Box3!
N-P
Box3! in parallel

Q-Z
Box4! Q-Z
Box4!
•Open sockets to receive data to boxk:/kludge/mapin.txt on each of the K
How would you run assignment
boxes!
•For each key,val pair in input.txt:!

1B in parallel? !!
• Send key,val pair to socket[boxFor (key)]!
• Run K processes: rsh MAP < mapin.txt > mapout.txt!
• Shuffle the data back to the right box!
• What
• Do the infrastructure
same steps would
for the reduce you need?!
process!
• (If the keys for reduce process don t change, you don t need to reshuffle
• How could you run a generic stream-and-sort algorithm in
them)!
parallel?!

• cat input.txt | MAP | sort | REDUCE > output.txt!

Box1! A-E
Box1! A-E

A-E

F-M
Box2! F-M

F-M
Box2!

N-P
Box3! N-P

N-P
Box3!

Q-Z
Box4! Q-Z
Box4! Q-Z

MAP
REDUCE

1. This would be pretty
systems-y (remote copy
files, waiting for remote
processes, …)

2. It would take work to
make it useful….

Motivating Example!
•Wikipedia is a very small part of the internet*!

ClueWeb09!
5Tb!

INTERNET

Wikipedia
• *may not be to scale!
abstracts!
650Mb!
1. This would be pretty
systems-y (remote copy
files, waiting for remote
processes, …)

2. It would take work to
make run for 500 jobs

• Reliability: Replication,
restarts, monitoring jobs,…

• Efficiency: load-balancing,
reducing file/network i/o,
optimizing file/network i/o,
…

• Useability: stream defined
datatypes, simple reduce
functions, ….

Summing
Event Counting
Counts!
on Subsets of
Documents!
1. This would be pretty
systems-y (remote copy
files, waiting for remote
processes, …)

2. It would take work to
make run for 500 jobs

•pilfered from: Alona Fyshe!

Inspiration not Plagiarism!
This is not the first lecture ever on Mapreduce!
I borrowed from Alona Fyshe and she borrowed from:!

Jimmy Lin!
•
http://www.umiacs.umd.edu/~jimmylin/cloud-computing/SIGIR-2009/Lin-MapReduce-
SIGIR2009.pdf!

Google!
http://code.google.com/edu/submissions/mapreduce-minilecture/listing.html!

http://code.google.com/edu/submissions/mapreduce/listing.html!

Cloudera!
http://vimeo.com/3584536!
Surprise, you mapreduced!!
• Mapreduce has three main phases!
• Map (send each input record to a key)!
• Sort (put all of one key in the same place)!
• handled behind the scenes!
• Reduce (operate on each key and its set of values)!
• Terms come from functional programming:!

• map(lambda x:x.upper(),["william","w","cohen"])è['WILLIAM',
'W', 'COHEN']!

• reduce(lambda x,y:x+"-"+y,["william","w","cohen"])è william-

w-cohen !
Mapreduce overview!

Map! Shuffle/Sort! Reduce!

Distributing NB!
• Questions:!
• How will you know when each machine is done?!

• Communication overhead!

• How will you know if a machine is dead?!

Failure!
• How big of a deal is it really?!

• A huge deal. In a distributed environment disks fail ALL THE TIME. !

• Large scale systems must assume that any process can fail at any time.!

• It may be much cheaper to make the software run reliably on unreliable

hardware than to make the hardware reliable.!
• Ken Arnold (Sun, CORBA designer): !

• Failure is the defining difference between distributed and local

programming, so you have to design distributed systems with the
expectation of failure. Imagine asking people, "If the probability of
something happening is one in 1013, how often would it happen?" Common
sense would be to answer, "Never." That is an infinitely large number in
human terms. But if you ask a physicist, she would say, "All the time. In a
cubic foot of air, those things happen all the time.
Well, that s a pain!
• What will you do when a task fails?!
Well, that s a pain!
• What s the difference between slow and dead?!
• Who cares? Start a backup process.!

• If the process is slow because of machine issues, the backup

may finish first!

• If it s slow because you poorly partitioned your data...

waiting is your punishment!
What else is a pain?!
• Losing your work!!
• If a disk fails you can lose some intermediate output!
• Ignoring the missing data could give you wrong
answers!

• Who cares? if I m going to run backup processes I

might as well have backup copies of the intermediate
data also!
HDFS: The Hadoop File System!
• Distributes data across the cluster!
• distributed file looks like a directory with shards
as files inside it!
• makes an effort to run processes locally with the
data!
• Replicates data!
• default 3 copies of each file!
• Optimized for streaming!
• really really big blocks !
$ hadoop fs -ls rcv1/small/sharded!
Found 10 items!
-rw-r--r-- 3 … 606405 2013-01-22 16:28 /user/wcohen/rcv1/small/sharded/part-00000!
-rw-r--r-- 3 … 1347611 2013-01-22 16:28 /user/wcohen/rcv1/small/sharded/part-00001!
-rw-r--r-- 3 … 939307 2013-01-22 16:28 /user/wcohen/rcv1/small/sharded/part-00002!
-rw-r--r-- 3 … 1284062 2013-01-22 16:28 /user/wcohen/rcv1/small/sharded/part-00003!
-rw-r--r-- 3 … 1009890 2013-01-22 16:28 /user/wcohen/rcv1/small/sharded/part-00004!
-rw-r--r-- 3 … 1206196 2013-01-22 16:28 /user/wcohen/rcv1/small/sharded/part-00005!
-rw-r--r-- 3 … 1384658 2013-01-22 16:28 /user/wcohen/rcv1/small/sharded/part-00006!
-rw-r--r-- 3 … 1299698 2013-01-22 16:28 /user/wcohen/rcv1/small/sharded/part-00007!
-rw-r--r-- 3 … 928752 2013-01-22 16:28 /user/wcohen/rcv1/small/sharded/part-00008!
-rw-r--r-- 3 … 806030 2013-01-22 16:28 /user/wcohen/rcv1/small/sharded/part-00009!
!
$ hadoop fs -tail rcv1/small/sharded/part-00005!
weak as the arrival of arbitraged cargoes from the West has put the local market under pressure… !
M14,M143,MCAT The Brent crude market on the Singapore International …!
!
MR Overview!
1. This would be pretty
systems-y (remote copy
files, waiting for remote
processes, …)

2. It would take work to
make work for 500 jobs

• Reliability: Replication,
restarts, monitoring jobs,…

• Efficiency: load-balancing,
reducing file/network i/o,
optimizing file/network i/o,
…

• Useability: stream defined
datatypes, simple reduce
functions, ….

Map reduce with Hadoop
streaming!
Breaking this down…!
• What actually is a key-value pair? How do you interface with
Hadoop?!

• One very simple way: Hadoop’s streaming interface.!

• Mapper outputs key-value pairs as: !

• One pair per line, key and value tab-separated!

• Reduced reads in data in the same format!

• Lines are sorted so lines with the same key are adjacent.!
An example:!
• SmallStreamNB.java and StreamSumReducer.java: !
• the code you just wrote. !
To run locally:!
To train with streaming Hadoop
you do this:!

But first you need to get your code and data

to the “Hadoop file system”

To train with streaming
Hadoop:!
• First, you need to prepare the corpus by splitting it into shards!
• … and distributing the shards to different machines:!
To train with streaming
Hadoop:!
• One way to shard text:!
• hadoop fs -put LocalFileName HDFSName!

• then run a streaming job with ‘cat’ as mapper and reducer !

• and specify the number of shards you want with -

numReduceTasks!
To train with streaming
Hadoop:!
• Next, prepare your code for upload and distribution to the
machines cluster!
To train with streaming
Hadoop:!
• Next, prepare your code for upload and distribution to the
machines cluster!
Now you can run streaming
Hadoop:!
“Real” Hadoop!
• Streaming is simple but!
• There’s no typechecking of inputs/outputs!

• You need to parse strings a lot!

• You can’t use compact binary encodings!

• …!

• basically you have limited control over what you’re doing!

others:

• KeyValueInputFormat

• SequenceFileInputFormat

Is any part of this wasteful?!
• Remember - moving data around and writing to/reading from
disk are very expensive operations!

• No reducer can start until:!

• all mappers are done !

• data in its partition has been sorted!

How much does buffering help?!
BUFFER_SIZE! Time! Message Size!
none! 1.7M words!
100! 47s! 1.2M!
1,000! 42s! 1.0M!
10,000! 30s! 0.7M!
100,000! 16s! 0.24M!
1,000,000! 13s! 0.16M!
limit! 0.05M!
Combiners!
• Sits between the map and the shuffle!
• Do some of the reducing while you re waiting for other stuff
to happen!

• Avoid moving all of that data over the network!

• Only applicable when !

• order of reduce values doesn t matter !

• effect is cumulative !
Deja vu: Combiner = Reducer!
• Often the combiner is the reducer.!
• like for word count!

• but not always!

1. This would be pretty
systems-y (remote copy
files, waiting for remote
processes, …)

2. It would take work to
make work for 500 jobs

• Reliability: Replication,
restarts, monitoring jobs,…

• Efficiency: load-balancing,
reducing file/network i/o,
optimizing file/network i/o,
…

• Useability: stream defined
datatypes, simple reduce
functions, ….

Some common pitfalls!
• You have no control over the order in which reduces are
performed!

• You have no control over the order in which you encounter

reduce values!

• More on this later!

• The only ordering you should assume is that Reducers always

start after Mappers!
Some common pitfalls!
• You should assume your Maps and Reduces will be taking place
on different machines with different memory spaces!

• Don t make a static variable and assume that other processes can
read it!

• They can t.!

• It appear that they can when run locally, but they can t!

• No really, don t do this. !

Some common pitfalls!
• Do not communicate between mappers or between reducers!
• overhead is high!

• you don t know which mappers/reducers are actually

running at any given point!

• there s no easy way to find out what machine they re

running on!

• because you shouldn t be looking for them anyway!

When mapreduce doesn t fit!
•The beauty of mapreduce is its separability and independence!
•If you find yourself trying to communicate between processes!
• you re doing it wrong!
•or!
• what you re doing is not a mapreduce!
When mapreduce doesn t fit!
• Not everything is a mapreduce!
• Sometimes you need more communication!

• We ll talk about other programming paradigms later!

What s so tricky about
MapReduce?!
• Really, nothing. It s easy.!
• What s often tricky is figuring out how to write an algorithm as a
series of map-reduce substeps.!

• How and when do you parallelize?!

• When should you even try to do this? when should you use a
different model?!
Thinking in Mapreduce!
• A new task: Word co-occurrence statistics (simplified)!
• Input:!

•Sentences!
• Output:!

•P(Word B is in sentence| Word A started the sentence)!

Thinking in mapreduce!
• We need to calculate!
• P(B in sentence | A started sentence) =!

• P(B in sentence & A started sentence)/P(A started sentence)=!

•count<A,B>/count<A,*>!
Word Co-occurrence: Solution 1!
•The Pairs paradigm:!
• For each sentence, output a pair!
• E.g Map( Machine learning for big data ) creates:!

•<Machine, learning>:1!
•<Machine, for>:1!
•<Machine, big>:1!
•<Machine, data>:1!
•<Machine,*>:1!
Word Co-occurrence: Solution 1!
• Reduce would create, for example:!
•<Machine, learning>:10!
•<Machine, for>:1000!
•<Machine, big>:50!
•<Machine, data>:200!
•...!
•<Machine,*>:12000!
Word Co-occurrence: Solution 1!
•P(B in sentence | A started sentence) =!
• P(B in sentence & A started sentence)/P(A started sentence)=!
•<A,B>/<A,*>!

•Do we have what we need?!

•Yes!!
Word Co-occurrence: Solution 1!
• But wait!!
• There s a problem .... can you see it?!
Word Co-occurrence: Solution 1!
• Each reducer will process all counts for a <word1,word2> pair!
• We need to know <word1,*> at the same time as
<word1,word2>!

• The information is in different reducers!!

Word Co-occurrence: Solution 1!
• Solution 1 a)!
• Make the first word the reduce key!

• Each reducer has:!

• key: word_i !

• values:
<word_i,word_j>....<word_i,word_b>.....<word_i,*>....!
Word Co-occurrence: Solution 1!
• Now we have all the information in the same reducer!
• But, now we have a new problem, can you see it?!

• Hint: remember - we have no control over the order of values!

Word Co-occurrence: Solution 1!
• There could be too many values to hold in memory!
• We need <word_i,*> to be the first value we encounter!

• Solution 1 b): !

• Keep <word_i,word_j> as the reduce key!

• Change the way Hadoop does its partitioning.!

Word Co-occurrence: Solution 1!
Word Co-occurrence: Solution 1!
• Ok cool, but we still have the same problem.!
• The information is all in the same reducer, but we don t know
the order!

• But now, we have all the information we need in the reduce key!!
Word Co-occurrence: Solution 1!
•We can use a custom comparator to sort the keys we encounter in
the reducer!

• One way: custom key class which implements

WriteableComparable!

•Aside: if you use tab-separated values and Hadoop streaming you

can create a streaming job where (for instance) field 1 is the partition
key, and the lines are sorted by fields 1 & 2.!
Word Co-occurrence: Solution 1!
•Now the order of key, value pairs will be as we need:!
•<Machine,*>:12000!
•<Machine, big>:50!
•<Machine, data>:200!
•<Machine, for>:1000!
•<Machine, learning>:10!
•...!
•P( big in sentence | Machine started sentence) = 50/12000!
Word Co-occurrence: Solution 2!
• The Stripes paradigm!
• For each sentence, output a key, record pair!

• E.g Map( Machine learning for big data ) creates:!

•<Machine>:<*:1,learning:1, for:1, big:1, data:1>!

• E.g Map( Machine parts are for machines ) creates:!

•<Machine>:<*:1,parts:1,are:1, for:1,machines:1>!
Word Co-occurrence: Solution 2!
• Reduce combines the records:!
• E.g Reduce for key <Machine> receives values:!

•<*:1,learning:1, for:1, big:1, data:1>!

•<*:1,parts:1,are:1, for:1,machines:1>!
• And merges them to create!

• <*:2,learning:1, for:2, big:1, data:1,parts:1,are:1,machines:1>!

Word Co-occurrence: Solution 2!
• This is nice because we have the * count already created !
• we just have to ensure it always occurs first in the record!
Word Co-occurrence: Solution 2!
• There is a really big (ha ha) problem with this solution!
• Can you see it?!

• The value may become too large to fit in memory!

Performance!
• IMPORTANT!
• You may not have room for all reduce values in memory!

• In fact you should PLAN not to have memory for all values!

• Remember, small machines are much cheaper!

• you have a limited budget!

Performance!
• Which is faster, stripes vs pairs?!
• Stripes has a bigger value per key!

• Pairs has more partition/sort overhead!

Performance!
Conclusions!
• Mapreduce!
• Can handle big data!

• Requires minimal code-writing!

• Real algorithms are typically a sequence of map-reduce steps!

2013.02.2 - Building Code-Vol 2
100% (1)
2013.02.2 - Building Code-Vol 2
792 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Mapreduce and Hadoop Distributed File System
No ratings yet
Mapreduce and Hadoop Distributed File System
36 pages
Parallel & Distributed Computing
100% (1)
Parallel & Distributed Computing
52 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
Vendor Qualification and Requirements - 1P - Latest 22-11-2019
100% (2)
Vendor Qualification and Requirements - 1P - Latest 22-11-2019
7 pages
Hadoop: A Seminar Report On
No ratings yet
Hadoop: A Seminar Report On
28 pages
Amin N Ed Titanium Alloys Towards Achieving Enhanced Propert PDF
No ratings yet
Amin N Ed Titanium Alloys Towards Achieving Enhanced Propert PDF
238 pages
Take A Close Look At: Ma Ed
No ratings yet
Take A Close Look At: Ma Ed
42 pages
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
No ratings yet
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
36 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Lab Manual JAVA
No ratings yet
Lab Manual JAVA
133 pages
By Pallavi Mandal Class: CS-B Roll No.: 2014BCS1150
No ratings yet
By Pallavi Mandal Class: CS-B Roll No.: 2014BCS1150
17 pages
Map Reduce Intro CS4961-L22
No ratings yet
Map Reduce Intro CS4961-L22
20 pages
Mapreduce and Hadoop Distributed File System
No ratings yet
Mapreduce and Hadoop Distributed File System
45 pages
PThread API Reference
No ratings yet
PThread API Reference
348 pages
Lecture 10 MapReduce Hadoop
No ratings yet
Lecture 10 MapReduce Hadoop
37 pages
Mapreduce and Hadoop Distributed File System: K. Madurai and B. Ramamurthy
No ratings yet
Mapreduce and Hadoop Distributed File System: K. Madurai and B. Ramamurthy
36 pages
1s07 Map Reduce Presentation 2019
No ratings yet
1s07 Map Reduce Presentation 2019
43 pages
Lecture 1 - Map Reduce
No ratings yet
Lecture 1 - Map Reduce
31 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
Aesv
No ratings yet
Aesv
32 pages
Chapter 6
No ratings yet
Chapter 6
57 pages
Mapreduce
No ratings yet
Mapreduce
7 pages
Towards Efficient Mapreduce Using Mpi
No ratings yet
Towards Efficient Mapreduce Using Mpi
10 pages
Lecture - 3
No ratings yet
Lecture - 3
25 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
65 pages
CS 425 / ECE 428 Distributed Systems Fall 2014: Lecture 3: Mapreduce and Hadoop
No ratings yet
CS 425 / ECE 428 Distributed Systems Fall 2014: Lecture 3: Mapreduce and Hadoop
24 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
Chapter 4
No ratings yet
Chapter 4
71 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
MapReduce and The New Software Stack
No ratings yet
MapReduce and The New Software Stack
33 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
37 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
55 pages
Unit 3 MapReduce Part 2
No ratings yet
Unit 3 MapReduce Part 2
12 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
Bda CHP2
No ratings yet
Bda CHP2
105 pages
Introduction To: Ma Ed
No ratings yet
Introduction To: Ma Ed
42 pages
ECS765P - W2 - The MapReduce Programming Model
No ratings yet
ECS765P - W2 - The MapReduce Programming Model
53 pages
Bda Ia1 Scheme
No ratings yet
Bda Ia1 Scheme
7 pages
Map Reduce: Simplified Processing On Large Clusters
No ratings yet
Map Reduce: Simplified Processing On Large Clusters
29 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Ecs765p W2
No ratings yet
Ecs765p W2
55 pages
Lecture 2.1
No ratings yet
Lecture 2.1
13 pages
Lez.d-01-Hadoop (A) Intro
No ratings yet
Lez.d-01-Hadoop (A) Intro
58 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
3.4 Map Scheduler
No ratings yet
3.4 Map Scheduler
23 pages
Lecture 03
No ratings yet
Lecture 03
26 pages
L1: Introduction, Mapreduce, Spark: Csl7710: Machine Learning With Big Data Dip Sankar Banerjee Cse, Iit Jodhpur
No ratings yet
L1: Introduction, Mapreduce, Spark: Csl7710: Machine Learning With Big Data Dip Sankar Banerjee Cse, Iit Jodhpur
51 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
AAAI2011 Tutorial Slides
No ratings yet
AAAI2011 Tutorial Slides
213 pages
Mapreduce: Simpli - Ed Data Processing On Large Clusters
No ratings yet
Mapreduce: Simpli - Ed Data Processing On Large Clusters
4 pages
Case Digest, Self Executing Provision
100% (1)
Case Digest, Self Executing Provision
2 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
02 Hadoop
No ratings yet
02 Hadoop
117 pages
N5ae Ver1-09 24AUG16
No ratings yet
N5ae Ver1-09 24AUG16
31 pages
Paper Map Reduce
No ratings yet
Paper Map Reduce
16 pages
Hadoop-Yahoo - Tutorial Course 1
No ratings yet
Hadoop-Yahoo - Tutorial Course 1
149 pages
06 Application Architecture
No ratings yet
06 Application Architecture
22 pages
Lecture 2 - Mapreduce: Cpe 458 - Parallel Programming, Spring 2009
No ratings yet
Lecture 2 - Mapreduce: Cpe 458 - Parallel Programming, Spring 2009
26 pages
Digest By: Shimi Fortuna Ali Akang Vs Municipality of Isulan
No ratings yet
Digest By: Shimi Fortuna Ali Akang Vs Municipality of Isulan
2 pages
Parallel Programming, Mapreduce Model: Unit Ii
No ratings yet
Parallel Programming, Mapreduce Model: Unit Ii
47 pages
Football Accumulator Tips For Today - WinDrawWin.
No ratings yet
Football Accumulator Tips For Today - WinDrawWin.
1 page
Introduction To Company: Bharti Airtel Limited
No ratings yet
Introduction To Company: Bharti Airtel Limited
61 pages
Problem Chapter 8
No ratings yet
Problem Chapter 8
62 pages
CMA Inter - July 2023 Past Paper Questions Practice
No ratings yet
CMA Inter - July 2023 Past Paper Questions Practice
36 pages
Banking Finance Tax Test SK2019 - 1
No ratings yet
Banking Finance Tax Test SK2019 - 1
4 pages
Intercorporate Loan
No ratings yet
Intercorporate Loan
1 page
Valvepedia March-2017
No ratings yet
Valvepedia March-2017
7 pages
PHY106 Week9
No ratings yet
PHY106 Week9
53 pages
DH-IPC-HDW2831T-ZS-S2: 8MP Lite IR Vari-Focal Eyeball Nework Camera
No ratings yet
DH-IPC-HDW2831T-ZS-S2: 8MP Lite IR Vari-Focal Eyeball Nework Camera
3 pages
TSA BIM Ready Complete
No ratings yet
TSA BIM Ready Complete
19 pages
Impact of Api Active Pharmaceutical Ingredient Source Selection On Generic Drug Products 2167 7689 1000136
No ratings yet
Impact of Api Active Pharmaceutical Ingredient Source Selection On Generic Drug Products 2167 7689 1000136
11 pages
Module 12
No ratings yet
Module 12
17 pages
Heather Jennings Resume
No ratings yet
Heather Jennings Resume
1 page
Training Course On AppCoMS For Nodal Officers
No ratings yet
Training Course On AppCoMS For Nodal Officers
18 pages
Gust Loads On Aircraft
No ratings yet
Gust Loads On Aircraft
59 pages
Tektronix Power Test Set TU-75B Manual - Tektronix - 1964 - 621.381548
No ratings yet
Tektronix Power Test Set TU-75B Manual - Tektronix - 1964 - 621.381548
10 pages
Who Is Ferdinand Marcos
No ratings yet
Who Is Ferdinand Marcos
1 page
MockTest 4A 1 Q Eng
No ratings yet
MockTest 4A 1 Q Eng
9 pages
RESUME - Payam Rahrow
No ratings yet
RESUME - Payam Rahrow
2 pages
Zebra
No ratings yet
Zebra
4 pages
The Cultural Revolution Extra Reading
No ratings yet
The Cultural Revolution Extra Reading
2 pages
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet

Mapreduce

Uploaded by

Mapreduce

Uploaded by

Map-Reduce With

• Looked at algorithms consisting of!

• è Our algorithms could be expressed as sequences of

• è Our algorithms could be expressed as sequences of

• How could you run a generic stream-and-sort algorithm in

• cat input.txt | MAP | sort | REDUCE > output.txt!

• cat input.txt | MAP | sort | REDUCE > output.txt!

• cat input.txt | MAP | sort | REDUCE > output.txt!

• cat input.txt | MAP | sort | REDUCE > output.txt!

• cat input.txt | MAP | sort | REDUCE > output.txt!

• cat input.txt | MAP | sort | REDUCE > output.txt!

• cat input.txt | MAP | sort | REDUCE > output.txt!

• cat input.txt | MAP | sort | REDUCE > output.txt!

•pilfered from: Alona Fyshe!

• reduce(lambda x,y:x+"-"+y,["william","w","cohen"])è william-

Map! Shuffle/Sort! Reduce!

• How will you know if a machine is dead?!

• A huge deal. In a distributed environment disks fail ALL THE TIME. !

• It may be much cheaper to make the software run reliably on unreliable

• Failure is the defining difference between distributed and local

• If the process is slow because of machine issues, the backup

• If it s slow because you poorly partitioned your data...

• Who cares? if I m going to run backup processes I

• One very simple way: Hadoop’s streaming interface.!

• Mapper outputs key-value pairs as: !

• One pair per line, key and value tab-separated!

• Reduced reads in data in the same format!

But first you need to get your code and data

• then run a streaming job with ‘cat’ as mapper and reducer !

• and specify the number of shards you want with -

• You need to parse strings a lot!

• You can’t use compact binary encodings!

• basically you have limited control over what you’re doing!

• No reducer can start until:!

• all mappers are done !

• data in its partition has been sorted!

• Avoid moving all of that data over the network!

• Only applicable when !

• order of reduce values doesn t matter !

• but not always!

• You have no control over the order in which you encounter

• More on this later!

• The only ordering you should assume is that Reducers always

• They can t.!

• No really, don t do this. !

• you don t know which mappers/reducers are actually

• there s no easy way to find out what machine they re

• because you shouldn t be looking for them anyway!

• We ll talk about other programming paradigms later!

• How and when do you parallelize?!

•P(Word B is in sentence| Word A started the sentence)!

• P(B in sentence & A started sentence)/P(A started sentence)=!

•Do we have what we need?!

• The information is in different reducers!!

• Each reducer has:!

• Hint: remember - we have no control over the order of values!

• Keep <word_i,word_j> as the reduce key!

• Change the way Hadoop does its partitioning.!

• One way: custom key class which implements

•Aside: if you use tab-separated values and Hadoop streaming you

• E.g Map( Machine learning for big data ) creates:!

•<Machine>:<*:1,learning:1, for:1, big:1, data:1>!

•<*:1,learning:1, for:1, big:1, data:1>!

• <*:2,learning:1, for:2, big:1, data:1,parts:1,are:1,machines:1>!

• The value may become too large to fit in memory!

• Remember, small machines are much cheaper!

• you have a limited budget!

• Pairs has more partition/sort overhead!

• Requires minimal code-writing!

• Real algorithms are typically a sequence of map-reduce steps!

You might also like