Mapreduce
Mapreduce
Hadoop!
Announcement - 1!
• Assignment 1B:!
• Autolab is not secure and assignments aren’t
designed for adversarial interactions!
• Our policy: deliberately “gaming” an
autograded assignment is considered
cheating.!
• The default penalty for cheating is failing
the course.!
• Getting perfect test scores should not be
possible: you’re either cheating, or it’s a bug.!
Announcement - 2!
• Paper presentations: 3/3 and 3/5!
• Projects: !
• see “project info” on wiki!
• 1-2 page writeup of your idea: 2/17!
• Response to my feedback: 3/5!
• Option for 605 students to collaborate:!
• Proposals will be posted; proposers can
advertise slots for collaborators, who can be
605 students (1-2 per project max)!
• “Pay”: 1 less assignment, no exam!
Today: from stream+sort to hadoop!
• Important point:!
• Our code is not CPU-bound!
• It’s I/O bound!
• To speed it up, we need to add more disk drives, not
more CPUs.!
• Example: finding a particular line in 1 TB of data!
Key-value Sorted
Key-value pairs
Key-value pairs
pairs
key-val
(one/line)
(one/line)
(one/line)
pairs
e.g., labeled docs
e.g., aggregate
e.g. event
counts
counts
How would you run assignment
1B in parallel? !!
• What infrastructure would you need?!
• How could you run a generic stream-and-sort algorithm in
parallel?!
A-E
Box1! Step 1: split input
F-M
Box2! data, by key, into
Key-value pairs
shards and ship
N-P
Box3!
(one/line)
each shard to a
e.g., labeled docs
Q-Z
Box4! different box
How would you run assignment
1B in parallel?
•Open!!
sockets to receive data to ~wcohen/
kludge/mapinput.txt on each of the K boxes!
• •For each
What infrastructure would youkey,val
need?!pair:!
• Send key,val pair to boxFor (key)!
• How could you run a generic stream-and-sort algorithm in
parallel?!
A-E
Box1! Step 1: split input
F-M
Box2! data, by key, into
shards and ship
N-P
Box3!
each shard to a
Q-Z
Box4! different box
How would you run assignment
1B in parallel?
•Open sockets!!
to receive data to boxk:/kludge/
mapin.txt on each of the K boxes!
• •For each
What infrastructure key,val
would youpair in input.txt:!
need?!
• Send key,val pair to boxFor (key)!
• • Run
How could you runKa processes: rsh boxk MAP < mapin.txt
generic stream-and-sort algorithm>in
parallel?! mapout.txt !
A-E
Box1! …
Step 2: run
F-M
Box2! …
the maps in
parallel
N-P
Box3! …
Q-Z
Box4! …
•Open sockets to receive data to boxk:/kludge/mapin.txt on each of the K
boxes!
How would you run assignment
•For each key,val pair in input.txt:!
• Send key,val pair to socket[boxFor (key)]!
1B in parallel? !!
• Run K processes: rsh … MAP < ….> … to completion!
• On each box:!
•Open sockets to receive and sort data to boxk:/kludge/redin.txt on
• of
each What infrastructure
the K boxes! would you need?!
•For each key,val pair in mapout.txt:!
• How could you run a generic stream-and-sort algorithm in
• !Send key,val pair to socket[boxFor (key)]!
parallel?!
A-E
Box1! …
Step 3:
F-M
Box2! …
redistribute
the map
…
N-P
Box3! output
Q-Z
Box4! …
•Open sockets to receive data to boxk:/kludge/mapin.txt on each of the K
boxes!
How would you run assignment
•For each key,val pair in input.txt:!
• Send key,val pair to socket[boxFor (key)]!
1B in parallel? !!
• Run K processes: rsh MAP …!
• On each box:!
•Open sockets to receive and sort data to boxk:/kludge/redin.txt on each of
the•KWhat
boxes!infrastructure would you need?!
•For each key,val pair in mapout.txt:!
• How could you run a generic stream-and-sort algorithm in
• !Send key,val pair to socket[boxFor (key)]!
parallel?!
A-E
Box1! A-E
Box1! Step 3:
F-M
Box2! F-M
Box2! redistribute
the map
N-P
Box3!
N-P
Box3! output
Q-Z
Box4! Q-Z
Box4!
•Open sockets to receive data to boxk:/kludge/mapin.txt on each of the K
boxes!
How would you run assignment
•For each key,val pair in input.txt:!
• Send key,val pair to socket[boxFor (key)]!
1B in parallel? !!
• Run K processes: rsh MAP < mapin.txt > mapout.txt!
• Shuffle the data back to the right box!
• Do the same steps for the reduce processes!
• What infrastructure would you need?!
• How could you run a generic stream-and-sort algorithm in
parallel?!
A-E
Box1! A-E
Box1! Step 4: run
F-M
Box2! F-M
Box2! the reduce
processes
N-P
Box3!
N-P
Box3! in parallel
Q-Z
Box4! Q-Z
Box4!
•Open sockets to receive data to boxk:/kludge/mapin.txt on each of the K
How would you run assignment
boxes!
•For each key,val pair in input.txt:!
1B in parallel? !!
• Send key,val pair to socket[boxFor (key)]!
• Run K processes: rsh MAP < mapin.txt > mapout.txt!
• Shuffle the data back to the right box!
• What
• Do the infrastructure
same steps would
for the reduce you need?!
process!
• (If the keys for reduce process don t change, you don t need to reshuffle
• How could you run a generic stream-and-sort algorithm in
them)!
parallel?!
F-M
Box2! F-M
F-M
Box2!
N-P
Box3! N-P
N-P
Box3!
Q-Z
Box4! Q-Z
Box4! Q-Z
MAP
REDUCE
1. This would be pretty
systems-y (remote copy
files, waiting for remote
processes, …)
2. It would take work to
make it useful….
Motivating Example!
•Wikipedia is a very small part of the internet*!
ClueWeb09!
5Tb!
INTERNET
Wikipedia
• *may not be to scale!
abstracts!
650Mb!
1. This would be pretty
systems-y (remote copy
files, waiting for remote
processes, …)
2. It would take work to
make run for 500 jobs
• Reliability: Replication,
restarts, monitoring jobs,…
• Efficiency: load-balancing,
reducing file/network i/o,
optimizing file/network i/o,
…
• Useability: stream defined
datatypes, simple reduce
functions, ….
Summing
Event Counting
Counts!
on Subsets of
Documents!
1. This would be pretty
systems-y (remote copy
files, waiting for remote
processes, …)
2. It would take work to
make run for 500 jobs
• Reliability: Replication,
restarts, monitoring jobs,…
• Efficiency: load-balancing,
reducing file/network i/o,
optimizing file/network i/o,
…
• Useability: stream defined
datatypes, simple reduce
functions, ….
Parallel and Distributed
Computing: "
MapReduce!
Jimmy Lin!
•
http://www.umiacs.umd.edu/~jimmylin/cloud-computing/SIGIR-2009/Lin-MapReduce-
SIGIR2009.pdf!
Google!
http://code.google.com/edu/submissions/mapreduce-minilecture/listing.html!
http://code.google.com/edu/submissions/mapreduce/listing.html!
Cloudera!
http://vimeo.com/3584536!
Surprise, you mapreduced!!
• Mapreduce has three main phases!
• Map (send each input record to a key)!
• Sort (put all of one key in the same place)!
• handled behind the scenes!
• Reduce (operate on each key and its set of values)!
• Terms come from functional programming:!
• map(lambda x:x.upper(),["william","w","cohen"])è['WILLIAM',
'W', 'COHEN']!
• Communication overhead!
• Reliability: Replication,
restarts, monitoring jobs,…
• Efficiency: load-balancing,
reducing file/network i/o,
optimizing file/network i/o,
…
• Useability: stream defined
datatypes, simple reduce
functions, ….
Map reduce with Hadoop
streaming!
Breaking this down…!
• What actually is a key-value pair? How do you interface with
Hadoop?!
• Lines are sorted so lines with the same key are adjacent.!
An example:!
• SmallStreamNB.java and StreamSumReducer.java: !
• the code you just wrote. !
To run locally:!
To train with streaming Hadoop
you do this:!
• …!
• effect is cumulative !
Deja vu: Combiner = Reducer!
• Often the combiner is the reducer.!
• like for word count!
• Reliability: Replication,
restarts, monitoring jobs,…
• Efficiency: load-balancing,
reducing file/network i/o,
optimizing file/network i/o,
…
• Useability: stream defined
datatypes, simple reduce
functions, ….
Some common pitfalls!
• You have no control over the order in which reduces are
performed!
• Don t make a static variable and assume that other processes can
read it!
• It appear that they can when run locally, but they can t!
• When should you even try to do this? when should you use a
different model?!
Thinking in Mapreduce!
• A new task: Word co-occurrence statistics (simplified)!
• Input:!
•Sentences!
• Output:!
•<Machine, learning>:1!
•<Machine, for>:1!
•<Machine, big>:1!
•<Machine, data>:1!
•<Machine,*>:1!
Word Co-occurrence: Solution 1!
• Reduce would create, for example:!
•<Machine, learning>:10!
•<Machine, for>:1000!
•<Machine, big>:50!
•<Machine, data>:200!
•...!
•<Machine,*>:12000!
Word Co-occurrence: Solution 1!
•P(B in sentence | A started sentence) =!
• P(B in sentence & A started sentence)/P(A started sentence)=!
•<A,B>/<A,*>!
• key: word_i !
• values:
<word_i,word_j>....<word_i,word_b>.....<word_i,*>....!
Word Co-occurrence: Solution 1!
• Now we have all the information in the same reducer!
• But, now we have a new problem, can you see it?!
• Solution 1 b): !
• But now, we have all the information we need in the reduce key!!
Word Co-occurrence: Solution 1!
•We can use a custom comparator to sort the keys we encounter in
the reducer!
•<Machine>:<*:1,parts:1,are:1, for:1,machines:1>!
Word Co-occurrence: Solution 2!
• Reduce combines the records:!
• E.g Reduce for key <Machine> receives values:!
• In fact you should PLAN not to have memory for all values!