0% found this document useful (0 votes)
13 views

Mapreduce

1) The document discusses cheating policies for an autograded assignment and paper presentation dates. It states that getting perfect scores should not be possible without cheating or a bug. 2) It then explains how sorting and streaming algorithms can be expressed as sequences of map, sort, and reduce operations on key-value pairs. 3) Finally, it discusses how Hadoop can parallelize these operations by distributing the data across multiple machines and disks, in order to speed up I/O-bound algorithms like searching large datasets.

Uploaded by

jefferyleclerc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Mapreduce

1) The document discusses cheating policies for an autograded assignment and paper presentation dates. It states that getting perfect scores should not be possible without cheating or a bug. 2) It then explains how sorting and streaming algorithms can be expressed as sequences of map, sort, and reduce operations on key-value pairs. 3) Finally, it discusses how Hadoop can parallelize these operations by distributing the data across multiple machines and disks, in order to speed up I/O-bound algorithms like searching large datasets.

Uploaded by

jefferyleclerc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Map-Reduce With

Hadoop!
Announcement - 1!
• Assignment 1B:!
• Autolab is not secure and assignments aren’t
designed for adversarial interactions!
• Our policy: deliberately “gaming” an
autograded assignment is considered
cheating.!
• The default penalty for cheating is failing
the course.!
• Getting perfect test scores should not be
possible: you’re either cheating, or it’s a bug.!
Announcement - 2!
• Paper presentations: 3/3 and 3/5!
• Projects: !
• see “project info” on wiki!
• 1-2 page writeup of your idea: 2/17!
• Response to my feedback: 3/5!
• Option for 605 students to collaborate:!
• Proposals will be posted; proposers can
advertise slots for collaborators, who can be
605 students (1-2 per project max)!
• “Pay”: 1 less assignment, no exam!
Today: from stream+sort to hadoop!

• Looked at algorithms consisting of!


• Sorting (to organize messages)!
• Streaming (low-memory, line-by-line) file
transformations (“map” operations)!
• Streaming “reduce” operations, like summing
counts, that input files sorted by keys and operate
on contiguous runs of lines with the same keys!

• è Our algorithms could be expressed as sequences of


map-sort-reduce triples (allowing identity maps and
reduces) operating on sequences of key-value pairs!
• è To parallelize we can look at parallelizing these …!
Today: from stream+sort to hadoop!

• Important point:!
• Our code is not CPU-bound!
• It’s I/O bound!
• To speed it up, we need to add more disk drives, not
more CPUs.!
• Example: finding a particular line in 1 TB of data!

• è Our algorithms could be expressed as sequences of


map-sort-reduce triples (allowing identity maps and
reduces) operating on sequences of key-value pairs!
• è To parallelize we can look at parallelizing these …!
Write code to run assignment 1B
in parallel!!
• What infrastructure would you need?!

• How could you run a generic stream-and-sort algorithm in


parallel?!

• cat input.txt | MAP | sort | REDUCE > output.txt!

Key-value Sorted
Key-value pairs
Key-value pairs

pairs
key-val
(one/line)
(one/line)

(one/line)
pairs

e.g., labeled docs
e.g., aggregate
e.g. event
counts

counts

You might also like