Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
Matei Zaharia
Electrical Engineering and Computer Sciences
University of California, Berkeley
What is Cloud Computing?
• Attractive features:
– Scale: up to 100’s of nodes
– Fine-grained billing: pay only for what you use
– Ease of use: sign up with credit card, get root access
What is MapReduce?
• Pioneered by Google
– Processes 20 petabytes of data per day
• At Google:
– Index construction for Google Search
– Article clustering for Google News
– Statistical machine translation
• At Yahoo!:
– “Web map” powering Yahoo! Search
– Spam detection for Yahoo! Mail
• At Facebook:
– Data mining
– Ad optimization
– Spam detection
Example: Facebook Lexicon
www.facebook.com/lexicon
Example: Facebook Lexicon
www.facebook.com/lexicon
What is MapReduce used for?
• In research:
– Astronomical image analysis (Washington)
– Bioinformatics (Maryland)
– Analyzing Wikipedia conflicts (PARC)
– Natural language processing (CMU)
– Particle physics (Nebraska)
– Ocean climate simulation (Washington)
– <Your application here>
Outline
• MapReduce architecture
• Example applications
• Getting started with Hadoop
• Higher-level languages over Hadoop: Pig and Hive
• Amazon Elastic MapReduce
MapReduce Design Goals
2. Cost-efficiency:
– Commodity machines (cheap, but unreliable)
– Commodity network
– Automatic fault-tolerance (fewer administrators)
– Easy to use (fewer programmers)
Typical Hadoop Cluster
Aggregation switch
Rack switch
• MapReduce framework
– Executes user jobs specified as “map” and “reduce”
functions
– Manages work distribution & fault-tolerance
Hadoop Distributed File System
Datanodes
MapReduce Programming Model
• Map function:
(Kin, Vin) list(Kinter, Vinter)
• Reduce function:
(Kinter, list(Vinter)) list(Kout, Vout)
Example: Word Count
def mapper(line):
foreach word in line.split():
output(word, 1)
the, 1
brown, 1
the quick fox, 1 brown, 2
Map
brown fox fox, 2
Reduce
how, 1
the, 1
fox, 1
now, 1
the, 1 the, 3
the fox ate
Map
the mouse quick, 1
how, 1
ate, 1 ate, 1
now, 1
mouse, 1
brown, 1 Reduce cow, 1
how now mouse, 1
Map cow, 1
brown cow quick, 1
MapReduce Execution Details
the, 1
brown, 1
the quick fox, 1 brown, 2
Map
brown fox fox, 2
Reduce
how, 1
the, 2 now, 1
fox, 1
the, 3
the fox ate
Map
the mouse quick, 1
how, 1
ate, 1 ate, 1
now, 1
mouse, 1
brown, 1 Reduce cow, 1
how now mouse, 1
Map cow, 1
brown cow quick, 1
Fault Tolerance in MapReduce
1. If a task crashes:
– Retry on another node
» OK for a map because it has no dependencies
» OK for reduce because map outputs are on disk
– If the same task fails repeatedly, fail the job or ignore
that input block (user-controlled)
2. If a node crashes:
– Re-launch its current tasks on other nodes
– Re-run any maps the node previously ran
» Necessary because their output files were lost along
with the crashed node
Fault Tolerance in MapReduce
• MapReduce architecture
• Example applications
• Getting started with Hadoop
• Higher-level languages over Hadoop: Pig and Hive
• Amazon Elastic MapReduce
1. Search
• Map:
if(line matches pattern):
output(line)
ant, bee
• Map: identity function Map
Reduce [A-M]
zebra
• Reduce: identify function
aardvark
ant
cow bee
cow
Map
elephant
pig
• Map:
foreach word in text.split():
output(word, filename)
• Reduce:
def reduce(word, filenames):
output(word, sort(filenames))
Inverted Index Example
hamlet.txt
to, hamlet.txt
to be or be, hamlet.txt
not to be or, hamlet.txt afraid, (12th.txt)
not, hamlet.txt be, (12th.txt, hamlet.txt)
greatness, (12th.txt)
not, (12th.txt, hamlet.txt)
of, (12th.txt)
be, 12th.txt or, (hamlet.txt)
12th.txt
not, 12th.txt to, (hamlet.txt)
be not afraid, 12th.txt
afraid of of, 12th.txt
greatness
greatness, 12th.txt
4. Most Popular Words
• Two-stage solution:
– Job 1:
» Create inverted index, giving (word, list(file)) records
– Job 2:
» Map each (word, list(file)) to (count, word)
» Sort these records by count as in sort job
• Optimizations:
– Map to (word, 1) instead of (word, file) in Job 1
– Count files in job 1’s reducer rather than job 2’s mapper
– Estimate count distribution in advance and drop rare words
5. Numerical Integration
• Map:
def map(start, end):
sum = 0
for(x = start; x < end; x += step):
sum += f(x) * step
output(“”, sum)
• Reduce:
def reduce(key, values):
output(key, sum(values))
Outline
• MapReduce architecture
• Example applications
• Getting started with Hadoop
• Higher-level languages over Hadoop: Pig and Hive
• Amazon Elastic MapReduce
Getting Started with Hadoop
conf.setMapperClass(MapClass.class);
conf.setCombinerClass(ReduceClass.class);
conf.setReducerClass(ReduceClass.class);
FileInputFormat.setInputPaths(conf, args[0]);
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
Word Count in Python with Hadoop Streaming
• MapReduce architecture
• Example applications
• Getting started with Hadoop
• Higher-level languages over Hadoop: Pig and Hive
• Amazon Elastic MapReduce
Motivation
Count clicks
Order by clicks
Take top 5
Notice how naturally the components of the job translate into Pig Latin.
Users = load …
Filter by age
Filtered = filter …
Pages = load …
Join on name
Joined = join …
Group on url
Grouped = group …
Summed = … count()…
Count clicks Sorted = order …
Top5 = limit …
Order by clicks
Take top 5
Notice how naturally the components of the job translate into Pig Latin.
Users = load …
Filter by age
Filtered = filter …
Pages = load …
Join on name
Joined = join …
Job 1
Group on url
Grouped = group …
Job 2 Summed = … count()…
Count clicks Sorted = order …
Top5 = limit …
Order by clicks
Job 3
Take top 5
• Developed at Facebook
• Used for majority of Facebook jobs
• “Relational database” built on Hadoop
– Maintains list of table schemas
– SQL-like query language (HQL)
– Can call Hadoop Streaming scripts from HQL
– Supports table partitioning, clustering, complex
data types, some optimizations
Sample Hive Queries
• Find top 5 pages visited by users aged 18-25:
SELECT p.url, COUNT(1) as clicks
FROM users u JOIN page_views p ON (u.name = p.user)
WHERE u.age >= 18 AND u.age <= 25
GROUP BY p.url
ORDER BY clicks
LIMIT 5;
• MapReduce architecture
• Example applications
• Getting started with Hadoop
• Higher-level languages over Hadoop: Pig and Hive
• Amazon Elastic MapReduce
Amazon Elastic MapReduce
• My email: [email protected]