Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
• Pioneered by Google
– Processes 20 petabytes of data per day
2. Cost-efficiency:
– Commodity machines (cheap, but unreliable)
– Commodity network
– Automatic fault-tolerance (fewer administrators)
– Easy to use (fewer programmers)
Typical Hadoop Cluster
Aggregation switch
Rack switch
• MapReduce framework
– Executes user jobs specified as “map” and
“reduce” functions
– Manages work distribution & fault-tolerance
Hadoop Distributed File System
• Files split into 128MB blocks
Namenode
• Blocks replicated across File1
1
2
several datanodes (usually 3) 3
4
sequential reads 2
4
1
3
4
3
2
4
• Map function:
(Kin, Vin) list(Kinter, Vinter)
• Reduce function:
(Kinter, list(Vinter)) list(Kout, Vout)
Word Count Execution
Input Map Shuffle & Sort Reduce Output
the, 1
brown, 1
the quick fox, 1 brown, 2
Map
brown fox fox, 2
Reduce
how, 1
the, 1
fox, 1
now, 1
the, 1 the, 3
the fox ate
Map
the mouse quick, 1
how, 1
ate, 1 ate, 1
now, 1
mouse, 1
brown, 1 Reduce cow, 1
how now
mouse, 1
Map cow, 1
brown cow quick, 1
MapReduce Execution Details
• Single master controls job execution on multiple
slaves
the, 1
brown, 1
the quick fox, 1 brown, 2
Map
brown fox fox, 2
Reduce
how, 1
the, 2 now, 1
fox, 1
the, 3
the fox ate
Map
the mouse quick, 1
how, 1
ate, 1 ate, 1
now, 1
mouse, 1
brown, 1 Reduce cow, 1
how now
mouse, 1
Map cow, 1
brown cow quick, 1
Fault Tolerance in MapReduce
1. If a task crashes:
– Retry on another node
• OK for a map because it has no dependencies
• OK for reduce because map outputs are on disk
– If the same task fails repeatedly, fail the job or ignore that input
block (user-controlled)
• Map:
if(line matches pattern):
output(line)
• Map:
foreach word in text.split():
output(word, filename)
• Reduce:
def reduce(word, filenames):
output(word, sort(filenames))
Inverted Index Example
hamlet.txt
to, hamlet.txt
to be or be, hamlet.txt
not to be or, hamlet.txt afraid, (12th.txt)
not, hamlet.txt be, (12th.txt, hamlet.txt)
greatness, (12th.txt)
not, (12th.txt, hamlet.txt)
of, (12th.txt)
be, 12th.txt or, (hamlet.txt)
12th.txt
not, 12th.txt to, (hamlet.txt)
be not afraid, 12th.txt
afraid of of, 12th.txt
greatness
greatness, 12th.txt
4. Most Popular Words
• Input: (filename, text) records
• Output: top 100 words occurring in the most files
• Two-stage solution:
– Job 1:
• Create inverted index, giving (word, list(file)) records
– Job 2:
• Map each (word, list(file)) to (count, word)
• Sort these records by count as in sort job
• Optimizations:
– Map to (word, 1) instead of (word, file) in Job 1
– Count files in job 1’s reducer rather than job 2’s mapper
– Estimate count distribution in advance and drop rare words
5. Numerical Integration
• Input: (start, end) records for sub-ranges to integrate
– Easy using custom InputFormat
• Output: integral of f(x) dx over entire range
• Map:
def map(start, end):
sum = 0
for(x = start; x < end; x += step):
sum += f(x) * step
output(“”, sum)
• Reduce:
def reduce(key, values):
output(key, sum(values))
Outline
• MapReduce architecture
• Example applications
• Getting started with Hadoop
• Higher-level languages over Hadoop: Pig and
Hive
• Amazon Elastic MapReduce
Getting Started with Hadoop
• Download from hadoop.apache.org
• To install locally, unzip and set JAVA_HOME
• Details: hadoop.apache.org/core/docs/current/quickstart.html
conf.setMapperClass(MapClass.class);
conf.setCombinerClass(ReduceClass.class);
conf.setReducerClass(ReduceClass.class);
FileInputFormat.setInputPaths(conf, args[0]);
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
Word Count in Python with Hadoop Streaming
Take top 5
Users = load …
Filter by age
Filtered = filter …
Pages = load …
Join on name
Joined = join …
Group on url
Grouped = group …
Summed = … count()…
Count clicks Sorted = order …
Top5 = limit …
Order by clicks
Take top 5
Users = load …
Filter by age
Filtered = filter …
Pages = load …
Join on name
Joined = join …
Job 1
Group on url
Grouped = group …
Job 2 Summed = … count()…
Count clicks Sorted = order …
Top5 = limit …
Order by clicks
Job 3
Take top 5
• If you want more control over how you Hadoop runs, you
can launch a Hadoop cluster on EC2 manually using the
scripts in src/contrib/ec2
Elastic MapReduce Workflow
Elastic MapReduce Workflow
Elastic MapReduce Workflow
Elastic MapReduce Workflow
Conclusions
• MapReduce programming model hides the complexity of
work distribution and fault tolerance
• My email: [email protected]