Programming
Hadoop Map-Reduce
Programming, Tuning & Debugging
Arun C Murthy
Yahoo! CCDI
[email protected]
ApacheCon US 2008
Existential angst: Who am I?
• Yahoo!
– Grid Team (CCDI)
• Apache Hadoop
– Developer since April 2006
– Core Committer (Map-Reduce)
– Member of the Hadoop PMC
Hadoop - Overview
• Hadoop includes:
– Distributed File System - distributes data
– Map/Reduce - distributes application
• Open source from Apache
• Written in Java
• Runs on
– Linux, Mac OS/X, Windows, and Solaris
– Commodity hardware
Distributed File System
• Designed to store large files
• Stores files as large blocks (64 to 128 MB)
• Each block stored on multiple servers
• Data is automatically re-replicated on need
• Accessed from command line, Java API, or C API
– bin/hadoop fs -put my-file hdfs://node1:50070/foo/bar
– Path p = new Path(“hdfs://node1:50070/foo/bar”);
FileSystem fs = p.getFileSystem(conf);
DataOutputStream file = fs.create(p);
file.writeUTF(“hello\n”);
file.close();
Map-Reduce
• Map-Reduce is a programming model for efficient
distributed computing
• It works like a Unix pipeline:
– cat input | grep | sort | unique -c | cat > output
– Input | Map | Shuffle & Sort | Reduce | Output
• Efficiency from
– Streaming through data, reducing seeks
– Pipelining
• A good fit for a lot of applications
– Log processing
– Web index building
Map/Reduce features
• Fine grained Map and Reduce tasks
– Improved load balancing
– Faster recovery from failed tasks
• Automatic re-execution on failure
– In a large cluster, some nodes are always slow or flaky
– Introduces long tails or failures in computation
– Framework re-executes failed tasks
• Locality optimizations
– With big data, bandwidth to data is a problem
– Map-Reduce + HDFS is a very effective solution
– Map-Reduce queries HDFS for locations of input data
– Map tasks are scheduled local to the inputs when possible
Mappers and Reducers
• Every Map/Reduce program must specify a Mapper
and typically a Reducer
• The Mapper has a map method that transforms input
(key, value) pairs into any number of intermediate
(key’, value’) pairs
• The Reducer has a reduce method that transforms
intermediate (key’, value’*) aggregates into any number
of output (key’’, value’’) pairs