Programming Hadoop
Programming Hadoop
ApacheCon US 2008
Apache Hadoop
Developer since April 2006 Core Committer (Map-Reduce) Member of the Hadoop PMC
Hadoop - Overview
Hadoop includes:
Distributed File System - distributes data Map/Reduce - distributes application
Map-Reduce
Map-Reduce is a programming model for efficient distributed computing It works like a Unix pipeline:
cat input | grep | sort | unique -c | cat > output
Input | Map | Shuffle & Sort | Reduce | Output
Efficiency from
Streaming through data, reducing seeks Pipelining
Map/Reduce features
Fine grained Map and Reduce tasks
Improved load balancing Faster recovery from failed tasks
Locality optimizations
With big data, bandwidth to data is a problem Map-Reduce + HDFS is a very effective solution Map-Reduce queries HDFS for locations of input data Map tasks are scheduled local to the inputs when possible
Map/Reduce Dataflow
Example
45% of all Hadoop tutorials count words. 25% count sentences. 20% are about paragraphs. 10% are log parsers. The remainder are helpful. jandersen @http://twitter.com/jandersen/statuses/ 926856631
A Map/Reduce may specify how its output is to be written by specifying an OutputFormat to be used These default to TextInputFormat and TextOutputFormat, which process line-based text data SequenceFile: SequenceFileInputFormat and SequenceFileOutputFormat These are file-based, but they are not required to be
Configuring a Job
Jobs are controlled by configuring JobConf JobConfs are maps from attribute names to string value The framework defines attributes to control how the job is executed.
conf.set(mapred.job.name, MyApp);
The launching program then submits the job and typically waits for it to complete
Non-Java Interfaces
Streaming Pipes (C++) Pig Hive Jaql Cascading
Streaming
What about Unix hacks?
Can define Mapper and Reduce using Unix text filters Typically use grep, sed, python, or perl scripts
Format for input and output is: key \t value \n Allows for easy debugging and experimentation Slower than Java programs
bin/hadoop jar hadoop-streaming.jar -input in-dir -output out-dir
-mapper streamingMapper.sh -reducer streamingReducer.sh
Mapper: /bin/sed -e 's| |\n|g' | /bin/grep . Reducer: /usr/bin/uniq -c | /bin/awk '{print $2 "\t" $1}'
Pipes (C++)
C++ API and library to link application with C++ application is launched as a sub-process of the Java task Keys and values are std::string with binary data Word count map looks like:
class WordCountMap: public HadoopPipes::Mapper {
public:
WordCountMap(HadoopPipes::TaskContext& context){}
void map(HadoopPipes::MapContext& context) {
std::vector<std::string> words =
HadoopUtils::splitString(context.getInputValue(), " ");
for(unsigned int i=0; i < words.size(); ++i) {
context.emit(words[i], "1");
}}};
Pipes (C++)
The reducer looks like:
class WordCountReduce: public HadoopPipes::Reducer {
public:
WordCountReduce(HadoopPipes::TaskContext& context){}
void reduce(HadoopPipes::ReduceContext& context) {
int sum = 0;
while (context.nextValue()) {
sum += HadoopUtils::toInt(context.getInputValue());
}
context.emit(context.getInputKey(), HadoopUtils::toString(sum));
}
};
Pipes (C++)
And define a main function to invoke the tasks:
int main(int argc, char *argv[]) {
return HadoopPipes::runTask(
HadoopPipes::TemplateFactory<WordCountMap,
WordCountReduce, void,
WordCountReduce>());
}
Word Count:
input = LOAD in-dir' USING TextLoader();
words = FOREACH input GENERATE FLATTEN(TOKENIZE(*));
grouped = GROUP words BY $0;
counts = FOREACH grouped GENERATE group, COUNT(words);
STORE counts INTO out-dir;
Reduces
Unless the amount of data being processed is small
0.95*num_nodes*mapred.tasktracker.reduce.tasks.maximum
Performance Example
Bob wants to count lines in text files totaling several terabytes He uses
Identity Mapper (input: text, output: same text) A single Reducer that counts the lines and outputs the total
Partitioners
Partitioners are application code that define how keys are assigned to reduces Default partitioning spreads keys evenly, but randomly
Uses key.hashCode() % num_reduces
Custom partitioning is often required, for example, to produce a total order in the output
Should implement Partitioner interface Set by calling conf.setPartitionerClass(MyPart.class) To get a total order, sample the map output keys and pick values to divide the keys into roughly equal buckets and use that in your partitioner
Combiners
When maps produce many repeated keys
It is often useful to do a local aggregation following the map Done by specifying a Combiner Goal is to decrease size of the transient data Combiners have the same interface as Reduces, and often are the same class. Combiners must not have side effects, because they run an indeterminate number of times. In WordCount, conf.setCombinerClass(Reduce.class);
Compression
Compressing the outputs and intermediate data will often yield huge performance gains
Can be specified via a configuration file or set programatically Set mapred.output.compress to true to compress job output Set mapred.compress.map.output to true to compress map outputs
Counters
Often Map/Reduce applications have countable events For example, framework counts records in to and out of Mapper and Reducer To define user counters:
static enum Counter {EVENT1, EVENT2};
reporter.incrCounter(Counter.EVENT1, 1);
Speculative execution
The framework can run multiple instances of slow tasks
Output from instance that finishes first is used Controlled by the configuration variable mapred.speculative.execution Can dramatically bring in long tails on jobs
Zero Reduces
Frequently, we only need to run a filter on the input data
No sorting or shuffling required by the job Set the number of reduces to 0 Output from maps will go directly to OutputFormat and disk
Define list of files you need to download in JobConf Files are downloaded once per a computer Add to launching program:
DistributedCache.addCacheFile(new URI(hdfs://nn:8020/foo), conf);
Add to task:
Path[] les = DistributedCache.getLocalCacheFiles(conf);
Tool
Handle standard Hadoop command line options:
-conf le - load a configuration file named file -D prop=value - define a single configuration property prop
Profiling
Set mapred.task.profile to true
Use mapred.task.profile.{maps|reduces}
hprof support is built-in Use mapred.task.profile.params to set options for the debugger Possibly use DistributedCache for the profilers agent
Job counters
Task status
Drilling down
Performance
Is your input splittable?
Gzipped files are NOT splittable Use compressed SequenceFiles
Are partitioners uniform? Buffering sizes (especially io.sort.mb) Can you avoid Reduce step? Only use singleton reduces for very small data
Use Partitioners and cat to get a total order
Memory usage
Please do not load all of your inputs into memory!
Q&A
For more information:
Website: http://hadoop.apache.org/core Mailing lists:
[email protected] [email protected] [email protected]