0% found this document useful (0 votes)

71 views42 pages

Programming Hadoop

This is a way to programm hadoop, mapo reduce framework

Uploaded by

Paolo Massa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views42 pages

Programming Hadoop

This is a way to programm hadoop, mapo reduce framework

Uploaded by

Paolo Massa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Programming Hadoop Map-Reduce

Programming, Tuning & Debugging

Arun C Murthy Yahoo! CCDI
[email protected]

ApacheCon US 2008

Existential angst: Who am I?

Yahoo!
Grid Team (CCDI)

Apache Hadoop
Developer since April 2006 Core Committer (Map-Reduce) Member of the Hadoop PMC

Hadoop - Overview
Hadoop includes:
Distributed File System - distributes data Map/Reduce - distributes application

Open source from Apache Written in Java Runs on

Linux, Mac OS/X, Windows, and Solaris Commodity hardware

Distributed File System

Designed to store large files Stores files as large blocks (64 to 128 MB) Each block stored on multiple servers Data is automatically re-replicated on need Accessed from command line, Java API, or C API
bin/hadoop fs -put my-le hdfs://node1:50070/foo/bar Path p = new Path(hdfs://node1:50070/foo/bar); FileSystem fs = p.getFileSystem(conf); DataOutputStream le = fs.create(p); le.writeUTF(hello\n); file.close();

Efficiency from
Streaming through data, reducing seeks Pipelining

A good fit for a lot of applications

Log processing Web index building

Map/Reduce features
Fine grained Map and Reduce tasks
Improved load balancing Faster recovery from failed tasks

Automatic re-execution on failure

In a large cluster, some nodes are always slow or flaky Introduces long tails or failures in computation Framework re-executes failed tasks

Locality optimizations
With big data, bandwidth to data is a problem Map-Reduce + HDFS is a very effective solution Map-Reduce queries HDFS for locations of input data Map tasks are scheduled local to the inputs when possible

Mappers and Reducers

Every Map/Reduce program must specify a Mapper and typically a Reducer The Mapper has a map method that transforms input (key, value) pairs into any number of intermediate (key, value) pairs The Reducer has a reduce method that transforms intermediate (key, value*) aggregates into any number of output (key, value) pairs

Map/Reduce Dataflow

Example

45% of all Hadoop tutorials count words. 25% count sentences. 20% are about paragraphs. 10% are log parsers. The remainder are helpful. jandersen @http://twitter.com/jandersen/statuses/ 926856631

Example: Wordcount Mapper

public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } }

Example: Wordcount Reducer

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

Input and Output Formats

A Map/Reduce may specify how its input is to be read by specifying an InputFormat to be used
InputSplit RecordReader

A Map/Reduce may specify how its output is to be written by specifying an OutputFormat to be used These default to TextInputFormat and TextOutputFormat, which process line-based text data SequenceFile: SequenceFileInputFormat and SequenceFileOutputFormat These are file-based, but they are not required to be

Configuring a Job
Jobs are controlled by configuring JobConf JobConfs are maps from attribute names to string value The framework defines attributes to control how the job is executed.
conf.set(mapred.job.name, MyApp);

Applications can add arbitrary values to the JobConf

conf.set(my.string, foo); conf.setInteger(my.integer, 12);

JobConf is available to all of the tasks

Putting it all together

Create a launching program for your application The launching program configures:
The Mapper and Reducer to use The output key and value types (input types are inferred from the InputFormat) The locations for your input and output Optionally the InputFormat and OutputFormat to use

The launching program then submits the job and typically waits for it to complete

Putting it all together

public class WordCount { public static void main(String[] args) throws IOException { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); // the keys are words (strings) conf.setOutputKeyClass(Text.class); // the values are counts (ints) conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setReducerClass(Reduce.class); conf.setInputPath(new Path(args[0]); conf.setOutputPath(new Path(args[1]); JobClient.runJob(conf); ..

Non-Java Interfaces
Streaming Pipes (C++) Pig Hive Jaql Cascading

Streaming
What about Unix hacks?
Can define Mapper and Reduce using Unix text filters Typically use grep, sed, python, or perl scripts

Format for input and output is: key \t value \n Allows for easy debugging and experimentation Slower than Java programs
bin/hadoop jar hadoop-streaming.jar -input in-dir -output out-dir -mapper streamingMapper.sh -reducer streamingReducer.sh

Mapper: /bin/sed -e 's| |\n|g' | /bin/grep . Reducer: /usr/bin/uniq -c | /bin/awk '{print $2 "\t" $1}'

Pipes (C++)
C++ API and library to link application with C++ application is launched as a sub-process of the Java task Keys and values are std::string with binary data Word count map looks like:
class WordCountMap: public HadoopPipes::Mapper { public: WordCountMap(HadoopPipes::TaskContext& context){} void map(HadoopPipes::MapContext& context) { std::vector<std::string> words = HadoopUtils::splitString(context.getInputValue(), " "); for(unsigned int i=0; i < words.size(); ++i) { context.emit(words[i], "1"); }}};

Pipes (C++)
The reducer looks like:
class WordCountReduce: public HadoopPipes::Reducer { public: WordCountReduce(HadoopPipes::TaskContext& context){} void reduce(HadoopPipes::ReduceContext& context) { int sum = 0; while (context.nextValue()) { sum += HadoopUtils::toInt(context.getInputValue()); } context.emit(context.getInputKey(), HadoopUtils::toString(sum)); } };

Pipes (C++)
And define a main function to invoke the tasks:
int main(int argc, char *argv[]) { return HadoopPipes::runTask( HadoopPipes::TemplateFactory<WordCountMap, WordCountReduce, void, WordCountReduce>()); }

Pig Hadoop Sub-project

Scripting language that generates Map/Reduce jobs User uses higher level operations
Group by Foreach

Word Count:
input = LOAD in-dir' USING TextLoader(); words = FOREACH input GENERATE FLATTEN(TOKENIZE(*)); grouped = GROUP words BY $0; counts = FOREACH grouped GENERATE group, COUNT(words); STORE counts INTO out-dir;

Hive Hadoop Sub-project

SQL-like interface for querying tables stored as flat-files on HDFS, complete with a meta-data repository Developed at Facebook In the process of moving from Hadoop contrib to a stand-alone Hadoop sub-project

How many Maps and Reduces

Maps
Usually as many as the number of HDFS blocks being processed, this is the default Else the number of maps can be specified as a hint The number of maps can also be controlled by specifying the minimum split size The actual sizes of the map inputs are computed by:
max(min(block_size, data/#maps), min_split_size)

Reduces
Unless the amount of data being processed is small
0.95*num_nodes*mapred.tasktracker.reduce.tasks.maximum

Performance Example
Bob wants to count lines in text files totaling several terabytes He uses
Identity Mapper (input: text, output: same text) A single Reducer that counts the lines and outputs the total

What is he doing wrong ? This happened, really !

I am not kidding !

Some handy tools

Partitioners Combiners Compression Counters Speculation Zero reduces Distributed File Cache Tool

Partitioners
Partitioners are application code that define how keys are assigned to reduces Default partitioning spreads keys evenly, but randomly
Uses key.hashCode() % num_reduces

Custom partitioning is often required, for example, to produce a total order in the output
Should implement Partitioner interface Set by calling conf.setPartitionerClass(MyPart.class) To get a total order, sample the map output keys and pick values to divide the keys into roughly equal buckets and use that in your partitioner

Combiners
When maps produce many repeated keys
It is often useful to do a local aggregation following the map Done by specifying a Combiner Goal is to decrease size of the transient data Combiners have the same interface as Reduces, and often are the same class. Combiners must not have side effects, because they run an indeterminate number of times. In WordCount, conf.setCombinerClass(Reduce.class);

Compression
Compressing the outputs and intermediate data will often yield huge performance gains
Can be specified via a configuration file or set programatically Set mapred.output.compress to true to compress job output Set mapred.compress.map.output to true to compress map outputs

Compression Types (mapred.output.compression.type) for SequenceFiles

block - Group of keys and values are compressed together record - Each value is compressed individually Block compression is almost always best

Compression Codecs (mapred(.map)?.output.compression.codec)

Default (zlib) - slower, but more compression LZO - faster, but less compression

Counters
Often Map/Reduce applications have countable events For example, framework counts records in to and out of Mapper and Reducer To define user counters:
static enum Counter {EVENT1, EVENT2}; reporter.incrCounter(Counter.EVENT1, 1);

Define nice names in a MyClass_Counter.properties file

CounterGroupName=My Counters EVENT1.name=Event 1 EVENT2.name=Event 2

Speculative execution
The framework can run multiple instances of slow tasks
Output from instance that finishes first is used Controlled by the configuration variable mapred.speculative.execution Can dramatically bring in long tails on jobs

Zero Reduces
Frequently, we only need to run a filter on the input data
No sorting or shuffling required by the job Set the number of reduces to 0 Output from maps will go directly to OutputFormat and disk

Distributed File Cache

Sometimes need read-only copies of data on the local computer.
Downloading 1GB of data for each Mapper is expensive

Define list of files you need to download in JobConf Files are downloaded once per a computer Add to launching program:
DistributedCache.addCacheFile(new URI(hdfs://nn:8020/foo), conf);

Add to task:
Path[] les = DistributedCache.getLocalCacheFiles(conf);

Tool
Handle standard Hadoop command line options:
-conf le - load a configuration file named file -D prop=value - define a single configuration property prop

Class looks like:

public class MyApp extends Congured implements Tool { public static void main(String[] args) throws Exception { System.exit(ToolRunner.run(new Conguration(), new MyApp(), args)); } public int run(String[] args) throws Exception { . getConf() } }

Debugging & Diagnosis

Run job with the Local Runner
Set mapred.job.tracker to local Runs application in a single process and thread

Run job on a small data set on a 1 node cluster

Can be done on your local dev box

Set keep.failed.task.files to true

This will keep files from failed tasks that can be used for debugging Use the IsolationRunner to run just the failed task

Java Debugging hints

Send a kill -QUIT to the Java process to get the call stack, locks held, deadlocks

Profiling
Set mapred.task.profile to true Use mapred.task.profile.{maps|reduces} hprof support is built-in Use mapred.task.profile.params to set options for the debugger Possibly use DistributedCache for the profilers agent

Jobtracker front page

Job counters

Task status

Drilling down

Drilling down -- logs

Performance
Is your input splittable?
Gzipped files are NOT splittable Use compressed SequenceFiles

Are partitioners uniform? Buffering sizes (especially io.sort.mb) Can you avoid Reduce step? Only use singleton reduces for very small data
Use Partitioners and cat to get a total order

Memory usage
Please do not load all of your inputs into memory!

Q&A
For more information:
Website: http://hadoop.apache.org/core Mailing lists:
[email protected] [email protected] [email protected]

IRC: #hadoop on irc.freenode.org

Pu
50% (2)
Pu
5 pages
Uveit Foster
75% (4)
Uveit Foster
954 pages
1 Supply Chain Management Fundamentals
100% (17)
1 Supply Chain Management Fundamentals
174 pages
Reductions PPT 29-08-2020
50% (2)
Reductions PPT 29-08-2020
12 pages
Chemistry (Annual Reports - Vol.59-1962)
100% (4)
Chemistry (Annual Reports - Vol.59-1962)
576 pages
Gen Chem II Exam 1 Ans Key VA f08
80% (5)
Gen Chem II Exam 1 Ans Key VA f08
5 pages
Solvent Cement For Joining PVC Pipe - Adhesives Formulation 28-3-2019
88% (8)
Solvent Cement For Joining PVC Pipe - Adhesives Formulation 28-3-2019
2 pages
Supply Chain Management Overview
0% (1)
Supply Chain Management Overview
67 pages
MBBS 1st Year Syllabus Overview
100% (2)
MBBS 1st Year Syllabus Overview
14 pages
CIA泄密
100% (3)
CIA泄密
118 pages
Birth Asphyxia
75% (4)
Birth Asphyxia
12 pages
Simple Distillation
67% (6)
Simple Distillation
2 pages
Radiology MCQs
75% (4)
Radiology MCQs
7 pages
Intermolecular Forces Tutorial Guide
75% (4)
Intermolecular Forces Tutorial Guide
1 page
Homework: Android Facebook Post - A Mobile Phone Exercise 1. Objectives
100% (1)
Homework: Android Facebook Post - A Mobile Phone Exercise 1. Objectives
10 pages
Inorganic Vs Organic Polymers PDF
100% (1)
Inorganic Vs Organic Polymers PDF
37 pages
Impact of Globalization On Legal Education
No ratings yet
Impact of Globalization On Legal Education
14 pages
pEDIATRICS Cvs MCQ
85% (13)
pEDIATRICS Cvs MCQ
3 pages
User Guide
100% (1)
User Guide
448 pages
Cassava Thesis
No ratings yet
Cassava Thesis
46 pages
The Calling of The Health Care Provider
100% (2)
The Calling of The Health Care Provider
6 pages
Social Media Marketing: A Literature Review
100% (2)
Social Media Marketing: A Literature Review
27 pages
Liquid Bromine Manufacturing Overview
100% (1)
Liquid Bromine Manufacturing Overview
7 pages
IRQs
No ratings yet
IRQs
2 pages
DERMATOLOGY MCQ (6 - 9 - 03) - POST TEST (20ebooks - Com)
67% (3)
DERMATOLOGY MCQ (6 - 9 - 03) - POST TEST (20ebooks - Com)
7 pages
Acute Respiratory Distress Syndrome Nursing Management and Interventions - Nurseslabs
No ratings yet
Acute Respiratory Distress Syndrome Nursing Management and Interventions - Nurseslabs
2 pages
The 35 Golden Eye Rules
100% (3)
The 35 Golden Eye Rules
7 pages
Income Tax Ordinance, 2001
67% (3)
Income Tax Ordinance, 2001
23 pages
2 Medicine MCQs Nephrology
100% (3)
2 Medicine MCQs Nephrology
3 pages
Cardiac Tamponade, Also Known As Pericardial Tamponade, Is An
100% (3)
Cardiac Tamponade, Also Known As Pericardial Tamponade, Is An
7 pages

Programming Hadoop

Uploaded by

Programming Hadoop

Uploaded by

Programming Hadoop Map-Reduce

Programming, Tuning & Debugging

Existential angst: Who am I?

Open source from Apache Written in Java Runs on

Distributed File System

A good fit for a lot of applications

Automatic re-execution on failure

Mappers and Reducers

Example: Wordcount Mapper

Example: Wordcount Reducer

Input and Output Formats

Applications can add arbitrary values to the JobConf

JobConf is available to all of the tasks

Putting it all together

Putting it all together

Pig Hadoop Sub-project

Hive Hadoop Sub-project

How many Maps and Reduces

What is he doing wrong ? This happened, really !

Some handy tools

Compression Types (mapred.output.compression.type) for SequenceFiles

Compression Codecs (mapred(.map)?.output.compression.codec)

Define nice names in a MyClass_Counter.properties file

Distributed File Cache

Class looks like:

Debugging & Diagnosis

Run job on a small data set on a 1 node cluster

Set keep.failed.task.files to true

Java Debugging hints

Jobtracker front page

Drilling down -- logs

IRC: #hadoop on irc.freenode.org

You might also like