Learn how to get started writing programs against Hadoop's API.
Check http://www.cloudera.com/hadoop-training-basic for training videos.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100%(2)100% found this document useful (2 votes)
3K views
Hadoop Training #4: Programming With Hadoop
Learn how to get started writing programs against Hadoop's API.
Check http://www.cloudera.com/hadoop-training-basic for training videos.
Some MapReduce Terminology • Job – A “full program” - an execution of a Mapper and Reducer across a data set • Task – An execution of a Mapper or a Reducer on a slice of data – a.k.a. Task-In-Progress (TIP) • Task Attempt – A particular instance of an attempt to execute a task on a machine
one job • 20 files to be mapped imply 20 map tasks + some number of reduce tasks • At least 20 map task attempts will be performed… more if a machine crashes, etc.
Task Attempts • A particular task will be attempted at least once, possibly more times if it crashes – If the same input causes crashes over and over, that input will eventually be abandoned • Multiple attempts at one task may occur in parallel with speculative execution turned on – Task ID from TaskInProgress is not a unique identifier; don’t use it that way
Job Distribution • MapReduce programs are contained in a Java “jar” file + an XML file containing serialized program configuration options • Running a MapReduce job places these files into the HDFS and notifies TaskTrackers where to retrieve the relevant program code
Data Distribution • Implicit in design of MapReduce! – All mappers are equivalent; so map whatever data is local to a particular node in HDFS • If lots of data does happen to pile up on the same node, nearby nodes will map instead – Data transfer is handled implicitly by HDFS
Configuring With JobConf • MR Programs have many configurable options • JobConf objects hold (key, value) components mapping String ’a – e.g., “mapred.map.tasks” 20 – JobConf is serialized and distributed before running the job • Objects implementing JobConfigurable can retrieve elements from a JobConf
Job Launch Process: JobClient • Pass JobConf to JobClient.runJob() or submitJob() – runJob() blocks, submitJob() does not • JobClient: – Determines proper division of input into InputSplits – Sends job data to master JobTracker server
Job Launch Process: TaskTracker • TaskTrackers running on slave nodes periodically query JobTracker for work • Retrieve job-specific jar and config • Launch task in separate instance of Java – main() is provided by Hadoop
Job Launch Process: Task • TaskTracker.Child.main(): – Sets up the child TaskInProgress attempt – Reads XML configuration – Connects back to necessary MapReduce components via RPC – Uses TaskRunner to launch user process
Job Launch Process: TaskRunner • TaskRunner launches your Mapper – Task knows ahead of time which InputSplits it should be mapping – Calls Mapper once for each record retrieved from the InputSplit • Running the Reducer is much the same
Creating the Mapper • You provide the instance of Mapper – Should extend MapReduceBase • One instance of your Mapper is initialized per task – Exists in separate process from all other instances of Mapper – no data sharing!
What is Writable? • Hadoop defines its own “box” classes for strings (Text), integers (IntWritable), etc. • All values are instances of Writable • All keys are instances of WritableComparable
Writing For Cache Coherency • Running the GC takes time • Reusing locations allows better cache usage (up to 2x performance benefit) • All keys and values given to you by Hadoop use this model (share containiner objects)
Reading Data • Data sets are specified by InputFormats – Defines input data (e.g., a directory) – Identifies partitions of the data that form an InputSplit – Factory for RecordReader objects to extract (k, v) records from the input source
Record Readers • Each InputFormat provides its own RecordReader implementation – Provides (unused?) capability multiplexing • LineRecordReader – Reads a line from a text file • KeyValueRecordReader – Used by KeyValueTextInputFormat
Sending Data To Reducers • Map function receives OutputCollector object – OutputCollector.collect() takes (k, v) elements • Any (WritableComparable, Writable) can be used
Sending Data To The Client • Reporter object sent to Mapper allows simple asynchronous feedback – incrCounter(Enum key, long amount) – setStatus(String msg) • Allows self-identification of input – InputSplit getInputSplit()
Partitioner • int getPartition(key, val, numPartitions) – Outputs the partition number for a given key – One partition == values sent to one Reduce task • HashPartitioner used by default – Uses key.hashCode() to return partition num • JobConf sets Partitioner implementation
Conclusions • That’s the Hadoop flow! • Lots of flexibility to override components, customize inputs and outputs • Using custom-built binary formats allows high-speed data movement
Hadoop Streaming Motivation • You want to use a scripting language – Faster development time – Easier to read, debug – Use existing libraries • You (still) have lots of data
HadoopStreaming • Interfaces Hadoop MapReduce with arbitrary program code • Uses stdin and stdout for data flow • You define a separate program for each of mapper, reducer
Data format • Input (key, val) pairs sent in as lines of input key (tab) val (newline) • Data naturally transmitted as text • You emit lines of the same form on stdout for output (key, val) pairs.
Launching Streaming Jobs • Special jar contains streaming “job” • Arguments select mapper, reducer, format… • Can also specify Java classes – Note: must be in Hadoop “internal” library
Streaming Conclusions • Fast, simple, powerful • Low-overhead way to get started with Hadoop • Resources: – http://wiki.apache.org/hadoop/HadoopStreaming – http://hadoop.apache.org/core/docs/current/streaming .html