100% found this document useful (2 votes)

3K views46 pages

Hadoop Training #4: Programming With Hadoop

Learn how to get started writing programs against Hadoop's API. Check http://www.cloudera.com/hadoop-training-basic for training videos.

Uploaded by

Dmytro Shteflyuk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

3K views46 pages

Hadoop Training #4: Programming With Hadoop

Learn how to get started writing programs against Hadoop's API. Check http://www.cloudera.com/hadoop-training-basic for training videos.

Uploaded by

Dmytro Shteflyuk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Programming with Hadoop

© 2009 Cloudera, Inc.

Overview
• How to use Hadoop
– Hadoop MapReduce
– Hadoop Streaming

© 2009 Cloudera, Inc.

Some MapReduce Terminology
• Job – A “full program” - an execution of a
Mapper and Reducer across a data set
• Task – An execution of a Mapper or a
Reducer on a slice of data
– a.k.a. Task-In-Progress (TIP)
• Task Attempt – A particular instance of an
attempt to execute a task on a machine

© 2009 Cloudera, Inc.

Terminology Example

• Running “Word Count” across 20 files is

one job
• 20 files to be mapped imply 20 map tasks
+ some number of reduce tasks
• At least 20 map task attempts will be
performed… more if a machine crashes,
etc.

© 2009 Cloudera, Inc.

Task Attempts
• A particular task will be attempted at least once,
possibly more times if it crashes
– If the same input causes crashes over and over, that
input will eventually be abandoned
• Multiple attempts at one task may occur in
parallel with speculative execution turned on
– Task ID from TaskInProgress is not a unique
identifier; don’t use it that way

© 2009 Cloudera, Inc.

MapReduce: High Level

© 2009 Cloudera, Inc.

Nodes, Trackers, Tasks
• Master node runs JobTracker instance,
which accepts Job requests from clients

• TaskTracker instances run on slave nodes

• TaskTracker forks separate Java process

for task instances

© 2009 Cloudera, Inc.

Job Distribution
• MapReduce programs are contained in a Java
“jar” file + an XML file containing serialized
program configuration options
• Running a MapReduce job places these files
into the HDFS and notifies TaskTrackers where
to retrieve the relevant program code

• … Where’s the data distribution?

© 2009 Cloudera, Inc.

Data Distribution
• Implicit in design of MapReduce!
– All mappers are equivalent; so map whatever
data is local to a particular node in HDFS
• If lots of data does happen to pile up on
the same node, nearby nodes will map
instead
– Data transfer is handled implicitly by HDFS

© 2009 Cloudera, Inc.

Configuring With JobConf
• MR Programs have many configurable options
• JobConf objects hold (key, value) components
mapping String ’a
– e.g., “mapred.map.tasks” 20
– JobConf is serialized and distributed before running
the job
• Objects implementing JobConfigurable can
retrieve elements from a JobConf

© 2009 Cloudera, Inc.

What Happens In MapReduce?
Depth First

© 2009 Cloudera, Inc.

Job Launch Process: Client
• Client program creates a JobConf
– Identify classes implementing Mapper and
Reducer interfaces
• JobConf.setMapperClass(), setReducerClass()
– Specify inputs, outputs
• FileInputFormat.addInputPath(conf)
• FileOutputFormat.setOutputPath(conf)
– Optionally, other options too:
• JobConf.setNumReduceTasks(),
JobConf.setOutputFormat()…

© 2009 Cloudera, Inc.

Job Launch Process: JobClient
• Pass JobConf to JobClient.runJob() or
submitJob()
– runJob() blocks, submitJob() does not
• JobClient:
– Determines proper division of input into
InputSplits
– Sends job data to master JobTracker server

© 2009 Cloudera, Inc.

Job Launch Process: JobTracker
• JobTracker:
– Inserts jar and JobConf (serialized to XML) in
shared location
– Posts a JobInProgress to its run queue

© 2009 Cloudera, Inc.

Job Launch Process: TaskTracker
• TaskTrackers running on slave nodes
periodically query JobTracker for work
• Retrieve job-specific jar and config
• Launch task in separate instance of Java
– main() is provided by Hadoop

© 2009 Cloudera, Inc.

Job Launch Process: Task
• TaskTracker.Child.main():
– Sets up the child TaskInProgress attempt
– Reads XML configuration
– Connects back to necessary MapReduce
components via RPC
– Uses TaskRunner to launch user process

© 2009 Cloudera, Inc.

Job Launch Process: TaskRunner
• TaskRunner launches your Mapper
– Task knows ahead of time which InputSplits it
should be mapping
– Calls Mapper once for each record retrieved
from the InputSplit
• Running the Reducer is much the same

© 2009 Cloudera, Inc.

Creating the Mapper
• You provide the instance of Mapper
– Should extend MapReduceBase
• One instance of your Mapper is initialized
per task
– Exists in separate process from all other
instances of Mapper – no data sharing!

© 2009 Cloudera, Inc.

Mapper
• void map(WritableComparable key,
Writable value,
OutputCollector output,
Reporter reporter)

© 2009 Cloudera, Inc.

What is Writable?
• Hadoop defines its own “box” classes for
strings (Text), integers (IntWritable), etc.
• All values are instances of Writable
• All keys are instances of
WritableComparable

© 2009 Cloudera, Inc.

Writing For Cache Coherency
while (more input exists) {
myIntermediate = new intermediate(input);
myIntermediate.process();
export outputs;
}

© 2009 Cloudera, Inc.

Writing For Cache Coherency
myIntermediate = new intermediate (junk);
while (more input exists) {
myIntermediate.setupState(input);
myIntermediate.process();
export outputs;
}

© 2009 Cloudera, Inc.

Writing For Cache Coherency
• Running the GC takes time
• Reusing locations allows better cache
usage (up to 2x performance benefit)
• All keys and values given to you by
Hadoop use this model (share containiner
objects)

© 2009 Cloudera, Inc.

Getting Data To The Mapper

© 2009 Cloudera, Inc.

Reading Data
• Data sets are specified by InputFormats
– Defines input data (e.g., a directory)
– Identifies partitions of the data that form an
InputSplit
– Factory for RecordReader objects to extract
(k, v) records from the input source

© 2009 Cloudera, Inc.

FileInputFormat and Friends

• TextInputFormat – Treats each ‘\n’-

terminated line of a file as a value
• KeyValueTextInputFormat – Maps ‘\n’-
terminated text lines of “k SEP v”
• SequenceFileInputFormat – Binary file of
(k, v) pairs with some add’l metadata
• SequenceFileAsTextInputFormat – Same,
but maps (k.toString(), v.toString())
© 2009 Cloudera, Inc.
Filtering File Inputs
• FileInputFormat will read all files out of a
specified directory and send them to the
mapper
• Delegates filtering this file list to a method
subclasses may override
– e.g., Create your own “xyzFileInputFormat” to
read *.xyz from directory list

Record Readers
• Each InputFormat provides its own
RecordReader implementation
– Provides (unused?) capability multiplexing
• LineRecordReader – Reads a line from a
text file
• KeyValueRecordReader – Used by
KeyValueTextInputFormat

Input Split Size
• FileInputFormat will divide large files into
chunks
– Exact size controlled by mapred.min.split.size
• RecordReaders receive file, offset, and
length of chunk
• Custom InputFormat implementations may
override split size – e.g., “NeverChunkFile”

Sending Data To Reducers
• Map function receives OutputCollector
object
– OutputCollector.collect() takes (k, v) elements
• Any (WritableComparable, Writable) can
be used

Sending Data To The Client
• Reporter object sent to Mapper allows
simple asynchronous feedback
– incrCounter(Enum key, long amount)
– setStatus(String msg)
• Allows self-identification of input
– InputSplit getInputSplit()

!
Partition And Shuffle

Partitioner
• int getPartition(key, val, numPartitions)
– Outputs the partition number for a given key
– One partition == values sent to one Reduce
task
• HashPartitioner used by default
– Uses key.hashCode() to return partition num
• JobConf sets Partitioner implementation

Reduction
• reduce( WritableComparable key,
Iterator values,
OutputCollector output,
Reporter reporter)
• Keys & values sent to one partition all go
to the same reduce task
• Calls are sorted by key – “earlier” keys are
reduced and output before “later” keys
• Remember – values.next() always returns
the same object, different data!
© 2009 Cloudera, Inc.
"
Finally: Writing The Output

OutputFormat
• Analogous to InputFormat
• TextOutputFormat – Writes “key val\n”
strings to output file
• SequenceFileOutputFormat – Uses a
binary format to pack (k, v) pairs
• NullOutputFormat – Discards output

Conclusions
• That’s the Hadoop flow!
• Lots of flexibility to override components,
customize inputs and outputs
• Using custom-built binary formats allows
high-speed data movement

Hadoop Streaming
Motivation
• You want to use a scripting language
– Faster development time
– Easier to read, debug
– Use existing libraries
• You (still) have lots of data

HadoopStreaming
• Interfaces Hadoop MapReduce with
arbitrary program code
• Uses stdin and stdout for data flow
• You define a separate program for each of
mapper, reducer

Data format
• Input (key, val) pairs sent in as lines of
input
key (tab) val (newline)
• Data naturally transmitted as text
• You emit lines of the same form on stdout
for output (key, val) pairs.

Example: map (k, v) (v, k)
#!/usr/bin/env python
import sys
while True:
line = sys.stdin.readline()
if len(line) == 0:
break
(k, v) = line.strip().split(“\t”)
print v + “\t” + k

Launching Streaming Jobs
• Special jar contains streaming “job”
• Arguments select mapper, reducer,
format…
• Can also specify Java classes
– Note: must be in Hadoop “internal” library

Reusing programs
• Identity mapper/reducer: cat
• Summing: wc
• Field selection: cut
• Filtering: awk

Streaming Conclusions
• Fast, simple, powerful
• Low-overhead way to get started with
Hadoop
• Resources:
– http://wiki.apache.org/hadoop/HadoopStreaming
– http://hadoop.apache.org/core/docs/current/streaming
.html

Salesforce Ai Associate Certification Practice Questions
100% (2)
Salesforce Ai Associate Certification Practice Questions
60 pages
Data Collection Procedures in Research Methodology PDF
100% (2)
Data Collection Procedures in Research Methodology PDF
30 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Cloudera Administrator Training For Apache Hadoop
No ratings yet
Cloudera Administrator Training For Apache Hadoop
5 pages
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Selection & Recruitment Process
No ratings yet
Selection & Recruitment Process
32 pages
Hadoop Training #1: Thinking at Scale
100% (1)
Hadoop Training #1: Thinking at Scale
20 pages
Hadoop Tutorial
50% (2)
Hadoop Tutorial
199 pages
Apache Hadoop Training
No ratings yet
Apache Hadoop Training
377 pages
Hive Interview Questions Answers
No ratings yet
Hive Interview Questions Answers
6 pages
Spark With Bigdata
No ratings yet
Spark With Bigdata
94 pages
Spark Sample Resume 2
100% (1)
Spark Sample Resume 2
7 pages
Ajay Singh - Hadoop Resume
67% (3)
Ajay Singh - Hadoop Resume
2 pages
Hadoop Admin
No ratings yet
Hadoop Admin
13 pages
1 Hdfs Notes
No ratings yet
1 Hdfs Notes
38 pages
Cloudera Developer Training For Apache Spark: Hands-On Exercises
No ratings yet
Cloudera Developer Training For Apache Spark: Hands-On Exercises
61 pages
Hadoop Questions
No ratings yet
Hadoop Questions
41 pages
Cloudera Spark
No ratings yet
Cloudera Spark
70 pages
AaxHadoop Interview Questions and Answers
No ratings yet
AaxHadoop Interview Questions and Answers
37 pages
BIG DATA & Hadoop Interview Questions With Answers
No ratings yet
BIG DATA & Hadoop Interview Questions With Answers
9 pages
Apache Pig
100% (2)
Apache Pig
80 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Primer On Big Data Testing
No ratings yet
Primer On Big Data Testing
24 pages
Apache Hive Interview Questions
50% (2)
Apache Hive Interview Questions
6 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Hadoop and Related Tools
No ratings yet
Hadoop and Related Tools
57 pages
MapReduce Example
No ratings yet
MapReduce Example
3 pages
Hadoop Training #5: MapReduce Algorithm
100% (2)
Hadoop Training #5: MapReduce Algorithm
31 pages
Bigdata Interview Preparation Guide
No ratings yet
Bigdata Interview Preparation Guide
292 pages
Hands-On Hadoop Tutorial
100% (1)
Hands-On Hadoop Tutorial
13 pages
Cloudera Kafka
100% (1)
Cloudera Kafka
50 pages
9 Sqoop Notes
No ratings yet
9 Sqoop Notes
17 pages
HDFS Commands
No ratings yet
HDFS Commands
15 pages
Some of The Frequently Asked Interview Questions For Hadoop Developers Are
100% (1)
Some of The Frequently Asked Interview Questions For Hadoop Developers Are
72 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
File Formats in Big Data
No ratings yet
File Formats in Big Data
13 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
3 Mapreduce Notes
No ratings yet
3 Mapreduce Notes
25 pages
Pair RDD Operations: Flat Map
No ratings yet
Pair RDD Operations: Flat Map
4 pages
Hadoop Commands Cheat Sheet
No ratings yet
Hadoop Commands Cheat Sheet
1 page
Big Data Hadoop Training Certification 7
No ratings yet
Big Data Hadoop Training Certification 7
40 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
AWS DataEngineering
100% (1)
AWS DataEngineering
23 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
Hadoop Security S360 2015v8 PDF
No ratings yet
Hadoop Security S360 2015v8 PDF
27 pages
Certification
No ratings yet
Certification
16 pages
Apache Spark Theory by Arsh
No ratings yet
Apache Spark Theory by Arsh
4 pages
Python Syllbus by Lokesh
No ratings yet
Python Syllbus by Lokesh
5 pages
Hadoop Administrator Interview Questions: Cloudera® Enterprise Version
No ratings yet
Hadoop Administrator Interview Questions: Cloudera® Enterprise Version
13 pages
2 Hadoop (Uploaded)
No ratings yet
2 Hadoop (Uploaded)
82 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
Cloud Dataproc Workflow Animation
No ratings yet
Cloud Dataproc Workflow Animation
2 pages
Big Data Introduction PDF
No ratings yet
Big Data Introduction PDF
180 pages
Kafka Cheat Sheets
No ratings yet
Kafka Cheat Sheets
1 page
Ultimate AWS Certified Solutions Architect Associate Exam Guide: Master Designing Resilient, Scalable Architectures with Core and Advanced AWS Services to Crack the SAA-C03 Certification (English Edition)
From Everand
Ultimate AWS Certified Solutions Architect Associate Exam Guide: Master Designing Resilient, Scalable Architectures with Core and Advanced AWS Services to Crack the SAA-C03 Certification (English Edition)
Venkata Sasi Kanumuri
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
HBase Administration Cookbook
From Everand
HBase Administration Cookbook
Yifeng Jiang
No ratings yet
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
IBM Integration Bus Third Edition
From Everand
IBM Integration Bus Third Edition
Gerardus Blokdyk
No ratings yet
Redis Cluster
67% (3)
Redis Cluster
17 pages
Redis Cluster
67% (3)
Redis Cluster
17 pages
Why RSA Works PDF
No ratings yet
Why RSA Works PDF
19 pages
Understanding The Top 5 Redis Performance Metrics
No ratings yet
Understanding The Top 5 Redis Performance Metrics
22 pages
Sigmod278 Silberstein
No ratings yet
Sigmod278 Silberstein
12 pages
Authors: Thanks To:: Miek Gieben Go Authors Google Go Nuts Mailing List
No ratings yet
Authors: Thanks To:: Miek Gieben Go Authors Google Go Nuts Mailing List
272 pages
Window Vista Business 20070810
No ratings yet
Window Vista Business 20070810
29 pages
Go Programming
80% (5)
Go Programming
60 pages
CSS 3 Help Cheat Sheet
100% (1)
CSS 3 Help Cheat Sheet
1 page
CSS 2.1 Help Cheat Sheet
No ratings yet
CSS 2.1 Help Cheat Sheet
1 page
Performance Tuning For The InfiniDB Analytics Database (For Version 1.0.3)
100% (1)
Performance Tuning For The InfiniDB Analytics Database (For Version 1.0.3)
72 pages
Backup Strategies With MySQL Enterprise Backup
No ratings yet
Backup Strategies With MySQL Enterprise Backup
33 pages
Building TweetReach With Sinatra, Tokyo Cabinet and Grackle
No ratings yet
Building TweetReach With Sinatra, Tokyo Cabinet and Grackle
21 pages
RFM: A Precursor To Data Mining
No ratings yet
RFM: A Precursor To Data Mining
10 pages
Scaling MySQL Writes Through Partitioning
No ratings yet
Scaling MySQL Writes Through Partitioning
38 pages
Calpont InfiniDB Administrator Guide (For Version 1.0.3)
100% (2)
Calpont InfiniDB Administrator Guide (For Version 1.0.3)
106 pages
The Complete Google Analytics Power User Guide PDF
100% (1)
The Complete Google Analytics Power User Guide PDF
45 pages
Scribd Architecture Overview
100% (20)
Scribd Architecture Overview
19 pages
Scaling Rails Applications in The Cloud
100% (3)
Scaling Rails Applications in The Cloud
59 pages
Amdahl's Law in The Multicore Era
100% (1)
Amdahl's Law in The Multicore Era
6 pages
Amdahl's Law in The Multicore Era - HPCA Keynote 02/2008
No ratings yet
Amdahl's Law in The Multicore Era - HPCA Keynote 02/2008
61 pages
Oracle Prep4sure 1z0-932 v2019-05-02 by Bat 83q PDF
No ratings yet
Oracle Prep4sure 1z0-932 v2019-05-02 by Bat 83q PDF
38 pages
Step by Step Duplicating Oracle Database Using RMAN Backup With Connection To Target Database
No ratings yet
Step by Step Duplicating Oracle Database Using RMAN Backup With Connection To Target Database
6 pages
Chapter 4 Concurrency Control Techniques
No ratings yet
Chapter 4 Concurrency Control Techniques
41 pages
CPTR213 - Fundamentals of Databases - Course Outline (FALL 2024)
No ratings yet
CPTR213 - Fundamentals of Databases - Course Outline (FALL 2024)
10 pages
Database Management System
No ratings yet
Database Management System
15 pages
Transactions and Concurrecynotes
No ratings yet
Transactions and Concurrecynotes
43 pages
Inti Go Docs
No ratings yet
Inti Go Docs
12 pages
Feel For Data SeT
No ratings yet
Feel For Data SeT
2 pages
Aktu One View by Aktu SDC - pdf2006480100030
No ratings yet
Aktu One View by Aktu SDC - pdf2006480100030
5 pages
ADBMS 2072 (CSITauthority - Blogspot.com)
No ratings yet
ADBMS 2072 (CSITauthority - Blogspot.com)
1 page
Example Configuring A Database Connection With VBS
No ratings yet
Example Configuring A Database Connection With VBS
2 pages
Speaking From The Inside Lived Experiences of Indigenous Senior Citizens
No ratings yet
Speaking From The Inside Lived Experiences of Indigenous Senior Citizens
115 pages
Job Description-Cloudera
No ratings yet
Job Description-Cloudera
1 page
Techno Stress
No ratings yet
Techno Stress
45 pages
Unit - 8 Database and Database Management System
No ratings yet
Unit - 8 Database and Database Management System
36 pages
Assignment 3
No ratings yet
Assignment 3
4 pages
Unit 1
No ratings yet
Unit 1
70 pages
Error Log - Visualiza PC
No ratings yet
Error Log - Visualiza PC
144 pages
Texting 101
No ratings yet
Texting 101
4 pages
Certificate 11
No ratings yet
Certificate 11
1 page
Arquitectura
No ratings yet
Arquitectura
8 pages
Sap S4 Hana
No ratings yet
Sap S4 Hana
38 pages
Ans Assi2
No ratings yet
Ans Assi2
10 pages
Database Management System (Data Modelling) Answer Key - Activity 2
No ratings yet
Database Management System (Data Modelling) Answer Key - Activity 2
17 pages
Data Analysis
No ratings yet
Data Analysis
8 pages
Gui JTabbedPane JTable
No ratings yet
Gui JTabbedPane JTable
3 pages
Hawassa - University - Social - Media - Study Fin
No ratings yet
Hawassa - University - Social - Media - Study Fin
4 pages

Hadoop Training #4: Programming With Hadoop

Uploaded by

Hadoop Training #4: Programming With Hadoop

Uploaded by

Programming with Hadoop

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

• Running “Word Count” across 20 files is

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

• TaskTracker instances run on slave nodes

• TaskTracker forks separate Java process

© 2009 Cloudera, Inc.

• … Where’s the data distribution?

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

• TextInputFormat – Treats each ‘\n’-

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

You might also like