Cloudera Developer Training
Cloudera Developer Training
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
201202$
01-1
Chapter 1
Introduction
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
01-2
Introduction
About This Course
About Cloudera
Course Logistics
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
01-3
Course Objectives
During this course, you will learn:
The core technologies of Hadoop
How HDFS and MapReduce work
What other projects exist in the Hadoop ecosystem
How to develop MapReduce jobs
How Hadoop integrates into the datacenter
Algorithms for common MapReduce tasks
How to create large workflows using Oozie
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
01-4
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
01-5
Course Chapters
Introduction
The Motivation For Hadoop
Hadoop: Basic Contents
Writing a MapReduce Program
Integrating Hadoop Into The Workflow
Delving Deeper Into The Hadoop API
Common MapReduce Algorithms
An Introduction to Hive and Pig
Practical Development Tips and Techniques
More Advanced MapReduce Programming
Joining Data Sets in MapReduce Jobs
Graph Manipulation In MapReduce
An Introduction to Oozie
Conclusion
Appendix: Cloudera Enterprise
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
01-6
Introduction
About This Course
About Cloudera
Course Logistics
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
01-7
About Cloudera
Cloudera is The commercial Hadoop company
Founded by leading experts on Hadoop from Facebook, Google,
Oracle and Yahoo
Provides consulting and training services for Hadoop users
Staff includes committers to virtually all Hadoop projects
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
01-8
Cloudera Software
Clouderas Distribution including Apache Hadoop (CDH)
A single, easy-to-install package from the Apache Hadoop core
repository
Includes a stable version of Hadoop, plus critical bug fixes and
solid new features from the development version
100% open source
Cloudera Manager, Free Edition
The easiest way to deploy a Hadoop cluster
Automates installation of Hadoop software
Installation, monitoring and configuration is performed from a
central machine
Manages up to 50 nodes
Completely free
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
01-9
Cloudera Enterprise
Cloudera Enterprise
Complete package of software and support
Built on top of CDH
Includes full version of Cloudera Manager
Install, manage, and maintain a cluster of any size
LDAP integration
Includes powerful cluster monitoring and auditing tools
Resource consumption tracking
Proactive health checks
Alerting
Configuration change audit trails
And more
24 x 7 support
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
01-10
Cloudera Services
Provides consultancy services to many key users of Hadoop
Including AOL Advertising, Samsung, Groupon, NAVTEQ, Trulia,
Tynt, RapLeaf, Explorys Medical
Solutions Architects are experts in Hadoop and related
technologies
Several are committers to the Apache Hadoop project
Provides training in key areas of Hadoop administration and
development
Courses include System Administrator training, Developer
training, Hive and Pig training, HBase Training, Essentials for
Managers
Custom course development available
Both public and on-site training available
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
01-11
Introduction
About This Course
About Cloudera
Course Logistics
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
01-12
Logistics
Course start and end times
Lunch
Breaks
Restrooms
Can I come in early/stay late?
Certification
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
01-13
Introductions
About your instructor
About you
Experience with Hadoop?
Experience as a developer?
What programming languages do you use?
Expectations from the course?
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
01-14
Chapter 2
The Motivation For Hadoop
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-1
Course Chapters
Introduction
The Motivation For Hadoop
Hadoop: Basic Contents
Writing a MapReduce Program
Integrating Hadoop Into The Workflow
Delving Deeper Into The Hadoop API
Common MapReduce Algorithms
An Introduction to Hive and Pig
Practical Development Tips and Techniques
More Advanced MapReduce Programming
Joining Data Sets in MapReduce Jobs
Graph Manipulation In MapReduce
An Introduction to Oozie
Conclusion
Appendix: Cloudera Enterprise
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-2
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-3
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-4
02-5
02-6
02-7
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-8
02-9
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-10
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-11
Data Recoverability
If a component of the system fails, its workload should be
assumed by still-functioning units in the system
Failure should not result in the loss of any data
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-12
Component Recovery
If a component of the system fails and then recovers, it should
be able to rejoin the system
Without requiring a full restart of the entire system
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-13
Consistency
Component failures during execution of a job should not affect
the outcome of the job
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-14
Scalability
Adding load to the system should result in a graceful decline in
performance of individual jobs
Not failure of the system
Increasing resources should support a proportional increase in
load capacity
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-15
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-16
Hadoops History
Hadoop is based on work done by Google in the late 1990s/early
2000s
Specifically, on papers describing the Google File System (GFS)
published in 2003, and MapReduce published in 2004
This work takes a radical new approach to the problem of
distributed computing
Meets all the requirements we have for reliability and scalability
Core concept: distribute the data as it is initially stored in the
system
Individual nodes can work on data local to those nodes
No data transfer over the network is required for initial
processing
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-17
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-18
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-19
Fault Tolerance
If a node fails, the master will detect that failure and re-assign the
work to a different node on the system
Restarting a task does not require communication with nodes
working on other portions of the data
If a failed node restarts, it is automatically added back to the
system and assigned new tasks
If a node appears to be running slowly, the master can
redundantly execute another instance of the same task
Results from the first to finish will be used
Known as speculative execution
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-20
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-21
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-22
Chapter 3
Hadoop: Basic Concepts
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-1
Course Chapters
Introduction
The Motivation For Hadoop
Hadoop: Basic Contents
Writing a MapReduce Program
Integrating Hadoop Into The Workflow
Delving Deeper Into The Hadoop API
Common MapReduce Algorithms
An Introduction to Hive and Pig
Practical Development Tips and Techniques
More Advanced MapReduce Programming
Joining Data Sets in MapReduce Jobs
Graph Manipulation In MapReduce
An Introduction to Oozie
Conclusion
Appendix: Cloudera Enterprise
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-2
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-3
03-4
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-5
Hadoop Components
Hadoop consists of two core components
The Hadoop Distributed File System (HDFS)
MapReduce
There are many other projects based around core Hadoop
Often referred to as the Hadoop Ecosystem
Pig, Hive, HBase, Flume, Oozie, Sqoop, etc
Many are discussed later in the course
A set of machines running HDFS and MapReduce is known as a
Hadoop Cluster
Individual machines are known as nodes
A cluster can have as few as one node, as many as several
thousands
More nodes = better performance!
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-6
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-7
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-8
03-9
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-10
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-11
03-12
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-13
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-14
03-15
Accessing HDFS
Applications can read and write HDFS files directly via the Java
API
Covered later in the course
Typically, files are created on a local filesystem and must be
moved into HDFS
Likewise, files stored in HDFS may need to be moved to a
machines local filesystem
Access to HDFS from the command line is achieved with the
hadoop fs command
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-16
hadoop fs Examples
Copy file foo.txt from local disk to the users directory in HDFS
hadoop fs -copyFromLocal foo.txt foo.txt
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-17
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-18
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-19
03-20
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-21
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-22
03-23
What Is MapReduce?
MapReduce is a method for distributing a task across multiple
nodes
Each node processes data stored on that node
Where possible
Consists of two phases:
Map
Reduce
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-24
Features of MapReduce
Automatic parallelization and distribution
Fault-tolerance
Status and monitoring tools
A clean abstraction for programmers
MapReduce programs are usually written in Java
Can be written in any scripting language using Hadoop
Streaming (see later)
All of Hadoop is written in Java
MapReduce abstracts all the housekeeping away from the
developer
Developer can concentrate simply on writing the Map and
Reduce functions
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-25
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-26
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-27
MapReduce: Terminology
A job is a full program
A complete execution of Mappers and Reducers over a dataset
A task is the execution of a single Mapper or Reducer over a slice
of data
A task attempt is a particular instance of an attempt to execute a
task
There will be at least as many task attempts as there are tasks
If a task attempt fails, another will be started by the JobTracker
Speculative execution (see later) can also result in more task
attempts than completed tasks
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-28
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-29
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-30
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-31
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-32
('foo', 7) ->
('foo', 7)
nothing
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-33
(3, 'bar')
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-34
03-35
(bar', 39)
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-36
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-37
reduce(String output_key,
Iterator<int> intermediate_vals)
set count = 0
foreach v in intermediate_vals:
count += v
emit(output_key, count)
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-38
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-39
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-40
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-41
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-42
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-43
03-44
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-45
03-46
03-47
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-48
03-49
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-50
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-51
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-52
Submitting A Job
When a client submits a job, its configuration information is
packaged into an XML file
This file, along with the .jar file containing the actual program
code, is handed to the JobTracker
The JobTracker then parcels out individual tasks to TaskTracker
nodes
When a TaskTracker receives a request to run a task, it
instantiates a separate JVM for that task
TaskTracker nodes can be configured to run multiple tasks at the
same time
If the node has enough processing power and memory
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-53
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-54
03-55
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-56
Hive
Hive is an abstraction on top of MapReduce
Allows users to query data in the Hadoop cluster without
knowing Java or MapReduce
Uses the HiveQL language
Very similar to SQL
The Hive Interpreter runs on a client machine
Turns HiveQL queries into MapReduce jobs
Submits those jobs to the cluster
Note: this does not turn the cluster into a relational database
server!
It is still simply running MapReduce jobs
Those jobs are created by the Hive Interpreter
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-57
Hive (contd)
Sample Hive query:
SELECT stock.product, SUM(orders.purchases)
FROM stock INNER JOIN orders
ON (stock.id = orders.stock_id)
WHERE orders.quarter = 'Q1'
GROUP BY stock.product;
We will investigate Hive in greater detail later in the course
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-58
Pig
Pig is an alternative abstraction on top of MapReduce
Uses a dataflow scripting language
Called PigLatin
The Pig interpreter runs on the client machine
Takes the PigLatin script and turns it into a series of MapReduce
jobs
Submits those jobs to the cluster
As with Hive, nothing magical happens on the cluster
It is still simply running MapReduce jobs
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-59
Pig (contd)
Sample Pig script:
stock = LOAD '/user/fred/stock' AS (id, item);
orders= LOAD '/user/fred/orders' AS (id, cost);
grpd = GROUP orders BY id;
totals = FOREACH grpd GENERATE group,
SUM(orders.cost) AS t;
result = JOIN stock BY id, totals BY group;
DUMP result;
We will investigate Pig in more detail later in the course
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-60
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-61
Oozie
Oozie allows developers to create a workflow of MapReduce jobs
Including dependencies between jobs
The Oozie server submits the jobs to the server in the correct
sequence
We will investigate Oozie later in the course
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-62
HBase
HBase is the Hadoop database
A NoSQL datastore
Can store massive amounts of data
Gigabytes, terabytes, and even petabytes of data in a table
Scales to provide very high write throughput
Hundreds of thousands of inserts per second
Copes well with sparse data
Tables can have many thousands of columns
Even if most columns are empty for any given row
Has a very constrained access model
Insert a row, retrieve a row, do a full or partial table scan
Only one column (the row key) is indexed
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-63
HBase&
Data&layout&
Row1oriented&
Column1oriented&
Transac:ons&
Yes&
Single&row&only&
Query&language&
SQL&
get/put/scan&
Security&
Authen:ca:on/Authoriza:on&
TBD&
Indexes&
On&arbitrary&columns&
Row1key&only&
Max&data&size&
TBs&
PB+&
Read/write&throughput&
limits&
1000s&queries/second&
Millions&of&queries/second&
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-64
03-65
Conclusion
In this chapter you have learned
What Hadoop is
What features the Hadoop Distributed File System (HDFS)
provides
The concepts behind MapReduce
How a Hadoop cluster operates
What other Hadoop Ecosystem projects exist
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-66
Chapter 4
Writing a MapReduce
Program
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-1
Course Chapters
Introduction
The Motivation For Hadoop
Hadoop: Basic Contents
Writing a MapReduce Program
Integrating Hadoop Into The Workflow
Delving Deeper Into The Hadoop API
Common MapReduce Algorithms
An Introduction to Hive and Pig
Practical Development Tips and Techniques
More Advanced MapReduce Programming
Joining Data Sets in MapReduce Jobs
Graph Manipulation In MapReduce
An Introduction to Oozie
Conclusion
Appendix: Cloudera Enterprise
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-2
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-3
04-4
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-5
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-6
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-7
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-8
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-9
04-10
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-11
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-12
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-13
04-14
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-15
What is Writable?
Hadoop defines its own box classes for strings, integers and so
on
IntWritable for ints
LongWritable for longs
FloatWritable for floats
DoubleWritable for doubles
Text for strings
Etc.
The Writable interface makes serialization quick and easy for
Hadoop
Any values type must implement the Writable interface
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-16
What is WritableComparable?
A WritableComparable is a Writable which is also
Comparable
Two WritableComparables can be compared against each
other to determine their order
Keys must be WritableComparables because they are passed
to the Reducer in sorted order
We will talk more about WritableComparable later
Note that despite their names, all Hadoop box classes implement
both Writable and WritableComparable
For example, IntWritable is actually a
WritableComparable
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-17
04-18
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-19
org.apache.hadoop.fs.Path;
org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mapred.FileInputFormat;
org.apache.hadoop.mapred.FileOutputFormat;
org.apache.hadoop.mapred.JobClient;
org.apache.hadoop.mapred.JobConf;
org.apache.hadoop.conf.Configured;
org.apache.hadoop.util.Tool;
org.apache.hadoop.util.ToolRunner;
04-20
JobClient.runJob(conf);
return 0;
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-21
org.apache.hadoop.fs.Path;
org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mapred.FileInputFormat;
org.apache.hadoop.mapred.FileOutputFormat;
org.apache.hadoop.mapred.JobClient;
org.apache.hadoop.mapred.JobConf;
org.apache.hadoop.conf.Configured;
org.apache.hadoop.util.Tool;
org.apache.hadoop.util.ToolRunner;
if (args.length != 2) {
System.out.printf(
"Usage: %s [generic options] <input dir> <output dir>\n",
getClass().getSimpleName());
ToolRunner.printGenericCommandUsage(System.out);
return -1;
}
JobConf conf = new JobConf(getConf(), WordCount.class);
conf.setJobName(this.getClass().getName());
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(WordMapper.class);
conf.setReducerClass(SumReducer.class);
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-22
JobClient.runJob(conf);
return 0;
04-23
JobClient.runJob(conf);
return 0;
04-24
04-25
JobClient.runJob(conf);
return 0;
04-26
FileOutputFormat.setOutputPath(conf,
Path(args[1]));
To
configure the job, create new
a new
JobConf object and specify
conf.setMapperClass(WordMapper.class);
the
class which will be called to run the job.
conf.setReducerClass(SumReducer.class);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(IntWritable.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
return 0;
04-27
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-28
conf.setMapperClass(WordMapper.class);
conf.setReducerClass(SumReducer.class);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(IntWritable.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
return 0;
04-29
Next, specify the input directory from which data will be read,
and
the output directory to which final output will be written.
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
return 0;
04-30
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-31
04-32
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-33
Give
the JobConf object information about which classes are
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
to
be instantiated as the Mapper and Reducer.
JobClient.runJob(conf);
}
return 0;
04-34
04-35
JobClient.runJob(conf);
return 0;
04-36
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
return 0;
04-37
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-38
JobClient.runJob(conf);
return 0;
04-39
04-40
org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.LongWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mapred.MapReduceBase;
org.apache.hadoop.mapred.Mapper;
org.apache.hadoop.mapred.OutputCollector;
org.apache.hadoop.mapred.Reporter;
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-41
org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.LongWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mapred.MapReduceBase;
org.apache.hadoop.mapred.Mapper;
org.apache.hadoop.mapred.OutputCollector;
org.apache.hadoop.mapred.Reporter;
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-42
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-43
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-44
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-45
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-46
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-47
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-48
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-49
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-50
04-51
org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mapred.OutputCollector;
org.apache.hadoop.mapred.MapReduceBase;
org.apache.hadoop.mapred.Reducer;
org.apache.hadoop.mapred.Reporter;
int wordCount = 0;
while (values.hasNext()) {
IntWritable value = values.next();
wordCount += value.get();
}
output.collect(key, new IntWritable(wordCount));
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-52
org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mapred.OutputCollector;
org.apache.hadoop.mapred.MapReduceBase;
org.apache.hadoop.mapred.Reducer;
org.apache.hadoop.mapred.Reporter;
04-53
int wordCount = 0;
while (values.hasNext()) {
IntWritable value = values.next();
wordCount += value.get();
}
output.collect(key, new IntWritable(wordCount));
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-54
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-55
int wordCount = 0;
while (values.hasNext())
{
The reduce method
receives a key and an Iterator of
IntWritable value = values.next();
it also receives an OutputCollector object
wordCountvalues;
+= value.get();
}
and a Reporter object.
output.collect(key, new IntWritable(wordCount));
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-56
int wordCount = 0;
while (values.hasNext()) {
IntWritable value = values.next();
wordCount += value.get();
}
output.collect(key, new IntWritable(wordCount));
04-57
int wordCount = 0;
while (values.hasNext()) {
IntWritable value = values.next();
wordCount += value.get();
}
output.collect(key, new IntWritable(wordCount));
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-58
int wordCount = 0;
while (values.hasNext()) {
IntWritable value = values.next();
wordCount += value.get();
}
output.collect(key, new IntWritable(wordCount));
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-59
04-60
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-61
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-62
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-63
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-64
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-65
04-66
04-67
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-68
Using Eclipse
Launch Eclipse by double-clicking on the Eclipse icon on the
desktop
If you are asked whether you want to send usage data, hit
Cancel
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-69
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-70
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-71
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-72
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-73
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-74
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-75
04-76
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-77
04-78
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-79
New API
import org.apache.hadoop.mapred.*
import org.apache.hadoop.mapreduce.*
Driver code:
Driver code:
Mapper:
Mapper:
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-80
New API
Reducer:
Reducer:
}
configure(JobConf job) (See later)
setup(Context c)
cleanup(Context c)
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-81
04-82
Conclusion
In this chapter you have learned
How to use the Hadoop API to write a MapReduce program in
Java
How to use the Streaming API to write Mappers and Reducers in
other languages
How to use Eclipse to speed up your Hadoop development
The differences between the Old and New Hadoop APIs
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-83
Chapter 5
Integrating Hadoop Into
The Workflow
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-1
Course Chapters
Introduction
The Motivation For Hadoop
Hadoop: Basic Contents
Writing a MapReduce Program
Integrating Hadoop Into The Workflow
Delving Deeper Into The Hadoop API
Common MapReduce Algorithms
An Introduction to Hive and Pig
Practical Development Tips and Techniques
More Advanced MapReduce Programming
Joining Data Sets in MapReduce Jobs
Graph Manipulation In MapReduce
An Introduction to Oozie
Conclusion
Appendix: Cloudera Enterprise
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-2
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-3
05-4
Introduction
Your data center already has a lot of components
Database servers
Data warehouses
File servers
Backup systems
How does Hadoop fit into this ecosystem?
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-5
05-6
RDBMS Strengths
Relational Database Management Systems (RDBMSs) have many
strengths
Ability to handle complex transactions
Ability to process hundreds or thousands of queries per second
Real-time delivery of results
Simple but powerful query language
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-7
RDBMS Weaknesses
There are some areas where RDBMSs are less ideal
Data schema is determined before data is ingested
Can make ad-hoc data collection difficult
Upper bound on data storage of 100s of terabytes
Practical upper bound on data in a single query of 10s of
terabytes
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-8
05-9
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-10
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-11
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-12
Benefits of Hadoop
Processing power scales with data storage
As you add more nodes for storage, you get more processing
power for free
Views do not need prematerialization
Ad-hoc full or partial dataset queries are possible
Total query size can be multiple petabytes
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-13
Hadoop Tradeoffs
Cannot serve interactive queries
The fastest Hadoop job will still take several seconds to run
Less powerful updates
No transactions
No modification of existing records
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-14
05-15
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-16
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-17
05-18
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-19
05-20
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-21
05-22
Tools include:
import
import-all-tables
list-tables
Options include:
--connect
--username
--password
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-23
Sqoop: Example
Example: import a table called employees from a database called
personnel in a MySQL RDBMS
sqoop import --username fred --password derf \
--connect jdbc:mysql://database.example.com/personnel
--table employees
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-24
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-25
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-26
05-27
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-28
05-29
Flume: Basics
Flume is a distributed, reliable, available service for efficiently
moving large amounts of data as it is produced
Ideally suited to gathering logs from multiple systems and
inserting them into HDFS as they are generated
Flume is Open Source
Initially developed by Cloudera
Flumes design goals:
Reliability
Scalability
Manageability
Extensibility
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-30
Agent&
Agent&&
Agent&
Agent&
encrypt&
MASTER&
Master'communicates'
with'all'Agents,'
specifying'congura$on'
etc.'
Processor&
Processor&
compress&
batch&
encrypt&
Writes'to'mul$ple'HDFS'le'
formats'(text,'SequenceFile,'
JSON,'Avro,'others)'
Parallelized'writes'across'many'
collectors''as'much'write'
throughput'as'required'
Collector(s)&
Mul$ple'congurable'
levels'of'reliability'
Agents''can'guarantee'
delivery'in'event'of'
failure'
Op$onally'deployable,'
centrally'administered'
Op$onally'pre=process'
incoming'data:'perform'
transforma$ons,'
suppressions,'metadata'
enrichment'
Flexibly'
deploy'
decorators'at'
any'step'to'
improve'
performance,''
reliability'or'
security'
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-31
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-32
05-33
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-34
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-35
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-36
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-37
05-38
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-39
FuseDFS
FuseDFS is based on FUSE (Filesystem in USEr space)
Allows you to mount HDFS as a regular filesystem
Note: HDFS limitations still exist!
Not intended as a general-purpose filesystem
Files are write-once
Not optimized for low latency
FuseDFS is included as part of the Hadoop distribution
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-40
Hoop
Hoop is an open-source project started at Cloudera
Distributed under the Apache Software License 2.0
Provides an HTTP/HTTPS REST interface to HDFS
Supports both reads and writes from/to HDFS
Can be accessed from within a program
Can be used via command-line tools such as curl or wget
Client accesses the Hoop server
Hoop server then accesses HDFS
Available from http://cloudera.github.com/hoop
05-41
05-42
Conclusion
In this chapter you have learned
How Hadoop can be integrating into an existing enterprise
How to load data from an existing RDBMS into HDFS using
Sqoop
How to manage real-time data such as log files using Flume
How to access HDFS from legacy systems with FuseDFS and
Hoop
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-43
Chapter 6
Delving Deeper Into
The Hadoop API
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-1
Course Chapters
Introduction
The Motivation For Hadoop
Hadoop: Basic Contents
Writing a MapReduce Program
Integrating Hadoop Into The Workflow
Delving Deeper Into The Hadoop API
Common MapReduce Algorithms
An Introduction to Hive and Pig
Practical Development Tips and Techniques
More Advanced MapReduce Programming
Joining Data Sets in MapReduce Jobs
Graph Manipulation In MapReduce
An Introduction to Oozie
Conclusion
Appendix: Cloudera Enterprise
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-2
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-3
06-4
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-5
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-6
06-7
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-8
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-9
Why MRUnit?
JUnit is a very popular Java unit testing framework
Problem: JUnit cannot be used directly to test Mappers or
Reducers
Unit tests require mocking up InputSplits,
OutputCollector, Reporter,
A lot of work
MRUnit is built on top of JUnit
Provides those mock objects
Allows you to test your code from within an IDE
Much easier to debug
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-10
JUnit Basics
We are using JUnit 4 in class
Earlier versions would also work
@Test
Java annotation
Indicates that this method is a test which JUnit should execute
@Before
Java annotation
Tells JUnit to call this method before every @Test method
Two @Test methods would result in the @Before method
being called twice
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-11
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-12
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-13
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-14
org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.LongWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mrunit.MapDriver;
org.junit.Before;
org.junit.Test;
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-15
org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.LongWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mrunit.MapDriver;
org.junit.Before;
org.junit.Test;
@Before
public void setUp() {
WordMapper mapper = new WordMapper();
mapDriver = new MapDriver<LongWritable, Text, Text, IntWritable>();
mapDriver.setMapper(mapper);
}
@Test
public void testMapper() {
mapDriver.withInput(new LongWritable(1), new Text("cat dog"));
mapDriver.withOutput(new Text("cat"), new IntWritable(1));
mapDriver.withOutput(new Text("dog"), new IntWritable(1));
mapDriver.runTest();
}
}
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-16
org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.LongWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mrunit.MapDriver;
org.junit.Before;
org.junit.Test;
@Test
public void testMapper() {
mapDriver.withInput(new LongWritable(1), new Text("cat dog"));
mapDriver.withOutput(new Text("cat"), new IntWritable(1));
mapDriver.withOutput(new Text("dog"), new IntWritable(1));
mapDriver.runTest();
}
}
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-17
org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.LongWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mrunit.MapDriver;
org.junit.Before;
org.junit.Test;
public
MapDriver<LongWritable, Text, Text, IntWritable> mapDriver;
@Before
public void setUp() {
WordMapper mapper = new WordMapper();
mapDriver = new MapDriver<LongWritable, Text, Text, IntWritable>();
mapDriver.setMapper(mapper);
}
@Test
public void testMapper() {
mapDriver.withInput(new LongWritable(1), new Text("cat dog"));
mapDriver.withOutput(new Text("cat"), new IntWritable(1));
mapDriver.withOutput(new Text("dog"), new IntWritable(1));
mapDriver.runTest();
}
}
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-18
org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.LongWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mrunit.MapDriver;
org.junit.Before;
org.junit.Test;
@Test
public void testMapper() {
mapDriver.withInput(new LongWritable(1), new Text("cat dog"));
mapDriver.withOutput(new Text("cat"), new IntWritable(1));
mapDriver.withOutput(new Text("dog"), new IntWritable(1));
mapDriver.runTest();
}
}
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-19
MRUnit Drivers
MRUnit has a MapDriver, a ReduceDriver, and a
MapReduceDriver
Methods:
withInput
Specifies input to the Mapper/Reducer
withOutput
Specifies expected output from the Mapper/Reducer
with{Input,Output} support the builder method and can be
chained
add{Input,Output}
Similar to with{Input,Output} but returns void
runTest
Runs the test
run
Runs the test and allows you to retrieve the results
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-20
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-21
@Test
public void testMapper() {
mapDriver.withInput(new LongWritable(1), new Text("cat dog"));
mapDriver.withOutput(new Text("cat"), new IntWritable(-1));
mapDriver.withOutput(new Text("dog"), new IntWritable(1));
try {
mapDriver.runTest();
} catch (Exception e) {
fail(e.getMessage());
}
}
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-22
MRUnit Conclusions
You should write unit tests for your code!
As you are performing the Hands-On Exercises in the rest of the
course we strongly recommend that you write unit tests as you
proceed
This will help greatly in debugging your code
Your instructor may ask to see your unit tests if you ask for
assistance!
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-23
06-24
The Combiner
Often, Mappers produce large amounts of intermediate data
That data must be passed to the Reducers
This can result in a lot of network traffic
It is often possible to specify a Combiner
Like a mini-Reducer
Runs locally on a single Mappers output
Output from the Combiner is sent to the Reducers
Combiner and Reducer code are often identical
Technically, this is possible if the operation performed is
commutative and associative
In this case, input and output data types for the Combiner/
Reducer must be identical
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-25
reduce(String output_key,
Iterator<int> intermediate_vals)
set count = 0
foreach v in intermediate_vals:
count += v
emit(output_key, count)
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-26
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-27
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-28
06-29
Specifying a Combiner
To specify the Combiner class to be used in your MapReduce
code, put the following line in your Driver:
conf.setCombinerClass(YourCombinerClass.class);
06-30
06-31
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-32
Note that the close() method does not receive the JobConf
object
You could save a reference to the JobConf object in the
configure method and use it in the close method if necessary
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-33
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-34
06-35
06-36
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-37
Custom Partitioners
Sometimes you will need to write your own Partitioner
Example: your key is a custom WritableComparable which
contains a pair of values (a, b)
You may decide that all keys with the same value for a need to go
to the same Reducer
The default Partitioner is not sufficient in this case
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-38
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-39
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-40
06-41
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-42
06-43
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-44
The conf object has read in the Hadoop configuration files, and
therefore knows the address of the NameNode etc.
A file in HDFS is represented by a Path object
Path p = new Path("/path/to/my/file");
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-45
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-46
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-47
06-48
06-49
06-50
06-51
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-52
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-53
06-54
Hands-On Exercise
In this Hands-On Exercise, you will gain practice writing
combiners and creating unit tests
Please refer to the Hands-On Exercise Manual
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-55
06-56
Conclusion
In this chapter you have learned
More about ToolRunner
How to write unit tests with MRUnit
How to specify Combiners to reduce intermediate data
How to use the configure and close methods for Map and
Reduce setup and teardown
How to write custom Partitioners for better load balancing
How to directly access HDFS
How to use the Distributed Cache
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-57
Chapter 7
Common MapReduce
Algorithms
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-1
Course Chapters
Introduction
The Motivation For Hadoop
Hadoop: Basic Contents
Writing a MapReduce Program
Integrating Hadoop Into The Workflow
Delving Deeper Into The Hadoop API
Common MapReduce Algorithms
An Introduction to Hive and Pig
Practical Development Tips and Techniques
More Advanced MapReduce Programming
Joining Data Sets in MapReduce Jobs
Graph Manipulation In MapReduce
An Introduction to Oozie
Conclusion
Appendix: Cloudera Enterprise
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-2
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-3
07-4
Introduction
MapReduce jobs tend to be relatively short in terms of lines of
code
It is typical to combine multiple small MapReduce jobs together
in a single workflow
Often using Oozie (see later)
You are likely to find that many of your MapReduce jobs use very
similar code
In this chapter we present some very common MapReduce
algorithms
These algorithms are frequently the basis for more complex
MapReduce jobs
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-5
07-6
Sorting
MapReduce is very well suited to sorting large data sets
Recall: keys are passed to the reducer in sorted order
Assuming the file to be sorted contains lines with a single value:
Mapper is merely the identity function for the value
(k, v) -> (v, _)
Reducer is the identity function
(k, _) -> (k, '')
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-7
Sorting (contd)
Trivial with a single reducer
For multiple reducers, need to choose a partitioning function
such that if k1 < k2, partition(k1) <= partition(k2)
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-8
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-9
Searching
Assume the input is a set of files containing lines of text
Assume the Mapper has been passed the pattern for which to
search as a special parameter
We saw how to pass parameters to your Mapper in the previous
chapter
Algorithm:
Mapper compares the line against the pattern
If the pattern matches, Mapper outputs (line, _)
Or (filename+line, _), or
If the pattern does not match, Mapper outputs nothing
Reducer is the Identity Reducer
Just outputs each intermediate key
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-10
07-11
Indexing
Assume the input is a set of files containing lines of text
Key is the byte offset of the line, value is the line itself
We can retrieve the name of the file using the Reporter object
More details on how to do this later
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-12
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-13
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-14
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-15
07-16
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-17
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-18
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-19
07-20
The Three Cs
Machine Learning is an active area of research and new
applications
There are three well-established categories of techniques for
exploiting data:
Collaborative filtering (recommendations)
Clustering
Classification
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-21
Collaborative Filtering
Collaborative Filtering is a technique for recommendations
Example application: given people who each like certain books,
learn to suggest what someone may like based on what they
already like
Very useful in helping users navigate data by expanding to topics
that have affinity with their established interests
Collaborative Filtering algorithms are agnostic to the different
types of data items involved
So they are equally useful in many different domains
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-22
Clustering
Clustering algorithms discover structure in collections of data
Where no formal structure previously existed
They discover what clusters, or groupings, naturally occur in
data
Examples:
Finding related news articles
Computer vision (groups of pixels that cohere into objects)
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-23
Classification
The previous two techniques are considered unsupervised
learning
The algorithm discovers groups or recommendations itself
Classification is a form of supervised learning
A classification system takes a set of data records with known
labels
Learns how to label new records based on that information
Example:
Given a set of e-mails identified as spam/not spam, label new emails as spam/not spam
Given tumors identified as benign or malignant, classify new
tumors
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-24
Clustering
Classification
Pearson correlation
Log likelihood
Spearman correlation
Tanimoto coefficient
Singular value
decomposition (SVD)
Llinear interpolation
Cluster-based
recommenders
k-means clustering
Canopy clustering
Fuzzy k-means
Latent Dirichlet
analysis (LDA)
Stochastic gradient
descent (SGD)
Support vector
machine (SVM)
Nave Bayes
Complementary nave
Bayes
Random forests
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-25
07-26
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-27
TF-IDF: Motivation
Merely counting the number of occurrences of a word in a
document is not a good enough measure of its relevance
If the word appears in many other documents, it is probably less
relevance
Some words appear too frequently in all documents to be
relevant
Known as stopwords
TF-IDF considers both the frequency of a word in a given
document and the number of documents which contain the word
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-28
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-29
"N%
idf = log$ '
#n&
N: total number of documents
n: number of documents that contain a term
TF-IDF
TF IDF
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-30
Computing TF-IDF
What we need:
Number of times t appears in a document
Different value for each document
Number of documents that contains t
One value for each term
Total number of documents
One value
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-31
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-32
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-33
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-34
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-35
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-36
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-37
07-38
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-39
Reducer
reduce(pair p, Iterator counts) {
s = 0
foreach c in counts do
s += c
emit(p, s)
}
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-40
07-41
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-42
07-43
Conclusion
In this chapter you have learned
Some typical MapReduce algorithms, including
Sorting
Searching
Indexing
Term Frequency Inverse Document Frequency
Word Co-Occurrence
We also briefly discussed machine learning with Hadoop
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-44
Chapter 8
An Introduction to
Hive and Pig
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-1
Course Chapters
Introduction
The Motivation For Hadoop
Hadoop: Basic Contents
Writing a MapReduce Program
Integrating Hadoop Into The Workflow
Delving Deeper Into The Hadoop API
Common MapReduce Algorithms
An Introduction to Hive and Pig
Practical Development Tips and Techniques
More Advanced MapReduce Programming
Joining Data Sets in MapReduce Jobs
Graph Manipulation In MapReduce
An Introduction to Oozie
Conclusion
Appendix: Cloudera Enterprise
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-2
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-3
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-4
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-5
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-6
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-7
Hive: Introduction
Hive was originally developed at Facebook
Provides a very SQL-like language
Can be used by people who know SQL
Under the covers, generates MapReduce jobs that run on the
Hadoop cluster
Enabling Hive requires almost no extra work by the system
administrator
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-8
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-9
Hive Datatypes
Primitive types:
TINYINT
INT
BIGINT
BOOLEAN
DOUBLE
STRING
Type constructors:
ARRAY < primitive-type >
MAP < primitive-type, data-type >
STRUCT < col-name : data-type, ... >
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-10
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-11
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-12
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-13
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-14
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-15
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-16
Joining Tables
Joining datasets is a complex operation in standard Java
MapReduce
We will cover this later in the course
In Hive, its easy!
SELECT s.word, s.freq, k.freq FROM
shakespeare s JOIN kjv k ON
(s.word = k.word)
WHERE s.freq >= 5;
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-17
08-18
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-19
Hive Limitations
Not all standard SQL is supported
No correlated subqueries, for example
No support for UPDATE or DELETE
No support for INSERTing single rows
Relatively limited number of built-in functions
No datatypes for date or time
Use the STRING datatype instead
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-20
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-21
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-22
Pig: Introduction
Pig was originally created at Yahoo! to answer a similar need to
Hive
Many developers did not have the Java and/or MapReduce
knowledge required to write standard MapReduce programs
But still needed to query data
Pig is a dataflow language
Language is called PigLatin
Relatively simple syntax
Under the covers, PigLatin scripts are turned into MapReduce
jobs and executed on the cluster
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-23
Pig Installation
Installation of Pig requires no modification to the cluster
The Pig interpreter runs on the client machine
Turns PigLatin into standard Java MapReduce jobs, which are
then submitted to the JobTracker
There is (currently) no shared metadata, so no need for a shared
metastore of any kind
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-24
Pig Concepts
In Pig, a single element of data is an atom
A collection of atoms such as a row, or a partial row is a tuple
Tuples are collected together into bags
Typically, a PigLatin script starts by loading one or more datasets
into bags, and then creates new bags by modifying those it
already has
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-25
Pig Features
Pig supports many features which allow developers to perform
sophisticated data analysis without having to write Java
MapReduce code
Joining datasets
Grouping data
Referring to elements by position rather than name
Useful for datasets with many elements
Loading non-delimited data using a custom SerDe
Creation of user-defined functions, written in Java
And more
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-26
08-27
More PigLatin
To view the structure of a bag:
DESCRIBE bagname;
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-28
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-29
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-30
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-31
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-32
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-33
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-34
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-35
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-36
Conclusion
In this chapter you have learned
What features Hive provides
What features Pig provides
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-37
Chapter 9
Practical Development
Tips and Techniques
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-1
Course Chapters
Introduction
The Motivation For Hadoop
Hadoop: Basic Contents
Writing a MapReduce Program
Integrating Hadoop Into The Workflow
Delving Deeper Into The Hadoop API
Common MapReduce Algorithms
An Introduction to Hive and Pig
Practical Development Tips and Techniques
More Advanced MapReduce Programming
Joining Data Sets in MapReduce Jobs
Graph Manipulation In MapReduce
An Introduction to Oozie
Conclusion
Appendix: Cloudera Enterprise
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-2
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-3
09-4
Introduction to Debugging
Debugging MapReduce code is difficult!
Each instance of a Mapper runs as a separate task
Often on a different machine
Difficult to attach a debugger to the process
Difficult to catch edge cases
Very large volumes of data mean that unexpected input is likely
to appear
Code which expects all data to be well-formed is likely to fail
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-5
09-6
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-7
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-8
Testing Strategies
When testing in pseudo-distributed mode, ensure that you are
testing with a similar environment to that on the real cluster
Same amount of RAM allocated to the task JVMs
Same version of Hadoop
Same version of Java
Same versions of third-party libraries
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-9
09-10
Testing Locally
Hadoop can run MapReduce in a single, local process
Does not require any Hadoop daemons to be running
Uses the local filesystem
Known as the LocalJobRunner
This is a very useful way of quickly testing incremental changes
to code
To run in LocalJobRunner mode, add the following lines to your
driver code:
conf.set("mapred.job.tracker", "local");
conf.set("fs.default.name", "file:///");
09-11
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-12
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-13
09-14
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-15
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-16
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-17
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-18
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-19
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-20
log4j Configuration
Configuration for log4j is stored in
/etc/hadoop/conf/log4j.properties
Can change global log settings with hadoop.root.log property
Can override log level on a per-class basis:
log4j.logger.org.apache.hadoop.mapred.JobTracker=WARN
log4j.logger.org.apache.hadoop.mapred.FooMapper=DEBUG
Programmatically:
LOGGER.setLevel(Level.WARN);
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-21
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-22
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-23
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-24
09-25
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-26
Example:
r.incrCounter("RecordType", "TypeA", 1);
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-27
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-28
Counters: Caution
Do not rely on a counters value from the Web UI while a job is
running
Due to possible speculative execution, a counters value could
appear larger than the actual final value
Modifications to counters from subsequently killed/failed tasks will
be removed from the final count
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-29
09-30
InputSplits: Reprise
Recall that each Mapper processes a single InputSplit
InputSplits are calculated by the client program before the job is
submitted to the JobTracker
Typically, an InputSplit equates to an HDFS block
So the number of Mappers will equal the number of HDFS blocks
in the input data
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-31
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-32
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-33
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-34
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-35
09-36
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-37
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-38
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-39
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-40
09-41
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-42
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-43
09-44
Hands-On Exercise
In this Hands-On Exercise you will write a Map-Only MapReduce
job using Counters
Please refer to the Hands-On Exercise Manual
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-45
09-46
Conclusion
In this chapter you have learned
Strategies for debugging your MapReduce code
How to write and subsequently view log files
How to retrieve job information with Counters
What issues to consider when using file compression
How to determine the optimal number of Reducers for a job
How to create Map-only MapReduce jobs
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-47
Chapter 10
More Advanced
MapReduce Programming
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-1
Course Chapters
Introduction
The Motivation For Hadoop
Hadoop: Basic Contents
Writing a MapReduce Program
Integrating Hadoop Into The Workflow
Delving Deeper Into The Hadoop API
Common MapReduce Algorithms
An Introduction to Hive and Pig
Practical Development Tips and Techniques
More Advanced MapReduce Programming
Joining Data Sets in MapReduce Jobs
Graph Manipulation In MapReduce
An Introduction to Oozie
Conclusion
Appendix: Cloudera Enterprise
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-2
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-3
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-4
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-5
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-6
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-7
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-8
Writable
WritableComparable
IntWritable
LongWritable
Text
Defines a de/serialization
protocol. Every data type in
Hadoop is a Writable
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-9
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-10
Inelegant
Problematic
If a or b contained commas
Not always practical
Doesnt easily work for binary objects
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-11
DataInput/DataOutput supports
boolean
byte, char (Unicode: 2 bytes)
double, float, int, long,
String (Unicode or UTF-8)
Line until line terminator
unsigned byte, short
byte array
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-12
10-13
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-14
WritableComparable
WritableComparable is a sub-interface of Writable
Must implement compareTo, hashCode, equals methods
All keys in MapReduce must be WritableComparable
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-15
10-16
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-17
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-18
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-19
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-20
10-21
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-22
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-23
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-24
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-25
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-26
MultiFileInputFormat
Abstract class which manages the use of multiple files in a
single task
You must supply a getRecordReader() implementation
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-27
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-28
What RecordReaders Do
InputSplits are handed to the RecordReaders
Specified by the path, starting position offset, length
RecordReaders must:
Ensure each (key, value) pair is processed
Ensure no (key, value) pair is processed more than once
Handle (key, value) pairs which are split across InputSplits
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-29
Sample InputSplit
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-30
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-31
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-32
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-33
OutputFormat
OutputFormats work much like InputFormat classes
Custom OutputFormats must provide a RecordWriter
implementation
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-34
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-35
Hands-On Exercise
In this Hands-On Exercise you will create a custom Writable to
extend the word co-occurrence program you wrote earlier in the
course
Please refer to the Hands-On Exercise Manual
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-36
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-37
Conclusion
In this chapter you have learned
How to create custom Writables and WritableComparables
How to save binary data using SequenceFiles and Avro
How to build custom InputFormats and OutputFormats
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-38
Chapter 11
Joining Data Sets
in MapReduce Jobs
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
11-1
Course Chapters
Introduction
The Motivation For Hadoop
Hadoop: Basic Contents
Writing a MapReduce Program
Integrating Hadoop Into The Workflow
Delving Deeper Into The Hadoop API
Common MapReduce Algorithms
An Introduction to Hive and Pig
Practical Development Tips and Techniques
More Advanced MapReduce Programming
Joining Data Sets in MapReduce Jobs
Graph Manipulation In MapReduce
An Introduction to Oozie
Conclusion
Appendix: Cloudera Enterprise
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
11-2
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
11-3
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
11-4
Introduction
We frequently need to join data together from two sources as
part of a MapReduce job, such as
Lookup tables
Data from database tables
There are two fundamental approaches: Map-side joins and
Reduce-side joins
Map-side joins are easier to write, but have potential scaling
issues
We will investigate both types of joins in this chapter
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
11-5
But First
But first
Avoid writing joins in Java MapReduce if you can!
Abstractions such as Pig and Hive are much easier to use
Save hours of programming
If you are dealing with text-based data, there really is no reason
not to use Pig or Hive
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
11-6
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
11-7
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
11-8
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
11-9
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
11-10
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
11-11
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
11-12
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
11-13
98
101
12
18
22
55
123
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
11-14
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
11-15
11-16
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
11-17
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
11-18
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
11-19
Required output:
EMP: 42, Aaron, loc(13), New York City
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
11-20
String empName;
int empId;
int locId;
String locationName;
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
11-21
void map(k, v) {
Record r = parse(v);
emit (r.locId, r);
}
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
11-22
11-23
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
11-24
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
11-25
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
11-26
}
}
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
11-27
A Better Mapper
void map(k, v) {
Record r = parse(v);
if (r.type == Typ.emp) {
emit (FK(r.locId), r);
} else {
emit (PK(r.locId), r);
}
}
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
11-28
A Better Reducer
Record thisLoc;
void reduce(k, values) {
for (Record v in values) {
if (v.type == Typ.loc) {
thisLoc = v;
} else {
v.locationName = thisLoc.locationName;
emit(v);
}
}
}
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
11-29
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
11-30
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
11-31
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
11-32
Conclusion
In this chapter you have learned
How to join write a Map-side join
How to implement the secondary sort
How to write a Reduce-side join
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
11-33
Chapter 12
Graph Manipulation
in MapReduce
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
12-1
Course Chapters
Introduction
The Motivation For Hadoop
Hadoop: Basic Contents
Writing a MapReduce Program
Integrating Hadoop Into The Workflow
Delving Deeper Into The Hadoop API
Common MapReduce Algorithms
An Introduction to Hive and Pig
Practical Development Tips and Techniques
More Advanced MapReduce Programming
Joining Data Sets in MapReduce Jobs
Graph Manipulation In MapReduce
An Introduction to Oozie
Conclusion
Appendix: Cloudera Enterprise
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
12-2
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
12-3
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
12-4
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
12-5
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
12-6
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
12-7
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
12-8
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
12-9
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
12-10
Representing Graphs
Imagine we want to represent this simple graph:
Two approaches:
Adjacency matrices
Adjacency lists
2"
1"
3"
4"
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
12-11
Adjacency Matrices
Represent the graph as an n x n square matrix
2"
v1 v2 v3 v4
v1 0
v2 1
v3 1
v4 1
1"
3"
4"
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
12-12
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
12-13
Adjacency Lists
Take an adjacency matrix and throw away all the zeros
v1 v2 v3 v4
v1 0
v2 1
v3 1
v4 1
v1: v2, v4
v2: v1, v3, v4
v3: v1
v4: v1, v3
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
12-14
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
12-15
v1: v2, v4
v2: v1, v3, v4
v3: v1
v4: v1, v3
1: [2] 2, 4
2: [3] 1, 3, 4
3: [1] 1
4: [2] 1, 3
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
12-16
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
12-17
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
12-18
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
12-19
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
12-20
12-21
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
12-22
From%Lin%&%Dyer.%(2010)%Data5Intensive%Text%Processing%with%MapReduce%
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
12-23
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
12-24
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
12-25
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
12-26
Conclusion
In this chapter you have learned
Best practices for representing graphs in Hadoop
How to implement a single source shortest path algorithm in
MapReduce
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
12-27
Chapter 13
An Introduction
To Oozie
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
13-1
Course Chapters
Introduction
The Motivation For Hadoop
Hadoop: Basic Contents
Writing a MapReduce Program
Integrating Hadoop Into The Workflow
Delving Deeper Into The Hadoop API
Common MapReduce Algorithms
An Introduction to Hive and Pig
Practical Development Tips and Techniques
More Advanced MapReduce Programming
Joining Data Sets in MapReduce Jobs
Graph Manipulation In MapReduce
An Introduction to Oozie
Conclusion
Appendix: Cloudera Enterprise
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
13-2
An Introduction to Oozie
In this chapter you will learn
What Oozie is
How to create Oozie workflows
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
13-3
An Introduction to Oozie
The Motivation for Oozie
Creating Oozie workflows
Hands-On Exercise
Conclusion
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
13-4
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
13-5
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
13-6
What is Oozie?
Oozie is a workflow engine
Runs on a server
Typically outside the cluster
Runs workflows of Hadoop jobs
Including Pig, Hive, Sqoop jobs
Submits those jobs to the cluster based on a workflow definition
Workflow definitions are submitted via HTTP
Jobs can be run at specific times
One-off or recurring jobs
Jobs can be run when data is present in a directory
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
13-7
An Introduction to Oozie
The Motivation for Oozie
Creating Oozie workflows
Hands-On Exercise
Conclusion
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
13-8
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
13-9
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
13-10
13-11
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
13-12
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
13-13
The wordcount action node defines a mapreduce action a standard Java MapReduce
job.
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
13-14
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
13-15
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
13-16
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
13-17
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
13-18
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
13-19
Descrip.on%
map-reduce
Runs%either%a%Java%MapReduce%or%Streaming%job%
fs
Create%directories,%move%or%delete%les%or%directories%
java
Runs%the%main()%method%in%the%specied%Java%class%as%a%single<Map,%
Map<only%job%on%the%cluster%
pig
Runs%a%Pig%job%
hive
Runs%a%Hive%job%
sqoop
Runs%a%Sqoop%job%
Sends%an%e<mail%message%
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
13-20
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
13-21
More on Oozie
For full documentation on Oozie, refer to
http://docs.cloudera.com/
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
13-22
An Introduction to Oozie
The Motivation for Oozie
Creating Oozie workflows
Hands-On Exercise
Conclusion
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
13-23
Hands-On Exercise
In this Hands-On Exercise you will run Oozie jobs
Please refer to the Hands-On Exercise Manual
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
13-24
An Introduction to Oozie
The Motivation for Oozie
Creating Oozie workflows
Hands-On Exercise
Conclusion
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
13-25
Conclusion
In this chapter you have learned
What Oozie is
How to create Oozie workflows
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
13-26
Chapter 14
Conclusion
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
14-1
Conclusion
During this course, you have learned:
The core technologies of Hadoop
How HDFS and MapReduce work
What other projects exist in the Hadoop ecosystem
How to develop MapReduce jobs
How Hadoop integrates into the datacenter
Algorithms for common MapReduce tasks
How to create large workflows using Oozie
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
14-2
Conclusion (contd)
How Hive and Pig can be used for rapid application development
Best practices for developing and debugging MapReduce jobs
Advanced features of the Hadoop API
How to handle graph manipulation problems with MapReduce
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
14-3
Certification
You can now take the Cloudera Certified Developer for Apache
Hadoop exam
Your instructor will give you information on how to access the
exam
14-4
Appendix A:
Cloudera Enterprise
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
A-1
Cloudera Enterprise
Reduces the risks of running Hadoop in production
Improves consistency, compliance, and administrative overhead
Management Suite components:
- Cloudera Manager (CM)
- Activity Monitor
- Resource Manager
- Authorization Manager
- The Flume User Interface
A-2
Cloudera Manager
Cloudera Manager (CM) is designed to make installing and
managing your cluster very easy
View a dashboard of cluster status
Modify configuration parameters
Easily start and stop master and slave daemons
Easily retrieve the configuration files required for client
machines to access the cluster
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
A-3
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
A-4
Activity Monitor
Activity Monitor gives an in-depth, comprehensive view of what
is happening on the cluster, in real-time
And what has happened in the past
Compare the performance of similar jobs
Store historical data
Chart metrics on cluster performance
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
A-5
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
A-6
Reporting
Cloudera Manager can report on resource usage in the cluster
Helpful for auditing and internal billing
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
A-7
Authorization Manager
Authorization Manager allows you to provision users, and
manage user activity, within the cluster
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
A-8
Conclusion
Cloudera Enterprise makes it easy to run open source Hadoop in
production
Includes
Cloudera Distribution including Apache Hadoop (CDH)
Cloudera Management Suite (CMS)
Production Support
Cloudera Management Suite enables you to:
Simplify and accelerate Hadoop deployment
Reduce the costs and risks of adopting Hadoop in production
Reliably operate Hadoop in production with repeatable success
Apply SLAs to Hadoop
Increase control over Hadoop cluster provisioning and
management
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
A-9