Unit 5
Unit 5
(PECO8013T)
Unit-V
Hadoop
By:- Dr. D. R. Patil
1
2
3
Syllabus
4
Syllabus
5
Books
6
Scheme
Outline
• Hadoop - Basics
• HDFS
– Goals
– Architecture
– Other functions
• MapReduce
– Basics
– Word Count Example
– Handy tools
– Finding shortest path example
• Related Apache sub-projects (Pig,
Hadoop
• Hadoop is an open source framework.
• It is provided by Apache to process and analyze very
huge volume of data.
• It is written in Java and is not OLAP (online analytical
processing).
• It is used for batch/offline processing.
• It is currently used by Google, Facebook, LinkedIn,
Yahoo, Twitter etc.
• Moreover it can be scaled up just by adding nodes in
the cluster.
History of Hadoop
• The Hadoop was started by Doug Cutting and Mike
Cafarella in 2002. Its origin was the Google File
System paper, published by Google.
History of Hadoop
Year Event
2003 Google released the paper, Google File System (GFS).
2004 Google released a white paper on Map Reduce.
2006 •Hadoop introduced.
•Hadoop 0.1.0 released.
•Yahoo deploys 300 machines and within this year reaches 600
machines.
2007 •Yahoo runs 2 clusters of 1000 machines.
•Hadoop includes HBase.
2008 •YARN JIRA opened
•Hadoop becomes the fastest system to sort 1 terabyte of data on a
900 node cluster within 209 seconds.
•Yahoo clusters loaded with 10 terabytes per day.
•Cloudera was founded as a Hadoop distributor.
History of Hadoop
Year Event
2009 •Yahoo runs 17 clusters of 24,000 machines.
•Hadoop becomes capable enough to sort a petabyte.
•MapReduce and HDFS become separate subproject.
2010 •Hadoop added the support for Kerberos.
•Hadoop operates 4,000 nodes with 40 petabytes.
•Apache Hive and Pig released.
2011 •Apache Zookeeper released.
•Yahoo has 42,000 Hadoop nodes and hundreds of petabytes of
storage.
2012 Apache Hadoop 1.0 version released.
2013 Apache Hadoop 2.2 version released.
2014 Apache Hadoop 2.6 version released.
2015 Apache Hadoop 2.7 version released.
2017 Apache Hadoop 3.0 version released.
2018 Apache Hadoop 3.1 version released.
Modules of Hadoop
• HDFS: Hadoop Distributed File System. Google published its
paper GFS and on the basis of that HDFS was developed. It states
that the files will be broken into blocks and stored in nodes over
the distributed architecture.
• Yarn: Yet another Resource Negotiator is used for job scheduling
and manage the cluster.
• Map Reduce: This is a framework which helps Java programs to
do the parallel computation on data using key value pair. The Map
task takes input data and converts it into a data set which can be
computed in Key value pair. The output of Map task is consumed
by reduce task and then the out of reducer gives the desired result.
• Hadoop Common: These Java libraries are used to start Hadoop
and are used by other Hadoop modules.
Hadoop Architecture
• The Hadoop architecture is a package of the file
system, MapReduce engine and the HDFS (Hadoop
Distributed File System). The MapReduce engine can
be MapReduce/MR1 or YARN/MR2.
• A Hadoop cluster consists of a single master and
multiple slave nodes. The master node includes Job
Tracker, Task Tracker, NameNode, and DataNode
whereas the slave node includes DataNode and
TaskTracker.
Hadoop Architecture
Hadoop Distributed File System
• The Hadoop Distributed File System (HDFS) is a
distributed file system for Hadoop.
• It contains a master/slave architecture.
• This architecture consist of a single NameNode
performs the role of master, and multiple DataNodes
performs the role of a slave.
• Both NameNode and DataNode are capable enough to
run on commodity machines. The Java language is used
to develop HDFS. So any machine that supports Java
language can easily run the NameNode and DataNode
software.
Hadoop - Why ?
• Need to process huge datasets on large
clusters of computers
• Very expensive to build reliability into
each application
• Nodes fail every day
– Failure is expected, rather than exceptional
– The number of nodes in a cluster is not
constant
• Need a common infrastructure
– Efficient, reliable, easy to use
– Open Source, Apache Licence
Who uses Hadoop?
• Amazon/A9
• Facebook
• Google
• New York Times
• Veoh
• Yahoo!
• …. many more
Commodity Hardware
Aggregation switch
Rack switch
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducer(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
Conf.setOutputFormat(TextOutputFormat.class);
JobClient.runJob(conf);
Input and Output Formats
• A Map/Reduce may specify how it’s input is to be read
by specifying an InputFormat to be used
• A Map/Reduce may specify how it’s output is to be
written by specifying an OutputFormat to be used
• These default to TextInputFormat and
TextOutputFormat, which process line-based text data
• Another common choice is SequenceFileInputFormat
and SequenceFileOutputFormat for binary data
• These are file-based, but they are not required to be
How many Maps and Reduces
• Maps
– Usually as many as the number of HDFS blocks being
processed, this is the default
– Else the number of maps can be specified as a hint
– The number of maps can also be controlled by specifying the
minimum split size
– The actual sizes of the map inputs are computed by:
• max(min(block_size,data/#maps), min_split_size
• Reduces
– Unless the amount of data being processed
is small
• 0.95*num_nodes*mapred.tasktracker.tasks.maxi
mum
Some handy tools
• Partitioners
• Combiners
• Compression
• Counters
• Speculation
• Zero Reduces
• Distributed File Cache
• Tool
Partitioners
• Partitioners are application code that define how keys
are assigned to reduces
• Default partitioning spreads keys evenly, but randomly
– Uses key.hashCode() % num_reduces
• Custom partitioning is often required, for example, to
produce a total order in the output
– Should implement Partitioner interface
– Set by calling conf.setPartitionerClass(MyPart.class)
– To get a total order, sample the map output keys and pick
values to divide the keys into roughly equal buckets and use
that in your partitioner
Combiners
• When maps produce many repeated keys
– It is often useful to do a local aggregation following the map
– Done by specifying a Combiner
– Goal is to decrease size of the transient data
– Combiners have the same interface as Reduces, and often are the
same class
– Combiners must not side effects, because they run an intermdiate
number of times
– In WordCount, conf.setCombinerClass(Reduce.class);
Compression
• Compressing the outputs and intermediate data will often yield
huge performance gains
– Can be specified via a configuration file or set programmatically
– Set mapred.output.compress to true to compress job output
– Set mapred.compress.map.output to true to compress map outputs
• Compression Types (mapred(.map)?.output.compression.type)
– “block” - Group of keys and values are compressed together
– “record” - Each value is compressed individually
– Block compression is almost always best
• Compression Codecs
(mapred(.map)?.output.compression.codec)
– Default (zlib) - slower, but more compression
– LZO - faster, but less compression
Counters
• Often Map/Reduce applications have countable events
• For example, framework counts records in to and out
of Mapper and Reducer
• To define user counters:
static enum Counter {EVENT1, EVENT2};
reporter.incrCounter(Counter.EVENT1, 1);
• Define nice names in a MyClass_Counter.properties
file
CounterGroupName=MyCounters
EVENT1.name=Event 1
EVENT2.name=Event 2
Speculative execution
• The framework can run multiple instances of slow
tasks
– Output from instance that finishes first is used
– Controlled by the configuration variable
mapred.speculative.execution
– Can dramatically bring in long tails on jobs
Zero Reduces
• Frequently, we only need to run a filter on the input
data
– No sorting or shuffling required by the job
– Set the number of reduces to 0
– Output from maps will go directly to OutputFormat and disk
Distributed File Cache
• Sometimes need read-only copies of data on the local
computer
– Downloading 1GB of data for each Mapper is expensive
• Define list of files you need to download in JobConf
• Files are downloaded once per computer
• Add to launching program:
DistributedCache.addCacheFile(new URI(“hdfs://nn:8020/foo”),
conf);
• Add to task:
Path[] files = DistributedCache.getLocalCacheFiles(conf);
Tool
• Handle “standard” Hadoop command line options
– -conf file - load a configuration file named file
– -D prop=value - define a single configuration property prop
• Class looks like:
public class MyApp extends Configured implements Tool {
public static void main(String[] args) throws Exception {
System.exit(ToolRunner.run(new Configuration(),
new MyApp(), args));
}
public int run(String[] args) throws Exception {
…. getConf() ….
}
}
Finding the Shortest Path
• A common graph
search application is
finding the shortest
path from a start node
to one or more target
nodes
• Commonly done on a
single machine with
Dijkstra’s Algorithm
• Can we use BFS to
find the shortest path
via MapReduce?
Finding the Shortest Path:
•
Intuition
We can define the solution to this problem
inductively
– DistanceTo(startNode) = 0
– For all nodes n directly reachable from
startNode, DistanceTo(n) = 1
– For all nodes n reachable from some other set
of nodes S,
DistanceTo(n) = 1 + min(DistanceTo(m), m S)
From Intuition to Algorithm
• A map task receives a node n as a key,
and (D, points-to) as its value
– D is the distance to the node from the start
– points-to is a list of nodes reachable from n
p points-to, emit (p, D+1)
• Reduces task gathers possible distances
to a given p and selects the minimum
one
What This Gives Us
• This MapReduce task can advance the
known frontier by one hop
• To perform the whole BFS, a non-
MapReduce component then feeds the
output of this step back into the
MapReduce task for another iteration
– Problem: Where’d the points-to list go?
– Solution: Mapper emits (n, points-to) as well
Blow-up and Termination
• This algorithm starts from one node
• Subsequent iterations include many
more nodes of the graph as the frontier
advances
• Does this ever terminate?
– Yes! Eventually, routes between nodes will
stop being discovered and no better
distances will be found. When distance is
the same, we stop
– Mapper should emit (n,D) to ensure that
“current distance” is carried into the reducer
Hadoop Related Subprojects
• Pig
– High-level language for data analysis
• HBase
– Table storage for semi-structured data
• Zookeeper
– Coordinating distributed applications
• Hive
– SQL-like Query language and Metastore
• Mahout
– Machine learning
Pig
• Started at Yahoo! Research
• Now runs about 30% of Yahoo!’s jobs
• Features
– Expresses sequences of MapReduce jobs
– Data model: nested “bags” of items
– Provides relational (SQL) operators
(JOIN, GROUP BY, etc.)
– Easy to plug in Java functions
An Example Problem
• Suppose you Load Users Load Pages
have user data in
Filter by age
a file, website
data in another, Join on name
and you need to Group on url
find the top 5
Count clicks
most visited
pages by users Order by clicks
Users = load …
Filter by age
Fltrd = filter …
Pages = load …
Join on name
Joined = join …
Group on url Grouped = group …
Summed = … count()…
Count clicks Sorted = order …
Top5 = limit …
Order by clicks
Take top 5
Ease of Translation
Load Users Load Pages
Users = load …
Filter by age
Fltrd = filter …
Pages = load …
Join on name
Joined = join …
Job 1
Group on url
Grouped = group …
Job 2 Summed = … count()…
Count clicks Sorted = order …
Top5 = limit …
Order by clicks
Job 3
Take top 5
HBase - What?
• Modeled on Google’s Bigtable
• Row/column store
• Billions of rows/millions on columns
• Column-oriented - nulls are free
• Untyped - stores byte[]
HBase - Data Model
Column
Column family:
Row Timestamp family
animal:
repairs:
animal:type animal:size repairs:cost
t2 zebra 1000 EUR
enclosure1
t1 lion big
enclosure2 … … … …
HBase - Data Storage
Column family animal:
(enclosure1, t2, animal:type) zebra
(enclosure1, t1, animal:size) big
(enclosure1, t1, animal:type) lion
• Retrieve a row
RowResult = table.getRow( “enclosure1” );
• Sample output:
Using a Hadoop Streaming
Mapper Script
SELECT TRANSFORM(page_views.userid,
page_views.date)
USING 'map_script.py'
AS dt, uid CLUSTER BY dt
FROM page_views;
Storm
• Developed by BackType which was
acquired by Twitter
• Lots of tools for data (i.e. batch)
processing
– Hadoop, Pig, HBase, Hive, …
• None of them are realtime systems
which is becoming a real requirement
for businesses
• Storm provides realtime computation
– Scalable
– Guarantees no data loss
Before Storm
Before Storm – Adding a worker
Deploy
Reconfigure/Redeploy
Problems
• Scaling is painful
• Poor fault-tolerance
• Coding is tedious
What we want
• Guaranteed data processing
• Horizontal scalability
• Fault-tolerance
• No intermediate message brokers!
• Higher level abstraction than message
passing
• “Just works” !!
Storm Cluster
Master node (similar to
Hadoop JobTracker)
Source of streams
Bolts