Chapter Five Hadoop Mapreduce & HDFS
Chapter Five Hadoop Mapreduce & HDFS
• Programming model:
– Provides abstraction to express computation
• Library:
– To take care the runtime parallelisation of the computation.
• Distributed, with some centralization
• Main nodes of cluster are where most of the computational
power and storage of the system lies
MapReduce Programming Model
• Inspired from map and reduce operations commonly used in
functional programming languages like Lisp.
• Input: a set of key/value pairs
• User supplies two functions:
– map(k,v) list(k1,v1)
– reduce(k1, list(v1)) v2
• (k1,v1) is an intermediate key/value pair
• Output is the set of (k1,v2) pairs
• For our example, assume that system
– Breaks up files into lines, and
– Calls map function with value of each line
• Key is the line number
Programming Model
2. Reduce task: all intermediate pairs with the same kint a list of values
reduce(out-key, list(intermediate-value)) list(out-values)
< kint, {vint} > < ko, vo >
8
Example: Counting the number of occurrences of each
work in a collection of documents
Master
assign assign
map reduce
Part 1 Map 1 Reduce 1 File 1
Part 2
Part 3 Reduce 1 write File 2
Map 2
Part 4
local
write
Part n
read Map n Reduce m File m
Remote
Read, Sort
Input file Intermediate Output files
partitions files
MapReduce: Word Count Example
• Consider the problem of counting the number of occurrences of
each word in a large collection of documents
• How would you do it in parallel?
• Solution:
– Divide documents among workers
– Each worker parses document to find all words, map
function outputs (word, count) pairs
– Partition (word, count) pairs across workers based on word
– For each word at a worker, reduce function locally add up
counts
• Given input: “One a penny, two a penny, hot cross buns.”
– Records output by the map() function would be
• (“One”, 1), (“a”, 1), (“penny”, 1),(“two”, 1), (“a”, 1),
(“penny”, 1), (“hot”, 1), (“cross”, 1), (“buns”, 1).
– Records output by reduce function would be
• (“One”, 1), (“a”, 2), (“penny”, 2), (“two”, 1), (“hot”, 1),
(“cross”, 1), (“buns”, 1)
Schematic Flow of Keys and Values
• Flow of keys and values in a map reduce task
17
MapReduce Implementation
18
Parallel Execution
User specifies:
M: number of map tasks
R: number of reduce tasks
Map:
MapReduce library splits the input file into M pieces
Typically 16-64MB per piece
Map tasks are distributed across the machines
Reduce:
Partitioning the intermediate key space into R pieces
hash(intermediate_key) mod R
Typical setting:
2,000 machines
M = 200,000
R = 5,000
19
Execution Flow
20
Master Data Structures
21
Fault-Tolerance
Two types of failures:
1. worker failures:
Identified by sending heartbeat messages by the master. If no
response within a certain amount of time, then the worker is dead.
In-progress and completed map tasks are re-scheduled idle
In-progress reduce tasks are re-scheduled idle
Workers executing reduce tasks affected from failed map/workers are
notified of re-scheduling
Question: Why completed tasks have to be re-scheduler?
Answer: Map output is stored on local fs, while reduce output is stored
on GFS
2. master failure:
1. Rare
2. Can be recovered from checkpoints
3. Solution: aborts the MapReduce computation and starts again
22
Disk Locality
Use of GFS that stores typically three copies of the data block
on different machines
Map tasks are scheduled “close” to data
On nodes that have input data (local disk)
If not, on nodes that are nearer to input data (e.g., same switch)
23
Task Granularity
Number of map tasks > number of worker nodes
Better load balancing
Better recovery
24
Stragglers
Slow workers delay overall completion time stragglers
Bad disks with soft errors
Other tasks using up resources
Machine configuration problems, etc
25
Refinements: Partitioning Function
26
Refinements: Combiner Function
27
Hadoop MapReduceSummary
Hadoop is a software framework for distributed processing of
large datasets across large clusters of computers
Large datasets Terabytes or petabytes of data
Large clusters hundreds or thousands of nodes
Hadoop is open-source implementation for Google
MapReduce
Hadoop is based on a simple programming model called
MapReduce
Hadoop is based on a simple data model, any data will fit
28
Design Principles of Hadoop
Need to process big data
Need to parallelize computation across thousands of nodes
Commodity hardware
Large number of low-end cheap machines working in
parallel to solve a computing problem
This is in contrast to Parallel DBs
Small number of high-end expensive machines
Automatic parallelization & distribution
Hidden from the end-user
Fault tolerance and automatic recovery
Nodes/tasks will fail and will recover automatically
Clean and simple programming abstraction
Users only provide two functions “map” and “reduce”
29
Divide and Conquer
“Work”
Partition
w1 w2 w3
r1 r2 r3
“Result” Combine
Distributed File System
Two major products used in big data computing: HDFS
( Hadoop Distributed File System ) and Google’s GFS
( Google File System ) ( now Colossus )。
HDFS uses a Master/Slave architecture , in which a
master node ( NameNode ) and a group of slave node
( DataNode ) are used to create a data storage system.
A data file saved to HDFS is first partitioned into multiple
data blocks with a fixed size, and the duplicate copies of
each data block are distributed to various datanodes.
Don’t move data to workers… move workers to the data!
Store data on the local disks of nodes in the cluster
Start up the workers on the node that has the data local
Why?
Not enough RAM to hold all the data in memory
Disk access is slow, but disk throughput is reasonable
A distributed file system is the answer
GFS (Google File System)
HDFS for Hadoop (= GFS clone)
Hadoop Distributed File System (HDFS)
Highly fault-tolerant
High throughput
Suitable for applications with large data sets
Can be built out of commodity hardware
Master/slave architecture
HDFS cluster consists of a single Namenode, a master server that
manages the file system namespace and regulates access to files by
clients
There are a number of DataNodes usually one per node in a cluster
HDFS exposes a file system namespace and allows user data to be
stored in files
A file is split into one or more blocks and set of blocks are stored in
DataNodes
NameNode
• Maps a filename to list of Block IDs.
• Maps each Block ID to DataNodes containing a replica of the block.
• Stores metadata for the files, like the directory structure of a typical
FS.
• The server holding the NameNode instance is quite crucial, as there is
only one.
• Transaction log for file deletes/adds, etc. Does not use transactions for
whole blocks or file-streams, only metadata.
• Handles creation of more replica blocks when necessary after a
DataNode failure
• Single Namespace for entire cluster. Files are broken up into blocks
Typically 64 MB block size
DataNode
•The DataNodes manage storage attached to the nodes that they run on
•DataNodes: serves read, write requests, performs block creation,
deletion, and replication upon instruction from Namenode.
•Maps a Block ID to a physical location on disk
• Stores the actual data in HDFS
• Can run on any underlying filesystem (ext3/4, NTFS, etc)
• Notifies NameNode of what blocks it has
• Each block replicated on multiple DataNodes
Client
• Finds location of blocks from NameNode
• Accesses data directly from DataNode
•Client can only append to existing files
Data Coherency
• Write-once-read-many access model
Distributed file systems good for millions of large files.
But have very high overheads and poor performance with billions of
smaller tuples
Main Properties of HDFS
36
NameNode and DataNode Comparsion
NameNode: DataNode:
• Stores the actual data
• Stores metadata for the files, like the in HDFS
directory structure of a typical FS. • Can run on any
• The server holding the NameNode underlying filesystem
instance is quite crucial, as there is (ext3/4, NTFS, etc)
only one. • Notifies NameNode of
• Transaction log for file deletes/adds, what blocks it has
etc. Does not use transactions for • NameNode replicates
whole blocks or file-streams, only blocks 2x in local rack,
metadata. 1x elsewhere
• Handles creation of more replica
blocks when necessary after a
DataNode failure
HDFS Architecture
Metadata(Name, replicas..)
Metadata ops Namenode (/home/foo/data,6. ..
Client
Block ops
Read Datanodes Datanodes
replication B
Blocks
Master/Slave Mode
On the master node a Namenode thread is running to do
the work of datanode registration, file partition, data block
replication and distribution, job scheduling, and cluster
recourse management
On each slave node, a Datanode thread is running to do
the tasks of data block storage, node state reporting, data
processing, etc
The Google File System
GFS architecture and components: The GFS is composed of clusters.
A cluster is a set of networked computers. GFS clusters contain
three types of interdependent entities which are: Client, master
and chunk server.
Clients could be: Computers or applications manipulating
existing files or creating new files on the system.
The master server is the orchestrator or manager of the cluster
system that maintain the operation log. Operation log keeps track
of the activities made by the master itself which helps reducing
the service interruptions to a minimum level.
At startup, master server retrieves information about contents
and inventories from chunk servers. Then after, the master server
keeps tracks of the location of the chunks with the cluster.
The GFS architecture keeps the messages that the master server
sends and receives which is very small. The master server itself
doesn’t handle file data at all, this is done by chunk servers.
Chunk servers are the core engine of the GFS. They store file
chunks of 64 MB size. Chunk servers coordinate with the master
server and send requested chunks to clients directly.
GFS replicas: The GFS has two replicas: Primary and secondary
replicas.
A primary replica is the data chunk that a chunk server sends to a
client.
Secondary replicas serve as backups on other chunk servers. The
master server decides which chunks act as primary or secondary.
If the client makes changes to the data in the chunk, then the
master server lets the chunk servers with secondary replicas,
know they have to copy the new chunk off the primary chunk
server to stay in its current state.
Google File System (GFS) Motivation
GFS developed in the late 1990s; uses thousands of storage
systems built from inexpensive commodity components to
provide petabytes of storage to a large user community with
diverse needs
Motivation
1. Component failures is the norm
Appl./OS bugs, human errors, failures of disks, power supplies, …
2. Files are huge (muti-GB to -TB files)
3. The most common operation is to append to an existing
file; random write operations to a file are extremely
infrequent. Sequential read operations are the norm
4. The consistency model should be relaxed to simplify the
system implementation but without placing an additional
burden on the application developers
42
GFS Assumptions
The system is built from inexpensive commodity components
that often fail.
The system stores a modest number of large files.
The workload consists mostly of two kinds of reads: large
streaming reads and small random reads.
The workloads also have many large sequential writes that
append data to files.
The system must implement well-defined semantics for many
clients simultaneously appending to the same file.
43
Three main applications of Hadoop