0% found this document useful (0 votes)
41 views44 pages

Chapter Five Hadoop Mapreduce & HDFS

Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It allows for the parallel processing of large datasets across nodes in a cluster using a simple programming model. Hadoop features distributed storage, parallel processing, fault tolerance, and scalability.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views44 pages

Chapter Five Hadoop Mapreduce & HDFS

Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It allows for the parallel processing of large datasets across nodes in a cluster using a simple programming model. Hadoop features distributed storage, parallel processing, fault tolerance, and scalability.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

Hadoop, a distributed

framework for Big Data


What is Hadoop?
• An open-source software framework that supports data-
intensive distributed applications, licensed under the
Apache v2 license.
• Abstract and facilitate the storage and processing of large
and/or rapidly growing data sets.
• Structured and non-structured data
• Simple programming models
• High scalability and availability
• Use commodity (cheap!) hardware with little redundancy
• Fault-tolerance
• Move computation rather than data
Hadoop Framework Tools
Major Hadoop components

• Hadoop has two major components


 MapReduce and HDFS

 MapReduce :Programming model(framework) that helps


programs do the parallel computation on data

 Hadoop Distributed File System (HDFS): A distributed file


system that runs on standard or low-end hardware.
Hadoop MapReduce ?
MapReduce is a programming model Google has used
successfully is processing its “big-data” sets (~ 20000 peta
bytes per day)
 A map function extracts some intelligence from raw data.
 A reduce function aggregates according to some guides
the data output by the map.
 Users specify the computation in terms of a map and a
reduce function,
 Underlying runtime system automatically parallelizes the
computation across large-scale clusters of machines, and
 Underlying system also handles machine failures, efficient
communications, and performance issues.
MapReduce: “A new abstraction that allows us to express the
simple computations we were trying to perform but hides the
messy details of parallelization, fault-tolerance, data distribution
and load balancing in a library.”

• Programming model:
– Provides abstraction to express computation
• Library:
– To take care the runtime parallelisation of the computation.
• Distributed, with some centralization
• Main nodes of cluster are where most of the computational
power and storage of the system lies
MapReduce Programming Model
• Inspired from map and reduce operations commonly used in
functional programming languages like Lisp.
• Input: a set of key/value pairs
• User supplies two functions:
– map(k,v)  list(k1,v1)
– reduce(k1, list(v1))  v2
• (k1,v1) is an intermediate key/value pair
• Output is the set of (k1,v2) pairs
• For our example, assume that system
– Breaks up files into lines, and
– Calls map function with value of each line
• Key is the line number
Programming Model

 Input: a set of key/value pairs


 Output: a set of key/value pairs
 Computation is expressed using the two functions:
1. Map task: a single pair  a list of intermediate pairs
 map(input-key, input-value)  list(out-key, intermediate-value)
 <ki, vi>  { < kint, vint > }

2. Reduce task: all intermediate pairs with the same kint  a list of values
 reduce(out-key, list(intermediate-value))  list(out-values)
 < kint, {vint} >  < ko, vo >

8
Example: Counting the number of occurrences of each
work in a collection of documents

map(String input_key, String input_value):


// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");

reduce(String output_key, Iterator intermediate_values):


// output_key: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));
9
Parallel Processing of MapReduce Job
User
Program
copy copy copy

Master
assign assign
map reduce
Part 1 Map 1 Reduce 1 File 1
Part 2
Part 3 Reduce 1 write File 2
Map 2
Part 4
local
write
Part n
read Map n Reduce m File m
Remote
Read, Sort
Input file Intermediate Output files
partitions files
MapReduce: Word Count Example
• Consider the problem of counting the number of occurrences of
each word in a large collection of documents
• How would you do it in parallel?
• Solution:
– Divide documents among workers
– Each worker parses document to find all words, map
function outputs (word, count) pairs
– Partition (word, count) pairs across workers based on word
– For each word at a worker, reduce function locally add up
counts
• Given input: “One a penny, two a penny, hot cross buns.”
– Records output by the map() function would be
• (“One”, 1), (“a”, 1), (“penny”, 1),(“two”, 1), (“a”, 1),
(“penny”, 1), (“hot”, 1), (“cross”, 1), (“buns”, 1).
– Records output by reduce function would be
• (“One”, 1), (“a”, 2), (“penny”, 2), (“two”, 1), (“hot”, 1),
(“cross”, 1), (“buns”, 1)
Schematic Flow of Keys and Values
• Flow of keys and values in a map reduce task

rk1 rv1 rk1 rv1,rv7,...


rk7 rv2 rk2 rv8,rvi,...
mk1 mv1
rk3 rv3 rk3 rv3,...
mk2 mv2
rk1 rv7
rk7 rv2,...
rk2 rv8

rki ... rvn,...


rk2 rvi
mkn mvn
rki rvn

map inputs map outputs reduce inputs


(key, value) (key, value)
Hadoop’s Architecture: MapReduce Engine
Hadoop MapReduce Engine
A MapReduce Process
• JobClient
• Submit job
• JobTracker
• Manage and schedule job, split job into tasks;
• Splits up data into smaller tasks(“Map”) and
sends it to the TaskTracker process in each node
• TaskTracker
• Start and monitor the task execution;
• reports back to the JobTracker node and reports
on job progress, sends data (“Reduce”) or
requests new jobs
• Child
• The process that really executes the task
MapReduce Engine
• Main nodes run TaskTracker to accept and reply to MapReduce
tasks, Main Nodes run DataNode to store needed blocks
closely as possible
• Central control node runs NameNode to keep track of HDFS
directories & files, and JobTracker to dispatch compute tasks to
TaskTracker
• MapReduce requires a distributed file system and an engine
that can distribute, coordinate, monitor and gather the results.
• Hadoop provides that engine through (the file system we
discussed earlier) and the JobTracker + TaskTracker system.
• JobTracker is simply a scheduler.
• TaskTracker is assigned a Map or Reduce (or other
operations); Map or Reduce run on node and so is the
TaskTracker; each task is run on its own JVM on a node.
• Written in Java, also supports Python and Ruby
MapReduce, Batch Processing
 Batch processing: processing large amounts of data at-once, in
one-go to deliver a result according to a query on the data.
 Need for many computations over large/huge sets of data:
 Input data: crawled documents, web request logs
 Output data: inverted indices, summary of pages crawled per
host, the set of the most frequent queries in a given day, …
 Most of these computation are relatively straight-forward
 To speedup computation and shorten processing time, we can
distribute data across 100s of machines and process them in
parallel
 But, parallel computations are difficult and complex to manage:
 Race conditions, debugging, data distribution, fault-tolerance,
load balancing, etc
 Ideally, we would like to process data in parallel but not deal
with the complexity of parallelisation and data distribution 16
MapReduce Example Applications
 The MapReduce model can be applied to many applications:
 Distributed grep:
 map: emits a line, if line matched the pattern
 reduce: identity function
 Count of URL access Frequency
 Reverse Web-Link Graph
 Inverted Index
 Distributed Sort
 ….

17
MapReduce Implementation

 MapReduce implementation presented in the paper


matched Google infrastructure at-the-time:
1. Large cluster of commodity PCs connected via switched Ethernet
2. Machines are typically dual-processor x86, running Linux, 2-4GB of
mem! (slow machines for today’s standards)
3. A cluster of machines, so failures are anticipated
4. Storage with (GFS) Google File System (2003) on IDE disks
attached to PCs. GFS is a distributed file system, uses replication
for availability and reliability.
 Scheduling system:
1. Users submit jobs
2. Each job consists of tasks; scheduler assigns tasks to machines

18
Parallel Execution
 User specifies:
 M: number of map tasks
 R: number of reduce tasks
 Map:
 MapReduce library splits the input file into M pieces
 Typically 16-64MB per piece
 Map tasks are distributed across the machines
 Reduce:
 Partitioning the intermediate key space into R pieces
 hash(intermediate_key) mod R
 Typical setting:
 2,000 machines
 M = 200,000
 R = 5,000
19
Execution Flow

20
Master Data Structures

 For each map/reduce task:


 State status {idle, in-progress, completed}
 Identity of the worker machine (for non-idle tasks)

 The location of intermediate file regions is passed from maps to


reducers tasks through the master.
 This information is pushed incrementally (as map tasks finish) to
workers that have in-progress reduce tasks.

21
Fault-Tolerance
Two types of failures:
1. worker failures:
 Identified by sending heartbeat messages by the master. If no
response within a certain amount of time, then the worker is dead.
 In-progress and completed map tasks are re-scheduled  idle
 In-progress reduce tasks are re-scheduled  idle
 Workers executing reduce tasks affected from failed map/workers are
notified of re-scheduling
 Question: Why completed tasks have to be re-scheduler?
 Answer: Map output is stored on local fs, while reduce output is stored
on GFS
2. master failure:
1. Rare
2. Can be recovered from checkpoints
3. Solution: aborts the MapReduce computation and starts again
22
Disk Locality

 Network bandwidth is a relatively scarce resource and also


increases latency
 The goal is to save network bandwidth

 Use of GFS that stores typically three copies of the data block
on different machines
 Map tasks are scheduled “close” to data
 On nodes that have input data (local disk)
 If not, on nodes that are nearer to input data (e.g., same switch)

23
Task Granularity
 Number of map tasks > number of worker nodes
 Better load balancing
 Better recovery

 But, this, increases load on the master


 More scheduling
 More states to be saved

 M could be chosen with respect to the block size of the file


system
 For locality properties
 R is usually specified by users
 Each reduce tasks produces one output file

24
Stragglers
 Slow workers delay overall completion time  stragglers
 Bad disks with soft errors
 Other tasks using up resources
 Machine configuration problems, etc

 Very close to end of MapReduce operation, master schedules backup


execution of the remaining in-progress tasks.
 A task is marked as complete whenever either the primary or the
backup execution completes.

 Example: sort operation takes 44% longer to complete when the


backup task mechanism is disabled.

25
Refinements: Partitioning Function

 Partitioning function identifies the reduce task


 Users specify the desired output files they want, R
 But, there may be more keys than R
 Uses the intermediate key and R
 Default: hash(key) mod R

 Important to choose well-balanced partitioning functions:


 hash(hostname(urlkey)) mod R
 For output keys that are URLs

26
Refinements: Combiner Function

 Introduce a mini-reduce phase before intermediate data is


sent to reduce
 When there is significant repetition of intermediate keys
 Merge values of intermediate keys before sending to reduce tasks
 Example: word count, many records of the form <word_name, 1>.
Merge records with the same word_name
 Similar to reduce function

 Saves network bandwidth

27
Hadoop MapReduceSummary
 Hadoop is a software framework for distributed processing of
large datasets across large clusters of computers
 Large datasets  Terabytes or petabytes of data
 Large clusters  hundreds or thousands of nodes
 Hadoop is open-source implementation for Google
MapReduce
 Hadoop is based on a simple programming model called
MapReduce
 Hadoop is based on a simple data model, any data will fit

28
Design Principles of Hadoop
 Need to process big data
 Need to parallelize computation across thousands of nodes
 Commodity hardware
 Large number of low-end cheap machines working in
parallel to solve a computing problem
 This is in contrast to Parallel DBs
 Small number of high-end expensive machines
Automatic parallelization & distribution
 Hidden from the end-user
Fault tolerance and automatic recovery
 Nodes/tasks will fail and will recover automatically
Clean and simple programming abstraction
 Users only provide two functions “map” and “reduce”
29
Divide and Conquer

“Work”
Partition

w1 w2 w3

“worker” “worker” “worker”

r1 r2 r3

“Result” Combine
Distributed File System
 Two major products used in big data computing: HDFS
 ( Hadoop Distributed File System ) and Google’s GFS
( Google File System ) ( now Colossus )。
 HDFS uses a Master/Slave architecture , in which a
master node ( NameNode ) and a group of slave node
( DataNode ) are used to create a data storage system.
 A data file saved to HDFS is first partitioned into multiple
data blocks with a fixed size, and the duplicate copies of
each data block are distributed to various datanodes.
 Don’t move data to workers… move workers to the data!
 Store data on the local disks of nodes in the cluster
 Start up the workers on the node that has the data local
 Why?
 Not enough RAM to hold all the data in memory
 Disk access is slow, but disk throughput is reasonable
 A distributed file system is the answer
 GFS (Google File System)
 HDFS for Hadoop (= GFS clone)
Hadoop Distributed File System (HDFS)
 Highly fault-tolerant
 High throughput
 Suitable for applications with large data sets
 Can be built out of commodity hardware
 Master/slave architecture
 HDFS cluster consists of a single Namenode, a master server that
manages the file system namespace and regulates access to files by
clients
 There are a number of DataNodes usually one per node in a cluster
 HDFS exposes a file system namespace and allows user data to be
stored in files
 A file is split into one or more blocks and set of blocks are stored in
DataNodes
NameNode
• Maps a filename to list of Block IDs.
• Maps each Block ID to DataNodes containing a replica of the block.
• Stores metadata for the files, like the directory structure of a typical
FS.
• The server holding the NameNode instance is quite crucial, as there is
only one.
• Transaction log for file deletes/adds, etc. Does not use transactions for
whole blocks or file-streams, only metadata.
• Handles creation of more replica blocks when necessary after a
DataNode failure
• Single Namespace for entire cluster. Files are broken up into blocks
 Typically 64 MB block size
DataNode
•The DataNodes manage storage attached to the nodes that they run on
•DataNodes: serves read, write requests, performs block creation,
deletion, and replication upon instruction from Namenode.
•Maps a Block ID to a physical location on disk
• Stores the actual data in HDFS
• Can run on any underlying filesystem (ext3/4, NTFS, etc)
• Notifies NameNode of what blocks it has
• Each block replicated on multiple DataNodes
Client
• Finds location of blocks from NameNode
• Accesses data directly from DataNode
•Client can only append to existing files
Data Coherency
• Write-once-read-many access model
 Distributed file systems good for millions of large files.
 But have very high overheads and poor performance with billions of
smaller tuples
Main Properties of HDFS

 Large: A HDFS instance may consist of thousands of


server machines, each storing part of the file system’s
data
 Replication: Each data block is replicated many
times (default is 3)
 Failure: Failure is the norm rather than exception
 Fault Tolerance: Detection of faults and quick,
automatic recovery from them is a core architectural
goal of HDFS
 Namenode is consistently checking Datanodes

36
NameNode and DataNode Comparsion

NameNode: DataNode:
• Stores the actual data
• Stores metadata for the files, like the in HDFS
directory structure of a typical FS. • Can run on any
• The server holding the NameNode underlying filesystem
instance is quite crucial, as there is (ext3/4, NTFS, etc)
only one. • Notifies NameNode of
• Transaction log for file deletes/adds, what blocks it has
etc. Does not use transactions for • NameNode replicates
whole blocks or file-streams, only blocks 2x in local rack,
metadata. 1x elsewhere
• Handles creation of more replica
blocks when necessary after a
DataNode failure
HDFS Architecture
Metadata(Name, replicas..)
Metadata ops Namenode (/home/foo/data,6. ..
Client
Block ops
Read Datanodes Datanodes

replication B
Blocks

Rack1 Write Rack2


 Structure Client
• Master/Slave
 Organization
 One Master node
 A group of Slave nodes
 Data file is partitioned into blocks with fixed size
• 3 copies of each data block are generated and stored on different
Software Components on Hadoop

Master/Slave Mode
 On the master node a Namenode thread is running to do
the work of datanode registration, file partition, data block
replication and distribution, job scheduling, and cluster
recourse management
 On each slave node, a Datanode thread is running to do
the tasks of data block storage, node state reporting, data
processing, etc
The Google File System
GFS architecture and components: The GFS is composed of clusters.
 A cluster is a set of networked computers. GFS clusters contain
three types of interdependent entities which are: Client, master
and chunk server.
 Clients could be: Computers or applications manipulating
existing files or creating new files on the system.
 The master server is the orchestrator or manager of the cluster
system that maintain the operation log. Operation log keeps track
of the activities made by the master itself which helps reducing
the service interruptions to a minimum level.
 At startup, master server retrieves information about contents
and inventories from chunk servers. Then after, the master server
keeps tracks of the location of the chunks with the cluster.
The GFS architecture keeps the messages that the master server
sends and receives which is very small. The master server itself
doesn’t handle file data at all, this is done by chunk servers.
 Chunk servers are the core engine of the GFS. They store file
chunks of 64 MB size. Chunk servers coordinate with the master
server and send requested chunks to clients directly.
 GFS replicas: The GFS has two replicas: Primary and secondary
replicas.
 A primary replica is the data chunk that a chunk server sends to a
client.
 Secondary replicas serve as backups on other chunk servers. The
master server decides which chunks act as primary or secondary.
If the client makes changes to the data in the chunk, then the
master server lets the chunk servers with secondary replicas,
know they have to copy the new chunk off the primary chunk
server to stay in its current state.
Google File System (GFS) Motivation
 GFS  developed in the late 1990s; uses thousands of storage
systems built from inexpensive commodity components to
provide petabytes of storage to a large user community with
diverse needs
 Motivation
1. Component failures is the norm
 Appl./OS bugs, human errors, failures of disks, power supplies, …
2. Files are huge (muti-GB to -TB files)
3. The most common operation is to append to an existing
file; random write operations to a file are extremely
infrequent. Sequential read operations are the norm
4. The consistency model should be relaxed to simplify the
system implementation but without placing an additional
burden on the application developers
42
GFS Assumptions
 The system is built from inexpensive commodity components
that often fail.
 The system stores a modest number of large files.
 The workload consists mostly of two kinds of reads: large
streaming reads and small random reads.
 The workloads also have many large sequential writes that
append data to files.
 The system must implement well-defined semantics for many
clients simultaneously appending to the same file.

43
Three main applications of Hadoop

• Advertisement (Mining user behavior to generate


recommendations)
• Searches (group related documents)
• Security (search for uncommon patterns)

You might also like