Lecture4 IntroMapReduce PDF
Lecture4 IntroMapReduce PDF
MapReduce/Hadoop
Limitations of Existing Data Analytics Architecture
Moving Data To
Compute Doesn’t Scale
Instrumentation
2
©2011 Cloudera, Inc. All Rights Reserved.
Slides from Dr. Amr Awadallah’s Hadoop talk at Stanford, CTO & VPE from Cloudera
Typical Large-Data Problem
• Iterate over a large number of records
• Extract something of interest from each
• Shuffle and sort intermediate results
• Aggregate intermediate results
• Generate final output
• The problem:
– Diverse input format (data diversity & heterogeneity)
– Large Scale: Terabytes, Petabytes
– Parallelization
w1 w2 w3
r1 r2 r3
“Result” Combine
Parallelization Challenges
• How do we assign work units to workers?
• What if we have more work units than
workers?
• What if workers need to share partial results?
• How do we aggregate partial results?
• How do we know all the workers have
finished?
• What if workers die?
What is the common theme of all of these problems?
Common Theme?
• Parallelization problems arise from:
– Communication between workers (e.g., to
exchange state)
– Access to shared resources (e.g., data)
• Thus, we need a synchronization mechanism
Source: Ricardo Guimarães Herrmann
Managing Multiple Workers
• Difficult because
– We don’t know the order in which workers run
– We don’t know when workers interrupt each other
– We don’t know the order in which workers access shared data
• Thus, we need:
– Semaphores (lock, unlock)
– Conditional variables (wait, notify, broadcast)
– Barriers
• Still, lots of problems:
– Deadlock, livelock, race conditions...
– Dining philosophers, sleeping barbers, cigarette smokers...
• Moral of the story: be careful!
Current Tools
Shared Memory Message Passing
• Programming models
Memory
– Shared memory (pthreads)
– Message passing (MPI) P1 P2 P3 P4 P5 P1 P2 P3 P4 P5
• Design Patterns
– Master-slaves
– Producer-consumer flows
– Shared work queues
producer consumer
master
work queue
slaves
producer consumer
Concurrency Challenge!
• Concurrency is difficult to reason about
• Concurrency is even more difficult to reason about
– At the scale of datacenters (even across datacenters)
– In the presence of failures
– In terms of multiple interacting services
• Not to mention debugging…
• The reality:
– Lots of one-off solutions, custom code
– Write you own dedicated library, then program with it
– Burden on the programmer to explicitly manage
everything
What’s the point?
• It’s all about the right level of abstraction
– The von Neumann architecture has served us well,
but is no longer appropriate for the multi-
core/cluster environment
• Hide system-level details from the developers
– No more race conditions, lock contention, etc.
• Separating the what from how
– Developer specifies the computation that needs to be
performed
– Execution framework (“runtime”) handles actual
execution
Key Ideas
• Scale “out”, not “up”
– Limits of SMP and large shared-memory machines
• Move processing to the data
– Cluster have limited bandwidth
• Process data sequentially, avoid random access
– Seeks are expensive, disk throughput is reasonable
• Seamless scalability
– From the mythical man-month to the tradable
machine-hour
The datacenter is the computer!
Map f f f f f
Fold g g g g g
MapReduce
• Programmers specify two functions:
map (k, v) → [(k’, v’)]
reduce (k’, [v’]) → [(k’, v’)]
– All values with the same key are sent to the same
reducer
• The execution framework handles everything
else…
Key Observation from Data Mining
Algorithms (Jin & Agrawal, SDM’01)
• Popular algorithms have a
common canonical loop While( ) {
forall( data instances d) {
• Can be used as the basis for
supporting a common middleware I = process(d)
(FREERide, Framework for Rapid R(I) = R(I) op d
Implementation of Data mining
Engines) }
…….
• Target distributed memory
parallelism, shared memory }
parallelism, and combination
a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8
r1 s1 r2 s2 r3 s3
MapReduce
• Programmers specify two functions:
map (k, v) → <k’, v’>*
reduce (k’, v’) → <k’, v’>*
– All values with the same key are sent to the same
reducer
• The execution framework handles everything
else…
What’s “everything else”?
MapReduce “Runtime”
• Handles scheduling
– Assigns workers to map and reduce tasks
• Handles “data distribution”
– Moves processes to data
• Handles synchronization
– Gathers, sorts, and shuffles intermediate data
• Handles errors and faults
– Detects worker failures and restarts
• Everything happens on top of a distributed FS
(later)
MapReduce
• Programmers specify two functions:
map (k, v) → [(k’, v’)]
reduce (k’, [v’]) → [(k’, v’)]
– All values with the same key are reduced together
• The execution framework handles everything else…
• Not quite…usually, programmers also specify:
partition (k’, number of partitions) → partition for k’
– Often a simple hash of the key, e.g., hash(k’) mod n
– Divides up key space for parallel reduce operations
combine (k’, [v’]) → [(k’, v’’)]
– Mini-reducers that run in memory after the map phase
– Used as an optimization to reduce network traffic
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6
a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8
a 1 b 2 c 9 a 5 c 2 b 7 c 8
r1 s1 r2 s2 r3 s3
Two more details…
• Barrier between map and reduce phases
– But we can begin copying intermediate data
earlier
• Keys arrive at each reducer in sorted order
– No enforced ordering across reducers
MapReduce can refer to…
• The programming model
• The execution framework (aka “runtime”)
• The specific implementation
the, 1
brown, 1
the quick fox, 1 brown, 2
brown fox Map fox, 2
Reduce how, 1
now, 1
the, 1 the, 3
fox, 1
the, 1
the fox ate
the mouse Map
quick, 1
how, 1
now, 1 ate, 1 ate, 1
brown, 1 mouse, 1 cow, 1
Reduce mouse, 1
how now quick, 1
brown cow Map cow, 1
An Optimization: The Combiner
• A combiner is a local aggregation function
for repeated keys produced by same map
• For associative ops. like sum, count, max
• Decreases size of intermediate data
the, 1
brown, 1
the quick fox, 1 brown, 2
brown fox Map fox, 2
Reduce how, 1
now, 1
the, 3
the, 2
fox, 1
the fox ate
the mouse Map
quick, 1
how, 1
now, 1 ate, 1 ate, 1
brown, 1 mouse, 1 cow, 1
Reduce mouse, 1
how now quick, 1
brown cow Map cow, 1
User
Program
(1) submit
Master
worker
split 0
(6) write output
split 1 (5) remote read worker
(3) read file 0
split 2 (4) local write
worker
split 3
split 4 output
worker
file 1
worker
SAN
Compute Nodes
Application /foo/bar
(file name, block id)
File namespace block 3df2
HDFS Client
(block id, block location)
instructions to datanode
datanode state
(block id, byte range)
HDFS datanode HDFS datanode
block data
Linux file system Linux file system
… …
NameNode
Secondary
NameNode
Client
Cluster Membership
DataNodes
NameNode : Maps a file to a file-id and list of MapNodes
DataNode : Maps a block-id to a physical location on disk
SecondaryNameNode: Periodic merge of Transaction log
Distributed File System
• Single Namespace for entire cluster
• Data Coherency
– Write-once-read-many access model
– Client can only append to existing files
• Files are broken up into blocks
– Typically 128 MB block size
– Each block replicated on multiple DataNodes
• Intelligent Client
– Client can find location of blocks
– Client accesses data directly from DataNode
NameNode Metadata
• Meta-data in Memory
– The entire metadata is in main memory
– No demand paging of meta-data
• Types of Metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g creation time, replication factor
• A Transaction Log
– Records file creations, file deletions. etc
Namenode Responsibilities
• Managing the file system namespace:
– Holds file/directory structure, metadata, file-to-
block mapping, access permissions, etc.
• Coordinating file operations:
– Directs clients to datanodes for reads and writes
– No data is moved through the namenode
• Maintaining overall health:
– Periodic communication with the datanodes
– Block re-replication and rebalancing
– Garbage collection
DataNode
• A Block Server
– Stores data in the local file system (e.g. ext3)
– Stores meta-data of a block (e.g. CRC)
– Serves data and meta-data to Clients
• Block Report
– Periodically sends a report of all existing blocks to
the NameNode
• Facilitates Pipelining of Data
– Forwards data to other specified DataNodes
Block Placement
• Current Strategy
-- One replica on local node
-- Second replica on a remote rack
-- Third replica on same remote rack
-- Additional replicas are randomly placed
• Clients read from nearest replica
• Would like to make this policy pluggable
Data Correctness
• Use Checksums to validate data
– Use CRC32
• File Creation
– Client computes checksum per 512 byte
– DataNode stores the checksum
• File access
– Client retrieves the data and checksum from
DataNode
– If Validation fails, Client tries other replicas
NameNode Failure
• A single point of failure
• Transaction Log stored in multiple directories
– A directory on the local file system
– A directory on a remote file system (NFS/CIFS)
• Need to develop a real HA solution
Putting everything together…
… … …
slave node slave node slave node
MapReduce Data Flow
Figure is from Hadoop, The Definitive Guide, 2nd Edition, Tom White, O’Reilly
Figure is from Hadoop, The Definitive Guide, 2nd Edition
Figure is from Hadoop, The Definitive Guide, 2nd Edition
• References:
• Hadoop: The Definitive Guide, Tom White, O’Reilly
• Hadoop In Action, Chuck Lam, Manning