Chapter 4
Chapter 4
2
Why Distribution?
Motivation
Why distribution?
Big data: volume, velocity, variety
We need more storage space
Space efficiency
We need more computing power
Time efficiency
We need more I/O throughput
Inevitable because of the computer architecture
Divide and Conquer
“Work”
Partition
w1 w2 w3
r1 r2 r3
“Result” Combine
Parallelization Challenges
How do we assign work units to workers?
What if we have more work units than
workers?
What if workers need to share partial results?
How do we aggregate partial results?
How do we know all the workers have
finished?
What if workers die?
Memory
Message passing (MPI)
Design Patterns P1 P2 P3 P4 P5 P1 P2 P3 P4 P5
Master-slaves
Producer-consumer flows
Shared work queues
producer consumer
master
work queue
slaves
producer consumer
Where the rubber meets the road
Concurrency is difficult to reason about
Concurrency is even more difficult to reason about
At the scale of datacenters and across datacenters
In the presence of failures
In terms of multiple interacting services
Not to mention debugging…
The reality:
Lots of one-off solutions, custom code
Write your own dedicated library, then program with it
Burden on the programmer to explicitly manage
everything
“Big Ideas”
Scale “out”, not “up”
Limits of SMP and large shared-memory machines
SMP is Symmetric multiprocessing (SMP) is a
computing architecture in which two or more
processors are attached to a single memory.
Move processing to the data
Cluster has limited bandwidth
Process data sequentially, avoid random access
Seeks are expensive, disk throughput is reasonable
Seamless scalability
From the mythical man-month to the tradable
machine-hour
Scaling “up” vs. “out”
No single machine is large enough
Smaller cluster of large SMP machines vs. larger
cluster of commodity machines (e.g., 16 128-core
machines vs. 128 16-core machines)
Nodes need to talk to each other!
Intra-node latencies: ~100 ns
Inter-node latencies: ~100 s
Distributed Computing for
Data Mining
Large-scale Computing
Large-scale computing for data mining
problems on commodity hardware
Challenges:
How do you distribute computation?
How can we make it easy to write distributed
programs?
Machines fail:
One server may stay up 3 years (1,000 days)
If you have 1,000 servers, expect to loose 1/day
With 1M machines 1,000 machines fail every day!
14
An Idea and a Solution
Issue:
Copying data over a network takes time
Idea:
Bring computation to data
Store files multiple times for reliability
Spark/Hadoop address these problems
Storage Infrastructure – File system
Google: GFS. Hadoop: HDFS
Programming model
MapReduce
Spark
15
Storage Infrastructure
Problem:
If nodes fail, how to store data persistently?
Answer:
Distributed File System
Provides global file namespace
Typical usage pattern:
Huge files (100s of GB to TB)
Data is rarely updated in place
Reads and appends are common
16
Distributed File System
Chunk servers
File is split into contiguous chunks
Typically each chunk is 16-64MB
Each chunk replicated (usually 2x or 3x)
Try to keep replicas in different racks
Master node
a.k.a. Name Node in Hadoop’s HDFS
Stores metadata about where files are stored
Might be replicated
Client library for file access
Talks to master to find chunk servers
Connects directly to chunk servers to access data
17
Distributed File System
Reliable distributed file system
Data kept in “chunks” spread across machines
Each chunk replicated on different machines
Seamless recovery from disk or machine failure
C0 C1 D0 C1 C2 C5 C0 C5
C5 C2 C5 C3 D0 D1 … D0 C2
Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N
19
HDFS Architecture – Master/Slave
Typical Big Data Problem
Iterate over a large number of records
Map
Extract something of interest from each
Shuffle and sort intermediate results
Aggregate intermediate results d u ce
Re
Generate final output
Map f f f f f
Fold g g g g g
Programming Model
MapReduce is a style of programming
designed for:
1. Easy parallel programming
2. Invisible management of hardware and software
failures
3. Easy management of very-large-scale data
(1) submit
Master
worker
split 0
(6) write output
split 1 (5) remote read worker
(3) read file 0
split 2 (4) local write
worker
split 3
split 4 output
worker
file 1
worker
Input Intermediate
key-value pairs key-value pairs
k v
map
k v
k v
map
k v
k v
… …
k v k v
27
MapReduce: The Reduce Step
Output
Intermediate Key-value groups key-value pairs
key-value pairs
reduce
k v k v v v k v
reduce
Group
k v k v v k v
by key
k v
… …
…
k v k v k v
28
Map-Reduce: A diagram
MAP:
Read input and
produces a set of
key-value pairs
Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)
Reduce:
Collect all values
belonging to the
key and output
29
Map-Reduce: In Parallel
All phases are distributed with many tasks doing the work
30
MapReduce Pattern
key-value
pairs
Input Output
Mappers Reducers
31
Example: Word Counting
Example MapReduce task:
We have a huge text document
Count the number of times each
distinct word appears in the file
Many applications of this:
Analyze web server logs to find popular URLs
Statistical machine translation:
Need to count number of times every 5-word sequence
occurs in a large corpus of documents
32
MapReduce: Word Counting
Provided by the Provided by the
programmer programmer
MAP: Group by key: Reduce:
Read input and Collect all values
Collect all pairs
produces a set of belonging to the
with same key
key-value pairs key and output
data
reads
The crew of the space
shuttle Endeavor recently
(The, 1) (crew, 1)
read the
returned to Earth as (crew, 1) (crew, 1)
ambassadors, harbingers (crew, 2)
of a new era of space (of, 1) (space, 1)
(space, 1)
sequential
exploration. Scientists at
(the, 1) (the, 1)
NASA are saying that the (the, 3)
Sequentially
recent assembly of the (space, 1) (the, 1)
Dextre bot is the first step (shuttle, 1)
in a long-term space-based (shuttle, 1) (the, 1)
(recently, 1)
man/mache partnership. (Endeavor, 1) (shuttle, 1)
'"The work we're doing now …
Only
-- the robotics we're doing (recently, 1) (recently, 1)
-- is what we're going to
need …………………….. …. …
Big document (key, value) (key, value) (key, value)
33
Word Count Using MapReduce
map(key, value):
// key: document name; value: text of the document
for each word w in value:
emit(w, 1)
reduce(key, values):
// key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)
34
MapReduce: Environment
MapReduce environment takes care of:
Partitioning the input data
Scheduling the program’s execution across a
set of machines
Performing the group by key step
In practice this is the bottleneck
Handling machine failures
Managing required inter-machine
communication
35
Dealing with Failures
Map worker failure
Map tasks completed or in-progress at
worker are reset to idle and rescheduled
Reduce workers are notified when map task is
rescheduled on another worker
Reduce worker failure
Only in-progress tasks are reset to idle and the
reduce task is restarted
36
Putting everything together…
… … …
slave node slave node slave node
HDFS Federation
Resource manager YARN
Hadoop 2: HDFS Federation
Two major components in HDFS
Namespaces
Blocks storage service
Hadoop 1
A single namenode manages the entire namespace
HDFS federation
Multiple namenode servers manage namespaces
Horizontal scaling, performance improvement,
multiple namespaces
Hadoop 2: YARN
Brings significant performance improvements
for some applications
Supports additional processing models
Implements a more flexible execution engine
YARN: from Hadoop 2
What is YARN?
A resource manager separated from
MapReduce in Hadoop 1
The operating system of Hadoop
Managing and monitoring workloads
Maintaining a multi-tenant environment
Implementing security controls
Managing high-availability features of Hadoop
No longer limited to the I/O intensive, high-
latency MapReduce model
Spark
Problems with MapReduce
Two major limitations of MapReduce:
Difficultly of programming directly in MR
Many problems aren’t easily described as map-reduce
Performance bottlenecks, or batch not fitting the
use cases
Persistence to disk typically slower than in-memory work
45
Spark: Most Popular Data-Flow System
46
Spark Architecture
Components
Spark applications
Independent sets of processes on a cluster,
coordinated by the SparkContext object in main
(Driver) program
Cluster managers
Allocate resources across applications
Executors
Processes that run computations and store data
Spark sends your application code (JAR or python
files) to executors, and SparkContext sends tasks
to executors to run
Note about the architecture
Each application has its own executor processes
Benefit: isolating applications from each other
Cons: data cannot be shared across different applications
without writing to external storage
Spark is agnostic to the underlying cluster manager
Driver program must listen for and accept incoming
connections from its executors throughout its lifetime
Driver must be network addressable from the workers
Driver program should be run close to the worker
nodes, since it schedules tasks on the cluster
Cluster Manager Types
Standalone: a simple cluster manager included in Spark
Faster job startup, but it doesn’t support communication with an HDFS
secured with Kerberos authentication protocol
Apache Mesos: a general cluster manager that can also runs
Hadoop MapReduce
A scalable and fault-tolerant “distributed systems kernel” written in C+
+, which also supports C++ and Python applications
It is actually a “scheduler of scheduler frameworks” because of its two-
level scheduling architecture
Hadoop YARN: resource manager in Hadoop 2
YARN lets you run different types of Java applications, not just Spark
It also provides methods for isolating and prioritizing applications
among users and organizations
The only cluster type that supports Kerberos-secured HDFS
You don’t have to install Spark on all nodes in the cluster
Spark Standalone Cluster
Two deploy modes for Spark Standalone
cluster
Client mode: driver is launched in the same
process as the client that submits the application
Cluster mode: driver is launched from one of the
Worker processes in the cluster, and the client
exits as soon as it fulfills its submission of
application
Client mode vs. Cluster mode
52
Spark on YARN cluster mode
Spark on YARN client mode
Spark: RDD
Key concept Resilient Distributed Dataset (RDD)
Partitioned collection of records
Generalizes (key-value) pairs
Spread across the cluster, Read-only
Caching dataset in memory
Different storage levels available
Fallback to disk possible
F:
Stage 1 groupBy
C: D: E:
join = RDD
58
Problems Suited for
MapReduce
Example: Host size
Suppose we have a large web corpus
Look at the metadata file
Lines of the form: (URL, size, date, …)
For each host, find the total number of bytes
That is, the sum of the page sizes for all URLs from
that particular host
Other examples:
Link analysis and graph processing
Machine Learning algorithms
60
Example: Language Model
Statistical machine translation:
Need to count number of times every 5-word
sequence occurs in a large corpus of documents
61
Example: Join By Map-Reduce
Compute the natural join R(A,B) ⋈ S(B,C)
R and S are each stored in files
Tuples are pairs (a,b) or (b,c)
A B B C A C
a1 b1 b2 c1 a3 c1
a2
a3
b1
b2
⋈ b2 c2 = a3 c2
b3 c3 a4 c3
a4 b3
S
R
62
Map-Reduce Join
Use a hash function h from B-values to 1...k
A Map process turns:
Each input tuple R(a,b) into key-value pair (b,(a,R))
Each input tuple S(b,c) into (b,(c,S))
64
Cost Measures for Algorithms
In MapReduce we quantify the cost of an
algorithm using
1. Communication cost = total I/O of all
processes
2. Elapsed communication cost = max of I/O
along any path
3. (Elapsed) computation cost analogous, but
count only running time of processes
Note that here the big-O notation is not the most useful
(adding more machines is always an option)
65
Example: Cost Measures
For a map-reduce algorithm:
Communication cost = input file size + 2 (sum of
the sizes of all files passed from Map processes to
Reduce processes) + the sum of the output sizes of
the Reduce processes.
Elapsed communication cost is the sum of the
largest input + output for any map process, plus
the same for any reduce process
66
What Cost Measures Mean
Either the I/O (communication) or processing
(computation) cost dominates
Ignore one or the other
67
Cost of Map-Reduce Join
Total communication cost
= O(|R|+|S|+|R ⋈ S|)
Elapsed communication cost = O(s)
We’re going to pick k and the number of Map
processes so that the I/O limit s is respected
We put a limit s on the amount of input or output that
any one process can have. s could be:
What fits in main memory
What fits on local disk
With proper indexes, computation cost is linear
in the input + output size
So computation cost is like comm. cost
68
Thanks for Your Attention!
69
References
Jure Leskovec, Anand Rajaraman, Jeffrey
David Ullman, Mining of Massive Datasets,
2nd Edition, Cambridge University Press,
November 2014. (Ch.1-2)
Slides from Stanford CS246 “Mining Massive
Datasets” (by Jure Leskovec)
Jimmy Lin and Chris Dyer, “Data-Intensive Text
Processing with MapReduce”. (Ch.1-2)
Slides from Jimmy Lin’s “Big Data Infrastructure”
course, Univ. Maryland (and Univ. Waterloo)
70
References
Papers:
GFS
MapReduce
71