DSECL ZG 522: Big Data Systems: Session 2: Parallel and Distributed Systems
DSECL ZG 522: Big Data Systems: Session 2: Parallel and Distributed Systems
DSECL ZG 522: Big Data Systems: Session 2: Parallel and Distributed Systems
2
Topics for today
3
Serial Computing
5
Spectrum of Parallelism
Pseudo-parallel
(intra-processor) Parallel
Distributed
7
Multi-processor Vs Multi-computer systems
UMA NUMA
8
Multi-processor Vs Multi-computer systems
9
Interconnection Networks
10
Classify based on Instruction and Data parallelism
• The term ‘stream’ refers to a sequence or flow of either instructions or data operated on
by the computer.
• In the complete cycle of instruction execution, a flow of instructions from main memory
to the CPU is established. This flow of instructions is called instruction stream.
• Similarly, there is a flow of operands between processor and memory bi-directionally.
This flow of operands is called data stream.
11
Flynn’s Taxonomy
Instruction Streams
Single Multiple
MISD
Single
SISD
Uniprocessors Uncommon
Data Streams
SIMD MIMD
Multiple
12
Some basic concepts ( esp. for programming in Big Data Systems )
» Coupling
» Tight - SIMD, MISD shared memory systems
» Loose - NOW, distributed systems, no shared memory
» Speedup
» how much faster can a program run when given N processors as opposed to 1 processor — T(N) / T(1)
» We will study Amdahl’s Law, Gustafson’s Law
» Parallelism of a program
» Compare time spent in computations to time spent for communication via shared memory or message
passing
» Granularity
» Average number of compute instructions before communication is needed across processors
» Note:
» If coarse granularity, use distributed systems else use tightly coupled multi-processors/computers
» Potentially high parallelism doesn’t lead to high speedup if granularity is too small leading to
high overheads
13
Comparing Parallel and Distributed Systems
Distributed System Parallel System
Programs have coarse grain parallelism Programs may demand fine grain parallelism
14
Motivation for parallel / distributed systems (1)
• Inherently distributed applications
• e.g. financial tx involving 2 or more parties
• Better scale in creating multiple smaller parallel tasks instead of a complex
task
• e.g. evaluate an aggregate over 6 months data
• Processors getting cheaper and networks faster
• e.g. Processor speed 2x / 1.5 years, network traffic 2x/year, processors
limited by energy consumption replicated / partitioned storage
• Better scale using replication or partitioning of storage
• e.g. replicated media servers for faster access or shards in search
engines
• Access to shared remote resources
• e.g. remote central DB
• Increased performance/cost ratio compared to special parallel systems remote shared resource
• e.g. search engine runs on a Network-of-Workstations
15
Motivation for parallel / distributed systems (2)
• Better reliability
Motivation because
for less chance
parallel of multiple systems
/ distributed failures (2)
• Be careful about Integrity : Consistent state of a
cluster nodes
resource across concurrent access
• Incremental scalability
• Add more nodes in a cluster to scale up
• e.g. Clusters in Cloud services, autoscaling in AWS resize cluster
16
Example: Netflix
reference: https://medium.com/refraction-tech-everything/how-netflix-works-the-hugely-simplified-complex-stuff-that-happens-every-time-you-hit-play-3a40c9be254b
17
Distributed network of content caching servers
This would be a P2P network if you were using bit torrent for free
18
Examples
19
Techniques for High Data Processing
Method Description Usage
Cluster computing A collection of computers, Commonly used in Big Data
homogenous or heterogenous, using Systems, such as Hadoop
commodity components running
open source or proprietary software,
communicating via message passing
Massively Parallel Processing Typically proprietary Distributed May be used in traditional Data
(MPP) Shared Memory machines with Warehouses, Data processing
integrated storage appliances, e.g. EMC Greenplum
(postgreSQL on an MPP)
High-Performance Computing Known to offer high performance Used to develop specialty and
(HPC) and scalability by using in-memory custom scientific applications for
computing research where results is more
valuable than cost
20
Topics for today
21
Limits of Parallelism
22
Amdahl’s Law
• S(N) = 1 / ( f + (1-f) / N )
• Implication :
✓ If N=>inf, S(N) => 1/f
✓ The effective speedup is limited by the sequential fraction of the code
23
Amdahl’s Law - Example
24
Limitations in speedup
25
Super-linear speedup
Partitioning a data parallel program to run across multiple processors may lead to a better
cache hit.*
Possible in some workloads with not much of other overheads, esp in embarrassingly
parallel applications.
26
Why Amdahl’s Law is such bad news
27
Speedup plot
256
Speedup for 1, 4, 16, 64, and 256 Processors
T1 / TN = 1 / (f + (1-f)/N)
192
128
64
0
0.00% 6.50% 13.00% 19.50% 26.00%
Percentage of Code that is Sequential
1 Processor 4 Processors 16 Processors 64 Processors
256 Processors
28
But wait - may be we are missing something
29
Gustafson-Barsis Law
Let W be the execution workload of the program before adding resources
f is the sequential part of the workload
So W = f * W + (1-f) * W
Let W(N) be larger execution workload after adding N processors
So W(N) = f * W + N * (1-f) * W
Parallelizable work can increase N times
The theoretical speedup in latency of the whole task at a fixed interval time T
S(N) = T * W(N) / T * W
= W(N) / W = ( f * W + N * (1-f) * W) / W Remember this when we
discuss programming in Session 5
S(N) = f + (1-f) * N S(N) is not limited by f as N scales
30
Topics for today
31
Memory access models
» Shared memory
» Multiple tasks on different processors access a common shared memory
address space in UMA or NUMA architectures
» Conceptually easier for programmers
» Think of writing a voting algorithm - it is trivial because P P P
everyone is in the same room, i.e. writing same variable
» Distributed memory
» Multiple tasks – executing a single program – access data P/M P/M
from separate (and isolated) address spaces (i.e. separate
virtual memories)
P/M
» How will this remote access happen ?
32
Shared Memory Model: Implications for Architecture
33
Distributed memory and message passing
• In a Distributed Memory model, data has to be
moved across Virtual Memories:
P1 P2
✓ i.e. a data item in VMem1 produced by task
T1 has to be “communicated” to task T2 so M1 M2
that
✓ T2 can make a copy of the same in VMem2
and use it.
34
Computing model for message passing
35
Communication model for message passing
X
Process 2:
Process 1: Y
receive message
send local variable X
with id and store as
as message to P1
local variable Y
with id
36
Distributed Memory Model: Implications for Architecture
* One can create a shared memory view using message passing and vice versa
37
Message Passing Model – Separate Address Spaces
Data
• Use of separate address spaces complicates programming
• But this complication is usually restricted to one or two phases: Processor A Processor B
✓ Partitioning the input data Data item X Data item Y
✓ Improves locality - computation closer to data
✓ Each process is enabled to access data from within its Processor C
address space, which in turn is likely to be mapped to the Data item Z
memory hierarchy of the processor in which the process is
running
Processor A
✓ Merging / Collecting the output data
Data item X’
✓ This is required if each task is producing outputs that Data item Y’
have to be combined Data item Z’
Remember granularity ?
We will see example of Hadoop map-reduce where data is partitioned, outputs
communicated over messages and merged to get final answer.
38
Message Passing Primitives
» Important : A message can be received in the OS buffer but may not have been
delivered to application buffer. This is where a distributed message ordering
logic can come in.
Image ref: https://cvw.cac.cornell.edu/mpip2p/SyncSend
39
Send-Receive Options
1 Send Sync Blocking Returns only after data is sent from kernel buffer. Easiest
to program but longest wait.
2 Send Async Blocking Returns after data is copied to kernel buffer but not
sent. Handle returned to check send status.
3 Send Sync Non-blocking Same as (2) but with no handle.
40
Topics for today
41
Data Access Strategies: Partition
• Strategy:
✓ Partition data – typically, equally – to the nodes of the (distributed) system
• Cost:
✓ Network access and merge cost when query needs to across across partitions
• Advantage(s):
✓ Works well if task/algorithm is (mostly) data parallel
✓ Works well when there is Locality of Reference within a partition
• Concerns
✓ Merge across data fetched from multiple partitions
✓ Partition balancing
✓ Row vs Columnar layouts - what improves locality of reference ?
✓ Will study shards and partition in Hadoop and MongoDB
42
Data Access Strategies: Replication
• Strategy:
✓ Replicate all data across nodes of the (distributed) system
• Cost:
✓ Higher storage cost
• Advantage(s):
✓ All data accessed from local disk: no (runtime) communication on the network
✓ High performance with parallel access
✓ Fail over across replicas
• Concerns
✓ Keep replicas in sync — various consistency models between readers and writers
✓ Will study in depth for MongoDB
43
Data Access Strategies: (Dynamic) Communication
• Strategy:
✓ Communicate (at runtime) only the data that is required
• Cost:
✓ High network cost for loosely coupled systems and data set to be exchanged
is large
• Advantage(s):
✓ Minimal communication cost when only a small portion of the data is actually
required by each node
• Concerns
✓ Highly available and performant network
✓ Fairly independent parallel data processing
44
Data Access Strategies – Networked Storage
45
Topics for today
46
Computer Cluster - Definition
47
Cluster - Objectives
• A computer cluster is typically built for one of the following two reasons:
✓ High Performance - referred to as compute-clusters
✓ High Availability - achieved via redundancy
Hadoop nodes are a cluster for performance (independent Map/Reduce jobs are
started on multiple nodes) and availability (data is replicated on multiple nodes for
fault tolerance)
Most Big Data systems run on a cluster configuration for performance and availability
48
Clusters – Peer to Peer computation
49
Client-Server vs. Peer-to-Peer
• Client-Server Computation
✓ A server node performs the core computation – business logic in case of
applications
✓ Client nodes request for such computation
✓ At the programming level this is referred to as the request-response model
✓ Email, network file servers, …
• Peer-to-Peer Computation:
✓ All nodes are peers i.e. they perform core computations and may act as
client or server for each other.
✓ bit torrent, some multi-player games, clusters
50
Cloud and Clusters
51
Motivation for using Clusters (1)
52
Motivation for using Clusters (2)
• Scale-out clusters with commodity workstations as nodes are suitable for software
environments that are resilient:
✓ i.e. individual nodes may fail, but
✓ middleware and software will enable computations to keep running (and keep
services available) for end users
✓ for instance, back-ends of Google and Facebook use this model.
• On the other hand, (public) cloud infrastructure is typically built as clusters of servers
✓ due to higher reliability of individual servers – used as nodes – (compared to
that of workstations as nodes).
53
Typical cluster components
Parallel applications
cluster node
http://www.cloudbus.org/papers/SSI-CCWhitePaper.pdf
55
Example cluster: Hadoop
• A job divided into tasks
• Considers every task either as a Map or a Reduce
• Tasks assigned to a set of nodes (cluster)
• Special control nodes manage the nodes for resource
management, setup, monitoring, data transfer, failover etc.
• Hadoop clients work with these control nodes to get the
job done
56
Summary
57
Next Session:
Big Data Analytics and Systems