DSECL ZG 522: Big Data Systems: Session 2: Parallel and Distributed Systems

DSECL ZG 522: Big Data Systems
Session 2: Parallel and Distributed Systems
Dr. Anindya Neogi

Associate Professor
[email protected]
Context
Big Data Systems use basic principles of Parallel

and Distributed Systems
2
Topics for today
• What are parallel / distributed systems

• Motivation for parallel / distributed systems
• Limits of parallelism
• Shared Memory vs Message Passing
• Data access strategies - Replication, Partitioning, Messaging
• Cluster computing
3
Serial Computing
• Software written for serial computation:

✓ A problem is broken into a discrete series of instructions
✓ Instructions are executed sequentially one after another
✓ Executed on a single processor
✓ Only one instruction may execute at any moment in time
✓ Single data stores - memory and disk
Extra info:
Von Neumann architecture : common memory

store and pathways between instructions and
data - causes Von Neumann bottleneck
Harvard architecture separates them to reduce

bottleneck.
Modern architectures use separate caches for

instruction and data.
4
Parallel Computing
• Simultaneous use of multiple compute resources to solve a computational problem

✓ A problem is broken into discrete parts that can be solved concurrently
✓ Each part is further broken down to a series of instructions
✓ Instructions from each part execute simultaneously on different processors
✓ Different processors can work with independent memory and storage
✓ An overall control/coordination mechanism is employed
5
Spectrum of Parallelism
More coupling More granularity of parallelism
Think of a factory assembly line Message passing

over interconnect
Super-scalar Multi-core Shared memory
over interconnect Clusters Grids Clouds
Pipelining Multi-threaded
Pseudo-parallel
(intra-processor) Parallel
Distributed
Task level Task level Service /

Instruction level
Data level Data level Request level
Application level
More in session 5 on programming
6
Distributed Computing
• In distributed computing,
✓ Multiple computing resources are connected in network and computing tasks are distributed across these
resources
✓ Results into increase in speed and efficiency of system
✓ Faster and more efficient than traditional methods of computing
✓ More suitable to process huge amounts of data in limited time
7
Multi-processor Vs Multi-computer systems
» Uniform Memory Access

Multiprocessor
» Shared memory address
space
» No common clock
» Fast interconnect
UMA NUMA
8
Multi-processor Vs Multi-computer systems
» Non Uniform Memory

Access Multicomputer
» May have shared address
spaces
» Typically message passing
» No common clock
NUMA NUMA
UMA
9
Interconnection Networks
a) A crossbar switch - faster

b) An omega switching network - cheaper
10
Classify based on Instruction and Data parallelism
Instruction Stream and Data Stream
• The term ‘stream’ refers to a sequence or flow of either instructions or data operated on
by the computer.
• In the complete cycle of instruction execution, a flow of instructions from main memory
to the CPU is established. This flow of instructions is called instruction stream.
• Similarly, there is a flow of operands between processor and memory bi-directionally.
This flow of operands is called data stream.
Reduce Von Neumann bottleneck with separate caches
11
Flynn’s Taxonomy
Instruction Streams
Single Multiple
MISD
Single
SISD
Uniprocessors Uncommon
Data Streams
Pipelining Fault tolerance
SIMD MIMD
Multiple
Scientific computing Multi-computers

Matrix manipulations Distributed Systems
Image from sciencedirect.com
12
Some basic concepts ( esp. for programming in Big Data Systems )
» Coupling
» Tight - SIMD, MISD shared memory systems
» Loose - NOW, distributed systems, no shared memory
» Speedup
» how much faster can a program run when given N processors as opposed to 1 processor — T(N) / T(1)
» We will study Amdahl’s Law, Gustafson’s Law
» Parallelism of a program
» Compare time spent in computations to time spent for communication via shared memory or message
passing
» Granularity
» Average number of compute instructions before communication is needed across processors
» Note:
» If coarse granularity, use distributed systems else use tightly coupled multi-processors/computers
» Potentially high parallelism doesn’t lead to high speedup if granularity is too small leading to
high overheads
13
Comparing Parallel and Distributed Systems
Distributed System Parallel System
Independent, autonomous systems Computer system with several processing

connected in a network accomplishing units attached to it
specific tasks
Coordination is possible between A common shared memory can be directly

connected computers with own memory accessed by every processing unit in a network
and CPU
Loose coupling of computers connected in Tight coupling of processing resources that
network, providing access to data and are used for solving single, complex problem
remotely located resources
Programs have coarse grain parallelism Programs may demand fine grain parallelism
14
Motivation for parallel / distributed systems (1)
• Inherently distributed applications
• e.g. financial tx involving 2 or more parties
• Better scale in creating multiple smaller parallel tasks instead of a complex
task
• e.g. evaluate an aggregate over 6 months data
• Processors getting cheaper and networks faster
• e.g. Processor speed 2x / 1.5 years, network traffic 2x/year, processors
limited by energy consumption replicated / partitioned storage
• Better scale using replication or partitioning of storage
• e.g. replicated media servers for faster access or shards in search
engines
• Access to shared remote resources
• e.g. remote central DB
• Increased performance/cost ratio compared to special parallel systems remote shared resource
• e.g. search engine runs on a Network-of-Workstations
15
Motivation for parallel / distributed systems (2)
• Better reliability
Motivation because
for less chance
parallel of multiple systems
/ distributed failures (2)
• Be careful about Integrity : Consistent state of a
cluster nodes
resource across concurrent access
• Incremental scalability
• Add more nodes in a cluster to scale up
• e.g. Clusters in Cloud services, autoscaling in AWS resize cluster
• Offload computing closer to user for scalability and better

server
resource usage
• Edge computing
offload some error handling to edge
16
Example: Netflix
~700+ distributed micro-services and hardware, integrated with other vendors
reference: https://medium.com/refraction-tech-everything/how-netflix-works-the-hugely-simplified-complex-stuff-that-happens-every-time-you-hit-play-3a40c9be254b
17
Distributed network of content caching servers
This would be a P2P network if you were using bit torrent for free
18
Examples
In each of the following cases, what is the motivation for parallel /

distributed computing ?
1. Airline scheduling
2. Summary Statistics of Historic Sales Data of a Retailer
3. Web Crawler
4. Weather forecasting
19
Techniques for High Data Processing
Method Description Usage
Cluster computing A collection of computers, Commonly used in Big Data
homogenous or heterogenous, using Systems, such as Hadoop
commodity components running
open source or proprietary software,
communicating via message passing
Massively Parallel Processing Typically proprietary Distributed May be used in traditional Data
(MPP) Shared Memory machines with Warehouses, Data processing
integrated storage appliances, e.g. EMC Greenplum
(postgreSQL on an MPP)
High-Performance Computing Known to offer high performance Used to develop specialty and
(HPC) and scalability by using in-memory custom scientific applications for
computing research where results is more
valuable than cost
20
Topics for today

21
Limits of Parallelism
• A parallel program has some sequential / serial code and significant

parallelized code
22
Amdahl’s Law
• T(1) : Time taken for a job with 1 processor

• T(N) : Time taken for same job with N processors
• Speedup S(N) = T(1) / T(N)
• S(N) is ideally N when it is a perfectly parallelizable program, i.e. data parallel with no sequential
component
• Assume fraction of a program that cannot be parallelised (serial) is f and 1-f is parallel
✓ T(N) = f * T(1) + (1-f) * T(1) / N
Only parallel portion is faster by N
• S(N) = T(1) / ( f * T(1) + (1-f) * T(1) / N )
• S(N) = 1 / ( f + (1-f) / N )
• Implication :
✓ If N=>inf, S(N) => 1/f
✓ The effective speedup is limited by the sequential fraction of the code
23
Amdahl’s Law - Example
10% of a program is sequential and there are 100 processors.

What is the effective speedup ?
S(N) = 1 / ( f + (1-f) / N )
S(100) = 1 / ( 0.1 + (1-0.1) / 100 )
= 1 / 0.109
= 9.17 (approx)
24
Limitations in speedup
Besides the sequential component of the program,

communication delays also result in reduction of
speedup
A and B exchange messages in blocking mode

Say processor speed is 0.5ns / instruction
Say network delay one way is 10 us
For one message delay, A and B would have each
executed 10us/0.5ns = 20000 instructions
+ context switching, scheduling, load balancing, I/O …
25
Super-linear speedup
Memory hierarchy / caches may increase the speedup

Single processor will have cache size D
But N processors will have total cache size N * D
So in a parallel environment the effect of cache may go up to N fold
Partitioning a data parallel program to run across multiple processors may lead to a better
cache hit.*
Possible in some workloads with not much of other overheads, esp in embarrassingly
parallel applications.
* There could be also be cache coherency overheads.
26
Why Amdahl’s Law is such bad news
S(N) ~ 1/ f , for large N
Suppose 33% of a program is sequential

• Then even a billion processors won’t give a speedup over 3
• For the 256 cores to gain ≥100x speedup, we need

100 ≤ 1 / (f + (1-f)/256)
Which means f ≤ .0061 or 99.4% of the algorithm must be perfectly
parallelizable !!
27
Speedup plot
256
Speedup for 1, 4, 16, 64, and 256 Processors
T1 / TN = 1 / (f + (1-f)/N)
192
128
64
0
0.00% 6.50% 13.00% 19.50% 26.00%
Percentage of Code that is Sequential
1 Processor 4 Processors 16 Processors 64 Processors
256 Processors
28
But wait - may be we are missing something
» The key assumption in Amdahl’s Law is total workload is fixed as

#processors is increased
» This doesn’t happen in practice — sequential part doesn’t increase with
resources
» Additional processors can be used for more complex workload and new
age larger parallel problems
» So Amdahl’s law under-estimates Speedup
» What if we assume fixed workload per processor
29
Gustafson-Barsis Law
Let W be the execution workload of the program before adding resources
f is the sequential part of the workload
So W = f * W + (1-f) * W
Let W(N) be larger execution workload after adding N processors
So W(N) = f * W + N * (1-f) * W
Parallelizable work can increase N times
The theoretical speedup in latency of the whole task at a fixed interval time T
S(N) = T * W(N) / T * W
= W(N) / W = ( f * W + N * (1-f) * W) / W Remember this when we
discuss programming in Session 5
S(N) = f + (1-f) * N S(N) is not limited by f as N scales
So solve larger problems when you have more processors
30
Topics for today

31
Memory access models
» Shared memory
» Multiple tasks on different processors access a common shared memory
address space in UMA or NUMA architectures
» Conceptually easier for programmers
» Think of writing a voting algorithm - it is trivial because P P P
everyone is in the same room, i.e. writing same variable
» Distributed memory
» Multiple tasks – executing a single program – access data P/M P/M
from separate (and isolated) address spaces (i.e. separate
virtual memories)
P/M
» How will this remote access happen ?
32
Shared Memory Model: Implications for Architecture
• A shared memory system has

✓ Physical memory (or memories) are
accessible by all processors
✓ The single (logical) address space
of each processor is mapped onto
the physical memory (or memories)
• A single program is implemented as a
collection of threads, with one or more
threads scheduled in a processor
• Conceptually easier to program
33
Distributed memory and message passing
• In a Distributed Memory model, data has to be
moved across Virtual Memories:
P1 P2
✓ i.e. a data item in VMem1 produced by task
T1 has to be “communicated” to task T2 so M1 M2
that
✓ T2 can make a copy of the same in VMem2
and use it.
• Whereas in a Shared Memory model

✓ task T1 writes the data item into a memory
location and T2 can read the same
✓ how ? the memory location is part of the
logical address space that is shared between
the tasks
34
Computing model for message passing
• Each data item must be located in one of the address spaces

✓ Data must be partitioned explicitly and placed (i.e. distributed)
✓ All interactions between processes require explicit communication i.e.
passing of messages
✓ Harder than shared memory from a programming abstraction standpoint
✓ Note: Shared memory abstractions can be theoretically created on
message passing systems
• In the simplest form:

✓ a sender (who has the data) and
✓ a receiver (who has to access the data)
• must co-operate for exchange of data
35
Communication model for message passing
• Processes operate within their own private address spaces

• Processes communicate by sending/receiving messages
✓ send: specifies recipient, buffer to be transmitted, and message identifier (“tag”)
✓ receive: specifies buffer to store data, and optional message identifier
✓ Sending messages is the only way to exchange data between processes 1 and 2
X
Process 2:
Process 1: Y
receive message
send local variable X
with id and store as
as message to P1
local variable Y
with id
36
Distributed Memory Model: Implications for Architecture
A distributed memory model is implemented in a distributed system where

✓ A collection of stand-alone systems are connected in a network
✓ A task runs as a collection of processes
✓ One or more processes are scheduled in each system / node
✓ Data exchanges happen via messages over the network
✓ Harder for programmers - hence need distributed OS / middleware layer
to hide some complexity *
* One can create a shared memory view using message passing and vice versa
37
Message Passing Model – Separate Address Spaces
Data
• Use of separate address spaces complicates programming
• But this complication is usually restricted to one or two phases: Processor A Processor B
✓ Partitioning the input data Data item X Data item Y
✓ Improves locality - computation closer to data
✓ Each process is enabled to access data from within its Processor C
address space, which in turn is likely to be mapped to the Data item Z
memory hierarchy of the processor in which the process is
running
Processor A
✓ Merging / Collecting the output data
Data item X’
✓ This is required if each task is producing outputs that Data item Y’
have to be combined Data item Z’
Remember granularity ?
We will see example of Hadoop map-reduce where data is partitioned, outputs
communicated over messages and merged to get final answer.
38
Message Passing Primitives
» Operations: Send and Receive

» Options:
» Blocking vs Non-Blocking
» Sync vs Async
» Important : A message can be received in the OS buffer but may not have been
delivered to application buffer. This is where a distributed message ordering
logic can come in.
Image ref: https://cvw.cac.cornell.edu/mpip2p/SyncSend
39
Send-Receive Options
1 Send Sync Blocking Returns only after data is sent from kernel buffer. Easiest
to program but longest wait.
2 Send Async Blocking Returns after data is copied to kernel buffer but not
sent. Handle returned to check send status.
3 Send Sync Non-blocking Same as (2) but with no handle.
4 Send Async Non-blocking Returns immediately with handle. Complex to program

but minimum wait.
5 Receive Sync Blocking Returns after application gets data.
6 Receive Sync Non-blocking Returns immediately with data or handle to check

status. More efficient.
40
Topics for today

41
Data Access Strategies: Partition
• Strategy:
✓ Partition data – typically, equally – to the nodes of the (distributed) system
• Cost:
✓ Network access and merge cost when query needs to across across partitions
• Advantage(s):
✓ Works well if task/algorithm is (mostly) data parallel
✓ Works well when there is Locality of Reference within a partition
• Concerns
✓ Merge across data fetched from multiple partitions
✓ Partition balancing
✓ Row vs Columnar layouts - what improves locality of reference ?
✓ Will study shards and partition in Hadoop and MongoDB
42
Data Access Strategies: Replication
• Strategy:
✓ Replicate all data across nodes of the (distributed) system
• Cost:
✓ Higher storage cost
• Advantage(s):
✓ All data accessed from local disk: no (runtime) communication on the network
✓ High performance with parallel access
✓ Fail over across replicas
• Concerns
✓ Keep replicas in sync — various consistency models between readers and writers
✓ Will study in depth for MongoDB
43
Data Access Strategies: (Dynamic) Communication
• Strategy:
✓ Communicate (at runtime) only the data that is required
• Cost:
✓ High network cost for loosely coupled systems and data set to be exchanged
is large
• Advantage(s):
✓ Minimal communication cost when only a small portion of the data is actually
required by each node
• Concerns
✓ Highly available and performant network
✓ Fairly independent parallel data processing
44
Data Access Strategies – Networked Storage
• Common Storage on the Network:

✓ Storage Area Network (for raw access – i.e. disk block access)
✓ Network Attached Storage (for file access)
• Common Storage on the Cloud:

✓ Use Storage as a Service
✓ e.g. Amazon S3
More in-depth coverage when studying Amazon storage case study
45
Topics for today

46
Computer Cluster - Definition
• A cluster is a type of distributed processing

system
✓ consisting of a collection of inter-
connected stand-alone computers
✓ working together as a single, integrated
computing resource
✓ e.g. web and application server cluster,
cluster running a Cloud service, DB server
cluster
47
Cluster - Objectives
• A computer cluster is typically built for one of the following two reasons:
✓ High Performance - referred to as compute-clusters
✓ High Availability - achieved via redundancy
An off-the-shelf or custom load balancer, reverse proxy can be configured to serve

the use case
• Question: How is this relevant for Big Data?
Hadoop nodes are a cluster for performance (independent Map/Reduce jobs are
started on multiple nodes) and availability (data is replicated on multiple nodes for
fault tolerance)
Most Big Data systems run on a cluster configuration for performance and availability
48
Clusters – Peer to Peer computation
• Distributed Computing models can be classified as:

✓ Client Server models
✓ Peer-to-Peer models
• based on the structure and interactions of the nodes in a distributed
system
• Clusters within the nodes use a Peer-to-Peer model of computation.

• There may be special control nodes that allocate and manage work
thus having a master-slave relationship.
49
Client-Server vs. Peer-to-Peer
• Client-Server Computation
✓ A server node performs the core computation – business logic in case of
applications
✓ Client nodes request for such computation
✓ At the programming level this is referred to as the request-response model
✓ Email, network file servers, …
• Peer-to-Peer Computation:
✓ All nodes are peers i.e. they perform core computations and may act as
client or server for each other.
✓ bit torrent, some multi-player games, clusters
50
Cloud and Clusters
• A cloud uses a datacenter as the infrastructure on top of which services

are provided
• e.g. AWS would have a datacenter in many regions - Mumbai, US
east, … (you can pick where you want your services deployed)
• A cluster is the basic building block for a datacenter:
✓ i.e. a datacenter is structured as a collection of clusters
• A cluster can host
✓ a multi-tenant service across clients - cost effective

✓ individual clients and their service(s) - dedicated instances
Can you draw a conceptual diagram to illustrate these cases ?
51
Motivation for using Clusters (1)
• Rate of obsolescence of computers is high

✓ Even for mainframes and supercomputers
✓ Servers (used for high performance computing) have to be replaced every
3 to 5 years.
• Solution: Build a cluster of commodity workstations
✓ Incrementally add nodes to the cluster to meet increasing workload
✓ Add nodes instead of replacing (i.e. let older nodes operate at a lower
speed)
✓ This model is referred to as a scale-out cluster
52
Motivation for using Clusters (2)
• Scale-out clusters with commodity workstations as nodes are suitable for software
environments that are resilient:
✓ i.e. individual nodes may fail, but
✓ middleware and software will enable computations to keep running (and keep
services available) for end users
✓ for instance, back-ends of Google and Facebook use this model.
• On the other hand, (public) cloud infrastructure is typically built as clusters of servers
✓ due to higher reliability of individual servers – used as nodes – (compared to
that of workstations as nodes).
53
Typical cluster components
Parallel applications
Parallel programming environment (e.g. map reduce)
Seq. Apps Cluster middleware (e.g. hadoop)
OS and runtimes OS and runtimes OS and runtimes
processor and memory processor and memory … processor and memory

local storage local storage local storage
network stack network stack network stack
cluster node
high speed switching network

54
Cluster Middleware - Some Functions
Single System Image (SSI) infrastructure High Availability (HA) Infrastructure

✓ Glues together OSs on all nodes to offer ✓ Cluster services for
unified access to system resources
✓ Availability
✓ Single process space
✓ Redundancy
✓ Cluster IP or single entry point
✓ Single auth
✓ Fault-tolerance
✓ Single memory and IO space ✓ Recovery from failures
✓ Process checkpointing and migration
✓ Single IPC space
✓ Single fs root
✓ Single virtual networking
✓ Single management GUI
http://www.cloudbus.org/papers/SSI-CCWhitePaper.pdf
55
Example cluster: Hadoop
• A job divided into tasks
• Considers every task either as a Map or a Reduce
• Tasks assigned to a set of nodes (cluster)
• Special control nodes manage the nodes for resource
management, setup, monitoring, data transfer, failover etc.
• Hadoop clients work with these control nodes to get the
job done
56
Summary
• Motivation and classification of parallel systems

• Computing limits of speedup
• Message passing
• How replication, partitioning helps in Big Data storage
and access
• Cluster computing basics
57
Next Session:
Big Data Analytics and Systems

DSECL ZG 522: Big Data Systems: Session 2: Parallel and Distributed Systems

Uploaded by

Copyright:

Available Formats

DSECL ZG 522: Big Data Systems: Session 2: Parallel and Distributed Systems

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DSECL ZG 522: Big Data Systems: Session 2: Parallel and Distributed Systems

Uploaded by

Copyright:

Available Formats

DSECL ZG 522: Big Data Systems

Session 2: Parallel and Distributed Systems

Dr. Anindya Neogi

Big Data Systems use basic principles of Parallel

• What are parallel / distributed systems

• Software written for serial computation:

Von Neumann architecture : common memory

Harvard architecture separates them to reduce

Modern architectures use separate caches for

• Simultaneous use of multiple compute resources to solve a computational problem

More coupling More granularity of parallelism

Think of a factory assembly line Message passing

Task level Task level Service /

» Uniform Memory Access

» Non Uniform Memory

a) A crossbar switch - faster

Instruction Stream and Data Stream

Reduce Von Neumann bottleneck with separate caches

Pipelining Fault tolerance

Scientific computing Multi-computers

Image from sciencedirect.com

Independent, autonomous systems Computer system with several processing

Coordination is possible between A common shared memory can be directly

• Offload computing closer to user for scalability and better

offload some error handling to edge

~700+ distributed micro-services and hardware, integrated with other vendors

In each of the following cases, what is the motivation for parallel /

• What are parallel / distributed systems

• A parallel program has some sequential / serial code and significant

• T(1) : Time taken for a job with 1 processor

• S(N) = T(1) / ( f * T(1) + (1-f) * T(1) / N )

10% of a program is sequential and there are 100 processors.

Besides the sequential component of the program,

A and B exchange messages in blocking mode

+ context switching, scheduling, load balancing, I/O …

Memory hierarchy / caches may increase the speedup

* There could be also be cache coherency overheads.

S(N) ~ 1/ f , for large N

Suppose 33% of a program is sequential

• For the 256 cores to gain ≥100x speedup, we need

» The key assumption in Amdahl’s Law is total workload is fixed as

So solve larger problems when you have more processors

• What are parallel / distributed systems

• A shared memory system has

• Whereas in a Shared Memory model

• Each data item must be located in one of the address spaces

• In the simplest form:

• Processes operate within their own private address spaces

A distributed memory model is implemented in a distributed system where

» Operations: Send and Receive

4 Send Async Non-blocking Returns immediately with handle. Complex to program

6 Receive Sync Non-blocking Returns immediately with data or handle to check

• What are parallel / distributed systems

• Common Storage on the Network:

• Common Storage on the Cloud:

More in-depth coverage when studying Amazon storage case study

• What are parallel / distributed systems

• A cluster is a type of distributed processing

An off-the-shelf or custom load balancer, reverse proxy can be configured to serve