0% found this document useful (0 votes)
63 views81 pages

Introduction To Parallel Computing: National Tsing Hua University Instructor: Jerry Chou 2017, Summer Semester

This document provides an introduction to parallel computing. It begins by defining parallel computing as solving a single problem using multiple processors working together simultaneously, unlike traditional serial computing which uses a single processor. The document then discusses why parallel computing is needed, including to save time by using more resources to shorten execution time, and to solve larger problems that would be impossible or impractical on a single computer. It also classifies parallel computers based on Flynn's taxonomy of instruction and data streams, and distinguishes between shared memory and distributed memory architectures.

Uploaded by

Michael Shi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views81 pages

Introduction To Parallel Computing: National Tsing Hua University Instructor: Jerry Chou 2017, Summer Semester

This document provides an introduction to parallel computing. It begins by defining parallel computing as solving a single problem using multiple processors working together simultaneously, unlike traditional serial computing which uses a single processor. The document then discusses why parallel computing is needed, including to save time by using more resources to shorten execution time, and to solve larger problems that would be impossible or impractical on a single computer. It also classifies parallel computers based on Flynn's taxonomy of instruction and data streams, and distinguishes between shared memory and distributed memory architectures.

Uploaded by

Michael Shi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Introduction to

Parallel Computing

National Tsing Hua University


Instructor: Jerry Chou
2017, Summer Semester
Outline
 Parallel Computing Introduction
 What is parallel computing
 Why need parallel computing
 Classificationsof Parallel Computers &
Programming Models
 Supercomputer & Latest technologies
 Parallel Program Analysis

Parallel Programming – NTHU LSA Lab 2


What is Parallel Computing?
“Solve a single problem by using multiple processors
(i.e. core) working together”
 Traditionally, program has been written for serial computation
Instruction
Problem Processor
tN t3 t2 t1
 In parallel computing, use multiple computer resources to solve
a computational problem
Instruction
Processor
Problem

Processor
Processor
tN t3 t2 t1
Parallel Programming – NTHU LSA Lab 3
Difference between parallel computing
& distributed computing
The two terminologies are very closely related.
But come from different backgrounds
 Parallel computing …
 Means different activities happen at the same time
 Spread out a single application over many cores/
processors/processes to get it done bigger or faster
 Mostly used in scientific computing
 Distributed computing…
 Activities across systems or distanced servers
 Focus more on concurrency and resource sharing
 From the business/commercial world
Parallel Programming – NTHU LSA Lab 4
The Universe is Parallel
 Parallel computing is an evolution of serial computing
that attempts to emulate what has always been the
state of affairs in the natural world

Parallel Programming – NTHU LSA Lab 5


Why need Parallel Computing
 Save time
 Use more resources to shorten execution with potential
cost saving Finish in 1 hour!!!
4 hours of work

 Shorter execution time allows more runs or more tuning


opportunity
DUAL XEON CPU server DGX-1 GPU server ( 8 GPUs)
FLOPS 3TF 170TF
Node Mem BW 76GB/s 768GB/s
Alexnet Train Time 150 Hr 2Hr
Train in 2Hr >250Nodes 1Node
Parallel Programming – NTHU LSA Lab 6
Why need Parallel Computing
 Solve larger problem
 Impossible or impractical to solve on a single computer
 Scientific computing:
Trillion particles
Tens and hundreds of parameters
TBs of data to be processed/analyzed
Several hours of execution
1 trillion particles, 4.225 Gpc box-size
using millions of cores ( PetaFLOPS) simulation, and 6 kpc force resolution.

The world has been driven by science research! 7


Parallel Programming – NTHU LSA Lab
Why need Parallel Computing
 Make better use of the underlying parallel hardware
 Advance in computer architecture

12 Cores IBM Blade Multi-core CPU 512 Cores NVIDIA Fermi GPU

Parallel Programming – NTHU LSA Lab 8


The Death of CPU Scaling
 Increase of transistor density ≠ performance
 The power and clock speed improvements collapsed

“Parallel Computing is a trend and essential tools


in today’s world!”Parallel Programming – NTHU LSA Lab 9
Trend of Parallel Computing
Heterogeneous
Single-Core Era Systems Era
Enabled by: Constraint by: Enabled by: Constraint by:
Moore’s Law Power Abundant data Programming
Voltage Scaling Complexity parallelism models
Power efficient GPUs Comm. overhead
Assembly  C/C++Java … Shader  CUDA OpenCL …

Distributed
Muti-Core Era System Era
Enabled by: Constraint by: Enabled by: Constraint by:
Moore’s Law Power Networking Synchronization
SMP Parallel SW Comm. overhead
Scalability
Pthread  OpenMP … MPI  MapReduce …
10
Parallel Programming – NTHU LSA Lab 10
Outline
 Parallel Computing Introduction
 Classifications
of Parallel Computers &
Programming Models
 Flynn’s classic taxonomy
 Memory architecture classification
 Programming model classification
 Supercomputer & Latest technologies
 Parallel Program Analysis

Parallel Programming – NTHU LSA Lab 11


Parallel Computer Classification
 Flynn’s classic taxonomy
 Since 1966 (50 years ago …)
 From the process unit prospective: Classify
computer architecture based two independent
dimensions: Instruction & Data
SISD SIMD
Single Instruction Single Instruction
Single Data Multiple Data
MISD MIMD
Multiple Instruction Multiple Instruction
Single Data Multiple Data
Parallel Programming – NTHU LSA Lab 12
Flynn’s classic taxonomy: SISD
 Single Instruction, Single Data (SISD):
 A serial (non-parallel) computer
 Single Instruction: Only one instruction stream is being acted
on by the CPU during any one clock cycle
 Single Data: Only one data stream is being used as input
during any one clock cycle
 Example: Old mainframes, single-core processor

Parallel Programming – NTHU LSA Lab 13


Flynn’s classic taxonomy: SIMD
 Single Instruction, Multiple Data (SIMD):
 Single Instruction: All processing units execute the same
instruction at any given clock cycle
 Multiple Data: Each processing unit can operate on a
different data element
 Example: GPU, vector processor (X86 AVX instruction)

Parallel Programming – NTHU LSA Lab 14


Flynn’s classic taxonomy: MISD
 Multiple Instruction, Single Data (MISD):
 Multiple Instruction: Each processing unit operates on the
data independently via separate instruction streams.
 Single Data: A single data stream is fed into multiple
processing units.
 Example: Only experiment by CMU in 1971; Could be used
for fault tolerance

Parallel Programming – NTHU LSA Lab 15


Flynn’s classic taxonomy: MIMD
 Multiple Instruction, Multiple Data (MIMD):
 Multiple Instruction: Every processor may be executing a
different instruction stream
 Multiple Data: Every processor may be working with a
different data stream
 Example: Most modern computers, such as multi-core CPU

Parallel Programming – NTHU LSA Lab 16


Outline
 Parallel Computing Introduction
 Classifications
of Parallel Computers &
Programming Models
 Flynn’s classic taxonomy
 Memory architecture classification
 Programming model classification
 Supercomputer & Latest technologies
 Parallel Program Analysis

Parallel Programming – NTHU LSA Lab 17


Shared Memory vs. Distributed Memory
Computer Architecture

CPU0 CPU1 CPU2 CPU3 CPU0 CPU1 CPU2 CPU3

MEM0 MEM1 MEM2 MEM3 MEM

Distributed memory Shared memory

Parallel Programming – NTHU LSA Lab 18


Shared Memory Multiprocessor
Computer System
 Single computer with multiple internal multi-
core processors

Parallel Programming – NTHU LSA Lab 19


Shared Memory Computer Architecture
 Uniform Memory Access (UMA):
 Most commonly represented today by Symmetric
Multiprocessor (SMP) machines
 Identical processors
 Equal access times to memory
 Example: commercial servers
 Non-Uniform Memory Access (NUMA):
 Often made by physically linking two or more SMPs
 One SMP can directly access
memory of another SMP
 Memory access across link is slower
 Example: HPC server

Parallel Programming – NTHU LSA Lab 20


Distributed Memory Multicomputer
 Connect multiple computers to form a
computing platform without sharing memory

Cluster: tens of servers

Supercomputer: Datacenter: thousands of servers


hundreds of servers
Parallel Programming – NTHU LSA Lab 21
Distributed Memory Multicomputer
 Require a communication network (i.e. not bus)
to connect inter-processor memory
 Processors have their own memory & address space
 Memory change made by a processor has NO effect on
the memory of other processors
 Programmers or programming tools are responsible to
explicitly define how and when data is communicated
between processors
Network fabric:
Ethernet,
InfiniBand

Parallel Programming – NTHU LSA Lab 22


Outline
 Parallel Computing Introduction
 Classifications
of Parallel Computers &
Programming Models
 Flynn’s classic taxonomy
 Memory architecture classification
 Programming model classification
 Supercomputer & Latest technologies
 Parallel Program Analysis

Parallel Programming – NTHU LSA Lab 23


Parallel Programming Model
 Parallel programming models exist as an abstraction
above hardware & memory architectures
 In general programming models are designed to match
the computer architecture
 Shared memory prog. model for shared memory machine
 Message passing prog. model for distributed memory machine
 But programming models are NOT restricted by the
machine or memory architecture
 Message passing model can be supported on SHARED memory
machine: e.g., MPI on a single server
 Shared memory model on DISTRIBUTED memory machine:
e.g., Partitioned Global Address Space
Parallel Programming – NTHU LSA Lab 24
Shared Memory Programming Model
 A single process can have multiple, concurrent execution paths
 Threads have local data, but also, shares resources
 Threads communicate with each other through global memory
 Threads can come and go, but the main program remains
 to provide the necessary shared resources until the application
has completed

Parallel Programming – NTHU LSA Lab 25


Shared Memory Programming Model
 Implementation
 A library of subroutines called from parallel source code
E.g.: POSIX Thread (Pthread)
 A set of compiler directives imbedded in either serial or
parallel source code
E.g.: OpenMP
#include <pthread.h> #include <omp.h>
void print_message_function ( void *ptr ) { int main() {
printf(“Hello, world.\n");
}
#pragma omp parallel
{
int main() {
printf("Hello, world.\n");
pthread_t thread;
}
pthread_create (&thread, NULL, (void *)
}
&print_message_function, NULL);
pthread_join(thread, NULL);
26
} Parallel Programming – NTHU LSA Lab
Message Passing Programming Model
 A set of tasks that use their own local memory
during computation
 Multiple tasks can reside on the same physical machine
and/or across an arbitrary number of machines
 Tasks exchange data through communications by
sending and receiving messages (Memory copy)
 MPI API:
 Send, Recv, Bcast,
Gather, Scatter, etc.

Parallel Programming – NTHU LSA Lab 27


Shared Memory vs. Message Passing
Shared Memory Message Passing
 Convenient:  Scalable
 Can share data structures  Locality control
 Just annotate loops  Communication is all explicit in
 Closer to serial code code (cost transparency)
 Disadvantages  Disadvantage
 No locality control  Need to rethink entire
 Does not scale application/ data structures
 Race conditions  Lots of tedious pack/unpack code
 Don’t know when to say “receive”
for some problems
Parallel Programming – NTHU LSA Lab 28
Summary
 The designs and popularity of programming model and
parallel systems are highly influenced by each other
 openMP, MPI, Pthreads, CUDA are just some of the
parallel languages for users to do parallel programming
 In reality, knowing what is parallel computing is more
IMPORTANT than knowing how to do parallel
programming, because that’s how you can…
 Learn a new parallel programming tools quickly
 Understand the performance of your program
 Optimize the performance of your program

Parallel Programming – NTHU LSA Lab 29


Outline
 Parallel Computing Introduction
 Classifications
of Parallel Computers &
Programming Models
 Supercomputer & Latest technologies
 Supercomputer
 Processor technology
 Interconnect & Network technology
 I/O & Storage technology
 Parallel Program Analysis
Parallel Programming – NTHU LSA Lab 30
Today’s Typical Parallel Computers
Racks: 16~42U Node/Server: 1~4U

Multi-core Processor:
100x cores/1000xthreads Co-Processor: 4~12 cores
Parallel Programming – NTHU LSA Lab 31
Supercomputers
 Definition: A computer with a high-level
computational capacity compared to
a general-purpose computer
 Its performance is measured in floating-
point operations per second (FLOPS) instead
of million instructions per second (MIPS)
 Ranked by the TOP500 list since 1993
 According to the HPL benchmark results
 Announced twice a year at ISC and SC conferences

Parallel Programming – NTHU LSA Lab 32


HPL Benchmark
 A parallel implementation of Linpack library
 Measure floating point rate of execution

 Computation:
 To solve linear matrix equation

 LU factorization by Panel factorization.


 Divide a matrix into many pieces.
 All parameters must be determined by user.

Parallel Programming – NTHU LSA Lab 33


What makes it a supercomputer
 What makes it a supercomputer?
 All the latest hardware technologies
 Customized system configurations
 Optimized software and libraries
 Huge amount of cost in money and energy

 It represents a competition
of technology and wealth
among a countries ……

Parallel Programming – NTHU LSA Lab 34


TOP500 List (2016 June)
Country System Vendor Power (kW) #cores Accelerator Rmax Rpeak
(PFLOPS)
1 China TaihuLight NRCPC 15,371 10M 93.0 125.4
2 China Tianhe-2 NUDT 17,808 3M Xeon Phi 33.9 54.9
3 US Titan Cray 8,209 560K Tesla K20X 17.6 27.1
4 US Sequoia IBM 7,890 1.5M 17.2 20.1
5 Japan K Fujitsu 12,660 705K 10.5 11.3
6 US Mira IBM 3,954 786K 8.6 10.0
7 US Trinity Cray 301K 8.1 11.1
8 Swiss Piz Daint Cray 2,325 116K Tesla K20X 6.2 7.8

 Accelerator provides huge computing power


 Titan’s Rmax without GPU was only 2K!!!

Parallel Programming – NTHU LSA Lab 35


TOP500 Trend: CPU
 Intel CPU counts for more than 80%

Parallel Programming – NTHU LSA Lab 36


TOP500 Trend: Interconnect
 InfiniBand has much larger share in performance

Parallel Programming – NTHU LSA Lab 37


TOP500 Trend: Vendor
 CRAY and IBM still have larger share for
performance

Parallel Programming – NTHU LSA Lab 38


TOP500 Trend: Country
 China has a huge jump because of the new
supercomputer
2015 Nov 2016 June

Parallel Programming – NTHU LSA Lab 39


TOP500 Trend: Computing power
 Goal is to reach Exascale computing 1EFlop (10^18) /s
by 2020

Parallel Programming – NTHU LSA Lab 40


Outline
 Parallel Computing Introduction
 Classifications
of Parallel Computers &
Programming Models
 Supercomputer & Latest technologies
 Supercomputer
 Processor technology
 Interconnect & Network technology
 I/O & Storage technology
 Parallel Program Analysis
Parallel Programming – NTHU LSA Lab 41
Limitation of CPU
General Purpose Processor
 A general purpose CPU (central processing unit) can do
anything, but its design is against the goal of achieving
the best performance for a specific application.

[source: nvidia] Parallel Programming – NTHU LSA Lab 42


Comparison Numbers
Intel Xeon E5- NVIDIA Tesla Intel Xeon Phi
2697 v3 CPU K80 GPU 7120P (Knight's
(Haswell) (Kepler) Corner)
Cores 2x14 2x13(SMX) 61
Logical Cores 2x28 2x2,496 244
Frequency 2.60GHz 562MHz 1.238GHz
GFLOPS(double) 2x583 2x1,455 1,208
Max memory 768GB 2x12GB 16GB
Max Mem BW 2x68GB/s 2x240GB/s 352GB/s
(Internal) (Internal)
Price 2,700 USD 5,000 USD 4,000 USD
Source: https://www.xcelerit.com/computing-benchmarks/libor/haswell_k80_phi/
Parallel Programming – NTHU LSA Lab 43
NVidia General Purpose GPU
 Extend GPU as a form of stream processor (or a vector processor)
for general purpose computing
 Suited for embarrassingly parallel tasks and vectorized operations
 Hierarchical memory structure Host
 Used as accelerators/co-processor Input Assembler
GPU (Device) Thread Execution Manager

SM SM SM SM SM SM SM

……
PBSM PBSM PBSM PBSM PBSM PBSM PBSM

Load/Store
Global Memory & Constant Memory
Parallel Programming – NTHU LSA Lab 44
Intel Xeon Phi
 A brand name given to a series of manycore processors follows
the Intel's MIC (Many Integrated Core) architecture
 Typically it has 50-70 processors on the die connected by a bidirectional
Ring network
 More like a separate system
 It runs Intel assembly code just like the main CPU in your computer
 It has an embedded linux
 Second generation chips (Knights Landing) could be used as a
standalone CPU

Parallel Programming – NTHU LSA Lab 45


Sunway TaihuLight SW26010
 Each node contains four clusters of 64 CPEs (SIMD)
 Each cluster is accompanied by a MPE (general
purpose)

Parallel Programming – NTHU LSA Lab 46


Google Tensor Processing Unit (TPU)
 Specifically for deep learning (tensorflow framework)
 30–80X higher performance-per-watt than
contemporary CPUs and GPUs
 Only for reduced precision computation (e.g. 8-bit precision)
 Matrix Multiplier Unit: use a to achieve hundreds of
thousands of matric operation in a single clock cycle
 Systolic array: The ALUs perform only multiplications and
additions in fixed patterns
 Reference
 https://cloud.google.com/blog/big-data/2017/05/an-in-
depth-look-at-googles-first-tensor-processing-unit-tpu
Parallel Programming – NTHU LSA Lab 47
Outline
 Parallel Computing Introduction
 Classifications
of Parallel Computers &
Programming Models
 Supercomputer & Latest technologies
 Supercomputer
 Processor technology
 Interconnect & Network technology
 I/O & Storage technology
 Parallel Program Analysis
Parallel Programming – NTHU LSA Lab 48
Communication
 Communication has the most impact to the
performance of parallel programs (Even more critical
to computing or memory).
 Network is generally much slower than CPU
 Communication is common to parallel programs
 Synchronization is expensive and could grow exponentially
to the number of servers

Parallel Programming – NTHU LSA Lab 49


Interconnection Networks
 Network design considerations
 Scalability, Performance, Resilience and Cost
Application
• Communication pattern & protocol

Interconnection Network Topology


• Network diameter
• Re-routing path for fault tolerance
• # fan-in & fan-out degree per node
Network Devices (Cable, Switch, Adapter, etc.)
• Bandwidth: #bits transferred per second
• Latency: time to pack, unpack, and send a message
• Scalability: # of ports on the adapter and switch
Parallel Programming – NTHU LSA Lab 50
Network Topology
Diameter Bisection #Links Degree
(latency) (resilience) (cost) (scalability)
Linear array P-1 1 P-1 2

 Cheapest solution, but not reliable and long


latency

Parallel Programming – NTHU LSA Lab 51


Network Topology
Diameter Bisection #Links Degree
(latency) (resilience) (cost) (scalability)
Linear array P-1 1 P-1 2
Ring p/2 2 P 2
Tree 2log 2 𝑝 1 2(p-1) 3
2-D Mesh 2( 𝑝 − 1) 𝑝 2 𝑝( 𝑝 − 1) 4
 Particularly suitable for some of the
applications such as the ocean
application and matrix calculation
 Can be extended to 3-D mesh
Parallel Programming – NTHU LSA Lab 52
Network Topology
Diameter Bisection #Links Degree
(latency) (resilience) (cost) (scalability)
Linear array P-1 1 P-1 2
Ring p/2 2 P 2
Tree 2log 2 𝑝 1 2(p-1) 3
2-D Mesh 2( 𝑝 − 1) 𝑝 2 𝑝( 𝑝 − 1) 4
2-D Torus 𝑝-1 2 𝑝 2p 4
Hypercube log 2 𝑝 p/2 p/2 x log 2 𝑝 log 2 𝑝
 Smaller diameter, more bisection, but also higher
cost and degree than Mesh and Torus
 More suitable for smaller scale systems
Parallel Programming – NTHU LSA Lab 53
Network Topology
 4-D hypercube
 Each node is numbered with a bitstring that is
log2(p) bits long.
 One bit can be flipped per hop so the diameter is
log2(p).

Parallel Programming – NTHU LSA Lab 54


6-Dimensional Mesh/Torus on K-Computer
 K-computer (Kei means “京”)
 Designed by FUJITSU, Japan
 World’s #5 fastest supercomputer
 80,000 compute nodes; 640,000 cores
 Network connection: Tofu
 Introduction video clip:
 http://www.fujitsu.com/global/about/businesspolicy/tech/
k/whatis/network/

Parallel Programming – NTHU LSA Lab 55


Network Device: InfiniBand
 A computer network communications link used in high-
performance computing featuring very high throughput
 It is the most commonly used interconnect in supercomputers
 Manufactured by Mellanox

InfiniBand

Ethernet

Parallel Programming – NTHU LSA Lab 56


InfiniBand: Usage in TOP500

Parallel Programming – NTHU LSA Lab 57


InfiniBand: RDMA

Source: Mellanox Parallel Programming – NTHU LSA Lab 58


InfiniBand vs. Gigabit Ethernet
InfiniBand Ethernet
Guaranteed credit Best effort delivery
based flow control
End-to-End congestion TCP/IP protocol. Designed for
Protocol
management L3/L4 switching
Hardware based Software based
retransmission retransmission
RDMA YES NO (only now starting)
Latency Low High
Throughput High Low
Max cable length 4km upto 70km
Price 36port switch: 25k USD 36port switch: 1.5k USD
QDR adapter: 500USD Network card: 50 USD
Parallel Programming – NTHU LSA Lab 59
Outline
 Parallel Computing Introduction
 Classifications
of Parallel Computers &
Programming Models
 Supercomputer & Latest technologies
 Supercomputer
 Processor technology
 Interconnect & Network technology
 I/O & Storage technology
 Parallel Program Analysis
Parallel Programming – NTHU LSA Lab 60
How About I/O?
 Not so great…

Source: http://www.mostlycolor.ch/2015_10_01_archive.html
Parallel Programming – NTHU LSA Lab 61
Opportunity in I/O
 Memory hierarchy
 New storage technology is coming: Flash
 It is still challenged to put the data in the right place,
at right time. register
 There is always a price Cache
to pay Main memory
Flash
(Non-volatile memory)
Hard Disk Drive

Magnetic tape Storage Systems


Parallel Programming – NTHU LSA Lab 62
Opportunity in I/O
 Parallel file and IO systems
 Lustre file system, MPI-IO

IO
server

Parallel Programming – NTHU LSA Lab 63


Opportunity in I/O
 Burst buffering
 Add non-volatile RAM at the IO server nodes as a
buffer to smooth the burst traffic pattern for
improving the IO performance of storage systems,
and reduce the IO latency

Parallel Programming – NTHU LSA Lab 64


Summary
 People has been and will always be able to find a way to
keep the growth of computing
 Technology: CPU scaling, distributed computing, new
processor architecture
 Optimization: algorithm, data management, compiler
 System design: network topology, file system
 It is more than just computing
 Networks and IO become greater concerns
 Does the performance report from supercomputers
really meets the needs of applications?
 People start re-thinking what should be the right objective
and benchmark for designing the next generation of
supercomputers. Parallel Programming – NTHU LSA Lab 65
Outline
 Parallel Computing Introduction
 Classificationsof Parallel Computers &
Programming Models
 Supercomputer & Latest technologies
 Parallel Program Analysis
 Speedup & Efficiency
 Strong scalability vs. Weak scalability
 Time complexity & Cost optimality

Parallel Programming – NTHU LSA Lab 66


Speedup Factor
𝑇𝑠
 Program speedup factor: 𝑆 𝑝 =
𝑇𝑝
 𝑇𝑠 : execution time using the BEST sequential algorithm
 𝑇𝑝 : execution time using 𝒑 processor Ideal
Superlinear
Linear speedup: 𝑆 𝑝 = 𝑝

Speedup factor

 Ideal maximum speedup in theory
 Superlinear speedup: 𝑆 𝑝 > 𝑝
 Occasionally happen in practice
Normal cases
 Extra HW resource (e.g. memory)
Number of processors
 SW or HW optimization (e.g. caching)
𝑇𝑠 𝑆(𝑝)
 System efficiency: 𝐸 𝑝 = = × 100%
𝑇𝑝 ×𝑝 𝑝
Parallel Programming – NTHU LSA Lab 67
Maximum Speedup
 Difficult to reach ideal max. speedup: S(p)=p
 Not every part of a computation can be parallelized
(results in processor idle)
 Need extra computations in the parallel version
(i.e. due to synchronization cost)
 Communication time between processes
(normally the major factor)

Speedup factor Normal cases

Parallel Programming – NTHU LSA Lab


Number of processors
68
Maximum Speedup
 Let 𝑓 be the fraction of computations that can
NOT be parallelized
𝑡𝑠 𝑝
𝑆 𝑝 = =
𝑓𝑡𝑠 + 1−𝑓 𝑡𝑠 /𝑝 1+ 𝑝−1 𝑓
𝑡𝑠

……

p processors

Parallel Programming – NTHU LSA Lab 69


Maximum Speedup

Parallel Programming – NTHU LSA Lab 70


Outline
 Parallel Computing Introduction
 Classificationsof Parallel Computers &
Programming Models
 Supercomputer & Latest technologies
 Parallel Program Analysis
 Speedup & Efficiency
 Strong scalability vs. Weak scalability
 Time complexity & Cost optimality

Parallel Programming – NTHU LSA Lab 71


Strong Scaling
 The problem size stays fixed but the number of
processing elements are increased.
 It is used to find a "sweet spot" that allows the
computation to complete in a reasonable amount of
time, yet does not waste too many cycles due to parallel
overhead.
 Linear scaling is achieve if the speedup is equal to the
number of processing elements.
Execution speedup
Time Linear speedup

# of cores # of cores
Parallel Programming – NTHU LSA Lab 72
Weak Scaling
 The problem size (workload) assigned to each processing
element stays fixed and additional processing elements
are used to solve a larger total problem
 It is a justification for programs that take a lot of memory
or other system resources (e.g., a problem wouldn't fit in
RAM on a single node)
 Linear scaling is achieved if the run time stays constant
while the workload is increased
Execution speedup
Time Linear speedup

# of cores # of cores
Parallel Programming – NTHU LSA Lab 73
Strong Scaling vs. Weak Scaling
 Strong scaling
 Linear scaling is harder to achieve, because of the
communication overhead may increase
proportional to the scale
 Weak scaling
 Linear scaling is easier to achieve because
programs typically employ nearest-neighbor
communication patterns where the
communication overhead is relatively constant
regardless of the number of processes used
Parallel Programming – NTHU LSA Lab 74
Outline
 Parallel Computing Introduction
 Classificationsof Parallel Computers &
Programming Models
 Supercomputer & Latest technologies
 Parallel Program Analysis
 Speedup & Efficiency
 Strong scalability vs. Weak scalability
 Time complexity & Cost optimality

Parallel Programming – NTHU LSA Lab 75


Time Complexity Analysis
 Tp = Tcomp + Tcomm
 Tp: Total execution time of a parallel algorithm
 Tcomp: Computation part

Tcomm
 Tcomm: Communication part
Startup time
 Tcomm = q (Tstartup + n Tdata)
# of data items
 Tstartup: Message latency (assumed constant)
 Tdata: Transmission time to send one data item
 n: Number of data items in a message
 q: Number of message
Parallel Programming – NTHU LSA Lab 76
Time Complexity Example 1
 Algorithm phase:
1. Computer 1 sends n/2 numbers to computer 2
2. Both computers add n/2 numbers simultaneously
3. Computer 2 sends its partial result back to computer 1
4. Computer 1 adds the partial sums to produce the final result
 Complexity analysis:
 Computation (for step 2 & 4):
Tcomp = n/2 + 1 = O(n)
 Communication (for step 1 & 3):
Tcomm = (Tstartup + n/2 x Tdata) + (Tstartup + Tdata)
= 2Tstartup + (n/2 + 1) Tdata = O(n)
 Overall complexity: O(n)
Parallel Programming – NTHU LSA Lab 77
Time Complexity Example 2
 Adding n numbers using m processes
 Evenly partition numbers to processes
𝑥0 … 𝑥(𝑛/𝑚−1) 𝑥𝑛/𝑚 … 𝑥(2𝑛/𝑚−1) 𝑥 𝑚−1 𝑛/𝑚 … 𝑥𝑛−1

+ + …………… +
Partial sums

+
Sum

Parallel Programming – NTHU LSA Lab 78


Time Complexity Example
 Sequential: 𝑂(𝑛)
 Parallel:
 Phase1: Send numbers to slaves
𝑡𝑐𝑐𝑐𝑐𝑐 = 𝑚(𝑡𝑠𝑠𝑠𝑠𝑠𝑠𝑠 + (𝑛/𝑚)𝑡𝑑𝑑𝑑𝑑 )
 Phase2: Compute partial sum
𝑡𝑐𝑐𝑐𝑐𝑐 = 𝑛/𝑚 − 1
 Phase3: Send results to master
𝑡𝑐𝑐𝑐𝑐2 = 𝑚(𝑡𝑠𝑠𝑠𝑠𝑠𝑠𝑠 + 𝑡𝑑𝑑𝑑𝑑 ) Tradeoff
between
 Phase4: Compute final accumulation computation &
𝑡𝑐𝑐𝑐𝑐2 = 𝑚 − 1 communication
 Overall:
𝑛
𝑡𝑝 = 2𝑚𝑡𝑠𝑠𝑠𝑠𝑠𝑠𝑠 + (𝑛 + 𝑚)𝑡𝑑𝑑𝑑𝑑 +𝑚 + − 2 = 𝑂(𝑚 + 𝑛/𝑚)
Parallel Programming – NTHU LSA Lab
𝑚 79
Cost-Optimal Algorithm
 Definition:
 Cost to solve a problem is proportional to the
execution time on a single processor system
 O(Tp ) x N = O(Ts)
 Example:
 Sequential algo: O(N log N)
 Parallel algo1: uses N processor with O(log N)
 Parallel algo2: uses N2 processor with O(1)

Parallel Programming – NTHU LSA Lab 80


Reference
 Textbook: Parallel Computing Chap1
 TOP500: https://www.top500.org/
 Blaise Barney, Lawrence Livermore National
Laboratory, Introduction to Parallel Computing,
https://computing.llnl.gov/tutorials/parallel_comp/
 Flynn's taxonomy,
https://en.wikipedia.org/wiki/Flynn%27s_taxonomy
 K computer, http://www.fujitsu.com/
global/about/businesspolicy/tech/k/
 InfiniBand, http://www.infinibandta.org/
Parallel Programming – NTHU LSA Lab 81

You might also like