0% found this document useful (0 votes)

63 views81 pages

Introduction To Parallel Computing: National Tsing Hua University Instructor: Jerry Chou 2017, Summer Semester

This document provides an introduction to parallel computing. It begins by defining parallel computing as solving a single problem using multiple processors working together simultaneously, unlike traditional serial computing which uses a single processor. The document then discusses why parallel computing is needed, including to save time by using more resources to shorten execution time, and to solve larger problems that would be impossible or impractical on a single computer. It also classifies parallel computers based on Flynn's taxonomy of instruction and data streams, and distinguishes between shared memory and distributed memory architectures.

Uploaded by

Michael Shi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views81 pages

Introduction To Parallel Computing: National Tsing Hua University Instructor: Jerry Chou 2017, Summer Semester

Uploaded by

Michael Shi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 81

Introduction to

Parallel Computing

National Tsing Hua University

Instructor: Jerry Chou
2017, Summer Semester
Outline
 Parallel Computing Introduction
 What is parallel computing
 Why need parallel computing
 Classificationsof Parallel Computers &
Programming Models
 Supercomputer & Latest technologies
 Parallel Program Analysis

Parallel Programming – NTHU LSA Lab 2

What is Parallel Computing?
“Solve a single problem by using multiple processors
(i.e. core) working together”
 Traditionally, program has been written for serial computation
Instruction
Problem Processor
tN t3 t2 t1
 In parallel computing, use multiple computer resources to solve
a computational problem
Instruction
Processor
Problem

Processor
Processor
tN t3 t2 t1
Parallel Programming – NTHU LSA Lab 3
Difference between parallel computing
& distributed computing
The two terminologies are very closely related.
But come from different backgrounds
 Parallel computing …
 Means different activities happen at the same time
 Spread out a single application over many cores/
processors/processes to get it done bigger or faster
 Mostly used in scientific computing
 Distributed computing…
 Activities across systems or distanced servers
 Focus more on concurrency and resource sharing
 From the business/commercial world
Parallel Programming – NTHU LSA Lab 4
The Universe is Parallel
 Parallel computing is an evolution of serial computing
that attempts to emulate what has always been the
state of affairs in the natural world

Parallel Programming – NTHU LSA Lab 5

Why need Parallel Computing
 Save time
 Use more resources to shorten execution with potential
cost saving Finish in 1 hour!!!
4 hours of work

 Shorter execution time allows more runs or more tuning

opportunity
DUAL XEON CPU server DGX-1 GPU server ( 8 GPUs)
FLOPS 3TF 170TF
Node Mem BW 76GB/s 768GB/s
Alexnet Train Time 150 Hr 2Hr
Train in 2Hr >250Nodes 1Node
Parallel Programming – NTHU LSA Lab 6
Why need Parallel Computing
 Solve larger problem
 Impossible or impractical to solve on a single computer
 Scientific computing:
Trillion particles
Tens and hundreds of parameters
TBs of data to be processed/analyzed
Several hours of execution
1 trillion particles, 4.225 Gpc box-size
using millions of cores ( PetaFLOPS) simulation, and 6 kpc force resolution.

The world has been driven by science research! 7

Parallel Programming – NTHU LSA Lab
Why need Parallel Computing
 Make better use of the underlying parallel hardware
 Advance in computer architecture

12 Cores IBM Blade Multi-core CPU 512 Cores NVIDIA Fermi GPU

Parallel Programming – NTHU LSA Lab 8

The Death of CPU Scaling
 Increase of transistor density ≠ performance
 The power and clock speed improvements collapsed

“Parallel Computing is a trend and essential tools

in today’s world!”Parallel Programming – NTHU LSA Lab 9
Trend of Parallel Computing
Heterogeneous
Single-Core Era Systems Era
Enabled by: Constraint by: Enabled by: Constraint by:
Moore’s Law Power Abundant data Programming
Voltage Scaling Complexity parallelism models
Power efficient GPUs Comm. overhead
Assembly  C/C++Java … Shader  CUDA OpenCL …

Distributed
Muti-Core Era System Era
Enabled by: Constraint by: Enabled by: Constraint by:
Moore’s Law Power Networking Synchronization
SMP Parallel SW Comm. overhead
Scalability
Pthread  OpenMP … MPI  MapReduce …
10
Parallel Programming – NTHU LSA Lab 10
Outline
 Parallel Computing Introduction
 Classifications
of Parallel Computers &
Programming Models
 Flynn’s classic taxonomy
 Memory architecture classification
 Programming model classification
 Supercomputer & Latest technologies
 Parallel Program Analysis

Parallel Programming – NTHU LSA Lab 11

Parallel Computer Classification
 Flynn’s classic taxonomy
 Since 1966 (50 years ago …)
 From the process unit prospective: Classify
computer architecture based two independent
dimensions: Instruction & Data
SISD SIMD
Single Instruction Single Instruction
Single Data Multiple Data
MISD MIMD
Multiple Instruction Multiple Instruction
Single Data Multiple Data
Parallel Programming – NTHU LSA Lab 12
Flynn’s classic taxonomy: SISD
 Single Instruction, Single Data (SISD):
 A serial (non-parallel) computer
 Single Instruction: Only one instruction stream is being acted
on by the CPU during any one clock cycle
 Single Data: Only one data stream is being used as input
during any one clock cycle
 Example: Old mainframes, single-core processor

Parallel Programming – NTHU LSA Lab 13

Flynn’s classic taxonomy: SIMD
 Single Instruction, Multiple Data (SIMD):
 Single Instruction: All processing units execute the same
instruction at any given clock cycle
 Multiple Data: Each processing unit can operate on a
different data element
 Example: GPU, vector processor (X86 AVX instruction)

Parallel Programming – NTHU LSA Lab 14

Flynn’s classic taxonomy: MISD
 Multiple Instruction, Single Data (MISD):
 Multiple Instruction: Each processing unit operates on the
data independently via separate instruction streams.
 Single Data: A single data stream is fed into multiple
processing units.
 Example: Only experiment by CMU in 1971; Could be used
for fault tolerance

Parallel Programming – NTHU LSA Lab 15

Flynn’s classic taxonomy: MIMD
 Multiple Instruction, Multiple Data (MIMD):
 Multiple Instruction: Every processor may be executing a
different instruction stream
 Multiple Data: Every processor may be working with a
different data stream
 Example: Most modern computers, such as multi-core CPU

Parallel Programming – NTHU LSA Lab 16

Outline
 Parallel Computing Introduction
 Classifications
of Parallel Computers &
Programming Models
 Flynn’s classic taxonomy
 Memory architecture classification
 Programming model classification
 Supercomputer & Latest technologies
 Parallel Program Analysis

Parallel Programming – NTHU LSA Lab 17

Shared Memory vs. Distributed Memory
Computer Architecture

CPU0 CPU1 CPU2 CPU3 CPU0 CPU1 CPU2 CPU3

MEM0 MEM1 MEM2 MEM3 MEM

Distributed memory Shared memory

Parallel Programming – NTHU LSA Lab 18

Shared Memory Multiprocessor
Computer System
 Single computer with multiple internal multi-
core processors

Parallel Programming – NTHU LSA Lab 19

Shared Memory Computer Architecture
 Uniform Memory Access (UMA):
 Most commonly represented today by Symmetric
Multiprocessor (SMP) machines
 Identical processors
 Equal access times to memory
 Example: commercial servers
 Non-Uniform Memory Access (NUMA):
 Often made by physically linking two or more SMPs
 One SMP can directly access
memory of another SMP
 Memory access across link is slower
 Example: HPC server

Parallel Programming – NTHU LSA Lab 20

Distributed Memory Multicomputer
 Connect multiple computers to form a
computing platform without sharing memory

Cluster: tens of servers

Supercomputer: Datacenter: thousands of servers

hundreds of servers
Parallel Programming – NTHU LSA Lab 21
Distributed Memory Multicomputer
 Require a communication network (i.e. not bus)
to connect inter-processor memory
 Processors have their own memory & address space
 Memory change made by a processor has NO effect on
the memory of other processors
 Programmers or programming tools are responsible to
explicitly define how and when data is communicated
between processors
Network fabric:
Ethernet,
InfiniBand

Parallel Programming – NTHU LSA Lab 22

Parallel Programming – NTHU LSA Lab 23

Parallel Programming Model
 Parallel programming models exist as an abstraction
above hardware & memory architectures
 In general programming models are designed to match
the computer architecture
 Shared memory prog. model for shared memory machine
 Message passing prog. model for distributed memory machine
 But programming models are NOT restricted by the
machine or memory architecture
 Message passing model can be supported on SHARED memory
machine: e.g., MPI on a single server
 Shared memory model on DISTRIBUTED memory machine:
e.g., Partitioned Global Address Space
Parallel Programming – NTHU LSA Lab 24
Shared Memory Programming Model
 A single process can have multiple, concurrent execution paths
 Threads have local data, but also, shares resources
 Threads communicate with each other through global memory
 Threads can come and go, but the main program remains
 to provide the necessary shared resources until the application
has completed

Parallel Programming – NTHU LSA Lab 25

Shared Memory Programming Model
 Implementation
 A library of subroutines called from parallel source code
E.g.: POSIX Thread (Pthread)
 A set of compiler directives imbedded in either serial or
parallel source code
E.g.: OpenMP
#include <pthread.h> #include <omp.h>
void print_message_function ( void *ptr ) { int main() {
printf(“Hello, world.\n");
}
#pragma omp parallel
{
int main() {
printf("Hello, world.\n");
pthread_t thread;
}
pthread_create (&thread, NULL, (void *)
}
&print_message_function, NULL);
pthread_join(thread, NULL);
26
} Parallel Programming – NTHU LSA Lab
Message Passing Programming Model
 A set of tasks that use their own local memory
during computation
 Multiple tasks can reside on the same physical machine
and/or across an arbitrary number of machines
 Tasks exchange data through communications by
sending and receiving messages (Memory copy)
 MPI API:
 Send, Recv, Bcast,
Gather, Scatter, etc.

Parallel Programming – NTHU LSA Lab 27

Shared Memory vs. Message Passing
Shared Memory Message Passing
 Convenient:  Scalable
 Can share data structures  Locality control
 Just annotate loops  Communication is all explicit in
 Closer to serial code code (cost transparency)
 Disadvantages  Disadvantage
 No locality control  Need to rethink entire
 Does not scale application/ data structures
 Race conditions  Lots of tedious pack/unpack code
 Don’t know when to say “receive”
for some problems
Parallel Programming – NTHU LSA Lab 28
Summary
 The designs and popularity of programming model and
parallel systems are highly influenced by each other
 openMP, MPI, Pthreads, CUDA are just some of the
parallel languages for users to do parallel programming
 In reality, knowing what is parallel computing is more
IMPORTANT than knowing how to do parallel
programming, because that’s how you can…
 Learn a new parallel programming tools quickly
 Understand the performance of your program
 Optimize the performance of your program

Parallel Programming – NTHU LSA Lab 29

Outline
 Parallel Computing Introduction
 Classifications
of Parallel Computers &
Programming Models
 Supercomputer & Latest technologies
 Supercomputer
 Processor technology
 Interconnect & Network technology
 I/O & Storage technology
 Parallel Program Analysis
Parallel Programming – NTHU LSA Lab 30
Today’s Typical Parallel Computers
Racks: 16~42U Node/Server: 1~4U

Multi-core Processor:
100x cores/1000xthreads Co-Processor: 4~12 cores
Parallel Programming – NTHU LSA Lab 31
Supercomputers
 Definition: A computer with a high-level
computational capacity compared to
a general-purpose computer
 Its performance is measured in floating-
point operations per second (FLOPS) instead
of million instructions per second (MIPS)
 Ranked by the TOP500 list since 1993
 According to the HPL benchmark results
 Announced twice a year at ISC and SC conferences

Parallel Programming – NTHU LSA Lab 32

HPL Benchmark
 A parallel implementation of Linpack library
 Measure floating point rate of execution

 Computation:
 To solve linear matrix equation

 LU factorization by Panel factorization.

 Divide a matrix into many pieces.
 All parameters must be determined by user.

Parallel Programming – NTHU LSA Lab 33

What makes it a supercomputer
 What makes it a supercomputer?
 All the latest hardware technologies
 Customized system configurations
 Optimized software and libraries
 Huge amount of cost in money and energy

 It represents a competition
of technology and wealth
among a countries ……

Parallel Programming – NTHU LSA Lab 34

TOP500 List (2016 June)
Country System Vendor Power (kW) #cores Accelerator Rmax Rpeak
(PFLOPS)
1 China TaihuLight NRCPC 15,371 10M 93.0 125.4
2 China Tianhe-2 NUDT 17,808 3M Xeon Phi 33.9 54.9
3 US Titan Cray 8,209 560K Tesla K20X 17.6 27.1
4 US Sequoia IBM 7,890 1.5M 17.2 20.1
5 Japan K Fujitsu 12,660 705K 10.5 11.3
6 US Mira IBM 3,954 786K 8.6 10.0
7 US Trinity Cray 301K 8.1 11.1
8 Swiss Piz Daint Cray 2,325 116K Tesla K20X 6.2 7.8

 Accelerator provides huge computing power

 Titan’s Rmax without GPU was only 2K!!!

Parallel Programming – NTHU LSA Lab 35

TOP500 Trend: CPU
 Intel CPU counts for more than 80%

Parallel Programming – NTHU LSA Lab 36

TOP500 Trend: Interconnect
 InfiniBand has much larger share in performance

Parallel Programming – NTHU LSA Lab 37

TOP500 Trend: Vendor
 CRAY and IBM still have larger share for
performance

Parallel Programming – NTHU LSA Lab 38

TOP500 Trend: Country
 China has a huge jump because of the new
supercomputer
2015 Nov 2016 June

Parallel Programming – NTHU LSA Lab 39

TOP500 Trend: Computing power
 Goal is to reach Exascale computing 1EFlop (10^18) /s
by 2020

Parallel Programming – NTHU LSA Lab 40

Outline
 Parallel Computing Introduction
 Classifications
of Parallel Computers &
Programming Models
 Supercomputer & Latest technologies
 Supercomputer
 Processor technology
 Interconnect & Network technology
 I/O & Storage technology
 Parallel Program Analysis
Parallel Programming – NTHU LSA Lab 41
Limitation of CPU
General Purpose Processor
 A general purpose CPU (central processing unit) can do
anything, but its design is against the goal of achieving
the best performance for a specific application.

[source: nvidia] Parallel Programming – NTHU LSA Lab 42

Comparison Numbers
Intel Xeon E5- NVIDIA Tesla Intel Xeon Phi
2697 v3 CPU K80 GPU 7120P (Knight's
(Haswell) (Kepler) Corner)
Cores 2x14 2x13(SMX) 61
Logical Cores 2x28 2x2,496 244
Frequency 2.60GHz 562MHz 1.238GHz
GFLOPS(double) 2x583 2x1,455 1,208
Max memory 768GB 2x12GB 16GB
Max Mem BW 2x68GB/s 2x240GB/s 352GB/s
(Internal) (Internal)
Price 2,700 USD 5,000 USD 4,000 USD
Source: https://www.xcelerit.com/computing-benchmarks/libor/haswell_k80_phi/
Parallel Programming – NTHU LSA Lab 43
NVidia General Purpose GPU
 Extend GPU as a form of stream processor (or a vector processor)
for general purpose computing
 Suited for embarrassingly parallel tasks and vectorized operations
 Hierarchical memory structure Host
 Used as accelerators/co-processor Input Assembler
GPU (Device) Thread Execution Manager

SM SM SM SM SM SM SM

……
PBSM PBSM PBSM PBSM PBSM PBSM PBSM

Load/Store
Global Memory & Constant Memory
Parallel Programming – NTHU LSA Lab 44
Intel Xeon Phi
 A brand name given to a series of manycore processors follows
the Intel's MIC (Many Integrated Core) architecture
 Typically it has 50-70 processors on the die connected by a bidirectional
Ring network
 More like a separate system
 It runs Intel assembly code just like the main CPU in your computer
 It has an embedded linux
 Second generation chips (Knights Landing) could be used as a
standalone CPU

Parallel Programming – NTHU LSA Lab 45

Sunway TaihuLight SW26010
 Each node contains four clusters of 64 CPEs (SIMD)
 Each cluster is accompanied by a MPE (general
purpose)

Parallel Programming – NTHU LSA Lab 46

Google Tensor Processing Unit (TPU)
 Specifically for deep learning (tensorflow framework)
 30–80X higher performance-per-watt than
contemporary CPUs and GPUs
 Only for reduced precision computation (e.g. 8-bit precision)
 Matrix Multiplier Unit: use a to achieve hundreds of
thousands of matric operation in a single clock cycle
 Systolic array: The ALUs perform only multiplications and
additions in fixed patterns
 Reference
 https://cloud.google.com/blog/big-data/2017/05/an-in-
depth-look-at-googles-first-tensor-processing-unit-tpu
Parallel Programming – NTHU LSA Lab 47
Outline
 Parallel Computing Introduction
 Classifications
of Parallel Computers &
Programming Models
 Supercomputer & Latest technologies
 Supercomputer
 Processor technology
 Interconnect & Network technology
 I/O & Storage technology
 Parallel Program Analysis
Parallel Programming – NTHU LSA Lab 48
Communication
 Communication has the most impact to the
performance of parallel programs (Even more critical
to computing or memory).
 Network is generally much slower than CPU
 Communication is common to parallel programs
 Synchronization is expensive and could grow exponentially
to the number of servers

Parallel Programming – NTHU LSA Lab 49

Interconnection Networks
 Network design considerations
 Scalability, Performance, Resilience and Cost
Application
• Communication pattern & protocol

Interconnection Network Topology

• Network diameter
• Re-routing path for fault tolerance
• # fan-in & fan-out degree per node
Network Devices (Cable, Switch, Adapter, etc.)
• Bandwidth: #bits transferred per second
• Latency: time to pack, unpack, and send a message
• Scalability: # of ports on the adapter and switch
Parallel Programming – NTHU LSA Lab 50
Network Topology
Diameter Bisection #Links Degree
(latency) (resilience) (cost) (scalability)
Linear array P-1 1 P-1 2

 Cheapest solution, but not reliable and long

latency

Parallel Programming – NTHU LSA Lab 51

Network Topology
Diameter Bisection #Links Degree
(latency) (resilience) (cost) (scalability)
Linear array P-1 1 P-1 2
Ring p/2 2 P 2
Tree 2log 2 𝑝 1 2(p-1) 3
2-D Mesh 2( 𝑝 − 1) 𝑝 2 𝑝( 𝑝 − 1) 4
 Particularly suitable for some of the
applications such as the ocean
application and matrix calculation
 Can be extended to 3-D mesh
Parallel Programming – NTHU LSA Lab 52
Network Topology
Diameter Bisection #Links Degree
(latency) (resilience) (cost) (scalability)
Linear array P-1 1 P-1 2
Ring p/2 2 P 2
Tree 2log 2 𝑝 1 2(p-1) 3
2-D Mesh 2( 𝑝 − 1) 𝑝 2 𝑝( 𝑝 − 1) 4
2-D Torus 𝑝-1 2 𝑝 2p 4
Hypercube log 2 𝑝 p/2 p/2 x log 2 𝑝 log 2 𝑝
 Smaller diameter, more bisection, but also higher
cost and degree than Mesh and Torus
 More suitable for smaller scale systems
Parallel Programming – NTHU LSA Lab 53
Network Topology
 4-D hypercube
 Each node is numbered with a bitstring that is
log2(p) bits long.
 One bit can be flipped per hop so the diameter is
log2(p).

Parallel Programming – NTHU LSA Lab 54

6-Dimensional Mesh/Torus on K-Computer
 K-computer (Kei means “京”)
 Designed by FUJITSU, Japan
 World’s #5 fastest supercomputer
 80,000 compute nodes; 640,000 cores
 Network connection: Tofu
 Introduction video clip:
 http://www.fujitsu.com/global/about/businesspolicy/tech/
k/whatis/network/

Parallel Programming – NTHU LSA Lab 55

Network Device: InfiniBand
 A computer network communications link used in high-
performance computing featuring very high throughput
 It is the most commonly used interconnect in supercomputers
 Manufactured by Mellanox

InfiniBand

Ethernet

Parallel Programming – NTHU LSA Lab 56

InfiniBand: Usage in TOP500

Parallel Programming – NTHU LSA Lab 57

InfiniBand: RDMA

Source: Mellanox Parallel Programming – NTHU LSA Lab 58

InfiniBand vs. Gigabit Ethernet
InfiniBand Ethernet
Guaranteed credit Best effort delivery
based flow control
End-to-End congestion TCP/IP protocol. Designed for
Protocol
management L3/L4 switching
Hardware based Software based
retransmission retransmission
RDMA YES NO (only now starting)
Latency Low High
Throughput High Low
Max cable length 4km upto 70km
Price 36port switch: 25k USD 36port switch: 1.5k USD
QDR adapter: 500USD Network card: 50 USD
Parallel Programming – NTHU LSA Lab 59
Outline
 Parallel Computing Introduction
 Classifications
of Parallel Computers &
Programming Models
 Supercomputer & Latest technologies
 Supercomputer
 Processor technology
 Interconnect & Network technology
 I/O & Storage technology
 Parallel Program Analysis
Parallel Programming – NTHU LSA Lab 60
How About I/O?
 Not so great…

Source: http://www.mostlycolor.ch/2015_10_01_archive.html
Parallel Programming – NTHU LSA Lab 61
Opportunity in I/O
 Memory hierarchy
 New storage technology is coming: Flash
 It is still challenged to put the data in the right place,
at right time. register
 There is always a price Cache
to pay Main memory
Flash
(Non-volatile memory)
Hard Disk Drive

Magnetic tape Storage Systems

Parallel Programming – NTHU LSA Lab 62
Opportunity in I/O
 Parallel file and IO systems
 Lustre file system, MPI-IO

IO
server

Parallel Programming – NTHU LSA Lab 63

Opportunity in I/O
 Burst buffering
 Add non-volatile RAM at the IO server nodes as a
buffer to smooth the burst traffic pattern for
improving the IO performance of storage systems,
and reduce the IO latency

Parallel Programming – NTHU LSA Lab 64

Summary
 People has been and will always be able to find a way to
keep the growth of computing
 Technology: CPU scaling, distributed computing, new
processor architecture
 Optimization: algorithm, data management, compiler
 System design: network topology, file system
 It is more than just computing
 Networks and IO become greater concerns
 Does the performance report from supercomputers
really meets the needs of applications?
 People start re-thinking what should be the right objective
and benchmark for designing the next generation of
supercomputers. Parallel Programming – NTHU LSA Lab 65
Outline
 Parallel Computing Introduction
 Classificationsof Parallel Computers &
Programming Models
 Supercomputer & Latest technologies
 Parallel Program Analysis
 Speedup & Efficiency
 Strong scalability vs. Weak scalability
 Time complexity & Cost optimality

Parallel Programming – NTHU LSA Lab 66

Speedup Factor
𝑇𝑠
 Program speedup factor: 𝑆 𝑝 =
𝑇𝑝
 𝑇𝑠 : execution time using the BEST sequential algorithm
 𝑇𝑝 : execution time using 𝒑 processor Ideal
Superlinear
Linear speedup: 𝑆 𝑝 = 𝑝

Speedup factor

 Ideal maximum speedup in theory
 Superlinear speedup: 𝑆 𝑝 > 𝑝
 Occasionally happen in practice
Normal cases
 Extra HW resource (e.g. memory)
Number of processors
 SW or HW optimization (e.g. caching)
𝑇𝑠 𝑆(𝑝)
 System efficiency: 𝐸 𝑝 = = × 100%
𝑇𝑝 ×𝑝 𝑝
Parallel Programming – NTHU LSA Lab 67
Maximum Speedup
 Difficult to reach ideal max. speedup: S(p)=p
 Not every part of a computation can be parallelized
(results in processor idle)
 Need extra computations in the parallel version
(i.e. due to synchronization cost)
 Communication time between processes
(normally the major factor)

Speedup factor Normal cases

Parallel Programming – NTHU LSA Lab

Number of processors
68
Maximum Speedup
 Let 𝑓 be the fraction of computations that can
NOT be parallelized
𝑡𝑠 𝑝
𝑆 𝑝 = =
𝑓𝑡𝑠 + 1−𝑓 𝑡𝑠 /𝑝 1+ 𝑝−1 𝑓
𝑡𝑠

……
…

p processors

Parallel Programming – NTHU LSA Lab 69

Maximum Speedup


Parallel Programming – NTHU LSA Lab 70

Outline
 Parallel Computing Introduction
 Classificationsof Parallel Computers &
Programming Models
 Supercomputer & Latest technologies
 Parallel Program Analysis
 Speedup & Efficiency
 Strong scalability vs. Weak scalability
 Time complexity & Cost optimality

Parallel Programming – NTHU LSA Lab 71

Strong Scaling
 The problem size stays fixed but the number of
processing elements are increased.
 It is used to find a "sweet spot" that allows the
computation to complete in a reasonable amount of
time, yet does not waste too many cycles due to parallel
overhead.
 Linear scaling is achieve if the speedup is equal to the
number of processing elements.
Execution speedup
Time Linear speedup

# of cores # of cores
Parallel Programming – NTHU LSA Lab 72
Weak Scaling
 The problem size (workload) assigned to each processing
element stays fixed and additional processing elements
are used to solve a larger total problem
 It is a justification for programs that take a lot of memory
or other system resources (e.g., a problem wouldn't fit in
RAM on a single node)
 Linear scaling is achieved if the run time stays constant
while the workload is increased
Execution speedup
Time Linear speedup

# of cores # of cores
Parallel Programming – NTHU LSA Lab 73
Strong Scaling vs. Weak Scaling
 Strong scaling
 Linear scaling is harder to achieve, because of the
communication overhead may increase
proportional to the scale
 Weak scaling
 Linear scaling is easier to achieve because
programs typically employ nearest-neighbor
communication patterns where the
communication overhead is relatively constant
regardless of the number of processes used
Parallel Programming – NTHU LSA Lab 74
Outline
 Parallel Computing Introduction
 Classificationsof Parallel Computers &
Programming Models
 Supercomputer & Latest technologies
 Parallel Program Analysis
 Speedup & Efficiency
 Strong scalability vs. Weak scalability
 Time complexity & Cost optimality

Parallel Programming – NTHU LSA Lab 75

Time Complexity Analysis
 Tp = Tcomp + Tcomm
 Tp: Total execution time of a parallel algorithm
 Tcomp: Computation part

Tcomm
 Tcomm: Communication part
Startup time
 Tcomm = q (Tstartup + n Tdata)
# of data items
 Tstartup: Message latency (assumed constant)
 Tdata: Transmission time to send one data item
 n: Number of data items in a message
 q: Number of message
Parallel Programming – NTHU LSA Lab 76
Time Complexity Example 1
 Algorithm phase:
1. Computer 1 sends n/2 numbers to computer 2
2. Both computers add n/2 numbers simultaneously
3. Computer 2 sends its partial result back to computer 1
4. Computer 1 adds the partial sums to produce the final result
 Complexity analysis:
 Computation (for step 2 & 4):
Tcomp = n/2 + 1 = O(n)
 Communication (for step 1 & 3):
Tcomm = (Tstartup + n/2 x Tdata) + (Tstartup + Tdata)
= 2Tstartup + (n/2 + 1) Tdata = O(n)
 Overall complexity: O(n)
Parallel Programming – NTHU LSA Lab 77
Time Complexity Example 2
 Adding n numbers using m processes
 Evenly partition numbers to processes
𝑥0 … 𝑥(𝑛/𝑚−1) 𝑥𝑛/𝑚 … 𝑥(2𝑛/𝑚−1) 𝑥 𝑚−1 𝑛/𝑚 … 𝑥𝑛−1

+ + …………… +
Partial sums

+
Sum

Parallel Programming – NTHU LSA Lab 78

Time Complexity Example
 Sequential: 𝑂(𝑛)
 Parallel:
 Phase1: Send numbers to slaves
𝑡𝑐𝑐𝑐𝑐𝑐 = 𝑚(𝑡𝑠𝑠𝑠𝑠𝑠𝑠𝑠 + (𝑛/𝑚)𝑡𝑑𝑑𝑑𝑑 )
 Phase2: Compute partial sum
𝑡𝑐𝑐𝑐𝑐𝑐 = 𝑛/𝑚 − 1
 Phase3: Send results to master
𝑡𝑐𝑐𝑐𝑐2 = 𝑚(𝑡𝑠𝑠𝑠𝑠𝑠𝑠𝑠 + 𝑡𝑑𝑑𝑑𝑑 ) Tradeoff
between
 Phase4: Compute final accumulation computation &
𝑡𝑐𝑐𝑐𝑐2 = 𝑚 − 1 communication
 Overall:
𝑛
𝑡𝑝 = 2𝑚𝑡𝑠𝑠𝑠𝑠𝑠𝑠𝑠 + (𝑛 + 𝑚)𝑡𝑑𝑑𝑑𝑑 +𝑚 + − 2 = 𝑂(𝑚 + 𝑛/𝑚)
Parallel Programming – NTHU LSA Lab
𝑚 79
Cost-Optimal Algorithm
 Definition:
 Cost to solve a problem is proportional to the
execution time on a single processor system
 O(Tp ) x N = O(Ts)
 Example:
 Sequential algo: O(N log N)
 Parallel algo1: uses N processor with O(log N)
 Parallel algo2: uses N2 processor with O(1)

Parallel Programming – NTHU LSA Lab 80

Reference
 Textbook: Parallel Computing Chap1
 TOP500: https://www.top500.org/
 Blaise Barney, Lawrence Livermore National
Laboratory, Introduction to Parallel Computing,
https://computing.llnl.gov/tutorials/parallel_comp/
 Flynn's taxonomy,
https://en.wikipedia.org/wiki/Flynn%27s_taxonomy
 K computer, http://www.fujitsu.com/
global/about/businesspolicy/tech/k/
 InfiniBand, http://www.infinibandta.org/
Parallel Programming – NTHU LSA Lab 81

Parallel Computing
100% (1)
Parallel Computing
53 pages
Parallel Computing
No ratings yet
Parallel Computing
28 pages
SSP 406 DCC Adaptive Chassis Control Design and Function
No ratings yet
SSP 406 DCC Adaptive Chassis Control Design and Function
32 pages
Cost of Rooftop Solar System (On-Grid, Off-Grid, Hybrid) in India - AlienSolar
No ratings yet
Cost of Rooftop Solar System (On-Grid, Off-Grid, Hybrid) in India - AlienSolar
2 pages
WIP Job Closure Process
No ratings yet
WIP Job Closure Process
13 pages
Criteria For Success of Osseointegrated Endosseous Implants Zarb
No ratings yet
Criteria For Success of Osseointegrated Endosseous Implants Zarb
6 pages
Unit 1
No ratings yet
Unit 1
22 pages
Cours 1
No ratings yet
Cours 1
38 pages
Cours 1
No ratings yet
Cours 1
38 pages
01 - Parallel Programming
No ratings yet
01 - Parallel Programming
18 pages
Parallel Comp Point Main
No ratings yet
Parallel Comp Point Main
18 pages
IIFL - Awfis - Initiating Coverage - 20241125
No ratings yet
IIFL - Awfis - Initiating Coverage - 20241125
40 pages
CS4961 Parallel Programming: Course Details
No ratings yet
CS4961 Parallel Programming: Course Details
7 pages
Lecture-1 Introduction Water Supply Engr.
100% (1)
Lecture-1 Introduction Water Supply Engr.
49 pages
P 1
No ratings yet
P 1
44 pages
Internship at D'Decor
No ratings yet
Internship at D'Decor
38 pages
Tesco Ar25 Interactive
No ratings yet
Tesco Ar25 Interactive
248 pages
EE664: Introduction To Parallel Computing: Dr. Gaurav Trivedi Lectures 5-14
No ratings yet
EE664: Introduction To Parallel Computing: Dr. Gaurav Trivedi Lectures 5-14
170 pages
Concurrent Programming With Threads: Rajkumar Buyya
No ratings yet
Concurrent Programming With Threads: Rajkumar Buyya
168 pages
Parallel and Distributed Computing
No ratings yet
Parallel and Distributed Computing
90 pages
Basics CUDA
No ratings yet
Basics CUDA
55 pages
PP Cuda Unit1 1
No ratings yet
PP Cuda Unit1 1
77 pages
Parallel Programming
No ratings yet
Parallel Programming
42 pages
Introduction To Parallel Computing
No ratings yet
Introduction To Parallel Computing
38 pages
A Survey On Parallel Architecture and Parallel Programming Languages and Tools
No ratings yet
A Survey On Parallel Architecture and Parallel Programming Languages and Tools
8 pages
5a.seismic Loading Criteria
No ratings yet
5a.seismic Loading Criteria
3 pages
Topic 1 2024
No ratings yet
Topic 1 2024
41 pages
SPI's First Decade Mirrors Gaming's Progress: John Prados
100% (1)
SPI's First Decade Mirrors Gaming's Progress: John Prados
24 pages
Lecture 4
No ratings yet
Lecture 4
27 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
Chapter # 1
No ratings yet
Chapter # 1
117 pages
Multi Threading
No ratings yet
Multi Threading
168 pages
Basics of Parallel Programming: Unit-1
No ratings yet
Basics of Parallel Programming: Unit-1
79 pages
HP LaserJet 2605
No ratings yet
HP LaserJet 2605
2 pages
CAQA5e ch1
No ratings yet
CAQA5e ch1
42 pages
KCS 713 Unit 1 Lecture 5
No ratings yet
KCS 713 Unit 1 Lecture 5
32 pages
Lecture 2 General Parallelism Terms
No ratings yet
Lecture 2 General Parallelism Terms
22 pages
001 - DDS IIIT Jan 10th
No ratings yet
001 - DDS IIIT Jan 10th
34 pages
PDC Complete Course File
No ratings yet
PDC Complete Course File
422 pages
Unit 1 - Part 1
No ratings yet
Unit 1 - Part 1
51 pages
Chapter 1 - Parallel Architectures
No ratings yet
Chapter 1 - Parallel Architectures
60 pages
Parallel Processor Computing Unit 1
No ratings yet
Parallel Processor Computing Unit 1
10 pages
1 Introduction
No ratings yet
1 Introduction
48 pages
Week1 Parallel and Distributed Computing
No ratings yet
Week1 Parallel and Distributed Computing
55 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
Chapter 02 - Asynchronous and Parallel Programming in
No ratings yet
Chapter 02 - Asynchronous and Parallel Programming in
55 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Unit VI Parallel Programming Concepts
No ratings yet
Unit VI Parallel Programming Concepts
90 pages
Lec1 Introduction To Parallel Computing
No ratings yet
Lec1 Introduction To Parallel Computing
40 pages
Lecture Notes On Parallel Computation
No ratings yet
Lecture Notes On Parallel Computation
30 pages
01 Intro Parallel Computing
No ratings yet
01 Intro Parallel Computing
40 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 05-Aug-2021 Module1 (Part 1)
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 05-Aug-2021 Module1 (Part 1)
30 pages
Generic ISO 14001 EMS Templates: ACT Plan
No ratings yet
Generic ISO 14001 EMS Templates: ACT Plan
59 pages
HPC BOOk
No ratings yet
HPC BOOk
68 pages
Parallel Computing Main
No ratings yet
Parallel Computing Main
47 pages
An Approach To Parallel Processing: Yashraj Rai Puja Padiya
No ratings yet
An Approach To Parallel Processing: Yashraj Rai Puja Padiya
3 pages
CMP 252 - Parallelism Fundamentals
No ratings yet
CMP 252 - Parallelism Fundamentals
64 pages
Introduction To Parallel Computing LLNL
No ratings yet
Introduction To Parallel Computing LLNL
44 pages
Lecture Parallel Computing
No ratings yet
Lecture Parallel Computing
6 pages
LaBorda ENG
No ratings yet
LaBorda ENG
20 pages
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
No ratings yet
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
63 pages
Introduction To Computing
No ratings yet
Introduction To Computing
6 pages
Parallel Computing Terminology
No ratings yet
Parallel Computing Terminology
11 pages
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
No ratings yet
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
22 pages
Niact 2
No ratings yet
Niact 2
25 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
47 pages
Escandor
No ratings yet
Escandor
2 pages
Scom 261 - News Release Final Version
No ratings yet
Scom 261 - News Release Final Version
2 pages
Functional Programming - Wikipedia
No ratings yet
Functional Programming - Wikipedia
22 pages
Automotive Servicing NC Ii Jay Christian T Agsalon
No ratings yet
Automotive Servicing NC Ii Jay Christian T Agsalon
3 pages
SBA Balanced-Scorecard Script
No ratings yet
SBA Balanced-Scorecard Script
5 pages
Performance (Memory) Optimization: National Tsing-Hua University 2017, Summer Semester
No ratings yet
Performance (Memory) Optimization: National Tsing-Hua University 2017, Summer Semester
77 pages
ADLT Cree XSPR Spec Sheet
No ratings yet
ADLT Cree XSPR Spec Sheet
5 pages
Parallel Computing Varun Patial
No ratings yet
Parallel Computing Varun Patial
41 pages
Heather Jennings Resume
No ratings yet
Heather Jennings Resume
1 page
Kashaf New Bsns Plan Bba
No ratings yet
Kashaf New Bsns Plan Bba
22 pages
Chap4 OpenMP
No ratings yet
Chap4 OpenMP
35 pages
Case Details SBI
No ratings yet
Case Details SBI
7 pages
Chap3 Pthread
No ratings yet
Chap3 Pthread
33 pages
Clinical Evidence Under The EU MDR
No ratings yet
Clinical Evidence Under The EU MDR
8 pages
PHY106 Week9
No ratings yet
PHY106 Week9
53 pages
Black Book Introduction
No ratings yet
Black Book Introduction
23 pages
Conservation Equations and Modeling of Chemical and Biochemical Processes 1st Edition Said S.E.H. Elnashaie Download
No ratings yet
Conservation Equations and Modeling of Chemical and Biochemical Processes 1st Edition Said S.E.H. Elnashaie Download
63 pages
Bottom Up Beta Template
No ratings yet
Bottom Up Beta Template
26 pages
Chap6 Heter Computing
No ratings yet
Chap6 Heter Computing
22 pages
Project - ParallelComputing BSR v2
No ratings yet
Project - ParallelComputing BSR v2
40 pages
Software Requirements Specification Template
No ratings yet
Software Requirements Specification Template
12 pages
Ferraz Fuse Catalogue PDF
No ratings yet
Ferraz Fuse Catalogue PDF
2 pages
Understanding Software Engineering Vol 1: Where does the software run and how? The hardware.
From Everand
Understanding Software Engineering Vol 1: Where does the software run and how? The hardware.
Gabriel Clemente
No ratings yet
Advanced Unix Programming
From Everand
Advanced Unix Programming
Prof. N. B Venkateswarlu
No ratings yet
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
All My IT Tech Posts
From Everand
All My IT Tech Posts
Stephen Edwards
No ratings yet
Quantum Computer Vs Traditional Computer
From Everand
Quantum Computer Vs Traditional Computer
Arief Muinnudin
No ratings yet
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
From Everand
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
Hunter Davis
No ratings yet

Introduction To Parallel Computing: National Tsing Hua University Instructor: Jerry Chou 2017, Summer Semester

Uploaded by

Introduction To Parallel Computing: National Tsing Hua University Instructor: Jerry Chou 2017, Summer Semester

Uploaded by

Introduction to

National Tsing Hua University

Parallel Programming – NTHU LSA Lab 2

Parallel Programming – NTHU LSA Lab 5

 Shorter execution time allows more runs or more tuning

The world has been driven by science research! 7

Parallel Programming – NTHU LSA Lab 8

“Parallel Computing is a trend and essential tools

Parallel Programming – NTHU LSA Lab 11

Parallel Programming – NTHU LSA Lab 13

Parallel Programming – NTHU LSA Lab 14

Parallel Programming – NTHU LSA Lab 15

Parallel Programming – NTHU LSA Lab 16

Parallel Programming – NTHU LSA Lab 17

CPU0 CPU1 CPU2 CPU3 CPU0 CPU1 CPU2 CPU3

MEM0 MEM1 MEM2 MEM3 MEM

Distributed memory Shared memory

Parallel Programming – NTHU LSA Lab 18

Parallel Programming – NTHU LSA Lab 19

Parallel Programming – NTHU LSA Lab 20

Cluster: tens of servers

Supercomputer: Datacenter: thousands of servers

Parallel Programming – NTHU LSA Lab 22

Parallel Programming – NTHU LSA Lab 23

Parallel Programming – NTHU LSA Lab 25

Parallel Programming – NTHU LSA Lab 27

Parallel Programming – NTHU LSA Lab 29

Parallel Programming – NTHU LSA Lab 32

 LU factorization by Panel factorization.

Parallel Programming – NTHU LSA Lab 33

Parallel Programming – NTHU LSA Lab 34

 Accelerator provides huge computing power

Parallel Programming – NTHU LSA Lab 35

Parallel Programming – NTHU LSA Lab 36

Parallel Programming – NTHU LSA Lab 37

Parallel Programming – NTHU LSA Lab 38

Parallel Programming – NTHU LSA Lab 39

Parallel Programming – NTHU LSA Lab 40

[source: nvidia] Parallel Programming – NTHU LSA Lab 42

Parallel Programming – NTHU LSA Lab 45

Parallel Programming – NTHU LSA Lab 46

Parallel Programming – NTHU LSA Lab 49

Interconnection Network Topology

 Cheapest solution, but not reliable and long

Parallel Programming – NTHU LSA Lab 51

Parallel Programming – NTHU LSA Lab 54

Parallel Programming – NTHU LSA Lab 55

Parallel Programming – NTHU LSA Lab 56

Parallel Programming – NTHU LSA Lab 57

Source: Mellanox Parallel Programming – NTHU LSA Lab 58

Magnetic tape Storage Systems

Parallel Programming – NTHU LSA Lab 63

Parallel Programming – NTHU LSA Lab 64

Parallel Programming – NTHU LSA Lab 66

Speedup factor Normal cases

Parallel Programming – NTHU LSA Lab

Parallel Programming – NTHU LSA Lab 69

Parallel Programming – NTHU LSA Lab 70

Parallel Programming – NTHU LSA Lab 71

Parallel Programming – NTHU LSA Lab 75

Parallel Programming – NTHU LSA Lab 78

Parallel Programming – NTHU LSA Lab 80

You might also like