Introduction To Parallel Computing: National Tsing Hua University Instructor: Jerry Chou 2017, Summer Semester
Introduction To Parallel Computing: National Tsing Hua University Instructor: Jerry Chou 2017, Summer Semester
Parallel Computing
Processor
Processor
tN t3 t2 t1
Parallel Programming – NTHU LSA Lab 3
Difference between parallel computing
& distributed computing
The two terminologies are very closely related.
But come from different backgrounds
Parallel computing …
Means different activities happen at the same time
Spread out a single application over many cores/
processors/processes to get it done bigger or faster
Mostly used in scientific computing
Distributed computing…
Activities across systems or distanced servers
Focus more on concurrency and resource sharing
From the business/commercial world
Parallel Programming – NTHU LSA Lab 4
The Universe is Parallel
Parallel computing is an evolution of serial computing
that attempts to emulate what has always been the
state of affairs in the natural world
12 Cores IBM Blade Multi-core CPU 512 Cores NVIDIA Fermi GPU
Distributed
Muti-Core Era System Era
Enabled by: Constraint by: Enabled by: Constraint by:
Moore’s Law Power Networking Synchronization
SMP Parallel SW Comm. overhead
Scalability
Pthread OpenMP … MPI MapReduce …
10
Parallel Programming – NTHU LSA Lab 10
Outline
Parallel Computing Introduction
Classifications
of Parallel Computers &
Programming Models
Flynn’s classic taxonomy
Memory architecture classification
Programming model classification
Supercomputer & Latest technologies
Parallel Program Analysis
Multi-core Processor:
100x cores/1000xthreads Co-Processor: 4~12 cores
Parallel Programming – NTHU LSA Lab 31
Supercomputers
Definition: A computer with a high-level
computational capacity compared to
a general-purpose computer
Its performance is measured in floating-
point operations per second (FLOPS) instead
of million instructions per second (MIPS)
Ranked by the TOP500 list since 1993
According to the HPL benchmark results
Announced twice a year at ISC and SC conferences
Computation:
To solve linear matrix equation
It represents a competition
of technology and wealth
among a countries ……
SM SM SM SM SM SM SM
……
PBSM PBSM PBSM PBSM PBSM PBSM PBSM
Load/Store
Global Memory & Constant Memory
Parallel Programming – NTHU LSA Lab 44
Intel Xeon Phi
A brand name given to a series of manycore processors follows
the Intel's MIC (Many Integrated Core) architecture
Typically it has 50-70 processors on the die connected by a bidirectional
Ring network
More like a separate system
It runs Intel assembly code just like the main CPU in your computer
It has an embedded linux
Second generation chips (Knights Landing) could be used as a
standalone CPU
InfiniBand
Ethernet
Source: http://www.mostlycolor.ch/2015_10_01_archive.html
Parallel Programming – NTHU LSA Lab 61
Opportunity in I/O
Memory hierarchy
New storage technology is coming: Flash
It is still challenged to put the data in the right place,
at right time. register
There is always a price Cache
to pay Main memory
Flash
(Non-volatile memory)
Hard Disk Drive
IO
server
Speedup factor
Ideal maximum speedup in theory
Superlinear speedup: 𝑆 𝑝 > 𝑝
Occasionally happen in practice
Normal cases
Extra HW resource (e.g. memory)
Number of processors
SW or HW optimization (e.g. caching)
𝑇𝑠 𝑆(𝑝)
System efficiency: 𝐸 𝑝 = = × 100%
𝑇𝑝 ×𝑝 𝑝
Parallel Programming – NTHU LSA Lab 67
Maximum Speedup
Difficult to reach ideal max. speedup: S(p)=p
Not every part of a computation can be parallelized
(results in processor idle)
Need extra computations in the parallel version
(i.e. due to synchronization cost)
Communication time between processes
(normally the major factor)
……
…
p processors
# of cores # of cores
Parallel Programming – NTHU LSA Lab 72
Weak Scaling
The problem size (workload) assigned to each processing
element stays fixed and additional processing elements
are used to solve a larger total problem
It is a justification for programs that take a lot of memory
or other system resources (e.g., a problem wouldn't fit in
RAM on a single node)
Linear scaling is achieved if the run time stays constant
while the workload is increased
Execution speedup
Time Linear speedup
# of cores # of cores
Parallel Programming – NTHU LSA Lab 73
Strong Scaling vs. Weak Scaling
Strong scaling
Linear scaling is harder to achieve, because of the
communication overhead may increase
proportional to the scale
Weak scaling
Linear scaling is easier to achieve because
programs typically employ nearest-neighbor
communication patterns where the
communication overhead is relatively constant
regardless of the number of processes used
Parallel Programming – NTHU LSA Lab 74
Outline
Parallel Computing Introduction
Classificationsof Parallel Computers &
Programming Models
Supercomputer & Latest technologies
Parallel Program Analysis
Speedup & Efficiency
Strong scalability vs. Weak scalability
Time complexity & Cost optimality
Tcomm
Tcomm: Communication part
Startup time
Tcomm = q (Tstartup + n Tdata)
# of data items
Tstartup: Message latency (assumed constant)
Tdata: Transmission time to send one data item
n: Number of data items in a message
q: Number of message
Parallel Programming – NTHU LSA Lab 76
Time Complexity Example 1
Algorithm phase:
1. Computer 1 sends n/2 numbers to computer 2
2. Both computers add n/2 numbers simultaneously
3. Computer 2 sends its partial result back to computer 1
4. Computer 1 adds the partial sums to produce the final result
Complexity analysis:
Computation (for step 2 & 4):
Tcomp = n/2 + 1 = O(n)
Communication (for step 1 & 3):
Tcomm = (Tstartup + n/2 x Tdata) + (Tstartup + Tdata)
= 2Tstartup + (n/2 + 1) Tdata = O(n)
Overall complexity: O(n)
Parallel Programming – NTHU LSA Lab 77
Time Complexity Example 2
Adding n numbers using m processes
Evenly partition numbers to processes
𝑥0 … 𝑥(𝑛/𝑚−1) 𝑥𝑛/𝑚 … 𝑥(2𝑛/𝑚−1) 𝑥 𝑚−1 𝑛/𝑚 … 𝑥𝑛−1
+ + …………… +
Partial sums
+
Sum