Lecture1 Introduction to Parallel Computing_2025
Lecture1 Introduction to Parallel Computing_2025
Why Parallelism?
2
Incredible progress in computer technology (Cont’d)
moore 2
3
Incredible progress in computer technology (Cont’d)
4
Why wasn’t parallel processing required?
5
Two driving forces of performance improvement until 2003,
and their limitation
6
Two driving forces of performance improvement until 2003,
and their limitation
▪ The “Power wall”
o Power consumption is proportional to frequency
Power = Capacitive load Voltage 2 Frequency
o High power consumption ➔ high temperature
11
Types of Parallelism (Cont’d)
Time
12
Flynn’s Classification
▪ SISD : Single Instruction Single Data Stream
o Pipelining
o Out-of-order Execution
o Superscalar Processor
o VLIW Processor
14
SISD : Single Instruction Single Data Stream
▪ Executes a single instruction which operates on a
single data stream
▪ Instruction-level Parallelism
o Pipelining
o Out-of-order Execution
o Superscalar Processor
o VLIW Processor
Out-of-Order Execution
▪ Problem of In-order pipeline : data dependency stalls dispatch of
younger instructions into functional (execution) units
In-order:
16 cycles
Out-of-order:
12 cycles
Superscalar Processor
▪ Two or more consecutive instructions in the original program
order can execute in parallel
▪ N-way Superscalar
o Can issue up to N instructions per cycle
o 2-way, 3-way, …
1-way :
time
fetch decode ld
fetch decode add
fetch decode sub
fetch decode bne
2-way:
fetch decode ld
fetch decode add
fetch decode sub
fetch decode bne
SIMD : Single Instruction Multiple Data Stream
▪ Data-level Parallelism
o Vector Processing Unit
Vector Processing
Vector Processing Unit
vadd
Vector Processing Unit
▪ SIMD vs SIMT
SIMD SIMT
GPU → SIMT (Single Instruction Multiple Tread)
28
Parallel Processor
29
Parallel Processor
30
Parallel Processor
31
Amdahl’s Law (I)
32
Amdahl’s Law (II)
Taf f ected
Timprov ed = + Tunaf f ected
improvemen t factor
▪ Example1
Total execution of a program Single core
A (30%) B (70%)
Old
Processor 2x Unaffected
speedup fraction
Dual core
New
Processor
2x speedup,
regardless of
the number of
processors
if the parallel
portion is 50%
34
Heterogeneous Parallel Computing (I)
▪ CPUs : Latency Oriented Design
o designed to minimize the execution latency of a single thread
• Large caches
Convert long latency memory accesses to short latency cache accesses
• Large control unit
Branch prediction for reduced branch latency
Data forwarding for reduced data latency
• Powerful ALU
ALU ALU
Reduced operation latency core core
control
ALU ALU
o good for programs that have one or very few threads core core
cache memory
global memory
35
Heterogeneous Parallel Computing (II)
▪ GPUs : Throughput Oriented Design
o thread pool
• threads are pending when they need memory fetches
• execute when they completed those fetch operations
o Small caches GPU
• To boost memory throughput
o Simple control
• No branch prediction
• No data forwarding
o Energy efficient ALUs global memory
36
Heterogeneous Parallel Computing (III)
37
The free lunch is over.. Now it’s up to the
programmers. Adding more processors doesn’t help
much if programmers don’t know how to use them
Next..
▪ GPU Architecture
38