0% found this document useful (0 votes)
4 views

Lecture1 Introduction to Parallel Computing_2025

The document discusses the evolution and necessity of parallel computing in response to the limitations of single-threaded CPU performance, which peaked around 2003 due to power constraints and diminishing returns from instruction-level parallelism. It outlines various types of parallelism, including instruction-level, data-level, and thread-level, as well as the implications of Amdahl's Law on speedup in parallel processing. Additionally, it highlights the differences between CPU and GPU architectures, emphasizing their respective strengths in handling sequential and parallel tasks.

Uploaded by

shdudtls2
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lecture1 Introduction to Parallel Computing_2025

The document discusses the evolution and necessity of parallel computing in response to the limitations of single-threaded CPU performance, which peaked around 2003 due to power constraints and diminishing returns from instruction-level parallelism. It outlines various types of parallelism, including instruction-level, data-level, and thread-level, as well as the implications of Amdahl's Law on speedup in parallel processing. Additionally, it highlights the differences between CPU and GPU architectures, emphasizing their respective strengths in handling sequential and parallel tasks.

Uploaded by

shdudtls2
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Introduction to Parallel Computing

Why Parallelism?

Prof. Seokin Hong


Incredible progress in computer technology

2
Incredible progress in computer technology (Cont’d)

▪ Performance improvements are led by


o Technology Scaling
• Feature size reduction in CMOS transistor technology
Smaller transistors → More transistors
Fast transistors → More performance (Higher clock rate)
Consume less power → Low power
Transistor
Moore’s Law: the number
Image result for moore's law of transistors in a dense
integrated circuit doubles
about every two years,
1965

moore 2

3
Incredible progress in computer technology (Cont’d)

▪ Performance improvements are led by


o Improvements in computer architectures
• Enabled by
Advanced Compiler → Elimination of assembly language programming
Standardized and vendor-independent operating systems (e.g., LINUX)
• These two changes lowered the cost of bringing out a new architecture

• Lead to innovative CPU architectures

4
Why wasn’t parallel processing required?

▪ Single-threaded CPU performance doubling every 18 months


o Since H/W performance increased, S/W performance automatically increased
without any change!!
o Working to parallelize program code was often not worth the time

5
Two driving forces of performance improvement until 2003,
and their limitation

1. Exploiting instruction-level parallelism (ILP)


o Execute independent instructions simultaneously
2. Increasing clock frequency
o Technology scaling → fast transistor → higher clock
frequency

▪ Single processor performance improvement ended in 2003


o Cannot continue to leverage Instruction-Level parallelism (ILP)
o Cannot increase the clock frequency further due to power

6
Two driving forces of performance improvement until 2003,
and their limitation
▪ The “Power wall”
o Power consumption is proportional to frequency
Power = Capacitive load  Voltage 2  Frequency
o High power consumption ➔ high temperature

“Idontcare”: posted at: http://forums.anandtech.com/showthread.php?t=2281195


7
Two driving forces of performance improvement until 2003,
and their limitation
▪ Diminishing gain with ILP
o Little performance benefit from building a processor that can issue more

Culler & Singh (data from Johnson 1991)


8
Two driving forces of performance improvement until 2003,
and their limitation

9 “The free Lunch is Over” by Herb Sutter, Dr. Dobbs 2005


Why is parallel processing required?

▪ Parallel processing is the primary way for continuous performance


improvement of processor

Intel's Big Shift After Hitting Technical Wall


………
Then two weeks ago, Intel, the world's largest chip maker,
publicly acknowledged that it had hit a "thermal wall" on
its microprocessor line. As a result, the company is
changing its product strategy and disbanding one of its
most advanced design groups. Intel also said that it would
abandon two advanced chip development projects, code-
named Tejas and Jayhawk.
Now, Intel is embarked on a course already adopted by
some of its major rivals: obtaining more computing
power by stamping multiple processors on a single chip
rather than straining to increase the speed of a single
processor.
...
John Markoff, New York Times, May 17, 2004
10
Types of Parallelism
▪ Instruction-Level Parallelism (ILP)
o Parallel execution of a sequence of instructions belonging to a specific
thread
o Superscalar, VLIW

▪ Data-Level Parallelism (DLP)


o Applying a single instruction to a collection of data in parallel
o SIMD instructions, GPU

▪ Thread-Level Parallelism (TLP)


o Running tasks (threads) at the same time
o Multi-core

11
Types of Parallelism (Cont’d)

Instruction- Data-Level Thread-Level


Level Parallelism Parallelism
Parallelism

Time

Core Core Core Core Core

12
Flynn’s Classification
▪ SISD : Single Instruction Single Data Stream
o Pipelining
o Out-of-order Execution
o Superscalar Processor
o VLIW Processor

▪ SIMD : Single Instruction Multiple Data Stream


o Vector Processing Unit (e.g., Intel AVX)
o GPU

▪ MISD : Multiple Instruction Single Data Stream


▪ MIMD : Multiple Instruction Multiple Data Stream
o Shared-memory multiprocessor (e.g., Multi-Core)
o Distributed-memory multiprocessors
Flynn’s Classification

14
SISD : Single Instruction Single Data Stream
▪ Executes a single instruction which operates on a
single data stream

▪ Instruction-level Parallelism
o Pipelining
o Out-of-order Execution
o Superscalar Processor
o VLIW Processor
Out-of-Order Execution
▪ Problem of In-order pipeline : data dependency stalls dispatch of
younger instructions into functional (execution) units

First ADD stall whole pipeline!


Out-of-Order Execution
▪ Idea of Out-of-Order pipeline : Move the dependent instructions out of
the way of independent ones
o When all source “values ” of an instruction are available, the instruction
can be executed
▪ Benefits : Allows independent instructions to execute and complete in
the presence of a long latency operation

In-order:
16 cycles

Out-of-order:
12 cycles
Superscalar Processor
▪ Two or more consecutive instructions in the original program
order can execute in parallel

▪ N-way Superscalar
o Can issue up to N instructions per cycle
o 2-way, 3-way, …

2-way Superscalar Processor


Superscalar vs. Pipelining

1-way :
time
fetch decode ld
fetch decode add
fetch decode sub
fetch decode bne

2-way:
fetch decode ld
fetch decode add
fetch decode sub
fetch decode bne
SIMD : Single Instruction Multiple Data Stream

▪ Executes a single instruction which operates on a multiple


data stream

▪ Data-level Parallelism
o Vector Processing Unit
Vector Processing
Vector Processing Unit

▪ A processor can operate on an entire vector in one instruction


▪ Work done automatically in parallel
▪ The operand to the instructions are complete vectors instead of one
element
▪ Reduce the fetch and decode bandwidth
▪ Important for multimedia applications and DNN (Deep Neural Network)
▪ Example
o Intel AVX
Vector Processing Unit

vadd
Vector Processing Unit

▪ Vector SIMD Units


o e.g., Intel AVX (Advanced Vector Extension)
GPU

▪ A GPU contains multiple SIMD Units


GPU → SIMT (Single Instruction Multiple Tread)

▪ SIMD vs SIMT

SIMD SIMT
GPU → SIMT (Single Instruction Multiple Tread)

▪ High-Level View of GPU

Core Core Core Core


Parallel Processor

▪ Intel Comet Lake Core i9


o 10 Cores (20 threads), 3.7 GHz
o GPU : UHD630, 1.2 GHz
o ILP + DLP + TLP

28
Parallel Processor

▪ Intel Xeon Phi 7290 coprocessor


▪ 72 cores, 1.7 GHz
▪ ILP + DLP + TLP

29
Parallel Processor

▪ NVIDIA Ampere A100


▪ 6912 Cores, 1.4GHz
▪ DLP+TLP

30
Parallel Processor

▪ 8 ARM Cores, Mali GPU, NPU


▪ ILP + DLP + TLP

31
Amdahl’s Law (I)

▪ Gene Amdahl, chief architect of IBM's


first mainframe series found that there
were some fairly stringent restrictions on
how much of a speedup one could get for
a given parallelized task. These
observations were wrapped up in
Amdahl's Law

▪ often used in parallel computing to predict the theoretical


maximum speedup using multiple processors
▪ The speedup of a program using multiple processors in
parallel computing is limited by the time needed for the
sequential fraction of the program.

32
Amdahl’s Law (II)

Taf f ected
Timprov ed = + Tunaf f ected
improvemen t factor

▪ Example1
Total execution of a program Single core

A (30%) B (70%)
Old
Processor 2x Unaffected
speedup fraction
Dual core

New
Processor

Execution time with new processor = 0.3 T / 2 + 0.7 T


Speedup = T/(0.85 T) = 1.176
33
Amdahl’s Law (III)

2x speedup,
regardless of
the number of
processors
if the parallel
portion is 50%

34
Heterogeneous Parallel Computing (I)
▪ CPUs : Latency Oriented Design
o designed to minimize the execution latency of a single thread
• Large caches
Convert long latency memory accesses to short latency cache accesses
• Large control unit
Branch prediction for reduced branch latency
Data forwarding for reduced data latency
• Powerful ALU
ALU ALU
Reduced operation latency core core
control
ALU ALU
o good for programs that have one or very few threads core core

cache memory

global memory

35
Heterogeneous Parallel Computing (II)
▪ GPUs : Throughput Oriented Design
o thread pool
• threads are pending when they need memory fetches
• execute when they completed those fetch operations
o Small caches GPU
• To boost memory throughput
o Simple control
• No branch prediction
• No data forwarding
o Energy efficient ALUs global memory

• Many, long latency but heavily pipelined for high


throughput
o Require massive number of threads to tolerate latencies

36
Heterogeneous Parallel Computing (III)

▪ CPUs for sequential parts where latency matters


o CPUs can be 10+ times faster than GPUs for sequential code

▪ GPUs for parallel parts where throughput wins


o GPUs can be 10+ times faster than CPUs for parallel code

37
The free lunch is over.. Now it’s up to the
programmers. Adding more processors doesn’t help
much if programmers don’t know how to use them

Next..

▪ GPU Architecture

38

You might also like