0% found this document useful (0 votes)

70 views38 pages

Lecture1 Introduction To Parallel Computing - 2025

The document discusses the evolution and necessity of parallel computing in response to the limitations of single-threaded CPU performance, which peaked around 2003 due to power constraints and diminishing returns from instruction-level parallelism. It outlines various types of parallelism, including instruction-level, data-level, and thread-level, as well as the implications of Amdahl's Law on speedup in parallel processing. Additionally, it highlights the differences between CPU and GPU architectures, emphasizing their respective strengths in handling sequential and parallel tasks.

Uploaded by

shdudtls2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

70 views38 pages

Lecture1 Introduction To Parallel Computing - 2025

Uploaded by

shdudtls2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Introduction to Parallel Computing

Why Parallelism?

Prof. Seokin Hong

Incredible progress in computer technology

2
Incredible progress in computer technology (Cont’d)

▪ Performance improvements are led by

o Technology Scaling
• Feature size reduction in CMOS transistor technology
Smaller transistors → More transistors
Fast transistors → More performance (Higher clock rate)
Consume less power → Low power
Transistor
Moore’s Law: the number
Image result for moore's law of transistors in a dense
integrated circuit doubles
about every two years,
1965

moore 2

3
Incredible progress in computer technology (Cont’d)

▪ Performance improvements are led by

o Improvements in computer architectures
• Enabled by
Advanced Compiler → Elimination of assembly language programming
Standardized and vendor-independent operating systems (e.g., LINUX)
• These two changes lowered the cost of bringing out a new architecture

• Lead to innovative CPU architectures

4
Why wasn’t parallel processing required?

▪ Single-threaded CPU performance doubling every 18 months

o Since H/W performance increased, S/W performance automatically increased
without any change!!
o Working to parallelize program code was often not worth the time

5
Two driving forces of performance improvement until 2003,
and their limitation

1. Exploiting instruction-level parallelism (ILP)

o Execute independent instructions simultaneously
2. Increasing clock frequency
o Technology scaling → fast transistor → higher clock
frequency

▪ Single processor performance improvement ended in 2003

o Cannot continue to leverage Instruction-Level parallelism (ILP)
o Cannot increase the clock frequency further due to power

6
Two driving forces of performance improvement until 2003,
and their limitation
▪ The “Power wall”
o Power consumption is proportional to frequency
Power = Capacitive load  Voltage 2  Frequency
o High power consumption ➔ high temperature

“Idontcare”: posted at: [Link]

7
Two driving forces of performance improvement until 2003,
and their limitation
▪ Diminishing gain with ILP
o Little performance benefit from building a processor that can issue more

Culler & Singh (data from Johnson 1991)

8
Two driving forces of performance improvement until 2003,
and their limitation

9 “The free Lunch is Over” by Herb Sutter, Dr. Dobbs 2005

Why is parallel processing required?

▪ Parallel processing is the primary way for continuous performance

improvement of processor

Intel's Big Shift After Hitting Technical Wall

………
Then two weeks ago, Intel, the world's largest chip maker,
publicly acknowledged that it had hit a "thermal wall" on
its microprocessor line. As a result, the company is
changing its product strategy and disbanding one of its
most advanced design groups. Intel also said that it would
abandon two advanced chip development projects, code-
named Tejas and Jayhawk.
Now, Intel is embarked on a course already adopted by
some of its major rivals: obtaining more computing
power by stamping multiple processors on a single chip
rather than straining to increase the speed of a single
processor.
...
John Markoff, New York Times, May 17, 2004
10
Types of Parallelism
▪ Instruction-Level Parallelism (ILP)
o Parallel execution of a sequence of instructions belonging to a specific
thread
o Superscalar, VLIW

▪ Data-Level Parallelism (DLP)

o Applying a single instruction to a collection of data in parallel
o SIMD instructions, GPU

▪ Thread-Level Parallelism (TLP)

o Running tasks (threads) at the same time
o Multi-core

11
Types of Parallelism (Cont’d)

Instruction- Data-Level Thread-Level

Level Parallelism Parallelism
Parallelism

Time

Core Core Core Core Core

12
Flynn’s Classification
▪ SISD : Single Instruction Single Data Stream
o Pipelining
o Out-of-order Execution
o Superscalar Processor
o VLIW Processor

▪ SIMD : Single Instruction Multiple Data Stream

o Vector Processing Unit (e.g., Intel AVX)
o GPU

▪ MISD : Multiple Instruction Single Data Stream

▪ MIMD : Multiple Instruction Multiple Data Stream
o Shared-memory multiprocessor (e.g., Multi-Core)
o Distributed-memory multiprocessors
Flynn’s Classification

14
SISD : Single Instruction Single Data Stream
▪ Executes a single instruction which operates on a
single data stream

▪ Instruction-level Parallelism
o Pipelining
o Out-of-order Execution
o Superscalar Processor
o VLIW Processor
Out-of-Order Execution
▪ Problem of In-order pipeline : data dependency stalls dispatch of
younger instructions into functional (execution) units

First ADD stall whole pipeline!

Out-of-Order Execution
▪ Idea of Out-of-Order pipeline : Move the dependent instructions out of
the way of independent ones
o When all source “values ” of an instruction are available, the instruction
can be executed
▪ Benefits : Allows independent instructions to execute and complete in
the presence of a long latency operation

In-order:
16 cycles

Out-of-order:
12 cycles
Superscalar Processor
▪ Two or more consecutive instructions in the original program
order can execute in parallel

▪ N-way Superscalar
o Can issue up to N instructions per cycle
o 2-way, 3-way, …

2-way Superscalar Processor

Superscalar vs. Pipelining

1-way :
time
fetch decode ld
fetch decode add
fetch decode sub
fetch decode bne

2-way:
fetch decode ld
fetch decode add
fetch decode sub
fetch decode bne
SIMD : Single Instruction Multiple Data Stream

▪ Executes a single instruction which operates on a multiple

data stream

▪ Data-level Parallelism
o Vector Processing Unit
Vector Processing
Vector Processing Unit

▪ A processor can operate on an entire vector in one instruction

▪ Work done automatically in parallel
▪ The operand to the instructions are complete vectors instead of one
element
▪ Reduce the fetch and decode bandwidth
▪ Important for multimedia applications and DNN (Deep Neural Network)
▪ Example
o Intel AVX
Vector Processing Unit

vadd
Vector Processing Unit

▪ Vector SIMD Units

o e.g., Intel AVX (Advanced Vector Extension)
GPU

▪ A GPU contains multiple SIMD Units

GPU → SIMT (Single Instruction Multiple Tread)

▪ SIMD vs SIMT

SIMD SIMT
GPU → SIMT (Single Instruction Multiple Tread)

▪ High-Level View of GPU

Core Core Core Core

Parallel Processor

▪ Intel Comet Lake Core i9

o 10 Cores (20 threads), 3.7 GHz
o GPU : UHD630, 1.2 GHz
o ILP + DLP + TLP

28
Parallel Processor

▪ Intel Xeon Phi 7290 coprocessor

▪ 72 cores, 1.7 GHz
▪ ILP + DLP + TLP

29
Parallel Processor

▪ NVIDIA Ampere A100

▪ 6912 Cores, 1.4GHz
▪ DLP+TLP

30
Parallel Processor

▪ 8 ARM Cores, Mali GPU, NPU

▪ ILP + DLP + TLP

31
Amdahl’s Law (I)

▪ Gene Amdahl, chief architect of IBM's

first mainframe series found that there
were some fairly stringent restrictions on
how much of a speedup one could get for
a given parallelized task. These
observations were wrapped up in
Amdahl's Law

▪ often used in parallel computing to predict the theoretical

maximum speedup using multiple processors
▪ The speedup of a program using multiple processors in
parallel computing is limited by the time needed for the
sequential fraction of the program.

32
Amdahl’s Law (II)

Taf f ected
Timprov ed = + Tunaf f ected
improvemen t factor

▪ Example1
Total execution of a program Single core

A (30%) B (70%)
Old
Processor 2x Unaffected
speedup fraction
Dual core

New
Processor

Execution time with new processor = 0.3 T / 2 + 0.7 T

Speedup = T/(0.85 T) = 1.176
33
Amdahl’s Law (III)

2x speedup,
regardless of
the number of
processors
if the parallel
portion is 50%

34
Heterogeneous Parallel Computing (I)
▪ CPUs : Latency Oriented Design
o designed to minimize the execution latency of a single thread
• Large caches
Convert long latency memory accesses to short latency cache accesses
• Large control unit
Branch prediction for reduced branch latency
Data forwarding for reduced data latency
• Powerful ALU
ALU ALU
Reduced operation latency core core
control
ALU ALU
o good for programs that have one or very few threads core core

cache memory

global memory

35
Heterogeneous Parallel Computing (II)
▪ GPUs : Throughput Oriented Design
o thread pool
• threads are pending when they need memory fetches
• execute when they completed those fetch operations
o Small caches GPU
• To boost memory throughput
o Simple control
• No branch prediction
• No data forwarding
o Energy efficient ALUs global memory

• Many, long latency but heavily pipelined for high

throughput
o Require massive number of threads to tolerate latencies

36
Heterogeneous Parallel Computing (III)

▪ CPUs for sequential parts where latency matters

o CPUs can be 10+ times faster than GPUs for sequential code

▪ GPUs for parallel parts where throughput wins

o GPUs can be 10+ times faster than CPUs for parallel code

37
The free lunch is over.. Now it’s up to the
programmers. Adding more processors doesn’t help
much if programmers don’t know how to use them

Next..

▪ GPU Architecture

CH02-COA10e Spring 2025
No ratings yet
CH02-COA10e Spring 2025
24 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
HPC - 1
No ratings yet
HPC - 1
40 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
Multi-Core Processors in CS149
No ratings yet
Multi-Core Processors in CS149
107 pages
HPC TT1
No ratings yet
HPC TT1
29 pages
Unit 7 - Parallel Processing Paradigm
No ratings yet
Unit 7 - Parallel Processing Paradigm
26 pages
Lec7 PDF
No ratings yet
Lec7 PDF
16 pages
Parallel Computing Course Guide
No ratings yet
Parallel Computing Course Guide
50 pages
Lecture 1
No ratings yet
Lecture 1
37 pages
03 Why Parallel
No ratings yet
03 Why Parallel
34 pages
HPC Pipeline Execution Time Overview
No ratings yet
HPC Pipeline Execution Time Overview
124 pages
Lecture 2
No ratings yet
Lecture 2
21 pages
PAG Unit1
No ratings yet
PAG Unit1
64 pages
Lecture 3
No ratings yet
Lecture 3
26 pages
Parallel Computing Course Guide
100% (1)
Parallel Computing Course Guide
49 pages
Pipelining vs. Parallel Processing Explained
No ratings yet
Pipelining vs. Parallel Processing Explained
23 pages
CS0051 - Module 01 - Subtopic 1
No ratings yet
CS0051 - Module 01 - Subtopic 1
27 pages
Overview of Parallel Computing Models
No ratings yet
Overview of Parallel Computing Models
65 pages
Enhancing Computing Performance Techniques
No ratings yet
Enhancing Computing Performance Techniques
15 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
Unit 1
No ratings yet
Unit 1
21 pages
HPC Important Question
No ratings yet
HPC Important Question
19 pages
CH02-COA10e Spring 2025
No ratings yet
CH02-COA10e Spring 2025
24 pages
Computer Architecture: Vnu - University Engineering Technology
No ratings yet
Computer Architecture: Vnu - University Engineering Technology
30 pages
24-25 - Parallel Processing PDF
No ratings yet
24-25 - Parallel Processing PDF
36 pages
CPU vs GPU Parallelism Explained
No ratings yet
CPU vs GPU Parallelism Explained
12 pages
CC Unit 1
No ratings yet
CC Unit 1
24 pages
kiến trúc máy tính
No ratings yet
kiến trúc máy tính
30 pages
Scalable ML for Remote Sensing Data
No ratings yet
Scalable ML for Remote Sensing Data
47 pages
CS0051 - Module 01
No ratings yet
CS0051 - Module 01
52 pages
CH02 COA10e.performance Issues
No ratings yet
CH02 COA10e.performance Issues
19 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
Communication Costs in Parallel Machines
No ratings yet
Communication Costs in Parallel Machines
80 pages
Ayushagrawal HPC
No ratings yet
Ayushagrawal HPC
17 pages
Lecture Slides-Week1
No ratings yet
Lecture Slides-Week1
59 pages
Cao - Unit 4 - Notes - Final
No ratings yet
Cao - Unit 4 - Notes - Final
30 pages
HPC Unit 2
No ratings yet
HPC Unit 2
72 pages
Chapter 1 Solution
No ratings yet
Chapter 1 Solution
35 pages
CUDA and CPU Parallelism Overview
No ratings yet
CUDA and CPU Parallelism Overview
3 pages
Advancedcomputer Architecture
No ratings yet
Advancedcomputer Architecture
91 pages
Multi-Core Programming - Increasing Performance Through Software Multi-Threading
No ratings yet
Multi-Core Programming - Increasing Performance Through Software Multi-Threading
11 pages
Ca - Unit 4
No ratings yet
Ca - Unit 4
77 pages
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
No ratings yet
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
78 pages
Performance Enhancements in Microprocessors
No ratings yet
Performance Enhancements in Microprocessors
47 pages
Architecture II
No ratings yet
Architecture II
247 pages
Parallel Processor Computing Unit 1
No ratings yet
Parallel Processor Computing Unit 1
10 pages
CH02 COA10e
No ratings yet
CH02 COA10e
67 pages
Chapter 2
No ratings yet
Chapter 2
34 pages
SP23 CS 212 Week 2
No ratings yet
SP23 CS 212 Week 2
23 pages
CI-0120 Arquitectura de Computadoras Ejemplos FundamentosDiseño
No ratings yet
CI-0120 Arquitectura de Computadoras Ejemplos FundamentosDiseño
52 pages
Architecture
No ratings yet
Architecture
67 pages
Multiprocessing Vs Multithreading 2
No ratings yet
Multiprocessing Vs Multithreading 2
16 pages
Log Connwa 2023-12-22
No ratings yet
Log Connwa 2023-12-22
43 pages
Key Customer Metrics Overview
No ratings yet
Key Customer Metrics Overview
5 pages
Technical Guide: Reference System Setup - v1.0
No ratings yet
Technical Guide: Reference System Setup - v1.0
9 pages
Class 12 Computer Science Project Rough Draft v2 - Shravanth
No ratings yet
Class 12 Computer Science Project Rough Draft v2 - Shravanth
32 pages
Minecraft Debug Log Analysis
No ratings yet
Minecraft Debug Log Analysis
48 pages
CoDeSys SP RTE - E
No ratings yet
CoDeSys SP RTE - E
35 pages
Exercise 1 - Running & Post-Processing An Example
No ratings yet
Exercise 1 - Running & Post-Processing An Example
26 pages
nanoCAD PPT (CGDL)
No ratings yet
nanoCAD PPT (CGDL)
15 pages
Advanced PowerPoint Techniques
No ratings yet
Advanced PowerPoint Techniques
17 pages
Emotional Interaction
No ratings yet
Emotional Interaction
9 pages
New HPC Introduction
No ratings yet
New HPC Introduction
100 pages
Free PDF Guide for Developers
No ratings yet
Free PDF Guide for Developers
2 pages
IF12497 Semiconductors and AI
No ratings yet
IF12497 Semiconductors and AI
3 pages
BCA I Sem Syllabus
No ratings yet
BCA I Sem Syllabus
4 pages
Brain Pod AI Has The Best AI Image Generatorstjlw
No ratings yet
Brain Pod AI Has The Best AI Image Generatorstjlw
4 pages
Python Data Visualization Basics
No ratings yet
Python Data Visualization Basics
4 pages
CPU Block Diagram and Functionality
No ratings yet
CPU Block Diagram and Functionality
17 pages
CSE - 4052610-CHS Important-1
No ratings yet
CSE - 4052610-CHS Important-1
2 pages
Three Ways of Being Humble
No ratings yet
Three Ways of Being Humble
42 pages
Managing the Recycle Bin in Windows
No ratings yet
Managing the Recycle Bin in Windows
3 pages
User Manual for Iconic Mockups
No ratings yet
User Manual for Iconic Mockups
10 pages
CV English Magaia
No ratings yet
CV English Magaia
1 page
Red Hat Enterprise Linux 9: Performing A Standard RHEL 9 Installation
No ratings yet
Red Hat Enterprise Linux 9: Performing A Standard RHEL 9 Installation
183 pages
Spatial Data Analysis Overview
No ratings yet
Spatial Data Analysis Overview
57 pages
Farmnavigator G7 UserManual MAG7XAM0AE020
No ratings yet
Farmnavigator G7 UserManual MAG7XAM0AE020
90 pages
Using Remote Display Technologies With ANSYS Workbench Products PDF
No ratings yet
Using Remote Display Technologies With ANSYS Workbench Products PDF
12 pages
DR DOS 6.0 User Guide-Opt
No ratings yet
DR DOS 6.0 User Guide-Opt
761 pages
DraftSight Essentials Guide for Beginners
100% (1)
DraftSight Essentials Guide for Beginners
18 pages
Install Java JDK 8 on Windows 10
No ratings yet
Install Java JDK 8 on Windows 10
10 pages
Climate Change From Large Language Models: Hongyin Zhu, Prayag Tiwari
No ratings yet
Climate Change From Large Language Models: Hongyin Zhu, Prayag Tiwari
8 pages

Lecture1 Introduction To Parallel Computing - 2025

Uploaded by

Lecture1 Introduction To Parallel Computing - 2025

Uploaded by

Introduction to Parallel Computing

Prof. Seokin Hong

▪ Performance improvements are led by

▪ Performance improvements are led by

• Lead to innovative CPU architectures

▪ Single-threaded CPU performance doubling every 18 months

1. Exploiting instruction-level parallelism (ILP)

▪ Single processor performance improvement ended in 2003

“Idontcare”: posted at: [Link]

Culler & Singh (data from Johnson 1991)

9 “The free Lunch is Over” by Herb Sutter, Dr. Dobbs 2005

▪ Parallel processing is the primary way for continuous performance

Intel's Big Shift After Hitting Technical Wall

▪ Data-Level Parallelism (DLP)

▪ Thread-Level Parallelism (TLP)

Instruction- Data-Level Thread-Level

Core Core Core Core Core

▪ SIMD : Single Instruction Multiple Data Stream

▪ MISD : Multiple Instruction Single Data Stream

First ADD stall whole pipeline!

2-way Superscalar Processor

▪ Executes a single instruction which operates on a multiple

▪ A processor can operate on an entire vector in one instruction

▪ Vector SIMD Units

▪ A GPU contains multiple SIMD Units

▪ High-Level View of GPU

Core Core Core Core

▪ Intel Comet Lake Core i9

▪ Intel Xeon Phi 7290 coprocessor

▪ NVIDIA Ampere A100

▪ 8 ARM Cores, Mali GPU, NPU

▪ Gene Amdahl, chief architect of IBM's

▪ often used in parallel computing to predict the theoretical

Execution time with new processor = 0.3 T / 2 + 0.7 T

• Many, long latency but heavily pipelined for high

▪ CPUs for sequential parts where latency matters

▪ GPUs for parallel parts where throughput wins

You might also like