L03 Architecture Memory
L03 Architecture Memory
Lecture 03
Parallel Computing
Application application
Problem parallelism
decompose
Tasks
compute
hardware
Physical parallelism
Cores &
Processors
[ CS3210 - AY2425S1 - L3 ]
2
Computer architecture
Key concepts about how modern computers work
Concerns on parallel execution
Challenges of accessing memory
Understanding these architecture basics help you
Understand and optimize the performance of your parallel
programs
Gain intuition about what workloads might benefit from fast parallel
machines
[ CS3210 - AY2425S1 - L3 ]
3
Outline
Processor Architecture and Technology Trends
Various forms of parallelism
Flynn’s Parallel Architecture Taxonomy
Architecture of Multicore Processors
Memory Organization
Distributed-memory Systems
Shared-memory Systems
Hybrid (Distributed-Shared Memory) Systems
[ CS3210 - AY2425S1 - L3 ]
4
Concurrency vs. Parallelism
Concurrency Parallelism
Two or more tasks can start, run, Two or more tasks can run
and complete in overlapping (execute) simultaneously, at the
time periods exact same time
They might not be running Tasks do not only make progress,
(executing on CPU) at the same they actually execute
instant simultaneously
Two or more execution flows make
progress at the same time by
interleaving their executions or by
executing instructions (on CPU) at
exactly the same time
[ CS3210 - AY2425S1 - L3 ]
5
Source of Processor Performance Gain
Parallelism of various forms are the main source of
performance gain
Processor Level:
Multiple
Shared Memory Processors
Distributed Memory
[ CS3210 - AY2425S1 - L3 ]
8
Bit Level Parallelism
Parallelism via increasing the processor word size
Word size may mean:
Unit of transfer between processor memory
Memory address space capacity
Integer size
Single precision floating point number size
Pipelining:
Split instruction execution in multiple stages, e.g.
Fetch (IF), Decode (ID), Execute (EX), Write-Back (WB)
Allow multiple instructions to occupy different stages in the same
clock cycle
Provided there is no data / control dependencies
Number of pipeline stages == Maximum achievable speedup
[ CS3210 - AY2425S1 - L3 ]
10
Pipelined Execution: Illustration
Non-pipelined
Disadvantages
Time Independence
IF ID EX WB IF ID EX WB Bubbles
Hazards: data
and control flow
Pipelined
Time Speculation
IF ID EX WB
Out-of-order
execution
IF ID EX WB
Read-after-write
Program Flow IF ID EX WB
(Instructions)
IF ID EX WB
[ CS3210 - AY2425S1 - L3 ]
11
The end of ILP
[ CS3210 - AY2425S1 - L3 ]
12
Instruction Level Parallelism: Superscalar
Duplicate the pipelines:
Allow multiple instructions to pass through the same stage
Scheduling is challenging (decide which instructions can be
executed together):
Dynamic (Hardware decision)
Static (Compiler decision)
[ CS3210 - AY2425S1 - L3 ]
13
Superscalar Execution: Illustration
Superscalar (2-wide)
Time Disadvantages:
IF ID EX WB structural hazard
IF ID EX WB
IF ID EX WB
IF ID EX WB
Program Flow
(Instructions)
[ CS3210 - AY2425S1 - L3 ]
14
Pipelined Processor Superscalar Processor
Determine what instruction to run Processor automatically finds
next independent instructions in an
Execution unit: performs the instruction sequence and can
operation described by an execute them in parallel on
instruction execution units
Registers: store value of variables Instructions come from the same
used as inputs and outputs to execution flow (thread)
operations
[ CS3210 - AY2425S1 - L3 ]
15
Instruction Level Parallelism: SIMD
Superscalar: decode and Add ALU to increase compute
execute multiple (i.e. two) capability
instructions per clock Single instruction, multiple
Issue: not enough ILP, but data: same instruction
math operations are long broadcasted to and executed
by all ALUs
[ CS3210 - AY2425S1 - L3 ]
16
Thread Level Parallelism: Motivation
Instruction-level parallelism is limited
For typical programs only 2-3 instructions can be executed in parallel
(either pipelined / superscalar )
Due to data/control dependencies
[ CS3210 - AY2425S1 - L3 ]
17
Superscalar SMT – 2 hardware threads
Registers: store value of Can run one scalar instruction
variables used as inputs and per clock from one of the
outputs to operations from hardware threads
one execution flow Two logical cores
[ CS3210 - AY2425S1 - L3 ]
18
Thread Level Parallelism in the Processor
Processor can provide hardware support for multiple "thread
contexts“
Known as simultaneous multithreading (SMT)
Information specific to each thread, e.g. Program Counter, Registers,
etc
Software threads can then execute in parallel
Many implementation approaches
Example:
Intel processors with hyper-threading technology, e.g. each i7 core
can execute 2 threads at the same time
[ CS3210 - AY2425S1 - L3 ]
19
Processor Level Parallelism (Multiprocessing)
Add more cores to the processor
The application should have multiple execution flows
Each process/thread has an independent context that can be
mapped to multiple processor cores
[ CS3210 - AY2425S1 - L3 ]
20
Flynn's Parallel Architecture Taxonomy
One commonly used taxonomy of parallel architecture:
Based on the parallelism of instructions and data streams in the most
constrained component of the processor
Proposed by M.Flynn in 1972(!)
Instruction stream:
A single execution flow
i.e. a single Program Counter (PC)
Data stream:
Data being manipulated by the instruction stream
[ CS3210 - AY2425S1 - L3 ]
23
Single Instruction Single Data (SISD)
A single instruction stream is executed
Each instruction work on single data
Most of the uniprocessors fall into this category
Processing
Unit
[ CS3210 - AY2425S1 - L3 ]
24
Single Instruction Multiple Data (SIMD)
A single stream of instructions
For example: one program counter
Each instruction works on multiple data
Popular model for supercomputer during 1980s:
Exploit data parallelism, commonly known as vector processor
Modern processor has some forms of SIMD:
E.g. the SSE, AVX instructions in Intel x86 processors
One Instruction
shared by many
Multiple data being PUs
processed
[ CS3210 - AY2425S1 - L3 ]
25
SIMD nowadays
Data parallel architectures
AVX instructions
GPGPUs
Same instruction broadcasted to all ALUs
AVX: Intrinsic functions operate on vectors of four
64-bit values (e.g., vector of 4 doubles)
Not great for divergent executions
[ CS3210 - AY2425S1 - L3 ]
26
Original program
Processes one array element using scalar instructions on
scalar registers (e.g., 32-bit floats)
[ CS3210 - AY2425S1 - L3 ]
27
Vector program (using AVX intrinsics)
Intrinsic functions operate on vectors of
eight 32-bit values (e.g., vector of floats)
Processes eight array elements
simultaneously using vector instructions on
256-bit vector registers
[ CS3210 - AY2425S1 - L3 ]
28
Multiple Instruction Single Data (MISD)
Multiple instruction streams
All instruction work on the same data at any time
No actual implementation except for the systolic array
[ CS3210 - AY2425S1 - L3 ]
29
Multiple Instruction Multiple Data (MIMD)
Each PU fetch its own instruction
Each PU operates on its data
Currently the most popular model for multiprocessor
[ CS3210 - AY2425S1 - L3 ]
30
Variant – SIMD + MIMD
Stream processor (nVidia GPUs)
A set of threads executing the same code (effectively SIMD)
Multiple set of threads executing in parallel (effectively MIMD at this
level)
[ CS3210 - AY2425S1 - L3 ]
31
MULTICORE ARCHITECTURE
[ CS3210 - AY2425S1 - L3 ]
32
Architecture of Multicore Processors
Hierarchical design
Pipelined design
Network-based design
[ CS3210 - AY2425S1 - L3 ]
33
Hierarchical Design
Multiple cores share multiple caches
Usages L2
Standard desktop
Server processors L1
Graphics processing units
[ CS3210 - AY2425S1 - L3 ]
34
Hierarchical Design - Examples
Each core is sophisticated, out-of-order
processor to maximize ILP
[ CS3210 - AY2425S1 - L3 ]
35
Pipelined Design
Data elements are processed by multiple execution cores in a
pipelined way
[ CS3210 - AY2425S1 - L3 ]
36
Example: Pipelined Design
[ CS3210 - AY2425S1 - L3 ]
38
Future Trends
Efficient on-chip interconnection
Enough bandwidth for data transfers between the cores
Scalable
Robust to tolerate failures
Efficient energy management
Reduce memory access time
Key word: Network on Chip (NoC)
[ CS3210 - AY2425S1 - L3 ]
39
MEMORY ORGANIZATION
[ CS3210 - AY2425S1 - L3 ]
40
Parallel Computer Component
Typical uniprocessor components:
Core
Processor
One or more level of caches
Cache Memory module
Other (e.g. I/O)
Uniprocessor
Processors in a parallel computer systems is also commonly
known as processing element
[ CS3210 - AY2425S1 - L3 ]
43
Recap: why do modern processors have cache?
Processors run efficiently when data is resident in caches
Caches reduce memory access latency *
* Caches provide high
bandwidth data transfer to CPU
[ CS3210 - AY2425S1 - L3 ]
44
Recap: memory latency and bandwidth
Memory latency: the amount of time for a memory request
(e.g., load, store) from a processor to be serviced by the
memory system
Example: 100 cycles, 100 nsec
Memory bandwidth: the rate at which the memory system can
provide data to a processor
Example: 20 GB/s
Processor “stalls” when it cannot run the next instruction in an
instruction stream because of a dependency on a previous
instruction
[ CS3210 - AY2425S1 - L3 ]
45
Execution on a Processor (one add per clock)
Time
[ CS3210 - AY2425S1 - L3 ]
46
In Modern Computing, Bandwidth is the Critical Resource
Performant parallel programs should:
Organize computation to fetch data from memory less often
Reuse data previously loaded by the same thread (temporal locality
optimizations)
Share data across threads (inter-thread cooperation)
Favor performing additional arithmetic to storing/reloading values
(the math is “free”)
Main point: programs must access memory infrequently to utilize
modern processors efficiently
[ CS3210 - AY2425S1 - L3 ]
47
Memory Organization of Parallel Computers
Parallel Computers
[ CS3210 - AY2425S1 - L3 ]
48
Distributed-Memory Systems
Processor Processor Processor
Cache Cache Cache
Memory Memory Memory
(Interconnection) Network
I/O Memory
[ CS3210 - AY2425S1 - L3 ]
50
Cache Coherence
Multiple copies of the same data exist on different caches
Local update by processing unit Other PUs should not see
the unchanged data
3
PU1 PU2 PU3
u=? 4 u 7
u=? 5
Cache u 5 Cache Cache u 5
1
2
Memory u 5
[ CS3210 - AY2425S1 - L3 ]
51
Further Classification – Shared Memory
Two factors can further differentiate shared memory systems:
Processor to Memory Delay (UMA / NUMA)
Whether delay to memory is uniform
[ CS3210 - AY2425S1 - L3 ]
53
Uniform Memory Access (Time) (UMA)
PU PU PU PU
Memory
UMA organization for Multiprocessor
[ CS3210 - AY2425S1 - L3 ]
56
Example: Multicore NUMA
core core core
1 2
... n
Multicore Processor P
[ CS3210 - AY2425S1 - L3 ]
57
ccNUMA
Processor Processor Processor Processor
[ CS3210 - AY2425S1 - L3 ]
58
COMA
Processor Processor Processor Processor
Interconnection Network
[ CS3210 - AY2425S1 - L3 ]
59
Summary: Shared Memory Systems
Advantages:
No need to partition code or data
No need to physically move data among processors
communication is efficient
Disadvantages:
Special synchronization constructs are required
Lack of scalability due to contention
[ CS3210 - AY2425S1 - L3 ]
60
[ CS3210 - AY2425S1 - L3 ]
64
Summary
Goal of parallel architecture is to reduce the average time to
execute an instruction
[ CS3210 - AY2425S1 - L3 ]
65
Reading
CS149 - PARALLEL COMPUTING class at Stanford
Kayvon Fatahalian and Kunle Olukotun
https://www.brendangregg.com/blog/2023-03-01/computer-
performance-future-2022.html
[ CS3210 - AY2425S1 - L3 ]
66