0% found this document useful (0 votes)
18 views56 pages

L03 Architecture Memory

The document discusses parallel computing architectures, focusing on key concepts such as concurrency, parallelism, and various forms of processor performance gains. It outlines the importance of understanding processor architecture, memory organization, and different types of parallelism, including bit, instruction, thread, and processor level parallelism. Additionally, it covers Flynn's taxonomy of parallel architectures and the design of multicore processors, emphasizing the significance of efficient memory access and data management in modern computing.

Uploaded by

Vishnu Prasath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views56 pages

L03 Architecture Memory

The document discusses parallel computing architectures, focusing on key concepts such as concurrency, parallelism, and various forms of processor performance gains. It outlines the importance of understanding processor architecture, memory organization, and different types of parallelism, including bit, instruction, thread, and processor level parallelism. Additionally, it covers Flynn's taxonomy of parallel architectures and the design of multicore processors, emphasizing the significance of efficient memory access and data management in modern computing.

Uploaded by

Vishnu Prasath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Parallel Computing Architectures

Lecture 03
Parallel Computing
Application application
Problem parallelism

decompose

Tasks

compute
hardware
Physical parallelism
Cores &
Processors

[ CS3210 - AY2425S1 - L3 ]
2
Computer architecture
 Key concepts about how modern computers work
 Concerns on parallel execution
 Challenges of accessing memory
 Understanding these architecture basics help you
 Understand and optimize the performance of your parallel
programs
 Gain intuition about what workloads might benefit from fast parallel
machines

[ CS3210 - AY2425S1 - L3 ]
3
Outline
 Processor Architecture and Technology Trends
 Various forms of parallelism
 Flynn’s Parallel Architecture Taxonomy
 Architecture of Multicore Processors
 Memory Organization
 Distributed-memory Systems
 Shared-memory Systems
 Hybrid (Distributed-Shared Memory) Systems

[ CS3210 - AY2425S1 - L3 ]
4
Concurrency vs. Parallelism

Concurrency Parallelism
 Two or more tasks can start, run,  Two or more tasks can run
and complete in overlapping (execute) simultaneously, at the
time periods exact same time
 They might not be running  Tasks do not only make progress,
(executing on CPU) at the same they actually execute
instant simultaneously
 Two or more execution flows make
progress at the same time by
interleaving their executions or by
executing instructions (on CPU) at
exactly the same time
[ CS3210 - AY2425S1 - L3 ]
5
Source of Processor Performance Gain
 Parallelism of various forms are the main source of
performance gain

 Let us understand parallelism at the:


 Bit Level
 Instruction Level Single
 Thread Level Processor

 Processor Level:
Multiple
 Shared Memory Processors
 Distributed Memory
[ CS3210 - AY2425S1 - L3 ]
8
Bit Level Parallelism
 Parallelism via increasing the processor word size
 Word size may mean:
 Unit of transfer between processor  memory
 Memory address space capacity
 Integer size
 Single precision floating point number size

 Word size trend:


 Varied in the 50s to 70s
 Following x86 processors:
16 bits (8086 – 1978)
 32 bits (80386 – 1985)
 64 bits (Pentium 4 / Opteron – 2003)
[ CS3210 - AY2425S1 - L3 ]
9
Instruction Level Parallelism
 Execute instructions in parallel:
 Pipelining (parallelism across time)
 Superscalar (parallelism across space)

 Pipelining:
 Split instruction execution in multiple stages, e.g.
 Fetch (IF), Decode (ID), Execute (EX), Write-Back (WB)
 Allow multiple instructions to occupy different stages in the same
clock cycle
 Provided there is no data / control dependencies
 Number of pipeline stages == Maximum achievable speedup
[ CS3210 - AY2425S1 - L3 ]
10
Pipelined Execution: Illustration
Non-pipelined
 Disadvantages
Time  Independence
IF ID EX WB IF ID EX WB  Bubbles
 Hazards: data
and control flow
Pipelined
Time  Speculation
IF ID EX WB
 Out-of-order
execution
IF ID EX WB
 Read-after-write
Program Flow IF ID EX WB
(Instructions)
IF ID EX WB

[ CS3210 - AY2425S1 - L3 ]
11
The end of ILP

Processor clock rate


stops increasing

No further benefit from ILP

[ CS3210 - AY2425S1 - L3 ]
12
Instruction Level Parallelism: Superscalar
 Duplicate the pipelines:
 Allow multiple instructions to pass through the same stage
 Scheduling is challenging (decide which instructions can be
executed together):
 Dynamic (Hardware decision)
 Static (Compiler decision)

 Most modern processors are superscalar


 e.g. each intel i7 core has 14 pipeline stages and can execute 6 micro-ops in
the same cycle

[ CS3210 - AY2425S1 - L3 ]
13
Superscalar Execution: Illustration
Superscalar (2-wide)
Time Disadvantages:
IF ID EX WB structural hazard
IF ID EX WB
IF ID EX WB
IF ID EX WB
Program Flow
(Instructions)

 Cycles-per-instruction (CPI)  Instructions-per-cycle (IPC)

[ CS3210 - AY2425S1 - L3 ]
14
Pipelined Processor Superscalar Processor
 Determine what instruction to run  Processor automatically finds
next independent instructions in an
 Execution unit: performs the instruction sequence and can
operation described by an execute them in parallel on
instruction execution units
 Registers: store value of variables  Instructions come from the same
used as inputs and outputs to execution flow (thread)
operations

[ CS3210 - AY2425S1 - L3 ]
15
Instruction Level Parallelism: SIMD
 Superscalar: decode and  Add ALU to increase compute
execute multiple (i.e. two) capability
instructions per clock  Single instruction, multiple
 Issue: not enough ILP, but data: same instruction
math operations are long broadcasted to and executed
by all ALUs

[ CS3210 - AY2425S1 - L3 ]
16
Thread Level Parallelism: Motivation
 Instruction-level parallelism is limited
 For typical programs only 2-3 instructions can be executed in parallel
(either pipelined / superscalar )
 Due to data/control dependencies

 Multithreading was originally a software mechanism


 Allow multiple parts of the same program to execute concurrently
 Key idea:
 The processor can execute the threads in parallel

[ CS3210 - AY2425S1 - L3 ]
17
Superscalar SMT – 2 hardware threads
 Registers: store value of  Can run one scalar instruction
variables used as inputs and per clock from one of the
outputs to operations from hardware threads
one execution flow  Two logical cores

[ CS3210 - AY2425S1 - L3 ]
18
Thread Level Parallelism in the Processor
 Processor can provide hardware support for multiple "thread
contexts“
 Known as simultaneous multithreading (SMT)
 Information specific to each thread, e.g. Program Counter, Registers,
etc
 Software threads can then execute in parallel
 Many implementation approaches

 Example:
 Intel processors with hyper-threading technology, e.g. each i7 core
can execute 2 threads at the same time
[ CS3210 - AY2425S1 - L3 ]
19
Processor Level Parallelism (Multiprocessing)
 Add more cores to the processor
 The application should have multiple execution flows
 Each process/thread has an independent context that can be
mapped to multiple processor cores

[ CS3210 - AY2425S1 - L3 ]
20
Flynn's Parallel Architecture Taxonomy
 One commonly used taxonomy of parallel architecture:
 Based on the parallelism of instructions and data streams in the most
constrained component of the processor
 Proposed by M.Flynn in 1972(!)

 Instruction stream:
 A single execution flow
 i.e. a single Program Counter (PC)
 Data stream:
 Data being manipulated by the instruction stream

[ CS3210 - AY2425S1 - L3 ]
23
Single Instruction Single Data (SISD)
 A single instruction stream is executed
 Each instruction work on single data
 Most of the uniprocessors fall into this category

Processing
Unit

[ CS3210 - AY2425S1 - L3 ]
24
Single Instruction Multiple Data (SIMD)
 A single stream of instructions
 For example: one program counter
 Each instruction works on multiple data
 Popular model for supercomputer during 1980s:
 Exploit data parallelism, commonly known as vector processor
 Modern processor has some forms of SIMD:
 E.g. the SSE, AVX instructions in Intel x86 processors

One Instruction
shared by many
Multiple data being PUs
processed

[ CS3210 - AY2425S1 - L3 ]
25
SIMD nowadays
 Data parallel architectures
 AVX instructions
 GPGPUs
 Same instruction broadcasted to all ALUs
 AVX: Intrinsic functions operate on vectors of four
64-bit values (e.g., vector of 4 doubles)
 Not great for divergent executions

[ CS3210 - AY2425S1 - L3 ]
26
Original program
 Processes one array element using scalar instructions on
scalar registers (e.g., 32-bit floats)

[ CS3210 - AY2425S1 - L3 ]
27
Vector program (using AVX intrinsics)
 Intrinsic functions operate on vectors of
eight 32-bit values (e.g., vector of floats)
 Processes eight array elements
simultaneously using vector instructions on
256-bit vector registers

[ CS3210 - AY2425S1 - L3 ]
28
Multiple Instruction Single Data (MISD)
 Multiple instruction streams
 All instruction work on the same data at any time
 No actual implementation except for the systolic array

[ CS3210 - AY2425S1 - L3 ]
29
Multiple Instruction Multiple Data (MIMD)
 Each PU fetch its own instruction
 Each PU operates on its data
 Currently the most popular model for multiprocessor

[ CS3210 - AY2425S1 - L3 ]
30
Variant – SIMD + MIMD
 Stream processor (nVidia GPUs)
 A set of threads executing the same code (effectively SIMD)
 Multiple set of threads executing in parallel (effectively MIMD at this
level)

[ CS3210 - AY2425S1 - L3 ]
31
MULTICORE ARCHITECTURE

[ CS3210 - AY2425S1 - L3 ]
32
Architecture of Multicore Processors
 Hierarchical design
 Pipelined design
 Network-based design

[ CS3210 - AY2425S1 - L3 ]
33
Hierarchical Design
 Multiple cores share multiple caches

 Cache size increases from the leaves to the root

 Each core can have a separate L1 cache and shares the L2


cache with other cores

 All cores share the common external memory L3

 Usages L2
 Standard desktop
 Server processors L1
 Graphics processing units
[ CS3210 - AY2425S1 - L3 ]
34
Hierarchical Design - Examples
Each core is sophisticated, out-of-order
processor to maximize ILP

Quad-Core AMD Opteron Intel Quad-Core Xeon

[ CS3210 - AY2425S1 - L3 ]
35
Pipelined Design
 Data elements are processed by multiple execution cores in a
pipelined way

 Useful if same computation steps have to be applied to a long


sequence of data elements
 E.g. processors used in routers
and graphics processors

[ CS3210 - AY2425S1 - L3 ]
36
Example: Pipelined Design

Xelerator X11 network processor


[ CS3210 - AY2425S1 - L3 ]
37
Network-Based Design
 Cores and their local caches and memories are connected via
an interconnection network
SUN Niagara 2 (UltraSPARC T2)

[ CS3210 - AY2425S1 - L3 ]
38
Future Trends
 Efficient on-chip interconnection
 Enough bandwidth for data transfers between the cores
 Scalable
 Robust to tolerate failures
 Efficient energy management
 Reduce memory access time
 Key word: Network on Chip (NoC)

[ CS3210 - AY2425S1 - L3 ]
39
MEMORY ORGANIZATION

[ CS3210 - AY2425S1 - L3 ]
40
Parallel Computer Component
 Typical uniprocessor components:
Core
 Processor
 One or more level of caches
Cache  Memory module
 Other (e.g. I/O)

Memory  These components similarly present in a parallel computer setup

Uniprocessor
 Processors in a parallel computer systems is also commonly
known as processing element

[ CS3210 - AY2425S1 - L3 ]
43
Recap: why do modern processors have cache?
 Processors run efficiently when data is resident in caches
 Caches reduce memory access latency *
* Caches provide high
bandwidth data transfer to CPU

[ CS3210 - AY2425S1 - L3 ]
44
Recap: memory latency and bandwidth
 Memory latency: the amount of time for a memory request
(e.g., load, store) from a processor to be serviced by the
memory system
 Example: 100 cycles, 100 nsec
 Memory bandwidth: the rate at which the memory system can
provide data to a processor
 Example: 20 GB/s
 Processor “stalls” when it cannot run the next instruction in an
instruction stream because of a dependency on a previous
instruction

[ CS3210 - AY2425S1 - L3 ]
45
Execution on a Processor (one add per clock)

The memory system is


occupied 100% of the time!

Time
[ CS3210 - AY2425S1 - L3 ]
46
In Modern Computing, Bandwidth is the Critical Resource
 Performant parallel programs should:
 Organize computation to fetch data from memory less often
 Reuse data previously loaded by the same thread (temporal locality
optimizations)
 Share data across threads (inter-thread cooperation)
 Favor performing additional arithmetic to storing/reloading values
(the math is “free”)
 Main point: programs must access memory infrequently to utilize
modern processors efficiently

[ CS3210 - AY2425S1 - L3 ]
47
Memory Organization of Parallel Computers

Parallel Computers

Distributed-Memory Shared-memory Hybrid (Distributed-


Multicomputers Multiprocessors Shared Memory)

Uniform Non-Uniform Cache-only


Memory Memory Memory
Access Access Access
(UMA) (NUMA) (COMA)

[ CS3210 - AY2425S1 - L3 ]
48
Distributed-Memory Systems
Processor Processor Processor
Cache Cache Cache
Memory Memory Memory
(Interconnection) Network

Memory Memory Memory


Cache Cache Cache
Processor Processor Processor

 Each node is an independent unit:


 With processor, memory and, sometimes, peripheral elements
 Physically distributed memory module:
  Memory in a node is private
[ CS3210 - AY2425S1 - L3 ]
49
Shared Memory System
PU1 PU2 PU3

Shared Memory Provider

I/O Memory

 Parallel programs / threads access


memory through the shared memory
provider
 Program is unaware of the actual
hardware memory architecture
 Cache coherence and memory consistency

[ CS3210 - AY2425S1 - L3 ]
50
Cache Coherence
 Multiple copies of the same data exist on different caches
 Local update by processing unit  Other PUs should not see
the unchanged data

3
PU1 PU2 PU3
u=? 4 u 7

u=? 5
Cache u 5 Cache Cache u 5
1

2
Memory u 5

[ CS3210 - AY2425S1 - L3 ]
51
Further Classification – Shared Memory
 Two factors can further differentiate shared memory systems:
 Processor to Memory Delay (UMA / NUMA)
 Whether delay to memory is uniform

 Presence of a local cache with cache coherence protocol (CC/NCC):


 Same shared variable may exist in multiple caches
 Hardware ensures correctness via cache coherence protocol

[ CS3210 - AY2425S1 - L3 ]
53
Uniform Memory Access (Time) (UMA)
PU PU PU PU

Cache Cache Cache Cache

Shared Memory Provider

Memory
UMA organization for Multiprocessor

 Latency of accessing the main memory is the same for every


processor:
 Uniform access time, hence the name
 Suitable for small number of processing units – due to
contention
[ CS3210 - AY2425S1 - L3 ]
54
Non-Uniform Memory Access (NUMA)
PU PU PU PU

Cache Cache Cache Cache

Interconnection (Shared Memory Provider)

Memory Memory Memory

 Physically distributed memory of all processing units are


combined to form a global shared-memory address space:
 also called distributed shared-memory
 Accessing local memory is faster than remote memory for a
processor
 Non-uniform access time
[ CS3210 - AY2425S1 - L3 ]
55
AMD Ryzen

[ CS3210 - AY2425S1 - L3 ]
56
Example: Multicore NUMA
core core core
1 2
... n

Multicore Processor P

Processor Processor Processor

Cache Cache Cache

Memory Memory Memory

Interconnection(Shared Memory Provider)

Memory Memory Memory

Cache Cache Cache

Processor Processor Processor

[ CS3210 - AY2425S1 - L3 ]
57
ccNUMA
Processor Processor Processor Processor

Cache Cache Cache Cache

Memory Memory Memory Memory

Interconnection (Shared Memory Provider)

 Cache Coherent Non-Uniform Memory Access


 Each node has cache memory to reduce contention

[ CS3210 - AY2425S1 - L3 ]
58
COMA
Processor Processor Processor Processor

Cache Cache Cache Cache

Interconnection Network

 Cache Only Memory Architecture


 Each memory block works as cache memory
 Data migrates dynamically and continuously according to the cache
coherence scheme

[ CS3210 - AY2425S1 - L3 ]
59
Summary: Shared Memory Systems
 Advantages:
 No need to partition code or data
 No need to physically move data among processors 
communication is efficient

 Disadvantages:
 Special synchronization constructs are required
 Lack of scalability due to contention

[ CS3210 - AY2425S1 - L3 ]
60
[ CS3210 - AY2425S1 - L3 ]
64
Summary
 Goal of parallel architecture is to reduce the average time to
execute an instruction

 Various forms of parallelism


 Different types of multicore processors
 Different types of parallel systems and different memory
systems

[ CS3210 - AY2425S1 - L3 ]
65
Reading
 CS149 - PARALLEL COMPUTING class at Stanford
 Kayvon Fatahalian and Kunle Olukotun

 Platform 2015: Intel Processor and Platform Evolution for the


Next Decade
 Intel white paper, 2005

 https://www.brendangregg.com/blog/2023-03-01/computer-
performance-future-2022.html

[ CS3210 - AY2425S1 - L3 ]
66

You might also like