0% found this document useful (0 votes)

50 views

Computer Architecture Simd Vector Gpu

This document discusses vector processing and SIMD (single instruction multiple data) architectures. It covers key concepts like: - Vector processors exploit data parallelism by performing the same operation on multiple data elements simultaneously using SIMD. - Vector processors use vector registers to hold multiple data elements and vector instructions to perform the same operation across elements. - This allows for deep instruction pipelining without dependencies between elements, improving performance. - Memory access is more regular with vectors which enables techniques like interleaving across memory banks and prefetching for higher bandwidth.

Uploaded by

hùng nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views

Computer Architecture Simd Vector Gpu

Uploaded by

hùng nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Computer Architecture: Vector Processing:

SIMD/Vector/GPU Exploiting Regular (Data) Parallelism

Prof. Onur Mutlu (edited by seth)

Carnegie Mellon University

Data Parallelism SIMD Processing

 Concurrency arises from performing the same operations  Single instruction operates on multiple data elements
on different pieces of data  In time or in space
 Single instruction multiple data (SIMD)  Multiple processing elements
 E.g., dot product of two vectors

 Contrast with data flow  Time-space duality

 Concurrency arises from executing different operations in parallel (in  Array processor: Instruction operates on multiple data
a data driven manner) elements at the same time
Vector processor: Instruction operates on multiple data
 Contrast with thread (“control”) parallelism 
elements in consecutive time steps
 Concurrency arises from executing different threads of control in
parallel

 SIMD exploits instruction-level parallelism

 Multiple instructions concurrent: instructions happen to be the same

3 4
Array vs. Vector Processors SIMD Array Processing vs. VLIW
ARRAY PROCESSOR VECTOR PROCESSOR
 VLIW

Instruction Stream Same op @ same time

Different ops @ time
LD VR  A[3:0] LD0 LD1 LD2 LD3 LD0
ADD VR  VR, 1 AD0 AD1 AD2 AD3 LD1 AD0
MUL VR  VR, 2
ST A[3:0]  VR MU0 MU1 MU2 MU3 LD2 AD1 MU0
ST0 ST1 ST2 ST3 LD3 AD2 MU1 ST0
Different ops @ same space AD3 MU2 ST1
MU3 ST2
Time Same op @ space ST3

Space Space

5 6

SIMD Array Processing vs. VLIW Vector Processors

 Array processor  A vector is a one-dimensional array of numbers
 Many scientific/commercial programs use vectors
for (i = 0; i<=49; i++)
C[i] = (A[i] + B[i]) / 2

 A vector processor is one whose instructions operate on

vectors rather than scalar (single data) values
 Basic requirements
 Need to load/store vectors  vector registers (contain vectors)
 Need to operate on vectors of different lengths  vector length
register (VLEN)
 Elements of a vector might be stored apart from each other in
memory  vector stride register (VSTR)
 Stride: distance between two elements of a vector

7 8
Vector Processors (II) Vector Processor Advantages
 A vector instruction performs an operation on each element + No dependencies within a vector
in consecutive cycles  Pipelining, parallelization work well
 Vector functional units are pipelined  Can have very deep pipelines, no dependencies!
 Each pipeline stage operates on a different data element
+ Each instruction generates a lot of work
 Vector instructions allow deeper pipelines  Reduces instruction fetch bandwidth
 No intra-vector dependencies  no hardware interlocking
within a vector + Highly regular memory access pattern
 No control flow within a vector  Interleaving multiple banks for higher memory bandwidth
 Known stride allows prefetching of vectors into cache/memory  Prefetching

+ No need to explicitly code loops

 Fewer branches in the instruction sequence

9 10

Vector Processor Disadvantages Vector Processor Limitations

-- Works (only) if parallelism is regular (data/SIMD parallelism) -- Memory (bandwidth) can easily become a bottleneck,
++ Vector operations especially if
-- Very inefficient if parallelism is irregular 1. compute/memory operation balance is not maintained
-- How about searching for a key in a linked list? 2. data is not mapped appropriately to memory banks

Fisher, “Very Long Instruction Word architectures and the ELI-512,” ISCA 1983. 11 12
Vector Registers Vector Functional Units
 Each vector data register holds N M-bit values  Use deep pipeline (=> fast
 Vector control registers: VLEN, VSTR, VMASK clock) to execute element
operations V V V
 Vector Mask Register (VMASK) 1 2 3
 Indicates which elements of vector to operate on
 Simplifies control of deep
pipeline because elements in
 Set by vector test instructions
vector are independent
 e.g., VMASK[i] = (Vk[i] == 0)
 Maximum VLEN can be N
 Maximum number of elements stored in a vector register
V0,0
M-bit wide
V1,0
M-bit wide Six stage multiply pipeline
V0,1 V1,1

V3 <- v1 * v2
V0,N-1 V1,N-1

13 Slide credit: Krste Asanovic 14

Vector Machine Organization (CRAY-1) Memory Banking

 CRAY-1  Example: 16 banks; can start one bank access per cycle
 Russell, “The CRAY-1  Bank latency: 11 cycles
computer system,”  Can sustain 16 parallel accesses if they go to different banks
CACM 1978.
Bank Bank Bank Bank
0 1 2 15
 Scalar and vector modes
 8 64-element vector
registers MDR MAR MDR MAR MDR MAR MDR MAR
 64 bits per element
Data bus
 16 memory banks
 8 64-bit scalar registers
 8 24-bit address registers Address bus

CPU
15 Slide credit: Derek Chiou 16
Vector Memory System Scalar Code Example
 For I = 0 to 49
 C[i] = (A[i] + B[i]) / 2
Bas
Stride
Vector Registers e
 Scalar code
Address MOVI R0 = 50 1
Generator + MOVA R1 = A 1 304 dynamic instructions
MOVA R2 = B 1
MOVA R3 = C 1
X: LD R4 = MEM[R1++] 11 ;autoincrement addressing
LD R5 = MEM[R2++] 11
0 1 2 3 4 5 6 7 8 9 A B C D E F
ADD R6 = R4 + R5 4
Memory Banks SHFR R7 = R6 >> 1 1
ST MEM[R3++] = R7 11
DECBNZ --R0, X 2 ;decrement and branch if NZ
Slide credit: Krste Asanovic 17 18

Scalar Code Execution Time Vectorizable Loops

 Scalar execution time on an in-order processor with 1 bank  A loop is vectorizable if each iteration is independent of any
 First two loads in the loop cannot be pipelined: 2*11 cycles other
 4 + 50*40 = 2004 cycles  For I = 0 to 49
 C[i] = (A[i] + B[i]) / 2 7 dynamic instructions
 Scalar execution time on an in-order processor with 16  Vectorized loop:
banks (word-interleaved) MOVI VLEN = 50 1
 First two loads in the loop can be pipelined MOVI VSTR = 1 1
 4 + 50*30 = 1504 cycles VLD V0 = A 11 + VLN - 1
VLD V1 = B 11 + VLN – 1
 Why 16 banks? VADD V2 = V0 + V1 4 + VLN - 1
 11 cycle memory access latency VSHFR V3 = V2 >> 1 1 + VLN - 1
 Having 16 (>11) banks ensures there are enough banks to VST C = V3 11 + VLN – 1
overlap enough memory operations to cover memory latency
19 20
Vector Code Performance Vector Chaining
 No chaining  Vector chaining: Data forwarding from one vector
 i.e., output of a vector functional unit cannot be used as the functional unit to another
input of another (i.e., no vector data forwarding)
 One memory port (one address generator)
V V V V V
 16 memory banks (word-interleaved) LV v1 1 2 3 4 5
MULV v3,v1,v2
ADDV v5, v3, v4

Chain Chain

Load
Unit
Mult. Add

Memory
 285 cycles
21 Slide credit: Krste Asanovic 22

Vector Code Performance - Chaining Vector Code Performance – Multiple Memory Ports
 Vector chaining: Data forwarding from one vector  Chaining and 2 load ports, 1 store port in each bank
functional unit to another
1 1 11 49 11 49

Strict assumption:
Each memory bank
4 49 has a single port
(memory bandwidth
bottleneck)
These two VLDs cannot be 1 49
pipelined. WHY?

11 49

VLD and VST cannot be

 182 cycles pipelined. WHY?  79 cycles
23 24
Questions (I) Gather/Scatter Operations
 What if # data elements > # elements in a vector register?
 Need to break loops so that each iteration operates on #
Want to vectorize loops with indirect accesses:
elements in a vector register
for (i=0; i<N; i++)
 E.g., 527 data elements, 64-element VREGs
A[i] = B[i] + C[D[i]]
 8 iterations where VLEN = 64
 1 iteration where VLEN = 15 (need to change value of VLEN)
Indexed load instruction (Gather)
 Called vector strip-mining LV vD, rD # Load indices in D vector
LVI vC, rC, vD # Load indirect from rC base
 What if vector data is not stored in a strided fashion in LV vB, rB # Load B vector
memory? (irregular memory access to a vector) ADDV.D vA,vB,vC # Do add
 Use indirection to combine elements into vector registers SV vA, rA # Store result
 Called scatter/gather operations

25 26

Gather/Scatter Operations Conditional Operations in a Loop

 Gather/scatter operations often implemented in hardware  What if some operations should not be executed on a vector
to handle sparse matrices (based on a dynamically-determined condition)?
loop: if (a[i] != 0) then b[i]=a[i]*b[i]
 Vector loads and stores use an index vector which is added
goto loop
to the base register to generate the addresses
Index Vector Data Vector Equivalent
 Idea: Masked operations
1 3.14 3.14  VMASK register is a bit mask determining which data element
3 6.5 0.0 should not be acted upon
7 71.2 6.5 VLD V0 = A
8 2.71 0.0
VLD V1 = B
0.0
0.0 VMASK = (V0 != 0)
0.0 VMUL V1 = V0 * V1
71.2
VST B = V1
2.7
 Does this look familiar? This is essentially predicated execution.
27 28
Another Example with Masking Masked Vector Instructions
Simple Implementation Density-Time Implementation
for (i = 0; i < 64; ++i) – execute all N operations, turn off – scan mask vector and only execute
if (a[i] >= b[i]) then c[i] = a[i] result writeback according to mask elements with non-zero masks
else c[i] = b[i] Steps to execute loop M[7]=1 A[7] B[7] M[7]=1
M[6]=0 A[6] B[6] M[6]=0 A[7] B[7]
1. Compare A, B to get
M[5]=1 A[5] B[5] M[5]=1
A B VMASK VMASK
1 2 0 M[4]=1 A[4] B[4] M[4]=1
M[3]=0 A[3] B[3] M[3]=0 C[5]
2 2 1 2. Masked store of A into C
3 2 1 M[2]=0 C[4]
4 10 0 3. Complement VMASK M[1]=1
M[2]=0 C[2]
-5 -4 0 M[0]=0
0 -3 1 M[1]=1 C[1] C[1]
4. Masked store of B into C
6 5 1 Write data port
-7 -8 1
M[0]=0 C[0]

Write Enable Write data port

29 Slide credit: Krste Asanovic 30

Some Issues
 Stride and banking
 As long as they are relatively prime to each other and there
are enough banks to cover bank access latency, consecutive
accesses proceed in parallel

 Storage of a matrix
 Row major: Consecutive elements in a row are laid out
consecutively in memory
 Column major: Consecutive elements in a column are laid out
consecutively in memory
 You need to change the stride when accessing a row versus
column

31 32
Array vs. Vector Processors, Revisited Remember: Array vs. Vector Processors
 Array vs. vector processor distinction is a “purist’s” ARRAY PROCESSOR VECTOR PROCESSOR
distinction

 Most “modern” SIMD processors are a combination of both Instruction Stream Same op @ same time
 They exploit data parallelism in both time and space Different ops @ time
LD VR  A[3:0] LD0 LD1 LD2 LD3 LD0
ADD VR  VR, 1 AD0 AD1 AD2 AD3 LD1 AD0
MUL VR  VR, 2
ST A[3:0]  VR MU0 MU1 MU2 MU3 LD2 AD1 MU0
ST0 ST1 ST2 ST3 LD3 AD2 MU1 ST0
Different ops @ same space AD3 MU2 ST1
MU3 ST2
Time Same op @ space ST3

Space Space

33 34

Vector Instruction Execution Vector Unit Structure

ADDV C,A,B Functional Unit

Execution using Execution using

one pipelined four pipelined
functional unit functional units

Vector
Registers
A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27] Elements 0, Elements 1, Elements 2, Elements 3,
4, 8, … 5, 9, … 6, 10, … 7, 11, …
A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23]
A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19]
A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15]

C[2] C[8] C[9] C[10] C[11]

C[1] C[4] C[5] C[6] C[7]

Lane

C[0] C[0] C[1] C[2] C[3] Memory Subsystem

Slide credit: Krste Asanovic 35 Slide credit: Krste Asanovic 36

Vector Instruction Level Parallelism Automatic Code Vectorization
for (i=0; i < N; i++)
Can overlap execution of multiple vector instructions C[i] = A[i] + B[i];
 example machine has 32 elements per vector register and 8 lanes Vectorized Code
Scalar Sequential Code
 Complete 24 operations/cycle while issuing 1 short instruction/cycle
load load load
Load Unit Multiply Unit Add Unit
Iter. 1 load load load
load

Time
mul
add add add add
time
load store store store
mul
add load
Iter. Iter.
Iter. 2 load 1 2 Vector Instruction

add
Vectorization is a compile-time reordering of
Instruction
issue
operation sequencing
 requires extensive loop dependence analysis
store
Slide credit: Krste Asanovic 37 Slide credit: Krste Asanovic 38

Vector/SIMD Processing Summary

 Vector/SIMD machines good at exploiting regular data-level
parallelism
 Same operation performed on many data elements
SIMD Operations in Modern ISAs
 Improve performance, simplify design (no intra-vector
dependencies)

 Performance improvement limited by vectorizability of code

 Scalar operations limit vector machine performance
 Amdahl’s Law
 CRAY-1 was the fastest SCALAR machine at its time!

 Many existing ISAs include (vector-like) SIMD operations

 Intel MMX/SSEn/AVX, PowerPC AltiVec, ARM Advanced SIMD
39
Intel Pentium MMX Operations MMX Example: Image Overlaying (I)
 Idea: One instruction operates on multiple data elements
simultaneously
 Ala array processing (yet much more limited)
 Designed with multimedia (graphics) operations in mind
No VLEN register
Opcode determines data type:
8 8-bit bytes
4 16-bit words
2 32-bit doublewords
1 64-bit quadword

Stride always equal to 1.

Peleg and Weiser, “MMX Technology

Extension to the Intel Architecture,”
IEEE Micro, 1996.

41 42

MMX Example: Image Overlaying (II)

Graphics Processing Units

SIMD not Exposed to Programmer (SIMT)

43
High-Level View of a GPU Concept of “Thread Warps” and SIMT
 Warp: A set of threads that execute the same instruction
(on different data elements)  SIMT (Nvidia-speak)
 All threads run the same kernel
 Warp: The threads that run lengthwise in a woven fabric …

Thread Warp 3
Thread Warp 8
Thread Warp Common PC
Scalar Scalar Scalar Scalar Thread Warp 7
ThreadThread Thread Thread
W X Y Z
SIMD Pipeline

45 46

Loop Iterations as Threads SIMT Memory Access

for (i=0; i < N; i++)
C[i] = A[i] + B[i];  Same instruction in different threads uses thread id to
Scalar Sequential Code Vectorized Code index and access different data elements
load load load
Let’s assume N=16, blockDim=4  4 blocks
Iter. 1 load load load
Time

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

add
+ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
add add

store store store

load + + + +
Iter. Iter.
Iter. 2 load 1 2 Vector Instruction

add

store
Slide credit: Krste Asanovic 47 Slide credit: Hyesoon Kim
Sample GPU SIMT Code (Simplified) Sample GPU Program (Less Simplified)

CPU code
for (ii = 0; ii < 100; ++ii) {
C[ii] = A[ii] + B[ii];
}

CUDA code
// there are 100 threads
__global__ void KernelFunction(…) {
int tid = blockDim.x * blockIdx.x + threadIdx.x;
int varA = aa[tid];
int varB = bb[tid];
C[tid] = varA + varB;
}

Slide credit: Hyesoon Kim Slide credit: Hyesoon Kim 50

Latency Hiding with “Thread Warps” Warp-based SIMD vs. Traditional SIMD
 Traditional SIMD contains a single thread
 Warp: A set of threads that
 Lock step
execute the same instruction  Programming model is SIMD (no threads)  SW needs to know vector
Warps available
(on different data elements) Thread Warp 3
Thread Warp 8 for scheduling length
 ISA contains vector/SIMD instructions
Thread Warp 7
SIMD Pipeline
 Fine-grained multithreading
I-Fetch
 One instruction per thread in  Warp-based SIMD consists of multiple scalar threads executing in
pipeline at a time (No branch Decode
a SIMD manner (i.e., same instruction executed by all threads)
prediction)
RF

Does not have to be lock step


 Interleave warp execution to Warps accessing
 Each thread can be treated individually (i.e., placed in a different
ALU

ALU

memory hierarchy
hide latencies Miss? warp)  programming model not SIMD
 Register values of all threads stay D-Cache Thread Warp 1  SW does not need to know vector length
in register file All Hit? Data Thread Warp 2
 Enables memory and branch latency tolerance
 No OS context switching
Writeback
Thread Warp 6
 ISA is scalar  vector instructions formed dynamically
 Memory latency hiding
 Essentially, it is SPMD programming model implemented on SIMD
 Graphics has millions of pixels
hardware
Slide credit: Tor Aamodt 51 52
SPMD Branch Divergence Problem in Warp-based SIMD
 Single procedure/program, multiple data  SPMD Execution on SIMD Hardware
 This is a programming model rather than computer organization  NVIDIA calls this “Single Instruction, Multiple Thread” (“SIMT”)
execution
 Each processing element executes the same procedure, except on
different data elements
 Procedures can synchronize at certain points in program, e.g. barriers A

Thread Warp Common PC

B
 Essentially, multiple instruction streams execute the same Thread Thread Thread Thread
program C D F 1 2 3 4
 Each program/procedure can 1) execute a different control-flow path,
E
2) work on different data, at run-time
 Many scientific applications programmed this way and run on MIMD G
computers (multiprocessors)
 Modern GPUs programmed in a similar way on a SIMD computer

53 Slide credit: Tor Aamodt 54

Control Flow Problem in GPUs/SIMD Branch Divergence Handling (I)

 GPU uses SIMD
pipeline to save area Stack
AA/1111
on control logic. TOS
Reconv. PC
-
Next PC
G
A
B
E
Active Mask
1111
 Group scalar threads into BB/1111
TOS
TOS
E
E
D
C
E
0110
1001
warps Branch
C/1001
C D/0110
D F
Thread Warp Common PC
Path A
 Branch divergence EE/1111 Thread Thread Thread Thread
occurs when threads Path B G/1111
G
1 2 3 4

inside warps branch to

different execution A B C D E G A

paths.

Time

Slide credit: Tor Aamodt 55 Slide credit: Tor Aamodt 56

Dynamic Warp Formation Dynamic Warp Formation/Merging
 Idea: Dynamically merge threads executing the same  Idea: Dynamically merge threads executing the same
instruction (after branch divergence) instruction (after branch divergence)
 Form new warp at divergence
 Enough threads branching to each path to create full new
warps
Branch

Path A

Path B

 Fung et al., “Dynamic Warp Formation and Scheduling for

Efficient GPU Control Flow,” MICRO 2007.
58 59

Dynamic Warp Formation Example What About Memory Divergence?

A
x/1111
y/1111  Modern GPUs have caches
Legend
x/1110 A A  Ideally: Want all threads in the warp to hit (without
B
conflicting with each other)
y/0011 Execution of Warp x Execution of Warp y
at Basic Block A at Basic Block A
x/1000 x/0110 x/0001
C y/0010 D y/0001 F y/1100
D
 Problem: One thread in a warp can stall the entire warp if it
x/1110
A new warp created from scalar
threads of both Warp x and y
misses in the cache.
E y/0011 executing at Basic Block D

x/1111
G y/1111  Need techniques to
A A B B C C D D E E F F G G A A
 Tolerate memory divergence
Baseline
 Integrate solutions to branch and memory divergence
Time
Dynamic A A B B C D E E F G G A A
Warp
Formation
Time

Slide credit: Tor Aamodt 60 61

NVIDIA GeForce GTX 285 NVIDIA GeForce GTX 285 “core”
 NVIDIA-speak:
 240 stream processors

 “SIMT execution”

 Generic speak: 64 KB of storage

 30 cores … for fragment
contexts (registers)
 8 SIMD functional units per core

= SIMD functional unit, control = instruction stream decode

shared across 8 units
= multiply-add = execution context storage
= multiply

62 63
Slide credit: Kayvon Fatahalian Slide credit: Kayvon Fatahalian

NVIDIA GeForce GTX 285 “core” NVIDIA GeForce GTX 285

Tex Tex
… … … … … …

64 KB of storage Tex Tex

… for thread contexts
(registers)
… … … … … …

Tex Tex
… … … … … …
 Groups of 32 threads share instruction stream (each group is
a Warp) Tex Tex

Up to 32 warps are simultaneously interleaved

… … … … … …


 Up to 1024 thread contexts can be stored There are 30 of these things on the GTX 285: 30,720 threads
64 65
Slide credit: Kayvon Fatahalian Slide credit: Kayvon Fatahalian

FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 26-Aug-2021 Module2-SIMD-VectorProcessors
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 26-Aug-2021 Module2-SIMD-VectorProcessors
16 pages
SIMD
No ratings yet
SIMD
44 pages
onur-digitaldesign-2020-lecture19-simd-beforelecture
No ratings yet
onur-digitaldesign-2020-lecture19-simd-beforelecture
64 pages
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
No ratings yet
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
26 pages
Lec. 12: Vector Computers: EECS 252 Graduate Computer Architecture
No ratings yet
Lec. 12: Vector Computers: EECS 252 Graduate Computer Architecture
31 pages
Vector
No ratings yet
Vector
38 pages
3316
No ratings yet
3316
7 pages
Simple Vector Processor Modeled With VHDL
No ratings yet
Simple Vector Processor Modeled With VHDL
6 pages
Flynn's Taxonomy: Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
Flynn's Taxonomy: Data-Level Parallelism in Vector, SIMD, and GPU Architectures
28 pages
7-VECTOR PROCESSING-04-Jan-2020Material - I - 04-Jan-2020 - VECTOR - PROCESSING PDF
No ratings yet
7-VECTOR PROCESSING-04-Jan-2020Material - I - 04-Jan-2020 - VECTOR - PROCESSING PDF
31 pages
1 Vector Processing: Solutions
No ratings yet
1 Vector Processing: Solutions
16 pages
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
58 pages
Onur 447 Spring15 Lecture14 Simd Afterlecture
No ratings yet
Onur 447 Spring15 Lecture14 Simd Afterlecture
60 pages
Vector
No ratings yet
Vector
42 pages
GUC_315_61_38694_2023-11-23T11_50_52
No ratings yet
GUC_315_61_38694_2023-11-23T11_50_52
33 pages
CS7103 - MultiCore Architecture Ppts Unit-II
No ratings yet
CS7103 - MultiCore Architecture Ppts Unit-II
43 pages
XX-BSC Compact Vector Processing
No ratings yet
XX-BSC Compact Vector Processing
49 pages
7TH_UNIT 4-21EC74H6_CA
No ratings yet
7TH_UNIT 4-21EC74H6_CA
67 pages
l22 Vector
No ratings yet
l22 Vector
32 pages
CH 04. Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
CH 04. Data-Level Parallelism in Vector, SIMD, and GPU Architectures
50 pages
Chapter 04
No ratings yet
Chapter 04
47 pages
Ca Part 3
No ratings yet
Ca Part 3
20 pages
Slide 7
No ratings yet
Slide 7
40 pages
COE4590_14_Vector
No ratings yet
COE4590_14_Vector
14 pages
module-4-chapter-2
No ratings yet
module-4-chapter-2
42 pages
Architecture Chapter4 E5 2012
No ratings yet
Architecture Chapter4 E5 2012
92 pages
CS6461 - Computer Architecture Fall 2016 - Vector Operations
No ratings yet
CS6461 - Computer Architecture Fall 2016 - Vector Operations
47 pages
Organisasi & Arsitektur Komputer
No ratings yet
Organisasi & Arsitektur Komputer
3 pages
UNIT-V-Pipeline and Array Processing and Multi Processors
No ratings yet
UNIT-V-Pipeline and Array Processing and Multi Processors
51 pages
Unit 2 ppt
No ratings yet
Unit 2 ppt
43 pages
Vector Processor
No ratings yet
Vector Processor
13 pages
23.L20 Multiprocessing Multithreading Vectorization
No ratings yet
23.L20 Multiprocessing Multithreading Vectorization
38 pages
Unit Iii - Aca
No ratings yet
Unit Iii - Aca
13 pages
Data-Level Parallelism Vector and GPU
No ratings yet
Data-Level Parallelism Vector and GPU
6 pages
onur-digitaldesign-2020-lecture20-gpu-beforelecture
No ratings yet
onur-digitaldesign-2020-lecture20-gpu-beforelecture
73 pages
Computer Architecture AllClasses-Outline-199-294
No ratings yet
Computer Architecture AllClasses-Outline-199-294
96 pages
26-27 SIMD Architecture
No ratings yet
26-27 SIMD Architecture
33 pages
VLIW ARCHITECTURE and Pipeline
No ratings yet
VLIW ARCHITECTURE and Pipeline
5 pages
Module 1.6
No ratings yet
Module 1.6
53 pages
Chapter 8
No ratings yet
Chapter 8
59 pages
Bangabandhu Sheikh Mujibur Rahman Maritime University Bangladesh
No ratings yet
Bangabandhu Sheikh Mujibur Rahman Maritime University Bangladesh
7 pages
COA Unit V B
No ratings yet
COA Unit V B
5 pages
Advanced Computer Architecture: Presented By, Krishna
No ratings yet
Advanced Computer Architecture: Presented By, Krishna
35 pages
Module 5 Coa
No ratings yet
Module 5 Coa
11 pages
Vector Computers
No ratings yet
Vector Computers
43 pages
Vector Processor
No ratings yet
Vector Processor
83 pages
17.40 Vector - RISCV 20190611 Vectors
No ratings yet
17.40 Vector - RISCV 20190611 Vectors
26 pages
PP Unit 2 Tesseract
No ratings yet
PP Unit 2 Tesseract
38 pages
CA 13 VectorProcessors
No ratings yet
CA 13 VectorProcessors
16 pages
CSE 820 Graduate Computer Architecture Vectors and Multiprocessor Introduction
No ratings yet
CSE 820 Graduate Computer Architecture Vectors and Multiprocessor Introduction
39 pages
CA Classes-201-205
No ratings yet
CA Classes-201-205
5 pages
Jss Academy of Technical Education, BANGALORE-560060: Topic: Automatic Loop Vectorizarion in Parallel Computing
No ratings yet
Jss Academy of Technical Education, BANGALORE-560060: Topic: Automatic Loop Vectorizarion in Parallel Computing
14 pages
PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook
No ratings yet
PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook
37 pages
CRAY-1 Brochure 1975
No ratings yet
CRAY-1 Brochure 1975
15 pages
Syllabus Topic: - Vector Processing - Vector Processor
No ratings yet
Syllabus Topic: - Vector Processing - Vector Processor
14 pages
WINSEM2022-23_CSE4001_ETH_VL2022230503160_Reference_Material_I_05-01-2023_2.3_SIMD_VP
No ratings yet
WINSEM2022-23_CSE4001_ETH_VL2022230503160_Reference_Material_I_05-01-2023_2.3_SIMD_VP
25 pages
DSP Processor Fundamentals
No ratings yet
DSP Processor Fundamentals
58 pages
ACA20012021 - Vector & Multiple Issue Processor - 2
No ratings yet
ACA20012021 - Vector & Multiple Issue Processor - 2
21 pages
Anjuman College of Engineering & Technology: Part II. Basic Processing Unit
No ratings yet
Anjuman College of Engineering & Technology: Part II. Basic Processing Unit
17 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Assignment Problem
No ratings yet
Assignment Problem
18 pages
Unit 3 Java Beans
No ratings yet
Unit 3 Java Beans
32 pages
10th Class Notes Full BOOK
No ratings yet
10th Class Notes Full BOOK
93 pages
04python 04 Getting Inputs From The User in Python
No ratings yet
04python 04 Getting Inputs From The User in Python
10 pages
Ad Config
No ratings yet
Ad Config
80 pages
C# collections
No ratings yet
C# collections
18 pages
Uvm New
No ratings yet
Uvm New
213 pages
Index Mappings For The Fast Fourier Transform: IEEE Transactions On Signal Processing April 1996
No ratings yet
Index Mappings For The Fast Fourier Transform: IEEE Transactions On Signal Processing April 1996
4 pages
Amdocs Placement Papers - Amdocs Placement Paper Pattern-2008 - IT-Software - Software Services Industry Placement Papers
No ratings yet
Amdocs Placement Papers - Amdocs Placement Paper Pattern-2008 - IT-Software - Software Services Industry Placement Papers
2 pages
BDACh05L04Spark DataFramesAndRDDs
No ratings yet
BDACh05L04Spark DataFramesAndRDDs
22 pages
Darvesh's Resume
No ratings yet
Darvesh's Resume
1 page
Cursors Lecture
No ratings yet
Cursors Lecture
30 pages
MBA in Python - 1
No ratings yet
MBA in Python - 1
32 pages
SEMS: The SIP Express Media Server: Frafos GMBH
No ratings yet
SEMS: The SIP Express Media Server: Frafos GMBH
8 pages
Script LG
0% (1)
Script LG
3 pages
Introduction To Mysql Solutions 2
No ratings yet
Introduction To Mysql Solutions 2
2 pages
Fiscal Management Techniques
No ratings yet
Fiscal Management Techniques
34 pages
Oracle 1z0-147: Exam Name: Oracle9i Program With PL/SQL Q & A: 132 Q&As
No ratings yet
Oracle 1z0-147: Exam Name: Oracle9i Program With PL/SQL Q & A: 132 Q&As
5 pages
Introduction To SQL
No ratings yet
Introduction To SQL
42 pages
Review of C Programming Language: CENG707 1
No ratings yet
Review of C Programming Language: CENG707 1
69 pages
Programming Languages & Paradigms Abstraction & Modularity: PROP HT 2011
No ratings yet
Programming Languages & Paradigms Abstraction & Modularity: PROP HT 2011
14 pages
CS F111 Computer Programming I Sem 2022-2023 HO
No ratings yet
CS F111 Computer Programming I Sem 2022-2023 HO
5 pages
FCFS (Non-Preemptive) :: Process ID Arrival Time Burst Time Priority
No ratings yet
FCFS (Non-Preemptive) :: Process ID Arrival Time Burst Time Priority
3 pages
Computer Science (083) Report File
No ratings yet
Computer Science (083) Report File
32 pages
YouTube Transcript To Detailed Notes Converter
No ratings yet
YouTube Transcript To Detailed Notes Converter
8 pages
PDF Essential ASP.NET Web Forms Development: Full Stack Programming with C#, SQL, Ajax, and JavaScript 1st Edition Robert E. Beasley download
100% (3)
PDF Essential ASP.NET Web Forms Development: Full Stack Programming with C#, SQL, Ajax, and JavaScript 1st Edition Robert E. Beasley download
65 pages
TMF637 Product Inventory Management API v4.0.0 Specification
No ratings yet
TMF637 Product Inventory Management API v4.0.0 Specification
45 pages
Unit III - Introduction To VB
No ratings yet
Unit III - Introduction To VB
63 pages
ITW Practical File Brar
No ratings yet
ITW Practical File Brar
12 pages
The OpenCV Installed With The Jetpack Does Not Have CUDA Supported
No ratings yet
The OpenCV Installed With The Jetpack Does Not Have CUDA Supported
11 pages

Computer Architecture Simd Vector Gpu

Uploaded by

Computer Architecture Simd Vector Gpu

Uploaded by

Computer Architecture: Vector Processing:

SIMD/Vector/GPU Exploiting Regular (Data) Parallelism

Prof. Onur Mutlu (edited by seth)

Data Parallelism SIMD Processing

 Contrast with data flow  Time-space duality

 SIMD exploits instruction-level parallelism

Instruction Stream Same op @ same time

SIMD Array Processing vs. VLIW Vector Processors

 A vector processor is one whose instructions operate on

+ No need to explicitly code loops

Vector Processor Disadvantages Vector Processor Limitations

13 Slide credit: Krste Asanovic 14

Vector Machine Organization (CRAY-1) Memory Banking

Scalar Code Execution Time Vectorizable Loops

VLD and VST cannot be

Gather/Scatter Operations Conditional Operations in a Loop

Write Enable Write data port

29 Slide credit: Krste Asanovic 30

Vector Instruction Execution Vector Unit Structure

Execution using Execution using

C[2] C[8] C[9] C[10] C[11]

C[1] C[4] C[5] C[6] C[7]

C[0] C[0] C[1] C[2] C[3] Memory Subsystem

Slide credit: Krste Asanovic 35 Slide credit: Krste Asanovic 36

Vector/SIMD Processing Summary

 Performance improvement limited by vectorizability of code

 Many existing ISAs include (vector-like) SIMD operations

Stride always equal to 1.

Peleg and Weiser, “MMX Technology

MMX Example: Image Overlaying (II)

Graphics Processing Units

Loop Iterations as Threads SIMT Memory Access

store store store

Slide credit: Hyesoon Kim Slide credit: Hyesoon Kim 50

Does not have to be lock step

Thread Warp Common PC

53 Slide credit: Tor Aamodt 54

Control Flow Problem in GPUs/SIMD Branch Divergence Handling (I)

inside warps branch to

Slide credit: Tor Aamodt 55 Slide credit: Tor Aamodt 56

 Fung et al., “Dynamic Warp Formation and Scheduling for

Dynamic Warp Formation Example What About Memory Divergence?

Slide credit: Tor Aamodt 60 61

 Generic speak: 64 KB of storage

= SIMD functional unit, control = instruction stream decode

NVIDIA GeForce GTX 285 “core” NVIDIA GeForce GTX 285

64 KB of storage Tex Tex

Up to 32 warps are simultaneously interleaved

You might also like