0% found this document useful (0 votes)

141 views83 pages

Vector Processor

SIMD architectures can significantly improve performance by enabling single instructions to operate on multiple data elements simultaneously. SIMD is more energy efficient than MIMD as it does not need to fetch and execute a separate instruction for each data operation. Vector processors, a type of SIMD architecture, improve performance on tasks like numerical simulations by collecting data elements into registers and operating on the register files to hide memory latency. Common SIMD instruction set extensions include MMX, SSE, AltiVec, and AVX.

Uploaded by

Lekshmi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

141 views83 pages

Vector Processor

Uploaded by

Lekshmi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 83

Unit -V

SIMD
• SIMD Arch have significant DLP
• Single Instruction can launch many data opns
• SIMD is more energy efficient than MIMD
– MIMD needs to fetch and execute one instruction
per data opns.
• SIMD is more attractive for PMDs (Personal
Mobile Devices)
• Advantage of SIMD over MIMD
Programmer thinks sequential execution yet achieves
parallel speedup by having parallel data operations
SIMD
• SIMD has 3 variations:
– Vector Architectures
• Allows pipelined execution of many data operations
– SIMD MMX
• Allows simultaneous parallel data operations that
support Multimedia applications.
– GPU Architectures
• They offer higher performance than traditional
multicore
• They have system processor, system memory &
graphics memory.
Vector Processor
• Efficient way to execute a vectorizable
application is by Vector processor- Jim Smith
• vector processor is a CPU that executes
instructions that operate on arrays of data.
– It collects set of data elements put them in the
register file
– operates on the data in those register files and
stores the results back in memory.
– These reg files acts like buffers and hide the memory
latency.
Vector processor
• SIMD classification
• Also be called as array processor
• Improves performance on numerical simulations
• Used in Video game console and Graphic
accelerators
• Ex: VIS, MMX, SSE, AltiVec and AVX
 VIS (Visual Instruction Set): VIS was developed by Sun Microsystems (now part of Oracle)
for their SPARC processors.
 MMX (MultiMedia Extensions): MMX was introduced by Intel in 1996 as an extension to the
x86 instruction set architecture.
 SSE (Streaming SIMD Extensions): SSE is an extension of the x86 instruction set
architecture introduced by Intel in 1999.
 AltiVec: AltiVec, also known as VMX (Velocity Engine) on PowerPC processors, was
developed by IBM, Motorola, and Apple.
 AVX (Advanced Vector Extensions): AVX is an extension of the x86 instruction set
architecture introduced by Intel in 2008 with the Sandy Bridge microarchitecture.
VMIPS
VMIPS
• It is loosely based on cray-1
• VMIPS instruction set
– Scalar portion is similar to MIPS
– Vector portion is logical vector extension of MIPS
• Registers:
– It has 8 vector registers
– Fixed length holding
– Each vector register holds single vector
– Each vector register holds 64 elements of 64 bits
The Cray-1 was the first supercomputer to successfully implement the vector processor design. These systems improve the
performance of math operations by arranging memory and registers to quickly perform a single operation on a large set of data.
VMIPS
• Vector registers
– Vector register file has 16 read and 8 write ports
– Supply operands to VFUs.
– Registers and the FUs are connected by a pair of
cross bar switches (thick gray lines)
• Scalar registers
– 32 GPRs and 32 FPRs as in MIPS.
– These supply operands to VFUs
– Supply addresses to L/S units.
VMIPS
• Vector functional units
– Each unit is fully pipelined
– start a new operation on every clock cycle
– Control Unit is needed to detect hazards
• structural hazards for functional units
• data hazards on register accesses
VMIPS
• VMIPS has five functional units
– Integer unit
– Logical Unit
– Floating point Add/Sub
– Floating point Multiply
– Floating point Divide
VMIPS
• VMIPS has scalar architecture like MIPS.
• Vector load/store unit-
– vector L/S unit loads/stores a vector to/from memory.
– This unit is fully pipelined
– words can be moved b/w the vector reg and memory one
word per clock cycle
– This unit also handles scalar loads and stores.
How Vector Processors Work: An Example

• Let's take a typical vector problem Y = a* X+ Y

• X and Y are vectors resident in memory
• a is a scalar
• This problem is the so-called SAXPY or DAXPY
– SAXPY –single precision a * X plus Y
– DAXPY -double precision a * X plus Y
How Vector Processors Work: An Example
DAXPY using MIPS instructions Y = a * X plus Y

• MIPS code for the DAXPY loop

• L.D FO,a ;load scalar a
• DADDIU R4,Rx,#512 ;last address to load
• Loop: L.D F2,0(Rx) ;load X[i]
• MUL.D F2,F2,FO ;a x X[i]
• L.D F4,0(Ry) ;load Y[i]
• ADD.D F4,F4,F2 ;a x X[i] + Y[i]
• S.D F4,0(Ry) ;store into Y[i]
• DADDIU Rx,Rx,#8 ;increment index to X
• DADDIU Ry,Ry,#8 ;increment index to Y
• DSUBU R20,R4,Rx ;compute bound
• BNEZ R20,Loop ;check if done

Requires almost 600 MIPS ops when the vectors have 64

elements 64 elements of a vector x 9 ops
VMIPS instructions

How Vector Processors Work: An Example

ADDVV.D: add two vectors
ADDVS.D: add vector to a scalar
LV/SV: vector load and vector store from address
Rx the address of vector X
Ry the address of vector Y

• VMIPS code for DAXPY

• L.D FO,a ;load scalar a
• LV V1,Rx ;load vector X
• MULVS.D V2,V1,F0 ;vector-scalar multiply
• LV V3,Ry ;load vector Y
• ADDVV.D V4,V2,V3 ;add
• SV V4,Ry ;store the result
Assumption: the vector length matches the number of vector operations – no
loop necessary.
How Vector Processors Work: An Example
• vector processor reduces the instruction bandwidth
• executes only 6 instructions vs almost 600 for MIPS
• It occurs because the vector operations work on 64 elements
• overhead instructions that constitute half the loop
on MIPS are not present in the VMIPS code
• compiler produces vector instructions for such a
sequence
• resulting code spends its time running in vector
mode , the code is said to be vectorized or
vectorizable.
How Vector Processors Work: An Example
• Loops can be vectorized when they do not have
dependences between iterations of a loop, are called
loop-carried dependences.
• Another important difference between MIPS and VMIPS is the
frequency of pipeline interlocks.
• In the MIPS code, every ADD. D must wait for a MUL. D, and
every S. D must wait for the ADD. D.
• On the vector processor, each vector instruction will only stall
for the first element in each vector, and then subsequent
elements will flow smoothly down the pipeline.
How Vector Processors Work: An Example

• Vector architects call forwarding of element dependent

operations chaining, in that the dependent operations
are "chained" together.
• Thus, pipeline stalls are required only once per vector
instruction, rather than once per vector element.
Vector Execution Time
• Vector Execution time depends on 3 factors:
– length of the operand vectors
– structural hazards among the operations
– data dependences
• Given the vector length and the initiation rate, the rate
at which a vector unit consumes new operands and
produces new results, we can compute the time for a
single vector instruction.
• All modern vector computers have vector functional
units with multiple parallel pipelines (or lanes) that can
produce two or more results per clock cycle
Vector Execution Time
• convoy :
– is the set of vector instructions that could execute together.
– The instructions in a convoy must not contain any structural
hazards
– if such hazards were present, the instructions would need to
be serialized and initiated in different convoys.
– we assume that a convoy of instructions must complete
execution before any other instructions can begin execution.
– vector instruction sequences with structural hazards
sequences should be in separate convoys
Vector Execution Time
• convoy :
LW $t0, 0($s0) # Load a value from memory into $t0
ADD $t1, $t0, $s1 # Add the value in $t0 to $s1 and
store the result in $t1
In this example, the ADD instruction depends on the
value loaded from memory in the previous LW
instruction. If these instructions are executed in the
same convoy, a hazard can occur.
To serialize the instructions using convoys, they can be
Convoy 1:
LW $t0, 0($s0)

Convoy 2:
ADD $t1, $t0, $s1
Vector Execution Time
• chaining :
– allows a vector operation to start as soon as the
individual elements of its vector source operand
become available
– The results from the first functional unit in the chain
are "forwarded" to the second functional unit.
– allows them to be in the same convoy
In these examples, we'll assume a vector length of 4 elements.
Vector Average:
Vector Addition: ADD $t0, $s1, $s2 # Add the first pair of vector elements
ADD $t1, $s3, $s4 # Add the second pair of vector elements
ADD $t0, $s1, $s2 # Add the first pair of vector elements SRL $t0, $t0, 1 # Shift the first addition result right by 1 (divide by
ADD $t1, $s3, $s4 # Add the second pair of vector elements 2)
ADD $t2, $t0, $t1 # Add the results of the previous additions SRL $t1, $t1, 1 # Shift the second addition result right by 1 (divide
by 2)
ADD $t2, $t0, $t1 # Add the averaged results of the previous
additions
Vector Execution Time
• chime
– a timing metric to estimate the time for a convoy
– simply the unit of time taken to execute one convoy
– vector sequence that consists of m convoys executes in m
chimes for a vector length of n
– for VMIPS this is approximately in m x n clock cycles.
– measuring time in chimes is a better approximation for long
vectors.
– If we know the number of convoys in a vector sequence, we
know the execution time in chimes
Vector Execution Time
• Show how the following code sequence lays out in convoys,
assuming a single copy of each vector functional unit:
– LV V1,Rx ;load vector X
– MLILVS.D V2,V1,F0 ;vector-scalar multiply
– LV V3,Ry ;load vector Y
– ADDVV.D V4,V2,V3 ;add two vectors
– SV V4,Ry ;store the sum
• How many chimes will this vector sequence take? How many
cycles per FLOP (floating-point operation) are needed, ignoring
vector instruction issue overhead?
Vector Scalar Value Select, Double Precision" in the MIPS SIMD Architecture (MSA) extension
Vector Execution Time

• MLILVS.D V2,V1,F0 ;vector-scalar multiply

MLILVS.D refers to the MIPS assembly instruction "Vector Scalar Value Select, Double
Precision" in the MIPS SIMD Architecture (MSA) extension. It is used for selecting
elements from two double-precision vector registers based on the corresponding
elements of a condition vector register.

The syntax for the MLILVS.D instruction is as follows:

MLILVS.D vd, vs, vt
vd: Destination vector register where the selected values will be stored.
vs: Source vector register containing the first vector of double-precision values.
vt: Source vector register containing the second vector of double-precision values

The MLILVS.D instruction performs element-wise selection based on the corresponding

elements of the condition vector registers. If the condition vector element is true (non-
zero), the element from the first source vector register is selected and stored in the
destination vector register. Otherwise, the element from the second source vector
register is selected and stored.
LV V1,Rx ;load vector X
MLILVS.D V2,V1,F0 ;vector-scalar multiply
LV V3,Ry ;load vector Y
ADDVV.D V4,V2,V3 ;add two vectors
SV V4,Ry ;store the sum
Vector Execution Time

– The first convoy starts with the first LV instruction.

The MULVS. D is dependent on the first LV, but chaining allows it to
be in the same convoy.
– The second LV instruction must be in a separate convoy since there
is a structural hazard on the load/store unit for the prior LV
instruction
– The ADDVV. D is dependent on the second LV, but it can again be in
the same convoy via chaining
– Finally, the SV has a structural hazard on the LV in the second
convoy, so it must go in the third convoy
Vector Execution Time

• The sequence requires three convoys

– LV MULVS.D
– LV ADDVV.D
– SV
– Since the sequence takes three chimes and there are two floating-point
operations per result
– number of cycles per FLOP is 1.5 (ignoring vector instruction issue
overhead).
– we allow the LV and MULVS.D both to execute in the first convoy,
– the chime approximation is reasonably accurate for long vectors
– for 64-element vectors, the time in chimes is 3, so the sequence would take
about 64 x 3 or 192 clock cycles.
Multiple Lanes

multiple FUs to improve the performance of a single vector add instruction, C = A + B

Multiple Lanes
• The vector processor
– (a) single add pipeline and can complete one addition per cycle.
– (b) four add pipelines and can complete four additions per cycle.
– The elements within a single vector add instruction are
interleaved across the four pipelines.
– The set of elements that move through the pipelines together is
termed an element group
– Going to four lanes from one. lane reduces the number of clocks
for a chime from 64 to 16.
– For multiple lanes to be advantageous, both the applications and
the architecture must support long vectors
Multiple Lanes

Fig: Structure of a vector unit containing four lanes

Multiple Lanes

– Each lane contains one portion of the vector reg file

and one execution pipeline from each vector FU.
– Each vector FU executes vector instructions at the
rate of one element group per cycle using multiple
pipelines, one per lane.
– The first lane holds the first element (element 0) for
all vector registers, and so the first element in any
vector instruction will have its source and
destination operands located in the first lane.
Multiple Lanes
• Vector register storage is divided across the lanes,
with each lane holding every fourth element of
each vector register
• three vector functional units:
– an FP add
– an FP multiply
– a load-store unit.
• Adding multiple lanes is a popular technique to
improve vector performance.
Vector-Length Registers
• A vector registers processor has a natural vector
length determined by the number of elements
in each vector register.
• This length, which is 64 for VMIPS
• In a real program the length of a particular
vector operation is often unknown at compile
time
Vector Length Register
• single piece of code may require different vector
lengths
For (i=0; i<n; i=i+1)
Y[i]= a*X[i]+ Y[i];
• size of all the vector operations depends on n
• value of n might subject to change during
execution
vector-length register
• vector-length register (VLR)
– controls the length of any vector operation
– value in the VLR cannot be greater than the length of
the vector registers
– This solves our problem as long as the real length is
less than or equal to the maximum vector length
(MVL)
vector-length register
• if the value of n is greater than the MVL
• strip mining is the generation of code such that
each vector operation is done for a size less than
or equal to the MVL.
• We create one loop that handles any number of
iterations that is a multiple of the MVL
• another loop that handles any remaining
iterations and must be less than the MVL.
Vector Mask Registers
• The presence of conditionals (IF statements)
inside loops and the use of sparse matrices
are two main reasons for lower levels of
vectorization.
• Consider the following loop written in C:
For(I =0; i< 64; i=i+1)
if(X[i]!=0)
X[i]= X[i]- Y[i];
Vector Mask Registers
• This loop cannot normally be vectorized
because of the conditional execution of the
body
• Mask registers essentially provide conditional
execution of each element operation.
• The vector-mask control uses a Boolean vector
to control the execution of a vector instruction
Vector Mask Registers
• When the vector mask register is enabled, any
vector instructions executed operate only on
the vector elements whose corresponding
entries in the vector-mask register are one.
• The entries in the destination vector register
that correspond to a zero in the mask register
are unaffected by the vector operation
Vector Mask Registers
LV V1,Rx ;load vector X into V1
LV V2,Ry ;load vector Y
L.D FO,#O ;load FP zero into FO
SNEVS.D V1,FO ;sets VM(i) to1if V1(i)!=F0
SLIBVV.D V1,V1,V2;subtract under vector mask
SV V1,Rx ;store the result in X
Memory Banks
• penalties for start-ups on load/store units are
higher than those for arithmetic units
• over 100 clock cycles on many processors.
• For VMIPS a start-up time of 12 clock cycles.
• To maintain an initiation rate of one word
fetched or stored per clock
– memory system must be capable of producing or
accepting this much data.
– accesses across multiple independent memory
banks usually delivers the desired rate
Memory Banks
• Most vector processors use memory banks,
which allow multiple independent accesses
rather than simple memory interleaving for
three reasons:
– To support simultaneous accesses from multiple
loads or stores, the memory system needs
multiple banks and to be able to control the
addresses to the banks independently
Memory Banks
– Most vector processors support the ability to load or store
data words that are not sequential. In such cases,
independent bank addressing, rather than interleaving, is
required.
– Most vector computers support multiple processors
sharing the same memory system, so each processor will
be generating its own independent stream of addresses
SIMD Instruction Set Extensions for Multimedia
• media applications operate on narrower data types.
• Many graphics systems used 8 bits to represent each of
the three primary colors plus 8 bits for transparency.
• Depending on the application, audio samples are usually
represented with 8 or 16 bits.
• Like vector instructions, a SIMD instruction specifies the
same operation on vectors of data.
• SIMD instructions tend to specify fewer operands and
hence use much smaller register files this is in contrast
to vector arch which has large reg files
SIMD Instruction Set Extensions for Multimedia
• multimedia support for 256-bit-wide operations

• Instruction category Operands______________________

• Unsigned add/subtract Thirty-two 8-bit, sixteen 16-bit, eight 32-bit, or four 64-bit
• Maximum/minimum Thirty-two 8-bit, sixteen 16-bit, eight 32-bit, or four 64-bit
• Average Thirty-two 8-bit, sixteen 16-bit, eight 32-bit, or four 64-bit
• Shift right/left Thirty-two 8-bit, sixteen 16-bit, eight 32-bit, or four 64-bit
• Floating point Sixteen 16-bit, eight 32-bit, four 64-bit, or two 128-bit
SIMD Instruction Set Extensions for Multimedia
• Multimedia SIMD extensions fix the number of data
operands in the opcode
• Multimedia SIMD does not offer the more sophisticated
addressing modes of vector architectures
• Multimedia SIMD usually does not offer the mask
registers to support conditional execution
• The Streaming SIMD Extensions (SSE) successor in 1999
added separate registers that were 128 bits wide
• so now instructions could simultaneously perform
sixteen 8-bit operations, eight 16-bit operations, or four
32-bit operations
SIMD Instruction Set Extensions for Multimedia

• Advanced Vector Extensions (AVX), added in

2010, doubles the width of the registers again
to 256 bits and thereby offers instructions that
double the number of operations on all
narrower data types
SIMD Instruction Set Extensions for Multimedia
• AVX Instruction Description
• VADDPD Add four packed double-precision operands
• VSUBPD Subtract four packed double-precision operands
• VMULPD Multiply four packed double-precision operands
• VDIVPD Divide four packed double-precision operands
• VFMADDPD Multiply and add four packed double-precision
operands
• VFMSUBPD Multiply and subtract four packed double-
precision operands
• VCMPxx Compare four packed double-precision operands
for EQ, NEQ, LT, LE, GT, GE,
• VMOVAPD Move aligned four packed double-precision
operands
• VBROADCASTSD Broadcast one double-precision operand to four
locations in a 256-bit register
Graphics Processing Unit
• A graphics processing unit (GPU), is similar CPU
• Designed specifically for performing the complex
mathematical and geometric calculations that are
necessary for graphics rendering.
Graphics Processing Unit
• A graphics processing unit (GPU) is a computer chip
that performs rapid mathematical calculations,
primarily for the purpose of rendering images.
• occasionally called visual processing unit (VPU)
• GPU is able to render images more quickly than a
CPU because of its parallel processing architecture
• Nvidia introduced the first GPU, the GeForce 256, in 1999
• Others include AMD, Intel and ARM.
• In 2012, Nvidia released a virtualized GPU, which offloads
graphics processing from the server CPU in a virtual desktop
infrastructure.
Graphics Processing Unit
• GPUs are used in
– Embedded Systems
– Mobile phones
– Personal computers
– Workstations
– Game consoles
GPU Vs CPU
• A GPU is tailored for highly parallel operation while a
CPU executes programs serially
• For this reason, GPUs have many parallel execution
units and higher transistor counts, while CPUs have
few execution units and higher clock speeds
• A GPU is for the most part deterministic in its
operation
• GPUs have much deeper pipelines (several thousand
stages vs 10-20 for CPUs)
• GPUs have significantly faster and more advanced
memory interfaces as they need to shift around a lot
more data than CPUs
What are GPU’s Growth?
• Entertainment Industry has driven the
economy of these chips?
– Males age 15-35 buy $10B in video games / year
• Moore’s Law ++
• Simplified design (stream processing)
• Single-chip designs
GPU
• Very Efficient For
– Fast Parallel Floating Point Processing
– Single Instruction Multiple Data Operations
– High Computation per Memory Access

• Not Efficient For

– Double Precision
– Logical Operations on Integer Data
– Branching-Intensive Operations
– Random Access, Memory-Intensive Operations
CUDA
• CUDA -Compute Unified Device Architecture
– is a parallel computing platform and programming
model created by NVIDIA
– Implemented by the GPUs
– CUDA gives developers access to the instruction set
and memory of the parallel computational elements
in CUDA GPUs.
– Using CUDA, GPUs become accessible for
computation like CPUs.
CUDA

• functions for the GPU (device) and functions

for the system processor (host),
• CUDA uses _device_or_global_ for GPU and -
-host-for the CPU.
• CUDA variables declared in the device or
global functions are allocated to the GPU
Memory
Definitions

– dimGrid - dimensions of the code (in blocks)

– dimBlock- dimensions of a block (in threads)
– Blockldx- identifier for blocks
– threadIdx- identifier for threads per block
– blockDim- number of threads per block
DAXPY C code
• /* Sequential code */
• // DAXPY in C
• void daxpy(int n, double a, double *x, double *y)
• {
• for(int i=0; i< n;++i)
• y[i]= a*x[i]+ y[i]
• }
DAXPY CUDA version
• // Invoke DAXPY with 256 threads per Thread Block _
• -host-
• int nblocks =(n+255)/256;
• daxpy<<<nblocks,256>>>(n,a, x, y);
• // DAXPY in CUDA
• _ device_
• void daxpy(int n, double a, double*x, double*y)
• {
• int i= blockIdx.x*blockDim.x+ threadIdx.x;
• if(i< n) y[i]= a*x[i]+ y[i];
• }
DAXPY
• We launch n threads, one per vector element
• 256 CUDA Threads per thread block
• The GPU function starts by calculating the
element index i
– based on the block ID,
– the number of threads per block,
– and the thread ID.
– As long as this index is within the array (i< n), it
performs the multiply and add.
NVIDIA GPUs: Terminology
NVIDIA GPUs: Terminology
• Program abstractions :
• Grid
– A vectorizable loop executed on GPU made up of one or
more thread blocks
• Thread Block
• A group of threads processing a portion of the loop on
MTSIMD processor .
• They communicate via local memory
• CUDA Thread
–Thread that processes one iteration of the loop executed on
one SIMD lane
NVIDIA GPUs: Terminology
• Machine Object
• Warp
– A thread of SIMD instruction executed on SIMD
lane
• PTX instruction
– Single SIMD instruction executed on SIMD lanes
NVIDIA GPUs: Terminology
• Memory hardware
• Global Memory
– DRAM available to all threads (SIMD processors in
GPU)
• Local Memory
– Private to the thread
• Shared Memory
– Accessible to all threads of a Streaming Processor
• Thread Processor Registers
NVIDIA GPUs: Terminology
• Processing hardware
• Streaming Multiprocessor
– Multithreaded SIMD processor executes threads of SIMD
instructions
• Giga Thread Engine
– Thread block scheduler assigns multiple thread blocks to MT
SIMD processor
• Warp Scheduler
– SIMD Thread Scheduler issues threads when they are ready
to execute
• Thread Processor
– SIMD lane executes operations in a thread of SIMD
instructions on a single element
NVIDIA GPU –MTSIMD
NVIDIA GPU- MTSIMD
• GPU is a multiprocessor composed of MTSIMD
processors.
• It is similar to vector processor but with many
parallel FU’s that are deeply pipelined.
• MTSIMD is a processor that executes code in the
form of thread blocks.
• GPU H/W contains a collection of MTSIMD
Processors execute a Grid of Thread Blocks.
NVIDIA GPU- MTSIMD
• GPU H/W has two levels of H/W schedulers
1. Thread Block Scheduler:
– Thread block scheduler is similar to control unit in
Vector processor
– det the no of thread blocks for a loop and
allocates them to diff MTSIMD processors.
– ensures that thread blocks are assigned to the
processors whose local memories have the
corresponding data.
NVIDIA GPU- FERMI MTSIMD
2. SIMD Thread Scheduler:
• SIMD Thread scheduler has scoreboard logic
• It keeps track of 48 threads of SIMD instructions
• It tells that which thread of SIMD instructions are ready to
run
• It sends those instructions to dispatch unit to be run on
MTSIMD processor
• within a SIMD Processor, which schedules when threads of
SIMD instructions should run
NVIDIA GPU- MTSIMD
• It has many parallel functional units
• SIMD Processors with separate PCs and are
programmed using threads.
• Each MTSIMD Processor is assigned 512
elements of the vectors to work on
• SIMD processors have 32,768 registers
• Like vector processor these registers are
logically divided across SIMD lanes.
NVIDIA GPU- MTSIMD
• Each SIMD Thread has 64 vector registers of
32 elements with 32 bit each.
• FERMI has 16 physical lanes each contain 2048
registers
• Thread Blocks would contain 512/32 = 16
SIMD threads.
• Each thread of SIMD instructions in this
example compute 32 of the elements of the
computation.
NVIDIA GPU- MTSIMD
• GPU applications have so many threads of SIMD
instructions that multithreading can
– hide the latency to DRAM
– increase utilization of multithreaded SIMD
Processors
NVIDA GPU ISA
• PTX (Parallel Thread Execution) provides a stable
instruction set for GPUs
• H/W instruction set is hidden from the
programmer
• PTX instructions describe the operations on a
single CUDA thread
• PTX uses virtual registers
• Translation to machine code is performed in
software
NVIDA GPU ISA
• Format of a PTX instruction is
opcode.type d, a, b, c;
– where d is the destination operand; a, b, and c are
source operands
• Source operands are 32-bit or 64-bit registers
or a constant value. Destinations are registers,
except for store instructions.
NVIDA GPU ISA
• the operation type is one of the following:
Type .type Specifier______
• Untyped bits 8, 16, 32, and 64 bits . b8, b16, . b32, b64
• Unsigned integer 8, 16, 32, and 64 bits .U8, . U16, U32, u64
• Signed integer 8, 16, 32, and 64 bits . S8, . S16, . S32, S64
• Floating Point 16, 32, and 64 bits .J16, J32, J64
Conditional Branching
• Like vector architectures, GPU branch hardware uses
internal masks
• Also uses
– Branch synchronization stack
– Entries consist of masks for each SIMD lane
– i.e. which threads commit their results

• Per-thread-lane 1-bit predicate register, specified by

programmer
NVIDIA GPU Memory Structures
GPU Architecture
FERMI GPU Architecture
FERMI GPU
FERMI GPU
FERMI vs KEPLER
FERMI vs KEPLER
FERMI vs KEPLER

DTSC 691 Project Handbook
No ratings yet
DTSC 691 Project Handbook
12 pages
Role of IT in Banking
No ratings yet
Role of IT in Banking
11 pages
Electronic Diversity Visa Lottery: Welcome
No ratings yet
Electronic Diversity Visa Lottery: Welcome
21 pages
A Series I Tech Spares Catalogue
No ratings yet
A Series I Tech Spares Catalogue
115 pages
2 - 8086 Memory Management
No ratings yet
2 - 8086 Memory Management
14 pages
Microprocessor: A Historical Background
No ratings yet
Microprocessor: A Historical Background
71 pages
PLC
No ratings yet
PLC
59 pages
Detail of 8085 and Its Communication
No ratings yet
Detail of 8085 and Its Communication
40 pages
Altera Max Plus Tutorial
No ratings yet
Altera Max Plus Tutorial
15 pages
2.12 Communication Addresses of Devices in DVP Series PLC
No ratings yet
2.12 Communication Addresses of Devices in DVP Series PLC
2 pages
What Is Data Acquisition
No ratings yet
What Is Data Acquisition
5 pages
Database Management Systems Unit-1
100% (1)
Database Management Systems Unit-1
5 pages
Programming PDF
No ratings yet
Programming PDF
108 pages
Addressing Modes of 8051
No ratings yet
Addressing Modes of 8051
20 pages
6th Sem Microprocessor Lab Manual Using AFDEBUG 10ECL68
100% (3)
6th Sem Microprocessor Lab Manual Using AFDEBUG 10ECL68
52 pages
Developing Encoder Using Python and Tiny Calculator Using Assembly Language Programming
No ratings yet
Developing Encoder Using Python and Tiny Calculator Using Assembly Language Programming
34 pages
PLC From Zero To Hero
No ratings yet
PLC From Zero To Hero
388 pages
اسئله سنوات دوائر كهربائيه 2
No ratings yet
اسئله سنوات دوائر كهربائيه 2
17 pages
Electrical Mach
No ratings yet
Electrical Mach
230 pages
Alternative Mid-1 QP
No ratings yet
Alternative Mid-1 QP
4 pages
Service Manual: MODEL: MACRO830/1400/2400
No ratings yet
Service Manual: MODEL: MACRO830/1400/2400
38 pages
Micro Processer 8086-1
No ratings yet
Micro Processer 8086-1
91 pages
Computer Architecture and Organization: Adigrat University Electrical and Computer Engineering Dep't
No ratings yet
Computer Architecture and Organization: Adigrat University Electrical and Computer Engineering Dep't
35 pages
Chapter 6
No ratings yet
Chapter 6
18 pages
4bit Comparator
No ratings yet
4bit Comparator
7 pages
MIT Unit 5 Notes
No ratings yet
MIT Unit 5 Notes
7 pages
CSEN701 PA2 Solution 27172 29689
100% (1)
CSEN701 PA2 Solution 27172 29689
4 pages
Assignment 1
100% (1)
Assignment 1
2 pages
8 TTL - I
No ratings yet
8 TTL - I
42 pages
Solution To Test
No ratings yet
Solution To Test
6 pages
Color Making and Mixing Process Using PLC
No ratings yet
Color Making and Mixing Process Using PLC
5 pages
Xilinx Block RAM
No ratings yet
Xilinx Block RAM
34 pages
Basic Computer Organization and Design
100% (1)
Basic Computer Organization and Design
20 pages
Microprocessor All Experiment IT PDF
No ratings yet
Microprocessor All Experiment IT PDF
22 pages
Chapter 7 - CPU Structure and Function Ver 1
No ratings yet
Chapter 7 - CPU Structure and Function Ver 1
43 pages
جميع اسئلة الرؤيا
No ratings yet
جميع اسئلة الرؤيا
13 pages
The One With I - o Port Programming
No ratings yet
The One With I - o Port Programming
33 pages
تقرير نظرية التراكب
No ratings yet
تقرير نظرية التراكب
9 pages
Ali Technical Institute Syllabus 10 SEPTEMBER
No ratings yet
Ali Technical Institute Syllabus 10 SEPTEMBER
13 pages
مبادئ التيار المتردد وتحليل دوائره
No ratings yet
مبادئ التيار المتردد وتحليل دوائره
66 pages
78K0-KB2 Sample Program
No ratings yet
78K0-KB2 Sample Program
58 pages
Chapter 8 - Memory Basics: Logic and Computer Design Fundamentals
No ratings yet
Chapter 8 - Memory Basics: Logic and Computer Design Fundamentals
38 pages
06119397
No ratings yet
06119397
6 pages
RISC Architecture and Super Computer: Prof. Sin-Min Lee Department of Computer Science San Jose State University
No ratings yet
RISC Architecture and Super Computer: Prof. Sin-Min Lee Department of Computer Science San Jose State University
85 pages
Magic Gloves
No ratings yet
Magic Gloves
9 pages
IOT Based Temperature Logger
No ratings yet
IOT Based Temperature Logger
3 pages
البحث
No ratings yet
البحث
70 pages
Kurdish Java Structured Programming - 2 PDF
100% (3)
Kurdish Java Structured Programming - 2 PDF
40 pages
Microprocessors, Micro Controller Assembly Language
No ratings yet
Microprocessors, Micro Controller Assembly Language
60 pages
Lab 4
No ratings yet
Lab 4
9 pages
الحاسبات 1920 3 رسوم الحاسوب
No ratings yet
الحاسبات 1920 3 رسوم الحاسوب
81 pages
Fundamentals of Logic
No ratings yet
Fundamentals of Logic
56 pages
CNC Milling (Arabic Book)
No ratings yet
CNC Milling (Arabic Book)
146 pages
STMicroelectronics STM32F103C8T6 Datasheet
No ratings yet
STMicroelectronics STM32F103C8T6 Datasheet
80 pages
7TH - Unit 4-21ec74h6 - Ca
No ratings yet
7TH - Unit 4-21ec74h6 - Ca
67 pages
Vector Processor
No ratings yet
Vector Processor
13 pages
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
No ratings yet
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
26 pages
CS7103 - MultiCore Architecture Ppts Unit-II
No ratings yet
CS7103 - MultiCore Architecture Ppts Unit-II
43 pages
Guc 315 61 38694 2023-11-23T11 50 52
No ratings yet
Guc 315 61 38694 2023-11-23T11 50 52
33 pages
Unit Iii - Aca
No ratings yet
Unit Iii - Aca
13 pages
XX-BSC Compact Vector Processing
No ratings yet
XX-BSC Compact Vector Processing
49 pages
Unit 2
No ratings yet
Unit 2
43 pages
CH 04. Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
CH 04. Data-Level Parallelism in Vector, SIMD, and GPU Architectures
50 pages
Python Beyond Limits: Python, #3
From Everand
Python Beyond Limits: Python, #3
AnwaarX
No ratings yet
Supervised Learning Neural Networks
No ratings yet
Supervised Learning Neural Networks
34 pages
Associative Mapping
No ratings yet
Associative Mapping
65 pages
4.2 Fuzzy - Sets
No ratings yet
4.2 Fuzzy - Sets
39 pages
Artificial Neural Network: Presentation By: C. Vinoth Kumar SSN College of Engineering
No ratings yet
Artificial Neural Network: Presentation By: C. Vinoth Kumar SSN College of Engineering
9 pages
L 9 Unsupervised Learning Network PDF
No ratings yet
L 9 Unsupervised Learning Network PDF
13 pages
Adaptive Networks: Presentation By: C. Vinoth Kumar SSN College of Engineering
No ratings yet
Adaptive Networks: Presentation By: C. Vinoth Kumar SSN College of Engineering
19 pages
Machine Learning Using Neural Networks: Presentation By: C. Vinoth Kumar SSN College of Engineering
No ratings yet
Machine Learning Using Neural Networks: Presentation By: C. Vinoth Kumar SSN College of Engineering
24 pages
Machine Learning: Presentation By: C. Vinoth Kumar SSN College of Engineering
100% (1)
Machine Learning: Presentation By: C. Vinoth Kumar SSN College of Engineering
15 pages
Derivative Free Optimization Simulated Annealing: Presentation By: C. Vinoth Kumar SSN College of Engineering
No ratings yet
Derivative Free Optimization Simulated Annealing: Presentation By: C. Vinoth Kumar SSN College of Engineering
16 pages
Genetic Algorithms - Knapsack Problem - Knapsack Problem
No ratings yet
Genetic Algorithms - Knapsack Problem - Knapsack Problem
28 pages
Derivative Free Optimization Genetic Algorithm 1
No ratings yet
Derivative Free Optimization Genetic Algorithm 1
21 pages
Genetic Algorithm 2
No ratings yet
Genetic Algorithm 2
41 pages
Soft Computing: Presentation By: C. Vinoth Kumar SSN College of Engineering
No ratings yet
Soft Computing: Presentation By: C. Vinoth Kumar SSN College of Engineering
17 pages
Introduction To Machine Learning: Workshop On Machine Learning For Intelligent Image Processing
No ratings yet
Introduction To Machine Learning: Workshop On Machine Learning For Intelligent Image Processing
44 pages
Introduction To Soft Computing: Presentation By: C. Vinoth Kumar SSN College of Engineering
No ratings yet
Introduction To Soft Computing: Presentation By: C. Vinoth Kumar SSN College of Engineering
21 pages
DR C Aravindan
No ratings yet
DR C Aravindan
140 pages
Stochastic Model For Cancer Cell Growth With Spontaneous Mutation and Proliferation
No ratings yet
Stochastic Model For Cancer Cell Growth With Spontaneous Mutation and Proliferation
11 pages
Brochure Machine Learning in Intelligent Image Processing
No ratings yet
Brochure Machine Learning in Intelligent Image Processing
2 pages
Retinal Blood Vessel Segmentation and Measurement of Vessel Diameters
No ratings yet
Retinal Blood Vessel Segmentation and Measurement of Vessel Diameters
6 pages
Advanced Computer Architecture: Presented By, Krishna
No ratings yet
Advanced Computer Architecture: Presented By, Krishna
35 pages
Shock Absorber
No ratings yet
Shock Absorber
13 pages
Chapter - Numerical Differentiation, Ordinary Differential Equations (ODE)
No ratings yet
Chapter - Numerical Differentiation, Ordinary Differential Equations (ODE)
28 pages
Function Machine Project
No ratings yet
Function Machine Project
2 pages
Docking With ArgusLab
100% (1)
Docking With ArgusLab
24 pages
Chapter 3 - Electronic Data Nterchange
No ratings yet
Chapter 3 - Electronic Data Nterchange
54 pages
JVC KD-SR86BT
No ratings yet
JVC KD-SR86BT
152 pages
L-SB-0001-18 Applicable Lexus Models - GTS Techstream ECU Flash Reprogramming Procedure
No ratings yet
L-SB-0001-18 Applicable Lexus Models - GTS Techstream ECU Flash Reprogramming Procedure
19 pages
Averixis Solutions
100% (1)
Averixis Solutions
18 pages
Epson FX-880, FX-880T, FX-1180 Service Manual
No ratings yet
Epson FX-880, FX-880T, FX-1180 Service Manual
108 pages
Indices and Logarithm Notes
No ratings yet
Indices and Logarithm Notes
8 pages
Gard Correspondents May2021
No ratings yet
Gard Correspondents May2021
108 pages
Philips dvp3336 dvp3336x 314178534351 Ver1.1
No ratings yet
Philips dvp3336 dvp3336x 314178534351 Ver1.1
40 pages
Unified AP Manual - 2 - 4 - sw5.0 - 210520 - 1621489863035
No ratings yet
Unified AP Manual - 2 - 4 - sw5.0 - 210520 - 1621489863035
719 pages
Detection of Cardiovascular Diseases in ECG Images Using Machine Learning and Deep Learning Methods
No ratings yet
Detection of Cardiovascular Diseases in ECG Images Using Machine Learning and Deep Learning Methods
4 pages
Routing Algorithms
0% (1)
Routing Algorithms
36 pages
Ayush Resume
No ratings yet
Ayush Resume
3 pages
MCA 2015syllabus PDF
No ratings yet
MCA 2015syllabus PDF
46 pages
Chennai Express Case Study
No ratings yet
Chennai Express Case Study
16 pages
Prakul Garg: Career Objective
No ratings yet
Prakul Garg: Career Objective
5 pages
Access The Camera by PC Client
No ratings yet
Access The Camera by PC Client
16 pages
Interview With Nicoló Petrucci, Former Head of Aerodynamics in Scuderia Toro Rosso.
No ratings yet
Interview With Nicoló Petrucci, Former Head of Aerodynamics in Scuderia Toro Rosso.
3 pages
Vmware 0
No ratings yet
Vmware 0
90 pages
SAP Solution Manager
0% (2)
SAP Solution Manager
31 pages
Introduction To The Merak Suite 2007.1: December, 2007
No ratings yet
Introduction To The Merak Suite 2007.1: December, 2007
16 pages
Sample Paper 12 With Answer Key
No ratings yet
Sample Paper 12 With Answer Key
43 pages
Isogeometric Analysis and Applications 2014
No ratings yet
Isogeometric Analysis and Applications 2014
301 pages

Vector Processor

Uploaded by

Vector Processor

Uploaded by

Unit -V

• Let's take a typical vector problem Y = a* X+ Y

• MIPS code for the DAXPY loop

Requires almost 600 MIPS ops when the vectors have 64

How Vector Processors Work: An Example

• VMIPS code for DAXPY

• Vector architects call forwarding of element dependent

• MLILVS.D V2,V1,F0 ;vector-scalar multiply

The syntax for the MLILVS.D instruction is as follows:

The MLILVS.D instruction performs element-wise selection based on the corresponding

– The first convoy starts with the first LV instruction.

• The sequence requires three convoys

multiple FUs to improve the performance of a single vector add instruction, C = A + B

Fig: Structure of a vector unit containing four lanes

– Each lane contains one portion of the vector reg file

• Instruction category Operands______________________

• Advanced Vector Extensions (AVX), added in

• Not Efficient For

• functions for the GPU (device) and functions

– dimGrid - dimensions of the code (in blocks)

• Per-thread-lane 1-bit predicate register, specified by

You might also like