Vector Processor
Vector Processor
SIMD
• SIMD Arch have significant DLP
• Single Instruction can launch many data opns
• SIMD is more energy efficient than MIMD
– MIMD needs to fetch and execute one instruction
per data opns.
• SIMD is more attractive for PMDs (Personal
Mobile Devices)
• Advantage of SIMD over MIMD
Programmer thinks sequential execution yet achieves
parallel speedup by having parallel data operations
SIMD
• SIMD has 3 variations:
– Vector Architectures
• Allows pipelined execution of many data operations
– SIMD MMX
• Allows simultaneous parallel data operations that
support Multimedia applications.
– GPU Architectures
• They offer higher performance than traditional
multicore
• They have system processor, system memory &
graphics memory.
Vector Processor
• Efficient way to execute a vectorizable
application is by Vector processor- Jim Smith
• vector processor is a CPU that executes
instructions that operate on arrays of data.
– It collects set of data elements put them in the
register file
– operates on the data in those register files and
stores the results back in memory.
– These reg files acts like buffers and hide the memory
latency.
Vector processor
• SIMD classification
• Also be called as array processor
• Improves performance on numerical simulations
• Used in Video game console and Graphic
accelerators
• Ex: VIS, MMX, SSE, AltiVec and AVX
VIS (Visual Instruction Set): VIS was developed by Sun Microsystems (now part of Oracle)
for their SPARC processors.
MMX (MultiMedia Extensions): MMX was introduced by Intel in 1996 as an extension to the
x86 instruction set architecture.
SSE (Streaming SIMD Extensions): SSE is an extension of the x86 instruction set
architecture introduced by Intel in 1999.
AltiVec: AltiVec, also known as VMX (Velocity Engine) on PowerPC processors, was
developed by IBM, Motorola, and Apple.
AVX (Advanced Vector Extensions): AVX is an extension of the x86 instruction set
architecture introduced by Intel in 2008 with the Sandy Bridge microarchitecture.
VMIPS
VMIPS
• It is loosely based on cray-1
• VMIPS instruction set
– Scalar portion is similar to MIPS
– Vector portion is logical vector extension of MIPS
• Registers:
– It has 8 vector registers
– Fixed length holding
– Each vector register holds single vector
– Each vector register holds 64 elements of 64 bits
The Cray-1 was the first supercomputer to successfully implement the vector processor design. These systems improve the
performance of math operations by arranging memory and registers to quickly perform a single operation on a large set of data.
VMIPS
• Vector registers
– Vector register file has 16 read and 8 write ports
– Supply operands to VFUs.
– Registers and the FUs are connected by a pair of
cross bar switches (thick gray lines)
• Scalar registers
– 32 GPRs and 32 FPRs as in MIPS.
– These supply operands to VFUs
– Supply addresses to L/S units.
VMIPS
• Vector functional units
– Each unit is fully pipelined
– start a new operation on every clock cycle
– Control Unit is needed to detect hazards
• structural hazards for functional units
• data hazards on register accesses
VMIPS
• VMIPS has five functional units
– Integer unit
– Logical Unit
– Floating point Add/Sub
– Floating point Multiply
– Floating point Divide
VMIPS
• VMIPS has scalar architecture like MIPS.
• Vector load/store unit-
– vector L/S unit loads/stores a vector to/from memory.
– This unit is fully pipelined
– words can be moved b/w the vector reg and memory one
word per clock cycle
– This unit also handles scalar loads and stores.
How Vector Processors Work: An Example
Convoy 2:
ADD $t1, $t0, $s1
Vector Execution Time
• chaining :
– allows a vector operation to start as soon as the
individual elements of its vector source operand
become available
– The results from the first functional unit in the chain
are "forwarded" to the second functional unit.
– allows them to be in the same convoy
In these examples, we'll assume a vector length of 4 elements.
Vector Average:
Vector Addition: ADD $t0, $s1, $s2 # Add the first pair of vector elements
ADD $t1, $s3, $s4 # Add the second pair of vector elements
ADD $t0, $s1, $s2 # Add the first pair of vector elements SRL $t0, $t0, 1 # Shift the first addition result right by 1 (divide by
ADD $t1, $s3, $s4 # Add the second pair of vector elements 2)
ADD $t2, $t0, $t1 # Add the results of the previous additions SRL $t1, $t1, 1 # Shift the second addition result right by 1 (divide
by 2)
ADD $t2, $t0, $t1 # Add the averaged results of the previous
additions
Vector Execution Time
• chime
– a timing metric to estimate the time for a convoy
– simply the unit of time taken to execute one convoy
– vector sequence that consists of m convoys executes in m
chimes for a vector length of n
– for VMIPS this is approximately in m x n clock cycles.
– measuring time in chimes is a better approximation for long
vectors.
– If we know the number of convoys in a vector sequence, we
know the execution time in chimes
Vector Execution Time
• Show how the following code sequence lays out in convoys,
assuming a single copy of each vector functional unit:
– LV V1,Rx ;load vector X
– MLILVS.D V2,V1,F0 ;vector-scalar multiply
– LV V3,Ry ;load vector Y
– ADDVV.D V4,V2,V3 ;add two vectors
– SV V4,Ry ;store the sum
• How many chimes will this vector sequence take? How many
cycles per FLOP (floating-point operation) are needed, ignoring
vector instruction issue overhead?
Vector Scalar Value Select, Double Precision" in the MIPS SIMD Architecture (MSA) extension
Vector Execution Time