0% found this document useful (0 votes)
102 views49 pages

XX-BSC Compact Vector Processing

Vector processors provide improved performance over superscalar processors for workloads involving linear algebra operations on arrays by mapping loops to vector instructions. They reduce fetch/decode bandwidth and avoid data hazards between elements of the same vector. The basic architecture includes vector and scalar units, with vector registers and functional units. Vector instructions are executed through convoys that avoid structural hazards. Performance is measured by total execution time and MFLOPS. Streaming SIMD Extensions provide vector-like instructions to x86 processors and are useful for multimedia, scientific and financial applications. The Intel Larrabee architecture combines aspects of CPUs and GPUs through its use of in-order cores with 512-bit vector processing units.

Uploaded by

mheba11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views49 pages

XX-BSC Compact Vector Processing

Vector processors provide improved performance over superscalar processors for workloads involving linear algebra operations on arrays by mapping loops to vector instructions. They reduce fetch/decode bandwidth and avoid data hazards between elements of the same vector. The basic architecture includes vector and scalar units, with vector registers and functional units. Vector instructions are executed through convoys that avoid structural hazards. Performance is measured by total execution time and MFLOPS. Streaming SIMD Extensions provide vector-like instructions to x86 processors and are useful for multimedia, scientific and financial applications. The Intel Larrabee architecture combines aspects of CPUs and GPUs through its use of in-order cores with 512-bit vector processing units.

Uploaded by

mheba11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Vector Processors

Kavitha Chandrasekar
Sreesudhan Ramkumar
Agenda
• Why Vector processors
• Basic Vector Architecture
• Vector Execution time
• Vector load - store units and Vector memory
systems
• Vector length Control
• Vector stride
Limitations of ILP
• ILP:
– Increase in instruction width (superscalar)
– Increase in machine pipeline depth
– Hence, Increase in number of in-flight instructions
• Need for increase in hardware structures like
ROB, rename register files
• Need to increase logic to track dependences
• Even in VLIW, increase in hardware and logic is
required
Vector Processor
• Work on linear arrays of numbers(vectors)
• Each iteration of a loop becomes one element of the vector
• Overcoming limitations of ILP:
– Dramatic reduction in fetch and decode bandwidth.
– No data hazard between elements of the same vector.
– Data hazard logic is required only between two vector
instructions
– Heavily interleaved memory banks. Hence latency of
initiating memory access versus cache access is amortized.
– Since loops are reduced to vector instructions, there are
no control hazards
– Good performance for poor locality
Basic Architecture
• Vector and Scalar units
• Types:
– Vector-register processors
– Memory-memory Vector
processors
• Vector Units
– Vector registers (with 2
read and 1 write ports)
– Vector functional units
(fully pipelined)
– Vector Load Store unit(fully
pipelined)
– Set of scalar registers
VMIPS vector instructions
MIPS vs VMIPS
(DAXPY loop)
Execution time of vector instructions
• Factors:
– length of operand vectors
– structural hazards among operations
– data dependences
• Overhead:
– initiating multiple vector instructions in a clock cycle
– Start-up overhead (more details soon)
Vector Execution time (contd.)
• Terms:
– Convoy:
– set of vector instructions that can begin execution
together in one clock period
– Instructions in a convoy must not contain any structural or
data hazards
– Analogous to placing scalar instructions in VLIW
– One convoy must finish before another begins
– Chime: Unit of time taken to execute one convoy
• Hence for vector sequence m convoys executes in m
chimes
• Hence for vector length of n, time=m × n clock cycles
Example

Convoy
Start-up overhead
• Startup time: Time between initialization of
the instruction and time the first result
emerges from pipeline
• Once pipeline is full, result is produced every
cycle.
• If vector lengths were infinite, startup
overhead is amortized
• But for finite vector lengths, it adds significant
overhead
Startup overhead-example
Vector Load-Store Units and Vector
Memory Systems
• Start-up time: Time to get first word from
memory into a register
• To produce results every clock multiple memory
banks are used
• Need for multiple memory banks in vector
processors:
– Many vector processors allow multiple loads and
stores per clock cycle
– Support for nonsequential access
– Support for sharing of system memory by multiple
processors
Example
• Number of memory banks required:
Real world issues
• Vector length in a program is not always
fixed(say 64)
• Need to access non adjacent elements from
memory
• Solutions:
– Vector length Control
– Vector Stride
Vector Length Control
• Example:

• Here value of ‘n’ might be known only during runtime.


• In case of parameters to procedure, it changes even
during runtime
• Hence, VLR (Vector Length Register) is used to control
the length of a vector operation during runtime
• MVL (Maximum Vector Length) holds the maximum
length of a vector operation (processor dependent)
Vector Length Control(contd.)
• Strip mining:
– When vector operation is longer than MVL, this
concept is used
Execution time due to strip mining
• Key factors that contribute to the running time of a strip-mined
loop consisting of a sequence of convoys:

1. Number of convoys in the loop, which determines the number of


chimes.
2. Overhead for each strip-mined sequence of convoys. This
overhead consists of the cost of executing the scalar code for
strip-mining each block, plus the vector start-up cost for each
convoy.
• Total running time for a vector sequence operating on a vector of
length n,Tn:
Example
Vector Stride
• To overcome access to nonadjacent elements in memory
• Example:

• This loop can be strip-mined as a vector multiplication


• Each row of B would be first operand and each column of C would be
second operand
• For memory organization as column major order, B’s elements would be
non-adjacent
• Stride is distance(uniform) between the non-adjacent elements.
• Allows access of nonsequential memory elements
Vector processors - Contd.
Agenda
• Enhancing Vector performance
• Measuring Vector performance
• SSE Instruction set and Applications
• A case study - Intel Larrabee vector processor
• Pitfalls and Fallacies
Enhancing Vector performance
• General
o Pipelining individual operations of one instruction
o Reducing Startup latency
• Addressing following hazards effectively
o Structural hazards
o Data hazards
o Control hazards
Pipelining & reducing Startup latency
Addressing Structural hazards - Multiple Lanes
Addressing Structural hazards - Multiple Lanes

• Addressed using pipelining and parallel lanes


Multiple Lanes - Contd.
• Registers & Floating point units are localized
within lanes
Addressing Data hazards - Flexible chaining
• Similar to Forwarding
• Chaining allows a vector operation to start as soon as the
individual elements of its vector source operand become
available
• Example:

Instruction Startup Vector


time length
(cycles) (units)
MULV.D V1, V2, V3 7 64
ADDV.D V4, V1, V5 6 64
Flexible Chaining - Contd.

MULV.D V1, V2, V3 Unchained Chained


ADDV.D V4, V1, V5

Time (cycles) VLM + VLA +STM + VLM/A + STM +


STA = 141 STA = 77
cycles / result 141 / 64 = 2.2 77 / 64 = 1.2
FLOPS / clock 128 / 141 = 0.9 128 / 77 = 1.7
cycle
Addressing Control hazards - Vector mask

• Instructions involving control statement can't run in vector


mode
• Solution:
o Convert control dependence into data dependence by
executing control statement and updating vector mask
register
o Run data dependent instructions in vector mode based on
value in value mask register
Vector mask - Contd.
Improving Vector mask - Scatter & Gather method

• Step 1: Set VM to 1 based on control condition


• Step 2: Create CVI - Create Vector Index based on VM
o Create an index vector which points to addresses of valid
contents
• Step 3: LVI - Load Vector Index (GATHER)
o Load valid operands based on step 2
• Step 4: Execute arithemetic operation on compressed vector
• Step 5: SVI - Store Vector Index (SCATTER)
o Store valid output based on step 2
Scatter & Gather - Contd.
Comparison - Basic vector mask &
Scatter - Gather

• Conclusion: Scatter & Gather will run faster if


less than one-quarter of elements are non
zero
Enhancing Vector performance - Summary

• General
o Pipelining individual operations of one instruction
o Reducing Startup latency
• Structural hazards
o Multiple Lanes
• Data hazards
o Flexible chaining
• Control hazards
o Basic vector mask
o Scatter & Gather
Measuring Vector Performance - Total
execution time
Scale for measuring performance:
• Total execution time of the vector loop - Tn
o Used to compare performance of different instructions on
processor

o Unit - clock cycles


o n - vector length
o MVL - maximum vector length
o Tloop - Loop overhead
o Tstart - startup overhead
o Tch ime - unit of convoys
Measuring Vector Performance -
MFLOPS
• MFLOPS - Millions of FLoating point Operations Per Second
o Used to compare performance of two different processors
• MFLOPS - Rn

oUnit - operations / second


• MFLOPS - Rinfinity (theoritical / peak performance)
SSE Instructions
• Streaming SIMD Extensions (SSE) is a SIMD instruction set extension to
the x86 architecture
• Streaming SIMD Extensions are similar to vector instructions.
• SSE originally added eight new 128-bit registers known as XMM0
through XMM7
• Each register packs together:
 four 32-bit single -
precision floating
point numbers or
 two 64-bit double -
precision floating
point numbers or
 two 64-bit integers or
 four 32-bit integers or
 eight 16-bit short integers or
 sixteen 8-bit bytes or characters.
SSE Instruction set & Applications
• Sample instruction set for floating point operations
o Scalar – ADDSS, SUBSS, MULSS, DIVSS
o Packed – ADDPS, SUBPS, MULPS, DIVPS
• Example

• Applications - multimedia, scientific and financial applications


A Case study - Intel Larrabee
Architecture
• a many-core visual computing
architecture code
• Intel’s new approach to a GPU
• Considered to be a hybrid between a
multi-core CPU and a GPU
• Combines functions of a multi-core CPU
with the functions of a GPU
Larrabee - The Big picture

• in order execution (Execution is also more deterministic so


instruction and task scheduling can be done by the compiler)
• Each Larrabee core contains a 512-bit vector
processing unit, able to process 16 single precision floating
point numbers at a time.
• uses extended x86 architecture set with additional features
like scatter / gather instructions and a mask register
designed to make using the vector unit easier and more
efficient.
Larrabee VPU Architecture
• 16 wide vector ALU in one core
• executes interger, single precision,
float and double precision float
instructions
• choice of 16 - Tradeoff between
increased computational density and
difficulty of high utilization with wider
one
• suports swizzling and replication
• Mask register and index register
operations
Larrabee Data types

• 32 512-bit vector registers & 8 16-bit vector mask registers


• Each element of vector register can be
o 8 wide - to store 16 float 32's or 16 int 32's
o 16 wide - to store 8 float 64's or 8 int 64's
Larrabee Instruction set

• vector arithmetic, logic and shift


• vector mask generation
• vector load / store
• swizzling
> Vector multiply - add, multiply - sub instructions
Past, Present & Future of Vector
processors
• Past
o Cray X1
o Earth simulator
• Present
o Cray Jaguar
o Larrabee
• Future: AVE (Advanced Vector Extensions)
o Sandy Bridge (Intel)
o Bulldozer (AMD)
Pitfalls and Fallacies

• Pitfalls:
o Concentrating on peak performance and ignoring start up
overhead (on memory-memory vector architecture)
o Increasing Vector performance, without comparable
increase in scalar performance
• Fallacy
o You can get vector performance without providing memory
bandwidth (by reusing vector registers)
Recap

• Why Vector processors


• Basic Vector Architecture
• Vector Execution time
• Vector load - store units and Vector memory systems
• Vector length - VLR
• Vector stride
• Enhancing Vector performance
• Measuring Vector performance
• SSE Instruction set and Applications
• A case study - Intel Larrabee vector processor
• Pitfalls and Fallacies
References

• Computer Architecture - A quantitative approach 4th edition (Appendix A, F &


G, chapter 2 & 3)
• Cray X1 http://www.supercomp.org/sc2003/paperpdfs/pap183.pdf
• Larrabee official page on intel http://software.intel.com/en-
us/articles/larrabee/
• Larrabee http://www.gpucomputing.org/drdobbs_042909_final.pdf
• http://www.vizworld.com/2009/05/new-whitepapers-from-intel-about-larrabee/
Thank you.

You might also like