0% found this document useful (0 votes)
79 views

CSO Computer Programming

Pipelining and vector processing can improve parallelism and throughput. Pipelining breaks processes into sequential stages that can partially overlap to improve throughput. Vector processing performs the same operation on multiple data elements concurrently using SIMD. Examples show how pipelining a laundry process into washing, drying, and folding stages reduces time from 6 hours to 3.5 hours for 4 loads. Vector instructions can multiply or add multiple elements at once. Pipelining speedup approaches the number of stages for large problems, while vector processing speedup approaches the number of concurrent elements.

Uploaded by

Aj
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views

CSO Computer Programming

Pipelining and vector processing can improve parallelism and throughput. Pipelining breaks processes into sequential stages that can partially overlap to improve throughput. Vector processing performs the same operation on multiple data elements concurrently using SIMD. Examples show how pipelining a laundry process into washing, drying, and folding stages reduces time from 6 hours to 3.5 hours for 4 loads. Vector instructions can multiply or add multiple elements at once. Pipelining speedup approaches the number of stages for large problems, while vector processing speedup approaches the number of concurrent elements.

Uploaded by

Aj
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 73

UNIT-V

PIPELINING AND VECTOR


PROCESSING
Pipeline and Vector Processing
Dr. Bernard Chen Ph.D.
University of Central Arkansas
Spring 2009
Parallel processing
 A parallel processing system is able to perform
concurrent data processing to achieve faster
execution time

 The system may have two or more ALUs and be able


to execute two or more instructions at the same time

 Goal is to increase the throughput – the amount of


processing that can be accomplished during a given
interval of time
Parallel processing classification

Single instruction stream, single data stream – SISD

Single instruction stream, multiple data stream – SIMD

Multiple instruction stream, single data stream – MISD

Multiple instruction stream, multiple data stream – MIMD


Single instruction stream, single data
stream – SISD

 Single control unit, single computer, and a memory


unit

 Instructions are executed sequentially. Parallel


processing may be achieved by means of multiple
functional units or by pipeline processing
Single instruction stream, multiple data
stream – SIMD

 Represents an organization that includes many


processing units under the supervision of a common
control unit.

 Includes multiple processing units with a single


control unit. All processors receive the same
instruction, but operate on different data.
Multiple instruction stream, single data
stream – MISD

 Theoretical only

 processors receive different instructions, but operate


on same data.
Multiple instruction stream, multiple
data stream – MIMD
 A computer system capable of processing several
programs at the same time.

 Most multiprocessor and multicomputer systems can


be classified in this category
Pipelining: Laundry Example

 Small laundry has one


washer, one dryer and one
operator, it takes 90 A B C D
minutes to finish one load:

 Washer takes 30 minutes


 Dryer takes 40 minutes
 “operator folding” takes 20
minutes
Sequential Laundry
6 PM 7 8 9 10 11 Midnight
Time

30 40 20 30 40 20 30 40 20 30 40 20
T
a A
s
k
B
O
r
d C
e 90 min
r
D
 This operator scheduled his loads to be delivered to the laundry every 90
minutes which is the time required to finish one load. In other words he
will not start a new task unless he is already done with the previous task
 The process is sequential. Sequential laundry takes 6 hours for 4 loads
Efficiently scheduled laundry: Pipelined
Laundry
Operator start work ASAP
6 PM 7 8 9 10 11 Midnight
Time

30 40 40 40 40 20
40 40 40
T
a A
s
k
B
O
r
d C
e
r
D
 Another operator asks for the delivery of loads to the laundry every 40
minutes!?.
 Pipelined laundry takes 3.5 hours for 4 loads
 Multiple tasks
Pipelining Facts operating
simultaneously
 Pipelining doesn’t help
6 PM 7 8 9 latency of single task,
Time it helps throughput of
entire workload
T
a 30 40 40 40 40 20  Pipeline rate limited by
s slowest pipeline stage
k A  Potential speedup =
O Number of pipe stages
r B  Unbalanced lengths of
d
e pipe stages reduces
The washer
r C waits for the speedup
dryer for 10
minutes
 Time to “fill” pipeline
D and time to “drain” it
reduces speedup
9.2 Pipelining
• Decomposes a sequential process into segments.
• Divide the processor into segment processors each
one is dedicated to a particular segment.
• Each segment is executed in a dedicated segment-
processor operates concurrently with all other
segments.
• Information flows through these multiple hardware
segments.
9.2 Pipelining
 Instruction execution is divided into k segments or
stages
 Instruction exits pipe stage k-1 and proceeds into

pipe stage k
 All pipe stages take the same amount of time;

called one processor cycle


 Length of the processor cycle is determined by the

slowest pipe stage

k segments
9.2 Pipelining
 Suppose we want to perform the
combined multiply and add operations
with a stream of numbers:

 A i * Bi + C i for i =1,2,3,…,7
9.2 Pipelining
 The suboperations performed in each
segment of the pipeline are as follows:

 R1  Ai, R2  Bi
 R3  R1 * R2 R4  Ci
 R5  R3 + R4
Pipeline Performance

 n:instructions n is equivalent to number of loads in


 k: stages in the laundry example
pipeline k is the stages (washing, drying and
 : clockcycle folding.
 Tk: total time Clock cycle is the slowest task time

k
SPEEDUP
 • Consider a k-segment pipeline operating on n data
sets. (In the above example, k = 3 and n = 4.)

 > It takes k clock cycles to fill the pipeline and get


the first result from the output of the pipeline.

 After that the remaining (n - 1) results will come out


at each clock cycle.

 > It therefore takes (k + n - 1) clock cycles to


complete the task.
SPEEDUP
 If we execute the same task
sequentially in a single processing unit,
it takes (k * n) clock cycles.
 • The speedup gained by using the
pipeline is:
 S = k * n / (k + n - 1 )
SPEEDUP
 S = k * n / (k + n - 1 )

For n >> k (such as 1 million data sets on a 3-


stage pipeline),
 S~k
 So we can gain the speedup which is equal to
the number of functional units for a large data
sets. This is because the multiple functional
units can work in parallel except for the filling
and cleaning-up cycles.
Example: 6 tasks, divided into
4 segments
1 2 3 4 5 6 7 8 9
T1 T2 T3 T4 T5 T6

T1 T2 T3 T4 T5 T6

T1 T2 T3 T4 T5 T6

T1 T2 T3 T4 T5 T6
Vector Processing
Review: Instructon Level
Parallelism
High speed execution based on instruction level
parallelism (ilp): potential of short instruction
sequences to execute in parallel
High-speed microprocessors exploit ILP by:
1) pipelined execution: overlap instructions
2) superscalar execution: issue and execute multiple
instructions per clock cycle
3) Out-of-order execution (commit in-order)
Memory accesses for high-speed microprocessor?
Data Cache, possibly multiported, multiple levels
Review
Speculation: Out-of-order execution, In-order commit (reorder buffer+rename
registers)=>precise exceptions
Software Pipelining
Symbolic loop unrolling (instructions from different iterations) to optimize pipeline with little code
expansion, little overhead
Superscalar and VLIW: CPI < 1 (IPC > 1)
Dynamic issue vs. Static issue
More instructions issue at same time => larger hazard penalty
# independent instructions = # functional units X latency
Branch Prediction
Branch History Table: 2 bits for loop accuracy
Recently executed branches correlated with next branch?
Branch Target Buffer: include branch address & prediction
Predicated Execution can reduce number of branches, number of mispredicted branches
Review: Theoretical Limits to ILP?
(Figure 4.48, Page 332)
60
56
Perfect disambiguation (HW), 1K 52
50 Selective Prediction, 16 entry 47
return, 64 registers, issue as many FP: 8 - 45 45
Instruction issues per cycle

40
as window
35
34

30

22 22
IPC

20
Integer: 6 - 12
15 15
14
17 16
15 14
13
12 12 11 11 12
10 10 10 10
9 8 9 8 9 9
10 8
6 6 6 7
5 6
4 4 4 4
3 2 3 3 3 3

gcc expresso li fpppp doducd tomcatv

Program

Infinite 256 128 64 32 16 8 4

Window
Infinite 256 128 64 32 16 8 4
Problems with conventional
approach
Limits to conventional exploitation of ILP:
1) pipelined clock rate: at some point, each increase in
clock rate has corresponding CPI increase (branches,
other hazards)
2) instruction fetch and decode: at some point, its hard
to fetch and decode more instructions per clock cycle
3) cache hit rate: some long-running (scientific)
programs have very large data sets accessed with poor
locality;
others have continuous data streams (multimedia) and
hence poor locality
Alternative Model:Vector
Processing
Vector processors have high-level operations that work on linear
arrays of numbers: "vectors"

SCALAR VECTOR
(1 operation) (N operations)

r1 r2 v1 v2

+ +

r3 v3
vector
length

add r3, r1, r2 add.vv v3, v1, v2


Properties of Vector
Processors
Each result independent of previous result
=> long pipeline, compiler ensures no dependencies
=> high clock rate
Vector instructions access memory with known pattern
=> highly interleaved memory
=> amortize memory latency of over 64 elements
=> no (data) caches required! (Do use instruction cache)
Reduces branches and branch problems in pipelines
Single vector instruction implies lots of work ( loop)
=> fewer instruction fetches
Operation & Instruction
Count:
RISCOperations
Spec92fp v. Vector(Millions)Processor
Instructions (M)
(from F. Quintana, U. Barcelona.)
Program RISC Vector R / V RISC Vector R / V
swim256 11595 1.1x115 0.8142x
hydro2d 5840 1.4x 58 0.8 71x
nasa7 6941 1.7x 69 2.2 31x
su2cor 5135 1.4x 51 1.8 29x
tomcatv 1510 1.4x 15 1.3 11x
wave5 2725 1.1x 27 7.2 4x
mdljdp2 3252 0.6x 32 15.8 2x

Vector reduces ops by 1.2X, instructions by 20X


Styles of Vector
Architectures
• memory-memory vector processors: all vector operations are
memory to memory
• vector-register processors: all vector operations between vector
registers (except load and store)
Vector equivalent of load-store architectures
Includes all vector machines since late 1980s:
Cray, Convex, Fujitsu, Hitachi, NEC
We assume vector-register for rest of lectures
Components of Vector
Processor
• Vector Register: fixed length bank holding a single vector
has at least 2 read and 1 write ports
typically 8-32 vector registers, each holding 64-128 64-bit elements
• Vector Functional Units (FUs): fully pipelined, start new
operation every clock
typically 4 to 8 FUs: FP add, FP mult, FP reciprocal (1/X), integer
add, logical, shift; may have multiple of same unit
• Vector Load-Store Units (LSUs): fully pipelined unit to load
or store a vector; may have multiple LSUs
• Scalar registers: single element for FP scalar or address
Cross-bar to connect FUs , LSUs, registers
“DLXV” Vector
Instructions Operation
Instr. Operands Comment
• ADDV V1,V2,V3 V1=V2+V3 vector + vector
• ADDSV V1,F0,V2 V1=F0+V2 scalar + vector
• MULTV V1,V2,V3 V1=V2xV3 vector x vector
• MULSV V1,F0,V2 V1=F0xV2 scalar x vector
• LV V1,R1 V1=M[R1..R1+63]load, stride=1
• LVWS V1,R1,R2 V1=M[R1..R1+63*R2] load,
stride=R2
• LVI V1,R1,V2 V1=M[R1+V2i,i=0..63] indir.
("gather")
• CeqV VM,V1,V2 VMASKi = (V1i=V2i)?comp.
setmask
• MOV VLR,R1 Vec. Len. Reg. = R1 set vector length
• MOV VM,R1 Vec. Mask = R1 set vector mask
Memory operations
Load/store operations move groups of data
between registers and memory
Three types of addressing
– Unit stride
• Fastest
– Non-unit (constant) stride
– Indexed (gather-scatter)
• Vector equivalent of register indirect
• Good for sparse arrays of data
• Increases number of programs that vectorize

32
DAXPY (Y = a * X
+
are length 64
Y)
Assuming vectors X, Y LD
LV
F0,a
V1,Rx
;load scalar a
;load vector X
Scalar vs. Vector MULTS V2,F0,V1 ;vector-scalar mult.
LV V3,Ry ;load vector Y
ADDV V4,V2,V3 ;add
LD F0,a SV Ry,V4 ;store the result
ADDI R4,Rx,#512 ;last address to load 578 (2+9*64) vs.
loop: LD F2, 0(Rx) ;load X(i) 321 (1+5*64) ops (1.8X)
MULTD F2,F0,F2 ;a*X(i)
578 (2+9*64) vs.
LD F4, 0(Ry) ;load Y(i) 6 instructions (96X)
ADDD F4,F2, F4 ;a*X(i) + Y(i)
SD F4 ,0(Ry) ;store into Y(i)
64 operation vectors +
ADDI Rx,Rx,#8 ;increment index to X
no loop overhead
ADDI Ry,Ry,#8 ;increment index to Y also 64X fewer pipeline
SUB R20,R4,Rx ;compute bound hazards
BNZ R20,loop ;check if done
Example Vector
Machines
Machine Year Clock RegsElementsFUsLSUs
Cray 1 197680 MHz 8 64 6 1
Cray XMP 1983120 MHz 8 64 8 2 L, 1 S
Cray YMP 1988166 MHz 8 64 8 2 L, 1 S
Cray C-90 1991240 MHz 8 128 8 4
Cray T-90 1996455 MHz 8 128 8 4
Conv. C-1 198410 MHz 8 128 4 1
Conv. C-4 1994133 MHz 16 128 3 1
Fuj. VP200 1982133 MHz8-25632-1024 3 2
Fuj. VP300 1996100 MHz8-25632-1024 3 2
NEC SX/2 1984160 MHz8+8K256+var 16 8
NEC SX/3 1995400 MHz8+8K256+var 16 8
Vector Linpack
Performance (MFLOPS)
Matrix Inverse (gaussian elimination)

Machine YearClock 100x100 1kx1kPeak(Procs)


Cray 1 197680 MHz 12 110160(1)
Cray XMP 1983120 MHz 121 218940(4)
Cray YMP 1988166 MHz 150 3072,667(8)
Cray C-90 1991240 MHz 387 90215,238(16)
Cray T-90 1996455 MHz 705 160357,600(32)
Conv. C-1 198410 MHz 3 --20(1)
Conv. C-4 1994135 MHz 160 25313240(4)
Fuj. VP200 1982133 MHz 18 422533(1)
NEC SX/2 1984166 MHz 43 8851300(1)
NEC SX/3 1995400 MHz 368 275725,600(4)
Vector Surprise
Use vectors for inner loop parallelism (no surprise)
One dimension of array: A[0, 0], A[0, 1], A[0, 2], ...
think of machine as, say, 32 vector regs each with 64 elements
1 instruction updates 64 elements of 1 vector register
and for outer loop parallelism!
1 element from each column: A[0,0], A[1,0], A[2,0], ...
think of machine as 64 “virtual processors” (VPs)
each with 32 scalar registers! ( multithreaded processor)
1 instruction updates 1 scalar register in 64 VPs
Hardware identical, just 2 compiler perspectives
Virtual Processor Vector
Model

Vector operations are SIMD


(single instruction multiple data)
operations
Each element is computed by a virtual
processor (VP)
Number of VPs given by vector length
vector control register
Vector Architectural State
Virtual Processors (#vir)

VP0 VP1 VP#vlr-1


General vr0
Control
Purpose vr1
Registers
Registers

vr31 vcr0
#vdw bits vcr1
vf0
Flag vf1
Registers vcr31
(32)
32 bits
vf31
1 bit
Vector Implementation
Vector register file
Each register is an array of elements
Size of each register determines maximum
vector length
Vector length register determines vector length
for a particular operation
Multiple parallel execution units = “lanes”
(sometimes called “pipelines” or “pipes”)
4 lanes, 2 vector functional
units

(Vector
Functional
Unit)
Tentative VIRAM-1
Floorplan
 0.18 µm DRAM
Memory 32 MB in 16 banks x
(128 Mbits / 16 MBytes) 256b, 128 subbanks
 0.25 µm,

5 Metal Logic
C
Ring- P  200 MHz MIPS,
4 Vector Pipes/Lanes U 16K I$, 16K D$
based +$ I/O 4 200 MHz
Switch
FP/int. vector units
 die: 16x16 mm
Memory  xtors: 270M
(128 Mbits / 16 MBytes)  power: 2 Watts
Vector Execution
Time
Time = f(vector length, data dependencies, struct.hazards)
• Initiation rate: rate that FU consumes vector elements
(= number of lanes; usually 1 or 2 on Cray T-90)
• Convoy: set of vector instructions that can begin execution in
same clock (no structural or data hazards)
• Chime: approx. time for a vector operation
• m convoys take m chimes; if each vector length is n, then they
take approx. m x n clock cycles (ignores overhead; good
approximization for long vectors)
1: LV V1,Rx ;load vector X
4 convoys, 1 lane, VL=64
2: MULV V2,F0,V1 ;vector-scalar mult.
=> 4 x 64 = 256 clocks
LV V3,Ry ;load vector Y (or 4 clocks per result)
3: ADDV V4,V2,V3 ;add
4: SV Ry,V4 ;store the result
DLXV Start-up Time
• Start-up time: pipeline latency time (depth of FU
pipeline); + other sources of overhead
Operation Start-up penalty (from CRAY-1)
Vector load/store 12
Vector multiply 7
Vector add 6
Convoy Start 1st result last result
Assume convoys don't overlap; vector length = n
1. LV 0 12 11+n (12+n-1)
2. MULV, LV 12+n 12+n+7 18+2n Multiply startup
12+n+1 12+n+13 24+2n Load start-up
3. ADDV 25+2n 25+2n+6 30+3n Wait convoy 2
4. SV 31+3n 31+3n+12 42+4n Wait convoy 3
Why startup time for each
vector instruction?
Why not overlap startup time of back-to-
back vector instructions?
Cray machines built from many ECL chips
operating at high clock rates; hard to do?
Berkeley vector design (“T0”) didn’t know
it wasn’t supposed to do overlap, so no
startup times for functional units (except
load)
Vector Load/Store Units &
Memories
Start-up overheads usually longer for LSUs
Memory system must sustain (# lanes x word) /clock cycle
Many Vector Processors use banks (versus simple interleaving):
1) support multiple loads/stores per cycle
=> multiple banks & address banks independently
2) support non-sequential accesses (see soon)
Note: No. memory banks > memory latency to avoid stalls
m banks => m words per memory latency l clocks
if m < l, then gap in memory pipeline:
clock: 0 … l l+1 l+2 … l+m- 1 l+m … 2l
word: -- … 0 1 2 … m-1-- … m
may have 1024 banks in SRAM
Vector
Length
What to do when vector length is not exactly 64?
• vector-length register (VLR) controls the length of
any vector operation, including a vector load or
store. (cannot be > the length of vector registers)
do 10 i = 1, n
10 Y(i) = a * X(i) + Y(i)
Don't know n until runtime!
n > Max. Vector Length (MVL)?
Strip
Mining
Suppose Vector Length > Max. Vector Length (MVL)?
• Strip mining: generation of code such that each vector
operation is done for a size  MVL
1st loop do short piece (n mod MVL), rest VL = MVL
low = 1
VL = (n mod MVL) /*find the odd size piece*/
do 1 j = 0,(n / MVL) /*outer loop*/
do 10 i = low,low+VL-1 /*runs for length VL*/
Y(i) = a*X(i) + Y(i) /*main operation*/
10 continue
low = low+VL /*start of next vector*/
VL = MVL /*reset the length to max*/
Loop Overhead!
1 continue
Common Vector
Metrics
• R : MFLOPS rate on an infinite-length vector

vector “speed of light”


Real problems do not have unlimited vector lengths, and the start-up penalties
encountered in real problems will be larger
(Rn is the MFLOPS rate for a vector of length n)
• N1/2: The vector length needed to reach one-half of R 
a good measure of the impact of start-up
• NV: The vector length needed to make vector mode faster than scalar mode
measures both start-up and speed of scalars relative to vectors, quality of connection of
scalar unit to vector unit
Vector
Stride
Suppose adjacent elements not sequential in memory
do 10 i = 1,100
do 10 j = 1,100
A(i,j) = 0.0
do 10 k = 1,100
10 A(i,j) = A(i,j)+B(i,k)*C(k,j)
Either B or C accesses not adjacent (800 bytes between)
• stride: distance separating elements that are to be merged
into a single vector (caches do unit stride)
=> LVWS (load vector with stride) instruction
Strides => can cause bank conflicts
(e.g., stride = 32 and 16 banks)
Think of address per vector element
Compiler Vectorization on Cray
XMP
Benchmark%FP%FP in vector
ADM 23%68%
DYFESM 26% 95%
FLO52 41%100%
MDG 28%27%
MG3D 31%86%
OCEAN 28%58%
QCD 14%1%
SPICE 16%7%(1% overall)
TRACK 9%23%
TRFD 22%10%
Vector Opt #1:
Chaining
Suppose:
MULV V1,V2,V3
ADDV V4,V1,V5 ; separate convoy?
• chaining: vector register (V1) is not as a single entity but as a
group of individual registers, then pipeline forwarding can
work on individual elements of a vector
• Flexible chaining: allow vector to chain to any other active
vector operation => more read/write ports
As long as enough HW, increases convoy size
7 64 6 64
Unchained
multv addv
7 64
multv
Chained addv
6 64
Example Execution of Vector
Code
Vector Vector Vector
Scalar Memory Pipeline Multiply Pipeline Adder Pipeline

8 lanes, vector length 32,


chaining
Vector Opt #2: Conditional
Execution
Suppose:
do 100 i = 1, 64
if (A(i) .ne. 0) then
A(i) = A(i) – B(i)
endif
100 continue
• vector-mask control takes a Boolean vector: when vector-mask
register is loaded from vector test, vector instructions operate
only on vector elements whose corresponding entries in the
vector-mask register are 1.
Still requires clock even if result not stored; if still performs
operation, what about divide by 0?
Vector Opt #3: Sparse
Matrices
Suppose:
do 100 i = 1,n
100 A(K(i)) = A(K(i)) + C(M(i))
• gather (LVI) operation takes an index vector and fetches the
vector whose elements are at the addresses given by adding a
base address to the offsets given in the index vector => a
nonsparse vector in a vector register
After these elements are operated on in dense form, the sparse
vector can be stored in expanded form by a scatter store (SVI),
using the same index vector
Can't be done by compiler since can't know Ki elements distinct, no
dependencies; by compiler directive
Use CVI to create index 0, 1xm, 2xm, ..., 63xm
Sparse Matrix Example
Cache (1993) vs. Vector (1988)
IBM RS6000 Cray YMP
Clock 72 MHz 167 MHz
Cache 256 KB 0.25 KB
Linpack 140 MFLOPS 160 (1.1)
Sparse Matrix 17 MFLOPS 125 (7.3)
(Cholesky Blocked )
Cache: 1 address per cache block (32B to 64B)
Vector: 1 address per element (4B)
Challenges:
Vector Example with
dependency
/* Multiply a[m][k] * b[k][n] to get c[m][n] */
for (i=1; i<m; i++)
{
for (j=1; j<n; j++)
{
sum = 0;
for (t=1; t<k; t++)
{
sum += a[i][t] * b[t][j];
}
c[i][j] = sum;
}
}
Problem: creating sum of elements in a vector = slow and requires use
of scalar unit
Optimized Vector Example
Consider vector processor as a collection of 32 virtual processors!
Does not need reduce!

/* Multiply a[m][k] * b[k][n] to get c[m][n] */


for (i=1; i<m; i++)
{
for (j=1; j<n; j+=32)/* Step j 32 at a time. */
{
sum[0:31] = 0; /* Initialize a vector register to zeros. */
for (t=1; t<k; t++)
{
a_scalar = a[i][t]; /* Get scalar from a matrix. */
b_vector[0:31] = b[t][j:j+31];/* Get vector from b matrix. */
prod[0:31] = b_vector[0:31]*a_scalar; /* Do a vector-scalar
multiply. */
sum[0:31] += prod[0:31]; /* Vector-vector add into results. */
}
/* Unit-stride store of vector of results. */
c[i][j:j+31] = sum[0:31];
}
}
Applications
Limited to scientific computing?
Multimedia Processing (compress., graphics, audio synth, image proc.)
Standard benchmark kernels (Matrix Multiply, FFT, Convolution, Sort)
Lossy Compression (JPEG, MPEG video and audio)
Lossless Compression (Zero removal, RLE, Differencing, LZW)
Cryptography (RSA, DES/IDEA, SHA/MD5)
Speech and handwriting recognition
Operating systems/Networking (memcpy, memset, parity, checksum)
Databases (hash/join, data mining, image/video serving)
Language run-time support (stdlib, garbage collection)
even SPECint95
Vector for Multimedia?
Intel MMX: 57 new 80x86 instructions (1st since 386)
similar to Intel 860, Mot. 88110, HP PA-71000LC, UltraSPARC
3 data types: 8 8-bit, 4 16-bit, 2 32-bit in 64bits
reuse 8 FP registers (FP and MMX cannot mix)
short vector: load, add, store 8 8-bit operands

+
Claim: overall speedup 1.5 to 2X for 2D/3D graphics, audio, video, speech, comm., ...
use in drivers or added to library routines; no compiler
MMX Instructions
Move 32b, 64b
Add, Subtract in parallel: 8 8b, 4 16b, 2 32b
opt. signed/unsigned saturate (set to max) if overflow
Shifts (sll,srl, sra), And, And Not, Or, Xor
in parallel: 8 8b, 4 16b, 2 32b
Multiply, Multiply-Add in parallel: 4 16b
Compare = , > in parallel: 8 8b, 4 16b, 2 32b
sets field to 0s (false) or 1s (true); removes branches
Pack/Unpack
Convert 32b<–> 16b, 16b <–> 8b
Pack saturates (set to max) if number is too large
Vectors and Variable Data
Width
Programmer thinks in terms of vectors of data of some width
(8, 16, 32, or 64 bits)
Good for multimedia; More elegant than
MMX-style extensions
Don’t have to worry about how data stored in hardware
No need for explicit pack/unpack operations
Just think of more virtual processors operating on narrow data
Expand Maximum Vector Length with decreasing data width:
64 x 64bit, 128 x 32 bit, 256 x 16 bit, 512 x 8 bit
Mediaprocesing:
Vectorizable? Vector Lengths?
Kernel Vector length
Matrix transpose/multiply # vertices at once
DCT (video, communication) image width
FFT (audio) 256-1024
Motion estimation (video) image width, iw/16
Gamma correction (video) image width
Haar transform (media mining) image width
Median filter (image processing) image width
Separable convolution (img. proc.) image width
(from Pradeep Dubey - IBM,
http://www.research.ibm.com/people/p/pradeep/tutor.html)
Vector
Pitfalls
Pitfall: Concentrating on peak performance and ignoring start-up overhead:
e.g. NV (length faster than scalar) > 100 (CDC-star)
Pitfall: Increasing vector performance, without comparable increases in scalar performance
(Amdahl's Law)
failure of Cray competitor from his former company
Pitfall: Good processor vector performance without providing good memory bandwidth
MMX?
Vector Advantages
Easy to get high performance; N operations:
– are independent
– use same functional unit
– access disjoint registers
– access registers in same order as previous instructions
– access contiguous memory words or known pattern
– can exploit large memory bandwidth
– hide memory latency (and any other latency)
• Scalable (get higher performance as more HW resources available)
• Compact: Describe N operations with 1 short instruction (v. VLIW)
• Predictable (real-time) performance vs. statistical performance (cache)
• Multimedia ready: choose N * 64b, 2N * 32b, 4N * 16b, 8N * 8b
Mature, developed compiler technology
• Vector Disadvantage: Out of Fashion
Vectors Are Inexpensive
Scalar Vector
 N ops per cycle  N ops per cycle
2) circuitry 2)
 HP PA-8000 circuitry
 4-way issue  T0 vector micro
 reorder buffer:  24 ops per cycle
850K transistors  730K transistors
 incl. 6,720 5-bit register total
number comparators  only 23 5-bit register
number comparators
 No floating point
MIPS R10000 vs. T0

*See http://www.icsi.berkeley.edu/real/spert/t0-intro.html
Vectors Lower Power
Vector
Single-issue Scalar
One instruction fetch,decode,
 One instruction fetch, decode, dispatch per vector
dispatch per operation
Structured register accesses
 Arbitrary register accesses,
adds area and power
Smaller code for high
 Loop unrolling and software performance, less power in
pipelining for high performance instruction cache misses
increases instruction cache
Bypass cache
footprint
 All data passes through cache; One TLB lookup per
waste power if no temporal locality group of loads or stores
 One TLB lookup per load or store Move only necessary data
across chip boundary
 Off-chip access in whole cache lines
Superscalar Energy Efficiency
Even Worse
Vector
Superscalar
 Control logic grows Control logic grows
quadratically with issue linearly with issue width
width Vector unit switches
 Control logic consumes off when not in use
energy regardless of Vector instructions expose
available parallelism parallelism without speculation
 Speculation to increase Software control of
visible parallelism speculation when desired:
wastes energy Whether to use vector mask or
compress/expand for conditionals
VLIW/Out-of-Order versus
Modest Scalar+Vector Vector
100
Performance

(Where are crossover


VLIW/OOO points on these curves?)

Modest Scalar (Where are important


applications on this axis?)
0

Very Sequential Very Parallel


Applications sorted by Instruction Level Parallelism
Vector Summary
Alternate model accomodates long memory latency,
doesn’t rely on caches as does Out-Of-Order,
superscalar/VLIW designs
Much easier for hardware: more powerful instructions,
more predictable memory accesses, fewer harzards,
fewer branches, fewer mispredicted branches, ...
What % of computation is vectorizable?
Is vector a good match to new apps such as
multidemia, DSP?

You might also like