0% found this document useful (0 votes)

117 views73 pages

Pipelining & Vector Processing Guide

Pipelining and vector processing can improve parallelism and throughput. Pipelining breaks processes into sequential stages that can partially overlap to improve throughput. Vector processing performs the same operation on multiple data elements concurrently using SIMD. Examples show how pipelining a laundry process into washing, drying, and folding stages reduces time from 6 hours to 3.5 hours for 4 loads. Vector instructions can multiply or add multiple elements at once. Pipelining speedup approaches the number of stages for large problems, while vector processing speedup approaches the number of concurrent elements.

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

117 views73 pages

Pipelining & Vector Processing Guide

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

UNIT-V

PIPELINING AND VECTOR

PROCESSING
Pipeline and Vector Processing
Dr. Bernard Chen Ph.D.
University of Central Arkansas
Spring 2009
Parallel processing
 A parallel processing system is able to perform
concurrent data processing to achieve faster
execution time

 The system may have two or more ALUs and be able

to execute two or more instructions at the same time

 Goal is to increase the throughput – the amount of

processing that can be accomplished during a given
interval of time
Parallel processing classification

Single instruction stream, single data stream – SISD

Single instruction stream, multiple data stream – SIMD

Multiple instruction stream, single data stream – MISD

Multiple instruction stream, multiple data stream – MIMD

Single instruction stream, single data
stream – SISD

 Single control unit, single computer, and a memory

unit

 Instructions are executed sequentially. Parallel

processing may be achieved by means of multiple
functional units or by pipeline processing
Single instruction stream, multiple data
stream – SIMD

 Represents an organization that includes many

processing units under the supervision of a common
control unit.

 Includes multiple processing units with a single

control unit. All processors receive the same
instruction, but operate on different data.
Multiple instruction stream, single data
stream – MISD

 Theoretical only

 processors receive different instructions, but operate

on same data.
Multiple instruction stream, multiple
data stream – MIMD
 A computer system capable of processing several
programs at the same time.

 Most multiprocessor and multicomputer systems can

be classified in this category
Pipelining: Laundry Example

 Small laundry has one

washer, one dryer and one
operator, it takes 90 A B C D
minutes to finish one load:

 Washer takes 30 minutes

 Dryer takes 40 minutes
 “operator folding” takes 20
minutes
Sequential Laundry
6 PM 7 8 9 10 11 Midnight
Time

30 40 20 30 40 20 30 40 20 30 40 20
T
a A
s
k
B
O
r
d C
e 90 min
r
D
 This operator scheduled his loads to be delivered to the laundry every 90
minutes which is the time required to finish one load. In other words he
will not start a new task unless he is already done with the previous task
 The process is sequential. Sequential laundry takes 6 hours for 4 loads
Efficiently scheduled laundry: Pipelined
Laundry
Operator start work ASAP
6 PM 7 8 9 10 11 Midnight
Time

30 40 40 40 40 20
40 40 40
T
a A
s
k
B
O
r
d C
e
r
D
 Another operator asks for the delivery of loads to the laundry every 40
minutes!?.
 Pipelined laundry takes 3.5 hours for 4 loads
 Multiple tasks
Pipelining Facts operating
simultaneously
 Pipelining doesn’t help
6 PM 7 8 9 latency of single task,
Time it helps throughput of
entire workload
T
a 30 40 40 40 40 20  Pipeline rate limited by
s slowest pipeline stage
k A  Potential speedup =
O Number of pipe stages
r B  Unbalanced lengths of
d
e pipe stages reduces
The washer
r C waits for the speedup
dryer for 10
minutes
 Time to “fill” pipeline
D and time to “drain” it
reduces speedup
9.2 Pipelining
• Decomposes a sequential process into segments.
• Divide the processor into segment processors each
one is dedicated to a particular segment.
• Each segment is executed in a dedicated segment-
processor operates concurrently with all other
segments.
• Information flows through these multiple hardware
segments.
9.2 Pipelining
 Instruction execution is divided into k segments or
stages
 Instruction exits pipe stage k-1 and proceeds into

pipe stage k
 All pipe stages take the same amount of time;

called one processor cycle

 Length of the processor cycle is determined by the

slowest pipe stage

k segments
9.2 Pipelining
 Suppose we want to perform the
combined multiply and add operations
with a stream of numbers:

 A i * Bi + C i for i =1,2,3,…,7
9.2 Pipelining
 The suboperations performed in each
segment of the pipeline are as follows:

 R1  Ai, R2  Bi
 R3  R1 * R2 R4  Ci
 R5  R3 + R4
Pipeline Performance

 n:instructions n is equivalent to number of loads in

 k: stages in the laundry example
pipeline k is the stages (washing, drying and
 : clockcycle folding.
 Tk: total time Clock cycle is the slowest task time

k
SPEEDUP
 • Consider a k-segment pipeline operating on n data
sets. (In the above example, k = 3 and n = 4.)

 > It takes k clock cycles to fill the pipeline and get

the first result from the output of the pipeline.

 After that the remaining (n - 1) results will come out

at each clock cycle.

 > It therefore takes (k + n - 1) clock cycles to

complete the task.
SPEEDUP
 If we execute the same task
sequentially in a single processing unit,
it takes (k * n) clock cycles.
 • The speedup gained by using the
pipeline is:
 S = k * n / (k + n - 1 )
SPEEDUP
 S = k * n / (k + n - 1 )

For n >> k (such as 1 million data sets on a 3-

stage pipeline),
 S~k
 So we can gain the speedup which is equal to
the number of functional units for a large data
sets. This is because the multiple functional
units can work in parallel except for the filling
and cleaning-up cycles.
Example: 6 tasks, divided into
4 segments
1 2 3 4 5 6 7 8 9
T1 T2 T3 T4 T5 T6

T1 T2 T3 T4 T5 T6

T1 T2 T3 T4 T5 T6
Vector Processing
Review: Instructon Level
Parallelism
High speed execution based on instruction level
parallelism (ilp): potential of short instruction
sequences to execute in parallel
High-speed microprocessors exploit ILP by:
1) pipelined execution: overlap instructions
2) superscalar execution: issue and execute multiple
instructions per clock cycle
3) Out-of-order execution (commit in-order)
Memory accesses for high-speed microprocessor?
Data Cache, possibly multiported, multiple levels
Review
Speculation: Out-of-order execution, In-order commit (reorder buffer+rename
registers)=>precise exceptions
Software Pipelining
Symbolic loop unrolling (instructions from different iterations) to optimize pipeline with little code
expansion, little overhead
Superscalar and VLIW: CPI < 1 (IPC > 1)
Dynamic issue vs. Static issue
More instructions issue at same time => larger hazard penalty
# independent instructions = # functional units X latency
Branch Prediction
Branch History Table: 2 bits for loop accuracy
Recently executed branches correlated with next branch?
Branch Target Buffer: include branch address & prediction
Predicated Execution can reduce number of branches, number of mispredicted branches
Review: Theoretical Limits to ILP?
(Figure 4.48, Page 332)
60
56
Perfect disambiguation (HW), 1K 52
50 Selective Prediction, 16 entry 47
return, 64 registers, issue as many FP: 8 - 45 45
Instruction issues per cycle

40
as window
35
34

22 22
IPC

20
Integer: 6 - 12
15 15
14
17 16
15 14
13
12 12 11 11 12
10 10 10 10
9 8 9 8 9 9
10 8
6 6 6 7
5 6
4 4 4 4
3 2 3 3 3 3

gcc expresso li fpppp doducd tomcatv

Program

Infinite 256 128 64 32 16 8 4

Window
Infinite 256 128 64 32 16 8 4
Problems with conventional
approach
Limits to conventional exploitation of ILP:
1) pipelined clock rate: at some point, each increase in
clock rate has corresponding CPI increase (branches,
other hazards)
2) instruction fetch and decode: at some point, its hard
to fetch and decode more instructions per clock cycle
3) cache hit rate: some long-running (scientific)
programs have very large data sets accessed with poor
locality;
others have continuous data streams (multimedia) and
hence poor locality
Alternative Model:Vector
Processing
Vector processors have high-level operations that work on linear
arrays of numbers: "vectors"

SCALAR VECTOR
(1 operation) (N operations)

r1 r2 v1 v2

+ +

r3 v3
vector
length

add r3, r1, r2 [Link] v3, v1, v2

Properties of Vector
Processors
Each result independent of previous result
=> long pipeline, compiler ensures no dependencies
=> high clock rate
Vector instructions access memory with known pattern
=> highly interleaved memory
=> amortize memory latency of over 64 elements
=> no (data) caches required! (Do use instruction cache)
Reduces branches and branch problems in pipelines
Single vector instruction implies lots of work ( loop)
=> fewer instruction fetches
Operation & Instruction
Count:
RISCOperations
Spec92fp v. Vector(Millions)Processor
Instructions (M)
(from F. Quintana, U. Barcelona.)
Program RISC Vector R / V RISC Vector R / V
swim256 11595 1.1x115 0.8142x
hydro2d 5840 1.4x 58 0.8 71x
nasa7 6941 1.7x 69 2.2 31x
su2cor 5135 1.4x 51 1.8 29x
tomcatv 1510 1.4x 15 1.3 11x
wave5 2725 1.1x 27 7.2 4x
mdljdp2 3252 0.6x 32 15.8 2x

Vector reduces ops by 1.2X, instructions by 20X

Styles of Vector
Architectures
• memory-memory vector processors: all vector operations are
memory to memory
• vector-register processors: all vector operations between vector
registers (except load and store)
Vector equivalent of load-store architectures
Includes all vector machines since late 1980s:
Cray, Convex, Fujitsu, Hitachi, NEC
We assume vector-register for rest of lectures
Components of Vector
Processor
• Vector Register: fixed length bank holding a single vector
has at least 2 read and 1 write ports
typically 8-32 vector registers, each holding 64-128 64-bit elements
• Vector Functional Units (FUs): fully pipelined, start new
operation every clock
typically 4 to 8 FUs: FP add, FP mult, FP reciprocal (1/X), integer
add, logical, shift; may have multiple of same unit
• Vector Load-Store Units (LSUs): fully pipelined unit to load
or store a vector; may have multiple LSUs
• Scalar registers: single element for FP scalar or address
Cross-bar to connect FUs , LSUs, registers
“DLXV” Vector
Instructions Operation
Instr. Operands Comment
• ADDV V1,V2,V3 V1=V2+V3 vector + vector
• ADDSV V1,F0,V2 V1=F0+V2 scalar + vector
• MULTV V1,V2,V3 V1=V2xV3 vector x vector
• MULSV V1,F0,V2 V1=F0xV2 scalar x vector
• LV V1,R1 V1=M[R1..R1+63]load, stride=1
• LVWS V1,R1,R2 V1=M[R1..R1+63*R2] load,
stride=R2
• LVI V1,R1,V2 V1=M[R1+V2i,i=0..63] indir.
("gather")
• CeqV VM,V1,V2 VMASKi = (V1i=V2i)?comp.
setmask
• MOV VLR,R1 Vec. Len. Reg. = R1 set vector length
• MOV VM,R1 Vec. Mask = R1 set vector mask
Memory operations
Load/store operations move groups of data
between registers and memory
Three types of addressing
– Unit stride
• Fastest
– Non-unit (constant) stride
– Indexed (gather-scatter)
• Vector equivalent of register indirect
• Good for sparse arrays of data
• Increases number of programs that vectorize

32
DAXPY (Y = a * X
+
are length 64
Y)
Assuming vectors X, Y LD
LV
F0,a
V1,Rx
;load scalar a
;load vector X
Scalar vs. Vector MULTS V2,F0,V1 ;vector-scalar mult.
LV V3,Ry ;load vector Y
ADDV V4,V2,V3 ;add
LD F0,a SV Ry,V4 ;store the result
ADDI R4,Rx,#512 ;last address to load 578 (2+9*64) vs.
loop: LD F2, 0(Rx) ;load X(i) 321 (1+5*64) ops (1.8X)
MULTD F2,F0,F2 ;a*X(i)
578 (2+9*64) vs.
LD F4, 0(Ry) ;load Y(i) 6 instructions (96X)
ADDD F4,F2, F4 ;a*X(i) + Y(i)
SD F4 ,0(Ry) ;store into Y(i)
64 operation vectors +
ADDI Rx,Rx,#8 ;increment index to X
no loop overhead
ADDI Ry,Ry,#8 ;increment index to Y also 64X fewer pipeline
SUB R20,R4,Rx ;compute bound hazards
BNZ R20,loop ;check if done
Example Vector
Machines
Machine Year Clock RegsElementsFUsLSUs
Cray 1 197680 MHz 8 64 6 1
Cray XMP 1983120 MHz 8 64 8 2 L, 1 S
Cray YMP 1988166 MHz 8 64 8 2 L, 1 S
Cray C-90 1991240 MHz 8 128 8 4
Cray T-90 1996455 MHz 8 128 8 4
Conv. C-1 198410 MHz 8 128 4 1
Conv. C-4 1994133 MHz 16 128 3 1
Fuj. VP200 1982133 MHz8-25632-1024 3 2
Fuj. VP300 1996100 MHz8-25632-1024 3 2
NEC SX/2 1984160 MHz8+8K256+var 16 8
NEC SX/3 1995400 MHz8+8K256+var 16 8
Vector Linpack
Performance (MFLOPS)
Matrix Inverse (gaussian elimination)

Machine YearClock 100x100 1kx1kPeak(Procs)

Cray 1 197680 MHz 12 110160(1)
Cray XMP 1983120 MHz 121 218940(4)
Cray YMP 1988166 MHz 150 3072,667(8)
Cray C-90 1991240 MHz 387 90215,238(16)
Cray T-90 1996455 MHz 705 160357,600(32)
Conv. C-1 198410 MHz 3 --20(1)
Conv. C-4 1994135 MHz 160 25313240(4)
Fuj. VP200 1982133 MHz 18 422533(1)
NEC SX/2 1984166 MHz 43 8851300(1)
NEC SX/3 1995400 MHz 368 275725,600(4)
Vector Surprise
Use vectors for inner loop parallelism (no surprise)
One dimension of array: A[0, 0], A[0, 1], A[0, 2], ...
think of machine as, say, 32 vector regs each with 64 elements
1 instruction updates 64 elements of 1 vector register
and for outer loop parallelism!
1 element from each column: A[0,0], A[1,0], A[2,0], ...
think of machine as 64 “virtual processors” (VPs)
each with 32 scalar registers! ( multithreaded processor)
1 instruction updates 1 scalar register in 64 VPs
Hardware identical, just 2 compiler perspectives
Virtual Processor Vector
Model

Vector operations are SIMD

(single instruction multiple data)
operations
Each element is computed by a virtual
processor (VP)
Number of VPs given by vector length
vector control register
Vector Architectural State
Virtual Processors (#vir)

VP0 VP1 VP#vlr-1

General vr0
Control
Purpose vr1
Registers
Registers

vr31 vcr0
#vdw bits vcr1
vf0
Flag vf1
Registers vcr31
(32)
32 bits
vf31
1 bit
Vector Implementation
Vector register file
Each register is an array of elements
Size of each register determines maximum
vector length
Vector length register determines vector length
for a particular operation
Multiple parallel execution units = “lanes”
(sometimes called “pipelines” or “pipes”)
4 lanes, 2 vector functional
units

(Vector
Functional
Unit)
Tentative VIRAM-1
Floorplan
 0.18 µm DRAM
Memory 32 MB in 16 banks x
(128 Mbits / 16 MBytes) 256b, 128 subbanks
 0.25 µm,

5 Metal Logic
C
Ring- P  200 MHz MIPS,
4 Vector Pipes/Lanes U 16K I$, 16K D$
based +$ I/O 4 200 MHz
Switch
FP/int. vector units
 die: 16x16 mm
Memory  xtors: 270M
(128 Mbits / 16 MBytes)  power: 2 Watts
Vector Execution
Time
Time = f(vector length, data dependencies, [Link])
• Initiation rate: rate that FU consumes vector elements
(= number of lanes; usually 1 or 2 on Cray T-90)
• Convoy: set of vector instructions that can begin execution in
same clock (no structural or data hazards)
• Chime: approx. time for a vector operation
• m convoys take m chimes; if each vector length is n, then they
take approx. m x n clock cycles (ignores overhead; good
approximization for long vectors)
1: LV V1,Rx ;load vector X
4 convoys, 1 lane, VL=64
2: MULV V2,F0,V1 ;vector-scalar mult.
=> 4 x 64 = 256 clocks
LV V3,Ry ;load vector Y (or 4 clocks per result)
3: ADDV V4,V2,V3 ;add
4: SV Ry,V4 ;store the result
DLXV Start-up Time
• Start-up time: pipeline latency time (depth of FU
pipeline); + other sources of overhead
Operation Start-up penalty (from CRAY-1)
Vector load/store 12
Vector multiply 7
Vector add 6
Convoy Start 1st result last result
Assume convoys don't overlap; vector length = n
1. LV 0 12 11+n (12+n-1)
2. MULV, LV 12+n 12+n+7 18+2n Multiply startup
12+n+1 12+n+13 24+2n Load start-up
3. ADDV 25+2n 25+2n+6 30+3n Wait convoy 2
4. SV 31+3n 31+3n+12 42+4n Wait convoy 3
Why startup time for each
vector instruction?
Why not overlap startup time of back-to-
back vector instructions?
Cray machines built from many ECL chips
operating at high clock rates; hard to do?
Berkeley vector design (“T0”) didn’t know
it wasn’t supposed to do overlap, so no
startup times for functional units (except
load)
Vector Load/Store Units &
Memories
Start-up overheads usually longer for LSUs
Memory system must sustain (# lanes x word) /clock cycle
Many Vector Processors use banks (versus simple interleaving):
1) support multiple loads/stores per cycle
=> multiple banks & address banks independently
2) support non-sequential accesses (see soon)
Note: No. memory banks > memory latency to avoid stalls
m banks => m words per memory latency l clocks
if m < l, then gap in memory pipeline:
clock: 0 … l l+1 l+2 … l+m- 1 l+m … 2l
word: -- … 0 1 2 … m-1-- … m
may have 1024 banks in SRAM
Vector
Length
What to do when vector length is not exactly 64?
• vector-length register (VLR) controls the length of
any vector operation, including a vector load or
store. (cannot be > the length of vector registers)
do 10 i = 1, n
10 Y(i) = a * X(i) + Y(i)
Don't know n until runtime!
n > Max. Vector Length (MVL)?
Strip
Mining
Suppose Vector Length > Max. Vector Length (MVL)?
• Strip mining: generation of code such that each vector
operation is done for a size  MVL
1st loop do short piece (n mod MVL), rest VL = MVL
low = 1
VL = (n mod MVL) /*find the odd size piece*/
do 1 j = 0,(n / MVL) /*outer loop*/
do 10 i = low,low+VL-1 /*runs for length VL*/
Y(i) = a*X(i) + Y(i) /*main operation*/
10 continue
low = low+VL /*start of next vector*/
VL = MVL /*reset the length to max*/
Loop Overhead!
1 continue
Common Vector
Metrics
• R : MFLOPS rate on an infinite-length vector


vector “speed of light”

Real problems do not have unlimited vector lengths, and the start-up penalties
encountered in real problems will be larger
(Rn is the MFLOPS rate for a vector of length n)
• N1/2: The vector length needed to reach one-half of R 
a good measure of the impact of start-up
• NV: The vector length needed to make vector mode faster than scalar mode
measures both start-up and speed of scalars relative to vectors, quality of connection of
scalar unit to vector unit
Vector
Stride
Suppose adjacent elements not sequential in memory
do 10 i = 1,100
do 10 j = 1,100
A(i,j) = 0.0
do 10 k = 1,100
10 A(i,j) = A(i,j)+B(i,k)*C(k,j)
Either B or C accesses not adjacent (800 bytes between)
• stride: distance separating elements that are to be merged
into a single vector (caches do unit stride)
=> LVWS (load vector with stride) instruction
Strides => can cause bank conflicts
(e.g., stride = 32 and 16 banks)
Think of address per vector element
Compiler Vectorization on Cray
XMP
Benchmark%FP%FP in vector
ADM 23%68%
DYFESM 26% 95%
FLO52 41%100%
MDG 28%27%
MG3D 31%86%
OCEAN 28%58%
QCD 14%1%
SPICE 16%7%(1% overall)
TRACK 9%23%
TRFD 22%10%
Vector Opt #1:
Chaining
Suppose:
MULV V1,V2,V3
ADDV V4,V1,V5 ; separate convoy?
• chaining: vector register (V1) is not as a single entity but as a
group of individual registers, then pipeline forwarding can
work on individual elements of a vector
• Flexible chaining: allow vector to chain to any other active
vector operation => more read/write ports
As long as enough HW, increases convoy size
7 64 6 64
Unchained
multv addv
7 64
multv
Chained addv
6 64
Example Execution of Vector
Code
Vector Vector Vector
Scalar Memory Pipeline Multiply Pipeline Adder Pipeline

8 lanes, vector length 32,

chaining
Vector Opt #2: Conditional
Execution
Suppose:
do 100 i = 1, 64
if (A(i) .ne. 0) then
A(i) = A(i) – B(i)
endif
100 continue
• vector-mask control takes a Boolean vector: when vector-mask
register is loaded from vector test, vector instructions operate
only on vector elements whose corresponding entries in the
vector-mask register are 1.
Still requires clock even if result not stored; if still performs
operation, what about divide by 0?
Vector Opt #3: Sparse
Matrices
Suppose:
do 100 i = 1,n
100 A(K(i)) = A(K(i)) + C(M(i))
• gather (LVI) operation takes an index vector and fetches the
vector whose elements are at the addresses given by adding a
base address to the offsets given in the index vector => a
nonsparse vector in a vector register
After these elements are operated on in dense form, the sparse
vector can be stored in expanded form by a scatter store (SVI),
using the same index vector
Can't be done by compiler since can't know Ki elements distinct, no
dependencies; by compiler directive
Use CVI to create index 0, 1xm, 2xm, ..., 63xm
Sparse Matrix Example
Cache (1993) vs. Vector (1988)
IBM RS6000 Cray YMP
Clock 72 MHz 167 MHz
Cache 256 KB 0.25 KB
Linpack 140 MFLOPS 160 (1.1)
Sparse Matrix 17 MFLOPS 125 (7.3)
(Cholesky Blocked )
Cache: 1 address per cache block (32B to 64B)
Vector: 1 address per element (4B)
Challenges:
Vector Example with
dependency
/* Multiply a[m][k] * b[k][n] to get c[m][n] */
for (i=1; i<m; i++)
{
for (j=1; j<n; j++)
{
sum = 0;
for (t=1; t<k; t++)
{
sum += a[i][t] * b[t][j];
}
c[i][j] = sum;
}
}
Problem: creating sum of elements in a vector = slow and requires use
of scalar unit
Optimized Vector Example
Consider vector processor as a collection of 32 virtual processors!
Does not need reduce!

/* Multiply a[m][k] * b[k][n] to get c[m][n] */

for (i=1; i<m; i++)
{
for (j=1; j<n; j+=32)/* Step j 32 at a time. */
{
sum[0:31] = 0; /* Initialize a vector register to zeros. */
for (t=1; t<k; t++)
{
a_scalar = a[i][t]; /* Get scalar from a matrix. */
b_vector[0:31] = b[t][j:j+31];/* Get vector from b matrix. */
prod[0:31] = b_vector[0:31]*a_scalar; /* Do a vector-scalar
multiply. */
sum[0:31] += prod[0:31]; /* Vector-vector add into results. */
}
/* Unit-stride store of vector of results. */
c[i][j:j+31] = sum[0:31];
}
}
Applications
Limited to scientific computing?
Multimedia Processing (compress., graphics, audio synth, image proc.)
Standard benchmark kernels (Matrix Multiply, FFT, Convolution, Sort)
Lossy Compression (JPEG, MPEG video and audio)
Lossless Compression (Zero removal, RLE, Differencing, LZW)
Cryptography (RSA, DES/IDEA, SHA/MD5)
Speech and handwriting recognition
Operating systems/Networking (memcpy, memset, parity, checksum)
Databases (hash/join, data mining, image/video serving)
Language run-time support (stdlib, garbage collection)
even SPECint95
Vector for Multimedia?
Intel MMX: 57 new 80x86 instructions (1st since 386)
similar to Intel 860, Mot. 88110, HP PA-71000LC, UltraSPARC
3 data types: 8 8-bit, 4 16-bit, 2 32-bit in 64bits
reuse 8 FP registers (FP and MMX cannot mix)
short vector: load, add, store 8 8-bit operands

+
Claim: overall speedup 1.5 to 2X for 2D/3D graphics, audio, video, speech, comm., ...
use in drivers or added to library routines; no compiler
MMX Instructions
Move 32b, 64b
Add, Subtract in parallel: 8 8b, 4 16b, 2 32b
opt. signed/unsigned saturate (set to max) if overflow
Shifts (sll,srl, sra), And, And Not, Or, Xor
in parallel: 8 8b, 4 16b, 2 32b
Multiply, Multiply-Add in parallel: 4 16b
Compare = , > in parallel: 8 8b, 4 16b, 2 32b
sets field to 0s (false) or 1s (true); removes branches
Pack/Unpack
Convert 32b<–> 16b, 16b <–> 8b
Pack saturates (set to max) if number is too large
Vectors and Variable Data
Width
Programmer thinks in terms of vectors of data of some width
(8, 16, 32, or 64 bits)
Good for multimedia; More elegant than
MMX-style extensions
Don’t have to worry about how data stored in hardware
No need for explicit pack/unpack operations
Just think of more virtual processors operating on narrow data
Expand Maximum Vector Length with decreasing data width:
64 x 64bit, 128 x 32 bit, 256 x 16 bit, 512 x 8 bit
Mediaprocesing:
Vectorizable? Vector Lengths?
Kernel Vector length
Matrix transpose/multiply # vertices at once
DCT (video, communication) image width
FFT (audio) 256-1024
Motion estimation (video) image width, iw/16
Gamma correction (video) image width
Haar transform (media mining) image width
Median filter (image processing) image width
Separable convolution (img. proc.) image width
(from Pradeep Dubey - IBM,
[Link]
Vector
Pitfalls
Pitfall: Concentrating on peak performance and ignoring start-up overhead:
e.g. NV (length faster than scalar) > 100 (CDC-star)
Pitfall: Increasing vector performance, without comparable increases in scalar performance
(Amdahl's Law)
failure of Cray competitor from his former company
Pitfall: Good processor vector performance without providing good memory bandwidth
MMX?
Vector Advantages
Easy to get high performance; N operations:
– are independent
– use same functional unit
– access disjoint registers
– access registers in same order as previous instructions
– access contiguous memory words or known pattern
– can exploit large memory bandwidth
– hide memory latency (and any other latency)
• Scalable (get higher performance as more HW resources available)
• Compact: Describe N operations with 1 short instruction (v. VLIW)
• Predictable (real-time) performance vs. statistical performance (cache)
• Multimedia ready: choose N * 64b, 2N * 32b, 4N * 16b, 8N * 8b
Mature, developed compiler technology
• Vector Disadvantage: Out of Fashion
Vectors Are Inexpensive
Scalar Vector
 N ops per cycle  N ops per cycle
2) circuitry 2)
 HP PA-8000 circuitry
 4-way issue  T0 vector micro
 reorder buffer:  24 ops per cycle
850K transistors  730K transistors
 incl. 6,720 5-bit register total
number comparators  only 23 5-bit register
number comparators
 No floating point
MIPS R10000 vs. T0

*See [Link]
Vectors Lower Power
Vector
Single-issue Scalar
One instruction fetch,decode,
 One instruction fetch, decode, dispatch per vector
dispatch per operation
Structured register accesses
 Arbitrary register accesses,
adds area and power
Smaller code for high
 Loop unrolling and software performance, less power in
pipelining for high performance instruction cache misses
increases instruction cache
Bypass cache
footprint
 All data passes through cache; One TLB lookup per
waste power if no temporal locality group of loads or stores
 One TLB lookup per load or store Move only necessary data
across chip boundary
 Off-chip access in whole cache lines
Superscalar Energy Efficiency
Even Worse
Vector
Superscalar
 Control logic grows Control logic grows
quadratically with issue linearly with issue width
width Vector unit switches
 Control logic consumes off when not in use
energy regardless of Vector instructions expose
available parallelism parallelism without speculation
 Speculation to increase Software control of
visible parallelism speculation when desired:
wastes energy Whether to use vector mask or
compress/expand for conditionals
VLIW/Out-of-Order versus
Modest Scalar+Vector Vector
100
Performance

(Where are crossover

VLIW/OOO points on these curves?)

Modest Scalar (Where are important

applications on this axis?)
0

Very Sequential Very Parallel

Applications sorted by Instruction Level Parallelism
Vector Summary
Alternate model accomodates long memory latency,
doesn’t rely on caches as does Out-Of-Order,
superscalar/VLIW designs
Much easier for hardware: more powerful instructions,
more predictable memory accesses, fewer harzards,
fewer branches, fewer mispredicted branches, ...
What % of computation is vectorizable?
Is vector a good match to new apps such as
multidemia, DSP?

Vector Processing and Pipelining
No ratings yet
Vector Processing and Pipelining
22 pages
Coa Unit 5
No ratings yet
Coa Unit 5
20 pages
Unit-5-Parallel Processing
No ratings yet
Unit-5-Parallel Processing
11 pages
Pipeline and Vector Processing
100% (1)
Pipeline and Vector Processing
18 pages
VI. Implicit Parallelism - Instruction Level VI. Implicit Parallelism Instruction Level Parallelism. Pipeline Superscalar & Vector P Processors
No ratings yet
VI. Implicit Parallelism - Instruction Level VI. Implicit Parallelism Instruction Level Parallelism. Pipeline Superscalar & Vector P Processors
26 pages
Multiprocessor Systems & Pipelining
No ratings yet
Multiprocessor Systems & Pipelining
11 pages
Understanding Parallel Computing Architectures
No ratings yet
Understanding Parallel Computing Architectures
24 pages
5.1-5.3 Pipelining and Parallel Processing
No ratings yet
5.1-5.3 Pipelining and Parallel Processing
56 pages
Parallel Processing & Pipelining
No ratings yet
Parallel Processing & Pipelining
33 pages
Lec18 Pipeline
No ratings yet
Lec18 Pipeline
59 pages
Instruction Formats and Control Units
No ratings yet
Instruction Formats and Control Units
63 pages
Presentation 5156 Content Document 20250301102853AM
No ratings yet
Presentation 5156 Content Document 20250301102853AM
40 pages
Parallelism in Uniprocessor System and Granularity
100% (5)
Parallelism in Uniprocessor System and Granularity
5 pages
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
No ratings yet
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
36 pages
Instruction Pipelining and SuperScalar Development - 2019
No ratings yet
Instruction Pipelining and SuperScalar Development - 2019
53 pages
CA Slides#3 Pipeline Introduction
No ratings yet
CA Slides#3 Pipeline Introduction
26 pages
Chapter 9 - Pipeline and Vector Processing Section 9.1 - Parallel Processing
No ratings yet
Chapter 9 - Pipeline and Vector Processing Section 9.1 - Parallel Processing
10 pages
Unit 5
No ratings yet
Unit 5
36 pages
Chapter 9
No ratings yet
Chapter 9
28 pages
Pipelining and Parallel Processing
No ratings yet
Pipelining and Parallel Processing
26 pages
Module 4 - Parallel & Pipeline Processing - Final
No ratings yet
Module 4 - Parallel & Pipeline Processing - Final
31 pages
Flynn's Classification and Pipelining
No ratings yet
Flynn's Classification and Pipelining
52 pages
ACA Handwriten Notes Chat GPT
No ratings yet
ACA Handwriten Notes Chat GPT
52 pages
3-Pipelining 241110 203716
No ratings yet
3-Pipelining 241110 203716
59 pages
Bản Sao Của Lecture 9 - Pipelined Processor Design
No ratings yet
Bản Sao Của Lecture 9 - Pipelined Processor Design
11 pages
BCA Semester II Computer Organisation and Architecture (COA
No ratings yet
BCA Semester II Computer Organisation and Architecture (COA
24 pages
Pipe Lining
No ratings yet
Pipe Lining
66 pages
Parallel Processing and Pipelining
No ratings yet
Parallel Processing and Pipelining
53 pages
Chapter 5 Pipelining and Vector Processing Modified
No ratings yet
Chapter 5 Pipelining and Vector Processing Modified
37 pages
Unit 5
No ratings yet
Unit 5
23 pages
VLIW vs. Superscalar Processors
No ratings yet
VLIW vs. Superscalar Processors
35 pages
Pipelining Vector Processing
No ratings yet
Pipelining Vector Processing
27 pages
Arch3 Pipelining Afterlecture
No ratings yet
Arch3 Pipelining Afterlecture
180 pages
Pipelining 2
No ratings yet
Pipelining 2
43 pages
Cao Unit 6
No ratings yet
Cao Unit 6
21 pages
ch.9 Pipeline MoDIFIED
No ratings yet
ch.9 Pipeline MoDIFIED
76 pages
Parallel Computing
No ratings yet
Parallel Computing
46 pages
Campmc Unit Ii
No ratings yet
Campmc Unit Ii
61 pages
Flynn's Taxonomy & Pipelining
No ratings yet
Flynn's Taxonomy & Pipelining
13 pages
Computer Architecture 1
No ratings yet
Computer Architecture 1
37 pages
Pipelining in Computer Architecture
No ratings yet
Pipelining in Computer Architecture
64 pages
Pipeline and Vector Processing Overview
No ratings yet
Pipeline and Vector Processing Overview
16 pages
Pipeline and Vector Processing
No ratings yet
Pipeline and Vector Processing
52 pages
Pipelining and Parallel Processing
No ratings yet
Pipelining and Parallel Processing
25 pages
Computer Architecture Unit 3
No ratings yet
Computer Architecture Unit 3
8 pages
COAU5
No ratings yet
COAU5
31 pages
ACA Mod2
No ratings yet
ACA Mod2
45 pages
COA DR MVN 5 UNIT - Latest PDF
No ratings yet
COA DR MVN 5 UNIT - Latest PDF
24 pages
Me FIRST
No ratings yet
Me FIRST
4 pages
Onur Digitaldesign - Comparch 2021 Lecture13 Pipelining Afterlecture
No ratings yet
Onur Digitaldesign - Comparch 2021 Lecture13 Pipelining Afterlecture
138 pages
Final
No ratings yet
Final
26 pages
Unit5 Parallel Processing Multiprocessor
No ratings yet
Unit5 Parallel Processing Multiprocessor
32 pages
COA UNIT-V PPTS Dr.G.Bhaskar ECE
No ratings yet
COA UNIT-V PPTS Dr.G.Bhaskar ECE
100 pages
740 Fall10 Lecture4 Afterlecture Pipelining
No ratings yet
740 Fall10 Lecture4 Afterlecture Pipelining
24 pages
Parallel Computing for Students
No ratings yet
Parallel Computing for Students
113 pages
The Improvement of The Personal Computer
No ratings yet
The Improvement of The Personal Computer
74 pages
Pipelining & Vector Processing Guide
No ratings yet
Pipelining & Vector Processing Guide
29 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
Surat Yasin
90% (30)
Surat Yasin
12 pages
Upsa Maths Yr 2
100% (5)
Upsa Maths Yr 2
9 pages
UASA BI Tahun 3
91% (11)
UASA BI Tahun 3
15 pages
Uasa Bahasa Arab Tahun 6 2024
83% (12)
Uasa Bahasa Arab Tahun 6 2024
11 pages
UASA BI Tahun 2
91% (11)
UASA BI Tahun 2
16 pages
Contoh Delivery Order
75% (4)
Contoh Delivery Order
2 pages
XII CS Project: Employee System
No ratings yet
XII CS Project: Employee System
9 pages
Reading Comprehension 1 Practice
No ratings yet
Reading Comprehension 1 Practice
3 pages
Tax Invoice for LED Products Purchase
No ratings yet
Tax Invoice for LED Products Purchase
1 page
Public Assembly Law and INC Protest
No ratings yet
Public Assembly Law and INC Protest
3 pages
EALLIN Key Recommendations
No ratings yet
EALLIN Key Recommendations
3 pages
Maryland Lease To Own Option To Purchase Agreement Form
100% (2)
Maryland Lease To Own Option To Purchase Agreement Form
6 pages
The Role of Input in Second Language Acquisition PDF
No ratings yet
The Role of Input in Second Language Acquisition PDF
6 pages
Tax Assessment Dispute: Global Fresh vs. CIR
No ratings yet
Tax Assessment Dispute: Global Fresh vs. CIR
13 pages
Chapter 6:: Skin Glands: Sebaceous, Eccrine, and Apocrine Glands
No ratings yet
Chapter 6:: Skin Glands: Sebaceous, Eccrine, and Apocrine Glands
32 pages
Ratio Analysis of Ongc
No ratings yet
Ratio Analysis of Ongc
6 pages
Biochem - Enzymes Report Script
No ratings yet
Biochem - Enzymes Report Script
5 pages
CA Final FR Volume-4
No ratings yet
CA Final FR Volume-4
54 pages
University of The Cordilleras Third Trimester 2019-2020 Ethics
No ratings yet
University of The Cordilleras Third Trimester 2019-2020 Ethics
4 pages
Service Manual Kona 2020
No ratings yet
Service Manual Kona 2020
11 pages
Poultry: A Global Overview
No ratings yet
Poultry: A Global Overview
77 pages
Cost Volume Profit Analysis
67% (3)
Cost Volume Profit Analysis
12 pages
Writing a Documentary Script Guide
100% (1)
Writing a Documentary Script Guide
14 pages
RP BOE Meeting Agenda (March 16, 2010)
No ratings yet
RP BOE Meeting Agenda (March 16, 2010)
14 pages
TRA Hot Works A200 Rev. 000
No ratings yet
TRA Hot Works A200 Rev. 000
4 pages
Pushed by Pandemic, Amazon Goes On A Hiring Spree Without Equal
No ratings yet
Pushed by Pandemic, Amazon Goes On A Hiring Spree Without Equal
12 pages
First Mate Phase-I Exam Results 2010
No ratings yet
First Mate Phase-I Exam Results 2010
9 pages
Registered Political Parties As of September 30, 2021
100% (1)
Registered Political Parties As of September 30, 2021
10 pages
De Facto Corp. by Estoppel
No ratings yet
De Facto Corp. by Estoppel
21 pages
Ilao-Oreta v. Sps. Ronquillo
No ratings yet
Ilao-Oreta v. Sps. Ronquillo
1 page
Self Employed Schedule C
No ratings yet
Self Employed Schedule C
1 page
Two Stage Amplifier Lab Report
No ratings yet
Two Stage Amplifier Lab Report
4 pages
Prosperity and Culture in 1920s America
No ratings yet
Prosperity and Culture in 1920s America
22 pages
Women in African Cinema - An Aesthetic and Thematic Analysis of Filmmaking by Women in ... (PDFDrive)
No ratings yet
Women in African Cinema - An Aesthetic and Thematic Analysis of Filmmaking by Women in ... (PDFDrive)
274 pages
Subject: Office Automation DIT Part 1st: Ms Excel 2007
No ratings yet
Subject: Office Automation DIT Part 1st: Ms Excel 2007
41 pages
TESOL Methods for Experienced Teachers
No ratings yet
TESOL Methods for Experienced Teachers
3 pages

Pipelining & Vector Processing Guide

Uploaded by

Pipelining & Vector Processing Guide

Uploaded by

UNIT-V

PIPELINING AND VECTOR

 The system may have two or more ALUs and be able

 Goal is to increase the throughput – the amount of

Single instruction stream, single data stream – SISD

Single instruction stream, multiple data stream – SIMD

Multiple instruction stream, single data stream – MISD

Multiple instruction stream, multiple data stream – MIMD

 Single control unit, single computer, and a memory

 Instructions are executed sequentially. Parallel

 Represents an organization that includes many

 Includes multiple processing units with a single

 processors receive different instructions, but operate

 Most multiprocessor and multicomputer systems can

 Small laundry has one

 Washer takes 30 minutes

called one processor cycle

slowest pipe stage

 n:instructions n is equivalent to number of loads in

 > It takes k clock cycles to fill the pipeline and get

 After that the remaining (n - 1) results will come out

 > It therefore takes (k + n - 1) clock cycles to

For n >> k (such as 1 million data sets on a 3-

gcc expresso li fpppp doducd tomcatv

Infinite 256 128 64 32 16 8 4

add r3, r1, r2 [Link] v3, v1, v2

Vector reduces ops by 1.2X, instructions by 20X

Machine YearClock 100x100 1kx1kPeak(Procs)

Vector operations are SIMD

VP0 VP1 VP#vlr-1

vector “speed of light”

8 lanes, vector length 32,

/* Multiply a[m][k] * b[k][n] to get c[m][n] */

(Where are crossover

Modest Scalar (Where are important

Very Sequential Very Parallel

You might also like