CSO Computer Programming
CSO Computer Programming
Theoretical only
30 40 20 30 40 20 30 40 20 30 40 20
T
a A
s
k
B
O
r
d C
e 90 min
r
D
This operator scheduled his loads to be delivered to the laundry every 90
minutes which is the time required to finish one load. In other words he
will not start a new task unless he is already done with the previous task
The process is sequential. Sequential laundry takes 6 hours for 4 loads
Efficiently scheduled laundry: Pipelined
Laundry
Operator start work ASAP
6 PM 7 8 9 10 11 Midnight
Time
30 40 40 40 40 20
40 40 40
T
a A
s
k
B
O
r
d C
e
r
D
Another operator asks for the delivery of loads to the laundry every 40
minutes!?.
Pipelined laundry takes 3.5 hours for 4 loads
Multiple tasks
Pipelining Facts operating
simultaneously
Pipelining doesn’t help
6 PM 7 8 9 latency of single task,
Time it helps throughput of
entire workload
T
a 30 40 40 40 40 20 Pipeline rate limited by
s slowest pipeline stage
k A Potential speedup =
O Number of pipe stages
r B Unbalanced lengths of
d
e pipe stages reduces
The washer
r C waits for the speedup
dryer for 10
minutes
Time to “fill” pipeline
D and time to “drain” it
reduces speedup
9.2 Pipelining
• Decomposes a sequential process into segments.
• Divide the processor into segment processors each
one is dedicated to a particular segment.
• Each segment is executed in a dedicated segment-
processor operates concurrently with all other
segments.
• Information flows through these multiple hardware
segments.
9.2 Pipelining
Instruction execution is divided into k segments or
stages
Instruction exits pipe stage k-1 and proceeds into
pipe stage k
All pipe stages take the same amount of time;
k segments
9.2 Pipelining
Suppose we want to perform the
combined multiply and add operations
with a stream of numbers:
A i * Bi + C i for i =1,2,3,…,7
9.2 Pipelining
The suboperations performed in each
segment of the pipeline are as follows:
R1 Ai, R2 Bi
R3 R1 * R2 R4 Ci
R5 R3 + R4
Pipeline Performance
k
SPEEDUP
• Consider a k-segment pipeline operating on n data
sets. (In the above example, k = 3 and n = 4.)
T1 T2 T3 T4 T5 T6
T1 T2 T3 T4 T5 T6
T1 T2 T3 T4 T5 T6
Vector Processing
Review: Instructon Level
Parallelism
High speed execution based on instruction level
parallelism (ilp): potential of short instruction
sequences to execute in parallel
High-speed microprocessors exploit ILP by:
1) pipelined execution: overlap instructions
2) superscalar execution: issue and execute multiple
instructions per clock cycle
3) Out-of-order execution (commit in-order)
Memory accesses for high-speed microprocessor?
Data Cache, possibly multiported, multiple levels
Review
Speculation: Out-of-order execution, In-order commit (reorder buffer+rename
registers)=>precise exceptions
Software Pipelining
Symbolic loop unrolling (instructions from different iterations) to optimize pipeline with little code
expansion, little overhead
Superscalar and VLIW: CPI < 1 (IPC > 1)
Dynamic issue vs. Static issue
More instructions issue at same time => larger hazard penalty
# independent instructions = # functional units X latency
Branch Prediction
Branch History Table: 2 bits for loop accuracy
Recently executed branches correlated with next branch?
Branch Target Buffer: include branch address & prediction
Predicated Execution can reduce number of branches, number of mispredicted branches
Review: Theoretical Limits to ILP?
(Figure 4.48, Page 332)
60
56
Perfect disambiguation (HW), 1K 52
50 Selective Prediction, 16 entry 47
return, 64 registers, issue as many FP: 8 - 45 45
Instruction issues per cycle
40
as window
35
34
30
22 22
IPC
20
Integer: 6 - 12
15 15
14
17 16
15 14
13
12 12 11 11 12
10 10 10 10
9 8 9 8 9 9
10 8
6 6 6 7
5 6
4 4 4 4
3 2 3 3 3 3
Program
Window
Infinite 256 128 64 32 16 8 4
Problems with conventional
approach
Limits to conventional exploitation of ILP:
1) pipelined clock rate: at some point, each increase in
clock rate has corresponding CPI increase (branches,
other hazards)
2) instruction fetch and decode: at some point, its hard
to fetch and decode more instructions per clock cycle
3) cache hit rate: some long-running (scientific)
programs have very large data sets accessed with poor
locality;
others have continuous data streams (multimedia) and
hence poor locality
Alternative Model:Vector
Processing
Vector processors have high-level operations that work on linear
arrays of numbers: "vectors"
SCALAR VECTOR
(1 operation) (N operations)
r1 r2 v1 v2
+ +
r3 v3
vector
length
32
DAXPY (Y = a * X
+
are length 64
Y)
Assuming vectors X, Y LD
LV
F0,a
V1,Rx
;load scalar a
;load vector X
Scalar vs. Vector MULTS V2,F0,V1 ;vector-scalar mult.
LV V3,Ry ;load vector Y
ADDV V4,V2,V3 ;add
LD F0,a SV Ry,V4 ;store the result
ADDI R4,Rx,#512 ;last address to load 578 (2+9*64) vs.
loop: LD F2, 0(Rx) ;load X(i) 321 (1+5*64) ops (1.8X)
MULTD F2,F0,F2 ;a*X(i)
578 (2+9*64) vs.
LD F4, 0(Ry) ;load Y(i) 6 instructions (96X)
ADDD F4,F2, F4 ;a*X(i) + Y(i)
SD F4 ,0(Ry) ;store into Y(i)
64 operation vectors +
ADDI Rx,Rx,#8 ;increment index to X
no loop overhead
ADDI Ry,Ry,#8 ;increment index to Y also 64X fewer pipeline
SUB R20,R4,Rx ;compute bound hazards
BNZ R20,loop ;check if done
Example Vector
Machines
Machine Year Clock RegsElementsFUsLSUs
Cray 1 197680 MHz 8 64 6 1
Cray XMP 1983120 MHz 8 64 8 2 L, 1 S
Cray YMP 1988166 MHz 8 64 8 2 L, 1 S
Cray C-90 1991240 MHz 8 128 8 4
Cray T-90 1996455 MHz 8 128 8 4
Conv. C-1 198410 MHz 8 128 4 1
Conv. C-4 1994133 MHz 16 128 3 1
Fuj. VP200 1982133 MHz8-25632-1024 3 2
Fuj. VP300 1996100 MHz8-25632-1024 3 2
NEC SX/2 1984160 MHz8+8K256+var 16 8
NEC SX/3 1995400 MHz8+8K256+var 16 8
Vector Linpack
Performance (MFLOPS)
Matrix Inverse (gaussian elimination)
vr31 vcr0
#vdw bits vcr1
vf0
Flag vf1
Registers vcr31
(32)
32 bits
vf31
1 bit
Vector Implementation
Vector register file
Each register is an array of elements
Size of each register determines maximum
vector length
Vector length register determines vector length
for a particular operation
Multiple parallel execution units = “lanes”
(sometimes called “pipelines” or “pipes”)
4 lanes, 2 vector functional
units
(Vector
Functional
Unit)
Tentative VIRAM-1
Floorplan
0.18 µm DRAM
Memory 32 MB in 16 banks x
(128 Mbits / 16 MBytes) 256b, 128 subbanks
0.25 µm,
5 Metal Logic
C
Ring- P 200 MHz MIPS,
4 Vector Pipes/Lanes U 16K I$, 16K D$
based +$ I/O 4 200 MHz
Switch
FP/int. vector units
die: 16x16 mm
Memory xtors: 270M
(128 Mbits / 16 MBytes) power: 2 Watts
Vector Execution
Time
Time = f(vector length, data dependencies, struct.hazards)
• Initiation rate: rate that FU consumes vector elements
(= number of lanes; usually 1 or 2 on Cray T-90)
• Convoy: set of vector instructions that can begin execution in
same clock (no structural or data hazards)
• Chime: approx. time for a vector operation
• m convoys take m chimes; if each vector length is n, then they
take approx. m x n clock cycles (ignores overhead; good
approximization for long vectors)
1: LV V1,Rx ;load vector X
4 convoys, 1 lane, VL=64
2: MULV V2,F0,V1 ;vector-scalar mult.
=> 4 x 64 = 256 clocks
LV V3,Ry ;load vector Y (or 4 clocks per result)
3: ADDV V4,V2,V3 ;add
4: SV Ry,V4 ;store the result
DLXV Start-up Time
• Start-up time: pipeline latency time (depth of FU
pipeline); + other sources of overhead
Operation Start-up penalty (from CRAY-1)
Vector load/store 12
Vector multiply 7
Vector add 6
Convoy Start 1st result last result
Assume convoys don't overlap; vector length = n
1. LV 0 12 11+n (12+n-1)
2. MULV, LV 12+n 12+n+7 18+2n Multiply startup
12+n+1 12+n+13 24+2n Load start-up
3. ADDV 25+2n 25+2n+6 30+3n Wait convoy 2
4. SV 31+3n 31+3n+12 42+4n Wait convoy 3
Why startup time for each
vector instruction?
Why not overlap startup time of back-to-
back vector instructions?
Cray machines built from many ECL chips
operating at high clock rates; hard to do?
Berkeley vector design (“T0”) didn’t know
it wasn’t supposed to do overlap, so no
startup times for functional units (except
load)
Vector Load/Store Units &
Memories
Start-up overheads usually longer for LSUs
Memory system must sustain (# lanes x word) /clock cycle
Many Vector Processors use banks (versus simple interleaving):
1) support multiple loads/stores per cycle
=> multiple banks & address banks independently
2) support non-sequential accesses (see soon)
Note: No. memory banks > memory latency to avoid stalls
m banks => m words per memory latency l clocks
if m < l, then gap in memory pipeline:
clock: 0 … l l+1 l+2 … l+m- 1 l+m … 2l
word: -- … 0 1 2 … m-1-- … m
may have 1024 banks in SRAM
Vector
Length
What to do when vector length is not exactly 64?
• vector-length register (VLR) controls the length of
any vector operation, including a vector load or
store. (cannot be > the length of vector registers)
do 10 i = 1, n
10 Y(i) = a * X(i) + Y(i)
Don't know n until runtime!
n > Max. Vector Length (MVL)?
Strip
Mining
Suppose Vector Length > Max. Vector Length (MVL)?
• Strip mining: generation of code such that each vector
operation is done for a size MVL
1st loop do short piece (n mod MVL), rest VL = MVL
low = 1
VL = (n mod MVL) /*find the odd size piece*/
do 1 j = 0,(n / MVL) /*outer loop*/
do 10 i = low,low+VL-1 /*runs for length VL*/
Y(i) = a*X(i) + Y(i) /*main operation*/
10 continue
low = low+VL /*start of next vector*/
VL = MVL /*reset the length to max*/
Loop Overhead!
1 continue
Common Vector
Metrics
• R : MFLOPS rate on an infinite-length vector
+
Claim: overall speedup 1.5 to 2X for 2D/3D graphics, audio, video, speech, comm., ...
use in drivers or added to library routines; no compiler
MMX Instructions
Move 32b, 64b
Add, Subtract in parallel: 8 8b, 4 16b, 2 32b
opt. signed/unsigned saturate (set to max) if overflow
Shifts (sll,srl, sra), And, And Not, Or, Xor
in parallel: 8 8b, 4 16b, 2 32b
Multiply, Multiply-Add in parallel: 4 16b
Compare = , > in parallel: 8 8b, 4 16b, 2 32b
sets field to 0s (false) or 1s (true); removes branches
Pack/Unpack
Convert 32b<–> 16b, 16b <–> 8b
Pack saturates (set to max) if number is too large
Vectors and Variable Data
Width
Programmer thinks in terms of vectors of data of some width
(8, 16, 32, or 64 bits)
Good for multimedia; More elegant than
MMX-style extensions
Don’t have to worry about how data stored in hardware
No need for explicit pack/unpack operations
Just think of more virtual processors operating on narrow data
Expand Maximum Vector Length with decreasing data width:
64 x 64bit, 128 x 32 bit, 256 x 16 bit, 512 x 8 bit
Mediaprocesing:
Vectorizable? Vector Lengths?
Kernel Vector length
Matrix transpose/multiply # vertices at once
DCT (video, communication) image width
FFT (audio) 256-1024
Motion estimation (video) image width, iw/16
Gamma correction (video) image width
Haar transform (media mining) image width
Median filter (image processing) image width
Separable convolution (img. proc.) image width
(from Pradeep Dubey - IBM,
http://www.research.ibm.com/people/p/pradeep/tutor.html)
Vector
Pitfalls
Pitfall: Concentrating on peak performance and ignoring start-up overhead:
e.g. NV (length faster than scalar) > 100 (CDC-star)
Pitfall: Increasing vector performance, without comparable increases in scalar performance
(Amdahl's Law)
failure of Cray competitor from his former company
Pitfall: Good processor vector performance without providing good memory bandwidth
MMX?
Vector Advantages
Easy to get high performance; N operations:
– are independent
– use same functional unit
– access disjoint registers
– access registers in same order as previous instructions
– access contiguous memory words or known pattern
– can exploit large memory bandwidth
– hide memory latency (and any other latency)
• Scalable (get higher performance as more HW resources available)
• Compact: Describe N operations with 1 short instruction (v. VLIW)
• Predictable (real-time) performance vs. statistical performance (cache)
• Multimedia ready: choose N * 64b, 2N * 32b, 4N * 16b, 8N * 8b
Mature, developed compiler technology
• Vector Disadvantage: Out of Fashion
Vectors Are Inexpensive
Scalar Vector
N ops per cycle N ops per cycle
2) circuitry 2)
HP PA-8000 circuitry
4-way issue T0 vector micro
reorder buffer: 24 ops per cycle
850K transistors 730K transistors
incl. 6,720 5-bit register total
number comparators only 23 5-bit register
number comparators
No floating point
MIPS R10000 vs. T0
*See http://www.icsi.berkeley.edu/real/spert/t0-intro.html
Vectors Lower Power
Vector
Single-issue Scalar
One instruction fetch,decode,
One instruction fetch, decode, dispatch per vector
dispatch per operation
Structured register accesses
Arbitrary register accesses,
adds area and power
Smaller code for high
Loop unrolling and software performance, less power in
pipelining for high performance instruction cache misses
increases instruction cache
Bypass cache
footprint
All data passes through cache; One TLB lookup per
waste power if no temporal locality group of loads or stores
One TLB lookup per load or store Move only necessary data
across chip boundary
Off-chip access in whole cache lines
Superscalar Energy Efficiency
Even Worse
Vector
Superscalar
Control logic grows Control logic grows
quadratically with issue linearly with issue width
width Vector unit switches
Control logic consumes off when not in use
energy regardless of Vector instructions expose
available parallelism parallelism without speculation
Speculation to increase Software control of
visible parallelism speculation when desired:
wastes energy Whether to use vector mask or
compress/expand for conditionals
VLIW/Out-of-Order versus
Modest Scalar+Vector Vector
100
Performance