Co Unit3
Co Unit3
HITESH RAVANI
9978139389
[email protected]
Computer Organization & Architecture (3140707) Shree Swami Atmanand Saraswati Institute of Technology
PIPELINING AND VECTOR PROCESSING
• Parallel Processing
• Pipelining
• Arithmetic Pipeline
• Instruction Pipeline
• RISC Pipeline
• Vector Processing
• Array Processors
Parallel Processing
PARALLEL PROCESSING
- Inter-Instruction level
- Intra-Instruction level
Parallel Processing
PARALLEL COMPUTERS
Architectural Classification
* Flynn's classification
- Based on the multiplicity of Instruction Streams and Data Streams
- Instruction Stream
Sequence of Instructions read from memory
- Data Stream
Operations performed on the data in the processor
Instruction stream
Characteristics
it is a limitation on
- Standard von Neumann machine
throughput caused
- Instructions and data are stored in memory
- One operation at a time by the standard personal
computer architecture.
Limitations
PERFORMANCE IMPROVEMENTS
• Multiprogramming
• Spooling
• Multifunction processor
• Pipelining
• Exploiting instruction-level parallelism
- Superscalar
- Superpipelining
- VLIW (Very Long Instruction Word)
Parallel Processing
M CU P
M CU P Memory
• •
• •
• •
M CU P Data stream
Instruction stream
Characteristics
Control Unit
Instruction stream
Data stream
Alignment network
Characteristics
Systolic Arrays
Associative Processors
- Content addressing
- Data transformation operations over many sets of arguments with
a single instruction
- STARAN, PEPE
Parallel Processing
P M P M ••• P M
Interconnection Network
Shared Memory
Characteristics
- Message-passing multicomputers
Parallel Processing
Buses,
Interconnection Network(IN) Multistage IN,
Crossbar Switch
P P ••• P
Characteristics
All processors have equally direct access to one large memory
address space
Example systems
Bus and cache-based systems
- Sequent Balance, Encore Multimax
Multistage IN-based systems
- Ultracomputer, Butterfly, RP3, HEP
Crossbar switch-based systems
- C.mmp, Alliant FX/8
Limitations
Memory access latency
Hot spot problem
Parallel Processing
MESSAGE-PASSING MULTICOMPUTER
Message-Passing Network Point-to-point connections
P P ••• P
M M ••• M
Characteristics
- Interconnected computers
- Each processor has its own memory, and communicate
via message-passing
Example systems
Limitations
- Communication overhead
- Hard to programming
Parallel Processing
VLIW
MISD Nonexistence
Systolic arrays
Dataflow
Associative processors
Message-passing multicomputers
Hypercube
Mesh
Reconfigurable
Pipelining
PIPELINING
A technique of decomposing a sequential process
into suboperations, with each subprocess being
executed in a partial dedicated segment that
operates concurrently with all other segments.
Ai * Bi + Ci for i = 1, 2, 3, ... , 7
Ai Bi Memory Ci
Segment 1
R1 R2
Multiplier
Segment 2
R3 R4
Adder
Segment 3
R5
GENERAL PIPELINE
Input S1 R1 S2 R2 S3 R3
3 S4 R4
Space-Time Diagram
1 2 3 4 5 6 7 8 9 Clock cycles
Segment 1 T1 T2 T3 T4 T5 T6
2 T1 T2 T3 T4 T5 T6
3 T1 T2 T3 T4 T5 T6
4 T1 T2 T3 T4 T5 T6
To complete n tasks using k-segment pipeline no.of clock cycles required =k+(n-1)
Pipelining
PIPELINE SPEEDUP
Speedup
Sk: Speedup=(Non-Pipelined/ Pipelined Machine )
Sk = n*tn / (k + n - 1)*tp
If n is much larger then,
tn
Sk = ( = k, if tn = k * tp )
tp
Pipelining
Pipelined System
(k + n - 1)*tp
= (4 + 99) * 20 = 2060nS
Non-Pipelined System
n*k*tp = 100 * 80 = 8000nS Multiple Functional Units in parallel
Speedup
Sk = 8000 / 2060 = 3.88
ARITHMETIC PIPELINE
Floating-point adder Exponents Mantissas
a b A B
X = A x 2a
R R
Y = B x 2b
Compare Difference
Segment 1: exponents
by subtraction
Add or subtract
X = 0.9504 x 103 Segment 3:
mantissas
Y = 0.8200 x 102
R R
[1] Compare the exponents
[2] Align the mantissa
[3] Add/sub the mantissa Segment 4: Adjust Normalize
exponent result
[4] Normalize the result
Stages: Other
Exponent fraction Fraction
S1 subtractor selector
Fraction with min(p,q)
r = max(p,q)
Right shifter
t = |p - q|
S2 Fraction
adder Overflow then mantissa shifted right
r c and exponent incremented by one
Leading zero
S3 counter
c
Underflow then mantissa shifted left
Left shifter
r and exponent decremented
(determined by leading zeros)
d
Exponent
S4 adder
s d
C = A + B = c x 2r = d x 2s
(r = max (p,q), 0.5 d < 1)
Instruction Pipeline
INSTRUCTION CYCLE
* Effective address calculation can be done in the part of the decoding phase
INSTRUCTION PIPELINE
i FI DA FO EX
i+1 FI DA FO EX
i+2 FI DA FO EX
Pipelined
i FI DA FO EX
i+1 FI DA FO EX
i+2 FI DA FO EX
Instruction Pipeline
Control hazards
Branches and other instructions that change the PC make the fetch of
the next instruction to be delayed
• Occur when some resource has not been duplicated enough to allow all
combinations of instructions in the pipeline to execute
DATA HAZARDS
Data Hazards
Interlock
- hardware detects the data dependencies and delays the scheduling
of the dependent instruction by stalling enough clock cycles
Forwarding (bypassing, short-circuiting)
- Accomplished by a data path that routes a value from a source
(usually an ALU) to a user, bypassing a designated register. This
allows the value to be produced to be used at an earlier stage in the
pipeline than would otherwise be possible
Software Technique
Instruction Scheduling(compiler) for delayed load
FORWARDING HARDWARE
• Example:
• 3-stage Pipeline
I: Instruction Fetch
A: Decode, Read Registers, ALU Operations
E: Write the result to the destination register
Instruction Pipeline
CONTROL HAZARDS
Branch Instructions
CONTROL HAZARDS
Prefetch Target Instruction
– Fetch instructions in both streams, branch not taken and branch taken
– Both are saved until branch is executed. Then, select the right
instruction stream and discard the wrong stream
Branch Target Buffer(BTB; Associative Memory)
– Entry: Addr of previously executed branches; Target instruction
and the next few instructions
– When fetching an instruction, search BTB.
– If found, fetch the instruction stream in BTB;
– If not, new stream is fetched and update BTB
Loop Buffer(High Speed Register file)
– Storage of entire loop that allows to execute a loop without accessing memory
Branch Prediction
– Guessing the branch condition, and fetch an instruction stream based on
the guess. Correct guess eliminates the branch penalty
Delayed Branch
– Compiler detects the branch and rearranges the instruction sequence
by inserting useful instructions that keep the pipeline busy
in the presence of a branch instruction
RISC Pipeline
RISC PIPELINE
RISC
- Machine with a very fast clock cycle that
executes at the rate of one instruction per cycle
<- Simple Instruction Set
Fixed Length Instruction Format
Register-to-Register Operations
DELAYED LOAD
LOAD: R1 M[address 1]
LOAD: R2 M[address 2]
ADD: R3 R1 + R2
STORE: M[address 3] R3
Three-segment pipeline timing
Pipeline timing with data conflict
clock cycle 1 2 3 4 5 6
Load R1 I A E
Load R2 I A E
Add R1+R2 I A E
Store R3 I A E
clock cycle 1 2 3 4 5 6 7
Load R1 I A E
The data dependency is taken
Load R2 I A E care by the compiler rather
NOP I A E than the hardware
Add R1+R2 I A E
Store R3 I A E
RISC Pipeline
DELAYED BRANCH
Compiler analyzes the instructions before and after the branch and
rearranges the program sequence by inserting useful instructions in the
delay steps
VECTOR PROCESSING
VECTOR PROGRAMMING
Conventional computer
Initialize I = 0
20 Read A(I)
Read B(I)
Store C(I) = A(I) + B(I)
Increment I = i + 1
If I 100 goto 20
Vector computer
Source
A
AR AR AR AR
DR DR DR DR
Data bus
Address Interleaving
Scalar and program control instruction are directly executed within the master control unit
Vector Processing
• The master control unit controls all the operations of the processor
elements. It also decodes the instructions and determines how the
instruction is to be executed.
• The main memory is used for storing the program. The control unit is
responsible for fetching the instructions. Vector instructions are send to
all PE's simultaneously and results are returned to the memory.