0% found this document useful (0 votes)
18 views39 pages

Ca 4

The document outlines the concepts of linear pipeline processors, focusing on asynchronous and synchronous models, clocking, timing control, and the efficiency of pipelining architecture. It discusses the importance of reservation tables, latency analysis, and the speedup factors associated with pipelining in computer architecture. Additionally, it explores optimal stage numbers, throughput, and collision avoidance in dynamic pipelines, providing a comprehensive overview of pipelining techniques to enhance computer system performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views39 pages

Ca 4

The document outlines the concepts of linear pipeline processors, focusing on asynchronous and synchronous models, clocking, timing control, and the efficiency of pipelining architecture. It discusses the importance of reservation tables, latency analysis, and the speedup factors associated with pipelining in computer architecture. Additionally, it explores optimal stage numbers, throughput, and collision avoidance in dynamic pipelines, providing a comprehensive overview of pipelining techniques to enhance computer system performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Computer Architecture

PCC-CS402

Day: 05
Date: 13.02.2025
Topics to be Covered:

• LINEAR PIPELINE PROCESSORS


• Asynchronous Model
• Synchronous Model
• Reservation table
• Clocking and Timing Control
• Clock Skewing
Learning Objective:

• To Learn about Pipelining Architecture and its various formation to increase the
performance of the computer Systems.
Course Outcome:

PCC-CS402.1: Learn pipelining concepts with a prior knowledge of stored


program methods
Pipelining:

• LINEAR PIPELINE PROCESSORS:


• A linear pipeline processor is a cascade of processing stages which arc linearly connected to perform a fixed
function over a stream of data flowing from one end to the other.
• In modem computers. linear pipelines are applied for instruction execution, arithmetic computation, and
memory-access operations.

• A linear pipeline processor is constructed with k processing stages. External inputs (operands) are fed into the
pipeline at the first stage S1. The processed results are passed from stage Si. to stage Si+1. for all k = 1, 2,...,k-
l. The final result emerges from the pipeline at the last stage Sk
• Depending on the control of data flow along the pipeline. we model linear pipelines in two categories:
asynchronous and synchronous
Asynchronous Model:

shown in Fig data flow between adjacent stages in an asynchronous pipeline is controlled by a handshaking
protocol. When stage Si is ready to transmit, it sends a ready signal to Stage.Si+1. After stage Si+1 receives the
incoming data, it returns an acknowledge signal to Si

Asynchronous pipelines are useful in designing communication channels in message-passing multicomputer


where pipelined wormhole routing is practiced . Asynchronous pipelines may have a variable throughput
rate. Different amounts of delay may be experienced indifferent stages
Synchronous Model:

Synchronous pipelines are illustrated in Fig. Clocked latches are used to interface between stages. The
latches are made with master-slave flip-flops, which can isolate inputs from outputs. Upon the arrival of a
clock pulse, all latches transfer data IO the next stage simultaneously.

The pipeline stages are combinational logic circuits. It is desired to have approximately equal delays
in all stages. These delays determine the clock period and thus the speed of the pipeline.
Reservation table:

The utilization pattern of successive stages in a synchronous pipeline is specified by a reservation Table.
For a linear pipeline, the utilization follows the diagonal streamline pattern.

This table is essentially a space, time diagram depicting the precedence relationship in using the pipeline stages .

For a k-stage linear pipeline, k clock cycles are needed for data to flow through the pipeline.

Successive tasks or operations are initiated one per cycle to enter the pipeline. Once the pipeline is filled up, one
result emerges from the pipeline for each additional cycle. This throughput is sustained only if the successive tasks
are independent of each other.
Clocking and Timing Control:

• The clock cycle τ of a pipeline is determined below.


• Let τi be the time delay of the circuitry in stage Si and d the time delay of a latch,
• Clock Cycle and Throughput Denote the maximum stage delay as τm and we can write t as

• 1

• At the rising edge of the clock pulse, the data is latched to the master flip-flops of each latch register.
The clock pulse has a width equal to d. In general, tm>> d by one to two orders of magnitude. This
implies that the maximum stage delay tm dominates the clock period.
• The pipeline frequency is defined as the inverse of the clock period: 2
• If one result is expected to come out of the pipeline per cycle, f represents the maximum throughput
of the pipeline. Depending on the initiation rate of successive tasks entering the pipeline, the actual
throughput of the pipeline may be lower than f. This is because more than one clock cycle has
elapsed between successive task initiations.
Clock Skewing:

• Ideally, we expect the clock pulses to arrive at all stages (latches) at the same time.
However due to a problem known as clock skewing. The same clock pulse may
arrive at different stages with a time offset of s. let tmax, be the time delay of the
longest logic path within a stage and tmin that of the shortest logic path within a
stage.
• To avoid a race in two successive stages. we must choose

• These constraints translate into the following bounds on the clock period when
clock skew takes effect:

• In the ideal case s = 0, tmax = τm and tmin = d. Thus, we have τ = τm + d, consistent


with the definition in without the effect of clock skewing.
Speedup:

• Ideally, A Linear pipeline of k stage can process 1 tasks in k + (n‐ 1 ) clock cycles,
• Where k cycle are needed to complete the execution of the very first task and the remaining n ‐1 tasks
require n‐ 1 cycles. Thus the Total time required is

• where τ is the clock period. Consider an equivalent‐function non‐pipelined processor which has a flow‐
through delay of kτ. The amount of time it takes to execute n tasks on this non‐pipelined processor is
T1 = nkτ
• Speedup Factor The speedup factor of a k‐stage pipeline over an equivalent non pipelined processor is
defined as

• The maximum speedup is Sk ‐> k as n‐>∞ . This maximum speedup is very difficult to achieve because of
data dependences between successive tasks (instructions), program branches, interrupts, and other factors
Speedup continue.. plots the speedup factor as a function of n, the number of tasks
(operations or instructions)performed by the pipeline. For small values
of n, the speedup can be very poor. The smallest value of Sk is 1 when
n = 1.
The larger the number k of subdivided pipeline stages, the higher the
potential speedup performance. When n =64, an eight -stage pipeline
bas a speedup value of 7.1 and a four-stage pipeline has a speedup of
3.7, However, the number of pipeline stages cannot increase indefinitely
due to practical constraints on costs, control complexity, circuit
implementation and packaging limitations. Furthermore, the stream
length n also affects the speedup; the longer the better in using a
pipeline.

Speedup factors of pipeline stages of a linear pipeline unit


Optimal Number of Stages:
Let t be the total time required for a non-pipelined sequential
program of a given function.
• To execute the same program on a k-stage pipeline with an
equal flow-through delay t, one needs a clock period of
• p = t/k + d where d is the latch delay.
Thus, the pipeline has a maximum throughput of f=1/p =
1/(t/k +d).
• The total pipeline cost is roughly estimated by c+ kh, where
c covers the cost of all logic stages and h represents the Optimal Number of pipeline stages of a
cost of each latch. linear pipeline unit
• A pipeline performance/cost ratio (PCR) has been defined
by Larson plots the PCR as a function of k. The peak of the PCR curve
corresponds to an optimal choice for the number of desired
pipeline stages:

where t is the total flow-through delay of the pipeline. Thus the total
stage cost c, the latch delay d. and the Latch cost h must be considered
to achieve the optimal value k0.
Efficiency and Throughput:

• The efficiency Ek of a linear k-stage pipeline is defined as

• Obviously, the efficiency approaches 1 when n->∞ , and a lower bound on Ek is 1/k when n = 1.
• The pipeline throughput Hk is defined as the number of tasks (operations) performed per unit time

The maximum throughput f occurs when Ek ‐>1 as n‐>∞ .

Hk= E∙ f = Ek/τ = Sk / kτ.


Reservation and Latency Analysis:

• In a static pipeline, it is relatively easy to partition a given function into sequence of linearly ordered
Sub-functions. However, function partitioning in a dynamic pipeline becomes quite complex because the
pipeline stages are interconnected with loops in addition to streamline connections.

A multifunction dynamic pipeline is shown in Fig. a. This pipeline has three stages. Besides the Streamline
connections from S1 to S2 and from S2 to S3, there is a feed forward connection from S1 to S3 and two
feedback connections from S3 to S2 and from S3 to S1
These feed forward and feedback connections make the scheduling of successive events into the pipeline
a nontrivial task. With these connections, the output of the pipeline is not necessarily from the last stage. In
fact, following different dataflow patterns, one can use the same pipeline to evaluate different functions
Reservation and Latency Analysis:

A dynamic pipeline with feed forward and feedback concoctions for- two different
functions
Reservation Tables:
• The reservation table for a static linear pipeline is trivial in the
sense that dataflow follows a linear streamline. The reservation,
table for a dynamic pipeline becomes more interesting because a
nonlinear pattern is followed.
• Given a pipeline configuration, multiple reservation tables can be
generated for the evaluation of different functions.
• Two reservation tables are given in Figs. b and c, corresponding to a
function X and a function Y respectively. Each function evaluation is
specified by one reservation table.
• A static pipeline is specified by a single reservation table. A dynamic
pipeline may be specified by more than one reservation table.
Reservation Tables :

• Each reservation table displays the time‐space flow of data through the pipeline for one
function evaluation. Different functions follow different paths through the pipeline.
• The number of columns in a reservation table is called the evaluation time of a given
function. For example, the function X requires eight clock cycles to evaluate, and
function Y requires six cycles, as shown in Figs b and c, respectively.
• A pipeline initiation table corresponds to each function evaluation. All initiations to a
static pipeline use the same reservation table. On the other hand, a dynamic pipeline
may allow different initiations to follow a mix of reservation tables.
• The checkmarks in each row of the reservation table correspond to the time instants
(cycles) that a particular stage will be used.
• There may be multiple checkmarks in a row, which means repeated usage of the same
stage in different Cycles. Contiguous checkmarks in a row simply imply the extended
usage of a stage over more than one cycle. Multiple checkmarks in a column mean that
multiple stages need to be used in parallel during a particular clock cycle.
Latency Analysis:

• The number of time units (clock cycles) between two initiations of a pipeline is the latency
between them.
• Latency values must be nonnegative integers.
• A latency of k means that two initiations are separated by k clock cycles. Any attempt by
two or more initiations to use the same pipeline stage at the same time will cause a
collision.
• A collision implies resource conflicts between two initiations in the pipeline. Therefore, all
collisions must be avoided in scheduling a sequence of pipeline initiations.
• Some latencies will cause collisions, and some will not.
• Latencies that cause collisions arc called forbidden latencies. In using the pipeline in
Fig. 1 to evaluate the function X, latencies2 and 5 are forbidden, as illustrated in fig. 2.
Latency Analysis:

Fig.2.Colllslons with forbidden latencies 2 and 5 In using the pipeline in Fig.1 to evaluate the function X

The ith initiation is denoted as Xi in Fig.2. With latency 2, initiations X1 and X2 collide in stage 2 at time
4.At time 7, these initiations collide in stage 3. Similarly, other collisions are shown at times S, 6, 8... etc
The collision patterns for latency 5 are shown in Fig. 2 b, where X1 and X2 are scheduled 5 clock cycles
apart. Their first collision occurs at time 6.
Forbidden latency & Latency Sequence:

• To detect a forbidden latency, one needs simply to check the distance between any two checkmarks in the
same row of the reservation table.
• For example, the distance between the first mark and the second mark in rows, in Fig. 1b is 5, implying that 5
is a forbidden latency.
• Similarly, latencies 2, 4, 5, and 7 are all seen to be forbidden from inspecting the same reservation table From
the reservation table in Fig. 1.c, we discover the forbidden latencies 2 and 4 for function Y.
• A latency sequence is a sequence of permissible non forbidden latencies between successive task initiations
latency cycle:

• A latency cycle is a latency sequence which repeats the


same subsequence (cycle) indefinitely. Figure3. Illustrates
latency cycles in using the pipeline in Fig. 1 to evaluate the
function X without causing a collision
• For example, the latency cycle (1, 8) represents the infinite
latency sequence 1, 8, 1, 8, 1, 8, …This implies that
successive initiations of new tasks are separated by one
cycle and eight cycles alternately

Flg. 3. Three valid latency cycles for the evaluation o f function X


Average latency & Constant latency:

• The average latency of a latency cycle is obtained by dividing the sum of all latencies by the number of
latencies along the cycle.
• The latency cycle (I, 8) thus has an average latency of (1 + 8)/2 = 4.5.
• A constant cycle is a latency cycle which contains only one latency value.
• Cycles (3) and (6) in Figs. 3.b and 3.c are both constant cycles.
• The average latency of a constant cycle is simply the latency itself.
• When scheduling events in a nonlinear pipeline, the main objective is to obtain the shortest average latency
between initiations without causing collisions. In what follows. we present a systematic method for
achieving such collision-free scheduling.
• Collision vectors. State diagrams, single cycles, greedy cycles, and minimal average latency (MAL).
• This pipeline design theory was originally developed by Davidson (1971) and his students.
• Collision Vectors:
By examining the reservation table, one can distinguish the set of permissible latencies from the set of
forbidden latencies.
• For a reservation table with n columns, the maximum forbidden latency m≤ n-1
• The permissible latency p should be as small as possible. The choice is made in the range 1≤ p≤ m-1.
• A permissible latency of p = 1 corresponds to the ideal case.
• In theory, a latency of 1 can always be achieved in a static pipeline which follows a linear (diagonal or
streamlined} reservation table as shown already.
• The combined set of permissible and forbidden latencies can be easily displayed by a collision vector, which
is an m-bit binary vector C = {Cm,Cm-1•• • C2 C1}.
• The value of Ci = 1 if latency i causes a collision and ci= 0 if latency i is permissible. Note that it is
always true that Cm= 1 corresponding to the maximum forbidden latency.
• For the two reservation tables in Fig. 1, the collision vector CX = (I 0110 I0) is obtained for function X and
CY= (1010} for function Y. From Cx, we can immediately tell that latencies 7, 5, 4, and 2 arc forbidden and
latencies 6, 3, and 1 are permissible. Similarly, 4 and 2 are forbidden latencies and 3 and 1 are permissible
latencies for function Y.
State Diagrams:

• From the above collision vector, one can construct a state diagram specifying the permissible
state transitions among successive initiations.
• The collision vector, like CX, corresponds to the initial state of the pipeline at time 1 and thus is
called an initial collision vector.
• Let p be a permissible latency within the range 2I ≤p ≤m - 1.
• The next state of the pipeline a t time t + p is obtained with the assistance of an m-bit right shift
register as in Fig.4.a.The initial collision vector C is initially loaded into the register. The register
is then shifted to the right. Each I-bit shift corresponds to an increase in the latency by 1. When a
0bit emerges from the right end after p shifts, it means p is a permissible latency. Likewise, a 1 bit
being shifted out means a collision, and thus the corresponding latency should be forbidden.
• Logical 0 enters from the left end of the shift register. The next state after p shifts is thus obtained
by bitwise-ORing the initial collision vector with the shifted register contents.
• For example, from the initial state CX= (1011010), the next state (1111111) is reached after one
right shift of the register, and the next state (101101I) is reached after three shifts or six shifts.
The state transition
diagram for a pipeline
unit:
A state diagram is obtained in Fig. 4.b for function X. From the
initial state (1011010), only three outgoing transitions are
possible, corresponding to the three permissible latencies, 6, 3,
and 1 in the initial collision vector. Similarly, from state
(1011011), one reaches the same state after either three shifts or
six shifts

When the number of shifts is m + 1 or greater, all transitions


are redirected back to the initial state.
For example, after eight or more (denoted as 8+) shifts, the
next state must be the initial state, regardless of which state
the transition starts from. In Fig. 4.c, a state diagram is
obtained for the reservation table in Fig. 1.c using a 4-bit
shift register. Once the initial collision vector is determined,
the corresponding state diagram is uniquely determined. Flg.4 Two state diagrams obtained from the two reservation
tables in f ig.1, respectively
The state transition diagram for a pipeline unit:

• The O's and l 's in the present state, say at time t, of a state diagram indicate the permissible and
forbidden latencies, respectively ,at time t. The bitwise ORing of the shifted version of the present
state with the initial collision vector is meant to prevent collisions from furore initiations starting
at time ,t + 1 and onward.
• Thus the state diagram covers all permissible state transitions that avoid collisions. All latencies
equal to or greater than mare permissible. This implies that collisions can always be avoided if
events are scheduled far apart (with latencies of m+). However, such long latencies are not
tolerable from the viewpoint of pipeline throughput.
Simple Cycles & Greedy Cycles:

• From the state diagram, we can determine optimal latency cycles which result in the MAL
• There are infinitely many latency cycles one can trace from the state diagram. For example, (1, 8), ( 1, 8,6, 8),(3),(6),(3, 8),
(3,6, 3) ..., are legitimate cycles traced from the state diagram in Fig. 4.b. Among these cycles, only simple cycles are of
interest.
• A simple cycle is a latency cycle in which each state appears only once. In the state diagram in Fig. 4.b only (3), (6), (8),
(I, 8), (3, 8), and (6, 8) arc simple cycles. The cycle (1, 8, 6, 8) is not simple because it travels through the state (1011010)
twice. Similarly, the cycle (3, 6, 3, 8, 6) is not simple because it repeat the state (1011011) three times.
• Some of the simple cycles are greedy cycles.
• A greedy cycle is one whose edges are all made with minimum latencies from their respective starting states. For example,
in Fig. 4.b the cycles (1, 8) and (3) arc greedy cycles. Greedy cycles in Fig. 4.C are (1, 8) and (3).
• Such cycles must first be simple, and their average latencies must be lower than those of the simple cycles. The greedy
cycle( 1, 8) in Fig.4.b has an average latency of( 1 + 8)/2 = 4.5, which is lower than that of the simple cycle (6, 8) = (6 +
8)/2 = 7.The greedy cycle(3) has a constant latency which equals the MAL for evaluating function X without causing a
collision.
• The MAL in Fig. 4.c is 3, corresponding to either of the two greedy cycles. The minimum-latency edge in the state diagrams
are marked with asterisks.
• At least one of the greedy cycles will lead to the MAL. The collision-free scheduling of pipeline event is thus reduced to
finding greedy cycles from the set of simple cycles. The greedy cycle yielding the MAL is the final choice.
Pipeline Schedule Optimization:

• An optimization technique based on the MA. The idea is to insert non-compute delay stage into the
original pipeline. This will modify the reservation table, resulting in a new collision vector and a
improved state diagram. The purpose is to yield an optimal latency cycle, which is absolutely the
shortest.
• Bounds on the MAL In 1972, Shar determined the following bounds on the minimal average latency
(MAL) achievable by any control strategy on a statically reconfigured pipeline executing a given
reservation cable:
• ( 1 ) The MAL is lower-bounded by the maximum number of checkmark.in any row of the reservation
table.
• (2)The MAL is lower than or equal to the average latency of any greedy cycle in the state diagram.
• (3)The average latency of any greedy cycle. is upper- bounded by the number of 1 's in the initial
collision vector plus 1. This is also an upper bound on the MAL.
Problem to Solved Number 1:

• Consider the following pipeline reservation table


1 2 3 4

S1 X X

S2 X

S3 X

(a) What are the forbidden latencies?


(f) Draw the state transition diagram.
(g) List all the simple cycles and greedy cycles.
(h) Determine the optimal constant latency cycle and the minimal average latency.
(i) Let the pipeline clock period be ‫=ﺡ‬2ns Determine the throughput of this pipeline
Number 2: 1 2 3 4 5 6

S1 X X
• Consider the following reservation table for a
four-stage pipeline with a clock cycle ‫ =ﺡ‬2 ns.
S2 X X

S3 X

S4 X X

(a) What are the forbidden latencies and the initial collision vector?
(b) Draw the state transition diagram for scheduling the pipeline.
(c) Determine the MAL associated with the shortest greedy cycle.
(d) Determine the pipeline throughput corresponding to the MAL and given .‫ﺡ‬
e) Determine the lower bound on the MAL for this pipeline. Have you obtained the optimal latency from
the above state diagram
Number 3: 1 2 3 4 5 6

S1 X X

S2 X X
Consider the five-stage pipelined processor specified by the
S3 X
following reservation table
S4 X

S5 X X

(a) List the set of forbidden latencies and the collision vector.
(b) Draw a state transition diagram showing all possible initial sequences (cycles) without causing a collision in the pipeline.
(c) List all the simple cycles from the state diagram.
(d) Identify the greedy cycles among the simple cycles.
(e) What is the minimum average latency (MAL) of this pipeline?
(f) What is the minimum allowed constant cycle in using this pipeline?
(g) What will be the maximum throughput of this pipeline?
(h) What will be the throughput if the minimum constant cycle is used?
Examples:
Home Assignment:

• Prove the lower bound and upper bound on the minimal average latency (MAL)
Non Compute Delay Stage Insertion to Increase performance:

• The optimal latency cycle must be selected from one of the lowest greedy cycles. However,
a greedy cycle is not sufficient to guarantee the optimality of the MAL. The lower bound
guarantees the optimality. For example, the MAL= 3 for both function X and function Y
and bas met the lower bound of 3 from their respective reservation tables.
• From Fig. 4.b, the upper bound on the MAL for function X is equal to 4 + I = 5, a rather
loose bound. On the other hand, Fig. 4.c shows a rather tight upper bound of 2 + I =3 on
the MAL. Therefore, all greedy cycles for function Y lead to the optimal latency value of 3,
which cannot be lowered further.
• To optimize the MAL, one needs to find the lower bound by modifying the reservation
table. The approach
• is to reduce the maximum number of checkmarks in any row. The modified reservation
table must preserve
• the original function being evaluated. Patel and Davidson (1976) have suggested the use
of non-compute delay stages to increase pipeline performance with a shorter MAL.
Delay Insertion:

• The purpose of delay insertion is to modify the reservation table, yielding a new collision
vector. This leads to a modified state diagram, which may produce greedy cycles meeting
the lower bound on the MAL.
• Before delay insertion, the three-stage pipeline in Fig. 5.a is specified by the reservation
table in Fig. 5.b. This table leads to a collision vector C = (1011), corresponding to
forbidden latencies 1, 2, and 4. The corresponding state diagram (Fig. 5.c) contains only
one self-reflecting state with a greedy cycle of latency 3 equal to the MAL.
• Based on the given reservation table, the maximum number of checkmarks in any row is 2.
Therefore, the MAL= 3 so obtained in Fig. 5.c is not optimal.
Inserting non-compute delays to reduce the MAL:

• To insert a non-compute stage D1 after stage S3 will delay both X, and X2 operations one cycle beyond time
4. To insert yet another non-compute stage after the second usage of S1 will delay the operation X2 by
another cycle.
• These delay operations, as grouped in Fig. 5.b, resulting a new pipeline configuration in Fig. 6.a. Both delay
elements D1 and D2 are inserted as extra stages, as shown in Fig. 6. b with an enlarged reservation table
having 3 + 2 = 5 rows and 5 + 2 = 7 columns.
Inserting non-compute
delays to reduce the MAL:

• In total, the operation X, bas


been delayed one cycle from
time 4 to time 5 and the
operation X2 has been
delayed two cycles from time
5 to time7. All remaining
operations (marked as X in
fig. 6.b) are unchanged
• This new table leads to a new Fig. 6. Insertion of two delay stages to obtain an optimal MAL for the
collision vector (100010) and pipeline In Fig.5.
a modified state diagram in
Fig. 6.c.
This diagram displays a greedy cycle (1, 3), resulting in a reduced
MAL= (1 + 3)/2 = 2. The delay insertion thus improves the pipeline
performance, yielding a lower bound for the MAL
Any Question?

Thank You

You might also like