0% found this document useful (0 votes)
23 views

Unit 3 - Advanced Computer Architecture - Www.rgpvnotes.in

The document discusses advanced computer architecture with a focus on pipelining, which enhances processor performance by allowing simultaneous execution of instructions. It explains the types of pipelines (static and dynamic), the instruction pipeline mechanism, and potential hazards such as structural, data, and control hazards that can affect throughput. Additionally, it covers techniques for improving instruction pipeline efficiency, including dynamic dependency checking methods like Tomasulo's algorithm.

Uploaded by

ayushdeepanshu7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Unit 3 - Advanced Computer Architecture - Www.rgpvnotes.in

The document discusses advanced computer architecture with a focus on pipelining, which enhances processor performance by allowing simultaneous execution of instructions. It explains the types of pipelines (static and dynamic), the instruction pipeline mechanism, and potential hazards such as structural, data, and control hazards that can affect throughput. Additionally, it covers techniques for improving instruction pipeline efficiency, including dynamic dependency checking methods like Tomasulo's algorithm.

Uploaded by

ayushdeepanshu7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Subject Name: Advanced Computer Architecture

Subject Code: CS-6001


Semester: 6th
Downloaded from be.rgpvnotes.in

Subject Name: Advance Computer Architecture Subject Code: CS 6001


Subject Notes
Unit-III
Pipelining:
Pipelining is one way of improving the overall processing performance of a processor. This architectural
approach allows the simultaneous execution of several instructions. Pipelining is transparent to the
programmer; it exploits parallelism at the instruction level by overlapping the execution process of
instructions. It is analogous to an assembly line where workers perform specific task and pass the partially
completed product to the next worker.
Linear Pipeline Processor:
The pipeline design technique decomposes a sequential process into several sub processes, called stages or
segments. A stage performs a particular function and produces an intermediate result. It consists of an input
latch, also called a register or buffer, followed by a processing circuit. (A processing circuit can be a
combinational or sequential circuit.) The processing circuit of a given stage is connected to the input latch of
the next stage (see Figure 3.1). A clock signal is connected to each input latch. At each clock pulse, every stage
transfers its intermediate result to the input latch of the next stage. In this way, the final result is produced
after the input data have passed through the entire pipeline, completing one stage per clock pulse. The period
of the clock pulse should be large enough to provide sufficient time for a signal to traverse through the
slowest stage, which is called the bottleneck (i.e., the stage needing the longest amount of time to complete).
In addition, there should be enough time for a latch to store its input signals. If the clock's period, P, is
expressed as P = tb + tl, then tb should be greater than the maximum delay of the bottleneck stage, and tl
should be sufficient for storing data into a latch.

Types of pipeline:
Pipelines can actually be divided into two classes:
a) Static or Linear Pipelines: These pipelines can perform one operation (Addition or Multiplication) at a time.
The operation of a static pipeline can only be changed after the pipeline has been drained. (A pipeline is said
to be drained when the last input data leave the pipeline.) For example, consider a static pipeline that is able

Page no: 1 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

to perform addition and multiplication. Each time that the pipeline switches from a multiplication operation to
an addition operation, it must be drained and set for the new operation. The performance of static pipelines is
severely degraded when the operations change often, since this requires the pipeline to be drained and
refilled each time. The output of the pipeline is produced from the last stage. For Diagram of static or linear
pipeline you can refer to the previous Figure 3.1.
b) Dynamic or Non Linear Pipelines processor: A dynamic pipeline can perform more than one operation at a
time. To perform a particular operation on an input data, the data must go through a certain sequence of
stages. For example, Figure 3.2 shows a three-stage dynamic pipeline that performs addition and
multiplication on different data at the same time. To perform multiplication, the input data must go through
stages 1, 2, and 3; to perform addition, the data only need to go through stages 1 and 3. Therefore, the first
stage of the addition process can be performed on an input data D1 at stage 1, while at the same time the last
stage of the multiplication process is performed at stage 3 on a different input data D2. Note that the time
interval between the initiation of the inputs D1 and D2 to the pipeline should be such that they do not reach
stage 3 at the same time; otherwise, there is a collision. In general, in dynamic pipelines the mechanism that
controls when data should be fed to the pipeline is much more complex than in static pipelines.

Mechanism for Instruction pipeline:


In Von Neumann architecture, the process of executing an instruction involves several steps. First, the control
unit of a processor fetches the instruction from the cache (or from memory). Then the control unit decodes
the instruction to determine the type of operation to be performed. When the operation requires operands,
the control unit also determines the address of each operand and fetches them from cache (or memory).
Next, the operation is performed on the operands and, finally, the result is stored in the specified location. An
instruction pipeline increases the performance of a processor by overlapping the processing of several
different instructions. Often, this is done by dividing the instruction execution process into several stages. As
shown in Figure 3.3, an instruction pipeline often consists of five stages, as follows:
 Instruction fetch (IF): Retrieval of instructions from cache (or main memory).
 Instruction decoding (ID): Identification of the operation to be performed.
 Operand fetch (OF): Decoding and retrieval of any required operands.
 Execution (EX): Performing the operation on the operands.
 Write-back (WB): Updating the destination operands.

Page no: 2 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

An instruction pipeline overlaps the process of the preceding stages for different instructions to achieve a
much lower total completion time, on average, for a series of instructions. As an example, consider Figure 3.4,
which shows the execution of four instructions in an instruction pipeline. During the first cycle, or clock pulse,
instruction i1 is fetched from memory. Within the second cycle, instruction i1 is decoded while instruction i2 is
fetched. This process continues until all the instructions are executed. The last instruction finishes the write-
back stage after the eighth clock cycle. Therefore, it takes 80 nanoseconds (ns) to complete execution of all
the four instructions when assuming the clock period to be 10 ns.
The total completion time can also be obtained using equation (3.1) that is,
Tpipe = m*P+(n-1)*P
=5*10+(4-1)*10
=80 ns.
Note that in a nonpipelined design the completion time will be much higher. Using equation (3.2),
Tseq = n*m*P = 4*5*10 = 200 ns.
Even though pipelining speeds up the execution of instructions, it does pose potential problems. Some of
these problems and possible solutions are discussed next.

Improving the Throughput of Instruction Pipeline:


Three sources of architectural problems may affect the throughput of an instruction pipeline. They are
fetching, bottleneck, and issuing problems. Some solutions are given for each.

Page no: 3 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

Fetching problem- In general, supplying instructions rapidly through a pipeline is costly in terms of chip area.
Buffering the data to be sent to the pipeline is one simple way of improving the overall utilization of a pipeline.
The utilization of a pipeline is defined as the percentage of time that the stages of the pipeline are used over a
sufficiently long period of time. A pipeline is utilized 100% of the time when every stage is used (utilized)
during each clock cycle.
Occasionally, the pipeline has to be drained and refilled, for example, whenever an interrupt or a branch
occurs. The time spent refilling the pipeline can be minimized by having instructions and data loaded ahead of
time into various geographically close buffers (like on-chip caches) for immediate transfer into the pipeline. If
instructions and data for normal execution can be fetched before they are needed and stored in buffers, the
pipeline will have a continuous source of information with which to work. Prefetch algorithms are used to
make sure potentially needed instructions are available most of the time. Delays from memory access conflicts
can thereby be reduced if these algorithms are used, since the time required to transfer data from main
memory is far greater than the time required to transfer data from a buffer.
The bottleneck Problem
The bottleneck problem relates to the amount of load (work) assigned to a stage in the pipeline. If too much
work is applied to one stage, the time taken to complete an operation at that stage can become unacceptably
long. This relatively long time spent by the instruction at one stage will inevitably create a bottleneck in the
pipeline system. In such a system, it is better to remove the bottleneck that is the source of congestion. One
solution to this problem is to further subdivide the stage. Another solution is to build multiple copies of this
stage into the pipeline.
Pipelining hazards:
If an instruction is available, but cannot be executed for some reason, a hazard exists for that instruction.
These hazards create issuing problems; they prevent issuing an instruction for execution. Three types of
hazard are discussed here. They are called structural hazard, data hazard, and control hazard. A structural
hazard refers to a situation in which a required resource is not available (or is busy) for executing an
instruction. A data hazard refers to a situation in which there exists a data dependency (operand conflict) with
a prior instruction. A control hazard refers to a situation in which an instruction, such as branch, causes a
change in the program flow. Each of these hazards is explained next.
Structural Hazards:
A structural hazard occurs as a result of resource conflicts between instructions. One type of structural hazard
that may occur is due to the design of execution units. If an execution unit that requires more than one clock
cycle (such as multiply) is not fully pipelined or is not replicated, then a sequence of instructions that uses the
unit cannot be subsequently (one per clock cycle) issued for execution. A replicating and/or pipelining
execution unit increases the number of instructions that can be issued simultaneously.
Another type of structural hazard that may occur is due to the design of register files. If a register file does not
have multiple write (read) ports, multiple writes (reads) to (from) registers cannot be performed
simultaneously.
For example, under certain situations the instruction pipeline might want to perform two register writes in a
clock cycle. This may not be possible when the register file has only one write port. The effect of a structural
hazard can be reduced fairly simply by implementing multiple execution units and using register files with
multiple input/output ports.

Page no: 4 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

Data Hazards: In a non-pipelined processor, the instructions are executed one by one, and the execution of an
instruction is completed before the next instruction is started. In this way, the instructions are executed in the
same order as the program. However, this may not be true in a pipelined processor, where instruction
executions are overlapped. An instruction may be started and completed before the previous instruction is
completed.
The data hazard, which is also referred to as the data dependency problem, comes about as a result of
overlapping (or changing the order of) the execution of data-dependent instructions. For example, in Figure
3.5 instruction i2 has a data dependency on i1 because it uses the result of i1 (i.e., the contents of register R2)
as input data. If the instructions were sent to a pipeline in the normal manner, i2 would be in the OF stage
before i1 passed through the WB stage.
This would result in using the old contents of R2 for computing a new value for R5, leading to an invalid result.
To have a valid result, i2 must not enter the OF stage until i1 has passed through the WB stage. In this way, as
is shown in Figure 3.6, the execution of i2 will be delayed for two clock cycles. In other words, instruction i2 is
said to be stalled for two clock cycles. Often, when an instruction is stalled, the instructions that are
positioned after the stalled instruction will also be stalled. However, the instructions before the stalled
instruction can continue execution. The delaying of execution can be accomplished in two ways. One way is to
delay the OF or IF stages of i2 for two clock cycles.
To insert a delay, an extra hardware component called a pipeline interlock can be added to the pipeline. A
pipeline interlock detects the dependency and delays the dependent instructions until the conflict is resolved.
Another way is to let the compiler solve the dependency problem.
During compilation, the compiler detects the dependency between data and instructions. It then rearranges
these instructions so that the dependency is not hazardous to the system. If it is not possible to rearrange the
instructions, NOP (no operation) instructions are inserted to create delays.

There are three primary types of data hazards named as


 RAW(Read after Write)
 WAR(Write after Read)
 WAW(Write after Write)

Page no: 5 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

RAW: This type of data hazard was discussed previously; it refers to the situation in which i2 reads a data
source before i1 writes to it. This may produce an invalid result since the read must be performed after the
write in order to obtain a valid result.
For example, in the sequence
i1: Add R2, R3, R4 --R2=R3+R4
i2: Add R5, R2, R1 --R5=R2+R1 an invalid result may be produced if i2 reads R2 before i1 writes to it.
WAR: This refers to the situation in which i2 writes to a location before i1 reads it.
For example, in the sequence
i1: Add R2, R3, R4 --R2=R3+R4
i2: Add R4, R5, R6 --R4=R5+R6 an invalid result may be produced if i2 writes to R4 before i1 reads it; that is,
the instruction i1 might use the wrong value of R4.
WAW: This refers to the situation in which i2 writes to a location before i1 writes to it.
For example, in the sequence
i1: Add R2, R3, R4 --R2=R3+R4
i2: Add R2, R5, R6 --R2=R5+R6 the value of R2 is recomputed by i2. If the order of execution were reversed,
that is, i2 writes to R2.
These are the methods to check the dependencies statically at the compile time by complier and by hardware
at the runtime. it is not always possible to determine the actual memory addresses of load and store
instructions in order to resolve a possible dependency between them. However, during the run time the actual
memory addresses are known, and thereby dependencies between instructions can be determined by
dynamically checking the dependency. In general, dynamic dependency checking has the advantage of being
able to determine dependencies that are either impossible or hard to detect at compile time.
Here we will discuss the techniques for dynamic dependency checking. Two of the most commonly used
techniques are called Tomasulo's method and the scoreboard method.
Tomasulo’s Algorithm
The Tomasulo algorithm was first implemented in the IBM 360/91 Floating Point Unit which came out three
years after the CDC 6600. This scheme was intedned to address several issues:
 A small number of floating point registers available the 360/91 had 4 double precision registers.
 Long memory latency this was just prior to the introduction of caches as a standard part of the
memory hierarchy.
 The cost effectiveness of functional unit hardware with multiple copies of the same functional unit,
some units were often underutilized.
 The performance penalties of name dependencies. These lead to WAW and WAR hazards.

Page no: 6 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

FPU consist of:


 Instruction buffer
 Load and Store buffer : entries in this buffer consist of
a) Busy bit: indicating the buffer element contains an outstanding load or store operation.
b) Tag: indicating the destination (or source for store) of the data for the operation.
c) Address (not shown) provided by the integer unit
d) Data.
 FP Register File : entries in this register consist of
a) Valid Bit: indicating the register contains the current value of the register.
b) Tag: indicating the current source of the register value if not present.
c) Value: the register value, if present.
 FP functional unit with associated reservation status :
a) Busy bit - indicating the reservation station is occupied with an outstanding instruction.
b) Result Tag - the "name" of the result to be produced by this instruction.
c) Source Operands.
 Common Data Bus (CDB).
Instruction executed here will occur in four phases which are fetch, issue, execute and write back.
Control Hazards (Branch handling techniques): In any set of instructions, there is normally a need for some
kind of statement that allows the flow of control to be something other than sequential. Instructions that do
this are included in every programming language and are called branches.
In general, about 30% of all instructions in a program are branches. This means that branch instructions in the
pipeline can reduce the throughput tremendously if not handled properly. Whenever a branch is taken, the
performance of the pipeline is seriously affected. Each such branch requires a new address to be loaded into
the program counter, which may invalidate all the instructions that are either already in the pipeline or
perfected in the buffer.
This draining and refilling of the pipeline for each branch degrade the throughput of the pipeline to that of a
sequential processor. Note that the presence of a branch statement does not automatically cause the pipeline
to drain and begin refilling. A branch not taken allows the continued sequential flow of uninterrupted
instructions to the pipeline. Only when a branch is taken does the problem arise.

Page no: 7 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

Branches can be divided into three groups:


a) Unconditional Branches
b) Conditional Branches
c) Loop Branches
An unconditional branch always alters the sequential program flow. It sets a new target address in the
program counter, rather than incrementing it by 1 to point to the next sequential instruction address, as is
normally the case.
A conditional branch sets a new target address in the program counter only when a certain condition, usually
based on a condition code, is satisfied. Otherwise, the program counter is incremented by 1 as usual. In other
words, a conditional branch selects a path of instructions based on a certain condition.
If the condition is satisfied, the path starts from the target address and is called a target path. If it is not, the
path starts from the next sequential instruction and is called a sequential path. Finally, a loop branch in a loop
statement usually jumps back to the beginning of the loop and executes it either a fixed or a variable (data-
dependent) number of times.
Among the preceding branch types, conditional branches are the hardest to handle. As an example, consider
the following conditional branch instruction sequence shows the execution of this sequence in our instruction
pipeline when the target path is selected. C de otes the u ers of y les asted he e er target path is
chosen.

Tave = Pb * (average number of cycles per branch instruction) + (1 - Pb) * (average number of cycles per
nonbranch instruction)where Pb denotes the probability that a given instruction is a branch. Thus average
number of cycles per branch instruction = Pt (1+c) + (1- Pt)
Where Pt denotes the probability that branch target is chosen.
Tave = Pb [Pt (1+c) + (1-Pt )(1)] + (1- Pb)(1) = 1 + cPb Pt.
Branch Prediction:

Page no: 8 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

In this type of design, the outcome of a branch decision is predicted before the branch is actually executed.
There are two types of predictions:
a) Static Branch Prediction: In static prediction, a fixed decision for prefetching one of the two paths is made
before the program runs. For example, a simple technique would be to always assume that the branch is
taken. This technique simply loads the program counter with the target address when a branch is
encountered. Another such technique is to automatically choose one path (sequential or target) for some
branch types and another for the rest of the branch types. If the chosen path is wrong, the pipeline is
drained and instructions corresponding to the correct path are fetched; the penalty is paid.
b) Dynamic Branch Prediction: In dynamic prediction, during the execution of the program the processor
makes a decision based on the past information of the previously executed branches. For example, a simple
technique would be to record the history of the last two paths taken by each branch instruction. If the last
two executions of a branch instruction have chosen the same path, that path will be chosen for the current
execution of the branch instruction. If the two paths do not match, one of the paths will be chosen
randomly.
Delayed Branching: The delayed branching scheme eliminates or significantly reduces the effect of the branch
penalty.
In this type of design, a certain number of instructions after the branch instruction is fetched and executed
regardless of which path will be chosen for the branch. For example, a processor with a branch delay of k
executes a path containing the next k sequential instructions and then either continues on the same path or
starts a new path from a new target address.
As often as possible, the compiler tries to fill the next k instruction slots after the branch with instructions
that are independent from the branch instruction. NOP (no operation) instructions are placed in any remaining
empty slots. As an example, consider the following code:

Assume K=2, complier will modify the code by moving i1 and inserting a NOP instruction after i3.

Instructions after Rescheduling the i1. And i1 will execute regardless of the branch.

Page no: 9 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

Throughput Improvement of the Instruction pipeline:


One way to increase the throughput of an instruction pipeline is to exploit instruction-level parallelism. The
common approaches to accomplish such parallelism are called superscalar, superpipeline and very long
instruction word (VLIW)]. Each approach attempts to initiate several instructions per cycle.
Superscalar pipeline design: The superscalar approach relies on spatial parallelism, that is, multiple operations
running concurrently on separate hardware. This approach achieves the execution of multiple instructions per
clock cycle by issuing several instructions to different functional units.
A superscalar processor contains one or more instruction pipelines sharing a set of functional units. It often
contains functional units, such as an add unit, multiply unit, divide unit, floating-point add unit, and graphic
unit. A superscalar processor contains a control mechanism to preserve the execution order of dependent
instructions for ensuring a valid result.
The scoreboard method and Tomasulo's method (discussed in the previous section) can be used for
implementing such mechanisms. In practice, most of the processors are based on the superscalar approach
and employ a scoreboard method to ensure a valid result.
Superscalar processing has its origins in the Cray-designed CDC supercomputers, in which multiple functional
units are kept busy by multiple instructions. The CDC machines could pack as many as 4 instructions in a word
at once, and these were fetched together and dispatched via a pipeline. Given the technology of the time, this
configuration was fast enough to keep the functional units busy without outpacing the instruction memory.
Current technology has enabled, and at the same time created the need to issue instructions in
parallel. As execution pipelines have approached the limits of speed, parallel execution has been required to
improve performance. As this requires greater fetch rates from memory, which hasn't accelerated
comparably, it has become necessary to fetch instructions in parallel -- fetching serially and pipelining their
dispatch can no longer keep multiple functional units busy. At the same time, the movement of the L1
instruction cache onto the chip has permitted designers to fetch a cache line in parallel with little cost.
In some cases superscalar machines still employ a single fetch-decode-dispatch pipe that drives all of the units.
For example, the Ultra SPARC splits execution after the third stage of a unified pipeline. However, it is
becoming more common to have multiple fetch- decode-dispatch pipes feeding the functional units.
The choice of approach depends on tradeoffs of the average execute time vs. the speed with which
instructions can be issued. For example, if execution averages several cycles, and the number of functional
units is small, then a single pipe may be able to keep the units utilized. When the number of functional units
grows large and/or their execution time approaches the issue time, then multiple issue pipes may be

 Being able to fetch instructions for that many pipes at once inter-pipeline interlocking
necessary.

 Reordering of instructions for multiple interlocked pipelines.


 Multiple write-back stages.
 Multiport D-cache and/or register file, and/or functionally split register file.
Reordering may be either static (compiler) or dynamic (using hardware lookahead). It can be difficult to
combine the two approaches because the compiler may not be able to predict the actions of the hardware
reordering mechanism.
Superscalar operation is limited by the number of independent operations that can be extracted from an
instruction stream. It has been shown in early studies on simpler processor models, that this is limited, mostly
by branches, to a small number (<10, typically about 4). More recent work has shown that, with speculative
execution and aggressive branch prediction, higher levels may be achievable. On certain highly regular codes,
the level of parallelism may be quite high (around 50). Of course, such highly regular codes are just as
amenable to other forms of parallel processing that can be employed more directly, and are also the exception
rather than the rule. Current thinking is that about 6-way instruction level parallelism for a typical program
mix may be the natural limit, with 4-way being likely for integer codes. Potential ILP may be three times this,

Page no: 10 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

but it will be very difficult to exploit even a majority of this parallelism. Nonetheless, obtaining a factor
of 4 to 6 boost in performance is quite significant, especially as processor speeds approach their limits.
Going beyond a single instruction stream and allowing multiple tasks (or threads) to operate at the same
time can enable greater system throughput. Because these are naturally independent at the fine-grained
level, we can select instructions from different streams to fill pipeline slots that would otherwise go vacant in
the case of issuing from a single thread. In turn, this makes it useful to add more functional units. We shall
further explore these multithreaded architectures later in the course.

Superpipeline processor design: The superpipeline approach achieves high performance by overlapping the
execution of multiple instructions on one instruction pipeline. A superpipeline processor often has an
instruction pipeline with more stages than a typical instruction pipeline design.
In other words, the execution process of an instruction is broken down into even finer steps. By increasing the
number of stages in the instruction pipeline, each stage has less work to do. This allows the pipeline clock rate
to increase (cycle time decreases), since the clock rate depends on the delay found in the slowest stage of the
pipeline. Super pipelining is based on dividing the stages of a pipeline into substages and thus increasing the
number of instructions which are supported by the pipeline at a given moment. For example if we divide each
stage into two, the clock cycle period t will be reduced to the half, t/2; hence, at the maximum capacity, the
pipeline produces a result every t/2 s. For a given architecture and the corresponding instruction set there is
an optimal number of pipeline stages; increasing the number of stages over this limit reduces the overall
performance. A solution to further improve speed is the superscalar architecture.
Given a pipeline stage time T, it may be possible to execute at a higher rate by starting operations at intervals
of T/n. This can be accomplished in two ways:
Further divide each of the pipeline stages into n substages.
Provide n pipelines that are overlapped.
The first approach requires faster logic and the ability to subdivide the stages into segments with uniform
latency. It may also require more complex inter-stage interlocking and stall-restart logic. The second approach
could be viewed in a sense as staggered superscalar operation, and has associated with it all of the same
requirements except that instructions and data can be fetched with a slight offset in time. In addition, inter-
pipeline interlocking is more difficult to manage because of the sub-clock period differences in timing between
the pipelines. Even so, staggered clock pipelines may be necessary with superscalar designs in the future, in
order to reduce peak power and corresponding power-supply induced noise. Alternatively, designs may be
forced to shift to a balanced mode of circuit operation in which logic transitions are balanced by reverse
transitions -- a technique used in the Cray supercomputers that resulted in the computer presenting a pure DC
load to the power supply, and greatly reduced noise in the system.
Inevitably, superpipelining is limited by the speed of logic, and the frequency of unpredictable branches. Stage
time cannot productively grow shorter than the interstage latch time, and so this is a limit for the number of
stages.
The MIPS R4000 is sometimes called a superpipelined machine, although its 8 stages really only split the I-
fetch and D-fetch stages of the pipe and add a Tag Check stage. Nonetheless, the extra stages enable it to
operate with higher throughput. The UltraSPARC's 9-stage pipe definitely qualifies it as a superpipelined
machine, and in fact it is a Super-Super design because of its superscalar issue. The Pentium 4 splits the
pipeline into 20 stages to enable increased clock rate. The benefit of such extensive pipelining is really only
gained for very regular applications such as graphics. On more irregular applications, there is little
performance advantage.
Static Arithmetic Pipeline
Some functions of the arithmetic logic unit of a processor can be pipelined to maximize performance. An
arithmetic pipeline is used for implementing complex arithmetic functions like floating-point addition,
multiplication, and division. These functions can be decomposed into consecutive sub functions. For example

Page no: 11 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

Figure presents pipeline architecture for the floating-point addition can be divided into three stages:
mantissas alignment, mantissas addition, and result normalization.

In the first stage, the mantissas M1 and M2 are aligned based on the difference in the exponents E1 and E2. If
| E1 - E2 | = k > 0, then the mantissa with the smaller exponent is right shifted by k digit positions. In the
second stage, the mantissas are added (or subtracted). In the third stage, the result is normalized so that the
final mantissa has a nonzero digit after the fraction point. When necessary, this normalized adjustment is done
by shifting the result mantissa and the exponent.
Multifunctional arithmetic pipeline:
In arithmetic pipeline is similar to an assembly line in a factory. Data enters a stage of pipeline, which performs
some arithmetic operation on data. The results are then passed to the next stage, which performs its
operation and so on until the final computation has been performed.
 Each stage performs only its specific function, it does not have to be capable of performing the task of
any other stage. An individual stage might be an adder or multiplier or other hardware to perform
some arithmetic function.
 Variations on arithmetic pipeline are fixed arithmetic pipeline.
 It is not very useful. Unless the exact function performed by the pipeline is required, the CPU cannot
use the fixed arithmetic pipeline. Configurable arithmetic pipeline.
 It is better suitable as it uses multiplexer as its input. The control unit of CPU sets the select signals of
the multiplexer to control flow of data (i.e. pipeline is configurable). Vectored arithmetic unit.
 A CPU may include a vectored arithmetic unit. A vectored arithmetic unit contains multiple functional
units to perform addition, multiplication, shifting, division etc) to operate different arithmetic
operations in parallel.
 It is used to implement floating point operations, multiplication of fixed point numbers and similar
computations encountered in scientific operations.
 Although arithmetic pipelines can perform many iterations of the same operation in parallel, they
cannot perform different operations simultaneously.

Page no: 12 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

This figure presents a pipelined architecture for multiplying two unsigned 4-bit numbers using carry save
adders. The first stage generates the partial products M1, M2, M3, and M4. Figure represents how M1 is
generated; the rest of partial products can be generated in the same way. The M1, M2, M3, and M4, are
added together through the two stages of carry save adders and the final stage of carry lookahead adder.

Page no: 13 Follow us on facebook to get real-time updates from RGPV


We hope you find these notes useful.
You can get previous year question papers at
https://qp.rgpvnotes.in .

If you have any queries or you want to submit your


study notes please write us at
[email protected]

You might also like