CST202 Lect. Note 3
CST202 Lect. Note 3
Arithmetic algorithms: Algorithms for multiplication and division (restoring method) of binary
numbers. Array multiplier, Booth’s multiplication algorithm.
Pipelining: Basic principles, classification of pipeline processors, instruction and arithmetic
pipelines (Design examples not required), hazard detection and resolution.
ARITHMETIC ALGORITHMS
In the binary system, multiplication of the multiplicand by one bit of the multiplier is easy.
If the multiplier bit is 1, the multiplicand is entered in the appropriate shifted position.
If the multiplier bit is 0, then Os are entered, as in the third row of the example.
The product is computed one bit at a time by adding the bit columns from right to left and
propagating carry values between columns.
Array Multiplier
Binary multiplication of unsigned operands can be implemented in a combinational, two
dimensional logic array, as shown in Figure for the 4-bit operand case.
The main component in each cell is a full adder, FA. The AND gate in each cell determines whether a
multiplicand bit, mj, is added to the incoming partial-product bit, based on the value of the
multiplier bit, qi.
1
Each row i, where 0 ≤ i ≤ 3, adds the multiplicand (appropriately shifted)to the incoming partial
product, PPi, to generate the outgoing partial product, PP(i + 1), ifqi = 1. If qi = 0, PPi is passed
vertically downward unchanged.
PP0 is all 0s, and PP4 is the desired product.
The multiplicand is shifted left one position per row by the diagonal signal path.
We note that the row-by-row addition done in the array circuit differs from the usual hand addition
described previously, which is done column-by-column.
Hardware:
2
Flowchart
Example:
3
Booth’s Multiplication Algorithm
Booth’s algorithm is a powerful direct algorithm to perform signed number multiplication.
The algorithm is based on the fact that any binary number can be represented by the sum and
difference of other binary numbers.
It operates on the fact that strings of 0’s in the multiplier require no addition but just shifting,
and a string of 1’s in the multiplier from bit weight 2k to weight 2m can be treated as 2k+1 – 2m
It handles both + ve & - ve numbers uniformly.
It achieves some efficiency in the number of additions required, when the multiplier has a few
large blocks of 1’s.
Flowchart:
4
Example:
Multiply 2 × -3 using Booth’s Multiplication
Multiplicand (M): 2 -> 0010
Multiplier (Q): -3 -> 1101
3 : 0011
(-3) 2’s complement of 3 : 1100
1
1101
5
Integer Division
Figure shows a logic circuit arrangement that implements the restoring division algorithm just
discussed.
An n-bit positive divisor is loaded into register M and an n-bit positive dividend is loaded into
register Q at the start of the operation.
Register A is set to 0.
After the division is complete, the n-bit quotient is in register Q and the remainder is in register A.
The required subtractions are facilitated by using 2’s-complement arithmetic.
The extra bit position at the left end of both A and M accommodates the sign bit during subtractions.
6
Flowchart:
7
PIPELINING
Pipelining is a technique of decomposing a sequential process into sub operations, with each
sub process being executed in a special dedicated segment that operates concurrently with
all other segments.
Laundry Example :Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold
Washer takes 30 minutes
Dryer takes 40 minutes
Folder takes 20 minutes
Pipelining doesn’t help latency of single task, it helps throughput of entire workload
Pipeline rate limited by slowest pipeline stage
Multiple tasks operating simultaneously
Potential speedup = Number pipe stages
A pipeline can be visualized as a collection of processing segments through which binary
information flows. Each segment performs partial processing dictated by the way the task is
partitioned.
The result obtained from the computation in each segment is transferred to the next segment in
the pipeline. The final result is obtained after the data have passed through all segments.
Advantages of Pipelining
1. The cycle time of the processor is reduced.
2. It increases the throughput of the system
3. It makes the system reliable.
Disadvantages of Pipelining
1. The design of pipelined processor is complex and costly to manufacture.
2. The instruction latency is more.
Pipeline Organization
The simplest way of viewing the pipeline structure is to imagine that each segment consists of
an input register followed by a combinational circuit.
The register holds the data and the combinational circuit performs the sub operation in the
particular segment.
9
The output of the combinational circuit is applied to the input register of the next segment.
A clock is applied to all registers after enough time has elapsed to perform all segment activity.
In this way the information flows through the pipeline one step at a time.
10
The first clock pulse transfers A1 and B1 into R1 and R2.
The second clock pulse transfers the product of R1 and R2 into R3 and C1 into R4.
The same clock pulse transfers A2 and B2 into R1 and R2.
The third clock pulse operates on all three segments simultaneously.
It places A3 and B3 into R1 and R2, transfers the product of R1 and R2 into R3, transfers C2
into R4, and places the sum of R3 and R4 into R5.
It takes three clock pulses to fill up the pipe and retrieve the first output from R5.
From there on, each clock produces a new output and moves the data one step down the
pipeline.
This happens as long as new input data flow into the system.
The operands are passed through all four segments in affixed sequence.
Each segment consists of a combinational circuit Si that performs a sub operation over the data
stream flowing through the pipe.
The segments are separated by registers Ri that hold the intermediate results between the
stages.
Information flows between adjacent stages under the control of a common clock applied to all
the registers simultaneously.
Space time diagram:
The behavior of a pipeline can be illustrated with a space time diagram.
This isa diagram that shows the segment utilization as a function of time.
11
Fig The horizontal axis displays the time in clock cycles and the vertical axis gives the
segment number.
The diagram shows six tasks T1 through T6 executed in four segments.
Initially, task T1 is handled by segment 1.
After the first clock, segment 2 is busy with T1, while segment 1 is busy with task T2.
Continuing in this manner, the first task T1 is completed after fourth clock cycle.
From then on, the pipe completes a task every clock cycle.
Consider the case where a k-segment pipeline with a clock cycle time tp is used to execute n
tasks.
The first task T1 requires a time equal to ktp to complete its operation since there are k
segments in a pipe.
The remaining n-1 tasks emerge from the pipe at the rate of one task per clock cycle and they
will be completed after a time equal to (n-1) tp.
Therefore, to complete n tasks using a k segment pipeline requires
k+ (n-1) clock cycles.
Consider a non-pipeline unit that performs the same operation and takes a time equal to tn to
complete each task.
The total time required for n tasks is n tn.
The speedup of a pipeline processing over an equivalent non pipeline processing is defined by
the ratio
S=ntn / (k+n-1)tp
As the number of tasks increases, n becomes much larger than k-1, and k+n-1 approaches the
value of n, under this condition the speed up ratio becomes
S=tn/tp
If we assume that the time it takes to process a task is the same in the pipeline and non-
12
pipeline circuits, we will have
tn=ktp.
Including this assumption speed up ratio reduces to
S=ktp/tp=k
1. Arithmetic Pipelines:
An arithmetic pipeline divides an arithmetic operation into sub operations for execution in the
pipeline segments.
So in arithmetic pipeline, an arithmetic operation like multiplication, addition, etc. can be
divided into series of steps that can be executed one by one in stages in Arithmetic Logic Unit
(ALU).
Pipeline arithmetic units are usually found in very high speed computers.
They are used to implement floating point operations, multiplication of fixed point numbers, and
similar computations encountered in scientific problems.
To understand the concepts of arithmetic pipeline in a more convenient way, let us consider an
example of a pipeline unit for floating-point addition and subtraction.
X = A * 10a = 0.9504 * 103
Y = B * 10b = 0.8200 * 102
Z = X + Y = 0.1324 * 104
Here A and B are mantissas (significant digit of floating point numbers), while a and b are
exponents.
The floating point addition and subtraction is done in 4 parts:
1. Compare the exponents.
2. Align the mantissas.
3. Add or subtract mantissas
4. Produce the result.
Registers are used for storing the intermediate results between the above operations.
The following block diagram represents the sub operations performed in each segment of the
13
pipeline.
1. Compare exponents by subtraction:
The exponents are compared by subtracting them to determine their difference. The larger
exponent is chosen as the exponent of the result.
The difference of the exponents, i.e., 3 - 2 = 1 determines how many times the mantissa
associated with the smaller exponent must be shifted to the right.
2. Align the mantissas:
The mantissa associated with the smaller exponent is shifted according to the difference of
exponents determined in segment one.
3. Add mantissas:
The two mantissas are added in segment three.
Z = X + Y = 1.0324 * 103 14
4. Normalize the result:
After normalization, the result is written as:
Z = 0.1324 * 104
2. Instruction Pipeline
Pipeline processing can occur not only in the data stream but in the instruction stream as well.
Most of the digital computers with complex instructions require instruction pipeline to carry
out operations like fetch, decode and execute instructions.
In general, the computer needs to process each instruction with the following sequence of
steps.
1. Fetch instruction from memory.
2. Decode the instruction.
3. Calculate the effective address.
4. Fetch the operands from memory.
5. Execute the instruction.
6. Store the result in the proper place.
Each step is executed in a particular segment, and there are times when different segments
may take different times to operate on the incoming information.
Moreover, there are times when two or more segments may require memory access at the
same time, causing one segment to wait until another is finished with the memory.
The organization of an instruction pipeline will be more efficient if the instruction cycle is
divided into segments of equal duration.
One of the most common examples of this type of organization is a Four-segment
instruction pipeline.
A four-segment instruction pipeline combines two or more different segments and makes it
as a single one.
For instance, the decoding of the instruction can be combined with the calculation of the
effective address into one segment.
The following block diagram shows a typical example of a four-segment instruction pipeline.
The instruction cycle is completed in four segments.
15
Segment 1: FI
The instruction fetch segment can be implemented using first in, first out (FIFO) buffer.
Segment 2:DA
The instruction fetched from memory is decoded in the second segment, and eventually, the
effective address is calculated in a separate arithmetic circuit.
Segment 3: FO
An operand from memory is fetched in the third segment.
Segment 4:EX
The instructions are finally executed in the last segment of the pipeline organization.
16
Following shows the operation of the instruction pipeline. The time in the horizontal axis is
divided into steps of equal duration. The four segments are represented in the diagram with an
abbreviated symbol.
It is assumed that the processor has separate instruction and data memories so that the operation
in FI and FO can proceed at the same time.
In the absence of a branch instruction, each segment operates on different instructions.
Thus, in step 4, instruction 1 is being executed in segment EX; the operand for instruction 2 is
being fetched into segment FO; instruction 3 is being decoded in segment DA; and instruction 4 is
being fetched from memory in segment FI.
Pipeline Conflicts
There are some factors that cause the pipeline to deviate its normal performance. Some of these
factors are given below:
1. Timing Variations
All stages cannot take same amount of time. This problem generally occurs in instruction
processing where different instructions have different operand requirements and thus
different processing time.
2. Data Hazards
When several instructions are in partial execution, and if they reference same data then the
problem arises. We must ensure that next instruction does not attempt to access data before
the current instruction, because this will lead to incorrect results.
17
3. Branching
In order to fetch and execute the next instruction, we must know what that instruction is. If the
present instruction is a conditional branch, and its result will lead us to the next instruction,
then the next instruction may not be known until the current one is processed.
4. Interrupts
Interrupts set unwanted instruction into the instruction stream. Interrupts effect the execution
of instruction.
5. Data Dependency
It arises when an instruction depends upon the result of a previous instruction but this result
is not yet available.
18
Note : Read after read (RAR) is not a hazard case.
Consider two instructions i1 and i2, with i1 occurring before i2 in program order.
Read after write (RAW): (i2 tries to read a source before i1 writes to it)
A read after write (RAW) data hazard refers to a situation where an instruction refers to
a result that has not yet been calculated or retrieved.
This can occur because even though an instruction is executed after a prior instruction,
the prior instruction has been processed only partly through the pipeline.
For example:
i1: R2 <- R5 + R3
i2: R4 <- R2 + R3
The first instruction is calculating a value to be saved in register R2, and the second
is going to use this value to compute a result for register R4.
However, in a pipeline, when operands are fetched for the 2nd operation, the
results from the first have not yet been saved, and hence a data dependency occurs.
A data dependency occurs with instruction i2, as it is dependent on the completion
of instruction i1.
Write after read (WAR): (i2 tries to write a destination before it is read by i1)
A write after read (WAR) data hazard represents a problem with concurrent execution.
For example:
i1. R4 <- R1 + R5
i2. R5 <- R1 + R2
In any situation with a chance that i2 may finish before i1 (i.e., with concurrent
execution), it must be ensured that the result of register R5 is not stored before i1
has had a chance to fetch the operands.
Write after write (WAW): (i2 tries to write an operand before it is written by i1)
A write after write (WAW) data hazard may occur in a concurrent execution
environment.
For example:
i1. R2 <- R4 + R7
i2. R2 <- R1 + R3
The write back (WB) of i2 must be delayed until i1 finishes executing.
19
Solutions for Data Hazards
a) Stalling
b) Forwarding
c) Reordering
a) Stalling:
Consider the following example
add $1, $2, $3
sub $4, $5, $1
Earlier instruction produces a value used by a later instruction.
Cycle: 1 2 3 4 5 6 7 8 9 10
Instruction
add F D X M W
sub F D X M W
Cycle: 1 2 3 4 5 6 7 8 9 10
Instruction
add F D X M W
sub F D X M W
R1
20
To minimize data dependency stalls in the pipeline, operand forwarding is used.
Operand Forwarding: In operand forwarding, we use the interface registers present between the
stages to hold intermediate output so that dependent instruction can access new value from the
interface register directly.
2. Structural hazards
This dependency arises due to the resource conflict in the pipeline.
A structural hazard occurs when two (or more) instructions that are already in pipeline need
the same resource.
The result is that instruction must be executed in series rather than parallel for a portion of
pipeline.
Structural hazards are sometime referred to as resource hazards.
Example:
A situation in which multiple instructions are ready to enter the execute instruction phase
and there is a single ALU (Arithmetic Logic Unit).
One solution to such resource hazard is to increase available resources, such as having
multiple ports into main memory and multiple ALU (Arithmetic Logic Unit) units.
Another Example:
21
In the above scenario, in cycle 4, instructions I1 and I4 are trying to access same resource
(Memory) which introduces a resource conflict.
22
3. Control hazards (branch hazards or instruction hazards)
This type of dependency occurs during the transfer of control instructions such as BRANCH,
CALL, JMP, etc.
A control hazard is when we need to find the destination of a branch, and can’t fetch any new
instructions until we know that destination.
Control hazard occurs when the pipeline makes wrong decisions on branch prediction and
therefore brings instructions into the pipeline that must subsequently be discarded.
The term branch hazard also refers to a control hazard.
Consider the following sequence of instructions in the program:
100: I1
101: I2 (JMP 250)
102: I3
.
.
250: BI1
To correct the problem we need to stop the Instruction fetch until we get target address of
branch instruction. This can be implemented by introducing delay slot until we get the23target
address.
But if the branch is taken, turn fetched instruction into a no-op (idle) and restarts
the IF at the branch target address.
24
c) Delayed branch
Specify in architecture that the instruction immediately following branch is always executed.
Always execute instructions following a branch regardless of whether or not we take it
The instructions in the delay slots are always fetched. Therefore, we would like to arrange for
them to be fully executed whether or not the branch is taken.
The objective is to place useful instructions in these slots.
The effectiveness of the delayed branch approach depends on how often it is possible to
reorder instructions.
25