0% found this document useful (0 votes)
7 views25 pages

CST202 Lect. Note 3

The document covers arithmetic algorithms for binary multiplication and division, including Booth's multiplication algorithm and the restoring division method. It also discusses pipelining, its principles, advantages, and disadvantages, as well as its classification into arithmetic and instruction pipelines. Examples illustrate the processes involved in both arithmetic operations and instruction execution in a pipelined architecture.

Uploaded by

xocije3152
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views25 pages

CST202 Lect. Note 3

The document covers arithmetic algorithms for binary multiplication and division, including Booth's multiplication algorithm and the restoring division method. It also discusses pipelining, its principles, advantages, and disadvantages, as well as its classification into arithmetic and instruction pipelines. Examples illustrate the processes involved in both arithmetic operations and instruction execution in a pipelined architecture.

Uploaded by

xocije3152
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

MODULE – III

Arithmetic algorithms: Algorithms for multiplication and division (restoring method) of binary
numbers. Array multiplier, Booth’s multiplication algorithm.
Pipelining: Basic principles, classification of pipeline processors, instruction and arithmetic
pipelines (Design examples not required), hazard detection and resolution.

ARITHMETIC ALGORITHMS

1. Multiplication of binary numbers

 In the binary system, multiplication of the multiplicand by one bit of the multiplier is easy.
 If the multiplier bit is 1, the multiplicand is entered in the appropriate shifted position.
 If the multiplier bit is 0, then Os are entered, as in the third row of the example.
 The product is computed one bit at a time by adding the bit columns from right to left and
propagating carry values between columns.

Array Multiplier
 Binary multiplication of unsigned operands can be implemented in a combinational, two
dimensional logic array, as shown in Figure for the 4-bit operand case.
 The main component in each cell is a full adder, FA. The AND gate in each cell determines whether a
multiplicand bit, mj, is added to the incoming partial-product bit, based on the value of the
multiplier bit, qi.

1
 Each row i, where 0 ≤ i ≤ 3, adds the multiplicand (appropriately shifted)to the incoming partial
product, PPi, to generate the outgoing partial product, PP(i + 1), ifqi = 1. If qi = 0, PPi is passed
vertically downward unchanged.
 PP0 is all 0s, and PP4 is the desired product.
 The multiplicand is shifted left one position per row by the diagonal signal path.
 We note that the row-by-row addition done in the array circuit differs from the usual hand addition
described previously, which is done column-by-column.

(a)Array multiplication of positive binary operands (b) Multiplier cell

Hardware:

2
Flowchart

Example:

3
Booth’s Multiplication Algorithm
 Booth’s algorithm is a powerful direct algorithm to perform signed number multiplication.
 The algorithm is based on the fact that any binary number can be represented by the sum and
difference of other binary numbers.
 It operates on the fact that strings of 0’s in the multiplier require no addition but just shifting,
and a string of 1’s in the multiplier from bit weight 2k to weight 2m can be treated as 2k+1 – 2m
 It handles both + ve & - ve numbers uniformly.
 It achieves some efficiency in the number of additions required, when the multiplier has a few
large blocks of 1’s.

 Booth algorithm Speed up the multiplication process.

Flowchart:

4
Example:
Multiply 2 × -3 using Booth’s Multiplication
Multiplicand (M): 2 -> 0010
Multiplier (Q): -3 -> 1101
3 : 0011
(-3) 2’s complement of 3 : 1100
1
1101

5
Integer Division

 Figure shows a logic circuit arrangement that implements the restoring division algorithm just
discussed.
 An n-bit positive divisor is loaded into register M and an n-bit positive dividend is loaded into
register Q at the start of the operation.
 Register A is set to 0.
 After the division is complete, the n-bit quotient is in register Q and the remainder is in register A.
 The required subtractions are facilitated by using 2’s-complement arithmetic.
 The extra bit position at the left end of both A and M accommodates the sign bit during subtractions.

 The following algorithm performs restoring division.


Do the following three steps n times:
1. Shift A and Q left one bit position.
2. Subtract M from A, and place the answer back in A.
3. If the sign of A is 1, set Q0 to 0 and add M back to A (that is, restore A); otherwise, set Q0 to 1

6
Flowchart:

7
 PIPELINING
 Pipelining is a technique of decomposing a sequential process into sub operations, with each
sub process being executed in a special dedicated segment that operates concurrently with
all other segments.
 Laundry Example :Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold
Washer takes 30 minutes
Dryer takes 40 minutes
Folder takes 20 minutes

Sequential laundry takes 6 hours for 4 loads 8


Pipelined laundry takes 3.5 hours for 4 loads

 Pipelining doesn’t help latency of single task, it helps throughput of entire workload
 Pipeline rate limited by slowest pipeline stage
 Multiple tasks operating simultaneously
 Potential speedup = Number pipe stages
 A pipeline can be visualized as a collection of processing segments through which binary
information flows. Each segment performs partial processing dictated by the way the task is
partitioned.
 The result obtained from the computation in each segment is transferred to the next segment in
the pipeline. The final result is obtained after the data have passed through all segments.
Advantages of Pipelining
1. The cycle time of the processor is reduced.
2. It increases the throughput of the system
3. It makes the system reliable.

Disadvantages of Pipelining
1. The design of pipelined processor is complex and costly to manufacture.
2. The instruction latency is more.

 Pipeline Organization
 The simplest way of viewing the pipeline structure is to imagine that each segment consists of
an input register followed by a combinational circuit.
 The register holds the data and the combinational circuit performs the sub operation in the
particular segment.
9
 The output of the combinational circuit is applied to the input register of the next segment.
 A clock is applied to all registers after enough time has elapsed to perform all segment activity.
 In this way the information flows through the pipeline one step at a time.

Example demonstrating the pipeline organization


 Suppose we want to perform the combined multiply and add operations with a stream of
numbers.
Ai*Bi + Ci for i=1, 2, 3 ….7
 Each sub operation is to implemented in a segment within a pipeline. Each segment has one or
two registers and a combinational circuit as shown in fig.

 The sub operationsperformed in each segment of the pipeline are as follows:


R1<- Ai R2<-Bi Input Ai and Bi
R3<-R1*R2 R4<-Ci multiply and input Ci
R5<-R3+R4 add Ci to product
 The five registers are loaded with new data every clock pulse.

10
 The first clock pulse transfers A1 and B1 into R1 and R2.
 The second clock pulse transfers the product of R1 and R2 into R3 and C1 into R4.
 The same clock pulse transfers A2 and B2 into R1 and R2.
 The third clock pulse operates on all three segments simultaneously.
 It places A3 and B3 into R1 and R2, transfers the product of R1 and R2 into R3, transfers C2
into R4, and places the sum of R3 and R4 into R5.
 It takes three clock pulses to fill up the pipe and retrieve the first output from R5.
 From there on, each clock produces a new output and moves the data one step down the
pipeline.
 This happens as long as new input data flow into the system.

 Four Segment Pipeline


 The general structure of four segment pipeline is shown in fig.

 The operands are passed through all four segments in affixed sequence.
 Each segment consists of a combinational circuit Si that performs a sub operation over the data
stream flowing through the pipe.
 The segments are separated by registers Ri that hold the intermediate results between the
stages.
 Information flows between adjacent stages under the control of a common clock applied to all
the registers simultaneously.
Space time diagram:
 The behavior of a pipeline can be illustrated with a space time diagram.
 This isa diagram that shows the segment utilization as a function of time.

11
 Fig The horizontal axis displays the time in clock cycles and the vertical axis gives the
segment number.
 The diagram shows six tasks T1 through T6 executed in four segments.
 Initially, task T1 is handled by segment 1.
 After the first clock, segment 2 is busy with T1, while segment 1 is busy with task T2.
Continuing in this manner, the first task T1 is completed after fourth clock cycle.
 From then on, the pipe completes a task every clock cycle.
 Consider the case where a k-segment pipeline with a clock cycle time tp is used to execute n
tasks.
 The first task T1 requires a time equal to ktp to complete its operation since there are k
segments in a pipe.
 The remaining n-1 tasks emerge from the pipe at the rate of one task per clock cycle and they
will be completed after a time equal to (n-1) tp.
 Therefore, to complete n tasks using a k segment pipeline requires
k+ (n-1) clock cycles.
 Consider a non-pipeline unit that performs the same operation and takes a time equal to tn to
complete each task.
 The total time required for n tasks is n tn.
 The speedup of a pipeline processing over an equivalent non pipeline processing is defined by
the ratio
S=ntn / (k+n-1)tp
 As the number of tasks increases, n becomes much larger than k-1, and k+n-1 approaches the
value of n, under this condition the speed up ratio becomes
S=tn/tp
 If we assume that the time it takes to process a task is the same in the pipeline and non-
12
pipeline circuits, we will have
tn=ktp.
 Including this assumption speed up ratio reduces to
S=ktp/tp=k

 Classification of Pipeline Processors


In general, the pipeline organization is applicable for two areas of computer design which includes:
1. Arithmetic Pipeline
2. Instruction Pipeline

1. Arithmetic Pipelines:
 An arithmetic pipeline divides an arithmetic operation into sub operations for execution in the
pipeline segments.
 So in arithmetic pipeline, an arithmetic operation like multiplication, addition, etc. can be
divided into series of steps that can be executed one by one in stages in Arithmetic Logic Unit
(ALU).
 Pipeline arithmetic units are usually found in very high speed computers.
 They are used to implement floating point operations, multiplication of fixed point numbers, and
similar computations encountered in scientific problems.
 To understand the concepts of arithmetic pipeline in a more convenient way, let us consider an
example of a pipeline unit for floating-point addition and subtraction.
X = A * 10a = 0.9504 * 103
Y = B * 10b = 0.8200 * 102
Z = X + Y = 0.1324 * 104
 Here A and B are mantissas (significant digit of floating point numbers), while a and b are
exponents.
 The floating point addition and subtraction is done in 4 parts:
1. Compare the exponents.
2. Align the mantissas.
3. Add or subtract mantissas
4. Produce the result.
 Registers are used for storing the intermediate results between the above operations.
 The following block diagram represents the sub operations performed in each segment of the
13
pipeline.
1. Compare exponents by subtraction:
 The exponents are compared by subtracting them to determine their difference. The larger
exponent is chosen as the exponent of the result.
 The difference of the exponents, i.e., 3 - 2 = 1 determines how many times the mantissa
associated with the smaller exponent must be shifted to the right.
2. Align the mantissas:
 The mantissa associated with the smaller exponent is shifted according to the difference of
exponents determined in segment one.

X = 0.9504 * 103 Y = 0.08200 * 103

3. Add mantissas:
 The two mantissas are added in segment three.
Z = X + Y = 1.0324 * 103 14
4. Normalize the result:
After normalization, the result is written as:
Z = 0.1324 * 104

2. Instruction Pipeline
 Pipeline processing can occur not only in the data stream but in the instruction stream as well.
 Most of the digital computers with complex instructions require instruction pipeline to carry
out operations like fetch, decode and execute instructions.
 In general, the computer needs to process each instruction with the following sequence of
steps.
1. Fetch instruction from memory.
2. Decode the instruction.
3. Calculate the effective address.
4. Fetch the operands from memory.
5. Execute the instruction.
6. Store the result in the proper place.

 Each step is executed in a particular segment, and there are times when different segments
may take different times to operate on the incoming information.
 Moreover, there are times when two or more segments may require memory access at the
same time, causing one segment to wait until another is finished with the memory.
 The organization of an instruction pipeline will be more efficient if the instruction cycle is
divided into segments of equal duration.
 One of the most common examples of this type of organization is a Four-segment
instruction pipeline.
 A four-segment instruction pipeline combines two or more different segments and makes it
as a single one.
 For instance, the decoding of the instruction can be combined with the calculation of the
effective address into one segment.
 The following block diagram shows a typical example of a four-segment instruction pipeline.
The instruction cycle is completed in four segments.

15
Segment 1: FI
The instruction fetch segment can be implemented using first in, first out (FIFO) buffer.
Segment 2:DA
The instruction fetched from memory is decoded in the second segment, and eventually, the
effective address is calculated in a separate arithmetic circuit.
Segment 3: FO
An operand from memory is fetched in the third segment.
Segment 4:EX
The instructions are finally executed in the last segment of the pipeline organization.
16
 Following shows the operation of the instruction pipeline. The time in the horizontal axis is
divided into steps of equal duration. The four segments are represented in the diagram with an
abbreviated symbol.

 It is assumed that the processor has separate instruction and data memories so that the operation
in FI and FO can proceed at the same time.
 In the absence of a branch instruction, each segment operates on different instructions.
 Thus, in step 4, instruction 1 is being executed in segment EX; the operand for instruction 2 is
being fetched into segment FO; instruction 3 is being decoded in segment DA; and instruction 4 is
being fetched from memory in segment FI.

 Pipeline Conflicts
There are some factors that cause the pipeline to deviate its normal performance. Some of these
factors are given below:
1. Timing Variations
All stages cannot take same amount of time. This problem generally occurs in instruction
processing where different instructions have different operand requirements and thus
different processing time.
2. Data Hazards
When several instructions are in partial execution, and if they reference same data then the
problem arises. We must ensure that next instruction does not attempt to access data before
the current instruction, because this will lead to incorrect results.
17
3. Branching
In order to fetch and execute the next instruction, we must know what that instruction is. If the
present instruction is a conditional branch, and its result will lead us to the next instruction,
then the next instruction may not be known until the current one is processed.
4. Interrupts
Interrupts set unwanted instruction into the instruction stream. Interrupts effect the execution
of instruction.
5. Data Dependency
It arises when an instruction depends upon the result of a previous instruction but this result
is not yet available.

 Pipeline Hazards Detection and Resolution

 The problems that occur in the pipeline are called hazards.


 Hazards that arise in the pipeline prevent the next instruction from executing during its
designated clock cycle.
 There are three types of hazards:
1. Data hazards: Instruction depends on result of prior instruction still in the pipeline
2. Structural hazards: Hardware cannot support certain combinations of instructions
(two instructions in the pipeline require the same resource).
3. Control hazards: Caused by delay between the fetching of instructions and decisions
about changes in control flow (branches and jumps).
1. Data hazards
 Data hazards occur when instructions that exhibit data dependence modify data in
different stages of a pipeline.
 Ignoring potential data hazards can result in race conditions (also termed race hazards).
 There are three situations in which a data hazard can occur:

a) Read After Write (RAW), a true dependency


b) Write After Read (WAR), an anti-dependency
c) Write After Write (WAW), an output dependency

18
Note : Read after read (RAR) is not a hazard case.
 Consider two instructions i1 and i2, with i1 occurring before i2 in program order.
Read after write (RAW): (i2 tries to read a source before i1 writes to it)
 A read after write (RAW) data hazard refers to a situation where an instruction refers to
a result that has not yet been calculated or retrieved.
 This can occur because even though an instruction is executed after a prior instruction,
the prior instruction has been processed only partly through the pipeline.
For example:
i1: R2 <- R5 + R3
i2: R4 <- R2 + R3
 The first instruction is calculating a value to be saved in register R2, and the second
is going to use this value to compute a result for register R4.
 However, in a pipeline, when operands are fetched for the 2nd operation, the
results from the first have not yet been saved, and hence a data dependency occurs.
 A data dependency occurs with instruction i2, as it is dependent on the completion
of instruction i1.
Write after read (WAR): (i2 tries to write a destination before it is read by i1)
 A write after read (WAR) data hazard represents a problem with concurrent execution.
For example:
i1. R4 <- R1 + R5
i2. R5 <- R1 + R2
 In any situation with a chance that i2 may finish before i1 (i.e., with concurrent
execution), it must be ensured that the result of register R5 is not stored before i1
has had a chance to fetch the operands.

Write after write (WAW): (i2 tries to write an operand before it is written by i1)
 A write after write (WAW) data hazard may occur in a concurrent execution
environment.
For example:
i1. R2 <- R4 + R7
i2. R2 <- R1 + R3
 The write back (WB) of i2 must be delayed until i1 finishes executing.

19
Solutions for Data Hazards
a) Stalling
b) Forwarding
c) Reordering

a) Stalling:
Consider the following example
add $1, $2, $3
sub $4, $5, $1
Earlier instruction produces a value used by a later instruction.

Cycle: 1 2 3 4 5 6 7 8 9 10
Instruction

add F D X M W

sub F D X M W

Cycle: 1 2 3 4 5 6 7 8 9 10
Instruction

add F D X M W

sub F D X M W

b) Forwarding: Connect new value directly to next stage


Data Hazard the root cause is the data dependency

R1

20
To minimize data dependency stalls in the pipeline, operand forwarding is used.
Operand Forwarding: In operand forwarding, we use the interface registers present between the
stages to hold intermediate output so that dependent instruction can access new value from the
interface register directly.

2. Structural hazards
 This dependency arises due to the resource conflict in the pipeline.
 A structural hazard occurs when two (or more) instructions that are already in pipeline need
the same resource.
 The result is that instruction must be executed in series rather than parallel for a portion of
pipeline.
 Structural hazards are sometime referred to as resource hazards.
Example:
 A situation in which multiple instructions are ready to enter the execute instruction phase
and there is a single ALU (Arithmetic Logic Unit).
 One solution to such resource hazard is to increase available resources, such as having
multiple ports into main memory and multiple ALU (Arithmetic Logic Unit) units.
Another Example:

21
 In the above scenario, in cycle 4, instructions I1 and I4 are trying to access same resource
(Memory) which introduces a resource conflict.

Dealing with Structural Hazards


a) Stall
• low cost, simple
• Increases CPI
• use for rare case since stalling has performance effect
• To avoid this problem, we have to keep the instruction on wait until the required resource
(memory in our case) becomes available. This wait will introduce stalls in the pipeline as
shown below:

b) Pipeline hardware resource


• useful for multi-cycle resources
• good performance
• sometimes complex e.g., RAM
c) Replicate resource
• good performance
• increases cost (+ maybe interconnect delay)
• useful for cheap or divisible resources

Structural hazards are reduced with these rules:


• Each instruction uses a resource at most once
• Always use the resource in the same pipeline stage
• Use the resource for one cycle only

22
3. Control hazards (branch hazards or instruction hazards)
 This type of dependency occurs during the transfer of control instructions such as BRANCH,
CALL, JMP, etc.
 A control hazard is when we need to find the destination of a branch, and can’t fetch any new
instructions until we know that destination.
 Control hazard occurs when the pipeline makes wrong decisions on branch prediction and
therefore brings instructions into the pipeline that must subsequently be discarded.
 The term branch hazard also refers to a control hazard.
 Consider the following sequence of instructions in the program:
100: I1
101: I2 (JMP 250)
102: I3
.
.
250: BI1

 Expected output: I1 -> I2 -> BI1


 NOTE: Generally, the target address of the JMP instruction is known after ID stage only.

 Output Sequence: I1 -> I2 -> I3 -> BI1


 So, the output sequence is not equal to the expected output that means the pipeline is not
implemented correctly.
Dealing with Control Hazards
a) Stall : Stop fetching instruction until result is available

 To correct the problem we need to stop the Instruction fetch until we get target address of
branch instruction. This can be implemented by introducing delay slot until we get the23target
address.

 Output Sequence: I1 -> I2 -> Delay (Stall) -> BI1


 As the delay slot performs no operation, this output sequence is equal to the expected output
sequence. But this slot introduces stall in the pipeline.
 Another example

b) Predict: Assume an outcome and continue fetching (undo if prediction is wrong)


 Solution for Control dependency Branch Prediction is the method through which
stalls due to control dependency can be eliminated.
 Continue fetching as if we won’t take the branch, but then invalidate the instructions
if we do take the branch
 Simply treat every branch as untaken; when the branch is untaken, pipelining as if
no hazard.

 But if the branch is taken, turn fetched instruction into a no-op (idle) and restarts
the IF at the branch target address.

24
c) Delayed branch
 Specify in architecture that the instruction immediately following branch is always executed.
 Always execute instructions following a branch regardless of whether or not we take it
 The instructions in the delay slots are always fetched. Therefore, we would like to arrange for
them to be fully executed whether or not the branch is taken.
 The objective is to place useful instructions in these slots.
 The effectiveness of the delayed branch approach depends on how often it is possible to
reorder instructions.

25

You might also like