0% found this document useful (0 votes)
192 views26 pages

cs3351 Dpco Unit 4

COMPUTER FUNDAMENTALS

Uploaded by

papu varsha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
192 views26 pages

cs3351 Dpco Unit 4

COMPUTER FUNDAMENTALS

Uploaded by

papu varsha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Department of Artificial Intelligence and Data Science

UNIT: IV

Processor

Syllabus

Instruction Execution - Building a Data Path - Designing a Control Unit - Hardwired Control, Microprogrammed
Control - Pipelining - Data Hazard - Control Hazards.

Instruction Execution

• Let us see how instruction is executed. The complete instruction cycle involves three operations: Instruction
fetching, opcode decoding and instruction execution.

• Fig. 7.1.1 shows the basic instruction cycle. After each instruction cycle, central processing unit checks for any valid
interrupt request. If so, central processing unit fetches the instructions from the interrupt service routine and after
completion of interrupt service routine, central processing unit starts the new instruction cycle from where it has
been interrupted.

Fig. 7.1.2 shows instruction cycle with interrupt cycle.

Instruction fetch cycle: In this cycle, the instruction is fetched from the memory location whose address is in the PC.
This instruction is placed in the Instruction Register (IR) in the processor.

Instruction decode cycle: In this cycle, the opcode of the instruction stored in the instruction register is
decoded/examined to determine which operation is to be performed.

Instruction execution cycle :In this cycle, the specified operation is performed by the processor. This often involves
fetching operands from the memory or from processor registers, performing an arithmetic or logical operation and
storing the result in the destination location. During the instruction execution, PC contents are incremented to point
to the next instruction. After completion of execution of the current instruction, the PC contains the address of the
next instruction and a new instruction fetch cycle can begin.

SubCode:CS3351 Subject Name:Digital Principles and Computer Organisation


Department of Artificial Intelligence and Data Science

Building a Data Path

AU: Dec.-14, May-15

• As shown in Fig. 7.3.2, the MIPS implementation includes, the datapath elements (a unit used to operate on or hold
data within a processor) such as the instruction and data memories, the register file, the ALU, and adders.

• Fig. 7.3.1 shows the combination of the three elements (instruction memory, program counter and adder) from Fig.
7.3.2 to form a datapath that fetches instructions and increments the PC to obtain the address of the next sequential
instruction.

• The instruction memory stores the instructions of a program and gives instruction as an output corresponding to
the address specified by the program counter. The adder is used to increment the PC by 4 to the address of the next
instruction.

• Since the instruction memory only reads, the output at any time reflects the contents of the location specified by
the address input, and no read control signal is needed.

• The program counter is a 32-bits register that is written at the end of every clock cycle and thus does not need a
write control signal.

• The adder always adds its two 32-bits inputs and place the sum on its output.

Datapath Segment for Arithmetic - Logic Instructions

• The arithmetic-logic instructions read operands from two registers, perform an ALU operation on the contents of
the registers, and write the result to a register. We call these instructions as R-type instructions. This instruction class
includes add, sub, AND, OR, and slt. For example, OR $t1, $t2, $t3 reads $t2 and $t3, performs logical OR operation
and saves the result in $t1.

SubCode:CS3351 Subject Name:Digital Principles and Computer Organisation


Department of Artificial Intelligence and Data Science

• The processor's 32 general-purpose registers are stored in a structure called a register file. A register file is a
collection of registers in which any register can be read or written by specifying the number of the register in the file.
The register file contains the register state of the computer.

• Fig. 7.3.2 shows multiport register file (two read ports and one write port) and the ALU section of Fig. 7.3.2 We
know that, the R-format instructions have three register operands: numbers Two source operands and one
destination operand.

• For each data word to be read from the register file, we need to specify the register number to the register file. On
the other hand, to write a data word, we need two inputs: One to specify the register number to be written and one
to supply the data to be written into the register.

•The register file always outputs the contents of whatever register numbers are on the Read register inputs. Write
operations, however, are controlled by the write control (Reg W) signal. This signal is asserted for a write operation at
the clock edge.

• Since writes to the register file are edge-triggered, it is possible to perform read and write operation for the same
register within a clock cycle: The read operation gives the value written in an earlier clock cycle, while the value
written will be available to a read in a subsequent clock cycle.

• As shown in Fig. 7.3.2, the register number inputs are 5 bits wide to specify one of 32 registers, whereas the data
input and two data output buses are each 32 bits wide.

Datapath Segment for Load Word and Store Word Instructions

• Now, consider the MIPS load word and store word instructions, which have the general form lw $t1,
offset_value($t2) or sw $t1, offset_value ($t2).

• In these instructions $t1 is a data register and $t2 is a base register. The memory address is computed by adding the
base register ($t2), to the 16-bits signed offset value specified in the instruction.

• In case of store instruction, the value from the data register ($t1) must be read and in case of load instruction, the
value read from memory must be written into the data register ($t1). Thus, we will need both the register file and the
ALU from Fig. 7.3.2.

• We know that, the offset value is 16-bits and base register contents are 32-bits. Thus, we need a sign-extend unit to
convert the 16-bits offset field in the instruction to a 32-bits signed value so that it can be added to base register.

SubCode:CS3351 Subject Name:Digital Principles and Computer Organisation


Department of Artificial Intelligence and Data Science

• In addition to sign extend unit, we need a data memory unit to read from or write to. The data memory has read
and write control signals to control the read and write operations. It also has an address input, and an input for the
data to be written into memory. Fig. 7.3.3 shows these two elements.

• Sign extension is implemented by replicating the high-order sign bit of the original data item in the high-order bits
of the larger, destination data item.

• Therefore, two units needed to implement loads and stores, in addition to the register file and ALU of Fig. 7.3.2, are
the data memory unit and the sign extension unit.

Datapath Segment for Branch Instruction

• The beq instruction has three operands, two registers that are compared for equality, and a 16-bits offset which is
used to compute the branch target address relative to the branch instruction address. It has a general form beq $t1,
$t2, offset.

• To implement this instruction, it is necessary to compute the branch target address by adding the sign-extended
offset field of the instruction to the PC. The two important things in the definition of branch instructions which need
careful attention are:

• The instruction set architecture specifies that the base for the branch address calculation is the address of the
instruction following the branch (i.e., PC + 4 the address of the next instruction.

• The architecture also states that the offset field is shifted left 2 bits so that it is a word offset; this shift increases
the effective range of the offset field by a factor of 4.

• Therefore, the branch target address is given by

Branch target address = PC+4 + offset (shifted left 2 bits)

• In addition to computing the branch target address, we must also see whether the two operands are equal or not. If
two operands are not equal the next instruction is the instruction that follows sequentially (PC= PC+4); in this case,
we say that the branch is not taken. On the other hand, if two operands are equal (i.e., condition is true), the branch
target address becomes the new PC, and we say that the branch is taken.

• Thus, the branch datapath must perform two operations : Compute the branch target address and compare the
registercontents.

SubCode:CS3351 Subject Name:Digital Principles and Computer Organisation


Department of Artificial Intelligence and Data Science

• Fig. 7.3.5 shows the structure of the datapath segment that handles branches.

SubCode:CS3351 Subject Name:Digital Principles and Computer Organisation


Department of Artificial Intelligence and Data Science

• To compute the branch target address, the branch datapath includes a sign extension unit, shifter and an adder.

• To perform the compare, we need to use the register file and the ALU shown in Fig. 7.3.2.

• Since the ALU provides an Zero signal that indicates whether the result is 0, we I can send the two register operands
to the ALU with the control set to do a subtract operation. If the Zero signal is asserted, we know that the two values
are equal.

• For jump instruction lower 28 bits of the PC are replaced by lower 26 bits of the instruction shifted left by 2 bits and
making two LSB bits 0. This can be implemented by simply concatenating 00 to the jump.

• In the MIPS instruction set, branches are delayed, meaning that the instruction immediately following the branch is
always executed, independent of whether the branch condition is true or false. When the condition is false, the
execution looks like a normal branch. When the condition is true, a delayed branch first executes the instruction
immediately following the branch in sequential instruction order before jumping to the specified branch target
address.

Creating a Single Datapath

• We can combine the datapath components needed for the individual instruction classes into a single datapath and
add the control to complete the implementation.

• This simplest datapath will attempt to execute all instructions in one clock cycle. This means that no datapath
resource can be used more than once per instruction, so any element needed more than once must be duplicated.
We therefore need a memory for instructions separate from one for data. We need the functional units to be
duplicated and many of the elements can be shared by different instruction flows.

• To share a datapath element between two different instruction classes, we have connected multiple connections to
the input of an element and used a multiplexer and control signal to select among the multiple inputs.

Designing a Control Unit

The ALU Control

• The MIPS ALU defines the six following combinations of four control inputs:

• Depending on the instruction class, the ALU will need to perform one of these first five functions. (NOR function
is needed for other parts of the MIPS instruction set. It is not included in the subset we are implementing.)

• In case of load word and store word instructions, we use the ALU to compute the memory address by

• In case of branch equal, the ALU must perform a subtraction.

SubCode:CS3351 Subject Name:Digital Principles and Computer Organisation


Department of Artificial Intelligence and Data Science

• We can control the operation of ALU by the 4-bits ALU control input and 2-bits ALUOP. The 2-bits ALUOP is
interpreted as shown in Table 7.4.1.

• Table 7.4.2 shows how to set the ALU control inputs based on the 2-bits ALUOP control and the 6-bits function
code.

• Here, multiple levels of decoding technique is used.

Advantages of using multiple levels of decoding

• It reduces the size of the main control unit.

• Use of several smaller control units may also potentially increase the speed of the control unit.

• Table 7.4.3 shows how the 4-bits ALU control is set depending on these two input fields: 6-bits function fields and
2-bits ALUOp field.

SubCode:CS3351 Subject Name:Digital Principles and Computer Organisation


Department of Artificial Intelligence and Data Science

• Once the truth table has been constructed, it can be optimized and can be implemented using logic gates.

Designing the Main Control Unit

• Before looking at the rest of the control design, it is useful to review the formats

of the three instruction classes: The R-type, branch and load-store instructions. Fig. 7.4.1 shows these formats.

• Format for R-format instructions: Opcode is 0. These instructions have three register operands: rs, rt, and rd.
Fields rs and rt are sources, and rd is the destination. The funct (Function) field is an ALU function discussed in the
previous section. The shamt field is used only for shifts.

Format for load and store instructions: Load (opcode ) or store (opcode ). The register rs is the base register that is
added to the 16-bit address field to form the memory address. For loads, rt is the destination register for the
loaded value. For stores, rt is the source register whose value should be stored into memory.

Format for branch equal: Opcode is 4. The registers rs and rt are the source registers that are compared for
equality. The 16-bits address field is sign-extended, shifted, and added to the PC + 4 to compute the branch target
address.

Important observations about this instruction format

SubCode:CS3351 Subject Name:Digital Principles and Computer Organisation


Department of Artificial Intelligence and Data Science

• Bits 31: 26 in the instruction format is op field and gives opcode (operation code). We will refer to this field as
Op[5: 0].

• Bits 25:21 and 20:16 in the instruction format always specify the rs and rt fields, respectively.

• Bits 25: 21 always give the base register (rs) for load and store instructions.

• Bits 15: 0 give the 16-bits offset for branch equal, load, and store.

• The destination register is in one of two places. For a load it is in bit positions 20: 16 (rt), while for an R-type
instruction it is in bit positions 15: 11 (rd). Thus, we will need to add a multiplexer to select which field of the
instruction is used.

• From the above information, we can add the instruction labels and extra multiplexer (for the Write register
number input of the register file) to the simple datapath. Fig. 7.4.2 shows these additions plus the ALU control
block, the write signals for state elements, the read signal for the data memory, and the control signals for the
multiplexers. Since all the multiplexers have two inputs, they each require a single control line.

• Fig. 7.4.2 shows seven single-bit control lines (RegDst, RegW, ALUSrc, MemW, MemR, PCSrc and MemtoReg) plus
the 2-bits ALUOP control signal.

• Table 7.4.4 describes the function of single-bit control lines.

SubCode:CS3351 Subject Name:Digital Principles and Computer Organisation


Department of Artificial Intelligence and Data Science

• These nine control signals (seven single-bit control lines and the 2-bits ALUOP control signals) can be set
according six input signals to the control unit, which are the opcode bits 31 to 26. Fig. 7.4.3 shows the datapath
with the control unit and the control signals. [Refer Fig. 7.4.3 on next page]

• As shown in the Fig. 7.4.3, the input to the control unit is the 6-bits opcode field from the instruction.

Hardwired Control

• In the hardwired control, the control units use fixed logic circuits to interpret instructions and generate control
signals from them.

SubCode:CS3351 Subject Name:Digital Principles and Computer Organisation


Department of Artificial Intelligence and Data Science

• The fixed logic circuits use contents of the control step counter, contents of the instruction register, contents of the
condition code flag and the external input signals such as MFC and interrupt requests to generate control signals.

• Fig. 7.5.1 shows the typical hardwired control unit. Here, the fixed logic circuit block includes combinational circuit
(decoder and encoder) that generates the required control outputs, depending on the state of all its inputs.

• By separating the decoding and encoding functions, we can draw more detail block diagram for hardwired control
unit as shown in the Fig. 7.5.2.

The instruction decoder decodes the instruction loaded in the IR. If IR is an 8-bit

register then instruction decoder generates 28, i.e. 256 lines; one for each instruction. According to code in the IR,
only one line amongst all output lines of decoder goes high i.e., set to 1 and all other lines are set to 0.

• The step decoder provides a separate signal line for each step, or time slot, in a control sequence. The encoder gets
in the input from instruction decoder, step decoder, external inputs and condition codes. It uses all these inputs to
generate the individual control signals.

• After execution of each instruction end signal is generated which resets control step counter and make it ready for
generation of control step for next instruction.

SubCode:CS3351 Subject Name:Digital Principles and Computer Organisation


Department of Artificial Intelligence and Data Science

• Let us see how the encoder generates signal for single bus processor organisation shown in Fig. 7.5.3 Y in. The
encoder circuit implements the following logic function to generate Yin

Yin= T1+ T6. ADD + T4. BRANCH + ...

• The Yin signal is asserted during time interval T1 for all instructions, during T for an ADD instruction, during T4 for an
unconditional BRANCH instruction and soon.

• As another example, the logic function to generate Zout signal can given by,

Zout = T2 + T7. ADD + T6 BRANCH + ...

• The Zout signal is asserted during time interval T2 of all instructions, during T, for an ADD instruction, during T6 for an
unconditional branch instruction and so on.

• Fig. 7.5.3 and 7.5.4 shows the hardware implementation of logic functions for Yin and Zout control signals.

SubCode:CS3351 Subject Name:Digital Principles and Computer Organisation


Department of Artificial Intelligence and Data Science

Example 7.5.1 Generate the logic circuit for the following function

End =T7. ADD + T5 . BR + (T5. N T4. ). BBN+...

Solution :Fig. 7.5.5 shows the circuit that generates the End control signal from the logic function.

End =T7. ADD + T5 . BR + (T5. N T4. ). BBN+...

Advantages of hardwired control unit

• Hardwired control unit is fast because control signals are generated by combinational circuits.

SubCode:CS3351 Subject Name:Digital Principles and Computer Organisation


Department of Artificial Intelligence and Data Science

• The delay in generation of control signals depends upon the number of gates.

• It has greater chip area efficiency since its uses less area on-chip.

Disadvantages of hardwired control unit

• More the control signals required by CPU; more complex will be the design of control unit.

• Modifications in control signal are very difficult. That means it requires rearranging of wires in the hardware circuit.

• It is difficult to correct mistake in original design or adding new feature in existing design of control unit.

Microprogrammed Control

• Every instruction in a processor is implemented by a sequence of one or more sets of concurrent microoperations.
Each microoperation is associated with a specific set of control lines which, when activated, causes that
microoperation to take place.

• Since the number of instructions and control lines is often in the hundreds, the complexity of hardwired control unit
is very high. Thus, it is costly and difficult to design.

• Further more, the hardwired control unit is relatively inflexible because it is difficult to change the design, if one
wishes to correct design error or modify the instruction set.

Comparison Between Hardwired and Microprogrammed Control Units

Pipelining

SubCode:CS3351 Subject Name:Digital Principles and Computer Organisation


Department of Artificial Intelligence and Data Science

We have seen various cycles involved in the instruction cycle. These fetch, decode and execute cycles for several
instructions are performed simultaneously to reduce overall processing time. This process is referred to as instruction
pipelining.

• To apply the concept of instruction pipelining, we must subdivide instruction processing in number of stages as
given below.

S1 - Fetch (F): Read instruction from the memory.

S2 - Decode (D): Decode the opcode and fetch source operand (s) if necessary.

S3 - Execute (E): Perform the operation specified by the instruction.

S4 - Store (S): Store the result in the destination.

• Here, instruction processing is divided into four stages hence it is known as four-stage instruction pipeline. With
this subdivision and assuming equal duration for each stage we can reduce the execution time for 4 instructions
from 16 time units to 7 time units. This is illustrated in Fig. 7.8.1.

• In this instruction pipelining four instructions are in progress at any given time. This means that four distinct
hardware units are needed, as shown in Fig. 7.8.2. These units are implemented such that they are capable of
performing their tasks simultaneously and without interfering with one another. Information from the stage is
passed to the next stage with the help of buffers.

SubCode:CS3351 Subject Name:Digital Principles and Computer Organisation


Department of Artificial Intelligence and Data Science

Example 7.8.1 Explain the function of a six segment pipeline and draw a space diagram for a six segment pipeline
showing the time it takes to process eight tasks. AU May-07, Marks 8

Solution: Six stages in the pipeline :

1) Fetch Instruction (FI): Read the next expected instruction into a buffer.

2) Decode Instruction (DI): Determine the opcode and the operand specifiers.

3) Calculate Operands (CO): Calculate the effective address of each source operand.

) Fetch Operands (FO): Fetch each operand from memory.

5) Execute Instruction (EI): Perform the indicated operation and store the result, if any in the specified destination
operand location.

6) Write Operand (WO): Store the result in memory.

Example 7.8.2 What is the ideal speed-up expected in a pipelined architecture with 'n' stages? Justify your
answer. AU May-07, Marks 2

Solution: The pipelined processor ideally completes the processing of one instruction in each clock cycle, which
means that the rate of instruction processing with n stage pipeline is n times that of sequential operation.
Therefore, ideal speed-up factor is n. However, such ideal performance of the pipeline is achieved only when
pipeline stages must complete their processing tasks for a given instruction in the time allotted. Unfortunately, this
is not the case; pipeline operations could not sustained without interruption throughout the program execution.

Pipeline Stages in the MIPS Instructions

1. Fetch instruction from memory.

2. Read registers while decoding the instruction. The regular format of MIPS instructions allows reading and
decoding to occur simultaneously.

SubCode:CS3351 Subject Name:Digital Principles and Computer Organisation


Department of Artificial Intelligence and Data Science

3. Execute the operation or calculate an address.

4. Access an operand in data memory.

5. Write the result into a register.

Designing Instruction Sets for Pipelining

• From the following points we can realized that the MIPS instruction set is designed for pipelined execution.

1. All MIPS instructions are the same length. This restriction makes it much easier to fetch instructions in the first
pipeline stage and to decode them in the second stage.

Pipeline Hazards

• The timing diagram for instruction pipeline operation shown in Fig. 7.8.3 (b) completes the processing of one
instruction in each clock cycle. This means that the rate of instruction processing is four times that of sequential
operation.

• The potential increase in performance resulting from pipelining is proportional to the number of pipeline stages.
However, this increase would be achieved only if pipelined operation shown in Fig. 7.8.3 (b) could be performed
without any interruption throughout program execution. Unfortunately, this is not the case.

• For many of reasons, one of the pipeline stages may not be able to complete its operation in the allotted time.

• Fig. 7.8.4 shows an example in which the operation specified in instruction 2 requires three cycles to complete,
from cycle 4 through cycle 6. Thus, in cycles 5 and 6, the information in buffer B2 must remain intact until the
instruction execution stage has completed its operation. This means that stage 2 and in turn, stage 1 are blocked
from accepting new instructions because the information in B1 cannot be overwritten. Thus decode step for
instruction and fetch step for instruction 5 must be postponed as shown in the Fig. 7.8.4.

• The instruction pipeline shown in Fig. 7.8.4 is said to have been stalled for two clock cycles (clock cycles 5 and 6)
and normal pipeline operation resumes in clock cycle 7.

• Any reason that causes the pipeline to stall is called a hazard.

SubCode:CS3351 Subject Name:Digital Principles and Computer Organisation


Department of Artificial Intelligence and Data Science

Types of Hazards

1. Structural hazards: These hazards are because of conflicts due to insufficient resources when even with all
possible combination, it may not be possible to overlap the operation.

2. Data or data dependent hazards: These result when instruction in the pipeline depends on the result of previous
instructions which are still in pipeline and not completed.

3. Instruction or control hazards: They arise while pipelining branch and other instructions that change the
contents of program counter. The simplest way to handle these hazards is to stall the pipeline. Stalling of the
pipeline allows few instructions to proceed to completion while stopping the execution of those which results in
hazards.

Structural Hazards

• The performance of pipelined processor depends on whether the functional units are pipelined and whether
they are multiple execution units to allow all possible combination of instructions in the pipeline. If for some
combination, pipeline has to be stalled to avoid the resource conflicts then there is a structural hazard.

• In other words, we can say that when two instructions require the use of a given hardware resource at the same
time, the structural hazard occurs.

Data Hazards

• When either the source or the destination operands of an instruction are not available at the time expected in
the pipeline and as a result pipeline is stalled, we say such a situation is a data hazard.

• Consider a program with two ins ructions, I1followed by I2. When this program is executed in a pipeline, the
execution of these two instructions can be performed concurrently. In such case the result of I 1 may not be
available for the execution of I2. If the result of I2 is dependent on the result of I1 we may get incorrect result if both
are executed concurrently. For example, assume A = 10 in the following two operations :

I1: A ← A +5

SubCode:CS3351 Subject Name:Digital Principles and Computer Organisation


Department of Artificial Intelligence and Data Science

I2: B← A×2

• When these two operations are performed in the given order, one after the other, we get result 30. But if they
are performed concurrently, the value of A used in computing B would be the original value, 10, leading to an
incorrect result. In this case data used in the I2 depend on the result of I1. The hazard due to such situation is called
data hazard or data dependent hazard. To avoid incorrect results we have to execute dependent instructions one
after the other (in-order).

Control (Instruction) Hazards

• The purpose of the instruction fetch unit is to supply the execution units with a steady stream of instructions.
This stream is interrupted when pipeline stall occurs either due to cache miss or due to branch instruction. Such a
situation is known as instruction hazard.

• Instruction hazard can cause greater degradation in performance than data

hazards.

Unconditional Branching

• Fig. 7.8.5 shows a sequence of instructions being executed in a two-stage pipeline. The instruction I 2 is a branch
instruction and its target instruction is IK. In clock cycle 3, the instruction I3 is fetched and at the same time branch
instruction (I2) is decoded and the target address is computed. In clock cycle 4, the incorrectly fetched instruction
I3 is discarded and instruction IK is fetched. During this time execution unit is idle and pipeline is stalled for one
clock cycle.

Branch Penalty: The time lost as a result of a branch instruction is often referred to as the branch penalty.

Factor effecting branch penalty

1. It is more for complex instructions.

2. For a longer pipeline, branch penalty is more..

• In case of longer pipelines, the branch penalty can be reduced by computing the branch address earlier in the
pipeline.

SubCode:CS3351 Subject Name:Digital Principles and Computer Organisation


Department of Artificial Intelligence and Data Science

Handling Control Hazards

Instruction Queue and Prefetching

• To reduce the effect of cache miss or branch penalty, many processors employ sophisticated fetch units that can
fetch instructions before they are needed and put them in a queue. This is illustrated in Fig. 7.11.1.

• A separate unit called dispatch unit takes instructions from the front of the queue and sends them to execution
unit. It also performs the decoding function.

• The fetch unit attempts to keep the instruction queue filled at all times to reduce the impact of occasional delays
when fetching instructions during cache miss.

• In case of data hazard, the dispatch unit is not able to issue instructions from the instruction queue. However, the
fetch unit continues to fetch instructions and add them to the queue.

Use of instruction queue during branch instruction

• Fig. 7.11.2 shows instruction time line. It also shows how the queue length changes over the clock cycles. Every
fetch operation adds one instruction to the queue and every dispatch unit operation reduces the queue length by
one. Hence, the queue length remains the same for the first four clock cycles.

• The instruction I has 2-cycle stall. In these two cycles fetch unit adds two instructions but dispatch unit does not
issue any instruction. Due to this, the queue length rises to 3 in clock cycle 6.

SubCode:CS3351 Subject Name:Digital Principles and Computer Organisation


Department of Artificial Intelligence and Data Science

• Since I5 is a branch instruction, instruction I6 is discarded and the target instruction of I5, IK is fetched in cycle 7.
Since I6 is discarded, normally there would be stall in cycle 7; however, here instruction I4 is dispatched from the
queue to the decoding stage.

• After discarding I6, the queue length drops to 1 in cycle 8. The queue length remains one until another stall is
encountered. In this example, instructions I1, I2, I3, I4 and IK complete execution in successive clock cycles. Hence, the
branch instruction does not increase the overall execution time.

Branch Folding

• The technique in which instruction fetch unit executes the branch instruction (by computing the branch address)
concurrently with the execution of other instructions is called branch folding.

• Branch folding occurs only if there exists at least one instruction in the queue other than the branch instruction,
at the time a branch instruction is encountered.

• In case of cache miss, the dispatch unit can send instructions for execution as long as the instruction queue is not
empty. Thus, instruction queue also prevents the delay that may occur due to cache miss.

Approaches to Deal

• The conditional branching is a major factor that affects the performance of instruction pipelining. There are
several approaches to deal with conditional branching.

• These are:

• Multiple streams

• Prefetch branch target

• Loop buffer

• Branch prediction.

SubCode:CS3351 Subject Name:Digital Principles and Computer Organisation


Department of Artificial Intelligence and Data Science

Multiple streams

• We know that, a simple pipeline suffers a penalty for a branch instruction. To avoid this, in this approach they use
two streams to store the fetched instruction. One stream stores instructions after the conditional branch
instruction and it is used when branch condition is not valid.

• The other stream stores the instruction from the branch address and it is used when branch condition is valid.

• Drawbacks :

• Due to multiple streams there are contention delays for access to registers and to memory.

• Each branch instruction needs additional stream.

Prefetch branch target

• In this approach, when a conditional branch is recognized, the target of the branch is prefetched, in addition to
the instruction following the branch. This already fetched target is used when branch is valid.

Loop buffer

• A loop buffer is a small very high speed memory. It is used to store recently prefetched instructions in sequence.
If conditional branch is valid, the hardware first checks whether the branch target is within the buffer. If so, the
next instructions are fetched from the buffer, instead of memory avoiding memory access.

• Advantages:

• In case of conditional branch, instructions are fetched from the buffer saving memory access time. This is useful
for loop sequences.

• If a branch occurs to a target just a few locations ahead of the address of the branch instruction, we may find
target in the buffer. This is useful for IF-THEN-ELSE-ENDIF sequences.

Delayed branch

• The location following a branch instruction is called a branch delay slot. There may be more than one branch
delay slot, depending on the time it takes to execute a branch instruction.

• There are three ways to fill the delay slot :

1. The delay slot is filled with an independent instruction before branch. In this case performance always improves.

2. The delay slot is filled from the target branch instructions. Performance improves if only branch is taken.

3. The delay slot is filled with an instruction which is one of the fall through instruction. In this case performance
improves if branch is not taken.

• These above techniques are called delayed branching.

Branch Prediction

• Prediction techniques can be used to check whether a branch will be valid or not valid. These techniques reduce
the branch penalty.

• The common prediction techniques are:

• Predict never taken

SubCode:CS3351 Subject Name:Digital Principles and Computer Organisation


Department of Artificial Intelligence and Data Science

•Predict always taken

• Predict by opcode

• Taken/Not taken switch

• Branch history table

• In the first two approaches if prediction is wrong a page fault or protection violation error occurs. The processor
then halts prefetching and fetches the instruction from the desired address.

• In the third prediction technique, the prediction decision is based on the opcode of the branch instruction. The
processor assumes that the branch will be taken from certain branch opcodes and not for others.

• The fourth and fifth prediction techniques are dynamic; they depend on the execution history of the previously
executed conditional branch instruction.

Branch Prediction Strategies

• There are two types of branch prediction strategies :

• Static branch strategy

• Dynamic branch strategy.

Static Branch Strategy: In this strategy branch can be predicted based on branch code types statically. This means
that the probability of branch with respect to a particular branch instruction type is used to predict the branch.
This branch strategy may not produce accurate results every time.

SubCode:CS3351 Subject Name:Digital Principles and Computer Organisation


Department of Artificial Intelligence and Data Science

Dynamic Branch Strategy: This strategy uses recent branch history during program execution to predict whether or
not the branch will be taken next time when it occurs. It uses recent branch information to predict the next branch.
The recent branch information includes branch prediction statistics such as:

T: Branch taken

N: Not taken

NN: Last two branches not taken

NT: Not branch taken and previous takes

TT: Both last two branch taken

TN: Last branch taken and previous not taken

• The recent branch information is stored in the buffer called Branch Target Buffer (BTB).

• Along with above information branch target buffer also stores the address of branch target.

• Fig. 7.11.4 shows the organization of branch target buffer.

• Fig. 7.11.5 shows a typical state diagram used in dynamic branch prediction.

• This state diagram allows backtracking of last two instructions in a given program. The branch target buffer entry
contains the backtracking information which guides the prediction.

• The prediction information is updated upon completion of the current branch.

SubCode:CS3351 Subject Name:Digital Principles and Computer Organisation


Department of Artificial Intelligence and Data Science

•To make branch overhead zero, the branch target buffer is extended to store the target instruction itself and a few
of its successor instructions. This allows processing of conditional branches with zero delay.

Example 7.11.1 The following sequence of instructions are executed in the basic 5-stage pipelined processor :

1w$1, 40($6)

add $6, $2, $2

sw $6, 50($1)

Indicate dependence and their type. Assuming there is no forwarding in this pipelined processor, indicate hazards
and add NOP instructions to eliminate them. AU: Dec.-18, Marks 6

Solution: a) I1: 1W $1, 40($6): Raw dependency on $1 from I1 to I3

I2: add $6, $2, $2 : Raw dependency on $6 from I2 to I3

I3: SW $6 ($1): WAR dependency on $6 from I1 to I2 and I3

b) In the basic five-stage pipeline WAR dependency does not cause any hazards. Assuming there is no forwarding in
this pipelined processor RAW dependencies cause hazards if register read happens in the second half of the clock
cycle and the register write happens in the first half. The code that eliminates these hazards by inserting nop
instruction is:

1w $1, 40($6)

add $6, $2, $2

nop; delay 13 to avoid RAW hazard on $1 from I1'

SW $6, 50($1)

Example 7.11.2 A processor has five individual stages, namely, IF, ID, EX, MEM, and WB and their latencies are 250
ps, 350 ps, 150 ps, 300 ps, and 200 ps respectively. The frequency of the instructions executed by the processor are
as follows; ALU: 40%, Branch 25 %, load: 20% and store 15% What is the clock cycle time in a pipelined and non-
pipelined processor? If you can split one stage of the pipelined datapath into two new stages, each with half the
latency of the original stage, which stage would you split and what is the new clock cycle time of the processor?
Assuming there are no stalls or hazards, what is the utilization of the data memory? Assuming there are no stalls or
hazards, what is the utilization of the write register port of the "Registers" unit? AU Dec.-18, Marks 6

Solution: a) Clock cycle time in a pipelined processor=350 ps

Clock cycle time in non-pipelined processor

SubCode:CS3351 Subject Name:Digital Principles and Computer Organisation


Department of Artificial Intelligence and Data Science

= 250 ps +350 ps + 150 ps + 300 ps + 200 ps= 1250 ps

b) We have to split one stage of the pipelined datapath which has a maximum latency i.e. ID.

After splitting ID stage with latencies ID1 = 175 ps and ID2 = 175 ps we have new

clock cycle time of the processor equal to 300 ps.

c) Assuming there are no stalls or hazards, the utilization of the data memory = 20% to 15 % = 35%.

d) Assuming there are no stalls or hazards, the utilization of the write-register port

of the register unit = 40 % + 25 %=65%

SubCode:CS3351 Subject Name:Digital Principles and Computer Organisation

You might also like