Coa 3
Coa 3
ARCHITECTURE
• Let Fi and Ei refer to the fetch and execute steps for instruction Ii
Sequential execution
Hardware organization
Reshma M R, Dept of CSE, RSET
Basic Idea of Instruction Pipelining (Two-Stage)
Clock cycle
Instruction
Fetch+ Execute
Pipelined execution
Sequential Pipeline
Stages = 2 (k) Stages = 2 (k)
Instructions = 4 (n) Instructions = 4 (n)
Sequential = k*n =8 clock cycles Pipeline = k+n-1 =2+3=5 clock
cycles
Segment 1
Segment 2
Segment 3
• The data stream passes the first processor with the results
stored in memory block which is also accessible by the
second processor.
• Decode (D) : decode the instruction and fetch the source operands
Instruction
• Pipelined operation in above figure is said to have been stalled for two clock
cycles.
• For example, this may be a result of a miss in the cache, requiring the instruction
to be fetched from the main memory.
Clock cycle
• Such idle periods shown in the figure are called stalls. They are also often referred
to as bubbles in the pipeline.
• Once created as a result of a delay in one of the pipeline stages, a bubble moves
downstream until it reaches the last unit
• The most common case in which this hazard may arise is in access to memory.
• One instruction may need to access memory as part of the Execute and Write stage
while another instruction is being fetched.
• A data hazard is any condition in which either the source or the destination
operands of an instruction are not available at the time expected in the pipeline.
• When this program is executed in a pipeline, the execution of I2 can begin before
the execution of I1 is completed. This means that the results generated by I1 may
not be available for use by I2
A 3+A
B 4xA
• When these operations are performed in the order given, the result is B=32.
• But if they are performed concurrently, the value of A used in computing B would
be the original value, 5, leading to an incorrect result
• The data hazard just described arises because one instruction, instruction I2 , is
waiting for data to be written in the register file.
• However, these data are available at the output of the output of the ALU once the
Execute stage completes step E1 .
• Hence, the delay can be reduced, or possibly eliminated, if we arrange for the
result of instruction I1 to be forwarded directly for use in step E2.
• Operand forwarding used so that we can minimize the stalls in data dependency.
• Cache miss
• Branch
• Now we will see the effect of branch instructions and techniques that can be used
for mitigating their impact
• To reduce the effect of these interruptions, many processors employ fetch units
that can fetch instructions before they are needed and put them in a queue.
• A separate unit, which we call the dispatch unit, takes instructions from the front
of the queue and sends them to the execution unit.
• To be effective, the fetch unit must have sufficient decoding and processing
capability to recognize and execute branch instructions.
• It attempts to keep the instruction queue filled at all the times to reduce the impact
of occasional delays when fetching instruction.
• However, the fetch unit continues to fetch instructions and add them to the queue.
• This is because the instruction fetch unit has executed the branch instruction (by
computing the branch address) concurrently with the execution of other
instructions.
• The effectiveness of this technique is enhanced when the instruction fetch unit is
able to add more than one instruction at a time.
• The decision to branch cannot be made until the execution of that instruction has
been completed
• There are several ways to reduce branch penalty associated with conditional
branches and their negative impact on the rate of execution of instructions
▪ Delay Branch
▪ Branch Prediction
▪ Dynamic Branch Prediction
Reshma M R, Dept of CSE, RSET
Delayed Branch
• The instructions in the delay slots are always fetched. Therefore, we would like to
arrange for them to be fully executed whether or not the branch is taken.
LOOP Decrement R2
Branch=0 LOOP
Shift_left R1
NEXT Add R1,R3
Instruction
Decrement F E • Logically the program is executed as if the
branch instruction were placed after the shift
Branch F E instruction.
Figure 8.13. Execution timing showing the delay slot being filled
during the last two passes through the loop in Figure 8.12.
Reshma M R, Dept of CSE, RSET
Branch Prediction
• The results of the compare operation are available at the end of cycle 3.
• Assuming that they are forwarded immediately to the instruction fetch unit, the
branch condition is evaluated in cycle 4.
• At this point the instruction fetch unit realizes that the prediction was incorrect and
the two instructions in the execution pipe are purged.
• Use hardware to observe whether the target address is lower or higher than that of
the branch instruction.
• The processor assumes that the next time the instruction is executed, the result is
likely to be the same.
• Hence, the algorithm may be described by a two state machine. The two states are
LT: Branch is likely to be taken
LNT: Branch is likely not to be taken,
This scheme requires one bit of history information for each branch instruction,
works well inside the program loops.
Reshma M R, Dept of CSE, RSET
Branch Prediction algorithm- State Machine
Better performance can be achieved by
keeping more information about execution
history.