0% found this document useful (0 votes)
367 views39 pages

Pipeline Hazards

The document discusses different types of hazards that can occur in pipelines, including structural hazards, data hazards, and control hazards. Structural hazards occur due to resource conflicts when the hardware cannot support all possible instruction combinations. Data hazards occur when an instruction depends on the result of a previous instruction in a way that is exposed by instruction overlapping in the pipeline. Control hazards arise from pipelining branches and other instructions that change the program counter. Stalls may be needed to resolve hazards, reducing pipeline performance. Forwarding is introduced as a technique to avoid stalls due to data hazards by forwarding operands from the execute stage to the decode/register fetch stages.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
367 views39 pages

Pipeline Hazards

The document discusses different types of hazards that can occur in pipelines, including structural hazards, data hazards, and control hazards. Structural hazards occur due to resource conflicts when the hardware cannot support all possible instruction combinations. Data hazards occur when an instruction depends on the result of a previous instruction in a way that is exposed by instruction overlapping in the pipeline. Control hazards arise from pipelining branches and other instructions that change the program counter. Stalls may be needed to resolve hazards, reducing pipeline performance. Forwarding is introduced as a technique to avoid stalls due to data hazards by forwarding operands from the execute stage to the decode/register fetch stages.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 39

Pipeline Hazards

There are situations, called hazards, that prevent the next instruction in the instruction stream from being executing during its designated clock cycle. Hazards reduce the performance from the ideal speedup gained by pipelining. There are three classes of hazards: Structural Hazards. They arise from resource conflicts when the hardware cannot support all possible combinations of instructions in simultaneous overlapped execution. Data Hazards. They arise when an instruction depends on the result of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline. Control Hazards.They arise from the pipelining of branches and other instructions that change the PC. Hazards in pipelines can make it necessary to stall the pipeline. The processor can stall on different events: A cache miss. A cache miss stalls all the instructions on pipeline both before and after the instruction causing the miss. A hazard in pipeline. Eliminating a hazard often requires that some instructions in the pipeline to be allowed to proceed while others are delayed. When the instruction is stalled, all the instructions issuedlater than the stalled instruction are also stalled. Instructions issued earlier than the stalled instruction must continue, since otherwise the hazard will never clear. A hazard causes pipeline bubbles to be inserted.The following table shows how the stalls are actually implemented. As a result, no new instructions are fetched during clock cycle 4, no instruction will finish during clock cycle 8. In case of structural hazards:
Instr Instr i Instr i+1 Instr i+2 Stall Instr i+3 Instr i+4 1 2 3 IF ID EX IF ID IF 4 MEM EX ID bubble Clock cycle number 5 6 7 WB MEM WB EX MEM WB bubble bubble bubble IF ID EX IF ID Clock cycle number 5 6 7 8 9 10

bubble MEM EX

WB MEM

WB

To simplify the picture it is also commonly shown like this:


Instr 1 2 3 4 8 9 10

Instr i Instr i+1 Instr i+2 Instr i+3 Instr i+4

IF ID EX IF ID IF

MEM EX ID stall

WB MEM EX IF

WB MEM ID IF

WB EX ID

MEM EX

WB MEM

WB

In case of data hazards:


Clock cycle number 4 5 6 7 MEM WB bubble EX MEM WB bubble ID EX MEM bubble IF ID EX IF ID

Instr Instr i Instr i+1 Instr i+2 Instr i+3 Instr i+4

1 2 3 IF ID EX IF ID IF

10

WB MEM EX

WB MEM

WB

which appears the same with stalls:


Clock cycle number 4 5 6 7 MEM WB stall EX MEM WB stall ID EX MEM stall IF ID EX IF ID

Instr Instr i Instr i+1 Instr i+2 Instr i+3 Instr i+4

1 2 3 IF ID EX IF ID IF

10

WB MEM EX

WB MEM

WB

Performance of Pipelines with Stalls


A stall causes the pipeline performance to degrade the ideal performance. Average instruction time unpipelined Speedup from pipelining = ---------------------------------------Average instruction time pipelined CPI unpipelined * Clock Cycle Time unpipelined = ------------------------------------CPI pipelined * Clock Cycle Time pipelined The ideal CPI on a pipelined machine is almost always 1. Hence, the pipelined CPI is CPIpipelined = Ideal CPI + Pipeline stall clock cycles per instruction = 1 + Pipeline stall clock cycles per instruction

If we ignore the cycle time overhead of pipelining and assume the stages are all perfectly balanced, then the cycle time of the two machines are equal and CPI unpipelined Speedup = ---------------------------1+ Pipeline stall cycles per instruction If all instructions take the same number of cycles, which must also equal the number of pipeline stages ( the depth of the pipeline) then unpipelined CPI is equal to the depth of the pipeline, leading to Pipeline depth Speedup = -------------------------1 + Pipeline stall cycles per instruction If there are no pipeline stalls, this leads to the intuitive result that pipelining can improve performance by the depth of pipeline.

Structural Hazards
When a machine is pipelined, the overlapped execution of instructions requires pipelining of functional units and duplication of resources to allow all posible combinations of instructions in the pipeline. If some combination of instructions cannot be accommodated because of a resource conflict, the machine is said to have a structural hazard. Common instances of structural hazards arise when Some functional unit is not fully pipelined. Then a sequence of instructions using that unpipelined unit cannot proceed at the rate of one per clock cycle Some resource has not been duplicated enough to allow all combinations of instructions in the pipeline to execute. Example1: a machine may have only one register-file write port, but in some cases the pipeline might want to perform two writes in a clock cycle.

Example2: a machine has shared a single-memory pipeline for data and instructions. As a result, when an instruction contains a data-memory reference(load), it will conflict with the instruction reference for a later instruction (instr 3):
Clock cycle number 4 5 6 MEM WB EX MEM WB ID EX MEM IF ID EX

Instr Load Instr 1 Instr 2 Instr 3

1 IF

2 ID IF

3 EX ID IF

WB MEM

WB

To resolve this, we stall the pipeline for one clock cycle when a data-memory access occurs. The effect of the stall is actually to occupy the resources for that instruction slot. The following table shows how the stalls are actually implemented.
Clock cycle number 5 6 WB MEM WB EX MEM bubble bubble ID IF

Instr Load Instr 1 Instr 2 Stall Instr 3

1 2 3 IF ID EX IF ID IF

4 MEM EX ID bubble

WB bubble EX

bubble MEM

WB

Instruction 1 assumed not to be data-memory reference (load or store), otherwise Instruction 3 cannot start execution for the same reason as above. To simplify the picture it is also commonly shown like this:
Clock cycle number 4 5 6 MEM WB EX MEM WB ID EX MEM stall IF ID

Instr Load Instr 1 Instr 2 Instr 3

1 IF

2 ID IF

3 EX ID IF

WB EX

MEM

WB

Introducing stalls degrades performance as we saw before. Why, then, would the designer allow structural hazards? There are two reasons: To reduce cost. For example, machines that support both an instruction and a cache access every cycle (to prevent the structural hazard of the above example)

require at least twice as much total memory. To reduce the latency of the unit. The shorter latency comes from the lack of pipeline registers that introduce overhead.

Data Hazards
A major effect of pipelining is to change the relative timing of instructions by overlapping their execution. This introduces data and control hazards. Data hazards occur when the pipeline changes the order of read/write accesses to operands so that the order differs from the order seen by sequentially executing instructions on the unpipelined machine. Consider the pipelined execution of these instructions:

ADD SUB AND OR XOR

R1, R2, R3 R4, R5, R1 R6, R1, R7 R8, R1, R9 R10,R1,R11

1 2 3 IF ID EX IF IDsub IF

4 MEM EX IDand IF

5 WB MEM EX IDor IF

WB MEM WB EX MEM WB IDxor EX MEM WB

All the instructions after the ADD use the result of the ADD instruction (in R1). The ADD instruction writes the value of R1 in the WB stage (shown black), and the SUB instruction reads the value during ID stage (IDsub). This problem is called a data hazard. Unless precautions are taken to prevent it, the SUB instruction will read the wrong value and try to use it. The AND instruction is also affected by this data hazard. The write of R1 does not complete until the end of cycle 5 (shown black). Thus, the AND instruction that reads the registers during cycle 4 (IDand) will receive the wrong result. The OR instruction can be made to operate without incurring a hazard by a simple implementation technique. The technique is to perform register file reads in the second half of the cycle, and writes in the first half. Because both WB for ADD and IDor for OR are performed in one cycle 5, the write to register file by ADD will perform in the first half of the cycle, and the read of registers by OR will perform in the second half of the cycle. The XOR instruction operates properly, because its register read occur in cycle 6 after the register write by ADD.

The next page discusses forwarding, a technique to eliminate the stalls for the hazard involving the SUB and AND instructions. We will also classify the data hazards and consider the cases when stalls can not be eliminated. We will see what compiler can do to schedule the pipeline to avoid stalls.

Forwarding
The problem with data hazards, introduced by this sequence of instructions can be solved with a simple hardware technique called forwarding.

ADD SUB AND

R1, R2, R3 R4, R5, R1 R6, R1, R7

1 2 3 IF ID EX IF IDsub IF

4 MEM EX IDand

5 WB MEM EX

6 WB MEM

WB

The key insight in forwarding is that the result is not really needed by SUB until after the ADD actually produces it. The only problem is to make it available for SUB when it needs it. If the result can be moved from where the ADD produces it (EX/MEM register), to where the SUB needs it (ALU input latch), then the need for a stall can be avoided. Using this observation , forwarding works as follows: The ALU result from the EX/MEM register is always fed back to the ALU input latches. If the forwarding hardware detects that the previous ALU operation has written the register corresponding to the source for the current ALU operation, control logic selects the forwarded result as the ALU input rather than the value read from the register file.
Forwarding of results to the ALU requires the additional of three extra inputs on each ALU multiplexer and the addtion of three paths to the new inputs.

The paths correspond to a forwarding of: (a) the ALU output at the end of EX, (b) the ALU output at the end of MEM, and (c) the memory output at the end of MEM.

Without forwarding our example will execute correctly with stalls:

ADD R1, R2, R3 SUB R4, R5, R1 AND R6, R1, R7

1 2 3 IF ID EX IF stall stall

4 MEM stall stall

5 6 7 8 9 WB IDsub EX MEM WB IF MEM WB IDand EX

As our example shows, we need to forward results not only from the immediately previous instruction, but possibly from an instruction that started three cycles earlier. Forwarding can be arranged from MEM/WB latch to ALU input also. Using those forwarding paths the code sequence can be executed without stalls:

ADD R1, R2, R3 SUB R4, R5, R1 AND R6, R1, R7

1 2 3 IF ID EXadd IF ID IF

4 MEMadd EXsub ID

5 6 7 WB MEM WB EXand MEM WB

The first forwarding is for value of R1 from EXadd to EXsub . The second forwarding is also for value of R1 from MEMadd to EXand. This code now can be executed without stalls. Forwarding can be generalized to include passing the result directly to the functional unit that requires it: a result is forwarded from the output of one unit to the input of another, rather than just from the result of a unit to the input of the same unit. One more Example To prevent a stall in this example, we would need to forward the values of R1 and R4 from the pipeline registers to the ALU and data memory inputs.

ADD R1, R2, R3 LW R4, d (R1) SW R4,12(R1)

1 2 3 IF ID EXadd IF ID IF

4 MEMadd EXlw ID

5 6 7 WB MEMlw WB EXsw MEMsw WB

Stores require an operand during MEM, and forwarding of that operand is shown here. The first forwarding is for value of R1 from EXadd to EXlw . The second forwarding is also for value of R1 from MEMadd to EXsw. The third forwarding is for value of R4 from MEMlw to MEMsw. Observe that the SW instruction is storing the value of R4 into a memory location computed by adding the displacement 12 to the value contained in register R1. This effective address computation is done in the ALU during the EX stage of the SW instruction. The value to be stored (R4 in this case) is needed only in the MEM stage as an input to Data Memory. Thus the value of R1 is forwarded to the EX stage for effective address computation and is needed earlier in time than the value of R4 which is forwarded to the input of Data Memory in the MEM stage. So forwarding takes place from "left-to-right" in time, but operands are not ALWAYS forwarded to the EX stage - it depends on the instruction and the point in the Datapath where the operand is needed. Of course, hardware support is necessary to support data forwarding.

Data Hazard Classification


A hazard is created whenever there is a dependence between instructions, and they are close enough that the overlap caused by pipelining would change the order of access to an operand. Our example hazards have all been with register operands, but it is also possible to create a dependence by writing and reading the same memory location. In DLX pipeline, however, memory references are always kept in order, preventing this type of hazard from arising. All the data hazards discussed here involve registers within the CPU. By convention, the hazards are named by the ordering in the program that must be preserved by the pipeline. RAW (read after write) WAW (write after write) WAR (write after read) Consider two instructions i and j, with i occurring before j. The possible data hazards are:

RAW (read after write) - j tries to read a source before i writes it, so j incorrectly gets the old value. This is the most common type of hazard and the kind that we use forwarding to overcome. WAW (write after write) - j tries to write an operand before it is written by i. The writes end up being performed in the wrong order, leaving the value written by i rather than the value written by j in the destination. This hazard is present only in pipelines that write in more than one pipe stage or allow an instruction to proceed even when a previous instruction is stalled. The DLX integer pipeline writes a register only in WB and avoids this class of hazards. WAW hazards would be possible if we made the following two changes to the DLX pipeline: move write back for an ALU operation into the MEM stage, since the data value is available by then. suppose that the data memory access took two pipe stages. Here is a sequence of two instructions showing the execution in this revised pipeline, highlighting the pipe stage that writes the result:
LW R1, 0(R2) IF ADD R1, R2, R3 ID IF EX ID MEM1 EX MEM2 WB WB

Unless this hazard is avoided, execution of this sequence on this revised pipeline will leave the result of the first write (the LW) in R1, rather than the result of the ADD. Allowing writes in different pipe stages introduces other problems, since two instructions can try to write during the same clock cycle. The DLX FP pipeline , which has both writes in different stages and different pipeline lengths, will deal with both write conflicts and WAW hazards in detail. WAR (write after read) - j tries to write a destination before it is read by i , so i incorrectly gets the new value. This can not happen in our example pipeline because all reads are early (in ID) and all writes are late (in WB). This hazard occurs when there are some instructions that write results early in the instruction pipeline, and other instructions that read a source late in the pipeline.

Because of the natural structure of a pipeline, which typically reads values before it writes results, such hazards are rare. Pipelines for complex instruction sets that support autoincrement addressing and require operands to be read late in the pipeline could create a WAR hazards. If we modified the DLX pipeline as in the above example and also read some operands late, such as the source value for a store instruction, a WAR hazard could occur. Here is the pipeline timing for such a potential hazard, highlighting the stage where the conflict occurs:

SW R1, 0(R2) IF ADD R2, R3, R4

ID IF

EX ID

MEM1 EX

MEM2 WB

WB

If the SW reads R2 during the second half of its MEM2 stage and the Add writes R2 during the first half of its WB stage, the SW will incorrectly read and store the value produced by the ADD. RAR (read after read) - this case is not a hazard :).

When Stalls are Required


Unfortunately, not all potential hazards can be handled by forwarding. Consider the following sequence of instructions:

LW SUB AND OR

R1, 0(R1) R4, R1, R5 R6, R1 R7 R8, R1, R9

1 IF

2 ID IF

3 EX ID IF

4 MEM EXsub ID IF

5 WB MEM EXand ID

6 WB MEM EX

WB MEM

WB

The LW instruction does not have the data until the end of clock cycle 4 (MEM) , while the SUB instruction needs to have the data by the beginning of that clock cycle (EXsub). For AND instruction we can forward the result immediately to the ALU (EXand) from the MEM/WB register(MEM).

OR instruction has no problem, since it receives the value through the register file (ID). In clock cycle no. 5, the WB of the LW instruction occurs "early" in first half of the cycle and the register read of the OR instruction occurs "late" in the second half of the cycle. For SUB instruction, the forwarded result would arrive too late - at the end of a clock cycle, when needed at the beginning. The load instruction has a delay or latency that cannot be eliminated by forwarding alone. Instead, we need to add hardware, called a pipeline interlock, to preserve the correct execution pattern. In general, a pipeline interlock detects a hazard and stalls the pipeline until the hazard is cleared. The pipeline with a stall and the legal forwarding is:

LW SUB AND OR

R1, 0(R1) R4, R1, R5 R6, R1 R7 R8, R1, R9

1 2 3 IF ID EX IF ID IF

4 MEM stall stall stall

5 WB EXsub ID IF

6 MEM EX ID

7 WB MEM EX

WB MEM

WB

The only necessary forwarding is done for R1 from MEM to EXsub. Notice that there is no need to forward R1 for AND instruction because now it is getting the value through the register file in ID (as OR above). There are techniques to reduce number of stalls even in this case, which we consider next.

Pipeline Scheduling
Generate DLX code that avoids pipeline stalls for the following sequence of statements: a=b+c; d=a-f; e=g-h;

Assume that all variables are 32-bit integers. Wherever necessary, explicitly explain the actions that are needed to avoid pipeline stalls in your scheduled code. Solution: The DLX assembly code for the given sequence of statements is :

LW Rb, b LW Rc, c Add Ra,Rb, Rc SW Ra, a LW Rf, f Sub Rd, Ra, Rf SW Rd, d LW Rg, g LW Rh, h Sub Re, Rg, Rh SW Re, e

1 2 3 4 5 6 7 IF ID EX M WB IF ID EX M WB

10

11 12

13

14

15

16 17 18

IF ID stall EX M WB IF stall ID EX M WB stall IF ID EX M WB IF ID IF stall EX M WB stall ID stall IF EX M WB ID EX M WB IF ID EX M WB IF ID IF stall EX M WB stall ID EX M WB

Running this code segment will need some forwarding. But instructions LW and ALU(Add or Sub), when put in sequence, are generating hazards for the pipeline that can not be resolved by forwarding. So the pipeline will stall. Observe that in time steps 4, 5, and 6, there are two forwards from the Data memory unit to the ALU in the EX stage of the Add instruction. So also the case in time steps 13, 14, and 15. The hardware to implement this forwarding will need two Load Memory Data registers to store the output of data memory. Note that for the SW instructions, the register value is needed at the input of Data memory. The better solution with compiler assist is given below. Rather then just allow the pipeline to stall, the compiler could try to schedule the pipeline to avoid these stalls by rearranging the code sequence to eliminate the hazards. Suggested version is (the problem has actually more than one solution) :

Instruction 1 2 3 4 5 6 LW Rb, b IF ID EX M WB

10 11 12 13 14 15 Explanation

LW Rc, c LW Rf, f Add Ra, Rb, Rc SW Ra, a Sub Rd, Ra, Rf LW Rg, g LW Rh, h SW Rd, d

IF ID EX M WB IF ID EX M WB IF ID EX M WB IF ID EX M WB IF ID EX M WB IF ID EX M WB IF ID EX M WB IF ID EX M WB Rd read in second half of ID; Rg read in second half ID EX M WB of ID; Rh forwarded IF ID EX M WB Re forwarded Rb read in second half of ID; Rc forwarded Ra forwarded Rf read in second half of ID; Ra forwarded

Sub Re, Rg, Rh SW Re, e

IF

The same color is used to outline the source and destination of forwarding. The blue color is used to indicate the technique to perform the register file reads in the second half of a cycle, and the writes in the first half. Note: Notice that the use of different registers for the first, second and third statements was critical for this schedule to be legal! In general, pipeline scheduling can increase the register count required.

Control Hazards
Control hazards can cause a greater performance loss for DLX pipeline than data hazards. When a branch is executed, it may or may not change the PC (program counter) to something other than its current value plus 4. If a branch changes the PC to its target address, it is a taken branch; if it falls through, it is not taken.

If instruction i is a taken branch, then the PC is normally not changed until the end of MEM stage, after the completion of the address calculation and comparison (see diagram). The simplest method of dealing with branches is to stall the pipeline as soon as the branch is detected until we reach the MEM stage, which determines the new PC. The pipeline behavior looks like :

Branch Branch successor Branch successor+1

IF ID IF(stall)

EX MEM stall stall

WB IF ID EX MEM WB IF ID EX MEM WB

The stall does not occur until after ID stage (where we know that the instruction is a branch). This control hazards stall must be implemented differently from a data hazard, since the IF cycle of the instruction following the branch must be repeated as soon as we know the branch outcome. Thus, the first IF cycle is essentially a stall (because it never performs useful work), which comes to total 3 stalls. Three clock cycles wasted for every branch is a significant loss. With a 30% branch frequency and an ideal CPI of 1, the machine with branch stalls achieves only half the ideal speedup from pipelining! The number of clock cycles can be reduced by two steps: Find out whether the branch is taken or not taken earlier in the pipeline; Compute the taken PC (i.e., the address of the branch target) earlier. Both steps should be taken as early in the pipeline as possible. By moving the zero test into the ID stage, it is possible to know if the branch is taken at the end of the ID cycle. Computing the branch target address during ID requires an additional adder, because the main ALU, which has been used for this function so far, is not usable until EX. The revised datapath :

With this datapath we will need only one-clock-cycle stall on branches.

Branch Branch successor

IF ID IF(stall)

EX MEM IF ID

WB EX MEM

WB

In some machines, branch hazards are even more expensive in clock cycles. For example, a machine with separate decode and register fetch stages will probably have a branch delay - the length of the control hazard - that is at least one clock cycle longer. The branch delay, unless it is dealt with, turns into a branch penalty. Many older machines that implement more complex instruction sets have branch delays of four clock cycles or more. In general, the deeper the pipeline, the worse the branch penalty in clock cycles. There are many methods for dealing with the pipeline stalls caused by branch delay. Further we will discuss four simple branch prediction schemes.Then we will look at more powerful compile-time scheme such as loop unrolling that reduces the frequency of loop branches.

Branch Prediction Schemes


There are many methods to deal with the pipeline stalls caused by branch delay. We discuss four simple compile-time schemes in which predictions are static they are fixed for each branch during the entire execution, and the predictions are compile-time guesses. Stall pipeline Predict taken Predict not taken Delayed branch Stall pipeline The simplest scheme to handle branches is to freeze or flush the pipeline, holding or deleting any instructions after the branch until the branch destination is known. Advantage: simple both to software and hardware (solution described earlier) Predict Not Taken A higher performance, and only slightly more complex, scheme is to predict the branch as not taken, simply allowing the hardware to continue as if the branch were not executed. Care must be taken not to change the machine state until the branch outcome is definitely known.

The complexity arises from: we have to know when the state might be changed by an instruction; we have to know how to "back out" a change. The pipeline with this scheme implemented behaves as shown below:

Untaken Branch IF Instr Instr i+1 Instr i+2 Taken Branch IF Instr Instr i+1 Branch target Branch target+1

ID IF

EX ID IF EX idle IF

MEM EX ID MEM idle ID IF

WB MEM EX WB idle EX ID idle MEM EX WB MEM WB MEM WB

ID IF

WB

When branch is not taken, determined during ID, we have fetched the fall-through and just continue. If the branch is taken during ID, we restart the fetch at the branch target. This causes all instructions following the branch to stall one clock cycle. Predict Taken An alternative scheme is to predict the branch as taken. As soon as the branch is decoded and the target address is computed, we assume the branch to be taken and begin fetching and executing at the target address. Because in DLX pipeline the target address is not known any earlier than the branch outcome, there is no advantage in this approach. In some machines where the target address is known before the branch outcome a predict-taken scheme might make sense. Delayed Branch In a delayed branch, the execution cycle with a branch delay of length n is
Branch instr sequential successor 1 sequential successor 2 . . . . . sequential successor n Branch target if taken

Sequential successors are in the branch-delay slots. These instructions are executed whether or not the branch is taken. The pipeline behavior of the DLX pipeline, which has one branch delay slot is shown below:

Untaken branch instr IF ID EX MEM Branch delay IF ID EX instr(i+1) Instr i+2 IF ID Instr i+3 IF Instr i+4

WB MEM WB EX ID IF MEM WB EX MEM WB ID EX MEM WB

Taken branch instr Branch delay instr(i+1) Branch target Branch target+1 Branch target+2

IF ID EX MEM WB IF ID EX IF ID IF MEM WB EX ID IF MEM WB EX MEM WB ID EX MEM WB

The job of the compiler is to make the successor instructions valid and useful. We will show three branch-scheduling schemes: From before branch From target From fall through
Scheduling the branch-delay slot. The left box in each pair shows the code before scheduling, the right box shows the scheduled code In (a) the delay slot is scheduled with an independent instruction from before the branch. This is the best choice. Strategies (b) and (c) are used when (a) is not possible. In the code sequences for (b) and (c), the use of R1 in the branch condition prevents the ADD instruction (whose destination is R1) from being moved after the branch. In (b) the branch-delay slot is scheduled from the target of the branch; usually the target instruction will need to be copied because it can be reached by another path.

In (c) the branch-delay slot is scheduled from the not-taken fall through To make this optimazation legal for (b) and (c), it must be OK to execute the SUB instruction when the branch goes in the unexpected direction. OK means that the work might be wasted but the program will still execute correctly.

Scheduling strategy From before branch From target From fall though

Requirements Branch must not depend on the rescheduled instructions Must be OK to execute rescheduled instructions if branch is not taken Must be OK to execute instructions if branch is taken

When Improves Performance Always When branch is taken. May enlarge program if instructions are duplicated When branch is not taken

The limitations on delayed-branch scheduling arise from the restrictions on the instructions that are scheduled into the delay slots and our ability to predict at compile time whether a branch is likely to be taken or not. Cancelling Branch To improve the ability of the compiler to fill branch delay slots, most machines with conditional branches have introduced a cancelling branch. In a cancelling branch the instruction includes the direction that the branch was predicted. - if the branch behaves as predicted, the instruction in the branch delay slot is fully executed; - if the branch is incorrectly predicted, the instruction in the delay slot is turned into no-op(idle). The behavior of a predicted-taken cancelling branch depends on whether the branch is taken or not:
Untaken branch instr IF ID EX MEM Branch delay IF ID idle instr(i+1) Instr i+2 IF ID Instr i+3 IF Instr i+4 WB idle idle EX MEM WB ID EX MEM WB IF ID EX MEM WB

Taken branch instr

IF ID EX MEM WB

Branch delay instr(i+1) Branch target Branch target+1 Branch target+2

IF ID EX IF ID IF

MEM WB EX ID IF MEM WB EX MEM WB ID EX MEM WB

The advantage of cancelling branches is that they eliminate the requirements on the instruction placed in the delay slot. Delayed branches are an architecturally visible feature of the pipeline. This is the source both of their advantage - allowing the use of simple compiler scheduling to reduce branch penalties; and their disadvantage - exposing an aspect of the implementation that is likely to change.

Problem on Branch Prediction Schemes


Use the following code segment:
loop: LW ADDI SW ADDI SUB BNEZ R1, 0(R2) R1,R1,#1 R1, 0(R2) R2, R2, #4 R4, R3, R2 R4, loop

Assume that the initial value of R3 is R2+396. Throughout this Exercise use the DLX integer pipeline and assume all memory accesses are cache hits. a) Show the timing of this instruction sequence for the DLX pipeline without any forwarding but assuming a register read and a write in the same clock cycle "forwards" through the register file. Assume that a branch is handled by

flushing the pipeline. If the memory references hit in the cache, how many cycles does this loop take to execute? Solution:

1 l I LW R1,0(R2) : F ADD R1,R1,# I 1 R1, SW 0(R2) ADD R2,R2,# I 4 R4,R3,R SUB 2 BNE R4, l Z

2 3 4 5 6 I M W EX D E B IF l
stal stal l

10 11 12 13 14 15 16 17

ID EX IF l

M W E B ID E M W X E B W B ID EX IF l M W E B ID EX IF l
stal

stal stal l

IF ID EX ME IF l
stal stall

stal stal l

ME M
stall

W B IF

Number of steps = (396 / 4) + 1 = 100 Clock Cycles to Execute the loop = Number of steps * Number of Clock Cycles per step = 100 * 17 = 1700 b) Show the timing of this instruction sequence for the DLX pipeline with normal forwarding hardware. Assume that the branch is handled by predicting it as not taken. If the memory references hit in the cache, how many cycles does this loop take to execute? Solution:

l: LW ADDI SW ADDI SUB BNEZ

1 2 3 R1,0(R2) IF ID EX R1,R1,#1 IF ID R1, 0(R2) IF R2,R2,#4 R2,R3,R2 R4, l if not taken l: (if taken)

4 5 ME WB stall EX stall ID IF

6 ME EX ID IF

7 WB ME EX ID IF

WB ME EX ID IF

WB ME WB EX ME WB idle idle idle IF

Keep in mind that 99 times the branch is predicted wrong( as not taken), and 1 time is actually predicted right! Clock Cycles to Execute the loop = 99 * 8 + 1 * 7 = 799 c) Assuming the DLX pipeline with a single-cycle delayed branch and forwarding hardware, schedule the instructions in the loop including the branch delay slot. You may reorder instructions and modify the individual instruction operands. Show a pipeline timing diagram and compute the number of cycles needed to execute the entire loop. Solution:

l: LW ADDI SUB ADDI BNEZ SW

R1,0(R2) R2,R2,#4 R4,R3,R2 R1,R1,#1 R4, l R1, 0(R2)

1 2 3 IF ID EX IF ID IF

4 ME EX ID IF

5 WB ME EX ID IF

6 WB ME EX ID IF

WB ME WB EX ME WB ID EX ME WB IF

Clock Cycles to Execute the loop = 100 * 6 = 600

Problem on Pipeline Hazards


Consider the following pipeline with 8 stages for a version of DLX:

IF1 IF2

Instruction fetch starts Instruction fetch completes Instruction decode and register fetch; begin computing ID branch target Execution starts; branch condition tested; finish EX1 computing branch target Execution completes - effective address or ALU result EX2 available MEM1/ALUWB First part of memory cycle plus WB of ALU operation MEM2 Memory access completes

LWB

Write back for a load instruction

As in the standard DLX pipeline, assume register writes are in the first half of a cycle and register reads are in the second half. a) How many register read/write ports are required? b) For each possible type of instruction source and each possible type of instruction destination, show a code example that depicts all possible forwarding requirements (not stalls). c) Show the same information as part (b) but for stalls rather than forwards. d) Assuming a predict-not-taken strategy, find the branch penalty for a taken and untaken branch. Assume that a predicted instruction can be executed up to, but not including, a pipestage that does a write back. Solution: a) We need 2 read ports for 2 registers to read in one clock cycle in ID stage because this is the maximum number of operands in an instruction. We need 2 write ports due to potential overlap in time between MEM1/ALUWB and LWB stages. b) ALU - ALU / ALU - Branch 1 ALU instr R1, _ , _ 2 any instr 3 ALU instr _ , R1, _ / BNEZ R1, _
1 IF1 IF2 ID EX1 EX21 MEM1 MEM2 LWB 2 IF1 IF2 ID EX1 EX2 MEM1 MEM2 LWB 3 IF1 IF2 ID EX2 MEM1 MEM2 LWB EX13

Forwarding is done for R1 from EX21 to EX13 . Memory - ALU / Memory - Branch / Memory - Memory 1 LW instr R1, _ , _ 2 any instr

3 any instr 4 any instr 5 ALU instr _ , R1, _ / BNEZ R1, _ / SW _ , R1


1 IF1 IF2 ID EX1 EX2 MEM1 MEM21 LWB 2 IF1 IF2 ID EX1 EX2 MEM1 MEM2 LWB 3 IF1 IF2 ID EX1 EX2 MEM1 MEM2 LWB 4 IF1 IF2 ID EX1 EX2 MEM1 MEM2 LWB 5 IF1 IF2 ID EX15 EX2 MEM1 MEM2 LWB

Forwarding is done for R1 from MEM21 to EX15 . ALU - Memory 1 ALU instr R1, _ , _ 2 SW _ , R1
1 IF1 IF2 ID EX1 EX2 MEM11 2 IF1 IF2 ID EX1 EX2 MEM2 MEM12 LWB MEM2

LWB

Forwarding is done for R1 from MEM11 to MEM12 without optioanal instruction.

c) ALU - ALU / ALU - Branch 1 ALU instr R1, _ , _ 2 ALU instr _ ,R1, _ /BNEZ R1, _
1 IF1 IF2 ID EX1 EX2 MEM1/ALUWB 2 IF1 IF2 stall stall ID MEM2 LWB EX1 EX2 MEM1

Memory - ALU / Memory - Branch / Memory - Memory 1 LW instr R1, _ , _ 2 ALU instr _ , R1, _ / BNEZ R1, _ /SW _ , R1
1 IF1 IF2 ID EX1 EX2 MEM1 MEM2 LWB

IF1

IF2

stall

stall

stall

stall

ID

...

ALU - Memory 1 ALU instr R1, _ , _ 2 SW _ , R1


1 IF1 IF2 2 IF1 ID EX1 EX2 MEM1/ALUWB IF2 stall stall ID MEM2 EX1 LWB EX2 ...

d) Branch taken 1 BNEZ R1, N 2 any instr 3 any instr 4 any instr ... N any instr
1 IF1 IF2 2 IF1 3 4 stall ID IF2 IF1 EX1 ID IF2 IF1 stall stall EX2 MEM1 MEM2 LWB

IF1N

IF2N

IDN

EX1N

EX2N

Target address is computed at the end of EX1 of a branch instruction. If at that time we find out that the branch is taken, we have to flush out all instructions in pipeline after the branch and fetch the instruction we jumped to. So, it looks like we had 3 stalls. Branch not taken 1 BNEZ R1, N 2 any instr

1 IF1 IF2 ID EX1 2 IF1 IF2 ID

EX2 EX1

MEM1 EX2

MEM2 MEM1

LWB MEM2

LWB

If branch is not taken, then our pipeline will function properly because it is designed as a predict-not-taken pipeline and we have no stalls at all.

Dealing with Exceptions


What makes pipelining hard to implement? Exceptions! Now we are ready to consider the challenges of exceptional situations where the instruction execution order is changed in unexpected ways. Exceptional situations are harder to handle in a pipelined machine because the overlapping of instructions makes it more difficult to know whether an instruction can safely change the state of the machine. In a pipelined machine an instruction is executed step by step and is not completed for several clock cycles. Unfortunately, other instructions in the pipeline can raise exceptions that may force the machine to abort the instructions in the pipeline before they complete. First we look at the types of situations that can arise and what architectural requirements exist for supporting them.

Types of Exceptions
The terminology used to describe exceptional situations where the normal execution order of instruction is changed varies among machines. The term interrupt, fault, and exception are used. We use the term exception to cover all these mechanisms, including the following: I/O device request Invoking an operating system service from a user program (system call) Tracing instruction execution Breakpoint (programmer-requested interrupt) Integer arithmetic overflow or underflow; FP arithmetic anomaly Page fault Misaligned memory accesses (if alignment is required) Memory protection violation Using an undefined instruction

Hardware malfunction Power failure The requirements on exceptions can be characterized by five types: Synchronous versus asynchronous. If the event occurs at the same place every time the program is executed with the same data and memory allocation, the event is synchronous. With the exception of hardware malfunctions, asynchronous events are caused by devices external to the processor and memory. Asynchronous events usually can be handled after the completion of the current instruction, which makes them easier to handle. User requested versus coerced If the user task directly asks for it, it is a user-requested event. In some sense, userrequested exceptions are not really exceptions, since they are predictable. They are treated as exceptions, because the same mechanisms that are used to save and restore the state are used for these user-requested events. Because the only function of an instruction that triggers this exception is to cause the exception, user-requested exceptions can always be handled after the instruction has completed. Coerced exceptions are caused by some hardware event that is not under the control of the user program. Coerced exceptions are harder to implement because they are not predictable. User maskable versus user nonmaskable If an event can be masked or disabled by a user task, it is user maskable. This mask simply controls whether the hardware responds to the exception or not. Within versus between instructions This classification depends on whether the event prevents instruction completion by occurring in the middle (within) of execution or whether it is recognized between instructions. Exceptions that occur within instructions are always synchronous, since the instruction triggers the exception. It is harder to implement exceptions that occur within instructions than between instructions, since the instruction must be stopped and restarted.

Resume versus terminate If the program's execution always stops after the interrupt, it is a terminating event. If the program's execution continues after the interrupt, it is a resuming event. It is easier to implement exceptions that terminate execution, since the machine need not be able to restart execution of the same program after handling the exception.

The following table describes different types of exceptions using the categories above: User User maskable Within vs. Synchronous vs. request Resume vs. Exception type vs. between asynchronous vs. terminate nonmaskable instructions coerced I/O device Asynchronous Coerced Nonmaskable Between Resume request Invoke operating User Synchronous Nonmaskable Between Resume system request Tracing User instruction Synchronous User maskable Between Resume request execution User Breakpoint Synchronous User maskable Between Resume request Integer arithmetic Synchronous Coerced User maskable Within Resume overflow Floating-point arithmetic Synchronous Coerced User maskable Within Resume overflow or underflow Page fault Synchronous Coerced Nonmaskable Within Resume Misaligned Synchronous Coerced User maskable Within Resume memory accesses Memory protection Synchronous Coerced Nonmaskable Within Resume violation Using undefined Synchronous Coerced Nonmaskable Within Terminate instruction Hardware Asynchronous Coerced Nonmaskable Within Terminate malfunction Power failure Asynchronous Coerced Nonmaskable Within Terminate

Synchronous, coerced exceptions occurring within instructions that can be resumed are the most difficult to implement.

The difficult task is implementing interrupts occurring within instructions where the instruction must be resumed because it requires another program to be invoked to - save the state of the executing program; - correct the cause of the exception; - restore the state of the program before the instruction that caused the exception; - start the program from the instruction that caused the exception. If a pipeline provides the ability for the machine to handle the exception, save the state, and restart without affecting the execution of the program, the pipeline or machine is said to be restartable. Almost all machines today are restartable, at least for integer pipelines, because it is needed to implement virtual memory.

Exceptions in DLX
As in unpipelined implementations, the most difficult exceptions have two properties: They occur within instructions; The instruction (within which the exception occurred) must be restartable. If the pipeline can be stopped so that the instructions just before the faulting instruction are completed and those after it can be restarted from scratch, the pipeline is said to have precise exceptions. Supporting precise exceptions is a requirement in many systems. In practice, the need for accommodating virtual memory have led designers to always provide precise exceptions for the integer pipeline.

Exceptions that may occur in the DLX pipeline Pipeline stage Problem exceptions occurring Page fault on instruction fetch; IF misaligned memory access; memory-protection violation ID Undefined or illegal opcode EX Arithmetic exception Page fault on data fetch; MEM misaligned memory access; memory-protection violation

WB

None

1)With pipelining, multiple exceptions may occur in the same clock cycle because there are multiple instructions in execution. Example
LW ADD IF ID IF EX ID MEM EX WB MEM

WB

This pair of instructions can cause a data page fault and an arithmetic exception at the same time, since LW is in the MEM stage while the ADD is in the EX stage. This can be handled by dealing with only the data page fault and then restarting the execution. The second exception will reoccur and will be handled independently. 2) Exceptions may even occur out of order; that is an instruction may cause an exception before an earlier instruction causes one. Example
LW ADD IF ID IF EX ID MEM EX WB MEM

WB

This time consider the case when LW gets a data page fault, seen when the instruction is in MEM, and ADD gets an instruction page fault, seen when the ADD instruction is in IF. The instruction page fault will actually occur first even though it is caused by a later instruction. Since we are implementing precise exceptions, the pipeline is reqiured to handle the exception caused by the LW instruction first! So, the pipeline cannot simply handle an exception when it occurs in time, since that will lead to exceptions occurring out of unpipelined order. Instead, it is done through the following steps: - the hardware posts all exceptions caused by a given instruction in a status vector associated with that instruction; - the status vector is carried along as the instruction goes down the pipeline;

- once the exception indicator is set in the exception status vector, any control signals that may cause a data value to be written is turned off (this includes both register and memory writes); - when an instruction enters WB, the exception status vector is checked; - if any exceptions are posted, they are handled in the order in which they would occur in time on an unpipelined machine.

Pipeline with Multicycle Operations


It is impractical to require that all DLX floating-point operations complete in one clock cycle, or even two. Doing it would mean accepting a slow clock, or using enormous amounts of logic in the floating-point units, or both. Instead, the floating-point pipeline will allow for a longer latency for operations. This is easier to grasp if we imagine the floating-point instructions as having the same pipeline as the integer instructions, with two important differences: The EX cycle may be repeated as many times as needed to complete the operation; There may be multiple floating-point functional units. Let's assume that there are four separate functional units : The main integer unit FP and integer multiplier FP adder (handles FP add, subtract, and conversion) FP and integer divider If we also assume that the execution stages of these functional units are not pipelined, then the resulting pipeline looks like:

Because only one instruction issues on every clock cycle, all instructions go through the standard pipeline for integer operations. The floating-point operations simply loop when they reach the EX stage. In reality, the intermediate results are probably not cycled around the EX unit, but the EX pipeline stage has some number of clock delays larger then 1. We can

generalize the structure of the FP pipeline to allow pipelining of some stages and multiple ongoing operations. To describe such a pipeline we must define the latency of the functional units and the initiation interval. Latency is defined as the number of intervening cycles between an instruction that produces a result and an instruction that uses the result. The initiation interval is the number of cycles that must elapse between issuing two operations of a given type. For example, we will use the latencies and initiation intervals as shown:
Functional unit Integer ALU Data memory(integer and FP loads) FP add FP multiply (also integer multiply) FP divide (also integer divide and FP sqrt) Latency Initiation interval 0 1 1 1 3 1 6 1 24 24

Integer ALU operations have a latency of 0, since the results can be used on the next clock cycle (right after EX). Loads have a latency of 1, since their results can be used after one intervening cycle (MEM). Pipeline latency is essentially equal to one cycle less than the depth of the execution pipeline, which is the number of stages from the EX stage to the stage that produces the result. Thus, for the example pipeline above, the number of stages in an FP add is 4, while the number of stages in FP multiply is 7. Extended pipeline is show below:

The example pipeline structure allows up to 4 outstanding FP adds, 7 outstanding FP/integer multiplies, and 1 FP divide. The FP multiplier and adder are fully pipelined and have a depth of 7 and 4 stages, respectively. The FP divider is not pipelined.

MULTD ADDD

The pipeline timing of a set of independent FP operations IF ID M1 M2 M3 M4 M5 M6 M7 MEM IF ID A1 A2 A3 MEM WB A4

WB

LD SD

IF

ID IF

EX ID

MEM EX

WB MEM

WB

The stages in italic show where data is needed, while the stages in bold show where a result is available.

Instruction Level Parallelism


Pipelining can overlap the execution of instructions when they are independent of one another. This potential overlap among instructions is called instruction-level parallelism (ILP) since the instructions can be evaluated in parallel. The amount of parallelism available within a basic block ( a straight-line code sequence with no branches in and out except for entry and exit) is quite small. The average dynamic branch frequency in integer programs was measured to be about 15%, meaning that about 7 instructions execute between a pair of branches. Since the instructions are likely to depend upon one another, the amount of overlap we can exploit within a basic block is likely to be much less than 7. To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks. The simplest and most common way to increase the amount of parallelism available among instructions is to exploit parallelism among iterations of a loop. This type of parallelism is often called loop-level parallelism. Example 1 for (i=1; i<=1000; i= i+1) x[i] = x[i] + y[i]; This is a parallel loop. Every iteration of the loop can overlap with any other iteration, although within each loop iteration there is little opportunity for overlap. Example 2 for (i=1; i<=100; i= i+1){ a[i] = a[i] + b[i]; //s1

b[i+1] = c[i] + d[i]; }

//s2

Is this loop parallel? If not how to make it parallel? Statement s1 uses the value assigned in the previous iteration by statement s2, so there is a loop-carried dependency between s1 and s2. Despite this dependency, this loop can be made parallel because the dependency is not circular: - neither statement depends on itself; - while s1 depends on s2, s2 does not depend on s1. A loop is parallel unless there is a cycle in the dependecies, since the absence of a cycle means that the dependencies give a partial ordering on the statements. To expose the parallelism the loop must be transformed to conform to the partial order. Two observations are critical to this transformation: There is no dependency from s1 to s2. Then, interchanging the two statements will not affect the execution of s2. On the first iteration of the loop, statement s1 depends on the value of b[1] computed prior to initiating the loop. This allows us to replace the loop above with the following code sequence, which makes possible overlapping of the iterations of the loop: a[1] = a[1] + b[1]; for (i=1; i<=99; i= i+1){ b[i+1] = c[i] + d[i]; a[i+1] = a[i+1] + b[i+1]; } b[101] = c[100] + d[100];

Example 3 for (i=1; i<=100; i= i+1){ a[i+1] = a[i] + c[i]; b[i+1] = b[i] + a[i+1]; } //S1 //S2

This loop is not parallel because it has cycles in the dependencies, namely the statements S1 and S2 depend on themselves! There are a number of techniques for converting such loop-level parallelism into instruction-level parallelism. Basically, such techniques work by unrolling the loop.

An important alternative method for exploiting loop-level parallelism is the use of vector instructions on a vector processor, which is not covered by this tutorial.

Loop Unrolling
To keep a pipeline full, parallelism among instructions must be exploited by finding sequences of unrelated instructions that can be overlapped in the pipeline. To avoid stalls, a dependent instruction must be separated from the source instruction by a distance in clock cycles equal to the pipeline latency of that source instruction.

Latencies of FP operations used in the example Instruction producing result Instruction using result Latency in clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load double Store double 0

Our example uses a simple loop that adds a scalar value to an array in memory:
for (i=1; i<=1000; i++) x[i] = x[i] + s;

1) The first step is to translate the above segment to DLX assembly language:
Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 ;F0 - array element ;add scalar in F2 ;store result ;decrement pointer ;8 bytes (per double) ;branch R1 != zero

BENZ R1, Loop

2) Show how the loop would look on DLX, both scheduled and unscheduled, including any stalls or idle clock cycles. Schedule for both delays - from floating-point operations and - from the delayed branches.

Without any scheduling

Scheduled

Loop: LD F0, 0(R1) stall ADDD F4, F0,F2 stall stall SD 0(R1), F4 SUBI R1, R1,#8 BENZ R1, Loop stall

Cycles 1 2 3 4 5 6 7 8 9

Cycles Loop: LD stall F4, F0, F2 R1, R1, SUBI #8 BENZ R1,Loop ADDD SD 8(R1), F4 F0, 0(R1) 1 2 3 4 5 6 ;delayed branch ;altered and interchanged with SUBI

9 clock cycles per element

6 clock cycles per element

3) Show the loop unrolled (scheduled and unscheduled) so that there are 4 copies of the loop body, assuming R1 is initially a multiple of 32, which means that the number of loop iterations is a multiple of 4. Eliminate any obviously redundant computations, and do not reuse any of the registers.

Without any scheduling Loop: LD Loop: LD stall ADDD stall stall SD LD stall ADDD stall stall SD LD stall ADDD stall stall SD LD F0, 0(R1) F4, F0, F2 1 2 3 4 5 6 ;drop SUBI &BNEZ 7 8 9 10 11 12 ;drop SUBI &BNEZ 13 14 15 16 17 18 ;drop SUBI &BNEZ 19

Scheduled F0, 0(R1) F6, LD 8(R1) F10,LD 16(R1) F14,LD 24(R1) F4, F0, ADDD F2 F8, F6, ADDD F2 F8, F6, ADDD F2 F16, F14, ADDD F2 SD 0(R1), F4 -8(R1), SD F8 -16(R1), SD F12 R1, R1, SUBI #32 1 2 3 4 5 6 7 8 9 10 11 12

0(R1), F4 F6, -8(R1) F8, F6, F2

-8(R1), F8 F10,-16(R1) F12,F10,F2

-16(R1), F12 F14,-24(R1)

stall ADDD stall stall SD SUBI BENZ stall

F16,F14,F2

-24(R1),F16 R1, R1, #32 R1, Loop

20 21 22 23 24 25 26 27

BENZ R1, Loop 13 SD 8(R1), F16 ;814 32=24

27 clock cycles per iteration; 27/4 = 6.8 clock cycles per element

14 clock cycles per iteration; 14/4 = 3.5 clock cycles per element

Summary
To obtain the final unrolled code we had to make the following decisions and transformations: Determine that it was legal to move the SD instruction after the SUBI and BNEZ, and find the amount to adjust the SD offset. Determine that unrolling the loop would be useful by finding that the loop iterations were independent, except for loop maintenance code. Use different registers to avoid unnecessary constraints that would be forced by using the same registers for different computations. Eliminate the extra tests and branches and adjust loop maintenance code. Determine that the loads and stores in the unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent. This requires analyzing memory addresses and finding that they do not refer to the same address! Schedule the code, preserving any dependencies needed to yield the same result as the original code.

Dynamic Scheduling Techniques

We examined compiler techniques for scheduling the instructions so as to separate dependent instructions and minimize the number of actual hazards and resultant stalls. This approach called static scheduling became popular with pipelining. Another approach, that earlier processors used, is called dynamic scheduling, where the hardware rearranges the instruction execution to reduce the stalls. Dynamic scheduling offers several advantages: It enables handling some cases when dependencies are unknown at compile time (e.g., because they may involve a memory reference); It simplifies the compiler; It allows code that was compiled with one pipeline in mind to run efficiently on a different pipeline. These advantages are gained at a cost of a significant increase in hardware complexity! Dynamic scheduling: The Idea A major limitation of the pipelining techniques is that they use in-order instruction issue: if an instruction is stalled in the pipeline, no later instructions can proceed. Thus, if there is a dependency between two closely spaced instructions in the pipeline, it will stall. For example:

DIVD ADDD SUBD

F0, F2, F4 F10, F0, F8 F12, F8, F14

SUBD instruction cannot execute because the dependency of ADDD on DIVD causes the pipeline to stall; yet SUBD is not data dependent on anything in the pipeline. This is a performance limitation that can be eliminated by not requiring instructions to execute in order. To allow SUBD to begin executing, we must separate the instruction issue process into two parts: checking the structural hazards and waiting for the absence of a data hazard. We can still check for structural hazards when we issue the instruction; thus, we still use in order instruction issue. However, we want the instructions to begin execution as soon as their data operands are available. Thus, the pipeline will do out-of-order execution, which implies out-of-order completion. In introducing out-of-order execution, we have essentially split the ID pipe stage into two stages:

Issue - Decode instructions, check for structural hazards; Read operands - Wait until no data hazards, then read operands. An instruction fetch proceeds with the issue stage and may fetch either into a single-entry latch or into a queue; instructions are then issued from the latch or queue. The EX stage follows the read operands stage, just as in the DLX pipeline. As in the DLX floating-point pipeline, execution may take multiple cycles, depending on the operation. Thus, we may need to distinguish when an instruction begins execution and when it completes execution; between the two times, the instruction is in execution. This allows multiple instructions to be in execution at the same time. Scoreboarding is a technique for allowing instructions to execute out of order when there are sufficient resources and no data dependencies; it is named after the CDC 6600 scoreboard, which developed this capability. The goal of a scoreboard is to maintain an execution rate of one instruction per clock cycle (when there are no structural hazards) by executing an instruction as early as possible. Thus, when the next instruction to execute is stalled, other instructions can be issued and executed if they do not depend on any active or stalled instruction. The scoreboard takes full responsibility for instruction issue and execution, including all hazard detection. Every instruction goes through the scoreboard, where a record of the data dependencies is constructed; this step corresponds to instruction issue and replaces part of the ID step in the DLX pipeline. The scoreboard then determines when the instruction can read its operands and begin execution. Tomasulo Approach is another scheme to allow execution to proceed in the presence of hazards developed by the IBM 360/91 floating-point unit. This scheme combines key elements of the scoreboarding scheme with the introduction of register renaming. In the loop unrolling section we showed how a compiler could rename registers to avoid WAW and WAR hazards. In Tomasulo's scheme this functionality is provided by the reservation stations, which buffer the operands of instructions waiting to issue, and by issue logic. The basic idea is that a reservation station fetches and buffers an operand as soon as it is available, eliminating the need to get the operand from a register. In addition, pending instructions designate the reservation station that will provide their input. Finally, when successive writes to a register appear, only the last one is actually used to update the register. As instructions are issued, the register specifiers for pending operands are renamed to the names of the reservation station in a process called register renaming. This

combination of issue logic and reservation stations provides renaming and eliminates WAW and WAR hazards. This additional capability is the major conceptual difference between scoreboarding and Tomasulo's algorithm. Since there can be more reservation stations than real registers, the technique can eliminate hazards that could not be eliminated by a compiler.

You might also like