0% found this document useful (0 votes)
18 views51 pages

Pipelining - Modified1

Uploaded by

Sasuke Uchiha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views51 pages

Pipelining - Modified1

Uploaded by

Sasuke Uchiha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 51

Pipelining: Datapath and Hazards

Chapter 6, Computer Organization


and Design, David A. Patterson and
John L. Hennessy
Introduction
•The performance of a Single CPU can be increased by :
–Improving the hardware by introducing faster circuits.
–Arranging the hardware such that more than one operation can be performed at the
same time.
•Since, there is a limit on the speed of hardware and the cost of faster circuits is quite
high, we have to adopt the 2nd option. This second approach is Instruction Level
Parallelism.
•Pipelining : Pipelining is a technique for implementing instruction-level parallelism within
a single processor.
•It is a process of arrangement of hardware elements of the CPU such that its overall
performance is increased with simultaneous execution of more than one instruction taking
place in a pipelined processor.
•Pipelining attempts to keep every part of the processor busy with some instruction by
dividing incoming instructions into a series of sequential steps performed by different
processor units with different parts of instructions processed in parallel.
Pipelining: Laundry Example

• Small laundry has one washer, one


dryer and one operator, it takes 90
minutes to finish one load:
A B C D
– Washer takes 30 minutes
– Dryer takes 40 minutes
– “operator folding” takes 20
minutes
Sequential Laundry
6 PM 7 8 9 10 11 Midnight
Time

30 40 20 30 40 20 30 40 20 30 40 20
T
a A
s
k
B
O
r
d C
e 90 min
r
D
• This operator scheduled his loads to be delivered to the laundry every 90 minutes
which is the time required to finish one load. In other words he will not start a
new task unless he is already done with the previous task
• The process is sequential. Sequential laundry takes 6 hours for 4 loads
Efficiently scheduled laundry: Pipelined Laundry

6 PM 7 8 9 10 11 Midnight
Time

30 40 40 40 40 20
40 40 40
T
a A
s
k
B
O
r
d C
e
r
D
• Another operator asks for the delivery of loads to the laundry every 40 minutes!
• Pipelined laundry takes 3.5 hours for 4 loads
Pipelining Facts
• Multiple tasks operating
simultaneously
6 PM 7 8 9 • Pipelining doesn’t help
Time latency of single task, it
helps throughput of
30 40 40 40 40 20 entire workload
T • Pipeline rate limited by
a A
s
slowest pipeline stage
k • Potential speedup =
B Number of pipe stages
O
r
• Unbalanced lengths of
d C The washer pipe stages reduces
waits for the
e dryer for 10 speedup
r minutes • Time to “fill” pipeline
D
and time to “drain” it
reduces speedup
Instruction pipeline versus sequential processing

sequential processing

Instruction pipeline
Instruction pipeline (Contd.)

sequential
processing is
faster for few
instructions
Performance of Pipelining system

•Throughput of the instruction pipeline is determined by how often


an instruction exits the pipeline. Pipelining does not decrease the time
for individual instruction execution. Instead, it increases instruction
throughput.

Machine cycle . The time required to move an instruction one step


further in the pipeline. The length of the machine cycle is
determined by the time required for the slowest pipe stage.
Performance Measurement

n is equivalent to number of loads in


• n:instructions the laundry example
• k: stages in pipeline k is the stages (washing, drying and
• : clockcycle folding.
• Tk: total time  Clock cycle is the slowest task time

Tk (k  (n  1))

T1 nk n
Speedup  
Tk k  (n  1) k
Pipeline Datapath for MIPS Instruction
Time Taken by each MIPS Instruction: Sequential Vs Pipeline Execution
Graphical Representation of ILP
Single Cycle Non Pipeline Datapath
Instruction Execution in Single Cycle Datapath Assuming
Pipelining
Pipeline Version of Single Cycle Datapath for MIPS
Pipeline Control Issues and Hardware
• Here, the following stages perform work as specified:
• IF/ID: Initializes control by passing the rs, rd, and rt fields of the
instruction, together with the opcode and funct fields, to the control
circuitry.
• ID/EX: Buffers control for the EX, MEM, and WB stages, while executing
control for the EX stage. Control decides what operands will be input to
the ALU, what ALU operation will be performed, and whether or not a
branch is to be taken based on the ALU Zero output.
• EX/MEM: Buffers control for the MEM and WB stages, while executing
control for the MEM stage. The control lines are set for memory read or
write, as well as for data selection for memory write. This stage of
control also contains the branch control logic.
• MEM/WB: Buffers and executes control for the WB stage, and selects
the value to be written into the register file.
The control lines for the final three stages: Note that four of the nine control lines are used in the EX phase,
with the remaining five control lines passed on to the EX/MEM pipeline register extended to hold the control
lines; three are used during the MEM stage, and the last two are passed to MEM/WB for use in the WB stage.
Hazards in
Pipelining System: Data & Branch
Overview of Hazards
• Pipeline processors have several problems associated with
controlling smooth, efficient execution of instructions on the
pipeline. These problems are generally called hazards, and
include the following three types:
• Structural Hazards occur when different instructions collide
while trying to access the same piece of hardware in the same
segment of a pipeline. This type of hazard can be alleviated by
having redundant hardware for the segments wherein the
collision occurs. Occasionally, it is possible to insert stalls or
reorder instructions to omit this type of hazard.
Time Taken by each MIPS Instruction: Sequential Vs Pipeline Execution
Structural Hazard #1: in case of Single Memory
Time (clock cycles)

Reading data from


lw

ALU
Mem Reg Mem Reg
I memory
n
s
Inst 1

ALU
t Mem Reg Mem Reg
r.

Inst 2

ALU
O Mem Reg Mem Reg
r
d
Inst 3

ALU
e Mem Reg Mem Reg
r

ALU
Mem Mem Reg
Inst 4 Reading instruction Reg

from memory

Read same memory twice in same clock cycle 23


Structural Hazard #1: Fix with separate instruction and data memories (I$ and D$)

Time (clock cycles)


I
n

ALU
I$ Reg D$ Reg

s lw

ALU
t Instr 1 I$ Reg D$ Reg

r.

ALU
I$ Reg D$ Reg
Instr 2
O

ALU
I$ Reg D$ Reg
Instr 3
r

ALU
d Instr 4 I$ Reg D$ Reg

e
r
24
Structural Hazard #2: Registers (1/2)

Time (clock cycles)


I
n

ALU
s I$ Reg D$ Reg
t
lw

ALU
r Instr 1
I$ Reg D$ Reg

ALU
I$ Reg D$ Reg
Instr 2
O

ALU
r Instr 3
I$ Reg D$ Reg

ALU
I$ Reg D$ Reg
e Instr 4
r

Can we read and write to registers simultaneously?


25
Structural Hazard #2: Registers (2/2)

• Two different solutions have been used:


(1) RegFile access is very fast: takes less than half the time of
ALU stage
• Write to Registers during first half of each clock cycle
• Read from Registers during second half of each clock cycle
(2) Build RegFile with independent read and write ports
• Result:
– can perform register Read and Write during same clock cycle

26
Overview of Hazards
• Data Hazards occur when an instruction depends on the result of a
previous instruction still in the pipeline, which result has not yet been
computed. The simplest remedy inserts stalls in the execution sequence,
which reduces the pipeline's efficiency.
• The solution to data dependencies is twofold.
– First, one can forward the ALU result to the writeback or data fetch stages.
– Second, in selected instances, it is possible to restructure the code to eliminate some
data dependencies.
• Control Hazards can result from branch instructions. Here, the branch
target address might not be ready in time for the branch to be taken,
which results in stalls (dead segments) in the pipeline that have to be
inserted as local wait events, until processing can resume after the branch
target is executed. Control hazards can be mitigated through accurate
branch prediction (which is difficult), and by delayed branch strategies.
Data Hazard
• Definition. A data hazard occurs when the current instruction
requires the result of a preceding instruction, but there are
insufficient segments in the pipeline to compute the result and
write it back to the register file in time for the current instruction to
read that result from the register file.
• We typically remedy this problem in one of three ways:
• Forwarding: In order to resolve a dependency, one adds special
circuitry to the pipeline that is comprised of wires and switches with
which one forwards or transmits the desired value to the pipeline
segment that needs that value for computation. Although this adds
hardware and control circuitry, the method works because it takes
far less time for the required value(s) to travel through a wire than it
does for a pipeline segment to compute its result.
Data Hazard
Example of data
hazards in a
sequence of MIPS
instructions, where
the red (blue) arrows
indicate
dependencies that
are problematic
Operand Forwarding

Data Hazard
Data Hazard Solution using Operand Forwarding and Stall
Data Hazard Solution using Operand Forwarding and Stall (Cont.)
Data Hazard
• Code Re-Ordering: Here, the compiler reorders statements in the
source code, or the assembler reorders object code, to place one or
more statements between the current instruction and the instruction
in which the required operand was computed as a result. This requires
an "intelligent" compiler or assembler, which must have detailed
information about the structure and timing of the pipeline on which
the data hazard would occur. We call this type of software a hardware-
dependent compiler.
• Stall Insertion: It is possible to insert one or more stalls (no-op
instructions) into the pipeline, which delays the execution of the
current instruction until the required operand is written to the register
file. This decreases pipeline efficiency and throughput, which is
contrary to the goals of pipeline processor design. Stalls are an
expedient method of last resort that can be used when compiler
action or forwarding fails or might not be supported in hardware or
software design.
• Problem: The first instruction (sub), starting on clock cycle 1 (CC1) completes on CC5, when
the result in Register 2 is written to the register file. If we did nothing to resolve data
dependencies, then no instruction that read Register 2 from the register file could read the
"new" value computed by the sub instruction until CC5. The dependencies in the other
instructions are illustrated by solid lines with arrowheads. If register read and write cannot
occur within the same clock cycle (we will see how this could happen in Section 5.3.4),
then only the fifth instruction (sw) can access the contents of register 2 in the manner
indicated by the flow of sequential execution in the MIPS code fragment shown previously.
• Solution #1 - Forwarding: The result generated by the sub instruction can be forwarded to
the other stages of the pipeline using special control circuitry (data bus switchable to any
other segment, which can be implemented via a decoder or crossbar switch). This is
indicated notionally in Figure 5.7 by solid red lines with arrowheads. If the register file can
read in the first half of a cycle and write in the second half of a cycle, then the forwarding
in CC5 is not problematic. Otherwise, we would have to delay the execution of the add
instruction by one clock cycle (see Figure 5.9 for insertion of a stall).
• Solution #2 - Code Re-Ordering: Since all Instructions 2 through 5 in the MIPS code
fragment require Register 2 as an operand, we do not have instructions in that particular
code fragment to put between Instruction 1 and Instruction 2. However, let us assume that
we have other instructions that (a) do not depend on the results of Instructions 1-5, and
(b) themselves induce no dependencies in Instructions 1-5 (e.g., by writing to register 1, 2,
3, 5, or 6). In that case, we could insert two instructions between Instructions 1 and 2, if
register read and write could occur concurrently. Otherwise, we would have to insert three
such instructions. The latter case is illustrated in the following figure, where the inserted
instructions and their pipeline actions are colored dark green.
Example of code reordering to solve data
hazards in a sequence of MIPS instructions
• Solution #3 - Stalls: Suppose that we had no instructions to
insert between Instructions 1 and 2. For example, there
might be data dependencies arising from the inserted
instructions that would themselves have to be repaired.
Alternatively, the program execution order (functional
dependencies) might not permit the reordering of code. In
such cases, we have to insert stalls, also called bubbles,
which are no-op instructions that merely delay the
pipeline execution until the dependencies are no longer
problematic with respect to pipeline timing. This is
illustrated in Figure 5.9 by inserting three stalls between
Instructions 1 and 2.
Example of stall insertion to solve data
hazards in a sequence of MIPS instructions

• the insertion of stalls is the least desirable technique because


it delays the execution of an instruction without accomplishing
any useful work (in contrast to code re-ordering).
Data Hazards and Stalls
if (ID/EX.MemRead and
((ID/EX.RegisterRt = IF/ID.RegisterRs) or
(ID/EX.RegisterRt = IF/ID.RegisterRt)))
stall the pipeline

• lw $s0, 20($t1) # lw rt, imm(rs)


• sub $t2, $s0, $t3 # sub rd, rs, rt

Note: The first line tests to see if the instruction is a load: the only
instruction that reads data memory is a load. The next two lines
check to see if the destination register field of the load in the EX
stage matches either source register of the instruction in the ID
stage. If the condition holds, the instruction stalls 1 clock cycle.
Stall when R-format dependent instruction follow a Load
Instruction (pipeline Stall)
Control Hazards

• Branch determines flow of control


– Fetching next instruction depends on branch
outcome
– The delay in determining the proper instruction to
fetch is called a control hazard or branch hazard.
– Pipeline can’t always fetch correct instruction
• Still working on ID stage of branch
• beq, bne in MIPS pipeline

41
Control Hazards Simple Solution Option 1: two Stalls

Stall on every branch until have new PC value;


Would add 2 bubbles/clock cycles for every Branch!

Time (clock cycles)


I EX
n

ALU
I$ Reg D$ Reg
s beq
t
r. nop bubble bubble bubble bubble bubble

O nop bubble bubble bubble bubble bubble

ALU
d Instr I$ Reg D$ Reg
e

ALU
r Instr I$ Reg D$ Reg

Where do we do the compare for the branch?42


Control Hazard: Branching
• Optimization #1:
– Insert special branch comparator in Stage 2 (Dec)
– As soon as instruction is decoded (i.e. Opcode identifies it
as a branch), immediately make a decision and set the
new value of the PC
– Benefit: since branch is complete in Stage 2, only one
unnecessary instruction is fetched, so only one no-op is
needed
– Side Note: means that branches are idle in Stages 3, 4 and
5

43
Special Branch Comparator with One Clock Cycle Stall

Time (clock cycles)


ID/RF
I

ALU
n beq I$ Reg D$ Reg

s
t nop bubble bubble bubble bubble bubble

r.

ALU
I$ Reg D$ Reg
Instr
O
r Instr

ALU
I$ Reg D$ Reg
d
e Instr

ALU
I$ Reg D$ Reg
r

Branch comparator moved to Decode stage


44
Control Hazards: Branch Delay Slot

• Optimization #2: Redefine branches


– Old definition: if we take the branch, none of the
instructions after the branch get executed by
accident
– New definition: whether or not we take the
branch, the single instruction immediately following
the branch gets executed (the branch-delay slot)
• Delayed Branch means we always execute the
instruction after branch
• This optimization is used with MIPS.
46
Example: Nondelayed vs. Delayed Branch
Nondelayed Branch Delayed Branch
or $8, $9, $10 add $1, $2,$3

add $1, $2, $3 sub $4, $5, $6

sub $4, $5, $6 beq $1, $4, Exit

beq $1, $4, Exit or $8, $9, $10

xor $10, $1, $11 xor $10, $1, $11

Exit: Exit:
47
Notes on Branch-Delay Slot

– Worst-Case Scenario: put a no-op in the branch-delay slot


– Better Case: place some instruction preceding the branch in the
branch-delay slot—as long as the changed doesn’t affect the logic
of program
• Re-ordering instructions is common way to speed up
programs
• Compiler usually finds such an instruction 50% of time
• Jumps also have a delay slot …

48
Control Hazards: Branch Prediction

• Opt #3: Predict outcome of a branch, fix up if guess


wrong
– Must cancel all instructions in pipeline that depended on
wrong-guess
– This is called “flushing” the pipeline
• Opt 3.1: Assume branches are NOT taken,
continue execution down the sequential instruction stream. If the branch is
taken, the instructions that are being fetched and decoded must be discarded.
Execution continues at the branch target.
– If branches are untaken half the time, and if it costs little to
discard the instructions, this optimization halves the cost of
control hazards.
• Opt3.2: Dynamic branch prediction: Prediction of
branches at runtime using runtime information.
– branch prediction buffer or branch history table 49
Exercise 1
• For the following code sequence in MIPS,
– Indicate the dependences
– Indicate the potential hazards and types
– Provide your hazard resolution methods and show how many extra
clock cycles you have to pay.

sub $2, $1,$3 # Register $2 written by sub


and $12,$2,$5 # 1st operand($2) depends on sub
or $13,$6,$2 # 2nd operand($2) depends on sub
add $14,$2,$2 # 1st($2) & 2nd($2) depend on sub
sw $15,100($2) # Base ($2) depends on sub

50
Exercise 2

• Show what happens when the branch is taken in this instruction


sequence, assuming the pipeline is optimized for branches that are
not taken and that we moved the branch execution to the ID stage.
The numbers to the left of the instruction (40, 44, . . . ) are the
addresses of the instructions.

36 sub $10, $4, $8


40 beq $1, $3, 7 # PC-relative branch to 40 + 4 + 7 * 4 = 72
44 and $12, $2, $5
48 or $13, $2, $6
52 add $14, $4, $2
56 slt $15, $6, $7
… …
72 lw $4, 50($7)

51
https://www.cise.ufl.edu/~mssz/CompOrg/CDA-pipe.html

Thank you

You might also like