0% found this document useful (0 votes)
37 views48 pages

Lect 08

Computer architecture lecture

Uploaded by

Zach Kauffman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views48 pages

Lect 08

Computer architecture lecture

Uploaded by

Zach Kauffman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

CprE 381 – Computer Organization and

Assembly Level Programming

Lecture 08 – Pipelining

Joseph Zambreno
Electrical and Computer Engineering
Iowa State University

www.ece.iastate.edu/~zambreno
rcl.ece.iastate.edu

As always, there's a couple of things in the pipeline - but that pipeline is a strange
and ambiguous place – Hugh Dancy
This Week’s Topic
• Multi-cycle processor datapath and control
– P&H D.3 (to some extent)
– Will move past fairly quickly
• Introduction to pipelined processor design
– P&H 4.5-4.6
– Pipelined datapath
– Pipelined control

• Continue work on Project Part B


• Don’t forget HW-08
– Single-cycle architectures and performance
• Midterm #2, Nov 6

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.2


Abstract View of the Single Cycle CPU
Main
op Control

ALU
fun
control

ALUSrc

MemWr

MemWr
MemRd
Equal

RegDst
RegWr
nPC_sel

ALUctr
ExtOp

Result Store
Reg.
Register
Instruction
Next PC

Wrt
Access
Fetch
ALU

Mem
Ext
PC

Fetch

Mem
Data
• Looks like a FSM with PC as state
Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.3
What’s wrong with our CPI=1 CPU?
Arithmetic & Logical
PC Inst Memory Reg File mux ALU mux setup

Load
PC Inst Memory Reg File mux ALU Data Mem mux setup

Critical Path
Store
PC Inst Memory Reg File mux ALU Data Mem

Branch
PC Inst Memory Reg File cmp mux

• Long Cycle Time


• All instructions take as much time as the slowest
• Real memory is not so nice as our idealized memory
– Cannot always get the job done in one (short) cycle

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.4


Single Cycle Disadvantages & Advantages
• Uses the clock cycle inefficiently – the clock cycle must
be timed to accommodate the slowest instr
– Especially problematic for more complex instructions like
floating point multiply

Cycle 1 Cycle 2
Clk

lw sw Waste

• May be wasteful of area since some functional units


(e.g., adders) must be duplicated since they can not be
shared during a clock cycle
But
• It is simple and easy to understand

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.5


Multicycle Implementation Overview
• Each instruction step takes 1 clock cycle
– Therefore, an instruction takes more than 1 clock cycle to complete
• Not every instruction takes the same number of clock cycles
to complete

• Multicycle implementations allow:


– faster clock rates
– different instructions to take a different number of clock cycles
– functional units to be used more than once per instruction as long as
they are used on different clock cycles, as a result
• only need one memory
• only need one ALU/adder

• Challenge: designing control logic


Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.6
The Multicycle Datapath
• Registers have to be added after every major
functional unit to hold the output value until it is used
in a subsequent clock cycle

Memory IR
Read Addr 1

A
PC

Address
Register Read
Read Addr 2 Data 1

ALUout
Read Data
(Instr. or Data) File
ALU
Write Addr Read

B
Write Data Data 2
Write Data
MDR

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.7


Clocking the Multicycle Datapath

System Clock

clock cycle

MemWrite RegWrite

IR

Memory Read Addr 1

A
PC

Address
Register Read
Read Addr 2 Data 1

ALUout
Read Data
(Instr. or Data) File
ALU
Write Addr Read

B
Write Data Data 2
Write Data
MDR

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.8


Our Multicycle Approach
• Break up the instructions into steps where each step takes a
clock cycle while trying to
– Balance the amount of work to be done in each step
– Use only one major functional unit per clock cycle
• At the end of a clock cycle
– Store values needed in a later clock cycle by the current instruction in
a state element (internal register not visible to the programmer)
IR – Instruction Register
MDR – Memory Data Register
A and B – Register File read data registers
ALUout – ALU output register
• All (except IR) hold data only between a pair of adjacent clock
cycles (so they don’t need a write control signal)
– Data used by subsequent instructions are stored in programmer visible
state elements (i.e., Register File, PC, or Memory)

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.9


The Complete Multicycle Data with Control
PCWriteCond
PCWrite PCSource
IorD ALUOp
MemRead Control ALUSrcB
MemWrite ALUSrcA
MemtoReg RegWrite
IRWrite RegDst

Instr[31-26]
PC[31-28]

Shift 28
Instr[25-0]
left 2 2
0
1
Memory 0
PC

0 Read Addr 1

A
Address
IR

Read
1 Register 1 zero
Read Addr 2 Data 1

ALUout
Read Data
0 File
(Instr. or Data) ALU
Write Addr
1 Read

B
Write Data Data 2 0
1 Write Data
4
MDR

1
0 2
Instr[15-0] Sign Shift 3
Extend 32 left 2 ALU
Instr[5-0] control

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.10


Our Multicycle Approach (cont.)
• Reading from or writing to any of the internal registers, Register
File, or the PC occurs (quickly) at the beginning (for read) or the
end of a clock cycle (for write)

• Reading from the Register File takes ~50% of a clock cycle since it
has additional control and access overhead (but reading can be
done in parallel with decode)

• Had to add multiplexors in front of several of the functional unit


input ports (e.g., Memory, ALU) because they are now shared by
different clock cycles and/or do multiple jobs

• All operations occurring in one clock cycle occur in parallel


– This limits us to one ALU operation, one Memory access, and one
Register File access per clock cycle

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.11


Five Instruction Steps
1. Instruction Fetch
2. Instruction Decode and Register Fetch
3. R-type Instruction Execution, Memory
Read/Write Address Computation, Branch
Completion, or Jump Completion
4. Memory Read Access, Memory Write
Completion or R-type Instruction
Completion
5. Memory Read Completion (Write Back)

INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!


Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.12
Step 1: Instruction Fetch
• Use PC to get instruction from the memory
and put it in the Instruction Register
• Increment the PC by 4 and put the result
back in the PC
• Can be described succinctly using the RTL
"Register-Transfer Language“
IR = Memory[PC];
PC = PC + 4;
Can we figure out the values of the control signals?

What is the advantage of updating the PC now?

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.13


Datapath Activity During Instr Fetch
PCWriteCond
PCWrite PCSource
IorD ALUOp
MemRead Control ALUSrcB
MemWrite ALUSrcA
MemtoReg RegWrite
IRWrite RegDst

Instr[31-26]
PC[31-28]

Shift 28
Instr[25-0]
left 2 2
0
1
Memory 0
PC

0 Read Addr 1

A
Address
IR

Read
1 Register 1 zero
Read Addr 2 Data 1

ALUout
Read Data
0 File
(Instr. or Data) ALU
Write Addr
1 Read

B
Write Data Data 2 0
1 Write Data
4
MDR

1
0 2 00
Instr[15-0] Sign Shift 3
Extend 32 left 2 ALU
Instr[5-0] control

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.14


Fetch Control Signal Settings
IorD=0 Instr Fetch
Unless otherwise assigned MemRead;IRWrite
Start ALUSrcA=0
PCWrite,IRWrite, ALUsrcB=01
MemWrite,RegWrite=0 PCSource,ALUOp=00
others=X PCWrite

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.15


Step 2: Instr Decode and Reg Fetch
• Don’t know what the instruction is yet, so can only
– Read registers rs and rt in case we need them
– Compute the branch address in case the instruction is a branch
• The RTL:

A = Reg[IR[25-21]];
B = Reg[IR[20-16]];
ALUOut = PC
+(sign-extend(IR[15-0])<< 2);

• Note we aren't setting any control lines based on the


instruction (since we don’t know what it is (the control logic is
busy "decoding" the op code bits))

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.16


Datapath Activity During Instr Decode
PCWriteCond
PCWrite PCSource
IorD ALUOp
MemRead Control ALUSrcB
MemWrite ALUSrcA
MemtoReg RegWrite
IRWrite RegDst

Instr[31-26]
PC[31-28]

Shift 28
Instr[25-0]
left 2 2
0
1
Memory 0
PC

0 Read Addr 1

A
Address
IR

Read
1 Register 1 zero
Read Addr 2 Data 1

ALUout
Read Data
0 File
(Instr. or Data) ALU
Write Addr
1 Read

B
Write Data Data 2 0
1 Write Data
4
MDR

1
0 2 00
Instr[15-0] Sign Shift 3
Extend 32 left 2 ALU
Instr[5-0] control

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.17


Decode Control Signals Settings
IorD=0 Instr Fetch Decode
Unless otherwise assigned MemRead;IRWrite ALUSrcA=0
Start ALUSrcA=0 ALUSrcB=11
PCWrite,IRWrite, ALUOp=00
ALUsrcB=01 PCWriteCond=0
MemWrite,RegWrite=0 PCSource,ALUOp=00
others=X PCWrite

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.18


Step 3 (Instruction Dependent)
• ALU is performing one of four functions, based on
instruction type

• Memory reference (lw and sw):


ALUOut = A + sign-extend(IR[15-0]);

• R-type:
ALUOut = A op B;

• Branch:
if (A==B) PC = ALUOut;
• Jump:
PC = PC[31-28] || (IR[25-0] << 2);

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.19


Datapath Activity During lw/sw Execute
PCWriteCond
PCWrite PCSource
IorD ALUOp
MemRead Control ALUSrcB
MemWrite ALUSrcA
MemtoReg RegWrite
IRWrite RegDst

Instr[31-26]
PC[31-28]

Shift 28
Instr[25-0]
left 2 2
0
1
Memory 0
PC

0 Read Addr 1

A
Address
IR

Read
1 Register 1 zero
Read Addr 2 Data 1

ALUout
Read Data
0 File
(Instr. or Data) ALU
Write Addr
1 Read

B
Write Data Data 2 0
1 Write Data
4
MDR

1
0 2 00
Instr[15-0] Sign Shift 3
Extend 32 left 2 ALU
Instr[5-0] control

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.20


Execute Control Signals Settings
IorD=0 Instr Fetch Decode
Unless otherwise assigned MemRead;IRWrite ALUSrcA=0
Start ALUSrcA=0 ALUSrcB=11
PCWrite,IRWrite, ALUsrcB=01 ALUOp=00
MemWrite,RegWrite=0 PCSource,ALUOp=00 PCWriteCond=0
others=X PCWrite

ALUSrcA=1 ALUSrcA=1 ALUSrcA=1


ALUSrcB=10 ALUSrcB=00 ALUSrcB=00 PCSource=10
ALUOp=00 Execute ALUOp=10 ALUOp=01 PCWrite
PCWriteCond=0 PCWriteCond=0 PCSource=01
PCWriteCond

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.21


Finishing Up – Write Back Control Settings
IorD=0 Instr Fetch Decode
Unless otherwise assigned MemRead;IRWrite ALUSrcA=0
Start ALUSrcA=0 ALUSrcB=11
PCWrite,IRWrite, ALUsrcB=01 ALUOp=00
MemWrite,RegWrite=0 PCSource,ALUOp=00 PCWriteCond=0
others=X PCWrite

ALUSrcA=1 ALUSrcA=1 ALUSrcA=1


ALUSrcB=10 ALUSrcB=00 ALUSrcB=00
Execute PCSource=10
ALUOp=00 ALUOp=10 ALUOp=01
PCWrite
PCWriteCond=0 PCWriteCond=0 PCSource=01
PCWriteCond

Memory Access
MemRead MemWrite RegDst=1
IorD=1 IorD=1 RegWrite
PCWriteCond=0 PCWriteCond=0 MemtoReg=0
PCWriteCond=0

RegDst=0
RegWrite
MemtoReg=1
PCWriteCond=0 Write Back

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.22


Multicycle Control
• Multicycle datapath control signals are not determined
solely by the bits in the instruction
– e.g., op code bits tell what operation the ALU should be doing, but
not what instruction cycle is to be done next
• We can use a finite state machine for control
– A set of states (current state stored in State Register)
– Next state function Datapath
(determined by current control

...
Combinational
state and the input) control logic points
– Output function
(determined by current
...
state) ...
State Reg
Inst Next State
Opcode

• So we are using a Moore machine


(datapath control signals based only on current state)
Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.23
Simplifying the Control Unit Design
• For an implementation of the full MIPS ISA instructions can
take from 3 clock cycles to 20+ clock cycles
– Resulting in finite state machines with hundreds to thousands of states
with even more arcs (state sequences)
• Such state machine representations become impossibly complex
• Instead, can represent the set of control signals that are
asserted during a state as a low-level control “instruction” to
be executed by the datapath

Microinstructions

• “Executing” the microinstruction is equivalent to asserting the


control signals specified by the microinstruction

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.24


Microprogramming
• A microinstruction has to specify
– what control signals should be asserted
– what microinstruction should be executed next
• Each microinstruction corresponds to one state in the FSM
and is assigned a state number (or “address”)
1. Sequential behavior – increment the state (address) of the current
microinstruction to get to the state (address) of the next
2. Jump to the microinstruction that begins execution of the next MIPS
instruction (state 0)
3. Branch to a microinstruction based on control unit input using
dispatch tables
• need one for microinstructions following state 1
• need another for microinstructions following state 2
• The set of microinstructions that define a MIPS assembly
language instruction (macroinstruction) is its microroutine

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.25


Multicycle Advantages & Disadvantages
• Uses the clock cycle efficiently – the clock cycle is timed to
accommodate the slowest instruction step
– Balance the amount of work to be done in each step
– Restrict each step to use only one major functional unit
• Multicycle implementations allow
– Faster clock rates
– Different instructions to take a different number of clock cycles
– Functional units to be used more than once per instruction as long as
they are used on different clock cycles
But
• Requires additional internal state registers, muxes, and more
complicated (FSM) control

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.26


Single Cycle vs. Multiple Cycle Timing
Single Cycle Implementation:

Cycle 1 Cycle 2
Clk

lw sw Waste
multicycle clock
slower than 1/5th of
Multiple Cycle Implementation: single cycle clock
due to state register
overhead
Clk Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10

lw sw R-type
IFetch Dec Exec Mem WB IFetch Dec Exec Mem IFetch

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.27


Increasing Parallelism
• Problem with the multi-cycle processor:
– Each functional unit used once per cycle
– Most of the time it is sitting waiting for its turn
• Well it is calculating all the time, but it is waiting for valid
data
– There is no parallelism in this arrangement
• Making instructions take more cycles can make machine
faster!
– Each instruction takes roughly the same time
• While the CPI is much worse, the clock freq is much higher
– Overlap execution of multiple instructions at the same time
• Different instructions will be active at the same time
– This is called “Pipelining”
– We will look at a 5 stage pipeline
• Modern machines (e.g. Intel Core 2) have over 10-30+
stages/instruction

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.28


Pipelining: You Do it All the Time!
• Ann, Brian, Cathy, Dave
each have one load of clothes to A B C D
wash, dry, fold, and put away

• Washer takes 30 minutes

• Dryer takes 30 minutes

• “Folder” takes 30 minutes

• “Stasher” takes 30 minutes to


put clothes into drawers

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.29


Sequential Laundry

• Sequential laundry takes 8 hours for


4 loads
Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.30
Pipelined Laundry

• Pipelined laundry takes 3.5 hours for 4 loads!


• Speedup = 8 / 3.5 = 2.3
• General case: 2n / 0.5n +1.5 ≈ 4 (# of stages)
Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.31
Pipelining Lessons

• Pipelining doesn’t help


latency of single task, it
helps throughput of entire
workload
• Multiple tasks operating
simultaneously using
different resources
• Potential speedup =
Number pipe stages
• Time to “fill” pipeline and
time to “drain” it reduces
speedup:
2.3X v. 4X in this example

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.32


Pipelining Lessons (cont.)
• Suppose new
Washer takes 20
minutes, new
Stasher takes 20
minutes. How much
faster is pipeline?
• Pipeline rate limited
by slowest pipeline
stage
• Unbalanced lengths
of pipe stages
reduces speedup
Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.33
5 Stage MIPS Execution
• IF: Instruction Fetch
– Fetch the instruction from memory
– Increment the PC
• RF/ID: Register Fetch and Instruction Decode
– Fetch base register
• EX: Execute
– Calculate base + sign-extended offset
• MEM: Memory
– Read the data from the data memory
• WB: Write back
– Write the results back to the register file
Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.34
A Pipelined MIPS Processor
• Start the next instruction before the current one has
completed
– Improves throughput - total amount of work done in a given time
– Instruction latency (execution time, delay time, response time -
time from the start of an instruction to its completion) is not
reduced

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8

lw IFetch Dec Exec Mem WB

sw IFetch Dec Exec Mem WB

R-type IFetch Dec Exec Mem WB

– Clock cycle (pipeline stage time) is limited by the slowest stage


– For some instructions, some stages are wasted cycles
Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.35
Single Cycle, Multiple Cycle, vs. Pipeline
Single Cycle Implementation:
Cycle 1 Cycle 2
Clk

lw sw Waste

Multiple Cycle Implementation:

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10
Clk
lw sw R-type
IFetch Dec Exec Mem WB IFetch Dec Exec Mem IFetch

Pipeline Implementation:
pipeline clock same
lw IFetch Dec Exec Mem WB
as multi-cycle clock
sw IFetch Dec Exec Mem WB

R-type IFetch Dec Exec Mem WB

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.36


MIPS Pipeline Datapath Modifications
• What do we need to add/modify in our MIPS datapath?
– State registers between each pipeline stage to isolate them

IF:IFetch ID:Dec EX:Execute MEM: WB:


MemAccess WriteBack

Add
Shift Add
4
left 2
Read Addr 1
Instruction Data
Register Read
IFetch/Dec

Memory Read Addr 2Data 1 Memory

Exec/Mem
Dec/Exec
Read
PC

File Read

Mem/WB
Address ALU Address
Write Addr Data
Read
Data 2 Write Data
Write Data

Sign
16 Extend 32

System Clock

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.37


MIPS Pipeline Control Path Modifications
• All control signals can be determined during Decode
– And held in the state registers between pipeline stages

ID/EX
EX/MEM

IF/ID Control

Add MEM/WB
Shift Add
4
left 2
Read Addr 1
Instruction Data
Register Read
Memory Read Addr 2Data 1 Memory
Read
PC

File Address Read


Address Write Addr ALU
Read Data
Data 2 Write Data
Write Data

Sign
16 Extend 32

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.38


Implementing Control

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.39


Pipelining the MIPS ISA
• What makes it easy
– All instructions are the same length (32 bits)
• Can fetch in the 1st stage and decode in the 2nd stage
– Few instruction formats (three) with symmetry across formats
• Can begin reading register file in 2nd stage
– Memory operations can occur only in loads and stores
• Can use the execute stage to calculate memory addresses
– Each MIPS instruction writes at most one result (i.e., changes the
machine state) and does so near the end of the pipeline (MEM and
WB)
• What makes it hard
– Structural hazards: what if we had only one memory?
– Control hazards: what about branches?
– Data hazards: what if an instruction’s input operands depend on the
output of a previous instruction?

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.40


Graphically Representing MIPS Pipeline

• Can help with answering questions like:


– How many cycles does it take to execute this code?
– What is the ALU doing during cycle 4?
– Is there a hazard, why does it occur, and how can it
be fixed?
Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.41
Why Pipeline? For Performance!
• Once the pipeline is full, one instruction is completed every cycle so CPI = 1

Time to fill the pipeline

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.42


A Simple Performance Analysis

• Suppose 2 ns for memory access, 2 ns for


ALU operation, and 1 ns for register file read
or write; compute instr rate

• Nonpipelined Execution:
–lw : IF + Read Reg + ALU + Memory + Write Reg
= 2 + 1 + 2 + 2 + 1 = 8 ns
–add: IF + Read Reg + ALU + Write Reg
= 2 + 1 + 2 + 1 = 6 ns
(recall 8ns for single-cycle processor)

• Pipelined Execution:
–Max(IF,Read Reg,ALU,Memory,Write Reg) = 2 ns

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.43


The lw Datapath

IF:IFetch ID:Dec EX:Execute MEM: WB:


MemAccess WriteBack

Add
Shift Add
4
left 2
Read Addr 1
Instruction Data
Register Read
IFetch/Dec

Memory Read Addr 2Data 1 Memory

Exec/Mem
Dec/Exec
Read
PC

File Read

Mem/WB
Address ALU Address
Write Addr Data
Read
Data 2 Write Data
Write Data

Sign
16 Extend 32

System Clock

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.44


Other Pipeline Structures Are Possible
• What about the (slow) multiply operation?
– Make the clock twice as slow or …
– Let it take two cycles (since it doesn’t use the DM stage)
MUL

ALU
IM Reg DM Reg

• What if the data memory access is twice as slow as


the instruction memory?
– Make the clock twice as slow or …
– Let data memory access take two cycles (and keep the
same clock rate) ALU
IM Reg DM1 DM2 Reg

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.45


Food for Thought...
• If dividing it into 5 parts made the clock faster
– And the effective CPI is still one

• Then dividing it into 10 parts would make the clock even


faster
– And wouldn’t the CPI still be one?

• Then why not go to twenty cycles?

• Really two issues


– Some things really have to complete in a cycle
• Find next PC from current PC
– CPI is not really one
• Sometimes you need the results from the previous instruction

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.46


Making it Even Faster
• In splitting the multiple instruction cycle
design into smaller and smaller steps
– There is a point of diminishing returns where as
much time is spent loading the state registers as
doing the work

• Other potential optimizations:


– Fetch (and execute) more than one instruction at
a time (out-of-order superscalar and VLIW (epic)
– CprE 581)
– Fetch (and execute) instructions from more than
one instruction stream (multithreading
(hyperthreading)) – CprE 581)
Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.47
Acknowledgments
• These slides contain material developed and
copyright by:
– David Patterson (UC Berkeley)
– Mary Jane Irwin (Penn State)
– Christos Kozyrakis (Stanford)
– Onur Mutlu (Carnegie Mellon)
– Krste Asanović (UC Berkeley)

Zambreno, Fall 2019 © ISU CprE 381 (Pipelining) Lect-08.48

You might also like