0% found this document useful (0 votes)
3 views

Lecture13

Uploaded by

minulo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lecture13

Uploaded by

minulo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 114

CSC 252/452: Computer Organization

Fall 2024: Lecture 13

Instructor: Yanan Guo

Department of Computer Science


University of Rochester
Carnegie Mellon

Announcement
• Programming assignment 3 is out
• Details: https://www.cs.rochester.edu/courses/252/fall2024/
labs/assignment3.html
• Due on Oct. 25th, 11:59 PM
• You (may still) have 3 slip days

2
Carnegie Mellon

Announcement
• Programming assignment 3 is in x86 assembly language. Seek
help from TAs.
• TAs are best positioned to answer your questions about
programming assignments!!!
• Programming assignments do NOT repeat the lecture materials.
They ask you to synthesize what you have learned from the
lectures and work out something new.

3
Carnegie Mellon

Single-Cycle Microarchitecture
Clock

PC
Register Flags
Memory
File Z S O

Cur.
PC Inst. New Enable? Cur. Flag
New Rd/Wr Reg. Current Values
Data Addr. Reg. IDs Reg.
New Data Valus
Values New Flag
PC
Enable? Values

Combinational Logic
Read current_states;
next_states = calculate_new_state(current_states);
When clock rises, current_states = next_states;
next_states has to be ready before the close rises

4
Carnegie Mellon

Single-Cycle Microarchitecture
Data to write
Enable Logic 6
Memory Data read back
Address

Logic 7 MUX
Logic 1
Clock newData [s4…s19]
PC M Select
Write
oPC nPC s0 4 Reg. ID RA U
X A
s1 0 Read Reg. Register
Logic 3 ID 1 L
s2 6 File RB U
s3 Read Reg. Logic 4
4 ID 2
Flags [s2…s9] s4 1
Clock Flags
s5 c Logic 2
Enable Logic 5
EnableF Z S O
… …

5
Carnegie Mellon

Single-Cycle Microarchitecture: Illustration


Think of it as a state machine
Combinational
Every cycle, one instruction gets logic Read Write

executed. At the end of the Data


cycle, architecture states get memory

modified. CC
100
Read Write

States (All updated as clock ports ports

rises) Register
file
■ PC register %rbx = 0x100

■ Cond. Code register


■ Data memory PC
0x014
■ Register file

6
Carnegie Mellon

Cycle 1 Cycle 2 Cycle 3 Cycle 4


Clock

① ② ③ ④
Cycle 1: 0x000: irmovq $0x100,%rbx # %rbx <-- 0x100
Cycle 2: 0x00a: irmovq $0x200,%rdx # %rdx <-- 0x200
Cycle 3: 0x014: addq %rdx,%rbx # %rbx <-- 0x300 CC <-- 000
Cycle 4: 0x016: je dest # Not taken
Cycle 5: 0x01f: rmmovq %rbx,0(%rdx) # M[0x200] <-- 0x300

Combinational
logic
• state set according to second
Read Write

Data irmovq instruction


memory
• combinational logic starting to
CC
100 react to state changes
Read Write
ports ports

Register
file
%rbx = 0x100

PC
0x014

7
Carnegie Mellon

Cycle 1 Cycle 2 Cycle 3 Cycle 4


Clock

① ② ③ ④
Cycle 1: 0x000: irmovq $0x100,%rbx # %rbx <-- 0x100
Cycle 2: 0x00a: irmovq $0x200,%rdx # %rdx <-- 0x200
Cycle 3: 0x014: addq %rdx,%rbx # %rbx <-- 0x300 CC <-- 000
Cycle 4: 0x016: je dest # Not taken
Cycle 5: 0x01f: rmmovq %rbx,0(%rdx) # M[0x200] <-- 0x300

Combinational
logic
• state set according to second
Read Write

Data irmovq instruction


memory
• combinational logic generates
CC
100 results for addq instruction
Read Write
ports ports

000 Register
%rbx
file <--
%rbx = 0x100
0x300

0x016
PC
0x014

8
Carnegie Mellon

Cycle 1 Cycle 2 Cycle 3 Cycle 4


Clock

① ② ③ ④
Cycle 1: 0x000: irmovq $0x100,%rbx # %rbx <-- 0x100
Cycle 2: 0x00a: irmovq $0x200,%rdx # %rdx <-- 0x200
Cycle 3: 0x014: addq %rdx,%rbx # %rbx <-- 0x300 CC <-- 000
Cycle 4: 0x016: je dest # Not taken
Cycle 5: 0x01f: rmmovq %rbx,0(%rdx) # M[0x200] <-- 0x300

Combinational
logic Read Write
• state set according to addq
Data
memory
instruction
• combinational logic starting
CC
000 to react to state changes
Read Write
ports ports

Register
file
%rbx = 0x300

PC
0x016

9
Carnegie Mellon

Cycle 1 Cycle 2 Cycle 3 Cycle 4


Clock

① ② ③ ④
Cycle 1: 0x000: irmovq $0x100,%rbx # %rbx <-- 0x100
Cycle 2: 0x00a: irmovq $0x200,%rdx # %rdx <-- 0x200
Cycle 3: 0x014: addq %rdx,%rbx # %rbx <-- 0x300 CC <-- 000
Cycle 4: 0x016: je dest # Not taken
Cycle 5: 0x01f: rmmovq %rbx,0(%rdx) # M[0x200] <-- 0x300

Combinational
logic Read Write
• state set according to addq
Data
memory
instruction
• combinational logic generates
CC
000 results for je instruction
Read Write
ports ports

Register
file
%rbx = 0x300

0x01f
PC
0x016

10
Carnegie Mellon

Processor Microarchitecture
• Sequential, single-cycle microarchitecture implementation
• Basic idea
• Hardware implementation
• Pipelined microarchitecture implementation
• Basic Principles
• Difficulties: Control Dependency
• Difficulties: Data Dependency

11
Carnegie Mellon

Performance Model

Execution time
of a program = # of Dynamic Instructions
(in seconds)

X # of cycles taken to execute an instruction (on average)

/ number of cycles per second

12
Carnegie Mellon

Performance Model

Execution time
of a program = # of Dynamic Instructions
(in seconds) CPI

X # of cycles taken to execute an instruction (on average)

/ number of cycles per second

12
Carnegie Mellon

Performance Model

Execution time
of a program = # of Dynamic Instructions
(in seconds) CPI

X # of cycles taken to execute an instruction (on average)

/ number of cycles per second Clock Frequency


(1/cycle time)

12
Carnegie Mellon

Improving Performance

Execution time
of a program = # of Dynamic Instructions
(in seconds)

X # of cycles taken to execute an instruction (on average)

/ number of cycles per second

• 1. Reduce the total number of instructions executed (mainly done by


the compiler and/or programmer).

13
Carnegie Mellon

Improving Performance

Execution time
of a program = # of Dynamic Instructions
(in seconds)

X # of cycles taken to execute an instruction (on average)

/ number of cycles per second

• 1. Reduce the total number of instructions executed (mainly done by


the compiler and/or programmer).
• 2. Increase the clock frequency (reduce the cycle time). Has huge
power implications.

13
Carnegie Mellon

Improving Performance

Execution time
of a program = # of Dynamic Instructions
(in seconds)

X # of cycles taken to execute an instruction (on average)

/ number of cycles per second

• 1. Reduce the total number of instructions executed (mainly done by


the compiler and/or programmer).
• 2. Increase the clock frequency (reduce the cycle time). Has huge
power implications.
• 3. Reduce the CPI, i.e., execute more instructions in one cycle.

13
Carnegie Mellon

Improving Performance

Execution time
of a program = # of Dynamic Instructions
(in seconds)

X # of cycles taken to execute an instruction (on average)

/ number of cycles per second

• 1. Reduce the total number of instructions executed (mainly done by


the compiler and/or programmer).
• 2. Increase the clock frequency (reduce the cycle time). Has huge
power implications.
• 3. Reduce the CPI, i.e., execute more instructions in one cycle.
• We will talk about one technique that simultaneously achieves 2 & 3.

13
Carnegie Mellon

Limitations of a Single-Cycle CPU

14
Carnegie Mellon

Limitations of a Single-Cycle CPU


• Cycle time

14
Carnegie Mellon

Limitations of a Single-Cycle CPU


• Cycle time
• Every instruction finishes in one cycle.

14
Carnegie Mellon

Limitations of a Single-Cycle CPU


• Cycle time
• Every instruction finishes in one cycle.
• The absolute time takes to execute each instruction varies.
Consider for instance an ADD instruction and a JMP instruction.

14
Carnegie Mellon

Limitations of a Single-Cycle CPU


• Cycle time
• Every instruction finishes in one cycle.
• The absolute time takes to execute each instruction varies.
Consider for instance an ADD instruction and a JMP instruction.
• But the cycle time is uniform across instructions, so the cycle time
needs to accommodate the worst case, i.e., the slowest
instruction.

14
Carnegie Mellon

Limitations of a Single-Cycle CPU


• Cycle time
• Every instruction finishes in one cycle.
• The absolute time takes to execute each instruction varies.
Consider for instance an ADD instruction and a JMP instruction.
• But the cycle time is uniform across instructions, so the cycle time
needs to accommodate the worst case, i.e., the slowest
instruction.
• How do we shorten the cycle time (increase the frequency)?

14
Carnegie Mellon

Limitations of a Single-Cycle CPU


• Cycle time
• Every instruction finishes in one cycle.
• The absolute time takes to execute each instruction varies.
Consider for instance an ADD instruction and a JMP instruction.
• But the cycle time is uniform across instructions, so the cycle time
needs to accommodate the worst case, i.e., the slowest
instruction.
• How do we shorten the cycle time (increase the frequency)?
• CPI

14
Carnegie Mellon

Limitations of a Single-Cycle CPU


• Cycle time
• Every instruction finishes in one cycle.
• The absolute time takes to execute each instruction varies.
Consider for instance an ADD instruction and a JMP instruction.
• But the cycle time is uniform across instructions, so the cycle time
needs to accommodate the worst case, i.e., the slowest
instruction.
• How do we shorten the cycle time (increase the frequency)?
• CPI
• The entire hardware is occupied to execute one instruction at a
time. Can’t execute multiple instructions at the same time.

14
Carnegie Mellon

Limitations of a Single-Cycle CPU


• Cycle time
• Every instruction finishes in one cycle.
• The absolute time takes to execute each instruction varies.
Consider for instance an ADD instruction and a JMP instruction.
• But the cycle time is uniform across instructions, so the cycle time
needs to accommodate the worst case, i.e., the slowest
instruction.
• How do we shorten the cycle time (increase the frequency)?
• CPI
• The entire hardware is occupied to execute one instruction at a
time. Can’t execute multiple instructions at the same time.
• How do execute multiple instructions in one cycle?

14
Carnegie Mellon

A Motivating Example
300 ps 20 ps

R
Combinational
e
logic
g

Clock

• Computation requires total of 300 picoseconds


• Additional 20 picoseconds to save result in register
• Must have clock cycle time of at least 320 ps

15
Carnegie Mellon

Pipeline Diagrams
• Time to finish 3 insts = 960 ps
• Each inst.’s latency is 320 ps
OP1 320
OP2 320
OP3 320
Time

• 3 instructions will take 960 ps to finish


• First cycle: Inst 1 takes 300 ps to compute new state,
20 ps to store the new states
• Second cycle: Inst 2 starts; it takes 300 ps to
compute new states, 20 ps to store new states
• And so on…

16
Carnegie Mellon

3-Stage Pipelined Version


100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Comb. R Comb. R Comb. R


logic e logic e logic e
A g B g C g

Clock
• Divide combinational logic into 3 stages of 100 ps each
• Insert registers between stages to store intermediate data between
stages. These are call pipeline registers (ISA-invisible)
• Can begin a new instruction as soon as the previous one finishes
stage A and has stored the intermediate data.
• Begin new operation every 120 ps
• Cycle time can be reduced to 120 ps

17
Carnegie Mellon

3-Stage Pipelined Version


100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Comb. R Comb. R Comb. R


logic e logic e logic e
A g B g C g

Clock

3-Stage Pipelined

OP1 A B C
OP2 A B C
OP3 A B C
Time

18
Carnegie Mellon

Comparison
• Time to finish 3 insts = 960 ps
Unpipelined
• Each inst.’s latency is 320 ps
OP1 320
OP2 320
OP3 320
Time

3-Stage Pipelined
• Time to finish 3 insets = 120 *
OP1 A B C
5 = 600 ps
OP2 A B C
• But each inst.’s latency
OP3 A B C
increases: 120 * 3 = 360 ps
Time

19
Carnegie Mellon

Benefits of Pipelining
• Time to finish 3 insts = 960 ps
• Each inst.’s latency is 320 ps
OP1
OP2
OP3
Time

Reduce the cycle time from 320 ps to 120 ps

• Time to finish 3 insets = 120 *


OP1 A B C
5 = 600 ps
OP2 A B C
• But each inst.’s latency
OP3 A B C
increases: 120 * 3 = 360 ps
Time

20
Carnegie Mellon

One Requirement of Pipelining


• The stages need to be using different hardware structures.
• That is, Stage A, Stage B, and Stage C need to exercise
different parts of the combination logic.

• Time to finish 3 insets = 120 *


OP1 A B C
5 = 600 ps
OP2 A B C
• But each inst.’s latency
OP3 A B C
increases: 120 * 3 = 360 ps
Time

21
Carnegie Mellon

Pipeline Trade-offs
• Pros: Decrease the total execution time (Increase the “throughput”).
• Cons: Increase the latency of each instruction as new registers are
needed between pipeline stages.
100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Comb. R Comb. R Comb. R


logic e logic e logic e
A g B g C g

300 ps 20 ps Clock

R
Combinational
e
logic
g

Clock 22
Carnegie Mellon

Throughput
• The rate at which the processor can finish executing an
instruction (at the steady state).

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Comb. R Comb. R Comb. R


logic e logic e logic e
A g B g C g

Inst 1 A B C Clock
Inst 2 A B C
Inst 3 A B C Throughput of this 3-stage
Inst 4 A B C
processor is 1 instruction every
120 ps, or 8.3 Giga (billion)
Inst 5 A B C Instructions per Second (GIPS).

Time
23
Carnegie Mellon

Aside: Unbalanced Pipeline


• A pipeline’s delay is limited by the slowest stage. This limits the
cycle time and the throughput
100 ps 20 ps 100 ps 20 ps 100 ps 20 ps
Cycle time: 120 ps
Comb. R Comb. R Comb. R
Delay: 360 ps logic e logic e logic e
Thrupt: 8.3 GIPS A g B g C g

Clock

24
Carnegie Mellon

Aside: Unbalanced Pipeline


• A pipeline’s delay is limited by the slowest stage. This limits the
cycle time and the throughput
100 ps 20 ps 100 ps 20 ps 100 ps 20 ps
Cycle time: 120 ps
Comb. R Comb. R Comb. R
Delay: 360 ps logic e logic e logic e
Thrupt: 8.3 GIPS A g B g C g

Clock

50 ps 20 ps 150 ps 20 ps 100 ps 20 ps

Comb. R Comb. R Comb. R


logic e logic e logic e
A g B g C g

Clock
24
Carnegie Mellon

Aside: Unbalanced Pipeline


• A pipeline’s delay is limited by the slowest stage. This limits the
cycle time and the throughput
100 ps 20 ps 100 ps 20 ps 100 ps 20 ps
Cycle time: 120 ps
Comb. R Comb. R Comb. R
Delay: 360 ps logic e logic e logic e
Thrupt: 8.3 GIPS A g B g C g

Clock

50 ps 20 ps 150 ps 20 ps 100 ps 20 ps
Cycle time: 170 ps
Comb. R Comb. R Comb. R
logic e logic e logic e
A g B g C g

Clock
24
Carnegie Mellon

Aside: Unbalanced Pipeline


• A pipeline’s delay is limited by the slowest stage. This limits the
cycle time and the throughput
100 ps 20 ps 100 ps 20 ps 100 ps 20 ps
Cycle time: 120 ps
Comb. R Comb. R Comb. R
Delay: 360 ps logic e logic e logic e
Thrupt: 8.3 GIPS A g B g C g

Clock

50 ps 20 ps 150 ps 20 ps 100 ps 20 ps
Cycle time: 170 ps
Delay: 510 ps Comb. R Comb. R Comb. R
logic e logic e logic e
A g B g C g

Clock
24
Carnegie Mellon

Aside: Unbalanced Pipeline


• A pipeline’s delay is limited by the slowest stage. This limits the
cycle time and the throughput
100 ps 20 ps 100 ps 20 ps 100 ps 20 ps
Cycle time: 120 ps
Comb. R Comb. R Comb. R
Delay: 360 ps logic e logic e logic e
Thrupt: 8.3 GIPS A g B g C g

Clock

50 ps 20 ps 150 ps 20 ps 100 ps 20 ps
Cycle time: 170 ps
Delay: 510 ps Comb. R Comb. R Comb. R
logic e logic e logic e
A g B g C g
Thrupt: 5.9 GIPS
Clock
24
Carnegie Mellon

Aside: Unbalanced Pipeline


• A pipeline’s delay is limited by the slowest stage. This limits the
cycle time and the throughput
170 ps

OP1 A B C
OP2 A B C
OP3 A B C
Time

50 ps 20 ps 150 ps 20 ps 100 ps 20 ps
Cycle time: 170 ps
Delay: 510 ps Comb. R Comb. R Comb. R
logic e logic e logic e
A g B g C g
Thrupt: 5.9 GIPS
Clock
25
Carnegie Mellon

Aside: Mitigating Unbalanced Pipeline


• Solution 1: Further pipeline the slow stages

50 ps 20 ps 100 ps 20 ps 50 ps 20 ps

Comb. R Comb. R Comb. R


logic e logic e logic e
A g B g C g

26
Carnegie Mellon

Aside: Mitigating Unbalanced Pipeline


• Solution 1: Further pipeline the slow stages
• Not always possible. What to do if we can’t further pipeline a stage?

50 ps 20 ps 100 ps 20 ps 50 ps 20 ps

Comb. R Comb. R Comb. R


logic e logic e logic e
A g B g C g

26
Carnegie Mellon

Aside: Mitigating Unbalanced Pipeline


• Solution 1: Further pipeline the slow stages
• Not always possible. What to do if we can’t further pipeline a stage?
• Solution 2: Use multiple copies of the slow component
50 ps 20 ps 100 ps 20 ps 50 ps 20 ps

Comb. R Comb. R Comb. R


logic e logic e logic e
A g B g C g

26
Carnegie Mellon

Aside: Mitigating Unbalanced Pipeline


• Solution 1: Further pipeline the slow stages
• Not always possible. What to do if we can’t further pipeline a stage?
• Solution 2: Use multiple copies of the slow component
50 ps 20 ps 100 ps 20 ps 50 ps 20 ps

Copy 1
Comb. R Comb. R Comb. R
logic e logic e logic e
A g B g C g

Copy 2
Comb.
logic
B

26
Carnegie Mellon

Aside: Mitigating Unbalanced Pipeline


• Solution 1: Further pipeline the slow stages
• Not always possible. What to do if we can’t further pipeline a stage?
• Solution 2: Use multiple copies of the slow component
50 ps 20 ps 100 ps 20 ps 50 ps 20 ps

Copy 1
Comb. R Comb. R Comb. R
logic e logic e logic e
A g B g C g
M
U
X
Copy 2
Comb.
logic
B

26
Carnegie Mellon

Aside: Mitigating Unbalanced Pipeline


• Solution 1: Further pipeline the slow stages
• Not always possible. What to do if we can’t further pipeline a stage?
• Solution 2: Use multiple copies of the slow component
50 ps 20 ps 100 ps 20 ps 50 ps 20 ps

Copy 1
Comb. R Comb. R Comb. R
logic e logic e logic e
A g B g C g
M
U
X
Copy 2
R Comb.
e logic
g B

26
Carnegie Mellon

Aside: Mitigating Unbalanced Pipeline


• Solution 1: Further pipeline the slow stages
• Not always possible. What to do if we can’t further pipeline a stage?
• Solution 2: Use multiple copies of the slow component
50 ps 20 ps 100 ps 20 ps 50 ps 20 ps

Copy 1
Comb. R Comb. R Comb. R
logic What e logic e logic e
A Logic? C
g B M g g
U
X
Copy 2
R Comb.
e logic
g B

26
Carnegie Mellon

Aside: Mitigating Unbalanced Pipeline


• Solution 1: Further pipeline the slow stages
• Not always possible. What to do if we can’t further pipeline a stage?
• Solution 2: Use multiple copies of the slow component
50 ps 20 ps 100 ps 20 ps 50 ps 20 ps
select
Copy 1
Comb. R Comb. R Comb. R
logic What e logic e logic e
A Logic? C
g B M g g
U
Clock X
Copy 2
R Comb.
e logic
g B

26
Carnegie Mellon

Aside: Mitigating Unbalanced Pipeline


• Data sent to copy 1 in odd cycles and to copy 2 in even cycles.

50 ps 20 ps 100 ps 20 ps 50 ps 20 ps
select
Copy 1
Comb. R Comb. R Comb. R
logic What e logic e logic e
A Logic? C
g B M g g
U
Clock X
Copy 2
R Comb.
e logic
g B
27
Carnegie Mellon

Aside: Mitigating Unbalanced Pipeline


• Data sent to copy 1 in odd cycles and to copy 2 in even cycles.
• This is called 2-way interleaving. Effectively the same as pipelining
Comb. logic B into two sub-stages.

50 ps 20 ps 100 ps 20 ps 50 ps 20 ps
select
Copy 1
Comb. R Comb. R Comb. R
logic What e logic e logic e
A Logic? C
g B M g g
U
Clock X
Copy 2
R Comb.
e logic
g B
27
Carnegie Mellon

Aside: Mitigating Unbalanced Pipeline


• Data sent to copy 1 in odd cycles and to copy 2 in even cycles.
• This is called 2-way interleaving. Effectively the same as pipelining
Comb. logic B into two sub-stages.
• The cycle time is reduced to 70 ps (as opposed to 120 ps) at the cost
of extra hardware.

50 ps 20 ps 100 ps 20 ps 50 ps 20 ps
select
Copy 1
Comb. R Comb. R Comb. R
logic What e logic e logic e
A Logic? C
g B M g g
U
Clock X
Copy 2
R Comb.
e logic
g B
27
Carnegie Mellon

Another Way to Look At the Microarchitecture


Principles:
• Execute each instruction one at a time, one after another
• Express every instruction as series of simple steps
• Dedicated hardware structure for completing each step
• Follow same general flow for each instruction type

Fetch: Read instruction from instruction memory


Decode: Read program registers
Execute: Compute value or address
Memory: Read or write data
Write Back: Write program registers
PC: Update program counter

28
Carnegie Mellon
newPC

PC
valE, valM

Write back valM

Data
Data
Fetch
Memory memory
memory
■ Read instruction from instruction memory
Addr, Data

Decode
valE ■ Read program registers
CC
CC ALU
ALU
Execute
Execute Cnd

aluA, aluB ■ Compute value or address


Memory
valA, valB ■ Read or write data

Decode
srcA, srcB
dstA, dstB A B
M
Write Back
Register
Register
file
file E ■ Write program registers
icode ifun
,
rA , rB
valC
valP
PC
Instruction
Instruction PC
PC
■ Update program counter
memory
memory increment
increment
Fetch

PC

29
Carnegie Mellon

Stage Computation: Arith/Log. Ops


OPq rA, rB 6 fn rA rB

OPq rA, rB

30
Carnegie Mellon

Stage Computation: Arith/Log. Ops


OPq rA, rB 6 fn rA rB

OPq rA, rB
icode:ifun ← M1[PC] Read instruction byte
rA:rB ← M1[PC+1] Read register byte
Fetch

valP ← PC+2 Compute next PC

30
Carnegie Mellon

Stage Computation: Arith/Log. Ops


OPq rA, rB 6 fn rA rB

OPq rA, rB
icode:ifun ← M1[PC] Read instruction byte
rA:rB ← M1[PC+1] Read register byte
Fetch

valP ← PC+2 Compute next PC


valA ← R[rA] Read operand A
Decode
valB ← R[rB] Read operand B

30
Carnegie Mellon

Stage Computation: Arith/Log. Ops


OPq rA, rB 6 fn rA rB

OPq rA, rB
icode:ifun ← M1[PC] Read instruction byte
rA:rB ← M1[PC+1] Read register byte
Fetch

valP ← PC+2 Compute next PC


valA ← R[rA] Read operand A
Decode
valB ← R[rB] Read operand B
valE ← valB OP valA Perform ALU operation
Execute
Set CC Set condition code register

30
Carnegie Mellon

Stage Computation: Arith/Log. Ops


OPq rA, rB 6 fn rA rB

OPq rA, rB
icode:ifun ← M1[PC] Read instruction byte
rA:rB ← M1[PC+1] Read register byte
Fetch

valP ← PC+2 Compute next PC


valA ← R[rA] Read operand A
Decode
valB ← R[rB] Read operand B
valE ← valB OP valA Perform ALU operation
Execute
Set CC Set condition code register
Memory

30
Carnegie Mellon

Stage Computation: Arith/Log. Ops


OPq rA, rB 6 fn rA rB

OPq rA, rB
icode:ifun ← M1[PC] Read instruction byte
rA:rB ← M1[PC+1] Read register byte
Fetch

valP ← PC+2 Compute next PC


valA ← R[rA] Read operand A
Decode
valB ← R[rB] Read operand B
valE ← valB OP valA Perform ALU operation
Execute
Set CC Set condition code register
Memory
Write R[rB] ← valE Write back result
back

30
Carnegie Mellon

Stage Computation: Arith/Log. Ops


OPq rA, rB 6 fn rA rB

OPq rA, rB
icode:ifun ← M1[PC] Read instruction byte
rA:rB ← M1[PC+1] Read register byte
Fetch

valP ← PC+2 Compute next PC


valA ← R[rA] Read operand A
Decode
valB ← R[rB] Read operand B
valE ← valB OP valA Perform ALU operation
Execute
Set CC Set condition code register
Memory
Write R[rB] ← valE Write back result
back
PC update PC ← valP Update PC

30
Carnegie Mellon

Stage Computation: rmmovq


rmmovq rA, D(rB) 4 0 rA rB D

rmmovq rA, D(rB)

31
Carnegie Mellon

Stage Computation: rmmovq


rmmovq rA, D(rB) 4 0 rA rB D

rmmovq rA, D(rB)


icode:ifun ← M1[PC] Read instruction byte
rA:rB ← M1[PC+1] Read register byte
Fetch
valC ← M8[PC+2] Read displacement D
valP ← PC+10 Compute next PC

31
Carnegie Mellon

Stage Computation: rmmovq


rmmovq rA, D(rB) 4 0 rA rB D

rmmovq rA, D(rB)


icode:ifun ← M1[PC] Read instruction byte
rA:rB ← M1[PC+1] Read register byte
Fetch
valC ← M8[PC+2] Read displacement D
valP ← PC+10 Compute next PC
valA ← R[rA] Read operand A
Decode
valB ← R[rB] Read operand B

31
Carnegie Mellon

Stage Computation: rmmovq


rmmovq rA, D(rB) 4 0 rA rB D

rmmovq rA, D(rB)


icode:ifun ← M1[PC] Read instruction byte
rA:rB ← M1[PC+1] Read register byte
Fetch
valC ← M8[PC+2] Read displacement D
valP ← PC+10 Compute next PC
valA ← R[rA] Read operand A
Decode
valB ← R[rB] Read operand B
valE ← valB + valC Compute effective address
Execute

31
Carnegie Mellon

Stage Computation: rmmovq


rmmovq rA, D(rB) 4 0 rA rB D

rmmovq rA, D(rB)


icode:ifun ← M1[PC] Read instruction byte
rA:rB ← M1[PC+1] Read register byte
Fetch
valC ← M8[PC+2] Read displacement D
valP ← PC+10 Compute next PC
valA ← R[rA] Read operand A
Decode
valB ← R[rB] Read operand B
valE ← valB + valC Compute effective address
Execute

Memory M8[valE] ← valA Write value to memory

31
Carnegie Mellon

Stage Computation: rmmovq


rmmovq rA, D(rB) 4 0 rA rB D

rmmovq rA, D(rB)


icode:ifun ← M1[PC] Read instruction byte
rA:rB ← M1[PC+1] Read register byte
Fetch
valC ← M8[PC+2] Read displacement D
valP ← PC+10 Compute next PC
valA ← R[rA] Read operand A
Decode
valB ← R[rB] Read operand B
valE ← valB + valC Compute effective address
Execute

Memory M8[valE] ← valA Write value to memory


Write
back

31
Carnegie Mellon

Stage Computation: rmmovq


rmmovq rA, D(rB) 4 0 rA rB D

rmmovq rA, D(rB)


icode:ifun ← M1[PC] Read instruction byte
rA:rB ← M1[PC+1] Read register byte
Fetch
valC ← M8[PC+2] Read displacement D
valP ← PC+10 Compute next PC
valA ← R[rA] Read operand A
Decode
valB ← R[rB] Read operand B
valE ← valB + valC Compute effective address
Execute

Memory M8[valE] ← valA Write value to memory


Write
back
PC update PC ← valP Update PC

31
Carnegie Mellon

Stage Computation: Jumps


jXX Dest

• Compute both addresses


• Choose based on setting of condition codes and branch condition
32
Carnegie Mellon

Stage Computation: Jumps


jXX Dest
icode:ifun ← M1[PC] Read instruction byte

Fetch
valC ← M8[PC+1] Read destination address
valP ← PC+9 Fall through address

• Compute both addresses


• Choose based on setting of condition codes and branch condition
32
Carnegie Mellon

Stage Computation: Jumps


jXX Dest
icode:ifun ← M1[PC] Read instruction byte

Fetch
valC ← M8[PC+1] Read destination address
valP ← PC+9 Fall through address

Decode

• Compute both addresses


• Choose based on setting of condition codes and branch condition
32
Carnegie Mellon

Stage Computation: Jumps


jXX Dest
icode:ifun ← M1[PC] Read instruction byte

Fetch
valC ← M8[PC+1] Read destination address
valP ← PC+9 Fall through address

Decode

Execute
Cnd ← Cond(CC,ifun) Take branch?

• Compute both addresses


• Choose based on setting of condition codes and branch condition
32
Carnegie Mellon

Stage Computation: Jumps


jXX Dest
icode:ifun ← M1[PC] Read instruction byte

Fetch
valC ← M8[PC+1] Read destination address
valP ← PC+9 Fall through address

Decode

Execute
Cnd ← Cond(CC,ifun) Take branch?
Memory

• Compute both addresses


• Choose based on setting of condition codes and branch condition
32
Carnegie Mellon

Stage Computation: Jumps


jXX Dest
icode:ifun ← M1[PC] Read instruction byte

Fetch
valC ← M8[PC+1] Read destination address
valP ← PC+9 Fall through address

Decode

Execute
Cnd ← Cond(CC,ifun) Take branch?
Memory
Write
back

• Compute both addresses


• Choose based on setting of condition codes and branch condition
32
Carnegie Mellon

Stage Computation: Jumps


jXX Dest
icode:ifun ← M1[PC] Read instruction byte

Fetch
valC ← M8[PC+1] Read destination address
valP ← PC+9 Fall through address

Decode

Execute
Cnd ← Cond(CC,ifun) Take branch?
Memory
Write
back
PC update PC ← Cnd ? valC : valP Update PC

• Compute both addresses


• Choose based on setting of condition codes and branch condition
32
Carnegie Mellon

Pipeline Stages
Fetch
• Select current PC
• Read instruction
• Compute incremented PC
Decode
• Read program registers
Execute
• Operate ALU
Memory
• Read or write data memory
Write Back
• Update register file

33
Carnegie Mellon

Real-World Pipelines: Car Washes

34
Carnegie Mellon

Real-World Pipelines: Car Washes


Sequential

34
Carnegie Mellon

Real-World Pipelines: Car Washes


Sequential Pipelined

34
Carnegie Mellon

Real-World Pipelines: Car Washes


Sequential Pipelined

Idea
• Divide process into independent stages
• Move objects through stages in sequence
• At any given times, multiple objects being processed

34
Carnegie Mellon

Pipeline Illustration

R R R R Write R
Fetch e Decode e Execute e Memory e e
g g g g back g

35
Carnegie Mellon

Pipeline Illustration

Inst0

R R R R Write R
Fetch e Decode e Execute e Memory e e
g g g g back g

35
Carnegie Mellon

Pipeline Illustration

Inst1 Inst0

R R R R Write R
Fetch e Decode e Execute e Memory e e
g g g g back g

35
Carnegie Mellon

Pipeline Illustration

Inst2 Inst1 Inst0

R R R R Write R
Fetch e Decode e Execute e Memory e e
g g g g back g

35
Carnegie Mellon

Pipeline Illustration

Inst3 Inst2 Inst1 Inst0

R R R R Write R
Fetch e Decode e Execute e Memory e e
g g g g back g

35
Carnegie Mellon

Pipeline Illustration

Inst4 Inst3 Inst2 Inst1 Inst0

R R R R Write R
Fetch e Decode e Execute e Memory e e
g g g g back g

35
Carnegie Mellon

Pipeline Illustration

Inst4 Inst3 Inst2 Inst1

R R R R Write R
Fetch e Decode e Execute e Memory e e
g g g g back g

35
Carnegie Mellon

Pipeline Illustration

Inst4 Inst3 Inst2

R R R R Write R
Fetch e Decode e Execute e Memory e e
g g g g back g

35
Carnegie Mellon

Pipeline Illustration

Inst4 Inst3

R R R R Write R
Fetch e Decode e Execute e Memory e e
g g g g back g

35
Carnegie Mellon

Pipeline Illustration

Inst4

R R R R Write R
Fetch e Decode e Execute e Memory e e
g g g g back g

35
Carnegie Mellon

Another Illustration
239
Clock
OP1 A B C
OP2 A B C
OP3 A B C

0 120 240 360 480 640


Time

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Comb. R Comb. R Comb. R


logic e logic e logic e
A g B g C g

Clock

36
Carnegie Mellon

Another Illustration
241
Clock
OP1 A B C
OP2 A B C
OP3 A B C

0 120 240 360 480 640


Time

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Comb. R Comb. R Comb. R


logic e logic e logic e
A g B g C g

Clock

37
Carnegie Mellon

Another Illustration
300
Clock
OP1 A B C
OP2 A B C
OP3 A B C

0 120 240 360 480 640


Time

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Comb. R Comb. R Comb. R


logic e logic e logic e
A g B g C g

Clock

38
Carnegie Mellon

Another Illustration
359
Clock
OP1 A B C
OP2 A B C
OP3 A B C

0 120 240 360 480 640


Time

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Comb. R Comb. R Comb. R


logic e logic e logic e
A g B g C g

Clock

39
Carnegie Mellon

Making the Pipeline Really Work


• Control Dependencies
• What is it?
• Software mitigation: Inserting Nops
• Software mitigation: Delay Slots
• Data Dependencies
• What is it?
• Software mitigation: Inserting Nops

40
Carnegie Mellon

Control Dependency
• Definition: Outcome of instruction A determines whether or not
instruction B should be executed.
• Jump instruction example below:
• jne L1 determines whether irmovq $1, %rax should be
executed
• But jne doesn’t know its outcome until after its Execute stage

xorg %rax, %rax


jne L1 # Not taken
irmovq $1, %rax # Fall Through
L1 irmovq $4, %rcx # Target
irmovq $3, %rax # Target + 1

41
Carnegie Mellon

Control Dependency
• Definition: Outcome of instruction A determines whether or not
instruction B should be executed.
• Jump instruction example below:
• jne L1 determines whether irmovq $1, %rax should be
executed
• But jne doesn’t know its outcome until after its Execute stage

xorg %rax, %rax F


jne L1 # Not taken
irmovq $1, %rax # Fall Through
L1 irmovq $4, %rcx # Target
irmovq $3, %rax # Target + 1

41
Carnegie Mellon

Control Dependency
• Definition: Outcome of instruction A determines whether or not
instruction B should be executed.
• Jump instruction example below:
• jne L1 determines whether irmovq $1, %rax should be
executed
• But jne doesn’t know its outcome until after its Execute stage

1 2

xorg %rax, %rax F D


jne L1 # Not taken F
irmovq $1, %rax # Fall Through
L1 irmovq $4, %rcx # Target
irmovq $3, %rax # Target + 1

41
Carnegie Mellon

Control Dependency
• Definition: Outcome of instruction A determines whether or not
instruction B should be executed.
• Jump instruction example below:
• jne L1 determines whether irmovq $1, %rax should be
executed
• But jne doesn’t know its outcome until after its Execute stage

1 2 3

xorg %rax, %rax F D E


jne L1 # Not taken F D
irmovq $1, %rax # Fall Through
L1 irmovq $4, %rcx # Target
irmovq $3, %rax # Target + 1

41
Carnegie Mellon

Control Dependency
• Definition: Outcome of instruction A determines whether or not
instruction B should be executed.
• Jump instruction example below:
• jne L1 determines whether irmovq $1, %rax should be
executed
• But jne doesn’t know its outcome until after its Execute stage

1 2 3

xorg %rax, %rax F D E


jne L1 # Not taken F D
nop F
irmovq $1, %rax # Fall Through
L1 irmovq $4, %rcx # Target
irmovq $3, %rax # Target + 1

41
Carnegie Mellon

Control Dependency
• Definition: Outcome of instruction A determines whether or not
instruction B should be executed.
• Jump instruction example below:
• jne L1 determines whether irmovq $1, %rax should be
executed
• But jne doesn’t know its outcome until after its Execute stage

1 2 3 4

xorg %rax, %rax F D E M


jne L1 # Not taken F D E
nop F D
irmovq $1, %rax # Fall Through
L1 irmovq $4, %rcx # Target
irmovq $3, %rax # Target + 1

41
Carnegie Mellon

Control Dependency
• Definition: Outcome of instruction A determines whether or not
instruction B should be executed.
• Jump instruction example below:
• jne L1 determines whether irmovq $1, %rax should be
executed
• But jne doesn’t know its outcome until after its Execute stage

1 2 3 4

xorg %rax, %rax F D E M


jne L1 # Not taken F D E
nop F D
nop F
irmovq $1, %rax # Fall Through
L1 irmovq $4, %rcx # Target
irmovq $3, %rax # Target + 1

41
Carnegie Mellon

Control Dependency
• Definition: Outcome of instruction A determines whether or not
instruction B should be executed.
• Jump instruction example below:
• jne L1 determines whether irmovq $1, %rax should be
executed
• But jne doesn’t know its outcome until after its Execute stage

1 2 3 4 5

xorg %rax, %rax F D E M W


jne L1 # Not taken F D E M
nop F D E
nop F D
irmovq $1, %rax # Fall Through F
L1 irmovq $4, %rcx # Target
irmovq $3, %rax # Target + 1

41
Carnegie Mellon

Control Dependency
• Definition: Outcome of instruction A determines whether or not
instruction B should be executed.
• Jump instruction example below:
• jne L1 determines whether irmovq $1, %rax should be
executed
• But jne doesn’t know its outcome until after its Execute stage

1 2 3 4 5

xorg %rax, %rax F D E M W


jne L1 # Not taken F D E M
nop F D E
nop F D
irmovq $1, %rax # Fall Through F
L1 irmovq $4, %rcx # Target
irmovq $3, %rax # Target + 1

41
Carnegie Mellon

Control Dependency
• Definition: Outcome of instruction A determines whether or not
instruction B should be executed.
• Jump instruction example below:
• jne L1 determines whether irmovq $1, %rax should be
executed
• But jne doesn’t know its outcome until after its Execute stage

1 2 3 4 5 6 7 8 9

xorg %rax, %rax F D E M W


jne L1 # Not taken F D E M W
nop F D E M W
nop F D E M W
irmovq $1, %rax # Fall Through F D E M W
L1 irmovq $4, %rcx # Target F D E M
irmovq $3, %rax # Target + 1 F D E

41
Carnegie Mellon

Delay Slots
1 2 3 4 5 6 7 8 9

xorg %rax, %rax F D E M W


jne L1 F D E M W
nop Can we make use of
F D E M W
the 2 wasted slots?
nop F D E M W
irmovq $1, %rax # Fall Through F D E M W
L1 irmovq $4, %rcx # Target F D E M W
irmovq $3, %rax # Target + 1 F D E M

42
Carnegie Mellon

Delay Slots
1 2 3 4 5 6 7 8 9

xorg %rax, %rax F D E M W


jne L1 F D E M W
nop Can we make use of
F D E M W
the 2 wasted slots?
nop F D E M W
irmovq $1, %rax # Fall Through F D E M W
L1 irmovq $4, %rcx # Target F D E M W
irmovq $3, %rax # Target + 1 F D E M

if (cond) {
do_A();
} else {
do_B();
}
do_C();

42
Carnegie Mellon

Delay Slots
1 2 3 4 5 6 7 8 9

xorg %rax, %rax F D E M W


jne L1 F D E M W
nop Can we make use of
F D E M W
the 2 wasted slots?
nop F D E M W
irmovq $1, %rax # Fall Through F D E M W
L1 irmovq $4, %rcx # Target F D E M W
irmovq $3, %rax # Target + 1 F D E M

if (cond) {
do_A();
Have to make sure do_C doesn’t
depend on do_A and do_B!!!
} else {
do_B();
}
do_C();

42
Carnegie Mellon

Delay Slots
1 2 3 4 5 6 7 8 9

xorg %rax, %rax F D E M W


jne L1 F D E M W
nop Can we make use of
F D E M W
the 2 wasted slots?
nop F D E M W
irmovq $1, %rax # Fall Through F D E M W
L1 irmovq $4, %rcx # Target F D E M W
irmovq $3, %rax # Target + 1 F D E M

do_C();
if (cond) {
A less obvious
example do_A();
} else {
do_B();
}

43
Carnegie Mellon

Delay Slots
1 2 3 4 5 6 7 8 9

xorg %rax, %rax F D E M W


jne L1 F D E M W
nop Can we make use of
F D E M W
the 2 wasted slots?
nop F D E M W
irmovq $1, %rax # Fall Through F D E M W
L1 irmovq $4, %rcx # Target F D E M W
irmovq $3, %rax # Target + 1 F D E M

do_C(); add A, B
if (cond) { or C, D
A less obvious
example do_A(); sub E, F
} else { jle 0x200
do_B(); add A, C
}

43
Carnegie Mellon

Delay Slots
1 2 3 4 5 6 7 8 9

xorg %rax, %rax F D E M W


jne L1 F D E M W
nop Can we make use of
F D E M W
the 2 wasted slots?
nop F D E M W
irmovq $1, %rax # Fall Through F D E M W
L1 irmovq $4, %rcx # Target F D E M W
irmovq $3, %rax # Target + 1 F D E M

do_C(); add A, B add A, B


if (cond) { or C, D sub E, F
A less obvious
example do_A(); sub E, F jle 0x200
} else { jle 0x200 or C, D
do_B(); add A, C add A, C
}

43
Carnegie Mellon

Delay Slots
1 2 3 4 5 6 7 8 9

xorg %rax, %rax F D E M W


jne L1 F D E M W
nop Can we make use of
F D E M W
the 2 wasted slots?
nop F D E M W
irmovq $1, %rax # Fall Through F D E M W
L1 irmovq $4, %rcx # Target F D E M W
irmovq $3, %rax # Target + 1 F D E M

do_C(); add A, B add A, B


if (cond) { or C, D sub E, F
A less obvious
example do_A(); sub E, F jle 0x200
} else { jle 0x200 or C, D
do_B(); add A, C add A, C
} Why don’t we move
the sub instruction?

43
Carnegie Mellon

Resolving Control Dependencies


• Software Mechanisms
• Adding NOPs: requires compiler to insert nops, which also take
memory space — not a good idea
• Delay slot: insert instructions that do not depend on the effect
of the preceding instruction. These instructions will execute
even if the preceding branch is taken — old RISC approach
• Hardware mechanisms
• Stalling (Think of it as hardware automatically inserting nops)
• Branch Prediction
• Return Address Stack

44

You might also like