0% found this document useful (0 votes)

3 views

Lecture13

Uploaded by

minulo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Lecture13

Uploaded by

minulo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 114

CSC 252/452: Computer Organization

Fall 2024: Lecture 13

Instructor: Yanan Guo

Department of Computer Science

University of Rochester
Carnegie Mellon

Announcement
• Programming assignment 3 is out
• Details: https://www.cs.rochester.edu/courses/252/fall2024/
labs/assignment3.html
• Due on Oct. 25th, 11:59 PM
• You (may still) have 3 slip days

2
Carnegie Mellon

Announcement
• Programming assignment 3 is in x86 assembly language. Seek
help from TAs.
• TAs are best positioned to answer your questions about
programming assignments!!!
• Programming assignments do NOT repeat the lecture materials.
They ask you to synthesize what you have learned from the
lectures and work out something new.

3
Carnegie Mellon

Single-Cycle Microarchitecture
Clock

PC
Register Flags
Memory
File Z S O

Cur.
PC Inst. New Enable? Cur. Flag
New Rd/Wr Reg. Current Values
Data Addr. Reg. IDs Reg.
New Data Valus
Values New Flag
PC
Enable? Values

Combinational Logic
Read current_states;
next_states = calculate_new_state(current_states);
When clock rises, current_states = next_states;
next_states has to be ready before the close rises

4
Carnegie Mellon

Single-Cycle Microarchitecture
Data to write
Enable Logic 6
Memory Data read back
Address

Logic 7 MUX
Logic 1
Clock newData [s4…s19]
PC M Select
Write
oPC nPC s0 4 Reg. ID RA U
X A
s1 0 Read Reg. Register
Logic 3 ID 1 L
s2 6 File RB U
s3 Read Reg. Logic 4
4 ID 2
Flags [s2…s9] s4 1
Clock Flags
s5 c Logic 2
Enable Logic 5
EnableF Z S O
… …

5
Carnegie Mellon

Single-Cycle Microarchitecture: Illustration

Think of it as a state machine
Combinational
Every cycle, one instruction gets logic Read Write

executed. At the end of the Data

cycle, architecture states get memory

modified. CC
100
Read Write

States (All updated as clock ports ports

rises) Register
file
■ PC register %rbx = 0x100

■ Cond. Code register

■ Data memory PC
0x014
■ Register file

6
Carnegie Mellon

Cycle 1 Cycle 2 Cycle 3 Cycle 4

Clock

① ② ③ ④
Cycle 1: 0x000: irmovq $0x100,%rbx # %rbx <-- 0x100
Cycle 2: 0x00a: irmovq $0x200,%rdx # %rdx <-- 0x200
Cycle 3: 0x014: addq %rdx,%rbx # %rbx <-- 0x300 CC <-- 000
Cycle 4: 0x016: je dest # Not taken
Cycle 5: 0x01f: rmmovq %rbx,0(%rdx) # M[0x200] <-- 0x300

Combinational
logic
• state set according to second
Read Write

Data irmovq instruction

memory
• combinational logic starting to
CC
100 react to state changes
Read Write
ports ports

PC
0x014

7
Carnegie Mellon

Cycle 1 Cycle 2 Cycle 3 Cycle 4

Clock

Combinational
logic
• state set according to second
Read Write

Data irmovq instruction

memory
• combinational logic generates
CC
100 results for addq instruction
Read Write
ports ports

000 Register
%rbx
file <--
%rbx = 0x100
0x300

0x016
PC
0x014

8
Carnegie Mellon

Cycle 1 Cycle 2 Cycle 3 Cycle 4

Clock

Combinational
logic Read Write
• state set according to addq
Data
memory
instruction
• combinational logic starting
CC
000 to react to state changes
Read Write
ports ports

PC
0x016

9
Carnegie Mellon

Cycle 1 Cycle 2 Cycle 3 Cycle 4

Clock

Combinational
logic Read Write
• state set according to addq
Data
memory
instruction
• combinational logic generates
CC
000 results for je instruction
Read Write
ports ports

0x01f
PC
0x016

10
Carnegie Mellon

Processor Microarchitecture
• Sequential, single-cycle microarchitecture implementation
• Basic idea
• Hardware implementation
• Pipelined microarchitecture implementation
• Basic Principles
• Difficulties: Control Dependency
• Difficulties: Data Dependency

11
Carnegie Mellon

Performance Model

Execution time
of a program = # of Dynamic Instructions
(in seconds)

X # of cycles taken to execute an instruction (on average)

/ number of cycles per second

12
Carnegie Mellon

Performance Model

Execution time
of a program = # of Dynamic Instructions
(in seconds) CPI

X # of cycles taken to execute an instruction (on average)

/ number of cycles per second

12
Carnegie Mellon

Performance Model

Execution time
of a program = # of Dynamic Instructions
(in seconds) CPI

X # of cycles taken to execute an instruction (on average)

/ number of cycles per second Clock Frequency

(1/cycle time)

12
Carnegie Mellon

Improving Performance

Execution time
of a program = # of Dynamic Instructions
(in seconds)

X # of cycles taken to execute an instruction (on average)

/ number of cycles per second

• 1. Reduce the total number of instructions executed (mainly done by

the compiler and/or programmer).

13
Carnegie Mellon

Improving Performance

Execution time
of a program = # of Dynamic Instructions
(in seconds)

X # of cycles taken to execute an instruction (on average)

/ number of cycles per second

• 1. Reduce the total number of instructions executed (mainly done by

the compiler and/or programmer).
• 2. Increase the clock frequency (reduce the cycle time). Has huge
power implications.

13
Carnegie Mellon

Improving Performance

Execution time
of a program = # of Dynamic Instructions
(in seconds)

X # of cycles taken to execute an instruction (on average)

/ number of cycles per second

• 1. Reduce the total number of instructions executed (mainly done by

the compiler and/or programmer).
• 2. Increase the clock frequency (reduce the cycle time). Has huge
power implications.
• 3. Reduce the CPI, i.e., execute more instructions in one cycle.

13
Carnegie Mellon

Improving Performance

Execution time
of a program = # of Dynamic Instructions
(in seconds)

X # of cycles taken to execute an instruction (on average)

/ number of cycles per second

• 1. Reduce the total number of instructions executed (mainly done by

the compiler and/or programmer).
• 2. Increase the clock frequency (reduce the cycle time). Has huge
power implications.
• 3. Reduce the CPI, i.e., execute more instructions in one cycle.
• We will talk about one technique that simultaneously achieves 2 & 3.

13
Carnegie Mellon

Limitations of a Single-Cycle CPU

14
Carnegie Mellon

Limitations of a Single-Cycle CPU

• Cycle time

14
Carnegie Mellon

Limitations of a Single-Cycle CPU

• Cycle time
• Every instruction finishes in one cycle.

14
Carnegie Mellon

Limitations of a Single-Cycle CPU

• Cycle time
• Every instruction finishes in one cycle.
• The absolute time takes to execute each instruction varies.
Consider for instance an ADD instruction and a JMP instruction.

14
Carnegie Mellon

Limitations of a Single-Cycle CPU

14
Carnegie Mellon

Limitations of a Single-Cycle CPU

14
Carnegie Mellon

Limitations of a Single-Cycle CPU

14
Carnegie Mellon

Limitations of a Single-Cycle CPU

• Cycle time
• Every instruction finishes in one cycle.
• The absolute time takes to execute each instruction varies.
Consider for instance an ADD instruction and a JMP instruction.
• But the cycle time is uniform across instructions, so the cycle time
needs to accommodate the worst case, i.e., the slowest
instruction.
• How do we shorten the cycle time (increase the frequency)?
• CPI
• The entire hardware is occupied to execute one instruction at a
time. Can’t execute multiple instructions at the same time.

14
Carnegie Mellon

Limitations of a Single-Cycle CPU

14
Carnegie Mellon

A Motivating Example
300 ps 20 ps

R
Combinational
e
logic
g

Clock

• Computation requires total of 300 picoseconds

• Additional 20 picoseconds to save result in register
• Must have clock cycle time of at least 320 ps

15
Carnegie Mellon

Pipeline Diagrams
• Time to finish 3 insts = 960 ps
• Each inst.’s latency is 320 ps
OP1 320
OP2 320
OP3 320
Time

• 3 instructions will take 960 ps to finish

• First cycle: Inst 1 takes 300 ps to compute new state,
20 ps to store the new states
• Second cycle: Inst 2 starts; it takes 300 ps to
compute new states, 20 ps to store new states
• And so on…

16
Carnegie Mellon

3-Stage Pipelined Version

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Comb. R Comb. R Comb. R

logic e logic e logic e
A g B g C g

Clock
• Divide combinational logic into 3 stages of 100 ps each
• Insert registers between stages to store intermediate data between
stages. These are call pipeline registers (ISA-invisible)
• Can begin a new instruction as soon as the previous one finishes
stage A and has stored the intermediate data.
• Begin new operation every 120 ps
• Cycle time can be reduced to 120 ps

17
Carnegie Mellon

3-Stage Pipelined Version

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Comb. R Comb. R Comb. R

logic e logic e logic e
A g B g C g

Clock

3-Stage Pipelined

OP1 A B C
OP2 A B C
OP3 A B C
Time

18
Carnegie Mellon

Comparison
• Time to finish 3 insts = 960 ps
Unpipelined
• Each inst.’s latency is 320 ps
OP1 320
OP2 320
OP3 320
Time

3-Stage Pipelined
• Time to finish 3 insets = 120 *
OP1 A B C
5 = 600 ps
OP2 A B C
• But each inst.’s latency
OP3 A B C
increases: 120 * 3 = 360 ps
Time

19
Carnegie Mellon

Benefits of Pipelining
• Time to finish 3 insts = 960 ps
• Each inst.’s latency is 320 ps
OP1
OP2
OP3
Time

Reduce the cycle time from 320 ps to 120 ps

• Time to finish 3 insets = 120 *

OP1 A B C
5 = 600 ps
OP2 A B C
• But each inst.’s latency
OP3 A B C
increases: 120 * 3 = 360 ps
Time

20
Carnegie Mellon

One Requirement of Pipelining

• The stages need to be using different hardware structures.
• That is, Stage A, Stage B, and Stage C need to exercise
different parts of the combination logic.

• Time to finish 3 insets = 120 *

OP1 A B C
5 = 600 ps
OP2 A B C
• But each inst.’s latency
OP3 A B C
increases: 120 * 3 = 360 ps
Time

21
Carnegie Mellon

Pipeline Trade-offs
• Pros: Decrease the total execution time (Increase the “throughput”).
• Cons: Increase the latency of each instruction as new registers are
needed between pipeline stages.
100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Comb. R Comb. R Comb. R

logic e logic e logic e
A g B g C g

300 ps 20 ps Clock

R
Combinational
e
logic
g

Clock 22
Carnegie Mellon

Throughput
• The rate at which the processor can finish executing an
instruction (at the steady state).

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Comb. R Comb. R Comb. R

logic e logic e logic e
A g B g C g

Inst 1 A B C Clock
Inst 2 A B C
Inst 3 A B C Throughput of this 3-stage
Inst 4 A B C
processor is 1 instruction every
120 ps, or 8.3 Giga (billion)
Inst 5 A B C Instructions per Second (GIPS).

Time
23
Carnegie Mellon

Aside: Unbalanced Pipeline

• A pipeline’s delay is limited by the slowest stage. This limits the
cycle time and the throughput
100 ps 20 ps 100 ps 20 ps 100 ps 20 ps
Cycle time: 120 ps
Comb. R Comb. R Comb. R
Delay: 360 ps logic e logic e logic e
Thrupt: 8.3 GIPS A g B g C g

Clock

24
Carnegie Mellon

Aside: Unbalanced Pipeline

Clock

50 ps 20 ps 150 ps 20 ps 100 ps 20 ps

Comb. R Comb. R Comb. R

logic e logic e logic e
A g B g C g

Clock
24
Carnegie Mellon

Aside: Unbalanced Pipeline

Clock

50 ps 20 ps 150 ps 20 ps 100 ps 20 ps
Cycle time: 170 ps
Comb. R Comb. R Comb. R
logic e logic e logic e
A g B g C g

Clock
24
Carnegie Mellon

Aside: Unbalanced Pipeline

Clock

50 ps 20 ps 150 ps 20 ps 100 ps 20 ps
Cycle time: 170 ps
Delay: 510 ps Comb. R Comb. R Comb. R
logic e logic e logic e
A g B g C g

Clock
24
Carnegie Mellon

Aside: Unbalanced Pipeline

Clock

50 ps 20 ps 150 ps 20 ps 100 ps 20 ps
Cycle time: 170 ps
Delay: 510 ps Comb. R Comb. R Comb. R
logic e logic e logic e
A g B g C g
Thrupt: 5.9 GIPS
Clock
24
Carnegie Mellon

Aside: Unbalanced Pipeline

• A pipeline’s delay is limited by the slowest stage. This limits the
cycle time and the throughput
170 ps

OP1 A B C
OP2 A B C
OP3 A B C
Time

50 ps 20 ps 150 ps 20 ps 100 ps 20 ps
Cycle time: 170 ps
Delay: 510 ps Comb. R Comb. R Comb. R
logic e logic e logic e
A g B g C g
Thrupt: 5.9 GIPS
Clock
25
Carnegie Mellon

Aside: Mitigating Unbalanced Pipeline

• Solution 1: Further pipeline the slow stages

50 ps 20 ps 100 ps 20 ps 50 ps 20 ps

Comb. R Comb. R Comb. R

logic e logic e logic e
A g B g C g

26
Carnegie Mellon

Aside: Mitigating Unbalanced Pipeline

• Solution 1: Further pipeline the slow stages
• Not always possible. What to do if we can’t further pipeline a stage?

50 ps 20 ps 100 ps 20 ps 50 ps 20 ps

Comb. R Comb. R Comb. R

logic e logic e logic e
A g B g C g

26
Carnegie Mellon

Aside: Mitigating Unbalanced Pipeline

Comb. R Comb. R Comb. R

logic e logic e logic e
A g B g C g

26
Carnegie Mellon

Aside: Mitigating Unbalanced Pipeline

Copy 1
Comb. R Comb. R Comb. R
logic e logic e logic e
A g B g C g

Copy 2
Comb.
logic
B

26
Carnegie Mellon

Aside: Mitigating Unbalanced Pipeline

Copy 1
Comb. R Comb. R Comb. R
logic e logic e logic e
A g B g C g
M
U
X
Copy 2
Comb.
logic
B

26
Carnegie Mellon

Aside: Mitigating Unbalanced Pipeline

Copy 1
Comb. R Comb. R Comb. R
logic e logic e logic e
A g B g C g
M
U
X
Copy 2
R Comb.
e logic
g B

26
Carnegie Mellon

Aside: Mitigating Unbalanced Pipeline

Copy 1
Comb. R Comb. R Comb. R
logic What e logic e logic e
A Logic? C
g B M g g
U
X
Copy 2
R Comb.
e logic
g B

26
Carnegie Mellon

Aside: Mitigating Unbalanced Pipeline

• Solution 1: Further pipeline the slow stages
• Not always possible. What to do if we can’t further pipeline a stage?
• Solution 2: Use multiple copies of the slow component
50 ps 20 ps 100 ps 20 ps 50 ps 20 ps
select
Copy 1
Comb. R Comb. R Comb. R
logic What e logic e logic e
A Logic? C
g B M g g
U
Clock X
Copy 2
R Comb.
e logic
g B

26
Carnegie Mellon

Aside: Mitigating Unbalanced Pipeline

• Data sent to copy 1 in odd cycles and to copy 2 in even cycles.

50 ps 20 ps 100 ps 20 ps 50 ps 20 ps
select
Copy 1
Comb. R Comb. R Comb. R
logic What e logic e logic e
A Logic? C
g B M g g
U
Clock X
Copy 2
R Comb.
e logic
g B
27
Carnegie Mellon

Aside: Mitigating Unbalanced Pipeline

• Data sent to copy 1 in odd cycles and to copy 2 in even cycles.
• This is called 2-way interleaving. Effectively the same as pipelining
Comb. logic B into two sub-stages.

50 ps 20 ps 100 ps 20 ps 50 ps 20 ps
select
Copy 1
Comb. R Comb. R Comb. R
logic What e logic e logic e
A Logic? C
g B M g g
U
Clock X
Copy 2
R Comb.
e logic
g B
27
Carnegie Mellon

Aside: Mitigating Unbalanced Pipeline

• Data sent to copy 1 in odd cycles and to copy 2 in even cycles.
• This is called 2-way interleaving. Effectively the same as pipelining
Comb. logic B into two sub-stages.
• The cycle time is reduced to 70 ps (as opposed to 120 ps) at the cost
of extra hardware.

50 ps 20 ps 100 ps 20 ps 50 ps 20 ps
select
Copy 1
Comb. R Comb. R Comb. R
logic What e logic e logic e
A Logic? C
g B M g g
U
Clock X
Copy 2
R Comb.
e logic
g B
27
Carnegie Mellon

Another Way to Look At the Microarchitecture

Principles:
• Execute each instruction one at a time, one after another
• Express every instruction as series of simple steps
• Dedicated hardware structure for completing each step
• Follow same general flow for each instruction type

Fetch: Read instruction from instruction memory

Decode: Read program registers
Execute: Compute value or address
Memory: Read or write data
Write Back: Write program registers
PC: Update program counter

28
Carnegie Mellon
newPC

PC
valE, valM

Write back valM

Data
Data
Fetch
Memory memory
memory
■ Read instruction from instruction memory
Addr, Data

Decode
valE ■ Read program registers
CC
CC ALU
ALU
Execute
Execute Cnd

aluA, aluB ■ Compute value or address

Memory
valA, valB ■ Read or write data

Decode
srcA, srcB
dstA, dstB A B
M
Write Back
Register
Register
file
file E ■ Write program registers
icode ifun
,
rA , rB
valC
valP
PC
Instruction
Instruction PC
PC
■ Update program counter
memory
memory increment
increment
Fetch

29
Carnegie Mellon

Stage Computation: Arith/Log. Ops

OPq rA, rB 6 fn rA rB

OPq rA, rB

30
Carnegie Mellon

Stage Computation: Arith/Log. Ops

OPq rA, rB 6 fn rA rB

OPq rA, rB
icode:ifun ← M1[PC] Read instruction byte
rA:rB ← M1[PC+1] Read register byte
Fetch

valP ← PC+2 Compute next PC

30
Carnegie Mellon

Stage Computation: Arith/Log. Ops

OPq rA, rB 6 fn rA rB

OPq rA, rB
icode:ifun ← M1[PC] Read instruction byte
rA:rB ← M1[PC+1] Read register byte
Fetch

valP ← PC+2 Compute next PC

valA ← R[rA] Read operand A
Decode
valB ← R[rB] Read operand B

30
Carnegie Mellon

Stage Computation: Arith/Log. Ops

OPq rA, rB 6 fn rA rB

OPq rA, rB
icode:ifun ← M1[PC] Read instruction byte
rA:rB ← M1[PC+1] Read register byte
Fetch

valP ← PC+2 Compute next PC

valA ← R[rA] Read operand A
Decode
valB ← R[rB] Read operand B
valE ← valB OP valA Perform ALU operation
Execute
Set CC Set condition code register

30
Carnegie Mellon

Stage Computation: Arith/Log. Ops

OPq rA, rB 6 fn rA rB

OPq rA, rB
icode:ifun ← M1[PC] Read instruction byte
rA:rB ← M1[PC+1] Read register byte
Fetch

valP ← PC+2 Compute next PC

valA ← R[rA] Read operand A
Decode
valB ← R[rB] Read operand B
valE ← valB OP valA Perform ALU operation
Execute
Set CC Set condition code register
Memory

30
Carnegie Mellon

Stage Computation: Arith/Log. Ops

OPq rA, rB 6 fn rA rB

OPq rA, rB
icode:ifun ← M1[PC] Read instruction byte
rA:rB ← M1[PC+1] Read register byte
Fetch

valP ← PC+2 Compute next PC

valA ← R[rA] Read operand A
Decode
valB ← R[rB] Read operand B
valE ← valB OP valA Perform ALU operation
Execute
Set CC Set condition code register
Memory
Write R[rB] ← valE Write back result
back

30
Carnegie Mellon

Stage Computation: Arith/Log. Ops

OPq rA, rB 6 fn rA rB

OPq rA, rB
icode:ifun ← M1[PC] Read instruction byte
rA:rB ← M1[PC+1] Read register byte
Fetch

valP ← PC+2 Compute next PC

30
Carnegie Mellon

Stage Computation: rmmovq

rmmovq rA, D(rB) 4 0 rA rB D

rmmovq rA, D(rB)

31
Carnegie Mellon

Stage Computation: rmmovq

rmmovq rA, D(rB) 4 0 rA rB D

rmmovq rA, D(rB)

icode:ifun ← M1[PC] Read instruction byte
rA:rB ← M1[PC+1] Read register byte
Fetch
valC ← M8[PC+2] Read displacement D
valP ← PC+10 Compute next PC

31
Carnegie Mellon

Stage Computation: rmmovq

rmmovq rA, D(rB) 4 0 rA rB D

rmmovq rA, D(rB)

icode:ifun ← M1[PC] Read instruction byte
rA:rB ← M1[PC+1] Read register byte
Fetch
valC ← M8[PC+2] Read displacement D
valP ← PC+10 Compute next PC
valA ← R[rA] Read operand A
Decode
valB ← R[rB] Read operand B

31
Carnegie Mellon

Stage Computation: rmmovq

rmmovq rA, D(rB) 4 0 rA rB D

rmmovq rA, D(rB)

31
Carnegie Mellon

Stage Computation: rmmovq

rmmovq rA, D(rB) 4 0 rA rB D

rmmovq rA, D(rB)

Memory M8[valE] ← valA Write value to memory

31
Carnegie Mellon

Stage Computation: rmmovq

rmmovq rA, D(rB) 4 0 rA rB D

rmmovq rA, D(rB)

Memory M8[valE] ← valA Write value to memory

Write
back

31
Carnegie Mellon

Stage Computation: rmmovq

rmmovq rA, D(rB) 4 0 rA rB D

rmmovq rA, D(rB)

Memory M8[valE] ← valA Write value to memory

Write
back
PC update PC ← valP Update PC

31
Carnegie Mellon

Stage Computation: Jumps

jXX Dest

• Compute both addresses

• Choose based on setting of condition codes and branch condition
32
Carnegie Mellon

Stage Computation: Jumps

jXX Dest
icode:ifun ← M1[PC] Read instruction byte

Fetch
valC ← M8[PC+1] Read destination address
valP ← PC+9 Fall through address

• Compute both addresses

• Choose based on setting of condition codes and branch condition
32
Carnegie Mellon

Stage Computation: Jumps

jXX Dest
icode:ifun ← M1[PC] Read instruction byte

Fetch
valC ← M8[PC+1] Read destination address
valP ← PC+9 Fall through address

Decode

• Compute both addresses

• Choose based on setting of condition codes and branch condition
32
Carnegie Mellon

Stage Computation: Jumps

jXX Dest
icode:ifun ← M1[PC] Read instruction byte

Fetch
valC ← M8[PC+1] Read destination address
valP ← PC+9 Fall through address

Decode

Execute
Cnd ← Cond(CC,ifun) Take branch?

• Compute both addresses

• Choose based on setting of condition codes and branch condition
32
Carnegie Mellon

Stage Computation: Jumps

jXX Dest
icode:ifun ← M1[PC] Read instruction byte

Fetch
valC ← M8[PC+1] Read destination address
valP ← PC+9 Fall through address

Decode

Execute
Cnd ← Cond(CC,ifun) Take branch?
Memory

• Compute both addresses

• Choose based on setting of condition codes and branch condition
32
Carnegie Mellon

Stage Computation: Jumps

jXX Dest
icode:ifun ← M1[PC] Read instruction byte

Fetch
valC ← M8[PC+1] Read destination address
valP ← PC+9 Fall through address

Decode

Execute
Cnd ← Cond(CC,ifun) Take branch?
Memory
Write
back

• Compute both addresses

• Choose based on setting of condition codes and branch condition
32
Carnegie Mellon

Stage Computation: Jumps

jXX Dest
icode:ifun ← M1[PC] Read instruction byte

Fetch
valC ← M8[PC+1] Read destination address
valP ← PC+9 Fall through address

Decode

Execute
Cnd ← Cond(CC,ifun) Take branch?
Memory
Write
back
PC update PC ← Cnd ? valC : valP Update PC

• Compute both addresses

• Choose based on setting of condition codes and branch condition
32
Carnegie Mellon

Pipeline Stages
Fetch
• Select current PC
• Read instruction
• Compute incremented PC
Decode
• Read program registers
Execute
• Operate ALU
Memory
• Read or write data memory
Write Back
• Update register file

33
Carnegie Mellon

Real-World Pipelines: Car Washes

34
Carnegie Mellon

Real-World Pipelines: Car Washes

Sequential

34
Carnegie Mellon

Real-World Pipelines: Car Washes

Sequential Pipelined

34
Carnegie Mellon

Real-World Pipelines: Car Washes

Sequential Pipelined

Idea
• Divide process into independent stages
• Move objects through stages in sequence
• At any given times, multiple objects being processed

34
Carnegie Mellon

Pipeline Illustration

R R R R Write R
Fetch e Decode e Execute e Memory e e
g g g g back g

35
Carnegie Mellon

Pipeline Illustration

Inst0

R R R R Write R
Fetch e Decode e Execute e Memory e e
g g g g back g

35
Carnegie Mellon

Pipeline Illustration

Inst1 Inst0

R R R R Write R
Fetch e Decode e Execute e Memory e e
g g g g back g

35
Carnegie Mellon

Pipeline Illustration

Inst2 Inst1 Inst0

R R R R Write R
Fetch e Decode e Execute e Memory e e
g g g g back g

35
Carnegie Mellon

Pipeline Illustration

Inst3 Inst2 Inst1 Inst0

R R R R Write R
Fetch e Decode e Execute e Memory e e
g g g g back g

35
Carnegie Mellon

Pipeline Illustration

Inst4 Inst3 Inst2 Inst1 Inst0

R R R R Write R
Fetch e Decode e Execute e Memory e e
g g g g back g

35
Carnegie Mellon

Pipeline Illustration

Inst4 Inst3 Inst2 Inst1

R R R R Write R
Fetch e Decode e Execute e Memory e e
g g g g back g

35
Carnegie Mellon

Pipeline Illustration

Inst4 Inst3 Inst2

R R R R Write R
Fetch e Decode e Execute e Memory e e
g g g g back g

35
Carnegie Mellon

Pipeline Illustration

Inst4 Inst3

R R R R Write R
Fetch e Decode e Execute e Memory e e
g g g g back g

35
Carnegie Mellon

Pipeline Illustration

Inst4

R R R R Write R
Fetch e Decode e Execute e Memory e e
g g g g back g

35
Carnegie Mellon

Another Illustration
239
Clock
OP1 A B C
OP2 A B C
OP3 A B C

0 120 240 360 480 640

Time

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Comb. R Comb. R Comb. R

logic e logic e logic e
A g B g C g

Clock

36
Carnegie Mellon

Another Illustration
241
Clock
OP1 A B C
OP2 A B C
OP3 A B C

0 120 240 360 480 640

Time

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Comb. R Comb. R Comb. R

logic e logic e logic e
A g B g C g

Clock

37
Carnegie Mellon

Another Illustration
300
Clock
OP1 A B C
OP2 A B C
OP3 A B C

0 120 240 360 480 640

Time

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Comb. R Comb. R Comb. R

logic e logic e logic e
A g B g C g

Clock

38
Carnegie Mellon

Another Illustration
359
Clock
OP1 A B C
OP2 A B C
OP3 A B C

0 120 240 360 480 640

Time

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Comb. R Comb. R Comb. R

logic e logic e logic e
A g B g C g

Clock

39
Carnegie Mellon

Making the Pipeline Really Work

• Control Dependencies
• What is it?
• Software mitigation: Inserting Nops
• Software mitigation: Delay Slots
• Data Dependencies
• What is it?
• Software mitigation: Inserting Nops

40
Carnegie Mellon

Control Dependency
• Definition: Outcome of instruction A determines whether or not
instruction B should be executed.
• Jump instruction example below:
• jne L1 determines whether irmovq $1, %rax should be
executed
• But jne doesn’t know its outcome until after its Execute stage

xorg %rax, %rax

jne L1 # Not taken
irmovq $1, %rax # Fall Through
L1 irmovq $4, %rcx # Target
irmovq $3, %rax # Target + 1

41
Carnegie Mellon

xorg %rax, %rax F

jne L1 # Not taken
irmovq $1, %rax # Fall Through
L1 irmovq $4, %rcx # Target
irmovq $3, %rax # Target + 1

41
Carnegie Mellon

1 2

xorg %rax, %rax F D

jne L1 # Not taken F
irmovq $1, %rax # Fall Through
L1 irmovq $4, %rcx # Target
irmovq $3, %rax # Target + 1

41
Carnegie Mellon

1 2 3

xorg %rax, %rax F D E

jne L1 # Not taken F D
irmovq $1, %rax # Fall Through
L1 irmovq $4, %rcx # Target
irmovq $3, %rax # Target + 1

41
Carnegie Mellon

1 2 3

xorg %rax, %rax F D E

jne L1 # Not taken F D
nop F
irmovq $1, %rax # Fall Through
L1 irmovq $4, %rcx # Target
irmovq $3, %rax # Target + 1

41
Carnegie Mellon

1 2 3 4

xorg %rax, %rax F D E M

jne L1 # Not taken F D E
nop F D
irmovq $1, %rax # Fall Through
L1 irmovq $4, %rcx # Target
irmovq $3, %rax # Target + 1

41
Carnegie Mellon

1 2 3 4

xorg %rax, %rax F D E M

jne L1 # Not taken F D E
nop F D
nop F
irmovq $1, %rax # Fall Through
L1 irmovq $4, %rcx # Target
irmovq $3, %rax # Target + 1

41
Carnegie Mellon

1 2 3 4 5

xorg %rax, %rax F D E M W

jne L1 # Not taken F D E M
nop F D E
nop F D
irmovq $1, %rax # Fall Through F
L1 irmovq $4, %rcx # Target
irmovq $3, %rax # Target + 1

41
Carnegie Mellon

1 2 3 4 5

xorg %rax, %rax F D E M W

jne L1 # Not taken F D E M
nop F D E
nop F D
irmovq $1, %rax # Fall Through F
L1 irmovq $4, %rcx # Target
irmovq $3, %rax # Target + 1

41
Carnegie Mellon

1 2 3 4 5 6 7 8 9

xorg %rax, %rax F D E M W

jne L1 # Not taken F D E M W
nop F D E M W
nop F D E M W
irmovq $1, %rax # Fall Through F D E M W
L1 irmovq $4, %rcx # Target F D E M
irmovq $3, %rax # Target + 1 F D E

41
Carnegie Mellon

Delay Slots
1 2 3 4 5 6 7 8 9

xorg %rax, %rax F D E M W

jne L1 F D E M W
nop Can we make use of
F D E M W
the 2 wasted slots?
nop F D E M W
irmovq $1, %rax # Fall Through F D E M W
L1 irmovq $4, %rcx # Target F D E M W
irmovq $3, %rax # Target + 1 F D E M

42
Carnegie Mellon

Delay Slots
1 2 3 4 5 6 7 8 9

xorg %rax, %rax F D E M W

jne L1 F D E M W
nop Can we make use of
F D E M W
the 2 wasted slots?
nop F D E M W
irmovq $1, %rax # Fall Through F D E M W
L1 irmovq $4, %rcx # Target F D E M W
irmovq $3, %rax # Target + 1 F D E M

if (cond) {
do_A();
} else {
do_B();
}
do_C();

42
Carnegie Mellon

Delay Slots
1 2 3 4 5 6 7 8 9

xorg %rax, %rax F D E M W

jne L1 F D E M W
nop Can we make use of
F D E M W
the 2 wasted slots?
nop F D E M W
irmovq $1, %rax # Fall Through F D E M W
L1 irmovq $4, %rcx # Target F D E M W
irmovq $3, %rax # Target + 1 F D E M

if (cond) {
do_A();
Have to make sure do_C doesn’t
depend on do_A and do_B!!!
} else {
do_B();
}
do_C();

42
Carnegie Mellon

Delay Slots
1 2 3 4 5 6 7 8 9

xorg %rax, %rax F D E M W

jne L1 F D E M W
nop Can we make use of
F D E M W
the 2 wasted slots?
nop F D E M W
irmovq $1, %rax # Fall Through F D E M W
L1 irmovq $4, %rcx # Target F D E M W
irmovq $3, %rax # Target + 1 F D E M

do_C();
if (cond) {
A less obvious
example do_A();
} else {
do_B();
}

43
Carnegie Mellon

Delay Slots
1 2 3 4 5 6 7 8 9

xorg %rax, %rax F D E M W

jne L1 F D E M W
nop Can we make use of
F D E M W
the 2 wasted slots?
nop F D E M W
irmovq $1, %rax # Fall Through F D E M W
L1 irmovq $4, %rcx # Target F D E M W
irmovq $3, %rax # Target + 1 F D E M

do_C(); add A, B
if (cond) { or C, D
A less obvious
example do_A(); sub E, F
} else { jle 0x200
do_B(); add A, C
}

43
Carnegie Mellon

Delay Slots
1 2 3 4 5 6 7 8 9

xorg %rax, %rax F D E M W

jne L1 F D E M W
nop Can we make use of
F D E M W
the 2 wasted slots?
nop F D E M W
irmovq $1, %rax # Fall Through F D E M W
L1 irmovq $4, %rcx # Target F D E M W
irmovq $3, %rax # Target + 1 F D E M

do_C(); add A, B add A, B

if (cond) { or C, D sub E, F
A less obvious
example do_A(); sub E, F jle 0x200
} else { jle 0x200 or C, D
do_B(); add A, C add A, C
}

43
Carnegie Mellon

Delay Slots
1 2 3 4 5 6 7 8 9

xorg %rax, %rax F D E M W

jne L1 F D E M W
nop Can we make use of
F D E M W
the 2 wasted slots?
nop F D E M W
irmovq $1, %rax # Fall Through F D E M W
L1 irmovq $4, %rcx # Target F D E M W
irmovq $3, %rax # Target + 1 F D E M

do_C(); add A, B add A, B

if (cond) { or C, D sub E, F
A less obvious
example do_A(); sub E, F jle 0x200
} else { jle 0x200 or C, D
do_B(); add A, C add A, C
} Why don’t we move
the sub instruction?

43
Carnegie Mellon

Resolving Control Dependencies

• Software Mechanisms
• Adding NOPs: requires compiler to insert nops, which also take
memory space — not a good idea
• Delay slot: insert instructions that do not depend on the effect
of the preceding instruction. These instructions will execute
even if the preceding branch is taken — old RISC approach
• Hardware mechanisms
• Stalling (Think of it as hardware automatically inserting nops)
• Branch Prediction
• Return Address Stack

Sony Xav-63m Xav-64bt BTM Ver1.4 SM
No ratings yet
Sony Xav-63m Xav-64bt BTM Ver1.4 SM
122 pages
International Education: Issues For Teachers, Second Edition (Toronto: Canadian
No ratings yet
International Education: Issues For Teachers, Second Edition (Toronto: Canadian
38 pages
Kidase English Tigrinya Geez 140712035700 Phpapp01
No ratings yet
Kidase English Tigrinya Geez 140712035700 Phpapp01
378 pages
21 Architecture MultiCycle PDF
No ratings yet
21 Architecture MultiCycle PDF
50 pages
Arch2 Microarchitecture Design Afterlecture
No ratings yet
Arch2 Microarchitecture Design Afterlecture
222 pages
week6_performance_numericals
No ratings yet
week6_performance_numericals
38 pages
Today - Finish Single-Cycle Datapath/control Path - Look at Its Performance and How To Improve It
No ratings yet
Today - Finish Single-Cycle Datapath/control Path - Look at Its Performance and How To Improve It
28 pages
Chapter 12 Performance of Single-cycle and multi-cycle data path
No ratings yet
Chapter 12 Performance of Single-cycle and multi-cycle data path
27 pages
Comparch 04
No ratings yet
Comparch 04
73 pages
FALLSEM2024-25_CSI3021_TH_VL2024250101951_2024-07-19_Reference-Material-I
No ratings yet
FALLSEM2024-25_CSI3021_TH_VL2024250101951_2024-07-19_Reference-Material-I
21 pages
chapter4_2
No ratings yet
chapter4_2
34 pages
Lecture11
No ratings yet
Lecture11
132 pages
Tutorial Module 4
No ratings yet
Tutorial Module 4
9 pages
The Final Datapath: Add M U X
No ratings yet
The Final Datapath: Add M U X
32 pages
Ch#4 Part 1, 2,34
No ratings yet
Ch#4 Part 1, 2,34
70 pages
Lecture 12
No ratings yet
Lecture 12
34 pages
Sheet7 Solution
No ratings yet
Sheet7 Solution
11 pages
Computer Architecture: CSCE 350
No ratings yet
Computer Architecture: CSCE 350
41 pages
461 Assignment
No ratings yet
461 Assignment
52 pages
Module 5 COA Solutions
100% (1)
Module 5 COA Solutions
22 pages
single-cycle-vs-multi-cycle-cpu
No ratings yet
single-cycle-vs-multi-cycle-cpu
11 pages
Untitled document
No ratings yet
Untitled document
23 pages
Lecture 11
No ratings yet
Lecture 11
37 pages
Computer Architecture: Trần Trọng Hiếu
No ratings yet
Computer Architecture: Trần Trọng Hiếu
29 pages
Slide 3
No ratings yet
Slide 3
65 pages
CG2028 Lecture 4
No ratings yet
CG2028 Lecture 4
40 pages
Comparch 05
No ratings yet
Comparch 05
48 pages
No. of Cycles IF ID EXE MEM WB
No ratings yet
No. of Cycles IF ID EXE MEM WB
5 pages
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
No ratings yet
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
35 pages
Computer Organization & Assembly Language: CS/COE0447
No ratings yet
Computer Organization & Assembly Language: CS/COE0447
82 pages
Module 4 Ktunotes - in Min
No ratings yet
Module 4 Ktunotes - in Min
11 pages
Arch3 Pipelining Afterlecture
No ratings yet
Arch3 Pipelining Afterlecture
180 pages
001-stored program computer
No ratings yet
001-stored program computer
6 pages
Illinois Exam2 Practice Solfa08
No ratings yet
Illinois Exam2 Practice Solfa08
4 pages
Materials Needed:: Spring 2020
No ratings yet
Materials Needed:: Spring 2020
5 pages
Lecture 16: Basic CPU Design
No ratings yet
Lecture 16: Basic CPU Design
20 pages
Ca Mid1 2017
No ratings yet
Ca Mid1 2017
9 pages
CS M151B / EE M116C: Computer Systems Architecture
No ratings yet
CS M151B / EE M116C: Computer Systems Architecture
38 pages
Mips Datapath
No ratings yet
Mips Datapath
23 pages
Chapter1 - Basic Structure of Computers
100% (1)
Chapter1 - Basic Structure of Computers
119 pages
Lecture10 - chapter4-p2
No ratings yet
Lecture10 - chapter4-p2
46 pages
DDCO imp qs for 2nd internals
No ratings yet
DDCO imp qs for 2nd internals
13 pages
05 Instruction+Level+Parallelism
No ratings yet
05 Instruction+Level+Parallelism
11 pages
5 Singlecycle
No ratings yet
5 Singlecycle
60 pages
Explain The Instruction Cycle and Its Advantages
No ratings yet
Explain The Instruction Cycle and Its Advantages
8 pages
Final Exam Topics: CSE 564 Computer Architecture Summer 2017
No ratings yet
Final Exam Topics: CSE 564 Computer Architecture Summer 2017
78 pages
Getting_Started
No ratings yet
Getting_Started
14 pages
Chapter04 ProcessorDesign PDF
No ratings yet
Chapter04 ProcessorDesign PDF
39 pages
MIPS2
No ratings yet
MIPS2
74 pages
Single Cycle
No ratings yet
Single Cycle
28 pages
It3030e CA Chap5 Cpu p1
No ratings yet
It3030e CA Chap5 Cpu p1
62 pages
Revision Microprocessor PDF
No ratings yet
Revision Microprocessor PDF
26 pages
Processor
No ratings yet
Processor
21 pages
Cse4302a1 Sol
No ratings yet
Cse4302a1 Sol
4 pages
03.EECE345 Computer Architecture ISA Design 02 (5)
No ratings yet
03.EECE345 Computer Architecture ISA Design 02 (5)
80 pages
Multicycle Approach Part 2
No ratings yet
Multicycle Approach Part 2
25 pages
coa mod1
No ratings yet
coa mod1
9 pages
Pipelining ControlUnitAndHazards
No ratings yet
Pipelining ControlUnitAndHazards
109 pages
Sheet 8
No ratings yet
Sheet 8
13 pages
2004 Spring Exam2
No ratings yet
2004 Spring Exam2
8 pages
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet
NES Architecture: Architecture of Consoles: A Practical Analysis, #1
From Everand
NES Architecture: Architecture of Consoles: A Practical Analysis, #1
Rodrigo Copetti
5/5 (1)
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation
From Everand
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation
Bruce Dang
No ratings yet
Chapter10_ANOVA - Student(1)
No ratings yet
Chapter10_ANOVA - Student(1)
38 pages
Lecture15
No ratings yet
Lecture15
81 pages
CSC 240 HW 4
No ratings yet
CSC 240 HW 4
17 pages
Lecture18
No ratings yet
Lecture18
70 pages
Lecture20
No ratings yet
Lecture20
81 pages
Lecture16
No ratings yet
Lecture16
123 pages
Lecture4
No ratings yet
Lecture4
154 pages
Lecture19
No ratings yet
Lecture19
71 pages
Lecture6
No ratings yet
Lecture6
127 pages
Lecture2
No ratings yet
Lecture2
96 pages
MPI PROCEDURE Project 2
No ratings yet
MPI PROCEDURE Project 2
17 pages
NSC Unit - 2 - 221218 - 100752
No ratings yet
NSC Unit - 2 - 221218 - 100752
25 pages
Harmony and Proportion - Pythagoras - Music and Space
No ratings yet
Harmony and Proportion - Pythagoras - Music and Space
2 pages
2 - Bhuvaneswar - A Sociolinguistics Analysis of
No ratings yet
2 - Bhuvaneswar - A Sociolinguistics Analysis of
38 pages
Pietro Deiro Cons Acc Course Outline
No ratings yet
Pietro Deiro Cons Acc Course Outline
9 pages
1 SM
No ratings yet
1 SM
9 pages
Wells Fargo Bank Digital Sign Outage Tech Procedures
No ratings yet
Wells Fargo Bank Digital Sign Outage Tech Procedures
10 pages
Group 2 Ra 7610
No ratings yet
Group 2 Ra 7610
21 pages
MATH 2070U Midterm 2011 A Solution
No ratings yet
MATH 2070U Midterm 2011 A Solution
6 pages
It's All in The Universe - Handling Chasm and Fan Traps: 5 Comments
No ratings yet
It's All in The Universe - Handling Chasm and Fan Traps: 5 Comments
3 pages
Vapor Power System: Rankine Cycle
No ratings yet
Vapor Power System: Rankine Cycle
8 pages
You Exec - Annual Report Part4 Complete
No ratings yet
You Exec - Annual Report Part4 Complete
38 pages
Philippine Agricultural Engineering Standard Paes 409:2002 Agricultural Structures - Milking Parlor
No ratings yet
Philippine Agricultural Engineering Standard Paes 409:2002 Agricultural Structures - Milking Parlor
12 pages
Housekeeping Safety
No ratings yet
Housekeeping Safety
7 pages
s4 Physics Paper 3 Exam 1
No ratings yet
s4 Physics Paper 3 Exam 1
4 pages
DB68-03711A-06 IM ACC AHU Kit GB EN 221128-D01
No ratings yet
DB68-03711A-06 IM ACC AHU Kit GB EN 221128-D01
40 pages
Data Domain Instructions
No ratings yet
Data Domain Instructions
2 pages
(Ebook) The Influencer Factory: A Marxist Theory of Corporate Personhood on YouTube by Grant Bollmer, Katherine Guinness ISBN 9781503638792, 1503638790instant download
100% (3)
(Ebook) The Influencer Factory: A Marxist Theory of Corporate Personhood on YouTube by Grant Bollmer, Katherine Guinness ISBN 9781503638792, 1503638790instant download
49 pages
HT Fuse 1 1
No ratings yet
HT Fuse 1 1
4 pages
Anh 9 HSG 2016-2017.
No ratings yet
Anh 9 HSG 2016-2017.
6 pages
Mata Trader 5 Ppt
No ratings yet
Mata Trader 5 Ppt
7 pages
HP Z440, Z640, and Z840 Workstation Series: Maintenance and Service Guide
No ratings yet
HP Z440, Z640, and Z840 Workstation Series: Maintenance and Service Guide
133 pages
The Theory of Rasa by Jivan Chaudhary
100% (1)
The Theory of Rasa by Jivan Chaudhary
6 pages
Assignment:: 1. What Is An Information Security Assurance?
No ratings yet
Assignment:: 1. What Is An Information Security Assurance?
3 pages
Hxr-Mc1: Digital HD Video Camera Recorder
No ratings yet
Hxr-Mc1: Digital HD Video Camera Recorder
8 pages
Borovik - Math Under The Microscope Book PDF
No ratings yet
Borovik - Math Under The Microscope Book PDF
373 pages