Pipelineing
Pipelineing
0/
2. The Pipeline
In pipelining, multiple tasks (for example, instructions) are executed in parallel.
To use the pipelining approach efficiently
1. We must have tasks that are repeated many times on different data.
2. Tasks must be divided into small pieces (operations or actions) that can be
performed in parallel.
Computer Architecture
Step = 1 Car 1
Station 1 Station 2 Station 3
After Step = 3 (the pipeline is full), at each step, a new car (task) is completed.
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.2
http:// www.buzluca.info
1
Computer Architecture License: https://creativecommons.org/licenses/by-nc-nd/4.0/
2. The Pipeline
In pipelining, multiple tasks (for example, instructions) are executed in parallel.
To use the pipelining approach efficiently
1. We must have tasks that are repeated many times on different data.
2. Tasks must be divided into small pieces (operations or actions) that can be
performed in parallel.
Computer Architecture
Step = 1 Car 1
Station 1 Station 2 Station 3
After Step = 3 (the pipeline is full), at each step, a new car (task) is completed.
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.2
http:// www.buzluca.info
1
Computer Architecture
Clock
1. Stage 2. Stage k. Stage
(Segment, layer)
Computer Architecture
Example: The elements of the arrays A, B, and C will be first read from memory,
and then the following operation will be performed: Ai*Bi + Ci i=1,2,3,...
Ai Bi Ci
Addition 3. Stage
Addition
R5
Result
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.4
http:// www.buzluca.info
2
Computer Architecture
Clock
1. Stage 2. Stage k. Stage
(Segment, layer)
Computer Architecture
Example: The elements of the arrays A, B, and C will be first read from memory,
and then the following operation will be performed: Ai*Bi + Ci i=1,2,3,...
Ai Bi Ci
Addition 3. Stage
Addition
R5
Result
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.4
http:// www.buzluca.info
2
Computer Architecture
Example (cont'd):
• In this example, the task is decomposed into 3 operations: Reading,
multiplication, and addition.
• We assume that arrays are in separate memory modules, which can be read in
parallel.
• We start to read elements of array C one clock cycle after reading A and B.
Functioning of the pipeline with three stages:
Clock cycle 1. Stage (Read) 2. Stage(Multiply) 3.Stage (Add)
R1 R2 R3 R4 R5
1 A1 B1 - - -
2 A2 B2 A1*B1 C1 -
3 A3 B3 A2*B2 C2 A1*B1 + C1 (First result)
4 A4 B4 A3*B3 C3 A2*B2 + C2 (2nd result)
5 A5 B5 A4*B4 C4 A3*B3 + C3 (3rd result)
Note:
• Assuming that the time to access the memory is significantly shorter than the
durations of the other operations and the data is always ready to be read,
reading is not treated as a separate operation.
• In this case, the pipeline could be designed with two stages which perform only
arithmetical operations: multiplication and addition.
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.5
http:// www.buzluca.info
Computer Architecture
S2 T1 T2 T3 T4 T5 T6
S3 T1 T2 T3 T4 T5
S4 T1 T2 T3 T4
The 1st task (T1) is completed in 4 After the kth cycle, a new task
clock cycles (number of stages k=4). is completed in each clock cycle.
3
Computer Architecture
Example (cont'd):
• In this example, the task is decomposed into 3 operations: Reading,
multiplication, and addition.
• We assume that arrays are in separate memory modules, which can be read in
parallel.
• We start to read elements of array C one clock cycle after reading A and B.
Functioning of the pipeline with three stages:
Clock cycle 1. Stage (Read) 2. Stage(Multiply) 3.Stage (Add)
R1 R2 R3 R4 R5
1 A1 B1 - - -
2 A2 B2 A1*B1 C1 -
3 A3 B3 A2*B2 C2 A1*B1 + C1 (First result)
4 A4 B4 A3*B3 C3 A2*B2 + C2 (2nd result)
5 A5 B5 A4*B4 C4 A3*B3 + C3 (3rd result)
Note:
• Assuming that the time to access the memory is significantly shorter than the
durations of the other operations and the data is always ready to be read,
reading is not treated as a separate operation.
• In this case, the pipeline could be designed with two stages which perform only
arithmetical operations: multiplication and addition.
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.5
http:// www.buzluca.info
Computer Architecture
S2 T1 T2 T3 T4 T5 T6
S3 T1 T2 T3 T4 T5
S4 T1 T2 T3 T4
The 1st task (T1) is completed in 4 After the kth cycle, a new task
clock cycles (number of stages k=4). is completed in each clock cycle.
3
Computer Architecture License: https://creativecommons.org/licenses/by-nc-nd/4.0/
T3 S1 S2 S3 S4
T4 S1 S2 S3 S4
Computer Architecture
Since all stages proceed at the same time, the time (delay) required for the
slowest stage determines the length of the period of the clock signal (cycle time).
The cycle time (the period of the clock) tp can be determined as follows:
tp= max(τi) + dr = τM + dr
tp: cycle time
τi : time delay of the circuitry in the ith stage
τM : maximum stage delay (the slowest stage)
dr : time delay of the register
4
Computer Architecture License: https://creativecommons.org/licenses/by-nc-nd/4.0/
T3 S1 S2 S3 S4
T4 S1 S2 S3 S4
Computer Architecture
Since all stages proceed at the same time, the time (delay) required for the
slowest stage determines the length of the period of the clock signal (cycle time).
The cycle time (the period of the clock) tp can be determined as follows:
tp= max(τi) + dr = τM + dr
tp: cycle time
τi : time delay of the circuitry in the ith stage
τM : maximum stage delay (the slowest stage)
dr : time delay of the register
4
Computer Architecture
Speedup:
k: number of stages in the pipeline
tp: cycle time
n: number of tasks
tn : time required for a task without pipelining
Computer Architecture
Comments on speedup:
To improve the performance of the pipeline, tasks must be divided into small and
balanced operations with equal (or at least similar) durations.
If the durations of the operations are short, then the clock cycle (tp) can be short.
Remember: The slowest stage determines the clock cycle.
Effects of increasing the number of stages of a pipeline:
Advantage:
• If the task can be divided into many small operations, increasing the number of
stages can lower the clock cycle (tp), and consequently the speedup increases.
tn
S = Smax = k (Theoretical)
lim
n→∞
tp
Disadvantages:
• The cost of the pipeline increases. At each stage of the pipeline, there is some
overhead (cost, energy, space) because of registers and additional connections.
• The completion time of the first task increases. T(1) = k·tp
• Branch penalties in the instruction pipeline caused by control hazards increase.
We will discuss branch penalties in the section "2.5 Pipeline hazards".
While designing a pipeline, these advantages and disadvantages should be taken
into consideration.
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.10
http:// www.buzluca.info
5
Computer Architecture
Speedup:
k: number of stages in the pipeline
tp: cycle time
n: number of tasks
tn : time required for a task without pipelining
Computer Architecture
Comments on speedup:
To improve the performance of the pipeline, tasks must be divided into small and
balanced operations with equal (or at least similar) durations.
If the durations of the operations are short, then the clock cycle (tp) can be short.
Remember: The slowest stage determines the clock cycle.
Effects of increasing the number of stages of a pipeline:
Advantage:
• If the task can be divided into many small operations, increasing the number of
stages can lower the clock cycle (tp), and consequently the speedup increases.
tn
S = Smax = k (Theoretical)
lim
n→∞
tp
Disadvantages:
• The cost of the pipeline increases. At each stage of the pipeline, there is some
overhead (cost, energy, space) because of registers and additional connections.
• The completion time of the first task increases. T(1) = k·tp
• Branch penalties in the instruction pipeline caused by control hazards increase.
We will discuss branch penalties in the section "2.5 Pipeline hazards".
While designing a pipeline, these advantages and disadvantages should be taken
into consideration.
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.10
http:// www.buzluca.info
5
Computer Architecture
If the delay of the registers is 5 ns, then the clock cycle is tp = 50+5 = 55 ns
Computer Architecture
Case C: We partition the task into three stages with similar durations.
Conclusion:
Pipelining has advantages if a task can be partitioned into small and balanced
operations.
It is important to decrease the length of the clock cycle (tp).
For example, if we could partition the task into five operations, each having the
duration of 20ns, we would have a clock cycle of 25ns.
6
Computer Architecture
If the delay of the registers is 5 ns, then the clock cycle is tp = 50+5 = 55 ns
Computer Architecture
Case C: We partition the task into three stages with similar durations.
Conclusion:
Pipelining has advantages if a task can be partitioned into small and balanced
operations.
It is important to decrease the length of the clock cycle (tp).
For example, if we could partition the task into five operations, each having the
duration of 20ns, we would have a clock cycle of 25ns.
6
Computer Architecture License: https://creativecommons.org/licenses/by-nc-nd/4.0/
Computer Architecture
7
Computer Architecture License: https://creativecommons.org/licenses/by-nc-nd/4.0/
Computer Architecture
7
Computer Architecture
Computer Architecture
8
Computer Architecture
Computer Architecture
8
Computer Architecture
Computer Architecture
9
Computer Architecture
Computer Architecture
9
Computer Architecture License: https://creativecommons.org/licenses/by-nc-nd/4.0/
Computer Architecture
b. Conditional Branch:
For a conditional branch instruction, there are two cases:
1. condition is false (branch is not taken), 2. condition is true (branch is taken)
b1. Conditional Branch (if the condition is false):
If the condition is not true, it is not necessary to stop or empty the pipeline
because the execution will continue with the next instruction.
Clock cycles The previous instruction sets
Instructions 1 2 3 4 5 6
the conditions (flags).
Instruction 1 FI DA FO EX
Conditional bra. 2 PC is not changed. No branching.
FI DA FO EX
Instruction 3 FI DA FO EX The instruction following the
branch is executed.
Without considering the condition, No need to empty
next instruction is fetched. No branch penalty
Here, the problem is that the previous instruction must be executed to determine
if the condition is true or not (depends on the flags of the CPU).
• If condition is false (branch is not taken), there is no branch penalty.
• If condition is true, a solution mechanism is necessary (next slide).
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.20
http:// www.buzluca.info
10
Computer Architecture License: https://creativecommons.org/licenses/by-nc-nd/4.0/
Computer Architecture
b. Conditional Branch:
For a conditional branch instruction, there are two cases:
1. condition is false (branch is not taken), 2. condition is true (branch is taken)
b1. Conditional Branch (if the condition is false):
If the condition is not true, it is not necessary to stop or empty the pipeline
because the execution will continue with the next instruction.
Clock cycles The previous instruction sets
Instructions 1 2 3 4 5 6
the conditions (flags).
Instruction 1 FI DA FO EX
Conditional bra. 2 PC is not changed. No branching.
FI DA FO EX
Instruction 3 FI DA FO EX The instruction following the
branch is executed.
Without considering the condition, No need to empty
next instruction is fetched. No branch penalty
Here, the problem is that the previous instruction must be executed to determine
if the condition is true or not (depends on the flags of the CPU).
• If condition is false (branch is not taken), there is no branch penalty.
• If condition is true, a solution mechanism is necessary (next slide).
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.20
http:// www.buzluca.info
10
Computer Architecture
Computer Architecture
11
Computer Architecture
Computer Architecture
11
Computer Architecture
2. Immediate mode
• ADD Rs, S2, Rd Rd ← Rs + S2 (S2: immediate data)
• LDL S2(Rs), Rd Rd←M[Rs + S2] Load long (32 bits)
31 26 25 21 20 16 15 14 0
Opcode Rd Rs 1 S2
6 5 5 1 15
Immediate data
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.23
http:// www.buzluca.info
Computer Architecture
3. Relative
• BRU Y PC←PC + Y Unconditional branch
• Bcc Y If (cc) then PC←PC + Y Conditional branch
31 26 25 21 20 0
Opcode CC Y
6 5 21
Signed offset
Condition
CC = 0: BRU (unconditional)
12
Computer Architecture
2. Immediate mode
• ADD Rs, S2, Rd Rd ← Rs + S2 (S2: immediate data)
• LDL S2(Rs), Rd Rd←M[Rs + S2] Load long (32 bits)
31 26 25 21 20 16 15 14 0
Opcode Rd Rs 1 S2
6 5 5 1 15
Immediate data
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.23
http:// www.buzluca.info
Computer Architecture
3. Relative
• BRU Y PC←PC + Y Unconditional branch
• Bcc Y If (cc) then PC←PC + Y Conditional branch
31 26 25 21 20 0
Opcode CC Y
6 5 21
Signed offset
Condition
CC = 0: BRU (unconditional)
12
Computer Architecture License: https://creativecommons.org/licenses/by-nc-nd/4.0/
Actually Register
+ 4, if RA A 0
File A_Out
the 1
instr. RD ALU Flags (C, Z, V, N)
length R_Sel
RB 0 Opr
is 4 B CL
WE Ra Rb Rd CL
bytes. 1
Rs1, Rs2, Rd B_Sel CL
Control Logic CL
1 + OPCode OPCode
Offset / Immediate PC_Rel
PC +
CL: Control Logic
Next Instruction Address A digital circuit that
0 Branch? decodes the
instructions and
1 Branch Address
generates the control
signals.
PC_Select
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.25
http:// www.buzluca.info
Computer Architecture
13
Computer Architecture License: https://creativecommons.org/licenses/by-nc-nd/4.0/
Actually Register
+ 4, if RA A 0
File A_Out
the 1
instr. RD ALU Flags (C, Z, V, N)
length R_Sel
RB 0 Opr
is 4 B CL
WE Ra Rb Rd CL
bytes. 1
Rs1, Rs2, Rd B_Sel CL
Control Logic CL
1 + OPCode OPCode
Offset / Immediate PC_Rel
PC +
CL: Control Logic
Next Instruction Address A digital circuit that
0 Branch? decodes the
instructions and
1 Branch Address
generates the control
signals.
PC_Select
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.25
http:// www.buzluca.info
Computer Architecture
13
Computer Architecture
In this course, to explain the concepts, we will use an exemplary five-stage RISC
load-store architecture :
1. Instruction fetch (IF):
Get instruction from memory, increment PC (depending on the instruction
length).
If instruction length is 4 bytes, PC ← PC + 4.
2. Instruction Decode, Read registers (DR)
Translate opcode into control signals and read registers (operands).
3. Execute (EX)
Perform ALU operation, compute jump/branch targets.
4. Memory (ME)
Access memory if needed (only load/store instructions).
5. Write back (WB)
Update register file (write results).
Actually Register
+ 4, if RA A 0
File A_Out
the 1
instr. RD ALU Flags
length RB D_Sel
is 4 0 Opr
WE Ra Rb Rd B CL CL
bytes. 1
B_Sel CL
Control Logic CL
1 +
PC_Rel
PC +
0 Branch ?
1
14
Computer Architecture
In this course, to explain the concepts, we will use an exemplary five-stage RISC
load-store architecture :
1. Instruction fetch (IF):
Get instruction from memory, increment PC (depending on the instruction
length).
If instruction length is 4 bytes, PC ← PC + 4.
2. Instruction Decode, Read registers (DR)
Translate opcode into control signals and read registers (operands).
3. Execute (EX)
Perform ALU operation, compute jump/branch targets.
4. Memory (ME)
Access memory if needed (only load/store instructions).
5. Write back (WB)
Update register file (write results).
Actually Register
+ 4, if RA A 0
File A_Out
the 1
instr. RD ALU Flags
length RB D_Sel
is 4 0 Opr
WE Ra Rb Rd B CL CL
bytes. 1
B_Sel CL
Control Logic CL
1 +
PC_Rel
PC +
0 Branch ?
1
14
Computer Architecture
Instruction
Actually + 4,
if the instr. assume no branches for now).
length is 4 • Write the instruction bits (op code,
bytes. Rs1, Rs2, Rd, offset/immediate) to the
pipeline register (IF/DR).
1 + Next Instruction • Write PC+1 to the pipeline register
Address
(for calculating the branch address in
PC PC+1 other stages).
• In case of branch, PC_Select=1, branch
0 target address is written to PC.
1 Branch Target Address
From other stages
PC_Select IF/DR Register
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.29
http:// www.buzluca.info
Computer Architecture
• Decode instruction,
Ra Rb
Instruction
generate control
Source signals.
Rs1, Rs2, Rd
off/imm
PC+1
o control bits
o offset/immediate
o contents of RA, RB
Control
o PC+1
Control bits that control all
IF/DR operational units in the processor DR/EX Register
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.30
http:// www.buzluca.info
15
Computer Architecture
Instruction
Actually + 4,
if the instr. assume no branches for now).
length is 4 • Write the instruction bits (op code,
bytes. Rs1, Rs2, Rd, offset/immediate) to the
pipeline register (IF/DR).
1 + Next Instruction • Write PC+1 to the pipeline register
Address
(for calculating the branch address in
PC PC+1 other stages).
• In case of branch, PC_Select=1, branch
0 target address is written to PC.
1 Branch Target Address
From other stages
PC_Select IF/DR Register
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.29
http:// www.buzluca.info
Computer Architecture
• Decode instruction,
Ra Rb
Instruction
generate control
Source signals.
Rs1, Rs2, Rd
off/imm
PC+1
o control bits
o offset/immediate
o contents of RA, RB
Control
o PC+1
Control bits that control all
IF/DR operational units in the processor DR/EX Register
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.30
http:// www.buzluca.info
15
Computer Architecture License: https://creativecommons.org/licenses/by-nc-nd/4.0/
• Read the control bits and data (offset/immediate, RA, RB) from the pipeline
register (DR/EX).
• Perform the ALU operation.
The ALU also calculates memory addresses for LOAD/STORE instructions.
For example; LDL $500(R4), R5 R5 ← M[R4 + $500]
The immediate value $500 is added to the contents of R4 in the ALU.
• Compute target addresses for the branch instructions
For example: BGT $0A If greater, then PC←PC + $0A
In this exemplary processor, an additional adder is used for target address
calculation.
• Decide if the jump/branch should be taken (control bits and flags from the
ALU are used)
• Write the following data to the pipeline register (EX/ME):
o Control bits
o Result of the ALU (D) and flags (F)
o RB for memory store operations (B)
o Branch target address
Computer Architecture
A A_Out
D
0 Opr
B
1 ALU
Operation
off/imm
B_Select +, -, shift, … To
B
Data
Relative branch Memory
address calculation
PC+1
+
Target
Branch
Address
Control
Control
EX/ME Branch?
Branch Address
DR/EX
To Stage 1 PC_Select
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.32
http:// www.buzluca.info
16
Computer Architecture License: https://creativecommons.org/licenses/by-nc-nd/4.0/
• Read the control bits and data (offset/immediate, RA, RB) from the pipeline
register (DR/EX).
• Perform the ALU operation.
The ALU also calculates memory addresses for LOAD/STORE instructions.
For example; LDL $500(R4), R5 R5 ← M[R4 + $500]
The immediate value $500 is added to the contents of R4 in the ALU.
• Compute target addresses for the branch instructions
For example: BGT $0A If greater, then PC←PC + $0A
In this exemplary processor, an additional adder is used for target address
calculation.
• Decide if the jump/branch should be taken (control bits and flags from the
ALU are used)
• Write the following data to the pipeline register (EX/ME):
o Control bits
o Result of the ALU (D) and flags (F)
o RB for memory store operations (B)
o Branch target address
Computer Architecture
A A_Out
D
0 Opr
B
1 ALU
Operation
off/imm
B_Select +, -, shift, … To
B
Data
Relative branch Memory
address calculation
PC+1
+
Target
Branch
Address
Control
Control
EX/ME Branch?
Branch Address
DR/EX
To Stage 1 PC_Select
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.32
http:// www.buzluca.info
16
Computer Architecture
D
the pipeline register
D
(EX/ME).
F Address
• Read data B (for
Stage 3: Execute (EX)
M
pipeline register.
Din R/W CS • Perform memory
B
load/store if needed.
• Write the following
data to the pipeline
register (ME/WB).
Target
Control
o Result of ALU
operation (D) (pass)
Computer Architecture
1 register file.
M
Destination register Rd
To Register File WE
(Write Writing to registers is a
Control
17
Computer Architecture
D
the pipeline register
D
(EX/ME).
F Address
• Read data B (for
Stage 3: Execute (EX)
M
pipeline register.
Din R/W CS • Perform memory
B
load/store if needed.
• Write the following
data to the pipeline
register (ME/WB).
Target
Control
o Result of ALU
operation (D) (pass)
Computer Architecture
1 register file.
M
Destination register Rd
To Register File WE
(Write Writing to registers is a
Control
17
Computer Architecture
Computer Architecture
18
Computer Architecture
Computer Architecture
18
Computer Architecture License: https://creativecommons.org/licenses/by-nc-nd/4.0/
Computer Architecture
Clock cycles
Instructions 1 2 3 4 5 6
ADD R1,R2,R3 IF DR EX ME WB
SUB R3,R4,R5 IF DR EX ME WB
19
Computer Architecture License: https://creativecommons.org/licenses/by-nc-nd/4.0/
Computer Architecture
Clock cycles
Instructions 1 2 3 4 5 6
ADD R1,R2,R3 IF DR EX ME WB
SUB R3,R4,R5 IF DR EX ME WB
19
Computer Architecture
Computer Architecture
20
Computer Architecture
Computer Architecture
20
Computer Architecture
Clock cycles
1 2 3 4 5 6 7 8 First write to R3
Instructions
in the first half,
ADD R1,R2,R3 IF DR EX ME WB then read it in the
second half.
SUB R3,R4,R5 IF - - DR EX ME WB
Computer Architecture
A_Select
Stage 2: Decode Read (DR)
0
A
A
1 A_Out
D
ALU Flags
B
Opr
B
ALU
Operation
off/imm
B_Select +, -, shift, …
B
21
Computer Architecture
Clock cycles
1 2 3 4 5 6 7 8 First write to R3
Instructions
in the first half,
ADD R1,R2,R3 IF DR EX ME WB then read it in the
second half.
SUB R3,R4,R5 IF - - DR EX ME WB
Computer Architecture
A_Select
Stage 2: Decode Read (DR)
0
A
A
1 A_Out
D
ALU Flags
B
Opr
B
ALU
Operation
off/imm
B_Select +, -, shift, …
B
21
Computer Architecture License: https://creativecommons.org/licenses/by-nc-nd/4.0/
Computer Architecture
22
Computer Architecture License: https://creativecommons.org/licenses/by-nc-nd/4.0/
Computer Architecture
22
Computer Architecture
OperandSelect
A A_Out
A
D
ALU Flags
Address
B
F
Opr
B Data
ALU
memory Dout
M
off/imm
Operation
B_Select +, -, shift, …
Din
B
R/W CS
Computer Architecture
LDL $500(R4), R1 IF DR EX ME WB
ADD R1, R2, R3 IF - DR EX ME WB
The previous value (not valid) of The control unit of the pipeline
R1 is fetched. selects the forwarding path as
This invalid value will not be used the input, not the value that
in the EX cycle. has been read in the DR stage.
23
Computer Architecture
OperandSelect
A A_Out
A
D
ALU Flags
Address
B
F
Opr
B Data
ALU
memory Dout
M
off/imm
Operation
B_Select +, -, shift, …
Din
B
R/W CS
Computer Architecture
LDL $500(R4), R1 IF DR EX ME WB
ADD R1, R2, R3 IF - DR EX ME WB
The previous value (not valid) of The control unit of the pipeline
R1 is fetched. selects the forwarding path as
This invalid value will not be used the input, not the value that
in the EX cycle. has been read in the DR stage.
23
Computer Architecture
Computer Architecture
24
Computer Architecture
Computer Architecture
24
Computer Architecture License: https://creativecommons.org/licenses/by-nc-nd/4.0/
Computer Architecture
25
Computer Architecture License: https://creativecommons.org/licenses/by-nc-nd/4.0/
Computer Architecture
25
Computer Architecture
In the case of a stall, the branch penalty is 3 cycles for this exemplary CPU.
Computer Architecture
26
Computer Architecture
In the case of a stall, the branch penalty is 3 cycles for this exemplary CPU.
Computer Architecture
26
Computer Architecture
D
A A_Out
A
operations are performed in
the EX stage, and results ALU Flags
F
B
are sent directly to the IF 0 Opr
B
stage. 1
In the case of a stall, we
off/imm
will have 2 cycles (instead
B
of 3) of branch penalty, if Relative branch
the branch is taken (slide address calculation
2.54).
PC+1
Computer Architecture
The target address ($108 + $1C = $124) has The target address is
Example: been calculated. sent to the IF stage.
The branch decision has been made (In EX).
Instructions
SUB R1, R2, R1 IF DR EX ME WB
BGT $1C IF DR EX ME WB
These instructions ADD R1, R1, R2 IF DR EX ME WB
should be skipped. ADD R3, R4, R2 IF DR EX ME WB
Target: STL $00(R6), R2 IF DR EX ME WB
In the case of a stall, the branch penalty is 2 cycles for this exemplary pipeline.
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.54
http:// www.buzluca.info
27
Computer Architecture
D
A A_Out
A
operations are performed in
the EX stage, and results ALU Flags
F
B
are sent directly to the IF 0 Opr
B
stage. 1
In the case of a stall, we
off/imm
will have 2 cycles (instead
B
of 3) of branch penalty, if Relative branch
the branch is taken (slide address calculation
2.54).
PC+1
Computer Architecture
The target address ($108 + $1C = $124) has The target address is
Example: been calculated. sent to the IF stage.
The branch decision has been made (In EX).
Instructions
SUB R1, R2, R1 IF DR EX ME WB
BGT $1C IF DR EX ME WB
These instructions ADD R1, R1, R2 IF DR EX ME WB
should be skipped. ADD R3, R4, R2 IF DR EX ME WB
Target: STL $00(R6), R2 IF DR EX ME WB
In the case of a stall, the branch penalty is 2 cycles for this exemplary pipeline.
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.54
http:// www.buzluca.info
27
Computer Architecture License: https://creativecommons.org/licenses/by-nc-nd/4.0/
A
unconditional branch File
instruction BRU is 1 cycle.
B
Instruction
Control Logic
Decoding
off/imm
Offset/imm.
Branch Target
+
Address
PC+1
PC+1
To Stage 1 (IF)
Computer Architecture
Example:
The target address ($108 + $1C = $124) has The target address is
been calculated. sent to the IF stage.
Instructions
SUB R1, R2, R1 IF DR EX ME WB
BRU $1C IF DR EX ME WB
Should be skipped. ADD R1, R1, R2 IF DR EX ME WB
Target: STL $00(R6), R2 IF DR EX ME WB
For the unconditional branch instruction, the branch penalty is 1 cycle after
moving the address calculation operation to the DR stage.
28
Computer Architecture License: https://creativecommons.org/licenses/by-nc-nd/4.0/
A
unconditional branch File
instruction BRU is 1 cycle.
B
Instruction
Control Logic
Decoding
off/imm
Offset/imm.
Branch Target
+
Address
PC+1
PC+1
To Stage 1 (IF)
Computer Architecture
Example:
The target address ($108 + $1C = $124) has The target address is
been calculated. sent to the IF stage.
Instructions
SUB R1, R2, R1 IF DR EX ME WB
BRU $1C IF DR EX ME WB
Should be skipped. ADD R1, R1, R2 IF DR EX ME WB
Target: STL $00(R6), R2 IF DR EX ME WB
For the unconditional branch instruction, the branch penalty is 1 cycle after
moving the address calculation operation to the DR stage.
28
Computer Architecture
Computer Architecture
29
Computer Architecture
Computer Architecture
29
Computer Architecture
Instructions
SUB R1, R2, R1 IF DR EX ME WB
BGT $1C IF DR EX ME WB
Inserted by the NOOP IF DR EX ME WB
compiler. NOOP IF DR EX ME WB
Target: STL $00(R6), R2 IF DR EX ME WB
Computer Architecture
30
Computer Architecture
Instructions
SUB R1, R2, R1 IF DR EX ME WB
BGT $1C IF DR EX ME WB
Inserted by the NOOP IF DR EX ME WB
compiler. NOOP IF DR EX ME WB
Target: STL $00(R6), R2 IF DR EX ME WB
Computer Architecture
30
Computer Architecture License: https://creativecommons.org/licenses/by-nc-nd/4.0/
Computer Architecture
31
Computer Architecture License: https://creativecommons.org/licenses/by-nc-nd/4.0/
Computer Architecture
31
Computer Architecture
Computer Architecture
32
Computer Architecture
Computer Architecture
32
Computer Architecture
There are two types of branch prediction mechanisms: static and dynamic.
Static branch prediction strategies:
a) Always predict not taken: Always assumes that the branch will not be taken
and fetches the next instruction in sequence.
b) Always predict taken: Always predicts that the branch will be taken and
fetches the target instruction of the branch.
To determine the target of the branch in advance (without calculation), the
branch target table is used (slide 2.66).
Studies analyzing program behavior have shown that conditional branches are
taken more than 50% of the time.
Therefore, always prefetching from the branch target address should give better
performance than always prefetching from the sequential path.
Computer Architecture
33
Computer Architecture
There are two types of branch prediction mechanisms: static and dynamic.
Static branch prediction strategies:
a) Always predict not taken: Always assumes that the branch will not be taken
and fetches the next instruction in sequence.
b) Always predict taken: Always predicts that the branch will be taken and
fetches the target instruction of the branch.
To determine the target of the branch in advance (without calculation), the
branch target table is used (slide 2.66).
Studies analyzing program behavior have shown that conditional branches are
taken more than 50% of the time.
Therefore, always prefetching from the branch target address should give better
performance than always prefetching from the sequential path.
Computer Architecture
33
Computer Architecture License: https://creativecommons.org/licenses/by-nc-nd/4.0/
Computer Architecture
34
Computer Architecture License: https://creativecommons.org/licenses/by-nc-nd/4.0/
Computer Architecture
34
Computer Architecture
Recent
conditional BHT:
branch Branch history
instructions in table
the current
program
Computer Architecture
A) We assume that in the beginning of the given piece of code, the BNZ instruction
is in the BHT and the value of its p bit is 1 (predict to take the branch).
In the first iteration (step) of the loop, the prediction at BNZ will be correct and
the pipeline will prefetch the correct instruction (beginning of the loop).
The p bit (p=1) is not changed until the last iteration of the loop.
In the last iteration of the loop, the p bit is still 1, and the prediction is to take the
branch; however, as the counter is zero, the program will not jump, and it will
instead continue with the next instruction following the branch (misprediction).
The p bit of BNZ is cleared (p ← 0) because the branch is not taken in the last step.
As a result, in a loop with 100 iterations, there are 99 correct predictions and only
one incorrect prediction.
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.70
http:// www.buzluca.info
35
Computer Architecture
Recent
conditional BHT:
branch Branch history
instructions in table
the current
program
Computer Architecture
A) We assume that in the beginning of the given piece of code, the BNZ instruction
is in the BHT and the value of its p bit is 1 (predict to take the branch).
In the first iteration (step) of the loop, the prediction at BNZ will be correct and
the pipeline will prefetch the correct instruction (beginning of the loop).
The p bit (p=1) is not changed until the last iteration of the loop.
In the last iteration of the loop, the p bit is still 1, and the prediction is to take the
branch; however, as the counter is zero, the program will not jump, and it will
instead continue with the next instruction following the branch (misprediction).
The p bit of BNZ is cleared (p ← 0) because the branch is not taken in the last step.
As a result, in a loop with 100 iterations, there are 99 correct predictions and only
one incorrect prediction.
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.70
http:// www.buzluca.info
35
Computer Architecture
Computer Architecture
Remember: in the previous example, after exiting the loop, the p bit of the inner
BNZ LOOP was 0 ("don't take the branch") (p=0) .
Now, if the same loop runs again (2nd run), in the first iteration (step), the
prediction about the BNZ will be "not to take the branch" (p=0).
However, the program will jump to the beginning of the loop (first misprediction).
Now, the p bit will be 1 because branch is taken (p ← 1).
Until the last iteration of the loop, predictions will be correct.
In the last iteration of the loop, there will be a misprediction as in the previous
example (second misprediction).
Hence, misprediction will occur twice for each full iteration of the inner loop.
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.72
http:// www.buzluca.info
36
Computer Architecture
Computer Architecture
Remember: in the previous example, after exiting the loop, the p bit of the inner
BNZ LOOP was 0 ("don't take the branch") (p=0) .
Now, if the same loop runs again (2nd run), in the first iteration (step), the
prediction about the BNZ will be "not to take the branch" (p=0).
However, the program will jump to the beginning of the loop (first misprediction).
Now, the p bit will be 1 because branch is taken (p ← 1).
Until the last iteration of the loop, predictions will be correct.
In the last iteration of the loop, there will be a misprediction as in the previous
example (second misprediction).
Hence, misprediction will occur twice for each full iteration of the inner loop.
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.72
http:// www.buzluca.info
36
Computer Architecture License: https://creativecommons.org/licenses/by-nc-nd/4.0/
Computer Architecture
T: Branch is Taken
From "Take" From "Not take"
N: Branch is Not taken
to "Not take" to "Take"
State: 11 11 10 11 10 00 00 01 00 01 11
Prediction: T T T T T N N N N N T
Actual: T√ N∅ T√ N∅ N∅ N√ T∅ N√ T∅ T∅ T√
2 mispredictions 2 mispredictions
The branch
The branch is State changes State changes
is actually
actually taken
not taken
Prediction was
Prediction was not correct
correct √ Misprediction: ∅
37
Computer Architecture License: https://creativecommons.org/licenses/by-nc-nd/4.0/
Computer Architecture
T: Branch is Taken
From "Take" From "Not take"
N: Branch is Not taken
to "Not take" to "Take"
State: 11 11 10 11 10 00 00 01 00 01 11
Prediction: T T T T T N N N N N T
Actual: T√ N∅ T√ N∅ N∅ N√ T∅ N√ T∅ T∅ T√
2 mispredictions 2 mispredictions
The branch
The branch is State changes State changes
is actually
actually taken
not taken
Prediction was
Prediction was not correct
correct √ Misprediction: ∅
37
Computer Architecture
Predict Predict
Predict Predict
not not
taken taken
taken taken
11 10
01 00
Taken
Computer Architecture
Example:
Problem:
A CPU has an instruction pipeline, where hardware-based mechanisms are used
to solve branch hazards.
This CPU runs the given piece of code below, which includes two nested loops.
Counter1 ← 10
LOOP1 ------ ; Any instruction
Counter2 ← 10
LOOP2 ------ ; Any instruction
------ ; Any instruction
Counter2 ← Counter2 - 1
BNZ LOOP2 ; Branch if not zero
------ ; Instruction after loop2
Counter1 ← Counter1 - 1
BNZ LOOP1 ; Branch if not zero
------ ; Instruction after loop1
For each branch prediction mechanism, give the number of correct predictions
and mispredictions for the two branch instructions (BNZ) in the given piece of
code.
Briefly explain your results.
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.76
http:// www.buzluca.info
38
Computer Architecture
Predict Predict
Predict Predict
not not
taken taken
taken taken
11 10
01 00
Taken
Computer Architecture
Example:
Problem:
A CPU has an instruction pipeline, where hardware-based mechanisms are used
to solve branch hazards.
This CPU runs the given piece of code below, which includes two nested loops.
Counter1 ← 10
LOOP1 ------ ; Any instruction
Counter2 ← 10
LOOP2 ------ ; Any instruction
------ ; Any instruction
Counter2 ← Counter2 - 1
BNZ LOOP2 ; Branch if not zero
------ ; Instruction after loop2
Counter1 ← Counter1 - 1
BNZ LOOP1 ; Branch if not zero
------ ; Instruction after loop1
For each branch prediction mechanism, give the number of correct predictions
and mispredictions for the two branch instructions (BNZ) in the given piece of
code.
Briefly explain your results.
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.76
http:// www.buzluca.info
38
Computer Architecture
Solution:
a. Static prediction
i) Always predict not taken (For this method, a BTT (branch target table) is
not necessary)
BNZ LOOP1: There is a correct prediction only in the last iteration (exit).
Other predictions are incorrect.
Correct : 1 Incorrect : 9
BNZ LOOP2: There is a correct prediction only in the last iteration (exit).
Other predictions are incorrect.
Correct : 10x1 = 10 Incorrect : 10x9 = 90
Total: Correct : 11 Incorrect : 99
Computer Architecture
ii-2) Always predict taken under the assumption that instr. are NOT in the BTT
BNZ LOOP1: There are mispredictions only in the first and last iterations.
Other predictions are correct.
Correct: 8 Incorrect: 2
BNZ LOOP2: In the first run of the loop, there are mispredictions only in the
first and last iterations; other predictions are correct.
In the 2nd -10th runs, there is a misprediction only in the last
iteration (exit).
Correct : 8+9x9 = 89 Incorrect : 2+9x1 = 11
Total: Correct: 97 Incorrect: 13
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.78
http:// www.buzluca.info
39
Computer Architecture
Solution:
a. Static prediction
i) Always predict not taken (For this method, a BTT (branch target table) is
not necessary)
BNZ LOOP1: There is a correct prediction only in the last iteration (exit).
Other predictions are incorrect.
Correct : 1 Incorrect : 9
BNZ LOOP2: There is a correct prediction only in the last iteration (exit).
Other predictions are incorrect.
Correct : 10x1 = 10 Incorrect : 10x9 = 90
Total: Correct : 11 Incorrect : 99
Computer Architecture
ii-2) Always predict taken under the assumption that instr. are NOT in the BTT
BNZ LOOP1: There are mispredictions only in the first and last iterations.
Other predictions are correct.
Correct: 8 Incorrect: 2
BNZ LOOP2: In the first run of the loop, there are mispredictions only in the
first and last iterations; other predictions are correct.
In the 2nd -10th runs, there is a misprediction only in the last
iteration (exit).
Correct : 8+9x9 = 89 Incorrect : 2+9x1 = 11
Total: Correct: 97 Incorrect: 13
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.78
http:// www.buzluca.info
39
Computer Architecture License: https://creativecommons.org/licenses/by-nc-nd/4.0/
Solution (cont’d):
b. Dynamic prediction with one bit
Note: Different prediction bits are used for each branch instruction (Slides 2.68,
2.69).
i) Assumption: In the beginning, instructions are in the BHT, and initial decision
is to take the branch
BNZ LOOP1: There is a misprediction only in the last iteration (exit). Other
predictions are correct.
Correct: 9 Incorrect: 1
BNZ LOOP2: In the first run of the loop, there is a misprediction only in the
last iteration (exit).
Other predictions are correct.
After the first run, the prediction bit "p" changes to “branch
will not be taken”.
Therefore, in the 2nd-10th runs, there are mispredictions in both
the first and last iterations (Slide 2.71).
Correct: 9 + 9x8 = 81 Incorrect: 1+ 9x2 =19
Total: Correct: 90 Incorrect: 20
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.79
http:// www.buzluca.info
Computer Architecture
ii) In the beginning instructions are NOT in the BHT, or the initial decision is NOT
to take the branch
BNZ LOOP1: There are mispredictions in the first and last iterations.
Other predictions are correct.
Correct: 8 Incorrect: 2
BNZ LOOP2: There are mispredictions in the first and last iterations.
Other predictions are correct.
Correct: 10x8 = 80 Incorrect: 10x2 =20
Total: Correct: 88 Incorrect: 22
40
Computer Architecture License: https://creativecommons.org/licenses/by-nc-nd/4.0/
Solution (cont’d):
b. Dynamic prediction with one bit
Note: Different prediction bits are used for each branch instruction (Slides 2.68,
2.69).
i) Assumption: In the beginning, instructions are in the BHT, and initial decision
is to take the branch
BNZ LOOP1: There is a misprediction only in the last iteration (exit). Other
predictions are correct.
Correct: 9 Incorrect: 1
BNZ LOOP2: In the first run of the loop, there is a misprediction only in the
last iteration (exit).
Other predictions are correct.
After the first run, the prediction bit "p" changes to “branch
will not be taken”.
Therefore, in the 2nd-10th runs, there are mispredictions in both
the first and last iterations (Slide 2.71).
Correct: 9 + 9x8 = 81 Incorrect: 1+ 9x2 =19
Total: Correct: 90 Incorrect: 20
http://akademi.itu.edu.tr/en/buzluca/ 2013 - 2021 Feza BUZLUCA 2.79
http:// www.buzluca.info
Computer Architecture
ii) In the beginning instructions are NOT in the BHT, or the initial decision is NOT
to take the branch
BNZ LOOP1: There are mispredictions in the first and last iterations.
Other predictions are correct.
Correct: 8 Incorrect: 2
BNZ LOOP2: There are mispredictions in the first and last iterations.
Other predictions are correct.
Correct: 10x8 = 80 Incorrect: 10x2 =20
Total: Correct: 88 Incorrect: 22
40
Computer Architecture
i) Assumption: In the beginning, instructions are in the BHT, and the initial
decision is to take the branch, prediction bits are 11.
BNZ LOOP1: There is a misprediction only in the last iteration (exit).
Other predictions are correct.
Correct: 9 Incorrect: 1
BNZ LOOP2: There is a misprediction only in the last iteration (exit).
Other predictions are correct.
Correct: 10x9 = 90 Incorrect: 10x1 = 10
Total: Correct: 99 Incorrect: 11
Computer Architecture
41
Computer Architecture
i) Assumption: In the beginning, instructions are in the BHT, and the initial
decision is to take the branch, prediction bits are 11.
BNZ LOOP1: There is a misprediction only in the last iteration (exit).
Other predictions are correct.
Correct: 9 Incorrect: 1
BNZ LOOP2: There is a misprediction only in the last iteration (exit).
Other predictions are correct.
Correct: 10x9 = 90 Incorrect: 10x1 = 10
Total: Correct: 99 Incorrect: 11
Computer Architecture
41