Lab3 Branch Prediction Hardware
Lab3 Branch Prediction Hardware
Jaewoong Sim
Electrical and Computer Engineering
Seoul National University
Lab Overview
• Goal: Run real RV32I programs on your optimized CPU design
• Implement U-type instructions (lui and auipc)
clk rstn
Simple CPU
IF clk ID clk EX clk MEM clk WB
Hazard Adder
Adder in_a
in_a result
result in_b
in_b
mem_read
mem_to_reg ALU
clk opcode alu_op in_a
mem_write check
in_b
alu_src result
alu_func Data Memory
Branch Predictor
P reg_write mem_write
C
mem_read
maskmode
ALU Control Unit read_data
sext
Register File alu_op
funct7 alu_func address
rs1
readdata1 write_data
funct3 clk
rs2
writereg
Instruction Memory
writedata readdata2
address instruction
wen
clk Forwarding
Immediate Generator
instruction sextimm
You should add ports, wires, and MUXs to complete the diagram
2
Lab Overview
• Goal: Run real RV32I programs on your optimized CPU design
• Implement Branch Hardware (branch predictor and branch target buffer)
clk rstn
Simple CPU
IF clk ID clk EX clk MEM clk WB
Hazard Adder
Adder in_a
in_a result
result in_b
in_b
mem_read
mem_to_reg ALU
clk opcode alu_op in_a
mem_write check
in_b
alu_src result
alu_func Data Memory
Branch Predictor
P reg_write mem_write
C
mem_read
maskmode
ALU Control Unit read_data
sext
Register File alu_op
funct7 alu_func address
rs1
readdata1 write_data
funct3 clk
rs2
writereg
Instruction Memory
writedata readdata2
address instruction
wen
clk Forwarding
Immediate Generator
instruction sextimm
You should add ports, wires, and MUXs to complete the diagram
3
Lab Overview
• Workloads
• Unit Tests: Synthetic instructions to test the CPU design
Task 1: Arithmetic/Logical Operations
Task 2: Arithmetic/Logical Operations with Immediate
Task 3: Load/Store Operations
Task 4: Branch Instructions
Task 5: Jump Instructions
Task 6: U-type Instructions
4
Lab Overview
• Follow the instructions and improve your CPU design step by step
• Part 0: Lab 3 Set Up
• Part 1: Enable Full RV32I Support
• Part 2: Measuring Baseline CPU Performance
• Part 3: Add Branch Hardware to CPU
• Part 4: Implement a Modern Branch Predictor
5
Part 3: Add Branch Hardware to CPU
• Branch Hardware consists of …
• Branch Predictor: Predict the direction of conditional branches
• Branch Target Buffer: Predict the target address of taken branches
6
Part 3: Add Branch Hardware to CPU
Branch Target Buffer (BTB)
• Configurations
• Direct-mapped Cache
• Consists of 256 entries
• Each entry consists of a valid bit, tag bits, and a 32-bit branch target address
• Initialization
• For an active low reset, all the entries in the BTB must be invalid
• Accessing BTB
• Index BTB using the lower bits of the PC (excluding PC[1:0])
• If a BTB miss happens, use PC + 4 as the target address
• Updating BTB
• Update BTB with the actual branch target address
for all types of taken branches (taken conditional, jumps, etc)
(i.e., Do not update BTB if the conditional branch is actually not taken)
7
Part 3: Add Branch Hardware to CPU
Gshare Branch Predictor
• Configurations
• Branch History Register (BHR) + Pattern History Table (PHT)
• BHR
8-bit register that stores actual branch outcomes
The right-most bit indicates the youngest branch outcome
PHT
1: taken, 0: not taken
• PHT
Consists of 256 entries
Each entry consists of 256 Entries
BHR
a 2-bit saturating counter
8-bit
8-bit 8-bit
8
Part 3: Add Branch Hardware to CPU
Gshare Branch Predictor
• Initialization
• For an active low reset,
BHR: 0
PHT: 01 (weakly NT)
9
Part 3: Add Branch Hardware to CPU
Next PC Selection Logic
• With branch hardware,
now there will be four possible next PC values
• PC
• PC + 4
• Predicted Taken PC (predicted target address from BTB)
• Misprediction recovery PC (actual branch target address)
10
Part 4: Implement a Modern Branch Predictor
• In Part 4, you will implement
one of the state-of-the-art branch predictors
• Implementation of BTB and Next PC Selection Logic
can be reused
• All you need to do is to replace
the gshare predictor with the perceptron predictor
• Every branch predictors are functionally the same;
they implement different policies to improve the prediction accuracy
11
Part 4: Implement a Modern Branch Predictor
Perceptron Branch Predictor
• Configuration
• Branch History Register (BHR) + Perceptron Table
• BHR
25-bit register that stores actual branch outcomes
The right-most bit indicates the youngest branch outcome
1: taken, 0: not taken 25-bit (History Length)
• Perceptron Table
Consists of 32 entries ≥
Each entry consists of
25 perceptron weights + 1 bias
Weight: 7-bit 2’s complement
Output: 12-bit 2’s complement
Training Threshold: 62
32 Entries
PC[*:1]
12
Part 4: Implement a Modern Branch Predictor
Perceptron Branch Predictor
• Initialization
• For an active low reset
BHR: 0
Perceptron Table: 0
• Accessing Perceptron Branch Predictor
• Index Perceptron Table using the lower bits of the PC (excluding PC[1:0])
• Make a branch prediction by
performing the dot product of the weights and the inputs
Inputs are the same as the BHR, except that…
The 0 in BHR is considered -1 in the inputs
Input to the bias is always set to 1
BHR 0 1 0 1 Input for bias
Output >= 0: Taken
Perceptron Table
Input -1 1 -1 1 1
-2 3 -1 0 2
* Toy Example
History Length: 4 Output 8 Taken
Perceptron Table Entries: 4
Weight -2 3 -1 0 2 13
Part 4: Implement a Modern Branch Predictor
Perceptron Branch Predictor
• Updating Perceptron Branch Predictor
• The branch predictor is updated only for the conditional branch instructions
• It is updated in the MEM stage
• Updating Algorithm
Θ: Training Threshold
t: Actual branch direction
x: Perceptron inputs
w: Perceptron weights
14
Tips
• Before you dive into the codes, complete the diagram
• You should add ports, wires, and MUXs
15
Tips
• How to debug your CPU Design
• Take advantage of GTKWave
It is a very powerful debugging tool for Verilog codes
Linux> ./simple_cpu
sim.vcd (value change dump file) will be generated
Linux> gtkwave sim.vcd
gtkwave will be launched, loading the sim.vcd file
16