0% found this document useful (0 votes)
28 views16 pages

Lab3 Branch Prediction Hardware

Uploaded by

cw031001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views16 pages

Lab3 Branch Prediction Hardware

Uploaded by

cw031001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Computer Architecture

Lab 3: Branch Prediction Hardware Implementation

Jaewoong Sim
Electrical and Computer Engineering
Seoul National University
Lab Overview
• Goal: Run real RV32I programs on your optimized CPU design
• Implement U-type instructions (lui and auipc)
clk rstn
Simple CPU
IF clk ID clk EX clk MEM clk WB
Hazard Adder
Adder in_a
in_a result
result in_b
in_b

Branch Control Unit


Branch Target Buffer branch
taken
Control Unit check
jump
branch

mem_read
mem_to_reg ALU
clk opcode alu_op in_a
mem_write check
in_b
alu_src result
alu_func Data Memory
Branch Predictor
P reg_write mem_write
C
mem_read
maskmode
ALU Control Unit read_data
sext
Register File alu_op
funct7 alu_func address
rs1
readdata1 write_data
funct3 clk
rs2

writereg
Instruction Memory
writedata readdata2
address instruction
wen
clk Forwarding

Immediate Generator
instruction sextimm

IF/ID ID/EX EX/MEM MEM/WB

You should add ports, wires, and MUXs to complete the diagram
2
Lab Overview
• Goal: Run real RV32I programs on your optimized CPU design
• Implement Branch Hardware (branch predictor and branch target buffer)
clk rstn
Simple CPU
IF clk ID clk EX clk MEM clk WB
Hazard Adder
Adder in_a
in_a result
result in_b
in_b

Branch Control Unit


Branch Target Buffer branch
taken
Control Unit check
jump
branch

mem_read
mem_to_reg ALU
clk opcode alu_op in_a
mem_write check
in_b
alu_src result
alu_func Data Memory
Branch Predictor
P reg_write mem_write
C
mem_read
maskmode
ALU Control Unit read_data
sext
Register File alu_op
funct7 alu_func address
rs1
readdata1 write_data
funct3 clk
rs2

writereg
Instruction Memory
writedata readdata2
address instruction
wen
clk Forwarding

Immediate Generator
instruction sextimm

IF/ID ID/EX EX/MEM MEM/WB

You should add ports, wires, and MUXs to complete the diagram
3
Lab Overview
• Workloads
• Unit Tests: Synthetic instructions to test the CPU design
 Task 1: Arithmetic/Logical Operations
 Task 2: Arithmetic/Logical Operations with Immediate
 Task 3: Load/Store Operations
 Task 4: Branch Instructions
 Task 5: Jump Instructions
 Task 6: U-type Instructions

• Benchmarks: Real RV32I programs to test the branch hardware performance


 bst_array, fibo, matmul, quicksort, spmv, spconv

• Do not modify inst.mem


• If you want to, keep in mind that…
 Each line of inst.mem consists of 32-bit instruction + newline character (33 characters)
 inst.mem begins with NOP
 inst.mem ends with five NOPs & Jump

4
Lab Overview
• Follow the instructions and improve your CPU design step by step
• Part 0: Lab 3 Set Up
• Part 1: Enable Full RV32I Support
• Part 2: Measuring Baseline CPU Performance
• Part 3: Add Branch Hardware to CPU
• Part 4: Implement a Modern Branch Predictor

• Refer to README.md for the details


• Today, we will mainly discuss Part 3 & Part 4

5
Part 3: Add Branch Hardware to CPU
• Branch Hardware consists of …
• Branch Predictor: Predict the direction of conditional branches
• Branch Target Buffer: Predict the target address of taken branches

• Accessing Branch Hardware


• The branch hardware is accessed in the instruction fetch (IF) stage,
if the instruction is a conditional branch or a (direct/indirect) jump
 Peek the instruction opcode in the IF stage (sort of pre-decoding)

• Updating Branch Hardware


• The branch hardware is updated with actual direction and target address
• It is updated in the memory access (MEM) stage
(i.e., The branch is resolved in the MEM stage)
 The branch target address is computed in the EX stage,
but is latched to the EX/MEM register to shorten the critical path

6
Part 3: Add Branch Hardware to CPU
Branch Target Buffer (BTB)
• Configurations
• Direct-mapped Cache
• Consists of 256 entries
• Each entry consists of a valid bit, tag bits, and a 32-bit branch target address

• Initialization
• For an active low reset, all the entries in the BTB must be invalid

• Accessing BTB
• Index BTB using the lower bits of the PC (excluding PC[1:0])
• If a BTB miss happens, use PC + 4 as the target address

• Updating BTB
• Update BTB with the actual branch target address
for all types of taken branches (taken conditional, jumps, etc)
(i.e., Do not update BTB if the conditional branch is actually not taken)

7
Part 3: Add Branch Hardware to CPU
Gshare Branch Predictor
• Configurations
• Branch History Register (BHR) + Pattern History Table (PHT)

• BHR
 8-bit register that stores actual branch outcomes
 The right-most bit indicates the youngest branch outcome
PHT
 1: taken, 0: not taken

• PHT
 Consists of 256 entries
 Each entry consists of 256 Entries
BHR
a 2-bit saturating counter

8-bit

8-bit 8-bit

8
Part 3: Add Branch Hardware to CPU
Gshare Branch Predictor
• Initialization
• For an active low reset,
 BHR: 0
 PHT: 01 (weakly NT)

• Accessing Gshare Branch Predictor


• Index Perceptron Table using the PC XOR BHR (PC[1:0] is ignored)

• Updating Gshare Branch Predictor


• The branch predictor is updated only for the conditional branch instructions
• It is updated in the MEM stage

9
Part 3: Add Branch Hardware to CPU
Next PC Selection Logic
• With branch hardware,
now there will be four possible next PC values
• PC
• PC + 4
• Predicted Taken PC (predicted target address from BTB)
• Misprediction recovery PC (actual branch target address)

• Your next PC selection logic should be revised accordingly

10
Part 4: Implement a Modern Branch Predictor
• In Part 4, you will implement
one of the state-of-the-art branch predictors
• Implementation of BTB and Next PC Selection Logic
can be reused
• All you need to do is to replace
the gshare predictor with the perceptron predictor
• Every branch predictors are functionally the same;
they implement different policies to improve the prediction accuracy

11
Part 4: Implement a Modern Branch Predictor
Perceptron Branch Predictor
• Configuration
• Branch History Register (BHR) + Perceptron Table

• BHR
 25-bit register that stores actual branch outcomes
 The right-most bit indicates the youngest branch outcome
 1: taken, 0: not taken 25-bit (History Length)

• Perceptron Table
 Consists of 32 entries ≥
 Each entry consists of
25 perceptron weights + 1 bias
 Weight: 7-bit 2’s complement
 Output: 12-bit 2’s complement
 Training Threshold: 62

32 Entries
PC[*:1]

12
Part 4: Implement a Modern Branch Predictor
Perceptron Branch Predictor
• Initialization
• For an active low reset
 BHR: 0
 Perceptron Table: 0
• Accessing Perceptron Branch Predictor
• Index Perceptron Table using the lower bits of the PC (excluding PC[1:0])
• Make a branch prediction by
performing the dot product of the weights and the inputs
 Inputs are the same as the BHR, except that…
 The 0 in BHR is considered -1 in the inputs
 Input to the bias is always set to 1
BHR 0 1 0 1 Input for bias
 Output >= 0: Taken
Perceptron Table

Input -1 1 -1 1 1
-2 3 -1 0 2
* Toy Example
History Length: 4 Output 8 Taken
Perceptron Table Entries: 4

Weight -2 3 -1 0 2 13
Part 4: Implement a Modern Branch Predictor
Perceptron Branch Predictor
• Updating Perceptron Branch Predictor
• The branch predictor is updated only for the conditional branch instructions
• It is updated in the MEM stage
• Updating Algorithm

Θ: Training Threshold
t: Actual branch direction
x: Perceptron inputs
w: Perceptron weights

 Perceptron weights are saturated at MIN/MAX value

14
Tips
• Before you dive into the codes, complete the diagram
• You should add ports, wires, and MUXs

• This lab requires a lot of debugging


• Even if you pass all the unit tests, your CPU design may have bugs
• Do not hesitate to go over the assembly files
 We provide assembly files for each workload
 Unit Tests: inst.txt
 Benchmarks: *.riscv.dump

 The assembly looks complex at first glance,


but if you know where to focus, you can easily understand it
 Instructions are in the .text section
 <_start>: 1. Initialize the registers
2. Jump to the <main>
3. Once the main function returns, jump to the <end>
 <main>: Several important functions are called
 <end>: Execute NOPs to wait until all the instructions retire from the five-stage pipelined CPU

15
Tips
• How to debug your CPU Design
• Take advantage of GTKWave
 It is a very powerful debugging tool for Verilog codes
 Linux> ./simple_cpu
 sim.vcd (value change dump file) will be generated
 Linux> gtkwave sim.vcd
 gtkwave will be launched, loading the sim.vcd file

• Debugging from the command line


 Take advantage of Built-in Verilog System Tasks
 Use the “$monitor” or “$display” task to inspect the variables you want

16

You might also like