Micro2022 Xiangshan Slides
Micro2022 Xiangshan Slides
Yinan Xu∗†, Zihao Yu∗, Dan Tang∗‡, Guokai Chen∗†, Lu Chen∗†, Lingrui Gou∗†, Yue Jin∗†, Qianruo Li∗†, Xin Li∗†, Zuojun Li∗†,
Jiawei Lin∗†, Tong Liu∗, Zhigang Liu∗, Jiazhan Tan∗, Huaqiang Wang∗†, Huizhe Wang∗†, Kaifan Wang∗†, Chuanqi Zhang∗†, Fawang Zhang∥,
Linjuan Zhang∗†, Zifei Zhang∗†, Yangyang Zhao∗, Yaoyang Zhou∗†, Yike Zhou∗, Jiangrui Zou∥, Ye Cai∥, Dandan Huan¶, Zusong Li¶, Jiye Zhao¶,
Zihao Chen§, Wei He§, Qiyuan Quan§, Xingwu Liu∗∗, Sa Wang∗†, Kan Shi∗, Ninghui Sun∗† and Yungang Bao∗†
∗State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences
†University of Chinese Academy of Sciences
‡Beijing Institute of Open Source Chip
§Peng Cheng Laboratory
¶Beijing VCore Technology
∥Shenzhen University
∗∗Dalian University of Technology
The Era of Agile and Open-Source Hardware
Customized code
< 10% LOC
Open-Source
Chip Ecosystem
Platform
Big Ones:
4
Institute of Computing Technology, Chinese Academy of Sciences (ICT, CAS)
Major Concerns Regarding the Agile Methodology
Complexity
Concern #2: how do you verify the processors?
Verification Complexity
Verification Gap
Design Complexity
Time
Source: Serge Leef, Reimagining
5
Digital Simulation, DAC 2021. Institute of Computing Technology, Chinese Academy of Sciences (ICT, CAS)
This Work: Let’s Do It and See What’s Happening
6
Institute of Computing Technology, Chinese Academy of Sciences (ICT, CAS)
XiangShan: Open-Source High Performance Processors
• 1st generation: YQH
• RV64GC, single-core, superscalar OoO
• 28nm tape-out, 1.3GHz, July 2021
• SPEC CPU2006 7.01@1GHz, DDR4-1600
21.69
15.00 15.73
• 2nd generation: NH 7.16
7.03
8.65 9.00
9.24 9.90 11.00
6.11
• RV64GCBK, dual-core, superscalar OoO
• Scheduled 2GHz@14nm tape-out, Q4 2022
• Estimated** SPEC CPU2006 19.45@2GHz
SPECint 2006/GHz* (Proportional to IPC)
• 3rd generation: KMH
• RV64GCBKHV, quad-core, superscalar OoO
• Close collaboration with industrial partners
8
Institute of Computing Technology, Chinese Academy of Sciences (ICT, CAS)
MinJie: Platform with Agile Development Flows and Tools
Feature Request
Debugging
Verification
Agile HDL Verilog Loop
Silicon-Proven IP
Simulation
9
Institute of Computing Technology, Chinese Academy of Sciences (ICT, CAS)
MinJie: Platform with Agile Development Flows and Tools
Feature Request
Debugging
Borken Workflow
Verification
Agile HDL Verilog Loop
Silicon-Proven IP
Simulation
10
Institute of Computing Technology, Chinese Academy of Sciences (ICT, CAS)
MinJie: Platform with Agile Development Flows and Tools
Verification
Agile HDL Verilog Loop
Silicon-Proven IP
Simulation
11
Institute of Computing Technology, Chinese Academy of Sciences (ICT, CAS)
This Work
• XiangShan: High Performance RISC-V Processor
Debugging
Verification
Agile HDL Verilog Loop
Simulation
12
Institute of Computing Technology, Chinese Academy of Sciences (ICT, CAS)
Processor Functional Verification
RISC-V RISC-V
Processor Specifications
13
Institute of Computing Technology, Chinese Academy of Sciences (ICT, CAS)
Processor Functional Verification: Reality
NEMU
Dromajo
Non-Executable
RISC-V
Processor =? = RISC-V
Specifications
SOTA: Let’s co-simulate them and compare the results!
14
Institute of Computing Technology, Chinese Academy of Sciences (ICT, CAS)
Processor Co-simulation to Find Bugs
RISC-V Model Processor Under Test
Equal
Initial State Initial State
Instruction Instruction
Executed Executed
Compare
Next State Next State
15
Institute of Computing Technology, Chinese Academy of Sciences (ICT, CAS)
Challenge: False Positives
• Example: Linux allocates valid PTEs and lazily executes memory-barrier instructions
• To avoid frequent TLB flushes for better performance
Most PTE Allocation PTE
PTE Allocation
Allocated PTE Allocated CPU
RISC-V by OS (SD) by OS (SD) by OS (SD) Under Test
Models
LD Instruction LD Instruction Store Queue
Page Access Page Access
→
𝒇
18
Institute of Computing Technology, Chinese Academy of Sciences (ICT, CAS)
Towards 1-to-N Correspondence Between REF and Designs
19
Institute of Computing Technology, Chinese Academy of Sciences (ICT, CAS)
This Work: DiffTest for RISC-V Processors
• Idea: acknowledge the non-deterministic nature of ISA – the ultimate golden model
• DiffTest: the state-of-the-art co-simulation framework for RISC-V processors
• This Work: Identify and Specify Sources of Behavioral Non-Determinism in ISA using Diff-Rules
• Dromajo[1]: to avoid non-deterministic sources such as the Debug Transport DTM (MMIO)
• Imperas[2]: to extract asynchronous information from micro-architecture RTL pipeline
Categories Sub-Categories Examples Addressed Before
Static Impl. Dependent Registers CSR, MMIO Devices [1][2]
Impl. Dependent Registers Timer, Counters [2]
Asynchronous Events External/Timer Interrupts [1][2]
Dynamic Speculative Execution Instr./Load/Store Page Fault This Work Only
Memory Model Memory Accesses in Multi-core This Work Only
Hardware Timing LR/SC, Instruction Fusion, Caches This Work Only
[1] Nursultan et al., Effective Processor Verification with Logic Fuzzer Enhanced Co-simulation. MICRO-54, 2021.
[2] Kevin McDermott. Brief Introduction to the 5 Levels of RISC-V Processor Verification. RISC-V Summit, 2021.
20
Institute of Computing Technology, Chinese Academy of Sciences (ICT, CAS)
Accelerating Simulation Debugging with Snapshots
• Idea: to re-construct the last cycles of simulation after abort
Time++ Time++
Snapshot Taken
Wakeup
Time++
Waveform
High Simulation Speed Enabled
• Key Insight: using fork from Linux to take snapshots of the simulation process
• Differential snapshot: Copy On Write mechanism provided by the OS
• Portability: transparent to the circuits, RTL simulators, external C/C++ models
• Efficiency: in-memory snapshots without disk I/O
Single-Core CoreMark Dual-Core Linux Boot
Simulation Time / s
2000
1500
Minor performance overhead:
1000
As snapshot interval size increases,
500 simulation speed remains stable
0
Snapshot Interval / s
22
Institute of Computing Technology, Chinese Academy of Sciences (ICT, CAS)
Practice of Agile Development on XiangShan
① Design Optimization ② Paper Reproduction
• 3 graduate students
• 11 days for a functionally correct prototype
• 37 bugs in 5 days
• 38 days to boot Linux with BPU
• 51 days for the overall frontend architecture
24
Institute of Computing Technology, Chinese Academy of Sciences (ICT, CAS)
Thanks!
Open-Source Chip Working Group
XiangShan
Team
Open-Source Collaborators