Nvdla Emc2019 Slides
Nvdla Emc2019 Slides
2
SiFive Internship
3
Motivation
• Limitations
• No L2
4
FireSim
• Fast, cycle-exact full system
simulator, runs on FPGA in the cloud
• Simulated design is derived from
Rocket Chip RTL
• Decouples target from FPGA DRAM
• Adds its own DRAM and LLC model
• Easy-to-use. Very good
documentation.
5
How FireSim Works?
• Transforms RTL to target model
• Inserts queues at I/O ports of
target
• Creates a token-based simulator
• In each cycle a token is
consumed by model
• What if token queue is empty?
• The model has to wait
Figure credit: Donggyu Kim et al. “Strober: Fast and Accurate
Stall the target pipeline Sample-Based Energy Simulation for Arbitrary RTL”
6
How to Stall The Target Pipeline?
• For Chisel code:
• Rocket Chip is written
in Chisel
7
Overall System Architecture
• NVDLA is integrated in target
• LLC + Memory Model: Not part of
the target. Added by FireSim.
• Supports multiple models e.g. DDR3,
constant latency
• Runtime configurable LLC: different set,
way, block sizes. No need to rebuild
FPGA image
8
Integrate Your Own Accelerator
• Any accelerator can be integrated
(if it fits inside FPGA)
• Develop and test software for your
accelerator in Linux environment
before having the chip in hand
• Get fast and accurate performance
results
9
NVDLA
• Scalable: nv_small, nv_medium,
nv_large
• Post-processing: activation Adopted from “The Nvidia Deep Learning Accelerator”, https://goo.gl/Znyba5
• Baseline config:
• Quad-core Rocket Core, 3.2 GHz
• NVDLA: 2048 INT8 MACs, 512 KiB conv. buffer, 3.2 GHz
11
Performance Analysis (II)
• Frame process time: 133 ms (7.5 fps)
• 67 ms on NVDLA
• 66 ms on processor, multithreaded with OpenMP
12
Performance Comparison
• Rocket: baseline config, no NVDLA 5.5x
• Xeon: E5-2658 v3
• Titan Xp: Pascal arch, 3840 CUDA cores
• Titan cosumes more power
• Titan Xp: board TDP 250 W, 471 mm² in 16nm
• NVDLA IP: 766 mW peak, 3.3 mm² in 16nm
13
Sharing LLC with Accelerator
• Sharing the LLC can be a good 1.6x
alternative to scratchpad
• Consumes less chip area
• Less programming effort
• Performance does not vary by
changing the LLC size
• But varies by changing the block size
• Streaming access pattern. Not much
data reuse left
* Speedup is measured w.r.t design with no LLC
• NVDLA minimum burst length: 32B
• Hardware prefetcher should help
14
Contention In Memory System
• We care about worst-case 2.5x
execution time in real-time
systems
• Synthetic benchmark is running
on the CPU stressing the
memory system
• NVDLA execution time is
measured * Normalized to solo execution time i.e. running
in isolation
15
Conclusion
• We integrated NVDLA with a RISC-V SoC on FireSim
• Fast, easy-to-use
• No FPGA board needed: runs on the Amazon could
• Can be used for architectural/system research
• We will be using it for research in real-time embedded systems
• Open-sourced and publicly available at:
https://github.com/CSL-KU/firesim-nvdla/
Google “firesim nvdla”
16
Demo
17
• Questions?
18