0% found this document useful (0 votes)

15 views18 pages

Nvdla Emc2019 Slides

arhitectura acceleratoarelor grafice

Uploaded by

Iulian Mocanu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views18 pages

Nvdla Emc2019 Slides

arhitectura acceleratoarelor grafice

Uploaded by

Iulian Mocanu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Integrating NVIDIA Deep

Learning Accelerator (NVDLA)

with RISC-V SoC on FireSim
Farzad Farshchi§, Qijing Huang¶, Heechul Yun§
§University of Kansas, ¶University of California, Berkeley
SiFive Internship

Rocket Chip SoC NVDLA

+
• Rocket Chip: open-source RISC-V SoC
• NVDLA: open-source DNN inference engine
• Demoed the integration at Hot Chips’18

2
SiFive Internship

3
Motivation

• Useful platform for research

• Limitations
• No L2

• Fast DRAM, slow SoC

• Expensive: $7k FPGA board

• Let’s integrate NVDLA into FireSim

4
FireSim
• Fast, cycle-exact full system
simulator, runs on FPGA in the cloud
• Simulated design is derived from
Rocket Chip RTL
• Decouples target from FPGA DRAM
• Adds its own DRAM and LLC model
• Easy-to-use. Very good
documentation.

5
How FireSim Works?
• Transforms RTL to target model
• Inserts queues at I/O ports of
target
• Creates a token-based simulator
• In each cycle a token is
consumed by model
• What if token queue is empty?
• The model has to wait
Figure credit: Donggyu Kim et al. “Strober: Fast and Accurate
Stall the target pipeline Sample-Based Energy Simulation for Arbitrary RTL”

6
How to Stall The Target Pipeline?
• For Chisel code:
• Rocket Chip is written
in Chisel

• For Verilog (we added):

7
Overall System Architecture
• NVDLA is integrated in target
• LLC + Memory Model: Not part of
the target. Added by FireSim.
• Supports multiple models e.g. DDR3,
constant latency
• Runtime configurable LLC: different set,
way, block sizes. No need to rebuild
FPGA image

8
Integrate Your Own Accelerator
• Any accelerator can be integrated
(if it fits inside FPGA)
• Develop and test software for your
accelerator in Linux environment
before having the chip in hand
• Get fast and accurate performance
results

9
NVDLA
• Scalable: nv_small, nv_medium,
nv_large

• We used nv_large: 2048 MACs

• Convolutional core: matrix-

matrix multiplication

• Post-processing: activation Adopted from “The Nvidia Deep Learning Accelerator”, https://goo.gl/Znyba5

function, pooling, etc.

10
Performance Analysis (I)

• Baseline config:
• Quad-core Rocket Core, 3.2 GHz

• NVDLA: 2048 INT8 MACs, 512 KiB conv. buffer, 3.2 GHz

• LLC: Shared 2 MiB, 8-way, 64 B block

• DRAM: 4 ranks, 8 banks, FR-FCFS

• YOLOv3: 416 x 416 frame, 66 billion operations

11
Performance Analysis (II)
• Frame process time: 133 ms (7.5 fps)
• 67 ms on NVDLA
• 66 ms on processor, multithreaded with OpenMP

• Layers not supported by NVDLA are running on processor

• Custom YOLO, upsampling, FP ⇔ INT8

• Make common DNN algorithm run very fast ✔

• Computations not supported by the accelerator can make you slow ✖

12
Performance Comparison
• Rocket: baseline config, no NVDLA 5.5x

• NVDLA+Rocket: baseline config 407x

• Xeon: E5-2658 v3
• Titan Xp: Pascal arch, 3840 CUDA cores
• Titan cosumes more power
• Titan Xp: board TDP 250 W, 471 mm² in 16nm
• NVDLA IP: 766 mW peak, 3.3 mm² in 16nm

13
Sharing LLC with Accelerator
• Sharing the LLC can be a good 1.6x
alternative to scratchpad
• Consumes less chip area
• Less programming effort
• Performance does not vary by
changing the LLC size
• But varies by changing the block size
• Streaming access pattern. Not much
data reuse left
* Speedup is measured w.r.t design with no LLC
• NVDLA minimum burst length: 32B
• Hardware prefetcher should help
14
Contention In Memory System
• We care about worst-case 2.5x
execution time in real-time
systems
• Synthetic benchmark is running
on the CPU stressing the
memory system
• NVDLA execution time is
measured * Normalized to solo execution time i.e. running
in isolation

15
Conclusion
• We integrated NVDLA with a RISC-V SoC on FireSim
• Fast, easy-to-use
• No FPGA board needed: runs on the Amazon could
• Can be used for architectural/system research
• We will be using it for research in real-time embedded systems
• Open-sourced and publicly available at:
https://github.com/CSL-KU/firesim-nvdla/
Google “firesim nvdla”

16
Demo

17
• Questions?

04 AMD Edge AI TechDay - Singapore - 2024 - FrankWang
No ratings yet
04 AMD Edge AI TechDay - Singapore - 2024 - FrankWang
29 pages
CCSP Exam Cram DOMAIN 2 Handout
No ratings yet
CCSP Exam Cram DOMAIN 2 Handout
135 pages
MERN Stack Interview Questions (2024)
100% (1)
MERN Stack Interview Questions (2024)
24 pages
Analytics 2022 12 19 020028
No ratings yet
Analytics 2022 12 19 020028
58 pages
Gym Management System Project Report
0% (1)
Gym Management System Project Report
129 pages
Template - Handover Documentation
80% (10)
Template - Handover Documentation
4 pages
Study Guide 1
No ratings yet
Study Guide 1
112 pages
Certified Information Systems Auditor (CISA) 2019 Information System Auditing Transcript
No ratings yet
Certified Information Systems Auditor (CISA) 2019 Information System Auditing Transcript
15 pages
Linux Commands
No ratings yet
Linux Commands
42 pages
Tuvsud Iso 26262 Compliance
No ratings yet
Tuvsud Iso 26262 Compliance
12 pages
Cloud AWS - Introduction To AWS
No ratings yet
Cloud AWS - Introduction To AWS
55 pages
Kumar 2023 Eng. Res. Express 5 035057
No ratings yet
Kumar 2023 Eng. Res. Express 5 035057
10 pages
Deep Learning (CNN) On Fpga
No ratings yet
Deep Learning (CNN) On Fpga
18 pages
SiFive - RISCV 101
No ratings yet
SiFive - RISCV 101
42 pages
Car Simulator User Manual
No ratings yet
Car Simulator User Manual
10 pages
Telit NE310L2 Firmware Download Procedure Application Note r0
No ratings yet
Telit NE310L2 Firmware Download Procedure Application Note r0
27 pages
Patt Patel CH 16
No ratings yet
Patt Patel CH 16
28 pages
The RISC-V Instruction Set Manual
No ratings yet
The RISC-V Instruction Set Manual
109 pages
Module Weeks 4 5
No ratings yet
Module Weeks 4 5
8 pages
Electronics 10 01514
No ratings yet
Electronics 10 01514
19 pages
Evaluation of Embedded Systems For Automotive Image Processing
No ratings yet
Evaluation of Embedded Systems For Automotive Image Processing
6 pages
Full Stack Development
No ratings yet
Full Stack Development
1 page
Research On Opencl Optimization For Fpga Deep Learning Application
No ratings yet
Research On Opencl Optimization For Fpga Deep Learning Application
19 pages
k400 Quick Start Guide
No ratings yet
k400 Quick Start Guide
2 pages
GPU Bootcamp Samhar
100% (1)
GPU Bootcamp Samhar
96 pages
Optimizing FPGA-based Accelerator Design For Deep Convolutional Neural Networks
No ratings yet
Optimizing FPGA-based Accelerator Design For Deep Convolutional Neural Networks
10 pages
Question Paper of UNIT-III
No ratings yet
Question Paper of UNIT-III
5 pages
Applsci 12 10771 v2
No ratings yet
Applsci 12 10771 v2
44 pages
Activity 1.1
No ratings yet
Activity 1.1
2 pages
Cne Practical Answers
No ratings yet
Cne Practical Answers
17 pages
VXP P2 SMR
No ratings yet
VXP P2 SMR
2 pages
001 Rats
No ratings yet
001 Rats
4 pages
AMD Embedded Roadmap and Vertical Strategy - For Customers
No ratings yet
AMD Embedded Roadmap and Vertical Strategy - For Customers
23 pages
07 Firesim Intro
No ratings yet
07 Firesim Intro
42 pages
Integrating Nvidia Deep Learning Accelerator (Nvdla) With Risc-V Soc On Firesim
No ratings yet
Integrating Nvidia Deep Learning Accelerator (Nvdla) With Risc-V Soc On Firesim
5 pages
An Implementation of Convolutional Neural Networks
No ratings yet
An Implementation of Convolutional Neural Networks
23 pages
Lez.b-06 - nVIDIA GPU and Servers
No ratings yet
Lez.b-06 - nVIDIA GPU and Servers
18 pages
Marcos Paul: Contact
No ratings yet
Marcos Paul: Contact
2 pages
Riscv Platform Spec
No ratings yet
Riscv Platform Spec
28 pages
An FPGA-Based Reconfigurable CNN Accelerator For YOLO
No ratings yet
An FPGA-Based Reconfigurable CNN Accelerator For YOLO
5 pages
Python Programming Lab Manual
No ratings yet
Python Programming Lab Manual
13 pages
Techlog 2018 - 2 Installation Licensing Guide
No ratings yet
Techlog 2018 - 2 Installation Licensing Guide
21 pages
A High-Throughput and Power-Efficient FPGA Implementation of YOLO CNN For Object Detection
No ratings yet
A High-Throughput and Power-Efficient FPGA Implementation of YOLO CNN For Object Detection
13 pages
Bringing Deep Learning To Embedded Systems: Mark Nadeski
No ratings yet
Bringing Deep Learning To Embedded Systems: Mark Nadeski
7 pages
Desktop Checking Assessment Form V2
No ratings yet
Desktop Checking Assessment Form V2
10 pages
MSMSMS
No ratings yet
MSMSMS
21 pages
How To Verify The Required NetBackup Daemons
No ratings yet
How To Verify The Required NetBackup Daemons
3 pages
Implementation of CNN On Zynq Based FPGA For Real-Time Object Detection
No ratings yet
Implementation of CNN On Zynq Based FPGA For Real-Time Object Detection
7 pages
Fortidlp
No ratings yet
Fortidlp
11 pages
Quiz-1 Syllabus of Embedded Systems Design
No ratings yet
Quiz-1 Syllabus of Embedded Systems Design
20 pages
An Efficient CNN Accelerator Using Inter-Frame Data Reuse of Videos On FPGAs
No ratings yet
An Efficient CNN Accelerator Using Inter-Frame Data Reuse of Videos On FPGAs
14 pages
CPUs GPUs Accelerators
No ratings yet
CPUs GPUs Accelerators
22 pages
01 Tutorial Intro Share
No ratings yet
01 Tutorial Intro Share
21 pages
A Survey of FPGA Based Accelerators For
No ratings yet
A Survey of FPGA Based Accelerators For
32 pages
Implementation of FPGA-based Accelerator For CNN
No ratings yet
Implementation of FPGA-based Accelerator For CNN
7 pages
Karandikar 2018 FireSim
No ratings yet
Karandikar 2018 FireSim
2 pages
Robotics Webinar Series Session 3 Slides
No ratings yet
Robotics Webinar Series Session 3 Slides
46 pages
A Reconfigurable CNN-based Accelerator Design For
No ratings yet
A Reconfigurable CNN-based Accelerator Design For
9 pages
A Reconfigurable CNN-Based Accelerator Design For Fast and Energy-Efficient Object Detection System On Mobile FPGA
No ratings yet
A Reconfigurable CNN-Based Accelerator Design For Fast and Energy-Efficient Object Detection System On Mobile FPGA
8 pages
Comparison of Processing Performance and Architectural Efficiency Metrics For Fpgas and Gpus in 3D Ultrasound Computer Tomography
No ratings yet
Comparison of Processing Performance and Architectural Efficiency Metrics For Fpgas and Gpus in 3D Ultrasound Computer Tomography
7 pages
Getting Started With The AMD Robotics Hardware Portfolio - Final v2
No ratings yet
Getting Started With The AMD Robotics Hardware Portfolio - Final v2
38 pages
Direct Mapping Problems
No ratings yet
Direct Mapping Problems
12 pages
Chipyard Tutorial - Intro by - Sagar Karandikar
No ratings yet
Chipyard Tutorial - Intro by - Sagar Karandikar
12 pages
06 From AMD Zynq US+ MPSoC - To - RFSoC - v02
No ratings yet
06 From AMD Zynq US+ MPSoC - To - RFSoC - v02
36 pages
02 AMD Tech Day AECG Portfolio Overview
No ratings yet
02 AMD Tech Day AECG Portfolio Overview
34 pages
A Scalable and Efficient Convolutional Neural Network Accelerator Using HLS For A System-On-Chip Design
No ratings yet
A Scalable and Efficient Convolutional Neural Network Accelerator Using HLS For A System-On-Chip Design
18 pages
A Deep Learning Prediction Process Accelerator Based FPGA PDF
No ratings yet
A Deep Learning Prediction Process Accelerator Based FPGA PDF
4 pages
Understanding The Potential of FPGA-Based Spatial Acceleration For Large Language Model Inference
No ratings yet
Understanding The Potential of FPGA-Based Spatial Acceleration For Large Language Model Inference
28 pages
Hc2024 Amd Vpeng
No ratings yet
Hc2024 Amd Vpeng
36 pages
Modeling Deep Learning Accelerator Enabled Gpus
No ratings yet
Modeling Deep Learning Accelerator Enabled Gpus
14 pages
CPUs GPUs Accelerators and Memory v1.0
No ratings yet
CPUs GPUs Accelerators and Memory v1.0
44 pages
Warboy Brochure
No ratings yet
Warboy Brochure
10 pages
MSC 2019 Adelaide Saeed Alqahtani Autonomous Racing Car Model
No ratings yet
MSC 2019 Adelaide Saeed Alqahtani Autonomous Racing Car Model
87 pages
Special Issue On Contemporary Industry Products 2024
No ratings yet
Special Issue On Contemporary Industry Products 2024
2 pages
FPGA Design For Object Detection
No ratings yet
FPGA Design For Object Detection
12 pages
A Mixed-Pruning Based Framework For Embedded Convolutional Neural Network Acceleration
No ratings yet
A Mixed-Pruning Based Framework For Embedded Convolutional Neural Network Acceleration
10 pages
Assignment 7
No ratings yet
Assignment 7
5 pages
Holy Mary Institute of Technology & Science: Answer All Questions and Each Question Carries: Questions
No ratings yet
Holy Mary Institute of Technology & Science: Answer All Questions and Each Question Carries: Questions
2 pages
HLS-Based Acceleration Framework For Deep Convolutional Neural Networks
No ratings yet
HLS-Based Acceleration Framework For Deep Convolutional Neural Networks
11 pages
Final Poster
No ratings yet
Final Poster
1 page
FPGA-Based Vit Inference Accelerator Optimization
No ratings yet
FPGA-Based Vit Inference Accelerator Optimization
6 pages
Tesi
No ratings yet
Tesi
73 pages
A Scalable FPGA Based Accelerator For Tiny-YOLO-V2
No ratings yet
A Scalable FPGA Based Accelerator For Tiny-YOLO-V2
9 pages
AMD XDNA NPU in Ryzen AI Processors
No ratings yet
AMD XDNA NPU in Ryzen AI Processors
10 pages
CUDA Programming with Python: From Basics to Expert Proficiency
From Everand
CUDA Programming with Python: From Basics to Expert Proficiency
William Smith
1/5 (1)
CUDA Programming in C: From Basics to Expert Proficiency
From Everand
CUDA Programming in C: From Basics to Expert Proficiency
William Smith
No ratings yet
Practical GPU Programming: High-performance computing with CUDA, CuPy, and Python on modern GPUs
From Everand
Practical GPU Programming: High-performance computing with CUDA, CuPy, and Python on modern GPUs
Maris Fenlor
No ratings yet
Practical GPU Programming
From Everand
Practical GPU Programming
Maris Fenlor
No ratings yet
Zig Programming: From Zero to Systems Master
From Everand
Zig Programming: From Zero to Systems Master
Niklas Hoffmann
No ratings yet
Mission Impossible - Build your own CPU from Scratch: Learn How to Design, Construct and Program a 16-bit Computer CPU
From Everand
Mission Impossible - Build your own CPU from Scratch: Learn How to Design, Construct and Program a 16-bit Computer CPU
Doug Domke
No ratings yet
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
Node.js, JavaScript, API: Interview Questions and Answers
From Everand
Node.js, JavaScript, API: Interview Questions and Answers
John Edward Cooper Berg
5/5 (1)
Node.js 63 Interview Questions and Answers
From Everand
Node.js 63 Interview Questions and Answers
John Edward Cooper Berg
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet

Nvdla Emc2019 Slides

Uploaded by

Nvdla Emc2019 Slides

Uploaded by

Integrating NVIDIA Deep

Learning Accelerator (NVDLA)

Rocket Chip SoC NVDLA

• Useful platform for research

• Fast DRAM, slow SoC

• Expensive: $7k FPGA board

• Let’s integrate NVDLA into FireSim

• For Verilog (we added):

• We used nv_large: 2048 MACs

• Convolutional core: matrix-

function, pooling, etc.

• LLC: Shared 2 MiB, 8-way, 64 B block

• DRAM: 4 ranks, 8 banks, FR-FCFS

• YOLOv3: 416 x 416 frame, 66 billion operations

• Layers not supported by NVDLA are running on processor

• Make common DNN algorithm run very fast ✔

• NVDLA+Rocket: baseline config 407x

You might also like