Design and Implementation of UVM-based Verification Framework
Design and Implementation of UVM-based Verification Framework
Winter 1-31-2025
Part of the Computer and Systems Architecture Commons, Digital Circuits Commons, Electrical and
Electronics Commons, and the Hardware Systems Commons
Recommended Citation
APA Citation
Aboudeif, R. (2025).Design and Implementation of UVM-based Verification Framework for Deep Learning
Accelerators [Master's Thesis, the American University in Cairo]. AUC Knowledge Fountain.
https://fount.aucegypt.edu/etds/2383
MLA Citation
Aboudeif, Randa Ahmed Hussein. Design and Implementation of UVM-based Verification Framework for
Deep Learning Accelerators. 2025. American University in Cairo, Master's Thesis. AUC Knowledge
Fountain.
https://fount.aucegypt.edu/etds/2383
This Master's Thesis is brought to you for free and open access by the Student Research at AUC Knowledge
Fountain. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of AUC
Knowledge Fountain. For more information, please contact [email protected].
Gradate Studies
A THESIS SUBMITTED BY
TO THE
SUPERVISED BY
August, 2024
In our thesis, a novel, scalable, reusable, and efficient verification framework for
deep learning hardware accelerators using the Universal Verification Methodology
(UVM) is introduced. The proposed framework is to create a scalable and reusable
UVM verification testbench for testing deep learning accelerators with simulation,
emulation, and Field Programmable Gate Array (FPGA) prototyping by running
different testing scenarios for convolutional neural networks (CNNs) with multiple
configurations. Each test scenario configures the DLA to run a specific CNN and
drives the input data, pretrained weight, and bias required for each layer. The
proposed framework is applicable to different DNN architectures, e.g., ALexNet,
GoogLeNet, etc. Therefore, the proposed framework makes the DLA verification
process faster, easier, and more reliable. It provides large coverage, can hit corner
cases, and has complete access to changing the DLA configurations through a generic
verification environment with components that can be easily reused to test different
DLA designs.
Moreover, the proposed framework has a scalable error injection methodology for
testing the trustworthiness of deep learning accelerators with simulation, emulation,
and FPGA prototyping. The proposed methodology applies error injection using
three mechanisms. Firstly, error injection is applied to the DNN data path by
corrupting each DNN layer’s feature map, weight, and bias. Secondly, adversarial
attacks on the input image are applied to mitigate the perturbation in input received
from cameras and other sensors in the physical world. Thirdly, the proposed
iv
methodology applies error injection to the DLA hardware configurations to detect
faulty hardware configurations for a DNN that may disrupt the inference process.
Moreover, the proposed error injection methodology is reliable and has complete
access to the DNN data path between layers and the DLA configurations. In addition
to that, it is applicable to different DNN architectures.
Regarding error injection, the cross-layer error injection in the LeNet-5 CNN
indicates that the LeNet-5 CNN is more sensitive to the internal layers’ multiple
values of input data, weight, and bias corruption and less sensitive to single-value
corruption in data, weight, and bias propagation between layers. Furthermore, the
LeNet-5 CNN accuracy rate decreases with high values of perturbation factor using
the fast gradient sign method for input image error injection. Furthermore, compared
to multiple weight errors, the convolution layer is more vulnerable to multiple data
errors than a single error. The existence of the ReLU activation function in the
convolution layer almost masks data and weights single-value errors.
v
Acknowledgements
Above all, I am truly grateful to the Almighty God for His sustenance, grace,
and strength. His kindness has enabled me to succeed in every academic endeavor I
have undertaken. I express my gratitude to him for giving me this chance and giving
me the capacity to move forward with success. I am appreciative of the gifts of
happiness, difficulties, and growth opportunities that I have experienced throughout
my life, not just during this research project.
My deepest gratitude goes out to my supervisors, Prof. Dr. Yehea Ismail and Dr.
Mohamed AbdelSalam, for their support and for giving me the chance to complete
my master’s thesis. I value the time and ideas they have contributed to make my
work interesting and productive. Their insightful advice, insights, and mentorship
inspired me to keep learning more every day. I benefited from their profound insights
at different points during my research. Once again, many thanks to them, since
without them, this work would have never seen the light as it is today.
I would like to thank my colleague, Eng. Tasneem Awaad, for her support and
helpful suggestions on this work. One of the many important contributions to this
research has been your support.
I would like to thank Dr. Ahmed Awaad for his support and encouragement over
the past few months.
vi
Contents
Abstract iv
Acknowledgements vi
Contents vii
List of Figures xi
1 INTRODUCTION 1
1.1.1 GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
vii
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3.3 NVDLA . . . . . . . . . . . . . . . . . . . . . . . . 13
viii
3.2.2 HVL Top . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
ix
6 EXPERIMENTAL RESULTS 40
7.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
References 52
x
List of Figures
4.2 Inupt data and weight pre-processing with input image error injection. 33
xi
5.2 NVDLA ping-pong register file [51]. . . . . . . . . . . . . . . . . . . 37
6.3 Proposed framework LeNet-5 CNN test case code coverage analysis. 44
6.5 Single CNN convolution layer data path error injection testing
scenarios error rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.6 LeNet-5 CNN data path error injection testing scenarios error rate. 47
6.7 The accuracy rate of the LeNet-5 CNN with adversarial images. . . 48
xii
List of Tables
6.2 The single CNN convolution layer simulation and emulation runtime. 41
xiii
List of Abbreviations
xiv
MAC Multiplication and Accumulation
ML Machine Learning
NN Neural Networks
NVDLA Nvidia Deep Learning Accelerator
1D one-dimensional
ONNX Open Neural Network Exchange
OOP Object-Oriented Programming
OVM Open Verification Methodology
RGBA Red Green Blue Alpha
RGB Red Green Blue
RTL Register-Transfer Level
SCE-MI Standard Co-Emulation Modeling
Interface
SDCs Silent Data Corruptions
SoC System on Chip
SRAM Static Random Access Memory
SV SystemVerilog
SWiFI Software-implemented Fault Injection
TACs Testbench Acceleration Compilers
TBX Testbench-Xpress
3D three-dimensional
TLM Transaction Level Model
UMD User Mode Driver
UVM Universal Verification Methodology
VMM Verification Methodology Manual
xv
Chapter 1
Introduction
In this chapter, an overview of the thesis is introduced. We begin with an
introduction to deep learning hardware acceleration. We state the problem statement,
motivations, and objectives of our work. Then we highlight our contributions. Finally,
we conclude this chapter with a general thesis outline.
1.1.1 GPU
The graphics processing unit (GPU) was first developed as a processor to control
the performance of graphics and video on computers. Currently, GPUs have reached
high performance and are used for more complex computational tasks. They are
general-purpose designs with thousands or hundreds of cores that can simultaneously
process thousands or hundreds of threads. Since deep learning algorithms are applied
to large amounts of data, they allow for an effective representation of the input
space. This requires complex computations and high memory requirements, making
it challenging for computing platforms to achieve real-time performance combined
with energy efficiency [2]. Consequently, to achieve high throughput and speed up
DNNs during the training and inference stages, GPUs are employed as hardware
1
accelerators. However, their excessive power consumption is frequently costly for
systems with limited resources.
2
1.2 Problem Statement and Motivation
Any UVM testbench can be divided into three main parts: the test, which runs
test scenarios outlined in the test plan; the sequence, which serves as the test’s
stimulus and the input of the Design Under Test (DUT); and the environment, which
initiates the sequence and monitors the DUT’s response to verify its functionality [8].
Extensive simulation and hardware emulation techniques are required to verify such
sophisticated DLA designs, and these approaches are essential for evaluating the
correctness and high-quality performance of such complex DLA architectures. These
architectural designs are represented in software routine forms in the software-based
simulation. Nevertheless, there are a number of drawbacks to this method, including
its comparatively slow speed as design complexity rises. However, such architectural
designs are expressed in Hardware Description Languages (HDLs) for hardware
emulation, which can help run faster and identify design and architectural issues early.
Co-simulation combined with the transaction-level methodology (co-emulation) [9] is
an extraordinary method of emulation that is exploited by emulation systems such
as Veloce Strato [10]. In this method, the testbench’s transactor interfaces with the
DUT run on the emulator through the testbench test that runs on the host machine.
Using just simulated test scenarios to assess and validate different DLA designs
has gotten difficult. Therefore, the DLA verification requires hardware acceleration.
Emulation speeds up complex test cases so they can run faster and find corner cases.
3
1.3 Research Objective
This research aims to create a scalable and reusable UVM verification framework
for testing the inference functionality of deep learning accelerators with simulation,
emulation, and FPGA prototyping. This framework makes the deep learning
accelerator verification process faster, easier, and more reliable. The proposed
framework in this research has a generic and scalable UVM-based verification
testbench. This testbench has the possibility of running different deep neural
networks with different architectures and inference parameters on deep learning
accelerators to test the inference process with different hardware configurations. It
provides complete access to the DLA configuration space, which helps in hitting
corner cases and speeding up the detection of functional bugs to achieve large coverage
for DLA verification. Moreover, the proposed framework has a scalable error injection
methodology for testing the trustworthiness of various DLA designs. The proposed
error injection methodology applies error injection using three mechanisms: DNN
data path error injection, input image error injection, and hardware configuration
registers error injection. This methodology can be applied to each layer of a
convolutional neural network to evaluate the robustness and ability of each layer
to alleviate the impact of an injected error. The verification environment in this
framework is portable across simulation and hardware-assisted verification (HAV)
platforms for emulation and FPGA prototyping. Simulation helps to detect functional
issues with more visibility in the early phases of the verification process. Furthermore,
the co-emulation accelerates the DLA verification process by speeding up the running
test cases for complex DNNs so they can run faster and hit corner cases. The Nvidia
Deep Learning Accelerator (NVDLA) is used as a case study to prove this objective.
1.4 Contributions
2. Design a novel, generic, scalable, and reusable UVM testbench to test DLA
for simulation and hardware emulation.
3. Develop testing scenarios for single and multiple-layer CNNs for DLA
verification.
(a) Applying error injection in the DNN data path by corrupting each DNN
layer’s feature map, weight, and bias.
4
(b) Applying adversarial attacks on the input image to mitigate the
perturbation in input received from cameras and other sensors in the
physical world.
(c) Applying error injection in the DLA hardware configuration registers to
detect faulty hardware configurations for a DNN that may disrupt the
inference process.
Chapter 3 explains our scalable and reusable proposed framework for DLA
verification.
Finally, the thesis is concluded in Chapter 7, where some future work is suggested
as well.
5
Chapter 2
Background and Related Work
This chapter provides some background about convolutional neural networks,
fault injection for deep neural networks, and deep learning hardware accelerators.
Moreover, it discusses related work for different research efforts focused on DLA
testing and verification.
2.1 Background
6
2.1.1.1 Convolutional Layer
A key component of the CNN architecture is the convolution layer, which conducts
feature extraction. Typically, this involves combining linear and nonlinear operations,
such as the convolution operation and activation function. Convolution is a particular
type of linear operation for feature extraction. A matrix of numbers called a kernel
is applied across the input, which is another matrix of numbers called a tensor. At
each point of the tensor, the element-wise product between each element of the
kernel and the input tensor is calculated and added to get the output value at the
corresponding position in the output tensor, which is referred to as a feature map,
as shown in Figure 2.1 [12]. This process is then repeated using different kernels to
create feature maps that each reflect a different attribute of the input tensors; in this
way, different kernels may be thought of as distinct feature extractors. Let layer l be
(l−1)
a convolutional layer. Then, the input of layer l comprises m1 feature maps from
(l−1) (l−1)
the previous layer, each of size m2 × m3 . In the case where l = 1, the input
is a single image I consisting of one or more channels. This way, a convolutional
neural network directly accepts raw images as input. The output of layer l consists
(l−1) (l−1) (l−1)
of m1 feature maps of size m2 × m3 [13]. The ith feature map in layer l,
(l)
denoted Yi , is computed as follows [13]:
(l−1)
m1
(l) (l) (l) (l−1)
= Bi + (2.1)
X
Yi Ki,j ∗ Yj ,
(j=1)
(l) (l)
where Bi is a bias matrix and Ki,j is the filter of the trainable weights of the
network with size 2hl1 + 1 × 2hl2 + 1 connecting the j th feature map in the layer
(l − 1) with the ith feature map in layer l. To skip a fixed number of pixels, both in
(l) (l)
the horizontal and vertical directions, before applying the filter again, s1 and s2
skipping factors are used [13]. Then, the size of the output feature maps is given by
[13]:
(l−1) (l−1)
(l) m2 − 2hl1 (l) m3 − 2hl2
m2 = (l)
and m3 = (l)
, (2.2)
s1 + 1 s2 + 1
7
2.1.1.2 Pooling Layer
The final convolution or pooling layer’s output feature maps typically get
flattened, or converted into a one-dimensional (1D) array of numbers, and then
connected to one or more fully connected layers, referred to as dense layers, where
each input and each output are connected by a learnable weight. A subset of fully
connected layers maps the features that were retrieved by the convolution layers
and down-sampled by the pooling layers to the final outputs of the network, which
in classification tasks are probabilities for each class. The number of output nodes
in the last fully linked layer is typically equal to the number of classes [12].
8
Figure 2.2: Max pooling [12].
There are some concerns to consider while implementing an effective fault injection
methodology because evaluating the impact of a fault differs from measuring a typical
fault injection. There are three primary fault categories [16]:
• Masked faults: The fault-free and faulty DNN outputs are the same for the
user.
• Benign faults: The user accepts a faulty output even when it differs from a
fault-free output.
• Malignant fault: The user rejects the defective output.
The fault allowance is decided by the DNN, the chosen metric, and the error
threshold. For example, in autonomous driving object detection algorithms, a fault
9
could cause a minor deviation from the desired behavior; in this case, it would be
considered a benign fault if the pedestrians were still identified correctly. Conversely,
if the error significantly alters the behavior of the application, the fault is then
classified as malignant, much like when a major failure occurs and an object cannot
be positively identified. For example, when a car doesn’t stop and collides with a
pedestrian.
Many DNN fault models have been proposed by academics, for example:
10
of object detection networks, (3) resiliency analysis of models intended to
withstand adversarial attacks, (4) training error-resilient models, and (5) an
initial investigation of PyTorchFI’s use for DNN interpretability.
Because there is only one check per layer in PyTorchFI, the implementation
has a very low overhead. There isn’t any overhead if there are no defined
perturbations. Additionally, PytorchFI operates at the native speed of silicon
because error modeling doesn’t require code instrumentation. This makes it
possible to explore the enormous state space more quickly, which is essential for
comprehending parts of faults in safety-critical systems that arise in the real
world. For the purpose of modeling high-level perturbations and comprehending
their impact at the system level, PyTorchFI operates at the application level
of DNNs. Lower-level perturbation models, on the other hand, including
register-level errors, cannot be represented at this level unless they can be
mapped to single- or multiple-bit flips.
In this work, the proposed error injection methodology models both the
application level and the architecture level perpetuation using the three proposed
error injection mechanisms: DNN data path error injection, DNN input image
error injection, and hardware configuration registers error injection for testing the
resilience of DNN running on deep learning accelerators. The proposed methodology
has a cross-layer error injection at different layers in the DNN. Moreover, it did not
require any type of additional processing, which maintains the system’s performance
compared to other fault injection mechanisms. Hence, the proposed framework did
not add any extra time.
11
2.1.3 Deep Learning Hardware Accelerators
A DNN accelerator is designed using one of two properties shared by all DNN
algorithms: (1) Strong temporal and spatial localities in the data within and across
each feature map allow the data to be stored and reused strategically; (2) each
feature map’s MAC operations have highly sparse dependencies that allow them to
be computed in parallel. To benefit from the second property, DNN accelerators
use spatial architectures, which consist of massively parallel processing engines that
each compute MACs. Hence, a DNN accelerator consists of a global buffer along
with an array of processing engines. In addition to a DRAM connection from which
data is transferred [14].
The accelerator’s architecture and memory hierarchy are shown in Figure 2.3.
To improve data movement, temporal reuse of loaded data is facilitated by buffering
the input image data (Img), filter weights (Filt), and partial sums (Psum) in a
shared 108 KB Static Random Access Memory (SRAM) buffer. Memory traffic
and computation can overlap when image data and filter weights are read from
DRAM into the buffer and streamed into the spatial computation array. Even with
the memory link operating at a lower clock frequency than the spatial array, the
system is still able to achieve excellent computational efficiency due to the presence
of streaming and reuse. The partial sums that are produced by the spatial array’s
computation of the inner products between the image and the filter weights are
sent to the buffer, where they may subsequently be compressed and rectified before
being sent to the DRAM. With run-length-based compression, the average image
bandwidth is lowered by 2x. Partial sums are saved in the buffer and then restored
to the spatial array to provide configurable support for image and filter sizes that
do not fit entirely within the spatial array. The number of these "passes" required
to complete the computations for a particular layer depends on the buffer size and
spatial array.
12
Figure 2.3: Eyeriss deep learning accelerator architecture [25].
The Flex Logix InferX X1 [26] accelerator is a system that combines software
and parallel hardware that can be adjusted per the needs of any given algorithm to
fully leverage parallelism. For edge servers and industrial vision systems, the InferX
X1M board is suitable for processing huge models and mega-pixel images at batch
= 1 for object detection and other high-resolution image processing tasks. It can
also support Yolov5. The software tools offered by Flex Logix include a utility to
move trained Open Neural Network Exchange (ONNX) models to the X1M, a basic
runtime framework to facilitate inference processing on Linux and Windows, and a
driver with external APIs to set up and deploy models. Additionally, X1M offers
high computational density and performance, giving customers’ applications greater
flexibility.
2.1.3.3 NVDLA
13
to larger performance-focused IoT devices. Furthermore, NVDLA can be set up as
a small or large system, as shown in Figure 2.4, to satisfy particular power, cost, or
area requirements [28].
If high performance and versatility are the main priorities, the large-NVDLA
system is a better choice. Performance-oriented systems need to retain a high
degree of flexibility because they can execute inference on a wide variety of network
topologies. Additionally, rather than serializing inference operations, these systems
can handle multiple jobs concurrently. Therefore, inference procedures shouldn’t
take up a lot of the host’s processing capacity. A second memory interface for a
specialized high-bandwidth SRAM was incorporated into the NVDLA hardware
shown in Figure 2.4 as an optional feature to meet these demands. The NVDLA
uses this SRAM as a cache and has it connected to a fast-memory bus interface port
[28].
14
Figure 2.4: Small NVDLA versus large NVDLA system [28].
15
can be used to apply Local Response Normalization, a kind of normalization
function.
Additionally, NVDLA has three main interfaces, as shown in Figure 2.6 [30]:
• Configuration Space Bus (CSB) interface: It is a control bus that a CPU uses
to access the NVDLA configuration registers.
16
Figure 2.6: NVDLA interface [30].
• Compilation tools: A parser and compiler are the main components. The
compiler is in charge of generating an order of hardware layers that is optimized
for a particular NVDLA setup to improve performance by lowering model size,
load, and run times. Parsing and compiling are the two fundamental steps in the
compilation process. The parser can read a Caffe model that has already been
trained and produce an "intermediate representation" of a network that can
move on to the next compilation stage, as shown in Figure 2.7. The compiler
creates a network of hardware layers based on the hardware configuration of an
NVDLA implementation and the parsed intermediate representation. These
steps can be performed on the device that has the NVDLA implementation
and could be done offline. It’s crucial to understand the precise hardware setup
of an NVDLA implementation since it helps the compiler produce the right
layers for the features that are offered. Depending on the convolution buffer
size available, this could involve, for instance, dividing convolution operations
into several smaller mini-operations or choosing between various convolution
operation types (such as Winograd convolution or basic convolution). This
stage is also in charge of assigning memory regions for weights and quantizing
models to lower precision, such as 8- or 16-bit integers or 16-bit floating points.
A list of operations for various NVDLA configurations can be produced using
the same compiler tool.
• Runtime environment: This refers to the setup where a model is executed on
NVDLA-compatible hardware. It can be split into two sections:
– User Mode Driver (UMD): This is the primary user-mode software
interface. The neural network is parsed, and then the compiler builds
the network layer by layer into an NVDLA loadable file format, as shown
17
in Figure 2.7. After loading this loadable, the user mode runtime driver
sends the inference job to the kernel mode driver.
– Kernel Mode Driver (KMD): It consists of drivers and firmware where
the NVDLA layer operations are scheduled, and each functional block is
configured by programming the NVDLA registers.
The API for UMD is standard and may be used to process loadable images,
execute inference, and bind input and output tensors to memory locations. An
"NVDLA loadable" image is a type of stored representation of the network that is
used to begin runtime execution. In the loadable, every functional block in the
NVDLA implementation is represented by a "layer" in the software; each layer
contains details about the dependencies of the blocks it uses, the tensors it uses as
inputs and outputs in memory, and the particular configuration of each block for a
particular operation. KMD schedules each action by utilizing a dependency graph
that connects the layers. The format of an NVDLA loadable is standardized for
both UMD and compiler implementations. An inference job is received by KMD’s
main entry point, which then chooses one of the several possible jobs for scheduling
and sends it to the core engine scheduler. The core engine scheduler is in charge
of managing NVDLA interrupts, scheduling layers on a block-by-block basis, and
updating any dependencies associated with that layer in response to a task from
a preceding layer being completed. When a layer is ready to be scheduled, the
scheduler takes information from the dependency graph. This helps the compiler
schedule layers optimally and prevents performance discrepancies caused by different
KMD implementations.
18
Shanggong et al. in [32] run the LeNet-5 CNN with INT8 format on NVDLA,
which is integrated on RISC-V SoC as shown in Figure 2.8. The NVDLA’s CSB
interface is connected through an adapter to the ARM Advanced Peripheral
Bus (APB) that is used in the SoC to connect the NVDLA to the RISC-V
core. Moreover, an AXI4-compliant DBB interface is used for communication
between the NVDLA and an external SoC DRAM. Additionally, the LeNet-5
CNN is implemented in the C language. However, this work is specific to a
single CNN and is done with a low-precision INT8 format that doesn’t provide
enough testing for the NVDLA. Furthermore, the communication time between
the RISC-V core and the NVDLA is much higher than using a native UVM
testbench code, which affects the overall system speed during the verification process.
19
Zynq Ultrascale+ FPGA for executing AlexNet as a benchmark for evaluating the
implemented system.
20
DNN execution is then finished using the faulty outputs from the RTL simulation at
the software level to determine the impact of the injected faults at the application
level. This framework runs faster in terms of fault injection time and has parallel
capabilities.
As depicted, none of these efforts use UVM, which would have provided more
modularity, scalability, and reusability compared to other verification methodologies.
In this work, the proposed framework provides a generic and scalable UVM-based
verification methodology that is much more efficient in terms of running speed
and adaptability for different DLA designs for both simulation and emulation for
various DNN architectures. Moreover, the proposed error injection methodology is
scalable and reliable across different DLA designs and DNN architectures. As the
proposed error injection methodology did not add any extra simulation or emulation
runtime. In addition, it did not add any performance overhead compared to other
fault injection mechanisms. Furthermore, the proposed framework can be portable
across the various HAV platforms to accelerate the verification process.
21
Chapter 3
Proposed UVM framework for
DLA Verification
This chapter presents the proposal of the UVM-based framework to verify the
inference process of the DNNs on DLAs with different designs. The proposed
framework is to create a scalable and reusable UVM verification testbench for testing
deep learning accelerators by running different test scenarios for convolutional neural
networks with multiple configurations. Each test scenario configures the DLA to
run a specific CNN and drives the inputs, pre-trained weights, and bias required for
each layer. The proposed framework makes the verification process faster, easier,
and more reliable. It provides large coverage, can hit corner cases, and offers
complete access to the DLA configurations. The proposed framework is portable
across simulation and HAV platforms for emulation and FPGA prototyping. Those
HAV platforms, including emulation and prototyping platforms, are used in SoC
verification to accelerate the verification testbenches to run faster and hit more
corner cases. Figure 3.1 shows an example of an emulation platform, which is the
Veloce Strato emulator [37]. Moreover, Figure 3.2 shows an example of an FPGA
prototyping platform, which is the Veloce proFPGA platform [38].
22
the design. A design description expressed in HDL is the foundation of the hardware
emulator [10], which is an array of FPGAs. The design is converted into a gate-level
netlist using an RTL compiler/synthesis tool. A crystal chip—which might be
referred to as an FPGA—has the design mapped onto it with different technologies.
An Advanced Verification Board (AVB) would be used to distribute a large design
across numerous multiple-crystal chips.
23
• Every transactor’s untimed part is put inside the Hardware Verification
Language (HVL) space to be a proxy to the corresponding timed parts in the
HDL space that are implemented as a Bus Functional Model (BFM), which acts
as an interface connection. The HDL BFM and the related testbench proxy
must be interfaced at the Transaction Level Model (TLM) to communicate
between the hardware emulator and the software-based simulator.
The proposed framework uses a CNN Caffe model that has the CNN pre-trained
weight and bias and a .prototxt file containing the details of the CNN model
architecture to extract the CNN weight and bias for each layer using Python-based
scripts. After that, the extracted weight, and bias, in addition to the input data,
are then remapped according to the DLA requirements. Then, precision conversion
is done on the remapped data based on the DLA precision configurations. Hence,
the remapped and precision converted data is saved inside an input package in the
CNN testbench to be used by the UVM verification environment during the testing
process, which is then transmitted to the DLA as shown in Figure 3.3.
24
and hit corner-case bugs, while the DLA-DUT netlist is prototyped at the hardware
emulator.
25
3.2.1 HDL Top
The HDL Top has the DUT instance and its connections with the UVM testbench
virtual interfaces as it is timed and synthesizable. Moreover, the HDL Top is the
hardware part running on the emulator. The implemented transactors in Figure 3.5
allow the transaction-level UVM testbench part to access the RTL DUT. Transactors
are split into proxies (as part of the HVL side running on the simulator) and
BFMs, which are AXI-based interfaces (as part of the HDL side running on the
HAV platform). The implemented AXI-based interfaces make it easier to integrate
different DLA RTL designs. Furthermore, the virtual interfaces are registered in the
UVM configuration database so that other testbench components can have access to
them, as they are connected to the DUT as follows:
• AXIMaster IF: A virtual AXI interface responsible for sending the DUT
configurations on the AXI bus.
• INTR IF: It is responsible for detecting interrupts sent by the DLA DUT.
Figure 3.5: UVM testbench timed HDL and untimed HVL partitioning.
HVL Top is responsible for running the UVM test, which is the testbench part
running on the host machine simulator, as it is dynamic, class-based, behavioral,
and untimed.
26
3.2.2.1 CNN Test
The CNN test class is extended from the uvm_test class, which is the top-level
UVM component in the UVM testbench. The CNN test checks and verifies the DUT
functionality by running the testing scenarios created in the verification plan. This
test class includes the CNN environment and the AXI environment. At the test level,
we could pass the environment configuration by creating a handle for the env_config
shown in Figure 3.4 and assigning all needed arguments for the testbench agents
and other configurations. Also, declare all virtual sequences, virtual sequencers,
and other components that are needed for the entire test, as shown in Figure 3.4.
Furthermore, the test runs by calling the task run_test() in the HVL top.
For example, if we want to run a CNN with multiple layers, then a new sequence
will be extended from the cnn_vseq base sequence, and then it can have the retrieved
input data, weight, and bias for each layer, along with the configurations of each
block in the hardware related to each CNN layer. For example, the convolution
pipeline for convolution layers, the planar data engine for the POOL layers, and so
on. Then, this flow could be extended to multiple layers. Therefore, the proposed
approach is scalable to run different convolutional neural networks by adding new
sequences for each network.
27
3.2.2.3 CNN Environment
2. The three UVM agents: the reg_file agent, the mem agent, and the intr agent.
In addition to the Data Scoreboard and the env_config class that has the agents’
configurations as shown in Figure 3.4.
The cnn_layer agent Figure 3.4 is extended from the uvm_agent. The
cnn_layer agent is the highest abstraction level agent that has the CNN layer
driver, which is extended from the uvm_driver. The CNN layer driver receives the
cnn_layer_seq_item transaction and maps the included hardware configurations
inside it to the DLA hardware blocks’ registers specified for each layer as register
read/write operations included in the reg_file_seq_item transaction, which is also a
UVM transaction to be transmitted to the middle level of abstraction reg_file agent.
Moreover, the CNN layer data, weight, and bias received in the cnn_layer_seq_item
transaction are transmitted via the CNN layer driver according to the memory
address configurations included in the mem_seq_item UVM transaction as memory
read/write operations to the middle level of abstraction mem agent. The transaction
flow is summarized as follows:
28
is then received by the mem agent’s driver using the cnnlayer2mem UVM
sequence. This sequence also acts as a linking sequence to send the
mem_seq_item transaction to the mem agent. Similarly, this transaction
is then converted to the axi_transaction (which also has a lower level of
abstraction) to be sent to the DUT through the AXIBusSlave driver.
• Furthermore, the interrupt received from the DLA DUT through the middle
level of abstraction intr agent is handled through the intr_seq_item, and then
forwarded to the cnn_vseq through the CNN layer driver in the CNN layer
agent.
Additionally, the CNN layer driver is connected to the data scoreboard using a UVM
analysis port to send the received CNN layer output to the data scoreboard for
comparison and checking.
The mentioned middle-level agents send the CNN data, weight, and bias with
the DLA hardware registers’ configurations as memory and register read/write
operations to be sent by the AXI environment agent on the DLA DUT AXI interface.
The AXI environment is extended from the UVM environment. It consists of the
axiBus UVM agent, which contains the AXIBusMaster Driver that stimulates the
DUT by driving AXI transactions for DLA register read/write operations, and
the AXIBusSlave Driver that stimulates the DUT by driving AXI transactions for
memory read/write operations, as shown in Figure 3.4.
After inference, the output predictions are transmitted to the Data scoreboard.
The implemented UVM data scoreboard receives each CNN layer output data from
the CNN layer driver. It then compares the output data received from the DUT
for the last CNN layer with the output data received from the reference model and
then checks the resilience of the inference process.
29
Chapter 4
Proposed Error Injection
Methodology for DLA Verification
This chapter presents the proposed error injection methodology, which is part
of our UVM verification framework mentioned in Chapter 3. The proposed error
injection methodology is implemented to test the trustworthiness of deep learning
accelerators. It can detect errors at an early stage during the DLA verification process
with simulation, emulation, and FPGA prototyping. The proposed methodology
applies error injection using three mechanisms, as shown in Figure 4.1: (1) applying
error injection in the DNN data path by corrupting the DNN layers; feature map,
weight, and bias for each DNN layer; (2) applying adversarial attacks on the
input image to mitigate the perturbation in input received from cameras and other
sensors in the physical world; and (3) applying error injection in the DLA hardware
configurations to detect faulty hardware configurations for a DNN that may disrupt
the inference process.
30
Figure 4.1: UVM testbench architecture with the error injection methodology.
Error injection for the DNN data path is done in this framework by extending
a new sequence for error injection from the cnn_vseq base sequence, which is a
UVM virtual base sequence as shown in Figure 4.1. As demonstrated in Chapter
3, this virtual sequence’s responsibility is to run the cnn_layer_seq sequence,
which is used for running a CNN on the DLA. The cnn_layer_seq sequence
constructs the cnn_layer_seq_item transactions that include the DLA hardware
configurations, data, weight, and bias specific for each CNN layer. Then these
transactions are transmitted using the virtual sequencer to the cnn_layer_agent, as
shown in Figure 4.1. After that, the cnn_layer_agent splits down these received
transactions to a lower level of abstraction to be transmitted to the DLA DUT
using the reg_file seq_item transaction through the reg_file agent for the DLA
configuration registers read/write operations that are sent to the DLA DUT AXI
interface from the AXIBusMaster Driver which is part of the AXIBus agent in AXI
environment, and using the mem_seq_item transaction through the mem Agent
for the memory read/write operations that are sent to the DLA DUT DRAM AXI
interface from the AXIBusSlave Driver which is also part of the AXIBus agent in
the AXI environment for CNN data, weight, and bias transmission.
Furthermore, the error injection is done by randomly injecting errors for single
and multiple values in the CNN internal layers’ input data, weight, and bias included
in the cnn_layer_seq_item transactions for each CNN layer. This is done at different
positions in the CNN (at different layers) to study the error propagation in the
CNN and the resilience of each layer. Moreover, to investigate the effect of the error
injection in each aspect of the DLA inference process in terms of the CNN robustness
and the output predictions. Table 4.1 lists the proposed data path error injection
testing scenarios that are applicable for any CNN and scalable for different CNN
architectures. After that, the DLA DUT response is checked in the Data scoreboard
to be compared to the reference model output to study the impact of the injected
errors on the DLA DUT CNN output predictions.
31
Table 4.1: DNN data path error injection testing scenarios.
The error injection is done on the input images in the proposed methodology
by generating adversarial examples from the original input data to test the CNN’s
robustness. An adversarial example is a sample of input data that has been slightly
modified to intentionally make a CNN misclassify it. CNN still fails, even if these
changes are so minor that a human observer does not even notice them. Adversarial
examples cause security concerns because they could be used to attack deep learning
systems, even if the adversary is unable to access the underlying model [48].
The Fast Gradient Sign Method (FGSM) [48] attack is used in the proposed error
injection mechanism to generate the adversarial examples. A Python-based script
is used to implement this method on the input image during data preparation, as
demonstrated in Chapter 3, to generate the adversarial image to be used during the
inference process as shown in Figure 4.2. The neural network’s gradients are used
by the fast gradient sign method to generate an adversarial example. The method
takes an input image and creates a new image that maximizes the loss using the
loss gradients with respect to the input image. We refer to this newly created image
as the adversarial image. The FGSM method can be explained using the following
32
equation [49]:
where:
• ϵ: A perturbation factor.
• θ: Model parameters.
In this method, gradients are calculated with respect to the input image. This is
carried out to create an image that maximizes the loss. This is done by calculating
the contribution of each pixel in the image to the loss value and then adding a
perturbation correspondingly. This method is pretty fast because it is easy to
determine how each input pixel contributes to the loss using the chain rule and
determining the required gradients. Hence, the goal is to deceive an already-trained
model using these adversarial generated images. Therefore, the cnn_vseq base
sequence is extended to implement a new sequence to run the CNN on the DLA
using the generated adversarial images, as shown in Figure 4.1. The DLA DUT
inference output is then verified in the data scoreboard compared to that received
from the reference model. To evaluate the input image perturbation impact on the
CNN output predictions for different perturbation factor values.
Figure 4.2: Inupt data and weight pre-processing with input image error injection.
Error injection for the DLA hardware configurations register space is done in this
framework by extending a new sequence from the cnn_vseq base sequence for DLA
33
configurations error-injecting as shown in Figure 4.1. The hardware configurations
of the DLA for a specific CNN included in the cnn_layer_seq_item transaction are
randomly corrupted such that:
• Changing the feature map parameters for different CNN layers. For example,
changing the input or output data width, height, or number of channels.
34
Chapter 5
NVDLA Case Study
In this chapter, the proposed framework is used to verify the NVDLA as a case
study. The proposed framework tests running a CNN on the NVDLA by applying a
specific testing scenario for a CNN architecture with the required NVDLA hardware
blocks’ configurations for each CNN layer according to the layer’s parameters and
then sending the required input, pre-trained weight, and bias to the NVDLA through
its DRAM interface.
To integrate the NVDLA DUT in the UVM testbench, the APB2CSB bridge
shown in Figure 5.1 is used to make the NVDLA CSB interface that is used to access
the NVDLA configuration registers compatible with the Advanced Peripheral Bus
(APB) interface. Then, the AXI2APB wrapper is used to convert from the APB
interface to the AXI interface and vice versa; the AXI2APB wrapper acts as a bridge
between the two buses to be then connected to the AXIMaster virtual interface.
Furthermore, the NVDLA DBB interface is then connected directly to the AXISlave
virtual interface, while the NVDLA interrupt signal is connected to the testbench
through the INTR virtual interface, as shown in Figure 5.1.
35
5.2 NVDLA Registers Configuration
The NVDLA register configurations are done in this framework in the CNN
layer driver inside the cnn_layer agent according to the hardware configurations for
each CNN layer in the received cnn_layer_seq item transaction as demonstrated
in Chapter 3 and Chapter 4. To hide the reprogramming latency that may occur
when the hardware is idle between two consecutive hardware layers while waiting
for reprogramming after sending a "done" interrupt to the testbench. The NVDLA
relies on the idea of ping-pong register programming for per-hardware-layer registers.
The CNN layer driver configures the NVDLA registers for each hardware block
(subunit) and then sends these configurations as a register read/write operation to
the NVDLA DUT with the help of the reg_file driver and the AXIBusMaster driver,
as mentioned before. Most NVDLA subunits have two sets of registers; the testbench
can program the second group in the background while the subunit operates using
the configuration from the first register set, and then the testbench sets the second
group’s "enable" bit when it is finished. The hardware will move to the second group
if the second group’s "enable" bit has already been set after processing the layer
defined by the first register set and clearing its "enable" bit. Hardware becomes
idle until programming is finished if the "enable" bit for the second group is not yet
set. Next, the procedure is repeated, with the first group now functioning as the
"shadow" group to which the testbench writes in the background and the second
group as the active group. With the help of this method, the hardware may easily
switch between active layers without wasting any cycles on testbench configuration
[51].
The NVDLA core is implemented based on a series of pipeline stages. All or part
of the hardware layers are handled by each pipeline stage. These are the pipeline’s
stages [51]:
The convolution core pipeline consists of the first five pipeline stages mentioned
above; all of these pipeline stages, except the CBUF and the CMAC, create hardware
layer layers through linked ping-pong buffers [51].
36
As shown in Figure 5.2, the ping-pong technique is considered in each pipeline
stage’s register file. Three register groups form each register file implementation:
a separate non-shadowed group called a "single register group" makes up the third
register group, while the two ping-pong groups (duplicated register groups 0 and 1)
share the same addresses. The CONSUMER register value shows which register the
data path is sourcing from, whereas the PRODUCER register field in the POINTER
register is used to specify which of the ping-pong groups is to be accessible from
the CSB interface. Both pointers choose group 0 by default. The names of registers
indicate which register set they are part of; if a register’s name begins with D_,
it is part of a duplicated register group; if not, it is part of a single register group.
The ping-pong groups’ registers serve as hardware layer configuration parameters
that are configured by the testbench based on the CNN layers’ parameters. An
enable register field exists in each group, and it is set by the testbench and cleared
by hardware. Before setting the enable bit, the testbench sets all other fields in the
group; this indicates that the hardware layer is prepared for execution. Until the
hardware layer finishes executing, all writes to register groups that have the enable
bit set will be silently dropped. At that point, the hardware will clear the enable bit
[51].
To run a CNN on NVDLA, the image data, weight, and bias need to be mapped
into the memory’s address space. The data is then loaded into the CBUF by the
CDMA for computation. For this reason, the previously mentioned process of
37
extracting the bias and weight in Chapter 3 and Chapter 4 is followed according to
the NVDLA configurations. The data of each layer in the CNN is mapped, and then
precision conversion is done, and the resulting mapped and precision converted data
is written in an input package that the testbench can read, as shown in Figure 5.1.
The feature data format that NVDLA supports with Direct Convolution (DC)
is the one utilized for the other internal CNN layers. As shown in Figure 5.3, the
feature data elements of one layer are grouped into a three-dimensional data cube
with width (W), height (H), and channel (C) size. The data cube is divided into
1x1x32 byte tiny atom cubes in order to do the memory mapping. Such that, data
may be added to the channel end if the original data is not 32byte aligned in the
C direction. Moreover, the atom cube mapping at surface or line boundaries may
occur in-compactly or adjacently. However, their alignment is always 32-byte. Next,
a compact mapping of every small cube in the feature data cube is performed [52].
38
5.3.3 Weight Mapping
The weights in this framework are not compacted. The width, height, channel,
and number of kernels are the four dimensions of the original weight data. A group of
three-dimensional (3D) data cubes can be formed using these dimensions. One data
cube is referred to as a kernel. Grouping the kernels is necessary before mapping.
Four types of weight data are supported by the NVDLA. These are the weights
for image input, weights for deconvolution, weights for Winograd convolution, and
weights for DC. Moreover, sparse compression and channel post-extension are the
two weight options that can be used to enhance DLA performance. In DC mode, a
single weight group has 16 kernels if the weights are in the FP16 or INT16 format.
Each kernel should be divided into 1x1x64-element small cubes. Each small cube is
128 bytes for FP16 or INT16 format. Furthermore, the last group may have fewer
kernels if the total number of kernels is not a multiple of 16. Figure 5.4 shows the
weight mapping for DC mode within a single group. Once the first weight group
has been mapped, the second weight group is also compactly mapped. Then zeros
must be added to each mapped weight for 128-byte alignment; otherwise, any zeros
should not be added in between the group boundaries. Similar to weight for DC,
weight mapping is used for input image convolution as well. The key difference
is that input image weight mapping requires a prior extra channel extension step,
which is mandatory as the first step for the input image weight mapping. Its main
concept is based on converting all the weights that are in the same line to a single
channel, then applying the same mapping steps of weight for DC [52].
39
Chapter 6
Experimental Results
This chapter illustrates the experimental results of using the proposed framework
to verify the NVDLA inference function by running different CNN testing scenarios
with and without the error injection methodology for simulation and emulation.
The proposed framework is applicable to sophisticated custom and standard CNN
architectures as per the CNN sequence for programming the NVDLA core. The
NVDLA DUT is integrated with the UVM testbench through AXI-based interfaces,
as demonstrated in Chapter 5. Each running testing scenario configures the NVDLA
hardware blocks required for each CNN layer according to the layer’s parameters
and then sends the required input data, pre-trained weight, and bias to the NVDLA
through its DRAM interface. The tool used for simulation is the QuestaSim simulator
[53]. Moreover, the platform used for emulation is Veloce Strato [10].
The first testing scenario runs a single CNN convolution layer on the NVDLA
with the inference parameters shown in Table 6.1 for simulation and emulation. The
NVDLA supports the following three data types: FP16, INT16, and INT8. To
improve calculative performance, INT16 precision is used in this testing scenario.
Moreover, the ReLU activation function is applied in this convolution layer. Hence,
the NVDLA is configured according to the mentioned parameters.
The simulation output is then compared with the trace-player direct testbench
test case supplied with the NVDLA release with the same inference parameters [54].
The comparison showed that they have the same simulation output. Furthermore,
by comparing the simulation runtime, it is shown in Table 6.2 that the proposed
framework simulation runtime for a single CNN layer is reduced by 17x compared to
that of the trace-player direct testbench. In addition to that, the same single CNN
convolution layer test case is run on the emulation platform with the runtime value
shown in Table 6.2.
40
Table 6.2: The single CNN convolution layer simulation and emulation runtime.
The second testing scenario is running LeNet-5 CNN on the NVDLA. LeNet-5
CNN was proposed by LeCun in 1998 [55], which was successfully applied to
handwritten digit recognition. LeNet-5 consists of the following layers: an input
layer, two convolution layers each followed by a pooling layer, two fully connected
layers, and an output layer. The architecture and the parameters used for inference
are mentioned in Table 6.3 [32]. The NVDLA is configured according to the
LeNet-5 parameters. The dataset used for testing is the MNIST handwritten digit
dataset. The grayscale 28*28 images are used as input data for LeNet-5. To improve
calculative accuracy and performance, FP16 precision is considered in this testing
scenario. Therefore, the weights with floating point precision are converted to FP16
format. The precision conversion from a 16-bit binary to a 16-bit floating point and
vice versa is done in this framework according to the IEEE-754-2008 standard using
the below equation, at which each variable is reflected in Figure 6.1 [56]:
41
Table 6.3: The LeNet-5 CNN parameters.
The simulation and emulation results are then compared to those when running
the same LeNet-5 CNN with the NVDLA software environment [57]. The comparison
showed that both have almost similar output accuracy rates for a regression run
over a random MINST dataset testing samples. Table 6.4 and Table 6.5 show
LeNet-5 output examples for the proposed framework and the NVDLA software
environment for two different input images. However, the simulation runtime for
the proposed framework is reduced by more than 200x compared to that using
the NVDLA software, as shown in Table 6.6. Similarly, the emulation runtime for
the proposed framework is reduced by more than 10x compared to that using the
NVDLA software, as shown in Table 6.6.
Input image
Proposed
framework LeNet-5
CNN test case
NVDLA software
environment
LeNet-5 CNN
42
Table 6.5: LeNet-5 output for digit 2 input image.
Input image
Proposed
framework LeNet-5
CNN test case
NVDLA software
environment
LeNet-5 CNN
The mentioned runtime values for the proposed framework are for running the
CNN on the NVDLA after the data preparation phase is done using Python-based
scripts whose running time is insignificant. The reason behind this reduction in
runtime is that the NVDLA software environment is running on a virtual platform
with instruction-by-instruction execution, which is more realistic. However, the
proposed UVM framework is a native code execution that is running very fast on a
host machine, it also has larger coverage for running any CNN, hitting corner cases,
and provides faster and easier debugging. Therefore, the best practice is to start the
testing and verification process of a DLA using the proposed framework until the
design stability is reached, then proceed with the realistic software execution.
In addition to that, the simulation and emulation wall time are measured on the
same host machine for the LeNet-5 CNN proposed framework test case. Figure 6.2
shows that the LeNet-5 CNN test case runtime is reduced by more than 3x for
emulation compared to simulation. Accordingly, accelerating the UVM testbench
for emulation speeds up the testing process, particularly for operating complicated
DNN architectures and when the complexity of the DLA design increases.
43
Figure 6.2: Proposed framework LeNet-5 CNN test case simulation versus emulation wall
time.
Code coverage is collected by the simulation tool for both the verification
testbench and the NVDLA software. The code coverage includes statement coverage,
branch coverage, and conditional coverage. These measurements assess the execution
of the code from different perspectives. The NVDLA’s total code coverage for the
LeNet-5 CNN is 45% for both the verification testbench and the NVDLA software,
as shown in Figure 6.3 and Figure 6.4. The reason for this is due to the specific
hardware configurations sent to the NVDLA DUT for the NVDLA registers according
to the LeNet-5 CNN-specific architecture and inference parameters. This causes
the execution of RTL code statements depending on the type of running hardware
layers and the CNN architecture complexity, which may or may not consume all the
DLA hardware blocks to achieve 100% code coverage.
Figure 6.3: Proposed framework LeNet-5 CNN test case code coverage analysis.
Figure 6.4: NVDLA software environment LeNet-5 CNN code coverage analysis.
44
6.3 Emulation Analysis
The emulation analysis shown in Table 6.7 indicates that the proposed framework
has a much lower software time and a much lower number of TBX clock cycles,
which is due to the optimization done in the proposed framework for interaction
with the hardware DLA DUT and the testbench running as software on the host
machine simulator either for the register space configurations or for the memory
read and write operations. Moreover, there is a simple increase in the hardware
area, mainly the Look Up Tables (LUTs) and Registers on the emulation platform
used with the proposed framework compared to the NVDLA software, due to the
testbench synthesizable added XRTL (Synthesizable SV RTL superset) hardware
interface.
Table 6.7: Emulation analysis summary.
The implemented error injection testing scenarios are applied to the single
CNN convolution layer and the LeNet-5 CNN as a case study. The proposed error
injection methodology is applicable to any CNN architecture by extending a new
CNN sequence.
45
6.4.1 DNN Data Path Error Injection
The implemented error injection testing scenarios are applied for running the
single CNN convolution layer test case on the NVDLA by randomly injecting errors
in weight and the layer’s input data (feature map) during the inference for both
simulation and emulation. Figure 6.5 shows the simulation and emulation regression
results for the convolution layer with the different random error injection testing
scenarios. The convolution layer is more sensitive to multiple data errors than to
multiple weight errors. Moreover, single data and single weight errors are almost
masked in the convolution layer due to the presence of the ReLU activation function.
Figure 6.5: Single CNN convolution layer data path error injection testing scenarios error
rate.
Furthermore, error injection testing scenarios are implemented for LeNet-5 CNN
running on the NVDLA by randomly injecting errors in different layers in weight,
bias, and the internal layers’ input data (feature maps between layers) during the
inference for both simulation and emulation. The error injection is done by inserting
random and incorrect single or multiple values of weight, bias, or internal layers’
input data for a randomly chosen layer. Simulation and emulation regression are run
for the implemented error injection testing scenarios over random MINST dataset
testing samples.
As demonstrated in Figure 6.6, the results indicate that in the LeNet-5 CNN, the
majority of single-value input data errors within the internal layers are masked and
do not impact the output predictions of the final layer. Moreover, the LeNet-5 CNN
is sensitive to the multiple values in internal layers’ input data errors. However, some
of them are masked by the pooling layers if they are injected into the convolution
layers. Furthermore, the LeNet-5 CNN masks most of the single-value errors in
weight. However, some errors in the second convolution layer were propagated and
46
corrupted the last layer output predictions as the ReLU activation function hardware
configuration is disabled in this convolution layer. Moreover, the LeNet-5 CNN is
sensitive to multiple-value weight errors, mainly if injected in the convolution layers.
For the bias, the LeNet-5 CNN masks most of the single-value errors. However, the
LeNet-5 CNN is sensitive to the multiple-value bias errors and propagates them as
they corrupt the output predictions, except for those injected in the fully connected
layers.
In summary, the LeNet-5 CNN is sensitive to the internal layers’ multiple values
of input data corruption and weight corruption more than that of bias corruption,
and less sensitive to single value corruption in data, weight, and bias propagation
between layers.
Figure 6.6: LeNet-5 CNN data path error injection testing scenarios error rate.
The implemented testing scenario for input image error injection is applied to
the LeNet-5 CNN test case running on the NVDLA, using adversarial images as
input images for simulation and emulation. Figure 6.7 shows the regression results
for this test case over random MINST dataset samples with different perturbation
factor values. The LeNet-5 CNN accuracy rate decreases with high values of ϵ as
the fast gradient sign method adds noise scaled by ϵ to the input image.
47
Figure 6.7: The accuracy rate of the LeNet-5 CNN with adversarial images.
As shown in Figure 6.6, Figure 6.5, and Figure 6.7, some of the error injection
testing scenarios running on the emulator have higher error rates compared to those
running for simulation. This is because emulation tends to hit more bugs compared
to simulation, as it executes the design in a more realistic and faster environment
and operates at a lower level of abstraction. This combination leads to a higher
likelihood of uncovering bugs that might not be exposed during simulation. However,
the trade-off is that while emulation can detect more bugs, simulation often provides
better tools for visibility and debugging.
Moreover, this framework has a larger coverage for running any CNN, hitting
corner cases, and is faster and easier to debug. The proposed framework added more
visibility during NVDLA testing and debugging, allowing direct and full access to the
NVDLA register space for configuration with less runtime. Table 6.8 summarizes
the main differences between the proposed framework and the NVDLA software
environment.
48
Table 6.8: The proposed framework versus the NVDLA software environment
49
Chapter 7
Conclusion and Future Work
This chapter lists all contributions that have been made in this research work
and provides future work for this thesis.
7.1 Contributions
4. The NVDLA is used as a case study of the DLA design under verification to
prove the robustness of our proposed verification framework with the error
injection methodology.
5. Verifying the NVDLA by running a single convolution layer CNN and running
the LeNet-5 CNN with and without error injection testing scenarios as an
example for CNNs; however, the proposed framework is applicable for any
CNN architecture.
6. Comparing the NVDLA direct testbench single CNN convolution layer test
case results with those of our framework. And comparing the NVDLA software
environment’s LeNet-5 CNN results with those of our framework.
50
implemented framework is reduced by more than 10x compared to that for the
NVDLA software. This reduction in simulation and emulation runtime helps
to handle complex DLA designs, mainly in figuring out and fixing the design
issues that may appear due to the complex computations involved in each
CNN layer mapped on hardware. In addition to that, the LeNet-5 CNN test
case runtime is reduced by more than 3x for emulation compared to simulation.
It causes the testing process to speed up, particularly when testing complex
DNN architectures and when the complexity of the DLA design increases.
8. The error injection experimental results for running a single CNN convolution
layer and running the LeNet-5 CNN on NVDLA show that a CNN is more
sensitive to the internal layers’ multiple values of input data, weight, and bias
corruption compared to a single value corruption in data, weight, and bias
propagation between layers.
Our future work would be to implement testing scenarios to run other more
complex CNNs with and without defense mechanisms on the NVDLA to check
the system’s stability and analyze the resilience of such CNNs against faults and
attacks. The proposed verification framework was tested on simulation and hardware
emulation; we wish to expand that to the FPGA prototyping system as well to
achieve better performance.
51
References
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with
deep convolutional neural networks,” Advances in neural information processing
systems, vol. 25, 2012.
[2] C.-s. Oh and J.-m. Yoon, “Hardware acceleration technology for deep-learning
in autonomous vehicles,” in 2019 IEEE International Conference on Big Data
and Smart Computing (BigComp), IEEE, 2019, pp. 1–3.
[3] G. Cesarano, “FPGA implementation of a deep learning inference accelerator
for autonomous vehicles,” Ph.D. dissertation, Politecnico di Torino, 2018.
[4] S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi, “A dynamically
configurable coprocessor for convolutional neural networks,” in Proceedings
of the 37th annual international symposium on Computer architecture, 2010,
pp. 247–257.
[5] M. Horowitz, “1.1 computing’s energy problem (and what we can do about it),”
in 2014 IEEE international solid-state circuits conference digest of technical
papers (ISSCC), IEEE, 2014, pp. 10–14.
[6] Y.-N. Yun, J.-B. Kim, N.-D. Kim, and B. Min, “Beyond UVM for practical
SoC verification,” in 2011 International SoC Design Conference, IEEE, 2011,
pp. 158–162.
[7] Universal Verification Methodology (UVM) 1.2 User’s Guide,
2015. https://www.accellera.org/images/downloads/standards/uvm
/uvm_users_guide_1.2.pdf (accessed Aug. 16, 2024).
[8] S. Sutherland and T. Fitzpatrick, “UVM Rapid Adoption: A Practical Subset
of UVM,” Proceedings of DVCon, 2015.
[9] S. Hassoun, M. Kudlugi, D. Pryor, and C. Selvidge, “A transaction-based
unified architecture for simulation and emulation,” IEEE transactions on very
large scale integration (VLSI) systems, vol. 13, no. 2, pp. 278–287, 2005.
[10] Veloce CS system - IC emulation and prototyping, Siemens Digital Industries
Software. https://eda.sw.siemens.com/en-US/ic/veloce/ (accessed Aug. 16,
2024).
[11] S. Albawi, T. Abed Mohammed, and S. ALZAWI, “Understanding of a
Convolutional Neural Network,” Aug. 2017.
[12] R. Yamashita, M. Nishio, R. K. G. Do, and K. Togashi, “Convolutional neural
networks: An overview and application in radiology,” Insights into imaging,
vol. 9, pp. 611–629, 2018.
[13] D. Stutz and L. Beyer, “Understanding Convolutional Neural Networks,” 2014.
52
[14] G. Li et al., “Understanding error propagation in deep learning neural network
(DNN) accelerators and applications,” in Proceedings of the International
Conference for High Performance Computing, Networking, Storage and
Analysis, 2017, pp. 1–12.
[15] ISO 26262-1:2018, ISO. https://www.iso.org/standard/68383.html#lifecycle.
[16] S. Pappalardo, A. Ruospo, I. O’Connor, B. Deveautour, E. Sanchez, and A.
Bosio, “A fault injection framework for AI hardware accelerators,” in 2023
IEEE 24th Latin American Test Symposium (LATS), IEEE, 2023, pp. 1–6.
[17] Z. Chen, N. Narayanan, B. Fang, G. Li, K. Pattabiraman, and N. DeBardeleben,
“TensorFI: A flexible fault injection framework for tensorflow applications,” in
2020 IEEE 31st International Symposium on Software Reliability Engineering
(ISSRE), IEEE, 2020, pp. 426–435.
[18] M. Abadi et al., “TensorFlow: A system for Large-Scale machine learning,”
in 12th USENIX symposium on operating systems design and implementation
(OSDI 16), 2016, pp. 265–283.
[19] A. Paszke et al., “Pytorch: An imperative style, high-performance deep learning
library,” Advances in neural information processing systems, vol. 32, 2019.
[20] A. Mahmoud et al., “PytorchFI: A runtime perturbation tool for DNNs,” in
2020 50th Annual IEEE/IFIP International Conference on Dependable Systems
and Networks Workshops (DSN-W), IEEE, 2020, pp. 25–31.
[21] R. Gräfe, Q. S. Sha, F. Geissler, and M. Paulitsch, “Large-scale application of
fault injection into PyTorch models-an extension to pytorchFI for validation
efficiency,” in 2023 53rd Annual IEEE/IFIP International Conference on
Dependable Systems and Networks-Supplemental Volume (DSN-S), IEEE, 2023,
pp. 56–62.
[22] C. Bolchini, L. Cassano, A. Miele, and A. Toschi, “Fast and accurate error
simulation for CNNs against soft errors,” IEEE Transactions on Computers,
vol. 72, no. 4, pp. 984–997, 2022.
[23] Z. Chen, N. Narayanan, B. Fang, G. Li, K. Pattabiraman, and N. DeBardeleben,
“TensorFI: A flexible fault injection framework for tensorflow applications,” in
2020 IEEE 31st International Symposium on Software Reliability Engineering
(ISSRE), IEEE, 2020, pp. 426–435.
[24] S. K. S. Hari, T. Tsai, M. Stephenson, S. W. Keckler, and J. Emer, “SASSIFI:
An architecture-level fault injection tool for GPU application resilience
evaluation,” in 2017 IEEE International Symposium on Performance Analysis
of Systems and Software (ISPASS), IEEE, 2017, pp. 249–258.
[25] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient
reconfigurable accelerator for deep convolutional neural networks,” IEEE
journal of solid-state circuits, vol. 52, no. 1, pp. 127–138, 2016.
[26] Home | Flex Logix − Reconfigurable Computing Flex Logix, lex-logix.com, Jun.
10, 2019. https://flex-logix.com/ (accessed Aug. 16, 2024).
[27] Nvidia Jetson xavier series, NVIDIA, https://www.nvidia.com/enus/
autonomous-machines/embedded-systems/jetson-agx-xavier (accessed Aug.
16, 2024).
53
[28] NVDLA primer - NVDLA Documentation, nvdla.org.
http://nvdla.org/primer.html (accessed Aug. 16, 2024).
[29] Hardware Architectural Specification - NVDLA Documentation, nvdla.org.
http://nvdla.org/hw/v1/hwarch.html (accessed Aug. 16, 2024).
[30] Integrator′ s Manual - NVDLA Documentation, nvdla.org.
http://nvdla.org/hw/v2/integration_guide.html (accessed Aug. 16, 2024).
[31] M. Khan, V. K. Kodavalla, and A. U. Sameer, “Verification Methodology of
Vison Based Hardware Accelerators in System-on-Chip,” in 2020 International
Conference on Industry 4.0 Technology (I4Tech), IEEE, 2020, pp. 39–45.
[32] S. Feng, J. Wu, S. Zhou, and R. Li, “The implementation of LeNet-5 with
NVDLA on RISC-V SoC,” in 2019 IEEE 10th International Conference on
Software Engineering and Service Science (ICSESS), IEEE, 2019, pp. 39–42.
[33] G. Zhou, J. Zhou, and H. Lin, “Research on NVIDIA deep learning accelerator,”
in 2018 12th IEEE International Conference on Anti-counterfeiting, Security,
and Identification (ASID), IEEE, 2018, pp. 192–195.
[34] S. Ramakrishnan, “Implementation of a deep learning inference accelerator on
the FPGA.,” 2020.
[35] B. Salami, O. S. Unsal, and A. C. Kestelman, “On the resilience of RTL NN
accelerators: Fault characterization and mitigation,” in 2018 30th International
Symposium on Computer Architecture and High Performance Computing
(SBAC-PAD), IEEE, 2018, pp. 322–329.
[36] W. Liu and C.-H. Chang, “A forward error compensation approach for fault
resilient deep neural network accelerator design,” in Proceedings of the 5th
Workshop on Attacks and Solutions in Hardware Security, 2021, pp. 41–50.
[37] Veloce Strato CS Emulation Platform for IC
Verification, Siemens Digital Industries Software, 2024.
https://eda.sw.siemens.com/en-US/ic/veloce/strato-hardware/ (accessed Aug.
16, 2024).
[38] MIPS leverages Siemens’ Veloce proFPGA
platform, Siemens Digital Industries Software, 2023.
https://newsroom.sw.siemens.com/en-US/mips-veloce-profpga/ (accessed
Aug. 16, 2024).
[39] Emulation and simulation; invaluable tools for IC verification
- Verification Horizons, blogs.sw.siemens.com, Dec. 06,
2016. https://blogs.sw.siemens.com/verificationhorizons/
2016/12/06/emulation-and-simulation-invaluable-tools-for-ic-verification/
(accessed Aug. 16, 2024).
[40] V. Rousseau, Anoop, G. Nagrecha, V. Kankariya, V. Mamidi, and D. Hurakadli,
“Techniques for robust transaction based acceleration,” in DVCON, Bangalore,
India,, 2016.
[41] V. Billa and S. Haran, “Trials and tribulations of migrating a native UVM
testbench from simulation to emulation,” in DVCON, Bangalore, India,, 2017.
54
[42] M. K. A. M. Saad, “An Automatic Generation of NoC Architectures:
An Application-Mapping Approach,” M.S. thesis, Computer and Systems
Engineering, faculty of engineering, Ain Shams University, 2020.
[43] SCE-mi (standard Co-Emulation Modeling Interface), Accellera Systems
Initiative. https://www.accellera.org/downloads/standards/sce-mi (accessed
Aug. 16, 2024).
[44] TensorFlow, Keras | TensorFlow Core | TensorFlow, TensorFlow, 2019.
https://www.tensorflow.org/guide/keras (accessed Aug. 16, 2024).
[45] R. Aboudeif, T. A. Awaad, M. AbdelElSalam, and Y. Ismail, “UVM Based
Verification Framework for Deep Learning Hardware Accelerator: Case Study,”
in International Conference on Electrical, Computer and Energy Technologies,
Sydney, Australia, 2024.
[46] J. Montesano and M. Litterick, “UVM sequence item based error injection,”
SNUG Ottawa, 2012.
[47] K. Schwartz and T. Corcoran, “Error injection: When good input goes bad,”
in Proc. Design Verification Conf.(DVCon US 2017), 2017.
[48] A. Kurakin, I. J. Goodfellow, and S. Bengio, “Adversarial examples in the
physical world,” in Artificial intelligence safety and security, Chapman and
Hall/CRC, 2018, pp. 99–112.
[49] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing
adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
[50] R. Aboudeif, T. A. Awaad, M. AbdelElSalam, and Y. Ismail, “Trustworthiness
Evaluation of Deep Learning Accelerators Using UVM-based Verification with
Error Injection,” in DVCON Europe, 2024 (accepted).
[51] Hardware Architectural Specification - NVDLA Documentation, nvdla.org.
http://nvdla.org/hw/v1/hwarch.html (accessed Aug. 16, 2024).
[52] In-memory data formats - NVDLA Documentation, nvdla.org.
http://nvdla.org/hw/format.html (accessed Aug. 16, 2024).
[53] Questa advanced verification, siemens digital industries software.
https://eda.sw.siemens.com/en-us/ic/questa/ (accessed aug. 16, 2024).
[54] NVDLA Open Source Hardware, version 1.0, GitHub, Jan. 25, 2023.
https://github.com/nvdla/hw (accessed Aug. 16, 2024).
[55] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11,
pp. 2278–2324, 1998.
[56] IEEE, “IEEE Standard for Floating-Point Arithmetic, IEEE Std 754-2019
(Revision of IEEE 754-2008),” Institute of Electrical and Electronics Engineers
New York, 2019.
[57] software manual - NVDLA documentation, nvdla.org.
http://nvdla.org/sw/contents.html (accessed Aug. 16, 2024).
55