The MathWorks, Inc. - MATLAB Deep Learning HDL Toolbox UG.-The MathWorks, Inc. (2021)
The MathWorks, Inc. - MATLAB Deep Learning HDL Toolbox UG.-The MathWorks, Inc. (2021)
User's Guide
R2021a
How to Contact MathWorks
Phone: 508-647-7000
iii
Workflow and APIS
5
Prototype Deep Learning Networks on FPGA and SoCs Workflow . . . . . . . 5-2
iv Contents
Effects of Custom Deep Learning Processor Parameters on Performance
and Resource Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-15
Featured Examples
10
Get Started with Deep Learning FPGA Deployment on Intel Arria 10 SoC
......................................................... 10-2
Get Started with Deep Learning FPGA Deployment on Xilinx ZCU102 SoC
......................................................... 10-5
Run a Deep Learning Network on FPGA with Live Camera Input . . . . . 10-62
v
Custom Deep Learning Processor Generation to Meet Performance
Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-96
Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-12
Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-12
Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-14
Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-14
vi Contents
Use Compiler Output for System Integration . . . . . . . . . . . . . . . . . . . . . . 12-3
External Memory Address Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3
Compiler Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3
Leg Level Compilations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-4
vii
1
1-2
Training Process
Training Process
You can train deep learning neural networks for classification tasks by using methods such as training
from scratch, or by transfer learning, or by feature extraction.
Transfer Learning
Transfer learning is used for cases where there is lack of labeled data. The existing network
architectures, trained for scenarios with large amounts of labeled data, are used for this approach.
The parameters of pretrained networks are modified to fit the unlabeled data. Therefore, transfer
learning is used for transferring knowledge across various tasks. You can train or modify these
networks faster so it is the most widely used training approach for deep learning applications.
1-3
1 What is Deep Learning?
Feature Extraction
Layers in deep learning networks are trained for extracting features from the input data. This
approach uses the network as a feature extractor. The features extracted after the training process
can be put into various machine learning models such as Support Vector Machines (SVM).
1-4
Convolutional Neural Networks
You can train CNNs from scratch, by transfer learning, or by feature extraction. You can then use the
trained network for classification or regression applications.
For more details on training CNNs, see “Pretrained Deep Neural Networks” .
For more details on deep learning, training process, and CNNs, see Deep Learning Onramp.
1-5
2
To illustrate the deep learning processor architecture, consider an image classification example.
2-2
Deep Learning Processor Architecture
Activation Normalization
Based on the neural network that you provide, the Activation Normalization module serves the
purpose of adding the ReLU nonlinearity, a maxpool layer, or performs Local Response Normalization
(LRN). You see that the processor has two Activation Normalization units. One unit follows the
Generic Convolution Processor. The other unit follows the Generic FC Processor.
Generic FC Processor
The Generic FC Processor performs the equivalent operation of one fully-connected layer (FC).
Using another AXI4 Master interface, the weights for the fully-connected layer are provided to the
Generic FC Processor. The Generic FC Processor then performs the fully-connected layer
operation on the input image and provides the activations for the Activation Normalization
module. This processor is also generic because it can support tensors and shapes of various sizes.
FC Controller (Scheduling)
The FC Controller (Scheduling) works similar to the Conv Controller (Scheduling). The
FC Controller (Scheduling) coordinates with the FIFO to act as ping-pong buffers for
performing the fully-connected layer operation and Activation Normalization depending on the
number of FC layers, and ReLU, maxpool, or LRN features that you have in your neural network. After
the Generic FC Processor and Activation Normalization modules process all the frames in
the image, the predictions or scores are transmitted through the AXI4 Master interface and stored in
the external DDR memory.
For more information, see “MATLAB Controlled Deep Learning Processor” on page 3-2.
2-3
3
• Generic deep learning processor IP, see “Deep Learning Processor Applications” on page 2-3 .
• MATLAB as AXI Master IP, see “Set Up for MATLAB AXI Master” (HDL Verifier) .
You can use this processor to run neural networks with various inputs, weights, and biases on the
same FPGA platform because the deep learning processor IP core can handle tensors and shapes of
any sizes. Before you use the MATLAB as AXI Master, make sure that you have installed the HDL
Verifier support packages for the FPGA boards. This figure shows the MATLAB controlled deep
learning processor architecture.
To integrate the generic deep learning processor IP with the MATLAB as AXI Master, use the AXI4
Slave interface of the deep learning processor IP core. By using a JTAG or PCI express interface, the
IP responds to read or write commands from MATLAB. Therefore, you can use the MATLAB
controlled deep learning processor to deploy the deep learning neural network to the FPGA boards
from MATLAB, perform operations specified by the network architecture, and then return the
predicted results to MATLAB. Following example illustrate how to deploy the pretrained series
network, AlexNet, to an Intel® Arria® 10 SoC development kit.
3-2
4
You can load the various deep learning neural networks such as Alexnet, VGG and GoogleNet
onto the MATLAB framework. When you compile the network, the network parameters are saved
into a structure that consists of NetConfigs and layerConfigs. NetConfigs consists of the
weights and biases of the trained network. layerConfig consists of various configuration values
of the trained network.
2 Modify pretrained neural network on MATLAB using transfer learning
The internal network developed on the MATLAB framework is trained and modified according to
the parameters of the external neural network. See also “Get Started with Transfer Learning”.
3 Compile user network
Compilation of the user network usually begins with validating the architecture, types of layers
present , data type of input and output parameters, and maximum number of activations. This
FPGA solution supports series network architecture with data types of single and int8. For more
details, see "Product Description". If the user network features are different, the compiler
produces an error and stops. The compiler also performs sanity check by using weight
compression and weight quantization.
4 Deploy on target FPGA board
By using specific APIs and the NetConfigs and layerConfigs, deploying the compiled
network converts the user-trained network into a fixed bitstream and then programs the
bitstream on the target FPGA.
5 Predict outcome
To classify objects in the input image, use the deployed framework on the FPGA board.
4-2
Deep Learning on FPGA Workflow
See Also
“Deep Learning on FPGA Solution and Workflows” on page 4-4
4-3
4 Deep Learning on FPGA Overview
The FPGA deep learning solution provides an end to end solution that allows you to estimate,
compile, profile and debug your custom pretrained series network. You can also generate a custom
deep learning processor IP. The estimator is used for estimating the performance of the deep learning
framework in terms of speed. The compiler converts the pretrained deep learning network for the
current application for deploying it on the intended target FPGA boards.
To learn more about the deep learning processor IP, see “Deep Learning Processor IP Core” on page
12-2 .
FPGA Advantages
FPGAs provide advantages, such as :
• High performance
• Flexible interfacing
• Data parallelism
• Model parallelism
• Pipeline parallelism
Task Workflow
4-4
Deep Learning on FPGA Solution and Workflows
Run a pretrained series network on your target “Prototype Deep Learning Networks on FPGA
FPGA board. and SoCs Workflow” on page 5-2
Obtain the performance of your pretrained series “Estimate Performance of Deep Learning
network for a preconfigured deep learning Network” on page 8-3
processor.
Customize the deep learning processor to meet “Estimate Resource Utilization for Custom
your area constraints. Processor Configuration” on page 8-9
Generate a custom deep learning processor for “Generate Custom Bitstream” on page 9-2
your FPGA.
Learn about the benefits of quantizing your “Quantization of Deep Neural Networks” on page
pretrained series networks. 11-2
Compare the accuracy of your quantized “Validation” on page 11-14
pretrained series networks against your single
data type pretrained series network.
Run a quantized pretrained series network on “Code Generation and Deployment” on page 11-
your target FPGA board. 17
4-5
5
• “Prototype Deep Learning Networks on FPGA and SoCs Workflow” on page 5-2
• “Profile Inference Run” on page 5-4
• “Multiple Frame Support” on page 5-6
5 Workflow and APIS
• Compile and deploy the deep learning network on specified target FPGA or SoC board by using
the deploy function.
• Retrieve the bitstream resource utilization by using the getBuildInfo function.
• Execute the deployed deep learning network and predict the classification of input images by
using the predict function.
• Calculate the speed and profile of the deployed deep learning network by using the predict
function. Set the Profile parameter to on.
This figure illustrates the workflow to deploy your deep learning network to the FPGA boards.
5-2
Prototype Deep Learning Networks on FPGA and SoCs Workflow
See Also
dlhdl.Target | dlhdl.Workflow
More About
• “Get Started with Deep Learning FPGA Deployment on Xilinx ZCU102 SoC” on page 10-5
5-3
5 Workflow and APIS
The labels classifying the images are stored in a structure struct and displayed on the screen.
The performance parameters of speed and latency are returned in a structure struct.
snet = resnet18;
hT = dlhdl.Target('Xilinx','Interface','Ethernet');
hW = dlhdl.Workflow('Net', snet, 'Bitstream', 'zcu102_single','Target',hT);
hW.deploy;
image = imread('zebra.jpeg');
inputImg = imresize(image, [224, 224]);
imshow(inputImg);
[prediction, speed] = hW.predict(single(inputImg),'Profile','on');
[val, idx] = max(prediction);
snet.Layers(end).ClassNames{idx}
5-4
Profile Inference Run
See Also
dlhdl.Target | dlhdl.Workflow | predict
More About
• “Prototype Deep Learning Networks on FPGA and SoCs Workflow” on page 5-2
• “Profile Network for Performance Improvement” on page 10-42
5-5
5 Workflow and APIS
This information is automatically generated by the compile method. For more information on the
generated DDR address offsets, see “Use Compiler Output for System Integration” on page 12-3.
You can also specify the maximum number of input frames as an optional argument in the compile
method. For more information, see “Generate DDR Memory Offsets Based On Number of Input
Frames”.
The output results have to be formatted to be a multiple of the FC output feature size. The
information and formatting are automatically generated by the compile method. For more
information on the generated DDR address offsets, see “Use Compiler Output for System Integration”
on page 12-3.
5-6
Multiple Frame Support
dnnfpga.hwutils.writeSignal(1, dnnfpga.hwutils.numTo8Hex(addrMap('nc_op_image_count')),15,hT);
See Also
compile | dlhdl.Target | dlhdl.Workflow
More About
• “Prototype Deep Learning Networks on FPGA and SoCs Workflow” on page 5-2
5-7
6
Ethernet Interface
The Ethernet interface leverages the ARM processor to send and receive information from the design
running on the FPGA. The ARM processor runs on a Linux operating system. You can use the Linux
operating system services to interact with the FPGA. When using the Ethernet interface, the
bitstream is downloaded to the SD card. The bitstream is persistent through power cycles and is
reprogrammed each time the FPGA is turned on. The ARM processor is configured with the correct
device tree when the bitstream is programmed.
To communicate with the design running on the FPGA, MATLAB leverages the Ethernet connection
between the host computer and ARM processor. The ARM processor runs a LIBIIO service, which
communicates with a datamover IP in the FPGA design. The datamover IP is used for fast data
transfers between the host computer and FPGA, which is useful when prototyping large deep
learning networks that would have long transfer times over JTAG. The ARM processor generates the
read and write transactions to access memory locations in both the onboard memory and deep
learning processor.
LIBIIO/Ethernet Performance
The improvement in performance speed of JTAG compared to LIBIIO/Ethernet is listed in this table.
6-2
LIBIIO/Ethernet Connection Based Deployment
dlhdl.Target
More About
• “Accelerate Prototyping Workflow for Large Networks by using Ethernet” on page 10-77
6-3
7
In this section...
“Supported Pretrained Networks” on page 7-2
“Supported Layers” on page 7-10
“Supported Boards” on page 7-21
“Third-Party Synthesis Tools and Version Support” on page 7-22
Networ Networ Type Single Data Type (with INT8 data type (with Applicat
k k Shipping Bitstreams) Shipping Bitstreams) ion
Descrip Area
tion
ZCU102 ZC706 Arria10 ZCU102 ZC706 Arria10 Classific
SoC SoC ation
AlexNet AlexNet Series Yes Yes Yes Yes Yes Yes Classific
convoluti Network ation
onal
neural
network.
LogoNet Logo Series Yes Yes Yes Yes Yes Yes Classific
recogniti Network ation
on
network
(LogoNe
t) is a
MATLAB
develope
d logo
identific
ation
network.
For more
informati
on, see
“Logo
Recognit
ion
Network
”.
7-2
Supported Networks, Layers, Boards, and Tools
MNIST MNIST Series Yes Yes Yes Yes Yes Yes Classific
Digit Network ation
Classific
ation.
See
“Create
Simple
Deep
Learning
Network
for
Classific
ation”
Lane LaneNet Series Yes Yes Yes Yes Yes Yes Classific
detectio convoluti Network ation
n onal
neural
network.
For more
informati
on, see
“Deploy
Transfer
Learning
Network
for Lane
Detectio
n” on
page 10-
14
VGG-16 VGG-16 Series No. No. Yes Yes No. Yes Classific
convoluti Network Network Network Network ation
onal exceeds exceeds exceeds
neural PL DDR FC FC
network. memory module module
For the size memory memory
pretrain size. size.
ed
VGG-16
model,
see
vgg16.
7-3
7 Networks and Layers
VGG-19 VGG-19 Series No. No. Yes Yes No. Yes Classific
convoluti Network Network Network Network ation
onal exceeds exceeds exceeds
neural PL DDR FC FC
network. memory module module
For the size memory memory
pretrain size. size.
ed
VGG-19
model,
see
vgg19 .
Darknet- Darknet- Series Yes Yes Yes Yes Yes Yes Classific
19 19 Network ation
convoluti
onal
neural
network.
For the
pretrain
ed
darknet-
19
model,
see
darknet
19.
7-4
Supported Networks, Layers, Boards, and Tools
Radar Convolut Series Yes Yes Yes Yes Yes Yes Classific
Classific ional Network ation
ation neural and
network Software
that uses Defined
micro- Radio
Doppler (SDR)
signatur
es to
identify
and
classify
the
object.
For more
informati
on, see
“Bicyclis
t and
Pedestri
an
Classific
ation by
Using
FPGA”
on page
10-46.
Defect snet_de Series Yes Yes Yes Yes Yes Yes Classific
Detectio fnet is Network ation
n a custom
snet_de AlexNet
fnet network
used to
identify
and
classify
defects.
For more
informati
on, see
“Defect
Detectio
n” on
page 10-
24.
7-5
7 Networks and Layers
Defect snet_bl Series Yes Yes Yes Yes Yes Yes Classific
Detectio emdetne Network ation
n t is a
snet_bl custom
emdetne convoluti
t onal
neural
network
used to
identify
and
classify
defects.
For more
informati
on, see
“Defect
Detectio
n” on
page 10-
24.
7-6
Supported Networks, Layers, Boards, and Tools
YOLO v2 You look Series Yes Yes Yes Yes Yes Yes Object
Vehicle only Network detectio
Detectio once based n
n (YOLO)
is an
object
detector
that
decodes
the
predictio
ns from
a
convoluti
onal
neural
network
and
generate
s
boundin
g boxes
around
the
objects.
For more
informati
on, see
“Vehicle
Detectio
n Using
YOLO v2
Deploye
d to
FPGA”
on page
10-87
7-7
7 Networks and Layers
DarkNet Darknet- Directed No. No. Yes Yes Yes Yes Classific
-53 53 acyclic Network Network ation
convoluti graph exceeds exceeds
onal (DAG) PL DDR PL DDR
neural network memory memory
network. based size. size.
For the
pretrain
ed
DarkNet-
53
model,
see
darknet
53.
ResNet- ResNet-1 Directed Yes Yes Yes Yes Yes Classific
18 8 acyclic ation
convoluti graph
onal (DAG)
neural network
network. based
For the
pretrain
ed
ResNet-1
8 model,
see
resnet1
8.
ResNet- ResNet-5 Directed No. No. Yes Yes Yes Yes Classific
50 0 acyclic Network Network ation
convoluti graph exceeds exceeds
onal (DAG) PL DDR PL DDR
neural network memory memory
network. based size. size.
For the
pretrain
ed
ResNet-5
0 model,
see
resnet5
0.
7-8
Supported Networks, Layers, Boards, and Tools
ResNet- You look Directed Yes Yes Yes Yes Yes Yes Object
based only acyclic detectio
YOLO v2 once graph n
(YOLO) (DAG)
is an network
object based
detector
that
decodes
the
predictio
ns from
a
convoluti
onal
neural
network
and
generate
s
boundin
g boxes
around
the
objects.
For more
informati
on, see
“Vehicle
Detectio
n Using
DAG
Network
Based
YOLO v2
Deploye
d to
FPGA”
on page
10-125.
7-9
7 Networks and Layers
Supported Layers
The following layers are supported by Deep Learning HDL Toolbox.
Input Layers
7-10
Supported Networks, Layers, Boards, and Tools
These limitations
apply when
generating code
for a network
using this layer:
7-11
7 Networks and Layers
Code generation is
now supported for
a 2-D grouped
convolution layer
that has the
NumGroups
property set as
'channel-wise'.
These limitations
apply when
generating code
for a network
using this layer:
7-12
Supported Networks, Layers, Boards, and Tools
• Dilation factor
must be [1 1].
• Number of
groups must be
1 or 2.
HW Fully Connected A fully connected Yes
(FC) layer multiplies the
fullyConnected input by a weight
Layer matrix, and then
adds a bias vector.
These limitations
apply when
generating code
for a network
using this layer:
Activation Layers
7-13
7 Networks and Layers
A ReLU layer is
supported only
when it is
preceded by any of
these layers:
• convolution
layer
• fully connected
layer
• adder layer
HW Layer is fused. A leaky ReLU layer Yes
performs a
leakyReluLayer threshold
operation where
any input value
less than zero is
multiplied by a
fixed scalar.
• convolution
layer
• fully connected
layer
• adder layer
7-14
Supported Networks, Layers, Boards, and Tools
A clipped ReLU
layer is supported
only when it is
preceded by any of
these layers:
• convolution
layer
• fully connected
layer
• adder layer
A batch
normalization layer
is only supported
only when it is
preceded by a
convolution layer.
7-15
7 Networks and Layers
The
WindowChannelS
ize must be in the
range of 3-9 for
code generation.
NoOP on inference NoOP on inference A dropout layer Yes
randomly sets
dropoutLayer input elements to
zero with a given
probability.
These limitations
apply when
generating code
for a network
using this layer:
7-16
Supported Networks, Layers, Boards, and Tools
These limitations
apply when
generating code
for a network
using this layer:
7-17
7 Networks and Layers
These limitations
apply when
generating code
for a network
using this layer:
7-18
Supported Networks, Layers, Boards, and Tools
Combination Layers
These limitations
apply when
generating code
for a network
using this layer:
• The maximum
number of
inputs to the
addition layer is
two when the
input data type
is int8.
• Both input
layers must
have the same
output layer
format. For
example, both
layers must
have conv
output format
or fc output
format.
7-19
7 Networks and Layers
These limitations
apply when
generating code
for a network
using this layer:
• The input
activation
feature number
must be a
multiple of the
square root of
the
“ConvThreadNu
mber”.
• Inputs to the
depth
concatenation
layer must be
exclusive to the
depth
concatenation
layer.
• Layers that
have a conv
output format
and layers that
have an FC
output format
cannot be
concatenated
together.
Output Layer
7-20
Supported Networks, Layers, Boards, and Tools
A
nnet.keras.lay
er.FlattenCSty
leLayer is only
supported only
when it is followed
by a fully
connected layer.
nnet.keras.lay HW Layer will be Zero padding layer Yes
er.ZeroPadding fused. for 2-D input.
2dLayer
A
nnet.keras.lay
er.ZeroPadding
2dLayer is only
supported only
when it is followed
by a convolution
layer or a maxpool
layer.
Supported Boards
These boards are supported by Deep Learning HDL Toolbox:
7-21
7 Networks and Layers
See Also
More About
• “Configure Board-Specific Setup Information”
7-22
8
After configuring your custom deep learning processor you can build and generate a custom
bitstream and custom deep learning processor IP core. For more information about the custom deep
learning processor IP core, see “Deep Learning Processor IP Core” on page 12-2.
The image shows the workflow to customize your deep learning processor, estimate the custom deep
learning processor performance and resource utilization, and build and generate your custom deep
learning processor IP core and bitstream.
See Also
estimatePerformance | estimateResources | dlhdl.ProcessorConfig |
getModuleProperty | setModuleProperty
More About
• “Deep Learning Processor Architecture” on page 2-2
• “Estimate Performance of Deep Learning Network” on page 8-3
• “Estimate Resource Utilization for Custom Processor Configuration” on page 8-9
8-2
Estimate Performance of Deep Learning Network
To learn how to use the information in the table data from the estimatePerformance function to
calculate your network performance, see “Profile Inference Run” on page 5-4.
1 Create a file in your current working folder called getLogoNetwork.m. In the file, enter:
function net = getLogoNetwork()
if ~isfile('LogoNet.mat')
url = 'https://www.mathworks.com/supportfiles/gpucoder/cnn_models/logo_detection/LogoNet.mat';
websave('LogoNet.mat',url);
end
data = load('LogoNet.mat');
net = data.convnet;
end
snet = getLogoNetwork;
2 Create a dlhdl.ProcessorConfig object.
hPC = dlhdl.ProcessorConfig;
3 Call estimatePerformance with snet to retrieve the layer level latencies and performance for
the LogoNet network.
hPC.estimatePerformance(snet)
8-3
8 Custom Processor Configuration Workflow
In this example compare the performance of the ResNet-18 network on the zcu102_single
bitstream configuration to the performance on the default custom bitstream configuration.
Prerequisites
• Deep Learning HDL Toolbox™ Support Package for Xilinx FPGA and SoC
• Deep Learning Toolbox™
• Deep Learning HDL Toolbox™
• Deep Learning Toolbox Model for ResNet-18 Network
hPC_shipping =
Processing Module "conv"
ConvThreadNumber: 16
InputMemorySize: [227 227 3]
OutputMemorySize: [227 227 3]
FeatureSizeLimit: 2048
KernelDataType: 'single'
8-4
Estimate Performance of Deep Learning Network
To estimate the performance of the ResNet-18 DAG network, use the estimatePerformance
function of the dlhdl.ProcessorConfig object. The function returns the estimated layer latency,
network latency, and network performance in frames per second (Frames/s).
hPC_shipping.estimatePerformance(snet)
To create a custom processor configuration, use the dlhdl.ProcessorConfig object. For more
information, see dlhdl.ProcessorConfig. To learn about modifiable parameters of the processor
configuration, see getModuleProperty and setModuleProperty.
hPC_custom = dlhdl.ProcessorConfig
hPC_custom =
Processing Module "conv"
8-5
8 Custom Processor Configuration Workflow
ConvThreadNumber: 16
InputMemorySize: [227 227 3]
OutputMemorySize: [227 227 3]
FeatureSizeLimit: 2048
KernelDataType: 'single'
To estimate the performance of the ResNet-18 DAG network, use the estimatePerformance
function of the dlhdl.ProcessorConfig object. The function returns the estimated layer latency,
network latency, and network performance in frames per second (Frames/s).
hPC_custom.estimatePerformance(snet)
8-6
Estimate Performance of Deep Learning Network
The performance of the ResNet-18 network on the custom bitstream configuration is lower than the
performance on the zcu102_single bitstream configuration. The difference between the custom
bitstream configuration and the zcu102_single bitstream configuration is the target frequency.
Modify the custom processor configuration to increase the target frequency. To learn about
modifiable parameters of the processor configuration, see dlhdl.ProcessorConfig.
hPC_custom.TargetFrequency = 220;
hPC_custom
hPC_custom =
Processing Module "conv"
ConvThreadNumber: 16
InputMemorySize: [227 227 3]
OutputMemorySize: [227 227 3]
FeatureSizeLimit: 2048
KernelDataType: 'single'
8-7
8 Custom Processor Configuration Workflow
Estimate the performance of the ResNet-18 DAG network on the modified custom bitstream
configuration.
hPC_custom.estimatePerformance(snet)
See Also
dlhdl.ProcessorConfig | estimatePerformance | estimateResources |
getModuleProperty | setModuleProperty
More About
• “Estimate Performance of Deep Learning Network” on page 8-3
• “Estimate Resource Utilization for Custom Processor Configuration” on page 8-9
• “Effects of Custom Deep Learning Processor Parameters on Performance and Resource
Utilization” on page 8-15
8-8
Estimate Resource Utilization for Custom Processor Configuration
hPC = dlhdl.ProcessorConfig
hPC =
hPC.estimateResources
The returned table contains resource utilization for the entire processor and individual modules.
8-9
8 Custom Processor Configuration Workflow
The reference (shipping) zcu102_int8 bitstream configuration is for a Xilinx ZCU102 ZU9EG
device. The default board resource counts are:
The default board resource counts exceed the user resource budget and is on the higher end of the
cost spectrum. You can achieve target performance and resource use budget by quantizing the target
deep learning network and customizing the custom default bitstream configuration.
In this example create a custom bitstream configuration to match your resource budget and
performance requirements.
Prerequisites
• Deep Learning HDL Toolbox™ Support Package for Xilinx FPGA and SoC
• Deep Learning Toolbox™
• Deep Learning HDL Toolbox™
• Deep Learning Toolbox Model Quantization Library
To load the pretrained series network, that has been trained on the Modified National Institute
Standards of Technology (MNIST) database, enter:
snet = getDigitsNetwork;
Quantize Network
ans=21×5 table
Optimized Layer Name Network Layer Name Learnables / Activations MinValue
____________________________ __________________ ________________________ _________
8-10
Estimate Resource Utilization for Custom Processor Configuration
hPC_reference = dlhdl.ProcessorConfig('Bitstream','zcu102_int8')
hPC_reference =
Processing Module "conv"
ConvThreadNumber: 64
InputMemorySize: [227 227 3]
OutputMemorySize: [227 227 3]
FeatureSizeLimit: 2048
KernelDataType: 'int8'
To estimate the performance of the digits series network, use the estimatePerformance function of
the dlhdl.ProcessorConfig object. The function returns the estimated layer latency, network
latency, and network performance in frames per second (Frames/s).
8-11
8 Custom Processor Configuration Workflow
To estimate the resource use of the zcu102_int8 bitstream, use the estimateResources
function of the dlhdl.ProcessorConfig object. The function returns the estimated DSP slice and
BRAM usage.
hPC_reference.estimatePerformance(dlquantObj)
hPC_reference.estimateResources
The estimated performance is 4314 FPS and the estimated resource use counts are:
The estimated DSP slice count and BRAM count use exceeds the target device resource budget.
Customize the bitstream configuration to reduce resource use.
To create a custom processor configuration, use the dlhdl.ProcessorConfig object. For more
information, see dlhdl.ProcessorConfig. To learn about modifiable parameters of the processor
configuration, see getModuleProperty and setModuleProperty.
To reduce the resource use for the custom bitstream, modify the KernelDataType for the conv,
fc, and adder modules. Modify the ConvThreadNumber to reduce DSP slice count. Reduce the
InputMemorySize and OutputMemorySize for the conv module to reduce BRAM count.
hPC_custom = dlhdl.ProcessorConfig;
hPC_custom.setModuleProperty('conv','KernelDataType','int8');
8-12
Estimate Resource Utilization for Custom Processor Configuration
hPC_custom.setModuleProperty('fc','KernelDataType','int8');
hPC_custom.setModuleProperty('adder','KernelDataType','int8');
hPC_custom.setModuleProperty('conv','ConvThreadNumber',4);
hPC_custom.setModuleProperty('conv','InputMemorySize',[30 30 1]);
hPC_custom.setModuleProperty('conv','OutputMemorySize',[30 30 1]);
hPC_custom
hPC_custom =
Processing Module "conv"
ConvThreadNumber: 4
InputMemorySize: [30 30 1]
OutputMemorySize: [30 30 1]
FeatureSizeLimit: 2048
KernelDataType: 'int8'
To estimate the performance of the digits series network, use the estimatePerformance function of
the dlhdl.ProcessorConfig object. The function returns the estimated layer latency, network
latency, and network performance in frames per second (Frames/s).
To estimate the resource use of the hPC_custom bitstream, use the estimateResources
function of the dlhdl.ProcessorConfig object. The function returns the estimated DSP slice and
BRAM usage.
hPC_custom.estimatePerformance(dlquantObj)
8-13
8 Custom Processor Configuration Workflow
hPC_custom.estimateResources
The estimated performance is 574 FPS and the estimated resource use counts are:
The estimated resources of the customized bitstream match the user target device resource budget
and the estimated performance matches the target network performance.
See Also
dlhdl.ProcessorConfig | estimatePerformance | estimateResources |
getModuleProperty | setModuleProperty
More About
• “Estimate Performance of Deep Learning Network” on page 8-3
• “Effects of Custom Deep Learning Processor Parameters on Performance and Resource
Utilization” on page 8-15
8-14
Effects of Custom Deep Learning Processor Parameters on Performance and Resource Utilization
This table lists the deep learning processor parameters and their effects on performance and
resource utilization.
8-15
8 Custom Processor Configuration Workflow
See Also
dlhdl.ProcessorConfig | estimatePerformance | estimateResources |
getModuleProperty | setModuleProperty
More About
• “Estimate Performance of Deep Learning Network” on page 8-3
• “Estimate Resource Utilization for Custom Processor Configuration” on page 8-9
• “Effects of Custom Deep Learning Processor Parameters on Performance and Resource
Utilization” on page 8-15
8-16
9
hPC = dlhdl.ProcessorConfig;
2 Setup the tool path to your design tool. For example, to setup the path to the Vivado® design tool,
enter:
hdlsetuptoolpath('ToolName', 'Xilinx Vivado', 'ToolPath', 'C:\Xilinx\Vivado\2019.2\bin\vivado.bat');
3 Generate the custom bitstream.
dlhdl.buildProcessor(hPC);
4 After the bitstream generation is completed, you can locate the bitstream file at cwd\dlhdl_prj
\vivado_ip_prj\vivado_prj.runs\impl_1, where cwd is your current working directory.
The name of the bitstream file is system_top_wrapper.bit. The associated
system_top_wrapper.mat file is located in the top level of the cwd.
To use the generated bitstream for the supported Xilinx boards, you should copy the
system_top_wrapper.bit and system_top_wrapper.mat files to the same folder.
To use the generated bitstream for the supported Intel boards, you should copy the
system_core.rbf, system.mat, system_periph.rbf, and system.sof files to the same
folder.
9-2
Generate Custom Bitstream
5 Deploy the custom bitstream and deep learning network to your target device.
hTarget = dlhdl.Target('Xilinx');
snet = alexnet;
hW = dlhdl.Workflow('Network',snet,'Bitstream','system_top_wrapper.bit','Target',hTarget);
% If your custom bitstream files are in a different folder, use:
% hW = dlhdl.Workflow('Network',snet,'Bitstream',...
% 'C:\yourfolder\system_top_wrapper.bit','Target',hTarget);
hW.compile;
hW.deploy;
See Also
dlhdl.ProcessorConfig | dlhdl.buildProcessor | dlhdl.Workflow
9-3
9 Custom Processor Code Generation Workflow
hPC = dlhdl.ProcessorConfig;
2 Setup the tool path to your design tool. For example, to setup the path to the Vivado design tool,
enter:
hdlsetuptoolpath('ToolName', 'Xilinx Vivado', 'ToolPath', 'C:\Xilinx\Vivado\2019.2\bin\vivado.bat');
3 Generate the custom processor IP.
dlhdl.buildProcessor(hPC);
See Also
dlhdl.ProcessorConfig | dlhdl.buildProcessor
More About
• “Deep Learning Processor IP Core” on page 12-2
9-4
10
Featured Examples
• “Get Started with Deep Learning FPGA Deployment on Intel Arria 10 SoC” on page 10-2
• “Get Started with Deep Learning FPGA Deployment on Xilinx ZCU102 SoC” on page 10-5
• “Logo Recognition Network” on page 10-9
• “Deploy Transfer Learning Network for Lane Detection” on page 10-14
• “Image Category Classification by Using Deep Learning” on page 10-18
• “Defect Detection” on page 10-24
• “Profile Network for Performance Improvement” on page 10-42
• “Bicyclist and Pedestrian Classification by Using FPGA ” on page 10-46
• “Visualize Activations of a Deep Learning Network by Using LogoNet” on page 10-51
• “Authoring a Reference Design for Live Camera Integration with Deep Learning Processor IP
Core” on page 10-57
• “Run a Deep Learning Network on FPGA with Live Camera Input” on page 10-62
• “Running Convolution-Only Networks by using FPGA Deployment” on page 10-72
• “Accelerate Prototyping Workflow for Large Networks by using Ethernet” on page 10-77
• “Create Series Network for Quantization” on page 10-83
• “Vehicle Detection Using YOLO v2 Deployed to FPGA” on page 10-87
• “Custom Deep Learning Processor Generation to Meet Performance Requirements”
on page 10-96
• “Deploy Quantized Network Example” on page 10-100
• “Quantize Network for FPGA Deployment” on page 10-109
• “Evaluate Performance of Deep Learning Network on Custom Processor Configuration”
on page 10-115
• “Customize Bitstream Configuration to Meet Resource Use Requirements” on page 10-120
• “Vehicle Detection Using DAG Network Based YOLO v2 Deployed to FPGA” on page 10-125
• “Customize Bitstream Configuration to Meet Resource Use Requirements” on page 10-134
• “Image Classification Using DAG Network Deployed to FPGA” on page 10-139
• “Classify Images on an FPGA Using a Quantized DAG Network” on page 10-147
• “Classify ECG Signals Using DAG Network Deployed To FPGA” on page 10-156
10 Featured Examples
Prerequisites
To load the pretrained series network, that has been trained on the Modified National Institute
Standards of Technology (MNIST) database, enter:
snet = getDigitsNetwork();
analyzeNetwork(snet)
Create a target object that has a custom name for your target device and an interface to connect your
target device to the host computer. Interface options are JTAG and Ethernet. To use JTAG, install
Intel™ Quartus™ Prime Standard Edition 18.1. Set up the path to your installed Intel Quartus Prime
executable if it is not already set up. For example, to set the toolpath, enter:
hTarget = dlhdl.Target('Intel')
hTarget =
Target with properties:
Vendor: 'Intel'
Interface: JTAG
Create an object of the dlhdl.Workflow class. When you create the object, specify the network and
the bitstream name. Specify the saved pretrained MNIST neural network, snet, as the network. Make
sure that the bitstream name matches the data type and the FPGA board that you are targeting. In
this example, the target FPGA board is the Intel Arria 10 SOC board and the bitstream uses a single
data type.
10-2
Get Started with Deep Learning FPGA Deployment on Intel Arria 10 SoC
hW =
Workflow with properties:
To compile the MNIST series network, run the compile function of the dlhdl.Workflow object.
dn = hW.compile;
To deploy the network on the Intel Arria 10 SoC hardware, run the deploy function of the
dlhdl.Workflow object. This function uses the output of the compile function to program the FPGA
board by using the programming file. It also downloads the network weights and biases. The deploy
function starts programming the FPGA device, displays progress messages, and the time it takes to
deploy the network.
hW.deploy
### FPGA bitstream programming has been skipped as the same bitstream is already loaded on the ta
### Loading weights to FC Processor.
### FC Weights loaded. Current time is 28-Jun-2020 13:45:47
To load the example image, execute the predict function of the dlhdl.Workflow object, and then
display the FPGA result, enter:
inputImg = imread('five_28x28.pgm');
imshow(inputImg);
10-3
10 Featured Examples
Run prediction with the profile 'on' to see the latency and throughput results.
See Also
More About
• “Check Host Computer Connection to FPGA Boards”
• “Create Simple Deep Learning Network for Classification”
10-4
Get Started with Deep Learning FPGA Deployment on Xilinx ZCU102 SoC
Prerequisites
To load the pretrained series network, that has been trained on the Modified National Institute
Standards of Technology (MNIST) database, enter:
snet = getDigitsNetwork();
Create a target object that has a custom name for your target device and an interface to connect your
target device to the host computer. Interface options are JTAG and Ethernet.
hTarget = dlhdl.Target('Xilinx','Interface','Ethernet')
hTarget =
Target with properties:
Vendor: 'Xilinx'
Interface: Ethernet
IPAddress: '192.168.1.101'
Username: 'root'
Port: 22
Create an object of the dlhdl.Workflow class. Specify the network and the bitstream name during
the object creation. Specify saved pretrained MNIST neural network, snet, as the network. Make sure
that the bitstream name matches the data type and the FPGA board that you are targeting. In this
example, the target FPGA board is the Xilinx ZCU102 SOC board and the bitstream uses a single data
type.
hW = dlhdl.Workflow('network', snet, 'Bitstream', 'zcu102_single','Target',hTarget)
hW =
Workflow with properties:
10-5
10 Featured Examples
To compile the MNIST series network, run the compile function of the dlhdl.Workflow object.
dn = hW.compile;
Skipping: imageinput
Compiling leg: conv_1>>relu_3 ...
### Optimizing series network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.laye
### Notice: (Layer 1) The layer 'data' with type 'nnet.cnn.layer.ImageInputLayer' is implemented
### Notice: (Layer 10) The layer 'output' with type 'nnet.cnn.layer.RegressionOutputLayer' is imp
Compiling leg: conv_1>>relu_3 ... complete.
Compiling leg: fc ...
### Notice: (Layer 1) The layer 'data' with type 'nnet.cnn.layer.ImageInputLayer' is implemented
### Notice: (Layer 3) The layer 'output' with type 'nnet.cnn.layer.RegressionOutputLayer' is imp
Compiling leg: fc ... complete.
Skipping: softmax
Skipping: classoutput
Creating Schedule...
.......
Creating Schedule...complete.
Creating Status Table...
......
Creating Status Table...complete.
Emitting Schedule...
......
Emitting Schedule...complete.
Emitting Status Table...
........
10-6
Get Started with Deep Learning FPGA Deployment on Xilinx ZCU102 SoC
To deploy the network on the Xilinx ZCU102 SoC hardware, run the deploy function of the
dlhdl.Workflow object. This function uses the output of the compile function to program the FPGA
board by using the programming file. It also downloads the network weights and biases. The deploy
function starts programming the FPGA device, displays progress messages, and the time it takes to
deploy the network.
hW.deploy
Downloading target FPGA device configuration over Ethernet to SD card done. The system will now r
System is rebooting . . . . . .
### Programming the FPGA bitstream has been completed successfully.
### Loading weights to Conv Processor.
### Conv Weights loaded. Current time is 30-Dec-2020 15:13:03
### Loading weights to FC Processor.
### FC Weights loaded. Current time is 30-Dec-2020 15:13:03
To load the example image, execute the predict function of the dlhdl.Workflow object, and then
display the FPGA result, enter:
inputImg = imread('five_28x28.pgm');
imshow(inputImg);
10-7
10 Featured Examples
Run prediction with the profile 'on' to see the latency and throughput results.
See Also
More About
• “Check Host Computer Connection to FPGA Boards”
• “Create Simple Deep Learning Network for Classification”
10-8
Logo Recognition Network
Logos assist users in brand identification and recognition. Many companies incorporate their logos in
advertising, documentation materials, and promotions. The logo recognition network (logonet) was
developed in MATLAB® and can recognize 32 logos under various lighting conditions and camera
motions. Because this network focuses only on recognition, you can use it in applications where
localization is not required.
Prerequisites
snet = getLogoNetwork();
analyzeNetwork(snet)
10-9
10 Featured Examples
Create a target object that has a custom name for your target device and an interface to connect your
target device to the host computer. Interface options are JTAG and Ethernet. To use JTAG, install
Xilinx™ Vivado™ Design Suite 2019.2. To set the Xilinx Vivado toolpath, enter:
% hdlsetuptoolpath('ToolName', 'Xilinx Vivado', 'ToolPath', 'C:\Xilinx\Vivado\2019.2\bin\vivado.b
Create an object of the dlhdl.Workflow class. When you create the object, specify the network and
the bitstream name. Specify the saved pretrained logonet neural network, snet, as the network.
Make sure that the bitstream name matches the data type and the FPGA board that you are
targeting. In this example the target FPGA board is the Xilinx ZCU102 SOC board. The bitstream
uses a single data type.
hW = dlhdl.Workflow('network', snet, 'Bitstream', 'zcu102_single','Target',hTarget);
% If running on Xilinx ZC706 board, instead of the above command,
% uncomment the command below.
%
% hW = dlhdl.Workflow('Network', snet, 'Bitstream', 'zc706_single','Target',hTarget);
To compile the logo recognition network, run the compile function of the dlhdl.Workflow object.
10-10
Logo Recognition Network
dn = hW.compile
To deploy the network on the Xilinx ZCU102 SoC hardware, run the deploy function of the
dlhdl.Workflow object. This function uses the output of the compile function to program the FPGA
board by using the programming file. It also downloads the network weights and biases. The deploy
function starts programming the FPGA device, displays progress messages, and the time it takes to
deploy the network.
hW.deploy
### FPGA bitstream programming has been skipped as the same bitstream is already loaded on the ta
### Loading weights to FC Processor.
### 33% finished, current time is 28-Jun-2020 12:40:14.
### 67% finished, current time is 28-Jun-2020 12:40:14.
### FC Weights loaded. Current time is 28-Jun-2020 12:40:14
image = imread('heineken.png');
inputImg = imresize(image, [227, 227]);
imshow(inputImg);
10-11
10 Featured Examples
Execute the predict function on the dlhdl.Workflow object and display the result:
10-12
Logo Recognition Network
ans =
'heineken'
See Also
More About
• “Check Host Computer Connection to FPGA Boards”
10-13
10 Featured Examples
Prerequisites
snet = getLaneDetectionNetwork();
analyzeNetwork(snet)
% The saved network contains 23 layers including input, convolution, ReLU, cross channel normaliz
% max pool, fully connected, and the regression output layers.
10-14
Deploy Transfer Learning Network for Lane Detection
Create a target object that has a custom name for your target device and an interface to connect your
target device to the host computer. Interface options are JTAG AND Ethernet.
hTarget = dlhdl.Target('Xilinx','Interface','Ethernet');
Create an object of the dlhdl.Workflow class. When you create the class, specify the network and
the bitstream name. Specify the saved pretrained lanenet neural network, snet, as the network. Make
sure that the bitstream name matches the data type and the FPGA board that you are targeting. In
this example the target FPGA board is the Xilinx ZCU102 SOC board. The bitstream uses a single
data type.
hW = dlhdl.Workflow('network', snet, 'Bitstream', 'zcu102_single','Target',hTarget);
% If running on Xilinx ZC706 board, instead of the above command,
% uncomment the command below.
%
% hW = dlhdl.Workflow('Network', snet, 'Bitstream', 'zc706_single','Target',hTarget);
To compile the lanenet series network, run the compile function of the dlhdl.Workflow object.
dn = hW.compile;
10-15
10 Featured Examples
To deploy the network on the Xilinx ZCU102 SoC hardware, run the deploy function of the
dlhdl.Workflow object. This function uses the output of the compile function to program the FPGA
board by using the programming file. It also downloads the network weights and biases. The deploy
function starts programming the FPGA device, displays progress messages, and the time it takes to
deploy the network.
hW.deploy;
### FPGA bitstream programming has been skipped as the same bitstream is already loaded on the ta
### Loading weights to FC Processor.
### 13% finished, current time is 28-Jun-2020 12:36:09.
### 25% finished, current time is 28-Jun-2020 12:36:10.
### 38% finished, current time is 28-Jun-2020 12:36:11.
### 50% finished, current time is 28-Jun-2020 12:36:12.
### 63% finished, current time is 28-Jun-2020 12:36:13.
### 75% finished, current time is 28-Jun-2020 12:36:14.
### 88% finished, current time is 28-Jun-2020 12:36:14.
### FC Weights loaded. Current time is 28-Jun-2020 12:36:15
Run the demoOnVideo function for the dlhdl.Workflow class object. This function loads the
example video, executes the predict function of the dlhdl.Workflow object, and then plots the
result.
demoOnVideo(hW,1);
10-16
Deploy Transfer Learning Network for Lane Detection
10-17
10 Featured Examples
Prerequisites
snet = alexnet;
% snet = vgg19;
% snet = darknet19;
analyzeNetwork(snet)
% The saved network contains 25 layers including input, convolution, ReLU, cross channel normaliz
% max pool, fully connected, and the softmax output layers.
10-18
Image Category Classification by Using Deep Learning
Use the dlhdl.Target class to create a target object with a custom name for your target device and
an interface to connect your target device to the host computer. Interface options are JTAG and
Ethernet. To use JTAG,Install Xilinx™ Vivado™ Design Suite 2019.2. To set the Xilinx Vivado toolpath,
enter:
hTarget = dlhdl.Target('Xilinx','Interface','Ethernet');
Use the dlhdl.Workflow class to create an object. When you create the object, specify the network
and the bitstream name. Specify the saved pretrained alexnet neural network as the network. Make
sure that the bitstream name matches the data type and the FPGA board that you are targeting. In
this example, the target FPGA board is the Xilinx ZCU102 SoC board. The bitstream uses a single
data type.
To compile the Alexnet series network, run the compile method of the dlhdl.Workflow object. You
can optionally specify the maximum number of input frames.
10-19
10 Featured Examples
dn = hW.compile('InputFrameNumberLimit',15)
To deploy the network on the Xilinx ZCU102 hardware, run the deploy function of the
dlhdl.Workflow object. This function uses the output of the compile function to program the FPGA
board by using the programming file. It also downloads the network weights and biases. The deploy
function starts programming the FPGA device, displays progress messages, and the time it takes to
deploy the network.
hW.deploy
### FPGA bitstream programming has been skipped as the same bitstream is already loaded on the ta
### Deep learning network programming has been skipped as the same network is already loaded on t
imgFile = 'espressomaker.jpg';
inputImg = imresize(imread(imgFile), [227,227]);
imshow(inputImg)
10-20
Image Category Classification by Using Deep Learning
Execute the predict method on the dlhdl.Workflow object and then show the label in the MATLAB
command window.
10-21
10 Featured Examples
ans =
'espresso maker'
Load multiple images and retrieve their prediction reults by using the mulltiple frame support
feature. For more information, see “Multiple Frame Support” on page 5-6.
The demoOnImage function loads multiple images and retrieves their prediction results. The
annotateresults function displays the image prediction result on top of the images which are
assembled into a 3-by-5 array.
imshow(inputImg)
demoOnImage;
10-22
Image Category Classification by Using Deep Learning
10-23
10 Featured Examples
Defect Detection
This example shows how to deploy a custom trained series network to detect defects in objects such
as hexagon nuts. The custom networks were trained by using transfer learning. Transfer learning is
commonly used in deep learning applications. You can take a pretrained network and use it as a
starting point to learn a new task. Fine-tuning a network with transfer learning is usually much faster
and easier than training a network with randomly initialized weights from scratch. You can quickly
transfer learned features to a new task using a smaller number of training signals. This example uses
two trained series networks trainedDefNet.mat and trainedBlemDetNet.mat.
Prerequisites
To download and load the custom pretrained series networks trainedDefNet and
trainedBlemDetNet, enter:
if ~isfile('trainedDefNet.mat')
url = 'https://www.mathworks.com/supportfiles/dlhdl/trainedDefNet.mat';
websave('trainedDefNet.mat',url);
end
net1 = load('trainedDefNet.mat');
snet_defnet = net1.custom_alexnet
snet_defnet =
SeriesNetwork with properties:
analyzeNetwork(snet_defnet)
10-24
Defect Detection
if ~isfile('trainedBlemDetNet.mat')
url = 'https://www.mathworks.com/supportfiles/dlhdl/trainedBlemDetNet.mat';
websave('trainedBlemDetNet.mat',url);
end
net2 = load('trainedBlemDetNet.mat');
snet_blemdetnet = net2.convnet
snet_blemdetnet =
SeriesNetwork with properties:
analyzeNetwork(snet_blemdetnet)
10-25
10 Featured Examples
Create a target object that has a custom name for your target device and an interface to connect your
target device to the host computer. Interface options are JTAG and Ethernet. To use the JTAG
connection, install the Xilinx™ Vivado™ Design Suite 2020.1
hT =
Target with properties:
Vendor: 'Xilinx'
Interface: Ethernet
IPAddress: '192.168.1.101'
Username: 'root'
Port: 22
Create an object of the dlhdl.Workflow class. When you create the object, specify the network and
the bitstream name. Specify the saved pretrained trainedDefNet as the network. Make sure that
10-26
Defect Detection
the bitstream name matches the data type and the FPGA board that you are targeting. In this
example the target FPGA board is the Xilinx ZCU102 SOC board. The bitstream uses a single data
type.
hW = dlhdl.Workflow('Network',snet_defnet,'Bitstream','zcu102_single','Target',hT)
hW =
Workflow with properties:
To compile the trainedDefnet series network, run the compile function of the dlhdl.Workflow
object .
hW.compile
Skipping: data
Compiling leg: conv1>>pool5 ...
10-27
10 Featured Examples
To deploy the network on the Xilinx ZCU102 SoC hardware, run the deploy function of the
dlhdl.Workflow object. This function uses the output of the compile function to program the FPGA
board by using the programming file. It also downloads the network weights and biases. The deploy
function starts programming the FPGA device, displays progress messages, and the time it takes to
deploy the network.
hW.deploy
10-28
Defect Detection
# Set up boot for Reference Design: 'AXI-Stream DDR Memory Access : 3-AXIM'
Downloading target FPGA device configuration over Ethernet to SD card done. The system will now r
System is rebooting . . . . . .
### Programming the FPGA bitstream has been completed successfully.
### Loading weights to Conv Processor.
### Conv Weights loaded. Current time is 16-Dec-2020 16:16:31
### Loading weights to FC Processor.
### 20% finished, current time is 16-Dec-2020 16:16:32.
### 40% finished, current time is 16-Dec-2020 16:16:32.
### 60% finished, current time is 16-Dec-2020 16:16:33.
### 80% finished, current time is 16-Dec-2020 16:16:34.
### FC Weights loaded. Current time is 16-Dec-2020 16:16:34
Load an image from the attached testImages folder, resize the image to match the network image
input layer dimensions, and run the predict function of the dlhdl.Workflow object to retrieve and
display the defect prediction from the FPGA.
wi = uint32(320);
he = uint32(240);
ch = uint32(3);
filename = fullfile(pwd,'ng1.png');
img=imread(filename);
img = imresize(img, [he, wi]);
img = mat2ocv(img);
10-29
10 Featured Examples
sz = [he wi ch];
out = ocv2mat(out,sz);
imshow(out)
Create an object of the dlhdl.Workflow class. When you create the object, specify the network and
the bitstream name. Specify the saved pretrained trainedblemDetNet as the network. Make sure
that the bitstream name matches the data type and the FPGA board that you are targeting. In this
10-30
Defect Detection
example the target FPGA board is the Xilinx ZCU102 SOC board. The bitstream uses a single data
type.
hW = dlhdl.Workflow('Network',snet_blemdetnet,'Bitstream','zcu102_single','Target',hT)
hW =
Workflow with properties:
To compile the trainedBlemDetNet series network, run the compile function of the
dlhdl.Workflow object.
hW.compile
Skipping: imageinput
Compiling leg: conv_1>>maxpool_2 ...
Compiling leg: conv_1>>maxpool_2 ... complete.
Compiling leg: fc_1>>fc_2 ...
Compiling leg: fc_1>>fc_2 ... complete.
Skipping: softmax
Skipping: classoutput
Creating Schedule...
.......
Creating Schedule...complete.
Creating Status Table...
......
Creating Status Table...complete.
Emitting Schedule...
......
Emitting Schedule...complete.
10-31
10 Featured Examples
To deploy the network on the Xilinx ZCU102 SoC hardware, run the deploy function of the
dlhdl.Workflow object. This function uses the output of the compile function to program the FPGA
board by using the programming file. It also downloads the network weights and biases. The deploy
function starts programming the FPGA device, displays progress messages, and the time it takes to
deploy the network.
hW.deploy
### FPGA bitstream programming has been skipped as the same bitstream is already loaded on the ta
### Loading weights to Conv Processor.
### Conv Weights loaded. Current time is 16-Dec-2020 16:16:47
### Loading weights to FC Processor.
### 50% finished, current time is 16-Dec-2020 16:16:48.
### FC Weights loaded. Current time is 16-Dec-2020 16:16:48
Load an image from the attached testImages folder, resize the image to match the network image
input layer dimensions, and run the predict function of the dlhdl.Workflow object to retrieve and
display the defect prediction from the FPGA.
wi = uint32(320);
he = uint32(240);
ch = uint32(3);
filename = fullfile(pwd,'ok1.png');
img=imread(filename);
img = imresize(img, [he, wi]);
img = mat2ocv(img);
10-32
Defect Detection
sz = [he wi ch];
out = ocv2mat(out,sz);
imshow(out)
10-33
10 Featured Examples
The trainedBlemDetNet network improves performance to 45 frames per second. The target
performance of the deployed network is 100 frames per second while staying within the target
resource utilization budget. The resource utilization budget takes into consideration parameters such
as memory size, on board IO, and so on. Increasing the resource utilization could mean choosing a
larger board which could cost more money. Increase deployed network performance and stay within
resource utilization budget by quantizing the network. To quantize and deploy the
trainedBlemDetNet network:
• Load the data set as an image datastore. The imageDatastore labels the images based on folder
names and stores the data. Divide the data into calibration and validation data sets. Use 50% of
the images for calibration and 50% of the images for validation. Expedite the calibration and
validation process by using a subset of the calibration and validation image sets.
if ~isfile('dataSet.zip')
url = 'https://www.mathworks.com/supportfiles/dlhdl/dataSet.zip';
websave('dataSet.zip',url);
end
unzip('dataSet.zip')
unzip('dataset.zip')
imageData = imageDatastore(fullfile('dataset'),...
'IncludeSubfolders',true,'FileExtensions','.PNG','LabelSource','foldernames');
[calibrationData, validationData] = splitEachLabel(imageData, 0.5,'randomized');
calibrationData_reduced = calibrationData.subset(1:20);
validationData_reduced = validationData.subset(1:1);
• Create a quantized network by using the dlquantizer object. Set the target execution
environment to FPGA.
dlQuantObj = dlquantizer(snet_blemdetnet,'ExecutionEnvironment','FPGA')
10-34
Defect Detection
dlQuantObj =
dlquantizer with properties:
• Use the calibrate function to exercise the network with sample inputs and collect the range
information. The calibrate function exercises the network and collects the dynamic ranges of
the weights and biases in the convolution and fully connected layers of the network and the
dynamic ranges of the activations in all layers of the network. The calibrate function returns a
table. Each row of the table contains range information for a learnable parameter of the quantized
network.
dlQuantObj.calibrate(calibrationData_reduced)
ans=21×5 table
Optimized Layer Name Network Layer Name Learnables / Activations MinValue
____________________________ __________________ ________________________ __________
• Create an object of the dlhdl.Workflow class. When you create the object, specify the network
and the bitstream name. Specify the saved pretrained quantized trainedblemDetNet object
dlQuantObj as the network. Make sure that the bitstream name matches the data type and the
FPGA board that you are targeting. In this example the target FPGA board is the Xilinx ZCU102
SOC board. The bitstream uses an int8 data type.
• To compile the quantized network, run the compile function of the dlhdl.Workflow object.
hW.compile('InputFrameNumberLimit',30)
10-35
10 Featured Examples
4 'maxpool_1' Max Pooling 2×2 max pooling with stride [2 2] and pad
5 'crossnorm' Cross Channel Normalization cross channel normalization with 5 channel
6 'conv_2' Convolution 20 5×5×20 convolutions with stride [1 1]
7 'relu_2' ReLU ReLU
8 'maxpool_2' Max Pooling 2×2 max pooling with stride [2 2] and pad
9 'fc_1' Fully Connected 512 fully connected layer
10 'fc_2' Fully Connected 2 fully connected layer
11 'softmax' Softmax softmax
12 'classoutput' Classification Output crossentropyex with classes 'ng' and 'ok'
Skipping: imageinput
Compiling leg: conv_1>>maxpool_2 ...
Compiling leg: conv_1>>maxpool_2 ... complete.
Compiling leg: fc_1>>fc_2 ...
Compiling leg: fc_1>>fc_2 ... complete.
Skipping: softmax
Skipping: classoutput
Creating Schedule...
.........
Creating Schedule...complete.
Creating Status Table...
........
Creating Status Table...complete.
Emitting Schedule...
......
Emitting Schedule...complete.
Emitting Status Table...
..........
Emitting Status Table...complete.
• To deploy the network on the Xilinx ZCU102 SoC hardware, run the deploy function of the
dlhdl.Workflow object. This function uses the output of the compile function to program the
FPGA board by using the programming file. It also downloads the network weights and biases. The
10-36
Defect Detection
deploy function starts programming the FPGA device, displays progress messages, and the time it
takes to deploy the network.
hW.deploy
Downloading target FPGA device configuration over Ethernet to SD card done. The system will now r
System is rebooting .
. . . . .
### Programming the FPGA bitstream has been completed successfully.
### Loading weights to Conv Processor.
### Conv Weights loaded. Current time is 16-Dec-2020 16:18:03
### Loading weights to FC Processor.
### FC Weights loaded. Current time is 16-Dec-2020 16:18:03
• Load an image from the attached testImages folder, resize the image to match the network
image input layer dimensions, and run the predict function of the dlhdl.Workflow object to
retrieve and display the defect prediction from the FPGA.
wi = uint32(320);
he = uint32(240);
ch = uint32(3);
10-37
10 Featured Examples
filename = fullfile(pwd,'ok1.png');
img=imread(filename);
img = imresize(img, [he, wi]);
img = mat2ocv(img);
sz = [he wi ch];
out = ocv2mat(out,sz);
imshow(out)
10-38
Defect Detection
To test that the quantized network can identify all test cases deploy an additional image, resize the
image to match the network image input layer dimensions, and run the predict function of the
dlhdl.Workflow object to retrieve and display the defect prediction from the FPGA.
wi = uint32(320);
he = uint32(240);
ch = uint32(3);
filename = fullfile(pwd,'okng.png');
img=imread(filename);
img = imresize(img, [he, wi]);
img = mat2ocv(img);
10-39
10 Featured Examples
sz = [he wi ch];
out = ocv2mat(out,sz);
imshow(out)
10-40
Defect Detection
Quantizing the network improves the performance from 45 frames per second to 125 frames per
second and reduces the deployed network size from 88 MB to 72 MB.
10-41
10 Featured Examples
Prerequisites
ans =
15×1 Layer array with layers:
Create a target object that has a custom name for your target device and an interface to connect your
target device to the host computer. Interface options are JTAG and Ethernet. For Ethernet interface,
enter:
hTarget = dlhdl.Target('Xilinx','Interface','Ethernet');
To use the JTAG interface, install Xilinx™ Vivado™ Design Suite 2019.2. Set up the path to your
installed Xilinx Vivado executable if it is not already set up. For example, to set the toolpath, enter:
% hdlsetuptoolpath('ToolName', 'Xilinx Vivado', 'ToolPath', 'C:\Xilinx\Vivado\2019.2\bin\vivado.b
10-42
Profile Network for Performance Improvement
Create an object of the dlhdl.Workflow class. When you create the object, specify the network and
the bitstream name. Specify the saved pretrained digits neural network, snet, as the network. Make
sure that the bitstream name matches the data type and the FPGA board that you are targeting. In
this example the target FPGA board is the Xilinx ZCU102 SOC board. The bitstream uses a single
data type.
To compile the MNIST series network, run the compile function of the dlhdl.Workflow object.
dn = hW.compile;
To deploy the network on the Xilinx ZCU102 SoC hardware, run the deploy function of the
dlhdl.Workflow object. This function uses the output of the compile function to program the FPGA
board by using the programming file. It also downloads the network weights and biases.
hW.deploy;
Downloading target FPGA device configuration over Ethernet to SD card done. The system will now r
System is rebooting . . . . . .
### Programming the FPGA bitstream has been completed successfully.
### Loading weights to FC Processor.
### FC Weights loaded. Current time is 28-Jun-2020 12:24:21
10-43
10 Featured Examples
inputImg = imread('five_28x28.pgm');
Execute the predict function of the dlhdl.Workflow object that has profile option set to 'on' to
display the latency and throughput results.
Remove the NumFrames, Total latency, and Frames/s from the profiler's results table. This
includes removing the module level and network level profiler results. Retain only the network layer
profiler results. Once the bottle neck layer has been identified display the bottle neck layer index,
running time, and information.
speed('Network',:) = [];
speed('____conv_module',:) = [];
speed('____fc_module',:) = [];
speed = removevars(speed, {'NumFrames','Total Latency(cycles)','Frame/s'});
% the first row in the profile table is the bottleneck layer. Thus the
% following
layerSpeed = speed(1,:);
layerName = strip(layerSpeed.Properties.RowNames{1},'_');
for idx = 1:length(snet.Layers)
currLayer = snet.Layers(idx);
if strcmp(currLayer.Name, layerName)
bottleNeckLayer = currLayer;
break;
end
end
10-44
Profile Network for Performance Improvement
### It accounts for about 63.29 percent of the total running time.
disp(currLayer);
Name: 'fc'
Hyperparameters
InputSize: 1568
OutputSize: 10
Learnable Parameters
Weights: [10×1568 single]
Bias: [10×1 single]
10-45
10 Featured Examples
Prerequisites
• The MAT File trainedNetBicPed.mat contains a model trained on training data set
trainDataNoCar and its label set trainLabelNoCar.
• The MAT File testDataBicPed.mat contains the test data set testDataNoCar and its label set
testLabelNoCar.
load('trainedNetBicPed.mat','trainedNetNoCar')
load('testDataBicPed.mat')
analyzeNetwork(trainedNetNoCar);
10-46
Bicyclist and Pedestrian Classification by Using FPGA
Set up the path to your installed Xilinx™ Vivado™ Design Suite 2019.2 executable if it is not already
set up. For example, to set the toolpath, enter:
Create a target object for your target device with a vendor name and an interface to connect your
target device to the host computer. Interface options are JTAG (default) and Ethernet. Vendor options
are Intel or Xilinx. Use the installed Xilinx Vivado Design Suite over an Ethernet connection to
program the device.
Create an object of the dlhdl.Workflow class. When you create the object, specify the network and
the bitstream name. Specify the saved pre-trained series network, trainedNetNoCar, as the
network. Make sure the bitstream name matches the data type and the FPGA board that you are
targeting. In this example, the target FPGA board is the Zynq UltraScale+ MPSoC ZCU102 board.
The bitstream uses a single data type. .
10-47
10 Featured Examples
To compile the trainedNetNoCar series network, run the compile function of the dlhdl.Workflow
object .
dn = hW.compile;
To deploy the network on the Zynq® UltraScale+™ MPSoC ZCU102 hardware, run the deploy
function of the dlhdl.Workflow object . This function uses the output of the compile function to
program the FPGA board by using the programming file.The function also downloads the network
weights and biases. The deploy function checks for the Xilinx Vivado tool and the supported tool
version. It then starts programming the FPGA device by using the bitstream, displays progress
messages and the time it takes to deploy the network.
hW.deploy;
### FPGA bitstream programming has been skipped as the same bitstream is already loaded on the ta
### Deep learning network programming has been skipped as the same network is already loaded on t
Classify one input from the sample test data set by using the predict function of the
dlhdl.Workflow object and display the label. The inputs to the network correspond to the
sonograms of the micro-Doppler signatures for a pedestrian or a bicyclist or a combination of both.
10-48
Bicyclist and Pedestrian Classification by Using FPGA
predTestLabel = categorical
ped
Load five random images from the sample test data set and execute the predict function of the
dlhdl.Workflow object to display the labels alongside the signatures. The predictions will happen
at once since the input is concatenated along the fourth dimension.
10-49
10 Featured Examples
fc 19441 0.00009
* The clock frequency of the DL processor is: 220MHz
% Display the micro-doppler signatures along with the ground truth and
% predictions.
for k = 1:numView
index = listIndex(k);
imagesc(testDataNoCar(:, :, :, index));
axis xy
xlabel('Time (s)')
ylabel('Frequency (Hz)')
title('Ground Truth: '+string(testLabelNoCar(index))+', Prediction FPGA: '+string(predTestLab
drawnow;
pause(3);
end
The image shows the micro-Doppler signatures of two bicyclists (bic+bic) which is the ground truth.
The ground truth is the classification of the image against which the network prediction is compared.
The network prediction retrieved from the FPGA correctly predicts that the image has two bicyclists.
10-50
Visualize Activations of a Deep Learning Network by Using LogoNet
Logos assist in brand identification and recognition. Many companies incorporate their logos in
advertising, documentation materials, and promotions. The logo recognition network (LogoNet) was
developed in MATLAB® and can recognize 32 logos under various lighting conditions and camera
motions. Because this network focuses only on recognition, you can use it in applications where
localization is not required.
Prerequisites
snet = getLogoNetwork();
Create a target object that has a custom name for your target device and an interface to connect your
target device to the host computer. Interface options are JTAG and Ethernet. To use JTAG, install
Intel™ Quartus™ Prime Standard Edition 18.1. Set up the path to your installed Intel Quartus Prime
executable if it is not already set up. For example, to set the toolpath, enter:
hTarget = dlhdl.Target('Intel','Interface','JTAG');
Create an object of the dlhdl.Workflow class. When you create the object, specify the network and
the bitstream name. Specify the saved pretrained LogoNet neural network, snet, as the network.
Make sure that the bitstream name matches the data type and the FPGA board that you are
targeting. In this example, the target FPGA board is the Intel Arria10 SOC board. The bitstream uses
a single data type.
10-51
10 Featured Examples
Read and show an image. Save its size for future use.
im = imread('ferrari.jpg');
imshow(im)
imgSize = size(im);
imgSize = imgSize(1:2);
Analyze the network to see which layers you can view. The convolutional layers perform convolutions
by using learnable parameters. The network learns to identify useful features, often including one
feature per channel. The first convolutional layer has 64 channels.
analyzeNetwork(snet)
The Image Input layer specifies the input size. Before passing the image through the network, you
can resize it. The network can also process larger images.. If you feed the network larger images, the
activations also become larger. Because the network is trained on images of size 227-by-227, it is not
trained to recognize larger objects or features.
Investigate features by observing which areas in the maxpool layers activate on an image and
comparing that image to the corresponding areas in the original images. Each layer of a convolutional
neural network consists of many 2-D arrays called channels. Pass the image through the network and
examine the output activations of the maxpool_1 layer.
act1 = hW.activations(single(im),'maxpool_1','Profiler','on');
10-52
Visualize Activations of a Deep Learning Network by Using LogoNet
The activations are returned as a 3-D array, with the third dimension indexing the channel on the
maxpool_1 layer. To show these activations using the imtile function, reshape the array to 4-D. The
third dimension in the input to imtile represents the image color. Set the third dimension to have
size 1 because the activations do not have color. The fourth dimension indexes the channel.
sz = size(act1);
act1 = reshape(act1,[sz(1) sz(2) 1 sz(3)]);
Display the activations. Each activation can take any value, so normalize the output using the
mat2gray. All activations are scaled so that the minimum activation is 0 and the maximum activation
is 1. Display the 96 images on an 12-by-8 grid, one for each channel in the layer.
I = imtile(mat2gray(act1),'GridSize',[12 8]);
imshow(I)
10-53
10 Featured Examples
10-54
Visualize Activations of a Deep Learning Network by Using LogoNet
Each tile in the activations grid is the output of a channel in the maxpool_1 layer. White pixels
represent strong positive activations and black pixels represent strong negative activations. A
channel that is mostly gray does not activate as strongly on the input image. The position of a pixel in
the activation of a channel corresponds to the same position in the original image. A white pixel at a
location in a channel indicates that the channel is strongly activated at that position.
Resize the activations in channel 33 to be the same size as the original image and display the
activations.
act1ch33 = act1(:,:,:,22);
act1ch33 = mat2gray(act1ch33);
act1ch33 = imresize(act1ch33,imgSize);
I = imtile({im,act1ch33});
imshow(I)
Find interesting channels by programmatically investigating channels with large activations. Find the
channel that has the largest activation by using the max function, resize the channel output, and
display the activations.
[maxValue,maxValueIndex] = max(max(max(act1)));
act1chMax = act1(:,:,:,maxValueIndex);
act1chMax = mat2gray(act1chMax);
act1chMax = imresize(act1chMax,imgSize);
I = imtile({im,act1chMax});
imshow(I)
10-55
10 Featured Examples
Compare the strongest activation channel image to the original image. This channel activates on
edges. It activates positively on light left/dark right edges and negatively on dark left/light right
edges.
See Also
More About
• activations
10-56
Authoring a Reference Design for Live Camera Integration with Deep Learning Processor IP Core
• Extracts the region of interest (ROI) based on ROI dimensions from the processing system (PS)
(ARM).
• Performs downsampling on the input image.
• Zero-centers the input image.
• Transfers the preprocessed image to the external DDR memory.
• Triggers the deep learning processor IP core.
• Notifies the PS(ARM) processor.
The deep learning processor IP core accesses the preprocessed inputs, performs the object
classification and loads the output results back into the external DDR memory.
The PS (ARM):
• Takes the ROI dimensions and passes them to the user IP core.
• Performs post-processing on the image data.
• Annotates the object classification results from the deep learning processor IP core on the output
video frame.
10-57
10 Featured Examples
You can also use MATLAB® to retrieve the classification results and verify the generated deep
learning processor IP core. The user DUT for this reference design is the preprocessing algorithm
(User IP Core). You can design the preprocessing DUT algorithm in Simulink®, generate the DUT IP
core, and integrate the generated DUT IP core into the larger system that contains the deep learning
processor IP core. To learn how to generate the DUT IP core, see “Run a Deep Learning Network on
FPGA with Live Camera Input” on page 10-62.
Follow these steps to configure and generate the deep learning processor IP core into the reference
design.
hPC = dlhdl.ProcessorConfig
To learn more about the deep learning processor architecture, see “Deep Learning Processor
Architecture” on page 2-2. To get information about the custom processor configuration parameters
and modifying the parameters, see getModuleProperty and setModuleProperty.
To learn how to generate the custom deep learning processor IP, see “Generate Custom Processor IP”
on page 9-4. The deep learning processor IP core is generated by using the HDL Coder™ IP core
generation workflow. For more information, see “Custom IP Core Generation” (HDL Coder).
dlhdl.buildProcessor(hPC)
The generated IP core files are located at cwd\dlhdl_prj\ipcore. cwd is the current working
directory. The ipcore folder contains an HTML report located at cwd\dlhdl_prj\ipcore
\DUT_ip_v1_0\doc.
10-58
Authoring a Reference Design for Live Camera Integration with Deep Learning Processor IP Core
The HTML report contains a description of the deep learning processor IP core, instructions for using
the core and integrating the core into your Vivado™ reference design, and a list of AXI4 registers.
You will need the AXI4 register list to enter addresses into the Vivado™ Address Mapping tool. For
more information about the AXI4 registers, see “Deep Learning Processor Register Map” on page 12-
9.
Integrate the Generated Deep Learning Processor IP Core into the Reference Design
Insert the generated deep learning processor IP core into your reference design. After inserting the
generated deep learning processor IP core into the reference design, you must:
• Connect the generated deep learning processor IP core AXI4 slave interface to an AXI4 master
device such as a JTAG AXI master IP core or a Zynq™ processing system (PS). Use the AXI4
master device to communicate with the deep learning processor IP core.
• Connect the vendor provided external memory interface IP core to the three AXI4 master
interfaces of the generated deep learning processor IP core.
The deep learning processor IP core uses the external memory interface to access the external DDR
memory. The image shows the deep learning processor IP core integrated into the Vivado™ reference
design and connected to the DDR memory interface generator (MIG) IP.
10-59
10 Featured Examples
In your Vivado™ reference design add an external memory interface generator (MIG) block and
connect the generated deep learning processor IP core to the MIG module. The MIG module is
connected to the processor IP core through an AXI interconnect module. The image shows the high
level architectural design and the Vivado™ reference design implementation.
The following code describes the contents of the ZCU102 reference design definition file
plugin_rd.m for the above Vivado™ reference design. For more details on how to define and register
the custom board, refer to the “Define Custom Board and Reference Design for Zynq Workflow” (HDL
Coder).
% Parse config
config = ZynqVideoPSP.common.parse_config(...
'ToolVersion', '2019.1', ...
'Board', 'zcu102', ...
10-60
Authoring a Reference Design for Live Camera Integration with Deep Learning Processor IP Core
After creating the reference design, use the HDL Coder™ IP core generation workflow to generate
the bitstream and program the ZCU102 board. You can then use MATLAB® and a dlhdl.Workflow
object to verify the deep learning processor IP core or you can use the HDL Coder™ workflow to
prototype the entire system. To verify the reference design, see “Run a Deep Learning Network on
FPGA with Live Camera Input” on page 10-62.
10-61
10 Featured Examples
Introduction
1 Model the preprocessing logic that processes the live camera input for the deep learning
processor IP core. The processed video frame is sent to the external DDR memory on the FPGA
board.
2 Simulate the model in Simulink® to verify the algorithm functionality.
3 Implement the preprocessing logic on a ZCU102 board by using a custom video reference design
which includes the generated deep learning processor IP core.
4 Individually validate the preprocessing logic on the FPGA board.
5 Individually validate the deep learning processor IP core functionality by using the Deep
Learning HDL Toolbox™ prototyping workflow.
6 Deploy and validate the entire system on a ZCU102 board.
This figure is a high-level architectural diagram of the system. The result of the deep learning
network prediction is sent to the ARM processor. The ARM processor annotates the deep learning
network prediction onto the output video frame.
10-62
Run a Deep Learning Network on FPGA with Live Camera Input
The objective of this system is to receive the live camera input through the HDMI input of the FMC
daughter card on the ZCU102 board. You design the preprocessing logic in Simulink® to select and
resize the region of interest (ROI). You then transmit the processed image frame to the deep learning
processor IP core to run image classification by using a deep learning network.
Model the preprocessing logic to process the live camera input for the deep learning network and
send the video frame to external DDR memory on the FPGA board. This logic is modelled in the DUT
subsystem:
• Image frame selection logic that allows you to use your cursor to choose an ROI from the incoming
camera frame. The selected ROI is the input to the deep learning network.
• Image resizing logic that resizes the ROI image to match the input image size of the deep learning
network.
• AXI4 Master interface logic that sends the resized image frame into the external DDR memory,
where the deep learning processor IP core reads the input. To model the AXI4 Master interface,
see “Model Design for AXI4 Master Interface Generation” (HDL Coder).
This figure shows the Simulink® model for the preprocessing logic DUT.
10-63
10 Featured Examples
To implement the preprocessing logic model on a ZCU102 SoC board, create an HDL Coder™
reference design in Vivado™ which receives the live camera input and transmits the processed video
data to the deep learning processor IP core. To create a custom video reference design that
integrates the deep learning processor IP core, see “Authoring a Reference Design for Live Camera
Integration with Deep Learning Processor IP Core” on page 10-57.
Start the HDL Coder HDL Workflow Advisor and use the Zynq hardware-software co-design workflow
to deploy the preprocessing logic model on Zynq hardware. This workflow is the standard HDL Coder
workflow. In this example the only difference is that this reference design contains the generated
deep learning processor IP core. For more details refer to the “Getting Started with Targeting Xilinx
Zynq Platform” (HDL Coder) example.
1. Start the HDL Workflow Advisor from the model by right-clicking the DLPreProcess DUT
subsystem and selecting HDL Advisor Workflow.
In Task 1.1, IP Core Generation is selected for Target workflow and ZCU102-FMC-HDMI-CAM is
selected for Target platform.
10-64
Run a Deep Learning Network on FPGA with Live Camera Input
In Task 1.2, HDMI RGB with DL Processor is selected for Reference Design.
10-65
10 Featured Examples
In Task 1.3, the Target platform interface table is loaded as shown in the following screenshot.
Here you can map the ports of the DUT subsystem to the interfaces in the reference design.
10-66
Run a Deep Learning Network on FPGA with Live Camera Input
2. Right-click Task 3.2, Generate RTL Code and IP Core, and then select Run to Selected Task.
You can find the register address mapping and other documentation for the IP core in the generated
IP Core Report.
In the HDL Workflow Advisor, run the Embedded System Integration tasks to deploy the generated
HDL IP core on Zynq hardware.
1. Run Task 4.1, Create Project. This task inserts the generated IP core into the HDMI RGB with
DL Processor reference design. To create a reference design that integrates the deep learning
processor IP core, see “Authoring a Reference Design for Live Camera Integration with Deep
Learning Processor IP Core” on page 10-57.
2. Click the link in the Result pane to open the generated Vivado project. In the Vivado tool, click
Open Block Design to view the Zynq design diagram, which includes the generated preprocessing
HDL IP core, the deep learning processor IP core and the Zynq processor.
10-67
10 Featured Examples
3. In the HDL Workflow Advisor, run the rest of the tasks to generate the software interface model
and build and download the FPGA bitstream.
To validate the integrated reference design that includes the generated preprocessing logic IP core,
deep learning processor IP core, and the Zynq processor:
1. Using the standard HDL Coder hardware/software co-design workflow, you can validate that the
preprocessing logic works as expected on the FPGA. The HDL Workflow Advisor generates a software
interface subsystem during Task 4.2 Generate Software Interface Model, which you can use in
your software model for interfacing with the FPGA logic. From the software model, you can tune and
probe the FPGA design on the hardware by using Simulink External Mode. Instruct the FPGA
preprocessing logic to capture an input frame and send it to the external DDR memory.
You can then use fpga object to create a connection from MATLAB to the ZCU102 board and read
the contents of the external DDR memory into MATLAB for validation. to use the fpga object, see
“Create Software Interface Script to Control and Rapidly Prototype HDL IP Core” (HDL Coder).
10-68
Run a Deep Learning Network on FPGA with Live Camera Input
2. The generated deep learning processor IP core has Ethernet and JTAG interfaces for
communications in the generated bitstream. You can individually validate the deep learning processor
IP core by using the dlhdl.Workflow object.
3. After you individually validate the preprocessing logic IP core and the deep learning processor IP
core, you can prototype the entire integrated system on the FPGA board. Using Simulink External
mode, instruct the FPGA preprocessing logic to send a processed input image frame to the DDR
buffer, instruct the deep learning processor IP core to read from the same DDR buffer, and execute
the prediction.
The deep learning processor IP core sends the result back to the external DDR memory. The software
model running on the ARM processor retrieves the prediction result and annotates the prediction on
the output video stream. This screenshot shows that you can read the ARM processor prediction
result by using a serial connection.
10-69
10 Featured Examples
This screenshot shows the frame captured from the output video stream which includes the ROI
selection and the annotated prediction result.
10-70
Run a Deep Learning Network on FPGA with Live Camera Input
4. After completing all your verification steps, manually deploy the entire reference design as an
executable on the SD card on the ZCU102 board by using the ARM processor. Once the manual
deployment is completed a MATLAB connection to the FPGA board is not required to operate the
reference design.
10-71
10 Featured Examples
Prerequisites
Resnet-50 Network
ResNet-50 is a convolutional neural network that is 50 layers deep. This pretrained network can
classify images into 1000 object categories (such as keyboard, mouse, pencil, and more).The network
has learned rich feature representations for a wide range of images. The network has an image input
size of 224-by-224.
rnet = resnet50;
To visualize the structure of the Resnet-50 network, at the MATLAB command prompt, enter:
analyzeNetwork(rnet)
To examine the outputs of the max_pooling2d_1 layer, create this network which is a subset of the
ResNet-50 network:
layers = rnet.Layers(1:5);
outLayer = regressionLayer('Name','output');
layers(end+1) = outLayer;
snet = assembleNetwork(layers);
Create a target object with a custom name and an interface to connect your target device to the host
computer. Interface options are JTAG and Ethernet. To use JTAG, install Xilinx™ Vivado™ Design
Suite 2019.2. To set the Xilinx Vivado toolpath, enter:
hTarget = dlhdl.Target('Xilinx','Interface','Ethernet');
10-72
Running Convolution-Only Networks by using FPGA Deployment
Create an object of the dlhdl.Workflow class. When you create the object, specify the network and
the bitstream name. Specify the saved pretrained ResNet-50 subset network, snet, as the network.
Make sure that the bitstream name matches the data type and the FPGA board that you are
targeting. In this example the target FPGA board is the Xilinx ZCU102 SOC board. The bitstream
uses a single data type.
To compile the modified ResNet-50 series network, run the compile function of the dlhdl.Workflow
object.
hW.compile
dn = hW.compile
To deploy the network on the Xilinx ZCU102 hardware, run the deploy function of the
dlhdl.Workflow object. This function uses the output of the compile function to program the FPGA
board by using the programming file. It also downloads the network weights and biases. The deploy
function programs the FPGA device, displays progress messages, and the time it takes to deploy the
network.
hW.deploy
### FPGA bitstream programming has been skipped as the same bitstream is already loaded on the ta
### Deep learning network programming has been skipped as the same network is already loaded on t
Load and display an image to use as an input image to the series network.
I = imread('daisy.jpg');
imshow(I)
10-73
10 Featured Examples
The result data is returned as a 3-D array, with the third dimension indexing across the 64 feature
images.
sz = size(P)
sz = 1×3
56 56 64
To visualize all 64 features in a single image, the data is reshaped into 4 dimensions, which is
appropriate input to the imtile function
10-74
Running Convolution-Only Networks by using FPGA Deployment
sz = 1×4
56 56 1 64
The input to imtile is normalized using mat2gray. All values are scaled so that the minimum
activation is 0 and the maximum activation is 1.
To show these activations by using the imtile function, reshape the array to 4-D. The third
dimension in the input to imtile represents the image color. Set the third dimension to size 1
because the activations do not have color. The fourth dimension indexes the channel. A gride size of
8x8 is selected because there are 64 features to display.
imshow(J)
10-75
10 Featured Examples
Bright features indicate a strong activation. To understand and debug convolutional networks,
running and visualizing data is a useful tool.
10-76
Accelerate Prototyping Workflow for Large Networks by using Ethernet
Prerequisites
• Xilinx ZCU102 SoC development kit. For help with board setup, see “Guided SD Card Set Up”
(Deep Learning HDL Toolbox Support Package for Xilinx FPGA and SoC Devices).
• Deep Learning HDL Toolbox™ Support Package for Xilinx FPGA and SoC
• Deep Learning HDL Toolbox™
• Deep Learning Toolbox™ Model for AlexNet Network
Introduction
Deep Learning HDL Toolbox establishes a connection between the host computer and FPGA board to
prototype deep learning networks on hardware. This connection is used to deploy deep learning
networks and run predictions. The connection provides two services:
There are two hardware interfaces for establishing a connection between the host computer and
FPGA board: JTAG and Ethernet.
JTAG Interface
The JTAG interface, programs the bitstream onto the FPGA over JTAG. The bitstream is not persistent
through power cycles. You must reprogram the bitstream each time the FPGA is turned on.
MATLAB uses JTAG to control an AXI Master IP in the FPGA design, to communicate with the design
running on the FPGA. You can use the AXI Master IP to read and write memory locations in the
onboard memory and deep learning processor.
10-77
10 Featured Examples
Ethernet Interface
The Ethernet interface leverages the ARM processor to send and receive information from the design
running on the FPGA. The ARM processor runs on a Linux operating system. You can use the Linux
operating system services to interact with the FPGA. When using the Ethernet interface, the
bitstream is downloaded to the SD card. The bitstream is persistent through power cycles and is
reprogrammed each time the FPGA is turned on. The ARM processor is configured with the correct
device tree when the bitstream is programmed.
To communicate with the design running on the FPGA, MATLAB leverages the Ethernet connection
between the host computer and ARM processor. The ARM processor runs a LIBIIO service, which
communicates with a datamover IP in the FPGA design. The datamover IP is used for fast data
transfers between the host computer and FPGA, which is useful when prototyping large deep
learning networks that would have long transfer times over JTAG. The ARM processor generates the
read and write transactions to access memory locations in both the onboard memory and deep
learning processor.
The figure below shows the high-level architecture of the Ethernet interface.\
This example uses the pretrained series network alexnet. This network is a larger network that has
significant improvement in transfer time when deploying it to the FPGA by using Ethernet. To load
alexnet, run the command:
snet = alexnet;
analyzeNetwork(snet);
% The saved network contains 25 layers including input, convolution, ReLU, cross channel normaliz
% max pool, fully connected, and the softmax output layers.
10-78
Accelerate Prototyping Workflow for Large Networks by using Ethernet
To deploy the deep learning network on the target FPGA board, create a dlhdl.Workflow object
that has the pretrained network snet as the network and the bitstream for your target FPGA board.
This example uses the bitstream 'zcu102_single', which has single data type and is configured for
the ZCU102 board. To run this example on a different board, use the bitstream for your board.
hW = dlhdl.Workflow('Network', snet, 'Bitstream', 'zcu102_single');
The output displays the size of the compiled network, which is 300 MB. The entire 300 MB is
transferred to the FPGA by using the deploy method. Due to the large size of the network, the
transfer can take a significant amount of time if using JTAG. When using Ethernet, the transfer
happens quickly.
10-79
10 Featured Examples
Before deploying a network, you must first establish a connection to the FPGA board. The
dlhdl.Target object represents this connection between the host computer and the FPGA. Create
two target objects, one for connection through the JTAG interface and one for connection through the
Ethernet interface. To use the JTAG connection, install Xilinx™ Vivado™ Design Suite 2019.2 and set
the path to your installed Xilinx Vivado executable if it is not already set up.
% hdlsetuptoolpath('ToolName', 'Xilinx Vivado', 'ToolPath', 'C:\Xilinx\Vivado\2019.2\bin\vivado.b
hTargetJTAG = dlhdl.Target('Xilinx', 'Interface', 'JTAG')
hTargetJTAG =
Target with properties:
Vendor: 'Xilinx'
Interface: JTAG
hTargetEthernet =
Target with properties:
Vendor: 'Xilinx'
Interface: Ethernet
IPAddress: '192.168.1.100'
Username: 'root'
Port: 22
To deploy the network, assign the target object to the dlhdl.Workflow object and execute the
deploy method. The deployment happens in two stages. First, the bitstream is programmed onto the
FPGA. Then, the network is transferred to the onboard memory.
Select the JTAG interface and time the operation. This operation might take several minutes.
hW.Target = hTargetJTAG;
tic;
hW.deploy;
elapsedTimeJTAG = toc
elapsedTimeJTAG = 1.0614e+03
10-80
Accelerate Prototyping Workflow for Large Networks by using Ethernet
Use the Ethernet interface by setting the dlhdl.Workflow target object to hTargetEthernet and
running the deploy function. There is a significant acceleration in the network deployment when
you use Ethernet to deploy the bitstream and network to the FPGA.
hW.Target = hTargetEthernet;
tic;
hW.deploy;
Downloading target FPGA device configuration over Ethernet to SD card done. The system will now r
System is rebooting . . . . . .
### Programming the FPGA bitstream has been completed successfully.
### Loading weights to FC Processor.
### 8% finished, current time is 29-Jun-2020 16:47:08.
### 17% finished, current time is 29-Jun-2020 16:47:08.
### 25% finished, current time is 29-Jun-2020 16:47:09.
### 33% finished, current time is 29-Jun-2020 16:47:10.
### 42% finished, current time is 29-Jun-2020 16:47:10.
### 50% finished, current time is 29-Jun-2020 16:47:11.
### 58% finished, current time is 29-Jun-2020 16:47:13.
### 67% finished, current time is 29-Jun-2020 16:47:13.
### 75% finished, current time is 29-Jun-2020 16:47:15.
### 83% finished, current time is 29-Jun-2020 16:47:16.
### 92% finished, current time is 29-Jun-2020 16:47:18.
### FC Weights loaded. Current time is 29-Jun-2020 16:47:18
elapsedTimeEthernet = toc
elapsedTimeEthernet = 47.5854
Changing from JTAG to Ethernet the deploy function reprograms the bitstream, which accounts for
most of the elapsed time. Reprogramming is due to different methods that are used to program the
bitstream for the different hardware interfaces. The Ethernet interface configures the ARM processor
and uses a persistent programming method so that the bitstream is reprogrammed each time the
board is turned on. When deploying different deep learning networks by using the same bitstream
and hardware interface, you can skip the bitstream programming, which further speeds up network
deployment.
imgFile = 'zebra.JPEG';
inputImg = imresize(imread(imgFile), [227,227]);
imshow(inputImg)
10-81
10 Featured Examples
prediction = hW.predict(single(inputImg));
result =
'zebra'
release(hTargetJTAG)
release(hTargetEthernet)
10-82
Create Series Network for Quantization
AlexNet has been trained on over a million images and can classify images into 1000 object
categories (such as keyboard, coffee mug, pencil, and many animals). The network has learned rich
feature representations for a wide range of images. The network takes an image as input and outputs
a label for the object in the image together with the probabilities for each of the object categories.
Transfer learning is commonly used in deep learning applications. You can take a pretrained network
and use it as a starting point to learn a new task. Fine-tuning a network with transfer learning is
usually much faster and easier than training a network with randomly initialized weights from
scratch. You can quickly transfer learned features to a new task using a smaller number of training
images.
Unzip and load the new images as an image datastore. imageDatastore automatically labels the
images based on folder names and stores the data as an ImageDatastore object. An image
datastore enables you to store large image data, including data that does not fit in memory, and
efficiently read batches of images during training of a convolutional neural network.
unzip('logos_dataset.zip');
Divide the data into training and validation data sets. Use 70% of the images for training and 30% for
validation. splitEachLabel splits the images datastore into two new datastores.
[imdsTrain,imdsValidation] = splitEachLabel(imds,0.7,'randomized');
Load the pretrained AlexNet neural network. If Deep Learning Toolbox™ Model for AlexNet Network
is not installed, then the software provides a download link. AlexNet is trained on more than one
million images and can classify images into 1000 object categories, such as keyboard, mouse, pencil,
and many animals. As a result, the model has learned rich feature representations for a wide range of
images.
snet = alexnet;
Use analyzeNetwork to display an interactive visualization of the network architecture and detailed
information about the network layers.
analyzeNetwork(snet)
10-83
10 Featured Examples
The first layer, the image input layer, requires input images of size 227-by-227-by-3, where 3 is the
number of color channels.
inputSize = snet.Layers(1).InputSize
inputSize = 1×3
227 227 3
The last three layers of the pretrained network net are configured for 1000 classes. These three
layers must be fine-tuned for the new classification problem. Extract all layers, except the last three,
from the pretrained network.
layersTransfer = snet.Layers(1:end-3);
Transfer the layers to the new classification task by replacing the last three layers with a fully
connected layer, a softmax layer, and a classification output layer. Specify the options of the new fully
connected layer according to the new data. Set the fully connected layer to have the same size as the
number of classes in the new data. To learn faster in the new layers than in the transferred layers,
increase the WeightLearnRateFactor and BiasLearnRateFactor values of the fully connected
layer.
numClasses = numel(categories(imdsTrain.Labels))
10-84
Create Series Network for Quantization
numClasses = 32
layers = [
layersTransfer
fullyConnectedLayer(numClasses,'WeightLearnRateFactor',20,'BiasLearnRateFactor',20)
softmaxLayer
classificationLayer];
Train Network
The network requires input images of size 227-by-227-by-3, but the images in the image datastores
have different sizes. Use an augmented image datastore to automatically resize the training images.
Specify additional augmentation operations to perform on the training images: randomly flip the
training images along the vertical axis, and randomly translate them up to 30 pixels horizontally and
vertically. Data augmentation helps prevent the network from overfitting and memorizing the exact
details of the training images.
To automatically resize the validation images without performing further data augmentation, use an
augmented image datastore without specifying any additional preprocessing operations.
augimdsValidation = augmentedImageDatastore(inputSize(1:2),imdsValidation);
Specify the training options. For transfer learning, keep the features from the early layers of the
pretrained network (the transferred layer weights). To slow down learning in the transferred layers,
set the initial learning rate to a small value. In the previous step, you increased the learning rate
factors for the fully connected layer to speed up learning in the new final layers. This combination of
learning rate settings results in fast learning only in the new layers and slower learning in the other
layers. When performing transfer learning, you do not need to train for as many epochs. An epoch is a
full training cycle on the entire training data set. Specify the mini-batch size and validation data. The
software validates the network every ValidationFrequency iterations during training.
Train the network that consists of the transferred and new layers. By default, trainNetwork uses a
GPU if one is available (requires Parallel Computing Toolbox™ and a CUDA® enabled GPU with
compute capability 6.1, 6.3, or higher). Otherwise, it uses a CPU. You can also specify the execution
environment by using the 'ExecutionEnvironment' name-value pair argument of
trainingOptions.
netTransfer = trainNetwork(augimdsTrain,layers,options);
10-85
10 Featured Examples
10-86
Vehicle Detection Using YOLO v2 Deployed to FPGA
Deep learning is a powerful machine learning technique that you can use to train robust object
detectors. Several techniques for object detection exist, including Faster R-CNN and you only look
once (YOLO) v2. This example trains a YOLO v2 vehicle detector using the
trainYOLOv2ObjectDetector function.
Load Dataset
This example uses a small vehicle dataset that contains 295 images. Many of these images come from
the Caltech Cars 1999 and 2001 data sets, available at the Caltech Computational Vision website,
created by Pietro Perona and used with permission. Each image contains one or two labeled instances
of a vehicle. A small dataset is useful for exploring the YOLO v2 training procedure, but in practice,
more labeled images are needed to train a robust detector. Unzip the vehicle images and load the
vehicle ground truth data.
unzip vehicleDatasetImages.zip
data = load('vehicleDatasetGroundTruth.mat');
vehicleDataset = data.vehicleDataset;
The vehicle data is stored in a two-column table, where the first column contains the image file paths
and the second column contains the vehicle bounding boxes.
Split the dataset into training and test sets. Select 60% of the data for training and the rest for
testing the trained detector.
rng(0);
shuffledIndices = randperm(height(vehicleDataset));
idx = floor(0.6 * length(shuffledIndices) );
trainingDataTbl = vehicleDataset(shuffledIndices(1:idx),:);
testDataTbl = vehicleDataset(shuffledIndices(idx+1:end),:);
Use imageDatastore and boxLabelDataStore to create datastores for loading the image and
label data during training and evaluation.
imdsTrain = imageDatastore(trainingDataTbl{:,'imageFilename'});
bldsTrain = boxLabelDatastore(trainingDataTbl(:,'vehicle'));
imdsTest = imageDatastore(testDataTbl{:,'imageFilename'});
bldsTest = boxLabelDatastore(testDataTbl(:,'vehicle'));
trainingData = combine(imdsTrain,bldsTrain);
testData = combine(imdsTest,bldsTest);
A YOLO v2 object detection network is composed of two subnetworks. A feature extraction network
followed by a detection network. The feature extraction network is typically a pretrained CNN (for
details, see Pretrained Deep Neural Networks). This example uses AlexNet for feature extraction. You
10-87
10 Featured Examples
can also use other pretrained networks such as MobileNet v2 or ResNet-18 can also be used
depending on application requirements. The detection sub-network is a small CNN compared to the
feature extraction network and is composed of a few convolutional layers and layers specific for
YOLO v2.
Use the yolov2Layers function to create a YOLO v2 object detection network automatically given a
pretrained AlexNet feature extraction network. yolov2Layers requires you to specify several inputs
that parameterize a YOLO v2 network:
First, specify the network input size and the number of classes. When choosing the network input
size, consider the minimum size required by the network itself, the size of the training images, and
the computational cost incurred by processing data at the selected size. When feasible, choose a
network input size that is close to the size of the training image and larger than the input size
required for the network. To reduce the computational cost of running the example, specify a network
input size of [224 224 3], which is the minimum size required to run the network.
numClasses = width(vehicleDataset)-1;
Note that the training images used in this example are bigger than 224-by-224 and vary in size, so
you must resize the images in a preprocessing step prior to training.
Next, use estimateAnchorBoxes to estimate anchor boxes based on the size of objects in the
training data. To account for the resizing of the images prior to training, resize the training data for
estimating anchor boxes. Use transform to preprocess the training data, then define the number of
anchor boxes and estimate the anchor boxes. Resize the training data to the input image size of the
network using the supporting function yolo_preprocessData.
trainingDataForEstimation = transform(trainingData,@(data)yolo_preprocessData(data,inputSize));
numAnchors = 7;
[anchorBoxes, meanIoU] = estimateAnchorBoxes(trainingDataForEstimation, numAnchors)
anchorBoxes = 7×2
145 126
91 86
161 132
41 34
67 64
136 111
33 23
meanIoU = 0.8651
For more information on choosing anchor boxes, see Estimate Anchor Boxes From Training Data
(Computer Vision Toolbox) (Computer Vision Toolbox™) and Anchor Boxes for Object Detection
(Computer Vision Toolbox).
10-88
Vehicle Detection Using YOLO v2 Deployed to FPGA
featureExtractionNetwork =
SeriesNetwork with properties:
Select 'relu5' as the feature extraction layer to replace the layers after 'relu5' with the detection
subnetwork. This feature extraction layer outputs feature maps that are downsampled by a factor of
16. This amount of downsampling is a good trade-off between spatial resolution and the strength of
the extracted features, as features extracted further down the network encode stronger image
features at the cost of spatial resolution. Choosing the optimal feature extraction layer requires
empirical analysis.
featureLayer = 'relu5';
You can visualize the network using analyzeNetwork or Deep Network Designer from Deep
Learning Toolbox™.
If more control is required over the YOLO v2 network architecture, use Deep Network Designer to
design the YOLO v2 detection network manually. For more information, see Design a YOLO v2
Detection Network (Computer Vision Toolbox).
Data Augmentation
Data augmentation is used to improve network accuracy by randomly transforming the original data
during training. By using data augmentation you can add more variety to the training data without
actually having to increase the number of labeled training samples.
Use transform to augment the training data by randomly flipping the image and associated box
labels horizontally. Note that data augmentation is not applied to the test and validation data. Ideally,
test and validation data should be representative of the original data and is left unmodified for
unbiased evaluation.
augmentedTrainingData = transform(trainingData,@yolo_augmentData);
Preprocess the augmented training data, and the validation data to prepare for training.
preprocessedTrainingData = transform(augmentedTrainingData,@(data)yolo_preprocessData(data,inputS
10-89
10 Featured Examples
'InitialLearnRate',1e-3, ...
'MaxEpochs',20,...
'CheckpointPath', tempdir, ...
'Shuffle','never');
[detector,info] = trainYOLOv2ObjectDetector(preprocessedTrainingData,lgraph,options);
*************************************************************************
Training a YOLO v2 Object Detector for the following object classes:
* vehicle
As a quick test, run the detector on one test image. Make sure you resize the image to the same size
as the training images.
I = imread(testDataTbl.imageFilename{2});
I = imresize(I,inputSize(1:2));
[bboxes,scores] = detect(detector,I);
I_new = insertObjectAnnotation(I,'rectangle',bboxes,scores);
figure
imshow(I_new)
10-90
Vehicle Detection Using YOLO v2 Deployed to FPGA
snet=detector.Network;
I_pre=yolo_pre_proc(I);
analyzeNetwork(snet)
10-91
10 Featured Examples
Create a target object for your target device with a vendor name and an interface to connect your
target device to the host computer. Interface options are JTAG (default) and Ethernet. Vendor options
are Intel or Xilinx. Use the installed Xilinx Vivado Design Suite over an Ethernet connection to
program the device.
Create an object of the dlhdl.Workflow class. When you create the object, specify the network and
the bitstream name. Specify the saved pre-trained series network, trainedNetNoCar, as the
network. Make sure the bitstream name matches the data type and the FPGA board that you are
targeting. In this example, the target FPGA board is the Zynq UltraScale+ MPSoC ZCU102 board.
The bitstream uses a single data type.
hW =
Workflow with properties:
10-92
Vehicle Detection Using YOLO v2 Deployed to FPGA
To compile the snet series network, run the compile function of the dlhdl.Workflow object .
dn = hW.compile
Skipping: data
Compiling leg: conv1>>yolov2ClassConv ...
Compiling leg: conv1>>yolov2ClassConv ... complete.
Skipping: yolov2Transform
Skipping: yolov2OutputLayer
Creating Schedule...
......
Creating Schedule...complete.
Creating Status Table...
.....
Creating Status Table...complete.
Emitting Schedule...
.....
Emitting Schedule...complete.
Emitting Status Table...
.......
Emitting Status Table...complete.
10-93
10 Featured Examples
To deploy the network on the Zynq® UltraScale+™ MPSoC ZCU102 hardware, run the deploy
function of the dlhdl.Workflow object . This function uses the output of the compile function to
program the FPGA board by using the programming file.The function also downloads the network
weights and biases. The deploy function checks for the Xilinx Vivado tool and the supported tool
version. It then starts programming the FPGA device by using the bitstream, displays progress
messages and the time it takes to deploy the network.
hW.deploy
### FPGA bitstream programming has been skipped as the same bitstream is already loaded on the ta
### Loading weights to Conv Processor.
### Conv Weights loaded. Current time is 20-Dec-2020 15:26:28
Execute the predict function on the dlhdl.Workflow object and display the result:
10-94
Vehicle Detection Using YOLO v2 Deployed to FPGA
10-95
10 Featured Examples
snet = getLogoNetwork;
To create a custom processor configuration, use the dlhdl.ProcessorConfig object. For more
information, see dlhdl.ProcessorConfig. To learn about modifiable parameters of the processor
configuration, see getModuleProperty and setModuleProperty.
hPC = dlhdl.ProcessorConfig;
hPC.TargetFrequency = 220;
hPC
hPC =
Processing Module "conv"
ConvThreadNumber: 16
InputMemorySize: [227 227 3]
OutputMemorySize: [227 227 3]
FeatureSizeLimit: 2048
KernelDataType: 'single'
10-96
Custom Deep Learning Processor Generation to Meet Performance Requirements
To estimate the performance of the LogoNet series network, use the estimatePerformance
function of the dlhdl.ProcessorConfig object. The function returns the estimated layer latency,
network latency, and network performance in frames per second (Frames/s).
hPC.estimatePerformance(snet)
### Notice: (Layer 1) The layer 'imageinput' with type 'nnet.cnn.layer.ImageInputLayer' is imple
### Notice: (Layer 14) The layer 'output' with type 'nnet.cnn.layer.RegressionOutputLayer' is imp
### Notice: (Layer 1) The layer 'imageinput' with type 'nnet.cnn.layer.ImageInputLayer' is imple
### Notice: (Layer 7) The layer 'output' with type 'nnet.cnn.layer.RegressionOutputLayer' is imp
The estimated frames per second is 5.5 Frames/s. To improve the network performance, modify the
custom processor convolution module kernel data type, convolution processor thread number, fully
connected module kernel data type, and fully connected module thread number. For more information
about these processor parameters, see getModuleProperty and setModuleProperty.
To create a custom processor configuration, use the dlhdl.ProcessorConfig object. For more
information, see dlhdl.ProcessorConfig. To learn about modifiable parameters of the processor
configuration, see getModuleProperty and setModuleProperty.
hPCNew = dlhdl.ProcessorConfig;
hPCNew.TargetFrequency = 300;
hPCNew.setModuleProperty('conv', 'KernelDataType', 'int8');
hPCNew.setModuleProperty('conv', 'ConvThreadNumber', 64);
hPCNew.setModuleProperty('fc', 'KernelDataType', 'int8');
hPCNew.setModuleProperty('fc', 'FCThreadNumber', 16);
hPCNew
hPCNew =
Processing Module "conv"
ConvThreadNumber: 64
InputMemorySize: [227 227 3]
OutputMemorySize: [227 227 3]
10-97
10 Featured Examples
FeatureSizeLimit: 2048
KernelDataType: 'int8'
dlquantObj = dlquantizer(snet,'ExecutionEnvironment','FPGA');
Image = imageDatastore('heineken.png','Labels','Heineken');
dlquantObj.calibrate(Image);
To estimate the performance of the LogoNet series network, use the estimate function of the
dlhdl.Workflow object. The function returns the estimated layer latency, network latency, and
network performance in frames per second (Frames/s).
hPCNew.estimatePerformance(dlquantObj)
### Notice: (Layer 1) The layer 'imageinput' with type 'nnet.cnn.layer.ImageInputLayer' is imple
### Notice: (Layer 14) The layer 'output' with type 'nnet.cnn.layer.RegressionOutputLayer' is imp
### Notice: (Layer 1) The layer 'imageinput' with type 'nnet.cnn.layer.ImageInputLayer' is imple
### Notice: (Layer 7) The layer 'output' with type 'nnet.cnn.layer.RegressionOutputLayer' is imp
10-98
Custom Deep Learning Processor Generation to Meet Performance Requirements
Use the new custom processor configuration to build and generate a custom processor and bitstream.
Use the custom bitstream to deploy the LogoNet network to your target FPGA board.
To learn how to use the generated bitstream file, see “Generate Custom Bitstream” on page 9-2.
The generated bitstream in this example is similar to the zcu102_int8 bitstream. To deploy the
quantized LogoNet network using the zcu102_int8 bitstream, see “Obtain Prediction Results for
Quantized LogoNet Network”.
10-99
10 Featured Examples
Required Products
snet = alexnet;
analyzeNetwork(snet);
The first layer, the image input layer, requires input images of size 227-by-227-by-3, where 3 is the
number of color channels.
inputSize = snet.Layers(1).InputSize;
inputSize = 1×3
227 227 3
This example uses the logos_dataset data set. The data set consists of 320 images. Create an
augmentedImageDatastore object to use for training and validation.
curDir = pwd;
newDir = fullfile(matlabroot,'examples','deeplearning_shared','data','logos_dataset.zip');
copyfile(newDir,curDir,'f');
unzip('logos_dataset.zip');
[imdsTrain,imdsValidation] = splitEachLabel(imds,0.7,'randomized');
10-100
Deploy Quantized Network Example
The last three layers of the pretrained network net are configured for 1000 classes. These three
layers must be fine-tuned for the new classification problem. Extract all the layers, except the last
three layers, from the pretrained network.
layersTransfer = snet.Layers(1:end-3);
Transfer the layers to the new classification task by replacing the last three layers with a fully
connected layer, a softmax layer, and a classification output layer. Set the fully connected layer to
have the same size as the number of classes in the new data.
numClasses = numel(categories(imdsTrain.Labels));
numClasses = 32
layers = [
layersTransfer
fullyConnectedLayer(numClasses,'WeightLearnRateFactor',20,'BiasLearnRateFactor',20)
softmaxLayer
classificationLayer];
Train Network
The network requires input images of size 227-by-227-by-3, but the images in the image datastores
have different sizes. Use an augmented image datastore to automatically resize the training images.
Specify additional augmentation operations to perform on the training images, such as randomly
flipping the training images along the vertical axis and randomly translating them up to 30 pixels
horizontally and vertically. Data augmentation helps prevent the network from overfitting and
memorizing the exact details of the training images.
To automatically resize the validation images without performing further data augmentation, use an
augmented image datastore without specifying any additional preprocessing operations.
augimdsValidation = augmentedImageDatastore(inputSize(1:2),imdsValidation);
Specify the training options. For transfer learning, keep the features from the early layers of the
pretrained network (the transferred layer weights). To slow down learning in the transferred layers,
set the initial learning rate to a small value. Specify the mini-batch size and validation data. The
software validates the network every ValidationFrequency iterations during training.
10-101
10 Featured Examples
'ValidationData',augimdsValidation, ...
'ValidationFrequency',3, ...
'Verbose',false, ...
'Plots','training-progress');
Train the network that consists of the transferred and new layers. By default, trainNetwork uses a
GPU if one is available (requires Parallel Computing Toolbox™ and a supported GPU device. For more
information, see “GPU Support by Release” (Parallel Computing Toolbox)). Otherwise, the network
uses a CPU (requires MATLAB Coder Interface for Deep learning Libraries™). You can also specify
the execution environment by using the 'ExecutionEnvironment' name-value argument of
trainingOptions.
netTransfer = trainNetwork(augimdsTrain,layers,options);
Create a dlquantizer object and specify the network to quantize. Specify the execution
environment as FPGA.
dlQuantObj = dlquantizer(netTransfer,'ExecutionEnvironment','FPGA');
The dlquantizer object uses calibration data to collect dynamic ranges for the learnable
parameters of the convolution and fully connected layers of the network.
For best quantization results, the calibration data must be a representative of actual inputs predicted
by the LogoNet network. Expedite the calibration process by reducing the calibration data set to 20
images.
imageData = imageDatastore(fullfile(curDir,'logos_dataset'),...
'IncludeSubfolders',true,'FileExtensions','.JPG','LabelSource','foldernames');
10-102
Deploy Quantized Network Example
imageData_reduced = imageData.subset(1:20);
dlQuantObj.calibrate(imageData_reduced)
Create a target object with a custom name for your target device and an interface to connect your
target device to the host computer. Interface options are JTAG and Ethernet. To use JTAG, install
Xilinx™ Vivado™ Design Suite 2020.1. To set the Xilinx Vivado toolpath, enter:
hTarget = dlhdl.Target('Xilinx','Interface','Ethernet');
Create an object of the dlhdl.Workflow class. When you create the class, an instance of the
dlquantizer object, the bitstream name, and the target information are specified. Specify
dlQuantObj as the network. Make sure that the bitstream name matches the data type and the
FPGA board that you are targeting. In this example, the target FPGA board is the Xilinx ZCU102 SOC
board and the bitstream uses the int8 data type.
To compile the quantized AlexNet series network, run the compile function of the dlhdl.Workflow
object.
dn = hW.compile
10-103
10 Featured Examples
Skipping: data
Compiling leg: conv1>>pool5 ...
Compiling leg: conv1>>pool5 ... complete.
Compiling leg: fc6>>fc ...
Compiling leg: fc6>>fc ... complete.
Skipping: softmax
Skipping: classoutput
Creating Schedule...
.........
Creating Schedule...complete.
Creating Status Table...
........
Creating Status Table...complete.
Emitting Schedule...
......
Emitting Schedule...complete.
Emitting Status Table...
..........
Emitting Status Table...complete.
10-104
Deploy Quantized Network Example
To deploy the network on the Xilinx ZCU102 SoC hardware, run the deploy function of the
dlhdl.Workflow object. This function uses the output of the compile function to program the FPGA
board by using the programming file. It also downloads the network weights and biases. The deploy
function starts programming the FPGA device, displays progress messages, and the time it takes to
deploy the network.
hW.deploy
### FPGA bitstream programming has been skipped as the same bitstream is already loaded on the ta
### Loading weights to Conv Processor.
### Conv Weights loaded. Current time is 17-Dec-2020 11:06:56
### Loading weights to FC Processor.
### 33% finished, current time is 17-Dec-2020 11:06:57.
### 67% finished, current time is 17-Dec-2020 11:06:59.
### FC Weights loaded. Current time is 17-Dec-2020 11:06:59
To load the example image, execute the predict function of the dlhdl.Workflow object, and then
display the FPGA result, enter:
idx = randperm(numel(imdsValidation.Files),4);
figure
for i = 1:4
subplot(2,2,i)
I = readimage(imdsValidation,idx(i));
imshow(I)
[prediction, speed] = hW.predict(single(I),'Profile','on');
[val, index] = max(prediction);
netTransfer.Layers(end).ClassNames{index}
label = netTransfer.Layers(end).ClassNames{index}
title(string(label));
end
10-105
10 Featured Examples
fc 117059 0.00053
* The clock frequency of the DL processor is: 220MHz
ans =
'ford'
label =
'ford'
ans =
'bmw'
label =
'bmw'
10-106
Deploy Quantized Network Example
ans =
'aldi'
label =
'aldi'
ans =
'corona'
label =
'corona'
10-107
10 Featured Examples
See Also
• dlquantizer
• calibrate
• dlhdl.Workflow
• “Quantization of Deep Neural Networks” on page 11-2
• “Prototype Deep Learning Networks on FPGA and SoCs Workflow” on page 5-2
10-108
Quantize Network for FPGA Deployment
To load the pretrained LogoNet network and analyze the network architecture, enter:
snet = getLogoNetwork;
analyzeNetwork(snet);
This example uses the logos_dataset data set. The data set consists of 320 images. Each image is
227-by-227 in size and has three color channels (RGB). Create an augmentedImageDatastore
10-109
10 Featured Examples
object to use for calibration and validation. Expedite calibration and validation by reducing the
calibration data set to 20 images. The MATLAB simulation workflow has a maximum limit of five
images when validating the quantized network. Reduce the validation data set sizes to five images.
The FPGA validation workflow has a maximum limit of one image when validating the quantized
network. Reduce the FPGA validation data set to a single image.
curDir = pwd;
newDir = fullfile(matlabroot,'examples','deeplearning_shared','data','logos_dataset.zip');
copyfile(newDir,curDir,'f');
unzip('logos_dataset.zip');
imageData = imageDatastore(fullfile(curDir,'logos_dataset'),...
'IncludeSubfolders',true,'FileExtensions','.JPG','LabelSource','foldernames');
[calibrationData, validationData] = splitEachLabel(imageData, 0.5,'randomized');
calibrationData_reduced = calibrationData.subset(1:20);
validationData_simulation = validationData.subset(1:5);
validationData_FPGA = validationData.subset(1:1);
Create a dlquantizer object and specify the network to quantize. Specify the execution
environment as FPGA for one object. Run the MATLAB simulation on for the second dlquantizer
object.
dlQuantObj_simulation = dlquantizer(snet,'ExecutionEnvironment',"FPGA",'Simulation','on');
dlQuantObj_FPGA = dlquantizer(snet,'ExecutionEnvironment',"FPGA");
Use the calibrate function to exercise the network with sample inputs and collect the range
information. The calibrate function exercises the network and collects the dynamic ranges of the
weights and biases. The calibrate function returns a table. Each row of the table contains range
information for a learnable parameter of the quantized network.
dlQuantObj_simulation.calibrate(calibrationData_reduced)
ans=35×5 table
Optimized Layer Name Network Layer Name Learnables / Activations MinValue
____________________________ __________________ ________________________ ___________
10-110
Quantize Network for FPGA Deployment
dlQuantObj_FPGA.calibrate(calibrationData_reduced)
ans=35×5 table
Optimized Layer Name Network Layer Name Learnables / Activations MinValue
____________________________ __________________ ________________________ ___________
Create a target object with a custom name for your target device and an interface to connect your
target device to the host computer. Interface options are JTAG and Ethernet. To use JTAG, install
Xilinx™ Vivado™ Design Suite 2020.1. To set the Xilinx Vivado toolpath, enter:
hTarget = dlhdl.Target('Xilinx','Interface','Ethernet');
Create a dlquantizationOptions object. Specify the target bitstream and target board interface.
The default metric function is a Top-1 accuracy metric function.
options_FPGA = dlquantizationOptions('Bitstream','zcu102_int8','Target',hTarget);
options_simulation = dlquantizationOptions;
To use a custom metric function, specify the metric function in the dlquantizationOptions
object.
10-111
10 Featured Examples
Use the validate function to quantize the learnable parameters in the convolution layers of the
network. The validate function simulates the quantized network in MATLAB. The validate
function uses the metric function defined in the dlquantizationOptions object to compare the
results of the single data type network object to the results of the quantized network object.
prediction_simulation = dlQuantObj_simulation.validate(validationData_simulation,options_simulati
### Notice: (Layer 1) The layer 'imageinput' with type 'nnet.cnn.layer.ImageInputLayer' is imple
### Notice: (Layer 2) The layer 'out_imageinput' with type 'nnet.cnn.layer.RegressionOutputLayer
Compiling leg: conv_1>>maxpool_4 ...
### Notice: (Layer 1) The layer 'imageinput' with type 'nnet.cnn.layer.ImageInputLayer' is imple
### Notice: (Layer 14) The layer 'output' with type 'nnet.cnn.layer.RegressionOutputLayer' is imp
Compiling leg: conv_1>>maxpool_4 ... complete.
Compiling leg: fc_1>>fc_3 ...
### Notice: (Layer 1) The layer 'maxpool_4' with type 'nnet.cnn.layer.ImageInputLayer' is implem
### Notice: (Layer 7) The layer 'output' with type 'nnet.cnn.layer.RegressionOutputLayer' is imp
Compiling leg: fc_1>>fc_3 ... complete.
### Should not enter here. It means a component is unaccounted for in MATLAB Emulation.
### Notice: (Layer 1) The layer 'fc_3' with type 'nnet.cnn.layer.ImageInputLayer' is implemented
### Notice: (Layer 2) The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented
### Notice: (Layer 3) The layer 'classoutput' with type 'nnet.cnn.layer.ClassificationOutputLaye
For an FPGA based validation, The validate function uses the output of the compile function to
program the FPGA board by using the programming file. It also downloads the network weights and
biases. The validate function uses the metric function defined in the dlquantizationOptions
object to compare the results of the network before and after quantization.
prediction_FPGA = dlQuantObj_FPGA.validate(validationData_FPGA,options_FPGA)
10-112
Quantize Network for FPGA Deployment
Skipping: imageinput
Compiling leg: conv_1>>maxpool_4 ...
Compiling leg: conv_1>>maxpool_4 ... complete.
Compiling leg: fc_1>>fc_3 ...
Compiling leg: fc_1>>fc_3 ... complete.
Skipping: softmax
Skipping: classoutput
Creating Schedule...
.........
Creating Schedule...complete.
Creating Status Table...
........
Creating Status Table...complete.
Emitting Schedule...
......
Emitting Schedule...complete.
Emitting Status Table...
..........
Emitting Status Table...complete.
### FPGA bitstream programming has been skipped as the same bitstream is already loaded on the ta
### Deep learning network programming has been skipped as the same network is already loaded on t
### Finished writing input activations.
### Running single input activations.
10-113
10 Featured Examples
Examine the MetricResults.Result field of the validation output to see the performance of the
quantized network.
prediction_simulation.MetricResults.Result
ans=2×2 table
NetworkImplementation MetricOutput
_____________________ ____________
{'Floating-Point'} 1
{'Quantized' } 1
prediction_FPGA.MetricResults.Result
ans=2×2 table
NetworkImplementation MetricOutput
_____________________ ____________
{'Floating-Point'} 1
{'Quantized' } 1
Examine the QuantizedNetworkFPS field of the validation output to see the frames per second
performance of the quantized network.
prediction_FPGA.QuantizedNetworkFPS
ans = 17.2915
See Also
• dlquantizer
• calibrate
• validate
• dlquantizationOptions
• “Quantization of Deep Neural Networks” on page 11-2
10-114
Evaluate Performance of Deep Learning Network on Custom Processor Configuration
In this example compare the performance of the ResNet-18 network on the zcu102_single
bitstream configuration to the performance on the default custom bitstream configuration.
Prerequisites
• Deep Learning HDL Toolbox™ Support Package for Xilinx FPGA and SoC
• Deep Learning Toolbox™
• Deep Learning HDL Toolbox™
• Deep Learning Toolbox Model for ResNet-18 Network
snet = resnet18;
hPC_shipping = dlhdl.ProcessorConfig('Bitstream',"zcu102_single")
hPC_shipping =
Processing Module "conv"
ConvThreadNumber: 16
InputMemorySize: [227 227 3]
OutputMemorySize: [227 227 3]
FeatureSizeLimit: 2048
KernelDataType: 'single'
10-115
10 Featured Examples
To estimate the performance of the ResNet-18 DAG network, use the estimatePerformance
function of the dlhdl.ProcessorConfig object. The function returns the estimated layer latency,
network latency, and network performance in frames per second (Frames/s).
hPC_shipping.estimatePerformance(snet)
To create a custom processor configuration, use the dlhdl.ProcessorConfig object. For more
information, see dlhdl.ProcessorConfig. To learn about modifiable parameters of the processor
configuration, see getModuleProperty and setModuleProperty.
hPC_custom = dlhdl.ProcessorConfig
10-116
Evaluate Performance of Deep Learning Network on Custom Processor Configuration
hPC_custom =
Processing Module "conv"
ConvThreadNumber: 16
InputMemorySize: [227 227 3]
OutputMemorySize: [227 227 3]
FeatureSizeLimit: 2048
KernelDataType: 'single'
To estimate the performance of the ResNet-18 DAG network, use the estimatePerformance
function of the dlhdl.ProcessorConfig object. The function returns the estimated layer latency,
network latency, and network performance in frames per second (Frames/s).
hPC_custom.estimatePerformance(snet)
10-117
10 Featured Examples
The performance of the ResNet-18 network on the custom bitstream configuration is lower than the
performance on the zcu102_single bitstream configuration. The difference between the custom
bitstream configuration and the zcu102_single bitstream configuration is the target frequency.
Modify the custom processor configuration to increase the target frequency. To learn about
modifiable parameters of the processor configuration, see dlhdl.ProcessorConfig.
hPC_custom.TargetFrequency = 220;
hPC_custom
hPC_custom =
Processing Module "conv"
ConvThreadNumber: 16
InputMemorySize: [227 227 3]
OutputMemorySize: [227 227 3]
FeatureSizeLimit: 2048
KernelDataType: 'single'
10-118
Evaluate Performance of Deep Learning Network on Custom Processor Configuration
Estimate the performance of the ResNet-18 DAG network on the modified custom bitstream
configuration.
hPC_custom.estimatePerformance(snet)
10-119
10 Featured Examples
The reference (shipping) zcu102_int8 bitstream configuration is for a Xilinx ZCU102 ZU9EG
device. The default board resource counts are:
The default board resource counts exceed the user resource budget and is on the higher end of the
cost spectrum. You can achieve target performance and resource use budget by quantizing the target
deep learning network and customizing the custom default bitstream configuration.
In this example create a custom bitstream configuration to match your resource budget and
performance requirements.
Prerequisites
• Deep Learning HDL Toolbox™ Support Package for Xilinx FPGA and SoC
• Deep Learning Toolbox™
• Deep Learning HDL Toolbox™
• Deep Learning Toolbox Model Quantization Library
To load the pretrained series network, that has been trained on the Modified National Institute
Standards of Technology (MNIST) database, enter:
snet = getDigitsNetwork;
Quantize Network
dlquantObj = dlquantizer(snet,'ExecutionEnvironment','FPGA');
Image = imageDatastore('five_28x28.pgm','Labels','five');
dlquantObj.calibrate(Image)
ans=21×5 table
Optimized Layer Name Network Layer Name Learnables / Activations MinValue
____________________________ __________________ ________________________ _________
10-120
Customize Bitstream Configuration to Meet Resource Use Requirements
hPC_reference = dlhdl.ProcessorConfig('Bitstream','zcu102_int8')
hPC_reference =
Processing Module "conv"
ConvThreadNumber: 64
InputMemorySize: [227 227 3]
OutputMemorySize: [227 227 3]
FeatureSizeLimit: 2048
KernelDataType: 'int8'
To estimate the performance of the digits series network, use the estimatePerformance function of
the dlhdl.ProcessorConfig object. The function returns the estimated layer latency, network
latency, and network performance in frames per second (Frames/s).
10-121
10 Featured Examples
To estimate the resource use of the zcu102_int8 bitstream, use the estimateResources
function of the dlhdl.ProcessorConfig object. The function returns the estimated DSP slice and
BRAM usage.
hPC_reference.estimatePerformance(dlquantObj)
hPC_reference.estimateResources
The estimated performance is 4314 FPS and the estimated resource use counts are:
The estimated DSP slice count and BRAM count use exceeds the target device resource budget.
Customize the bitstream configuration to reduce resource use.
To create a custom processor configuration, use the dlhdl.ProcessorConfig object. For more
information, see dlhdl.ProcessorConfig. To learn about modifiable parameters of the processor
configuration, see getModuleProperty and setModuleProperty.
To reduce the resource use for the custom bitstream, modify the KernelDataType for the conv,
fc, and adder modules. Modify the ConvThreadNumber to reduce DSP slice count. Reduce the
InputMemorySize and OutputMemorySize for the conv module to reduce BRAM count.
hPC_custom = dlhdl.ProcessorConfig;
hPC_custom.setModuleProperty('conv','KernelDataType','int8');
10-122
Customize Bitstream Configuration to Meet Resource Use Requirements
hPC_custom.setModuleProperty('fc','KernelDataType','int8');
hPC_custom.setModuleProperty('adder','KernelDataType','int8');
hPC_custom.setModuleProperty('conv','ConvThreadNumber',4);
hPC_custom.setModuleProperty('conv','InputMemorySize',[30 30 1]);
hPC_custom.setModuleProperty('conv','OutputMemorySize',[30 30 1]);
hPC_custom
hPC_custom =
Processing Module "conv"
ConvThreadNumber: 4
InputMemorySize: [30 30 1]
OutputMemorySize: [30 30 1]
FeatureSizeLimit: 2048
KernelDataType: 'int8'
To estimate the performance of the digits series network, use the estimatePerformance function of
the dlhdl.ProcessorConfig object. The function returns the estimated layer latency, network
latency, and network performance in frames per second (Frames/s).
To estimate the resource use of the hPC_custom bitstream, use the estimateResources
function of the dlhdl.ProcessorConfig object. The function returns the estimated DSP slice and
BRAM usage.
hPC_custom.estimatePerformance(dlquantObj)
10-123
10 Featured Examples
hPC_custom.estimateResources
The estimated performance is 574 FPS and the estimated resource use counts are:
The estimated resources of the customized bitstream match the user target device resource budget
and the estimated performance matches the target network performance.
10-124
Vehicle Detection Using DAG Network Based YOLO v2 Deployed to FPGA
Deep learning is a powerful machine learning technique that you can use to train robust object
detectors. Several techniques for object detection exist, including Faster R-CNN and you only look
once (YOLO) v2. This example trains a YOLO v2 vehicle detector using the
trainYOLOv2ObjectDetector function.
Load Dataset
This example uses a small vehicle dataset that contains 295 images. Many of these images come from
the Caltech Cars 1999 and 2001 data sets, available at the Caltech Computational Vision website,
created by Pietro Perona and used with permission. Each image contains one or two labeled instances
of a vehicle. A small dataset is useful for exploring the YOLO v2 training procedure, but in practice,
more labeled images are needed to train a robust detector. Unzip the vehicle images and load the
vehicle ground truth data.
unzip vehicleDatasetImages.zip
data = load('vehicleDatasetGroundTruth.mat');
vehicleDataset = data.vehicleDataset;
The vehicle data is stored in a two-column table, where the first column contains the image file paths
and the second column contains the vehicle bounding boxes.
% Add the fullpath to the local vehicle data folder.
vehicleDataset.imageFilename = fullfile(pwd,vehicleDataset.imageFilename);
Split the dataset into training and test sets. Select 60% of the data for training and the rest for
testing the trained detector.
rng(0);
shuffledIndices = randperm(height(vehicleDataset));
idx = floor(0.6 * length(shuffledIndices) );
trainingDataTbl = vehicleDataset(shuffledIndices(1:idx),:);
testDataTbl = vehicleDataset(shuffledIndices(idx+1:end),:);
Use imageDatastore and boxLabelDataStore to create datastores for loading the image and
label data during training and evaluation.
imdsTrain = imageDatastore(trainingDataTbl{:,'imageFilename'});
bldsTrain = boxLabelDatastore(trainingDataTbl(:,'vehicle'));
imdsTest = imageDatastore(testDataTbl{:,'imageFilename'});
bldsTest = boxLabelDatastore(testDataTbl(:,'vehicle'));
A YOLO v2 object detection network is composed of two subnetworks. A feature extraction network
followed by a detection network. The feature extraction network is typically a pretrained CNN (for
10-125
10 Featured Examples
details, see Pretrained Deep Neural Networks). This example uses ResNet-18 for feature extraction.
You can also use other pretrained networks such as MobileNet v2 or ResNet-50 depending on
application requirements. The detection sub-network is a small CNN compared to the feature
extraction network and is composed of a few convolutional layers and layers specific for YOLO v2.
Use the yolov2Layers function to create a YOLO v2 object detection network automatically given a
pretrained ResNet-18 feature extraction network. yolov2Layers requires you to specify several
inputs that parameterize a YOLO v2 network:
First, specify the network input size and the number of classes. When choosing the network input
size, consider the minimum size required by the network itself, the size of the training images, and
the computational cost incurred by processing data at the selected size. When feasible, choose a
network input size that is close to the size of the training image and larger than the input size
required for the network. To reduce the computational cost of running the example, specify a network
input size of [224 224 3], which is the minimum size required to run the network.
numClasses = width(vehicleDataset)-1;
Note that the training images used in this example are bigger than 224-by-224 and vary in size, so
you must resize the images in a preprocessing step prior to training.
Next, use estimateAnchorBoxes to estimate anchor boxes based on the size of objects in the
training data. To account for the resizing of the images prior to training, resize the training data for
estimating anchor boxes. Use transform to preprocess the training data, then define the number of
anchor boxes and estimate the anchor boxes. Resize the training data to the input image size of the
network using the supporting function yolo_preprocessData.
trainingDataForEstimation = transform(trainingData,@(data)yolo_preprocessData(data,inputSize));
numAnchors = 7;
[anchorBoxes, meanIoU] = estimateAnchorBoxes(trainingDataForEstimation, numAnchors)
anchorBoxes = 7×2
145 126
91 86
161 132
41 34
67 64
136 111
33 23
meanIoU = 0.8651
For more information on choosing anchor boxes, see Estimate Anchor Boxes From Training Data
(Computer Vision Toolbox) (Computer Vision Toolbox™) and Anchor Boxes for Object Detection
(Computer Vision Toolbox).
10-126
Vehicle Detection Using DAG Network Based YOLO v2 Deployed to FPGA
featureExtractionNetwork = resnet18;
Select 'res4b_relu' as the feature extraction layer to replace the layers after 'res4b_relu' with
the detection subnetwork. This feature extraction layer outputs feature maps that are downsampled
by a factor of 16. This amount of downsampling is a good trade-off between spatial resolution and the
strength of the extracted features, as features extracted further down the network encode stronger
image features at the cost of spatial resolution. Choosing the optimal feature extraction layer
requires empirical analysis.
featureLayer = 'res4b_relu';
lgraph = yolov2Layers(inputSize,numClasses,anchorBoxes,featureExtractionNetwork,featureLayer);
You can visualize the network using analyzeNetwork or Deep Network Designer from Deep
Learning Toolbox™.
If more control is required over the YOLO v2 network architecture, use Deep Network Designer to
design the YOLO v2 detection network manually. For more information, see Design a YOLO v2
Detection Network (Computer Vision Toolbox).
Data Augmentation
Data augmentation is used to improve network accuracy by randomly transforming the original data
during training. By using data augmentation you can add more variety to the training data without
actually having to increase the number of labeled training samples.
Use transform to augment the training data by randomly flipping the image and associated box
labels horizontally. Note that data augmentation is not applied to the test and validation data. Ideally,
test and validation data should be representative of the original data and is left unmodified for
unbiased evaluation.
augmentedTrainingData = transform(trainingData,@yolo_augmentData);
Preprocess the augmented training data, and the validation data to prepare for training.
preprocessedTrainingData = transform(augmentedTrainingData,@(data)yolo_preprocessData(data,inputS
10-127
10 Featured Examples
[detector,info] = trainYOLOv2ObjectDetector(preprocessedTrainingData,lgraph,options);
*************************************************************************
Training a YOLO v2 Object Detector for the following object classes:
* vehicle
As a quick test, run the detector on one test image. Make sure you resize the image to the same size
as the training images.
I = imread(testDataTbl.imageFilename{2});
I = imresize(I,inputSize(1:2));
[bboxes,scores] = detect(detector,I);
I_new = insertObjectAnnotation(I,'rectangle',bboxes,scores);
figure
imshow(I_new)
10-128
Vehicle Detection Using DAG Network Based YOLO v2 Deployed to FPGA
snet=detector.Network;
I_pre=yolo_pre_proc(I);
analyzeNetwork(snet)
Create a target object for your target device with a vendor name and an interface to connect your
target device to the host computer. Interface options are JTAG (default) and Ethernet. Vendor options
are Intel or Xilinx. Use the installed Xilinx Vivado Design Suite over an Ethernet connection to
program the device.
Create an object of the dlhdl.Workflow class. When you create the object, specify the network and
the bitstream name. Specify the saved pre-trained series network, trainedNetNoCar, as the
network. Make sure the bitstream name matches the data type and the FPGA board that you are
targeting. In this example, the target FPGA board is the Zynq UltraScale+ MPSoC ZCU102 board.
The bitstream uses a single data type.
hW =
Workflow with properties:
To compile the snet series network, run the compile function of the dlhdl.Workflow object .
dn = hW.compile
10-129
10 Featured Examples
Skipping: data
Compiling leg: conv1>>pool1 ...
Compiling leg: conv1>>pool1 ... complete.
10-130
Vehicle Detection Using DAG Network Based YOLO v2 Deployed to FPGA
10-131
10 Featured Examples
To deploy the network on the Zynq® UltraScale+™ MPSoC ZCU102 hardware, run the deploy
function of the dlhdl.Workflow object . This function uses the output of the compile function to
program the FPGA board by using the programming file.The function also downloads the network
weights and biases. The deploy function checks for the Xilinx Vivado tool and the supported tool
version. It then starts programming the FPGA device by using the bitstream, displays progress
messages and the time it takes to deploy the network.
hW.deploy
Downloading target FPGA device configuration over Ethernet to SD card done. The system will now r
System is rebooting . . . . . .
### Programming the FPGA bitstream has been completed successfully.
### Loading weights to Conv Processor.
### Conv Weights loaded. Current time is 04-Jan-2021 13:59:03
Execute the predict function on the dlhdl.Workflow object and display the result:
10-132
Vehicle Detection Using DAG Network Based YOLO v2 Deployed to FPGA
10-133
10 Featured Examples
The reference (shipping) zcu102_int8 bitstream configuration is for a Xilinx ZCU102 ZU9EG
device. The default board resource counts are:
The default board resource counts exceed the user resource budget and is on the higher end of the
cost spectrum. You can achieve target performance and resource use budget by quantizing the target
deep learning network and customizing the custom default bitstream configuration.
In this example create a custom bitstream configuration to match your resource budget and
performance requirements.
Prerequisites
• Deep Learning HDL Toolbox™ Support Package for Xilinx FPGA and SoC
• Deep Learning Toolbox™
• Deep Learning HDL Toolbox™
• Deep Learning Toolbox Model Quantization Library
To load the pretrained series network, that has been trained on the Modified National Institute
Standards of Technology (MNIST) database, enter:
snet = getDigitsNetwork;
Quantize Network
dlquantObj = dlquantizer(snet,'ExecutionEnvironment','FPGA');
Image = imageDatastore('five_28x28.pgm','Labels','five');
dlquantObj.calibrate(Image)
ans=21×5 table
Optimized Layer Name Network Layer Name Learnables / Activations MinValue
____________________________ __________________ ________________________ _________
10-134
Customize Bitstream Configuration to Meet Resource Use Requirements
hPC_reference = dlhdl.ProcessorConfig('Bitstream','zcu102_int8')
hPC_reference =
Processing Module "conv"
ConvThreadNumber: 64
InputMemorySize: [227 227 3]
OutputMemorySize: [227 227 3]
FeatureSizeLimit: 2048
KernelDataType: 'int8'
To estimate the performance of the digits series network, use the estimatePerformance function of
the dlhdl.ProcessorConfig object. The function returns the estimated layer latency, network
latency, and network performance in frames per second (Frames/s).
10-135
10 Featured Examples
To estimate the resource use of the zcu102_int8 bitstream, use the estimateResources
function of the dlhdl.ProcessorConfig object. The function returns the estimated DSP slice and
BRAM usage.
hPC_reference.estimatePerformance(dlquantObj)
hPC_reference.estimateResources
The estimated performance is 4314 FPS and the estimated resource use counts are:
The estimated DSP slice count and BRAM count use exceeds the target device resource budget.
Customize the bitstream configuration to reduce resource use.
To create a custom processor configuration, use the dlhdl.ProcessorConfig object. For more
information, see dlhdl.ProcessorConfig. To learn about modifiable parameters of the processor
configuration, see getModuleProperty and setModuleProperty.
To reduce the resource use for the custom bitstream, modify the KernelDataType for the conv,
fc, and adder modules. Modify the ConvThreadNumber to reduce DSP slice count. Reduce the
InputMemorySize and OutputMemorySize for the conv module to reduce BRAM count.
hPC_custom = dlhdl.ProcessorConfig;
hPC_custom.setModuleProperty('conv','KernelDataType','int8');
10-136
Customize Bitstream Configuration to Meet Resource Use Requirements
hPC_custom.setModuleProperty('fc','KernelDataType','int8');
hPC_custom.setModuleProperty('adder','KernelDataType','int8');
hPC_custom.setModuleProperty('conv','ConvThreadNumber',4);
hPC_custom.setModuleProperty('conv','InputMemorySize',[30 30 1]);
hPC_custom.setModuleProperty('conv','OutputMemorySize',[30 30 1]);
hPC_custom
hPC_custom =
Processing Module "conv"
ConvThreadNumber: 4
InputMemorySize: [30 30 1]
OutputMemorySize: [30 30 1]
FeatureSizeLimit: 2048
KernelDataType: 'int8'
To estimate the performance of the digits series network, use the estimatePerformance function of
the dlhdl.ProcessorConfig object. The function returns the estimated layer latency, network
latency, and network performance in frames per second (Frames/s).
To estimate the resource use of the hPC_custom bitstream, use the estimateResources
function of the dlhdl.ProcessorConfig object. The function returns the estimated DSP slice and
BRAM usage.
hPC_custom.estimatePerformance(dlquantObj)
10-137
10 Featured Examples
hPC_custom.estimateResources
The estimated performance is 574 FPS and the estimated resource use counts are:
The estimated resources of the customized bitstream match the user target device resource budget
and the estimated performance matches the target network performance.
10-138
Image Classification Using DAG Network Deployed to FPGA
Required Products
The first layer, the image input layer, requires input images of size 227-by-227-by-3, where 3 is the
number of color channels.
inputSize = snet.Layers(1).InputSize;
This example uses the MathWorks MerchData data set. This is a small data set containing 75 images
of MathWorks merchandise, belonging to five different classes (cap, cube, playing cards, screwdriver,
and torch).
curDir = pwd;
unzip('MerchData.zip');
imds = imageDatastore('MerchData', ...
'IncludeSubfolders',true, ...
'LabelSource','foldernames');
[imdsTrain,imdsValidation] = splitEachLabel(imds,0.7,'randomized');
The fully connected layer and classification layer of the pretrained network net are configured for
1000 classes. These two layers fc1000 and ClassificationLayer_predictions in ResNet-18,
contain information on how to combine the features that the network extracts into class probabilities
and predicted labels. These two layers must be fine-tuned for the new classification problem. Extract
all the layers, except the last two layers, from the pretrained network.
lgraph = layerGraph(snet)
lgraph =
LayerGraph with properties:
10-139
10 Featured Examples
numClasses = numel(categories(imdsTrain.Labels))
numClasses = 5
Train Network
The network requires input images of size 224-by-224-by-3, but the images in the image datastores
have different sizes. Use an augmented image datastore to automatically resize the training images.
Specify additional augmentation operations to perform on the training images, such as randomly
flipping the training images along the vertical axis and randomly translating them up to 30 pixels
horizontally and vertically. Data augmentation helps prevent the network from overfitting and
memorizing the exact details of the training images.
pixelRange = [-30 30];
imageAugmenter = imageDataAugmenter( ...
'RandXReflection',true, ...
'RandXTranslation',pixelRange, ...
'RandYTranslation',pixelRange);
To automatically resize the validation images without performing further data augmentation, use an
augmented image datastore without specifying any additional preprocessing operations.
augimdsTrain = augmentedImageDatastore(inputSize(1:2),imdsTrain, ...
'DataAugmentation',imageAugmenter);
augimdsValidation = augmentedImageDatastore(inputSize(1:2),imdsValidation);
Specify the training options. For transfer learning, keep the features from the early layers of the
pretrained network (the transferred layer weights). To slow down learning in the transferred layers,
set the initial learning rate to a small value. Specify the mini-batch size and validation data. The
software validates the network every ValidationFrequency iterations during training.
options = trainingOptions('sgdm', ...
'MiniBatchSize',10, ...
'MaxEpochs',6, ...
'InitialLearnRate',1e-4, ...
'Shuffle','every-epoch', ...
'ValidationData',augimdsValidation, ...
'ValidationFrequency',3, ...
'Verbose',false, ...
'Plots','training-progress');
Train the network that consists of the transferred and new layers. By default, trainNetwork uses a
GPU if one is available (requires Parallel Computing Toolbox™ and a supported GPU device. For more
10-140
Image Classification Using DAG Network Deployed to FPGA
information, see “GPU Support by Release” (Parallel Computing Toolbox)). Otherwise, the network
uses a CPU (requires MATLAB Coder Interface for Deep learning Libraries™). You can also specify
the execution environment by using the 'ExecutionEnvironment' name-value argument of
trainingOptions.
netTransfer = trainNetwork(augimdsTrain,lgraph,options);
Use the dlhdl.Target class to create a target object with a custom name for your target device and
an interface to connect your target device to the host computer. Interface options are JTAG and
Ethernet. To use JTAG,Install Xilinx™ Vivado™ Design Suite 2019.2. To set the Xilinx Vivado toolpath,
enter:
hTarget = dlhdl.Target('Xilinx','Interface','Ethernet');
Use the dlhdl.Workflow class to create an object. When you create the object, specify the network
and the bitstream name. Specify the saved pretrained alexnet neural network as the network. Make
sure that the bitstream name matches the data type and the FPGA board that you are targeting. In
this example, the target FPGA board is the Xilinx ZCU102 SoC board. The bitstream uses a single
data type.
10-141
10 Featured Examples
To compile the netTransfer DAG network, run the compile method of the dlhdl.Workflow object.
You can optionally specify the maximum number of input frames.
dn = hW.compile('InputFrameNumberLimit',15)
10-142
Image Classification Using DAG Network Deployed to FPGA
Skipping: data
Compiling leg: conv1>>pool1 ...
Compiling leg: conv1>>pool1 ... complete.
Compiling leg: res2a_branch2a>>res2a_branch2b ...
Compiling leg: res2a_branch2a>>res2a_branch2b ... complete.
Compiling leg: res2b_branch2a>>res2b_branch2b ...
Compiling leg: res2b_branch2a>>res2b_branch2b ... complete.
Compiling leg: res3a_branch1 ...
Compiling leg: res3a_branch1 ... complete.
Compiling leg: res3a_branch2a>>res3a_branch2b ...
Compiling leg: res3a_branch2a>>res3a_branch2b ... complete.
Compiling leg: res3b_branch2a>>res3b_branch2b ...
Compiling leg: res3b_branch2a>>res3b_branch2b ... complete.
Compiling leg: res4a_branch1 ...
Compiling leg: res4a_branch1 ... complete.
Compiling leg: res4a_branch2a>>res4a_branch2b ...
Compiling leg: res4a_branch2a>>res4a_branch2b ... complete.
Compiling leg: res4b_branch2a>>res4b_branch2b ...
Compiling leg: res4b_branch2a>>res4b_branch2b ... complete.
Compiling leg: res5a_branch1 ...
Compiling leg: res5a_branch1 ... complete.
Compiling leg: res5a_branch2a>>res5a_branch2b ...
Compiling leg: res5a_branch2a>>res5a_branch2b ... complete.
Compiling leg: res5b_branch2a>>res5b_branch2b ...
Compiling leg: res5b_branch2a>>res5b_branch2b ... complete.
Compiling leg: pool5 ...
Compiling leg: pool5 ... complete.
Compiling leg: new_fc ...
Compiling leg: new_fc ... complete.
Skipping: prob
10-143
10 Featured Examples
Skipping: new_classoutput
Creating Schedule...
...........................
Creating Schedule...complete.
Creating Status Table...
..........................
Creating Status Table...complete.
Emitting Schedule...
..........................
Emitting Schedule...complete.
Emitting Status Table...
............................
Emitting Status Table...complete.
To deploy the network on the Xilinx ZCU102 hardware, run the deploy function of the
dlhdl.Workflow object. This function uses the output of the compile function to program the FPGA
board by using the programming file. It also downloads the network weights and biases. The deploy
function starts programming the FPGA device, displays progress messages, and the time it takes to
deploy the network.
hW.deploy
### FPGA bitstream programming has been skipped as the same bitstream is already loaded on the ta
### Deep learning network programming has been skipped as the same network is already loaded on t
10-144
Image Classification Using DAG Network Deployed to FPGA
Execute the predict method on the dlhdl.Workflow object and then show the label in the MATLAB
command window.
[prediction, speed] = hW.predict(single(inputImg),'Profile','on');
10-145
10 Featured Examples
ans =
'MathWorks Cube'
10-146
Classify Images on an FPGA Using a Quantized DAG Network
ResNet-18 has been trained on over a million images and can classify images into 1000 object
categories (such as keyboard, coffee mug, pencil, and many animals). The network has learned rich
feature representations for a wide range of images. The network takes an image as input and outputs
a label for the object in the image together with the probabilities for each of the object categories.
Required Products
To perform classification on a new set of images, you fine-tune a pretrained ResNet-18 convolutional
neural network by transfer learning. In transfer learning, you can take a pretrained network and use
it as a starting point to learn a new task. Fine-tuning a network with transfer learning is usually much
faster and easier than training a network with randomly initialized weights from scratch. You can
quickly transfer learned features to a new task using a smaller number of training images.
snet = resnet18;
analyzeNetwork(snet);
The first layer, the image input layer, requires input images of size 227-by-227-by-3, where 3 is the
number of color channels.
inputSize = snet.Layers(1).InputSize;
This example uses the MathWorks MerchData data set. This is a small data set containing 75 images
of MathWorks merchandise, belonging to five different classes (cap, cube, playing cards, screwdriver,
and torch).
10-147
10 Featured Examples
curDir = pwd;
unzip('MerchData.zip');
imds = imageDatastore('MerchData', ...
'IncludeSubfolders',true, ...
'LabelSource','foldernames');
[imdsTrain,imdsValidation] = splitEachLabel(imds,0.7,'randomized');
The fully connected layer and classification layer of the pretrained network net are configured for
1000 classes. These two layers fc1000 and ClassificationLayer_predictions in ResNet-18,
contain information on how to combine the features that the network extracts into class probabilities
and predicted labels . These two layers must be fine-tuned for the new classification problem. Extract
all the layers, except the last two layers, from the pretrained network.
lgraph = layerGraph(snet)
lgraph =
LayerGraph with properties:
numClasses = numel(categories(imdsTrain.Labels))
numClasses = 5
Train Network
The network requires input images of size 224-by-224-by-3, but the images in the image datastores
have different sizes. Use an augmented image datastore to automatically resize the training images.
Specify additional augmentation operations to perform on the training images, such as randomly
flipping the training images along the vertical axis and randomly translating them up to 30 pixels
horizontally and vertically. Data augmentation helps prevent the network from overfitting and
memorizing the exact details of the training images.
pixelRange = [-30 30];
imageAugmenter = imageDataAugmenter( ...
'RandXReflection',true, ...
'RandXTranslation',pixelRange, ...
'RandYTranslation',pixelRange);
To automatically resize the validation images without performing further data augmentation, use an
augmented image datastore without specifying any additional preprocessing operations.
augimdsTrain = augmentedImageDatastore(inputSize(1:2),imdsTrain, ...
'DataAugmentation',imageAugmenter);
augimdsValidation = augmentedImageDatastore(inputSize(1:2),imdsValidation);
10-148
Classify Images on an FPGA Using a Quantized DAG Network
Specify the training options. For transfer learning, keep the features from the early layers of the
pretrained network (the transferred layer weights). To slow down learning in the transferred layers,
set the initial learning rate to a small value. Specify the mini-batch size and validation data. The
software validates the network every ValidationFrequency iterations during training.
options = trainingOptions('sgdm', ...
'MiniBatchSize',10, ...
'MaxEpochs',6, ...
'InitialLearnRate',1e-4, ...
'Shuffle','every-epoch', ...
'ValidationData',augimdsValidation, ...
'ValidationFrequency',3, ...
'Verbose',false, ...
'Plots','training-progress');
Train the network that consists of the transferred and new layers. By default, trainNetwork uses a
GPU if one is available (requires Parallel Computing Toolbox™ and a supported GPU device. For more
information, see “GPU Support by Release” (Parallel Computing Toolbox)). Otherwise, the network
uses a CPU (requires MATLAB Coder Interface for Deep learning Libraries™). You can also specify
the execution environment by using the 'ExecutionEnvironment' name-value argument of
trainingOptions.
netTransfer = trainNetwork(augimdsTrain,lgraph,options);
10-149
10 Featured Examples
Use the calibrate function to exercise the network with sample inputs and collect the range
information. The calibrate function exercises the network and collects the dynamic ranges of the
weights and biases in the convolution and fully connected layers of the network and the dynamic
ranges of the activations in all layers of the network. The calibrate function returns a table. Each
row of the table contains range information for a learnable parameter of the quantized network.
dlquantObj.calibrate(augimdsTrain)
ans=95×5 table
Optimized Layer Name Network Layer Name Learnables / Activations MinValue M
__________________________ __________________ ________________________ ________ _
Use the dlhdl.Target class to create a target object with a custom name for your target device and
an interface to connect your target device to the host computer. Interface options are JTAG and
Ethernet. To use JTAG,Install Xilinx™ Vivado™ Design Suite 2019.2. To set the Xilinx Vivado toolpath,
enter:
% hdlsetuptoolpath('ToolName', 'Xilinx Vivado', 'ToolPath', 'C:\Xilinx\Vivado\2019.2\bin\vivado.b
hTarget = dlhdl.Target('Xilinx','Interface','Ethernet');
Use the dlhdl.Workflow class to create an object. When you create the object, specify the network
and the bitstream name. Specify the saved pretrained alexnet neural network as the network. Make
sure that the bitstream name matches the data type and the FPGA board that you are targeting. In
this example, the target FPGA board is the Xilinx ZCU102 SoC board. The bitstream uses a single
data type.
hW = dlhdl.Workflow('Network', dlquantObj, 'Bitstream', 'zcu102_int8','Target',hTarget);
To compile the netTransfer DAG network, run the compile method of the dlhdl.Workflow object.
You can optionally specify the maximum number of input frames.
10-150
Classify Images on an FPGA Using a Quantized DAG Network
dn = hW.compile('InputFrameNumberLimit',15)
10-151
10 Featured Examples
Skipping: data
Compiling leg: conv1>>pool1 ...
Compiling leg: conv1>>pool1 ... complete.
Compiling leg: res2a_branch2a>>res2a_branch2b ...
Compiling leg: res2a_branch2a>>res2a_branch2b ... complete.
Compiling leg: res2b_branch2a>>res2b_branch2b ...
Compiling leg: res2b_branch2a>>res2b_branch2b ... complete.
Compiling leg: res3a_branch1 ...
Compiling leg: res3a_branch1 ... complete.
Compiling leg: res3a_branch2a>>res3a_branch2b ...
Compiling leg: res3a_branch2a>>res3a_branch2b ... complete.
Compiling leg: res3b_branch2a>>res3b_branch2b ...
Compiling leg: res3b_branch2a>>res3b_branch2b ... complete.
Compiling leg: res4a_branch1 ...
Compiling leg: res4a_branch1 ... complete.
Compiling leg: res4a_branch2a>>res4a_branch2b ...
Compiling leg: res4a_branch2a>>res4a_branch2b ... complete.
Compiling leg: res4b_branch2a>>res4b_branch2b ...
Compiling leg: res4b_branch2a>>res4b_branch2b ... complete.
Compiling leg: res5a_branch1 ...
Compiling leg: res5a_branch1 ... complete.
Compiling leg: res5a_branch2a>>res5a_branch2b ...
Compiling leg: res5a_branch2a>>res5a_branch2b ... complete.
Compiling leg: res5b_branch2a>>res5b_branch2b ...
Compiling leg: res5b_branch2a>>res5b_branch2b ... complete.
Compiling leg: pool5 ...
Compiling leg: pool5 ... complete.
Compiling leg: new_fc ...
Compiling leg: new_fc ... complete.
Skipping: prob
Skipping: new_classoutput
Creating Schedule...
.............................
Creating Schedule...complete.
Creating Status Table...
10-152
Classify Images on an FPGA Using a Quantized DAG Network
............................
Creating Status Table...complete.
Emitting Schedule...
..........................
Emitting Schedule...complete.
Emitting Status Table...
..............................
Emitting Status Table...complete.
To deploy the network on the Xilinx ZCU102 hardware, run the deploy function of the
dlhdl.Workflow object. This function uses the output of the compile function to program the FPGA
board by using the programming file. It also downloads the network weights and biases. The deploy
function starts programming the FPGA device, displays progress messages, and the time it takes to
deploy the network.
hW.deploy
Downloading target FPGA device configuration over Ethernet to SD card done. The system will now r
System is rebooting . . . . . .
### Programming the FPGA bitstream has been completed successfully.
### Loading weights to Conv Processor.
### Conv Weights loaded. Current time is 11-Jan-2021 11:26:16
10-153
10 Featured Examples
Execute the predict method on the dlhdl.Workflow object and then show the label in the MATLAB
command window.
10-154
Classify Images on an FPGA Using a Quantized DAG Network
ans =
'MathWorks Cube'
10-155
10 Featured Examples
Training a deep CNN from scratch is computationally expensive and requires a large amount of
training data. In various applications, a sufficient amount of training data is not available, and
synthesizing new realistic training examples is not feasible. In these cases, leveraging existing neural
networks that have been trained on large data sets for conceptually similar tasks is desirable. This
leveraging of existing neural networks is called transfer learning. In this example we adapt two deep
CNNs, GoogLeNet and SqueezeNet, pretrained for image recognition to classify ECG waveforms
based on a time-frequency representation.
GoogLeNet and SqueezeNet are deep CNNs originally designed to classify images in 1000 categories.
We reuse the network architecture of the CNN to classify ECG signals based on images from the CWT
of the time series data. The data used in this example are publicly available from PhysioNet.
Data Description
In this example, you use ECG data obtained from three groups of people: persons with cardiac
arrhythmia (ARR), persons with congestive heart failure (CHF), and persons with normal sinus
rhythms (NSR). In total you use 162 ECG recordings from three PhysioNet databases: MIT-BIH
Arrhythmia Database [3][7], MIT-BIH Normal Sinus Rhythm Database [3], and The BIDMC Congestive
Heart Failure Database [1][3]. More specifically, 96 recordings from persons with arrhythmia, 30
recordings from persons with congestive heart failure, and 36 recordings from persons with normal
sinus rhythms. The goal is to train a classifier to distinguish between ARR, CHF, and NSR.
Download Data
The first step is to download the data from the GitHub repository. To download the data from the
website, click Clone or download and select Download ZIP. Save the file
physionet_ECG_data-master.zip in a folder where you have write permission. The instructions
for this example assume you have downloaded the file to your temporary directory, tempdir, in
MATLAB. Modify the subsequent instructions for unzipping and loading the data if you choose to
download the data in folder different from tempdir. If you are familiar with Git, you can download
the latest version of the tools (git) and obtain the data from a system command prompt using git
clone https://github.com/mathworks/physionet_ECG_data/.
After downloading the data from GitHub, unzip the file in your temporary directory.
unzip(fullfile(tempdir,'physionet_ECG_data-master.zip'),tempdir)
Unzipping creates the folder physionet-ECG_data-master in your temporary directory. This folder
contains the text file README.md and ECGData.zip. The ECGData.zip file contains
• ECGData.mat
• Modified_physionet_data.txt
• License.txt
ECGData.mat holds the data used in this example. The text file, Modified_physionet_data.txt,
is required by PhysioNet's copying policy and provides the source attributions for the data as well as
a description of the preprocessing steps applied to each ECG recording.
10-156
Classify ECG Signals Using DAG Network Deployed To FPGA
Unzip ECGData.zip in physionet-ECG_data-master. Load the data file into your MATLAB
workspace.
unzip(fullfile(tempdir,'physionet_ECG_data-master','ECGData.zip'),...
fullfile(tempdir,'physionet_ECG_data-master'))
load(fullfile(tempdir,'physionet_ECG_data-master','ECGData.mat'))
ECGData is a structure array with two fields: Data and Labels. The Data field is a 162-by-65536
matrix where each row is an ECG recording sampled at 128 hertz. Labels is a 162-by-1 cell array of
diagnostic labels, one for each row of Data. The three diagnostic categories are: 'ARR', 'CHF', and
'NSR'.
To store the preprocessed data of each category, first create an ECG data directory dataDir inside
tempdir. Then create three subdirectories in 'data' named after each ECG category. The helper
function helperCreateECGDirectories does this. helperCreateECGDirectories accepts
ECGData, the name of an ECG data directory, and the name of a parent directory as input arguments.
You can replace tempdir with another directory where you have write permission. You can find the
source code for this helper function in the Supporting Functions section at the end of this example.
%parentDir = tempdir;
parentDir = pwd;
dataDir = 'data';
helperCreateECGDirectories(ECGData,parentDir,dataDir)
Plot a representative of each ECG category. The helper function helperPlotReps does this.
helperPlotReps accepts ECGData as input. You can find the source code for this helper function in
the Supporting Functions section at the end of this example.
helperPlotReps(ECGData)
10-157
10 Featured Examples
After making the folders, create time-frequency representations of the ECG signals. These
representations are called scalograms. A scalogram is the absolute value of the CWT coefficients of a
signal.
To create the scalograms, precompute a CWT filter bank. Precomputing the CWT filter bank is the
preferred method when obtaining the CWT of many signals using the same parameters.
Before generating the scalograms, examine one of them. Create a CWT filter bank using
cwtfilterbank (Wavelet Toolbox) for a signal with 1000 samples. Use the filter bank to take the
CWT of the first 1000 samples of the signal and obtain the scalogram from the coefficients.
Fs = 128;
fb = cwtfilterbank('SignalLength',1000,...
'SamplingFrequency',Fs,...
'VoicesPerOctave',12);
sig = ECGData.Data(1,1:1000);
[cfs,frq] = wt(fb,sig);
t = (0:999)/Fs;figure;pcolor(t,frq,abs(cfs))
set(gca,'yscale','log');shading interp;axis tight;
title('Scalogram');xlabel('Time (s)');ylabel('Frequency (Hz)')
10-158
Classify ECG Signals Using DAG Network Deployed To FPGA
Use the helper function helperCreateRGBfromTF to create the scalograms as RGB images and
write them to the appropriate subdirectory in dataDir. The source code for this helper function is in
the Supporting Functions section at the end of this example. To be compatible with the GoogLeNet
architecture, each RGB image is an array of size 224-by-224-by-3.
helperCreateRGBfromTF(ECGData,parentDir,dataDir)
Load the scalogram images as an image datastore. The imageDatastore function automatically
labels the images based on folder names and stores the data as an ImageDatastore object. An image
datastore enables you to store large image data, including data that does not fit in memory, and
efficiently read batches of images during training of a CNN.
allImages = imageDatastore(fullfile(parentDir,dataDir),...
'IncludeSubfolders',true,...
'LabelSource','foldernames');
Randomly divide the images into two groups, one for training and the other for validation. Use 80% of
the images for training, and the remainder for validation. For purposes of reproducibility, we set the
random seed to the default value.
rng default
[imgsTrain,imgsValidation] = splitEachLabel(allImages,0.8,'randomized');
disp(['Number of training images: ',num2str(numel(imgsTrain.Files))]);
10-159
10 Featured Examples
SqueezeNet
SqueezeNet is a deep CNN whose architecture supports images of size 227-by-227-by-3. Even though
the image dimensions are different for GoogLeNet, you do not have to generate new RGB images at
the SqueezeNet dimensions. You can use the original RGB images.
Load
Extract the layer graph from the network. Confirm SqueezeNet has fewer layers than GoogLeNet.
Also confirm that SqueezeNet is configured for images of size 227-by-227-by-3
lgraphSqz = layerGraph(sqz);
disp(['Number of Layers: ',num2str(numel(lgraphSqz.Layers))])
Number of Layers: 68
disp(lgraphSqz.Layers(1).InputSize)
227 227 3
To retrain SqueezeNet to classify new images, make changes similar to those made for GoogLeNet.
ans =
6×1 Layer array with layers:
Replace the 'drop9' layer, the last dropout layer in the network, with a dropout layer of probability
0.6.
tmpLayer = lgraphSqz.Layers(end-5);
newDropoutLayer = dropoutLayer(0.6,'Name','new_dropout');
lgraphSqz = replaceLayer(lgraphSqz,tmpLayer.Name,newDropoutLayer);
Unlike GoogLeNet, the last learnable layer in SqueezeNet is a 1-by-1 convolutional layer, 'conv10',
and not a fully connected layer. Replace the 'conv10' layer with a new convolutional layer with the
number of filters equal to the number of classes. As was done with GoogLeNet, increase the learning
rate factors of the new layer.
numClasses = numel(categories(imgsTrain.Labels));
tmpLayer = lgraphSqz.Layers(end-4);
10-160
Classify ECG Signals Using DAG Network Deployed To FPGA
Replace the classification layer with a new one without class labels.
tmpLayer = lgraphSqz.Layers(end);
newClassLayer = classificationLayer('Name','new_classoutput');
lgraphSqz = replaceLayer(lgraphSqz,tmpLayer.Name,newClassLayer);
Inspect the last six layers of the network. Confirm the dropout, convolutional, and output layers have
been changed.
lgraphSqz.Layers(63:68)
ans =
6×1 Layer array with layers:
The RGB images have dimensions appropriate for the GoogLeNet architecture. Create augmented
image datastores that automatically resize the existing RGB images for the SqueezeNet architecture.
For more information, see augmentedImageDatastore.
Create a new set of training options to use with SqueezeNet. Set the random seed to the default value
and train the network. The training process usually takes 1-5 minutes on a desktop CPU.
ilr = 3e-4;
miniBatchSize = 10;
maxEpochs = 15;
valFreq = floor(numel(augimgsTrain.Files)/miniBatchSize);
opts = trainingOptions('sgdm',...
'MiniBatchSize',miniBatchSize,...
'MaxEpochs',maxEpochs,...
'InitialLearnRate',ilr,...
'ValidationData',augimgsValidation,...
'ValidationFrequency',valFreq,...
'Verbose',1,...
'ExecutionEnvironment','cpu',...
'Plots','training-progress');
rng default
trainedSN = trainNetwork(augimgsTrain,lgraphSqz,opts);
10-161
10 Featured Examples
Inspect the last layer of the network. Confirm the Classification Output layer includes the three
classes.
10-162
Classify ECG Signals Using DAG Network Deployed To FPGA
trainedSN.Layers(end)
ans =
ClassificationOutputLayer with properties:
Name: 'new_classoutput'
Classes: [ARR CHF NSR]
ClassWeights: 'none'
OutputSize: 3
Hyperparameters
LossFunction: 'crossentropyex'
Use the dlhdl.Target class to create a target object with a custom name for your target device and
an interface to connect your target device to the host computer. Interface options are JTAG and
Ethernet. To use JTAG,Install Xilinx™ Vivado™ Design Suite 2019.2. To set the Xilinx Vivado toolpath,
enter:
% hdlsetuptoolpath('ToolName', 'Xilinx Vivado', 'ToolPath', 'C:\Xilinx\Vivado\2019.2\bin\vivado.b
hTarget = dlhdl.Target('Xilinx','Interface','Ethernet');
Use the dlhdl.Workflow class to create an object. When you create the object, specify the network
and the bitstream name. Specify the saved pretrained alexnet neural network as the network. Make
sure that the bitstream name matches the data type and the FPGA board that you are targeting. In
this example, the target FPGA board is the Xilinx ZCU102 SoC board. The bitstream uses a single
data type.
hW=dlhdl.Workflow('Network', trainedSN, 'Bitstream', 'zcu102_single','Target',hTarget)
hW =
Workflow with properties:
To compile the trainedSN DAG network, run the compile method of the dlhdl.Workflow object.
dn = hW.compile
10-163
10 Featured Examples
10-164
Classify ECG Signals Using DAG Network Deployed To FPGA
Skipping: data
Compiling leg: conv1>>fire2-relu_squeeze1x1 ...
Compiling leg: conv1>>fire2-relu_squeeze1x1 ... complete.
Compiling leg: fire2-expand1x1>>fire2-relu_expand1x1 ...
Compiling leg: fire2-expand1x1>>fire2-relu_expand1x1 ... complete.
Compiling leg: fire2-expand3x3>>fire2-relu_expand3x3 ...
Compiling leg: fire2-expand3x3>>fire2-relu_expand3x3 ... complete.
Do nothing: fire2-concat
Compiling leg: fire3-squeeze1x1>>fire3-relu_squeeze1x1 ...
Compiling leg: fire3-squeeze1x1>>fire3-relu_squeeze1x1 ... complete.
Compiling leg: fire3-expand1x1>>fire3-relu_expand1x1 ...
Compiling leg: fire3-expand1x1>>fire3-relu_expand1x1 ... complete.
Compiling leg: fire3-expand3x3>>fire3-relu_expand3x3 ...
Compiling leg: fire3-expand3x3>>fire3-relu_expand3x3 ... complete.
Do nothing: fire3-concat
Compiling leg: pool3>>fire4-relu_squeeze1x1 ...
Compiling leg: pool3>>fire4-relu_squeeze1x1 ... complete.
Compiling leg: fire4-expand1x1>>fire4-relu_expand1x1 ...
Compiling leg: fire4-expand1x1>>fire4-relu_expand1x1 ... complete.
Compiling leg: fire4-expand3x3>>fire4-relu_expand3x3 ...
Compiling leg: fire4-expand3x3>>fire4-relu_expand3x3 ... complete.
Do nothing: fire4-concat
Compiling leg: fire5-squeeze1x1>>fire5-relu_squeeze1x1 ...
Compiling leg: fire5-squeeze1x1>>fire5-relu_squeeze1x1 ... complete.
Compiling leg: fire5-expand1x1>>fire5-relu_expand1x1 ...
Compiling leg: fire5-expand1x1>>fire5-relu_expand1x1 ... complete.
Compiling leg: fire5-expand3x3>>fire5-relu_expand3x3 ...
Compiling leg: fire5-expand3x3>>fire5-relu_expand3x3 ... complete.
Do nothing: fire5-concat
Compiling leg: pool5>>fire6-relu_squeeze1x1 ...
Compiling leg: pool5>>fire6-relu_squeeze1x1 ... complete.
Compiling leg: fire6-expand1x1>>fire6-relu_expand1x1 ...
Compiling leg: fire6-expand1x1>>fire6-relu_expand1x1 ... complete.
Compiling leg: fire6-expand3x3>>fire6-relu_expand3x3 ...
Compiling leg: fire6-expand3x3>>fire6-relu_expand3x3 ... complete.
Do nothing: fire6-concat
Compiling leg: fire7-squeeze1x1>>fire7-relu_squeeze1x1 ...
Compiling leg: fire7-squeeze1x1>>fire7-relu_squeeze1x1 ... complete.
Compiling leg: fire7-expand1x1>>fire7-relu_expand1x1 ...
Compiling leg: fire7-expand1x1>>fire7-relu_expand1x1 ... complete.
Compiling leg: fire7-expand3x3>>fire7-relu_expand3x3 ...
10-165
10 Featured Examples
10-166
Classify ECG Signals Using DAG Network Deployed To FPGA
To deploy the network on the Xilinx ZCU102 hardware, run the deploy function of the
dlhdl.Workflow object. This function uses the output of the compile function to program the FPGA
board by using the programming file. It also downloads the network weights and biases. The deploy
function starts programming the FPGA device, displays progress messages, and the time it takes to
deploy the network.
hW.deploy
### FPGA bitstream programming has been skipped as the same bitstream is already loaded on the ta
### Loading weights to Conv Processor.
### Conv Weights loaded. Current time is 12-Jan-2021 16:28:17
idx=randi(32);
testim=readimage(imgsValidation,idx);
im=imresize(testim,[227 227]);
imshow(testim)
[YPred1,probs1] = classify(trainedSN,im);
accuracy1 = (YPred1==imgsValidation.Labels);
[YPred2,probs2] = hW.predict(single(im),'profile','on');
10-167
10 Featured Examples
accuracy2 = (YPred==imgsValidation.Labels);
[val,idx]= max(YPred2);
trainedSN.Layers(end).ClassNames{idx}
ans =
'CHF'
Supporting Functions
rootFolder = parentFolder;
localFolder = dataFolder;
mkdir(fullfile(rootFolder,localFolder))
folderLabels = unique(ECGData.Labels);
for i = 1:numel(folderLabels)
mkdir(fullfile(rootFolder,localFolder,char(folderLabels(i))));
end
end
10-168
Classify ECG Signals Using DAG Network Deployed To FPGA
helperPlotReps plots the first thousand samples of a representative of each class of ECG signal
found in ECGData.
function helperPlotReps(ECGData)
% This function is only intended to support the ECGAndDeepLearningExample.
% It may change or be removed in a future release.
folderLabels = unique(ECGData.Labels);
for k=1:3
ecgType = folderLabels{k};
ind = find(ismember(ECGData.Labels,ecgType));
subplot(3,1,k)
plot(ECGData.Data(ind(1),1:1000));
grid on
title(ecgType)
end
end
function helperCreateRGBfromTF(ECGData,parentFolder,childFolder)
% This function is only intended to support the ECGAndDeepLearningExample.
% It may change or be removed in a future release.
imageRoot = fullfile(parentFolder,childFolder);
data = ECGData.Data;
labels = ECGData.Labels;
[~,signalLength] = size(data);
fb = cwtfilterbank('SignalLength',signalLength,'VoicesPerOctave',12);
r = size(data,1);
for ii = 1:r
cfs = abs(fb.wt(data(ii,:)));
im = ind2rgb(im2uint8(rescale(cfs)),jet(128));
imgLoc = fullfile(imageRoot,char(labels(ii)));
imFileName = strcat(char(labels(ii)),'_',num2str(ii),'.jpg');
imwrite(imresize(im,[224 224]),fullfile(imgLoc,imFileName));
end
end
10-169
11
Most pretrained neural networks and neural networks trained using Deep Learning Toolbox™ use
single-precision floating point data types. Even small trained neural networks require a considerable
amount of memory, and require hardware that can perform floating-point arithmetic. These
restrictions can inhibit deployment of deep learning capabilities to low-power microcontrollers and
FPGAs.
Using the Deep Learning Toolbox Model Quantization Library support package, you can quantize a
network to use 8-bit scaled integer data types.
Quantization of a neural network requires a GPU, the GPU Coder™ Interface for Deep Learning
Libraries support package, and the Deep Learning Toolbox Model Quantization Library support
package. Using a GPU requires a CUDA® enabled NVIDIA® GPU with compute capability 6.1, 6.3 or
higher.
• Precision loss: Precision loss is a rounding error. When precision loss occurs, the value is rounded
to the nearest number that is representable by the data type. In the case of a tie it rounds:
• Positive numbers to the closest representable value in the direction of positive infinity.
• Negative numbers to the closest representable value in the direction of negative infinity.
In MATLAB you can perform this type of rounding using the round function.
• Underflow: Underflow is a type of precision loss. Underflows occur when the value is smaller than
the smallest value representable by the data type. When this occurs, the value saturates to zero.
• Overflow: When a value is larger than the largest value that a data type can represent, an
overflow occurs. When an overflow occurs, the value saturates to the largest value representable
by the data type.
1 Consider the following values logged for a parameter while exercising a network.
11-2
Quantization of Deep Neural Networks
2 Find the ideal binary representation of each logged value of the parameter.
The most significant bit (MSB) is the left-most bit of the binary word. This bit contributes most to
the value of the number. The MSB for each value is highlighted in yellow.
11-3
11 Deep Learning Quantization
3 By aligning the binary words, you can see the distribution of bits used by the logged values of a
parameter. The sum of MSB's in each column, highlighted in green, give an aggregate view of the
logged values.
11-4
Quantization of Deep Neural Networks
4 Display the MSB counts of each bit location as a heat map. In this heat map, darker blue regions
correspond to a larger number of MSB's in the bit location.
11-5
11 Deep Learning Quantization
5 The software assigns a data type that can represent the bit locations that capture the most
information. In this example, the software selects a data type that represents bits from 23 to 2-3.
An additional sign bit is required to represent the signedness of the value.
6 After assigning the data type, any bits outside of that data type are removed. Due to the
assignment of a smaller data type of fixed length, precision loss, overflow, and underflow can
occur for values that are not representable by the data type.
11-6
Quantization of Deep Neural Networks
In this example, the value 0.03125, suffers from an underflow, so the quantized value is 0. The
value 2.1 suffers some precision loss, so the quantized value is 2.125. The value 16.250 is larger
than the largest representable value of the data type, so this value overflows and the quantized
value saturates to 15.874.
11-7
11 Deep Learning Quantization
7 The Deep Network Quantizer app displays this heat map histogram for each learnable parameter
in the convolution layers and fully connected layers of the network. The gray regions of the
histogram show the bits that cannot be represented by the data type.
11-8
Quantization of Deep Neural Networks
See Also
Apps
Deep Network Quantizer
Functions
calibrate | dlquantizationOptions | dlquantizer | validate
11-9
11 Deep Learning Quantization
Execution Environment
Development Host FPGA GPU CPU
Requirements
Setup Toolkit hdlsetuptoolpath “Setting Up the “Prerequisites for Deep
Environment (HDL Coder) Prerequisite Products” Learning with MATLAB
(GPU Coder) Coder” (MATLAB
The Calibrate workflow Coder)
requires the MinGW C+
+ compiler or other
compilers. For a list of
supported compilers,
see https://
www.mathworks.com/
support/requirements/
supported-
compilers.html
11-10
Quantization Workflow Prerequisites
11-11
11 Deep Learning Quantization
Calibration
Workflow
Collect the dynamic ranges of the weights and biases in the convolution and fully connected layers of
the quantized network and the dynamic ranges of the activations in all layers.
The calibrate method uses the collected dynamic ranges to generate an exponents file. The
dlhdl.Workflow class compile method uses the exponents file to generate a configuration file that
contains the weights and biases of the quantized network.
This workflow is the workflow to calibrate your quantized series deep learning network.
11-12
Calibration
See Also
calibrate | dlquantizationOptions | dlquantizer | validate
More About
• “Quantization of Deep Neural Networks” on page 11-2
• “Validation” on page 11-14
• “Code Generation and Deployment” on page 11-17
11-13
11 Deep Learning Quantization
Validation
Workflow
Before deploying the quantized network to your target FPGA or SoC board, to verify the accuracy of
your quantized network, use the validation workflow.
This workflow is the workflow to validate your quantized series deep learning network.
11-14
Validation
See Also
dlquantizationOptions | dlquantizer | validate
11-15
11 Deep Learning Quantization
More About
• “Quantization of Deep Neural Networks” on page 11-2
• “Calibration” on page 11-12
• “Code Generation and Deployment” on page 11-17
11-16
Code Generation and Deployment
• Compile and deploy the quantized deep learning network on a target FPGA or SoC board by using
the deploy function.
• Estimate the speed of the quantized deep learning network in terms of number of frames per
second by using the estimate function.
• Execute the deployed quantized deep learning network and predict the classification of input
images by using the predict function.
• Calculate the speed and profile of the deployed quantized deep learning network by using the
predict function. Set the Profile parameter to on.
This figure illustrates the workflow to deploy your quantized deep learning network to the FPGA
boards.
11-17
11 Deep Learning Quantization
See Also
dlhdl.Workflow | dlhdl.Target | dlquantizer
More About
• “Quantization of Deep Neural Networks” on page 11-2
• “Calibration” on page 11-12
• “Validation” on page 11-14
11-18
12
To generate the DL processor IP core, use the HDL Coder™ IP core generation workflow. The
generated IP core contains a standard set of registers and the generated IP core report. For more
information, see “Deep Learning Processor Register Map” on page 12-9
The DL processor IP core reads inputs from the external memory and sends outputs to the external
memory. The external memory buffer allocation is calculated by the compiler based on the network
size and your hardware design. For more information, see “Use Compiler Output for System
Integration” on page 12-3.
The input and output data stored in the external memory in a predefined format. For more
information, see “External Memory Data Format” on page 12-6.
See Also
More About
• “Custom IP Core Generation” (HDL Coder)
• “Use Compiler Output for System Integration” on page 12-3
• “External Memory Data Format” on page 12-6
• “Deep Learning Processor Register Map” on page 12-9
12-2
Use Compiler Output for System Integration
To integrate the generated deep learning processor IP core into your system reference design, use
the compile method outputs.
The example displays the external memory map generated for the ResNet-18 recognition network
that uses the zcu102_single bitstream. See, “Compile dagnet network object”.
Compiler Optimizations
The compile function optimizes networks for deployment by identifying network layers that you can
execute in a single operation on hardware and then fuse them together. The compile function
performs these layer fusions and optimizations:
12-3
12 Deep Learning Processor IP Core User Guide
This image shows the legs of the ResNet-18 network created by the compile function and those legs
highlighted on the ResNet-18 layer architecture.
12-4
Use Compiler Output for System Integration
See Also
More About
• “Deep Learning Processor IP Core” on page 12-2
• “External Memory Data Format” on page 12-6
• “Deep Learning Processor Register Map” on page 12-9
12-5
12 Deep Learning Processor IP Core User Guide
Key Terminology
• Parallel Data Transfer Number refers to the number of pixels that are transferred every
clock cycle through the AXI master interface. Use the letter N in place of the Parallel Data
Transfer Number. Mathematically N is the square root of the ConvThreadNumber. See
“ConvThreadNumber”.
• Feature Number refers to the value of the z dimension of an x-by-y-by-z matrix. For example,
most input images are of dimension x-by-y-by-three, with three referring to the red, green, and
blue channels of an image. Use the letter Z in place of the Feature Number.
The image demonstrates how the data stored in a 3-by-3-by-4 matrix is translated into a 1-by-36
matrix that is then stored in the external memory.
When the image Feature Number (Z) is not a multiple of the Parallel Data Transfer Number
(N), then we must pad a zeroes matrix of size x-by-y along the z dimension of the matrix to make the
image Z value a multiple of N.
For example, if your input image is an x-by-y matrix with a Z value of three and the value of N is four,
pad the image with a zeros matrix of size x-by-y to make the input to the external memory an x-by-y-
by-4 matrix.
12-6
External Memory Data Format
The image shows the example output external memory data format for the input matrix after the zero
padding. In the image, A, B, and C are the three features of the input image and G is the zero- padded
data to make the input image Z value four, which is a multiple of N.
If your deep learning processor consists of only a convolution (conv) processing module, the output
external data is using the conv module external data format, which means it possibly contains padded
data if your output Z value is not a multiple of the N value. The padded data is removed when you use
the dlhdl.Workflow workflow. If you do not use the dlhdl.Workflow workflow and directly read
the output from the external memory, remove the padded data.
The image shows the example external memory output data format for a fully connected output
feature size of six. In the image, A, B, C, D, E, and F are the output features of the image.
See Also
More About
• “Deep Learning Processor IP Core” on page 12-2
• “Use Compiler Output for System Integration” on page 12-3
• “Deep Learning Processor Register Map” on page 12-9
12-7
12 Deep Learning Processor IP Core User Guide
See Also
12-8
Deep Learning Processor Register Map
The DL processor IP core is generated by using the HDL Coder IP core generation workflow. The
generated IP core contains a standard set of registers. For more information, see “Custom IP Core
Generation” (HDL Coder).
For the full list of register offsets, see the Register Address Mapping table in the generated deep
learning (DL) processor IP core report.
The image contains all the AXI4 registers created during IP core generation.
12-9
12 Deep Learning Processor IP Core User Guide
See Also
More About
• “Deep Learning Processor IP Core” on page 12-2
• “Use Compiler Output for System Integration” on page 12-3
• “External Memory Data Format” on page 12-6
12-10