0% found this document useful (0 votes)
23 views4 pages

An End-to-End Workflow To Efficiently Compress and Deploy DNN Classifiers On SoC FPGA

An_End-to-End_Workflow_to_Efficiently_Compress_and_Deploy_DNN_Classifiers_on_SoC_FPGA

Uploaded by

howard777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views4 pages

An End-to-End Workflow To Efficiently Compress and Deploy DNN Classifiers On SoC FPGA

An_End-to-End_Workflow_to_Efficiently_Compress_and_Deploy_DNN_Classifiers_on_SoC_FPGA

Uploaded by

howard777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

IEEE EMBEDDED SYSTEMS LETTERS, VOL. 16, NO.

3, SEPTEMBER 2024 255

An End-to-End Workflow to Efficiently Compress


and Deploy DNN Classifiers on SoC/FPGA
Romina Soledad Molina , Iván René Morales , Maria Liz Crespo , Veronica Gil Costa ,
Sergio Carrato , and Giovanni Ramponi , Life Senior Member, IEEE

Abstract—Machine learning (ML) models have demonstrated and considering the interaction between the FPGA and the
discriminative and representative learning capabilities over a wide processor that make up to the SoC.
range of applications, even at the cost of high-computational Recent studies in this field usually focus on the inte-
complexity. Due to their parallel processing capabilities, reconfig-
urability, and low-power consumption, systems on chip based on gration of pruning and/or quantization methods for model
a field programmable gate array (SoC/FPGA) have been used to compression, and only a few contributions address the entire
face this challenge. Nevertheless, SoC/FPGA devices are resource- development cycle [3], [4], [5]. Specifically, Fahim et al. [3]
constrained, which implies the need for optimal use of technology proposed a co-design workflow, including quantization-aware
for the computation and storage operations involved in ML-based training and pruning, to optimize the architectures for low-
inference. Consequently, mapping a deep neural network (DNN)
architecture to a SoC/FPGA requires compression strategies to power devices based on FPGA, without including knowledge
obtain a hardware design with a good compromise between distillation (KD) neither a strategy for FPGA verification.
effectiveness, memory footprint, and inference time. This letter Furthermore, the co-design workflow was validated with
presents an efficient end-to-end workflow for deploying DNNs MNIST dataset and not real-world applications. This letter
on an SoC/FPGA by integrating hyperparameter tuning through in [5] presents a co-design methodology for FPGA-based DNN
Bayesian optimization (BO) with an ensemble of compression
techniques. accelerators developed for inference at the edge, but requires
a high degree of data transfer between on-chip and off-chip
Index Terms—Compression, deep neural networks, memory. The proposal in [4] exposes an FPGA/DNN co-
FPGA/SoC, machine learning (ML), workflow.
design for low-end devices, but is limited to convolutional
neural networks (CNN) and assumes the weights are stored in
I. I NTRODUCTION external memory.
ANY new machine learning (ML) application fields In this letter, we propose a workflow1 that integrates
M require mapping deep neural network (DNN) infer-
ence processes into systems on chip based on a field
KD, pruning, and quantization to efficiently compress and
deploy DNN-based classifiers on SoC/FPGA, addressing the
programmable gate array (SoC/FPGA), to exploit some entire development cycle: from the ML-based architecture
inherent favorable features of these technologies, such as training to the hardware deployment, tackling the drawbacks
low latency, low-power consumption, and high parallelism. presented in [3], [4], and [5]. The inclusion of the KD
Compression techniques become essential to deploy ML mod- technique, combined with hyperparameter optimization (HPO)
els on resource-constrained devices, preserving effectiveness through Bayesian optimization (BO), allows the generation
in smaller and faster models [1], [2]. of a reduced and more efficient model (student model) that
Long design cycles are needed to implement suitable mimics the overall accuracy of a teacher architecture.
FPGA firmware optimizing latency and resource utilization. HPO is carried out using BO because it is supported
Developers would benefit from a methodology guiding DNN by the literature as one of the most effective techniques
implementation on SoC/FPGA, unifying existing techniques for performing this task [6]. In contrast, grid search suffers
from the curse of dimensionality, while random search is
Manuscript received 7 October 2023; revised 27 November 2023; accepted more efficient than grid search but is unreliable for training
11 December 2023. Date of publication 14 December 2023; date of current some complex models [7]. In the proposed methodology, the
version 30 August 2024. This manuscript was recommended for publication developer defines the different topologies (teacher and student)
by M. Zhao. (Corresponding author: Romina Soledad Molina.)
Romina Soledad Molina and Iván René Morales are with the and employs HPO to find the best set of values for a DNN
Multidisciplinary Laboratory (MLab), STI Unit, Abdus Salam International topology previously defined.
Centre for Theoretical Physics, 34151 Trieste, Italy, and also with the The proposed workflow allows full on-chip storage of
Department of Engineering and Architecture (DIA), University of Trieste,
34127 Trieste, Italy (e-mail: [email protected]). ML-based models. This makes it suitable for applications
Maria Liz Crespo is with the Multidisciplinary Laboratory (MLab), STI that require fast inference and energy efficiency, unlike the
Unit, Abdus Salam International Centre for Theoretical Physics, 34151 Trieste, contributions in [4] and [5] that require a high degree of
Italy.
Sergio Carrato and Giovanni Ramponi are with the Department of data transfer between the on-chip and off-chip memory. The
Engineering and Architecture (DIA), University of Trieste, 34127 Trieste, reduced model is obtained at the expense of a longer training
Italy.
Veronica Gil Costa is with CONICET, National University of San Luis,
San Luis 5700, Argentina.
Digital Object Identifier 10.1109/LES.2023.3343030 1 Available on: https://github.com/RomiSolMolina/workflowCompressionML.

1943-0671 
c 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:30 UTC from IEEE Xplore. Restrictions apply.
256 IEEE EMBEDDED SYSTEMS LETTERS, VOL. 16, NO. 3, SEPTEMBER 2024

Fig. 1. Workflow to efficiently compress and deploy DNN-based classifiers


on SoC/FPGA.

time and a more extensive search space for HPO. The main Fig. 2. Training and compression stages.
contributions of this letter can be summarized as follows.
1) A workflow based on KD combined with quantization- search, a new architecture is compiled and trained for each
aware pruning (QAP) and HPO, aiming to guide the iteration, whereas the optimizer looks at a combination of
developer in obtaining and compressing ML-based mod- hyperparameters within the defined search space, considering
els and implementing them in SoC/FPGA, the validation accuracy as the objective function.
2) An SoC/FPGA framework containing the ComBlock IP After optimizing, the network is trained with the optimal
core [8], which facilitates the communication protocols, configuration found, generating a teacher model with 32-bit
and the control and evaluation of ML-based hardware floating-point precision (teacherFP). Alternatively, a pretrained
from the SoC’s soft-core or hard-core processor, model can be used as a teacherFP.
3) The incorporation of KD and BO in the training steps, 2) Stage 2—Student-Aware Training: The corresponding
allowing a full on-chip deployment on FPGA, thus student hypermodel is constructed, and a search space is
avoiding off-chip communication bottlenecks. created within a range of values that defines the number of
The remainder of this letter is organized as follows. Section II neurons and kernels per layer. Furthermore, the hypermodel
introduces the workflow to compress and deploy DNNs on is defined with n-bit fixed-point precision (where n is the
SoC/FPGA. The experiments and results are presented in number of bits) for the quantization (Q) method and with
Section III. Finally, Section IV summarizes the conclusions. target sparsity for pruning (P). Because the student network
has a reduced number of parameters compared to the teacher
II. W ORKFLOW
network, the range of values for the search space (for each
The main steps of the proposed workflow are shown in layer) is set to a smaller number. Using the dataset, the
Fig. 1. The input to the workflow is the dataset used to train student hypermodel, and teacherFP model, the BO process
and compress the target model, as described in Section II-A. is performed according to the number of iterations specified
Hereafter, a data structure is generated with information by the developer. The learning approach chosen during the
regarding the layers, weights, and bias of the model. This optimization is KD, which is combined with QAP to transfer
data structure is used by a hardware synthesis tool for ML the knowledge of the teacherFP to a smaller architecture. QAP
to translate the DNN model, as presented in Section II-B. incorporates quantization and pruning into the student training
Subsequently, inference hardware is created and integrated process.
into an assessment template to verify its functionality on the Once the optimization process is completed, the selected
SoC/FPGA device, as introduced in Section II-C. hyperparameter values are delivered to the developer. With
The implementation details are available in the repository: this, the knowledge of teacherFP is distilled into the student
https://github.com/RomiSolMolina/workflowCompressionML. network, previously defined by the number of bits and target
sparsity used in the optimization process. QAP is employed
A. DNN Training and Compression along with the learning process through KD. Finally, the
The training and compression of the DNN classifier include compressed studentKDQP model is trained.
the following stages (Fig. 2): 3) General Key Points: BO, implemented through the
1) Stage 1—Teacher Training: The teacher hypermodel is kerasTuner [9], performs parameter searches based on
created by the user defining the number and type of layer, developer-configured settings, generating architectures with
and the search space, which is a range of values per layer more or fewer parameters. The heuristic search spaces for
defined with a heuristic approach by the user. Using the dataset teacher optimization range between 32 and 300, whereas for
and the teacher hypermodel, the BO process is performed the student from 1 to 20. Fig. 3 shows an example of the
according to the number of iterations specified by the devel- search space definition considering one convolutional layer.
oper. BO is derived from probabilistic models and aims to To tune the number of kernels in the first layer, the algorithm
maximize or minimize an objective function (e.g., model searches for an optimal value between 32 and 128 in 8-steps
accuracy or loss) by efficiently exploring the hyperparameter incremental. The optimal learning rate for teacher and student
space and identifying the configuration that maximizes or models is selected from 0.01, 0.001, or 0.0001.
minimizes the function. In this process, BO guides the search The dataset was divided into training, validation, and testing
of hyperparameters using past evaluation results. During the (70%, 15%, and 15%, respectively) for both stages of the

Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:30 UTC from IEEE Xplore. Restrictions apply.
MOLINA et al.: END-TO-END WORKFLOW TO EFFICIENTLY COMPRESS 257

Algorithm 1 Pseudocode Running on the Processor


Input: inputSignal[INPUT_SIZE]
1: comBlock.fifo_flush(fifo)
2: comBlock.register_write(reset_DNN, 0)
3: while true do
Fig. 3. Example of the search space definition.
4: for i = 0, i < INPUT_SIZE do
5: comBlock.fifo_write(inputSignal[i])
6: end for
7: outVal ← comBlock.register_read(dnnOutput)
8: end while

III. E XPERIMENTS AND R ESULTS


Fig. 4. Hardware assessment framework.
A. Real Case Study Applications
DNN training and compression process. When using pre- A preliminary version of our work was presented
trained networks from SOTA, it is necessary to fine-tune the in [11], [12], [13], and [14], where we applied Q, P, and KD
architecture with the corresponding dataset to adapt it to a for model compression to two real-world applications, and we
specific problem. provided evidence of performance improvement when target-
ing SoC/FPA devices. The former application (1D-multilayer
B. Integration With Hardware Synthesis Tool for ML perceptron (MLP)) is focused on 1-D signals; it is a pulse
After DNN training and compression, the model is saved in shape discriminator (PSD) based on a MLP, to be used for
a data file suitable for the backend tools to perform hardware event recognition in cosmic rays studies [12], [13]. The latter
synthesis for ML-based designs. We chose hls4ml [10] because application is in the field of object classification in 2D-images;
it is open-source and supports as high-level synthesis (HLS) its aim is moth classification in the context of pest detection.
backends Vivado HLS, Intel HLS, and Vitis HLS in the The solution in [11] (2D-CNN) is based on ad-hoc CNN
experimental mode. However, the proposed workflow can be trained with a dataset obtained from in-field traps through
adapted to other hardware synthesis tools for ML applications. an IoT system. This application is further developed in [14]
The DNN model is saved in a hierarchical data format (.h5 (2D-VGG16) using the larger pretrained teacher network
file) containing the data structure of the model, such as the (VGG16) [14], and a public dataset (Pest24 [15]) that was
layers, weights, and biases. This is the input for the hls4ml employed for fine-tuning the model. With this application, we
package that is linked to an HLS tool that allows the generation aimed to illustrate that the developer may have a previously
of hardware from the C, C++, and System C codes. hls4ml trained model—in this case, extracted from [14]—and com-
creates an HLS project based on the configuration provided press it through the proposed methodology (without the need
by the developer, such as the FPGA part, reuse factor, to generate the teacher model from scratch), and the potential
data precision, and latency/resource strategies. Once the HLS of the compression method when working with larger teacher
project is generated, a hardware block can be synthesized and architectures. Moreover, we extended the 2-D classification
exported to the FPGA register transfer level (RTL). problem using MobileNetV2 (2D-MobileNetV2) as a teacher
model, which was previously fine-tuned with Pest24 and modi-
fied with a classifier suitable for pest classification. This option
C. Hardware Assessment Framework
was selected because MobileNetV2 is a topology engineered
An SoC/FPGA framework was developed to test the deploy- for operation in resource-constrained environments.
ment of DNN models in hardware (Fig. 4). The assessment In this letter, we apply the end-to-end workflow in both real-
framework includes an embedded processor (soft core or hard case studies, complementing the results in [11], [12], [13],
core) that acts as a programmable control device. The DNN and [14].
inference hardware exported from the HLS is placed within the
framework design to enable flawless interchangeability during
the testing iterations. Communication between the processor B. Workflow Assessment
and the inference hardware is carried out using ComBlock [8], Table I summarizes the most relevant results. In 1D-MLP,
an open-source, portable, and customizable IP block designed the teacher is an MLP with four fully connected layers. The
to simplify the interaction between the embedded processor 2D-CNN teacher is composed of four convolutional and three
and FPGA. We selected ComBlock because of its easy fully connected layers. For 2D-VGG16 and 2D-MobileNetV2,
integration with soft-core and hard-core processors. However, the teacher models were obtained through transfer learning
the developer can select a different hardware communications after fine-tuning the network for Pest24 dataset. For the four
block, such as a DMA controller. architectures, the overall accuracy was above 97%.
Through ComBlock, the processor sends the input stream Regarding the compressed models, the 1D-MLP student
to the inference block and retrieves the results of the DNN network was implemented and tested on an Artix-7 FPGA
classifier. Algorithm 1 summarizes how easily the processor board, using an embedded soft-core processor based on
interacts with the FPGA, using ComBlock for initialization Microblaze. The 2D-VGG16 and 2D-CNN student models
and data streaming. were implemented and tested on an UltraScale+ ZCU102 with

Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:30 UTC from IEEE Xplore. Restrictions apply.
258 IEEE EMBEDDED SYSTEMS LETTERS, VOL. 16, NO. 3, SEPTEMBER 2024

TABLE I
E VALUATION . T HE ACRONYMS IN THE TABLE A RE P: PARAMETERS , CR: C OMPRESSION R ATIO , NL: N UMBER OF L AYERS
(O NLY C ONVOLUTIONAL AND F ULLY C ONNECTED L AYERS ), OA: OVERALL ACCURACY, SP: S PARSITY, AND
L: L ATENCY W ITHOUT DATA T RANSFER AT 200 MH Z (5- NS C LOCK C YCLE )

SoC/FPGA. Combining pruning, quantization, and KD leads


to a good tradeoff between resource utilization and quality
metrics. BO automatically selects an optimal combination
of hyperparameters for the teacher and student models. An
improvement in latency and energy efficiency was achieved
owing to the minimal off-chip data transactions and reduced
computational operations required to complete the inference.
Fig. 5. Latency (in clock cycle units) for 1D-MLP. Data from HLS reports
at different clock frequencies (100 and 200 MHz). Future work could include neural architecture search (NAS)
as an extension of the methodology.
an embedded hard-core processor. Table I shows the post-
place and route resource utilization for the compressed models. R EFERENCES
We analyze 1D-MLP student latency with reuse factors of [1] T. Aarrestad et al., “Fast convolutional neural networks on FPGAs
1, 8, 16, 32, and 64. This parameter, introduced by hls4ml, with hls4ml,” Mach. Learn., Sci. Technol., vol. 2, no. 4, Jul. 2021,
Art. no. 045015.
is intrinsically associated with the use of multipliers in the [2] T. Choudhary, V. Mishra, A. Goswami, and J. Sarangapani, “A com-
FPGA to compute a layer of neuron values. A low-reuse prehensive survey on model compression and acceleration,” Artif. Intell.
factor implies a low-latency and high-resource utilization. The Rev., vol. 53, no. 7, pp. 5113–5155, 2020.
[3] F. Fahim et al., “hls4ml: An open-source codesign Workflow to empower
results are presented in Fig. 5. An increased latency (in clock scientific low-power machine learning devices,” in Proc. Res. Symp. Tiny
cycle units) is evident when running at 200 MHz, compared Mach. Learn., 2021, pp. 1–10.
to 100 MHz. Such difference is more noticeable in the Artix-7 [4] C. Hao et al., “FPGA/DNN co-design: An efficient design methodology
for IoT intelligence on the edge,” in Proc. 56th ACM/IEEE Design
FPGA and the SoC-based PYNQ-Z1, using a 28 nm process Autom. Conf. (DAC), Las Vegas, NV, USA, 2019, pp. 1–6.
technology, versus the 16 nm fabric of the UltraScale+ SoC. [5] J. Haris, P. Gibson, J. Cano, N. B. Agostini, and D. Kaeli, “SECDA:
Considering image classification, we addressed the com- Efficient hardware/software co-design of FPGA-based DNN accelerators
for edge inference,” in Proc. IEEE 33rd Int. Symp. Comput. Archit.
pression of the ML-based model in [11]. For 2D-VGG16, a High Perform. Comput. (SBAC-PAD), Belo Horizonte, Brazil, 2021,
suitable network was obtained by employing BO for HPO, pp. 33–43.
considering an implementation on a high-end FPGA but with [6] A. H. Victoria and G. Maragatham, “Automatic tuning of hyperpa-
rameters using Bayesian optimization,” Evolving Syst., vol. 12, no. 1,
twice the number of parameters compared to the [14] model pp. 217–223, 2021.
(which was generated by manually tuning the hyperparame- [7] J. Bergstra and Y. Bengio, “Random search for hyper-parameter
ters). For the cosmic ray application, a reduced architecture optimization,” J. Mach. Learn. Res., vol. 13, no. 2, pp. 1–25, 2012.
[8] K. S. Mannatunga et al., “Design for portability of reconfigurable virtual
was generated compared to the ones obtained in [12] and [13] instrumentation,” in Proc. 10th Southern Conf. Program. Logic (SPL),
(26% and more than 70% FPGA usage, respectively). Buenos Aires, Argentina, 2019, pp. 45–52.
The experiments show that our proposed methodology [9] T. O’Malley et al., “Keras tuner.” 2019. [Online]. Available: https://
github.com/keras-team/keras-tuner
successfully generates compressed models, leading to a [10] J. Duarte et al., “Fast inference of deep neural networks in FPGAs for
fully on-chip memory-mapped implementation on the FPGA. particle physics,” J. Instrum., vol. 13, no. 07, 2018, Art. no. P07027.
Integrating KD into the ensemble of compression techniques [11] A. Suárez, R. S. Molina, G. Ramponi, R. Petrino, L. Bollati, and
D. Sequeiros, “Pest detection and classification to reduce pesticide use
contributes to achieving a balanced student model in terms of in fruit crops based on deep neural networks and image processing,” in
size, computational efficiency, and accuracy. The methodology Proc. 19th Workshop Inf. Process. Control (RPIC), San Juan, Argentina,
addresses the entire development cycle, overcoming the lim- 2021, pp. 1–6.
[12] R. S. Molina et al., “Compression of NN-based pulse-shape discrim-
itations outlined in previous works, such as [3], [4], and [5]. inators in front-end electronics for particle detection,” in Proc. Appl.
Furthermore, the ComBlock in the hardware assessment Electron. Pervading Ind., Environ. Soc., 2021, pp. 93–99.
framework facilitates the model performance evaluation in the [13] L. G. G. Ordóñez et al., “Pulse shape discrimination for online data
acquisition in water cherenkov detectors based on FPGA/SoC,” in Proc.
FPGA by simplifying the communication interfaces. 37th Int. Cosmic Ray Conf., Berlin, Germany, 2021, pp. 1–8.
[14] R. S. Molina et al., “ML-based classifier for precision agriculture on
embedded systems,” in Proc. Appl. Electron. Pervading Ind., Environ.
IV. C ONCLUSION Soc., 2023, pp. 117–124.
[15] Q.-J. Wang et al., “Pest24: A large-scale very small object data set
In this letter, we presented an end-to-end workflow for of agricultural pests for multi-target detection,” Comput. Electron.
efficient DNN-based classifier compression and deployment on Agriculture, vol. 175, Aug. 2020, Art. no. 105585.

Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:30 UTC from IEEE Xplore. Restrictions apply.

You might also like