An End-to-End Workflow To Efficiently Compress and Deploy DNN Classifiers On SoC FPGA
An End-to-End Workflow To Efficiently Compress and Deploy DNN Classifiers On SoC FPGA
Abstract—Machine learning (ML) models have demonstrated and considering the interaction between the FPGA and the
discriminative and representative learning capabilities over a wide processor that make up to the SoC.
range of applications, even at the cost of high-computational Recent studies in this field usually focus on the inte-
complexity. Due to their parallel processing capabilities, reconfig-
urability, and low-power consumption, systems on chip based on gration of pruning and/or quantization methods for model
a field programmable gate array (SoC/FPGA) have been used to compression, and only a few contributions address the entire
face this challenge. Nevertheless, SoC/FPGA devices are resource- development cycle [3], [4], [5]. Specifically, Fahim et al. [3]
constrained, which implies the need for optimal use of technology proposed a co-design workflow, including quantization-aware
for the computation and storage operations involved in ML-based training and pruning, to optimize the architectures for low-
inference. Consequently, mapping a deep neural network (DNN)
architecture to a SoC/FPGA requires compression strategies to power devices based on FPGA, without including knowledge
obtain a hardware design with a good compromise between distillation (KD) neither a strategy for FPGA verification.
effectiveness, memory footprint, and inference time. This letter Furthermore, the co-design workflow was validated with
presents an efficient end-to-end workflow for deploying DNNs MNIST dataset and not real-world applications. This letter
on an SoC/FPGA by integrating hyperparameter tuning through in [5] presents a co-design methodology for FPGA-based DNN
Bayesian optimization (BO) with an ensemble of compression
techniques. accelerators developed for inference at the edge, but requires
a high degree of data transfer between on-chip and off-chip
Index Terms—Compression, deep neural networks, memory. The proposal in [4] exposes an FPGA/DNN co-
FPGA/SoC, machine learning (ML), workflow.
design for low-end devices, but is limited to convolutional
neural networks (CNN) and assumes the weights are stored in
I. I NTRODUCTION external memory.
ANY new machine learning (ML) application fields In this letter, we propose a workflow1 that integrates
M require mapping deep neural network (DNN) infer-
ence processes into systems on chip based on a field
KD, pruning, and quantization to efficiently compress and
deploy DNN-based classifiers on SoC/FPGA, addressing the
programmable gate array (SoC/FPGA), to exploit some entire development cycle: from the ML-based architecture
inherent favorable features of these technologies, such as training to the hardware deployment, tackling the drawbacks
low latency, low-power consumption, and high parallelism. presented in [3], [4], and [5]. The inclusion of the KD
Compression techniques become essential to deploy ML mod- technique, combined with hyperparameter optimization (HPO)
els on resource-constrained devices, preserving effectiveness through Bayesian optimization (BO), allows the generation
in smaller and faster models [1], [2]. of a reduced and more efficient model (student model) that
Long design cycles are needed to implement suitable mimics the overall accuracy of a teacher architecture.
FPGA firmware optimizing latency and resource utilization. HPO is carried out using BO because it is supported
Developers would benefit from a methodology guiding DNN by the literature as one of the most effective techniques
implementation on SoC/FPGA, unifying existing techniques for performing this task [6]. In contrast, grid search suffers
from the curse of dimensionality, while random search is
Manuscript received 7 October 2023; revised 27 November 2023; accepted more efficient than grid search but is unreliable for training
11 December 2023. Date of publication 14 December 2023; date of current some complex models [7]. In the proposed methodology, the
version 30 August 2024. This manuscript was recommended for publication developer defines the different topologies (teacher and student)
by M. Zhao. (Corresponding author: Romina Soledad Molina.)
Romina Soledad Molina and Iván René Morales are with the and employs HPO to find the best set of values for a DNN
Multidisciplinary Laboratory (MLab), STI Unit, Abdus Salam International topology previously defined.
Centre for Theoretical Physics, 34151 Trieste, Italy, and also with the The proposed workflow allows full on-chip storage of
Department of Engineering and Architecture (DIA), University of Trieste,
34127 Trieste, Italy (e-mail: [email protected]). ML-based models. This makes it suitable for applications
Maria Liz Crespo is with the Multidisciplinary Laboratory (MLab), STI that require fast inference and energy efficiency, unlike the
Unit, Abdus Salam International Centre for Theoretical Physics, 34151 Trieste, contributions in [4] and [5] that require a high degree of
Italy.
Sergio Carrato and Giovanni Ramponi are with the Department of data transfer between the on-chip and off-chip memory. The
Engineering and Architecture (DIA), University of Trieste, 34127 Trieste, reduced model is obtained at the expense of a longer training
Italy.
Veronica Gil Costa is with CONICET, National University of San Luis,
San Luis 5700, Argentina.
Digital Object Identifier 10.1109/LES.2023.3343030 1 Available on: https://github.com/RomiSolMolina/workflowCompressionML.
1943-0671
c 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:30 UTC from IEEE Xplore. Restrictions apply.
256 IEEE EMBEDDED SYSTEMS LETTERS, VOL. 16, NO. 3, SEPTEMBER 2024
time and a more extensive search space for HPO. The main Fig. 2. Training and compression stages.
contributions of this letter can be summarized as follows.
1) A workflow based on KD combined with quantization- search, a new architecture is compiled and trained for each
aware pruning (QAP) and HPO, aiming to guide the iteration, whereas the optimizer looks at a combination of
developer in obtaining and compressing ML-based mod- hyperparameters within the defined search space, considering
els and implementing them in SoC/FPGA, the validation accuracy as the objective function.
2) An SoC/FPGA framework containing the ComBlock IP After optimizing, the network is trained with the optimal
core [8], which facilitates the communication protocols, configuration found, generating a teacher model with 32-bit
and the control and evaluation of ML-based hardware floating-point precision (teacherFP). Alternatively, a pretrained
from the SoC’s soft-core or hard-core processor, model can be used as a teacherFP.
3) The incorporation of KD and BO in the training steps, 2) Stage 2—Student-Aware Training: The corresponding
allowing a full on-chip deployment on FPGA, thus student hypermodel is constructed, and a search space is
avoiding off-chip communication bottlenecks. created within a range of values that defines the number of
The remainder of this letter is organized as follows. Section II neurons and kernels per layer. Furthermore, the hypermodel
introduces the workflow to compress and deploy DNNs on is defined with n-bit fixed-point precision (where n is the
SoC/FPGA. The experiments and results are presented in number of bits) for the quantization (Q) method and with
Section III. Finally, Section IV summarizes the conclusions. target sparsity for pruning (P). Because the student network
has a reduced number of parameters compared to the teacher
II. W ORKFLOW
network, the range of values for the search space (for each
The main steps of the proposed workflow are shown in layer) is set to a smaller number. Using the dataset, the
Fig. 1. The input to the workflow is the dataset used to train student hypermodel, and teacherFP model, the BO process
and compress the target model, as described in Section II-A. is performed according to the number of iterations specified
Hereafter, a data structure is generated with information by the developer. The learning approach chosen during the
regarding the layers, weights, and bias of the model. This optimization is KD, which is combined with QAP to transfer
data structure is used by a hardware synthesis tool for ML the knowledge of the teacherFP to a smaller architecture. QAP
to translate the DNN model, as presented in Section II-B. incorporates quantization and pruning into the student training
Subsequently, inference hardware is created and integrated process.
into an assessment template to verify its functionality on the Once the optimization process is completed, the selected
SoC/FPGA device, as introduced in Section II-C. hyperparameter values are delivered to the developer. With
The implementation details are available in the repository: this, the knowledge of teacherFP is distilled into the student
https://github.com/RomiSolMolina/workflowCompressionML. network, previously defined by the number of bits and target
sparsity used in the optimization process. QAP is employed
A. DNN Training and Compression along with the learning process through KD. Finally, the
The training and compression of the DNN classifier include compressed studentKDQP model is trained.
the following stages (Fig. 2): 3) General Key Points: BO, implemented through the
1) Stage 1—Teacher Training: The teacher hypermodel is kerasTuner [9], performs parameter searches based on
created by the user defining the number and type of layer, developer-configured settings, generating architectures with
and the search space, which is a range of values per layer more or fewer parameters. The heuristic search spaces for
defined with a heuristic approach by the user. Using the dataset teacher optimization range between 32 and 300, whereas for
and the teacher hypermodel, the BO process is performed the student from 1 to 20. Fig. 3 shows an example of the
according to the number of iterations specified by the devel- search space definition considering one convolutional layer.
oper. BO is derived from probabilistic models and aims to To tune the number of kernels in the first layer, the algorithm
maximize or minimize an objective function (e.g., model searches for an optimal value between 32 and 128 in 8-steps
accuracy or loss) by efficiently exploring the hyperparameter incremental. The optimal learning rate for teacher and student
space and identifying the configuration that maximizes or models is selected from 0.01, 0.001, or 0.0001.
minimizes the function. In this process, BO guides the search The dataset was divided into training, validation, and testing
of hyperparameters using past evaluation results. During the (70%, 15%, and 15%, respectively) for both stages of the
Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:30 UTC from IEEE Xplore. Restrictions apply.
MOLINA et al.: END-TO-END WORKFLOW TO EFFICIENTLY COMPRESS 257
Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:30 UTC from IEEE Xplore. Restrictions apply.
258 IEEE EMBEDDED SYSTEMS LETTERS, VOL. 16, NO. 3, SEPTEMBER 2024
TABLE I
E VALUATION . T HE ACRONYMS IN THE TABLE A RE P: PARAMETERS , CR: C OMPRESSION R ATIO , NL: N UMBER OF L AYERS
(O NLY C ONVOLUTIONAL AND F ULLY C ONNECTED L AYERS ), OA: OVERALL ACCURACY, SP: S PARSITY, AND
L: L ATENCY W ITHOUT DATA T RANSFER AT 200 MH Z (5- NS C LOCK C YCLE )
Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:30 UTC from IEEE Xplore. Restrictions apply.