An End-to-End Workflow To Efficiently Compress and Deploy DNN Classifiers On SoC FPGA

An_End-to-End_Workflow_to_Efficiently_Compress_and_Deploy_DNN_Classifiers_on_SoC_FPGA

Uploaded by

howard777

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views4 pages

An End-to-End Workflow To Efficiently Compress and Deploy DNN Classifiers On SoC FPGA

An_End-to-End_Workflow_to_Efficiently_Compress_and_Deploy_DNN_Classifiers_on_SoC_FPGA

Uploaded by

howard777

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

IEEE EMBEDDED SYSTEMS LETTERS, VOL. 16, NO.

3, SEPTEMBER 2024 255

An End-to-End Workflow to Efficiently Compress

and Deploy DNN Classifiers on SoC/FPGA
Romina Soledad Molina , Iván René Morales , Maria Liz Crespo , Veronica Gil Costa ,
Sergio Carrato , and Giovanni Ramponi , Life Senior Member, IEEE

Abstract—Machine learning (ML) models have demonstrated and considering the interaction between the FPGA and the
discriminative and representative learning capabilities over a wide processor that make up to the SoC.
range of applications, even at the cost of high-computational Recent studies in this field usually focus on the inte-
complexity. Due to their parallel processing capabilities, reconfig-
urability, and low-power consumption, systems on chip based on gration of pruning and/or quantization methods for model
a field programmable gate array (SoC/FPGA) have been used to compression, and only a few contributions address the entire
face this challenge. Nevertheless, SoC/FPGA devices are resource- development cycle [3], [4], [5]. Specifically, Fahim et al. [3]
constrained, which implies the need for optimal use of technology proposed a co-design workflow, including quantization-aware
for the computation and storage operations involved in ML-based training and pruning, to optimize the architectures for low-
inference. Consequently, mapping a deep neural network (DNN)
architecture to a SoC/FPGA requires compression strategies to power devices based on FPGA, without including knowledge
obtain a hardware design with a good compromise between distillation (KD) neither a strategy for FPGA verification.
effectiveness, memory footprint, and inference time. This letter Furthermore, the co-design workflow was validated with
presents an efficient end-to-end workflow for deploying DNNs MNIST dataset and not real-world applications. This letter
on an SoC/FPGA by integrating hyperparameter tuning through in [5] presents a co-design methodology for FPGA-based DNN
Bayesian optimization (BO) with an ensemble of compression
techniques. accelerators developed for inference at the edge, but requires
a high degree of data transfer between on-chip and off-chip
Index Terms—Compression, deep neural networks, memory. The proposal in [4] exposes an FPGA/DNN co-
FPGA/SoC, machine learning (ML), workflow.
design for low-end devices, but is limited to convolutional
neural networks (CNN) and assumes the weights are stored in
I. I NTRODUCTION external memory.
ANY new machine learning (ML) application fields In this letter, we propose a workflow1 that integrates
M require mapping deep neural network (DNN) infer-
ence processes into systems on chip based on a field
KD, pruning, and quantization to efficiently compress and
deploy DNN-based classifiers on SoC/FPGA, addressing the
programmable gate array (SoC/FPGA), to exploit some entire development cycle: from the ML-based architecture
inherent favorable features of these technologies, such as training to the hardware deployment, tackling the drawbacks
low latency, low-power consumption, and high parallelism. presented in [3], [4], and [5]. The inclusion of the KD
Compression techniques become essential to deploy ML mod- technique, combined with hyperparameter optimization (HPO)
els on resource-constrained devices, preserving effectiveness through Bayesian optimization (BO), allows the generation
in smaller and faster models [1], [2]. of a reduced and more efficient model (student model) that
Long design cycles are needed to implement suitable mimics the overall accuracy of a teacher architecture.
FPGA firmware optimizing latency and resource utilization. HPO is carried out using BO because it is supported
Developers would benefit from a methodology guiding DNN by the literature as one of the most effective techniques
implementation on SoC/FPGA, unifying existing techniques for performing this task [6]. In contrast, grid search suffers
from the curse of dimensionality, while random search is
Manuscript received 7 October 2023; revised 27 November 2023; accepted more efficient than grid search but is unreliable for training
11 December 2023. Date of publication 14 December 2023; date of current some complex models [7]. In the proposed methodology, the
version 30 August 2024. This manuscript was recommended for publication developer defines the different topologies (teacher and student)
by M. Zhao. (Corresponding author: Romina Soledad Molina.)
Romina Soledad Molina and Iván René Morales are with the and employs HPO to find the best set of values for a DNN
Multidisciplinary Laboratory (MLab), STI Unit, Abdus Salam International topology previously defined.
Centre for Theoretical Physics, 34151 Trieste, Italy, and also with the The proposed workflow allows full on-chip storage of
Department of Engineering and Architecture (DIA), University of Trieste,
34127 Trieste, Italy (e-mail: [email protected]). ML-based models. This makes it suitable for applications
Maria Liz Crespo is with the Multidisciplinary Laboratory (MLab), STI that require fast inference and energy efficiency, unlike the
Unit, Abdus Salam International Centre for Theoretical Physics, 34151 Trieste, contributions in [4] and [5] that require a high degree of
Italy.
Sergio Carrato and Giovanni Ramponi are with the Department of data transfer between the on-chip and off-chip memory. The
Engineering and Architecture (DIA), University of Trieste, 34127 Trieste, reduced model is obtained at the expense of a longer training
Italy.
Veronica Gil Costa is with CONICET, National University of San Luis,
San Luis 5700, Argentina.
Digital Object Identifier 10.1109/LES.2023.3343030 1 Available on: https://github.com/RomiSolMolina/workflowCompressionML.

1943-0671
c 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:30 UTC from IEEE Xplore. Restrictions apply.
256 IEEE EMBEDDED SYSTEMS LETTERS, VOL. 16, NO. 3, SEPTEMBER 2024

Fig. 1. Workflow to efficiently compress and deploy DNN-based classifiers

on SoC/FPGA.

time and a more extensive search space for HPO. The main Fig. 2. Training and compression stages.
contributions of this letter can be summarized as follows.
1) A workflow based on KD combined with quantization- search, a new architecture is compiled and trained for each
aware pruning (QAP) and HPO, aiming to guide the iteration, whereas the optimizer looks at a combination of
developer in obtaining and compressing ML-based mod- hyperparameters within the defined search space, considering
els and implementing them in SoC/FPGA, the validation accuracy as the objective function.
2) An SoC/FPGA framework containing the ComBlock IP After optimizing, the network is trained with the optimal
core [8], which facilitates the communication protocols, configuration found, generating a teacher model with 32-bit
and the control and evaluation of ML-based hardware floating-point precision (teacherFP). Alternatively, a pretrained
from the SoC’s soft-core or hard-core processor, model can be used as a teacherFP.
3) The incorporation of KD and BO in the training steps, 2) Stage 2—Student-Aware Training: The corresponding
allowing a full on-chip deployment on FPGA, thus student hypermodel is constructed, and a search space is
avoiding off-chip communication bottlenecks. created within a range of values that defines the number of
The remainder of this letter is organized as follows. Section II neurons and kernels per layer. Furthermore, the hypermodel
introduces the workflow to compress and deploy DNNs on is defined with n-bit fixed-point precision (where n is the
SoC/FPGA. The experiments and results are presented in number of bits) for the quantization (Q) method and with
Section III. Finally, Section IV summarizes the conclusions. target sparsity for pruning (P). Because the student network
has a reduced number of parameters compared to the teacher
II. W ORKFLOW
network, the range of values for the search space (for each
The main steps of the proposed workflow are shown in layer) is set to a smaller number. Using the dataset, the
Fig. 1. The input to the workflow is the dataset used to train student hypermodel, and teacherFP model, the BO process
and compress the target model, as described in Section II-A. is performed according to the number of iterations specified
Hereafter, a data structure is generated with information by the developer. The learning approach chosen during the
regarding the layers, weights, and bias of the model. This optimization is KD, which is combined with QAP to transfer
data structure is used by a hardware synthesis tool for ML the knowledge of the teacherFP to a smaller architecture. QAP
to translate the DNN model, as presented in Section II-B. incorporates quantization and pruning into the student training
Subsequently, inference hardware is created and integrated process.
into an assessment template to verify its functionality on the Once the optimization process is completed, the selected
SoC/FPGA device, as introduced in Section II-C. hyperparameter values are delivered to the developer. With
The implementation details are available in the repository: this, the knowledge of teacherFP is distilled into the student
https://github.com/RomiSolMolina/workflowCompressionML. network, previously defined by the number of bits and target
sparsity used in the optimization process. QAP is employed
A. DNN Training and Compression along with the learning process through KD. Finally, the
The training and compression of the DNN classifier include compressed studentKDQP model is trained.
the following stages (Fig. 2): 3) General Key Points: BO, implemented through the
1) Stage 1—Teacher Training: The teacher hypermodel is kerasTuner [9], performs parameter searches based on
created by the user defining the number and type of layer, developer-configured settings, generating architectures with
and the search space, which is a range of values per layer more or fewer parameters. The heuristic search spaces for
defined with a heuristic approach by the user. Using the dataset teacher optimization range between 32 and 300, whereas for
and the teacher hypermodel, the BO process is performed the student from 1 to 20. Fig. 3 shows an example of the
according to the number of iterations specified by the devel- search space definition considering one convolutional layer.
oper. BO is derived from probabilistic models and aims to To tune the number of kernels in the first layer, the algorithm
maximize or minimize an objective function (e.g., model searches for an optimal value between 32 and 128 in 8-steps
accuracy or loss) by efficiently exploring the hyperparameter incremental. The optimal learning rate for teacher and student
space and identifying the configuration that maximizes or models is selected from 0.01, 0.001, or 0.0001.
minimizes the function. In this process, BO guides the search The dataset was divided into training, validation, and testing
of hyperparameters using past evaluation results. During the (70%, 15%, and 15%, respectively) for both stages of the

Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:30 UTC from IEEE Xplore. Restrictions apply.
MOLINA et al.: END-TO-END WORKFLOW TO EFFICIENTLY COMPRESS 257

Algorithm 1 Pseudocode Running on the Processor

Input: inputSignal[INPUT_SIZE]
1: comBlock.fifo_flush(fifo)
2: comBlock.register_write(reset_DNN, 0)
3: while true do
Fig. 3. Example of the search space definition.
4: for i = 0, i < INPUT_SIZE do
5: comBlock.fifo_write(inputSignal[i])
6: end for
7: outVal ← comBlock.register_read(dnnOutput)
8: end while

III. E XPERIMENTS AND R ESULTS

Fig. 4. Hardware assessment framework.
A. Real Case Study Applications
DNN training and compression process. When using pre- A preliminary version of our work was presented
trained networks from SOTA, it is necessary to fine-tune the in [11], [12], [13], and [14], where we applied Q, P, and KD
architecture with the corresponding dataset to adapt it to a for model compression to two real-world applications, and we
specific problem. provided evidence of performance improvement when target-
ing SoC/FPA devices. The former application (1D-multilayer
B. Integration With Hardware Synthesis Tool for ML perceptron (MLP)) is focused on 1-D signals; it is a pulse
After DNN training and compression, the model is saved in shape discriminator (PSD) based on a MLP, to be used for
a data file suitable for the backend tools to perform hardware event recognition in cosmic rays studies [12], [13]. The latter
synthesis for ML-based designs. We chose hls4ml [10] because application is in the field of object classification in 2D-images;
it is open-source and supports as high-level synthesis (HLS) its aim is moth classification in the context of pest detection.
backends Vivado HLS, Intel HLS, and Vitis HLS in the The solution in [11] (2D-CNN) is based on ad-hoc CNN
experimental mode. However, the proposed workflow can be trained with a dataset obtained from in-field traps through
adapted to other hardware synthesis tools for ML applications. an IoT system. This application is further developed in [14]
The DNN model is saved in a hierarchical data format (.h5 (2D-VGG16) using the larger pretrained teacher network
file) containing the data structure of the model, such as the (VGG16) [14], and a public dataset (Pest24 [15]) that was
layers, weights, and biases. This is the input for the hls4ml employed for fine-tuning the model. With this application, we
package that is linked to an HLS tool that allows the generation aimed to illustrate that the developer may have a previously
of hardware from the C, C++, and System C codes. hls4ml trained model—in this case, extracted from [14]—and com-
creates an HLS project based on the configuration provided press it through the proposed methodology (without the need
by the developer, such as the FPGA part, reuse factor, to generate the teacher model from scratch), and the potential
data precision, and latency/resource strategies. Once the HLS of the compression method when working with larger teacher
project is generated, a hardware block can be synthesized and architectures. Moreover, we extended the 2-D classification
exported to the FPGA register transfer level (RTL). problem using MobileNetV2 (2D-MobileNetV2) as a teacher
model, which was previously fine-tuned with Pest24 and modi-
fied with a classifier suitable for pest classification. This option
C. Hardware Assessment Framework
was selected because MobileNetV2 is a topology engineered
An SoC/FPGA framework was developed to test the deploy- for operation in resource-constrained environments.
ment of DNN models in hardware (Fig. 4). The assessment In this letter, we apply the end-to-end workflow in both real-
framework includes an embedded processor (soft core or hard case studies, complementing the results in [11], [12], [13],
core) that acts as a programmable control device. The DNN and [14].
inference hardware exported from the HLS is placed within the
framework design to enable flawless interchangeability during
the testing iterations. Communication between the processor B. Workflow Assessment
and the inference hardware is carried out using ComBlock [8], Table I summarizes the most relevant results. In 1D-MLP,
an open-source, portable, and customizable IP block designed the teacher is an MLP with four fully connected layers. The
to simplify the interaction between the embedded processor 2D-CNN teacher is composed of four convolutional and three
and FPGA. We selected ComBlock because of its easy fully connected layers. For 2D-VGG16 and 2D-MobileNetV2,
integration with soft-core and hard-core processors. However, the teacher models were obtained through transfer learning
the developer can select a different hardware communications after fine-tuning the network for Pest24 dataset. For the four
block, such as a DMA controller. architectures, the overall accuracy was above 97%.
Through ComBlock, the processor sends the input stream Regarding the compressed models, the 1D-MLP student
to the inference block and retrieves the results of the DNN network was implemented and tested on an Artix-7 FPGA
classifier. Algorithm 1 summarizes how easily the processor board, using an embedded soft-core processor based on
interacts with the FPGA, using ComBlock for initialization Microblaze. The 2D-VGG16 and 2D-CNN student models
and data streaming. were implemented and tested on an UltraScale+ ZCU102 with

Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:30 UTC from IEEE Xplore. Restrictions apply.
258 IEEE EMBEDDED SYSTEMS LETTERS, VOL. 16, NO. 3, SEPTEMBER 2024

TABLE I
E VALUATION . T HE ACRONYMS IN THE TABLE A RE P: PARAMETERS , CR: C OMPRESSION R ATIO , NL: N UMBER OF L AYERS
(O NLY C ONVOLUTIONAL AND F ULLY C ONNECTED L AYERS ), OA: OVERALL ACCURACY, SP: S PARSITY, AND
L: L ATENCY W ITHOUT DATA T RANSFER AT 200 MH Z (5- NS C LOCK C YCLE )

SoC/FPGA. Combining pruning, quantization, and KD leads

to a good tradeoff between resource utilization and quality
metrics. BO automatically selects an optimal combination
of hyperparameters for the teacher and student models. An
improvement in latency and energy efficiency was achieved
owing to the minimal off-chip data transactions and reduced
computational operations required to complete the inference.
Fig. 5. Latency (in clock cycle units) for 1D-MLP. Data from HLS reports
at different clock frequencies (100 and 200 MHz). Future work could include neural architecture search (NAS)
as an extension of the methodology.
an embedded hard-core processor. Table I shows the post-
place and route resource utilization for the compressed models. R EFERENCES
We analyze 1D-MLP student latency with reuse factors of [1] T. Aarrestad et al., “Fast convolutional neural networks on FPGAs
1, 8, 16, 32, and 64. This parameter, introduced by hls4ml, with hls4ml,” Mach. Learn., Sci. Technol., vol. 2, no. 4, Jul. 2021,
Art. no. 045015.
is intrinsically associated with the use of multipliers in the [2] T. Choudhary, V. Mishra, A. Goswami, and J. Sarangapani, “A com-
FPGA to compute a layer of neuron values. A low-reuse prehensive survey on model compression and acceleration,” Artif. Intell.
factor implies a low-latency and high-resource utilization. The Rev., vol. 53, no. 7, pp. 5113–5155, 2020.
[3] F. Fahim et al., “hls4ml: An open-source codesign Workflow to empower
results are presented in Fig. 5. An increased latency (in clock scientific low-power machine learning devices,” in Proc. Res. Symp. Tiny
cycle units) is evident when running at 200 MHz, compared Mach. Learn., 2021, pp. 1–10.
to 100 MHz. Such difference is more noticeable in the Artix-7 [4] C. Hao et al., “FPGA/DNN co-design: An efficient design methodology
for IoT intelligence on the edge,” in Proc. 56th ACM/IEEE Design
FPGA and the SoC-based PYNQ-Z1, using a 28 nm process Autom. Conf. (DAC), Las Vegas, NV, USA, 2019, pp. 1–6.
technology, versus the 16 nm fabric of the UltraScale+ SoC. [5] J. Haris, P. Gibson, J. Cano, N. B. Agostini, and D. Kaeli, “SECDA:
Considering image classification, we addressed the com- Efficient hardware/software co-design of FPGA-based DNN accelerators
for edge inference,” in Proc. IEEE 33rd Int. Symp. Comput. Archit.
pression of the ML-based model in [11]. For 2D-VGG16, a High Perform. Comput. (SBAC-PAD), Belo Horizonte, Brazil, 2021,
suitable network was obtained by employing BO for HPO, pp. 33–43.
considering an implementation on a high-end FPGA but with [6] A. H. Victoria and G. Maragatham, “Automatic tuning of hyperpa-
rameters using Bayesian optimization,” Evolving Syst., vol. 12, no. 1,
twice the number of parameters compared to the [14] model pp. 217–223, 2021.
(which was generated by manually tuning the hyperparame- [7] J. Bergstra and Y. Bengio, “Random search for hyper-parameter
ters). For the cosmic ray application, a reduced architecture optimization,” J. Mach. Learn. Res., vol. 13, no. 2, pp. 1–25, 2012.
[8] K. S. Mannatunga et al., “Design for portability of reconfigurable virtual
was generated compared to the ones obtained in [12] and [13] instrumentation,” in Proc. 10th Southern Conf. Program. Logic (SPL),
(26% and more than 70% FPGA usage, respectively). Buenos Aires, Argentina, 2019, pp. 45–52.
The experiments show that our proposed methodology [9] T. O’Malley et al., “Keras tuner.” 2019. [Online]. Available: https://
github.com/keras-team/keras-tuner
successfully generates compressed models, leading to a [10] J. Duarte et al., “Fast inference of deep neural networks in FPGAs for
fully on-chip memory-mapped implementation on the FPGA. particle physics,” J. Instrum., vol. 13, no. 07, 2018, Art. no. P07027.
Integrating KD into the ensemble of compression techniques [11] A. Suárez, R. S. Molina, G. Ramponi, R. Petrino, L. Bollati, and
D. Sequeiros, “Pest detection and classification to reduce pesticide use
contributes to achieving a balanced student model in terms of in fruit crops based on deep neural networks and image processing,” in
size, computational efficiency, and accuracy. The methodology Proc. 19th Workshop Inf. Process. Control (RPIC), San Juan, Argentina,
addresses the entire development cycle, overcoming the lim- 2021, pp. 1–6.
[12] R. S. Molina et al., “Compression of NN-based pulse-shape discrim-
itations outlined in previous works, such as [3], [4], and [5]. inators in front-end electronics for particle detection,” in Proc. Appl.
Furthermore, the ComBlock in the hardware assessment Electron. Pervading Ind., Environ. Soc., 2021, pp. 93–99.
framework facilitates the model performance evaluation in the [13] L. G. G. Ordóñez et al., “Pulse shape discrimination for online data
acquisition in water cherenkov detectors based on FPGA/SoC,” in Proc.
FPGA by simplifying the communication interfaces. 37th Int. Cosmic Ray Conf., Berlin, Germany, 2021, pp. 1–8.
[14] R. S. Molina et al., “ML-based classifier for precision agriculture on
embedded systems,” in Proc. Appl. Electron. Pervading Ind., Environ.
IV. C ONCLUSION Soc., 2023, pp. 117–124.
[15] Q.-J. Wang et al., “Pest24: A large-scale very small object data set
In this letter, we presented an end-to-end workflow for of agricultural pests for multi-target detection,” Comput. Electron.
efficient DNN-based classifier compression and deployment on Agriculture, vol. 175, Aug. 2020, Art. no. 105585.

Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:30 UTC from IEEE Xplore. Restrictions apply.

Internship Report On AKRSP
67% (3)
Internship Report On AKRSP
83 pages
PMFIAS Prelims Magnum 2025 06 Science and Technology
No ratings yet
PMFIAS Prelims Magnum 2025 06 Science and Technology
210 pages
CrimPro Lakas Atenista Notes
No ratings yet
CrimPro Lakas Atenista Notes
46 pages
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
No ratings yet
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
42 pages
MR Haroon Khan Employment Offer Letter (Princess Cruise Ship Company)
No ratings yet
MR Haroon Khan Employment Offer Letter (Princess Cruise Ship Company)
3 pages
Standard Costs and Variance Analysis Part 3
100% (2)
Standard Costs and Variance Analysis Part 3
6 pages
Harvard Business Review Case Study Agero
100% (1)
Harvard Business Review Case Study Agero
12 pages
CTC
No ratings yet
CTC
30 pages
Job Application Letter Title
100% (1)
Job Application Letter Title
8 pages
Oiv Ma As1 12
No ratings yet
Oiv Ma As1 12
92 pages
Neuromorphic Architectures Lec 4-16-1731320691
No ratings yet
Neuromorphic Architectures Lec 4-16-1731320691
276 pages
Thesis 2
No ratings yet
Thesis 2
144 pages
Full Job Description - Project Management Fresher
No ratings yet
Full Job Description - Project Management Fresher
2 pages
An Implementation of Convolutional Neural Networks
No ratings yet
An Implementation of Convolutional Neural Networks
23 pages
8 Gabriel
No ratings yet
8 Gabriel
22 pages
Energy-Efficient Deep Learning Inference On Edge Devices
No ratings yet
Energy-Efficient Deep Learning Inference On Edge Devices
55 pages
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
No ratings yet
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
41 pages
Efficient Deep Learning Infrastructures For Embedded Computing Systems: A Comprehensive Survey and Future Envision
No ratings yet
Efficient Deep Learning Infrastructures For Embedded Computing Systems: A Comprehensive Survey and Future Envision
101 pages
Certificates
No ratings yet
Certificates
54 pages
An Empirical Approach To Enhance Performance For Scalable CORDIC-Based Deep Neural Networks
No ratings yet
An Empirical Approach To Enhance Performance For Scalable CORDIC-Based Deep Neural Networks
32 pages
Applsci 12 10771 v2
No ratings yet
Applsci 12 10771 v2
44 pages
Hardware-Friendly User-Specific Machine Learning For Edge Devices
No ratings yet
Hardware-Friendly User-Specific Machine Learning For Edge Devices
29 pages
Demystifying Parallel and Distributed Deep Learning
No ratings yet
Demystifying Parallel and Distributed Deep Learning
43 pages
Florida Financial Aid Application FFAA Reference Guide 2021-2022
No ratings yet
Florida Financial Aid Application FFAA Reference Guide 2021-2022
19 pages
Quantization and Deployment Od DNN On Microcontroller
No ratings yet
Quantization and Deployment Od DNN On Microcontroller
34 pages
Embedded Deep Learning Accelerators A Survey On Recent Advances
No ratings yet
Embedded Deep Learning Accelerators A Survey On Recent Advances
19 pages
Research On Opencl Optimization For Fpga Deep Learning Application
No ratings yet
Research On Opencl Optimization For Fpga Deep Learning Application
19 pages
Marsellus A Heterogeneous RISC-V AI-IoT End-Node SoC With 28 B DNN Acceleration
No ratings yet
Marsellus A Heterogeneous RISC-V AI-IoT End-Node SoC With 28 B DNN Acceleration
15 pages
06A Dual-Split 6T SRAM-Based Computing-in-Memory
No ratings yet
06A Dual-Split 6T SRAM-Based Computing-in-Memory
14 pages
Hidet: Task-Mapping Programming Paradigm For Deep Learning Tensor Programs
No ratings yet
Hidet: Task-Mapping Programming Paradigm For Deep Learning Tensor Programs
15 pages
An Overview of Efficient Interconnection Networks For Deep Neural Network Accelerators
No ratings yet
An Overview of Efficient Interconnection Networks For Deep Neural Network Accelerators
15 pages
BP 36-56 Ingles
No ratings yet
BP 36-56 Ingles
16 pages
Plane Areas
No ratings yet
Plane Areas
26 pages
w1 - Machine Learning Hardware Design For Efficiency, Flexibility, and Scalability (Feature)
No ratings yet
w1 - Machine Learning Hardware Design For Efficiency, Flexibility, and Scalability (Feature)
19 pages
Futureinternet 12 00113 v2
No ratings yet
Futureinternet 12 00113 v2
22 pages
08Pr067C Electrical Safety: Safety Management System Procedure
No ratings yet
08Pr067C Electrical Safety: Safety Management System Procedure
8 pages
Isvlsi2019 SS
No ratings yet
Isvlsi2019 SS
7 pages
An Integrated Analysis Framework of Convolutional Neural
No ratings yet
An Integrated Analysis Framework of Convolutional Neural
19 pages
Tcas-I Haco Final
No ratings yet
Tcas-I Haco Final
14 pages
Lesson 1. Basic Concepts of Leadership
No ratings yet
Lesson 1. Basic Concepts of Leadership
14 pages
Module 10 - Learners Guide
No ratings yet
Module 10 - Learners Guide
29 pages
10 3390@electronics8030295
No ratings yet
10 3390@electronics8030295
15 pages
Embedded Deep Learning Accelerators - A Survey On Recent Advances
No ratings yet
Embedded Deep Learning Accelerators - A Survey On Recent Advances
19 pages
VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing
No ratings yet
VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing
12 pages
(P1) High - Performance - CNN - Accelerators - Based - On - Hardware - and - Algorithm - Co-Optimization
No ratings yet
(P1) High - Performance - CNN - Accelerators - Based - On - Hardware - and - Algorithm - Co-Optimization
14 pages
Electronics 10 02859 v2
No ratings yet
Electronics 10 02859 v2
16 pages
A Scalable and Efficient Convolutional Neural Network Accelerator Using HLS For A System-On-Chip Design
No ratings yet
A Scalable and Efficient Convolutional Neural Network Accelerator Using HLS For A System-On-Chip Design
18 pages
Optimum Equipment Management Through: Life Cycle Costing
No ratings yet
Optimum Equipment Management Through: Life Cycle Costing
4 pages
Jlpea 12 00011 v3
No ratings yet
Jlpea 12 00011 v3
16 pages
An Energy Efficient Convolutional Neural Network Accelerator For Speech Classification Based On FPGA and Quantization
No ratings yet
An Energy Efficient Convolutional Neural Network Accelerator For Speech Classification Based On FPGA and Quantization
13 pages
BNN in FPGA
No ratings yet
BNN in FPGA
15 pages
A 0.61-J Frame Pipelined Wired-Logic DNN Processor in 16-nm FPGA Using Convolutional Non-Linear Neural Network
No ratings yet
A 0.61-J Frame Pipelined Wired-Logic DNN Processor in 16-nm FPGA Using Convolutional Non-Linear Neural Network
11 pages
Accelerating Deep Neural Networks Implem
No ratings yet
Accelerating Deep Neural Networks Implem
18 pages
FP BNN On FPGA
No ratings yet
FP BNN On FPGA
15 pages
Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework
No ratings yet
Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework
13 pages
Pal 2025 Eng. Res. Express 7 015317
No ratings yet
Pal 2025 Eng. Res. Express 7 015317
16 pages
Applsci 15 00688 v3
No ratings yet
Applsci 15 00688 v3
21 pages
NoC Based DNN Accelerators
No ratings yet
NoC Based DNN Accelerators
8 pages
Bringing Deep Learning To Embedded Systems: Mark Nadeski
No ratings yet
Bringing Deep Learning To Embedded Systems: Mark Nadeski
7 pages
Technical Seminar Black
No ratings yet
Technical Seminar Black
20 pages
A Survey of Model Compression and Acceleration For Deep Neural Networks
No ratings yet
A Survey of Model Compression and Acceleration For Deep Neural Networks
10 pages
Accelerating Binarized Neural Networks Comparison of FPGA CPU GPU and ASIC
No ratings yet
Accelerating Binarized Neural Networks Comparison of FPGA CPU GPU and ASIC
8 pages
Action Plan: Department of Education
No ratings yet
Action Plan: Department of Education
3 pages
Extract Hidden Data From PDF
No ratings yet
Extract Hidden Data From PDF
2 pages
Polylut: Ultra-Low Latency Polynomial Inference With Hardware-Aware Structured Pruning
No ratings yet
Polylut: Ultra-Low Latency Polynomial Inference With Hardware-Aware Structured Pruning
13 pages
Dell - Digital Firm: Rehan Khan Sana Bashir Shah Shamael.Z.Khan Shoaib Shamim Sulaiman Shakil Taji
No ratings yet
Dell - Digital Firm: Rehan Khan Sana Bashir Shah Shamael.Z.Khan Shoaib Shamim Sulaiman Shakil Taji
23 pages
Bravo-Guerrero vs. Bravo, 465 SCRA 244, July 29, 2005
No ratings yet
Bravo-Guerrero vs. Bravo, 465 SCRA 244, July 29, 2005
7 pages
MythicWhitepaper 2019oct31
No ratings yet
MythicWhitepaper 2019oct31
9 pages
A Mixed-Pruning Based Framework For Embedded Convolutional Neural Network Acceleration
No ratings yet
A Mixed-Pruning Based Framework For Embedded Convolutional Neural Network Acceleration
10 pages
FP-DNN An Automated Framework For Mapping
No ratings yet
FP-DNN An Automated Framework For Mapping
8 pages
A High-Performance Hardware Accelerator For Sparse Convolutional Neural Network On FPGA
No ratings yet
A High-Performance Hardware Accelerator For Sparse Convolutional Neural Network On FPGA
7 pages
Paper 8
No ratings yet
Paper 8
7 pages
ISQED2021
No ratings yet
ISQED2021
7 pages
286 1006 1 PB
No ratings yet
286 1006 1 PB
8 pages
Accelerated Deep Learning Inference From Constrained Embedded Devices
No ratings yet
Accelerated Deep Learning Inference From Constrained Embedded Devices
5 pages
DAC'22 EBSP Bit Sparsity DNN
No ratings yet
DAC'22 EBSP Bit Sparsity DNN
6 pages
Sainik School Amaravathinagar Class Xii - Summer Vacation Home Work Annexure A
No ratings yet
Sainik School Amaravathinagar Class Xii - Summer Vacation Home Work Annexure A
5 pages
ISO 14001 Environment Management Watermark
No ratings yet
ISO 14001 Environment Management Watermark
2 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
5 pages
Royal Ahold NV
No ratings yet
Royal Ahold NV
6 pages
COMP40004 - Web Development and Operating Systems
No ratings yet
COMP40004 - Web Development and Operating Systems
4 pages
SFT3508S (SFT3508I) IPTV Gateway Server Spec
No ratings yet
SFT3508S (SFT3508I) IPTV Gateway Server Spec
4 pages
An Efficient Hardware Accelerator For Block Sparse Convolutional Neural Networks On FPGA
No ratings yet
An Efficient Hardware Accelerator For Block Sparse Convolutional Neural Networks On FPGA
4 pages
Invoice: Invoice From Invoice To Customer Information
No ratings yet
Invoice: Invoice From Invoice To Customer Information
2 pages
A Deep Learning Prediction Process Accelerator Based FPGA PDF
No ratings yet
A Deep Learning Prediction Process Accelerator Based FPGA PDF
4 pages
DESCO 19640 Certificated
No ratings yet
DESCO 19640 Certificated
2 pages
HTML in A Day For Digital Marketing Pro Course
No ratings yet
HTML in A Day For Digital Marketing Pro Course
1 page
Caffe Deep Learning Framework Essentials: Definitive Reference for Developers and Engineers
From Everand
Caffe Deep Learning Framework Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
OpenCL Programming and Architecture: Definitive Reference for Developers and Engineers
From Everand
OpenCL Programming and Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Technical Foundations of Torch: Definitive Reference for Developers and Engineers
From Everand
Technical Foundations of Torch: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

An End-to-End Workflow To Efficiently Compress and Deploy DNN Classifiers On SoC FPGA

Uploaded by

An End-to-End Workflow To Efficiently Compress and Deploy DNN Classifiers On SoC FPGA

Uploaded by

IEEE EMBEDDED SYSTEMS LETTERS, VOL. 16, NO.

3, SEPTEMBER 2024 255

An End-to-End Workflow to Efficiently Compress

Fig. 1. Workflow to efficiently compress and deploy DNN-based classifiers

Algorithm 1 Pseudocode Running on the Processor

III. E XPERIMENTS AND R ESULTS

SoC/FPGA. Combining pruning, quantization, and KD leads

You might also like