arXiv : hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices

Fahim, Farah; Hirschauer, James; Luo, Yingyi; Herwig, Christian; Summers, Sioni; Harris, Philip; Krupa, Jeffrey; Hoang, Duc; Kreinar, Edward; Pierini, Maurizio; Javed, Hamza; Orgrenci-Memik, Seda; Aarrestad, Thea; Hawks, Benjamin; Hauck, Scott; Loncar, Vladimir; Tran, Nhan; Wu, Zhenbin; Jindariani, Sergo; Di Guglielmo, Giuseppe; Duarte, Javier; Liu, Mia; Carloni, Luca P.; Rankin, Dylan; Hester, Josiah; Ngadiuba, Jennifer; Mamish, John; Pol, Adrian Alan; Valentin, Manuel Blanco; Hsu, Shih-Chieh
002754189 001__ 2754189
002754189 005__ 20240217072842.0
002754189 0248_ $$aoai:cds.cern.ch:2754189$$pcerncds:FULLTEXT$$pcerncds:CERN:FULLTEXT$$pcerncds:CERN
002754189 037__ $$9arXiv$$aarXiv:2103.05579$$ccs.LG
002754189 037__ $$9arXiv:reportnumber$$aFERMILAB-CONF-21-080-SCD
002754189 035__ $$9arXiv$$aoai:arXiv.org:2103.05579
002754189 035__ $$9Inspire$$aoai:inspirehep.net:1850827$$d2024-02-16T15:14:32Z$$h2024-02-17T03:00:35Z$$mmarcxml$$ttrue$$uhttps://inspirehep.net/api/oai2d
002754189 035__ $$9Inspire$$a1850827
002754189 041__ $$aeng
002754189 100__ $$aFahim, Farah$$uFermilab$$uNorthwestern U.$$vFermilab$$vNorthwestern U.
002754189 245__ $$9arXiv$$ahls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices
002754189 269__ $$c2021-03-09
002754189 300__ $$a10 p
002754189 500__ $$9arXiv$$a10 pages, 8 figures, TinyML Research Symposium 2021
002754189 520__ $$9arXiv$$aAccessible machine learning algorithms, software, and diagnostic tools for energy-efficient devices and systems are extremely valuable across a broad range of application domains. In scientific domains, real-time near-sensor processing can drastically improve experimental design and accelerate scientific discoveries. To support domain scientists, we have developed hls4ml, an open-source software-hardware codesign workflow to interpret and translate machine learning algorithms for implementation with both FPGA and ASIC technologies. We expand on previous hls4ml work by extending capabilities and techniques towards low-power implementations and increased usability: new Python APIs, quantization-aware pruning, end-to-end FPGA workflows, long pipeline kernels for low power, and new device backends include an ASIC workflow. Taken together, these and continued efforts in hls4ml will arm a new generation of domain scientists with accessible, efficient, and powerful tools for machine-learning-accelerated discovery.
002754189 540__ $$3preprint$$aCC BY-SA 4.0$$uhttp://creativecommons.org/licenses/by-sa/4.0/
002754189 65017 $$2arXiv$$aphysics.ins-det
002754189 65017 $$2SzGeCERN$$aDetectors and Experimental Techniques
002754189 65017 $$2arXiv$$acs.AR
002754189 65017 $$2SzGeCERN$$aComputing and Computers
002754189 65017 $$2arXiv$$acs.LG
002754189 65017 $$2SzGeCERN$$aComputing and Computers
002754189 690C_ $$aCERN
002754189 690C_ $$aPREPRINT
002754189 700__ $$aHawks, Benjamin$$uFermilab$$vFermilab
002754189 700__ $$aHerwig, Christian$$uFermilab$$vFermilab
002754189 700__ $$aHirschauer, James$$uFermilab$$vFermilab
002754189 700__ $$aJindariani, Sergo$$uFermilab$$vFermilab
002754189 700__ $$aTran, Nhan$$uFermilab$$vFermilab
002754189 700__ $$aCarloni, Luca P.$$uColumbia U.$$vColumbia U.
002754189 700__ $$aDi Guglielmo, Giuseppe$$uColumbia U.$$vColumbia U.
002754189 700__ $$aHarris, Philip$$uMIT$$vMIT
002754189 700__ $$aKrupa, Jeffrey$$uMIT$$vMIT
002754189 700__ $$aRankin, Dylan$$uMIT$$vMIT
002754189 700__ $$aValentin, Manuel Blanco$$uNorthwestern U.$$vNorthwestern U.
002754189 700__ $$aHester, Josiah$$uNorthwestern U.$$vNorthwestern U.
002754189 700__ $$aLuo, Yingyi$$uNorthwestern U.$$vNorthwestern U.
002754189 700__ $$aMamish, John$$uNorthwestern U.$$vNorthwestern U.
002754189 700__ $$aOrgrenci-Memik, Seda$$uNorthwestern U.$$vNorthwestern U.
002754189 700__ $$aAarrestad, Thea$$uCERN$$vCERN
002754189 700__ $$aJaved, Hamza$$uCERN$$vCERN
002754189 700__ $$aLoncar, Vladimir$$uCERN$$vCERN
002754189 700__ $$aPierini, Maurizio$$uCERN$$vCERN
002754189 700__ $$aPol, Adrian Alan$$uCERN$$vCERN
002754189 700__ $$aSummers, Sioni$$uCERN$$vCERN
002754189 700__ $$aDuarte, Javier$$uUC, San Diego$$vUC, San Diego
002754189 700__ $$aHauck, Scott$$uWashington U., Seattle$$vWashington U., Seattle
002754189 700__ $$aHsu, Shih-Chieh$$uWashington U., Seattle$$vWashington U., Seattle
002754189 700__ $$aNgadiuba, Jennifer$$uCaltech$$vCaltech
002754189 700__ $$aLiu, Mia$$uPurdue U.$$vPurdue U.
002754189 700__ $$aHoang, Duc$$uRhodes Coll.$$vRhodes Coll.
002754189 700__ $$aKreinar, Edward$$uUnlisted, US, VA$$vUnlisted, US, VA
002754189 700__ $$aWu, Zhenbin$$uIllinois U., Chicago$$vIllinois U., Chicago
002754189 773__ $$wC21-03-26
002754189 8564_ $$uhttps://lss.fnal.gov/archive/2021/conf/fermilab-conf-21-080-scd.pdf$$yFermilab Library Server
002754189 8564_ $$82281303$$s6987$$uhttps://cds.cern.ch/record/2754189/files/accuracy-2.png$$y00004 Performance of quantization-aware training from Ref.~\cite{Coelho:2020zfu} in terms of the relative accuracy as a function of bit width. The relative accuracy is evaluated with respect to the floating-point baseline model. The CPU-based emulation (solid green) of the FPGA-based QAT model (solid orange) is compared to the PTQ model (dashed purple).
002754189 8564_ $$82281304$$s22993$$uhttps://cds.cern.ch/record/2754189/files/hls4ml-flow-4.png$$y00000 A typical workflow to translate an ML model into an FPGA or ASIC implementation using \hlsfml. The red boxes (left) describe the model training and compression steps performed within conventional ML software frameworks. The \hlsfml configuration and conversion steps are shown in the blue boxes (center). The black boxes (right) illustrate possible ways to export and integrate the HLS project into a larger hardware design.
002754189 8564_ $$82281305$$s20031$$uhttps://cds.cern.ch/record/2754189/files/AUCROC_FT_vs_LT_6b32b_v2.png$$y00005 Performance of quantization-aware pruning using the lottery ticket pruning scheme as a function of hardware computational complexity. After QAP, the 6-bit, 80\% pruned model achieves a factor of 50 reduction in BOPs compared to the 32-bit, unpruned model with no loss in performance.
002754189 8564_ $$82281306$$s7990$$uhttps://cds.cern.ch/record/2754189/files/pr_scan_lut_histogram.png$$y00008 DSP (top) and LUT (bottom) usage of the jet substructure classification network as a function of the percentage of the network pruned.
002754189 8564_ $$82281307$$s32118$$uhttps://cds.cern.ch/record/2754189/files/nn_mlp.png$$y00003 Numerical profiling graph (top) from \hlsfml for a fully-connected neural network (bottom). The distribution of the absolute value of the weights is shown on the x-axis. The items on the y-axis are the different weights (0) and biases (1) for the model layers.
002754189 8564_ $$82281308$$s141256$$uhttps://cds.cern.ch/record/2754189/files/hls4ml-arch.png$$y00001 Internal structure of the \hlsfml package. Model converters translate models from \textsc{Keras}, \textsc{PyTorch}, etc. into an intermediate HLSModel representation. This representation can be further configured and optimized. Different backend writers can be used to export the model into a given vendor-specific language, such as Vitis HLS, Quartus HLS, Catapult HLS, or others.
002754189 8564_ $$82281309$$s16574$$uhttps://cds.cern.ch/record/2754189/files/slider_weights.png$$y00002 Numerical profiling graph (top) from \hlsfml for a fully-connected neural network (bottom). The distribution of the absolute value of the weights is shown on the x-axis. The items on the y-axis are the different weights (0) and biases (1) for the model layers.
002754189 8564_ $$82281310$$s1356592$$uhttps://cds.cern.ch/record/2754189/files/2103.05579.pdf$$yFulltext
002754189 8564_ $$82281311$$s8124$$uhttps://cds.cern.ch/record/2754189/files/pr_scan_dsp_histogram.png$$y00007 DSP (top) and LUT (bottom) usage of the jet substructure classification network as a function of the percentage of the network pruned.
002754189 8564_ $$82281312$$s4888$$uhttps://cds.cern.ch/record/2754189/files/largeReuse.png$$y00006 DSP usage for the MNIST neural network implementation where the reuse factor $R$ is scanned. As $R$ is increased, the DSP usage decreases while the latency (not shown) increases accordingly.
002754189 8564_ $$82281313$$s88484$$uhttps://cds.cern.ch/record/2754189/files/designMethodology.png$$y00009 Design and verification stack for the ASIC workflow.
002754189 8564_ $$82325709$$s931594$$uhttps://cds.cern.ch/record/2754189/files/fermilab-conf-21-080-scd.pdf$$yFulltext
002754189 960__ $$a11
002754189 980__ $$aConferencePaper
002754189 980__ $$aPREPRINT
CERN Document Server

Access articles, reports and multimedia content in HEP

Main menu

CERN Accelerating science