0% found this document useful (0 votes)
210 views

hls4ml Tutorial

This document provides instructions for setting up and running a hands-on tutorial for the hls4ml package. It instructs users to open a Jupyter notebook hosted online, authenticate with their GitHub account, and execute code cells using shift + enter. For local work, it recommends installing necessary software like Vivado and following conda instructions on the hls4ml GitHub page.

Uploaded by

pelayo leguina
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
210 views

hls4ml Tutorial

This document provides instructions for setting up and running a hands-on tutorial for the hls4ml package. It instructs users to open a Jupyter notebook hosted online, authenticate with their GitHub account, and execute code cells using shift + enter. For local work, it recommends installing necessary software like Vivado and following conda instructions on the hls4ml GitHub page.

Uploaded by

pelayo leguina
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Hands On - Setup

● The interactive part is served with Python notebooks


● Open https://cern.ch/ssummers/hls4ml-tutorial in your web browser
● Authenticate with your Github account (login if necessary)
● If you’re new to Jupyter notebooks, select a cell and hit “shift + enter” to execute the code
● If you have Vivado install yourself, you might prefer to work locally, see ‘conda’ section at:
https://github.com/fastmachinelearning/hls4ml-tutorial

Aug 18, 2021 hls4ml tutorial


1
hls4ml tutorial
FastML Workshop 2020
Sioni Summers et al. for the hls4ml team
2
Introduction
● hls4ml is a package for translating neural networks to FPGA firmware for inference with extremely
low latency on FPGAs
○ https://github.com/hls-fpga-machine-learning/hls4ml
○ https://fastmachinelearning.org/hls4ml/
○ pip install hls4ml

● In this session you will get hands on experience with the hls4ml package
● We’ll learn how to:
○ Translate models into synthesizable FPGA code
○ Explore the different handles provided by the tool to optimize the inference
■ Latency, throughput, resource usage
● Make our inference more computationally efficient with pruning and quantization

Aug 18, 2021 hls4ml tutorial


3
Why FPGAs?

Aug 18, 2021 hls4ml tutorial


4
LHC Experiment Data Flow

r l
s
lli Hz

ge ve

pu ine
g
on

tin
ig e
co M

ge

om ffl
Tr h-L
si
pp 40

ig

O
ig
Tr

H
L1

C
DATA FLOW
L1 trigger:
∙ 40 MHz in / 100 KHz out
∙ Process 100s TB/s
∙ Trigger decision to be made in ≈ 10 μs
∙ Coarse local reconstruction
∙ FPGAs / Hardware implemented
hls4ml tutorial
5
hls4ml origins: triggering at (HL-)LHC

Extreme collision frequency of 40 MHz → extreme data rates O(100 TB/s)


Most collision “events” don’t produce interesting physics
“Triggering” = filter events to reduce data rates to manageable levels

hls4ml tutorial 6
LHC Experiment Data Flow

r l
s
lli Hz

ge ve

pu ine
g
on

tin
ig e
co M

ge

om ffl
Tr h-L
si
pp 40

ig

O
ig
Tr

H
L1

C
DATA FLOW

Deploy ML algorithms very early in the game


Challenge: strict latency constraints!

hls4ml tutorial
7
The challenge: triggering at (HL-)LHC
The trigger discards events forever, so selection must be very precise
ML can improve sensitivity to rare physics
Needs to be fast!
Enter: hls4ml (high level synthesis for machine learning)

hls4ml tutorial 8
Muon trigger example

Aug 18, 2021 hls4ml tutorial


9
hls4ml: progression
● Previous slides showed the original motivation for hls4ml
○ Extreme low latency, high throughput domain
● Since then, we have been expanding!
○ Longer latency domains, larger models, resource constrained
○ Different FPGA vendors
○ New applications, new architectures
● While maintaining core characteristics:
○ “Layer-unrolled” HLS library → not another DPU
○ Extremely configurable: precision, resource vs latency/throughput tradeoff
○ Research project, application- and user-driven
○ Accessible, easy to use

Aug 18, 2021 hls4ml tutorial


10
Recent Developments
hls4ml community is very active!
● Binary & Ternary neural networks:
[2020 Mach. Learn.: Sci. Technol]
○ Compressed weights for low resource inference
● Boosted Decision Trees: [JINST 15 P05026 (2020)]
○ Low latency for Decision Tree ensembles
● GarNet / GravNet: [arXiv: 2008.03601]
○ Distance weighted graph neural networks suitable
for sparse/irregular point-cloud data
● Quantization aware training QKeras + support in
hls4ml: [arXiv: 2006.10159]
● Convolutional neural networks
Mach. Learn.: Sci. Technol. 2 045015 (2021)

Aug 18, 2021 hls4ml tutorial


11
Coming Soon
● A few exciting new things are being developed
and should become available soon:
○ Intel Quartus HLS, Mentor Catapult HLS,
Intel One API ‘Backends’
○ Recurrent Neural Networks
○ More integrated ‘end-to-end’ flow with bitfile
generation and host bindings for platforms
like Alveo, PYNQ
■ Bundled into MLCommons Tiny
submission -- image classification and
anomaly detection
https://mlcommons.org/en/news/mlperf-tiny-v05/

Aug 18, 2021 hls4ml tutorial


12
What are FPGAs?
Field Programmable Gate Arrays are reprogrammable
FPGA diagram
integrated circuits

Contain many different building blocks (‘resources’) which are


connected together as you desire

Originally popular for prototyping ASICs, but now also for high
performance computing

Aug 18, 2021 hls4ml tutorial


13
What are FPGAs?
FPGA diagram
Field Programmable Gate Arrays are reprogrammable
integrated circuits

Logic cells / Look Up Tables perform arbitrary functions on


small bitwidth inputs (2-6)

These can be used for boolean operations, arithmetic, small


memories

Flip-Flops register data in time with the clock pulse

Logic cell

Look-up
Flip-flop
table
(logic) (registers)

Aug 18, 2021 hls4ml tutorial


14
What are FPGAs?
Field Programmable Gate Arrays are reprogrammable integrated
FPGA diagram
circuits

DSPs (Digital Signal Processor) are specialized units for


multiplication and arithmetic

Faster and more efficient than using LUTs for these types of
operations

And for Neural Nets, DSPs are often the most scarce

DSP
(multiplication)

Aug 18, 2021 hls4ml tutorial


15
What are FPGAs?
FPGA diagram
Field Programmable Gate Arrays are reprogrammable
integrated circuits

BRAMs are small, fast memories - RAMs, ROMs, FIFOs (18Kb


each in Xilinx)
Memories using BRAMs more efficient than using LUTs

A big FPGA has nearly 100Mb of BRAM, chained together as


needed

Aug 18, 2021 hls4ml tutorial


16
What are FPGAs?
FPGA diagram
In addition, there are specialised blocks for I/O, making FPGAs
popular in embedded systems and HEP triggers

High speed transceivers with Tb/s total bandwidth


PCIe, (Multi) Gigabit Ethernet, Infiniband

AND: Support highly parallel algorithm implementations

Low power per Op (relative to CPU/GPU)

Aug 18, 2021 hls4ml tutorial


17
Why are FPGAs Fast?
● Fine-grained / resource parallelism
○ Use the many resources to work on
different parts of the problem
simultaneously
○ Allows us to achieve low latency
● Most problems have at least some sequential
aspect, limiting how low latency we can go
○ But we can still take advantage of it
with…
● Pipeline parallelism
○ Use the register pipeline to work on Like a production line for data…
different data simultaneously
○ Allows us to achieve high throughput

Aug 18, 2021 hls4ml tutorial


18
How are FPGAs programmed?
Hardware Description Languages

HDLs are programming languages which describe electronic


circuits

High Level Synthesis

Compile from C/C++ to VHDL


Pre-processor directives and constraints used to optimize the
design
Drastic decrease in firmware development time!

Today we’ll use Xilinx Vivado HLS [*]

[*] https://www.xilinx.com/support/documentation/sw_manuals/xilinx2020_1/ug902-vivado-high-level-synthesis.pdf
Aug 18, 2021 hls4ml tutorial
19
Jargon
● LUT - Look Up Table aka ‘logic’ - generic functions on small bitwidth inputs. Combine many to build the
algorithm
● FF - Flip Flops - control the flow of data with the clock pulse. Used to build the pipeline and achieve high
throughput
● DSP - Digital Signal Processor - performs multiplication and other arithmetic in the FPGA
● BRAM - Block RAM - hardened RAM resource. More efficient memories than using LUTs for more than a few
elements
● HLS - High Level Synthesis - compiler for C, C++, SystemC into FPGA IP cores
● HDL - Hardware Description Language - low level language for describing circuits
● RTL - Register Transfer Level - the very low level description of the function and connection of logic gates
● Latency - time between starting processing and receiving the result
○ Measured in clock cycles or seconds
● II - Initiation Interval - time from accepting first input to accepting next input

Aug 18, 2021 hls4ml tutorial


20
high level synthesis for machine learning

Catapult
Coming Soon

https://fastmachinelearning.org/hls4ml/
Aug 18, 2021 hls4ml tutorial
21
Neural network inference

Aug 18, 2021 hls4ml tutorial


22
Neural network inference
Ln
L1
LN precomputed and DSPs logic cells
stored in BRAMs

Aug 18, 2021 hls4ml tutorial


23
Neural network inference
Ln
L1
LN precomputed and DSPs logic cells
stored in BRAMs

How many resources? DSPs,


LUTs, FFs?
Does the model fit in the latency
requirements?

Aug 18, 2021 hls4ml tutorial


24
Neural network inference
Ln
L1
LN precomputed and DSPs logic cells
stored in BRAMs

How many resources? DSPs,


LUTs, FFs?
Does the model fit in the latency
requirements?

Aug 18, 2021 hls4ml tutorial


25
Efficient NN design for FPGAs
FPGAs provide huge flexibility Constraints:
Performance depends on how well you take Input bandwidth
advantage of this FPGA resources
Latency

Today you will learn how to optimize your project through:

- compression: reduce number of synapses or neurons


a i ning
tr
NN
- quantization: reduces the precision of the calculations (inputs,
weights, biases)
r o ject
p
- parallelization: tune how much to parallelize to make the inference GA ing
FP sign
faster/slower versus FPGA resources de

Aug 18, 2021 hls4ml tutorial


26
What we won’t cover today
● Two new tutorial notebooks are not yet ready, but will be soon!
○ Boosted decision trees: implemented in a companion package to hls4ml
■ https://github.com/thesps/conifer
○ Convolutional NNs: convolutional layers can quickly increase in number of
operations, recently available in hls4ml at larger scales
■ https://arxiv.org/abs/2101.05108
● What comes after hls4ml… you would need to integrate the ‘IP core’ into a larger
design
○ For a custom board, you’d need to do this by hand (e.g. CMS L1 Trigger, National
Instruments DAQ framework)
○ For more off-the-shelf boards, integration with system-on-chip or host CPU can be
more straightforward
■ https://github.com/mlcommons/tiny_results_v0.5/tree/main/open/hls4ml

Aug 18, 2021 hls4ml tutorial


27
Today’s hls4ml hands on
∙ Part 1:
- Get started with hls4ml: train a basic model and run the conversion, simulation & c-synthesis
steps

∙ Part 2:
- Learn how to tune inference performance with quantization & ReuseFactor

∙ Part 3:
- Perform model compression and observe its effect on the FPGA resources/latency

∙ Part 4:
- Train using QKeras “quantization aware training” and study impact on FPGA metrics

Aug 18, 2021 hls4ml tutorial


28
hls4ml tutorial
Part 1: Model Conversion

29
Physics case: jet tagging
Study a multi-classification task to be implemented on FPGA: discrimination between highly
energetic (boosted) q, g, W, Z, t initiated jets

Jet = collimated ‘spray’ of particles

top Z W other quark gluon

t→bW→bqq Z→qq W→qq q/g background


3-prong jet 2-prong jet 2-prong jet no substructure
and/or mass ~ 0
Reconstructed as one massive jet with substructure

Aug 18, 2021 hls4ml tutorial 30


Physics case: jet tagging

top Z W other quark gluon

Input variables: several observables known to have high discrimination power


from offline data analyses and published studies [*]
[*] D. Guest at al. PhysRevD.94.112002, G. Kasieczka et al. JHEP05(2017)006, J. M.
Butterworth et al. PhysRevLett.100.242001, etc..

Aug 18, 2021 hls4ml tutorial


31
Physics case: jet tagging
● We’ll train the five class multi-classifier on a sample of ~ 1M events with two boosted
WW/ZZ/tt/qq/gg anti-kT jets
○ Dataset DOI: 10.5281/zenodo.3602254
○ OpenML: https://www.openml.org/d/42468
● Fully connected neural network with 16 expert-level inputs:
○ Relu activation function for intermediate layers
○ Softmax activation function for output layer

better
AUC = area under ROC curve
(100% is perfect, 20% is random)
Aug 18, 2021 hls4ml tutorial
32
Hands On - Setup
● The interactive part is served with Python notebooks
● Open https://cern.ch/ssummers/hls4ml-tutorial in your web browser
● Authenticate with your Github account (login if necessary)
● Open and start running through “part1_getting_started” !
● If you’re new to Jupyter notebooks, select a cell and hit “shift + enter” to execute the code
● If you have Vivado install yourself, you might prefer to work locally, see ‘conda’ section at:
https://github.com/fastmachinelearning/hls4ml-tutorial

Aug 18, 2021 hls4ml tutorial


33
hls4ml Tutorial
Part 2: Advanced Configuration

34
Efficient NN design: quantization
ap_fixed<width bits, integer bits> ∙ In the FPGA we use fixed point representation
0101.1011101010 − Operations are integer ops, but we can represent fractional
integer fractional values
width ∙ But we have to make sure we’ve used the correct data types!
Scan integer bits Scan fractional bits
Fractional bits fixed to 8 Integer bits fixed to 6

FPGA AUC / Expected AUC


FPGA AUC / Expected AUC

Full performance at 6
Full performance at 8
integer bits
fractional bits

Aug 18, 2021 hls4ml tutorial


35
Efficient NN design: quantization

Aug 18, 2021 hls4ml tutorial


36
Efficient NN design: parallelization
∙ Trade-off between latency and FPGA resource usage determined by the parallelization of the
calculations in each layer

∙ Configure the “reuse factor” = number of times a multiplier is used to do a computation


Fewer resources,
Lower throughput,
Fully serial Higher latency

More resources,
Fully parallel Higher throughput,
Lower latency
Reuse factor: how much to parallelize operations in a hidden layer
Aug 18, 2021 hls4ml tutorial
37
Parallelization: DSP usage
More resources
Fully parallel
Each mult. used 1x

Each mult. used 2x

Each mult. used 3x


Longer latency

Aug 18, 2021 hls4ml tutorial


38
Parallelization: Timing
Latency of layer m

Longer latency

~ 175 ns
Latency (clock

Each mult. used 6x


cycles)

… …
Each mult. used 3x

Fully parallel
~ 75 ns Each mult. used 1x

More resources
Aug 18, 2021 hls4ml tutorial
39
Large MLP
● ‘Strategy: Resource’ for IOType: io_parallel # options: io_serial/io_parallel
larger networks and higher HLSConfig:
reuse factor Model:
● Uses a slightly different HLS
Precision: ap_fixed<16,6>
implementation of the dense
layer to compile faster and ReuseFactor: 128
better for large layers Strategy: Resource
● Here, we use a different LayerName:
partitioning on the first layer dense1:
for the best partitioning of
ReuseFactor: 112
arrays

This config is for a model trained on the MNIST digits classification dataset
Architecture (fully connected): 784 → 128 → 128 → 128 → 10
Model accuracy: ~97%
We can work out how many DSPs this should use...

Aug 18, 2021 hls4ml tutorial


40
Large MLP
∙ It takes a while to synthesise, so here’s one I made earlier…

∙ The DSPs should be: (784 x 128) / 112 + (2 x 128 x 128 + 128 x 10) / 128 = 1162 🤞
============================
============================
+ Timing (ns): =====================================
== Utilization Estimates
* Summary:
=====================================
+--------+-------+----------+------------+ +---------------------+---------+-------+---------+--------+
| Clock | Target| Estimated| Uncertainty| | Name | BRAM_18K| DSP48E| FF | LUT |
+---------------------+---------+-------+---------+--------+
+--------+-------+----------+------------+ ...
|ap_clk | 5.00| 4.375| 0.62| +---------------------+---------+-------+---------+--------+
+--------+-------+----------+------------+ |Total | 1962| 1162| 169979| 222623|
+---------------------+---------+-------+---------+--------+
|Available SLR | 2160| 2760| 663360| 331680|
+ Latency (clock cycles): +---------------------+---------+-------+---------+--------+
|Utilization SLR (%) | 90| 42| 25| 67|
* Summary:
+---------------------+---------+-------+---------+--------+
+-----+-----+-----+-----+----------+ |Available | 4320| 5520| 1326720| 663360|
| Latency | Interval | Pipeline | +---------------------+---------+-------+---------+--------+
|Utilization (%) | 45| 21| 12| 33|
| min | max | min | max | Type | +---------------------+---------+-------+---------+--------+
+-----+-----+-----+-----+----------+
| 518| 522| 128| 128| dataflow |
+-----+-----+-----+-----+----------+

II determined by the largest reuse factor


Aug 18, 2021 hls4ml tutorial
41
hls4ml Tutorial
Part 3: Compression

42
NN compression methods
● Network compression is a widespread technique to reduce the size, energy consumption, and
overtraining of deep neural networks
● Several approaches have been studied:
○ parameter pruning: selective removal of weights based on a particular ranking
[arxiv.1510.00149, arxiv.1712.01312]
○ low-rank factorization: using matrix/tensor decomposition to estimate informative parameters
[arxiv.1405.3866]
○ transferred/compact convolutional filters: special structural convolutional filters to save
parameters [arxiv.1602.07576]
○ knowledge distillation: training a compact network with distilled knowledge of a large network
[doi:10.1145/1150402.1150464]
● Today we’ll use the tensorflow model sparsity toolkit
○ https://blog.tensorflow.org/2019/05/tf-model-optimization-toolkit-pruning-API.html
● But you can use other methods!

Aug 18, 2021 hls4ml tutorial


43
TF Sparsity
● Iteratively remove low magnitude weights, starting with 0 sparsity, smoothly increasing
up to the set target as training proceeds

Aug 18, 2021 hls4ml tutorial


44
Efficient NN design: compression

Fully parallelized
(max DSP use)

compression
Number of DSPs available
● DSPs (used for multiplication) are often
limiting resource
○ maximum use when fully parallelized
○ DSPs have a max size for input (e.g.
70% compression ~ 70% fewer DSPs 27x18 bits), so number of DSPs per
multiplication changes with precision

Aug 18, 2021 hls4ml tutorial


45
hls4ml Tutorial
Part 4: Quantization

46
Efficient NN design: quantization
● hls4ml allows you to use different data types
everywhere, we saw how to tune that in part 2
● We will also try quantization-aware training with
QKeras (part 4)
● With quantization-aware we can even go down to
just 1 or 2 bits
○ See our recent work:
https://arxiv.org/abs/2003.06308
● See other talks on quantization at this workshop:
Amir, Thea, Benjamin

Aug 18, 2021 hls4ml tutorial


47
QKeras
● QKeras is a library to train models with
quantization in the training
○ Developed & maintained by Google
● Easy to use, drop-in replacements for Keras
layers
○ e.g. Dense → QDense
○ e.g. Conv2D → QConv2D
○ Use ‘quantizers’ to specify how many bits
to use where
○ Same kind of granularity as hls4ml
● Can achieve good performance with very few
bits
● We’ve recently added support for
QKeras-trained models to hls4ml
○ The number of bits used in training is also
used in inference
○ The intermediate model is adjusted to
capture all optimizations possible with
QKeras
hls4ml tutorial
48
Summary
● After this session you’ve gained some hands on experience with hls4ml
○ Translated neural networks to FPGA firmware, run simulation and synthesis
● Tuned network inference performance with precision and ReuseFactor
○ Used profiling and trace tools to guide tuning
● Learned how to simply prune a neural network and the impact on resources
● Trained a model with small number of bits using QKeras, and use the same spec in inference easily
with hls4ml
● The tutorial server is always available at https://cern.ch/ssummers/hls4ml-tutorial
● You can find these tutorial notebooks to run locally at:
https://github.com/fastmachinelearning/hls4ml-tutorial
● You can run the tutorial Docker image yourself like:
○ docker run -p 8888:8888 gitlab-registry.cern.ch/ssummers/hls4ml-tutorial:12.v
○ 15 GB download! Or remove ‘.v’ for a much smaller image but without Xilinx tools (so no ‘build’)
● Use hls4ml in your own environment: pip install hls4ml[profiling]

Aug 18, 2021 hls4ml tutorial


49

You might also like