Autoencoders On FPGAs For Real-Time, Unsupervised New Physics Detection at 40 MHZ at The Large Hadron Collider
Autoencoders On FPGAs For Real-Time, Unsupervised New Physics Detection at 40 MHZ at The Large Hadron Collider
Jennifer Ngadiuba¶
Fermi National Accelerator Laboratory, Batavia, IL 60510, USA
Thong Q. Nguyen
California Institute of Technology Pasadena, CA 91125, USA
arXiv:2108.03986v2 [physics.ins-det] 12 Aug 2021
Javier Duarte
University of California San Diego La Jolla, CA 92093, USA
Zhenbin Wu
University of Illinois at Chicago Chicago, IL 60607, USA
(Dated: August 13, 2021)
In this paper, we show how to adapt and deploy anomaly detection algorithms based on deep
autoencoders, for the unsupervised detection of new physics signatures in the extremely challenging
environment of a real-time event selection system at the Large Hadron Collider (LHC). We demon-
strate that new physics signatures can be enhanced by three orders of magnitude, while staying
within the strict latency and resource constraints of a typical LHC event filtering system. This
would allow for collecting datasets potentially enriched with high-purity contributions from new
physics processes. Through per-layer, highly parallel implementations of network layers, support
for autoencoder-specific losses on FPGAs and latent space based inference, we demonstrate that
anomaly detection can be performed in as little as 80 ns using less than 3% of the logic resources in
the Xilinx Virtex VU9P FPGA. Opening the way to real-life applications of this idea during the
next data-taking campaign of the LHC.
ing so far has been for the HLT, this strategy could be In addition, we discuss how to customize the model
more effective if deployed in the L1T, i.e. before any compression in order to better accommodate for unsu-
selection bias is introduced. Due to the extreme latency pervised learning. Previously, we showed that QAT can
and computing resource constraints of the L1T, only result in a large reduction in resource consumption with
relatively simple, mostly theory-motivated selection al- minor accuracy loss for supervised algorithms [19, 23].
gorithms are currently deployed. These usually include In this paper, we extend and adapt that compression
requirements on the minimum energy of a physics object, workflow to deal with the specific challenge of compress-
such as a reconstructed lepton or a jet, effectively exclud- ing autoencoders used for AD. Several approaches are
ing lower-energy events from further processing. Instead, possible:
by deploying an unbiased algorithm which selects events
based on their degree of abnormality, rather than on the • Post-training quantization (PTQ) [16, 27, 38–41],
amount of energy present in the event, we can collect data consisting of applying a fixed-point precision to a
in a signal-model-independent way. Such an anomaly de- floating-point baseline model. This is the simplest
tection (AD) algorithm is required to have extremely low quantization approach, typically resulting in good
latency because of the restrictions imposed by the L1T. algorithm stability, at the cost of losing performance.
Recent developments of the hls4ml library allow us to More aggressive PTQ (lower precision) is usually
consider, for the first time, the possibility of deploying an accompanied by a larger reduction in accuracy.
AD algorithm on the FPGAs mounted on the L1T boards.
The hls4ml library is an open-source software, developed • QAT, consisting of imposing the fixed-point pre-
to translate neural networks [16–20] and boosted decision cision constraint at training time, e.g., using the
trees [21] into FPGA firmware. A fully on-chip implemen- QKeras library. This approach typically allows
tation of the machine learning model is used in order to one to limit the accuracy loss when imposing a
stay within the 1 µs latency budget imposed by a typical higher level of quantization, finding a better weight
L1T system. Additionally, the initiation interval (II) of configuration than what one can get with PTQ.
the algorithm should be within 150 ns, which is related to However, applying QAT to VAE models for AD
the bunch-crossing time for the upcoming period of the can result in unstable performance because QAT
LHC operations. Since there are several L1T algorithms would return the best input-to-output reconstruc-
deployed per FPGA, each of them should use much less tion performance, but the best reconstruction does
than the available resources. With its interface to QK- not necessarily guarantee the best AD performance.
eras [22], hls4ml supports quantization-aware training Ultimately, the stability of the result depends on
(QAT) [23], which makes it possible to drastically reduce the nature of the detected anomaly.
the FPGA resource consumption while preserving accu-
• Knowledge distillation with QAT: one could change
racy. Using hls4ml we can compress neural networks to
the quantized-model optimization strategy, refram-
fit the limited resources of an FPGA.
ing the problem as knowledge distillation [42–45].
In this paper, we discuss how to adapt and improve Rather than fixing the quantized weights to min-
the strategy presented in Ref. [12] to fit the L1T infras- imize the VAE loss, the difference between the
tructure. We focus on AEs, with specific emphasis on loss from the quantized model and the floating-
VAEs [24, 25]. We consider both fully-connected and point model for the same input could be minimized.
convolutional architectures, and discuss how to compress Rather than training a quantized copy of a given
the model through pruning [26–28], the removal of un- floating-point model, one could train a different
necessary operations, and quantization [17, 29–36], the model to predict this floating-point output, starting
reduction of the precision of operations, at training time. from the same input. Doing so, one could aim at
As discussed in Ref. [12], one can train (V)AE on a targeting the floating-point AD performance with a
given data sample by minimizing a measure of the distance completely different network (e.g., an MLP regres-
between the input and the output (the loss function). This sion) that could better meet the constraints of a
strategy, which is very common when using (V)AEs for L1T environment, e.g. being faster or consuming
anomaly detection [37], comes with practical challenges less computing resources.
when considering a deployment on FPGAs. The use
of high-level features is not optimal because it requires • Anomaly classification with QAT: the approximated
time-consuming data preprocessing. The situation is loss regression with QAT could be turned into a
further complicated for VAEs, which require a random classification problem. Rather than approximating
sampling from a Gaussian distribution in the latent space. the floating-point decision, one could try to obtain
Furthermore, one has to buffer the input data on chip a yes/no answer to a different question: would the
while the output is generated by the FPGA processing in floating-point algorithm return an AD score larger
order to compute the distance afterwards. To deal with than a threshold for this event? In this way, one
all of these aspects, we explore different approaches and could set the threshold on the accurate floating-
compare the accuracy, latency and resource consumption point model and could obtain good accuracy (in
of the various methods. terms of anomaly acceptance) without having to
3
predict the exact AD score value across multiple the same inputs, namely the (pT , η, φ) values for 18 recon-
orders of magnitude. structed objects (ordered as 4 muons, 4 electrons, and 10
jets), and the φ and magnitude of the missing transverse
In this paper, we focus on the first two approaches, leaving energy (MET), forming together an input of shape (19, 3)
the investigation of the other approaches to future work. where MET η values are zero-padded by construction (η
This paper is structured as follows: in Section II we is zero for transverse quantities). For events with fewer
describe the benchmark dataset. In Section III a detailed than the maximum number of muons, electrons, or jets,
description of the autoencoder models is given, followed by the input is also zero-padded, as commonly done in the
Section IV in which the definition of the quantities used as L1T algorithm logic.
anomaly detection scores is presented. In Section V results In order to account for resource consumption and la-
of the uncompressed and unquantized model are presented. tency of the data pre-processing step, we use a batch
In the next part, Section VI, model compression is detailed. normalization layer [53] as the first layer for each model.
Section VII describes the strategy to compress the models As all processing is done on-chip, the resource and latency
and deploy them on FPGAs, including an assessment of measurements will be consistent with those of a real L1T
the required FPGA resources. implementation. For both architectures, CNN and DNN,
we consider both a plain AE and a VAE. In the AE, the
encoder provides directly the coordinates of the given
II. DATA SAMPLES input, projected in the latent space. In the VAE, the
encoder returns the mean values µ ~ and the standard devi-
This study follows the setup of Ref. [12, 46]. We use ation ~σ of the N -dimensional Gaussian distribution which
a data sample that represents a typical proton-proton represents the latent-space probability density function
collision dataset that has been pre-filtered by requiring associated with a given event.
the presence of an electron or a muon with a transverse For the DNN model, the four-vector of each recon-
momentum pT > 23GeV and a pseudo-rapidity |η| < structed object is flattened and concatenated into a 1D
3 (electron) and |η| < 2.1 (muon). This is representative array, resulting in a 57-dimension input vector. The DNN
of a typical L1T selection algorithm of a multipurpose AE architecture is shown on the top plot in Figure I. All
LHC experiment. In addition to this, we consider the four of the inputs are batch-normalized and passed through
benchmark new physics scenarios discussed in Ref. [12]: a stack of 3 fully connected layers, with 32, 16, and 3
nodes. The output of each layer is followed by a batch
• A leptoquark (LQ) with a mass of 80 GeV, decaying
normalization layer and activated by a leaky ReLU func-
to a b quark and a τ lepton [47],
tion [54]. The 3-dimensional output of the encoder is the
• A neutral scalar boson (A) with a mass of 50 GeV, projection of the input in the latent space. The decoder
decaying to two off-shell Z bosons, each forced to consists of a stack of 3 layers, with 16, 32, and 57 nodes.
decay to two leptons: A → 4` [48], As for the encoder, we use a batch normalization layer
between the fully connected layer and its activation. The
• A scalar boson with a mass of 60 GeV, decaying to last layer has no activation function, while leaky ReLU
two tau leptons: h0 → τ τ [49], is used for the others. The DNN VAE follows the same
architecture, except for the latent-space processing, which
• A charged scalar boson with a mass of 60 GeV, de- follows the usual VAE prescription: two 3-dimensional
caying to a tau lepton and a neutrino: h± → τ ν [50]. fully connected layers produce the µ ~ and ~σ vectors from
which Gaussian latent quantities are sampled and injected
These four processes are used to evaluate the accuracy of in the decoder.
the trained models. A detailed description of the dataset The CNN AE architecture is shown on the bottom plot
can be found in Ref. [51]. in Figure I. The encoder takes as input the single-channel
In total, the background sample consists of 8 million 2D array of four-momenta including the two MET-related
events. Of these, 50% are used for training, 40% for testing features (magnitude and φ angle) and zeros for MET η,
and 10% for validation). The new physics benchmark resulting in a total input size of 19 × 3 × 1. It should
samples are only used for evaluating the performance of be emphasised that we are not using image data, rather
the models. treating tabular data as a 2D image to make it possible
The training dataset together with signal datasets for to explore CNN architectures. The input is first zero-
testing are published on Zenodo [47–50, 52]. padded in order to resize the image to 20 × 3 × 1, which is
required in order to parallelize the network processing in
the following layer on the FPGA. For the Conv2D FPGA
III. AUTOENCODER MODELS implementation, we control how many iterations of outer
loop (over the rows of the image array) are running in
We consider two classes of architectures: one based on parallel. To simplify the implementation we run the same
dense feed-forward neural networks (DNNs) and one using number of iterations in parallel, which requires that the
convolutional neural networks (CNNs). Both start from number of rows in the input image is an integer multiple
4
EN R
CO O DE
DE EC
R D
Input ∈ ℝ57
BN Dense ∈ ℝ32 Dense ∈ ℝ16 Latent space ∈ ℝ3 Dense ∈ ℝ16 Dense ∈ ℝ32 Dense ∈ ℝ57
FIG. I. Network architecture for the DNN AE (top) and CNN AE (bottom) models. The corresponding VAE models are derived
introducing the Gaussian sampling in the latent space, for the same encoder and decoder architectures (see text).
U
:
:
m
)
)
1
)
)
)
)
of the number of parallel processors. Since 19 is a prime no activation function is added. Its output is interpreted
number, we choose to extend the input size to 20 before as the AE reconstructed input. The CNN VAE is derived
passing it through the Conv2D layer. After padding, the from the AE, including the µ ~ and ~σ Gaussian sampling
input is scaled by a batch normalization layer and then in the latent space.
processed by a stack of two CNN blocks, each including a All models are implemented in TensorFlow, and
2D convolutional layer followed by a ReLU [55] activation trained on the background dataset by minimizing a
function. The first layer has 16 3 × 3 kernels, without customized mean squared error (MSE) loss with the
padding to ensure that pT , η and φ inputs do not share Adam [56] optimizer. In order to aid the network learn-
weights. The second layer has 32 3 × 1 kernels. Both ing process, we use a dataset with standardized pT as a
layers have no bias parameters and a stride set to one. target, so that all the quantities are O(1). To account
The output of the second CNN block is flattened and for physical boundaries of η and φ, for those features a
passed to a DNN layer, with 8 neurons and no activation, re-scaled tanh activation is used in the loss computation.
which represents the latent space. The decoder takes In addition, the sum in the MSE loss is modified in order
this as input to a dense layer with 64 nodes and ReLU to ignore the zero-padding entries of the input dataset
activation, and reshapes it into a 2 × 1 × 32 table. The and the corresponding outputs. When training the VAE,
following architecture mirrors the encoder architecture the loss is changed to:
with 2 CNN blocks with the same number of filters as in
the encoder and with ReLU activation. Both are followed L = (1 − β)MSE(Output, Input) + βDKL (~
µ, ~σ ) , (1)
by an upsampling layer, in order to mimic the result of a
transposed convolutional layer. where MSE labels the reconstruction loss (also used in the
Finally, one convolutional layer with a single filter and AE training), DKL is the Kullback-Leibler regularization
5
100 100
10−1 10−1
10−2 10−2
True Positive Rate
10−1 10−1
10−2 10−2
True Positive Rate
10−3 10−3
FIG. II. ROC curves of four AD scores (IO AD for AE and VAE models, Rz and DKL ADs for the VAE models) for the CNN
(left) and DNN (right) models, obtained from the two new physics benchmark models: LQ → bτ (top) and A → 4` (bottom).
of QAT are compared to results obtained by applying a function of the bit width. For DNN and CNN VAE
fixed-point precision to a BP floating-point model (i.e. both PTQ and QAT show stable results, and therefore
using PTQ), using the same bit precision scan. we choose QAT for AEs. For the QAT CNN, the QAT
Performance of the quantized models, both for QAT DNN AE and the PTQ DNN VAE a bit width of 8 is
and PTQ, is assessed using the TPR obtained for an chosen, and for the PTQ CNN VAE a bit width of 4 is
FPR of 10−5 for the given precision. The bottom plots used. The performance numbers for the chosen models
in Figures IV and V show ratios of QAT performance are summarized in Table II.
quantities obtained for each bit width with respect to the
BP model performance of the AE and VAE, respectively.
The top plots show ratios of PTQ performance quantities VII. PORTING THE ALGORITHM TO FPGAS
obtained in the same manner as for QAT.
Based on these ratio plots, the precision used for the The models described above are translated into
final model is chosen. As expected, the performance of firmware using hls4ml, then synthesized with Vivado
the VAEs is not stable as a function of bit width, since the HLS 2020.1 [60], targeting a Xilinx Virtex UltraScale+
AD figure of merit used for inference (DKL ) is different VU9P (xcvu9p-flgb2104-2-e) FPGA with a clock fre-
from those minimized during the QAT training (VAE quency of 200 MHz. In order to have fair resource and
IO). Therefore, we use PTQ compression for both DNN latency estimations, obtained from the HLS C Simulation
and CNN VAEs because they show stable results as a we have implemented custom layers in hls4ml, which in
7
100 100
10−1 10−1
10−2 10−2
True Positive Rate
10−1 10−1
10−2 10−2
True Positive Rate
10−3 10−3
FIG. III. ROC curves of four AD scores (IO AD for AE and VAE models, Rz and DKL ADs for the VAE models) for the CNN
(left) and DNN (right) models, obtained from two new physics benchmark models: h± → τ ν (top) and h0 → τ τ (bottom).
the case of AE computes the loss function between the Since the performance of all the models under study are
input and network output and for VAE computes the of a similar level, we choose the “best” model based on
DKL term of the loss. the smallest resource consumption, which turns out to be
A summary of the accuracy, resource consumption, and DNN VAE. This model was integrated into the emp-fwk
latency for the QAT DNN and CNN BP AE models, and infrastructure firmware for LHC trigger boards [61], tar-
the PTQ DNN and CNN BP VAE models is shown in geting a Xilinx VCU118 development kit, with the same
Table III. Resource utilization is quoted as a fraction of VU9P FPGA as previously discussed. Data were loaded
the total available resources on the FPGA. We find the into onboard buffers mimicking the manner in which data
resources are less than about 12% of the available FPGA arrives from optical fibres in the L1T system. The de-
resources, except for the CNN AE, which uses up to 47% sign was operated at 240 MHz, and the model predictions
of the look-up tables (LUTs). Moreover, the latency is observed at the output were consistent with those cap-
less than about 365 ns for all models except the CNN AE, tured from the HLS C Simulation. For this model we
which has a latency of 1480 ns. The II for all models is also provide resource and latency estimates for a Xilinx
within the required 115 ns, again except the CNN AE. Virtex 7 690 FPGA, which is the FPGA most widely used
Based on these, both types of architectures with both in the current CMS trigger. The estimates are given in
types of autoencoders are suitable for application at the Table IV.
LHC L1T, except for the CNN AE, which consumes too
much of the resources.
8
TABLE I. Performance assessment of the CNN and DNN models, for different AD scores and different new physics benchmark
scenarios.
TPR @ FPR 10−5 [%] AUC[%]
Model AD score
LQ → bτ A → 4` h± → τ ν h0 → τ τ LQ → bτ A → 4` h± → τ ν h0 → τ τ
IO 0.06 3.28 0.10 0.09 92 94 95 85
CNN VAE DKL 0.05 2.85 0.07 0.14 84 85 86 71
Rz 0.05 2.53 0.06 0.12 84 85 86 71
CNN AE IO 0.09 6.29 0.10 0.13 95 94 96 85
IO 0.07 5.23 0.08 0.11 93 95 95 85
DNN VAE DKL 0.07 5.27 0.08 0.11 92 94 94 81
Rz 0.06 4.05 0.07 0.10 86 93 88 76
DNN AE IO 0.05 3.56 0.06 0.09 95 96 96 87
1.5 1.5
TPR/TPR baseline
TPR/TPR baseline
1.0 1.0
LQ b
0.5 A 4 0.5
h±
h0
0.0 0.0
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Bit width Bit width
2.0 CNN AE DNN AE
2.0
1.5 1.5
TPR/TPR baseline
TPR/TPR baseline
1.0 1.0
0.5 0.5
0.0 0.0
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Bit width Bit width
FIG. IV. TPR ratios versus model bit width for the AE CNN (left) and DNN (right) models tested on four new physics
benchmark models, using mean squared error as figure of merit for PTQ (top) and QAT (bottom) strategies.
1.5 1.5
TPR/TPR baseline
TPR/TPR baseline
1.0 1.0
LQ b
0.5 A 4 0.5
h±
h0
0.0 0.0
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Bit width Bit width
1.5 1.5
TPR/TPR baseline
TPR/TPR baseline
1.0 1.0
0.5 0.5
0.0 0.0
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Bit width Bit width
FIG. V. TPR ratios versus model bit width for the VAE CNN (left) and DNN (right) models tested on four new physics
benchmark models, using DKL as figure of merit for PTQ (top) and QAT (bottom) strategies.
TABLE II. Performance assessment of the quantized and pruned CNN and DNN models, for different AD scores and different
new physics benchmark scenarios.
TPR @ FPR 10−5 [%] AUC[%]
Model AD score
LQ → bτ A → 4` h± → τ ν h0 → τ τ LQ → bτ A → 4` h± → τ ν h0 → τ τ
CNN AE QAT 4 bits IO 0.09 5.96 0.10 0.13 94 96 96 88
CNN VAE PTQ 8 bits DKL 0.05 2.56 0.06 0.12 84 84 85 71
DNN AE QAT 8 bits IO 0.08 5.48 0.09 0.11 95 96 96 88
DNN VAE PTQ 8 bits DKL 0.08 3.41 0.09 0.08 92 94 94 81
the Xilinx VU9P resources and the corresponding latency IX. CODE AVAILABILITY
is within 130 ns, while the CNN VAE uses less than 12%
and the corresponding latency is 365 ns. All three models
The QKeras library is available under github.com/
have the initiation interval within the strict limit imposed
google/qkeras, where the work presented here is us-
by the frequency of bunch crossing at the LHC. Between
ing QKeras version 0.9.0. The hls4ml library with cus-
the two architectures under study, the DNN requires a
tom layers used in the paper are under AE L1 paper
few times less resources in the trigger, however both DNN
branch and is available at https://github.com/
and CNN fit the strict latency requirement and therefore
fastmachinelearning/hls4ml/tree/AE_L1_paper.
both architectures can potentially be used at the LHC
trigger. The CNN AE model is found to require more
resources than are available.
X. DATA AVAILABILITY
With this work, we have identified and finalized the
necessary ingredients to deploy (V)AEs in the L1T of the
LHC experiments for Run 3 to accelerate the search for The data used in this study are openly available at
unexpected signatures of new physics. Zenodo at Ref. [47–50, 52].
10
TABLE III. Resource utilization and latency for the quantized and pruned DNN and CNN (V)AE models. Resources are based
on the Vivado estimates from Vivado HLS 2020.1 for a clock period of 5 ns on Xilinx VU9P.
Model DSP [%] LUT [%] FF [%] BRAM [%] Latency [ns] II [ns]
DNN AE QAT 8 bits 2 5 1 0.5 130 5
CNN AE QAT 4 bits 8 47 5 6 1480 895
DNN VAE PTQ 8 bits 1 3 0.5 0.3 80 5
CNN VAE PTQ 8 bits 10 12 4 2 365 115
TABLE IV. Resource utilization and latency for the quantized and pruned DNN AE model. Resources are based on the Vivado
estimates from Vivado HLS 2020.1 for a clock period of 5 ns on Xilinx V7-690.
Model DSP [%] LUT [%] FF [%] BRAM [%] Latency [ns] II [ns]
DNN VAE PTQ 8 bits 3 9 3 0.4 205 5
[1] LHC Machine. JINST 3, S08001 (2008). [13] Knapp, O. et al. Adversarially Learned Anomaly Detec-
[2] Aad, G. et al. The ATLAS Experiment at the CERN tion on CMS Open Data: re-discovering the top quark.
Large Hadron Collider. JINST 3, S08003 (2008). Eur. Phys. J. Plus 136, 236 (2021). 2005.01598.
[3] Chatrchyan, S. et al. The CMS Experiment at the CERN [14] CMS Exotica hotline leads hunt for ex-
LHC. JINST 3, S08004 (2008). otic particles (2010). URL https://www.
[4] Sirunyan, A. M. et al. Performance of√ the CMS Level-1 symmetrymagazine.org/breaking/2010/06/24/
trigger in proton-proton collisions at s = 13 TeV. J. cms-exotica-hotline-leads-hunt-for-exotic-particles.
Instrum. 15, P10017 (2020). 2006.10165. [15] Poppi, F. Is the bell ringing?. Exotica : à l’affût des
[5] The Phase-2 upgrade of the CMS Level-1 trigger. CMS événements exotiques 14 (2010). URL http://cds.cern.
Technical Design Report CERN-LHCC-2020-004. CMS- ch/record/1306501.
TDR-021 (2020). URL https://cds.cern.ch/record/ [16] Duarte, J. et al. Fast inference of deep neural networks
2714892. in FPGAs for particle physics. JINST 13, P07027 (2018).
[6] Aad, G. et al. Operation of the ATLAS trigger system in 1804.06913.
Run 2. J. Instrum. 15, P10004 (2020). 2007.12539. [17] Ngadiuba, J. et al. Compressing deep neural networks
[7] Technical Design Report for the Phase-II Upgrade of the on FPGAs to binary and ternary precision with hls4ml.
ATLAS TDAQ System. ATLAS Technical Design Report Mach. Learn.: Sci. Technol. (2020). 2003.06308.
CERN-LHCC-2017-020. ATLAS-TDR-029 (2017). URL [18] Iiyama, Y. et al. Distance-Weighted Graph Neural Net-
https://cds.cern.ch/record/2285584. works on FPGAs for Real-Time Particle Reconstruction
[8] Aad, G. et al. Observation of a new particle in the in High Energy Physics. Front. Big Data 3, 598927 (2020).
search for the standard model Higgs boson with the AT- 2008.03601.
LAS detector at the LHC. Phys. Lett. B 716, 1 (2012). [19] Aarrestad, T. et al. Fast convolutional neural networks
1207.7214. on fpgas with hls4ml. Mach. Learn.: Sci. Technol. 2,
[9] Chatrchyan, S. et al. Observation of a new boson at a 045015 (2021). 2101.05108.
mass of 125 GeV with the CMS experiment at the LHC. [20] Heintz, A. et al. Accelerated Charged Particle Tracking
Phys. Lett. B 716, 30 (2012). 1207.7235. with Graph Neural Networks on FPGAs. In 34th Confer-
[10] Aarrestad, T. et al. The dark machines anomaly score chal- ence on Neural Information Processing Systems (2020).
lenge: Benchmark data and model independent event clas- 2012.01563.
sification for the large hadron collider (2021). 2105.14027. [21] Summers, S. et al. Fast inference of Boosted Decision
[11] Kasieczka, G. et al. The lhc olympics 2020: A community Trees in FPGAs for particle physics. JINST 15, P05026
challenge for anomaly detection in high energy physics (2020). 2002.02534.
(2021). 2101.08320. [22] Coelho, C. Qkeras (2019). URL https://github.com/
[12] Cerri, O. et al. Variational Autoencoders for New Physics google/qkeras.
Mining at the Large Hadron Collider. JHEP 05, 036 [23] Coelho, C. N. et al. Automatic heterogeneous quantization
(2019). 1811.10276. of deep neural networks for low-latency inference on the
edge for particle detectors. Nat. Mach. Intell. (2021).
11