0% found this document useful (0 votes)
11 views12 pages

Autoencoders On FPGAs For Real-Time, Unsupervised New Physics Detection at 40 MHZ at The Large Hadron Collider

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views12 pages

Autoencoders On FPGAs For Real-Time, Unsupervised New Physics Detection at 40 MHZ at The Large Hadron Collider

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Autoencoders on FPGAs for real-time, unsupervised new physics detection at 40 MHz

at the Large Hadron Collider


Ekaterina Govorkova,∗ Ema Puljak, Thea Aarrestad, Thomas James, Vladimir Loncar,† Maurizio
Pierini, Adrian Alan Pol, Nicolò Ghielmetti,‡ Maksymilian Graczyk,§ and Sioni Summers
European Organization for Nuclear Research (CERN) CH-1211 Geneva 23, Switzerland

Jennifer Ngadiuba¶
Fermi National Accelerator Laboratory, Batavia, IL 60510, USA

Thong Q. Nguyen
California Institute of Technology Pasadena, CA 91125, USA
arXiv:2108.03986v2 [physics.ins-det] 12 Aug 2021

Javier Duarte
University of California San Diego La Jolla, CA 92093, USA

Zhenbin Wu
University of Illinois at Chicago Chicago, IL 60607, USA
(Dated: August 13, 2021)
In this paper, we show how to adapt and deploy anomaly detection algorithms based on deep
autoencoders, for the unsupervised detection of new physics signatures in the extremely challenging
environment of a real-time event selection system at the Large Hadron Collider (LHC). We demon-
strate that new physics signatures can be enhanced by three orders of magnitude, while staying
within the strict latency and resource constraints of a typical LHC event filtering system. This
would allow for collecting datasets potentially enriched with high-purity contributions from new
physics processes. Through per-layer, highly parallel implementations of network layers, support
for autoencoder-specific losses on FPGAs and latent space based inference, we demonstrate that
anomaly detection can be performed in as little as 80 ns using less than 3% of the logic resources in
the Xilinx Virtex VU9P FPGA. Opening the way to real-life applications of this idea during the
next data-taking campaign of the LHC.

I. INTRODUCTION orders of magnitude within a few microseconds. The trig-


ger selection algorithms running in the L1T and HLT are
The CERN Large Hadron Collider (LHC) [1] gener- designed to guarantee a high acceptance rate for certain
ates 40 million proton-proton collision events per second. physics processes under study. When designing searches
Particles produced in these events are detected in the sen- for new physics kinds of collisions (e.g., dark matter pro-
sors of detectors located around the LHC ring, producing duction), physicists typically consider specific scenarios
hundreds of terabytes of data per second. The largest motivated by theoretical considerations. This supervised
general-purpose particle detectors at LHC, ATLAS [2] strategy has proven to be successful when dealing with
and CMS [3], discard most of the collision events with theory-motivated searches, as was the case with the search
online selection systems, as a result bandwidth limita- for the Higgs boson [8, 9]. Conversely, this approach may
tions. These systems consist of two stages; the level-1 become a limiting factor in the absence of a strong theo-
trigger (L1T) [4–7], where algorithms are deployed as pro- retical prior. For this reason, there are several community
grammable logic on custom electronic boards equipped efforts to investigate unsupervised machine learning (ML)
with field-programmable gate arrays (FPGAs), and the techniques for new physics searches [10, 11]. These in-
High Level Trigger (HLT), where selection algorithms vestigate the use of autoencoders (AEs) and variational
asynchronously process the events accepted by the L1T autoencoders (VAEs) for offline processing [24, 25], and
on commercially available CPUs. The largest fraction therefore do not consider constraints such as resource
of events are discarded at the first selection stage, the usage and latency. Ref. [12, 13] propose to integrate
L1T, which has the task of reducing the event rate by 2.5 unsupervised learning algorithms in the online selection
system of the CMS and ATLAS experiments, in order
to preserve rare events which would not otherwise be
selected, in a special data stream. A similar, albeit non-
∗ learning, approach was pursued in CMS with the exotica
E-mail: [email protected]
† Also at Institute of Physics Belgrade, Serbia hotline [14, 15] during the first year of LHC data taking
‡ Also at Politecnico di Milano, Italy and in similar efforts during experiments at the Super
§ Also at Imperial College London, UK Proton–Antiproton Synchrotron and Tevatron.
¶ Also at California Institute of Technology, USA While the primary focus for online unsupervised learn-
2

ing so far has been for the HLT, this strategy could be In addition, we discuss how to customize the model
more effective if deployed in the L1T, i.e. before any compression in order to better accommodate for unsu-
selection bias is introduced. Due to the extreme latency pervised learning. Previously, we showed that QAT can
and computing resource constraints of the L1T, only result in a large reduction in resource consumption with
relatively simple, mostly theory-motivated selection al- minor accuracy loss for supervised algorithms [19, 23].
gorithms are currently deployed. These usually include In this paper, we extend and adapt that compression
requirements on the minimum energy of a physics object, workflow to deal with the specific challenge of compress-
such as a reconstructed lepton or a jet, effectively exclud- ing autoencoders used for AD. Several approaches are
ing lower-energy events from further processing. Instead, possible:
by deploying an unbiased algorithm which selects events
based on their degree of abnormality, rather than on the • Post-training quantization (PTQ) [16, 27, 38–41],
amount of energy present in the event, we can collect data consisting of applying a fixed-point precision to a
in a signal-model-independent way. Such an anomaly de- floating-point baseline model. This is the simplest
tection (AD) algorithm is required to have extremely low quantization approach, typically resulting in good
latency because of the restrictions imposed by the L1T. algorithm stability, at the cost of losing performance.
Recent developments of the hls4ml library allow us to More aggressive PTQ (lower precision) is usually
consider, for the first time, the possibility of deploying an accompanied by a larger reduction in accuracy.
AD algorithm on the FPGAs mounted on the L1T boards.
The hls4ml library is an open-source software, developed • QAT, consisting of imposing the fixed-point pre-
to translate neural networks [16–20] and boosted decision cision constraint at training time, e.g., using the
trees [21] into FPGA firmware. A fully on-chip implemen- QKeras library. This approach typically allows
tation of the machine learning model is used in order to one to limit the accuracy loss when imposing a
stay within the 1 µs latency budget imposed by a typical higher level of quantization, finding a better weight
L1T system. Additionally, the initiation interval (II) of configuration than what one can get with PTQ.
the algorithm should be within 150 ns, which is related to However, applying QAT to VAE models for AD
the bunch-crossing time for the upcoming period of the can result in unstable performance because QAT
LHC operations. Since there are several L1T algorithms would return the best input-to-output reconstruc-
deployed per FPGA, each of them should use much less tion performance, but the best reconstruction does
than the available resources. With its interface to QK- not necessarily guarantee the best AD performance.
eras [22], hls4ml supports quantization-aware training Ultimately, the stability of the result depends on
(QAT) [23], which makes it possible to drastically reduce the nature of the detected anomaly.
the FPGA resource consumption while preserving accu-
• Knowledge distillation with QAT: one could change
racy. Using hls4ml we can compress neural networks to
the quantized-model optimization strategy, refram-
fit the limited resources of an FPGA.
ing the problem as knowledge distillation [42–45].
In this paper, we discuss how to adapt and improve Rather than fixing the quantized weights to min-
the strategy presented in Ref. [12] to fit the L1T infras- imize the VAE loss, the difference between the
tructure. We focus on AEs, with specific emphasis on loss from the quantized model and the floating-
VAEs [24, 25]. We consider both fully-connected and point model for the same input could be minimized.
convolutional architectures, and discuss how to compress Rather than training a quantized copy of a given
the model through pruning [26–28], the removal of un- floating-point model, one could train a different
necessary operations, and quantization [17, 29–36], the model to predict this floating-point output, starting
reduction of the precision of operations, at training time. from the same input. Doing so, one could aim at
As discussed in Ref. [12], one can train (V)AE on a targeting the floating-point AD performance with a
given data sample by minimizing a measure of the distance completely different network (e.g., an MLP regres-
between the input and the output (the loss function). This sion) that could better meet the constraints of a
strategy, which is very common when using (V)AEs for L1T environment, e.g. being faster or consuming
anomaly detection [37], comes with practical challenges less computing resources.
when considering a deployment on FPGAs. The use
of high-level features is not optimal because it requires • Anomaly classification with QAT: the approximated
time-consuming data preprocessing. The situation is loss regression with QAT could be turned into a
further complicated for VAEs, which require a random classification problem. Rather than approximating
sampling from a Gaussian distribution in the latent space. the floating-point decision, one could try to obtain
Furthermore, one has to buffer the input data on chip a yes/no answer to a different question: would the
while the output is generated by the FPGA processing in floating-point algorithm return an AD score larger
order to compute the distance afterwards. To deal with than a threshold for this event? In this way, one
all of these aspects, we explore different approaches and could set the threshold on the accurate floating-
compare the accuracy, latency and resource consumption point model and could obtain good accuracy (in
of the various methods. terms of anomaly acceptance) without having to
3

predict the exact AD score value across multiple the same inputs, namely the (pT , η, φ) values for 18 recon-
orders of magnitude. structed objects (ordered as 4 muons, 4 electrons, and 10
jets), and the φ and magnitude of the missing transverse
In this paper, we focus on the first two approaches, leaving energy (MET), forming together an input of shape (19, 3)
the investigation of the other approaches to future work. where MET η values are zero-padded by construction (η
This paper is structured as follows: in Section II we is zero for transverse quantities). For events with fewer
describe the benchmark dataset. In Section III a detailed than the maximum number of muons, electrons, or jets,
description of the autoencoder models is given, followed by the input is also zero-padded, as commonly done in the
Section IV in which the definition of the quantities used as L1T algorithm logic.
anomaly detection scores is presented. In Section V results In order to account for resource consumption and la-
of the uncompressed and unquantized model are presented. tency of the data pre-processing step, we use a batch
In the next part, Section VI, model compression is detailed. normalization layer [53] as the first layer for each model.
Section VII describes the strategy to compress the models As all processing is done on-chip, the resource and latency
and deploy them on FPGAs, including an assessment of measurements will be consistent with those of a real L1T
the required FPGA resources. implementation. For both architectures, CNN and DNN,
we consider both a plain AE and a VAE. In the AE, the
encoder provides directly the coordinates of the given
II. DATA SAMPLES input, projected in the latent space. In the VAE, the
encoder returns the mean values µ ~ and the standard devi-
This study follows the setup of Ref. [12, 46]. We use ation ~σ of the N -dimensional Gaussian distribution which
a data sample that represents a typical proton-proton represents the latent-space probability density function
collision dataset that has been pre-filtered by requiring associated with a given event.
the presence of an electron or a muon with a transverse For the DNN model, the four-vector of each recon-
momentum pT > 23GeV and a pseudo-rapidity |η| < structed object is flattened and concatenated into a 1D
3 (electron) and |η| < 2.1 (muon). This is representative array, resulting in a 57-dimension input vector. The DNN
of a typical L1T selection algorithm of a multipurpose AE architecture is shown on the top plot in Figure I. All
LHC experiment. In addition to this, we consider the four of the inputs are batch-normalized and passed through
benchmark new physics scenarios discussed in Ref. [12]: a stack of 3 fully connected layers, with 32, 16, and 3
nodes. The output of each layer is followed by a batch
• A leptoquark (LQ) with a mass of 80 GeV, decaying
normalization layer and activated by a leaky ReLU func-
to a b quark and a τ lepton [47],
tion [54]. The 3-dimensional output of the encoder is the
• A neutral scalar boson (A) with a mass of 50 GeV, projection of the input in the latent space. The decoder
decaying to two off-shell Z bosons, each forced to consists of a stack of 3 layers, with 16, 32, and 57 nodes.
decay to two leptons: A → 4` [48], As for the encoder, we use a batch normalization layer
between the fully connected layer and its activation. The
• A scalar boson with a mass of 60 GeV, decaying to last layer has no activation function, while leaky ReLU
two tau leptons: h0 → τ τ [49], is used for the others. The DNN VAE follows the same
architecture, except for the latent-space processing, which
• A charged scalar boson with a mass of 60 GeV, de- follows the usual VAE prescription: two 3-dimensional
caying to a tau lepton and a neutrino: h± → τ ν [50]. fully connected layers produce the µ ~ and ~σ vectors from
which Gaussian latent quantities are sampled and injected
These four processes are used to evaluate the accuracy of in the decoder.
the trained models. A detailed description of the dataset The CNN AE architecture is shown on the bottom plot
can be found in Ref. [51]. in Figure I. The encoder takes as input the single-channel
In total, the background sample consists of 8 million 2D array of four-momenta including the two MET-related
events. Of these, 50% are used for training, 40% for testing features (magnitude and φ angle) and zeros for MET η,
and 10% for validation). The new physics benchmark resulting in a total input size of 19 × 3 × 1. It should
samples are only used for evaluating the performance of be emphasised that we are not using image data, rather
the models. treating tabular data as a 2D image to make it possible
The training dataset together with signal datasets for to explore CNN architectures. The input is first zero-
testing are published on Zenodo [47–50, 52]. padded in order to resize the image to 20 × 3 × 1, which is
required in order to parallelize the network processing in
the following layer on the FPGA. For the Conv2D FPGA
III. AUTOENCODER MODELS implementation, we control how many iterations of outer
loop (over the rows of the image array) are running in
We consider two classes of architectures: one based on parallel. To simplify the implementation we run the same
dense feed-forward neural networks (DNNs) and one using number of iterations in parallel, which requires that the
convolutional neural networks (CNNs). Both start from number of rows in the input image is an integer multiple
4

EN R
CO O DE
DE EC
R D

Input ∈ ℝ57

BN Dense ∈ ℝ32 Dense ∈ ℝ16 Latent space ∈ ℝ3 Dense ∈ ℝ16 Dense ∈ ℝ32 Dense ∈ ℝ57

Block 1 Block 2 Block 3 Block 4 Block 5


Conv2d (16,(3,3)) Conv2d 1 (32,(3,1)) Dense (8 Conv2d 2 (32,(3,1)) Conv2d 3 (16,(3,1))
ReLU ReLU Dense 1 (64 ReLU ReLU
AvPooling (3,1) AvPooling (3,1 ReL UpSampling (3,1 UpSampling (3,1
Flatten (64 Reshape (2,1,32 ZeroPad (0,0),(1,1) ZeroPad (1,0),(0,0)
Block 0
Output
Input 19x3x
Conv2d 4 (1,(3,3)
ZeroPadding (1,0
BatchNor

ReLU ReLU ReLU ReLU ReLU

FIG. I. Network architecture for the DNN AE (top) and CNN AE (bottom) models. The corresponding VAE models are derived
introducing the Gaussian sampling in the latent space, for the same encoder and decoder architectures (see text).

U

:
:
m

)
)

1
)
)
)
)

of the number of parallel processors. Since 19 is a prime no activation function is added. Its output is interpreted
number, we choose to extend the input size to 20 before as the AE reconstructed input. The CNN VAE is derived
passing it through the Conv2D layer. After padding, the from the AE, including the µ ~ and ~σ Gaussian sampling
input is scaled by a batch normalization layer and then in the latent space.
processed by a stack of two CNN blocks, each including a All models are implemented in TensorFlow, and
2D convolutional layer followed by a ReLU [55] activation trained on the background dataset by minimizing a
function. The first layer has 16 3 × 3 kernels, without customized mean squared error (MSE) loss with the
padding to ensure that pT , η and φ inputs do not share Adam [56] optimizer. In order to aid the network learn-
weights. The second layer has 32 3 × 1 kernels. Both ing process, we use a dataset with standardized pT as a
layers have no bias parameters and a stride set to one. target, so that all the quantities are O(1). To account
The output of the second CNN block is flattened and for physical boundaries of η and φ, for those features a
passed to a DNN layer, with 8 neurons and no activation, re-scaled tanh activation is used in the loss computation.
which represents the latent space. The decoder takes In addition, the sum in the MSE loss is modified in order
this as input to a dense layer with 64 nodes and ReLU to ignore the zero-padding entries of the input dataset
activation, and reshapes it into a 2 × 1 × 32 table. The and the corresponding outputs. When training the VAE,
following architecture mirrors the encoder architecture the loss is changed to:
with 2 CNN blocks with the same number of filters as in
the encoder and with ReLU activation. Both are followed L = (1 − β)MSE(Output, Input) + βDKL (~
µ, ~σ ) , (1)
by an upsampling layer, in order to mimic the result of a
transposed convolutional layer. where MSE labels the reconstruction loss (also used in the
Finally, one convolutional layer with a single filter and AE training), DKL is the Kullback-Leibler regularization
5

term [57] usually adopted for VAEs V. PERFORMANCE AT FLOATING-POINT


PRECISION
1X
log(σi2 ) − σi2 − µ2i + 1 ,

µ, ~σ ) = −
DKL (~ (2)
2 i The model performance is assessed using the four new
physics benchmark models. The anomaly-detection scores
considered in this paper are IO AD for the AE models,
and β is a hyperparameter defined in the range [0, 1] [58]. Rz and DKL ADs for the VAE models. For completeness,
Both models are trained for 100 epochs with a batch size results obtained from the IO AD score of the VAE models
of 1024, using early stopping if there is no improvement in are also shown. The receiver operating characteristic
the loss observed after ten epochs. All models are trained (ROC) curves in Figures II and III show the true positive
with floating point precision on an NVIDIA RTX2080 rate (TPR) as a function of the false positive rate (FPR),
GPU. We refer to these as the baseline floating-point computed by changing the lower threshold applied on the
(BF) models. different anomaly scores. We further quantify the AD
performance quoting the area under the ROC curve (AUC)
and the TPR corresponding to a FPR working point of
10−5 (see Table I), which on this dataset corresponds to
IV. ANOMALY DETECTION SCORES the reduction of the background rate to approximately
1000 events per month.
An autoencoder is optimized to retain the minimal set From the ROC curves, we conclude that DKL can be
of information needed to reconstruct a accurate estimate used as an anomaly metric for both the DNN and CNN
of the input. During inference, an autoencoder might have VAE. This has the potential to significantly reduce the
problems generalizing to features it was not exposed to inference latency and on-chip resource consumption as
during training. Selecting events where the autoencoder only half of the network (the encoder) needs to be evalu-
output is far from the given input is often seen as an ated and that there no longer is a need to buffer the input
effective AD algorithm. For this purpose, one could use in order to compute an MSE loss. The Rz metric per-
a metric that measures the distance between the input forms worse and is therefore not included in the following
and the output. The simplest solution is to use the same studies.
metric that defines the training loss function. In our case,
we use the MSE between the input and the output. We
refer to this strategy as input-output (IO) AD. VI. MODEL COMPRESSION
In the case of a VAE deployed in the L1T, one cannot
simply exploit an IO AD strategy since this would require We adopt different strategies for model compression.
sampling random numbers on the FPGA. The trigger First of all, we compress the BF model by pruning the
decision would not be deterministic, something usually dense and convolutional layers by 50% of their connec-
tolerated only for service triggers, and not for triggers tions, following the same procedure as Ref. [19]. Pruning
serving physics studies. Moreover, one would have to store is enforced using the polynomial decay implemented in
random numbers on the FPGA, which would consume TensorFlow pruning API, a Keras-based [59] inter-
resources and increase the latency. To deal with this face consisting of a simple drop-in replacement of Keras
problem, we consider an alternative strategy by defining layers. A sparsity of 50% is targeted, meaning only 50%
an AD score based on the µ ~ and ~σ values returned by of the weights are retained in the pruned layers and the
the encoder (see Eq. (1)). In particular, we consider two remaining ones are set to zero. The pruning is set to start
options: the KL divergence term entering the VAE loss from the fifth epoch of the training to ensure the model
(see Eq. (2)) and the z-score of the origin ~0 in the latent is closer to a stable minimum before removing weights
space with respect to a Gaussian distribution centered at deemed unimportant. By pruning the BF model layers
µ
~ with standard deviation ~σ [10]: to a target sparsity of 50%, the number of floating-point
operations required when evaluating the model, can be
X µ2 significantly reduced. We refer to the resulting model
i
Rz = . (3) as the baseline pruned (BP) model. For the VAE, only
σi2
i the encoder is pruned, since only that will be deployed
on FPGA. The BP models are taken as a reference to
These two AD scores have several benefits we take advan- evaluate the resource saving of the following compression
tage of: Gaussian sampling is avoided; we save significant strategies, including QAT and PTQ.
resources and latency by not evaluating the decoder; and Furthermore, we perform a QAT of each model de-
we do not need to buffer the input data for computation scribed in Section III, implementing them in the QKeras
of the MSE. During the model optimization, we tune library [23]. The bit precision is scanned between 2 and
β so that we obtain (on the benchmark signal models) 16 with a 2-bit step. When quantizing a model, we also
comparable performance for the DKL AD score and the impose a pruning of the dense (convolutional) layers by
IO AD score of the VAE. 50%, as done for the DNN (CNN) BP models. The results
6

100 100

10−1 10−1

10−2 10−2
True Positive Rate

True Positive Rate


10−3 10−3

10−4 CNN ROC LQ → bτ 10−4 DNN ROC LQ → bτ


IO VAE (AUC = 92%) IO VAE (AUC = 93%)
VAE DKL (AUC = 84%) VAE DKL (AUC = 92%)
10−5 10−5
VAE Rz (AUC = 84%) VAE Rz (AUC = 86%)
IO AE (AUC = 95%) IO AE (AUC = 95%)
10−6 −6 10−6 −6
10 10−5 10−4 10−3 10−2 10−1 100 10 10−5 10−4 10−3 10−2 10−1 100
False Positive Rate False Positive Rate
100 100

10−1 10−1

10−2 10−2
True Positive Rate

True Positive Rate

10−3 10−3

10−4 CNN ROC A → 4ℓ 10−4 DNN ROC A → 4ℓ


IO VAE (AUC = 94%) IO VAE (AUC = 95%)
VAE DKL (AUC = 85%) VAE DKL (AUC = 94%)
10−5 10−5
VAE Rz (AUC = 85%) VAE Rz (AUC = 93%)
IO AE (AUC = 94%) IO AE (AUC = 96%)
10−6 −6 10−6 −6
10 10−5 10−4 10−3 10−2 10−1 100 10 10−5 10−4 10−3 10−2 10−1 100
False Positive Rate False Positive Rate

FIG. II. ROC curves of four AD scores (IO AD for AE and VAE models, Rz and DKL ADs for the VAE models) for the CNN
(left) and DNN (right) models, obtained from the two new physics benchmark models: LQ → bτ (top) and A → 4` (bottom).

of QAT are compared to results obtained by applying a function of the bit width. For DNN and CNN VAE
fixed-point precision to a BP floating-point model (i.e. both PTQ and QAT show stable results, and therefore
using PTQ), using the same bit precision scan. we choose QAT for AEs. For the QAT CNN, the QAT
Performance of the quantized models, both for QAT DNN AE and the PTQ DNN VAE a bit width of 8 is
and PTQ, is assessed using the TPR obtained for an chosen, and for the PTQ CNN VAE a bit width of 4 is
FPR of 10−5 for the given precision. The bottom plots used. The performance numbers for the chosen models
in Figures IV and V show ratios of QAT performance are summarized in Table II.
quantities obtained for each bit width with respect to the
BP model performance of the AE and VAE, respectively.
The top plots show ratios of PTQ performance quantities VII. PORTING THE ALGORITHM TO FPGAS
obtained in the same manner as for QAT.
Based on these ratio plots, the precision used for the The models described above are translated into
final model is chosen. As expected, the performance of firmware using hls4ml, then synthesized with Vivado
the VAEs is not stable as a function of bit width, since the HLS 2020.1 [60], targeting a Xilinx Virtex UltraScale+
AD figure of merit used for inference (DKL ) is different VU9P (xcvu9p-flgb2104-2-e) FPGA with a clock fre-
from those minimized during the QAT training (VAE quency of 200 MHz. In order to have fair resource and
IO). Therefore, we use PTQ compression for both DNN latency estimations, obtained from the HLS C Simulation
and CNN VAEs because they show stable results as a we have implemented custom layers in hls4ml, which in
7

100 100

10−1 10−1

10−2 10−2
True Positive Rate

True Positive Rate


10−3 10−3

10−4 CNN ROC h± → τν 10−4 DNN ROC h± → τν


IO VAE (AUC = 95%) IO VAE (AUC = 95%)
VAE DKL (AUC = 86%) VAE DKL (AUC = 94%)
10−5 10−5
VAE Rz (AUC = 86%) VAE Rz (AUC = 88%)
IO AE (AUC = 96%) IO AE (AUC = 96%)
10−6 −6 10−6 −6
10 10−5 10−4 10−3 10−2 10−1 100 10 10−5 10−4 10−3 10−2 10−1 100
False Positive Rate False Positive Rate
100 100

10−1 10−1

10−2 10−2
True Positive Rate

True Positive Rate

10−3 10−3

10−4 CNN ROC h0 → ττ 10−4 DNN ROC h0 → ττ


IO VAE (AUC = 85%) IO VAE (AUC = 85%)
VAE DKL (AUC = 71%) VAE DKL (AUC = 81%)
10−5 10−5
VAE Rz (AUC = 71%) VAE Rz (AUC = 76%)
IO AE (AUC = 85%) IO AE (AUC = 87%)
10−6 −6 10−6 −6
10 10−5 10−4 10−3 10−2 10−1 100 10 10−5 10−4 10−3 10−2 10−1 100
False Positive Rate False Positive Rate

FIG. III. ROC curves of four AD scores (IO AD for AE and VAE models, Rz and DKL ADs for the VAE models) for the CNN
(left) and DNN (right) models, obtained from two new physics benchmark models: h± → τ ν (top) and h0 → τ τ (bottom).

the case of AE computes the loss function between the Since the performance of all the models under study are
input and network output and for VAE computes the of a similar level, we choose the “best” model based on
DKL term of the loss. the smallest resource consumption, which turns out to be
A summary of the accuracy, resource consumption, and DNN VAE. This model was integrated into the emp-fwk
latency for the QAT DNN and CNN BP AE models, and infrastructure firmware for LHC trigger boards [61], tar-
the PTQ DNN and CNN BP VAE models is shown in geting a Xilinx VCU118 development kit, with the same
Table III. Resource utilization is quoted as a fraction of VU9P FPGA as previously discussed. Data were loaded
the total available resources on the FPGA. We find the into onboard buffers mimicking the manner in which data
resources are less than about 12% of the available FPGA arrives from optical fibres in the L1T system. The de-
resources, except for the CNN AE, which uses up to 47% sign was operated at 240 MHz, and the model predictions
of the look-up tables (LUTs). Moreover, the latency is observed at the output were consistent with those cap-
less than about 365 ns for all models except the CNN AE, tured from the HLS C Simulation. For this model we
which has a latency of 1480 ns. The II for all models is also provide resource and latency estimates for a Xilinx
within the required 115 ns, again except the CNN AE. Virtex 7 690 FPGA, which is the FPGA most widely used
Based on these, both types of architectures with both in the current CMS trigger. The estimates are given in
types of autoencoders are suitable for application at the Table IV.
LHC L1T, except for the CNN AE, which consumes too
much of the resources.
8

TABLE I. Performance assessment of the CNN and DNN models, for different AD scores and different new physics benchmark
scenarios.
TPR @ FPR 10−5 [%] AUC[%]
Model AD score
LQ → bτ A → 4` h± → τ ν h0 → τ τ LQ → bτ A → 4` h± → τ ν h0 → τ τ
IO 0.06 3.28 0.10 0.09 92 94 95 85
CNN VAE DKL 0.05 2.85 0.07 0.14 84 85 86 71
Rz 0.05 2.53 0.06 0.12 84 85 86 71
CNN AE IO 0.09 6.29 0.10 0.13 95 94 96 85
IO 0.07 5.23 0.08 0.11 93 95 95 85
DNN VAE DKL 0.07 5.27 0.08 0.11 92 94 94 81
Rz 0.06 4.05 0.07 0.10 86 93 88 76
DNN AE IO 0.05 3.56 0.06 0.09 95 96 96 87

2.0 CNN AE 2.0 DNN AE

1.5 1.5
TPR/TPR baseline

TPR/TPR baseline
1.0 1.0
LQ b
0.5 A 4 0.5

h0
0.0 0.0
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Bit width Bit width
2.0 CNN AE DNN AE
2.0

1.5 1.5
TPR/TPR baseline
TPR/TPR baseline

1.0 1.0

0.5 0.5

0.0 0.0
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Bit width Bit width

FIG. IV. TPR ratios versus model bit width for the AE CNN (left) and DNN (right) models tested on four new physics
benchmark models, using mean squared error as figure of merit for PTQ (top) and QAT (bottom) strategies.

VIII. CONCLUSIONS strategies to identify potential anomalies. We show that


one could perform the anomaly detection (AD) with a
variational AE (VAE) using the projected representation
We discussed how to extend new physics detection of a given input in the latent space, which has several
strategies at the LHC with autoencoders deployed in the advantages for an FPGA implementation: (1) no need
L1T infrastructure of the experiments. In particular, to sample Gaussian-distributed pseudorandom numbers
we show how one could deploy a deep neural network (preserving the deterministic outcome of the trigger deci-
(DNN) or convolutional neural network (CNN) AE on a sion) and (2) no need to run the decoder in the trigger,
field-programmable gate array (FPGA) using the hls4ml resulting in a significant resource saving.
library, within a O(1)µs latency and with small resource As can be seen from Table III, the latency when using
utilization once the model is quantized and pruned. We only the encoder as opposed to full VAE is reduced by a
show that one can retain accuracy by compressing the factor of two, while the performance is of a similar level
model at training time. Moreover, we discuss different (see Table II). The DNN (V)AE models use less than 5% of
9

2.0 CNN VAE 2.0 DNN VAE

1.5 1.5
TPR/TPR baseline

TPR/TPR baseline
1.0 1.0
LQ b
0.5 A 4 0.5

h0
0.0 0.0
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Bit width Bit width

2.0 CNN VAE 2.0 DNN VAE

1.5 1.5
TPR/TPR baseline

TPR/TPR baseline
1.0 1.0

0.5 0.5

0.0 0.0
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Bit width Bit width

FIG. V. TPR ratios versus model bit width for the VAE CNN (left) and DNN (right) models tested on four new physics
benchmark models, using DKL as figure of merit for PTQ (top) and QAT (bottom) strategies.

TABLE II. Performance assessment of the quantized and pruned CNN and DNN models, for different AD scores and different
new physics benchmark scenarios.
TPR @ FPR 10−5 [%] AUC[%]
Model AD score
LQ → bτ A → 4` h± → τ ν h0 → τ τ LQ → bτ A → 4` h± → τ ν h0 → τ τ
CNN AE QAT 4 bits IO 0.09 5.96 0.10 0.13 94 96 96 88
CNN VAE PTQ 8 bits DKL 0.05 2.56 0.06 0.12 84 84 85 71
DNN AE QAT 8 bits IO 0.08 5.48 0.09 0.11 95 96 96 88
DNN VAE PTQ 8 bits DKL 0.08 3.41 0.09 0.08 92 94 94 81

the Xilinx VU9P resources and the corresponding latency IX. CODE AVAILABILITY
is within 130 ns, while the CNN VAE uses less than 12%
and the corresponding latency is 365 ns. All three models
The QKeras library is available under github.com/
have the initiation interval within the strict limit imposed
google/qkeras, where the work presented here is us-
by the frequency of bunch crossing at the LHC. Between
ing QKeras version 0.9.0. The hls4ml library with cus-
the two architectures under study, the DNN requires a
tom layers used in the paper are under AE L1 paper
few times less resources in the trigger, however both DNN
branch and is available at https://github.com/
and CNN fit the strict latency requirement and therefore
fastmachinelearning/hls4ml/tree/AE_L1_paper.
both architectures can potentially be used at the LHC
trigger. The CNN AE model is found to require more
resources than are available.

X. DATA AVAILABILITY
With this work, we have identified and finalized the
necessary ingredients to deploy (V)AEs in the L1T of the
LHC experiments for Run 3 to accelerate the search for The data used in this study are openly available at
unexpected signatures of new physics. Zenodo at Ref. [47–50, 52].
10

TABLE III. Resource utilization and latency for the quantized and pruned DNN and CNN (V)AE models. Resources are based
on the Vivado estimates from Vivado HLS 2020.1 for a clock period of 5 ns on Xilinx VU9P.
Model DSP [%] LUT [%] FF [%] BRAM [%] Latency [ns] II [ns]
DNN AE QAT 8 bits 2 5 1 0.5 130 5
CNN AE QAT 4 bits 8 47 5 6 1480 895
DNN VAE PTQ 8 bits 1 3 0.5 0.3 80 5
CNN VAE PTQ 8 bits 10 12 4 2 365 115

TABLE IV. Resource utilization and latency for the quantized and pruned DNN AE model. Resources are based on the Vivado
estimates from Vivado HLS 2020.1 for a clock period of 5 ns on Xilinx V7-690.
Model DSP [%] LUT [%] FF [%] BRAM [%] Latency [ns] II [ns]
DNN VAE PTQ 8 bits 3 9 3 0.4 205 5

XI. AUTHOR INFORMATION ACKNOWLEDGEMENTS

This work is supported by the European Research


Council (ERC) under the European Union’s Horizon
2020 research and innovation program (Grant Agreement
Correspondence and material requests can be e-mailed No. 772369) and the ERC-POC programme (grant No.
to E. Govorkova ([email protected]). 996696).

[1] LHC Machine. JINST 3, S08001 (2008). [13] Knapp, O. et al. Adversarially Learned Anomaly Detec-
[2] Aad, G. et al. The ATLAS Experiment at the CERN tion on CMS Open Data: re-discovering the top quark.
Large Hadron Collider. JINST 3, S08003 (2008). Eur. Phys. J. Plus 136, 236 (2021). 2005.01598.
[3] Chatrchyan, S. et al. The CMS Experiment at the CERN [14] CMS Exotica hotline leads hunt for ex-
LHC. JINST 3, S08004 (2008). otic particles (2010). URL https://www.
[4] Sirunyan, A. M. et al. Performance of√ the CMS Level-1 symmetrymagazine.org/breaking/2010/06/24/
trigger in proton-proton collisions at s = 13 TeV. J. cms-exotica-hotline-leads-hunt-for-exotic-particles.
Instrum. 15, P10017 (2020). 2006.10165. [15] Poppi, F. Is the bell ringing?. Exotica : à l’affût des
[5] The Phase-2 upgrade of the CMS Level-1 trigger. CMS événements exotiques 14 (2010). URL http://cds.cern.
Technical Design Report CERN-LHCC-2020-004. CMS- ch/record/1306501.
TDR-021 (2020). URL https://cds.cern.ch/record/ [16] Duarte, J. et al. Fast inference of deep neural networks
2714892. in FPGAs for particle physics. JINST 13, P07027 (2018).
[6] Aad, G. et al. Operation of the ATLAS trigger system in 1804.06913.
Run 2. J. Instrum. 15, P10004 (2020). 2007.12539. [17] Ngadiuba, J. et al. Compressing deep neural networks
[7] Technical Design Report for the Phase-II Upgrade of the on FPGAs to binary and ternary precision with hls4ml.
ATLAS TDAQ System. ATLAS Technical Design Report Mach. Learn.: Sci. Technol. (2020). 2003.06308.
CERN-LHCC-2017-020. ATLAS-TDR-029 (2017). URL [18] Iiyama, Y. et al. Distance-Weighted Graph Neural Net-
https://cds.cern.ch/record/2285584. works on FPGAs for Real-Time Particle Reconstruction
[8] Aad, G. et al. Observation of a new particle in the in High Energy Physics. Front. Big Data 3, 598927 (2020).
search for the standard model Higgs boson with the AT- 2008.03601.
LAS detector at the LHC. Phys. Lett. B 716, 1 (2012). [19] Aarrestad, T. et al. Fast convolutional neural networks
1207.7214. on fpgas with hls4ml. Mach. Learn.: Sci. Technol. 2,
[9] Chatrchyan, S. et al. Observation of a new boson at a 045015 (2021). 2101.05108.
mass of 125 GeV with the CMS experiment at the LHC. [20] Heintz, A. et al. Accelerated Charged Particle Tracking
Phys. Lett. B 716, 30 (2012). 1207.7235. with Graph Neural Networks on FPGAs. In 34th Confer-
[10] Aarrestad, T. et al. The dark machines anomaly score chal- ence on Neural Information Processing Systems (2020).
lenge: Benchmark data and model independent event clas- 2012.01563.
sification for the large hadron collider (2021). 2105.14027. [21] Summers, S. et al. Fast inference of Boosted Decision
[11] Kasieczka, G. et al. The lhc olympics 2020: A community Trees in FPGAs for particle physics. JINST 15, P05026
challenge for anomaly detection in high energy physics (2020). 2002.02534.
(2021). 2101.08320. [22] Coelho, C. Qkeras (2019). URL https://github.com/
[12] Cerri, O. et al. Variational Autoencoders for New Physics google/qkeras.
Mining at the Large Hadron Collider. JHEP 05, 036 [23] Coelho, C. N. et al. Automatic heterogeneous quantization
(2019). 1811.10276. of deep neural networks for low-latency inference on the
edge for particle detectors. Nat. Mach. Intell. (2021).
11

2006.10159. 2018, 7920 (2018). 1711.00205.


[24] Kingma, D. P. & Welling, M. Auto-encoding variational [36] Wang, N., Choi, J., Brand, D., Chen, C.-Y. &
bayes (2014). 1312.6114. Gopalakrishnan, K. Training deep neural networks
[25] Rezende, D. J., Mohamed, S. & Wierstra, D. Stochastic with 8-bit floating point numbers. In Bengio, S. et al.
backpropagation and approximate inference in deep gen- (eds.) Advances in Neural Information Processing
erative models. In ICML, vol. 32, 1278 (2014). 1401.4082. Systems, vol. 31, 7675 (Curran Associates, Inc., 2018).
[26] LeCun, Y., Denker, J. S. & Solla, S. A. Optimal brain URL https://proceedings.neurips.cc/paper/2018/
damage. In Touretzky, D. S. (ed.) Advances in Neural file/335d3d1cd7ef05ec77714a215134914c-Paper.pdf.
Information Processing Systems, vol. 2, 598 (Morgan- 1812.08011.
Kaufmann, 1990). URL http://papers.nips.cc/paper/ [37] An, J. & Cho, S. Variational autoencoder based anomaly
250-optimal-brain-damage. detection using reconstruction probability. Special Lecture
[27] Han, S., Mao, H. & Dally, W. J. Deep compression: on IE 2, 1 (2015).
Compressing deep neural networks with pruning, trained [38] Nagel, M., van Baalen, M., Blankevoort, T. & Welling, M.
quantization and Huffman coding. In Bengio, Y. & Le- Data-free quantization through weight equalization and
Cun, Y. (eds.) 4th International Conference on Learning bias correction. In 2019 IEEE/CVF International Con-
Representations, San Juan, Puerto Rico, May 2, 2016 ference on Computer Vision, Seoul, South Korea, October
(2016). 1510.00149. 27, 2019, 1325 (2019). 1906.04721.
[28] Blalock, D., Ortiz, J. J. G., Frankle, J. & Guttag, J. [39] Meller, E., Finkelstein, A., Almog, U. & Grobman,
What is the state of neural network pruning? In Dhillon, M. Same, same but different: Recovering neural net-
I., Papailiopoulos, D. & Sze, V. (eds.) Proceedings work quantization error through weight factorization. In
of Machine Learning and Systems, vol. 2, 129 (2020). Chaudhuri, K. & Salakhutdinov, R. (eds.) Proceedings
URL https://proceedings.mlsys.org/paper/2020/ of the 36th International Conference on Machine Learn-
file/d2ddea18f00665ce8623e36bd4e3c7c5-Paper.pdf. ing, June 9, 2019, Long Beach, CA, USA, vol. 97, 4486
2003.03033. (PMLR, 2019). URL http://proceedings.mlr.press/
[29] Moons, B., Goetschalckx, K., Berckelaer, N. V. & Ver- v97/meller19a.html. 1902.01917.
helst, M. Minimum energy quantized neural networks. In [40] Zhao, R., Hu, Y., Dotzel, J., Sa, C. D. & Zhang,
Matthews, M. B. (ed.) 2017 51st Asilomar Conference Z. Improving neural network quantization without
on Signals, Systems, and Computers, Pacific Grove, CA, retraining using outlier channel splitting. In Chaud-
USA, October 29, 2017, 1921 (2017). 1711.00215. huri, K. & Salakhutdinov, R. (eds.) Proceedings of
[30] Courbariaux, M., Bengio, Y. & David, J.-P. Bi- the 36th International Conference on Machine Learn-
naryConnect: Training deep neural networks with ing, June 9, 2019, Long Beach, CA, USA, vol. 97, 7543
binary weights during propagations. In Cortes, C., (PMLR, 2019). URL http://proceedings.mlr.press/
Lawrence, N. D., Lee, D. D., Sugiyama, M. & Garnett, v97/zhao19c.html. 1901.09504.
R. (eds.) Advances in Neural Information Processing [41] Banner, R., Nahshan, Y., Hoffer, E. & Soudry, D.
Systems, vol. 28, 3123 (Curran Associates, Inc., 2015). Post-training 4-bit quantization of convolution net-
URL https://proceedings.neurips.cc/paper/2015/ works for rapid-deployment. In Wallach, H. et al.
file/3e15cc11f979ed25912dff5b0669f2cd-Paper.pdf. (eds.) Advances in Neural Information Processing
1511.00363. Systems, vol. 32, 7950 (Curran Associates, Inc., 2019).
[31] Zhang, D., Yang, J., Ye, D. & Hua, G. LQ-nets: Learned URL https://proceedings.neurips.cc/paper/2019/
quantization for highly accurate and compact deep neural file/c0a62e133894cdce435bcb4a5df1db2d-Paper.pdf.
networks. In Ferrari, V., Hebert, M., Sminchisescu, C. & 1810.05723.
Weiss, Y. (eds.) Proceedings of the European Conference [42] Shin, S., Boo, Y. & Sung, W. Knowledge distillation for
on Computer Vision, Munich, Germany, September 8, optimization of quantized deep neural networks. In 2020
2018, 373 (2018). 1807.10029. IEEE Workshop on Signal Processing Systems (SiPS), 1
[32] Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R. & (2020).
Bengio, Y. Quantized neural networks: Training neural [43] Polino, A., Pascanu, R. & Alistarh, D. Model compres-
networks with low precision weights and activations. J. sion via distillation and quantization. In International
Mach. Learn. Res. 18, 1 (2018). URL http://jmlr.org/ Conference on Learning Representations (2018). URL
papers/v18/16-456.html. 1609.07061. https://openreview.net/forum?id=S1XolQbRW.
[33] Rastegari, M., Ordonez, V., Redmon, J. & Farhadi, A. [44] Gao, M. et al. An embarrassingly simple approach for
XNOR-Net: ImageNet classification using binary convolu- knowledge distillation (2019). 1812.01819.
tional neural networks. In 14th European Conference on [45] Mishra, A. & Marr, D. Apprentice: Using knowledge dis-
Computer Vision (ECCV), 525 (Springer International tillation techniques to improve low-precision network accu-
Publishing, Cham, Switzerland, 2016). 1603.05279. racy. In International Conference on Learning Represen-
[34] Micikevicius, P. et al. Mixed precision training. In tations (2018). URL https://openreview.net/forum?
6th International Conference on Learning Representa- id=B1ae1lZRb.
tions, Vancouver, BC, Canada, April 30, 2018 (2018). [46] Nguyen, T. Q. et al. Topology classification with deep
URL https://openreview.net/forum?id=r1gs9JgRZ. learning to improve real-time event selection at the LHC.
https://openreview.net/forum?id=r1gs9JgRZ, Comput. Softw. Big Sci. 3, 12 (2019). 1807.00083.
1710.03740. [47] Govorkova, E. et al. Unsupervised new physics detection
[35] Zhuang, B., Shen, C., Tan, M., Liu, L. & Reid, I. Towards at 40 mhz: LQ → b τ signal benchmark dataset. Zenodo
effective low-bitwidth convolutional neural networks. In https://doi.org/10.5281/zenodo.5055454 (2021).
2018 IEEE/CVF Conference on Computer Vision and [48] Govorkova, E. et al. Unsupervised new physics detection
Pattern Recognition, Salt Lake City, UT, USA, June 18, at 40 mhz: A → 4 leptons signal benchmark dataset. Zen-
12

odo https://doi.org/10.5281/zenodo.5046446 (2021). sites.google.com/site/deeplearningicml2013/relu_


[49] Govorkova, E. et al. Unsupervised new physics detection hybrid_icml2013_final.pdf?attredirects=0&d=1.
at 40 mhz: h0 → τ τ signal benchmark dataset. Zenodo [55] Nair, V. & Hinton, G. E. Rectified linear units im-
https://doi.org/10.5281/zenodo.5061633 (2021). prove restricted boltzmann machines. In Fürnkranz,
[50] Govorkova, E. et al. Unsupervised new physics detection J. & Joachims, T. (eds.) ICML, 807 (Omnipress,
at 40 mhz: h+ → τ ν signal benchmark dataset. Zenodo 2010). URL http://dblp.uni-trier.de/db/conf/icml/
https://doi.org/10.5281/zenodo.5061688 (2021). icml2010.html#NairH10.
[51] Govorkova, E. et al. LHC physics dataset for unsupervised [56] Kingma, D. P. & Ba, J. Adam: A Method for Stochastic
New Physics detection at 40 MHz (2021). 2107.02157. Optimization. ArXiv e-prints (2014). 1412.6980.
[52] Govorkova, E. et al. Unsupervised new physics detection [57] Joyce, J. M. Kullback-Leibler Divergence, 720–722
at 40 mhz: Training dataset. Zenodo https://doi.org/ (Springer Berlin Heidelberg, 2011). URL https://doi.
10.5281/zenodo.5046389 (2021). org/10.1007/978-3-642-04898-2_327.
[53] Ioffe, S. & Szegedy, C. Batch normalization: Accelerating [58] Higgins, I. et al. beta-vae: Learning basic visual concepts
deep network training by reducing internal covariate shift. with a constrained variational framework (2016).
In Bach, F. & Blei, D. (eds.) Proceedings of the 32nd In- [59] Chollet, F. et al. Keras (2015). URL https://keras.io.
ternational Conference on Machine Learning, vol. 37, 448 [60] Xilinx. Vivado design suite user guide: High-level
(PMLR, 2015). URL http://proceedings.mlr.press/ synthesis (2020). URL https://www.xilinx.com/
v37/ioffe15.html. 1502.03167. support/documentation/sw_manuals/xilinx2020_1/
[54] Maas, A. L., Hannun, A. Y. & Ng, A. Y. Rectifier ug902-vivado-high-level-synthesis.pdf.
nonlinearities improve neural network acoustic models. [61] EMP Collaboration. emp-fwk homepage (2019). URL
In ICML Workshop on Deep Learning for Audio, https://serenity.web.cern.ch/serenity/emp-fwk/.
Speech and Language Processing (2013). URL https://

You might also like