Fast kernel methods for data quality monitoring as a goodness-of-fit test

Gaia Grosso; Nicolò Lai; Marco Letizia; Jacopo Pazzini; Marco Rando; Lorenzo Rosasco; Andrea Wulzer; Marco Zanetti

doi:10.1088/2632-2153/acebb7

1. Introduction

Modern high-energy physics experiments operating at colliders are extremely sophisticated devices consisting of millions of sensors sampled every few nanoseconds, producing an enormous throughput of complex data. Several types of technologies are employed for identifying and measuring the particles originated in the collisions; in all cases, the environmental conditions are severe, making the required performances challenging to achieve. Although the various subsystems are designed to offer redundancy, measurements can be undermined by malfunctions of parts of the experiment, either because of critical inefficiencies or because of possibly misinterpreted spurious signals. In addition to supervising the status (powering, electronic configuration, temperature, etc) of the various hardware components, data from all sources must thus be monitored continuously to assess their quality and to promptly detect any faults, possibly providing indications about their causes. Given the rate of tens of MHz at which data is gathered and the number of sensors to be checked, the monitoring process needs to be as automated as possible: approaches based on machine learning (ML) techniques are particularly appealing for this task and have started being employed by the experimental collaborations [1–4], complementing more traditional methods [5–9]. Data quality monitoring (DQM) consists, in essence, of comparing batches of data with corresponding reference samples gathered in nominal conditions; departures from the latter can then be analysed to identify their origin. The data processing must fit the computational constraints imposed by the frequency at which batches are delivered and by their size, with the latter depending on the granularity with which sensors are grouped and the statistical uncertainty aimed at.

In this work, we present the application of a methodology developed in the context of model-independent searches for new physics [10–12]—specifically of its kernel methods implementation [13] based on the Falkon [14] library—as an efficient and effective DQM tool. The method (dubbed New Physics Learning Machine (NPLM), see section 3) implements a goodness-of-fit statistical test in many dimensions. It leverages the ability of classifiers to infer the underlying data-generating distributions in order to estimate the likelihood ratio, eventually assessing goodness-of-fit by a hypothesis test based on the likelihood-ratio test statistics. At variance with anomaly detection methods, which are more common in the ML literature, NPLM is not only sensitive to the presence of outliers in the data. It searches for discrepancies in the statistical distribution of the data, relative to the expected (reference) distribution. It is thus also sensitive to data discrepancies that are due to an anomalous statistical concentration of multiple events, each of which is not an outlier but on the contrary very typical in the reference data population. The Falkon-based implementation of NPLM offers tremendous advantages in terms of training time compared to the one based on neural networks. It can thus be used for DQM.

Our proposed approach addresses two main limitation of conventional DQM methods. First, it introduces a sophisticated multivariate statistical data analysis strategy for the assessment of the data quality. Conventional methods typically employ one or a number of one-dimensional (1D) marginals, therefore their sensitivity to anomalies depends critically on the choice of input variables and it does not exploit anomalies in their correlations. A key feature of NPLM is instead the capability of examining the phase space as a whole, in many dimensions and exploiting correlations boosting the sensitivity to anomalous data. For the same reason, it will be possible to run NPLM on low-level quantities that require limited pre-processing. This can be advantageous for DQM, as it allows it to deal with almost raw data from the detectors’ electronic front-ends, therefore limiting the bias introduced by further manipulations that could hide issues in the data.

Secondly, our proposal can seed a standardisation of the design of DQM methodologies. Current strategies are designed specifically for each detector system and monitoring task, while NPLM is a universal method with a wide range of applicability. An intelligent ad hoc selection of the relevant physical variables will still be beneficial for the sensitivity, but NPLM provides a universal framework to assess the agreement with the reference of their distribution. The availability of a standardised approach based on NPLM could be also exploit to validate the adequacy (or establish the inadequacy) of conventionally-designed approaches.

In order to test the effectiveness of NPLM for DQM, we exploit an experimental setup which we have full control of, consisting of a reduced-size version of the muon chambers installed in the Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider (LHC). The setup is operated as a cosmic muon telescope. As explained later, scaling tests are performed to assess the performances of the DQM algorithm in view of its possible deployment during standard LHC operations.

The paper is organised as follows. In the next section, we introduce the experimental setup and the algorithm input variables. These include a reference data set collected under standard conditions and smaller samples with anomalous controlled behaviours. The ML model and our core strategy are then described in section 3, whereas an overview of the results is given in section 4. Finally, the last section is devoted to conclusions and further developments.

2. Experimental setup

For this research, we exploited an experimental apparatus consisting of a set of drift tube (DT) chambers housed at the Legnaro INFN National Laboratory (figure 1, left). These chambers are a smaller in size copy of those deployed in the CMS experiment at the LHC [15]. The basic element of a DT chamber is a 70 cm long tube with a cross section of $4\times2.1\ \mathrm{cm}^2$ (figure 1, bottom right). Inside each tube, an electric field is produced by an anodic wire laid in the centre and two cathodic strips (cathodes) on the sides; the former is set at a voltage of 3.6 kV, the latter at −1.2 kV. An additional pair of strips at 1.8 kV is placed above and below the wire to improve the homogeneity of the field. The tubes are filled with a mixture of argon and carbon dioxide gas (85%–15%) that gets ionised by charged particles passing through it. The produced electrons drift towards the wire at a constant velocity along the field lines, where they are collected. For each tube, the front-end electronics record the arrival time of the ions, amplify the signal, and filter out noise below a specific threshold (nominally 100 mV).

Figure 1. Refer to the following caption and surrounding text. — **Figure 1.** Left: the experimental apparatus at Legnaro Laboratory, with four drift-tube chambers, vertically stacked. Right: a schematic view of the cell (bottom) and an example of hit pattern left by a charged particle crossing a chamber (top).
Download figure:
Standard image High-resolution image

A DT chamber consists of 64 tubes arranged in four layers of 16 tubes each. The layers are staggered horizontally by half a cell. The setup at Legnaro records muons from cosmic rays, which occur at a rate of about 1 per minute per cm ${^2}$ at sea level. Data acquisition occurs continuously at a rate of 40 MHz, without the need for any trigger logic. An external time reference is provided by plastic scintillators placed in between the DT chambers; the corresponding information is added to the data stream and used in the following analysis steps.

Thanks to the homogeneity of the electric field, the particle’s position within each tube (with a left-right ambiguity) is linearly dependent on the drift time. Namely, the distance of the muon track from the wire reads

$\begin{equation} \displaystyle x_{\pm} = \pm v_\mathrm{d}\, (t_\mathrm{hit} - t_0)\, = \pm v_\mathrm{d}\, t\,, \end{equation} \tag{ 1 }$

with $t_\mathrm{hit}$ the time associated to each signal in a tube (called a hit). The two parameters are the drift velocity $v_\mathrm{d}$ , known by means of a calibration procedure (in our case, $v_\mathrm{d} = 53\, \mu$ m ns⁻¹), and the time pedestal t₀, which can be deduced from the timing information provided by the scintillators¹⁰ . The drift time t is obtained by the difference between $t_\mathrm{hit}$ and the time pedestal.

The hits occurring in a time window of $90\,\mu$ s centred around the signal provided by the scintillators are grouped in quadruplets (with one hit pertaining to each of the four layers as in figure 1, right top). Then, a linear fit is performed on each of the quadruplets and the candidate muon track is obtained from the combination yielding the best χ². In this way the trajectory of the muon in the plane transverse to the tubes is determined, with a precision on the position of about 180 µm and on the slope of about 1 mrad. Tracks from various DT chambers can be combined to determine the 3D muon trajectory; in the following we will however consider only the 2D measurement.

If the detector conditions are anomalous, the efficiency and accuracy of the muon track reconstruction may be compromised. Ensuring the proper operation of the detector thus requires monitoring the quality of the recorded data. In what follows, we consider six basic quantities related to the passage of a muon through a DT chamber:

Drift times t_i : the four drift times associated with the muon track. The drift time distribution is displayed in figure 2 in different ranges for the muon track angle θ (or slope, see the next item), showing the correlations between these two variables. The t_i distributions are also reported in figures 3 and 4.
Slope θ: the angle formed by the muon track with the vertical axis. The chamber efficiency is expected to drop beyond $|\theta|\sim 40^{\circ}$ as we see in figures 3 and 4.
Number of hits $n_\mathrm{Hits}$ : the number of hits recorded in a time window of one second around the muon crossing time. Many spurious hits are present in addition to those due to the passage of a muon. The noise rate depends on the environmental conditions, with the one at the LHC orders of magnitude larger than that of our laboratory in Legnaro, but the recorded spurious hits rate can also be affected by issues related to the detector operation conditions.

Figure 2. Refer to the following caption and surrounding text. — **Figure 2.** Drift time distribution in different θ ranges.
Download figure:
Standard image High-resolution image

Figure 3. Refer to the following caption and surrounding text. — **Figure 3.** The distribution of the input features in the reference and in three anomalous working conditions of the cathodes voltages.
Download figure:
Standard image High-resolution image

Figure 4. Refer to the following caption and surrounding text. — **Figure 4.** The distribution of the input features in the reference and in three anomalous working conditions of the thresholds.
Download figure:
Standard image High-resolution image

The six variables $x = \{t_1,\ldots,t_4,\theta,n_\mathrm{Hits}\}$ will be the input features of the NPLM algorithm for DQM, described in the next sections. Notice that the data are gathered from the subset of tubes in a single chamber that geometrically matches the scintillators, i.e. about three tubes per layer.

We collected the data by artificially inducing possible issues that can occur during detector operations. Specifically, we reduced the voltage of the cathodic strips to 75%, 50%, and 25% of their nominal value (−900 V, −600 V, and −300 V, respectively), and we lowered the front-end thresholds to 75%, 50%, and 25% of their nominal value (75 mV, 50 mV, and 25 mV, respectively). The former action distorts the electric field shape, whereas the latter mimics the sudden contribution of noise sources. We conducted a dedicated data acquisition campaign in these six anomalous configurations, collecting around 10⁴ events for each configuration. We also collected around $3\times 10^5$ data points in the normal (or, reference) working conditions of the apparatus. The distribution of the six input features for the reference data and the data collected under the different anomalous conditions are shown in figure 3 (variation of the cathodes voltages) and figure 4 (variation of the thresholds). These data will be used to design and calibrate the DQM algorithm, as described in the following section.

3. Methodology

In the setup described in the previous section, we are interested in assessing the quality of individual batches of data collected by the apparatus, each of which denoted as ${\cal{D}} = \{x_i\}_{i = 1}^{{N}_{\cal{D}}}$ . Namely, we ask whether the statistical distribution of the data points in ${\cal{D}}$ coincides or not with the one expected under reference working conditions, $p(x|\textrm{{R}})$ . We thus aim at performing what is known in statistics as a goodness-of-fit test. See [17] for references and a concise overview.

The reference distribution $p(x|\textrm{{R}})$ is not available in closed form. What is available is instead a second dataset ${\cal{R}} = \{x_i\}_{i = 1}^{{N}_{\cal{R}}}$ collected by the same apparatus when operated in the reference working conditions, such that the data in ${\cal{R}}$ do follow the $p(x|\textrm{{R}})$ distribution. Our goodness-of-fit test is thus carried out by comparing the two datasets ${\cal{D}}$ and ${\cal{R}}$ , asking whether they are sampled from the same statistical distribution. The problem can then be formulated as a two-sample test, in which, however, ${\cal{D}}$ and ${\cal{R}}$ play asymmetric roles.

The data batch ${\cal{D}}$ is what needs to be tested. Therefore its composition and its size, ${N}_{\cal{D}}$ , are among the specification requirements of the DQM methodology we are developing. ${N}_{\cal{D}}\sim 1000$ is in the ballpark of what is typically considered by DQM applications deployed at CMS.

The reference dataset ${\cal{R}}$ is instead created within the methodology design, with mild or no limitation on its size, ${N}_{\cal{R}}$ . A larger ${\cal{R}}$ dataset offers a more faithful representation of the underlying reference statistical distribution and therefore a more accurate test. Furthermore, taking ${N}_{\cal{R}}$ larger than ${N}_{\cal{D}}$ reduces the effect of the ${\cal{R}}$ dataset statistical fluctuation on the outcome of the test, leaving only those inherently due to the fluctuations of ${\cal{D}}$ . This makes the outcome for a given data batch ${\cal{D}}$ nearly independent on the specific instance of the set ${\cal{R}}$ that is employed for the test, making the result more robust. In what follows, we will thus preferentially consider an unbalanced setup for the two datasets, with ${N}_{\cal{R}}\gt{N}_{\cal{D}}$ . We will further exploit the availability of a relatively large volume of data collected under the reference working conditions for calibrating the test statistics variable and for selecting the hyperparameters, as discussed in the following.

The availability of a large set of data that are accurately labelled as having been collected under the reference detector conditions deserves further comments. These data are routinely available, in particular in high-energy physics experiments, and are in fact used for the design and calibration of regular DQM methods [5–9]. They are validated by a careful offline inspection, which typically requires human intervention. This validation process is way too demanding and slow to be employed as a DQM algorithm. The purpose of DQM is in fact to monitor the data quality online, i.e. while they are being collected. The offline validation is instead straightforwardly capable of producing labelled reference data samples that are way larger than individual data batches.

3.1. The NPLM method

We employ the NPLM method, which was proposed and developed by some of us [10–13] to address a similar problem in the different context of searches for new physical laws at collider experiments. The search for new physics is performed by comparing the measured data with a reference dataset whose statistical distribution is the one predicted by a standard set of physical laws that supposedly describe the experimental setup. The purpose of the comparison is not to assess the quality of the data like in DQM, but the quality of the distribution prediction and in turn to check whether the standard laws are adequate or, instead, new physical laws are needed to model the experimental setup. However, this conceptual difference does not have practical consequences. The NPLM setup of ${\cal{D}}$ versus ${\cal{R}}$ data comparison is straightforwardly portable to DQM problems.

The NPLM method design is inspired by the classical approach to hypothesis testing based on the likelihood ratio [18]. A model $f_{\textbf{w}}(x)$ acting on the space of data x, with trainable parameters ${{\textbf{w}}}$ , is employed to define a set of alternatives to $p(x|\textrm{{R}})$ for the distribution of the data points in ${\cal{D}}$ . Since the alternative hypothesis depends on ${{\textbf{w}}}$ , we denote it as $\textrm{{H}}_{{{\textbf{w}}}}$ and $p(x|\textrm{{H}}_{{{\textbf{w}}}})$ is the alternative distribution of x. In particular, $f_{{\textbf{w}}}(x)$ directly parametrises the logarithm of the ratio between $p(x|\textrm{{H}}_{{{\textbf{w}}}})$ and $p(x|\textrm{{R}})$ . The model $f_{{\textbf{w}}}(x)$ could be a neural network as in [10–12], or it could be built with kernel methods [13]. We will employ the latter option for reasons that will become clear soon. The model is trained by adjusting its parameters to best accommodate the observed data. Consequently, the trained parameters ${\widehat{\mathbf{w}}}$ define the best-fit hypothesis $\textrm{{H}}_{\widehat{\mathbf{w}}}$ . Following [18], the test statistic variable to be employed for the assessment of the quality of the data ${\cal{D}}$ is ¹¹

$\begin{equation} \displaystyle t_{{\widehat{\boldsymbol{\mathrm{{w}}}}}}({\cal{D}}) = 2\sum_{x\in {\cal{D}}}\log\frac{p(x|{\mathrm{H}}_{\widehat{\boldsymbol{\mathrm{{w}}}}})}{p(x|{\mathrm{R}})} = 2\sum_{x\in {\cal{D}}} f_{{\widehat{\boldsymbol{\mathrm{{w}}}}}}(x)\,. \end{equation} \tag{ 2 }$

In order to train the model we exploit a classical result of statistical learning: a continuous-output classifier trained to tell apart two datasets approximates—possibly up to a given monotonic transformation—the log ratio between the probability distribution of the two training sets. This property is proven explicitly in e.g. [10, 13] for the weighted logistic loss

$\begin{equation} \displaystyle \ell(y,f_{{\textbf{w}}}(x)) = (1-y)(1+{N}_1/{N}_0) \log \left(1+e^{{\,f}_{{\textbf{w}}}(x)}\right)+ y\,(1+{N}_0/{N}_1) \log \left(1+e^{-f_{{\textbf{w}}}(x)}\right)\,. \end{equation} \tag{ 3 }$

By assigning label y = 0 to the data in ${\cal{R}}$ , and y = 1 to those in ${\cal{D}}$ , the model $f_{\widehat{\mathbf{w}}}(x)$ trained with the loss in equation (3) approaches the logarithm of ${p(x|\textrm{{H}}_{\widehat{\mathbf{w}}})}/{p(x|\textrm{{R}})}$ as it was needed in equation (2). The weight factors in equation (3), which depend on ${N}_1/{N}_0 = {N}_{\cal{D}}/{N}_{\cal{R}}$ , are included because the two training datasets are unbalanced as previously explained.

A direct application of the classical theory of hypothesis testing [18] would actually suggest to employ a different loss function. In fact, the best-fit parameters ${\widehat{\mathbf{w}}}$ to be used in the definition of the test statistic (2) should be those that maximise the likelihood function. Minimising the logistic loss produces instead an estimate of the best-fit parameters that is different, a priori, from the maximum likelihood estimate. This can be remedied by employing a special loss function called ‘maximum likelihood loss, whose minimisation is equivalent to maximising the likelihood [10]. The maximum likelihood loss is not used in the kernel-based implementation of NPLM [13] and the logistic loss (3) is preferred for practical reasons. No strong performance degradation has been observed using the logistic loss in place of the maximum likelihood loss in the tests of the NPLM method performed so far.

Using the elements above, the design of the NPLM method for DQM works as follows. We first pick up a model for $f_{{\textbf{w}}}(x)$ and select its hyperparameters. The hyperparameters selection strategy is described in the next section for the kernel-based implementation of NPLM. Next, we need to calibrate the test statistics variable (2) in order to be able to associate its value $t({\cal{D}})$ to a probability $\textrm{{p}}[t({\cal{D}})]$ , the p-value. This probability will be the output of the DQM algorithm. Based on its value, the analyser will eventually judge the quality of each data batch ${\cal{D}}$ . For instance, the analyser might define a probability threshold, below which the data batch is discarded or set apart for further analyses. Above the threshold the batch could be retained as a good batch.

It should be noted that the selected hyperparameters and the p-value do depend on the detailed setup of the DQM problem under consideration. For instance, different hyperparameters will be used in section 4 for the setup with five input features and data batch size ${N}_{\cal{D}} = 1000$ than in the case of six features and ${N}_{\cal{D}} = 500$ . The p-value calibration function ${p}[t]$ will be also different. However, once these elements are made available for a given setup, they can be used to evaluate the quality of all the ${\cal{D}}$ batches in that setup. The only operation that the DQM algorithm has to perform at run-time is one single training of ${\cal{D}}$ against ${\cal{R}}$ , out of which $t({\cal{D}})$ is obtained and in turn ${p}[t({\cal{D}})]$ .

Calibration is performed as follows. The test statistics (2) is preferentially large and positive if the best-fit alternative distribution $p(x|\textrm{{H}}_{\widehat{\mathbf{w}}})$ accommodates the data better than the reference distribution $p(x|\textrm{{R}})$ does, signalling that the data batch is likely not thrown from $p(x|\textrm{{R}})$ . Large $t({\cal{D}})$ should thus correspond to a small probability. The precise correspondence is established by comparison with the typical values that t attains when the data batch is instead a good batch. We thus compute the distribution, $p(t|\textrm{{R}})$ , that the t variable possesses when the data follow the reference statistical distribution and the p-value is defined as

$\begin{equation} \displaystyle {p}[t] = \int_{t}^\infty \mathrm{d}t^{\prime}\; p(t^{\prime} | {\mathrm{R}})\,. \end{equation} \tag{ 4 }$

The physical meaning of ${p}[t({\cal{D}})]$ is the probability that a good data batch gives a value of t that is more unlikely (i.e. larger) than the value $t({\cal{D}})$ produced by the batch ${\cal{D}}$ . If a threshold is set on p, this threshold measures the frequency at which good data batches are not recognised as such by the algorithm.

The $p(t | \textrm{{R}})$ distribution is straightforwardly estimated empirically, thanks to the availability of reference-distributed labelled data points. We create several artificial data batches—called Toy datasets—of the same size ${N}_{\cal{D}}$ as the true batches. We run the training and compute t on each of them. Each Toy dataset should be statistically independent, and independent from the reference dataset ${\cal{R}}$ that is employed for training. A very large sample of reference-distributed data is thus used in order to produce both the Toy batches and the reference dataset. By histogramming the values of t computed on the Toys we could easily obtain an estimate of $p(t|R)$ and hence of ${p}[t]$ . A different procedure is adopted here, exploiting the empirical observation [13] that $p(t|\textrm{{R}})$ is well approximated by a chi-squared (χ²) distribution. The number of degrees of freedom of the χ² depends on the setup but can be determined by fitting to the empirical distribution of the t values computed on the Toys. The survival function (one minus the cumulative) of the corresponding χ² distribution will be used as an estimate of ${p}[t]$ . It should be noted that by proceeding in this way we will be formally able to compute very small p-values that correspond to highly-discrepant data batches with very large $t({\cal{D}})$ . However, the agreement of $p(t|\textrm{{R}})$ with the χ² cannot be verified in the high t region, which the Toys do not populate, and there is no theoretical reason to expect that this agreement will persist in that region. Our quantification of the p-value is thus only accurate in the region that the Toys statistically populate. For instance, if 300 Toys are thrown, only p-values larger than around $1/300$ are accurately computed. If $t({\cal{D}})$ falls in a region where our determination of p is much smaller than that, ours should be regarded as a reasonable estimate that is particularly useful to compare the level of discrepancy of different batches, but it cannot be directly validated. However, in those cases we will be able to ensure that ${p}[t({\cal{D}})]\lesssim1/300$ by directly comparing with the t values on the Toys.

Another feature of the NPLM approach is the possibility of exploiting the function $f_{{\widehat{\mathbf{w}}}}$ learned during the training task to characterise anomalous batches of data. The function $f_{{\widehat{\mathbf{w}}}}$ represents the log-ratio between $p(x|\textrm{{H}}_{\widehat{\mathbf{w}}})$ and $p(x|\textrm{{R}})$ and, hence, can be used to deform and adapt the reference distribution to the data by reweighting, according to the following expression

$\begin{equation} p(x|{\mathrm{H}}_{\widehat{\boldsymbol{\mathrm{{w}}}}}) = e^{{\,f}_{{\widehat{\boldsymbol{\mathrm{{w}}}}}}(x)} p(x|{\mathrm{R}}). \end{equation} \tag{ 5 }$

The function $\exp(f_{{\widehat{\mathbf{w}}}}(x))$ will be close to one if the data are well-described by the reference distribution, while it will depart from it otherwise. One should therefore be able to gain additional information about the anomalous batch by inspecting this quantity as a function of the input variables, or any combination of them, even when not explicitly provided as an input feature for the training. Having access to this kind of information is a valuable element in the context of the search for new physics [10, 11, 13], since the physics-motivated variables that one might want to inspect to explain a potential anomalous score could be some type of nontrivial combination of the input features with a clear physical meaning, such as the invariant mass of a many-body final state. For DQM applications, this analysis is less relevant since a direct visual inspection of the ratio between the binned data and reference marginal distributions is already quite informative and the user might not be not interested in exploring specific high-level features in the first place. On the other hand, one can still exploit the possibility of reconstructing the data distribution using $f_{{\widehat{\mathbf{w}}}}$ as a debugging tool, namely to check whether the learning model correctly recognises if the data deviates from the reference and how.

Moreover, somewhat aside from the main goal of the present article, the output of the NPLM-DQM application could be exploited to study those data batches that display significant deviations from the reference and, depending on the characteristic of the departures, to classifying them into different anomalous categories. Further investigations on a possible extension of the application in this respect are left for future work.

3.2. Falkon-based NPLM

Applying NPLM to the DQM problem is simpler than using it for new physics searches. For new physics searches one needs to worry about imperfections in the reference data that stem from the mismodelling of the reference distribution based on the underlying standard physical laws. Including these effects in NPLM is possible but requires dedicated work and domain-specific expertise [12]. Mismodelling is not a concern in DQM problems because no modelling is required at all: the reference-distributed data are merely collected from the same experimental apparatus and not simulated. NPLM algorithms for DQM can thus be designed more easily and systematically without the need for extremely specialised domain knowledge.

DQM applications are, however, much more computationally demanding than new physics searches. For new physics searches there is typically only one dataset ${\cal{D}}$ to be analysed. For DQM, a large flow of data batches needs to be analysed online. We will see in section 5 that, for instance, order 10 s are needed to the CMS muon system to collect one data batch. Our DQM algorithm must respond on a competitive timescale in order to be applicable to that problem. The relevant operation time is the one needed for a single training, as previously explained. The original implementation of NPLM based on neural networks is vastly incompatible with this requirement. On the other hand, the one based on kernel methods is much faster to train on problems of comparable scale [13]. It could thus match the specification requirements for applications to LHC detectors.

The performance of the kernel-based version of NPLM stems from those of the Falkon [14] library, the core algorithm powering our implementation. A sketch of the basic theoretical and algorithmic ideas implemented in Falkon, developed in [19–21], are reported below.

With kernel methods, one learns functions of the following form

$\begin{equation} f_{\mathrm{\textbf{w}}}(x) = \sum_{i = 1}^{{N}} w_i k_\sigma(x,x_i)\,, \end{equation} \tag{ 6 }$

with ${{N}} = {N}_0+{N}_1$ the total size of the training dataset. Here $k_\sigma (x,x_i)$ is the kernel function and σ some hyperparameter. We consider the Gaussian kernel

$\begin{equation} k_{\sigma}(x,x^{{\prime}}) = e^{-\Vert x-x^{{\prime}}\Vert^2/2\sigma^2}\,, \end{equation} \tag{ 7 }$

so that $f_{{\textbf{w}}}$ is a linear combination of Gaussians of fixed width σ, centred at the training data points. The optimisation of the model parameters ${{\textbf{w}}}$ is achieved by minimising the empirical risk $\hat L(f_{{\textbf{w}}})$ , plus a regularisation term

$\begin{equation} \displaystyle \hat L_\lambda(f_{{\textbf{w}}}) = \hat L(f_{{\textbf{w}}})+\lambda R(f_{{\textbf{w}}})\,. \end{equation} \tag{ 8 }$

The empirical risk in our case is the one associated with the logistic loss (3)

$\begin{equation} \displaystyle \hat L(f_{{\textbf{w}}}) = \sum_{i = 1}^{{N}} \ell(y_i,f_{{\textbf{w}}}(x_i))\,. \end{equation} \tag{ 9 }$

The regularisation term is given by

$\begin{equation} \displaystyle R(f_{{\textbf{w}}}) = \sum_{ij} w_i w_j k_{\sigma}(x_i,x_j)\,. \end{equation} \tag{ 10 }$

Its relative importance in the optimisation target (8) is controlled by the hyperparameter λ.

Kernel methods are non-parametric approaches, in the sense that the number of parameters ${{\textbf{w}}}$ in equation (6) increases automatically with the total number of data points. Gaussian kernel methods are universal, meaning that they can recover any continuous function in the large sample limit [22, 23]. However, optimising the function in equation (6), with the target in equation (8), requires handling an ${{N}}\times {{N}}$ matrix—the kernel matrix—with entries $k_\sigma(x_i,x_j)$ . The computational complexity of the optimisation thus scales cubically in time and quadratically space with respect to the number of training points ${{N}}$ [14, 19]. These costs prevent the application to large-scale settings, and some approximation is needed.

Within the Falkon library, the problem of minimising equation (8) is formulated in terms of an approx-imate Newton method (see algorithm 2 of [14]). The algorithm is based on the Nystróm approximation, which is used twice. First, to reduce the size of the problem, by considering solutions of the form

$\begin{equation} f_{{\textbf{w}}}(x) = \sum_{i = 1}^M w_i k_{\sigma}(x,\tilde x_i), \end{equation} \tag{ 11 }$

where $\{\tilde{x}_1,\ldots, \tilde{x}_M\} \subset \{x_1,\ldots,x_N\}$ are called Nyström centres and are sampled uniformly at random from the input data. The number of centres $M\leqslant N$ is a hyperparameter to be chosen. Then, Nyström approximation is again used to derive an approximate Hessian matrix

$\begin{equation} \tilde{{{\textbf{H}}}} = \frac{1}{M}T\tilde{D}T^{\,\intercal} + \lambda I. \end{equation} \tag{ 12 }$

Here, T is such that $T^\intercal T = \tilde{K}$ (Cholesky decomposition), with $\tilde{K}\in \mathbb{R}^{M\times M}$ the kernel matrix subsampled with respect to both rows and columns. $\tilde{D} \in \mathbb{R}^{M \times M}$ is a diagonal matrix s.t. the ith element is the second derivative of the loss $\ell^{{\prime} ^{\prime}}(y_i,f_{{\textbf{w}}} (x_i))$ , with respect to its first variable. Equation (12) is then used as a preconditioner to perform conjugate gradient descent. With this strategy, the overall computational cost to achieve optimal statistical bounds is $\mathcal{O}(N)$ in memory and, of particular importance for our scope, $\mathcal{O}(N\sqrt{N} \log N)$ in time. It is known in the literature [24, 25] that the effect of the projection in the subspace determined by the centres is a form of regularisation. On the other hand, the stochasticity of the projection can potentially lead to a subspace that does not guarantee stability. From this point of view, the inclusion of a further explicit penalty term can be used to ensure stability as needed. Indeed, the regularisation level is determined by both the penalty parameter and the number of centres. These ideas are formalised and made quantitative in [24]. The reader can find more details about the Falkon algorithm, including comparisons with alternative approaches, in [14].

3.2.1. Hyperparameters selection

The selection of the three Falkon hyperparameters M, σ and λ follows the prescriptions of [13], with one minor modification described below. The hyperparameters selection employs data collected under the reference working condition, and proceeds as follows.

The number of centres M controls the expressive power of the model and therefore it should be as large as possible not to compromise the sensitivity to anomalous distributions with intricate shapes. It must also be at least as large as $\sqrt{N}$ in order to achieve statistically optimal bounds of the training convergence. At the same time, training is faster if M is smaller. The experiments performed in [13] show that any value of M above around the data batch size ${N}_{\cal{D}}$ does not compromise sensitivity.

The Gaussian width σ is selected as the 90th percentile of the pairwise distance between reference-distributed data points. Notice that the model (11) acts on an input vector x whose input features are standardised to have zero mean and unit variance on reference-distributed data. The same standardisation is applied before computing the distances.

The regularisation parameter λ is kept as small as possible while keeping training stable, i.e. avoiding large training times or non-numerical outputs. A number of reference-distributed Toy data batches is employed for this study, each trained against the reference sample ${\cal{R}}$ . Some of the experiments performed in this paper employ quite smaller data batches (e.g. ${N}_{\cal{D}} = 250$ ) than those considered in [13]. In these new conditions we observe that the compatibility of the test statistic distribution with a χ² (see the end of section 3.1) is violated for very small λ. In these cases, we raise λ until when the agreement with the χ² is restored.

The hyperparameters selected with the above criteria, in the different setups for DQM considered in this paper, are reported in table 1. It should be emphasised that the hyperparameters selection problem for NPLM is rather different than for regular applications of Falkon or other types of classifiers. The hyper-parameters for regular classifiers can be optimised based on the performances in the specific classification task under examination. NPLM aims instead at goodness-of-fit, namely at attaining good sensitivity to a wide class of anomalous data distributions that are not known or specified a priori. Hence, a prior reasonable choice of the hyperparameters must be performed and cannot be re-optimised a posteriori. In particular, no re-optimisation can be or has been performed to enhance the sensitivity to the specific types of anomalies that are considered in this paper to demonstrate the sensitivity of the method.

Table 1. NPLM algorithm parameters configuration for the five-dimensional and six-dimensional experiments considered in this work. The numbers of degrees of freedom of the χ² that best approximates $p(t | \textrm{{R}})$ is reported in the last column.

	${N}_{\cal{R}}$	${N}_{\cal{D}}$	M	σ	λ	dof
5D	2000	250	2000	4.5	10⁻⁶	40
		500			10⁻⁷	83
		1000			10⁻⁸	171
6D	2000	250	2000	4.8	10⁻⁶	58
		500				78
		1000				109

3.3. Alternative approaches

Goodness-of-fit and two-sample test problems are of interest in several domains of science. Many approaches exist, and developing new strategies is an active area of research. One heuristic reason to choose NPLM for DQM, among the many different options, is that it has been developed in the challenging context of new physics searches. Prior experimental and theoretical knowledge suggests that new physics is elusive. The target for new physics searches is thus to spot out minor departures of the actual data from the reference distribution. These departures could emerge either as small corrections to the distribution shape or as relatively large corrections like sharp peaks, which however only account for a very small fraction of the experimental data. Detecting such small effects requires precisely comparing the reference distribution with large datasets, which NPLM is designed to perform. Using NPLM for DQM could thus enable a more accurate monitoring of the data offering sensitivity to more subtle failures of the apparatus. The number of input features in the data that are typically relevant for new physics searches ranges from few to tens, which is an adequate number also for the monitoring of individual detectors and detector systems fully exploiting the correlations among the variables. For comparison, methods to assess the quality of generated images target instead order thousand-dimensional input data. They could be less performant for DQM as they are designed to address a radically different problem.

These heuristic considerations suggest that NPLM is a reasonable starting point for the development of novel DQM algorithms based on advanced multivariate goodness-of-fit or two-sample test methods, which we advocate in this paper. On the other hand, no comprehensive comparative study of the NPLM performances is currently available. Such comparison is beyond the scope of this paper. However, the DQM problems and datasets we study will be useful benchmarks for future work in this direction.

Work has initiated [26, 27] to compare NPLM with a certain class of methods, called ‘classifier-based’ methods. The classifier-based approaches [28] are all those that entail training a classifier to tell apart ${\cal{D}}$ from ${\cal{R}}$ and using the trained classifier to construct a test statistic for the hypothesis test. A simple implementation [29] employs the classification accuracy as test statistics. Following the standard pipeline for classifiers, the model is trained on part of the ${\cal{D}}$ and ${\cal{R}}$ datasets (the training set), while the accuracy is evaluated on the remaining data (the test set). The idea is that while the accuracy will be poor (around random guess) if ${\cal{D}}$ and ${\cal{R}}$ follow the same distribution, it will be higher if their distributions differ.

NPLM is technically a classifier-based method. Its major peculiarities are the choice of the likelihood ratio test statistic in equation (2) and the fact that the entire datasets are employed both for training and for the evaluation of the test statistics. None of these choices is motivated from the viewpoint of the theory of classification, while they are both natural or in fact required from the perspective of the theory of hypothesis testing that underlies the NPLM approach. Performance studies in [27] show that these choices are beneficial for the sensitivity. These results partly contradict [26], which however employs different classification models, different criteria for hyperparameters selection and uses permutation tests for the estimate of the sensitivity rather than computing it empirically as in NPLM. These differences are evidently responsible for the different findings and more work is needed for a conclusive assessment.

4. Results

In this section, we present the application of the NPLM strategy for DQM to the DT chambers data described in section 2 ¹² . We will consider monitoring data batches of variable size ${N}_{\cal{D}} = 250$ , 500 and 1000, by employing a reference dataset of fixed size ${N}_{\cal{R}} = 2000$ .

The input data consists of six features: the four drift times, the muon angle and the number of hits. As shown in the bottom-right plots of figures 3 and 4, the number of hits, $n_\mathrm{Hits}$ , is highly discriminant for the anomalies we considered in our study, and in particular for the ones affecting the thresholds (the lower the threshold, the higher the noise). At the LHC, however, that quantity also depends on the luminosity delivered to the experiment, which could vary greatly even during a single run. Not being necessarily a proxy to a detector issue, it is worth considering also the case where only the other five variables are provided to the algorithm; as an additional benefit, this will allow assessing the ability of the NPLM DQM approach to exploit correlations between variables and detect anomalies even when their effect is unexpected and not straightforwardly evident.

The left and middle panels of figure 5 show the test statistics distribution in the five-dimensional (5D) problem, for data batches size ${N}_{\cal{D}} = 500$ . The grey histograms display the distribution of t in the reference working conditions, $P(t|\textrm{{R}})$ . This is obtained empirically by processing reference-distributed Toy data batches, and fitted to a χ² distribution as explained in section 3.1. The different distributions of the test statistic associated with the anomalous batches shown in the coloured histograms are very well separated from the reference distribution, meaning that anomalous data are very likely to be identified as such by the algorithm. This is quantified by the median p-value of the anomalous batches, reported in the central column of table 2. The table also reports the median p-value for larger ( ${N}_{\cal{D}} = 1000$ ) and smaller ( ${N}_{\cal{D}} = 250$ ) batches. The sensitivity to the anomaly increases with ${N}_{\cal{D}}$ , as expected.

Figure 5. Refer to the following caption and surrounding text. — **Figure 5.** Distribution of the test statistics in the scenario ${N}_{\cal{D}} = 500$ . The plot displays the distribution of the test statistic t on reference-distributed Toys and on the data collected under anomalous detector conditions.
Download figure:
Standard image High-resolution image

Table 2. Median p-values for different anomalies and data batches size. Five input features are considered, excluding $n_\mathrm{hits}$ .

Anomaly	${N}_{\cal{D}} = 250$	${N}_{\cal{D}} = 500$	${N}_{\cal{D}} = 1000$
Cathode 75%	0.0034	$1.1\times 10^{-6}$	${\lt}10^{-7}$
Cathode 50%	0.029	$3.4\times 10^{-4}$	${\lt}10^{-7}$
Cathode 25%	0.14	0.0019	${\lt}10^{-7}$
Threshold 75%	$2.8\times 10^{-7}$	${\lt}10^{-7}$	${\lt}10^{-7}$
Threshold 50%	${\lt}10^{-7}$	${\lt}10^{-7}$	${\lt}10^{-7}$
Threshold 25%	${\lt}10^{-7}$	${\lt}10^{-7}$	${\lt}10^{-7}$

For a comparative assessment of the performance, we computed a Kolmogorov–Smirnov (KS) test on each individual feature for the same data used to train the NPLM model. The KS median p-values are reported in table 3 and compared with the ones obtained with the 5D NPLM test. We see that individual variables have a very limited power to discriminate the anomalous batches. The NPLM method instead is sensitive to correlated discrepancies in the different distributions and discriminates the anomalies effectively. For illustrative purposes, we show in the left and middle panels of figure 6 the distribution of the 1D KS statistic computed on the drift time of the first layer (t₁) for reference and anomalous batches. By comparison with figure 5, it is easy to recognise the advantage of the NPLM strategy.

Figure 6. Refer to the following caption and surrounding text. — **Figure 6.** Distribution of the test statistic for the KS test.
Download figure:
Standard image High-resolution image

Table 3. Median p-values in the setup ${N}_{\cal{D}} = 500$ .

Anomaly	NPLM (5D)	KS $(t_1)$	KS $(t_2)$	KS $(t_3)$	KS $(t_4)$	KS $(\phi)$
Cathode 75%	$1.1\times 10^{-6}$	0.50	0.41	0.43	0.40	0.42
Cathode 50%	$3.4\times 10^{-4}$	0.47	0.27	0.47	0.37	0.41
Cathode 25%	0.0019	0.45	0.44	0.13	0.45	0.50
Threshold 75%	${\lt}10^{-7}$	0.23	0.14	0.16	0.14	0.48
Threshold 50%	${\lt}10^{-7}$	0.09	0.10	0.06	0.17	0.42
Threshold 25%	${\lt}10^{-7}$	0.11	0.07	0.04	0.11	0.66

We now turn to the study of the complete six-dimensional problem, including the variable $n_\mathrm{hits}$ . The reference and anomalous test statistic distributions are shown on the right panel of figure 5. By comparing with the other panels of the figure we can appreciate the tremendous discriminating power of the $n_\mathrm{hits}$ variable: including $n_\mathrm{hits}$ all the anomalies can be detected with very high significance. Therefore, using this variable alone for the NPLM DQM test, or running a regular KS test (as shown in the right panel of figure 6), is sufficient to identify the anomalies, as previously mentioned.

We conclude this section by showing some examples of the data marginal distribution reconstructed by the model. The three plots reported in figure 7 are produced by reweighting each event of the reference sample used for the training by an exponential factor $e^{f_{\widehat{\mathbf{w}}}(x)}$ , as explained in equation (5); both the reweighted reference and the data samples are binned, and their ratio with respect to the original reference sample is shown in the bottom panels. By comparing the data-versus-reference ratio (labelled as ‘true’) with the reconstructed one (‘learned’) we can appreciate the correctness of the model in understanding the nature of the anomaly and, hence, trust the results of the ML task.

Figure 7. Refer to the following caption and surrounding text. — **Figure 7.** Examples of input data and respective learned likelihood ratios with sample size $N_R = 2000$ and $N_D = 500$ .
Download figure:
Standard image High-resolution image

**Figure 7.** Examples of input data and respective learned likelihood ratios with sample size $N_R = 2000$ and $N_D = 500$ .
Download figure:
Standard image High-resolution image

All the numerical experiments presented in this paper have been performed on a single machine equipped with a NVIDIA Titan Xp GPU with 12 GB of VRAM. We tested the performances of the algorithm in terms of execution time; the training time for a single 5D classification task is approximately 0.5 s, with no significant dependency on the nature of the data and the size of the sample.

5. Conclusions and outlook

We presented the test of a powerful ML-based algorithm, NPLM, as a tool to monitor the quality of the data originated by a typical detector used for measuring particles at high energy colliders. NPLM compares collected measurements with a reference dataset describing the standard detector readout, performing a multidimensional likelihood-ratio hypothesis test.

The study demonstrated the capability of the algorithm to detect anomalous detector conditions, with a much greater discriminating power than simpler traditional methods, like KS test. Traditional strategies for DQM are specifically designed for individual detectors and monitoring needs. We will thus be in the position of performing a full-fledged sensitivity comparison only after implementing our strategy in a concrete realistic setup with an established standard alternative. Nevertheless, the strong observed advantage in performances suggests that major sensitivity improvements will be attained thanks to the deployment of a sophisticated goodness-of-fit method such as NPLM.

Although conducted on simplified experimental conditions, the test presents figures appropriate for a typical monitoring system of a detector operating at the LHC; in particular, the number of channels and the size of the datasets are of the same order of magnitude as the corresponding CMS DQM application. The amount of data we consider for each batch can be gathered much more quickly at the LHC than in a cosmic stand like the one used here, anyhow the rate at which possible issues should be detected is not larger than one in a minute¹³ ; the time requested by NPLM to run—less than a second—makes the algorithms suitable to be executed online.

Our results open the door to the systematic deployment of NPLM for DQM, in particular to CMS detector subsystems in real LHC data-taking conditions, potentially advancing the standard practice in two main aspects. First, by boosting the sensitivity to detector malfunctioning exploiting a more refined multivariate statistical methodology and employing low-level variables reducing the bias introduced by further manipulations of the data. This will ultimately improve the quality of the data used for physics analyses. Second, by introducing a universal problem-independent approach for a more efficient and nearly automated design of DQM strategies.

Acknowledgments

L R, M L and M R acknowledge the financial support of the European Research Council (Grant SLING 819789). L R acknowledges the financial support of the AFOSR Projects FA9550-18-1-7009, FA9550-17-1-0390 and BAA-AFRL-AFOSR-2016-0007 (European Office of Aerospace Research and Development), the EU H2020-MSCA-RISE Project NoMADS—DLV-777826, and the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216. G G is supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant Agreement No. 772369). A W is supported by the Grant PID2020-115845GB-I00/AEI/10.13039/501100011033.

Data availability statement

The data that support the findings of this study are openly available at the following URL/DOI: https://zenodo.org/record/7128223.

Dates

Peer review information

3.2.1. Hyperparameters selection

Fast kernel methods for data quality monitoring as a goodness-of-fit test

Author notes

Notes

Article metrics

Submit

Share this article

Dates

Peer review information

Abstract

1. Introduction

2. Experimental setup

3. Methodology

3.1. The NPLM method

3.2. Falkon-based NPLM

3.2.1. Hyperparameters selection

3.3. Alternative approaches

4. Results

5. Conclusions and outlook

Acknowledgments

Data availability statement

Footnotes

You may also like