ML Physics
ML Physics
Abstract
Physics and modern machine learning have a huge number of connections, through their scientific goals, methods,
applications, and theoretical understanding. These notes provide an introduction to modern machine learning, specifically
for physicists. As prerequisites they rely on some results from bachelor-level physics lectures, but do not assume any prior
knowledge of machine learning.
* [email protected]
1 Basics 1
1.1 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Multivariate analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Fits and interpolations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Bayesian networks and likelihood loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Regression 15
2.1 Amplitude regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Numerical integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1 Basics
1.1 Statistics
Bayes’ Theorem Before we start with some basics of machine learning, let use remind ourselves of basics statistics
from our first year of physics studies. A formal introduction into statistics starts with the three Kolmogorov Axioms. The
first axiom states that a probability for a given outcome is a non-negative real number
The second axiom states that the sum of probabilities for all possible outcomes of an experiment is one,
X Z
p(Ai ) = 1 or dA p(A) = 1 . (1.2)
i
The third axiom states that probabilities for disjoint outcomes add,
The logical combination of these outcomes is an ‘or’. Alternatively, we can ask the question what the joint probability of
two measurements A and B is. Their independence is defined by
If two measurements are not independent, we need to define conditional probabilities. This means we ask what the
probability of A is under the condition that we also observed B,
p(A ∩ B)
p(A|B) := ⇔ p(A|B) p(B) = p(A ∩ B) . (1.5)
p(B)
p(A ∩ B)
p(A ∩ B) = p(B ∩ A) ⇒ p(B|A) = . (1.6)
p(A)
These formulas can be viewed as definitions or as axioms, extending the Kologorov axioms. They give us
Bayes’ Theorem, which states that the conditional probability for a theory T give a set of measurements M is
p(M |T ) p(T )
p(T |M ) = . (1.7)
p(M )
It comes with a set of definitions: p(T |M ) is called posterior probability; if we are interested in p(M |T ) as a funktion of
the second argument T , it is called likelihood. The problem with the likelihood is the lack of normalization as a function
of T . p(M ) is called evidence, and it is typically replaced through the normalization condition
1
Z Z
1=dT p(T |M ) = dT p(M |T ) p(T )
p(M )
Z
⇔ p(M ) = dT p(M |T ) p(T ) . (1.8)
Finally, p(T ) is called prior probability or just prior. In Eq.(1.8) we see that is defines an intergration measure over theory
space. Bayes’ Theorem tells us how to combine two independent measurements. We first make one measurement, then
compute its posterior probability, use this as the prior for the second measurement, and compute the combined
probability. This is logically clear, but technically complicated.
2 1 BASICS
Information entropy To describe complex systems with the help of probabilities, we need the concept of information
entropy. Let us start with a system with 2N equally probably states,
X
pj = 2−N mit pj = 2N 2−N = 1 . (1.9)
j
To find out in which state the system is we can construct a decision tree. The most efficient algorithm is to step-wise split
the possible states into groups of equal size and then ask in which of the two halves the system lives. This way we avoid
any element of luck, and we need
ln pj
N = − log2 pj = (1.10)
ln 2
answers {0, 1}. Now we generalize the system to Ω states with different probabilities pj . In that case the branches of the
decision tree correspond to different probabilities, there will be an element of luck. The number of necessary answers
should be the same, but only defined as an expectation value. It defines the information entropy H,
Ω
1 X H
h− log2 pj i = − pj ln pj =: (1.11)
ln 2 j=1 ln 2
By definition, the information entropy measures (logarithmic) uncertainty in units of ln 2, so-called bits. Uncertainty is
the amount of expected information needed to correctly identify the state of the system.
First, we see that for pj ∈ [0, 1] each term in the expectation value in Eq.(1.11) is positive, so H ≥ 0. The entropy
assumes its smallest value for only one term, pj = 1. Here we know the system perfectly. For two outcomes with p1 = p
and p2 = 1 − p we can compute the maximum entropy easily,
H = − [p ln p + (1 − p) ln(1 − p)]
1−p
dH p
= − ln p + − ln(1 − p) −
dp p 1−p
1
= − ln p + ln(1 − p) = 0 ⇔ p=1−p= . (1.12)
2
The information entropy vanishes for p = 0 and p = 1 and is symmetric around its maximum at p = 1/2. For a perfectly
understood dataset with only p(x) = 0 and p(x) = 1 there is no entropy. In general, the entropy is maximal for Ω = 1/p,
and its value depends on the number of states,
Ω
X
H≤− p ln p = − ln p = ln Ω . (1.13)
j=1
Next, we can ask what happens with the entropy when we split our system in two. If the two systems are not independent,
we can compute the entropies in terms of the joint probability pi,j ,
Ω1
X Ω2
X X
H1 = − pi ln pi H2 = − pj ln pj H12 = − pi,j ln pi,j . (1.14)
i=1 j=1 i,j
Because equally distributed systems maximize the entropy, this combined entropy should be smaller than the sum of the
independent individual entropies, The difference is the mutual information I12
I12 = H1 + H2 − H12
!
X X X X X
=− pi pj|i ln pi − pi|j pj ln pj + pi,j ln pi,j
i j j i i,j
X
=− pi pj|i ln pi + pj pi|j ln pj − pi,j ln pi,j
i,j
X pi pj
=− pi,j ln ≡ DKL [pi,j , pi pj ] . (1.15)
i,j
pi,j
1.1 Statistics 3
H2
H1
| {z }
H12
The relation between the different entropies is illustrated in Fig. 1. We can understand the mutual information in terms of
clustering algorithms. If we pick a random point in the entire space, what does the association with partition 1 tell us
about a possible association with partition 2.
Finally, we can construct the variance of information,
It compares two partitions, and it vanishes exactly when the two partitions are equal. For finite values it gives a proper
distance, i.e. positive definite, symmetric, and satisfying the triangle anomaly.
The functional defined in Eq.(1.15) is called Kullback–Leibler divergence and compares two probability distributions,
evaluated on a dataset corresponding to the first distribution,
X pa
DKL [pa , pb ] = pa ln 6= DKL [pb , pa ] . (1.17)
pb
For continuous distributions it reads
* +
pa pa (x)
Z
DKL [pa , pb ] = log ≡ dx pa (x) log . (1.18)
pb pb (x)
pa
It vanishes if two distributions agree everywhere. We will come back to it in much more detail later.
Likelihoods and Neyman-Pearson Lemma Combining measurements is easier when we use likelihoods. Let us
assume that p(x) gives the probability that an event or phase space configuration x corresponds to some kind of signal,
turning the conditional probability or likelihood of Eq.(1.7) into
We can split the measurement into a counting experiment leading to n events with s events expected,
sn −s
pPoisson (n|s) = e , (1.20)
n!
and a properly normalized probability for each event,
Z
p(x) ∈ [0, 1] with dx p(x) = 1 . (1.21)
In this form we see that the combined likelihoods for two independent datasts are simply
We can compute the ratio of two likelihoods for the same dataset of n observed events {xi }, but two theory hypotheses,
for instance signal plus background vs background only,
Qn
p({xi }|TS+B ) e−s+b i=1 [(s + b)ps+b (xi )]
= −b Qn
p({xi }|TB ) e i=1 [bpb (xi )]
n
Y sps (xi ) + bpb (xi )
= e−s . (1.24)
i=1
bpb (xi )
The Neyman-Pearson lemma states that this likelihood ratio is the most powerful test statistics, or test variable, to
distinguish two hypotheses. This is defined as the smallest false-negative error for a given false-positive error. For typical
physics applications this means it minimizes the mis-identification of a signal for a background fluctuation.
Before moving on, let us mention that in the limit of large event counts the Poisson distribution in Eq.(1.20) becomes a
Gaussian,
(n − s)2
1
pPoisson (n|s) → pGauss (n|s) = √ exp − , (1.25)
2πs 2s
a standard form we will use throughout this lecture.
Robust estimators In Eq.(1.7) we introduced the likelihood with an abtract definition of a theory T . In Eq.(1.24) the
two theories were signal-plus-background vs background-only, again without any reference to what makes them different.
In general, the theory hypothesis defining a liklihood is decribed by an ntuple of model parameters θ, so we again replace
our conventions,
To simplify our argument, let us assume that our measurement is not a set of events, but a set of values f (x), for instance
counting rates over a binned phase space x. This means we compare a measured set of numbers fi with the corresponding
theory predictions fθ (x). If the data fi and the uncertainties σi are statistically distributed, we usually employ a
logarithmic Gaussian likelihood
In this form the individual Gaussians have mean fi , variance σi2 , standard deviation σi , and a width of the bell curve of
2σi . However, in the real world data is usually not distributed statistically, and the tails of distributions are usually less
suppressed than the exponential suppression defining the Gaussian.
Given a likelihood, we can estimate the value for this model parameter θ by maximizing the likelihood that it explains the
data,
X
θ̂ = argmaxθ p(M |θ)
M
≈ argminθ [− log p(M |θ)] (1.28)
In the second step we assume that maximizing the likelihood or minimizing the negative log-likelihood gives the same
result. We call θ̂ the estimator for the model parameter θ. The sensitivity or robustness of an estimator depends on the
1.2 Multivariate analysis 5
form of the likelihood. It can be measured by the impact that a chance of a datapoint has on the log-likelihood and is
called influence function,
d log p(M |θ)
IF = − (1.29)
dfi
If we already know that Gaussian likelihood does not describe a dataset, for instance in the tails, we can ask what kind of
likelihood would lead to a robust estimator. Specifically, we might want to be less sensitive to poorly modelled outliers.
As an example, we consider the mean or center θ = µ of a given liklihood distribution. Starting with a one-dimensional
Gaussian, we find
(x − µ)2
p(x|µ) ∝ exp −
2µ
2
(x − µ)
⇒ − log p(x|µ) = − + ···
2µ
x − µ x→∞
⇒ IF = −−−−→ x . (1.30)
µ
This influence function gives us the robustness of a Gaussian likelihood with extremely suppressed tails. Next, we check a
Cauchy or Breit-Wigner distribution centered arond zero
1
p(x|µ) ∝
1 + x2
⇒ − log p(x|µ) = log(1 + x2 ) + · · ·
2x x→∞ 2
⇒ IF = −−−−→ . (1.31)
1 + x2 x
Its influence function vanishes for larger x-values, which means the Breit-Wigner distribution is more robust against
outliers in the tails than a Gaussian. A more complete set of examples includes a Gaussian and the more robust Laplace,
Cauchy (Breit-Wigner), and Tukey distributions is given in Fig. 2.
As a side remark, we can use the estimator as an illustration of frequentist/likelihood vs Bayesian statistics. Everything
starts with Bayes theorem, but the two methods ask different questions. For the etimator, the likelihood question in
Eq.(1.28) determines the value for θ which is the most likely, given the data. For this purpose, it maximizes the likelihood
p(M |θ). If we ask the question what M tells us about possible values of θ, we would naively use the probability instead,
which means we would compute
X X p(M |θ)p(θ)
θ̂Bayes = argmaxθ p(θ|M ) = argmaxθ
p(M )
M M
X
= argmaxθ [p(M |θ)p(θ)] . (1.32)
M
This result differs from Eq.(1.28) because we now need a prior to construct the estimator. Similarly, a likelihood approach
would get rid of a nuisance parameter or a physics parameter we are currently not interested in by profiling them. This
means we identify the maximum likelihood and project this value into the target space. In the Bayesian approach we
integrate it out. The prior then defines a θ-space which is needed for such an integration. In terms of the integration
measure the profile likelihood would correspond to a delta-distribution prior, which is why we refer to differences between
the methods as volume effects. Since the prior is a measure in model space, we have to extract it from other measurements
or make an assumption. It is important to always keep in mind, that inference can be done using lieklihoods or using the
posterior probabilities, and if the two methods give different answers this is because they ask different questions.
One problem we encounter in any physics analysis is that we want to extract simple numbers, like an energy measurement
or a signal probability, from a complex detector output. Historically, this is done by constructing a set of theory-motivated
observables and evaluate each of these observables in view of the regression or classification task. If all available
6 1 BASICS
mean, L2 -loss,
least squares
∼ exp(− 12 |x − µ|2 ) ∼ 21 (x − µ)2
Gaussian not robust
median, L1 -loss
∼ exp(−|x − µ|) ∼ |x − µ|
Laplace robust
1
∼ ∼ log(1 + x2 )
1 + x2
Cauchy very robust
information can be extracted from one observable, this is called a suffient statistics or an optimal observable. In reality,
this is never the case. Instead, we are faced with a set of correlated observables, which we need to analyse together.
Let us look at such a set of correlated observables and the problem of multivariate classification. A historic solution is
decision tree. Imagine we want to classify an event xi as signal or background, based on the observables Oj . As a basis
for this decision we study a set of training events {xi }, histogram them for each Oj , and find the values Oj,split which give
the most successful split between signal and background for each distribution.
To define the optimal split we start with the signal and background probabilities as a function of the signal event count S
and the background event count B,
s b
pS = ≡p and pB = ≡1−p. (1.33)
s+b s+b
If we look at a histogrammed normalized observable O ∈ [0 ... 1] we can compute p and 1 − p from number of expected
signal and background events following Eq.(1.33). For instance, we can look at a signal which prefers large O-values and
1.2 Multivariate analysis 7
two background distributions to compute these signal and background probabilities p(O).
s(O)
s(O) = AO b(O) = A(1 − O) ⇒ p(O) = =O
s(O) + b(O)
O
s(O) = AO b(O) = B ⇒ p(O) = . (1.34)
B
O+
A
We then define that a single event is signal-like when its (signal) probability is p(O) > 1/2, otherwise the event is
background,
1
2
Oj,split = Oj = . (1.35)
B
p=1/2
A
To organize our multivariate analysis we evaluate the performance of each optimized cut Oj,split , for example to apply the
most efficient cut first.
We already know how this discussion can be formalized — we need to construct the set of Oj,split such that they maximize
the information entropy defined in Eq.(1.11). Because we do not want to measure the information entropy, we can use a
standard logarithm to evaluate it,
h i
H[p] = − p log p + (1 − p) log(1 − p) , (1.36)
To build a decision tree out of our observables we first compute the best splitting for each observable individually and
then choose the observable with the most successful split. More precisely, for a successful split we want to maximize the
difference of the information entropy before the split and the sum of the information entropies after the split. This is
called the information gain, and we choose the first observable of our decision tree though
h i
max Hbefore split [p(Oj )] − Hafter split,1 [p(Oj )] − Hafter split,2 [p(Oj )] . (1.38)
j
Because multivariate analysis using (boosted) decision trees relies on successive split points of pre-defined observables, it
cannot be optimal. To do better, we would need to extract p(O) from our dataset and work with it without binning or
explicit cuts. This will eventually lead us to classification using neural networks.
A historic illustration for a decision tree used in particle physics is shown in Fig. 3. It comes from the first high-visibility
application of (boosted) decision trees in particle physics, to identify electron-neutrinos from a beam of muon-neutrinos
using the MiniBooNE Cerenkov detector. Each observable defines a so-called node, and the two branches below each
node are defined as ‘yes’ vs ‘no’ or as ‘signal-like’ vs ‘background-like’. The first node is defined by the observable with
the highest information gain among all the optimal splits. The two branches are based on this optimal split value, found
by maximizing the cross entropy. Every outgoing branch defines the next node again through the maximum information
gain, and its outgoing branches again reflect the optimal split, etc. Finally, the algorithm needs a condition when we stop
splitting a branch by defining a node and instead define a so-called leaf, for instance calling all events ‘signal’ after a
certain number of splittings selecting it as ‘signal-like’. Such conditions could be that all collected training events are
either signal or background, that a branch has too few events to continue, or simply by enforcing a maximum number of
branches.
No matter how we define the stopping criterion for constructing a decision tree, there will always be signal events in
background leaves and vice versa. We can only guarantee that a tree is completely right for the training sample, if each
leaf includes one training event. This perfect discrimination obviously does not carry over to an independent test sample,
which means our decision tree is overtrained. In general, overtraining means that the performance for instance of a
classifier on the training data is so finely tuned that it follows the statistical fluctuations of the training data and does not
generalize to the same performance on independent sample of test data.
8 1 BASICS
S/B
52/48
< 100 ≥ 100
PMT Hits?
B S/B
4/37 48/11
< 0.2 GeV ≥ 0.2 GeV
Energy?
S/B S
9/10 39/1
< 500 cm ≥ 500 cm
Radius?
S B
7/1 2/9
Figure 3: Illustration of a decision tree from an early application in particle physics, selecting electron neutrinos to prove
neutrino oscillations. Figure from Ref. [?].
If we want to further enhance the performance of the decision tree we can focus on the events which are wrongly
classified after we define the leaves. For instance, we can add an event weight w > 1 to every mis-identified event (we
care about) and carry this weight through the calculation of the splitting condition. This is the simple idea behind a
boosted decision tree (BDT). Usually, the weights are chosen such that the sum of all events is one. If we construct
several independent decision trees, we can also combine their output for the final classifier. It is not obvious that this
procedure will improve the tree for a finite number of leaves, and it is not obvious that such a re-weighting will converge
to a unique or event improved boosted decision tree, but in practice this method has been shown to be extremely powerful.
Finally, we need to measure the performance of a BDT classification using some kind of success metric. Obviously, a
√ signal efficiency alone is not sufficient, because the signal-to-background ratio s/b or the Gaussian significance
large
s/ b depend on the signal and background rates. For a simple classification task we can compute four numbers
If we tag all events we know the normalization conditions s(truth) = s(S-tagged) + s(B-tagged) and correspondingly for b(truth) .
The signal efficiency is also called recall or sensitivity in other fields of research. The background mis-identification rate
can be re-phrased as background rejection 1 − B , also referred to as specificity.
Once we have tagged a signal sample we can ask how many of those tagged events are actually signal, defining the purity
or precision s(S-tagged) /(s(S-tagged) + b(S-tagged) ). Finally, we can ask how many of our decisions are correct and compute
(s(S-tagged) + b(B-tagged) )/(s(truth) + b(truth) ), reflecting the fraction of correct decisions and referred to as accuracy.
In particle physics we usually measure the success of a classifier in the plane S vs B , where for the latter we either write
1 − B of 1/B . If a classifier gives us, for example, a continuous measure of signal-ness of an event being signal we can
choose different working points by defining a cut on the classifier output. The problem with such any such cut is that we
lose information from all those signal events which almost made it to the signal-tagged sample. If we can construct our
classifier such that its output is a probability, we can also weight all events by their signal vs background probability score
and keep all events in our analysis.
From a practical perspective, we start with the assumption or observation that neural networks are nothing but
numerically defined functions. The simplest case, regression networks are scalar or vector fields defined on some space,
approximated by fθ (x). In physics applications, this space is often some kind of phase space, an important aspect,
because many aspects of phase space are interpretable to physicists.
1.3 Fits and interpolations 9
Assuming that we have indirect or implicit access to the truth f (x) in form of a training dataset (x, f )j , we want to
construct the approximation
A set of functional values for a given set of points can be approximated in two ways. First, in a fit we start with a
functional form in terms of a small set of parameters which we also refer to as θ.
To determine these network parameters, we use a minimization algorithm on an appropriately defined loss function. More
specifically, we maximize the probability for the fit output f (xj ) to correspond to the training points fj , with
uncertainties σj . Because we are interested in the θ, we evaluate this probability as a likelihood, for example assuming the
Gaussian log-likelihood of Eq.(1.27), often referred to as χ2 . For the definition of the best parameters θ we ignore the
θ-independent normalization, so the loss function or function we minimize to determine the fit’s model parameters is
The fit function is not optimized to go though all or even some of the training data points, fθ (xj ) 6= fj . Instead, the
log-likelihood loss is a compromise to agree with all training data points within their uncertainties. We can plot the values
Lj for the training data and should find a Gaussian distribution of mean zero and standard deviation one,
N (µ = 0, σ = 1).
An interesting question arises in cases where we do not know an uncertainty σj for each training point, or where such an
uncertainty does not make any sense, because we know all training data points to the same precision σj = σ. In that case
we can still define a fit function, but the loss function becomes a simple mean squared error,
1 X 1
Lfit = |fj − fθ (xj )|2 ≡ MSE . (1.41)
2σ 2 j 2σ 2
Again, the prefactor is θ-independent and does not contribute to the definition of the best fit. This simplification means
that our MSE fit puts much more weight on deviations for large functional values fj . This is not what we want, so without
control over the uncertainties most machine learning applications will apply a preprocessings to the data,
fi
fj → log fj or fj → fj − hfj i or fi → ··· (1.42)
hfi i
In cases where we expect something like a Gaussian distribution a standard scaling would preprocess the data to a mean
of zero and a standard deviation of one. In an ideal world, such preprocessings should not affect our results, but in reality
they almost always do. The only way to avoid preprocessings is to add information like the scale of expected and allowed
deviations for the likelihood loss in Eq.(1.40).
The second way of approximating a set of functional values is interpolation, which ensures fθ (xj ) = fj and is the method
of choice for datasets without noise. Between these training data points we choose a linear or polynomial form, the latter
defining a so-called spline approximation. It provides an interpolation which is differentiable a certain number of times
(n) (n)
by matching not only the functional values fθ (xj ) = fj , but also the nth derivatives fθ (x ↑ xj ) = fθ (x ↓ xj ). In the
machine learning language we can say that the order of our spline defines an implicit bias for our interpolation, because it
defines a resolution in x-space where the interpolation works. A linear interpolation is not expected to do well for widely
spaced training points and rapidly changing functional values, while a spline-interpolation should require fewer training
points because the spline itself can express a non-trivial functional form.
The main difference between a fit and an interpolation is their respective behavior on unknown dataset. For both, a fit and
an interpolation we expect our fit model fθ (x) to describe the training data. To measure the quality of a fit beyond the
training data we can compute the loss function L or the point-wise contributions to the loss Lj on an independent
test dataset. If a fit does not generalize from a training to a test dataset, it is usually because it has learned not only the
smooth underlying function, but also the statistical fluctuation of the training data. While a test dataset of the same size
will have statistical fluctuations of the same size, they will not be in the same place, which means the loss function
evaluated on the training data will be much smaller than the loss function evaluated on the test data. This failure mode is
called over-fitting or, more generally, overtraining. For interpolation this overtraining is a feature, because we want to
10 1 BASICS
reproduce the training data perfectly. The generalization property is left to choice of the interpolation function. As a side
remark, manipulating the training dataset while performing one fit after the other is an efficient way to search for outliers
in a dataset, or a new particle in an otherwise smooth distribution of invariant masses at the LHC.
More systematically, we can define a set of errors which we make when targeting a problem by constructing a fit function
through minimizing a loss on a training dataset. First, an approximation error is introduced when we define a fit function,
which limits the expressiveness of the network in describing the true function we want to learn. Second, an estimation or
generalization error appears when we approximate the true training objective by a combination of loss function and
training data set. In practice, these errors are related. A justified limit to the expressiveness of a fit function, or implicit
bias, defines useful fits for a given task. In many physics applications we want our fit to be smooth at a given resolution.
When defining a good fit, increasing the class of functions the fit represents leads to a smaller approximation error, but
increases the estimation error. This is called the bias-variance trade off, and we can control it by limiting or regularizing
the expressiveness of the fit function and by ensuring that the loss of an independent test dataset does not increase while
training on the training dataset. Finally, any numerical optimization comes with a training error, representing the fact that
a fitted function might just live, for instance, in a sufficiently good local minimum of the loss landscape. This error is a
numerical problem, which we can solve though more efficient loss minimization. While we can introduce these errors for
fits, they will become more relevant for neural networks.
Next, we introduce a neural network as a numerically defined fit function with a huge number of model parameters θ,
As mentioned before, we minimize a loss function numerically to determine the neural network parameters θ. This
procedure us called network training and requires a training dataset (x, f )j representing the target function f (x).
We will skip the usual inspiration from biological neurons and instead ask our first question, which is how to describe an
unknown function in terms of a large number of model parameters θ without making more assumptions than some kind of
smoothness on the relevant scales. For a simple regression task we write the mapping as
The key is to think of this problem in terms of building blocks which we can put together such that simple functions
require a small number of modules or building blocks, and model parameters, and complex functions require larger and
larger numbers of those building blocks. We start by defining so-called layers, which in a fully connected or dense
network transfer information from all D entries of the vector x defining one layer to all vector entries of the subsequent
layer,
Our network consist of N layers, including one input layer x → x(1) , one output layer, and N − 2 hidden layers. If a
(n+1) (n)
vector entry xj collects information from all xj , we can try to write each step of this chain as
where the D × D matrix W is referred to as the network weights and the D-dimensional vector b as the bias. In general,
neighboring layers do not need to have the same dimension, which means W does not have to be a diagonal matrix. In
our simple regression case we already know that over the layers we need to reduce the width of the network from the
input vector dimension D to the output scalar x(N ) = fθ (x).
Splitting the vector x(n) into its D entries defines the nodes which form our network layer
(n) (n) (n−1) (n)
xi = Wij xj + bi . (1.47)
(n−1) (n)
For a fully connected network a node takes D components xj and transforms them into a single output xi . For each
node the D + 1 network parameters are D matrix entries Wij and one bias bi . If we want to compute the loss function for
1.4 Neural networks 11
a given data point (xj , fj ), we follow the arrows in Eq.(1.45), feed each data point xj through the input layer, go through
the following layers one by one, compute the network output fθ (xj ), and compare it to fj though a loss function.
The transformation shown in Eq.(1.46) is an affine transformation. Just like linear transformations, affine transformations
form a group. This is equivalent to saying that combining affine layers still gives us an affine transformation, just encoded
in a slightly more complicated manner. This means our network defined by Eq.(1.46) can only describe linear functions,
albeit in high-dimensional spaces.
To describe non-linear functions we need to introduce some kind of non-linear structure in our neural network. The
simplest implementation of the required nonlinearity is to apply a so-called activation function to each node. Probably the
simplest 1-dimensional choice is the so-called rectified linear unit
(
0 xj ≤ 0
ReLU(xj ) := max(0, xj ) = , (1.48)
xj xj > 0
Here we write the ReLU transformation of a vector as the vector of ReLU-transformed elements. This non-linear
transformation is the same for each node, so all our network parameters are still given by the affine transformations. But
now a sufficiently deep network can describe general, non-linear functions, and combining layers adds complexity, new
parameters, and expressivity to our network function fθ (x). There are many alternatives to ReLU as the source of
non-linearity in the network setup, and depending on our problem they might be helpful, for example by providing a finite
gradient over the x-range. However, throughout this lecture we always refer to a standard activation function as ReLU.
This brings us to the second question, namely, how to determine a correct or at least good set of network parameters θ to
describe a training dataset (x, f )j . From our fit discussion we know that one way to determine the network parameters is
by minimizing a loss function. For simplicity, we can think of the MSE loss defined in Eq.(1.41) and ignore the
normalization 1/(2σ 2 ). To minimize the loss we have to compute its derivative with respect to the network parameters. If
we ignore the bias for now, for a given weight in the last network layer we need to compute
2
(N ) (N −1)
dL ∂ f − ReLU[W1k xk ] ∂ReLU[W1k xk
(N ) (N −1) (N ) (N −1)
] ∂[W1k xk ]
(N )
= (N ) (N −1) (N ) (N −1) (N )
dW1j ∂ReLU[W1k xk ] ∂[W1k xk ] ∂W1j
(N ) (N −1) (N −1)
= −2 f − ReLU[W1k xk ] × 1 × δjk xk
√ (N −1)
≡ −2 L xj , (1.50)
(N )
provided Wij xj > 0, otherwise the partial derivative vanishes. This form implies that the derivative of the loss with
respect to the weights in the N th layer is a function of the loss itself and of the previous layer x(N −1) . If we ignore the
ReLU derivative in Eq.(1.50) and still limit ourselves to the weight matrix in Eq.(1.49) we can follow the chain of layers
and find
2
(N ) (N −1)
dL ∂ f − ReLU[W1k xk ] (n) (n−1)
∂[(W (N ) · · · W (n+1) )1k Wk` x`
(n)
= (N ) (N −1) (n)
dWij ∂[W1k xk ] ∂Wij
√ (N )
(n−1)
= −2 L W · · · W (n+1) xj . (1.51)
1i
This means we compute the derivative of the loss with respect to the weights in the reverse direction as the network
evaluation shown in Eq.(1.45). We have shown this only for the network weights, but it works for the biases the same
way. This back-propagation is the crucial step in defining appropriate network parameters by numerically minimizing a
loss function. The simple back-propagation might also give a hint to why the chain-like network structure of Eq.(1.45)
combined with the affine layers of Eq.(1.46) have turned out so successful as a high-dimensional representation of
arbitrary functions.
12 1 BASICS
The output of the back-propagation in the network training is the expectation value of the derivative of the loss function
with respect to a network parameter. We could evaluate this expectation value over the full training dataset. However,
especially for large datasets, it becomes impossible to compute this expectation value, so instead we evaluate the same
expectation value over a small, randomly chosen subset of the training data. This method is called stochastic gradient
descent, and the subsets of the training data are called minibatches or batches
* +
∂L
with θj ∈ {b, W } . (1.52)
∂θj
minibatch
Even through the training data is split into batches and the network training works on these batches, we still follow the
progress of the training and the numerical value of the loss as a function of epochs, defined as the number of batch
trainings required for the network to evaluate the full training sample.
After showing how to compute the loss function and its derivative with respect to the network parameters, the final
question is how we actually do the minimization. For a given network parameter θj , we first need to scan over possible
values widely, and then tune it precisely to its optimal value. In other words, we first scan the parameter landscape
globally, identify the global minimum or at least a local minimum close enough in loss value to the global minimum, and
then descend into this minimum. This is a standard task in physics, including particle physics, and compared to many
applications of Markov Chains Monte Carlos the standard ML-minimization is not very complicated. We start with the
naive iterative optimization in time steps
* +
(t+1) (t) ∂L(t)
θj = θj − α . (1.53)
∂θj
The minus sign means that our optimization walks against the direction of the gradient, and α is the learning rate. From
our description above it is clear that the learning rate should not be constant, but should follow a decreasing schedule.
One of the problems with the high-dimensional loss optimization is that far away from the minimum the gradients are
small and not reliable. Nevertheless, we know that we need large steps to scan the global landscape. Once we approach a
minimum, the gradients will become larger, and we want to stay within the range of the minimum. An efficient adaptive
strategy is given by
D ∂L(t) E
(t+1) (t) ∂θj
θj = θj − α s . (1.54)
D ∂L(t) E2
+
∂θj
Away from the minimum, this form allows us to enhance the step size even for small gradients by choosing a sizeable
value α. However, whenever the gradient grows too fast, the step size remains below the cutoff α/. Finally, we can
stabilize the walk through the loss landscape by mixing the loss gradient at the most recent step with gradients from the
updates before,
* + * + * +
∂L(t) ∂L(t) ∂L(t−1)
→ β + (1 − β) (1.55)
∂θj ∂θj ∂θj
This strategy is called momentum, and now the complicated form of the denominator in Eq.(1.54) makes sense, and
serves as a smoothing of the denominator for rapidly varying gradients. A slightly more sophisticated version of this
adaptive scan of the loss landscape is encoded in the widely used Adam optimizer.
Note that for any definition of the step size we still need to schedule the learning rate α. A standard choice for such a
learning rate scheduling is an exponential decay of α with the batch or epoch number. An interesting alternative is a
one-cycle learning rate where we first increase α batch after batch, with a dropping loss, until the loss rapidly explodes.
This point defines the size of the minimum structure in the loss landscape. Now we can choose the step size at the
minimum loss value to define a suitable constant learning rate for our problem, potentially leading to much faster training.
Finally, we need to mention that the minimization of the loss function for a neural network only ever uses first derivatives,
differently from the optimization of a fit function. The simple reason is that for the large number of network parameters θ
the simple scaling of the computational effort rules out computing second derivatives like we would normally do.
1.5 Bayesian networks and likelihood loss 13
Going back to the three error introduced in the last section, they can be translated directly to neural networks. The
approximation error is less obvious than for the choice of fit function, but also the expressiveness of neural network is
limited through the network architecture and the set of hyperparameters. The training error becomes more relevant
because we now minimize the loss over an extremely high-dimensional parameter space, where we cannot expect to find
the global minimum and will always have to settle for a sufficiently good local minimum. To define a compromise
between the approximation and generalization errors we usually divide a ML-related dataset into three parts. The main
part is the training data, anything between 60% and 80% of the data. The above-mentioned test data is then 10% to 20%
of the complete dataset, and we use it to check how the network generalizes to unseen data, test for overtraining, or
measure the network performance. The validation data can be used to re-train the network, optimize the architecture or the
different settings of the network. The crucial aspect is that the test data is completely independent of the network training.
The training data consists of pairs (x, A)j . We define p(A) ≡ p(A|x) as the probability distribution for possible
amplitudes at a given phase space point x and omit the argument x from now on. The mean value for the amplitude at the
point x is
Z Z
h A i = dA A p(A) with p(A) = dθ p(A|θ) p(θ|xtrain ) . (1.57)
Here, we can think of p(A|θ) as a single model describing an amplitude through a set of network parameters, while
p(θ|xtrain ) weights this model by its level of agreement with the training data xtrain . We do not know the closed form of
p(θ|xtrain ), because it is encoded in the training data. Training the network means that we approximate it as a distribution
using variational approximation for the integrand in the sense of a distribution and test function
Z Z Z
p(A) = dθ p(A|θ) p(θ|xtrain ) ≈ dθ p(A|θ) q(θ|x) ≡ dθ p(A|θ) q(θ) . (1.58)
As for p(A) we omit the x-dependence of q(θ|x). This approximation leads us directly to the BNN loss function. We
define the variational approximation using the KL-divergence introduced in Eq.(1.18),
* +
q(θ) q(θ)
Z
DKL [q(θ), p(θ|xtrain )] = log = dθ q(θ) log . (1.59)
p(θ|xtrain ) p(θ|xtrain )
q
There are many ways to compare two distributions, defining a problem called optimal transport. We will come back to
alternative ways of combining probability densities over high-dimension spaces in Sec. ??. Using Bayes’ theorem we can
write the KL-divergence as
q(θ)p(xtrain )
Z
DKL [q(θ), p(θ|xtrain )] = dθ q(θ) log
p(θ)p(xtrain |θ)
Z Z
= DKL [q(θ), p(θ)] − dθ q(θ) log p(xtrain |θ) + log p(xtrain ) dθ q(θ) . (1.60)
14 1 BASICS
The prior p(θ) describes the network parameters before training; since it does not really include prior physics or training
information we will still refer to it as a prior, but we think about it as a hyperparameter which can be chosen to optimize
performance and stability. From a practical perspective, a good prior will help the network converge more efficiently, but
any prior should give the correct results, and we always need to test the effect of different priors.
The evidence p(xtrain ) guarantees the correct normalization of p(θ|xtrain ) and is usually intractable. If we implement the
normalization condition for q(θ) by construction, we find
Z
DKL [q(θ), p(θ|xtrain )] = DKL [q(θ), p(θ)] − dθ q(θ) log p(xtrain |θ) + log p(xtrain ) . (1.61)
The log-evidence in the last term does not depend on θ, which means that it will not be adjusted during training and we
can ignore when constructing the loss. However, it ensures that DKL [q(θ), p(θ|xtrain )] can reach its minimum at zero.
Alternatively, we can solve the equation for the evidence and find
Z
log p(xtrain ) = DKL [q(θ), p(θ|xtrain )] − DKL [q(θ), p(θ)] + dθ q(θ) log p(xtrain |θ)
Z
> dθ q(θ) log p(xtrain |θ) − DKL [q(θ), p(θ)] (1.62)
This condition is called the evidence lower bound (ELBO), and the evidence reaches this lower bound exactly when our
training condition in Eq.(1.18) is minimal. Combining all of this, we turn Eq.(1.61) or, equivalently, the ELBO into the
loss function for a Bayesian network,
Z
LBNN = − dθ q(θ) log p(xtrain |θ) + DKL [q(θ), p(θ)] . (1.63)
The first term of the BNN loss is a likelihood sampled according to q(θ), the second enforces a (Gaussian) prior. This
Gaussian prior acts on the distribution of network weights. Using an ELBO loss means nothing but minimizing the
KL-divergence between the probability p(θ|xtrain ) and its network approximation q(θ) and neglecting all terms which do
not depend on θ. It results in two terms, a likelihood and a KL-divergence, which we will study in more detail next.
The Bayesian network output is constructed in a non-linear way with a large number of layers, to we can assume that
Gaussian weight distributions do not limit us in terms of the uncertainty on the network output. The log-likelihood
log p(xtrain |θ) implicitly includes the sum over all training points.
Before we discuss how we evaluate the Bayesian network in the next section, we want to understand more about the BNN
setup and loss. First, let us look at the deterministic limit of our Bayesian network loss. This means we want to look at the
loss function of the BNN in the limit
The easiest way to look at this limit is to first assume a Gaussian form of the network parameter distributions, as given in
Eq.(1.27)
1 2 2
qµ,σ (θ) = √ e−(θ−µq ) /(2σq ) , (1.65)
2πσq
We can now evaluate this KL-divergence in the limit of σq → 0 and finite µq (θ) → θ0 as the one remaining θ-dependent
parameter,
(θ0 − µp )2
DKL [qµ,σ (θ), pµ,σ (θ)] → + const . (1.67)
2σp2
15
Ensemble of networks
-0.1
0.2 0.8
A(ω1)
(σstoch(ω1))
x
BNN Output
ng
q(ω)
pli
-0.3 1 N
⟨A⟩ = A(ωi)
m
N∑
sa
0.5 0.7
A(ω2)
i
(σstoch(ω2))
1 N 2
x output x 2
= σstoch(ωi)
N∑
σstoch
i
1 N
2
= (⟨A⟩ − A(ωi))2
N∑
-0.2 σpred
0.4 0.9 i
A(ω3)
(σstoch(ω3))
x
(θ0 − µp )2
LBNN = − log p(xtrain |θ0 ) + . (1.68)
2σp2
The first term is again the likelihood defining the correct network parameters, the second ensures that the network
parameters do not become too large. Because it include the squares of the network parameters, it is referred to as an
L2-regularization. Going back to Eq.(1.63), an ELBO loss is a combination of a likelihood loss and a regularization.
While for the Bayesian network the prefactor of this regularization term is fixed, we can generalize this idea and apply an
L2-regularization to any network with an appropriately chosen pre-factor.
Sampling the likelihood following a distribution of the network parameters, as it happens in the first term of the Bayesian
loss in Eq.(1.63), is something we can also generalize to deterministic networks. Let us start with a toy model where we
sample over network parameters by either including them in the loss computation or not. Such a random sampling
between two discrete possible outcomes is described by a Bernoulli distribution. If the two possible outcomes are zero
and one, we can write the distribution in terms of the expectation value ρ ∈ [0, 1],
(
ρx (1 − ρ)1−x x = 0, 1
pBernoulli (x) = (1.69)
0 else .
When we include an event, the network weight is set to θ0 , otherwise it is set to zero, or θ = xθ0 . We can include it as a
test function for our integral over the log-likelihood log p(xtrain |θ) and find
Z
LBNN = − dx ρx (1 − ρ)1−x log p(xtrain |xθ0 )
2 Regression
2.1 Amplitude regression
After introducing BNNs using the notation of transition amplitude learning, we still have to extract the mean and the
uncertainty for the amplitude A over phase space. In the evaluation step we exchange the two integrals in Eq.(1.57) and
16 2 REGRESSION
use the variational approximation to write the mean prediction A for a given phase space point as
Z
hAi = dAdθ A p(A|θ) p(θ|xtrain )
Z
= dAdθ A p(A|θ) q(θ)
Z Z
≡ dθ q(θ) A(θ) with θ-dependent mean A(θ) = dA A p(A|θ) . (2.1)
We can interpret this formula as a sampling over network parameters, provided we assume uncorrelated variations of the
individual network parameters. Corresponding to the definition of the θ-dependent mean A, the variance of A is
Z
2 2
σtot = dAdθ (A − hAi) p(A|θ) q(θ)
Z
= dAdθ A2 − 2AhAi + hAi2 p(A|θ) q(θ)
Z Z Z Z
2 2
= dθ q(θ) dA A p(A|θ) − 2hAi dA A p(A|θ) + hAi dA p(A|θ) (2.2)
For the three integrals we can generalize the notation for the θ-dependent mean as in Eq.(2.1) and write
Z h i
2
σtot = dθ q(θ) A2 (θ) − 2hAiA(θ) + hAi2
Z h i
= dθ q(θ) A2 (θ) − A(θ)2 + A(θ)2 − 2hAiA(θ) + hAi2
Z h 2 i
= dθ q(θ) A2 (θ) − A(θ)2 + A(θ) − hAi 2
≡ σstoch 2
+ σpred . (2.3)
For this transformation we keep in mind that hAi is already integrated over θ and A and can be pulled out of the integrals.
This expression defines two contributions to the variance or uncertainty.
First, σpred is defined in terms of the θ-integrated expectation value hAi
Z h i2
2
σpred = dθ q(θ) A(θ) − hAi . (2.4)
Following from the definition in Eq.(2.1), it vanishes in the limit of perfectly narrow network weights, q(θ) → δ(θ − θ0 ).
This limit requires perfect training, so we expect σpred to decrease with more and better training data. In that sense it
represents a statistical uncertainty.
Second, σstoch is defined without sampling the network parameters,
Z
2
σstoch ≡ hσstoch (θ)2 i = dθ q(θ) σstoch (θ)2
Z h i
= dθ q(θ) A2 (θ) − A(θ)2 . (2.5)
It does not vanish for q(θ) → δ(θ − θ0 ), but it vanishes if the amplitude is arbitrarily well known and well described,
characterized by p(A|θ) → δ(A − A0 ). While this uncertainty may receive contributions from too little training data, it
only approaches a plateau for perfect training. This plateau value can reflect a stochastic training sample, limited
expressivity of the network, not-so-smart choices of hyperparameters etc, in the sense of a systematic uncertainty. To
avoid mis-understanding we can refer to it as a stochastic or as model-related uncertainty,
σstoch ≡ σmodel ≡ hσmodel (θ)2 i . (2.6)
To understand these two uncertainty measures better, we can look at the θ-dependent network output and read Eq.(2.1)
and (2.5) as sampling A(θ) and σmodel (θ)2 over a network parameter distribution q(θ) for each phase space point x
A(θ)
BNN : x, θ → . (2.7)
σmodel (θ)
2.1 Amplitude regression 17
If we follow Eq.(1.63) and assume q(θ) to be Gaussian, we now have a network with twice as many parameters as a
standard network to describe two outputs. For a given phase space point x we can then compute the three global network
predictions hAi, σmodel , and eventually σpred . Unlike the distribution of the individual network weights q(θ), the amplitude
output is not Gaussian.
The evaluation of a BNN is illustrated in Fig. 4. From that graphics we see that a BNN works very much like an ensemble
of networks trained on the same data and producing a spread of outputs. The first advantage over an ensemble is that the
BNN is only twice as expensive as a regular network, and less if we assume that the likelihood loss leads to an especially
efficient training. The second advantage of the BNN is that it learns a function and its uncertainty together, which will
give us some insight into how advanced networks learn such densities in Sec. ??. The disadvantage of Bayesian networks
over ensembles is that the Bayesian network only cover local structures in the loss landscape.
While we usually do not assume a Gaussian uncertainty on the Bayesian network output, this might be a good
approximation for σmodel (θ). Using this approximation we can write the likelihood p(xtrain |θ) in Eq.(1.63) as a Gaussian
and use the closed form for the KL-divergence in Eq.(1.66), so the BNN loss function turns into
" 2 #
Aj (θ) − Atruth
Z
j
X
LBNN = dθ qµ,σ (θ) + log σmodel,j (θ)
2σmodel,j (θ)2
points j
As always, the amplitudes and σmodel are functions of phase space x. The loss is minimized with respect to the means and
standard deviations of the network weights describing q(θ). This interesting aspect of this loss function is that, while the
training data just consists of amplitudes at phase space points and does not include the uncertainty estimate, the network
constructs a point-wise uncertainty from the variation of the network parameters. This means we can rely on a
well-defined likelihood loss rather than some kind of MSE loss even for our regression network training, this is
fudging genius!
For some applications we might want to use this advantage of the BNN loss in Eq.(2.8), but without sampling the network
parameters θ. In complete analogy to a fit we can then use a deterministic network like in Eq.(1.66), described by
q(θ) = δ(θ − θ0 ) inserted into the Gaussian BNN loss in Eq.(2.8) to define
" 2 #
X Aj (θ0 ) − Atruth
j (θ0 − µp )2
Lheteroskedastic = + log σ model,j (θ 0 ) + . (2.9)
2σmodel,j (θ0 )2 2σp2
points j
The interplay between the first two terms works in a way that the first term can be minimized either by reproducing the
data and minimizing the numerator, or by maximizing the denominator. The second term penalizes the second strategy,
defining a correlated limit of A and σmodel over phase space. Compared to the full Bayesian network this simplified
approach has two disadvantages: first, we implicitly assume that the uncertainty on the amplitudes is Gaussian. Second,
σmodel only captures noisy amplitude values in the training data. Extracting a statistical uncertainty, as encoded in σpred ,
requires us to sample over the weight space. Obviously, this simplification cannot be interpreted as an efficient network
ensembling.
For a specific task, let us look at LHC amplitudes for the production of two photons and a jet [?],
gg → γγg(g) (2.10)
The corresponding transition amplitude A is a real function over phase space, with the detector-inspired kinematic cuts
The jet 4-momenta are identified with the gluon 4-momenta through a jet algorithm. The standard computer program for
this calculation is called NJet, and the same amplitudes can also be computed with the event generator Sherpa at one loop.
This amplitude has to be calculated at one loop for every phase space point, so we want to train a network once to
reproduce its output much faster. Because these amplitude calculations are a key ingredient to the LHC simulation chain,
the amplitude network needs to reproduce the correct amplitude distributions including all relevant features and with a
reliable uncertainty estimate.
18 2 REGRESSION
normalized
60 60
40 40
20 20
0 0
0.04 0.02 0.00 0.02 0.04 0.04 0.02 0.00 0.02 0.04
+ overflow bin
(train) + overflow bin
(train)
normalized
60 60
40 40
20 20
0 0
0.04 0.02 0.00 0.02 0.04 0.04 0.02 0.00 0.02 0.04
+ overflow bin
(train) + overflow bin
(test)
Figure 5: Performance of the BNN, loss-boosted BNN, and process-boosted BNN, in terms of the precision of the generated
amplitudes, defined in Eq.(2.12) and evaluated on the training (upper) and test datasets. Figures from Ref. [?].
For one gluon in the final state, the most general phase space has 5 × 4 = 20 dimensions. We could simplify this phase
space by requiring momentum conservation and on-shell particles in the initial and final state, but for this test we leave
these effects for the network to be learned. Limiting ourselves to one jet, this means we train a regression network to learn
a real number over a 20-dimensional phase space. To estimate the accuracy of the network we can compute the relative
deviation between the training or test data with the network amplitudes,
hAij − Atrain/test
j
∆(train/test)
j = (2.12)
Atrain/test
j
In the upper left panel of Fig. 5 we show the performance of a well-trained Bayesian network in reproducing the training
dataset. While the majority of phase space points are described very precisely, a problem occurs for the phase space
points with the largest amplitudes. The reason for this shortcoming is that there exist small phase space regions where the
transition amplitude increase rapidly, by several orders of magnitude. The network fails to learn this behavior in spite of a
log-scaling following Eq.(1.42), because the training data in these regions is sparse. For an LHC simulation this kind of
bias is a serious limitation, because exactly these phase space regions need to be controlled for a reliable rate prediction.
This defines our goal of an improved amplitude network training, namely to identify and control outliers in the
∆-distribution of Eq.(2.12).
The likelihood loss in Eq.(2.8) with its σmodel (θ) has a great advantage, because we can test the required Gaussian shape
of the corresponding pull variable for the truth and NN-amplitudes,
Aj (θ) − Atruth
j
, (2.13)
σmodel,j (θ)
as part of the training. For the initial BNN run it turns out that this distribution indeed looks like a Gaussian in the peak
region, but with too large and not exponentially suppressed tails. We can improve this behavior using an inspiration from
the boosting of a decision tree. We enhance the impact of critical phase space points through an increased event weight,
2.2 Numerical integration 19
This boosted or feedback training will improve the network performance both, in the accuracy of the amplitude prediction
and in the learned uncertainty on the network amplitudes. If we limit ourselves to self-consistency arguments, we can
select the amplitudes with nj > 1 through large pull values, as defined in Eq.(2.13). It turns out that this
loss-based boosting significantly improves the uncertainty estimate. However, in the upper right panel of Fig. 5 we see
that the effect on the large amplitudes is modest, large amplitudes still lead to too many outliers in the network accuracy,
and we also tend to systematically underestimate those large amplitudes.
To further improve the performance of our network we can target the problematic phase space points directly, by
increasing nj based on the size of the amplitudes. This process-specific boosting goes beyond the self-consistency of the
network and directly improves a specific task beyond just learning the distribution of amplitudes over phase space. In the
lower panels of Fig. 5 we see that at least for the training data the ∆-distribution looks the same for small and large
amplitudes. Going back to Eq.(1.3), we can interpret the boosted loss for large values of nj as a step towards interpolating
the corresponding amplitudes. While for small amplitudes the network still corresponds to a fit, we are forcing the
network to reproduce the amplitudes at some phase space points very precisely. Obviously, this boosting will lead to
issues of overtraining. We can see this in the lower right panel of Fig. 5, where the improvement through process boosting
for the test dataset does not match the improvement for the training dataset. However, as long as the performance of the
test dataset improves, even if it is less than for the training dataset, we improve the network in spite of the overtraining.
The issue with this overtraining is that it becomes harder to control the uncertainties, which might require some kind of
alternating application of loss-based and process boosting.
This example illustrates three aspects of advanced regression networks. First, we have seen that a Bayesian network
allows us to construct a likelihood loss even if the training data does not include an uncertainty estimate. Second, we can
improve the consistency of the network training by boosting events selected from the pull distribution. Third, we can
further improve the network through process-specific boosting, where we force the network from a fit to an interpolation
mode based on a specific selection of the input data. For the latter the benefits do not fully generalize from the training to
the test dataset, but the network performance does improve on the test dataset, which is all we really require.
The last application of a regression network is the numerical calculation of a D-dimensional phase space integral
Z 1 Z 1
I(s) = dx1 · · · dxD f (s; x) , (2.15)
0 0
where xi are the integration valiables and s is a vector of additional parameters, not integrated over. Because the values of
the integrand can span a wide numerical range is useful to normalize the integrand, for example by its value at the center
of the x-hypercube,
f (s; x) I(s)
f (x; s) → ⇔ I(s) → . (2.16)
f (s; 12 , 21 , , ..., 12 ) f (s; 1 1 1
2 , 2 , ..., 2 )
Without going into details, it is also useful to transform the integrand into a form which vanishes at the integration
boundaries. Analytically, we would compute the primitive F ,
dD F (s; x)
= f (s; x) , (2.17)
dx1 . . . dxD
20 2 REGRESSION
In particle physics we really never know the primitive of a phase space integrand, but we can try to construct it and
encode it in a neural network,
On the other hand, we do not have data to train a surrogate network for F directly. The idea is to instead train an
integrand surrogate, such that its D-th derivative matches f ,
dFθ (s; x)
LMSE f (s; x), . (2.20)
dx1 ...dxD
If the training on the integrand fixes the network weights such that the interand as well as F are determined by the same
network, and Fθ fulfills Eq.(2.19), we have directly learned the integral.
To construct a surrogate which can be differentiated multiple times with respect to some of its inputs, we need a
differentiable activation function, for example the sigmoid function,
1
Sigmoid(x) = ⇒ Sigmoid0 (x) = log(ex + 1) ⇒ Sigmoid(n) = −Lin (−ex ) . (2.21)
1 + e−x
There exists a fast sum representation of the dilogarithm Li2 for numerical evaluation. The same set of derivative can be
computed for the tanh activation function.
Next, we need to compute the derivative of this fully connected neural network. Following the conventions of Eq.(1.45),
the input layer for the two input vectors is
(
(0) xi i≤D
xi = . (2.22)
si−D i>D
For the hidden layers we just replace the ReLU activation function in Eq.(1.49) with the sigmoid,
h i
(n) (n) (n−1) (n)
xi = Sigmoid Wij xj + bi . (2.23)
The scalar output of the network with N layers can be differentiated, for instance, with respect to x1 ,
(N ) (N −1)
fθ ≡ xN = Wj xj + b(N −1)
(N −1)
dfθ (N ) dxj
⇒ = Wj
dx1 dx1
" (N −2)
#
(N −1) dx`
h i
(N ) (N −1)
X (N ) (N )
0
= Wj Sigmoid Wjk xk + bj Wj` , (2.24)
j
dx1
where we write the sum over j explicitly, while for the other indices we use the usual summing convention. Next, we
differentiate this expression with respect to x2 , altogether D times, to compute the MSE loss in Eq.(2.20). The loss can be
minimized with respect to the network parameters θ using the usual backpropagation. Because the integrand is known
exactly, there is no need to regularize the network, but it would also not hurt. Also, the numerical generation of integrant
values is numerically cheap, which means Fθ an be trained using very large numbers of training data points.
2.2 Numerical integration 21
5000
1L sigm
1L tanh
2L sigm
4000
2L tanh
3000
frequency 2000
1000
0
−8 −6 −4 −2
p
Figure 6: Number of digits accuracy for two integrals, using two different activation functions, one of them defined in
Eq.(2.25). Figure from Ref. [?].
In the original paper, the method is showcased for two integrals, one of them is
1 1 1
1
Z Z Z
I1L (s12 , s14 , m2H , m2t ) = dx1 dx2 dx3 2
0 0 0 F1L
with F1L =m2t + 2x3 m2t + x23 m2t + 2x2 m2t − x2 s14 + 2x2 x3 m2t − x2 x3 m2H + x22 m2t
+ 2x1 m2t + 2x1 x3 m2t − x1 x3 s12 + 2x1 x2 m2t − x1 x2 m2H + x21 m2t . (2.25)
INN − Itruth
p = log10 , (2.26)
Itruth
giving the effective number of digits the estimates gets right. The results for this accuracy are shown in Fig. 6. It is based
on training an ensemble of eight replicas of the same network, use their average as the central prediction of the integral,
and the standard deviation as an uncertainty estimate. The entried in the histogram have different initialisations and are
trained on different training data. The results for the two different activation functions are similar. The two-loop integral,
which we skip in this summary, has a lower accuracy than the one-loop integral, which is to be expected given the larger
number of integrations.
Schedule
22 2 REGRESSION