0% found this document useful (0 votes)

27 views24 pages

ML Physics

This document provides an introduction to machine learning for physicists. It discusses basics of statistics, including Bayes' theorem and information entropy. Key machine learning concepts are covered such as multivariate analysis, neural networks, Bayesian networks, and likelihood loss. Regression techniques like amplitude regression and numerical integration are also introduced. The document aims to connect machine learning methods to physics concepts in order to provide physicists with a foundation for applying machine learning.

Uploaded by

QuantumChromist

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views24 pages

ML Physics

Uploaded by

QuantumChromist

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Physics and Machine Learning

Jan Pawlowski and Tilman Plehn*

Institut für Theoretische Physik, Universität Heidelberg, Germany

October 17, 2023

Abstract

Physics and modern machine learning have a huge number of connections, through their scientific goals, methods,
applications, and theoretical understanding. These notes provide an introduction to modern machine learning, specifically
for physicists. As prerequisites they rely on some results from bachelor-level physics lectures, but do not assume any prior
knowledge of machine learning.

* [email protected]
1 Basics 1
1.1 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Multivariate analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Fits and interpolations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Bayesian networks and likelihood loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Regression 15
2.1 Amplitude regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Numerical integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1 Basics

1.1 Statistics

Bayes’ Theorem Before we start with some basics of machine learning, let use remind ourselves of basics statistics
from our first year of physics studies. A formal introduction into statistics starts with the three Kolmogorov Axioms. The
first axiom states that a probability for a given outcome is a non-negative real number

p(A) ∈ R and p(A) ≥ 0 . (1.1)

The second axiom states that the sum of probabilities for all possible outcomes of an experiment is one,
X Z
p(Ai ) = 1 or dA p(A) = 1 . (1.2)
i

The third axiom states that probabilities for disjoint outcomes add,

p(A ∪ B) = p(A) + p(B) . (1.3)

The logical combination of these outcomes is an ‘or’. Alternatively, we can ask the question what the joint probability of
two measurements A and B is. Their independence is defined by

p(A ∩ B) = p(A)p(B) . (1.4)

If two measurements are not independent, we need to define conditional probabilities. This means we ask what the
probability of A is under the condition that we also observed B,

p(A ∩ B)
p(A|B) := ⇔ p(A|B) p(B) = p(A ∩ B) . (1.5)
p(B)

Because A ∩ B = B ∩ A the inverse conditional probability reads

p(A ∩ B)
p(A ∩ B) = p(B ∩ A) ⇒ p(B|A) = . (1.6)
p(A)

These formulas can be viewed as definitions or as axioms, extending the Kologorov axioms. They give us
Bayes’ Theorem, which states that the conditional probability for a theory T give a set of measurements M is

p(M |T ) p(T )
p(T |M ) = . (1.7)
p(M )

It comes with a set of definitions: p(T |M ) is called posterior probability; if we are interested in p(M |T ) as a funktion of
the second argument T , it is called likelihood. The problem with the likelihood is the lack of normalization as a function
of T . p(M ) is called evidence, and it is typically replaced through the normalization condition

1
Z Z
1=dT p(T |M ) = dT p(M |T ) p(T )
p(M )
Z
⇔ p(M ) = dT p(M |T ) p(T ) . (1.8)

Finally, p(T ) is called prior probability or just prior. In Eq.(1.8) we see that is defines an intergration measure over theory
space. Bayes’ Theorem tells us how to combine two independent measurements. We first make one measurement, then
compute its posterior probability, use this as the prior for the second measurement, and compute the combined
probability. This is logically clear, but technically complicated.
2 1 BASICS

Information entropy To describe complex systems with the help of probabilities, we need the concept of information
entropy. Let us start with a system with 2N equally probably states,
X
pj = 2−N mit pj = 2N 2−N = 1 . (1.9)
j

To find out in which state the system is we can construct a decision tree. The most efficient algorithm is to step-wise split
the possible states into groups of equal size and then ask in which of the two halves the system lives. This way we avoid
any element of luck, and we need
ln pj
N = − log2 pj = (1.10)
ln 2
answers {0, 1}. Now we generalize the system to Ω states with different probabilities pj . In that case the branches of the
decision tree correspond to different probabilities, there will be an element of luck. The number of necessary answers
should be the same, but only defined as an expectation value. It defines the information entropy H,

Ω
1 X H
h− log2 pj i = − pj ln pj =: (1.11)
ln 2 j=1 ln 2

By definition, the information entropy measures (logarithmic) uncertainty in units of ln 2, so-called bits. Uncertainty is
the amount of expected information needed to correctly identify the state of the system.
First, we see that for pj ∈ [0, 1] each term in the expectation value in Eq.(1.11) is positive, so H ≥ 0. The entropy
assumes its smallest value for only one term, pj = 1. Here we know the system perfectly. For two outcomes with p1 = p
and p2 = 1 − p we can compute the maximum entropy easily,
H = − [p ln p + (1 − p) ln(1 − p)]
1−p

dH p
= − ln p + − ln(1 − p) −
dp p 1−p
1
= − ln p + ln(1 − p) = 0 ⇔ p=1−p= . (1.12)
2
The information entropy vanishes for p = 0 and p = 1 and is symmetric around its maximum at p = 1/2. For a perfectly
understood dataset with only p(x) = 0 and p(x) = 1 there is no entropy. In general, the entropy is maximal for Ω = 1/p,
and its value depends on the number of states,
Ω
X
H≤− p ln p = − ln p = ln Ω . (1.13)
j=1

Next, we can ask what happens with the entropy when we split our system in two. If the two systems are not independent,
we can compute the entropies in terms of the joint probability pi,j ,
Ω1
X Ω2
X X
H1 = − pi ln pi H2 = − pj ln pj H12 = − pi,j ln pi,j . (1.14)
i=1 j=1 i,j

Because equally distributed systems maximize the entropy, this combined entropy should be smaller than the sum of the
independent individual entropies, The difference is the mutual information I12
I12 = H1 + H2 − H12
  !
X X X X X
=− pi  pj|i  ln pi − pi|j pj ln pj + pi,j ln pi,j
i j j i i,j
X
=− pi pj|i ln pi + pj pi|j ln pj − pi,j ln pi,j
i,j
X pi pj
=− pi,j ln ≡ DKL [pi,j , pi pj ] . (1.15)
i,j
pi,j
1.1 Statistics 3

H2
H1

H1|2 I12 H2|1

| {z }
H12

Figure 1: Example of an information diagram.

The relation between the different entropies is illustrated in Fig. 1. We can understand the mutual information in terms of
clustering algorithms. If we pick a random point in the entire space, what does the association with partition 1 tell us
about a possible association with partition 2.
Finally, we can construct the variance of information,

VoI12 = H1|2 + H2|1 = H1 + H2 − 2I12 . (1.16)

It compares two partitions, and it vanishes exactly when the two partitions are equal. For finite values it gives a proper
distance, i.e. positive definite, symmetric, and satisfying the triangle anomaly.
The functional defined in Eq.(1.15) is called Kullback–Leibler divergence and compares two probability distributions,
evaluated on a dataset corresponding to the first distribution,
X pa
DKL [pa , pb ] = pa ln 6= DKL [pb , pa ] . (1.17)
pb
For continuous distributions it reads
* +
pa pa (x)
Z
DKL [pa , pb ] = log ≡ dx pa (x) log . (1.18)
pb pb (x)
pa

It vanishes if two distributions agree everywhere. We will come back to it in much more detail later.

Likelihoods and Neyman-Pearson Lemma Combining measurements is easier when we use likelihoods. Let us
assume that p(x) gives the probability that an event or phase space configuration x corresponds to some kind of signal,
turning the conditional probability or likelihood of Eq.(1.7) into

p(M |T ) → p({xi }|T ) . (1.19)

We can split the measurement into a counting experiment leading to n events with s events expected,
sn −s
pPoisson (n|s) = e , (1.20)
n!
and a properly normalized probability for each event,
Z
p(x) ∈ [0, 1] with dx p(x) = 1 . (1.21)

The likelihood for the set of independent events becomes

n n
sn −s Y 1 Y
p({xi }|T ) = e p(xi ) = e−s [sp(xi )] . (1.22)
n! i=1
n! i=1
4 1 BASICS

In this form we see that the combined likelihoods for two independent datasts are simply

p({xi , xj }|T ) = p({xi }|T ) p({xj }|T )

n1Y
+n2
1
= e−s1 −s2 [sp(xi )] . (1.23)
n1 !n2 ! i=1

We can compute the ratio of two likelihoods for the same dataset of n observed events {xi }, but two theory hypotheses,
for instance signal plus background vs background only,
Qn
p({xi }|TS+B ) e−s+b i=1 [(s + b)ps+b (xi )]
= −b Qn
p({xi }|TB ) e i=1 [bpb (xi )]
n
Y sps (xi ) + bpb (xi )
= e−s . (1.24)
i=1
bpb (xi )

The Neyman-Pearson lemma states that this likelihood ratio is the most powerful test statistics, or test variable, to
distinguish two hypotheses. This is defined as the smallest false-negative error for a given false-positive error. For typical
physics applications this means it minimizes the mis-identification of a signal for a background fluctuation.
Before moving on, let us mention that in the limit of large event counts the Poisson distribution in Eq.(1.20) becomes a
Gaussian,
(n − s)2

1
pPoisson (n|s) → pGauss (n|s) = √ exp − , (1.25)
2πs 2s
a standard form we will use throughout this lecture.

Robust estimators In Eq.(1.7) we introduced the likelihood with an abtract definition of a theory T . In Eq.(1.24) the
two theories were signal-plus-background vs background-only, again without any reference to what makes them different.
In general, the theory hypothesis defining a liklihood is decribed by an ntuple of model parameters θ, so we again replace
our conventions,

p(M |T ) → p(M |θ) . (1.26)

To simplify our argument, let us assume that our measurement is not a set of events, but a set of values f (x), for instance
counting rates over a binned phase space x. This means we compare a measured set of numbers fi with the corresponding
theory predictions fθ (x). If the data fi and the uncertainties σi are statistically distributed, we usually employ a
logarithmic Gaussian likelihood

|fi − fθ (xi )|2

Y 1
log p(M |θ) = log √ exp −
i
2πσi 2σi2
X |fi − fθ (xi )|2
=− + const(θ) . (1.27)
i
2σi2

In this form the individual Gaussians have mean fi , variance σi2 , standard deviation σi , and a width of the bell curve of
2σi . However, in the real world data is usually not distributed statistically, and the tails of distributions are usually less
suppressed than the exponential suppression defining the Gaussian.
Given a likelihood, we can estimate the value for this model parameter θ by maximizing the likelihood that it explains the
data,
X
θ̂ = argmaxθ p(M |θ)
M
≈ argminθ [− log p(M |θ)] (1.28)

In the second step we assume that maximizing the likelihood or minimizing the negative log-likelihood gives the same
result. We call θ̂ the estimator for the model parameter θ. The sensitivity or robustness of an estimator depends on the
1.2 Multivariate analysis 5

form of the likelihood. It can be measured by the impact that a chance of a datapoint has on the log-likelihood and is
called influence function,
d log p(M |θ)
IF = − (1.29)
dfi
If we already know that Gaussian likelihood does not describe a dataset, for instance in the tails, we can ask what kind of
likelihood would lead to a robust estimator. Specifically, we might want to be less sensitive to poorly modelled outliers.
As an example, we consider the mean or center θ = µ of a given liklihood distribution. Starting with a one-dimensional
Gaussian, we find
(x − µ)2

p(x|µ) ∝ exp −
2µ
2
(x − µ)
⇒ − log p(x|µ) = − + ···
2µ
x − µ x→∞
⇒ IF = −−−−→ x . (1.30)
µ
This influence function gives us the robustness of a Gaussian likelihood with extremely suppressed tails. Next, we check a
Cauchy or Breit-Wigner distribution centered arond zero
1
p(x|µ) ∝
1 + x2
⇒ − log p(x|µ) = log(1 + x2 ) + · · ·
2x x→∞ 2
⇒ IF = −−−−→ . (1.31)
1 + x2 x
Its influence function vanishes for larger x-values, which means the Breit-Wigner distribution is more robust against
outliers in the tails than a Gaussian. A more complete set of examples includes a Gaussian and the more robust Laplace,
Cauchy (Breit-Wigner), and Tukey distributions is given in Fig. 2.
As a side remark, we can use the estimator as an illustration of frequentist/likelihood vs Bayesian statistics. Everything
starts with Bayes theorem, but the two methods ask different questions. For the etimator, the likelihood question in
Eq.(1.28) determines the value for θ which is the most likely, given the data. For this purpose, it maximizes the likelihood
p(M |θ). If we ask the question what M tells us about possible values of θ, we would naively use the probability instead,
which means we would compute
X X p(M |θ)p(θ)
θ̂Bayes = argmaxθ p(θ|M ) = argmaxθ
p(M )
M M
X
= argmaxθ [p(M |θ)p(θ)] . (1.32)
M

This result differs from Eq.(1.28) because we now need a prior to construct the estimator. Similarly, a likelihood approach
would get rid of a nuisance parameter or a physics parameter we are currently not interested in by profiling them. This
means we identify the maximum likelihood and project this value into the target space. In the Bayesian approach we
integrate it out. The prior then defines a θ-space which is needed for such an integration. In terms of the integration
measure the profile likelihood would correspond to a delta-distribution prior, which is why we refer to differences between
the methods as volume effects. Since the prior is a measure in model space, we have to extract it from other measurements
or make an assumption. It is important to always keep in mind, that inference can be done using lieklihoods or using the
posterior probabilities, and if the two methods give different answers this is because they ask different questions.

1.2 Multivariate analysis

One problem we encounter in any physics analysis is that we want to extract simple numbers, like an energy measurement
or a signal probability, from a complex detector output. Historically, this is done by constructing a set of theory-motivated
observables and evaluate each of these observables in view of the regression or classification task. If all available
6 1 BASICS

estimator assumed likelihood potential force

d log p(M, θ)
p(M, θ) − log p(M, θ) −
dx

mean, L2 -loss,
least squares
∼ exp(− 12 |x − µ|2 ) ∼ 21 (x − µ)2
Gaussian not robust

median, L1 -loss
∼ exp(−|x − µ|) ∼ |x − µ|
Laplace robust

1
∼ ∼ log(1 + x2 )
1 + x2
Cauchy very robust

Tukey does not exist

very robust

Figure 2: Examples of increasingly robust estimators.

information can be extracted from one observable, this is called a suffient statistics or an optimal observable. In reality,
this is never the case. Instead, we are faced with a set of correlated observables, which we need to analyse together.

Let us look at such a set of correlated observables and the problem of multivariate classification. A historic solution is
decision tree. Imagine we want to classify an event xi as signal or background, based on the observables Oj . As a basis
for this decision we study a set of training events {xi }, histogram them for each Oj , and find the values Oj,split which give
the most successful split between signal and background for each distribution.

To define the optimal split we start with the signal and background probabilities as a function of the signal event count S
and the background event count B,

s b
pS = ≡p and pB = ≡1−p. (1.33)
s+b s+b

If we look at a histogrammed normalized observable O ∈ [0 ... 1] we can compute p and 1 − p from number of expected
signal and background events following Eq.(1.33). For instance, we can look at a signal which prefers large O-values and
1.2 Multivariate analysis 7

two background distributions to compute these signal and background probabilities p(O).

s(O)
s(O) = AO b(O) = A(1 − O) ⇒ p(O) = =O
s(O) + b(O)
O
s(O) = AO b(O) = B ⇒ p(O) = . (1.34)
B
O+
A
We then define that a single event is signal-like when its (signal) probability is p(O) > 1/2, otherwise the event is
background,
1



2
Oj,split = Oj = . (1.35)
B
p=1/2 

A
To organize our multivariate analysis we evaluate the performance of each optimized cut Oj,split , for example to apply the
most efficient cut first.
We already know how this discussion can be formalized — we need to construct the set of Oj,split such that they maximize
the information entropy defined in Eq.(1.11). Because we do not want to measure the information entropy, we can use a
standard logarithm to evaluate it,
h i
H[p] = − p log p + (1 − p) log(1 − p) , (1.36)

Our construction of split values can then be formally written as

Oj,split = argmaxsplits H[p(Oj )] . (1.37)

To build a decision tree out of our observables we first compute the best splitting for each observable individually and
then choose the observable with the most successful split. More precisely, for a successful split we want to maximize the
difference of the information entropy before the split and the sum of the information entropies after the split. This is
called the information gain, and we choose the first observable of our decision tree though
h i
max Hbefore split [p(Oj )] − Hafter split,1 [p(Oj )] − Hafter split,2 [p(Oj )] . (1.38)
j

Because multivariate analysis using (boosted) decision trees relies on successive split points of pre-defined observables, it
cannot be optimal. To do better, we would need to extract p(O) from our dataset and work with it without binning or
explicit cuts. This will eventually lead us to classification using neural networks.
A historic illustration for a decision tree used in particle physics is shown in Fig. 3. It comes from the first high-visibility
application of (boosted) decision trees in particle physics, to identify electron-neutrinos from a beam of muon-neutrinos
using the MiniBooNE Cerenkov detector. Each observable defines a so-called node, and the two branches below each
node are defined as ‘yes’ vs ‘no’ or as ‘signal-like’ vs ‘background-like’. The first node is defined by the observable with
the highest information gain among all the optimal splits. The two branches are based on this optimal split value, found
by maximizing the cross entropy. Every outgoing branch defines the next node again through the maximum information
gain, and its outgoing branches again reflect the optimal split, etc. Finally, the algorithm needs a condition when we stop
splitting a branch by defining a node and instead define a so-called leaf, for instance calling all events ‘signal’ after a
certain number of splittings selecting it as ‘signal-like’. Such conditions could be that all collected training events are
either signal or background, that a branch has too few events to continue, or simply by enforcing a maximum number of
branches.
No matter how we define the stopping criterion for constructing a decision tree, there will always be signal events in
background leaves and vice versa. We can only guarantee that a tree is completely right for the training sample, if each
leaf includes one training event. This perfect discrimination obviously does not carry over to an independent test sample,
which means our decision tree is overtrained. In general, overtraining means that the performance for instance of a
classifier on the training data is so finely tuned that it follows the statistical fluctuations of the training data and does not
generalize to the same performance on independent sample of test data.
8 1 BASICS

S/B
52/48
< 100 ≥ 100
PMT Hits?
B S/B
4/37 48/11
< 0.2 GeV ≥ 0.2 GeV
Energy?
S/B S
9/10 39/1
< 500 cm ≥ 500 cm
Radius?
S B
7/1 2/9

Figure 3: Illustration of a decision tree from an early application in particle physics, selecting electron neutrinos to prove
neutrino oscillations. Figure from Ref. [?].

If we want to further enhance the performance of the decision tree we can focus on the events which are wrongly
classified after we define the leaves. For instance, we can add an event weight w > 1 to every mis-identified event (we
care about) and carry this weight through the calculation of the splitting condition. This is the simple idea behind a
boosted decision tree (BDT). Usually, the weights are chosen such that the sum of all events is one. If we construct
several independent decision trees, we can also combine their output for the final classifier. It is not obvious that this
procedure will improve the tree for a finite number of leaves, and it is not obvious that such a re-weighting will converge
to a unique or event improved boosted decision tree, but in practice this method has been shown to be extremely powerful.
Finally, we need to measure the performance of a BDT classification using some kind of success metric. Obviously, a
√ signal efficiency alone is not sufficient, because the signal-to-background ratio s/b or the Gaussian significance
large
s/ b depend on the signal and background rates. For a simple classification task we can compute four numbers

1. signal efficiency, or true positive rate S ≡ s(S-tagged) /s(truth) ;

2. background efficiency, or true negative rate b(B-tagged) /b(truth) ;
3. background mis-identification rate, or false positive rate B ≡ b(S-tagged) /b(truth) ;
4. signal mis-identification rate, or false negative rate s(B-tagged) /s(truth) .

If we tag all events we know the normalization conditions s(truth) = s(S-tagged) + s(B-tagged) and correspondingly for b(truth) .
The signal efficiency is also called recall or sensitivity in other fields of research. The background mis-identification rate
can be re-phrased as background rejection 1 − B , also referred to as specificity.
Once we have tagged a signal sample we can ask how many of those tagged events are actually signal, defining the purity
or precision s(S-tagged) /(s(S-tagged) + b(S-tagged) ). Finally, we can ask how many of our decisions are correct and compute
(s(S-tagged) + b(B-tagged) )/(s(truth) + b(truth) ), reflecting the fraction of correct decisions and referred to as accuracy.
In particle physics we usually measure the success of a classifier in the plane S vs B , where for the latter we either write
1 − B of 1/B . If a classifier gives us, for example, a continuous measure of signal-ness of an event being signal we can
choose different working points by defining a cut on the classifier output. The problem with such any such cut is that we
lose information from all those signal events which almost made it to the signal-tagged sample. If we can construct our
classifier such that its output is a probability, we can also weight all events by their signal vs background probability score
and keep all events in our analysis.

1.3 Fits and interpolations

From a practical perspective, we start with the assumption or observation that neural networks are nothing but
numerically defined functions. The simplest case, regression networks are scalar or vector fields defined on some space,
approximated by fθ (x). In physics applications, this space is often some kind of phase space, an important aspect,
because many aspects of phase space are interpretable to physicists.
1.3 Fits and interpolations 9

Assuming that we have indirect or implicit access to the truth f (x) in form of a training dataset (x, f )j , we want to
construct the approximation

fθ (x) ≈ f (x) . (1.39)

A set of functional values for a given set of points can be approximated in two ways. First, in a fit we start with a
functional form in terms of a small set of parameters which we also refer to as θ.
To determine these network parameters, we use a minimization algorithm on an appropriately defined loss function. More
specifically, we maximize the probability for the fit output f (xj ) to correspond to the training points fj , with
uncertainties σj . Because we are interested in the θ, we evaluate this probability as a likelihood, for example assuming the
Gaussian log-likelihood of Eq.(1.27), often referred to as χ2 . For the definition of the best parameters θ we ignore the
θ-independent normalization, so the loss function or function we minimize to determine the fit’s model parameters is

X X |fj − fθ (xj )|2

Lfit = Lj = . (1.40)
j j
2σj2

The fit function is not optimized to go though all or even some of the training data points, fθ (xj ) 6= fj . Instead, the
log-likelihood loss is a compromise to agree with all training data points within their uncertainties. We can plot the values
Lj for the training data and should find a Gaussian distribution of mean zero and standard deviation one,
N (µ = 0, σ = 1).
An interesting question arises in cases where we do not know an uncertainty σj for each training point, or where such an
uncertainty does not make any sense, because we know all training data points to the same precision σj = σ. In that case
we can still define a fit function, but the loss function becomes a simple mean squared error,

1 X 1
Lfit = |fj − fθ (xj )|2 ≡ MSE . (1.41)
2σ 2 j 2σ 2

Again, the prefactor is θ-independent and does not contribute to the definition of the best fit. This simplification means
that our MSE fit puts much more weight on deviations for large functional values fj . This is not what we want, so without
control over the uncertainties most machine learning applications will apply a preprocessings to the data,
fi
fj → log fj or fj → fj − hfj i or fi → ··· (1.42)
hfi i
In cases where we expect something like a Gaussian distribution a standard scaling would preprocess the data to a mean
of zero and a standard deviation of one. In an ideal world, such preprocessings should not affect our results, but in reality
they almost always do. The only way to avoid preprocessings is to add information like the scale of expected and allowed
deviations for the likelihood loss in Eq.(1.40).
The second way of approximating a set of functional values is interpolation, which ensures fθ (xj ) = fj and is the method
of choice for datasets without noise. Between these training data points we choose a linear or polynomial form, the latter
defining a so-called spline approximation. It provides an interpolation which is differentiable a certain number of times
(n) (n)
by matching not only the functional values fθ (xj ) = fj , but also the nth derivatives fθ (x ↑ xj ) = fθ (x ↓ xj ). In the
machine learning language we can say that the order of our spline defines an implicit bias for our interpolation, because it
defines a resolution in x-space where the interpolation works. A linear interpolation is not expected to do well for widely
spaced training points and rapidly changing functional values, while a spline-interpolation should require fewer training
points because the spline itself can express a non-trivial functional form.
The main difference between a fit and an interpolation is their respective behavior on unknown dataset. For both, a fit and
an interpolation we expect our fit model fθ (x) to describe the training data. To measure the quality of a fit beyond the
training data we can compute the loss function L or the point-wise contributions to the loss Lj on an independent
test dataset. If a fit does not generalize from a training to a test dataset, it is usually because it has learned not only the
smooth underlying function, but also the statistical fluctuation of the training data. While a test dataset of the same size
will have statistical fluctuations of the same size, they will not be in the same place, which means the loss function
evaluated on the training data will be much smaller than the loss function evaluated on the test data. This failure mode is
called over-fitting or, more generally, overtraining. For interpolation this overtraining is a feature, because we want to
10 1 BASICS

reproduce the training data perfectly. The generalization property is left to choice of the interpolation function. As a side
remark, manipulating the training dataset while performing one fit after the other is an efficient way to search for outliers
in a dataset, or a new particle in an otherwise smooth distribution of invariant masses at the LHC.
More systematically, we can define a set of errors which we make when targeting a problem by constructing a fit function
through minimizing a loss on a training dataset. First, an approximation error is introduced when we define a fit function,
which limits the expressiveness of the network in describing the true function we want to learn. Second, an estimation or
generalization error appears when we approximate the true training objective by a combination of loss function and
training data set. In practice, these errors are related. A justified limit to the expressiveness of a fit function, or implicit
bias, defines useful fits for a given task. In many physics applications we want our fit to be smooth at a given resolution.
When defining a good fit, increasing the class of functions the fit represents leads to a smaller approximation error, but
increases the estimation error. This is called the bias-variance trade off, and we can control it by limiting or regularizing
the expressiveness of the fit function and by ensuring that the loss of an independent test dataset does not increase while
training on the training dataset. Finally, any numerical optimization comes with a training error, representing the fact that
a fitted function might just live, for instance, in a sufficiently good local minimum of the loss landscape. This error is a
numerical problem, which we can solve though more efficient loss minimization. While we can introduce these errors for
fits, they will become more relevant for neural networks.

1.4 Neural networks

Next, we introduce a neural network as a numerically defined fit function with a huge number of model parameters θ,

fθ (x) ≈ f (x) . (1.43)

As mentioned before, we minimize a loss function numerically to determine the neural network parameters θ. This
procedure us called network training and requires a training dataset (x, f )j representing the target function f (x).
We will skip the usual inspiration from biological neurons and instead ask our first question, which is how to describe an
unknown function in terms of a large number of model parameters θ without making more assumptions than some kind of
smoothness on the relevant scales. For a simple regression task we write the mapping as

x → fθ (x) with x ∈ RD and fθ ∈ R . (1.44)

The key is to think of this problem in terms of building blocks which we can put together such that simple functions
require a small number of modules or building blocks, and model parameters, and complex functions require larger and
larger numbers of those building blocks. We start by defining so-called layers, which in a fully connected or dense
network transfer information from all D entries of the vector x defining one layer to all vector entries of the subsequent
layer,

x → x(1) → x(2) · · · → x(N ) ≡ fθ (x) . (1.45)

Our network consist of N layers, including one input layer x → x(1) , one output layer, and N − 2 hidden layers. If a
(n+1) (n)
vector entry xj collects information from all xj , we can try to write each step of this chain as

x(n−1) → x(n) := W (n) x(n−1) + b(n) , (1.46)

where the D × D matrix W is referred to as the network weights and the D-dimensional vector b as the bias. In general,
neighboring layers do not need to have the same dimension, which means W does not have to be a diagonal matrix. In
our simple regression case we already know that over the layers we need to reduce the width of the network from the
input vector dimension D to the output scalar x(N ) = fθ (x).
Splitting the vector x(n) into its D entries defines the nodes which form our network layer
(n) (n) (n−1) (n)
xi = Wij xj + bi . (1.47)

(n−1) (n)
For a fully connected network a node takes D components xj and transforms them into a single output xi . For each
node the D + 1 network parameters are D matrix entries Wij and one bias bi . If we want to compute the loss function for
1.4 Neural networks 11

a given data point (xj , fj ), we follow the arrows in Eq.(1.45), feed each data point xj through the input layer, go through
the following layers one by one, compute the network output fθ (xj ), and compare it to fj though a loss function.
The transformation shown in Eq.(1.46) is an affine transformation. Just like linear transformations, affine transformations
form a group. This is equivalent to saying that combining affine layers still gives us an affine transformation, just encoded
in a slightly more complicated manner. This means our network defined by Eq.(1.46) can only describe linear functions,
albeit in high-dimensional spaces.
To describe non-linear functions we need to introduce some kind of non-linear structure in our neural network. The
simplest implementation of the required nonlinearity is to apply a so-called activation function to each node. Probably the
simplest 1-dimensional choice is the so-called rectified linear unit
(
0 xj ≤ 0
ReLU(xj ) := max(0, xj ) = , (1.48)
xj xj > 0

giving us instead of Eq.(1.47)

h i
x(n−1) → x(n) := ReLU W (n) x(n−1) + b(n) , (1.49)

Here we write the ReLU transformation of a vector as the vector of ReLU-transformed elements. This non-linear
transformation is the same for each node, so all our network parameters are still given by the affine transformations. But
now a sufficiently deep network can describe general, non-linear functions, and combining layers adds complexity, new
parameters, and expressivity to our network function fθ (x). There are many alternatives to ReLU as the source of
non-linearity in the network setup, and depending on our problem they might be helpful, for example by providing a finite
gradient over the x-range. However, throughout this lecture we always refer to a standard activation function as ReLU.
This brings us to the second question, namely, how to determine a correct or at least good set of network parameters θ to
describe a training dataset (x, f )j . From our fit discussion we know that one way to determine the network parameters is
by minimizing a loss function. For simplicity, we can think of the MSE loss defined in Eq.(1.41) and ignore the
normalization 1/(2σ 2 ). To minimize the loss we have to compute its derivative with respect to the network parameters. If
we ignore the bias for now, for a given weight in the last network layer we need to compute
2
(N ) (N −1)
dL ∂ f − ReLU[W1k xk ] ∂ReLU[W1k xk
(N ) (N −1) (N ) (N −1)
] ∂[W1k xk ]
(N )
= (N ) (N −1) (N ) (N −1) (N )
dW1j ∂ReLU[W1k xk ] ∂[W1k xk ] ∂W1j
(N ) (N −1) (N −1)
= −2 f − ReLU[W1k xk ] × 1 × δjk xk
√ (N −1)
≡ −2 L xj , (1.50)

(N )
provided Wij xj > 0, otherwise the partial derivative vanishes. This form implies that the derivative of the loss with
respect to the weights in the N th layer is a function of the loss itself and of the previous layer x(N −1) . If we ignore the
ReLU derivative in Eq.(1.50) and still limit ourselves to the weight matrix in Eq.(1.49) we can follow the chain of layers
and find
2
(N ) (N −1)
dL ∂ f − ReLU[W1k xk ] (n) (n−1)
∂[(W (N ) · · · W (n+1) )1k Wk` x`
(n)
= (N ) (N −1) (n)
dWij ∂[W1k xk ] ∂Wij
√ (N )
(n−1)
= −2 L W · · · W (n+1) xj . (1.51)
1i

This means we compute the derivative of the loss with respect to the weights in the reverse direction as the network
evaluation shown in Eq.(1.45). We have shown this only for the network weights, but it works for the biases the same
way. This back-propagation is the crucial step in defining appropriate network parameters by numerically minimizing a
loss function. The simple back-propagation might also give a hint to why the chain-like network structure of Eq.(1.45)
combined with the affine layers of Eq.(1.46) have turned out so successful as a high-dimensional representation of
arbitrary functions.
12 1 BASICS

The output of the back-propagation in the network training is the expectation value of the derivative of the loss function
with respect to a network parameter. We could evaluate this expectation value over the full training dataset. However,
especially for large datasets, it becomes impossible to compute this expectation value, so instead we evaluate the same
expectation value over a small, randomly chosen subset of the training data. This method is called stochastic gradient
descent, and the subsets of the training data are called minibatches or batches
* +
∂L
with θj ∈ {b, W } . (1.52)
∂θj
minibatch

Even through the training data is split into batches and the network training works on these batches, we still follow the
progress of the training and the numerical value of the loss as a function of epochs, defined as the number of batch
trainings required for the network to evaluate the full training sample.
After showing how to compute the loss function and its derivative with respect to the network parameters, the final
question is how we actually do the minimization. For a given network parameter θj , we first need to scan over possible
values widely, and then tune it precisely to its optimal value. In other words, we first scan the parameter landscape
globally, identify the global minimum or at least a local minimum close enough in loss value to the global minimum, and
then descend into this minimum. This is a standard task in physics, including particle physics, and compared to many
applications of Markov Chains Monte Carlos the standard ML-minimization is not very complicated. We start with the
naive iterative optimization in time steps
* +
(t+1) (t) ∂L(t)
θj = θj − α . (1.53)
∂θj

The minus sign means that our optimization walks against the direction of the gradient, and α is the learning rate. From
our description above it is clear that the learning rate should not be constant, but should follow a decreasing schedule.
One of the problems with the high-dimensional loss optimization is that far away from the minimum the gradients are
small and not reliable. Nevertheless, we know that we need large steps to scan the global landscape. Once we approach a
minimum, the gradients will become larger, and we want to stay within the range of the minimum. An efficient adaptive
strategy is given by
D ∂L(t) E
(t+1) (t) ∂θj
θj = θj − α s . (1.54)
D ∂L(t) E2
+
∂θj

Away from the minimum, this form allows us to enhance the step size even for small gradients by choosing a sizeable
value α. However, whenever the gradient grows too fast, the step size remains below the cutoff α/. Finally, we can
stabilize the walk through the loss landscape by mixing the loss gradient at the most recent step with gradients from the
updates before,
* + * + * +
∂L(t) ∂L(t) ∂L(t−1)
→ β + (1 − β) (1.55)
∂θj ∂θj ∂θj

This strategy is called momentum, and now the complicated form of the denominator in Eq.(1.54) makes sense, and
serves as a smoothing of the denominator for rapidly varying gradients. A slightly more sophisticated version of this
adaptive scan of the loss landscape is encoded in the widely used Adam optimizer.
Note that for any definition of the step size we still need to schedule the learning rate α. A standard choice for such a
learning rate scheduling is an exponential decay of α with the batch or epoch number. An interesting alternative is a
one-cycle learning rate where we first increase α batch after batch, with a dropping loss, until the loss rapidly explodes.
This point defines the size of the minimum structure in the loss landscape. Now we can choose the step size at the
minimum loss value to define a suitable constant learning rate for our problem, potentially leading to much faster training.
Finally, we need to mention that the minimization of the loss function for a neural network only ever uses first derivatives,
differently from the optimization of a fit function. The simple reason is that for the large number of network parameters θ
the simple scaling of the computational effort rules out computing second derivatives like we would normally do.
1.5 Bayesian networks and likelihood loss 13

Going back to the three error introduced in the last section, they can be translated directly to neural networks. The
approximation error is less obvious than for the choice of fit function, but also the expressiveness of neural network is
limited through the network architecture and the set of hyperparameters. The training error becomes more relevant
because we now minimize the loss over an extremely high-dimensional parameter space, where we cannot expect to find
the global minimum and will always have to settle for a sufficiently good local minimum. To define a compromise
between the approximation and generalization errors we usually divide a ML-related dataset into three parts. The main
part is the training data, anything between 60% and 80% of the data. The above-mentioned test data is then 10% to 20%
of the complete dataset, and we use it to check how the network generalizes to unseen data, test for overtraining, or
measure the network performance. The validation data can be used to re-train the network, optimize the architecture or the
different settings of the network. The crucial aspect is that the test data is completely independent of the network training.

1.5 Bayesian networks and likelihood loss

After we have understood how we can construct and train a neural network in complete analogy to a fit, let us discuss
what kind of network we actually need for simple physics applications. In a scientific application we are not only
interested in single network output or a set of network parameters θ, but need to include the corresponding uncertainty.
Going back to fits, any decent fit approach provides error bands for each of the fit parameters, ideally correlated. With this
uncertainty aspect in mind, a nice and systematic approach are so-called Bayesian neural networks. This kind of network
is a naming disaster in that there is nothing exclusively Bayesian about them [?], while in particle physics Bayesian has a
clear negative connotation. The difference between a deterministic and a Bayesian network is that the latter allow for
distributions of network parameters, which then define distributions of the network output and provide central values f (x)
as well as uncertainties ∆f (x) by sampling over θ-space. The corresponding loss function follows a clear statistics logic
in terms of a distribution of network parameters.
Let us start with a simple regression task, computing the a scalar transition amplitude as a function phase space points,

fθ (x) ≈ f (x) ≡ A(x) with x ∈ RD . (1.56)

The training data consists of pairs (x, A)j . We define p(A) ≡ p(A|x) as the probability distribution for possible
amplitudes at a given phase space point x and omit the argument x from now on. The mean value for the amplitude at the
point x is
Z Z
h A i = dA A p(A) with p(A) = dθ p(A|θ) p(θ|xtrain ) . (1.57)

Here, we can think of p(A|θ) as a single model describing an amplitude through a set of network parameters, while
p(θ|xtrain ) weights this model by its level of agreement with the training data xtrain . We do not know the closed form of
p(θ|xtrain ), because it is encoded in the training data. Training the network means that we approximate it as a distribution
using variational approximation for the integrand in the sense of a distribution and test function
Z Z Z
p(A) = dθ p(A|θ) p(θ|xtrain ) ≈ dθ p(A|θ) q(θ|x) ≡ dθ p(A|θ) q(θ) . (1.58)

As for p(A) we omit the x-dependence of q(θ|x). This approximation leads us directly to the BNN loss function. We
define the variational approximation using the KL-divergence introduced in Eq.(1.18),
* +
q(θ) q(θ)
Z
DKL [q(θ), p(θ|xtrain )] = log = dθ q(θ) log . (1.59)
p(θ|xtrain ) p(θ|xtrain )
q

There are many ways to compare two distributions, defining a problem called optimal transport. We will come back to
alternative ways of combining probability densities over high-dimension spaces in Sec. ??. Using Bayes’ theorem we can
write the KL-divergence as
q(θ)p(xtrain )
Z
DKL [q(θ), p(θ|xtrain )] = dθ q(θ) log
p(θ)p(xtrain |θ)
Z Z
= DKL [q(θ), p(θ)] − dθ q(θ) log p(xtrain |θ) + log p(xtrain ) dθ q(θ) . (1.60)
14 1 BASICS

The prior p(θ) describes the network parameters before training; since it does not really include prior physics or training
information we will still refer to it as a prior, but we think about it as a hyperparameter which can be chosen to optimize
performance and stability. From a practical perspective, a good prior will help the network converge more efficiently, but
any prior should give the correct results, and we always need to test the effect of different priors.
The evidence p(xtrain ) guarantees the correct normalization of p(θ|xtrain ) and is usually intractable. If we implement the
normalization condition for q(θ) by construction, we find
Z
DKL [q(θ), p(θ|xtrain )] = DKL [q(θ), p(θ)] − dθ q(θ) log p(xtrain |θ) + log p(xtrain ) . (1.61)

The log-evidence in the last term does not depend on θ, which means that it will not be adjusted during training and we
can ignore when constructing the loss. However, it ensures that DKL [q(θ), p(θ|xtrain )] can reach its minimum at zero.
Alternatively, we can solve the equation for the evidence and find
Z
log p(xtrain ) = DKL [q(θ), p(θ|xtrain )] − DKL [q(θ), p(θ)] + dθ q(θ) log p(xtrain |θ)
Z
> dθ q(θ) log p(xtrain |θ) − DKL [q(θ), p(θ)] (1.62)

This condition is called the evidence lower bound (ELBO), and the evidence reaches this lower bound exactly when our
training condition in Eq.(1.18) is minimal. Combining all of this, we turn Eq.(1.61) or, equivalently, the ELBO into the
loss function for a Bayesian network,
Z
LBNN = − dθ q(θ) log p(xtrain |θ) + DKL [q(θ), p(θ)] . (1.63)

The first term of the BNN loss is a likelihood sampled according to q(θ), the second enforces a (Gaussian) prior. This
Gaussian prior acts on the distribution of network weights. Using an ELBO loss means nothing but minimizing the
KL-divergence between the probability p(θ|xtrain ) and its network approximation q(θ) and neglecting all terms which do
not depend on θ. It results in two terms, a likelihood and a KL-divergence, which we will study in more detail next.
The Bayesian network output is constructed in a non-linear way with a large number of layers, to we can assume that
Gaussian weight distributions do not limit us in terms of the uncertainty on the network output. The log-likelihood
log p(xtrain |θ) implicitly includes the sum over all training points.
Before we discuss how we evaluate the Bayesian network in the next section, we want to understand more about the BNN
setup and loss. First, let us look at the deterministic limit of our Bayesian network loss. This means we want to look at the
loss function of the BNN in the limit

q(θ) = δ(θ − θ0 ) . (1.64)

The easiest way to look at this limit is to first assume a Gaussian form of the network parameter distributions, as given in
Eq.(1.27)
1 2 2
qµ,σ (θ) = √ e−(θ−µq ) /(2σq ) , (1.65)
2πσq

and correspondingly for p(θ). The KL-divergence has a closed form,

σq2 − σp2 + (µq − µp )2 σp

DKL [qµ,σ (θ), pµ,σ (θ)] = + log . (1.66)
2σp2 σq

We can now evaluate this KL-divergence in the limit of σq → 0 and finite µq (θ) → θ0 as the one remaining θ-dependent
parameter,

(θ0 − µp )2
DKL [qµ,σ (θ), pµ,σ (θ)] → + const . (1.67)
2σp2
15

Ensemble of networks
-0.1
0.2 0.8
A(ω1)
(σstoch(ω1))
x
BNN Output

ng
q(ω)

pli
-0.3 1 N
⟨A⟩ = A(ωi)

m
N∑

sa
0.5 0.7
A(ω2)
i

(σstoch(ω2))
1 N 2
x output x 2
= σstoch(ωi)
N∑
σstoch
i
1 N
2
= (⟨A⟩ − A(ωi))2
N∑
-0.2 σpred
0.4 0.9 i

A(ω3)
(σstoch(ω3))
x

Figure 4: Illustration of a Bayesian network. Figure from Ref. [?].

We can write down the deterministic limit of Eq.(1.63),

(θ0 − µp )2
LBNN = − log p(xtrain |θ0 ) + . (1.68)
2σp2

The first term is again the likelihood defining the correct network parameters, the second ensures that the network
parameters do not become too large. Because it include the squares of the network parameters, it is referred to as an
L2-regularization. Going back to Eq.(1.63), an ELBO loss is a combination of a likelihood loss and a regularization.
While for the Bayesian network the prefactor of this regularization term is fixed, we can generalize this idea and apply an
L2-regularization to any network with an appropriately chosen pre-factor.
Sampling the likelihood following a distribution of the network parameters, as it happens in the first term of the Bayesian
loss in Eq.(1.63), is something we can also generalize to deterministic networks. Let us start with a toy model where we
sample over network parameters by either including them in the loss computation or not. Such a random sampling
between two discrete possible outcomes is described by a Bernoulli distribution. If the two possible outcomes are zero
and one, we can write the distribution in terms of the expectation value ρ ∈ [0, 1],
(
ρx (1 − ρ)1−x x = 0, 1
pBernoulli (x) = (1.69)
0 else .

When we include an event, the network weight is set to θ0 , otherwise it is set to zero, or θ = xθ0 . We can include it as a
test function for our integral over the log-likelihood log p(xtrain |θ) and find
Z
LBNN = − dx ρx (1 − ρ)1−x log p(xtrain |xθ0 )

= −ρ log p(xtrain |θ0 ) (1.70)

A simple sampling of weights by removing nodes is called dropout and is commonly used to avoid overfitting of
networks. For deterministic networks ρ is a free hyperparameter of the network training, while for Bayesian networks this
kind of sampling is a key result from the construction of the loss function.

2 Regression
2.1 Amplitude regression
After introducing BNNs using the notation of transition amplitude learning, we still have to extract the mean and the
uncertainty for the amplitude A over phase space. In the evaluation step we exchange the two integrals in Eq.(1.57) and
16 2 REGRESSION

use the variational approximation to write the mean prediction A for a given phase space point as
Z
hAi = dAdθ A p(A|θ) p(θ|xtrain )
Z
= dAdθ A p(A|θ) q(θ)
Z Z
≡ dθ q(θ) A(θ) with θ-dependent mean A(θ) = dA A p(A|θ) . (2.1)

We can interpret this formula as a sampling over network parameters, provided we assume uncorrelated variations of the
individual network parameters. Corresponding to the definition of the θ-dependent mean A, the variance of A is
Z
2 2
σtot = dAdθ (A − hAi) p(A|θ) q(θ)
Z
= dAdθ A2 − 2AhAi + hAi2 p(A|θ) q(θ)

Z Z Z Z
2 2
= dθ q(θ) dA A p(A|θ) − 2hAi dA A p(A|θ) + hAi dA p(A|θ) (2.2)

For the three integrals we can generalize the notation for the θ-dependent mean as in Eq.(2.1) and write
Z h i
2
σtot = dθ q(θ) A2 (θ) − 2hAiA(θ) + hAi2
Z h i
= dθ q(θ) A2 (θ) − A(θ)2 + A(θ)2 − 2hAiA(θ) + hAi2
Z h 2 i
= dθ q(θ) A2 (θ) − A(θ)2 + A(θ) − hAi 2
≡ σstoch 2
+ σpred . (2.3)

For this transformation we keep in mind that hAi is already integrated over θ and A and can be pulled out of the integrals.
This expression defines two contributions to the variance or uncertainty.
First, σpred is defined in terms of the θ-integrated expectation value hAi
Z h i2
2
σpred = dθ q(θ) A(θ) − hAi . (2.4)

Following from the definition in Eq.(2.1), it vanishes in the limit of perfectly narrow network weights, q(θ) → δ(θ − θ0 ).
This limit requires perfect training, so we expect σpred to decrease with more and better training data. In that sense it
represents a statistical uncertainty.
Second, σstoch is defined without sampling the network parameters,
Z
2
σstoch ≡ hσstoch (θ)2 i = dθ q(θ) σstoch (θ)2
Z h i
= dθ q(θ) A2 (θ) − A(θ)2 . (2.5)

It does not vanish for q(θ) → δ(θ − θ0 ), but it vanishes if the amplitude is arbitrarily well known and well described,
characterized by p(A|θ) → δ(A − A0 ). While this uncertainty may receive contributions from too little training data, it
only approaches a plateau for perfect training. This plateau value can reflect a stochastic training sample, limited
expressivity of the network, not-so-smart choices of hyperparameters etc, in the sense of a systematic uncertainty. To
avoid mis-understanding we can refer to it as a stochastic or as model-related uncertainty,
σstoch ≡ σmodel ≡ hσmodel (θ)2 i . (2.6)
To understand these two uncertainty measures better, we can look at the θ-dependent network output and read Eq.(2.1)
and (2.5) as sampling A(θ) and σmodel (θ)2 over a network parameter distribution q(θ) for each phase space point x

A(θ)
BNN : x, θ → . (2.7)
σmodel (θ)
2.1 Amplitude regression 17

If we follow Eq.(1.63) and assume q(θ) to be Gaussian, we now have a network with twice as many parameters as a
standard network to describe two outputs. For a given phase space point x we can then compute the three global network
predictions hAi, σmodel , and eventually σpred . Unlike the distribution of the individual network weights q(θ), the amplitude
output is not Gaussian.
The evaluation of a BNN is illustrated in Fig. 4. From that graphics we see that a BNN works very much like an ensemble
of networks trained on the same data and producing a spread of outputs. The first advantage over an ensemble is that the
BNN is only twice as expensive as a regular network, and less if we assume that the likelihood loss leads to an especially
efficient training. The second advantage of the BNN is that it learns a function and its uncertainty together, which will
give us some insight into how advanced networks learn such densities in Sec. ??. The disadvantage of Bayesian networks
over ensembles is that the Bayesian network only cover local structures in the loss landscape.
While we usually do not assume a Gaussian uncertainty on the Bayesian network output, this might be a good
approximation for σmodel (θ). Using this approximation we can write the likelihood p(xtrain |θ) in Eq.(1.63) as a Gaussian
and use the closed form for the KL-divergence in Eq.(1.66), so the BNN loss function turns into
" 2 #
Aj (θ) − Atruth
Z
j
X
LBNN = dθ qµ,σ (θ) + log σmodel,j (θ)
2σmodel,j (θ)2
points j

σq2 − σp2 + (µq − µp )2 σp

+ + log . (2.8)
2σp2 σq

As always, the amplitudes and σmodel are functions of phase space x. The loss is minimized with respect to the means and
standard deviations of the network weights describing q(θ). This interesting aspect of this loss function is that, while the
training data just consists of amplitudes at phase space points and does not include the uncertainty estimate, the network
constructs a point-wise uncertainty from the variation of the network parameters. This means we can rely on a
well-defined likelihood loss rather than some kind of MSE loss even for our regression network training, this is
fudging genius!
For some applications we might want to use this advantage of the BNN loss in Eq.(2.8), but without sampling the network
parameters θ. In complete analogy to a fit we can then use a deterministic network like in Eq.(1.66), described by
q(θ) = δ(θ − θ0 ) inserted into the Gaussian BNN loss in Eq.(2.8) to define
" 2 #
X Aj (θ0 ) − Atruth
j (θ0 − µp )2
Lheteroskedastic = + log σ model,j (θ 0 ) + . (2.9)
2σmodel,j (θ0 )2 2σp2
points j

The interplay between the first two terms works in a way that the first term can be minimized either by reproducing the
data and minimizing the numerator, or by maximizing the denominator. The second term penalizes the second strategy,
defining a correlated limit of A and σmodel over phase space. Compared to the full Bayesian network this simplified
approach has two disadvantages: first, we implicitly assume that the uncertainty on the amplitudes is Gaussian. Second,
σmodel only captures noisy amplitude values in the training data. Extracting a statistical uncertainty, as encoded in σpred ,
requires us to sample over the weight space. Obviously, this simplification cannot be interpreted as an efficient network
ensembling.
For a specific task, let us look at LHC amplitudes for the production of two photons and a jet [?],

gg → γγg(g) (2.10)

The corresponding transition amplitude A is a real function over phase space, with the detector-inspired kinematic cuts

pT,j > 20 GeV |ηj | < 5 Rjj,jγ,γγ > 0.4

pT,γ > 40, 30 GeV |ηγ | < 2.37 . (2.11)

The jet 4-momenta are identified with the gluon 4-momenta through a jet algorithm. The standard computer program for
this calculation is called NJet, and the same amplitudes can also be computed with the event generator Sherpa at one loop.
This amplitude has to be calculated at one loop for every phase space point, so we want to train a network once to
reproduce its output much faster. Because these amplitude calculations are a key ingredient to the LHC simulation chain,
the amplitude network needs to reproduce the correct amplitude distributions including all relevant features and with a
reliable uncertainty estimate.
18 2 REGRESSION

120 largest 100% ANN 120 largest 100% ANN

gg g gg g
largest 1% ANN largest 1% ANN
100 largest 0.1% ANN 100 loss-boosted largest 0.1% ANN
BNN training
80 80
normalized

normalized
60 60
40 40
20 20
0 0
0.04 0.02 0.00 0.02 0.04 0.04 0.02 0.00 0.02 0.04
+ overflow bin
(train) + overflow bin
(train)

120 largest 100% ANN 120 largest 100% ANN

gg g gg g
largest 1% ANN largest 1% ANN
100 process-boosted
BNN training largest 0.1% ANN 100 process-boosted
BNN training largest 0.1% ANN
80 80
normalized

normalized
60 60
40 40
20 20
0 0
0.04 0.02 0.00 0.02 0.04 0.04 0.02 0.00 0.02 0.04
+ overflow bin
(train) + overflow bin
(test)

Figure 5: Performance of the BNN, loss-boosted BNN, and process-boosted BNN, in terms of the precision of the generated
amplitudes, defined in Eq.(2.12) and evaluated on the training (upper) and test datasets. Figures from Ref. [?].

For one gluon in the final state, the most general phase space has 5 × 4 = 20 dimensions. We could simplify this phase
space by requiring momentum conservation and on-shell particles in the initial and final state, but for this test we leave
these effects for the network to be learned. Limiting ourselves to one jet, this means we train a regression network to learn
a real number over a 20-dimensional phase space. To estimate the accuracy of the network we can compute the relative
deviation between the training or test data with the network amplitudes,

hAij − Atrain/test
j
∆(train/test)
j = (2.12)
Atrain/test
j

In the upper left panel of Fig. 5 we show the performance of a well-trained Bayesian network in reproducing the training
dataset. While the majority of phase space points are described very precisely, a problem occurs for the phase space
points with the largest amplitudes. The reason for this shortcoming is that there exist small phase space regions where the
transition amplitude increase rapidly, by several orders of magnitude. The network fails to learn this behavior in spite of a
log-scaling following Eq.(1.42), because the training data in these regions is sparse. For an LHC simulation this kind of
bias is a serious limitation, because exactly these phase space regions need to be controlled for a reliable rate prediction.
This defines our goal of an improved amplitude network training, namely to identify and control outliers in the
∆-distribution of Eq.(2.12).
The likelihood loss in Eq.(2.8) with its σmodel (θ) has a great advantage, because we can test the required Gaussian shape
of the corresponding pull variable for the truth and NN-amplitudes,

Aj (θ) − Atruth
j
, (2.13)
σmodel,j (θ)

as part of the training. For the initial BNN run it turns out that this distribution indeed looks like a Gaussian in the peak
region, but with too large and not exponentially suppressed tails. We can improve this behavior using an inspiration from
the boosting of a decision tree. We enhance the impact of critical phase space points through an increased event weight,
2.2 Numerical integration 19

leading us to a boosted version of the BNN loss,

" 2 #
Aj (θ) − Atruth
Z
j
X
LBBNN = dθ qµ,σ (θ) nj × + log σmodel,j (θ)
2σmodel,j (θ)2
points j

σq2 − σp2 + (µq − µp )2 σp

+ 2
+ log . (2.14)
2σp σq

This boosted or feedback training will improve the network performance both, in the accuracy of the amplitude prediction
and in the learned uncertainty on the network amplitudes. If we limit ourselves to self-consistency arguments, we can
select the amplitudes with nj > 1 through large pull values, as defined in Eq.(2.13). It turns out that this
loss-based boosting significantly improves the uncertainty estimate. However, in the upper right panel of Fig. 5 we see
that the effect on the large amplitudes is modest, large amplitudes still lead to too many outliers in the network accuracy,
and we also tend to systematically underestimate those large amplitudes.

To further improve the performance of our network we can target the problematic phase space points directly, by
increasing nj based on the size of the amplitudes. This process-specific boosting goes beyond the self-consistency of the
network and directly improves a specific task beyond just learning the distribution of amplitudes over phase space. In the
lower panels of Fig. 5 we see that at least for the training data the ∆-distribution looks the same for small and large
amplitudes. Going back to Eq.(1.3), we can interpret the boosted loss for large values of nj as a step towards interpolating
the corresponding amplitudes. While for small amplitudes the network still corresponds to a fit, we are forcing the
network to reproduce the amplitudes at some phase space points very precisely. Obviously, this boosting will lead to
issues of overtraining. We can see this in the lower right panel of Fig. 5, where the improvement through process boosting
for the test dataset does not match the improvement for the training dataset. However, as long as the performance of the
test dataset improves, even if it is less than for the training dataset, we improve the network in spite of the overtraining.
The issue with this overtraining is that it becomes harder to control the uncertainties, which might require some kind of
alternating application of loss-based and process boosting.

This example illustrates three aspects of advanced regression networks. First, we have seen that a Bayesian network
allows us to construct a likelihood loss even if the training data does not include an uncertainty estimate. Second, we can
improve the consistency of the network training by boosting events selected from the pull distribution. Third, we can
further improve the network through process-specific boosting, where we force the network from a fit to an interpolation
mode based on a specific selection of the input data. For the latter the benefits do not fully generalize from the training to
the test dataset, but the network performance does improve on the test dataset, which is all we really require.

2.2 Numerical integration

The last application of a regression network is the numerical calculation of a D-dimensional phase space integral
Z 1 Z 1
I(s) = dx1 · · · dxD f (s; x) , (2.15)
0 0

where xi are the integration valiables and s is a vector of additional parameters, not integrated over. Because the values of
the integrand can span a wide numerical range is useful to normalize the integrand, for example by its value at the center
of the x-hypercube,

f (s; x) I(s)
f (x; s) → ⇔ I(s) → . (2.16)
f (s; 12 , 21 , , ..., 12 ) f (s; 1 1 1
2 , 2 , ..., 2 )

Without going into details, it is also useful to transform the integrand into a form which vanishes at the integration
boundaries. Analytically, we would compute the primitive F ,

dD F (s; x)
= f (s; x) , (2.17)
dx1 . . . dxD
20 2 REGRESSION

and then the integral by evaluating the integration boundaries

1 1
dD F (s; x)
Z Z
I(s) = dx1 · · · dxD
0 0 dx1 . . . dxD
xD =1
1 1
dD−1 F (s; x)
Z Z
= dx1 · · · dxD−1
0 0 dx1 . . . dxD−1
xD =0
X P
= (−1)D− xi
F (s; x) . (2.18)
x1 ,...,xD =0,1

In particle physics we really never know the primitive of a phase space integrand, but we can try to construct it and
encode it in a neural network,

Fθ (s; x) ≈ F (s; x) . (2.19)

On the other hand, we do not have data to train a surrogate network for F directly. The idea is to instead train an
integrand surrogate, such that its D-th derivative matches f ,

dFθ (s; x)
LMSE f (s; x), . (2.20)
dx1 ...dxD

If the training on the integrand fixes the network weights such that the interand as well as F are determined by the same
network, and Fθ fulfills Eq.(2.19), we have directly learned the integral.
To construct a surrogate which can be differentiated multiple times with respect to some of its inputs, we need a
differentiable activation function, for example the sigmoid function,
1
Sigmoid(x) = ⇒ Sigmoid0 (x) = log(ex + 1) ⇒ Sigmoid(n) = −Lin (−ex ) . (2.21)
1 + e−x
There exists a fast sum representation of the dilogarithm Li2 for numerical evaluation. The same set of derivative can be
computed for the tanh activation function.
Next, we need to compute the derivative of this fully connected neural network. Following the conventions of Eq.(1.45),
the input layer for the two input vectors is
(
(0) xi i≤D
xi = . (2.22)
si−D i>D

For the hidden layers we just replace the ReLU activation function in Eq.(1.49) with the sigmoid,
h i
(n) (n) (n−1) (n)
xi = Sigmoid Wij xj + bi . (2.23)

The scalar output of the network with N layers can be differentiated, for instance, with respect to x1 ,
(N ) (N −1)
fθ ≡ xN = Wj xj + b(N −1)
(N −1)
dfθ (N ) dxj
⇒ = Wj
dx1 dx1
" (N −2)
#
(N −1) dx`
h i
(N ) (N −1)
X (N ) (N )
0
= Wj Sigmoid Wjk xk + bj Wj` , (2.24)
j
dx1

where we write the sum over j explicitly, while for the other indices we use the usual summing convention. Next, we
differentiate this expression with respect to x2 , altogether D times, to compute the MSE loss in Eq.(2.20). The loss can be
minimized with respect to the network parameters θ using the usual backpropagation. Because the integrand is known
exactly, there is no need to regularize the network, but it would also not hurt. Also, the numerical generation of integrant
values is numerically cheap, which means Fθ an be trained using very large numbers of training data points.
2.2 Numerical integration 21

5000
1L sigm
1L tanh
2L sigm
4000
2L tanh

3000

frequency 2000

1000

0
−8 −6 −4 −2
p

Figure 6: Number of digits accuracy for two integrals, using two different activation functions, one of them defined in
Eq.(2.25). Figure from Ref. [?].

In the original paper, the method is showcased for two integrals, one of them is
1 1 1
1
Z Z Z
I1L (s12 , s14 , m2H , m2t ) = dx1 dx2 dx3 2
0 0 0 F1L
with F1L =m2t + 2x3 m2t + x23 m2t + 2x2 m2t − x2 s14 + 2x2 x3 m2t − x2 x3 m2H + x22 m2t
+ 2x1 m2t + 2x1 x3 m2t − x1 x3 s12 + 2x1 x2 m2t − x1 x2 m2H + x21 m2t . (2.25)

It is needed to compute the LHC rate for Higgs pair production.

The accuracy of the estimated integal can be measured in analogy to Eq.(2.12),

INN − Itruth
p = log10 , (2.26)
Itruth

giving the effective number of digits the estimates gets right. The results for this accuracy are shown in Fig. 6. It is based
on training an ensemble of eight replicas of the same network, use their average as the central prediction of the integral,
and the standard deviation as an uncertainty estimate. The entried in the histogram have different initialisations and are
trained on different training data. The results for the two different activation functions are similar. The two-loop integral,
which we skip in this summary, has a lower accuracy than the one-loop integral, which is to be expected given the larger
number of integrations.

Schedule
22 2 REGRESSION

Date # Chapter Tilman Jan comments

10/17 1 Basics ⊗
10/19 1 BTDs & fits ⊗
10/24 1 NNs & likelihood loss ⊗
10/26 2 Amplitude regression ⊗
10/31 2 Integration H&N ⊗
11/02 3 Classification H&N ⊗
11/07 CNNs ML4Jets ⊗
11/09 ML4Jets ⊗
11/14 ⊗
11/16 ⊗
11/21
11/23
11/28
11/30 Function counting theorem
12/05 ⊗
12/07
12/12 ⊗
12/14 Applications of CNNs Glühwein-WS ⊗
12/19 Diffusion models ⊗
01/09 ⊗
01/11 Attention
01/16 Transformers
01/18 GNN
01/23 Probabilistic graphical models
01/25 Reinforcement learning ⊗
01/30 Graph partitioning
02/01 Optimal transport ⊗

Probability and Statistics For Machine Learning - A Textbook
No ratings yet
Probability and Statistics For Machine Learning - A Textbook
530 pages
A Practical Guide To Splines
100% (7)
A Practical Guide To Splines
366 pages
ECE2191 Lecture Notes
No ratings yet
ECE2191 Lecture Notes
106 pages
Probability and Statistics For Particle Physics: Carlos Maña
100% (1)
Probability and Statistics For Particle Physics: Carlos Maña
252 pages
1.probability Random Variables and Stochastic Processes Athanasios Papoulis S. Unnikrishna Pillai 1 300 1 30
No ratings yet
1.probability Random Variables and Stochastic Processes Athanasios Papoulis S. Unnikrishna Pillai 1 300 1 30
30 pages
MTH2222 Mathematics of Uncertainty
No ratings yet
MTH2222 Mathematics of Uncertainty
96 pages
Quantum Information
No ratings yet
Quantum Information
100 pages
Book
No ratings yet
Book
113 pages
Book
No ratings yet
Book
106 pages
STAT3004 Course Notes
100% (1)
STAT3004 Course Notes
167 pages
Numerical Methods I
No ratings yet
Numerical Methods I
44 pages
Chapter 05 Regression Analysis
No ratings yet
Chapter 05 Regression Analysis
47 pages
Function Approximation, Interpolation, and Curve Fitting PDF
100% (1)
Function Approximation, Interpolation, and Curve Fitting PDF
53 pages
Bayesian Inference Data Evaluation and Decisions Second Edition
100% (2)
Bayesian Inference Data Evaluation and Decisions Second Edition
245 pages
Microcanonical Ensemble Unit 8
No ratings yet
Microcanonical Ensemble Unit 8
12 pages
Vstatmp E17
No ratings yet
Vstatmp E17
504 pages
Statistical Foundations of Machine Learning: The Handbook
No ratings yet
Statistical Foundations of Machine Learning: The Handbook
364 pages
Introduction To Bayesian Inference: M. Botje NIKHEF, PO Box 41882, 1009DB Amsterdam, The Netherlands June 21, 2006
No ratings yet
Introduction To Bayesian Inference: M. Botje NIKHEF, PO Box 41882, 1009DB Amsterdam, The Netherlands June 21, 2006
68 pages
Probability Random Variables and Random Processes Part 1
100% (10)
Probability Random Variables and Random Processes Part 1
30 pages
Prob Best-2 PDF
No ratings yet
Prob Best-2 PDF
181 pages
Ed
No ratings yet
Ed
300 pages
Statistical Physics Methods in Optimization and Machine Learning Notes
No ratings yet
Statistical Physics Methods in Optimization and Machine Learning Notes
279 pages
Basic Probability Theory For Bio Medical Engineers - JohnD. Enderle
100% (1)
Basic Probability Theory For Bio Medical Engineers - JohnD. Enderle
136 pages
ACaticha-Entropic Physics Book-July 2022
No ratings yet
ACaticha-Entropic Physics Book-July 2022
364 pages
280 LN Deller PART1 WITH ALL SUPPLEMENTS Fall2015 PDF
No ratings yet
280 LN Deller PART1 WITH ALL SUPPLEMENTS Fall2015 PDF
118 pages
Introduction To Probability Theory and Statistics
No ratings yet
Introduction To Probability Theory and Statistics
127 pages
MapPredict Guide
No ratings yet
MapPredict Guide
145 pages
Leon-Garcia-IPPR - Chapters 1-6
No ratings yet
Leon-Garcia-IPPR - Chapters 1-6
180 pages
HUST Guest Lecture 2022
No ratings yet
HUST Guest Lecture 2022
142 pages
Introduction To Probability Theory and S
No ratings yet
Introduction To Probability Theory and S
127 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
114 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
Distribution System
No ratings yet
Distribution System
103 pages
Stochastic Dynamics
No ratings yet
Stochastic Dynamics
78 pages
MA40189 Notes
No ratings yet
MA40189 Notes
70 pages
Modelling Nigerian's FGN Bond
No ratings yet
Modelling Nigerian's FGN Bond
69 pages
Four Lectures On Statistical Physics of Learning
No ratings yet
Four Lectures On Statistical Physics of Learning
74 pages
Quantum Information: Stephen M. Barnett
No ratings yet
Quantum Information: Stephen M. Barnett
60 pages
Lectures On Probability, Entropy, and Statistical Physics
No ratings yet
Lectures On Probability, Entropy, and Statistical Physics
170 pages
Chapter 2
No ratings yet
Chapter 2
68 pages
2 Information Theory
No ratings yet
2 Information Theory
40 pages
Cs229 Probability Review
No ratings yet
Cs229 Probability Review
36 pages
03 Prob
No ratings yet
03 Prob
38 pages
Anna University Chennai:: Chennai - 600 025 Curriculum 2004 B.Tech. Petroleum Engineering
No ratings yet
Anna University Chennai:: Chennai - 600 025 Curriculum 2004 B.Tech. Petroleum Engineering
64 pages
2017 Fall ME349 03 NumAnalysis1
No ratings yet
2017 Fall ME349 03 NumAnalysis1
41 pages
CH 14 .....
No ratings yet
CH 14 .....
36 pages
FullProf User's Guide
No ratings yet
FullProf User's Guide
76 pages
Probability Review
No ratings yet
Probability Review
29 pages
An Introduction To Quantum Filtering: John Gough April 25, 2018
No ratings yet
An Introduction To Quantum Filtering: John Gough April 25, 2018
35 pages
CHAPTER 4 Finite Differences and Interpolation
No ratings yet
CHAPTER 4 Finite Differences and Interpolation
20 pages
CENG 222 Statistical Methods For Computer Engineering
No ratings yet
CENG 222 Statistical Methods For Computer Engineering
31 pages
Entropy, Concentration, and Learning - A Statistical Mechanics Primer
No ratings yet
Entropy, Concentration, and Learning - A Statistical Mechanics Primer
38 pages
Probability Is A Branch of Mathematics That Deals With Measuring The Likelihood of Events
No ratings yet
Probability Is A Branch of Mathematics That Deals With Measuring The Likelihood of Events
34 pages
Computational Thinking & Artificial Intelligence: (5 Week, Probability)
No ratings yet
Computational Thinking & Artificial Intelligence: (5 Week, Probability)
29 pages
Probs Stats
No ratings yet
Probs Stats
26 pages
GSEB Solutions Class 12 Statistics Part 1 Chapter 3 Linear Regre
No ratings yet
GSEB Solutions Class 12 Statistics Part 1 Chapter 3 Linear Regre
33 pages
Kompetensi Manajerial Dan Kompetensi Supervisi
No ratings yet
Kompetensi Manajerial Dan Kompetensi Supervisi
21 pages
Introduction To Discrete Probability Theory and Bayesian Networks
No ratings yet
Introduction To Discrete Probability Theory and Bayesian Networks
26 pages
210 Book
No ratings yet
210 Book
199 pages
Numerical Methods
No ratings yet
Numerical Methods
25 pages
2223hk1 Slide01 ML2022-2
No ratings yet
2223hk1 Slide01 ML2022-2
23 pages
2.1 Basic Notation and Concepts: Bayes' Theorem Navie Bayes Classifier
No ratings yet
2.1 Basic Notation and Concepts: Bayes' Theorem Navie Bayes Classifier
16 pages
Berrar EBCB Naive Bayes Preprint
No ratings yet
Berrar EBCB Naive Bayes Preprint
19 pages
SMA 3261 - Lecture 3 - NFDIF - NBDIF - Stirlings - Divided - Differences
No ratings yet
SMA 3261 - Lecture 3 - NFDIF - NBDIF - Stirlings - Divided - Differences
13 pages
PMRprobabilistic Modelling Primer
No ratings yet
PMRprobabilistic Modelling Primer
14 pages
Introduction To Probability Theory: A Short Course On Graphical Models
No ratings yet
Introduction To Probability Theory: A Short Course On Graphical Models
30 pages
Finite Elements Using Maple
No ratings yet
Finite Elements Using Maple
15 pages
Ardl Analysis Appendix
No ratings yet
Ardl Analysis Appendix
9 pages
LECTURE 1: Introduction
No ratings yet
LECTURE 1: Introduction
16 pages
2 Newton's Interpolations
No ratings yet
2 Newton's Interpolations
8 pages
Generalized Additive Models - Hastie &tibshirani
No ratings yet
Generalized Additive Models - Hastie &tibshirani
10 pages
Unit - 2
No ratings yet
Unit - 2
3 pages
Investment Management Assignment
No ratings yet
Investment Management Assignment
7 pages
GEMS UnfoldUnwrinkle PDF
No ratings yet
GEMS UnfoldUnwrinkle PDF
10 pages
Notes08 Infotheory
No ratings yet
Notes08 Infotheory
7 pages
Hermite Interpolation
No ratings yet
Hermite Interpolation
4 pages
Statistical Data Analysis: PH4515: 1 Course Structure
No ratings yet
Statistical Data Analysis: PH4515: 1 Course Structure
5 pages
Regression Analysis Formula Excel Template
No ratings yet
Regression Analysis Formula Excel Template
5 pages
Stats135 Reviewer
No ratings yet
Stats135 Reviewer
5 pages
Assignment 4
No ratings yet
Assignment 4
4 pages
A Quick Assesment of "Automatic" Curve Discretization
No ratings yet
A Quick Assesment of "Automatic" Curve Discretization
4 pages
Bicubic Spline
No ratings yet
Bicubic Spline
4 pages
Bec 4102 Econometrics I
No ratings yet
Bec 4102 Econometrics I
2 pages
Introductory Econometrics Viva Flashcards
No ratings yet
Introductory Econometrics Viva Flashcards
2 pages
Algebraic Methods in Statistical Mechanics and Quantum Field Theory
From Everand
Algebraic Methods in Statistical Mechanics and Quantum Field Theory
Dr. Gérard G. Emch
No ratings yet
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Mathematical Foundations of Information Theory
From Everand
Mathematical Foundations of Information Theory
A. Ya. Khinchin
3.5/5 (9)
Digital Signal Processing (DSP) with Python Programming
From Everand
Digital Signal Processing (DSP) with Python Programming
Maurice Charbit
No ratings yet
Set Theory Essentials
From Everand
Set Theory Essentials
Emil Milewski
No ratings yet
Introduction to Statistics
From Everand
Introduction to Statistics
Simone Malacrida
No ratings yet

ML Physics

Uploaded by

ML Physics

Uploaded by

Physics and Machine Learning

Jan Pawlowski and Tilman Plehn*

Institut für Theoretische Physik, Universität Heidelberg, Germany

October 17, 2023

p(A) ∈ R and p(A) ≥ 0 . (1.1)

p(A ∪ B) = p(A) + p(B) . (1.3)

p(A ∩ B) = p(A)p(B) . (1.4)

Because A ∩ B = B ∩ A the inverse conditional probability reads

H1|2 I12 H2|1

Figure 1: Example of an information diagram.

VoI12 = H1|2 + H2|1 = H1 + H2 − 2I12 . (1.16)

p(M |T ) → p({xi }|T ) . (1.19)

The likelihood for the set of independent events becomes

p({xi , xj }|T ) = p({xi }|T ) p({xj }|T )

p(M |T ) → p(M |θ) . (1.26)

|fi − fθ (xi )|2

1.2 Multivariate analysis

estimator assumed likelihood potential force

Tukey does not exist

Figure 2: Examples of increasingly robust estimators.

Our construction of split values can then be formally written as

Oj,split = argmaxsplits H[p(Oj )] . (1.37)

1. signal efficiency, or true positive rate S ≡ s(S-tagged) /s(truth) ;

1.3 Fits and interpolations

fθ (x) ≈ f (x) . (1.39)

X X |fj − fθ (xj )|2

1.4 Neural networks

fθ (x) ≈ f (x) . (1.43)

x → fθ (x) with x ∈ RD and fθ ∈ R . (1.44)

x → x(1) → x(2) · · · → x(N ) ≡ fθ (x) . (1.45)

x(n−1) → x(n) := W (n) x(n−1) + b(n) , (1.46)

giving us instead of Eq.(1.47)

1.5 Bayesian networks and likelihood loss

fθ (x) ≈ f (x) ≡ A(x) with x ∈ RD . (1.56)

q(θ) = δ(θ − θ0 ) . (1.64)

and correspondingly for p(θ). The KL-divergence has a closed form,

σq2 − σp2 + (µq − µp )2 σp

Figure 4: Illustration of a Bayesian network. Figure from Ref. [?].

We can write down the deterministic limit of Eq.(1.63),

= −ρ log p(xtrain |θ0 ) (1.70)

σq2 − σp2 + (µq − µp )2 σp

pT,j > 20 GeV |ηj | < 5 Rjj,jγ,γγ > 0.4

120 largest 100% ANN 120 largest 100% ANN

120 largest 100% ANN 120 largest 100% ANN

leading us to a boosted version of the BNN loss,

σq2 − σp2 + (µq − µp )2 σp

2.2 Numerical integration

and then the integral by evaluating the integration boundaries

Fθ (s; x) ≈ F (s; x) . (2.19)

It is needed to compute the LHC rate for Higgs pair production.

Date # Chapter Tilman Jan comments

You might also like

1. signal efficiency, or true positive rate S ≡ s(S-tagged) /s(truth) ;