SSRN 5162304
SSRN 5162304
– Course Material –
Authors:
Mario V. Wüthrich (ETH Zurich)
Ronald Richman (InsureAI)
Benjamin Avanzi (The University of Melbourne)
Mathias Lindholm (Stockholm University)
Michael Mayer (la Mobilière)
Jürg Schelldorfer (Swiss Re)
Salvatore Scognamiglio (University of Naples Parthenope)
Terms of Use
These lecture notes are an ongoing project which is continuously revised, updated and
extended. We highly appreciate any comments that readers may have to improve these
notes. The use of these lecture notes is subject to the following rules:
• These notes are provided to reusers to distribute, remix, adapt, and build upon the
material in any medium or format for noncommercial purposes only, and only so
long as attribution and credit is given to the original authors and source, and if you
indicate if changes were made. This aligns with the Creative Commons Attribution
4.0 International License CC BY-NC.
• The authors may update the manuscript or withdraw it at any time. There is
no right of availability of any (old) version of these notes. The authors may also
change these terms of use at any time.
• The authors disclaim all warranties, including but not limited to the use of the
contents of these notes and the related notebooks and statistical code. When using
these notes, notebooks and statistical code, you fully agree to this.
1
https://actuary.eu/about-the-aae/continuous-professional-development/
3
4
2 Regression models 31
2.1 Exponential dispersion family . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.1.1 Introduction of the exponential dispersion family . . . . . . . . . . 32
2.1.2 Cumulant function, mean and variance function . . . . . . . . . . . 32
2.1.3 Maximum likelihood estimation and deviance loss . . . . . . . . . 34
2.2 Regression models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3 Covariate pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.2 Categorical covariates . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3.3 Continuous covariates . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.4 Regularization and sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.4.1 Introduction and overview . . . . . . . . . . . . . . . . . . . . . . . 42
2.4.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.4.3 Ridge regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4.4 LASSO regularization . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4.5 Best-subset selection regularization . . . . . . . . . . . . . . . . . . 45
2.4.6 Elastic net regularization . . . . . . . . . . . . . . . . . . . . . . . 45
5
6 Contents
4 Interlude 63
4.1 Unbiasedness and calibration . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1.1 Statistical biases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1.2 The balance property . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.1.3 Auto-calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 General purpose non-parametric regressions . . . . . . . . . . . . . . . . . 68
4.2.1 Local regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2.2 Isotonic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3 Model analysis tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3.1 Gini score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3.2 Murphy’s score decomposition . . . . . . . . . . . . . . . . . . . . 75
5.4.1 Aggregating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.4.2 Network ensembling . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.5 Summary on feed-forward neural networks . . . . . . . . . . . . . . . . . . 96
5.6 Combing a GLM and a neural network . . . . . . . . . . . . . . . . . . . . 97
5.7 LocalGLMnet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.8 Outlook: Kolmogorov–Arnold networks . . . . . . . . . . . . . . . . . . . 101
12 Outlook 257
1.1 Introduction
The actuarial profession is rapidly evolving, with machine learning (ML) and artificial in-
telligence (AI) tools becoming increasingly incorporated into practical actuarial method-
ology. This book explains the evolution of AI tools from more familiar regression models
all the way up to generative AI systems, to equip actuaries with technical knowledge
about these tools and to enable actuaries to apply these within their work.
Why do we believe that these AI tools represent such an important advance that they
are worthy of a book length analysis? As we explain next, the methodology underlying
modern AI tools is surprisingly similar to the methodology of actuarial science, and the
advances in AI tools therefore can be applied easily and with great effect within the work
that actuaries do.
Actuaries, through their education within and study of actuarial science, are equipped
with tools from many disciplines to solve the challenges they encounter in their work,
which often focuses on managing risk within financial institutions. These tools are,
on the one hand, drawn from a variety of other disciplines such as statistics, finance,
demography and economics, and, on the other hand, also includes specialized techniques
developed within the field of actuarial science. Examples of the former are the valuation
of options and guarantees using risk-neutral techniques and projection of mortality using
the Lee–Carter model [132], and, of the latter, is the chain-ladder technique, used to
predict outstanding claims liabilities. Combining these tools with expert knowledge of
the particular industries in which they work allows actuaries to build models to provide
insight and analysis of a variety of important problems within these industries.
Actuarial modeling often takes the approach of approximating and predicting empirically
observed phenomena without spending too much time building full theories explaining
these observations in detail; in this sense, actuarial science is different from economics
or physics which use empirical phenomena to build theories. Of course, actuarial sci-
ence is grounded in the study of rigorous mathematics, probability theory and statistics
and, moreover, has developed deep theoretical frameworks for topics such as credibility
theory. Nonetheless, the practical application of actuarial modeling is to make predic-
tions and not to develop theories explaining the observations. Take, for example, an
actuary who is tasked with estimating the required reserve for the outstanding liabilities
11
12 Chapter 1. Introduction and prerequisites
This describes the probability that the random variable Y takes a value of less or equal
to y ∈ R. We write Y ∼ F for a random variable Y having distribution F . In most
practical applications, these distributions are unknown, and the general aim in statistics
and data science is to infer the correct distribution from observed realizations of the
random variables.
There are two main types of distributions. There are discrete distributions and absolutely
continuous ones. Discrete distributions are step functions having countably many steps
(yj )j≥1 ⊂ R, allowing for positive probability masses (pj )j≥1 in these steps. That is,
X
pj = P [Y = yj ] > 0 with pj = 1.
j≥1
Figure 1.1 (lhs) shows a discrete distribution with finitely many steps in (yj )Jj=1 , J = 8.
Typical examples with discrete distributions are count random variables Y taking values
in the integers (yj )j≥1 ⊆ N0 . In insurance, count random variables are used to model the
numbers of claims; examples are the binomial distribution, the Poisson distribution and
the negative binomial distribution. We discuss these distributions below.
0.020
0.8
0.015
distribution F(y)
density f(y)
0.6
0.010
0.4
0.005
0.2
0.000
0.0
0 2 4 6 8 10 0 50 100 150
responses y responses y
Figure 1.1: (lhs) Discrete distribution with finitely many steps in (yj )Jj=1 , J = 8, (rhs)
density of an absolutely continuous distribution.
for all y ∈ R. Typical examples in actuarial science are the gamma or the log-normal
distributions which are used to model positive claim sizes. Figure 1.1 (rhs) shows a
(gamma) density y 7→ f (y) with the area of the blue region being equal to one.
subject to existence. In the discrete distribution case, this expected value is equal to
X
E [Y ] = yj pj ,
j≥1
Remark 1.1. The expected value (1.1) is generally not the most likely outcome, e.g., if
we have skewed densities, as in Figure 1.1 (rhs), the expected value, the mode and the
median will generally differ. The expected value corresponds to the average outcome of
the uncertain (claim) event. In insurance, one typically assumes that one has a large
insurance portfolio of independent and identically distributed (i.i.d.) claims (Yi )i≥1 . The
selection of the mean E[Y1 ] is then justified by the fact that the law of large numbers
implies that the average claim n−1 ni=1 Yi converges to the expected value E[Y1 ], a.s.,
P
as the sample size increases n → ∞. One can also view the mean E[Y1 ] as the value
that minimizes the expected quadratic difference between our forecast and the actual
outcome of the random event, thus, on average it provides the most accurate prediction
w.r.t. the expected quadratic difference; this and similar statements are going to be
discussed further in Section 1.4, below, especially, we refer to Remarks 1.4. ■
Y |X ∼ F (·|X),
for F (·|X) being a distribution depending on (varies with) the covariates X, e.g., X may
describe a translation of the expected value relative to a base case.
Having covariate information X about the insurance claim Y specifies the following
regression consideration. Denote by X ⊆ Rq the support of the covariates X; the set X
is called covariate space or feature space. The general aim in regression modeling is to
find the (true) regression function µ∗ : X → R that describes the conditional expectation
of the response variable Y having covariate X, that is,
µ∗ (X) := E [ Y | X] . (1.2)
The main problem in most applications is that the true regression function µ∗ in (1.2) is
unknown, and we have only observed a finite sample L = (Yi , X i )ni=1 from that model.
Goal: Infer the (true) regression function µ∗ from this finite sample L.
One way to estimate the unknown (true) regression function µ∗ is to assume that it takes
a specific functional form. Assuming GLMs with log-link implies that we consider the
regression functions µϑ : X → R of type
q
X
X 7→ log(µϑ (X)) = ϑ0 + ϑj Xj , (1.3)
j=1
for regression parameters ϑ ∈ Rq+1 .1 If we bring the log-link to the other side, we receive
the equivalent formulation
q
X
X 7→ µϑ (X) = exp ϑ0 + ϑj Xj . (1.4)
j=1
M = {µϑ }ϑ ,
To find the best candidate µϑ ∈ M for µ∗ , one typically chooses an objective function
to select the best parameter (candidate) ϑ ∈ Rq+1 for the given observed sample L. For
this, select a loss function
L : R × R → R, (y, m) 7→ L(y, m); (1.5)
in this loss function, y plays the role of the outcome of Y , and m plays the role of the
mean of Y . The loss function L is then used to assess the difference between y and m;
we are going to discuss the required properties of this loss function in Section 1.4, below,
to make this a sensible model selection tool. This then motivates to solve the following
optimization problem
n
X
ϑb = arg min L(Yi , µϑ (X i )), (1.6)
ϑ∈Rq+1 i=1
supposed there exists a unique solution. Thus, we try to minimize the loss between Yi
and (its prediction) µϑ (X i ) for a suitable loss function L.
The solution µϑb from (1.6) is the best candidate in M w.r.t. the selected loss
function L and for given observations L generated by µ∗ . In this sense, we do not
compare the candidates µϑ to the unknown µ∗ , but rather to the data (Yi , X i )
generated by µ∗ .
Example 1.2 (Gaussian log-link GLM). A common example for the loss function is the
square loss function L(y, m) = (y − m)2 . The previous optimization problem (1.6) then
reads as
n
(Yi − µϑ (X i ))2
X
ϑb = arg min
ϑ∈Rq+1 i=1
2
n
X q
X
= arg min Yi − exp ϑ0 + ϑj Xj , (1.7)
ϑ∈Rq+1 i=1 j=1
and we obtain the estimated regression function µϑb to approximate µ∗ . We call the
square loss function approach (1.7) the Gaussian log-link GLM, and the reason for this
name will become clear in (1.11), below. ■
1
We generally do not use boldface notation for ϑ, even if ϑ is a real-valued vector. The reason for
this is that we would like to understand ϑ generically as the model parameter, which can be a real-valued
vector, but which could also be a different object that parametrizes a model.
• First, clearly, the true model µ∗ should be ‘close’ to the selected model class
M, otherwise it is impossible to identify a model in M being similar to µ∗ .
E.g., the GLM has a linear structure (1.3) which may not be suitable in some
problems, and this requires to select other regression function classes.
(1) In many situations one does not have unlimited data resources in actuarial prob-
lems. E.g., for a certain insurance product we may only have very few claims.
(2) In classification problems, one estimates probabilities in the unit interval (0, 1),
which is a nice and bounded space. Insurance claims can be heavy-tailed, and
very few large claims can heavily distort the fitting procedure on finite samples.
Therefore, the selection of the loss function needs special care.
For these reasons, one has to carefully analyze the choices of the potential models, the
loss function and the fitting algorithm, otherwise, one may result in an inappropriate
predictive model. The purpose of this section is to introduce the theory behind model
fitting and model evaluation in a broader sense, and we are going to be more specific in
Chapter 2 by relating these statistical concepts to actuarial problems.
for all µ ∈ M, and where we assume that the left-hand side of (1.9) takes a finite value.
The loss function is strictly consistent for mean estimation if we have an equality in (1.9)
if and only if µ(X) = µ∗ (X), a.s.
Definition 1.3 tells us that we should only select strictly consistent loss functions for mean
estimation for regression model fitting, otherwise the expected loss minimization (1.8) is
not a sensible model fitting strategy as we may not discover the true model µ∗ by this
minimization (assumed it belongs to M).
Mathematical result. The strictly consistent loss functions for mean estimation are the
Bregman divergences; see Savage [198] and Gneiting [77]. Under certain assumptions, this
statement is an “if and only if”-statement. This implies that we should always consider
a Bregman divergence for regression model fitting of conditional mean type (1.2).
Bregman divergences take the following form
Generally, Kullback–Leibler (KL) divergences and deviance loss functions are Breg-
man divergences; examples include the square loss, the Poisson deviance loss, the
gamma deviance loss or the categorical loss; see Table 2.2, below.
Remarks 1.4. We have mentioned in Remark 1.1 that, typically, for insurance pricing
we use (conditional) means. This is justified by the law of large numbers that ensures
that we charge the correct price level. These (conditional) means are estimated based on
strictly consistent loss functions for mean estimation.
2
It is probably in the genes of actuaries to try to minimize losses. By a sign switch −L, we obtain a
score, and economists would probably rather want to maximize scores.
In contrast, we could also be interested into medians and quantiles, e.g., for risk man-
agement purposes. Strictly consistent loss functions for median and quantile estimation
are the mean absolute error (MAE) and the Pinball losses, more generally; see Thomson
[216] and Saerens [197]. These losses are not Bregman divergences, and they should not
be used for regression model fitting if we are interested in (conditional) expectations. The
same applies for model validation and model selection, i.e., if one fits regression models
on conditional means, one should not use MAE figures to validate them. ■
• Compared to (1.6), we scale by 1/n. This does not change the solution on finite
samples n < ∞, but it is necessary for the law of large numbers argument to hold.
• Generally, we only work with strictly consistent loss functions L for mean estima-
tion, this also applies to the empirical version (1.10).
• The solution µb of (1.10) depends on the realization of the sample (Yi , X i )n i=1 , and
repeating this experiment typically gives a different solution µ b. In this sense, the
solution µ of (1.10) is itself a random variable, a function of the sample (Yi , X i )ni=1 ,
b
and atypical observations may give atypical solutions.
• The difference between the solutions of (1.8) and (1.10) is coined estimation error.
Typically, for increasing sample size, the estimation error decreases on average. In
√
many problems, estimation errors decay at rate 1/ n for i.i.d. data (Yi , X i )ni=1 ,
which usually can be attributed to a central limit theorem (strict consistency and
asymptotic normality).
If we study the model fitting problem (1.8) and if the true model µ∗ ∈ M belongs to the
selected model class, then any strictly consistent loss function L finds the true model µ∗ ,
and there are infinitely many strictly consistent loss functions for mean estimation. This
statement is an asymptotic (infinite sample size) statement, in contrast to its empirical
counterpart (1.10).
The specific selection of the loss function L becomes important on finite sample
sizes n < ∞.
More specifically, Gourieroux et al. [87, Theorem 4] proved (in a GLM context) that
optimal regression models are found if the chosen strictly consistent loss function L for
mean estimation reflects the correct variance behavior of the response Y . In that case,
the model fitting procedure results in a so-called best asymptotically normal estimation.
We illustrate the Gourieroux et al. [87] result by an example. Consider the Gaussian,
the Poisson, the gamma and the inverse Gaussian distributions for Y , given X, with
corresponding conditional mean µ∗ (X) (being the same in all four cases). The conditional
variances in these four models are given by
where φ > 0 is a given dispersion parameter. We observe that in these four models the
conditional variances are power functions of the mean functional µ∗ (X) with different
power variance parameters p ∈ {0, 1, 2, 3}. These different variance functions can be
translated to deviance loss functions by selecting the corresponding distribution within the
exponential dispersion family (EDF); for details see (2.6) and the subsequent discussion.
In particular, the Gaussian case translates to the square loss
All of these deviance loss functions are strictly consistent for mean estimation, but
the one with the correct conditional variance behavior for the response Y , given
X, has the best finite sample properties (on average).
More examples are provided in Table 2.2, below, and in Section 2.1.3, below, we give
more mathematical justification to this statement.
Figure 1.2 shows the four deviance loss functions (1.11)-(1.14) with power variance pa-
rameters p ∈ {0, 1, 2, 3}. For this figure, we select a fixed mean parameter m = 1000
and the dispersion parameters φ > 0 are chosen such that all four models have the same
variance of 1000. Figure 1.2 shows the resulting deviance losses y 7→ L(y, m)/φ, scaled by
80
60
deviance losses
40
20
Gauss
Poisson
gamma
InvGauss
0
responses Y
Figure 1.2: Deviance loss functions (1.11)-(1.14), y 7→ L(y, m)/φ for fixed mean m =
1000; the circles are at y = 800, 1200 (symmetric around m = 1000) for better orientation.
φ−1 so that they all live on the same scale; the colored circles are at y = 800, 1200 (sym-
metric around m = 1000) for better orientation. The square loss of the Gaussian model
(in blue color) is symmetric around m = 1000, and all other deviance losses are asym-
metric around this value (compared the colored circles). This asymmetry is the property
that should match the response distribution so that we receive (on average) optimal finite
sample properties in model estimation according to Gourierioux et al. [87]. That is, if
the response distribution is very right-skewed, the selected deviance loss should account
for this to receive a best asymptotically normal estimation of the expected values; we
come back to this in Section 2.1, below.
Remarks 1.5. • All losses in (1.11)-(1.14) satisfy L(y, m) ≥ 0, and these losses are
zero if and only if y = m. The square loss (1.11) is defined on R × R, and the other
three losses on the positive real line R+ × R+ , i.e., they need positive inputs.
• There is a deep connection between the distributions of the EDF, maximum likeli-
hood estimation (MLE) and deviance loss minimization which we did not explain
here. First, minimizing the above deviance loss functions is equivalent to MLE
in the corresponding distributional model. That is, e.g., minimizing the Poisson
deviance loss results in the MLE µ bMLE for the Poisson model. We will come back
to this, below. Second, all considered models (1.11)-(1.14) have in common that
they belong to the EDF, and each member of the EDF is characterized by a certain
functional form of its conditional variance function V(Y |X); we refer to Jørgensen
[112, Theorem 2.11] and Bar-Lev–Kokonendji [13, Section 2.4]. In fact, the condi-
tional variance function determines the specific distribution within the EDF, and
this, in turn, gives us the optimal deviance loss function choice for model fitting on
finite samples.
• Distributions within the EDF with conditional variance functions of the form
are called Tweedie’s distributions with power variance parameter p ∈ R\(0, 1). This
class of distributions has simultaneously been introduced by Tweedie [222] and Bar-
Lev–Enis [12], and for p ∈ (0, 1) there do not exist any Tweedie’s distributions; see
Jørgensen [111]. For p ∈ (1, 2) we have Tweedie’s compound Poisson model which
is absolutely continuous on R+ and which has a point mass in zero.
• In cases where the conditional variance function is unknown or rather unclear, one
can exploit an iterative estimation procedure by alternating mean and variance
estimation to get optimal regression models. Under isotonicity of the conditional
variance in the conditional mean behavior, this has recently been studied in Delong–
Wüthrich [51], and it is verified in this study that this iterative estimation procedure
can be very beneficial for improving model accuracy, i.e., getting closer to best
asymptotically normal in the sense of Gourieroux et al. [87].
■
We conclude this section with the log-link GLM example (1.3)-(1.4). We call Example
1.2 a Gaussian log-link GLM because the square loss minimization emphasizes that the
responses Y are conditionally Gaussian, given X. Similarly, we can define a Poisson and
a gamma log-link GLM.
Example 1.6 (Poisson log-link GLM). Select the Poisson deviance loss (1.12) for L.
This gives optimization in the log-link GLM case
n
µϑ (X i )
X
ϑb = arg min 2 µϑ (X i ) − Yi − Yi log ,
ϑ∈Rq+1 i=1
Yi
for log-link GLM (1.3)-(1.4). This is a Poisson log-link GLM, and it emphasizes that the
responses Yi have conditional Poisson distributions, given X i . ■
Example 1.7 (gamma log-link GLM). Select the gamma deviance loss (1.13) for L. This
gives optimization in the log-link GLM case
n
Yi − µϑ (X i ) µϑ (X i )
X
ϑb = arg min 2 + log ,
ϑ∈Rq+1 i=1
µϑ (X i ) Yi
for log-link GLM (1.3)-(1.4). This is a gamma log-link GLM, and it emphasizes that the
responses Yi have conditional gamma distributions, given X i . ■
Table 2.2, below, gives more examples and it describes how certain distributions are
linked to deviance losses.
diagrams, etc. We will come back to graphical illustrations, below. In this section we
describe model validation and model selection.
Section 1.4 has been focusing on model fitting, and the crucial message was that this
should be done under strictly consistent loss functions for mean estimation. The same
methodology can be used for model validation and model selection, that is, by a strictly
consistent scoring of the fitted models. The objective of expected loss minimization is
turned into the new objective of generalization loss minimization. In other words, one
would like to know which of several models has the best forecast accuracy, and this is
evaluated by studying their generalization loss, meaning that one wants to know “which
of the fitted models generalizes best to new unseen data” in the sense of giving the most
accurate forecasts for new data.
This section focuses on the core (model-agnostic) methods for model validation and
model selection that are in the intersection between machine learning and statistics, such
as cross-validation, Akaike’s information criterion or out-of-bag validation. Out-of-bag
validation requires to introduce the bootstrap which is also done in this section. More
sophisticated tools for model validation and model selection (model-agnostic and model-
specific ones) will be described in later chapters, but for this we first need to introduce
the relevant techniques.
Definition 1.3 motivates to select the model with the smallest expected loss for a given
strictly consistent loss function L, see (1.9). As highlighted in (1.10), this selection needs
to be done empirically because the true data generating mechanism is unknown. There is
a specific point that needs special attention in this model validation and model selection
procedure.
Model estimation and model validation should not be done on the identical sample.
This is most easily understood by realizing that if one would use the identical data,
always a more complex model would outperform a nested simpler model. This is related
to the in-sample bias that may judge the more complex model too optimistically because
it solves the same minimization problem (1.10) under more degrees of freedom.
The standard way of analyzing the generalization loss (forecast performance) is to par-
tition the entire sample into two data sets a learning sample L for model fitting and a
test sample (hold-out sample) T for model testing (generalization loss analysis). These
two samples should be mutually independent, and contain i.i.d. data L = (Yi , X i )ni=1
and T = (Yt , X t )m
t=1 , respectively, following the same law as (Y, X); the two samples are
distinguished here by the different lower indices 1 ≤ i ≤ n and 1 ≤ t ≤ m, respectively.
Model fitting (model learning) is then performed solely on the learning sample L by
minimizing the in-sample loss
n
1X
bL ∈ arg min
µ L (Yi , µ(X i )) .
µ∈M n i=1
The selected model(s) are evaluated (compared to each other) on the hold-out sample T
by analyzing their out-of-sample loss (empirical generalization loss (GL))
m
d ,µ 1 X
GL(T bL ) := L (Yt , µ
bL (X t )) . (1.16)
m t=1
Thus, the learning sample L is only used for the estimation of the regression function
bL (·), which is then evaluated on the independent hold-out sample T . This out-of-sample
µ
loss (1.16) is the main workhorse for model selection in machine learning models (e.g.,
between a neural network and a gradient boosting model). A main reason for this is that
computationally it is not very demanding.
1.5.2 Cross-validation
Out-of-sample loss validation (1.16) may not be the most economic way of dealing with
small data, i.e., with a statistical problem where no big data is available. By this we
mean that, unlike in many machine learning problems, in actuarial problems we do not
have unlimited data resources, but the available data is determined by the size of the
insurance portfolio. In this case, K-fold cross-validation (CV) is an alternative.
In a first step, K-fold cross-validation fits the model K times to different sub-samples of
the data to derive the generalization loss estimate (1.17), below, and in a second final
step, all data is used to receive the optimal predictive model µ b. For the first step, we
partition (at random) the index set I = {1, . . . , n} into K roughly equally sized folds
(Ik )K
k=1 , see Figure 1.3. K is a hyper-parameter that is usually selected as K = 10, but
for small sample sizes n we may also select a smaller K to receive reliable results; in
Figure 1.3 it is set to K = 5. We then learn the model on the data with indices I \ Ik ,
and we perform an out-of-sample validation on the indices Ik . That is, for 1 ≤ k ≤ K,
we compute the estimated models
X
b(\k) ∈ arg min
µ L (Yi , µ(X i )) .
µ∈M i∈I\Ik
Note that each term under the k-summation uses a (disjoint) partition of the entire data
into a learning sample with indices I \ Ik and a test sample with indices Ik . Thus, we
perform a proper (mutual) out-of-sample validation for each 1 ≤ k ≤ K.
Remarks 1.8. • Both the out-of-sample loss (1.16) and the K-fold cross-validation
loss (1.17) give estimates for the true (expected) generalization loss. The cross-
validation loss has the advantage that it does not only estimate this generalization
loss, but we can also quantify uncertainty by computing the empirical standard
deviation of the K folds under the k-summation in (1.17).
I1 I \ I2
I \ I3
I2 I \ I4
I \ I5
I3
I \ I1
I \ I2 I4
I \ I3
I \ I4 I5
• K-fold cross-validation can be demanding because it requires fitting the model K+1
times, once to get the optimal regression function µ
b, and K times to compute the
K-fold cross-validation loss (1.17).
• Above, we have partitioned the index set I completely at random into the folds
(Ik )K
k=1 . Stratified K-fold cross-validation does not partition I completely at ran-
dom. For instance, one may order the instances 1 ≤ i ≤ n w.r.t. the sizes of the
responses (Yi )ni=1 , and then allocate the K largest claims at random to the different
folds (Ik )K
k=1 of the partition, then the next K largest claims likewise, etc. This
provides more similarity between the folds, which can be an advantage in model
selection, especially under right-skewed or heavy-tailed loss size distributions.
■
Since ϑbMLE maximizes (1.18), the quantity ℓL (ϑbMLE ) gives an in-sample biased view of
this model. AIC and BIC determine asymptotically an in-sample bias correction (in
different settings) and based on the fact that model fitting was done by MLE. The AIC
value of this MLE fitted model is defined by
where n is the sample size. Remark that these model selection criteria may not be valid in
machine learning models, such as neural networks, as these models do not use the MLE
for model fitting. Moreover, in many machine learning models, the dimension of the
model parameter is unclear because such models are often over-parametrized resulting
in redundancy; we refer to Abbas et al. [1] who discuss the effective dimension of neural
networks. Therefore, AIC and BIC are mainly useful tools for model selection among
GLMs supposed they were fitted with MLE.
We summarize the important points about AIC and BIC.
• First, we need MLE estimated models for applying AIC and BIC.
• Second, (1.19) and (1.20) require to consider all terms of the log-likelihood function
ℓL (ϑ). This also applies to models where some of the terms cannot be computed
analytically, like, e.g., in Tweedie’s compound Poisson model.
• Third, model selection can be done for any two models and these models do not
need to be nested and they do not need to be Gaussian. This makes AIC and BIC
very attractive and widely applicable.
• Fourth, the responses need to be on the identical scale in all compared models.
• Fifth, AIC and BIC only give preference to a model, but they do not confirm that
the selected model is suitable, i.e., it could simply be the best option of a class of
inappropriate models.
Parametric bootstrap
For a parametric bootstrap version, we assume that the independent observations follow
a given model Yi |X i ∼ Fϑ (·|X i ), for 1 ≤ i ≤ n, being parametrized by an unknown
model parameter ϑ ∈ Rr . This allows one to estimate the model parameter ϑ from
the i.i.d. sample (Yi , X i )ni=1 , giving an estimated model Fϑb(·|X) for given covariates
X. A bootstrap sample (Yi⋆ , X i )ni=1 is then obtained by simulating new conditionally
independent observations Yi⋆ |X i ∼ Fϑb(·|X i ) from the estimated model, for 1 ≤ i ≤ n. If
the estimated model is sufficiently accurate, the bootstrap sample (Yi⋆ , X i )ni=1 has similar
distributional properties as the original sample (Yi , X i )ni=1 . Based on this bootstrap
sample (Yi⋆ , X i )ni=1 , we can re-estimate the model parameter ϑ providing us with a
bootstrap estimate ϑb⋆ ∈ Rr . Repeating this procedure m times gives us an empirical
distribution of the estimated model parameter (called empirical bootstrap distribution)
m
1 X
G(θ)
b := 1 for θ ∈ Rr ,
m j=1 {θ≤ϑb(⋆j) }
with ϑb(⋆j) ∈ Rr denoting the estimated model parameter from the j-th bootstrap sample
(⋆j)
(Yi , X i )ni=1 , 1 ≤ j ≤ m.
This is one version of a parametric bootstrap. It keeps the covariates (X i )ni=1 fixed,
and it only re-simulates the responses from the estimated model. There are many differ-
ent variants of the parametric bootstrap, e.g., also re-simulating the covariates or only
re-simulating the residuals, called residual bootstrap; we refer to Wüthrich–Merz [243,
Section 4.3] and the references therein for more discussion.
Non-parametric bootstrap
The non-parametric bootstrap is even simpler than its parametric counterpart. For
a non-parametric bootstrap we directly start from the observed sample (Yi , X i )ni=1 . A
non-parametric bootstrap sample (Yj⋆ , X ⋆j )nj=1 is generated by drawing with replacements
from the original data (Yi , X i )ni=1 . This naturally also re-simulates the covariates, and
some samples appear multiple times in the bootstrap sample (Yj⋆ , X ⋆j )nj=1 , and others
are not selected by this drawing with replacements. Denote by I ⋆ ⊆ I = {1, . . . , n} the
set of indices that have been selected from the original data (Yi , X i )ni=1 for the bootstrap
sample (Yj⋆ , X ⋆j )nj=1 . We estimate a new model from this bootstrap sample
n
X
b⋆ ∈ arg min
µ L Yj⋆ , µ(X ⋆j ) , (1.21)
µ∈M j=1
and we can study the empirical bootstrap distribution as above by repeating this boot-
strap procedure many times.
Out-of-bag cross-validation
The interesting point about the non-parametric bootstrap now is that the instances
i ∈ I \ I ⋆ have not been used in this estimation procedure to receive the bootstrap
b⋆ in (1.21). We call the set of observations (Yi , X i )i∈I\I ⋆ an out-of-bag (OoB)
estimate µ
sample. This presents a valid (disjoint) sample for cross-validation (i.e., an empirical
generalization loss)
d OoB := 1 X
GL b⋆ (X i )) .
L (Yi , µ (1.22)
|I \ I ⋆ | i∈I\I ⋆
n
n−1
n
lim = lim 1 − n−1 = e−1 = 36.8%.
n→∞ n n→∞
Thus, on average, the out-of-bag sample has a reasonably big size; see Breiman [28].
vi
L(Yi , µ(X i )) → L(Yi , µ(X i )),
φ
where vi > 0 is an instance specific weight and φ > 0 is a general parameter for dispersion
that is not instance specific; for details see Section 2.1. This instance specific factor vi /φ
is going to be added in front of all loss functions L in all subsequent considerations.
• Assume we want to consider two model classes M1 and M2 that are signifi-
cantly different, e.g., GLMs and gradient boosting models, and we would like
to select the best model from these two classes to predict a new observation
Y , given X.
• Select a suitable strictly consistent loss function L for mean estimation, e.g.,
if the responses Y are gamma like, given X, we select the gamma deviance
loss for L.
• Based on the learning sample L, select the best models w.r.t. the selected
loss function L from both model classes Mk , k = 1, 2,
n
X vi
bk ∈ arg min
µ L (Yi , µ(X i )) .
µ∈Mk i=1
φ
b1 ∈ M1 if GL(T
That is, select µ d ,µ b1 ) < GL(T
d ,µ b2 ∈ M2 .
b2 ), otherwise select µ
Regression models
We begin with remarks before diving into predictive modeling. Before starting with
modeling, we should ask ourselves about the desirable characteristics a predictive model
should possess. Typically, we cannot comply with all of them, and the best model will
be a compromise of all of these desirable characteristics.
Let us mention some points in a slightly unstructured manner.
(a) Clearly, the model should have a good predictive performance giving us very accu-
rate forecasts.
(b) The model should have a certain smoothness so that the forecasts do not dramati-
cally change, if one slightly perturbs the inputs.
(c) The model should have a certain sparsity and simplicity, which means that we target
for a model that is as small as possible, but still predicts sufficiently accurately;
i.e., we aim for a parsimonious model.
(d) Towards stakeholder we should be able to explain the inner functioning of the
model, and the results should intuitively make sense and be explainable.
(e) It should have good finite sample properties in estimation, so that all parts of the
model can be determined with credibility.
(g) We should be able to manually change parts of the model to integrate expert
knowledge, if available and necessary.
(h) It should comply with regulation, and we should be able to verify this.
The starting point of a machine learner would probably be to just run the available data
through a gradient boosting machine (GBM) or a neural network, and then study its
outputs. As already discussed at the beginning of Section 1.4, this may not be the best
way of dealing with the problem, especially, in scarce data settings that may have a
large variability in their responses (class imbalance is a related buzzword). This is one
of the key differences between machine learning and actuarial data science, namely, the
31
32 Chapter 2. Regression models
actuary first tries to understand the (raw) data, and then designs an optimal architecture
according to the insights that she/he has gained from this initial data analysis. This
insight can significantly improve predictive models, already, e.g., choosing a more suitable
strictly consistent loss function for mean estimation than the square loss can make a huge
difference. This is the main motivation for us to first study the most important family
of distributions, the exponential dispersion family (EDF). This knowledge can then be
translated into optimal choices of the objective function for a certain type of data and
problem; in fact, this will justify the deviance loss choices (1.11)-(1.14). In this sense,
the statistical theory matters beyond algorithmic forecasting.
κ : Θ → R,
This indicates the crucial role played by the cumulant function κ in the EDF.
The inverse function h := (κ′ )−1 is called the canonical link of the chosen EDF, and it
provides us with the identity
which is one-to-one with the interior of the effective domain Θ̊, thus, the EDF can either
be parametrized by the canonical parameter θ or by its mean parameter µ0 .
Finally, we introduce the variance function µ0 7→ V (µ0 ) = (κ′′ ◦ h)(µ0 ), which has the
property
φ
V (Y ) = V (µ0 ). (2.4)
v
All the models discussed in (1.11)-(1.14) and (1.15) are of this EDF type, with power
variance function V (µ0 ) = µp0 for p ∈ R \ (0, 1); for simplicity we have set weight v = 1
in these previous examples. This variance function V fully characterizes the cumulant
function κ, supposed it exists on the selected mean parameter space; see Jørgensen [112,
Theorem 2.11] and Bar-Lev–Kokonendji [13, Section 2.4].
The EDF is attractive because it contains many popular statistical models used to
solve actuarial problems such as the Bernoulli, Gaussian, gamma, inverse Gaussian,
Poisson or negative binomial models. These examples are distinguished by the
choice of the cumulant function κ; Table 2.1 gives the most relevant examples.
Table 2.1: Commonly used examples of the EDF with corresponding means.
this statement only holds up to the possibly degenerate behavior at the boundary of Θ.1
This allows us to define the deviance loss function within the EDF.
Definition 2.1. Select Y ∼ EDF(θ, φ/v; κ) with steep cumulant function κ, see footnote
1 on page 34. The deviance loss function of the selected EDF is defined by
φ
L(y, m) = 2 log fh(y) (y) − log fh(m) (y) ≥ 0, (2.6)
v
with m ∈ κ′ (Θ̊) and y in the convex closure of the support of the response Y .
• Using (2.6), the EDF density fθ (y) with cumulant function κ, given in (2.1),
is transformed into a deviance loss function L(y, m); this uses the canonical
link relation h(m) = θ, see (2.3).
• The latter is precisely the property that motivated the choices of the deviance
losses (1.11)-(1.14), e.g., if the responses Y are Poisson distributed, we can
either perform MLE in the Poisson model or we can minimize the Poisson
deviance loss (1.12) providing us with the same model.
Table 2.2 presents the most popular EDF models in actuarial science and their deviance
loss functions (2.6). Tweedie’s CP refers to Tweedie’s compound Poisson model that has
a power variance function (1.15) with p ∈ (1, 2). This can be extended to p ∈ {0}∪[1, ∞),
where p = 1, 2 has to be understood in the limiting sense for the cumulant function κ
and the corresponding deviance loss L.2 The power variance parameter p = 0 gives the
Gaussian model, p = 1 the Poisson model, p = 2 the gamma model and p = 3 the
inverse Gaussian model. These are the only models of the EDF with a power variance
1
In fact, we need to be slightly more careful with statement (2.5). Generally, we request that the
cumulant function κ is steep at the boundary of the effective domain Θ; see Barndorff-Nielsen [16,
Theorem 9.2]. This aligns the mean parameter space with the convex closure of the support of the
response Y . Then, (2.5) is correct up to the boundary of Θ. At the boundary we may get a degenerate
model having a finite mean estimate µ bMLE
0 , but an undefined canonical parameter estimate.
2
The cases p < 0 do not have steep cumulant functions κ and are therefore disregarded here.
Table 2.2: Commonly used examples of the EDF with corresponding deviance losses; the
corresponding canonical links h are provided in Table 3.1, below.
function for which the normalizing term c(y; φ/v) has a closed from, see Blæsild–Jensen
[23]. This becomes relevant for AIC and BIC, see (1.19) and (1.20), but also for MLE of
the dispersion parameter φ.
Table 2.3: Tweedie’s models for power variance parameters p ∈ {0} ∪ [1, ∞); this table
is taken form Jørgensen [112].
We close this section with a technical remark. Table 2.3 gives the supports of the re-
sponses Y , the effective domains Θ and the mean parameter spaces κ′ (Θ̊) of Tweedie’s
distributions, having power variance function (1.15). In all these examples, the convex
closure of the support of Y is equal to the closure of the mean parameter space κ′ (Θ̊).
This is a characterization of steep cumulant functions κ; see Barndorff-Nielsen [16, The-
orem 9.2]. If Y is in the boundary of the mean parameter space, the deviance loss (2.6)
needs to be understood in the limiting sense; see Wüthrich–Merz [243, formula (4.8)].
for a strictly consistent loss function L for mean estimation. There is a (minor) change
to (1.10), namely, we add factors vi /φ to receive weighted losses in (2.7). The motivation
for these weightings is that we assume of having selected a deviance loss function for
L, being implied by a distributional model coming from the EDF (2.1). In view of the
variance behavior (2.2), this requires a suitable weighting of all the individual instances
1 ≤ i ≤ n. The weighting proposed in (2.7) is precisely the one that selects the MLE
bMLE w.r.t. the log-likelihood function of the i.i.d. sample (Yi , X i , vi )n
µ i=1 , with responses
Yi following the corresponding (conditional) EDF (2.1), given means µ(X i ) and weights
vi , for 1 ≤ i ≤ n.3 In summary, as already mentioned before, deviance loss minimization
in (2.7) is equivalent to MLE in the corresponding EDF.
This is the core of regression modeling with the recurrent goal of finding the true re-
gression function µ∗ , see (1.2). The different statistical and machine learning methods
mainly differ in selecting different classes M of candidate regression functions µ : X → R,
e.g., we can select a class of GLMs, of deep neural networks of a certain architecture, or
of regression trees. Some of these classes are non-parametric and others are parametric
families. Optimization (2.7) looks more like a non-parametric version, for a parametrized
class of regression functions M = {µϑ }ϑ , we rather solve
n
1X vi
ϑb ∈ arg min L(Yi , µϑ (X i )). (2.8)
ϑ n i=1 φ
Example 2.2 (Poisson log-link GLM, revisited). We revisit Example 1.6. Assume
Yi , given X i and vi , is Poisson distributed with log-link GLM regression function
q
X
X 7→ log(µϑ (X)) = ϑ0 + ϑj Xj ,
j=1
for regression parameter ϑ ∈ Rq+1 , see (1.3). Assume the instances i are indepen-
dent, and select the Poisson deviance loss for L. This gives the optimal solution
n
1X µϑ (X i )
ϑbMLE = arg min 2vi µϑ (X i ) − Yi − Yi log .
ϑ∈Rq+1 n i=1 Yi
3
There is a subtle point that should be highlighted in the notation of the i.i.d. sample (Yi , X i , vi )ni=1 .
The i.i.d. property concerns the random sampling of the entire triple (Yi , X i , vi ) including the weight
vi . In this sense, there is a slight abuse of notation here because one should use a capital letter for the
(random) weight. We have decided to use a small letter to have the identical notation as in the (standard)
definition of the EDF (2.1). Having the weight vi as a random variable moreover implies that for the
response distribution of Yi one always needs to condition on X i and vi ; this is the typical situation
because, generally, the joint distribution of (Yi , X i , vi ) is not of interest. For notational convenience (and
to align with the standard EDF definition) we have not done so in the text, see, e.g., formula (3.3). With
this said, the (implicit) randomness in vi may play a role, but this will then be clear from the context.
Remark 2.3. From (2.2) we observe that the expected value of Y is invariant under
changes of the volume/weight v > 0 and the variance decreases linearly in this volume
parameter. This indicates that the responses Y within the EDF consider normalized
quantities, see Example 2.2. In Jørgensen [112], the normalized quantities correspond to
the so-called reproductive form. ■
These are the basics of regression modeling. The selected class M of candidate models will
also depend on the purpose of its use. In some cases, we want to use regression techniques
to explain (or understand) the specific impact of the covariates X on the responses Y . For
instance, does a new medication (reflected in X) have a positive impact on the state of
health (reflected in Y ).4 In other cases, we are just aiming for optimal predictive accuracy,
being less interested into the specific impact of X on Y . An interesting discussion on
explain vs. predict is given in Shmueli [207]. For actuarial pricing, we are somehow
in between the two requirements. We would like to have very accurate (risk-adjusted)
pricing schemes, however, stakeholders like customers, management and regulators want
to know the risk factors and how they impact the prices. That is why actuaries typically
have to compromise between explain vs. predict, and the extent to which this is necessary
depends on societies, countries, different regulations and companies’ business strategies.
2.3.1 Notation
We start with a few words on the notation. The q-dimension real-valued covariate in-
formation is denoted by X = (X1 , . . . , Xq )⊤ , and it takes values in the covariate space
X ⊆ Rq . If we have a sample (X i )ni=1 ⊂ X of such covariates, we can stack them into
a design matrix X. For this we typically extend the covariates X i by an initial compo-
nent (bias component, zero component) being equal to one, to receive the new (extended)
covariates
X i = (1, Xi,1 , . . . , Xi,q )⊤ ∈ Rq+1 ; (2.9)
this uses a slight abuse of notation because we do not indicate in X i whether it includes
the bias component or not. However, this will always be clear from the context. The
4
We have purposefully avoided a causal language here, since causal statements can only be made
from statistical models in certain circumstances.
This collects all covariates X i of all instances 1 ≤ i ≤ n on the rows, and it adds an
initial bias column being identically equal to one. This is the input information in tabular
form; it has the shape of a matrix, which is equal to a tensor of order 2 (also called 2D
tensor). It is important to highlight that the different fonts X, X, X and X have different
meanings.
Ordinal encoding
This assigns to each level ak the corresponding integer k ∈ N. In our running example
(2.11) we have an alphabetical ordering which can be regarded as ordinal. This provides
the one-dimensional ordinal entity embedding given in Table 2.4.
accountant 1
actuary 2
economist 3
quant 4
statistician 5
underwriter 6
One-hot encoding
One may argue that this ordinal (alphabetic) order does not make much sense for risk
classification, and one should treat these variables rather as nominal variables. The first
solution for a numerical embedding of nominal variables is one-hot encoding which maps
each level ak to a basis vector in RK resulting in a K-dimensional entity embedding
⊤
X ∈ A 7→ 1{X=a1 } , . . . , 1{X=aK } ∈ RK . (2.13)
accountant 1 0 0 0 0 0
actuary 0 1 0 0 0 0
economist 0 0 1 0 0 0
quant 0 0 0 1 0 0
statistician 0 0 0 0 1 0
underwriter 0 0 0 0 0 1
Dummy coding
One-hot encoding does not lead to full rank design matrices X because there is a redun-
dancy. If we know that X does not take the first K − 1 levels (ak )K−1k=1 , it is immediately
clear that it has the be of the last level aK . If full rank design matrices is an important
prerequisite, one should therefore change it to dummy coding (note that for GLMs full
rank design matrices are important, but not for neural networks as going to be explained
below). For dummy coding one selects a reference level, e.g., a2 = actuary. Based on
this selection, all other levels are measured relative to this reference level
⊤
X ∈ A 7→ 1{X=a1 } , 1{X=a3 } , 1{x=a4 } , . . . , 1{X=aK } ∈ RK−1 . (2.14)
accountant 1 0 0 0 0
actuary 0 0 0 0 0
economist 0 1 0 0 0
quant 0 0 1 0 0
statistician 0 0 0 1 0
underwriter 0 0 0 0 1
In actuarial practice, usually the level with the biggest exposure is chosen as reference
level.
Entity embedding
One-hot encoding and dummy coding lead to so-called sparse design matrices X, meaning
that most of the entries will be zero by these encoding schemes if we have many categorical
covariates with many levels. Such sparse design matrices can lead to issues in statistical
modeling and model estimation, e.g., resulting estimated model parameters may not be
credible, and matrices may not be well-conditioned because of too sparse levels which can
cause numerical issues when inverting these matrices. Therefore, less sparse encodings
are considered. One should note that in one-hot encoding and dummy coding there is no
notion of adjacency and similarity. However, it might be that some job profiles have a
more similar risk behavior than others; another popular actuarial example are car brands,
certainly sports car brands have a more similar claims behavior than car brands that
typically produce family cars. Borrowing ideas from natural language processing (NLP),
one should therefore consider low(er)-dimensional entity embeddings where proximity is
related to similarity; see by Brébisson et al. [27], Guo–Berkhahn [88], Richman [186, 187],
Delong–Kozak [49] and Richman–Wüthrich [192].
Select an embedding dimension b ∈ N, this is a hyper-parameter that needs to be selected
by the modeler, typically b ≪ K. We define an entity embedding (EE) as follows
This assigns to each level ak ∈ A an embedding vector eEE (ak ) ∈ Rb . In total this
entity embedding involves b · K parameters (called embedding weights). These need to
be determined either by the modeler (manually) or during the model fitting procedure
(algorithmically), and proximity in embedding should reflect similarity in (risk) behavior.
Target encoding
Especially for regression trees, one sometimes uses target encoding, meaning that one
does not only consider the categorical covariate, but also the corresponding response.
We assume to have a sample (Yi , Xi , vi )ni=1 with categorical covariates Xi ∈ A, real-
valued responses Yi and weights vi > 0. We compute the weighted sample means on all
levels ak ∈ A by
Pn
i=1 vi Yi 1{Xi =ak }
y k = Pn .
i=1 vi 1{Xi =ak }
These weighted sample means (y k )Kk=1 are used like ordinal levels, replacing the nominal
ones, and we obtain similarly to (2.12) the one-dimensional target encoding embedding
K
X
X ∈ A 7→ y k 1{X=ak } . (2.16)
k=1
Though convincing at the first sight, one has to be aware of the fact that this does
not consider any interactions within the covariates, e.g., for scarce levels it may happen
that a high or low value is mainly implied by another covariate, and the resulting target
encoding value (marginal value) y k is misleading. This is especially an issue in regression
tree constructions if some of the leaves of the regression tree only contain very few
instances (under high-cardinality categorical covariates). A method to deal with scarce
levels is to combine this target encoding scheme with Bühlmann credibility [34]; see also
Micci-Barreca [155]. For this, we try to assess how credible the individual estimates y k
are, and we improve unreliable ones by mixing them with the global weighted empirical
mean y = ni=1 vi Yi / ni=1 vi providing a convex credibility combination
P P
y cred
k = ωk y k + (1 − ωk ) y, (2.17)
Standardization
Often, one standardizes covariates. Assume that we have n instances with a continuous
covariates (Xi )ni=1 . Standardization considers the transformation
X −m
X 7→
b
, (2.18)
sb
b ∈ R is the empirical mean and sb > 0 the empirical standard deviation of (Xi )n
where m i=1 .
MinMaxScaler
X − min1≤i≤n Xi
X 7→ 2 − 1. (2.19)
max1≤i≤n Xi − min1≤i≤n Xi
Categorization
Finally, especially in GLMs, one often discretizes continuous covariates by binning them
into categorical classes, e.g., one builds age classes. This is often done to provide more
robust functional forms to particular covariates within a GLM framework; of course, this
could also be achieved by splines. Select a finite partition (Ik )Kk=1 of the support of the
continuous covariate X. Then, we can assign a categorical value ak ∈ A to X if it falls
into Ik , that is,
K
X
X 7→ ak 1{X∈Ik } . (2.20)
k=1
This categorization then allows one to treat the continuous covariate X as a categorical
one, e.g., using dummy encoding in regression modeling; this is very frequently used in
actuarial GLMs.
provides the lowest generalization error under finite sample fitting, it avoids in-sample
noise (over-)fitting, and it is very beneficial in explaining a model. We have already
touched upon this topic in the AIC model selection (1.19), that quantifies a penalization
for model complexity under MLE.
In practice, neither knowing the true model nor the (causal) regression structure and the
factors that impact the responses, one often starts with slightly too large models, and one
tries to shrink them by regularization. Regularization penalizes model complexity and/or
extreme regression coefficients; typically, a zero coefficient means that the corresponding
term is dropped from the regression function, making a model more sparse.
The most popular regularization techniques include ridge regularization (also known as
Tikhonov regularization [219] or L2 -regularization), the LASSO regularization of Tibshi-
rani [217] (also known as L1 -regularization) and the elastic net regularization of Zou–
Hastie [250]. Furthermore, there are more specialized techniques like the fused LASSO
of Tibshirani et al. [218] for ordered features, or the group LASSO by Yuan–Lin [246],
which we are also going to present in this section. There are more methods, e.g., smoothly
clipped absolute deviation (SCAD) regularization of Fan–Li [66], which are less relevant
for our purposes. An excellent reference on sparse regression is the monograph of Hastie
et al. [94].
2.4.2 Regularization
We come back to the parametric regression estimation problem (2.8). It involves selecting
the optimal regression function µϑb from a class of candidate models M = {µϑ }ϑ that are
parametrized by ϑ. Let us assume that the parameter ϑ is a (r + 1)-dimensional vector
ϑ = (ϑ0 , ϑ1 , . . . , ϑr )⊤ ∈ Rr+1 ,
where we typically assume that (ϑj )rj=1 parametrizes the terms in µϑ (X) that involve the
covariates X, and ϑ0 is a parameter for the covariate-free part of the regression function
that determines the overall level of the regression; ϑ0 is referred to the bias term and
this is best understood from the log-link GLM structure (1.3). For reasons explained
in Section 4.1, below, this bias term ϑ0 should always be excluded from regularization.
In other words, if we regularize the entire regression parameter ϑ, we drop out of the
framework of strictly consistent loss functions, even if the selected loss function L is
strictly consistent. Denote by ϑ\0 = (ϑ1 , . . . , ϑr )⊤ ∈ Rr the parameter vector excluding
the bias term ϑ0 .
A regularized parameter estimation is achieved by considering the optimization problem
n
!
X vi
ϑb ∈ arg min L(Yi , µϑ (X i )) + η R(ϑ\0 ) , (2.21)
ϑ i=1
φ
Note that for larger sample sizes n, regularization should be weaker which is naturally
achieved by dropping any scaling 1/n in (2.21).
1.0
1.0
0.8
0.5
coordinate theta2
penalization
0.6
0.0
0.4
−0.5
0.2
ridge ridge
LASSO LASSO
−1.0
0.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
This version is not used very often in applications mainly because optimizing (2.24) is
difficult, in fact, LASSO regularization (2.23) is considered as a tractable version instead.
with α ∈ [0, 1]. The elastic net regularization overcomes some issues of LASSO, e.g.,
LASSO does not necessarily provide a unique solution for a linear regression problem
with square loss function L. It also has the tendency to group effects by assigning similar
weights to correlated covariate components.
We give some further remarks to these standard regularization methods.
• In case of a linear regression with the square loss function, there is always a unique
(closed-form) solution to the ridge regression problem, because we minimize a con-
vex (quadratic) objective function. The LASSO regression is more complicated,
uniqueness is not guaranteed, and it is typically solved by the method of Karush–
Kuhn–Tucker (KKT) [114, 129], and using the so-called soft-thresholding operator;
see Hastie et al. [94].
• The best-subset selection regression does not lead to a convex minimization prob-
lem, nor the SCAD regression.
where each group ϑ(k) ∈ Rdk contains dk components of the regression parameter ϑ\0 .
Group LASSO regularization is obtained by solving
n G
!
X vi X
ϑb ∈ arg min L(Yi , µϑ (X i )) + ηk ∥ϑ(k) ∥2 , (2.26)
ϑ i=1
φ k=1
for regularization parameters ηj ≥ 0. The fused LASSO proposal enforces sparsity in the
non-zero components, but also sparsity in the different regression parameter values for
adjacent variables. It considers first order differences, which are related to derivatives of
functions, but one could also consider second (or higher) order differences.
Finally, one may want to enforce that parameters are positive or that first differences are
positive. This suggests to consider, respectively, for positivity
n r
X vi X
ϑb ∈ arg min L(Yi , µϑ (X i )) + ηj (0 − ϑj )+ ,
ϑ i=1
φ j=1
2.5 Outlook
We have introduced all the technical and mathematical tools to now dive into predictive
modeling. Roughly speaking, there are three main types of regression model classes
that are used for actuarial modeling. (1) There are parametric GLM-type models, these
include feed-forward neural networks, recurrent and convolutional neural networks as
well as transformers. These models have in common that the architecture is fixed before
model fitting, and this gives a fixed number of parameters ϑ to be determined. (2) There
are no-parametric regression tree-type models, these include random forests and gradient
boosting machines. These models have in common that we do not start from a fixed
complexity, but we let the models grow during fitting by searching for more structure
in the data. (3) There are nearest neighbor and kernel based models. These models are
based on topologies and adjacency relations, under the assumption that locally we have
similar predictions, and these predictions can be obtained by locally smoothing the noisy
responses. For example, local regression and isotonic regression belong to this class.
Generalized linear models (GLMs) are the core models in predictive modeling, and, still
today, they are the state-of-the-art in practice for solving actuarial and financial problems
because they have many advantages over more advanced (and more complicated) machine
learning models; we refer to the introduction to Chapter 2. GLMs were introduced in
1972 by Nelder–Wedderburn [164] and the standard monograph on GLMs is the book of
McCullagh–Nelder [150]. This chapter on GLMs will set the stage for later chapters on
machine learning methods and AI tools. In these later chapters, we will see that neural
networks can be seen as a generalization of GLMs.
GLMs are based on the EDF (2.1). Model fitting is done by MLE which can either
be achieved by maximizing the log-likelihood function of the selected EDF (2.1) or by
minimizing the corresponding deviance loss function, see (2.6) and Table 2.2. Numerical
optimization for parameter fitting is usually done by Fisher’s scoring method and the
iteratively re-weighted least squares (IRLS) algorithm; this is one of the key contributions
in the original work of Nelder–Wedderburn [164].
for a given regression parameter ϑ ∈ Rq+1 . Thus, after applying the link function g,
one postulates a linear functional form in the components of X, expressed by the scalar
49
50 Chapter 3. Generalized linear models
(dot) product ⟨ϑ, X⟩ between ϑ and X; we implicitly extend the covariate X by a bias
component X0 ≡ 1 in this GLM chapter, see (2.9). The parameter ϑ0 is called intercept
or bias. In this chapter, ϑ ∈ Rq+1 is generally a (q + 1)-dimensional vector; for the chosen
notation we also refer to the footnote on page 16.
The GLM assumption (3.1) provides the following GLM structure for the conditional
mean of the response Y , given the covariates X,
Example 3.1 (log-link GLM, revisited). The most popular link function in actuarial
pricing is the log-link g(·) = log(·). This log-link GLM has already been introduced in
Section 1.3, in particular, we refer to the regression function defined in (1.3) and (1.4),
respectively. We can rewrite this conditional mean functional as
q
Y
ϑ0
X 7→ µϑ (X) = E [ Y | X] = exp ⟨ϑ, X⟩ = e eϑj Xj .
j=1
This is a multiplicative best-estimate pricing structure with price factors (price relativi-
ties) eϑj Xj . This multiplicative pricing structure is transparent and interpretable, e.g., if
ϑj > 0 we can easily read off the increase in best-estimate implied by an increase in Xj .
The bias term eϑ0 gives the base premium and the calibration of the GLM (note that
the base premium is not the same as the average premium, the latter averages over the
covariate distribution X ∼ P and considers the entire regression parameter ϑ). Shifting
the bias term ϑ0 shifts the average price level.
female
male
5
regression function mu(X)
4
3
2
1
20 30 40 50 60 70 80 90
age X1
Figure 3.1 shows a log-link GLM regression function X 7→ µϑ (X) = exp⟨ϑ, X⟩, with
a two-dimensional covariate X ∈ R2 and a regression parameter ϑ ∈ R3 . The first
component X1 ∈ [18, 90] models the age of the policyholder, given on the x-axis of
the graph, and the second component X2 ∈ {0, 1} is a binary categorical component
with X2 = 0 for male and X2 = 1 for female (blue and red colors). That is, it uses
dummy coding with male being the reference level, see (2.14). The choice of the log-link
gives the easy interpretation that females differ from males (at the same age X1 ) by a
multiplicative factor of eϑ2 , the selected regression parameter ϑ2 < 0 is negative in the
picture. Increasing age X1 by one year, increases the best estimate by a multiplicative
factor of eϑ1 , since the selected regression parameter ϑ1 > 0 is positive, here. Finally,
the bias term eϑ0 globally shifts (calibrates) the overall level. ■
Figure 3.1 gives a nice picture of a transparent and interpretable regression function based
on the log-link, and it shows the main advantages of using the log-link. In practice, this
regression function is not known and needs to be estimated from noisy data, i.e., we
need to find the (true) regression parameter ϑ ∈ Rq+1 from a (noisy) learning sample
(Yi , X i , vi )ni=1 , after having chosen a suitable link function g. We could just select any
strictly consistent loss function L, see Section 1.4, and then perform model fitting, or
rather loss minimization1 (2.8), to find a regression parameter estimate ϑb for ϑ. We
have seen three examples of that type for the log-link GLM, see Examples 1.2, 1.6 and
1.7. However, we have argued in Section 1.4.3 that on finite samples L = (Yi , X i , vi )ni=1
one can find the most accurate models (on average) if the selected strictly consistent
loss function L reflects the properties of the responses (Yi )ni=1 ; this is grounded in the
theoretical work of Gourieroux et al. [87].
Select the member of the EDF that is a reasonable distributional choice for the responses,
given the covariates, and set the assumption
That is, we make the canonical parameter θ = θ(X i ) ∈ Θ dependent on the covariates.
Assuming the GLM regression structure (3.1), applying the EDF mean property (2.2),
and using the one-to-one correspondence between the canonical parameter and the mean
gives us
g κ′ (θ) = ⟨ϑ, X⟩ .
or equivalently, using the canonical link h = (κ′ )−1 of the selected EDF, we can solve
this for the canonical parameter
θ = θ(X) = θϑ (X) = h g −1 ⟨ϑ, X⟩ . (3.4)
This explains the relationship between the canonical parameter θ = θ(X) = θϑ (X) ∈ Θ
of the selected EDF and the selected GLM regression function. The unknown regression
parameter ϑ ∈ Rq+1 enters this canonical parameter; pay attention to the distinguished
use of θ ∈ Θ for the canonical parameter and ϑ ∈ Rr+1 for the regression parameter.
1
Often we call minimization of a loss function as ‘model fitting’, however, at the current stage we do
not really have a ‘model’, but only a regression function assumption, see page 32. For having a ‘model’, we
also need a distributional assumption. This is often disregarded within the machine learning community,
but this has played a key role in advancing actuarial modeling, see, e.g., the claims reserving uncertainty
results developed by Mack [146].
From (3.4) we observe that there is a distinguished link choice for a GLM, called the
canonical link choice. Namely, select g = h. In the case of this canonical link choice
g = h, the canonical parameter coincides with the linear predictor
Mathematically speaking, the canonical link choice has some real advantages, e.g., MLE
is always a concave maximization problem and, thus, a solution is unique (supposed we
have a full rank design matrix X). However, practical needs often overrule mathematical
properties, for a dozen of other reasons the modeler typically prefers the log-link for g.
The log-link is the canonical link (if and only if) we select the Poisson model within the
EDF. Therefore, in many regression problems, we do not work with the canonical link
for g.
Table 3.1: Canonical link h(µ) = (κ′ )−1 (µ) of selected EDF distributions.
Table 3.1 shows the canonical links h of the most popular members of the EDF. Usually,
the canonical link is chosen for g in case of the Gaussian, the Poisson and the Bernoulli
models. In these cases we have an effective domain Θ = R, and, as a result, there is no
domain conflict resulting from the linear predictor (3.5) for any possible choices of the
covariates X ∈ Rq . This does not hold in the other cases, e.g., in the gamma case we
have a one-side bounded effective domain Θ = (−∞, 0). This gives constraints on the
possible choices of ϑ and X in (3.5) to have a well-defined model. This difficulty can be
circumvented in the gamma case by selecting the log-link for g, i.e.,
θϑ (X) = h g −1 ⟨ϑ, X⟩ = −1/ exp ⟨ϑ, X⟩ = − exp ⟨−ϑ, X⟩ < 0,
being a well-defined canonical parameter for any ϑ ∈ Rq+1 and X ∈ Rq . This is a main
reason in practice to choose a link function g different from the canonical link h of the
selected EDF distribution.
We also remark that the balance property that is going to be introduced in Section 4.1,
below, is important in insurance pricing. This balance property is only (if and only if)
fulfilled for MLE fitted GLMs if we work with the canonical link g = h. Otherwise,
a balance correction will be necessary; for more discussion we also refer to Lindholm–
Wüthrich [140].
There remains fitting the regression function (3.1) based on a learning sample L =
(Yi , X i , vi )ni=1 . As outlined in Section 2.1.3, there are two different ways to receive the
MLE of ϑ. We can either maximize the log-likelihood function or we can minimize the
corresponding deviance loss function.
For the given learning sample L = (Yi , X i , vi )ni=1 we receive the log-likelihood function
under the previous choices (and assuming independence between the instances), see (2.1),
n
X vi
ϑ 7→ ℓ(ϑ) = [Yi θϑ (X i ) − κ(θϑ (X i ))] + c(Yi , φ/vi ), (3.6)
i=1
φ
where θϑ (X i ) contains the regression parameter ϑ, see (3.4). Solving the resulting max-
imization problem gives the MLE, subject to existence,
One may argue that working with the log-likelihood function (3.6) looks too complicated
because the formula seems quite involved. In all future derivations we will translate the
MLE problem (3.7) to the corresponding deviance loss minimization problem.
We translate the selected EDF to the corresponding deviance loss function L, see (2.6)
and Table 2.2. This gives us the (same) MLE as in (3.7) by solving the minimization
problem
n
X vi
ϑbMLE ∈ arg min L(Yi , µϑ (X i )), (3.8)
ϑ i=1
φ
The regression fitting problem (3.8) is now in an attractive form that is going to be
used throughout and in more generality below according to the following recipe:
(1) Specify the distributional model within the EDF for modeling the responses
Y , given X; see (3.3). This gives us the choice of the cumulant function κ.
(3) Then, one can choose any family M = {µϑ }ϑ of regression functions µϑ :
X → R, and the optimal one is found by solving (3.8) for this family M.
• For GLM fitting we always require the design matrix X to have full rank q + 1 ≤ n.
For categorical inputs this requires dummy coding, see Section 2.3.2.
• Under the canonical link choice g = h, the deviance loss minimization (3.8) is
convex. Thus, a solution is always unique.
• For non-canonical link choices, the objective function in (3.8) is not necessarily
convex, and this needs to be checked case by case; compare Examples 5.5 and 5.6
of Wüthrich–Merz [243] which show that the gamma log-link GLM is a convex
fitting problem, whereas the inverse Gaussian log-link GLM is not.
We can then apply the model validation and model selection tools of Chapter 1 such
as cross-validation. One could also compute a regularized MLE, e.g., adding a LASSO
penalization to (3.8) to receive a sparse GLM, see Section 2.4. For a more detailed outline
2
One could also use the gradient descent methods described in Section 5.3, below. Gradient descent
has the advantage of being able to deal with big data and big models, Fisher’s scoring method and the
IRLS algorithm usually converge faster, because they do not only consider the gradient, but also the
second derivatives (Hessians). The size of the data and the model will decide whether these Hessians can
be computed efficiently.
about model fitting and validation we refer to Wüthrich–Merz [243, Chapter 5], and we
discuss some GLM related techniques in Section 3.3, below.
Remark 3.2. In general, the GLM fitted model does not fulfill the balance property
discussed below in (4.4). Therefore, one often adjusts the estimated bias term ϑbMLE
0 to
rectify this balance property. An exception is the choice of the canonical link g = h
under which the balance property always holds, i.e., under the canonical link choice, the
GLM estimated model is an (in-sample) re-allocation of the totally observed claim; this
is explained below in Section 4.1, and we also refer to Lindholm–Wüthrich [140]. ■
We then pre-process the covariates, and we fit a Poisson log-link GLM to the learning
sample L = (Yi , X i , vi )ni=1 , i.e., using the Poisson deviance loss for L, we solve
n
1X
ϑbMLE = arg min vi L (Yi , µϑ (X i ))
ϑ∈Rq+1 n i=1
n
1X µϑ (X i )
= arg min 2vi µϑ (X i ) − Yi − Yi log (3.10)
ϑ∈Rq+1 n i=1 Yi
n
1X vi µϑ (X i )
= arg min 2 vi µϑ (X i ) − Ni − Ni log .
ϑ∈Rq+1 n i=1 Ni
This is precisely Example 2.2. Recall, Yi = Ni /vi are the observed claims frequencies, if
Ni ∈ N0 denote the observed claims counts.
The results of the GLM fitting procedure are presented in Listing 3.1, below. We discuss
this example in the next section, and a more detailed description and analysis of this
GLM example is contained in the accompanying notebook on GLMs; see
notebook-insert-link
with q ′ < q. This nested model has a (q ′ + 1)-dimensional regression parameter ϑH0 ∈
′
Rq +1 . We can now set up a statistical test for the null-hypothesis that the data has been
generated by the smaller nested model (3.12) against the alternative of the full model
(3.11). Based on the learning sample L = (Yi , X i , vi )ni=1 , we estimate in both models
the regression parameter with MLE, providing us with the MLEs ϑbfull in the full model
and ϑbH0 in the nested model, respectively. The resulting likelihood ratio of the two fitted
models gives us, refer to (2.1) for the EDF densities,
Qn
i=1 fθϑ
bH0 (X i )
(Yi )
Qn ≤ 1. (3.13)
i=1 fθϑ
bfull (X i )
(Yi )
This likelihood ratio is upper bounded by one because we apply MLE to two nested
models; the one in the nominator having more degrees of freedom in MLE. The rationale
behind the LRT now is as follows. If this ratio is fairly close to one, the null-hypothesis
model is as good as the full model, and one cannot reject the null-hypothesis that the
data has been generated by the smaller model.
The test statistics (3.13) is not in a convenient form and, instead, one considers a logged
version thereof
n
X
T = −2 fθ H0 (X i )
(Yi ) − fθ full (X i )
(Yi ) ≥ 0. (3.14)
ϑ b ϑ
b
i=1
If this test statistics T is large, the null-hypothesis should be rejected. Using asymptotic
MLE theory, the distribution of this test statistics T under the null-hypothesis can be
approximated by a χ2 -distribution with q − q ′ degrees of freedom; see Fahrmeir–Tutz [65,
Section 2.2.2].
0 5 10 15 20
test statistics T
Figure 3.2: χ2 -density rejection area (in orange) for a significance level of 5% with q−q ′ =
3 degrees of freedom.
Figure 3.2 shows in orange color the rejection region for the test statistics T defined in
(3.14) on a significance level of 5% for dropping q − q ′ = 3 covariate components from X.
If the learning sample L = (Yi , X i , vi )ni=1 provides an observed test statistics T that has
a p-value less than 5% (lies in the orange region) we reject the null-hypothesis on this
significance level and we go for the bigger model, otherwise this LRT does not support
working with the bigger model.
• The LRT test statistics T in (3.14) can also be expressed as a difference of deviance
losses. It then receives the interpretation of measuring the distance of the two
models to the empirical sample in terms of a KL divergence; see Wüthrich–Merz
[243, Section 5.1.7].
• The LRT can only be applied to nested models, i.e., with responses having the
same distributions under both models, and one having a nested regression function
within the other one. For non-nested models, only AIC (1.19) and BIC (1.20) apply.
Alternatively, we could employ (group) LASSO regularization during model fitting,
see Section 2.4. This already leads to sparsity and parsimony during model fitting,
but it requires fine-tuning of the regularization parameter η. Often, LASSO is only
used for parameter (covariate) selection, and once the covariates are selected, in an
additional step, a non-regularized GLM is fitted.
• The LRT requires that we fit two models, the full model and the nested model. The
Wald test [230] is rather similar but with a smaller computational effort. For the
Wald test, one only needs to fit the full model, and the nested one is then approx-
imated by asymptotic MLE arguments. This is computationally more attractive,
but it is less accurate because it involves more approximations; see Wüthrich–Merz
[243, Section 5.3.2].
• The LRT test statistics (3.14) requires the true dispersion parameter φ. Usu-
ally, this is not available. However, if we estimate this dispersion parameter by
MLE (consistently in a statistical sense) in the bigger (full) model, the asymp-
totic normality results carry over to the case with consistently estimated dispersion
parameter.
We illustrate the LRT and the Wald test on the French MTPL claims frequency data in-
troduced in Section 3.3.2. Listing 3.1 shows the results of the fitted Poisson log-link GLM
(3.10). We consider 5 covariates: a continuous covariate DrivAge which was discretized
into age classes and which was then implemented by dummy coding, the categorical
VehBrand variable implemented by dummy coding, a binary VehGas variable also imple-
mented by dummy coding, and two continuous variables Density and Area. This results
in a regression parameter ϑ ∈ Rq+1 of dimension q + 1 = 20.
The last column p value of Listing 3.1 shows the p-values of the Wald tests. As explained
above, these Wald tests are essentially the same as the LRTs, but with less computational
efforts because only the full model needs to be fitted, and we do not need to refit a null-
hypothesis model for all q + 1 = 20 Wald tests in Listing 3.1. These Wald tests consider
dropping one component of ϑ at a time against the full model. As a result, we cannot
just drop all the variables that have a small p-value because simultaneous consideration
of multiple individual variable droppings is not a nested model consideration. That
is, a multiple comparison does not work if these multiple models are not nested. In
other words, variable dropping needs to be done recursively, i.e., once we have dropped
the least significant variable, we have a new full model, and we need to re-perform
the whole exercise based on this new full model. This precisely results in the variable
selection process that is typically known as backward selection, because every Wald or
LRT has to be done recursively in a full vs. null-hypothesis model context. If this model
reduction is combined with increasing a model by adding (new) variables, it is called
backward-forward-stepwise selection, e.g., we may decide to drop recursively variable A,
then variable B, and then variable C, and after these three reductions it may turn out that
adding again variable A is beneficial. This allows one to stepwise (recursively) decrease
Listing 3.1: Poisson log-link GLM example on the French MTPL claims frequency data.
1 glm ( formula = ClaimNb ~ DrivAge + VehBrand + VehGas + Density +
2 Area , family = poisson () , data = learn , offset = log ( Exposure ))
3
4 Deviance Residuals :
5 Min 1Q Median 3Q Max
6 -0.8890 -0.3393 -0.2535 -0.1387 7.6569
7
8 Coefficients :
9 Estimate Std . Error z value p value
10 ( Intercept ) -3.258957 0.034102 -95.564 < 2e -16 ***
11 DrivAge18 -20 1.275057 0.044964 28.358 < 2e -16 ***
12 DrivAge21 -25 0.641668 0.028659 22.390 < 2e -16 ***
13 DrivAge26 -30 0.153978 0.025703 5.991 2.09 e -09 ***
14 DrivAge41 -50 0.121999 0.018925 6.447 1.14 e -10 ***
15 DrivAge51 -70 -0.017036 0.018525 -0.920 0.357776
16 DrivAge71 + -0.047132 0.029964 -1.573 0.115726
17 VehBrandB2 0.007238 0.018084 0.400 0.688958
18 VehBrandB3 0.085213 0.025049 3.402 0.000669 ***
19 VehBrandB4 0.034577 0.034523 1.002 0.316553
20 VehBrandB5 0.122826 0.028792 4.266 1.99 e -05 ***
21 VehBrandB6 0.080310 0.032325 2.484 0.012976 *
22 VehBrandB10 0.067790 0.040607 1.669 0.095032 .
23 VehBrandB11 0.221375 0.043348 5.107 3.27 e -07 ***
24 VehBrandB12 -0.152185 0.020866 -7.294 3.02 e -13 ***
25 VehBrandB13 0.101940 0.047062 2.166 0.030306 *
26 VehBrandB14 -0.201833 0.093754 -2.153 0.031336 *
27 VehGasRegular -0.198766 0.013323 -14.920 < 2e -16 ***
28 Density 0.094453 0.014623 6.459 1.05 e -10 ***
29 Area 0.028487 0.019909 1.431 0.152471
30 ---
31 Signif . codes : 0 ’*** ’ 0.001 ’** ’ 0.01 ’* ’ 0.05 ’. ’ 0.1 ’ ’ 1
32
33 ( Dispersion parameter for poisson family taken to be 1)
34
35 Null deviance : 153852 on 610205 degrees of freedom
36 Residual deviance : 151375 on 610186 degrees of freedom
37 AIC : 197067
38
39 Number of Fisher Scoring iterations : 6
justified by AIC (1.19) because dropping Area has roughly the same AIC value as the full
model, and because of aiming for parsimony we should go for the smaller model in such
a situation. In all other cases, the AIC value increases by dropping the corresponding
variables, see Listing 3.2.
Finally, we show an analysis in Listing 3.3, where we sequentially add one variable after
the other, providing the corresponding reduction in the deviance loss (LRT test statistics).
The conclusions are essentially the same as above. A critical point that needs to be
emphasized for Listing 3.3 is that the order of inclusion matters in this stepwise forward
selections. Changing its order may provide different conclusions. For instance, if two
covariate components are highly collinear, then they serve at essentially explaining the
same phenomenon, and once we have included the first one, we do not need the second
one any more in the model, though the second one may have a marginally slightly better
explanatory power (or slightly different interactions with other variables). That is, the
order of inclusion matters in these stepwise forward selections, and the p-values depend
on this order; in the example of Listing 3.3 the variables Density and Area are highly
collinear, and by exchanging the order of these two variables in Listing 3.3 gives a high
p-value to Density. The accompanying notebook on GLMs gives more discussions on
this example, see
notebook-insert-link
We conclude that variable selection is as much art as science. Increasing the size of the
model quickly leads to a combinatorial complexity that does not allow one to exploit all
possible sub-models (and combinations). This section has been focusing on LRTs and
Wald tests because there is a well-founded and understood statistical theory behind the
LRT and the Wald test, but, in essence, they are not any different from other model selec-
tion procedures. For instance, we can replace all LRTs by LASSO regressions, but then
computational efforts will become bigger as LASSO regression needs hyper-parameter
(regularization parameter) tuning. This increased computational complexity is common
to any machine learning method, because in complex algorithmic models we will no longer
be able to rely on an asymptotic likelihood theory. Therefore, first trying to understand
the (raw) data is always very beneficial. This allows the modeler to already make a rea-
sonable model choice in the first place, which can then be further refined. Otherwise, we
have to go through the tedious backward-forward LRT model selection process because
we can only test nested GLMs, and likely we eventually result in a sort of non-uniqueness
having equally good non-nested models. This is quite similar to the regression tree con-
structions below, and in regression trees a technique called pruning is used to select the
optimal tree.
Summary
This chapter has set the ground for machine learning methods by giving the basic
tools and intuition from classical statistical and actuarial methods, we especially
refer to the boxes on pages 29 and 54, as well as to the Poisson Example 2.2. We
are now ready to dive into the theory of machine learning tools.
Interlude
Chapters 1 to 3 introduced statistical methods that are part of the core syllabus
of actuarial science, and these chapters have laid a solid foundation for diving into
the machine learning tools. We are going to introduce these tools from Chapter 5
onwards. The present chapter discusses some general techniques that are useful in
various situations, but which are not strictly necessary to understand the machine
learning tools. For this reason, the fast reader may skip this chapter and come
back to it at a later stage.
• In the second part of this chapter, we discuss two general purpose non-parametric
regression methods, local regression and isotonic regression. These are useful in
various situations.
• In the final part of this chapter, we discuss further model selection tools, such as
the Gini score and Murphy’s score decomposition. These are model-agnostic tools,
i.e., they can be used for any regression method.
63
64 Chapter 4. Interlude
• Most regression models include an intercept term which is the part of the regression
function that is not influenced by the covariates, see GLM (3.1). In deep learning,
this intercept term is coined bias term, and we need to ensure that it is correctly
specified to avoid the statistical bias.
• There is some concern about unfair discrimination in insurance pricing, and algo-
rithmic decision making more generally. Any kind of unfair treatment of individuals
or groups with similar features is related to a bias in the model construction and/or
the decision making process. This is called an unfair discrimination bias.
In the present section, we focus on the statistical bias which studies the average price
level over the entire insurance portfolio. We assume in all considerations that the first
moments exist.
Global unbiasedness means that the average price level E[vµ(X)] provided by the selected
regression function µ is sufficient to cover the portfolio claim vY on average. This global
unbiasedness is stated w.r.t. the population distribution (Y, X, v) ∼ P.
If we work with an estimated model µ bL , which has been fitted on a learning sample
n
L = (Yi , X i , vi )i=1 , we typically also average over the learning sample L to state the
global unbiasedness
E [v µ
bL (X)] = E[vY ], (4.1)
• The left-hand side of (4.1) reflects that we re-sample both L and (X, v) to verify this
global unbiasedness. In insurance, this may be questionable, as we cannot repeat
an experiment, i.e., we only have one past claims history reflected in the learning
sample L. Bootstrap may be a way of generating different past histories, however,
this in itself maybe problematic because if we have a biased model, this bias remains
in the bootstrap samples as we re-simulate from the very same observations, or in
other words, a bias remains latent and is not discovered by bootstrapping.
• Global unbiasedness (4.1) also re-samples over the covariates X, and an insurer
may argue that the company (only) wants to be unbiased for its specific portfolio
T = (Yt , X t , vt )m
t=1 . This would then result in verifying the conditional global
unbiasedness
m
X m
X
vt E [ µ
bL (X t )| X t ] = vt E [ Yt | X t , vt ] . (4.2)
t=1 t=1
This averages over the learning sample L on the left-hand side, and over the claims
(Yt )m m
t=1 on the right-hand side, but it keeps the (forecast) portfolio (X t , vt )t=1 fixed.
Note that (4.2) will also require an assumption about the dependence between L
and T . In fact, one can even go one step further and require that the learning
sample L and the test sample T have the identical covariates, i.e., working on a
fixed portfolio (X i , vi )ni=1 for learning and forecasting. In that case, we require
instead n n
X X
bL (X i )| (X i , vi )n
vi E [ µ i=1 ] = vi E [ Yi | X i , vi ] . (4.3)
i=1 i=1
On both sides, this only averages over the responses. On the left-hand side, these
responses reflect past responses in L used for fitting, and on the right-hand side
these are the future claims to be forecast.
• These considerations bear the difficulty that the average claims level on the right-
hand sides of (4.1)-(4.3) needs to be available, which is typically not the case. That
is why we discuss the balance property next.
• The balance property is an in-sample property that holds for almost every (a.e.)
realization of the learning sample L. A crucial difference to the unbiasedness defi-
nitions above is that for verifying the balance property we do not need to know the
right price level, but it is fully empirical.
• The correct interpretation of the balance property (4.4) is that we re-allocate the
total (aggregate) portfolio claim ni=1 vi Yi to all insurance policyholders 1 ≤ i ≤ n,
P
such that this collective bears the entire aggregate claim. This view is at the core
of insurance, namely, the collective shares all risks (and claims) within the risk
community (solidarity).
• Generally, in actuarial science, model fitting procedures that have this balance
property should be preferred. MLE estimated GLMs using the canonical link com-
ply with the balance property; see Nelder–Wedderburn [164]. That is, for a MLE
fitted GLM we have estimated means
D E
b [ Yi | X i ] = g −1 ϑ
µϑbMLE (X i ) = E bMLE , X i .
These estimated means fulfill the balance property if the selected link g = h was
the canonical link of the chosen EDF. Otherwise, the balance property fails to hold.
It can be rectified by modifying the intercept estimate
ϑbMLE
0 7→ ϑbcorrected
0 = ϑbMLE
0 + δ, (4.5)
4.1.3 Auto-calibration
For actuarial pricing auto-calibration is an important property. We discuss this in this
section. Auto-calibration has been introduced by Schervish [199] in the context of me-
teorology, and it is discussed in the statistical literature by Tsyplakov [221], Menon et
al. [153], Gneiting–Ranjan [79], Pohle [177], Gneiting–Resin [80] and Tasche [214]. Ac-
tuarial literature discussing auto-calibration is Krüger–Ziegel [127], Denuit et al. [53],
Fissler et al. [69], Lindholm et al. [138], Lindholm–Wüthrich [140], Wüthrich [240] and
Wüthrich–Ziegel [244].
Definition 4.3. A regression function µ : X → R is auto-calibrated for (Y, X) if, a.s.,
violation of auto−calibration
20
price cohort mu(X)
claim E[Y|mu(X)]
15
total claim
10
5
0
1 2 3
price cohorts
• From a statistical point of view, we should test any fitted regression model for
auto-calibration. Developing powerful statistical tests for auto-calibration is still
an open field of research; see Wüthrich [240].
(1) In the first step, one fits a (GLM) regression function X 7→ µ(X) by regressing
the responses Yi from the covariates X i , i.e., by considering the regression problem
(2) To assess the auto-calibration property of this fitted regression function µb(·), one
applies a second regression step. This second fitting step is a “regression on the
regression”, i.e., one regresss the responses Yt but this time from the real-valued
estimates µb(X t ), i.e., by considering the regression problem Yt ∼ µb(X t ) for the
independent instances 1 ≤ t ≤ m. This second step is performed on the test
sample T . These second fitted regression values are plotted on the y-axis in Figure
4.2, and they provide us with the blue dots in Figure 4.2. If these blue dots are on
the orange diagonal line we have a perfectly auto-calibrated model.
We need suitable regression techniques to perform this second regression step Yt ∼ µ b(X t ).
In Sections 4.2.1 and 4.2.2, below, we meet two non-parametric regression methods that
may serve at performing this second regression step. The method used in Figure 4.2 is a
more simple one. It uses a discrete binning w.r.t. the empirical deciles of (µb(X t ))m
t=1 , and
then it computes the empirical means of the responses belonging to the corresponding
bins (and on the x-axis we select the barycenters of the bins).
The resulting two-step fitting procedure is illustrated by the blue dots in Figure 4.2, and
they are compared to the orange diagonal line. If the blue dots lie on this orange diagonal
line, the second regression step does not learn any new estimated means compared to the
first estimates µb. This indicates that auto-calibration holds for µ
b. From Figure 4.2, there
is some indication that on some values µ(X t ) the average responses are misspecified,
b
b(X t ) ̸= E[Yt |µ
µ b(X t )], and, thus, these insurance policies are likely cross-subsidized by
other policies to cover the entire portfolio claim. It is also striking that there seems
to be non-monotonicity in the blue dots, which indicates that first regression function
(µb(X t ))m
t=1 does not provide the correct ranking.
Note that the type of plot in Figure 4.2 has different names, sometimes they are called lift
plots, auto-calibration plots or T -reliability diagrams; see Gneiting–Resin [80]. There is
not a unique terminology for these kinds of plots, and sometimes it is even contradictory.
In some cases, plots that show both regression steps on the x-axis are called lift plots,
and in some literature the cumulative accuracy profile, studied in Section 4.3.1, below,
is called lift plot.
graphs and results. The standard reference for local regression is Loader [142], who is
also the owner of the R package locfit; the present section is taken from Loader [142,
Chapter 2]. The goal is to locally fit a non-parametric polynomial regression function to
a sample (Yi , Xi , vi )ni=1 that has one-dimensional real-valued covariates Xi ∈ R.
Assume we want to fit a regression value µ bloc (X) in a fixed covariate value X ∈ R by
only considering the instances (Yi , Xi , vi ) with covariates Xi in the neighborhood of X.
First, we select a bandwidth δ(X) > 0 to define the smoothing window
∆(X) = (X − δ(X), X + δ(X)) .
This determines the neighborhood around X. Only instances with Xi ∈ ∆(X) are
considered for estimating µbloc (X). Second, we introduce a weighting function. Often
the tricube weighting function w(u) = (1 − |u|3 )3 is used, u ∈ [−1, 1]. This weights
the instances i within the smoothing window w.r.t. their relative distances ui = (Xi −
X)/δ(X) to X. The goal then is to fit a spline to the weighted observations in this
smoothing window. For illustration, let us select a quadratic spline
x 7→ µϑ (x; X) = ϑ0 + ϑ1 (x − X) + ϑ2 (x − X)2 ,
with regression parameter ϑ = (ϑj )2j=0 ∈ R3 . This motivates us to consider the following
weighted local regression problem around X
n
Xi − X
(Yi − µϑ (Xi ; X))2 .
X
ϑb = arg min vi 1{Xi ∈∆(X)} w
ϑ i=1
δ(X)
The fitted local regression value in X is then obtained by setting
We revisit Figure 4.2 which studies the auto-calibration property of a fitted GLM µ b. We
revisit the two-step fitting procedure discussed in Section 4.1.3, but this time we apply
a local regression for the second fitting step Xt ∼ µ b(X t ), based on the independent
instances 1 ≤ t ≤ m. For this, we select quadratic splines, and the bandwidth δ(X) is
chosen such that the smoothing window ∆(X) contains a nearest neighbor fraction of
α = 10% of the data T . The results are presented in Figure 4.3. The conclusion of this
plot is that a priori (without further investigation) we cannot reject the null-hypothesis
of having an auto-calibrated GLM µ b, because the resulting blue dots fluctuate around
the orange diagonal line. The main question is whether these fluctuations are too large,
and a second question is why do we not receive a monotone picture. These two issues
may indicate that the ordering received by (µ b(X t ))m
t=1 is not correct. But to come to a
firm conclusion further analysis is necessary.
Naturally, the hyper-parameters of the nearest neighbor fraction α ∈ (0, 1] and the degree
of the splines have a crucial influence on the results, and changing these parameters can
provide rather different results. Especially, the choice in the nearest neighbor fraction α
bloc (X)
can be very sensitive, for a small value of α we do not receive a credible estimate µ
because the (random) noise dominates the systematic effects, and for α close to one, we
consider remote instances i from X which may have a completely different systematic
response behavior. Thus, there is a critical trade-off between small and large values of
nearest neighbor fractions α. This also impacts the conclusions gained from Figure 4.3.
−1.0
local regression (log−scale)
−1.5
−2.0
−2.5
−3.0
−3.5
vi Yi + . . . + vi+k Yi+k
vi ← vi + . . . + vi+k and Yi ← , (4.8)
vi + . . . + vi+k
this reduces the sample size, but accounts for this smaller sample size by increasing the
corresponding weight. Basically, this merging means that we build sufficient statistics on
instances with identical covariates.
Remarks 4.4. • The isotonic regression (4.9) is based on the (strictly consistent)
square loss function for mean estimation. However, every strictly consistent loss
function for mean estimation gives the identical solution; see Barlow et al. [14,
Theorem 1.10].
• The PAV algorithm due to Ayer et al. [8], Miles [158] and Kruskal [128] is used to
solve the constraint optimization problem (4.9). Essentially, the PAV algorithm is
based on merging neighboring classes by applying (4.8) if the isotonic assumption
is violated by the corresponding sample means of adjacent bins (indices); for details
see Wüthrich–Ziegel [244, Appendix]. Thus, the PAV algorithm is constructing the
isotonic estimate µb iso by optimally binning the instances, optimal w.r.t. the square
loss objective function and w.r.t. the ranking of the covariates. This precisely
replaces any hyper-parameter choice in isotonic regression that would otherwise
need to be set, e.g., in local regression.
• The isotonic regression only gives regression values in the discrete covariate values
biso (Xi ) := µ
µ biso
i , 1 ≤ i ≤ n, and typically a step-function interpolation is used.
■
A major argument for the isotonic regression is that it provides an empirically auto-
calibrated regression solution. That is, through (optimally) binning and empirical mean
computations, we obtain
Pn
j=1 vj Yj 1{µbiso =µbiso } h i
biso (Xi ) = µ
µ biso
i = Pn
j i
biso (Xi ) ,
b Yi µ
=E (4.10)
j=1 vj 1{µbiso =µbiso }
j i
the latter denoting the empirical mean of the instances having regression estimate µbiso (Xi ).
Empirical auto-calibration (4.10) expresses that we perform binning in the PAV algo-
rithm, and the bin labels are precisely the empirical means of the bins; see also target
encoding (2.16). This verifies empirical auto-calibration.
We revisit the auto-calibration analysis of Figures 4.2 and 4.3, but instead of using decile
binning or a local regression, we perform an isotonic re-calibration (regression). For this
we fit a first regression function X 7→ µb(X) to the learning sample L = (Yi , X i , vi )ni=1 .
As discussed in Section 4.1.3, this first regression function µ
b does not necessarily fulfill
the auto-calibration property (4.6). The idea of isotonic re-calibration is to apply the
re-calibration step (4.7), under the assumption that the first fitted regression function µ b
∗
provides the right risk ranking for the true regression function µ , that is,
b(X t ) ≤ µ
µ b(X t′ ) ⇐⇒ µ∗ (X t ) ≤ µ∗ (X t′ ),
b(X t ) 7→ µ
µ biso
brc (X t ) := µ t
Figure 4.4: Lift plot with isotonic re-calibration; this continues from Figures 4.2-4.3.
We come back to the lift plots of Figures 4.2-4.3, but we use an isotonic re-calibration step
this time to receive the blue dots in Figure 4.4. The crucial differences to previous two
plots are: (1) the isotonically re-calibrated lift plot is rank preserving giving a monotone
regression in Figure 4.4; (2) the binning is optimal w.r.t. any strictly consistent loss
function for mean estimation and subject to the initial ranking; (3) the solution µ b iso is
auto-calibrated and the balance property holds, (4) the Gini score, introduced in Section
4.3.1, below, gives a valid model selection criterion because of auto-calibration.
We conclude with the following result; which is mentioned in Wüthrich–Ziegel [244].
Corollary 4.5. Assume the estimated regression function µ b : X → R gets the ranking
b(X) and µ (X) are strictly comonotonic. The true regression function µ∗
correct, i.e., µ ∗
µ∗ (X) = E [Y | µ
b(X)] , a.s.
is a non-parametric solution that is no longer, e.g., a GLM, even if the first regression
model is a GLM. As a consequence, it is no longer as easily explainable as a GLM,
and it is also more difficult to manually change the model. The second disadvantage is
that the resulting regression function has discontinuities, which is not very appreciated in
insurance pricing. The latter disadvantage can be removed by replacing the step function
by linearly interpolating functions between the observations. The third problem is that
the isotonic re-calibration step needs special attention at the lower and upper boundaries
of the support, as it tends to over-fit in this part of the covariate space, which may require
to manually merge the largest and the smallest bins, respectively. Coming back to Figure
4.4, the lowest three bins have been merged to ensure that all predicted values are strictly
positive, and Figure 4.4 suggests to also merge the two or three highest bins because the
largest prediction seems an outlier, over-fitting to a single observation.
1
We call the curve in (4.11) an empirical Lorenz curve, because we take the sample average over the
covariates (X i )n
i=1 , and the non-empirical version would consider the population distribution for X ∼ P,
instead. There is a second ingredient which is the regression function µ. On purpose, we did not put
hats on µ because the Gini score (4.13) should be evaluated out-of-sample (this applies to the subsequent
cumulative accuracy profile (4.12)). That is, if the regression function µ b is estimated from a learning
sample L, then the cumulative accuracy profile Cb b
µ
(α) should be computed on an independent test sample
T , and there should be two hats in the notation C
b . In this section, we use a generic regression function
bµ
µ to explain the theory.
This empirical mirrored Lorenz curve measures the contribution of the largest regression
values (µ(X i ))ni=⌈(1−α)n⌉+1 to the portfolio average.2 It is precisely this property that
allows Gini [74, 75] to describe discrimination or disparity in wealth distribution by
computing the resulting area under the curve (AUC). The bigger this area, the more
disperse are the regression values (µ(X i ))ni=1 .3
For statistical modeling, we replace the empirical Lorenz curve (4.11) by the cumula-
tive accuracy profile (CAP); see Ferrario–Hämmerli [67, Section 6.3.7] and Tasche [213];
Denuit–Trufin [54] call the same construction concentration curve. The (empirical) cu-
mulative accuracy profile of regression function µ(·) is given by
n
1 1 X
α ∈ (0, 1) 7→ Cbµ (α) = 1 Pn Yi ; (4.12)
n i=1 Yi n i=⌈(1−α)n⌉+1
compared to (4.11), we replace the predictions µ(X i ) by the observations Yi , but, impor-
tantly, we keep the order of the regression values µ(X 1 ) < µ(X 2 ) < . . . < µ(X n ) in the
indices 1 ≤ i ≤ n. Similarly to the empirical Lorenz curve, a better discrimination results
in a bigger AUC, and the maximal AUC is obtained if the claim sizes (Yi )ni=1 provide the
same ordering as the regression values (µ(X i ))ni=1 . This is precisely the motivation to
use this concept for model selection. We illustrate this in Figure 4.5.
B
cumulative accuracy profile
A
0.6
0.4
0.2
fitted model
null model
perfect model
0.0
alpha
Figure 4.5: Cumulative accuracy profile (CAP) given by α 7→ Cbµ (α) used to define the
Gini score (4.13); this plot is taken from [241].
The orange dotted line in Figure 4.5 shows the cumulative accuracy profile of perfectly
aligned responses (Yi )ni=1 and regression values, and the red line the cumulative accuracy
2
These considerations are also a well-known method in extreme value theory and risk management.
E.g., one speaks about the 20-80 Pareto rule, which means that the 20% largest claims make up 80% of
the total claim amount; see Embrechts et al. [63, Section 8.2.3].
3
Remark that in an economic context the Lorenz curve is usually below the 45◦ line because the
summation in the upper tail in (4.11) is switched to the lower tail, compare Figure 4.5 and Goldburd
et al. [81, Figure 25]. In machine learning, one typically considers the curve mirrored at the diagonal,
giving a sign switch in all inequalities.
profile w.r.t. the selected regression function (µ(X i ))ni=1 . The dotted blue line corre-
b0 = n1 n
P
sponds to the null model µ i=1 Yi not considering any covariates, but the global
empirical mean µ b0 instead. The Gini score is defined by, see Figure 4.5,
area(A)
Gini(µ) = ≤ 1, (4.13)
area(A + B)
where the areas under the curves, area(A) and area(A + B), have precisely the meaning
as in Figure 4.5. Generally, a bigger Gini score is interpreted as a better discrimination
of the selected regression model µ(·) for the responses Y . We remark that this Gini
score (defined by the AUC) is equivalent to the so-called receiver operating curve (ROC)
method for binary responses Y ; see Tasche [213, formula (5.6a)].
There is one critical issue with this model selection technique. Namely, the Gini score
is not a strictly consistent model selection tool. The Gini score is fully rank based,
but it does not consider whether the model lives on the right scale; this is precisely
the point raised in Wüthrich [239]. However, if we can additionally ascertain that the
regression functions µ(·) under consideration are auto-calibrated for (Y, X), the Gini
score is a sensible model selection tool; see Wüthrich [239, Theorem 4.3]. Basically, auto-
calibration lifts the models to the right level (the level of the responses), and the Gini
score then verifies whether the ordering of the policies w.r.t. their responses is optimal.
UNCL = E [L (Y, µ0 )] ≥ 0,
DSCL = E [L (Y, µ0 )] − E [L (Y, µrc (X))] ≥ 0,
MSCL = E [L (Y, µ(X))] − E [L (Y, µrc (X))] ≥ 0.
The uncertainty term UNCL quantifies the total prediction uncertainty not using any
covariates in its prediction µ0 = E[Y ], this is the global mean. The discrimination
(resolution) term DSCL quantifies the reduction in prediction uncertainty if we use the
auto-calibrated regression function µrc (X) based on covariate information X, see (4.7).
Finally, the miscalibration term MSCL vanishes if the regression function µ(X) is auto-
calibrated, otherwise it quantifies the auto-calibration misspecification.
In applications, we need to compute these quantities empirically, out-of-sample, similarly
to the previous sections. The auto-calibrated model µrc can be determined with an
isotonic re-calibration step (4.9).
There are other decompositions which, e.g., compare a regression model µ to the true
one µ∗ w.r.t. the available covariate information; see Fissler et al. [69, formula (17)],
Because the true regression model µ∗ is unknown, this decomposition is of less practical
value. These considerations are not restricted to the true regression model µ∗ , but
one can also compare two regression functions µ1 and µ2 . Positivity of the conditional
miscalibration is then related to convex orders of regression functions. In particular, if
we choose the square loss function for L, such a convex order can easily be obtained by
martingale arguments on filtrations (which reflect increasing sets of information, i.e., if
we increase the information set σ(X) by including more covariate information σ(X + ) ⊃
σ(X), we receive a higher resolution in the resulting true regression function based on
X + ); see Wüthrich–Buser [241, Section 2.5].4
4
Basically, this reflects a martingale construction from an integrable random variable and a filtration.
FNNs are also called artificial neural networks (ANNs), multi-layer perceptrons (MLPs),
if they have multiple layers, and more generally, deep learning (DL) architectures. In this
chapter, we study plain-vanilla (standard) FNNs.
77
78 Chapter 5. Feed-forward neural networks
suitable structure to enter a (generalized) linear model, this is illustrated in Figure 5.1
by the blue and green boxes. Having done all the preparatory work in the GLM Chapter
3, this FNN extension is a natural one:
X 7→ g(µ(X)) = ⟨ϑ, X⟩ .
Inserting a feature extractor z (d:1) modifies this GLM structure to the FNN equa-
tion D E
X 7→ g(µ(X)) = ϑ, z (d:1) (X) . (5.1)
This is precisely the (natural) FNN extension going to be discussed in this chapter.
It allows for non-linear structures and interactions of the covariate components.
Items (a) and (b) are hyper-parameters that are selected by the modeler, and the network
weights of item (c) are parameters that are learned during network training (model
fitting).
Having d ∈ N of these FNN layers (z (m) )dm=1 , with matching input and output dimen-
sions, we compose them to a feature extractor of depth d
X 7→ z (d:1) (X) := z (d) ◦ · · · ◦ z (1) (X) ∈ Rqd , (5.3)
where the input dimension of the first FNN layer z (1) is the dimension of the covariates
X, given by q0 := q. This feature extractor is illustrated in the blue box of Figure 5.1
(d+1)
where w(d+1) ∈ Rqd +1 is the output/readout parameter, including a bias term w0 ,
and g −1 is the inverse of the chosen link function. Compared to the GLM in (3.2),
the (only) difference is that we replace the original covariates X ∈ Rq+1 by the newly
learned representation z (d:1) (X) ∈ Rqd +1 from the feature extractor (5.3). For notational
convenience, we change the notation of the readout parameter in (5.1) to w(d+1) , compare
to (5.4).
In the next sections, we discuss the construction of the FNN layers z (m) in careful detail.
Table 5.1: Popular choices of non-linear activation functions ϕ and their derivatives ϕ′ ;
2 √
Φ denotes the standard Gaussian distribution, and Φ′ (x) = e−x /2 / 2π its density.
The activation functions and their derivatives in Table 5.1 are illustrated in Figure 5.2.
They have different properties, e.g., the first two activation functions are bounded which
can be an advantage or a disadvantage, depending on the problem to be solved (do we
want bounded or unbounded functions?). The hyperbolic tangent is symmetric in zero
which can be an advantage over the sigmoid in deep neural network fitting (because
it is naturally calibrated to zero and does not require to adjust biases). The ReLU is
an activation function that is very popular in the machine learning community, it leads
to sparsity in networks, and it is not differentiable in zero but it has a sub-gradient
because it is a convex function. The SiLU is a smooth version of the ReLU, but it is
neither monotone nor convex, see Figure 5.2. The GELU has recently gained popularity
in transformer architectures, and it has some similarity with the SiLU, see Figure 5.2.
Generally, it is difficult to give a good advise for a specific selection of the ‘best’ activation
function, and this should rather be part of the hyper-parameter tuning for the specific
actuarial problem to be solved.
2.0
1.2
sigmoid sigmoid
tanh tanh
1.0
1.5
ReLU ReLU
SiLU SiLU
GELU GELU
0.8
1.0
0.6
phi'(x)
phi(x)
0.5
0.4
0.0
0.2
−0.5
0.0
−1.0
−0.2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
x x
In this section, we formalize the FNN layer z (m) : Rqm−1 → Rqm introduced in (5.2). This
step is a bit technical, but basically it relates to defining all the connections between the
neurons (units) illustrated by the lines between the circles in Figure 5.1.
Selected an activation function ϕ. The FNN layer z (m) is for x ∈ Rqm−1 defined by
⊤
(m)
z (m) (x) = z1 (x), . . . , zq(m)
m
(x) , (5.5)
(m)
We interpret this as follows. Every neuron zj (·) is corresponds to a circle in Figure
5.1. Each of these neurons performs a GLM operation with inverse link ϕ, see (5.6).
That is, every neuron performs a data compression (projection) from x ∈ Rqm−1 to the
(m)
real line zj (x) ∈ R, 1 ≤ j ≤ qm . This inevitably results in a loss of information.
(m)
To compensate for this loss of information, every of the qm neurons zj (·) performs a
(m)
different compression, reflected by different network weights so that (hopefully) wj ,
the relevant information for prediction is extracted by the feature extractor z (d:1) .
As mentioned above, the selections of the activation function ϕ and of the number of
neurons qm are hyper-parameters selected by the modeler, whereas the (optimal) network
(m)
weights wj are learned during network training, see Section 5.3, below.
We can now paste everything together by composing the FNN layers (assuming matching
input and output dimensions) to the feature extractor (5.3), and then apply the readout
to this feature extracted covariate information. From (5.5)-(5.6), it follows that each FNN
(m) (m)
layer z (m) has network weights (w1 , . . . , wqm ) of dimension qm (qm−1 + 1). Collecting
all network weights of all layers, including the output parameter of (5.4), gives network
weights (for the notation, see also footnote on page 16)
(1)
ϑ = w1 , . . . , w(d)
qd , w
(d+1)
∈ Rr , (5.7)
of total dimension
d
X
r= qm (qm−1 + 1) + (qd + 1). (5.8)
m=1
Indicating the network parameter in the notation motivates us to replace (5.4) by the
slightly adapted notation
D E
µϑ (X) = E [ Y | X] = g −1 w(d+1) , z (d:1) (X) . (5.9)
Example 5.1. We discuss the FNN example of depth d = 2 given in Figure 5.1. It has
a 16-dimensional covariate vector X providing an input dimension of q0 = q = 16. The
first hidden layer z (1) : Rq0 → Rq1 has q1 = 8 neurons providing 8 · 17 = 136 network
weights. The second hidden layer z (1) : Rq1 → Rq2 has q2 = 8 neurons providing 8·9 = 72
network weights. Finally, the output parameter has dimension 9, thus, altogether the
FNN architecture of Figure 5.1 has network weights ϑ of dimension r = 217. These
network weights ϑ needs to be fitted from the learning sample L.
The hyper-parameters selected by the modeler are the depth d = 2, the number of hidden
neurons q1 = 8 and q2 = 8, as well as the activation function ϕ and the inverse link
g −1 . For model fitting, the modeler needs to additionally select the (strictly consistent)
loss function L for mean estimation, as well as the optimization algorithm to solve the
optimization problem. This optimization algorithm will have a significant impact on the
selected network, i.e., this is different from GLMs where the solution is fully determined
by the model architecture and the loss function. This might be surprising in the first
place, and we are going to discuss this in detail below. ■
We close this section with some remarks before discussing the practical issues related to
network fitting.
Remarks 5.2. • The above FNN architecture is also called a fully-connected FNN
(m−1) (m)
because every neuron zk is connected to a neuron zj by the network weight
(m)
wj,k .
• The FNN defined in (5.9) has a one-dimensional output. We can also have a multi-
output network for multi-task learning, e.g., if we want to predict claims frequency
and claims severity (simultaneously), we may select a two-dimensional output
D E D E⊤
(d+1) (d+1)
X 7→ g1−1 w1 , z (d:1) (X) , g2−1 w2 , z (d:1) (X) .
In this case, the feature extractor z (d:1) (X) should learn the relevant information
for both claims frequency and claims severity prediction.
• We call FNNs parametric models because once the architecture is fixed, the size
of the network parameter ϑ is determined. This is different from non-parametric
models where the dimension of the parameter is not given a priori. For instance,
in regression trees, discussed below, every iteration of the fitting algorithm will
add a new parameter to the model. Sometimes, people call FNNs semi-parametric
models. One reason for this is that the dimension of the network parameter ϑ does
not determine the complexity of the FNN regression function. FNN regression
functions are not parsimonious, i.e., they usually have much redundancy, and there
is research on exploring the ‘effective dimension’ of FNNs; we refer, e.g., to Abbas
et al. [1].
■
function can be approximated arbitrarily well by a suitable (and sufficiently large) FNN.
This approximation can be w.r.t. different norms and the assumptions for such a state-
ment to hold are comparably weak, e.g., the sigmoid activation function leads to a class
of FNNs that are universal in the above sense. For precise mathematical statements and
proofs about these denseness results we refer to Cybenko [47], Hornik et al. [103], Hornik
[102], Leshno et al. [134] and Isenbeck–Rüschendorf [108]; and there is a vast literature
with similar statements and proofs.
These universality statements imply that basically any (continuous and compactly sup-
port) regression function can be approximated arbitrarily well within the class of FNNs.
This sounds very promising:
• First, it means that the class of FNNs is very rich and flexible.
• Second, no matter what the specific true data generating model looks like, there is
a FNN that is similar to this data generating mechanism, and our aim is to find it
using the learning sample L that has generated that data.
Unfortunately, there is a backside of the coin of these exciting properties:
• There is no hope to find a (best) parsimonious FNN (on finite samples). In other
words, within the class of FNN there are infinitely many (almost equally) good
candidate models. Based on a finite sample there is no best selection, we can only
distinguish clearly better from clearly worse models. This can almost be stated as
a paradigm in FNN predictive modeling.
• Model selection within the class of FNN has several elements of randomness, e.g.,
a fitting algorithm needs to be (randomly) initialized and this impacts the selected
solution. To be able to replicate results, the fitting procedure has to be designed
very carefully and seeds of random number generators need to be stored to be able
to track and replicate the specific solutions.
Some of the previous items will only become clear once we have introduced stochastic
gradient descent fitting, and the reader should keep these (critical) items in mind for the
discussions below.
Based on the fact that we cannot find a ‘best’ FNN approximation to the true
model on finite samples (see discussion above), we try to find a ‘reasonably good’
FNN approximation to the true data generating mechanism. Reasonably good
means that it usually outperforms a classical GLM, but at the same time there
are infinitely many other FNNs that have a similarly good predictive performance
(generalization to new data).
Due to the non-convexity and the complexity of the problem, computational aspects are
crucial in designing a good FNN learning algorithm. The main tool is stochastic gradient
descent (SGD) which stepwise adaptively improves the network weights ϑ w.r.t. a given
objective function. We are going to derive the SGD algorithm step-by-step, and it will
follow in Section 5.3.7, below. On the way to get there, we need to discuss several items
and issues, which is done in the next sections. The next section starts by introducing the
standard gradient descent method.
for ϑ[t+1] close to ϑ[t] ; and ∇ϑ denotes the gradient (derivative) w.r.t. the network weights.
The right-hand side of the above approximation (5.11) becomes minimal, if the second
term is as negative as possible. Therefore, the update in the network weights should
point into the opposite direction of the gradient.
This motivates the standard gradient descent update
ϑ[t] 7→ ϑ[t+1] = ϑ[t] − ϱt+1 ∇ϑ L ϑ[t] ; L , (5.12)
where ϱt+1 > 0 is a (small) learning rate, also called step size.
The learning rate needs to be small because the first order Taylor expansion is only a
valid approximation in the neighborhood of ϑ[t] . On the other hand, the learning rate
should not be too small, otherwise we need to run too many of these standard gradient
descent steps.
An important point is that the initial value ϑ[0] of the gradient descent algorithm should
be selected at random to avoid that the gradient descent algorithm starts in a saddlepoint
1
We assume differentiability in all gradient descent considerations. This is typically the case in the
selected network architectures, except for the ReLU activation function which is not differentiable in the
origin.
of the loss surface ϑ 7→ L(ϑ; L). For instance, if one sets ϑ[0] = 0, there is no pre-
determined initial direction for the first gradient descent step because the FNN has
symmetries around this initial value, i.e., we have a saddlepoint of the loss surface and
the algorithm will not start to exploit the parameter space. A popular initializer is the
glorot_uniform initializer of Glorot–Bengio [76, formula (16)]. It adjusts the volatility
in the random uniform initialization to the sizes of the network layers.
The following points need to be addressed by the modeler for a successful gradient
descent fitting:
• Stochastic gradient descent to deal with big data, i.e., big learning sam-
ples L; see Section 5.3.7.
embedded variables are concatenated with the continuous ones, which then jointly enter
the feature extractor of the FNN architecture. This adds b · K embedding weights to the
network parameter ϑ, if K is the number of levels of the categorical covariate. These
embedding weights are also learned during gradient descent training, i.e., they also enter
the gradient computations. In many cases, this gives a superior performance over one-hot
encoding and dummy coding, respectively. An example is given in the notebook:
notebook-insert-link
with learning rate ϱt+1 > 0 and momentum parameter ν > 0, and where we initialize
v[0] = 0. The learning rate and the momentum parameter are hyper-parameters that
need to be fine-tuned by the modeler. This and slightly modified versions thereof are
implemented in standard software, and this software often comes with suitable standard
values for these hyper-parameters, i.e., they are ready-to-use. Therefore, we do not
describe these points in more detail. Standard momentum-based algorithms are rmsprop
or adam; see Hinton et al. [98], Kingma–Ba [118] and Goodfellow et al. [83, Sections 8.3
and 8.5].
Another noteworthy improvement is the Nesterov acceleration [165]. Nesterov has noticed
that such algorithms often have a zig-zag behavior, meaning that they overshoot and
then correct by moving back and forth, which does not seem to be very effective. The
improvement suggested by Nesterov is to already anticipate the next gradient descent step
in determining the optimal learning rates and momentums. This way an overshooting
can be reduced. This is implemented, for example, in the nadam version of adam.
The described gradient descent algorithms are usually used in standard network architec-
tures such as FNNs. If one works with more specific architectures, e.g., with transformers,
there are more specialized gradient descent methods. For instance, for transformers, there
is an adamW version of Loshchilov–Hutter [145] which better adapts to problems where
the variables live on different scales.
is not a sensible problem that we should try to solve. This MLE fitted FNN will
not only extract the structural part (systematic effects) from the learning sample L =
(Yi , X i , vi )ni=1 , but it will also largely adapt to the noisy part (pure randomness) in this
learning sample. Obviously, such a FNN will badly generalize and it will have a poor
predictive performance on out-of-sample test data T .
Figure 5.3 gives an example that in-sample over-fits. The black dots are the observed
responses Yi (in the learning sample L), and the true regression function µ∗ is shown in
green color. The red graph shows a fitted regression model that over-fits to the learning
sample L. It follows the black dots quite closely, significantly deviating from the true
green regression function. Out-of-sample (repeating this experiment), the black dots may
likely also lie on the other side of the green line and, thus, the red estimated model will
in−sample over−fitting
● ●
observations (in−sample)
10
under−fitting
systematic effects
over−fitting
8
●
6
mu(x)
●
●
4
●
●
2
● ● ●
● ●
●
● ● ●
● ● ●
●
0
Figure 5.3: An example of over-fitting; this figure is taken from [243, Figure 7.6].
generally not perform well in predictions (perform worse than an estimated model that
is close to the green line).
Consequently, within a highly flexible model class, we need to try to find a model that
only extracts the systematic part from a noisy sample. The key to this problem is early
stopping, some scholars call early stopping a regularization method, however, technically
it is different because it has an essential temporal component related to algorithmic time.
At this fitting step, FNN regression modeling significantly differs from GLM. In
GLMs, there often is little over-fitting potential and one tries to minimize the
empirical loss L(ϑ; L) to find the optimal GLM parameter. In contrast, reasonably
large FNNs have a high over-fitting potential and, therefore, one only tries to
get the empirical loss L(ϑ; L) reasonably small to find a good network parameter.
Practically, this is achieved by exercising an early stopping rule during gradient
descent training.
Let us first explain why early stopping works before discussing its implementation. Com-
ing back to the empirical loss (5.10), we compute its gradient
n
X vi
∇ϑ L(ϑ; L) = ∇ϑ L(Yi , µϑ (X i )).
i=1
φ
We observe that this gradient consists of a sum of many individual gradients of each
instance 1 ≤ i ≤ n. In each gradient descent step we try to find the most effective/signif-
icant update. Systematic effects will impact many individual instances (otherwise these
effects would not be systematic). At the beginning of the gradient descent algorithm,
before having found these systematic effects, they will therefore dominate the gradient
descent steps. Once these systematic effects acting on many instances 1 ≤ i ≤ n have
been found, the relative importance of instance-individual factors (noise) starts to in-
crease. This is precisely the time-point to early stop the gradient descent algorithm; we
call this early stopping because the algorithm has not yet reached a local minimum of
the loss function and, as explained above, this is not our intention.
The previous outline also explains why all components of the covariates should live on
a comparable scale. If one covariate component lives on a bigger scale than the other
ones, it dominates the gradients. Thus, the gradient descent algorithm will find the
systematic effects of that dominant covariate component and then its starts to exploit its
noisy part (because this noise is still on a bigger magnitude than the systematic part of
the remaining covariates). At this stage, we early stop because learning the noise does
not generalize to the new data, and, henceforth, we have not found the systematic part
of the other covariate components.
T T
entire data
L
U
Figure 5.4: Partition of the entire data (lhs) into learning sample L and test sample T
(middle), and into training sample U, validation sample V and test sample T (rhs); this
figure is taken from [243, Figure 7.7].
Thus, we perform the gradient descent algorithm only on the training sample U
X vi
∇ϑ L(ϑ; U) = ∇ϑ L(Yi , µϑ (X i )), (5.13)
i∈U
φ
training loss
0.159
validation loss
minimal validation loss
0.158
(modified) deviance loss
0.157
0.156
0.155
0.154
0.153
training epochs
dominated by the noise in V). Often one takes 20% or 10% of the learning data L as
validation sample V, depending on the sample size n.
Technically, for gradient descent training, one installs a so-called callback. This just
means that one saves every weight ϑ[t] , t ≥ 0, that decreases the validation loss L(ϑ[t] ; V),
⋆
and after running the algorithm one ‘calls back’ the weight ϑ[t ] with the minimal vali-
⋆
dation loss, which then presents the estimated network weights ϑb = ϑ[t ] from the early
stopped gradient descent algorithm.
naturally to non-sparsity because every neuron needs to be able to comply with different
tasks. We will come back to drop-out in Section 8.5.1 on page 153.
The gradient calculation in (5.13) involves large matrix multiplications if the dimension
of the network weights ϑ and the size of the training sample U are large, which is usually
the case. Matrix multiplications of large matrices can be very slow which hinders fast
network fitting. For this reason, one typically considers a stochastic gradient descent
(SGD) algorithm.
For the SGD method one chooses a fixed batch size s ∈ N, and one randomly partitions
the training sample U = (Yi , X i , vi )ni=1 into (mini-)batches U1 , . . . , U⌊n/s⌋ of roughly the
same size s.
⌊n/s⌋
where one cyclically visits the batches (Uk )k=1 for the gradient descent step t → t + 1.
⌊n/s⌋
The batch size s ∈ N of the batches (Uk )k=1 should not be too small. Assuming that the
observations (Yi , X i , vi )si=1 are i.i.d., the law of large numbers will provide us with the
(locally) optimal gradient descent step if we let the batch size s → ∞. This suggests that
we should choose very large batch sizes s. As explained above, computational reasons
force us to choose small(er) batch sizes, which may provide certain ‘erratic’ gradient
descent updates in view of the optimal next step. However, this is not the full story, and
some erratic steps can even be beneficial for finding better network weights, as long as
these erratic steps are not too numerous (and not too large). An infinite sample only
gives the next optimal step, which is a one-step ahead consideration. This may guide
us into a bottleneck, saddlepoint or local minimum that is far from optimal, because
the next optimal step is only a local optimal consideration. Having some erratic steps
from time to time may help us to escape from trapped situations like a bottleneck by
slightly shifting in the parameter space, so that we have the opportunity to explore
different environments (generally, such erratic steps are not too big for small step sizes,
and usually the loss surface is smooth and not too steep, so that an erratic step does not
dramatically change the situation). In this sense, finding a good trade-off between next
best steps and erratic steps, leads to the best predictive FNNs.
• We use SGD training for FNN fitting. For insurance pricing problems, typi-
cally, reasonable mini-batch sizes s are in the range of 1000 to 5000.
• For early stopping, we implement a call back which tracks the validation loss
on the validation sample V during SGD training. The validation sample is
typically 10% to 20% of the entire learning sample L.
There is the recurrent question of how to select a good network architecture. A general
principle is that the selected network architecture should not be too small to be sufficiently
flexible to approximate all potentially suitable regression functions. Generally, it is a bad
guidance to attempt for a minimal and parsimonious FNN. Usually, there are many
different, roughly equally good approximations to the real data generating mechanism,
and the SGD algorithm can only find (some of) those if it has sufficiently many degrees
of freedom to exploit the environment (this contradicts parsimony), otherwise the fitting
will likely not result in the best possible predictive model. Of course, this is a bit against
actuarial thinking. Optimizing neural network architectures (e.g., the hyper-parameters
like the depth d and the number of neurons qm ) is not a target that one should try
to achieve, but one has to accommodate with the fact that the selected architectures
should exceed a minimal complexity bound above parsimony for SGD training to be
successful. Typically, this results in a lot of redundancy, which cannot be reduced by
existing techniques. Specifically, the FNN architecture should have a certain depth d
that is not too small (depth promotes interactions), and each hidden layer should also
not be chosen too small.
We make an attempt for a clearer instruction about the choice of the FNN architecture:
In our examples, which are comparably small insurance pricing examples of roughly
500,000 insurance policies equipped with roughly 10 covariate components, it has turned
out that FNN architectures of depth d ∈ {3, . . . , 6} with approximately 15 to 30 neurons
in each hidden layer work well. For the French MTPL claims frequency example used in
these notes, see Section 3.3.2, we designed a standard FNN architecture of depth d = 3
with (q1 , q2 , q3 ) = (20, 15, 10) neurons in the three FNN layers. This has proved to work
well in this example; see our notebook
notebook-insert-link
Another critical point is that network fitting involves several elements of randomness.
Even if we fix the architecture and the fitting procedure, we typically end up with multiple
equally good predictive models if we run the same fitting algorithm and strategy multiple
times (with different seeds).
(2) the random partition into learning sample L and test sample T ;
(3) the random partition into training sample U and validation sample V;
⌊n/s⌋
(4) the partition into the batches (Uk )k=1 ; and
(5) there are further random items like drop-outs, if used during SGD training,
etc.
All this makes the early stopped SGD solution (highly) non-unique.
This non-uniqueness is a fact that one has to live with in machine learning models. In
Section 5.4 we present an ensemble predictor that reduces this randomness by averaging.
Remark 5.3 (balance correction). The early stopped SGD fitted FNN will not fulfill
the balance property (4.4), even if we use the canonical link of the selected deviance
loss function for g in the readout (5.4). A reason for this failure is early stopping which
stops the algorithm before having found a critical point of the loss surface. The balance
(d+1)
property can be rectified by adjusting the estimate of the bias term of the readout w b0
correspondingly. If we work with the canonical link for g, one can alternatively exercise
another GLM step, using the feature extracted covariates (z b (d+1) (X i ))n
i=1 as new covari-
ates for this GLM step; z b (d+1) denotes the SGD fitted feature extractor. Thus, we fit a
GLM on the new learning sample (Yi , z b (d+1) (X i ), vi )n
i=1 under the canonical link choice,
see (5.1). This is the proposal of Wüthrich [238]. ■
5.4 Nagging
In Chapter 6 we will meet a method called bagging which was introduced by Breiman [29].
Bagging combines bootstrap and aggregating. Bootstrap is the re-sampling technique
discussed in Section 1.5.4, and this is combined with aggregation which has an averaging
effect, reducing the randomness.
In this section we do not bootstrap, but, because network fitting has several items of
randomness as discussed in the previous section, we replace the bootstrap samples by
different SGD solutions. An ensembling of network predictors has first been considered
by Dietterich [56, 57] and subsequently it has been studied in Richman–Wüthrich [189],
where the name nagging for network aggregating was introduced.
5.4.1 Aggregating
Aggregating can most easily be explained by considering an i.i.d. sequence of square-
integrable predictors (µ bj )j≥1 , which are assumed to be unbiased for the true predictor
∗
µ , that is, E[µ ∗
bj ] = µ , for j ≥ 1. For a fixed predictor µ
bj , we have an approximation
error, called estimation error (or, more broadly, model error),
b j − µ∗ ,
µ
which on average is zero due to the unbiasedness assumption. For unknown true predic-
tor µ∗ , one estimates this estimation error by the variance (or the standard deviation,
respectively) of the predictor, i.e.,
q the average approximation error, called estimation un-
certainty, is given by V(µ bj ) or V(µbj ), respectively, which can be determined empirically
from the predictors (µ bj )j≥1 .
bj )M
On the other hand, having multiple independent unbiased predictors (µ j=1 , one can
build the ensemble predictor
M
1 X
µb(M ) = µ
bj . (5.15)
M j=1
This ensemble predictor has an estimation error µ b(M ) −µ∗ , and the estimation uncertainty
is given by
1
V µ b(M ) = V(µb1 ) → 0 for M → ∞. (5.16)
M
Of course, all this is well-known in statistics, but the important takeaway is that ensem-
bling over unbiased i.i.d. predictors substantially reduces estimation errors and uncer-
tainty (through the law of large numbers).
There are two caveats:
µϑb (X), where ϑbj denotes the SGD fitted network weights from the j-th conditionally
j
independent SGD run, using always the identical fitting strategy, only the initialization
and partitioning is done with a different random seed, and the conditional stems from
the fact that it is conditional on the learning sample L. Iterating this SGD fitting M
times, gives us M conditionally independent FNNs (µϑb )M j=1 , given L.
j
A first question is: how robust are the predictions µϑb (X), . . . , µϑb (X) for a given
1 M
covariate value X? This question has been analyzed in Richman–Wüthrich [189] and
Wüthrich–Merz [243, Figure 7.18] on a motor insurance claims frequency data set of sam-
ple size roughly n = 500, 000. The average fluctuations of the different fits (µϑb (X))Mj=1
j
were of magnitude 10%, this concerns the main body of the covariate distribution X ∼ P.
In this part of the covariate space we got reliable and quite robust models. However,
there are less frequent covariate combinations where these fluctuations were up to 40%,
i.e., the different initializations of SGD gave fluctuations in the best-estimates of up to
40%. Thus, there is clearly a credibility issue in the estimated FNNs in this (scarce) part
of the covariate space. Aggregating helps to reduce these fluctuations.
This motivated the nagging predictor
M
1 X
bnagg
µ M (X) = µ (X). (5.17)
M j=1 ϑbj
√
This ensembling reduces the average fluctuations by a factor M (on the standard de-
viation scale). That is, this determines the rate of convergence, and we obtain the law
of large numbers, a.s.,
h i
bnagg (X) = E µ b (X) L, X ,
lim µ (5.18)
M →∞ M ϑ1
where E[·|L, X] is the conditional expectation operator describing the selected SGD
fitting procedure, and for a fixed covariate value X, this is also the measure the ‘a.s.’ is
referring to. This law of large numbers limit precisely states what kind of (conditional)
unbiasedness we can receive by the nagging predictor. A difference of the limit (5.18)
compared to the true mean µ∗ (X) can originate from the following items: the particular
learning sample L that is at our disposal, the FNN architecture that we choose, the
specific version of the SGD algorithm with early stopping that we apply, but also the
particular distribution we use for initializing the SGD algorithm.
A first conclusion is that the nagging predictor robustifies the best-estimate prediction. A
second conclusion is that the nagging predictor significantly improves prediction accuracy,
by reducing the estimation error. This is verified in many examples in the literature, see,
e.g., Wüthrich–Merz [243, Table 7.9]. Figure 5.6 shows two different examples, the left-
hand side gives a claims frequency example and the right-hand side a claims severity
example. Both plots show the decrease of out-of-sample loss as a function of the number
M of ensembled network predictors (µϑb )M j=1 . From these examples we conclude that a
j
good value for M is either 10 or 20, because afterwards the out-of-sample loss of the
nagging predictor µ bnagg
M stabilizes.
23.84
nagging predictor nagging predictor
1 standard deviation 1 standard deviation
1.55
23.83
out−of−sample losses
23.82
in−sample losses
1.50
23.81
23.80
1.45
23.79
23.78
0 10 20 30 40 0 10 20 30 40
index M index M
bnagg
Figure 5.6: Decrease of the out-of-sample loss of the nagging predictors µ M as a function
of M ≥ 1: (lhs) Poisson frequency example, (rhs) gamma severity example; these figures
are taken from [243, Figures 7.19 and 7.24].
However, these improvements are always conditional on the learning sample L, and con-
ditional on the difficulty that the selected model class and the model fitting procedure
should not be flaw, i.e., the true model is close to the selected model class in the sense
of the universality statements and suitable models can be found by the selected SGD
procedure. Moreover, computing the nagging predictor can be demanding because one
needs to fit the network architecture M times. There is a recent proposal that performs
multiple predictions within the same model and learning procedure; see Gorishniy et
al. [85].
The previous sections introduced FNNs and their SGD training. This builds the core
of deep learning on tabular data. The following sections of this chapter present FNN
architectures that are particularly useful for solving actuarial problems, and Chapter 8,
below, presents deep learning architectures for tensor data and unstructured data.
• Select a deviance loss function that reflects the properties of the responses.
• Run SGD fitting with early stopping using a call back; see Section 5.3.5. This
can be complemented with regularization and drop-out; see Section 5.3.6
• Apply a balance correction to comply with the balance property (4.4); see
Remark 5.3.
The next two sections present two FNN architectures that are attractive to solve actuarial
problems. The first one combines a GLM with FNN features, and the second one locally
looks like a GLM.
We close this short section by highlighting the reference Richman–Wüthrich [193] on the
ICEnet architecture. Generally, it is non-trivial to enforce smoothness and monotonicity
in FNN architectures. The ICEnet is a regularized FNN method that achieves these
properties. This requires evaluation of first differences and to do so in the ICEnet, it is
necessary to have a multi-output network to be able to simultaneously obtain predictions
for adjacent inputs; see Richman–Wüthrich [193].
with MLE ϑbMLE ∈ Rq+1 and link function g. This first fitting has been performed on
a learning sample L = (Yi , X i , vi )ni=1 and it provides us with the estimated residuals,
1 ≤ i ≤ n,
bGLM (X i ).
εbi = Yi − µ (5.19)
If this fitted GLM is suitable for the collected data, these residuals should not show
any systematic structure. That is, these residuals should be (roughly) independent and
centered (on average), and there should not be any systematic effects in these residuals as
a function of the covariates X. This motivates us to study a new, second regression step,
namely, we regress the residuals from the covariates on the new learning sample Lε =
(εbi , X i , vi )ni=1 . This is precisely the basic idea behind boosting, namely, one stage-wise
adaptively tries to improve the model by specifically focusing on finding the weaknesses
of the previous model(s). This is quite different from ensembling, which (only) averages
over the models, but it does not let the individual models compete to improve them, see
Section 5.4 for ensembling.
Remarks 5.4. The residuals defined in (5.19) need some care. First, they are not
independent, even if the instances 1 ≤ i ≤ n are independent. Note that the learning
sample L enters the estimated regression function µ bGLM = µGLM
bMLE . This typically implies
ϑ
a negative correlation between the estimated residuals. This is also the reason why
empirical variance estimators are normalized by 1/(n − 1) and not by 1/n to receive
unbiased variance estimators. Second, the residuals in (5.19) may have different variances,
even under the true regression function µ∗ because they are not standardized. If we
know that the learning data has been generated by a member of the EDF with cumulant
function κ, the standardization is straightforward from (2.4), i.e., using the variance
function V (·) = (κ′′ ◦h)(·) with canonical link h. That is, based on this variance function,
we obtain Pearson’s residuals
bGLM (X i )
Yi − µ
εbPi = q . (5.20)
bGLM (X i )) /vi
V (µ
These Pearson’s residuals should roughly look like an independent sequence of centered
variables having the same dispersion, in particular, they should not show any structure in
the covariates and as a function of the estimated regression function. Otherwise, either
the estimated regression function is not correctly specified, or the selected cumulant
function κ does not provide the correct variance behavior; see also Delong–Wüthrich [51]
for variance back-testing using isotonic regression. The residuals (5.20) can also be used
to receive Pearson’s dispersion estimate defined by
n
1 X 2
φbP = εbPi , (5.21)
n − (q + 1) i=1
The CANN proposal of Wüthrich–Merz [242] does not explicitly compute the residuals,
but it directly adds a regression function to the first estimated model, i.e., it acts on the
(linear) predictor scale. Based on a given regression function µ bGLM with link g it makes
the Ansatz
D E
µCANN (X) = g −1 g µ
bGLM (X) + w (d+1) , z (d:1) (X) ,
and since the first model is a GLM with link g, we can equivalently write
D E D E
µCANN (X) = g −1 ϑbMLE , X + w(d+1) , z (d:1) (X) , (5.22)
where we assume that ϑbMLE has been fitted in a first GLM step and is kept fixed (frozen)
in the second estimation step, and where the second part of (5.22) describes a FNN
architecture. In this second CANN step, only this second (FNN) part is fitted to detect
more systematic structure that has not been found by the initial GLM. If the fitted FNN
part in this second CANN step is equal to zero, then we are back in the GLM. This
expresses that the FNN could not find any additional systematic structure that is not
already integrated into the GLM.
Remark that keeping the first part ⟨ϑbMLE , X⟩ frozen during the second boosting step
(fitting the FNN) can also be interpreted as having an offset playing the role of known
prior differences, and one wants to see whether one finds more differences beyond this
offset.
For fitting the FNN we apply a SGD algorithm on training and validation data as ex-
plained above. This two step estimation concept has now been applied successfully in
many studies, see, e.g., Brauer [26] and Havrylenko–Heger [95], the latter reference uses
this CANN approach to detect interactions.
In view of (5.22) the similarity to ResNet is apparent. The second term in (5.22) describes
a classic FNN architecture, the first term can be interpreted as a residual connection or a
skip connection because it connects the input X directly to the output; for an illustration
see Wüthrich–Merz [243, Figure 7.14]. We can also interpret (5.22) as having a linear
(GLM) term and we build a non-linear FNN architecture around this linear term to
capture interactions and non-linearities not present in the GLM-term.
5.7 LocalGLMnet
Essentially, there are two different ways of explaining results of algorithmic models. Ei-
ther one uses post-hoc methods, we discuss these in Chapter [insert], below, or one tries to
integrate explainable features into the algorithmic models. The LocalGLMnet is a FNN
architecture that attempts to design a model that provides these integrated explainable
features; we refer to Richman–Wüthrich [190, 191]. The LocalGLMnet can be seen as
a varying coefficient model; for varying coefficient models we refer to Hastie–Tibshirani
[92], and in a tree-based context to Zhou–Hooker [249] and Zakrisson–Lindholm [247],
the latter discussing parameter identifiability issues, which we are also going to face in
this section.
The starting point is a GLM whose sensitivities can easily be understood, especially,
under the log-link choice, see Example 3.1. Recall the GLM regression function
q
µϑ (X) = g −1 ⟨ϑ, X⟩ = g −1 ϑ0 +
X
ϑj Xj , (5.23)
j=1
with regression parameter ϑ ∈ Rq+1 . This regression parameter takes a fixed value that
is estimated with MLE. The LocalGLMnet architecture replaces this fixed parameter ϑ
(d:1) (d:1) ⊤
by a multi-output FNN architecture (function) z (d:1) = (z1 , . . . , zq )
q
(d:1)
µϑ (X) = g −1 ϑ0 +
X
zj (X) Xj , (5.24)
j=1
z (d:1) : Rq → Rq .
(d:1)
Thus, the GLM parameters (ϑj )qj=1 are replaced by network outputs (zj (X))qj=1 of
the same dimension q. Locally, these network outputs look like constants and, therefore,
the LocalGLMnet behaves locally as a GLM. This is precisely the main motivation to
study the architecture in (5.24). While we estimated the parameter ϑ ∈ Rq+1 with MLE
(d:1)
in GLMs, we now learn these attention weights (zj (X))qj=1 with SGD fitting within a
FNN framework.
(d:1)
The learned attention weights (zj (X))qj=1 allow for nice interpretations:
(1) We focus on the individual terms under the j-summation in (5.24). If for the j-
(d:1)
th component Xj the learned network output is constant, zj (X) ≡ ϑj ̸= 0, we
effectively have a GLM component in this term. Therefore, we aim at understanding
whether the multi-output network provides sensitivities in the inputs or not. If there
are no sensitivities we should go for a GLM.
(d:1)
(2) Property zj (X) ≡ 0 for some j, proposes to completely drop this term from
the regression function. This is a way of selecting or dropping terms from the
LocalGLMnet regression function. In fact, Richman–Wüthrich [190] propose an
empirical statistical test to check for this kind of model sparsity.
(d:1) (d:1)
(3) If we obtain an attention weight zj (X) = zj (Xj ) that only depends on the
covariate component Xj , we know that this term does not interact with any other
covariate components. More generally, we can test for interactions by considering
for a fixed component j the gradient w.r.t. X
!⊤
(d:1) ∂ (d:1) ∂ (d:1)
∇zj (X) = z (X), . . . , z (X) ∈ Rq . (5.25)
∂X1 j ∂Xq j
This allows us to understand the local interactions in the j-th term in the neigh-
borhood of X.
This all looks very nice and convincing, however, there is a caveat that needs careful
consideration. Namely, the LocalGLMnet regression function lacks identifiability. We
briefly discuss this in the following items.
(4) Due to the flexibility of large FNNs we may find a term that gives us the function
(d:1)
zj (X) Xj = Xj ′ , (5.26)
(d:1)
by learning an attention weight zj (X) = Xj ′ /Xj , for j ′ ̸= j. This is the reason,
why we speak about dropping a ‘term’ and not a ‘covariate component’ in the
previous items (1)-(3), because even if we drop the term for Xj , this covariate
component may still play an important role in the attention weights of other terms.
(5) Related to item (4), for SGD training of the LocalGLMnet we need to initialize
the gradient descent algorithm. We recommend to initialize the network weights
that we precisely start in the MLE fitted GLM (5.23). In our examples, this has
pre-determined the role of all j-terms such that we did not encounter any situation
where an issue similar to (5.26) occurred.
Assume that all covariate components are standardized to be centered with unit variance,
i.e., we have standardized columns in the design matrix X. This makes the attention
(d:1)
weights zj (X) directly comparable across the different components 1 ≤ j ≤ q.
It motivates a measure of variable importance by defining the sample averages
n
1X (d:1)
Ij = z (X i ) . (5.27)
n i=1 j
If this value is very small, the empirical test of Richman–Wüthrich [190] gives support
to the null-hypothesis of dropping this term and, obviously, if Ij is large it has a big
impact on the regression function (because the centering of the covariates implies that
the regression function is calibrated to zero); this is similar to the LRTs and the p-values
in GLMs, see Listing 3.1.
Since the LocalGLMnet locally behaves like a GLM, if the attention weights do not take
extreme values, there is also some similarity to the post-hoc interpretability tool called
local interpretable model-agnostic explanations (LIME) by Ribeiro et al. [185]. LIME
fits a LASSO regularized GLM locally to individual covariate values X to describe the
most important variables that explain the regression value µ b(X) of a fitted regression
(d:1) q
model. Using the attention weights (zj (X))j=1 we have similar, but more precise,
information about this local behavior in X; we come back to LIME in Section [insert],
below.
For KANs we exchange these roles by putting learnable activation functions on the edges
in the form of learnable splines, and we set all weights (in the nodes) equal to one. Let
(Bs )s be a family of B-splines.2 For a KAN we build the splines
X
x ∈ R 7→ S(x) = ws Bs (x),
s
with learnable weights (ws )s ⊂ R. That is, every spline S incorporates weights (ws )s that
can be trained with SGD methods. In the KAN proposal of Liu et al. [141], these splines
are used as residual connections around the SiLU function, defining the KAN activation
functions ϕ : R → R by
!
X
x 7→ ϕ(x) = w (SiLU(x) + S(x)) = w SiLU(x) + ws Bs (x) , (5.28)
s
with another weight w ∈ R and the SiLU function given in Table 5.1. These are highly
flexible activation functions that can be trained on learning data.
We are now ready to define a KAN layer, which is the analogue to the FNN layer
introduced in (5.5)-(5.6).
KAN layer. Based on the selected class of KAN activation functions (5.28), we define
the KAN layer z (m) : Rqm−1 → Rqm as follows. For x = (x1 , . . . , xqm−1 )⊤ ∈ Rqm−1 , we set
⊤
(m)
z (m) (x) = z1 (x), . . . , zq(m)
m
(x) ,
Regression trees (decision trees) have been introduced in the seminal monograph of
Breiman et al. [31] called classification and regression trees (CARTs), published in 1984.
Regression trees are based on recursively partitioning the covariate space, therefore, this
technique is also known as rpart in R; see Therneau–Atkinson [215]. Nowadays, regres-
sion trees are not used any more in their pure form because they are not fully competitive
with more advanced regression methods. However, they are the main building blocks of
gradient boosting machines (GBMs) and random forests. For this reason, we give a short
introduction to regression trees in this chapter. In GBMs many small regression trees
(called weak learners) are combined to a powerful predictor, and in random forests many
large and noisy regression trees are combined with bagging to a more powerful predictor.
Random forest has been introduced by Breiman [30], and it will be discussed in this
chapter, GBMs will be discussed in Chapter 7, below.
This finite partition is constructed by a binary tree, and we call (Xt )t∈T the leaves of this
binary tree as these sets are the knots of the binary tree that do not have any descendants,
see Figure 6.1 for an example with six leaves.
Assuming that all insurance policyholders X ∈ Xt , who belong to the same leaf Xt , have
the same risk behavior, motivates to define the regression function as
X
X 7→ µ(X) = µt 1{X∈Xt } , (6.2)
t∈T
103
104 Chapter 6. Regression trees and random forests
0.074
26e+3 / 678e+3
100%
yes DrivAge >= 25 no
0.07 0.16
24e+3 / 648e+3 2011 / 30e+3
96% 4%
Density < 644 DrivAge >= 21
0.06 0.087
13e+3 / 378e+3 11e+3 / 270e+3
56% 40%
VehGas = Regular VehBrand = B12
Figure 6.1: Binary regression tree with three binary splits resulting in six leaves.
Figure 6.2: (lhs) Partition (Xt )t∈T of a rectangular covariate space X ⊂ R2 with dif-
ferently colored conditional means (µt )t∈T on the corresponding leaves; (rhs) GLM with
multiplicative structure.
There are two main items to be selected to design the regression tree function (6.2):
(1) The partitioning of the covariate space X into the leaves (Xt )t∈T ; and
(2) the selection of the conditional means (µt )t∈T on the leaves.
(b) Select the least homogeneous leaf from (Xt )t∈T , i.e., the leaf for which we can find
the most efficient SBS w.r.t. some objective function.
(c) Estimate the conditional means on these two new leaves Xt0 and Xt1 .
Items (a)-(c) require to choose a leaf Xt of the actual tree (Xt )t∈T , a covariate component
Xk , 1 ≤ k ≤ q, that serves for the next SBS, and a split level c ∈ R that partitions
w.r.t. the selected covariate component Xk . To decide about these three choices, we need
an objective function. Since we estimate conditional means, it is natural to take a strictly
consistent loss function L for mean estimation.
This then translates items (a)-(c) into the following optimization problem
"
X vi
(t̂, k̂, ĉ) = arg max L (Yi , µ
bt ) (6.3)
t∈T, 1≤k≤q, c∈R i: X ∈X φ
i t
#
− L (Yi , µ
bt0 ) 1{Xi,k ≤c} + L (Yi , µ
bt1 ) 1{Xi,k >c} ,
This may look complicated, but it (only) says that we try to find the leaf Xt̂ , the covariate
component Xk̂ and the split level ĉ that provides the biggest decrease in loss (6.3) by this
additional SBS. Moreover, the weighted empirical means (6.4) are the optimal predictors
on both parts of the partitioned leaf Xt w.r.t. the selected strictly consistent loss function
L. Therefore, the objective function in (6.3) is lower bounded by zero.
The solution of (6.3) gives us the new leaves, i.e., the descendants of Xt̂ defined by
Xt̂0 = x ∈ Xt̂ ; xk̂ ≤ ĉ and Xt̂1 = x ∈ Xt̂ ; xk̂ > ĉ ,
and the empirical weighted means µ bt̂1 on the new leaves Xt̂0 and Xt̂1 , respectively.
bt̂0 and µ
This fully explains the SBS recursive partitioning algorithm, and this is the standard
regression tree algorithm usually used.
We close this section with some remarks.
• The selection of the optimal split level ĉ in (6.3) is not unique because we work
with a finite sample on every leaf Xt . Typically, the split levels c ∈ R are chosen
precisely in the middle of adjacent observed covariate values to make them unique.
• In practice, one only considers potential SBS that exceed a minimal number of
instances in both descendants Xt0 and Xt1 of leaf Xt , otherwise one cannot reli-
ably estimate the weighted empirical means (6.4). In implementations, there is a
hyper-parameter related to the minimal leaf size which precisely accounts for this
constraint. The choice of the minimal leaf size depends on the problem to be solved,
e.g., for car insurance frequencies one typically requires 1000 insurance policies to
receive reliable frequency estimates.
• It may happen that in one leave there are only instances with zero claims, which
(of course) is a very homogeneous leaf. However, the mean estimate (6.4) typically
leads to a degenerate model in that case. Therefore, often Bühlmann credibility is
used with a credibility coefficient (shrinkage parameter) being selected as a hyper-
parameter; technically this is precisely done as in (2.17); see Therneau–Atkinson
[215].
• We did not discuss the stopping rule of the recursive partitioning algorithm. Of
course, we should prevent from over-fitting. However, designing a good stopping
rule is usually not feasible, also because optimization (6.3) only focuses on the next
best split (greedy search), but maybe a next poor split can stimulate a second next
excellent split. To account for such flexibility, a large binary regression tree can be
constructed in a first step. In a second step, all parts of the large tree are pruned,
if they do not sufficiently contribute to the required homogeneity in relation to
the parameters involved; this is measured by analyzing how much a certain split
contributes to the decrease in loss (including all its descendants and accounting
for their complexity). This pruning step uses best-subset selection regularization
(2.24), and, importantly, it can be performed efficiently by another recursive algo-
rithm. The details are rather technical, they were proved in Breiman et al. [31],
and, aligned to our notation, we refer to Wüthrich–Buser [241, Section 6.2]. We do
not discuss this any further because we are not going to use regression trees in its
pure form.
6.2 Bagging
In view of the missing robustness of the plain-vanilla regression tree construction dis-
cussed in the previous section, there were many attempts to robustify regression tree
predictors. For this, the non-parametric bootstrap, discussed in Section 1.5.4, is com-
bined with aggregating, discussed in Section 5.4.1, resulting in Breiman’s [29] bagging
proposal.
We start by revisiting aggregation. As mentioned, the estimated regression tree related
to (6.2) lacks robustness. Assume we have M independent learning samples L(j) , 1 ≤
j ≤ M , that follow the same data generating mechanism, and which have sample size
n. This allows us to construct M independent regression tree predictors (using the same
methodology, but different independent learning data)
X (j)
b(j) (X) =
µ µ
bt 1 (j) ,
{X∈X }
t
t∈T(j)
where the upper index 1 ≤ j ≤ M denotes the different estimated regression trees. Since,
by assumption, the underlying learning samples L(j) and the resulting regression trees
are i.i.d., the law of large numbers applies
M
1 X
lim b(j) = E[µ
µ b(1) ], (6.5)
M →∞ M
j=1
a.s., and the randomness asymptotically vanishes, see (5.16). This highlights the advan-
tages of aggregating, namely, the randomness (from the finite samples) asymptotically
vanishes. On the other hand, (6.5) also indicates a main issue of this technique. Namely,
we have convergence to a deterministic limit E[µb(1) ], but there is no guarantee that this
limit is close to the true regression function µ∗ , i.e., if the individual regression tree
constructions µb(j) are biased (in some systematic way), so will the limit be. Therefore,
aggregation is only a method to diminish uncertainty through randomness in the learning
samples, but not for mitigating a (systematic) bias in the construction.
Example 6.1. An easy example for a biased estimation procedure is the following. If
(j)
b(j) = max1≤i≤n Yi , we certainly get a positive bias in this mean estimation
we set µ
(1) (1)
procedure since the limit in (6.5) is equal to E[max1≤i≤n Yi ] > E[Y1 ] = µ∗ in any
non-deterministic situation with non-comonotonic responses and n > 1. ■
For aggregation, we need multiple independent learning samples L(j) and predictors µ b(j) ,
respectively. Similarly to Section 5.4, it is not immediately clear where we can get these
independent samples from. Breiman’s [29] b in bagging refers to bootstrap simulation, or
more precisely to the non-parametric bootstrap discussed in Section 1.5.4. Starting from
the observed learning sample L = (Yi , X i , vi )ni=1 , we draw with replacements indepen-
(⋆j) (⋆j) (⋆j)
dent bootstrap samples L(⋆j) = (Yi , X i , vi )ni=1 , where ‘independent’ applies to
the drawing with replacements. The resulting bootstrapped learning samples (L(⋆j) )M j=1
are conditionally i.i.d., given the learning sample L. From these, we can construct con-
ditionally i.i.d. regression tree predictors µ b(⋆j) to which the law of large numbers applies,
a.s.,
M
1 X h i
lim b(⋆j) = E µ
µ b(⋆1) L .
M →∞ M
j=1
The same remark about the bias applies as above, but this time the bias additionally
depends on the specific observations in the learning sample L, e.g., having a small sample
with an outlier will likely result in a largely biased predictor, if the outlier is not properly
controlled. On the other hand, the out-of-bag method (unique to non-parametric boot-
strapping) gives one an easy (and integrated) cross-validation technique that may allow
to detect such biases; for out-of-bag validation, see (1.22).
A general issue with bagging is that the individual bootstrapped regression trees µ b(⋆j)
are highly correlated because the identical observations are recycled many times. This
mutual dependence makes this whole modeling approach not very efficient, and hence
these regression trees are not used in actuarial science. Random forests discussed in the
next section precisely tries to improve on this point.
large regression trees frequently missing the optimal split provides some over-fitting but
also a more random tree construction. This precisely has a decorrelating effect, resulting
in the random forests predictor discussed next.
(⋆j) (⋆j) (⋆j)
As for bagging, we generate i.i.d. bootstrap samples L(⋆j) = (Yi , X i , vi )ni=1 from
the learning sample L = (Yi , X i , vi )ni=1 by drawing with replacements. For each bootstrap
(⋆j) (⋆j) (⋆j)
sample, L(⋆j) = (Yi , X i , vi )ni=1 , we construct a large and noisy regression tree
estimator µb(⋆j) as follows. Consider the j-th bootstrap sample L(⋆j) , and assume we
(⋆j) (⋆j)
have constructed a binary tree (Xt , µ bt )t∈T on that bootstrap sample that we want
to further partition similar to (6.3). To add randomness, we select in each loop of the
SBS recursive partitioning algorithm a non-empty random subset Q ⊂ {1, . . . , q} of the
covariate components X = (X1 , . . . , Xq )⊤ , and we only consider the components in Q
for the next SBS, that is, we replace (6.3) by
(⋆j)
"
X vi
(⋆j) (⋆j)
(t̂, k̂, ĉ) = arg max L Yi ,µ
bt (6.6)
t∈T, k∈Q, c∈R (⋆j) (⋆j)
φ
i: X i ∈Xt
#
(⋆j) (⋆j) (⋆j) (⋆j)
− L Yi , µ bt0 1{X (⋆j) ≤c} + L Yi , µ bt1 1{X (⋆j) >c} ,
i,k i,k
the main difference to (6.3) being the random set Q, highlighted in magenta color in
(6.6). This algorithm gives us a randomized and bootstrapped regression tree predictor
b(⋆j) (X) for each bootstrap sample L(⋆j) , 1 ≤ j ≤ M .
µ
Aggregating over these regression trees allows us to define the random forest predictor
M
RF 1 X
µ
b (X) = b(⋆j) (X).
µ (6.7)
M j=1
• By sampling a true subset Q ⫋ {1, . . . , q} we may miss the optimal SBS. This
introduces more randomness and decorrelation for i.i.d. sets Q in each iteration of
the recursive partitioning algorithm.
• Often, Q is set to have a fixed size in all recursive partitioning steps, popular choices
√
are q or ⌊q/3⌋ for the size of Q.
• Generally, random forest predictors are not as competitive as networks and GBMs,
that is why these techniques are not used very frequently. Moreover, random forests
can be computationally intensive, i.e., constructing large trees on potentially many
high-cardinality categorical covariates can severely impact the fitting time.
• Standard random forest packages often work under the Gaussian loss assumption
which is not appropriate in many actuarial problems, and this loss cannot easily be
replaced in these implementations.
– First, they may help to detect interactions, and if there are interactions a more
sophisticated method can be used.
– Second, they are used as a surrogate model for explainability, because through
the splitting mechanism they provide a simple variable importance measure.
This is explained next.
There is a nice application though of random forests that is used. Assume we have fitted
a regression model µb : X → R to the learning data L = (Yi , X i , vi )n
i=1 , and we would like
to have a measure of variable importance, meaning that we would like to measure which
of the components Xk of the covariates X has a big impact on the regression function.
A possible solution to this question is to fit a random forest surrogate model µ bRF to the
regression function µb. That is, we fit a random forest regression function µ bRF to the
b(X i ), X i )n
learning data Lb = (µ i=1 by minimizing the square loss
n
1X 2
bRF (X i ) .
b(X i ) − µ
µ
n i=1
If we find an accurate random forest regression model µ bRF ≈ µb, we can use this random
forest as a surrogate model for analyzing variable importance. This random forest is
an ensemble over multiple regression trees (µb(⋆j) )Mj=1 , see (6.7), and we can analyze all
the SBS that lead to these regression trees (µb (⋆j) )M
j=1 . Each SBS can be allocated to a
covariate component Xk , 1 ≤ k ≤ q, and aggregating the decreases of losses (6.6) for
each component 1 ≤ k ≤ q, gives us a measure of variable importance.
This chapter presents boosting and, in particular, gradient boosting machines (GBMs).
GBMs belong to the most powerful machine learning methods on tabular data, often
outperforming the FNN architectures studied in Chapter 5. The concept of boosting has
already been mentioned in Section 5.6, where we boosted a GLM with FNN features.
Before going into the theory of GBMs, we take a look at the idea of using (additive)
iterative updating, i.e., boosting, we then study GBMs, and in the last part of this
chapter we discuss XGBoost and LightGBM which are the state-of-the-art GBMs these
days.
subject to existence and uniqueness. This gives us the updated regression function esti-
mate
j
b(j) (X) = g −1 g µ b(j) ) = g −1
X
µ b(j−1) (X) + b(X; ϑ b(X; ϑb(s) ) , (7.2)
s=0
111
112 Chapter 7. Gradient boosting machines
where we set for the initialization b(X; ϑb(0) ) the homogeneous MLE given by b(X; ϑb(0) ) ≡
b0 ) = g( n
P Pn
g(µ i=1 vi Yi / i=1 vi ).
Note that (7.1) is a natural generalization of the one-step boosting described in Section
5.6, see formula (5.22). The only difference is that in this former chapter we use very
specific base learners. Moreover, (7.2) stresses the close connection to additive modeling,
where the update in iteration j is based on trying to capture the remaining signal after
having adjusted the intercept according to
j−1
X
(j−1)
g µ
b (X) = b(X; ϑb(s) ).
s=0
This can be reinterpreted by noting that the base learner b(X; ϑb(j) ) tries to find the
b(j−1) (X). This is rather different from
weaknesses of the previous regression model µ
ensembling as described in Section 5.4.
The pseudo algorithm for (generalized) additive boosting is given in Algorithm 1, and
is sometimes also referred to as stagewise additive boosting, see, e.g., Hastie et al. [93,
Algorithm 10.2].
Initialize.
b(0) (X) = µ
– Set the initial mean estimate to the global empirical mean µ b0 .
– Select the maximum number of boosting iterations jmax ≥ 1.
Iterate.
while 1 ≤ j ≤ jmax do
Update
b(j) (X) = g −1 g µ
µ b(j−1) (X) + b(X; ϑ
b(j) ) ,
end.
Return.
bboost (X) := µ
µ b(jmax ) (X).
• Apart from estimating the base learners’ parameters ϑb(j) , one needs to decide on the
maximum number of boosting iterations jmax . In practice, one picks a large jmax
The intuition behind this step is to avoid taking too large steps in each iteration
of the algorithm.
■
Example 7.2. Assume that the true data generating mechanism is given by
where µ∗ (X) ∈ R is an unknown mean function, and σ 2 > 0 is an unknown variance pa-
rameter. For simplicity, we set vi ≡ 1. Assume furthermore that we have an i.i.d. learning
sample L = (Yi , X i )ni=1 generated from (7.3).
Our goal is to approximate µ∗ (X) with base learners b(X; ϑ) using Algorithm 1 under
the square loss and with the identity link choice for g. In view of (7.1) this gives us for
the loss in iteration j, we drop φ = σ 2 in the following identities,
n n 2
L Yi , g −1 g µ
X X
b(j−1) (X i ) + b(X i ; ϑ) = b(j−1) (X i ) + b(X i ; ϑ)
Yi − µ
i=1 i=1
n 2
X (j)
= εbi − b(X i ; ϑ) ,
i=1
where we set
(j)
εbi b(j−1) (X i ).
= Yi − µ
(j)
Thus, the base learners b(X i ; ϑ) try to exploit the present residuals εbi to find the next
optimal parameter ϑb(j) . This is analogous to (5.19). ■
The boosting procedure described in Example 7.2 is also known as matching pursuit;
see, e.g., Mallat–Zhang [148]. Matching pursuit was introduced in the signal processing
literature. Typically, it focuses on the square loss which is most suitable for Gaussian
responses, see Table 2.2. The situation in Example 7.2 is discussed in detail in Bühlmann
[37], when using low depth and low interaction regression trees based on the square loss in
(6.3), and this reference also includes results on convergence. The Gaussian assumption
used in Example 7.2 is a special case of Tweedie’s family, see Table 2.3, and it is possible
to obtain similar iterative boosting schemes for Tweedie’s family under the log-link choice.
This is known as response boosting; see Hainaut et al. [90]. It is, however, important to
note that when using a general Tweedie’s deviance loss with trees as base learners, one
should, of course, also optimize the regression trees in (6.3) w.r.t. that Tweedie’s deviance
loss.
for learning rate η (j) > 0. By choosing a sufficiently small learning rate η (j) > 0, it is
possible to ascertain a loss improvement (unless the gradient in (7.5) is zero)
Lg Yi , ϑb(j) ≤ Lg Yi , ϑb(j−1) .
One option to select the learning rate η (j) > 0 is to use so-called full relaxation, which
corresponds to doing a line search according to
ηb(j) = arg min Lg Yi , ϑb(j−1) − η vi ∇ϑ Lg Yi , ϑb(j−1) , (7.6)
η>0
compare with (7.1). For more on this, including conditions for convergence and conver-
gence rates; see Nesterov [166].
The procedure for learning an unknown parameter compared to learning an unknown
function can be approached similarly, which is the intuition behind GBMs. Again consider
a single instance i with observation (Yi , X i , vi ). By using abbreviation (7.4), we obtain
vi
L(Yi , µ(X i )) = vi Lg (Yi , g(µ(X))) .
φ
By differentiating the right-hand side w.r.t. ϑ = g(µ(X i )) allows us to rewrite the stan-
dard gradient descent step (7.5) according to
g µ b(j−1) (X i ) − η (j) vi ∇ϑ Lg Yi , g µ
b(j) (X i ) = g µ b(j−1) (X i ) ,
If we apply this iteration (7.7) for each instance 1 ≤ i ≤ n, it will converge (under suitable
learning rates) to a saturated model. Naturally, this provides an in-sample over-fitted
model as each instance i receives its individual mean parameter estimate.
Instead, the crucial step is to approximate the gradients vi ∇ϑ Lg (Yi , g(µ b(j−1) (X i ))) by
base learners that are simple functions (low-dimensional objects), such as trees. This
implies that the problem becomes regularized, and with a reduced the dimension of the
problem.
Define, in iteration j ≥ 1, the working responses for 1 ≤ i ≤ n by
(j)
ri = −vi ∇ϑ Lg Yi , g µ
b(j−1) (X i ) . (7.8)
For GBMs, one fits a base learner from the parametrized class B = {X 7→ b(X; ϑ)}ϑ
(j)
to the new learning sample L(j) = (ri , X i )ni=1 . Since this is a regularization step, and
since the idea is that the base learners should iteratively learn a gradient approximation,
this suggests to use the square loss for this approximation step
n 2
X (j)
ϑb(j) = arg min ri − b(X i ; ϑ) . (7.9)
ϑ i=1
This implies that (the saturated) standard gradient descent iteration (7.7) is replaced by
its regularized gradient approximated version, called GBM step,
b(j) (X i ) = g −1 g µ
µ b(j−1) (X i ) + η (j) b(X i ; ϑ
b(j) ) . (7.10)
Here, one can note the close resemblance between (7.10) and the updating step in the
generalized additive boosting of Algorithm 1. In order to make the connection to gener-
alized additive boosting even stronger, note that the full relaxation corresponds to doing
a full line search w.r.t. η (j) in analogy to (7.6). That is, solve
n
vi L Yi , g −1 g µ
X
ηb(j) = arg min b(j−1) (X i ) + η b(X i ; ϑ
b(j) ) . (7.11)
η>0 i=1
By iterating over (7.8)-(7.11), in analogy with the generalized additive boosting algo-
rithm, see Algorithm 1, we obtain the general GBM procedure from Friedman [70]. This
algorithm is summarized in Algorithm 2.
Remarks 7.3. • It is common to add yet another learning rate to Algorithm 2, the
intuition again being that small steps are less harmful than taking too large ones;
by taking too small steps one can always “catch up” in the coming iterations by
taking a number of smaller similar steps.
• Note that depending on which software implementation that one is using, the loss
function may or may not include a pre-defined link function.
Initialize.
b(0) (X) = µ
– Set the initial mean estimate to the global empirical mean µ b0 .
– Select the maximum number of boosting iterations jmax ≥ 1.
Iterate.
while 1 ≤ j ≤ jmax do
(j)
1. Calculate the working responses ri from (7.8).
2. Fit a base learner from {b(X; ϑ)}ϑ to the working responses according to (7.9).
3. Calculate the optimal step length ηb(j) according to (7.11).
4. Update
b(j) (X) = g −1 g µ
µ b(j−1) (X) + ηb(j) b(X; ϑ
b(j) ) ,
and increase j.
end.
Return.
bGBM (X) := µ
µ b(jmax ) (X).
• In Algorithm 2, the weights vi > 0 are included, and depending on the software
implementation, one needs to ensure to make proper use of the weights, intercepts
and offsets, respectively.
• The general GBM described in Algorithm 2 shares the same problems as general-
ized additive boosting w.r.t. potential over-fitting, and one should again use early
stopping based on cross-validation, or a similar technique, to prevent over-fitting.
• Consistency and convergence properties for GBMs tend to become technical. Such
type of results can be found in Zhang–Yu [248], where the GBM from Algorithm 2
appears as a special case of a more general greedy boosting algorithm.
■
The close connections between additive boosting and general GBMs have already been
established above, and to stress this further we continue Example 7.2:
Example 7.4. We revisit the set-up of Example 7.2. Calculating the working responses
according to (7.8) for the square loss we get
(j)
ri b(j−1) (X i )) = 2(Yi − µ
= − ∇ϑ L(Yi , µ b(j−1) (X i )) ∝ Yi − µ
b(j−1) (X i ),
where constants not depending on i have been dropped; in this example we assumed
(j)
vi ≡ 1. Fitting a base learner b(X; ϑ) to the new learning sample L(j) = (ri , X i )ni=1
using the square loss is equivalent to minimize the following expression in ϑ, see (7.9),
n 2 n 2
X (j) X
ri − b(X i ; ϑ) = b(j−1) (X i ) − b(X i ; ϑ)
Yi − µ
i=1 i=1
n
X 2
= b(j−1) (X i ) + b(X i ; ϑ)
Yi − µ .
i=1
The latter expression is equivalent to additive boosting (7.1) under the square loss func-
tion choice and for the identity link for g. ■
low interaction regression trees have a small cardinality |T|, i.e., they only consider a few
binary splits. Using a fixed small cardinality |T|, we select the class B of regression tree
base learners by, see (6.2),
X
b(X; ϑ) = mt 1{X∈Xt } , (7.12)
t∈T
where parameter ϑ collects the full description of the characterization of the binary
regression tree (7.12). We insert this class B of fixed small cardinality |T| regression trees
as base learners in Algorithm 2.
Reconsidering the line search (7.11) for the optimal selection of the learning rate η (j) > 0
in the j-th iteration of Algorithm 2. This line search (7.11) can be replaced by directly
updating the piecewise constant estimates (mt )t∈T on every leaf. That is, for all t ∈ T
(j) X
m
bt = arg min vi Lg Yi , g µ
b(j−1) (X i ) + m . (7.13)
m∈R (j)
i: X i ∈Xt
b
This minimization relies on first having fitted a regression tree to the working responses
(j)
to obtain the covariate space partition (Xbt )t∈T of the covariate space X , see (7.9).
This results in the tree-based gradient boosting predictor in the j-th iteration
(j)
b(j) (X) = g −1 g µ
X
µ b(j−1) (X) + m
bt 1 b(j) } .
(7.14)
{X∈X t
t∈T
As mentioned above, tree-based GBMs have become very popular and can be found in
off-the-shelf software, such as the gbm package in R. When using one of these packages,
one always needs to carefully check what link functions g are implemented and how
the algorithm deals with the weights vi ; we also refer to Remark 2.3. The tree-based
GBM procedure is summarized in Algorithm 3, which corresponds to the regression tree
boosting described in Friedman [70].
Initialize.
b(0) (X) = µ
– Set the initial mean estimate to the global empirical mean µ b0 .
– Select the maximum number of boosting iterations jmax ≥ 1.
– Fix the cardinality |T| of the regression tree base learners.
Iterate.
while 1 ≤ j ≤ jmax do
(j)
1. Calculate the working responses ri from (7.8).
2. Fit a regression tree (7.12) of cardinality |T| to the working responses using
the greedy minimization of (7.9).
3. Perform optimal leaf adjustments according to (7.13).
4. Update
(j)
b(j) (X) = g −1 g µ
X
µ b(j−1) (X) + m
bt 1 b(j) } ,
{X∈X t
t∈T
and increase j.
end.
Return.
btree-GBM (X) := µ
µ b(jmax ) (X).
• Note that compared with tree-based additive boosting, a tree-based GBM only
makes use of square loss fitted trees, irrespective of the underlying data generating
process motivating the use of the loss L(Y, µ). This makes the tree-based GBMs
easy to apply to custom distributions. Furthermore, it is neither a problem to use
count data responses or binary responses (classification). For general binary tree-
fitting, see, e.g., the discussion concerning (6.3) above, and Hainaut et al. [90] on
response boosting.
• In many applications, so-called tree stumps are used as base learners; tree stumps
consider one single split and have |T| = 2 leaves. We suggest to consider bigger
trees, because bigger trees promote interaction modeling. With tree stumps and
the log-link choice for g, the GBM results in a multiplicative model; see Wüthrich–
Buser [241, Example 7.5].
■
where bik (T)2 corresponds to the loss improvement in the k-th split in T, i.e., this is
obtained by evaluating the objective function of (6.3) in the selected split. Consequently,
if we are using a method which is based on jmax additively boosted trees, Tj , 1 ≤ j ≤ jmax ,
this suggests to consider the average VI-score for covariate component 1 ≤ l ≤ q given
by
jX
max
1
VIl = VIl (Tj ). (7.16)
jmax j=1
For more on VI-scores and other tree-based methods, see, e.g., Hastie et al. [93, Chapter
10.13], and the references therein.
Remarks 7.6. • Note that the VI-scores from (7.15) and (7.16) will naturally weigh
the importance of covariate components that occur rarely in splits but have large
loss reductions and covariate components that often occur in splits, but with low
loss reductions.
• When using these VI-scores in GBMs it is typically the case that the importance
is measured on the tree scale, hence, focusing on the loss improvement w.r.t. the
gradient approximations, not w.r.t. the (empirical) loss given by a deviance loss
function.
■
In the previous sections the basic ideas underpinning (generalized additive) boosting and
gradient boosting, with an extra focus on the situation when using low-cardinality trees
as base learners, was presented. These base learners are the most commonly used ones
in practice.
LightGBM. A popular high-speed version of GBMs targeting the issue of the high-
cardinality nominal categorical covariates is LightGBM by Ke et al. [116]. The idea
behind LightGBM is to (i) focus on data instances with large gradients, and (ii) to ex-
ploit sparsity in the covariate space. Step (i) is what is referred to as gradient-based one
sided sampling (GOSS) and is an alternative to uniform sampling in bagging, and step
(ii) is what is called exclusive feature bundling (EFB), which is a type of histogram pro-
cedure combining both covariate (feature) selection and merging of covariates (features).
The intuition behind the merging of covariates is that covariates whose one-hot encoded
categories are not jointly active can be merged, which is likely to happen if the covariate
space is large. For more details, see Ke et al. [116]. The LightGBM method has been
proved to be both fast and accurate.
XGBoost. Another popular fast gradient based boosting procedure is XGBoost, see
Chen–Guestrin [41]. It is based on a functional second order approximation of the loss
function. In terms of the previously used notation for a single instance i, this can be
where ∇2ϑ denotes the second derivative (Hessian) w.r.t. ϑ. Consequently, even though
derivatives appear in LgXGB (·) from (7.17), they do not appear in the same way as in the
GBMs described by Algorithms 2 and 3. In fact, this second order Taylor expansion (7.17)
is related to a Newton step and the working residuals are suitably scaled by their Hessians
before being approximated by the base learners. Thus, effectively, (7.9) is replaced by
a Hessian scaled and weighted version. Furthermore, by using low-cardinality trees as
base learners applied to LgXGB (·) from (7.17), the leaf values for a given part in the
partition is given explicitly, and the criterion for finding the greedy optimal split point
is given explicitly. Thus, by using the approximate loss LgXGB (·) from (7.17) combined
with trees makes it possible to skip (costly) line searches. Note that this also will result
in a different type of trees than standard trees that are grown recursively leaf-wise, for
more details; see Chen-Guestrin [41]. XGBoost also allows for regularization by using
a penalized deviance loss (log-likelihood), and it can be equipped with histogram-based
techniques handling high-cardinality nominal categorical covariates. For more details,
see Chen-Guestrin [41].
Multi-parametric losses and further extensions. Above the loss functions con-
sidered have all been effectively one-dimensional, trying to learn an unknown regresssion
function X 7→ µ(X) ∈ R; in the presence of a nuisance parameter φ ∈ R+ referred to
as a dispersion parameter. The above discussed boosting techniques can be naturally
extended to the situation with a functional dispersion φ(X), or more generally, when
the loss function is expressed in terms of a p-dimensional real-valued argument ϑ ∈ Rp ,
and we want to learn an unknown p-dimensional function X 7→ ϑ(X); we use boldface
notation in ϑ to emphasize that this is a multi-dimensional object. This situation is
similar to considering a multi-task (and multi-output) FNN, see Remarks 5.2. Exam-
ples of methods addressing this situation are, e.g., gamboostLSS, see Mayr et al. [149],
NGBoost, see Duan et al. [59], Cyclic GBMs (CGBMs), see Delong et al. [50]. Both
gamboostLSS and CGBMs use cyclic updating over the p parameter dimensions of ϑ. In
addition gamboostLSS allows for component-wise optimization, which means that one
may specify covariate specific base learners. NGBoost uses so-called natural gradients,
which aims for improving speed and stability.
8.1 Introduction
This chapter builds on the feed-forward neural network (FNN) architecture that was
introduced in Chapter 5. The FNN architecture can be seen as a prototype of more
sophisticated deep learning architectures, with the recurrent goal of an optimal feature
extraction for predictive modeling. The FNNs discussed in Chapter 5 act on so-called
tabular input data, which means that one has a q-dimensional cross-section X i,t ∈ Rq of
(structured) real-valued input data over all instances 1 ≤ i ≤ n at a given time point t.
It is useful to interpret t as discrete time. More generally, it is just a positional index.
This structured data has the format of q-dimensional real-valued vectors, and it is called
tabular because we can collect the cross-sectional input data (X i,t )ni=1 at a given time
point t in a table Xt , resulting in the design matrix at time t
X1,t,1 · · · X1,t,q
⊤ . .. .. n×q
Xt = [X 1,t , . . . , X n,t ] = ..
. . ∈R .
Xn,t,1 · · · Xn,t,q
Compared to (2.10), we drop the intercept (bias) component. This describes the covariate
information at time t for predicting the responses (Yi,t )ni=1 .
Naturally, this allows for a time-series extension, called panel data or longitudinal data.
If one has only one instance, one typically drops the instance index i, and in that case
one speaks about time-series data. The time-series data of a given instance comprises
responses, covariates and volumes, respectively, given by
Y1:t = (Y1 , . . . , Yt )⊤ ∈ Rt ,
X 1:t = (X 1 , . . . , X t )⊤ ∈ Rt×q ,
v1:t = (v1 , . . . , vt )⊤ ∈ Rt .
We do not use boldface notation in Y1:t and v1:t to highlight that these are time-series of
one-dimensional variables.
123
124 Chapter 8. Deep learning for tensor and unstructured data
We can illustrate this data by mapping the time to the vertical axis
Y1 X1,1 · · · X1,q v1
. . .. .. .
Y1:t = ..
, X 1:t = ..
. . and v1:t = ..
.
Yt Xt,1 · · · Xt,q vt
This color image has a spatial structure described by the first two indices (u, v), in fact,
in this example we have a rectangle with t pixels on the x-axis and s pixels on the y-axis.
The last index j then labels the three color channels red-green-blue (RGB). This is the
typical way of encoding color pictures into a 3D tensor. An example is given in Figure
8.1 for a 30 × 30 color picture. ■
Figure 8.1: RGB channels for a 30 × 30 color picture: R channel, G channel, B channel,
and overlap of the three channels.
The common approach of using unstructured data such as texts, speech and images in
predictive models, is to map such unstructured data to tensors. For images, this is solved
as in Example 8.1. For texts and speech, this is achieved by an entity embedding which
uses tokenization. That is, we tokenize speech by assigning integers to all words, and
then we apply an entity embedding (2.15) to these integers and words, respectively. This
turns sentences (and speech) into 2D tensors of shape Rt×b , with t being the length of
the sentence and b the dimension of the entity embedding. We discuss this in careful
detail in Section 8.2, below.
Image recognition problems are very different from actuarial problems. A classical
image recognition problem is to determine whether there is a cat, Y = 0, or a
dog, Y = 1, on an image, thus, one tries to determine Y from an RGB image
X 1:t,1:s ∈ Rt×s×3 . Purposefully we wrote ‘determine’ to this image recognition
task, because this is not a forecast problem. There is no randomness in terms of
irreducible risk involved in this cat vs. dog image recognition problem. Therefore,
we will not elaborate any further on such image recognition problems, but we
will focus on actuarial forecasting problems, below, where the irreducible risk is a
significant factor in the prediction task. The techniques though are the same, but
we put different emphasis on the different features.
8.2.1 Tensors
The input data to networks (and regression functions) is usually in tensor form. For
single instances, the input information is either a single vector X ∈ Rq (1D tensor), a
time-series X 1:t ∈ Rt×q (2D tensor) or a spatial image X 1:t,1:s ∈ Rt×s×q (3D tensor),
where we generally assume that a color image has three color channels expressed by
setting q = 3, see Example 8.1 and Figure 8.1. If we have a black-and-white picture, we
typically want to preserve this spatial structure and, therefore, use a 3D tensor with a
single gray color channel X 1:t,1:s ∈ Rt×s×1 , i.e., we set q = 1. In this notation, typically,
the first indices of the tensor describe a time-series or a spatial structure, and the last
index (referring to q) are the channels.
Having multiple instances 1 ≤ i ≤ n increases the tensors by one order, e.g., for images
as covariates, we have an input, 4D design tensor, over all instances
Assuming independence between the instances then applies to the first index 1 ≤ i ≤ n
of this 4D design tensor X.
results in high-dimensional input tensors. A similar situation occurs for unstructured text
data. The Oxford English Dictionary estimates that there are roughly 170,000 English
words in regular use, which results in an input dimension of 170,000 if one uses one-hot
encoding for these words. Aiming at making the input data smaller, we revisit entity
embedding discussed in (2.15). Select a (small) embedding dimension b ∈ N, and consider
the entity embedding
eEE : A → Rb , X1 7→ eEE (X1 ), (8.1)
where A = {a1 , . . . , aK } is the set of all levels of a categorical covariate component
X1 contained in X (for the moment we do not assume that all components of X are
real-valued).
This results in K embedding weights eEE EE (a ) ∈ Rb , 1 ≤ k ≤ K. In FNN fitting,
k := e k
these embedding weights are part of the network parameter ϑ, and they are learned with
SGD that aims at making a strictly consistent loss function (generally) small
n
X vi !
L(ϑ; L) = L(Yi , µϑ (X i )) = min,
i=1
φ
see Section 5.3. Hence, the embedding weights involved in (8.1) are learned in a su-
pervised learning manner using the targets (responses) (Yi )ni=1 . Changing the targets
will lead to different embeddings. E.g., job profiles impact accident or liability insur-
ance claims differently, and using these two different responses will result in different
embeddings of the job profiles. In contrast, we could also try to use an unsupervised
learning embedding, which requires that we can put the categorical covariates into some
context. This unsupervised learning embedding will be discussed in Section 8.2.3, and it
also relates to the clustering methods studied in Section 9, below.
Often, gradient descent fitting does not work well if one has many high-cardinality cat-
egorical covariates. High-cardinality categorical covariates give a significant potential
for over-fitting, and, as a result, usually gradient descent methods exercise a very early
stopping time. In such cases, it is beneficial to regularize the embedding, similarly to
Section 2.4. Assume that we have one categorical covariate Xi,1 ∈ A in X i with K levels.
This gives the embedding weights (eEE K b
k )k=1 ⊂ R ; these embedding weights are part of
the network weights ϑ. Using ridge regularization (2.22) on these embedding weights,
motivates to consider the regularized loss
n K
X vi η X 2
L(Yi , µϑ (X i )) + eEE , (8.2)
i=1
φ φ k=1 k 2
for regularization parameter η > 0. There is one point that we want to highlight; this has
been discussed in Richman–Wüthrich [192]. The regularized loss (8.2) balances between
the sample size n and the number of occurrences of a given categorical level ak ∈ A.
The issue in SGD training is that one does not consider the loss simultaneously over the
entire learning sample L, but only over the random (mini-)batch used for the next SGD
step, see (5.14). As a result, (8.2) cannot be evaluated in SGD, because we only see one
mini-batch at a time. To account for this issue, we change the regularized loss (8.2) into
a different form. Taking the notation from Avanzi et al. [7], we define
that is, k[i] indicates to which level ak[i] the categorical variable Xi,1 of instance 1 ≤ i ≤ n
belongs to. We define the totally observed exposure on each level 1 ≤ k ′ ≤ K by
n
X n
X
vk+′ = vi 1{k[i]=k′ } = vi 1{Xi,1 =ak′ } .
i=1 i=1
The crucial difference of this latter expression to (8.2) is that the regularization is inte-
grated into the round brackets under the i-summation. This allows one to apply SGD on
partitions (mini-batches) of {1, . . . , n}; this requires that one changes the loss function
correspondingly in SGD implementations, and that one equips all instances 1 ≤ i ≤ n
+
with the volume information vk[i] so that on every instance i (and batch) one can evaluate
both terms under the i-summation in (8.3)
We observe from (8.3) that the regularization on individual instances i is inversely propor-
+
tional to the volumes vk[i] , and more frequent levels ak receive a less strong regularization
towards zero compared to scarce ones. This scaling is quite common, and it is a natural
consequence of a Bayesian interpretation of this regularization approach; see Richman–
Wüthrich [192]. This reference also discusses regularized entity embedding in case of
hierarchical categorical covariates, e.g., vehicle brand - vehicle model - vehicle details
build a natural hierarchy, and a certain Toyota make cannot appear under a Volkswagen
brand. This may give further regularization restrictions, e.g., similar to fused regulariza-
tion, see (2.27), we can bring hierarchies into a context.
where we add a token zero for an empty word. This is going to be useful if we need
to bring different sentences to equal length. In machine learning jargon, this is called
padding shorter sentences with zeros to equal length.
A sentence consists of different words aw and their tokens w, respectively, and the order
of the words and tokens matters for the meaning of the sentence. Therefore, we use the
positional index t ∈ N to indicate the position of a word in a sentence.
A sentence text of length T is given by
Bag-of-words
The method of bag-of-words is the most crude one to make a text = (w1 , . . . , wT ) numer-
ical for predictive modeling. It drops the positional index and it defines the bag-of-word
mapping
T
!
X
ψ : W0T → NW
0 , text 7→ ψ(text) = 1{wt =w} . (8.4)
t=1 w∈W
This is called bag-of-words because one places all words into the same bag. As a result,
one loses the order and the positional index, and one only counts how often a certain
word appears in text. This is very crude because, e.g., the following two sentences
provide the same bag-of-words: ‘the car is red’ and ‘is the car red’, but their meaning
is rather different. That is, the semantics of the sentence gets lost by the bag-of-words
embedding (by dropping the positional index). Moreover, the range NW 0 of ψ is very
high-dimensional, and ψ(text) is likely scarce if the text is small and the vocabulary
large. In this approach, one often removes so-called stop words such as ‘at’, ‘to’, ‘the’,
etc., to put more emphasis on the more important parts of the sentences and to reduce
the dimension of the vocabulary.
Referring to Bengio et al. [20, 21, 22], we start with an entity embedding of the vocabulary
A and its tokenization W0 , respectively. Select an embedding dimension b ≪ W and
consider the word embedding (WE)
that assigns to each token w an embedding vector eWE (w). In an unsupervised learning
manner, one tries to learn the embedding vectors from their contexts. E.g., ‘I’m driving
by car to the city’ and ‘I’m driving my vehicle to the town center’ uses similar words in
a similar context. Therefore, their embedding vectors should be close because they are
almost interchangeable. The goal is to learn such similarity in the meanings from the
context in which these words are used. For this, consider a sentence
C = {text = (w1 , . . . , wT )} ,
These probabilities should reflect the frequencies of the sentences text = (w1 , . . . , wT )
in speeches and texts (in the domain we are interested in).
Applying Bayes’ rule, we can determine how likely a certain word wt ∈ W0 occurs in a
given sentence text = (w1 , . . . , wT ) at position t
p(w1 , . . . , wT )
p ( wt | w1 , . . . , wt−1 , wt+1 , . . . , wT ) = . (8.7)
p(w1 , . . . , wt−1 , wt+1 , . . . , wT )
In general, these probabilities (8.6)-(8.7) are unknown, and they need to be estimated
(learned) from a learning sample L. Learning these probabilities will be based on em-
bedding the tokens into low dimensional spaces, and this is precisely the step where the
word embedding (8.5) is learned. There are two classic approaches for this: word-to-
vector (word2vec) by Mikolov et al. [156, 157] and global vectors (GloVe) by Pennington
et al. [175] and Chaubard et al. [40]. We describe these two methods next, this description
is taken from Wüthrich–Merz [243, Chapter 10].
(i) One can try to predict the center word wt from its context w1 , . . . , wt−1 , wt+1 , . . . , wT
as described in (8.7). In this approach, to reduce complexity, one often neglects the
positional indices of the context words, and one considers the bag-of-words {ws }s̸=t
instead. This method is called continuous bag-of-words (CBOW).
(ii) One can revert the problem and try to predict the context from the center word wt
This is obtained by Bayes’s rule from (8.7). This gives the skip-gram approach.
We present word2vec of Mikolov et al. [156, 157] in detail. For this we discuss the
two different approaches:
(1) Skip-gram approach of predicting the context from the center word;
(2) CBOW approach of predicting the center word from the context.
As a result of these approaches, we receive the word embeddings eWE (w), be-
cause they enter the probabilities (8.7) and (8.8) through a cosine similarity and
a softmax implementation.
For the skip-gram approach one tries to determine the probabilities (8.8) from a learning
sample L = (texti )ni=1 of different sentences. Since this problem is too complex in its
full generality, one solves a simpler problem.
(1) First, one restricts to a fixed small context (window) size c ∈ N, and one tries to
find the probabilities in this context window of wt , given by
(2) Second, one assumes conditional independence of the context words, given the
center word wt .
Of course, the second assumption is generally not satisfied by real texts, but it signifi-
cantly simplifies the estimation problem. In particular, this crude (wrong) version is still
sufficient to receive a good word embedding (8.5), which is our main incentive to look
at this method. Under this conditional independence assumption, we have log-likelihood
for learning sample L and for given context size c ∈ N
n X
X X
ℓL = log p ( wi,t+j | wi,t ) . (8.9)
i=1 t −c≤j≤c,j̸=0
Assume that the conditional probabilities in (8.9) can be modeled by the softmax function
D E
exp e(1) (wt ), e(2) (ws )
p (ws | wt ) = PW ∈ (0, 1). (8.10)
w=1 exp e(1) (wt ), e(2) (w)
Thus, if the scalar (dot) product between e(1) (wt ) and e(2) (ws ) is large, we get a high
probability that ws is in the context of the center word wt .1
Inserting (8.10) into (8.9), we receive a log-likelihood function ℓL in the two word em-
bedding mappings
Maximizing this log-likelihood ℓL for the given learning sample L gives us the two (dif-
ferent) word embeddings. The optimization is done by variants of the SGD algorithm,
the only difficulty is that this high-dimensional problem can result in very expensive
computations, and negative sampling is a method that can circumvent this problem. We
discuss this in the next subsection.
overhead
● signal
street
●●
run
●
plow
●
1.0
pole
●
traffic
knock
●
●
hit
● vehicle
●
garage
●light
●
hydrant
●
truck
●
salt
●
snow
● gate●
door
● car
●●
shed strike
●
hwy
● blow
●
● box
●
sign
● fence
0.5
shop
●
work
●
st
●
tree
●
back
dimension 2
dept
● plant
●
leopold
●crack
●
fire
●●
line
storage
● antenna
●
slide
●
panel
fall ●●
●build police
● control
●
llm
● wall
●
0.0
be
● wind
● ● electric
alarm ●
main
● camera
● airport
● city
● unit
●
graffito
● treatment
● ●
wastewater
ical
board
● ●
● generator
station● pump ●●good
kennedy
hoyt
●● ● roof site
lightning ● ●
● department
west
glass ● screen
● falk ● vandalize
● ●
miss
● service
●
hail
tower
● phone
siren
●
memorial
location
●● ● bldg
smoke
● ●
multiple safety
●●
breakage
● break ●system
●
lincoln
●
window
● park
●
golf radio
● cause●●
● wire
−0.5
vandalism
● shelter
● hall
●
lift
●
● mitchell
toki ● classroom
● steal
● village
●●
equipment sewer
●
lake
●
room
● storm
●
pavilion
● east
●
office
●● ● power
computer ●
high ●library
courthouse ● ●comm
ice ●
es
● ms
●
center
●tank
theft● ● leak
● surge
●
pool
●
gym
school
●●
● house
water ●
freeze
● outage
●
washington
−1.0
● laptop
●●
floor pipe
●
playground
●●
jail
dimension 1
e.g., ‘house’ is next to ‘water’ or ‘pole’ is next to ‘vehicle’. We have chosen an embedding
dimension of b = 2 to nicely illustrate the results, typically, word embedding dimensions
range from 50 to 300.
Negative sampling
For computational reasons, it can be difficult to solve the word2vec skip-gram approach
(8.9)-(8.10), the categorical distribution in (8.10) has W different levels, and likewise
the input has this cardinality. Negative sampling turns this learning problem into a
supervised learning problem of a lower complexity; see Mikolov et al. [157].
e ∈ W × W of center words w and context words
For this, we consider the pairs (w, w)
e To each of these pairs we add a binary response variable Y ∈ {0, 1}, resulting in
w.
observation (Y, w, w).
e There will be two types of center-context pairs, real ones that are
obtained from the learning sample L and fake ones that are generated purely randomly.
We construct these two types of pairs as follows:
e i )n
(2) We take all real pairs (wi , w i=1 , and we randomly permute the index of the context
word indicated by a permutation π. This gives us a second (fake) learning data set,
we shift the index by n for later purposes, L2 = (Yn+i = 0, wn+i , w en+π(i) )n
i=1 , with
Y = 0 as response.
Merging real and fake learning data gives us a learning sample L = L1 ∪ L2 of sample size
of 2n. This now allows us to turn the unsupervised learning problem into a supervised
logistic regression problem by studying the new log-likelihood
2n
X
ℓL = log P [ Y = Yi | wi , w
ei ]
i=1
n
1
X
= log
i=1
1 + exp⟨−e (wi ), e(2) (w
(1) ei )⟩
2n
1
X
+ log .
k=n+1
1 + exp⟨e (wk ), e(2) (w
(1) ek )⟩
The first n instances 1 ≤ i ≤ n come from the real data L1 , and the second n instances
n + 1 ≤ k ≤ 2n from the fake data L2 with the π-permuted context words. The two
parts of the log-likelihood then correspond to the logistic probabilities for the responses
Yi = 1 and Yk = 0 being real or fake, respectively. Maximizing this log-likelihood ℓL , we
can learn the two embeddings (8.11). The example in Figure 8.2 has been obtained in
this way.
For SGD training to work properly in this negative sampling learning, one should ran-
domly permute the instances in L = L1 ∪ L2 , to ensure that all (mini-)batches contain
instances of both types.
For the CBOW method we aim at predicting the center word wt from its context, see
(8.7). As in the skip-gram approach, we select a fixed context (window) size c ∈ N. For
given learning sample L, this provides us with the log-likelihood
n X
X
log p ( wi,t | wi,t−c , . . . , wi,t−1 , wi,t+1 , . . . , wi,t+c ) .
i=1 t
To solve this problem, we need again to reduce the complexity. As in the bag-of-words
approach (8.4), we drop the positional index t. Moreover, for the (continuous) CBOW
version, we average over the bag-of-words to receive the average embedding of the context
words
(2) 1 X
ēi,t = e(2) (wi,t+j ),
2c −c≤j≤c,j̸=0
for the context word embedding e(2) , see (8.11). This averaging can be done because
the word embedding gives a numerical representation to the context words. The CBOW
approach then considers the following log-likelihood
n X
X (2)
ℓL = log p wi,t ēi,t
i=1 t
D E
(2)
n X
X exp e(1) (wi,t ), ēi,t
= log P D
(2)
E (8.12)
W
i=1 t w=1 exp e(1) (w), ēi,t
n XD W
!
E D E
X
(1) (2) X
(1) (2)
= e (wi,t ), ēi,t − log exp e (w), ēi,t .
i=1 t w=1
Thus, we measure the similarity between the center word embedding e(1) (w) and its
(2)
average context word embedding ēi,t . From this we can again learn the two embeddings
(8.11) using a version of the SGD algorithm to minimize the negative log-likelihood.
Compared to skip-gram, CBOW is usually faster in fitting, but skip-gram performs better
on less frequent words. Naturally, we can apply a negative sampling version to CBOW,
(2)
by randomly permuting the average context words ēi,t , and then designing a logistic
regression that tries to identify the true and the fake pairs.
Whereas word2vec is based on solid statistical methods, using well-defined and explain-
able log-likelihoods, GloVe is a word embedding approach that is more of an engineering
type. GloVe was developed by Pennington et al. [175] and Chaubard et al. [40].
GloVe is more in the sense of clustering; clustering is going to be presented in Section
9.3, below. Select a fixed context size c ∈ N and count the different context words w e
in the context window of the given center word w ∈ W. This defines the matrix of
co-occurrences
×W
C = C(w, w)
e ∈ NW
0 .
w,w∈W
e
Matrix C is a symmetric matrix, and typically it is sparse as many words do not appear
in the context of other words (on finitely many texts). Empirical analysis and intuitive
arguments lead to an approach of approximating this co-occurrence matrix by
D E
e ≈
log C(w, w) e(1) (w), e(2) (w)
e + αw + βw
e,
with intercepts αw , βwe ∈ R; see Pennington et al. [175]. To ensure that everything is
well-defined, Pennington et al. [175] come up with the following objective function to be
minimized
X D E 2
χ(C(w, w))
e e − e(1) (w), e(2) (w)
log C(w, w) e − αw − βw
e ,
w,w∈W
e
for hyper-parameters xmax > 0 and γ > 0. From this we can again learn the two
embeddings (8.11) using the available learning data L.
Clearly, GloVe is more difficult to implement and to fine-tune than the word2vec methods;
some small scale examples in an insurance context are given in Wüthrich–Merz [243,
Chapter 10]. This short introduction was not meant to explain GloVe to the level of
explicit implementation and reasoning why it is sensible, but for GloVe (as well as for
other methods) there is a large pre-trained version available that can be downloaded;2
other pre-trained open-source models that can be downloaded include, e.g., spaCy3 and
FastText.4
These pre-trained word embeddings are ready to use, and they can be downloaded in
different scales and embedding dimensions. A point that needs careful attention is that
these word embeddings have been trained on a large corpus of texts from internet. These
texts consider any sorts of topics. When it comes to a specific use of such pre-trained
libraries, this needs some care because certain words have different meanings in different
contexts. Wüthrich–Merz [243, Section 10] computed a non-life insurance example that
considered insurance coverage of public institutions. In this example, the word Lincoln
appears in several claims texts. Lincoln is a former US president, there are Lincoln
memorials, there are towns called Lincoln, there is a Lincoln car brand, there are restau-
rants named Lincoln, but in the claims texts Lincoln is the school insured. Therefore, a
pre-trained embedding may not be fully suitable for the purpose needed, because specific
insurance related terminology may not have been used while training the embedding.
This will require additional training of the pre-trained libraries to the specific purpose,
while fitting the entire predictive model. Nevertheless, having a pre-trained data basis is
often an excellent starting point for an actuarial application, and, in the sense of transfer
learning, a pre-trained library can be refined for the specific task to be solved.
2
https://nlp.stanford.edu/projects/glove/
3
https://spacy.io/models/en#en_core_web_md
4
https://fasttext.cc
Naturally, we can simultaneously use all these different input formats by designing suit-
able network architectures. E.g., we can use a FNN on the tabular input data providing
us with a first intermediate output, we can use a recurrent neural network (RNN) on
the times-series and text data providing us with second intermediate output, and we can
use a convolutional neural network (CNN) on the image data providing us with third
intermediate output. These intermediate outputs are concatenated and then further pro-
cessed through a FNN providing a predictor variable. Such an example is illustrated in
Figure 8.3.
FNNs have already been introduced in Chapter 5, and the following sections are going
to be devoted to RNNs, CNNs as well as transformers, which is another popular way of
dealing either with time-series data or with tabular data.
(1) FNNs ignore spatial and/or time-series structure in the data, treating input ele-
ments equally regardless of their relative proximity in the positional index.
(3) FNN cannot deal with time-series observations that are increasing over time.
To solve these issues, specialized network architectures like convolutional neural networks
(CNNs) and recurrent neural networks (RNNs) have been introduced. In the present
section, we study CNNs, introduced by LeCun–Bengio [130]. CNNs derive their name
from the use of the convolution operator to extract features from the input tensors.
• parameter sharing.
First, regarding local connectivity, each unit (neuron) in a CNN layer is connected only
to a localized region (window) of the input, known as the receptive field. The resulting
weight matrices, called filters, present a smaller size than the entire input data because
they focus on the receptive field (window) only, and the features extracted depend only
on that portion of the input data.
Second, CNN layers employ parameter sharing, wherein the same filter is applied across
all different regions of the input. That is, by sliding the filter across the input surface,
we compute features in each receptive field using the identical filter (single set of weight
matrix). One can imagine a rolling window similar to time-series applications.
=⇒ This design of local connectivity and parameter sharing significantly reduces the
number of parameters to be learned compared to FNNs layers.
Different CNN layers are available; they differ in the number of dimensions over which
the convolutional operation is applied to. The choice of dimension depends on the char-
acteristics of the input data and the specific prediction task. In the following sections,
we formally describe the mechanisms of 1D and 2D CNN layers. However, it should be
noted that that these principles can be generalized to any higher dimension.
• The kernel size defines the length of the filters (window size) used in the convolu-
tional operation. This parameter determines the size of the receptive field, or the
range of input values each filter can ‘see’ at a time. A larger kernel size allows the
model to detect features that span longer sequences. However, this also increases
the number of parameters and the computational complexity of the model.
• The stride specifies the step size with which the kernel moves along the input
sequence during the convolution process. A smaller stride (e.g., δ = 1) results in
overlapping receptive fields, which can provide a more detailed representation of
the input but at the cost of higher computational costs. On the other hand, a
larger stride (e.g., δ ≥ 2) reduces the overlap, leading to faster computation at the
risk of losing information.
Figure 8.4: CNN filter of kernel size K = 3 and using (lhs) stride δ = 1 and (rhs) stride
δ = 3.
A 1D CNN layer can be thought of as operating like a rolling window procedure. The
kernel directly corresponds to the size of the rolling window, which slides across the
input sequence, examining a fixed number of consecutive elements at each step. The
stride, meanwhile, defines the step size, determining how far the rolling window advances
along the input after each computation. Figure 8.4 shows two examples with kernel size
K = 3, the left-hand side has stride δ = 1 giving overlapping windows on the time axis
t, whereas the right-hand side has stride δ = 3 giving non-overlapping windows. These
input parameters K and δ need to be selected by the modeler and their choices depend on
the trade-off between computational efficiency and the desired level of feature resolution.
1D CNN layer. In practice, a single filter may not be sufficient to capture the com-
plexity of the features in the input data; this is similar to the number of neurons in a
FNN layer. To address this issue, multiple filters can be concatenated, each learning a
different set of features, i.e., using different filter weights.
A 1D CNN layer with q1 ∈ N filters is a mapping
′
(1)
z (1) : Rt×q → Rt ×q1 , X 1:t 7→ z (1) (X 1:t ) = zu,j (X 1:t ) . (8.15)
1≤u≤t′ ,1≤j≤q1
(1)
In particular, each element of the matrix is, denoted as zu,j (X 1:t ), is a unit obtained by
convolving the j-th filter with the u-th receptive field
K D
!
E
(1) (1) X (1)
zu,j (X 1:t ) =ϕ w0,j + wk,j , X (u−1)δ+k , (8.16)
k=1
(1) (1)
with bias term w0,j ∈ R and filter weights wk,j ∈ Rq , 1 ≤ k ≤ K. In this case, the total
number of parameters to be learned is (1 + Kq)q1 .
′
• The j-th column of the matrix z (1) ∈ Rt ×q1 given in (8.15), containing the elements
(1) (1) (1)
z1,j , z2,j , . . . , zt′ ,j , represents a set of features extracted by applying the same fil-
ter to the different receptive fields (sliding along the time axis). This provides a
representation of all receptive fields in a common feature space.
′
• The u-th row of the matrix z (1) ∈ Rt ×q1 given in (8.15), containing the elements
(1) (1) (1)
zu,1 , zu,2 , . . . , zu,q1 , represents a set of features obtained by applying q1 different
filters to the u-th receptive field. These features provide multiple representations
of the same receptive field, or in other words, a slice across the different filters.
This enables the model to extract different sets of features from each receptive
field, capturing different aspects of the data and resulting in multi-dimensional
representations; this is analogous to having multiple neurons in a FNN layer.
Figure 8.5 provides a graphical illustration of how a 1D CNN layer with q1 filters works.
It emphasizes that these layers perform computations based on a convolutional operator,
where the filters are convolved across different regions of the input, similar to how the two
functions interact in a mathematical convolution. In the figure, the blue matrix represents
the input data X 1:t . Each filter is applied to the input data, generating a corresponding
set of features that are graphically represented by a colored rectangular block; e.g., the
(1) (1) (1) (1) t′
yellow filter weights (w0,1 , (wk,1 )K
k=1 ) map to the first block z 1 = (zu,1 )u=1 (in yellow).
(1) (1)
The feature vectors z 1 , . . . , z q1 obtained from the q1 filters are then concatenated along
the appropriate dimension, resulting in the matrix (1D tensor) z (1) , see (8.15). This
output matrix can be interpreted as a learned multi-dimensional representation of the
time-series input data (respecting time adjacency), ready to be passed to subsequent
layers for further processing.
The input data consists of three dimensions and is represented as 3D tensors in Rt×s×q ,
where q represents the number of channels (e.g., q = 3 for the RGB color channels in
images, see Example 8.1). The convolution operation involves sliding a small window
across the two axes t and s, while simultaneously processing all q channels at each
position. This enables the model to perform convolution operations along both the
vertical t and horizontal s axes, making it highly effective at detecting spatial patterns
and local structure in the input data X 1:t,1:s ∈ Rt×s×q . Popular applications in the
actuarial literature of 2D CNNs are pattern recognition problems, e.g., for telematics
data, see Gao et al. [72], or mortality modeling, see Wang et al. [231].
In the case of a 2D CNN, both the kernel size and stride are represented as pairs of
elements that specify the size and movement of the filter in two dimensions.
Also in the case of a 2D CNN layer, the kernel size (Kt , Ks ) and stride δ = (δt , δs ) directly
affect both the size of the output feature map and the computational complexity of the
convolution operation. The output of a 2D CNN layer with a single filter is a feature
j k j k
matrix with t′ = t−Kδt + 1 rows and s =
t ′ s−Ks
δs + 1 columns. In this case, the total
number of receptive fields is given by t s′ .
′
2D CNN filter. Let X 1:t,1:s ∈ Rt×s×q be the input tensor with q channels. A 2D CNN
layer with a single filter of size Kt × Ks and stride δt × δs is a mapping
′ ′
(1) (1) (1)
z 1 : Rt×s×q → Rt ×s , X 1:t,1:s 7→ z 1 (X 1:t,1:s ) = zu,v,1 (X 1:t,1:s ) .
1≤u≤t′ ,1≤v≤s′
(8.17)
(1)
Unit zu,v,1 (X 1:t,1:s ) is extracted by convolving the filter with the receptive field (u, v)
Ks D
Kt X E
(1) (1) X (1)
zu,v,1 (X 1:t,1:s ) = ϕ w0,1 + wkt ,ks ,1 , X (u−1)δt +kt ,(v−1)δs +ks ,
kt =1 ks =1
(1) (1)
with bias term w0,1 ∈ R and filter weights wkt ,ks ,1 ∈ Rq , for 1 ≤ kt ≤ Kt and 1 ≤ ks ≤ Ks .
The total number of parameters to be optimized is 1 + Kt Ks q. The rows of the matrix
(1)
z 1 (X 1:t,1:s ) are obtained by sliding the filter (window) across the input data in a hori-
zontal direction, i.e., the filter moves from left to right across the rows of the input. Each
row of the matrix corresponds to a different receptive field (u, v) along this horizontal
pass. On the other hand, the columns of the matrix are obtained by sliding the filter
vertically over the input data. In this case, the filter moves from top to bottom across
the columns of the input. Each column in the matrix corresponds to a different receptive
field in this vertical pass.
2D CNN layer. Also in the case of 2D CNN layer, multiple filters can be applied.
A 2D CNN layer with q1 filters is a mapping
′ ′
z (1) : Rt×s×q → Rt ×s ×q1 , (8.18)
(1) (1)
X 1:t,1:s 7→ z (X 1:t,1:s ) = zu,v,j (X 1:t,1:s ) ,
1≤u≤t′ ,1≤v≤s′ ,1≤j≤q1
(1) ′ ′
with z j (X 1:t,1:s ) ∈ Rt ×s , 1 ≤ j ≤ q1 , being 2D CNN filters (8.17) with different filter
weights. Each element of the j-th feature map is computed as the double summation
(convolution)
Kt X
Ks D E
(1) (1) X (1)
zu,v,j (X 1:t,1:s ) = ϕ w0,j + wkt ,ks ,j , X (u−1)δt +kt ,(v−1)δs +ks , (8.19)
kt =1 ks =1
(1) (1)
where w0,j ∈ R is the bias term of the j-th filter, and wkt ,ks ,j ∈ Rq , 1 ≤ kt ≤ Kt and
1 ≤ ks ≤ Ks , represents the filter weights for that j-th filter. The total number of
coefficients to learn in a 2D CNN layer with q1 filters of size Kt × Ks is (1 + Kt Ks q)q1 .
The graphical representation of how a 2D CNN layer works is shown in Figure 8.6. The
3D object in blue represents the input data, while the 3D yellow object represents the
first filter. By sliding this filter horizontally and vertically over the input tensor a feature
(1) ′ ′
map is produced, resulting in the yellow 2D CNN filter output z 1 (X 1:t,1:s ) ∈ Rt ×s .
Concatenating the feature maps obtained from the q1 different filters generates the output
of the 2D CNN layer.
where
(2) (2)
$ % $ %
′′ t′ − Kt ′′ s′ − Ks
t = (2)
+1 ∈N and s = (2)
+ 1 ∈ N,
δt δs
and where we select q2 ∈ N filters for this second 2D CNN layer.
We can then compose these two 2D CNN layers to a deep 2D CNN architecture
′′ ×s′′ ×q
z (2:1) : Rt×s×q → Rt 2
with z (2:1) = z (2) ◦ z (1) . (8.20)
Naturally, this mechanism also works for 1D CNN layers, and we can generalize it to any
depth d, resulting in a deep CNN architectures z (d:1) = z (d) ◦ · · · ◦ z (1) . Importantly,
always the output dimension of the CNN layer z (m−1) has to match in the input dimension
of the next CNN layer z (m) . This is completely analogous to deep FNN architectures
(5.3).
This almost fully describes (deep) CNN architectures. There are some more points that
we would like to discuss:
This is discussed in the next sections. When it comes to CNN fitting, we apply a version
of SGD, see Section 5.3.
For simplicity, we use the 1D case for explaining a LCN layer, and the 2D case is com-
pletely analogous. Formally, a 1D LCN layer with q1 ∈ N filters is a mapping
′ (1)
z (1) : Rt×q → Rt ×q1 , X 1:t 7→ z (1) (X 1:t ) = zu,j (X 1:t )
1≤u≤t′ ,1≤j≤q1
,
K D
!
E
(1) (1) X (1)
zu,j (X 1:t ) =ϕ w0,u,j + wk,u,j , X (u−1)δ+k ,
k=1
(1) (1)
with bias terms w0,u,j ∈ R and filter weights wk,u,j ∈ Rq , 1 ≤ k ≤ K, related to the u-th
receptive field. In this case, the total number of parameters to be learned is (1 + Kq)q1 t′ .
Compared to (8.15) and (8.16) there is only one difference, namely, the bias terms and
filter weights have a lower index u that corresponds to the u-th receptive field considered.
This increases the size of the parameters by a factor t′ compared to the 1D CNN case.
Figure 8.7 provides a graphical representation of the FNN, LCN and CNN layers; the
number of parameters exclude the bias terms. For an input tensor X 1:6 ∈ R6×1 we have
18 parameters by mapping this to a FNN layer with q1 = 3 neurons, see Figure 8.7 (lhs).
The right-hand side shows an example of a 1D CNN layer with kernel size K = 2 and
stride δ = 2, resulting in 2 parameters. The LCN layer (with the same kernel size and
stride) in between has 6 parameters.
Figure 8.7: A graphical comparison of the FNN, LCN and CNN layers; the number of
parameters excludes the bias terms.
• MaxPooling: It selects the maximum value from a set of input values in the se-
lected pooling windows. This operation is commonly preferred when the goal is to
emphasize the most relevant features while ignoring the less significant ones.
• AveragePooling: It computes the average of the values within the given pooling
windows. This operation results in smoother feature maps, as it considers the
overall characteristics of a local neighborhood, rather than focusing on extreme
values. AveragePooling is especially appropriate when the objective is to capture
the overall structure or distribution.
Guidance on selecting the appropriate types of pooling for various scenarios is provided
in Boureau et al. [25]. Again, as a several other items, to find the right pooling and
its parameterization is part of hyper-parameter tuning that needs to be performed by
the modeler. Notably, we remark that pooling layers can be applied along multiple
dimensions, depending on the structure of the input data.
For illustrative purposes, we consider the 1D pooling case, and the 2D case is completely
similar. For this we keep the 1D CNN layer in mind, see (8.15). The only difference is
that we replace the convolutional operation (8.16) by pooling operations. For this we
select a pooling size K ∈ N and a stride δ ∈ N giving the mapping
′
pool
z pool : Rt×q → Rt ×q , X 1:t 7→ z (1) (X 1:t ) = zu,j (X 1:t ) ,
1≤u≤t′ ,1≤j≤q
Here, we note that the 1D pooling layer reduces the first dimension of the data from t to
t′ = ⌊ t−K
δ + 1⌋ while preserving the original covariate dimension q. The elements of the
output depend on the type of pooling used:
pool
zu,j (X 1:t ) = max Xk,j .
(u−1)δ+1≤k≤uδ
uδ
pool
X
zu,j (X 1:t ) = Xk,j .
k=(u−1)δ+1
The case of 2D pooling is completely analogous. For the default stride δ = K, we consider
a partition of the tensor, and in each of these subsets of the partition we either extract
the maximum or the average, depending on the type of pooling, this also relates to Figure
8.4 (rhs).
8.3.6 Padding
Padding refers to the process of adding extra values, generally zeros, to the edges of a
tensor before applying the convolution operation. This technique is used to control the
dimension of the output feature map and to ensure that the network effectively captures
information located at the edges. When a CNN layer is applied to the input data tensor,
it typically reduces the dimension from (t, s) to (t′ , s′ ) because the filter does not fully
operate on the edges of the input even if we have a stride δ = 1. A widely used solution
to this problem, known as padding, involves adding zeros to the edges of the input data
to ensure that the output feature map maintains the same spatial dimensions as the
original input tensor. This allows for the filter to operate effectively at the borders
without reducing the size of the output. Padding can be applied to various types of
input tensors. For example, in 1D CNNs (used for time-series data), padding may be
applied at the start and end of the sequence. Similarly, in 2D CNNs, padding is applied
along both dimensions and so on for higher-dimensional CNNs. The amount of padding
is typically determined based on the filter size K, we assume stride δ = 1 for the moment.
For instance, in the case of a 2D CNN layer with a filter of size 3 × 3, padding of 1 pixel
is commonly added to each side of the input to preserve its original dimensions.
Figure 8.8 graphically illustrates the process of applying padding to data in both one-
dimensional and two-dimensional cases.
1D Padding 2D Padding
We have now met all the modules that are necessary to use CNNs as feature extrac-
tors. Generally, the output of a deep CNN architecture is a tensor that has the same
order as the input tensor, e.g., in a time-series example, the output tensor is an element
′
z (d:1) (X 1:t ) ∈ Rt ×qd . If we want to use this new data representation z (d:1) (X 1:t ) for pre-
dictive modeling, say, to forecast an insurance claim in the next period, it needs further
processing, e.g., through a FNN layer; this is illustrated in Figure 8.3. In particular,
this requires to transform the tensor z (d:1) (X 1:t ) into a vector of length t′ q1 . In machine
learning lingo, this shape transformation is called flatten, and it is implemented by a
so-called flatten layer (which only performs that shape transformation of a tensor to a
vector). After this flatten layer, the extracted information can be concatenated with
other vector information, as illustrated in Figure 8.3.
Recurrent neural networks (RNNs) are a class of neural networks specifically de-
signed to process sequential data.
RNNs generate predictions at a given time step, by taking into account the preceding
elements of the sequence. This mechanism is implemented through the introduction of
cyclic/recurrent connections within the network which feed the output of a network layer
back into the network as input enabling information to persist across several time steps.
This design makes RNNs promising methods for tasks where the order of data is relevant
(time-causality), such as text processing or time-series forecasting.
The two most popular RNN architectures are the long short-term memory (LSTM) archi-
tecture of Hochreiter–Schmidhuber [100] and the gated recurrent unit (GRU) architecture
of Cho et al. [42]. These two architectures are presented in Sections 8.4.2 and 8.4.3, be-
low. An early actuarial application of RNNs is the mortality example of Nigri et al. [167],
and Lindholm–Palmborg [139] discuss the efficient use of data in time-series mortality
problems.
(1) the value X u of the sequence X 1:t at the current time step u, and
(1)
(2) the output of the RNN cell z u−1 from the previous time step u − 1.
(1)
The output z u−1 of the previous RNN step can be seen as a compressed summary (learned
representation) of the past observations X 1:u−1 prior to u. These two components are
concatenated to generate the next prediction.
(1)
where X u is the covariate at time u ≥ 1 and z u−1 is the output from the previous RNN
(1) (1) (1) (1)
loop; we initialize z 0 = 0. The vector z u = (zu,1 , . . . , zu,q1 )⊤ is interpreted as a learned
representation considering all information X 1:u . The j-th unit is computed as
D E D E
(1) (1) (1) (1) (1) (1)
zu,j = zu,j (X 1:u ) = ϕ w0,j + wj , X u + v j , z u−1 , (8.22)
with the same update rule applied recursively. This design creates a deep computational
graph allowing to learn long-run time dependencies in the data. The total number of
parameters to optimize in an RNN layer with q1 cells is given by q1 (1 + q + q1 ). A visual
representation of the recurrent mechanism of an RNN unit is provided in Figure 8.9. At
each time step, the unit produces an output, with the final output of the sequence high-
lighted in red. In most applications of neural networks, particularly in tasks involving
-1
Figure 8.9: The recurrent mechanism of a RNN cell.
(1) (1)
sequential data, only the last RNN cell state z t = z t (X 1:t ) is used to generate the
prediction of response Yt . However, in some cases, the model can be beneficial to consider
the entire sequence (called return sequence) of RNN cell states across the different time
(1)
steps, represented as (z u )tu=1 . This provides a more comprehensive/granular represen-
tation of the input data. In such scenarios, one may apply flatting to further process the
data, see Section 8.3.7 and Figure 8.3.
For training RNNs, one commonly uses the back-propagation through time (BPTT)
algorithm, see Werbos [235]. BPTT unrolls the RNN across the entire sequence during
optimization. While effective, BPTT often suffers from issues such as slow convergence,
vanishing gradients, and difficulties in learning long-term dependencies. To overcome
these challenges, more advanced RNN architectures such as LSTM and GRU networks
were introduced. These are discussed next.
released. This gating mechanism allows the network to select important information that
has to be kept in the memory and discard irrelevant or outdated data, making it more
effective in capturing long-term dependencies. By combining this long-term memory with
the short-term memory provided by the recurrent connections, LSTMs are able to learn
both long and short-term dependencies simultaneously.
To calculate the output of the LSTM layer, three activation functions are applied. The
sigmoid function σ ∈ (0, 1), the hyperbolic tangent function tanh(x) ∈ (−1, 1), and a
general activation function ϕ, we refer for their definitions to Table 5.1.
(1) (1)
where z u is called hidden state and cu is the memory cell.
(1)
At each timestep, the LSTM layer takes the current input X u , the previous state z u−1 ,
(1)
and the previous memory cell cu−1 . These inputs are used to compute the two outputs
of the LSTM layer:
• The updated hidden state
z (1) (1) (1)
∈ Rq1 ,
u = o u ⊙ ϕ cu
The three gates and candidate new memory cell are defined by
(1) (1) (1) (1)
f (1)
u = σ w0,f + Wf X u + Vf z u−1 ∈ (0, 1)q1 ,
(1) (1) (1) (1)
i(1)
u = σ w0,i + Wi X u + Vi z u−1 ∈ (0, 1)q1 ,
(1) (1)
o(1)
u = σ w0,o + Wo(1) X u + Vo(1) z u−1 ∈ (0, 1)q1 ,
(1) (1)
c̃(1)
u = tanh w0,c + Wc(1) X u + Vc(1) z u−1 ∈ (−1, 1)q1 ,
(1) (1) (1) (1) (1) (1) (1) (1)
with biases w0,f , w0,i , w0,o , w0,c ∈ Rq1 , network weights Wf , Wi , Wo , Wc ∈ Rq1 ×q
(1) (1) (1) (1)
and Vf , Vi , Vo , Vc ∈ Rq1 ×q1 . These gates work together to dynamically regulate
the flow of information into, out of, and within the cell. During training, the parameters
of the LSTM layer are adjusted to optimize this gating mechanism. The total number of
weights in an LSTM layer is given by 4q1 (q + q1 + 1).
A graphical illustration of the LSTM layer is shown in Figure 8.10. This diagram high-
lights the flow of data through the gates and the interaction between the memory cell
and the hidden state over time.
The GRU networks of Cho et al. [42] have emerged as a promising alternative to LSTM
networks. Although GRUs were developed more recently, they offer a simplified archi-
tecture compared to LSTMs, making them computationally more efficient while still
addressing the issue of vanishing gradients that traditional RNNs face. Unlike LSTMs,
GRUs do not have an explicit memory cell. Instead, they rely on two gates: the update
gate and the reset gate. The first determines how much of the previous hidden state
should be carried forward, essentially controlling the flow of information. The second
one, on the other hand, decides how much of the previous information should be dis-
carded when generating the current state. This design simplifies the architecture while
still allowing it to capture long-term dependencies. In essence, GRUs can be seen as a
more compact version of LSTMs, offering similar benefits in terms of handling long-range
dependencies but with fewer parameters.
(1)
This equation defines the hidden state z u at time step u, which is a weighted combination
(1) (1)
of the previous hidden state z u−1 and the candidate state z̃ u . The balance between
(1)
these two components is controlled by the state of the update gate ou , which determines
how much of the previous state should be retained and how much of the new information
(1) (1) (1)
should be incorporated. The states ou and z̃ u , with z̃ u further influenced by the
(1)
reset gate r u , are expressed as follows
(1) (1)
o(1)
u = σ w0,o + Wo(1) X u + Vo(1) z u−1 ∈ (0, 1)q1 ,
(1) (1)
r (1)
u = σ w0,r + Wr(1) X u + Vr(1) z u−1 ∈ (0, 1)q1 ,
(1) (1)
q1
z̃ (1)
u = tanh w0,z + Wz(1) X u + Vz(1) (r (1)
u ⊙ z u−1 ) ∈ (−1, 1) .
The number of weights in a GRU layer with q1 units, equal to 3q1 (q + q1 + 1), is lower
compared to the number of parameters in a LSTM layer with the same number of units.
(1)
z (1)
u = z (1) (X u , z u−1 ),
(2)
z (2)
u = z (2) (z (1)
u , z u−1 ),
(2)
where z u represents the state of the second RNN layer at time u. Of course, this can
be generalized to any depth d ≥ 2.
In the deep LSTM case, we consider for the m-th RNN layer, m ≥ 1,
(m) (m)
z (m) : Rqm−1 ×qm ×qm → Rqm ×qm , z (m−1)
u , z u−1 , cu−1 7→ z (m) (m)
u , cu , (8.23)
(0)
where we initialize the input for m = 1 by z u = X u and q0 = q.
• Deep RNN architectures, like the LSTM layers (8.23), consider the entire return
(m−1) t
sequence (z u )u=1 of the previous RNN layer m − 1. For predicting a response
(d)
Yt , in the last time period t, one typically only extracts the last state z t ∈ Rqd
in the last RNN layer m = d. This can then be concatenated with other covariate
information, see Figure 8.3.
• Following up on the previous item, if one wants to process the entire return sequence
(d)
(z u )tu=1 of the last RNN layer m = d, one typically needs to flatten this return
sequence to further process it, see Section 8.3.7 for flatten layers.
• The previous examples have all been dealing with information X 1:t of equal length
t. However, RNN can process information of any length, if only the last state
(d)
z t ∈ Rqd is extracted. For example, we can have insurance policyholders 1 ≤ i ≤ n
with claims histories of different lengths X τi :t , where τi ∈ {1, . . . , t − 1} is the
starting point of the history of policyholder i. This can easily be handled by RNNs.
8.5 Transformers
Transformers are a class of deep learning architectures that have revolutionized natural
language processing (NLP) and they are at core of the great success of large language
models (LLMs). Introduced by Vaswani et al. [228], these models replace traditional re-
current and convolutional structures with attention mechanisms. These attention mecha-
nisms use weighting schemes to identify and prioritize the most relevant information and
their interactions. While originally developed for tasks involving data with a sequential
structure, transformers have been adapted to tabular input data, increasing the potential
of these models in actuarial science. First applications of transformers for tabular data
were considered by Huang et al. [106] and Brauer [26]. In their work, continuous and
categorical information is tokenized and embedded so that this pre-processed input infor-
mation has a 2D tensor structure, making them suitable to enter a transformer; see also
the feature tokenizer transformer (FTT) of Gorishniy et al. [86]. More recently, Rich-
man et al. [188] advanced transformer-based models for tabular data by incorporating
a novel weighting scheme inspired by the credibility mechanism of Bühlmann’s seminal
work [34, 36].
Attention layer
The core of transformers is the attention layer, which is designed to identify the most
relevant information in the input data. The central idea is to learn a weighting scheme
that prioritizes the most important parts of the input, thereby enhancing the model’s
ability to perform a given predictive task.
Different attention mechanisms are available in the literature. Our focus is on the most
commonly used variant, the scaled dot-product attention of Vaswani et al. [228]. To
illustrate this attention mechanism, we consider three matrices Q, K, V ∈ Rt×q of the
same dimensions. These three matrices represent the query, the key, and the value,
respectively, of the attention mechanism.
The scaled dot-product attention mechanism is given by the mapping, called attention
head,
H : R(t×q)×(t×q)×(t×q) → Rt×q , (Q, K, V ) 7→ H = H(Q, K, V ).
The attention mechanism is applied to the value matrix V , with the output H calculated
as a weighted sum of its elements. The (attention) weights, dependent on the query ma-
trix Q and the key matrix K, are computed through a scalar/dot-product multiplication,
followed by the softmax function to normalize the scores
!
QK ⊤
H = AV = softmax √ V ∈ Rt×q . (8.24)
q
Here, q is a scaling factor which tries to make the matrices free of the input dimension
√
q, while the matrix of scores A ∈ Rt×t is derived from the matrix A′ = QK ⊤ / q by
applying the softmax operator to the rows of A′
exp(a′u,s )
A = softmax(A′ ), where au,s = Pt ′
∈ (0, 1), (8.25)
k=1 exp(au,k )
for 1 ≤ u, s ≤ t. This transformation ensures that the elements of each row of the
matrix A sum to one. To provide some intuition: the learned attention scores in A are
multiplied by the value vectors in V . Each row of the resulting matrix AV is a weighted
average of the vectors in V , where the weights, which sum to one, determine the (relative)
importance of each row vector in V . It is important to note that the scaled dot-product
attention mechanism is highly efficient computationally, as it performs matrix operations,
such as dot-products and softmax, in parallel across all queries, keys and values. This
eliminates the need for recursive or sequential computation, making it particularly well-
suited for implementation on Graphics Processing Units (GPUs).
Let us briefly illustrate the attention mechanism implied by the query Q and the key K.
These two matrices can be express by
1 1
a′u,s = √ q ⊤
u ks = √ ⟨q u , ks ⟩ . (8.26)
q q
From this we conclude that if the query q u points in the same direction as the key ks ,
we receive a large attention weight au,s (supposed all queries and keys have roughly the
same absolute values). This then implies that the corresponding entries on the s-th row
of the value matrix V get a large attention (weight).
Figure 8.11: Construction of attention matrix A using transposed query matrix Q⊤ (in
blue) and key matrix K ⊤ (in yellow).
Figure 8.11 illustrates the scalar products (8.26). Basically, every query q u tries to
find the keys ks that provide a large scalar product (8.26), which is mapped to a large
attention weight au,s ; this is related to the cosine similarity mentioned in the footnote
on page 131.
Time-distributed layer
⊤
z t−FNN : Rt×q → Rt×q1 , X 1:t 7→ z t−FNN (X 1:t ) = z FNN (X 1 ), . . . , z FNN (X t ) ,
(8.27)
where z FNN : Rq → Rq1 is a FNN layer. This transformation leaves the time dimension
t unchanged, as the FNN layer z FNN is applied separately to each time step 1 ≤ u ≤ t.
Importantly, the same parameters (network weights and biases) are shared across all time
steps. This invariance across time steps is what makes the layer time-distributed.
Drop-out layer
Layer normalization
where ϵ > 0 is a small constant added for numerical stability, µ ∈ R and σ ∈ R+ are
calculated as follows
q q
1X 1X
X̄ = Xj and σ2 = (Xj − X̄)2 ,
q j=1 q j=1
We are now ready to introduce the attention head and the transformer layer in
detail. In this section, we discuss transformers designed for sequential data.
Transformer layer
z t−FNN
j : Rt×q → Rt×q , X 1:t 7→ z t−FNN
j (X 1:t ),
for j ∈ {Q, K, V }. These provide us with the time-slices for fixed time points 1 ≤ u ≤ t,
see (8.27),
(Q)
q u = z FNN
Q (X u ) = ϕQ w0 + WQ X u ∈ Rq ,
(K)
ku = z FNN
K (X u ) = ϕK w 0 + WK X u ∈ Rq ,
(V )
v u = z FNN
V (X u ) = ϕV w0 + WV X u ∈ Rq ,
with corresponding network weights, biases and activation functions. Writing this in
matrix notation gives us the query, key and value matrices
Q = z t−FNN
Q (X 1:t ) = [q 1 , . . . , q t ]⊤ ∈ Rt×q ,
K = z t−FNN
K (X 1:t ) = [k1 , . . . , kt ]⊤ ∈ Rt×q ,
V = z t−FNN
V (X 1:t ) = [v 1 , . . . , v t ]⊤ ∈ Rt×q ,
see also Figure 8.11. This allows us to define the attention head implied by input X 1:t
as follows, see (8.24),
!
QK ⊤
H = H(X 1:t ) = softmax √ V ∈ Rt×q . (8.28)
q
5
Some authors suggest pre-processing the input by applying layer normalization; however, we omit
this step in our notation to keep it as simple as possible.
A transformer layer is constructed by combining the attention head with the augmented
input tensor through a skip-connection mechanism
X 1:t 7→ z skip (X 1:t ) = X 1:t + H(X 1:t ) ∈ Rt×q . (8.29)
After the attention mechanism, the transformed input is typically processed through a
series of additional layers. Generally, a normalization layer z norm is applied first, followed
by a time-distributed FNN layer z t−FNN having output dimension q.
In this setting, the output of the transformer layer can be expressed as
z trans (X 1:t ) = z skip (X 1:t ) + z t−FNN ◦ z norm z skip (X 1:t ) , (8.30)
The layers described in (8.30) operate on the original input X 1:t , and this can be
formalized to the transformer layer
Figure 8.12 illustrates the transformer layer (8.30)-(8.31). The input X 1:t corresponds
to the blue box and the output z trans (X 1:t ) to the yellow box, and the feature extraction
by the transformer layer is sketch in the magenta box.
Multi-head transformers
A transformer layer can also have multiple attention heads, allowing the model to focus
on different parts of the input sequence simultaneously. Rather than computing a single
attention output, multi-head attention applies the attention mechanism multiple times in
parallel, with each attention head using different weights and parameters. That is, each
attention head operates on a separate set of query, key and value matrices, producing
multiple output tensors. These output tensors are then concatenated and projected once
more to generate the final attention result.
Formally, we can define the multi-head attention mechanism with nh attention heads as
follows. For each head j, we apply the attention mechanism to the input tensor X 1:t to
obtain the matrices Qj , Kj , Vj , which are derived from separate projections of the input
tensor. This gives attention heads for 1 ≤ j ≤ nh
Qj Kj⊤
!
Hj (X 1:t ) = softmax √ Vj ∈ Rt×q .
q
These attention head outputs are concatenated along the feature dimension, yielding the
multi-head (MH) attention output
HMH (X 1:t ) = Concat (H1 (X 1:t ), H2 (X 1:t ), . . . , Hnh (X 1:t )) W O ∈ Rt×q , (8.32)
where W O ∈ Rhq×q is the output weight matrix. This multi-head attention output is
then incorporated into the subsequent layers, as in the original architecture. Specifically,
after computing the multi-head attention output, it is added to the input tensor X 1:t
using a skip-connection (8.29), followed by normalization and FNN layers (8.30).
• Bring all data in tensor form using feature tokenization, see Section 8.2.
• The tensors are processed by FNN layers, CNN layers, RNN layers and/or
(multi-)head transformer layers, see also Figure 8.3.
• This vector is further processed through FNN layers to form the output, see
Figure 8.3.
• Based on the training, validation and test split, (U, V, T ), we train this
architecture using a strictly consistent loss function L, see Section 5.3.
Remark 8.2 (RNN vs. transformer). We close this section with a remark highlighting
a key difference between RNN layers and transformer layers. RNN layers have a natural
notion of time/position because RNN layers move sequentially across the time-series data
X 1:t . In contrast, attention layers do not respect time-causality, in the sense that any
query q u can communicate with any key ks through (8.26). To make transformers aware
of time, one typically adds a positional encoding to the input tensor, meaning that, e.g.,
the last column q of X 1:t ∈ Rt×q contains the (normalized) entries u/t ∈ [0, 1] on the
u-th row of X 1:t . This add a notion of time to the algorithm, though, it does not imply
that the algorithm will respect time causality. Time causality would only be the case, if
queries q u can only communicate with keys ks having occurred before s ≤ u. ■
b
eEE
j : Aj → R , Xj 7→ eEE EE
j := ej (Xj ),
where Aj denotes the levels of categorical covariate Xj . Note that for every categorical
covariate we use the same embedding dimension b. This results in b qj=1
Pc
|Aj | embedding
weights that need to be learned.
For the continuous covariates Xj , qc + 1 ≤ j ≤ q, we also perform a b-dimensional
embedding. For this we select FNNs. for qc + 1 ≤ j ≤ q, providing
z j : R → Rb , Xj 7→ z j := z j (Xj ).
This tensor X ◦1:q contains the transformed covariate information, and it serves as input
to subsequent layers in our model. It is further augmented by introducing an additional
component, the so-called the classification (CLS) token. This additional component is
inspired by the bidirectional encoder representations from transformers (BERT) archi-
tecture discussed in Devlin et al. [55]. The purpose of the CLS token is to encode every
column 1 ≤ k ≤ b of the input tensor X ◦1:q ∈ Rq×b into a single variable.
This results in the augmented input tensor
" #
X ◦1:q h
EE EE
i⊤
X 1:q+1 = = e1 , . . . , eqc , z q +1 , . . . , z q , c ∈ R(q+1)×b , (8.33)
c⊤ c
This mapping can also be implemented using the multi-head attention mechanism. Pre-
dictions are derived considering only the final row of the output tensor z trans (X 1:q+1 ),
which corresponds to the position of the CLS token before the transformer layer is ap-
plied. Let ctrans (X) := z trans b
q+1 (X 1:q+1 ) ∈ R denote the CLS token after being processed
by the transformer layer. It encodes the tokenized information of the input covariates.
Through the attention mechanism within the transformer layer, interactions between
all covariates are captured and integrated into the CLS token. As a result, ctrans (X)
becomes a optimized representation of the raw tabular input data X
X 7→ ctrans (X).
The final step involves decoding this tokenized variable ctrans (X) into a set of covariates
suitable for predicting the response variable Y . This decoding process is problem-specific.
For instance, Gorishniy et el. [86] used layer normalization, followed by a ReLU activation
and a one-dimensional FNN layer with linear activation, such that
X 7→ µtrans (X) = z FNN ReLU z norm (ctrans (X)) . (8.34)
Of course, for claims counts and claims size modeling we would rather consider a different
architecture, having a the log-link for g −1 to ensure positivity, see Section 5.1. Figure 8.14
graphically illustrates all the blocks constituting the transformer architecture described
above.
categorical
embedding features
layers
numerical
layers FFN features
Figure 8.14: Graphical representation of the transformer architecture for tabular data.
which is inspired by the seminal work of Bühlmann [34] and Bühlmann–Straub [36]; this
new architecture was named the credibility transformer (CT). The resulting architecture
presents some modifications to enhance model flexibility and to fully leverage the benefits
of the credibility mechanism.
Positional encodings
The first modification concerns the input tensor (8.33). Additional positional encodings
were added, this is a common modification for transformers to capture the notion of time
and/or position. However, unlike the sequential data typically processed by traditional
transformers, tabular data lacks a natural ordering. Indeed, in this context, positional
encodings are adapted to encode information specific to the covariates, ensuring the
model can receive additional information about the structure of the tabular data.
While sophisticated positional encoding mechanisms exist, such as the sine-cosine encod-
ing scheme proposed by Vaswani [228], the CT architecture adopts a simpler approach
based on embedding layers. More formally, the embedding layer maps the position of
each covariate j ∈ {1, . . . , q} into a b-dimensional representation inducing the mapping
epos : {1, . . . , q} → Rb , j 7→ epos
j := epos (j).
This positional encoding scheme introduces qb additional parameters. These learned
representations are incorporated to augment the input tensor (8.33) obtained from the
feature tokenization transformation of the original covariates and the CLS token. In this
context, the augmented input tensor is represented as
(epos ⊤
1 )
◦
X 1:q
..
X+ =
. ∈ R(q+1)×2b ,
1:q+1 ⊤
(epos
q )
c1 c2
Transformer layer
The augmented tensor X + 1:q+1 represents the input to the standard transformer architec-
ture. Considering a single transformer layer, we have a mapping
z trans : R(q+1)×2b → R(q+1)×2b , X+
1:q+1 7→ z
trans
(X +
1:q+1 ).
Within the transformer layer, the CT architecture of Richman et al. [188] introduces some
differences compared to the standard layer (8.30). Specifically, it incorporates additional
time-distributed FNN and normalization layers to increase the model’s flexibility. Start-
ing from z skip (X +
1:T +1 ) obtained as in (8.29), the output of the transformer layer used in
the CT architecture is then defined as
z trans (X +
1:T +1 ) = z
skip
(X +
1:T +1 ) (8.35)
+ z norm2 ◦ z drop2 ◦ z t-FNN2 ◦ z drop1 ◦ z t-FNN1 ◦ z norm1 z skip (X +
1:T +1 ) .
In this process, the tensor is first normalized, resulting in z norm1 , and then processed
through a time-distributed FNN layer, denoted as z t-FNN1 , combined with drop-out layer
z drop1 . This is followed by a second time-distributed FNN layer, z t-FNN2 , with another
drop-out z drop2 . Finally, the process concludes with a second normalization step, z norm2 .
The result of these transformations is combined through a second skip-connection, pro-
ducing an output tensor with shape R(q+1)×2b .
The next component of the CT architecture stands out from other models as it focuses on
implementing the core credibility mechanism. Unlike Gorishniy et al. [86], which relies
solely on the transformer-tokenized CLS token for predictions, the CT combines two
distinct versions of the CLS token through a credibility-based weighting scheme. More
precisely, the first version of the CLS token, referred to as the prior, is extracted before the
covariates undergo interactions through the attention mechanism. As a result, it reflects
the initial representation of the input covariates without any interactions between them.
The prior version is extracted from the value matrix V = [v 1 , . . . , v q+1 ]⊤ . To ensure
that this token is represented exactly in the same embedding space as the outputs of the
transformer, it is processed through the same transformations as the transformer layer
in equation (8.35), using the same weights.
This gives us the following representation for the prior token
cprior = z norm2 ◦ z drop2 ◦ z FNN2 ◦ z drop1 ◦ z FNN1 ◦ z norm1 (v q+1 ) ∈ R2b ,
that is, we use the layers from (8.35), but we do not need its time-distributed versions
because we only process a single vector v q+1 . The second version of the CLS token is
derived after processing everything through the transformer layer. This version incor-
porates the effects of the attention mechanism, reflecting the interactions among the
covariates. The second version of the CLS token holds the tokenized information of the
covariates X as well as their positional embeddings.
This transformer-based token is given by
+
ctrans = z trans 2b
q+1 (X 1:q+1 ) ∈ R ,
Credibility mechanism
Both tokens cprior and ctrans are used for making predictions in the CT architecture,
with weights assigned to each representation. This process involves selecting a fixed
probability weight, α ∈ (0, 1), and sampling independent Bernoulli random variables
Z ∼ Bernoulli(α) during SGD training. These random variables determine which CLS
token is passed forward through the network to make predictions.
Specifically, the two tokens are combined as follows by
Thus, in α·100% of the gradient descent steps, the transformer-based token ctrans is used,
which has been augmented by covariate interactions. In the remaining (1−α)·100% of the
steps, the prior token cprior is selected. This mechanism effectively assigns a credibility
of α to the transformer-based token ctrans and a complementary credibility of 1 − α to
the prior token cprior in SGD, guiding the network to learn reasonable parameters during
training. This credibility token ccred then enters an encoder for prediction, similar to
(8.34).
The probability α is treated as a hyper-parameter and can be optimized via grid search.
One could select α > 1/2 to give greater weight to the tokenized covariate information,
reflecting its increased importance in the prediction process; in the examples of Richman
et al. [188], the best results have been obtain by a choice of roughly α = 95%.
The credibility mechanism in equation (8.36) is applied only during the SGD fitting. For
out-of-sample predictions, one sets Z ≡ 1, and uses the transformer-based token.
The CT architecture introduces another and less obvious credibility mechanism which is
realized during the dot-product attention. To understand how it works, consider that,
according to equation (8.28), the last row of the attention head, denoted as Hq+1 , is
obtained by multiplying the elements of Aq+1 = (aq+1,1 , . . . , aq+1,q+1 ) with the corre-
sponding columns of the value matrix V ∈ R(q+1)×2b :
Furthermore, since A results from applying the softmax operator (see (8.25)), the ele-
ments of the vector Aq+1 are strictly positive and satisfy the normalization condition
q+1
X
aq+1,j = 1, and with aq+1,j > 0.
j=1
In this context, the k-th element of the attention head Hq+1 can be expressed as a
convex combination of the elements in the k-th column of the value matrix V , with
coefficients given by Aq+1 . Additionally, decomposing the value matrix V into two parts,
the first part, v covariate ∈ Rq×2b , contains the first q rows of V , while the second part,
v q+1 ∈ Rb , corresponds to the row associated with the CLS token, equation (8.37) can
be reformulated as:
Hq+1 = P v q+1 + (1 − P ) v covariate , (8.38)
where P = aq+1,q+1 ∈ (0, 1) is the last element of the attention row Aq+1 and represents
the weight assigned to the CLS token. The remaining weight, qj=1 aq+1,j = 1 − P , is
P
distributed across the covariate information. This formulation reveals that the attention
mechanism for the CLS token can be interpreted as a credibility-weighted average. The
CLS token’s own information (representing collective experience) is combined with the
covariates’ information (representing individual experience) according to their respective
credibility weights. In essence, this is a Bühlmann [34] type linear credibility formula, or a
dynamic version of it, with the credibility weights learned during training and depending
on the input.
Decoder
The final block of the CT architecture is the decoder that derives from the represention
ccred (X), given in (8.36), the predictions. The proposal of Richman et al. [188] performs
this task considering an additional FNN layer z FNN and a suitable link-function g, so
that we obtain the regression function
D E
X 7→ µcred (X) = g −1 w, z FNN (ccred (X)) , (8.39)
feature
tokenizer positional
transformation encoding
categorical
embedding features
layers
numerical
layers FFN features
We conclude that transformers for tabular data (using the feature tokenization trans-
formation) and its credibility transformer extension present interesting network archi-
tectures for solving actuarial problems. The credibility transformer is inspired by a
Bühlmann [34] credibility mechanism, that is useful to stabilize and improve network
training. In fact, in examples, the credibility transformer has proved excellent perfor-
mance, though at higher computational costs than a classical FNN architecture. More-
over, the hidden credibility mechanism (8.38) allows for an integrated variable importance
measure; for further details, see Richman et al. [188].
Unsupervised learning
9.1 Introduction
Unsupervised learning covers the topic of learning the structure in the covariates X
without considering any response variable Y . This means that unsupervised learning
aims at understanding the population distribution of the covariates X ∼ P. This can be
achieved by learning the inherent pattern in X from an i.i.d. learning sample L = (X i )ni=1 .
This learning sample is called unlabelled because it does not include any responses.
Broadly speaking, there are the following main tasks that can be studied and solved with
unsupervised learning.
(2) Clustering. Clustering techniques are methods that are based on classifying (or bin-
ning), meaning that they group similar covariates X into (homogeneous) classes
(clusters, bins). This leads to a segmentation of a heterogeneous population into
homogeneous classes. Popular methods are hierarchical clustering methods, K-
means clustering, K-medoids clustering, distribution-based Gaussian mixture mod-
els (GMMs) clustering or density-based spatial clustering of applications with noise
(DBSCAN).
165
166 Chapter 9. Unsupervised learning
The previous methods are mainly used to understand, simplify and illustrate the covari-
ate distribution X ∼ P. In financial and insurance applications, unsupervised learning
methods can also be used for anomaly detection. Based on similarity measures, unsuper-
vised learning methods may be used for outlier detection, e.g., indicating fraud or other
abnormal structure or behavior in the data.
We give a selected overview of some of these unsupervised learning techniques, and for
more methods and a more in-depth discussion we refer the interested reader to the unsu-
pervised learning literature. A general difficulty in most unsupervised learning methods
is that they work well for real-valued vectors X ∈ Rq , but they struggle to deal with
categorical variables. Naturally, actuarial problems heavily rely on categorical covari-
ates, and there is only little guidance on how to deal with those categorical variables
in an unsupervised learning context, e.g., how can we reasonably quantify dissimilarity
between different colors of cars? Section 2.3.2 has been discussing pre-processing of cat-
egorical covariates, such as one-hot encoding, entity embedding or target encoding. This
has been extended by considering a contextual entity embedding in Section 8.2.2. These
are possible ways of pre-processing categorical covariates before considering unsupervised
learning methods.
2−dimensional features X
2
1
X2
0
−1
−2
−2 −1 0 1 2
X1
9.2.1 Standardization
Throughout this chapter, we assume to work with standardized data, meaning the fol-
lowing. Based on the learning sample L = (X i )ni=1 ⊂ Rq , we construct the design matrix
X1,1 · · · X1,q
. .. ..
X = [X 1 , . . . , X n ]⊤ = ..
. . ∈ R
n×q
; (9.1)
Xn,1 · · · Xn,q
compared to (2.10), we drop the intercept column from the design matrix in this chapter.
The columns of this design matrix X describe different quantities, e.g., the first column
may describe the weight of the vehicle, the second one the age of the policyholder, etc.
Thus, these columns live on different scales (and units). Since we would like to apply
a unit-free dimension reduction technique to X, we need to standardize the columns
of X beforehand. For this foregoing standardization, we apply (2.18) to all covariate
components, such that the columns of the resulting design matrix X are centered and have
the same empirical variance. The empirical variance is either (n − 1)/n or 1 depending
on which empirical standard deviation estimator was applied. Standardization is done
to ensure that all columns of the design matrix X live on the same scale, are unit-free,
before applying the PCA to this design matrix X.
A general difficulty in most of the unsupervised learning methods is the treatment of
categorical variables. They can be embedded using one-hot encoding, but if we have
many levels, the resulting design matrix X will be sparse, i.e., with lots of zero entries
(before standardization), and most unsupervised learning methods struggle with this
sparsity; intuitively, the higher the dimension q the more collinear sparse vectors become
and the less well-conditioned the design matrix X will be. Therefore, it is advantageous if
one can use a low-dimensional entity embedding (2.15) for categorical variables, though,
it is not always clear where one can get it from. E.g., if one uses a supervised learning
method for this entity embedding, the utilized target may not be the right label that
reflects the clustering that one wants to get. For instance, if one has different provinces,
there are many different targets (population density, average salary, area of the province,
average rainfall, etc.), and each target may lead to a different entity embedding.
9.2.2 Auto-encoders
The general principle of finding a lower dimensional object that encodes the learning
sample L = (X i )ni=1 can be described by an auto-encoder. An auto-encoder maps a
high-dimensional object X ∈ Rq to a low-dimensional representation, say, in Rp , p < q,
so that this dimension reduction still allows to (almost perfectly) reconstruct the original
data X. To measure the loss of information, we introduce a dissimilarity function.
Definition 9.1 (dissimilarity function). A dissimilarity function
Note that we use the same notation L for the dissimilarity function as for the loss function
(1.5) because, essentially, they play the same role, only the input dimensions differ. In
general, a dissimilarity function does not need to be a proper distance function. That is,
a dissimilarity function does not need to be symmetric in its arguments, nor does it need
to satisfy the triangle inequality, e.g., the KL divergence is a non-symmetric example
that does not satisfy the triangle inequality.
Φ : Rq → Rp and Ψ : Rp → Rq ,
such that their composition Ψ ◦ Φ has a small reconstruction error w.r.t. the chosen
dissimilarity function L(·, ·), that is,
Strictly speaking, Definition 9.2 is not a proper mathematical definition, because the
clause in (9.3) is not a well-defined mathematical term. We allow ourselves to be a bit
sloppy at this point; we are just going to explain it, and it will not harm any mathematical
arguments below.
Naturally, a small reconstruction error in (9.3) cannot be achieved for all X ∈ Rq , because
the dimension reduction p < q always leads to a loss of information on the entire space
Rq . However, if all X “of interest” live in a lower dimensional object B1 of dimension
p, then it is possible to find an encoder (coordinate mapping) Φ : Rq → Rp so that one
can perfectly reconstruct the original object B1 by the decoder Ψ : Rp → Rq . We give
an example.
Example 9.3. We come back to Figure 9.1. If the covariates live in the unit circle,
X ∈ B1 ⊂ R2 , the encoder (coordinate mapping) is received by Φ : R2 → [0, 2π)
describing the angle in polar coordinates, and the decoder Ψ : [0, 2π) → B1 ⊂ R2
maps these angles to the unit circle. This auto-encoder preserves the information of the
polar angle and it loses the information about the size of the radius. However, if all
covariates X ∈ B1 “of interest” lie in the unit circle, the information about the radius
is not necessary, and, in fact, the object B1 is one-dimensional p = 1 in this case. In
other words, the auto-encoder Ψ ◦ Φ is the identity map on B1 , and generally it is an
(orthogonal) projection from R2 to B1 . We conclude that in this case, the interval [0, 2π)
is the low-dimensional representation of the unit circle B1 . ■
Working in Euclidean spaces Rq , one can take the classical Lk -norms, k ∈ (0, ∞], as
dissimilarity measures
1/k
q
k
L(X, X ′ ) = X − X ′ Xj − Xj′
X
k = . (9.4)
j=1
When it comes to categorical covariates, things start to become more difficult. Assume
that X is categorical taking K levels. Using one-hot encoding we can embed X 7→ X ∈
{0, 1}K into the K-dimensional Euclidean space RK . The Hamming distance counts the
number of positions where two one-hot encoded vectors X and X ′ differ. The Hamming
distance is equal to the Manhattan distance for binary vectors. Up to a factor of two,
the one-hot encoded Hamming distance is equal to
1
L(X, X ′ ) = 1 ′ .
K 2 {X̸=X }
This is inspired by the fact that if X and X ′ are independent and uniform on {a1 , . . . , aK },
we receive the contingency Table 9.1 (lhs). For non-uniform i.i.d. categorical variables it
may be adapted to a weighted version as sketched on the right-hand side of Table 9.1
1
L(X, X ′ ) = 1 ′ ,
pX pX ′ {X̸=X }
X \ X′ a1 ··· aK X \ X′ a1 ··· aK
a1 1/K 2 ··· 1/K 2 a1 1/p2a1 ··· 1/(pa1 paK )
.. .. .. .. .. .. .. ..
. . . . . . . .
aK 1/K 2 ··· 1/K 2 aK 1/(paK pa1 ) ··· 1/p2aK
The PCA is a linear dimension reduction technique that has different interpretations,
one of them is a linear auto-encoder interpretation, according to Definition 9.2. We
have decided to give the technical explanation behind the PCA that describes an explicit
construction that will result in the auto-encoder interpretation. Basically, this only uses
linear algebra.
Consider the design matrix X = [X 1 , . . . , X n ]⊤ given by (9.1). The rows of this design
matrix contain the transposed covariates X i ∈ Rq , 1 ≤ i ≤ n. Select the standard unit
basis (ej )qj=1 of the Euclidean space Rq . This allows us to write the covariates X i as
q
X
X i = Xi,1 e1 + . . . + Xi,q eq = Xi,j ej ∈ Rq . (9.7)
j=1
The PCA represents these covariates X i in a different orthonormal basis1 (v j )qj=1 , i.e.,
by a linear transformation we can rewrite these covariates X i as
q
X
X i = ai,1 v 1 + . . . + ai,q v q = ⟨X i , v j ⟩ v j ∈ Rq , (9.8)
j=1
The PCA performs the following steps that provide the dimension reduction at a minimal
loss of information:
(1) Select the encoding dimension p < q and define the encoder Φp : Rq → Rp by
That is, we only keep the first p principal components in the alternative orthonormal
basis (v j )qj=1 representation (9.8), and we truncate the rest.
That is, we pad the vector Z ∈ Rp with zeros to get the right length q > p; in
network modeling this is called padding with zeros to length q, see Section 8.3.6.
(3) Composing the encoder Φp and the decoder Ψp gives us the auto-encoder
p
X
X 7→ Ψp ◦ Φp (X) = ⟨X, v j ⟩ v j ∈ Rq . (9.10)
j=1
that is, the reconstruction error is precisely determined by the components that
were truncated by the encoder Φp . The main idea behind the PCA is to select the
orthonormal basis (v j )qj=1 of Rq such that this reconstruction error (9.11) becomes
minimal (in some dissimilarity measure) over the learning sample L = (X i )ni=1 .
This is achieved by selecting the directions of the biggest variabilities in the design
matrix X.
We want to minimize the reconstruction error (9.11) over all learning samples L =
(X i )ni=1 . For this we select the squared L2 -distance dissimilarity measure
2
L(X, X ′ ) = X − X ′ 2,
and for aggregating over the instances 1 ≤ i ≤ n, we simply take the sum of the dis-
similarity terms. In view of (9.11), a straightforward computation gives us the total
dissimilarity on the learning sample L, expressed by the design matrix X,
n n q
Φp (X i )∥22 ∥Xv j ∥22 .
X X X
L (X i , Ψp ◦ Φp (X i )) = ∥X i − Ψp ◦ = (9.12)
i=1 i=1 j=p+1
The latter term tells us how we should select the orthonormal basis (v j )qj=1 , namely,
the terms ∥Xv j ∥22 should be maximally decreasing in j. This implies that any linear
auto-encoder Ψp ◦ Φp (i.e., for any 1 ≤ p ≤ q) has minimal total dissimilarity across
the learning sample L. This requirement can be solved by recursive convex Lagrange
problems.
A first orthonormal basis vector v 1 ∈ Rq is given by a solution of
This solves the total reconstruction error (dissimilarity) minimization (9.12) for the linear
auto-encoders (9.10) simultaneously for all 1 ≤ p ≤ q, and the single terms ⟨X, v j ⟩ v j
are the principal components in the lower dimensional representations.
At this stage, we could close the chapter on the PCA, because (9.10), (9.13) and (9.14)
fully solve the problem. However, there is a more efficient way of computing the PCA
than recursively solving the convex Lagrange problems (9.13)-(9.14). This alternative
way of computing the orthonormal basis (v j )qj=1 uses a singular value decomposition
(SVD) and the algorithm of Golub–Van Loan [82]; see Hastie et al. [93, Section 14.5.1].
The SVD is based on the following mathematical result:
There exist orthogonal matrices U ∈ Rn×q and V ∈ Rq×q , with U ⊤ U = V ⊤ V = Idq , and
a diagonal matrix Λ = diag(λ1 , . . . , λq ) ∈ Rq×q with singular values λ1 ≥ . . . ≥ λq ≥ 0
such that we have the SVD of X
X = U ΛV ⊤ . (9.15)
The matrix U is called left-singular matrix of X, and the matrix V is called right-singular
matrix of X. The crucial property of the SVD is that the column vectors of the right-
singular matrix V = [v 1 , . . . , v q ] ∈ Rq×q precisely give the orthonormal basis (v j )qj=1 that
we are looking for. This is justified by the computation
∥Xv j ∥22 = v ⊤ ⊤ ⊤ 2 ⊤ 2
j X Xv j = v j V Λ V v j = λj . (9.16)
Crucially, the singular values are ordered λ1 ≥ . . . ≥ λq ≥ 0, and the orthonormal basis
vectors (v j )qj=1 are the eigenvectors of X⊤ X to the squared singular values λ2j . Thus,
(9.16) shows that the first principal component has the biggest potential reconstruction
error, and this is decreasing in j, minimizing the reconstruction error (9.11) for any p.
Example
We provide a two-dimensional example, and we use the R command svd which is included
in the base package of R [179].
10
5
5
X2
X2
0
observations X observations X
−5
−5
−5 0 5 10 −5 0 5 10
X1 X1
Figure 9.2: Two PCAs for two different learning samples L = (X i )ni=1 in R2 .
Figure 9.2 shows two PCAs for two different learning samples L = (X i )ni=1 in R2 , i.e., in
two dimensions q = 2. The plot on the left-hand side has a very dominant first principal
component, and the first basis vector v 1 can fairly well describe the data. The plot on the
right-hand side shows much more dispersion, and the singular values λ1 ≈ λ2 . Therefore,
we need both basis vectors v 1 and v 2 to accurately describe the second learning sample.
The principle components are received by computing ⟨X i , v j ⟩, and these are obtained
by the orthogonal projections of the black dots X i in Figure 9.2 to the red (j = 1) and
orange (j = 2) line, respectively.
(i) The output dimension is equal to the input dimension, qd = q0 = q, where q is the
dimension of the covariates X ∈ Rq .
(ii) One FNN layer z (m) , 1 ≤ m < d, should have a very small dimension (number
of units) qm = p < q. This dimension corresponds to the number of principal
components we consider in (9.9)-(9.10) for the PCA, and z (m) is called the bottleneck
of the FNN. This bottleneck has size p, being a hyper-parameter selected by the
modeler.
Composing the BNN encoder and BNN decoder, provides us with the BNN auto-encoder
Ψp ◦ Φp : Rq → Rq given by
X 7→ Ψp ◦ Φp (X) = z (d) ◦ · · · ◦ z (1) (X).
X1 x1
X2 x2
X3 x3
X4 x4
X5 x5
function, so that the SGD algorithm tries to minimize the reconstruction error w.r.t. the
selected dissimilarity function. The low-dimensional representation of the data is then
received by evaluating the bottleneck of the trained FNN
n on
{Φp (X i )}ni=1 = z (m:1) (X i ) ⊂ Rp .
i=1
The advantage of the BNN auto-encoder over the PCA is that the BNN auto-encoder can
deal with non-linear structures in the data, supposed we select a non-linear activation
function ϕ. The disadvantage clearly is that the BNN auto-encoder treats the bottleneck
dimension p as a hyper-parameter that needs to be selected before training the BNN.
Changing this hyper-parameter requires to refit a new BNN architecture, i.e., in contrast
to the PCA, we do not simultaneously get the results for all dimensions 1 ≤ p ≤ q.
Moreover, there is also no notion of singular values (λj )qj=1 that quantifies the significance
of the principal components, but one has to evaluate the reconstruction errors for every
bottleneck dimension p to receive the suitable size of the bottleneck, i.e., an acceptable
reconstruction error.
We close with remarks:
• Hinton–Salakhutdinov [97] have noticed that the gradient descent training of BNN
architectures can be difficult, and it can likely result in poorly trained BNNs. There-
fore, Hinton–Salakhutdinov [97] propose to use a BNN architecture that is sym-
metric in the bottleneck layer w.r.t. the number of neurons in all FNN layers. That
is, select an architecture with qm−k = qm+k for all 1 ≤ k ≤ m and d = 2m; Figure
9.3 gives such an example with d = 4. Training of BNNs can then successively
be done by recursively collapsing layers (keeping symmetric FNNs); we refer to
Wüthrich–Merz [243, Section 7.5.5] for more details.
• If one chooses the linear activation function for ϕ, the bottleneck will represent a
linear space of dimension p (supposed that all other FNN layers have more units),
and we receive a linear data compression. The result is basically the same as the
one from the PCA, with the difference that the BNN auto-encoder does not give
a representation in the orthonormal basis (v j )pj=1 , but it gives the (same) results
in the same linear subspace in a different parametrization (that is not explicitly
specified). The reason for this is that the BNN auto-encoder does not have any
notion of orthonormality, but this would need to be enforced by regularization
during training.
where we typically are thinking of a higher dimensional feature embedding, d > q; in fact,
typically, this higher dimensional space will be infinite-dimensional, but for the moment
it is sufficient to think about a finite large d. We give an example that is related to
support vector machines (SVMs).
3
2
2
X2
X2
1
1
0
0
−1
−1
−4 −2 0 2 4 −4 −2 0 2 4
X1 X1
Figure 9.4 (lhs) shows four real-valued instances (Xi )4i=1 ⊂ R. There are two of red type
and two of blue type. On the real line R, the red dots cannot be separated from the
blue ones by one partition of R. Figure 9.4 (rhs) shows the situation after applying the
feature map Xi 7→ F (Xi ) = (Xi , Xi2 /4)⊤ ∈ R2 . After this two-dimensional embedding
we can separate the red from the blue dots by a single partition illustrated by the orange
horizontal line. Such a feature map F is the first part of the idea behind the non-linear
kernel PCA, i.e., this higher dimensional embedding gives the necessary flexibility.
It is not necessary to explicitly select the feature map F , but for our problem it suffices
to know the (implied) kernel
This is called the kernel trick, which means that for such types of problems, it is sufficient
to know the kernel K. In many cases, it is simpler to directly select this kernel K, instead
of trying to find a suitable feature map F .
We are going to explain why knowing the kernel K is sufficient, but before that we
start by assuming that the feature map F : Rq → Rd is known. This gives us the new
(embedded) features (F (X i ))ni=1 ⊂ Rd , and we can construct the new design matrix in
this bigger dimensional space
Z = [F (X 1 ), . . . , F (X n )]⊤ ∈ Rn×d .
Based on this embedding, we can perform a PCA on these embedded new features.
Following the recipe of the PCA, see Section 9.2.3, we can find the principal components
using a SVD of Z providing us with the right-singular matrix with column vectors (v Fj )dj=1 .
This is then used to define the decoder Φp , see (9.9), which gives us the PCA dimension
reduction for p ≤ d
D E D E⊤
X 7→ F (X) 7→ Φp (F (X)) = F (X), v F1 , . . . , F (X), v Fp ∈ Rp . (9.19)
Up to this stage, it seems that we need to know the feature map F : Rq → Rd to compute
the individual principal components in (9.19). The crucial (next) step of the kernel trick
is that the kernel K is sufficient to compute ⟨F (X), v Fj ⟩, and the explicit knowledge of
the feature map F is not necessary. This precisely provides the (non-linear) kernel PCA
of Schölkopf et al. [200].
There are two crucial points that make the kernel trick work. These two points need
some mathematical arguments which we are going to skip here, the interested reader is
referred to Schölkopf et al. [200]. Before discussing these two points (a) and (b), we need
to slightly generalize the feature map F introduced in (9.17). Typically, this feature map
F : Rq → H maps to an infinite-dimensional Hilbert space H. This Hilbert space H
allows for the kernel K construction (9.18) because any Hilbert space is equipped with
a scalar product. Of course, the finite-dimensional space Rd , selected in (9.17), is one
example of a Hilbert space H. Now we are ready to discuss the two points (a) and (b)
that make it possible to compute (9.19).
(a) First, there exist vectors αj = (αj,1 , . . . , αj,n )⊤ ∈ Rn , j ≥ 1, that allow one to write
the (eigen-)vectors (v Fj )j≥1 as follows
n
X
v Fj = αj,i F (X i ), (9.20)
i=1
i.e., the vectors (v Fj )j≥1 are in the span of the new features (F (X i ))ni=1 . Inserting
this gives us for j ≥ 1
D E n
X
F (X), v Fj = αj,i K(X, X i ). (9.21)
i=1
Note: this only requires the (implied) kernel K, but not the feature map F itself.
(b) Second, we need to compute (αj )j≥1 . Define the Gram matrix (kernel matrix)
1
αj = q aj . (9.22)
λK ⊤
j aj aj
In summary, we can compute (9.21)-(9.22) directly from the kernel K, without the explicit
use of the feature map F . Thus, the kernel PCA dimension reduction (9.19) is fully
determined by the kernel K, which also justifies the name kernel PCA.
Popular kernels in practice are polynomial kernels of order k (with b ∈ R) or radial
Gaussian kernels (with γ > 0) given by, respectively,
For k = 1 (and b = 0) we have the linear kernel used in the standard PCA. Based on
these kernel selections we can directly solve the kernel PCA by using (9.21)-(9.22), and
we receive the kernel PCA dimension reduction (9.19) without an explicit choice of the
feature map F .
Remark 9.4 (Mercer kernel). There is one question though. Namely, are the selected
kernels in (9.23)-(9.24) “valid” kernels, i.e., are these (artificial) kernel choices implied
by a feature map F : Rq → H according to (9.18)? If there does not exist such a feature
map F for a selected kernel K, then the presented theory may not hold for that kernel
K. The Moore–Aronszajn [6] theorem gives sufficient conditions to solve this question.
For this, we first need to introduce the Mercer kernel [154]. Assume X is a metric space.
A mapping K : X × X → R is a Mercer kernel if it is continuous in both arguments,
symmetric and positive semi-definite.2 The theorem of Moore–Aronszajn tells us that
for every Mercer kernel there is a so-called reproducing kernel Hilbert space (RKHS) for
which we can select a feature map F that has this Mercer kernel K as implied kernel.
Thus, any Mercer kernel K is a valid selection as it can be generated by a feature map
F ; see, e.g., Andrès et al. [3] for more details. ■
There are a couple of points that one should consider. The (kernel) PCA is usually
performed on standardized matrices because this provides better results, i.e., essentially
we should focus on the correlation/dependence structure, see Section 9.2.1. For the
2
Positive semi-definite means that for all finite sequences of instances (X i )n
i=1 and for all sequences
Pn
a ∈ Rn we have a a
i,i′ =1 i i
′ K(X i , X i ′ ) ≥ 0.
kernelized version this means that one often replaces the Gram matrix K by a normalized
version
e = K − 1n K − K1n + 1n K1n ,
K (9.25)
Example 9.5. We give an example of a kernel PCA. Assume that the covariates L =
(X i )ni=1 ⊂ R2 are two-dimensional, and they are related to three circles with different
radii.
2−dimensional features X
2
1
X2
0
−1
−2
−2 −1 0 1 2
X1
Figure 9.5: Learning sample L = (X i )ni=1 in R2 with three circles of different radii.
Figure 9.5 shows the learning sample L. This learning sample cannot be partitioned
into the different colors by a simple splitting with hyperplanes (straight lines), and both
coordinates are necessary to distinguish the differently colored dots. This implies that
the first principal component of a (linear) PCA is not sufficient to describe the different
instances. This is verified in Figure 9.6 (lhs).
Figure 9.6 (middle and rhs) show a polynomial kernel PCA, with k = 2 and b = 1, and a
Gaussian kernel PCA, with γ = 1, respectively. We now observe that the first principal
component (on the x-axis) can separate the different colors fairly well, and in these two
cases this first principal component (from the kernel PCA) might be sufficient to describe
the data. This analysis does not use the standardized version of K. ■
0.6
1.5
0.4
1.0
2nd principal component
0.2
0.5
0.0
0.0
0
−0.5
−0.2
−1
−1.0
−0.4
−1.5
−2
−0.6
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −2.0 −1.5 −1.0 0.2 0.4 0.6 0.8
Figure 9.6: Kernel PCAs: (lhs) linear PCA, (middle) polynomial kernel PCA with k = 2,
(rhs) Gaussian kernel PCA.
This gives us a partition (Xk )k∈K of the covariate space X by defining for all k ∈ K the
clusters
Xk = {X ∈ X ; CK (X) = k} ; (9.27)
for an illustration see Figure 9.7. This is rather equivalent to the regression tree par-
tition (6.1), the main difference lies in its specific construction. For the regression tree
construction in Chapter 6 we use the responses Y to construct the partition, and in the
clustering methods we use the covariates X themselves to define the clustering through a
dissimilarity function. That is, we aim at choosing the classifier CK such that the result-
ing dissimilarities within all clusters (Xk )k∈K are minimal. Sometimes this is also called
quantization, meaning that all covariates X i ∈ Xk can be represented sufficiently accu-
rately by a so-called model point ck ∈ Xk , and actuarial modeling is then only performed
on these model points (ck )k∈K . This is a quite common approach in life insurance port-
folio valuation, that helps to simplify the complexity of valuation of large life insurance
portfolios.
One distinguishes different types of clustering methods. There is (1) hierarchical cluster-
ing, (2) centroid-based clustering, (3) distribution-based clustering and (4) density-based
clustering.
Step (1a) of K−means for K=4 Step (1b) of K−means for K=4
● ●
● ● ● ●
● ●
● ● ● ● ● ●
● ●
● ●
●● ●●
● ● ● ●
● ● ● ●
● ●● ● ●●
● ● ● ● ●● ● ●● ● ● ● ● ●● ● ●●
● ●● ● ●
● ● ● ●● ● ●
● ●
● ● ● ●●
●● ● ● ● ●●
●●
● ●● ●●
● ● ●
●
● ● ● ●● ● ● ● ● ●● ●●
● ● ●
●
● ● ● ●● ● ● ●
●● ● ● ● ● ●● ● ● ● ●
● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●●● ●
● ● ●● ● ● ● ●● ●
● ● ● ●
●● ● ● ●
●
● ●● ● ● ● ● ●
● ● ● ●● ●● ●
● ● ● ● ● ●
●● ● ● ●
●
● ●● ● ● ● ● ●
● ● ● ●● ●● ●
● ●
● ● ● ●● ●● ●●
● ●● ●● ● ● ● ● ● ●● ●● ●●
● ●● ●● ● ●
●● ● ● ●●
● ● ●● ●
● ● ●●●● ● ●●● ● ●● ● ● ●●
● ● ●● ●
● ● ●●●● ● ●●● ●
●● ● ●●●● ●● ● ● ●●● ●● ● ●●●● ●● ● ● ●●●
● ●●●● ● ●●●●● ●● ● ●● ●● ● ●●●● ● ●●●●● ●● ● ●● ●●
●● ● ● ● ● ●●●●●● ●● ● ● ●● ● ● ●● ● ●● ● ● ● ● ●●●●●● ●● ● ● ●● ● ● ●● ●
● ● ●● ●
● ● ●● ● ●● ● ● ●● ● ●● ● ●●● ●● ● ● ● ● ● ●● ● ●● ● ● ●● ● ●● ● ●●● ●● ● ● ●
● ● ● ● ●● ● ●
●●●●● ●●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●
●●●●● ●●●● ● ● ● ● ● ●
● ●● ● ●● ●● ● ●●
● ●● ● ● ● ●● ● ●●●●● ● ●● ● ● ● ●● ● ●● ●● ● ●●
● ●● ● ●
● ● ●● ● ●●●●● ● ●● ● ●
● ● ● ●● ●● ●●●
●
● ● ●● ●● ●●
● ● ●●● ● ● ● ● ●● ●●●
● ● ● ●●● ● ● ●
● ● ● ● ●●●●● ● ●● ●● ●● ● ● ●●●● ●● ●● ● ● ● ● ● ● ●●●●● ● ●●
● ● ●● ●● ●● ● ● ●●●● ● ● ●● ●● ●● ●● ●● ● ●
● ● ● ●
● ●
●● ● ●●● ● ●
●● ●
●
●
●
●● ●●●●●● ●
●
● ●●●
● ●
●● ●
●
●
●● ● ● ●
● ● ● ●● ●● ● ● ● ● ● ●
● ●
●● ● ●●● ● ●
●● ●
●
●
●
●● ●●●●●● ●
●
● ●●●
● ●
●● ●
●
●
●● ● ● ●
● ● ● ●● ●● ● ●
● ● ● ● ● ●●● ●● ●●● ● ●●●
● ● ●● ●● ● ● ● ● ●
● ●●●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ●●● ● ●●●
● ● ●● ●● ● ● ● ● ●
● ●●●● ●● ● ●● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ●●● ●● ● ●● ● ●●●● ● ● ● ● ●●●
● ● ●●
● ●
● ● ● ●
● ● ● ● ● ● ● ● ●●● ●● ● ●● ● ●●●● ● ● ● ● ●●●
● ● ●●
● ●
● ●
● ●● ● ● ● ●●● ● ●● ● ● ● ●●●
● ●●●● ● ●●●●
●
●● ● ● ●●●●● ●
●●●●●● ● ● ●● ● ● ●●●●● ●
●●●●●● ● ●
● ● ● ● ● ●● ● ● ● ●●● ●●●●●● ● ●●●●●●● ●●●● ●● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ●● ● ● ● ●●● ●●●●●● ● ●●●●●●● ●●●● ●● ● ● ● ● ●● ●● ●● ●
● ●● ●
● ●● ● ● ●●● ●● ●●● ●●●
● ●●
●●
● ●● ●● ●●●● ●● ● ●● ●
● ●● ● ● ●●● ●● ●●● ●●●
● ●●
●●
● ●● ●● ●●●● ●●
● ● ●● ● ●● ●●●●●●●● ● ●●● ●● ● ● ● ● ●●● ● ● ●● ● ●● ●●●●●●●● ● ●●● ●● ● ● ● ● ●●●
● ● ●● ● ●●● ●●● ● ●● ● ● ●
●●●●● ● ●● ●●● ● ●●
● ●●●●●
● ●● ● ● ● ● ●● ● ●●● ●●● ● ●● ● ● ●
●●●●● ● ●● ●●● ● ●●
● ●●●●●
● ●● ● ●
● ● ● ● ● ● ● ●● ● ●●
●●
●
●●● ●● ●● ● ●●
●● ● ●● ● ●
●●● ●● ● ● ● ● ● ● ● ●● ● ●●
●●
●
●●● ●● ●● ● ●●
●● ● ●● ● ●
●●● ●●
● ● ● ● ● ●● ● ●● ●●●●● ●●● ●● ●●
● ● ●● ● ● ●● ●
●● ●●
● ●●● ● ●● ● ● ● ● ● ● ● ●● ● ●● ●●●●● ●●● ●● ●●
● ● ●● ● ● ●● ●
●● ●●
● ●●● ● ●● ● ●
● ● ● ●
● ●●● ● ● ●●●● ● ●●● ●● ●● ●● ●●
● ● ●●
●● ●●● ● ● ● ● ●
● ●●● ● ● ●●●● ● ●●● ●● ●● ●● ●●
● ● ●●
●● ●●● ●
● ● ●
● ●● ● ● ●● ● ● ● ●
●●●
● ●● ● ●
●●● ●●●● ●●
●● ●
●● ●
●●●●●
● ●
●
● ● ●●● ● ●● ● ● ● ●
● ● ●
● ●● ● ● ●● ● ● ● ●
●●●
● ●● ● ●
●●● ●●●● ●●
●● ●
●● ●
●●●●●
● ●
●
● ● ●●● ● ●● ● ● ● ●
● ●● ● ● ● ● ● ● ●● ●●●● ● ● ●●●● ●● ●●
●●●● ●● ●● ●
● ●●
● ●●● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ●● ●●●● ● ● ●●●● ●● ●●
●●●● ●● ●● ●
● ●●
● ●●● ● ●● ● ● ●●
● ●
● ● ●● ●●●● ●●
●●●
●
●● ● ● ●●
● ●●
●● ●●
●● ●●● ●● ●●●● ● ● ●●● ● ● ● ●
● ● ●● ●●●● ●●
●●●
●
●● ● ● ●●
● ●●
●● ●●
●● ●●● ●● ●●●● ● ● ●●● ● ●
● ● ● ●● ●● ● ● ●
● ●● ● ● ●●● ● ● ●● ●● ●● ●●● ●● ●● ● ●● ● ● ● ● ●● ●● ● ● ●
● ●● ● ● ●●● ● ● ●● ●● ●● ●●● ●● ●● ● ●● ●
● ● ●
● ● ● ● ● ●● ● ●●●●●● ● ● ●● ●● ● ● ● ●
● ●●
● ● ●●●●● ●●● ●●
● ●
●● ●
●●●●● ● ●
●
●●
● ● ● ● ●
● ● ● ● ● ●● ● ●●●●●● ● ● ●● ●● ● ● ● ●
● ●●
● ● ●●●●● ●●● ●●
● ●
●● ●
●●●●● ● ●
●
●●
● ●
● ●
●
●● ●● ● ●● ●●● ● ● ●● ●● ● ●● ●●● ● ●
● ● ● ●●
●●● ● ● ●● ● ●
●● ● ● ●●● ● ●●●● ●● ● ●●
● ●●
● ● ●● ● ● ● ●●
●●● ● ● ●● ● ●
●● ● ● ●●● ● ●●●● ●● ● ●●
● ●●
● ● ●●
●
● ● ● ● ●● ● ● ● ● ●●
● ● ●● ●● ●●●● ●●● ●● ●● ●● ●
● ●●●●●● ● ● ●●●●● ●● ●●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●●
● ● ●● ●● ●●●● ●●● ●● ●● ●● ●
● ●●●●●● ● ● ●●●●● ●● ●●● ● ●● ● ●
● ●● ● ● ● ● ●●
●●
● ●● ● ●●●●● ●●● ●● ●
● ● ●● ●●
● ● ●● ● ● ● ● ●
● ●● ● ● ● ● ●●
●●
● ●● ● ●●●●● ●●● ●● ●
● ● ●● ●●
● ● ●● ● ● ● ●
● ● ● ●
●
●
● ●●
● ●●●● ●
● ● ● ●
● ●●
●●● ● ●● ● ●● ● ●● ● ● ● ● ● ●
●
●
● ●●
● ●●●● ●
● ● ● ●
● ●●
●●● ● ●● ● ●● ● ●● ● ●
● ● ●● ● ● ● ●● ● ● ● ●● ●
● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ●● ●
● ● ●● ●
● ●● ●● ● ●● ●● ● ●●●●● ● ● ● ● ●●● ●● ● ●
● ● ●● ●● ● ●● ●● ● ●●●●● ● ● ● ● ●●● ●● ● ●
●
● ●● ● ● ● ● ●● ●● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ●●●● ● ●
●
●●
●
● ●● ● ● ● ● ●● ●● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ●●●● ● ●
●
●●
●
● ●● ● ● ● ● ● ●●●● ● ●● ● ● ● ● ● ●●●●
●●
● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●
●● ●● ●● ●●
● ● ●
● ● ● ● ● ●●● ● ●● ● ● ● ●● ● ●●● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ●● ● ●●● ● ● ● ● ● ● ●●● ● ●● ● ●
●● ● ● ● ●● ● ● ● ●● ●● ●● ●● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ●● ●● ●● ●● ● ● ● ● ●
●● ● ●
● ●● ●● ●● ●●● ● ● ● ● ● ● ● ●● ● ●
● ●● ●● ●● ●●● ● ● ● ● ● ● ●
● ● ● ● ● ●● ● ● ●● ●
●● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ●
●● ● ●● ●● ● ● ● ● ●●
● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●
● ● ●●● ● ● ● ●● ●● ● ●●● ● ●● ● ● ● ●●● ● ● ● ●● ●● ● ●●● ● ●● ●
● ●● ● ● ● ● ●●●●
● ●● ●● ● ● ● ●● ● ● ● ● ●●●●
● ●● ●● ● ●
● ● ●● ● ● ●● ● ● ●● ● ● ●
●● ● ●● ● ●● ●
● ●●
● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ●● ● ● ●● ● ● ●
●● ● ●● ● ●● ●
● ●●
● ● ● ● ● ● ● ●● ● ●● ●
● ●● ● ● ●● ●● ●● ● ●● ● ● ●● ●● ●●
● ●● ● ●● ● ● ● ●● ● ●● ● ●
● ●●● ●● ● ● ● ● ● ● ●●● ●● ● ● ● ● ●
● ● ● ●● ● ● ●
●●
●● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ●
●●
●● ●● ● ● ● ●● ● ● ● ●● ●
● ● ● ●● ● ●●● ● ●●● ● ● ● ●● ● ●●● ● ●●●
● ● ● ●
● ●
● ● ● ●●● ● ●
● ●
● ● ● ●
● ●
● ● ● ●●● ● ●
● ●
● ● ● ●
●● ● ● ● ● ●● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ●● ●● ● ● ●● ●●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ●
● ● ● ●
● ● ● ●
● ●
● ● ● ●
● ●
● ● ● ●
● ● ● ●
● ●
a tree construction-like manner, and there are two opposite ways of deriving such a
segmentation: (i) divisive clustering and (ii) agglomerative clustering.
(i) Divisive clustering is a top-down approach of constructing clusters. Analogously to
the recursive partitioning algorithm for regression trees, see Chapter 6, a divisive clus-
tering method partitions non-homogeneous clusters into more homogeneous sub-groups
w.r.t. the chosen dissimilarity measure, and a stopping rule decides on the resulting
number K of clusters. The most basic divisive clustering algorithm is called DIvisive
ANAlysis (DIANA) clustering; see Kaufman–Rousseeuw [115, Chapter 6].
(ii) Agglomerative clustering is a bottom-up method that dynamically grows bigger clus-
ters by merging sets that are similar until the algorithm is stopped. A basic agglomer-
ative algorithm is called AGglomerative NESting (AGNES) clustering; see Kaufman–
Rousseeuw [115, Chapter 5].
Typically, the resulting clusterings are illustrated in so-called dendrograms, which look the
same as regression trees, see Figure 6.1. Both algorithms are usually exploit in a greedy
manner, meaning that one tries to find the next optimal step in each loop of the recursive
algorithm, i.e., the next optimal partition or fusion for the divisive and agglomerative
algorithm, respectively. Usually, this results in time-consuming algorithms, and, if they
converge, they likely converge to local optimums and not to the global one; the argument
for this is similar to the one of the gradient descent algorithm; see Section 5.3. Note
that if there are n instances that we want to partition into two non-empty sets, there
are 2n−1 − 1 possibilities. Therefore, it can be exhaustive to find the next best split, and
this search is therefore often combined with other methods such as K-means clustering.
centers are selected for these K clusters. The partitioning is then obtained by allocating
all instances to these cluster centers w.r.t. some distance or similarity measure. We
will present K-means and K-medoids clustering, these methods mainly differ in how the
cluster centers are selected.
A divisive clustering algorithm is the DIANA algorithm which we briefly discuss in this
section. Select a dissimilarity function L : X × X → R+ on the covariate space X ⊂ Rq .
We are going to recursively partition the learning sample L = (X i )ni=1 ⊂ X .
A remark on the notation. The classifier (9.26) gives a partition (Xk )k∈K of the
covariate space X ⊂ Rq , see (9.27). Naturally, we can only construct an exact partition
on the (finite) learning sample L = (X i )ni=1 ⊂ X . Thus, in fact, we will (only) construct
the finite clusters
Xk ∩ L = {X ∈ L; X ∈ Xk } . (9.28)
To keep the notation simple in this outline, we will not distinguish the notation of the
clusters Xk , given in (9.27), and their finite sample counterparts Xk ∩ L, given in (9.28).
Moreover, we will also use the same notation for the indices of the instances in the clusters
Thus, Xk has the three different meanings (9.27), (9.28) and (9.29), but from the context
it will always be clear which version we use.
(i) For the given dissimilarity function L, compute the diameters of all clusters, k ∈ K,
δk = max
′
L (X i , X i′ ) . (9.30)
i,i ∈Xk
This diameter δk ≥ 0 gives the maximal dissimilarity between the instances within
the same cluster Xk . If maxk∈K δk = 0, we stop the algorithm because all clusters
only contain completely similar instances. Otherwise, select the cluster with the
biggest diameter
k ∗ = arg max δk ,
k∈K
with a deterministic rule if there is more than one argument that maximizes this
problem. This provides us with the cluster Xk∗ that we would like to split in the
recursive step K → K + 1. Naturally, |Xk∗ | > 1, because δk = 0 for all clusters
that only contain one instance.
with a deterministic rule if there is more than one such instance. This defines the
initialization of the inner loop by setting up a new cluster XK+1 = {X i∗ } and
reducing the existing cluster by setting Xk′ ∗ = Xk∗ \ XK+1 .
If |Xk′ ∗ | > 1, we may migrate more instances from Xk′ ∗ to XK+1 by recursively
computing for all instances i ∈ Xk′ ∗ the differences of the average dissimilarities on
these two clusters Xk′ ∗ and XK+1 , that is,
1 X 1 X
∆(i) = L (X i , X i′ ) − L (X i , X i′ ) .
|Xk′ ∗ | − 1 |XK+1 |
i′ ∈Xk′ ∗ \{i} i′ ∈XK+1
If maxi∈X ′ ∗ ∆(i) ≤ 0, we stop the inner migration loop because migrating does not
k
decrease the average dissimilarity. Otherwise, we select
with a deterministic rule if there is more than one maximizer. We migrate this
instance i∗ to the new cluster by setting XK+1 ← XK+1 ∪ {X i∗ } and we reduce the
existing cluster to Xk′ ∗ = Xk∗ \ XK+1 . This inner loop is iterated until the stopping
rule is met or until |Xk′ ∗ | = 1. We then update the number of clusters K ← K + 1,
the reduced cluster Xk∗ ← Xk′ ∗ = Xk∗ \ XK+1 and we add the new cluster XK+1 ,
K+1
giving us the new partition (Xk )k=1 of X .
These two steps (i)-(ii) are iterated until a stopping rule is met, or until there is no
dissimilarity left on the existing clusters maxk∈K δk = 0. If we do not install a stopping
rule, this algorithm will naturally terminate once all instances on all clusters are fully
similar, and we can result in at most K ≤ n = |L| clusters.
To define the diameter in (9.30), we consider the biggest dissimilarity between two in-
stances in the same cluster. Of course, we could use many other definitions, e.g., we
could consider the average dissimilarity as an alternative
1 X
δk = 2
L (X i , X i′ ) .
|Xk | i,i′ ∈X
k
We briefly describe the most basic agglomerative clustering algorithm, also known as
AGNES; see Kaufman–Rousseeuw [115, Chapter 6]. Agglomerative means that we let
the clusters grow starting from the individual instances.
Recursive fusion iteration. We recursively fusion clusters that have the smallest
mutual dissimilarity. Therefore, we define the average dissimilarity between two clusters
Xk and Xl , for k, l ∈ K, as follows
1 X X
δk,l = L (X i , X i′ ) , (9.31)
|Xk ||Xl | i∈X i′ ∈X
k l
this is called the unweighted pair-group method with average mean (UPGMA). This UP-
GMA allows us to selected the two most similar clusters
(k ∗ , l∗ ) = arg min δk,l ,
k,l∈K
with a deterministic rule if there is more than one minimizer. We merge these two clusters
Xk∗ ← Xk∗ ∪ Xl∗ , reduce the index set K ← K \ {l∗ }, and relabel the indices such that
we obtain the new decreased index set K = {1, . . . , K}.
The UPGMA (9.31) is sometimes replaced by other dissimilarity measures. For example,
the complete linkage considers
δk,l = max L (X i , X i′ ) , (9.32)
i∈Xk ,i′ ∈Xl
The complete linkage considers the two most distinct instances, and the single linkage
the two most similar instances in the two clusters.
where (ck )k∈K are the corresponding cluster centers (also called cores); these cluster
centers can be part of the minimization in (9.34); this is not indicated in the notation,
but we briefly explain this next. K-means clustering and K-medoids clustering mainly
(but not only) differ in how these cluster centers (ck )k∈K are selected. For K-means
clustering we select the empirical cluster means (thus, the selection of the cluster centers
is not part of the minimization in (9.34)) by setting for k ∈ K
1 X
ck = X i ∈ Rq . (9.35)
|Xk | i∈X
k
K-means clustering
For K-means clustering we select the squared Euclidean distance as dissimilarity function
L(X i , X i′ ) = ∥X i − X i′ ∥22 .
The consequence of this choice is that the empirical cluster means (9.35) minimize the
within-cluster dissimilarities on the clusters Xk . That is, we have for all k ∈ K
1 X
∥c − X i ∥22 =
X
ck = arg min X i. (9.37)
c∈Rq i∈Xk
|Xk | i∈X
k
Precisely this property is the reason for dropping the cluster center optimization in (9.34)
for K-means clustering, and this motivates the name K-means clustering for this method.
Thus, we aim at solving for the cluster centers (9.37)
K
(Xk∗ )k∈K = arg min ∥ck − X i ∥22 .
X X
(Xk )k∈K k=1 i∈X
k
Recursive K-means iteration. We repeat for t ≥ 1 until no more changes are ob-
served:
(t−1)
(1a) Given the present empirical cluster means (ck )k∈K , we update the partition
(t)
(Xk )k∈K by computing for each instance 1 ≤ i ≤ n the optimal allocation
(t−1) 2
kt∗ (i) = arg min ck − Xi ,
k∈K 2
with a deterministic rule if there is more than one minimizer. This gives us the
new clusters at algorithmic time t
(t)
Xk = {i ∈ {1, . . . , n}; kt∗ (i) = k} .
(t) (t)
(1b) Given the new clusters (Xk )k∈K , we update the empirical cluster means (sk )k∈K
according to (9.37); these two steps (1a) and (1b) are illustrated in Figure 9.7.
The crucial point in the above algorithm is that both steps (1a) and (1b) are not increasing
the total within-cluster dissimilarity D(t) for increasing t. Having a lower bound zero
and the fact that a finite sample can only be allocated in finitely many different ways to
K clusters implies that we have convergence
2000
total within−cluster dissimilarity
1500
●
1000
●
●
●
●
500
●
●
●
2 4 6 8 10
hyperparameter K
(t∗ )
The drawback is that (Xk )k∈K typically is a local minimum of the total within-cluster
dissimilarity, and different initial configurations may converge to different (local) mini-
mums.
An open question is the selection of the number of clusters K. This could be determined
(t∗ )
recursively as follows. Construct an optimal partition (Xk )K k=1 for a given K. For
the increased number of clusters K + 1, initialize the (K + 1)-means algorithm by the
optimal clusters for parameter K, and randomly partition one of these clusters into two
clusters. This implies that the total within-cluster dissimilarity decreases when going
(t∗ ) (0) K+1
from (Xk )K k=1 to (Xk )k=1 . Then, run the algorithm in this increased setting with this
initialization, and monotonicity implies that this new solution for K + 1 clusters has a
smaller total within-cluster dissimilarity. This results in a graph as in Figure 9.8 that is
decreasing in K. An elbow criterion selects K ∗ where this graph has a kink, in Figure
9.8 this might by for K ∗ = 4.
K-medoids clustering
The K-means clustering algorithm requires that we consider the squared Euclidean dis-
tance as the dissimilarity function L. Of course, this is not always suitable, e.g., if we
have a network where we need to travel from X i ∈ R2 to X i′ ∈ R2 , the traveling costs
are rather related to the Euclidean distance, but not to the squared Euclidean distance.
However, K-means clustering cannot deal with this Euclidean distance problem. The
K-medoids algorithm is more flexible and it can deal with any dissimilarity function
L. This comes at the price of higher computational costs, because we cannot simply
select the empirical cluster means as the cores (ck )k∈K . Instead, we need to compute the
medoids (ck )k∈K to receive a monotonically decreasing algorithm. The (optimal) medoids
are given by
K
X X
(ck )k∈K = arg min L(ck , X i ), (9.39)
(ck )k∈K ⊂L k=1 i∈X
k
i.e., the medoids ck ∈ L = (xi )ni=1 belong to the observed sample; again we install a
deterministic rule if the argument in the minimum is not unique.
Similar to K-means clustering, the global minimum can generally not be found, there-
fore, we try to find a local minimum. Usually, the partitioning around medoids (PAM)
algorithm of Kaufman–Rousseeuw [115] is exploited to solve the K-medoids clustering.
K
X X
D= L (ck , X i ) ≥ 0. (9.40)
k=1 i∈Xk
(1a) Select a present medoid ck and a non-medoid X i ∈ L \ (ck )k∈K , swap the role
of ck and X i , and allocate each instance in L to the closest medoid in this new
configuration;
(1b) compute the new total within-cluster dissimilarity (9.40) under this new configura-
tion;
(1c) if the total within-cluster dissimilarity increases under this new configuration reject
the swap, otherwise keep it.
Remark that there are many variants on how the swap in step (1a) can be selected (in a
systematic way). Kaufman–Rousseeuw [115] provide one version which is also described
in Algorithm 2 of Schubert–Rousseeuw [201], but there are many other possibilities that
may provide computational improvements.
K
1 1
exp − (x − ck )⊤ Σ−1
X
f (x) = pk k (x − ck ) , (9.41)
k=1
(2π|Σk |)q/2 2
with mean vectors (ck )k∈K ⊂ Rq , positive definite covariance matrices Σk ∈ Rq×q , 1 ≤
k ≤ K, and mixture probabilities (pk )k∈K ⊂ (0, 1) aggregating to one, K
P
k=1 pk = 1. This
density gives a multivariate GMM with model parameter
ϑ = (µk , Σk , pk )k∈K .
If we want to estimate this parameter with MLE, we need to consider the log-likelihood
function for the given learning sample L = (X i )ni=1
n K !
pk 1
exp − (X i − ck )⊤ Σ−1
X X
ϑ 7→ ℓL (ϑ) = log q/2 k (X i − ck ) . (9.42)
i=1 k=1
(2π|Σk |) 2
It is well-known that this MLE problem cannot be solved directly. In fact, it is not even
clear whether a MLE for ϑ exists. There are many examples where this log-likelihood
function is unbounded, henceforth, there is no MLE for ϑ in such cases.
For these reasons, one is less ambitious, and one just tries to find an estimator for ϑ
that explains the learning sample reasonable well (is not a spurious solution in jargon).
State-of-the-art uses variants of the expectation-maximization (EM) algorithm to find
such solutions. We will not describe the EM algorithm here, but we refer to Dempster et
al. [52], Wu [237] and McLachlan–Krishnan [152]. There are many different implemen-
tations of the EM algorithm, and for GMM clustering there are many variants relating
to different choices of the covariance matrices Σk . E.g., if we decouple this covariance
matrix according to Σk = λk Dk Ak Dk⊤ with a scalar λk > 0, an orthogonal matrix Dk
containing the eigenvectors, and a diagonal matrix Ak that is proportional to the eigen-
values of Σk , then one can fix some of these choices and exclude them from the MLE
optimization. One choice is identical orientations Dk = Id (identity matrices), equal
volumes λk = λ > 0, and Ak can then provide different ellipsoids. Finally, the GMM
clustering is obtained by allocating X i to the estimated GMM component that provides
the biggest log-likelihood.
C = {i ∈ {1, . . . , n}; mi ≥ M } .
The DBSCAN method of Ester et al. [64] is obtained by constructing a graph of vertices
and edges from these core instances C:
(1) The vertices are given by all core instances X i ∈ C, and we add an edge between
two core instances X i , X i′ ∈ C if they are in the ε-neighborhood of each other, i.e.,
if L(X i , X i′ ) ≤ ε. This gives a graph with vertices and edges, and we define the
clusters to be the connected components of this graph.
(2) There are still the instances X l ∈ L \ C that are not core instances. If such a
non-core instance X l is in the ε-neighborhood of at least one core instance X i ,
i.e., if L(X i , X l ) ≤ ε for at least one core instance X i ∈ C, then we assign it (at
random) to one of these close core instances by adding an edge from X l to X i .
This increases the corresponding connected component of that core instance X i ,
but because the graph ends in X l (there is no further edge in X l ), this non-core
instance is an isolated (satellite) point that is only connected to X i .
Finally, there are the so-called outliers X l with L(X i , X l ) > ε for all core instances
X i ∈ C. These are treated as noise, and they are not assigned to any cluster.
Advantages of DBSCAN are that the number of clusters is flexible, and the resulting
structures of the clusters can have any shape. Such a structure may be useful if one tries
to describe how things spread (in a graph-like manner by nearest neighbor infections),
but also for disaster modeling, e.g., one may use such graph to model the spread of a fire.
In other words, we try to find a sample (Z i )ni=1 in the two-dimensional Euclidean space
such that the original adjacency matrix
(L(X i , X i′ ))1≤i,i′ ≤n ∈ Rn×n
+ , (9.43)
is preserved as good as possible. This idea is similar to most visualization methods.
The t-SNE method of van der Maaten–Hinton [223] is considering an embedding that
slightly modifies the above idea. Namely, we are going to map the dissimilarity L(X i , X i′ )
to a categorical distribution q = (qj )Jj=1 . Equivalently, we are going to find a t-distribution
with instances (Z i )ni=1 whose dissimilarities can be mapped to a second categorical dis-
tribution p = (pj )Jj=1 . Instead of minimizing (9.43) we try to make the KL divergence
from p to q small
J
!
X qj
DKL (q||p) = qj log . (9.44)
j=1
pj
Original sample. For two instances X i , X i′ ∈ L one defines the conditional probabil-
ity weight
exp − 2σ1 2 ∥X i′ − X i ∥22
for i ̸= i′ .
i
qi′ |i = ∈ (0, 1), (9.45)
− 2σ1 2 ∥X k − X i ∥22
P
k̸=i exp
i
The choice of the bandwidth σi > 0 is discussed below. The explanation of (9.45) is that
qi′ |i gives the probability of selecting X i′ as the neighbor of X i from all instances, under
a Gaussian kernel similarity measure, i.e., X i is the center (core) of these conditional
probabilities.
Since qi′ |i is non-symmetric, a symmetrized version is defined by
1
qi,i′ = qi′ |i + qi|i′ ∈ (0, 1), for i ̸= i′ . (9.46)
2n
P
Note, we exclude the diagonal from these definitions. Observe that i′ ̸=i qi′ |i = 1 for
all 1 ≤ i ≤ n. This implies that ni=1 i′ ̸=i qi,i′ = 1 and, henceforth, q = (qi,i′ )i̸=i′ is a
P P
Visualization sample. Select a fixed dimension p < q. The goal is to find a visual-
ization sample (Z i )ni=1 ⊂ Rp such that its Student-t probabilities p = (pi,i′ )i̸=i′ (with one
degree of freedom which is the Cauchy distribution), and defined by
−1
1 + ∥Z i − Z i′ ∥22
pi,i′ = P −1 ∈ (0, 1), for i ̸= i, (9.47)
k̸=l 1 + ∥Z k − Z l ∥22
are close to the categorical distribution q = (qi,i′ )i̸=i′ of the original sample (X i )ni=1 .
This optimization problem is usually solved with the gradient descent algorithm.
Remarks 9.6. • There is some discrepancy in the definition of q and p. For the
high-dimension case, we define q via the conditional probabilities (9.46). This
approach has turned to be more robust towards outliers. In the low-dimensional
case we can directly define p by (9.47).
• The Student-t distribution is heavy-tailed (regularly varying), and for one degree of
freedom (Cauchy case) we have a quadratic asymptotic decay pi,i′ ≈ ∥Z i − Z i′ ∥−22
for ∥Z i − Z i′ ∥2 → ∞.
• There remains the choice of the bandwidth σi > 0. Typically, a smaller value for
σi > 0 gives a denser clustering. For q •|i = (qi′ |i )i′ ̸=i , one can define the perplexity
n o X
Perp(q •|i ) = exp H(q •|i ) = exp − qi′ |i log2 (qi′ |i ) ,
i′ ̸=i
with H(q •|i ) being the Shannon entropy. Following van der Maaten–Hinton [223],
a good choice of the bandwidths σi is received by a constant perplexity Perp(q •|i )
in i.
■
50
50
component 1
component 1
component 1
0
0
−50
−50
−50
−50 0 50 −50 0 50 −50 0 50
Figure 9.9 gives an example where we map a learning sample (X i )ni=1 ⊂ R5 that is
five-dimensional to a two-dimensional illustration (Z i )ni=1 ⊂ R2 . We use the R package
tsne [58] which has a hyper-parameter called perplexity. Figure 9.9 shows the results
for different values of this perplexity parameter. The colors are in all plots identical for
the instances, and the specific meaning of the colors is related to a sports car evaluation
with red color being a sports car; for details see Rentzmann–Wüthrich [183].
Generative modeling
10.1 Introduction
Most of the applications considered in these notes follow the classical paradigm of super-
vised learning, introduced in Chapters 1 and 2, which is to infer the best approximation
to an unknown regression function from a finite sample of data. Once this regression
function has been learned, this function can be used to compute point estimates of in-
terest, including, e.g., best estimates and estimates of quantiles of distributions. Within
actuarial science, this paradigm of learning a regression function and a best estimate,
respectively, encompasses a large proportion of tasks focused on prediction. Nonetheless,
some tasks performed quite often by actuaries stand outside this paradigm, where, in-
stead of producing point estimates, it is desired or necessary to produce a full predictive
distribution. For example, in non-life reserving, besides a point estimate of reserves, it
is usually important to produce a distribution of the outstanding loss liabilities, both to
communicate the uncertainty of the point estimate (reserves) and to quantify the capi-
tal and margins needed to back the reserves. We briefly also mention capital modeling
applications, whether parameterizing (simple) statistical distributions to model a full
distribution of prospective insurance risks, or using dimensionality reduction tools, such
as those introduced in the previous chapter, to reduce complex observations of market
variables or mortality rates down to tractable quantities to enable modeling; a classical
example here is interest rate modeling. These tasks can be characterized as generative
modeling, which has as the goal to learn the underlying probability distribution X ∼ P
of the instances itself, or a joint distribution of responses and covariates (Y, X) ∼ P; we
refer to Section 1.2.3.1
Once we have learned a good approximation to this distribution of the data, we can:
(1) Generate new samples: Draw new instances X ∼ P that resemble the data the
model was trained on. This is useful for data augmentation, simulating scenarios,
and creating synthetic data.
(2) Estimate probabilities: Evaluate the likelihood p(X) of a given data point X, which
is useful, e.g., for anomaly detection.
1
For simplicity, we drop the volume/weight in this chapter, by assuming v ≡ 1.
193
194 Chapter 10. Generative modeling
(4) Perform conditional data generation: Generate new samples from a conditional
distribution X|Y ∼ P(·|Y ), allowing for targeted data synthesis.
What distinguishes generative modeling from the unsupervised learning applications dis-
cussed in Section 9 - some of which also focus on producing latent factors Z - is exactly
this goal of producing a learned probability distribution X ∼ P of the data X.
Notation. We remark that in this field of literature, one typically uses the notation
p(X) for the density and the likelihood of X ∼ P, we adopt this convention (which is in
slight conflict with our previous notation).
Major recent advances in generative modeling have been driven by the use of deep neural
networks to learn probability distributions over complex, high-dimensional data; these
so-called deep generative models (DGMs) have been extraordinarily successful in appli-
cations in natural language processing (NLP) and image generation. We parameterize
this distribution with parameter ϑ, writing pϑ (X), where ϑ ∈ Rr represents the network
parameter that needs to be learned from observed data. Various approaches have been
proposed in the literature for the task of deep generative modeling; we refer to Tomczak
[220] for an overview.
Here, we mainly discuss two of these approaches: the latent factor and implicit probability
distribution approaches.
(1) The latent factor approach is to assume a parametric model
Z
pϑ (X) = pϑ (X | z) π(z) dz,
where z denotes latent variables capturing unobserved factors that create variability in
the data samples, and π(z) is a prior distribution over these latent variables; this prior
distribution can also be set as a very weak prior or as flat one in some cases.
The deep neural network components can parameterize both
• the latent distribution π(z) or, more commonly, the conditional distribution Z|X ∼
qϑ (z | X) in variational inference frameworks, where ϑ contains both, the parame-
ters of the conditional likelihood pϑ and the variational inference posterior qϑ .
Variational auto-encoders (VAEs) are a typical example of this latent factor approach,
where the encoder network approximates the posterior qϑ (z | X), and the decoder net-
work approximates the likelihood pϑ (X | z); in what follows, we will extend the auto-
encoders introduced in Section 9.2.2 into this variational approach. We also briefly men-
tion the Generative Adversarial Networks (GANs) of Goodfellow et al. [84], which work
by sampling directly from an assumed latent factor distribution Z ∼ π(z), and the de-
noising diffusion models due to Ho et al. [99], which are another way of acting on samples
from a latent factor distribution π.
(2) The implicit probability distribution is a different approach that does not rely on
explicit latent factors Z. Instead, a neural network can directly learn a conditional
probability distribution over X = (X1 , . . . , XT ) = X1:T . Concretely, for sequential data
X1:T (such as text or time-series data), this probability distribution is represented by
factorizing the joint distribution into a product of conditional terms, i.e., as an auto-
regressive factorization that takes the form
T
Y
pϑ (X1:T ) = pϑ Xt | X1:t−1 ,
t=1
where each term pϑ Xt | X1:t−1 is modeled by a neural network that conditions on the
preceding elements of the sequence X1:t−1 .
In many cases - especially when each Xt is categorical (e.g., a word in a vocabulary
W) - this network outputs a probability distribution via a softmax function, which is
the natural parameterization for multinomial outcomes, see (8.10). Specifically, if the
possible values for Xt belong to some discrete set W ⊂ R , then for each w ∈ W, the
network produces outputs
elogits(w)
pϑ Xt = w | X1:t−1 = P logits(u)
,
u∈W e
where logits(·) refers to the unnormalized probabilities before transforming back to the
probability scale. This implicit approach obviates the need for latent variables by directly
specifying how the next outcome depends on the past. In language modeling, for example,
the model learns pϑ (Xt | X1:t−1 ) over a vocabulary of possible tokens (words or subwords);
we have already introduced these concepts in Section 8.2. In time-series forecasting, the
same principle applies, although the data may be continuous or mixed-type, in which
case alternative output layers can be used. In all such scenarios, the core idea is to define
the probabilities over W of the next outcome - completely determined by the network -
rather than decomposing the distribution through auxiliary latent factors.
In modern NLP, transformer-based models have emerged as a powerful way to implement
these auto-regressive approaches. In Section 8.5, we have introduced encoder transform-
ers, which process a known sequence of tokens to produce an output. For NLP and other
generative modeling purposes, decoder transformers are used, where the next-token in an
observed sequence is predicted; this prediction is then appended to the sequence and the
following token is predicted, and so on. We will discuss decoder transformers and large
language models in more detail below.
(1) Encoder (inference/recognition model): The encoder network takes an input data
point X and outputs the parameters of a latent distribution, typically an isotropic
multivariate Gaussian distribution
(d)
qϑ (z | X) = N µϑ (X), Σϑ (X) . (10.1)
Here, µϑ (X) and Σϑ (X) are given by the encoder’s outputs, with Σϑ (X) often
constrained to be diagonal for simplicity. The encoder is thus learning an approx-
imate posterior distribution over latent variables Z, conditioned on X. We will
explain how this assumption is enforced using a regularization term in the next
section.
(2) Decoder (generative model): The decoder network takes a latent sample Z and
outputs parameters of a distribution over the data space
(d)
pϑ (x | Z) = N mϑ (Z), Sϑ (Z) ,
In summary, the decoder is the generative component: once trained, it can take random
samples Z ∼ π(z) from the latent space and produce synthetic data points by generating
new data X ′ ∼ pϑ (x | Z) that resembles the original dataset.
where π(z) is a prior over latent variables, typically chosen as a standard Gaussian
N (0, I). Directly optimizing log pϑ (x) is generally intractable, but we can employ varia-
tional inference to maximize a lower bound, the evidence lower bound (ELBO), for a full
derivation and explanation, see Odaibo [169],
Z
log pϑ (x) = log pϑ (x | z) π(z) dz
pϑ (x | z) π(z)
Z
= log qϑ (z | x) dz
qϑ (z | x)
pϑ (x | Z) π(Z)
= log Eqϑ (z|x)
qϑ (Z | x)
pϑ (x | Z) π(Z)
≥ Eqϑ (z|x) log
qϑ (Z | x)
h i
= Eqϑ (z|x) log pϑ (x | Z) − DKL qϑ (· | x) ∥ π =: E ϑ; x ,
where Eqϑ (z|x) [·] is the expectation operator of Z ∼ qϑ (z | x) and DKL (qϑ (· | x)∥π) is
the KL divergence from π to qϑ (· | x); for the finite discrete case, see (9.44).
• Reconstruction term: Eqϑ (z|x) [log pϑ (x | Z)], which encourages the decoder to re-
construct the original x from the latent code Z.
• Regularization term: −DKL (qϑ (· | x)∥π), which aligns the encoder’s approximate
(d)
posterior with the prior π(z). As we have mentioned already, typically π(z) =
N (0, I).
Combining these two terms yields a balance between, on the one hand, a faithful recon-
structions and, on the other, a latent space constrained to follow the prior assumptions.
For a learning sample L = (X i )ni=1 , training maximizes the average ELBO over the
learning sample
n
1X
arg max E ϑ; X i . (10.2)
ϑ n i=1
(1) Reconstruction quality: How accurately can we reconstruct the original data
from our compressed representation? This corresponds to the first term in
the ELBO, Eqϑ (z|x) [log pϑ (x|Z)]. Higher values mean better reconstruction.
The ELBO elegantly combines these objectives into a single value to be maximized.
When maximizing the ELBO, we are finding the best trade-off between accurate
reconstructions and well-structured compressions. This is why VAEs often learn
meaningful, disentangled representations - they are simultaneously optimizing for
fidelity (reconstruction) and simplicity (regularization).
and second, during the backward pass, taking the gradients w.r.t. the mean and variance
parameters. A single Monte Carlo sample often suffices to approximate the ELBO
e x) := log pϑ x | µ (x) + Σ1/2 (x) ε
E(ϑ; x) ≈ E(ϑ; ϑ ϑ − DKL qϑ (· | x) ∥ π .
By performing this Monte Carlo sampling during the fitting procedure, VAEs learn both
the inference (encoder) and generative (decoder) networks by maximizing this Monte
Carlo ELBO instead of (10.2). Once trained, we can generate new data by sampling
Z ∼ π(z) and then drawing X ′ ∼ pϑ (x | Z).
(2) Indirect approach (with reparameterization). Instead, you could start with
standard-sized blanks (from a fixed, standard distribution) and then apply
a consistent transformation to each blank (scaling and shifting). Now, if
you need to adjust your output distribution, you only need to modify the
transformation parameters, not the random selection process.
10.2.5 Discussion
VAEs illustrate the latent factor approach described earlier: the hidden variables Z
capture underlying structure, and the encoder-decoder networks map between X and
Z. This ensures that VAEs can both reconstruct existing data and sample novel data
points, all while maintaining a tractable training objective (the ELBO). In practice, many
variants of VAEs exist, e.g., β-VAEs or conditional VAEs, each modifying the objective
or architecture to emphasize different aspects, such as disentangled latent representations
or conditional generation.
Overall, VAEs remain one of the most popular DGMs due to their relative conceptual
simplicity, stable training procedure, and ability to produce both probabilistic encodings
and realistic sample generations.
two popular methods - generative adversarial networks (GANs) and giffusion models -
and then we briefly compare how these approaches relate to each other and to VAEs.
GANs, introduced by Goodfellow et al. [84], represent a different strategy for generative
modeling. Unlike VAEs, which explicitly approximate a conditional probability distri-
bution over the latent variables Z ∼ qϑ (z | X) via a latent variable formulation, GANs
implicitly learn to generate data directly from random latent factors Z sampled from
conventional distributions by pitting two neural networks against each other in an adver-
sarial game. In other words, the encoder part is missing from GANs. These two networks,
the generator and the discriminator, evolve through a competitive process that can yield
remarkably realistic samples in many domains - especially images - but unfortunately
suffers from training difficulties.
X ′ = G(Z; ϑ1 ).
The generator’s objective is to produce data that is indistinguishable from real data
by the discriminator.
The generator is designed to take a random noise vector and transform it into a generated
output by processing the noise through several neural network layers. This process allows
the generator to learn how to create realistic images from random inputs. On the other
hand, the discriminator receives a sample as input and then processes it through a series
of layers. It outputs a probability via a sigmoid activation function that indicates whether
the sample is real, i.e., sampled from the dataset used to train the GAN, or generated
by the generator network.
As in the earlier sections, we use ϑ generically to denote the model parameters, though
in practice one typically maintains separate sets of parameters for G and D, i.e., ϑ =
(ϑ1 , ϑ2 ). These networks are trained simultaneously in a mini-max game.
GANs frame training as a zero-sum game between the generator and the discriminator.
The value function is given by
h i h i
min max V (D, G) = EX∼pdata (x) log D(X) + EZ∼π(z) log 1 − D G(Z) .
G D
Here:
• Z ∼ π(z) is the prior for the noise vector (often a Gaussian or uniform distribution).
The discriminator maximizes V (D, G) to try to distinguish real versus fake samples. The
generator minimizes V (D, G), trying to “fool” the discriminator D so that generated
samples are classified as real. In practice, optimization is performed via alternating
gradient-based updates.
The discriminator is trained using the binary cross-entropy loss function, which is suitable
for binary classification tasks (real vs. fake); in fact, the log-likelihood of the Bernoulli
distribution is given by Y log p + (1 − Y ) log(1 − p) which is the structure of the above
minimax game.
The discriminator’s weights are frozen during the training of the generator, meaning to
say that while the GAN is being trained to improve the generator, only the generator’s
parameters are updated. The purpose of freezing the discriminator in this step is to ensure
that updates are made only to the generator network, allowing it to learn how to produce
images that can optimally fool the discriminator. In other words, while optimizing the
generator, the discriminator’s weights remain fixed and serve only to provide a training
signal (the discriminator’s classification score) for the generator. This setup ensures that
only the generator’s parameters adjust, learning progressively how to create samples that
can fool the current state of the discriminator.
Training GANs is known to be difficult and can suffer from issues such as mode collapse
(where the generator learns to produce only a limited variety of samples) or vanishing
gradients. Despite these challenges, with proper techniques (e.g., careful network design,
hyperparameter tuning, and objective variants like Wasserstein GAN [4]), GANs can
generate highly detailed and convincing samples.
Diffusion models, also referred to as score-based generative models, see Song–Ermon [209],
have recently gained significant attention as a state-of-the-art approach for image and
audio generation, see Ho et al. [99]. Unlike VAEs and GANs, which rely on (possibly)
low-dimensional latent factors or adversarial training objectives, diffusion models employ
a forward noising process paired with a reverse denoising process. The forward process
systematically corrupts data into noise, and the reverse process - learned by a neural
network - seeks to recover clean data from noisy samples. Thus, at generation time,
one simply starts from random noise and iteratively applies the learned reverse process
to obtain a final synthetic sample. Similar to GANs, there is no need for an encoder
model within the diffusion modeling paradigm and, moreover, we do not approximate
conditional latent factors, but we rather learn implicitly directly from random samples.
A typical forward noising process (following [208, 99]) is defined as a Markov chain of
length T . Starting with a real data sample X 0 (e.g., an image), we produce a sequence
of increasingly noisy samples
X 1, X 2, . . . , X T ,
where each step X t is obtained by adding a small amount of Gaussian noise to X t−1 .
Concretely, one common choice is
(d)
p
q(X t | X t−1 ) = N 1 − βt X t−1 , βt I ,
with a variance schedule {βt }Tt=1 ∈ (0, 1). After iterating this for sevaral steps, the sample
is nearly indistinguishable from pure Gaussian noise (supposed that the variance schedule
adds sufficient noise).
Training objective
• The neural network ϵϑ (often a U-Net, see Ronneberger et al. [195], which is a
type of encoder-decoder CNN framework useful for working with images or similar
architectures) is trained to predict ε from (X t , t). The training loss commonly used
is
h i
2
Lsimple (ϑ) = EX 0 , ε, t ε − ϵϑ (X t , t) 2
.
By minimizing this loss, the model learns to “denoise” xt at each time step, effectively
approximating the score function (the gradient of the log-density w.r.t. the data) and
providing a route to reverse the forward chain.
Once trained, sampling proceeds by starting with X T ∼ N (0, I) and recursively applying
X t−1 ∼ pϑ (xt−1 | X t ), t = T, . . . , 1,
to obtain a final sample X 0 that resembles the data distribution. In practice, the neural
network predicts either the noise ε or the clean image X 0 from (X t , t), and one uses
these predictions to sample from the approximate reverse Gaussian.
Discussion
Diffusion models present a compelling alternative to VAEs and GANs for high-fidelity
generation, particularly for large-scale image and audio data. In contrast to the single-
step generation of GANs (where latent noise is transformed into a final image in one
forward pass), diffusion models gradually refine pure noise into structured data through a
learned denoising sequence. Although this multi-step sampling can be slower, the gradual
nature often leads to stable training dynamics and high-quality samples. Moreover, Song–
Ermon [209] and subsequent works show that diffusion and score-based approaches can be
unified through a differential equation perspective, with many interesting recent advances
about the theoretical foundations of these models. Empirically, modern diffusion models
have achieved state-of-the-art results on various generation tasks.
At each time step t, the decoder uses self-attention over the previously generated tokens
X1:t−1 , together with positional embeddings, to form a representation from which to
predict the next Xt . Notably, a causal mask is applied in the self-attention mechanism to
ensure the model can only attend to past tokens, preserving the auto-regressive property
and preventing future leakage; this is not the case in classical transformers as discussed in
Section 8.5, because the queries and keys can freely interact in the attention mechanism
(8.24).
Importantly, and in contrast to the encoding transformers presented earlier, the positional
embeddings in decoding transformers are not usually learned, but are a static (previsible)
function; we refer to Vaswani et al. [228] for the original approach using trigonometric
functions of the position of each token and to Su et al. [211] for the highly successful
rotary position embedding (ROPE) approach, which has been widely adopted by modern
large language models.
Recall from (8.24) that the scaled dot-product attention mechanism computes, for a query
Q, key K, and corresponding values V , the attention head
!
QK ⊤
H = softmax √ V,
q
where q denotes the (embedding) dimension of the query q u , the key ku and usually the
value vectors v u , 1 ≤ u ≤ t − 1, i.e., we have Q, K, V ∈ R(t−1)×q , see (8.24).
In a decoder block, to preserve time-causality, each token’s query vector q u is restricted
to only attend to keys from the previous positions, (ks )us=1 . This is implemented via a
causal mask in the softmax step, setting the attention scores to zero whenever the key
position exceeds the query position. More precisely, this is done by setting q ⊤
u ks to −∞
for s > u. Formally, for each token index u and each position s, we set the mask
0, if s ≤ u,
masks,u = (10.3)
−∞, if s > u,
where the first index is the accident year and the second is the development year.
When projecting future claims (Ci,j , where i + j > n + 1, i.e., beyond the cur-
rent calendar year), we restrict ourselves to using only observed claims-entries in
the upper-left triangle. This is directly analogous to causal masking, where for
predicting token Xt , the model can only attend to previous tokens X1:t−1 .
Formally, in self-attention with causal masking, we use (10.3) to ensure that at-
tention scores for future positions become invisible after the softmax application,
preventing information leakage from tokens not yet generated - just as actuaries
cannot use future claims development factors that have not occurred yet.
The sequential nature of both processes emphasizes the auto-regressive property:
each new prediction builds upon previous predictions, compounding both capabil-
ities and potential errors. Just as errors in early development factors propagate
through the entire claims triangle, errors in early token predictions can influence
all subsequent generations in a decoder transformer model.
make more confident predictions. Such post-hoc calibration not only affects sampling
diversity when generating text (by controlling how quickly the distribution’s mass is
concentrated), but also helps ensure that probability estimates from the softmax layer
align more faithfully with actual uncertainties.
T
1X
ℓ(ϑ) = − log pϑ Xt X1:t−1 ,
T t=1
over a dataset of sequences. By training in this manner, the model learns to predict
the next token based on the prior context. Although this training task of the next
token prediction appears to be simple, nonetheless, it is sufficient for decoder transformer
models to learn highly useful representations that can be adapted for various NLP tasks.
Applications and variants. Decoder transformer models have been successfully de-
ployed in a wide range of generative tasks, including:
• Language modeling and text generation: GPT models, see Radford et al. [180]
and Brown et al. [32], achieve state-of-the-art results on various NLP benchmarks,
generating coherent text and facilitating tasks such as summarization, translation,
and open-domain dialogue.
Ongoing research explores scaling up decoder transformers to billions (and even trillions)
of parameters, yielding large language models capable of zero-shot or few-shot learning,
robust transfer, and coherent long-range generation. Although computationally expen-
sive, these models adapt to diverse downstream tasks with minimal additional training.
GANs Adversarial
training between • High-fidelity • Training • Image
generator and samples instability synthesis
discriminator • Sharp, realistic • Mode collapse • Style
outputs • No explicit transfer
• Excels in visual density • Data aug-
domains estimation mentation
The connection between traditional generative models and LLMs is not merely archi-
tectural. The fundamental principles we have discussed, which are learning probability
distributions, leveraging self-attention mechanisms, and employing auto-regressive fac-
torization, remain at the core of LLM design. However, at the extreme scales of modern
LLMs (with billions or trillions of parameters), these principles yield models that tran-
scend simple next-token prediction to capture deeper patterns of language, knowledge,
and problem-solving.
In the following sections, we will explore how LLMs have extended the generative paradigm
from simple distribution learning to complex systems capable of contextual understand-
ing, few-shot learning, and even rudimentary reasoning. This trajectory from specialized
generative models to increasingly general AI systems reflects both the power of scale and
the richness of the frameworks we have developed to this point for modeling complex
distributions.
motivated the construction of models with billions (and later trillions) of parameters,
coining the term Large Language Models (LLMs). As model capacity increases, LLMs
often display emergent capabilities not present in smaller counterparts, such as better
zero-shot generalization and few-shot in-context learning (described below).
Alongside model scaling, researchers recognized the importance of diverse, high-quality
training corpora. Auto-regressive transformers trained on multi-domain data (web text,
scientific articles, code, etc.) acquire flexible linguistic competence that can be harnessed
for many downstream tasks, simply by changing the prompt or performing minimal fine-
tuning. The development of the principles which underly modern LLMs can be traced
through the advances contained in the Generative Pretrained Transformer (GPT) series
of papers.
GPT-1. The first GPT, Radford et al. [180], demonstrated that a unidirectional (de-
coder) transformer trained on large unsupervised corpora could achieve strong per-
formance on downstream tasks with minimal fine-tuning. This pretrain-then-finetune
paradigm became a blueprint for subsequent GPT-style models.
GPT-2. Scaling up both model size (up to 1.5B parameters) and data quantity revealed
that bigger models not only improved perplexity (in simple terms, the loss of the model)
but could also generate impressively coherent texts, Radford et al. [181]. GPT-2 sparked
discussions about responsible model release due to concerns over disinformation and
misuse, thus, highlighting ethical and security considerations.
GPT-3 and in-context learning. GPT-3, Brown et al. [32], introduced a much larger
model (up to 175B parameters) and ushered in the era of in-context learning. Surpris-
ingly, GPT-3 could perform new tasks simply by reading a handful of examples within
the prompt - few-shot prompting - without any gradient updates to model parameters.
This phenomenon occurs because of the model’s internal representation of language: it
implicitly “learns” from the in-prompt examples and generalizes these patterns to pre-
dict the next tokens. This emergent capability defied earlier assumptions that explicit
fine-tuning was always necessary.
A growing body of work attempts to explain how LLMs implement in-context learning
from a theoretical standpoint:
While no single unifying theory fully accounts for in-context learning, these angles under-
score its complexity and partially illuminate the remarkable “learning without parameter
updates” phenomenon.
These elements, taken together, have propelled LLMs forward, creating significant capa-
bilities in language understanding and generation. Research continues to expand context
windows, refine architectural insights, and develop more robust theoretical frameworks.
Given the significant achievements of LLMs, their theoretical underpinnings and practical
implications will likely remain a central focus in research in generative modeling.
From a research perspective, a significant practical constraint of auto-regressive trans-
formers is the context window, typically limited by computational considerations (e.g., a
few thousand tokens). Research on extending this context to tens or hundreds of thou-
sands of tokens (via efficient attention mechanisms or hierarchical memory) is ongoing;
see Beltagy et al. [17] and Chowdhery et al. [43]. These longer context windows allow
LLMs to handle extensive documents, multi-step narratives, or long code bases, further
extending their utility.
where omega ∈ [0, 1] is a credibility factor that determines how much to rely on
individual risk data versus the global model.
p(y | Q, E1 , . . . , En ).
This process can be viewed as an implicit “credibility weighting” (or Bayes’ update)
for the new prompt examples - just as an actuary balances a policy’s past claims
with broader class results.
No parameter updates required. In both cases, the global model (the insurer’s
rating manual or the LLM’s pretrained weights) remains unchanged. New, context-
specific outputs are produced without the computational overhead of a full re-
training process or the risk of forgetting established knowledge.
• Large-scale data: Billions of tokens from diverse sources, including web pages,
books, and domain-specific datasets.
• Customer support and chatbots: Companies employ LLM-driven assistants for query
handling, knowledge base lookups, and interactive customer service.
• Scientific and legal research: LLMs facilitate initial drafts for technical documents,
case analyses, or literature reviews, speeding up research processes.
While LLMs show promising capabilities, caution must be exercised regarding halluci-
nations, biases, and misinterpretations. Techniques like RLHF, see Section 10.6.4, and
carefully designed prompts, see Section 10.6.6, can mitigate these issues to some extent.
(1) Initial supervised fine-tuning: Start from the pretrained model and fine-tune it on
labeled examples to adapt it toward desired behaviors (e.g., polite conversation).
(2) Reward model (RM) training: Collect human preference data (e.g., given two model
outputs, which is more helpful or correct?) and train a reward model to predict
these preference scores.
RLHF has been instrumental in aligning LLMs’ outputs with more human-like values,
improving their helpfulness and reducing harmful or factually incorrect content.
• Monotonicity: For example, requiring that premium rates increase with cov-
erage amounts or risk classifications.
In the LLM workflows, full fine-tuning of all model parameters can be prohibitively expen-
sive in terms of computation and memory, especially for models with tens or hundreds
Adapters [104]. Insert lightweight adapter layers within or between the existing trans-
former layers. During fine-tuning, only these adapter parameters are updated, while the
original model weights remain fixed. Adapters can be trained for each downstream task,
enabling modular reusability.
Prefix tuning [136]. Prepends a small set of learnable “prefix” tokens to each atten-
tion block. The main LLM parameters are frozen, and only the prefix embeddings are
optimized to steer model behavior.
Low-rank adaptation (LoRA) [105]. Instead of storing full dense weight updates,
LoRA factors the update matrix into low-rank components. During forward/backward
passes, these low-rank matrices are injected into attention or FNN layers. This greatly
reduces the memory overhead needed to adapt the model.
Practical considerations
• Modularity: Adapters and prefix modules can be swapped in or out for different
tasks, making it easier to maintain multiple domain-specific fine-tunings.
Overall, parameter-efficient methods have become a standard practice for adapting large
decoder transformers (e.g., GPT-3.5, Llama) to specialized tasks without incurring the
enormous cost of training all parameters end to end.
10.6.6 Prompting
Prompting has emerged as a powerful way to harness decoding auto-regressive trans-
former models. In essence, prompting leverages the fact that a LLM generates tokens
conditionally on previously observed (or generated) tokens. By carefully selecting an ini-
tial sequence of tokens (the prompt), users can steer the model to perform specific tasks
or produce more detailed and context-aligned outputs. This framework of “prompt-
and-generate” has dramatically extended the applicability of LLMs to tasks including
question answering, summarization, and domain-specific reasoning; see Brown et al. [32].
Although we are going to attempt to provide a mathematical description here, it is im-
portant to note that prompting can be very heuristic in practice and that many ideas in
prompting LLMs have been discovered in a totally empirical manner.
Thus, the prompt p is a partial sequence or context that conditions the generation of the
continuation c. By altering the structure, style, or content of the prompt, we can influence
the model’s output distribution without modifying any model parameters! Of course, we
need a very large model to be able to produce these highly conditional distributions over
outputs, but with the large foundation models we have been discussing, the required
conditions are in place for this to be successful.
Prompting in practice
This sequence forms the prompt p. The model then predicts the continuation c, expected
to mirror the pattern shown by the demonstrations.
Chain-of-thought prompting
• Prompt:
“Q: If a car travels at 60 km/h for 2 hours, how far does it go? Let’s break it down
step by step.”
“We know speed = distance / time. If speed is 60 km/h and time is 2 hours, distance
= 60 · 2 = 120 km.”
• Final answer:
“120 km.”
By exposing intermediate steps, chain-of-thought prompts often elicit more accurate and
interpretable responses from the model, especially for multi-step reasoning tasks. This
technique has been shown to improve performance on mathematical reasoning, logical
deduction, and other complex question-answering domains.
Moreover, this approach underlies the next generation of LLMs, which are called reason-
ing models, see Section 10.6.7 below.
Prompting has become a rapidly expanding research area with efforts focusing on:
• Interpretability and reliability: Investigating how prompts can reveal model reason-
ing, help identify hallucinations, or mitigate undesired behaviors. Chain-of-thought
prompting is one approach that aims to surface model reasoning steps.
Importantly, prompting interacts strongly with model size: large-scale LLMs often ex-
hibit emergent few-shot and reasoning capabilities that smaller models lack. As a result,
prompting-based methods have become the de facto approach for eliciting complex be-
haviors from LLMs with minimal overhead.
akin to RLHF approaches. Additional data is collected from the model itself,
enabling multi-stage improvement (SFT → RL → more SFT → more RL). This
yields a more robust reasoning LLM.
This pipeline highlights a broader theme: many state-of-the-art reasoning LLMs rely on
a combination of reinforcement learning, supervised instruction tuning (especially with
chain-of-thought data), and occasional distillation to smaller architectures.
Four general strategies for building or improving reasoning models have emerged:
(3) SFT + RL (typical RLHF). Most top-performing reasoning LLMs (e.g., fi-
nal DeepSeek-R1 or rumored pipelines for the O1 and O3 models from OpenAI) blend
supervised fine-tuning (SFT) with reinforcement learning (RL):
• SFT stage: Collect chain-of-thought or instruction data from either humans or the
model itself (“cold-start” generation). Train the LLM to follow these instructions
or produce step-by-step solutions.
• Iterate: Additional SFT data can be created using the latest model checkpoint,
forming a virtuous cycle of improvement.
This approach generally outperforms pure reinforcement learning, especially for large-
scale deployments, and is favored in contemporary reasoning LLM research.
(4) Pure SFT and distillation. Finally, distillation from a larger reasoning model
can be an easier method to produce smaller reasoning LLMs:
• SFT data generation: A larger teacher model (e.g., DeepSeek-R1) generates high-
quality chain-of-thought or instruction examples.
Advanced reasoning LLMs like DeepSeek-R1 or OpenAI’s “O1” models demand substan-
tial compute resources. However, projects like TinyZero [174] and Sky-T1 [168] show
that interesting progress is possible on smaller scales. For instance:
Such efforts underline the practicality of targeted or lower-scale fine-tuning for specialized
tasks and domain constraints.
Consider a query (or prompt) Q. A single LLM can generate multiple candidate responses
R = {R1 , R2 , . . . , RK }.
where rk,t denotes the t-th token in candidate Rk , and Tk is the length of that candidate.
Next, the judge component (which may be the same LLM configured in “critique” mode,
or a separate model) provides a quality score or utility Jϕ (Rk , Q) for each candidate,
1 ≤ k ≤ K, reflecting the likelihood of correctness, alignment, or other criteria; in
this case ϕ represents the parameters of the judge LLM, which can be the same as the
parameter set ϑ if the same LLM is being used as the one used for generation. A simple
self-consistency mechanism then selects the final response R b as
R
b = arg max Jϕ (R, Q).
R∈R
This dual role (generator + judge) has inspired research on iterated refinement, self-
consistency decoding, and constitutional AI, whereby models use rules or guidelines to
critique their outputs. For instance, a chain-of-thought can be included in the judging
mechanism to facilitate more nuanced evaluations, e.g., verifying steps in a math proof.
By iterating this process - sampling multiple answers, critiquing them, and selecting or
refining the best - it is often possible to reduce error rates and highlight reasoning flaws
that might otherwise go unnoticed in a single pass. This constitutes a “self-consistency”
or “self-evaluation” loop, potentially improving the reliability of LLM-generated content
without additional external supervision.
Overall, using LLMs in a judge capacity exemplifies how generative models can be ex-
tended beyond pure text generation to include meta-level reasoning about their own out-
puts. Such techniques dovetail with alignment strategies (Section 10.6.4) and advanced
prompting methodologies (Section 10.6.6), contributing to a growing toolbox for building
and refining LLMs. In Section 10.6.7 we discussed similar ideas that now underlie state
of the art LLMs.
Working with LLMs requires a well-defined governance framework to ensure ethical, trans-
parent, and compliant usage; see van der Merwe–Richman [224]. Such a framework typ-
ically includes:
These governance pillars lay the foundation for subsequent steps in model selection,
performance evaluation, and long-term monitoring.
When using a LLM for a task, where relevant, actuaries and data scientists should
strongly consider requiring the LLM to output a numerical score for its predictions.
Let X ∈ X be an input (such as a set of claims documents), and let the LLM’s decision
function
f : X → R,
produce a numerical output f (X). Interpreting f (X) as a probability, confidence level,
or rating on a defined scale, we might write
• Machine learning metrics: Standard metrics like strictly consistent loss functions,
e.g., mean squared error, or Gini or F1-scores enable direct benchmarking across
different prompts or data conditions.
Moreover, assigning a numerical score at each step helps mitigate hallucinations: once
the model is required to quantify its certainty, stakeholders can explicitly detect outlier
predictions (e.g., extremely high scores for dubious responses) and investigate them using
structured review or escalation processes. Having a set of quantitative scores and analysis
for several different LLMs can aid in the process of selecting the most relevant model for
a task.
Performance metrics
Once the model produces numeric scores, a variety of performance measures become
straightforward to implement:
• NLP/LLM metrics: Evaluate textual quality via ROUGE or BLEU, while exam-
ining numeric agreement with human-labeled data (e.g., classification accuracy).
• Alignment measures: Compare the LLM’s proposed chain-of-thought and final nu-
meric scores against expert judgements or known standards.
• Stability and Sensitivity: Assess the model’s outputs across varying prompt word-
ings or reordered inputs, checking for robustness in both text and numerical pre-
dictions.
Robustness. Testing the model under adversarial prompts or demographic shifts en-
sures its numeric output remains stable. For instance, if f (X) changes dramatically under
minor prompt modifications, further prompt engineering or data augmentation may be
warranted.
Y ∈ {True, False}
pb(Y = 1 | X)
• Data inputs: Ensuring that new data remain consistent with original training or
fine-tuning assumptions.
Regular review of numeric scoring trends - particularly areas with unexpectedly high
or low scores - can reveal potential hallucinations or systematic biases early, prompt-
ing timely interventions. Governance bodies can decide on revised thresholds, updated
prompts, or further training if required.
Conclusion
Imposing a numeric score on every LLM decision links the model’s text generation to
rigorous statistical validation, a hallmark of actuarial and data science practice. By in-
tegrating such quantitative outputs with robust governance, performance tracking, bias
assessments, and human oversight, actuaries can more confidently deploy LLM-based
solutions in high-stakes or regulated environments. These protocols do not merely en-
hance transparency - they offer a meaningful safeguard against hallucinations by flagging
uncertain or outlier score values for deeper review.
• Model compression: In some cases, the sparse representation learned via auto-
encoding can hint at parameter-efficient strategies to prune or quantize model
weights.
Although LLMs themselves typically rely on dense transformer layers, the application
of sparse auto-encoders to LLM-generated activations or embeddings is increasingly ex-
plored in post-hoc interpretability settings, aiming to identify stable, low-dimensional
factors that underlie the rich behaviors observed in large-scale generation.
Mechanistic interpretability
test whether sub-circuits within the network encode these structures internally; see Cao
et al. [39]. For example, if a network reliably identifies symbolic patterns in its hidden
representations, it indicates emergent, structured computations within the parameters.
The synergy between mechanistic interpretability and sparse auto-encoders arises from
their shared focus on identifying meaningful structure within high-dimensional repre-
sentations. Sparse auto-encoders, by design, encourage models to compress data into
a limited set of activation units, thereby shedding light on which dimensions are most
critical for a given task; see Makhzani–Frey [147]. When examining LLM activations, if a
sparse auto-encoder robustly encodes certain linguistic features in a small subset of neu-
rons, this subset may correspond to functionally relevant circuits in the original model.
Interventions such as ablation or activation patching can then be more narrowly targeted,
allowing researchers to focus on the most influential dimensions within the network.
Enforcing sparsity is particularly helpful when trying to localize features to individual
neurons or small neuron clusters. For instance, if a sparse auto-encoder bottleneck consis-
tently highlights dimensions tied to tense or sentiment, these dimensions become natural
entry points for deeper mechanistic analysis. Researchers can then ablate or patch only
those critical neurons to measure how the LLM’s behavior changes, shedding light on
where and how key linguistic functions are implemented.
10.6.11 Summary
LLMs represent a paradigm shift in generative modeling, wherein a single pretrained
neural network can address myriad tasks with minimal additional training. By incorpo-
rating human feedback mechanisms (RLHF), employing specialized prompts, or enabling
self-consistency checks, LLMs can produce high-quality, context-aware, and interpretable
outputs. Nonetheless, significant challenges remain, including controlling unwanted bi-
ases, ensuring factual accuracy, and addressing potential misuse. Ongoing research in
foundation models, fine-tuning protocols, and advanced prompting strategies continues
to shape the evolving landscape of LLMs.
Reinforcement learning
11.1 Introduction
Reinforcement learning is a fast evolving and exciting field on optimal decision making in
a dynamic environment. In particular, reinforcement learning is one of the key techniques
behind reasoning which is a crucial feature of modern generative AI models, e.g., used in
solving mathematical problems.
To give full consideration to reinforcement learning, we would need to write an entire
book. Our aim here is to give a short introduction to reinforcement learning and discuss
what kind of problems can be studied by this technology. The reader should be aware
of the fact that we only present the most simple problems and their solutions, but the
latest technology is by far more developed; classical references on reinforcement learning
are Sutton–Barto [212] and Murphy [162], the material presented in this section is largely
taken from these two references. For an explicit actuarial example considering an optimal
pricing problem, see Palmborg–Lindskog [173]. This paper presents a non-life insurance
premium control problem that seeks for an optimal premium rule trying to maximize
profits, complying with solvency regulation, and considering customers’ price sensitivities.
Such a multi-objective premium control problem can be solved dynamically by learning
how specific actions contribute to a total reward with the aim of maximizing this total
reward. This is the general philosophy in reinforcement learning.
Before starting, we would like to mention that the reinforcement learning community
is somewhat disjoint from the classical machine learning community and also from the
statistical community. We emphasize this because terminology can be quite different in
these different communities, e.g., bootstrapping can mean rather different things depend-
ing on the specific community. We mention this to highlight that there may be some
inconsistencies in this section compared to earlier chapters.
Generally speaking, in predictive modeling a decision maker tries to make accurate fore-
casts, and to evaluate the accuracy of her/his forecasts, the decision maker receives the
correct answer at a later stage. As an example, we forecast an insurance claim at the
beginning of the period, and by the end of the period we know the true claim incurred.
In contrast, in reinforcement learning a decision maker takes actions which are rewarded.
However, there is no right or wrong answer that can be revealed to the decision maker,
she/he only gets a feedback in terms of a bigger or smaller reward, and at the same time,
229
230 Chapter 11. Reinforcement learning
she/he does not have the possibility to exploit all potential actions and their resulting
rewards. For example, an insurer (decision maker) can either increase or lower the in-
surance premium, and by the end of the period the insurer gets a reward in terms of the
total premium earned (based on the assumption that the customers will only sign new
contracts up to their price tolerance levels), but there is no possibility for the insurer
to simultaneously test different pricing strategies before exercising one of them. Using
reinforcement learning, the insurer can continuously (online) learn to improve its decision
making strategy by learning from the feedback received.
The classical multi-armed bandit problem gives a fairly good introduction and overview
of the field of reinforcement learning. That is why we start with this classical example
(which is not directly related to insurance).
Assume that a gambler has the option to play on k ≥ 2 different slot machines (one-armed
bandits), where each of the slot machines has a different random payout. Naturally, the
gambler’s goal is to maximize her/his gain, and she/he selects the slot machine from
which she/he believes that it will hit the jackpot in the next round.
In this game the gambler can exploit the slot machine that she/he believes has the
biggest payout, but at the same time it may also be worth to explore the other
k − 1 slot machines because one of them could even be better.
for a ∈ A. This is called the true action-value because it uses the true reward mechanism;
this is similar to the true regression function in (1.2).
If we knew the true action-value function a 7→ q(a), we would simply maximize this
function over a ∈ A, to obtain the maximal (expected) reward. Typically, the true action-
value function is unknown to us because we do not know the precise reward mechanisms
of the different slot machines. A general way to solve this problem is to try to learn this
action-value function by exploring and exploiting the slot machines over several rounds
of the game. This gives us estimates (qbt (a))a∈A in every round t ≥ 0 for the true action-
value function q. We use these estimates (qbt (a))a∈A for the next round of the game. They
are then updated according to the received reward Rt+1 in this next round. Thus, the
reward Rt+1 is a feedback on how our specific action At has performed.
For given estimates (qbt (a))a∈A at time t ≥ 0, the greedy action is the one with the
(immediate) highest action-value estimate
If we select the greedy action, we exploit our current knowledge around the maximum of
the estimated action-value function a 7→ qbt (a). But we can also select a non-greedy action
by exploring whether we can improve our estimates (qbt (a))a∈A by selecting a different
slot machine. Exploiting is the (estimated) optimal one-step ahead strategy but it is not
necessarily optimal for multiple steps ahead (in the long run). This is precisely the trade-
off between exploiting and exploring which may have a sophisticated interrelationship.
To better understand this reinforcement learning mechanism, it is useful to give an ex-
plicit numerical example. We present the example of Sutton–Barto [212, Section 2.3].
Example 11.1. We choose k = 10 one-armed bandits and we select their true action-
values (q(a))10
a=1 by simulating them from independent standard uniform Gaussian dis-
tributions. These true action-values (q(a))10
a=1 are kept fixed, and they are unknown to
the gambler. Figure 11.1 illustrates these true action-values, the most promising slot
machine is number a = 4, closely followed by slot machine number a = 8.
2 4 6 8 10
and we assume a Markov property, meaning that this reward Rt+1 only depends on the
last selected action At , and not on any information prior to time period t.
The most natural and simple action-value estimate is given by computing the empirical
average rewards on each slot machine a ∈ A
t−1
1 X
qbt (a) = Pt−1 Rs+1 1{As =a} ,
s=0 1{As =a} s=0
and we set it to a default value for action-values, i.e., a ∈ A, without any observations.
The law of large numbers tells us that qbt (a) → q(a), a.s., for t → ∞, and supposed a is
selected infinitely often in these trials. At time t ≥ 0, the next greedy action is given by
At = arg max qbt (a),
a∈A
with a deterministic rule if there is more than one solution to this maximization problem.
This greedy step exploits the estimated optimal slot machine at time t ≥ 0. To also
explore the other slot machines, we insert random non-greedy steps, be using a so-called
ε-greedy strategy. Select ε ∈ (0, 1) and sample i.i.d. Bernoulli random variables Bt , t ≥ 0,
being independent of everything else and taking the value one with probability ε.
The ε-greedy action at time t ≥ 0 is given by
arg max qbt (a)
if Bt = 0,
At = a∈A (11.2)
Ut
if Bt = 1,
1.0
0.8
0.8
ratio optimal action
average reward
0.6
0.6
0.4
0.4
0.2
0.2
epsilon=0.01 epsilon=0.01
epsilon=0.02 epsilon=0.02
epsilon=0.05 epsilon=0.05
0.0
0.0
0 200 400 600 800 1000 0 200 400 600 800 1000
Figure 11.2: (lhs) Development of average rewards Rt , and (rhs) ratio of selection of the
optimal slot machine for iterations 1 ≤ t ≤ 1000.
We implement the k-armed bandit reinforcement learning algorithm (11.2) for different
ε-greedy strategies with ε ∈ (0.01, 0.02, 0.05). The results are shown in Figure 11.2. The
left-hand side gives the average rewards, under the gambler’s strategy (11.2), defined by
t−1
1 X
Rt = Rs+1 ,
t s=0
The right-hand side of the figure shows the proportion of the selection of the optimal
slot machine a = 4, i.e., the slot machine a ∈ A that has the highest expected reward
q(a); this optimal value is illustrated by the black horizontal line in Figure 11.2 (lhs). We
observe that for all ε choices the average rewards Rt approach this optimal value q(a),
a = 4, and there seems to be little difference in their speeds of convergence. On the other
hand, due to the ε-greedy strategy, the average reward will not converge to q(a), but to
a smaller value, because the ε-greedy sampling gives a fixed proportion of non-optimal
slot machine selections. For convergence to q(a) one also needs to temper the Bernoulli
probability εt ↓ 0 for t → ∞.
Figure 11.2 (rhs) shows the proportions of the selection of the optimal slot machine a = 4
t−1
1 X
1 .
t s=0 {At =4}
These quantities converge to 1−(ε−ε/k), illustrated by the horizontal lines in Figure 11.2
(rhs). In two cases, we have a rather smooth increase of these proportions to their limits,
only the green graph looks a bit surprising. Remark that all these graphs depend on the
initialization of the algorithm. We have randomly initialized (qb0 (a))ka=1 , and different
seeds provide different results. A random initialization avoids the difficulty of having
multiple maximums in (11.2); if one sets (qb0 (a))ka=1 to very large values, this promotes
exploring in the beginning of the algorithm, because every initial reward is likely going
to be a disappointment moving on the the next slot machine. Let us try to understand
the green graph of Figure 11.2 (rhs).
k−armed bandit: selected bandit k−armed bandit: selected bandit k−armed bandit: selected bandit
10
10
10
8
8
selected bandit
selected bandit
selected bandit
6
6
4
4
2
2
0
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
problems, having side constraints and multiple targets, etc. This will require more so-
phisticated reinforcement learning algorithms, potentially based on approximations where
some of the quantities cannot be computed explicitly, etc. In the remainder of this sec-
tion we will discuss some of these extensions. The main purpose of this discussion is to
make the reader familiar with some of the reinforcement learning concepts and intuition.
Clearly, this discussion should be understood as an introduction, and for more advanced
reinforcement learning technologies, the reader is referred to the specialized literature on
reinforcement learning.
These updates can be performed on a constant memory size and they essentially use the
Markov property. Beyond that they have a rather interesting structure that is common
to many reinforcement learning algorithms. We can interpret 1/Nt (a) as a learning rate
or a step size parameter.
For suitable learning rates ϱt (a) > 0, we obtain updates
qbt+1 (a) = qbt (a) + ϱt (a) (Rt+1 − qbt (a)) 1{At =a} .
This looks very innocent, but actually this structure is the key to many reinforcement
learning algorithms, namely, it proposes temporal difference learning by incrementally
improving the estimate over time by the new experience
It can be interpreted as trying to predict the new experience Rt+1 by the old estimate
qbt (a), and (11.4) can then be seen as the corresponding updating step, trying to improve
the prediction based on the new experience.
Summary. The key to reinforcement learning often has it grounds in a temporal differ-
ence learning structure of type (11.4). The multi-armed bandit problem of Example 11.1
has all these features. There are some extensions/changes that we are going to introduce,
below, to make the framework more practical.
• We extend the actions (At )t≥0 and the rewards (Rt )t≥1 by a third sequence, the
states (St )t≥0 . This will allow us to solve more interesting problems.
• We will not directly focus on the rewards (Rt )t≥1 , but rather on the future ex-
pected discounted rewards, called expected gains, for given (initial) state-action
pairs (St , At ). Under an unknown reward mechanism, we need to estimate these
expected gains along the way.
R0 , S0 , A0 ; R1 , S1 , A1 ; R2 , S2 , A2 ; . . . , (11.5)
agent
Rt St At
environment
A finite MDP has three finite spaces S, A and R. If the state space S and the action
space A are finite, we speak about tabular learning because we can store all potential
state-action pairs (s, a) ∈ S × A in a finite table. This outline mainly focuses on tabular
learning, and possible extensions are only briefly considered in Section 11.8, below.
In this dynamic learning setting, one distinguishes two different cases, one can either
have a continuing task where (St )t≥0 randomly evolves over the state space S forever. In
contrast, an episodic task is assumed to terminate, and we can restart it in a green field
again. In that case, one adds a terminal (absorbing) state to the state space S † = S ∪{†},
and the game is terminated when St+1 enters the terminal state † (for its first time). This
motivates the definition of the stopping time
n o
T = inf t ≥ 0; St+1 ∈ S † \ S ∈ [0, ∞]; (11.6)
if there is no terminal state or if the state space process does not reach the terminal
state, we set T = ∞.
Figure 11.4 shows that a MDP involves two different (Markovian) transitions: (a) there
is the environment’s dynamics p : S ×R×S ×A → [0, 1] in blue color, and (b) there is the
agent’s policy π : A × S → [0, 1] in orange color. Markovian means that these dynamics
are fully determined by just considering the realization in the previous iteration. We
discuss this.
(a) Environment’s dynamics. The environment’s dynamics is given by nature, and it is
either known or unknown to the decision maker.
Specifically, in the finite spaces case, we assume the transition probabilities
p s′ , r s, a := P St+1 = s′ , Rt+1 = r St = s, At = a
(11.7)
h i
= P St+1 = s′ , Rt+1 = r St = s, At = a, (Su )t−1 t−1 t
u=0 , (Au )u=0 , (Ru )u=0 ,
for t ≥ 0.
Thus, the pair (St+1 , Rt+1 ) only depends on the previous state-action pair (St , At ), this is
the Markov property we use for the environment’s dynamics. These transition probabil-
ities p fully determine the environment’s dynamics of the MDP. We give some remarks:
• In the case of a terminal state † and state space S † , one constrains (11.7) to remain
in the terminal state with probability one (and all subsequent rewards and actions
are discarded because the process is terminated).
• To run this dynamics we still need to define the agent’s policy π(a|s), and we need
to specify the initial state S0 ; we have set initial reward R0 = 0.
Using the stopping time T introduced in (11.6), one defines the total discounted reward,
called gain, after time t by
T
X
Gt = γ u−t Ru+1 , (11.8)
u=t
for γ ∈ (0, 1], and where an empty sum is set equal to zero.
The gain Gt is not generally finite for γ = 1, i.e., there are models without a stopping
T = ∞ or models with a slow (heavy tailed) stopping which may make the gain infinite
on average for γ = 1. For γ < 1, this sum is always finite (also on average), because the
rewards (Rt )t≥1 are uniformly bounded on a finite reward space R, providing a uniform
upper bound (and a finite mean) from the corresponding geometric series.
(b) Agent’s policy. There remains the agent’s policy; note that the decision maker is
commonly called agent.
The agent’s policy is assumed to be of the form
π(a|s) := P ( At = a| St = s) (11.9)
= P At = a St = s, (Su )t−1 t−1 t
u=0 , (Au )u=0 , (Ru )u=0 ,
for t ≥ 0.
This policy π describes the decision making of the agent, see Figure 11.4. This decision
making can be deterministic, in which case the agent’s policy π(·|s) is a single point mea-
sure in some action a ∈ A, but it can also be random with π(·|s) describing a distribution
over the action space A. In case of a deterministic policy, it is more convenient to use
the notation π : S → A, s 7→ a = π(s) ∈ A.
We assume that the policy π(·|s) is not influenced by the rewards, see (11.9) and the
dotted blue line in Figure 11.4. The goal is to select the optimal policy by maximizing
the future expected discounted rewards, called value function, see next section.
The environment’s transition probabilities p, given by (11.7), and the agent’s poli-
cies π, given by (11.9), describe the finite MDP, as illustrated in Figure 11.4. The
goal of this dynamic decision making problem is to find an optimal policy π ∗ for a
given environment’s dynamics p that maximizes the expected gain. This problem
is solved by reinforcement learning, and there are two rather different situations:
either the environment’s dynamics p is known to the agent or it is unknown to
the agent. In the latter case, we can perform model-based reinforcement learning
by trying to learn the model, or we can perform model-free reinforcement learning
where the environment’s dynamics is not needed to solve the task.
for states s ∈ S.
Because of stationary we can drop the time index t on the left-hand side of the previous
identity. Under a known environment’s dynamics p, the value function vπ can be com-
puted for every policy π, and the aim is to find the optimal policy π ∗ that maximizes the
value function. This then gives the optimal value
In this setting, the main question is whether there exists an optimal policy π ∗ that solves
(11.10), and, if yes, how can it be found. In the finite MDP case and under deterministic
policies, there exists an optimal policy π ∗ ; see Puterman [178, Corollary 6.2.8]. Moreover,
there are many other settings where such an existence result can be proved. In the finite
tabular case and under a known environment’s dynamics p, the optimal policy problem
is then solved by dynamic programming. There are two different versions that are useful:
(A) policy iteration and (B) value iteration. We briefly describe these. Let A(s) ⊂ A be
the admissible actions a in state s ∈ S.
(a) Policy evaluation (also called prediction problem) aims at computing the value func-
tion vπ for a fixed policy π. Deconvoluting the Markovian property by one step, gives us
the Bellman equations
for s ∈ S. These Bellman equations (11.11) have a unique solution if γ < 1. Remark, for
given π, (11.11) gives us a system of |S| linear equations for (vπ (s))s∈S . Thus, for known
environment’s dynamics p, this can fully be solved for the given policy π.
(b) Policy improvement aims at improving the policy for a given value function (vπ (s))s∈S .
The is done by a greedy step for deterministic policy improvements.
(a) Apply policy evaluation to the deterministic policy πk to find the unique so-
lution of the value function (vπk (s))s∈S to the linear system
and increase k → k + 1.
(2) Return (πk∗ (s))s∈S and (vπk∗ (s))s∈S for the stopping time k ∗ .
Algorithm 4 gives the resulting policy iteration algorithm. We comment on this algo-
rithm. The greedy step (11.13) implies that each policy πk+1 is uniformly better than
the previous one πk , and this algorithm will converge. As mentioned above, the system
(11.12) describes |S| linear equations that need to be solved. That is, for a suitable vector
bπk ∈ R|S| and matrix Bπk ∈ R|S|×|S| , (11.12) can be rewritten in vector notation
where we set v πk = (vπk (s))s∈S ∈ R|S| . This can be solved by a matrix inversion, giving
us the solution v πk = (Id − Bπk )−1 bπk for the given policy πk . In practice, this is solved
differently. Namely, the value function can be seen as the fix point of the system (11.12)
and (11.14), respectively. This fix point can be found by Banach’s fix point iteration,
under γ ∈ (0, 1). This observation is precisely the idea for the next algorithm, namely,
it may not be necessary to run this fix point iteration until convergence, and one could
alternate the fix point iteration step(s) and the policy evaluation more frequently.
(0) For k = 0, select an initial value function (vk (s))s∈S , and γ ∈ (0, 1).
(2) Return (vk∗ (s))s∈S and the resulting deterministic policy (πk∗ (s))s∈S obtained by
Algorithm 5 directly iterates the value optimization, i.e., it focuses on vk (s) instead of
the value vπk (s) of an actual policy πk .
knowledge is not available, see, e.g., the multi-armed bandit problem studied in Example
11.1, where the reward distribution is not available, but needs to be learned from experi-
ence. Thus, the typical case in practice is the one with unknown environment’s dynamics
p. This section mainly presents a preparation for the next Section 11.7 which explains
how to deal with the case of unknown environment’s dynamics p. The main technique
used will be temporal difference learning, and in this section we prepare for this.
In the case of an unknown environment’s dynamics p, one tries to either learn from actual
experience or one tries to learn from simulated experience.
• Learning from actual experience does not require any knowledge about the environ-
ment’s dynamics p. In fact, in a model-free manner, one directly tries to estimate
the value function v(s) from actual experience (this is also called prediction), from
which one then derives the optimal policy.
We expand the value function (vπ (s))s∈S to the action-value function (also called Q-
function) which additionally accounts for the taken action
" T #
X
u−t
qπ (s, a) = Eπ [ Gt | St = s, At = a] = Eπ γ Ru+1 St = s, At = a ,
u=t
for s ∈ S and a ∈ A(s) ⊂ A, where A(s) are the admissible actions in state s.
This is similar to the multi-armed bandit example (11.1), but expanded by the state
St = s and accounting for all future (discounted) rewards under policy π. We have for
any t ≥ 0 the following two crucial relationships between the value and the action-value
functions
vπ (s) = Eπ [ qπ (St , At )| St = s] ,
qπ (s, a) = Eπ [ Rt+1 + γ vπ (St+1 )| St = s, At = a] ,
this uses the tower property for conditional expectations, and the latter particularly uses
that the next action At only depends on St , and not on the entire history, see (11.9), and
on the stationarity of the MDP. The latter identity shows that the action-value function
naturally enters the policy improvement (11.13) because we have
p(s′ , r|s, a) r + γvπk (s′ )
X X
πk+1 (s) = arg max
a∈A(s) s′ ∈S r∈R
= arg max qπk (s, a). (11.15)
a∈A(s)
R0 , S0 , A0 ; R1 , S1 , A1 ; R2 , S2 , A2 ; . . . ; RT , ST , AT ; RT +1 , ST +1 . (11.16)
Denote by Ts,a the first visit of the observed sequence (11.16) to the state-action pair
(s, a) ∈ S × A. This gives us an empirical estimate of the action-value qπ (s, a) of policy
π in (s, a)
T
X
qbπ (s, a) = γ u−Ts,a Ru+1 . (11.17)
u=Ts,a
This is the simplest version of Monte Carlo estimation of the action-value function, and
there are many similar variants; see Sutton–Barto [212, Section 5.1]. For the following
algorithm, this estimation is performed for all state-action pairs (s, a) independently, i.e.,
the estimates do not build on each other. Therefore, according to reinforcement learning
terminology, this is not a bootstrapping estimate.1 Inserting the empirical estimate
(11.17) into (11.15) motivates the following policy improvement step for k → k + 1
This leads us to the following algorithm; in the sequel it is more convenient to write the
updates πk → πk+1 as π ← π. I.e., instead of labeling the iterations by k ≥ 0, we use a
generic left-arrow ‘←’ to indicate the updates in the loops of the following algorithms.
The resulting Monte Carlo exploring starts algorithm is presented in Algorithm 6. It
considers the first visits Ts,a to every state-action pair (s, a) by recursively checking
whether there is no earlier visit in the observed episode (11.16); this refers to the name
‘exploring starts’ of the algorithm. The resulting gain G is then appended to the observed
values Gains(s, a) of that state-action pair (s, a), and the current optimal policy estimate
π is re-evaluated/updated in the observed state s = St . However, this is not for a fixed
policy, but rather over all past experienced policies because Gains(s, a) collects the gains
over all past episodes (11.16). This algorithm cannot converge to a suboptimal solution
because of monotonicity. This is intuitively clear, however, as stated in Sutton–Barto
[212, Section 5.3], there is no formal proof of this intuition. There is another difficulty in
this algorithm, namely, there needs to be a way of observing an episode (11.16) for each
policy π under consideration. Typically, this requires simulated experience, but it will
not easily be possible to generate actual experience for each policy π of interest. E.g.,
in the multi-armed bandit problem this would require a huge investment to generate an
episode for every policy π of interest.
1
Note that bootstrapping in reinforcement learning means that the parameter estimation depends
recursively on previous estimates. This is different from the statistical bootstrap of Section 1.5.4.
The greedy update (11.18) exploits the optimal action in state St . In all algorithms below
we insert ε-greedy updates for a given ε ∈ (0, 1) to also explore.
An ε-greedy strategy is obtained by replacing (11.18) by the following two steps
a+
k+1 = arg max q
bπk (St , a), (11.19)
a∈A(St )
This ε-greedy strategy is equivalent to (11.2). It also mitigates the problem that there
are state-action pairs that are not sufficiently often visited. In fact, it allows us to drop
the inconvenient assumption in Algorithm 6 that each potential pair (s, a) must appear
as a starting point with a positive probability. This ε-greedy strategy is called on-policy
method because it is used on the policy πk itself, whereas off-policy methods work on the
transition probabilities to generate the episodes, e.g., by using a version of importance
sampling.
There is one cool thing that we did not mention, namely, the above action-value updates
can again be done by incremental learning (because we consider simple averages), see
Section 11.3 and (11.4),
where G(St , At ) is the gain appended to Gain(St , At ) in the current iteration, and with
learning rates ϱt = ϱt (St , At ) > 0. This is the basis for all upcoming more practical
proposals under unknown environment’s dynamics p.
Assume that the gain Gt = G(St , At ) belongs to the state-action pair (St , At ) at time t,
and that it has been generated by an episode following policy π. Then, Gt is an empirical
estimate for qπ (St , At ) = Eπ [Gt |St , At ], see (11.17). If we revert this consideration, we
can also use the action-value qπ (St , At ) to predict the gain Gt , i.e., the gain is predicted
by its expected value (which minimizes the mean squared error). If we perform this
prediction at time t + 1, i.e., if we use the next action-value qπ (St+1 , At+1 ) to predict the
gain Gt+1 , we receive the following approximation
Inserting this approximation into the previous incremental update gives us with what is
known as the one-step temporal difference (TD(0)) update
π under consideration, but only the next policy actions π(At |St ) matter for this step-
by-step update. Therefore, it can be performed online on actual experience. Rewriting
(11.23) gives us
(0) Select an initial action-value function (qb(s, a))s,a with qb(†, a) ≡ 0; and γ ∈ (0, 1),
and small ε > 0.
Algorithm 7 gives the SARSA temporal difference algorithm for estimating the action-
value function. SARSA on-policy learning is fully practical as we can online (real time)
observe the next reward Rt+1 and the next state St+1 , given the state-action pair (St , At )
on the selected policy. For an example, we refer to the multi-armed bandit problem that
returns the next reward Rt+1 after we have taken the action At . That is, this can be
performed with actual experience, and it does not require any (prior) knowledge about
the environment’s dynamics p. On-policy refers to the fact that we use the selected policy
π once more to anticipate the next action At+1 , given state St+1 . This is precisely the
difference to the off-policy algorithm presented in the next subsection.
A variant of SARSA is expected SARSA temporal difference learning which considers the
update
X
qb(St , At ) ← qb(St , At ) + ϱt Rt+1 + γ π(a|St+1 ) qb(St+1 , a) − qb(St , At(11.24)
) .
a∈A(St+1 )
That is, we do not simulate the next action At+1 for the update, but we replace it by
an expected value. The following off-policy algorithm is similar in that it replaces the
expected value (reflected by the sum) by an off-policy maximum operation, see (11.25),
below.
This is called off-policy learning because it does not anticipate the next action At+1
w.r.t. selected policy π, given state St+1 .
(0) Select an initial action-value function (qb(s, a))s,a with qb(†, a) ≡ 0; and γ ∈ (0, 1)
and small ε > 0.
The maximizations in Algorithm 8 can be critical as they may lead to biases. To mitigate
such biases there are more advanced methods like double-Q-learning; for details see
Sutton–Barto [212, Section 6.7].
Example 11.2. For our illustration, we select a small scale example with finite spaces
R = S = A = {1, . . . , 10} for the reward space, the state space and the action space,
respectively. We select a continuing task MDP by specifying an environment’s dynamics
p : S × R × S × A → (0, 1), see (11.7). This probability tensor p has 104 = 10, 000
entries that we select at random such that p(·, ·|s, a) sums to one for all state-action pairs
(s, a) ∈ S ×A, and such that it does not have any symmetries. For the gain computation,
we select a discount factor of γ = 1/2. Our goal is to find the optimal policy π ∗ that
maximizes the values vπ (s) for all initial states s ∈ S.
There exists a unique solution to this optimal control problem and we try to find it with
reinforcement learning. We present the four methods: (1) policy iteration presented in
Algorithm 4, (2) value iteration presented in Algorithm 5, (3) SARSA temporal difference
learning of Algorithm 7, and (4) Q-learning temporal difference of Algorithm 8. The first
two methods (1)-(2) are based on the knowledge of the true probability tensor p, methods
(3)-(4) do not use the knowledge of this probability tensor p, but only observed actual
experience, thus, the latter two methods present realistic learning problems. The SARSA
algorithm (3) performs on-policy learning, and Q-learning (4) off-policy learning.
policy iteration (known environment's dynamics) value iteration (known environment's dynamics)
2.0
0.55
initial state s= 1
initial state s= 2
initial state s= 3
initial state s= 4
0.50
initial state s= 5
initial state s= 6
value function v(s)
initial state s= 9
initial state s= 10
initial state s= 1
0.40
initial state s= 2
initial state s= 3
1.0
initial state s= 4
0.35
initial state s= 5
initial state s= 6
initial state s= 7
initial state s= 8
0.30
initial state s= 9
0.5
initial state s= 10
0 1 2 3 4 0 5 10 15
iterations k iterations k
Figure 11.5: Algorithm convergence analysis: (lhs) policy iteration Algorithm 4 and (rhs)
value iteration Algorithm 5; the x-axis shows the iterations k ≥ 0 of the algorithms.
Figure 11.5 shows the developments of the value functions vπk (s), s ∈ S, in the policy
iteration algorithm, and the value functions vk (s), s ∈ S, in the value iteration algorithm
for iterations k ≥ 0. The policy iteration Algorithm 4 converges in two iterations to the
optimal policy π ∗ , and the value iteration Algorithm 5 converges in roughly 20 iterations,
see Figure 11.5. For the policy iteration algorithm we solve the linear system (11.14) in
every step k ≥ 0 of the algorithm, which can easily be done here because Bπk is a small
matrix of size 10×10. For the value iteration algorithm we instead use a (single) Banach’s
fix point step in each iteration k ≥ 0. For this small scale example, this results in a less
efficient algorithm because the matrix inversion is very efficient in this small example.
Table 11.1 shows the resulting optimal policy π ∗ which is the same in both algorithms.
state s 1 2 3 4 5 6 7 8 9 10
policy iteration 7 5 10 6 4 3 6 3 3 7
value iteration 7 5 10 6 4 3 6 3 3 7
Under a known environment’s dynamics p, we can easily find the optimal policy π ∗ , as
illustrated in Table 11.1. We now turn our attention to the more realistic situation of not
knowing the environment’s dynamics p. We therefore implement SARSA and Q-learning
temporal difference to determine an optimal policy. Recall, this is done as follows. Based
on state St , we exercise action At according to our actual policy π in place. Nature then
gives us the reward Rt+1 and the next state St+1 , based on the current state-action pair
(St , At ). This allows us to perform actual experience learning.
We implement SARSA temporal difference learning as follows. We select an ε-greedy
policy with ε = 1%, and we run the (continuing) task for 3000 iterations t ∈ {0, . . . , 3000},
that is, on average 30 times we explore, instead of exploiting the currently estimated
optimal policy. This seems a low value. We apply this procedure 2000 times, which
determines the outer loop in Algorithm 7. Finally, the learning rate ϱt = ϱt (St , At )
is chosen inversely proportional to the number of occurrences of the state-action pair
(St , At ) up to and including time t; this corresponds to (11.3) extended to the state
observation.
0.4
0.4
action−value function
action−value function
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000
time t time t
Figure 11.6: Algorithm convergence analysis: (lhs) SARSA temporal difference Algo-
rithm 7 and (rhs) Q-learning temporal difference Algorithm 8.
Figure 11.6 (lhs) shows the convergence behavior of the on-policy SARSA temporal dif-
ference algorithm. The x-axis shows the time scale t ∈ {0, . . . , 3000}, and the y-axis the
action-value functions qb(s, a) for selected state-action pairs (s, a). Each graph is averaged
over the 2000 runs of the outer loop of Algorithm 7. We observe not full convergence
yet, which indicates that we should run the continuing task for more time steps t. Based
on these action-value estimates qb(s, a), we determine the estimated optimal policy π b ∗ (s),
s ∈ S. The result is given in Table 11.2.
From Table 11.2, we observe that SARSA finds almost perfectly the true optimal policy
π ∗ only in state s = 9, we estimate the optimal action to be a = π b ∗ (9) = 10, instead of
∗
the true optimal action π (9) = 3.
For the off-policy learning with Q-learning temporal difference we apply the same strategy
and the same parameters as for SARSA. The convergence behavior is shown in Figure
11.6 (rhs) and the estimated optimal policy π b ∗ is given on the last line of Table 11.2.
There is one misspecification with Q-learning which concerns state s = 7, where we
state s 1 2 3 4 5 6 7 8 9 10
policy iteration 7 5 10 6 4 3 6 3 3 7
value iteration 7 5 10 6 4 3 6 3 3 7
SARSA temporal difference 7 5 10 6 4 3 6 3 10 7
Q-learning temporal difference 7 5 10 6 4 3 2 3 3 7
Remarks 11.3. We close this tabular learning exposition by some further methods.
• There are many variants that aim at improving both accuracy and speed of con-
vergence. E.g., the temporal difference step (11.23) considers a one-step ahead
prediction which can easily be replaced by a n-step ahead prediction
if qbk denotes the estimate of the action-value function at time k. This motivates
the n-step temporal difference update
qbt+n (St , At ) ← qbt+n−1 (St , At ) + ϱt G
b t:t+n − qbt+n−1 (St , At ) ;
see Sutton–Barto [212, formula (7.5)]. Integrating this into the SARSA tempo-
ral difference algorithm gives the n-step SARSA which approaches Monte Carlo
estimation for n → ∞.
b t:t+n . Selecting λ ∈
• A second modification is to average over different returns G
[0, 1), we can also approximate the gain Gt by
X
b λ = (1 − λ)
G λn−1 G
b t:t+n ,
t
n≥1
note that the weights aggregate to one. This gives the general temporal difference
methods called TD(λ); see Sutton–Barto [212, Chapter 12]. For λ = 1, this again
reduces to Monte Carlo estimation.
First, if one only operated on this multi-output network qϑ , one would result in an
unstable situation because in the optimizations the unknown network parameter ϑ ap-
pears on both sides of the equation. To improve the stability of the algorithm, Mnih et
al. [159, 160] duplicated the network qϑ by a second network qϑτ , called target network,
that has the same architecture and only differs in the network parameter ϑτ . Both net-
works are initialized by the same network parameter ϑ = ϑτ , but the network parameter
of the first network will be updated more frequently, resulting in a second network qϑτ
that is more inert. This bigger inertia stabilizes the updates of the action-value esti-
mates qϑ because otherwise the estimated action-value estimates would use themselves
(in a self-circular way), see (11.25). Therefore, we need this bigger inertia second network
to receive meaningful results and to prevent from over-fitting.
Second, to not waste any (costly) observations, every quadruple (St , At , Rt+1 , St+1 ) is
stored in a memory denoted by M; this idea was introduced by Lin [137]. For gradient
descent learning, we will (re-)sample random (mini-)batches from this memory M to
learn the network parameter ϑ. As a side effect, such random mini-batches break the
temporal correlation in the experience which is an advantage in gradient descent learning.
Note that this makes the following Algorithm 9 an off-policy algorithm.
The first part of the deep Q-network algorithm is similar to the Q-learning temporal
difference algorithm, see Algorithm 8, and the two algorithms start to differ in the step
where we use the memory M. Using this memory, we sample a mini-batch of size K, and
each of this samples is used to construct an approximative gain G b k based on the second
inert network qϑτ , this is motivated by (11.22). The approximative gains are then used
to improve the network parameter ϑ of the first network in the next step, by optimally
predicting the approximative gains (G b k )K by this first network qϑ . Note that the multi-
k=1
output network qϑ (Sk , Ak ) has input Sk , and we select the output channel that coincides
with the value of Ak . For the loss function L one can use any strictly consistent loss
function for mean estimation, however, in reinforcement learning practice, also robust
versions of loss functions have been chosen. Finally, every τ ≫ 1 iterations, the inert
network qϑτ is updated, with either a soft update α ∈ (0, 1) or a hard update α = 1.
The above algorithm is again prone to provide a biased estimate by taking the maximum
in the Gb k estimate. The deep double-Q-network proposed by Hasselt et al. [91] tries to
compensate for this by considering an estimated gain instead
Rk+1
! if Sk+1 is terminal,
G
bk =
Rk+1 + γ qϑτ
Sk+1 , arg max qϑ (Sk+1 , a) otherwise.
a∈A(Sk+1 )
(0) Select a random initial network parameter ϑ and initialize ϑτ = ϑ. Choose α ∈ (0, 1]
and small ε > 0. Initialize the memory M, set the (mini-)batch size K ∈ N and
select the step size τ ∈ N.
– Choose S0 ∈ S at random.
– Loop for t ≥ 0 until terminal value † for St :
∗ For state St , sample At from an ε-greedy policy of structure (11.19)-
(11.20) under the actual network parameter ϑ for qϑ .
∗ Given state-action pair (St , At ), observe reward Rt+1 and next state St+1 .
∗ Store (St , At , Rt+1 , St+1 ) to the memory M.
∗ Sample a mini-batch of size K from the memory M.
∗ Set training labels, for 1 ≤ k ≤ K,
(
Rk+1 if Sk+1 is terminal,
G
bk =
Rk+1 + γ maxa∈A(Sk+1 ) qϑτ (Sk+1 , a) otherwise.
ϑτ ← αϑ + (1 − α)ϑτ .
exp {z ϑ (s)a }
πϑ (a|s) = P ,
a′ ∈A exp {z ϑ (s)a′ }
• Actor. The actor is responsible for learning an optimal policy πϑ (a|s) by improving
the parameter ϑ based on the feedback signal the actor receives.
• Critic. The critic evaluates the taken action and gives a feedback signal to the
actor. The evaluation is often done with a value or/and an action-value function
estimate. This value function is estimated/improved by the critic analyzing the
resulting rewards of the taken actions.
for a given policy πϑ and starting in state s0 ∈ S. Using some algebra, we can reformulate
the gradient of the performance J(ϑ) w.r.t. ϑ as follows
" #
X
∇ϑ J(ϑ) ∝ Eπϑ qπϑ (St , a)∇ϑ πϑ (a| St ) S0 = s0
a∈A
= Eπϑ [ qπϑ (St , At )∇ϑ log πϑ ( At | St )| S0 = s0 ] (11.26)
= Eπϑ [ Gt ∇ϑ log πϑ ( At | St )| S0 = s0 ] ;
see Sutton–Barto [212, Section 13.3]. There is a subtle issue hidden in this identity
(11.26), namely, the time index t seems undefined. In fact, in this identity t should be
a random variable such that St has the same distribution as the relative frequency of
the visits of the state-space sequence to the states in S under policy πϑ and starting
in S0 = s0 . Thus, St should have the long-term equilibrium distribution of the relative
numbers of visits to the states under Pπϑ [·|S0 = s0 ].
The policy update in Algorithm 10 is also called reinforce; see Sutton–Barto [212, Section
13.3]. This reinforce uses the so-called eligibility vector
1
∇ϑ log πϑ ( At | St ) = ∇ϑ πϑ ( At | St ) ,
π ϑ ( At | S t )
which shows that the gradient ascent steps are composed by the optimal policy increase
being reweighted with the corresponding occurrence frequencies of the conditional actions
At , in states St , to not favor more frequent actions. Having γ < 1 also allows us to use
Algorithm 10 for continuing tasks.
The policy update in Algorithm 10 can be generalized by including a state-dependent
baseline b(St ), being independent of At . This baseline will cancel in the gradient compu-
tations, and it will provide the reinforce with baseline policy update, see Sutton–Barto
[212, Section 13.4],
The advantage of this baseline, if chosen smartly, is that it can significantly reduce the
variance in the reinforcement learning algorithm. A typical, good selection is an estimate
of the value function b(St ) = vbπϑ (St ). Such a choice can significantly improve the speed of
convergence of the reinforcement learning algorithm; see, e.g., Sutton–Barto [212, Figure
13.2].
feedback signal to the actor. This implies a dependence along the time line which allows
to improve the quality of the estimates (by bootstrapping). For illustration, we consider
the one-step actor-critic algorithm of Sutton–Barto [212, Section 13.5]. It is the analogue
to the one-step temporal difference algorithms from above such as the SARSA Algorithm
7.
First, we select a second parametrized function class {vw (s)}w being based on a real-
valued vector w, and we assume that all considered terms are differentiable in w. We
then modify the reinforce with baseline policy update by a one-step temporal difference
update using the gain approximation (11.22) based on its expected version (11.24), and
the baseline b(St ) = vw (St )
Since this is (only) a one-step incremental update it can be done online, in contrast to
the Monte Carlo policy gradient control Algorithm 10.
Secondly, we also need to describe the value function update by temporal difference. This
can be achieved by a second gradient assent step
(0) Initialize ϑ and w, and select γ ∈ (0, 1] and step sizes ϱ > 0 and ϱ′ > 0.
In both of these two temporal difference steps (11.27) and (11.28), we consider a so-called
advantage function, which can have different forms depending on the specific algorithms
used. In our case it is
δt = Rt+1 + γvw (St+1 ) − vw (St ).
This compares the result of action At , given by Rt+1 + γvw (St+1 ), to the (average) value
that we would expect, given by vw (St ). Thus, the direction of the improvements is
multiplied by a step-size that adjusts for the advantage achieved by the corresponding
action.
There are many notable variants of the actor-critic Algorithm 11; we refer to the vastly
growing literature in this field. We would like to close this section by the proximal policy
optimization (PPO) introduced by Schulman et al. [203]. For this we first discuss trust
region policy optimization (TRPO) of Schulman et al. [202]. Many of the policy gradient
methods lack sufficient robustness, and TRPO is a method that tries to improve on that
point. Let us come back to the temporal difference step (11.27)
ϑ ← ϑ + ϱγ t δt ∇ϑ log πϑ ( At | St ) ,
with advantage function δt . This gradient ascent step can stem from a maximization
problem
arg max γ t δt log πϑ ( At | St ) .
ϑ
TRPO now argues that having an old estimate ϑold , the updated estimate ϑ should not
be too different from this previous estimate. This motivates regularization, see Section
2.4, and as penalty function we select the KL divergence of the resulting categorical dis-
tributions (we have assumed a finite action space A); for the KL divergence see (9.44).
Moreover, we replace log πϑ by a ratio of new and old policy, providing a different nor-
malization in the gradient ∇ϑ . This motivates the KL regularized optimization
!
t πϑ ( At | St )
arg max γ δt − η DKL (πϑ (· |St ) ||πϑold (· |St ) ) ,
ϑ πϑold ( At | St )
with regularization parameter η > 0. Since this TRPO is comparably complex and it
does not allow, e.g., for drop-outs during fitting PPO proposes a simpler method with a
comparable performance, see Schulman et al. [203]. The objective that is solved is given
by
( ! !)
tπ ϑ ( At | S t ) πϑ ( At | St )
arg max min γ δt , γ t δt min max ,1 − ε ,1 + ε ,
ϑ πϑold ( At | St ) πϑold ( At | St )
for a clipping hyper-parameter ε ∈ (0, 1). Thus, the probability ratio is censored (clipped)
at 1 − ε and 1 + ε. This removes the incentive of moving the probability ratio out of the
interval [1 − ε, 1 + ε]. This objective function is plotted in Figure 11.7 as a function of
the probability ratio r = r(ϑ) = πϑ ( At | St )/πϑold ( At | St ) > 0. The resulting function
depends on the sign of the advantage function δt ∈ R.
delta_t>0
delta_t<0
clipped objective function
1
0
−1
−2
probability ratio r
Figure 11.7: Objective function of PPO for ε = 1/2 and with the probability ratios
r = r(ϑ) = πϑ ( At | St )/πϑold ( At | St ) > 0 on the x-axis.
Outlook
The present version of these notes covers a large part of the AI tools that actuaries should
be familiar with, but the reader may also have noticed that still some topics are missing,
e.g., methods on interpretability, data visualization, variable importance, or a discussion
about fairness and discrimination. We are going to provide more chapters that will cover
these topics, and for an overview of possible further topics, we also refer to:
https://actuary.eu/about-the-aae/continuous-professional-development/
257
258 Chapter 12. Outlook
[1] Abbas, A., Sutter, D., Zoufal, C., Lucchi, A., Figalli, A., Woerner, S. (2021). The power of
quantum neural networks. Nature Computational Science 1, June 2021, 403-409.
[2] Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions
on Automatic Control 19/6, 716-723.
[3] Andrès, H., Boumezoued, A., Jourdain, B. (2024). Signature-based validation of real-world
economic scenarios. ASTIN Bulletin - The Journal of the IAA 54/2, 410-440.
[4] Arjovsky, M., Chintala, S., Bottou, L. (2017). Wasserstein GAN. Proceedings of the 34th
International Conference on Machine Learning (ICML), 214-223.
[5] Arnold, V.I. (1957). On functions of three variables. Doklady Akademii Nauk SSSR 114/4,
679-681.
[6] Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Math-
ematical Society 68/3, 337-404.
[7] Avanzi, B., Taylor, G., Wang, M., Wong, B. (2024). Machine learning with high-cardinality
categorical features in actuarial applications. ASTIN Bulletin - The Journal of the IAA
54/2, 213-238.
[8] Ayer, M., Brunk, H.D., Ewing, G.M., Reid, W.T., Silverman, E. (1955). An empirical
distribution function for sampling with incomplete information. Annals of Mathematical
Statistics 26, 641-647.
[9] Ba, J.L., Kiros, J.R., Hinton, G.E. (2016). Layer normalization. arXiv:1607.06450.
[10] Bai, Y., Jones, A., Ndousse, K., Askell, A., Leike, J., Amodei, D. (2022). Constitutional
AI: Harmlessness from AI feedback. arXiv:2212.08073.
[11] Bailey, R.A., Simon, L.J. (1960). Two studies on automobile insurance ratemaking. ASTIN
Bulletin - The Journal of the IAA 1, 192-217.
[12] Bar-Lev, S. K., Enis, P. (1986). Reproducibility and natural exponential families with power
variance functions. The Annals of Statistics 14, 1507-1522.
[13] Bar-Lev, S.K., Kokonendji, C.C. (2017). On the mean value parametrization of the natural
exponential kamily - a revisited review. Mathematical Methods of Statistics 26/3, 159-175.
[14] Barlow, R.E., Bartholomew, D.J., Bremmer, J.M., Brunk, H.D. (1972). Statistical Inference
under Order Restrictions. John Wiley & Sons.
[15] Barlow, R.E., Brunk, H.D. (1972). The isotonic regression problem and its dual. Journal
of the American Statistical Association 67/337, 140-147.
[16] Barndorff-Nielsen, O. (2014). Information and Exponential Families: In Statistical Theory.
John Wiley & Sons.
[17] Beltagy, I., Peters, M.E., Cohan, A. (2020). Longformer: The long-document transformer.
arXiv:2004.05150.
259
260 Bibliography
[18] Bender, E.M., Koller, A. (2020). Climbing towards NLU: On meaning, form, and under-
standing in the age of data. Proceedings of the Annual Meeting of the ACL.
[19] Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N. (2015). Scheduled sampling for sequence pre-
diction with recurrent neural networks. Advances in Neural Information Processing Systems
28, 1171-1179.
[20] Bengio Y., Courville A., Vincent P. (2013). Representation learning: a review and new
perspectives. IEEE Transactions on Pattern Analysis and Machine Learning Intelligence
35/8, 1798-1828.
[21] Bengio Y., Ducharme R., Vincent P., Jauvin C. (2003). A neural probabilistic language
model. Journal of Machine Learning Research 3/Feb, 1137-1155.
[22] Bengio, Y., Schwenk, H., Senécal, J.-S., Morin, F., Gauvain, J.-L. (2006). Neural proba-
bilistic language models. In: Innovations in Machine Learning. Holmes, D.E., Jain, L.C.
(Eds.). Springer, Studies in Fuzziness and Soft Computing 194, 137-186.
[23] Blæsild, P., Jensen, J.L. (1985). Saddlepoint formulas for reproductive exponential models.
Scandinavian Journal of Statistics 12/3, 193-202.
[24] Bommasani, R., Hudson, D.A., Adeli, E., et al. (2021). On the opportunities and risks of
foundation models. arXiv:2108.07258.
[25] Boureau, Y.L., Ponce, J., LeCun, Y. (2010). A theoretical analysis of feature pooling in
vision recognition. In: Proceedings of the International Conference on Machine Learning
ICML 65.
[26] Brauer, A. (2024). Enhancing actuarial non-life pricing models via transformers. European
Actuarial Journal 14/3, 991-1012.
[27] Brébisson, de A., Simon, É., Auvolat, A., Vincent, P., Bengio, Y. (2015). Artificial neural
networks applied to taxi destination prediction. arXiv:1508.00021.
[31] Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. (1984). Classification and Regression
Trees. Wadsworth Statistics/Probability Series.
[32] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan,
A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners.
Advances in Neural Information Processing Systems 33, 1877-1901.
[33] Brunk, H.D., Ewing, G.M., Utz, W.R. (1957). Minimizing integrals in certain classes of
monotone functions. Pacific Journal of Mathematics 7, 833-847.
[34] Bühlmann, H. (1967). Experience rating and credibility. ASTIN Bulletin - The Journal of
the IAA 4/3, 199-207.
[35] Bühlmann, H., Gisler, A. (2005). A Course in Credibility Theory and its Applications.
Springer Universitext.
[36] Bühlmann, H., Straub, E. (1970). Glaubwürdigkeit für Schadensätze. Mitteilungen der
Schweizerischen Vereinigung der Versicherungsmathematiker 1970, 111-131.
[37] Bühlmann, P. (2002). Consistency for L2 boosting and matching pursuit with trees and tree-
type basis functions. In: Research Report/Seminar für Statistik, Eidgenössische Technische
Hochschule (ETH), Vol. 109. Seminar für Statistik, Eidgenössische Technische Hochschule
(ETH).
[38] Burges, C.J.C. (1998). Tutorial on support vector machines for pattern recognition. Data
Mining and Knowledge Discovery 2, 121-167.
[39] Cao, Q., Sutton, C., Liska, A., Titov, I. (2021). Neuro-symbolic probing in neural language
models. Proceedings of the Annual Meeting of the ACL.
[40] Chaubard, F., Mundra, R., Socher, R. (2016). Deep Learning for Natural Language Pro-
cessing. Lecture Notes, Stanford University.
[41] Chen, T., Guestrin, C. (2016). XGBoost: a scalable tree boosting system. In: KDD ’16:
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, 785-794.
[42] Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Ben-
gio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical
machine translation. arXiv:1406.1078.
[43] Chowdhery, A., Narang, A., Devlin, J., et al. (2022). PaLM: Scaling language modeling
with pathways. arXiv:2204.02311,.
[44] Christiano, P.F., Leike, J., Brown, T.B., Martic, M., Legg, S., Amodei, D. (2017). Deep
reinforcement learning from human preferences. Advances in Neural Information Processing
Systems 30, 4299-4307.
[45] Cortes, C., Vapnik, V. (1995). Support-vector networks. Machine Learning 20/3, 273-297.
[46] Cunningham, H., Ewart, A., Riggs, L., Huben, R., Sharkey, L. (2023). Sparse autoencoders
find highly interpretable features in language models. arXiv:2309.08600.
[48] DeepSeek Research Team (2025). DeepSeek R1: A family of open-source reasoning LLMs.
arXiv:2501.12948.
[49] Delong, Ł, Kozak, A. (2023). The use of autoencoders for training neural networks with
mixed categorical and numerical features. ASTIN Bulletin - The Journal of the IAA 53/2,
213-232.
[50] Delong, Ł., Lindholm, M. and Zakrisson, H., (2023). On cyclic gradient boosting machines.
SSRN Manuscript ID 4352505.
[51] Delong, Ł, Wüthrich, M.V. (2024). Isotonic regression for variance estimation and its role
in mean estimation and model validation. North American Actuarial Journal, in press.
[52] Dempster, A.P., Laird, N.M., Rubin, D.B. (1977). Maximum likelihood for incomplete data
via the EM algorithm. Journal of the Royal Statistical Society, Series B 39/1, 1-22.
[53] Denuit, M., Charpentier, A., Trufin, J. (2021). Autocalibration and Tweedie-dominance for
insurance pricing in machine learning. Insurance: Mathematics and Economics 101/B,
485-497.
[54] Denuit, M., Trufin, J. (2021). Lorenz curve, Gini coefficient, and Tweedie dominance for
autocalibrated predictors. LIDAM Discussion Paper ISBA 2021/36.
[55] Devlin, J., Chang, M.-W., Lee, K., Toutanova, K. (2018). BERT: Pre-training of deep
bidirectional Transformers for language understanding. arXiv:1810.04805.
[56] Dietterich, T.G. (2000) An experimental comparison of three methods for constructing
ensembles of decision trees: bagging, boosting, and randomization. Machine Learning 40,
139-157.
[57] Dietterich, T.G. (2000). Ensemble methods in machine learning. In: Multiple Classifier
Systems, Kittel, J., Roli, F. (eds.). Lecture Notes in Computer Science 1857. Springer,
1-15.
[58] Donaldson, J. (2016). t-distributed stochastic neighbor embedding for R (t-SNE). R package
tsne.
[59] Duan, T., Anand, A., Ding, D.Y., Thai, K.K., Basu, S., Ng, A., Schuler, A., (2020). NG-
Boost: Natural gradient boosting for probabilistic prediction. In; International Conference
on Machine Learning, Proceedings of Machine Learning Research, 2690-2700.
[60] Dutang, C., Charpentier, A., Gallic, E. (2024). Insurance dataset. Recherche Data Gouv.
https://doi.org/10.57745/P0KHAG
[61] Efron, B. (1979). Bootstrap methods: another look at the jackknife. Annals of Statistics
7/1, 1-26.
[62] Efron, B., Tibshirani, R.J. (1993). An Introduction to the Bootstrap. Chapman & Hall.
[63] Embrechts, P., Klüppelberg, C., Mikosch, T. (2003). Modelling Extremal Events for Insur-
ance and Finance. 4th printing. Springer.
[64] Ester, M., Kriegel, J.P., Sander, J., Xu, X. (1996). A density-based algorithm for discovering
clusters in large spatial databases with noise. In: Proceedings of the Second International
Conference on Knowledge Discovery and Data Mining (KDD-96). AAAI Press. 226-231.
[65] Fahrmeir, L., Tutz, G. (1994). Multivariate Statistical Modelling Based on Generalized
Linear Models. Springer.
[66] Fan, J., Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle
properties. Journal of the American Statistical Association 96/456 1348-1360.
[67] Ferrario, A., Hämmerli, R. (2019). On boosting: theory and applications. SSRN Manuscript
ID 3402687.
[68] Fisher, R.A. (1934). Two new properties of mathematical likelihood. Proceeding of the Royal
Society A 144/852, 285-307.
[69] Fissler, T., Lorentzen, C., Mayer, M. (2022). Model comparison and calibration assess-
ment: user guide for consistent scoring functions in machine learning and actuarial practice.
arXiv:2202.12780.
[70] Friedman, J., H. (2001). Greedy function approximation: a gradient boosting machine.
Annals of Statistics 25/5, 1189-1232.
[71] Fritschi, S., Guenther, F., Wright, M.N., Suling, M., Mueller, S.M. (2019). Training of
neural networks. R package neuralnet.
[72] Gao, G., Wang, H., Wüthrich, M.V. (2022). Boosting Poisson regression models with telem-
atics car driving data. Machine Learning 111/1, 243-272.
[73] Ghosh, P., Sajjadi, M.S.M., Vergari, A., Black, M., Schölkopf, B., (2020). From varia-
tional to deterministic autoencoders. International Conference on Learning Representations
(ICLR).
[74] Gini, C. (1912). Variabilità e Mutabilità. Contributo allo Studio delle Distribuzioni e delle
Relazioni Statistiche. C. Cuppini, Bologna.
[75] Gini, C. (1936). On the measure of concentration with special reference to income and
statistics. Colorado College Publication, General Series No. 208, 73-79.
[76] Glorot, X., Bengio, Y. (2010). Understanding the difficulty of training deep feedforward
neural networks. In: Proceedings of the Thirteenth International Conference on Artificial
Intelligence and Statistics, Proceedings of Machine Learning Research 9, 249-256.
[77] Gneiting, T. (2011). Making and evaluating point forecasts. Journal of the American Sta-
tistical Association 106/494, 746-762.
[78] Gneiting, T., Raftery, A.E. (2007). Strictly proper scoring rules, prediction, and estimation.
Journal of the American Statistical Association 102/477, 359-378.
[79] Gneiting, T., Ranjan, R. (2013). Combining predictive distributions. Electronic Journal of
Statistics 7, 1747-1782.
[80] Gneiting, T., Resin, J. (2023). Regression diagnostics meets forecst evaluation: conditional
calibration, reliability diagrams, and coefficient of determination. Electronic Journal of
Statistics 17, 3226-3286.
[81] Goldburd, M., Khare, A., Tevet, D., Guller, D. (2020). Generalized Linear Models for
Insurance Rating. 2nd edition. CAS Monograph Series, 5.
[82] Golub, G., Van Loan, C. (1983). Matrix Computations. John Hopkins University Press.
[83] Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning. MIT Press, http://www.
deeplearningbook.org
[84] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville,
A., Bengio, Y. (2014). Generative adversarial nets. Advances in Neural Information Pro-
cessing Systems 27.
[85] Gorishniy, Y., Kotelnikov, A., Babenko, A. (2024). TabM: advancing tabular deep learning
with parameter-efficient ensembling. arXiv:2410.24210.
[86] Gorishniy, Y., Rubachev, I., Khrulkov, V., Babenko, A. (2021). Revisiting deep learning
models for tabular data. In: Beygelzimer, A., Dauphin, Y., Liang, P., Wortman Vaughan,
J. (eds). Advances in Neural Information Processing Systems, 34. Curran Associates, Inc.,
New York, 18932-18943.
[87] Gourieroux, C., Montfort, A., Trognon, A. (1984). Pseudo maximum likelihood methods:
theory. Econometrica 52/3, 681-700.
[88] Guo, C., Berkhahn, F. (2016). Entity embeddings of categorical variables. arXiv:1604.06737.
[89] Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q. On calibration of modern neural networks.
Proceedings of the 34th International Conference on Machine Learning (ICML), 1321-1330.
[90] Hainaut, D., Trufin, J., Denuit, M. (2022). Response versus gradient boosting trees, GLMs
and neural networks under Tweedie loss and log-link. Scandinavian Actuarial Journal
2022/10, 841-866.
[91] Hasselt, van H., Guez, A., Silver, D. (2015). Deep reinforcement learning with double Q-
learning. arXiv:1509.06461.
[92] Hastie, T., Tibshirani, R. (1993). Varying-coefficient models. Journal of the Royal Statistical
Society Series B: Statistical Methodology 55/4, 757-779.
[93] Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. 2nd edition. Springer Series in Statistics.
[94] Hastie, T., Tibshirani, R., Wainwright, M. (2015). Statistical Learning with Sparsity: The
Lasso and Generalizations. CRC Press.
[95] Havrylenko, Y., Heger, J. (2024) Detection of interacting variables for generalized linear
models via neural networks. European Actuarial Journal 14/2, 551-580.
[96] He, K., Zhang, X., Ren, S., Sun, J. (2015). Deep residual learning for image recognition.
arXiv:1512.03385.
[97] Hinton, G.E., Salakhutdinov, R.R. (2006). Reducing the dimensionality of data with neural
networks. Science 313/5786, 504-507.
[98] Hinton, G., Srivastava, N., Swersky, K. (2012). Neural Networks for Machine Learning.
Lecture Slides. University of Toronto.
[99] Ho, J., Jain, A., Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in
Neural Information Processing Systems 33, 6840-6851.
[100] Hochreiter, S., Schmidhuber, J. (1997). Long short-term memory. Neural Computation 9/8,
1735-1780.
[101] Hofner, B., Mayr, A., Robinzonov, N., Schmid, M. (2014). Model-based boosting in R: A
hands-on tutorial using the R ackage mboost. Computational Statistics 29, 3-35.
[102] Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural
Networks 4/2, 251-257.
[103] Hornik, K., Stinchcombe, M., White, H. (1989). Multilayer feedforward networks are uni-
versal approximators. Neural Networks 2/5, 359-366.
[104] Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., de Laroussilhe, Q., Gesmundo, A.,
Attariyan, M., Gelly, S. (2019). Parameter-efficient transfer learning for NLP. Proceedings
of the 36th International Conference on Machine Learning (ICML), 2790-2799.
[105] Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.
(2022). LoRA: Low-rank adaptation of large language models. International Conference
on Learning Representations (ICLR).
[106] Huang, X., Khetan, A., Cvitkovic, M., Karnin, Z. (2020). TabTransformer: Tabular data
modeling using contextual embeddings. arXiv:2012.06678.
[107] Ioffe, S., Szegedy, C. (2015). Batch normalization: Accelerating deep network training by
reducing internal covariate shift. In: Proceedings of the 32nd International Conference on
Machine Learning 37, 448-456.
[108] Isenbeck, M., Rüschendorf, L. (1992). Completeness in location families. Probability and
Mathematical Statistics 13/2, 321-343.
[109] James, G., Witten, D., Hastie, T., Tibshirani, R. (2015). An Introduction to Statistical
Learning. With Applications in R. Corrected 6th printing. Springer.
[110] Jørgensen, B. (1986). Some properties of exponential dispersion models. Scandinavian Jour-
nal of Statistics 13/3, 187-197.
[111] Jørgensen, B. (1987). Exponential dispersion models. Journal of the Royal Statistical Soci-
ety, Series B 49/2, 127-145.
[112] Jørgensen, B. (1997). The Theory of Dispersion Models. Chapman & Hall.
[113] Kaplan, J., McCandlish, S., Henighanm, T., et al. (2020). Scaling laws for neural language
models. arXiv:2001.08361.
[114] Karush, W. (1939). Minima of Functions of Several Variables with Inequalities as Side
Constraints. MSc Thesis. Department of Mathematics, University of Chicago.
[115] Kaufman, L., Rousseeuw, P.J. (1990). Finding Groups in Data: An Introduction to Cluster
Analysis. John Wiley & Sons.
[116] Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.-Y. (2017).
LightGBM: a highly efficient gradient boosting decision tree. Advances in Neural Information
Processing Systems 30, 3146-3154.
[117] Khattab, O., Singhvi, A., Maheshwari, P., et al. (2023). Dspy: Compiling declarative lan-
guage model calls into self-improving pipelines. arXiv:2310.03714.
[118] Kingma, D., Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980.
[119] Kingma, D.P, Welling, M. (2013). Auto-encoding variational Bayes. arXiv:1312.6114.
[120] Kingma, D.P., Welling, M. (2019). An introduction to variational autoencoders. Founda-
tions and Trends in Machine Learning 12/4, 307-392.
[121] Kirsch, L., Wang, Y., Zhao, Y., Pickett, M. (2022). Meta-learning in-context transformers.
arXiv:2209.07680.
[122] Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biolog-
ical Cybernetics 43, 59-69.
[123] Kohonen, T. (2001). Self-Organizing Maps. 3rd edition. Springer.
[124] Kohonen, T. (2013). Essentials of the self-organizing map. Neural Networks 37, 52-65.
[125] Kolmogorov, A. (1957). On the representation of continuous functions of many variables
by superposition of continuous functions of one variable and addition. Doklady Akademii
Nauk SSSR 114/5, 953-956.
[126] Kramer, M.A. (1991). Nonlinear principal component analysis using autoassociative neural
networks. AIChE Journal 37/2, 233-243.
[127] Krüger, F., Ziegel, J.F. (2021). Generic conditions for forecast dominance. Journal of Busi-
ness and Economics Statistics 39/4, 972-983.
[128] Kruskal, J.B. (1964). Nonmetric multidimensional scaling. Psychometrica 29, 115-129.
[129] Kuhn, H.W., Tucker, A.W. (1951). Nonlinear programming. In: Proceedings of 2nd Berkeley
Symposium. University of California Press, 481-492.
[130] LeCun, Y., Bengio, Y. (1995). Convolutional networks for images, speech, and time series.
The Handbook of Brain Theory and Neural Networks 3361/10.
[131] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P. (1998). Gradient-based learning applied to
document recognition. Proceedings of the IEEE 86/11, 2278-2324.
[132] Lee, R.D., Carter, L.R. (1992). Modeling and forecasting U.S. mortality. Journal of the
American Statistical Association 87/419, 659-671.
[133] Leeuw, de J., Hornik, K., Mair, P. (2009). Isotone optimization in R: pool-adjacent-violators
algorithm (PAVA) and active set methods. Journal of Statistical Software 32/5, 1-24.
[134] Leshno, M., Lin, V.Y., Pinkus, A., Schocken, S. (1993). Multilayer feedforward networks
with a nonpolynomial activation function can approximate any function. Neural Networks
6/6, 861-867.
[135] Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V.,
Zettlemoyer, L. (2020). BART: Denoising sequence-to-sequence pre-training for natural lan-
guage generation, translation, and comprehension. Proceedings of the 58th Annual Meeting
of the Association for Computational Linguistics, 7871-7880.
[136] Li, X., Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for generation.
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics,
4582-4597.
[137] Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning
and teaching. Machine Learning 8/3-4, 293-321.
[138] Lindholm, M., Lindskog, F., Palmquist, J. (2023). Local bias adjustment, duration-weighted
probabilities, and automatic construction of tariff cells. Scandinavian Actuarial Journal
2023/10, 946-973.
[139] Lindholm, M., Palmborg, L. (2022). Efficient use of data for LSTM mortality forecasting.
European Actuarial Journal 12/2, 749-778.
[140] Lindholm, M., Wüthrich, M.V. (2024). The balance property in insurance pricing. SSRN
Manuscript ID 4925165.
[141] Liu, Z., Wang, Y., Vaidya, S., Ruehle, F., Halverson, J., Soljačić, M., Hou, T.Y., Tegmark,
M. (2024). KAN: Kolmogorov–Arnold networks. arXiv:2404.19756.
[142] Loader, C. (1999). Local Regression and Likelihood. Springer.
[143] Lorentzen, C., Mayer, M., Wüthrich, M.V. (2022). Gini index and friends. SSRN Manuscript
ID 4248143.
[144] Lorenz, M.O. (1905). Methods of measuring the concentration of wealth. Publications of
the American Statistical Association 9/70, 209-219.
[145] Loshchilov, I., Hutter, F. (2017). Decoupled weight decay regularization. International Con-
ference on Learning Representations (ICLR).
[146] Mack, T. (1993). Distribution-free calculation of the standard error of chain ladder reserve
estimates. ASTIN Bulletin - The Journal of the IAA 23/2, 213-225.
[147] Makhzani, A., Frey, B. (2014). K-sparse autoencoders. International Conference on Learn-
ing Representations (ICLR).
[148] Mallat, S., Zhang, Z. (1993). Matching pursuits with time frequency dictionaries. IEEE
Transactions on Signal Processing 41, 3397-3415.
[149] Mayr, A., Fenske, N., Hofner, B., Kneib, T., Schmid, M., (2012). Generalized additive
models for location, scale and shape for high dimensional data – a flexible approach based
on boosting. Journal of the Royal Statistical Society Series C: Applied Statistics 61/3,
403-427.
[150] McCullagh, P., Nelder, J.A. (1983). Generalized Linear Models. Chapman & Hall.
[151] McInnes, L., Healy, J., Melville, J. (2018). UMAP: uniform manifold approximation and
projection for dimension reduction. arXiv:1802.03426v2.
[152] McLachlan, G.J., Krishnan, T. (2008). The EM Algorithm and Extensions. 2nd edition.
John Wiley & Sons.
[153] Menon, A.K., Jiang, X., Vembu, S., Elkan, C., Ohno-Machado, L. (2012). Predicting accu-
rate probabilities with ranking loss. ICML’12: Proceedings of the 29th International Con-
ference on Machine Learning, 659-666.
[154] Mercer, J. (1909), Functions of positive and negative type and their connection with the
theory of integral equations. Philosophical Transactions of the Royal Society A 209/441-
458, 415-446.
[155] Micci-Barreca, D. (2001). A preprocessing scheme for high-cardinality categorical attributes
in classification and prediction problems. ACM SIGKDD Explorations Newsletter 3/1, 27-
32.
[156] Mikolov, T., Chen, K., Corrado, G.S., Dean, J. (2013). Efficient estimation of word repre-
sentations in vector space. arXiv:1301.3781.
[157] Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J. (2013). Distributed represen-
tations of words and phrases and their compositionality. Advances in Neural Information
Processing Systems 26, 3111-3119.
[158] Miles, R.E. (1959). The complete amalgamation into blocks, by weighted means, of a finite
set of real numbers. Biometrika 46, 317-327.
[159] Minh, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller,
M. (2013). Playing Atari with deep reinforcement learning. arXiv:1312.5602.
[160] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves,
A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S., Beattie, C., Antonoglou,
I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D. (2015). Human-level control
through deep reinforcement learning. Nature 518/7540, 529-533.
[161] Murphy, A.H. (1973). A new vector partition of the probability score. Journal of Applied
Meteorology 12/4, 595-600.
[162] Murphy, K.P. (2024). Reinforcement learning: an overview. arXiv:2412.05265.
[163] Nanda, N., Lindner, J., Belrose, C., Olsson, C. (2023). Activation patching for mechanistic
interpretability. Proceedings of the Mechanistic Interpretability Workshop.
[164] Nelder, J.A., Wedderburn, R.W.M. (1972). Generalized linear models. Journal of the Royal
Statistical Society, Series A 135/3, 370-384.
[165] Nesterov, Y. (2007). Gradient methods for minimizing composite objective function. Tech-
nical Report 76. Center for Operations Research and Econometrics (CORE), Catholic Uni-
versity of Louvain.
[166] Nesterov, Y. (2018). Lectures on Convex Optimization. Springer.
[167] Nigri, A., Levantesi, S., Marino, M., Scognamiglio, S., Perla, F. (2019). A deep learning
integrated Lee–Carter model. Risks 7/1, 33.
[168] NovaSky Team (2025). Sky-T1: Train your own O1 preview model within $450. https:
//novasky-ai.github.io/posts/sky-t1, Accessed: 2025-01-09.
[169] Odaibo, S. (2019). Tutorial: deriving the standard variational autoencoder (VAE) loss
function. arXiv:1907.08956.
[170] Olah, C., Mordvintsev, A., Schubert, L. (2018). Feature visualization Distill. https://
distill.pub/2017/feature-visualization
[171] Olah, C., Satyanarayan, A., Johnson, I., Carter, S., Schubert, L., Ye, K., Mordvintsev, A.
(2020). An overview of early vision in InceptionV1. Distill. https://distill.pub/2020/
circuits/early-vision
[172] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Christiano, P., Leike, J., Amodei, D. (2022).
Training language models to follow instructions with human feedback. arXiv:2203.02155.
[173] Palmborg, L, Lindskog, F. (2023). Premium control with reinforcement learning. ASTIN
Bulletin - The Journal of the IAA 53/2, 233-257.
[174] Pan, J., Zhang, J., Wang, X., Yuan, L., Peng, H., Suhr, A. (2025). TinyZero. https:
//github.com/Jiayi-Pan/TinyZero, accessed: 2025-01-24.
[175] Pennington, J., Socher, R., Manning, C.D. (2014). GloVe: global vectors for word repre-
sentation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language
Processing (EMNLP), 1532-1543.
[176] Perla, F., Richman, R., Scognamiglio, S., Wüthrich, M.V. (2024). Accurate and explainable
mortality forecasting with the LocalGLMnet. Scandinavian Actuarial Journal 2024/7, 1-
23.
[177] Pohle, M.-O. (2020). The Murphy decomposition and the calibration-resolution principle:
A new perspective on forecast evaluation. arXiv:2005.01835.
[178] Puterman, M.L. (2005). Markov Decision Processes: Discrete Stochastic Dynamic Program-
ming. John Wiley & Sons.
[179] R Core Team (2021). R: A language and environment for statistical computing. R Founda-
tion for Statistical Computing, Vienna, Austria. http://www.R-project.org/.
[180] Radford, A., Narasimhan, K., Salimans, T., Sutskever, I. (2018). Improving language un-
derstanding by generative pre-training. OpenAI Technical Report.
[181] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu,
P.J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer.
Journal of Machine Learning Research 21/140, 1-67.
[182] Raschka, S. (2025). Understanding reasoning LLMs: Methods and strategies for building
and refining reasoning models. Blog Post, February 5, 2025.
[183] Rentzmann, S., Wüthrich, M.V. (2019). Unsupervised learning: What is a sports car?
SSRN Manuscript ID 3439358.
[184] Rezende, D.J., Mohamed, S., Wierstra, D. (2014). Stochastic backpropagation and approx-
imate inference in deep generative models. Proceedings of the 31st International Conference
on Machine Learning, 1278-1286.
[185] Ribeiro, M.T., Singh, S., Guestrin, C. (2016). “Why should I trust you?”: explaining
the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, KDD ’16. New York: Association
for Computing Machinery, 1135-1144.
[186] Richman, R. (2021). AI in actuarial science - a review of recent advances - part 1. Annals
of Actuarial Science 15/2, 207-229.
[187] Richman, R. (2021). AI in actuarial science - a review of recent advances - part 2. Annals
of Actuarial Science 15/2, 230-258.
[188] Richman, R., Scognamiglio, S., and Wüthrich, M. V. (2025). The credibility transformer.
European Actuarial Journal, in press.
[189] Richman, R., Wüthrich, M.V. (2020). Nagging predictors. Risks 8/3, article 83.
[190] Richman, R., Wüthrich, M.V. (2023). LocalGLMnet: interpretable deep learning for tabular
data. Scandinavian Actuarial Journal 2023/1, 71-95.
[191] Richman, R., Wüthrich, M.V. (2023). LASSO regularization within the LocalGLMnet ar-
chitecture. Advances in Data Analysis and Classification 17/4, 951-981.
[192] Richman, R., Wüthrich, M.V. (2024). High-cardinality categorical covariates in network
regressions. Japanese Journal of Statistics and Data Science 7/2, 921-965.
[193] Richman, R., Wüthrich, M.V. (2024). Smoothness and monotonicity constraints for neural
networks using ICEnet Annals of Actuarial Science 18/3, 712-739.
[194] Ridgeway, G. (2024). Generalized boosted models: a guide to the gbm package. https:
//cran.r-project.org/web/packages/gbm/vignettes/gbm.pdf
[195] Ronneberger, O., Fischer, P., Brox, T. (2015). U-Net: Convolutional networks for biomed-
ical image segmentation. Medical Image Computing and Computer-Assisted Intervention
(MICCAI), 234-241.
[196] Rumelhart, D.E., Hinton, G.E., Williams, R.J. (1986). Learning representations by back-
propagating errors. Nature 323/6088, 533-536.
[197] Saerens, M. (2000). Building cost functions minimizing to some summary statistics. IEEE
Transactions on Neural Networks 11, 1263-1271.
[198] Savage, L.J. (1971). Elicitable of personal probabilities and expectations. Journal of the
American Statistical Association 66/336, 783-810.
[199] Schervish, M.J. (1989). A general method of comparing probability assessors. The Annals
of Statistics 17/4, 1856-1879.
[200] Schölkopf, B., Smola, A., Müller, K.R. (1998). Nonlinear component analysis as a kernel
eigenvalue problem. Neural Computation 10/5, 1299-1319.
[201] Schubert, E., Rousseeuw, P.J. (2019). Faster k-medoids clustering: improving the PAM,
CLARA, and CLARANS algorithms. arXiv:1810.05691v3.
[202] Schulman, J., Levine, S., Moritz, P., Jordan, M.I., Abbbeel, P. (2015). Trust region policy
optimization. arXiv:1502.05477.
[203] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O. (2017). Proximal policy
optimization algorithms. arXiv:170706347.
[204] Schwarz, G.E. (1978). Estimating the dimension of a model. Annals of Statistics 6/2, 461-
464.
[205] Scognamiglio, S. (2022). Calibrating the Lee–Carter and the Poisson Lee–Carter models
via neural networks. ASTIN Bulletin - The Journal of the IAA 52/2, 519-561.
[206] Semenovich, D., Dolman, C. (2020). What makes a good forecast? Lessons from meteorol-
ogy. 20/20 All-Actuaries Virtual Summit, The Institute of Actuaries, Australia.
[207] Shmueli, G. (2010). To explain or to predict? Statistical Science 25/3, 289-310.
[208] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S. (2015). Deep unsupervised
learning using nonequilibrium thermodynamics. Proceedings of the 32nd International Con-
ference on Machine Learning, 2256-2265.
[209] Song, Y., Ermon, S. (2019). Generative modeling by estimating gradients of the data dis-
tribution. Advances in Neural Information Processing Systems 32.
[210] Srivastava, N., Hinton, G., Krizhevsky, A. Sutskever, I., Salakhutdinov, R. (2014). Dropout:
a simple way to prevent neural networks from overfitting. Journal of Machine Learning
Research 15/56, 1929-1958.
[211] Su, J., Lu, R., Huang, G., Liang, Y., Xia, F. (2021). RoFormer: Enhanced transformer
with rotary position embedding. arXiv2104.09864.
[212] Sutton, R.S., Barto, A.G. (2018). Reinforcement Learning: An Introduction. MIT Press.
[213] Tasche, D. (2006). Validation of internal rating systems and PD estimates. arXiv:0606071.
[214] Tasche, D. (2021). Calibrating sufficiently. Statistics: A Journal of Theoretical and Applied
Statistics 55/6, 1356-1386.
[215] Therneau, T.M., Atkinson, E.J. (2015). An introduction to recursive partitioning using the
RPART routines. R Vignettes, version of June 29, 2015. Mayo Foundation, Rochester.
[216] Thomson, W. (1979). Eliciting production possibilities from a well-informed manager. Jour-
nal of Economic Theory 20, 360-380.
[217] Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. Journal of the
Royal Statistical Society, Series B 58/1, 267-288.
[218] Tibshirani, R., Saunders, M., Rosset, S., Knight, K. (2005). Sparsity and smoothness via
the fused LASSO. Journal of the Royal Statistical Society, Series B 67/1, 91-108.
[219] Tikhonov, A.N. (1943). On the stability of inverse problems. Doklady Akademii Nauk SSSR
39/5, 195-198.
[221] Tsyplakov, A. (2013). Evaluation of probabilistic forecasts: proper scoring rules and mo-
ments. SSRN Manuscript ID 2236605.
[222] Tweedie, M.C.K. (1984). An index which distinguishes between some important exponential
families. In: Statistics: Applications and New Directions. Ghosh, J.K., Roy, J. (Eds.). Pro-
ceeding of the Indian Statistical Golden Jubilee International Conference, Indian Statistical
Institute, Calcutta, 579-604.
[223] van der Maaten, L.J.P., Hinton, G.E. (2008). Visualizing data using t-SNE. Journal of
Machine Learning Research 9, 2579-2605.
[224] van der Merwe, M., Richman, R. (2024). Responsible AI: The role of actuaries in bridging
the trust deficit. Presented at the ASSA 2024 Convention in Cape Town.
[225] Vapnik, V.N. (1997). The support vector method. In: Artificial Neural Networks –
ICANN’97, Gerstner, W., Germond, A., Hasler, M., Nicoud, J.-D. (eds.). Lecture Notes in
Computer Science. Vol. 1327. Springer.
[226] Vapnik, V.N., Chervonenkis, A.Y. (1964). On a class of perceptrons. Avtomatika i Tele-
mekhanika 25/1.
[227] Vapnik, V.N., Chervonenkis, A.Y. (1964). On a class of algorithms of learning pattern
recognition. Avtomatika i Telemekhanika 25/6.
[228] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł.,
Polosukhin, I. (2017). Attention is all you need. arXiv:1706.03762v5.
[229] Wager, S., Wang, S., Liang, P.S. (2013). Dropout training as adaptive regularization. In:
Advances in Neural Information Processing Systems 26. Burges, C., Bottou, L., Welling,
M., Ghahramani, Z., Weinberger, K. (Eds.). Curran Associates, 351-359.
[230] Wald, A. (1949). Note on the consistency of the maximum likelihood estimate. Annals of
Mathematical Statistics 20/4, 595-601.
[231] Wang, C.W., Zhang, J., Zhu, W. (2021). Neighbouring prediction for mortality. ASTIN
Bulletin: The Journal of the IAA 51/3, 689-718.
[232] Watkins, C.J.C.H. (1989). Learning from Delayed Rewards. PhD Thesis, University of Cam-
bridge.
[233] Watkins, C.J.C.H., Dayan, P. (1992). Q-learning. Machine Learning 8/3-4, 279-292.
[234] Wei J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q.,
Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models.
arXiv:2201.11903.
[235] Werbos, P.J. (1988). Generalization of backpropagation with application to a recurrent gas
market model. Neural Networks 1/4, 339-356.
[236] Williams, R.J., Zipser, D. (1989). A learning algorithm for continually running fully recur-
rent neural networks. Neural Computation 1/2, 270-280.
[237] Wu, C.F.J. (1983). On the convergence properties of the EM algorithm. The Annals of
Statistics 11/1, 95-103.
[238] Wüthrich, M.V. (2020). Bias regularization in neural network models for general insurance
pricing. European Actuarial Journal 10/1, 179-202.
[239] Wüthrich, M.V. (2023). Model selection with Gini indices under auto-calibration. European
Actuarial Journal 13/1, 71-95.
[240] Wüthrich, M.V. (2025). Auto-calibration tests for discrete finite regression functions. Eu-
ropean Actuarial Journal, in press.
[241] Wüthrich, M.V., Buser, C. (2016). Data Analytics for Non-Life Insurance Pricing. SSRN
Manuscript ID 2870308, Version of June 19, 2023.
[242] Wüthrich, M.V., Merz, M. (2019). Editorial: Yes, we CANN! ASTIN Bulletin - The Journal
of the IAA 49/1, 1-3.
[243] Wüthrich, M.V., Merz, M. (2023). Statistical Foundations of Actuarial Learning
and its Applications. Springer Actuarial. https://link.springer.com/book/10.1007/
978-3-031-12409-9
[244] Wüthrich, M.V., Ziegel, J. (2024). Isotonic recalibration under a low signal-to-noise ratio.
Scandinavian Actuarial Journal 2024/3, 279-299.
[245] Xie, S.M., Yala, L., Liu, Q. (2021). An explanation of in-context learning as implicit
Bayesian inference. arXiv:2110.08387.
[246] Yuan, X.T., Lin, Y. (2007). Model selection and estimation in regression with grouped
variables. Journal of the Royal Statistical Society, Series B 68/1, 49-67.
[247] Zakrisson, H., Lindholm, M. (2025). A tree-based varying coefficient model. Computational
Statistics, in press.
[248] Zhang, T., Yu, B. (2005). Boosting with early stopping: Convergence and consistency The
Annals of Statistics 33/4, 1538-1579.
[249] Zhou, Y., Hooker, G. (2022). Decision tree boosted varying coefficient models. Data Mining
and Knowledge Discovery 36/6, 2237-2271.
[250] Zou, H., Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal
of the Royal Statistical Society, Series B 67/2, 301-320.