0% found this document useful (0 votes)
13 views271 pages

SSRN 5162304

The document provides comprehensive lecture notes and teaching materials aimed at upskilling actuaries in data science and AI, based on the Actuarial Association of Europe's strategy paper. It covers various topics including probability theory, regression models, generalized linear models, and neural networks, with an emphasis on evolving methodologies in these fields. The notes are intended for noncommercial use and are subject to ongoing revisions and updates by the authors.

Uploaded by

cyc1120
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views271 pages

SSRN 5162304

The document provides comprehensive lecture notes and teaching materials aimed at upskilling actuaries in data science and AI, based on the Actuarial Association of Europe's strategy paper. It covers various topics including probability theory, regression models, generalized linear models, and neural networks, with an emphasis on evolving methodologies in these fields. The notes are intended for noncommercial use and are subject to ongoing revisions and updates by the authors.

Uploaded by

cyc1120
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 271

AI Tools for Actuaries

– Course Material –

Authors:
Mario V. Wüthrich (ETH Zurich)
Ronald Richman (InsureAI)
Benjamin Avanzi (The University of Melbourne)
Mathias Lindholm (Stockholm University)
Michael Mayer (la Mobilière)
Jürg Schelldorfer (Swiss Re)
Salvatore Scognamiglio (University of Naples Parthenope)

Version March 3, 2025


2

Version March 3, 2025, @AI Tools for Actuaries


Preface and Terms of Use

About these Lecture Notes


There are many different initiatives around the globe that aim at upskilling the actuarial
profession in data science and AI related subjects. The aim of these notes is to give a
comprehensive set of lecture notes and teaching material to this effect. To help to define
its scope, we have taken the proposal of the Actuarial Association of Europe (AAE)
Education Committee, called ‘CPD in Data Science’,1 which is a strategy paper that
proposes a syllabus for actuaries in data science and AI related topics. Based on this
strategy paper we have developed our own view on what the actuarial profession should
be familiar with in this field, and these lecture notes are the result of this process. Since
data science, machine learning and AI are vastly evolving fields, we expect that the
emphasis given to the different subjects in this field will change over time, and similarly
these lecture notes should evolve.
These lecture notes only contain a limited number of (explicit) examples. In parallel we
develop comprehensive notebooks that provide explicit data examples and code to the
theory presented in these notes; for the notebooks see
notebook-insert-link

Terms of Use
These lecture notes are an ongoing project which is continuously revised, updated and
extended. We highly appreciate any comments that readers may have to improve these
notes. The use of these lecture notes is subject to the following rules:
• These notes are provided to reusers to distribute, remix, adapt, and build upon the
material in any medium or format for noncommercial purposes only, and only so
long as attribution and credit is given to the original authors and source, and if you
indicate if changes were made. This aligns with the Creative Commons Attribution
4.0 International License CC BY-NC.

• The authors may update the manuscript or withdraw it at any time. There is
no right of availability of any (old) version of these notes. The authors may also
change these terms of use at any time.

• The authors disclaim all warranties, including but not limited to the use of the
contents of these notes and the related notebooks and statistical code. When using
these notes, notebooks and statistical code, you fully agree to this.
1
https://actuary.eu/about-the-aae/continuous-professional-development/

3
4

Version March 3, 2025, @AI Tools for Actuaries


Contents

1 Introduction and prerequisites 11


1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Notation in probability theory . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.1 Probability distribution functions . . . . . . . . . . . . . . . . . . . 13
1.2.2 Expected values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.3 Covariates and regression functions . . . . . . . . . . . . . . . . . . 14
1.3 Generalized linear models - in short . . . . . . . . . . . . . . . . . . . . . 15
1.4 Strictly consistent loss functions . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.2 Strictly consistent loss functions for mean estimation . . . . . . . . 18
1.4.3 Regression fitting on finite samples . . . . . . . . . . . . . . . . . . 19
1.5 Model validation and model selection . . . . . . . . . . . . . . . . . . . . . 22
1.5.1 Out-of-sample (generalization) loss . . . . . . . . . . . . . . . . . . 23
1.5.2 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.5.3 Akaike’s information criterion for model selection . . . . . . . . . . 25
1.5.4 Bootstrap and out-of-bag validation . . . . . . . . . . . . . . . . . 26
1.5.5 Summary on model validation and model selection . . . . . . . . . 28

2 Regression models 31
2.1 Exponential dispersion family . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.1.1 Introduction of the exponential dispersion family . . . . . . . . . . 32
2.1.2 Cumulant function, mean and variance function . . . . . . . . . . . 32
2.1.3 Maximum likelihood estimation and deviance loss . . . . . . . . . 34
2.2 Regression models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3 Covariate pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.2 Categorical covariates . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3.3 Continuous covariates . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.4 Regularization and sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.4.1 Introduction and overview . . . . . . . . . . . . . . . . . . . . . . . 42
2.4.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.4.3 Ridge regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4.4 LASSO regularization . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4.5 Best-subset selection regularization . . . . . . . . . . . . . . . . . . 45
2.4.6 Elastic net regularization . . . . . . . . . . . . . . . . . . . . . . . 45

5
6 Contents

2.4.7 Group and fused LASSO regularizations . . . . . . . . . . . . . . . 46


2.5 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3 Generalized linear models 49


3.1 Generalized linear model regressions . . . . . . . . . . . . . . . . . . . . . 49
3.1.1 Generalized linear model regression functions . . . . . . . . . . . . 49
3.1.2 Canonical link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2 Generalized linear model fitting . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2.1 MLE via log-likelihoods . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2.2 MLE via deviance loss functions . . . . . . . . . . . . . . . . . . . 53
3.2.3 Numerical solution of the GLM problem . . . . . . . . . . . . . . . 54
3.3 Likelihood-ratio test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.2 Lab: a real data example . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.3 Likelihood ratio test and Wald test . . . . . . . . . . . . . . . . . . 56

4 Interlude 63
4.1 Unbiasedness and calibration . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1.1 Statistical biases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1.2 The balance property . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.1.3 Auto-calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 General purpose non-parametric regressions . . . . . . . . . . . . . . . . . 68
4.2.1 Local regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2.2 Isotonic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3 Model analysis tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3.1 Gini score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3.2 Murphy’s score decomposition . . . . . . . . . . . . . . . . . . . . 75

5 Feed-forward neural networks 77


5.1 Feed-forward neural network architecture . . . . . . . . . . . . . . . . . . 78
5.1.1 Feature extractor and readout . . . . . . . . . . . . . . . . . . . . . 78
5.1.2 Activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1.3 Feed-forward neural network layer . . . . . . . . . . . . . . . . . . 80
5.1.4 Feed-forward neural network architecture . . . . . . . . . . . . . . 81
5.2 Universality theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3 Network fitting with gradient descent . . . . . . . . . . . . . . . . . . . . . 83
5.3.1 The standard gradient descent method . . . . . . . . . . . . . . . . 84
5.3.2 Covariate pre-processing . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3.3 Gradient calculation via back-propagation . . . . . . . . . . . . . . 86
5.3.4 Learning rate and higher order Taylor approximations . . . . . . . 86
5.3.5 Early stopping rule . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3.6 Regularization and drop-out . . . . . . . . . . . . . . . . . . . . . . 90
5.3.7 Stochastic gradient descent . . . . . . . . . . . . . . . . . . . . . . 91
5.3.8 Summarizing feed-forward neural networks and their training . . . 92
5.4 Nagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

Version March 3, 2025, @AI Tools for Actuaries


Contents 7

5.4.1 Aggregating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.4.2 Network ensembling . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.5 Summary on feed-forward neural networks . . . . . . . . . . . . . . . . . . 96
5.6 Combing a GLM and a neural network . . . . . . . . . . . . . . . . . . . . 97
5.7 LocalGLMnet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.8 Outlook: Kolmogorov–Arnold networks . . . . . . . . . . . . . . . . . . . 101

6 Regression trees and random forests 103


6.1 Regression trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.2 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.3 Random forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

7 Gradient boosting machines 111


7.1 (Generalized additive) boosting . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2 Gradient boosting machines . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.2.1 Functional gradient boosting . . . . . . . . . . . . . . . . . . . . . 114
7.2.2 Tree-based gradient boosting machines . . . . . . . . . . . . . . . . 117
7.3 Interpretability measures and variable importance . . . . . . . . . . . . . 119
7.4 State-of-the-art gradient boosting machines . . . . . . . . . . . . . . . . . 120

8 Deep learning for tensor and unstructured data 123


8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8.2 Tokenization and the art of entity embedding . . . . . . . . . . . . . . . . 125
8.2.1 Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
8.2.2 (Supervised) entity embedding . . . . . . . . . . . . . . . . . . . . 125
8.2.3 (Unsupervised) word embedding . . . . . . . . . . . . . . . . . . . 127
8.2.4 Summary of Section 8.2 . . . . . . . . . . . . . . . . . . . . . . . . 135
8.3 Convolutional neural networks . . . . . . . . . . . . . . . . . . . . . . . . 135
8.3.1 1D convolutional neural networks . . . . . . . . . . . . . . . . . . . 136
8.3.2 2D convolutional neural networks . . . . . . . . . . . . . . . . . . . 139
8.3.3 Deep convolutional neural networks . . . . . . . . . . . . . . . . . 141
8.3.4 Locally-connected networks . . . . . . . . . . . . . . . . . . . . . . 142
8.3.5 Pooling layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.3.6 Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.3.7 Conclusions and flatten layers . . . . . . . . . . . . . . . . . . . . . 145
8.4 Recurrent neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8.4.1 Plain-vanilla recurrent neural networks . . . . . . . . . . . . . . . . 146
8.4.2 Long short-term memory networks . . . . . . . . . . . . . . . . . . 147
8.4.3 Gated recurrent unit networks . . . . . . . . . . . . . . . . . . . . 149
8.4.4 Deep recurrent neural networks and conclusion . . . . . . . . . . . 150
8.5 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.5.1 Basic components of transformers . . . . . . . . . . . . . . . . . . 151
8.5.2 Transformers for sequential data . . . . . . . . . . . . . . . . . . . 154
8.5.3 Transformers for tabular data . . . . . . . . . . . . . . . . . . . . . 157
8.5.4 The credibility transformer . . . . . . . . . . . . . . . . . . . . . . 158

Version March 3, 2025, @AI Tools for Actuaries


8 Contents

9 Unsupervised learning 165


9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
9.2 Dimension reduction methods . . . . . . . . . . . . . . . . . . . . . . . . . 166
9.2.1 Standardization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
9.2.2 Auto-encoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
9.2.3 Principal component analysis . . . . . . . . . . . . . . . . . . . . . 170
9.2.4 Bottleneck neural network . . . . . . . . . . . . . . . . . . . . . . . 173
9.2.5 Kernel principal component analysis . . . . . . . . . . . . . . . . . 175
9.3 Clustering methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
9.3.1 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . 181
9.3.2 K-means and K-medoids clusterings . . . . . . . . . . . . . . . . . 184
9.3.3 Clustering using Gaussian mixture models . . . . . . . . . . . . . . 187
9.3.4 Density-based clustering . . . . . . . . . . . . . . . . . . . . . . . . 188
9.4 Low-dimensional visualization methods . . . . . . . . . . . . . . . . . . . . 189

10 Generative modeling 193


10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
10.2 Variational auto-encoders . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
10.2.1 Introduction and motivation . . . . . . . . . . . . . . . . . . . . . 196
10.2.2 Variational auto-encoder model architecture . . . . . . . . . . . . . 196
10.2.3 Variational objective: the evidence lower bound . . . . . . . . . . . 197
10.2.4 Reparameterization trick and training . . . . . . . . . . . . . . . . 198
10.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
10.3 Other approaches related to latent factor models . . . . . . . . . . . . . . 199
10.3.1 Generative adversarial networks . . . . . . . . . . . . . . . . . . . 200
10.3.2 Diffusion models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
10.4 Decoder transformer models . . . . . . . . . . . . . . . . . . . . . . . . . . 203
10.4.1 Introduction and motivation . . . . . . . . . . . . . . . . . . . . . 203
10.4.2 Architecture overview . . . . . . . . . . . . . . . . . . . . . . . . . 203
10.4.3 Self-attention with causal masking . . . . . . . . . . . . . . . . . . 204
10.4.4 Softmax outputs and probability calibration . . . . . . . . . . . . . 205
10.4.5 Training and inference . . . . . . . . . . . . . . . . . . . . . . . . . 206
10.5 Conclusion on generative models . . . . . . . . . . . . . . . . . . . . . . . 207
10.6 Large language models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
10.6.1 From auto-regressive transformers to LLMs . . . . . . . . . . . . . 209
10.6.2 Foundation models and pretraining . . . . . . . . . . . . . . . . . . 213
10.6.3 Use cases of large language models . . . . . . . . . . . . . . . . . . 213
10.6.4 Fine-tuning with reinforcement learning from human feedback . . 213
10.6.5 Parameter-efficient fine-tuning . . . . . . . . . . . . . . . . . . . . 215
10.6.6 Prompting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
10.6.7 Building and refining reasoning models for LLMs . . . . . . . . . . 219
10.6.8 LLMs as a judge and self-consistency mechanisms . . . . . . . . . 221
10.6.9 Responsible use of large language models . . . . . . . . . . . . . . 223
10.6.10 Sparse auto-encoders and mechanistic interpretability . . . . . . . 225
10.6.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

Version March 3, 2025, @AI Tools for Actuaries


Contents 9

10.7 Conclusion: From empirical scaling to broad AI . . . . . . . . . . . . . . . 228

11 Reinforcement learning 229


11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
11.2 Multi-armed bandit problem . . . . . . . . . . . . . . . . . . . . . . . . . 230
11.3 Incremental learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
11.4 Tabular learning problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
11.5 Known environment’s dynamics . . . . . . . . . . . . . . . . . . . . . . . . 237
11.5.1 Policy iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
11.5.2 Value iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
11.6 Unknown environment’s dynamics: Monte Carlo . . . . . . . . . . . . . . 239
11.7 Temporal difference learning . . . . . . . . . . . . . . . . . . . . . . . . . . 243
11.7.1 SARSA on-policy control . . . . . . . . . . . . . . . . . . . . . . . 244
11.7.2 Q-learning off-policy control . . . . . . . . . . . . . . . . . . . . . . 245
11.8 Beyond tabular learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
11.9 Actor-critic reinforcement learning algorithms . . . . . . . . . . . . . . . . 251
11.9.1 Policy gradient control . . . . . . . . . . . . . . . . . . . . . . . . . 251
11.9.2 Actor-critic reinforcement learning . . . . . . . . . . . . . . . . . . 252

12 Outlook 257

Version March 3, 2025, @AI Tools for Actuaries


10 Contents

Version March 3, 2025, @AI Tools for Actuaries


Chapter 1

Introduction and prerequisites

1.1 Introduction
The actuarial profession is rapidly evolving, with machine learning (ML) and artificial in-
telligence (AI) tools becoming increasingly incorporated into practical actuarial method-
ology. This book explains the evolution of AI tools from more familiar regression models
all the way up to generative AI systems, to equip actuaries with technical knowledge
about these tools and to enable actuaries to apply these within their work.
Why do we believe that these AI tools represent such an important advance that they
are worthy of a book length analysis? As we explain next, the methodology underlying
modern AI tools is surprisingly similar to the methodology of actuarial science, and the
advances in AI tools therefore can be applied easily and with great effect within the work
that actuaries do.
Actuaries, through their education within and study of actuarial science, are equipped
with tools from many disciplines to solve the challenges they encounter in their work,
which often focuses on managing risk within financial institutions. These tools are,
on the one hand, drawn from a variety of other disciplines such as statistics, finance,
demography and economics, and, on the other hand, also includes specialized techniques
developed within the field of actuarial science. Examples of the former are the valuation
of options and guarantees using risk-neutral techniques and projection of mortality using
the Lee–Carter model [132], and, of the latter, is the chain-ladder technique, used to
predict outstanding claims liabilities. Combining these tools with expert knowledge of
the particular industries in which they work allows actuaries to build models to provide
insight and analysis of a variety of important problems within these industries.
Actuarial modeling often takes the approach of approximating and predicting empirically
observed phenomena without spending too much time building full theories explaining
these observations in detail; in this sense, actuarial science is different from economics
or physics which use empirical phenomena to build theories. Of course, actuarial sci-
ence is grounded in the study of rigorous mathematics, probability theory and statistics
and, moreover, has developed deep theoretical frameworks for topics such as credibility
theory. Nonetheless, the practical application of actuarial modeling is to make predic-
tions and not to develop theories explaining the observations. Take, for example, an
actuary who is tasked with estimating the required reserve for the outstanding liabilities

11
12 Chapter 1. Introduction and prerequisites

in a non-life insurer. Using datasets comprised of observations of past delays in claims


emergence, he/she will use tools to estimate factors which predict the amount of future
claim emergence and apply these to recent cohorts of reported claims. While the actuary
will understand what produces claims delays and how these underlying causes relate to
the claims development factors, nonetheless, the analysis is performed from an empirical
standpoint.
Remarkably, exactly this approach of building models to approximate and predict empir-
ical phenomena underlies the highly successful field of ML and AI, which, has, in recent
years, built models capable of solving tasks in computer vision, natural language process-
ing, generative modeling and, which more recently, is beginning to tackle the problem
of building more generally intelligent AI systems. At the risk of oversimplifying, many
approaches within ML and AI work by compiling a dataset relevant to the problem at
hand and then specifying a class of models that is well-suited to the type of data. After
calibrating the parameters of the model appropriately, predictions are then be made, if
needed, can be interpreted. At the largest scale, so-called foundation models are trained
on significant proportions of the text available on the Internet; this training process in-
volves making predictions of the next expected word in a sentence, given some previous
context about what the sentence is discussing. Despite the seemingly simple task that
these models are given, nonetheless, by learning to approximate the empirical (condi-
tional) distribution of words within natural language, these foundation models are now
becoming capable of providing useful output across a wide range of applications from
writing computer code to editing documents.
Of course, the scale of data involved in building these large-scale AI models, and the
complexity of the models applied, can be different from typical actuarial applications.
Nonetheless, borrowing, modifying and integrating these new tools within actuarial sci-
ence represents - in our view - a remarkable opportunity to enhance the practice of our
discipline; making this task much easier is the similar approach to building models that
is common within both disciplines.
In this book, we aim to equip actuaries with both a practical and a theoretically sound un-
derstanding of these powerful AI tools that is specifically tailored to the unique challenges
and demands of the actuarial profession. The book will journey through the landscape of
AI tools, starting with familiar regression models as a foundation and then introducing
more advanced techniques, including neural networks, deep learning architectures, and
the emerging field of generative AI. We will emphasize statistical rigor, model valida-
tion, and careful consideration of data limitations that are already core competencies
of actuarial science and show how these apply equally to AI tools, allowing these to be
integrated easily into the existing actuarial toolkit.

1.2 Notation in probability theory


An introduction to probability theory and statistics is part of the core syllabus of actuarial
science, and its contents is taught in every introductory course on probability theory. In
this section, we summarize some of the probability concepts that will be necessary to be
able to formalize the following chapters and the methods presented in these notes.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 1. Introduction and prerequisites 13

1.2.1 Probability distribution functions


We generally work on a sufficiently rich probability space (Ω, F, P) with sample space Ω,
σ-field F, and probability function P : F → [0, 1] that assigns the probabilities P(A) to
the events A ∈ F. This probability space (Ω, F, P) supports random variables Y which
are real-valued measurable functions on this probability space. Random variables can be
characterized by probability distribution functions (distributions in short) F : R → [0, 1]
given by
F (y) := P [Y ≤ y] for y ∈ R.

This describes the probability that the random variable Y takes a value of less or equal
to y ∈ R. We write Y ∼ F for a random variable Y having distribution F . In most
practical applications, these distributions are unknown, and the general aim in statistics
and data science is to infer the correct distribution from observed realizations of the
random variables.
There are two main types of distributions. There are discrete distributions and absolutely
continuous ones. Discrete distributions are step functions having countably many steps
(yj )j≥1 ⊂ R, allowing for positive probability masses (pj )j≥1 in these steps. That is,
X
pj = P [Y = yj ] > 0 with pj = 1.
j≥1

Figure 1.1 (lhs) shows a discrete distribution with finitely many steps in (yj )Jj=1 , J = 8.
Typical examples with discrete distributions are count random variables Y taking values
in the integers (yj )j≥1 ⊆ N0 . In insurance, count random variables are used to model the
numbers of claims; examples are the binomial distribution, the Poisson distribution and
the negative binomial distribution. We discuss these distributions below.

discrete distribution (step function) density f(y)


1.0

0.020
0.8

0.015
distribution F(y)

density f(y)
0.6

0.010
0.4

0.005
0.2

0.000
0.0

0 2 4 6 8 10 0 50 100 150

responses y responses y

Figure 1.1: (lhs) Discrete distribution with finitely many steps in (yj )Jj=1 , J = 8, (rhs)
density of an absolutely continuous distribution.

Absolutely continuous distributions have a (probability) density representation, f ≥ 0,

Version March 3, 2025, @AI Tools for Actuaries


14 Chapter 1. Introduction and prerequisites

w.r.t. the Lebesgue measure, providing us with


Z y
F (y) = f (z) dz,
−∞

for all y ∈ R. Typical examples in actuarial science are the gamma or the log-normal
distributions which are used to model positive claim sizes. Figure 1.1 (rhs) shows a
(gamma) density y 7→ f (y) with the area of the blue region being equal to one.

1.2.2 Expected values


Random variables are used to describe (future) outcomes of experiments (insurance
claims) that cannot be perfectly predicted. E.g., will a car driver have an accident over
the next calendar year and, if yes, how big will his insurance claim be? Such questions are
modeled by random variables Y , which are used to describe the quantitative outcomes
stemming from uncertain events. For forecasting, one typically uses the expected value
(mean) of Y ∼ F given by the Riemann–Stieltjes integral
Z
E [Y ] = y dF (y), (1.1)
R

subject to existence. In the discrete distribution case, this expected value is equal to
X
E [Y ] = yj pj ,
j≥1

and, in the absolutely continuous case, we compute the integral


Z
E [Y ] = yf (y) dy.
R

Remark 1.1. The expected value (1.1) is generally not the most likely outcome, e.g., if
we have skewed densities, as in Figure 1.1 (rhs), the expected value, the mode and the
median will generally differ. The expected value corresponds to the average outcome of
the uncertain (claim) event. In insurance, one typically assumes that one has a large
insurance portfolio of independent and identically distributed (i.i.d.) claims (Yi )i≥1 . The
selection of the mean E[Y1 ] is then justified by the fact that the law of large numbers
implies that the average claim n−1 ni=1 Yi converges to the expected value E[Y1 ], a.s.,
P

as the sample size increases n → ∞. One can also view the mean E[Y1 ] as the value
that minimizes the expected quadratic difference between our forecast and the actual
outcome of the random event, thus, on average it provides the most accurate prediction
w.r.t. the expected quadratic difference; this and similar statements are going to be
discussed further in Section 1.4, below, especially, we refer to Remarks 1.4. ■

1.2.3 Covariates and regression functions


For forecasting a random variable Y , there is often additional information available in the
form of covariates X (also called features, independent variables or explanatory variables).
We assume that covariates X = (X1 , . . . , Xq )⊤ are q-dimensional real-valued random
vectors (or realizations of these random vectors); pre-processing of categorical covariates
is going to be discussed in Section 2.3, below.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 1. Introduction and prerequisites 15

When we write (Y, X) ∼ P in an actuarial context, we describe the random selection of


an insurance policy with covariates X from a population (portfolio) distribution P, and
the resulting insurance claim (response) Y should be understood conditionally, given the
covariates (insurance policy features) X, that is,

Y |X ∼ F (·|X),

for F (·|X) being a distribution depending on (varies with) the covariates X, e.g., X may
describe a translation of the expected value relative to a base case.

Having covariate information X about the insurance claim Y specifies the following
regression consideration. Denote by X ⊆ Rq the support of the covariates X; the set X
is called covariate space or feature space. The general aim in regression modeling is to
find the (true) regression function µ∗ : X → R that describes the conditional expectation
of the response variable Y having covariate X, that is,

µ∗ (X) := E [ Y | X] . (1.2)

This regression function can be determined (estimated) non-parametrically, e.g., we are


going to discuss regression trees, gradient boosting machines (GBMs) and isotonic re-
gressions, below, or we are going to use parametric regression models, such as generalized
linear models (GLMs) or neural network regression architectures to find (approximate)
this true regression function µ∗ .
Generally speaking, µ∗ (X) is called best-estimate because it is the most accurate pre-
diction of Y , given X, w.r.t. the mean squared error (MSE). It is also called pure risk
premium or actuarial premium because it does not include any safety loadings, nor does
it rely on any commercial considerations.

1.3 Generalized linear models - in short


To be able to better illustrate the following sections, we briefly introduce a specific GLM
in this section. We have decided for a log-link GLM because actuaries are very familiar
with this GLM; the general theory of GLMs is going to be discussed in Chapter 3, below.

The main problem in most applications is that the true regression function µ∗ in (1.2) is
unknown, and we have only observed a finite sample L = (Yi , X i )ni=1 from that model.

Goal: Infer the (true) regression function µ∗ from this finite sample L.

One way to estimate the unknown (true) regression function µ∗ is to assume that it takes
a specific functional form. Assuming GLMs with log-link implies that we consider the
regression functions µϑ : X → R of type
q
X
X 7→ log(µϑ (X)) = ϑ0 + ϑj Xj , (1.3)
j=1

Version March 3, 2025, @AI Tools for Actuaries


16 Chapter 1. Introduction and prerequisites

for regression parameters ϑ ∈ Rq+1 .1 If we bring the log-link to the other side, we receive
the equivalent formulation
 
q
X
X 7→ µϑ (X) = exp ϑ0 + ϑj Xj  . (1.4)
j=1

This gives us a parametrized family

M = {µϑ }ϑ ,

of candidates to approximate the true regression function µ∗ .

To find the best candidate µϑ ∈ M for µ∗ , one typically chooses an objective function
to select the best parameter (candidate) ϑ ∈ Rq+1 for the given observed sample L. For
this, select a loss function
L : R × R → R, (y, m) 7→ L(y, m); (1.5)
in this loss function, y plays the role of the outcome of Y , and m plays the role of the
mean of Y . The loss function L is then used to assess the difference between y and m;
we are going to discuss the required properties of this loss function in Section 1.4, below,
to make this a sensible model selection tool. This then motivates to solve the following
optimization problem
n
X
ϑb = arg min L(Yi , µϑ (X i )), (1.6)
ϑ∈Rq+1 i=1
supposed there exists a unique solution. Thus, we try to minimize the loss between Yi
and (its prediction) µϑ (X i ) for a suitable loss function L.

The solution µϑb from (1.6) is the best candidate in M w.r.t. the selected loss
function L and for given observations L generated by µ∗ . In this sense, we do not
compare the candidates µϑ to the unknown µ∗ , but rather to the data (Yi , X i )
generated by µ∗ .

Example 1.2 (Gaussian log-link GLM). A common example for the loss function is the
square loss function L(y, m) = (y − m)2 . The previous optimization problem (1.6) then
reads as
n
(Yi − µϑ (X i ))2
X
ϑb = arg min
ϑ∈Rq+1 i=1
  2
n
X q
X
= arg min Yi − exp ϑ0 + ϑj Xj  , (1.7)
ϑ∈Rq+1 i=1 j=1

and we obtain the estimated regression function µϑb to approximate µ∗ . We call the
square loss function approach (1.7) the Gaussian log-link GLM, and the reason for this
name will become clear in (1.11), below. ■
1
We generally do not use boldface notation for ϑ, even if ϑ is a real-valued vector. The reason for
this is that we would like to understand ϑ generically as the model parameter, which can be a real-valued
vector, but which could also be a different object that parametrizes a model.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 1. Introduction and prerequisites 17

The main question is whether this is a sensible model estimation technique?

• First, clearly, the true model µ∗ should be ‘close’ to the selected model class
M, otherwise it is impossible to identify a model in M being similar to µ∗ .
E.g., the GLM has a linear structure (1.3) which may not be suitable in some
problems, and this requires to select other regression function classes.

• Second, is the selected loss function L sensible to find a good candidate


model µϑb ∈ M by solving (1.6)? We discuss this in the next Section 1.4,
and we also argue why the square loss function (1.7) is not always the best
choice on finite samples.

1.4 Strictly consistent loss functions


1.4.1 Introduction
Finding good predictive models and regression functions for (1.2) crucially depends on
model fitting, model selection and forecast evaluation. These are core topics in mathe-
matical statistics, and there is a broad literature and distinguished theory behind finding
good predictive models. Some key references in mathematical statistics on these topics
are Gneiting–Raftery [78] and Gneiting [77]. We give a brief introduction and present
the main tools that are relevant for our purposes.

Broadly speaking, model fitting in an algorithmic world is understood as minimizing


some loss function (objective function) on given observations; for an example see (1.6).
The result is a predictive model, but is this predictive model fit for its purpose? One
may argue that the specific choices of the potential models and the objective function
do not matter too much, because having an almost infinite amount of data, we will find
the ground truth anyway. This argumentation is often used in the machine learning
community. (1) It is based on the assumption that one has almost infinite resources of
observations. (2) Implicitly, it is often considered in classification problems. We argue
why actuarial problems are generally different.

(1) In many situations one does not have unlimited data resources in actuarial prob-
lems. E.g., for a certain insurance product we may only have very few claims.

(2) In classification problems, one estimates probabilities in the unit interval (0, 1),
which is a nice and bounded space. Insurance claims can be heavy-tailed, and
very few large claims can heavily distort the fitting procedure on finite samples.
Therefore, the selection of the loss function needs special care.

For these reasons, one has to carefully analyze the choices of the potential models, the
loss function and the fitting algorithm, otherwise, one may result in an inappropriate
predictive model. The purpose of this section is to introduce the theory behind model
fitting and model evaluation in a broader sense, and we are going to be more specific in
Chapter 2 by relating these statistical concepts to actuarial problems.

Version March 3, 2025, @AI Tools for Actuaries


18 Chapter 1. Introduction and prerequisites

1.4.2 Strictly consistent loss functions for mean estimation


Denote by M the class of (sufficiently nice) regression functions µ : X → R under
consideration as candidates for (1.2). Select a loss function L according to (1.5). The
best candidate within the class M for the true regression function (1.2) w.r.t. the selected
loss function L is the solution to, subject to existence,

b ∈ arg min E [L(Y, µ(X))] .


µ (1.8)
µ∈M

b ∈ M that provides us with the smallest


In words, we try to find the regression function µ
expected loss (the expectation is over the population distribution P). To make this a
sensible model selection tool, we require some properties on the chosen loss function L.
This is introduced in the next definition.2
Definition 1.3. Denote the true regression function by µ∗ , see (1.2). The loss function
L is consistent for mean estimation w.r.t. M if

E [L(Y, µ∗ (X))] ≤ E [L(Y, µ(X))] , (1.9)

for all µ ∈ M, and where we assume that the left-hand side of (1.9) takes a finite value.
The loss function is strictly consistent for mean estimation if we have an equality in (1.9)
if and only if µ(X) = µ∗ (X), a.s.
Definition 1.3 tells us that we should only select strictly consistent loss functions for mean
estimation for regression model fitting, otherwise the expected loss minimization (1.8) is
not a sensible model fitting strategy as we may not discover the true model µ∗ by this
minimization (assumed it belongs to M).
Mathematical result. The strictly consistent loss functions for mean estimation are the
Bregman divergences; see Savage [198] and Gneiting [77]. Under certain assumptions, this
statement is an “if and only if”-statement. This implies that we should always consider
a Bregman divergence for regression model fitting of conditional mean type (1.2).
Bregman divergences take the following form

L(y, m) = ψ(y) − ψ(m) − ψ ′ (m) (y − m) ≥ 0,

for a strictly convex function ψ with (sub-)gradient ψ ′ .

Generally, Kullback–Leibler (KL) divergences and deviance loss functions are Breg-
man divergences; examples include the square loss, the Poisson deviance loss, the
gamma deviance loss or the categorical loss; see Table 2.2, below.

Remarks 1.4. We have mentioned in Remark 1.1 that, typically, for insurance pricing
we use (conditional) means. This is justified by the law of large numbers that ensures
that we charge the correct price level. These (conditional) means are estimated based on
strictly consistent loss functions for mean estimation.
2
It is probably in the genes of actuaries to try to minimize losses. By a sign switch −L, we obtain a
score, and economists would probably rather want to maximize scores.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 1. Introduction and prerequisites 19

In contrast, we could also be interested into medians and quantiles, e.g., for risk man-
agement purposes. Strictly consistent loss functions for median and quantile estimation
are the mean absolute error (MAE) and the Pinball losses, more generally; see Thomson
[216] and Saerens [197]. These losses are not Bregman divergences, and they should not
be used for regression model fitting if we are interested in (conditional) expectations. The
same applies for model validation and model selection, i.e., if one fits regression models
on conditional means, one should not use MAE figures to validate them. ■

1.4.3 Regression fitting on finite samples


The difficulty in practice is that we cannot compute (1.8) because, typically, the true
data generating model is unknown. The general solution to this problem is to replace
the true model by an empirical one. For this, we assume to have a sample of i.i.d. data
(Yi , X i )ni=1 that follow the same law as (Y, X). The empirical version of (1.8) is given
by
n
1X
b ∈ arg min
µ L(Yi , µ(X i )). (1.10)
µ∈M n i=1

We give a couple of remarks on (1.10).


• The law of large numbers provides convergence of the empirical loss in (1.10) to
the true expected one, a.s. This uses the i.i.d. property of the sample (Yi , X i )ni=1 .

• Compared to (1.6), we scale by 1/n. This does not change the solution on finite
samples n < ∞, but it is necessary for the law of large numbers argument to hold.

• Generally, we only work with strictly consistent loss functions L for mean estima-
tion, this also applies to the empirical version (1.10).

• The solution µb of (1.10) depends on the realization of the sample (Yi , X i )n i=1 , and
repeating this experiment typically gives a different solution µ b. In this sense, the
solution µ of (1.10) is itself a random variable, a function of the sample (Yi , X i )ni=1 ,
b
and atypical observations may give atypical solutions.

• The difference between the solutions of (1.8) and (1.10) is coined estimation error.
Typically, for increasing sample size, the estimation error decreases on average. In

many problems, estimation errors decay at rate 1/ n for i.i.d. data (Yi , X i )ni=1 ,
which usually can be attributed to a central limit theorem (strict consistency and
asymptotic normality).
If we study the model fitting problem (1.8) and if the true model µ∗ ∈ M belongs to the
selected model class, then any strictly consistent loss function L finds the true model µ∗ ,
and there are infinitely many strictly consistent loss functions for mean estimation. This
statement is an asymptotic (infinite sample size) statement, in contrast to its empirical
counterpart (1.10).

The specific selection of the loss function L becomes important on finite sample
sizes n < ∞.

Version March 3, 2025, @AI Tools for Actuaries


20 Chapter 1. Introduction and prerequisites

More specifically, Gourieroux et al. [87, Theorem 4] proved (in a GLM context) that
optimal regression models are found if the chosen strictly consistent loss function L for
mean estimation reflects the correct variance behavior of the response Y . In that case,
the model fitting procedure results in a so-called best asymptotically normal estimation.
We illustrate the Gourieroux et al. [87] result by an example. Consider the Gaussian,
the Poisson, the gamma and the inverse Gaussian distributions for Y , given X, with
corresponding conditional mean µ∗ (X) (being the same in all four cases). The conditional
variances in these four models are given by

Gaussian model: V(Y |X) = φ,


Poisson model: V(Y |X) = µ∗ (X),
gamma model: V(Y |X) = φ (µ∗ (X))2 ,
inverse Gaussian model: V(Y |X) = φ (µ∗ (X))3 ,

where φ > 0 is a given dispersion parameter. We observe that in these four models the
conditional variances are power functions of the mean functional µ∗ (X) with different
power variance parameters p ∈ {0, 1, 2, 3}. These different variance functions can be
translated to deviance loss functions by selecting the corresponding distribution within the
exponential dispersion family (EDF); for details see (2.6) and the subsequent discussion.
In particular, the Gaussian case translates to the square loss

L(y, m) = (y − m)2 , (1.11)

the Poisson case to the Poisson deviance loss


m
  
L(y, m) = 2 m − y − y log , (1.12)
y
the gamma case to the gamma deviance loss
y−m m
  
L(y, m) = 2 + log , (1.13)
m y
and the inverse Gaussian case to the inverse Gaussian deviance loss
(y − m)2
L(y, m) = . (1.14)
m2 y

All of these deviance loss functions are strictly consistent for mean estimation, but
the one with the correct conditional variance behavior for the response Y , given
X, has the best finite sample properties (on average).

More examples are provided in Table 2.2, below, and in Section 2.1.3, below, we give
more mathematical justification to this statement.
Figure 1.2 shows the four deviance loss functions (1.11)-(1.14) with power variance pa-
rameters p ∈ {0, 1, 2, 3}. For this figure, we select a fixed mean parameter m = 1000
and the dispersion parameters φ > 0 are chosen such that all four models have the same
variance of 1000. Figure 1.2 shows the resulting deviance losses y 7→ L(y, m)/φ, scaled by

Version March 3, 2025, @AI Tools for Actuaries


Chapter 1. Introduction and prerequisites 21

deviance loss functions

80
60
deviance losses
40
20
Gauss
Poisson
gamma
InvGauss
0

700 800 900 1000 1100 1200 1300

responses Y

Figure 1.2: Deviance loss functions (1.11)-(1.14), y 7→ L(y, m)/φ for fixed mean m =
1000; the circles are at y = 800, 1200 (symmetric around m = 1000) for better orientation.

φ−1 so that they all live on the same scale; the colored circles are at y = 800, 1200 (sym-
metric around m = 1000) for better orientation. The square loss of the Gaussian model
(in blue color) is symmetric around m = 1000, and all other deviance losses are asym-
metric around this value (compared the colored circles). This asymmetry is the property
that should match the response distribution so that we receive (on average) optimal finite
sample properties in model estimation according to Gourierioux et al. [87]. That is, if
the response distribution is very right-skewed, the selected deviance loss should account
for this to receive a best asymptotically normal estimation of the expected values; we
come back to this in Section 2.1, below.

Remarks 1.5. • All losses in (1.11)-(1.14) satisfy L(y, m) ≥ 0, and these losses are
zero if and only if y = m. The square loss (1.11) is defined on R × R, and the other
three losses on the positive real line R+ × R+ , i.e., they need positive inputs.

• There is a deep connection between the distributions of the EDF, maximum likeli-
hood estimation (MLE) and deviance loss minimization which we did not explain
here. First, minimizing the above deviance loss functions is equivalent to MLE
in the corresponding distributional model. That is, e.g., minimizing the Poisson
deviance loss results in the MLE µ bMLE for the Poisson model. We will come back
to this, below. Second, all considered models (1.11)-(1.14) have in common that
they belong to the EDF, and each member of the EDF is characterized by a certain
functional form of its conditional variance function V(Y |X); we refer to Jørgensen
[112, Theorem 2.11] and Bar-Lev–Kokonendji [13, Section 2.4]. In fact, the condi-
tional variance function determines the specific distribution within the EDF, and
this, in turn, gives us the optimal deviance loss function choice for model fitting on
finite samples.

Version March 3, 2025, @AI Tools for Actuaries


22 Chapter 1. Introduction and prerequisites

• Distributions within the EDF with conditional variance functions of the form

V(Y |X) = φ (µ∗ (X))p , (1.15)

are called Tweedie’s distributions with power variance parameter p ∈ R\(0, 1). This
class of distributions has simultaneously been introduced by Tweedie [222] and Bar-
Lev–Enis [12], and for p ∈ (0, 1) there do not exist any Tweedie’s distributions; see
Jørgensen [111]. For p ∈ (1, 2) we have Tweedie’s compound Poisson model which
is absolutely continuous on R+ and which has a point mass in zero.

• In cases where the conditional variance function is unknown or rather unclear, one
can exploit an iterative estimation procedure by alternating mean and variance
estimation to get optimal regression models. Under isotonicity of the conditional
variance in the conditional mean behavior, this has recently been studied in Delong–
Wüthrich [51], and it is verified in this study that this iterative estimation procedure
can be very beneficial for improving model accuracy, i.e., getting closer to best
asymptotically normal in the sense of Gourieroux et al. [87].

We conclude this section with the log-link GLM example (1.3)-(1.4). We call Example
1.2 a Gaussian log-link GLM because the square loss minimization emphasizes that the
responses Y are conditionally Gaussian, given X. Similarly, we can define a Poisson and
a gamma log-link GLM.

Example 1.6 (Poisson log-link GLM). Select the Poisson deviance loss (1.12) for L.
This gives optimization in the log-link GLM case
n
µϑ (X i )
X   
ϑb = arg min 2 µϑ (X i ) − Yi − Yi log ,
ϑ∈Rq+1 i=1
Yi

for log-link GLM (1.3)-(1.4). This is a Poisson log-link GLM, and it emphasizes that the
responses Yi have conditional Poisson distributions, given X i . ■

Example 1.7 (gamma log-link GLM). Select the gamma deviance loss (1.13) for L. This
gives optimization in the log-link GLM case
n
Yi − µϑ (X i ) µϑ (X i )
X   
ϑb = arg min 2 + log ,
ϑ∈Rq+1 i=1
µϑ (X i ) Yi

for log-link GLM (1.3)-(1.4). This is a gamma log-link GLM, and it emphasizes that the
responses Yi have conditional gamma distributions, given X i . ■

Table 2.2, below, gives more examples and it describes how certain distributions are
linked to deviance losses.

1.5 Model validation and model selection


In general, one should always start to analyze data and fitted models by exploiting
different visualizations and graphical tools, such as QQ-plots, lift plots, T-reliability

Version March 3, 2025, @AI Tools for Actuaries


Chapter 1. Introduction and prerequisites 23

diagrams, etc. We will come back to graphical illustrations, below. In this section we
describe model validation and model selection.
Section 1.4 has been focusing on model fitting, and the crucial message was that this
should be done under strictly consistent loss functions for mean estimation. The same
methodology can be used for model validation and model selection, that is, by a strictly
consistent scoring of the fitted models. The objective of expected loss minimization is
turned into the new objective of generalization loss minimization. In other words, one
would like to know which of several models has the best forecast accuracy, and this is
evaluated by studying their generalization loss, meaning that one wants to know “which
of the fitted models generalizes best to new unseen data” in the sense of giving the most
accurate forecasts for new data.
This section focuses on the core (model-agnostic) methods for model validation and
model selection that are in the intersection between machine learning and statistics, such
as cross-validation, Akaike’s information criterion or out-of-bag validation. Out-of-bag
validation requires to introduce the bootstrap which is also done in this section. More
sophisticated tools for model validation and model selection (model-agnostic and model-
specific ones) will be described in later chapters, but for this we first need to introduce
the relevant techniques.

1.5.1 Out-of-sample (generalization) loss

Definition 1.3 motivates to select the model with the smallest expected loss for a given
strictly consistent loss function L, see (1.9). As highlighted in (1.10), this selection needs
to be done empirically because the true data generating mechanism is unknown. There is
a specific point that needs special attention in this model validation and model selection
procedure.

Model estimation and model validation should not be done on the identical sample.

This is most easily understood by realizing that if one would use the identical data,
always a more complex model would outperform a nested simpler model. This is related
to the in-sample bias that may judge the more complex model too optimistically because
it solves the same minimization problem (1.10) under more degrees of freedom.

The standard way of analyzing the generalization loss (forecast performance) is to par-
tition the entire sample into two data sets a learning sample L for model fitting and a
test sample (hold-out sample) T for model testing (generalization loss analysis). These
two samples should be mutually independent, and contain i.i.d. data L = (Yi , X i )ni=1
and T = (Yt , X t )m
t=1 , respectively, following the same law as (Y, X); the two samples are
distinguished here by the different lower indices 1 ≤ i ≤ n and 1 ≤ t ≤ m, respectively.
Model fitting (model learning) is then performed solely on the learning sample L by
minimizing the in-sample loss
n
1X
bL ∈ arg min
µ L (Yi , µ(X i )) .
µ∈M n i=1

Version March 3, 2025, @AI Tools for Actuaries


24 Chapter 1. Introduction and prerequisites

The selected model(s) are evaluated (compared to each other) on the hold-out sample T
by analyzing their out-of-sample loss (empirical generalization loss (GL))
m
d ,µ 1 X
GL(T bL ) := L (Yt , µ
bL (X t )) . (1.16)
m t=1

Thus, the learning sample L is only used for the estimation of the regression function
bL (·), which is then evaluated on the independent hold-out sample T . This out-of-sample
µ
loss (1.16) is the main workhorse for model selection in machine learning models (e.g.,
between a neural network and a gradient boosting model). A main reason for this is that
computationally it is not very demanding.

1.5.2 Cross-validation
Out-of-sample loss validation (1.16) may not be the most economic way of dealing with
small data, i.e., with a statistical problem where no big data is available. By this we
mean that, unlike in many machine learning problems, in actuarial problems we do not
have unlimited data resources, but the available data is determined by the size of the
insurance portfolio. In this case, K-fold cross-validation (CV) is an alternative.
In a first step, K-fold cross-validation fits the model K times to different sub-samples of
the data to derive the generalization loss estimate (1.17), below, and in a second final
step, all data is used to receive the optimal predictive model µ b. For the first step, we
partition (at random) the index set I = {1, . . . , n} into K roughly equally sized folds
(Ik )K
k=1 , see Figure 1.3. K is a hyper-parameter that is usually selected as K = 10, but
for small sample sizes n we may also select a smaller K to receive reliable results; in
Figure 1.3 it is set to K = 5. We then learn the model on the data with indices I \ Ik ,
and we perform an out-of-sample validation on the indices Ik . That is, for 1 ≤ k ≤ K,
we compute the estimated models
X
b(\k) ∈ arg min
µ L (Yi , µ(X i )) .
µ∈M i∈I\Ik

These estimated models are cross-validated (mutually out-of-sample) by


K
CV 1 X 1 X  
GL
d := L Yi , µ
b(\k) (X i ) . (1.17)
K k=1 |Ik | i∈I
k

Note that each term under the k-summation uses a (disjoint) partition of the entire data
into a learning sample with indices I \ Ik and a test sample with indices Ik . Thus, we
perform a proper (mutual) out-of-sample validation for each 1 ≤ k ≤ K.

Remarks 1.8. • Both the out-of-sample loss (1.16) and the K-fold cross-validation
loss (1.17) give estimates for the true (expected) generalization loss. The cross-
validation loss has the advantage that it does not only estimate this generalization
loss, but we can also quantify uncertainty by computing the empirical standard
deviation of the K folds under the k-summation in (1.17).

Version March 3, 2025, @AI Tools for Actuaries


Chapter 1. Introduction and prerequisites 25

I1 I \ I2
I \ I3
I2 I \ I4
I \ I5
I3
I \ I1
I \ I2 I4
I \ I3
I \ I4 I5

Figure 1.3: Partitions of K-fold cross-validation with K = 5 folds.

• K-fold cross-validation can be demanding because it requires fitting the model K+1
times, once to get the optimal regression function µ
b, and K times to compute the
K-fold cross-validation loss (1.17).

• Above, we have partitioned the index set I completely at random into the folds
(Ik )K
k=1 . Stratified K-fold cross-validation does not partition I completely at ran-
dom. For instance, one may order the instances 1 ≤ i ≤ n w.r.t. the sizes of the
responses (Yi )ni=1 , and then allocate the K largest claims at random to the different
folds (Ik )K
k=1 of the partition, then the next K largest claims likewise, etc. This
provides more similarity between the folds, which can be an advantage in model
selection, especially under right-skewed or heavy-tailed loss size distributions.

1.5.3 Akaike’s information criterion for model selection


In Section 1.5.1, we have emphasized that model validation and generalization loss anal-
ysis should be done on a disjoint hold-out sample T that has not been used for model
fitting to avoid an in-sample bias in model selection. Akaike [2], in Akaike’s information
criterion (AIC), and Schwarz [204], in the Bayesian information criterion (BIC), tried to
(asymptotically) quantify this in-sample bias (under different assumptions). Naturally,
AIC and BIC only apply under these assumptions. We start from a parametrized density
Y |X ∼ fϑ (·|X) with an unknown r-dimensional model parameter ϑ ∈ Rr . Application of
AIC and BIC requires to fit this density with MLE. Denote the log-likelihood function
based on the i.i.d. sample L = (Yi , X i )ni=1 by
n
X
ϑ 7→ ℓL (ϑ) = log (fϑ (Yi |X i )) .
i=1

We compute the MLE of ϑ, subject to existence and uniqueness, by solving

ϑbMLE = arg max ℓL (ϑ). (1.18)


ϑ

Since ϑbMLE maximizes (1.18), the quantity ℓL (ϑbMLE ) gives an in-sample biased view of
this model. AIC and BIC determine asymptotically an in-sample bias correction (in

Version March 3, 2025, @AI Tools for Actuaries


26 Chapter 1. Introduction and prerequisites

different settings) and based on the fact that model fitting was done by MLE. The AIC
value of this MLE fitted model is defined by

AIC = −2ℓL (ϑbMLE ) + 2r, (1.19)

where r ∈ N is the dimension of the model parameter ϑ. If we have different MLE


fitted models, preference should be given to the model that has the smallest AIC value.
Likewise we can proceed with the BIC value defined by

BIC = −2ℓL (ϑbMLE ) + log(n)r, (1.20)

where n is the sample size. Remark that these model selection criteria may not be valid in
machine learning models, such as neural networks, as these models do not use the MLE
for model fitting. Moreover, in many machine learning models, the dimension of the
model parameter is unclear because such models are often over-parametrized resulting
in redundancy; we refer to Abbas et al. [1] who discuss the effective dimension of neural
networks. Therefore, AIC and BIC are mainly useful tools for model selection among
GLMs supposed they were fitted with MLE.
We summarize the important points about AIC and BIC.

• First, we need MLE estimated models for applying AIC and BIC.

• Second, (1.19) and (1.20) require to consider all terms of the log-likelihood function
ℓL (ϑ). This also applies to models where some of the terms cannot be computed
analytically, like, e.g., in Tweedie’s compound Poisson model.

• Third, model selection can be done for any two models and these models do not
need to be nested and they do not need to be Gaussian. This makes AIC and BIC
very attractive and widely applicable.

• Fourth, the responses need to be on the identical scale in all compared models.

• Fifth, AIC and BIC only give preference to a model, but they do not confirm that
the selected model is suitable, i.e., it could simply be the best option of a class of
inappropriate models.

1.5.4 Bootstrap and out-of-bag validation


Starting from an i.i.d. sample (Yi , X i )ni=1 , one often would like to have more observations
of the same nature, but because (typically) the data generating mechanism is unknown,
this cannot be achieved. Bootstrap is a method that aims at generating more data that is
similar to (Yi , X i )ni=1 . Bootstrap simulation goes back to Efron [61] and Efron–Tibshirani
[62]. There are parametric versions and non-parametric ones.

Parametric bootstrap

For a parametric bootstrap version, we assume that the independent observations follow
a given model Yi |X i ∼ Fϑ (·|X i ), for 1 ≤ i ≤ n, being parametrized by an unknown

Version March 3, 2025, @AI Tools for Actuaries


Chapter 1. Introduction and prerequisites 27

model parameter ϑ ∈ Rr . This allows one to estimate the model parameter ϑ from
the i.i.d. sample (Yi , X i )ni=1 , giving an estimated model Fϑb(·|X) for given covariates
X. A bootstrap sample (Yi⋆ , X i )ni=1 is then obtained by simulating new conditionally
independent observations Yi⋆ |X i ∼ Fϑb(·|X i ) from the estimated model, for 1 ≤ i ≤ n. If
the estimated model is sufficiently accurate, the bootstrap sample (Yi⋆ , X i )ni=1 has similar
distributional properties as the original sample (Yi , X i )ni=1 . Based on this bootstrap
sample (Yi⋆ , X i )ni=1 , we can re-estimate the model parameter ϑ providing us with a
bootstrap estimate ϑb⋆ ∈ Rr . Repeating this procedure m times gives us an empirical
distribution of the estimated model parameter (called empirical bootstrap distribution)
m
1 X
G(θ)
b := 1 for θ ∈ Rr ,
m j=1 {θ≤ϑb(⋆j) }

with ϑb(⋆j) ∈ Rr denoting the estimated model parameter from the j-th bootstrap sample
(⋆j)
(Yi , X i )ni=1 , 1 ≤ j ≤ m.
This is one version of a parametric bootstrap. It keeps the covariates (X i )ni=1 fixed,
and it only re-simulates the responses from the estimated model. There are many differ-
ent variants of the parametric bootstrap, e.g., also re-simulating the covariates or only
re-simulating the residuals, called residual bootstrap; we refer to Wüthrich–Merz [243,
Section 4.3] and the references therein for more discussion.

Non-parametric bootstrap

The non-parametric bootstrap is even simpler than its parametric counterpart. For
a non-parametric bootstrap we directly start from the observed sample (Yi , X i )ni=1 . A
non-parametric bootstrap sample (Yj⋆ , X ⋆j )nj=1 is generated by drawing with replacements
from the original data (Yi , X i )ni=1 . This naturally also re-simulates the covariates, and
some samples appear multiple times in the bootstrap sample (Yj⋆ , X ⋆j )nj=1 , and others
are not selected by this drawing with replacements. Denote by I ⋆ ⊆ I = {1, . . . , n} the
set of indices that have been selected from the original data (Yi , X i )ni=1 for the bootstrap
sample (Yj⋆ , X ⋆j )nj=1 . We estimate a new model from this bootstrap sample
n
X  
b⋆ ∈ arg min
µ L Yj⋆ , µ(X ⋆j ) , (1.21)
µ∈M j=1

and we can study the empirical bootstrap distribution as above by repeating this boot-
strap procedure many times.

Out-of-bag cross-validation

The interesting point about the non-parametric bootstrap now is that the instances
i ∈ I \ I ⋆ have not been used in this estimation procedure to receive the bootstrap
b⋆ in (1.21). We call the set of observations (Yi , X i )i∈I\I ⋆ an out-of-bag (OoB)
estimate µ
sample. This presents a valid (disjoint) sample for cross-validation (i.e., an empirical
generalization loss)
d OoB := 1 X
GL b⋆ (X i )) .
L (Yi , µ (1.22)
|I \ I ⋆ | i∈I\I ⋆

Version March 3, 2025, @AI Tools for Actuaries


28 Chapter 1. Introduction and prerequisites

This gives an alternative to K-fold cross-validation (1.17). This out-of-bag validation


is natural in random forests regression modeling, because random forests model fitting
is based on bootstrapping. However, it could be used as well for any other regression
technique for model validation.
We determine the average size of the out-of-bag sample. Not selecting a sample in a single
draw has a probability of (n − 1)/n, and repeating this n times raises this probability to
the power n. For large sample sizes we obtain convergence

n
n−1
  n
lim = lim 1 − n−1 = e−1 = 36.8%.
n→∞ n n→∞

Thus, on average, the out-of-bag sample has a reasonably big size; see Breiman [28].

1.5.5 Summary on model validation and model selection

In practice, in machine learning tasks, an out-of-sample validation based on a disjoint


learning sample L and test sample T is predominant. First, in many applications one has
a vast amount of data, so scarcity is not an issue. Second, in complex algorithmic models
one can have very sophisticated and time-consuming model fitting procedures which
makes it impossible to perform, e.g., K-fold cross-validation. K-fold cross-validation
is only feasible in smaller problems. It has the advantage that we can not only give
preference to a model, but we can also quantify the uncertainty in this decision. This
is, e.g., done in regression tree modeling for the tree size selection, that is, this is a way
of controlling the randomness in model selection. AIC and BIC only apply for MLE
estimated models, which clearly limits their scope to simple models. Finally, out-of-bag
validation is not used very often except for random forests. Computationally, it is similar
to K-fold cross-validation and we see quite some potential of out-of-bag validation for
model evaluation.
To keep the discussion as simple as possible in this chapter, we have avoided to talk
about weights or exposures. Regression modeling is typically done within the exponential
dispersion family (EDF) which is going to be introduced in Section 2.1. The EDF con-
siders volume-scaled quantities and this requires that generally all losses L are replaced
by scaled losses

vi
L(Yi , µ(X i )) → L(Yi , µ(X i )),
φ

where vi > 0 is an instance specific weight and φ > 0 is a general parameter for dispersion
that is not instance specific; for details see Section 2.1. This instance specific factor vi /φ
is going to be added in front of all loss functions L in all subsequent considerations.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 1. Introduction and prerequisites 29

Summary of model fitting and model selection procedure

• Assume we have two independent data sets, a learning sample L =


(Yi , X i , vi )ni=1 and a test sample T = (Yt , X t , vt )m
t=1 , both containing
i.i.d. data following the same law as (Y, X, v).

• Assume we want to consider two model classes M1 and M2 that are signifi-
cantly different, e.g., GLMs and gradient boosting models, and we would like
to select the best model from these two classes to predict a new observation
Y , given X.

• Select a suitable strictly consistent loss function L for mean estimation, e.g.,
if the responses Y are gamma like, given X, we select the gamma deviance
loss for L.

• Based on the learning sample L, select the best models w.r.t. the selected
loss function L from both model classes Mk , k = 1, 2,
n
X vi
bk ∈ arg min
µ L (Yi , µ(X i )) .
µ∈Mk i=1
φ

Note, this only uses the learning sample L.

• For predicting Y , given X, select the model from µ


b1 and µ
b2 that has the
smaller out-of-sample loss on the test sample T
m
1 X vt
GL(T , µ
d bk ) = L (Yt , µ
bk (X t )) .
m t=1 φ

b1 ∈ M1 if GL(T
That is, select µ d ,µ b1 ) < GL(T
d ,µ b2 ∈ M2 .
b2 ), otherwise select µ

Version March 3, 2025, @AI Tools for Actuaries


30 Chapter 1. Introduction and prerequisites

Version March 3, 2025, @AI Tools for Actuaries


Chapter 2

Regression models

We begin with remarks before diving into predictive modeling. Before starting with
modeling, we should ask ourselves about the desirable characteristics a predictive model
should possess. Typically, we cannot comply with all of them, and the best model will
be a compromise of all of these desirable characteristics.
Let us mention some points in a slightly unstructured manner.

(a) Clearly, the model should have a good predictive performance giving us very accu-
rate forecasts.

(b) The model should have a certain smoothness so that the forecasts do not dramati-
cally change, if one slightly perturbs the inputs.

(c) The model should have a certain sparsity and simplicity, which means that we target
for a model that is as small as possible, but still predicts sufficiently accurately;
i.e., we aim for a parsimonious model.

(d) Towards stakeholder we should be able to explain the inner functioning of the
model, and the results should intuitively make sense and be explainable.

(e) It should have good finite sample properties in estimation, so that all parts of the
model can be determined with credibility.

(f) We should be able to quantify prediction uncertainty.

(g) We should be able to manually change parts of the model to integrate expert
knowledge, if available and necessary.

(h) It should comply with regulation, and we should be able to verify this.

The starting point of a machine learner would probably be to just run the available data
through a gradient boosting machine (GBM) or a neural network, and then study its
outputs. As already discussed at the beginning of Section 1.4, this may not be the best
way of dealing with the problem, especially, in scarce data settings that may have a
large variability in their responses (class imbalance is a related buzzword). This is one
of the key differences between machine learning and actuarial data science, namely, the

31
32 Chapter 2. Regression models

actuary first tries to understand the (raw) data, and then designs an optimal architecture
according to the insights that she/he has gained from this initial data analysis. This
insight can significantly improve predictive models, already, e.g., choosing a more suitable
strictly consistent loss function for mean estimation than the square loss can make a huge
difference. This is the main motivation for us to first study the most important family
of distributions, the exponential dispersion family (EDF). This knowledge can then be
translated into optimal choices of the objective function for a certain type of data and
problem; in fact, this will justify the deviance loss choices (1.11)-(1.14). In this sense,
the statistical theory matters beyond algorithmic forecasting.

2.1 Exponential dispersion family


Distribution functions F (Y |X) play a surprisingly marginal role in regression model-
ing because most attention is paid to the estimation of the regression function. This
regression function estimation is almost solely based on the first moments and strictly
consistent loss functions, see Definition 1.3. Higher moments and distributions only
become relevant for uncertainty quantification. But even for this, one often does not
rely on explicit distributions, but one uses asymptotic normality results which typically
only require i.i.d. data with finite second moments. Only, once one studies finite sample
properties the underlying distributions become more important, see discussion in Section
1.4.3.

2.1.1 Introduction of the exponential dispersion family


We briefly introduce the EDF in this section and we give its most important properties.
This allows us to more specifically discuss the selection of the strictly consistent loss
function L for regression fitting. The EDF was introduced by Sir Fisher [68], and the
most relevant references are Jørgensen [110, 111, 112] and Barndorff-Nielsen [16]; this
short outline follows Wüthrich–Merz [243] and we also use the notation of that reference.
We say that Y ∼ EDF(θ, φ/v; κ) belongs to the EDF if it has a density of the form
yθ − κ(θ)
 
Y ∼ fθ (y) dν(y) = exp + c(y, φ/v) dν(y), (2.1)
φ/v

for a σ-finite measure ν on R (determining the support of Y ), with canonical parameter


θ ∈ Θ in the effective domain Θ ⊆ R, for cumulant function

κ : Θ → R,

with dispersion parameter φ > 0, weight/volume v > 0, and normalizing function


c(y, φ/v) such that the density in (2.1) integrates to one.

2.1.2 Cumulant function, mean and variance function


In any non-trivial setting, the effective domain Θ ⊆ R is a (possibly infinite) interval
with a non-empty interior Θ̊, and the cumulant function κ is infinitely often differentiable
(smooth) and strictly convex on Θ̊.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 2. Regression models 33

Furthermore, we have the first two moments


φ ′′ φ  
µ0 = E [Y ] = κ′ (θ) and V (Y ) = κ (θ) = κ′′ (κ′ )−1 (µ0 ) , (2.2)
v v

for the canonical parameters θ ∈ Θ̊.

This indicates the crucial role played by the cumulant function κ in the EDF.

The inverse function h := (κ′ )−1 is called the canonical link of the chosen EDF, and it
provides us with the identity

µ0 = E [Y ] = κ′ (θ) ⇐⇒ h(µ0 ) = h (E [Y ]) = θ. (2.3)

This motivates the definition of the mean parameter space


n o
κ′ (Θ̊) = κ′ (θ); θ ∈ Θ̊ ,

which is one-to-one with the interior of the effective domain Θ̊, thus, the EDF can either
be parametrized by the canonical parameter θ or by its mean parameter µ0 .
Finally, we introduce the variance function µ0 7→ V (µ0 ) = (κ′′ ◦ h)(µ0 ), which has the
property
φ
V (Y ) = V (µ0 ). (2.4)
v
All the models discussed in (1.11)-(1.14) and (1.15) are of this EDF type, with power
variance function V (µ0 ) = µp0 for p ∈ R \ (0, 1); for simplicity we have set weight v = 1
in these previous examples. This variance function V fully characterizes the cumulant
function κ, supposed it exists on the selected mean parameter space; see Jørgensen [112,
Theorem 2.11] and Bar-Lev–Kokonendji [13, Section 2.4].

The EDF is attractive because it contains many popular statistical models used to
solve actuarial problems such as the Bernoulli, Gaussian, gamma, inverse Gaussian,
Poisson or negative binomial models. These examples are distinguished by the
choice of the cumulant function κ; Table 2.1 gives the most relevant examples.

EDF distribution cumulant function κ(θ) mean µ0 = κ′ (θ)


Gaussian θ2 /2 θ
gamma − log(−θ) −1/θ
√ p
inverse Gaussian − −2θ 1/ (−2θ)
Poisson eθ eθ
negative binomial − log(1 − eθ ) e /(1 − eθ )
θ
2−p
((1−p)θ) 1−p 1
Tweedie’s compound Poisson 2−p , p ∈ (1, 2) ((1 − p)θ) 1−p
Bernoulli log(1 + eθ ) eθ /(1 + eθ )

Table 2.1: Commonly used examples of the EDF with corresponding means.

Version March 3, 2025, @AI Tools for Actuaries


34 Chapter 2. Regression models

2.1.3 Maximum likelihood estimation and deviance loss


The special parametrization of the EDF through the canonical parameter θ makes this
family of distributions very attractive for MLE because maximization of the density (2.1)
in θ ∈ Θ is very simple. For a single observation Y (i.e., for sample size n = 1), we have
MLE
θbMLE = h(Y ) ⇐⇒ µ bMLE
0 =Y; (2.5)

this statement only holds up to the possibly degenerate behavior at the boundary of Θ.1
This allows us to define the deviance loss function within the EDF.
Definition 2.1. Select Y ∼ EDF(θ, φ/v; κ) with steep cumulant function κ, see footnote
1 on page 34. The deviance loss function of the selected EDF is defined by
φ    
L(y, m) = 2 log fh(y) (y) − log fh(m) (y) ≥ 0, (2.6)
v

with m ∈ κ′ (Θ̊) and y in the convex closure of the support of the response Y .

Deviance losses for building actuarial models

The deviance losses provide an important way of building actuarial models.

• Every member of the EDF is characterized by its cumulant function κ.

• Using (2.6), the EDF density fθ (y) with cumulant function κ, given in (2.1),
is transformed into a deviance loss function L(y, m); this uses the canonical
link relation h(m) = θ, see (2.3).

• Maximizing the likelihood of the EDF density fθ in canonical parameter θ


is equivalent to minimizing the deviance loss L(y, m) in mean parameter m.

• The latter is precisely the property that motivated the choices of the deviance
losses (1.11)-(1.14), e.g., if the responses Y are Poisson distributed, we can
either perform MLE in the Poisson model or we can minimize the Poisson
deviance loss (1.12) providing us with the same model.

Table 2.2 presents the most popular EDF models in actuarial science and their deviance
loss functions (2.6). Tweedie’s CP refers to Tweedie’s compound Poisson model that has
a power variance function (1.15) with p ∈ (1, 2). This can be extended to p ∈ {0}∪[1, ∞),
where p = 1, 2 has to be understood in the limiting sense for the cumulant function κ
and the corresponding deviance loss L.2 The power variance parameter p = 0 gives the
Gaussian model, p = 1 the Poisson model, p = 2 the gamma model and p = 3 the
inverse Gaussian model. These are the only models of the EDF with a power variance
1
In fact, we need to be slightly more careful with statement (2.5). Generally, we request that the
cumulant function κ is steep at the boundary of the effective domain Θ; see Barndorff-Nielsen [16,
Theorem 9.2]. This aligns the mean parameter space with the convex closure of the support of the
response Y . Then, (2.5) is correct up to the boundary of Θ. At the boundary we may get a degenerate
model having a finite mean estimate µ bMLE
0 , but an undefined canonical parameter estimate.
2
The cases p < 0 do not have steep cumulant functions κ and are therefore disregarded here.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 2. Regression models 35

EDF distribution cumulant function κ(θ) deviance loss L(y, m)


2
Gaussian θ /2 (y − m)2
gamma − log(−θ) 2 ((y − m)/m + log(m/y))

inverse Gaussian − −2θ (y − m)2 /(m2 y)
Poisson eθ  2 (m − y − y log(m/y))
 
y y+1

negative binomial − log(1 − eθ ) 2 y log m − (y + 1) log m+1
2−p  1−p 1−p 
2−p
((1−p)θ) 1−p −m −m2−p
Tweedie’s CP 2−p , p ∈ (1, 2) 2 y y 1−p − y 2−p
Bernoulli log(1 + eθ ) 2 (−y log(m) − (1 − y) log(1 − m))

Table 2.2: Commonly used examples of the EDF with corresponding deviance losses; the
corresponding canonical links h are provided in Table 3.1, below.

function for which the normalizing term c(y; φ/v) has a closed from, see Blæsild–Jensen
[23]. This becomes relevant for AIC and BIC, see (1.19) and (1.20), but also for MLE of
the dispersion parameter φ.

p distribution support of Y Θ κ′ (Θ̊)


p=0 Gaussian distribution R R R
p=1 Poisson distribution N0 R (0, ∞)
1<p<2 Tweedie’s CP distribution [0, ∞) (−∞, 0) (0, ∞)
p=2 gamma distribution (0, ∞) (−∞, 0) (0, ∞)
p>2 generated by positive stable distributions (0, ∞) (−∞, 0] (0, ∞)
p=3 inverse Gaussian distribution (0, ∞) (−∞, 0] (0, ∞)

Table 2.3: Tweedie’s models for power variance parameters p ∈ {0} ∪ [1, ∞); this table
is taken form Jørgensen [112].

We close this section with a technical remark. Table 2.3 gives the supports of the re-
sponses Y , the effective domains Θ and the mean parameter spaces κ′ (Θ̊) of Tweedie’s
distributions, having power variance function (1.15). In all these examples, the convex
closure of the support of Y is equal to the closure of the mean parameter space κ′ (Θ̊).
This is a characterization of steep cumulant functions κ; see Barndorff-Nielsen [16, The-
orem 9.2]. If Y is in the boundary of the mean parameter space, the deviance loss (2.6)
needs to be understood in the limiting sense; see Wüthrich–Merz [243, formula (4.8)].

2.2 Regression models


The general regression problem has already been introduced in Section 1.4. Namely, we
start from a candidate class M = {µ} of sufficiently nice regression functions µ : X → R.
Based on an i.i.d. learning sample L = (Yi , X i , vi )ni=1 , we aim at minimizing the in-sample
loss
n
1X vi
b ∈ arg min
µ L(Yi , µ(X i )), (2.7)
µ∈M n i=1
φ

Version March 3, 2025, @AI Tools for Actuaries


36 Chapter 2. Regression models

for a strictly consistent loss function L for mean estimation. There is a (minor) change
to (1.10), namely, we add factors vi /φ to receive weighted losses in (2.7). The motivation
for these weightings is that we assume of having selected a deviance loss function for
L, being implied by a distributional model coming from the EDF (2.1). In view of the
variance behavior (2.2), this requires a suitable weighting of all the individual instances
1 ≤ i ≤ n. The weighting proposed in (2.7) is precisely the one that selects the MLE
bMLE w.r.t. the log-likelihood function of the i.i.d. sample (Yi , X i , vi )n
µ i=1 , with responses
Yi following the corresponding (conditional) EDF (2.1), given means µ(X i ) and weights
vi , for 1 ≤ i ≤ n.3 In summary, as already mentioned before, deviance loss minimization
in (2.7) is equivalent to MLE in the corresponding EDF.
This is the core of regression modeling with the recurrent goal of finding the true re-
gression function µ∗ , see (1.2). The different statistical and machine learning methods
mainly differ in selecting different classes M of candidate regression functions µ : X → R,
e.g., we can select a class of GLMs, of deep neural networks of a certain architecture, or
of regression trees. Some of these classes are non-parametric and others are parametric
families. Optimization (2.7) looks more like a non-parametric version, for a parametrized
class of regression functions M = {µϑ }ϑ , we rather solve
n
1X vi
ϑb ∈ arg min L(Yi , µϑ (X i )). (2.8)
ϑ n i=1 φ

Example 2.2 (Poisson log-link GLM, revisited). We revisit Example 1.6. Assume
Yi , given X i and vi , is Poisson distributed with log-link GLM regression function
q
X
X 7→ log(µϑ (X)) = ϑ0 + ϑj Xj ,
j=1

for regression parameter ϑ ∈ Rq+1 , see (1.3). Assume the instances i are indepen-
dent, and select the Poisson deviance loss for L. This gives the optimal solution
n
1X µϑ (X i )
  
ϑbMLE = arg min 2vi µϑ (X i ) − Yi − Yi log .
ϑ∈Rq+1 n i=1 Yi

Yi = Ni /vi are the observed claims frequencies, if Ni ∈ N0 denote the observed


claims counts; see Remark 2.3. In the Poisson model we have φ = 1. Based on
(2.6), we conclude that ϑbMLE is the MLE from a Poisson log-link GLM that could
also be obtained by maximizing the corresponding log-likelihood function. ■

3
There is a subtle point that should be highlighted in the notation of the i.i.d. sample (Yi , X i , vi )ni=1 .
The i.i.d. property concerns the random sampling of the entire triple (Yi , X i , vi ) including the weight
vi . In this sense, there is a slight abuse of notation here because one should use a capital letter for the
(random) weight. We have decided to use a small letter to have the identical notation as in the (standard)
definition of the EDF (2.1). Having the weight vi as a random variable moreover implies that for the
response distribution of Yi one always needs to condition on X i and vi ; this is the typical situation
because, generally, the joint distribution of (Yi , X i , vi ) is not of interest. For notational convenience (and
to align with the standard EDF definition) we have not done so in the text, see, e.g., formula (3.3). With
this said, the (implicit) randomness in vi may play a role, but this will then be clear from the context.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 2. Regression models 37

Remark 2.3. From (2.2) we observe that the expected value of Y is invariant under
changes of the volume/weight v > 0 and the variance decreases linearly in this volume
parameter. This indicates that the responses Y within the EDF consider normalized
quantities, see Example 2.2. In Jørgensen [112], the normalized quantities correspond to
the so-called reproductive form. ■

These are the basics of regression modeling. The selected class M of candidate models will
also depend on the purpose of its use. In some cases, we want to use regression techniques
to explain (or understand) the specific impact of the covariates X on the responses Y . For
instance, does a new medication (reflected in X) have a positive impact on the state of
health (reflected in Y ).4 In other cases, we are just aiming for optimal predictive accuracy,
being less interested into the specific impact of X on Y . An interesting discussion on
explain vs. predict is given in Shmueli [207]. For actuarial pricing, we are somehow
in between the two requirements. We would like to have very accurate (risk-adjusted)
pricing schemes, however, stakeholders like customers, management and regulators want
to know the risk factors and how they impact the prices. That is why actuaries typically
have to compromise between explain vs. predict, and the extent to which this is necessary
depends on societies, countries, different regulations and companies’ business strategies.

2.3 Covariate pre-processing


In this section, we discuss covariate pre-processing in the case of tabular input data. In
the previous sections, we have always been speaking about q-dimensional real-valued co-
variates X = (X1 , . . . , Xq )⊤ , being supported in a covariate space X ⊆ Rq . Categorical
covariate pre-processing make a major part of actuarial modeling because of the abun-
dance of these kinds of information within traditional actuarial domains, for example,
we are thinking of vehicle brands, vehicle models, vehicle details, etc. Such informa-
tion needs to be pre-processed to real-valued vectors so that it can be used in regression
models. Not only categorical variables need pre-processing, but it can be that also real-
valued/continuous input variables need pre-processing. This is going to be discussed in
this section.

2.3.1 Notation
We start with a few words on the notation. The q-dimension real-valued covariate in-
formation is denoted by X = (X1 , . . . , Xq )⊤ , and it takes values in the covariate space
X ⊆ Rq . If we have a sample (X i )ni=1 ⊂ X of such covariates, we can stack them into
a design matrix X. For this we typically extend the covariates X i by an initial compo-
nent (bias component, zero component) being equal to one, to receive the new (extended)
covariates
X i = (1, Xi,1 , . . . , Xi,q )⊤ ∈ Rq+1 ; (2.9)
this uses a slight abuse of notation because we do not indicate in X i whether it includes
the bias component or not. However, this will always be clear from the context. The
4
We have purposefully avoided a causal language here, since causal statements can only be made
from statistical models in certain circumstances.

Version March 3, 2025, @AI Tools for Actuaries


38 Chapter 2. Regression models

design matrix is defined by


 
1 X1,1 · · · X1,q
⊤ . .. .. ..  n×(q+1)
X = [X 1 , . . . , X n ] = 
.
. . . . ∈R . (2.10)
1 Xn,1 · · · Xn,q

This collects all covariates X i of all instances 1 ≤ i ≤ n on the rows, and it adds an
initial bias column being identically equal to one. This is the input information in tabular
form; it has the shape of a matrix, which is equal to a tensor of order 2 (also called 2D
tensor). It is important to highlight that the different fonts X, X, X and X have different
meanings.

2.3.2 Categorical covariates


Categorical covariates (nominal or ordinal) need pre-processing to bring them into a
numerical form. This is done by an entity embedding which we are going to discuss in
this section.
Consider a categorical covariate X that takes values in a finite set A = {a1 , . . . , aK }
having K levels. The running example in this section will have K = 6 levels

A = {accountant, actuary, economist, quant, statistician, underwriter} . (2.11)

Ordinal encoding

If we have ordinal (ordered) levels (ak )K


k=1 , we can use a one-dimensional ordinal entity
embedding
K
X
X ∈ A 7→ k 1{X=ak } . (2.12)
k=1

This assigns to each level ak the corresponding integer k ∈ N. In our running example
(2.11) we have an alphabetical ordering which can be regarded as ordinal. This provides
the one-dimensional ordinal entity embedding given in Table 2.4.

accountant 1
actuary 2
economist 3
quant 4
statistician 5
underwriter 6

Table 2.4: Ordinal one-dimensional ordinal entity embedding.

One-hot encoding

One may argue that this ordinal (alphabetic) order does not make much sense for risk
classification, and one should treat these variables rather as nominal variables. The first

Version March 3, 2025, @AI Tools for Actuaries


Chapter 2. Regression models 39

solution for a numerical embedding of nominal variables is one-hot encoding which maps
each level ak to a basis vector in RK resulting in a K-dimensional entity embedding
 ⊤
X ∈ A 7→ 1{X=a1 } , . . . , 1{X=aK } ∈ RK . (2.13)

Table 2.5 illustrates one-hot encoding (as row vectors).

accountant 1 0 0 0 0 0
actuary 0 1 0 0 0 0
economist 0 0 1 0 0 0
quant 0 0 0 1 0 0
statistician 0 0 0 0 1 0
underwriter 0 0 0 0 0 1

Table 2.5: One-hot encoding (as row vectors).

Dummy coding

One-hot encoding does not lead to full rank design matrices X because there is a redun-
dancy. If we know that X does not take the first K − 1 levels (ak )K−1k=1 , it is immediately
clear that it has the be of the last level aK . If full rank design matrices is an important
prerequisite, one should therefore change it to dummy coding (note that for GLMs full
rank design matrices are important, but not for neural networks as going to be explained
below). For dummy coding one selects a reference level, e.g., a2 = actuary. Based on
this selection, all other levels are measured relative to this reference level
 ⊤
X ∈ A 7→ 1{X=a1 } , 1{X=a3 } , 1{x=a4 } , . . . , 1{X=aK } ∈ RK−1 . (2.14)

Table 2.6 illustrates dummy coding (as row vectors).

accountant 1 0 0 0 0
actuary 0 0 0 0 0
economist 0 1 0 0 0
quant 0 0 1 0 0
statistician 0 0 0 1 0
underwriter 0 0 0 0 1

Table 2.6: Dummy coding (as row vectors).

In actuarial practice, usually the level with the biggest exposure is chosen as reference
level.

Entity embedding

One-hot encoding and dummy coding lead to so-called sparse design matrices X, meaning
that most of the entries will be zero by these encoding schemes if we have many categorical
covariates with many levels. Such sparse design matrices can lead to issues in statistical

Version March 3, 2025, @AI Tools for Actuaries


40 Chapter 2. Regression models

modeling and model estimation, e.g., resulting estimated model parameters may not be
credible, and matrices may not be well-conditioned because of too sparse levels which can
cause numerical issues when inverting these matrices. Therefore, less sparse encodings
are considered. One should note that in one-hot encoding and dummy coding there is no
notion of adjacency and similarity. However, it might be that some job profiles have a
more similar risk behavior than others; another popular actuarial example are car brands,
certainly sports car brands have a more similar claims behavior than car brands that
typically produce family cars. Borrowing ideas from natural language processing (NLP),
one should therefore consider low(er)-dimensional entity embeddings where proximity is
related to similarity; see by Brébisson et al. [27], Guo–Berkhahn [88], Richman [186, 187],
Delong–Kozak [49] and Richman–Wüthrich [192].
Select an embedding dimension b ∈ N, this is a hyper-parameter that needs to be selected
by the modeler, typically b ≪ K. We define an entity embedding (EE) as follows

eEE : A → Rb , X 7→ eEE (X). (2.15)

This assigns to each level ak ∈ A an embedding vector eEE (ak ) ∈ Rb . In total this
entity embedding involves b · K parameters (called embedding weights). These need to
be determined either by the modeler (manually) or during the model fitting procedure
(algorithmically), and proximity in embedding should reflect similarity in (risk) behavior.

finance maths statistics liabilities


accountant 0.5 0 0 0
actuary 0.5 0.3 0.5 0.5
economist 0.5 0.2 0.5 0
quant 0.7 0.3 0.3 0
statistician 0 0.5 0.8 0
underwriter 0 0.1 0.1 0.8

Table 2.7: Entity embedding with embedding dimension b = 4.

Table 2.7 illustrates an entity embedding with embedding dimension b = 4 < K = 6.


This is a manually chosen example with b · K = 24 parameters. Typically, in machine
learning, these embedding weights are part of model fitting (learning); we will come back
to entity embedding in Section 8.2, below.

Target encoding

Especially for regression trees, one sometimes uses target encoding, meaning that one
does not only consider the categorical covariate, but also the corresponding response.
We assume to have a sample (Yi , Xi , vi )ni=1 with categorical covariates Xi ∈ A, real-
valued responses Yi and weights vi > 0. We compute the weighted sample means on all
levels ak ∈ A by
Pn
i=1 vi Yi 1{Xi =ak }
y k = Pn .
i=1 vi 1{Xi =ak }

Version March 3, 2025, @AI Tools for Actuaries


Chapter 2. Regression models 41

These weighted sample means (y k )Kk=1 are used like ordinal levels, replacing the nominal
ones, and we obtain similarly to (2.12) the one-dimensional target encoding embedding

K
X
X ∈ A 7→ y k 1{X=ak } . (2.16)
k=1

Though convincing at the first sight, one has to be aware of the fact that this does
not consider any interactions within the covariates, e.g., for scarce levels it may happen
that a high or low value is mainly implied by another covariate, and the resulting target
encoding value (marginal value) y k is misleading. This is especially an issue in regression
tree constructions if some of the leaves of the regression tree only contain very few
instances (under high-cardinality categorical covariates). A method to deal with scarce
levels is to combine this target encoding scheme with Bühlmann credibility [34]; see also
Micci-Barreca [155]. For this, we try to assess how credible the individual estimates y k
are, and we improve unreliable ones by mixing them with the global weighted empirical
mean y = ni=1 vi Yi / ni=1 vi providing a convex credibility combination
P P

y cred
k = ωk y k + (1 − ωk ) y, (2.17)

with credibility weights for 1 ≤ k ≤ K


Pn
i=1 vi 1{Xi =ak }
ωk = P n ∈ [0, 1],
i=1 vi 1{Xi =ak } +τ

and shrinkage parameter τ ≥ 0. This shrinkage parameter is a hyper-parameter, also


called credibility coefficient in Bühlmann credibility. The bigger the credibility coefficient,
the closer is the shrunk value y cred
k to the global average y.

2.3.3 Continuous covariates


In theory, continuous covariates do not need any pre-processing. However, in practice, it
might be that the continuous covariates do not provide the right functional form, or they
may live on the wrong scale. For example, we may replace a positive covariate X > 0 by
a 4-dimensional pre-processed covariate
 ⊤
X 7→ X, log(X), exp(X), (X − 10)2 .

This has a linear, a logarithmic, an exponential and a non-monotone quadratic term.

Standardization

Often, one standardizes covariates. Assume that we have n instances with a continuous
covariates (Xi )ni=1 . Standardization considers the transformation

X −m
X 7→
b
, (2.18)
sb
b ∈ R is the empirical mean and sb > 0 the empirical standard deviation of (Xi )n
where m i=1 .

Version March 3, 2025, @AI Tools for Actuaries


42 Chapter 2. Regression models

MinMaxScaler

The MinMaxScaler is given by the transformation

X − min1≤i≤n Xi
X 7→ 2 − 1. (2.19)
max1≤i≤n Xi − min1≤i≤n Xi

Categorization

Finally, especially in GLMs, one often discretizes continuous covariates by binning them
into categorical classes, e.g., one builds age classes. This is often done to provide more
robust functional forms to particular covariates within a GLM framework; of course, this
could also be achieved by splines. Select a finite partition (Ik )Kk=1 of the support of the
continuous covariate X. Then, we can assign a categorical value ak ∈ A to X if it falls
into Ik , that is,
K
X
X 7→ ak 1{X∈Ik } . (2.20)
k=1

This categorization then allows one to treat the continuous covariate X as a categorical
one, e.g., using dummy encoding in regression modeling; this is very frequently used in
actuarial GLMs.

2.4 Regularization and sparsity


2.4.1 Introduction and overview
Generally speaking, a regression model X 7→ µ(X) is correctly specified if it integrates
all relevant covariates in the correct form, and if there is no redundant, missing or wrong
covariate in the regression function. Redundant means that we include multiple covariates
that essentially present the same information (e.g., being collinear), missing means that
we have forgotten important covariate information that would be available, and wrong
means that the covariate does not impact the response. In real world problems, with
unknown regression functions, this is part of covariate pre-processing, covariate selection
and model selection.
One might be tempted to include any available information into the regression model,
to ensure that nothing gets forgotten. However, fitting on finite samples, this is usually
not a good strategy, because one may easily run into over-fitting issues, difficulties with
collinearity in covariates (also resulting in in-sample over-fitting), in identifiability issues,
and, generally, in a poor predictive model because model fitting has not been successful.
Model complexity control is crucial in a successful model fitting and predictive modeling
procedure, and, typically, one aims for a sparse model, meaning that one aims for a
parsimonious model having the fewest necessary variables but still providing accurate
predictions.
In this sense, sparse regression is an umbrella term for searching for small (parsimonious)
models by penalizing large models and enforcing the fitting procedure to perform variable
selection by only including the (in some sense) most significant information into the
regression function. Typically, only having the most significant variables in the model

Version March 3, 2025, @AI Tools for Actuaries


Chapter 2. Regression models 43

provides the lowest generalization error under finite sample fitting, it avoids in-sample
noise (over-)fitting, and it is very beneficial in explaining a model. We have already
touched upon this topic in the AIC model selection (1.19), that quantifies a penalization
for model complexity under MLE.
In practice, neither knowing the true model nor the (causal) regression structure and the
factors that impact the responses, one often starts with slightly too large models, and one
tries to shrink them by regularization. Regularization penalizes model complexity and/or
extreme regression coefficients; typically, a zero coefficient means that the corresponding
term is dropped from the regression function, making a model more sparse.
The most popular regularization techniques include ridge regularization (also known as
Tikhonov regularization [219] or L2 -regularization), the LASSO regularization of Tibshi-
rani [217] (also known as L1 -regularization) and the elastic net regularization of Zou–
Hastie [250]. Furthermore, there are more specialized techniques like the fused LASSO
of Tibshirani et al. [218] for ordered features, or the group LASSO by Yuan–Lin [246],
which we are also going to present in this section. There are more methods, e.g., smoothly
clipped absolute deviation (SCAD) regularization of Fan–Li [66], which are less relevant
for our purposes. An excellent reference on sparse regression is the monograph of Hastie
et al. [94].

2.4.2 Regularization
We come back to the parametric regression estimation problem (2.8). It involves selecting
the optimal regression function µϑb from a class of candidate models M = {µϑ }ϑ that are
parametrized by ϑ. Let us assume that the parameter ϑ is a (r + 1)-dimensional vector

ϑ = (ϑ0 , ϑ1 , . . . , ϑr )⊤ ∈ Rr+1 ,

where we typically assume that (ϑj )rj=1 parametrizes the terms in µϑ (X) that involve the
covariates X, and ϑ0 is a parameter for the covariate-free part of the regression function
that determines the overall level of the regression; ϑ0 is referred to the bias term and
this is best understood from the log-link GLM structure (1.3). For reasons explained
in Section 4.1, below, this bias term ϑ0 should always be excluded from regularization.
In other words, if we regularize the entire regression parameter ϑ, we drop out of the
framework of strictly consistent loss functions, even if the selected loss function L is
strictly consistent. Denote by ϑ\0 = (ϑ1 , . . . , ϑr )⊤ ∈ Rr the parameter vector excluding
the bias term ϑ0 .
A regularized parameter estimation is achieved by considering the optimization problem
n
!
X vi
ϑb ∈ arg min L(Yi , µϑ (X i )) + η R(ϑ\0 ) , (2.21)
ϑ i=1
φ

where R : Rr → R+ is a penalty function and η ≥ 0 is the regularization parameter. We


will give the most common examples, below. There is one noteworthy change from (2.8)
to (2.21). The former objective function is an empirical version of the strictly consistent
scoring problem given in (1.8), and this empirical version naturally scales with 1/n. This
scaling does not affect the optimal solution in (2.8). For the regularized version (2.21)
we prefer to drop this factor in order to make the regularization parameter η scale-free.

Version March 3, 2025, @AI Tools for Actuaries


44 Chapter 2. Regression models

Note that for larger sample sizes n, regularization should be weaker which is naturally
achieved by dropping any scaling 1/n in (2.21).

2.4.3 Ridge regularization


Ridge regularization or ridge regression selects a squared L2 -norm penalty function for R
n
!
X vi
ϑbridge ∈ arg min L(Yi , µϑ (X i )) + η ∥ϑ\0 ∥22 . (2.22)
ϑ i=1
φ

Ridge regularization generally punishes large values in (ϑbridge


j )rj=1 , this is called shrinkage.
The level of shrinkage is determined by the regularization parameter η ≥ 0, and an
optimal choice can be determined, e.g., by cross-validation. This intuition of shrinkage
also indicates why we exclude the bias term ϑ0 from regularization.
Example 2.4 (Ridge regularized Poisson log-link GLM). We revisit Example 2.2 which
considers a Poisson log-link GLM with GLM regression function
q
X
X 7→ log(µϑ (X)) = ϑ0 + ϑj Xj .
j=1

The ridge regularized model is found by solving


 
n q
µϑ (X i )
X    X
ϑbridge = arg min  2vi µϑ (X i ) − Yi − Yi log +η ϑ2j  .
ϑ∈Rq+1 i=1
Yi j=1

In this case we have parameter dimension r = q. ■

2.4.4 LASSO regularization


LASSO regularization (least absolute shrinkage and selection operator regularization)
selects a L1 -norm penalty function for R
n
!
X vi
ϑbLASSO ∈ arg min L(Yi , µϑ (X i )) + η ∥ϑ\0 ∥1 . (2.23)
ϑ i=1
φ
The behavior of LASSO regularization is fundamentally different from ridge regulariza-
tion. This difference results from the fact that the squared L2 -norm is differentiable in
the origin and the L1 -norm is not, see Figure 2.1 (lhs). The consequence of this de-
generacy of the L1 -norm in the origin is that some of the components of (ϑbLASSO j )rj=1
may be optimal in (2.23) if they are exactly equal to zero, and the more we increase the
regularization parameter η, the more components of (ϑbLASSO
j )rj=1 are optimal in precisely
setting them to zero. Thus, increasing the regularization parameter η leads to sparsity
of non-zero values in (ϑbLASSO
j )rj=1 , and, consequently, there results a sparse regression
model for large regularization parameters η. For this reason, LASSO regularization is a
very popular method for variable selection, because only the most significant terms for
prediction will remain in the model for large regularization parameters η, and we may
interpret this as a kind of variable importance. On the other hand, ridge regression just
generally shrinks parameters, but it does not set them exactly to zero. Remark that
LASSO regularization (2.23) is often used for variable selection, and after this selection,
a non-regularized regression model is fitted on the selected variables to receive a higher
level of discrimination on the selected variables.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 2. Regression models 45

ridge vs. LASSO penalization feasible set of solutions

1.0

1.0
0.8

0.5
coordinate theta2
penalization
0.6

0.0
0.4

−0.5
0.2

ridge ridge
LASSO LASSO

−1.0
0.0

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

theta coordinate theta1

Figure 2.1: Ridge vs. LASSO penalization.

2.4.5 Best-subset selection regularization


Best-subset selection regularization selects a L0 -norm penalty function for R
 
n r
X v i X
ϑbBSS ∈ arg min  L(Yi , µϑ (X i )) + η 1{ϑj ̸=0}  . (2.24)
ϑ i=1
φ j=1

This version is not used very often in applications mainly because optimizing (2.24) is
difficult, in fact, LASSO regularization (2.23) is considered as a tractable version instead.

2.4.6 Elastic net regularization


Elastic net regularization combines LASSO and ridge regularization by setting
n
!
X vi  
ϑbElastNet ∈ arg min L(Yi , µϑ (X i )) + η α∥ϑ\0 ∥1 + (1 − α)∥ϑ\0 ∥22 , (2.25)
ϑ i=1
φ

with α ∈ [0, 1]. The elastic net regularization overcomes some issues of LASSO, e.g.,
LASSO does not necessarily provide a unique solution for a linear regression problem
with square loss function L. It also has the tendency to group effects by assigning similar
weights to correlated covariate components.
We give some further remarks to these standard regularization methods.
• In case of a linear regression with the square loss function, there is always a unique
(closed-form) solution to the ridge regression problem, because we minimize a con-
vex (quadratic) objective function. The LASSO regression is more complicated,
uniqueness is not guaranteed, and it is typically solved by the method of Karush–
Kuhn–Tucker (KKT) [114, 129], and using the so-called soft-thresholding operator;
see Hastie et al. [94].

• The best-subset selection regression does not lead to a convex minimization prob-
lem, nor the SCAD regression.

Version March 3, 2025, @AI Tools for Actuaries


46 Chapter 2. Regression models

• Generally, these regularized regression problems can be interpreted geometrically


using the method of Lagrange. The feasible set of solutions in the ridge regression
case is a L2 -ball, and in the LASSO regression case a L1 -cube; see Figure 2.1 (rhs).
This geometric difference (having corners or not) precisely distinguishes sparsity of
LASSO from shrinkage of ridge regression, i.e., this is related to the difference of
these convex geometric sets having differentiable boundaries or not.

• The above regularization methods can also be understood in a Bayesian context.


E.g., focusing on ridge regression (2.22): if we replace the strictly consistent loss
function by a negative log-likelihood function of the responses, given the covariates,
then the penalty term in (2.22) can be interpreted as a Gaussian prior distribution
on the parameter ϑ\0 . The resulting estimate ϑbMAP is then called maximal-a-
posteriori (MAP) estimator which results from maximizing the posterior distribu-
tion of ϑ, given the sample (Yi , X i , vi )ni=1 .

• Following up in the previous item, if we interpret these regularization problems in


a Bayesian context, we can use methods different from MAP estimation for model
fitting. One could compute the posterior distribution of the parameter given the
observations, e.g., using Markov chain Monte Carlo (MCMC) methods. This has
the advantage that one can also (easily) quantify parameter estimation uncertainty.
For computational reasons, in most machine learning applications this full posterior
analysis is not feasible, and the MCMC methods do not converge on large parameter
sets as the MCMC chains do not mix well across the entire parameter space in high-
dimensional problems. Therefore, the MAP estimator or variational Bayes (VB)
methods are studied as approximations.

2.4.7 Group and fused LASSO regularizations


The previous regularization proposals have been focusing on shrinking the regression
parameters (ϑj )rj=1 and/or to make regression models sparse by setting some of these
parameters ϑj exactly to zero. In these regularization methods all terms and parameters
(ϑj )rj=1 are considered individually. However, it could be that we want to treat some
of them jointly in a group, e.g., if we have a dummy coded categorical covariate (2.14),
regularization should act simultaneously on all parameters that belong to that categorical
covariate. For this, we group the (categorical) covariates. Assume we have G groups

ϑ\0 = (ϑ(1) , . . . , ϑ(G) )⊤ ∈ Rd1 ×···×dG = Rr ,

where each group ϑ(k) ∈ Rdk contains dk components of the regression parameter ϑ\0 .
Group LASSO regularization is obtained by solving
n G
!
X vi X
ϑb ∈ arg min L(Yi , µϑ (X i )) + ηk ∥ϑ(k) ∥2 , (2.26)
ϑ i=1
φ k=1

for regularization parameters ηk ≥ 0. For increasing regularization parameters ηk , group


LASSO regularization leads to sparsity in setting the entire block of covariates ϑ(k) ∈ Rdk
simultaneously equal to zero.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 2. Regression models 47

The next regularization version we discuss is related to graduation in mortality table


smoothing. Assume that we have an adjacency relation on the covariate components,
i.e., the covariate component Xj is naturally embedded between the components Xj−1
and Xj+1 , and the lower index j is not set at an arbitrary order. In that case, we may
want the corresponding regression parameters ϑj to be similar for adjacent variables,
assuming for the moment that the parameter ϑj corresponds to covariate component Xj .
This motivates the fused LASSO regularization.
Fused LASSO regularization is obtained by solving
 
n r
X vi X
ϑb ∈ arg min  L(Yi , µϑ (X i )) + ηj |ϑj − ϑj−1 | , (2.27)
ϑ i=1
φ j=2

for regularization parameters ηj ≥ 0. The fused LASSO proposal enforces sparsity in the
non-zero components, but also sparsity in the different regression parameter values for
adjacent variables. It considers first order differences, which are related to derivatives of
functions, but one could also consider second (or higher) order differences.
Finally, one may want to enforce that parameters are positive or that first differences are
positive. This suggests to consider, respectively, for positivity
 
n r
X vi X
ϑb ∈ arg min  L(Yi , µϑ (X i )) + ηj (0 − ϑj )+  ,
ϑ i=1
φ j=1

and to enforce a monotone increasing property


 
n r
X vi X
ϑb ∈ arg min  L(Yi , µϑ (X i )) + ηj (ϑj−1 − ϑj )+  ,
ϑ i=1
φ j=1

where the function


u 7→ (u)+ = max{u, 0} = u 1{u>0} , (2.28)
considers only the positive part. This function is called rectified linear unit (ReLU) func-
tion in machine learning, but it is also known as the ramp function or the hinge function,
and it is the crucial object in European call and put pricing in financial mathematics.

2.5 Outlook
We have introduced all the technical and mathematical tools to now dive into predictive
modeling. Roughly speaking, there are three main types of regression model classes
that are used for actuarial modeling. (1) There are parametric GLM-type models, these
include feed-forward neural networks, recurrent and convolutional neural networks as
well as transformers. These models have in common that the architecture is fixed before
model fitting, and this gives a fixed number of parameters ϑ to be determined. (2) There
are no-parametric regression tree-type models, these include random forests and gradient
boosting machines. These models have in common that we do not start from a fixed
complexity, but we let the models grow during fitting by searching for more structure
in the data. (3) There are nearest neighbor and kernel based models. These models are

Version March 3, 2025, @AI Tools for Actuaries


48 Chapter 2. Regression models

based on topologies and adjacency relations, under the assumption that locally we have
similar predictions, and these predictions can be obtained by locally smoothing the noisy
responses. For example, local regression and isotonic regression belong to this class.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 3

Generalized linear models

Generalized linear models (GLMs) are the core models in predictive modeling, and, still
today, they are the state-of-the-art in practice for solving actuarial and financial problems
because they have many advantages over more advanced (and more complicated) machine
learning models; we refer to the introduction to Chapter 2. GLMs were introduced in
1972 by Nelder–Wedderburn [164] and the standard monograph on GLMs is the book of
McCullagh–Nelder [150]. This chapter on GLMs will set the stage for later chapters on
machine learning methods and AI tools. In these later chapters, we will see that neural
networks can be seen as a generalization of GLMs.
GLMs are based on the EDF (2.1). Model fitting is done by MLE which can either
be achieved by maximizing the log-likelihood function of the selected EDF (2.1) or by
minimizing the corresponding deviance loss function, see (2.6) and Table 2.2. Numerical
optimization for parameter fitting is usually done by Fisher’s scoring method and the
iteratively re-weighted least squares (IRLS) algorithm; this is one of the key contributions
in the original work of Nelder–Wedderburn [164].

3.1 Generalized linear model regressions


This section is sightly more technical because it lays the foundation for any other regres-
sion approach, and once we have provided the full details within a GLM framework, later
chapters will be more straightforward.

3.1.1 Generalized linear model regression functions


Consider q-dimensional real-valued covariates X = (X1 , . . . , Xq )⊤ ; for covariate pre-
processing we refer to Section 2.3. For a GLM regression function one selects a smooth
and strictly increasing link function g, and defines the class of (parametrized) regression
functions M = {µϑ }ϑ by
q
X
X 7→ g(µϑ (X)) = ϑ0 + ϑj Xj =: ⟨ϑ, X⟩ , (3.1)
j=1

for a given regression parameter ϑ ∈ Rq+1 . Thus, after applying the link function g,
one postulates a linear functional form in the components of X, expressed by the scalar

49
50 Chapter 3. Generalized linear models

(dot) product ⟨ϑ, X⟩ between ϑ and X; we implicitly extend the covariate X by a bias
component X0 ≡ 1 in this GLM chapter, see (2.9). The parameter ϑ0 is called intercept
or bias. In this chapter, ϑ ∈ Rq+1 is generally a (q + 1)-dimensional vector; for the chosen
notation we also refer to the footnote on page 16.

The GLM assumption (3.1) provides the following GLM structure for the conditional
mean of the response Y , given the covariates X,

µϑ (X) = E [ Y | X] = g −1 ⟨ϑ, X⟩ . (3.2)

Example 3.1 (log-link GLM, revisited). The most popular link function in actuarial
pricing is the log-link g(·) = log(·). This log-link GLM has already been introduced in
Section 1.3, in particular, we refer to the regression function defined in (1.3) and (1.4),
respectively. We can rewrite this conditional mean functional as
q
Y
ϑ0
X 7→ µϑ (X) = E [ Y | X] = exp ⟨ϑ, X⟩ = e eϑj Xj .
j=1

This is a multiplicative best-estimate pricing structure with price factors (price relativi-
ties) eϑj Xj . This multiplicative pricing structure is transparent and interpretable, e.g., if
ϑj > 0 we can easily read off the increase in best-estimate implied by an increase in Xj .
The bias term eϑ0 gives the base premium and the calibration of the GLM (note that
the base premium is not the same as the average premium, the latter averages over the
covariate distribution X ∼ P and considers the entire regression parameter ϑ). Shifting
the bias term ϑ0 shifts the average price level.

GLM regression with log−link

female
male
5
regression function mu(X)
4
3
2
1

20 30 40 50 60 70 80 90

age X1

Figure 3.1: GLM regression function X 7→ µϑ (X) with log-link.

Figure 3.1 shows a log-link GLM regression function X 7→ µϑ (X) = exp⟨ϑ, X⟩, with
a two-dimensional covariate X ∈ R2 and a regression parameter ϑ ∈ R3 . The first
component X1 ∈ [18, 90] models the age of the policyholder, given on the x-axis of
the graph, and the second component X2 ∈ {0, 1} is a binary categorical component

Version March 3, 2025, @AI Tools for Actuaries


Chapter 3. Generalized linear models 51

with X2 = 0 for male and X2 = 1 for female (blue and red colors). That is, it uses
dummy coding with male being the reference level, see (2.14). The choice of the log-link
gives the easy interpretation that females differ from males (at the same age X1 ) by a
multiplicative factor of eϑ2 , the selected regression parameter ϑ2 < 0 is negative in the
picture. Increasing age X1 by one year, increases the best estimate by a multiplicative
factor of eϑ1 , since the selected regression parameter ϑ1 > 0 is positive, here. Finally,
the bias term eϑ0 globally shifts (calibrates) the overall level. ■

Figure 3.1 gives a nice picture of a transparent and interpretable regression function based
on the log-link, and it shows the main advantages of using the log-link. In practice, this
regression function is not known and needs to be estimated from noisy data, i.e., we
need to find the (true) regression parameter ϑ ∈ Rq+1 from a (noisy) learning sample
(Yi , X i , vi )ni=1 , after having chosen a suitable link function g. We could just select any
strictly consistent loss function L, see Section 1.4, and then perform model fitting, or
rather loss minimization1 (2.8), to find a regression parameter estimate ϑb for ϑ. We
have seen three examples of that type for the log-link GLM, see Examples 1.2, 1.6 and
1.7. However, we have argued in Section 1.4.3 that on finite samples L = (Yi , X i , vi )ni=1
one can find the most accurate models (on average) if the selected strictly consistent
loss function L reflects the properties of the responses (Yi )ni=1 ; this is grounded in the
theoretical work of Gourieroux et al. [87].
Select the member of the EDF that is a reasonable distributional choice for the responses,
given the covariates, and set the assumption

Yi |X i ∼ EDF (θ(X i ), φ/vi , κ) . (3.3)

That is, we make the canonical parameter θ = θ(X i ) ∈ Θ dependent on the covariates.
Assuming the GLM regression structure (3.1), applying the EDF mean property (2.2),
and using the one-to-one correspondence between the canonical parameter and the mean
gives us
g κ′ (θ) = ⟨ϑ, X⟩ .


or equivalently, using the canonical link h = (κ′ )−1 of the selected EDF, we can solve
this for the canonical parameter
 
θ = θ(X) = θϑ (X) = h g −1 ⟨ϑ, X⟩ . (3.4)

This explains the relationship between the canonical parameter θ = θ(X) = θϑ (X) ∈ Θ
of the selected EDF and the selected GLM regression function. The unknown regression
parameter ϑ ∈ Rq+1 enters this canonical parameter; pay attention to the distinguished
use of θ ∈ Θ for the canonical parameter and ϑ ∈ Rr+1 for the regression parameter.
1
Often we call minimization of a loss function as ‘model fitting’, however, at the current stage we do
not really have a ‘model’, but only a regression function assumption, see page 32. For having a ‘model’, we
also need a distributional assumption. This is often disregarded within the machine learning community,
but this has played a key role in advancing actuarial modeling, see, e.g., the claims reserving uncertainty
results developed by Mack [146].

Version March 3, 2025, @AI Tools for Actuaries


52 Chapter 3. Generalized linear models

3.1.2 Canonical link

From (3.4) we observe that there is a distinguished link choice for a GLM, called the
canonical link choice. Namely, select g = h. In the case of this canonical link choice
g = h, the canonical parameter coincides with the linear predictor

θϑ (X) = ⟨ϑ, X⟩ . (3.5)

Mathematically speaking, the canonical link choice has some real advantages, e.g., MLE
is always a concave maximization problem and, thus, a solution is unique (supposed we
have a full rank design matrix X). However, practical needs often overrule mathematical
properties, for a dozen of other reasons the modeler typically prefers the log-link for g.
The log-link is the canonical link (if and only if) we select the Poisson model within the
EDF. Therefore, in many regression problems, we do not work with the canonical link
for g.

distribution canonical link h(µ) Θ κ′ (Θ̊)


Gaussian distribution µ R R
Poisson distribution log(µ) R (0, ∞)
Tweedie’s CP distribution µ1−p /(1 − p) (−∞, 0) (0, ∞)
gamma distribution −1/µ (−∞, 0) (0, ∞)
inverse Gaussian distribution −1/(2µ2 ) (−∞, 0] (0, ∞)
Bernoulli distribution log(µ/(1 − µ)) R (0, 1)

Table 3.1: Canonical link h(µ) = (κ′ )−1 (µ) of selected EDF distributions.

Table 3.1 shows the canonical links h of the most popular members of the EDF. Usually,
the canonical link is chosen for g in case of the Gaussian, the Poisson and the Bernoulli
models. In these cases we have an effective domain Θ = R, and, as a result, there is no
domain conflict resulting from the linear predictor (3.5) for any possible choices of the
covariates X ∈ Rq . This does not hold in the other cases, e.g., in the gamma case we
have a one-side bounded effective domain Θ = (−∞, 0). This gives constraints on the
possible choices of ϑ and X in (3.5) to have a well-defined model. This difficulty can be
circumvented in the gamma case by selecting the log-link for g, i.e.,
 
θϑ (X) = h g −1 ⟨ϑ, X⟩ = −1/ exp ⟨ϑ, X⟩ = − exp ⟨−ϑ, X⟩ < 0,

being a well-defined canonical parameter for any ϑ ∈ Rq+1 and X ∈ Rq . This is a main
reason in practice to choose a link function g different from the canonical link h of the
selected EDF distribution.
We also remark that the balance property that is going to be introduced in Section 4.1,
below, is important in insurance pricing. This balance property is only (if and only if)
fulfilled for MLE fitted GLMs if we work with the canonical link g = h. Otherwise,
a balance correction will be necessary; for more discussion we also refer to Lindholm–
Wüthrich [140].

Version March 3, 2025, @AI Tools for Actuaries


Chapter 3. Generalized linear models 53

3.2 Generalized linear model fitting

There remains fitting the regression function (3.1) based on a learning sample L =
(Yi , X i , vi )ni=1 . As outlined in Section 2.1.3, there are two different ways to receive the
MLE of ϑ. We can either maximize the log-likelihood function or we can minimize the
corresponding deviance loss function.

3.2.1 MLE via log-likelihoods

For the given learning sample L = (Yi , X i , vi )ni=1 we receive the log-likelihood function
under the previous choices (and assuming independence between the instances), see (2.1),

n
X vi
ϑ 7→ ℓ(ϑ) = [Yi θϑ (X i ) − κ(θϑ (X i ))] + c(Yi , φ/vi ), (3.6)
i=1
φ

where θϑ (X i ) contains the regression parameter ϑ, see (3.4). Solving the resulting max-
imization problem gives the MLE, subject to existence,

ϑbMLE ∈ arg max ℓ(ϑ). (3.7)


ϑ

One may argue that working with the log-likelihood function (3.6) looks too complicated
because the formula seems quite involved. In all future derivations we will translate the
MLE problem (3.7) to the corresponding deviance loss minimization problem.

3.2.2 MLE via deviance loss functions

We translate the selected EDF to the corresponding deviance loss function L, see (2.6)
and Table 2.2. This gives us the (same) MLE as in (3.7) by solving the minimization
problem
n
X vi
ϑbMLE ∈ arg min L(Yi , µϑ (X i )), (3.8)
ϑ i=1
φ

for the GLM regression function

X 7→ µϑ (X) = g −1 ⟨ϑ, X⟩ . (3.9)

Version March 3, 2025, @AI Tools for Actuaries


54 Chapter 3. Generalized linear models

Regression fitting problem

The regression fitting problem (3.8) is now in an attractive form that is going to be
used throughout and in more generality below according to the following recipe:

(1) Specify the distributional model within the EDF for modeling the responses
Y , given X; see (3.3). This gives us the choice of the cumulant function κ.

(2) The choice of the cumulant function κ is translated to the corresponding


deviance loss function L by (2.6), see, e.g., Table 2.2.

(3) Then, one can choose any family M = {µϑ }ϑ of regression functions µϑ :
X → R, and the optimal one is found by solving (3.8) for this family M.

• Outlook: This procedure is fairly general, as it does not require a GLM


family for M. In fact, it can be a general class of regression models, not
necessarily being a non-parametric one. Below, we are going to use neural
networks or regression trees for M, and the foundation of their fitting/learn-
ing always lies in solving (3.8). Thus, all that we need to do is to select the
‘right’ strictly consistent loss function L for mean estimation, and then we
solve (3.8) for the selected regression model class M.

3.2.3 Numerical solution of the GLM problem


Coming back to the GLM model class (3.9) for M in (3.8). Computationally, this can be
solved either by Fisher’s scoring method or by the IRLS algorithm.2 Basically, this fully
solves the GLM fitting problem and we obtain the MLE fitted GLM regression function
X 7→ µϑbMLE (X). We give some remarks:

• For GLM fitting we always require the design matrix X to have full rank q + 1 ≤ n.
For categorical inputs this requires dummy coding, see Section 2.3.2.

• Under the canonical link choice g = h, the deviance loss minimization (3.8) is
convex. Thus, a solution is always unique.

• For non-canonical link choices, the objective function in (3.8) is not necessarily
convex, and this needs to be checked case by case; compare Examples 5.5 and 5.6
of Wüthrich–Merz [243] which show that the gamma log-link GLM is a convex
fitting problem, whereas the inverse Gaussian log-link GLM is not.

We can then apply the model validation and model selection tools of Chapter 1 such
as cross-validation. One could also compute a regularized MLE, e.g., adding a LASSO
penalization to (3.8) to receive a sparse GLM, see Section 2.4. For a more detailed outline
2
One could also use the gradient descent methods described in Section 5.3, below. Gradient descent
has the advantage of being able to deal with big data and big models, Fisher’s scoring method and the
IRLS algorithm usually converge faster, because they do not only consider the gradient, but also the
second derivatives (Hessians). The size of the data and the model will decide whether these Hessians can
be computed efficiently.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 3. Generalized linear models 55

about model fitting and validation we refer to Wüthrich–Merz [243, Chapter 5], and we
discuss some GLM related techniques in Section 3.3, below.

Remark 3.2. In general, the GLM fitted model does not fulfill the balance property
discussed below in (4.4). Therefore, one often adjusts the estimated bias term ϑbMLE
0 to
rectify this balance property. An exception is the choice of the canonical link g = h
under which the balance property always holds, i.e., under the canonical link choice, the
GLM estimated model is an (in-sample) re-allocation of the totally observed claim; this
is explained below in Section 4.1, and we also refer to Lindholm–Wüthrich [140]. ■

3.3 Likelihood-ratio test


3.3.1 Introduction
First model validation and model selection tools have been introduced in Chapter 1, and
we are going to discuss more tools in Chapter 4, below. All these tools are model-agnostic,
meaning that they can be applied to any class of regression problems and models. A little
bit less general are AIC and BIC, discussed in Section 1.5.3, because they require MLE
fitted models. Apart from that AIC and BIC are still general, e.g., we can compare a
MLE fitted Poisson GLM to a MLE fitted negative binomial GLM, even if they have
completely different GLM functions. In this section we discuss the likelihood ratio test
(LRT) and the Wald test which require nested GLMs to be applicable.
Under fairly general assumptions, MLE is consistent and asymptotically normal, meaning
that for increasing sample sizes n → ∞, the MLE ϑbMLE converges to the true parameter
ϑ∗ , supposed that the true model is a GLM with the same parametrization. The speed
of convergence is determined by the inverse of Fisher’s information matrix. This links
to Gourieroux et al. [87, Theorem 4] who prove that this Fisher’s information matrix is
minimal if we select the deviance loss function of the correct EDF under which the data
has been generated. In Gourieroux et al. [87], this is called best asymptotically normal,
and it means that this deviance loss function choice provides on average the best finite
sample behavior in regression parameter estimation. These asymptotic normality results
are at the heart of the LRT and the Wald test, which we present next.

3.3.2 Lab: a real data example


The analysis in the next section is based on an explicit real data example. We use
the well-known French motor third party liability (MTPL) claims frequency data set
freMTPL2freq of Dutang et al. [60]. We apply the same data cleaning as described in
Wüthrich–Merz [243, Appendix B], and we apply the identical partition into learning
sample L and test sample T as in that reference;3 see Wüthrich–Merz [243, Listing 5.2
and Table 5.2]. Also further chapters of these notes will be based on this data set and
on this partition.
3
The cleaned French MTPL claims frequency data can be downloaded from https://people.math.
ethz.ch/~wueth/Lecture/freMTPL2freqClean.rda. The label LearnTest with levels ‘L’ and ‘T’ indi-
cates whether the instances belong to the learning sample L or the test sample T .

Version March 3, 2025, @AI Tools for Actuaries


56 Chapter 3. Generalized linear models

We then pre-process the covariates, and we fit a Poisson log-link GLM to the learning
sample L = (Yi , X i , vi )ni=1 , i.e., using the Poisson deviance loss for L, we solve
n
1X
ϑbMLE = arg min vi L (Yi , µϑ (X i ))
ϑ∈Rq+1 n i=1
n
1X µϑ (X i )
  
= arg min 2vi µϑ (X i ) − Yi − Yi log (3.10)
ϑ∈Rq+1 n i=1 Yi
n
1X vi µϑ (X i )
  
= arg min 2 vi µϑ (X i ) − Ni − Ni log .
ϑ∈Rq+1 n i=1 Ni
This is precisely Example 2.2. Recall, Yi = Ni /vi are the observed claims frequencies, if
Ni ∈ N0 denote the observed claims counts.
The results of the GLM fitting procedure are presented in Listing 3.1, below. We discuss
this example in the next section, and a more detailed description and analysis of this
GLM example is contained in the accompanying notebook on GLMs; see
notebook-insert-link

3.3.3 Likelihood ratio test and Wald test


The LRT and the Wald test are two MLE-based model selection tools that allow one to
perform (in-sample) model selection. A necessary requirement to be able to apply these
two tests is that we compare two nested GLMs. Two GLMs for Y , given X, are nested
if they are based on the same EDF distribution (3.3), and if their only difference is that
one model is bigger including the other one as a special case (via the regression function).
Call the bigger model the ‘full model’ and assume it has a regression function
D E q
g κ′ (θϑfull (X)) = ϑfull , X = ϑfull
X
ϑfull

0 + j Xj , (3.11)
j=1

with (q + 1)-dimensional regression parameter ϑfull ∈ Rq+1 . A smaller nested model is


obtained by setting some of these regression parameter components equal to zero, i.e.,
dropping the corresponding covariate components from the model. Since the ordering of
the covariate components in X is arbitrary, we can assume w.l.o.g. that we want to set
the last components to zero. Thus, we consider a nested (null-hypothesis) model
q′

X
ϑH0 ϑH0

g κ (θϑH0 (X)) = 0 + j Xj , (3.12)
j=1

with q ′ < q. This nested model has a (q ′ + 1)-dimensional regression parameter ϑH0 ∈

Rq +1 . We can now set up a statistical test for the null-hypothesis that the data has been
generated by the smaller nested model (3.12) against the alternative of the full model
(3.11). Based on the learning sample L = (Yi , X i , vi )ni=1 , we estimate in both models
the regression parameter with MLE, providing us with the MLEs ϑbfull in the full model
and ϑbH0 in the nested model, respectively. The resulting likelihood ratio of the two fitted
models gives us, refer to (2.1) for the EDF densities,
Qn
i=1 fθϑ
bH0 (X i )
(Yi )
Qn ≤ 1. (3.13)
i=1 fθϑ
bfull (X i )
(Yi )

Version March 3, 2025, @AI Tools for Actuaries


Chapter 3. Generalized linear models 57

This likelihood ratio is upper bounded by one because we apply MLE to two nested
models; the one in the nominator having more degrees of freedom in MLE. The rationale
behind the LRT now is as follows. If this ratio is fairly close to one, the null-hypothesis
model is as good as the full model, and one cannot reject the null-hypothesis that the
data has been generated by the smaller model.
The test statistics (3.13) is not in a convenient form and, instead, one considers a logged
version thereof
n 
X 
T = −2 fθ H0 (X i )
(Yi ) − fθ full (X i )
(Yi ) ≥ 0. (3.14)
ϑ b ϑ
b
i=1

If this test statistics T is large, the null-hypothesis should be rejected. Using asymptotic
MLE theory, the distribution of this test statistics T under the null-hypothesis can be
approximated by a χ2 -distribution with q − q ′ degrees of freedom; see Fahrmeir–Tutz [65,
Section 2.2.2].

likelihood ratio test


0.25
0.20
chi−square density
0.15
0.10
0.05
0.00

0 5 10 15 20

test statistics T

Figure 3.2: χ2 -density rejection area (in orange) for a significance level of 5% with q−q ′ =
3 degrees of freedom.

Figure 3.2 shows in orange color the rejection region for the test statistics T defined in
(3.14) on a significance level of 5% for dropping q − q ′ = 3 covariate components from X.
If the learning sample L = (Yi , X i , vi )ni=1 provides an observed test statistics T that has
a p-value less than 5% (lies in the orange region) we reject the null-hypothesis on this
significance level and we go for the bigger model, otherwise this LRT does not support
working with the bigger model.

We give some remarks.

• The LRT test statistics T in (3.14) can also be expressed as a difference of deviance
losses. It then receives the interpretation of measuring the distance of the two
models to the empirical sample in terms of a KL divergence; see Wüthrich–Merz
[243, Section 5.1.7].

Version March 3, 2025, @AI Tools for Actuaries


58 Chapter 3. Generalized linear models

• The LRT can only be applied to nested models, i.e., with responses having the
same distributions under both models, and one having a nested regression function
within the other one. For non-nested models, only AIC (1.19) and BIC (1.20) apply.
Alternatively, we could employ (group) LASSO regularization during model fitting,
see Section 2.4. This already leads to sparsity and parsimony during model fitting,
but it requires fine-tuning of the regularization parameter η. Often, LASSO is only
used for parameter (covariate) selection, and once the covariates are selected, in an
additional step, a non-regularized GLM is fitted.

• The LRT requires that we fit two models, the full model and the nested model. The
Wald test [230] is rather similar but with a smaller computational effort. For the
Wald test, one only needs to fit the full model, and the nested one is then approx-
imated by asymptotic MLE arguments. This is computationally more attractive,
but it is less accurate because it involves more approximations; see Wüthrich–Merz
[243, Section 5.3.2].

• The LRT test statistics (3.14) requires the true dispersion parameter φ. Usu-
ally, this is not available. However, if we estimate this dispersion parameter by
MLE (consistently in a statistical sense) in the bigger (full) model, the asymp-
totic normality results carry over to the case with consistently estimated dispersion
parameter.

We illustrate the LRT and the Wald test on the French MTPL claims frequency data in-
troduced in Section 3.3.2. Listing 3.1 shows the results of the fitted Poisson log-link GLM
(3.10). We consider 5 covariates: a continuous covariate DrivAge which was discretized
into age classes and which was then implemented by dummy coding, the categorical
VehBrand variable implemented by dummy coding, a binary VehGas variable also imple-
mented by dummy coding, and two continuous variables Density and Area. This results
in a regression parameter ϑ ∈ Rq+1 of dimension q + 1 = 20.
The last column p value of Listing 3.1 shows the p-values of the Wald tests. As explained
above, these Wald tests are essentially the same as the LRTs, but with less computational
efforts because only the full model needs to be fitted, and we do not need to refit a null-
hypothesis model for all q + 1 = 20 Wald tests in Listing 3.1. These Wald tests consider
dropping one component of ϑ at a time against the full model. As a result, we cannot
just drop all the variables that have a small p-value because simultaneous consideration
of multiple individual variable droppings is not a nested model consideration. That
is, a multiple comparison does not work if these multiple models are not nested. In
other words, variable dropping needs to be done recursively, i.e., once we have dropped
the least significant variable, we have a new full model, and we need to re-perform
the whole exercise based on this new full model. This precisely results in the variable
selection process that is typically known as backward selection, because every Wald or
LRT has to be done recursively in a full vs. null-hypothesis model context. If this model
reduction is combined with increasing a model by adding (new) variables, it is called
backward-forward-stepwise selection, e.g., we may decide to drop recursively variable A,
then variable B, and then variable C, and after these three reductions it may turn out that
adding again variable A is beneficial. This allows one to stepwise (recursively) decrease

Version March 3, 2025, @AI Tools for Actuaries


Chapter 3. Generalized linear models 59

Listing 3.1: Poisson log-link GLM example on the French MTPL claims frequency data.
1 glm ( formula = ClaimNb ~ DrivAge + VehBrand + VehGas + Density +
2 Area , family = poisson () , data = learn , offset = log ( Exposure ))
3
4 Deviance Residuals :
5 Min 1Q Median 3Q Max
6 -0.8890 -0.3393 -0.2535 -0.1387 7.6569
7
8 Coefficients :
9 Estimate Std . Error z value p value
10 ( Intercept ) -3.258957 0.034102 -95.564 < 2e -16 ***
11 DrivAge18 -20 1.275057 0.044964 28.358 < 2e -16 ***
12 DrivAge21 -25 0.641668 0.028659 22.390 < 2e -16 ***
13 DrivAge26 -30 0.153978 0.025703 5.991 2.09 e -09 ***
14 DrivAge41 -50 0.121999 0.018925 6.447 1.14 e -10 ***
15 DrivAge51 -70 -0.017036 0.018525 -0.920 0.357776
16 DrivAge71 + -0.047132 0.029964 -1.573 0.115726
17 VehBrandB2 0.007238 0.018084 0.400 0.688958
18 VehBrandB3 0.085213 0.025049 3.402 0.000669 ***
19 VehBrandB4 0.034577 0.034523 1.002 0.316553
20 VehBrandB5 0.122826 0.028792 4.266 1.99 e -05 ***
21 VehBrandB6 0.080310 0.032325 2.484 0.012976 *
22 VehBrandB10 0.067790 0.040607 1.669 0.095032 .
23 VehBrandB11 0.221375 0.043348 5.107 3.27 e -07 ***
24 VehBrandB12 -0.152185 0.020866 -7.294 3.02 e -13 ***
25 VehBrandB13 0.101940 0.047062 2.166 0.030306 *
26 VehBrandB14 -0.201833 0.093754 -2.153 0.031336 *
27 VehGasRegular -0.198766 0.013323 -14.920 < 2e -16 ***
28 Density 0.094453 0.014623 6.459 1.05 e -10 ***
29 Area 0.028487 0.019909 1.431 0.152471
30 ---
31 Signif . codes : 0 ’*** ’ 0.001 ’** ’ 0.01 ’* ’ 0.05 ’. ’ 0.1 ’ ’ 1
32
33 ( Dispersion parameter for poisson family taken to be 1)
34
35 Null deviance : 153852 on 610205 degrees of freedom
36 Residual deviance : 151375 on 610186 degrees of freedom
37 AIC : 197067
38
39 Number of Fisher Scoring iterations : 6

and increase models.


Another issue with the Wald tests in Listing 3.1 is that these tests are performed on the
individual components of ϑ. Considering DrivAge and VehBrand, it seems more sensible
to consider all components of a dummy coded categorical variable simultaneously, because
we want to keep, e.g., VehBrand in the model or not. This requires to consider the whole
group at once exactly as in the group LASSO regularization, see (2.26). In the case of
VehBrand we can perform the LRT with q − q ′ = 10 degrees of freedom.
Listing 3.2 shows a LRT analysis that performs LRTs for dropping one variable at a time
from the full model, i.e., the resulting five reduced models are always compared to the
full model; in R this is called a drop1 analysis. In case of DrivAge we drop 6 regression
parameters, in case of VehBrand 10 regression parameters, and in the remaining cases 1
parameter each. This gives the resulting degrees of freedom q − q ′ for the LRTs (and the
Wald tests). The p-values (in the last column) only support to drop Area. This is also

Version March 3, 2025, @AI Tools for Actuaries


60 Chapter 3. Generalized linear models

Listing 3.2: GLM example: drop1 analysis.


1 Single term deletions
2 Model :
3 ClaimNb ~ DrivAge + VehBrand + VehGas + Density + Area
4
5 Df Deviance AIC LRT Pr ( > Chi )
6 < none > 151375 197067
7 DrivAge 6 152483 198163 1107.98 < 2.2 e -16 ***
8 VehBrand 10 151550 197222 175.15 < 2.2 e -16 ***
9 VehGas 1 151598 197287 222.58 < 2.2 e -16 ***
10 Density 1 151417 197106 41.77 1.029 e -10 ***
11 Area 1 151377 197067 2.05 0.1524
12 ---
13 Signif . codes : 0 ’*** ’ 0.001 ’** ’ 0.01 ’* ’ 0.05 ’. ’ 0.1 ’ ’ 1

justified by AIC (1.19) because dropping Area has roughly the same AIC value as the full
model, and because of aiming for parsimony we should go for the smaller model in such
a situation. In all other cases, the AIC value increases by dropping the corresponding
variables, see Listing 3.2.

Listing 3.3: GLM example: anova analysis.


1 Analysis of Deviance Table
2 Model : poisson , link : log
3 Response : ClaimNb
4 Terms added sequentially ( first to last )
5
6 Df Deviance Resid . Df Resid . Dev Pr ( > Chi )
7 NULL 610205 153852
8 DrivAge 6 1174.40 610199 152678 <2e -16 ***
9 VehBrand 10 156.40 610189 152522 <2e -16 ***
10 VehGas 1 112.24 610188 152409 <2e -16 ***
11 Density 1 1032.18 610187 151377 <2e -16 ***
12 Area 1 2.05 610186 151375 0.1524
13 ---
14 Signif . codes : 0 ’*** ’ 0.001 ’** ’ 0.01 ’* ’ 0.05 ’. ’ 0.1 ’ ’ 1

Finally, we show an analysis in Listing 3.3, where we sequentially add one variable after
the other, providing the corresponding reduction in the deviance loss (LRT test statistics).
The conclusions are essentially the same as above. A critical point that needs to be
emphasized for Listing 3.3 is that the order of inclusion matters in this stepwise forward
selections. Changing its order may provide different conclusions. For instance, if two
covariate components are highly collinear, then they serve at essentially explaining the
same phenomenon, and once we have included the first one, we do not need the second
one any more in the model, though the second one may have a marginally slightly better
explanatory power (or slightly different interactions with other variables). That is, the
order of inclusion matters in these stepwise forward selections, and the p-values depend
on this order; in the example of Listing 3.3 the variables Density and Area are highly
collinear, and by exchanging the order of these two variables in Listing 3.3 gives a high
p-value to Density. The accompanying notebook on GLMs gives more discussions on
this example, see

Version March 3, 2025, @AI Tools for Actuaries


Chapter 3. Generalized linear models 61

notebook-insert-link

We conclude that variable selection is as much art as science. Increasing the size of the
model quickly leads to a combinatorial complexity that does not allow one to exploit all
possible sub-models (and combinations). This section has been focusing on LRTs and
Wald tests because there is a well-founded and understood statistical theory behind the
LRT and the Wald test, but, in essence, they are not any different from other model selec-
tion procedures. For instance, we can replace all LRTs by LASSO regressions, but then
computational efforts will become bigger as LASSO regression needs hyper-parameter
(regularization parameter) tuning. This increased computational complexity is common
to any machine learning method, because in complex algorithmic models we will no longer
be able to rely on an asymptotic likelihood theory. Therefore, first trying to understand
the (raw) data is always very beneficial. This allows the modeler to already make a rea-
sonable model choice in the first place, which can then be further refined. Otherwise, we
have to go through the tedious backward-forward LRT model selection process because
we can only test nested GLMs, and likely we eventually result in a sort of non-uniqueness
having equally good non-nested models. This is quite similar to the regression tree con-
structions below, and in regression trees a technique called pruning is used to select the
optimal tree.

Summary

This chapter has set the ground for machine learning methods by giving the basic
tools and intuition from classical statistical and actuarial methods, we especially
refer to the boxes on pages 29 and 54, as well as to the Poisson Example 2.2. We
are now ready to dive into the theory of machine learning tools.

Version March 3, 2025, @AI Tools for Actuaries


62 Chapter 3. Generalized linear models

Version March 3, 2025, @AI Tools for Actuaries


Chapter 4

Interlude

Chapters 1 to 3 introduced statistical methods that are part of the core syllabus
of actuarial science, and these chapters have laid a solid foundation for diving into
the machine learning tools. We are going to introduce these tools from Chapter 5
onwards. The present chapter discusses some general techniques that are useful in
various situations, but which are not strictly necessary to understand the machine
learning tools. For this reason, the fast reader may skip this chapter and come
back to it at a later stage.

This chapter discusses a collection of different topics:


• We start this chapter by discussing unbiasedness and the balance property. These
are important properties in insurance pricing to ensure that the overall price level
is not misspecified. Next, we discuss auto-calibration which is another important
property that insurance pricing schemes should fulfill.

• In the second part of this chapter, we discuss two general purpose non-parametric
regression methods, local regression and isotonic regression. These are useful in
various situations.

• In the final part of this chapter, we discuss further model selection tools, such as
the Gini score and Murphy’s score decomposition. These are model-agnostic tools,
i.e., they can be used for any regression method.

4.1 Unbiasedness and calibration


4.1.1 Statistical biases
Generally, unbiasedness is an important property in actuarial pricing. This applies re-
gardless of what specific meaning we underpin unbiasedness:
• In Section 1.5.1, we discussed the in-sample bias that we need to avoid in model
selection, otherwise the predictor may generalize poorly to new data.

• In regression modeling, we want to ensure that the estimated model is void of a


statistical bias to ensure that the average price level is not misspecified.

63
64 Chapter 4. Interlude

• Most regression models include an intercept term which is the part of the regression
function that is not influenced by the covariates, see GLM (3.1). In deep learning,
this intercept term is coined bias term, and we need to ensure that it is correctly
specified to avoid the statistical bias.

• There is some concern about unfair discrimination in insurance pricing, and algo-
rithmic decision making more generally. Any kind of unfair treatment of individuals
or groups with similar features is related to a bias in the model construction and/or
the decision making process. This is called an unfair discrimination bias.

In the present section, we focus on the statistical bias which studies the average price
level over the entire insurance portfolio. We assume in all considerations that the first
moments exist.

Definition 4.1. The regression function X 7→ µ(X) is (globally, statistically) unbiased


for (Y, X, v), if
E[vµ(X)] = E[vY ].

Global unbiasedness means that the average price level E[vµ(X)] provided by the selected
regression function µ is sufficient to cover the portfolio claim vY on average. This global
unbiasedness is stated w.r.t. the population distribution (Y, X, v) ∼ P.
If we work with an estimated model µ bL , which has been fitted on a learning sample
n
L = (Yi , X i , vi )i=1 , we typically also average over the learning sample L to state the
global unbiasedness

E [v µ
bL (X)] = E[vY ], (4.1)

assuming that (Y, X, v) is independent of L. This reflects an out-of-sample global unbi-


asedness verification, see also Section 1.5.1.
We give some remarks.

• The left-hand side of (4.1) reflects that we re-sample both L and (X, v) to verify this
global unbiasedness. In insurance, this may be questionable, as we cannot repeat
an experiment, i.e., we only have one past claims history reflected in the learning
sample L. Bootstrap may be a way of generating different past histories, however,
this in itself maybe problematic because if we have a biased model, this bias remains
in the bootstrap samples as we re-simulate from the very same observations, or in
other words, a bias remains latent and is not discovered by bootstrapping.

• Global unbiasedness (4.1) also re-samples over the covariates X, and an insurer
may argue that the company (only) wants to be unbiased for its specific portfolio
T = (Yt , X t , vt )m
t=1 . This would then result in verifying the conditional global
unbiasedness
m
X m
X
vt E [ µ
bL (X t )| X t ] = vt E [ Yt | X t , vt ] . (4.2)
t=1 t=1

This averages over the learning sample L on the left-hand side, and over the claims
(Yt )m m
t=1 on the right-hand side, but it keeps the (forecast) portfolio (X t , vt )t=1 fixed.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 4. Interlude 65

Note that (4.2) will also require an assumption about the dependence between L
and T . In fact, one can even go one step further and require that the learning
sample L and the test sample T have the identical covariates, i.e., working on a
fixed portfolio (X i , vi )ni=1 for learning and forecasting. In that case, we require
instead n n
X X
bL (X i )| (X i , vi )n
vi E [ µ i=1 ] = vi E [ Yi | X i , vi ] . (4.3)
i=1 i=1
On both sides, this only averages over the responses. On the left-hand side, these
responses reflect past responses in L used for fitting, and on the right-hand side
these are the future claims to be forecast.

• These considerations bear the difficulty that the average claims level on the right-
hand sides of (4.1)-(4.3) needs to be available, which is typically not the case. That
is why we discuss the balance property next.

4.1.2 The balance property


Related to the last item of the previous remarks, we introduce the balance property which
is easier to verify. The notion of the balance property has been introduced in Bühlmann–
Gisler [35, Theorem 4.5] and it is studied in detail in Lindholm–Wüthrich [140]. The
balance property is formulated for an estimation procedure L 7→ µ bL , and it refers to
a point-wise property in the learning sample L, i.e., for any possible realization of the
learning sample.
Definition 4.2. A regression model fitting procedure L 7→ µ bL satisfies the balance prop-
erty if for a.e. realization of the learning sample L = (Yi , X i , vi )ni=1 the following identity
holds n n
X X
vi µ
bL (X i ) = v i Yi . (4.4)
i=1 i=1

We give some remarks.

• The balance property is an in-sample property that holds for almost every (a.e.)
realization of the learning sample L. A crucial difference to the unbiasedness defi-
nitions above is that for verifying the balance property we do not need to know the
right price level, but it is fully empirical.

• The correct interpretation of the balance property (4.4) is that we re-allocate the
total (aggregate) portfolio claim ni=1 vi Yi to all insurance policyholders 1 ≤ i ≤ n,
P

such that this collective bears the entire aggregate claim. This view is at the core
of insurance, namely, the collective shares all risks (and claims) within the risk
community (solidarity).

• Generally, in actuarial science, model fitting procedures that have this balance
property should be preferred. MLE estimated GLMs using the canonical link com-
ply with the balance property; see Nelder–Wedderburn [164]. That is, for a MLE
fitted GLM we have estimated means
D E
b [ Yi | X i ] = g −1 ϑ
µϑbMLE (X i ) = E bMLE , X i .

Version March 3, 2025, @AI Tools for Actuaries


66 Chapter 4. Interlude

These estimated means fulfill the balance property if the selected link g = h was
the canonical link of the chosen EDF. Otherwise, the balance property fails to hold.
It can be rectified by modifying the intercept estimate

ϑbMLE
0 7→ ϑbcorrected
0 = ϑbMLE
0 + δ, (4.5)

for a suitable (data dependent) δ ∈ R; this is studied in Lindholm–Wüthrich [140].

• We emphasize that the balance property is an in-sample property. Generally, we


P
expect the mean of the total claim i vi Yi to vary in the future, too, compared to
the learning and test samples, e.g., due to inflation or other external factors. In this
case, we may want to re-calibrate our model such that the left-hand side of (4.4)
matches our global portfolio forecast, integrating any extrapolation consideration
and expert knowledge about the future claims level. This can easily be done by
shifting the intercept (bias term) similar to (4.5).

4.1.3 Auto-calibration
For actuarial pricing auto-calibration is an important property. We discuss this in this
section. Auto-calibration has been introduced by Schervish [199] in the context of me-
teorology, and it is discussed in the statistical literature by Tsyplakov [221], Menon et
al. [153], Gneiting–Ranjan [79], Pohle [177], Gneiting–Resin [80] and Tasche [214]. Ac-
tuarial literature discussing auto-calibration is Krüger–Ziegel [127], Denuit et al. [53],
Fissler et al. [69], Lindholm et al. [138], Lindholm–Wüthrich [140], Wüthrich [240] and
Wüthrich–Ziegel [244].
Definition 4.3. A regression function µ : X → R is auto-calibrated for (Y, X) if, a.s.,

µ(X) = E [ Y | µ(X)] . (4.6)


Assume µ(X) are the actuarial prices that are charged to the insurance policyholders
with covariates X to cover their financial risks Y . Auto-calibration (4.6) of the pricing
functional µ : X → R provides the property that every price cohort µ(X) is on average
self-financing for its corresponding claims Y . Thus, there is no systematic cross-financing
(subsidy) between the different price cohorts. This is a property that actuarial pricing
schemes should generally fulfill. Figure 4.1 gives an example where price cohort 1 is
systematically under-estimated and cross-financed by price cohort 3.
The following statements are proved in Wüthrich–Buser [241, Section 2.5].

• The true regression function µ∗ , given in (1.2), is auto-calibrated. This immediately


implies that any non-auto-calibrated regression function cannot be the true one.

• The global mean µ0 := E[Y ], not considering any covariates, is auto-calibrated.

• Typically, there are infinitely many auto-calibrated regression models µ : X → R.

• Consider any regression function µ : X → R. The following re-calibration step gives


an auto-calibrated regression function

µrc (X) := E [ Y | µ(X)] . (4.7)

Version March 3, 2025, @AI Tools for Actuaries


Chapter 4. Interlude 67

violation of auto−calibration

20
price cohort mu(X)
claim E[Y|mu(X)]

15
total claim
10
5
0

1 2 3

price cohorts

Figure 4.1: Violation of auto-calibration: price cohort 3 cross-subsidizes price cohort 1.

This re-calibration step can be performed empirically by an isotonic regression; we


come back to this in Section 4.2.2, below, we also refer to Wüthrich–Ziegel [244].

• From a statistical point of view, we should test any fitted regression model for
auto-calibration. Developing powerful statistical tests for auto-calibration is still
an open field of research; see Wüthrich [240].

lift plot: decile bins


−1.0
bin averages (log−scale)
−1.5
−2.0
−2.5
−3.0
−3.5

−3.5 −3.0 −2.5 −2.0 −1.5 −1.0

estimated regression function (log−scale)

Figure 4.2: Lift plot using decile binning.

Figure 4.2 gives a graphical example of an auto-calibration analysis; it is based on the


French MTPL Poisson log-link GLM example introduced in Section 3.3.2.
To analyze auto-calibration in regression fitting one performs a two-step fitting procedure:

(1) In the first step, one fits a (GLM) regression function X 7→ µ(X) by regressing
the responses Yi from the covariates X i , i.e., by considering the regression problem

Version March 3, 2025, @AI Tools for Actuaries


68 Chapter 4. Interlude

Yi ∼ X i for independent instances 1 ≤ i ≤ n. This first step is performed on the


b(X t ))m
learning sample L. The resulting estimated regression values (µ t=1 are plotted
on the x-axis in Figure 4.2; these are shown out-of-sample on the test sample T .

(2) To assess the auto-calibration property of this fitted regression function µb(·), one
applies a second regression step. This second fitting step is a “regression on the
regression”, i.e., one regresss the responses Yt but this time from the real-valued
estimates µb(X t ), i.e., by considering the regression problem Yt ∼ µb(X t ) for the
independent instances 1 ≤ t ≤ m. This second step is performed on the test
sample T . These second fitted regression values are plotted on the y-axis in Figure
4.2, and they provide us with the blue dots in Figure 4.2. If these blue dots are on
the orange diagonal line we have a perfectly auto-calibrated model.
We need suitable regression techniques to perform this second regression step Yt ∼ µ b(X t ).
In Sections 4.2.1 and 4.2.2, below, we meet two non-parametric regression methods that
may serve at performing this second regression step. The method used in Figure 4.2 is a
more simple one. It uses a discrete binning w.r.t. the empirical deciles of (µb(X t ))m
t=1 , and
then it computes the empirical means of the responses belonging to the corresponding
bins (and on the x-axis we select the barycenters of the bins).
The resulting two-step fitting procedure is illustrated by the blue dots in Figure 4.2, and
they are compared to the orange diagonal line. If the blue dots lie on this orange diagonal
line, the second regression step does not learn any new estimated means compared to the
first estimates µb. This indicates that auto-calibration holds for µ
b. From Figure 4.2, there
is some indication that on some values µ(X t ) the average responses are misspecified,
b
b(X t ) ̸= E[Yt |µ
µ b(X t )], and, thus, these insurance policies are likely cross-subsidized by
other policies to cover the entire portfolio claim. It is also striking that there seems
to be non-monotonicity in the blue dots, which indicates that first regression function
(µb(X t ))m
t=1 does not provide the correct ranking.

Note that the type of plot in Figure 4.2 has different names, sometimes they are called lift
plots, auto-calibration plots or T -reliability diagrams; see Gneiting–Resin [80]. There is
not a unique terminology for these kinds of plots, and sometimes it is even contradictory.
In some cases, plots that show both regression steps on the x-axis are called lift plots,
and in some literature the cumulative accuracy profile, studied in Section 4.3.1, below,
is called lift plot.

4.2 General purpose non-parametric regressions


In the previous section on auto-calibration, we have seen that it is sometimes useful to
have a general purpose non-parametric regression tool. In this section, we introduce two
such tools, local regression and isotonic regression, and we are going to revisit Figure 4.2
with these new tools.

4.2.1 Local regression


We start by discussing a non-parametric local regression method. This non-parametric
local regression method is based on splines and it gives us a general purpose tool to smooth

Version March 3, 2025, @AI Tools for Actuaries


Chapter 4. Interlude 69

graphs and results. The standard reference for local regression is Loader [142], who is
also the owner of the R package locfit; the present section is taken from Loader [142,
Chapter 2]. The goal is to locally fit a non-parametric polynomial regression function to
a sample (Yi , Xi , vi )ni=1 that has one-dimensional real-valued covariates Xi ∈ R.
Assume we want to fit a regression value µ bloc (X) in a fixed covariate value X ∈ R by
only considering the instances (Yi , Xi , vi ) with covariates Xi in the neighborhood of X.
First, we select a bandwidth δ(X) > 0 to define the smoothing window
∆(X) = (X − δ(X), X + δ(X)) .
This determines the neighborhood around X. Only instances with Xi ∈ ∆(X) are
considered for estimating µbloc (X). Second, we introduce a weighting function. Often
the tricube weighting function w(u) = (1 − |u|3 )3 is used, u ∈ [−1, 1]. This weights
the instances i within the smoothing window w.r.t. their relative distances ui = (Xi −
X)/δ(X) to X. The goal then is to fit a spline to the weighted observations in this
smoothing window. For illustration, let us select a quadratic spline
x 7→ µϑ (x; X) = ϑ0 + ϑ1 (x − X) + ϑ2 (x − X)2 ,
with regression parameter ϑ = (ϑj )2j=0 ∈ R3 . This motivates us to consider the following
weighted local regression problem around X
n
Xi − X
 
(Yi − µϑ (Xi ; X))2 .
X
ϑb = arg min vi 1{Xi ∈∆(X)} w
ϑ i=1
δ(X)
The fitted local regression value in X is then obtained by setting

bloc (X) := µ b(X; X) = ϑ


µ b0 .
ϑ

We revisit Figure 4.2 which studies the auto-calibration property of a fitted GLM µ b. We
revisit the two-step fitting procedure discussed in Section 4.1.3, but this time we apply
a local regression for the second fitting step Xt ∼ µ b(X t ), based on the independent
instances 1 ≤ t ≤ m. For this, we select quadratic splines, and the bandwidth δ(X) is
chosen such that the smoothing window ∆(X) contains a nearest neighbor fraction of
α = 10% of the data T . The results are presented in Figure 4.3. The conclusion of this
plot is that a priori (without further investigation) we cannot reject the null-hypothesis
of having an auto-calibrated GLM µ b, because the resulting blue dots fluctuate around
the orange diagonal line. The main question is whether these fluctuations are too large,
and a second question is why do we not receive a monotone picture. These two issues
may indicate that the ordering received by (µ b(X t ))m
t=1 is not correct. But to come to a
firm conclusion further analysis is necessary.
Naturally, the hyper-parameters of the nearest neighbor fraction α ∈ (0, 1] and the degree
of the splines have a crucial influence on the results, and changing these parameters can
provide rather different results. Especially, the choice in the nearest neighbor fraction α
bloc (X)
can be very sensitive, for a small value of α we do not receive a credible estimate µ
because the (random) noise dominates the systematic effects, and for α close to one, we
consider remote instances i from X which may have a completely different systematic
response behavior. Thus, there is a critical trade-off between small and large values of
nearest neighbor fractions α. This also impacts the conclusions gained from Figure 4.3.

Version March 3, 2025, @AI Tools for Actuaries


70 Chapter 4. Interlude

lift plot: local regression fit

−1.0
local regression (log−scale)
−1.5
−2.0
−2.5
−3.0
−3.5

−3.5 −3.0 −2.5 −2.0 −1.5 −1.0

estimated regression function (log−scale)

Figure 4.3: Lift plot using a local regression fit.

4.2.2 Isotonic regression


A disadvantage of the above local regression is that it is very sensitive in hyper-parameter
selection; often more robust methods are preferred. Isotonic regression is a non-parametric
regression method that does not involve any hyper-parameter tuning, but it is based on
the assumption of isotonicity between the one-dimensional real-valued covariates Xi and
the regression values µ∗ (Xi ) = E[Yi |Xi ] explaining the responses Yi . That is, for an
isotonic regression we assume for all 1 ≤ i, i′ ≤ n the following property to hold

Xi ≤ Xi′ ⇐⇒ µ∗ (Xi ) ≤ µ∗ (Xi′ ),

or, in other words, an isotonic regression is rank-preserving.


Isotonic regression goes back to Ayer et al. [8], Brunk et al. [33], Miles [158], Barlow
et al. [14], Barlow–Brunk [15], Kruskal [128], and the pool adjacent violators (PAV)
algorithm gives a fast implementation for solving an isotonic regression numerically; see
Leeuw et al. [133].
To perform an isotonic regression on a sample (Yi , Xi , vi )ni=1 with one-dimensional real-
valued covariates Xi ∈ R, we assume w.l.o.g. that there are no ties in the covariates,
and that the instances are labeled such that X1 < X2 < . . . < Xn . Excluding ties is
w.l.o.g. because we can replace instances with ties by merging them and we obtain the
same isotonic regression solution. Merging means the following, assume that Xi = . . . =
Xi+k take the same value. Then, we merge these instances by assigning them to one single
instance with covariate Xi ← Xi+l , 0 ≤ l ≤ k, with weight and response, respectively,

vi Yi + . . . + vi+k Yi+k
vi ← vi + . . . + vi+k and Yi ← , (4.8)
vi + . . . + vi+k

this reduces the sample size, but accounts for this smaller sample size by increasing the
corresponding weight. Basically, this merging means that we build sufficient statistics on
instances with identical covariates.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 4. Interlude 71

b iso ∈ Rn under the assumption X1 < X2 < . . . < Xn is


The isotonic regression solution µ
obtained by solving the following constraint optimization problem
n
b iso = arg min vi (Yi − µi )2 ,
X
µ subject to µ1 ≤ . . . ≤ µn . (4.9)
µ∈Rn i=1

We give some remarks.

Remarks 4.4. • The isotonic regression (4.9) is based on the (strictly consistent)
square loss function for mean estimation. However, every strictly consistent loss
function for mean estimation gives the identical solution; see Barlow et al. [14,
Theorem 1.10].

• The PAV algorithm due to Ayer et al. [8], Miles [158] and Kruskal [128] is used to
solve the constraint optimization problem (4.9). Essentially, the PAV algorithm is
based on merging neighboring classes by applying (4.8) if the isotonic assumption
is violated by the corresponding sample means of adjacent bins (indices); for details
see Wüthrich–Ziegel [244, Appendix]. Thus, the PAV algorithm is constructing the
isotonic estimate µb iso by optimally binning the instances, optimal w.r.t. the square
loss objective function and w.r.t. the ranking of the covariates. This precisely
replaces any hyper-parameter choice in isotonic regression that would otherwise
need to be set, e.g., in local regression.

• The isotonic regression only gives regression values in the discrete covariate values
biso (Xi ) := µ
µ biso
i , 1 ≤ i ≤ n, and typically a step-function interpolation is used.

A major argument for the isotonic regression is that it provides an empirically auto-
calibrated regression solution. That is, through (optimally) binning and empirical mean
computations, we obtain
Pn
j=1 vj Yj 1{µbiso =µbiso } h i
biso (Xi ) = µ
µ biso
i = Pn
j i
biso (Xi ) ,
b Yi µ
=E (4.10)
j=1 vj 1{µbiso =µbiso }
j i

the latter denoting the empirical mean of the instances having regression estimate µbiso (Xi ).
Empirical auto-calibration (4.10) expresses that we perform binning in the PAV algo-
rithm, and the bin labels are precisely the empirical means of the bins; see also target
encoding (2.16). This verifies empirical auto-calibration.
We revisit the auto-calibration analysis of Figures 4.2 and 4.3, but instead of using decile
binning or a local regression, we perform an isotonic re-calibration (regression). For this
we fit a first regression function X 7→ µb(X) to the learning sample L = (Yi , X i , vi )ni=1 .
As discussed in Section 4.1.3, this first regression function µ
b does not necessarily fulfill
the auto-calibration property (4.6). The idea of isotonic re-calibration is to apply the
re-calibration step (4.7), under the assumption that the first fitted regression function µ b

provides the right risk ranking for the true regression function µ , that is,

b(X t ) ≤ µ
µ b(X t′ ) ⇐⇒ µ∗ (X t ) ≤ µ∗ (X t′ ),

Version March 3, 2025, @AI Tools for Actuaries


72 Chapter 4. Interlude

for all 1 ≤ t, t′ ≤ m. Under this assumption it is sensible to apply an isotonic regression


to the sample (Yt , µb(X t ), vt )m
t=1 with new real-valued covariates Xt := µ
b(X t ) used as
rankings to perform the re-calibration step (4.7). Isotonic regression (4.9) provides us
with the isotonically re-calibrated estimates

b(X t ) 7→ µ
µ biso
brc (X t ) := µ t

These new isotonically re-calibrated estimates (µ brc (X t ))m


t=1 have the same ranking as the
initial estimates (µ m
b(X t ))t=1 , modulo ties, but these new isotonically re-calibrated esti-
mates (µ m
brc (X t ))t=1 fulfill (empirical) auto-calibration, see (4.10), and this re-calibrated
model also fulfills the balance property (4.4). These are pretty strong results!

lift plot: isotonic regression


−1.0
−1.5
isotonic regression
−2.0
−2.5
−3.0
−3.5

−3.5 −3.0 −2.5 −2.0 −1.5 −1.0

estimated regression function

Figure 4.4: Lift plot with isotonic re-calibration; this continues from Figures 4.2-4.3.

We come back to the lift plots of Figures 4.2-4.3, but we use an isotonic re-calibration step
this time to receive the blue dots in Figure 4.4. The crucial differences to previous two
plots are: (1) the isotonically re-calibrated lift plot is rank preserving giving a monotone
regression in Figure 4.4; (2) the binning is optimal w.r.t. any strictly consistent loss
function for mean estimation and subject to the initial ranking; (3) the solution µ b iso is
auto-calibrated and the balance property holds, (4) the Gini score, introduced in Section
4.3.1, below, gives a valid model selection criterion because of auto-calibration.
We conclude with the following result; which is mentioned in Wüthrich–Ziegel [244].

Corollary 4.5. Assume the estimated regression function µ b : X → R gets the ranking
b(X) and µ (X) are strictly comonotonic. The true regression function µ∗
correct, i.e., µ ∗

is found by the re-calibration step

µ∗ (X) = E [Y | µ
b(X)] , a.s.

If an auto-calibrated predictor is needed, isotonic regression is suggested. An advantage


clearly is that the order of the prices is not changed, and the levels are only locally
perturbed. The three disadvantages we see are that the final regression model is a two-
step procedure with the second model dropping out of the initial model class M, i.e., this

Version March 3, 2025, @AI Tools for Actuaries


Chapter 4. Interlude 73

is a non-parametric solution that is no longer, e.g., a GLM, even if the first regression
model is a GLM. As a consequence, it is no longer as easily explainable as a GLM,
and it is also more difficult to manually change the model. The second disadvantage is
that the resulting regression function has discontinuities, which is not very appreciated in
insurance pricing. The latter disadvantage can be removed by replacing the step function
by linearly interpolating functions between the observations. The third problem is that
the isotonic re-calibration step needs special attention at the lower and upper boundaries
of the support, as it tends to over-fit in this part of the covariate space, which may require
to manually merge the largest and the smallest bins, respectively. Coming back to Figure
4.4, the lowest three bins have been merged to ensure that all predicted values are strictly
positive, and Figure 4.4 suggests to also merge the two or three highest bins because the
largest prediction seems an outlier, over-fitting to a single observation.

4.3 Model analysis tools


We close this interlude by discussing two model validation tools that are used in ma-
chine learning and statistics. These two tools require some care because they need
auto-calibrated regression functions and/or isotonic regression. We highlight this in the
following discussion.

4.3.1 Gini score


The Gini score is often used in practice for model validation and model selection. The
Gini score goes back to an economic concept introduced 1912 by Gini [74, 75], called the
Gini index. The Gini index is based on the Lorenz [144] curve. It is a measure for the
disparity of the wealth distribution across a given population; see Lorentzen et al. [143]
for a broader discussion. This concept of disparity quantification has been adapted to
a machine learning context for model validation and model selection, typically, meaning
that a higher Gini score gives better a discrimination of the regression function µ(X) for
the claims prediction of Y .
Assume we have an observed sample (Yi , X i , vi = 1)ni=1 and, we let its indices be ordered
such that µ(X 1 ) < µ(X 2 ) < . . . < µ(X n ) for the given regression function µ : X → R.
This requires that we may need to relabel the instances 1 ≤ i ≤ n, and for simplicity we
assume that there are no ties in this ordering. The empirical version of the (mirrored)
Lorenz [144] curve is defined by1
n
1 1 X
α ∈ (0, 1) 7→ L
b µ (α) =
1 Pn µ(X i ). (4.11)
n i=1 µ(X i ) n i=⌈(1−α)n⌉+1

1
We call the curve in (4.11) an empirical Lorenz curve, because we take the sample average over the
covariates (X i )n
i=1 , and the non-empirical version would consider the population distribution for X ∼ P,
instead. There is a second ingredient which is the regression function µ. On purpose, we did not put
hats on µ because the Gini score (4.13) should be evaluated out-of-sample (this applies to the subsequent
cumulative accuracy profile (4.12)). That is, if the regression function µ b is estimated from a learning
sample L, then the cumulative accuracy profile Cb b
µ
(α) should be computed on an independent test sample
T , and there should be two hats in the notation C
b . In this section, we use a generic regression function

µ to explain the theory.

Version March 3, 2025, @AI Tools for Actuaries


74 Chapter 4. Interlude

This empirical mirrored Lorenz curve measures the contribution of the largest regression
values (µ(X i ))ni=⌈(1−α)n⌉+1 to the portfolio average.2 It is precisely this property that
allows Gini [74, 75] to describe discrimination or disparity in wealth distribution by
computing the resulting area under the curve (AUC). The bigger this area, the more
disperse are the regression values (µ(X i ))ni=1 .3
For statistical modeling, we replace the empirical Lorenz curve (4.11) by the cumula-
tive accuracy profile (CAP); see Ferrario–Hämmerli [67, Section 6.3.7] and Tasche [213];
Denuit–Trufin [54] call the same construction concentration curve. The (empirical) cu-
mulative accuracy profile of regression function µ(·) is given by
n
1 1 X
α ∈ (0, 1) 7→ Cbµ (α) = 1 Pn Yi ; (4.12)
n i=1 Yi n i=⌈(1−α)n⌉+1

compared to (4.11), we replace the predictions µ(X i ) by the observations Yi , but, impor-
tantly, we keep the order of the regression values µ(X 1 ) < µ(X 2 ) < . . . < µ(X n ) in the
indices 1 ≤ i ≤ n. Similarly to the empirical Lorenz curve, a better discrimination results
in a bigger AUC, and the maximal AUC is obtained if the claim sizes (Yi )ni=1 provide the
same ordering as the regression values (µ(X i ))ni=1 . This is precisely the motivation to
use this concept for model selection. We illustrate this in Figure 4.5.

cumulative accuracy profile


1.0
0.8

B
cumulative accuracy profile

A
0.6
0.4
0.2

fitted model
null model
perfect model
0.0

0.0 0.2 0.4 0.6 0.8 1.0

alpha

Figure 4.5: Cumulative accuracy profile (CAP) given by α 7→ Cbµ (α) used to define the
Gini score (4.13); this plot is taken from [241].

The orange dotted line in Figure 4.5 shows the cumulative accuracy profile of perfectly
aligned responses (Yi )ni=1 and regression values, and the red line the cumulative accuracy
2
These considerations are also a well-known method in extreme value theory and risk management.
E.g., one speaks about the 20-80 Pareto rule, which means that the 20% largest claims make up 80% of
the total claim amount; see Embrechts et al. [63, Section 8.2.3].
3
Remark that in an economic context the Lorenz curve is usually below the 45◦ line because the
summation in the upper tail in (4.11) is switched to the lower tail, compare Figure 4.5 and Goldburd
et al. [81, Figure 25]. In machine learning, one typically considers the curve mirrored at the diagonal,
giving a sign switch in all inequalities.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 4. Interlude 75

profile w.r.t. the selected regression function (µ(X i ))ni=1 . The dotted blue line corre-
b0 = n1 n
P
sponds to the null model µ i=1 Yi not considering any covariates, but the global
empirical mean µ b0 instead. The Gini score is defined by, see Figure 4.5,

area(A)
Gini(µ) = ≤ 1, (4.13)
area(A + B)

where the areas under the curves, area(A) and area(A + B), have precisely the meaning
as in Figure 4.5. Generally, a bigger Gini score is interpreted as a better discrimination
of the selected regression model µ(·) for the responses Y . We remark that this Gini
score (defined by the AUC) is equivalent to the so-called receiver operating curve (ROC)
method for binary responses Y ; see Tasche [213, formula (5.6a)].
There is one critical issue with this model selection technique. Namely, the Gini score
is not a strictly consistent model selection tool. The Gini score is fully rank based,
but it does not consider whether the model lives on the right scale; this is precisely
the point raised in Wüthrich [239]. However, if we can additionally ascertain that the
regression functions µ(·) under consideration are auto-calibrated for (Y, X), the Gini
score is a sensible model selection tool; see Wüthrich [239, Theorem 4.3]. Basically, auto-
calibration lifts the models to the right level (the level of the responses), and the Gini
score then verifies whether the ordering of the policies w.r.t. their responses is optimal.

4.3.2 Murphy’s score decomposition


We may want to analyze how well a given regression model µ discriminates the responses
Y w.r.t. the covariates X. One possible method is Murphy’s [161] score decomposition.
For related literature we refer to Pohle [177], Gneiting–Resin [80], Fissler et al. [69] and
Semenovich–Dolman [206].
Select a strictly consistent loss function L for mean estimation that is lower-bounded by
zero, and such that all the following terms exist. Murphy’s score decomposition is given
by
E [L(Y, µ(X))] = UNCL − DSCL + MSCL ,

with uncertainty, discrimination (resolution) and miscalibration, respectively,

UNCL = E [L (Y, µ0 )] ≥ 0,
DSCL = E [L (Y, µ0 )] − E [L (Y, µrc (X))] ≥ 0,
MSCL = E [L (Y, µ(X))] − E [L (Y, µrc (X))] ≥ 0.

The uncertainty term UNCL quantifies the total prediction uncertainty not using any
covariates in its prediction µ0 = E[Y ], this is the global mean. The discrimination
(resolution) term DSCL quantifies the reduction in prediction uncertainty if we use the
auto-calibrated regression function µrc (X) based on covariate information X, see (4.7).
Finally, the miscalibration term MSCL vanishes if the regression function µ(X) is auto-
calibrated, otherwise it quantifies the auto-calibration misspecification.
In applications, we need to compute these quantities empirically, out-of-sample, similarly
to the previous sections. The auto-calibrated model µrc can be determined with an
isotonic re-calibration step (4.9).

Version March 3, 2025, @AI Tools for Actuaries


76 Chapter 4. Interlude

There are other decompositions which, e.g., compare a regression model µ to the true
one µ∗ w.r.t. the available covariate information; see Fissler et al. [69, formula (17)],

E [L(Y, µ(X))] = UNCL − DSC∗L + MSC∗L , (4.14)

with maximal discrimination (resolution) and conditional miscalibration, respectively,

DSC∗L = E [L (Y, µ0 )] − E [L (Y, µ∗ (X))] ≥ 0,


MSC∗L = E [L (Y, µ(X))] − E [L (Y, µ∗ (X))] ≥ 0.

Because the true regression model µ∗ is unknown, this decomposition is of less practical
value. These considerations are not restricted to the true regression model µ∗ , but
one can also compare two regression functions µ1 and µ2 . Positivity of the conditional
miscalibration is then related to convex orders of regression functions. In particular, if
we choose the square loss function for L, such a convex order can easily be obtained by
martingale arguments on filtrations (which reflect increasing sets of information, i.e., if
we increase the information set σ(X) by including more covariate information σ(X + ) ⊃
σ(X), we receive a higher resolution in the resulting true regression function based on
X + ); see Wüthrich–Buser [241, Section 2.5].4

4
Basically, this reflects a martingale construction from an integrable random variable and a filtration.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 5

Feed-forward neural networks

In this chapter, we introduce feed-forward neural networks (FNNs) for tabular


data, and in this introduction we put special emphasis on FNN architectures that
are particularly suitable for solving actuarial problems. Chapters 1 to 3 have
introduced the statistical foundation for now being able to consider FNNs.

FNNs are also called artificial neural networks (ANNs), multi-layer perceptrons (MLPs),
if they have multiple layers, and more generally, deep learning (DL) architectures. In this
chapter, we study plain-vanilla (standard) FNNs.

Figure 5.1: Feed-forward neural network architecture.

In a nutshell, networks perform representation learning, meaning that a feature extractor


learns a new representation of the covariates, such that this new representation has a

77
78 Chapter 5. Feed-forward neural networks

suitable structure to enter a (generalized) linear model, this is illustrated in Figure 5.1
by the blue and green boxes. Having done all the preparatory work in the GLM Chapter
3, this FNN extension is a natural one:

Generalized linear models (GLMs) vs. feed-forward neural networks (FNNs)

Recall the GLM regression structure (3.1)

X 7→ g(µ(X)) = ⟨ϑ, X⟩ .

Inserting a feature extractor z (d:1) modifies this GLM structure to the FNN equa-
tion D E
X 7→ g(µ(X)) = ϑ, z (d:1) (X) . (5.1)
This is precisely the (natural) FNN extension going to be discussed in this chapter.
It allows for non-linear structures and interactions of the covariate components.

5.1 Feed-forward neural network architecture


This section formalizes the FNN equation (5.1) by introducing all the relevant terms. The
fitting (learning) procedure of FNNs is discussed in Section 5.3, below. In this chapter
we work with tabular data, meaning that the covariates X are q-dimensional real-valued
vectors that can be stacked into design matrices (tables) X; see (2.10).

5.1.1 Feature extractor and readout


Deep FNNs consist of multiple hidden layers z (m) , called FNN layers. A FNN layer

z (m) : Rqm−1 → Rqm , (5.2)

performs a non-linear transformation of the covariates; this is formalized in (5.5), below.


The main ingredients of such a FNN layer z (m) are:

(a) the number qm ∈ N of neurons, also called units;

(b) the non-linear activation function ϕ : R → R; and

(c) the network weights representing part of the model parameter ϑ.

Items (a) and (b) are hyper-parameters that are selected by the modeler, and the network
weights of item (c) are parameters that are learned during network training (model
fitting).
Having d ∈ N of these FNN layers (z (m) )dm=1 , with matching input and output dimen-
sions, we compose them to a feature extractor of depth d
 
X 7→ z (d:1) (X) := z (d) ◦ · · · ◦ z (1) (X) ∈ Rqd , (5.3)

where the input dimension of the first FNN layer z (1) is the dimension of the covariates
X, given by q0 := q. This feature extractor is illustrated in the blue box of Figure 5.1

Version March 3, 2025, @AI Tools for Actuaries


Chapter 5. Feed-forward neural networks 79

for an architecture of depth d = 2, with input dimension q0 = q = 16, and units q1 = 8


and q2 = 8 in the two FNN layers.
The final step of the FNN architecture is to use the learned representation z (d:1) (X) as
input to a GLM; this is called the readout. This provides us with a FNN architecture of
depth d
D E
µ(X) = E [ Y | X] = g −1 w(d+1) , z (d:1) (X) , (5.4)

(d+1)
where w(d+1) ∈ Rqd +1 is the output/readout parameter, including a bias term w0 ,
and g −1 is the inverse of the chosen link function. Compared to the GLM in (3.2),
the (only) difference is that we replace the original covariates X ∈ Rq+1 by the newly
learned representation z (d:1) (X) ∈ Rqd +1 from the feature extractor (5.3). For notational
convenience, we change the notation of the readout parameter in (5.1) to w(d+1) , compare
to (5.4).

In the next sections, we discuss the construction of the FNN layers z (m) in careful detail.

5.1.2 Activation function


Because feature extractors z (d:1) should be able to learn non-linear feature transfor-
mations of the original covariates, we need to select non-linear activation functions
ϕ : R → R. Table 5.1 gives popular choices.

activation function ϕ derivative ϕ′


sigmoid (logistic) ϕ(x) = σ(x) = (1 + e−x )−1 ϕ(1 − ϕ)
hyperbolic tangent ϕ(x) = tanh(x) = 2σ(2x) − 1 1 − ϕ2
rectified linear unit (ReLU) ϕ(x) = x1{x≥0} 1{x>0} , x ̸= 0
sigmoid linear unit (SiLU) ϕ(x) = xσ(x) σ(x)(1 − ϕ(x)) + ϕ(x)
Gaussian error linear unit (GELU) ϕ(x) = xΦ(x) Φ(x) + xΦ′ (x)

Table 5.1: Popular choices of non-linear activation functions ϕ and their derivatives ϕ′ ;
2 √
Φ denotes the standard Gaussian distribution, and Φ′ (x) = e−x /2 / 2π its density.

The activation functions and their derivatives in Table 5.1 are illustrated in Figure 5.2.
They have different properties, e.g., the first two activation functions are bounded which
can be an advantage or a disadvantage, depending on the problem to be solved (do we
want bounded or unbounded functions?). The hyperbolic tangent is symmetric in zero
which can be an advantage over the sigmoid in deep neural network fitting (because
it is naturally calibrated to zero and does not require to adjust biases). The ReLU is
an activation function that is very popular in the machine learning community, it leads
to sparsity in networks, and it is not differentiable in zero but it has a sub-gradient
because it is a convex function. The SiLU is a smooth version of the ReLU, but it is
neither monotone nor convex, see Figure 5.2. The GELU has recently gained popularity
in transformer architectures, and it has some similarity with the SiLU, see Figure 5.2.
Generally, it is difficult to give a good advise for a specific selection of the ‘best’ activation
function, and this should rather be part of the hyper-parameter tuning for the specific
actuarial problem to be solved.

Version March 3, 2025, @AI Tools for Actuaries


80 Chapter 5. Feed-forward neural networks

(non−linear) activation functions derivatives of activation functions

2.0

1.2
sigmoid sigmoid
tanh tanh

1.0
1.5
ReLU ReLU
SiLU SiLU
GELU GELU

0.8
1.0

0.6
phi'(x)
phi(x)
0.5

0.4
0.0

0.2
−0.5

0.0
−1.0

−0.2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

x x

Figure 5.2: Activation functions and their derivatives of Table 5.1.

5.1.3 Feed-forward neural network layer

In this section, we formalize the FNN layer z (m) : Rqm−1 → Rqm introduced in (5.2). This
step is a bit technical, but basically it relates to defining all the connections between the
neurons (units) illustrated by the lines between the circles in Figure 5.1.
Selected an activation function ϕ. The FNN layer z (m) is for x ∈ Rqm−1 defined by
 ⊤
(m)
z (m) (x) = z1 (x), . . . , zq(m)
m
(x) , (5.5)

with neurons (units), for 1 ≤ j ≤ qm given by


qm−1 !
(m) (m) X (m) (m)
zj (x) =ϕ wj,0 + wj,l xl =: ϕ⟨wj , x⟩, (5.6)
l=1

(m) (m) (m) (m)


with network weights wj = (wj,0 , . . . , wj,qm−1 )⊤ ∈ Rqm−1 +1 , including the bias wj,0 .

(m)
We interpret this as follows. Every neuron zj (·) is corresponds to a circle in Figure
5.1. Each of these neurons performs a GLM operation with inverse link ϕ, see (5.6).
That is, every neuron performs a data compression (projection) from x ∈ Rqm−1 to the
(m)
real line zj (x) ∈ R, 1 ≤ j ≤ qm . This inevitably results in a loss of information.
(m)
To compensate for this loss of information, every of the qm neurons zj (·) performs a
(m)
different compression, reflected by different network weights so that (hopefully) wj ,
the relevant information for prediction is extracted by the feature extractor z (d:1) .
As mentioned above, the selections of the activation function ϕ and of the number of
neurons qm are hyper-parameters selected by the modeler, whereas the (optimal) network
(m)
weights wj are learned during network training, see Section 5.3, below.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 5. Feed-forward neural networks 81

5.1.4 Feed-forward neural network architecture

We can now paste everything together by composing the FNN layers (assuming matching
input and output dimensions) to the feature extractor (5.3), and then apply the readout
to this feature extracted covariate information. From (5.5)-(5.6), it follows that each FNN
(m) (m)
layer z (m) has network weights (w1 , . . . , wqm ) of dimension qm (qm−1 + 1). Collecting
all network weights of all layers, including the output parameter of (5.4), gives network
weights (for the notation, see also footnote on page 16)

 
(1)
ϑ = w1 , . . . , w(d)
qd , w
(d+1)
∈ Rr , (5.7)

of total dimension
d
X
r= qm (qm−1 + 1) + (qd + 1). (5.8)
m=1

Indicating the network parameter in the notation motivates us to replace (5.4) by the
slightly adapted notation
D E
µϑ (X) = E [ Y | X] = g −1 w(d+1) , z (d:1) (X) . (5.9)

In this sense, a FNN architecture provides a class of parametric regression functions


M = {µϑ }ϑ , being parametrized through the network weights ϑ ∈ Rr given in
(5.7). For network regression fitting we can (essentially) apply the same deviance
loss minimization principle as stated in (3.8), we also refer to the box on page 54,
or regularized versions thereof as presented in Section 2.4. We will come back to
this below because it turns out that FNN fitting is more difficult than expected.

Example 5.1. We discuss the FNN example of depth d = 2 given in Figure 5.1. It has
a 16-dimensional covariate vector X providing an input dimension of q0 = q = 16. The
first hidden layer z (1) : Rq0 → Rq1 has q1 = 8 neurons providing 8 · 17 = 136 network
weights. The second hidden layer z (1) : Rq1 → Rq2 has q2 = 8 neurons providing 8·9 = 72
network weights. Finally, the output parameter has dimension 9, thus, altogether the
FNN architecture of Figure 5.1 has network weights ϑ of dimension r = 217. These
network weights ϑ needs to be fitted from the learning sample L.
The hyper-parameters selected by the modeler are the depth d = 2, the number of hidden
neurons q1 = 8 and q2 = 8, as well as the activation function ϕ and the inverse link
g −1 . For model fitting, the modeler needs to additionally select the (strictly consistent)
loss function L for mean estimation, as well as the optimization algorithm to solve the
optimization problem. This optimization algorithm will have a significant impact on the
selected network, i.e., this is different from GLMs where the solution is fully determined
by the model architecture and the loss function. This might be surprising in the first
place, and we are going to discuss this in detail below. ■

Version March 3, 2025, @AI Tools for Actuaries


82 Chapter 5. Feed-forward neural networks

There is the recurrent discussion about an optimal FNN architecture. Generally,


this question is difficult to answer, and we are going to give more explicit instruc-
tions in Section 5.3.8, below. The choice of a good network architecture does not
only depend on the predictive problem to be solved, but also on the selected model
fitting algorithm. Therefore, we first discuss model fitting.

We close this section with some remarks before discussing the practical issues related to
network fitting.

Remarks 5.2. • The above FNN architecture is also called a fully-connected FNN
(m−1) (m)
because every neuron zk is connected to a neuron zj by the network weight
(m)
wj,k .

• Sometimes, one makes a distinction between shallow FNNs of depth d = 1 and


deep FNNs with d ≥ 2. In practical applications, one should always work with
deep FNNs, because composing layers facilitates interaction modeling. For non-
life actuarial pricing problems often a value d ∈ {3, . . . , 6} is a suitable choice;
we generally use d = 3 for our French MTPL claims frequency example because
this has proved to be suitable for this kind of (small) problem with 9 covariate
components.

• The FNN defined in (5.9) has a one-dimensional output. We can also have a multi-
output network for multi-task learning, e.g., if we want to predict claims frequency
and claims severity (simultaneously), we may select a two-dimensional output
 D E D E⊤
(d+1) (d+1)
X 7→ g1−1 w1 , z (d:1) (X) , g2−1 w2 , z (d:1) (X) .

In this case, the feature extractor z (d:1) (X) should learn the relevant information
for both claims frequency and claims severity prediction.

• We call FNNs parametric models because once the architecture is fixed, the size
of the network parameter ϑ is determined. This is different from non-parametric
models where the dimension of the parameter is not given a priori. For instance,
in regression trees, discussed below, every iteration of the fitting algorithm will
add a new parameter to the model. Sometimes, people call FNNs semi-parametric
models. One reason for this is that the dimension of the network parameter ϑ does
not determine the complexity of the FNN regression function. FNN regression
functions are not parsimonious, i.e., they usually have much redundancy, and there
is research on exploring the ‘effective dimension’ of FNNs; we refer, e.g., to Abbas
et al. [1].

5.2 Universality theorems


The universality theorems are at the core of the great success of FNNs. The main uni-
versality theorem statement says that any compactly supported continuous (regression)

Version March 3, 2025, @AI Tools for Actuaries


Chapter 5. Feed-forward neural networks 83

function can be approximated arbitrarily well by a suitable (and sufficiently large) FNN.
This approximation can be w.r.t. different norms and the assumptions for such a state-
ment to hold are comparably weak, e.g., the sigmoid activation function leads to a class
of FNNs that are universal in the above sense. For precise mathematical statements and
proofs about these denseness results we refer to Cybenko [47], Hornik et al. [103], Hornik
[102], Leshno et al. [134] and Isenbeck–Rüschendorf [108]; and there is a vast literature
with similar statements and proofs.
These universality statements imply that basically any (continuous and compactly sup-
port) regression function can be approximated arbitrarily well within the class of FNNs.
This sounds very promising:
• First, it means that the class of FNNs is very rich and flexible.

• Second, no matter what the specific true data generating model looks like, there is
a FNN that is similar to this data generating mechanism, and our aim is to find it
using the learning sample L that has generated that data.
Unfortunately, there is a backside of the coin of these exciting properties:
• There is no hope to find a (best) parsimonious FNN (on finite samples). In other
words, within the class of FNN there are infinitely many (almost equally) good
candidate models. Based on a finite sample there is no best selection, we can only
distinguish clearly better from clearly worse models. This can almost be stated as
a paradigm in FNN predictive modeling.

• The model selection/fitting problem is very high-dimensional and non-convex (for


any reasonable choice of an objective function), and usually, at best, the found
‘solutions’ are (close to) local optimums.

• Model selection within the class of FNN has several elements of randomness, e.g.,
a fitting algorithm needs to be (randomly) initialized and this impacts the selected
solution. To be able to replicate results, the fitting procedure has to be designed
very carefully and seeds of random number generators need to be stored to be able
to track and replicate the specific solutions.
Some of the previous items will only become clear once we have introduced stochastic
gradient descent fitting, and the reader should keep these (critical) items in mind for the
discussions below.

5.3 Network fitting with gradient descent

Based on the fact that we cannot find a ‘best’ FNN approximation to the true
model on finite samples (see discussion above), we try to find a ‘reasonably good’
FNN approximation to the true data generating mechanism. Reasonably good
means that it usually outperforms a classical GLM, but at the same time there
are infinitely many other FNNs that have a similarly good predictive performance
(generalization to new data).

Version March 3, 2025, @AI Tools for Actuaries


84 Chapter 5. Feed-forward neural networks

Due to the non-convexity and the complexity of the problem, computational aspects are
crucial in designing a good FNN learning algorithm. The main tool is stochastic gradient
descent (SGD) which stepwise adaptively improves the network weights ϑ w.r.t. a given
objective function. We are going to derive the SGD algorithm step-by-step, and it will
follow in Section 5.3.7, below. On the way to get there, we need to discuss several items
and issues, which is done in the next sections. The next section starts by introducing the
standard gradient descent method.

5.3.1 The standard gradient descent method


Similarly to (3.8), we try to find network weights ϑb that provide a small empirical loss
on the learning sample L = (Yi , X i , vi )ni=1 ; the empirical loss in network parameter ϑ on
L is defined by
n
X vi
L(ϑ; L) := L(Yi , µϑ (X i )), (5.10)
i=1
φ
where µϑ is the FNN (5.9) with network weights ϑ ∈ Rr and L is a strictly consistent
loss function for mean estimation. We add the learning sample L to the loss notation
L(ϑ; L) because for SGD training we are going to vary over different learning samples.
The standard gradient descent method works as follows. Assume that at algorithmic time
t ∈ N0 , we have network weights ϑ[t] ∈ Rr providing the empirical loss L(ϑ[t] ; L). We
would like to stepwise adaptively improve these network weights ϑ[t] 7→ ϑ[t+1] such that
the empirical loss decreases in each step t 7→ t + 1. We try to achieve this by a small
perturbation of ϑ[t] that leads to a local improvement of the network weights. Local
(small) changes can be described by a first order Taylor expansion1
     ⊤  
L ϑ[t+1] ; L ≈ L ϑ[t] ; L + ∇ϑ L ϑ[t] ; L ϑ[t+1] − ϑ[t] , (5.11)

for ϑ[t+1] close to ϑ[t] ; and ∇ϑ denotes the gradient (derivative) w.r.t. the network weights.
The right-hand side of the above approximation (5.11) becomes minimal, if the second
term is as negative as possible. Therefore, the update in the network weights should
point into the opposite direction of the gradient.
This motivates the standard gradient descent update
 
ϑ[t] 7→ ϑ[t+1] = ϑ[t] − ϱt+1 ∇ϑ L ϑ[t] ; L , (5.12)

where ϱt+1 > 0 is a (small) learning rate, also called step size.
The learning rate needs to be small because the first order Taylor expansion is only a
valid approximation in the neighborhood of ϑ[t] . On the other hand, the learning rate
should not be too small, otherwise we need to run too many of these standard gradient
descent steps.
An important point is that the initial value ϑ[0] of the gradient descent algorithm should
be selected at random to avoid that the gradient descent algorithm starts in a saddlepoint
1
We assume differentiability in all gradient descent considerations. This is typically the case in the
selected network architectures, except for the ReLU activation function which is not differentiable in the
origin.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 5. Feed-forward neural networks 85

of the loss surface ϑ 7→ L(ϑ; L). For instance, if one sets ϑ[0] = 0, there is no pre-
determined initial direction for the first gradient descent step because the FNN has
symmetries around this initial value, i.e., we have a saddlepoint of the loss surface and
the algorithm will not start to exploit the parameter space. A popular initializer is the
glorot_uniform initializer of Glorot–Bengio [76, formula (16)]. It adjusts the volatility
in the random uniform initialization to the sizes of the network layers.

The following points need to be addressed by the modeler for a successful gradient
descent fitting:

• Covariate pre-processing; see Section 5.3.2.

• Gradient calculation by back-propagation for an efficient calculation


of the gradients ∇ϑ L(ϑ; L) in (5.12); see Section 5.3.3.

• Learning rate and higher order Taylor approximations to accelerate


the algorithm; see Section 5.3.4.

• Early stopping rule for the algorithm; see Section 5.3.5.

• Regularization and drop-out; see Section 5.3.6.

• Stochastic gradient descent to deal with big data, i.e., big learning sam-
ples L; see Section 5.3.7.

5.3.2 Covariate pre-processing


An important point in gradient descent methods is covariate pre-processing, also for
continuous covariates; see Section 2.3 for covariate pre-processing. It is important for the
gradient descent algorithm to work well that all covariate components live on the same
scale, otherwise some covariate components will dominate the gradient, and the algorithm
may not be able to adapt to the right scales (even for the momentum-based versions
discussed below). Therefore, continuous covariates should be standardized (2.18) or the
MinMaxScaler (2.19) should be applied. Under skewed covariates, often standardization
works better because the MinMaxScaler tends to cluster at the boundary for skewed
covariates. If the skewness is too large, i.e., if a given covariate lives on different scales
(magnitudes), one should first apply a log-transformation. For example, the population
density studied in Listing 3.1 may live on different magnitudes, and one should rather
consider the log-population density instead in FNNs.
For categorical covariates usually one-hot encoding is used in the first place. Note that
in FNNs a full rank design matrix is not an essential assumption/property. Since identi-
fiability and parsimony is not given in FNNs in general, the discussion between one-hot
encoding vs. dummy coding does not matter, and both choices are equally suitable.
High-cardinality categorical covariates lead to large input dimensions q0 to the feature
extractor. This is generally problematic in network fitting as it gives a high potential
for over-fitting. As a general recommendation in such situations, we propose to use en-
tity embedding with a low-dimensional embedding dimension b, see (2.15). The entity

Version March 3, 2025, @AI Tools for Actuaries


86 Chapter 5. Feed-forward neural networks

embedded variables are concatenated with the continuous ones, which then jointly enter
the feature extractor of the FNN architecture. This adds b · K embedding weights to the
network parameter ϑ, if K is the number of levels of the categorical covariate. These
embedding weights are also learned during gradient descent training, i.e., they also enter
the gradient computations. In many cases, this gives a superior performance over one-hot
encoding and dummy coding, respectively. An example is given in the notebook:

notebook-insert-link

If the categorical covariates are high-cardinality and/or if there is a natural hierarchical


structure in the categorical covariates (vehicle brand - vehicle model - vehicle details),
one should regularize the entity embedding to receive a successful network training that
prevents from over-fitting. We come back to this in later chapters in relation to variational
Bayes’ methods; and we also refer to Richman–Wüthrich [192].

5.3.3 Gradient calculation via back-propagation


Generally the gradient computation ∇ϑ L(ϑ; L) is high-dimensional and computationally
intensive. The network weights ϑ enter the readout, see (5.9), and they enter the different
layers (z (m) )dm=1 in the feature extractor z (d:1) . Theoretically, the gradient can be worked
out using standard calculus, however, through the iterated application of the chain-rule
the computations become very tedious.
The workhorse to compute these gradients efficiently, and to avoid tedious formulas, is the
back-propagation method of Rumelhart et al. [196]. Mathematically speaking, the back-
propagation method is a clever re-parametrization of the problem which allows one to
efficiently compute these gradients recursively by matrix multiplications, also benefiting
from the fact that the most popular activation functions ϕ have simple derivatives, see
Table 5.1.
At this stage, we do not believe that it is very useful to give more technical details about
the back-propagation method. The main takeaway is that the gradients can be computed
efficiently, and that the (ready-to-use) standard FNN fitting software has implemented
versions of this back-propagation method. The interested reader is referred to Wüthrich–
Merz [243, Section 7.2.3] for more technical details.

5.3.4 Learning rate and higher order Taylor approximations


The first order Taylor expansion (5.11) computes the slopes and, hence, the direction of
the optimal local updates. The optimal (directional) learning rates ϱt+1 > 0 in (5.12)
can be determined by the curvature of the loss surface described by the second order
derivatives (Hessian) of the empirical loss L(ϑ; L). This is what Newton–Raphson teach
us to do. Unfortunately, it is computationally infeasible to compute the Hessian in
(bigger) FNNs, therefore, we cannot determine the optimal learning rates by second
order derivatives. In physics, first order derivatives are related to speed and second order
derivatives to acceleration. Since we cannot compute second order derivatives, inspired
by physics, we try to mimic how speed and velocity build up by computing a momentum
from past velocities. This is a way of mimicking second order derivatives. This motivates

Version March 3, 2025, @AI Tools for Actuaries


Chapter 5. Feed-forward neural networks 87

to replace the standard gradient descent update (5.12) by a momentum-based gradient


descent update
 
v[t] 7→ v[t+1] = ν v[t] − ϱt+1 ∇ϑ L ϑ[t] ; L ,
ϑ[t] 7→ ϑ[t+1] = ϑ[t] + v[t+1] ,

with learning rate ϱt+1 > 0 and momentum parameter ν > 0, and where we initialize
v[0] = 0. The learning rate and the momentum parameter are hyper-parameters that
need to be fine-tuned by the modeler. This and slightly modified versions thereof are
implemented in standard software, and this software often comes with suitable standard
values for these hyper-parameters, i.e., they are ready-to-use. Therefore, we do not
describe these points in more detail. Standard momentum-based algorithms are rmsprop
or adam; see Hinton et al. [98], Kingma–Ba [118] and Goodfellow et al. [83, Sections 8.3
and 8.5].
Another noteworthy improvement is the Nesterov acceleration [165]. Nesterov has noticed
that such algorithms often have a zig-zag behavior, meaning that they overshoot and
then correct by moving back and forth, which does not seem to be very effective. The
improvement suggested by Nesterov is to already anticipate the next gradient descent step
in determining the optimal learning rates and momentums. This way an overshooting
can be reduced. This is implemented, for example, in the nadam version of adam.
The described gradient descent algorithms are usually used in standard network architec-
tures such as FNNs. If one works with more specific architectures, e.g., with transformers,
there are more specialized gradient descent methods. For instance, for transformers, there
is an adamW version of Loshchilov–Hutter [145] which better adapts to problems where
the variables live on different scales.

5.3.5 Early stopping rule


We come back to the universality statements of Section 5.2. Having a reasonably large
FNN architecture, this architecture is very flexible because it is capable to approximate
a fairly large function class. This implies that computing the MLE
n
X vi
ϑbMLE ∈ arg min L (ϑ; L) = arg min L(Yi , µϑ (X i )),
ϑ ϑ i=1
φ

is not a sensible problem that we should try to solve. This MLE fitted FNN will
not only extract the structural part (systematic effects) from the learning sample L =
(Yi , X i , vi )ni=1 , but it will also largely adapt to the noisy part (pure randomness) in this
learning sample. Obviously, such a FNN will badly generalize and it will have a poor
predictive performance on out-of-sample test data T .
Figure 5.3 gives an example that in-sample over-fits. The black dots are the observed
responses Yi (in the learning sample L), and the true regression function µ∗ is shown in
green color. The red graph shows a fitted regression model that over-fits to the learning
sample L. It follows the black dots quite closely, significantly deviating from the true
green regression function. Out-of-sample (repeating this experiment), the black dots may
likely also lie on the other side of the green line and, thus, the red estimated model will

Version March 3, 2025, @AI Tools for Actuaries


88 Chapter 5. Feed-forward neural networks

in−sample over−fitting

● ●
observations (in−sample)

10
under−fitting
systematic effects
over−fitting

8

6
mu(x)

4


2

● ● ●
● ●

● ● ●
● ● ●

0

0.5 1.0 1.5 2.0

Figure 5.3: An example of over-fitting; this figure is taken from [243, Figure 7.6].

generally not perform well in predictions (perform worse than an estimated model that
is close to the green line).
Consequently, within a highly flexible model class, we need to try to find a model that
only extracts the systematic part from a noisy sample. The key to this problem is early
stopping, some scholars call early stopping a regularization method, however, technically
it is different because it has an essential temporal component related to algorithmic time.

At this fitting step, FNN regression modeling significantly differs from GLM. In
GLMs, there often is little over-fitting potential and one tries to minimize the
empirical loss L(ϑ; L) to find the optimal GLM parameter. In contrast, reasonably
large FNNs have a high over-fitting potential and, therefore, one only tries to
get the empirical loss L(ϑ; L) reasonably small to find a good network parameter.
Practically, this is achieved by exercising an early stopping rule during gradient
descent training.

Let us first explain why early stopping works before discussing its implementation. Com-
ing back to the empirical loss (5.10), we compute its gradient
n
X vi
∇ϑ L(ϑ; L) = ∇ϑ L(Yi , µϑ (X i )).
i=1
φ

We observe that this gradient consists of a sum of many individual gradients of each
instance 1 ≤ i ≤ n. In each gradient descent step we try to find the most effective/signif-
icant update. Systematic effects will impact many individual instances (otherwise these
effects would not be systematic). At the beginning of the gradient descent algorithm,
before having found these systematic effects, they will therefore dominate the gradient
descent steps. Once these systematic effects acting on many instances 1 ≤ i ≤ n have
been found, the relative importance of instance-individual factors (noise) starts to in-

Version March 3, 2025, @AI Tools for Actuaries


Chapter 5. Feed-forward neural networks 89

crease. This is precisely the time-point to early stop the gradient descent algorithm; we
call this early stopping because the algorithm has not yet reached a local minimum of
the loss function and, as explained above, this is not our intention.
The previous outline also explains why all components of the covariates should live on
a comparable scale. If one covariate component lives on a bigger scale than the other
ones, it dominates the gradients. Thus, the gradient descent algorithm will find the
systematic effects of that dominant covariate component and then its starts to exploit its
noisy part (because this noise is still on a bigger magnitude than the systematic part of
the remaining covariates). At this stage, we early stop because learning the noise does
not generalize to the new data, and, henceforth, we have not found the systematic part
of the other covariate components.

Implementation of early stopping requires a careful treatment of the available learning


sample L. For this, we partition the learning data L at random into a training sample
U and a validation sample V. The training sample U is used for computing the gradient
descent steps, and the validation sample V is used to track over-fitting by an instantaneous
(out-of-sample) validation analysis; this partition is illustrated in Figure 5.4.

T T

entire data
L
U

Figure 5.4: Partition of the entire data (lhs) into learning sample L and test sample T
(middle), and into training sample U, validation sample V and test sample T (rhs); this
figure is taken from [243, Figure 7.7].

Thus, we perform the gradient descent algorithm only on the training sample U
X vi
∇ϑ L(ϑ; U) = ∇ϑ L(Yi , µϑ (X i )), (5.13)
i∈U
φ

and we perform an instantaneous out-of-sample validation on V giving us the validation


loss L(ϑ[t] ; V) at algorithmic time t for network weights estimate ϑ[t] . Naturally, the
training loss L(ϑ[t] ; U) should decrease for t → ∞. The validation loss L(ϑ[t] ; V) decreases
as long as we learn systematic effects, and it starts to increase (deteriorate) once we learn
the noise part. This change of behavior gives us the early stopping point t⋆ (and more

generally the early stopping rule), and we estimate the network weights by ϑb = ϑ[t ] .
This is illustrated in Figure 5.5 where we stop at t⋆ = 43. Of course, the validation
sample V should be sufficiently large to obtain a credible stopping rule (that itself is not

Version March 3, 2025, @AI Tools for Actuaries


90 Chapter 5. Feed-forward neural networks

stochastic gradient descent algorithm

training loss

0.159
validation loss
minimal validation loss

0.158
(modified) deviance loss
0.157
0.156
0.155
0.154
0.153

0 100 200 300 400 500

training epochs

Figure 5.5: An example of early stopping at time t⋆ = 43 (orange line).

dominated by the noise in V). Often one takes 20% or 10% of the learning data L as
validation sample V, depending on the sample size n.
Technically, for gradient descent training, one installs a so-called callback. This just
means that one saves every weight ϑ[t] , t ≥ 0, that decreases the validation loss L(ϑ[t] ; V),

and after running the algorithm one ‘calls back’ the weight ϑ[t ] with the minimal vali-

dation loss, which then presents the estimated network weights ϑb = ϑ[t ] from the early
stopped gradient descent algorithm.

5.3.6 Regularization and drop-out


Of course, there is no difficulty in using a regularized loss version in the gradient descent
algorithm, see Section 2.4 for regularization. However, an attempt to use LASSO regular-
ization to receive sparse network weights does not work. The problem is that the sparsity
in LASSO regularization stems from the non-differentiability of the L1 -regularization in
the origin, but the gradient descent algorithm (as its name says) cannot deal with this
non-differentiability in the origin (in fact, it uses a sub-gradient in the origin). Thus, gra-
dient descent methods can only get the regularized network weights small, but then these
small weights need to be set manually to (exactly) zero, of course, this can be achieved
by an integrated deterministic rule; see Richman–Wüthrich [191] for an example.
Another popular method to prevent from (in-sample) over-fitting is drop-out by Srivastava
et al. [210] and Wager et al. [229]. Drop-out is an additional network layer between two
(m)
FNN layers that removes neurons zj at random from the network, only during gradient
descent training; at random means by an i.i.d. Bernoulli random variable with a given
drop-out rate (which is a hyper-parameter selected by the modeler) in each gradient
descent step. This random removal during network training prevents individual neurons
to specialize to a certain task, because it is always the composite of non-dropped neurons
that has to fulfill the forecasting task. This leads to better balanced networks, and

Version March 3, 2025, @AI Tools for Actuaries


Chapter 5. Feed-forward neural networks 91

naturally to non-sparsity because every neuron needs to be able to comply with different
tasks. We will come back to drop-out in Section 8.5.1 on page 153.

5.3.7 Stochastic gradient descent

The gradient calculation in (5.13) involves large matrix multiplications if the dimension
of the network weights ϑ and the size of the training sample U are large, which is usually
the case. Matrix multiplications of large matrices can be very slow which hinders fast
network fitting. For this reason, one typically considers a stochastic gradient descent
(SGD) algorithm.

For the SGD method one chooses a fixed batch size s ∈ N, and one randomly partitions
the training sample U = (Yi , X i , vi )ni=1 into (mini-)batches U1 , . . . , U⌊n/s⌋ of roughly the
same size s.

One then considers the SGD updates


 
ϑ[t] 7→ ϑ[t+1] = ϑ[t] − ϱt+1 ∇ϑ L ϑ[t] ; Uk , (5.14)

⌊n/s⌋
where one cyclically visits the batches (Uk )k=1 for the gradient descent step t → t + 1.

⌊n/s⌋
The batch size s ∈ N of the batches (Uk )k=1 should not be too small. Assuming that the
observations (Yi , X i , vi )si=1 are i.i.d., the law of large numbers will provide us with the
(locally) optimal gradient descent step if we let the batch size s → ∞. This suggests that
we should choose very large batch sizes s. As explained above, computational reasons
force us to choose small(er) batch sizes, which may provide certain ‘erratic’ gradient
descent updates in view of the optimal next step. However, this is not the full story, and
some erratic steps can even be beneficial for finding better network weights, as long as
these erratic steps are not too numerous (and not too large). An infinite sample only
gives the next optimal step, which is a one-step ahead consideration. This may guide
us into a bottleneck, saddlepoint or local minimum that is far from optimal, because
the next optimal step is only a local optimal consideration. Having some erratic steps
from time to time may help us to escape from trapped situations like a bottleneck by
slightly shifting in the parameter space, so that we have the opportunity to explore
different environments (generally, such erratic steps are not too big for small step sizes,
and usually the loss surface is smooth and not too steep, so that an erratic step does not
dramatically change the situation). In this sense, finding a good trade-off between next
best steps and erratic steps, leads to the best predictive FNNs.

Version March 3, 2025, @AI Tools for Actuaries


92 Chapter 5. Feed-forward neural networks

We summarize SGD training:

• We use SGD training for FNN fitting. For insurance pricing problems, typi-
cally, reasonable mini-batch sizes s are in the range of 1000 to 5000.

• To speed up and improve the SGD fitting procedures, we use momentum-


based versions of SGD, e.g., the adam version; often its standard parametriza-
tion works well. This can be complemented by the Nesterov acceleration, as
implemented in the nadam version of SGD.

• For early stopping, we implement a call back which tracks the validation loss
on the validation sample V during SGD training. The validation sample is
typically 10% to 20% of the entire learning sample L.

• Finally, SGD training can be improved by regularization and drop-out, this


is part of hyper-parameter tuning and it needs to be checked case by case.
Typically, more regularization and drop-out is needed with increasing sizes
of the FNN architectures.

5.3.8 Summarizing feed-forward neural networks and their training


We are now in the exciting position of being able to apply FNN regression models! But,
the first attempts on real data will likely be a disappointment because choosing a good
network architecture and a good training algorithm needs some practical experience.

There is the recurrent question of how to select a good network architecture. A general
principle is that the selected network architecture should not be too small to be sufficiently
flexible to approximate all potentially suitable regression functions. Generally, it is a bad
guidance to attempt for a minimal and parsimonious FNN. Usually, there are many
different, roughly equally good approximations to the real data generating mechanism,
and the SGD algorithm can only find (some of) those if it has sufficiently many degrees
of freedom to exploit the environment (this contradicts parsimony), otherwise the fitting
will likely not result in the best possible predictive model. Of course, this is a bit against
actuarial thinking. Optimizing neural network architectures (e.g., the hyper-parameters
like the depth d and the number of neurons qm ) is not a target that one should try
to achieve, but one has to accommodate with the fact that the selected architectures
should exceed a minimal complexity bound above parsimony for SGD training to be
successful. Typically, this results in a lot of redundancy, which cannot be reduced by
existing techniques. Specifically, the FNN architecture should have a certain depth d
that is not too small (depth promotes interactions), and each hidden layer should also
not be chosen too small.

We make an attempt for a clearer instruction about the choice of the FNN architecture:
In our examples, which are comparably small insurance pricing examples of roughly
500,000 insurance policies equipped with roughly 10 covariate components, it has turned
out that FNN architectures of depth d ∈ {3, . . . , 6} with approximately 15 to 30 neurons
in each hidden layer work well. For the French MTPL claims frequency example used in

Version March 3, 2025, @AI Tools for Actuaries


Chapter 5. Feed-forward neural networks 93

these notes, see Section 3.3.2, we designed a standard FNN architecture of depth d = 3
with (q1 , q2 , q3 ) = (20, 15, 10) neurons in the three FNN layers. This has proved to work
well in this example; see our notebook

notebook-insert-link

Another critical point is that network fitting involves several elements of randomness.
Even if we fix the architecture and the fitting procedure, we typically end up with multiple
equally good predictive models if we run the same fitting algorithm and strategy multiple
times (with different seeds).

The elements of randomness involve:

(1) the initialization of the network weights ϑ[0] for SGD;

(2) the random partition into learning sample L and test sample T ;

(3) the random partition into training sample U and validation sample V;
⌊n/s⌋
(4) the partition into the batches (Uk )k=1 ; and

(5) there are further random items like drop-outs, if used during SGD training,
etc.

All this makes the early stopped SGD solution (highly) non-unique.

This non-uniqueness is a fact that one has to live with in machine learning models. In
Section 5.4 we present an ensemble predictor that reduces this randomness by averaging.

Remark 5.3 (balance correction). The early stopped SGD fitted FNN will not fulfill
the balance property (4.4), even if we use the canonical link of the selected deviance
loss function for g in the readout (5.4). A reason for this failure is early stopping which
stops the algorithm before having found a critical point of the loss surface. The balance
(d+1)
property can be rectified by adjusting the estimate of the bias term of the readout w b0
correspondingly. If we work with the canonical link for g, one can alternatively exercise
another GLM step, using the feature extracted covariates (z b (d+1) (X i ))n
i=1 as new covari-
ates for this GLM step; z b (d+1) denotes the SGD fitted feature extractor. Thus, we fit a
GLM on the new learning sample (Yi , z b (d+1) (X i ), vi )n
i=1 under the canonical link choice,
see (5.1). This is the proposal of Wüthrich [238]. ■

5.4 Nagging
In Chapter 6 we will meet a method called bagging which was introduced by Breiman [29].
Bagging combines bootstrap and aggregating. Bootstrap is the re-sampling technique
discussed in Section 1.5.4, and this is combined with aggregation which has an averaging
effect, reducing the randomness.
In this section we do not bootstrap, but, because network fitting has several items of
randomness as discussed in the previous section, we replace the bootstrap samples by

Version March 3, 2025, @AI Tools for Actuaries


94 Chapter 5. Feed-forward neural networks

different SGD solutions. An ensembling of network predictors has first been considered
by Dietterich [56, 57] and subsequently it has been studied in Richman–Wüthrich [189],
where the name nagging for network aggregating was introduced.

5.4.1 Aggregating
Aggregating can most easily be explained by considering an i.i.d. sequence of square-
integrable predictors (µ bj )j≥1 , which are assumed to be unbiased for the true predictor

µ , that is, E[µ ∗
bj ] = µ , for j ≥ 1. For a fixed predictor µ
bj , we have an approximation
error, called estimation error (or, more broadly, model error),

b j − µ∗ ,
µ

which on average is zero due to the unbiasedness assumption. For unknown true predic-
tor µ∗ , one estimates this estimation error by the variance (or the standard deviation,
respectively) of the predictor, i.e.,
q the average approximation error, called estimation un-
certainty, is given by V(µ bj ) or V(µbj ), respectively, which can be determined empirically
from the predictors (µ bj )j≥1 .

bj )M
On the other hand, having multiple independent unbiased predictors (µ j=1 , one can
build the ensemble predictor
M
1 X
µb(M ) = µ
bj . (5.15)
M j=1

This ensemble predictor has an estimation error µ b(M ) −µ∗ , and the estimation uncertainty
is given by
  1
V µ b(M ) = V(µb1 ) → 0 for M → ∞. (5.16)
M
Of course, all this is well-known in statistics, but the important takeaway is that ensem-
bling over unbiased i.i.d. predictors substantially reduces estimation errors and uncer-
tainty (through the law of large numbers).
There are two caveats:

(1) Where do we get the i.i.d. predictors from?

(2) How can we ensure that they are unbiased?

5.4.2 Network ensembling


We come back to the previous two questions. Since we do not know the true model
of (Y, X, v), neither of the two questions (1) and (2) can be answered, i.e., we cannot
resample and we do not know the right mean level.
We can only manipulate with conditional quantities, given the learning sample L =
(Yi , X i , vi )ni=1 , and then we can still discuss whether the analysis we receive is of any
interest. We come back to the elements of randomness that are involved in SGD fitting,
see discussion on page 93 in Section 5.3.8. If we select all these random items in an
i.i.d. manner (with an i.i.d. random seed), we receive conditional i.i.d. FNN models

Version March 3, 2025, @AI Tools for Actuaries


Chapter 5. Feed-forward neural networks 95

µϑb (X), where ϑbj denotes the SGD fitted network weights from the j-th conditionally
j
independent SGD run, using always the identical fitting strategy, only the initialization
and partitioning is done with a different random seed, and the conditional stems from
the fact that it is conditional on the learning sample L. Iterating this SGD fitting M
times, gives us M conditionally independent FNNs (µϑb )M j=1 , given L.
j

A first question is: how robust are the predictions µϑb (X), . . . , µϑb (X) for a given
1 M
covariate value X? This question has been analyzed in Richman–Wüthrich [189] and
Wüthrich–Merz [243, Figure 7.18] on a motor insurance claims frequency data set of sam-
ple size roughly n = 500, 000. The average fluctuations of the different fits (µϑb (X))Mj=1
j
were of magnitude 10%, this concerns the main body of the covariate distribution X ∼ P.
In this part of the covariate space we got reliable and quite robust models. However,
there are less frequent covariate combinations where these fluctuations were up to 40%,
i.e., the different initializations of SGD gave fluctuations in the best-estimates of up to
40%. Thus, there is clearly a credibility issue in the estimated FNNs in this (scarce) part
of the covariate space. Aggregating helps to reduce these fluctuations.
This motivated the nagging predictor
M
1 X
bnagg
µ M (X) = µ (X). (5.17)
M j=1 ϑbj

This ensembling reduces the average fluctuations by a factor M (on the standard de-
viation scale). That is, this determines the rate of convergence, and we obtain the law
of large numbers, a.s.,

h i
bnagg (X) = E µ b (X) L, X ,
lim µ (5.18)
M →∞ M ϑ1

where E[·|L, X] is the conditional expectation operator describing the selected SGD
fitting procedure, and for a fixed covariate value X, this is also the measure the ‘a.s.’ is
referring to. This law of large numbers limit precisely states what kind of (conditional)
unbiasedness we can receive by the nagging predictor. A difference of the limit (5.18)
compared to the true mean µ∗ (X) can originate from the following items: the particular
learning sample L that is at our disposal, the FNN architecture that we choose, the
specific version of the SGD algorithm with early stopping that we apply, but also the
particular distribution we use for initializing the SGD algorithm.

A first conclusion is that the nagging predictor robustifies the best-estimate prediction. A
second conclusion is that the nagging predictor significantly improves prediction accuracy,
by reducing the estimation error. This is verified in many examples in the literature, see,
e.g., Wüthrich–Merz [243, Table 7.9]. Figure 5.6 shows two different examples, the left-
hand side gives a claims frequency example and the right-hand side a claims severity
example. Both plots show the decrease of out-of-sample loss as a function of the number
M of ensembled network predictors (µϑb )M j=1 . From these examples we conclude that a
j
good value for M is either 10 or 20, because afterwards the out-of-sample loss of the
nagging predictor µ bnagg
M stabilizes.

Version March 3, 2025, @AI Tools for Actuaries


96 Chapter 5. Feed-forward neural networks

nagging predictors for M>=1 nagging predictors for M>=1

23.84
nagging predictor nagging predictor
1 standard deviation 1 standard deviation

1.55
23.83
out−of−sample losses
23.82

in−sample losses
1.50
23.81
23.80

1.45
23.79
23.78

0 10 20 30 40 0 10 20 30 40

index M index M

bnagg
Figure 5.6: Decrease of the out-of-sample loss of the nagging predictors µ M as a function
of M ≥ 1: (lhs) Poisson frequency example, (rhs) gamma severity example; these figures
are taken from [243, Figures 7.19 and 7.24].

Recommendation. In network predictions, always consider the nagging predic-


bnagg
tor µ M with M = 10 or M = 20, this will significantly improve the predictive
model.

However, these improvements are always conditional on the learning sample L, and con-
ditional on the difficulty that the selected model class and the model fitting procedure
should not be flaw, i.e., the true model is close to the selected model class in the sense
of the universality statements and suitable models can be found by the selected SGD
procedure. Moreover, computing the nagging predictor can be demanding because one
needs to fit the network architecture M times. There is a recent proposal that performs
multiple predictions within the same model and learning procedure; see Gorishniy et
al. [85].

5.5 Summary on feed-forward neural networks

The previous sections introduced FNNs and their SGD training. This builds the core
of deep learning on tabular data. The following sections of this chapter present FNN
architectures that are particularly useful for solving actuarial problems, and Chapter 8,
below, presents deep learning architectures for tensor data and unstructured data.

We summarize the standard procedure of deep learning as follows.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 5. Feed-forward neural networks 97

Summary of FNN regression modeling

• Perform covariate pre-processing; see Section 5.3.2.

• Select a deviance loss function that reflects the properties of the responses.

• Select a suitable FNN architecture, a momentum-based SGD algorithm, and


a suitable batch size; see Sections 5.3.7 and 5.3.8.

• Run SGD fitting with early stopping using a call back; see Section 5.3.5. This
can be complemented with regularization and drop-out; see Section 5.3.6

• Apply a balance correction to comply with the balance property (4.4); see
Remark 5.3.

• Repeat this M times to compute the nagging predictor (5.17).

The next two sections present two FNN architectures that are attractive to solve actuarial
problems. The first one combines a GLM with FNN features, and the second one locally
looks like a GLM.
We close this short section by highlighting the reference Richman–Wüthrich [193] on the
ICEnet architecture. Generally, it is non-trivial to enforce smoothness and monotonicity
in FNN architectures. The ICEnet is a regularized FNN method that achieves these
properties. This requires evaluation of first differences and to do so in the ICEnet, it is
necessary to have a multi-output network to be able to simultaneously obtain predictions
for adjacent inputs; see Richman–Wüthrich [193].

5.6 Combing a GLM and a neural network


This section discusses boosting of classical actuarial regression models such as a GLM
with FNN features. The main idea is to apply a two step fitting procedure. In a first step,
we fit a classical statistical model, and in a second boosting step, we let a machine learning
model analyze the residuals of the first model. If the machine learning model does not
find any structure in these residuals, the first model seems to perform well, otherwise, if
it finds structure, it boosts the first model by adding this structure. This proposal has
been discussed in an editorial of the ASTIN Bulletin called ‘Yes, we CANN!’ to promote
machine learning research within the actuarial community; see Wüthrich–Merz [242].
Similar ideas have been developed known as ResNet (residual network) in He et al. [96].
Nowadays, these are standard modules in advanced network architectures such as the
transformers of Vaswani et al. [228].
We start by first fitting a GLM to our predictive problem. This provides us with a MLE
fitted GLM
D E
−1 bMLE
bGLM (X) := µGLM
µ bMLE (X) = g ϑ ,X ,
ϑ

with MLE ϑbMLE ∈ Rq+1 and link function g. This first fitting has been performed on
a learning sample L = (Yi , X i , vi )ni=1 and it provides us with the estimated residuals,

Version March 3, 2025, @AI Tools for Actuaries


98 Chapter 5. Feed-forward neural networks

1 ≤ i ≤ n,
bGLM (X i ).
εbi = Yi − µ (5.19)
If this fitted GLM is suitable for the collected data, these residuals should not show
any systematic structure. That is, these residuals should be (roughly) independent and
centered (on average), and there should not be any systematic effects in these residuals as
a function of the covariates X. This motivates us to study a new, second regression step,
namely, we regress the residuals from the covariates on the new learning sample Lε =
(εbi , X i , vi )ni=1 . This is precisely the basic idea behind boosting, namely, one stage-wise
adaptively tries to improve the model by specifically focusing on finding the weaknesses
of the previous model(s). This is quite different from ensembling, which (only) averages
over the models, but it does not let the individual models compete to improve them, see
Section 5.4 for ensembling.
Remarks 5.4. The residuals defined in (5.19) need some care. First, they are not
independent, even if the instances 1 ≤ i ≤ n are independent. Note that the learning
sample L enters the estimated regression function µ bGLM = µGLM
bMLE . This typically implies
ϑ
a negative correlation between the estimated residuals. This is also the reason why
empirical variance estimators are normalized by 1/(n − 1) and not by 1/n to receive
unbiased variance estimators. Second, the residuals in (5.19) may have different variances,
even under the true regression function µ∗ because they are not standardized. If we
know that the learning data has been generated by a member of the EDF with cumulant
function κ, the standardization is straightforward from (2.4), i.e., using the variance
function V (·) = (κ′′ ◦h)(·) with canonical link h. That is, based on this variance function,
we obtain Pearson’s residuals
bGLM (X i )
Yi − µ
εbPi = q . (5.20)
bGLM (X i )) /vi
V (µ
These Pearson’s residuals should roughly look like an independent sequence of centered
variables having the same dispersion, in particular, they should not show any structure in
the covariates and as a function of the estimated regression function. Otherwise, either
the estimated regression function is not correctly specified, or the selected cumulant
function κ does not provide the correct variance behavior; see also Delong–Wüthrich [51]
for variance back-testing using isotonic regression. The residuals (5.20) can also be used
to receive Pearson’s dispersion estimate defined by
n 
1 X 2
φbP = εbPi , (5.21)
n − (q + 1) i=1

if the estimated GLM function µ bGLM involves q + 1 estimated regression parameters.


Pearson’s dispersion estimate is consistent, i.e., it converges to the true value, a.s., for
increasing sample size. ■

The CANN proposal of Wüthrich–Merz [242] does not explicitly compute the residuals,
but it directly adds a regression function to the first estimated model, i.e., it acts on the
(linear) predictor scale. Based on a given regression function µ bGLM with link g it makes
the Ansatz
   D E
µCANN (X) = g −1 g µ
bGLM (X) + w (d+1) , z (d:1) (X) ,

Version March 3, 2025, @AI Tools for Actuaries


Chapter 5. Feed-forward neural networks 99

and since the first model is a GLM with link g, we can equivalently write
D E D E
µCANN (X) = g −1 ϑbMLE , X + w(d+1) , z (d:1) (X) , (5.22)

where we assume that ϑbMLE has been fitted in a first GLM step and is kept fixed (frozen)
in the second estimation step, and where the second part of (5.22) describes a FNN
architecture. In this second CANN step, only this second (FNN) part is fitted to detect
more systematic structure that has not been found by the initial GLM. If the fitted FNN
part in this second CANN step is equal to zero, then we are back in the GLM. This
expresses that the FNN could not find any additional systematic structure that is not
already integrated into the GLM.
Remark that keeping the first part ⟨ϑbMLE , X⟩ frozen during the second boosting step
(fitting the FNN) can also be interpreted as having an offset playing the role of known
prior differences, and one wants to see whether one finds more differences beyond this
offset.
For fitting the FNN we apply a SGD algorithm on training and validation data as ex-
plained above. This two step estimation concept has now been applied successfully in
many studies, see, e.g., Brauer [26] and Havrylenko–Heger [95], the latter reference uses
this CANN approach to detect interactions.
In view of (5.22) the similarity to ResNet is apparent. The second term in (5.22) describes
a classic FNN architecture, the first term can be interpreted as a residual connection or a
skip connection because it connects the input X directly to the output; for an illustration
see Wüthrich–Merz [243, Figure 7.14]. We can also interpret (5.22) as having a linear
(GLM) term and we build a non-linear FNN architecture around this linear term to
capture interactions and non-linearities not present in the GLM-term.

5.7 LocalGLMnet
Essentially, there are two different ways of explaining results of algorithmic models. Ei-
ther one uses post-hoc methods, we discuss these in Chapter [insert], below, or one tries to
integrate explainable features into the algorithmic models. The LocalGLMnet is a FNN
architecture that attempts to design a model that provides these integrated explainable
features; we refer to Richman–Wüthrich [190, 191]. The LocalGLMnet can be seen as
a varying coefficient model; for varying coefficient models we refer to Hastie–Tibshirani
[92], and in a tree-based context to Zhou–Hooker [249] and Zakrisson–Lindholm [247],
the latter discussing parameter identifiability issues, which we are also going to face in
this section.
The starting point is a GLM whose sensitivities can easily be understood, especially,
under the log-link choice, see Example 3.1. Recall the GLM regression function
 
q
µϑ (X) = g −1 ⟨ϑ, X⟩ = g −1 ϑ0 +
X
ϑj Xj  , (5.23)
j=1

with regression parameter ϑ ∈ Rq+1 . This regression parameter takes a fixed value that
is estimated with MLE. The LocalGLMnet architecture replaces this fixed parameter ϑ

Version March 3, 2025, @AI Tools for Actuaries


100 Chapter 5. Feed-forward neural networks

(d:1) (d:1) ⊤
by a multi-output FNN architecture (function) z (d:1) = (z1 , . . . , zq )
 
q
(d:1)
µϑ (X) = g −1 ϑ0 +
X
zj (X) Xj  , (5.24)
j=1

with a real-valued bias parameter ϑ0 ∈ R and having a multi-output FNN architecture

z (d:1) : Rq → Rq .
(d:1)
Thus, the GLM parameters (ϑj )qj=1 are replaced by network outputs (zj (X))qj=1 of
the same dimension q. Locally, these network outputs look like constants and, therefore,
the LocalGLMnet behaves locally as a GLM. This is precisely the main motivation to
study the architecture in (5.24). While we estimated the parameter ϑ ∈ Rq+1 with MLE
(d:1)
in GLMs, we now learn these attention weights (zj (X))qj=1 with SGD fitting within a
FNN framework.
(d:1)
The learned attention weights (zj (X))qj=1 allow for nice interpretations:
(1) We focus on the individual terms under the j-summation in (5.24). If for the j-
(d:1)
th component Xj the learned network output is constant, zj (X) ≡ ϑj ̸= 0, we
effectively have a GLM component in this term. Therefore, we aim at understanding
whether the multi-output network provides sensitivities in the inputs or not. If there
are no sensitivities we should go for a GLM.
(d:1)
(2) Property zj (X) ≡ 0 for some j, proposes to completely drop this term from
the regression function. This is a way of selecting or dropping terms from the
LocalGLMnet regression function. In fact, Richman–Wüthrich [190] propose an
empirical statistical test to check for this kind of model sparsity.
(d:1) (d:1)
(3) If we obtain an attention weight zj (X) = zj (Xj ) that only depends on the
covariate component Xj , we know that this term does not interact with any other
covariate components. More generally, we can test for interactions by considering
for a fixed component j the gradient w.r.t. X
!⊤
(d:1) ∂ (d:1) ∂ (d:1)
∇zj (X) = z (X), . . . , z (X) ∈ Rq . (5.25)
∂X1 j ∂Xq j
This allows us to understand the local interactions in the j-th term in the neigh-
borhood of X.
This all looks very nice and convincing, however, there is a caveat that needs careful
consideration. Namely, the LocalGLMnet regression function lacks identifiability. We
briefly discuss this in the following items.

(4) Due to the flexibility of large FNNs we may find a term that gives us the function
(d:1)
zj (X) Xj = Xj ′ , (5.26)
(d:1)
by learning an attention weight zj (X) = Xj ′ /Xj , for j ′ ̸= j. This is the reason,
why we speak about dropping a ‘term’ and not a ‘covariate component’ in the
previous items (1)-(3), because even if we drop the term for Xj , this covariate
component may still play an important role in the attention weights of other terms.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 5. Feed-forward neural networks 101

(5) Related to item (4), for SGD training of the LocalGLMnet we need to initialize
the gradient descent algorithm. We recommend to initialize the network weights
that we precisely start in the MLE fitted GLM (5.23). In our examples, this has
pre-determined the role of all j-terms such that we did not encounter any situation
where an issue similar to (5.26) occurred.

Assume that all covariate components are standardized to be centered with unit variance,
i.e., we have standardized columns in the design matrix X. This makes the attention
(d:1)
weights zj (X) directly comparable across the different components 1 ≤ j ≤ q.
It motivates a measure of variable importance by defining the sample averages
n
1X (d:1)
Ij = z (X i ) . (5.27)
n i=1 j

If this value is very small, the empirical test of Richman–Wüthrich [190] gives support
to the null-hypothesis of dropping this term and, obviously, if Ij is large it has a big
impact on the regression function (because the centering of the covariates implies that
the regression function is calibrated to zero); this is similar to the LRTs and the p-values
in GLMs, see Listing 3.1.
Since the LocalGLMnet locally behaves like a GLM, if the attention weights do not take
extreme values, there is also some similarity to the post-hoc interpretability tool called
local interpretable model-agnostic explanations (LIME) by Ribeiro et al. [185]. LIME
fits a LASSO regularized GLM locally to individual covariate values X to describe the
most important variables that explain the regression value µ b(X) of a fitted regression
(d:1) q
model. Using the attention weights (zj (X))j=1 we have similar, but more precise,
information about this local behavior in X; we come back to LIME in Section [insert],
below.

5.8 Outlook: Kolmogorov–Arnold networks


At the core of the great interest and success of FNNs (multi-layer perceptrons, MLPs)
are the universality theorems that build the mathematical foundation of describing the
approximation capacity of these function classes, see Section 5.2. More generally, this
approximation question can be related to David Hilbert’s 13th problem which looks for al-
gebraic function solutions (of two arguments) to a certain class of problems. Kolmogorov–
Arnold [125, 5] solved a version of Hilbert’s 13th problem, also known as the Kolmogorov–
Arnold representation theorem. This approach structurally resembles FNNs, with one
essential difference, and this difference is precisely the basis of the so-called Kolmogorov–
Arnold networks (KANs) introduced by Liu et al. [141]. In FNNs we have fixed activation
functions ϕ in the neurons (units), and the weights are the parts that are learnable from
the data. If we illustrate this graphically, see Figure 5.1, we can think of nodes (units)
(m)
being connected by edges. The nodes correspond to the neurons zj , and they have
a fixed functional structure given by the selected activation function ϕ. The edges de-
(m)
scribed by the network weights wj,l then connect these nodes, and these edges are the
part of the network architecture that are trainable from the learning sample L.

Version March 3, 2025, @AI Tools for Actuaries


102 Chapter 5. Feed-forward neural networks

For KANs we exchange these roles by putting learnable activation functions on the edges
in the form of learnable splines, and we set all weights (in the nodes) equal to one. Let
(Bs )s be a family of B-splines.2 For a KAN we build the splines
X
x ∈ R 7→ S(x) = ws Bs (x),
s

with learnable weights (ws )s ⊂ R. That is, every spline S incorporates weights (ws )s that
can be trained with SGD methods. In the KAN proposal of Liu et al. [141], these splines
are used as residual connections around the SiLU function, defining the KAN activation
functions ϕ : R → R by
!
X
x 7→ ϕ(x) = w (SiLU(x) + S(x)) = w SiLU(x) + ws Bs (x) , (5.28)
s

with another weight w ∈ R and the SiLU function given in Table 5.1. These are highly
flexible activation functions that can be trained on learning data.
We are now ready to define a KAN layer, which is the analogue to the FNN layer
introduced in (5.5)-(5.6).

KAN layer. Based on the selected class of KAN activation functions (5.28), we define
the KAN layer z (m) : Rqm−1 → Rqm as follows. For x = (x1 , . . . , xqm−1 )⊤ ∈ Rqm−1 , we set
 ⊤
(m)
z (m) (x) = z1 (x), . . . , zq(m)
m
(x) ,

with units for 1 ≤ j ≤ qm


qm−1
(m) X (m)
zj (x) = ϕj,l (xl ), (5.29)
l=0
(m)
for KAN activation functions ϕj,l , 1 ≤ j ≤ qm and 0 ≤ l ≤ qm−1 , of structure (5.28).
Now, we are on familiar grounds because the only difference to FNNs is that we replace
the FNN neurons (5.6) by the KAN units (5.29). The KAN layers can then be composed
as in (5.3), and from this we construct the KAN output (5.4).
Liu et al. [141] give various examples of the superior behavior of KANs over FNNs, this
comes at the price of higher computational efforts and more hyper-parameter tuning.
E.g., it requires the selection of the number of knots in the selected B-splines. For this
we need to set a grid across the real line, a finer grid gives higher local accuracy but
also higher computational efforts and higher model storage costs. Most of the analysis
presented in Liu et al. [141] focuses on function approximation in deterministic settings,
which is a rather different kind of problem compared to training a predictive model
on noisy data. When it comes to statistical modeling problems, it is unclear whether
these higher computational and hyper-parameter tuning efforts are justified, in-sample
over-fitting being the most serious concern (that is difficult to control in highly flexible
models). Certainly, flexible KANs need sophisticated regularization techniques for a
successful training which are not fully understood at the moment. With this we close
this short section on KANs.
2
Selecting a B-spline means that we need to select the polynomial degree of the spline, e.g., cubic,
and we need to selected the knots where the piecewise polynomial functions are concatenated.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 6

Regression trees and random


forests

Regression trees (decision trees) have been introduced in the seminal monograph of
Breiman et al. [31] called classification and regression trees (CARTs), published in 1984.
Regression trees are based on recursively partitioning the covariate space, therefore, this
technique is also known as rpart in R; see Therneau–Atkinson [215]. Nowadays, regres-
sion trees are not used any more in their pure form because they are not fully competitive
with more advanced regression methods. However, they are the main building blocks of
gradient boosting machines (GBMs) and random forests. For this reason, we give a short
introduction to regression trees in this chapter. In GBMs many small regression trees
(called weak learners) are combined to a powerful predictor, and in random forests many
large and noisy regression trees are combined with bagging to a more powerful predictor.
Random forest has been introduced by Breiman [30], and it will be discussed in this
chapter, GBMs will be discussed in Chapter 7, below.

6.1 Regression trees


The main idea behind regression trees is to partition the covariate space X ⊂ Rq into
homogeneous subsets w.r.t. the prediction task at hand. This precisely reflects the core
of actuarial pricing in trying to build homogeneous risk classes; see, e.g., Bailey–Simon
[11]. A finite partition of the covariate space X is given by a finite index set T and a
collection (Xt )t∈T of non-empty subsets of X such that
[
X = Xt and Xt ∩ X s = ∅ for all t ̸= s. (6.1)
t∈T

This finite partition is constructed by a binary tree, and we call (Xt )t∈T the leaves of this
binary tree as these sets are the knots of the binary tree that do not have any descendants,
see Figure 6.1 for an example with six leaves.
Assuming that all insurance policyholders X ∈ Xt , who belong to the same leaf Xt , have
the same risk behavior, motivates to define the regression function as
X
X 7→ µ(X) = µt 1{X∈Xt } , (6.2)
t∈T

103
104 Chapter 6. Regression trees and random forests

0.074
26e+3 / 678e+3
100%
yes DrivAge >= 25 no

0.07 0.16
24e+3 / 648e+3 2011 / 30e+3
96% 4%
Density < 644 DrivAge >= 21

0.06 0.087
13e+3 / 378e+3 11e+3 / 270e+3
56% 40%
VehGas = Regular VehBrand = B12

0.054 0.066 0.071 0.092 0.14 0.24


5564 / 174e+3 7421 / 204e+3 2199 / 80e+3 9188 / 189e+3 1394 / 23e+3 617 / 6816
26% 30% 12% 28% 3% 1%

Figure 6.1: Binary regression tree with three binary splits resulting in six leaves.

with conditional mean parameters µt ∈ R, for t ∈ T. We call µt conditional mean


parameter because it reflects the conditionally expected response E[Y |X] on the leaf X ∈
Xt . Figure 6.2 (lhs) illustrates a partition of a two-dimensional rectangular covariate space
X ⊂ R2 into ten leaves (Xt )t∈T , and the different colors reflect the different conditional
means (µt )t∈T on the corresponding leaves. The right-hand side of the figure shows a
GLM regression function with a multiplicative structure.

Figure 6.2: (lhs) Partition (Xt )t∈T of a rectangular covariate space X ⊂ R2 with dif-
ferently colored conditional means (µt )t∈T on the corresponding leaves; (rhs) GLM with
multiplicative structure.

There are two main items to be selected to design the regression tree function (6.2):

(1) The partitioning of the covariate space X into the leaves (Xt )t∈T ; and

(2) the selection of the conditional means (µt )t∈T on the leaves.

We discuss the recursive partitioning algorithm to solve these two tasks.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 6. Regression trees and random forests 105

Recursive partitioning algorithm. We briefly give a non-technical description, and


the fast reader can skip the following paragraphs which give a mathematical foundation
to this brief description. Broadly speaking, the recursive partitioning algorithm analyzes
in each loop of the algorithm which of the leaves (Xt )t∈T is the least homogeneous one
(measured in some loss metric). The least homogeneous leaf, say Xt , is then partitioned
into two parts Xt0 and Xt1 to increase the homogeneity. This is how the binary tree is
grown, i.e., it replaces the old leaf Xt by its two descendants Xt0 and Xt1 , which results
in an increased partition; see Figure 6.1 for an example. This is recursively iterated until
all the leaves are sufficiently homogeneous.

Covariate pre-processing. Continuous covariate components do not need any pre-


processing. To control the computational complexity, target encoding (2.16) is applied
to the categorical covariate components, and to improve accuracy, target encoding is
performed individually on each leaf Xt , t ∈ T, before applying the next recursive parti-
tioning step. For large trees and many high-cardinality categorical covariates this can be
prohibitive.

Initialization. The recursive partitioning algorithm is initialized by setting T = {0},


selecting the so-called root tree X0 = X , and setting as mean estimate for µ0 the weighted
b0 = n
P Pn
empirical mean µ i=1 vi Yi / i=1 vi . We now generally put hats µ
b on top of all µ-
values to indicate that these are estimated from the learning sample L = (Yi , X i , vi )ni=1 .

(Binary) recursive partitioning iteration. Assume we have a partition (Xt , µ bt )t∈T


of the covariate space X with corresponding estimated conditional means. Figure 6.1
gives an example where we have applied 5 binary splits resulting in 6 leaves, i.e., |T| = 6.
For the next iteration of the algorithm, the following needs to be done:

(a) Assume we want to partition leaf Xt . To control the combinatorial complexity,


we only analyze standardized binary splits (SBS) meaning that the split criterion
is based on one single covariate component of X only, say Xk , and we allocate
all instances i with Xi,k ≤ c to one descendant Xt0 , and otherwise, if Xi,k > c,
we allocate instance i to the other descendant Xt1 of the selected leaf Xt to be
partitioned. This gives so-called ‘rectangular binary splits’ as illustrated in Figure
6.2 (lhs). Remark that this can also be applied to the target encoded categorical
covariate components (2.16).

(b) Select the least homogeneous leaf from (Xt )t∈T , i.e., the leaf for which we can find
the most efficient SBS w.r.t. some objective function.

(c) Estimate the conditional means on these two new leaves Xt0 and Xt1 .

Items (a)-(c) require to choose a leaf Xt of the actual tree (Xt )t∈T , a covariate component
Xk , 1 ≤ k ≤ q, that serves for the next SBS, and a split level c ∈ R that partitions
w.r.t. the selected covariate component Xk . To decide about these three choices, we need
an objective function. Since we estimate conditional means, it is natural to take a strictly
consistent loss function L for mean estimation.

Version March 3, 2025, @AI Tools for Actuaries


106 Chapter 6. Regression trees and random forests

This then translates items (a)-(c) into the following optimization problem
"
X vi
(t̂, k̂, ĉ) = arg max L (Yi , µ
bt ) (6.3)
t∈T, 1≤k≤q, c∈R i: X ∈X φ
i t
#
 
− L (Yi , µ
bt0 ) 1{Xi,k ≤c} + L (Yi , µ
bt1 ) 1{Xi,k >c} ,

with weighted empirical means


P P
i: X i ∈Xt vi Yi 1{Xi,k ≤c} i: X i ∈Xt vi Yi 1{Xi,k >c}
µ
bt0 = P and µ
bt1 = P . (6.4)
i: X i ∈Xt vi 1{Xi,k ≤c} i: X i ∈Xt vi 1{Xi,k >c}

This may look complicated, but it (only) says that we try to find the leaf Xt̂ , the covariate
component Xk̂ and the split level ĉ that provides the biggest decrease in loss (6.3) by this
additional SBS. Moreover, the weighted empirical means (6.4) are the optimal predictors
on both parts of the partitioned leaf Xt w.r.t. the selected strictly consistent loss function
L. Therefore, the objective function in (6.3) is lower bounded by zero.
The solution of (6.3) gives us the new leaves, i.e., the descendants of Xt̂ defined by
 
Xt̂0 = x ∈ Xt̂ ; xk̂ ≤ ĉ and Xt̂1 = x ∈ Xt̂ ; xk̂ > ĉ ,

with the new (updated) index set


  n o
T ← T \ {t̂} ∪ t̂0, t̂1 ,

and the empirical weighted means µ bt̂1 on the new leaves Xt̂0 and Xt̂1 , respectively.
bt̂0 and µ
This fully explains the SBS recursive partitioning algorithm, and this is the standard
regression tree algorithm usually used.
We close this section with some remarks.
• The selection of the optimal split level ĉ in (6.3) is not unique because we work
with a finite sample on every leaf Xt . Typically, the split levels c ∈ R are chosen
precisely in the middle of adjacent observed covariate values to make them unique.

• In practice, one only considers potential SBS that exceed a minimal number of
instances in both descendants Xt0 and Xt1 of leaf Xt , otherwise one cannot reli-
ably estimate the weighted empirical means (6.4). In implementations, there is a
hyper-parameter related to the minimal leaf size which precisely accounts for this
constraint. The choice of the minimal leaf size depends on the problem to be solved,
e.g., for car insurance frequencies one typically requires 1000 insurance policies to
receive reliable frequency estimates.

• It may happen that in one leave there are only instances with zero claims, which
(of course) is a very homogeneous leaf. However, the mean estimate (6.4) typically
leads to a degenerate model in that case. Therefore, often Bühlmann credibility is
used with a credibility coefficient (shrinkage parameter) being selected as a hyper-
parameter; technically this is precisely done as in (2.17); see Therneau–Atkinson
[215].

Version March 3, 2025, @AI Tools for Actuaries


Chapter 6. Regression trees and random forests 107

• We did not discuss the stopping rule of the recursive partitioning algorithm. Of
course, we should prevent from over-fitting. However, designing a good stopping
rule is usually not feasible, also because optimization (6.3) only focuses on the next
best split (greedy search), but maybe a next poor split can stimulate a second next
excellent split. To account for such flexibility, a large binary regression tree can be
constructed in a first step. In a second step, all parts of the large tree are pruned,
if they do not sufficiently contribute to the required homogeneity in relation to
the parameters involved; this is measured by analyzing how much a certain split
contributes to the decrease in loss (including all its descendants and accounting
for their complexity). This pruning step uses best-subset selection regularization
(2.24), and, importantly, it can be performed efficiently by another recursive algo-
rithm. The details are rather technical, they were proved in Breiman et al. [31],
and, aligned to our notation, we refer to Wüthrich–Buser [241, Section 6.2]. We do
not discuss this any further because we are not going to use regression trees in its
pure form.

• Besides the insufficient predictive performance, also robustness of regression trees


is an issue. It can happen that one of the first SBS looks completely different if
one slightly perturbs a few instances in the learning sample L, resulting in a very
different predictive model. This missing robustness limits or even prevents using
regression trees in insurance pricing. A way to control robustness is bagging and
random forest, these will be discussed in Sections 6.2-6.3, below.

6.2 Bagging
In view of the missing robustness of the plain-vanilla regression tree construction dis-
cussed in the previous section, there were many attempts to robustify regression tree
predictors. For this, the non-parametric bootstrap, discussed in Section 1.5.4, is com-
bined with aggregating, discussed in Section 5.4.1, resulting in Breiman’s [29] bagging
proposal.
We start by revisiting aggregation. As mentioned, the estimated regression tree related
to (6.2) lacks robustness. Assume we have M independent learning samples L(j) , 1 ≤
j ≤ M , that follow the same data generating mechanism, and which have sample size
n. This allows us to construct M independent regression tree predictors (using the same
methodology, but different independent learning data)
X (j)
b(j) (X) =
µ µ
bt 1 (j) ,
{X∈X }
t
t∈T(j)

where the upper index 1 ≤ j ≤ M denotes the different estimated regression trees. Since,
by assumption, the underlying learning samples L(j) and the resulting regression trees
are i.i.d., the law of large numbers applies
M
1 X
lim b(j) = E[µ
µ b(1) ], (6.5)
M →∞ M
j=1

a.s., and the randomness asymptotically vanishes, see (5.16). This highlights the advan-
tages of aggregating, namely, the randomness (from the finite samples) asymptotically

Version March 3, 2025, @AI Tools for Actuaries


108 Chapter 6. Regression trees and random forests

vanishes. On the other hand, (6.5) also indicates a main issue of this technique. Namely,
we have convergence to a deterministic limit E[µb(1) ], but there is no guarantee that this
limit is close to the true regression function µ∗ , i.e., if the individual regression tree
constructions µb(j) are biased (in some systematic way), so will the limit be. Therefore,
aggregation is only a method to diminish uncertainty through randomness in the learning
samples, but not for mitigating a (systematic) bias in the construction.

Example 6.1. An easy example for a biased estimation procedure is the following. If
(j)
b(j) = max1≤i≤n Yi , we certainly get a positive bias in this mean estimation
we set µ
(1) (1)
procedure since the limit in (6.5) is equal to E[max1≤i≤n Yi ] > E[Y1 ] = µ∗ in any
non-deterministic situation with non-comonotonic responses and n > 1. ■

For aggregation, we need multiple independent learning samples L(j) and predictors µ b(j) ,
respectively. Similarly to Section 5.4, it is not immediately clear where we can get these
independent samples from. Breiman’s [29] b in bagging refers to bootstrap simulation, or
more precisely to the non-parametric bootstrap discussed in Section 1.5.4. Starting from
the observed learning sample L = (Yi , X i , vi )ni=1 , we draw with replacements indepen-
(⋆j) (⋆j) (⋆j)
dent bootstrap samples L(⋆j) = (Yi , X i , vi )ni=1 , where ‘independent’ applies to
the drawing with replacements. The resulting bootstrapped learning samples (L(⋆j) )M j=1
are conditionally i.i.d., given the learning sample L. From these, we can construct con-
ditionally i.i.d. regression tree predictors µ b(⋆j) to which the law of large numbers applies,
a.s.,
M
1 X h i
lim b(⋆j) = E µ
µ b(⋆1) L .
M →∞ M
j=1

The same remark about the bias applies as above, but this time the bias additionally
depends on the specific observations in the learning sample L, e.g., having a small sample
with an outlier will likely result in a largely biased predictor, if the outlier is not properly
controlled. On the other hand, the out-of-bag method (unique to non-parametric boot-
strapping) gives one an easy (and integrated) cross-validation technique that may allow
to detect such biases; for out-of-bag validation, see (1.22).
A general issue with bagging is that the individual bootstrapped regression trees µ b(⋆j)
are highly correlated because the identical observations are recycled many times. This
mutual dependence makes this whole modeling approach not very efficient, and hence
these regression trees are not used in actuarial science. Random forests discussed in the
next section precisely tries to improve on this point.

6.3 Random forests


The careful reader will have noticed that in the previous section we have not been men-
tioning the sizes of the constructed regression trees, i.e., the number of resulting leaves
|T|. For Breiman’s [30] random forests, we construct very large trees |T| ≫ 1 to introduce
more randomness into the procedure, and to decorrelate these large trees, we employ a
procedure that adds additional randomness to the regression tree construction by fre-
quently missing the optimal partition in (6.3). Thus, in contrast to bagging, we aim at
having more randomness in the regression tree construction. In particular, constructing

Version March 3, 2025, @AI Tools for Actuaries


Chapter 6. Regression trees and random forests 109

large regression trees frequently missing the optimal split provides some over-fitting but
also a more random tree construction. This precisely has a decorrelating effect, resulting
in the random forests predictor discussed next.
(⋆j) (⋆j) (⋆j)
As for bagging, we generate i.i.d. bootstrap samples L(⋆j) = (Yi , X i , vi )ni=1 from
the learning sample L = (Yi , X i , vi )ni=1 by drawing with replacements. For each bootstrap
(⋆j) (⋆j) (⋆j)
sample, L(⋆j) = (Yi , X i , vi )ni=1 , we construct a large and noisy regression tree
estimator µb(⋆j) as follows. Consider the j-th bootstrap sample L(⋆j) , and assume we
(⋆j) (⋆j)
have constructed a binary tree (Xt , µ bt )t∈T on that bootstrap sample that we want
to further partition similar to (6.3). To add randomness, we select in each loop of the
SBS recursive partitioning algorithm a non-empty random subset Q ⊂ {1, . . . , q} of the
covariate components X = (X1 , . . . , Xq )⊤ , and we only consider the components in Q
for the next SBS, that is, we replace (6.3) by

(⋆j)
"
X vi 
(⋆j) (⋆j)

(t̂, k̂, ĉ) = arg max L Yi ,µ
bt (6.6)
t∈T, k∈Q, c∈R (⋆j) (⋆j)
φ
i: X i ∈Xt
     #
(⋆j) (⋆j) (⋆j) (⋆j)
− L Yi , µ bt0 1{X (⋆j) ≤c} + L Yi , µ bt1 1{X (⋆j) >c} ,
i,k i,k

the main difference to (6.3) being the random set Q, highlighted in magenta color in
(6.6). This algorithm gives us a randomized and bootstrapped regression tree predictor
b(⋆j) (X) for each bootstrap sample L(⋆j) , 1 ≤ j ≤ M .
µ
Aggregating over these regression trees allows us to define the random forest predictor
M
RF 1 X
µ
b (X) = b(⋆j) (X).
µ (6.7)
M j=1

We give some remarks:

• By sampling a true subset Q ⫋ {1, . . . , q} we may miss the optimal SBS. This
introduces more randomness and decorrelation for i.i.d. sets Q in each iteration of
the recursive partitioning algorithm.

• If Q ≡ {1, . . . , q}, there is no difference between bagging and random forests.

• Often, Q is set to have a fixed size in all recursive partitioning steps, popular choices

are q or ⌊q/3⌋ for the size of Q.

• Generally, random forest predictors are not as competitive as networks and GBMs,
that is why these techniques are not used very frequently. Moreover, random forests
can be computationally intensive, i.e., constructing large trees on potentially many
high-cardinality categorical covariates can severely impact the fitting time.

• Standard random forest packages often work under the Gaussian loss assumption
which is not appropriate in many actuarial problems, and this loss cannot easily be
replaced in these implementations.

Version March 3, 2025, @AI Tools for Actuaries


110 Chapter 6. Regression trees and random forests

• Nevertheless, random forests are still considered to be useful in actuarial appli-


cation. They are comparably simple in applications and they do not need much
hyper-parameter tuning.

– First, they may help to detect interactions, and if there are interactions a more
sophisticated method can be used.
– Second, they are used as a surrogate model for explainability, because through
the splitting mechanism they provide a simple variable importance measure.
This is explained next.

There is a nice application though of random forests that is used. Assume we have fitted
a regression model µb : X → R to the learning data L = (Yi , X i , vi )n
i=1 , and we would like
to have a measure of variable importance, meaning that we would like to measure which
of the components Xk of the covariates X has a big impact on the regression function.
A possible solution to this question is to fit a random forest surrogate model µ bRF to the
regression function µb. That is, we fit a random forest regression function µ bRF to the
b(X i ), X i )n
learning data Lb = (µ i=1 by minimizing the square loss

n 
1X 2
bRF (X i ) .
b(X i ) − µ
µ
n i=1

If we find an accurate random forest regression model µ bRF ≈ µb, we can use this random
forest as a surrogate model for analyzing variable importance. This random forest is
an ensemble over multiple regression trees (µb(⋆j) )Mj=1 , see (6.7), and we can analyze all
the SBS that lead to these regression trees (µb (⋆j) )M
j=1 . Each SBS can be allocated to a
covariate component Xk , 1 ≤ k ≤ q, and aggregating the decreases of losses (6.6) for
each component 1 ≤ k ≤ q, gives us a measure of variable importance.

In summary, due computational costs and missing competitiveness, regression trees


and random forests do not belong to the regression techniques that are frequently
used. An important point is that regression trees applied differently, as weak
learners in GBMs, will equip us with some of the most competitive predictive
models on tabular data, see Chapter 7, below.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 7

Gradient boosting machines

This chapter presents boosting and, in particular, gradient boosting machines (GBMs).
GBMs belong to the most powerful machine learning methods on tabular data, often
outperforming the FNN architectures studied in Chapter 5. The concept of boosting has
already been mentioned in Section 5.6, where we boosted a GLM with FNN features.
Before going into the theory of GBMs, we take a look at the idea of using (additive)
iterative updating, i.e., boosting, we then study GBMs, and in the last part of this
chapter we discuss XGBoost and LightGBM which are the state-of-the-art GBMs these
days.

7.1 (Generalized additive) boosting


The idea of boosting is to use an iterative updating scheme to approximate the unknown
true regression function X 7→ µ∗ (X) introduced in (1.2). For this iterative scheme of
deriving an approximation µ b to µ∗ , one typically uses simple functions or so-called base
learners. This makes boosting very closely connected to additive modeling. In agreement
with Chapter 3, we adapt our boosting schemes to GLMs by trying to learn a regression
function µb(X) on its transformed scale g(µ b(X)), for a given link function g, see (3.1).
The notation will become a bit cumbersome, but it will be more directly related to GLMs
and FNNs.
Let B = {X 7→ b(X; ϑ)}ϑ define a class of base learners being parametrised by ϑ;
concerning the parameter notation we refer to the footnote on page 16. Based on the
learning sample L = (Yi , X i , vi )ni=1 , the boosting update in iteration j ≥ 1 is described
according to
n
vi     
L Yi , g −1 g µ
X
ϑb(j) = arg min b(j−1) (X i ) + b(X i ; ϑ) , (7.1)
ϑ i=1
φ

subject to existence and uniqueness. This gives us the updated regression function esti-
mate
 
    j
b(j) (X) = g −1 g µ b(j) ) = g −1 
X
µ b(j−1) (X) + b(X; ϑ b(X; ϑb(s) ) , (7.2)
s=0

111
112 Chapter 7. Gradient boosting machines

where we set for the initialization b(X; ϑb(0) ) the homogeneous MLE given by b(X; ϑb(0) ) ≡
b0 ) = g( n
P Pn
g(µ i=1 vi Yi / i=1 vi ).

Note that (7.1) is a natural generalization of the one-step boosting described in Section
5.6, see formula (5.22). The only difference is that in this former chapter we use very
specific base learners. Moreover, (7.2) stresses the close connection to additive modeling,
where the update in iteration j is based on trying to capture the remaining signal after
having adjusted the intercept according to
  j−1
X
(j−1)
g µ
b (X) = b(X; ϑb(s) ).
s=0

This can be reinterpreted by noting that the base learner b(X; ϑb(j) ) tries to find the
b(j−1) (X). This is rather different from
weaknesses of the previous regression model µ
ensembling as described in Section 5.4.
The pseudo algorithm for (generalized) additive boosting is given in Algorithm 1, and
is sometimes also referred to as stagewise additive boosting, see, e.g., Hastie et al. [93,
Algorithm 10.2].

Algorithm 1 Generalized additive boosting

Initialize.

b(0) (X) = µ
– Set the initial mean estimate to the global empirical mean µ b0 .
– Select the maximum number of boosting iterations jmax ≥ 1.

Iterate.
while 1 ≤ j ≤ jmax do

Update    
b(j) (X) = g −1 g µ
µ b(j−1) (X) + b(X; ϑ
b(j) ) ,

where ϑb(j) is given by (7.1) and increase j.

end.

Return.
bboost (X) := µ
µ b(jmax ) (X).

Remarks 7.1. • Note that for each boosting step j


n n
X vi 
(j)
 X vi  
L Yi , µ
b (X i ) ≤ b(j−1) (X i ) .
L Yi , µ
i=1
φ i=1
φ

This is an in-sample loss improvement (on the learning sample L).

• Apart from estimating the base learners’ parameters ϑb(j) , one needs to decide on the
maximum number of boosting iterations jmax . In practice, one picks a large jmax

Version March 3, 2025, @AI Tools for Actuaries


Chapter 7. Gradient boosting machines 113

and uses early stopping based on cross-validation to decide on an effective number


of boosting steps to receive a suitable out-of-sample performance, see Section 5.3.5.

• In boosting it is common to include a learning rate or shrinkage factor η ∈ (0, 1],


or select a decreasing sequence of positive factors η (j) > 0. One then replaces the
updating step in Algorithm 1 by
   
b(j) (X) = g −1 g µ
µ b(j−1) (X) + η (j) b(X; ϑ
b(j) ) .

The intuition behind this step is to avoid taking too large steps in each iteration
of the algorithm.

Example 7.2. Assume that the true data generating mechanism is given by

Y = µ∗ (X) + ε, with ε ∼ N (0, σ 2 ), (7.3)

where µ∗ (X) ∈ R is an unknown mean function, and σ 2 > 0 is an unknown variance pa-
rameter. For simplicity, we set vi ≡ 1. Assume furthermore that we have an i.i.d. learning
sample L = (Yi , X i )ni=1 generated from (7.3).
Our goal is to approximate µ∗ (X) with base learners b(X; ϑ) using Algorithm 1 under
the square loss and with the identity link choice for g. In view of (7.1) this gives us for
the loss in iteration j, we drop φ = σ 2 in the following identities,
n      n   2
L Yi , g −1 g µ
X X
b(j−1) (X i ) + b(X i ; ϑ) = b(j−1) (X i ) + b(X i ; ϑ)
Yi − µ
i=1 i=1
n  2
X (j)
= εbi − b(X i ; ϑ) ,
i=1

where we set
(j)
εbi b(j−1) (X i ).
= Yi − µ
(j)
Thus, the base learners b(X i ; ϑ) try to exploit the present residuals εbi to find the next
optimal parameter ϑb(j) . This is analogous to (5.19). ■

The boosting procedure described in Example 7.2 is also known as matching pursuit;
see, e.g., Mallat–Zhang [148]. Matching pursuit was introduced in the signal processing
literature. Typically, it focuses on the square loss which is most suitable for Gaussian
responses, see Table 2.2. The situation in Example 7.2 is discussed in detail in Bühlmann
[37], when using low depth and low interaction regression trees based on the square loss in
(6.3), and this reference also includes results on convergence. The Gaussian assumption
used in Example 7.2 is a special case of Tweedie’s family, see Table 2.3, and it is possible
to obtain similar iterative boosting schemes for Tweedie’s family under the log-link choice.
This is known as response boosting; see Hainaut et al. [90]. It is, however, important to
note that when using a general Tweedie’s deviance loss with trees as base learners, one
should, of course, also optimize the regression trees in (6.3) w.r.t. that Tweedie’s deviance
loss.

Version March 3, 2025, @AI Tools for Actuaries


114 Chapter 7. Gradient boosting machines

7.2 Gradient boosting machines


In Section 7.1, boosting was introduced as an iterative updating procedure, successively
improving the fit of (and the complexity in) the approximation of µ(X) using base
learners {b(X; ϑ)}ϑ . This was done by using successive updates in the mean model
under a link transformation, see (7.2). A different route forward is to consider learning
an unknown function using functional gradient descent. We discuss this in the present
section.

7.2.1 Functional gradient boosting


We take this introduction to functional gradient descent in different steps. We start with
a single instance i, and we fit a one-dimensional real-valued parameter ϑ ∈ R to the
observation Yi , providing a mean estimate µ = g −1 (ϑ) for Yi . That is, for the moment,
we let the (generic) parameter ϑ play the role of the link transformed mean g(µ) of Yi .
In order to ease the notation of this exposition, we define
vi
vi Lg (Yi , ϑ) = L(Yi , g −1 (ϑ)), (7.4)
φ
for the selected link function g and a strictly consistent loss function L for mean estima-
tion. Moreover, assume that the derivative of vi Lg (Yi , ϑ) w.r.t. ϑ exists, and we denote
it by vi ∇ϑ Lg (Yi , ϑ).
Using the standard gradient descent method, see (5.12), we aim at finding an optimal
parameter ϑ.
b The standard gradient descent step at algorithmic time j reads as
 
ϑb(j) = ϑb(j−1) − η (j) vi ∇ϑ Lg Yi , ϑb(j−1) , (7.5)

for learning rate η (j) > 0. By choosing a sufficiently small learning rate η (j) > 0, it is
possible to ascertain a loss improvement (unless the gradient in (7.5) is zero)
   
Lg Yi , ϑb(j) ≤ Lg Yi , ϑb(j−1) .

One option to select the learning rate η (j) > 0 is to use so-called full relaxation, which
corresponds to doing a line search according to
  
ηb(j) = arg min Lg Yi , ϑb(j−1) − η vi ∇ϑ Lg Yi , ϑb(j−1) , (7.6)
η>0

compare with (7.1). For more on this, including conditions for convergence and conver-
gence rates; see Nesterov [166].
The procedure for learning an unknown parameter compared to learning an unknown
function can be approached similarly, which is the intuition behind GBMs. Again consider
a single instance i with observation (Yi , X i , vi ). By using abbreviation (7.4), we obtain
vi
L(Yi , µ(X i )) = vi Lg (Yi , g(µ(X))) .
φ
By differentiating the right-hand side w.r.t. ϑ = g(µ(X i )) allows us to rewrite the stan-
dard gradient descent step (7.5) according to
      
g µ b(j−1) (X i ) − η (j) vi ∇ϑ Lg Yi , g µ
b(j) (X i ) = g µ b(j−1) (X i ) ,

Version March 3, 2025, @AI Tools for Actuaries


Chapter 7. Gradient boosting machines 115

or, equivalently, by bringing the link g to the other side


     
b(j) (X i ) = g −1 g µ
µ b(j−1) (X i ) − η (j) vi ∇ϑ Lg Yi , g µ
b(j−1) (X i ) . (7.7)

If we apply this iteration (7.7) for each instance 1 ≤ i ≤ n, it will converge (under suitable
learning rates) to a saturated model. Naturally, this provides an in-sample over-fitted
model as each instance i receives its individual mean parameter estimate.
Instead, the crucial step is to approximate the gradients vi ∇ϑ Lg (Yi , g(µ b(j−1) (X i ))) by
base learners that are simple functions (low-dimensional objects), such as trees. This
implies that the problem becomes regularized, and with a reduced the dimension of the
problem.
Define, in iteration j ≥ 1, the working responses for 1 ≤ i ≤ n by
  
(j)
ri = −vi ∇ϑ Lg Yi , g µ
b(j−1) (X i ) . (7.8)

For GBMs, one fits a base learner from the parametrized class B = {X 7→ b(X; ϑ)}ϑ
(j)
to the new learning sample L(j) = (ri , X i )ni=1 . Since this is a regularization step, and
since the idea is that the base learners should iteratively learn a gradient approximation,
this suggests to use the square loss for this approximation step
n  2
X (j)
ϑb(j) = arg min ri − b(X i ; ϑ) . (7.9)
ϑ i=1

This implies that (the saturated) standard gradient descent iteration (7.7) is replaced by
its regularized gradient approximated version, called GBM step,
   
b(j) (X i ) = g −1 g µ
µ b(j−1) (X i ) + η (j) b(X i ; ϑ
b(j) ) . (7.10)

Here, one can note the close resemblance between (7.10) and the updating step in the
generalized additive boosting of Algorithm 1. In order to make the connection to gener-
alized additive boosting even stronger, note that the full relaxation corresponds to doing
a full line search w.r.t. η (j) in analogy to (7.6). That is, solve
n     
vi L Yi , g −1 g µ
X
ηb(j) = arg min b(j−1) (X i ) + η b(X i ; ϑ
b(j) ) . (7.11)
η>0 i=1

By iterating over (7.8)-(7.11), in analogy with the generalized additive boosting algo-
rithm, see Algorithm 1, we obtain the general GBM procedure from Friedman [70]. This
algorithm is summarized in Algorithm 2.

Remarks 7.3. • It is common to add yet another learning rate to Algorithm 2, the
intuition again being that small steps are less harmful than taking too large ones;
by taking too small steps one can always “catch up” in the coming iterations by
taking a number of smaller similar steps.

• Note that depending on which software implementation that one is using, the loss
function may or may not include a pre-defined link function.

Version March 3, 2025, @AI Tools for Actuaries


116 Chapter 7. Gradient boosting machines

Algorithm 2 General gradient boosting machine

Initialize.

b(0) (X) = µ
– Set the initial mean estimate to the global empirical mean µ b0 .
– Select the maximum number of boosting iterations jmax ≥ 1.

Iterate.
while 1 ≤ j ≤ jmax do
(j)
1. Calculate the working responses ri from (7.8).
2. Fit a base learner from {b(X; ϑ)}ϑ to the working responses according to (7.9).
3. Calculate the optimal step length ηb(j) according to (7.11).
4. Update    
b(j) (X) = g −1 g µ
µ b(j−1) (X) + ηb(j) b(X; ϑ
b(j) ) ,

and increase j.

end.

Return.
bGBM (X) := µ
µ b(jmax ) (X).

• In Algorithm 2, the weights vi > 0 are included, and depending on the software
implementation, one needs to ensure to make proper use of the weights, intercepts
and offsets, respectively.

• The general GBM described in Algorithm 2 shares the same problems as general-
ized additive boosting w.r.t. potential over-fitting, and one should again use early
stopping based on cross-validation, or a similar technique, to prevent over-fitting.

• The general GBM of Algorithm 2 is closely connected to additive gradient based


boosting, such as GAM-boost, being based on generalized additive models (GAMs),
see, e.g., the gamboost class in the R package mboost [101].

• Consistency and convergence properties for GBMs tend to become technical. Such
type of results can be found in Zhang–Yu [248], where the GBM from Algorithm 2
appears as a special case of a more general greedy boosting algorithm.

The close connections between additive boosting and general GBMs have already been
established above, and to stress this further we continue Example 7.2:

Example 7.4. We revisit the set-up of Example 7.2. Calculating the working responses
according to (7.8) for the square loss we get
(j)
ri b(j−1) (X i )) = 2(Yi − µ
= − ∇ϑ L(Yi , µ b(j−1) (X i )) ∝ Yi − µ
b(j−1) (X i ),

Version March 3, 2025, @AI Tools for Actuaries


Chapter 7. Gradient boosting machines 117

where constants not depending on i have been dropped; in this example we assumed
(j)
vi ≡ 1. Fitting a base learner b(X; ϑ) to the new learning sample L(j) = (ri , X i )ni=1
using the square loss is equivalent to minimize the following expression in ϑ, see (7.9),
n  2 n  2
X (j) X
ri − b(X i ; ϑ) = b(j−1) (X i ) − b(X i ; ϑ)
Yi − µ
i=1 i=1
n 
X  2
= b(j−1) (X i ) + b(X i ; ϑ)
Yi − µ .
i=1

The latter expression is equivalent to additive boosting (7.1) under the square loss func-
tion choice and for the identity link for g. ■

7.2.2 Tree-based gradient boosting machines


Up until now, we have not been discussing any particular choices of the base learners
B = {X 7→ b(X; ϑ)}ϑ . The most popular class B of base learners are low depth and low
interaction regression trees of a fixed cardinality. They are simple to implement, easy
and fast to compute and have an excellent predictive performance in GBMs.
Recall the notation from Chapter 6. This chapter introduced a finite partition of the
covariate space X = t∈T Xt , with a finite index set T of cardinality |T|. Low depth and
P

low interaction regression trees have a small cardinality |T|, i.e., they only consider a few
binary splits. Using a fixed small cardinality |T|, we select the class B of regression tree
base learners by, see (6.2),
X
b(X; ϑ) = mt 1{X∈Xt } , (7.12)
t∈T

where parameter ϑ collects the full description of the characterization of the binary
regression tree (7.12). We insert this class B of fixed small cardinality |T| regression trees
as base learners in Algorithm 2.
Reconsidering the line search (7.11) for the optimal selection of the learning rate η (j) > 0
in the j-th iteration of Algorithm 2. This line search (7.11) can be replaced by directly
updating the piecewise constant estimates (mt )t∈T on every leaf. That is, for all t ∈ T
   
(j) X
m
bt = arg min vi Lg Yi , g µ
b(j−1) (X i ) + m . (7.13)
m∈R (j)
i: X i ∈Xt
b

This minimization relies on first having fitted a regression tree to the working responses
(j)
to obtain the covariate space partition (Xbt )t∈T of the covariate space X , see (7.9).
This results in the tree-based gradient boosting predictor in the j-th iteration
 
 
(j)
b(j) (X) = g −1 g µ
X
µ b(j−1) (X) + m
bt 1 b(j) } .
 (7.14)
{X∈X t
t∈T

As mentioned above, tree-based GBMs have become very popular and can be found in
off-the-shelf software, such as the gbm package in R. When using one of these packages,
one always needs to carefully check what link functions g are implemented and how

Version March 3, 2025, @AI Tools for Actuaries


118 Chapter 7. Gradient boosting machines

the algorithm deals with the weights vi ; we also refer to Remark 2.3. The tree-based
GBM procedure is summarized in Algorithm 3, which corresponds to the regression tree
boosting described in Friedman [70].

Algorithm 3 Tree-based gradient boosting machine

Initialize.

b(0) (X) = µ
– Set the initial mean estimate to the global empirical mean µ b0 .
– Select the maximum number of boosting iterations jmax ≥ 1.
– Fix the cardinality |T| of the regression tree base learners.

Iterate.
while 1 ≤ j ≤ jmax do
(j)
1. Calculate the working responses ri from (7.8).
2. Fit a regression tree (7.12) of cardinality |T| to the working responses using
the greedy minimization of (7.9).
3. Perform optimal leaf adjustments according to (7.13).
4. Update
 
 
(j)
b(j) (X) = g −1 g µ
X
µ b(j−1) (X) + m
bt 1 b(j) } ,

{X∈X t
t∈T

and increase j.

end.

Return.
btree-GBM (X) := µ
µ b(jmax ) (X).

Remarks 7.5. • In (7.14) and in Algorithm 3 it is common to add yet another


learning rate (shrinkage factor) in the update step.

• Note that compared with tree-based additive boosting, a tree-based GBM only
makes use of square loss fitted trees, irrespective of the underlying data generating
process motivating the use of the loss L(Y, µ). This makes the tree-based GBMs
easy to apply to custom distributions. Furthermore, it is neither a problem to use
count data responses or binary responses (classification). For general binary tree-
fitting, see, e.g., the discussion concerning (6.3) above, and Hainaut et al. [90] on
response boosting.

• In many applications, so-called tree stumps are used as base learners; tree stumps
consider one single split and have |T| = 2 leaves. We suggest to consider bigger
trees, because bigger trees promote interaction modeling. With tree stumps and

Version March 3, 2025, @AI Tools for Actuaries


Chapter 7. Gradient boosting machines 119

the log-link choice for g, the GBM results in a multiplicative model; see Wüthrich–
Buser [241, Example 7.5].

7.3 Interpretability measures and variable importance


The question of interpreting model output is an important and active field of research
to which Chapter [insert] is devoted. That chapter presents model-agnostic tools that do
not depend on a certain type of regression model, in this section we present a tree-based
interpretability tool. As we have seen in Chapter 6, a single regression (or classification)
tree is in itself explainable and interpretable. As we know, however, large trees tend
to become unstable, whereas we have argued that (generalized additive) boosted tree
models, making use of low-cardinality trees as base learners, tend to have an excellent
predictive performance. But, when constructing a predictor by adding many small trees,
the interpretability of the resulting predictor gets lost.
A simple and very popular interpretability measure based on trees was introduced in
Breiman et al. [31]. It also generalizes to (generalized) additive tree models. It is called
variable importance score (VI-score). The idea is simple and can be formalized as follows.
Let S(k; T) denote which of the q covariate components (Xl )ql=1 is used in split k in the
tree T, thus, S(k; T) ∈ {1, . . . , q}. That is, if we have a tree of cardinality |T|, this means
that there are in total |T| − 1 splits in the tree T. The VI-score for covariate component
1 ≤ l ≤ q w.r.t. the tree T is given by
|T|−1
X
bik (T)2 1
VIl (T) = {S(k;T)=l} , (7.15)
k=1

where bik (T)2 corresponds to the loss improvement in the k-th split in T, i.e., this is
obtained by evaluating the objective function of (6.3) in the selected split. Consequently,
if we are using a method which is based on jmax additively boosted trees, Tj , 1 ≤ j ≤ jmax ,
this suggests to consider the average VI-score for covariate component 1 ≤ l ≤ q given
by
jX
max
1
VIl = VIl (Tj ). (7.16)
jmax j=1

For more on VI-scores and other tree-based methods, see, e.g., Hastie et al. [93, Chapter
10.13], and the references therein.

Remarks 7.6. • Note that the VI-scores from (7.15) and (7.16) will naturally weigh
the importance of covariate components that occur rarely in splits but have large
loss reductions and covariate components that often occur in splits, but with low
loss reductions.

• In practice these VI-scores are often normalized to sum to one.

• When using these VI-scores in GBMs it is typically the case that the importance
is measured on the tree scale, hence, focusing on the loss improvement w.r.t. the

Version March 3, 2025, @AI Tools for Actuaries


120 Chapter 7. Gradient boosting machines

gradient approximations, not w.r.t. the (empirical) loss given by a deviance loss
function.

7.4 State-of-the-art gradient boosting machines

In the previous sections the basic ideas underpinning (generalized additive) boosting and
gradient boosting, with an extra focus on the situation when using low-cardinality trees
as base learners, was presented. These base learners are the most commonly used ones
in practice.

When it comes to practical considerations, we have already discussed adding additional


shrinkage factors, in order to avoid taking too large boosting steps, see, e.g., Remark 7.5.
Another consideration, both improving speed and robustness, is to limit the number of
data points used in each iteration by using sub-sampling techniques, i.e., bagging.

Another consideration when it comes to speed, as already commented on in Chapter


6.1, trees are theoretically able to handle high-cardinality nominal categorical covariates,
but in practice, when greedily searching for optimal binary splits, this may become
problematic w.r.t. computational time. In the standard tree-based GBM implemented in
the R package gbm [194], this is partly alleviated by pre-sorting the covariate components
when searching for (greedy) optimal splits.

LightGBM. A popular high-speed version of GBMs targeting the issue of the high-
cardinality nominal categorical covariates is LightGBM by Ke et al. [116]. The idea
behind LightGBM is to (i) focus on data instances with large gradients, and (ii) to ex-
ploit sparsity in the covariate space. Step (i) is what is referred to as gradient-based one
sided sampling (GOSS) and is an alternative to uniform sampling in bagging, and step
(ii) is what is called exclusive feature bundling (EFB), which is a type of histogram pro-
cedure combining both covariate (feature) selection and merging of covariates (features).
The intuition behind the merging of covariates is that covariates whose one-hot encoded
categories are not jointly active can be merged, which is likely to happen if the covariate
space is large. For more details, see Ke et al. [116]. The LightGBM method has been
proved to be both fast and accurate.

XGBoost. Another popular fast gradient based boosting procedure is XGBoost, see
Chen–Guestrin [41]. It is based on a functional second order approximation of the loss
function. In terms of the previously used notation for a single instance i, this can be

Version March 3, 2025, @AI Tools for Actuaries


Chapter 7. Gradient boosting machines 121

expressed as the following second order Taylor expansion


   
Lg Yi , g µ
b(j−1) (X i ) + b(X i ; ϑ)
     
≈ Lg Yi , g µ
b(j−1) (X i ) + ∇ϑ Lg Yi , g µ
b(j−1) (X i ) b(X i ; ϑ)
1 2 g  
+ ∇ϑ L Yi , g µ b(j−1) (X i ) b(X i ; ϑ)2
2
   1   
∝ ∇ϑ Lg Yi , g µb(j−1) (X i ) b(X i ; ϑ) + ∇2ϑ Lg Yi , g µ b(j−1) (X i ) b(X i ; ϑ)2
   2
g (j−1)
=: LXGB Yi , g µb (X i ); b(X i ; ϑ) , (7.17)

where ∇2ϑ denotes the second derivative (Hessian) w.r.t. ϑ. Consequently, even though
derivatives appear in LgXGB (·) from (7.17), they do not appear in the same way as in the
GBMs described by Algorithms 2 and 3. In fact, this second order Taylor expansion (7.17)
is related to a Newton step and the working residuals are suitably scaled by their Hessians
before being approximated by the base learners. Thus, effectively, (7.9) is replaced by
a Hessian scaled and weighted version. Furthermore, by using low-cardinality trees as
base learners applied to LgXGB (·) from (7.17), the leaf values for a given part in the
partition is given explicitly, and the criterion for finding the greedy optimal split point
is given explicitly. Thus, by using the approximate loss LgXGB (·) from (7.17) combined
with trees makes it possible to skip (costly) line searches. Note that this also will result
in a different type of trees than standard trees that are grown recursively leaf-wise, for
more details; see Chen-Guestrin [41]. XGBoost also allows for regularization by using
a penalized deviance loss (log-likelihood), and it can be equipped with histogram-based
techniques handling high-cardinality nominal categorical covariates. For more details,
see Chen-Guestrin [41].

Multi-parametric losses and further extensions. Above the loss functions con-
sidered have all been effectively one-dimensional, trying to learn an unknown regresssion
function X 7→ µ(X) ∈ R; in the presence of a nuisance parameter φ ∈ R+ referred to
as a dispersion parameter. The above discussed boosting techniques can be naturally
extended to the situation with a functional dispersion φ(X), or more generally, when
the loss function is expressed in terms of a p-dimensional real-valued argument ϑ ∈ Rp ,
and we want to learn an unknown p-dimensional function X 7→ ϑ(X); we use boldface
notation in ϑ to emphasize that this is a multi-dimensional object. This situation is
similar to considering a multi-task (and multi-output) FNN, see Remarks 5.2. Exam-
ples of methods addressing this situation are, e.g., gamboostLSS, see Mayr et al. [149],
NGBoost, see Duan et al. [59], Cyclic GBMs (CGBMs), see Delong et al. [50]. Both
gamboostLSS and CGBMs use cyclic updating over the p parameter dimensions of ϑ. In
addition gamboostLSS allows for component-wise optimization, which means that one
may specify covariate specific base learners. NGBoost uses so-called natural gradients,
which aims for improving speed and stability.

Version March 3, 2025, @AI Tools for Actuaries


122 Chapter 7. Gradient boosting machines

Version March 3, 2025, @AI Tools for Actuaries


Chapter 8

Deep learning for tensor and


unstructured data

8.1 Introduction
This chapter builds on the feed-forward neural network (FNN) architecture that was
introduced in Chapter 5. The FNN architecture can be seen as a prototype of more
sophisticated deep learning architectures, with the recurrent goal of an optimal feature
extraction for predictive modeling. The FNNs discussed in Chapter 5 act on so-called
tabular input data, which means that one has a q-dimensional cross-section X i,t ∈ Rq of
(structured) real-valued input data over all instances 1 ≤ i ≤ n at a given time point t.
It is useful to interpret t as discrete time. More generally, it is just a positional index.
This structured data has the format of q-dimensional real-valued vectors, and it is called
tabular because we can collect the cross-sectional input data (X i,t )ni=1 at a given time
point t in a table Xt , resulting in the design matrix at time t
 
X1,t,1 · · · X1,t,q
⊤  . .. ..  n×q
Xt = [X 1,t , . . . , X n,t ] =  ..
 . .  ∈R .
Xn,t,1 · · · Xn,t,q

Compared to (2.10), we drop the intercept (bias) component. This describes the covariate
information at time t for predicting the responses (Yi,t )ni=1 .

Naturally, this allows for a time-series extension, called panel data or longitudinal data.
If one has only one instance, one typically drops the instance index i, and in that case
one speaks about time-series data. The time-series data of a given instance comprises
responses, covariates and volumes, respectively, given by

Y1:t = (Y1 , . . . , Yt )⊤ ∈ Rt ,
X 1:t = (X 1 , . . . , X t )⊤ ∈ Rt×q ,
v1:t = (v1 , . . . , vt )⊤ ∈ Rt .

We do not use boldface notation in Y1:t and v1:t to highlight that these are time-series of
one-dimensional variables.

123
124 Chapter 8. Deep learning for tensor and unstructured data

We can illustrate this data by mapping the time to the vertical axis
     
Y1 X1,1 · · · X1,q v1
 .   . .. ..  .
Y1:t =  .. 

, X 1:t =  ..
 . .  and v1:t =  .. 

.
Yt Xt,1 · · · Xt,q vt

A major change is that the input vector of a given instance at time t, X t ∈ Rq , is


replaced by an input tensor of order 2 (called 2D tensor), given by X 1:t ∈ Rt×q and
describing the entire historical input data. 2D tensors are matrices, but, more generally,
tensors can have any integer order. The 2D tensor X 1:t has a time-causal structure,
i.e., the adjacency in time index t has a specific meaning, and the main question is how
to design network architectures that respect (benefit from) this adjacency. Naturally,
classical FNN architectures that are designed for input vectors, and not time-series data,
do not respect time-causality.
This tensor approach can be made fully general, i.e., it can be extended from 2D tensors
for time-series data to tensors of any order. We give an example of a 3D tensor having
a spatial structure.
Example 8.1 (RGB color image). An example for a 3D tensor is a color image

X 1:t,1:s = (Xu,v,j )1≤u≤t, 1≤v≤s, 1≤j≤3 ∈ Rt×s×3 .

This color image has a spatial structure described by the first two indices (u, v), in fact,
in this example we have a rectangle with t pixels on the x-axis and s pixels on the y-axis.
The last index j then labels the three color channels red-green-blue (RGB). This is the
typical way of encoding color pictures into a 3D tensor. An example is given in Figure
8.1 for a 30 × 30 color picture. ■

Figure 8.1: RGB channels for a 30 × 30 color picture: R channel, G channel, B channel,
and overlap of the three channels.

The common approach of using unstructured data such as texts, speech and images in
predictive models, is to map such unstructured data to tensors. For images, this is solved
as in Example 8.1. For texts and speech, this is achieved by an entity embedding which
uses tokenization. That is, we tokenize speech by assigning integers to all words, and
then we apply an entity embedding (2.15) to these integers and words, respectively. This
turns sentences (and speech) into 2D tensors of shape Rt×b , with t being the length of
the sentence and b the dimension of the entity embedding. We discuss this in careful
detail in Section 8.2, below.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 8. Deep learning for tensor and unstructured data 125

Typical image recoginition problems

Image recognition problems are very different from actuarial problems. A classical
image recognition problem is to determine whether there is a cat, Y = 0, or a
dog, Y = 1, on an image, thus, one tries to determine Y from an RGB image
X 1:t,1:s ∈ Rt×s×3 . Purposefully we wrote ‘determine’ to this image recognition
task, because this is not a forecast problem. There is no randomness in terms of
irreducible risk involved in this cat vs. dog image recognition problem. Therefore,
we will not elaborate any further on such image recognition problems, but we
will focus on actuarial forecasting problems, below, where the irreducible risk is a
significant factor in the prediction task. The techniques though are the same, but
we put different emphasis on the different features.

8.2 Tokenization and the art of entity embedding


Any (input) data for predictive modeling needs to be in tensor form. This section dis-
cusses tokenization of unstructured (text) data that builds the process of mapping un-
structured text data to tensors. Remark that we focus on texts and speech, because
images can be treated as in Example 8.1.

8.2.1 Tensors
The input data to networks (and regression functions) is usually in tensor form. For
single instances, the input information is either a single vector X ∈ Rq (1D tensor), a
time-series X 1:t ∈ Rt×q (2D tensor) or a spatial image X 1:t,1:s ∈ Rt×s×q (3D tensor),
where we generally assume that a color image has three color channels expressed by
setting q = 3, see Example 8.1 and Figure 8.1. If we have a black-and-white picture, we
typically want to preserve this spatial structure and, therefore, use a 3D tensor with a
single gray color channel X 1:t,1:s ∈ Rt×s×1 , i.e., we set q = 1. In this notation, typically,
the first indices of the tensor describe a time-series or a spatial structure, and the last
index (referring to q) are the channels.
Having multiple instances 1 ≤ i ≤ n increases the tensors by one order, e.g., for images
as covariates, we have an input, 4D design tensor, over all instances

X = [X 1,1:t,1:s , . . . , X n,1:t,1:s ] ∈ Rn×t×s×3 .

Assuming independence between the instances then applies to the first index 1 ≤ i ≤ n
of this 4D design tensor X.

8.2.2 (Supervised) entity embedding


Before turning our attention to unsupervised word embedding, we revisit supervised entity
embedding which links our discussion to Section 2.3.2. We discussed that it is very
common to actuarial problems to have many categorical covariates, and quite some of
those may have many levels, i.e., may be high-cardinality categorical; examples are car
brands, vehicle models, provinces, job profiles, etc. Dummy or one-hot encoding then

Version March 3, 2025, @AI Tools for Actuaries


126 Chapter 8. Deep learning for tensor and unstructured data

results in high-dimensional input tensors. A similar situation occurs for unstructured text
data. The Oxford English Dictionary estimates that there are roughly 170,000 English
words in regular use, which results in an input dimension of 170,000 if one uses one-hot
encoding for these words. Aiming at making the input data smaller, we revisit entity
embedding discussed in (2.15). Select a (small) embedding dimension b ∈ N, and consider
the entity embedding
eEE : A → Rb , X1 7→ eEE (X1 ), (8.1)
where A = {a1 , . . . , aK } is the set of all levels of a categorical covariate component
X1 contained in X (for the moment we do not assume that all components of X are
real-valued).
This results in K embedding weights eEE EE (a ) ∈ Rb , 1 ≤ k ≤ K. In FNN fitting,
k := e k
these embedding weights are part of the network parameter ϑ, and they are learned with
SGD that aims at making a strictly consistent loss function (generally) small
n
X vi !
L(ϑ; L) = L(Yi , µϑ (X i )) = min,
i=1
φ

see Section 5.3. Hence, the embedding weights involved in (8.1) are learned in a su-
pervised learning manner using the targets (responses) (Yi )ni=1 . Changing the targets
will lead to different embeddings. E.g., job profiles impact accident or liability insur-
ance claims differently, and using these two different responses will result in different
embeddings of the job profiles. In contrast, we could also try to use an unsupervised
learning embedding, which requires that we can put the categorical covariates into some
context. This unsupervised learning embedding will be discussed in Section 8.2.3, and it
also relates to the clustering methods studied in Section 9, below.
Often, gradient descent fitting does not work well if one has many high-cardinality cat-
egorical covariates. High-cardinality categorical covariates give a significant potential
for over-fitting, and, as a result, usually gradient descent methods exercise a very early
stopping time. In such cases, it is beneficial to regularize the embedding, similarly to
Section 2.4. Assume that we have one categorical covariate Xi,1 ∈ A in X i with K levels.
This gives the embedding weights (eEE K b
k )k=1 ⊂ R ; these embedding weights are part of
the network weights ϑ. Using ridge regularization (2.22) on these embedding weights,
motivates to consider the regularized loss
n K
X vi η X 2
L(Yi , µϑ (X i )) + eEE , (8.2)
i=1
φ φ k=1 k 2

for regularization parameter η > 0. There is one point that we want to highlight; this has
been discussed in Richman–Wüthrich [192]. The regularized loss (8.2) balances between
the sample size n and the number of occurrences of a given categorical level ak ∈ A.
The issue in SGD training is that one does not consider the loss simultaneously over the
entire learning sample L, but only over the random (mini-)batch used for the next SGD
step, see (5.14). As a result, (8.2) cannot be evaluated in SGD, because we only see one
mini-batch at a time. To account for this issue, we change the regularized loss (8.2) into
a different form. Taking the notation from Avanzi et al. [7], we define

k[i] = {k ∈ {1, . . . , K}; Xi,1 = ak } ,

Version March 3, 2025, @AI Tools for Actuaries


Chapter 8. Deep learning for tensor and unstructured data 127

that is, k[i] indicates to which level ak[i] the categorical variable Xi,1 of instance 1 ≤ i ≤ n
belongs to. We define the totally observed exposure on each level 1 ≤ k ′ ≤ K by
n
X n
X
vk+′ = vi 1{k[i]=k′ } = vi 1{Xi,1 =ak′ } .
i=1 i=1

This allows us to rewrite the regularized loss (8.2) as follows


 
n
1X η 2
vi L(Yi , µϑ (X i )) + + eEE
k[i] 2 .
 (8.3)
φ i=1
vk[i]

The crucial difference of this latter expression to (8.2) is that the regularization is inte-
grated into the round brackets under the i-summation. This allows one to apply SGD on
partitions (mini-batches) of {1, . . . , n}; this requires that one changes the loss function
correspondingly in SGD implementations, and that one equips all instances 1 ≤ i ≤ n
+
with the volume information vk[i] so that on every instance i (and batch) one can evaluate
both terms under the i-summation in (8.3)
We observe from (8.3) that the regularization on individual instances i is inversely propor-
+
tional to the volumes vk[i] , and more frequent levels ak receive a less strong regularization
towards zero compared to scarce ones. This scaling is quite common, and it is a natural
consequence of a Bayesian interpretation of this regularization approach; see Richman–
Wüthrich [192]. This reference also discusses regularized entity embedding in case of
hierarchical categorical covariates, e.g., vehicle brand - vehicle model - vehicle details
build a natural hierarchy, and a certain Toyota make cannot appear under a Volkswagen
brand. This may give further regularization restrictions, e.g., similar to fused regulariza-
tion, see (2.27), we can bring hierarchies into a context.

8.2.3 (Unsupervised) word embedding


Fitting a model (8.3) gives a supervised (and regularized) entity embedding. When it
comes to large unstructured inputs such as texts, this regularized approach may not work.
Therefore, in an initial step, one first tries to get an entity embedding in an unsupervised
learning manner, which can then be further processed. To arrive at an unsupervised
embedding, we need to be able to bring categorical covariates into a context. This is most
easily understood in text recognition problems where, e.g., certain verbs only appear in
the context of certain nouns and activities. We discuss this in this section and we use
texts (sentences) as our basic example.
Compared to the previous section, we change the notation from k ∈ {1, . . . , K} to w ∈
{1, . . . , W }. The index k has been labeling the levels ak ∈ A = {a1 , . . . , aK } of a
categorical covariate, in the present section it is more natural to associate w with words
in a text recognition context. Assume there are W ∈ N words (aw )W w=1 in the entire
considered vocabulary A. To make the notation simpler, we tokenize these words aw by
their integer indices 1 ≤ w ≤ W , that is, the token w corresponds to the word aw in
vocabulary A; the ordering is completely irrelevant for our purposes. Thus, the entire
vocabulary A is tokenized by the integers

W = {1, . . . , W } ⊂ N and W0 = W ∪ {0},

Version March 3, 2025, @AI Tools for Actuaries


128 Chapter 8. Deep learning for tensor and unstructured data

where we add a token zero for an empty word. This is going to be useful if we need
to bring different sentences to equal length. In machine learning jargon, this is called
padding shorter sentences with zeros to equal length.
A sentence consists of different words aw and their tokens w, respectively, and the order
of the words and tokens matters for the meaning of the sentence. Therefore, we use the
positional index t ∈ N to indicate the position of a word in a sentence.
A sentence text of length T is given by

text = (w1 , . . . , wT ) ∈ W0T .


Remark. We speak about words and sentences or texts. This determines the units of the
tokenization, i.e., turning words into tokens. However, there is no restriction in selecting
words as units. One can also tokenize parts of speech (bigger units) or syllable and
morphemes that make up words (smaller units). The methodology will be completely
the same. We present it on the level of words because this is the most intuitive and
natural unit to discuss the technology.

Bag-of-words

The method of bag-of-words is the most crude one to make a text = (w1 , . . . , wT ) numer-
ical for predictive modeling. It drops the positional index and it defines the bag-of-word
mapping
T
!
X
ψ : W0T → NW
0 , text 7→ ψ(text) = 1{wt =w} . (8.4)
t=1 w∈W
This is called bag-of-words because one places all words into the same bag. As a result,
one loses the order and the positional index, and one only counts how often a certain
word appears in text. This is very crude because, e.g., the following two sentences
provide the same bag-of-words: ‘the car is red’ and ‘is the car red’, but their meaning
is rather different. That is, the semantics of the sentence gets lost by the bag-of-words
embedding (by dropping the positional index). Moreover, the range NW 0 of ψ is very
high-dimensional, and ψ(text) is likely scarce if the text is small and the vocabulary
large. In this approach, one often removes so-called stop words such as ‘at’, ‘to’, ‘the’,
etc., to put more emphasis on the more important parts of the sentences and to reduce
the dimension of the vocabulary.

Word embeddings: unsupervised learning

Referring to Bengio et al. [20, 21, 22], we start with an entity embedding of the vocabulary
A and its tokenization W0 , respectively. Select an embedding dimension b ≪ W and
consider the word embedding (WE)

eWE : W0 → Rb , w 7→ eWE (w), (8.5)

that assigns to each token w an embedding vector eWE (w). In an unsupervised learning
manner, one tries to learn the embedding vectors from their contexts. E.g., ‘I’m driving
by car to the city’ and ‘I’m driving my vehicle to the town center’ uses similar words in
a similar context. Therefore, their embedding vectors should be close because they are

Version March 3, 2025, @AI Tools for Actuaries


Chapter 8. Deep learning for tensor and unstructured data 129

almost interchangeable. The goal is to learn such similarity in the meanings from the
context in which these words are used. For this, consider a sentence

text = (w1 , . . . , wt−1 , wt , wt+1 , . . . , wT ) ,

where the positional indices t ∈ N become important now.

Question: Can we predict the word wt by knowing that it is sandwiched between


wt−1 and wt+1 ?

Assume we have a collection of different sentences

C = {text = (w1 , . . . , wT )} ,

to which we can assign positive probabilities

p(text) = p(w1 , . . . , wT ) > 0. (8.6)

These probabilities should reflect the frequencies of the sentences text = (w1 , . . . , wT )
in speeches and texts (in the domain we are interested in).
Applying Bayes’ rule, we can determine how likely a certain word wt ∈ W0 occurs in a
given sentence text = (w1 , . . . , wT ) at position t

p(w1 , . . . , wT )
p ( wt | w1 , . . . , wt−1 , wt+1 , . . . , wT ) = . (8.7)
p(w1 , . . . , wt−1 , wt+1 , . . . , wT )

In general, these probabilities (8.6)-(8.7) are unknown, and they need to be estimated
(learned) from a learning sample L. Learning these probabilities will be based on em-
bedding the tokens into low dimensional spaces, and this is precisely the step where the
word embedding (8.5) is learned. There are two classic approaches for this: word-to-
vector (word2vec) by Mikolov et al. [156, 157] and global vectors (GloVe) by Pennington
et al. [175] and Chaubard et al. [40]. We describe these two methods next, this description
is taken from Wüthrich–Merz [243, Chapter 10].

There are two opposite ways of learning the (conditional) probabilities p:

(i) One can try to predict the center word wt from its context w1 , . . . , wt−1 , wt+1 , . . . , wT
as described in (8.7). In this approach, to reduce complexity, one often neglects the
positional indices of the context words, and one considers the bag-of-words {ws }s̸=t
instead. This method is called continuous bag-of-words (CBOW).

(ii) One can revert the problem and try to predict the context from the center word wt

p ( w1 , . . . , wt−1 , wt+1 , . . . , wT | wt ) . (8.8)

This is obtained by Bayes’s rule from (8.7). This gives the skip-gram approach.

Version March 3, 2025, @AI Tools for Actuaries


130 Chapter 8. Deep learning for tensor and unstructured data

We present word2vec of Mikolov et al. [156, 157] in detail. For this we discuss the
two different approaches:

(1) Skip-gram approach of predicting the context from the center word;

(2) CBOW approach of predicting the center word from the context.

Both of these two approaches are computationally intensive:

(3) Negative sampling is a method of coping with the computational complexity


in the skip-gram and the CBOW approach.

As a result of these approaches, we receive the word embeddings eWE (w), be-
cause they enter the probabilities (8.7) and (8.8) through a cosine similarity and
a softmax implementation.

Word2vec: skip-gram approach

For the skip-gram approach one tries to determine the probabilities (8.8) from a learning
sample L = (texti )ni=1 of different sentences. Since this problem is too complex in its
full generality, one solves a simpler problem.

(1) First, one restricts to a fixed small context (window) size c ∈ N, and one tries to
find the probabilities in this context window of wt , given by

p ( wt−c , . . . , wt−1 , wt+1 , . . . , wt+c | wt ) .

(2) Second, one assumes conditional independence of the context words, given the
center word wt .

Of course, the second assumption is generally not satisfied by real texts, but it signifi-
cantly simplifies the estimation problem. In particular, this crude (wrong) version is still
sufficient to receive a good word embedding (8.5), which is our main incentive to look
at this method. Under this conditional independence assumption, we have log-likelihood
for learning sample L and for given context size c ∈ N
n X
X X
ℓL = log p ( wi,t+j | wi,t ) . (8.9)
i=1 t −c≤j≤c,j̸=0

Our goal is to maximize this log-likelihood ℓL in the conditional probabilities p(·|·) to


learn the most common context words of a given center word wt . If we embed all tokens
w ∈ W0 , using a word embedding (8.5), we can learn the embeddings eWE (w) ∈ Rb
by letting them enter the conditional probabilities p(·|·). There is one point though
that needs to be considered. We need two different word embeddings e(1) (w) ∈ Rb and
e(2) (w) ∈ Rb for center and context words, respectively, as these two different usages play
different roles in the conditional probabilities in (8.9).

Version March 3, 2025, @AI Tools for Actuaries


Chapter 8. Deep learning for tensor and unstructured data 131

Assume that the conditional probabilities in (8.9) can be modeled by the softmax function
D E
exp e(1) (wt ), e(2) (ws )
p (ws | wt ) = PW ∈ (0, 1). (8.10)
w=1 exp e(1) (wt ), e(2) (w)

Thus, if the scalar (dot) product between e(1) (wt ) and e(2) (ws ) is large, we get a high
probability that ws is in the context of the center word wt .1
Inserting (8.10) into (8.9), we receive a log-likelihood function ℓL in the two word em-
bedding mappings

e(1) : W0 → Rb and e(2) : W0 → Rb . (8.11)

Maximizing this log-likelihood ℓL for the given learning sample L gives us the two (dif-
ferent) word embeddings. The optimization is done by variants of the SGD algorithm,
the only difficulty is that this high-dimensional problem can result in very expensive
computations, and negative sampling is a method that can circumvent this problem. We
discuss this in the next subsection.

2−dimensional embedding of center word

overhead
● signal
street
●●
run

plow

1.0

pole

traffic
knock


hit
● vehicle

garage
●light

hydrant

truck

salt

snow
● gate●
door
● car
●●
shed strike

hwy
● blow

● box

sign
● fence
0.5

shop

work

st

tree

back
dimension 2

dept
● plant

leopold
●crack

fire
●●
line
storage
● antenna

slide

panel
fall ●●
●build police
● control

llm
● wall

0.0

be
● wind
● ● electric
alarm ●
main
● camera
● airport
● city
● unit

graffito
● treatment
● ●
wastewater
ical
board
● ●
● generator
station● pump ●●good
kennedy
hoyt
●● ● roof site
lightning ● ●
● department
west
glass ● screen
● falk ● vandalize
● ●
miss
● service

hail
tower
● phone
siren

memorial
location
●● ● bldg
smoke
● ●
multiple safety
●●
breakage
● break ●system

lincoln

window
● park

golf radio
● cause●●
● wire
−0.5

vandalism
● shelter
● hall

lift

● mitchell
toki ● classroom
● steal
● village
●●
equipment sewer

lake

room
● storm

pavilion
● east

office
●● ● power
computer ●
high ●library
courthouse ● ●comm
ice ●
es
● ms

center
●tank
theft● ● leak
● surge

pool

gym
school
●●
● house
water ●

freeze
● outage

washington
−1.0

● laptop
●●
floor pipe

playground
●●
jail

−1.0 −0.5 0.0 0.5 1.0

dimension 1

Figure 8.2: Word2vec skip-gram embedding with embedding dimension b = 2, obtained


by negative sampling; this figure is taken from Wüthrich–Merz [243, Figure 10.2].

Figure 8.2 gives a low-dimensional example with embedding dimension b = 2. It shows


frequent words in claims texts, and the embeddings correspond to the center words
e(1) (w) obtained by a word2vec skip-gram approach. The red words show the insured
hazards, and we observe that the remaining words cluster around these insured hazards,
1
This scalar product is related to the cosine similarity between two vectors a, b ∈ Rb defined by
⟨a, b⟩/(∥a∥2 ∥b∥2 ). This is a popular similarity measure in machine learning; we also refer to Definition
9.2 for the definition of a dissimilarity function.

Version March 3, 2025, @AI Tools for Actuaries


132 Chapter 8. Deep learning for tensor and unstructured data

e.g., ‘house’ is next to ‘water’ or ‘pole’ is next to ‘vehicle’. We have chosen an embedding
dimension of b = 2 to nicely illustrate the results, typically, word embedding dimensions
range from 50 to 300.

Negative sampling

For computational reasons, it can be difficult to solve the word2vec skip-gram approach
(8.9)-(8.10), the categorical distribution in (8.10) has W different levels, and likewise
the input has this cardinality. Negative sampling turns this learning problem into a
supervised learning problem of a lower complexity; see Mikolov et al. [157].
e ∈ W × W of center words w and context words
For this, we consider the pairs (w, w)
e To each of these pairs we add a binary response variable Y ∈ {0, 1}, resulting in
w.
observation (Y, w, w).
e There will be two types of center-context pairs, real ones that are
obtained from the learning sample L and fake ones that are generated purely randomly.
We construct these two types of pairs as follows:

e from the learning sample L and we assign


(1) We extract all center-context pairs (w, w)
a response Y = 1 to these pairs, for indicating that these are true pairs. This gives
e i )n
us the first part of the learning data, denoted by L1 = (Yi = 1, wi , w i=1 .

e i )n
(2) We take all real pairs (wi , w i=1 , and we randomly permute the index of the context
word indicated by a permutation π. This gives us a second (fake) learning data set,
we shift the index by n for later purposes, L2 = (Yn+i = 0, wn+i , w en+π(i) )n
i=1 , with
Y = 0 as response.

Merging real and fake learning data gives us a learning sample L = L1 ∪ L2 of sample size
of 2n. This now allows us to turn the unsupervised learning problem into a supervised
logistic regression problem by studying the new log-likelihood

2n
X
ℓL = log P [ Y = Yi | wi , w
ei ]
i=1
n
1
X  
= log
i=1
1 + exp⟨−e (wi ), e(2) (w
(1) ei )⟩
2n
1
X  
+ log .
k=n+1
1 + exp⟨e (wk ), e(2) (w
(1) ek )⟩

The first n instances 1 ≤ i ≤ n come from the real data L1 , and the second n instances
n + 1 ≤ k ≤ 2n from the fake data L2 with the π-permuted context words. The two
parts of the log-likelihood then correspond to the logistic probabilities for the responses
Yi = 1 and Yk = 0 being real or fake, respectively. Maximizing this log-likelihood ℓL , we
can learn the two embeddings (8.11). The example in Figure 8.2 has been obtained in
this way.
For SGD training to work properly in this negative sampling learning, one should ran-
domly permute the instances in L = L1 ∪ L2 , to ensure that all (mini-)batches contain
instances of both types.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 8. Deep learning for tensor and unstructured data 133

Word2vec: continuous bag-of-words

For the CBOW method we aim at predicting the center word wt from its context, see
(8.7). As in the skip-gram approach, we select a fixed context (window) size c ∈ N. For
given learning sample L, this provides us with the log-likelihood
n X
X
log p ( wi,t | wi,t−c , . . . , wi,t−1 , wi,t+1 , . . . , wi,t+c ) .
i=1 t

To solve this problem, we need again to reduce the complexity. As in the bag-of-words
approach (8.4), we drop the positional index t. Moreover, for the (continuous) CBOW
version, we average over the bag-of-words to receive the average embedding of the context
words
(2) 1 X
ēi,t = e(2) (wi,t+j ),
2c −c≤j≤c,j̸=0

for the context word embedding e(2) , see (8.11). This averaging can be done because
the word embedding gives a numerical representation to the context words. The CBOW
approach then considers the following log-likelihood
n X  
X (2)
ℓL = log p wi,t ēi,t
i=1 t
D E
(2)
 
n X
X exp e(1) (wi,t ), ēi,t
= log  P D
(2)
E (8.12)
W
i=1 t w=1 exp e(1) (w), ēi,t
n XD W
!
E D E
X
(1) (2) X
(1) (2)
= e (wi,t ), ēi,t − log exp e (w), ēi,t .
i=1 t w=1

Thus, we measure the similarity between the center word embedding e(1) (w) and its
(2)
average context word embedding ēi,t . From this we can again learn the two embeddings
(8.11) using a version of the SGD algorithm to minimize the negative log-likelihood.
Compared to skip-gram, CBOW is usually faster in fitting, but skip-gram performs better
on less frequent words. Naturally, we can apply a negative sampling version to CBOW,
(2)
by randomly permuting the average context words ēi,t , and then designing a logistic
regression that tries to identify the true and the fake pairs.

Global vectors algorithm

Whereas word2vec is based on solid statistical methods, using well-defined and explain-
able log-likelihoods, GloVe is a word embedding approach that is more of an engineering
type. GloVe was developed by Pennington et al. [175] and Chaubard et al. [40].
GloVe is more in the sense of clustering; clustering is going to be presented in Section
9.3, below. Select a fixed context size c ∈ N and count the different context words w e
in the context window of the given center word w ∈ W. This defines the matrix of
co-occurrences
 
×W
C = C(w, w)
e ∈ NW
0 .
w,w∈W
e

Version March 3, 2025, @AI Tools for Actuaries


134 Chapter 8. Deep learning for tensor and unstructured data

Matrix C is a symmetric matrix, and typically it is sparse as many words do not appear
in the context of other words (on finitely many texts). Empirical analysis and intuitive
arguments lead to an approach of approximating this co-occurrence matrix by
D E
e ≈
log C(w, w) e(1) (w), e(2) (w)
e + αw + βw
e,

with intercepts αw , βwe ∈ R; see Pennington et al. [175]. To ensure that everything is
well-defined, Pennington et al. [175] come up with the following objective function to be
minimized
X  D E 2
χ(C(w, w))
e e − e(1) (w), e(2) (w)
log C(w, w) e − αw − βw
e ,
w,w∈W
e

with a weighting function that takes care of zero co-occurrences



x ∧ xmax

x ≥ 0 7→ χ(x) = ,
xmax

for hyper-parameters xmax > 0 and γ > 0. From this we can again learn the two
embeddings (8.11) using the available learning data L.
Clearly, GloVe is more difficult to implement and to fine-tune than the word2vec methods;
some small scale examples in an insurance context are given in Wüthrich–Merz [243,
Chapter 10]. This short introduction was not meant to explain GloVe to the level of
explicit implementation and reasoning why it is sensible, but for GloVe (as well as for
other methods) there is a large pre-trained version available that can be downloaded;2
other pre-trained open-source models that can be downloaded include, e.g., spaCy3 and
FastText.4
These pre-trained word embeddings are ready to use, and they can be downloaded in
different scales and embedding dimensions. A point that needs careful attention is that
these word embeddings have been trained on a large corpus of texts from internet. These
texts consider any sorts of topics. When it comes to a specific use of such pre-trained
libraries, this needs some care because certain words have different meanings in different
contexts. Wüthrich–Merz [243, Section 10] computed a non-life insurance example that
considered insurance coverage of public institutions. In this example, the word Lincoln
appears in several claims texts. Lincoln is a former US president, there are Lincoln
memorials, there are towns called Lincoln, there is a Lincoln car brand, there are restau-
rants named Lincoln, but in the claims texts Lincoln is the school insured. Therefore, a
pre-trained embedding may not be fully suitable for the purpose needed, because specific
insurance related terminology may not have been used while training the embedding.
This will require additional training of the pre-trained libraries to the specific purpose,
while fitting the entire predictive model. Nevertheless, having a pre-trained data basis is
often an excellent starting point for an actuarial application, and, in the sense of transfer
learning, a pre-trained library can be refined for the specific task to be solved.
2
https://nlp.stanford.edu/projects/glove/
3
https://spacy.io/models/en#en_core_web_md
4
https://fasttext.cc

Version March 3, 2025, @AI Tools for Actuaries


Chapter 8. Deep learning for tensor and unstructured data 135

8.2.4 Summary of Section 8.2


After this section, all input data has been tokenized and embedded, and we are equipped
with input tensors, that can be of different form. They can be of

• vector form (1D tensor) from tabular data;

• matrix form (2D tensor) from time-series and text data;

• tensor form of order 3 (3D tensor) from image data.

Naturally, we can simultaneously use all these different input formats by designing suit-
able network architectures. E.g., we can use a FNN on the tabular input data providing
us with a first intermediate output, we can use a recurrent neural network (RNN) on
the times-series and text data providing us with second intermediate output, and we can
use a convolutional neural network (CNN) on the image data providing us with third
intermediate output. These intermediate outputs are concatenated and then further pro-
cessed through a FNN providing a predictor variable. Such an example is illustrated in
Figure 8.3.

Figure 8.3: Network architecture to process different types of input data.

FNNs have already been introduced in Chapter 5, and the following sections are going
to be devoted to RNNs, CNNs as well as transformers, which is another popular way of
dealing either with time-series data or with tabular data.

8.3 Convolutional neural networks


When data is represented as 2D or 3D tensors, as discussed in the previous sections, and
characterized by large size, applying traditional FNN layers can be inappropriate. This
relates to three issues.

(1) FNNs ignore spatial and/or time-series structure in the data, treating input ele-
ments equally regardless of their relative proximity in the positional index.

(2) FNN demand a large number of parameters, leading to significant computational


costs.

(3) FNN cannot deal with time-series observations that are increasing over time.

Version March 3, 2025, @AI Tools for Actuaries


136 Chapter 8. Deep learning for tensor and unstructured data

To solve these issues, specialized network architectures like convolutional neural networks
(CNNs) and recurrent neural networks (RNNs) have been introduced. In the present
section, we study CNNs, introduced by LeCun–Bengio [130]. CNNs derive their name
from the use of the convolution operator to extract features from the input tensors.

CNNs differ from FNNs in two key aspects:

• local connectivity; and

• parameter sharing.

First, regarding local connectivity, each unit (neuron) in a CNN layer is connected only
to a localized region (window) of the input, known as the receptive field. The resulting
weight matrices, called filters, present a smaller size than the entire input data because
they focus on the receptive field (window) only, and the features extracted depend only
on that portion of the input data.
Second, CNN layers employ parameter sharing, wherein the same filter is applied across
all different regions of the input. That is, by sliding the filter across the input surface,
we compute features in each receptive field using the identical filter (single set of weight
matrix). One can imagine a rolling window similar to time-series applications.

=⇒ This design of local connectivity and parameter sharing significantly reduces the
number of parameters to be learned compared to FNNs layers.

Different CNN layers are available; they differ in the number of dimensions over which
the convolutional operation is applied to. The choice of dimension depends on the char-
acteristics of the input data and the specific prediction task. In the following sections,
we formally describe the mechanisms of 1D and 2D CNN layers. However, it should be
noted that that these principles can be generalized to any higher dimension.

8.3.1 1D convolutional neural networks

1D CNNs are specifically designed to handle data organized in a one-dimensional


grid-like structure, such as sequences or time-series (panel) data.

1D CNN architectures apply a convolution operation along a single axis (position/time


axis) making them effective for capturing sequential patterns and temporal dependencies
in the data. In this sense, the filters act as rolling windows over the input data.
When applying a 1D CNN layer the modeler needs to specify some hyper-parameters.
Among them, the kernel size K ∈ N and the stride δ ∈ N play a key role since they define
the size and the number of receptive fields.

• The kernel size defines the length of the filters (window size) used in the convolu-
tional operation. This parameter determines the size of the receptive field, or the
range of input values each filter can ‘see’ at a time. A larger kernel size allows the
model to detect features that span longer sequences. However, this also increases
the number of parameters and the computational complexity of the model.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 8. Deep learning for tensor and unstructured data 137

• The stride specifies the step size with which the kernel moves along the input
sequence during the convolution process. A smaller stride (e.g., δ = 1) results in
overlapping receptive fields, which can provide a more detailed representation of
the input but at the cost of higher computational costs. On the other hand, a
larger stride (e.g., δ ≥ 2) reduces the overlap, leading to faster computation at the
risk of losing information.

Figure 8.4: CNN filter of kernel size K = 3 and using (lhs) stride δ = 1 and (rhs) stride
δ = 3.

A 1D CNN layer can be thought of as operating like a rolling window procedure. The
kernel directly corresponds to the size of the rolling window, which slides across the
input sequence, examining a fixed number of consecutive elements at each step. The
stride, meanwhile, defines the step size, determining how far the rolling window advances
along the input after each computation. Figure 8.4 shows two examples with kernel size
K = 3, the left-hand side has stride δ = 1 giving overlapping windows on the time axis
t, whereas the right-hand side has stride δ = 3 giving non-overlapping windows. These
input parameters K and δ need to be selected by the modeler and their choices depend on
the trade-off between computational efficiency and the desired level of feature resolution.

1D CNN filter. We begin by illustrating the mechanism of a 1D CNN layer by con-


sidering a single filter for simplicity. Let X 1:t ∈ Rt×q be the input data matrix. An 1D
CNN layer with a single filter of size K with stride δ is a mapping

 
(1) (1) (1)
z 1 : Rt×q → Rt , X 1:t 7→ z 1 (X 1:t ) = zu,1 (X 1:t ) , (8.13)
1≤u≤t′
(1)
where t′ = ⌊ t−K
δ + 1⌋ ∈ N represents the number of receptive fields. Unit zu,1 (X 1:t ) ∈ R
is extracted by convolving the filter with the u-th receptive field
K D
!
E
(1) (1) X (1)
zu,1 (X 1:t ) =ϕ w0,1 + wk,1 , X (u−1)δ+k , (8.14)
k=1
(1) (1)
with bias term w0,1 ∈ R, filter weights wk,1 ∈ Rq , 1 ≤ k ≤ K, and activation function
ϕ : R → R.
The total number of parameters to be learned in a 1D CNN layer with 1 filter is 1 + Kq.
(1)
The size of the output of the filter z 1 , defined as the number of receptive fields t′ ,
depends on K and δ. Remark that (8.14) (almost) takes the form of a mathematical
convolution (up to a sign switch in one of the two indices).

Version March 3, 2025, @AI Tools for Actuaries


138 Chapter 8. Deep learning for tensor and unstructured data

1D CNN layer. In practice, a single filter may not be sufficient to capture the com-
plexity of the features in the input data; this is similar to the number of neurons in a
FNN layer. To address this issue, multiple filters can be concatenated, each learning a
different set of features, i.e., using different filter weights.
A 1D CNN layer with q1 ∈ N filters is a mapping

 
(1)
z (1) : Rt×q → Rt ×q1 , X 1:t 7→ z (1) (X 1:t ) = zu,j (X 1:t ) . (8.15)
1≤u≤t′ ,1≤j≤q1

(1)
In particular, each element of the matrix is, denoted as zu,j (X 1:t ), is a unit obtained by
convolving the j-th filter with the u-th receptive field
K D
!
E
(1) (1) X (1)
zu,j (X 1:t ) =ϕ w0,j + wk,j , X (u−1)δ+k , (8.16)
k=1

(1) (1)
with bias term w0,j ∈ R and filter weights wk,j ∈ Rq , 1 ≤ k ≤ K. In this case, the total
number of parameters to be learned is (1 + Kq)q1 .

In this context we have:


• The j-th column of the matrix z (1) ∈ Rt ×q1 given in (8.15), containing the elements
(1) (1) (1)
z1,j , z2,j , . . . , zt′ ,j , represents a set of features extracted by applying the same fil-
ter to the different receptive fields (sliding along the time axis). This provides a
representation of all receptive fields in a common feature space.


• The u-th row of the matrix z (1) ∈ Rt ×q1 given in (8.15), containing the elements
(1) (1) (1)
zu,1 , zu,2 , . . . , zu,q1 , represents a set of features obtained by applying q1 different
filters to the u-th receptive field. These features provide multiple representations
of the same receptive field, or in other words, a slice across the different filters.
This enables the model to extract different sets of features from each receptive
field, capturing different aspects of the data and resulting in multi-dimensional
representations; this is analogous to having multiple neurons in a FNN layer.

Figure 8.5 provides a graphical illustration of how a 1D CNN layer with q1 filters works.
It emphasizes that these layers perform computations based on a convolutional operator,
where the filters are convolved across different regions of the input, similar to how the two
functions interact in a mathematical convolution. In the figure, the blue matrix represents
the input data X 1:t . Each filter is applied to the input data, generating a corresponding
set of features that are graphically represented by a colored rectangular block; e.g., the
(1) (1) (1) (1) t′
yellow filter weights (w0,1 , (wk,1 )K
k=1 ) map to the first block z 1 = (zu,1 )u=1 (in yellow).
(1) (1)
The feature vectors z 1 , . . . , z q1 obtained from the q1 filters are then concatenated along
the appropriate dimension, resulting in the matrix (1D tensor) z (1) , see (8.15). This
output matrix can be interpreted as a learned multi-dimensional representation of the
time-series input data (respecting time adjacency), ready to be passed to subsequent
layers for further processing.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 8. Deep learning for tensor and unstructured data 139

Figure 8.5: A graphical illustration of a 1D CNN layer.

8.3.2 2D convolutional neural networks

2D CNNs are specifically designed to process data structured in a two-dimensional


grid, such as images or spatially structured data sets.

The input data consists of three dimensions and is represented as 3D tensors in Rt×s×q ,
where q represents the number of channels (e.g., q = 3 for the RGB color channels in
images, see Example 8.1). The convolution operation involves sliding a small window
across the two axes t and s, while simultaneously processing all q channels at each
position. This enables the model to perform convolution operations along both the
vertical t and horizontal s axes, making it highly effective at detecting spatial patterns
and local structure in the input data X 1:t,1:s ∈ Rt×s×q . Popular applications in the
actuarial literature of 2D CNNs are pattern recognition problems, e.g., for telematics
data, see Gao et al. [72], or mortality modeling, see Wang et al. [231].

In the case of a 2D CNN, both the kernel size and stride are represented as pairs of
elements that specify the size and movement of the filter in two dimensions.

• The kernel size is denoted by K = (Kt , Ks ) ∈ N2 , where Kt represents the height of


the kernel (number of rows), and Ks , represents the width of the kernel (number of
columns). This pair defines the filter’s spatial extent, determining the window size
of the regions over which the convolution operation slides and applies the kernel to
the input data.

• The stride in a 2D CNN is represented by δ = (δt , δs ) ∈ N2 , where δt denotes the


vertical stride (the number of rows the filter moves after each step), and δs denotes
the horizontal stride (the number of columns the filter moves after each step).

Also in the case of a 2D CNN layer, the kernel size (Kt , Ks ) and stride δ = (δt , δs ) directly
affect both the size of the output feature map and the computational complexity of the
convolution operation. The output of a 2D CNN layer with a single filter is a feature

Version March 3, 2025, @AI Tools for Actuaries


140 Chapter 8. Deep learning for tensor and unstructured data

j k j k
matrix with t′ = t−Kδt + 1 rows and s =
t ′ s−Ks
δs + 1 columns. In this case, the total
number of receptive fields is given by t s′ .

2D CNN filter. Let X 1:t,1:s ∈ Rt×s×q be the input tensor with q channels. A 2D CNN
layer with a single filter of size Kt × Ks and stride δt × δs is a mapping
′ ′
 
(1) (1) (1)
z 1 : Rt×s×q → Rt ×s , X 1:t,1:s 7→ z 1 (X 1:t,1:s ) = zu,v,1 (X 1:t,1:s ) .
1≤u≤t′ ,1≤v≤s′
(8.17)
(1)
Unit zu,v,1 (X 1:t,1:s ) is extracted by convolving the filter with the receptive field (u, v)
 
Ks D
Kt X E
(1) (1) X (1)
zu,v,1 (X 1:t,1:s ) = ϕ w0,1 + wkt ,ks ,1 , X (u−1)δt +kt ,(v−1)δs +ks  ,
kt =1 ks =1

(1) (1)
with bias term w0,1 ∈ R and filter weights wkt ,ks ,1 ∈ Rq , for 1 ≤ kt ≤ Kt and 1 ≤ ks ≤ Ks .
The total number of parameters to be optimized is 1 + Kt Ks q. The rows of the matrix
(1)
z 1 (X 1:t,1:s ) are obtained by sliding the filter (window) across the input data in a hori-
zontal direction, i.e., the filter moves from left to right across the rows of the input. Each
row of the matrix corresponds to a different receptive field (u, v) along this horizontal
pass. On the other hand, the columns of the matrix are obtained by sliding the filter
vertically over the input data. In this case, the filter moves from top to bottom across
the columns of the input. Each column in the matrix corresponds to a different receptive
field in this vertical pass.

2D CNN layer. Also in the case of 2D CNN layer, multiple filters can be applied.
A 2D CNN layer with q1 filters is a mapping
′ ′
z (1) : Rt×s×q → Rt ×s ×q1 , (8.18)
 
(1) (1)
X 1:t,1:s 7→ z (X 1:t,1:s ) = zu,v,j (X 1:t,1:s ) ,
1≤u≤t′ ,1≤v≤s′ ,1≤j≤q1
(1) ′ ′
with z j (X 1:t,1:s ) ∈ Rt ×s , 1 ≤ j ≤ q1 , being 2D CNN filters (8.17) with different filter
weights. Each element of the j-th feature map is computed as the double summation
(convolution)
 
Kt X
Ks D E
(1) (1) X (1)
zu,v,j (X 1:t,1:s ) = ϕ w0,j + wkt ,ks ,j , X (u−1)δt +kt ,(v−1)δs +ks  , (8.19)
kt =1 ks =1
(1) (1)
where w0,j ∈ R is the bias term of the j-th filter, and wkt ,ks ,j ∈ Rq , 1 ≤ kt ≤ Kt and
1 ≤ ks ≤ Ks , represents the filter weights for that j-th filter. The total number of
coefficients to learn in a 2D CNN layer with q1 filters of size Kt × Ks is (1 + Kt Ks q)q1 .
The graphical representation of how a 2D CNN layer works is shown in Figure 8.6. The
3D object in blue represents the input data, while the 3D yellow object represents the
first filter. By sliding this filter horizontally and vertically over the input tensor a feature
(1) ′ ′
map is produced, resulting in the yellow 2D CNN filter output z 1 (X 1:t,1:s ) ∈ Rt ×s .
Concatenating the feature maps obtained from the q1 different filters generates the output
of the 2D CNN layer.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 8. Deep learning for tensor and unstructured data 141

Figure 8.6: A graphical illustration of a 2D CNN layer.

8.3.3 Deep convolutional neural networks


Multiple CNN layers can be composed to capture increasingly complex and hierarchical
features from the input data. This allows for learning low-level (local) patterns in the ini-
tial layers, which are subsequently combined into higher-level (more global) abstractions
in deeper layers. As is standard in FNN architectures, the output of one layer serves
as input to the next layer, allowing information to propagate through the network. For
instance, in a two-layer 2D CNN architecture, the output of the first 2D CNN layer,
denoted as z (1) (X 1:t,1:s ), becomes the input to the second 2D CNN layer. Each CNN
layer is characterized by a specific set of input parameters such as the kernel size and the
stride, which can vary across layers to optimally extract feature information.
To distinguish the layer-specific parameters, an additional upper index is introduced. We
denote the kernel sizes of the first and second layers by K (1) and K (2) , respectively, and
correspondingly we denote their strides by δ (1) and δ (2) .
For illustrative purposes, we consider the 2D CNN case. Assume that the output dimen-
sion (t′ , s′ , q1 ) of a first 2D CNN layer z (1) , see (8.18), coincides with the input dimension
of a second 2D CNN layer z (2) , having the same structure as (8.18), that is,
′ ′ ′′ ×s′′ ×q
z (2) : Rt ×s ×q1 → Rt 2
,

where
(2) (2)
$ % $ %
′′ t′ − Kt ′′ s′ − Ks
t = (2)
+1 ∈N and s = (2)
+ 1 ∈ N,
δt δs
and where we select q2 ∈ N filters for this second 2D CNN layer.

We can then compose these two 2D CNN layers to a deep 2D CNN architecture
′′ ×s′′ ×q
z (2:1) : Rt×s×q → Rt 2
with z (2:1) = z (2) ◦ z (1) . (8.20)

Naturally, this mechanism also works for 1D CNN layers, and we can generalize it to any
depth d, resulting in a deep CNN architectures z (d:1) = z (d) ◦ · · · ◦ z (1) . Importantly,
always the output dimension of the CNN layer z (m−1) has to match in the input dimension
of the next CNN layer z (m) . This is completely analogous to deep FNN architectures
(5.3).

Version March 3, 2025, @AI Tools for Actuaries


142 Chapter 8. Deep learning for tensor and unstructured data

This almost fully describes (deep) CNN architectures. There are some more points that
we would like to discuss:

• Locally-connnected networks which are a special case of CNNs.

• Pooling layers which are useful to reduce tensor dimensions.

• Padding which increases tensors.

• Flatten layer which restructures tensors.

This is discussed in the next sections. When it comes to CNN fitting, we apply a version
of SGD, see Section 5.3.

8.3.4 Locally-connected networks


A notable variant of CNNs is the locally-connected network (LCN). LCNs preserve the
concept of local connectivity while removing the parameter sharing mechanism that char-
acterizes CNNs. As a result, LCNs can be seen as a middle ground between CNNs and
FNNs. In a LCN, each receptive field is assigned a unique set of weights, enhancing
the model’s flexibility and allowing it to capture location-specific patterns that CNNs
might overlook. However, this extra flexibility comes at the cost of significantly more
parameters than for CNNs. LCN layers have been introduced to the actuarial literature
in Scognamiglio [205] for 1D LCNs and in Perla et al. [176] for 2D LCNs.

For simplicity, we use the 1D case for explaining a LCN layer, and the 2D case is com-
pletely analogous. Formally, a 1D LCN layer with q1 ∈ N filters is a mapping
′ (1)
z (1) : Rt×q → Rt ×q1 , X 1:t 7→ z (1) (X 1:t ) = zu,j (X 1:t )

1≤u≤t′ ,1≤j≤q1
,

with element of the matrix is a features obtained by

K D
!
E
(1) (1) X (1)
zu,j (X 1:t ) =ϕ w0,u,j + wk,u,j , X (u−1)δ+k ,
k=1

(1) (1)
with bias terms w0,u,j ∈ R and filter weights wk,u,j ∈ Rq , 1 ≤ k ≤ K, related to the u-th
receptive field. In this case, the total number of parameters to be learned is (1 + Kq)q1 t′ .
Compared to (8.15) and (8.16) there is only one difference, namely, the bias terms and
filter weights have a lower index u that corresponds to the u-th receptive field considered.
This increases the size of the parameters by a factor t′ compared to the 1D CNN case.
Figure 8.7 provides a graphical representation of the FNN, LCN and CNN layers; the
number of parameters exclude the bias terms. For an input tensor X 1:6 ∈ R6×1 we have
18 parameters by mapping this to a FNN layer with q1 = 3 neurons, see Figure 8.7 (lhs).
The right-hand side shows an example of a 1D CNN layer with kernel size K = 2 and
stride δ = 2, resulting in 2 parameters. The LCN layer (with the same kernel size and
stride) in between has 6 parameters.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 8. Deep learning for tensor and unstructured data 143

Fully-Connected Locally-Connected Convolutional

3 x 6 = 18 parameters 3 x 2 = 6 parameters 2 parameters

Figure 8.7: A graphical comparison of the FNN, LCN and CNN layers; the number of
parameters excludes the bias terms.

8.3.5 Pooling layers


Pooling layers are designed to summarize local information and to reduce the dimensions
of tensors. They are commonly used in architectures involving CNN layers. Typically
applied after a CNN layer, pooling reduces the tensor’s dimension while preserving its
most significant information. One of the earliest papers to make use of this kind of layers
is LeCun et al. [131]. A pooling function replaces the network’s output at a specific
location with a summary statistics derived from nearby values, such as the maximum or
the average value. In a sense, pooling layers are similar to CNN layers, as both operate
on local regions (windows) of the inputs. However, instead of performing a convolution
(which involves multiplying the input by a set of weights and summing the results), a
pooling layer performs a down-sampling operation. Pooling layers require specifying the
pooling size, which is the analogue to the kernel size in CNN layers, denoted as K, and
the stride δ. In most cases, the stride is set to its default value δ = K.
There are several types of pooling layers, and the choice of the specific pooling technique
depends on the task to be performed. Among the most popular pooling layers, there are:

• MaxPooling: It selects the maximum value from a set of input values in the se-
lected pooling windows. This operation is commonly preferred when the goal is to
emphasize the most relevant features while ignoring the less significant ones.

• AveragePooling: It computes the average of the values within the given pooling
windows. This operation results in smoother feature maps, as it considers the
overall characteristics of a local neighborhood, rather than focusing on extreme
values. AveragePooling is especially appropriate when the objective is to capture
the overall structure or distribution.

Guidance on selecting the appropriate types of pooling for various scenarios is provided
in Boureau et al. [25]. Again, as a several other items, to find the right pooling and
its parameterization is part of hyper-parameter tuning that needs to be performed by
the modeler. Notably, we remark that pooling layers can be applied along multiple
dimensions, depending on the structure of the input data.

For illustrative purposes, we consider the 1D pooling case, and the 2D case is completely
similar. For this we keep the 1D CNN layer in mind, see (8.15). The only difference is

Version March 3, 2025, @AI Tools for Actuaries


144 Chapter 8. Deep learning for tensor and unstructured data

that we replace the convolutional operation (8.16) by pooling operations. For this we
select a pooling size K ∈ N and a stride δ ∈ N giving the mapping


 
pool
z pool : Rt×q → Rt ×q , X 1:t 7→ z (1) (X 1:t ) = zu,j (X 1:t ) ,
1≤u≤t′ ,1≤j≤q

Here, we note that the 1D pooling layer reduces the first dimension of the data from t to
t′ = ⌊ t−K
δ + 1⌋ while preserving the original covariate dimension q. The elements of the
output depend on the type of pooling used:

• In the case of MaxPooling, we consider

pool
zu,j (X 1:t ) = max Xk,j .
(u−1)δ+1≤k≤uδ

• In the case of AveragePooling, we consider


pool
X
zu,j (X 1:t ) = Xk,j .
k=(u−1)δ+1

The case of 2D pooling is completely analogous. For the default stride δ = K, we consider
a partition of the tensor, and in each of these subsets of the partition we either extract
the maximum or the average, depending on the type of pooling, this also relates to Figure
8.4 (rhs).

8.3.6 Padding

Padding refers to the process of adding extra values, generally zeros, to the edges of a
tensor before applying the convolution operation. This technique is used to control the
dimension of the output feature map and to ensure that the network effectively captures
information located at the edges. When a CNN layer is applied to the input data tensor,
it typically reduces the dimension from (t, s) to (t′ , s′ ) because the filter does not fully
operate on the edges of the input even if we have a stride δ = 1. A widely used solution
to this problem, known as padding, involves adding zeros to the edges of the input data
to ensure that the output feature map maintains the same spatial dimensions as the
original input tensor. This allows for the filter to operate effectively at the borders
without reducing the size of the output. Padding can be applied to various types of
input tensors. For example, in 1D CNNs (used for time-series data), padding may be
applied at the start and end of the sequence. Similarly, in 2D CNNs, padding is applied
along both dimensions and so on for higher-dimensional CNNs. The amount of padding
is typically determined based on the filter size K, we assume stride δ = 1 for the moment.
For instance, in the case of a 2D CNN layer with a filter of size 3 × 3, padding of 1 pixel
is commonly added to each side of the input to preserve its original dimensions.
Figure 8.8 graphically illustrates the process of applying padding to data in both one-
dimensional and two-dimensional cases.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 8. Deep learning for tensor and unstructured data 145

1D Padding 2D Padding

Figure 8.8: 1D and 2D padding.

8.3.7 Conclusions and flatten layers

We have now met all the modules that are necessary to use CNNs as feature extrac-
tors. Generally, the output of a deep CNN architecture is a tensor that has the same
order as the input tensor, e.g., in a time-series example, the output tensor is an element

z (d:1) (X 1:t ) ∈ Rt ×qd . If we want to use this new data representation z (d:1) (X 1:t ) for pre-
dictive modeling, say, to forecast an insurance claim in the next period, it needs further
processing, e.g., through a FNN layer; this is illustrated in Figure 8.3. In particular,
this requires to transform the tensor z (d:1) (X 1:t ) into a vector of length t′ q1 . In machine
learning lingo, this shape transformation is called flatten, and it is implemented by a
so-called flatten layer (which only performs that shape transformation of a tensor to a
vector). After this flatten layer, the extracted information can be concatenated with
other vector information, as illustrated in Figure 8.3.

8.4 Recurrent neural networks

Recurrent neural networks (RNNs) are a class of neural networks specifically de-
signed to process sequential data.

RNNs generate predictions at a given time step, by taking into account the preceding
elements of the sequence. This mechanism is implemented through the introduction of
cyclic/recurrent connections within the network which feed the output of a network layer
back into the network as input enabling information to persist across several time steps.
This design makes RNNs promising methods for tasks where the order of data is relevant
(time-causality), such as text processing or time-series forecasting.
The two most popular RNN architectures are the long short-term memory (LSTM) archi-
tecture of Hochreiter–Schmidhuber [100] and the gated recurrent unit (GRU) architecture
of Cho et al. [42]. These two architectures are presented in Sections 8.4.2 and 8.4.3, be-
low. An early actuarial application of RNNs is the mortality example of Nigri et al. [167],
and Lindholm–Palmborg [139] discuss the efficient use of data in time-series mortality
problems.

Version March 3, 2025, @AI Tools for Actuaries


146 Chapter 8. Deep learning for tensor and unstructured data

8.4.1 Plain-vanilla recurrent neural networks


We begin by introducing the simplest form of a RNN architecture, termed plain-vanilla
RNN. This architecture represents the starting point for building more complex RNN
models, and it is useful to understand the recurrent mechanism. We explain the mecha-
nism behind the RNN with a focus on time-series data.
Consider a multivariate time-series X 1:t ∈ Rt×q with components X u ∈ Rq at times 1 ≤
u ≤ t. RNNs generate predictions at specific time steps u by processing the past observed
values of the time-series. Specifically, past observations X 1:u−1 are compressed into a
(1)
statistics z u−1 , which is concatenated with the new covariate X u . The concatenation of
these two objects is then used to predict the next response Yu . Note that the covariate
X u may also include the past observation Yu−1 , which corresponds to a look-back of size
one. Of course, the look-back can have any length.
At each time step 1 ≤ u ≤ t, the RNN receives two inputs:

(1) the value X u of the sequence X 1:t at the current time step u, and
(1)
(2) the output of the RNN cell z u−1 from the previous time step u − 1.
(1)
The output z u−1 of the previous RNN step can be seen as a compressed summary (learned
representation) of the past observations X 1:u−1 prior to u. These two components are
concatenated to generate the next prediction.

RNN layer. A RNN layer with q1 ∈ N units is given by the mapping


   
(1) (1)
z (1) : Rq×q1 → Rq1 , X u , z u−1 7→ z (1)
u := z
(1)
X u , z u−1 , (8.21)

(1)
where X u is the covariate at time u ≥ 1 and z u−1 is the output from the previous RNN
(1) (1) (1) (1)
loop; we initialize z 0 = 0. The vector z u = (zu,1 , . . . , zu,q1 )⊤ is interpreted as a learned
representation considering all information X 1:u . The j-th unit is computed as
 D E D E
(1) (1) (1) (1) (1) (1)
zu,j = zu,j (X 1:u ) = ϕ w0,j + wj , X u + v j , z u−1 , (8.22)

(1) (1) (1)


with bias w0,j ∈ R, network weights wj ∈ Rq and v j ∈ Rq1 , and activation function
ϕ : R → R.
Comparing (8.21)-(8.22) to the FNN layer (5.5)-(5.6), we observe that we have precisely
(1)
the same structure, that is, we can consider the concatenated input (X u , z u−1 ) ∈ Rq+q1
which is processed through a FNN layer that has input dimension q + q1 and output
dimension q1 , see (5.5). The change is that we have a recurrent loop, reflected by the
(1)
lower index in the compressed past information (RNN cell) z u−1 .
The parameters of an RNN layer are time-independent, since they remain constant across
all time steps 1 ≤ u ≤ t. This mechanism is similar to the parameter sharing mechanism
adopted in the 1D CNNs. The main difference is that, in CNNs, each output is calculated
using a small window of neighboring inputs, resulting in a shallow structure. In contrast,
RNNs rely on a recurrent mechanism where each output depends on the previous outputs,

Version March 3, 2025, @AI Tools for Actuaries


Chapter 8. Deep learning for tensor and unstructured data 147

with the same update rule applied recursively. This design creates a deep computational
graph allowing to learn long-run time dependencies in the data. The total number of
parameters to optimize in an RNN layer with q1 cells is given by q1 (1 + q + q1 ). A visual
representation of the recurrent mechanism of an RNN unit is provided in Figure 8.9. At
each time step, the unit produces an output, with the final output of the sequence high-
lighted in red. In most applications of neural networks, particularly in tasks involving

-1
Figure 8.9: The recurrent mechanism of a RNN cell.

(1) (1)
sequential data, only the last RNN cell state z t = z t (X 1:t ) is used to generate the
prediction of response Yt . However, in some cases, the model can be beneficial to consider
the entire sequence (called return sequence) of RNN cell states across the different time
(1)
steps, represented as (z u )tu=1 . This provides a more comprehensive/granular represen-
tation of the input data. In such scenarios, one may apply flatting to further process the
data, see Section 8.3.7 and Figure 8.3.

For training RNNs, one commonly uses the back-propagation through time (BPTT)
algorithm, see Werbos [235]. BPTT unrolls the RNN across the entire sequence during
optimization. While effective, BPTT often suffers from issues such as slow convergence,
vanishing gradients, and difficulties in learning long-term dependencies. To overcome
these challenges, more advanced RNN architectures such as LSTM and GRU networks
were introduced. These are discussed next.

8.4.2 Long short-term memory networks


LSTM networks, introduced in Hochreiter–Schmidhuber [100], extend the basic structure
of traditional RNNs by incorporating a specialized mechanism to handle long-term de-
pendencies more effectively. The core of this more advanced RNN model is the memory
cell, which is designed to store and regulate the flow of long-term information through-
out the network. The memory cell in LSTMs operates using a system of gates (namely
the input, output, and forget gates) that control how information is stored, updated, and

Version March 3, 2025, @AI Tools for Actuaries


148 Chapter 8. Deep learning for tensor and unstructured data

released. This gating mechanism allows the network to select important information that
has to be kept in the memory and discard irrelevant or outdated data, making it more
effective in capturing long-term dependencies. By combining this long-term memory with
the short-term memory provided by the recurrent connections, LSTMs are able to learn
both long and short-term dependencies simultaneously.
To calculate the output of the LSTM layer, three activation functions are applied. The
sigmoid function σ ∈ (0, 1), the hyperbolic tangent function tanh(x) ∈ (−1, 1), and a
general activation function ϕ, we refer for their definitions to Table 5.1.

LSTM layer. We formally introduce an LSTM layer.


An LSTM layer with q1 units is obtained by the mapping, for u ≥ 1,
   
(1) (1)
z (1) : Rq×q1 ×q1 → Rq1 ×q1 , X u , z u−1 , cu−1 7→ z (1) (1)
u , cu ,

(1) (1)
where z u is called hidden state and cu is the memory cell.
(1)
At each timestep, the LSTM layer takes the current input X u , the previous state z u−1 ,
(1)
and the previous memory cell cu−1 . These inputs are used to compute the two outputs
of the LSTM layer:
• The updated hidden state
z (1) (1) (1)
∈ Rq1 ,

u = o u ⊙ ϕ cu

where ⊙ is the Hadamard product, which applies elementwise multiplication, and


(1)
the activation fucntion ϕ is applied elementwise. The output gate ou dynamically
(1)
regulates how much of the transformed memory cell ϕ(cu ) is outputted.

• The updated memory cell


(1)
c(1) (1) (1) (1)
u = f u ⊙ cu−1 + iu ⊙ c̃u ∈ Rq1 ,
(1)
with forget gate f (1) (1)
u , input gate iu , and candidate new memory cell c̃u .

The three gates and candidate new memory cell are defined by
 
(1) (1) (1) (1)
f (1)
u = σ w0,f + Wf X u + Vf z u−1 ∈ (0, 1)q1 ,
 
(1) (1) (1) (1)
i(1)
u = σ w0,i + Wi X u + Vi z u−1 ∈ (0, 1)q1 ,
 
(1) (1)
o(1)
u = σ w0,o + Wo(1) X u + Vo(1) z u−1 ∈ (0, 1)q1 ,
 
(1) (1)
c̃(1)
u = tanh w0,c + Wc(1) X u + Vc(1) z u−1 ∈ (−1, 1)q1 ,
(1) (1) (1) (1) (1) (1) (1) (1)
with biases w0,f , w0,i , w0,o , w0,c ∈ Rq1 , network weights Wf , Wi , Wo , Wc ∈ Rq1 ×q
(1) (1) (1) (1)
and Vf , Vi , Vo , Vc ∈ Rq1 ×q1 . These gates work together to dynamically regulate
the flow of information into, out of, and within the cell. During training, the parameters
of the LSTM layer are adjusted to optimize this gating mechanism. The total number of
weights in an LSTM layer is given by 4q1 (q + q1 + 1).
A graphical illustration of the LSTM layer is shown in Figure 8.10. This diagram high-
lights the flow of data through the gates and the interaction between the memory cell
and the hidden state over time.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 8. Deep learning for tensor and unstructured data 149

Figure 8.10: Graphical representation of a LSTM unit.

8.4.3 Gated recurrent unit networks

The GRU networks of Cho et al. [42] have emerged as a promising alternative to LSTM
networks. Although GRUs were developed more recently, they offer a simplified archi-
tecture compared to LSTMs, making them computationally more efficient while still
addressing the issue of vanishing gradients that traditional RNNs face. Unlike LSTMs,
GRUs do not have an explicit memory cell. Instead, they rely on two gates: the update
gate and the reset gate. The first determines how much of the previous hidden state
should be carried forward, essentially controlling the flow of information. The second
one, on the other hand, decides how much of the previous information should be dis-
carded when generating the current state. This design simplifies the architecture while
still allowing it to capture long-term dependencies. In essence, GRUs can be seen as a
more compact version of LSTMs, offering similar benefits in terms of handling long-range
dependencies but with fewer parameters.

(1) (1) (1) (1) (1) (1)


GRU layer. Choose weights Wo , Wr , Wz ∈ Rq1 ×q and Vo , Vr , Vz ∈ Rq1 ×q1 ,
(1) (1) (1)
and biases w0,o , w0,r , w0,z ∈ Rq1 . The output of the q1 -dimensional GRU layer is deter-
mined as follows
(1)
z (1) (1) (1) (1)
u = (1 − ou ) ⊙ z u−1 + ou ⊙ z̃ u .

(1)
This equation defines the hidden state z u at time step u, which is a weighted combination
(1) (1)
of the previous hidden state z u−1 and the candidate state z̃ u . The balance between
(1)
these two components is controlled by the state of the update gate ou , which determines
how much of the previous state should be retained and how much of the new information
(1) (1) (1)
should be incorporated. The states ou and z̃ u , with z̃ u further influenced by the

Version March 3, 2025, @AI Tools for Actuaries


150 Chapter 8. Deep learning for tensor and unstructured data

(1)
reset gate r u , are expressed as follows
 
(1) (1)
o(1)
u = σ w0,o + Wo(1) X u + Vo(1) z u−1 ∈ (0, 1)q1 ,
 
(1) (1)
r (1)
u = σ w0,r + Wr(1) X u + Vr(1) z u−1 ∈ (0, 1)q1 ,
 
(1) (1)
q1
z̃ (1)
u = tanh w0,z + Wz(1) X u + Vz(1) (r (1)
u ⊙ z u−1 ) ∈ (−1, 1) .

The number of weights in a GRU layer with q1 units, equal to 3q1 (q + q1 + 1), is lower
compared to the number of parameters in a LSTM layer with the same number of units.

8.4.4 Deep recurrent neural networks and conclusion


RNN layers, including LSTM and GRU layers, can be stacked to form deep RNNs. Note
that we have carefully used upper indices (1) in all notation above to indicate a first RNN
layer. Such layers can now be composed. The way of designing deep RNN structures is
not unique, and different approaches can be considered. The following gives a proposal
for a RNN of depth d = 2. At each time step 1 ≤ u ≤ t, consider the recurrent structure

(1)
z (1)
u = z (1) (X u , z u−1 ),
(2)
z (2)
u = z (2) (z (1)
u , z u−1 ),

(2)
where z u represents the state of the second RNN layer at time u. Of course, this can
be generalized to any depth d ≥ 2.
In the deep LSTM case, we consider for the m-th RNN layer, m ≥ 1,
   
(m) (m)
z (m) : Rqm−1 ×qm ×qm → Rqm ×qm , z (m−1)
u , z u−1 , cu−1 7→ z (m) (m)
u , cu , (8.23)

(0)
where we initialize the input for m = 1 by z u = X u and q0 = q.

We conclude this RNN chapter with some remarks.

• Deep RNN architectures, like the LSTM layers (8.23), consider the entire return
(m−1) t
sequence (z u )u=1 of the previous RNN layer m − 1. For predicting a response
(d)
Yt , in the last time period t, one typically only extracts the last state z t ∈ Rqd
in the last RNN layer m = d. This can then be concatenated with other covariate
information, see Figure 8.3.

• Following up on the previous item, if one wants to process the entire return sequence
(d)
(z u )tu=1 of the last RNN layer m = d, one typically needs to flatten this return
sequence to further process it, see Section 8.3.7 for flatten layers.

• The previous examples have all been dealing with information X 1:t of equal length
t. However, RNN can process information of any length, if only the last state
(d)
z t ∈ Rqd is extracted. For example, we can have insurance policyholders 1 ≤ i ≤ n
with claims histories of different lengths X τi :t , where τi ∈ {1, . . . , t − 1} is the
starting point of the history of policyholder i. This can easily be handled by RNNs.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 8. Deep learning for tensor and unstructured data 151

8.5 Transformers
Transformers are a class of deep learning architectures that have revolutionized natural
language processing (NLP) and they are at core of the great success of large language
models (LLMs). Introduced by Vaswani et al. [228], these models replace traditional re-
current and convolutional structures with attention mechanisms. These attention mecha-
nisms use weighting schemes to identify and prioritize the most relevant information and
their interactions. While originally developed for tasks involving data with a sequential
structure, transformers have been adapted to tabular input data, increasing the potential
of these models in actuarial science. First applications of transformers for tabular data
were considered by Huang et al. [106] and Brauer [26]. In their work, continuous and
categorical information is tokenized and embedded so that this pre-processed input infor-
mation has a 2D tensor structure, making them suitable to enter a transformer; see also
the feature tokenizer transformer (FTT) of Gorishniy et al. [86]. More recently, Rich-
man et al. [188] advanced transformer-based models for tabular data by incorporating
a novel weighting scheme inspired by the credibility mechanism of Bühlmann’s seminal
work [34, 36].

8.5.1 Basic components of transformers


Transformers are sophisticated architectures that combine multiple layers. Before illus-
trating specific transformer architectures, we describe the key layers that serve as the
building blocks for these advanced models.

Attention layer

The core of transformers is the attention layer, which is designed to identify the most
relevant information in the input data. The central idea is to learn a weighting scheme
that prioritizes the most important parts of the input, thereby enhancing the model’s
ability to perform a given predictive task.
Different attention mechanisms are available in the literature. Our focus is on the most
commonly used variant, the scaled dot-product attention of Vaswani et al. [228]. To
illustrate this attention mechanism, we consider three matrices Q, K, V ∈ Rt×q of the
same dimensions. These three matrices represent the query, the key, and the value,
respectively, of the attention mechanism.
The scaled dot-product attention mechanism is given by the mapping, called attention
head,
H : R(t×q)×(t×q)×(t×q) → Rt×q , (Q, K, V ) 7→ H = H(Q, K, V ).

The attention mechanism is applied to the value matrix V , with the output H calculated
as a weighted sum of its elements. The (attention) weights, dependent on the query ma-
trix Q and the key matrix K, are computed through a scalar/dot-product multiplication,
followed by the softmax function to normalize the scores
!
QK ⊤
H = AV = softmax √ V ∈ Rt×q . (8.24)
q

Version March 3, 2025, @AI Tools for Actuaries


152 Chapter 8. Deep learning for tensor and unstructured data

Here, q is a scaling factor which tries to make the matrices free of the input dimension

q, while the matrix of scores A ∈ Rt×t is derived from the matrix A′ = QK ⊤ / q by
applying the softmax operator to the rows of A′
exp(a′u,s )
A = softmax(A′ ), where au,s = Pt ′
∈ (0, 1), (8.25)
k=1 exp(au,k )

for 1 ≤ u, s ≤ t. This transformation ensures that the elements of each row of the
matrix A sum to one. To provide some intuition: the learned attention scores in A are
multiplied by the value vectors in V . Each row of the resulting matrix AV is a weighted
average of the vectors in V , where the weights, which sum to one, determine the (relative)
importance of each row vector in V . It is important to note that the scaled dot-product
attention mechanism is highly efficient computationally, as it performs matrix operations,
such as dot-products and softmax, in parallel across all queries, keys and values. This
eliminates the need for recursive or sequential computation, making it particularly well-
suited for implementation on Graphics Processing Units (GPUs).
Let us briefly illustrate the attention mechanism implied by the query Q and the key K.
These two matrices can be express by

Q = [q 1 , . . . , q t ]⊤ ∈ Rt×q and K = [k1 , . . . , kt ]⊤ ∈ Rt×q .

with row vectors q s , ks ∈ Rq , 1 ≤ s ≤ t. The elements a′u,s of matrix A′ are given by

1 1
a′u,s = √ q ⊤
u ks = √ ⟨q u , ks ⟩ . (8.26)
q q

From this we conclude that if the query q u points in the same direction as the key ks ,
we receive a large attention weight au,s (supposed all queries and keys have roughly the
same absolute values). This then implies that the corresponding entries on the s-th row
of the value matrix V get a large attention (weight).

Figure 8.11: Construction of attention matrix A using transposed query matrix Q⊤ (in
blue) and key matrix K ⊤ (in yellow).

Figure 8.11 illustrates the scalar products (8.26). Basically, every query q u tries to
find the keys ks that provide a large scalar product (8.26), which is mapped to a large

Version March 3, 2025, @AI Tools for Actuaries


Chapter 8. Deep learning for tensor and unstructured data 153

attention weight au,s ; this is related to the cosine similarity mentioned in the footnote
on page 131.

Time-distributed layer

Transformer architectures generally include time-distributed layers, which process sequen-


tial data by applying the same transformation to each time step. The transformation can
be any type of neural network layer, such as a FNN or a CNN layer; however, the most of
the transformer models utilize time-distributed FNN layers. Consider a time-distributed
FNN layer with q1 units applied to the input tensor X 1:t ∈ Rt×q . This performs the
following mapping

 ⊤
z t−FNN : Rt×q → Rt×q1 , X 1:t 7→ z t−FNN (X 1:t ) = z FNN (X 1 ), . . . , z FNN (X t ) ,
(8.27)

where z FNN : Rq → Rq1 is a FNN layer. This transformation leaves the time dimension
t unchanged, as the FNN layer z FNN is applied separately to each time step 1 ≤ u ≤ t.
Importantly, the same parameters (network weights and biases) are shared across all time
steps. This invariance across time steps is what makes the layer time-distributed.

Drop-out layer

Drop-out is a widely used regularization technique in neural networks introduced in Sri-


vastava et al. [210] and Wager et al. [229]; we have already met drop-out in Section 5.3.6.
It works by randomly ignoring a subset of neurons (units) during the training process, to
enhance the model’s generalization capabilities. This is typically implemented by multi-
plying the output of a specific layer by i.i.d. realizations of Bernoulli random variables
with a fixed drop-out rate α ∈ (0, 1) in each step of SGD training.
A drop-out layer can be expressed as

z drop : Rq → Rq , X 7→ z drop (X) = Z ⊙ X,

where Z = (Z1 , Z2 , . . . , Zq )⊤ ∈ {0, 1}q is a vector of i.i.d. Bernoulli random variables


that are resampled in each SGD step, and ⊙ is the elementwise Hadamard product.

Layer normalization

Layer normalization, introduced by Ba et al. [9], is a technique used to stabilize the


learning process, to accelerate convergence, and to improve the model’s predictive per-
formance. For this kind of layer, normalization is applied to the inputs within a single
layer, in contrast to batch normalization of Ioffe–Szegedy [107].
Mathematically, layer normalization is a mapping
! !
X − X̄
z norm : Rq → Rq , X 7→ z norm (X) = γj √j + δj ,
σ2 + ϵ 1≤j≤q

Version March 3, 2025, @AI Tools for Actuaries


154 Chapter 8. Deep learning for tensor and unstructured data

where ϵ > 0 is a small constant added for numerical stability, µ ∈ R and σ ∈ R+ are
calculated as follows
q q
1X 1X
X̄ = Xj and σ2 = (Xj − X̄)2 ,
q j=1 q j=1

and γ = (γ1 , . . . , γq )⊤ ∈ Rq and δ = (δ1 , . . . , δq )⊤ ∈ Rq are vectors of trainable parame-


ters.

8.5.2 Transformers for sequential data

We are now ready to introduce the attention head and the transformer layer in
detail. In this section, we discuss transformers designed for sequential data.

Transformer layer

Consider input data with a sequential structure (time-series) represented as a tensor


X 1:t ∈ Rt×q . The first foundational component of the transformer model is the attention
layer5 . To implement the attention mechanism on the input data X 1:t , we first need to
derive the query, key and value matrices Q, K and V , respectively. For this, we select
three time-distributed q-dimensional FNN layers

z t−FNN
j : Rt×q → Rt×q , X 1:t 7→ z t−FNN
j (X 1:t ),

for j ∈ {Q, K, V }. These provide us with the time-slices for fixed time points 1 ≤ u ≤ t,
see (8.27),
 
(Q)
q u = z FNN
Q (X u ) = ϕQ w0 + WQ X u ∈ Rq ,
 
(K)
ku = z FNN
K (X u ) = ϕK w 0 + WK X u ∈ Rq ,
 
(V )
v u = z FNN
V (X u ) = ϕV w0 + WV X u ∈ Rq ,

with corresponding network weights, biases and activation functions. Writing this in
matrix notation gives us the query, key and value matrices

Q = z t−FNN
Q (X 1:t ) = [q 1 , . . . , q t ]⊤ ∈ Rt×q ,
K = z t−FNN
K (X 1:t ) = [k1 , . . . , kt ]⊤ ∈ Rt×q ,
V = z t−FNN
V (X 1:t ) = [v 1 , . . . , v t ]⊤ ∈ Rt×q ,

see also Figure 8.11. This allows us to define the attention head implied by input X 1:t
as follows, see (8.24),
!
QK ⊤
H = H(X 1:t ) = softmax √ V ∈ Rt×q . (8.28)
q

5
Some authors suggest pre-processing the input by applying layer normalization; however, we omit
this step in our notation to keep it as simple as possible.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 8. Deep learning for tensor and unstructured data 155

A transformer layer is constructed by combining the attention head with the augmented
input tensor through a skip-connection mechanism
X 1:t 7→ z skip (X 1:t ) = X 1:t + H(X 1:t ) ∈ Rt×q . (8.29)
After the attention mechanism, the transformed input is typically processed through a
series of additional layers. Generally, a normalization layer z norm is applied first, followed
by a time-distributed FNN layer z t−FNN having output dimension q.
In this setting, the output of the transformer layer can be expressed as
  
z trans (X 1:t ) = z skip (X 1:t ) + z t−FNN ◦ z norm z skip (X 1:t ) , (8.30)

based on skip-connection (8.29) and on attention head (8.28).


More layers can be employed at this stage to further enhance the model’s flexibility.
Moreover, drop-out layers can be included to regularize the network and mitigate the
risk of overfitting.

The layers described in (8.30) operate on the original input X 1:t , and this can be
formalized to the transformer layer

z trans : Rt×q → Rt×q , X 1:t 7→ z trans (X 1:t ). (8.31)

Figure 8.12: Graphical representation of a transformer architecture.

Figure 8.12 illustrates the transformer layer (8.30)-(8.31). The input X 1:t corresponds
to the blue box and the output z trans (X 1:t ) to the yellow box, and the feature extraction
by the transformer layer is sketch in the magenta box.

Multi-head transformers

A transformer layer can also have multiple attention heads, allowing the model to focus
on different parts of the input sequence simultaneously. Rather than computing a single

Version March 3, 2025, @AI Tools for Actuaries


156 Chapter 8. Deep learning for tensor and unstructured data

attention output, multi-head attention applies the attention mechanism multiple times in
parallel, with each attention head using different weights and parameters. That is, each
attention head operates on a separate set of query, key and value matrices, producing
multiple output tensors. These output tensors are then concatenated and projected once
more to generate the final attention result.
Formally, we can define the multi-head attention mechanism with nh attention heads as
follows. For each head j, we apply the attention mechanism to the input tensor X 1:t to
obtain the matrices Qj , Kj , Vj , which are derived from separate projections of the input
tensor. This gives attention heads for 1 ≤ j ≤ nh
Qj Kj⊤
!
Hj (X 1:t ) = softmax √ Vj ∈ Rt×q .
q
These attention head outputs are concatenated along the feature dimension, yielding the
multi-head (MH) attention output

HMH (X 1:t ) = Concat (H1 (X 1:t ), H2 (X 1:t ), . . . , Hnh (X 1:t )) W O ∈ Rt×q , (8.32)

where W O ∈ Rhq×q is the output weight matrix. This multi-head attention output is
then incorporated into the subsequent layers, as in the original architecture. Specifically,
after computing the multi-head attention output, it is added to the input tensor X 1:t
using a skip-connection (8.29), followed by normalization and FNN layers (8.30).

Working with tensors and unstructured data for predictive modeling

• Bring all data in tensor form using feature tokenization, see Section 8.2.

• The tensors are processed by FNN layers, CNN layers, RNN layers and/or
(multi-)head transformer layers, see also Figure 8.3.

• Flatten all outputs of the previous layers to be able to concatenate them to


a vector, see Section 8.3.7.

• This vector is further processed through FNN layers to form the output, see
Figure 8.3.

• Based on the training, validation and test split, (U, V, T ), we train this
architecture using a strictly consistent loss function L, see Section 5.3.

Remark 8.2 (RNN vs. transformer). We close this section with a remark highlighting
a key difference between RNN layers and transformer layers. RNN layers have a natural
notion of time/position because RNN layers move sequentially across the time-series data
X 1:t . In contrast, attention layers do not respect time-causality, in the sense that any
query q u can communicate with any key ks through (8.26). To make transformers aware
of time, one typically adds a positional encoding to the input tensor, meaning that, e.g.,
the last column q of X 1:t ∈ Rt×q contains the (normalized) entries u/t ∈ [0, 1] on the
u-th row of X 1:t . This add a notion of time to the algorithm, though, it does not imply
that the algorithm will respect time causality. Time causality would only be the case, if
queries q u can only communicate with keys ks having occurred before s ≤ u. ■

Version March 3, 2025, @AI Tools for Actuaries


Chapter 8. Deep learning for tensor and unstructured data 157

8.5.3 Transformers for tabular data


The transformer architecture discussed above cannot be directly applied to tabular data,
which limits its applicability to actuarial problems. Several extensions of transformers
for tabular data have been proposed in the literature. We focus on the model proposed
by Gorishniy et al. [86]; this has also explored by Brauer [26] in an actuarial context.
The key innovation in the architecture of Gorishniy et al. [86] lies in the so-called feature
tokenization transformation, which converts the original covariates, both categorical and
continuous ones, into a format suitable for being processed by transformer layers.
We consider the raw tabular (vector) input X, which comprises both categorical and
cq
continuous covariates. Specifically, it includes qc categorical covariates (Xj )j=1 and qn =
q
q − qc continuous (numerical) covariates (Xj )j=cq +1 . We embed both types of covariates
into a b-dimensional Euclidean space Rb , where b is an integer (hyper-parameter) that
needs to be selected by the modeler; in fact, this is precisely the embedding dimension
as introduced in (2.15) and (8.1). We apply a b-dimensional entity embedding to each
categorical covariate Xj , 1 ≤ j ≤ qc , given by

b
eEE
j : Aj → R , Xj 7→ eEE EE
j := ej (Xj ),

where Aj denotes the levels of categorical covariate Xj . Note that for every categorical
covariate we use the same embedding dimension b. This results in b qj=1
Pc
|Aj | embedding
weights that need to be learned.
For the continuous covariates Xj , qc + 1 ≤ j ≤ q, we also perform a b-dimensional
embedding. For this we select FNNs. for qc + 1 ≤ j ≤ q, providing

z j : R → Rb , Xj 7→ z j := z j (Xj ).

This uses (b + 1) ∗ (q − qc ) network weights that need to be learned.


After processing both categorical and continuous covariates to b-dimensional embeddings,
we collect all these embeddings in a tensor
h i⊤
X ◦1:q = eEE EE
1 , . . . , eqc , z qc +1 , . . . , z q ∈ Rq×b .

This tensor X ◦1:q contains the transformed covariate information, and it serves as input
to subsequent layers in our model. It is further augmented by introducing an additional
component, the so-called the classification (CLS) token. This additional component is
inspired by the bidirectional encoder representations from transformers (BERT) archi-
tecture discussed in Devlin et al. [55]. The purpose of the CLS token is to encode every
column 1 ≤ k ≤ b of the input tensor X ◦1:q ∈ Rq×b into a single variable.
This results in the augmented input tensor
" #
X ◦1:q h
EE EE
i⊤
X 1:q+1 = = e1 , . . . , eqc , z q +1 , . . . , z q , c ∈ R(q+1)×b , (8.33)
c⊤ c

where c = (c1 , c2 , . . . , cb )⊤ ∈ Rb denotes the CLS token. Each of the scalars ck ∈ R


comprising the CLS token, 1 ≤ k ≤ b, will encode one column of the input tensor
X ◦1:q ∈ Rq×b , i.e., it will provide a one-dimensional projection of the corresponding k-th

Version March 3, 2025, @AI Tools for Actuaries


158 Chapter 8. Deep learning for tensor and unstructured data

Figure 8.13: Transposed augmented input tensor X ⊤


1:q+1 .

q-dimensional vector to a scalar ck ∈ R, this is illustrated in Figure 8.13 by the yellow


CLS token c.
For further processing, only the information contained in the CLS token c will be for-
warded to make predictions, as it reflects a compressed (encoded) version of the entire
tensor information (after training of course). The augmented tensor X 1:q+1 serves as the
input to a transformer layer (8.31), that is,

z trans : R(q+1)×b → R(q+1)×b , X 1:q+1 7→ z trans (X 1:q+1 ).

This mapping can also be implemented using the multi-head attention mechanism. Pre-
dictions are derived considering only the final row of the output tensor z trans (X 1:q+1 ),
which corresponds to the position of the CLS token before the transformer layer is ap-
plied. Let ctrans (X) := z trans b
q+1 (X 1:q+1 ) ∈ R denote the CLS token after being processed
by the transformer layer. It encodes the tokenized information of the input covariates.
Through the attention mechanism within the transformer layer, interactions between
all covariates are captured and integrated into the CLS token. As a result, ctrans (X)
becomes a optimized representation of the raw tabular input data X

X 7→ ctrans (X).

The final step involves decoding this tokenized variable ctrans (X) into a set of covariates
suitable for predicting the response variable Y . This decoding process is problem-specific.
For instance, Gorishniy et el. [86] used layer normalization, followed by a ReLU activation
and a one-dimensional FNN layer with linear activation, such that
  
X 7→ µtrans (X) = z FNN ReLU z norm (ctrans (X)) . (8.34)

Of course, for claims counts and claims size modeling we would rather consider a different
architecture, having a the log-link for g −1 to ensure positivity, see Section 5.1. Figure 8.14
graphically illustrates all the blocks constituting the transformer architecture described
above.

8.5.4 The credibility transformer


Recently, Richman et al. [188] proposed an extension of the standard transformer model
for tabular data, designed to align more closely with the actuarial concept of credibil-
ity. Specifically, it introduces a credibility mechanism into the transformer architecture,

Version March 3, 2025, @AI Tools for Actuaries


Chapter 8. Deep learning for tensor and unstructured data 159

feature tokenizer transformation

categorical
embedding features
layers

numerical
layers FFN features

classifier (CLS) token

Figure 8.14: Graphical representation of the transformer architecture for tabular data.

which is inspired by the seminal work of Bühlmann [34] and Bühlmann–Straub [36]; this
new architecture was named the credibility transformer (CT). The resulting architecture
presents some modifications to enhance model flexibility and to fully leverage the benefits
of the credibility mechanism.

Positional encodings

The first modification concerns the input tensor (8.33). Additional positional encodings
were added, this is a common modification for transformers to capture the notion of time
and/or position. However, unlike the sequential data typically processed by traditional
transformers, tabular data lacks a natural ordering. Indeed, in this context, positional
encodings are adapted to encode information specific to the covariates, ensuring the
model can receive additional information about the structure of the tabular data.
While sophisticated positional encoding mechanisms exist, such as the sine-cosine encod-
ing scheme proposed by Vaswani [228], the CT architecture adopts a simpler approach
based on embedding layers. More formally, the embedding layer maps the position of
each covariate j ∈ {1, . . . , q} into a b-dimensional representation inducing the mapping
epos : {1, . . . , q} → Rb , j 7→ epos
j := epos (j).
This positional encoding scheme introduces qb additional parameters. These learned
representations are incorporated to augment the input tensor (8.33) obtained from the
feature tokenization transformation of the original covariates and the CLS token. In this
context, the augmented input tensor is represented as

(epos ⊤
  
1 )
 ◦
X 1:q
 ..  
X+ =

 .    ∈ R(q+1)×2b ,
1:q+1 ⊤
(epos
q )
 
 
c1 c2

Version March 3, 2025, @AI Tools for Actuaries


160 Chapter 8. Deep learning for tensor and unstructured data

where the enlarged CLS token is now defined as c = (c⊤ ⊤ ⊤ ⊤ 2b


1 , c2 ) = (c1 , . . . , c2b ) ∈ R .
It is worth noting that the use of positional encoding increases the size of both the
augmented input tensor and the CLS token. A detailed discussion on the benefits of
incorporating positional encoding schemes can be found in Huang et al. [106]. This
augmented input tensor extended by the positional encoding is shown in Figure 8.15,
and it should be compared to Figure 8.13.

Figure 8.15: Transposed augmented input tensor with positional encoding (X + ⊤


1:q+1 ) .

Transformer layer

The augmented tensor X + 1:q+1 represents the input to the standard transformer architec-
ture. Considering a single transformer layer, we have a mapping
z trans : R(q+1)×2b → R(q+1)×2b , X+
1:q+1 7→ z
trans
(X +
1:q+1 ).

Within the transformer layer, the CT architecture of Richman et al. [188] introduces some
differences compared to the standard layer (8.30). Specifically, it incorporates additional
time-distributed FNN and normalization layers to increase the model’s flexibility. Start-
ing from z skip (X +
1:T +1 ) obtained as in (8.29), the output of the transformer layer used in
the CT architecture is then defined as
z trans (X +
1:T +1 ) = z
skip
(X +
1:T +1 ) (8.35)
  
+ z norm2 ◦ z drop2 ◦ z t-FNN2 ◦ z drop1 ◦ z t-FNN1 ◦ z norm1 z skip (X +
1:T +1 ) .

In this process, the tensor is first normalized, resulting in z norm1 , and then processed
through a time-distributed FNN layer, denoted as z t-FNN1 , combined with drop-out layer
z drop1 . This is followed by a second time-distributed FNN layer, z t-FNN2 , with another
drop-out z drop2 . Finally, the process concludes with a second normalization step, z norm2 .
The result of these transformations is combined through a second skip-connection, pro-
ducing an output tensor with shape R(q+1)×2b .

Prior and transformer-based tokens

The next component of the CT architecture stands out from other models as it focuses on
implementing the core credibility mechanism. Unlike Gorishniy et al. [86], which relies

Version March 3, 2025, @AI Tools for Actuaries


Chapter 8. Deep learning for tensor and unstructured data 161

solely on the transformer-tokenized CLS token for predictions, the CT combines two
distinct versions of the CLS token through a credibility-based weighting scheme. More
precisely, the first version of the CLS token, referred to as the prior, is extracted before the
covariates undergo interactions through the attention mechanism. As a result, it reflects
the initial representation of the input covariates without any interactions between them.
The prior version is extracted from the value matrix V = [v 1 , . . . , v q+1 ]⊤ . To ensure
that this token is represented exactly in the same embedding space as the outputs of the
transformer, it is processed through the same transformations as the transformer layer
in equation (8.35), using the same weights.
This gives us the following representation for the prior token
 
cprior = z norm2 ◦ z drop2 ◦ z FNN2 ◦ z drop1 ◦ z FNN1 ◦ z norm1 (v q+1 ) ∈ R2b ,

that is, we use the layers from (8.35), but we do not need its time-distributed versions
because we only process a single vector v q+1 . The second version of the CLS token is
derived after processing everything through the transformer layer. This version incor-
porates the effects of the attention mechanism, reflecting the interactions among the
covariates. The second version of the CLS token holds the tokenized information of the
covariates X as well as their positional embeddings.
This transformer-based token is given by
+
ctrans = z trans 2b
q+1 (X 1:q+1 ) ∈ R ,

which corresponds to the (q + 1)-st row of z trans (X +


1:T +1 ) being calculated according to
(8.35).
The two tokens, cprior and ctrans , provide two different representations of the data and
both can be used for predicting the response variable. The prior token cprior captures
only the initial covariate information, while the transformer-based token ctrans augments
this information by incorporating the interactions among the covariates.

Credibility mechanism

Both tokens cprior and ctrans are used for making predictions in the CT architecture,
with weights assigned to each representation. This process involves selecting a fixed
probability weight, α ∈ (0, 1), and sampling independent Bernoulli random variables
Z ∼ Bernoulli(α) during SGD training. These random variables determine which CLS
token is passed forward through the network to make predictions.
Specifically, the two tokens are combined as follows by

ccred = Z ctrans + (1 − Z) cprior ∈ R2b . (8.36)

Thus, in α·100% of the gradient descent steps, the transformer-based token ctrans is used,
which has been augmented by covariate interactions. In the remaining (1−α)·100% of the
steps, the prior token cprior is selected. This mechanism effectively assigns a credibility
of α to the transformer-based token ctrans and a complementary credibility of 1 − α to
the prior token cprior in SGD, guiding the network to learn reasonable parameters during

Version March 3, 2025, @AI Tools for Actuaries


162 Chapter 8. Deep learning for tensor and unstructured data

training. This credibility token ccred then enters an encoder for prediction, similar to
(8.34).
The probability α is treated as a hyper-parameter and can be optimized via grid search.
One could select α > 1/2 to give greater weight to the tokenized covariate information,
reflecting its increased importance in the prediction process; in the examples of Richman
et al. [188], the best results have been obtain by a choice of roughly α = 95%.
The credibility mechanism in equation (8.36) is applied only during the SGD fitting. For
out-of-sample predictions, one sets Z ≡ 1, and uses the transformer-based token.

The hidden credibility mechanism

The CT architecture introduces another and less obvious credibility mechanism which is
realized during the dot-product attention. To understand how it works, consider that,
according to equation (8.28), the last row of the attention head, denoted as Hq+1 , is
obtained by multiplying the elements of Aq+1 = (aq+1,1 , . . . , aq+1,q+1 ) with the corre-
sponding columns of the value matrix V ∈ R(q+1)×2b :

Hq+1 = Aq+1 V. (8.37)

Furthermore, since A results from applying the softmax operator (see (8.25)), the ele-
ments of the vector Aq+1 are strictly positive and satisfy the normalization condition
q+1
X
aq+1,j = 1, and with aq+1,j > 0.
j=1

In this context, the k-th element of the attention head Hq+1 can be expressed as a
convex combination of the elements in the k-th column of the value matrix V , with
coefficients given by Aq+1 . Additionally, decomposing the value matrix V into two parts,
the first part, v covariate ∈ Rq×2b , contains the first q rows of V , while the second part,
v q+1 ∈ Rb , corresponds to the row associated with the CLS token, equation (8.37) can
be reformulated as:
Hq+1 = P v q+1 + (1 − P ) v covariate , (8.38)

where P = aq+1,q+1 ∈ (0, 1) is the last element of the attention row Aq+1 and represents
the weight assigned to the CLS token. The remaining weight, qj=1 aq+1,j = 1 − P , is
P

distributed across the covariate information. This formulation reveals that the attention
mechanism for the CLS token can be interpreted as a credibility-weighted average. The
CLS token’s own information (representing collective experience) is combined with the
covariates’ information (representing individual experience) according to their respective
credibility weights. In essence, this is a Bühlmann [34] type linear credibility formula, or a
dynamic version of it, with the credibility weights learned during training and depending
on the input.

Decoder

The final block of the CT architecture is the decoder that derives from the represention
ccred (X), given in (8.36), the predictions. The proposal of Richman et al. [188] performs

Version March 3, 2025, @AI Tools for Actuaries


Chapter 8. Deep learning for tensor and unstructured data 163

this task considering an additional FNN layer z FNN and a suitable link-function g, so
that we obtain the regression function
D E
X 7→ µcred (X) = g −1 w, z FNN (ccred (X)) , (8.39)

with readout parameter w. This is illustrated in Figure 8.16.

feature
tokenizer positional
transformation encoding

categorical
embedding features
layers

numerical
layers FFN features

classifier (CLS) token

Figure 8.16: Graphical representation of the credibility transformer.

We conclude that transformers for tabular data (using the feature tokenization trans-
formation) and its credibility transformer extension present interesting network archi-
tectures for solving actuarial problems. The credibility transformer is inspired by a
Bühlmann [34] credibility mechanism, that is useful to stabilize and improve network
training. In fact, in examples, the credibility transformer has proved excellent perfor-
mance, though at higher computational costs than a classical FNN architecture. More-
over, the hidden credibility mechanism (8.38) allows for an integrated variable importance
measure; for further details, see Richman et al. [188].

Version March 3, 2025, @AI Tools for Actuaries


164 Chapter 8. Deep learning for tensor and unstructured data

Version March 3, 2025, @AI Tools for Actuaries


Chapter 9

Unsupervised learning

9.1 Introduction
Unsupervised learning covers the topic of learning the structure in the covariates X
without considering any response variable Y . This means that unsupervised learning
aims at understanding the population distribution of the covariates X ∼ P. This can be
achieved by learning the inherent pattern in X from an i.i.d. learning sample L = (X i )ni=1 .
This learning sample is called unlabelled because it does not include any responses.
Broadly speaking, there are the following main tasks that can be studied and solved with
unsupervised learning.

(1) Dimension reduction. Starting from q-dimensional real-valued covariates X ∈ Rq ,


we may ask the question whether there is a lower dimensional representation of
X without a significant loss of information (i.e., only with a small reconstruction
error). E.g., if we have the three-dimensional covariates X = (X1 , X2 , X1 − X2 )⊤ ,
it is clear that we can drop one coordinate without any loss of information, because
two components are sufficient to fully describe X in this case. The most commonly
used technique for dimension reduction is the principle component analysis (PCA)
which is based on the singular value decomposition (SVD). The PCA provides a
linear approximation by projecting the covariates to a new orthonormal coordinate
system. A non-linear extension of the PCA is obtained by a bottleneck neural
network (BNN) that transforms the data non-linearly to receive a lower dimensional
representation of the original data at a minimal reconstruction error.

(2) Clustering. Clustering techniques are methods that are based on classifying (or bin-
ning), meaning that they group similar covariates X into (homogeneous) classes
(clusters, bins). This leads to a segmentation of a heterogeneous population into
homogeneous classes. Popular methods are hierarchical clustering methods, K-
means clustering, K-medoids clustering, distribution-based Gaussian mixture mod-
els (GMMs) clustering or density-based spatial clustering of applications with noise
(DBSCAN).

(3) Low-dimensional visualization. Low-dimensional visualization of high-dimensional


data has some similarity with the dimension reduction problem from the first item,

165
166 Chapter 9. Unsupervised learning

but instead of trying to minimize a loss of information (reconstruction error), these


methods rather aim at preserving the local structure (topology), so that a certain
adjacency relation in a neighborhood of a given instance is kept. Popular methods
are t-distributed stochastic neighbor embedding (t-SNE), uniform manifold approxi-
mation and projection (UMAP) or self-organizing maps (SOM) also called Kohonen
map.

The previous methods are mainly used to understand, simplify and illustrate the covari-
ate distribution X ∼ P. In financial and insurance applications, unsupervised learning
methods can also be used for anomaly detection. Based on similarity measures, unsuper-
vised learning methods may be used for outlier detection, e.g., indicating fraud or other
abnormal structure or behavior in the data.
We give a selected overview of some of these unsupervised learning techniques, and for
more methods and a more in-depth discussion we refer the interested reader to the unsu-
pervised learning literature. A general difficulty in most unsupervised learning methods
is that they work well for real-valued vectors X ∈ Rq , but they struggle to deal with
categorical variables. Naturally, actuarial problems heavily rely on categorical covari-
ates, and there is only little guidance on how to deal with those categorical variables
in an unsupervised learning context, e.g., how can we reasonably quantify dissimilarity
between different colors of cars? Section 2.3.2 has been discussing pre-processing of cat-
egorical covariates, such as one-hot encoding, entity embedding or target encoding. This
has been extended by considering a contextual entity embedding in Section 8.2.2. These
are possible ways of pre-processing categorical covariates before considering unsupervised
learning methods.

9.2 Dimension reduction methods


We start from a learning sample L = (X i )ni=1 having q-dimensional real-valued features
X i ∈ Rq . The goal is to understand whether this learning sample is contained in a lower
dimensional manifold (object or surface).
Figure 9.1 gives an example of a learning sample L = (X i )ni=1 in R2 . The features
(X i )ni=1 form a noisy unit circle. Therefore, if we know that the features roughly live on
the unit circle B1 = {x ∈ R2 ; x21 + x22 = 1}, it is sufficient to record the angle in the polar
coordinate system. Based on this angle, we can almost perfectly reconstruct the original
feature X i .
Of course, this is a very simple example because it can nicely be plotted in R2 and
the underlying functional form can easily be identified. However, in general, we work
in high-dimensional spaces Rq , q ≫ 1, where such a graphical solution does not work.
First, finding such manifolds in high-dimensional spaces is difficult, typically, we cannot
even determine the dimension of the embedded object; in some sense this relates to the
discussion on the effective dimension of FNNs, see last item of the discussion on page
82. Second, though looking simple, already Figure 9.1 may pose some challenges. A
problem is that the unit circle is a non-linear object in the Euclidean coordinate system,
and the most popular dimension reduction method, the PCA, is (only) designed to deal
with linear problems.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 9. Unsupervised learning 167

2−dimensional features X

2
1
X2
0
−1
−2

−2 −1 0 1 2

X1

Figure 9.1: Learning sample L = (X i )ni=1 in R2 .

9.2.1 Standardization
Throughout this chapter, we assume to work with standardized data, meaning the fol-
lowing. Based on the learning sample L = (X i )ni=1 ⊂ Rq , we construct the design matrix
 
X1,1 · · · X1,q
 . .. .. 
X = [X 1 , . . . , X n ]⊤ =  ..
 . .  ∈ R
n×q
; (9.1)
Xn,1 · · · Xn,q

compared to (2.10), we drop the intercept column from the design matrix in this chapter.
The columns of this design matrix X describe different quantities, e.g., the first column
may describe the weight of the vehicle, the second one the age of the policyholder, etc.
Thus, these columns live on different scales (and units). Since we would like to apply
a unit-free dimension reduction technique to X, we need to standardize the columns
of X beforehand. For this foregoing standardization, we apply (2.18) to all covariate
components, such that the columns of the resulting design matrix X are centered and have
the same empirical variance. The empirical variance is either (n − 1)/n or 1 depending
on which empirical standard deviation estimator was applied. Standardization is done
to ensure that all columns of the design matrix X live on the same scale, are unit-free,
before applying the PCA to this design matrix X.
A general difficulty in most of the unsupervised learning methods is the treatment of
categorical variables. They can be embedded using one-hot encoding, but if we have
many levels, the resulting design matrix X will be sparse, i.e., with lots of zero entries
(before standardization), and most unsupervised learning methods struggle with this
sparsity; intuitively, the higher the dimension q the more collinear sparse vectors become
and the less well-conditioned the design matrix X will be. Therefore, it is advantageous if
one can use a low-dimensional entity embedding (2.15) for categorical variables, though,
it is not always clear where one can get it from. E.g., if one uses a supervised learning
method for this entity embedding, the utilized target may not be the right label that
reflects the clustering that one wants to get. For instance, if one has different provinces,

Version March 3, 2025, @AI Tools for Actuaries


168 Chapter 9. Unsupervised learning

there are many different targets (population density, average salary, area of the province,
average rainfall, etc.), and each target may lead to a different entity embedding.

9.2.2 Auto-encoders
The general principle of finding a lower dimensional object that encodes the learning
sample L = (X i )ni=1 can be described by an auto-encoder. An auto-encoder maps a
high-dimensional object X ∈ Rq to a low-dimensional representation, say, in Rp , p < q,
so that this dimension reduction still allows to (almost perfectly) reconstruct the original
data X. To measure the loss of information, we introduce a dissimilarity function.
Definition 9.1 (dissimilarity function). A dissimilarity function

L(·, ·) : Rq × Rq → R+ , (X, X ′ ) 7→ L(X, X ′ ) ≥ 0, (9.2)

has the property that L(X, X ′ ) = 0 if and only if X = X ′ .

Note that we use the same notation L for the dissimilarity function as for the loss function
(1.5) because, essentially, they play the same role, only the input dimensions differ. In
general, a dissimilarity function does not need to be a proper distance function. That is,
a dissimilarity function does not need to be symmetric in its arguments, nor does it need
to satisfy the triangle inequality, e.g., the KL divergence is a non-symmetric example
that does not satisfy the triangle inequality.

Definition 9.2 (auto-encoder). Select dimensions p < q. An auto-encoder is a pair


(Φ, Ψ) of mappings, called encoder and decoder, respectively,

Φ : Rq → Rp and Ψ : Rp → Rq ,

such that their composition Ψ ◦ Φ has a small reconstruction error w.r.t. the chosen
dissimilarity function L(·, ·), that is,

X 7→ L (X, Ψ ◦ Φ(X)) is small for all instances X of interest. (9.3)

Strictly speaking, Definition 9.2 is not a proper mathematical definition, because the
clause in (9.3) is not a well-defined mathematical term. We allow ourselves to be a bit
sloppy at this point; we are just going to explain it, and it will not harm any mathematical
arguments below.
Naturally, a small reconstruction error in (9.3) cannot be achieved for all X ∈ Rq , because
the dimension reduction p < q always leads to a loss of information on the entire space
Rq . However, if all X “of interest” live in a lower dimensional object B1 of dimension
p, then it is possible to find an encoder (coordinate mapping) Φ : Rq → Rp so that one
can perfectly reconstruct the original object B1 by the decoder Ψ : Rp → Rq . We give
an example.

Example 9.3. We come back to Figure 9.1. If the covariates live in the unit circle,
X ∈ B1 ⊂ R2 , the encoder (coordinate mapping) is received by Φ : R2 → [0, 2π)
describing the angle in polar coordinates, and the decoder Ψ : [0, 2π) → B1 ⊂ R2
maps these angles to the unit circle. This auto-encoder preserves the information of the

Version March 3, 2025, @AI Tools for Actuaries


Chapter 9. Unsupervised learning 169

polar angle and it loses the information about the size of the radius. However, if all
covariates X ∈ B1 “of interest” lie in the unit circle, the information about the radius
is not necessary, and, in fact, the object B1 is one-dimensional p = 1 in this case. In
other words, the auto-encoder Ψ ◦ Φ is the identity map on B1 , and generally it is an
(orthogonal) projection from R2 to B1 . We conclude that in this case, the interval [0, 2π)
is the low-dimensional representation of the unit circle B1 . ■

Working in Euclidean spaces Rq , one can take the classical Lk -norms, k ∈ (0, ∞], as
dissimilarity measures
 1/k
q
k
L(X, X ′ ) = X − X ′ Xj − Xj′ 
X
k = . (9.4)
j=1

For k = 2, we have the Euclidean distance, k = 1 corresponds to the L1 -distance (also


known as Manhattan distance), and the limiting case k = ∞ is the supremums norm.
The Euclidean distance is a special case of the Mahalanobis distance defined by
q
L(X, X ′ ) = (X − X ′ )⊤ A (X − X ′ ), (9.5)

for a positive definite matrix A ∈ Rq×q .

When it comes to categorical covariates, things start to become more difficult. Assume
that X is categorical taking K levels. Using one-hot encoding we can embed X 7→ X ∈
{0, 1}K into the K-dimensional Euclidean space RK . The Hamming distance counts the
number of positions where two one-hot encoded vectors X and X ′ differ. The Hamming
distance is equal to the Manhattan distance for binary vectors. Up to a factor of two,
the one-hot encoded Hamming distance is equal to

L(X, X ′ ) = 1{X̸=X ′ } . (9.6)

A disadvantage of this approach, equivalently to one-hot encoding, is that all mutual


distances between the levels are equal, and there is no notion of similarity between
different levels. This is precisely the step where a low-dimensional entity embedding may
be useful, see Sections 2.3.2 and 8.2.2.
For multiple categorical covariates, one can scale them to bring them to the same units.
To account for the number of levels K, the scaled (probability) version is given by

1
L(X, X ′ ) = 1 ′ .
K 2 {X̸=X }
This is inspired by the fact that if X and X ′ are independent and uniform on {a1 , . . . , aK },
we receive the contingency Table 9.1 (lhs). For non-uniform i.i.d. categorical variables it
may be adapted to a weighted version as sketched on the right-hand side of Table 9.1

1
L(X, X ′ ) = 1 ′ ,
pX pX ′ {X̸=X }

with (positive) categorical probabilities (pak )K


k=1 summing up to one.

Version March 3, 2025, @AI Tools for Actuaries


170 Chapter 9. Unsupervised learning

X \ X′ a1 ··· aK X \ X′ a1 ··· aK
a1 1/K 2 ··· 1/K 2 a1 1/p2a1 ··· 1/(pa1 paK )
.. .. .. .. .. .. .. ..
. . . . . . . .
aK 1/K 2 ··· 1/K 2 aK 1/(paK pa1 ) ··· 1/p2aK

Table 9.1: Contingency tables of two i.i.d. categorical random variables.

9.2.3 Principal component analysis


Linear auto-encoder

The PCA is a linear dimension reduction technique that has different interpretations,
one of them is a linear auto-encoder interpretation, according to Definition 9.2. We
have decided to give the technical explanation behind the PCA that describes an explicit
construction that will result in the auto-encoder interpretation. Basically, this only uses
linear algebra.
Consider the design matrix X = [X 1 , . . . , X n ]⊤ given by (9.1). The rows of this design
matrix contain the transposed covariates X i ∈ Rq , 1 ≤ i ≤ n. Select the standard unit
basis (ej )qj=1 of the Euclidean space Rq . This allows us to write the covariates X i as
q
X
X i = Xi,1 e1 + . . . + Xi,q eq = Xi,j ej ∈ Rq . (9.7)
j=1

The PCA represents these covariates X i in a different orthonormal basis1 (v j )qj=1 , i.e.,
by a linear transformation we can rewrite these covariates X i as
q
X
X i = ai,1 v 1 + . . . + ai,q v q = ⟨X i , v j ⟩ v j ∈ Rq , (9.8)
j=1

for the new coefficients ai,j = ⟨X i , v j ⟩ ∈ R.


The two representations (9.7) and (9.8) are equivalent, however, in different parametriza-
tions (bases) of the (same) Euclidean space Rq .

The PCA performs the following steps that provide the dimension reduction at a minimal
loss of information:

(1) Select the encoding dimension p < q and define the encoder Φp : Rq → Rp by

X 7→ Φp (X) = (⟨X, v 1 ⟩, . . . , ⟨X, v p ⟩)⊤ ∈ Rp . (9.9)

That is, we only keep the first p principal components in the alternative orthonormal
basis (v j )qj=1 representation (9.8), and we truncate the rest.

(2) For the decoder Ψp : Rp → Rq we use a simple embedding


p
X p
X q
X
Z 7→ Ψp (Z) = Zj v j = Zj v j + 0 · v j ∈ Rq .
j=1 j=1 j=p+1
1
An orthonormal basis of Rq is a set of normalized and orthogonal vectors (v j )qj=1 spanning the whole
space Rq . In particular, we have scalar (dot) product ⟨v j , v k ⟩ = 1{j=k} .

Version March 3, 2025, @AI Tools for Actuaries


Chapter 9. Unsupervised learning 171

That is, we pad the vector Z ∈ Rp with zeros to get the right length q > p; in
network modeling this is called padding with zeros to length q, see Section 8.3.6.

(3) Composing the encoder Φp and the decoder Ψp gives us the auto-encoder
p
X
X 7→ Ψp ◦ Φp (X) = ⟨X, v j ⟩ v j ∈ Rq . (9.10)
j=1

This is nothing else than an orthogonal projection of X to the sub-space spanned


by (v j )pj=1 . This auto-encoder has a reconstruction error of
q
X
X − Ψp ◦ Φp (X) = ⟨X, v j ⟩ v j , (9.11)
j=p+1

that is, the reconstruction error is precisely determined by the components that
were truncated by the encoder Φp . The main idea behind the PCA is to select the
orthonormal basis (v j )qj=1 of Rq such that this reconstruction error (9.11) becomes
minimal (in some dissimilarity measure) over the learning sample L = (X i )ni=1 .
This is achieved by selecting the directions of the biggest variabilities in the design
matrix X.
We want to minimize the reconstruction error (9.11) over all learning samples L =
(X i )ni=1 . For this we select the squared L2 -distance dissimilarity measure
2
L(X, X ′ ) = X − X ′ 2,

and for aggregating over the instances 1 ≤ i ≤ n, we simply take the sum of the dis-
similarity terms. In view of (9.11), a straightforward computation gives us the total
dissimilarity on the learning sample L, expressed by the design matrix X,
n n q
Φp (X i )∥22 ∥Xv j ∥22 .
X X X
L (X i , Ψp ◦ Φp (X i )) = ∥X i − Ψp ◦ = (9.12)
i=1 i=1 j=p+1

The latter term tells us how we should select the orthonormal basis (v j )qj=1 , namely,
the terms ∥Xv j ∥22 should be maximally decreasing in j. This implies that any linear
auto-encoder Ψp ◦ Φp (i.e., for any 1 ≤ p ≤ q) has minimal total dissimilarity across
the learning sample L. This requirement can be solved by recursive convex Lagrange
problems.
A first orthonormal basis vector v 1 ∈ Rq is given by a solution of

v 1 ∈ arg max ∥Xv∥22 , (9.13)


∥v∥2 =1

and the j-th orthonormal basis vector v j ∈ Rq , 2 ≤ j ≤ q, is computed recursively by

v j ∈ arg max ∥Xv∥22 subject to ⟨v k , v⟩ = 0 for all 1 ≤ k ≤ j − 1. (9.14)


∥v∥2 =1

This solves the total reconstruction error (dissimilarity) minimization (9.12) for the linear
auto-encoders (9.10) simultaneously for all 1 ≤ p ≤ q, and the single terms ⟨X, v j ⟩ v j
are the principal components in the lower dimensional representations.

Version March 3, 2025, @AI Tools for Actuaries


172 Chapter 9. Unsupervised learning

Singular value decomposition

At this stage, we could close the chapter on the PCA, because (9.10), (9.13) and (9.14)
fully solve the problem. However, there is a more efficient way of computing the PCA
than recursively solving the convex Lagrange problems (9.13)-(9.14). This alternative
way of computing the orthonormal basis (v j )qj=1 uses a singular value decomposition
(SVD) and the algorithm of Golub–Van Loan [82]; see Hastie et al. [93, Section 14.5.1].
The SVD is based on the following mathematical result:
There exist orthogonal matrices U ∈ Rn×q and V ∈ Rq×q , with U ⊤ U = V ⊤ V = Idq , and
a diagonal matrix Λ = diag(λ1 , . . . , λq ) ∈ Rq×q with singular values λ1 ≥ . . . ≥ λq ≥ 0
such that we have the SVD of X
X = U ΛV ⊤ . (9.15)
The matrix U is called left-singular matrix of X, and the matrix V is called right-singular
matrix of X. The crucial property of the SVD is that the column vectors of the right-
singular matrix V = [v 1 , . . . , v q ] ∈ Rq×q precisely give the orthonormal basis (v j )qj=1 that
we are looking for. This is justified by the computation

∥Xv j ∥22 = v ⊤ ⊤ ⊤ 2 ⊤ 2
j X Xv j = v j V Λ V v j = λj . (9.16)

Crucially, the singular values are ordered λ1 ≥ . . . ≥ λq ≥ 0, and the orthonormal basis
vectors (v j )qj=1 are the eigenvectors of X⊤ X to the squared singular values λ2j . Thus,
(9.16) shows that the first principal component has the biggest potential reconstruction
error, and this is decreasing in j, minimizing the reconstruction error (9.11) for any p.

Example

We provide a two-dimensional example, and we use the R command svd which is included
in the base package of R [179].

PCA: example 1 PCA: example 2


10

10
5

5
X2

X2
0

observations X observations X
−5

−5

1st principal comp. 1st principal comp.


2nd principal comp. 2nd principal comp.

−5 0 5 10 −5 0 5 10

X1 X1

Figure 9.2: Two PCAs for two different learning samples L = (X i )ni=1 in R2 .

Figure 9.2 shows two PCAs for two different learning samples L = (X i )ni=1 in R2 , i.e., in
two dimensions q = 2. The plot on the left-hand side has a very dominant first principal

Version March 3, 2025, @AI Tools for Actuaries


Chapter 9. Unsupervised learning 173

component, and the first basis vector v 1 can fairly well describe the data. The plot on the
right-hand side shows much more dispersion, and the singular values λ1 ≈ λ2 . Therefore,
we need both basis vectors v 1 and v 2 to accurately describe the second learning sample.
The principle components are received by computing ⟨X i , v j ⟩, and these are obtained
by the orthogonal projections of the black dots X i in Figure 9.2 to the red (j = 1) and
orange (j = 2) line, respectively.

The right-singular matrix V = [v 1 , . . . , v q ] of the SVD of X is computed with the


algorithm of Golub–Van Loan [82], from which we can calculate the representa-
tion (9.8) and the corresponding principal components ⟨X, v j ⟩ v j of the PCA. The
magnitudes of the singular values (λj )qj=1 determine how many principal compo-
nents we need to select for a good description of the learning sample L.

9.2.4 Bottleneck neural network


The PCA presented in the last section is a linear auto-encoder. A natural attempt to
receive a non-linear auto-encoder is to set-up a FNN architecture that aims at compress-
ing the input data at a low reconstruction error; this has been introduced and studied
by Kramer [126] and Hinton–Salakhutdinov [97].
Select a deep FNN architecture z (d:1) = z (d) ◦ · · · ◦ z (1) , see (5.3). This deep FNN
architecture needs to have two special features:

(i) The output dimension is equal to the input dimension, qd = q0 = q, where q is the
dimension of the covariates X ∈ Rq .

(ii) One FNN layer z (m) , 1 ≤ m < d, should have a very small dimension (number
of units) qm = p < q. This dimension corresponds to the number of principal
components we consider in (9.9)-(9.10) for the PCA, and z (m) is called the bottleneck
of the FNN. This bottleneck has size p, being a hyper-parameter selected by the
modeler.

This two features give us the BNN encoder


 
X 7→ Φp (X) := z (m:1) (X) = z (m) ◦ · · · ◦ z (1) (X) ∈ Rp ,

and the BNN decoder, for Z ∈ Rp ,


 
Z 7→ Ψp (Z) := z (d) ◦ · · · ◦ z (m+1) (Z) ∈ Rq .

Composing the BNN encoder and BNN decoder, provides us with the BNN auto-encoder
Ψp ◦ Φp : Rq → Rq given by
 
X 7→ Ψp ◦ Φp (X) = z (d) ◦ · · · ◦ z (1) (X).

Figure 9.3 illustrates a BNN auto-encoder of depth d = 4, input dimension q0 = q4 = 5,


and with a bottleneck of size q2 = p = 2.
This BNN auto-encoder can be trained on the learning sample L = (X i )ni=1 using the
SGD algorithm as described in Section 5.3.1. As loss function we select a dissimilarity

Version March 3, 2025, @AI Tools for Actuaries


174 Chapter 9. Unsupervised learning

X1 x1

X2 x2

X3 x3

X4 x4

X5 x5

Figure 9.3: BNN auto-encoder with bottleneck dimension qm = p = 2, m = 2; this graph


was plotted with [71].

function, so that the SGD algorithm tries to minimize the reconstruction error w.r.t. the
selected dissimilarity function. The low-dimensional representation of the data is then
received by evaluating the bottleneck of the trained FNN
n on
{Φp (X i )}ni=1 = z (m:1) (X i ) ⊂ Rp .
i=1

The advantage of the BNN auto-encoder over the PCA is that the BNN auto-encoder can
deal with non-linear structures in the data, supposed we select a non-linear activation
function ϕ. The disadvantage clearly is that the BNN auto-encoder treats the bottleneck
dimension p as a hyper-parameter that needs to be selected before training the BNN.
Changing this hyper-parameter requires to refit a new BNN architecture, i.e., in contrast
to the PCA, we do not simultaneously get the results for all dimensions 1 ≤ p ≤ q.
Moreover, there is also no notion of singular values (λj )qj=1 that quantifies the significance
of the principal components, but one has to evaluate the reconstruction errors for every
bottleneck dimension p to receive the suitable size of the bottleneck, i.e., an acceptable
reconstruction error.
We close with remarks:

• Hinton–Salakhutdinov [97] have noticed that the gradient descent training of BNN
architectures can be difficult, and it can likely result in poorly trained BNNs. There-
fore, Hinton–Salakhutdinov [97] propose to use a BNN architecture that is sym-
metric in the bottleneck layer w.r.t. the number of neurons in all FNN layers. That
is, select an architecture with qm−k = qm+k for all 1 ≤ k ≤ m and d = 2m; Figure
9.3 gives such an example with d = 4. Training of BNNs can then successively
be done by recursively collapsing layers (keeping symmetric FNNs); we refer to
Wüthrich–Merz [243, Section 7.5.5] for more details.

• If one chooses the linear activation function for ϕ, the bottleneck will represent a
linear space of dimension p (supposed that all other FNN layers have more units),

Version March 3, 2025, @AI Tools for Actuaries


Chapter 9. Unsupervised learning 175

and we receive a linear data compression. The result is basically the same as the
one from the PCA, with the difference that the BNN auto-encoder does not give
a representation in the orthonormal basis (v j )pj=1 , but it gives the (same) results
in the same linear subspace in a different parametrization (that is not explicitly
specified). The reason for this is that the BNN auto-encoder does not have any
notion of orthonormality, but this would need to be enforced by regularization
during training.

9.2.5 Kernel principal component analysis


The PCA is a linear dimension reduction technique, and one might look for a non-linear
variant that is in the spirit of the PCA. To arrive at a non-linear PCA, the idea is to
consider a feature map
F : Rq → Rd , X 7→ F (X), (9.17)

where we typically are thinking of a higher dimensional feature embedding, d > q; in fact,
typically, this higher dimensional space will be infinite-dimensional, but for the moment
it is sufficient to think about a finite large d. We give an example that is related to
support vector machines (SVMs).

feature map: NO feature map: YES


3

3
2

2
X2

X2
1

1
0

0
−1

−1

−4 −2 0 2 4 −4 −2 0 2 4

X1 X1

Figure 9.4: Feature map F : R → R2 with x 7→ F (x) = (x, x2 /4)⊤ .

Figure 9.4 (lhs) shows four real-valued instances (Xi )4i=1 ⊂ R. There are two of red type
and two of blue type. On the real line R, the red dots cannot be separated from the
blue ones by one partition of R. Figure 9.4 (rhs) shows the situation after applying the
feature map Xi 7→ F (Xi ) = (Xi , Xi2 /4)⊤ ∈ R2 . After this two-dimensional embedding
we can separate the red from the blue dots by a single partition illustrated by the orange
horizontal line. Such a feature map F is the first part of the idea behind the non-linear
kernel PCA, i.e., this higher dimensional embedding gives the necessary flexibility.

Main problem: How can we find a suitable feature map F ?

Part of the answer to this question is:

Version March 3, 2025, @AI Tools for Actuaries


176 Chapter 9. Unsupervised learning

It is not necessary to explicitly select the feature map F , but for our problem it suffices
to know the (implied) kernel

K : X × X → R, (X, X ′ ) 7→ K(X, X ′ ) = F (X), F (X ′ ) . (9.18)

This is called the kernel trick, which means that for such types of problems, it is sufficient
to know the kernel K. In many cases, it is simpler to directly select this kernel K, instead
of trying to find a suitable feature map F .

We are going to explain why knowing the kernel K is sufficient, but before that we
start by assuming that the feature map F : Rq → Rd is known. This gives us the new
(embedded) features (F (X i ))ni=1 ⊂ Rd , and we can construct the new design matrix in
this bigger dimensional space
Z = [F (X 1 ), . . . , F (X n )]⊤ ∈ Rn×d .
Based on this embedding, we can perform a PCA on these embedded new features.
Following the recipe of the PCA, see Section 9.2.3, we can find the principal components
using a SVD of Z providing us with the right-singular matrix with column vectors (v Fj )dj=1 .
This is then used to define the decoder Φp , see (9.9), which gives us the PCA dimension
reduction for p ≤ d
D E D E⊤
X 7→ F (X) 7→ Φp (F (X)) = F (X), v F1 , . . . , F (X), v Fp ∈ Rp . (9.19)

Up to this stage, it seems that we need to know the feature map F : Rq → Rd to compute
the individual principal components in (9.19). The crucial (next) step of the kernel trick
is that the kernel K is sufficient to compute ⟨F (X), v Fj ⟩, and the explicit knowledge of
the feature map F is not necessary. This precisely provides the (non-linear) kernel PCA
of Schölkopf et al. [200].
There are two crucial points that make the kernel trick work. These two points need
some mathematical arguments which we are going to skip here, the interested reader is
referred to Schölkopf et al. [200]. Before discussing these two points (a) and (b), we need
to slightly generalize the feature map F introduced in (9.17). Typically, this feature map
F : Rq → H maps to an infinite-dimensional Hilbert space H. This Hilbert space H
allows for the kernel K construction (9.18) because any Hilbert space is equipped with
a scalar product. Of course, the finite-dimensional space Rd , selected in (9.17), is one
example of a Hilbert space H. Now we are ready to discuss the two points (a) and (b)
that make it possible to compute (9.19).
(a) First, there exist vectors αj = (αj,1 , . . . , αj,n )⊤ ∈ Rn , j ≥ 1, that allow one to write
the (eigen-)vectors (v Fj )j≥1 as follows
n
X
v Fj = αj,i F (X i ), (9.20)
i=1

i.e., the vectors (v Fj )j≥1 are in the span of the new features (F (X i ))ni=1 . Inserting
this gives us for j ≥ 1
D E n
X
F (X), v Fj = αj,i K(X, X i ). (9.21)
i=1

Version March 3, 2025, @AI Tools for Actuaries


Chapter 9. Unsupervised learning 177

Note: this only requires the (implied) kernel K, but not the feature map F itself.

(b) Second, we need to compute (αj )j≥1 . Define the Gram matrix (kernel matrix)

K = [K(X i , X i′ )]1≤i,i′ ≤n ∈ Rn×n .

Solve the eigenvalue problem


Ka = λK a,
and normalize the eigenvectors aj of the non-zero eigenvalues λK
j > 0 as follows

1
αj = q aj . (9.22)
λK ⊤
j aj aj

These are precisely the vectors αj ∈ Rn needed in (9.21).

In summary, we can compute (9.21)-(9.22) directly from the kernel K, without the explicit
use of the feature map F . Thus, the kernel PCA dimension reduction (9.19) is fully
determined by the kernel K, which also justifies the name kernel PCA.
Popular kernels in practice are polynomial kernels of order k (with b ∈ R) or radial
Gaussian kernels (with γ > 0) given by, respectively,

K(X, X ′ ) = (b + ⟨X, X ′ ⟩)k , (9.23)


n o

K(X, X ) = exp −γ∥X − X ′ ∥22 . (9.24)

For k = 1 (and b = 0) we have the linear kernel used in the standard PCA. Based on
these kernel selections we can directly solve the kernel PCA by using (9.21)-(9.22), and
we receive the kernel PCA dimension reduction (9.19) without an explicit choice of the
feature map F .

Remark 9.4 (Mercer kernel). There is one question though. Namely, are the selected
kernels in (9.23)-(9.24) “valid” kernels, i.e., are these (artificial) kernel choices implied
by a feature map F : Rq → H according to (9.18)? If there does not exist such a feature
map F for a selected kernel K, then the presented theory may not hold for that kernel
K. The Moore–Aronszajn [6] theorem gives sufficient conditions to solve this question.
For this, we first need to introduce the Mercer kernel [154]. Assume X is a metric space.
A mapping K : X × X → R is a Mercer kernel if it is continuous in both arguments,
symmetric and positive semi-definite.2 The theorem of Moore–Aronszajn tells us that
for every Mercer kernel there is a so-called reproducing kernel Hilbert space (RKHS) for
which we can select a feature map F that has this Mercer kernel K as implied kernel.
Thus, any Mercer kernel K is a valid selection as it can be generated by a feature map
F ; see, e.g., Andrès et al. [3] for more details. ■

There are a couple of points that one should consider. The (kernel) PCA is usually
performed on standardized matrices because this provides better results, i.e., essentially
we should focus on the correlation/dependence structure, see Section 9.2.1. For the
2
Positive semi-definite means that for all finite sequences of instances (X i )n
i=1 and for all sequences
Pn
a ∈ Rn we have a a
i,i′ =1 i i
′ K(X i , X i ′ ) ≥ 0.

Version March 3, 2025, @AI Tools for Actuaries


178 Chapter 9. Unsupervised learning

kernelized version this means that one often replaces the Gram matrix K by a normalized
version
e = K − 1n K − K1n + 1n K1n ,
K (9.25)

where the matrix 1n ∈ Rn×n has all elements equal to 1/n.


Compared to the classical PCA, there are two disadvantages of the kernel PCA. First,
the kernel PCA is computationally very demanding for high sample sizes n, because it
involves eigenvalue decompositions of matrices of size n × n. Therefore, often the data
is first compressed by a clustering algorithm, such as K-means clustering discussed in
the next section. Second, since the feature map F is not explicitly given, we cannot
compute the eigenvectors (9.20), and, as a result, the reconstruction error is not easily
tractable. Additionally, the eigenvalues are not directly helpful to determine the number
of necessary principal components p to have a small reconstruction error, because these
eigenvalues are computed in a different space.

Example 9.5. We give an example of a kernel PCA. Assume that the covariates L =
(X i )ni=1 ⊂ R2 are two-dimensional, and they are related to three circles with different
radii.

2−dimensional features X
2
1
X2
0
−1
−2

−2 −1 0 1 2

X1

Figure 9.5: Learning sample L = (X i )ni=1 in R2 with three circles of different radii.

Figure 9.5 shows the learning sample L. This learning sample cannot be partitioned
into the different colors by a simple splitting with hyperplanes (straight lines), and both
coordinates are necessary to distinguish the differently colored dots. This implies that
the first principal component of a (linear) PCA is not sufficient to describe the different
instances. This is verified in Figure 9.6 (lhs).
Figure 9.6 (middle and rhs) show a polynomial kernel PCA, with k = 2 and b = 1, and a
Gaussian kernel PCA, with γ = 1, respectively. We now observe that the first principal
component (on the x-axis) can separate the different colors fairly well, and in these two
cases this first principal component (from the kernel PCA) might be sufficient to describe
the data. This analysis does not use the standardized version of K. ■

Version March 3, 2025, @AI Tools for Actuaries


Chapter 9. Unsupervised learning 179

(linear) PCA polynomial kernel PCA Gaussian kernel PCA

0.6
1.5

0.4
1.0
2nd principal component

2nd principal component

2nd principal component


1

0.2
0.5

0.0
0.0

0
−0.5

−0.2
−1
−1.0

−0.4
−1.5

−2

−0.6
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −2.0 −1.5 −1.0 0.2 0.4 0.6 0.8

1st principal component 1st principal component 1st principal component

Figure 9.6: Kernel PCAs: (lhs) linear PCA, (middle) polynomial kernel PCA with k = 2,
(rhs) Gaussian kernel PCA.

9.3 Clustering methods


Clustering methods aim at grouping similar covariates X into (homogeneous) clusters
(also called groups, classes or bins), leading to a segmentation of the covariate space
X ⊂ Rq ; this is similar to a regression tree partition of the covariate space, we refer to
Figure 6.2. Assume that we aim at partitioning the covariate space X into K disjoint
clusters. We introduce a classifier

CK : X → K := {1, . . . , K}, X 7→ CK (X). (9.26)

This gives us a partition (Xk )k∈K of the covariate space X by defining for all k ∈ K the
clusters
Xk = {X ∈ X ; CK (X) = k} ; (9.27)

for an illustration see Figure 9.7. This is rather equivalent to the regression tree par-
tition (6.1), the main difference lies in its specific construction. For the regression tree
construction in Chapter 6 we use the responses Y to construct the partition, and in the
clustering methods we use the covariates X themselves to define the clustering through a
dissimilarity function. That is, we aim at choosing the classifier CK such that the result-
ing dissimilarities within all clusters (Xk )k∈K are minimal. Sometimes this is also called
quantization, meaning that all covariates X i ∈ Xk can be represented sufficiently accu-
rately by a so-called model point ck ∈ Xk , and actuarial modeling is then only performed
on these model points (ck )k∈K . This is a quite common approach in life insurance port-
folio valuation, that helps to simplify the complexity of valuation of large life insurance
portfolios.

One distinguishes different types of clustering methods. There is (1) hierarchical cluster-
ing, (2) centroid-based clustering, (3) distribution-based clustering and (4) density-based
clustering.

(1) Hierarchical clustering. Hierarchical clustering methods are based on recursive


algorithms. The final number K of clusters is not a-priori given as a hyper-parameter,
but it is determined by a stopping rule. This hierarchical clustering is performed in

Version March 3, 2025, @AI Tools for Actuaries


180 Chapter 9. Unsupervised learning

Step (1a) of K−means for K=4 Step (1b) of K−means for K=4

● initial centers ● initial centers


● updated centers

● ●

● ● ● ●
● ●
● ● ● ● ● ●
● ●
● ●
●● ●●
● ● ● ●
● ● ● ●
● ●● ● ●●
● ● ● ● ●● ● ●● ● ● ● ● ●● ● ●●
● ●● ● ●
● ● ● ●● ● ●
● ●
● ● ● ●●
●● ● ● ● ●●
●●
● ●● ●●
● ● ●

● ● ● ●● ● ● ● ● ●● ●●
● ● ●

● ● ● ●● ● ● ●
●● ● ● ● ● ●● ● ● ● ●
● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●●● ●
● ● ●● ● ● ● ●● ●
● ● ● ●
●● ● ● ●

● ●● ● ● ● ● ●
● ● ● ●● ●● ●
● ● ● ● ● ●
●● ● ● ●

● ●● ● ● ● ● ●
● ● ● ●● ●● ●
● ●
● ● ● ●● ●● ●●
● ●● ●● ● ● ● ● ● ●● ●● ●●
● ●● ●● ● ●
●● ● ● ●●
● ● ●● ●
● ● ●●●● ● ●●● ● ●● ● ● ●●
● ● ●● ●
● ● ●●●● ● ●●● ●
●● ● ●●●● ●● ● ● ●●● ●● ● ●●●● ●● ● ● ●●●
● ●●●● ● ●●●●● ●● ● ●● ●● ● ●●●● ● ●●●●● ●● ● ●● ●●
●● ● ● ● ● ●●●●●● ●● ● ● ●● ● ● ●● ● ●● ● ● ● ● ●●●●●● ●● ● ● ●● ● ● ●● ●

● ● ●● ●
● ● ●● ● ●● ● ● ●● ● ●● ● ●●● ●● ● ● ● ● ● ●● ● ●● ● ● ●● ● ●● ● ●●● ●● ● ● ●
● ● ● ● ●● ● ●
●●●●● ●●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●
●●●●● ●●●● ● ● ● ● ● ●
● ●● ● ●● ●● ● ●●
● ●● ● ● ● ●● ● ●●●●● ● ●● ● ● ● ●● ● ●● ●● ● ●●
● ●● ● ●
● ● ●● ● ●●●●● ● ●● ● ●
● ● ● ●● ●● ●●●

● ● ●● ●● ●●
● ● ●●● ● ● ● ● ●● ●●●
● ● ● ●●● ● ● ●
● ● ● ● ●●●●● ● ●● ●● ●● ● ● ●●●● ●● ●● ● ● ● ● ● ● ●●●●● ● ●●
● ● ●● ●● ●● ● ● ●●●● ● ● ●● ●● ●● ●● ●● ● ●
● ● ● ●
● ●
●● ● ●●● ● ●
●● ●



●● ●●●●●● ●

● ●●●
● ●
●● ●


●● ● ● ●
● ● ● ●● ●● ● ● ● ● ● ●
● ●
●● ● ●●● ● ●
●● ●



●● ●●●●●● ●

● ●●●
● ●
●● ●


●● ● ● ●
● ● ● ●● ●● ● ●
● ● ● ● ● ●●● ●● ●●● ● ●●●
● ● ●● ●● ● ● ● ● ●
● ●●●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ●●● ● ●●●
● ● ●● ●● ● ● ● ● ●
● ●●●● ●● ● ●● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ●●● ●● ● ●● ● ●●●● ● ● ● ● ●●●
● ● ●●
● ●
● ● ● ●
● ● ● ● ● ● ● ● ●●● ●● ● ●● ● ●●●● ● ● ● ● ●●●
● ● ●●
● ●
● ●
● ●● ● ● ● ●●● ● ●● ● ● ● ●●●
● ●●●● ● ●●●●

●● ● ● ●●●●● ●
●●●●●● ● ● ●● ● ● ●●●●● ●
●●●●●● ● ●
● ● ● ● ● ●● ● ● ● ●●● ●●●●●● ● ●●●●●●● ●●●● ●● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ●● ● ● ● ●●● ●●●●●● ● ●●●●●●● ●●●● ●● ● ● ● ● ●● ●● ●● ●
● ●● ●
● ●● ● ● ●●● ●● ●●● ●●●
● ●●
●●
● ●● ●● ●●●● ●● ● ●● ●
● ●● ● ● ●●● ●● ●●● ●●●
● ●●
●●
● ●● ●● ●●●● ●●
● ● ●● ● ●● ●●●●●●●● ● ●●● ●● ● ● ● ● ●●● ● ● ●● ● ●● ●●●●●●●● ● ●●● ●● ● ● ● ● ●●●
● ● ●● ● ●●● ●●● ● ●● ● ● ●
●●●●● ● ●● ●●● ● ●●
● ●●●●●
● ●● ● ● ● ● ●● ● ●●● ●●● ● ●● ● ● ●
●●●●● ● ●● ●●● ● ●●
● ●●●●●
● ●● ● ●
● ● ● ● ● ● ● ●● ● ●●
●●

●●● ●● ●● ● ●●
●● ● ●● ● ●
●●● ●● ● ● ● ● ● ● ● ●● ● ●●
●●

●●● ●● ●● ● ●●
●● ● ●● ● ●
●●● ●●
● ● ● ● ● ●● ● ●● ●●●●● ●●● ●● ●●
● ● ●● ● ● ●● ●
●● ●●
● ●●● ● ●● ● ● ● ● ● ● ● ●● ● ●● ●●●●● ●●● ●● ●●
● ● ●● ● ● ●● ●
●● ●●
● ●●● ● ●● ● ●
● ● ● ●
● ●●● ● ● ●●●● ● ●●● ●● ●● ●● ●●
● ● ●●
●● ●●● ● ● ● ● ●
● ●●● ● ● ●●●● ● ●●● ●● ●● ●● ●●
● ● ●●
●● ●●● ●
● ● ●
● ●● ● ● ●● ● ● ● ●
●●●
● ●● ● ●
●●● ●●●● ●●
●● ●
●● ●
●●●●●
● ●

● ● ●●● ● ●● ● ● ● ●
● ● ●
● ●● ● ● ●● ● ● ● ●
●●●
● ●● ● ●
●●● ●●●● ●●
●● ●
●● ●
●●●●●
● ●

● ● ●●● ● ●● ● ● ● ●
● ●● ● ● ● ● ● ● ●● ●●●● ● ● ●●●● ●● ●●
●●●● ●● ●● ●
● ●●
● ●●● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ●● ●●●● ● ● ●●●● ●● ●●
●●●● ●● ●● ●
● ●●
● ●●● ● ●● ● ● ●●
● ●
● ● ●● ●●●● ●●
●●●

●● ● ● ●●
● ●●
●● ●●
●● ●●● ●● ●●●● ● ● ●●● ● ● ● ●
● ● ●● ●●●● ●●
●●●

●● ● ● ●●
● ●●
●● ●●
●● ●●● ●● ●●●● ● ● ●●● ● ●
● ● ● ●● ●● ● ● ●
● ●● ● ● ●●● ● ● ●● ●● ●● ●●● ●● ●● ● ●● ● ● ● ● ●● ●● ● ● ●
● ●● ● ● ●●● ● ● ●● ●● ●● ●●● ●● ●● ● ●● ●
● ● ●
● ● ● ● ● ●● ● ●●●●●● ● ● ●● ●● ● ● ● ●
● ●●
● ● ●●●●● ●●● ●●
● ●
●● ●
●●●●● ● ●

●●
● ● ● ● ●
● ● ● ● ● ●● ● ●●●●●● ● ● ●● ●● ● ● ● ●
● ●●
● ● ●●●●● ●●● ●●
● ●
●● ●
●●●●● ● ●

●●
● ●
● ●


●● ●● ● ●● ●●● ● ● ●● ●● ● ●● ●●● ● ●
● ● ● ●●
●●● ● ● ●● ● ●
●● ● ● ●●● ● ●●●● ●● ● ●●
● ●●
● ● ●● ● ● ● ●●
●●● ● ● ●● ● ●
●● ● ● ●●● ● ●●●● ●● ● ●●
● ●●
● ● ●●

● ● ● ● ●● ● ● ● ● ●●
● ● ●● ●● ●●●● ●●● ●● ●● ●● ●
● ●●●●●● ● ● ●●●●● ●● ●●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●●
● ● ●● ●● ●●●● ●●● ●● ●● ●● ●
● ●●●●●● ● ● ●●●●● ●● ●●● ● ●● ● ●
● ●● ● ● ● ● ●●
●●
● ●● ● ●●●●● ●●● ●● ●
● ● ●● ●●
● ● ●● ● ● ● ● ●
● ●● ● ● ● ● ●●
●●
● ●● ● ●●●●● ●●● ●● ●
● ● ●● ●●
● ● ●● ● ● ● ●
● ● ● ●


● ●●
● ●●●● ●
● ● ● ●
● ●●
●●● ● ●● ● ●● ● ●● ● ● ● ● ● ●


● ●●
● ●●●● ●
● ● ● ●
● ●●
●●● ● ●● ● ●● ● ●● ● ●
● ● ●● ● ● ● ●● ● ● ● ●● ●
● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ●● ●
● ● ●● ●
● ●● ●● ● ●● ●● ● ●●●●● ● ● ● ● ●●● ●● ● ●
● ● ●● ●● ● ●● ●● ● ●●●●● ● ● ● ● ●●● ●● ● ●

● ●● ● ● ● ● ●● ●● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ●●●● ● ●

●●

● ●● ● ● ● ● ●● ●● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ●●●● ● ●

●●

● ●● ● ● ● ● ● ●●●● ● ●● ● ● ● ● ● ●●●●

●●
● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●
●● ●● ●● ●●

● ● ●
● ● ● ● ● ●●● ● ●● ● ● ● ●● ● ●●● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ●● ● ●●● ● ● ● ● ● ● ●●● ● ●● ● ●
●● ● ● ● ●● ● ● ● ●● ●● ●● ●● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ●● ●● ●● ●● ● ● ● ● ●
●● ● ●
● ●● ●● ●● ●●● ● ● ● ● ● ● ● ●● ● ●
● ●● ●● ●● ●●● ● ● ● ● ● ● ●
● ● ● ● ● ●● ● ● ●● ●
●● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ●
●● ● ●● ●● ● ● ● ● ●●
● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●
● ● ●●● ● ● ● ●● ●● ● ●●● ● ●● ● ● ● ●●● ● ● ● ●● ●● ● ●●● ● ●● ●
● ●● ● ● ● ● ●●●●
● ●● ●● ● ● ● ●● ● ● ● ● ●●●●
● ●● ●● ● ●
● ● ●● ● ● ●● ● ● ●● ● ● ●
●● ● ●● ● ●● ●
● ●●
● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ●● ● ● ●● ● ● ●
●● ● ●● ● ●● ●
● ●●
● ● ● ● ● ● ● ●● ● ●● ●
● ●● ● ● ●● ●● ●● ● ●● ● ● ●● ●● ●●
● ●● ● ●● ● ● ● ●● ● ●● ● ●
● ●●● ●● ● ● ● ● ● ● ●●● ●● ● ● ● ● ●
● ● ● ●● ● ● ●
●●
●● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ●
●●
●● ●● ● ● ● ●● ● ● ● ●● ●
● ● ● ●● ● ●●● ● ●●● ● ● ● ●● ● ●●● ● ●●●
● ● ● ●
● ●
● ● ● ●●● ● ●
● ●
● ● ● ●
● ●
● ● ● ●●● ● ●
● ●
● ● ● ●
●● ● ● ● ● ●● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ●● ●● ● ● ●● ●●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ●
● ● ● ●
● ● ● ●
● ●
● ● ● ●
● ●
● ● ● ●

● ● ● ●

● ●

Figure 9.7: Illustration of K-means clustering for CK : R2 → K with K = 4: the black


dots illustrate the cluster centers of step (1a) of the K-means clustering algorithm, and
the orange dots show step (1b); this figure is taken from [183].

a tree construction-like manner, and there are two opposite ways of deriving such a
segmentation: (i) divisive clustering and (ii) agglomerative clustering.
(i) Divisive clustering is a top-down approach of constructing clusters. Analogously to
the recursive partitioning algorithm for regression trees, see Chapter 6, a divisive clus-
tering method partitions non-homogeneous clusters into more homogeneous sub-groups
w.r.t. the chosen dissimilarity measure, and a stopping rule decides on the resulting
number K of clusters. The most basic divisive clustering algorithm is called DIvisive
ANAlysis (DIANA) clustering; see Kaufman–Rousseeuw [115, Chapter 6].
(ii) Agglomerative clustering is a bottom-up method that dynamically grows bigger clus-
ters by merging sets that are similar until the algorithm is stopped. A basic agglomer-
ative algorithm is called AGglomerative NESting (AGNES) clustering; see Kaufman–
Rousseeuw [115, Chapter 5].
Typically, the resulting clusterings are illustrated in so-called dendrograms, which look the
same as regression trees, see Figure 6.1. Both algorithms are usually exploit in a greedy
manner, meaning that one tries to find the next optimal step in each loop of the recursive
algorithm, i.e., the next optimal partition or fusion for the divisive and agglomerative
algorithm, respectively. Usually, this results in time-consuming algorithms, and, if they
converge, they likely converge to local optimums and not to the global one; the argument
for this is similar to the one of the gradient descent algorithm; see Section 5.3. Note
that if there are n instances that we want to partition into two non-empty sets, there
are 2n−1 − 1 possibilities. Therefore, it can be exhaustive to find the next best split, and
this search is therefore often combined with other methods such as K-means clustering.

(2) Centroid-based clustering. For centroid-based clustering methods, one specifies


a fixed number K of clusters, and, depending on the method, different optimal cluster

Version March 3, 2025, @AI Tools for Actuaries


Chapter 9. Unsupervised learning 181

centers are selected for these K clusters. The partitioning is then obtained by allocating
all instances to these cluster centers w.r.t. some distance or similarity measure. We
will present K-means and K-medoids clustering, these methods mainly differ in how the
cluster centers are selected.

(3) Distribution-based clustering. Centroid-based methods implicitly assume that


the dissimilarity is a hard boundary decision. Distribution-based clusterings provide
more flexibility in the sense that they allow for soft boundaries by adding some noise to
the instances, accounting for the fact that some of the instances may, by coincidence,
sit on the wrong side of the boundary. The most commonly used method is Gaussian
mixture model (GMM) clustering, for which the K cluster centers of the centroid-based
methods are replaced by K Gaussian distributions.

(4) Density-based clustering. Density-based clustering does not assume a fixed


number K of clusters, but similarly to agglomerative clustering, these clusters are de-
termined dynamically. The idea behind density-based clustering is that every core of a
cluster should be surrounded by a sufficiently dense set of instances to qualify to be a
core of a cluster. The DBSCAN algorithm discussed below is of this type.

9.3.1 Hierarchical clustering


We present the most basic divisive and agglomerative hierarchical clustering algorithms
in this section.

Divisive clustering algorithm

A divisive clustering algorithm is the DIANA algorithm which we briefly discuss in this
section. Select a dissimilarity function L : X × X → R+ on the covariate space X ⊂ Rq .
We are going to recursively partition the learning sample L = (X i )ni=1 ⊂ X .

A remark on the notation. The classifier (9.26) gives a partition (Xk )k∈K of the
covariate space X ⊂ Rq , see (9.27). Naturally, we can only construct an exact partition
on the (finite) learning sample L = (X i )ni=1 ⊂ X . Thus, in fact, we will (only) construct
the finite clusters
Xk ∩ L = {X ∈ L; X ∈ Xk } . (9.28)

To keep the notation simple in this outline, we will not distinguish the notation of the
clusters Xk , given in (9.27), and their finite sample counterparts Xk ∩ L, given in (9.28).
Moreover, we will also use the same notation for the indices of the instances in the clusters

Xk = {i ∈ {1, . . . , n}; X i ∈ Xk } . (9.29)

Thus, Xk has the three different meanings (9.27), (9.28) and (9.29), but from the context
it will always be clear which version we use.

We now discuss the DIANA algorithm of Kaufman–Rousseeuw [115, Chapter 6].

Version March 3, 2025, @AI Tools for Actuaries


182 Chapter 9. Unsupervised learning

Initialization. We initialize X1 = L. Similar to the regression tree this is the root


cluster, and we initialize the index set K = {1}, giving us one big cluster with K = 1.

Recursive partitioning iteration. Assume we have the current partition (clusters)


(Xk )k∈K of the learning sample L. This partition has K clusters, and in the next recursive
step K → K +1, we are going to expand by one cluster by partitioning one of the existing
clusters. This recursive partitioning is going to be done in two steps: (i) find the cluster
k ∗ ∈ K that is going to be partitioned, and (ii) construct the explicit partitioning of this
cluster Xk∗ (which is again done recursively). We explain these two steps.

(i) For the given dissimilarity function L, compute the diameters of all clusters, k ∈ K,

δk = max

L (X i , X i′ ) . (9.30)
i,i ∈Xk

This diameter δk ≥ 0 gives the maximal dissimilarity between the instances within
the same cluster Xk . If maxk∈K δk = 0, we stop the algorithm because all clusters
only contain completely similar instances. Otherwise, select the cluster with the
biggest diameter
k ∗ = arg max δk ,
k∈K
with a deterministic rule if there is more than one argument that maximizes this
problem. This provides us with the cluster Xk∗ that we would like to split in the
recursive step K → K + 1. Naturally, |Xk∗ | > 1, because δk = 0 for all clusters
that only contain one instance.

(ii) We construct the explicit partitioning of Xk∗ by an inner recursive algorithm.


Search for the instance i ∈ Xk∗ that is the least similar one to all other instances
in this cluster
i∗ = arg max
X
L (X i , X i′ ) ,
i∈Xk∗ i′ ∈Xk∗ \{i}

with a deterministic rule if there is more than one such instance. This defines the
initialization of the inner loop by setting up a new cluster XK+1 = {X i∗ } and
reducing the existing cluster by setting Xk′ ∗ = Xk∗ \ XK+1 .
If |Xk′ ∗ | > 1, we may migrate more instances from Xk′ ∗ to XK+1 by recursively
computing for all instances i ∈ Xk′ ∗ the differences of the average dissimilarities on
these two clusters Xk′ ∗ and XK+1 , that is,
1 X 1 X
∆(i) = L (X i , X i′ ) − L (X i , X i′ ) .
|Xk′ ∗ | − 1 |XK+1 |
i′ ∈Xk′ ∗ \{i} i′ ∈XK+1

If maxi∈X ′ ∗ ∆(i) ≤ 0, we stop the inner migration loop because migrating does not
k
decrease the average dissimilarity. Otherwise, we select

i∗ = arg max ∆(i),


i∈Xk′ ∗

with a deterministic rule if there is more than one maximizer. We migrate this
instance i∗ to the new cluster by setting XK+1 ← XK+1 ∪ {X i∗ } and we reduce the

Version March 3, 2025, @AI Tools for Actuaries


Chapter 9. Unsupervised learning 183

existing cluster to Xk′ ∗ = Xk∗ \ XK+1 . This inner loop is iterated until the stopping
rule is met or until |Xk′ ∗ | = 1. We then update the number of clusters K ← K + 1,
the reduced cluster Xk∗ ← Xk′ ∗ = Xk∗ \ XK+1 and we add the new cluster XK+1 ,
K+1
giving us the new partition (Xk )k=1 of X .
These two steps (i)-(ii) are iterated until a stopping rule is met, or until there is no
dissimilarity left on the existing clusters maxk∈K δk = 0. If we do not install a stopping
rule, this algorithm will naturally terminate once all instances on all clusters are fully
similar, and we can result in at most K ≤ n = |L| clusters.
To define the diameter in (9.30), we consider the biggest dissimilarity between two in-
stances in the same cluster. Of course, we could use many other definitions, e.g., we
could consider the average dissimilarity as an alternative
1 X
δk = 2
L (X i , X i′ ) .
|Xk | i,i′ ∈X
k

Agglomerative clustering algorithm

We briefly describe the most basic agglomerative clustering algorithm, also known as
AGNES; see Kaufman–Rousseeuw [115, Chapter 6]. Agglomerative means that we let
the clusters grow starting from the individual instances.

Initialization. We initialize Xi = {X i } with index set i ∈ K = {1, . . . , n}, that is,


each instance 1 ≤ i ≤ n forms its own cluster.

Recursive fusion iteration. We recursively fusion clusters that have the smallest
mutual dissimilarity. Therefore, we define the average dissimilarity between two clusters
Xk and Xl , for k, l ∈ K, as follows
1 X X
δk,l = L (X i , X i′ ) , (9.31)
|Xk ||Xl | i∈X i′ ∈X
k l

this is called the unweighted pair-group method with average mean (UPGMA). This UP-
GMA allows us to selected the two most similar clusters
(k ∗ , l∗ ) = arg min δk,l ,
k,l∈K

with a deterministic rule if there is more than one minimizer. We merge these two clusters
Xk∗ ← Xk∗ ∪ Xl∗ , reduce the index set K ← K \ {l∗ }, and relabel the indices such that
we obtain the new decreased index set K = {1, . . . , K}.
The UPGMA (9.31) is sometimes replaced by other dissimilarity measures. For example,
the complete linkage considers
δk,l = max L (X i , X i′ ) , (9.32)
i∈Xk ,i′ ∈Xl

or the single linkage considers


δk,l = min L (X i , X i′ ) . (9.33)
i∈Xk ,i′ ∈Xl

The complete linkage considers the two most distinct instances, and the single linkage
the two most similar instances in the two clusters.

Version March 3, 2025, @AI Tools for Actuaries


184 Chapter 9. Unsupervised learning

9.3.2 K-means and K-medoids clusterings


For K-means and K-medoids clustering, respectively, we first need to select the number
K of clusters that we want to have. This K is a hyper-parameter selected by the modeler.
We also remind of the three different definitions (9.27), (9.28) and (9.29) of Xk ; this is
going to be used throughout this section.
The idea behind these two clustering methods is to achieve a minimal dissimilarity within
the K clusters
K
(Xk∗ )k∈K ∈ arg min
X X
L(ck , X i ), (9.34)
(Xk )k∈K k=1 i∈X
k

where (ck )k∈K are the corresponding cluster centers (also called cores); these cluster
centers can be part of the minimization in (9.34); this is not indicated in the notation,
but we briefly explain this next. K-means clustering and K-medoids clustering mainly
(but not only) differ in how these cluster centers (ck )k∈K are selected. For K-means
clustering we select the empirical cluster means (thus, the selection of the cluster centers
is not part of the minimization in (9.34)) by setting for k ∈ K
1 X
ck = X i ∈ Rq . (9.35)
|Xk | i∈X
k

K-medoids considers a double minimization in (9.34), namely,


K
X X
arg min L(ck , X i ). (9.36)
(Xk )k∈K ; ck ∈L for all k ∈ K k=1 i∈Xk

That is, in K-medoids clustering, the cluster medoids ck ∈ L correspond to observed


covariates (X i )ni=1 , which puts an extra computational effort on the algorithm to find
them compared to K-means clustering. The advantage is that ck ∈ L is always a valid
observation.
On the other hand, K-means clustering is computationally much more attractive because
the cluster centers (9.35) can be computed at a minimal effort; we are going to justify
this selection below. A disadvantage of this approach is that if X is not convex (we refer
to the continuous version (9.27)), it may happen that the empirical cluster mean ck is
not inside the area of valid instances X , e.g., if this covariate space forms a circumference
like in Figure 9.1, the center of the circle does not belong to this circumference and, thus,
is not a valid covariate value.
Another disadvantage of K-means clustering is that it is specific to the squared Eu-
clidean distance, we discuss this next, whereas the K-medoids clustering works for any
dissimilarity function L.
A final introductory remark is that in (9.34) we do not scale with the cluster sizes,
thus, we simultaneously balance the cluster size and the total within-cluster dissimilarity.
Alternatively, we could scale with the cluster sizes |Xk |.

K-means clustering

For K-means clustering we select the squared Euclidean distance as dissimilarity function

L(X i , X i′ ) = ∥X i − X i′ ∥22 .

Version March 3, 2025, @AI Tools for Actuaries


Chapter 9. Unsupervised learning 185

The consequence of this choice is that the empirical cluster means (9.35) minimize the
within-cluster dissimilarities on the clusters Xk . That is, we have for all k ∈ K
1 X
∥c − X i ∥22 =
X
ck = arg min X i. (9.37)
c∈Rq i∈Xk
|Xk | i∈X
k

Precisely this property is the reason for dropping the cluster center optimization in (9.34)
for K-means clustering, and this motivates the name K-means clustering for this method.
Thus, we aim at solving for the cluster centers (9.37)
K
(Xk∗ )k∈K = arg min ∥ck − X i ∥22 .
X X
(Xk )k∈K k=1 i∈X
k

Initialization. We initialize the K-means algorithm by randomly allocating all in-


stances (Xi )ni=1 to the K clusters (e.g., i.i.d. uniform). This defines the initial partition
(0) (0)
(Xk )k∈K as well as the initial cluster centers (ck )k∈K by the corresponding empirical
(0)
cluster means (9.37) on these initial clusters (Xk )k∈K .

Recursive K-means iteration. We repeat for t ≥ 1 until no more changes are ob-
served:
(t−1)
(1a) Given the present empirical cluster means (ck )k∈K , we update the partition
(t)
(Xk )k∈K by computing for each instance 1 ≤ i ≤ n the optimal allocation

(t−1) 2
kt∗ (i) = arg min ck − Xi ,
k∈K 2

with a deterministic rule if there is more than one minimizer. This gives us the
new clusters at algorithmic time t
(t)
Xk = {i ∈ {1, . . . , n}; kt∗ (i) = k} .

(t) (t)
(1b) Given the new clusters (Xk )k∈K , we update the empirical cluster means (sk )k∈K
according to (9.37); these two steps (1a) and (1b) are illustrated in Figure 9.7.

Observe that at algorithmic step t the total within-cluster dissimilarity is given by


K X
X (t) 2
D(t) = ck − X i ≥ 0. (9.38)
2
k=1 i∈X (t)
k

The crucial point in the above algorithm is that both steps (1a) and (1b) are not increasing
the total within-cluster dissimilarity D(t) for increasing t. Having a lower bound zero
and the fact that a finite sample can only be allocated in finitely many different ways to
K clusters implies that we have convergence

lim D(t) = D(t∗ ) ∈ [0, ∞),


t→∞

in finitely many steps t∗ ∈ N0 .

Version March 3, 2025, @AI Tools for Actuaries


186 Chapter 9. Unsupervised learning

decrease in total within−cluster dissimilarity

2000
total within−cluster dissimilarity
1500

1000



500


2 4 6 8 10

hyperparameter K

Figure 9.8: Decrease of total within-cluster dissimilarity as a function of the number of


clusters K; taken from [183].

(t∗ )
The drawback is that (Xk )k∈K typically is a local minimum of the total within-cluster
dissimilarity, and different initial configurations may converge to different (local) mini-
mums.
An open question is the selection of the number of clusters K. This could be determined
(t∗ )
recursively as follows. Construct an optimal partition (Xk )K k=1 for a given K. For
the increased number of clusters K + 1, initialize the (K + 1)-means algorithm by the
optimal clusters for parameter K, and randomly partition one of these clusters into two
clusters. This implies that the total within-cluster dissimilarity decreases when going
(t∗ ) (0) K+1
from (Xk )K k=1 to (Xk )k=1 . Then, run the algorithm in this increased setting with this
initialization, and monotonicity implies that this new solution for K + 1 clusters has a
smaller total within-cluster dissimilarity. This results in a graph as in Figure 9.8 that is
decreasing in K. An elbow criterion selects K ∗ where this graph has a kink, in Figure
9.8 this might by for K ∗ = 4.

K-medoids clustering

The K-means clustering algorithm requires that we consider the squared Euclidean dis-
tance as the dissimilarity function L. Of course, this is not always suitable, e.g., if we
have a network where we need to travel from X i ∈ R2 to X i′ ∈ R2 , the traveling costs
are rather related to the Euclidean distance, but not to the squared Euclidean distance.
However, K-means clustering cannot deal with this Euclidean distance problem. The
K-medoids algorithm is more flexible and it can deal with any dissimilarity function
L. This comes at the price of higher computational costs, because we cannot simply
select the empirical cluster means as the cores (ck )k∈K . Instead, we need to compute the
medoids (ck )k∈K to receive a monotonically decreasing algorithm. The (optimal) medoids

Version March 3, 2025, @AI Tools for Actuaries


Chapter 9. Unsupervised learning 187

are given by
K
X X
(ck )k∈K = arg min L(ck , X i ), (9.39)
(ck )k∈K ⊂L k=1 i∈X
k

i.e., the medoids ck ∈ L = (xi )ni=1 belong to the observed sample; again we install a
deterministic rule if the argument in the minimum is not unique.

Similar to K-means clustering, the global minimum can generally not be found, there-
fore, we try to find a local minimum. Usually, the partitioning around medoids (PAM)
algorithm of Kaufman–Rousseeuw [115] is exploited to solve the K-medoids clustering.

Initialization. Choose at random K different initial medoids c1 , . . . , cK ∈ X . Allocate


each instance X i ∈ L to its closest medoid w.r.t. the selected dissimilarity measure L.
This provides the initial total within-cluster dissimilarity

K
X X
D= L (ck , X i ) ≥ 0. (9.40)
k=1 i∈Xk

Recursive K-medoids iteration. We repeat until no more changes are observed:

(1a) Select a present medoid ck and a non-medoid X i ∈ L \ (ck )k∈K , swap the role
of ck and X i , and allocate each instance in L to the closest medoid in this new
configuration;

(1b) compute the new total within-cluster dissimilarity (9.40) under this new configura-
tion;

(1c) if the total within-cluster dissimilarity increases under this new configuration reject
the swap, otherwise keep it.

Remark that there are many variants on how the swap in step (1a) can be selected (in a
systematic way). Kaufman–Rousseeuw [115] provide one version which is also described
in Algorithm 2 of Schubert–Rousseeuw [201], but there are many other possibilities that
may provide computational improvements.

9.3.3 Clustering using Gaussian mixture models


The K-means and K-medoids algorithms are based on the implicit assumption that
the dissimilarity has a hard decision boundary. GMMs are distribution-based clustering
methods that allow for soft decision boundaries accounting for some instances lying in
the ‘wrong’ cluster. The basic assumption is that we have a mixture of K multivari-
ate Gaussian distributions that are assumed to have generated the instances (X i )ni=1 .
Thus, it is assumed that these instances were generated i.i.d. from a Gaussian mixture
distribution having density

K
1 1
 
exp − (x − ck )⊤ Σ−1
X
f (x) = pk k (x − ck ) , (9.41)
k=1
(2π|Σk |)q/2 2

Version March 3, 2025, @AI Tools for Actuaries


188 Chapter 9. Unsupervised learning

with mean vectors (ck )k∈K ⊂ Rq , positive definite covariance matrices Σk ∈ Rq×q , 1 ≤
k ≤ K, and mixture probabilities (pk )k∈K ⊂ (0, 1) aggregating to one, K
P
k=1 pk = 1. This
density gives a multivariate GMM with model parameter

ϑ = (µk , Σk , pk )k∈K .

If we want to estimate this parameter with MLE, we need to consider the log-likelihood
function for the given learning sample L = (X i )ni=1
n K !
pk 1

exp − (X i − ck )⊤ Σ−1
X X
ϑ 7→ ℓL (ϑ) = log q/2 k (X i − ck ) . (9.42)
i=1 k=1
(2π|Σk |) 2

It is well-known that this MLE problem cannot be solved directly. In fact, it is not even
clear whether a MLE for ϑ exists. There are many examples where this log-likelihood
function is unbounded, henceforth, there is no MLE for ϑ in such cases.
For these reasons, one is less ambitious, and one just tries to find an estimator for ϑ
that explains the learning sample reasonable well (is not a spurious solution in jargon).
State-of-the-art uses variants of the expectation-maximization (EM) algorithm to find
such solutions. We will not describe the EM algorithm here, but we refer to Dempster et
al. [52], Wu [237] and McLachlan–Krishnan [152]. There are many different implemen-
tations of the EM algorithm, and for GMM clustering there are many variants relating
to different choices of the covariance matrices Σk . E.g., if we decouple this covariance
matrix according to Σk = λk Dk Ak Dk⊤ with a scalar λk > 0, an orthogonal matrix Dk
containing the eigenvectors, and a diagonal matrix Ak that is proportional to the eigen-
values of Σk , then one can fix some of these choices and exclude them from the MLE
optimization. One choice is identical orientations Dk = Id (identity matrices), equal
volumes λk = λ > 0, and Ak can then provide different ellipsoids. Finally, the GMM
clustering is obtained by allocating X i to the estimated GMM component that provides
the biggest log-likelihood.

9.3.4 Density-based clustering


Density-based clustering focuses on constructing clusters that are sufficiently dense. We
assume that the dissimilarity function L is symmetric in its arguments. We select a
neighborhood radius ε > 0 and a critical density value M ∈ N. This value M is the
minimal number of instances X i′ in the ε-neighborhood of a considered instance X i , to
declare this instance X i to be a core instance. Thus, we need to count the number of
instances in the ε-neighborhood of each instance X i , 1 ≤ i ≤ n,
X
mi = 1{L(X i ,X i′ )≤ε} .
i′ ̸=i

This allows us to define the set of core instances

C = {i ∈ {1, . . . , n}; mi ≥ M } .

The DBSCAN method of Ester et al. [64] is obtained by constructing a graph of vertices
and edges from these core instances C:

Version March 3, 2025, @AI Tools for Actuaries


Chapter 9. Unsupervised learning 189

(1) The vertices are given by all core instances X i ∈ C, and we add an edge between
two core instances X i , X i′ ∈ C if they are in the ε-neighborhood of each other, i.e.,
if L(X i , X i′ ) ≤ ε. This gives a graph with vertices and edges, and we define the
clusters to be the connected components of this graph.

(2) There are still the instances X l ∈ L \ C that are not core instances. If such a
non-core instance X l is in the ε-neighborhood of at least one core instance X i ,
i.e., if L(X i , X l ) ≤ ε for at least one core instance X i ∈ C, then we assign it (at
random) to one of these close core instances by adding an edge from X l to X i .
This increases the corresponding connected component of that core instance X i ,
but because the graph ends in X l (there is no further edge in X l ), this non-core
instance is an isolated (satellite) point that is only connected to X i .
Finally, there are the so-called outliers X l with L(X i , X l ) > ε for all core instances
X i ∈ C. These are treated as noise, and they are not assigned to any cluster.
Advantages of DBSCAN are that the number of clusters is flexible, and the resulting
structures of the clusters can have any shape. Such a structure may be useful if one tries
to describe how things spread (in a graph-like manner by nearest neighbor infections),
but also for disaster modeling, e.g., one may use such graph to model the spread of a fire.

9.4 Low-dimensional visualization methods


Visualization methods are more of a topological nature trying to preserve neighborhoods.
Popular methods are the t-SNE method of van der Maaten–Hinton [223], UMAP of
McInnes et al. [151] or SOM by Kohonen [122, 123, 124]. We only present the t-SNE
method here and for the other methods we refer to the relevant literature.
These visualization methods are based on finding instances (Z i )ni=1 in a lower dimensional
space, say, R2 , such that there is (some) similarity in the distance functions generate by
the learning sample L = (X i )ni=1 ⊂ Rq and this sequence (Z i )ni=1 ⊂ R2 . This motivates
to solve minimization problems of the type
(L(X i , X i′ ) − ∥Z i − Z i′ ∥2 )2 .
X
arg min
(Z i )n
i=1 ⊂R
2
1≤i,i′ ≤n

In other words, we try to find a sample (Z i )ni=1 in the two-dimensional Euclidean space
such that the original adjacency matrix
(L(X i , X i′ ))1≤i,i′ ≤n ∈ Rn×n
+ , (9.43)
is preserved as good as possible. This idea is similar to most visualization methods.
The t-SNE method of van der Maaten–Hinton [223] is considering an embedding that
slightly modifies the above idea. Namely, we are going to map the dissimilarity L(X i , X i′ )
to a categorical distribution q = (qj )Jj=1 . Equivalently, we are going to find a t-distribution
with instances (Z i )ni=1 whose dissimilarities can be mapped to a second categorical dis-
tribution p = (pj )Jj=1 . Instead of minimizing (9.43) we try to make the KL divergence
from p to q small
J
!
X qj
DKL (q||p) = qj log . (9.44)
j=1
pj

Version March 3, 2025, @AI Tools for Actuaries


190 Chapter 9. Unsupervised learning

The KL divergence is zero if and only if q = p; this is proved by Jensen’s inequality.

Original sample. For two instances X i , X i′ ∈ L one defines the conditional probabil-
ity weight
 
exp − 2σ1 2 ∥X i′ − X i ∥22
for i ̸= i′ .
i
qi′ |i =   ∈ (0, 1), (9.45)
− 2σ1 2 ∥X k − X i ∥22
P
k̸=i exp
i

The choice of the bandwidth σi > 0 is discussed below. The explanation of (9.45) is that
qi′ |i gives the probability of selecting X i′ as the neighbor of X i from all instances, under
a Gaussian kernel similarity measure, i.e., X i is the center (core) of these conditional
probabilities.
Since qi′ |i is non-symmetric, a symmetrized version is defined by

1  
qi,i′ = qi′ |i + qi|i′ ∈ (0, 1), for i ̸= i′ . (9.46)
2n
P
Note, we exclude the diagonal from these definitions. Observe that i′ ̸=i qi′ |i = 1 for
all 1 ≤ i ≤ n. This implies that ni=1 i′ ̸=i qi,i′ = 1 and, henceforth, q = (qi,i′ )i̸=i′ is a
P P

categorical distribution with J = n2 − n components.

Visualization sample. Select a fixed dimension p < q. The goal is to find a visual-
ization sample (Z i )ni=1 ⊂ Rp such that its Student-t probabilities p = (pi,i′ )i̸=i′ (with one
degree of freedom which is the Cauchy distribution), and defined by
−1
1 + ∥Z i − Z i′ ∥22
pi,i′ = P −1 ∈ (0, 1), for i ̸= i, (9.47)
k̸=l 1 + ∥Z k − Z l ∥22

are close to the categorical distribution q = (qi,i′ )i̸=i′ of the original sample (X i )ni=1 .

This motivates the following minimization problem


!
qi,i′
(Z ∗i )ni=1 ∈ arg min DKL (q||p) = arg min
X
qi,i′ log . (9.48)
(Z i )n
i=1 (Z i )n
i=1 i̸=i′
pi,i′

This optimization problem is usually solved with the gradient descent algorithm.

Remarks 9.6. • There is some discrepancy in the definition of q and p. For the
high-dimension case, we define q via the conditional probabilities (9.46). This
approach has turned to be more robust towards outliers. In the low-dimensional
case we can directly define p by (9.47).

• The Student-t distribution is heavy-tailed (regularly varying), and for one degree of
freedom (Cauchy case) we have a quadratic asymptotic decay pi,i′ ≈ ∥Z i − Z i′ ∥−22
for ∥Z i − Z i′ ∥2 → ∞.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 9. Unsupervised learning 191

• There remains the choice of the bandwidth σi > 0. Typically, a smaller value for
σi > 0 gives a denser clustering. For q •|i = (qi′ |i )i′ ̸=i , one can define the perplexity
 
n o  X 
Perp(q •|i ) = exp H(q •|i ) = exp − qi′ |i log2 (qi′ |i ) ,
i′ ̸=i
 

with H(q •|i ) being the Shannon entropy. Following van der Maaten–Hinton [223],
a good choice of the bandwidths σi is received by a constant perplexity Perp(q •|i )
in i.

t−SNE with perplexity = 10 t−SNE with perplexity = 30 t−SNE with perplexity = 50


50

50

50
component 1

component 1

component 1
0

0
−50

−50

−50
−50 0 50 −50 0 50 −50 0 50

component 2 component 2 component 2

Figure 9.9: t-SNE visualizations from a five-dimensional sample (X i )ni=1 ⊂ R5 to a two-


dimensional illustration (Z i )ni=1 ⊂ R2 for different values of the perplexity parameter.

Figure 9.9 gives an example where we map a learning sample (X i )ni=1 ⊂ R5 that is
five-dimensional to a two-dimensional illustration (Z i )ni=1 ⊂ R2 . We use the R package
tsne [58] which has a hyper-parameter called perplexity. Figure 9.9 shows the results
for different values of this perplexity parameter. The colors are in all plots identical for
the instances, and the specific meaning of the colors is related to a sports car evaluation
with red color being a sports car; for details see Rentzmann–Wüthrich [183].

Version March 3, 2025, @AI Tools for Actuaries


192 Chapter 9. Unsupervised learning

Version March 3, 2025, @AI Tools for Actuaries


Chapter 10

Generative modeling

10.1 Introduction
Most of the applications considered in these notes follow the classical paradigm of super-
vised learning, introduced in Chapters 1 and 2, which is to infer the best approximation
to an unknown regression function from a finite sample of data. Once this regression
function has been learned, this function can be used to compute point estimates of in-
terest, including, e.g., best estimates and estimates of quantiles of distributions. Within
actuarial science, this paradigm of learning a regression function and a best estimate,
respectively, encompasses a large proportion of tasks focused on prediction. Nonetheless,
some tasks performed quite often by actuaries stand outside this paradigm, where, in-
stead of producing point estimates, it is desired or necessary to produce a full predictive
distribution. For example, in non-life reserving, besides a point estimate of reserves, it
is usually important to produce a distribution of the outstanding loss liabilities, both to
communicate the uncertainty of the point estimate (reserves) and to quantify the capi-
tal and margins needed to back the reserves. We briefly also mention capital modeling
applications, whether parameterizing (simple) statistical distributions to model a full
distribution of prospective insurance risks, or using dimensionality reduction tools, such
as those introduced in the previous chapter, to reduce complex observations of market
variables or mortality rates down to tractable quantities to enable modeling; a classical
example here is interest rate modeling. These tasks can be characterized as generative
modeling, which has as the goal to learn the underlying probability distribution X ∼ P
of the instances itself, or a joint distribution of responses and covariates (Y, X) ∼ P; we
refer to Section 1.2.3.1
Once we have learned a good approximation to this distribution of the data, we can:

(1) Generate new samples: Draw new instances X ∼ P that resemble the data the
model was trained on. This is useful for data augmentation, simulating scenarios,
and creating synthetic data.

(2) Estimate probabilities: Evaluate the likelihood p(X) of a given data point X, which
is useful, e.g., for anomaly detection.
1
For simplicity, we drop the volume/weight in this chapter, by assuming v ≡ 1.

193
194 Chapter 10. Generative modeling

(3) Infer latent representations: Learn a compressed, lower-dimensional representation


Z of the data X (similar to the auto-encoder examples in Section 9.2.2), which can
reveal underlying structure, e.g., through visualizations of Z.

(4) Perform conditional data generation: Generate new samples from a conditional
distribution X|Y ∼ P(·|Y ), allowing for targeted data synthesis.

What distinguishes generative modeling from the unsupervised learning applications dis-
cussed in Section 9 - some of which also focus on producing latent factors Z - is exactly
this goal of producing a learned probability distribution X ∼ P of the data X.
Notation. We remark that in this field of literature, one typically uses the notation
p(X) for the density and the likelihood of X ∼ P, we adopt this convention (which is in
slight conflict with our previous notation).

Goal: Infer (or approximate) the true probability distribution X ∼ P


(or density p(X)) from a finite sample L of (past) observations.

Major recent advances in generative modeling have been driven by the use of deep neural
networks to learn probability distributions over complex, high-dimensional data; these
so-called deep generative models (DGMs) have been extraordinarily successful in appli-
cations in natural language processing (NLP) and image generation. We parameterize
this distribution with parameter ϑ, writing pϑ (X), where ϑ ∈ Rr represents the network
parameter that needs to be learned from observed data. Various approaches have been
proposed in the literature for the task of deep generative modeling; we refer to Tomczak
[220] for an overview.
Here, we mainly discuss two of these approaches: the latent factor and implicit probability
distribution approaches.
(1) The latent factor approach is to assume a parametric model
Z
pϑ (X) = pϑ (X | z) π(z) dz,

where z denotes latent variables capturing unobserved factors that create variability in
the data samples, and π(z) is a prior distribution over these latent variables; this prior
distribution can also be set as a very weak prior or as flat one in some cases.
The deep neural network components can parameterize both

• the conditional likelihood pϑ (X | z), and

• the latent distribution π(z) or, more commonly, the conditional distribution Z|X ∼
qϑ (z | X) in variational inference frameworks, where ϑ contains both, the parame-
ters of the conditional likelihood pϑ and the variational inference posterior qϑ .

In practice, we choose flexible neural-network-based functions to map between X and Z


(encoder) and from Z to (new) data X (decoder).

Version March 3, 2025, @AI Tools for Actuaries


Chapter 10. Generative modeling 195

Variational auto-encoders (VAEs) are a typical example of this latent factor approach,
where the encoder network approximates the posterior qϑ (z | X), and the decoder net-
work approximates the likelihood pϑ (X | z); in what follows, we will extend the auto-
encoders introduced in Section 9.2.2 into this variational approach. We also briefly men-
tion the Generative Adversarial Networks (GANs) of Goodfellow et al. [84], which work
by sampling directly from an assumed latent factor distribution Z ∼ π(z), and the de-
noising diffusion models due to Ho et al. [99], which are another way of acting on samples
from a latent factor distribution π.

(2) The implicit probability distribution is a different approach that does not rely on
explicit latent factors Z. Instead, a neural network can directly learn a conditional
probability distribution over X = (X1 , . . . , XT ) = X1:T . Concretely, for sequential data
X1:T (such as text or time-series data), this probability distribution is represented by
factorizing the joint distribution into a product of conditional terms, i.e., as an auto-
regressive factorization that takes the form
T
Y 
pϑ (X1:T ) = pϑ Xt | X1:t−1 ,
t=1

where each term pϑ Xt | X1:t−1 is modeled by a neural network that conditions on the
preceding elements of the sequence X1:t−1 .
In many cases - especially when each Xt is categorical (e.g., a word in a vocabulary
W) - this network outputs a probability distribution via a softmax function, which is
the natural parameterization for multinomial outcomes, see (8.10). Specifically, if the
possible values for Xt belong to some discrete set W ⊂ R , then for each w ∈ W, the
network produces outputs

logits(w) = logitsϑ (X1:t−1 ; w),

which are mapped to probabilities by the softmax transformation

 elogits(w)
pϑ Xt = w | X1:t−1 = P logits(u)
,
u∈W e

where logits(·) refers to the unnormalized probabilities before transforming back to the
probability scale. This implicit approach obviates the need for latent variables by directly
specifying how the next outcome depends on the past. In language modeling, for example,
the model learns pϑ (Xt | X1:t−1 ) over a vocabulary of possible tokens (words or subwords);
we have already introduced these concepts in Section 8.2. In time-series forecasting, the
same principle applies, although the data may be continuous or mixed-type, in which
case alternative output layers can be used. In all such scenarios, the core idea is to define
the probabilities over W of the next outcome - completely determined by the network -
rather than decomposing the distribution through auxiliary latent factors.
In modern NLP, transformer-based models have emerged as a powerful way to implement
these auto-regressive approaches. In Section 8.5, we have introduced encoder transform-
ers, which process a known sequence of tokens to produce an output. For NLP and other
generative modeling purposes, decoder transformers are used, where the next-token in an

Version March 3, 2025, @AI Tools for Actuaries


196 Chapter 10. Generative modeling

observed sequence is predicted; this prediction is then appended to the sequence and the
following token is predicted, and so on. We will discuss decoder transformers and large
language models in more detail below.

10.2 Variational auto-encoders


10.2.1 Introduction and motivation
Variational auto-encoders (VAEs), introduced by Kingma and Welling [119, 120] and
Rezende et al. [184], combine ideas from variational inference and the auto-encoder
framework (see Section 9.2.2) to provide a probabilistic approach for both learning latent
representations and generating new samples from an unknown distribution. In contrast
to a standard auto-encoder - which learns a deterministic mapping from inputs X to a
compressed latent representation Z and back to a reconstruction X ′ - a VAE incorporates
stochasticity into both the encoding and decoding processes. This allows the model to
approximate the full distribution p(X) of the data, rather than merely learning a point-
wise reconstruction. Moreover, the latent factor distribution Z ∼ π(z) is usually assumed
to follow a multivariate isotropic normal distribution; once the encoding network has been
trained, samples from π(z) can be drawn in a principled manner.
Remark 10.1. We briefly mention that, within actuarial modeling utilizing PCA, al-
though a multivariate isotropic normal distribution is not imposed during the PCA fitting,
nonetheless, it is often assumed that the (orthogonal) principal components follow this
distribution and samples are drawn from π(z) on this basis. Of course, the independence
of the principal components by design means that at least part of this assumption is justi-
fied; orthogonal components are independent under a multivariate Gaussian assumption.
A nice extension of this approach is to use a density estimation on the factors Z; see
Ghosh et al. [73].

10.2.2 Variational auto-encoder model architecture


At a high level, a VAE is rather similar to the bottleneck neural network (BNN) presented
in Section 9.2.4; the main difference is that we introduce distributional assumptions
into these networks, as we now explain. A VAE consists of two core components, each
parameterized by usually distinct sets of neural network parameters, which we collect
into ϑ ∈ Rr (in a slight abuse of notation):

(1) Encoder (inference/recognition model): The encoder network takes an input data
point X and outputs the parameters of a latent distribution, typically an isotropic
multivariate Gaussian distribution
(d) 
qϑ (z | X) = N µϑ (X), Σϑ (X) . (10.1)

Here, µϑ (X) and Σϑ (X) are given by the encoder’s outputs, with Σϑ (X) often
constrained to be diagonal for simplicity. The encoder is thus learning an approx-
imate posterior distribution over latent variables Z, conditioned on X. We will
explain how this assumption is enforced using a regularization term in the next
section.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 10. Generative modeling 197

(2) Decoder (generative model): The decoder network takes a latent sample Z and
outputs parameters of a distribution over the data space

(d) 
pϑ (x | Z) = N mϑ (Z), Sϑ (Z) ,

for an absolutely continuous X. If X is binary or discrete, it estimates the param-


eters of a Bernoulli or a discrete distribution. In practice, mϑ (Z) and Sϑ (Z) are
also outputs of a (separate) neural network.

In summary, the decoder is the generative component: once trained, it can take random
samples Z ∼ π(z) from the latent space and produce synthetic data points by generating
new data X ′ ∼ pϑ (x | Z) that resembles the original dataset.

10.2.3 Variational objective: the evidence lower bound


As we mentioned above, deep generative modeling often seeks to approximate the (in-
tractable) marginal likelihood pϑ (x). For VAEs, we write
Z
pϑ (x) = pϑ (x | z) π(z) dz,

where π(z) is a prior over latent variables, typically chosen as a standard Gaussian
N (0, I). Directly optimizing log pϑ (x) is generally intractable, but we can employ varia-
tional inference to maximize a lower bound, the evidence lower bound (ELBO), for a full
derivation and explanation, see Odaibo [169],
Z
log pϑ (x) = log pϑ (x | z) π(z) dz
pϑ (x | z) π(z)
Z
= log qϑ (z | x) dz
qϑ (z | x)
pϑ (x | Z) π(Z)
 
= log Eqϑ (z|x)
qϑ (Z | x)
pϑ (x | Z) π(Z)
  
≥ Eqϑ (z|x) log
qϑ (Z | x)
h i  
= Eqϑ (z|x) log pϑ (x | Z) − DKL qϑ (· | x) ∥ π =: E ϑ; x ,

where Eqϑ (z|x) [·] is the expectation operator of Z ∼ qϑ (z | x) and DKL (qϑ (· | x)∥π) is
the KL divergence from π to qϑ (· | x); for the finite discrete case, see (9.44).

The term E(ϑ; x) is the ELBO and consists of two terms

• Reconstruction term: Eqϑ (z|x) [log pϑ (x | Z)], which encourages the decoder to re-
construct the original x from the latent code Z.

• Regularization term: −DKL (qϑ (· | x)∥π), which aligns the encoder’s approximate
(d)
posterior with the prior π(z). As we have mentioned already, typically π(z) =
N (0, I).

Version March 3, 2025, @AI Tools for Actuaries


198 Chapter 10. Generative modeling

Combining these two terms yields a balance between, on the one hand, a faithful recon-
structions and, on the other, a latent space constrained to follow the prior assumptions.
For a learning sample L = (X i )ni=1 , training maximizes the average ELBO over the
learning sample
n
1X 
arg max E ϑ; X i . (10.2)
ϑ n i=1

Intuitive understanding of the ELBO

The ELBO may appear mathematically imposing, but it can be understood


through an intuitive lens. Imagine you are trying to compress data (like an image)
into a compact code that captures its essential features, and then reconstruct it
from this code. The ELBO represents a balance between two competing objectives:

(1) Reconstruction quality: How accurately can we reconstruct the original data
from our compressed representation? This corresponds to the first term in
the ELBO, Eqϑ (z|x) [log pϑ (x|Z)]. Higher values mean better reconstruction.

(2) Compression regularity: How ’well-behaved’ is our compression scheme?


Does it distribute information evenly, avoid redundancy, and make an ef-
ficient use of the available space? This corresponds to the KL divergence
term, −DKL (qϑ (·|x)∥π). Smaller divergence means the compression follows
our desired prior distribution π.

The ELBO elegantly combines these objectives into a single value to be maximized.
When maximizing the ELBO, we are finding the best trade-off between accurate
reconstructions and well-structured compressions. This is why VAEs often learn
meaningful, disentangled representations - they are simultaneously optimizing for
fidelity (reconstruction) and simplicity (regularization).

10.2.4 Reparameterization trick and training


As we have mentioned in Section 5.3.3, neural network training relies on calculating gra-
dients from outputs of the network to the parameters. In the case of VAEs, one cannot
calculate a direct gradient of Eqϑ (z|x) [·] w.r.t. ϑ, because this expression is too complex
as the unknown parameter ϑ enters the density for the expected value computatation.
However, under a Gaussian assumption there is a nice way out, called the reparameteri-
zation trick. Assume a Gaussian encoder (10.1). This allows us to rewrite the Gaussian
random variable Z as follows
(d) 1/2
Z = µϑ (x) + Σϑ (x) ε, ε ∼ N (0, I).
Here, ε is independent of ϑ, so we can backpropagate through the neural network outputs
µϑ (x) and Σϑ (x) without any issues. That is, we can rewrite
h  i
1/2
Eqϑ (z|x) [log pϑ (x | Z)] = E log pϑ x | µϑ (x) + Σϑ (x) ε ,
where the parameter ϑ is no longer part of the expectation operator E[·]. The process
then employs a Monte Carlo version by, first, sampling ε ∼ N (0, I) in the forward pass,

Version March 3, 2025, @AI Tools for Actuaries


Chapter 10. Generative modeling 199

and second, during the backward pass, taking the gradients w.r.t. the mean and variance
parameters. A single Monte Carlo sample often suffices to approximate the ELBO
 
e x) := log pϑ x | µ (x) + Σ1/2 (x) ε

E(ϑ; x) ≈ E(ϑ; ϑ ϑ − DKL qϑ (· | x) ∥ π .

By performing this Monte Carlo sampling during the fitting procedure, VAEs learn both
the inference (encoder) and generative (decoder) networks by maximizing this Monte
Carlo ELBO instead of (10.2). Once trained, we can generate new data by sampling
Z ∼ π(z) and then drawing X ′ ∼ pϑ (x | Z).

The reparameterization trick: a visual analogy

The reparameterization trick can be understood through a simple analogy. Imagine


you are running a factory that produces custom-sized widgets. There are two ways
to adjust the production:

(1) Direct approach (without reparameterization). You might randomly select a


size for each widget directly from your target distribution. But if you need
to adjust this distribution (e.g., make widgets larger on average), you would
have to change your entire random selection process. This is difficult to
optimize.

(2) Indirect approach (with reparameterization). Instead, you could start with
standard-sized blanks (from a fixed, standard distribution) and then apply
a consistent transformation to each blank (scaling and shifting). Now, if
you need to adjust your output distribution, you only need to modify the
transformation parameters, not the random selection process.

The reparameterization trick follows this second approach.

10.2.5 Discussion
VAEs illustrate the latent factor approach described earlier: the hidden variables Z
capture underlying structure, and the encoder-decoder networks map between X and
Z. This ensures that VAEs can both reconstruct existing data and sample novel data
points, all while maintaining a tractable training objective (the ELBO). In practice, many
variants of VAEs exist, e.g., β-VAEs or conditional VAEs, each modifying the objective
or architecture to emphasize different aspects, such as disentangled latent representations
or conditional generation.
Overall, VAEs remain one of the most popular DGMs due to their relative conceptual
simplicity, stable training procedure, and ability to produce both probabilistic encodings
and realistic sample generations.

10.3 Other approaches related to latent factor models


In addition to VAEs, several other approaches related - more or less - to latent factor
approaches have been proposed for deep generative modeling. In this section, we highlight

Version March 3, 2025, @AI Tools for Actuaries


200 Chapter 10. Generative modeling

two popular methods - generative adversarial networks (GANs) and giffusion models -
and then we briefly compare how these approaches relate to each other and to VAEs.

10.3.1 Generative adversarial networks


Introduction

GANs, introduced by Goodfellow et al. [84], represent a different strategy for generative
modeling. Unlike VAEs, which explicitly approximate a conditional probability distri-
bution over the latent variables Z ∼ qϑ (z | X) via a latent variable formulation, GANs
implicitly learn to generate data directly from random latent factors Z sampled from
conventional distributions by pitting two neural networks against each other in an adver-
sarial game. In other words, the encoder part is missing from GANs. These two networks,
the generator and the discriminator, evolve through a competitive process that can yield
remarkably realistic samples in many domains - especially images - but unfortunately
suffers from training difficulties.

The GAN framework

A GAN comprises two main components:

• Generator (G): A neural network generator, parameterized by ϑ1 ∈ Rr1 , takes as


input a random noise vector Z ∼ π(z) (commonly a standard multivariate Gaussian
distribution) and outputs a synthetic sample

X ′ = G(Z; ϑ1 ).

The generator’s objective is to produce data that is indistinguishable from real data
by the discriminator.

• Discriminator (D): A neural network discriminator, parameterized by ϑ2 ∈ Rr2 ,


receives either a real sample X (from the true dataset) or a synthetic sample
X ′ (from the generator), and outputs a scalar D(X; ϑ2 ) ∈ [0, 1]. This scalar is
interpreted as the probability that input X is ‘real’. The discriminator aims to
correctly distinguish real data from generated data.

The generator is designed to take a random noise vector and transform it into a generated
output by processing the noise through several neural network layers. This process allows
the generator to learn how to create realistic images from random inputs. On the other
hand, the discriminator receives a sample as input and then processes it through a series
of layers. It outputs a probability via a sigmoid activation function that indicates whether
the sample is real, i.e., sampled from the dataset used to train the GAN, or generated
by the generator network.
As in the earlier sections, we use ϑ generically to denote the model parameters, though
in practice one typically maintains separate sets of parameters for G and D, i.e., ϑ =
(ϑ1 , ϑ2 ). These networks are trained simultaneously in a mini-max game.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 10. Generative modeling 201

The GAN objective function

GANs frame training as a zero-sum game between the generator and the discriminator.
The value function is given by
h i h i
min max V (D, G) = EX∼pdata (x) log D(X) + EZ∼π(z) log 1 − D G(Z) .
G D

Here:

• X ∼ pdata (x) denotes the true (unknown) data distribution, i.e., X ∼ P.

• Z ∼ π(z) is the prior for the noise vector (often a Gaussian or uniform distribution).

The discriminator maximizes V (D, G) to try to distinguish real versus fake samples. The
generator minimizes V (D, G), trying to “fool” the discriminator D so that generated
samples are classified as real. In practice, optimization is performed via alternating
gradient-based updates.
The discriminator is trained using the binary cross-entropy loss function, which is suitable
for binary classification tasks (real vs. fake); in fact, the log-likelihood of the Bernoulli
distribution is given by Y log p + (1 − Y ) log(1 − p) which is the structure of the above
minimax game.
The discriminator’s weights are frozen during the training of the generator, meaning to
say that while the GAN is being trained to improve the generator, only the generator’s
parameters are updated. The purpose of freezing the discriminator in this step is to ensure
that updates are made only to the generator network, allowing it to learn how to produce
images that can optimally fool the discriminator. In other words, while optimizing the
generator, the discriminator’s weights remain fixed and serve only to provide a training
signal (the discriminator’s classification score) for the generator. This setup ensures that
only the generator’s parameters adjust, learning progressively how to create samples that
can fool the current state of the discriminator.
Training GANs is known to be difficult and can suffer from issues such as mode collapse
(where the generator learns to produce only a limited variety of samples) or vanishing
gradients. Despite these challenges, with proper techniques (e.g., careful network design,
hyperparameter tuning, and objective variants like Wasserstein GAN [4]), GANs can
generate highly detailed and convincing samples.

10.3.2 Diffusion models


Introduction and motivation

Diffusion models, also referred to as score-based generative models, see Song–Ermon [209],
have recently gained significant attention as a state-of-the-art approach for image and
audio generation, see Ho et al. [99]. Unlike VAEs and GANs, which rely on (possibly)
low-dimensional latent factors or adversarial training objectives, diffusion models employ
a forward noising process paired with a reverse denoising process. The forward process
systematically corrupts data into noise, and the reverse process - learned by a neural
network - seeks to recover clean data from noisy samples. Thus, at generation time,
one simply starts from random noise and iteratively applies the learned reverse process

Version March 3, 2025, @AI Tools for Actuaries


202 Chapter 10. Generative modeling

to obtain a final synthetic sample. Similar to GANs, there is no need for an encoder
model within the diffusion modeling paradigm and, moreover, we do not approximate
conditional latent factors, but we rather learn implicitly directly from random samples.

Forward and reverse processes

A typical forward noising process (following [208, 99]) is defined as a Markov chain of
length T . Starting with a real data sample X 0 (e.g., an image), we produce a sequence
of increasingly noisy samples
X 1, X 2, . . . , X T ,

where each step X t is obtained by adding a small amount of Gaussian noise to X t−1 .
Concretely, one common choice is

(d)
p 
q(X t | X t−1 ) = N 1 − βt X t−1 , βt I ,

with a variance schedule {βt }Tt=1 ∈ (0, 1). After iterating this for sevaral steps, the sample
is nearly indistinguishable from pure Gaussian noise (supposed that the variance schedule
adds sufficient noise).

Training objective

Training aims to learn the reverse distribution pϑ (X t−1 | X t ) so as to maximize the


likelihood of clean samples under the implied marginal distribution. One way to view
this is through a variational perspective, see Sohl-Dickstein et al. [208], which yields an
ELBO on the negative log-likelihood of X 0 . However, Ho et al. [99] propose a simplified
(yet empirically effective) noise-prediction objective, in which a neural network is trained
to predict the noise ε that was added at each forward step. Concretely:

• Let X 0 be a real sample and ε ∼ N (0, I) be drawn independently.

• Define the noised version of X 0 at step t via


√ √
Xt = αt X 0 + 1 − αt ε,
Qt
where αt = s=1 (1 − βs ).

• The neural network ϵϑ (often a U-Net, see Ronneberger et al. [195], which is a
type of encoder-decoder CNN framework useful for working with images or similar
architectures) is trained to predict ε from (X t , t). The training loss commonly used
is
h i
2
Lsimple (ϑ) = EX 0 , ε, t ε − ϵϑ (X t , t) 2
.

By minimizing this loss, the model learns to “denoise” xt at each time step, effectively
approximating the score function (the gradient of the log-density w.r.t. the data) and
providing a route to reverse the forward chain.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 10. Generative modeling 203

Sampling and generation

Once trained, sampling proceeds by starting with X T ∼ N (0, I) and recursively applying
X t−1 ∼ pϑ (xt−1 | X t ), t = T, . . . , 1,
to obtain a final sample X 0 that resembles the data distribution. In practice, the neural
network predicts either the noise ε or the clean image X 0 from (X t , t), and one uses
these predictions to sample from the approximate reverse Gaussian.

Discussion

Diffusion models present a compelling alternative to VAEs and GANs for high-fidelity
generation, particularly for large-scale image and audio data. In contrast to the single-
step generation of GANs (where latent noise is transformed into a final image in one
forward pass), diffusion models gradually refine pure noise into structured data through a
learned denoising sequence. Although this multi-step sampling can be slower, the gradual
nature often leads to stable training dynamics and high-quality samples. Moreover, Song–
Ermon [209] and subsequent works show that diffusion and score-based approaches can be
unified through a differential equation perspective, with many interesting recent advances
about the theoretical foundations of these models. Empirically, modern diffusion models
have achieved state-of-the-art results on various generation tasks.

10.4 Decoder transformer models


10.4.1 Introduction and motivation
Decoder transformers extend the transformer framework (introduced in Section 8.5) to
the task of auto-regressive generation. In contrast to an encoder-only transformer (as in
Section 8.5), which processes an entire input sequence to extract a meaningful represen-
tation (recall from Section 8.5.3 that we captured this within the CLS token which is
used for downstream modeling), decoder-only architectures focus on predicting each sub-
sequent element of an output sequence given the elements generated so far. This makes
them naturally suited to generative modeling, particularly of text, but also of other se-
quential modalities. Well-known examples of decoder-based transformer models include
GPT, see Radford et al. [180] and Brown et al. [32], BART, see Lewis et al. [135], and
T5 (in its decoder-only mode), see Raffel et al. [181].

10.4.2 Architecture overview


A decoder transformer follows a structure similar to a standard transformer but is spe-
cialized for sequence generation. Let

X1:T = X1 , X2 , . . . , XT ,
denote a sequence of tokens (e.g., words, subwords, or characters). The model defines
the joint distribution over all tokens via the factorization
T
Y 
pϑ (X1:T ) = pϑ Xt X1:t−1 .
t=1

Version March 3, 2025, @AI Tools for Actuaries


204 Chapter 10. Generative modeling

At each time step t, the decoder uses self-attention over the previously generated tokens
X1:t−1 , together with positional embeddings, to form a representation from which to
predict the next Xt . Notably, a causal mask is applied in the self-attention mechanism to
ensure the model can only attend to past tokens, preserving the auto-regressive property
and preventing future leakage; this is not the case in classical transformers as discussed in
Section 8.5, because the queries and keys can freely interact in the attention mechanism
(8.24).
Importantly, and in contrast to the encoding transformers presented earlier, the positional
embeddings in decoding transformers are not usually learned, but are a static (previsible)
function; we refer to Vaswani et al. [228] for the original approach using trigonometric
functions of the position of each token and to Su et al. [211] for the highly successful
rotary position embedding (ROPE) approach, which has been widely adopted by modern
large language models.

10.4.3 Self-attention with causal masking

Recall from (8.24) that the scaled dot-product attention mechanism computes, for a query
Q, key K, and corresponding values V , the attention head

!
QK ⊤
H = softmax √ V,
q

where q denotes the (embedding) dimension of the query q u , the key ku and usually the
value vectors v u , 1 ≤ u ≤ t − 1, i.e., we have Q, K, V ∈ R(t−1)×q , see (8.24).
In a decoder block, to preserve time-causality, each token’s query vector q u is restricted
to only attend to keys from the previous positions, (ks )us=1 . This is implemented via a
causal mask in the softmax step, setting the attention scores to zero whenever the key
position exceeds the query position. More precisely, this is done by setting q ⊤
u ks to −∞
for s > u. Formally, for each token index u and each position s, we set the mask


0, if s ≤ u,
masks,u = (10.3)
−∞, if s > u,

where the mask is applied in an additive manner to q ⊤ u ks . This results in an atten-


tion score 0, whenever s > u, see (8.25). This implies that the model can only “see”
previously generated tokens, enabling true auto-regressive generation and time-causality.
Stacking multiple layers of masked self-attention and feed-forward transformations yields
the typical decoder transformer.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 10. Generative modeling 205

Analogy for causal masking in self-attention

Causal masking in decoder transformers can be understood through the lens of an


actuarial reserving exercise, where only past information is available for predicting
future development. In a claims development triangle, we observe

C1,2 · · · C1,n

C1,1
 C2,1

C2,2 · · · ?  
,
 . .. .. 
 . ..
 . . . . 
Cn,1 ? ··· ?

where the first index is the accident year and the second is the development year.
When projecting future claims (Ci,j , where i + j > n + 1, i.e., beyond the cur-
rent calendar year), we restrict ourselves to using only observed claims-entries in
the upper-left triangle. This is directly analogous to causal masking, where for
predicting token Xt , the model can only attend to previous tokens X1:t−1 .
Formally, in self-attention with causal masking, we use (10.3) to ensure that at-
tention scores for future positions become invisible after the softmax application,
preventing information leakage from tokens not yet generated - just as actuaries
cannot use future claims development factors that have not occurred yet.
The sequential nature of both processes emphasizes the auto-regressive property:
each new prediction builds upon previous predictions, compounding both capabil-
ities and potential errors. Just as errors in early development factors propagate
through the entire claims triangle, errors in early token predictions can influence
all subsequent generations in a decoder transformer model.

10.4.4 Softmax outputs and probability calibration


At the final step of a decoder transformer layer, the hidden representation is typically
projected onto the vocabulary space W, yielding another set of so-called logits for each
token, {logits(w)}w∈W , and applying the softmax function gives the categorical distribu-
tion
 exp(logits(w)/T )
pϑ Xt = w | X1:t−1 = P ,
u∈W exp(logits(u)/T )
where T is often referred to as the temperature (see the next paragraph). In this way, the
model’s outputs define an implicit probability distribution over all possible next tokens.
Therefore, conceptually, decoder-only transformers align with the implicit probability dis-
tribution approach discussed above: the next token is fully parameterized by the model,
conditioned solely on the past. In practice, this allows for straightforward sampling -
token by token - and yields a model that scales gracefully as the dataset grows.
Importantly we must also note that large neural models are frequently miscalibrated,
meaning their predicted probabilities may not align well with true outcome frequencies,
see Guo et al. [89], in other words, the balance property of Section 4.1.2 is frequently not
fulfilled with decoder transformers. Consequently, temperature scaling is commonly used
to adjust the sharpness of the distribution: increasing T > 1 flattens the distribution
to reflect higher uncertainty, whereas decreasing T < 1 sharpens the distribution to

Version March 3, 2025, @AI Tools for Actuaries


206 Chapter 10. Generative modeling

make more confident predictions. Such post-hoc calibration not only affects sampling
diversity when generating text (by controlling how quickly the distribution’s mass is
concentrated), but also helps ensure that probability estimates from the softmax layer
align more faithfully with actual uncertainties.

10.4.5 Training and inference


Training (teacher forcing). Decoder transformers are typically trained via teacher
forcing, wherein the ground-truth token Xt at time step t is predicted from all preceding
tokens X1:t−1 . Unlike standard machine learning loss functions that treat each sample
independently, teacher forcing uses the ground-truth token Xt at each time step to con-
dition the prediction of the next token, thereby capturing the sequential dependencies
inherent in language data. Moreover, this training scheme results in a very efficient use
of sequence data. However, while this approach accelerates training by providing a clear
context, it may introduce exposure bias during inference when the model must rely on
its own generated tokens; for more discussion of this and a solution we refer to Bengio et
al. [19]. Specifically, we minimize a negative log-likelihood objective:

T
1X 
ℓ(ϑ) = − log pϑ Xt X1:t−1 ,
T t=1

over a dataset of sequences. By training in this manner, the model learns to predict
the next token based on the prior context. Although this training task of the next
token prediction appears to be simple, nonetheless, it is sufficient for decoder transformer
models to learn highly useful representations that can be adapted for various NLP tasks.

Inference (auto-regressive generation). Once trained, decoding proceeds token-


by-token. We begin with a first token X1 , compute pϑ (X2 | X1 ), sample or select the
most probable token, append it to the partial sequence, and continue until we reach a
predefined end-of-sequence token or a desired length. This iterative procedure naturally
yields samples that reflect the learned distribution over sequences. Later, we will mention
some more advanced sampling schemes.

Applications and variants. Decoder transformer models have been successfully de-
ployed in a wide range of generative tasks, including:

• Language modeling and text generation: GPT models, see Radford et al. [180]
and Brown et al. [32], achieve state-of-the-art results on various NLP benchmarks,
generating coherent text and facilitating tasks such as summarization, translation,
and open-domain dialogue.

• Conditional text generation: Using additional conditioning signals (e.g., prompts,


context paragraphs), decoder transformers can produce targeted text in specific
domains. BART of Lewis et al. [135] and T5 of Raffel et al. [181] exemplify encoder-
decoder or decoder-focused architectures that excel at summarization, question-
answering, and more.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 10. Generative modeling 207

• Code generation: Extensions of the decoder transformer architecture have also


found success in generating programming code, assisting software development
workflows.

Ongoing research explores scaling up decoder transformers to billions (and even trillions)
of parameters, yielding large language models capable of zero-shot or few-shot learning,
robust transfer, and coherent long-range generation. Although computationally expen-
sive, these models adapt to diverse downstream tasks with minimal additional training.

10.5 Conclusion on generative models


Generative modeling now represents a cornerstone of modern machine learning, enabling
us to capture intricate distributions over high-dimensional data and opening avenues
for rich, creative applications. We have surveyed a variety of deep generative modeling
approaches - from latent factor methods such as VAEs, GANs and diffusion models to
implicit techniques such as auto-regressive decoder transformers. Each framework brings
distinct strengths: VAEs offer tractable likelihood-based training and interpretable la-
tent spaces, GANs facilitate visually compelling sample generation through adversarial
training, and diffusion models provide stable multi-step data refinement. Meanwhile,
auto-regressive transformers address sequential dependencies in language and beyond,
forming the basis of cutting-edge large-scale language models. We summarize the discus-
sion to this point in Table 10.1.
Despite these differences, the unifying goal is to learn pϑ (X1:T ) in a way that captures
the underlying structure of the data, allowing us to sample, infer probabilities, or encode
latent representations. Remarkably, the simple idea of modeling the distribution of ob-
served data has led to new modeling capabilities which are ground-breaking for tasks such
as image synthesis, text generation, uncertainty quantification, and other areas critical to
science, finance, and technology. As model scale continues to grow, and as new training
paradigms emerge (e.g., incorporating human feedback or advanced prompt engineering),
generative models will likely expand their role even further.
The generative modeling approaches we have explored thus far - VAEs, GANs, diffusion
models, and auto-regressive transformers - each represent important milestones in our
ability to learn and sample from complex distributions. Yet, it is the decoder-based
transformer architecture, specifically when scaled to massive parameters and trained on
diverse textual corpora, that has yielded perhaps the most remarkable breakthrough in
recent AI tools: the emergence of Large Language Models (LLMs).
LLMs can be viewed as the natural evolution of auto-regressive generative modeling,
where the scaling of data, parameters, and computation, as well as the application of
many new and novel techniques, has led to qualitatively different capabilities from previ-
ous types of generative models. Unlike specialized generative models designed for narrow
domains (e.g., image generation with GANs or diffusion models), LLMs demonstrate a
surprising degree of generality - they can generate not only coherent text across domains
but also exhibit capabilities in reasoning, planning, and task adaptation that were not
explicitly engineered.

Version March 3, 2025, @AI Tools for Actuaries


208 Chapter 10. Generative modeling

Table 10.1: Comparison of key generative modeling approaches

Model Core principle Strengths Limitations Typical


applications
VAEs Learn latent
representations • Explicit • Often yields • Dimension-
via variational probabilistic blurry samples ality
inference formulation • Limited sample reduction
• Stable training diversity • Anomaly
dynamics • Complex detection
• Interpretable reconstruction- • Data aug-
latent space regularization mentation
balance

GANs Adversarial
training between • High-fidelity • Training • Image
generator and samples instability synthesis
discriminator • Sharp, realistic • Mode collapse • Style
outputs • No explicit transfer
• Excels in visual density • Data aug-
domains estimation mentation

Diffusion Gradual denoising


models process, reversing • State-of-the-art • Slow sequential • Image
forward diffusion quality sampling generation
• Stable training • • Audio
• Flexible Computationally synthesis
conditioning intensive • 3D content
• Complex creation
implementation

Auto- Factorize joint


regressive distribution into • Scales to • Exposure bias • Text
transform- conditional massive datasets • Limited context generation
ers predictions • Captures long window • Code
dependencies • Sequential synthesis
• Ideal for generation • Language
sequential data modeling

Version March 3, 2025, @AI Tools for Actuaries


Chapter 10. Generative modeling 209

The connection between traditional generative models and LLMs is not merely archi-
tectural. The fundamental principles we have discussed, which are learning probability
distributions, leveraging self-attention mechanisms, and employing auto-regressive fac-
torization, remain at the core of LLM design. However, at the extreme scales of modern
LLMs (with billions or trillions of parameters), these principles yield models that tran-
scend simple next-token prediction to capture deeper patterns of language, knowledge,
and problem-solving.
In the following sections, we will explore how LLMs have extended the generative paradigm
from simple distribution learning to complex systems capable of contextual understand-
ing, few-shot learning, and even rudimentary reasoning. This trajectory from specialized
generative models to increasingly general AI systems reflects both the power of scale and
the richness of the frameworks we have developed to this point for modeling complex
distributions.

10.6 Large language models


LLMs are perhaps the crowning achievement of the field of deep generative modeling to
date, having shown dramatic success on a variety of machine learning tasks, capturing
both public attention and massive capital investments. In this section, we will narrow
our focus specifically to LLMs, which are massive pre-trained decoder-based transformers
that have recently dominated the landscape of NLP and multimodal tasks. Building on
the foundational ideas introduced in this chapter, we will examine how LLMs leverage
in-context learning, alignment techniques, and prompt-based interfaces to deliver aston-
ishing generative performance for real-world applications. We will then cover emerging
topics like reasoning models and discuss methods for working with LLMs in a safe man-
ner. It is important to emphasize to readers that, firstly, many of these areas are still
emerging and are sometimes highly heuristic, and, secondly, due to the financial impor-
tance of LLMs, most research is now no longer published, so we need to rely on less
formal sources to explain some of the concepts.

10.6.1 From auto-regressive transformers to LLMs


The conceptual roots of LLMs stem from the auto-regressive decoder transformer archi-
tectures that we have discussed in Section 10.4. As discussed there, an auto-regressive
model, in essence, factorizes the probability of a token sequence X1:T = (X1 , . . . , XT ) as
T
Y
pϑ (X1:T ) = pϑ (Xt | X1:t−1 ).
t=1

By combining this factorization with the self-attention mechanism of transformers, Vaswani


et al. [228], each token’s prediction can condition on the entire preceding context in a
computationally efficient manner. The resulting architecture naturally lends itself to
large-scale training, which, in turn, uncovers extensive patterns in language data.
Early work on transformers, Brown et al. [32] and Kaplan et al. [113], revealed scaling
laws indicating that larger models (in terms of parameters) trained on correspondingly
larger datasets achieve systematically lower training and validation losses. This discovery

Version March 3, 2025, @AI Tools for Actuaries


210 Chapter 10. Generative modeling

motivated the construction of models with billions (and later trillions) of parameters,
coining the term Large Language Models (LLMs). As model capacity increases, LLMs
often display emergent capabilities not present in smaller counterparts, such as better
zero-shot generalization and few-shot in-context learning (described below).
Alongside model scaling, researchers recognized the importance of diverse, high-quality
training corpora. Auto-regressive transformers trained on multi-domain data (web text,
scientific articles, code, etc.) acquire flexible linguistic competence that can be harnessed
for many downstream tasks, simply by changing the prompt or performing minimal fine-
tuning. The development of the principles which underly modern LLMs can be traced
through the advances contained in the Generative Pretrained Transformer (GPT) series
of papers.

GPT-1. The first GPT, Radford et al. [180], demonstrated that a unidirectional (de-
coder) transformer trained on large unsupervised corpora could achieve strong per-
formance on downstream tasks with minimal fine-tuning. This pretrain-then-finetune
paradigm became a blueprint for subsequent GPT-style models.

GPT-2. Scaling up both model size (up to 1.5B parameters) and data quantity revealed
that bigger models not only improved perplexity (in simple terms, the loss of the model)
but could also generate impressively coherent texts, Radford et al. [181]. GPT-2 sparked
discussions about responsible model release due to concerns over disinformation and
misuse, thus, highlighting ethical and security considerations.

GPT-3 and in-context learning. GPT-3, Brown et al. [32], introduced a much larger
model (up to 175B parameters) and ushered in the era of in-context learning. Surpris-
ingly, GPT-3 could perform new tasks simply by reading a handful of examples within
the prompt - few-shot prompting - without any gradient updates to model parameters.
This phenomenon occurs because of the model’s internal representation of language: it
implicitly “learns” from the in-prompt examples and generalizes these patterns to pre-
dict the next tokens. This emergent capability defied earlier assumptions that explicit
fine-tuning was always necessary.

GPT-3.5, GPT-4, and further refinements. Subsequent iterations like GPT-3.5


and GPT-4 refined the architecture, improved instruction-following, and integrated align-
ment techniques (see the discussion on RLHF in Section 10.6.4). These models further
demonstrated emergent competencies in reasoning, math problem-solving, and creative
writing, driven by the combination of massive parameter counts, diverse training data,
and sophisticated alignment protocols.

A growing body of work attempts to explain how LLMs implement in-context learning
from a theoretical standpoint:

• Meta-learning perspective: LLMs may store “internal optimizers” or representations


that mirror gradient-based learning, thus, enabling them to adapt swiftly to new
tasks within the forward pass alone; see Kirsch et al. [121].

Version March 3, 2025, @AI Tools for Actuaries


Chapter 10. Generative modeling 211

• Implicit Bayesian inference: Some interpret in-context learning as performing ap-


proximate Bayesian updates over prompts, where examples shape the posterior
distribution over the next-token predictions; see Xie et al. [245].

• Emergent structures: Empirical analyses suggest that deeper, wider transform-


ers spontaneously encode schema-like knowledge, capturing linguistic phenomena
(syntax, semantics) and domain rules (logic, mathematics); see our discussion on
mechanistic interpretability later in Section 10.6.10.

While no single unifying theory fully accounts for in-context learning, these angles under-
score its complexity and partially illuminate the remarkable “learning without parameter
updates” phenomenon.

Summary and outlook

The journey from auto-regressive transformers to modern LLMs reflects a combination


of:

• Model scaling in parameter count and data size.

• Emergent phenomena, such as in-context learning and few-shot prompting.

• Alignment methods, including human feedback and chain-of-thought prompting.

These elements, taken together, have propelled LLMs forward, creating significant capa-
bilities in language understanding and generation. Research continues to expand context
windows, refine architectural insights, and develop more robust theoretical frameworks.
Given the significant achievements of LLMs, their theoretical underpinnings and practical
implications will likely remain a central focus in research in generative modeling.
From a research perspective, a significant practical constraint of auto-regressive trans-
formers is the context window, typically limited by computational considerations (e.g., a
few thousand tokens). Research on extending this context to tens or hundreds of thou-
sands of tokens (via efficient attention mechanisms or hierarchical memory) is ongoing;
see Beltagy et al. [17] and Chowdhery et al. [43]. These longer context windows allow
LLMs to handle extensive documents, multi-step narratives, or long code bases, further
extending their utility.

Version March 3, 2025, @AI Tools for Actuaries


212 Chapter 10. Generative modeling

In-context learning: a credibility analogy without parameter updates

Analogy overview. In-context learning enables large language models (LLMs)


to adapt to new tasks simply by seeing examples in the prompt, without modifying
their underlying weights. This mechanism has a meaningful parallel in credibility
theory, where an actuary working in specialist lines of business refines a broad
“base rate” with policyholder-specific information - again, without rebuilding the
entire rating model.
1. Base rate model. An insurer’s rate manual provides a baseline premium
derived from population-level statistics. This is akin to the LLM’s pretrained
parameters, which capture general linguistic or domain knowledge.
2. Individual risk experience. When pricing a single policy, the insurer reviews
the insured’s personal claims history to tailor the final premium. In a LLM,
analogous “in-context examples” in the prompt illustrate task-specific patterns for
the immediate query.
3. Credibility-weighted output. Classical Bühlmann credibility [34] can be
summarized as

Premium = ω × (Individual Experience) + (1 − ω) × (Class Rate),

where omega ∈ [0, 1] is a credibility factor that determines how much to rely on
individual risk data versus the global model.

• No retraining: The underlying rating tables remain intact, reflecting prior


knowledge.

• Context-sensitive adjustment: The final estimate updates automatically


based on recent experience.

Link to in-context learning. Similarly, a LLM conditions its output on both


the query Q and provided examples (E1 , E2 , . . . , En ):

p(y | Q, E1 , . . . , En ).

This process can be viewed as an implicit “credibility weighting” (or Bayes’ update)
for the new prompt examples - just as an actuary balances a policy’s past claims
with broader class results.
No parameter updates required. In both cases, the global model (the insurer’s
rating manual or the LLM’s pretrained weights) remains unchanged. New, context-
specific outputs are produced without the computational overhead of a full re-
training process or the risk of forgetting established knowledge.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 10. Generative modeling 213

10.6.2 Foundation models and pretraining


Modern LLMs typically emerge from large-scale foundation models, see Bommasani et
al. [24], which are neural networks (often transformers) trained on massive corpora en-
compassing various domains and modalities. These foundation models are learned via
self-supervised or unsupervised objectives, such as masked language modeling or, more
often, next-token prediction. The goal is to capture rich statistical patterns over text,
enabling the model to generalize across tasks with limited or no additional fine-tuning.
A typical pretraining process involves:

• Large-scale data: Billions of tokens from diverse sources, including web pages,
books, and domain-specific datasets.

• Transformers at scale: Decoder architectures, see Section 10.4, or encoder-decoder


transformers, often with billions of parameters, trained on large compute clusters.

• Unsupervised objectives: Masked language modeling (e.g., BERT-style) or auto-


regressive next-token prediction (e.g., GPT-style).

This pretraining yields a foundation that can be adapted to a variety of downstream


tasks, simply by adjusting prompts or performing light fine-tuning.

10.6.3 Use cases of large language models


The proliferation of LLMs has yielded significant impact across diverse sectors:

• Customer support and chatbots: Companies employ LLM-driven assistants for query
handling, knowledge base lookups, and interactive customer service.

• Content generation: Automated writing, summarization, and content ideation work-


flows leverage the generative capacity of LLMs.

• Coding assistance: Tools like GitHub Copilot, based on transformer architectures,


suggest code completions, refactorings, and bug fixes.

• Scientific and legal research: LLMs facilitate initial drafts for technical documents,
case analyses, or literature reviews, speeding up research processes.

While LLMs show promising capabilities, caution must be exercised regarding halluci-
nations, biases, and misinterpretations. Techniques like RLHF, see Section 10.6.4, and
carefully designed prompts, see Section 10.6.6, can mitigate these issues to some extent.

10.6.4 Fine-tuning with reinforcement learning from human feedback


Although pretrained LLMs demonstrate impressive linguistic capabilities, they can gen-
erate outputs that are misleading, biased, or misaligned with human preferences. Rein-
forcement learning from human feedback (RLHF) aims to address this by incorporating
human judgements into the fine-tuning process, see Christiano et al. [44] and Ouyang et
al. [172].

Version March 3, 2025, @AI Tools for Actuaries


214 Chapter 10. Generative modeling

An RHLF procedure might be applied as follows:

(1) Initial supervised fine-tuning: Start from the pretrained model and fine-tune it on
labeled examples to adapt it toward desired behaviors (e.g., polite conversation).

(2) Reward model (RM) training: Collect human preference data (e.g., given two model
outputs, which is more helpful or correct?) and train a reward model to predict
these preference scores.

(3) Reinforcement learning-based optimization: Using the reward model as a proxy


for human preferences, update the LLM weights via reinforcement learning (e.g.,
proximal policy optimization, see Section 11.9.2, below). The model is thus guided
to produce outputs favorably judged by humans.

RLHF has been instrumental in aligning LLMs’ outputs with more human-like values,
improving their helpfulness and reducing harmful or factually incorrect content.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 10. Generative modeling 215

RLHF analogy: actuarial preferences for smoothness and monotonicity

In RLHF, a LLM is adjusted to align with human preferences as judged by a


reward model. One can draw a parallel with how actuaries incorporate business
or regulatory shape constraints - e.g. requiring that premium rates be smooth in
age or monotonically increasing in coverage - without redesigning the entire rating
structure.
1. Initial model setup. When pricing using GLMs, actuaries will often begin
with a baseline set of relativities, just as LLMs begin with pretrained weights.
These starting points embed broad empirical domain knowledge.
2. Defining the “preference” or constraint. While RLHF uses human anno-
tations to build a reward model that scores outputs, actuaries may impose formal
constraints such as:

• Smoothness: Ensuring rates underlying premiums do not change abruptly


between adjacent age bands.

• Monotonicity: For example, requiring that premium rates increase with cov-
erage amounts or risk classifications.

This functions analogously to a “preference function” specifying which model be-


haviors are desired or penalized.
3. Credibility-like balancing act. Actuaries incorporate these shape con-
straints without discarding prior data or ignoring established rates; rather, they
might credibility-weight the new constraints with the existing model (for example,
by balancing penalties with an empirical loss function, such as a deviance loss). In
RLHF terms, the constraints act as a reward/penalty that the model must learn
to satisfy, while a KL divergence (or similar) prevents overly large deviations from
the baseline model.
4. Controlled model updates. Just as the RLHF step fine-tunes a LLM to bet-
ter match human preferences (while preserving core language skills), the actuarial
process iteratively refines rates to respect shape constraints. The baseline struc-
ture stays in place - ensuring model continuity - while incremental updates push
the model toward smooth or monotonic relationships in the variables of interest.
Outcome. The result is a stable, constraint-aware model that remains consistent
with previous assumptions, much like a LLM that remains coherent to its pre-
training but is shaped by new feedback. By introducing shape constraints akin to
RLHF’s preference signals, actuaries achieve alignment with business rules, regu-
latory standards, and risk theory without requiring a full model rebuild.

10.6.5 Parameter-efficient fine-tuning

In the LLM workflows, full fine-tuning of all model parameters can be prohibitively expen-
sive in terms of computation and memory, especially for models with tens or hundreds

Version March 3, 2025, @AI Tools for Actuaries


216 Chapter 10. Generative modeling

of billions of parameters. Consequently, researchers have proposed parameter-efficient


adaptation methods that offer strong performance with only a small fraction of parame-
ters being updated or introduced.

Overview of fine-tuning methods

Adapters [104]. Insert lightweight adapter layers within or between the existing trans-
former layers. During fine-tuning, only these adapter parameters are updated, while the
original model weights remain fixed. Adapters can be trained for each downstream task,
enabling modular reusability.

Prefix tuning [136]. Prepends a small set of learnable “prefix” tokens to each atten-
tion block. The main LLM parameters are frozen, and only the prefix embeddings are
optimized to steer model behavior.

Low-rank adaptation (LoRA) [105]. Instead of storing full dense weight updates,
LoRA factors the update matrix into low-rank components. During forward/backward
passes, these low-rank matrices are injected into attention or FNN layers. This greatly
reduces the memory overhead needed to adapt the model.

Practical considerations

• Substantial memory savings: By freezing most layers, these approaches reduce


GPU/TPU memory usage and speed up training iterations.

• Modularity: Adapters and prefix modules can be swapped in or out for different
tasks, making it easier to maintain multiple domain-specific fine-tunings.

• Performance-trade-offs: While often competitive with full fine-tuning, performance


might slightly lag if the downstream domain deviates significantly from the pretrain-
ing distribution.

Overall, parameter-efficient methods have become a standard practice for adapting large
decoder transformers (e.g., GPT-3.5, Llama) to specialized tasks without incurring the
enormous cost of training all parameters end to end.

10.6.6 Prompting
Prompting has emerged as a powerful way to harness decoding auto-regressive trans-
former models. In essence, prompting leverages the fact that a LLM generates tokens
conditionally on previously observed (or generated) tokens. By carefully selecting an ini-
tial sequence of tokens (the prompt), users can steer the model to perform specific tasks
or produce more detailed and context-aligned outputs. This framework of “prompt-
and-generate” has dramatically extended the applicability of LLMs to tasks including
question answering, summarization, and domain-specific reasoning; see Brown et al. [32].
Although we are going to attempt to provide a mathematical description here, it is im-
portant to note that prompting can be very heuristic in practice and that many ideas in
prompting LLMs have been discovered in a totally empirical manner.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 10. Generative modeling 217

Mathematical basis of prompting

Let X1:T = (X1 , X2 , . . . , XT ) denote a (partially) observed sequence of tokens, which we


can split into a prompt portion
p = X1:k = (X1 , . . . , Xk ),
and a continuation portion
c = Xk+1:T = (Xk+1 , . . . , XT ),
for some 1 < k < T . A LLM defines the probability distribution
T
Y 
pϑ (X1:T ) = pϑ Xt x1:t−1 .
t=1

In the prompting paradigm, we treat p as given and seek for t > k


T
Y 
pϑ (c p) = pϑ Xt X1:k , Xk+1 , . . . , Xt−1 .
t=k+1

Thus, the prompt p is a partial sequence or context that conditions the generation of the
continuation c. By altering the structure, style, or content of the prompt, we can influence
the model’s output distribution without modifying any model parameters! Of course, we
need a very large model to be able to produce these highly conditional distributions over
outputs, but with the large foundation models we have been discussing, the required
conditions are in place for this to be successful.

Prompting in practice

Instruction-based and zero/few-shot prompting. One strategy is to provide an


instruction prompt describing the desired task (e.g., “Translate the following sentence to
French:”). The model then generates a sequence that completes the task. In few-shot
prompting, one appends one or more input-output examples (mini “demonstrations”) to
guide the model’s behavior:

Example 1: [Input] → [Output] ... Example n: [Input] → [Output] New Input:


| {z } | {z } | {z }
demonstration demonstration query

This sequence forms the prompt p. The model then predicts the continuation c, expected
to mirror the pattern shown by the demonstrations.

Task-oriented prompts. Prompts can also be task-specific, containing relevant do-


main context (e.g., table schemas for SQL queries, or examples of correct predictions for
classification tasks). Crucially, the model remains in its pretrained state - no additional
fine-tuning is performed.
During evaluation, we compare the predicted continuation against ground truth refer-
ences if available, or assess the model’s response quality via metrics such as BLEU (for
translation) or more qualitative judgements; see also the section on LLM as a judge be-
low in Section 10.6.8 and a quantitative approach for assessing LLM outputs in Section
10.6.9.

Version March 3, 2025, @AI Tools for Actuaries


218 Chapter 10. Generative modeling

Chain-of-thought prompting

A notable advancement in prompting is chain-of-thought prompting , see Wei et al. [234],


where the prompt is carefully engineered to induce the model to explain or reason through
intermediate steps before producing a final answer. Rather than prompting the model
for a single final solution, we include a sequence of reasoning tokens - similar to showing
one’s “work” in a math problem. For instance:

• Prompt:

“Q: If a car travels at 60 km/h for 2 hours, how far does it go? Let’s break it down
step by step.”

• Chain-of-thought (model’s intermediate reasoning):

“We know speed = distance / time. If speed is 60 km/h and time is 2 hours, distance
= 60 · 2 = 120 km.”

• Final answer:

“120 km.”

By exposing intermediate steps, chain-of-thought prompts often elicit more accurate and
interpretable responses from the model, especially for multi-step reasoning tasks. This
technique has been shown to improve performance on mathematical reasoning, logical
deduction, and other complex question-answering domains.
Moreover, this approach underlies the next generation of LLMs, which are called reason-
ing models, see Section 10.6.7 below.

Recent research on prompting

Prompting has become a rapidly expanding research area with efforts focusing on:

• Prompt engineering: Systematic methods for designing prompts, including prompt


tuning and auto-prompting, where prompts (or prompt tokens) are learned or
searched automatically rather than hand-crafted. A very nice approach and soft-
ware package for this is DSPy, see Khattab et al. [117]; remarkably, optimal prompts
can greatly increase the accuracy of LLM on certain tasks!

• Interpretability and reliability: Investigating how prompts can reveal model reason-
ing, help identify hallucinations, or mitigate undesired behaviors. Chain-of-thought
prompting is one approach that aims to surface model reasoning steps.

• Large-scale benchmarks: Evaluations on broad sets of tasks (e.g., MMLU, Big-


Bench) to measure how well prompting generalizes without further training.

Importantly, prompting interacts strongly with model size: large-scale LLMs often ex-
hibit emergent few-shot and reasoning capabilities that smaller models lack. As a result,
prompting-based methods have become the de facto approach for eliciting complex be-
haviors from LLMs with minimal overhead.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 10. Generative modeling 219

10.6.7 Building and refining reasoning models for LLMs


Very recent advances in LLMs extend these models not only to perform generative mod-
eling of text, but exploit the structure of decoder transformers to mimic the reasoning
process that a person might follow when solving a highly complex problem. Among
these models, the O1 and O3 series of models from OpenAI and the recent DeepSeek R1
model [48] have created a significant amount of interest publicly and within the machine
learning community. In this section, we follow very closely the key ideas presented in a
recent blog post by Raschka [182], who provides an excellent overview of methods and
strategies for enhancing LLMs with reasoning capabilities. Unfortunately, not all of the
research underlying reasoning models is in the public domain, so some “detective” work
needs to be done to understand what methodologies have been applied to create these
models. The main exception to this is the DeepSeek paper [48].

Overview and motivation

Reasoning-centric LLMs are specialized models designed to handle complex, multi-step


tasks - such as challenging math problems, puzzles, or coding scenarios - by explicitly
incorporating intermediate thought processes into their responses. Although many stan-
dard LLMs can solve simple factual queries (e.g., “What is the capital of France?”),
reasoning models aim to excel at questions like “A train travels at 60 mph for 3 hours;
how far does it go?”, where multiple steps (e.g., relating speed, time, and distance) are
required. While existing large models are already somewhat capable in this domain, spe-
cialized reasoning models often include an intermediate chain-of-thought, enabling them
to systematically break down and solve intricate problems. Whereas chain-of-thought of
thought prompting was covered in Section 10.6.6, the main idea here is not to induce
reasoning into LLMs through prompts, but rather to use reinforcement learning or other
approaches to encourage LLMs automatically to produce intermediate chain-of-thought
outputs.
Reasoning models are particularly useful for tasks requiring elaborate, multi-step logic or
complex structure (e.g., advanced mathematics, riddles, or step-by-step code generation).
In contrast, simpler tasks - like summarization, translation, or basic knowledge-based
question answering - can often be performed by standard decoder transformer LLMs
without additional “reasoning” overhead. Indeed, using a specialized reasoning model
for every request can be inefficient and may introduce unnecessary verbosity or cost.
The choice of model should therefore be guided by the complexity of the query at hand.
An illustrative example of reasoning-focused LLM development is provided by the project
DeepSeek R1 [48], which modifies the usual LLM training approach as follows:

• DeepSeek-R1-Zero: A base pre-trained model (DeepSeek-V3) is further trained


with pure reinforcement learning (RL) on tasks requiring multi-step solutions.
Surprisingly, the model spontaneously begins to produce chain-of-thought-like re-
sponses, showcasing that reasoning can emerge purely from reinforcement learning
under the right reward design.

• DeepSeek-R1: Next, this “cold-start”, the reinforcement learning model is refined


using a combination of supervised fine-tuning (SFT) and reinforcement learning,

Version March 3, 2025, @AI Tools for Actuaries


220 Chapter 10. Generative modeling

akin to RLHF approaches. Additional data is collected from the model itself,
enabling multi-stage improvement (SFT → RL → more SFT → more RL). This
yields a more robust reasoning LLM.

• DeepSeek-R1-Distill: Finally, the project employs a distillation-based approach


to transfer reasoning capabilities into smaller models (e.g., Llama 8B or Qwen 30B).
Though these distilled models do not match the full power of DeepSeek-R1, they
exhibit surprisingly strong reasoning for their size.

This pipeline highlights a broader theme: many state-of-the-art reasoning LLMs rely on
a combination of reinforcement learning, supervised instruction tuning (especially with
chain-of-thought data), and occasional distillation to smaller architectures.

Four main approaches to reasoning LLMs

Four general strategies for building or improving reasoning models have emerged:

(1) Inference-time scaling. Inference-time scaling involves no additional training but


expends more computational resources or tokens at inference to achieve better outputs.
Examples include:

• Chain-of-thought (CoT) prompting: Encouraging step-by-step generation simply by


appending instructions like “Think step by step.” This can lead to more complete
intermediate solutions for complex problems.

• Beam search, voting, or search-based methods: Generating multiple candidate so-


lutions and then selecting the best via majority voting or a learned “judge.” These
methods often improve accuracy on difficult queries.

While effective, inference-time scaling can be costly if used indiscriminately.

(2) Pure Reinforcement Learning. DeepSeek-R1-Zero is a prime example of pure


reinforcement learning producing emergent reasoning behavior. In that setup, the base
model is optimized via accuracy and format rewards, with no intermediate supervised
fine-tuning. The former reinforcement learning reward is used when the generated out-
puts can be verified empirically, for example, by testing code examples produced by the
LLM by running these and assessing the outputs against known answers. The latter
reward is derived using a LLM as a judge to ensure that the reasoning tokens are cor-
rectly formatted. Despite the lack of explicit chain-of-thought demonstrations, the model
gradually learns to present multi-step reasoning in its answers, suggesting that certain
reward signals can implicitly nudge the model toward generating structured, step-by-step
solutions.

(3) SFT + RL (typical RLHF). Most top-performing reasoning LLMs (e.g., fi-
nal DeepSeek-R1 or rumored pipelines for the O1 and O3 models from OpenAI) blend
supervised fine-tuning (SFT) with reinforcement learning (RL):

Version March 3, 2025, @AI Tools for Actuaries


Chapter 10. Generative modeling 221

• SFT stage: Collect chain-of-thought or instruction data from either humans or the
model itself (“cold-start” generation). Train the LLM to follow these instructions
or produce step-by-step solutions.

• RL stage: Further refine the model with preference-based or rule-based rewards


(e.g., verified answers, consistent format).

• Iterate: Additional SFT data can be created using the latest model checkpoint,
forming a virtuous cycle of improvement.

This approach generally outperforms pure reinforcement learning, especially for large-
scale deployments, and is favored in contemporary reasoning LLM research.

(4) Pure SFT and distillation. Finally, distillation from a larger reasoning model
can be an easier method to produce smaller reasoning LLMs:
• SFT data generation: A larger teacher model (e.g., DeepSeek-R1) generates high-
quality chain-of-thought or instruction examples.

• Distillation: A smaller student model is fine-tuned on this curated dataset. The


student often inherits some reasoning abilities from the teacher, though, typically
at lower performance ceilings.
Distillation can drastically reduce inference costs and hardware requirements, making
advanced reasoning accessible to researchers without multi-million-dollar budgets.

Practical considerations and low-budget extensions

Advanced reasoning LLMs like DeepSeek-R1 or OpenAI’s “O1” models demand substan-
tial compute resources. However, projects like TinyZero [174] and Sky-T1 [168] show
that interesting progress is possible on smaller scales. For instance:

• TinyZero (3B params) attempts a pure reinforcement learning approach anal-


ogous to DeepSeek-R1-Zero, revealing emergent self-verification steps even at a
modest parameter count.

• Sky-T1 (32B params) employs a low-budget distillation or SFT strategy, re-


portedly achieving performance near certain proprietary models at a fraction of
the cost.

Such efforts underline the practicality of targeted or lower-scale fine-tuning for specialized
tasks and domain constraints.

10.6.8 LLMs as a judge and self-consistency mechanisms


Beyond generating text, LLMs can be employed in a referee or judge capacity, evalu-
ating their own outputs or the outputs of other models; see Bai et al. [10]. In this
paradigm, the LLM critiques and scores a response - often employing chain-of-thought-
like reasoning internally - to indicate correctness, clarity, or other quality measures. Such
“self-evaluation” or “self-consistency” strategies may reduce error rates and help identify
potential reasoning flaws.

Version March 3, 2025, @AI Tools for Actuaries


222 Chapter 10. Generative modeling

Mathematical formulation of self-consistency

Consider a query (or prompt) Q. A single LLM can generate multiple candidate responses

R = {R1 , R2 , . . . , RK }.

At generation time, we obtain each Rk , for 1 ≤ k ≤ K, auto-regressively from the model’s


next-token distribution:
Tk
Y
pϑ (Rk | Q) = pϑ (rk,t | Q, rk,1 , . . . , rk,t−1 ),
t=1

where rk,t denotes the t-th token in candidate Rk , and Tk is the length of that candidate.
Next, the judge component (which may be the same LLM configured in “critique” mode,
or a separate model) provides a quality score or utility Jϕ (Rk , Q) for each candidate,
1 ≤ k ≤ K, reflecting the likelihood of correctness, alignment, or other criteria; in
this case ϕ represents the parameters of the judge LLM, which can be the same as the
parameter set ϑ if the same LLM is being used as the one used for generation. A simple
self-consistency mechanism then selects the final response R b as

R
b = arg max Jϕ (R, Q).
R∈R

This max-sampling approach chooses the highest scoring solution.

Practical implementation and research directions

In practice, a single LLM instance might be used to:

• Propose multiple candidate responses R.

• Judge or rank these responses based on correctness or alignment criteria, effectively


instantiating R ∈ R 7→ Jϕ (R, Q).

• Select the best response (or revise weaker ones).

This dual role (generator + judge) has inspired research on iterated refinement, self-
consistency decoding, and constitutional AI, whereby models use rules or guidelines to
critique their outputs. For instance, a chain-of-thought can be included in the judging
mechanism to facilitate more nuanced evaluations, e.g., verifying steps in a math proof.
By iterating this process - sampling multiple answers, critiquing them, and selecting or
refining the best - it is often possible to reduce error rates and highlight reasoning flaws
that might otherwise go unnoticed in a single pass. This constitutes a “self-consistency”
or “self-evaluation” loop, potentially improving the reliability of LLM-generated content
without additional external supervision.
Overall, using LLMs in a judge capacity exemplifies how generative models can be ex-
tended beyond pure text generation to include meta-level reasoning about their own out-
puts. Such techniques dovetail with alignment strategies (Section 10.6.4) and advanced
prompting methodologies (Section 10.6.6), contributing to a growing toolbox for building
and refining LLMs. In Section 10.6.7 we discussed similar ideas that now underlie state
of the art LLMs.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 10. Generative modeling 223

10.6.9 Responsible use of large language models


Here we provide an overview of how LLMs can be used responsibly, basing this section
mainly on van der Merwe–Richman [224]. The main idea is to use LLMs to output not
only text data, but also quantitative metrics, for example, a LLM can evaluate how well
a candidates curriculum vitae matches a job specification and output a score between 0
and 10. Once a quantitative output is available, then the usual actuarial and statistical
processes used for model validation can be applied.

Overview and governance

Working with LLMs requires a well-defined governance framework to ensure ethical, trans-
parent, and compliant usage; see van der Merwe–Richman [224]. Such a framework typ-
ically includes:

• Accountability and ethics: Establishing clear responsibility for model outcomes


and aligning generated content with relevant industry regulations or guidelines.

• Monitoring and controls: Continuously reviewing data pipelines and model


performance, including adherence to data-privacy standards and detection of drift
in model outputs.

• Escalation processes: Providing clear pathways for human intervention (human-


in-the-loop) when unusual or high-stakes scenarios arise.

These governance pillars lay the foundation for subsequent steps in model selection,
performance evaluation, and long-term monitoring.

Model selection and quantitative scoring

When using a LLM for a task, where relevant, actuaries and data scientists should
strongly consider requiring the LLM to output a numerical score for its predictions.
Let X ∈ X be an input (such as a set of claims documents), and let the LLM’s decision
function
f : X → R,
produce a numerical output f (X). Interpreting f (X) as a probability, confidence level,
or rating on a defined scale, we might write

f (X) = P(claim is fraudulent | X) or f (X) = rating ∈ [0, 10].

By capturing the LLM’s beliefs in a single numeric measure, quantitative validations


become directly applicable:

• Statistical analysis of scores: Means, variances, hypothesis testing, and time-series


control charts can reveal shifts or unusual patterns.

• Machine learning metrics: Standard metrics like strictly consistent loss functions,
e.g., mean squared error, or Gini or F1-scores enable direct benchmarking across
different prompts or data conditions.

Version March 3, 2025, @AI Tools for Actuaries


224 Chapter 10. Generative modeling

Moreover, assigning a numerical score at each step helps mitigate hallucinations: once
the model is required to quantify its certainty, stakeholders can explicitly detect outlier
predictions (e.g., extremely high scores for dubious responses) and investigate them using
structured review or escalation processes. Having a set of quantitative scores and analysis
for several different LLMs can aid in the process of selecting the most relevant model for
a task.

Performance metrics

Once the model produces numeric scores, a variety of performance measures become
straightforward to implement:

• NLP/LLM metrics: Evaluate textual quality via ROUGE or BLEU, while exam-
ining numeric agreement with human-labeled data (e.g., classification accuracy).

• Alignment measures: Compare the LLM’s proposed chain-of-thought and final nu-
meric scores against expert judgements or known standards.

• Stability and Sensitivity: Assess the model’s outputs across varying prompt word-
ings or reordered inputs, checking for robustness in both text and numerical pre-
dictions.

Robustness, bias and interpretability

Robustness. Testing the model under adversarial prompts or demographic shifts en-
sures its numeric output remains stable. For instance, if f (X) changes dramatically under
minor prompt modifications, further prompt engineering or data augmentation may be
warranted.

Bias and fairness. Let


G ∈ {g1 , g2 , . . . }

represent demographic group membership, and

Y ∈ {True, False}

be the label of interest. The model’s estimated probability

pb(Y = 1 | X)

is tracked across these demographic subgroups to detect systematic disparities. Because


f (X) is explicitly numeric, standard fairness metrics (e.g., demographic parity, equal
opportunity) can easily be computed.

Interpretability. Requiring the LLM to provide the chain-of-thought text alongside


the numeric score can assist in diagnosing errors and identifying the sources of risk or
ambiguity. Embeddings and dimensionality reduction can further reveal latent clusters
in the data, creating a rich environment for deeper model explanations.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 10. Generative modeling 225

Monitoring and ongoing controls

After deployment, continuous monitoring should track:

• Data inputs: Ensuring that new data remain consistent with original training or
fine-tuning assumptions.

• Performance metrics: Watching for significant shifts in numeric predictions or text


outputs, indicative of concept drift.

Regular review of numeric scoring trends - particularly areas with unexpectedly high
or low scores - can reveal potential hallucinations or systematic biases early, prompt-
ing timely interventions. Governance bodies can decide on revised thresholds, updated
prompts, or further training if required.

Conclusion

Imposing a numeric score on every LLM decision links the model’s text generation to
rigorous statistical validation, a hallmark of actuarial and data science practice. By in-
tegrating such quantitative outputs with robust governance, performance tracking, bias
assessments, and human oversight, actuaries can more confidently deploy LLM-based
solutions in high-stakes or regulated environments. These protocols do not merely en-
hance transparency - they offer a meaningful safeguard against hallucinations by flagging
uncertain or outlier score values for deeper review.

10.6.10 Sparse auto-encoders and mechanistic interpretability


LLMs predominantly employ decoder transformer architectures (Section 10.4); however,
these models are quite difficult to interpret, thus, it is hard to understand how the mod-
els have produced outputs - which may be used in sensitive downstream processes. Re-
cently, there has been renewed interest in sparse auto-encoders and other dimensionality-
reduction techniques for understanding or probing the internal representations learned
by these massive networks. Building on the discussion of auto-encoders in Section 10.2.1,
sparse variants of auto-encoders impose additional constraints to reveal interpretable,
low-dimensional structures. In parallel, mechanistic interpretability research aims to
reverse-engineer the computations of large models by inspecting how individual param-
eters, neurons, or attention heads contribute to emergent language capabilities.

Sparse auto-encoders in the context of LLMs

Recall that an auto-encoder seeks to learn a bottleneck representation of input data,


see Section 9.2.2. Sparse auto-encoders add a sparsity constraint on the hidden layer
activations. Formally, if Z = Φ(X) is an m-dimensional hidden representation of input
X, sparsity can be encouraged by adding a regularization term such as
m
X
R(Z) = η ϕ(Zj ),
j=1

Version March 3, 2025, @AI Tools for Actuaries


226 Chapter 10. Generative modeling

where ϕ is a sparsity-inducing function, e.g., the L1 -norm or the KL divergence from


a low target activation; see Section 2.4. By pushing many Zj values towards zero, the
hidden layer tends to learn disentangled or factorized features.
In the context of LLM embeddings (e.g., word embeddings, hidden states of transformers),
sparse auto-encoders can serve as a diagnostic or exploratory tool:

• Representation analysis: By fitting a sparse auto-encoder to hidden activations


from a LLM, we may uncover latent components that correspond to semantic or
syntactic features of language. This can shed light on how the network distributes
information across its many dimensions. Often, the link between the learned sparse
representations and the natural language concepts, these represent, might itself be
made with LLMs; see Cunningham et al. [46].

• Dimensionality reduction for visualization: Sparse auto-encoders provide a con-


trollable way to project high-dimensional LLM embeddings (often thousands of
dimensions) down to a smaller but still interpretable sub-space, potentially aiding
in model debugging or interpretability tasks.

• Model compression: In some cases, the sparse representation learned via auto-
encoding can hint at parameter-efficient strategies to prune or quantize model
weights.

Although LLMs themselves typically rely on dense transformer layers, the application
of sparse auto-encoders to LLM-generated activations or embeddings is increasingly ex-
plored in post-hoc interpretability settings, aiming to identify stable, low-dimensional
factors that underlie the rich behaviors observed in large-scale generation.

Mechanistic interpretability

Mechanistic interpretability seeks to uncover how LLMs perform computations at a fine-


grained, algorithmic level. Rather than treating a LLM as a monolithic black box, re-
searchers strive to identify and interpret the roles of individual components, such as
attention heads, FNN layers and neurons, in linguistic processing; see Olah et al. [170].
This line of inquiry has gained momentum as models grow larger, prompting deeper
examinations into how these intricate architectures handle tasks ranging from syntactic
parsing to semantic understanding.
Recent advances in mechanistic interpretability have employed a variety of techniques.
One approach, sometimes referred to as activation patching or ablation, involves system-
atically intervening in model activations while the network processes specific inputs; see
Nanda et al. [163]. By replacing or removing particular activation vectors - often in se-
lective layers - researchers can observe how these modifications alter the model’s outputs.
If performance on a certain task degrades significantly after a specific intervention, that
suggests the intervened neurons or layers play a causal role. In parallel, tracing infor-
mation flow across a transformer stack has illuminated how attention heads and FNN
layers pass signals that reflect long-range context, anaphora resolution, and even forms
of logical inference; see Olah et al. [171]. Another promising direction involves neuro-
symbolic probing, where tasks derived from symbolic or logic-based paradigms are used to

Version March 3, 2025, @AI Tools for Actuaries


Chapter 10. Generative modeling 227

test whether sub-circuits within the network encode these structures internally; see Cao
et al. [39]. For example, if a network reliably identifies symbolic patterns in its hidden
representations, it indicates emergent, structured computations within the parameters.

Synergies with sparse auto-encoders

The synergy between mechanistic interpretability and sparse auto-encoders arises from
their shared focus on identifying meaningful structure within high-dimensional repre-
sentations. Sparse auto-encoders, by design, encourage models to compress data into
a limited set of activation units, thereby shedding light on which dimensions are most
critical for a given task; see Makhzani–Frey [147]. When examining LLM activations, if a
sparse auto-encoder robustly encodes certain linguistic features in a small subset of neu-
rons, this subset may correspond to functionally relevant circuits in the original model.
Interventions such as ablation or activation patching can then be more narrowly targeted,
allowing researchers to focus on the most influential dimensions within the network.
Enforcing sparsity is particularly helpful when trying to localize features to individual
neurons or small neuron clusters. For instance, if a sparse auto-encoder bottleneck consis-
tently highlights dimensions tied to tense or sentiment, these dimensions become natural
entry points for deeper mechanistic analysis. Researchers can then ablate or patch only
those critical neurons to measure how the LLM’s behavior changes, shedding light on
where and how key linguistic functions are implemented.

Discussion and future directions

In summary, combining auto-encoder-based methods with mechanistic interpretability


provides a powerful toolkit for unearthing how LLMs encode and transform informa-
tion; Hinton–Salakhutdinov [97]. By extracting compact, interpretable representations
of activations, sparse auto-encoders reveal hidden structures that might otherwise re-
main concealed in high-dimensional parameter spaces. Mechanistic interpretability goes
a step further by attempting to reverse-engineer the computations behind these struc-
tures. Through selective analysis of sub-circuits, neurons, or attention heads, it becomes
possible to illuminate how tasks such as coreference resolution, numeric reasoning, and
code generation are managed internally.
Looking ahead, deeper integration of these research directions may inspire new architec-
tural designs or training paradigms that prioritize intrinsic interpretability. For instance,
networks could incorporate sparsity constraints during training, encouraging representa-
tional clarity from the outset. Circuit-level regularization might also become a standard
practice, with the goal of engineering LLMs whose inner operations are more transparent
and less prone to failure modes that are difficult to detect. As models scale toward tril-
lions of parameters, understanding these internal processes will likely become increasingly
critical for ensuring robustness, safety, and alignment with human values; Bender–Koller
[18].

Version March 3, 2025, @AI Tools for Actuaries


228 Chapter 10. Generative modeling

10.6.11 Summary
LLMs represent a paradigm shift in generative modeling, wherein a single pretrained
neural network can address myriad tasks with minimal additional training. By incorpo-
rating human feedback mechanisms (RLHF), employing specialized prompts, or enabling
self-consistency checks, LLMs can produce high-quality, context-aware, and interpretable
outputs. Nonetheless, significant challenges remain, including controlling unwanted bi-
ases, ensuring factual accuracy, and addressing potential misuse. Ongoing research in
foundation models, fine-tuning protocols, and advanced prompting strategies continues
to shape the evolving landscape of LLMs.

10.7 Conclusion: From empirical scaling to broad AI


In this chapter, we have seen how LLMs - trained at unprecedented scales on diverse,
high-dimensional data - build upon core ideas in empirical modeling that are intimately
familiar to the actuarial profession, and that we have elucidated in these notes. The core
principle of collecting data, specifying a flexible model architecture, and adjusting param-
eters to fit observations is the same principle that underlies many actuarial techniques,
albeit that the actuarial techniques are typically applied at a (much) smaller scale.
Yet, as we increase the amount of data and computational resources, we find that LLMs,
propelled by auto-regressive transformers and alignment protocols such as RLHF, can
begin to exhibit emergent capabilities. No longer do these models merely perform spe-
cialized tasks like text classification or simple regressions; instead, they start to display
a more generalized form of intelligence - able to write code, solve multi-step reason-
ing problems, and even critique or refine their own outputs. Powering these models to
the next stage of usefulness are techniques to cause these models to “think” more and
approximate some level of reasoning.
This scaling-up process, from small datasets to massive corpora, parallels the journey of
actuarial science itself. In actuarial work, we often rely on large empirical datasets - claims
histories, market data, demographic projections - and calibrate models to approximate
real-world behavior. As data volumes and computational budgets have expanded, so too
has the potential for sophisticated machine learning methods to deliver richer insights
with less human-crafted structure.
Indeed, this evolution suggests that the classical actuarial approach - balancing rigorous
statistical methods applied to empirical data with real-world pragmatism - that has now
been adopted more widely in machine learning and artificial intelligence may serve as a
foundation for creating even more advanced AI systems. Although true general artificial
intelligence remains an open challenge, LLMs have brought us closer to models capable
of multi-domain reasoning and adaptive problem-solving. Building on these advances
and refining the models explored above on expanding datasets may lead, step by step,
toward increasingly general and powerful AI capabilities.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 11

Reinforcement learning

11.1 Introduction
Reinforcement learning is a fast evolving and exciting field on optimal decision making in
a dynamic environment. In particular, reinforcement learning is one of the key techniques
behind reasoning which is a crucial feature of modern generative AI models, e.g., used in
solving mathematical problems.
To give full consideration to reinforcement learning, we would need to write an entire
book. Our aim here is to give a short introduction to reinforcement learning and discuss
what kind of problems can be studied by this technology. The reader should be aware
of the fact that we only present the most simple problems and their solutions, but the
latest technology is by far more developed; classical references on reinforcement learning
are Sutton–Barto [212] and Murphy [162], the material presented in this section is largely
taken from these two references. For an explicit actuarial example considering an optimal
pricing problem, see Palmborg–Lindskog [173]. This paper presents a non-life insurance
premium control problem that seeks for an optimal premium rule trying to maximize
profits, complying with solvency regulation, and considering customers’ price sensitivities.
Such a multi-objective premium control problem can be solved dynamically by learning
how specific actions contribute to a total reward with the aim of maximizing this total
reward. This is the general philosophy in reinforcement learning.
Before starting, we would like to mention that the reinforcement learning community
is somewhat disjoint from the classical machine learning community and also from the
statistical community. We emphasize this because terminology can be quite different in
these different communities, e.g., bootstrapping can mean rather different things depend-
ing on the specific community. We mention this to highlight that there may be some
inconsistencies in this section compared to earlier chapters.
Generally speaking, in predictive modeling a decision maker tries to make accurate fore-
casts, and to evaluate the accuracy of her/his forecasts, the decision maker receives the
correct answer at a later stage. As an example, we forecast an insurance claim at the
beginning of the period, and by the end of the period we know the true claim incurred.
In contrast, in reinforcement learning a decision maker takes actions which are rewarded.
However, there is no right or wrong answer that can be revealed to the decision maker,
she/he only gets a feedback in terms of a bigger or smaller reward, and at the same time,

229
230 Chapter 11. Reinforcement learning

she/he does not have the possibility to exploit all potential actions and their resulting
rewards. For example, an insurer (decision maker) can either increase or lower the in-
surance premium, and by the end of the period the insurer gets a reward in terms of the
total premium earned (based on the assumption that the customers will only sign new
contracts up to their price tolerance levels), but there is no possibility for the insurer
to simultaneously test different pricing strategies before exercising one of them. Using
reinforcement learning, the insurer can continuously (online) learn to improve its decision
making strategy by learning from the feedback received.

11.2 Multi-armed bandit problem

The classical multi-armed bandit problem gives a fairly good introduction and overview
of the field of reinforcement learning. That is why we start with this classical example
(which is not directly related to insurance).
Assume that a gambler has the option to play on k ≥ 2 different slot machines (one-armed
bandits), where each of the slot machines has a different random payout. Naturally, the
gambler’s goal is to maximize her/his gain, and she/he selects the slot machine from
which she/he believes that it will hit the jackpot in the next round.

In this game the gambler can exploit the slot machine that she/he believes has the
biggest payout, but at the same time it may also be worth to explore the other
k − 1 slot machines because one of them could even be better.

We formalize this. We work in discrete time t ∈ N0 . In each round t ≥ 0 of the game,


the gambler can select an action At ∈ A from the action space A. In the multi-armed
bandit problem the action space refers to the k available slot machines, A = {1, . . . , k}.
The taken action At provides a real-valued (random) reward Rt+1 from which we aim at
learning the optimal action, called control. The (true) action-value is defined by

q(a) = E [ Rt+1 | At = a] , (11.1)

for a ∈ A. This is called the true action-value because it uses the true reward mechanism;
this is similar to the true regression function in (1.2).
If we knew the true action-value function a 7→ q(a), we would simply maximize this
function over a ∈ A, to obtain the maximal (expected) reward. Typically, the true action-
value function is unknown to us because we do not know the precise reward mechanisms
of the different slot machines. A general way to solve this problem is to try to learn this
action-value function by exploring and exploiting the slot machines over several rounds
of the game. This gives us estimates (qbt (a))a∈A in every round t ≥ 0 for the true action-
value function q. We use these estimates (qbt (a))a∈A for the next round of the game. They
are then updated according to the received reward Rt+1 in this next round. Thus, the
reward Rt+1 is a feedback on how our specific action At has performed.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 11. Reinforcement learning 231

For given estimates (qbt (a))a∈A at time t ≥ 0, the greedy action is the one with the
(immediate) highest action-value estimate

arg max qbt (a).


a∈A

If we select the greedy action, we exploit our current knowledge around the maximum of
the estimated action-value function a 7→ qbt (a). But we can also select a non-greedy action
by exploring whether we can improve our estimates (qbt (a))a∈A by selecting a different
slot machine. Exploiting is the (estimated) optimal one-step ahead strategy but it is not
necessarily optimal for multiple steps ahead (in the long run). This is precisely the trade-
off between exploiting and exploring which may have a sophisticated interrelationship.
To better understand this reinforcement learning mechanism, it is useful to give an ex-
plicit numerical example. We present the example of Sutton–Barto [212, Section 2.3].

Example 11.1. We choose k = 10 one-armed bandits and we select their true action-
values (q(a))10
a=1 by simulating them from independent standard uniform Gaussian dis-
tributions. These true action-values (q(a))10
a=1 are kept fixed, and they are unknown to
the gambler. Figure 11.1 illustrates these true action-values, the most promising slot
machine is number a = 4, closely followed by slot machine number a = 8.

k−armed bandit: true action−values


0.5
true action−values q(a)
0.0
−0.5

2 4 6 8 10

k−armed bandit (k=10)

Figure 11.1: k-armed bandit true action-values (q(a))ka=1 , k = 10.

The reward Rt+1 on slot machine a ∈ A in period t ≥ 0 is determined by

Rt+1 |At =a ∼ N (q(a), 1) ,

and we assume a Markov property, meaning that this reward Rt+1 only depends on the
last selected action At , and not on any information prior to time period t.
The most natural and simple action-value estimate is given by computing the empirical
average rewards on each slot machine a ∈ A
t−1
1 X
qbt (a) = Pt−1 Rs+1 1{As =a} ,
s=0 1{As =a} s=0

Version March 3, 2025, @AI Tools for Actuaries


232 Chapter 11. Reinforcement learning

and we set it to a default value for action-values, i.e., a ∈ A, without any observations.
The law of large numbers tells us that qbt (a) → q(a), a.s., for t → ∞, and supposed a is
selected infinitely often in these trials. At time t ≥ 0, the next greedy action is given by
At = arg max qbt (a),
a∈A
with a deterministic rule if there is more than one solution to this maximization problem.
This greedy step exploits the estimated optimal slot machine at time t ≥ 0. To also
explore the other slot machines, we insert random non-greedy steps, be using a so-called
ε-greedy strategy. Select ε ∈ (0, 1) and sample i.i.d. Bernoulli random variables Bt , t ≥ 0,
being independent of everything else and taking the value one with probability ε.
The ε-greedy action at time t ≥ 0 is given by

arg max qbt (a)
 if Bt = 0,
At = a∈A (11.2)
Ut

if Bt = 1,

where Ut is uniform on A and independent of everything else.


Thus, with probability 1 − (ε − ε/k) we exploit and with probability ε − ε/k we explore.
Since this also implies that, a.s., every action a ∈ A is selected infinitely often as t → ∞,
we receive the law of large numbers for all actions a ∈ A, saying that the estimated
action-values qbt (a) converge to the true ones q(a), a.s., as t → ∞.

k−armed bandit: average reward k−armed bandit: ratio optimal action


1.0

1.0
0.8

0.8
ratio optimal action
average reward
0.6

0.6
0.4

0.4
0.2

0.2

epsilon=0.01 epsilon=0.01
epsilon=0.02 epsilon=0.02
epsilon=0.05 epsilon=0.05
0.0

0.0

0 200 400 600 800 1000 0 200 400 600 800 1000

algorithmic time algorithmic time

Figure 11.2: (lhs) Development of average rewards Rt , and (rhs) ratio of selection of the
optimal slot machine for iterations 1 ≤ t ≤ 1000.

We implement the k-armed bandit reinforcement learning algorithm (11.2) for different
ε-greedy strategies with ε ∈ (0.01, 0.02, 0.05). The results are shown in Figure 11.2. The
left-hand side gives the average rewards, under the gambler’s strategy (11.2), defined by
t−1
1 X
Rt = Rs+1 ,
t s=0
The right-hand side of the figure shows the proportion of the selection of the optimal
slot machine a = 4, i.e., the slot machine a ∈ A that has the highest expected reward

Version March 3, 2025, @AI Tools for Actuaries


Chapter 11. Reinforcement learning 233

q(a); this optimal value is illustrated by the black horizontal line in Figure 11.2 (lhs). We
observe that for all ε choices the average rewards Rt approach this optimal value q(a),
a = 4, and there seems to be little difference in their speeds of convergence. On the other
hand, due to the ε-greedy strategy, the average reward will not converge to q(a), but to
a smaller value, because the ε-greedy sampling gives a fixed proportion of non-optimal
slot machine selections. For convergence to q(a) one also needs to temper the Bernoulli
probability εt ↓ 0 for t → ∞.
Figure 11.2 (rhs) shows the proportions of the selection of the optimal slot machine a = 4
t−1
1 X
1 .
t s=0 {At =4}

These quantities converge to 1−(ε−ε/k), illustrated by the horizontal lines in Figure 11.2
(rhs). In two cases, we have a rather smooth increase of these proportions to their limits,
only the green graph looks a bit surprising. Remark that all these graphs depend on the
initialization of the algorithm. We have randomly initialized (qb0 (a))ka=1 , and different
seeds provide different results. A random initialization avoids the difficulty of having
multiple maximums in (11.2); if one sets (qb0 (a))ka=1 to very large values, this promotes
exploring in the beginning of the algorithm, because every initial reward is likely going
to be a disappointment moving on the the next slot machine. Let us try to understand
the green graph of Figure 11.2 (rhs).

k−armed bandit: selected bandit k−armed bandit: selected bandit k−armed bandit: selected bandit
10

10

10
8

8
selected bandit

selected bandit

selected bandit
6

6
4

4
2

2
0

0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000

algorithmic time algorithmic time algorithmic time

Figure 11.3: Explicit selections At ∈ A, t = 0, . . . , 1000, of the different slot machines in


the three examples of Figure 11.2 (rhs).

Figure 11.3 shows the selected slot machines (At )1000


t=0 of the three examples of Figure 11.2
(rhs). We focus on the green graph on the right-hand side. The optimal slot machine is
a = 4, and we observe that in the green case, the algorithm is rather undecided about
the two machines a = 4 and a = 8 in the first iterations of the algorithm. These two
slot machines have quite close long-term rewards q(a), see Figure 11.1. In this example,
only after roughly t = 400 iterations, we have found the best slot machine a = 4. This
explains the green curve in Figure 11.2 (rhs). ■

Conclusion. Example 11.1 gives the basic understanding of reinforcement learning.


Subsequent extensions study more sophisticated dynamic decision making problems, e.g.,
having infinite (continuous) state spaces, having more sophisticated dynamic learning

Version March 3, 2025, @AI Tools for Actuaries


234 Chapter 11. Reinforcement learning

problems, having side constraints and multiple targets, etc. This will require more so-
phisticated reinforcement learning algorithms, potentially based on approximations where
some of the quantities cannot be computed explicitly, etc. In the remainder of this sec-
tion we will discuss some of these extensions. The main purpose of this discussion is to
make the reader familiar with some of the reinforcement learning concepts and intuition.
Clearly, this discussion should be understood as an introduction, and for more advanced
reinforcement learning technologies, the reader is referred to the specialized literature on
reinforcement learning.

11.3 Incremental learning


Example 11.1 has been implemented brute force, by simply increasing the information
set at each period t ≥ 0 by one unit. To keep the size of the memory constant, one should
use a so-called incremental learning implementation, see Sutton–Barto [212, Section 2.4],
meaning that one should update the action-value estimates recursively
t
1 X 1  
qbt+1 (a) = Rs+1 1{As =a} = Rt+1 1{At =a} + Nt−1 (a)qbt (a)
Nt (a) s=0 Nt (a)
1
= qbt (a) + (Rt+1 − qbt (a)) 1{At =a} ,
Nt (a)

with the counts on each slot machine a ∈ A


t
X
Nt (a) = 1{As =a} . (11.3)
s=0

These updates can be performed on a constant memory size and they essentially use the
Markov property. Beyond that they have a rather interesting structure that is common
to many reinforcement learning algorithms. We can interpret 1/Nt (a) as a learning rate
or a step size parameter.
For suitable learning rates ϱt (a) > 0, we obtain updates

qbt+1 (a) = qbt (a) + ϱt (a) (Rt+1 − qbt (a)) 1{At =a} .

This looks very innocent, but actually this structure is the key to many reinforcement
learning algorithms, namely, it proposes temporal difference learning by incrementally
improving the estimate over time by the new experience

new estimate ← old estimate + ϱ (new experience − old estimate) . (11.4)

It can be interpreted as trying to predict the new experience Rt+1 by the old estimate
qbt (a), and (11.4) can then be seen as the corresponding updating step, trying to improve
the prediction based on the new experience.
Summary. The key to reinforcement learning often has it grounds in a temporal differ-
ence learning structure of type (11.4). The multi-armed bandit problem of Example 11.1
has all these features. There are some extensions/changes that we are going to introduce,
below, to make the framework more practical.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 11. Reinforcement learning 235

• We extend the actions (At )t≥0 and the rewards (Rt )t≥1 by a third sequence, the
states (St )t≥0 . This will allow us to solve more interesting problems.

• We will not directly focus on the rewards (Rt )t≥1 , but rather on the future ex-
pected discounted rewards, called expected gains, for given (initial) state-action
pairs (St , At ). Under an unknown reward mechanism, we need to estimate these
expected gains along the way.

11.4 Tabular learning problems


The above multi-armed bandit problem introduced the essentials of reinforcement learn-
ing. The following sections formalize this process w.r.t. the classical literature in rein-
forcement learning, in particular, we are going to extend the model by a state space; the
following outline is taken from Sutton–Barto [212, Chapter 3].
A Markov decision process (MDP) considers three different sequences, states (St )t≥0
living on a state space S, actions (At )t≥0 living on a action space A, and rewards (Rt )t≥0
living on a reward space R ⊂ R. We initialize R0 = 0; this initial reward is usually not
needed, see Example 11.1, but it allows us to study the triples (Rt , St , At ), for all t ≥ 0.
A MDP then describes the time-series of these triples

R0 , S0 , A0 ; R1 , S1 , A1 ; R2 , S2 , A2 ; . . . , (11.5)

which will additionally obey a Markov property.

agent

Rt St At

environment

Figure 11.4: Environment-agent interaction process (11.5) for t ≥ 0.

A finite MDP has three finite spaces S, A and R. If the state space S and the action
space A are finite, we speak about tabular learning because we can store all potential
state-action pairs (s, a) ∈ S × A in a finite table. This outline mainly focuses on tabular
learning, and possible extensions are only briefly considered in Section 11.8, below.
In this dynamic learning setting, one distinguishes two different cases, one can either
have a continuing task where (St )t≥0 randomly evolves over the state space S forever. In
contrast, an episodic task is assumed to terminate, and we can restart it in a green field
again. In that case, one adds a terminal (absorbing) state to the state space S † = S ∪{†},
and the game is terminated when St+1 enters the terminal state † (for its first time). This
motivates the definition of the stopping time
n o
T = inf t ≥ 0; St+1 ∈ S † \ S ∈ [0, ∞]; (11.6)

Version March 3, 2025, @AI Tools for Actuaries


236 Chapter 11. Reinforcement learning

if there is no terminal state or if the state space process does not reach the terminal
state, we set T = ∞.
Figure 11.4 shows that a MDP involves two different (Markovian) transitions: (a) there
is the environment’s dynamics p : S ×R×S ×A → [0, 1] in blue color, and (b) there is the
agent’s policy π : A × S → [0, 1] in orange color. Markovian means that these dynamics
are fully determined by just considering the realization in the previous iteration. We
discuss this.
(a) Environment’s dynamics. The environment’s dynamics is given by nature, and it is
either known or unknown to the decision maker.
Specifically, in the finite spaces case, we assume the transition probabilities

p s′ , r s, a := P St+1 = s′ , Rt+1 = r St = s, At = a
  
(11.7)
h i
= P St+1 = s′ , Rt+1 = r St = s, At = a, (Su )t−1 t−1 t
u=0 , (Au )u=0 , (Ru )u=0 ,

for t ≥ 0.
Thus, the pair (St+1 , Rt+1 ) only depends on the previous state-action pair (St , At ), this is
the Markov property we use for the environment’s dynamics. These transition probabil-
ities p fully determine the environment’s dynamics of the MDP. We give some remarks:

• In the case of a terminal state † and state space S † , one constrains (11.7) to remain
in the terminal state with probability one (and all subsequent rewards and actions
are discarded because the process is terminated).

• To run this dynamics we still need to define the agent’s policy π(a|s), and we need
to specify the initial state S0 ; we have set initial reward R0 = 0.

Using the stopping time T introduced in (11.6), one defines the total discounted reward,
called gain, after time t by
T
X
Gt = γ u−t Ru+1 , (11.8)
u=t

for γ ∈ (0, 1], and where an empty sum is set equal to zero.
The gain Gt is not generally finite for γ = 1, i.e., there are models without a stopping
T = ∞ or models with a slow (heavy tailed) stopping which may make the gain infinite
on average for γ = 1. For γ < 1, this sum is always finite (also on average), because the
rewards (Rt )t≥1 are uniformly bounded on a finite reward space R, providing a uniform
upper bound (and a finite mean) from the corresponding geometric series.
(b) Agent’s policy. There remains the agent’s policy; note that the decision maker is
commonly called agent.
The agent’s policy is assumed to be of the form

π(a|s) := P ( At = a| St = s) (11.9)
 
= P At = a St = s, (Su )t−1 t−1 t
u=0 , (Au )u=0 , (Ru )u=0 ,

for t ≥ 0.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 11. Reinforcement learning 237

This policy π describes the decision making of the agent, see Figure 11.4. This decision
making can be deterministic, in which case the agent’s policy π(·|s) is a single point mea-
sure in some action a ∈ A, but it can also be random with π(·|s) describing a distribution
over the action space A. In case of a deterministic policy, it is more convenient to use
the notation π : S → A, s 7→ a = π(s) ∈ A.
We assume that the policy π(·|s) is not influenced by the rewards, see (11.9) and the
dotted blue line in Figure 11.4. The goal is to select the optimal policy by maximizing
the future expected discounted rewards, called value function, see next section.

The environment’s transition probabilities p, given by (11.7), and the agent’s poli-
cies π, given by (11.9), describe the finite MDP, as illustrated in Figure 11.4. The
goal of this dynamic decision making problem is to find an optimal policy π ∗ for a
given environment’s dynamics p that maximizes the expected gain. This problem
is solved by reinforcement learning, and there are two rather different situations:
either the environment’s dynamics p is known to the agent or it is unknown to
the agent. In the latter case, we can perform model-based reinforcement learning
by trying to learn the model, or we can perform model-free reinforcement learning
where the environment’s dynamics is not needed to solve the task.

11.5 Known environment’s dynamics


We start by assuming a known environment’s dynamics p; in most cases this is not
realistic, but it helps us to shape the ideas. In this setting, there is no model uncertainty
because the dynamics of the MDP is known and used to solve the task. This case is
known as model-based reinforcement learning, which applies to an either known or an
estimated environment’s dynamics for solving the task.
We add the selected policy π to the notation Pπ and Eπ , respectively.
The value function (state-value function or expected gain) of a given policy π is defined
by
T
" #
X
vπ (s) = Eπ [ Gt | St = s] = Eπ γ u−t Ru+1 St = s ,
u=t

for states s ∈ S.
Because of stationary we can drop the time index t on the left-hand side of the previous
identity. Under a known environment’s dynamics p, the value function vπ can be com-
puted for every policy π, and the aim is to find the optimal policy π ∗ that maximizes the
value function. This then gives the optimal value

v∗ (s) := vπ∗ (s) = sup vπ (s), for s ∈ S. (11.10)


π

In this setting, the main question is whether there exists an optimal policy π ∗ that solves
(11.10), and, if yes, how can it be found. In the finite MDP case and under deterministic
policies, there exists an optimal policy π ∗ ; see Puterman [178, Corollary 6.2.8]. Moreover,
there are many other settings where such an existence result can be proved. In the finite

Version March 3, 2025, @AI Tools for Actuaries


238 Chapter 11. Reinforcement learning

tabular case and under a known environment’s dynamics p, the optimal policy problem
is then solved by dynamic programming. There are two different versions that are useful:
(A) policy iteration and (B) value iteration. We briefly describe these. Let A(s) ⊂ A be
the admissible actions a in state s ∈ S.

11.5.1 Policy iteration


Assume a known environment’s dynamics p. For policy iteration one alternates the
following two steps: (a) policy evaluation and (b) policy improvement.

(a) Policy evaluation (also called prediction problem) aims at computing the value func-
tion vπ for a fixed policy π. Deconvoluting the Markovian property by one step, gives us
the Bellman equations

p(s′ , r|s, a) r + γvπ (s′ ) ,


X X X 
vπ (s) = π(a|s) (11.11)
a∈A(s) s′ ∈S r∈R

for s ∈ S. These Bellman equations (11.11) have a unique solution if γ < 1. Remark, for
given π, (11.11) gives us a system of |S| linear equations for (vπ (s))s∈S . Thus, for known
environment’s dynamics p, this can fully be solved for the given policy π.

(b) Policy improvement aims at improving the policy for a given value function (vπ (s))s∈S .
The is done by a greedy step for deterministic policy improvements.

Algorithm 4 Policy iteration algorithm

(0) k = 0: select an initial deterministic policy πk : S → A. Fix γ < 1.

(1) Iterate for k ≥ 0 until the policy is stable:

(a) Apply policy evaluation to the deterministic policy πk to find the unique so-
lution of the value function (vπk (s))s∈S to the linear system

p(s′ , r|s, πk (s)) r + γvπk (s′ ) ,


X X 
vπk (s) = (11.12)
s′ ∈S r∈R

this inserts the deterministic policy πk into (11.11).


(b) Apply policy improvement by finding (πk+1 (s))s∈S in a greedy step

p(s′ , r|s, a) r + γvπk (s′ ) ,


X X 
πk+1 (s) = arg max (11.13)
a∈A(s) s′ ∈S r∈R

and increase k → k + 1.

(2) Return (πk∗ (s))s∈S and (vπk∗ (s))s∈S for the stopping time k ∗ .

Algorithm 4 gives the resulting policy iteration algorithm. We comment on this algo-
rithm. The greedy step (11.13) implies that each policy πk+1 is uniformly better than
the previous one πk , and this algorithm will converge. As mentioned above, the system

Version March 3, 2025, @AI Tools for Actuaries


Chapter 11. Reinforcement learning 239

(11.12) describes |S| linear equations that need to be solved. That is, for a suitable vector
bπk ∈ R|S| and matrix Bπk ∈ R|S|×|S| , (11.12) can be rewritten in vector notation

v πk = bπk + Bπk v πk , (11.14)

where we set v πk = (vπk (s))s∈S ∈ R|S| . This can be solved by a matrix inversion, giving
us the solution v πk = (Id − Bπk )−1 bπk for the given policy πk . In practice, this is solved
differently. Namely, the value function can be seen as the fix point of the system (11.12)
and (11.14), respectively. This fix point can be found by Banach’s fix point iteration,
under γ ∈ (0, 1). This observation is precisely the idea for the next algorithm, namely,
it may not be necessary to run this fix point iteration until convergence, and one could
alternate the fix point iteration step(s) and the policy evaluation more frequently.

11.5.2 Value iteration


The difficulty with the policy iteration algorithm is that the policy evaluation may be
quite prohibitive in finding the value function (vπk (s))s∈S in each step k ≥ 0 of the
algorithm. Value iteration is an alternative solution. For this we remark that the policy
evaluation (11.12) can be solved by Banach’s fix point iteration for γ ∈ (0, 1), thus, we
iterate for ℓ ≥ 0
p(s′ , r|s, πk (s)) r + γvℓ (s′ ) ,
X X 
vℓ+1 (s) =
s′ ∈S r∈R
to find vℓ → vπk for ℓ → ∞ for a fixed policy πk ; this is called iterative policy evaluation.
The idea for value iteration is to truncate this iterative policy evaluation after one step.
This leads to the simpler Algorithm 5.

Algorithm 5 Value iteration algorithm

(0) For k = 0, select an initial value function (vk (s))s∈S , and γ ∈ (0, 1).

(1) Iterate for k ≥ 0 until stopping is exercised (resulting in stopping index k ∗ )

p(s′ , r|s, a) r + γvk (s′ ) ,


X X 
vk+1 (s) = max for s ∈ S.
a∈A(s)
s′ ∈S r∈R

(2) Return (vk∗ (s))s∈S and the resulting deterministic policy (πk∗ (s))s∈S obtained by

p(s′ , r|s, a) r + γvk∗ (s′ ) .


X X 
πk∗ (s) = arg max
a∈A(s) s′ ∈S r∈R

Algorithm 5 directly iterates the value optimization, i.e., it focuses on vk (s) instead of
the value vπk (s) of an actual policy πk .

11.6 Unknown environment’s dynamics: Monte Carlo


The policy iteration and the value iteration algorithms presented in the previous Section
11.5 both require the knowledge of the environment’s dynamics p. In most cases this

Version March 3, 2025, @AI Tools for Actuaries


240 Chapter 11. Reinforcement learning

knowledge is not available, see, e.g., the multi-armed bandit problem studied in Example
11.1, where the reward distribution is not available, but needs to be learned from experi-
ence. Thus, the typical case in practice is the one with unknown environment’s dynamics
p. This section mainly presents a preparation for the next Section 11.7 which explains
how to deal with the case of unknown environment’s dynamics p. The main technique
used will be temporal difference learning, and in this section we prepare for this.
In the case of an unknown environment’s dynamics p, one tries to either learn from actual
experience or one tries to learn from simulated experience.
• Learning from actual experience does not require any knowledge about the environ-
ment’s dynamics p. In fact, in a model-free manner, one directly tries to estimate
the value function v(s) from actual experience (this is also called prediction), from
which one then derives the optimal policy.

• Learning from simulated experience requires a model to generate the (Markov)


transitions. This can either be the true environment’s dynamics p (if known) or
it can be an estimated one. This case refers to model-based reinforcement learn-
ing. Noteworthy, in many situations not the entire environment’s dynamics p is
necessary, in contrast, to the dynamic programming solutions of policy or value
iterations, see Algorithms 4 and 5. That is, dynamic programming requires the
probability function p in a sufficiently analytical way, whereas the methods pre-
sented below only require (explicit) transitions (i.e., samples) of next states St+1
and rewards Rt+1 . Whenever, one works with explicit samples, the process is called
Monte Carlo learning.

We expand the value function (vπ (s))s∈S to the action-value function (also called Q-
function) which additionally accounts for the taken action
" T #
X
u−t
qπ (s, a) = Eπ [ Gt | St = s, At = a] = Eπ γ Ru+1 St = s, At = a ,
u=t

for s ∈ S and a ∈ A(s) ⊂ A, where A(s) are the admissible actions in state s.
This is similar to the multi-armed bandit example (11.1), but expanded by the state
St = s and accounting for all future (discounted) rewards under policy π. We have for
any t ≥ 0 the following two crucial relationships between the value and the action-value
functions
vπ (s) = Eπ [ qπ (St , At )| St = s] ,
qπ (s, a) = Eπ [ Rt+1 + γ vπ (St+1 )| St = s, At = a] ,
this uses the tower property for conditional expectations, and the latter particularly uses
that the next action At only depends on St , and not on the entire history, see (11.9), and
on the stationarity of the MDP. The latter identity shows that the action-value function
naturally enters the policy improvement (11.13) because we have
p(s′ , r|s, a) r + γvπk (s′ )
X X 
πk+1 (s) = arg max
a∈A(s) s′ ∈S r∈R
= arg max qπk (s, a). (11.15)
a∈A(s)

Version March 3, 2025, @AI Tools for Actuaries


Chapter 11. Reinforcement learning 241

Therefore, we could have introduced the action-value function qπ (s, a) already at an


earlier stage. A main difference to the previous policy and value iteration algorithms
with known environment’s dynamics p is that we need to replace the policy improvement
(11.13) by a tractable quantity not involving p, and for this we directly work with the
action-value function qπ (s, a) instead of the value function vπ (s), see (11.15). Assume
that for any given policy π, we can observe an episode following π, and given by the
sequence

R0 , S0 , A0 ; R1 , S1 , A1 ; R2 , S2 , A2 ; . . . ; RT , ST , AT ; RT +1 , ST +1 . (11.16)

Denote by Ts,a the first visit of the observed sequence (11.16) to the state-action pair
(s, a) ∈ S × A. This gives us an empirical estimate of the action-value qπ (s, a) of policy
π in (s, a)
T
X
qbπ (s, a) = γ u−Ts,a Ru+1 . (11.17)
u=Ts,a

This is the simplest version of Monte Carlo estimation of the action-value function, and
there are many similar variants; see Sutton–Barto [212, Section 5.1]. For the following
algorithm, this estimation is performed for all state-action pairs (s, a) independently, i.e.,
the estimates do not build on each other. Therefore, according to reinforcement learning
terminology, this is not a bootstrapping estimate.1 Inserting the empirical estimate
(11.17) into (11.15) motivates the following policy improvement step for k → k + 1

πk+1 (s) = arg max qbπk (s, a). (11.18)


a∈A(s)

This leads us to the following algorithm; in the sequel it is more convenient to write the
updates πk → πk+1 as π ← π. I.e., instead of labeling the iterations by k ≥ 0, we use a
generic left-arrow ‘←’ to indicate the updates in the loops of the following algorithms.
The resulting Monte Carlo exploring starts algorithm is presented in Algorithm 6. It
considers the first visits Ts,a to every state-action pair (s, a) by recursively checking
whether there is no earlier visit in the observed episode (11.16); this refers to the name
‘exploring starts’ of the algorithm. The resulting gain G is then appended to the observed
values Gains(s, a) of that state-action pair (s, a), and the current optimal policy estimate
π is re-evaluated/updated in the observed state s = St . However, this is not for a fixed
policy, but rather over all past experienced policies because Gains(s, a) collects the gains
over all past episodes (11.16). This algorithm cannot converge to a suboptimal solution
because of monotonicity. This is intuitively clear, however, as stated in Sutton–Barto
[212, Section 5.3], there is no formal proof of this intuition. There is another difficulty in
this algorithm, namely, there needs to be a way of observing an episode (11.16) for each
policy π under consideration. Typically, this requires simulated experience, but it will
not easily be possible to generate actual experience for each policy π of interest. E.g.,
in the multi-armed bandit problem this would require a huge investment to generate an
episode for every policy π of interest.
1
Note that bootstrapping in reinforcement learning means that the parameter estimation depends
recursively on previous estimates. This is different from the statistical bootstrap of Section 1.5.4.

Version March 3, 2025, @AI Tools for Actuaries


242 Chapter 11. Reinforcement learning

Algorithm 6 Monte Carlo exploring starts algorithm

(0) Select an initial deterministic policy π : S → A, an initial action-value function


estimate (qb(s, a))s,a , and set Gains(s, a) to the empty list for all state-action pairs
(s, a). Select γ ∈ (0, 1).

(1) Iterate until a stopping rule is exercised:

– Choose an initial state S0 ∈ S and an initial action A0 ∈ A(S0 ) at random so


that all potential state-action pairs have a positive probability.
– Observe an episode (11.16) for the present policy π with finite termination
time T , and initialize the gain G = 0.
– Loop for t = T, . . . , 0:
∗ G ← γG + Rt+1 .
∗ If the state-action pair (St , At ) does not appear in (Su , Au )t−1
u=0 :
 Append the gain G to Gains(St , At );
 Action-value update: qb(St , At ) ← average(Gains(St , At ));
 Policy update: π(St ) ← arg max qb(St , a).
a∈A(St )

The greedy update (11.18) exploits the optimal action in state St . In all algorithms below
we insert ε-greedy updates for a given ε ∈ (0, 1) to also explore.
An ε-greedy strategy is obtained by replacing (11.18) by the following two steps

a+
k+1 = arg max q
bπk (St , a), (11.19)
a∈A(St )

and update for the new random policy


(
1 − ε (1 − 1/|A(St )|) for a = a+
k+1 ,
πk+1 (a|St ) = (11.20)
ε/|A(St )| otherwise.

This ε-greedy strategy is equivalent to (11.2). It also mitigates the problem that there
are state-action pairs that are not sufficiently often visited. In fact, it allows us to drop
the inconvenient assumption in Algorithm 6 that each potential pair (s, a) must appear
as a starting point with a positive probability. This ε-greedy strategy is called on-policy
method because it is used on the policy πk itself, whereas off-policy methods work on the
transition probabilities to generate the episodes, e.g., by using a version of importance
sampling.
There is one cool thing that we did not mention, namely, the above action-value updates
can again be done by incremental learning (because we consider simple averages), see
Section 11.3 and (11.4),

qb(St , At ) ← qb(St , At ) + ϱt (G(St , At ) − qb(St , At )) , (11.21)

where G(St , At ) is the gain appended to Gain(St , At ) in the current iteration, and with

Version March 3, 2025, @AI Tools for Actuaries


Chapter 11. Reinforcement learning 243

learning rates ϱt = ϱt (St , At ) > 0. This is the basis for all upcoming more practical
proposals under unknown environment’s dynamics p.

11.7 Temporal difference learning


All previously considered algorithms are not practical because they either assume that
the environment’s dynamics p is known or that we can generate an episode (11.16) for
any selected policy π. The former does not work because we typically do not know the
mechanism of the environment, see multi-armed bandit example. The latter may be too
costly, as we may not be able to generate episodic experiences for any policy of interest, in
the multi-armed bandit example this would require a huge investment of the gambler. The
following temporal difference learning algorithms allow for actual experience learning,
which can be interpreted as online learning as they are performed step-by-step as actual
experience becomes available. We will present two different algorithms the SARSA on-
policy control and the Q-learning off-policy control.
We first explain the meaning of temporal difference learning. The previous example of
Algorithm 6 was based on the incremental updating rule (11.21) stating

qb(St , At ) ← qb(St , At ) + ϱt (G(St , At ) − qb(St , At )) ,

and the gain generally satisfies the recursion


T
X
Gt = γ u−t Ru+1 = Rt+1 + γGt+1 .
u=t

Assume that the gain Gt = G(St , At ) belongs to the state-action pair (St , At ) at time t,
and that it has been generated by an episode following policy π. Then, Gt is an empirical
estimate for qπ (St , At ) = Eπ [Gt |St , At ], see (11.17). If we revert this consideration, we
can also use the action-value qπ (St , At ) to predict the gain Gt , i.e., the gain is predicted
by its expected value (which minimizes the mean squared error). If we perform this
prediction at time t + 1, i.e., if we use the next action-value qπ (St+1 , At+1 ) to predict the
gain Gt+1 , we receive the following approximation

Gt ≈ Rt+1 + γ qπ (St+1 , At+1 ). (11.22)

Inserting this approximation into the previous incremental update gives us with what is
known as the one-step temporal difference (TD(0)) update

qb(St , At ) ← qb(St , At ) + ϱt (Rt+1 + γ qb(St+1 , At+1 ) − qb(St , At )) , (11.23)

with learning rates ϱt = ϱt (St , At ) > 0.


This considers a temporal difference component in t which tries to minimize the prediction
error of the gain by the estimated action-value function, in fact (11.23) can be interpreted
as a gradient descent step; for more technical rigour justifying (11.23) we refer to Sutton–
Barto [212, Sections 6.1-6.3]. The major difference to the Monte Carlo version of the
previous section is that we do not need to compute the gains G for the entire policy

Version March 3, 2025, @AI Tools for Actuaries


244 Chapter 11. Reinforcement learning

π under consideration, but only the next policy actions π(At |St ) matter for this step-
by-step update. Therefore, it can be performed online on actual experience. Rewriting
(11.23) gives us

qb(St , At ) ← ϱt (Rt+1 + γ qb(St+1 , At+1 )) + (1 − ϱt ) qb(St , At ).

Thus, we obtain an update in a recursive credibility manner. This method is therefore


also called a bootstrapping method because it improves on the existing estimates; this
meaning of bootstrapping is different from Section 1.5.4. Moreover, this estimate is also
called a sample estimate because it anticipates the next state-action pair (St+1 , At+1 ) in
its estimation. In many cases it can be proved that under some assumptions one obtains
convergence to the true value under a given policy π.

11.7.1 SARSA on-policy control


The first temporal difference method that we present is called SARSA on-policy control.
It is called SARSA because from state St , we generate the action At from the current
policy π, this gives us the reward Rt+1 and the new state St+1 , from which we can generate
the next action At+1 to use the bootstrap update (11.23). This update is applied to every
non-terminal state St . If St+1 = † is terminal, we set qb(St+1 , At+1 ) = 0.

Algorithm 7 SARSA temporal difference algorithm (for estimating the Q-function)

(0) Select an initial action-value function (qb(s, a))s,a with qb(†, a) ≡ 0; and γ ∈ (0, 1),
and small ε > 0.

(1) Iterate until a stopping rule is exercised:

– Choose S0 ∈ S at random and sample action A0 from an ε-greedy policy of


structure (11.19)-(11.20).
– Loop for t ≥ 0 until the terminal value † for St :
∗ Given state-action pair (St , At ), observe reward Rt+1 and next state St+1 .
∗ Given state St+1 , sample action At+1 from an ε-greedy policy of structure
(11.19)-(11.20).
∗ Update the action-value function with temporal difference (11.23) and use
the state-action pair (St+1 , At+1 ) as input to the next step t + 1.

Algorithm 7 gives the SARSA temporal difference algorithm for estimating the action-
value function. SARSA on-policy learning is fully practical as we can online (real time)
observe the next reward Rt+1 and the next state St+1 , given the state-action pair (St , At )
on the selected policy. For an example, we refer to the multi-armed bandit problem that
returns the next reward Rt+1 after we have taken the action At . That is, this can be
performed with actual experience, and it does not require any (prior) knowledge about
the environment’s dynamics p. On-policy refers to the fact that we use the selected policy
π once more to anticipate the next action At+1 , given state St+1 . This is precisely the
difference to the off-policy algorithm presented in the next subsection.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 11. Reinforcement learning 245

A variant of SARSA is expected SARSA temporal difference learning which considers the
update
 
X
qb(St , At ) ← qb(St , At ) + ϱt Rt+1 + γ π(a|St+1 ) qb(St+1 , a) − qb(St , At(11.24)
) .
a∈A(St+1 )

That is, we do not simulate the next action At+1 for the update, but we replace it by
an expected value. The following off-policy algorithm is similar in that it replaces the
expected value (reflected by the sum) by an off-policy maximum operation, see (11.25),
below.

11.7.2 Q-learning off-policy control


Watkins [232] and Watkins–Dayan [233] introduced a significant simplification which
allowed for first rigorous proofs. Namely, they considered the temporal difference update
!
qb(St , At ) ← qb(St , At ) + ϱt Rt+1 + γ max qb(St+1 , a) − qb(St , At ) . (11.25)
a∈A(St+1 )

This is called off-policy learning because it does not anticipate the next action At+1
w.r.t. selected policy π, given state St+1 .

Algorithm 8 Q-Learning temporal difference algorithm

(0) Select an initial action-value function (qb(s, a))s,a with qb(†, a) ≡ 0; and γ ∈ (0, 1)
and small ε > 0.

(1) Iterate until a stopping rule is exercised:

– Choose initial state S0 ∈ S at random.


– Loop for t ≥ 0 until the terminal value † for St :
∗ For state St , sample At from an ε-greedy policy of structure (11.19)-
(11.20).
∗ Given state-action pair (St , At ), observe reward Rt+1 and next state St+1 .
∗ Update the action-value function with temporal difference (11.25) and use
the state St+1 as input to the next step t + 1.

The maximizations in Algorithm 8 can be critical as they may lead to biases. To mitigate
such biases there are more advanced methods like double-Q-learning; for details see
Sutton–Barto [212, Section 6.7].

Example 11.2. For our illustration, we select a small scale example with finite spaces
R = S = A = {1, . . . , 10} for the reward space, the state space and the action space,
respectively. We select a continuing task MDP by specifying an environment’s dynamics
p : S × R × S × A → (0, 1), see (11.7). This probability tensor p has 104 = 10, 000
entries that we select at random such that p(·, ·|s, a) sums to one for all state-action pairs
(s, a) ∈ S ×A, and such that it does not have any symmetries. For the gain computation,

Version March 3, 2025, @AI Tools for Actuaries


246 Chapter 11. Reinforcement learning

we select a discount factor of γ = 1/2. Our goal is to find the optimal policy π ∗ that
maximizes the values vπ (s) for all initial states s ∈ S.
There exists a unique solution to this optimal control problem and we try to find it with
reinforcement learning. We present the four methods: (1) policy iteration presented in
Algorithm 4, (2) value iteration presented in Algorithm 5, (3) SARSA temporal difference
learning of Algorithm 7, and (4) Q-learning temporal difference of Algorithm 8. The first
two methods (1)-(2) are based on the knowledge of the true probability tensor p, methods
(3)-(4) do not use the knowledge of this probability tensor p, but only observed actual
experience, thus, the latter two methods present realistic learning problems. The SARSA
algorithm (3) performs on-policy learning, and Q-learning (4) off-policy learning.

policy iteration (known environment's dynamics) value iteration (known environment's dynamics)

2.0
0.55

initial state s= 1
initial state s= 2
initial state s= 3
initial state s= 4
0.50

initial state s= 5
initial state s= 6
value function v(s)

value function v(s) initial state s= 7


1.5 initial state s= 8
0.45

initial state s= 9
initial state s= 10
initial state s= 1
0.40

initial state s= 2
initial state s= 3
1.0

initial state s= 4
0.35

initial state s= 5
initial state s= 6
initial state s= 7
initial state s= 8
0.30

initial state s= 9
0.5

initial state s= 10

0 1 2 3 4 0 5 10 15

iterations k iterations k

Figure 11.5: Algorithm convergence analysis: (lhs) policy iteration Algorithm 4 and (rhs)
value iteration Algorithm 5; the x-axis shows the iterations k ≥ 0 of the algorithms.

Figure 11.5 shows the developments of the value functions vπk (s), s ∈ S, in the policy
iteration algorithm, and the value functions vk (s), s ∈ S, in the value iteration algorithm
for iterations k ≥ 0. The policy iteration Algorithm 4 converges in two iterations to the
optimal policy π ∗ , and the value iteration Algorithm 5 converges in roughly 20 iterations,
see Figure 11.5. For the policy iteration algorithm we solve the linear system (11.14) in
every step k ≥ 0 of the algorithm, which can easily be done here because Bπk is a small
matrix of size 10×10. For the value iteration algorithm we instead use a (single) Banach’s
fix point step in each iteration k ≥ 0. For this small scale example, this results in a less
efficient algorithm because the matrix inversion is very efficient in this small example.
Table 11.1 shows the resulting optimal policy π ∗ which is the same in both algorithms.

state s 1 2 3 4 5 6 7 8 9 10
policy iteration 7 5 10 6 4 3 6 3 3 7
value iteration 7 5 10 6 4 3 6 3 3 7

Table 11.1: Optimal policy π ∗ (s) ∈ A for states s ∈ S.

Under a known environment’s dynamics p, we can easily find the optimal policy π ∗ , as

Version March 3, 2025, @AI Tools for Actuaries


Chapter 11. Reinforcement learning 247

illustrated in Table 11.1. We now turn our attention to the more realistic situation of not
knowing the environment’s dynamics p. We therefore implement SARSA and Q-learning
temporal difference to determine an optimal policy. Recall, this is done as follows. Based
on state St , we exercise action At according to our actual policy π in place. Nature then
gives us the reward Rt+1 and the next state St+1 , based on the current state-action pair
(St , At ). This allows us to perform actual experience learning.
We implement SARSA temporal difference learning as follows. We select an ε-greedy
policy with ε = 1%, and we run the (continuing) task for 3000 iterations t ∈ {0, . . . , 3000},
that is, on average 30 times we explore, instead of exploiting the currently estimated
optimal policy. This seems a low value. We apply this procedure 2000 times, which
determines the outer loop in Algorithm 7. Finally, the learning rate ϱt = ϱt (St , At )
is chosen inversely proportional to the number of occurrences of the state-action pair
(St , At ) up to and including time t; this corresponds to (11.3) extended to the state
observation.

SARSA (unknown environment) Q−learning (unknown environment)


0.5
0.5

0.4
0.4
action−value function

action−value function
0.3
0.3

0.2
0.2

0.1
0.1

0.0
0.0

0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000

time t time t

Figure 11.6: Algorithm convergence analysis: (lhs) SARSA temporal difference Algo-
rithm 7 and (rhs) Q-learning temporal difference Algorithm 8.

Figure 11.6 (lhs) shows the convergence behavior of the on-policy SARSA temporal dif-
ference algorithm. The x-axis shows the time scale t ∈ {0, . . . , 3000}, and the y-axis the
action-value functions qb(s, a) for selected state-action pairs (s, a). Each graph is averaged
over the 2000 runs of the outer loop of Algorithm 7. We observe not full convergence
yet, which indicates that we should run the continuing task for more time steps t. Based
on these action-value estimates qb(s, a), we determine the estimated optimal policy π b ∗ (s),
s ∈ S. The result is given in Table 11.2.
From Table 11.2, we observe that SARSA finds almost perfectly the true optimal policy
π ∗ only in state s = 9, we estimate the optimal action to be a = π b ∗ (9) = 10, instead of

the true optimal action π (9) = 3.
For the off-policy learning with Q-learning temporal difference we apply the same strategy
and the same parameters as for SARSA. The convergence behavior is shown in Figure
11.6 (rhs) and the estimated optimal policy π b ∗ is given on the last line of Table 11.2.
There is one misspecification with Q-learning which concerns state s = 7, where we

Version March 3, 2025, @AI Tools for Actuaries


248 Chapter 11. Reinforcement learning

state s 1 2 3 4 5 6 7 8 9 10
policy iteration 7 5 10 6 4 3 6 3 3 7
value iteration 7 5 10 6 4 3 6 3 3 7
SARSA temporal difference 7 5 10 6 4 3 6 3 10 7
Q-learning temporal difference 7 5 10 6 4 3 2 3 3 7

b ∗ (s) ∈ A for states s ∈ S.


Table 11.2: Estimated optimal policies π

exercise an action a = 2 instead of the action a = 6. This closes this example. ■

Remarks 11.3. We close this tabular learning exposition by some further methods.

• There are many variants that aim at improving both accuracy and speed of con-
vergence. E.g., the temporal difference step (11.23) considers a one-step ahead
prediction which can easily be replaced by a n-step ahead prediction

Gt = Rt+1 + γRt+2 + · · · + γ n−1 Rt+n + γ n Gt+n


n
X
≈ γ s−1 Rt+s + γ n qbt+n−1 (St+n , At+n ) =: G
b t:t+n ,
s=1

if qbk denotes the estimate of the action-value function at time k. This motivates
the n-step temporal difference update
 
qbt+n (St , At ) ← qbt+n−1 (St , At ) + ϱt G
b t:t+n − qbt+n−1 (St , At ) ;

see Sutton–Barto [212, formula (7.5)]. Integrating this into the SARSA tempo-
ral difference algorithm gives the n-step SARSA which approaches Monte Carlo
estimation for n → ∞.
b t:t+n . Selecting λ ∈
• A second modification is to average over different returns G
[0, 1), we can also approximate the gain Gt by
X
b λ = (1 − λ)
G λn−1 G
b t:t+n ,
t
n≥1

note that the weights aggregate to one. This gives the general temporal difference
methods called TD(λ); see Sutton–Barto [212, Chapter 12]. For λ = 1, this again
reduces to Monte Carlo estimation.

11.8 Beyond tabular learning


The previous methods of reinforcement learning have been considering tabular problems
meaning that both the state space S and the action space A were finite. This, allows one
to store everything in a (finite) table. Going over to continuous state spaces S lets the
problem become more complicated. In that case, tabular estimates can be approximated
by networks. The method of deep Q-network (DQN), introduced by Mnih et al. [159, 160],

Version March 3, 2025, @AI Tools for Actuaries


Chapter 11. Reinforcement learning 249

is considered to be a breakthrough in reinforcement learning, see Sutton–Barto [212,


Chapter 16].
Assume a general state space S and a finite action space A = {a1 , . . . , am } with m possible
actions (aj )m
j=1 . First, approximate the action-value function qπ (s, a) of a given policy π
with a multi-output network (qϑ (s, aj ))m
j=1 , having one input s ∈ S and m outputs, and
with ϑ denoting the network parameter. That is, select a multi-output network

qϑ : S 7→ Rm , s 7→ (qϑ (s, a1 ), . . . , qϑ (s, am ))⊤ .

First, if one only operated on this multi-output network qϑ , one would result in an
unstable situation because in the optimizations the unknown network parameter ϑ ap-
pears on both sides of the equation. To improve the stability of the algorithm, Mnih et
al. [159, 160] duplicated the network qϑ by a second network qϑτ , called target network,
that has the same architecture and only differs in the network parameter ϑτ . Both net-
works are initialized by the same network parameter ϑ = ϑτ , but the network parameter
of the first network will be updated more frequently, resulting in a second network qϑτ
that is more inert. This bigger inertia stabilizes the updates of the action-value esti-
mates qϑ because otherwise the estimated action-value estimates would use themselves
(in a self-circular way), see (11.25). Therefore, we need this bigger inertia second network
to receive meaningful results and to prevent from over-fitting.
Second, to not waste any (costly) observations, every quadruple (St , At , Rt+1 , St+1 ) is
stored in a memory denoted by M; this idea was introduced by Lin [137]. For gradient
descent learning, we will (re-)sample random (mini-)batches from this memory M to
learn the network parameter ϑ. As a side effect, such random mini-batches break the
temporal correlation in the experience which is an advantage in gradient descent learning.
Note that this makes the following Algorithm 9 an off-policy algorithm.

The first part of the deep Q-network algorithm is similar to the Q-learning temporal
difference algorithm, see Algorithm 8, and the two algorithms start to differ in the step
where we use the memory M. Using this memory, we sample a mini-batch of size K, and
each of this samples is used to construct an approximative gain G b k based on the second
inert network qϑτ , this is motivated by (11.22). The approximative gains are then used
to improve the network parameter ϑ of the first network in the next step, by optimally
predicting the approximative gains (G b k )K by this first network qϑ . Note that the multi-
k=1
output network qϑ (Sk , Ak ) has input Sk , and we select the output channel that coincides
with the value of Ak . For the loss function L one can use any strictly consistent loss
function for mean estimation, however, in reinforcement learning practice, also robust
versions of loss functions have been chosen. Finally, every τ ≫ 1 iterations, the inert
network qϑτ is updated, with either a soft update α ∈ (0, 1) or a hard update α = 1.
The above algorithm is again prone to provide a biased estimate by taking the maximum
in the Gb k estimate. The deep double-Q-network proposed by Hasselt et al. [91] tries to
compensate for this by considering an estimated gain instead

 Rk+1
 ! if Sk+1 is terminal,
G
bk =
 Rk+1 + γ qϑτ
 Sk+1 , arg max qϑ (Sk+1 , a) otherwise.
a∈A(Sk+1 )

Version March 3, 2025, @AI Tools for Actuaries


250 Chapter 11. Reinforcement learning

Algorithm 9 Deep Q-network algorithm

(0) Select a random initial network parameter ϑ and initialize ϑτ = ϑ. Choose α ∈ (0, 1]
and small ε > 0. Initialize the memory M, set the (mini-)batch size K ∈ N and
select the step size τ ∈ N.

(1) Iterate until a stopping rule is exercised:

– Choose S0 ∈ S at random.
– Loop for t ≥ 0 until terminal value † for St :
∗ For state St , sample At from an ε-greedy policy of structure (11.19)-
(11.20) under the actual network parameter ϑ for qϑ .
∗ Given state-action pair (St , At ), observe reward Rt+1 and next state St+1 .
∗ Store (St , At , Rt+1 , St+1 ) to the memory M.
∗ Sample a mini-batch of size K from the memory M.
∗ Set training labels, for 1 ≤ k ≤ K,
(
Rk+1 if Sk+1 is terminal,
G
bk =
Rk+1 + γ maxa∈A(Sk+1 ) qϑτ (Sk+1 , a) otherwise.

∗ Update the network parameter


K
X  
ϑ ← arg min L G
b k , qϑ (Sk , Ak ) .
ϑ k=1

∗ Every τ steps, update the second network parameter

ϑτ ← αϑ + (1 − α)ϑτ .

∗ Use state St+1 for the next step t + 1.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 11. Reinforcement learning 251

That is, we replace a maximum by an estimated expected value.

11.9 Actor-critic reinforcement learning algorithms


State-of-the-art reinforcement learning uses so-called actor-critic reinforcement learning
algorithms. These work a bit differently compared to the algorithms presented above:
Essentially, the above algorithms estimate the action-value function qπ (s, a) from which
the optimal policy and action is directly selected by an ε-greedy step (11.19)-(11.20).
Actor-critic reinforcement learning tries to directly estimate an optimal policy πϑ∗ (a|s)
from a parametrized family of policies {πϑ (a|s)}ϑ . In the case of a finite action space
A, we can select the parametrized family as categorical distributions πϑ (·|s) on A, for
all states s ∈ S. For example, we can select a multi-output FNN z ϑ : S → R|A| , and
applying the softmax function, we receive the corresponding probabilities

exp {z ϑ (s)a }
πϑ (a|s) = P ,
a′ ∈A exp {z ϑ (s)a′ }

if z ϑ (s)a′ denotes the component of z ϑ that corresponds to action a′ ∈ A.


An actor-critic algorithm has two different modules:

• Actor. The actor is responsible for learning an optimal policy πϑ (a|s) by improving
the parameter ϑ based on the feedback signal the actor receives.

• Critic. The critic evaluates the taken action and gives a feedback signal to the
actor. The evaluation is often done with a value or/and an action-value function
estimate. This value function is estimated/improved by the critic analyzing the
resulting rewards of the taken actions.

11.9.1 Policy gradient control


Before we can discuss actor-critic algorithms, we need to introduce the policy gradient,
assuming that ϑ is a real-valued vector and that all following terms are differentiable
in ϑ. Following Sutton–Barto [212, Section 13.2], we focus for the moment on episodic
tasks which allow us to set γ = 1, and we assume finite action and state spaces (the
following results can be generalized). Under these assumptions we define the so-called
performance, which is going to be the objective function, given by
" T #
X
J(ϑ) = vπϑ (s0 ) = Eπϑ Ru+1 S0 = s0 ,
u=0

for a given policy πϑ and starting in state s0 ∈ S. Using some algebra, we can reformulate
the gradient of the performance J(ϑ) w.r.t. ϑ as follows
" #
X
∇ϑ J(ϑ) ∝ Eπϑ qπϑ (St , a)∇ϑ πϑ (a| St ) S0 = s0
a∈A
= Eπϑ [ qπϑ (St , At )∇ϑ log πϑ ( At | St )| S0 = s0 ] (11.26)
= Eπϑ [ Gt ∇ϑ log πϑ ( At | St )| S0 = s0 ] ;

Version March 3, 2025, @AI Tools for Actuaries


252 Chapter 11. Reinforcement learning

see Sutton–Barto [212, Section 13.3]. There is a subtle issue hidden in this identity
(11.26), namely, the time index t seems undefined. In fact, in this identity t should be
a random variable such that St has the same distribution as the relative frequency of
the visits of the state-space sequence to the states in S under policy πϑ and starting
in S0 = s0 . Thus, St should have the long-term equilibrium distribution of the relative
numbers of visits to the states under Pπϑ [·|S0 = s0 ].

Algorithm 10 Monte Carlo policy gradient control

(0) Initialize ϑ, and select γ ∈ (0, 1] and step size ϱ > 0.

(1) Iterate until a stopping rule is exercised:

– Choose an initial state S0 ∈ S at random and observe an episode (11.16) for


the present policy πϑ with finite termination time T .
– Loop for t = 0, . . . , T − 1:
PT k−(t+1) R
∗ Gt ← k=t+1 γ k+1 .
∗ Policy update: ϑ ← ϑ + ϱγ t G t ∇ϑ log πϑ ( At | St ).

The policy update in Algorithm 10 is also called reinforce; see Sutton–Barto [212, Section
13.3]. This reinforce uses the so-called eligibility vector
1
∇ϑ log πϑ ( At | St ) = ∇ϑ πϑ ( At | St ) ,
π ϑ ( At | S t )

which shows that the gradient ascent steps are composed by the optimal policy increase
being reweighted with the corresponding occurrence frequencies of the conditional actions
At , in states St , to not favor more frequent actions. Having γ < 1 also allows us to use
Algorithm 10 for continuing tasks.
The policy update in Algorithm 10 can be generalized by including a state-dependent
baseline b(St ), being independent of At . This baseline will cancel in the gradient compu-
tations, and it will provide the reinforce with baseline policy update, see Sutton–Barto
[212, Section 13.4],

ϑ ← ϑ + ϱγ t (Gt − b(St )) ∇ϑ log πϑ ( At | St ) .

The advantage of this baseline, if chosen smartly, is that it can significantly reduce the
variance in the reinforcement learning algorithm. A typical, good selection is an estimate
of the value function b(St ) = vbπϑ (St ). Such a choice can significantly improve the speed of
convergence of the reinforcement learning algorithm; see, e.g., Sutton–Barto [212, Figure
13.2].

11.9.2 Actor-critic reinforcement learning


The reinforce with baseline policy update from above comes already quite close to an
actor-critic algorithm. The missing point is that the critic is not yet involved. The
critic will not only estimate the value function vπϑ (·), based on this it will also give a

Version March 3, 2025, @AI Tools for Actuaries


Chapter 11. Reinforcement learning 253

feedback signal to the actor. This implies a dependence along the time line which allows
to improve the quality of the estimates (by bootstrapping). For illustration, we consider
the one-step actor-critic algorithm of Sutton–Barto [212, Section 13.5]. It is the analogue
to the one-step temporal difference algorithms from above such as the SARSA Algorithm
7.
First, we select a second parametrized function class {vw (s)}w being based on a real-
valued vector w, and we assume that all considered terms are differentiable in w. We
then modify the reinforce with baseline policy update by a one-step temporal difference
update using the gain approximation (11.22) based on its expected version (11.24), and
the baseline b(St ) = vw (St )

ϑ ← ϑ + ϱγ t (Rt+1 + γvw (St+1 ) − vw (St )) ∇ϑ log πϑ ( At | St ) . (11.27)

Since this is (only) a one-step incremental update it can be done online, in contrast to
the Monte Carlo policy gradient control Algorithm 10.
Secondly, we also need to describe the value function update by temporal difference. This
can be achieved by a second gradient assent step

w ← w + ϱ′ (Rt+1 + γvw (St+1 ) − vw (St )) ∇w vw (St ) , (11.28)

for a second learning rate ϱ′ > 0.

Algorithm 11 One-step actor-critic temporal difference algorithm

(0) Initialize ϑ and w, and select γ ∈ (0, 1] and step sizes ϱ > 0 and ϱ′ > 0.

(1) Iterate until a stopping rule is exercised:

– Choose an initial state S0 ∈ S at random.


– Loop for t ≥ 0 until terminal value † for St :
∗ For state St , sample At from πϑ (·|St ).
∗ Given state-action pair (St , At ), observe reward Rt+1 and next state St+1 .
∗ Update the value function with temporal difference (11.28).
∗ Update the policy with temporal difference (11.27), and use state St+1 as
input to the next step t + 1.

In both of these two temporal difference steps (11.27) and (11.28), we consider a so-called
advantage function, which can have different forms depending on the specific algorithms
used. In our case it is
δt = Rt+1 + γvw (St+1 ) − vw (St ).
This compares the result of action At , given by Rt+1 + γvw (St+1 ), to the (average) value
that we would expect, given by vw (St ). Thus, the direction of the improvements is
multiplied by a step-size that adjusts for the advantage achieved by the corresponding
action.
There are many notable variants of the actor-critic Algorithm 11; we refer to the vastly
growing literature in this field. We would like to close this section by the proximal policy

Version March 3, 2025, @AI Tools for Actuaries


254 Chapter 11. Reinforcement learning

optimization (PPO) introduced by Schulman et al. [203]. For this we first discuss trust
region policy optimization (TRPO) of Schulman et al. [202]. Many of the policy gradient
methods lack sufficient robustness, and TRPO is a method that tries to improve on that
point. Let us come back to the temporal difference step (11.27)

ϑ ← ϑ + ϱγ t δt ∇ϑ log πϑ ( At | St ) ,

with advantage function δt . This gradient ascent step can stem from a maximization
problem  
arg max γ t δt log πϑ ( At | St ) .
ϑ

TRPO now argues that having an old estimate ϑold , the updated estimate ϑ should not
be too different from this previous estimate. This motivates regularization, see Section
2.4, and as penalty function we select the KL divergence of the resulting categorical dis-
tributions (we have assumed a finite action space A); for the KL divergence see (9.44).
Moreover, we replace log πϑ by a ratio of new and old policy, providing a different nor-
malization in the gradient ∇ϑ . This motivates the KL regularized optimization
!
t πϑ ( At | St )
arg max γ δt − η DKL (πϑ (· |St ) ||πϑold (· |St ) ) ,
ϑ πϑold ( At | St )

with regularization parameter η > 0. Since this TRPO is comparably complex and it
does not allow, e.g., for drop-outs during fitting PPO proposes a simpler method with a
comparable performance, see Schulman et al. [203]. The objective that is solved is given
by
( ! !)
tπ ϑ ( At | S t ) πϑ ( At | St )
arg max min γ δt , γ t δt min max ,1 − ε ,1 + ε ,
ϑ πϑold ( At | St ) πϑold ( At | St )

for a clipping hyper-parameter ε ∈ (0, 1). Thus, the probability ratio is censored (clipped)
at 1 − ε and 1 + ε. This removes the incentive of moving the probability ratio out of the
interval [1 − ε, 1 + ε]. This objective function is plotted in Figure 11.7 as a function of
the probability ratio r = r(ϑ) = πϑ ( At | St )/πϑold ( At | St ) > 0. The resulting function
depends on the sign of the advantage function δt ∈ R.

Version March 3, 2025, @AI Tools for Actuaries


Chapter 11. Reinforcement learning 255

proximal policy optimization objective


2

delta_t>0
delta_t<0
clipped objective function
1
0
−1
−2

0.0 0.5 1.0 1.5 2.0

probability ratio r

Figure 11.7: Objective function of PPO for ε = 1/2 and with the probability ratios
r = r(ϑ) = πϑ ( At | St )/πϑold ( At | St ) > 0 on the x-axis.

Version March 3, 2025, @AI Tools for Actuaries


256 Chapter 11. Reinforcement learning

Version March 3, 2025, @AI Tools for Actuaries


Chapter 12

Outlook

The present version of these notes covers a large part of the AI tools that actuaries should
be familiar with, but the reader may also have noticed that still some topics are missing,
e.g., methods on interpretability, data visualization, variable importance, or a discussion
about fairness and discrimination. We are going to provide more chapters that will cover
these topics, and for an overview of possible further topics, we also refer to:

https://actuary.eu/about-the-aae/continuous-professional-development/

257
258 Chapter 12. Outlook

Version March 3, 2025, @AI Tools for Actuaries


Bibliography

[1] Abbas, A., Sutter, D., Zoufal, C., Lucchi, A., Figalli, A., Woerner, S. (2021). The power of
quantum neural networks. Nature Computational Science 1, June 2021, 403-409.
[2] Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions
on Automatic Control 19/6, 716-723.
[3] Andrès, H., Boumezoued, A., Jourdain, B. (2024). Signature-based validation of real-world
economic scenarios. ASTIN Bulletin - The Journal of the IAA 54/2, 410-440.
[4] Arjovsky, M., Chintala, S., Bottou, L. (2017). Wasserstein GAN. Proceedings of the 34th
International Conference on Machine Learning (ICML), 214-223.
[5] Arnold, V.I. (1957). On functions of three variables. Doklady Akademii Nauk SSSR 114/4,
679-681.
[6] Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Math-
ematical Society 68/3, 337-404.
[7] Avanzi, B., Taylor, G., Wang, M., Wong, B. (2024). Machine learning with high-cardinality
categorical features in actuarial applications. ASTIN Bulletin - The Journal of the IAA
54/2, 213-238.
[8] Ayer, M., Brunk, H.D., Ewing, G.M., Reid, W.T., Silverman, E. (1955). An empirical
distribution function for sampling with incomplete information. Annals of Mathematical
Statistics 26, 641-647.
[9] Ba, J.L., Kiros, J.R., Hinton, G.E. (2016). Layer normalization. arXiv:1607.06450.
[10] Bai, Y., Jones, A., Ndousse, K., Askell, A., Leike, J., Amodei, D. (2022). Constitutional
AI: Harmlessness from AI feedback. arXiv:2212.08073.
[11] Bailey, R.A., Simon, L.J. (1960). Two studies on automobile insurance ratemaking. ASTIN
Bulletin - The Journal of the IAA 1, 192-217.
[12] Bar-Lev, S. K., Enis, P. (1986). Reproducibility and natural exponential families with power
variance functions. The Annals of Statistics 14, 1507-1522.
[13] Bar-Lev, S.K., Kokonendji, C.C. (2017). On the mean value parametrization of the natural
exponential kamily - a revisited review. Mathematical Methods of Statistics 26/3, 159-175.
[14] Barlow, R.E., Bartholomew, D.J., Bremmer, J.M., Brunk, H.D. (1972). Statistical Inference
under Order Restrictions. John Wiley & Sons.
[15] Barlow, R.E., Brunk, H.D. (1972). The isotonic regression problem and its dual. Journal
of the American Statistical Association 67/337, 140-147.
[16] Barndorff-Nielsen, O. (2014). Information and Exponential Families: In Statistical Theory.
John Wiley & Sons.
[17] Beltagy, I., Peters, M.E., Cohan, A. (2020). Longformer: The long-document transformer.
arXiv:2004.05150.

259
260 Bibliography

[18] Bender, E.M., Koller, A. (2020). Climbing towards NLU: On meaning, form, and under-
standing in the age of data. Proceedings of the Annual Meeting of the ACL.

[19] Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N. (2015). Scheduled sampling for sequence pre-
diction with recurrent neural networks. Advances in Neural Information Processing Systems
28, 1171-1179.

[20] Bengio Y., Courville A., Vincent P. (2013). Representation learning: a review and new
perspectives. IEEE Transactions on Pattern Analysis and Machine Learning Intelligence
35/8, 1798-1828.

[21] Bengio Y., Ducharme R., Vincent P., Jauvin C. (2003). A neural probabilistic language
model. Journal of Machine Learning Research 3/Feb, 1137-1155.

[22] Bengio, Y., Schwenk, H., Senécal, J.-S., Morin, F., Gauvain, J.-L. (2006). Neural proba-
bilistic language models. In: Innovations in Machine Learning. Holmes, D.E., Jain, L.C.
(Eds.). Springer, Studies in Fuzziness and Soft Computing 194, 137-186.

[23] Blæsild, P., Jensen, J.L. (1985). Saddlepoint formulas for reproductive exponential models.
Scandinavian Journal of Statistics 12/3, 193-202.

[24] Bommasani, R., Hudson, D.A., Adeli, E., et al. (2021). On the opportunities and risks of
foundation models. arXiv:2108.07258.

[25] Boureau, Y.L., Ponce, J., LeCun, Y. (2010). A theoretical analysis of feature pooling in
vision recognition. In: Proceedings of the International Conference on Machine Learning
ICML 65.

[26] Brauer, A. (2024). Enhancing actuarial non-life pricing models via transformers. European
Actuarial Journal 14/3, 991-1012.

[27] Brébisson, de A., Simon, É., Auvolat, A., Vincent, P., Bengio, Y. (2015). Artificial neural
networks applied to taxi destination prediction. arXiv:1508.00021.

[28] Breiman, L. (1996), Out-of-bag estimation. https://www.stat.berkeley.edu/~breiman/


OOBestimation.pdf

[29] Breiman, L. (1996). Bagging predictors. Machine Learning 24/2, 123-140.

[30] Breiman, L. (2001). Random forests. Machine Learning 45/1, 5-32.

[31] Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. (1984). Classification and Regression
Trees. Wadsworth Statistics/Probability Series.

[32] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan,
A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners.
Advances in Neural Information Processing Systems 33, 1877-1901.

[33] Brunk, H.D., Ewing, G.M., Utz, W.R. (1957). Minimizing integrals in certain classes of
monotone functions. Pacific Journal of Mathematics 7, 833-847.

[34] Bühlmann, H. (1967). Experience rating and credibility. ASTIN Bulletin - The Journal of
the IAA 4/3, 199-207.

[35] Bühlmann, H., Gisler, A. (2005). A Course in Credibility Theory and its Applications.
Springer Universitext.

[36] Bühlmann, H., Straub, E. (1970). Glaubwürdigkeit für Schadensätze. Mitteilungen der
Schweizerischen Vereinigung der Versicherungsmathematiker 1970, 111-131.

Version March 3, 2025, @AI Tools for Actuaries


Bibliography 261

[37] Bühlmann, P. (2002). Consistency for L2 boosting and matching pursuit with trees and tree-
type basis functions. In: Research Report/Seminar für Statistik, Eidgenössische Technische
Hochschule (ETH), Vol. 109. Seminar für Statistik, Eidgenössische Technische Hochschule
(ETH).

[38] Burges, C.J.C. (1998). Tutorial on support vector machines for pattern recognition. Data
Mining and Knowledge Discovery 2, 121-167.

[39] Cao, Q., Sutton, C., Liska, A., Titov, I. (2021). Neuro-symbolic probing in neural language
models. Proceedings of the Annual Meeting of the ACL.

[40] Chaubard, F., Mundra, R., Socher, R. (2016). Deep Learning for Natural Language Pro-
cessing. Lecture Notes, Stanford University.

[41] Chen, T., Guestrin, C. (2016). XGBoost: a scalable tree boosting system. In: KDD ’16:
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, 785-794.

[42] Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Ben-
gio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical
machine translation. arXiv:1406.1078.

[43] Chowdhery, A., Narang, A., Devlin, J., et al. (2022). PaLM: Scaling language modeling
with pathways. arXiv:2204.02311,.

[44] Christiano, P.F., Leike, J., Brown, T.B., Martic, M., Legg, S., Amodei, D. (2017). Deep
reinforcement learning from human preferences. Advances in Neural Information Processing
Systems 30, 4299-4307.

[45] Cortes, C., Vapnik, V. (1995). Support-vector networks. Machine Learning 20/3, 273-297.

[46] Cunningham, H., Ewart, A., Riggs, L., Huben, R., Sharkey, L. (2023). Sparse autoencoders
find highly interpretable features in language models. arXiv:2309.08600.

[47] Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics


of Control, Signals and Systems 2, 303-314.

[48] DeepSeek Research Team (2025). DeepSeek R1: A family of open-source reasoning LLMs.
arXiv:2501.12948.

[49] Delong, Ł, Kozak, A. (2023). The use of autoencoders for training neural networks with
mixed categorical and numerical features. ASTIN Bulletin - The Journal of the IAA 53/2,
213-232.

[50] Delong, Ł., Lindholm, M. and Zakrisson, H., (2023). On cyclic gradient boosting machines.
SSRN Manuscript ID 4352505.

[51] Delong, Ł, Wüthrich, M.V. (2024). Isotonic regression for variance estimation and its role
in mean estimation and model validation. North American Actuarial Journal, in press.

[52] Dempster, A.P., Laird, N.M., Rubin, D.B. (1977). Maximum likelihood for incomplete data
via the EM algorithm. Journal of the Royal Statistical Society, Series B 39/1, 1-22.

[53] Denuit, M., Charpentier, A., Trufin, J. (2021). Autocalibration and Tweedie-dominance for
insurance pricing in machine learning. Insurance: Mathematics and Economics 101/B,
485-497.

[54] Denuit, M., Trufin, J. (2021). Lorenz curve, Gini coefficient, and Tweedie dominance for
autocalibrated predictors. LIDAM Discussion Paper ISBA 2021/36.

Version March 3, 2025, @AI Tools for Actuaries


262 Bibliography

[55] Devlin, J., Chang, M.-W., Lee, K., Toutanova, K. (2018). BERT: Pre-training of deep
bidirectional Transformers for language understanding. arXiv:1810.04805.
[56] Dietterich, T.G. (2000) An experimental comparison of three methods for constructing
ensembles of decision trees: bagging, boosting, and randomization. Machine Learning 40,
139-157.
[57] Dietterich, T.G. (2000). Ensemble methods in machine learning. In: Multiple Classifier
Systems, Kittel, J., Roli, F. (eds.). Lecture Notes in Computer Science 1857. Springer,
1-15.
[58] Donaldson, J. (2016). t-distributed stochastic neighbor embedding for R (t-SNE). R package
tsne.
[59] Duan, T., Anand, A., Ding, D.Y., Thai, K.K., Basu, S., Ng, A., Schuler, A., (2020). NG-
Boost: Natural gradient boosting for probabilistic prediction. In; International Conference
on Machine Learning, Proceedings of Machine Learning Research, 2690-2700.
[60] Dutang, C., Charpentier, A., Gallic, E. (2024). Insurance dataset. Recherche Data Gouv.
https://doi.org/10.57745/P0KHAG
[61] Efron, B. (1979). Bootstrap methods: another look at the jackknife. Annals of Statistics
7/1, 1-26.
[62] Efron, B., Tibshirani, R.J. (1993). An Introduction to the Bootstrap. Chapman & Hall.
[63] Embrechts, P., Klüppelberg, C., Mikosch, T. (2003). Modelling Extremal Events for Insur-
ance and Finance. 4th printing. Springer.
[64] Ester, M., Kriegel, J.P., Sander, J., Xu, X. (1996). A density-based algorithm for discovering
clusters in large spatial databases with noise. In: Proceedings of the Second International
Conference on Knowledge Discovery and Data Mining (KDD-96). AAAI Press. 226-231.
[65] Fahrmeir, L., Tutz, G. (1994). Multivariate Statistical Modelling Based on Generalized
Linear Models. Springer.
[66] Fan, J., Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle
properties. Journal of the American Statistical Association 96/456 1348-1360.
[67] Ferrario, A., Hämmerli, R. (2019). On boosting: theory and applications. SSRN Manuscript
ID 3402687.
[68] Fisher, R.A. (1934). Two new properties of mathematical likelihood. Proceeding of the Royal
Society A 144/852, 285-307.
[69] Fissler, T., Lorentzen, C., Mayer, M. (2022). Model comparison and calibration assess-
ment: user guide for consistent scoring functions in machine learning and actuarial practice.
arXiv:2202.12780.
[70] Friedman, J., H. (2001). Greedy function approximation: a gradient boosting machine.
Annals of Statistics 25/5, 1189-1232.
[71] Fritschi, S., Guenther, F., Wright, M.N., Suling, M., Mueller, S.M. (2019). Training of
neural networks. R package neuralnet.
[72] Gao, G., Wang, H., Wüthrich, M.V. (2022). Boosting Poisson regression models with telem-
atics car driving data. Machine Learning 111/1, 243-272.
[73] Ghosh, P., Sajjadi, M.S.M., Vergari, A., Black, M., Schölkopf, B., (2020). From varia-
tional to deterministic autoencoders. International Conference on Learning Representations
(ICLR).

Version March 3, 2025, @AI Tools for Actuaries


Bibliography 263

[74] Gini, C. (1912). Variabilità e Mutabilità. Contributo allo Studio delle Distribuzioni e delle
Relazioni Statistiche. C. Cuppini, Bologna.
[75] Gini, C. (1936). On the measure of concentration with special reference to income and
statistics. Colorado College Publication, General Series No. 208, 73-79.
[76] Glorot, X., Bengio, Y. (2010). Understanding the difficulty of training deep feedforward
neural networks. In: Proceedings of the Thirteenth International Conference on Artificial
Intelligence and Statistics, Proceedings of Machine Learning Research 9, 249-256.
[77] Gneiting, T. (2011). Making and evaluating point forecasts. Journal of the American Sta-
tistical Association 106/494, 746-762.
[78] Gneiting, T., Raftery, A.E. (2007). Strictly proper scoring rules, prediction, and estimation.
Journal of the American Statistical Association 102/477, 359-378.
[79] Gneiting, T., Ranjan, R. (2013). Combining predictive distributions. Electronic Journal of
Statistics 7, 1747-1782.
[80] Gneiting, T., Resin, J. (2023). Regression diagnostics meets forecst evaluation: conditional
calibration, reliability diagrams, and coefficient of determination. Electronic Journal of
Statistics 17, 3226-3286.
[81] Goldburd, M., Khare, A., Tevet, D., Guller, D. (2020). Generalized Linear Models for
Insurance Rating. 2nd edition. CAS Monograph Series, 5.
[82] Golub, G., Van Loan, C. (1983). Matrix Computations. John Hopkins University Press.
[83] Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning. MIT Press, http://www.
deeplearningbook.org
[84] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville,
A., Bengio, Y. (2014). Generative adversarial nets. Advances in Neural Information Pro-
cessing Systems 27.
[85] Gorishniy, Y., Kotelnikov, A., Babenko, A. (2024). TabM: advancing tabular deep learning
with parameter-efficient ensembling. arXiv:2410.24210.
[86] Gorishniy, Y., Rubachev, I., Khrulkov, V., Babenko, A. (2021). Revisiting deep learning
models for tabular data. In: Beygelzimer, A., Dauphin, Y., Liang, P., Wortman Vaughan,
J. (eds). Advances in Neural Information Processing Systems, 34. Curran Associates, Inc.,
New York, 18932-18943.
[87] Gourieroux, C., Montfort, A., Trognon, A. (1984). Pseudo maximum likelihood methods:
theory. Econometrica 52/3, 681-700.
[88] Guo, C., Berkhahn, F. (2016). Entity embeddings of categorical variables. arXiv:1604.06737.
[89] Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q. On calibration of modern neural networks.
Proceedings of the 34th International Conference on Machine Learning (ICML), 1321-1330.
[90] Hainaut, D., Trufin, J., Denuit, M. (2022). Response versus gradient boosting trees, GLMs
and neural networks under Tweedie loss and log-link. Scandinavian Actuarial Journal
2022/10, 841-866.
[91] Hasselt, van H., Guez, A., Silver, D. (2015). Deep reinforcement learning with double Q-
learning. arXiv:1509.06461.
[92] Hastie, T., Tibshirani, R. (1993). Varying-coefficient models. Journal of the Royal Statistical
Society Series B: Statistical Methodology 55/4, 757-779.

Version March 3, 2025, @AI Tools for Actuaries


264 Bibliography

[93] Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. 2nd edition. Springer Series in Statistics.
[94] Hastie, T., Tibshirani, R., Wainwright, M. (2015). Statistical Learning with Sparsity: The
Lasso and Generalizations. CRC Press.
[95] Havrylenko, Y., Heger, J. (2024) Detection of interacting variables for generalized linear
models via neural networks. European Actuarial Journal 14/2, 551-580.
[96] He, K., Zhang, X., Ren, S., Sun, J. (2015). Deep residual learning for image recognition.
arXiv:1512.03385.
[97] Hinton, G.E., Salakhutdinov, R.R. (2006). Reducing the dimensionality of data with neural
networks. Science 313/5786, 504-507.
[98] Hinton, G., Srivastava, N., Swersky, K. (2012). Neural Networks for Machine Learning.
Lecture Slides. University of Toronto.
[99] Ho, J., Jain, A., Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in
Neural Information Processing Systems 33, 6840-6851.
[100] Hochreiter, S., Schmidhuber, J. (1997). Long short-term memory. Neural Computation 9/8,
1735-1780.
[101] Hofner, B., Mayr, A., Robinzonov, N., Schmid, M. (2014). Model-based boosting in R: A
hands-on tutorial using the R ackage mboost. Computational Statistics 29, 3-35.
[102] Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural
Networks 4/2, 251-257.
[103] Hornik, K., Stinchcombe, M., White, H. (1989). Multilayer feedforward networks are uni-
versal approximators. Neural Networks 2/5, 359-366.
[104] Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., de Laroussilhe, Q., Gesmundo, A.,
Attariyan, M., Gelly, S. (2019). Parameter-efficient transfer learning for NLP. Proceedings
of the 36th International Conference on Machine Learning (ICML), 2790-2799.
[105] Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.
(2022). LoRA: Low-rank adaptation of large language models. International Conference
on Learning Representations (ICLR).
[106] Huang, X., Khetan, A., Cvitkovic, M., Karnin, Z. (2020). TabTransformer: Tabular data
modeling using contextual embeddings. arXiv:2012.06678.
[107] Ioffe, S., Szegedy, C. (2015). Batch normalization: Accelerating deep network training by
reducing internal covariate shift. In: Proceedings of the 32nd International Conference on
Machine Learning 37, 448-456.
[108] Isenbeck, M., Rüschendorf, L. (1992). Completeness in location families. Probability and
Mathematical Statistics 13/2, 321-343.
[109] James, G., Witten, D., Hastie, T., Tibshirani, R. (2015). An Introduction to Statistical
Learning. With Applications in R. Corrected 6th printing. Springer.
[110] Jørgensen, B. (1986). Some properties of exponential dispersion models. Scandinavian Jour-
nal of Statistics 13/3, 187-197.
[111] Jørgensen, B. (1987). Exponential dispersion models. Journal of the Royal Statistical Soci-
ety, Series B 49/2, 127-145.
[112] Jørgensen, B. (1997). The Theory of Dispersion Models. Chapman & Hall.

Version March 3, 2025, @AI Tools for Actuaries


Bibliography 265

[113] Kaplan, J., McCandlish, S., Henighanm, T., et al. (2020). Scaling laws for neural language
models. arXiv:2001.08361.
[114] Karush, W. (1939). Minima of Functions of Several Variables with Inequalities as Side
Constraints. MSc Thesis. Department of Mathematics, University of Chicago.
[115] Kaufman, L., Rousseeuw, P.J. (1990). Finding Groups in Data: An Introduction to Cluster
Analysis. John Wiley & Sons.
[116] Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.-Y. (2017).
LightGBM: a highly efficient gradient boosting decision tree. Advances in Neural Information
Processing Systems 30, 3146-3154.
[117] Khattab, O., Singhvi, A., Maheshwari, P., et al. (2023). Dspy: Compiling declarative lan-
guage model calls into self-improving pipelines. arXiv:2310.03714.
[118] Kingma, D., Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980.
[119] Kingma, D.P, Welling, M. (2013). Auto-encoding variational Bayes. arXiv:1312.6114.
[120] Kingma, D.P., Welling, M. (2019). An introduction to variational autoencoders. Founda-
tions and Trends in Machine Learning 12/4, 307-392.
[121] Kirsch, L., Wang, Y., Zhao, Y., Pickett, M. (2022). Meta-learning in-context transformers.
arXiv:2209.07680.
[122] Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biolog-
ical Cybernetics 43, 59-69.
[123] Kohonen, T. (2001). Self-Organizing Maps. 3rd edition. Springer.
[124] Kohonen, T. (2013). Essentials of the self-organizing map. Neural Networks 37, 52-65.
[125] Kolmogorov, A. (1957). On the representation of continuous functions of many variables
by superposition of continuous functions of one variable and addition. Doklady Akademii
Nauk SSSR 114/5, 953-956.
[126] Kramer, M.A. (1991). Nonlinear principal component analysis using autoassociative neural
networks. AIChE Journal 37/2, 233-243.
[127] Krüger, F., Ziegel, J.F. (2021). Generic conditions for forecast dominance. Journal of Busi-
ness and Economics Statistics 39/4, 972-983.
[128] Kruskal, J.B. (1964). Nonmetric multidimensional scaling. Psychometrica 29, 115-129.
[129] Kuhn, H.W., Tucker, A.W. (1951). Nonlinear programming. In: Proceedings of 2nd Berkeley
Symposium. University of California Press, 481-492.
[130] LeCun, Y., Bengio, Y. (1995). Convolutional networks for images, speech, and time series.
The Handbook of Brain Theory and Neural Networks 3361/10.
[131] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P. (1998). Gradient-based learning applied to
document recognition. Proceedings of the IEEE 86/11, 2278-2324.
[132] Lee, R.D., Carter, L.R. (1992). Modeling and forecasting U.S. mortality. Journal of the
American Statistical Association 87/419, 659-671.
[133] Leeuw, de J., Hornik, K., Mair, P. (2009). Isotone optimization in R: pool-adjacent-violators
algorithm (PAVA) and active set methods. Journal of Statistical Software 32/5, 1-24.
[134] Leshno, M., Lin, V.Y., Pinkus, A., Schocken, S. (1993). Multilayer feedforward networks
with a nonpolynomial activation function can approximate any function. Neural Networks
6/6, 861-867.

Version March 3, 2025, @AI Tools for Actuaries


266 Bibliography

[135] Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V.,
Zettlemoyer, L. (2020). BART: Denoising sequence-to-sequence pre-training for natural lan-
guage generation, translation, and comprehension. Proceedings of the 58th Annual Meeting
of the Association for Computational Linguistics, 7871-7880.
[136] Li, X., Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for generation.
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics,
4582-4597.
[137] Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning
and teaching. Machine Learning 8/3-4, 293-321.
[138] Lindholm, M., Lindskog, F., Palmquist, J. (2023). Local bias adjustment, duration-weighted
probabilities, and automatic construction of tariff cells. Scandinavian Actuarial Journal
2023/10, 946-973.
[139] Lindholm, M., Palmborg, L. (2022). Efficient use of data for LSTM mortality forecasting.
European Actuarial Journal 12/2, 749-778.
[140] Lindholm, M., Wüthrich, M.V. (2024). The balance property in insurance pricing. SSRN
Manuscript ID 4925165.
[141] Liu, Z., Wang, Y., Vaidya, S., Ruehle, F., Halverson, J., Soljačić, M., Hou, T.Y., Tegmark,
M. (2024). KAN: Kolmogorov–Arnold networks. arXiv:2404.19756.
[142] Loader, C. (1999). Local Regression and Likelihood. Springer.
[143] Lorentzen, C., Mayer, M., Wüthrich, M.V. (2022). Gini index and friends. SSRN Manuscript
ID 4248143.
[144] Lorenz, M.O. (1905). Methods of measuring the concentration of wealth. Publications of
the American Statistical Association 9/70, 209-219.
[145] Loshchilov, I., Hutter, F. (2017). Decoupled weight decay regularization. International Con-
ference on Learning Representations (ICLR).
[146] Mack, T. (1993). Distribution-free calculation of the standard error of chain ladder reserve
estimates. ASTIN Bulletin - The Journal of the IAA 23/2, 213-225.
[147] Makhzani, A., Frey, B. (2014). K-sparse autoencoders. International Conference on Learn-
ing Representations (ICLR).
[148] Mallat, S., Zhang, Z. (1993). Matching pursuits with time frequency dictionaries. IEEE
Transactions on Signal Processing 41, 3397-3415.
[149] Mayr, A., Fenske, N., Hofner, B., Kneib, T., Schmid, M., (2012). Generalized additive
models for location, scale and shape for high dimensional data – a flexible approach based
on boosting. Journal of the Royal Statistical Society Series C: Applied Statistics 61/3,
403-427.
[150] McCullagh, P., Nelder, J.A. (1983). Generalized Linear Models. Chapman & Hall.
[151] McInnes, L., Healy, J., Melville, J. (2018). UMAP: uniform manifold approximation and
projection for dimension reduction. arXiv:1802.03426v2.
[152] McLachlan, G.J., Krishnan, T. (2008). The EM Algorithm and Extensions. 2nd edition.
John Wiley & Sons.
[153] Menon, A.K., Jiang, X., Vembu, S., Elkan, C., Ohno-Machado, L. (2012). Predicting accu-
rate probabilities with ranking loss. ICML’12: Proceedings of the 29th International Con-
ference on Machine Learning, 659-666.

Version March 3, 2025, @AI Tools for Actuaries


Bibliography 267

[154] Mercer, J. (1909), Functions of positive and negative type and their connection with the
theory of integral equations. Philosophical Transactions of the Royal Society A 209/441-
458, 415-446.
[155] Micci-Barreca, D. (2001). A preprocessing scheme for high-cardinality categorical attributes
in classification and prediction problems. ACM SIGKDD Explorations Newsletter 3/1, 27-
32.
[156] Mikolov, T., Chen, K., Corrado, G.S., Dean, J. (2013). Efficient estimation of word repre-
sentations in vector space. arXiv:1301.3781.
[157] Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J. (2013). Distributed represen-
tations of words and phrases and their compositionality. Advances in Neural Information
Processing Systems 26, 3111-3119.
[158] Miles, R.E. (1959). The complete amalgamation into blocks, by weighted means, of a finite
set of real numbers. Biometrika 46, 317-327.
[159] Minh, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller,
M. (2013). Playing Atari with deep reinforcement learning. arXiv:1312.5602.
[160] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves,
A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S., Beattie, C., Antonoglou,
I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D. (2015). Human-level control
through deep reinforcement learning. Nature 518/7540, 529-533.
[161] Murphy, A.H. (1973). A new vector partition of the probability score. Journal of Applied
Meteorology 12/4, 595-600.
[162] Murphy, K.P. (2024). Reinforcement learning: an overview. arXiv:2412.05265.
[163] Nanda, N., Lindner, J., Belrose, C., Olsson, C. (2023). Activation patching for mechanistic
interpretability. Proceedings of the Mechanistic Interpretability Workshop.
[164] Nelder, J.A., Wedderburn, R.W.M. (1972). Generalized linear models. Journal of the Royal
Statistical Society, Series A 135/3, 370-384.
[165] Nesterov, Y. (2007). Gradient methods for minimizing composite objective function. Tech-
nical Report 76. Center for Operations Research and Econometrics (CORE), Catholic Uni-
versity of Louvain.
[166] Nesterov, Y. (2018). Lectures on Convex Optimization. Springer.
[167] Nigri, A., Levantesi, S., Marino, M., Scognamiglio, S., Perla, F. (2019). A deep learning
integrated Lee–Carter model. Risks 7/1, 33.
[168] NovaSky Team (2025). Sky-T1: Train your own O1 preview model within $450. https:
//novasky-ai.github.io/posts/sky-t1, Accessed: 2025-01-09.
[169] Odaibo, S. (2019). Tutorial: deriving the standard variational autoencoder (VAE) loss
function. arXiv:1907.08956.
[170] Olah, C., Mordvintsev, A., Schubert, L. (2018). Feature visualization Distill. https://
distill.pub/2017/feature-visualization
[171] Olah, C., Satyanarayan, A., Johnson, I., Carter, S., Schubert, L., Ye, K., Mordvintsev, A.
(2020). An overview of early vision in InceptionV1. Distill. https://distill.pub/2020/
circuits/early-vision
[172] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Christiano, P., Leike, J., Amodei, D. (2022).
Training language models to follow instructions with human feedback. arXiv:2203.02155.

Version March 3, 2025, @AI Tools for Actuaries


268 Bibliography

[173] Palmborg, L, Lindskog, F. (2023). Premium control with reinforcement learning. ASTIN
Bulletin - The Journal of the IAA 53/2, 233-257.
[174] Pan, J., Zhang, J., Wang, X., Yuan, L., Peng, H., Suhr, A. (2025). TinyZero. https:
//github.com/Jiayi-Pan/TinyZero, accessed: 2025-01-24.
[175] Pennington, J., Socher, R., Manning, C.D. (2014). GloVe: global vectors for word repre-
sentation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language
Processing (EMNLP), 1532-1543.
[176] Perla, F., Richman, R., Scognamiglio, S., Wüthrich, M.V. (2024). Accurate and explainable
mortality forecasting with the LocalGLMnet. Scandinavian Actuarial Journal 2024/7, 1-
23.
[177] Pohle, M.-O. (2020). The Murphy decomposition and the calibration-resolution principle:
A new perspective on forecast evaluation. arXiv:2005.01835.
[178] Puterman, M.L. (2005). Markov Decision Processes: Discrete Stochastic Dynamic Program-
ming. John Wiley & Sons.
[179] R Core Team (2021). R: A language and environment for statistical computing. R Founda-
tion for Statistical Computing, Vienna, Austria. http://www.R-project.org/.
[180] Radford, A., Narasimhan, K., Salimans, T., Sutskever, I. (2018). Improving language un-
derstanding by generative pre-training. OpenAI Technical Report.
[181] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu,
P.J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer.
Journal of Machine Learning Research 21/140, 1-67.
[182] Raschka, S. (2025). Understanding reasoning LLMs: Methods and strategies for building
and refining reasoning models. Blog Post, February 5, 2025.
[183] Rentzmann, S., Wüthrich, M.V. (2019). Unsupervised learning: What is a sports car?
SSRN Manuscript ID 3439358.
[184] Rezende, D.J., Mohamed, S., Wierstra, D. (2014). Stochastic backpropagation and approx-
imate inference in deep generative models. Proceedings of the 31st International Conference
on Machine Learning, 1278-1286.
[185] Ribeiro, M.T., Singh, S., Guestrin, C. (2016). “Why should I trust you?”: explaining
the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, KDD ’16. New York: Association
for Computing Machinery, 1135-1144.
[186] Richman, R. (2021). AI in actuarial science - a review of recent advances - part 1. Annals
of Actuarial Science 15/2, 207-229.
[187] Richman, R. (2021). AI in actuarial science - a review of recent advances - part 2. Annals
of Actuarial Science 15/2, 230-258.
[188] Richman, R., Scognamiglio, S., and Wüthrich, M. V. (2025). The credibility transformer.
European Actuarial Journal, in press.
[189] Richman, R., Wüthrich, M.V. (2020). Nagging predictors. Risks 8/3, article 83.
[190] Richman, R., Wüthrich, M.V. (2023). LocalGLMnet: interpretable deep learning for tabular
data. Scandinavian Actuarial Journal 2023/1, 71-95.
[191] Richman, R., Wüthrich, M.V. (2023). LASSO regularization within the LocalGLMnet ar-
chitecture. Advances in Data Analysis and Classification 17/4, 951-981.

Version March 3, 2025, @AI Tools for Actuaries


Bibliography 269

[192] Richman, R., Wüthrich, M.V. (2024). High-cardinality categorical covariates in network
regressions. Japanese Journal of Statistics and Data Science 7/2, 921-965.
[193] Richman, R., Wüthrich, M.V. (2024). Smoothness and monotonicity constraints for neural
networks using ICEnet Annals of Actuarial Science 18/3, 712-739.
[194] Ridgeway, G. (2024). Generalized boosted models: a guide to the gbm package. https:
//cran.r-project.org/web/packages/gbm/vignettes/gbm.pdf
[195] Ronneberger, O., Fischer, P., Brox, T. (2015). U-Net: Convolutional networks for biomed-
ical image segmentation. Medical Image Computing and Computer-Assisted Intervention
(MICCAI), 234-241.
[196] Rumelhart, D.E., Hinton, G.E., Williams, R.J. (1986). Learning representations by back-
propagating errors. Nature 323/6088, 533-536.
[197] Saerens, M. (2000). Building cost functions minimizing to some summary statistics. IEEE
Transactions on Neural Networks 11, 1263-1271.
[198] Savage, L.J. (1971). Elicitable of personal probabilities and expectations. Journal of the
American Statistical Association 66/336, 783-810.
[199] Schervish, M.J. (1989). A general method of comparing probability assessors. The Annals
of Statistics 17/4, 1856-1879.
[200] Schölkopf, B., Smola, A., Müller, K.R. (1998). Nonlinear component analysis as a kernel
eigenvalue problem. Neural Computation 10/5, 1299-1319.
[201] Schubert, E., Rousseeuw, P.J. (2019). Faster k-medoids clustering: improving the PAM,
CLARA, and CLARANS algorithms. arXiv:1810.05691v3.
[202] Schulman, J., Levine, S., Moritz, P., Jordan, M.I., Abbbeel, P. (2015). Trust region policy
optimization. arXiv:1502.05477.
[203] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O. (2017). Proximal policy
optimization algorithms. arXiv:170706347.
[204] Schwarz, G.E. (1978). Estimating the dimension of a model. Annals of Statistics 6/2, 461-
464.
[205] Scognamiglio, S. (2022). Calibrating the Lee–Carter and the Poisson Lee–Carter models
via neural networks. ASTIN Bulletin - The Journal of the IAA 52/2, 519-561.
[206] Semenovich, D., Dolman, C. (2020). What makes a good forecast? Lessons from meteorol-
ogy. 20/20 All-Actuaries Virtual Summit, The Institute of Actuaries, Australia.
[207] Shmueli, G. (2010). To explain or to predict? Statistical Science 25/3, 289-310.
[208] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S. (2015). Deep unsupervised
learning using nonequilibrium thermodynamics. Proceedings of the 32nd International Con-
ference on Machine Learning, 2256-2265.
[209] Song, Y., Ermon, S. (2019). Generative modeling by estimating gradients of the data dis-
tribution. Advances in Neural Information Processing Systems 32.
[210] Srivastava, N., Hinton, G., Krizhevsky, A. Sutskever, I., Salakhutdinov, R. (2014). Dropout:
a simple way to prevent neural networks from overfitting. Journal of Machine Learning
Research 15/56, 1929-1958.
[211] Su, J., Lu, R., Huang, G., Liang, Y., Xia, F. (2021). RoFormer: Enhanced transformer
with rotary position embedding. arXiv2104.09864.

Version March 3, 2025, @AI Tools for Actuaries


270 Bibliography

[212] Sutton, R.S., Barto, A.G. (2018). Reinforcement Learning: An Introduction. MIT Press.

[213] Tasche, D. (2006). Validation of internal rating systems and PD estimates. arXiv:0606071.

[214] Tasche, D. (2021). Calibrating sufficiently. Statistics: A Journal of Theoretical and Applied
Statistics 55/6, 1356-1386.

[215] Therneau, T.M., Atkinson, E.J. (2015). An introduction to recursive partitioning using the
RPART routines. R Vignettes, version of June 29, 2015. Mayo Foundation, Rochester.

[216] Thomson, W. (1979). Eliciting production possibilities from a well-informed manager. Jour-
nal of Economic Theory 20, 360-380.

[217] Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. Journal of the
Royal Statistical Society, Series B 58/1, 267-288.

[218] Tibshirani, R., Saunders, M., Rosset, S., Knight, K. (2005). Sparsity and smoothness via
the fused LASSO. Journal of the Royal Statistical Society, Series B 67/1, 91-108.

[219] Tikhonov, A.N. (1943). On the stability of inverse problems. Doklady Akademii Nauk SSSR
39/5, 195-198.

[220] Tomczak, J.M. (2024). Deep Generative Modeling. Springer.

[221] Tsyplakov, A. (2013). Evaluation of probabilistic forecasts: proper scoring rules and mo-
ments. SSRN Manuscript ID 2236605.

[222] Tweedie, M.C.K. (1984). An index which distinguishes between some important exponential
families. In: Statistics: Applications and New Directions. Ghosh, J.K., Roy, J. (Eds.). Pro-
ceeding of the Indian Statistical Golden Jubilee International Conference, Indian Statistical
Institute, Calcutta, 579-604.

[223] van der Maaten, L.J.P., Hinton, G.E. (2008). Visualizing data using t-SNE. Journal of
Machine Learning Research 9, 2579-2605.

[224] van der Merwe, M., Richman, R. (2024). Responsible AI: The role of actuaries in bridging
the trust deficit. Presented at the ASSA 2024 Convention in Cape Town.

[225] Vapnik, V.N. (1997). The support vector method. In: Artificial Neural Networks –
ICANN’97, Gerstner, W., Germond, A., Hasler, M., Nicoud, J.-D. (eds.). Lecture Notes in
Computer Science. Vol. 1327. Springer.

[226] Vapnik, V.N., Chervonenkis, A.Y. (1964). On a class of perceptrons. Avtomatika i Tele-
mekhanika 25/1.

[227] Vapnik, V.N., Chervonenkis, A.Y. (1964). On a class of algorithms of learning pattern
recognition. Avtomatika i Telemekhanika 25/6.

[228] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł.,
Polosukhin, I. (2017). Attention is all you need. arXiv:1706.03762v5.

[229] Wager, S., Wang, S., Liang, P.S. (2013). Dropout training as adaptive regularization. In:
Advances in Neural Information Processing Systems 26. Burges, C., Bottou, L., Welling,
M., Ghahramani, Z., Weinberger, K. (Eds.). Curran Associates, 351-359.

[230] Wald, A. (1949). Note on the consistency of the maximum likelihood estimate. Annals of
Mathematical Statistics 20/4, 595-601.

[231] Wang, C.W., Zhang, J., Zhu, W. (2021). Neighbouring prediction for mortality. ASTIN
Bulletin: The Journal of the IAA 51/3, 689-718.

Version March 3, 2025, @AI Tools for Actuaries


Bibliography 271

[232] Watkins, C.J.C.H. (1989). Learning from Delayed Rewards. PhD Thesis, University of Cam-
bridge.
[233] Watkins, C.J.C.H., Dayan, P. (1992). Q-learning. Machine Learning 8/3-4, 279-292.
[234] Wei J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q.,
Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models.
arXiv:2201.11903.
[235] Werbos, P.J. (1988). Generalization of backpropagation with application to a recurrent gas
market model. Neural Networks 1/4, 339-356.
[236] Williams, R.J., Zipser, D. (1989). A learning algorithm for continually running fully recur-
rent neural networks. Neural Computation 1/2, 270-280.
[237] Wu, C.F.J. (1983). On the convergence properties of the EM algorithm. The Annals of
Statistics 11/1, 95-103.
[238] Wüthrich, M.V. (2020). Bias regularization in neural network models for general insurance
pricing. European Actuarial Journal 10/1, 179-202.
[239] Wüthrich, M.V. (2023). Model selection with Gini indices under auto-calibration. European
Actuarial Journal 13/1, 71-95.
[240] Wüthrich, M.V. (2025). Auto-calibration tests for discrete finite regression functions. Eu-
ropean Actuarial Journal, in press.
[241] Wüthrich, M.V., Buser, C. (2016). Data Analytics for Non-Life Insurance Pricing. SSRN
Manuscript ID 2870308, Version of June 19, 2023.
[242] Wüthrich, M.V., Merz, M. (2019). Editorial: Yes, we CANN! ASTIN Bulletin - The Journal
of the IAA 49/1, 1-3.
[243] Wüthrich, M.V., Merz, M. (2023). Statistical Foundations of Actuarial Learning
and its Applications. Springer Actuarial. https://link.springer.com/book/10.1007/
978-3-031-12409-9
[244] Wüthrich, M.V., Ziegel, J. (2024). Isotonic recalibration under a low signal-to-noise ratio.
Scandinavian Actuarial Journal 2024/3, 279-299.
[245] Xie, S.M., Yala, L., Liu, Q. (2021). An explanation of in-context learning as implicit
Bayesian inference. arXiv:2110.08387.
[246] Yuan, X.T., Lin, Y. (2007). Model selection and estimation in regression with grouped
variables. Journal of the Royal Statistical Society, Series B 68/1, 49-67.
[247] Zakrisson, H., Lindholm, M. (2025). A tree-based varying coefficient model. Computational
Statistics, in press.
[248] Zhang, T., Yu, B. (2005). Boosting with early stopping: Convergence and consistency The
Annals of Statistics 33/4, 1538-1579.
[249] Zhou, Y., Hooker, G. (2022). Decision tree boosted varying coefficient models. Data Mining
and Knowledge Discovery 36/6, 2237-2271.
[250] Zou, H., Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal
of the Royal Statistical Society, Series B 67/2, 301-320.

Version March 3, 2025, @AI Tools for Actuaries

You might also like