Statistical Foundations of Actuarial
Statistical Foundations of Actuarial
Mario V. Wüthrich
Michael Merz
Statistical
Foundations
of Actuarial
Learning and its
Applications
Springer Actuarial
Editors-in-Chief
Hansjoerg Albrecher, University of Lausanne, Lausanne, Switzerland
Michael Sherris, UNSW, Sydney, NSW, Australia
Series Editors
Daniel Bauer, University of Wisconsin-Madison, Madison, WI, USA
Stéphane Loisel, ISFA, Université Lyon 1, Lyon, France
Alexander J. McNeil, University of York, York, UK
Antoon Pelsser, Maastricht University, Maastricht, The Netherlands
Ermanno Pitacco, Università di Trieste, Trieste, Italy
Gordon Willmot, University of Waterloo, Waterloo, ON, Canada
Hailiang Yang, The University of Hong Kong, Hong Kong, Hong Kong
This is a series on actuarial topics in a broad and interdisciplinary sense, aimed at
students, academics and practitioners in the fields of insurance and finance.
Springer Actuarial informs timely on theoretical and practical aspects of top-
ics like risk management, internal models, solvency, asset-liability management,
market-consistent valuation, the actuarial control cycle, insurance and financial
mathematics, and other related interdisciplinary areas.
The series aims to serve as a primary scientific reference for education, research,
development and model validation.
The type of material considered for publication includes lecture notes, mono-
graphs and textbooks. All submissions will be peer-reviewed.
Mario V. Wüthrich • Michael Merz
Statistical Foundations
of Actuarial Learning
and its Applications
Mario V. Wüthrich Michael Merz
Department of Mathematics, RiskLab Faculty of Business Administration
Switzerland University of Hamburg
ETH Zürich Hamburg, Germany
Zürich, Switzerland
This work was supported by Schweizerische Aktuarvereinigung SAV and Swiss Re.
Mathematics Subject Classification: C13, C21/31, C24/34, G22, 62F10, 62F12, 62J07, 62J12, 62M45,
62P05, 68T01, 68T50
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Acknowledgments
We kindly thank our very generous sponsors, the Swiss Association of Actuaries
(SAA) and Swiss Re, for financing the open access option of the electronic version
of this book. Our special thanks go to Sabine Betz (President of SAA), Adrian
Kolly (Swiss Re), and Holger Walz (SAA) who were very positive and interested in
this book project from the very beginning, and who made this open access funding
possible within their institutions.
A very special thank you goes to Hans Bühlmann who has been supporting us
over the last 30 years. We have had so many inspiring discussions over these years,
and we have greatly benefited and learned from Hans’ incredible knowledge and
intuition.
Jointly with Christoph Buser, we have started to teach the lecture “Data Analytics
for Non-Life Insurance Pricing” at ETH Zurich in 2018. Our data analytics lecture
focuses (only) on the Poisson claim counts case, but its lecture notes have provided
a first draft for this book project. This draft has been developed and extended to
the general case of the exponential family. Since our first lecture, we have greatly
benefited from interactions with many colleagues and students. In particular, we
would like to mention the data science initiative “Actuarial Data Science” of the
Swiss Association of Actuaries (chaired by Jürg Schelldorfer), whose tutorials
provided a great stimulus for this book. Moreover, we mention the annual Insurance
Data Science Conference (chaired by Markus Gesmann and Andreas Tsanakas) and
the ASTIN Reading Club (chaired by Ronald Richman and Dimitri Semenovich).
Furthermore, we would like to kindly thank Ronald Richman who has always been
a driving force behind learning and adapting new machine learning techniques, and
we also kindly thank Simon Rentzmann for many interesting discussions on how to
apply these techniques on real insurance problems.
We thank the following colleagues by name (in alphabetical order). We col-
laborated and had inspiring discussions in the field of statistical learning with
the following colleagues: Johannes Abegglen, Hansjörg Albrecher, Davide Apol-
loni, Peter Bühlmann, Christoph Buser, Patrick Cheridito, Łukasz Delong, Paul
Embrechts, Andrea Ferrario, Tobias Fissler, Luca Fontana, Daisuke Frei, Tsz Chai
Fung, Guangyuan Gao, Yan-Xing Lan, Gee Lee, Mathias Lindholm, Christian
v
vi Acknowledgments
Lorentzen, Friedrich Loser, Michael Mayer, Daniel Meier, Alexander Noll, Gareth
Peters, Jan Rabenseifner, Peter Reinhard, Simon Rentzmann, Ronald Richman,
Ludger Rüschendorf, Robert Salzmann, Marc Sarbach, Jürg Schelldorfer, Pavel
Shevchenko, Joël Thomann, Andreas Tsanakas, George Tzougas, Emiliano Valdez,
Tim Verdonck, and Patrick Zöchbauer.
Contents
1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1
1.1 The Statistical Modeling Cycle . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1
1.2 Preliminaries on Probability Theory . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3
1.3 Lab: Exploratory Data Analysis . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7
1.4 Outline of This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 9
2 Exponential Dispersion Family . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 13
2.1 Exponential Family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 13
2.1.1 Definition and Properties . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 13
2.1.2 Single-Parameter Linear EF: Count Variable
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 18
2.1.3 Vector-Valued Parameter EF: Absolutely
Continuous Examples . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 20
2.1.4 Vector-Valued Parameter EF: Count Variable
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 27
2.2 Exponential Dispersion Family . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 28
2.2.1 Definition and Properties . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 28
2.2.2 Exponential Dispersion Family Examples . . . . . . . . . . . . . . . . 31
2.2.3 Tweedie’s Distributions . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 34
2.2.4 Steepness of the Cumulant Function . .. . . . . . . . . . . . . . . . . . . . 37
2.2.5 Lab: Large Claims Modeling . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 38
2.3 Information Geometry in Exponential Families . . . . . . . . . . . . . . . . . . . . 40
2.3.1 Kullback–Leibler Divergence . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 40
2.3.2 Unit Deviance and Bregman Divergence . . . . . . . . . . . . . . . . . 42
3 Estimation Theory .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 49
3.1 Introduction to Decision Theory . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 49
3.2 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 51
3.3 Unbiased Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 56
3.3.1 Cramér–Rao Information Bound . . . . . .. . . . . . . . . . . . . . . . . . . . 56
3.3.2 Information Bound in the Exponential Family
Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 62
vii
viii Contents
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 577
Index . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 595
Chapter 1
Introduction
and we present these tools along with actuarial examples. In actuarial practice one
often distinguishes between life and general insurance. This distinction is done for
good reasons. There are legislative reasons that require to legally separate life from
general insurance business, but there are also modeling reasons, because insurance
products in life and general insurance can have rather different features. In this book,
we do not make this distinction because the statistical methods presented here can be
useful in both branches of insurance, and we are going to consider life and general
insurance examples, e.g., the former considering mortality forecasting and the latter
aiming at insurance claims prediction for pricing.
F (y) = P [Y ≤ y] ,
being the probability of the event that Y has a realization of less or equal to y. We
write Y ∼ F for Y having distribution function F . Similarly random vectors Y ∼ F
are characterized by (cumulative) distribution functions F : Rq → [0, 1] with
F (y) = P Y1 ≤ y1 , . . . , Yq ≤ yq for y = (y1 , . . . , yq ) ∈ Rq .
satisfying k∈N f (k) = 1. If N ⊆ N0 , the integer-valued random variable Y
is called count random variable. Count random variables are used to model the
number of claims in insurance. A similar situation occurs if Y models nominal
outcomes, for instance, if Y models gender with female being encoded by 0 and
male being encoded by 1, then f (0) is the probability weight of having a female
and f (1) = 1 − f (0) the probability weight of having a male; in this case we
identify the finite set N = {0, 1} = {female, male}.
• A random variable Y ∼ F is said to be absolutely continuous2 if there exists a
non-negative (measurable) function f , called density of Y , such that
y
F (y) = f (x) dx for all y ∈ R.
−∞
dk
MY (r)|r=0 = E Yk for all k ∈ N0 . (1.1)
dr k
In particular, in this case we immediately know that all moments of Y exist, and
these moments completely determine the moment generating function MY of Y .
Another consequence is that for a random variable Y , whose moment generating
function MY has a strictly positive radius of convergence around the origin, the
distribution function F is fully determined by this moment generating function.
That is, if we have two such random variables Y1 and Y2 with MY1 (r) = MY2 (r)
(d)
for all r ∈ (−r0 , r0 ), for some r0 > 0, then Y1 = Y2 .3 Thus, these two
random variables have the same distribution function. This statement carries over
to the limit, i.e., if we have a sequence of random variables (Yn )n whose moment
generating functions converge on a common interval (−r0 , r0 ), for some r0 > 0,
to the moment generating function of Y , also being finite on (−r0 , r0 ), then (Yn )n
converges in distribution to Y ; such an argument is used to prove the central limit
theorem (CLT).
(d)
3 The notation Y1 = Y2 is generally used for equality in distribution meaning that Y1 and Y2 have
the same distribution function.
6 1 Introduction
The latter tells us that the survival function 1 − F (y) = P[Y > y] decays
exponentially for y → ∞. Heavy-tailed distribution functions do not have this
property, but the survival function decays slower than exponentially as y → ∞.
This slower decay of the survival function is the case for so-called subexponential
distribution functions (an example is the log-normal distribution, we refer to Rolski
et al. [320]) and for regularly varying survival functions (an example is the Pareto
distribution). Regularly varying survival functions 1 − F have the property
1 − F (ty)
lim = t −β for all t > 0 and some β > 0. (1.3)
y→∞ 1 − F (y)
These distribution functions have a polynomial tail (power tail) with tail index β >
0. In particular, if a positively supported distribution function F has a regularly
varying survival function with tail index β > 0, then this distribution function is
also subexponential, see Theorem 2.5.5 in Rolski et al. [320].
We are not going to specifically focus on heavy-tailed distribution functions,
here, but we will explain how light-tailed random variables can be transformed to
enjoy heavy-tailed properties. In these notes, we are mainly interested in studying
different aspects of regression modeling. Regression modeling requires numerous
observations to be able to successfully fit these models to the data. By definition,
large claims are scarce, as they live in the tail of the distribution function and, thus,
correspond to rare events. Therefore, it is often not possible to employ a regression
model for scarce tail events. For this reason, extreme value analysis only plays
a marginal role in these notes, though, it has a significant impact on insurance
prices. For more on extreme value theory we refer to the relevant literature, see,
e.g., Embrechts et al. [121], Rolski et al. [320], Mikosch [277] and Albrecher et
al. [7].
1.3 Lab: Exploratory Data Analysis 7
Our theory is going to be supported by several data examples. These examples are
mostly based on publicly available data. The different data sets are described in
detail in Chap. 13. We highly recommend the reader to use these data sets to gain
her/his own modeling experience.
We describe some tools here that allow for a descriptive and exploratory analysis
of the available data; exploratory data analysis has been introduced and promoted by
Tukey [357]. We consider the observed claim sizes of the Swedish motorcycle data
set described in Sect. 13.2. This data set consists of 656 (positive) claim amounts yi ,
1 ≤ i ≤ n = 656. These claim amounts are illustrated in the boxplots of Fig. 1.1.
Typically in insurance, there are large claims that dominate the picture, see
Fig. 1.1 (lhs). This results in right-skewed distribution functions, and such data is
better illustrated on the log scale, see Fig. 1.1 (rhs). The latter, of course, assumes
that all claims are strictly positive.
Figure 1.2 (lhs) shows the empirical distribution function of the observations yi ,
1 ≤ i ≤ n, which is obtained by
n
n (y) = 1
F 1{yi ≤y} for y ∈ R.
n
i=1
If this data set has been generated by i.i.d. random variables, then the Glivenko–
Cantelli theorem [64, 159] tells us that this empirical distribution function F n
converges uniformly to the (true) data generating distribution function, a.s., as the
number n of observations converges to infinity, see Theorem 20.6 in Billingsley
[34].
Figure 1.2 (rhs) shows the empirical density of the observations yi , 1 ≤ i ≤
n. This empirical density is obtained by considering a kernel smoother of a given
claim amounts of Swedish motorcycle data claim amounts of Swedish motorcycle data
200000
12
logged claim amounts
10
150000
claim amounts
8
100000
6
50000
4
0
Fig. 1.1 Boxplot of the claim amounts of the Swedish motorcycle data set: (lhs) on the original
scale and (rhs) on the log scale
8 1 Introduction
4e−05
1.0
0.8
empirical distribution
3e−05
empirical density
0.6
2e−05
0.4
1e−05
0.2
0e+00
0.0
Fig. 1.2 (lhs) Empirical distribution and (rhs) empirical density of the observed claim amounts yi ,
1≤i≤n
bandwidth around each observation yi . The standard choice is the Gaussian kernel,
with the bandwidth determining the variance parameter σ 2 > 0 of the Gaussian
density,
1
n
1 1 (y − yi )2
y → fn (y) = √ exp − .
n 2πσ 2 2 σ2
i=1
From the graph in Fig. 1.2 (rhs) we observe that the main body of the claim sizes
is below an amount of 50’000, but the biggest claim exceeds 200’000. The latter
motivates to study heavy-tailedness of the claim size data. Therefore, one usually
benchmarks with a distribution function F that has a regularly varying survival
function with a tail index β > 0, see (1.3). Asymptotically a regularly varying
survival function behaves as y −β ; for this reason the log-log plot is a popular tool
to identify regularly varying tails. The log-log plot of a distribution function F is
obtained by considering
n . If this
Figure 1.3 gives the log-log plot of the empirical distribution function F
plot looks asymptotically (for y → ∞) like a straight line with a negative slope
−β, then the data shows heavy-tailedness in the sense of regular variation. Such
data cannot be modeled by a distribution function for which the moment generating
function MY (r) exists for some positive r > 0, see (1.2). Figure 1.3 does not suggest
a regularly varying tail as we do not see an obvious asymptotic straight line for
increasing claim sizes.
These graphs give us a first indication what the claim size data is about. Later
on we are going to introduce explanatory variables that describe the insurance
1.4 Outline of This Book 9
0
function F
−1
logged survival function
−2
−3
−4
−5
−6
4 6 8 10 12
logged claim amounts
This book has eleven chapters (including the present one), and it has two appendices.
We briefly describe the contents of these chapters and appendices.
In Chap. 2 we introduce and discuss the exponential family (EF) and the
exponential dispersion family (EDF). The EF and the EDF are by far the most
important classes of distribution functions for regression modeling. They include,
among others, the Gaussian, the binomial, the Poisson, the gamma, the inverse
Gaussian and Tweedie’s models. We introduce these families of distribution func-
tions, discuss their properties and provide several examples. Moreover, we introduce
the Kullback–Leibler (KL) divergence and the Bregman divergence, which are
important tools in model evaluation.
Chapter 3 is on classical statistical decision theory. This chapter is important for
historical reasons, but it also provides the right mathematical grounding and intu-
ition for more modern tools from data science and machine learning. In particular,
we discuss maximum likelihood estimation (MLE), unbiasedness, consistency and
asymptotic normality of MLEs in this chapter.
Chapter 4 is the core theoretical chapter on predictive modeling and forecast
evaluation. The main problem in actuarial modeling is to forecast and price future
claims. For this, we build predictive models, and this chapter deals with assessing
and ranking these predictive models. We therefore introduce the mean squared
10 1 Introduction
error of prediction (MSEP) and, more generally, the generalization loss (GL)
to assess predictive models. This chapter is complemented by a more decision-
theoretic approach to forecast evaluation, it discusses deviance losses, proper
scoring, elicitability, forecast dominance, cross-validation, Akaike’s information
criterion (AIC) and we give an introduction to the bootstrap simulation method.
Chapter 5 discusses state-of-the-art statistical modeling in insurance which is the
generalized linear model (GLM). We discuss GLMs in the light of claim count and
claim size modeling, we present feature engineering, model fitting, model selection,
over-dispersion, zero-inflated claim counts problems, double GLMs, and insurance-
specific issues such as the balance property for having unbiasedness.
Chapter 6 summarizes some techniques that use Bayes’ theorem. These are
classical Bayesian statistical models, e.g., using the Markov chain Monte Carlo
(MCMC) method for model fitting. This chapter discusses regularization of regres-
sion models such as ridge and LASSO regularization, which has a Bayesian
interpretation, and it concerns the Expectation-Maximization (EM) algorithm. The
EM algorithm is a general purpose tool that can handle incomplete data settings. We
illustrate this for different examples coming from mixture distributions, censored
and truncated claims data.
The core of this book are deep learning methods and neural networks. Chapter 7
considers deep feed-forward neural (FN) networks. We introduce the generic
architecture of deep FN networks, and we discuss universality theorems of FN
networks. We present network fitting, back-propagation, embedding layers for
categorical variables and insurance-specific issues such as the balance property in
network fitting and network ensembling to reduce model uncertainty. This chapter
is complemented by many examples on non-life insurance pricing, but also on
mortality modeling, as well as tools that help to explain deep FN network regression
results.
Chapters 8 and 9 consider recurrent neural (RN) networks and convolutional
neural (CN) networks. These are special network architectures that are useful for
time-series and spatial data modeling, e.g., applied to image recognition problems.
Time-series and images have a natural topology, and RN and CN networks try to
benefit from this additional structure (over tabular data). We introduce these network
architectures and provide insurance-relevant examples.
Chapter 10 discusses natural language processing (NLP) which deals with
regression modeling of non-tabular or unstructured text data. We explain how
words can be embedded into low-dimension spaces that serve as numerical word
encodings. These can then be used for text recognition, either using RN networks or
attention layers. We give an example where we aim at predicting claim perils from
claim descriptions.
Chapter 11 is a selection of different topics. We mention forecasting under
model uncertainty, deep quantile regression, deep composite regression or the
LocalGLMnet which is an interpretable FN network architecture. Moreover, we
provide a bootstrap example to assess prediction uncertainty, and we discuss mixture
density networks.
1.4 Outline of This Book 11
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 2
Exponential Dispersion Family
We introduce the exponential family (EF) and the exponential dispersion family
(EDF) in this chapter. The single-parameter EF has been introduced in 1934
by the British statistician Sir Fisher [128], and it has been extended to vector-
valued parameters by Darmois [88], Koopman [223] and Pitman [306] between
1935 and 1936. It is the most commonly used family of distribution functions
in statistical modeling; among others, it contains the Gaussian distribution, the
gamma distribution, the binomial distribution and the Poisson distribution. Its
parametrization is taken in a special form that is convenient for statistical modeling.
The EF can be introduced in a constructive way providing the main properties of
this family of distribution functions. In this chapter we follow Jørgensen [201–203]
and Barndorff-Nielsen [23], and we state the most important results based on this
constructive introduction. This gives us a unified notation which is going to be useful
for our purposes.
We define the EF w.r.t. a σ -finite measure ν on R. The results in this section can be
generalized to σ -finite measures on Rm , but such an extension is not necessary for
our purposes. Select an integer k ∈ N, and choose measurable functions a : R →
R and T : R → Rk .1 Consider for a canonical parameter θ ∈ Rk the Laplace
1We could also use boldface notation for T because T (y) ∈ Rk is vector-valued, but we prefer to
not use boldface notation for (vector-valued) functions.
transform
L(θ ) = exp θ T (y) + a(y) dν(y).
R
Assume that this Laplace transform is not identically equal to +∞. The effective
domain is defined by
= θ ∈ Rk ; L(θ ) < ∞ ⊆ Rk . (2.1)
κ : → R, θ → κ(θ) = logL(θ ).
Remarks 2.3
• The definition of the EF (2.2) assumes that the effective domain ⊆ Rk has
been constructed from the choices a : R → R and T : R → Rk as described
in (2.1). This is not explicitly stated in the surrounding text of (2.2).
• The support of any random variable Y ∼ F (·; θ) of this EF does not depend on
the explicit choice of the canonical parameter θ ∈ , but solely on the choice of
the σ -finite measure ν on R, and the distribution functions F (·; θ) are mutually
absolutely continuous (equivalent) w.r.t. ν.
• In statistics, the main object of interest is the canonical parameter θ . Importantly
for parameter estimation, the function a(·) does not involve the canonical
parameter. Therefore, it is irrelevant for parameter estimation and (only) serves
as a normalization so that F in (2.2) is a proper distribution function. In fact, this
is the way how the EF is often introduced in the statistical and actuarial literature,
but in this latter introduction we lose the deeper interpretation of the cumulant
function κ, nor is it immediately clear what properties it possesses.
• The case k ≥ 2 gives a vector-valued canonical parameter θ. The case k = 1
gives a single-parameter EF, and, if additionally T (y) = y, it is called a single-
parameter linear EF.
Theorem 2.4 Assume the effective domain has a non-empty interior . ˚ Choose
Y ∼ F (·; θ ) for fixed θ ∈ . ˚ The moment generating function of T (Y ) for
sufficiently small r ∈ Rk is given by
MT (Y ) (r) = Eθ exp r T (Y ) = exp {κ(θ + r) − κ(θ)} ,
where the last identity follows from the fact that the support of the EF does not
depend on the explicit choice of the canonical parameter.
Theorem 2.4 has a couple of immediate implications. First, in any interior point
θ ∈ ˚ both the moment generating function r → MT (Y ) (r) (in the neighborhood of
the origin) and the cumulant function θ → κ(θ ) have derivatives of all orders, and,
similarly to Sect. 1.2, moments of all orders of T (Y ) exist, see also (1.1). Existence
16 2 Exponential Dispersion Family
of moments of all orders implies that the distribution function of T (Y ) cannot have
a regularly varying tails.
∂2
Covθ (Tj (Y ), Tl (Y )) = κ(θ).
∂θj ∂θl
The convexity of κ follows because ∇θ2 κ(θ) is the positive semi-definite covariance
˚ This finishes the proof.
matrix of T (Y ), for all θ ∈ .
Remarks 2.7
• Throughout these notes we will work under Assumption 2.6 without making
explicit reference. This assumption strengthens the properties of the cumulant
function κ from being convex, see Corollary 2.5, to being strictly convex. This
strengthening implies that the mean function θ → μ = μ(θ ) = ∇θ κ(θ) can be
inverted; this is needed for the canonical link, see Definition 2.8, below.
• The strict convexity of κ means that the covariance matrix ∇θ2 κ(θ) of T (Y ) is
˚ see Corollary 2.5. This property
positive definite and has full rank k for all θ ∈ ,
is important, otherwise we do not have identifiability in the canonical parameter
θ because we have a linear dependence between the components of T (Y ).
• Mathematically, this strict convexity is not a restriction because it can be obtained
by working under a so-called minimal representation. If the covariance matrix
∇θ2 κ(θ) does not have full rank k, the choice k is “non-optimal” because the
problem lives in a smaller dimension. Thus, w.l.o.g., we may and will assume to
work in this smaller dimension, called minimal representation; for a rigorous
derivation of a minimal representation we refer to Section 8.1 in Barndorff-
Nielsen [23].
˚
for mean μ = Eθ [T (Y )] of Y ∼ F (·; θ ) with θ ∈ .
Remarks 2.9 (Dual Parameter Space) Assumption 2.6 provides that the
canonical link h is well-defined, and we can either work with the canonical
parameter representation θ ∈ ˚ ⊆ Rk or with its dual (mean) parameter
representation μ = Eθ [T (Y )] ∈ M with
In Sect. 2.2.4, below, we introduce one more property called steepness that the
cumulant function κ should satisfy. This additional property gives a relationship
between the support T of the random variables T (Y ) of the given EF and the
boundary of the dual parameter space M. This steepness property is important for
parameter estimation.
18 2 Exponential Dispersion Family
For the Bernoulli distribution with parameter p ∈ (0, 1) we choose as ν the counting
measure on {0, 1}. We make the following choices: T (y) = y,
eθ p
a(y) = 0, κ(θ) = log(1 + eθ ), p = κ (θ) = , θ = h(p) = log ,
1 + eθ 1−p
eθ
μ = Eθ [Y ] = κ (θ ) = p and Varθ (Y ) = κ (θ ) = = p(1 − p),
(1 + eθ )2
Pθ [Y = y] = py (1 − p)1−y .
eθ
μ = Eθ [Y ] = κ (θ ) = np and Varθ (Y ) = κ (θ ) = n = np(1 − p),
(1 + eθ )2
For the Poisson distribution with parameter λ > 0 we choose as ν the counting
measure on N0 . We make the following choices: T (y) = y,
1
a(y) = log , κ(θ ) = eθ , μ = κ (θ ) = eθ , θ = h(μ) = log(μ),
y!
1 μy
dF (y; θ ) = exp θy − eθ dν(y) = e−μ dν(y). (2.4)
y! y!
The canonical link μ → h(μ) is the log-link. Mean and variance are given by
μ = Eθ [Y ] = κ (θ ) = λ and Varθ (Y ) = κ (θ ) = λ = μ = Eθ [Y ] ,
where we set λ = eθ . The probability weights in the Poisson case satisfy for y ∈
T = N0
λy
Pθ [Y = y] = e−λ .
y!
eθ μ
μ = κ (θ ) = α , θ = h(μ) = log ,
1 − eθ μ+α
for effective domain = (−∞, 0), dual parameter space M = (0, ∞) and support
T = N0 of Y = T (Y ). With these choices we have
y+α−1
dF (y; θ ) = exp θy + αlog(1 − eθ ) dν(y)
y
y+α−1 y
= p (1 − p)α dν(y),
y
For α ∈ N this model can also be interpreted as the waiting time until we observe
α successful trials among i.i.d. trials, for instance, for α = 1 we have the geometric
distribution (with a small reparametrization).
The probability weights of the negative-binomial model satisfy for y ∈ T = N0
y+α−1 y
Pθ [Y = y] = p (1 − p)α . (2.5)
y
1 θ12 1
a(y) = − log(2π), κ(θ) = − − log(−2θ2 ),
2 4θ2 2
θ1 θ12
(μ, σ 2 + μ2 ) = ∇θ κ(θ ) = −1
, (−2θ2) + 2 ,
−2θ2 4θ2
1
dF (y; θ ) = √ exp θy/σ − y 2 /(2σ 2 ) − θ 2 /2 dν(y)
2πσ
1 1
= √ exp − 2 (y − σ θ )2 dν(y),
2πσ 2σ
and, in particular, the canonical link is the identity link μ → θ = h(μ) = μ in this
single-parameter EF example.
22 2 Exponential Dispersion Family
For the gamma distribution with parameters α, β > 0 we choose as ν the Lebesgue
measure on R+ . Then we make the following choices: T (y) = (y, logy) ,
for effective domain = (−∞, 0) × (0, ∞), and setting β = −θ1 > 0 and
α = θ2 > 0. The dual parameter space is M = (0, ∞) × R, and we have support
T = (0, ∞) × R of T (Y ) = (Y, logY ) . With these choices we obtain
dF (y; θ) = exp θ T (y) − log
(θ2 ) + θ2 log(−θ1 ) − logy dν(y)
(−θ1 )θ2 θ2 −1
= y exp {−(−θ1 )y} dν(y)
(θ2 )
β α α−1
= y exp {−βy} dν(y).
(α)
(α)
Eθ (Y, logY ) = ∇θ κ(θ) = α/β, − log(β) .
(α)
for effective domain = (−∞, 0), dual parameter space M = (0, ∞) and support
T = (0, ∞). With these choices we have for β = −θ > 0
(−θ )α α−1
dF (y; θ ) = y exp {−(−θ )y} dν(y). (2.6)
(α)
α α 1
μ = Eθ [Y ] = and σ 2 = Varθ (Y ) = = μ2 .
β β2 α
For parameter estimation one often needs to invert these identities which gives us
μ2 μ
α= and β= .
σ2 σ2
Remarks 2.10
• The gamma distribution contains as special cases the exponential distribution for
α = θ2 = 1 and β = −θ1 > 0, and the χr2 -distribution with r degrees of freedom
for α = θ2 = r/2 and β = −θ1 = 1/2.
• The distributions of the EF are all light-tailed in the sense that all moments
of T (Y ) exist. Therefore, the EF does not allow for regularly varying survival
functions, see (1.3). If Y is gamma distributed, then Z = exp{Y } is log-gamma
distributed (with the special case of the Pareto distribution for the exponential
case α = θ2 = 1). For an example we refer to Sect. 2.2.5. However, this log-
transformation is not always recommended because it may provide accurate
models on the transformed log-scale, but back-transformation to the original
scale may not necessarily provide a good predictive model on that original scale.
• The gamma density (2.6) may be a bit tricky in applications because the effective
domain = (−∞, 0) is one-sided bounded (we come back to this below). For
this reason, in practice, one often uses links different from the canonical link
h(μ) = −α/μ. For instance, a parametrization θ = − exp{−ϑ} for ϑ ∈ R, see
Ohlsson–Johansson [290], leads to the following model
y α−1
dF (y; ϑ) = exp −e−ϑ y − αϑ dν(y). (2.7)
(α)
We will study the gamma model in more depth below, and parametrization (2.7)
will correspond to the log-link choice, see Example 5.5, below.
Figure 2.1 gives examples of gamma densities for shape parameters α ∈
{1/2, 1, 3/2, 2} and scale parameters β ∈ {1/2, 1, 3/2, 2} with α = β all providing
the same mean μ = Eθ [Y ] = α/β = 1. The crucial observation is that these gamma
densities can have two different shapes, for α ≤ 1 we have a strictly decreasing
shape and for α > 1 we have a unimodal density with mode in (α − 1)/β.
For the inverse Gaussian distribution with parameters α, β > 0 we choose as ν the
Lebesgue measure on R+ . Then we make the following choices: T (y) = (y, 1/y) ,
1 1
a(y) = − log(2πy 3 ), κ(θ) = − 2(θ1 θ2 )1/2 − log(−2θ2 ),
2 2
−2θ 1/2
−2θ 1/2
1
2 1
α/β, β/α + 1/α 2 = ∇θ κ(θ ) = , + ,
−2θ1 −2θ2 −2θ2
24 2 Exponential Dispersion Family
1.0
alpha=0.5, beta=0.5
α ∈ {1/2, 1, 3/2, 2} and scale alpha=1, beta=1
alpha=1.5, beta=1.5
parameters alpha=2, beta=2
0.8
β ∈ {1/2, 1, 3/2, 2} all
providing the same mean
μ = α/β = 1
0.6
density
0.4
0.2
0.0
0 1 2 3 4 5
For receiving (2.8) we have chosen canonical parameter θ = (θ1 , θ2 ) ∈ (−∞, 0)2 .
Interestingly, we can close this parameter space for θ1 = 0, i.e., the effective domain
is not open in this example. The choice θ1 = 0 gives us cumulant function κ(θ) =
− 12 log(−2θ2 ) and boundary case
1 1
dF (y; θ) = exp θ T (y) + log(−2θ2) − log(2πy ) dν(y) 3
2 2
1 −2θ2
= (−2θ 2 ) 1/2
exp − dν(y)
(2πy 3 )1/2 2y
α α2
= exp − dν(y). (2.9)
(2πy 3 )1/2 2y
2.1 Exponential Family 25
for θ ∈ (−∞, 0), dual parameter space M = (0, ∞) and support T = (0, ∞). With
these choices we have the inverse Gaussian model for β = (−2θ )1/2 > 0
1
dF (y; θ ) = exp{a(y)} exp − (−2θ )y 2 − 2α(−2θ )1/2y dν(y)
2y
α α2 β 2
= exp − 1− y dν(y).
(2πy 3)1/2 2y α
μ3/2 μ1/2
α= and β= .
σ σ
Figure 2.2 gives examples of inverse Gaussian densities for parameter choices
α = β ∈ {1/2, 1, 3/2, 2} all providing the same mean μ = Eθ [Y ] = α/β = 1.
For the generalized inverse Gaussian distribution with parameters α, β > 0 and
γ ∈ R we choose as ν the Lebesgue measure on R+ . We combine the terms of
the gamma and the inverse Gaussian models to the vector-valued choice: T (y) =
(y, logy, 1/y) with k = 3. Moreover, we choose a(y) = −logy and cumulant
function
θ
2
κ(θ ) = log 2Kθ2 (2 θ1 θ3 ) − log(θ1 /θ3 ),
2
26 2 Exponential Dispersion Family
2.5
alpha=0.5
α = β ∈ {1/2, 1, 3/2, 2} all alpha=1
alpha=1.5
providing the same mean alpha=2
2.0
μ = α/β = 1
1.5
density
1.0
0.5
0.0
0 1 2 3 4 5
for θ = (θ1 , θ2 , θ3 ) ∈ (−∞, 0) × R × (−∞, 0), and where Kθ2 denotes the
modified Bessel function of the second kind with index γ = θ2 ∈ R. With these
choices we obtain generalized inverse Gaussian density
θ
dF (y; θ) = exp θ T (y) − log 2Kθ2 (2 θ1 θ3 ) + log(θ1 /θ3 ) − logy dν(y)
2
2
(α/β)γ /2 γ −1 1
= √ y exp − αy + βy −1 dν(y), (2.10)
2Kγ ( αβ) 2
The effective domain is a bit complicated because the possible choices of (θ1 , θ3 )
depend on θ2 ∈ R, namely, for θ2 < 0 the negative half-line (−∞, 0] can be closed
at the origin for θ1 , and for θ2 > 0 it can be closed at the origin for θ3 . The inverse
Gaussian model is obtained for θ2 = −1/2 and the gamma model is obtained for
θ3 = 0. For further properties of the generalized inverse Gaussian distribution we
refer to the textbook of Jørgensen [200].
2.1 Exponential Family 27
for effective domain = Rk , dual parameter space M = (0, 1)k , and the support
T of T (Y ) are the k + 1 corners of the unit simplex in Rk . This representation is
minimal, see Assumption 2.6. With these choices we have (set θk+1 = 0)
1{y=j }
k
k+1
eθj
dF (y; θ ) = exp θ T (y) − log 1 + e θi
dν(y) = k+1 dν(y).
i=1 j =1 i=1 eθi
Remarks 2.11
• There are many more examples that belong to the EF. From Theorem 2.4, we
know that all examples of the EF are light-tailed in the sense that all moments of
T (Y ) exist. If we want to model heavy-tailed distributions within the EF, we first
need to apply a suitable transformation. We could model the Pareto distribution
28 2 Exponential Dispersion Family
using transformation T (y) = logy, and assuming that the transformed random
variable has an exponential distribution. Different light-tailed examples are
obtained by, e.g., using transformation T (y) = y τ for the Weibull distribution
or T (y) = (logy, log(1 − y)) for the beta distribution. We refrain from giving
explicit formulas for these or other examples.
• Observe that in all examples above we have T ⊂ M, i.e., the support of T (Y )
is contained in the closure of the dual parameter space M, we come back to this
observation in Sect. 2.2.4, below.
In the previous section we have introduced the EF, and we have explicitly studied the
vector-valued parameter EF examples of the Gaussian, the gamma and the inverse
Gaussian models. We have highlighted that these three vector-valued parameter
EFs can be turned into single-parameter EFs by declaring one parameter to be
a nuisance parameter that is not modeled (and acts as a hyper-parameter). This
changes these three models into single-parameter EFs. These three single-parameter
EFs with nuisance parameter can also be interpreted as EDF models. In this section
we discuss the single-parameter EDF; this is sufficient for our purposes, and vector-
valued parameter extensions can be obtained in a canonical way.
The EFs of Sect. 2.1 can be extended to EDFs. In the single-parameter case this
is achieved by a transformation Y = X/ω, where ω > 0 is a scaling and where X
belongs to a single-parameter linear EF, i.e., with T (x) = x. We restrict ourselves to
the single-parameter case k = 1 throughout this section. Choose a σ -finite measure
ν1 on R and a measurable function a1 : R → R. These choices give a single-
parameter linear EF, directly modeling a real-valued random variable T (X) = X.
By (2.2) we have distribution for the single-parameter linear EF random variable X
dF (x; θ, 1) = f (x; θ, 1)dν1(x) = exp θ x − κ(θ ) + a1 (x) dν1 (x),
with effective domain defined by (2.11), i.e., for ω = 1. This allows us to consider
the distribution functions
dF (x; θ, ω) = f (x; θ, ω)dνω (x) = exp θ x − ωκ(θ ) + aω (x) dνω (x)
= exp ω (θy − κ(θ )) + aω (ωy) dνω (ωy), (2.13)
with
Remarks 2.13
• Exposure v > 0 and dispersion parameter ϕ > 0 provide the parametrization
usually used for ω = v/ϕ ∈ W. Their meaning and interpretation will become
clear below, and they will always appear as a ratio ω = v/ϕ.
• The support of these EDF distributions does not depend on the explicit choice of
the canonical parameter θ ∈ , but it may depend on ω = v/ϕ ∈ W through
the choices of the σ -finite measures νω , for ω ∈ W. Consequently, a(y; ω) is
a normalization such that f (y; θ, ω) integrates to 1 w.r.t. the chosen σ -finite
measure νω to receive a proper distributional model.
• The transformation x → y = x/ω in (2.13) is called duality transformation, see
Section 3.1 in Jørgensen [203]. It provides the duality between the additive form
(in variable x in (2.13)) and the reproductive form (in variable y in (2.13)) of the
EDF; Definition 2.12 is the reproductive form.
• Lemma 2.1 tells us that is convex, thus, it is a possibly infinite interval in R.
To exclude trivial cases we will always assume that the σ -finite measure ν1 is not
concentrated in one single point (this relates to the minimal representation for
k = 1 in the linear EF case, see Assumption 2.6), and that the interior ˚ of the
effective domain is non-empty.
Proof This follows analogously to Theorem 2.4. The linear case T (y) = y with ν1
not being concentrated in one single point guarantees that the minimal dimension is
k = 1, providing a minimal representation in this dimension, see Assumption 2.6.
Before giving explicit examples we state the so-called convolution formula.
2.2 Exponential Dispersion Family 31
1
n
Y+ = vi Yi ∼ F (·; θ, v+ /ϕ).
v+
i=1
Proof The proof immediately follows from calculating the moment generating
function MY+ (r) and from using the independence between the Yi ’s.
for effective domain = R and dual parameter space M = (0, 1). With these
choices we have
θ ny n−ny
n n e 1
f (y; θ, n) = exp n θy − log(1 + eθ ) = .
ny ny 1 + eθ 1 + eθ
This is a single-parameter EDF. The canonical link p → h(p) gives the logit
function. Mean and variance are given by
eθ 1 1 eθ 1
p = Eθ [Y ] = κ (θ) = and Varθ (Y ) = κ (θ) = = p(1 − p),
1 + eθ n n (1 + eθ )2 n
and the variance function is given by V (μ) = μ(1 − μ). The binomial random
variable is obtained by setting X = nY ∼ Binom(n, p).
32 2 Exponential Dispersion Family
For the Poisson distribution with parameters λ > 0 and v > 0 we choose the
counting measure on N0 /v for exposure ω = v. Then we make the following choices
vy
v
a(y) = log , κ(θ ) = eθ , λ = κ (θ ) = eθ , θ = h(λ) = log(λ),
(vy)!
for effective domain = R and dual parameter space M = (0, ∞). With these
choices we have
v vy (vλ)vy
f (y; θ, v) = exp v θy − eθ = e−vλ . (2.15)
(vy)! (vy)!
This is a single-parameter EDF. The canonical link λ → h(λ) is the log-link. Mean
and variance are given by
1 1 1
λ = Eθ [Y ] = κ (θ ) = eθ and Varθ (Y ) = κ (θ ) = eθ = λ,
v v v
and the variance function is given by V (λ) = λ, that is, the variance function is
linear in the mean parameter λ. The Poisson random variable is obtained by setting
X = vY ∼ Poi(vλ). We choose ϕ = 1, here, meaning that we have neither under-
nor over-dispersion. Thus, the choices v and ϕ in ω = v/ϕ have the interpretation
of an exposure and a dispersion parameter, respectively. This interpretation is going
to be important in claim counts modeling, below.
For the gamma distribution with parameters α, β > 0 we choose the Lebesgue
measure on R+ and shape parameter ω = v/ϕ = α. We make the following choices
for effective domain = (−∞, 0) and dual parameter space M = (0, ∞). With
these choices we have
α α α−1 (−θ α)α α−1
f (y; θ, α) = y exp α yθ + log(−θ ) = y exp {−(−θ α)y} .
(α)
(α)
2.2 Exponential Dispersion Family 33
This is analogous to (2.6) with shape parameter α > 0 and scale parameter β =
−θ > 0. Mean and variance are given by
1 1
μ = Eθ [Y ] = κ (θ ) = −θ −1 and Varθ (Y ) = κ (θ ) = θ −2 ,
α α
and the variance function is given by V (μ) = μ2 , that is, the variance function
is quadratic in the mean parameter μ. The gamma random variable is obtained by
setting X = αY ∼
(α, β). This gives us for the first two moments of X
α α 1
μX = Eθ [X] = and Varθ (X) = = μ2X .
β β2 α
For the inverse Gaussian distribution with parameters α, β > 0 we choose the
Lebesgue measure on R+ and we set ω = v/ϕ = α. We make the following choices
α 1/2 α
a(y) = log − , κ(θ ) = −(−2θ )1/2,
(2πy 3 )1/2 2y
1 1
μ = κ (θ ) = 1/2
, θ = h(μ) = − 2 ,
(−2θ ) 2μ
for θ ∈ (−∞, 0) and dual parameter space M = (0, ∞). With these choices we
have
α 1/2 α
f (y; θ, α)dy = exp α θy + (−2θ ) 1/2
− dy
(2πy 3 )1/2 2y
2
α 1/2 α
= exp − 1 − (−2θ ) 1/2
y dy
(2πy 3 )1/2 2y
2
α α2 (−2θ )1/2
= exp − 1− x dx,
(2πx 3 )1/2 2x α
34 2 Exponential Dispersion Family
where in the last step we did a change of variable y → x = αy. This is exactly (2.8).
Mean and variance are given by
1 1
μ = Eθ [Y ] = κ (θ ) = (−2θ )−1/2 and Varθ (Y ) = κ (θ ) = (−2θ )−3/2,
α α
and the variance function is given by V (μ) = μ3 , that is, the variance function is
cubic in the mean parameter μ. The inverse Gaussian random variable is obtained by
setting X = αY . The mean and variance of X are given by, set β = (−2θ )1/2 > 0,
α α 1
μX = Eθ [X] = and Varθ (X) = = 2 μ3X .
β β3 α
Tweedie’s compound Poisson (CP) model was introduced in 1984 by Tweedie [358],
and it has been studied in detail in Jørgensen [202], Jørgensen–de Souza [204],
Smyth–Jørgensen [342] and in the review paper of Delong et al. [94]. Tweedie’s CP
model belongs to the EDF. We spend more time on explaining Tweedie’s CP model
because it plays an important role in actuarial modeling.
Tweedie’s CP model is received by choosing as σ -finite measure ν1 a mixture of
the Lebesgue measure on (0, ∞) and a point measure in 0. Furthermore, we choose
power variance parameter p ∈ (1, 2) and cumulant function
1 2−p
κ(θ ) = κp (θ ) = ((1 − p)θ ) 1−p , (2.17)
2−p
2.2 Exponential Dispersion Family 35
with exposure v > 0 and dispersion parameter ϕ > 0; the normalizing function
a(·; v/ϕ) does not have any simple closed form, we refer to Section 2.1 in
Jørgensen–de Souza [204] and Section 4.2 in Jørgensen [203].
Some readers will notice that this is the moment generating function of a CP
distribution having i.i.d. gamma claim sizes. This is exactly the statement of the
next proposition which is found, e.g., in Smyth–Jørgensen [342].
N
Proposition 2.17 Assume S = i=1 Zi is CP distributed with Poisson claim
counts N ∼ Poi(λv) and i.i.d. gamma claim sizes Zi ∼
(α, β) being independent
(d)
of N. We have S = vY/ϕ by identifying the parameters as follows
α+2 1
p= ∈ (1, 2), β = −θ > 0 and λ= κp (θ ) > 0.
α+1 ϕ
Table 2.1 Power variance function models V (μ) = μp within the EDF (taken from Table 4.1 in
Jørgensen [203])
p Distribution Support of Y M
p<0 Generated by extreme stable distributions R [0, ∞) (0, ∞)
p=0 Gaussian distribution R R R
p=1 Poisson distribution N0 R (0, ∞)
1<p<2 Tweedie’s CP distribution [0, ∞) (−∞, 0) (0, ∞)
p=2 Gamma distribution (0, ∞) (−∞, 0) (0, ∞)
p>2 Generated by positive stable distributions (0, ∞) (−∞, 0] (0, ∞)
p=3 Inverse Gaussian distribution (0, ∞) (−∞, 0] (0, ∞)
2.2 Exponential Dispersion Family 37
we refer to Formula (20) in Section 8.1 of Barndorff-Nielsen [23]. Define the convex
closure of the support T by C = conv(T).
Theorem 2.19 (Theorem 9.2 in Barndorff-Nielsen [23], Without Proof) Assume
we have a fixed EF satisfying Assumption 2.6. The cumulant function κ is steep if
˚
and only if C̊ = M = ∇θ κ().
Theorem 2.19 tells us that for a steep cumulant function we have C = M =
˚ In this case parameter estimation can be extended to observations T (Y ) ∈
∇θ κ().
M such that we may obtain a degenerate model at the boundary of M. Coming
back to our Poisson example from above, in this case we set μ = 0, which gives a
degenerate Poisson model.
Throughout this book we will work under the assumption that κ is steep.
The classical examples satisfy this assumption: the examples with power variance
parameter p in {0} ∪ [1, ∞) satisfy Theorem 2.19; this includes the Gaussian, the
Poisson, the gamma, the inverse Gaussian and Tweedie’s CP models, see Table 2.1.
Moreover, the examples we have met in Sect. 2.1 fulfill this assumption; these
are the single-parameter linear EF models of the Bernoulli, the binomial and the
negative binomial distributions, as well as the vector-valued parameter examples of
the Gaussian, the gamma and the inverse Gaussian models and of the categorical
distribution. The only models we have seen that do not have a steep cumulant
function are the power variance models with p < 0, see Table 2.1.
Remark 2.20 Working within the EDF needs some additional thoughts because the
support T = Tω of the single-parameter linear EDF random variable Y = T (Y ) may
38 2 Exponential Dispersion Family
depend on the specific choice of the dispersion parameter ω ∈ W ⊃ {1} through the
σ -finite measure dνω (ω ·), see (2.13). For instance, in the binomial case the support
of Y is given by Tω = {0, 1/n, . . . , 1} with ω = n, see Sect. 2.2.2.
Assume that the cumulant function κ is steep for the single-parameter linear
EF that corresponds to the single-parameter EDF with ω = 1. Theorem 2.19
then implies that for this choice we have C̊ω=1 = ∇θ κ() ˚ with convex closure
Cω=1 = conv(Tω=1 ).
Consider ω ∈ W \{1} which corresponds to the choice νω of the σ -finite measure
on R. This choice belongs to the cumulant function θ → ωκ(θ ) in the additive form
(x-parametrization in (2.13)). Since steepness (2.20) holds for any ω > 0 we receive
that the convex closure of the support of this distribution in the x-parametrization
in (2.13) is given by ∇θ ωκ()˚ = ω∇θ κ(). ˚ The duality transformation x → y =
x/ω leads to the change of measure dνω (x) → dνω (ωy) and to the corresponding
change of support, see (2.13). The latter implies that in the reproductive form (y-
parametrization) the convex closure of the support does not depend on the specific
choice of ω ∈ W. Since the EDF representation given in (2.14) corresponds to the
y-parametrization (reproductive form), we can use Theorem 2.19 without limitation
also for the single-parameter linear EDF given by (2.14), and C does not depend on
ω ∈ W.
From Corollary 2.14 we know that the moment generating function exists around the
origin for all examples belonging to the EDF. This implies that the moments of all
orders exist, and that we have an exponentially decaying survival function Pθ [Y >
y] = 1 − F (y; θ, ω) ∼ exp{−y} for some > 0 as y → ∞, see (1.2). In many
applied situations the data is more heavy-tailed and, thus, cannot be modeled by
such an exponentially decaying survival function. In such cases one often chooses
a distribution function with a regularly varying survival function; regular variation
with tail index β > 0 has been introduced in (1.3). A popular choice is a log-gamma
distribution which can be obtained from the gamma distribution (belonging to the
EDF). We briefly explain how this is done and how it relates to the Pareto and the
Lomax [256] distributions.
We start from the gamma density (2.6). The random variable Z has a log-gamma
distribution with shape parameter α > 0 and scale parameter β = −θ > 0 if
log(Z) = Y has a gamma distribution with these parameters. Thus, the gamma
density of Y = log(Z) is given by
β α α−1
f (y; β, α)dy = y exp {−βy} dy for y > 0.
(α)
2.2 Exponential Dispersion Family 39
βα
f (z; β, α)dz = (logz)α−1 z−(β+1) dz for z > 1.
(α)
This log-gamma density has support (1, ∞). The distribution function of this log-
gamma distributed random variable needs to be calculated numerically, and its
survival function is regularly varying with tail index β > 0.
A special case of the log-gamma distribution is the Pareto distribution. The Pareto
distribution is more tractable and it is obtained by setting shape parameter α = 1 in
the log-gamma density. This gives us the Pareto density
F (z; β) = 1 − z−β .
Obviously, this provides a regularly varying survival function with tail index β > 0;
in fact, in this case we do not need to go over to the limit in (1.3) because we
have an exact identity. The Pareto distribution has the nice property that it is closed
under thresholding (lower-truncation) with M, that is, we remain within the family
of Pareto distributions with the same tail index β by considering lower-truncated
claims: for 1 ≤ M ≤ z we have
P [M < Z ≤ z] z −β
F (z; β, M) = P [ Z ≤ z| Z > M] = =1− .
P [Z > M] M
This is the classical definition of the Pareto distribution, and it allows to preserve
full flexibility in the choice of the threshold M > 0.
The disadvantage of the Pareto distribution is that it does not provide a
continuous density on R+ as there is a discontinuity in threshold M. For this reason,
one sometimes explores another change of variable Z → X = Z − M for a Pareto
distributed random variable Z ∼ F (·; β, M). This provides the Lomax distribution,
also called Pareto Type II distribution. X has the following distribution function on
(0, ∞)
−β
x+M
P [X ≤ x] = 1 − for x ≥ 0.
M
This distribution has again a regularly varying survival function with tail index β >
0. Moreover, we have
x+M −β
M −β
lim M
x −β = lim 1 + = 1.
x→∞ x→∞ x
M
40 2 Exponential Dispersion Family
0
distribution with tail index
β = 2 and threshold
M = 1 000 000
−2
logged survival function
−4
−6
Pareto distribution
Lomax distribution
12 13 14 15 16 17
log(x)
This says that we should choose the same threshold M > 0 for both the Pareto and
the Lomax distribution to receive the same asymptotic tail behavior, and this also
quantifies the rate of convergence between the two survival functions. Figure 2.3
illustrates this convergence in a log-log plot choosing tail index β = 2 and threshold
M = 1 000 000.
For completeness we provide the density of the Pareto distribution
β z −(β+1)
f (z; β, M) = for z ≥ M,
M M
and of the Lomax distribution
−(β+1)
β x+M
f (x; β, M) = for x ≥ 0.
M M
Ay et al. [16] and Nielsen [285] for an extended treatment of these mathematical
concepts.
Choose a fixed EF (2.2) with cumulant function κ on the effective domain
⊆ Rk and with σ -finite measure ν on R. We define the Kullback–Leibler (KL)
divergence (relative entropy) from model θ 1 ∈ to model θ 0 ∈ within this EF
by
f (y; θ 0 )
DKL (f (·; θ 0 )||f (·; θ 1 )) = f (y; θ 0 )log dν(y) ≥ 0.
R f (y; θ 1 )
Recall that the support of the EF does not depend on the specific choice of the
canonical parameter θ in , see Remarks 2.3; this implies that the KL divergence
is well-defined, here. The positivity of the KL divergence is obtained from Jensen’s
inequality; this is proved in Lemma 2.21, below.
The KL divergence has the interpretation of having a data model that is
characterized by the distribution f (·; θ 0 ), and we would like to measure how close
another model f (·; θ 1 ) is to the data model. Note that the KL divergence is not
a distance function because it is neither symmetric nor does it satisfy the triangle
inequality.
We calculate the KL divergence within the chosen EF
DKL (f (·; θ 0 )||f (·; θ 1 )) = f (y; θ 0 ) (θ 0 − θ 1 ) T (y) − κ(θ 0 ) + κ(θ 1 ) dν(y)
R
where we have used Corollary 2.5, and the positivity of the KL divergence can be
seen from the convexity of κ. This allows us to consider the following (Taylor)
expansion
This illustrates that the KL divergence corresponds to second and higher order
differences between the cumulant value κ(θ 0 ) and another cumulant value κ(θ 1 ).
The gradients of the KL divergence w.r.t. θ 1 in θ 1 = θ 0 and w.r.t. θ 0 in θ 0 = θ 1 are
given by
∇θ 1 DKL (f (·; θ 0 )||f (·; θ 1 ))θ (2.23)
1 =θ 0
= ∇θ 0 DKL (f (·; θ 0 )||f (·; θ 1 ))θ = 0.
0 =θ 1
This emphasizes that the KL divergence reflects second and higher-order terms in
cumulant function κ; and that the data model θ 0 forms the minimum of this KL
42 2 Exponential Dispersion Family
divergence (as a function of θ 1 ) as we will just see. We calculate the Hessian (second
order term) w.r.t. θ 1 in θ 1 = θ 0
def.
∇θ21 DKL (f (·; θ 0 )||f (·; θ 1 )) = ∇θ2 κ(θ) = I(θ 0 ).
θ 1 =θ 0 θ=θ 0
The positive definite matrix I(θ 0 ) (in a minimal representation) is called Fisher’s
information. Fisher’s information is an important tool in statistics that we will
meet in Theorem 3.13 of Sect. 3.3, below. A function satisfying (2.21) (with
being zero if and only if θ 0 = θ 1 ), fulfilling (2.23) and having positive definite
Fisher’s information is called divergence, see Definition 5 in Nielsen [285]. Fisher’s
information I(θ 0 ) measures the curvature of the KL divergence in θ 0 and we have
the second order Taylor approximation
1
κ(θ 1 ) ≈ κ(θ 0 ) + ∇θ κ(θ 0 ) (θ 1 − θ 0 ) + (θ 1 − θ 0 ) I(θ 0 ) (θ 1 − θ 0 ) .
2
Next-order terms are obtained from the so-called Amari–Chentsov tensor, see Amari
[10] and Section 4.2 in Ay et al. [16]. In information geometry one studies the
(possibly degenerate) Riemannian metric on the effective domain induced by
Fisher’s information; we refer to Section 3.7 in Nielsen [285].
Lemma 2.21 Consider two densities p and q w.r.t. a given σ -finite measure ν. We
have DKL (p||q) ≥ 0, and DKL (p||q) = 0 if and only if p = q, ν-a.s.
Proof Assume Y ∼ pdν, then we can rewrite the KL divergence, using Jensen’s
inequality,
+ ,
p(y) q(Y )
DKL (p||q) = p(y)log dν(y) = − Ep log
q(y) p(Y )
+ ,
q(Y )
≥ −logEp = − log q(y)dν(y) ≥ 0. (2.24)
p(Y )
Equality holds if and only if p = q, ν-a.s. The last inequality of (2.24) considers
- q does not necessarily need to be a density w.r.t. ν, i.e., we can also have
that
q(y)dν(y) < 1.
In the next chapter we are going to introduce maximum likelihood estimation for
parameters, see Definition 3.4, below. Maximum likelihood estimators are obtained
by maximizing likelihood functions (evaluated in the observations). Maximizing
likelihood functions within the EDF is equivalent to minimizing deviance loss
2.3 Information Geometry in Exponential Families 43
functions. Deviance loss functions are based on unit deviances, which, in turn,
correspond to KL divergences. The purpose of this small section is to discuss this
relation. This should be viewed as a preparation for Chap. 4.
Assume we work within a single-parameter linear EDF, i.e., T (y) = y. Using
the canonical link h we obtain the canonical parameter θ = h(μ) ∈ ⊆ R
from the mean parameter μ ∈ M. If we replace the (typically unknown) mean
parameter μ by an observation Y , supposed Y ∈ M, we get the specific model
that is exactly calibrated to this observation. This provides us with the canonical
parameter estimate θY = h(Y ) for θ . We can now measure the KL divergence from
any model represented by θ to the observation calibrated model θY = h(Y ). This
KL divergence is given by (we use (2.21) and we set ω = v/ϕ = 1)
f (y;
θY , 1)
DKL ( f (·; h(Y ), 1)| |f (·; θ, 1)) = f (y;
θY , 1)log dν(y)
R f (y; θ, 1)
= (h(Y ) − θ ) Y − κ(h(Y )) + κ(θ ) ≥ 0.
This latter object is the unit deviance (up to factor 2) of the chosen EDF. It plays a
crucial role in predictive modeling.
We define the unit deviance under the assumption that κ is steep as follows:
d : C̊ × M → R+ (2.25)
(y, μ) → d(y, μ) = 2 yh(y) − κ (h(y)) − yh(μ) + κ (h(μ)) ≥ 0,
where C is the convex closure of the support T of Y and M is the dual parameter
space of the chosen EDF. Steepness of κ implies C̊ = M, see Theorem 2.19.
This unit deviance d is received from the KL divergence, and it is (twice) the dif-
ference of two log-likelihood functions, one using canonical parameter h(y) and the
other one having any canonical parameter θ = h(μ) ∈ . ˚ That is, for μ = κ (θ ),
This looks like a generalization of the Gaussian distribution, where the square
difference (y − μ)2 in the exponent is replaced by the unit deviance d(y, μ) with
μ = κ (θ ). This interpretation gets further support by the following lemma.
44 2 Exponential Dispersion Family
Lemma 2.22 Under Assumption 2.6 and the assumption that the cumulant function
κ is steep, the unit deviance d (y, μ) ≥ 0 of the chosen EDF is zero if and only if
y = μ. Moreover, the unit deviance d (y, μ) is twice continuously differentiable
w.r.t. (y, μ) in C̊ × M, and
∂ 2 d (y, μ) ∂ 2 d (y, μ) ∂ 2 d (y, μ)
= = − = 2/V (μ) > 0.
∂μ2 y=μ ∂y 2 y=μ ∂μ∂y y=μ
Proof The positivity and the if and only if statement follows from Lemma 2.21 and
the strict convexity of κ. Continuous differentiability follows from the smoothness
of κ in the interior of . Moreover we have
∂ 2 d (y, μ) ∂
= 2 −yh (μ) + μh (μ) = 2h (μ) = 2/κ (h(μ)) = 2/V (μ) > 0,
∂μ2 y=μ ∂μ
y=μ
where V (μ) is the variance function of the chosen EDF introduced in Corol-
lary 2.14. The remaining second derivatives are received by similar (straightfor-
ward) calculations.
Remarks 2.23
• Lemma 2.22 shows that the unit deviance definition of d(y, μ) provides a so-
called regular unit deviance according to Definition 1.1 in Jørgensen [203].
Moreover, any model that can be brought into the form (2.27) for a (regular) unit
deviance is called (regular) reproductive dispersion model, see Definition 1.2 of
Jørgensen [203].
• In general the unit deviance d(y, μ) is not symmetric in its two arguments y and
μ, we come back to this in Fig. 11.1, below.
More generally, the KL divergence and the unit deviance can be embedded into
the framework of Bregman loss functions [50]. We restrict to the single-parameter
EDF case. Assume that ψ : C̊ → R is a strictly convex function. The Bregman
divergence w.r.t. ψ between y and μ is defined by
1
Dψ (y, μ) = yh(y) − κ(h(y)) + κ(h(μ)) − h(μ)y = d(y, μ). (2.29)
2
2.3 Information Geometry in Exponential Families 45
with μ = κ (θ ) = exp{θ }. This Poisson unit deviance will commonly be used for
model fitting and forecast evaluation, see, e.g., (5.28).
The off-diagonal terms in Fisher’s information matrix I(θ ) are non-zero which
means that the two components of the canonical parameter θ interact. Choosing
a different parametrization μ = θ2 /(−θ1 ) (dual mean parametrization) and α = θ2
we receive diagonal Fisher’s information in (μ, α)
α
0 α
μ2 0
I(μ, α) =
(α)
(α)−
(α)2 = μ2 , (2.30)
0
(α)2
− 1
α 0 (α) − 1
α
where is the digamma function, see Footnote 2 on page 22. This transformation
is obtained by using the corresponding Jacobian matrix for variable transformation;
more details are provided in (3.16) below. In this new representation, the parameters
μ and α are orthogonal; the term (α) − α1 is further discussed in Remarks 5.26
and Remarks 5.28, below.
Using this second parametrization based on mean μ and dispersion 1/α, we
arrive at the EDF representation of the gamma model. This allows us to calculate the
corresponding unit deviance (within the EDF), which in the gamma case is given by
μ
Y
d(Y, μ) = 2 − 1 + log ≥ 0.
μ Y
Example 2.26 (Inverse Gaussian Model) Our final example considers the inverse
Gaussian vector-valued parameter EF case. We consider the cumulant function
κ(θ) = −2(θ1θ2 )1/2 − 12 log(−2θ2) for θ = (θ1 , θ2 ) ∈ = (−∞, 0] × (−∞, 0),
see Sect. 2.1.3. For the KL divergence from model θ 1 to model θ 0 we receive
. .
−θ0,2 −θ0,1
DKL (f (·; θ 0 )||f (·; θ 1 )) = −θ1,1 − θ1,2 − 2 θ1,1 θ1,2
−θ0,1 −θ0,2
θ0,2 − θ1,2 1 −θ0,2
+ + log ≥ 0.
−2θ0,2 2 −θ1,2
Again the off-diagonal terms in Fisher’s information matrix I(θ ) are non-zero in
the canonical parametrization. We switch to the mean parametrization by setting
2.3 Information Geometry in Exponential Families 47
(Y − μ)2
d(Y, μ) = ≥ 0.
μ2 Y
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 3
Estimation Theory
This chapter gives an introduction to decision and estimation theory. This intro-
duction is based on the books of Lehmann [243, 244], the lecture notes of Künsch
[229] and the book of Van der Vaart [363]. This chapter presents classical statistical
estimation theory, it embeds estimation into a historical context, and it provides
important aspects and intuition for modern data science and predictive modeling.
For further reading we recommend the books of Barndorff-Nielsen [23], Berger
[31], Bickel–Doksum [33] and Efron–Hastie [117].
A : Y → A, Y → A(Y ), (3.1)
where Eθ is the expectation w.r.t. the probability distribution P (·; θ ). Risk func-
tion (3.3) describes the long-term average loss of using decision rule A. As an
example we may think of estimating γ (θ ) for unknown (true) parameter θ by a
decision rule Y → A(Y ). Then, the loss function L(θ, A(Y )) should describe the
estimation loss if we consider the discrepancy between γ (θ ) and its estimate A(Y ),
and the risk function R(θ, A) is the average estimation loss in that case.
Good decision rules A should provide a small risk R(θ, A). Unfortunately, this
statement is of rather theoretical nature because, in general, the true data generating
parameter θ is not known and the goodness of a decision rule for the true parameter
cannot be evaluated explicitly, but the risk can only be estimated (for instance, using
a bootstrap approach). Moreover, typically, there does not exist a uniformly best
decision rule A over all θ ∈ . For these reasons we may (just) try to eliminate
decision rules that are obviously not good. We give two introductory examples.
Example 3.2 (Minimax Decision Rule) Decision rule A is called minimax if for all
* : Y → A we have
alternative decision rules A
*
sup R(θ, A) ≤ sup R(θ, A).
θ∈ θ∈
3.2 Parameter Estimation 51
A minimax decision rule is the best choice in the worst case of the true θ , i.e., it
minimizes the worst case risk.
The above examples give two possible choices of decision rules. The first one
tries to minimize the worst case risk, whereas the second one uses additional knowl-
edge in terms of a prior distribution π on . This means that we impose stronger
assumptions in the second case to get stronger conclusions. The difficult part in
practice is to justify these stronger assumptions in order to validate the stronger
conclusions. Below, we are going to introduce other criteria that should be satisfied
by good decision rules, an important one in estimation will be unbiasedness.
for 1 ≤ i ≤ n. Note that these random variables are not i.i.d. because they may
differ in exposures vi > 0. Throughout, we assume that Assumption 2.6 is fulfilled
and that the cumulant function κ is steep, see Theorem 2.19. For the latter we also
refer to Remark 2.20: the supports Tvi /ϕ of Yi may differ; however, these supports
share the same convex closure.
Independence between the Yi ’s implies that the joint probability P (·; θ ) is the
product distribution of the individual distributions F (·; θ, vi /ϕ), 1 ≤ i ≤ n.
Therefore, the MLE of θ in the EDF is found by solving
n
Yi *
θ − κ(*
θ)
θ MLE = arg max Y (*
θ ) = arg max .
*
θ ∈ *
θ ∈ ϕ/vi
i=1
Since the cumulant function κ is strictly convex we receive the MLE (subject
to existence)
n n
vi Yi i=1 vi Yi
θ MLE =
θ MLE (Y ) = (κ )−1 i=1
n =h n .
i=1 vi i=1 vi
Thus, the MLE is received by applying the canonical link h = (κ )−1 , see
Definition 2.8, and strict convexity of κ implies that the MLE is unique. However,
existence needs to be analyzed more carefully! It may happen that the MLE θ MLE is
a boundary point of the effective domain which may not exist (if is open). We
give an example. Assume we work in the Poisson model presented in Sect. 2.1.2.
The canonical link in the Poisson model is the log-link μ → h(μ) = log(μ), for
μ > 0. With positive probability we have in the Poisson case ni=1 vi Yi = 0.
3.2 Parameter Estimation 53
Therefore, with positive probability the MLE θ MLE does not exist (we have a
degenerate Poisson model in that case).
Since the canonical link is strictly increasing we can also perform MLE in the
dual (mean) parametrization. The dual parameter space is given by M = κ (),˚
see Remarks 2.9, with mean parameters μ = κ (θ ) ∈ M. This motivates
n
μ) − κ(h(*
Yi h(* μ))
μMLE = arg max Y (h(*
μ)) = arg max . (3.4)
μ∈M
* μ ∈M
* ϕ/vi
i=1
Also this dual MLE does not need to exist (in the dual parameter space M).
Under the assumption that the cumulant function κ is steep, we know that the closure
of the dual parameter space M contains the supports Tvi /ϕ of Yi , see Theorem 2.19
and Remark 2.20. Thus, in that case we can close the dual parameter space and
receive MLE μMLE ∈ M (in a possibly degenerate model). In the aforementioned
degenerate Poisson situation we receive
μMLE = 0 which is in the boundary ∂M of
the dual parameter space.
θ Bayes =
θ Bayes (Y ) = Eπ [θ |Y ],
where the conditional expectation on the right-hand side is calculated under the
posterior distribution π(θ |y) ∝ p(y; θ )π(θ ) for a given observation Y = y.
Example 3.7 (Bayesian Estimator) Assume that A = = R and choose the square
loss function L(θ, a) = (θ − a)2 . Assume that for ν-a.e. y ∈ Y the following
decision rule A : Y → A exists
where the expectation is calculated w.r.t. the posterior distribution π(θ |y). In
this case, A is a Bayesian decision rule w.r.t. π and L(θ, a) = (θ − a)2 : by
* : Y → A, ν-a.s.,
assumption (3.6) we have for any other decision rule A
* ))2 |Y = y].
Eπ [(θ − A(Y ))2 |Y = y] ≤ Eπ [(θ − A(Y
*
Applying the tower property we receive for any other decision rule A
* ))2 ] =
R(θ, A)dπ(θ ) = E[(θ − A(Y ))2 ] ≤ E[(θ − A(Y *
R(θ, A)dπ(θ ),
where the expectation E is calculated over the joint distribution of Y and θ . This
proves that A is a Bayesian decision rule w.r.t. π and L(θ, a) = (θ − a)2 , see
Example 3.3. Finally, note that the conditional expectation given in Definition 3.6 is
the minimizer of (3.6). This justifies the name Bayesian estimator in Definition 3.6
(for the square loss function). The case of the Bayesian estimator for a general loss
function L is considered in Theorem 4.1.1 of Lehmann [244].
1 l
n
lim Yi = Eθ [Y1l ].
n→∞ n
i=1
Assume that the following map is invertible (on suitable range definitions for (3.7)–
(3.8))
The MLE, the Bayesian estimator and the method of moments estimator are the
most commonly used parameter estimators. They may have additional properties
(under certain assumptions) that we are going to explore below. In the remainder of
this section we give an additional view on estimators which is based on the empirical
distribution of the observation Y .
3.2 Parameter Estimation 55
Assume that the components Yi of Y are real-valued and i.i.d. F distributed. The
empirical distribution induced by the observation Y = (Y1 , . . . , Yn ) is given by
n
n (y) = 1
F 1{Yi ≤y} for y ∈ R, (3.9)
n
i=1
we also refer to Fig. 1.2 (lhs). The Glivenko–Cantelli theorem [64, 159] tells us that
the empirical distribution F n converges uniformly to F , a.s., for n → ∞.
assuming that f (·; *θ) are densities w.r.t. a σ -finite measure on R. Assume that F
has density f w.r.t. the σ -finite measure ν on R. Then, we can rewrite the above as
f (y)
Q(F ) = arg min log f (y)dν(y) = arg min DKL (f ||f (·; *
θ )).
*
θ f (y; *
θ) *
θ
The latter is the Kullback–Leibler (KL) divergence which we have met in Sect. 2.3.
Lemma 2.21 states that the KL divergence is non-negative, and it is zero if and only
if the two densities f and f (·; *
θ) are identical, ν-a.s. This implies that Q(F (·; θ )) =
θ . Thus, Q is Fisher-consistent for θ ∈ , assuming identifiability, see Remarks 3.1.
Next, we use this Fisher-consistent functional (KL divergence) to receive the
MLE. Replace the unknown distribution F by the empirical one to receive
1
n
= arg max log f (Yi ; *
θ) =
θ MLE ,
*
θ n
i=1
56 3 Estimation Theory
where we have used that the empirical density f n allocates point masses of size 1/n
to the i.i.d. observations Y1 , . . . , Yn . Thus, the MLE θ MLE of θ can be obtained by
choosing the model f (·; * θ ), *
θ ∈ , that is closest in KL divergence to the empirical
distribution F n of i.i.d. observations Yi ∼ F . Note that in this construction we do
not assume that the true distribution F is in F , see Definition 3.9.
Remarks 3.11
• Many properties of estimators of θ are based on properties of Fisher-consistent
functionals Q (in cases where they exist). For instance, asymptotic properties as
n → ∞ are obtained from smoothness properties of Fisher-consistent functionals
Q, or using the influence function we can analyze the impact of individual
observations Yi on decision rules θ = θ (Y ) = Q(F n ). The latter is the basis of
robust statistics, see Huber [194] and Hampel et al. [180]. Since Fisher-consistent
functionals do not require that the true distribution belongs to F it requires a
careful consideration of the quantity to be estimated.
• The discussion on parameter estimation has implicitly assumed that the true data
generating model belongs to the family P = {P (·; θ ); θ ∈ }, and the only
problem was to find the true parameter in . More generally, one should also
consider model uncertainty w.r.t. the chosen family P, i.e., the data generating
model may not belong to this family. Of course, this problem is by far more
difficult. We explore this in more detail in Sect. 11.1.4, below.
Above we have stated some quality criteria for decision rules like the minimax
property. A crucial property in financial applications is the so-called unbiasedness
(for mean estimates) because this guarantees that the overall (price) levels are
correctly specified.
3.3 Unbiased Estimators 57
Eθ [A(Y )] = γ (θ ). (3.10)
γ (θ )2
Varθ (A(Y )) ≥ .
I(θ )
58 3 Estimation Theory
where we have used that the support of the random variables does not depend on θ
and that the domain of θ is open.
Secondly, we set U = A(Y ) in (3.12). We have similarly to above using
unbiasedness w.r.t. γ
p(Y ; θ + ) − p(Y ; θ) p(y; θ + ) − p(y; θ)
Covθ A(Y ), = A(y) p(y; θ)dν(y)
p(Y ; θ) Y p(y; θ)
= γ (θ + ) − γ (θ).
2 + ,
∂ ∂2
I(θ ) = Eθ log p(Y ; θ ) = −Eθ log p(Y ; θ ) . (3.14)
∂θ ∂θ 2
Fisher’s information I(θ ) expresses the variance of the score s(θ, Y ). Iden-
tity (3.14) justifies the notion Fisher’s information in Sect. 2.3 for the EF.
• In order to determine the Cramér–Rao information bound for unknown θ we
need to estimate Fisher’s information I(θ ) from the available data. There are
two different ways to do so, either we choose
2
∂
I(
θ ) = E
θ log p(Y ; θ ) ,
∂θ
Proposition 3.15 The unbiased decision rule A for γ attains the Cramér–
Rao information bound if and only if the density is of the form p(y; θ ) =
exp {δ(θ )T (y) − β(θ ) + a(y)} with T = A. In that case we have γ (θ ) =
β (θ )/δ (θ ).
for 1 ≤ l, j ≤ k.
3.3 Unbiased Estimators 61
Remarks 3.17
• Whenever an unbiased decision rule A(Y ) for γ (θ ) meets the Cramér–Rao
information bound it is UMVU. Thus, it minimizes the risk function R(θ , A)
being based on the square loss L(θ , a) = (γ (θ) − a)2 among all unbiased
decision rules, because unbiasedness for γ (θ) gives R(θ , A) = Varθ (A(Y )).
• The regularity conditions in Theorem 3.16 include that Fisher’s information
matrix I(θ ) is positive definite.
• Under additional regularity conditions we have the following identity for Fisher’s
information matrix
I (θ ) = Eθ (∇θ log p(Y ; θ )) (∇θ log p(Y ; θ )) = −Eθ ∇θ2 log p(Y ; θ ) ∈ Rk×k .
ζ ∈ Rr → θ = θ (ζ ) ∈ Rk ,
such that all derivatives ∂θl (ζ )/∂ζj exist for 1 ≤ l ≤ k and 1 ≤ j ≤ r. The Jacobian
matrix is given by
∂
J (ζ ) = θl (ζ ) ∈ Rk×r .
∂ζj 1≤l≤k,1≤j ≤r
I ∗ (ζ ) = J (ζ ) I(θ (ζ )) J (ζ ). (3.16)
This formula is used quite frequently, e.g., in generalized linear models when
changing the parametrization of the models.
62 3 Estimation Theory
An immediate consequence of Corollary 2.5 is that the expected value of the score
˚ This then reads as
is zero for any θ ∈ .
Thus, the statistics S(Y )/n is an unbiased decision rule for the mean μ = ∇θ κ(θ),
and we can study its Cramér–Rao information bound. Fisher’s information matrix
is given by the positive definite matrix
I (θ ) = In (θ) = Eθ s(θ, Y )s(θ, Y ) = −Eθ ∇θ2 log p(Y ; θ) = n∇θ2 κ(θ) ∈ Rk×k .
Recall that I(θ )−1 scales as n−1 , see (3.15). This provides us with the following
corollary.
The UMVU property stated in Corollary 3.18 is, in general, not related to MLE,
but within the EF there is the following link. We have (subject to existence)
1
= arg max p(Y ; * *
θ S(Y ) − nκ(*
MLE
θ θ ) = arg max θ) = h S(Y ) ,
*
θ∈ *
θ∈ n
(3.20)
where h = (∇θ κ)−1 is the canonical link of this EF, see Definition 2.8; and where
we need to ensure that a solution to (3.20) exists; e.g., the solution to (3.20) might
be at the boundary of which may cause problems, see Example 3.5.1 Because the
cumulant function κ is strictly convex (in a minimal representation), we receive the
1 Another example where there does not exist a proper solution to the MLE problem (3.20) is, for
instance, obtained within the 2-dimensional Gaussian EF if we have only one single observation Y1 .
Intuitively this is clear because we cannot estimate two parameters from one observation T (Y1 ) =
(Y1 , Y12 ).
64 3 Estimation Theory
the dual parameter space M = ∇θ κ() ⊆ Rk has been introduced in Remarks 2.9.
If S(Y )/n is contained in M, then this MLE is a proper solution; otherwise, because
we have assumed that the cumulant function κ is steep, the MLE exists in the closure
M, see Theorem 2.19, and it is UMVU for μ, see Corollary 3.18.
n
E
μMLE [T (Yi )] = n
μMLE = S(Y ).
i=1
Remarks 3.20
• The balance property is a very important property in insurance pricing because it
implies that the portfolio is priced on the right level: we have unbiasedness
n
Eθ E
μMLE [T (Yi )] = Eθ [S(Y )] = nμ. (3.21)
i=1
• The statistics S(Y ) is a sufficient statistics of Y , this follows from the factoriza-
tion criterion; see Theorem 1.5.2 of Lehmann [244].
3.3 Unbiased Estimators 65
The single-parameter linear EDF case is very similar to the above vector-valued
parameter EF case. We briefly summarize the main results in the EDF case.
Recall Example 3.5: assume that Y1 , . . . , Yn are independent having densities
w.r.t. a σ -finite measures on R (not being concentrated in a single point) given by,
see (2.14),
yi θ − κ(θ )
Yi ∼ f (yi ; θ, vi /ϕ) = exp + a(yi ; vi /ϕ) , (3.23)
ϕ/vi
for 1 ≤ i ≤ n. Note that these random variables are not i.i.d. because they may differ
˚ is found by, see (3.5),
in the exposures vi > 0. The MLE of μ = κ (θ ), θ ∈ ,
n
n
μ) − κ(h(*
Yi h(* μ)) vi Yi
μ MLE
= arg max = i=1
n , (3.24)
μ∈M
*
ϕ/vi i=1 vi
i=1
n
n
n
E
μMLE [vi Yi ] = vi
μMLE = vi Yi .
i=1 i=1 i=1
∂ vi
n n
∂ vi
s(θ, Y ) = log p(Y ; θ ) = (θ Yi − κ(θ )) = Yi − κ (θ ) .
∂θ ∂θ ϕ ϕ
i=1 i=1
˚
Of course, we have Eθ [s(θ, Y )] = 0 and we receive Fisher’s information for θ ∈
+ , n
∂2 vi
I(θ ) = −Eθ 2
log p(Y ; θ ) = κ (θ ) > 0. (3.25)
∂θ ϕ
i=1
66 3 Estimation Theory
Moreover, from Corollary 3.21 we know that this estimator is UMVU for λ, which
can easily be seen, and uses Fisher’s information (3.25) with dispersion parameter
ϕ=1
+ , n
n
∂2
I(θ ) = −Eθ log p(Y ; θ ) = vi κ (θ ) = λ vi .
∂θ 2
i=1 i=1
One could study many other properties of decision rules (and corresponding
estimators), for instance, admissibility or uniformly minimum risk equivariance
(UMRE), and we could also study other families of distribution functions such as
group families. We refrain from doing so because we will not need this for our
purposes.
3.4 Asymptotic Behavior of Estimators 67
All results above have been based on a finite sample Y n = (Y1 , . . . , Yn ) , we add a
lower index n to Y n to indicate the finite sample size n ∈ N. The aim of this section
is to analyze properties of decision rules when the sample size n tends to infinity.
3.4.1 Consistency
1
n
1
MLE
μ n = S(Y n ) = (T1 (Yi ), . . . , Tk (Yi )) ∈ M.
n n
i=1
We add a lower index n to the MLE to indicate the sample size. The i.i.d. property
of Yi , i ≥ 1, implies that we can apply the strong law of large numbers which tells
us that we have limn→∞ μMLE
n = Eθ [T (Y1 )] = ∇θ κ(θ) = μ, a.s., for all θ ∈ .
This implies strong consistency of the sequence of MLEs μMLE
n , n ≥ 1, for μ.
We have seen that these MLEs are also UMVU for μ, but if we transform them
to the canonical scale
MLE
θn they are, in general, biased for θ , see (3.22). However,
since the cumulant function κ is strictly convex (under a minimal representation)
we receive limn→∞
MLE
θn = θ , a.s., which provides strong consistency also on the
canonical scale.
68 3 Estimation Theory
in the second step we assumed that F has a density f w.r.t. a σ -finite measure ν on
R.
Assume we have i.i.d. data Yi ∼ f (·; θ ), i ≥ 1. Thus, the true data generating
distribution is described by the parameter θ ∈ . MLE requires the study of the
log-likelihood function (we scale with the sample size n)
1
n
1
*
θ → Y n (*
θ) = log f (Yi ; *
θ ).
n n
i=1
1
n
lim θ ) = Eθ log f (Y ; *
log f (Yi ; * θ) . (3.26)
n→∞ n
i=1
Thus, if we are allowed to exchange the arg max operation and the limit in n → ∞
we receive, a.s.,
1
n
lim θ MLE
= lim arg max log f (Yi ; *
θ)
n→∞ n n→∞ * n
θ i=1
1
n
?
= arg max lim *
log f (Yi ; θ)
* n→∞ n
θ i=1
= arg max Eθ log f (Y ; *
θ ) = Q(F (·; θ )) = θ. (3.27)
*
θ
3.4 Asymptotic Behavior of Estimators 69
That is, we receive consistency of the MLE for θ if we are allowed to exchange the
arg max operation and the limit in n → ∞. This requires regularity conditions on
the considered family of distributions F = {F (·; θ ); θ ∈ }. The case of a finite
parameter space = {θ1 , . . . , θJ } is easy, this is a simplified version of Wald’s
[374] consistency proof,
1 1 1
n n n
Pθj θj ∈ arg max log f (Yi ; θk ) ≤ Pθj log f (Yi ; θk ) > log f (Yi ; θj ) .
θk n n n
i=1 k=j i=1 i=1
for all *
θ ∈ . Strict consistency of loss and scoring functions is going to be
defined formally in Sect. 4.1.3, below, and we have just seen that this plays an
important role for the consistency of M-estimators in the sense of Definition 3.23.
• Consistency (3.27) assumes that the data generating model Y ∼ F belongs to
the specified family F = {F (·; θ ); θ ∈ }. Model uncertainty may imply that
the data generating model does not belong to F . In this situation, and if we are
allowed to exchange the arg max operation and the limit in n in (3.27), the MLE
will provide the model in F that is closest in KL divergence to the true model F .
We come back to this in Sect. 11.1.4, below.
As mentioned above, typically, we would like to have stronger results than just
consistency. We give an introductory example based on the EF.
Example 3.27 (Asymptotic Normality of the MLE in the EF) We work under the
same EF as in Example 3.24. This example has provided consistency of the sequence
of MLEs μMLE
n , n ≥ 1, for μ. Note that the i.i.d. property together with the finite
70 3 Estimation Theory
where θ = θ (μ) = (∇θ κ)−1 (μ) ∈ for μ ∈ M, and N denotes the Gaussian
distribution. This is the multivariate version √
of the central limit theorem (CLT), and
it tells us that the rate of convergence is 1/ n. This asymptotic result is stated in
terms of Fisher’s information matrix under parametrization θ . We transform this
to the dual mean parametrization and call Fisher’s information matrix under the
dual mean parametrization I1∗ (μ). This involves the change of variable μ → θ =
θ (μ) = (∇θ κ)−1 (μ). The Jacobian matrix of this change of variable is given by
J (μ) = I1 (θ (μ))−1 and, thus, the transformation of Fisher’s information matrix
gives, see also (3.16),
This allows us to express the above CLT w.r.t. Fisher’s information matrix corre-
sponding to μ and it gives us
√ MLE
μn − μ ⇒ N 0, I1∗ (μ)−1
n as n → ∞. (3.28)
√ MLE √
n
θ n − θ = n (∇θ κ)−1
μMLE
n − (∇θ κ)−1 (μ) (3.29)
(d)
⇒ N 0, J (μ)I1∗ (μ)−1 J (μ) = N 0, I1 (θ)−1 as n → ∞.
We have exactly the same structural form in the two asymptotic results (3.28)
and (3.29). There is a main difference,
μMLE
n is unbiased for μ whereas, in general,
MLE
θ n is not unbiased for θ , but we receive the same asymptotic behavior.
3.4 Asymptotic Behavior of Estimators 71
∂
n
∂
log f (Yi ; *
θ) = Y n (*
θ ) = 0,
*
∂ θ i=1 ∂*θ
Sketch of Proof Fix θ ∈ and consider a Taylor expansion of the score Y n (·) in
θ for
θn . It is given by
1 2
Y n (
θn ) = Y n (θ ) + Y n (θ )
θn − θ + (θn ) θn − θ ,
2 Yn
72 3 Estimation Theory
θn ]. Since
for θn ∈ [θ, θn is a root of the score, the left-hand side is equal to zero.
This allows us to re-arrange the above Taylor expansion as follows
√ √1 (θ )
n Yn
n
θn − θ = .
− n1 Y n (θ ) − 2n Y n (θn )
1
θn −θ
Remarks 3.29
• A sequence ( θn )n≥1 satisfying Theorem 3.28 is called efficient likelihood esti-
mator (ELE) of θ . Typically, the sequence of MLEs θnMLE gives such an
ELE sequence, but there are counterexamples where this is not the case, see
Example 3.1 in Section 6.2 of Lehmann [244]. In that example θnMLE exists for
all n ≥ 1, but it converges in probability to ∞, regardless of the value of the true
parameter θ .
• Any sequence of estimators that fulfills (3.30) is called asymptotically efficient,
because, similarly to the Cramér–Rao information bound of Theorem 3.13, it
attains I1 (θ )−1 (which under certain assumptions is a lower variance bound
except on Lebesgue measure zero, see Theorem 1.1 in Section 6.1 of Lehmann
[244]). However, there are two important differences here: (1) the Cramér–
Rao information bound statement needs unbiasedness of the decision rule,
whereas (3.30) only requires consistency (but not unbiasedness nor asymptoti-
cally vanishing bias); and (2) the lower bound in the Cramér–Rao statement is
an effective variance (on a finite sample), whereas the quantity in (3.30) is only
an asymptotic variance. Moreover, any other sequence √ that differs in probability
from an asymptotically efficient one less than o(1/ n) is asymptotically effi-
cient, too.
• If we consider a differentiable function θ → γ (θ ), then Theorem 3.28 implies
√ (γ (θ ))2
n γ
θn − γ (θ ) ⇒ N 0, as n → ∞. (3.31)
I1 (θ )
1
n
θnMLE = arg max log f (Yi ; *
θ ). (3.32)
*θ n
i=1
3.4 Asymptotic Behavior of Estimators 73
1 ∂
n
log f (Yi ; *
θ) = 0. (3.33)
n ∂*
θ
i=1
1
n
ψ(Yi ; *
θ ) = 0, (3.34)
n
i=1
for i.i.d. data Yi ∼ F (·; θ ). Suppose that the first moment of ψ(Yi ; *
θ) exists. The
law of large numbers gives us, a.s., see also (3.26),
1
n
lim θ ) = Eθ ψ(Y ; *
ψ(Yi ; * θ) . (3.35)
n→∞ n
i=1
For rigorous statements we refer to Theorems 5.21 and 5.41 in Van der Vaart
[363]. A modification to the regression case is given in Theorem 11.6 below.
Example 3.30 We consider the single-parameter linear EF for given strictly convex
and steep cumulant function κ and w.r.t. a σ -finite measure ν on R. The score
equation gives requirement
1 !
S(Y n ) = κ (θ ) = Eθ [Y1 ]. (3.37)
n
Strict convexity implies that the right-hand side strictly increases in θ . Therefore,
we have at most one solution of the score equation here. We assume that the
74 3 Estimation Theory
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons licence and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons licence, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons licence and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 4
Predictive Modeling and Forecast
Evaluation
adapted versions are mainly based on the strictly consistent scoring framework of
Gneiting–Raftery [163] and Gneiting [162]. In particular, we will discuss deviance
losses in Sect. 4.1.2 that are strictly consistent scoring functions for mean estimation
and, hence, provide proper scoring rules.
where on the second last line we use the independence between Y n and Y . This
finishes the proof.
and similarly for Y ∼ f (y; θ, v/ϕ). Note that all random variables share the same
canonical parameter θ ∈ .˚ The MLE of μ ∈ M based on Y n = (Y1 , . . . , Yn ) is
found by solving, see (3.4)–(3.5),
μMLE = μ
MLE (Y n ) = arg max Y n (*
μ) (4.5)
μ∈M
*
n
μ) − κ(h(*
Yi h(* μ))
= arg max ,
μ∈M
*
ϕ/vi
i=1
with canonical link h = (κ )−1 . Since the cumulant function κ is strictly convex and
assumed to be steep, there exists a unique solution μMLE ∈ M. If μMLE ∈ M we
have a proper solution providing θ MLE = h(
μ MLE ) ∈ , otherwise μMLE provides
a degenerate model. This decision rule Y n → μ MLE = μ MLE (Y n ) is now used
to predict the (independent) new random variable Y and to estimate the unknown
parameters θ and μ, respectively. That is, we use the following predictor for Y
=
Y n → Y Eθ [Y ] = E
θ MLE [Y ] =
μMLE =
μMLE (Y n ).
4.1 Generalization Loss 79
Note that this predictor Y is used to predict an unobserved (new) random variable
Y , and it is itself a random variable as a function of (independent) past observations
Y n . We calculate the MSEP in this model. Using Theorem 4.1 we obtain
+ 2 , 2
Eθ Y −
μ MLE
= Eθ [Y ] − Eθ
μMLE + Varθ
μMLE + Varθ (Y )
2 ϕκ (θ ) ϕκ (θ )
= κ (θ ) − κ (θ ) + n + (4.6)
i=1 vi v
(κ (θ ))2 ϕκ (θ )
= + ,
I(θ ) v
see (3.25) for Fisher’s information I(θ ). In this calculation we have used that the
MLE μMLE is UMVU for μ = κ (θ ) and that Y n and Y come from the same
EDF with the same canonical parameter θ ∈ . ˚ As a result, we are only left
with estimation variance and process variance, moreover, the estimation variance
asymptotically vanishes as ni=1 vi → ∞.
The main estimation technique used in these notes is MLE introduced in Def-
inition 3.4. At this stage, MLE is un-related to any specific scoring function L
because it has been received by maximizing the log-likelihood function. In this
section we discuss the deviance loss function (as a scoring function) and we
highlight its connection to the Bregman divergence introduced in Sect. 2.3. Based
on the deviance loss function choice we rephrase Theorem 4.1 in terms of this
scoring function. A theoretical foundation to these considerations will be given in
Sect. 4.1.3, below.
For the derivations in this section we rely on the same single-parameter linear
EDF as in Example 4.3, having a steep cumulant function κ. The MLE of μ = κ(θ )
is found by solving, see (4.5),
n
μ) − κ(h(*
Yi h(* μ))
μMLE =
μMLE (Y n ) = arg max ∈ M,
μ∈M
*
ϕ/vi
i=1
For the saturated model the common canonical parameter θ of the independent
random variables Y1 , . . . , Yn in (4.4) is replaced by individual canonical parameters
θi , 1 ≤ i ≤ n. These individual canonical parameters are estimated with individual
MLEs. The individual MLEs are given by, respectively,
θiMLE = (κ )−1 (Yi ) = h (Yi ) and μMLE
i = Yi ∈ M,
the latter always exists because of strict convexity and steepness of κ. Since the
MLE μMLE
i = Yi maximizes the log-likelihood, we receive for any μ ∈ M the
inequality
0 ≤ 2 logf (Yi ; h (Yi ) , vi /ϕ) − logf (Yi ; h(μ), vi /ϕ)
vi
=2 Yi h (Yi ) − κ (h (Yi )) − Yi h (μ) + κ (h (μ)) (4.7)
ϕ
vi
= d (Yi , μ) .
ϕ
This applies, e.g., to the Poisson or Bernoulli cases for observation Yi = 0, in these
cases we obtain unit deviances 2μ and −2log(1 − μ), respectively.
The previous considerations (4.7)–(4.8) have been studying one single obser-
vation Yi of Y n . Aggregating over all observations in Y n (and additionally using
independence between the individual components of Y n ) we arrive at the so-called
deviance loss function
4.1 Generalization Loss 81
1 vi
n
def.
D(Y n , μ) = d (Yi , μ) (4.9)
n ϕ
i=1
2 vi
n
= Yi h (Yi ) − κ (h (Yi )) − Yi h (μ) + κ (h (μ)) ≥ 0.
n ϕ
i=1
The deviance loss function D(Y n , μ) subtracts twice the log-likelihood Y n (μ)
from the one of the saturated model. Thus, it introduces a sign flip compared to (4.5).
This immediately gives us the following corollary.
Corollary 4.5 (Deviance Loss Function) The MLE problem (4.5) is equiva-
lent to solving
μMLE = arg max Y n (*
μ) = arg min D(Y n , *
μ). (4.10)
μ∈M
* μ ∈M
*
Remarks 4.6
• Formula (4.10) replaces a maximization problem by a minimization problem
with objective function D(Y n , μ) being bounded below by zero. We can use
this deviance loss function as a loss function not only for parameter estimation,
but also as a scoring function for analyzing GLs within the EDF (similarly to
Theorem 4.1).
• We draw the link to the KL divergence discussed in Sect. 2.3. In formula (2.26)
we have shown that the unit deviance is equal to the KL divergence (up to
scaling with factor 2), thus, equivalently, MLE aims at minimizing the average
KL divergence over all observations Y n
1
n
θ MLE = arg min DKL f (·; h(Yi ), vi /ϕ)f (·; *
θ , vi /ϕ) ,
*
θ ∈ n
i=1
82 4 Predictive Modeling and Forecast Evaluation
where ∝ highlights that we drop all terms that do not involve * θ. This describes the
change in joint likelihood by varying the canonical parameter * θ over its domain
. The first line of (4.11) is in the spirit of minimizing a weighted square loss, but
the Gaussian square is replaced by the unit deviance d. The second line of (4.11)
is in the spirit of information geometry considered in Sect. 2.3, where we try to
find a canonical parameter * θ that has a small KL divergence to the n individual
models being parametrized by h(Y1 ), . . . , h(Yn ), thus, the MLE θ MLE provides
an optimal balance over the entire set of (independent) observations Y1 , . . . , Yn
w.r.t. the KL divergence.
• In contrast to the square loss function, the deviance loss function D(Y n , μ)
respects the distributional properties of Y n , see (4.11). That is, if the underlying
distribution allows for larger or smaller claims, this fact is appropriately valued
in the deviance loss function (supposed that we have chosen the right family of
distributions; model uncertainty will be studied in Sect. 11.1, below).
• Assume we work in the Gaussian model. In this model we have κ(θ ) = θ 2 /2
and canonical link h(μ) = μ, see Sect. 2.1.3. This provides unit deviance in the
Gaussian case d (y, μ) = (y − μ)2 , which is exactly the square loss function for
action space A = M. Thus, the square loss function is most appropriate in the
Gaussian case.
• As explained above, we use unit deviances d(y, μ) as a measure of discrepancy.
Alternatively, as in the introduction to this section, see (4.6), we can consider
Pearson’s χ 2 -statistic which corresponds to the weighted square loss function
(y − μ)2
X2 (y, μ) = , (4.12)
V (μ)
below will have low claim frequencies which implies that μ will be small. The
appearance of a small μ in the denominator of (4.12) will imply that Pearson’s
χ 2 -statistic is not very robust in small frequency applications, in particular, if we
need to estimate this μ from Y n . Therefore, we refrain from using (4.12).
Naturally, in analogy to Theorem 4.1 and derivation (4.6), the above consider-
ations motivate us to consider expected GLs under unit deviances within the EDF.
We use the decision rule
μMLE (Y n ) ∈ A = M to predict a new observation Y .
the last identity uses independence between Y n and Y , and with estimation
risk function
μMLE (Y n ) = Eθ d μ,
E μ, μMLE (Y n ) > 0, (4.14)
Remarks 4.8
• Eθ [d(Y, μ)] plays the role of the pure process variance (irreducible risk) of
Theorem 4.1. This term does not involve any parameter estimation bias and
uncertainty because it is based on the true parameter θ and μ = κ (θ ),
respectively. In Sect. 4.1.3, below, we are going to justify the appropriateness
of this object as a tool for forecast evaluation. In particular, because the unit
deviance is strictly consistent for the mean functional, the true mean μ = μ(θ )
minimizes Eθ [d(Y, μ)], see (4.28), below.
• The second term E (μ, A(Y n )) measures parameter estimation bias and uncer-
tainty of decision rule A(Y n ) versus the true parameter μ = κ (θ ). The first
remark is that we can do this for any decision rule A, i.e., we do not necessarily
need to consider the MLE. The second remark is that we can no longer get a clear
cut differentiation between a bias term and a parameter estimation uncertainty
term for deviance loss functions not coming from the Gaussian distribution. We
come back to this in Remarks 7.17, below, where we give more characterization
to the individual terms of the expected deviance GL.
• An issue in applying Theorem 4.7 to the MLE decision rule A(Y n ) = μMLE (Y n )
is that, in general, it does not lead to a finite estimation risk function. For instance,
in the Poisson case we have with positive probability μMLE (Y n ) = 0, which
results in an infinite estimation risk. In order to avoid this, we need to bound
away the decision rule form the boundary of M and , respectively. In the
Poisson case this can be achieved by considering a decision rule A(Y n ) =
max{ μMLE (Y n ), } for a fixed given ∈ (0, μ = κ (θ )). This decision rule
has a bias which asymptotically vanishes as n → ∞. Moreover, consistency and
asymptotic normality tells us that this lower bound does not affect prediction for
large sample sizes n (with large probability).
• Similar to (4.3), we can also consider the deviance GL, given Y n . Under
independence of Y n and Y we have deviance GL
Example 4.9 (Estimation Risk Function in the Gaussian Case) We consider the
Gaussian case with cumulant function κ(θ ) = θ 2 /2 and canonical link h(μ) = μ.
4.1 Generalization Loss 85
The estimation risk function is in the Gaussian case for a square integrable predictor
A(Y n ) given by
These are exactly the squared bias and the estimation variance, see (4.1). Thus, in the
Gaussian case, the MSEP and the expected deviance GL coincide. Moreover, adding
a deterministic bias c ∈ R to A(Y n ) increases the estimation risk function, supposed
that A(Y n ) is unbiased for μ. We emphasize the latter as this is an important
property to have, and we refer to the next Example 4.10 for an example where this
property fails to hold.
Example 4.10 (Estimation Risk Function in the Poisson Case) We consider the
Poisson case with cumulant function κ(θ ) = eθ and canonical link h(μ) = logμ.
The estimation risk function is given by (subject to existence)
E (μ, A(Y n )) = 2 μlog(μ) − μ − μEθ log(A(Y n )) + Eθ [A(Y n )] . (4.16)
Assume that decision rule A(Y n ) is non-deterministic and unbiased for μ. Using
Jensen’s inequality these assumptions imply for the estimation risk function
E (μ, A(Y n )) = 2μ log(μ) − Eθ log(A(Y n )) > 0.
We now add a small deterministic bias c ∈ R to the unbiased estimator A(Y n ) for
μ. This gives us estimation risk function, see (4.16) and subject to existence,
E (μ, A(Y n ) + c) = 2 μlog(μ) − μEθ log(A(Y n ) + c) + c .
Consider the derivative w.r.t. bias c in 0, we use Jensen’s inequality on the last line,
+ ,
∂ 1
E (μ, A(Y n ) + c) = 2 − μEθ + 1
∂c c=0 A(Y n ) + c c=0
+ ,
1
= −2μEθ +2
A(Y n )
1
< −2μ + 2 = 0. (4.17)
Eθ [A(Y n )]
86 4 Predictive Modeling and Forecast Evaluation
Thus, the estimation risk becomes smaller if we add a small bias to the (non-
deterministic) unbiased predictor A(Y n ). This issue has been raised in Denuit et
al. [97]. Of course, this is a very unfavorable property, and it is rather different from
the Gaussian case in Example 4.9. It is essentially driven by the fact that parameter
estimation is based on a finite sample, which implies a strict inequality in (4.17)
for the finite sample estimate A(Y n ). A conclusion of this example is that if we use
expected deviance GLs for forecast evaluation we need to insist on having unbiased
predictors. This will become especially important for more complex regression
models, see Sect. 7.4.2, below.
More generally, one can prove this result of a smaller estimation risk function for
a small positive bias for any EDF member with power variance function V (μ) = μp
with p ≥ 1, see also (4.18) below. The proof uses the Fortuin–Kasteleyn–Ginibre
(FKG) inequality [133] providing Eθ [A(Y n )1−p ] < Eθ [A(Y n )]Eθ [A(Y n )−p ] =
μEθ [A(Y n )−p ] to receive (4.17) for power variance parameters p ≥ 1.
Remarks 4.11 (Conclusion from Examples 4.9 and 4.10 and a Further Remark)
• Working with expected deviance GLs for evaluating forecasts requires some care
because a bigger bias in the (finite sample) estimate A(Y n ) may provide a smaller
estimation risk function E(μ, A(Y n )). For this reason, we typically insist on
having unbiased predictors/forecasts. The latter is also an important requirement
in financial applications to guarantee that the overall price is set to the right level,
we refer to the balance property in Corollary 3.19 and to Sect. 7.4.2, below.
• In Theorems 4.1 and 4.7 we use independence between the predictor A(Y n )
and the random variable Y to receive the split of the expected deviance GL
into irreducible risk and estimation risk function. In regression models, this
independence between the predictor A(Y n ) and the random variable Y may
no longer hold. In that case we will still work with the expected deviance GL
Eθ [d(Y, A(Y n ))], but a clear split between estimation and forecasting will no
longer be possible, see Sect. 4.2, below.
The next example gives the most important unit deviances in actuarial modeling.
Example 4.12 (Unit Deviances) We give the most prominent examples of unit
deviances within the single-parameter linear EDF. We recall unit deviance (2.25)
d(y, μ) = 2 yh(y) − κ (h(y)) − yh(μ) + κ (h(μ)) ≥ 0.
Table 4.1 Unit deviances of selected distributions commonly used in actuarial science
Distribution Cumulant function κ(θ) Unit deviance d(y, μ)
Gaussian θ 2 /2 (y − μ)2
Gamma −log(−θ) 2 ((y − μ)/μ + log(μ/y))
√
Inverse Gaussian − −2θ (y − μ)2 /(μ2 y)
Poisson eθ 2 (μ − y − ylog(μ/y))
Negative-binomial −log(1 − eθ ) 2 ylog μy − (y + 1)log μ+1 y+1
For the remaining power variance cases we have: p = 1 corresponds to the Poisson
case, p = 2 gives the gamma case, the cases p < 0 do not have a steep cumulant
function, and, moreover, there are no EDF models for p ∈ (0, 1), see Theorem 2.18.
The unit deviance in the Bernoulli case is also called binary cross-entropy.
This binary cross-entropy has a categorical generalization, called multi-class cross-
entropy. Assume we have a categorical EF with levels {1, . . . , k + 1} and corre-
sponding probabilities p1 , . . . , pk+1 ∈ (0, 1) summing up to 1, see Sect. 2.1.4.
We denote by Y = (1{Y =1} , . . . , 1{Y =k+1} ) ∈ Rk+1 the indicator variable that
shows which level the categorical random variable Y takes; Y is called one-hot
encoding of the categorical random variable Y . Assume y is a realization of Y and
set μ = p = (p1 , . . . , pk+1 ) . The categorical (multi-class) cross-entropy loss
function is given by
k+1
d(y, μ) = d(y, p) = −2 yj logpj ≥ 0. (4.19)
j =1
k+1
k+1
k+1
qj
DKL (q||p) = qj log = qj logqj − qj logpj .
pj
j =1 j =1 j =1
88 4 Predictive Modeling and Forecast Evaluation
Outlook 4.13 In the regression modeling, below, each response Yi will have its own
mean parameter μi = μ(β, x i ) which will be a function of its covariate information
x i , and β denotes a regression parameter to be estimated with MLE. In that case,
we modify the deviance loss function (4.9) to
1 vi 1 vi
n n
β → D(Y n , β) = d (Yi , μi ) = d (Yi , μ(β, x i )) , (4.20)
n ϕ n ϕ
i=1 i=1
β
MLE
= arg min D(Y n , β). (4.21)
β
If Y is a new response with covariate information x and following the same EDF as
Y n , we will evaluate the corresponding expected scaled deviance GL given by
+ ,
v
d Y, μ(
MLE
Eβ β , x) , (4.22)
ϕ
where Eβ is the expectation under the true regression parameter β for Y n and Y .
This will be discussed in Sect. 5.1.7, below. If we interpret (Y, x, v) as a random
vector describing a randomly selected insurance policy from our portfolio, and being
independent of Y n (and the corresponding covariate information x i , 1 ≤ i ≤ n),
then will be independent of (Y, x, v). Nevertheless, the predictor μ(
MLE MLE
β β , x)
will introduce dependence between the chosen decision rule and Y through x, and
we no longer receive the split of the expected deviance GL as stated in Theorem 4.7,
for a related discussion we also refer to Remarks 7.17, below.
If we interpret (Y, x, v) as a randomly selected insurance policy, then the
expected GL (4.22) is evaluated under the joint (portfolio) distribution of (Y, x, v),
and the deviance loss D(Y n ,
MLE
β ) is an (in-sample) empirical version of (4.22).
Remark 4.14 In (4.23) we assume that the loss function L is bounded below by
zero. This can be an advantage in applications because it gives a calibration to the
loss function. In general, this lower bound is not a necessary condition for forecast
evaluation. If we drop this lower bound property, we rather call L (only) a scoring
function. For instance, the log-likelihood log(f (y, a)) in (3.27) plays the role of a
scoring function.
The forecaster can take the position of minimizing the expected loss to choose
her/his action rule. That is, subject to existence, an optimal action w.r.t. L is received
by
a =
a (F ) = arg min EF [L(Y, a)] = arg min L(y, a)dF (y). (4.24)
a∈A a∈A C
In this setup the scoring function L(y, a) describes the loss that the forecaster suffers
if she/he uses action a ∈ A and observation y ∈ C materializes. Since we do not
want to insist on uniqueness in (4.24) we rather think of set-valued functionals in
this section, which may provide solutions to problems like (4.24).1
We now reverse the line of arguments, and we start from a general set-valued
functional. Denote by F the family of distribution functions of interest supported
on C. Consider the set-valued functional
1 In fact, also for the MLE in Definition 3.4 we should consider a set-valued functional. We have
decided to skip this distinction to avoid any kind of complication and to not disturb the flow of
reading.
90 4 Predictive Modeling and Forecast Evaluation
EF [L(Y,
a )] ≤ EF [L(Y, a)] , (4.26)
for all F ∈ F ,
a ∈ A(F ) and a ∈ A. It is strictly consistent if it is consistent and
equality in (4.26) implies that a ∈ A(F ).
As stated in Theorem 1 of Gneiting [162], a loss function L is consistent for the
functional A relative to the class F if and only if, given any F ∈ F , every
a ∈ A(F )
is an optimal action under L in the sense of (4.24).
We give an example. Assume we start from the functional F → A(F ) = EF [Y ]
that maps each distribution F to its expected value. In this case we do not need
to consider a set-valued functional because the expected value is a singleton (we
assume that F only contains distributions with a finite first moment). The question
then is whether we can find a loss function L such that this mean can be received by
a minimization (4.24). This question is answered in Theorem 4.19, below.
Next we relate a consistent loss function L to a proper scoring rule. A proper
scoring rule is a function R : C × F → R such that
for all F, G ∈ F, supposed that the expectations are well-defined. A scoring rule
R analyzes the penalty R(y, G) if the forecaster works with a distribution G and
an observation y of Y ∼ F materializes. Proper scoring rules have been promoted
in Gneiting–Raftery [163] and Gneiting [162]. They are important because they
encourage the forecaster to make honest forecasts, i.e., it gives the forecaster the
incentive to minimize the expected score by following his true belief about the true
distribution, because only this minimizes the expected penalty in (4.27).
Theorem 4.16 (Gneiting [162, Theorem 3]) Assume that L is a consistent loss
function for the functional A relative to the class F . For each F ∈ F , let aF ∈ A(F ).
The scoring rule
Example 4.17 Consider the unit deviance d (·, ·) : C × M → R+ for a given EDF
˚ with cumulant function κ. Lemma 2.22 says that under
F = {F (·; θ, v/ϕ); θ ∈ }
suitable assumptions this unit deviance d (y, μ) is zero if and only if y = μ. We
consider the mean functional on F
where μ = μ(θ ) = κ (θ ) is the mean of the chosen EDF. Choosing the unit deviance
as loss function we receive for any action a ∈ A, see (4.13),
This is minimized for a = μ and it proves that the unit deviance is strictly consistent
for the mean functional A : Fθ → A(Fθ ) = μ(θ ) relative to the chosen EDF
˚ Using Theorem 4.16, the scoring rule
F = {F (·; θ, v/ϕ); θ ∈ }.
for any * θ = θ . We conclude from this small example that the unit deviance is a
strictly consistent loss function for the mean functional on the chosen EDF, and this
provides us with a strictly proper scoring rule.
for any *
θ = θ .
The convergence statement follows from the strong law of large numbers applied
to the i.i.d. random variables (Yi , vi ), i ≥ 1, and supposed that the right-hand side
of (4.29) exists. Thus, the deviance loss function (4.9) is an empirical version of the
expected deviance loss function, and this approach is successful if we can exchange
the ‘argmin’ operator of (4.28) and the limit n → ∞ in (4.29). This closes the circle
and brings us back to the M-estimator considered in Remarks 3.26 and 3.29, and
which also links forecast evaluation and M-estimation.
Forecast Dominance
A consequence of Theorem 4.19 is that there are infinitely many strictly consistent
loss functions for the mean functional, and, in principle, we could choose any
of these for forecast evaluation. Choosing the unit deviance d that matches the
distribution Fθ of the observations Y n and Y , respectively, gives us the MLE μMLE ,
and we have seen that the MLE μ MLE
is not only unbiased for μ = κ (θ ), but it
also meets the Cramér–Rao information bound. That is, it is UMVU within the data
generating model reflected by the true unit deviance d. This provides us (in the finite
sample case) with a natural candidate for d in (4.29) and, thus, a canonical proper
scoring rule for (out-of-sample) forecast evaluation.
The previous statements have all been done under the assumption that there is
no uncertainty about the underlying family of distribution functions that generates
Y and Y n , respectively. Uncertainty was limited to the true canonical parameter θ
and the true mean μ(θ ). This situation changes under model uncertainty. Krüger–
Ziegel [227] study the question of having multiple strictly consistent loss functions
in the situation where there is no natural candidate choice. Different choices may
give different rankings to different (finite sample) predictors. Assume we have
two predictors μ1 and μ2 for a random variable Y . Similarly to the definition of
the expected deviance GL, we understand these predictors μ1 and μ2 as random
variables, and we assume that all considered random variables have a finite first
moment. Importantly, we do not assume independence between μ1 ,
μ2 and Y ,
and in regression models we typically receive dependence between predictors μ
and random variables Y through the features (covariates) x, see also Outlook 4.13.
Following Krüger–Ziegel [227] and Ehm et al. [119] we define forecast dominance
as follows.
94 4 Predictive Modeling and Forecast Evaluation
for all Bregman divergences Dψ with (convex) ψ supported on C, the latter being
the convex closure of the supports of Y ,
μ1 and
μ2 .
If we work with a fixed member of the EDF, e.g., the gamma distribution, then
we typically study the corresponding expected deviance GL for forecast evaluation
in one single model, see Theorem 4.7 and (4.29). This evaluation may involve
model risk in the decision making process, and forecast dominance provides a robust
selection criterion.
Krüger–Ziegel [227] build on Theorem 1b and Corollary 1b of Ehm et al. [119] to
prove the following theorem (which prevents from considering all convex functions
ψ).
Theorem 4.21 (Theorem 2.1 of Krüger–Ziegel [227]) Predictor
μ1 dominates
predictor
μ2 if and only if for all τ ∈ C
μ1 >τ } ≥ E (Y − τ ) 1{
E (Y − τ ) 1{ μ2 >τ } . (4.30)
Denuit et al. [97] argue that in insurance one typically works with Tweedie’s
distributions having power variances V (μ) = μp with power variance parameters
p ≥ 1. This motivates the following weaker form of forecast dominance.
Definition 4.22 (Tweedie’s Forecast Dominance) Predictor
μ1 Tweedie-
dominates predictor
μ2 if
E dp (Y,
μ1 ) ≤ E dp (Y,
μ2 ) ,
and
μ1 >τ } ≥ E Y 1{
E Y 1{ μ2 >τ } for all τ ∈ C.
Theorem 4.21 gives necessary and sufficient conditions to have forecast dom-
inance, Proposition 4.23 gives sufficient conditions to have the weaker Tweedie’s
forecast dominance. In Theorem 7.15, below, we give another characterization of
forecast dominance in terms of convex orders, under the additional assumption that
the predictors are so-called auto-calibrated.
4.2 Cross-Validation
This section focuses on estimating the expected deviance GL (4.13) in cases where
the canonical parameter θ is not known. Of course, the same concepts apply to the
MSEP. In the remainder of this section we scale the unit deviances with v/ϕ, to
bring them in line with the deviance loss (4.9).
Here, we no longer assume that Y and A(Y n ) are independent, and in the dependent
case Theorem 4.7 does not apply. The reason for dropping the independence
assumption is that below we consider regression models of a similar type as in
Outlook 4.13. The expected deviance GL (4.31) as such is not directly useful
because it cannot be calculated if the true canonical parameter θ is not known.
Therefore, we are going to explain how it can be estimated empirically.
We start from the expected deviance GL in the EDF applied to the MLE decision
rule
μMLE (Y n ). It can be rewritten as
+
v , +
v ,
Eθ μMLE (Y n ) = Eθ
d Y, μMLE (Y n ) Y n = y n dP (y n ; θ ),
d Y,
ϕ ϕ
(4.32)
96 4 Predictive Modeling and Forecast Evaluation
where we use the tower property for conditional expectations. In view of (4.32),
there are two things to be done:
(1) For given observations Y n = y n , we need to estimate the deviance GL, see
also (4.15),
+
v , +
v ,
Eθ d Y,
μ MLE
(Y n ) Y n = y n = Eθ d Y,
μ MLE
(y n ) Y n = y n .
ϕ ϕ
(4.33)
This is the part that we are going to solve empirically in the this section.
Typically, we assume that Y and Y n are independent, nevertheless, Y and
its MLE predictor may still be dependent because we may have a predictor
μMLE (Y n ) = μMLE (Y n , x). That is, this predictor often depends on covariate
information x that describes Y , an example is provided in (4.22) of Outlook 4.13
and this is different from (4.15). In that case, the decision rule A : Y × X → A
is extended by an additional covariate component x ∈ X , we refer to Sect. 5.1.1,
where X is introduced and discussed.
(2) We have to find a way to generate more observations Y n from P (y n ; θ ) in
order to evaluate the outer integral in (4.32) empirically. One way to do so is
the bootstrap method that is going to be discussed in Sect. 4.3, below.
We address the first problem of estimating the deviance GL given in (4.33).
We do this under the assumption that Y n and Y are independent. In order to
estimate (4.33) we need observations for Y . However, typically, there are no
observations available for this random variable because it is only going to be
observed in the future. For this reason, one uses past observations for both, model
fitting and the GL analysis. In order to perform this analysis in a proper way, the
general paradigm is to partition the entire data into two disjoint data sets, a so-
called learning data set L = {Y1 , . . . , Yn } and a test data set T = {Y1† , . . . , YT† }.
If we assume that all observations in L ∪ T are independent, then we receive a
suitable observation Y n from the learning data set L that can be used for model
fitting. The test sample T can then play the role of the unobserved random variable
Y (by assumption being independent of Y n ). Note that L is only used for model
fitting and T is only used for the deviance GL evaluation, see Fig. 4.1.
This setup motivates to estimate the mean parameter μ with MLE μMLEL =
μ MLE (Y n ) from the learning data L and Y n , respectively, by minimizing the
deviance loss function μ → D(Y n , μ) on the learning data L, according to Corol-
lary 4.5. Then we use this predictor μMLE
L to empirically evaluate the conditional
expectation in (4.33) on T . The perception used is that we (in-sample) learn a
model on L and we out-of-sample test this model on T to see how it generalizes
to unobserved variables Yt† , 1 ≤ t ≤ T , that are of a similar nature as Y .
4.2 Cross-Validation 97
2 vi
n
D(L, L )=
μMLE Yi h (Yi ) − κ (h (Yi )) − Yi h( L ) + κ h(
μMLE μMLE
L ) ,
n ϕ
i=1
2 vt† † †
T
μMLE
D(T , L )= Yt h Yt −κ h Yt† −Yt† h(
μMLE
L )+κ h(
μMLE ) ,
L
T ϕ
t =1
where the sum runs over the test sample T having exposures v1† , . . . , vT† > 0.
For MLE we minimize the objective function (4.9), therefore, the in-sample
deviance loss D(L, L ) = D(Y n ,
μMLE μMLE (Y n )) exactly corresponds to the
minimal deviance loss (4.9) achieved on the learning data L, i.e., when using
MLE μMLE
L = μMLE (Y n ). We call this in-sample because the same data L is
used for parameter estimation and deviance loss calculation. Typically, this loss is
biased because it uses the optimal (in-sample) parameter estimate, we also refer to
Sect. 4.2.3, below.
The out-of-sample loss D(T ,
μMLE
L ) then empirically estimates the inner expec-
tation in (4.32). This is a proper out-of-sample analysis because the test data T
is disjoint from the learning data L on which the decision rule μMLE
L has been
trained. Note that this out-of-sample figure reflects (4.33) in the following sense.
98 4 Predictive Modeling and Forecast Evaluation
We have a portfolio of risks (Yt† , vt† ), 1 ≤ t ≤ T , and (4.33) does not only reflect
the calculation of the deviance GL of a given risk, but also the random selection of
a risk from the portfolio. In this sense, (4.33) is an average over a given portfolio
whose description is also included in the probability Pθ .
(continued)
4.2 Cross-Validation 99
is (only) done for estimating the deviance GL of the model learned on all
data. I.e. for prediction we work with MLE μMLE
L=D , but the out-of-sample
deviance loss is estimated using this data in a different way.
The three most commonly used methods are leave-one-out, K-fold and stratified
K-fold cross-validation. We briefly describe these three cross-validation methods.
Leave-One-Out Cross-Validation
def.
μMLE
μ(−i) = L(−i) ,
which is based on all data except observation Yi . This observation is now used to
do an out-of-sample analysis, and averaging this over all 1 ≤ i ≤ n we receive the
leave-one-out cross-validation loss
vi
n 1
n
loo = 1
D μ(−i) =
d Yi , D Ti ,
μ(−i) (4.34)
n ϕ n
i=1 i=1
2 vi
n
= μ(−i) + κ h
Yi h (Yi ) − κ (h (Yi )) − Yi h μ(−i) ,
n ϕ
i=1
K-Fold Cross-Validation
Choose a fixed integer K ≥ 2 and partition the entire data D at random into K
disjoint subsets (called folds) L1 , . . . , LK of approximately the same size. The
learning data for fixed 1 ≤ k ≤ K is then defined by L[−k] = D \ Lk and the
test data by Tk = Lk , see Fig. 4.2. Based on learning data L[−k] we calculate the
MLE
μ[−k] = μ
def.
MLE
L[−k] ,
K
CV = 1
D μ[−k]
D Tk ,
K
k=1
1 1 vi
K
= μ[−k]
d Yi , (4.35)
K |Tk | ϕ
k=1 Yi ∈Tk
1 vi
K
≈ μ[−k] .
d Yi ,
n ϕ
k=1 Yi ∈Tk
The last step is an approximation because not all Tk may have exactly the same
sample size if n is not a multiple of K. We can understand (4.35) not only as a
conditional out-of-sample loss estimate in the spirit of Definition 4.24. The outer
empirical average in (4.35) also makes it suitable for an expected deviance GL
estimate according to (4.32). The variance of this empirical deviance GL is given by
(subject to existence)
1 vi v
K
Varθ D CV
≈ 2 Covθ d Yi ,
μ [−k] j
, d Yj ,
μ [−l]
.
n ϕ ϕ
k,l=1 Yi ∈Tk Yj ∈Tl
A disadvantage of the above K-fold cross-validation is that it may happen that there
are two outliers in the data, and there is a positive probability that these two outliers
belong to the same subset Lk . This may substantially distort K-fold cross-validation
because in that case the subsets Lk , 1 ≤ k ≤ K, are of different quality. Stratified K-
fold cross-validation aims at distributing outliers more equally across the partition.
Order the observations Yi , 1 ≤ i ≤ n, as follows
for learning data set L = {Y1 , . . . , Yn }. This provides us with an in-sample Poisson
deviance loss of D(Y n , −2
L ) = D(L,
μMLE L ) = 25.213 · 10 .
μMLE
Since we do not have test data T , we explore tenfold cross-validation. We
therefore partition the entire data at random into K = 10 disjoint sets L1 , . . . , L10 ,
and compute the tenfold cross-validation loss as described in (4.35). This gives us
CV = 25.213 · 10−2 , thus, we receive the same value as for the in-sample loss
D
which says that we do not have in-sample over-fitting, here. This is not surprising
4.2 Cross-Validation 103
in the homogeneous model λ = Eθ [Yi ]. We can also quantify the uncertainty in this
estimate by the corresponding empirical standard deviation for Tk = Lk
/
0
0 1
K
1 CV 2 = 0.234 · 10−2 .
μ[−k] − D
D Tk , (4.36)
K −1
k=1
This says that there is quite some fluctuation in the data because uncertainty in
CV = 25.213 · 10−2 is roughly 1%. This finishes this example, and we
estimate D
will come back to it in Sect. 5.2.4, below.
n
n
−2 log h
θ MLE (Yi ) + 2 dim(θ ) < − 2 ϑ MLE (Yi ) + 2 dim(ϑ),
log g
i=1 i=1
(4.37)
Remarks 4.28
• AIC is neither an in-sample loss nor an out-of-sample loss to measure gen-
eralization accuracy, but it considers penalized log-likelihoods. Under certain
assumptions one can prove that asymptotically minimizing AICs is equivalent
to minimizing leave-one-out cross-validation mean squared errors.
• The two penalized log-likelihoods have to be evaluated on the same data Y n
and they need to consider the MLEs θ MLE and ϑ MLE because the justification
of AIC is based on the asymptotic normality of MLEs, otherwise there is no
mathematical justification why (4.37) should be a reasonable model selection
tool.
• AIC does not require (but allows for) nested models hθ and gϑ nor need they be
Gaussian, it is only based on asymptotic normality. We give a heuristic argument
below.
• Evaluation of (4.37) involves all terms of the log-likelihoods, also those that do
not depend on the parameters θ and ϑ.
• Both models should consider the data Y n in the same units, i.e., AIC does not
apply if hθ is a density for Yi and gϑ is a density for cYi . In that case, one has
to perform a transformation of variables to ensure that both densities consider
the data in the same units. We briefly highlight this by considering a Gaussian
example. We choose i.i.d. observations Yi ∼ N (θ, σ 2 ) for known variance σ 2 >
nChoose c > 0, we have cYi ∼ N (ϑ MLE
0. = cθ, c2σ 2 ). We obtain MLE θ MLE =
i=1 Yi /n and log-likelihood in MLE θ
1 2
n n
n
θ MLE (Yi ) = − log(2πσ ) −
log h 2
Yi −
θ MLE
.
2 2σ 2
i=1 i=1
n
On the transformed scale we have MLE
ϑ MLE = i=1 cYi /n = c
θ MLE and
log-likelihood in MLE
ϑ MLE
1 2
n n
n
log g
ϑ MLE (cYi ) = − log(2πc 2 2
σ ) − cYi − c θ MLE
.
2 2c2σ 2
i=1 i=1
Thus, find that the two log-likelihoods differ by −nlog(c), but we consider the
same model only under different measurement units of the data. The same applies
when we work, e.g., with a log-normal model or logged data in a Gaussian model.
the MLE estimated models to the true model, i.e., we consider the difference
(supposed the densities are defined on the same domain)
2 2
DKL f 2h
θ MLE (·) − DKL f g
2 MLE (·)
ϑ
f (y) f (y)
= log f (y)dν(y) − log f (y)dν(y)
h θ MLE (y) g
ϑ MLE (y)
= log g ϑ MLE (y) f (y)dν(y) − log h
θ MLE (y) f (y)dν(y). (4.38)
If this difference is negative, model h θ MLE should be preferred over model g ϑ MLE
because it is closer to the true model f w.r.t. the KL divergence. Thus, we need to
calculate the two integrals in (4.38). Since the true density f is not known, these
two integrals need to be estimated.
As a first idea we estimate the integrals on the right-hand side empirically using
the observations Y n , say, the first integral is estimated by
1
n
log g
ϑ MLE (Yi ) .
n
i=1
However, this will lead to a biased estimate because the MLE ϑ MLE exactly
maximizes this empirical estimate (as a function of ϑ). The integrals in (4.38),
on the other hand, can be interpreted as an out-of-sample calculation between
independent random variables Y n (used for MLE) and Y ∼ f dν used in the integral.
The bias results from the fact that in the empirical estimate the independence
gets lost. Therefore, we need to correct this estimate for the bias in order to
obtain a reasonable estimate for the difference of the KL divergences. Under the
following
√ assumptions this bias correction is asymptotically given by −dim(ϑ)/n:
(1) n( ϑ MLE (Y n ) − ϑ0 ) is asymptotically normally distributed N (0, (ϑ0 )−1 ) as
n → ∞, where ϑ0 is the parameter that minimizes the KL divergence from gϑ to
f ; we also refer to Remarks 3.26. (2) The true f is sufficiently close to gϑ0 such
that the Ef -covariance matrix of the score ∇ϑ loggϑ0 is close to the negative Ef -
expected Hessian ∇ϑ2 loggϑ0 ; see also (3.36) and Sect. 11.1.4, below. In that case,
(ϑ0 ) approximately corresponds to Fisher’s information matrix I1 (ϑ0 ) and AIC is
justified.
This shows that AIC applies if both models are evaluated under the same
observations Y n , the models need to use the MLEs, and asymptotic normality needs
to hold with limits such that the true model is close to a member of the selected
model classes {hθ ; θ } and {gϑ ; ϑ}. We remark that this is not the only set-up under
which AIC can be justified, but other set-ups do not essentially differ.
The Bayesian information criterion (BIC) is similar to AIC but in a Bayesian
context. The BIC says that model h θ MLE should be preferred over model g ϑ MLE if
n
n
−2 log h
θ MLE (Yi ) +log(n)dim(θ ) < −2 ϑ MLE (Yi ) +log(n)dim(ϑ),
log g
i=1 i=1
106 4 Predictive Modeling and Forecast Evaluation
where n is the sample size of Y n used for model fitting. The BIC has been derived
by Schwarz [331]. Therefore, it is also called Schwarz’ information criterion (SIC).
4.3 Bootstrap
The bootstrap method has been invented by Efron [115] and Efron–Tibshirani [118].
The bootstrap is used to simulate new data from either the empirical distribution F n
or from an estimated model F (·; θ ). This allows, for instance, to evaluate the outer
expectation in the expected deviance GL (4.32) which requires a data model for Y n .
The presentation in this section is based on the lecture notes of Bühlmann–Mächler
[59, Chapter 5].
Y →
θ = A(Y ). (4.39)
Typically, the decision rule A(·) is a known function and we would like to determine
the distributional properties of parameter estimator (4.39) as a function of the
(random) observations Y . E.g., for any measurable set C, we might want to compute
Pθ θ ∈ C = Pθ [A(Y ) ∈ C] = 1{A(y)∈C} dP (y; θ ). (4.40)
Since, typically, the true data generating distribution Yi ∼ F (·; θ ) is not known, the
distributional properties of θ cannot be determined, also not by Monte Carlo simula-
tion. The idea behind bootstrap is to approximate F (·; θ ). Choose as approximation
to F (·; θ ) the empirical distribution of the i.i.d. observations Y given by, see (3.9),
n
n (y) = 1
F 1{Yi ≤y} for y ∈ R.
n
i=1
The Glivenko–Cantelli theorem [64, 159] tells us that the empirical distribution
n converges uniformly to F (·; θ ), a.s., for n → ∞, so it should be a good
F
approximation to F (·; θ ) for large n. The idea now is to simulate from the empirical
n .
distribution F
4.3 Bootstrap 107
1
M
M
F ∗
(ϑ) = 1{
θ (m∗) ≤ϑ} ,
M
m=1
1
M
def. ∗
Pθ θ ∈C ≈
Pθ θ ∈ C = P∗Y θ ∈C ≈ 1{
θ (m∗) ∈C} , (4.41)
M
m=1
where P∗Y corresponds to the bootstrap distribution of Step (1a) of the above
algorithm, and where we set θ ∗ = A(Y ∗ ). This bootstrap distribution P∗Y is
∗ for studying
empirically approximated by the empirical bootstrap distribution FM
θ ∗.
Remarks 4.29
• The quality of the approximations in (4.41) depend on the richness of the
observation Y = (Y1 , . . . , Yn ), because the bootstrap distribution
∗ ∗
P∗Y
θ ∈ C = P∗Y =y
θ ∈C ,
depends on the realization y of the data Y from which we generate the bootstrap
sample Y ∗ . It also depends on M and the explicit random drawings Yi∗ providing
∗ . The latter uncertainty can be controlled
the empirical bootstrap distribution F M
∗
since the bootstrap distribution PY corresponds to a multinomial distribution, and
the Glivenko–Cantelli theorem [64, 159] applies to F ∗ and P∗ for M → ∞. The
M Y
former uncertainty inherited from the realization Y = y cannot be diminished
because we cannot enrich the observation Y .
108 4 Predictive Modeling and Forecast Evaluation
1 (m∗)
M
∗
Eθ
θ = E∗Y
θ ≈
θ ,
M
m=1
• The previous item discusses the approximation of the bootstrap mean and
variance, respectively. Bootstrap intervals for coverage ratios need some care,
and there are different versions. The naive way of just calculating quantiles from
∗ often does not work well, and methods like a double bootstrap may need to
F M
be considered.
• In (4.39) we have assumed that the quantity of interest is the parameter θ , but
similar considerations also apply to general decision rules estimating γ (θ ).
• The bootstrap as defined above directly acts on the observations Y1 , . . . , Yn , and
the basic assumption is that these observations are i.i.d. If this is not the case,
one may first need to transform the observations, for instance, one can calculate
residuals and assume that these residuals are i.i.d. In more complicated cases, one
even drops the i.i.d. assumption and replaces it by an identical mean and variance
assumption, that is, that all residuals are assumed to be independent, centered and
with unit variance. This is sometimes also called residual bootstrap and it may
be suitable in regression models as will be introduced below. Thus, in this latter
case we estimate for each observation Yi its mean μi and its standard deviation
σi , for instance, using the variance function of the chosen EDF. This then allows
for calculating the residuals εi = (Yi − σi . For the residual bootstrap we
μi )/
resample the residuals εi∗ from εn . This provides bootstrap observations
ε1 , . . . ,
Yi∗ =
μi + εi∗ .
σi
Yi∗ =
μi + εi∗ .
σi Vi
4.3 Bootstrap 109
√ √ ∗ prob.
Pθ n
θ − θ ≤ z − P∗Y n
θ −
θ ≤z → 0,
√ ∗ P∗Y
n
θ −
θ ⇒ N 0, I1 (θ )−1 in probability as n → ∞.
(2) Return
θ (1∗) , . . . ,
θ (M∗) and the resulting empirical bootstrap distribution
1
M
M
F ∗
(ϑ) = 1{
θ (m∗) ≤ϑ} .
M
m=1
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons licence and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons licence, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons licence and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 5
Generalized Linear Models
Most of the theory in the previous chapters has been based on the assumption of
having similarity (or homogeneity) between the different observations. This was
expressed by making an i.i.d. assumption on the observations, see, e.g., Sect. 3.3.2.
In many practical applications such a homogeneity assumption is not reasonable,
one may for example think of car insurance pricing where different car drivers have
different driving experience and they drive different cars, or of health insurance
where policyholders may have different genders and ages. Figure 5.1 shows a
health insurance example where the claim sizes depend on the gender and the
age of the policyholders. The most popular statistical models that are able to
cope with such heterogeneous data are the generalized linear models (GLMs). The
notion of GLMs has been introduced in the seminal work of Nelder–Wedderburn
[283] in 1972. Their work has introduced a unified procedure for modeling and
fitting distributions within the EDF to data having systematic differences (effects)
that can be described by explanatory variables. Today, GLMs are the state-of-the-
art statistical models in many applied fields including statistics, actuarial science
and economics. However, the specific use of GLMs in the different fields may
substantially differ. In fields like actuarial science these models are mainly used for
predictive modeling, in other fields like economics or social sciences GLMs have
become the main tool in exploring and explaining (hopefully) causal relations. For
a discussion on “predicting” versus “explaining” we refer to Shmueli [338].
It is difficult to give a good list of references for GLMs, since GLMs and their
offsprings are present in almost every statistical modeling publication and in every
lecture on statistics. Classical statistical references are the books of McCullagh–
Nelder [265], Fahrmeir–Tutz [123] and Dobson [107], in the actuarial literature we
mention the textbooks (in alphabetical order) of Charpentier [67], De Jong–Heller
[89], Denuit et al. [99–101], Frees [134] and Ohlsson–Johansson [290], but this list
is far from being complete.
12000
Fig. 5.1 Claim sizes in
health insurance as a function
of the age of the policyholder,
11500
and split by gender
11000
claim size
10500
10000
9500
female
9000
male
20 40 60 80 100
age of policyholder
with canonical parameters θi ∈ ,˚ exposures vi > 0 and dispersion parameter ϕ >
0. Throughout, we assume that the effective domain has a non-empty interior.
There is a fundamental difference between (5.1) and Example 3.5. We now allow
every random variable Yi to have its own canonical parameter θi ∈ . ˚ We call
this a heterogeneous situation because the observations are allowed to differ in a
systematic way expressed by different canonical parameters. This is highlighted by
the lines in the health insurance example of Fig. 5.1 where (expected) claim sizes
differ by gender and age of policyholder.
In Sect. 4.1.2 we have introduced the saturated model where every observation Yi
has its own parameter θi . In general, if we have n observations Y = (Y1 , . . . , Yn )
we can estimate at most n parameters. The other extreme case is the homogeneous
one, meaning that θi = θ ∈ ˚ for all 1 ≤ i ≤ n. In this latter case we have exactly
one parameter to estimate, and we call this model null model, intercept model
or homogeneous model, because all components of Y are assumed to follow the
5.1 Generalized Linear Models and Log-Likelihoods 113
same law expressed in a single common parameter θ . Both the saturated model and
the null model may behave very poorly in predicting new observations. Typically,
the saturated model fully reflects the data Y including the noisy part (random
component, irreducible risk, see Remarks 4.2) and, therefore, it is not useful for
prediction. We also say that this model (in-sample) over-fits to the data Y and
does not generalize (out-of-sample) to new data. The null model often has a poor
predictive performance because if the data has systematic effects these cannot be
captured by a null model. GLMs try to find a good balance between these two
extreme cases, by trying to extract (only) the systematic effects from noisy data
Y . We therefore model the canonical parameters θi as a low-dimensional function
of explanatory variables which capture the systematic effects in the data. In Fig. 5.1
gender and age of policyholder play the role of such explanatory variables.
Assume that each observation Yi is equipped with a feature (explanatory variable,
covariate) x i that belongs to a fixed given feature space X . These features x i
are assumed to describe the systematic effects in the observations Yi , i.e., these
features are assumed to be appropriate descriptions of the heterogeneity between the
observations. In a nutshell, we then assume of having a suitable regression function
˚
θ : X → , x → θ (x),
for 1 ≤ i ≤ n. As a result we receive for the first moment of Yi , see Corollary 2.14,
We start with the discussion of the features x ∈ X . Features are also called
explanatory variables, covariates, independent variables or regressors. Throughout,
we assume that the features x = (x0 , x1 , . . . , xq ) include a first component x0 = 1,
and we choose feature space X ⊂ {1} × Rq . The inclusion of this first component
x0 = 1 is useful in what follows. We call this first component intercept or bias
component because it will be modeling an intercept of a regression model. The
114 5 Generalized Linear Models
null model (homogeneous model) has features that only consist of this intercept
component. For later purposes it will be useful to introduce the design matrix X
which collects the features x 1 , . . . , x n ∈ X of all responses Y1 , . . . , Yn . The design
matrix is defined by
⎛ ⎞
1 x1,1 · · · x1,q
⎜ .. .. . . .. ⎟ ∈ Rn×(q+1) .
X = (x 1 , . . . , x n ) = ⎝ . . . . ⎠ (5.4)
1 xn,1 · · · xn,q
q
x → g(μ(x)) = g Eθ(x) [Y ] = η(x) = β, x = β0 + βj x j . (5.5)
j =1
Here, ·, · describes the scalar product in the Euclidean space Rq+1 , θ (x) =
h(μ(x)) is the resulting canonical parameter (using canonical link h = (κ )−1 ),
and η(x) is the so-called linear predictor. After applying a suitable link function g,
the systematic effects of the random variable Y with features x can be described by
a linear predictor η(x) = β, x , linear in the components of x ∈ X . This gives
a particular functional form to (5.3), and the random variables Y1 , . . . , Yn share
a common regression parameter β ∈ Rq+1 . Remark that the link function g used
in (5.5) can be different from the canonical link h used to calculate θ (x) = h(μ(x)).
We come back to this distinction below.
Summary of (5.5)
1. The independent random variables Yi follow a fixed member of the
EDF (5.1) with individual canonical parameters θi ∈ , ˚ for all 1 ≤ i ≤ n.
2. The canonical parameters θi and the corresponding mean parameters μi
are related by the canonical link h = (κ )−1 as follows h(μi ) = θi , where
κ is the cumulant function of the chosen EDF, see Corollary 2.14.
3. We assume that the systematic effects in the random variables Yi can
be described by linear predictors ηi = η(x i ) = β, x i and a strictly
monotone and smooth link function g such that we have g(μi ) = ηi =
β, x i , for all 1 ≤ i ≤ n, with common regression parameter β ∈ Rq+1 .
We can either express this GLM regression structure in the dual (mean) parameter
˚ see Remarks 2.9,
space M or in the effective domain ,
5.1 Generalized Linear Models and Log-Likelihoods 115
where (h ◦ g −1 ) is the composition of the inverse link g −1 and the canonical link h.
For the moment, the link function g is quite general. In practice, the explicit choice
needs some care. The right-hand side of (5.5) is defined on the whole real line if at
least one component of x is both-sided unbounded. On the other hand, M and ˚
may be bounded sets. Therefore, the link function g may require some restrictions
such that the domain and the range fulfill the necessary constraints. The dimension
of β should satisfy 1 ≤ 1 + q ≤ n, the lower bound will provide a null model and
the upper bound a saturated model.
q
μ(x) = Eθ(x) [Y ] = β, x = β0 + βj x j ,
j =1
and choosing the log-link g(m) = log(m) we receive a model with multiplicative
effects
q
μ(x) = Eθ(x) [Y ] = expβ, x = eβ0 e βj xj .
j =1
The latter is probably the most commonly used GLM in insurance pricing because
it leads to explainable tariffs where feature values directly relate to price de- and
increases in percentages of a base premium exp{β0 }.
Another very popular choice is the canonical (natural) link, i.e., g = h = (κ )−1 .
The canonical link substantially simplifies the analysis and it has very favorable
statistical properties (as we will see below). However, in some applications practical
needs overrule good statistical properties. Under the canonical link g = h we have
in the dual mean parameter space M and in the effective domain , respectively,
Thus, the linear predictor η and the canonical parameter θ coincide under the
canonical link choice g = h = (κ )−1 .
After having a fully specified GLM within the EDF, there remains estimation of the
regression parameter β ∈ Rq+1 . This is done within the framework of MLE.
(continued)
5.1 Generalized Linear Models and Log-Likelihoods 117
vi
n
β → Y (β) = Yi h(μ(x i )) − κ (h(μ(x i ))) + a(Yi ; vi /ϕ), (5.7)
ϕ
i=1
vi
n
β → Y (β) = Yi β, x i − κβ, x i + a(Yi ; vi /ϕ). (5.8)
ϕ
i=1
n
vi
s(β, Y ) = ∇β Y (β) = [Yi − μi ] ∇β h(μ(x i ))
ϕ
i=1
n
vi ∂h(μi ) ∂μi
= [Yi − μi ] ∇β η(x i ) (5.9)
ϕ ∂μi ∂ηi
i=1
vi Yi − μi ∂g(μi ) −1
n
= xi ,
ϕ V (μi ) ∂μi
i=1
where we use the definition of the variance function V (μ) = (κ ◦ h)(μ), see
Corollary 2.14. We define the diagonal working weight matrix, which in general
depends on β through the means μi = g −1 β, x i ,
−2
∂g(μi ) vi 1
W (β) = diag ∈ Rn×n ,
∂μi ϕ V (μi )
1≤i≤n
This allows us to write the score equations in a compact form, which provides the
following proposition.
118 5 Generalized Linear Models
Proposition 5.1 The MLE for β is found by solving the score equations
Remarks 5.2
• In general, the MLE of β is not calculated by maximizing the log-likelihood
function Y (β), but rather by solving the score equations s(β, Y ) = 0; we also
refer to Remarks 3.29 on M- and Z-estimators. The score equations provide
the critical points for β, from which the global maximum of the log-likelihood
function can be determined, supposed it exists.
• Existence of a MLE of β is not always given, similarly to Example 3.5, we may
face the problem that the solution lies at the boundary of the parameter space
(which itself may be an open set).
• If the log-likelihood function β → Y (β) is strictly concave, then the critical
point of the score equations s(β, Y ) = 0 is unique, supposed it exists, and,
henceforth, we have a unique MLE
MLE
β for β. Below, we give cases where
the strict concavity of the log-likelihood holds.
• In general, there is no closed from solution for the MLE of β, except in the
Gaussian case with canonical link, thus, we need to solve the score equations
numerically.
Similarly to Remarks 3.17 we can calculate Fisher’s information matrix w.r.t. β
through the negative expected Hessian of Y (β).
∂ vi
n ∂
Y (β, ϕ) = − 2 Yi h(μ(x i )) − κ (h(μ(x i ))) + a(Yi ; vi /ϕ) = 0,
∂ϕ ϕ ∂ϕ
i=1
(5.11)
and we can plug in the MLE of β (which can be estimated independently of ϕ).
Fisher’s information matrix is in this extended framework given by
X W (β)X
2
I(β, ϕ) = −Eβ ∇(β,ϕ) Y (β, ϕ) = 2 0 ,
0 −Eβ ∂ Y (β, ϕ)/∂ϕ 2
In view of Proposition 5.1 we need a root search algorithm to obtain the MLE
of β. Typically, one uses Fisher’s scoring method or the iterative re-weighted
least squares (IRLS) algorithm to solve this root search problem. This is a main
result derived in the seminal work of Nelder–Wedderburn [283] and it explains the
popularity of GLMs, namely, GLMs can be solved efficiently by this algorithm.
Fisher’s scoring method/IRLS algorithm explore the updates for t ≥ 0 until
convergence
−1
(t +1)
→ = X W ( X W (
β ) X β + R(Y ,
(t ) (t ) (t ) (t ) (t )
β β β )X β ) ,
(5.12)
where all terms on the right-hand side are evaluated at algorithmic time t. If we
have n observations Y = (Y1 , . . . , Yn ) we can estimate at most n parameters.
Therefore, in our GLM we assume to have a regression parameter β ∈ Rq+1 of
dimension q + 1 ≤ n. Moreover, we require that the design matrix X has full rank
q + 1 ≤ n. Otherwise the regression parameter is not uniquely identifiable since
linear dependence in the columns of X allows us to reduce the dimension of the
parameter space to a smaller representation. This is also needed to calculate the
inverse matrix in (5.12). This motivates the following assumption.
120 5 Generalized Linear Models
(t +1)
→ =
β + I(
β )−1 s(
(t ) (t ) (t ) (t )
β β β , Y ),
where I(β) = −∇β2 Y (β) denotes the observed information matrix in β ∈ Rq+1 .
I(
The calculation of the inverse of the observed information matrix (
(t )
β ))−1 can
be time consuming and unstable because we need to calculate second derivatives
and the eigenvalues of the observed information matrix can be close to zero. A
stable scheme is obtained by replacing the observed information matrix I(β)
by Fisher’s information matrix I(β) = Eβ [ I(β)] being positive definite under
Assumption 5.3; this provides a quasi-Newton method. Thus, for Fisher’s scoring
method we iterate for t ≥ 0
(t +1)
→ =
β + I(
β )−1 s(
(t ) (t ) (t ) (t )
β β β , Y ), (5.13)
and rewriting this provides us exactly with (5.12). The latter can also be
interpreted as an IRLS scheme where the response g(Yi ) is replaced by an
adjusted linearized version Zi = g(μi ) + ∂g(μ i)
∂μi (Yi − μi ). This corresponds
to the last bracket in (5.12), and with corresponding weights.
• Under the canonical link choice, Fisher’s information matrix and the observed
information matrix coincide, i.e. I(β) = I(β), and the Newton–Raphson
algorithm, Fisher’s scoring method and the IRLS algorithm are identical. This
can easily be seen from Proposition 5.1. We receive under the canonical link
choice
vi
∇β2 Y (β) = − I(β) = −X diag V (μi ) X (5.14)
ϕ 1≤i≤n
Example 5.5 (Gamma Model with Log-Link) We study the gamma distribution as a
single-parameter EDF model, choosing the shape parameter α = 1/ϕ as the inverse
of the dispersion parameter, see Sect. 2.2.2. Cumulant function κ(θ ) = − log(−θ )
gives us the canonical link θ = h(μ) = −1/μ. Moreover, we choose the log-link
η = g(μ) = log(μ) for the GLM. This gives a canonical parameter θ = − exp{−η}.
We receive the score
n + ,
vi Yi vi
s(β, Y ) = ∇β Y (β) = − 1 x i = X diag R(Y , β).
ϕ μi ϕ 1≤i≤n
i=1
In the gamma model all observations Yi are strictly positive, a.s., and under the
full rank assumption q + 1 ≤ n, the observed information matrix I(β) is positive
definite, thus, we have a strictly concave log-likelihood function in the gamma case
with log-link.
Example 5.6 (Tweedie’s Models with Log-Link) We study Tweedie’s models for
power variance parameters p > 1 as a single-parameter EDF model, see Sect. 2.2.3.
The cumulant function κp is given in Table 4.1. This gives us the canonical link θ =
hp (μ) = μ1−p /(1 − p) < 0 for μ > 0 and p > 1. Moreover, we choose the log-
link η = g(μ) = log(μ) for the GLM. This implies θ = exp{(1 − p)η}/(1 − p) < 0
for p > 1. We receive the score
n
vi Yi − μi vi 1
s(β, Y ) = ∇β Y (β) = x i = X diag R(Y , β).
ϕ μp−1 ϕ μp−2
i=1 i i 1≤i≤n
122 5 Generalized Linear Models
This matrix is positive definite for p ∈ [1, 2], and for p > 2 it is not positive definite
because (p−1)Yi −(p−2)μi may have positive or negative values if we vary μi > 0
over its domain M. Thus, we do not have concavity of the optimization problem
under the log-link choice in Tweedie’s GLMs for power variance parameters p > 2.
This in particular applies to the inverse Gaussian GLM with log-link.
Throughout this section we work under the canonical link choice g = h = (κ )−1 .
This choice has very favorable statistical properties. We have already seen in
Remarks 5.4 that the derivation of the MLE of β becomes particularly easy under
the canonical link choice and the observed information matrix I(β) coincides with
Fisher’s information matrix I(β) in this case, see (5.14).
For insurance pricing, canonical links have another very remarkable property,
namely, that the estimated model automatically fulfills the balance property and,
henceforth, is unbiased. This is particularly important in insurance pricing because
it tells us that the insurance prices (over the entire portfolio) are on the right level.
We have already met the balance property in Corollary 3.19.
n
n
n
vi κ
MLE
E
β
MLE [vi Yi ] = β , xi = vi Yi .
i=1 i=1 i=1
Proof The first column of the design matrix X is identically equal to 1 representing
the intercept, see (5.4). The second part of Proposition 5.1 then provides for this first
column of X, we cancel the (constant) dispersion ϕ,
(1, . . . , 1) diag(v1 , . . . , vn ) κ (X
MLE
β ) = (1, . . . , 1) diag(v1 , . . . , vn ) Y .
Remark 5.8 We mention once more that this balance property is very strong and
useful, see also Remarks 3.20. In particular, the balance property holds, even though
the chosen GLM might be completely misspecified. Misspecification may include
an incorrect distributional model, not the right link function choice, or if we have
not pre-processed features appropriately, etc. Such misspecification will imply that
we have a poor model on an insurance policy level (observation level). However,
the total premium charged over the entire portfolio will be on the right level
(supposed that the structure of the portfolio does not change) because it matches
the observations, and henceforth, we have unbiasedness for the portfolio mean.
From the log-likelihood function (5.8) we see that under the canonical link choice
we consider the statistics S(Y ) = X diag(vi /ϕ)1≤i≤n Y ∈ Rq+1 , and to prove the
balance property we have used the first component of this statistics. Considering all
components, S(Y ) is an unbiased estimator (decision rule) for
n
vi
Eβ [S(Y )] = X diag(vi /ϕ)1≤i≤n κ (Xβ) = κ β, x i xi,j .
ϕ
i=1 0≤j ≤q
(5.15)
This unbiased estimator S(Y ) meets the Cramér–Rao information bound, hence
it is UMVU: taking the partial derivatives of the previous expression gives
∇β Eβ [S(Y )] = I(β), the latter also being the multivariate Cramér–Rao
information bound for the unbiased decision rule S(Y ) for (5.15). Focusing on
the first component we have
n
n
n
Varβ E
β
MLE [vi Yi ] = Varβ vi Yi = ϕvi V (μi ) = ϕ 2 (I(β))0,0 ,
i=1 i=1 i=1
(5.16)
where the component (0, 0) in the last expression is the top-left entry of Fisher’s
information matrix I(β) under the canonical link choice.
Formula (5.16) quantifies the uncertainty in the premium calculation of the insur-
ance policies if we use the MLE estimated model (under the canonical link
choice). That is, this quantifies the uncertainty in the dual mean parametrization
in terms of the resulting variance. We could also focus on the MLE
MLE
β itself
(for general link function g). In general, this MLE is not unbiased but we have
124 5 Generalized Linear Models
MLE (d)
βn ≈ N β, In (β)−1 , (5.17)
where
MLE
β n is the MLE based on the observations Y n = (Y1 , . . . , Yn ) , and In (β)
is Fisher’s information matrix of Y n , which scales linearly in n in the homogeneous
EF case, see Remarks 3.14, and in the homogeneous EDF case it scales as ni=1 vi ,
see (3.25).
vi
n
β
MLE
= arg max Y (β) = arg max Yi h(μ(x i )) − κ (h(μ(x i ))) ,
β β ϕ
i=1
n
vi
β
MLE
= arg max Y (β) = arg min d(Yi , μi ), (5.18)
β β ϕ
i=1
the latter satisfying d(Yi , μi ) ≥ 0 for all 1 ≤ i ≤ n, and being zero if and
only if Yi = μi , see Lemma 2.22. Thus, using the unit deviances we have a loss
function that is bounded below by zero, and we determine the regression parameter
β such that this loss is (in-sample) minimized. This can also be interpreted in a more
geometric way. Consider the (q + 1)-dimensional manifold M ⊂ Rn spanned by
the GLM function
1 The regularity conditions for asymptotic normality results will depend on the particular
regression problem studied, we refer to pages 43–44 in Fahrmeir–Tutz [123].
5.1 Generalized Linear Models and Log-Likelihoods 125
i=2
i=1
Minimization (5.18) then tries to find the point μ(β) in this manifold M ⊂ Rn
that minimizes simultaneously all unit deviances d(Yi , ·) w.r.t. the observation Y =
(Y1 , . . . , Yn ) ∈ Rn . Or in other words, the optimal parameter β is obtained by
“projecting” observation Y onto this manifold M,where “projection” is understood
as a simultaneous minimization of loss function ni=1 vϕi d(Yi , μi ), see Fig. 5.2. In
the un-weighted Gaussian case, this corresponds to the usual orthogonal projection
as the next example shows, and in the non-Gaussian case it is understood in the KL
divergence minimization sense as displayed in formula (4.11).
Example 5.9 (Gaussian Case) Assume we have the Gaussian EDF case κ(θ ) =
θ 2 /2 with canonical link g(μ) = h(μ) = μ. In this case, the manifold (5.19) is the
linear space spanned by the columns of the design matrix X
n
vi
β
MLE
= arg min d(Yi , μi ) = arg min Y − Xβ22 ,
β ϕ β
i=1
where we have used that the unit deviances in the Gaussian case are given by the
square loss function, see Example 4.12. As a consequence, the MLE
MLE
β is found
by orthogonally projecting Y onto M = {Xβ| β ∈ Rq+1 } ⊂ Rn , and this orthogonal
projection is given by X
MLE
β ∈ M.
126 5 Generalized Linear Models
The purpose of this section is to illustrate how the concept of GLMs is used in
actuarial modeling. We therefore explore the typical actuarial examples of claim
counts and claim size modeling.
The selection of a predictive model within GLMs for solving an applied actuarial
problem requires the following choices.
Choice of the Member of the EDF Select a member of the EDF that fits the
modeling problem. In a first step, we should try to understand the properties of
the data Y before doing this selection, for instance, do we have count data, do we
have a classification problem, do we have continuous observations?
All members of the EDF are light-tailed because the moment generating function
exists around the origin, see Corollary 2.14, and the EDF is not suited to model
heavy-tailed data, for instance, having a regularly varying tail. Therefore, a datum
Y is sometimes first transformed before being modeled by a member of the EDF.
A popular transformation is the logarithm for positive observations. After this
transformation a member of the EDF can be chosen to model log(Y ). For instance,
if we choose the Gaussian distribution for log(Y ), then Y will be log-normally
distributed, or if we choose the exponential distribution for log(Y ), then Y will
be Pareto distributed, see Sect. 2.2.5. One can then model the transformed datum
with a GLM. Often this provides very accurate models, say, on the log scale for the
log-transformed data. There is one issue with this approach, namely, if a model
is unbiased on the transformed scale then it is typically biased on the original
observation scale; if the transformation is concave this easily follows from Jensen’s
inequality. The problematic part now is that the bias correction itself often has
systematic effects which means that the transformation (or the involved nuisance
parameters) should be modeled with a regression model, too, see Sect. 5.3.9. In
many cases this will not easily work, unfortunately. Therefore, if possible, clear
preference should be given to modeling the data on the original observation scale (if
unbiasedness is a central requirement).
Choice of Link Function From a statistical point of view we should choose the
canonical link g = h to connect the mean μ of the model to the linear predictor
η because this implies many favorable mathematical properties. However, as seen,
sometimes we have different needs. Practical reasons may require that we have a
model with additive or multiplicative effects, which favors the identity or the log-
link, respectively. Another requirement is that the resulting canonical parameter θ =
(h ◦ g −1 )(η) needs to be within the effective domain . If this effective domain is
bounded, for instance, if it covers the negative real line as for the gamma model,
5.2 Actuarial Applications of Generalized Linear Models 127
a (transformation of the) log-link might be more suitable than the canonical link
because g −1 (·) = − exp(·) has a strictly negative range, see Example 5.5.
Thus, the features x ∈ X ⊂ Rq+1 need to be in the right functional form so that
they can appropriately describe the systematic effect via the function (5.20). We
distinguish the following feature types:
• Continuous real-valued feature components, examples are age of policyholder,
weight of car, body mass index, etc.
• Ordinal categorical feature components, examples are ratings like good-
medium-bad or A-B-C-D-E.
• Nominal categorical feature components, examples are vehicle brands, occupa-
tion of policyholders, provinces of living places of policyholders, etc. The values
that the categorical feature components can take are called levels.
• Binary feature components are special categorical features that only have two
levels, e.g. female-male, open-closed. Because binary variables often play a
distinguished role in modeling they are separated from categorical variables
which are typically assumed to have more than two levels.
All these components need to be brought into a suitable form so that they can be
used in a linear predictor η(x) = β, x , see (5.20). This requires the consideration
of the following points (1) transformation of continuous components so that they can
describe the systematic effects in a linear form, (2) transformation of categorical
components to real-valued components, (3) interaction of components beyond an
additive structure in the linear predictor, and (4) the resulting design matrix X should
have full rank q + 1 ≤ n. We are going to describe these points (1)–(4) in the next
section.
there are also other codings like effects coding or Helmert’s contrast coding.2 The
choice of the coding will not influence the predictive model (if we work with
a full rank design matrix), but it may influence parameter selection, parameter
reduction and model interpretation. For instance, the choice of the coding is (more)
important in medical studies where one tries to understand the effects between
certain therapies.
Assume that the raw feature component * xj is a categorical variable taking K
different levels {a1 , . . . , aK }. For dummy coding we declare one level, say aK , to
be the reference level and all other levels are described relative to that reference
level. Formally, this can be described by an embedding map
*
xj → x j = (1{*
xj =a1 } , . . . , 1{*
xj =aK−1 } ) ∈ R
K−1
. (5.21)
K−1
*
xj → expβ, x j = exp{β0 } exp βk 1{*
xj =ak } , (5.22)
k=1
2 There is an example of Helmert’s contrast coding in Remarks 2.7 of lecture notes [392], and for
more examples we refer to the UCLA statistical consulting website: https://stats.idre.ucla.edu/r/
library/r-library-contrast-coding-systems-for-categorical-variables/.
5.2 Actuarial Applications of Generalized Linear Models 129
Remarks 5.11
• Importantly, dummy coding leads to full rank design matrices X and, henceforth,
Assumption 5.3 is fulfilled.
• Dummy coding is different from one-hot encoding which is going to be
introduced in Sect. 7.3.1, below.
• Dummy coding needs some care if we have categorical feature components with
many levels, for instance, considering car brands and car models we can get
hundreds of levels. In that case we will have sparsity in the resulting design
matrix. This may cause computational issues, and, as the following example
will show, it may lead to high uncertainty in parameter estimation. In particular,
the columns of the design matrix X of very rare levels will be almost collinear
which implies that we do not receive very well-conditioned matrices in Fisher’s
scoring method (5.12). For this reason, it is recommended to merge levels
to bigger classes. In Sect. 7.3.1, below, we are going to present a different
treatment. Categorical variables are embedded into low-dimensional spaces, so
that proximity in these spaces has a reasonable meaning for the regression task
at hand.
Example 5.12 (Balance Property and Dummy Coding) A main argument for the
use of the canonical link function has been the fulfillment of the balance property,
see Corollary 5.7. If we have categorical feature components and if we apply dummy
coding to those, then the balance property is projected down to the individual levels
of that categorical variable. Assume that columns 2 to K of design matrix X are
used to model a raw categorical feature *x1 with K levels according to (5.21). In that
case, columns 2 ≤ k ≤ K will indicate all observations Yi which belong to levels
ak−1 . Analogously to the proof of Corollary 5.7, we receive (summation i runs over
the different instances/policies)
n
n
E
β
MLE [vi Yi ] = xi,k E
β
MLE [vi Yi ] = xi,k vi Yi = vi Yi .
i: *
xi,1 =ak−1 i=1 i=1 i: *
xi,1 =ak−1
(5.23)
Thus, we receive the balance property for all policies 1 ≤ i ≤ n that belong to level
ak−1 .
If we have many levels, then it will happen that some levels have only very few
observations, and the above summation (5.23) only runs over very few insurance
policies with *
xi,1 = ak−1 . Suppose additionally the volumes vi are small. This can
lead to considerable estimation uncertainty, because the estimated prices on the left-
hand side of (5.23) will be based too much on individual observations Yi having the
corresponding level, and we are not in the regime of a law of large numbers that
balances these observations.
Thus, this balance property from dummy coding is a natural property under the
canonical link choice. Actuarial pricing is very familiar with such a property. Early
130 5 Generalized Linear Models
Binary feature components do not need a treatment different from the categorical
ones, they are Bernoulli variables which can be encoded as 0 or 1. This is exactly
dummy coding for K = 2 levels.
Continuous feature components are already real-valued. Therefore, from the view-
point of ‘variable types’, the continuous feature components do not need any
pre-processing because they are already in the right format to be included in scalar
products.
Nevertheless, in many cases, also continuous feature components need feature
engineering because only in rare cases they directly fit the functional form (5.20).
We give an example. Consider car drivers that have different driving experience and
different driving skills. To explain experience and skills we typically choose the age
of driver as explanatory variable. Modeling the claim frequency as a function of the
age of driver, we often observe a U-shaped function, thus, a function that is non-
monotone in the age of driver variable. Since the link function g needs to be strictly
monotone, this regression problem cannot be modeled by (5.20), directly including
the age of driver as a feature because this leads to monotonicity of the regression
function in the age of driver variable.
Typically, in such situations, the continuous variable is discretized to categorical
classes. In the driver’s age example, we build age classes. These age classes
are then treated as categorical variables using dummy coding (5.21). We will
give examples below. These age classes should fulfill the requirement of being
sufficiently homogeneous in the sense that insurance policies that fall into the
same class should have a similar propensity to claims. This implies that we would
like to have many small homogeneous classes. However, the classes should be
sufficiently large, otherwise parameter estimation involves high uncertainty, see
also Example 5.12. Thus, there is a trade-off to sufficiently meet both of these two
requirements.
A disadvantage of this discretization approach is that neighboring age classes
will not be recognized by the regression function because, per se, dummy coding
is based on nominal variables not having any topology. This is also illustrated by
the fact, that all categorical levels (excluding the reference level) have, in view
5.2 Actuarial Applications of Generalized Linear Models 131
*
xl → β1* xl2 + β3*
xl + β2* xl3 + β4 log(*
xl ), (5.24)
q
x → μ(x) = expβ, x = exp{β0 } exp βj xj .
j =1
That is, all feature components xj enter the regression function in an exponential
form. In general insurance, one may have specific variables for which it is explicitly
known that they should enter the regression function as a power function. Having a
raw feature *xl we can pre-process it as *
xl → xl = log(*
xl ). This implies
β
q
μ(x) = expβ, x = exp{β0 } *
xl l exp βj xj ,
j =1,j =l
which gives a power term of order βl . The GLM estimates in this case the power
parameter that should be used for *
xl . If the power parameter is known, then one
can even include this component as an offset; offsets are discussed in Sect. 5.2.3,
below.
Interactions
Naturally, GLMs only allow for an additive structure in the linear predictor. Similar
to continuous feature components, such an additive structure may not always be
suitable and one wants to model more complex interaction terms. Such interactions
need to be added manually by the modeler, for instance, if we have two raw feature
132 5 Generalized Linear Models
components *
xl and *
xk , we may want to consider a functional form
xl , *
(* xk ) → β1*
xl + β2*
xk + β3*
xl * xl2*
xk + β4* xk ,
5.2.3 Offsets
for all 1 ≤ i ≤ n. An offset oi does not change anything from a structural viewpoint,
in fact, it could be integrated into the feature x i with a regression parameter that is
identically equal to 1.
Offsets are frequently used in Poisson models with the (canonical) log-link
choice to model multiplicative time exposures in claim frequency modeling. Under
the log-link choice we receive from (5.25) the following mean function
In this version, the offset oi provides us with an exposure exp{oi } that acts
multiplicatively on the regression function. If wi = exp{oi } measures time, then
wi is a so-called pro-rata temporis (proportional in time) exposure.
Remark 5.14 (Boosting) A popular machine learning technique in statistical mod-
eling is boosting. Boosting tries to step-wise adaptively improve a regression
model. Offsets (5.25) are a simple way of constructing boosted models. Assume
we have constructed a predictive model using any statistical model, and denote the
resulting estimated means of Yi by μ i (0) . The idea of boosting is that we select
another statistical model and we try to see whether this second model can still find
systematic structure in the data which has not been found by the first model. In view
5.2 Actuarial Applications of Generalized Linear Models 133
of (5.25), we include the first model into the offset and we build a second model
around this offset, that is, we may explore a GLM
i (1) = g −1 g(μ
μ i (0) ) + β, x i .
We present a first GLM example. This example is based on French motor third
party liability (MTPL) insurance claim counts data. The data is described in detail
in Chap. 13.1; an excerpt of the available MTPL data is given in Listing 13.2. For the
moment we only consider claim frequency modeling. We use the following data: Ni
describes the number of claims, vi ∈ (0, 1] describes the duration of the insurance
policy, and *x i describes the available raw feature information of insurance policy i,
see Listing 13.2.
We are going to model the claim counts Ni with a Poisson GLM using the
canonical link function of the Poisson model. In the Poisson approach there are two
different ways to account for the duration of the insurance policy. Either we model
Yi = Ni /vi with the Poisson model of the EDF, see Sect. 2.2.2 and Remarks 2.13
(reproductive form), or we directly model Ni with the Poisson distribution from the
EF and treat the log-duration as an offset variable oi = log vi . In the first approach
we have for the log-link choice g(·) = h(·) = log(·) and dispersion ϕ = 1
yi β, x i − eβ,x i
Yi = Ni /vi ∼ f (yi ; θi , vi ) = exp + a(yi ; vi ) , (5.26)
1/vi
the Poisson distribution from the EF. Using notation (2.2) this gives us
Ni ∼ f (ni ; θi ) = exp ni (log vi + β, x i ) − elog vi +β,x i + a(ni ) (5.27)
ni β,x i
vi β, x i − e
= exp + a(ni ) + ni log vi ,
1/vi
Feature Engineering
observed frequency per car brand groups observed frequency per regional groups observed frequency per fuel type
0.20
0.20
0.20
0.15
0.15
0.15
frequency
frequency
frequency
0.10
0.10
0.10
0.05
0.05
0.05
0.00
0.00
0.00
B1 B2 B3 B4 B5 B6 B10 B11 B12 B13 B14 R11 R22 R24 R26 R41 R43 R53 R72 R74 R83 R93 Diesel Regular
car brand groups regional groups fuel type
Fig. 5.3 Empirical marginal frequencies on each level of the categorical variables (lhs)
VehBrand, (middle) Region, and (rhs) VehGas
5.2 Actuarial Applications of Generalized Linear Models 135
Continuous Variables
X ⊂ {1} × R × {0, 1}5 × {0, 1}2 × {0, 1}6 × R × {0, 1}10 × {0, 1} × R × {0, 1}21,
0.20
0.20
0.20
136
0.15
0.15
0.15
0.10
0.10
0.10
frequency
frequency
frequency
0.05
0.05
0.05
0.00
0.00
0.00
A B C D E F 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 11 13 15 17 19
area code groups vehicle power groups vehicle age groups
observed frequency per driver's age groups observed frequency per bonus−malus level groups observed frequency per density (log−scale) groups
0.4
0.6
0.20
0.5
0.3
0.15
0.4
0.2
0.3
0.10
frequency
frequency
frequency
0.2
0.1
0.05
0.1
0.0
0.0
0.00
Fig. 5.4 Empirical marginal frequencies of the continuous variables: top row (lhs) Area, (middle) VehPower, (rhs) VehAge, and bottom row (lhs)
DrivAge, (middle) BonusMalus, (rhs) log-Density, i.e., Density on the log scale; note that DrivAge and BonusMalus have a different y-scale
5 Generalized Linear Models
in these plots
5.2 Actuarial Applications of Generalized Linear Models 137
Table 5.2 shows the summary of the chosen partition into learning and test
samples
L = (Yi = Ni /vi , x i , vi ) : i = 1, . . . , n = 610 206 ,
and
T = (Yt† = Nt† /vt† , x †t , vt† ) : t = 1, . . . , T = 67 801 .
In contrast to Sect. 4.2 we also include feature information and exposure information
to L and T .
Listing 5.2 Partition of the data to learning sample L and test sample T
1 RNGversion("3.5.0") # we use R version 3.5.0 for this partition
2 set.seed(500)
3 ll <- sample(c(1:nrow(dat)), round(0.9*nrow(dat)), replace = FALSE)
4 learn <- dat[ll,]
5 test <- dat[-ll,]
Table 5.2 Choice of learning data set L and test data set T ; the empirical frequency on both
data sets is similar (last column), and the split of the policies w.r.t. the numbers of claims is also
rather similar
Numbers of observed claims Empirical
0 1 2 3 4 5 frequency
Learning sample L 96.32% 3.47% 0.19% 0.01% 0.0006% 0.0002% 7.36%
Test sample T 96.31% 3.50% 0.18% 0.01% 0.0015% 0.0015% 7.35%
138 5 Generalized Linear Models
where the terms under the summation are set equal to vi μ(x i ) for Yi = 0, see (4.8),
and we have GLM regression function
That is, we work under the canonical link with the canonical parameter being equal
to the linear predictor. The MLE of β is found by minimizing (5.28). This is done
with Fisher’s scoring method. In order to receive a non-degenerate solution we need
to ensure that we have sufficiently many claims Yi > 0, otherwise it might happen
that the MLE provides a (degenerate) solution at the boundary of the effective
domain . We denote the MLE by βL =
MLE MLE
β , because it has been estimated
on the learning data L, only. This gives us estimated regression function
We emphasize that we only use the learning data L for this model fitting. In view of
Definition 4.24 we receive in-sample and out-of-sample Poisson deviance losses
2
n
MLE
μ(x i )
D(L, β L ) = μ(x i ) − Yi − Yi log
vi ≥ 0,
n Yi
i=1
2 †
T
MLE † † † μ(x †t )
D(T , β L ) = (x t ) − Yt − Yt log
vt μ ≥ 0.
T
t =1 Yt†
We implement this GLM on the data of Listing 5.1 (and including the categorical
features) in R using the function glm [307], a short overview of the results is
presented in Listing 5.3. This overview presents the regression model implemented,
an excerpt of the parameter estimates
MLE
β L , standard errors which are received
from the square-rooted diagonal entries of the inverse of the estimated Fisher’s
information matrix In (
MLE
β L ), see (5.17); the remaining columns will be described
in Sect. 5.3.2 on the Wald test (5.33). The bottom line of the output says that Fisher’s
scoring algorithm has converged in 6 iterations, it gives the in-sample deviance loss
nD(L,
MLE
β L ) called Residual deviance (not being scaled by the number of
5.2 Actuarial Applications of Generalized Linear Models 139
Listing 5.3 Results in model Poisson GLM1 using the R command glm
1 Call:
2 glm(formula = ClaimNb ~ VehPowerGLM + VehAgeGLM + DrivAgeGLM +
3 BonusMalusGLM + VehBrand + VehGas + DensityGLM + Region +
4 AreaGLM, family = poisson(), data = learn, offset = log(Exposure))
5
6 Deviance Residuals:
7 Min 1Q Median 3Q Max
8 -1.4728 -0.3256 -0.2456 -0.1383 7.7971
9
10 Coefficients:
11 Estimate Std. Error z value Pr(>!z!)
12 (Intercept) -4.8175439 0.0579296 -83.162 < 2e-16 ***
13 VehPowerGLM5 0.0604293 0.0229841 2.629 0.008559 **
14 VehPowerGLM6 0.0868252 0.0225509 3.850 0.000118 ***
15 . . .
16 . . .
17 RegionR93 0.1388160 0.0294901 4.707 2.51e-06 ***
18 RegionR94 0.1918538 0.0938250 2.045 0.040874 *
19 AreaGLM 0.0407973 0.0200818 2.032 0.042199 *
20 ---
21 Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
22
23 (Dispersion parameter for poisson family taken to be 1)
24
25 Null deviance: 153852 on 610205 degrees of freedom
26 Residual deviance: 147069 on 610157 degrees of freedom
27 AIC: 192818
28
29 Number of Fisher Scoring iterations: 6
Table 5.3 Run times, number of parameters, AICs, in-sample and out-of-sample deviance losses,
tenfold cross-validation losses with empirical standard deviation in brackets, see also (4.36), (units
are in 10−2 ) and the in-sample average frequency of the null model (Poisson intercept model, see
Example 4.27) and of model Poisson GLM1
Run # AIC In-sample Out-of-sample Tenfold CV Aver.
time Param. loss on L loss on T CV
loss D freq.
Poisson null – 1 199’506 25.213 25.445 25.213(0.234) 7.36%
Poisson GLM1 16 s 49 192’818 24.101 24.146 24.121(0.245) 7.36%
observations), as well as Akaike’s Information Criterion (AIC), see Sect. 4.2.3 for
AIC. Note that we have implemented Poisson version (5.27) with the exposures
entering the offset, see lines 2–4 of Listing 5.3; this is important for understanding
AIC being calculated on the (unscaled) claim counts Ni .
Table 5.3 summarizes the results of model Poisson GLM1 and it compares the
figures to the null model (only having an intercept β0 ); the null model has already
been introduced in Example 4.27. We present the run time needed to fit the model,3
the number of regression parameters q + 1 in β ∈ Rq+1 , AIC, in-sample and
out-of-sample deviance losses, as well as tenfold cross-validation losses on the
3 All run times are measured on a personal laptop Intel(R) Core(TM) i7-8550U CPU @ 1.80 GHz
1.99 GHz with 16 GB RAM, and they only correspond to fitting the model (or the corresponding
step) once, i.e., they do not account for multiple runs, for instance, for K-fold cross-validation.
140 5 Generalized Linear Models
learning data L. For tenfold cross-validation we always use the same (non-stratified)
partition of L (in all examples in this monograph), and in bracket we show the
empirical standard deviation received by (4.36). Tenfold cross-validation would not
be necessary in this case because we have test data T on which we can evaluate the
out-of-sample deviance GL. We present both figures to back-test whether tenfold
cross-validation works properly in our example. We observe that the out-of-sample
deviance losses D(T ,
MLE
β L ) are within one empirical standard deviation of the
tenfold cross-validation losses D CV , which supports this methodology of model
comparison.
From Table 5.3 we conclude that we should prefer model Poisson GLM1 over
the null model, this decision is supported by a smaller AIC, a smaller out-of-sample
deviance loss D(T , CV . The last
MLE
β L ) as well as a smaller cross-validation loss D
column of Table 5.3 confirms that the estimated model meets the balance property
(we work with the canonical link here). Note that this balance property should be
fulfilled for two reasons. Firstly, we would like to have the overall portfolio price on
the right level, and secondly, deviance losses should only be compared on the same
overall frequency, see Example 4.10.
Before we continue to introduce more models to challenge model Poisson
GLM1, we are going to discuss statistical tools for model evaluation. Of course,
we would like to know whether model Poisson GLM1 is a good model for this data
or whether it is just the better model of two bad options.
Remark 5.15 (Prior and Posterior Information) Pricing literature distinguishes
between prior feature information and posterior feature information, see Verschuren
[372]. Prior feature information is available at the inception of the (new) insurance
contract before having any claims history. This includes, for instance, age of driver,
vehicle brand, etc. For policy renewals, past claims history is available and prices
of policy renewals can also be based on such posterior information. Past claims
history has led to the development of so-called bonus-malus systems (BMS) which
often are in the form of multiplicative factors to the base premium to reward and
punish good and bad past experience, respectively. One stream of literature studies
optimal designs of BMS, we refer to Loimaranta [255], De Pril [91], Lemaire [245],
Denuit et al. [102], Brouhns et al. [57] Pinquet [304], Pinquet et al. [305], Tzougas
et al. [360] or Ágoston–Gyetvai [4]. Another stream of literature studies how one
can optimally extract predictive information from an existing BMS, see Boucher–
Inoussa [46], Boucher–Pigeon [47] and Verschuren [372].
The latter is basically what we also do in the above example: note that we include
the variable BonusMalus into the feature information and, thus, we use past
claims information to predict future claims. For new policies, the bonus-malus level
is at 100%, and our information does not allow to clearly distinguish between new
5.3 Model Validation 141
policies and policy renewals for drivers that have posterior information reflected by
a bonus-malus level of 100%. Since young drivers are more likely new customers we
expect interactions between the driver’s age variable and the bonus-malus level, this
intuition is supported by Fig. 13.12 (lhs). In order to improve our model, we would
require more detailed information about past claims history. Remark that we do
not strictly distinguish between prior and posterior information, here. If we go over
to a time-series consideration, where more and more claims experience becomes
available of an individual driver, we should clearly distinguish the different sets of
information, because otherwise it may happen that in prior and posterior pricing
factors we correct twice for the same factor; an interesting paper is Corradin et
al. [82].
We also mention that a new source of posterior information is emerging through
the collection of telematics car driving data. Telematics car driving data leads to a
completely new way of posterior information rate making (experience rating), we
refer to Ayuso et al. [17–19], Boucher et al. [42], Lemaire et al. [246] and Denuit
et al. [98]. We mention the papers of Gao et al. [152, 154] and Meng et al. [271]
who directly extract posterior feature information from telematics car driving data
in order to improve rate making. This approach combines a Poisson GLM with a
network extractor for the telematics car driving data.
One of the purposes of Chap. 4 has been to describe measures to analyze how well
a fitted model generalizes to unseen data. In a proper generalization analysis this
requires learning data L for in-sample model fitting and a test sample T for an
out-of-sample generalization analysis. In many cases, one is not in the comfortable
situation of having a test sample. In such situations one can use AIC that tries to
correct the in-sample figure for model complexity or, alternatively, K-fold cross-
validation as used in Table 5.3.
The purpose of this section is to introduce diagnostic tools for fitted models; these
are often based on unit deviances d(Yi , μi ), which play the role of squared residuals
in classical linear regression. Moreover, we discuss parameter and model selection,
for instance, by step-wise backward elimination or forward selection using the
analysis of variance (ANOVA) or the likelihood ratio test (LRT).
Within the EDF we distinguish two different types of residuals. The first type of
residuals are based on the unit deviances d(Yi , μi ) studied in (4.7). The deviance
142 5 Generalized Linear Models
In the Gaussian case the two residuals coincide. This indicates that Pearson’s
residuals are most appropriate in the Gaussian case because they respect the
distributional properties in that case. For other distributions, Pearson’s residuals
can be markedly skewed, as stated in Section 2.4.2 of McCullagh–Nelder [265],
and therefore may fail to have properties similar to Gaussian residuals. An other
issue occurs in Pearson’s
√ residuals when the denominator involves an estimated
standard deviation V ( μi ), for instance, if we work in a small frequency Poisson
problem. Estimation uncertainty in small denominators of Pearson’s residuals may
substantially distort the estimated residuals. For this reason, we typically work with
(the more robust) deviance residuals; this is related to the discussion in Chap. 4 on
MSEPs versus expected deviance GLs, see Remarks 4.6.
The squared residuals provide unit deviance and weighted square loss, respec-
tively,
vi vi (Yi − μi )2
(riD )2 = d (Yi , μi ) and (riP )2 = ,
ϕ ϕ V (μi )
(Yi − μi )2
(riP )2 = ,
μi
This emphasizes the different behaviors around the observation Yi of the two types
1/3
of residuals in the Poisson case. The scale μi has been motivated in McCullagh–
5.3 Model Validation 143
log−likelihood of Poisson model log−likelihood of gamma model log−likelihood of inverse Gaussian model
−0.90
−1.0
−1.0
log−likelihood for mu
log−likelihood for mu
log−likelihood for mu
−0.95
−1.5
−1.5
−1.00
−2.0
−2.0
−1.05
−1.10
−2.5
−2.5
0.4 0.6 0.8 1.0 1.2 1.4 1.6 0.4 0.6 0.8 1.0 1.2 1.4 1.6 0.4 0.6 0.8 1.0 1.2 1.4 1.6
mu^1/3 mu^(−1/3) mu^(−1)
Fig. 5.5 Log-likelihoods Y (μ) in Y = 1 as a function of μ plotted against (lhs) μ1/3 in the
Poisson case, (middle) μ−1/3 in the gamma case with shape parameter α = 1, and (rhs) μ−1 in the
inverse Gaussian case with α = 1
1 (Yi − μ i )2
n
1 n
P =
ϕ and D =
ϕ vi d (Yi ,
μi ) ,
n − (q + 1) V (
μi )/vi n − (q + 1)
i=1 i=1
(5.30)
1.2
expected Poisson unit deviance
0.4
0.8
0.3
0.6
0.2
0.4
0.1
0.2
0.0
0 1 2 3 4 5 0.00 0.02 0.04 0.06 0.08 0.10
E[N] E[N]
Fig. 5.6 Expected unit deviance vEμ [d(Y, μ)] in the Poisson case as a function of E[N] =
E[vY ] = vμ; the two plots only differ in the scale on the x-axis
2
This statistic is under certain assumptions asymptotically ϕχn−(q+1) -distributed,
where χn−(q+1) denotes a χ -distribution with n−(q +1) degrees of freedom. Thus,
2 2
this approximation gives us an expected value of ϕ(n−(q +1)). This exactly justifies
the deviance dispersion estimate (5.30) in these cases. However, as stated in the last
paragraph of Section 2.3 of McCullagh–Nelder [265], often a χ 2 -approximation is
not suitable even as n → ∞. We give an example.
Example 5.17 (Poisson Unit Deviances) The deviance statistics in the Poisson
model with means μn = (μ1 , . . . , μn ) is given by
1 1
n n
μi
D(Y n , μn ) = vi d (Yi , μi ) = 2vi μi − Yi − Yi log ,
n n Yi
i=1 i=1
note that in the Poisson model we have (by definition) ϕ = 1. We evaluate the
expected value of this deviance statistics. It is given by
+ , + ,
1 1
n n
μi Ni
Eμn D(Y n , μn ) = 2vi Eμi μi − Yi − Yi log = 2Eμi Ni log ,
n Yi n vi μi
i=1 i=1
ind.
with Ni ∼ Poi(vi μi ).
In Fig. 5.6 we plot the expected unit deviance vμ → vEμ [d(Y, μ)] in the Poisson
model. In our example of Table 5.3, we have Eμ [vY ] = vμ ≈ 3.89%, which results
in an expected unit deviance of vEμ [d(Y, μ)] ≈ 25.52·10−2 < 1. This is in line with
the losses in Table 5.3. Thus, the expected deviance nEμn D(Y n , μn ) ≈ n/4 < n.
Therefore it is substantially smaller than n. But this implies that nD(Y n , μn ) cannot
2
be asymptotically χn−(q+1) -distributed because the latter has an expected of value
n−(q+1) ≈ n for n → ∞. In fact, the deviance dispersion estimate is not consistent
5.3 Model Validation 145
in this example, and for a consistent estimate one should rely on Pearson’s deviance
estimate.
In order to have an asymptotic χ 2 -distribution we need to have large volumes
v because then a saddlepoint approximation holds that allows to approximate the
(scaled) unit deviances by χ 2 -distributions, see Sect. 5.5.2, below.
The MLE
MLE
β is determined by the point in M that minimizes the distance to Y ,
where distance between Y and M is measured component-wise by vϕi d(Yi , μi ) with
μ ∈ M, i.e., w.r.t. the KL divergence.
Assume, now, that we want to drop the components β r in β, i.e., we want to drop
these columns from the design matrix resulting in a smaller design matrix Xr ∈
Rn×(q+1−r) . This generates a (q + 1 − r)-dimensional nested manifold Mr ⊂ M
described by
Mr = μ = g −1 (Xr β) ∈ Rn β ∈ Rq+1−r ⊂ M.
Likelihood Ratio Test (LRT) We consider the testing problem of the null hypoth-
esis H0 against the alternative hypothesis H1
H0 : β r = 0 against H1 : β r = 0. (5.31)
The inequality holds because the null hypothesis model is nested in the full model,
henceforth, the latter needs to have a bigger log-likelihood value in the MLE. If
the LRT statistics is large, the null hypothesis should be rejected because the
reduced model is not competitive compared to the full model. More mathematically,
under similar conditions as for the asymptotic normality results of the MLE of
β in (5.17), we have that under the null hypothesis H0 the LRT statistics is
asymptotically χ 2 -distributed with r degrees of freedom. Therefore, we should
reject the null hypothesis in favor of the full model if the resulting p-value of
under the χr2 -distribution is too small. These results remain true if the unknown
dispersion parameter ϕ is replaced by a consistent estimator ϕ , e.g., Pearson’s
dispersion estimate ϕ P (from the bigger model).
The LRT statistics may not be properly defined in over-dispersed situations
where the distributional assumptions are not fully specified, for instance, in an over-
dispersed Poisson model. In such situations, one usually divides the log-likelihood
(of the Poisson model) by the estimated over-dispersion and then uses the resulting
scaled LRT statistics as an approximation to the unspecified model.
Wald Test Alternatively, we can use the Wald statistics. The Wald statistics uses
a second order approximation to the log-likelihood and, therefore, is only based
on the first two moments (and not on the entire distribution). Define the matrix
Ir ∈ Rr×(q+1) such that β r = Ir β, i.e., matrix Ir selects exactly the components of
β that are included in β r (and which are set to 0 under the null hypothesis H0 given
in (5.31)).
Asymptotic normality (5.17) motivates consideration of the Wald statistics
MLE −1 −1
W = (Ir − 0) Ir I( (Ir
MLE MLE
β β ) Ir β − 0). (5.32)
The Wald statistics measures the distance between the MLE in the full model
Ir
MLE
β restricted to the components of β r and the null hypothesis H0 (being
β r = 0). The estimated Fisher’s information matrix I(
MLE
β ) is used to bring
all components onto the same unit scale (and to account for collinearity). The
Wald statistics W is asymptotically χr2 -distributed under the same assumptions as
for (5.17) to hold. Thus, the null hypothesis H0 should be rejected if the resulting p-
5.3 Model Validation 147
value of W under the χr2 -distribution is too small. Note that this test does not require
calculation of the MLE in the null hypothesis model, i.e., this test is computationally
more attractive than the LRT because we only need to fit one model. Again, an
unknown dispersion parameter ϕ in Fisher’s information matrix I(β) is replaced by
a consistent estimator
ϕ (from the bigger model).
In the special case of considering only one component of β, i.e., if β r = βk with
r = 1 and for one selected component 0 ≤ k ≤ q, the Wald statistics reduces to
MLE)2
(β 1/2
MLE
β
Wk = k
or Tk = Wk = k
, (5.33)
σk2
σk
with diagonal entries of the inverse of the estimated Fisher’s information matrix
σk2 = (I(
MLE −1
given by β ) )k,k , 0 ≤ k ≤ q. The square-roots of these estimates are
provided in column Std. Error of the R output in Listing 5.3.
In this case the Wald statistics Wk is equal to the square of the t-statistics Tk ;
this t-statistics is provided in column z value of the R output of Listing 5.3.
Remark that Fisher’s information matrix involves the dispersion parameter ϕ. If
this dispersion parameter is estimated with a consistent estimator ϕ we have a t-
statistics. For known dispersion parameter the t-statistics reduces to a z-statistics,
i.e., the corresponding p-values can be calculated from a normal distribution instead
of a t-distribution. In the Poisson case, the dispersion ϕ = 1 is known, and for this
reason, we perform a z-test (and not a t-test) in the last column of Listing 5.3; and
we call Tk a z-statistics in that case.
In the previous section, we have presented tests that allow for model selection in
the case of nested models. More generally, if we have a full model, say, based
on regression parameter β ∈ Rq+1 we would like to select the “best” sub-
model according to some selection criterion. In most cases, it is computationally
not feasible to fit all sub-models if q is large, therefore, this is not a practical
solution. For large models and data sets step-wise procedures are a feasible tool.
Backward elimination starts from the full model, and then recursively drops feature
components which have high p-values in the corresponding Wald statistics (5.32)
and (5.33). Performing this recursively will provide us with hierarchy of nested
models. Forward selection works just in the opposite direction, that is, we start with
the null model and we include feature components one after the other that have a
low p-value in the corresponding Wald statistics.
148 5 Generalized Linear Models
Remarks 5.18
• The order of the inclusion/exclusion of the feature components matters in this
selection algorithms because we do not have additivity in this selection process.
For this reason, often backward elimination and forward selection is combined
in an alternating way.
• This process as well as the tests from Sect. 5.3.2 are based on a fixed pre-
processing of features. If the feature pre-processing is done differently, all
analysis needs to be repeated for this new model. Moreover, between two dif-
ferent models we need to apply different tools for model selection (if they are not
nested), for instance, AIC, cross-validation or an out-of-sample generalization
analysis.
• For categorical variables with dummy coding we should apply the forward
selection or the backward elimination simultaneously on the entire dummy coded
vector of a categorical variable. This will include or exclude this variable; if we
only apply the Wald test to one component of the dummy vector, then we test
whether this level should be merged with the reference level.
Typically, in practice, a so-called analysis of variance (ANOVA) table is studied.
The ANOVA table is mainly motivated by the Gaussian model with orthogonal
data. The Gaussian assumption implies that the deviance loss is equal to the
square loss and the orthogonality implies that the square loss decouples in an
additive way w.r.t. the feature components. This implies that one can explicitly
study the contribution of each feature component to the decrease in square loss;
an example is given in Section 2.3.2 of McCullagh–Nelder [265]. In non-Gaussian
and non-orthogonal situations one loses this additivity property and, as mentioned
in Remarks 5.18, the order of inclusion matters. Therefore, for the ANOVA table
we pre-specify the order in which the components are included and then we analyze
the decrease of deviance loss by the inclusion of additional components.
Example 5.19 (Poisson GLM1, Revisited) We revisit the MTPL claim frequency
example of Sect. 5.2.4 to illustrate the variable selection procedures. Based on the
model presented in Listing 5.3 we run an ANOVA analysis using the R command
anova, the results are presented in Listing 5.4.
Listing 5.4 shows the hierarchy of models starting from the null model by
sequentially including feature components one by one. The column Df gives the
number of regression parameters involved and the column Deviance the decrease
of deviance loss by the inclusion of this feature component. The biggest model
improvements are provided by the bonus-malus level and driver’s age, this is not
surprising in view of the empirical analysis in Figs. 5.3 and 5.4, and in Chap. 13.1.
At the other end we have the Area code which only seems to improve the model
marginally. However, this does not imply, yet, that this variable should be dropped.
There are two points that need to be considered: (1) maybe feature pre-processing
of Area has not been done in an appropriate way and the variable is not in the
right functional form for the chosen link function; and (2) Area is the last variable
included in the model in Listing 5.4 and, maybe, there are already other variables
5.3 Model Validation 149
that take over the role of Area in smaller models which is possible if we have
correlations between the feature components. In our data, Area and Density are
highly correlated. For this reason, we exchange the order of these two components
and run the same analysis again, we call this model Poisson GLM1B (which of
course provides the same predictive model as Poisson GLM1).
Listing 5.5 shows the ANOVA table if we exchange the order of these two
variables. We observe that the magnitudes of the decrease of the deviance loss
has switched between the two variables. Overall, Density seems slightly more
150 5 Generalized Linear Models
predictive, and we may consider dropping Area from the model, also because the
correlation between Density and Area is very high.
If we want to perform backward elimination (sequentially drop one variable after
the other) we can use the R command drop1. For small models this is doable, for
larger models it is computationally demanding.
In Listing 5.6 we present the results of this drop1 analysis. Both, according to
AIC and according to the LRT, we should keep all variables in the model. Again,
Area and Density provide the smallest LRT statistics which illustrates the
high collinearity between these two variables (note that the values in Listing 5.6 are
identical to the ones in Listings 5.4 and 5.5, respectively).
We conclude that in model Poisson GLM1 we should keep all feature com-
ponents, and a model improvement can only be obtained by a different feature
pre-processing, by a different regression function or by a different distributional
model.
We revisit model Poisson GLM1 studied in Sect. 5.2.4 for MTPL claim frequency
modeling, and we consider additional competing models by using different feature
pre-processing. From Example 5.19, above, we conclude that we should keep all
variables in the model if we work with model Poisson GLM1.
5.3 Model Validation 151
Table 5.4 Contingency table of observed number of policies against predicted number of
policies with given claim counts ClaimNb
Numbers of claims ClaimNb
0 1 2 3 4 5
Observed number of policies 587’772 21’198 1’174 57 4 1
Predicted number of policies 587’325 22’064 779 34 3 0.3
4
DrivAge → βl DrivAge + βl+1 log(DrivAge) + βl+j (DrivAge)j .
j =2
(5.34)
152 5 Generalized Linear Models
Table 5.5 Run times, number of parameters, AICs, in-sample and out-of-sample deviance losses,
tenfold cross-validation losses (units are in 10−2 ) and in-sample average frequency of the null
model (intercept model) and of different Poisson GLMs
Run # In-sample Out-of-sample Tenfold CV Aver.
time Param. AIC loss on L loss on T CV
loss D freq.
Poisson null – 1 199’506 25.213 25.445 25.213 7.36%
Poisson GLM1 16s 49 192’818 24.101 24.146 24.121 7.36%
Poisson GLM2 15s 48 192’753 24.091 24.113 24.110 7.36%
Poisson GLM3 15s 50 192’716 24.084 24.102 24.104 7.36%
From Table 5.5 we observe that this leads to a further small model improvement.
We mention that this model improvement can also be observed in a decrease of
Pearson’s dispersion estimate to ϕ P = 1.6644. Noteworthy, all model selection
criteria AIC, out-of-sample generalization loss and cross-validation come to the
same conclusion in this example.
The tedious task of the modeler now is to find all these systematic effects and
bring them in an appropriate form into the model. Here, this is still possible because
we have a comparably small model. However, if we have hundreds of feature
components, such a manual analysis becomes intractable. Other regression models
such as network regression models should be preferred, or at least should be used
to find systematic effects. But, one should also keep in mind that the (final) chosen
model should be as simple as possible (parsimonious).
Remarks 5.20
• An advantage of GLMs is that these regression models can deal with collinearity
in feature components. Nevertheless, the results should be carefully checked if
the collinearity in feature components is very high. If we have a high collinearity
between two feature components then we may observe large values with opposite
signs in the corresponding regression parameters compensating each other. The
5.3 Model Validation 153
resulting GLM will not be very robust, and a slight change in the observations
may change these regression parameters completely. In this case one should drop
one of the two highly collinear feature components. This problem may also occur
if we include too many terms in functional forms like in (5.34).
• A tool to find suitable functional forms of regression functions in continuous
feature components are the partial residual plots of Cook–Croos-Dabrera [80]. If
we want to analyze the first feature component x1 of x, we can fit a GLM to the
data using the entire feature vector x. The partial residuals for component x1 are
defined by, see formula (8) in Cook–Croos-Dabrera [80],
where g is the chosen link function and g(μ(x i )) = β, x i . These partial
residuals offset the effect of feature component x1 . The partial residual plot shows
partial
ri against xi,1 . If this plot shows a linear structure then including x1 linearly
is justified, and any other functional form may be detected from that plot.
Often run times are an issue in model fitting, in particular, if we want to exper-
iment with different models, different feature codings, etc. Under-sampling is an
interesting approach that can be applied in imbalanced situations (like in our claim
frequency data situation) to speed up calculations, and still receiving accurate
approximations. We briefly describe under-sampling in this subsection.
154 5 Generalized Linear Models
Thus, we stretch the exposures of the policies without claims in L∗ ; for our data this
factor is 26.17. This then provides us with an empirical frequency on L∗ of 7.36%
which is identical to the observed frequency on the entire learning data L.
We fit model Poisson GLM3 on this reduced (and exposure adjusted) learning
data L∗ , the results are presented on the last line of Table 5.6. This model can be
fitted in 1s, and by construction it fulfills the balance property. The resulting in-
sample and out-of-sample losses (evaluated on the entire data L and T ) are very
close to model Poisson GLM3 which verifies that the model fitted only on the
learning data L∗ gives a good approximation. We do not provide AIC because the
data used is not identical to the data used to fit the other models. The tenfold cross-
Table 5.6 Run times, number of parameters, AICs, in-sample and out-of-sample deviance losses,
tenfold cross-validation losses (units are in 10−2 ) and in-sample average frequency of the null
model (intercept model) and of different Poisson GLMs, the last row uses under-sampling in model
Poisson GLM3
Run # In-sample Out-of-sample Tenfold CV Aver.
time param. AIC loss on L loss on T CV
loss D freq.
Poisson null – 1 199’506 25.213 25.445 25.213 7.36%
Poisson GLM1 16 s 49 192’818 24.101 24.146 24.121 7.36%
Poisson GLM2 15 s 48 192’753 24.091 24.113 24.110 7.36%
Poisson GLM3 15 s 50 192’716 24.084 24.102 24.104 7.36%
under-sampling 1s 50 – 24.098 24.108 24.120 7.36%
5.3 Model Validation 155
In the previous example we have seen that the considered Poisson GLMs do not fully
fit our data, at least not with the chosen feature engineering, because there is over-
dispersion in the data (relative to the chosen models). This may give rise to consider
models that allow for over-dispersion. Typically, such over-dispersed models are
constructed starting from the Poisson model, because the Poisson model enjoys
many nice properties as we have seen above. A natural extension is to introduce the
family of mixed Poisson models, where the frequency is not modeled with a single
parameter but rather with a whole family of parameters described by an underlying
mixing distribution.
In the dual mean parametrization the Poisson distribution for Y = N/v reads as
(vλ)vy
Y ∼ f (y; λ, v) = e−vλ for y ∈ N0 /v,
(vy)!
supposed that the moments exist, we refer to Lemma 2.18 in Wüthrich [387]. Hence,
mixing over different frequency parameters allows us to receive over-dispersion. Of
course, this concept can also be applied to mixing over the canonical parameter θ in
the EF (instead of the mean parameter).
This leads to the framework of Bayesian credibility models which are widely
used and studied in actuarial science, we refer to the textbook of Bühlmann–Gisler
[58]. We have already met this idea in the Bayesian decision rule of Example 3.3
which has led to the Bayesian estimator in Definition 3.6.
Negative-Binomial Model
In the case of the Poisson model, the gamma distribution is a particularly attractive
mixing distribution for λ because it allows for a closed-form solution in (5.36),
and fπ=
(y; v) will be a negative-binomial distribution.4 One can choose differ-
ent parametrizations of this mixing distribution, and they will provide different
scalings in the resulting negative-binomial distribution. We choose the following
(d)
parametrization π(λ) =
(vα, vα/μ) for mean parameter μ > 0 and shape
parameter vα > 0. This implies, see (5.36),
(vλ)vy (vα/μ)vα vα−1 −vαλ/μ
fNB (y; μ, v, α) = e−vλ λ e dλ
R+ (vy)!
(vα)
(vy + vα) v vy (vα/μ)vα
=
(vy)!
(vα) (v + vα/μ)vy+vα
vy + vα − 1 θ vy vα
= e 1 − eθ ,
vy
4The gamma distribution is the conjugate prior to the Poisson distribution. As a result, the posterior
distribution, given observations, will again be a gamma distribution with posterior parameters, see
Section 8.1 of Wüthrich [387]. This Bayesian model has been introduced to the actuarial literature
by Bichsel [32].
5.3 Model Validation 157
setting for canonical parameter θ = log(μ/(μ + α)) < 0. This is the negative-
binomial distribution we have already met in (2.5). A single-parameter linear EDF
representation is given by, we set unit dispersion parameter ϕ = 1,
yθ + α log(1 − eθ ) vy + vα − 1
Y ∼ fNB (y; θ, v, α) = exp + log ,
1/v vy
(5.38)
where this is a density w.r.t. the counting measure on N0 /v. The cumulant function
and the canonical link, respectively, are given by
μ
κ(θ ) = −α log(1 − eθ ) and θ = h(μ) = log ∈ = (−∞, 0).
μ+α
Note that α > 0 is treated as nuisance parameter (which is a fixed part of the
cumulant function, here). The first two moments of the claim count N = vY are
given by
eθ
vμ = Eθ [N] = vα , (5.39)
1 − eθ
eθ μ
Varθ (N) = Eθ [N] 1 + = Eθ [N] 1 + > Eθ [N]. (5.40)
1 − eθ α
This shows that we receive a fixed over-dispersion of size μ/α, which (in this
parametrization) does not depend on the exposure v; this is the reason for choosing
(d)
a mixing distribution π(λ) =
(vα, vα/μ). This parametrization is called NB2
parametrization.
Remarks 5.21
• We emphasize that the effective domain = (−∞, 0) is one-sided bounded.
Therefore, the canonical link for the linear predictor will not work in general
because the linear predictor x → η(x) can be both-sided unbounded in a GLM
setting. Instead, we use the log-link for g(·) in our example below, with the
downside that one loses the balance property.
• The unit deviance in this negative-binomial EDF model is given by
+ ,
y y+α
(y, μ) → d(y, μ) = 2 y log − (y + α) log ,
μ μ+α
we also refer to Table 4.1 for α = 1. We emphasize that this is the unit deviance
in a single-parameter linear EDF, and we only aim at estimating canonical
parameter θ ∈ and mean parameter μ ∈ M, respectively, whereas α > 0 is
treated as a given nuisance parameter. This is important because the unit deviance
relies on the saturated model which, in general, estimates a one-dimensional
158 5 Generalized Linear Models
The cumulant function and the canonical link, respectively, are now given by
*
μ
κ(θ ) = − log(1 − eθ ) and θ = h(*
μ) = log ∈ = (−∞, 0).
*
μ+1
*] = eθ
μ = Eθ [Y
* ,
1 − eθ
*) = *
Varθ (Y
ϕ
κ (θ ) =
1
μ (1 + *
* μ) .
v vα
Thus, we receive the reproductive EDF representation with dispersion parameter
ϕ = 1/α and variance function V (*
* μ) = *
μ(1 + *μ). Moreover, N = vY = vα Y *.
• The negative-binomial model with the NB1 parametrization uses the mixing
(d)
distribution π(λ) =
(μv/α, v/α). This leads to mean Eθ [N] = vμ and
variance Varθ (N) = Eθ [N](1 + α). In this parametrization, μ enters the gamma
function as
(μv/α) in the gamma density which does not allow for an EDF
representation. This parametrization has been called NB1 by Cameron–Trivedi
[63] because both terms in the variance Varθ (N) = vμ + vμα are linear in μ. In
contrast, in the NB2 parametrization the second term has a square vμ2 /α in μ,
see (5.40). Further discussion is provided in Greene [171].
All previous statements have been based on the assumption that α > 0 is a
given nuisance parameter. If α needs to be estimated too, then, we drop out
of the EF. In this case, an iterative estimation procedure is applied to the EDF
representation (5.38). One starts with a fixed nuisance parameter α (0) and fits the
5.3 Model Validation 159
family=negative.binomial(theta, link="log").
where theta > 0 denotes the nuisance parameter. The tricky part now is that we
have to bring in the different exposures vi of all policies 1 ≤ i ≤ n. That is, we
would like to have for claim counts ni = vi yi , see (5.38),
vi yi vi α
vi yi + vi α − 1 vi μi vi μi
fNB (yi ; μi , vi , α) = 1−
vi yi vi μi + vi α vi μi + vi α
+ yi α ,vi
vi yi + vi α − 1 μi μi
= 1− .
vi yi μi + α μi + α
The square bracket can be implemented in glm as a scaled and weighted regression
problem, see Listing 5.8 with theta = α. This approach provides the correct GLM
parameter estimates
MLE
β for given α, however, the outputted AIC values cannot
be compared to the Poisson case. Note that the Poisson case of Table 5.5 considers
observations Ni whereas Listing 5.8 uses Yi = Ni /vi . For this reason we calculate
the log-likelihood and AIC by an own implementation.
The same remark applies to glm.nb, and also nuisance parameter estimation
cannot be performed by that routine under different exposures vi . Therefore, we
have implemented an iterative estimation algorithm ourselves, alternating glm of
Listing 5.8 for given α and a maximization routine optimize to find the optimal
α for given β using (5.38). We have applied this iteration in Example 5.23, below,
and it has converged in 5 iterations.
Table 5.7 Run times, number of parameters, AICs, in-sample and out-of-sample deviance losses
(units are in 10−2 ) and in-sample average frequency of the null models (Poisson and negative-
binomial) and the Poisson and negative-binomial GLMs. The optimal model is highlighted in
boldface
Run # In-sample Out-of-sample Aver.
time Param. AIC loss on L loss on T freq.
Poisson null – 1 199’506 25.213 25.445 7.36%
Poisson GLM3 15 s 50 192’716 24.084 24.102 7.36%
NB null MLE = 1.059
αnull – 2 198’466 20.357 20.489 7.36%
NB null MLE = 1.810
αNB – 1 198’564 21.796 21.948 7.36%
NB GLM3 αNBMLE = 1.810 85s 51 192’113 20.722 20.674 7.38%
null model. The NB null model has two parameters, the homogeneous (overall)
frequency and the nuisance parameter. MLE of the homogeneous overall frequency
is identical to the one in the Poisson null model, and MLE of the nuisance parameter
provides MLE = 1.059. This is substantially smaller than infinity and suggests
αnull
over-dispersion. The results are presented on the third line of Table 5.7. We observe
a smaller AIC of the NB null model against the Poisson null model which says that
we should allow for over-dispersion.
We now focus on the NB GLM. The feature pre-processing is done exactly as
in model Poisson GLM3, and we choose the log-link for g. We call this model
NB GLM3. The iterative estimation procedure outlined above provides a nuisance
parameter estimate MLE = 1.810. This is bigger than in the NB null model because
αNB
the regression structure explains some part of the over-dispersion, however, it is
still substantially smaller than infinity which justifies the inclusion of this over-
dispersion parameter.
The last line of Table 5.7 gives the result of model NB GLM3. From AIC we
conclude that we favor the negative-binomial GLM over the Poisson GLM since
AIC decreases from 192’716 to 192’113. The in-sample and out-of-sample deviance
losses can only be compared within the same models, i.e., the models that have the
same cumulant function. This also applies to the negative-binomial models which
have cumulant function κ(θ ) = −α log(1 − eθ ). Thus, to compare the NB null
model and model NB GLM3, we need to choose the same nuisance parameter α.
For this reason we added this second NB null model to Table 5.7. This second NB
null model no longer uses the MLE MLE , therefore, the corresponding AIC only
αnull
includes one estimated parameter.
5.3 Model Validation 161
Table 5.8 Out-of-sample deviance losses: forecast dominance. The optimal model is highlighted
in boldface
Poisson NB deviance NB deviance
Model deviance MLE = 1.059
αnull MLE = 1.810
αNB
Null model 25.445 20.489 21.948
Poisson GLM3 24.102 19.266 20.678
NB GLM3 MLE = 1.810
αNB 24.100 19.262 20.674
As mentioned above, deviance losses can only be compared under exactly the
same cumulant function (including the same nuisance parameters). If we want to
have a more robust model selection, we can consider forecast dominance according
to Definition 4.20. Being less ambitious, here, we consider forecast dominance
only for the three considered cumulant functions Poisson, negative-binomial with
MLE = 1.059 and negative-binomial with
αnull MLE = 1.810. The out-of-sample
αNB
deviance losses are given in Table 5.8 in the different columns. According to this
forecast dominance analysis we also give preference to model NB GLM3, but model
Poisson GLM3 is pretty close.
Figure 5.7 compares the logged predictors log( μi ), 1 ≤ i ≤ n, of the models
Poisson GLM3 and NB GLM3. We see a huge similarity in these predictors, only
high frequency policies are judged slightly differently by the NB model compared
to the Poisson model.
Table 5.9 gives the predicted number of claims against the observed ones. We
observe that model NB GLM3 predicts more accurately the number of policies with
2 or less claims, but it over-estimates the number of policies with more than 2 claims.
This may also be related to the fact that the estimated in-sample frequency has a
162 5 Generalized Linear Models
Table 5.9 Contingency table of observed number of policies against predicted number of policies
with given claim counts ClaimNb
Numbers of claims ClaimNb
0 1 2 3 4 5
Observed number of policies 587’772 21’198 1’174 57 4 1
Poisson predicted number of policies 587’325 22’064 779 34 3 0.3
NB predicted number of policies 587’902 20’982 1’200 100 15 4
positive bias in model NB GLM3, see Table 5.7. That is, since we do not work with
the canonical link, we do not have the balance property.
We close this example by providing the drop1 analysis in Listing 5.9. From
this analysis we conclude that the feature component Area should be dropped.
Of course, this confirms the high collinearity between Density and Area which
implies that we do not need both variables in the model. We remark that the AIC
values in Listing 5.9 are not on our scale, as stated in Remark 5.22.
In many applications it is the case that the Poisson distribution does not fully fit
the claim counts data because there are too many policies with zero claims, i.e.,
5.3 Model Validation 163
for π0 ∈ (0, 1), μ = eθ > 0, and for the Poisson probability weights we refer
to (2.4). For π0 > 0 the weight of a zero claim Y = 0 is increased (inflated)
compared to the original Poisson distribution.
Remarks 5.24
• The ZIP distribution has different interpretations. It can be interpreted as a
hierarchical model where we have a latent variable Z which indicates with
probability π0 that we have an excess zero, and with probability 1 − π0 we have
an ordinary Poisson distribution, i.e. for y ∈ N0
1{y=0} for z = 0,
Pθ [ Y = y| Z = z] = y (5.41)
e−μ μy! for z = 1,
for π0 ∈ (0, 1) and μ > 0. For π0 > e−μ the weight of a zero claim is increased
and for π0 < e−μ it is decreased. This distribution is called a hurdle distribution,
because we first need to overcome the hurdle at zero to come to the Poisson
model. Lower-truncated distributions are studied in Sect. 6.4, below, and mixture
distributions are discussed in Sect. 6.3.1. In general, fitting lower-truncated
distributions is challenging because the density and the distribution function
should both have tractable forms to perform MLE for truncated distributions.
The Expectation-Maximization (EM) algorithm is a useful tool to perform
model fitting under truncation. We come back to the hurdle Poisson model in
Example 6.19, below, and it is also closely related to the zero-truncated Poisson
(ZTP) model discussed in Remarks 6.20.
The first two moments of a ZIP random variable Y ∼ fZIP (·; θ, π0 ) are given by
Eθ,π0 [Y ] = (1 − π0 )μ,
Varθ,π0 (Y ) = (1 − π0 )μ + (π0 − π02 )μ2 = Eθ,π0 [Y ] (1 + π0 μ) ,
these calculations easily follow with the latent variable Z interpretation from above.
As a consequence, we receive an over-dispersed model with over-dispersion π0 μ
(the latter also follows from the fact that we consider a mixed Poisson distribution
with a Bernoulli mixing distribution having weights π0 in 0 and 1 − π0 in μ > 0,
see (5.37)).
Unfortunately, MLE does not allow for explicit solutions in this model. The score
i.i.d.
equations of Yi ∼ fZIP (·; θ, π0 ) are given by
n
∇(π0 ,μ) Y (π0 , μ) = ∇(π0 ,μ) log π0 + (1 − π0 )e−μ 1{Yi =0}
i=1
n y
−μ μ
+ ∇(π0 ,μ) log (1 − π0 )e 1{Yi >0} = 0.
y!
i=1
The R package pscl [401] has a function called zeroinfl which uses the general
purpose optimizer optim to find the MLEs in the ZIP model. Alternatively, we
could explore the EM algorithm for mixture distributions presented in Sect. 6.3,
below.
In insurance applications, the ZIP application can be problematic if we have
different exposures vi > 0 for different insurance policies i. In the Poisson GLM
case with canonical link choice we typically integrate the different exposures into
the offset, see (5.27). However, it is not clear whether and how we should integrate
the different exposures into the zero-inflation probability π0 . It seems natural to
believe that shorter exposures should increase π0 , but the explicit functional form of
this increase can be debated, some options are discussed in Section 5 of Lee [239].
5.3 Model Validation 165
Table 5.10 Run times, number of parameters, AICs, in-sample and out-of-sample deviance
losses (units are in 10−2 ) and in-sample average frequency of the null models (Poisson, negative-
binomial and ZIP) and the Poisson, negative-binomial and ZIP GLMs. The optimal model is
highlighted in boldface
Run # AIC In-sample Out-of-sample Aver.
time Param. loss on L loss on T freq.
Poisson null – 1 199’506 25.213 25.445 7.36%
Poisson GLM3 15 s 50 192’716 24.084 24.102 7.36%
NB null MLE = 1.059
αnull – 2 198’466 20.357 20.489 7.36%
NB null MLE = 1.810
αNB – 1 198’564 21.796 21.948 7.36%
NB GLM3 αNBMLE = 1.810 85 s 51 192’113 20.722 20.674 7.38%
ZIP null 20 s 2 198’638 – – 7.43%
ZIP GLM3 (null π0 ) 270 s 51 192’393 – – 7.37%
Table 5.11 Out-of-sample deviance losses: forecast dominance. The optimal model is highlighted
in boldface
Poisson NB deviance NB deviance
Model deviance MLE = 1.059
αnull MLE = 1.810
αNB
Null model 25.445 20.489 21.948
Poisson GLM3 24.102 19.266 20.678
NB GLM3 MLE = 1.810
αNB 24.100 19.262 20.674
ZIP null model 25.446 20.490 21.949
ZIP GLM3 24.103 19.267 20.679
Table 5.12 Contingency table of observed numbers of policies against predicted numbers of
policies with given claim counts ClaimNb
Numbers of claims ClaimNb
0 1 2 3 4 5
Observed number of policies 587’772 21’198 1’174 57 4 1
Poisson predicted number of policies 587’325 22’064 779 34 3 0.3
NB predicted number of policies 587’902 20’982 1’200 100 15 4
ZIP predicted number of policies 587’829 21’094 1’191 79 9 4
As a second example we consider claim size modeling within GLMs. For this
example we do not use the French MTPL claims data because the empirical
density plot in Fig. 13.15 indicates that a GLM will not fit to that data. The French
MTPL data seems to have three distinct modes, which suggests to use a mixture
distribution. Moreover, the log-log plot indicates a regularly varying tail, which
cannot be captured by the EDF on the original observation scale; we are going
to study this data in Example 6.14, below. Here, we use the Swedish motorcycle
data, previously used in the textbook of Ohlsson–Johansson [290] and described in
Chap. 13.2. From Fig. 5.9 we see that the empirical density has one mode, and the
log-log plot supports light tails, i.e., the gamma model might be a suitable choice for
this data. Therefore, we choose a gamma GLM with log-link g. As described above,
the log-link is not the canonical link for the gamma EDF distribution but it ensures
the right sign w.r.t. the linear predictor ηi = β, x i . Working with the log-link in
the gamma model will imply that the balance property is not fulfilled.
empirical density of average claim amounts empirical distribution of average claim amounts log−log plot of average claim amounts
1.0
0.04
0
logged survival probability
−5 −4 −3 −2 −1
empirical distribution
0.8
0.03
empirical density
0.6
0.02
0.4
0.01
0.2
−6
log 10K
0.00
0.0
log 100K
Fig. 5.9 (lhs) Empirical density, (middle) empirical distribution and (rhs) log-log plot of claim
amounts of the Swedish motorcycle data presented in Chap. 13.2
168 5 Generalized Linear Models
Feature Engineering
The Swedish motorcycle claim amount data poses the special difficulty that we
do not have individual claim observations Zi,j , but we only know the total claim
i
amounts Si = N j =1 Zi,j and the number of claims Ni on each insurance policy;
Fig. 5.9 shows average claims Si /Ni of insurance policies i with Ni > 0. In general,
this poses a problem in statistical modeling, but in the gamma model this problem
can be handled because the gamma distribution is closed under aggregation of
i.i.d. gamma claims Zi,j . In all what follows in this section, we only study insurance
policies with Ni > 0, and we label these insurance policies i accordingly.
Assume that Zi,j are i.i.d. gamma distributed with shape parameter αi and scale
parameter ci , we refer to (2.6). The mean, the variance and the moment generating
function of Zi,j are given by
αi
αi αi ci
E[Zi,j ] = , Var(Zi,j ) = 2 and MZi,j (r) = ,
ci ci ci − r
(5.43)
where the moment generating function requires r < ci to be finite. Assuming that
the number of claims Ni is a known positive integer ni ∈ N, we see from the
ni
moment generating function that Si = j =1 Zi,j is again gamma distributed with
shape parameter ni αi and scale parameter ci . We change the notation from Ni to
ni to emphasize that the number of claims is treated as a known constant (and
also to avoid using the notation of conditional probabilities, here). Finally, we scale
Yi = Si /(ni αi ) ∼
(ni αi , ni αi ci ). This random variable Yi has a single-parameter
EDF gamma distribution with weight vi = ni , dispersion ϕi = 1/αi and cumulant
function κ(θi ) = − log(−θi ), for θi ∈ = (−∞, 0),
yθi − κ(θi )
Yi ∼ f (y; θi , vi /ϕi ) = exp + a(y; vi /ϕi ) (5.44)
ϕi /vi
(−θi αi vi )vi αi vi αi −1
= y exp {−(−θi αi vi )y} ,
(vi αi )
5.3 Model Validation 169
and the canonical parameter is θi = −ci . For our GLM analysis we treat the shape
parameter αi ≡ α > 0 as a nuisance parameter that does not depend on the specific
policy i, i.e., we set constant dispersion ϕ = 1/α, and only the scale parameter ci is
chosen policy dependent through θi = −ci .
Random variable Yi = Si /(ni α) ∼
(ni α, ni αci ) gives the reproductive form
of the gamma EDF, see Remarks 2.13. In applications, this form is not directly
useful because under unknown shape parameter α, we cannot calculate observations
Yi = Si /(ni α). For this reason, we parametrize the model differently, here. We
consider instead
This (new) random variable has the same gamma EDF (5.44), we only need to
reinterpret the canonical parameter as θi = −ci /α. Then, we choose the log-link
for g which implies
1
μi = Eθi [Yi ] = κ (θi ) = − = exp{ηi } = expβ, x i ,
θi
Because we have only few claims data in this Swedish motorcycle example (only
m = 656 insurance policies suffer claims), we do not perform a generalization
analysis with learning and test samples. In this situation we need all data for
model fitting, and model performance is analyzed with AIC and with tenfold cross-
validation.
The in-sample deviance loss in the gamma GLM is given by
2 ni
m
Yi −
μ(x i ) Yi
D(L,
μ(·)) = − log , (5.46)
m ϕ
μ(x i )
μ(x i )
i=1
where i runs over the policies i = 1, . . . , m with positive claims Yi = Si /ni > 0,
μ(x i ) = exp
MLE
and β , x i is the MLE estimated regression function. Similar to
the Poisson case (5.29), McCullagh–Nelder [265] derive the following behavior
170 5 Generalized Linear Models
0.3
0.03
empirical density
empirical density
0.2
0.02
0.1
0.01
0.00
0.0
0 50 100 150 200 0 2 4 6
average claim amounts in SEK 1'000 cube−rooted average claim amounts^(1/3)
1/3
Fig. 5.10 (lhs) Empirical density of Yi and (rhs) empirical density of Yi
for the gamma unit deviance around its mode, see Section 7.2 and Figure 7.2 in
McCullagh–Nelder [265],
2/3 −1/3 −1/3 2
d (Yi , μi ) ≈ 9Yi Yi − μi , (5.47)
−1/3
this uses that the log-likelihood is symmetric around its mode for scale μi , see
Fig. 5.5 (middle). This shows that the gamma deviance scales differently around Yi
compared to the square loss function. From this we receive an approximation to the
deviance residuals (for v/ϕ = 1)
1/3
Yi
1/3
Yi − μi
1/3
riD = sign(Yi − μi ) d (Yi , μi ) ≈ 3 −1 =3 1/3
.
μi μi
(5.48)
Listing 5.11 Results in model Gamma GLM1 using the R command glm
1 Call:
2 glm(formula = ClaimAmount/ClaimNb ~ OwnerAge + I(OwnerAge^2) +
3 AreaGLM + RiskClass + VehAge + I(VehAge^2) + Gender + BonusClass,
4 family = Gamma(link = "log"), data = mcdata0, weights = ClaimNb)
5
6 Deviance Residuals:
7 Min 1Q Median 3Q Max
8 -3.3683 -1.4585 -0.5979 0.4354 3.4763
9
10 Coefficients:
11 Estimate Std. Error t value Pr(>!t!)
12 (Intercept) 8.9737854 0.5532821 16.219 < 2e-16 ***
13 OwnerAge 0.1072781 0.0280862 3.820 0.000147 ***
14 I(OwnerAge^2) -0.0014508 0.0003489 -4.158 3.65e-05 ***
15 AreaGLM -0.0768512 0.0368284 -2.087 0.037303 *
16 RiskClass 0.0615575 0.0327553 1.879 0.060651 .
17 VehAge -0.2051148 0.0296184 -6.925 1.05e-11 ***
18 I(VehAge^2) 0.0062649 0.0015946 3.929 9.45e-05 ***
19 GenderMale 0.1085538 0.1673443 0.649 0.516772
20 BonusClass 0.0089004 0.0225371 0.395 0.693029
21 ---
22 Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
23
24 (Dispersion parameter for Gamma family taken to be 1.536577)
25
26 Null deviance: 1368.0 on 655 degrees of freedom
27 Residual deviance: 1126.5 on 647 degrees of freedom
28 AIC: 14922
29
30 Number of Fisher Scoring iterations: 11
Table 5.13 Run times, number of parameters, AICs, Pearson’s dispersion estimate, in-sample
losses, tenfold cross-validation losses and the in-sample average claim amounts of the null model
(gamma intercept model) and the gamma GLMs
Run # AIC Dispersion In-sample Tenfold CV Average
time Param. est.
ϕP loss on L CV
loss D amount
Gamma null – 1+1 14’416 2.057 2.085 2.091 24’641
Gamma GLM1 1s 9+1 14’277 1.537 1.717 1.752 25’105
Gamma GLM2 1s 7+1 14’274 1.544 1.719 1.747 25’130
The results of models Gamma GLM1 and Gamma GLM2 are presented in
Table 5.13. We show AICs, Pearson’s dispersion estimate, the in-sample deviance
losses on all available data, the corresponding tenfold cross-validation losses, and
the average claim amounts.
Firstly, we observe that the GLMs do not meet the balance property. This is
implied by the fact that we do not use the canonical link to avoid any sort of difficulty
of dealing with the one-sided bounded effective domain = (−∞, 0). For pricing,
the intercept parameter β MLE should be shifted to eliminate this bias, i.e, we need to
0
shift this parameter under the log-link by − log(25 130/24641) for model Gamma
GLM2.
Secondly, the in-sample and tenfold cross-validation losses are not directly
comparable to AIC. Observe that we need to know the dispersion parameter ϕ in
order to calculate both of these statistics. For the in-sample and cross-validation
172 5 Generalized Linear Models
losses we have set ϕ = 1, thus, all these figures are directly comparable. For AIC
we have estimated the dispersion parameter ϕ with MLE. This is the reason for
increasing the number of parameters in Table 5.13 by +1. Moreover, the resulting
AICs differ from the ones received from the R command glm, see, for instance,
Listing 5.11. The AIC value in Listing 5.11 does not consider all terms appropriately
due to the inclusion of weights, this is similar to Remark 5.22, it uses the
deviance dispersion estimate ϕ D , i.e., not the MLE and (still) increases the number
of parameters by 1 because the dispersion is estimated. For these reasons, we have
implemented our own code for calculating AIC. Both AIC and the tenfold cross-
validation losses say that we should give preference to model Gamma GLM2.
The dispersion estimate in Listing 5.11 corresponds to Pearson’s estimate
1 (Yi −
m
μ i )2
P =
ϕ ni . (5.49)
m − (q + 1)
μ2i
i=1
We observe that the dispersion estimate is roughly 1.5 which gives an estimate of
the shape parameter α = 1/ϕ of 2/3. A shape parameter less than 1 implies that the
density of the gamma distribution is strictly decreasing, see Fig. 2.1. Often this is a
sign that the model does not fully fit the data, and if we use this model for simulation
we may receive too many observations close to zero compared to the true data.
A shape parameter less than 1 may be implied by more heterogeneity in the data
compared to what the chosen gamma GLM allows for or by large claims that cannot
be explained by the present gamma density structure. Thus, there is some sign here
that the data is more heavy-tailed than our model choice suggests. Alternatively,
there might be some need to also model the shape parameter with a regression
model; this could be done using the vector-valued parameter EF representation of
the gamma model, see Sect. 2.1.3. In view of Fig. 5.10 (rhs) it may also be that
the feature information is not sufficient to describe the second mode in 4, thus, we
probably need more explanatory information to reduce dispersion.
In Fig. 5.11 we give the Tukey–Anscombe plot and a QQ plot. Note that the
observations for ni = 1 follow a gamma distribution with shape parameter α
and scale parameter ci = α/μi = −αθi . Thus, if we scale Yi /μi , we receive
i.i.d. gamma random variables with shape and scale parameters equal to α. This
then allows us for ni = 1 to plot the empirical distribution of Yi / μi against
(α, α)
in a QQ plot where we estimate 1/α by Pearson’s dispersion estimate. The Tukey–
Anscombe plot looks reasonable, but the QQ plot shows that the gamma model
does not entirely fit the data. From this plot we cannot conclude whether the gamma
distribution is causing the problem or whether it is a missing term in the regression
structure. We only see that the data is over-dispersed, resulting in more heavy-tailed
observations than the theoretical gamma model can explain, and a compensation
by too many small observations (which is induced by over-dispersion, i.e., a shape
parameter smaller than one). In the network chapter we will refine the regression
function, keeping the gamma assumption, to understand which modeling part is
causing the difficulty.
Remark 5.26 For the calculation of AIC in Table 5.13 we have used the MLE of the
dispersion parameter ϕ. This is obtained by solving the score equation (5.11) for the
5.3 Model Validation 173
8
2
deviance residuals
observed values
1
6
0
4
−1
2
−2
−3
0
8.0 8.5 9.0 9.5 10.0 10.5 11.0 0 2 4 6 8
fitted means (log−scale) theoretical values
Fig. 5.11 (lhs) Tukey–Anscombe plot of the fitted model Gamma GLM2, and (rhs) QQ plot of
the fitted model Gamma GLM2
gamma case. It is given by, we set α = 1/ϕ and we calculate the MLE of α instead,
∂ n
Y (β, α) = vi Yi h(μ(x i )) − κ (h(μ(x i ))) + log Yi + log(αvi ) + 1 − (αvi ) = 0,
∂α
i=1
where (α) =
(α)/
(α) is the digamma function. We calculate the second
derivative w.r.t. α, see also (2.30),
n + , n + ,
∂2 1 1
Y (β, α) = vi − vi (αvi ) = vi
2
− (αvi ) <0 for α > 0,
∂α 2 α αvi
i=1 i=1
the negativity follows from Theorem 1 in Alzner [9]. In fact, the function log α −
(α) is strictly completely monotonic for α > 0. This says that the log-likelihood
Y (β, α) is a concave function in α > 0 and the solution to the score equation is
unique, giving the MLE of α and ϕ, respectively.
We present the inverse Gaussian GLM in this section as a competing model to the
gamma GLM studied in the previous section.
Infinite Divisibility
In the gamma model above we have used that the total claim amount S = nj=1 Zj
has a gamma distribution for given claim counts N = n > 0 and i.i.d. gamma
claim sizes Zj . This property is closely related to divisibility. A random variable S
is called divisible by n ∈ N if there exist i.i.d. random variables Z1 , . . . , Zn such
174 5 Generalized Linear Models
that
(d)
n
S = Zj ,
j =1
Alternatively to the gamma GLM one often explores an inverse Gaussian GLM
which has a cubic variance function V (μ) = μ3 . We bring this inverse Gaussian
model into the same form as the gamma model of Sect. 5.3.7, so that we can
aggregate claims within insurance policies. The mean, the variance and the moment
generating function of an inverse Gaussian random variable Zi,j with parameters
αi , ci > 0 are given by
+ 6 ,
αi αi
E[Zi,j ] = , Var(Zi,j ) = and MZi,j (r) = exp αi ci − ci − 2r ,
2
ci ci3
where the moment generating function requires r < ci2 /2 to be finite. From the
ni
moment generating function we see that Si = j =1 Zi,j is inverse Gaussian
distributed with parameters ni αi and ci . Finally, we scale Yi = Si /(ni αi ) which
1/2 1/2
provides us with an inverse Gaussian distribution with parameters ni αi and
1/2 1/2
ni αi ci . This random variable Yi has a single-parameter EDF inverse Gaussian
distribution in its reproductive form, namely,
yθi − κ(θi )
Yi ∼ f (y; θi , vi /ϕi ) = exp + a(y; vi /ϕi ) (5.50)
ϕi /vi
2
αi
1/2
α
= 6i exp − 1 − −2θi y ,
2π 3
y 2y/vi
vi
5.3 Model Validation 175
√
with cumulant function κ(θ ) = − −2θ for θ ∈ = (−∞, 0], weight vi = ni ,
dispersion parameter ϕi = 1/αi and canonical parameter θi = −ci2/2.
Similarly to the gamma case, this representation is not directly useful if the
parameter αi is not known. Therefore, we parametrize this model differently.
Namely, we consider
1/2 1/2
Yi = Si /ni ∼ InvGauss ni αi , ni ci . (5.51)
This re-scaled random variable has that same inverse Gaussian EDF (5.50), but
we need to re-interpret the parameters. We have dispersion parameter ϕi = 1/αi2
and canonical parameter θi = −ci2 /(2αi2 ). For our GLM analysis we will treat
the parameter αi ≡ α > 0 as a nuisance parameter that does not depend on the
specific policy i. Thus, we have constant dispersion ϕ = 1/α 2 and only the scale
parameter ci is assumed to be policy dependent through the canonical parameter
θi = −ci2 /(2α 2 ).
We are now in the same situation as in the gamma case in Sect. 5.3.7. We choose
the log-link for g which implies
1
μi = Eθi [Yi ] = κ (θi ) = √ = exp{ηi } = expβ, x i ,
−2θi
1 ni (Yi −
m
μ(x i ))2
D(L,
μ(·)) = , (5.52)
m ϕ μ(x i )2 Yi
i=1
where i runs over the policies i = 1, . . . , m with positive claims Yi = Si /ni > 0,
μ(x i ) = exp
MLE
and β , x i is the MLE estimated regression function. The unit
deviances behave as
2
d (Yi , μi ) = Yi Yi−1 − μ−1
i , (5.53)
176 5 Generalized Linear Models
Table 5.14 Run times, number of parameters, AICs, in-sample losses, tenfold cross-validation
losses and the in-sample average claim amounts of the null gamma model, model Gamma GLM2,
the null inverse Gaussian model, and model inverse Gaussian GLM2; the deviance losses use unit
dispersion ϕ = 1
Run # In-sample Tenfold CV Average
time Param. AIC loss on L CV
loss D amount
Gamma null – 1+1 14’416 2.085 2.091 24’641
Gamma GLM2 1s 7+1 14’274 1.719 1.747 25’130
IG null – 1+1 14’715 5.012 · 10−4 5.016 · 10−4 24’641
IG GLM2 1s 7+1 14’686 4.793 · 10−4 4.820 · 10−4 32’268
note that the log-likelihood is symmetric around its mode for scale μ−1
i , see Fig. 5.5
(rhs). From this we receive deviance residuals (for v/ϕ = 1)
μ−1 −1
1/2
riD = sign(Yi − μi ) d (Yi , μi ) = Yi i − Yi .
thus, it monotonically increases in the expected claim size μi . It seems that this
structure is not fully suitable for this data set, i.e., there is no indication that the
coefficient of variation increases in the expected claim size. We come back to a
comparison of the gamma and the inverse Gaussian model in Sect. 11.1, below.
Another way to improve the gamma model of Sect. 5.3.7 could be to use a log-
normal distribution instead. In the above situation this does not work because the
observations are not in the right format. If the claim observations Zi,j are log-
5.3 Model Validation 177
β
MLE
= (X X)−1 X Y , (5.54)
for full rank q + 1 ≤ n design matrix X. Note that in this case we have a closed-
form solution for the MLE of β. This is called the homoskedastic case because
all observations Yi are assumed to have the same variance σ 2 , otherwise, in the
heteroskedastic case, we would still have to include the covariance matrix.
Since we work with the canonical link on the log-scale we have the balance
property on the log-scale, see Corollary 5.7. Thus, we receive unbiasedness
n n MLE n
n
Eβ E MLE [Yi ] = Eβ
β , x i = Eβ [Yi ] = μ(x i ).
β
i=1 i=1 i=1 i=1
(5.55)
178 5 Generalized Linear Models
60000
fitted means E[Z]
2
residuals
40000
0
−2
20000
−4
0
7 8 9 10 11 7 8 9 10 11
fitted log−means E[Y]=E[log Z] fitted log−means E[Y]=E[log Z]
If we move back to the original scale of the observations Zi we receive from the
log-normal assumption
MLE 2 [Zi ] = exp
MLE
E(
β ,σ )
β , x i + σ 2
/2 .
Therefore, we need to adjust with the nuisance parameter σ 2 for the back-
transformation to the original observation scale. At this point, typically, the dif-
ficulties start. Often, a good back-transformation involves a feature dependent
variance parameter σ 2 (x i ), thus, in many practical applications the homoskedas-
ticity assumption is not fulfilled, and a constant variance parameter choice leads to
a poor model on the original observation scale.
A suitable estimation of σ 2 (x i ) may turn out to be rather difficult. This is
illustrated in Fig. 5.12. The left-hand side of this figure shows the Tukey–Anscombe
plot of the homoskedastic case providing unscaled (σ 2 ≡ 1) (Pearson’s) residuals
on the log-scale
riP = log(Zi ) −
μ(x i ) = Yi −
μ(x i ).
The right-hand side of Fig. 5.12 plots these estimated means μZi against the
estimated means μ(x i ) on the log-scale. We observe a graph that is non-monotone,
implied by the non-monotonicity of the standard deviation estimate σ (x i ) as a
function of μ(x i ). This non-monotonicity is not bad per se, as we still have a
proper statistical model, however, it might be rather counter-intuitive and difficult to
explain. For this reason it is advisable to directly model the expected value by one
single function, and not to decompose it into different regression functions.
Another important point to be considered is that for model selection using AIC
we have to work on the same scale for all models. Thus, if we use a gamma model to
model Zi , then for an AIC selection we need to evaluate also the log-normal model
on that scale. This can be seen from the justification in Sect. 4.2.3.
Finally, we focus on unbiasedness. Note that on the log-scale we have unbiased-
ness (5.55) through the balance property. Unfortunately, this does not carry over to
the original scale. We give a small example, where we assume that there is neither
any uncertainty about the distributional model nor about the nuisance parameter.
That is, we assume that Zi are i.i.d. log-normally distributed with parameters μ and
σ 2 , where only μ is unknown. The MLE of μ is given by
1
n
MLE =
μ log(Zi ) ∼ N (μ, σ 2 /n).
n
i=1
1
n
1
n
E(μ,σ 2 ) E(
μMLE ,σ 2 ) [Zi ] = E(μ,σ 2 ) exp{
μMLE } exp{σ 2 /2}
n n
i=1 i=1
= exp μ + (1 + n−1 )σ 2 /2
1
n
> exp μ + σ 2 /2 = E(μ,σ 2 ) [Zi ] .
n
i=1
1
n
1
n
E(μ,σ 2 ) E(
μMLE 2 [Z
σ ) i
,* ] = E 2
(μ,σ ) exp{
μ MLE
} σ 2 /2}
exp{*
n n
i=1 i=1
= exp μ + σ 2 /(2n) + * σ 2 /2
1
n
< exp μ + σ 2 /2 = E(μ,σ 2 ) [Zi ] .
n
i=1
180 5 Generalized Linear Models
This shows that working on the log-scale is rather difficult because the back-
transformation is far from being trivial, and for unknown nuisance parameter not
even the sign of the bias is clear. Similar considerations apply to the frequently used
Box–Cox transformation [48] for χ = 1
χ
Z −1
Zi →
Yi = i .
χ
5.4 Quasi-Likelihoods
1
∇μ Y (μ) = V (μ)−1 (Y − μ) .
ϕ
In case of a diagonal variance function V (μ) this relates to the score (5.9). The
remaining step is to model the mean parameter μ = μ(β) ∈ Rn as a function of a
lower dimensional regression parameter β ∈ Rq+1 , we also refer to Fig. 5.2. For
5.4 Quasi-Likelihoods 181
this last step we assume that the Jacobian B ∈ Rn×(q+1) of dμ/dβ has full rank
q + 1. The score equations for β and given observations Y then read as
1
B V (μ(β))−1 (Y − μ(β)) = 0.
ϕ
This is of exactly the same structure as the score equations in Proposition 5.1, and
the roots are found by using the IRLS algorithm for t ≥ 0, see (5.12),
−1
(t +1)
→ = B V (
μ(t ))−1 B μ(t ))−1 B
B V (
(t ) (t )
β β β +Y −
μ(t ) ,
μ(t ) = μ(
(t )
where β ).
We conclude with the following points about quasi-likelihoods:
• For regression parameter estimation within the quasi-likelihood framework it
is sufficient to know the structure of the first two moments μ(β) ∈ Rn and
V (μ) ∈ Rn×n as well as the score equations. Thus, we do not need to explicitly
specify a distributional family for the observations Y . This structure of the first
two moments is then sufficient for their estimation using the IRLS algorithm, i.e.,
we receive the predictors within this framework.
• Since we do not specify the full distribution of Y we can neither simulate from
this model nor can we calculate quantities where the full log-likelihood of the
model needs to be known. For example, we cannot calculate AIC in a quasi-
likelihood model.
• The quasi-likelihood model is characterized by the functional forms of μ(β) and
V (μ). The former plays the role of the link function and the linear predictor in the
GLM, and the latter plays the role of the variance function within the EDF which
is characterized through the cumulant function κ. For instance, if we assume to
have a diagonal matrix
then, the choice of the variance function μ → V (μ) describes the explicit
selection of the quasi-likelihood model. If we choose the power variance function
V (μ) = μp , p ∈ (0, 1), we have a quasi-Tweedie’s model.
• For prediction uncertainty evaluation we also need an estimate of the dispersion
parameter ϕ > 0. Since we do not know the full likelihood in this approach,
Pearson’s estimate ϕ P is the only option we have to estimate ϕ.
• For asymptotic normality results and hypothesis testing within the quasi-
likelihood framework we refer to Section 8.4 of McCullagh–Nelder [265].
182 5 Generalized Linear Models
In the derivations above we have treated the dispersion parameter ϕ in the GLM as
a nuisance parameter. In the case of a homogeneous dispersion parameter it can be
canceled in the score equations for MLE, see (5.9). Therefore, it does not influence
MLE, and in a subsequent step this nuisance parameter can still be estimated
using, e.g., Pearson’s or deviance residuals, see Sect. 5.3.1 and Remark 5.26. In
some examples we may have systematic effects in the dispersion parameter, too.
In this case the above approach will not work because a heterogeneous dispersion
parameter no longer cancels in the score equations. This has been considered in
Smyth [341] and Smyth–Verbyla [343]. The heterogeneous dispersion situation is
of general interest for GLMs, and it is of particular interest for Tweedie’s CP GLM
if we interpret Tweedie’s distribution [358] as a CP model with i.i.d. gamma claim
sizes, see Proposition 2.17; we also refer to Jørgensen–de Souza [204], Smyth–
Jørgensen [342] and Delong et al. [94].
We extend model assumption (5.1) by assuming that also the dispersion parameter
ϕi is policy i dependent. Assume that all random variables Yi are independent and
have densities w.r.t. a σ -finite measure ν on R given by
yi θi − κ(θi )
Yi ∼ f (yi ; θi , vi /ϕi ) = exp + a(yi ; vi /ϕi ) ,
ϕi /vi
vi
n
β → Y (β) = Yi h(μ(x i )) − κ (h(μ(x i ))) + a(Yi ; vi /ϕi ),
ϕi
i=1
with canonical link h = (κ )−1 . The difference to (5.7) is that the dispersion
parameter ϕi now depends on the insurance policy which requires additional
modeling. We choose a second strictly monotone and smooth link function gϕ :
5.5 Double Generalized Linear Model 183
where zi is the feature of policy i, which may potentially differ from x i . The
rationale behind this different feature is that different information might be relevant
for modeling the dispersion parameter, or feature information might be differently
pre-processed compared to the response Yi . We now need to estimate two regression
parameters β and γ in this approach on possibly differently pre-processed feature
information x i and zi of policy i. In general, this is not easily doable because the
term a(Yi ; vi /ϕi ) of the log-likelihood of Yi may have a complicated structure (or
may not be available in closed form like in Tweedie’s CP model).
We reformulate the EDF density using the unit deviance d(Y, μ) defined in (2.25);
˚ for the canonical link
we drop the lower index i for the moment. Set θ = h(μ) ∈
h, then
v
f (y; θ, v/ϕ) = exp [yh(μ) − κ(h(μ))] + a(y; v/ϕ)
ϕ
v 1
= exp [yh(y) − κ(h(y))] + a(y; v/ϕ) exp − d(y, μ)
ϕ 2ϕ/v
ω
def. ∗
= a (y; ω) exp − d(y, μ) , (5.58)
2
with ω = v/ϕ ∈ W. This corresponds to (2.27), and it brings the EDF density into
a Gaussian-looking form. A general difficulty is that the term a ∗ (y; ω) may have a
complicated structure or may not be given in closed form. Therefore, we consider
its saddlepoint approximation; this is based on Section 3.5 of Jørgensen [203].
Suppose that we are in the absolutely continuous EDF case and that κ is steep.
In that case Y ∈ M, a.s., and the variance function y → V (y) is well-defined for
all observations Y = y, a.s. Based on Daniels [87], Barndorff-Nielsen–Cox [24]
proved the following statement, see Theorem 3.10 in Jørgensen [203]: assume there
exists ω0 ∈ W such that for all ω > ω0 the density (5.58) is bounded. Then, the
following saddlepoint approximation is uniform on compact subsets of the support
T of Y
−1/2
2πϕ 1
f (y; θ, v/ϕ) = V (y) exp − d(y, μ) (1 + O(ϕ/v)) ,
v 2ϕ/v
(5.59)
184 5 Generalized Linear Models
for sufficiently large volumes v, and at the same time, this does not affect the unit
deviance d(y, μ), preserving the estimation properties of μ. The discrete counterpart
is given in Theorem 3.11 of Jørgensen [203].
Using saddlepoint approximation (5.59) we receive an approximate log-
likelihood function
1 −1 1 2π
Y (μ, ϕ) ≈ −ϕ vd(Y, μ) − log (ϕ) − log V (Y ) .
2 2 v
This approximation has an attractive form for dispersion estimation because it gives
def.
an approximate EDF for observation d = vd(Y, μ), for given μ. Namely, for
canonical parameter φ = −ϕ −1 < 0 we have approximation
dφ − (− log (−φ)) 1 2π
Y (μ, φ) ≈ − log V (Y ) . (5.60)
2 2 v
The right-hand side has the structure of a gamma EDF for observation d with
canonical parameter φ < 0, cumulant function κϕ (φ) = − log(−φ) and dispersion
parameter 2. Thus, we have the structure of an approximate gamma model on the
right-hand side of (5.60) with, for given μ,
1
Eφ [d|μ] ≈ κϕ (φ) = − = ϕ, (5.61)
φ
1
Varφ (d|μ) ≈ 2κϕ (φ) = 2 = 2ϕ 2 . (5.62)
φ2
These statements say that for given μ and assuming that the saddlepoint approx-
imation is sufficiently accurate, d is approximately gamma distributed with shape
parameter 1/2 and canonical parameter φ (which relates to the dispersion ϕ in the
mean parametrization). Thus, we can estimate φ and ϕ, respectively, with a (second)
GLM from (5.60), for given mean parameter μ.
Remarks 5.28
• The accuracy of the saddlepoint approximation is discussed in Section 3.2 of
Smyth–Verbyla [343]. The saddlepoint approximation is exact in the Gaussian
and the inverse Gaussian case. In the Gaussian case, we have log-likelihood
dφ − (− log (−φ)) 1 2π
Y (μ, φ) = − log ,
2 2 v
5.5 Double Generalized Linear Model 185
with variance function V (Y ) = Y 3 . Thus, in the Gaussian case and in the inverse
Gaussian case we have a gamma model for d with mean ϕ and shape parameter
1/2, for given μ; for a related result we also refer to Theorem 3 of Blæsild–Jensen
[38]. For Tweedie’s models with p ≥ 1, one can show that the relative error of the
saddlepoint approximation is a non-increasing function of the squared coefficient
of variation τ = ϕv V (y)/y 2 = ϕv y p−2 , leading to small approximation errors if
ϕ/v is sufficiently small; typically one requires τ < 1/3, see Section 3.2 of
Smyth–Verbyla [343].
• The saddlepoint approximation itself does not provide a density because in gen-
eral the term O(ϕ/v) in (5.59) is non-zero. Nelder–Pregibon [282] renormalized
the saddlepoint approximation to a proper density and studied its properties.
• In the gamma EDF case, the saddlepoint approximation would not be necessary
because this case can still be solved in closed form. In fact, in the gamma EDF
case we have log-likelihood, set φ = −v/ϕ < 0,
φd(Y, μ) − χ(φ)
Y (μ, φ) = − log Y, (5.63)
2
with χ(φ) = 2(log
(−φ) + φ log(−φ) − φ). For given μ, this is an EDF
for d(Y, μ) with cumulant function χ on the effective domain (−∞, 0). This
provides us with expected value and variance
1
Eφ [d(Y, μ)|μ] = χ (φ) = 2 (−(−φ) + log(−φ)) ≈ − ,
φ
1
Varφ (d(Y, μ)|μ) = 2χ (φ) = 4 (−φ) − ,
−φ
with digamma function and the approximation exactly refers to the sad-
dlepoint approximation; for the variance statement we also refer to Fisher’s
information (2.30). For receiving more accurate mean approximations one can
consider higher order terms, e.g., the second order approximation is χ (φ) ≈
−1/φ + 1/(6φ 2 ). In fact, from the saddlepoint approximation (5.60) and from
the exact formula (5.63) we receive in the gamma case Stirling’s formula
√
(γ ) ≈ 2πγ γ −1/2 e−γ .
In the subsequent examples we will just use the saddlepoint approximation also
in the gamma EDF case.
186 5 Generalized Linear Models
The saddlepoint approximation (5.60) proposes to alternate MLE of β for the mean
model (5.56) and of γ for the dispersion model (5.57). Fisher’s information matrix
of the saddlepoint approximation (5.60) w.r.t. the canonical parameters θ and φ is
given by
v
φvκ (θ ) −v Y − κ (θ ) V (μ(θ )) 0
I (θ, φ) = −Eθ,φ = ϕ(φ) ,
−v Y − κ (θ ) − 12 φ12 0 1
2 Vϕ (ϕ(φ))
with variance function Vϕ (ϕ) = ϕ 2 , and emphasizing that we work in the canonical
parametrization (θ, φ). This is a positive definite diagonal matrix which suggests
that the algorithm alternating the β and γ estimations will have a fast convergence.
For fixed estimate
γ we calculate estimated dispersion parameters ϕi = gϕ−1
γ , zi
of policies 1 ≤ i ≤ n, see (5.57). These then allow us to calculate diagonal working
weight matrix
−2
∂g(μi ) vi 1
W (β) = diag ∈ Rn×n ,
∂μi
ϕi V (μi )
1≤i≤n
Fisher’s scoring method (5.12) iterates for s ≥ 0 the following recursion to receive
γ
−1
γ (s+1) = Z Wϕ (
γ (s) → γ (s))Z Z Wϕ (
γ (s) ) Z
γ (s) + R ϕ (d,
γ (s)) ,
(5.64)
where Z = (z1 , . . . , zn ) is the design matrix used to estimate the dispersion
parameters.
We revisit the Swedish motorcycle claim size data studied in Sect. 5.3.7. We expand
the gamma claim size GLM to a double GLM also modeling the systematic effects
in the dispersion parameter. In a first step we need to change the parametrization of
the gamma model of Sect. 5.3.7. In the former section we have modeled the average
claim size Si /ni ∼
(ni αi , ni ci ), but for applying the saddlepoint approximation
we should use the reproductive form (5.44) of the gamma model. We therefore set
The reason for the different parametrization in Sect. 5.3.7 has been that (5.65) is not
directly useful if αi is unknown because in that case the observations Yi cannot be
calculated. In this section we estimate ϕi = 1/αi which allows us to model (5.65);
a different treatment within Tweedie’s family is presented in Sect. 11.1.3. The only
difficulty is to initialize the double GLM algorithm. We proceed as follows.
(0) In an initial step we assume constant dispersion ϕi = 1/αi ≡ 1/α = 1. This
gives us exactly the mean estimates of Sect. 5.3.7 for Si /ni ∼
(ni α, ni ci );
note that for constant shape parameter α the mean of Si /ni can be estimated
without explicit knowledge of α (because it cancels in the score equations).
Using these mean estimates we calculate the MLE α (0) of the (constant) shape
parameter α, see Remark 5.26. This then allows us to determine the (scaled)
(1) (0)
observations Yi = Si /(ni α (0) ) and we initialize
ϕi = 1/ α (0) .
(1) Iterate for t ≥ 1:
– estimate the mean μi of Yi using the mean GLM (5.56) based on the
(t ) (t −1)
observations Yi and the dispersion estimates ϕi . This provides us with
(t )
μi ;
– based on the deviances d(t i
)
= vi d(Yi(t ) ,
μ(t )
i ), calculate the updated dis-
persion estimates ϕi(t ) using the dispersion GLM (5.57) and the residual
MLE iteration (5.64) with the saddlepoint approximation. Set for the updated
observations Yi(t +1) = Si ϕi(t )/ni .
188 5 Generalized Linear Models
Table 5.15 Number of parameters, AICs, Pearson’s dispersion estimate, in-sample losses, tenfold
cross-validation losses and the in-sample average claim amounts of the null model (gamma
intercept model) and the (double) gamma GLM
# Dispersion In-sample Tenfold CV Average
Param. AIC est.
ϕP loss on L CV
loss D amount
Gamma null 1+1 14’416 2.057 2.085 2.091 24’641
Gamma GLM2 7+1 14’274 1.544 1.719 1.747 25’130
Double gamma GLM 7+6 14’258 – (1.721) – 26’413
In an initial double GLM analysis we use the feature information zi = x i for the
dispersion ϕi modeling (5.57). We choose for both GLMs the log-link which leads to
concave maximization problems, see Example 5.5. Running the above double GLM
algorithm converges in 4 iterations, and analyzing the resulting model we observe
that we should drop the variable RiskClass from the feature zi . We then run the
same double GLM algorithm with the feature information x i and the new zi again,
and the results are presented in Table 5.15.
The considered double GLM has parameter dimensions β ∈ R7 and γ ∈ R6 . To
have comparability with AIC of Sect. 5.3.7, we evaluate AIC of the double GLM
in the observations Si /ni (and not in Yi ; i.e., similar to the gamma GLM). We
observe that it has an improved AIC value compared to model Gamma GLM2.
Thus, indeed, dispersion modeling seems necessary in this example (under the
GLM2 regression structure). We do not calculate in-sample and cross-validation
losses in the double GLM because in the other two models of Table 5.15 we have
set ϕ = 1 in these statistics. However, the in-sample loss of model Gamma GLM2
with ϕ = 1 corresponds to the (homogeneous) deviance dispersion estimate (up to
scaling n/(n − (q + 1))), and this in-sample loss of 1.719 can directly be compared
to the average estimated dispersion m−1 m i=1
ϕi = 1.721 (in round brackets in
Table 5.15). On the downside, the double GLM has a bigger bias which needs an
adjustment.
In Fig. 5.13 (lhs) we give the normal plots of model Gamma GLM2 and the
double gamma GLM model. This plot is received by transforming the observations
to normal quantiles using the corresponding estimated gamma models. We see
quite some similarity between the two estimated gamma models. Both models
seem to have similar deficiencies, i.e., dispersion modeling improves explanation
of observations, however, either the regression function or the gamma distributional
assumption does not fully fit the data, especially for small claims. Finally, in
Fig. 5.13 (rhs) we plot the estimated dispersion parameters ϕi against the logged
estimated means log( μi ) (linear predictors). We observe that the estimated disper-
sion has a (weak) U-shape as a function of the expected claim sizes which indicates
that the tails cannot fully be captured by our model. This closes this example.
Remark 5.29 For the dispersion estimation ϕi we use as observations the deviances
di = vi d (Yi ,
μi ), 1 ≤ i ≤ n. On a finite sample, these deviances are typically
biased due to the use of the estimated means
μi . Smyth–Verbyla [343] propose the
5.5 Double Generalized Linear Model 189
normal plot of the fitted gamma models estimated dispersion vs. logged estimated means
4.0
Double GLM
3
3.5
2
dispersion parameter
3.0
observed values
1
2.5
0
2.0
−1
1.5
−2
1.0
Gamma GLM2
−3
Double GLM
0.5
−3 −2 −1 0 1 2 3 8 9 10 11
theoretical values logged (linear) predictor
Fig. 5.13 (lhs) Normal plot of the fitted models Gamma GLM2 and double GLM, (rhs) estimated
dispersion parameters
ϕi against the logged estimated means log(
μi ) (the orange line gives the
in-sample loss in model Gamma GLM2)
This implies for the two working weight matrices of the double GLM
vi 1 2−p vi
W (β) = diag μ2i = diag μi ,
ϕi V (μi ) 1≤i≤n ϕi 1≤i≤n
1 1
Wϕ (γ ) = diag ϕi2 = diag(1/2, . . . , 1/2).
2 Vϕ (ϕi ) 1≤i≤n
and these deviances could still be de-biased, see Remark 5.29. The working
responses for the two GLMs are
The drawback of this approach is that it only considers the (scaled) total claim
amounts Yi = Si ϕi /vi as observations, see Proposition 2.17. These total claim
amounts consist of the number of claims Ni and i.i.d. individual claim sizes
Zi,j ∼
(α, ci ), supposed Ni ≥ 1. Having observations of both claim amounts
Si and claim counts Ni allows one to build a Poisson GLM for claim counts and
a gamma GLM for claim sizes which can be estimated separately. This has also
been the reason of Smyth–Jørgensen [342] to enhance Tweedie’s model estimation
for known claim counts in their Section 4. Moreover, in Theorem 4 of Delong et
al. [94] it is proved that the two GLM approaches can be identified under log-link
choices.
In our examples we have studied several figures like AIC, cross-validation losses,
etc., for model and parameter selection. Moreover, we have plotted the results, for
instance, using the Tukey–Anscombe plot or the QQ plot. Of course, there are
numerous other plots and tools that can help us to analyze the results and to improve
the resulting models. We present some of these in this section.
The MLE
MLE
β satisfies at convergence of the IRLS algorithm, see (5.12),
−1
β
MLE
= X W (
β
MLE
)X X W (
β
MLE
) X
β
MLE
+ R(Y ,
β
MLE
) ,
5.6 Diagnostic Tools 191
Following Section 4.2.2 of Fahrmeir–Tutz [123], this allows us to define the so-
called hat matrix, see also Remark 5.29,
−1
H = H ( ) = W ( X X W ( X W (
MLE MLE 1/2 MLE MLE 1/2
β β ) β )X β ) ∈ Rn×n ,
(5.66)
recall that the working weight matrix W (β) is diagonal. The hat matrix H is
symmetric and idempotent, i.e. H 2 = H , with trace(H ) = rank(H ) = q + 1.
Therefore, H acts as a projection, mapping the observations *
Y to the fitted values
Y = W ( X + R(Y , Y = W (
MLE 1/2 MLE
* def.
β
MLE 1/2
) β
MLE
β
MLE
) → H * β ) Xβ
= W (
MLE 1/2
β ) η,
the latter being the fitted linear predictors. The diagonal elements hi,i of this hat
matrix H satisfy 0 ≤ hi,i ≤ 1, and values close to 1 correspond to extreme data
*i influences
points i, in particular, for hi,i = 1 only observation Y ηi , whereas for
hi,i = 0 observation Y*i has no influence on ηi .
Figure 5.14 gives the resulting hat matrices of the double gamma GLM of
Sect. 5.5.4. On the left-hand side we show the diagonal entries hi,i of the claim
8 9 10 11 8 9 10 11
logged (linear) predictor logged (linear) predictor
Fig. 5.14 Diagonal entries hi,i of the two hat matrices of the example in Sect. 5.5.4: (lhs) for
means μi and responses Yi , and (rhs) for dispersions
ϕi and responses di
192 5 Generalized Linear Models
amount responses Yi (for the estimation of μi ), and on the right-hand side the
corresponding plots for the deviance responses di (for the estimation of ϕi ). These
diagonal elements hi,i are ordered on the x-axis w.r.t. the linear predictors
ηi . From
this figure we conclude that the diagonal entries of the hat matrices are bigger for
very small responses in our example, and the dispersion plot has a couple of more
special observations that may require further analysis.
where all lower indices (−i) indicate that we drop the corresponding row or/and
column from the matrices and vectors, and where * Y has been defined in the previous
subsection. This allows us to compare β MLE
and
(1)
β (−i) to analyze the influence of
observation Yi .
To reformulate this approximation, we come back to the hat matrix H =
H (
MLE
β ) = (hi,j )1≤i,j ≤n defined in (5.66). It fulfills
⎛ ⎞
n
n
MLE 1/2
W (β ) MLE
Xβ = H*
Y =⎝ *j , . . . ,
h1,j Y *j ⎠
hn,j Y ∈ Rn .
j =1 j =1
Thus, for predicting Yi we can consider the linear predictor (for the chosen link g)
n
μi ) = , x i = (X )i = Wi,i ( *j .
MLE MLE MLE −1/2
ηi = g( β β β ) hi,j Y
j =1
5.6 Diagnostic Tools 193
This allows one to efficiently calculate a leave-one-out prediction using the hat
matrix H . This also motivates to study the generalized cross-validation (GCV) loss
which is an approximation to leave-one-out cross-validation, see Sect. 4.2.2,
vi
n
GCV = 1
D d Yi , g −1 (
ηi(−i,1) ) (5.67)
n ϕ
i=1
2 vi
n
Yi h (Yi ) − κ (h (Yi )) − Yi h g −1 ( ) + κ h g −1 (
(−i,1) (−i,1)
= ηi ηi ) .
n ϕ
i=1
1 2
n
GCV = 1
D ηi(−i,1) ,
Yi −
n σ 2
i=1
n
hi,j 1 hi,i
ηi(−i,1) =
(1)
β (−i) , x i = Yj =
ηi − Yi .
1 − hi,i 1 − hi,i 1 − hi,i
j =1,j =i
1 n 2
GCV = 1 Yi −
ηi
D ,
n σ2 1 − hi,i
i=1
The generalized cross-validation loss is used, for instance, for generalized addi-
tive model (GAM) fitting where an efficient and fast cross-validation method is
required to select regularization parameters. Generalized cross-validation
has been
introduced by Craven–Wahba [84] but these authors replaced hi,i by nj=1 hj,j /n.
It holds that nj=1 hj,j = trace(H ) = q + 1, thus, using this approximation we
receive
2
1 1
n n
GCV Yi −
ηi n (Yi − ηi )2
D ≈ =
n σ2 1 − nj=1 hj,j /n (n − (q + 1)) i=1
2 σ 2
i=1
n
ϕP
= ,
n − (q + 1) σ 2
with
ϕ P being Pearson’s dispersion estimate in the Gaussian model, see (5.30).
We give a numerical example based on the gamma GLM for the claim sizes
studied in Sect. 5.3.7.
Example 5.31 (Leave-One-Out Cross-Validation) The aim of this example is to
compare the generalized cross-validation loss D GCV to the leave-one-out cross-
validation loss D , see (4.34), the former being an approximation to the latter.
loo
We do this for the gamma claim size model studied in Sect. 5.3.7. In this example
it is feasible to exactly calculate the leave-one-out cross-validation loss because we
have only 656 claims.
The results are presented in Table 5.16. Firstly, the different cross-validation
losses confirm that the model slightly (in-sample) over-fits to the data, which is
not a surprise when estimating 7 regression parameters based on 656 observations.
Secondly, the cross-validation losses provide similar numbers with leave-one-out
being slightly bigger than tenfold cross-validation, here. Thirdly, the generalized
cross-validation loss D GCV manages to approximate the leave-one-out cross-
validation loss D very well in this example.
loo
Table 5.17 gives the corresponding results for model Poisson GLM1 of
Sect. 5.2.4. Firstly, in this example with 610’206 observations it is not feasible
to calculate the leave-one-out cross-validation loss (for computational reasons).
Therefore, we rely on the generalized cross-validation loss as an approximation.
From the results of Table 5.17 it seems that this approximation (rather) under-
estimates the loss (compared to tenfold cross-validation). Indeed, this is an
observation that we have made also in other examples.
The reader will have noticed that the discussion of GLMs in this chapter has
been focusing on the single-parameter linear EDF case (5.1). In many actuarial
applications we also want to study examples of the vector-valued parameter
EF (2.2). We briefly discuss the categorical case since this case is frequently used.
thus, k + 1 has been chosen as reference level. For the canonical parameter
θ = (θ1 , . . . , θk ) ∈ = Rk we have cumulant function and mean functional,
respectively,
⎛ ⎞
k
eθ
κ(θ) = log ⎝1 + e θj ⎠ , p = Eθ [T (Y )] = ∇θ κ(θ) = k .
j =1 1+ j =1 e
θj
expβ l , x
x → pl = pl (x) = Pβ [Y = l] = k , (5.69)
1 + j =1 expβ j , x
Note that this naturally gives us the canonical link h which we have already derived
in Sect. 2.1.4. Define the matrix for feature x ∈ X ⊂ {1} × Rq
⎛ ⎞
x 0 0 ··· 0
⎜0 x 0 ··· 0 ⎟
⎜ ⎟
⎜ x ··· ⎟
X=⎜ 0 0 0 ⎟ ∈ Rk×k(q+1) . (5.71)
⎜ . .. .. .. .. ⎟
⎝ .. . . . . ⎠
0 0 0 · · · x
This gives linear predictor and canonical parameter, respectively, under the canoni-
cal link h
θ = h(p(x)) = η(x) = Xβ = β 1 , x , . . . , β k , x ∈ = Rk . (5.72)
n
β → Y (β) = (X i β) T (Yi ) − κ(Xi β).
i=1
n
n
s(β, Y ) = ∇β Y (β) = X
i T (Y i ) − ∇θ κ(X i β) = X
i [T (Yi ) − p(x i )] = 0,
i=1 i=1
5.7 Generalized Linear Models with Categorical Responses 197
with logistic regression function (5.69) for p(x). For the score equations with
canonical link we also refer to the second case in Proposition 5.1. Next, we calculate
Fisher’s information matrix, we also refer to (3.16),
n
In (β) = −Eβ ∇β2 Y (β) = X
i i (β)X i ,
i=1
We rewrite the score in a similar way as in Sect. 5.1.4. This requires for general link
g(p) = η and inverse link p = g −1 (η), respectively, the following block diagonal
matrix
W (β) = diag ∇η g −1 (η) i (β)−1 ∇η g −1 (η)
η=Xi β η=Xi β
1≤i≤n
−1
= diag ∇p g(p)p=g−1 (X β) i (β) ∇p g(p)p=g−1 (X , (5.73)
i i β)
1≤i≤n
Because we work with the canonical link g = h and g −1 = ∇θ κ, we can use the
simplified block diagonal matrix
This is now exactly in the same form as in Proposition 5.1. Fisher’s scoring
method/IRLS algorithm then allows us to recursively calculate the MLE of β ∈
Rk(q+1) by
−1
(t +1)
β → = X W ( X W (
β ) X β + R(Y ,
(t ) (t ) (t ) (t ) (t )
β β )X β ) .
(d)
MLE
βn ≈ N (β, In (β)−1 ),
for large sample sizes n. This allows us to apply the Wald test (5.32) for back-
ward parameter elimination. Moreover, in-sample and out-of-sample losses can
be analyzed with unit deviances coming from the categorical cross-entropy loss
function (4.19).
Remarks 5.32 The above derivations have been done for the categorical distribution
under the canonical link choice. However, these considerations hold true for more
general links g within the vector-valued parameter EF. That is, the block diagonal
matrix W (β) in (5.73) and the working residuals R(Y , β) in (5.74) provide score
equations (5.75) for general vector-valued parameter EF examples, and where we
replace the categorical probability p by the mean μ = Eβ [T (Y )].
There are several special topics and tools in regression modeling that we have not
discussed, yet. Some of them will be considered in selected chapters below, and
some points are mentioned here, without going into detail.
The GLMs studied above have been considering cross-sectional data, meaning that
we have fixed one time period t and studied this time period in an isolated fashion.
Time-dependent extensions are called longitudinal or panel data. Consider a time
series of data (Yi,t , x i,t ) for policies 1 ≤ i ≤ n and time points t ≥ 1. For the
prediction of response variable Yi,t we may then regress on the individual past
5.8 Further Topics of Regression Modeling 199
for canonical parameter θ ∈ and F (·|Di,t ; θ ) being a member of the EDF. For a
GLM we choose a link function g and make the assumption
g Eβ [Yi,t |Di,t ] = β, zi,t , (5.76)
(d)
F (·|Di,t ; θ ) = F (·|Yi,t −1 , x i,t ; θ ) for all t ≥ 2 and θ ∈ ,
with σ (Di,t )-measurable feature vector wi,t . Regression parameter β then describes
the fixed systematic effects that are common over the entire portfolio 1 ≤ i ≤ n
and B i describes the policy dependent random effects (assumed to be normalized
E[B i ] = 0). Typically one assumes that B 1 , . . . , B n are centered and i.i.d. Such
effects are called static random effects because they are not time-dependent, and
they may also be interpreted in a Bayesian sense.
Finally, extending these static random effects to dynamic random effects B i,t ,
t ≥ 1, leads to so-called state-space models, the linear state-space model being the
most popular example and being fitted using the Kalman filter [207].
There are several ways in which the GLM framework can be modified.
200 5 Generalized Linear Models
The most common modification of GLMs concerns the regression structure, namely,
that the scalar product in the linear predictor
x → g(μ) = η = β, x ,
where sj : R → R are natural cubic splines. Natural cubic splines sj are obtained
by concatenating cubic functions in so-called nodes. A GAM can have as many
nodes in each cubic spline sj as there are different levels xi,j in the data 1 ≤ i ≤ n.
In general, this leads to very flexible regression models, and to control in-sample
over-fitting regularization is applied, for regularization we also refer to Sect. 6.2.
Regularization requires setting a tuning parameter, and an efficient determination of
this tuning parameter uses generalized cross-validation, see Sect. 5.6. Nevertheless,
fitting GAMs can be very computational, already for portfolios with 1 million
policies and involving 20 feature components the calibration can be very slow.
Moreover, regression function (5.78) does not (directly) allow for a data driven
method of finding interactions between feature components. For these reasons, we
do not further study GAMs in this monograph.
A modification in the regression function that is able to consider interactions
between feature components is the framework of classification and regression trees
(CARTs). CARTs have been introduced by Breiman et al. [54] in 1984, and they
are still used in its original form today. Regression trees aim to partition the feature
space X into a finite number of disjoint subsets Xt , 1 ≤ t ≤ T , such that all policies
(Yi , x i ) in the same subset x i ∈ Xt satisfy a certain homogeneity property w.r.t. the
regression task (and the chosen loss function). The CART regression function is
then defined by
T
x → μ(x) =
μt 1{x∈Xt } ,
t =1
Freund [139] and Freund–Schapire [140]. Today boosting belongs to the most
powerful predictive regression methods, we mention the XGBoost algorithm of
Chen–Guestrin [71] that has won many competitions. We will not further study
CARTs and boosting in these notes because these methods also have some
drawbacks. For instance, resulting regression functions are not continuous nor do
they easily allow to extrapolate data beyond the (observed) feature space, e.g., if we
have a time component. Moreover, they are more difficult in the use of unstructured
data such as text data. For more on CARTs and boosting in actuarial science we
refer to Denuit et al. [100] and Ferrario–Hämmerli [125].
The theory above has been relying on the EDF, but, of course, we could also study
any other family of distribution functions. A clear drawback of the EDF is that
it only considers light-tailed distribution functions, i.e., distribution functions for
which the moment generating function exists around the origin. If the data is more
heavy-tailed, one may need to transform this data and then use the EDF on the
transformed data (with the drawback that one loses the balance property) or one
chooses another family of distribution functions. Transformations have already been
discussed in Remarks 2.11 and Sect. 5.3.9. Another two families of distributions that
have been studied in the actuarial literature are the generalized beta of the second
kind (GB2) distribution, see Venter [369], Frees et al. [137] and Chan et al. [66], and
inhomogeneous phase type (IHP) distributions, see Albrecher et al. [8] and Bladt
[37]. The GB2 family is a 4-parameter family, and it nests several examples such
as the gamma, the Weibull, the Pareto and the Lomax distributions, see Table B1 in
Chan et al. [66]. The density of the GB2 distribution is for y > 0 given by
|a| y aα1 −1
f (y; a, b, α1, α2 ) = b
b
a α1 +α2 (5.79)
B(α1 , α2 ) 1 + yb
|a| y a α1 α2
y 1
= b
y a y a ,
B(α1 , α2 ) 1+ b 1+ b
with scale parameter b > 0, shape parameters a ∈ R and α1 , α2 > 0, and beta
function
(α1 )
(α2 )
B(α1 , α2 ) = .
(α1 + α2 )
zα1 −1 (1 − z)α2 −1
f (z; α1 , α2 ) = .
B(α1 , α2 )
202 5 Generalized Linear Models
for −α1 a < 1 < α2 a. Observe that for a > 0 we have that the survival function of
Y is regularly varying with tail index α2 a > 0. Thus, we can model Pareto-like tails
with the GB2 family; for regular variation we refer to (1.3).
As proposed in Frees et al. [137], one can introduce a regression structure for
b > 0 by choosing a log-link and setting
B(α1 + 1/a, α2 − 1/a)
log Ea,b,α1 ,α2 [Y ] = log + β, x .
B(α1 , α2 )
The GLMs introduced above aim at estimating the means μ(x) = Eθ(x) [Y ] of
random variables Y being explained by features x. Since mean estimation can
be rather sensitive in situations where we have large claims, the more robust
quantile regression has attracted some attention, recently. Quantile regression has
been introduced by Koenker–Bassett [220]. The idea is that instead of estimating
the mean μ of a random variable Y , we rather try to estimate its τ -quantile for
given τ ∈ (0, 1). The τ -quantile is given by the generalized inverse F −1 (τ ) of the
distribution function F of Y , that is,
Consider the pinball loss function for y ∈ C (convex closure of the support of Y )
and actions a ∈ A = R
(y, a) → Lτ (y, a) = (y − a) τ − 1{y−a<0} ≥ 0. (5.81)
5.8 Further Topics of Regression Modeling 203
a (F ) ∈ A(F ) = arg min EF [Lτ (Y, a)] .
a∈A
Note that for the time being we do not know whether the solution to this
minimization problem is a singleton. For this reason, we state the solution (subject
to existence) as a set-valued functional A, see (4.25).
We calculate the score equation of the expected loss using the Leibniz rule
a ∞
∂
EF [Lτ (Y, a)] = −(τ − 1) dF (y) − τ dF (y)
∂a −∞ a
!
= −(τ − 1)F (a) − τ (1 − F (a)) = F (a) − τ = 0.
In fact, using the pinball loss, we have just seen that the τ -quantile is elicitable
within the class of continuous distributions, see Definition 4.18.
For a more general result we need a more general definition of a (set-valued)
τ -quantile
Qτ (F ) = y ∈ R; lim F (z) ≤ τ ≤ F (y) . (5.82)
z↑y
This defines a closed interval and its lower endpoint corresponds to the generalized
inverse F −1 (τ ) given in (5.80). In complete analogy to Theorem 4.19 on the
elicitability of the mean functional, we have the following statement for the τ -
quantile; this result goes back to Thomson [351] and Saerens [326].
Theorem 5.33 (Gneiting [162, Theorem 9], Without Proof) Let F be the class of
distribution functions on an interval C ⊆ R and choose quantile level τ ∈ (0, 1).
• The τ -quantile (5.82) is elicitable relative to F .
204 5 Generalized Linear Models
Quantile Regression
The idea behind quantile regression is that we build a regression model for the τ -
quantile. Assume we have a datum (Y, x) whose conditional τ -quantile, given x ∈
{1} × Rq , can be described by the regression function
x → g FY−1
|x (τ ) = β τ , x ,
for a strictly monotone and smooth link function g : C → R, and for a regression
parameter β τ ∈ Rq+1 . The aim now is to estimate this regression parameter from
independent data (Yi , x i ), 1 ≤ i ≤ n. The pinball loss Lτ , given in (5.81), provides
us with the following optimization problem
n
β τ = arg min Lτ Yi , g −1 β, x i .
β∈Rq+1 i=1
We conclude from this short section that we can regress any quantity a(F ) that is
elicitable, i.e., for which a loss function exists that is strictly consistent for a(F )
on F ∈ F . For more on quantile regression we refer to the monograph of Uribe–
Guillén [361], and an interesting paper is Dimitriades et al. [106]. We will study
quantile regression within deep networks in Chap. 11.2, below.
5.8 Further Topics of Regression Modeling 205
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons licence and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons licence, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons licence and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 6
Bayesian Methods, Regularization
and Expectation-Maximization
The previous chapter has been focusing on MLE of regression parameters within
GLMs. Alternatively, we could address the parameter estimation problem within a
Bayesian setting. The purpose of this chapter is to discuss the Bayesian estimation
approach. This leads us to the notion of regularization within GLMs. Bayesian
methods are also used in the Expectation-Maximization (EM) algorithm for MLE
in the case of incomplete data. For literature on Bayesian theory we recommend
Gelman et al. [157], Congdon [79], Robert [319], Bühlmann–Gisler [58] and Gilks
et al. [158]. A nice historical (non-mathematical) review of Bayesian methods is
presented in McGrayne [266]. Regularization is discussed in the book of Hastie et
al. [184], and a good reference for the EM algorithm is McLachlan–Krishnan [267].
The Bayesian estimator has been introduced in Definition 3.6. Assume that the
observation Y has independent components Yi that can be described by a GLM
with link function g and regression parameter β ∈ Rq+1 , i.e., the random variables
Yi have densities
ind. y(h ◦ g −1 )β, x i − (κ ◦ h ◦ g −1 )β, x i
Yi ∼ f (y; β, x i , vi /ϕ) = exp + a(y; vi /ϕ) ,
ϕ/vi
with canonical link h = (κ )−1 . In a Bayesian approach one models the regression
parameter β with a prior distribution1 π(β) on the parameter space Rq+1 , and the
independence assumption between the components of Y needs to be understood
1Often, in Bayesian arguing, distribution and density is used in an interchangeable (and not fully
precise) way, and it is left to the reader to give the right meaning to π.
For the given observation Y , this allows us to calculate the posterior density of β
using Bayes’ rule
p(Y , β)
n
π(β|Y ) = - ∝ f (Yi ; β, x i , vi /ϕ) π(β), (6.2)
p(Y , *
β)d *
β i=1
where the proportionality sign ∝ indicates that we have dropped the terms that do
not depend on β. Thus, the functional form in β of the posterior density π(β|Y )
is fully determined by the joint density p(Y , β), and the remaining term is a
normalization to obtain a proper probability distribution. In many situations, the
knowledge of the functional form of the posterior density in β is sufficient to
perform Bayesian parameter estimation, at least, numerically. We will give some
references, below.
The Bayesian estimator for β is given by the posterior mean (supposed it exists)
β
Bayes
= Eπ [ β| Y ] = β π(β|Y )dν(β).
supposed that this first moment exists and that x n+1 is the feature of Yn+1 . We see
that it all boils down to have sufficiently explicit knowledge about the posterior
density π(β|Y ) given in (6.2).
Remark 6.1 (Conditional MSEP) Based on the assumption that the posterior distri-
bution π(β|Y ) can be determined, we can analyze the GL. In a Bayesian setup one
usually does not calculate the MSEP as described in Theorem 4.1, but one rather
studies the conditional MSEP, conditioned exactly on the collected information Y .
That is,
Eπ (Yn+1 − Eπ [ Yn+1 | Y ])2 Y = Varπ ( Yn+1 | Y )
6.2 Regularization
β
MAP
= arg max log π(β|Y ) = arg max Y (β) + log π(β). (6.3)
β∈Rq+1 β∈Rq+1
β
MAP p
= arg max Y (β) − λβp ,
β∈Rq+1
1
=
MAP MAP p
β β (λ) = arg max Y (β) − λβ − p , (6.4)
β∈R q+1 n
we also scale with the sample size n to make the units of the tuning parameter λ
independent of the sample size n.
Remarks 6.2
p
• The regularization term λβ − p keeps the components of the regression parame-
ter β − close to zero, thus, it prevents from over-fitting by letting parameters only
take moderate values. The magnitudes of the parameter values are controlled by
the regularization parameter λ > 0 which acts as a hyper-parameter. Optimal
hyper-parameters are determined by cross-validation.
• In (6.4) all components of β − are treated equally. This may not be appropriate
if the feature components of x live on different scales. This problem of different
scales can be solved by either scaling the components of x to a unit scale, or
by introducing a diagonal importance matrix T = diag(t1 , . . . , tq ) with tj > 0
that describes the scales of the components of x. This allows us to regularize
T −1 β − p instead of β − p . Thus, in this latter case we replace (6.4) by the
p p
weighted version
1 −p q
β
MAP
= arg max Y (β) − λ tj |βj |p .
β n
j =1
1 K
β
MAP
= arg max Y (β) − λ β k 2 . (6.5)
β n
k=1
This proposal leads to sparsity, i.e., for large regularization parameters λ the
entire β k may be shrunk (exactly) to zero; this is discussed in Sect. 6.2.5, below.
We also refer to Section 4.3 in Hastie et al. [184], and Devriendt et al. [104]
proposed this approach in the actuarial literature.
• There are more versions of regularization, e.g., in the fused LASSO approach we
ensure that the first differences βj − βj −1 remain small.
212 6 Bayesian Methods, Regularization and Expectation-Maximization
1 p
arg max Y (β) subject to β − p ≤ c. (6.6)
β∈Rq+1 n
This optimization problem can be tackled by the method of Karush, Kuhn and
Tucker (KKT) [208, 228]. Optimization problem (6.4) corresponds by Lagrangian
duality to the constraint optimization problem (6.6). For every c for which the
p
budget constraint in (6.6) is binding β − p = c, there is a corresponding
regularization parameter λ = λ(c), and, conversely, the solution of (6.4) solves (6.6)
with c =
MAP p
β − (λ)p .
We compare the two special cases of p = 1, 2 in this section, and in the subsequent
Sects. 6.2.3 and 6.2.4 we discuss how these two cases can be solved numerically.
Ridge Regularization p = 2 For p = 2, the prior distribution π in (6.4) is a
centered Gaussian distribution. This L2 -regularization is called ridge regularization
or Tikhonov regularization [353], and we have
1 q
β
ridge
=
β
ridge
(λ) = arg max Y (β) − λ βj2 . (6.7)
β∈R q+1 n
j =1
1 q
β
LASSO
=
β
LASSO
(λ) = arg max Y (β) − λ |βj |. (6.8)
β∈Rq+1 n j =1
6.2 Regularization 213
5
0
0
−5
−5
−10
−10
−15
−15
Fig. 6.1 Illustration of optimization problem (6.6) under a budget constraint (lhs) for p = 2
(Euclidean norm) and (rhs) p = 1 (Manhattan norm)
214 6 Bayesian Methods, Regularization and Expectation-Maximization
4
feature component x_2
2
0
−2
−4
−4 −2 0 2 4
feature component x_1
ridge regularization this is not the case, except for special situations concerning the
position of the red MLE. Thus, ridge regression makes components of parameter
estimates generally smaller, whereas LASSO shrinks some of these components
exactly to zero (this also explains the name LASSO).
Remark 6.3 (Elastic Net) LASSO regularization faces difficulties with collinearity
in feature components. In particular, if we have a group of highly correlated feature
components, LASSO fails to do a grouped selection, but it selects one component
and ignores the other ones. On the other hand, ridge regularization can deal with
this issue. For this reason, Zou–Hastie [409] proposed the elastic net regularization,
which uses a combined regularization term
1
β
elastic net
= arg max Y (β) − λ (1 − α)β22 + αβ1 ,
β∈Rq+1 n
for some α ∈ (0, 1). The L1 -term gives sparsity and the quadratic term removes
the limitation on the number of selected variables, providing a grouped selection.
In Fig. 6.2 we compare the elastic net regularization (orange color) to ridge and
LASSO regularization (black and blue color). Ridge regularization provides a
smooth strictly convex boundary (black), whereas LASSO provides a boundary that
is non-differentiable in the corners (blue). The elastic net is still non-differentiable
in the corners, this is needed for variable selection, and at the same time it is strictly
convex between the corners which is needed for grouping.
6.2 Regularization 215
note that we exclude the intercept β0 from regularization (we use a slight abuse of
notation, here), and we also refer to Proposition 5.1. The negative expected Hessian
of this optimization problem is given by
J (β) = −Eβ ∇β2 Y (β) − nλβ − 22 = I (β) + 2nλdiag(0, 1, . . . , 1) ∈ R(q+1)×(q+1) ,
(t +1)
β → =
β + J ( s(
β )−1 *
(t ) (t ) (t ) (t )
β β , Y ). (6.10)
(t +1)
= β + J ( β )−1 * s(
(t ) (t ) (t )
β β ,Y)
= J ( β )−1 J ( β ) β + X W ( β )R(Y , β ) − 2nλ
(t ) (t ) (t ) (t ) (t ) (t )
β−
= J ( β )−1 I( β ) β + X W ( β )R(Y ,
(t ) (t ) (t ) (t ) (t )
β )
= J ( β )−1 X W ( β ) X β + R(Y ,
(t ) (t ) (t ) (t )
β ) .
1.5
1.0
2.0
0.5
losses
beta_j
0.0
1.9
beta_1
−0.5
beta_2
beta_3
1.8
beta_4
−1.0
beta_5
ridge regularized beta_6
Gamma GLM1 beta_7
−1.5
gamma null beta_8
−10 −5 0 5 10 15 20 25 −10 −5 0 5
log(lambda) log(lambda)
Fig. 6.3 Ridge regularized MLEs in model Gamma GLM1: (lhs) in-sample deviance losses as a
ridge (λ) for 1 ≤ j ≤ q = 8
function of the regularization parameter λ > 0, (rhs) resulting β j
apply Fisher’s scoring updates (6.10).3 For this analysis we center and normalize
(to unit variance) the columns of the design matrix (except for the initial column of
X encoding the intercept).
Figure 6.3 (lhs) shows the resulting in-sample deviance losses as a function of
λ > 0. Regularization parameter λ allows us to continuously connect the in-sample
deviance losses of the null model (2.085) and model Gamma GLM1 (1.717), see
Table 5.13. Figure 6.3 (rhs) shows the regression parameter estimates β ridge(λ), 1 ≤
j
j ≤ q = 8, as a function of λ > 0. Overall they decrease because the budget
constraint gets more tight for increasing λ, however, the individual parameters do
not need to be monotone, since one parameter may (better) compensate a decrease
of another (through correlations in feature components).
Finally, we need to choose the optimal regularization parameter λ > 0.
This is done by cross-validation. We exploit the generalized cross-validation loss,
see (5.67), and the hat matrix in this ridge regularized case is given by
Hλ = W ( ) X J ( ) X W (
ridge 1/2 ridge −1 ridge 1/2
β β β ) .
In contrast to (5.66), this hat matrix Hλ is not a projection but we would need to
work in an augmented model to receive the projection property (accounting for the
regularization part).
Figure 6.4 plots the generalized cross-validation loss as a function of λ > 0.
We observe the minimum in parameter λ = e−9.4 . The resulting generalized cross-
validation loss is 1.76742. This is bigger than the one received in model Gamma
3The R command glmnet [142] allows for regularized MLE, however, the current version does
not include the gamma distribution. Therefore, we have implemented our own routine.
6.2 Regularization 217
1.76746
D
λ>0
1.76745
GCV losses
1.76744
1.76743
1.76742
−13 −12 −11 −10 −9
log(lambda)
GLM2, see Table 5.16, thus, we still prefer model Gamma GLM2 over the optimally
ridge regularized model GLM1. Note that for model Gamma GLM2 we did variable
selection, whereas ridge regression just generally shrinks regression parameters.
For more interpretation we refer to Example 6.8, below, which considers LASSO
regularization.
Gaussian Case
1
n
β
LASSO
= arg max − (Yi − β0 − β1 xi )2 − λ|β1 |.
β∈R2 2n
i=1
We
n standardizethe observations andfeatures (Yi , xi )1≤i≤n such that we have
= n −1 n
i=1 i = 0 and n i=1 xi = 1. This implies that we can omit
Y 0, x 2
i=1 i
the intercept parameter β0 , as the optimal intercept satisfies for this standardized
data (and any β1 ∈ R)
1
n
0 =
β Yi − β1 xi = 0. (6.11)
n
i=1
218 6 Bayesian Methods, Regularization and Expectation-Maximization
Thus, w.l.o.g., we assume to work with standardized data in this section, this gives
us the optimization problem (we drop the lower index in β1 because we only have
one component)
1
n
β LASSO (λ) = arg max −
LASSO = β (Yi − βxi )2 − λ|β|. (6.12)
β∈R 2n
i=1
The difficulty is that the regularization term is not differentiable in zero. Since this
term is convex we can express its derivative in terms of a sub-gradient s. This
provides score
⎛ ⎞
1 1
n n
∂ 1
⎝− (Yi − βxi )2 − λ|β|⎠ = (Yi − βxi ) xi − λs = Y , x − β − λs,
∂β 2n n n
i=1 i=1
where we use standardization n−1 ni=1 xi2 = 1 in the second step, Y , x is the
scalar product of Y , x = (x1 , . . . , xn ) ∈ Rn , and where we consider the sub-
gradient
⎧
⎨ +1 if β > 0,
s = s(β) = −1 if β < 0,
⎩
∈ [−1, 1] otherwise.
!
n−1 Y , x − β − λs = n−1 Y , x − β − sign(β)λ = 0.
20
λ = 4 (red dotted lines)
10
0
−10
−20
soft−thresholding
−20 −10 0 10 20
n−1 ni=1 xi,j
2 = 1 for all 1 ≤ j ≤ q. This allows us again to drop the intercept
In Sect. 7.2.3 we will discuss gradient descent methods for network fitting. In this
section we provide preliminary considerations on gradient descent methods because
these are also useful to fit LASSO regularized parameters within GLMs (different
from Gaussian GLMs). Remark that we do a sign switch in what follows, and we
aim at minimizing an objective function g.
Choose a convex and differentiable function g : Rq+1 → R. Assuming that
the global minimum of g is achieved, a necessary and sufficient condition for the
optimality of β ∗ ∈ Rq+1 in this convex setting is ∇β g(β)|β=β ∗ = 0. Gradient
descent algorithms find this optimal point by iterating for t ≥ 0
for tempered learning rates t +1 > 0. This algorithm is motivated by a first order
Taylor expansion that determines the direction of the maximal local decrease of the
objective function g supposed we are in position β, i.e.,
β) = g(β) + ∇β g(β) *
g(* β − β + o *
β − β2 as *
β − β2 → 0.
The gradient descent algorithm (6.14) leads to the (unconstraint) minimum of the
objective function g at convergence. A budget constraint like (6.6) leads to a convex
constraint β ∈ C ⊂ Rq+1 . Consideration of such a convex constraint requires
that we reformulate the gradient descent algorithm (6.14). The gradient descent
step (6.14) can also be found, for given learning rate t +1 , by solving the following
6.2 Regularization 221
15
solution beta^t+1
β (t) − t+1 ∇β g(β (t) )
10
of (6.15) and, second,
projecting this unconstraint
solution back to the convex
5
set C giving β (t+1) ; see also
Figure 5.5 in Hastie et
0
al. [184]
−5
−10
−15
−15 −10 −5 0 5 10 15
linearized problem for g with the Euclidean square distance penalty term (ridge
regularization) for too big gradient descent steps
1
arg min g(β (t )) + ∇β g(β (t )) β − β (t ) + β − β (t ) 22 . (6.15)
β∈Rq+1 2t +1
The solution to this optimization problem exactly gives the gradient descent
step (6.14). This is now adapted to a constraint gradient descent update for convex
constraint C:
1
β (t +1) = arg min g(β (t ) ) + ∇β g(β (t ) ) β − β (t ) + β − β (t )22 .
β∈C 2t +1
(6.16)
The solution to this constraint convex optimization problem is obtained by, first,
taking an unconstraint gradient descent step β (t ) → β (t ) − t +1 ∇β g(β (t )), and,
second, if this step is not within the convex set C, it is projected back to C; this is
illustrated in Fig. 6.6, and it is called projected gradient descent step (justification
is given in Lemma 6.6 below). Thus, the only difficulty in applying this projected
gradient descent step is to find an efficient method of projecting the unconstraint
solution (6.14)–(6.15) back to the convex constraint set C.
Assume that the convex constraint set C is expressed by a convex function
h (not necessarily being differentiable). To solve (6.16) and to motivate the
projected gradient descent step, we use the proximal gradient method discussed in
Section 5.3.3 of Hastie et al. [184]. The proximal gradient method helps us to do
the projection in the projected gradient descent step. We introduce the generalized
222 6 Bayesian Methods, Regularization and Expectation-Maximization
122 (t)
22
2
2β − t+1 ∇β g(β (t) ) − β 2 + t+1 h(β)
2 2
1 2 2 2
22
2
; < 12
2
22
2
= t+1 2∇β g(β (t) )2 − t+1 ∇β g(β (t) ), β (t) − β + 2β (t) − β 2 + t+1 h(β)
2 2 2 2
1 2 2 2
22
2
1 2
2 (t)
22
2
= t+1 2∇β g(β (t) )2 + t+1 ∇β g(β (t) ) β − β (t) + 2β − β 2 + h(β) .
2 2 2t+1 2
This is exactly the right objective function (in the round brackets) if we ignore all
terms that are independent of β. This proves the lemma.
Thus, to solve the constraint optimization problem (6.16) we bring it into its dual
Lagrangian form (6.18). Then we apply the generalized projection operator to the
unconstraint solution to find the constraint solution, see Lemma 6.6. This approach
will be successful if we can explicitly compute the generalized projection operator
proxh (·).
Lemma 6.7 The generalized projection operator (6.17) satisfies for LASSO
constraint h(β) = λβ − 1
def.
proxh (z) = SλLASSO (z) = z0 , sign(z1 )(|z1 | − λ)+ , . . . , sign(zq )(|zq | − λ)+ ,
for z ∈ Rq+1 .
6.2 Regularization 223
1. Make the gradient descent step for a suitable learning rate t +1 > 0
(t +1)
β (t ) → *
β = β (t ) − t +1 ∇β g(β (t )).
If the gradient ∇β g(·) is Lipschitz continuous with Lipschitz constant L > 0, the
proximal gradient descent algorithm will converge at rate O(1/t) for a fixed step
size 0 < = t +1 ≤ L, see Section 4.2 in Parikh–Boyd [292].
Example 6.8 (LASSO Regression) We revisit Example 6.5 which considers claim
size modeling using model Gamma GLM1. In order to apply the proximal gradient
descent algorithm for LASSO regularization we need to calculate the gradient of
the negative log-likelihood. In the gamma case with log-link, it is given by, see
Example 5.5,
1.5
LASSO regularized
Gamma GLM1
gamma null
1.0
2.0
0.5
losses
beta_j
0.0
1.9
−0.5
beta_1
beta_2
1.8
beta_3
−1.0
beta_4
beta_5
beta_6
beta_7
−1.5
beta_8
−10 −9 −8 −7 −6 −10 −9 −8 −7 −6
log(lambda) log(lambda)
Fig. 6.7 LASSO regularized MLEs in model Gamma GLM1: (lhs) in-sample losses as a function
LASSO (λ) for 1 ≤ j ≤ q
of the regularization parameter λ > 0, (rhs) resulting β j
Oracle Property
n = A∗ ] = 1,
lim P[A (6.19)
n→∞
√
n −1
β n,A∗ (λn ) − β ∗A∗ ⇒ N 0, IA∗ as n → ∞, (6.20)
q
|βj |2 − 2aλ|βj | + λ2 (a + 1)λ2
Jλ (β) = λ|βj |1{|βj |≤λ} − 1{λ<|βj |≤aλ} + 1{|βj |>aλ} ,
2(a − 1) 2
j =1
20
20
10
10
0
0
−10
−10
−20
−20
Fig. 6.8 (lhs) LASSO soft-thresholding operator x → Sλ (x) for λ = 4 (red dotted lines), (rhs)
SCAD thresholding operator x → SλSCAD (x) for λ = 4 and a = 3
Thus, we have a constant LASSO-like slope λ > 0 for 0 < β ≤ λ, shrinking some
components exactly to zero. For β > aλ the slope is 0, removing regularization, and
it is concatenated between the two scenarios. The thresholding operator for SCAD
regularization is given by, see Fan–Li [124],
⎧
⎪
⎨ sign(x)(|x| − λ)+ for |x| ≤ 2λ,
SCAD
Sλ (x) = (a−1)x−sign(x)aλ for 2λ < |x| ≤ aλ,
⎪
⎩x
a−2
for |x| > aλ.
Figure 6.8 compares the two thresholding operators of LASSO and SCAD.
Alternatively, we propose to do variable selection with LASSO regularization in
a first step. Since the resulting LASSO regularized estimator may not be consistent,
one should explore a second regression step where one uses an un-penalized
regression model on the LASSO selected components, we also refer to Lee et al.
[237].
In Example 6.8 we have seen that if there are natural groups within the feature
components they should be treated simultaneously. Assume we have a group
6.2 Regularization 227
1
K
β
group
=
β
group
(λ) = arg max Y (β) − λ β k 2 , (6.21)
β=(β0 ,β 1 ,...,β K ) n
k=1
K
x → η(x) = β, x = β0 + β k , x k .
k=1
The latter highlights that the problem decouples into K independent problems. Thus,
we need to solve for all 1 ≤ k ≤ K the optimization problems
122
22
2
arg min zk − β k 2 + λβ k 2 .
β k ∈Rqk 2
228 6 Bayesian Methods, Regularization and Expectation-Maximization
K
and for the generalized projection operator for h(β) = λ k=1 β k 2 we
have
group def. q q
proxh (z) = Sλ (z) = z0 , Sλ 1 (z1 ), . . . , Sλ K (zK ) ,
2 22 2 22
this follows because the square distance 2zk − β k 22 = zk 22 − 2zk , β k + 2β k 22
is minimized if zk and β k point into the same direction. Thus, there remains the
minimization of the objective function in ≥ 0. The first derivative is given by
2
∂ 1
zk 2 1 −
2
+ λ = − zk 2 1 − +λ = λ−zk 2 +.
∂ 2 zk 2 zk 2
If zk 2 > λ we have = zk 2 − λ > 0, and otherwise we need to set = 0. This
implies
q
Sλ k (zk ) = (zk 2 − λ)+ zk /zk 2 .
group LASSO regularization: in−sample losses group LASSO regularization: regression parameters
2.1
1.5
group LASSO
Gamma GLM1
gamma null
1.0
2.0
0.5
losses
beta_j
0.0
1.9
−0.5
beta_1
beta_2
1.8
beta_3
−1.0
beta_4
beta_5
beta_6
beta_7
−1.5
beta_8
−10 −9 −8 −7 −6 −10 −9 −8 −7 −6
log(lambda) log(lambda)
Fig. 6.9 Group LASSO regularized MLEs in model Gamma GLM1: (lhs) in-sample losses as a
group (λ) for 1 ≤ j ≤ q
function of the regularization parameter λ > 0, (rhs) resulting β j
1. Make the gradient descent step for a suitable learning rate t +1 > 0
(t +1)
β (t ) → *
β = β (t ) − t +1 ∇β g(β (t )).
Example 6.10 (Group LASSO Regression) We revisit Example 6.8 which considers
claim size modeling using model Gamma GLM1. This time we group the variables
OwnerAge and OwnerAge2 (β1 , β2 ) as well as VehAge and VehAge2 (β5 , β6 ).
The results are shown in Fig. 6.9.
The order in which the parameters are regularized to zero is: β4 (RiskClass),
β8 (BonusClass), β7 (GenderMale), (β1 , β2 ) (OwnerAge, OwnerAge2), β3
(AreaGLM) and (β5 , β6 ) (VehAge, VehAge2). This order now reflects more the
variable importance as received from the Wald statistics of Listing 5.11, and it
shows that grouped features should be regularized jointly in order to determine their
importance.
230 6 Bayesian Methods, Regularization and Expectation-Maximization
In many applied problems there does not exist a simple off-the-shelf distribution
that is suitable to model the whole range of observations. We think of claim size
modeling which may range from small to very large claims; the main body of the
data may look like, say, gamma distributed, but the tail of the data being regularly
varying. Another related problem is that claims may come from different insurance
policy modules. For instance, in property insurance, one can insure water damage,
fire, glass and theft claims on the same insurance policy, and feature information
about the claim type may not always be available. In such cases, it looks attractive
to choose a mixture or a composition of different distributions. In this section we
focus on mixtures.
Choose a fixed integer K bigger than 1 and define the (K − 1)-unit simplex
excluding the edges by
K
K = p ∈ (0, 1)K ; pk = 1 . (6.22)
k=1
K defines the family of categorical distributions with K levels (all levels having
a strictly positive probability). These distributions belong to the vector-valued
parameter EF which we have met in Sects. 2.1.4 and 5.7.
The idea behind mixture distributions is to mix K different distributions with a
mixture probability p ∈ K . For instance, we can mix K different EDF densities
fk by considering
K
K
yθk − κk (θk )
Y ∼ pk fk (y; θk , v/ϕk ) = pk exp + ak (y; v/ϕk ) ,
ϕk /v
k=1 k=1
(6.23)
0.00030
0.00020
density
density
0.00010
0.00010
0.00000
0.00000
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
claim size y claim size y
Fig. 6.10 (lhs) Mixture distribution mixing three gamma densities, and (rhs) mixture distributions
mixing two gamma components and a Pareto component with mixture probabilities p =
(0.7, 0.1, 0.2) for orange, green and blue components (the density components are already
multiplied with p)
232 6 Bayesian Methods, Regularization and Expectation-Maximization
these are the K corners of the (K − 1)-unit simplex K . One-hot encoding differs
from dummy coding (5.21). One-hot encoding does not lead to a full rank design
matrix because there is a redundancy, that is, we can drop one component of Z
and still have the same information. One-hot encoding Z of Z allows us to extend
the incomplete (data) log-likelihood Y (θ , p), see (6.23)–(6.24), under complete
information (Y, Z) as follows
K
(Y,Z) (θ , p) = log (pk fk (Y ; θk , v/ϕk )) Zk
k=1
K
Zk
Y θk − κk (θk )
= log pk exp + ak (Y ; v/ϕk ) (6.26)
ϕk /v
k=1
K
Y θk − κk (θk )
= Zk log(pk ) + + ak (Y ; v/ϕk ) .
ϕk /v
k=1
6.3 Expectation-Maximization Algorithm 233
pk fk (Y ; θk , v/ϕk )
Pθ,p [Zk = 1|Y ] = K .
l=1 pl fl (Y ; θl , v/ϕl )
k (θ , p|Y ) def.
Z = Eθ ,p [Zk |Y ] = Pθ,p [Zk = 1|Y ] for 1 ≤ k ≤ K.
(6.27)
This posterior mean
Z= 1 (θ , p|Y ), . . . , Z
Z(θ , p|Y ) = (Z K (θ , p|Y )) ∈
K is used as an estimate for the (unobserved) latent variable Z; note that
this posterior mean depends on the unknown parameters (θ, p).
• M-step. Based on Y and Z the parameters θ and p are estimated with
MLE.
see (6.26),
n
K
− κk (θk )
∇θ (t ) Yi θk
Z = 0, (6.29)
i,k
ϕk /vi
i=1 k=1
n
K
∇p− (t )
Zi,k log(pk ) = 0, (6.30)
i=1 k=1
K−1
where p − = (p1 , . . . , pK−1 ) and setting pK = 1 − k=1 pk ∈ (0, 1).
Remarks 6.11
• The E-step uses Bayes’ rule. This motivates to consider the EM algorithm in this
Bayesian chapter; alternatively, it also fits to the MLE chapters.
• We have formulated the M-step in (6.29)–(6.30) in a general way because the
canonical parameter θ and the mixture probability p could be modeled by
GLMs, and, henceforth, they may be feature x i dependent. Moreover, (6.29) is
formulated for a mixture of single-parameter EDF distributions, but, of course,
this holds in much more generality.
• Equations (6.29)–(6.30) are the score equations received from (6.26). There is
a subtle point here, namely, Zk ∈ {0, 1} in (6.26) are observations, whereas
(t ) ∈ (0, 1) in (6.29)–(6.30) are their estimates. Thus, in the EM algorithm
Z i,k
the unknown latent variables are replaced by their estimates which, in our setup,
results in two different types of variables with disjoint ranges. This may matter
in software implementations, for instance, a categorical GLM may ask for a
categorical random variable Z ∈ {1, . . . , K} (of factor type), whereas Z is
in the interior of the unit simplex K .
• For mixture distributions one can replace the latent variables Z i by their
conditionally expected values Z i , see (6.29)–(6.30). In general, this does not hold
true in EM algorithm applications: in our case we benefit from the fact that Zk
influences the complete log-likelihood linearly, see (6.26). In the general (non-
linear) case of the EM algorithm application, different from mixture distribution
problems, one needs to calculate the conditional expectation of the log-likelihood
function.
236 6 Bayesian Methods, Regularization and Expectation-Maximization
∂ Yi θk − κk (θk )
n
= 0,
∂θk (t )
i=1 ϕk /(vi Zi,k )
∂ (t )
n
(t ) log(pK ) = 0,
Zi,k log(pk ) + Z i,K
∂pk
i=1
recall normalization pK = 1 − K−1 k=1 pk ∈ (0, 1).
From the first score equation we see that we receive the classical MLE/GLM
framework, and all tools introduced above for parameter estimation can directly
be used. The only part that changes are the weights vi → vi Z (t ) . In the
i,k
homogeneous case, i.e., in the null model we have MLE after the t-th iteration of
the EM algorithm
n
(t )
i=1 vi Zi,k Yi
θk(t ) = hk n ,
(t )
i=1 vi Zi,k
1 (t )
n
k(t ) =
p Zi,k for 1 ≤ k ≤ K. (6.31)
n
i=1
In Sect. 6.3.4, below, we will present an example that uses the null model for
the mixture probabilities p, and we present an other example that uses a logistic
categorical GLM for these mixture probabilities.
exp{y (θ, p)} be the marginal density of Y . This allows us to rewrite the incomplete
log-likelihood as follows for any value of z
f (Y, z; θ, p)
Y (θ , p) = log f (Y ; θ, p) = log ,
f (z|Y ; θ, p)
f (Y, z; θ , p)
Y (θ , p) ≥ π(z) log (6.33)
z
π(z)
= EZ∼π (Y,Z) (θ , p) Y −
def.
π(z) log(π(z)) = Q(θ , p; π).
z
The general idea of the EM algorithm is to make this lower bound Q(θ, p; π) as
large as possible in θ , p and π by iterating the following two alternating steps for
t ≥ 1:
(t −1)
π (t ) = arg max Q
θ p(t −1) ; π ,
, (6.34)
π
(
(t )
θ , p (t )) = arg max Q θ , p;
π (t ) . (6.35)
θ,p
The first step (6.34) can be solved explicitly and it results in the E-step. Namely,
(t −1)
from (6.32) we see that maximizing Q( θ p (t −1); π) in π is equivalent to
,
(t −1)
minimizing the KL divergence DKL (π||f (·|Y ; θ p (t −1))) in π because the
,
left-hand side of (6.32) is independent of π. Thus, we have to solve
2
2
π (t) = arg max Q p (t−1) ; π = arg min DKL π 2f (·|Y ;
(t−1) (t−1)
θ , θ ,
p (t−1) ) .
π π
(t −1)
This optimization is solved by choosing the density π (t ) = f (·|Y ;
θ p (t −1)),
,
see Lemma 2.21, and this gives us exactly (6.28) if we calculate the corresponding
conditional expectation of the latent variable Z. Moreover, importantly, this step
provides us with an identity in (6.33):
(t −1) (t −1)
Y (
θ p(t −1) ) = Q
, θ p (t −1);
, π (t ) . (6.36)
The second step (6.35) then increases the right-hand side of (6.36). This second
step is equivalent to
(
(t )
θ , p (t )) = arg max Q θ , p;
π (t ) = arg max EZ∼
π (t) (Y,Z) (θ , p) Y ,
θ,p θ,p
(6.37)
and this maximization is solved by the solution of the score equations (6.29)–(6.30)
of the M-step. In this step we explicitly use the linearity in Z of the log-likelihood
(Y,Z) , which allows us to calculate the objective function in (6.37) explicitly
resulting in replacing Z by
(t )
Z . For other incomplete data problems, where we
do not have this linearity, this step will be more complicated.
Summarizing, alternating optimizations (6.34) and (6.35) gives us a sequence of
parameters (
(t )
θ , p (t ))t ≥0 with monotonically increasing incomplete log-likelihoods
(t −1) (t +1)
. . . ≤ Y ( p (t −1)) ≤ Y ( p (t )) ≤ Y ( p (t +1)) ≤ . . . .
(t )
θ , θ , θ ,
(6.38)
6.3 Expectation-Maximization Algorithm 239
Remarks 6.12
• In general, the log-likelihood function (θ , p) → Y (θ , p) does not need to be
bounded. In that case the EM algorithm may not converge (unless it converges
to a local maximum). An illustrative example is given in Example 6.13, below,
which shows what can go wrong in MLE of mixture distributions.
• Even if the log-likelihood function (θ , p) → Y (θ , p) is bounded, one may
not expect a unique solution to the parameter estimation problem with the EM
algorithm. Firstly, a monotonically increasing sequence (6.38) only guarantees
that we have convergence of that sequence. But the sequence may not converge
to the global maximum and different starting points of the algorithm need to
be explored. Secondly, convergence of sequence (6.38) does not necessarily
imply that the parameters (
(t )
θ , p (t ) ) converge for t → ∞. On the one hand,
we may have an identifiability issue because the components fk of the mixture
distribution may be exchangeable, and secondly one needs stronger conditions
to ensure that not only the log-likelihoods converge but also their arguments
(parameters) (
(t )
θ , p (t )). This is the point studied in Wu [385].
• Even in very simple examples of mixture distributions we can have multiple local
maximums. In this case the role of the starting point plays a crucial role. It is
advantageous that in the starting configuration every component k shares roughly
the same number of observations for the initial estimates ( p (0) ) and
(0) (1)
θ , Z ,
otherwise one may start in a so-called spurious configuration where only a few
observations almost fully determine a component k of the mixture distribution.
This may result in similar singularities as in Example 6.13, below. Therefore,
there are three common ways to determine a starting configuration of the EM
algorithm, see Miljkovic–Grün [278]: (a) Euclidean distance-based initialization:
cluster centers are selected at random, and all observations are allocated to these
centers according to the shortest Euclidean distance; (b) K-means clustering
allocation; or (c) completely random allocation to K bins. Using one of these
three options, fk and p are initialized.
• We have formulated the EM algorithm in the homogeneous situation. However,
we can easily expand it to GLMs by, for instance, assuming that the canonical
parameters θk are modeled by linear predictors β k , x and/or likewise for
the mixture probabilities p. The E-step will not change in this setup. For
the M-step, we will solve a different maximization problem, however, this
maximization problem respects monotonicity (6.38), and therefore a modified
version of the above EM algorithm applies. We emphasize that the crucial point
is monotonicity (6.38) that makes the EM algorithm a valid procedure.
240 6 Bayesian Methods, Regularization and Expectation-Maximization
In this section we are going to present different mixture distribution examples that
use the EM algorithm for parameter estimation. On the one hand this illustrates the
functioning of the EM algorithm, and on the other hand it also highlights pitfalls
that need to be avoided.
Example 6.13 (Gaussian Mixture) We directly fit a mixture model to the observa-
tion Y = (Y1 , . . . , Yn ) . Assume that the log-likelihood of Y is given by a mixture
of two Gaussian distributions
2
n 1 1
Y (θ , σ , p) = log pk √ exp − 2 (Yi − θk )2 ,
i=1 k=1
2πσ k 2σk
If we choose any
θ2 ∈ R, p ∈ 2 and σ2 > 0, we receive for
θ1 = Y1
2
1 1
lim Y (
θ , σ , p) = lim log pk √ exp − 2 (Y1 −
θ k )2
σ1 →0 σ1 →0 2πσk 2σ
k=1 k
n
p2 1
+ log √ − (Yi −
θ2 )2 = ∞.
i=2
2πσ2 2σ22
Thus, we can make the log-likelihood of this mixture Gaussian model arbitrarily
large by fitting a degenerate Gaussian model to one observation in one mixture
component, and letting the remaining observations be described by the other mixture
component. This shows that the MLE problem may not be well-posed for mixture
distributions because the log-likelihood can be unbounded.
If the data has well separated clusters, the log-likelihood of a mixture Gaussian
distribution will have multiple local maximums. One can construct for any given
6.3 Expectation-Maximization Algorithm 241
number B ∈ N a data set Y such that the number of local maximums exceeds this
number B, see Theorem 3 in Améndola et al. [11].
Example 6.14 (Gamma Claim Size Modeling) In this example we consider claim
size modeling of the French MTPL example given in Chap. 13.1. In view of
Fig. 13.15 this seems quite difficult because we have three modes and heavy-
tailedness. We choose a mixture of 5 distribution functions, we choose four gamma
distributions and the Lomax distribution
4
β5 y + M −(β5 +1)
α
βk k αk −1
Y ∼ pk y exp {−βk y} + p5 , (6.39)
(αk ) M M
k=1
Figure 6.11 shows the resulting estimated mixture distribution. It gives the
individual mixture components (top-lhs), the resulting mixture density (top-rhs),
the QQ plot (bottom-lhs) and the log-log plot (bottom-rhs). Overall we find a
rather good fit; maybe the first mode is a bit too spiky. However, this plot may
also be misleading because the empirical density plot relies on kernel smoothing
having a given bandwidth. Thus, the true observations may be more spiky than the
plot indicates. The third mode suggests that there are two different values in the
observations around 1’100, this is also visible in the QQ plot. Nevertheless, the
overall result seems satisfactory. These results (based on 13 estimated parameters)
are also summarized in Table 6.2.
We mention a couple of limitations of these results. Firstly, the log-likelihood
of this mixture model is unbounded, similarly to Example 6.13 we can precisely fit
one degenerate gamma mixture component to an individual observation Yi which
results in an infinite log-likelihood value. Thus, the found solution corresponds
to a local maximum of the log-likelihood function and we should not state AIC
values in Table 6.2, see also Remarks 4.28. Secondly, it is crucial to initialize three
components to the three modes, if we randomly allocate all claims to 5 bins as initial
configuration, the EM algorithm only finds mode Z = 3 but not necessarily the first
two modes, at least, in our specifically chosen random initialization this was the
case. In fact, the likelihood value of our latter solution was worse than in the first
calibration which shows that we ended up in a worse local maximum.
We may be tempted to also estimate the Lomax threshold M with MLE. In
Fig. 6.12 we plot the maximal log-likelihood as a function of M (if we start the EM
algorithm always in the same configuration given in Table 6.1). From this figure a
threshold of M = 1 600 seems optimal. Choosing this threshold of M = 1 600
leads to a slightly bigger log-likelihood of −199’304 and a slightly smaller tail
parameter of β (100) = 1.318. However, overall the model is very similar to the one
5
with M = 2 000. In general, we do not recommend to estimate M with MLE, but
this should be treated as a hyper-parameter selected by the modeler. The reason for
this recommendation is that this threshold is crucial in deciding for large claims
modeling and its estimation from data is, typically, not very robust; we also refer to
Remarks 6.15, below.
6.3 Expectation-Maximization Algorithm 243
Z=3, p=42%
Z=4, p=25%
Z=5, p=26%
density
density
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000
claim amounts claim amounts
QQ plot log-log plot
0
4
–4
0
–6
–8
–2
observations
–10
–4 –2 0 2 4 0 5 10 15
theoretical values logged claim amounts
Fig. 6.11 Mixture null model: (top-lhs) individual estimated gamma components
(100) ), 1 ≤ k ≤ K, and Lomax component f5 (·; β
αk(100) , β
fk (·; (100) ), (top-rhs) estimated
k
4 (100) (100)
5
mixture density k=1 p
(100)
k fk (·;
αk , βk ) + p
(100)
5 f5 (·; β (100)
), (bottom-lhs) QQ plot of
5
the estimated model, (bottom-rhs) log-log plot of the estimated model
Table 6.2 Mixture models for French MTPL claim size modeling
# Param. Y (
θ,
p) AIC
μ = E θ ,
p [Y ]
Empirical 2’266
Null model (M = 2000) 13 −199’306 398’637 2’381
Logistic GLM (M = 2000) 193 −198’404 397’193 2’176
maximal log−likelihood
0 1000 2000 3000 4000
threshold M
we do not consider Density because of the high co-linearity with Area, see
Fig. 13.12 (rhs). Thus, we are left with the features Area, VehAge, DrivAge,
BonusMalus, VehBrand and Region. Pre-processing of these features is done
as in Listing 5.1, except that we keep Area categorical. Using these features
x ∈ X ⊂ {1} × Rq we choose a logistic categorical GLM for the mixture
probabilities
exp{Xγ }
x → (p1 (x), . . . , pK−1 (x)) = 4 , (6.40)
1 + l=1 expγ l , x
R(K−1)(q+1); this regression parameter γ should not be confused with the shape
parameters β1 , . . . , β4 of the gamma components and the tail parameter β5 of the
Lomax component, see (6.39). Note that the notation in this section slightly differs
from Sect. 5.7 on the logistic categorical GLM. In this section we consider mixture
probabilities p(x) ∈ K=5 (which corresponds to one-hot encoding), whereas
in Sect. 5.7 we model (p1 (x), . . . , pK−1 (x)) with a categorical GLM (which
corresponds
to dummy coding), and normalization provides us with pK (x) =
1 − K−1 l=1 p l (x) ∈ (0, 1).
This logistic categorical GLM requires that we replace in the M-step
the probability estimation (6.31) by Fisher’s scoring method for GLMs as
outlined in Sect. 5.7.2, but there is a small difference to that section. In the
working residuals (5.74) we use dummy coding T (Z) ∈ {0, 1}K−1 of a
categorical variable Z, this now needs to be replaced by the estimated vector
1 (θ , p|Y ), . . . , Z
(Z K−1 (θ , p|Y )) ∈ (0, 1)K−1 which is used as an estimate
for the latent variable T (Z). Apart from that everything is done as described in
Sect. 5.7.2; in R this can be done with the procedure multinom from the package
nnet [368]. We start the EM algorithm exactly in the final configuration of the
6.3 Expectation-Maximization Algorithm 245
Table 6.3 Parameter choices in the mixture models: upper part null model, lower part GLM for
estimated mixture probabilities
p(x i )
k=1 k=2 k=3 k=4 k=5
k(100)
Null: p 0.04 0.03 0.42 0.25 0.26
αk(100)
Null: 93.05 650.94 1’040.37 1.34 –
(100)
Null: β 1.207 1.108 0.888 0.001 1.416
k
μ(100)
Null: = (100)
αk(100) /β 77 588 1’172 1’304 –
k k
GLM: average mixture probabilities 0.04 0.03 0.42 0.25 0.26
GLM: αk(100) 94.03 597.20 1’043.38 1.28 –
GLM: β (100) 1.223 1.019 0.891 0.001 1.365
k
(100) (100) (100)
GLM: μ k =
kα k /β 77 586 1’172 1’268 –
estimated mixture null model, and we run this algorithm for 20 iterations (which
provides convergences).
The resulting parameters are given in the lower part of Table 6.3. We observe that
the resulting parameters remain essentially the same, the second mode Z = 2 is a
bit less spiky, and the tail parameter is slightly smaller. The summary of this model
is given on the last line of Table 6.2. Regression modeling adds another 4 · 45 = 180
parameters to the model because we have q = 45 feature components in x (different
from the intercept component). In view of AIC we give preference to the logistic
mixture probability case (though AIC has to be interpreted with care, here, because
we do not consider the MLE but rather a local maximum).
Figure 6.13 plots the individual estimated mixture probabilities x i → p (x i ) ∈
5 over the insurance policies 1 ≤ i ≤ n; these plots are inspired by the thesis of
Frei [138]. The upper plots consider these probabilities against the estimated claim
sizes μ(x i ) = 5k=1 p k (x i )
μk and the lower plots against the ranks of μ(x i ), the
latter gives a different scaling on the x-axis because of the heavy-tailedness of the
claims. The plots on the left-hand side show all individual policies 1 ≤ i ≤ n, and
the plots on the right-hand side show a quadratic spline fit to these observations. Not
surprisingly, we observe that the claim size estimate μ(x i ) is mainly driven by the
large claims probability p 5 (x i ) describing the Lomax contribution.
In Fig. 6.14 we compare the QQ plots of the mixture null model and the one
where we model the mixture probabilities with the logistic categorical GLM. We
see that the latter (more complex) model clearly outperforms the more simple one,
in fact, this QQ plot looks quite convincing for the French MTPL claim size data.
Finally, we perform a Wald test (5.32). We simultaneously treat all parameters that
belong to the same feature variable (similar to the ANOVA analysis); for instance,
for the 22 Regions the corresponding part of the regression parameter γ contains
4 · 21 = 84 components. The resulting p-values of dropping such components are
all close to 0 which says that we should not eliminate one of the feature variables.
This closes the example.
246 6 Bayesian Methods, Regularization and Expectation-Maximization
1.0
mixture probabilities mixture probabilities
1.0
p1 p1
p2 p2
p3 p3
p4 p4
p5
0.8
0.8
p5
mixture probabilities
mixture probabilities
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
1500 2000 2500 3000 3500 4000 4500 1500 2000 2500 3000 3500 4000 4500
estimated means estimated means
1.0
p1 p1
p2 p2
p3 p3
p4 p4
p5
0.8
0.8
p5
mixture probabilities
mixture probabilities
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0 5000 10000 15000 20000 25000 0 5000 10000 15000 20000 25000
rank of estimated means rank of estimated means
Remarks 6.15
• In Example 6.14 we have chosen a mixture distribution with four gamma
components and one Lomax component. The reason for choosing the Lomax
component has been two-fold. Firstly, we need a regularly varying tail to
model the heavy-tailed property of the data. Secondly, we have preferred the
Lomax distribution over the Pareto distribution because this provides us with a
continuous density in (6.39). The results in Example 6.14 have been satisfactory.
In most practical approaches, however, this approach will not work, even when
fixing the threshold M of the Lomax component. Often, the nature of the data
is such that the chosen gamma mixture distribution is not able to fully explain
the small data in the body of the distribution, and in that situation the Lomax tail
will assist in fitting the small claims. The typical result is that the Lomax part
6.3 Expectation-Maximization Algorithm 247
QQ plot QQ plot
4
4
observed values
observed values
2
2
0
0
–2
–2
–4 –2 0 2 4 –4 –2 0 2 4
theoretical values theoretical values
Fig. 6.14 QQ plots of the mixture models: (lhs) null model and (rhs) logistic categorical GLM for
mixture probabilities
then pays more attention to small claims (through the log-likelihood function of
numerous small claims) and the fitting of the tail turns out to be poor (because
a few large claims do not sufficiently contribute to the log-likelihood). There are
two ways to solve this dilemma. Either one works with composite distributions,
see (6.56) below, and one drops the continuity property of the density; this is the
approach taken in Fung et al. [148]. Or one fits the Lomax distribution solely
to large observations in a first step, and then fixes the parameters of the Lomax
distribution during the second step when fitting the full model to all data, this
is the approach taken in Frei [138]. Both of these two approaches have been
providing good results on real insurance data.
• There is an asymptotic theory for the optimal selection of the number of
mixture components, we refer to Khalili–Chen [214] and Khalili [213]. Fung et
al. [148] combine this asymptotic theory of mixture component selection with
feature selection within these mixture components using LASSO and SCAD
regularization.
• In Example 6.14 we have only modeled the mixture probabilities feature depen-
dent, but not the parameters of the gamma mixture components. Introducing
regressions for the gamma mixture components needs some care in fitting. For
policy independent shape parameters α1 , . . . , α4 , we can estimate the regression
functions for the means of the mixture components without explicitly specifying
αk because these shape parameters cancel in the score equations. However, these
shape parameters will be needed in the E-step, which requires also MLE of αk .
For more discussion on shape parameter estimation we refer to Sect. 5.3.7 (GLM
with constant shape parameter) and Sect. 5.5.4 (double GLM).
248 6 Bayesian Methods, Regularization and Expectation-Maximization
f (y; θ )1{y>τ }
f(τ,∞) (y; θ ) = , (6.41)
1 − F (τ, θ )
0.00030
0.00020
0.00020
density
density
0.00010
0.00010
0.00000
0.00000
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
claim size y claim size y
Fig. 6.15 (lhs) Lower-truncated gamma density with τ = 2 000, and (rhs) lower- and upper-
truncated gamma density with truncation points 2 000 and 6 000
6.4 Truncated and Censored Data 249
1.0
0.8
0.8
probability
probability
0.6
0.6
0.4
0.4
0.2
0.2
gamma distribution gamma distribution
0.0
0.0
right−censored distribution left− and right−censored distribution
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
claim size y claim size y
Fig. 6.16 (lhs) Right-censored gamma distribution with M = 6 000, and (rhs) left- and right-
censored gamma distribution with censoring points 2 000 and 6 000
that is, we have a point mass in the censoring point M. We can define left-censoring
analogously by considering the claim Y ∨ M = max{Y, M}. Figure 6.16 (lhs) shows
a right-censored gamma distribution with censoring point M = 6 000, and Fig. 6.16
(rhs) shows a left- and right-censored example with censoring points 2 000 and
6 000.
Often in re-insurance, deductibles (also called retention levels) and maximal
covers are combined, for instance, an excess-of-loss (XL) insurance cover of size
u > 0 above the retention level d > 0 covers the claim
We interpret this as an incomplete data problem because the claim sizes Yi above
the censoring point M are not known. The complete log-likelihood is given by
n
Y (θ ) = log f (Yi ; θi , vi /ϕ).
i=1
Y θ − κ(θ )
Y ∧M (θ ) = Y (θ ) = + a(Y ; v/ϕ). (6.43)
ϕ/v
the latter follows because Y ∧M = M has the corresponding point mass in censoring
point M (we work with an absolutely continuous EDF here). Choose an arbitrary
density π having the same support as Y |{Y ≥M} , and consider a random variable
Z ∼ π. Using (6.44) and the EDF structure on the last line, we have for Y ≥ M
Y ∧M (θ ) = π(z) Y ∧M (θ ) dν(z)
f (z; θ, v/ϕ)/π(z)
= π(z) log dν(z)
f (z|Y ≥ M; θ, v/ϕ)/π(z)
f (z; θ, v/ϕ)
= π(z) log dν(z) + DKL (π||f (·|Y ≥ M; θ, v/ϕ))
π(z)
6.4 Truncated and Censored Data 251
f (z; θ, v/ϕ)
≥ π(z) log dν(z)
π(z)
Eπ [Z] θ − κ(θ ) def.
= + Eπ [a(Z; v/ϕ)] − Eπ log π(Z) = Q(θ ; π).
ϕ/v
This allows us to explore the E-step and the M-step similarly to (6.34) and (6.35).
The E-step in the case Y ≥ M for given canonical parameter estimate θ (t −1)
reads as
2
2
π (t ) = arg max Q
θ (t −1) ; π = arg min DKL π 2f (·|Y ≥ M; θ (t −1), v/ϕ)
π π
= f (·|Y ≥ M;
θ (t −1), v/ϕ).
This allows us to calculate the estimation of the claim size above M, i.e., under
π (t )
(t ) = E
Y π (t) [Z] = z f (z|Y ≥ M; θ (t −1), v/ϕ) dν(z). (6.45)
Note that this is an estimate of the censored claim Y |{Y ≥M} . This completes the
E-step.
The M-step considers in the EDF case for censored claim sizes Y ≥ M
Eπ (t) [Z] θ − κ(θ )
π (t ) = arg max
θ (t ) = arg max Q θ ;
θ θ ϕ/v
= arg max Y (t) (θ ), (6.46)
θ
the latter uses that the normalizing term a(·; v/ϕ) is not relevant for the MLE of
θ . That is, (6.46) describes the regular MLE step under the observation Y (t ) in the
case of a censored observation Y ≥ M; and if Y < M we simply use the log-
likelihood (6.43).
(t )
θ = arg max (t) (θ ).
Y
θ
Note that the above EM algorithm uses that the log-likelihood Y (θ) of the EDF
is linear in the observations that interact with parameter θ . We revisit the gamma
claim size example of Sect. 5.3.7.
Example 6.16 (Right-Censored Gamma Claim Sizes) We revisit the gamma claim
size GLM introduced in Sect. 5.3.7. The claim sizes are illustrated in Fig. 13.22. In
total we have n = 656 observations Yi , and they range from 16 SEK to 211’254
SEK. We right-censor this data at M = 50 000, this results in 545 uncensored
observations and 111 censored observations equal to M. Thus, for the 17% largest
claims we assume to not have any knowledge about the exact claim sizes. We use
the EM algorithm for right-censored data to fit a GLM to this problem.
In order to calculate the E-step we need to evaluate the conditional expecta-
tion (6.45) under the gamma model
(t ) =
Y z f (z|Y ≥ M;
θ (t −1), v/ϕ) dν(z) (6.47)
∞ β α α−1
(α) z exp{−βz} α 1 − G(α + 1, βM)
= z dz = ,
M 1 − G(α, βM) β 1 − G(α, βM)
Table 6.4 Comparison of the complete log-likelihood and the incomplete log-likelihood (right-
censoring M = 50 000) results
# Log-likelihood Dispersion Average Rel.
Param. Y ( θ MLE ,
ϕ MLE ) est.
ϕ MLE amount change
Gamma GLM2 (complete data) 7+1
−7 129 1.427 25’130
Crude GLM2 (right-censored) 7+1 −7 158 18’068 −28%
EM est. GLM2 (right-censored) 7+1 −7 132 26’687 +6%
GLM2. This dispersion parameter we keep fixed in all our models studied in this
example. In a first step we simply fit a gamma GLM to the right-censored data
Yi ∧ M. We call this model ‘crude GLM2’, and it underestimates the empirical
claim sizes by 28% because it ignores the fact of having right-censored data.
To initialize the EM algorithm for right-censored data we use the model crude
GLM2. We then iterate the algorithm for 15 steps which provides convergence. The
results are presented in Table 6.4. We observe that the resulting log-likelihood of
the model fitted on the censored data and evaluated on the complete data Y (which
is available here) is almost the same as for model Gamma GLM2, which has been
estimated on the complete data. Moreover, this right-censored EM algorithm fitted
model slightly over-estimates the average claim sizes.
Figure 6.17 shows the estimated means μi on an individual claims level. The
x-axis always gives the estimates from the complete log-likelihood model Gamma
GLM2. The y-axis on the left-hand side shows the estimates from the crude GLM
and the right-hand side the estimates from the EM algorithm fitted counterpart (fitted
on the right-censored data). We observe that the crude model underestimates the
claims (being below the diagonal), and the largest estimate lies below M = 50 000
right−censored data
10
10
9
9
8
8 9 10 11 8 9 10 11
complete data complete data
Fig. 6.17 Comparison of the estimated means μi in model Gamma GLM2 against (lhs) the crude
GLM and (rhs) the EM fitted right-censored model; both axis are on the log-scale, the dotted lines
shows the censoring point log(M)
254 6 Bayesian Methods, Regularization and Expectation-Maximization
in our example (horizontal dotted line). The EM algorithm fitted model, considering
the fact that we have right-censored data, corrects for the censoring, and the resulting
estimates resemble the ones from the complete log-likelihood model quite well.
In fact, we probably slightly over-estimate under right-censoring, here. Note that
all these considerations have been done under an identical dispersion parameter
estimate ϕ MLE . For the complete log-likelihood case, this is not really needed for
mean estimation because it cancels in the score equations for mean estimation.
However, a reasonable dispersion parameter estimate is crucial for the incomplete
case as it enters Y (t ) in the E-step, see (6.47), thus, the caveat here is that we need
a reasonable dispersion estimate from the right-censored data (which we did not
discuss, here, and which requires further research).
Compared to censoring we have less information under truncation because not only
the claim sizes below the lower-truncation point are unknown, but we also do not
know how many claims there are below that truncation point τ . Assume we work
with responses belonging to the EDF. The incomplete log-likelihood is given by
n
Y >τ (θ ) = log f (Yi ; θi , vi /ϕ) − log (1 − F (τ ; θi , vi /ϕ)) ,
i=1
assuming that Y = (Yi )1≤i≤n > τ collects all claims above the truncation point
Yi > τ , see (6.41). We proceed as in Fung et al. [147] to construct a complete
log-likelihood; there are different ways to do so, but this proposal is convenient
for parameter estimation. Firstly, we equip each observed claim Yi > τ with an
independent count random variable Ki ∼ p(·; θi , vi /ϕ) that determines the number
of claims below the truncation point that correspond to claim i above the truncation
point. Secondly, we assume that these claims are given by independent observations
Zi,1 , . . . , Zi,Ki ≤ τ , a.s., with a distribution obtained from an un-truncated version
of Yi , i.e., we consider the upper-truncated version of f (·; θi , vi /ϕ) for Zi,j . This
gives us the complete log-likelihood
n
f (Yi ; θi , vi /ϕ)
(Y ,K,Z) (θ) = log (6.49)
1 − F (τ ; θi , vi /ϕ)
i=1
Ki
f (Zi,j ; θi , vi /ϕ)
+ log p(Ki ; θi , vi /ϕ) + log ,
F (τ ; θi , vi /ϕ)
j =1
6.4 Truncated and Censored Data 255
with K = (Ki )1≤i≤n , and Z collects all (latent) claims Zi,j ≤ τ , an empty sum is
set equal to zero. Next, we assume that Ki is following the geometric distribution
Within the EDF this allows us to do the same EM algorithm considerations as above;
note that this expression no longer involves the distribution function. We consider
one observation Yi > τ and we drop the lower index i. This gives us complete
observation (Y, K, Z = (Zj )1≤j ≤K ) and conditional density
def.
= Q(θ ; π),
256 6 Bayesian Methods, Regularization and Expectation-Maximization
where the second last identity uses that the log-likelihood (6.51) has a simple form
under the geometric distribution chosen for K; this is exactly the step where we
benefit from this specific choice of the probability extension below the truncation
point. There is a subtle point here. Namely, Y >τ (θ ) is the log-likelihood of the
lower-truncated datum Y > τ , whereas log f (Y ; θ, v/ϕ) is the log-likelihood not
using any lower-truncation.
The E-step for given canonical parameter estimate θ (t −1) reads as
2
2
θ (t −1); π = arg min DKL π 2f (·|Y ;
π (t ) = arg max Q θ (t −1), v/ϕ)
π π
= f · Y ;
θ (t −1), v/ϕ
·
f (·j ;
θ (t −1), v/ϕ)
= p ·;
θ (t −1), v/ϕ .
j =1
F (τ ; θ (t −1), v/ϕ)
The latter describes a compound distribution for K j =1 Zj with a geometric count
random variable K and independent i.i.d. random variables Z1 , Z2 , . . . , having
upper-truncated densities f(−∞,τ ] (·;
θ (t −1), v/ϕ). This allows us to calculate the
expected compound claim below the truncation point
⎡ ⎤
K
≤τ
Y
(t )
= E
π (t)
⎣ Zj ⎦ = E
π (t) [K] E
π (t) [Z1 ]
j =1
F (τ ;
θ (t −1), v/ϕ)
= z f(−∞,τ ] (z;
θ (t −1), v/ϕ) dν(z).
1 − F (τ ; θ (t −1), v/ϕ)
That is, the M-step applies the classical MLE step, we only need to change weights
and observations
v
π (t) [K] =
v → v (t ) = v 1 + E ,
1 − F (τ ; θ (t −1), v/ϕ)
Y +Y ≤τ(t )
Y + E
(t ) = π (t) [K] E π (t) [Z1 ]
Y → Y = .
1 + E
π (t) [K] 1 + E
π (t) [K]
Note that this uses the specific structure of the EDF, in particular, we benefit from
linearity here which allows for closed-form solutions.
(t )Z
Yi + K (t )
(t )
vi(t ) = vi 1 + K and (t ) =
Y
i i,1
.
i i
(t )
1+Ki
n
(t )
θ = arg max (t) (θ ;
v (t )/ϕ) = arg max (t ); θi ,
log f (Y vi(t ) /ϕ).
Y i
θ θ i=1
Remarks 6.17 Essentially, the above algorithm uses that the MLE in the EDF is
based on a sufficient statistics of the observations, and in our case this sufficient
(t ).
statistics is Yi
Example 6.18 (Lower-Truncated Claim Sizes) We revisit the gamma claim size
GLM introduced in Sect. 5.3.7, see also Example 6.16 on right-censored claims. We
258 6 Bayesian Methods, Regularization and Expectation-Maximization
choose as lower-truncation point τ = 1 000, i.e., we get rid of the very small claims
that mainly generate administrative expenses at a rather small claim compensation.
We have 70 claims below this truncation point, and there remain n = 586 claims
above the truncation point that can be used for model fitting in the lower-truncated
case. We use the EM algorithm for lower-truncated data to fit a GLM to this problem.
In order to calculate the E-step we need to evaluate the conditional expecta-
tion (6.52) under the gamma model for truncation probability
τ β α α−1
F (τ ;
θ (t −1), v/ϕ) = z exp{−βz} dz = G(α, βτ ),
0
(α)
For the modeling we choose again the features as used for model Gamma GLM2,
this gives q +1 = 7 regression parameter components and additionally we set for the
dispersion parameter ϕ MLE = 1.427. This dispersion parameter we keep fixed in all
the models studied in this example. In a first step we simply fit a gamma GLM to the
lower-truncated data Yi > τ . We call this model ‘crude GLM2’, and it overestimates
the true claim sizes because it ignores the fact of having lower-truncated data.
To initialize the EM algorithm for lower-truncated data we use the model crude
GLM2. We then iterate the algorithm for 10 steps which provides convergence.
The results are presented in Table 6.5. We observe that the resulting log-likelihood
fitted on the lower-truncated data and evaluated on the complete data Y (which is
available here) is the same as for model Gamma GLM2 which has been estimated
on the complete data. Moreover, this lower-truncated EM algorithm fitted model
slightly under-estimates the average claim sizes.
Figure 6.18 shows the estimated means μi on an individual claims level. The
x-axis always gives the estimates from the complete log-likelihood model Gamma
GLM2. The y-axis on the left-hand side shows the estimates from the crude GLM
and the right-hand side the estimates from the EM algorithm fitted counterpart
(fitted on the lower-truncated data). We observe that the crude model overestimates
Table 6.5 Comparison of the complete log-likelihood and the incomplete log-likelihood (lower-
truncation τ = 1 000) results
# Log-likelihood Dispersion Average Rel.
Param. Y ( θ MLE ,
ϕ MLE ) est.
ϕ MLE amount change
Gamma GLM2 (complete data) 7+1
−7 129 1.427 25’130
Crude GLM2 (lower-truncated) 7+1 −7 133 26’879 +7%
EM est. GLM2 (lower-truncated) 7+1 −7 129 24’900 −1%
6.4 Truncated and Censored Data 259
lower−truncated data
9.5
9.5
9.0
9.0
8.5
8.5
above truncation point above truncation point
8.0
8.0
below truncation point below truncation point
8.0 8.5 9.0 9.5 10.0 10.5 11.0 8.0 8.5 9.0 9.5 10.0 10.5 11.0
complete data complete data
Fig. 6.18 Comparison of the estimated means μi in model Gamma GLM2 against (lhs) the crude
GLM and (rhs) the EM fitted lower-truncated model; both axis are on the log-scale
the claims (being above the orange diagonal), in particular, this applies to claims
with lower expected claim amounts. The EM algorithm fitted model, considering
the fact that we have lower-truncated data, corrects for the truncation, and the
resulting estimates almost completely coincide with the ones from the complete log-
likelihood model. Again we remark that we use an identical dispersion parameter
estimate ϕ MLE , and it is an open problem to select a reasonable value from lower-
truncated data.
Example 6.19 (Zero-Truncated Claim Counts and the Hurdle Poisson Model) In
Sect. 5.3.6, we have been studying the ZIP model that has assigned an additional
probability weight to the event {N = 0} of having zero claims. This model can
be understood as a hierarchical model with a latent variable Z indicating whether
we have an excess zero claim or not, see (5.41). In that situation we have a
mixture distribution of a Poisson distribution and a degenerate distribution. Fitting
in Example 5.25 has been done brute force by using a general purpose optimizer,
but we could also use the EM algorithm for mixture distributions.
An alternative way of modeling excess zeros is the hurdle approach which
combines a lower-truncated count distribution with a point mass in zero. For the
Poisson case this reads as, see (5.42),
⎧
⎨ π0 for k = 0,
fhurdle Poisson(k; λ, v, π0 ) = −vλ (vλ)k (6.53)
⎩ (1 − π0 ) e k!
for k ∈ N,
1−e −vλ
the ZTP model N > 0 is given by (we consider one single component only and drop
the lower index in the notation)
In view of (6.51), this gives us complete log-likelihood (note that Zj = 0 for all j )
K
(N,K,Z) (θ) = Nθ − veθ − log(N!) + N log(v) + Zj θ − veθ − log(Zj !) + Zj log(v)
j =1
We can now directly apply a simplified version of the EM algorithm for lower-
truncated data. For the E-step we have, given parameter
θ (t −1),
θ (t−1)
(t )
P
θ (t−1) [N = 0] e−ve (t ) = 0.
K = = and Z
1 − P θ (t−1) [N = 0] 1 − e−ve
θ (t−1) 1
This provides us with the estimated weights and observations (set Y = N/v)
v Y N
(t ) =
v (t ) = v 1 + K and (t ) =
Y = (t ) .
1−e −ve
θ (t−1) 1+K (t ) v
(6.55)
Thus, the EM algorithm iterates Poisson MLEs, and the E-Step modifies the weights
v (t ) in each step of the loop correspondingly. We remark that the ZTP model
has an EF representation which allows one to directly estimate the corresponding
parameters without using the EM algorithm, see Remark 6.20, below.
We revisit the French MTPL claim frequency data, and, in particular, we use
model Poisson GLM3 as a benchmark, we refer to Tables 5.5 and 5.10. The feature
engineering is done exactly as in model Poisson GLM3. We then select only the
insurance policies from the learning data L that have suffered at least one claim, i.e.,
Ni > 0. These are m = 22 434 out of n = 610 206 insurance policies. Thus, we
only consider m/n = 3.68% of all insurance policies, and we fit the lower-truncated
log-likelihood (ZTP model) to this data
m
Ni θi − vi eθi − log(Ni !) + Ni log(vi ) − log(1 − e−vi e ),
θi
N>0 (β) =
i=1
6.4 Truncated and Censored Data 261
convergence of log−likelihood in EM algorithm fitted on all data vs. fitted on data N>0
0
–2
log−likelihood
–4
–6
–8
0 20 40 60 80 100 –8 –6 –4 –2 0
algorithmic time canonical parameter fitted on all data
Fig. 6.19 (lhs) Convergence of the EM algorithm for the lower-truncated data in the Poisson
hurdle case; (rhs) canonical parameters of the Poisson GLMs fitted on all data L vs. fitted only
on policies with Ni > 0
Table 6.6 Run times, number of parameters, AICs, in-sample and out-of-sample deviance losses
(units are in 10−2 ) and in-sample average frequency of the Poisson null model and the Poisson,
negative-binomial, ZIP and hurdle Poisson GLMs
Run # In-sample Out-of-sample Aver.
time Param. AIC loss on L loss on T freq.
Poisson null – 1 199’506 25.213 25.445 7.36%
Poisson GLM3 15 s 50 192’716 24.084 24.102 7.36%
NB GLM3 MLE = 1.810
αNB 85 s 51 192’113 20.722 20.674 7.38%
ZIP GLM3 (null π0 ) 270 s 51 192’393 – – 7.37%
Hurdle Poisson GLM3 300 s 100 191’851 – – 7.39%
where 1 ≤ i ≤ m runs over all insurance policies with at least one claim and where
the canonical parameter θi is given by the linear predictor θi = β, x i . We fit this
model using the EM algorithm for lower-truncated data. In each loop this requires
that the offset oi(t ) = log(vi(t ) ) is adjusted according to (6.55); for the discussion of
offsets we refer to Sect. 5.2.3. Convergence of the EM algorithm is achieved after
roughly 75 iterations, see Fig. 6.19 (lhs).
In our first analysis we do not consider the Poisson hurdle model, but we simply
consider model Poisson GLM3. However, this Poisson model with regression
parameter β is fitted only on the data Ni > 0 (exactly using the results of the
EM algorithm for lower-truncated data Ni > 0). The resulting predictive model is
presented in Table 6.7. We observe that model Poisson GLM3 that is only fitted on
the data Ni > 0 is clearly not competitive, i.e., we cannot simply extrapolate this
estimated model to {Ni = 0}. This extrapolation results in a Poisson GLM that has
a much too large average frequency of 15.11%, see last column of Table 6.7; this
bias can clearly be seen in Fig. 6.19 (rhs) where we compare the two fits. From
this we conclude that either the Poisson model assumption in general does not
262 6 Bayesian Methods, Regularization and Expectation-Maximization
Table 6.7 Number of parameters, in-sample and out-of-sample deviance losses on all data
(units are in 10−2 ), out-of-sample lower-truncated log-likelihood N>0 and in-sample average
frequency of the Poisson null model and model Poisson GLM3 fitted on all data L and fitted on
the data Ni > 0 only
# In-sample Out-of-sample Aver.
Param. Loss on L Loss on T N>0 freq.
Poisson null 1 25.213 25.445 – 7.36%
Poisson GLM3 fitted on all data 50 24.084 24.102 −0.2278 7.36%
Poisson GLM3 fitted on Ni > 0 50 28.064 28.211 −0.2195 15.11%
match the data, or that we have excess zeros (which do not influence the estimation
procedure if we only consider the policies with at least one claim). Let us compare
the lower-truncated log-likelihood N>0 out-of-sample only on the policies with at
least one claim (ZTP model). We observe that the EM fitted model provides a better
description of the data, as we have a bigger log-likelihood than the model fitted on
all data L (i.e. −0.2195 vs. −0.2278 for the ZTP log-likelihood). Thus, the lower-
truncated fitting procedure finds a better model on {Ni > 0} when only fitted on
these lower-truncated claim counts.
This analysis concludes that we need to fit the full hurdle Poisson model (6.53).
That is, we cannot simply extrapolate the model fitted on the ZTP log-likelihood
N>0 because, typically, π0 (x i ) = exp{−vi eβ,x i }, the latter coming from the
Poisson GLM with regression parameter β. We model the zero claim probability
π0 (x i ) by the logistic Bernoulli GLM indicating whether we have claims or not.
We set up the logistic GLM for p(x i ) = 1 − π0 (x i ) of describing the indicator
Yi = 1{Ni >0} of having claims. The difficulty compared to the Poisson model is that
we cannot easily integrate the time exposure vi as a pro rata temporis variable like
in the Poisson case. We therefore make the following considerations. The canonical
link in the logistic Bernoulli GLM is the logit function p → logit(p) = log(p/(1 −
p)) = log(p) − log(1 − p) for p ∈ (0, 1). Typically, in our application, p " 1 is
fairly small because claims are rare events. This implies log(p/(1 − p)) ≈ log(p),
i.e., the logit link behaves similarly to the log-link for small default probabilities p.
This motivates to integrate the logged exposures log vi as offsets into the logistic
probabilities. That is, we make the following model assumption
Table 6.8 Contingency table of the observed numbers of policies against predicted numbers of
policies with given claim counts ClaimNb (in-sample)
Numbers of claims ClaimNb
0 1 2 3 4 5
Observed number of policies 587’772 21’198 1’174 57 4 1
Poisson predicted number of policies 587’325 22’064 779 34 3 0.3
NB predicted number of policies 587’902 20’982 1’200 100 15 4
ZIP predicted number of policies 587’829 21’094 1’191 79 9 4
Hurdle Poisson predicted number of policies 587’772 21’119 1’233 76 6 1
where * β ∈ Rq+1 is the regression parameter from the logistic Bernoulli GLM,
and where μ(x i , vi ) = vi expβ, x i is the Poisson GLM estimated with the
EM algorithm on the lower-truncated data Ni > 0 (ZTP model). The results are
presented in Table 6.6.
Table 6.6 compares the hurdle Poisson model to the approaches studied in
Table 5.10. Firstly, fitting the hurdle Poisson model is more time intensive, the EM
algorithm takes some time and we need to fit the Bernoulli logistic GLM which
is of a similar complexity as fitting model Poisson GLM3. The results in terms of
AIC look convincing. The hurdle Poisson model provides an excellent model for the
indicator of having a claim (here it outperforms model ZIP GLM3). It also tries to
optimally fit a ZTP model to all insurance policies having at least one claim. This
can also be seen from Table 6.8 which determines the expected number of policies
that suffer the different numbers of claims.
We close this example by concluding that the hurdle Poisson model provides the
best description, at the price of using more parameters. The ZIP model could be
lifted to a similar level, however, we consider fitting the hurdle approach to be more
convenient, see also Remark 6.20, below. In particular, feature engineering seems
simpler in the hurdle approach because the different effects are clearly separated,
whereas in the ZIP approach it is more difficult to suitably model the excess zeros,
see also Listing 5.10. This closes this example.
Remark 6.20 In (6.54) we have been considering the ZTP model for different
exposures v > 0. If we set these exposures to v = 1, we obtain the ZTP log-
likelihood
N>0 (θ ) = Nθ − eθ + log(1 − e−e ) − log(N!).
θ
eθ λ
μ = Eθ [N] = κ (θ ) = = ,
1− e−e
θ
1 − e−λ
Note that the term in brackets is positive but less than one. The latter implies that
the ZTP model has under-dispersion. Alternatively to the EM algorithm, we can
also directly fit a GLM to this ZTP model. The only difficulty is that we need to
appropriately integrate the time exposures. The original Poisson model suggests
that if we choose the canonical parameter being equal to the linear predictor, we
should integrate the logged exposures as offsets into the linear predictors. Along
these lines, if we choose the canonical link h = (κ )−1 of the ZTP model, we
receive that the canonical parameter θ is equal to the linear predictor β, x , and we
can directly integrate the logged exposures as offsets into the canonical parameters,
see (5.25). This then allows us to directly fit this ZTP model with exposures using
Fisher’s scoring method. In this case of a concave log-likelihood function, the result
will be identical to the solution of the EM algorithm found in Example 6.19, and, in
fact, this direct approach is more straightforward and more time-efficient. Similar
considerations can be done for other hurdle models.
In Sect. 6.3.1 we have promoted to mix distributions in cases where the data cannot
be modeled by a single EDF distribution. Alternatively, one can also consider to
compose densities which leads to so-called composite models (also called splicing
models). This idea has been introduced to the actuarial literature by Cooray–Ananda
[81] and Scollnik [332]. Assume we have two absolutely continuous densities
f (i) (·; θi ) with corresponding distribution functions F (i) (·; θi ), i = 1, 2. These two
densities can easily be composed at a splicing value τ and with weight p ∈ (0, 1)
by considering the following composite density
supposed that both denominators are non-zero. In this notation we treat splicing
value τ as a hyper-parameter that is chosen by the modeler, and is not estimated
6.4 Truncated and Censored Data 265
from data. In view of (6.41) we can rewrite this in terms for lower- and upper-
truncated densities
(1) (2)
f (y; p, θ1 , θ2 ) = p f(−∞,τ ] (y; θ1) + (1 − p) f(τ,∞) (y; θ2 ).
In this notation, we see that a composite model can also be interpreted as a mixture
(1) (2)
model with mixture probability p ∈ (0, 1) and mixing densities f(−∞,τ ] and f(τ,∞)
having disjoint supports (∞, τ ] and (τ, ∞), respectively.
These disjoint supports allow for simpler MLE, i.e., we do not need to rely on
the ‘EM algorithm for mixture distributions’ to fit this model. The log-likelihood of
Y ∼ f (y; p, θ1 , θ2 ) is given by
(1)
Y (p, θ1 , θ2 ) = log(p) + log f(−∞,τ ] (Y ; θ1 ) 1{Y ≤τ }
(2)
+ log(1 − p) + log f(τ,∞) (Y ; θ2 ) 1{Y >τ } .
This shows that the log-likelihood nicely decouples in the composite case and all
parameters can directly be estimated with MLE: parameter θ1 uses all observations
smaller or equal to τ , parameter θ2 uses all observations bigger than τ , and p is
estimated by the proportions of claims below and above the splicing point τ . This
holds for a null model as well as for a GLM approach for θ1 , θ2 and p.
Nevertheless, the EM algorithm may still be used for parameter estimation,
namely, truncation may ask for the ‘EM algorithm for truncated data’. Alternatively,
we could also use the ‘EM algorithm for censored data’ to estimate the truncated
densities, because we have knowledge of the number of claims above and below the
splicing point τ , thus, we could right- or left-censor these claims. The latter may
lead to more stability in the estimation procedure since we use more information
in parameter estimation, i.e., the two truncated densities will not be independent
because they simultaneously consider all claim counts (but not identical claim sizes
due to censoring).
For composite models one sometimes requires more regularity in the densities,
we may, e.g., require continuity in the density in the splicing point which provides
mixture probability
f (2) (τ ; θ2 )F (1)(τ ; θ1 )
p = .
f (1) (τ ; θ 1 )(1 − F (τ ; θ2 )) + f
(2) (2) (τ ; θ )F (1) (τ ; θ )
2 1
This reduces the number of parameters to be estimated but complicates the score
equations. If we require a differential condition in τ we receive requirement
(i)
where fy (y; θi ) denotes the first derivative w.r.t. y. Together with the continuity
this provides requirement for having differentiability in τ
(2)
f (2)(τ ; θ2 ) fy (τ ; θ2 )
= (1) .
f (τ ; θ1 )
(1)
fy (τ ; θ1 )
Again this reduces the degrees of freedom in parameter estimation but complicates
the score equations. We refrain from giving an example and close this section; we
will consider a deep composite regression model in Sect. 11.3.2, below, where we
replace the fixed splicing point by a quantile for a fixed quantile level.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons licence and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons licence, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons licence and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 7
Deep Learning
In the sequel, we introduce deep learning models. In this chapter these deep
learning models will be based on fully-connected feed-forward neural networks. We
present these networks as an extension of GLMs. These networks perform feature
engineering themselves. We discuss how networks achieve this, and we explain how
networks are used for predictive modeling. There is a vastly growing literature on
deep learning with networks, the classical reference is the book of Goodfellow et
al. [166], but also the numerous tutorials around the open-source deep learning
libraries TensorFlow [2], Keras [77] or PyTorch [296] give an excellent overview
of the state-of-the-art in this field.
In Chap. 5 on GLMs, we have been modeling the mean structure of the responses
Y , given features x, by the following regression function, see (5.6),
The crucial assumption has been that the regression function (7.1) provides a
reasonable functional description of the expected value Eθ(x) [Y ] of datum (Y, x).
As described in Sect. 5.2.2, this typically requires manual feature engineering of x,
bringing feature information into the right structural form.
In contrast to manual feature engineering, deep learning aims at performing an
automated feature engineering within the statistical model by massaging infor-
mation through different transformations. Deep learning uses a finite sequence of
functions (z(m) )1≤m≤d , called layers,
def.
z(m:1) (x) = z(m) ◦ · · · ◦ z(1) (x) ∈ {1} × Rqm . (7.2)
Note that the first component is always identically equal to 1. For this reason we
call the representation z(m:1) (x) ∈ {1} × Rqm of x to be qm -dimensional.
Deep learning now assumes that we have d ∈ N appropriate transformations
(layers) z(m) , 1 ≤ m ≤ d, such that z(d:1) (x) provides a suitable qd -dimensional
representation of the raw feature x ∈ X , that then enters a GLM
In many regression problems it can be shown that one can equivalently work
with the design matrix Z = (z(d:1)(x 1 ), . . . , z(d:1)(x n )) ∈ Rn×(qd +1) or with
7.2 Generic Feed-Forward Neural Networks 269
Mercer’s kernel K ∈ Rn×n . Mercer’s kernel does not require the full knowledge
of the learned representations z(d:1)(x i ), but it suffices to know the discrepancies
between z(d:1)(x i ) and z(d:1) (x j ) measured by the scalar products K(x i , x j ). This
is also closely related to the cosine similarity in word embeddings, see (10.11). This
approach then results in replacing the search for an optimal representation learning
by a search of the optimal Mercer’s kernel for the given data; this is called the kernel
trick in machine learning.
Feed-forward neural (FN) networks use special layers z(m) in (7.2)–(7.3), whose
components are called neurons. This is discussed and studied in detail in this section.
FN networks are regression functions of type (7.3) where each neuron zj(m) , 1 ≤
j ≤ qm , of the layers z(m) = (1, z1(m) , . . . , zq(m)
m ) , 1 ≤ m ≤ d, has the structure of
(m)
a GLM; the first component z0 = 1 always plays the role of the intercept and does
not need any modeling.
A first important choice is the activation function φ : R → R which plays the
role of the inverse link function g −1 . To perform non-linear representation learning,
this activation function should be non-linear, too. The most popular choices of
activation functions are listed in Table 7.1.
The first three examples in Table 7.1 are smooth functions with simple deriva-
tives, see the last column of Table 7.1. Having simple derivatives is an advantage in
gradient descent algorithms for model fitting. The derivative of the ReLU activation
function for x = 0 is given by the step function activation, and in 0 one typically
considers a sub-gradient. We briefly comment on these activation functions.
Table 7.1 Popular choices of non-linear activation functions and their derivatives; the last two
examples are not strictly monotone
Activation function Derivative
Sigmoid (logistic) activation φ(x) = (1 + e−x )−1 φ = φ(1 − φ)
Hyperbolic tangent activation φ(x) = tanh(x) φ = 1 − φ2
Exponential activation φ(x) = exp(x) φ = φ
Step function activation φ(x) = 1{x≥0}
Rectified linear unit (ReLU) activation φ(x) = x 1{x≥0}
270 7 Deep Learning
1.0
w=5
x → tanh(wx) ∈ (−1, 1) for w=1
(fixed) weights w=1/5
w ∈ {1/5, 1, 5} and
0.5
x ∈ (−10, 10)
tanh
0.0
−0.5
−1.0
−10 −5 0 5 10
x
ex − e−x −1
−2x
x → tanh(x) = = 2 1 + e − 1 ∈ (−1, 1).
ex + e−x
(m)
Interpretation Every neuron z → zj (z) describes a GLM regression function
with link function φ −1 and regression parameter wj ∈ Rqm−1 +1 for features
(m)
Area
Power
VehAge
DrivAge GLM
Bonus
B1
B2
B3
B4
B5
B6
B10
B11
B12
B13
B14
VehGas
Density
R11
R21 Y
R22
R23
R24
R25
R26
R31
R41
R42
R43
R52
R53
R54
R72
R73
R74
R82
R83
R91
R93
R94
Fig. 7.2 FN network of depth d = 3, with number of neurons (q1 , q2 , q3 ) = (20, 15, 10) and
input dimension q0 = 40. This gives us a network parameter ϑ ∈ Rr of dimension r = 1 306
d
r= qm (qm−1 + 1) + (qd + 1).
m=1
1Figures 7.2 and 7.9 are similar to Figure 1 in [122], and all FN network plots have been created
with modified versions of the plot functions of the R package neuralnet [144].
7.2 Generic Feed-Forward Neural Networks 273
For a shallow FN network we can study the question of the maximal complexity
of the resulting partition of the feature space X ⊂ {1} × Rq0 when considering q1
7.2 Generic Feed-Forward Neural Networks 275
neurons (7.9) in the single FN layer z(1). Zaslavsky [400] proved that q1 hyperplanes
can partition the Euclidean space Rq0 in at most
min{q 0 ,q1 }
q1
disjoint sets. (7.10)
j
j =0
This number (7.10) can be seen as a maximal upper complexity bound for shallow
FN networks with step function activation. It grows exponentially for q1 ≤ q0 , and
it slows down to a polynomial growth for q1 > q0 . Thus, the complexity of shallow
FN networks grows comparably slow as the width q1 of the network exceeds q0 , and
therefore we often need a huge network to receive a good approximation.
This result (7.10) should be contrasted to Theorem 4 in Montúfar et al. [280] who
give a lower bound on the complexity of regression functions of deep FN networks
(under the ReLU activation function). Assume qm ≥ q0 for all 1 ≤ m ≤ d. The
maximal complexity is bounded below by
d−1 B C q0
qm q0 qd
disjoint linear regions. (7.11)
q0 j
m=1 j =0
μ : R2 → R, x → μ(x).
x2
0.0
−0.5
We choose the step function activation for φ and a first FN layer with q1 = 4
neurons
x → z(1) (x) = 1, z1(1) (x), . . . , z4(1) (x)
= 1, 1{x1 ≥−1/2} , 1{x2 ≥−1/2} , 1{x1 ≥1/2} , 1{x2 ≥1/2} ∈ {1} × {0, 1}4.
having dimension q1 (q0 + 1) = 12. For the second FN layer with q2 = 4 neurons
we choose the step function activation and
(2) (2)
z → z(2) (z) = 1, z1 (z), . . . , z4 (z)
= 1, 1{z1 +z2 ≥3/2} , 1{z2 +z3 ≥3/2} , 1{z1 +z4 ≥3/2} , 1{z3 +z4 ≥3/2} .
having dimension q2 (q1 + 1) = 20. For the output layer we choose the identity link
g(x) = x, and the regression parameter β = (0, 1, −1, −1, 1) ∈ R5 . As a result,
we obtain
; <
χA (x) = β, z(2:1) (x) . (7.13)
That is, this network of depth d = 2, number of neurons (q1 , q2 ) = (4, 4), step
function activation and identity link can perfectly replicate the indicator function for
the square A = [−1/2, 1/2) × [−1/2, 1/2), see Fig. 7.3. This network has r = 37
parameters.
We now consider a shallow FN network with q1 neurons. The resulting regression
function with identity link is given by
; < ; <
(1)
x → β, z(1:1) (x) = β, (1, z1 (x), . . . , zq(1)
1
(x))
9 :
= β, 1, 1; (1)
<
, . . . , 1;
(1)
<
,
w1 ,x ≥0 w q1 ,x ≥0
where we have used the step function activation φ(x) = 1{x≥0} . As in (7.9),
each of these neurons leads to a partition of the space R2 with a straight line.
Importantly these straight lines go across the entire feature space, and, there-
fore, we cannot exactly construct the indicator function of Fig. 7.3 with a shal-
low FN network. This can nicely be seen in Fig. 7.4 (lhs), where we con-
sider a shallow FN network with q1 = 4 neurons, weights (7.12), and β =
(0, 1/2, 1/2, −1/2, −1/2).
However, from the universality theorems we know that shallow FN networks
can approximate any compactly supported (continuous) function arbitrarily well
for sufficiently large q1 . In this example we can introduce additional neurons and
let the resulting hyperplanes rotate around the origin. In Fig. 7.4 (middle, rhs) we
show this for q1 = 8 and q1 = 64 neurons. We observe that this allows us to
approximate a circle, see Fig. 7.4 (rhs), and having circles of different sizes at
x2
x2
We describe gradient descent methods in this section. These are used to fit FN
networks. Gradient descent algorithms have already been used in Sect. 6.2.4 for
fitting LASSO regularized regression models. We will give the full methodological
part here, without relying on Sect. 6.2.4.
for a strictly monotone and smooth link function g, and a FN network z(d:1) with
network parameter ϑ ∈ Rr . We assume that the chosen activation function φ is
differentiable. We highlight in the notation that the mean functional μϑ (·) depends
on the network parameter ϑ. The canonical parameter of the response Yi is given
by θ (x i ) = h(μϑ (x i )) ∈ , where h = (κ )−1 is the canonical link and κ the
cumulant function of the chosen member of the EDF. This gives us (under constant
dispersion ϕ) the log-likelihood function, for given data Y = (Y1 , . . . , Yn ) ,
vi
n
ϑ → Y (ϑ) = Yi h(μϑ (x i )) − κ (h(μϑ (x i ))) + a(Yi ; vi /ϕ).
ϕ
i=1
The deviance loss function in this model is given by, see (4.9) and (4.8),
2 vi
n
D(Y , ϑ) = Yi h (Yi ) − κ (h (Yi )) − Yi h (μϑ (x i )) + κ (h (μϑ (x i ))) ≥ 0.
n ϕ
i=1
(7.14)
7.2 Generic Feed-Forward Neural Networks 279
This shows that the locally optimal change ϑ → *ϑ points into the opposite direction
of the gradient of the deviance loss function. This motivates the following gradient
descent step.
This gradient descent update gives us the new (smaller) deviance loss at
algorithmic time t + 1
2 22
2 2
D(Y , ϑ (t +1)) = D(Y , ϑ (t ) ) − t +1 2∇ϑ D(Y , ϑ (t ))2 + o (t +1 ) for t +1 ↓ 0.
2
Under suitably tempered learning rates (t )t ≥1 , this algorithm will converge to a
local minimum of the deviance loss function as t → ∞ (supposed that we do not
get trapped in a saddlepoint).
Remarks 7.4 We give a couple of (preliminary) remarks on the gradient descent
algorithm (7.15), more explanation, further derivations, and variants of the gradient
descent algorithm will be discussed below.
280 7 Deep Learning
• In the applications we will early stop the gradient descent algorithm before
reaching a local minimum (to prevent from over-fitting). This is going to be
discussed in the next paragraphs.
• Fine-tuning the learning rate (t )t is important, in particular, there is a trade-off
between smaller and bigger learning rates: they need to be sufficiently small so
that the first order Taylor expansion is still a valid approximation, and they should
be sufficiently big otherwise the convergence of the algorithm will be very slow
because it needs many iterations.
• The gradient descent algorithm is a first order algorithm, and one is tempted to
study higher order approximations, e.g., leading to the Newton–Raphson algo-
rithm. Unfortunately, higher order derivatives are computationally not feasible if
the size n of the data Y = (Y1 , . . . , Yn ) and the dimension r of the network
parameter ϑ are large. In fact, even the calculation of the first order derivatives
may be challenging and, therefore, stochastic gradient descent methods are
considered below. Nevertheless, it is beneficial to have a notion of a second order
term. Momentum-based methods originate from approximating the second order
terms, these will be studied in (7.19)–(7.20), below.
• The gradient descent step (7.15) solves an unconstraint local optimization.
Similarly to (6.15)–(6.16) we could change the gradient descent algorithm to
a constraint optimization problem, e.g., involving a LASSO constraint that can
be solved with the generalized projection operator (6.17).
Fast gradient descent algorithms essentially rely on fast gradient calculations of the
deviance loss function. Under the EDF setup we have gradient w.r.t. ϑ
2 vi
n
∇ϑ D(Y , ϑ) = μϑ (x i ) − Yi h (μϑ (x i )) ∇ϑ μϑ (x i ) (7.16)
n ϕ
i=1
2 vi μϑ (x i ) − Yi
n
1 ; <
= ∇ϑ β, z (d:1)
(x i ) ,
n ϕ V (μϑ (x i )) g (μϑ (x i ))
i=1
where the last step uses the variance function V (·) of the chosen EDF, we also refer
to (5.9). The main difficulty is the calculation of the gradient
; < ; <
∇ϑ β, z(d:1)(x) = ∇ϑ β, z(d) ◦ · · · ◦ z(1) (x) ,
reparametrization of the problem so that the gradients can be calculated more easily.
We therefore modify the weight matrices W (m) by dropping the first row containing
(m)
the intercept parameters w0,j , 1 ≤ j ≤ qm . Define for 1 ≤ m ≤ d + 1
∈ Rqm−1 ×qm ,
(m) (m)
W(−0) = wjm−1 ,jm
1≤jm−1 ≤qm−1 ; 1≤jm ≤qm
where wj(m)
m−1 ,jm
denotes component jm−1 of w(m)
jm , and where we set qd+1 = 1
(d+1)
(output dimension) and wjd ,1 = βjd for 0 ≤ jd ≤ qd .
Proposition 7.5 (Back-Propagation for the Hyperbolic Tangent Activation)
Choose a FN network of depth d ∈ N and with hyperbolic tangent activation
function φ(x) = tanh(x).
• Define recursively
– initialize qd+1 = 1 and δ (d+1)(x) = 1 ∈ Rqd+1 ;
– iterate for d ≥ m ≥ 1
2
(m:1) (m+1)
δ (m) (x) = diag 1 − zjm (x) W(−0) δ (m+1) (x) ∈ Rqm .
1≤jm ≤qm
• We obtain for 0 ≤ m ≤ d
⎛ ⎞
⎝ ∂β, z (d:1) (x)
⎠
(m+1)
= z(m:1) (x) δ (m+1) (x) ∈ R(qm +1)×qm+1 ,
∂wjm ,jm+1
0≤jm ≤qm ; 1≤jm+1 ≤qm+1
(d+1)
ζ1 (x) = β, z(d:1)(x) .
282 7 Deep Learning
The main idea is to calculate the derivatives of β, z(d:1)(x) w.r.t. these new
variables ζj(m) (x).
Initialization for m = d +1 This provides for m = d +1 and 1 ≤ jd+1 ≤ qd+1 = 1
(d+1)
∂β, z(d:1)(x) ∂β, z(d:1)(x) ∂ζ1 (x)
=
∂ζj(d)
d
(x) ∂ζ1(d+1)(x) ∂ζj(d)
d
(x)
(d+1)
where we have used wjd ,1 = βjd and for the hyperbolic tangent activation function
φ = 1 − φ 2 . Continuing recursively for d > m ≥ 1 and 1 ≤ jm ≤ qm we obtain
qm+1
(m+1) (m+1) (m:1) (m)
= δjm+1 (x) wjm ,jm+1 1 − (zjm (x))2 = δjm (x).
jm+1 =1
Thus, the vectors δ (m) (x) = (δ1(m) (x), . . . , δq(m) are calculated recursively in
m (x))
d ≥ m ≥ 1 with initialization δ (d+1)
(x) = 1 and the recursion
δ (m) (x) = diag 1 − (zj(m:1) (x)) 2 (m+1) (m+1)
W(−0) δ (x) ∈ Rqm .
m 1≤jm ≤qm
Finally, we need to show how these derivatives are related to the original
derivatives in the gradient descent method. We have for 0 ≤ jd ≤ qd and jd+1 = 1
(d+1)
∂β, z(d:1)(x) ∂β, z(d:1) (x) ∂ζ1 (x)
= (d+1)
= δj(d+1) (x) zj(d:1) (x).
∂βjd ∂ζ (x) ∂β jd
d+1 d
1
7.2 Generic Feed-Forward Neural Networks 283
(m+1)
∂β, z(d:1) (x) ∂β, z(d:1)(x) ∂ζjm+1 (x)
(m+1)
= (m+1) (m+1)
= δj(m+1)
m+1
(x) zj(m:1)
m
(x).
∂wjm ,jm+1 ∂ζjm+1 (x) ∂wjm ,jm+1
(1)
∂β, z(d:1)(x) ∂β, z(d:1)(x) ∂ζj1 (x) (1)
(1)
= = δj1 (x) xl .
∂wl,j1
∂ζj(1)
1
(x) (1)
∂wl,j1
Remark 7.6 Proposition 7.5 gives the back-propagation method for the hyperbolic
tangent activation function which has derivative φ = 1 − φ 2 . This becomes visible
in the definition of δ (m) (x) where we consider the diagonal matrix
2
(m:1)
diag 1 − zjm (x) .
1≤jm ≤qm
In the case of the sigmoid activation function this gives us, see also Table 7.1,
(m:1) (m:1)
diag zjm (x) 1 − zjm (x) .
1≤jm ≤qm
ϑ (t ) → ϑ (t +1) = ϑ (t ) − t +1 ∇ϑ D(Y , ϑ (t ) ).
284 7 Deep Learning
Remark 7.7 The initialization ϑ (0) ∈ Rr of the gradient descent algorithm needs
some care. A FN network has many symmetries, for instance, we can permute
neurons within a FN layer and we receive the same predictive model. For this
reason, the initial network weights W (m) = (w1 , . . . , wqm ) ∈ R(qm−1 +1)×qm ,
(m) (m)
The gradient of the deviance loss function is obtained by the matrix multiplication
2
∇ϑ D(Y , ϑ) = M v(Y ).
n
Matrix multiplication can be very slow in numerical implementations if the
sample size n is large. For this reason, one typically uses the stochastic gradient
descent (SGD) method that does not consider the entire data Y = (Y1 , . . . , Yn )
simultaneously.
2 For our examples we use the R library keras [77] which is an API to TensorFlow [2].
7.2 Generic Feed-Forward Neural Networks 285
For the SGD method one chooses a fixed batch size b ∈ N, and one randomly
partitions the entire data Y into (mini-)batches Y 1 , . . . , Y 'n/b( of approximately the
same size b (up to cardinality). Each gradient descent update
ϑ (t ) → ϑ (t +1) = ϑ (t ) − t +1 ∇ϑ D(Y s , ϑ (t ) ),
The gradient descent method only considers a first order Taylor expansion and one is
tempted to consider higher order terms to improve the approximation. For instance,
Newton’s method uses a second order Taylor term by updating
−1
ϑ (t ) → ϑ (t +1) = ϑ (t ) − ∇ϑ2 D(Y , ϑ (t ) ) ∇ϑ D(Y , ϑ (t )). (7.18)
For ν = 0 we have the plain vanilla gradient descent method, for ν > 0 we also
memorize the previous gradients (with exponentially decaying weights). Typically
this leads to better convergence properties.
Nesterov [284] has noticed that for convex functions the gradient descent updates
may have a zig-zag behavior. Therefore, he proposed the so-called Nesterov-
accelerated version
Thus, the calculation of the momentum v(t +1) uses a look-ahead ϑ (t ) + νv(t ) in
the gradient calculation (anticipating part of the next step). This provides for the
update (7.21) the following equivalent versions, under reparametrization *
(t )
ϑ =
ϑ + νv ,
(t ) (t )
ϑ (t +1) = ϑ (t ) + νv(t ) − t +1 ∇ϑ D(Y , ϑ (t ) + νv(t ))
= ϑ (t ) + νv(t ) − t +1 ∇ϑ D(Y , *
(t )
ϑ ) (7.22)
=*ϑ + νv(t +1) − t +1 ∇ϑ D(Y , * ϑ ) − νv(t +1).
(t ) (t )
For the Nesterov accelerated update we can also study, we use the last line of (7.22),
ϑ (t +1) = * − t +1 ∇ϑ D(Y , *
(t ) (t )
ϑ ϑ ),
* (t +1)
ϑ = ϑ (t +1) + ν ϑ (t +1) − ϑ (t ) . (7.24)
7.2 Generic Feed-Forward Neural Networks 287
for given weights ν, α ∈ (0, 1). Similar to Bayesian credibility theory, v(t )
and r(t ) are biased because these two processes have been initialized in zero.
Therefore, they are rescaled by 1/(1 − ν t ) and 1/(1 − α t ), respectively. This
gives us the gradient descent update
v(t +1)
ϑ (t ) → ϑ (t +1) = ϑ (t ) − 6 ) ,
ε+ r(t+1) 1 − νt
1−α t
288 7 Deep Learning
where the square-root is taken component-wise, for a global decay rate > 0,
and for a small positive constant ε > 0 to ensure that everything is well-defined.
• nadam is the Nesterov-accelerated [284] version of adam. Similarly as when
going from (7.19)–(7.20) to (7.23), the acceleration is obtained by a shift of 1 in
the velocity parameter, thus, consider the Nesterov-accelerated adam update
As explained above, we model the mean of the datum (Y, x) by a deep FN network
; <
x → μ(x) = μϑ (x) = Eθ(x) [Y ] = g −1 β, z(d:1)(x) ,
ϑ
MLE
= arg min D(Y , ϑ).
ϑ
In Fig. 7.5 we give a schematic figure of a loss surface ϑ → D(Y , ϑ) for a (low-
dimensional) example ϑ ∈ R2 . The two plots show the same loss surface from two
different angles. This loss surface has three (local) minimums (red color), and the
smallest one (global minimum) gives the MLE
MLE
ϑ .
In general, this global minimum cannot be found for more complex network
architectures because the loss surface typically has a complicated structure for high-
dimensional parameter spaces. Is this a problem in FN network fitting? Not really!
We are going to explain why. The universality theorems in Sect. 7.2.2 state that more
complex FN networks have an excellent approximation capacity. If we translate
this to our statistical modeling problem it means that the observations Y can be
approximated arbitrarily well by sufficiently complex FN networks. In particular,
for a given complex network architecture, the MLE
MLE
ϑ will provide the optimal
fit of this architecture to the data Y , and, as a result, this network does not only
reflect the systematic effects in the data but also the noisy part. This behavior is
called (in-sample) over-fitting to the learning data L. It implies that such statistical
models typically have a poor generalization to unseen (out-of-sample) test data T ;
this is illustrated by the red color in Fig. 7.6. For this reason, in general, we are
not interested in finding the MLE
MLE
ϑ of ϑ in FN network regression modeling,
but we would like to find a parameter estimate ϑ that (only) extracts the systematic
effects from the learning data L. This is illustrated by the different colors in Figs. 7.5
7.2 Generic Feed-Forward Neural Networks 289
the
ta1
ta2
the
the
ta2
1
theta
Fig. 7.5 Schematic figure of a loss surface ϑ → D(Y , ϑ) from two different angles for a two-
dimensional parameter ϑ ∈ R2
under−fitting
extracting systematic effects systematic effects
over−fitting
(green)
8
6
mu(x)
4
2
0
and 7.6, where we assume: (a) red color provides models with a poor generalization
power due to over-fitting, (b) blue color provides models with a poor generalization
power, too, because these parametrizations do not explain the systematic effects in
the data at all (called under-fitting), and (c) green color gives good parametrizations
that explain the systematic effects in the data and generalize well to unseen data.
Thus, the aim is to find parametrizations that are in the green area of Fig. 7.5.
This green area emphasizes that we lose the notion of uniqueness because there
are infinitely many models in the green area that have a comparable generalization
290 7 Deep Learning
power. Next we explain how we can exploit the gradient descent algorithm to make
it useful for finding parametrizations in the green area.
Remark 7.8 The loss surface considerations in Fig. 7.5 are based on a fixed network
architecture. Recent research promotes the so-called Graph HyperNetwork (GHN)
that is a (hyper-)network which tries to find the optimal network architecture and
its parametrization by an additional network, we refer to Zhang et al. [402] and
Knyazev et al. [219].
As stated above, if we run the gradient descent algorithm with properly tempered
learning rates it will converge to a local minimum of the loss function, which means
that the resulting FN network over-fits to the learning data. For this reason we need
to early stop the gradient descent algorithm beforehand. Coming back to Fig. 7.5,
typically, we start the gradient descent algorithm somewhere in the blue area of
the loss surface (supposed that the red area is a sparse set on the loss surface).
Visually speaking, the gradient descent algorithm then walks down the valley (green,
yellow and red area) by exploiting locally optimal steps. Since at the early stage of
the algorithm the systematic effects play a dominant role over the noisy part, the
gradient descent algorithm learns these systematic effects at this first stage (blue
area in Fig. 7.5). When the algorithm arrives at the green area the noisy part in the
data starts to increasingly influence the model calibration (gradient descent steps),
and, henceforth, at this stage the algorithm should be stopped, and the learned
parameter should be selected for predictive modeling. This early stopping is an
implicit way of regularization, because it implies that we stop the parameter fitting
before the parameters start to learn very individual features of the (noisy) data (and
take extreme values).
This early stopping point is determined by doing an out-of-sample analysis. This
requires the learning data L to be further split into training data U and validation
data V. The training data U is used for gradient descent parameter learning, and
the validation data V is used for tracking the over-fitting by an instantaneous (out-
of-sample) validation analysis. This partition is illustrated in Fig. 7.7, which also
highlights that the validation data V is disjoint from the test data T , the latter only
being used in the final step for comparing different statistical models (e.g., a GLM
vs. a FN network). That is, model comparison is done in a proper out-of-sample
manner on T , and each of these models is only fit on U and V. Thus, for FN network
fitting with early stopping we need a reasonable amount of data that can be split into
3 sufficiently large data sets so that each is suitable for its purpose.
For early stopping we partition the learning data L into training data U and
validation data V. The plain vanilla gradient descent algorithm can then be changed
as follows.
7.2 Generic Feed-Forward Neural Networks 291
T T
D
L
U
Fig. 7.7 Partition of entire data D (lhs) into learning data L and test data T (middle), and into
training data U , validation data V and test data T (rhs)
ϑ (t ) → ϑ (t +1) = ϑ (t ) − t +1 ∇ϑ D(U, ϑ (t ) ).
In applications we use the SGD algorithm that can also have erratic steps because
not all random (mini-)batches are necessarily typical representations of the data.
In such cases we should use more sophisticated stopping criteria than (7.27), for
instance, early stop if the validation loss increases five times in a row.
292 7 Deep Learning
0.160
training loss
D(V , ϑ (t) ) over different validation loss
iterations t ≥ 0 of the SGD minimal validation loss
algorithm
The categorical features have been treated by dummy coding within GLMs. Dummy
coding provides full rank design matrices. For FN network regression modeling the
294 7 Deep Learning
full rank property is not important because, anyway, we neither have a single (local)
minimum in the objective function, nor do we want to calculate the MLE of the
network parameter. Typically, in FN network regression modeling one uses one-
hot encoding for the categorical variables that encodes every level by a unit vector.
Assume the raw feature component * xj is a categorical variable taking K different
levels {a1 , . . . , aK }. One-hot encoding is obtained by the embedding map
*
xj → x j = (1{*
xj =a1 } , . . . , 1{*
xj =aK } ) ∈ {0, 1} .
K
(7.28)
An explicit example is given in Table 7.2 which should be compared to Table 5.1.
The continuous feature components do not need any pre-processing but they can
directly enter the FN network which will take care of representation learning.
However, an efficient use of gradient descent methods typically requires that all
feature components live on a similar scale and that they are roughly uniformly
spread across their domains. This makes gradient descent steps more efficient in
exploiting the relevant directions.
One possibility is to use the MinMaxScaler. Let xj− and xj+ be the minimal and
maximal possible feature values of the continuous feature component xj , i.e., xj ∈
[xj− , xj+ ]. We transform this continuous feature component to unit scale for all data
1 ≤ i ≤ n by
MM
xi,j − xj−
xi,j → xi,j =2 − 1 ∈ [−1, 1]. (7.29)
xj+ − xj−
MM )
The resulting feature values (xi,j 1≤i≤n should roughly be uniformly spread
across the interval [−1, 1]. If this is not the case, for instance, because we have
outliers in the feature values, we may first transform them non-linearly to get
7.3 Feed-Forward Neural Network Examples 295
more uniformly spread values. For example, we consider the Density of the car
frequency example on the log scale.
An alternative to the MinMaxScaler is to consider normalization with the
empirical mean x̄j and the empirical standard deviation σ̂j over all data xi,j . That
is,
sd xi,j − x̄j
xi,j → xi,j = . (7.30)
σ̂j
We present a first FN network example applied to the French MTPL claim frequency
data studied in Sect. 5.2.4. We assume that the claim counts Ni are independent and
Poisson distributed with claim count density (5.26), where we replace the GLM
regression function x → expβ, x by a FN network regression function
Table 7.3 Run times, number of parameters, in-sample and out-of-sample deviance losses (units
are in 10−2 ) and in-sample average frequency of the Poisson null model, model Poisson GLM3 of
Table 5.5 and the FN network model (with one-hot encoding of the categorical variables)
Run # In-sample Out-of-sample Aver.
time param. loss on L loss on T freq.
Poisson null – 1 25.213 25.445 7.36%
Poisson GLM3 15 s 50 24.084 24.102 7.36%
One-hot FN (q1 , q2 , q3 ) = (20, 15, 10) 51 s 1’306 23.757 23.885 6.96%
SGD algorithm and we retrieve the network with the lowest validation loss using
a callback. This is illustrated in Listing 7.3. The fitting performance on the
training and validation data is illustrated in Fig. 7.8, and we retrieve the network
calibration after the 52th epoch because it has the lowest validation loss. The results
are presented in Table 7.3.
From the results of Table 7.3 we conclude that the FN network outperforms
model Poisson GLM3 (out-of-sample) since it has a (clearly) lower out-of-sample
deviance loss on the test data T . This may indicate that there is an interaction
between the feature components that has not been captured in the GLM. The run
time of 51s corresponds to the run time until the minimal validation loss is reached,
of course, in practice we need to continue beyond this minimal validation loss to
ensure that we have really found the minimum. Finally, and importantly, we observe
that this early stopped FN network calibration does not meet the balance property
because the resulting average frequency of this fitted model of 6.96% is below the
empirical frequency of 7.36%. This is a major deficiency of this FN network fitting
approach, and this is going to be discussed further in Sect. 7.4.2, below.
We can perform a detailed analysis about different batch sizes, variants of SGD
methods, run times, etc. We briefly summarize our findings; this summary is also
based on the findings in Ferrario et al. [127]. We have fitted this model on batches
of sizes 2’000, 5’000, 10’000 and 20’000, and it seems that a batch size around
5’000 has the best performance, both concerning out-of-sample performance and
run time to reach the minimal validation loss. Comparing the different optimizers
rmsprop, adam and nadam, a clear preference can be given to nadam: the
resulting prediction accuracy is similar in all three optimizers (they all reach the
green area in Fig. 7.5), but nadam reaches this optimal point in half of the time
compared to rmsprop and adam.
We conclude by highlighting that different initial points ϑ (0) of the SGD
algorithm will give different network calibrations, and differences can be consid-
erable. This is discussed in Sect. 7.4.4, below. Moreover, we could explore different
network architectures, more simple ones, more complex ones, different activation
functions, etc. The results of these different architectures will not be essentially
different from our results, as long as the networks are above a minimal complexity
bound. This closes our first example on FN networks and this example is the
benchmark for refined versions that are presented in the subsequent sections.
298 7 Deep Learning
The categorical feature components have been treated either by dummy coding or
by one-hot encoding, and this has resulted in numerous network parameters in the
first FN layer, see Fig. 7.2. Natural language processing (NLP) treats categorical
feature components differently, namely, it embeds categorical feature components
(or words in NLP) into a Euclidean space Rb of a small dimension b. This small
dimension b is a hyper-parameter that has to be selected by the modeler, and which,
typically, is selected much smaller than the total number of levels of the categorical
feature. This embedding technique is quite common in NLP, see Bengio et al. [27–
29], but it goes beyond NLP applications, see Guo–Berkhahn [176], and it has been
introduced to the actuarial community by Richman [312, 313] and the tutorial of
Schelldorfer–Wüthrich [329].
We assume the same set-up as in dummy coding (5.21) and in one-hot encod-
ing (7.28), namely, that we have a raw categorical feature component * xj taking K
different levels {a1 , . . . , aK }. In one-hot encoding these K levels are mapped to the
K unit vectors of the Euclidean space RK , and consequently all levels have the same
mutual Euclidean distance. This does not seem to be the best way of comparing the
different levels because in our regression analysis we would like to identify the
levels that are more similar w.r.t. the regression task and, thus, these should cluster.
For an embedding layer one chooses a Euclidean space Rb of a dimension b < K,
typically being (much) smaller than K. One then considers the embedding map
def.
e : {a1 , . . . , aK } → Rb , ak → e(ak ) = e(k) . (7.31)
Area
Power
VehAge
DrivAge Area
Bonus
B1
B2 Power
B3
B4
B5 VehAge
B6
B10
B11
B12 DrivAge
B13
B14
VehGas Bonus
Density
R11
R21 Y Y
R22
R23 VehBrEmb
R24
R25
R26
R31
R41 VehGas
R42
R43
R52
R53 Density
R54
R72
R73
R74
R82 RegEmb
R83
R91
R93
R94
Fig. 7.9 (lhs) One-hot encoding with q0 = 40, and (rhs) embedding layers for VehBrand and
Region with embedding dimension b = 2 and q0 = 11; the remaining network architecture is
identical with (q1 , q2 , q3 ) = (20, 15, 10) for depth d = 3
Table 7.4 Run times, number of parameters, in-sample and out-of-sample deviance losses (units
are in 10−2 ) and in-sample average frequency of the Poisson null model, model Poisson GLM3 of
Table 5.5 and the FN network models (with one-hot encoding and embedding layers of dimension
b = 2, respectively)
Run # In-sample Out-of-sample Aver.
time param. loss on L loss on T freq.
Poisson null – 1 25.213 25.445 7.36%
Poisson GLM3 15 s 50 24.084 24.102 7.36%
One-hot FN (q1 , q2 , q3 ) = (20, 15, 10) 51 s 1’306 23.757 23.885 6.96%
Embed FN (q1 , q2 , q3 ) = (20, 15, 10) 120 s 792 23.694 23.820 7.24%
A first remark is that the model calibration takes longer using embedding layers
compared to one-hot encoding. The main reason for this is that having an embedding
layer increases the depth of the network by one layer, as can be seen from Fig. 7.9.
Therefore, the back-propagation takes more time, and the convergence is slower
requiring more gradient descent steps. We have less over-fitting as can be seen from
Fig. 7.10. The final fitted model has a slightly better out-of-sample performance
compared to the one-hot encoding one. However, this slight improvement in the
performance should not be overstated because, as explained in Remarks 7.9, there
are a couple of elements of randomness involved in SGD fitting, and choosing
a different seed may change the results. We remark that the balance property is
not fulfilled because the average frequency of the fitted model does not meet the
empirical frequency, see the last column of Table 7.4; we come back to this in
Sect. 7.4.2, below.
7.4 Special Features in Networks 301
Listing 7.5 Summary of FN network of Fig. 7.9 (rhs) using embedding layers of dimension b = 2
1 Layer (type) Output Shape Param # Connected to
2 ==============================================================================
3 VehBrand (InputLayer) (None, 1) 0
4 ______________________________________________________________________________
5 Region (InputLayer) (None, 1) 0
6 ______________________________________________________________________________
7 BrandEmb (Embedding) (None, 1, 2) 22 VehBrand[0][0]
8 ______________________________________________________________________________
9 RegionEmb (Embedding) (None, 1, 2) 44 Region[0][0]
10 ______________________________________________________________________________
11 Design (InputLayer) (None, 7) 0
12 ______________________________________________________________________________
13 Brand_flat (Flatten) (None, 2) 0 BrandEmb[0][0]
14 ______________________________________________________________________________
15 Region_flat (Flatten) (None, 2) 0 RegionEmb[0][0]
16 ______________________________________________________________________________
17 concate (Concatenate) (None, 11) 0 Design[0][0]
18 Brand_flat[0][0]
19 Region_flat[0][0]
20 ______________________________________________________________________________
21 FNLayer1 (Dense) (None, 20) 240 concate[0][0]
22 ______________________________________________________________________________
23 FNLayer2 (Dense) (None, 15) 315 FNLayer1[0][0]
24 ______________________________________________________________________________
25 FNLayer3 (Dense) (None, 10) 160 FNLayer2[0][0]
26 ______________________________________________________________________________
27 Network (Dense) (None, 1) 11 FNLayer3[0][0]
28 ______________________________________________________________________________
29 Vol (InputLayer) (None, 1) 0
30 ______________________________________________________________________________
31 Multiply (Multiply) (None, 1) 0 Network[0][0]
32 Vol[0][0]
33 ==============================================================================
34 Total params: 792
35 Trainable params: 792
36 Non-trainable params: 0
training loss
D(V , ϑ (t) ) over different validation loss
iterations t ≥ 0 of the SGD
0.159
variables
0.156
0.155
0.154
R73
0.4
B13
B14 B3
B6 R21 R91
B1
B2
0.0
0.2
dimension 2
dimension 2
B5
B4 B11 R23
R41 R93
R11
−0.2
R83
B10 R25
0.0
R31
R24
R53R54 R42
R52 R72 R22
R94
−0.4
R26
R82
B1/B2 Renault, Nissan, Citroen
−0.2
B12 B3 Volkswagen, Audi, Skoda, Seat
−0.6
−0.4
B10/B11 Mercedes, Chrysler, BMW
B12 Japanese cars (except Nissan)
−0.8
−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 −0.2 −0.1 0.0 0.1 0.2 0.3
dimension 1 dimension 1
Fig. 7.11 Embedding weights eVehBrand ∈ R2 and eRegion ∈ R2 of the categorical variables
VehBrand and Region for embedding dimension b = 2
A major advantage of using embedding layers for the categorical variables is that
we receive a continuous representation of nominal variables, where proximity can be
interpreted as similarity for the regression task at hand. This is nicely illustrated in
Fig. 7.11 which shows the resulting 2-dimensional embeddings eVehBrand ∈ R2 and
eRegion ∈ R2 of the categorical variables VehBrand and Region. The Region
embedding eRegion ∈ R2 shows surprising similarities with the French map, for
instance, Paris region R11 is adjacent to R23, R22, R21, R26, R24 (which is also
the case in the French map), the Isle of Corsica R94 and the South of France R93,
R91 and R73 are well separated from other regions, etc. Similar observations can
be made for the embedding of VehBrand, Japanese cars B12 are far apart from the
other cars, cars B1, B2, B3 and B6 (Renault, Nissan, Citroen, Volkswagen, Audi,
Skoda, Seat and Fiat) cluster, etc.
Above, over-fitting to the learning data has been taken care of by early stopping. In
view of Sect. 6.2 one could also use regularization. This can easily be obtained by
replacing (7.14), for instance, by the following Lp -regularized counterpart
2 vi
n
p
ϑ → Yi h (Yi )−κ (h (Yi ))−Yi h (μϑ (x i ))+κ (h (μϑ (x i ))) +λ ϑ − p ,
n ϕ
i=1
for some p ≥ 1, regularization parameter λ > 0 and where the reduced network
parameter ϑ − ∈ Rr−1 excludes the intercept parameter β0 of the output layer,
we also refer to (6.4) in the context of GLMs. For grouped penalty terms we
7.4 Special Features in Networks 303
refer to (6.21). The difficulty with this approach is the tuning of the regularization
parameter(s) λ: run time is one issue, suitable grouping is another issue, and non-
uniqueness of the optimal network a further one that can substantially distort the
selection of reasonable regularization parameters.
A more popular method to prevent from over-fitting individual neurons in a FN
layer to a certain task are so-called drop-out layers. A drop-out layer is an additional
layer between FN layers that removes at random during gradient descent training
neurons from the network, i.e., in each gradient descent step, any of the earmarked
neurons is offset independently from the others with a fixed probability δ ∈ (0, 1).
This random removal will imply that the composite of the remaining neurons needs
to be sufficiently well balanced to take over the role of the dropped-out neurons.
Therefore, a single neuron cannot be over-trained to a certain task because it needs
to be able play several different roles. Drop-out has been introduced by Srivastava
et al. [345] and Wager et al. [373].
Listing 7.6 FN network of depth d = 3 using a drop-out layer, ridge regularization and a
normalization layer
1 Network = list(Design,BrandEmb,RegionEmb) %>%
2 layer_concatenate(name=’concate’) %>%
3 layer_dense(units=20, activation=’tanh’, name=’FNLayer1’) %>%
4 layer_dropout (rate = 0.01) %>%
5 layer_dense(units=15, kernel_regularizer=regularizer_l2(0.0001),
6 activation=’tanh’, name=’FNLayer2’) %>%
7 layer_batch_normalization() %>%
8 layer_dense(units=10, activation=’tanh’, name=’FNLayer3’) %>%
9 layer_dense(units=1, activation=’exponential’, name=’Network’,
10 weights=list(array(0, dim=c(10,1)), array(log(lambda0), dim=c(1))))
Listing 7.6 gives an example, where we add a drop-out layer with a drop-out
probability of δ = 0.01 after the first FN layer, and in the second FN layer we apply
(2)
ridge regularization to the weights (w1,1 , . . . , wq(2)
1 ,q2 ), i.e., excluding the intercepts
(2)
w0,j , 1 ≤ j ≤ q2 . Both the drop-out layer and regularization are only used during
the gradient descent fitting, and these network features are disabled during the
prediction.
Drop-out is closely related to ridge regularization as the following linear
Gaussian regression example shows; this consideration is taken from Section 18.6
of Efron–Hastie [117]. Assume we have a linear regression problem with square
loss function
1
n
D(Y , β) = (Yi − β, x i )2 .
2
i=1
n−1 ni=1 xi,j
2 = 1, for all 1 ≤ j ≤ q. This standardization implies that we can
i.e., every individual component xi,j can drop out independently of the others.
Gaussian MLE requires to set the gradient of DI (Y , β) w.r.t. β ∈ Rq equal to
zero. The average score equation is given by (we average over the drop-out random
variables Ii,j )
n
δ
n
Eδ ∇β DI (Y , β) Y = −X Y + X Xβ + diag 2
xi,1 , . . . , xi,q β
2
1−δ
i=1 i=1
δn !
= −X Y + X Xβ + β = 0,
1−δ
where we have used the normalization of the columns of the design matrix X ∈
Rn×q (we drop the intercept column). This is ridge regression in the linear Gaussian
case with a regularization parameter λ = δ/(2(1 − δ)) > 0 for δ ∈ (0, 1), see (6.9).
Normalization Layers
In (7.29) and (7.30) we have discussed that the continuous feature components
should be pre-processed so that all components live on the same scale, otherwise the
gradient descent fitting may not be efficient. A similar phenomenon may occur with
the learned representations z(m:1) (x i ) in the FN layers 1 ≤ m ≤ d. In particular, this
is the case if we choose an unbounded activation function φ. For this reason, it can
(m:1)
be advantageous to rescale the components zj (x i ), 1 ≤ j ≤ qm , in a given FN
layer back to the same scale. To achieve this, a normalization step (7.30) is applied
to every neuron zj(m:1) (x i ) over the given cases i in the considered (mini-)batch. This
involves two more parameters (for the empirical mean and the empirical standard
deviation) in each neuron of the corresponding FN layer. Note, however, that all
these operations are of a linear nature. Therefore, they do not affect the predictive
model (i.e., these operations cancel in the scalar products in (7.6)), but they may
improve the performance of the gradient descent algorithm.
The code in Listing 7.6 uses a normalization layer on line 6. In our applications,
it has not been necessary to use these normalization layers, as it has not led to better
7.4 Special Features in Networks 305
run times in SGD algorithms; note that our networks are not very deep and they use
the symmetric and bounded hyperbolic tangent activation function.
We have seen in Table 7.4 that our FN network outperforms the GLM for claim
frequency prediction in terms of a lower out-of-sample loss. We interpret this as
follows. Feature engineering has not been done in the most optimal way for the
GLM because the FN network finds modeling structure that is not present in the
selected GLM. As a consequence, the FN network provides a better generalization
to unseen data, i.e., we can better predict new data on a granular level with the FN
network. However, having a more precise model on an individual policy level does
not necessarily imply that the model also performs better on a global portfolio level.
In our example we see that we may have smaller errors on an individual policy level,
but these smaller errors do not aggregate to a more precise model in the average
portfolio frequency. In our case, we have a misspecification of the average portfolio
frequency, see the last column of Table 7.4. This is a major deficiency in insurance
pricing because it may result in a misspecification of the overall price level, and this
requires a correction. We call this correction bias regularization.
(early stopped) SGD algorithm. The output of this fitted model reads as
⎛ ⎞
; <
qd
−1 (d:1)
(x i ) = g −1 ⎝ zj (x i )⎠ ,
j (d:1)
xi →
ϑ (x i ) = g
μ β,
z β0 + β
j =1
306 7 Deep Learning
(m)
where the hat in z(d:1) indicates that we use the estimated weights w l , 1 ≤ l ≤ qm ,
1 ≤ m ≤ d, in the FN layers. The balance property can be rectified by replacing β 0
by the solution β 0 of the following identity
⎛ ⎞
n n qd
vi g −1 ⎝
!
vi Yi = β 0 + β zj(d:1)(x i )⎠ .
j
i=1 i=1 j =1
If we work with the canonical link g = h = (κ )−1 , we can do better because the
MLE of such a GLM automatically provides the balance property, see Corollary 5.7.
Choose the SGD learned network parameter ϑ = ( w(1) (d) ∈ Rr .
1 ,...,w qd , β)
Denote by z (d:1) the fitted network architecture that is based on the estimated
(1) (d)
weights w 1 , . . . , w qd . This allows us to study the learned representations of the
raw features x 1 , . . . , x n in the last FN layer. We denote these learned representations
by
z1 =
z(d:1) (x 1 ), . . . ,
zn =
z(d:1) (x n ) ∈ {1} × Rqd . (7.32)
These learned representations can be used as new features to explain the response
Y . We define the feature engineered design matrix by
X = ( zn ) ∈ Rn×(qd +1) .
z1 , . . . ,
Based on this new design matrix X we can run a classical GLM receiving a unique
MLE β MLE
∈R q d +1 supposed that this design matrix has a full rank qd + 1 ≤ n,
see Proposition 5.1. Since we work with the canonical link, this re-calibrated FN
network will automatically satisfy the balance property, and the resulting regression
function reads as
; <
μ(x) = h−1
MLE (d:1)
x → β z
, (x) . (7.33)
7.4 Special Features in Networks 307
Example 7.12 (Balance Property in Networks) We apply this additional MLE step
to the two FN networks of Table 7.4. Note that in these two examples we consider
a Poisson model using the canonical link for g, thus, the resulting adjusted
network (7.33) will automatically satisfy the balance property, see Corollary 5.7.
In Listing 7.7 we illustrate the necessary code that has to be added to List-
ings 7.1–7.3. On lines 7–8 of Listing 7.7 we retrieve the learned representa-
tions (7.32) which are used as the new features in the Poisson GLM on lines 13–14.
The resulting MLE ∈ Rqd +1 is imputed to the network parameter
MLE
β ϑ on
lines 17–20. Table 7.5 shows the performance of the resulting bias regularized FN
networks.
Firstly, we observe from the last column of Table 7.5 that, indeed, the bias
regularization step (7.33) provides the balance property. In general, in-sample losses
(have to) decrease because
MLE
β is (in-sample) more optimal than the early stopped
SGD solution β. Out-of-sample this leads to a small improvement in the one-
308 7 Deep Learning
Table 7.5 Run times, number of parameters, in-sample and out-of-sample deviance losses (units
are in 10−2 ) and in-sample average frequency of the Poisson null model, model Poisson GLM3 of
Table 5.5 and the FN network models (with one-hot encoding and embedding layers of dimension
b = 2, respectively), and their bias regularized counterparts
Run # In-sample Out-of-sample Aver.
time param. loss on L loss on T freq.
Poisson null – 1 25.213 25.445 7.36%
Poisson GLM3 15 s 50 24.084 24.102 7.36%
One-hot FN (q1 , q2 , q3 ) = (20, 15, 10) 51 s 1’306 23.757 23.885 6.96%
Embed FN (q1 , q2 , q3 ) = (20, 15, 10) 120 s 792 23.694 23.820 7.24%
One-hot FN bias regularized +4 s 1’306 23.742 23.878 7.36%
Embed FN bias regularized +4 s 792 23.690 23.824 7.36%
hot encoded variant and a small worsening in the embedding variant, i.e., the
latter slightly over-fits in this additional MLE step. However, these differences are
comparably small so that we do not further worry about the over-fitting, here. This
closes this example.
We present another approach of correcting for the potential failure of the balance
property. This method does not depend on a particular type of regression model,
i.e., it can be applied to any regression model. This proposal goes back to Denuit et
al. [97], and it is based on the notion of auto-calibration introduced by Patton [297]
and Krüger–Ziegel [227]. We first describe auto-calibration and its implications.
Definition 7.13 The random variable Z is an auto-calibrated forecast of random
variable Y if E[Y |Z] = Z, a.s.
If the response Y is described by the features X = x, we consider the conditional
mean of Y , given X,
μ(X) = E [Y |X] .
This conditional mean μ(X) is an auto-calibrated forecast for the response Y . Use
the tower property and note that σ (μ(X)) ⊂ σ (X) to receive, a.s.,
for all Bregman divergences Dψ . Strassen’s theorem tells us that μ1 is more volatile
than μ2 (both being auto-calibrated and unbiased for E[Y ]) and this additional
volatility implies that the former auto-calibrated predictor can better follow Y . This
provides the superior forecast dominance of μ1 over
μ2 . This relation is most easily
understood by the following example. Consider (Y, X) as above. Assume that the
feature X* is a sub-variable of the feature X by dropping some of the components
of X. Naturally, we have σ (X)* ⊂ σ (X), and both sets of information provide auto-
calibrated forecasts
μ(X) = E [Y |X] and * = E Y *
μ(X) X .
The tower property and Jensen’s inequality give for any convex function (subject
to existence)
E [(μ(X))] = E [ (E [Y |X])] = E E (E [Y |X]) *
X
≥ E E E [Y |X] * X = E E Y *X * .
= E μ(X)
Thus, we have μ(X) ,cx μ(X) * which implies forecast dominance of μ(X) over
* This makes perfect sense in view of σ (X)
μ(X). * ⊂ σ (X). Basically, this describes
the construction of a F-martingale using an integrable random variable Y and a
filtration F on the underlying probability space (, A, P). This martingale sequence
provides forecast dominance with increasing information sets described by the
filtration F.
310 7 Deep Learning
We now turn our attention to the balance property and the unbiasedness of
predictors, this follows Denuit et al. [97]. Assume we have any predictor
μ(x) of
Y , for instance, this can be any FN network predictor μ
ϑ (x) coming from an early
stopped SGD algorithm. We define its balance-corrected version by
μBC (x) = E [Y |
μ(x)] . (7.34)
Eθ [d (Y,
μ(X))] = Eθ [d (Y, μ)]
μ(X))] + Eθ [κ (h (
+ 2 μh(μ) − κ(h(μ)) − Eθ [Y h ( μ(X)))] ,
This expresses the expected deviance GL of the predictor μ(X) as the entropy (first
term), the conditional resolution (second term) and the conditional calibration (third
term). The conditional resolution describes the information gain in terms of forecast
dominance knowing the feature X, and the conditional calibration describes how
7.4 Special Features in Networks 311
well we estimate μ(X). The conditional resolution is positive because μ(X) ,cx μ
and the unit deviance d(Y, ·) is a convex function, see Lemma 2.22. The conditional
calibration is also positive, this can be seen by considering the deviance GL,
conditional on X.
We can reformulate this expected deviance GL in terms of the auto-calibration
property
Eθ [d (Y,
μ(X))] = Eθ [d (Y, μ)] − Eθ [d (Y, μ)] − Eθ [d (Y,
μBC (X))]
+ Eθ [d (Y,
μ(X))] − Eθ [d (Y,
μBC (X))] .
The first term is the entropy, the second term is called the auto-resolution and the
third term describes the auto-calibration. If we have an auto-calibrated forecast
μ(X) then the last term vanishes because it is equal to its balance-corrected version
μBC (X). Again these two latter terms are positive, for the auto-calibration this can
be seen by considering the deviance GL, conditioned on μ(X).
To rectify the balance property we directly focus on (7.34), and we estimate
this conditional expectation. That is, the balance correction can be achieved by an
additional regression step directly estimating the balance-corrected version μBC (x)
in (7.34). This additional regression step differs from (7.33) as it does not use the
learned representations z(d:1) (x) in the last FN layer (7.32), but it uses the learned
representations in the output layer. That is, consider the learned features
z#1 = (1, μ
ϑ (x 1 )) , . . . ,
z#n = (1, μ
ϑ (x n )) ∈ {1} × R,
and perform an additional linear regression step for the response Y using the design
matrix
#
X= z#n
z1 , . . . , ∈ Rn×2 .
with diagonal weight matrix V = diag(vi )1≤i≤n . The balance property is then
restored by estimating the balance-corrected means
μBC (x i ) by
0 + β
μBC (x i ) = β 1 μ
ϑ (x i ), (7.36)
for 1 ≤ i ≤ n. Note that this can be done for any regression model since we do not
rely on the network architecture in this step.
312 7 Deep Learning
Remarks 7.18
• Balance correction (7.36) may lead to some conflict in range if the dual (mean)
parameter space M is (one-sided) bounded. Moreover, it does not consider the
deviance loss of the response Y , but it rather underlies a Gaussian model by
using the weighted square loss function for finding (the Gaussian MLE) β ∈ R2 .
Alternatively, we could consider the canonical link h that belongs to the chosen
EDF. This then allows us to study the regression problem on the canonical scale
by setting for the learned representations
zθ1 = 1, h(μ
ϑ (x 1 )) , ...,
zθn = 1, h(μ
ϑ (x n )) ∈ {1} × . (7.37)
The latter motivates the consideration of a GLM under the chosen EDF
x i → h (
μBC (x i )) = β,
zθi = β0 + β1 h(μ
ϑ (x i )), (7.38)
for regression parameter β ∈ R2 . The choice of the canonical link and the
inclusion of an intercept will provide the balance property when estimating β
with MLE, see Corollary 5.7. If the mean estimates μ
ϑ (x i ) involve the canonical
link h, (7.38) reads as
; <
x i → h ( zθi = β0 + β1
μBC (x i )) = β, β,
z(d:1)(x i ) ,
the latter scalar product is the output activation received from the FN net-
work. From this we see that the estimated balance-corrected calibration on the
canonical scale will give us a non-optimal (in-sample) estimation step compared
to (7.33), if we work with the canonical link h.
• Denuit et al. [97] give a proposal to break down the global balance to a local
version using a suitable kernel function, this will be further discussed in the next
Example 7.19.
This empirical version is obtained from the R library locfit [254] that allows us
to consider a local polynomial regression fit of degree deg=2, and we use a nearest
neighbor fraction of alpha=0.05, the code is provided in Listing 7.8. We use the
exposure v scaled version in (7.39) since the balance property should hold on that
scale, see Corollary 5.7. The claim counts are given by N = vY , and the exposure
7.4 Special Features in Networks 313
Figure 7.12 (lhs) shows the empirical auto-calibration of (7.39) using the R
code of Listing 7.8. If the auto-calibration would hold exactly, then the black
dots should lie on the red diagonal line. We observe a very good match, which
indicates that the auto-calibration property holds quite accurately for our network
predictor (v, x) → vμ ϑ (x). For very small expectations Eθ(x) [N] we slightly
underestimate, and for bigger expectations we slightly overestimate. The blue line
shows the empirical density of the predictors vi μ ϑ (x i ), 1 ≤ i ≤ n, highlighting
heavy-tailedness and that the underestimation in the right tail will not substantially
contribute to the balance property as these are only very few insurance policies.
We explore the Gaussian balance correction (7.35) considering a linear regression
model with weighted square loss function. We receive the estimate β = (9 ·
10−4 , 1.005), thus, μ
ϑ (x) only gets very gently distorted, see (7.36). The results of
this balance-corrected version
μBC (x) are given on line ‘embed FN Gauss balance-
corrected’ in Table 7.6. We observe that this approach is rather competitive leading
to a slightly better model (out-of-sample). Figure 7.12 (rhs) shows the resulting
(empirical) auto-calibration plot which is still not fully in line with Proposition 7.16;
this empirical plot may be distorted by the exposures, by the fact that it is an
0.5
20
20
0.4
0.4
auto−calibration
auto−calibration
15
15
0.3
0.3
10
10
0.2
0.2
0.1
0.1
5
5
auto−calibration auto−calibration
0.0
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5
estimated claims v*mu(x) estimated claims v*mu(x)
Fig. 7.12 (lhs) Empirical auto-calibration (7.39), the blue line shows the empirical density of the
ϑ (x i ), 1 ≤ i ≤ n; (rhs) balance-corrected version using the weighted Gaussian
predictors vi μ
correction (7.35)
314 7 Deep Learning
Table 7.6 Run times, number of parameters, in-sample and out-of-sample deviance losses (units
are in 10−2 ) and in-sample average frequency of the Poisson null model, model Poisson GLM3
of Table 5.5, the FN network model (with embedding layers of dimension b = 2), and their bias
regularized and balance-corrected counterparts, the local correction uses a GAM with 2.6 degrees
of freedom in the cubic spline part
Run # In-sample Out-of-sample Aver.
time param. loss on L loss on T freq.
Poisson null – 1 25.213 25.445 7.36%
Poisson GLM3 15 s 50 24.084 24.102 7.36%
Embed FN (q1 , q2 , q3 ) = (20, 15, 10) 120 s 792 23.694 23.820 7.24%
Embed FN bias regularized +4 s 792 23.690 23.824 7.36%
Embed FN Gauss balance-corrected – 792 + 2 23.692 23.819 7.36%
Embed FN locally balance-corrected – 792 + 3.6 23.692 23.818 7.36%
empirical plot fitted with locfit, and by fact that a linear Gaussian correction
estimate may not be fully suitable.
Denuit et al. [97] propose a local balance correction that is very much in the
spirit of the local polynomial regression fit with locfit. However, when using
locfit we did not pay any attention to the balance property. Therefore, we
proceed slightly differently, here. In formula (7.37) we give the network predictors
on the canonical scale. This equips us with the data (Yi , vi , zθi )1≤i≤n . To perform
a local balance correction we fit a generalized additive model (GAM) to this data,
using the canonical link, the Poisson deviance loss function, the observations Yi ,
the exposures vi and the feature information zθi ; for GAMs we refer to Hastie–
Tibshirani [181, 182], Wood [384] and Chapter 3 in Wüthrich–Buser [392], in
particular, we proceed as in Example 3.4 of the latter reference.
The GAM regression fit on the canonical scale is illustrated in Fig. 7.13 (lhs).
We essentially receive a straight line which says that the auto-calibration property is
already well satisfied by the FN network predictor μ ϑ . In fact, it is not completely
a straight line, but GCV provides an optimal model with 2.6 effective degrees of
freedom in the natural cubic spline part. This local (GAM) balance correction leads
to another small model improvement (out-of-sample), see last line of Table 7.6.
Conclusion The balance property adjustment and the bias regularization are crucial
in ensuring that the predictive model is on the right (price) level. We have pre-
sented three sophisticated methods of balance property adjustments: the additional
GLM step under the canonical link choice (7.33), the model-free global Gaussian
correction (7.35)–(7.36), and the local balance correction using a GAM under the
canonical link choice. In our example, the results of the three different approaches
are rather similar. In the sequel, we use the additional GLM step solution (7.33), the
reason being that under this approach we can rely on one single regression model
that directly predicts the claims. The other two approaches need two steps to get the
predictions, which requires the storage of two models.
7.4 Special Features in Networks 315
0.5
2
cubic spline regression
20
0.4
auto−calibration
1
15
0.3
0
10
0.2
−1
0.1
5
−2
auto−calibration
0.0
0
density (right axis)
Fig. 7.13 (lhs) GAM fit on the canonical scale having 2.6 effective degrees of freedom (red shows
the estimated confidence bounds); (rhs) balance-corrected version using the local GAM correction
From Table 7.5 we conclude that the FN networks find systematic structure in the
data that is not present in model Poisson GLM3, thus, the feature engineering for
the GLM can be improved. Unfortunately, FN networks neither directly build on
GLMs nor do they highlight the weaknesses of GLMs. In this section we discuss
a proposal presented in Wüthrich–Merz [394] and Schelldorfer–Wüthrich [329]
of combining two regression approaches. We are going to boost a GLM with FN
network features. Typically, boosting is applied within the framework of regression
trees. It goes back to the work of Valiant [362], Kearns–Valiant [209, 210], Schapire
[328], Freund [139] and Freund–Schapire [140]. The idea behind boosting is to
analyze the residuals of a given regression model with a second regression model
to see whether this second regression model can still find systematic effects in the
residuals which have not been discovered by the first one.
We start from the GLM studied in Chap. 5, and we boost this GLM with a FN
network. Assume that both regression models act on the same feature space X ⊂
{1} × Rq0 . The GLM provides a regression function for link function g and GLM
parameter β GLM ∈ Rq0 +1
; <
x → μGLM (x) = g −1 β GLM , x .
combined regression
Area
function (7.40) using a GLM
(in a skip connection) and a Power
FN network VehAge
DrivAge
Bonus
Y
VehBrEmb
VehGas
Density
RegEmb
7.4 Special Features in Networks 317
(coming from the GLM) with a non-linear one (coming form the FN network).
This has the flavor of a Taylor expansion to combine terms of different orders.
Secondly, skip connections can also be beneficial for gradient descent fitting
because the inputs have a more direct link to the outputs, and the network only
builds the functional form around the function in the skip connection.
• There are numerous variants of (7.40). A straightforward one is to choose a
weight α ∈ (0, 1) and consider the regression function
; < ; <
x → μ(x) = g −1 α β GLM , x + (1 − α) β FN , z(d:1)(x) . (7.41)
The indicator 1{χ=j } chooses the GLM that belongs to the corresponding
insurance portfolio χ ∈ {1, 2, 3} with the (individual) GLM parameter β GLM χ .
The FN network term makes them related, i.e., the GLMs of the different
insurance portfolios interact (jointly learn) via the FN network module. This is
the approach used in Gabrielli et al. [149] to improve the chain-ladder reserving
method by learning across different claims reserving triangles.
The regression function (7.40) gives the structural form of the combined
regression model, but there is a second important ingredient proposed by Wüthrich–
Merz [394]. Namely, the gradient descent algorithm (7.15) for model fitting can be
started in an initial network parameter !(0) ∈ Rq0 +1+r that corresponds to the MLE
of the GLM. Denote by
GLM
β the MLE of the GLM part, only.
Choose the initial value of the gradient descent algorithm for the fitting of the
combined regression model (7.40)
!(0) = ≡ 0 ∈ Rq0 +1+r ,
GLM (1) FN
β , w1 , . . . , w (d)
qd , β (7.42)
that is, initially, no signals traverse the FN network part because we set β FN ≡
0.
318 7 Deep Learning
Remarks 7.21
• Using the initialization (7.42), the gradient descent algorithm starts exactly in
the optimal GLM. The algorithm then tries to improve this GLM w.r.t. the given
loss function using the additional FN network features. If the loss substantially
reduces during the gradient descent training, the GLM misses systematic struc-
ture and it can be improved, otherwise the GLM is already good (enough).
• We can declare the MLE
GLM
β to be non-trainable. In that case the original
GLM always remains in the combined regression model and it acts as an offset.
If we declare the MLE
GLM
β to be non-trainable, we could choose a trainable
credibility weight α ∈ (0, 1), see (7.41), which gradually reduces the influence
of the GLM (if necessary).
Implementation of the general combined regression model (7.40) can be a bit
cumbersome, see Listing 4 in Gabrielli et al. [149], but things can substantially
be simplified by declaring the GLM part in (7.40) as being non-trainable, i.e.,
estimating β GLM by
GLM
β in the GLM, and then freeze this parameter. In view
of (7.40) this simply means that we add an offset oi =
GLM
β , x i to the FN
network that is treated as a prior difference between the different data points, we
refer to Sect. 5.2.3.
Example 7.22 (Combined GLM and FN Network) We revisit the French MTPL
claim frequency GLM of Sect. 5.3.4, and we boost model Poisson GLM3 with FN
network features. For the FN architecture we use the structure depicted in Fig. 7.14,
i.e., a FN network of depth d = 3 having (q1 , q2 , q3 ) = (20, 15, 10) neurons, and
using embedding layers of dimension b = 2 for the categorical feature components.
Moreover, we declare the GLM part to be non-trainable which allows us to use the
GLM as an offset in the FN network. Moreover, we apply bias regularization (7.33)
to receive the balance property.
The results are presented in Table 7.7. A first observation is that using model
Poisson GLM3 as an offset reduces the run time of gradient descent fitting because
we start the algorithm already in a reasonable model. Secondly, as expected, the
Table 7.7 Run times, number of parameters, in-sample and out-of-sample deviance losses (units
are in 10−2 ) and in-sample average frequency of the Poisson null model, model Poisson GLM3 of
Table 5.5, the FN network model (with embedding layers of dimension b = 2), and the combined
regression model GLM3+FN, see (7.40)
Run # In-sample Out-of-sample Aver.
time param. loss on L loss on T freq.
Poisson null – 1 25.213 25.445 7.36%
Poisson GLM3 15 s 50 24.084 24.102 7.36%
Embed FN (q1 , q2 , q3 ) = (20, 15, 10) 120 s 792 23.694 23.820 7.24%
Embed FN bias regularized +4 s 792 23.690 23.824 7.36%
Combined GLM+FN (20, 15, 10) +53 s 50 + 792 23.772 23.834 7.24%
Combined GLM+FN bias regularized +4 s 50 + 792 23.765 23.830 7.36%
7.4 Special Features in Networks 319
FN features decrease the loss of model Poisson GLM3, this indicates that there
are systematic effects that are not captured by the GLM. The final combined and
regularized model has roughly the same out-of-sample loss as the corresponding
FN network, showing that this approach can be beneficial in run times, and the
predictive power is similar to a pure FN network.
Example 7.23 (Improving Model Poisson GLM3) In this example we would like to
explore the deficiencies of model Poisson GLM3 by boosting it with FN network
features. We do this in a systematic way by only considering two (continuous)
features components at a time in the FN network. That is, we consider the combined
approach (7.40) with initialization (7.42), but as feature information for the network
part, we only consider two components at a time. For instance, we start with the
features (1, Area, VehPower) ∈ {1} × R2 for the network part, and the remaining
feature information is ignored in this step. This way we can test whether the
marginal modeling of Area and VehPower is suitable in model Poisson GLM3,
and whether a pairwise interaction in these two components is missing. We train
this FN network starting from model Poisson GLM3 (and keeping this GLM part
frozen). The decrease in the out-of-sample loss during the gradient descent training
is shown in Fig. 7.15 (top-left). We observe that the loss remains rather constant over
100 training epochs. This tells us that the pair (Area, VehPower) is appropriately
considered in model Poisson GLM3.
Figure 7.15 gives all pairwise plots of the continuous feature components Area,
VehPower, VehAge, DrivAge, BonusMalus, Density, the scale on the y-
axis is identical in all plots. We observe that only the plots including the variable
BonusMalus provide a bigger decrease in loss (in blue color in the colored
version). This indicates that mainly this feature component is not modeled optimally
in model Poisson GLM3, because boosting with a FN network finds systematic
structure here that improves the loss of model Poisson GLM3. In model Poisson
GLM3, the variable BonusMalus has been modeled log-linearly with an interac-
tion term with DrivAge and (DrivAge)2 , see (5.35). Table 7.8 shows the result
if we add a FN network feature (7.40) for the pair (DrivAge, BonusMalus)
to model Poisson GLM3. Indeed, we see that the resulting combined GLM-FN
network model has the same GL as the full FN network approach. Thus, we
conclude that model Poisson GLM3 performs fairly well and only the modeling
of the pair (DrivAge, BonusMalus) should be improved.
Ensemble learning is a popular way of expressing that one takes an average over
different predictors. There are many established methods that belong to the family of
ensemble learning, e.g., there is boostrap aggregating (called bagging) introduced
by Breiman [51], there are random forests, and there is boosting. Random forests
320 7 Deep Learning
Fig. 7.15 Exploring all pairwise interactions: out-of-sample losses over 100 gradient descent
epochs for all pairs of the continuous feature components Area, VehPower, VehAge,
DrivAge, BonusMalus, Density (the scale on the y-axis is identical in all plots)
and boosting are mainly based on classification and regression trees (CARTs) and
they belong to the most powerful machine learning methods for tabular data. These
methods combine a family of predictors to a more powerful predictor. The present
section is inspired by the bagging method of Breiman [51], and we perform network
aggregating (called nagging).
Table 7.8 Run times, number of parameters, in-sample and out-of-sample deviance losses (units
are in 10−2 ) and in-sample average frequency of the Poisson null model, model Poisson GLM3
of Table 5.5, model Poisson GLM3 with additional FN features for (DrivAge, BonusMalus),
the FN network model (with embedding layers of dimension b = 2), and the combined regression
model GLM3+FN, see (7.40)
Run # In-sample Out-of-sample Aver.
time param. loss on L loss on T freq.
Poisson null – 1 25.213 25.445 7.36%
Poisson GLM3 15 s 50 24.084 24.102 7.36%
GLM3 +FN(DrivAge, BonusMalus) – 50 + 792 23.804 23.805 7.36%
Embed FN bias regularized 124 s 792 23.690 23.824 7.36%
Combined GLM+FN bias regularized 72 s 50 + 792 23.765 23.830 7.36%
in−sample losses over 1600 calibrations out−of−sample losses over 1600 calibrations
Fig. 7.16 Boxplots over 1 600 network calibrations only differing in the seeds for the SGD
algorithm and the partitioning of the learning data: (lhs) in-sample losses on L and (rhs) out-
of-sample losses on T , the horizontal lines show the calibration chosen in Table 7.5; units are in
10−2
Before doing so, we would like to understand whether there is some dependence
between the in-sample and the out-of-sample losses over the M = 1 600 runs of
the SGD algorithm with different seeds. In Fig. 7.17 we provide a scatter plot of
the out-of-sample losses vs. the in-sample losses. This plot is complemented by
a cubic spline regression (in orange color). From this plot we conclude that the
models with very small in-sample losses tend to over-fit, and the models with large
in-sample losses tend to under-fit (always using the same early stopping rule). In
view of these results we conclude that the chosen early stopping rule is sensible
because on average it tends to provide the model with the smallest out-of-sample
loss on T . Recall that we do not use T during the SGD fitting, but only the learning
data L that is split into the training data U and the validation data V for exercising
the early stopping, see Fig. 7.7.
Fig. 7.17 Scatter plot of scatter plot of out−of−sample vs. in−sample losses
23.90
out-of-sample losses
vs. in-sample losses for
23.88
out−of−sample deviance losses
single calibration
cubic spline
empirical mean
For each run of the SGD algorithm we receive a different (early stopped) network
parameter estimate
m
ϑ ∈ Rr , 1 ≤ m ≤ M = 1 600. Using these parameter
estimates we receive the estimated network regression functions, for 1 ≤ m ≤ M,
x → μ
m (x) = μ
ϑ (x),
m
Since we choose the seeds of the SGD runs at random we may (and will) assume
that we have independence between the prices ( μmt )t ∈T of the different runs 1 ≤
m ≤ M of the SGD algorithm. This allows us to estimate the average price and the
coefficient of variation of these prices of a fixed insurance policy t over the different
SGD runs
/
0
1 0 1 m 2
M M
1
(1:M)
μ̄t =
μtm
and Vcot = (1:M) 1 μt − μ̄(1:M)
t .
M μ̄t M−1
m=1 m=1
(7.43)
These (out-of-sample) coefficients of variation are illustrated in Fig. 7.18. We
observe a considerable variation on some policies. The average coefficient of
variation is roughly 10% (orange horizontal line, lhs). The maximal coefficient of
variation is about 40%, thus, for this policy the individual prices μm
t of the different
(1:M)
SGD runs 1 ≤ m ≤ M fluctuate considerably around μ̄t . This now explains
why we choose M = 1 600 SGD runs, namely, √the averaging in (7.43) reduces the
coefficient of variation on this policy to 40%/ M = 40%/40 = 1%, note that we
have independence between the different SGD runs. Thus, by averaging we receive
an acceptable influence of the variation of the individual SGD fittings.
Listing 7.9 shows the 10 policies (out-of-sample) with the largest coefficients
of variations Vcot . These polices have in common that they belong to the lowest
BonusMalus level, the drivers are very young, the cars are comparably old and
they have a bigger vehicle power. From a practical point of view we should doubt
these policies, since the information provided may not be correct. New drivers (at
the age of 18) typically enter a bonus-malus scheme at level 100, and only after
several accident-free years these drivers can reach a bonus-malus level of 50. Thus,
policies as in Listing 7.9 should not exist, and our pricing framework has difficulties
to (correctly) handle them. In practice, this needs further investigation because,
obviously, there is a data issue, here.
324 7 Deep Learning
average
20000
1 std.dev.
cubic spline
coefficient of variation
0.3
15000
frequency
10000
0.2
5000
0.1
0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.1 0.2 0.3 0.4
average estimated frequency coefficient of variations
Listing 7.9 The 10 policies (out-of-sample) with the largest coefficients of variation
1 Area VehPower VehAge DrivAge BonusMalus VehBrand VehGas Region vco
2 D 8 16 18 50 B11 Regular R53 0.4089006
3 D 9 17 20 50 B11 Regular R24 0.3827665
4 C 8 11 18 50 B5 Regular R24 0.3762306
5 C 9 18 18 50 B5 Regular R24 0.3697370
6 C 7 17 18 50 B1 Regular R24 0.3579979
7 C 9 19 19 50 B5 Regular R24 0.3554879
8 C 6 15 20 50 B1 Regular R93 0.3528679
9 C 7 14 19 50 B1 Regular R53 0.3518279
10 A 11 20 50 50 B13 Regular R74 0.3442184
11 D 5 14 18 50 B3 Diesel R24 0.3403783
Nagging Predictor
The previously observed variations of the prices motivate to average over the
different models (network calibrations). This brings us to bagging introduced by
Breiman [51]. Bagging is based on averaging/aggregating over several ‘indepen-
dent’ predictions; this is done in three steps. In a first step, a model is fitted to the
data L. In a second step, independent bootstrap samples L∗(m) are generated from
this fitted model; the independence has to be understood in a conditional sense,
namely, the different bootstrap samples L∗(m) are independent in m, given the data
L. In the third step, for every bootstrap sample L∗(m) one estimates a model μm ,
and averaging (7.43) provides the bagging predictor. Bagging is mainly a variance
reduction technique. Note that if the fitted model of the first step has a bias, then
likely the bootstrap samples L∗(m) are biased, and so is the bagging predictor.
Therefore, bagging does not help to reduce a potential bias. All these results have to
7.4 Special Features in Networks 325
be understood conditionally on the data L. If this data is atypical for the problem,
so will the bootstrap samples be.
We can perform a similar analysis for the fitted networks, but we do not need to
bootstrap, here, because the various elements of randomness in SGD fitting allow us
to generate independent predictors μm , conditional on the data L. Averaging (7.43)
over these predictors then provides us with the network aggregating (nagging)
predictor μ̄(1:M) ; we also refer to Dietterich [105] and Richman–Wüthrich [315]
for this aggregation. Thus, we replace the bootstrap step by the different runs of
the SGD algorithm. Both options provide independent predictors μm , conditional
on the data L. However, there is a fundamental difference between bagging and
nagging. Bagging generates new (bootstrap) samples L∗(m) and, thus, bagging also
involves randomness coming from sampling the new observations. Nagging always
acts on the same sample L, and it only refits the model multiple times. Therefore,
the latter will typically introduce less variation. Of course, bagging and nagging can
be combined, and then the full expected GL can be estimated, we come back to this
in Sect. 11.4, below. We do not sample new observations, here, because we would
like to understand the variations implied by the SGD algorithm with early stopping
on the given (fixed) data.
In Fig. 7.18 we have seen that we need nagging over 1 600 network calibrations
so that the maximal coefficient of variation on an individual policy level is below
1% in our MTPL example. In this section we would like to understand the minimal
out-of-sample loss that can be achieved by nagging on the (entire) test data set, and
we would like to analyze its rate of convergence.
1 m
M
μ̄ (1:M)
(x) =
μ (x) for M ≥ 1. (7.44)
M
m=1
This allows us to study the out-of-sample losses on T in the Poisson model for
M ≥1
2 †
T
(1:M) † † † μ̄(1:M) (x †t )
D(T , μ̄ (1:M)
)= vt μ̄ (x t ) − Yt − Yt log .
T
t =1 Yt†
Remark 7.24 From Remarks 7.17 we know that the expected deviance GL of
the estimated model is lower bounded by the expected deviance GL of the true
data generating model; the difference is the conditional calibration. Within the
family of Tweedie’s CP models Richman–Wüthrich [315] proved that, indeed,
aggregating decreases monotonically the expected deviance GL of the estimated
model (Proposition 2 of [315]), convergence is established (Proposition 3 of [315]),
326 7 Deep Learning
Proof of Proposition 7.25 The lower bound on the right-hand side immediately
follows from Theorem 4.19. For an estimate μ > 0 we define the function, we
also refer to (4.18) and we set for the canonical link hp = (κp )−1 ,
⎧
⎪ μlog(
μ) − μ for p = 1,
⎨ μ1−p
μ2−p
μ → ψp (
μ) = μhp ( μ) = μ 1−p − 2−p
μ) − κp hp ( for p ∈ (1, 2),
⎪
⎩
−μ/ μ − log( μ) for p = 2.
This is the part of the log-likelihood (and deviance loss) that depends on the
canonical parameter θ = hp ( μ), and replacing the observation Y by μ. Calculating
the second derivative w.r.t.
μ provides for p ∈ [1, 2]
∂2
μ−p−1 − (1 − p)
μ) = −pμ
ψp ( μ−p =
μ−(1+p) [−pμ − (1 − p)
μ] ≤ 0,
μ2
∂
the last inequality uses that the square bracket is non-positive, a.s., under our
assumptions on μ. Thus, ψp is concave on the interval (0, p/(p − 1)μ). We now
focus on the inequalities for M ≥ 1. Consider the decomposition of the nagging
predictor for M + 1
1
M+1
1 m
M+1
μ̄(1:M+1) = μ̄(−j ) , where μ̄(−j ) =
μ 1{m=j } .
M +1 M
j =1 m=1
7.4 Special Features in Networks 327
The predictors μ̄(−j ) , j ≥ 1, are copies of μ̄(1:M) , though not independent ones.
Using the function ψp , the second term on the right-hand side has the same structure
as the estimation risk function (4.14),
Eθ d(Y, μ̄(1:M) )
= Eθ d(Y, μ̄(1:M+1) ) + 2 Eθ Y hp μ̄(1:M+1) − κp hp μ̄(1:M+1)
− 2 Eθ Y hp μ̄(1:M) − κp hp μ̄(1:M)
= Eθ d(Y, μ̄(1:M+1) ) + 2 E ψp μ̄(1:M+1) − E ψp μ̄(1:M)
⎛ ⎡ ⎛ ⎞⎤ ⎞
1
M+1
= Eθ d(Y, μ̄(1:M+1) ) + 2 ⎝E ⎣ψp ⎝ μ̄(−j ) ⎠⎦ − E ψp μ̄(1:M) ⎠
M +1
j =1
⎛ ⎡ ⎤ ⎞
1
M+1
≥ Eθ d(Y, μ̄(1:M+1) ) + 2 ⎝E ⎣ ψp μ̄(−j ) ⎦ − E ψp μ̄(1:M) ⎠
M +1
j =1
= Eθ d(Y, μ̄(1:M+1) ) ,
the second last step applies Jensen’s inequality to the concave function ψp , and the
last step follows from the fact that μ̄(−j ) , j ≥ 1, are copies of μ̄(1:M) .
Remarks 7.26
• Proposition 7.25 says that aggregation works, i.e., aggregating i.i.d. predictors
leads to monotonically decreasing expected deviance GLs. In fact, if μ ≤ 2μ,
a.s., we receive Tweedie’s forecast dominance by aggregating, restricted to the
power variance parameters p ∈ [1, 2], see Definition 4.22.
• The i.i.d. assumption can be relaxed, indeed, it is sufficient that every μ̄(−j )
in the above proof has the same distribution as μ̄(1:M) . This does not require
independence between the predictors μm , m ≥ 1, but exchangeability is
sufficient.
• We need the condition < μ ≤ p/(p − 1)μ, a.s., to ensure the monotonicity
within Tweedie’s CP models. For the Poisson model p = 1 we can drop the
upper bound, and we only need the lower bound to ensure the existence of the
expected deviance GL. For p ∈ (1, 2] the upper bound is increasingly binding,
in the gamma case p = 2 requiring μ ≤ 2μ, a.s.
• Note that we do not require unbiasedness of μ for μ in Proposition 7.25. Thus,
at this stage, aggregating is a variance reduction technique.
328 7 Deep Learning
23.84
losses D(T , μ̄(1:M) ) of the nagging predictor
1 standard deviation
nagging predictors
23.83
(μ̄(1:M) (x †t ))1≤t≤T for
1 ≤ M ≤ 40; losses are in
23.82
out−of−sample losses
10−2
23.81
23.80
23.79
23.78
0 10 20 30 40
index M
The uniformly integrable upper bound is only needed in the Poisson case p = 1,
because the other cases are covered by < μ ≤ p/(p − 1)μ, a.s. Moreover,
asymptotic normality can be established, we refer to Proposition 4 in Richman–
Wüthrich [315].
We come back to our MTPL Poisson claim frequency example and its 1 600
network calibrations illustrated in Fig. 7.17. Figure 7.19 provides the out-of-sample
portfolio losses D(T , μ̄(1:M) ) of the resulting nagging predictors (μ̄(1:M) (x †t ))1≤t ≤T
for 1 ≤ M ≤ 40 in red color, and the corresponding 1 standard deviation confidence
bounds in orange color. The blue horizontal dotted line shows the case M = 1
which exactly refers to the (first) bias regularized FN network μm=1 with embedding
layers given in Table 7.5. Indeed, averaging over multiple networks improves the
predictive model and the out-of-sample loss decreases over the first 2 ≤ M ≤ 10
nagging steps. After the first 10 steps the picture starts to stabilize which indicates
that for this size of portfolio (and this type of problem) we need to average over
roughly 10–20 FN networks to receive optimal predictive models on the portfolio
level. For M → ∞ the out-of-sample loss converges to the green horizontal dotted
line in Fig. 7.19 of 23.783 · 10−2 . These numbers are also reported on the last line
of Table 7.9.
Figure 7.20 provides the empirical auto-calibration property (7.39) of the
nagging predictor μ̄(1:1600); this is obtained completely analogously to Fig. 7.12.
7.4 Special Features in Networks 329
Table 7.9 Run times, number of parameters, in-sample and out-of-sample deviance losses (units
are in 10−2 ) and in-sample average frequency of the Poisson null model, model Poisson GLM3 of
Table 5.5, the FN network models (with embedding layers of dimension b = 2), and the nagging
predictor for M = 1 600
Run # In-sample Out-of-sample Aver.
time param. loss on L loss on T freq.
Poisson null – 1 25.213 25.445 7.36%
Poisson GLM3 15 s 50 24.084 24.102 7.36%
Embed FN bias regularized
μm=1 +4 s 792 23.690 23.824 7.36%
Average over 1 600 SGDs (Fig. 7.16) – 792 23.728 23.819 7.36%
Nagging FN μ̄(1:M) , M = 1 600 ∞ ‘792’ 23.691 23.783 7.36%
0.5
auto-calibration (7.39) of the
Poisson nagging predictor,
the blue line shows the
20
0.4
empirical density of
vi μ̄(1:1600) (x i ), 1 ≤ i ≤ n
auto−calibration
15
0.3
10
0.2
0.1
5
auto−calibration
0.0
0
0.0 0.1 0.2 0.3 0.4 0.5
estimated claims v*mu(x)
The nagging predictors are (already) bias regularized, and Fig. 7.20 supports that
the auto-calibration property holds rather accurately.
At this stage, we have fully arrived at Breiman’s [53] two modeling cultures
dilemma, see also Sect. 1.1. We have started from a parametric data model, and
in order to boost its predictive performance we have combined such models in
an algorithmic way. Working with many blended networks is not really practical,
therefore, in such situations, a meta model can be fitted to the resulting nagging
predictor.
Meta Model
Since working with M = 1 600 different FN networks is not practical, we fit a meta
model to the nagging predictors μ̄(1:M) (·). This can easily be done by just selecting
an additional FN network and fit this additional network to the working data
D∗ = μ̄(1:M) (x i ), x i , vi : i = 1, . . . , n ∪ μ̄(1:M) (x †t ), x †t , vt† : t = 1, . . . , T .
330 7 Deep Learning
Table 7.10 Run times, number of parameters, in-sample and out-of-sample deviance losses (units
are in 10−2 ) and in-sample average frequency of the Poisson null model, model Poisson GLM3
of Table 5.5, the FN network model (with embedding layers of dimension b = 2), the nagging
predictor, and the meta network model
Run # In-sample Out-of-sample Aver.
time param. loss on L loss on T freq.
Poisson null – 1 25.213 25.445 7.36%
Poisson GLM3 15 s 50 24.084 24.102 7.36%
Embed FN bias regularized
μm=1 +4 s 792 23.690 23.824 7.36%
Nagging FN μ̄(1:M) ∞ ‘792’ 23.691 23.783 7.36%
Meta FN network μmeta – 792 23.714 23.777 7.36%
For this calibration step we can consider all data, since we would like to fit a
regression model as accurately as possible to the entire regression surface formed by
all nagging predictors from the learning and the test data sets L and T . Moreover,
this step should not over-fit since this regression surface of nagging predictors
does not include any noise, but it is on the level of expected values. As network
architecture we choose again the same FN network of depth d = 3. The only
change to the fitting procedure above is replacing the Poisson deviance loss by the
square loss function, since we do not work with the Poisson responses Ni but rather
with their mean estimates μ̄(1:M) (x i ) and μ̄(1:M) (x †t ) in this fitting step. Since the
resulting meta network model may still have a bias we apply the bias regularization
step of Listing 7.7 to the Poisson observations with the Poisson deviance loss on the
learning data L (only). The results are presented in Table 7.10.
From these results we observe that in our case the meta network performs
similarly well to the nagging predictor, and it seems to be a very reasonable choice.
Finally, in Fig. 7.21 (lhs) we analyze the resulting frequencies on an individual
policy level on the test data set T . We plot the estimated frequencies μm=1 (x †t ) of
the first FN network (this corresponds to ‘embed FN bias regularized’ in Table 7.10
with an out-of-sample loss of 23.824) against the nagging predictor μ̄(1:M) (x †t )
which averages over M = 1 600 networks. From Fig. 7.21 (lhs) we conclude
that there are quite some differences between these two predictors, this exactly
reflects the variations obtained in Fig. 7.18 (lhs). The nagging predictor removes this
variation by averaging. Figure 7.21 (rhs) compares the nagging predictor μ̄(1:M) (x †t )
to the one of the meta model μmeta (x †t ). This scatter plot shows that the predictors
lie almost perfectly on the diagonal line which suggests that the meta model can be
used as a substitute for the nagging predictor. This completes this claim frequency
modeling example.
Remark 7.27 The meta model concept can also be useful in other situations. For
instance, we can fit a gradient boosting regression model to the observations.
Typically, this is much faster than calculating a nagging predictor (because it directly
focuses on the weaknesses of the existing model). If the gradient boosting model
is based on regression trees, it has the disadvantage that the resulting regression
7.4 Special Features in Networks 331
–1
–1
0.8 0.8
meta model
–2
–2
0.6 0.6
–3
0.4 0.4
–3
–4
0.2 0.2
–4
–4 –3 –2 –1 –4 –3 –2 –1
nagging predictor nagging predictor
Fig. 7.21 Scatter plot of the out-of-sample predictions μm=1 (x †t ), μ̄(1:M) (x †t ) and
μmeta (x †t ) over
† †
all polices 1 ≤ t ≤ T on the test data set T : (lhs) μm=1 (x t ) vs. μ̄(1:M) (x t ) and (rhs) μmeta (x †t )
† †
vs. μ̄(1:M) (x t ); the color scale shows the exposures vt ∈ (0, 1]
Example 7.28 (Gamma Claim Size Modeling) We revisit the gamma claim size
example of Sect. 5.3.7. The data comprises Swedish motorcycle claim amounts. We
have seen that this claim size data is not heavy-tailed, thus, a gamma distribution
may be a reasonable choice for this data. For the modeling of this data we use the
same normalization is in (5.45), this parametrization does not require the explicit
knowledge of the (constant) shape parameter of the gamma distribution for mean
estimation.
The difficulty with this data is that only 656 insurance policies suffer a claim,
and likely a single FN network will not lead to stable results in this example.
As FN network architecture we again choose a network of depth d = 3 and
with (q1 , q2 , q3 ) = (20, 15, 10) neurons. Since the input layer has dimension
q0 = 1 + 6 = 7 we receive a network parameter of dimension r = 626. As loss
function we choose the gamma deviance loss, see Table 4.1. Moreover, we choose
the nadam optimizer, a batch size of 300, a training-validation split of 8:2, and we
retrieve the network calibration with the lowest validation loss with a callback.
Figure 7.22 shows the results of 1 000 different SGD runs (only differing in the
initial seeds and the splits of the training-validation sets as well as the batches).
We see a considerable variation between the different SGD runs, both in in-sample
deviance losses but also in the average estimated claims. Note that we did not bias-
regularize the resulting networks (we work with the log-link here which is not the
canonical one). This is why we receive fluctuating portfolio averages in Fig. 7.22
332 7 Deep Learning
in−sample losses over 1000 calibrations portfolio means over 1000 calibrations
1000 calibrations 1000 calibrations
1.35 1.40 1.45 1.50 1.55 1.60 1.65
30000
in−sample deviance losses
individual policies
average
1 std.dev.
cubic spline
200
coefficient of variation
0.4
150
frequency
0.3
100
0.2
50
0.1
Fig. 7.23 Coefficients of variations Vcoi on an individual claim level 1 ≤ i ≤ n over the 1 000
calibrations (lhs) scatter plot against the nagging predictor μ̄(1:M) (x i ) and (rhs) histogram
(rhs), the red line illustrates the empirical mean. Obviously, these FN networks are
(on average) positively biased, and they will need a bias correction for the final
prediction.
Figure 7.23 analyzes the variations on an individual claim level by studying
the in-sample version of the coefficient of variation given in (7.43). We see that
these coefficients of variation are bigger than in the claim frequency example, see
Fig. 7.18. Thus, to receive stable results the nagging predictors μ̄(1:M) (x i ) have to be
calculated over many networks. Figure 7.24 confirms that aggregating reduces (in-
sample) losses also in this case. From this figure we also see that the convergence is
slower compared to the MTPL frequency example of Fig. 7.19, of course, because
we have a much smaller claims portfolio.
7.4 Special Features in Networks 333
1.55
for 1 ≤ M ≤ 40 on the
motorcycle claim size data
in−sample losses
1.50
1.45
0 10 20 30 40
index M
Table 7.11 Number of parameters, Pearson’s dispersion estimate, MLE dispersion estimate, in-
sample losses and in-sample average claim amounts of the null model (gamma intercept model),
the gamma GLMs and the network nagging predictor; for the GLMs we refer to Table 5.13
# Dispersion In-sample Average
param. P
ϕ
ϕ MLE loss on L amount
Gamma null 1+1 2.057 1.690 2.085 24’641
Gamma GLM1 9+1 1.537 1.426 1.717 25’105
Gamma GLM2 7+1 1.544 1.427 1.719 25’130
Gamma FN network nagging 626 + 1 – – 1.478 26’387
Gamma FN network nagging (bias reg) 626 + 1 1.050 1.240 1.465 24’641
Table 7.11 presents the results if we take the nagging predictor over 1 000
different networks. The first observation is that we receive a much smaller in-sample
loss compared to the GLMs, thus, there seems to be much room for improvements in
the GLMs. Secondly, the nagging predictor has a substantial bias. For this reason we
shift the intercept parameter in the output layer so that the portfolio average of the
nagging predictor is equal to the empirical mean, see the last column of Table 7.11.
A main difficulty in this model is the estimation of the dispersion parameter
ϕ > 0 and the shape parameter α = 1/ϕ of the gamma distribution, respectively.
Pearson’s dispersion estimate does not work because we do not know the degrees
of freedom of the nagging predictor, see also (5.49). In Table 7.11 we calculate
Pearson’s dispersion estimate by simply dividing by the number of observations;
this should be understood as a lower bound; this number is highlighted in italic.
Alternatively, we can calculate the MLE, however, this may be rather different from
Pearson’s estimate, as indicated in Table 7.11. Figure 7.25 (lhs) shows the resulting
QQ plot of the nagging predictor if we use the MLE ϕ MLE = 1.240, and the right-
hand side shows the same plot for ϕ = 1.050. From these plots it seems that we
should rather go for a smaller dispersion parameter, the MLE being probably too
much dominated by the small claims. This observation should also be understood as
a red flag, as it tells us that the chosen gamma model is not fully suitable. This may
334 7 Deep Learning
8 QQ plot of the fitted gamma model QQ plot of the fitted gamma model
8
6
6
observed values
observed values
4
4
2
2
0
0
0 2 4 6 8 0 2 4 6 8
theoretical values theoretical values
Fig. 7.25 QQ plots of the nagging predictors against the gamma density with (lhs)
ϕ MLE = 1.240
and (rhs)
ϕ = 1.050
80000
nagging predictor 2
60000
60000
GLM predictor
40000
40000
20000
20000
0
Fig. 7.26 (lhs) Scatter plot of model Gamma GLM2 predictors against the nagging predictors
μ̄(1:M) (x i ) over all instances 1 ≤ i ≤ n, (rhs) scatter plot of two (independent) nagging predictors
be for various reasons: (1) the dispersion is not constant and should be modeled
policy dependent, (2) the features are not sufficient to explain the observations,
or (3) the gamma distribution is not suitable and should be replaced by another
distribution.
In Fig. 7.26 (lhs) we compare the predictions received from model Gamma
GLM2 against the nagging predictors μ̄(1:M) (x i ) over all instances 1 ≤ i ≤ n.
The scatter plot spreads quite wildly around the diagonal which seriously questions
at least one of the two models. To ensure that this variability between the two models
is not caused by the (complex) FN network architecture, we verify the nagging
7.4 Special Features in Networks 335
140000
auto-calibration (7.39) of the
3.0e−05
Gamma FN network nagging
predictor of Table 7.11, the
blue line shows the empirical
2.0e−05
auto−calibration
1≤i≤n
1.0e−05
0.0e+00
auto−calibration
density (right axis)
0
0 20000 40000 60000 80000 100000 120000 140000
estimated claim size
Zhou et al. [406] ask the question whether ensembling over ‘selected’ networks is
better than ensembling over all networks. In their proposal they introduce a weighted
averaging scheme over the different network predictors μm , 1 ≤ m ≤ M. We
perform a slightly different analysis here. We are re-using the M = 1 600 SGD
calibrations of the Poisson FN network illustrated in Fig. 7.17. We order these SGD
calibrations w.r.t. their in-sample losses D(L,
μm ), 1 ≤ m ≤ M, and partition this
ordered sample into three equally sized sets: the first one containing the smallest
336 7 Deep Learning
15
D(L, μm ), 1 ≤ m ≤ M, of
Fig. 7.17
10
empirical density
5
0
23.60 23.65 23.70 23.75 23.80
in−sample losses
in-sample losses, the second one the middle sized in-sample losses, and the third
one the largest in-sample losses. Figure 7.28 shows the empirical density of these
in-sample losses, and the vertical lines give the partition into the three sets, we call
the resulting (disjoint) index sets I small , I middle , I large ⊂ {1, . . . , M}. Remark that
this partition is done fully in-sample, based on the learning data L, only.
We then consider the nagging predictors on each of these index sets separately,
i.e.,
1
μ̄small(x) =
μm (x),
|I small |
m∈I small
1
μ̄middle(x) =
μm (x), (7.46)
|I middle |
m∈I middle
1
μ̄large (x) = m (x).
μ
|I large |
m∈I large
If we believe into the orange cubic spline in Fig. 7.17, the middle nagging predictor
μ̄middle should out-perform the other two nagging predictors. Indeed, this is the case,
here. We receive the out-of-sample losses (in 10−2 ) on the three subsets
This approach boosts by far any other approach considered, see Table 7.10; note that
this analysis relies on a fully proper in-sample and out-of-sample testing strategy.
Moreover, this also supports our early stopping strategy because, obviously, the
optimal networks are centered around our early stopping rule. How does this result
match Proposition 7.25 saying that the nagging predictor has a monotonically
7.4 Special Features in Networks 337
That is, the expected claim count Eθi [Ni ] = vi μ(x i ) is assumed to scale
proportionally in the exposure vi > 0. Figure 7.29 raises some doubts whether this
is really the case, or at least SGD fitting has some difficulties to assess the expected
frequencies μ(x i ) on the policies i with short exposures vi > 0. We discuss this
further in the next subsection. Table 7.12 gives a summary of our results.
Analysis of Over-dispersion
With all the excitement of Fig. 7.29, the above models do not fit the observations
since the over-dispersion is too large, see the last column of Table 7.12. This has
motivated the study of the negative binomial model in Sect. 5.3.5, the ZIP model in
Sect. 5.3.6, and the hurdle Poisson model in Example 6.19. These models have led
to an improvement in terms of AIC, see Table 6.6. We could go down the same
338 7 Deep Learning
Table 7.12 Number of parameters, in-sample and out-of-sample deviance losses (units are in
10−2 ), in-sample average frequency and (over-)dispersion of the Poisson null model, model Poisson
GLM3 of Table 5.5, the FN network model (with embedding layers of dimension b = 2), the
nagging predictor, the meta network model, and the middle nagging predictor
# In-sample Out-of-sample Aver. Disp.
param. loss on L loss on T freq.
ϕP
Poisson null 1 25.213 25.445 7.36% 1.7160
Poisson GLM3 50 24.084 24.102 7.36% 1.6644
Embed FN bias regularized μm=1 792 23.690 23.824 7.36% 1.6812
Nagging FN μ̄(1:M) ‘792’ 23.691 23.783 7.36% 1.6592
Meta FN network μmeta 792 23.714 23.777 7.36% 1.6737
Middle nagging FN μ̄middle ‘792’ 23.698 23.272 7.36% 1.6618
route here by substituting the Poisson model. We refrain from doing so, as we
want to further analyze the Poisson model. Suppose we calculate an AIC value for
the Poisson FN network using 792 as the number of parameters involved. In that
case, we receive a value of 191 790, thus, clearly lower than the one of the negative
binomial GLM, and also slightly lower than the one of the hurdle Poisson model,
see Table 6.6. Remark that AIC values within FN networks are not supported by
any theory as we neither use the MLE nor do we have a reasonable evaluation of the
number of parameters involved in networks. Thus, such a value may serve at best as
a rough rule of thumb.
This lower AIC value suggests that we should try to improve the modeling of
the systematic effects by better regression functions. In particular, there may be
more explanatory variables involved that have predictive power. If these explanatory
variables are latent, we can rely on the negative binomial model, as it can be
interpreted as a mixture model averaging over latent variables. In view of Fig. 7.29,
the exposures vi seem to have a predictive power different from proportional scaling,
see (7.48); we also mention some peculiarities of the exposures on page 556. This
motivates to change the FN network regression model such that the exposures are
considered non-proportionally. We choose a FN network that directly models the
mean of the claim counts
; <
(x, v) ∈ X × (0, 1] → μ(x, v) = exp β, z(d:1) (x, v) > 0, (7.49)
modeling the mean Eϑ [N] = μ(x, v) of the Poisson datum (N, x, v). The expected
frequency is then given by Eϑ [Y ] = Eϑ [N/v] = μ(x, v)/v.
Remark 7.29 At this stage we clearly have to distinguish between statistical
modeling and actuarial modeling. In statistical modeling it makes perfect sense
to choose the regression function (7.49), since including the exposure in a non-
proportional way may increase the predictive power of the model, at least this is
what our data suggests.
7.4 Special Features in Networks 339
From an actuarial point of view this approach should clearly be doubted. The
typical exposure of car insurance policies is one calendar year, i.e., v = 1, if the
renewals of insurance policies are accounted correctly. Shorter exposures may have
a specific (non-predictable) reason, for example, the policyholder or the insurance
company may terminate an insurance contract after a claim. Thus, if this is possible,
the exposure is a random variable, too, and it clearly has a predictive power for
claims prediction; in that case we lose the properties of the Poisson count process
(having independent and stationary increments).
As a consequence, we should include the exposure proportionally from an
actuarial modeling point of view. Nevertheless we do the modeling exercise based
on the regression function (7.49), here. This will indicate the predictive power of the
exposure, which may be thought of a proxy for another (non-available) explanatory
variable. Moreover, if (7.49) allows for a good Poisson regression model, we have a
simple way of bootstrapping from our data (conditionally on given exposures v).
We would also like to emphasize that if one feature component dominates all
others in terms of the predictive power, then likely there is a leakage of information
through this component, and this needs a more careful analysis.
We implement the FN network regression model (7.49) using again a network
architecture of depth d = 3 with (q1 , q2 , q3 ) = (20, 15, 10) neurons. We use
embedding layers for the two categorical variables VehBrand and Region, and
we have 8 continuous/binary feature components. This is one more compared to
Fig. 7.9 (rhs) because we also model the exposure vi as a continuous input to the
network. As a result, the dimension r of the network parameter ϑ ∈ Rr increases
from 792 to 812 (because we have q1 = 20 neurons in the first FN layer). We
calculate the nagging predictor μ̄(1:M) of this network averaging over M = 500
individual (early stopped) FN network calibrations, the results are presented in
Table 7.13.
Table 7.13 Number of parameters, in-sample and out-of-sample deviance losses (units are in
10−2 ), in-sample average frequency and (over-)dispersion of the Poisson null model, model Poisson
GLM3 of Table 5.5, the FN network models (with embedding layers of dimension b = 2), the
nagging predictors, and the middle nagging predictors excluding and including exposures vi as
continuous network inputs
# In-sample Out-of-sample Aver. Disp.
param. loss on L loss on T freq.
ϕP
Poisson null 1 25.213 25.445 7.36% 1.7160
Poisson GLM3 50 24.084 24.102 7.36% 1.6644
Embed FN μm=1 792 23.690 23.824 7.36% 1.6812
Nagging FN μ̄(1:M) ‘792’ 23.691 23.783 7.36% 1.6592
Middle nagging FN μ̄middle ‘792’ 23.698 23.272 7.36% 1.6618
Exposure v: FN
μm=1 812 23.358 23.496 7.36% 1.0650
Exposure v: nagging FN μ̄(1:M) ‘812’ 23.299 23.382 7.36% 1.0416
Exposure v: middle nagging FN μ̄middle ‘812’ 23.303 23.299 7.36% 1.0427
340 7 Deep Learning
0.15
proportionally (blue), the
model including exposures
non-proportionally through
frequency
0.10
the FN network (black) and
observed (red)
0.05
FN w/o exposure
FN exposure
0.00
observed
In the previous section we have studied ensembles of FN networks. One may also
aim at directly comparing these networks to each other in terms of the fitted network
parameters
j
ϑ over the different calibrations 1 ≤ j ≤ M (of the same FN network
architecture). Such a comparison may, e.g., be useful if one wants to choose a
7.4 Special Features in Networks 341
q1
q1 ; <
(1:1) (1)
g(μ(x)) = β0 + βj zj (x) = β0 + βj φ w j , x
j =1 j =1
; < ; <
= β0 + βj φ w(1)
j , x + (−β k ) φ −w (1)
k , x . (7.50)
j =k
From this we see that the following two network parameters (we switch signs in all
the parameters that belong to index k)
(1) (1)
ϑ = (w1 , . . . , wk , . . . , w (1)
q 1 , β0 , . . . , βk , . . . , βq 1 ) and
* (1) (1)
ϑ = (w1 , . . . , −wk , . . . , w (1)
q1 , β0 , . . . , −βk , . . . , βq1 )
give the same FN network predictions. Beside these sign switches, we can also
permute the enumeration of the neurons in a given FN layer, giving the same
predictions. We discuss Theorem 2 of Rüger–Ossen [323] to solve this identifiability
issue. First, we consider the network weights from the input x to the first FN layer
z(1) (x). Apply the sign switch operation (7.50) to the neurons in the first FN layer
(1) (1)
so that all the resulting intercepts w0,1 , . . . , w0,q1 are positive while not changing
the regression function x → g(μ(x)). Next, apply a permutation to the indices
1 ≤ j ≤ q1 so that we receive ordered intercepts
(1) (1)
w0,1 > . . . > w0,q 1
> 0,
(m) (m)
w0,1 > . . . > w0,qm > 0.
supposed that all intercepts are different from zero and mutually different in the
same FN layers. As stated in Section 2.2 of Rüger–Ossen [323], there may still exist
different parameters in this fundamental domain that provide the same predictive
model, but these are of zero Lebesgue measure. The same applies to the intercepts
(m)
w0,j being zero or having equal intercepts for different neurons. Basically, this
means that we are fine if we work with absolutely continuous prior distributions
on the fundamental domain when we want to work within a Bayesian setup.
7.5 Auto-encoders
We typically center the data matrix Y , providing ni=1 yi,j = 0 for all 1 ≤ j ≤ q,
normalization w.r.t. the standard deviation can be done, but is not always necessary.
Centering implies that we can interpret Y as a q-dimensional empirical distribution
with each component (column) being centered. The covariance matrix of this
(centered) empirical distribution is calculated as
n
1 1
=
yi,j yi,k = Y Y ∈ Rq×q . (7.53)
n n
i=1 1≤j,k≤q
This is a covariance matrix, and if the columns of Y are normalized with the
empirical standard deviations
σj , 1 ≤ j ≤ q, this is a correlation matrix.
! : Rq → Rp and : Rp → Rq , (7.54)
such that their composition ◦ ! has a small reconstruction error w.r.t. the chosen
dissimilarity function L(·, ·), that is,
Note that we want (7.55) for selected cases y, and if they are within a p-dimensional
manifold the auto-encoding will be successful. The first mapping ! : Rq → Rp is
called encoder, and the second mapping : Rp → Rq is called decoder. The object
!(y) ∈ Rp is a p-dimensional encoding (representation) of y ∈ Rq which contains
maximal information of y up to the reconstruction error (7.55).
PCA gives us a linear auto-encoder (7.54). If the data matrix Y ∈ Rn×q has rank
q, there exist q linearly independent rows of Y that span Rq . PCA determines a
different, very specific basis of Rq . It looks for an orthonormal basis v 1 , . . . , v q ∈
Rq such that v 1 explains the direction of the biggest variability in Y , v 2 the direction
of the second biggest variability in Y orthogonal to v 1 , and so forth. Variability is
understood in the sense of maximal empirical variance under the assumption that
the columns of Y are centered, see (7.52)–(7.53). Such an orthonormal basis can
be found by determining q linearly independent eigenvectors of the symmetric and
positive definite matrix
= Y Y ∈ Rq×q .
A = n
For this we can solve recursively the following convex Lagrange problems. The first
basis vector v 1 ∈ Rq is determined by the solution of3
v 1 = arg max Y w22 = arg max w Y Y w , (7.56)
w2 =1 w w=1
3If the q eigenvalues of A are distinct, the solution to (7.56) and (7.57) is unique up to the sign,
otherwise this requires more care.
7.5 Auto-encoders 345
Y = U V . (7.58)
The matrix U is called left-singular matrix of Y , and the matrix V is called right-
singular matrix of Y . Observe by using the SVD (7.58)
V AV = V Y Y V = V V U U V V = 2
= diag(λ21 , . . . , λ2q ).
That is, the squared singular values (λ2j )1≤j ≤q are the eigenvalues of matrix A, and
the column vectors of the right-singular matrix V = (v 1 , . . . , v q ) (eigenvectors of
A) give an orthonormal basis v 1 , . . . , v q . This motivates to define the q principal
components of Y by the column vectors of
YV = U = U diag(λ1 , . . . , λq ) (7.59)
= λ1 u1 , . . . , λq uq ∈ Rn×q .
The Eckart–Young–Mirsky theorem [114, 279]4 proves that this rank p matrix Y p
minimizes the Frobenius norm relative to Y among all rank p matrices, that is,
where the Frobenius norm is given by C2F = i,j ci,j 2 for a matrix C = (c ) .
i,j i,j
The orthonormal basis v 1 , . . . , v q ∈ R gives the (linear) encoder (projection)
q
! : Rq → Rp , y → !(y) = y v 1 , . . . , y v p = (v 1 , . . . , v p ) y.
4 In fact, (7.61) holds for both the Frobenius norm and the spectral norm.
346 7 Deep Learning
These gives the first p principal components in (7.59) if we insert the transposed
data matrix Y = (y 1 , . . . , y n ) ∈ Rq×n for y ∈ Rq . The (linear) decoder is
given by
: Rp → Rq , z → (z) = (v 1 , . . . , v p )z.
Thus, ◦ !(Y ) minimizes the Frobenius reconstruction error (7.61) on the data
matrix Y among all linear maps of rank p. In view of (7.55) we can express the
squared Frobenius reconstruction error as
n
2 2
n
Y − Y p 2F = 2y i − ◦ !(y i )22 = L y i , ◦ !(y i ) , (7.62)
2
i=1 i=1
thus, we choose the squared Euclidean distance as the dissimilarity measure, here,
that we minimize simultaneously on all cases y i , 1 ≤ i ≤ n.
Remark 7.30 The PCA gives a linear approximation to the data matrix Y by
minimizing (7.61) and (7.62) for given rank p. This may not be appropriate if the
non-linear terms are dominant. Figure 7.31 (lhs) gives a situation where the PCA
works well; this data has been generated by i.i.d. multivariate Gaussian random
vectors y i ∼ N (0, ). Figure 7.31 (middle) gives a non-linear example where the
PCA does not work well, the data matrix Y ∈ Rn×2 is a column-centered matrix
that builds a circle around the origin.
Another nice example where the PCA fails is Fig. 7.31 (rhs). This figure is
inspired by Shlens [337] and Ruckstuhl [321]. It shows a situation where the level
sets are non-convex, and the principal components point into a completely wrong
direction to explain the structure of the data.
7.5 Auto-encoders 347
Fig. 7.31 Two-dimensional PCAs in different situations of the data matrix Y ∈ Rn×2
We use the SVD to fit the most popular stochastic mortality model, the Lee–Carter
(LC) model [238], to (raw) mortality data. The raw mortality data considers for each
calendar year t and each age x the number of people Dx,t who died (in that year t
at age x) divided by the corresponding population exposure ex,t . In practice this
requires some care. Due to migration, often, the exposures ex,t are non-observable
figures and need to be estimated. Moreover, also the death counts Dx,t in year t at
age x can be defined differently, age cohorts are usually defined by the year of birth.
We denote the (observed) raw mortality rates by Mx,t = Dx,t /ex,t . The subsequent
derivations consider the raw log-mortality rates log(Mx,t ), for this reason we assume
that Mx,t > 0 for all calendar years t and ages x. The goal is to model these raw
log-mortality rates (for each country, region, risk group and gender separately).
The LC model defines the force of mortality as
log(μx,t ) = ax + bx kt , (7.63)
1
Yx,t = log(Mx,t ) −
ax = log(Mx,t ) − log(Mx,s ), (7.64)
|T |
s∈T
348 7 Deep Learning
where the last identity defines the estimate ax . Strictly speaking we have a slight
difference to the centering in Sect. 7.5.1 because we center the rows and not the
columns of the data matrix, here, but the role of rows and columns is exchangeable in
the PCA. The optimal (parameter) values ( bx )x and ( kt )t are determined as follows,
see (7.63),
2
arg min Yx,t − bx kt ,
(bx )x ,(kt )t x,t
where the sum runs over the years t ∈ T and the ages x0 ≤ x ≤ x1 , with x0 and x1
being the lower and upper age boundaries. This can be rewritten as an optimization
problem (7.61)–(7.62). Consider the data matrix Y = (Yx,t )x0 ≤x≤x1;t ∈T ∈ Rn×q ,
and set n = x1 − x0 + 1 and q = |T |. Assume Y has rank q. This allows us to
consider
x1
bx = 1 and
kt = 0, (7.65)
x=x0 t ∈T
the latter being consistent with the centering of the rows of Y with
ax in (7.64).
We fit the LC model to the Swiss mortality data of females and males separately.
The raw log-mortality rates log(Mx,t ) for the years t ∈ T = {1950, . . . , 2016}
and the ages 0 ≤ x ≤ 99 are illustrated in Fig. 7.32; both plots use the same color
scale. This mortality data has been obtained from the Human Mortality Database
(HMD) [195]. In general, we observe a diagonal structure that indicates mortality
improvements over time.
7.5 Auto-encoders 349
Swiss Female raw log−mortality rates Swiss Male raw log−mortality rates
0 0
0 6 12 20 28 36 44 52 60 68 76 84 92
0 6 12 20 28 36 44 52 60 68 76 84 92
−2 −2
−4 −4
age x
age x
−6 −6
−8 −8
−10 −10
1950 1957 1964 1971 1978 1985 1992 1999 2006 2013 1950 1957 1964 1971 1978 1985 1992 1999 2006 2013
calendar year t calendar year t
Fig. 7.32 Raw log-mortality rates log(Mx,t ) for the calendar years 1950 ≤ t ≤ 2016 and the ages
x0 = 0 ≤ x ≤ x1 = 99 of Swiss females (lhs) and Swiss males (rhs); both plots use the same color
scale
Swiss Female Lee−Carter log−mortality rates Swiss Male Lee−Carter log−mortality rates
0 0
0 6 12 20 28 36 44 52 60 68 76 84 92
0 6 12 20 28 36 44 52 60 68 76 84 92
−2 −2
−4 −4
age x
age x
−6 −6
−8 −8
−10 −10
1950 1957 1964 1971 1978 1985 1992 1999 2006 2013 1950 1957 1964 1971 1978 1985 1992 1999 2006 2013
calendar year t calendar year t
log( ax +
μx,t ) = bx
kt for x0 ≤ x ≤ x1 and t ∈ T .
Figure 7.33 shows the LC fitted log-mortality surface (log( μx,t ))0≤x≤99;t ∈T sepa-
rately for Swiss females and Swiss males, the color scale is the same as in Fig. 7.32.
The plots show a huge similarity between the raw log-mortality data and the LC
fitted log-mortality surface which clearly supports the LC model for the Swiss
data. In general, the LC surface is a smoothed version of the raw log-mortality
surface. The main difference in our LC fit concerns the male population for ages
350 7 Deep Learning
females females
1500
males males
30
reconstruction error
singular values
25
1000
20
15
500
10
5
0
0
0 10 20 30 40 50 60 0 10 20 30 40 50 60 70
index index
Fig. 7.34 (lhs) Singular values λj , 1 ≤ j ≤ |T |, of the SVD of the data matrix Y ∈ Rn×|T | , and
(rhs) the reconstruction errors Y − Y p 2F for 0 ≤ p ≤ |T |
log(μx,t ) = ax + bx , k t ,
for bx , k t ∈ Rp .
We have (only) fitted a mortality surface to the raw log-mortality rates on the
rectangle {x0 , . . . , x1 } × T . This does not allow us to forecast mortality into the
future. Forecasting requires a two step procedure, which, after this first estimation
step, extrapolates the time index (time-series) ( kt )t ∈T beyond the latest observation
point in T . The simplest (meaningful) model for this second (extrapolation) step
is a random walk with drift for the time index process ( kt )t ≥0 . Figure 7.35 shows
the estimated two-dimensional process ( k t )t ∈T , i.e., for p = 2, on the rectangle
7.5 Auto-encoders 351
time series k_t of Swiss females time series k_t of Swiss males
60
40
50
20
0
0
−20
−40
−60
−50
−80
2nd component of k_t 2nd component of k_t
1950 1960 1970 1980 1990 2000 2010 1950 1960 1970 1980 1990 2000 2010
calendar years calendar years
Fig. 7.35 Estimated two-dimensional processes ( k t )t∈T for Swiss females (lhs) and Swiss males
(rhs); these are normalized such that they are centered and such that the components of bx add up
to 1
The input and output neurons have blue color, and the bottleneck of dimension q2 =
2 is shown in red color in Fig. 7.36 (lhs).
352 7 Deep Learning
z1 Z1
y1 Y1
y2 Y2
y3 Y3
y4 Y4 z2 Z2
y5 Y5
y6 Y6
y7 Y7 z3 Z3
y8 Y8
y9 Y9
y10 Y10
z4 Z4
y11 Y11
y12 Y12
y13 Y13
y14 Y14 z5 Z5
y15 Y15
y16 Y16
y17 Y17 z6 Z6
y18 Y18
y19 Y19
y20 Y20 z7 Z7
Fig. 7.36 (lhs) BN network of depth d = 3 with (q0 , q1 , q2 , q3 , q4 ) = (20, 7, 2, 7, 20), (middle
and rhs) shallow BN networks with a bottleneck of dimensions 7 and 2, respectively
(m)
and having network weights wj ∈ Rqm−1 , 1 ≤ j ≤ qm . For the output we choose
the identity function as activation function
(d+1)
z(d+1) : Rqd → Rqd+1 , z → z(d+1)(z) = w 1 , z , . . . , w (d+1)
qd+1 , z ,
and having network weights wj(d+1) ∈ Rqd , 1 ≤ j ≤ qd+1 . The resulting network
parameter ϑ is now fitted to the data matrix Y = (y 1 , . . . , y n ) ∈ Rn×q such that
the reconstruction error is minimized over all instances
n
n
ϑ = arg min L y i , ◦ !(y i ) = arg min L y i , z(d+1:1)(y i ) .
ϑ∈Rr i=1 ϑ∈Rr i=1
activations !(y) = z(m:1) (y) ∈ Rp in the linear activation case are not directly
comparable to the principal components (y v 1 , . . . , y v p ) of the PCA. Namely,
the PCA uses an orthonormal basis v 1 , . . . , v p whereas the linear BN network case
uses any p-dimensional basis, i.e., to directly bring these two representations in line
we still need a coordinate transformation of the bottleneck activations.
Hinton–Salakhutdinov [186] noticed that the gradient descent fitting of a BN
network needs some care, otherwise we may find a local minimum of the loss
function that has a poor reconstruction performance. In order to implement a more
sophisticated way of SGD fitting we require that the depth d of the network is an
odd number and that the network architecture is symmetric around the central FN
layer (d + 1)/2. This is the case in Fig. 7.36 (lhs). Fitting of this network of depth
d = 3 is now done in three steps:
1. The symmetry around the central FN layer m = 2 allows us to collapse this
central layer by merging layers 1 and 3 (because q1 = q3 ). Merging these two
layers provides us a shallow BN network with neurons (q0 , q1 = q3 , qd+1 =
q0 ) = (20, 7, 20). This shallow BN network is shown in Fig. 7.36 (middle).
In a first step we fit this simpler network to the data Y . This gives us the
(1) (1) (4) (4)
preliminary estimates for the network weights w1 , . . . , w q1 and w1 , . . . , w q4
of the full BN network. From this fitted shallow BN network we receive the
learned representations zi = z(1) (y i ) ∈ Rq1 , 1 ≤ i ≤ n, in the central layer
using the preliminary estimates of the network weights.
2. In the second step we use the learned representations zi ∈ Rq1 , 1 ≤ i ≤ n, to
fit the inner part of the original network (using a suitable dissimilarity function).
This inner part is a shallow network with neurons (q1 , q2 , q3 = q1 ) = (7, 2, 7),
354 7 Deep Learning
see Fig. 7.36 (rhs). This second step gives us the preliminary estimates for the
network weights w (2) (2) (3) (3)
1 , . . . , w q2 and w 1 , . . . , w q3 of the full BN network.
3. In the final step we fit the full BN network on the data Y and use the preliminary
estimates of the weights (of the previous two steps) as initialization of the
gradient descent algorithm.
Example 7.31 (BN Network Mortality Model) We apply this BN network approach
to modify the LC model of Sect. 7.5.4. Hainaut [178] considered such a BN network
application. For computational reasons, Hainaut [178] proposed a calibration
strategy different from Hinton–Salakhutdinov [186]. We use this latter calibration
strategy as it has turned out to work well in our setting.
As BN network architecture we choose a FN network of depth d = 3. The input
and output dimensions are equal to q0 = q4 = 67, this exactly corresponds to
the number of available calendar years 1950 ≤ t ≤ 2016, see Fig. 7.32. Then, we
select a symmetric architecture around the central FN layer m = 2 with q1 = q3 =
20 neurons. That is, in a first step, the 67 calendar years are compressed to a 20-
dimensional representation. For the bottleneck we then explore different numbers
of neurons q2 = p ∈ {1, . . . , 20}. These BN networks are implemented and fitted in
R with the library keras [77]. We have fitted these models separately to the Swiss
female and male populations. The raw log-mortality rates are illustrated in Fig. 7.32,
and for comparability with the LC approach we have centered these log-mortality
rates according to (7.64), and we use the squared Euclidean distance as the objective
function.
Figure 7.37 compares the squared Frobenius reconstruction errors of the linear
LC approximations Y p to their non-linear BN network counterparts with bottle-
necks q2 = p. We observe that the BN figures are clearly smaller saying that a
non-linear auto-encoding provides a better reconstruction, this is true, in particular,
for 2 ≤ q2 < 20. For q2 ≥ 20 the learning with the BN networks seems saturated,
note that the outer layers have q1 = q3 = 20 neurons which limits the learning at
the bottleneck for bigger q2 . In view of Fig. 7.37 there seems to be a kink at q2 = 4,
non-linear BN approach
200
100
0
5 10 15 20
index
7.5 Auto-encoders 355
Swiss Female BNN log−mortality rates Swiss Male BNN log−mortality rates
0 0
0 6 12 20 28 36 44 52 60 68 76 84 92
0 6 12 20 28 36 44 52 60 68 76 84 92
−2 −2
−4 −4
age x
age x
−6 −6
−8 −8
−10 −10
1950 1957 1964 1971 1978 1985 1992 1999 2006 2013 1950 1957 1964 1971 1978 1985 1992 1999 2006 2013
calendar year t calendar year t
Fig. 7.38 BN network (q1 , q2 , q3 ) = (20, 2, 20) fitted log-mortality rates log(
μx,t ) for the
calendar years 1950 ≤ t ≤ 2016 and the ages x0 = 0 ≤ x ≤ x1 = 99 of Swiss females
(left) and Swiss males (right); the plots use the same color scale as Fig. 7.32
and an “elbow” criterion says that this is the critical bottleneck size that should not
be exceeded.
The resulting estimated log-mortality surfaces for the bottleneck q2 = 2 are
illustrated in Fig. 7.38. These strongly resemble the raw log-mortality rates in
Fig. 7.32, in particular, for the male population we get a better fit for ages 20 ≤
x ≤ 40 from 1980 to 2000 compared to the LC model. In a further analysis we
should check whether this BN network does not over-fit to the data. We could, e.g.,
explore drop-outs during calibration or smaller FN (compression) layers q1 = q3 .
Finally, we analyze the resulting activations at the bottleneck by considering the
BN encoder (7.66). Note that we assume y ∈ Rq in (7.66) with q = |T | being
the rank of the data matrix Y ∈ Rn×q . Thus, the encoder takes a fixed age 0 ≤
x ≤ 99 and encodes the corresponding time-series observation y x ∈ R|T | by the
bottleneck activations. This parametrization has been inspired by the PCA which
typically considers a data matrix that has more rows than columns. This results in
at most q = rank(Y ) singular values, supposed n ≥ q. However, we can easily
exchange the role of rows and columns, e.g., by transposing all matrices involved.
For mortality forecasting it is advantageous to exchange these roles because we
would like to extrapolate a time-series beyond T . For this reason we set for the input
dimension q0 = q = 100, which provides us with |T | observations y t ∈ R100 . We
then fit the BN encoder (7.66) to receive the bottleneck activations
Figure 7.39 shows these figures for a bottleneck q2 = 2. We observe that these
bottleneck time-series (!(y t ))t ∈T are much more difficult to understand than the
LC/RH ones given in Fig. 7.35. Firstly, we see that we have quite some dependence
356 7 Deep Learning
1.0
bottleneck activations
bottleneck activations
0.5
0.5
0.0
0.0
−0.5
−0.5
−1.0
−1.0
1950 1960 1970 1980 1990 2000 2010 1950 1960 1970 1980 1990 2000 2010
calendar years calendar years
Fig. 7.39 BN network (q1 , q2 , q3 ) = (20, 2, 20): bottleneck activations showing !(y t ) ∈ R2 for
t∈T
between the components of the time-series. Secondly, in contrast to the LC/RH case
of Fig. 7.35, there is not one component that dominates. Note that this dominance
has been obtained by scaling the components of (bx )x to add up to 1 (which,
of course, reflects the magnitudes of the singular values). In the non-linear case,
these scales are hidden in the decoder which is more difficult to extract. Thirdly,
the extrapolation may not work if the time-series has a trend and if we use the
hyperbolic tangent activation function that has a bounded range. In general, a trend
extrapolation has to be considered very carefully with FN networks with non-linear
activation functions, and often there is no good solution to this problem within
the FN network framework. We conclude that this approach improves in-sample
mortality surface modeling, but it leaves open the question about forecasting the
future mortality rates because an extrapolation seems more difficult.
Remark 7.32 The concept of BN networks has also been considered in the actuarial
literature to encode geographic information, see Blier-Wong et al. [39]. Since
geographic information has a natural spatial component, these authors propose
to use a convolutional neural network to encode the spatial information before
processing the learned features through a BN network. The proposed decoder may
have different forms, either it tries to reconstruct the whole (spatial) neighborhood
of a given location or it only tries to reconstruct the site of a given location.
7.6 Model-Agnostic Tools 357
We collect some model-agnostic tools in this section that help us to better understand
and analyze the networks, their calibrations and predictions. Model-agnostic tools
are techniques that are not specific to a certain model type and can be used for
any regression model. Most methods presented here are nicely presented in the
tutorial of Lorentzen–Mayer [258]. There are several ways of getting a better
understanding of a regression model. First, we can analyze variable importance
which tries to answer similar questions to the GLM variable selection tools
of Sect. 5.3 on model validation. However, in general, we cannot rely on any
asymptotic likelihood theory for such an analysis. Second, we can try to understand
the predictive model. For a GLM with the log-link function this is quite simple
because the systematic effects are of a multiplicative nature. For networks this
is much more complicated because we allow for much more general regression
functions. We can either try to understand these functions on a global portfolio level
(by averaging the effects over many insurance policies) or we can try to understand
these functions locally for individual insurance policies. The latter refers to local
sensitivities around a chosen feature value x ∈ X , and the former to global model-
agnostics.
For GLMs we have studied the LRT and the Wald test that have been assisting us
in reducing the GLM by the feature components that do not contribute sufficiently
to the regression task at hand, see Sects. 5.3.2 and 5.3.3. These variable reduction
techniques rely on an asymptotic likelihood theory. Here, we need to proceed
differently, and we just aim at ranking the variables by their importance, similarly
to a drop1 analysis, see Listing 5.6.
For a given FN network regression model
Region
Density
VehAge
Area
VehGas
model GLM3
VehPower FN network
We calculate the VPI on the MTPL claim frequency data of model Poisson
GLM3 of Table 5.5 and the FN network regression model μm=1 of Table 7.9; we
use this example throughout this section on model-agnostic tools. Figure 7.40 shows
the relative increases
D(L(j ) , μ) − D(L, μ)
vpi(j ) = ,
D(L, μ)
function. Binary split CARTs have the advantage that this can be done in an additive
way.
More complex regression models like FN networks can then be analyzed by using
a binary split regression tree as a global surrogate model. That is, we can fit a CART
to the network regression function (as a surrogate model) and then analyze variable
importance in this surrogate regression tree model using the tools of regression trees.
We will not give an explicit example here because we have not formally introduced
regression trees in this manuscript, but this concept is fairly straightforward and
well-understood.
There are several graphical tools that study the individual behavior in the feature
components. Some of these tools select individual insurance policies and others
study global portfolio properties. They have in common that they are based on
marginal considerations, i.e., some sort of projection.
0
−1
−1
logged expected frequency
−2
−3
−3
−4
−4
−5
−5
−6
−6
−7
−7
20 30 40 50 60 70 80 90 20 30 40 50 60 70 80 90
age of driver age of driver
Fig. 7.41 ICE plots of 100 randomly selected insurance policies x i of (lhs) model Poisson GLM3
and (rhs) FN network μm=1 letting the variable DrivAge vary over its domain; the y-axis is on
the canonical parameter scale
by exploiting the same plot only on insurance policies that have a BonusMalus
level of at least 100%. In that case the lines for small ages are non-decreasing when
approaching the age of 18, thus, providing a more reasonable interpretation. We
conclude that if we have strong dependence and/or interactions between the feature
components this method may not provide any reasonable interpretations.
Partial dependence plots (PDPs) have been introduced by Friedman [141], see also
Zhao–Hastie [405]. PDPs are closely related to the do-operator in causal inference
in statistics; we refer to Pearl [298] and Pearl et al. [299] for the do-operator. A
PDP and the do-operator, respectively, are obtained by breaking the dependence
structure between different feature components. Namely, we decompose the feature
x = (xj , x \j ) into two parts with x \j denoting all feature components except of
component xj ; we will use a slight abuse of notation because the components need
to be permuted correspondingly in the following regression function x → μ(x) =
μ(xj , x \j ). Since, typically, there is dependence between xj and x \j one can infer
x \j from xj , and vice versa. A PDP breaks this inference potential so that the
sensitivity can be studied purely in xj . In particular, the partial dependence profile
is obtained by
xj → μ̄ (xj ) =
j
μ(xj , x \j ) dp(x \j ), (7.67)
7.6 Model-Agnostic Tools 361
the latter allowing for inferring x \j from xj through the conditional probability
dp(x \j |xj ).
Remark 7.34 (Discrimination-Free Insurance Pricing) Recent actuarial literature
discusses discrimination-free insurance pricing which aims at developing a pricing
framework that is free of discrimination w.r.t. so-called protected characteristics
such as gender and ethnicity; we refer to Guillén [174], Chen et al. [69, 70],
Lindholm et al. [253] and Frees–Huang [136] for discussions on discrimination
in insurance. In general, part of the problem also lies in the fact that one can
often infer the protected characteristics from the non-protected feature information.
This is called indirect discrimination or proxy discrimination. The proposal of
Lindholm et al. [253] for achieving discrimination-free prices exactly follows the
construction (7.67), by breaking the link, which infers the protected characteristics
from the non-protected ones.
The partial dependence profile on our portfolio L with given features x 1 , . . . , x n
is now obtained by just using the portfolio distribution as an empirical distribution
for p in (7.67). That is, for a selected component xj of x, we consider the partial
dependence profile
1 1
n n
xj → μ̄j (xj ) = μ(xj , x i,\j ) = μ xi,0 , xi,1 , . . . , xi,j −1 , xj , xi,j +1 , . . . , xi,q ,
n n
i=1 i=1
Fig. 7.42 PDPs of (lhs) BonusMalus level and (middle) DrivAge; the y-axis is on the
canonical parameter scale; (rhs) ratio of policies with a bonus-malus level of 50% per driver’s
age
362 7 Deep Learning
look reasonable. However, we are again facing the difficulty that these partial
dependence profiles consider feature configurations that should not appear in our
portfolio. Roughly 57% of all insurance policies have a bonus-malus level of 50%,
which means that these driver’s did not suffer any claims in the past couple of
years. Obviously a driver of age 18 cannot be on this bonus-malus level, simply
because she/he is not in a state where she/he can have multiple years of driving
experience without an accident. However, the PDP does not respect this fact, and just
extrapolates the regression function into that part of the feature space. Therefore, the
PDP at driver’s age 18 is based on 57% of the insurance policies being on a bonus-
malus level of 50% because this corresponds to the empirical portfolio distribution
p(x \j ) excluding the driver’s age xj = DrivAge information. Figure 7.42 (rhs)
shows the ratio of insurance policies that have a bonus-malus level of 50%. We
observe that this ratio is roughly zero up to age 28 (orange vertical dotted line),
which indicates that a driver needs 10 successive accident-free years to reach the
lowest bonus-malus level (starting from 100%). We consider it to be data error that
this ratio below age 28 is not identically equal to zero. We conclude that these PDPs
need to be interpreted very carefully because the insurance portfolio is not uniformly
distributed across the feature space. In some parts of the feature space the regression
function x → μ(x) may not even be well-defined because certain combinations of
feature values x may not exist (e.g., a driver of age 18 on bonus-malus level 50% or
a boy at a girl’s college).
PDPs have the problem that they do not respect the dependencies between the
feature components, as explained in the previous paragraphs. The accumulated local
effects (ALE) profile tries to take account for these dependencies by only studying
a local feature perturbation, we refer to Apley–Zhu [13]. We present a smooth
(gradient-based) version of ALE because our regression functions are differentiable.
Consider the local effect in the individual feature x w.r.t. the component xj by
studying the partial derivative
∂μ(x)
μj (x) = . (7.68)
∂xj
The average local effect of component j is received by
xj → j (xj ; μ) = μj (xj , x \j )dp(x \j |xj ). (7.69)
ALE integrate the average local effects j (·) over their domain, and the ALE profile
is defined by
xj xj
xj → j (zj ; μ)dzj = μj (zj , x \j )dp(x \j |zj )dzj , (7.70)
xj0 xj0
7.6 Model-Agnostic Tools 363
where xj0 is a given initialization point. The difference between PDPs and ALE
is that the latter correctly considers the dependence structure between xj and x \j ,
see (7.69).
Listing 7.10 Local effects through the gradients of FN networks in keras [77]
1 Input = layer_input(shape = c(11), dtype = ’float32’, name = ’Design’)
2 #
3 Output = Input %>%
4 layer_dense(units=20, activation=’tanh’, name=’FNLayer1’) %>%
5 layer_dense(units=15, activation=’tanh’, name=’FNLayer2’) %>%
6 layer_dense(units=10, activation=’tanh’, name=’FNLayer3’) %>%
7 layer_dense(units=1, activation=’linear’, name=’Network’)
8 #
9 model = keras_model(inputs = c(Input), outputs = c(Output))
10 #
11 grad = Output %>%
12 layer_lambda(function(x) k_gradients(model$outputs, model$inputs))
13 model.grad = keras_model(inputs = c(Input), outputs = c(grad))
14 theta.grad <- data.frame(model.grad %>% predict(XX))
Example We come back to our MTPL claim frequency FN network example. The
local effects (7.68) can directly be calculated in the R library keras [77] for a FN
network, see Listing 7.10. In order to do so we need to drop the embedding layers,
compared to Listing 7.4, and directly work on the learned embeddings. This gives
an input layer of dimension q = 7 + 2 + 2 = 11 because we have two categorical
features that have been embedded into 2-dimensional Euclidean spaces R2 . Then,
we can formally calculate the gradient of the FN network w.r.t. its inputs which is
done on lines 11–13 of Listing 7.10. Remark that we work on the canonical scale
because we use the linear activation function on line 7 of the listing.
There remain the averaging (7.69) and the integration (7.70) which can be done
empirically
1
xj → j (xj ; μ) = μj (x i ), (7.71)
|E(xj )|
i∈E (xj )
∂θ (x)
θj (x) = = βj ≡ j (xj ; θ ).
∂xj
In the case of model Poisson GLM3 presented in Sect. 5.3.4 the situation is
more delicate as we model the interactions in the GLM as follows, see (5.34)
and (5.35),
(DrivAge, BonusMalus)
4
→ βl DrivAge + βl+1 log(DrivAge) + βl+j (DrivAge)j
j =2
In that case, though we work with a GLM, the resulting local effects are different
if we calculate the derivatives w.r.t. DrivAge and BonusMalus, respectively,
because we explicitly (manually) include non-linear effects into the GLM.
Figure 7.43 shows the ALE profiles of the variables BonusMalus and
DrivAge. The shapes of these profiles can directly be compared to the PDPs
of Fig. 7.42 (the scale on the y-axis should be ignored because this will depend
on the applied centering, however, we hold on to the canonical scale). The main
difference between these two plots can be observed for the variable DrivAge at
low ages. Namely, the ALE profiles have a different shape at low ages respecting
the dependencies in the feature components by only considering real local feature
configurations.
7.6 Model-Agnostic Tools 365
0.8
3.0
2.5
0.6
2.0
logged effect
logged effect
0.4
1.5
1.0
0.2
0.5
0.0
0.0
FN network FN network
Fig. 7.43 ALE profiles of (lhs) BonusMalus level and (rhs) DrivAge; the y-axis is on the
log-scale
∂ 2 μ(x)
μj,k (x) = = 0. (7.72)
∂xj ∂xk
This means that the magnitude of a change of the regression function μ(x) in xj
depends on the current value of xk . If there is no such interaction, we can additively
decompose the regression function μ(x) into two independent terms. This then
reads as μ(x) = μ\j (x \j ) + μ\k (x \k ). This motivation is now applied to the
PDP profiles given in (7.67). We define the centered versions xj → μ̆j (xj ) and
xk → μ̆k (xk ) of the PDP profiles by centering the PDP profiles xj → μ̄j (xj )
and xk → μ̄k (xk ) over the portfolio values x i , 1 ≤ i ≤ n. Next, we consider an
analogous two-dimensional version for (xj , xk ). Let (xj , xk ) → μ̆j,k (xj , xk ) be the
centered version of a two-dimensional PDP profile (xj , xk ) → μ̄j,k (xj , xk ).
Friedman’s H -statistics measures the pairwise interaction strength between the
components xj and xk , and it is defined by
n j,k 2
i=1 μ̆ (xi,j , xi,k ) − μ̆j (xi,j ) − μ̆k (xi,k )
2
Hj,k = n j,k 2
, (7.73)
i=1 μ̆ (xi,j , xi,k )
n j,k 2
the joint effect i=1 μ̆ (xi,j , xi,k ) , sometimes also the absolute measure is
considered by taking the square root of the enumerator in (7.73). Of course, this
can be extended to interactions of three components, etc., we refer to Friedman–
Popescu [143].
We do not give an example here, because calculating Friedman’s H -statistics
can be computationally demanding if one has many feature components with many
levels in FN network modeling.
The above methods like the PDP and the ALE profile have been analyzing the global
behavior of the regression functions. We briefly mention some tools that describe the
local sensitivity and explanation of regression results.
Probably the most popular method is the locally interpretable model-agnostic
explanation (LIME) introduced by Ribeiro et al. [311]. This analyzes locally the
expected response of a given feature x by perturbing x. In a nutshell, the idea is to
select an environment E(x) ⊂ X of a chosen feature x and to study the regression
function x → μ(x ) in this environment x ∈ E(x). This is done by fitting a
(much) simpler surrogate model to μ on this environment E(x). If the environment
is small, often a linear regression model is chosen. This then allows one to interpret
the regression function μ(·) locally using the simpler surrogate model, and if we
have a high-dimensional feature space, this linear regression is complemented with
LASSO regularization to only select the most important feature components.
The second method considered in the literature is the Shapley additive expla-
nation (SHAP). The SHAP is based on Shapley values [335] which is a method
of allocating rewards to players in cooperative games, where a team of individual
players jointly contributes to a potential success. Shapley values solve this allocation
problem under the requirements of additivity and fairness. This concept can be
translated to analyzing how individual feature components of x contribute to the
total prediction μ(x) of a given case. Shapley values allow one to do such a
contribution analysis in the aforementioned additive and fair way, see Lundberg–Lee
[261]. The calculation of SHAP values is combinatorially demanding and therefore
several approximations have been proposed, many of them having their own caveats,
we refer to Aas et al. [1]. We will not further consider these but refer to the relevant
literature.
The above model-agnostic tools have mainly been studying the sensitivities of the
expected response μ(x) in the feature components of x. This becomes apparent
7.6 Model-Agnostic Tools 367
from considering the partial derivatives (7.68) to calculate the local effects. Alterna-
tively, we could try to understand how the feature components of x contribute to a
given response μ(x), see Ancona et al. [12]; this section follows Merz et al. [273].
The marginal attribution on an input component j of the response μ(x) can be
studied by the directional derivative
∂μ(x)
xj → xj μj (x) = xj . (7.74)
∂xj
This was first proposed to the data science community by Shrikumar et al. [340].
Basically, it means that we replace the partial derivative μj (x) by the directional
derivative along the vector xj ej = (0, . . . , 0, xj , 0, . . . , 0) ∈ Rq+1
We implicitly interpret μ(X) = E[Y |X] as the price of the response Y , here,
though we do not need the response distribution in this section. Assume μ(X)
has a continuous distribution function Fμ(X) ; and we drop the intercept component
X0 = x0 = 1 from these considerations (but we still keep it in the regression
model). This implies that Uμ(X) = Fμ(X) (μ(X)) is uniformly distributed on [0, 1].
Choosing a density ζ on [0, 1] gives us a probability distortion ζ(Uμ(X) ) as we have
the normalization
1
Ep ζ(Uμ(X) ) = ζ(u)du = 1.
0
∂
Sj (μ; ζ ) = μ (1, X1 , . . . , Xj −1 , (1 + )Xj , Xj +1 , . . . Xq ) ; ζ .
∂ =0
the right-hand side exactly uses the marginal attribution (7.74). There remains the
freedom of the choice of the density ζ on [0, 1], which allows us to study the
sensitivities of different distortion risk measures. For the uniform distribution ζ ≡ 1
on [0, 1] we simply have the average (best-estimate) price and its average marginal
attributions
Remarks 7.36
• In the introduction to this section we have assumed that μ(X) has a continuous
distribution function. This emphasizes that this sensitivity analysis is most
suitable for continuous feature components. Categorical and discrete feature
components can be embedded into a Euclidean space, e.g., using embedding
layers, and then they can be treated as continuous variables.
• Sensitivities (7.75) respect the local portfolio structure as they are calculated
w.r.t. p.
• In applications, we will work with the empirical portfolio distribution for p
provided by (x i )1≤i≤n . This gives an empirical approximation to (7.75) and,
in particular, it will require a choice of a bandwidth for the evaluation of the
7.6 Model-Agnostic Tools 369
−1
conditional probability, conditioned on the event {μ(X) = Fμ(X) (α)}. This is
done with a local smoother similarly to Listing 7.8.
In analogy to Merz et al. [273] we give a different interpretation to the
sensitivities (7.75), which allows us to further expand this formula. We have 1st
order Taylor expansion
By bringing the gradient term to the other side, using (7.75) and conditionally
averaging, we receive the 1st order marginal attributions
q
−1 −1
Fμ(X) (α) = Ep μ (X) μ(X) = Fμ(X) (α) ≈ μ (0) + Sj (μ; α). (7.76)
j =1
Thus, the sensitivities Sj (μ; α) provide a 1st order description of the quantiles
−1
Fμ(X) (α) of μ(X). We call this approach marginal attribution by conditioning on
quantiles (MACQ) because it shows how the components Xj of X contribute to a
given quantile level.
Example 7.37 (MACQ for Linear Regression) The simplest case is the linear
regression case because the 1st order marginal attributions (7.76) are exact in this
case. Consider a linear regression function with regression parameter β ∈ Rq+1
q
x → μ(x) = β, x = β0 + βj x j .
j =1
The 1st order marginal attributions for fixed α ∈ (0, 1) are given by
q
−1
Fμ(X) (α) = μ (0) + Sj (μ; α)
j =1
q
−1
= β0 + βj Ep Xj μ(X) = Fμ(X) (α) . (7.77)
j =1
profile (7.70). Set initial value xj0 = 0, the ALE profile for the linear regression
model is given by
xj
xj → j (zj )dzj = βj xj .
0
A natural next step is to expand the 1st order attributions to 2nd orders. This
allows us to consider the interaction terms. Consider the 2nd order Taylor expansion
1
μ(x + ) = μ(x) + (∇x μ(x)) + ∇x2 μ(x) + o(22 ) for 2 → 0.
2
Similar to (7.76), setting = −x, this gives us the 2nd order marginal attributions
q
1
q
−1
Fμ(X) (α) ≈ μ (0) + Sj (μ; α) − Tj,k (μ; α) (7.78)
2
j =1 j,k=1
q
1
= μ (0) + Sj (μ; α) − Tj,j (μ; α) − Tj,k (μ; α),
2
j =1 1≤j <k≤q
where for 1 ≤ j, k ≤ q we define μj,k (x) = ∂xj ∂xk μ(x), see (7.72), and
−1
Tj,k (μ; α) = Ep Xj Xk μj,k (X) μ(X) = Fμ(X) (α) . (7.79)
Remarks 7.38
• The first line of (7.78) separates the 1st order attributions from the 2nd order
attributions, the second line splits w.r.t. the individual component j attributions
and the interaction attributions j = k.
• The 1st order attributions (7.75) have been motivated by considering the direc-
tional derivatives of the VaR distortion risk measure. Unfortunately, the 2nd order
consideration has no simple equivalent motivation, as the 2nd order directional
derivatives are much more involved, even in the linear case, we refer to Property
1 in Gourieroux et al. [167].
7.6 Model-Agnostic Tools 371
1
−1
− Ep (a − X) ∇x2 μ(X)(a − X) μ(X) = Fμ(X) (α) .
2
Essentially, this means that we shift the feature distribution p to considering the
shifted random vectors Xa = X − a and while setting μa (·) = μ(a + ·),
thus, this simply says that we pre-process the features differently. In view of
approximation (7.81) we can now select a reference point a ∈ Rq that makes the 2nd
order marginal attributions as precise as possible. Define the events Al = {μ(X) =
−1
Fμ(X) (αl )} for a discrete quantile grid 0 < α1 < . . . < αL < 1. We define the
objective function
L
−1
a → G(a; μ) = Fμ(X) (αl ) − μ (a) + Ep (a − X) ∇x μ(X) Al (7.82)
l=1
2
1
+ Ep (a − X) ∇x2 μ(X)(a − X) Al .
2
Making this objective function G(a; μ) small in a will provide us with a good
reference point for the selected quantile levels (αl )1≤l≤L; this is exactly the MACQ
372 7 Deep Learning
proposal of Merz et al. [273]. A local minimum can be found by applying a gradient
descent algorithm
L
−1
∇a G(a; μ) = 2 Fμ(X) (αl ) − μ (a) + Ep (a − X) ∇x μ(X) Al
l=1
1
+ Ep (a − X) ∇x2 μ(X)(a − X) Al
2
× − ∇a μ (a) + Ep [∇x μ(X)| Al ]
1
−Ep X ∇x2 μ(X) Al 2
+ a Ep ∇x μ(X) Al .
2
All subsequent considerations and interpretations are done w.r.t. an optimal ref-
erence point a ∈ Rq by minimizing the objective function (7.82) on the chosen
quantile grid. Mathematically speaking, this optimal choice is w.l.o.g. because the
origin 0 of the coordinate system of the feature space X is arbitrary, and any
other origin can be chosen by a translation, see formula (7.81) and the subsequent
discussion. For interpretations, however, the choice of the reference point a matters
because the directional derivative Xj μj (X) can be small either because Xj is small
or because μj (X) is small. Having a small Xj means that this feature value is close
to the chosen reference point.
Example 7.39 (MACQ Analysis) We revisit the MTPL claim frequency example
using the FN network regression model of depth d = 3 having (q1 , q2 , q3 ) =
(20, 15, 10) neurons. Importantly, we use the hyperbolic tangent as the activation
function in the FN layers which provides smoothness of the regression function.
Figure 7.40 shows the VPI plot of this fitted model. Obviously, the variable
BonusMalus plays the most important role in this predictive model. Remark that
the VPI plot does not properly respect the dependence structure in the features as it
independently permutes each feature component at a time. The aim in this example
is to determine variable importance by doing the MACQ analysis (7.78).
Figure 7.44 (lhs) shows the empirical density of the fitted canonical parameter
θ (x i ), 1 ≤ i ≤ n; all plots in this example refer to the canonical scale. We then
minimize the objective function (7.82) which provides us with an optimal reference
point a ∈ Rq ; we choose equidistant quantile grid 1% < 2% < . . . < 99%
and all conditional expectations in ∇a G(a; μ) are empirically approximated by a
local smoother similar to Listing 7.8. Figure 7.44 (rhs) gives the resulting marginal
attributions w.r.t. this reference point. The orange line shows the 1st order marginal
7.6 Model-Agnostic Tools 373
−1.5
0.8
canonical parameter
−2.0
empirical density
0.6
−2.5
−3.0
0.4
−3.5
0.2
−4.0
2nd order marginal attributions
2nd order marginal attributions w/o interactions
true empirical quantiles
0.0
−4.5
−4 −3 −2 −1 0.0 0.2 0.4 0.6 0.8 1.0
expected frequency (canonical scale) quantile level alpha
Fig. 7.44 (lhs) Empirical density of the fitted canonical parameter θ(x i ), 1 ≤ i ≤ n, (rhs) 1st and
2nd order marginal attributions
Area
VehPower
0.4
VehAge
interaction attributions
DrivAge
BonusMalus
VehGas
0.2
0.5
Density
VehBrand
sensitivities
Region
0.0
0.0
BonusMalusX−AreaX
BonusMalusX−DrivAgeX
VehGasX−DrivAgeX
VehGasX−BonusMalusX
VehBrand2−BonusMalusX
Region2−BonusMalusX
−1.0
Region2−VehBrand2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
quantile level alpha quantile level alpha
Fig. 7.45 (lhs) Second order marginal attributions Sj (μ; α) − 12 Tj,j (μ; α) excluding interaction
terms, and (rhs) interaction terms − 12 Tj,k (μ; α), j = k
attributions (7.76), and the red line the 2nd order marginal attributions (7.78). The
cyan line drops the interaction terms Tj,k (μ; α), j = k, from the 2nd order marginal
attributions. From the shaded cyan area we see the importance of the interaction
terms. We note that the 2nd order marginal attributions (red line) match the true
empirical quantiles (black dots) quite well for the chosen reference point a.
Figure 7.45 gives the 2nd order marginal attributions Sj (μ; α) − 12 Tj,j (μ; α) of
the individual components 1 ≤ j ≤ q on the left-hand side, and the interaction terms
− 12 Tj,k (μ; α), j = k on the right-hand side. We identify the following components
as being important BonusMalus, DrivAge, VehGas, VehBrand and Region;
these components show a behavior substantially different from being equal to 0, i.e.,
374 7 Deep Learning
Density VehAge
VehBrand
sensitivities
Region
DrivAge
0.0
BonusMalus
VehGas
Density
−0.5
VehBrand
quantiles 0.8
quantiles 0.6
Region quantiles 0.4
−1.0
quantiles 0.2
0.0 0.2 0.4 0.6 0.8 1.0 −1.0 −0.5 0.0 0.5
quantile level alpha
q
Fig. 7.46 (lhs) Second order marginal attributions Sj (μ; α) − 12 k=1 Tj,k (μ; α) including
interaction terms, and (rhs) slices at the quantile levels α ∈ {20%, 40%, 60%, 80%}
these components differentiate from the reference point a. These components also
have major interactions that contribute to the quantiles above the level 80%.
q 1 ≤ j ≤ q
If we allocate the interaction terms to the corresponding components
we receive the second order marginal attributions Sj (μ; α) − 12 k=1 Tj,k (μ; α).
These are illustrated in Fig. 7.46 (lhs) and the quantile slices at the levels α ∈
{20%, 40%, 60%, 80%} are given in Fig. 7.46 (rhs). These graphs illustrate variable
importance on different quantile levels (and respecting the dependence within
the features). In particular, we identify the main variables that distinguish the
given quantile levels from the reference level θ (a), i.e., Fig. 7.46 (rhs) should be
understood as the relative differences to the chosen reference level. Once more we
see that BonusMalus is the main driver, but also other variables contribute to the
differentiation of the high quantile levels.
Figure 7.47 shows the individual attributions xi,j μj (x i ) of 1’000 randomly
selected cases x i for the feature components j = BonusMalus, DrivAge,
VehGas, VehBrand; the colors illustrate the corresponding feature values xi,j
of the individual car drivers i, and the black solid line corresponds to Sj (μ; α) −
1
2 Tj,j (μ; α) excluding the interaction terms (the black dotted line is one empir-
ical standard deviation around the black solid line). Focusing on the variable
BonusMalus we observe that the lower quantiles are almost completely domi-
nated by insurance policies on the lowest bonus-malus level. The bonus-malus levels
70–80 provide little sensitivity (are concentrated around the zero line) because the
reference point a reflects these bonus-malus levels, and, finally, the large quantiles
are dominated by high bonus-malus levels (red dots).
The plot of the variable DrivAge is interpreted similarly. The reference point
a is close to the young drivers, therefore, young drivers are concentrated around
the zero line. At the low quantile levels, higher ages contribute positively to the
low expected frequencies, whereas these ages have an unfavorable impact at higher
7.6 Model-Agnostic Tools 375
4
120
individual 2nd order attribution
3
110
4
70
2
100
60
1
90
2
50
80
0
40
0
70
−1
30
60
−2
−2
20
50
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
quantile level alpha quantile level alpha
0.0
0.0
−0.5
−0.5
B1
B2
B3
−1.0
B4
−1.0
B5
B6
B10
−1.5
B11
−1.5
B12
Diesel B13
Regular B14
−2.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
quantile level alpha quantile level alpha
Fig. 7.47 Individual attributions xi,j μj (x i ) of 1’000 randomly selected cases x i for j =
BonusMalus, DrivAge, VehGas, VehBrand; the plots have different y-scales
In the previous section we have studied some model-agnostic tools that can be used
for any (differentiable) regression model. In this section we give some network
specific plots. For simplicity we choose one specific example, namely, the FN
def.
network μ = μm=1 of Table 7.9. We start by analyzing the learned representations
in the different FN layers, this links to our introduction in Sect. 7.1.
For any FN layer 1 ≤ m ≤ d we can study the learned representations
z(m:1) (x). For Fig. 7.48 we select at random 1’000 insurance policies x i , and the
dots show the activations of these insurance policies in neurons j = 4 (x-axis)
and j = 9 (y-axis) in the corresponding FN layers. These neuron activations are
in the interval (−1, 1) because we work with the hyperbolic tangent activation
function for φ. The color scale shows the resulting estimated frequencies μ(x i ) of
the selected policies. We observe that the layers are increasingly (in the depth of the
network) separating the low frequency policies (light blue-green colors) from the
high frequency policies (red color). This is a quite typical picture that we obtain
here, though, this sparsity in the 3rd FN layer is not the case for every neuron
1 ≤ j ≤ qd .
In higher dimensional FN architectures it will be difficult to analyze the learned
representations on each individual neuron, but at least one can try to understand
the main effects learned. For this, on the one hand, we can focus on the important
feature components, see, e.g., Sect. 7.6.1, and, on the other hand, we can try to study
the main effects learned using a PCA in each FN layer, see Sect. 7.5.3. Figure 7.49
shows the singular values λ1 ≥ λ2 ≥ . . . ≥ λqm > 0 in each of the three FN layers
1 ≤ m ≤ d = 3; we center the neuron activations to mean zero before applying
the SVD. These plots support the previously made statement that the layers are
increasingly separating the high frequency from the low frequency policies. An
elbow criterion tells us that in the first FN layer we have 8 important principal
components (out of 20), in the second FN layer 3 (out of 15) and in the third FN
layer 1 (out of 10). This is also reflected in Fig. 7.48 where we see more and more
100 150 200 250 300 singular values: FN layer 1 singular values: FN layer 2 singular values: FN layer 3
singular values
singular values
50
50
50
0
0
5 10 15 20 2 4 6 8 10 12 14 2 4 6 8 10
Index Index Index
Fig. 7.50 (First row) Input variables (BonusMalus, DrivAge), (Second–fourth row) first two
principal components in FN layers m = 1, 2, 3; (lhs) gives the color scale of estimated frequency
μ, (middle) gives the color scale BonusMalus, and (rhs) gives the color scale DrivAge
7.7 Lab: Analysis of the Fitted Networks 379
Fig. 7.51 Second order marginal attributions: (lhs) w.r.t. the input layer x ∈ Rq0 , (middle)
w.r.t. the first FN layer z(1:1) (x) ∈ Rq1 , and (rhs) w.r.t. the second FN layer z(2:1) (x) ∈ Rq2
the plots more smooth and making the interactions between the neurons smaller.
Note that the learned representations z(3:1)(x i ) ∈ Rq3 in the last FN layer go into
a classical GLM for the output layer, which does not have any interactions in the
canonical predictor (because it is additive on the canonical scale), thus, being of
the same type as the linear regression of Example 7.37. In the Poisson model with
the log-link function, the interactions can only be of a multiplicative type in GLMs.
Therefore, the network feature-engineers the input x i (in an automated way) such
that the learned representation z(d:1)(x i ) in the last FN layer is exactly in this GLM
structure. This is verified by the small interaction part in Fig. 7.51 (rhs). This closes
this part on model-agnostic tools.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 8
Recurrent Neural Networks
We start from a deep FN network providing the regression function, see (7.2)–(7.3),
This network regression function can be chosen independent of T since the relevant
history x T −τ :T always has the same length τ + 1. The time variable T could be used
as a feature component in x T −τ :T . The disadvantage of this approach is that such
a FN network architecture does not respect the temporal causality. Observe that we
feed the past history into the first FN layer
This operation typically does not respect any topology in the time index of
x T −τ +1:T . Thus, the FN network does not recognize that the feature x t −1 has been
experienced just before the next feature x t . For this reason we are looking for a
network architecture that can handle the time-series information in a temporal causal
way.
1 More mathematically speaking, we assume to have a filtration (At )t≥0 on the probability space
(, A, P). The basic assumption then is that both sequences (x t )t and (Yt )t are (At )t -adapted, and
we aim at predicting YT +1 , based on the information AT . In the above case this information AT is
generated by x 0 , x 1 , . . . , x T , where x t typically includes the observation Yt . We could also shift
the time index in x t by one time unit, and in that case we would assume that (x t )t is previsible
w.r.t. the filtration (At )t . We do not consider this shift in time index as it only makes the notation
unnecessarily more complicated, but the results remain the same by including the information
correspondingly into the features.
8.2 Plain-Vanilla Recurrent Neural Network 383
where the RN layer z(1) has the same structure as the FN layer given in (7.5), but
based on feature input (x t , zt −1 ) ∈ X × Rq1 ⊂ {1} × Rq0 × Rq1 , and not including
an intercept component {1} in the output.
having neurons, 1 ≤ j ≤ q1 ,
; < ; <
zj(1)(x, z) = φ w(1)
j , x + u(1)
j , z , (8.4)
Thus, the FN layers (7.5)–(7.6) and the RN layers (8.3)–(8.4) are structurally
equivalent, only the input x ∈ X is adapted to the time-series structure (x t , zt −1 ) ∈
X × Rq1 . Before giving more interpretation and before explaining how this single
RN network structure can be extended to a deep RN network we illustrate this RN
layer.
384 8 Recurrent Neural Networks
RN layer
time-series z (1) (xt , z t−1 ) output
input xt processing zt
input (xt , z t−1 )
output
z t−1
RN layer
time-series z (1) (xt , z t−1 ) output
input xt processing zt
input (xt , z t−1 )
RN layer
time-series z (1) (xt+1 , z t ) output
input xt+1 processing z t+1
input (xt+1 , z t )
Fig. 8.2 Unfolded representation of RN layer z(1) processing the input (x t , zt−1 )
Figure 8.1 shows an RN layer z(1) processing the input (x t , zt −1 ), see (8.2). From
this graph, the recurrent structure becomes clear since we have a loop (cycle) feeding
the output zt back into the RN layer to process the next input (x t +1 , zt ).
Often one depicts the RN architecture in a so-called unfolded way. This is done
in Fig. 8.2. Instead of plotting the loop (cycle) as in Fig. 8.1 (orange arrow in the
colored version), we unfold this loop by plotting the RN layer multiple times. Note
that this RN layer in Fig. 8.2 uses always the same network weights w(1) (1)
j and uj ,
1 ≤ j ≤ q1 , for all t. Moreover, the use of the colors of the arrows (in the colored
version) in the two figures coincides.
Remarks 8.1
• The neurons of the RN layer (8.4) have the following structure
q0
q1
zj(1) (x, z) (1) (1) (1)
= φ wj , x + uj , z = φ w0,j + (1)
wl,j xl + (1)
ul,j zl .
l=1 l=1
8.2 Plain-Vanilla Recurrent Neural Network 385
The network weights W (1) = (wj )1≤j ≤q1 ∈ R(q0 +1)×q1 include an intercept
(1)
component w0,j and the network weights U (1) = (uj )1≤j ≤q1 ∈ Rq1 ×q1 do not
(1) (1)
zt[1] = z(1) x t , zt[1]
−1 ∈ Rq 1 , (8.5)
z[m]
t = z(m) z[m−1]
t , z[m]
t −1 ∈ Rq m for 2 ≤ m ≤ d, (8.6)
notation z[m]
t = z[m]
t (x 0:t ).
386 8 Recurrent Neural Networks
output output
[1] [2]
z t−1 z t−1
or we can add a skip connection from the input variable x t to the second RN layer
zt[1] = z(1) x t , zt[1]
−1 ,
zt[2] = z(2) x t , zt[1] , zt[2]
−1 .
output
[1]
z T −2
RN layer
[1]
time-series z (1) (xT −1 , z T −2 ) output
[1]
input xT −1 processing z T −1
[1]
input (xT −1 , z T −2 )
RN layer FN network
[1] [1]
time-series z (1) (xT , z T −1 ) output z̄ (D:1) (z T ) prediction
input xT processing zT
[1]
processing T +1
Y
[1] [1]
input (xT , z T −1 ) memory z T
Fig. 8.4 Forecasting the response YT +1 using a RN network (8.8) based on a single RN layer
d = 1 and on a FN network of depth D
1 vi,T +1
n
ϑ → D(Y T +1 , ϑ) = d Yi,T +1 , μϑ (x i,0:T ) , (8.9)
n ϕ
i=1
for the observations Y T +1 = (Y1,T +1 , . . . , Yn,T +1 ) , and where ϑ collects all the
RN and FN network weights/parameters of the regression function (8.8). This model
can now be fitted using a variant of the gradient descent algorithm. The variant
uses back-propagation through time (BPTT) which is an adaption of the back-
propagation method to calculate the gradient w.r.t. the network parameter ϑ.
1 1 vi,t +1
n T
ϑ → D(Y , ϑ) = d Yi,t +1 , μϑ (x i,0:t ) . (8.11)
n T +1 ϕ
i=1 t =0
FN network
[1]
output z̄ (D:1) (z T −2 ) prediction
[1]
z T −2 processing T −1
Y
[1]
memory z T −2
RN layer FN network
[1] [1]
time-series z (1) (xT −1 , z T −2 ) output z̄ (D:1) (z T −1 ) prediction
[1]
input xT −1 processing z T −1 processing T
Y
[1] [1]
input (xT −1 , z T −2 ) memory z T −1
RN layer FN network
[1] [1]
time-series z (1) (xT , z T −1 ) output z̄ (D:1) (z T ) prediction
[1]
input xT processing zT processing YT +1
[1] [1]
input (xT , z T −1 ) memory z T
Fig. 8.5 Forecasting (Yt+1 )t using a RN network (8.10) based on a single RN layer d = 1 and
using a time-distributed FN layer for the outputs
390 8 Recurrent Neural Networks
Note that this can easily be adapted if the different cases 1 ≤ i ≤ n have different
lengths in their histories. An example is provided in Listing 10.8, below.
1 ex − e−x
φσ (x) = ∈ (0, 1) and φtanh(x) = ∈ (−1, 1),
1 + e−x ex + e−x
with the network weights Wf(m) ∈ R(qm−1 +1)×qm and Uf(m) ∈ Rqm ×qm , and with
f
the sigmoid activation function φσ = φσ , we also refer to (8.7).
8.3 Special Recurrent Neural Networks 391
with the network weights Wi(m) ∈ R(qm−1 +1)×qm and Ui(m) ∈ Rqm ×qm , and with
the sigmoid activation function φσi = φσ .
• The output gate models the release of memory information rate
; < ; <
o[m]
t = o(m) z[m−1]
t , z[m]
t −1 = φ o
σ Wo
(m) [m−1]
, z t + U (m) [m]
o , z t −1 ∈ (0, 1)qm ,
(8.12)
with the network weights Wo(m) ∈ R(qm−1 +1)×qm and Uo(m) ∈ Rqm ×qm , and with
the sigmoid activation function φσo = φσ .
These gates have outputs in (0, 1), and they determine the relative amount of
memory that is updated and released in each step. The so-called cell state process
(c[m]
t )t is used to store the relevant memory. Given zt
[m−1]
, z[m] [m]
t −1 and c t −1 , the
updated cell state is defined by
c[m]
t = c(m) z[m−1]
t , z[m] , c [m]
t −1 t −1 (8.13)
; < ; <
= f [m]
t ) c[m] [m]
t −1 + i t ) φtanh Wc(m) , z[m−1]
t + Uc(m) , z[m]
t −1 ∈ Rq m ,
with the network weights Wc(m) ∈ R(qm−1 +1)×qm and Uc(m) ∈ Rqm ×qm , and )
denotes the Hadamard product. This defines how the memory (cell state) is updated
and passed forward using the forget and the input gates f [m]
t and i [m]
t , respectively.
The neuron activations z[m]
t are updated, given z [m−1] [m]
t , z t −1 and c[m]
t , by
z[m]
t = z (m)
z [m−1] [m]
t , z ,
t −1 tc [m]
= o [m]
t ) φ c [m]
t ∈ Rq m , (8.14)
f
Fig. 8.6 LSTM cell z(m) with forget gate φσ , input gate φσi and output gate φσo
The LSTM architecture of the previous section seems quite complex and involves
many parameters. Cho et al. [76] have introduced the GRU architecture that is
simpler and uses less parameters, but has similar properties. The GRU architecture
uses two gates that are defined as follows for t ≥ 1, see also (8.7):
8.3 Special Recurrent Neural Networks 393
with the network weights Wr(m) ∈ R(qm−1 +1)×qm and Ur(m) ∈ Rqm ×qm , and with
the sigmoid activation function φσr = φσ .
• The update gate models the memory update rate
; < ; <
u[m]
t = u(m) z[m−1]
t , z[m]
t −1 = φσ
u
Wu(m) , z[m−1]
t + Uu(m) , z[m]
t −1 ∈ (0, 1)qm ,
with the network weights Wu(m) ∈ R(qm−1 +1)×qm and Uu(m) ∈ Rqm ×qm , and with
the sigmoid activation function φσu = φσ .
The neuron activations z[m]
t are updated, given z[m−1]
t and z[m]
t −1 , by
z[m]
t = z(m) z[m−1]
t , z[m]
t−1 (8.16)
= r [m]
t ) z[m] [m]
t−1 + (1 − r t ) ) φ W
(m) [m−1]
, zt + u[m]
t ) U (m) , z[m]
t−1 ∈ Rqm ,
with the network weights W (m) ∈ R(qm−1+1)×qm and U (m) ∈ Rqm ×qm , and for a
general activation function φ.
The GRU and the LSTM architectures are similar, the former using less parameters
because we do not explicitly model the cell state process. For an illustration of a
GRU cell we refer to Fig. 8.7. In the sequel we focus on the LSTM architecture;
Fig. 8.7 GRU cell z(m) with reset gate φσr and update gate φσu
394 8 Recurrent Neural Networks
though the GRU architecture is simpler and has less parameters, it is less robust in
fitting.
The mortality data has a natural time-series structure, and for this reason mortality
forecasting is an obvious problem that can be studied within RN networks. For
instance, the LC mortality model (7.63) involves a stochastic process (kt )t that
needs to be extrapolated into the future. This extrapolation problem can be done
in different ways. The original proposal of Lee and Carter [238] has been to analyze
ARIMA time-series models, and to use standard statistical tools, Lee and Carter
found that the random walk with drift gives a good stochastic description of the
time index process (kt )t . Nigri et al. [286] proposed to fit a LSTM network to
this stochastic process, this approach is also studied in Lindholm–Palmborg [252]
where an efficient use of the mortality data for network fitting is discussed. These
approaches still rely on the classical LC calibration using the SVD of Sect. 7.5.4,
and the LSTM network is (only) used to extrapolate the LC time index process (kt )t .
More generally, one can design a RN network architecture that directly processes
the raw mortality data Mx,t = Dx,t /ex,t , not specifically relying on the LC structure.
This has been done in Richman–Wüthrich [316] using a FN network architecture, in
Perla et al. [301] using a RN network and a convolutional neural (CN) network
architecture, and in Schürch–Korn [330] extending this analysis to the study of
prediction uncertainty using bootstrapping. A similar CN network approach has
been taken by Wang et al. [375] interpreting the raw mortality data of Fig. 7.32
as an image.
We revisit the LC mortality model [238] presented in Sect. 7.5.4. The LC log-
mortality rate is assumed to have the following structure, see (7.63),
for the ages x0 ≤ x ≤ x1 and for the calendar years t ∈ T . We now add the upper
(p)
indices (p) to consider different populations p. The SVD gives us the estimates ax ,
kt and
(p) (p)
bx based on the observed centered raw log-mortality rates, see Sect. 7.5.4.
The SVD is applied to each population p separately, i.e., there is no interaction
between the different populations. This approach allows us to fit a separate log-
(p)
mortality surface estimate (log( μx,t ))x0 ≤x≤x1;t ∈T to each population p. Figure 7.33
8.4 Lab: Mortality Forecasting with RN Networks 395
shows an example for two populations p, namely, for Swiss females and for Swiss
males.
The mortality forecasting requires to extrapolate the time index processes
(
(p)
kt )t ∈T beyond the latest observed calendar year t1 = max{T }. As mentioned in
Lee–Carter [238] a random walk with drift provides a suitable model for modeling
(
(p)
kt )t ≥0 for many populations p, see Fig. 7.35 for the Swiss population. Assume
that
kt +1 =
(p) (p) (p)
kt + εt +1 t ≥ 0, (8.17)
(p) i.i.d.
with εt ∼ N (δp , σp2 ), t ≥ 1, having drift δp ∈ R and variance σp2 > 0.
Model assumption (8.17) allows us to estimate the (constant) drift δp with MLE.
For observations (
(p)
kt )t ∈T we receive the log-likelihood function
√ 1 (p) (p) 2
t1
δp → ( k (p) ) (δp ) = − log( 2πσp ) − k − k t −1 − δ p ,
t t∈T 2σp2 t
t =t0 +1
kt1 −
(p) (p)
kt0
δpMLE = . (8.18)
t1 − t0
kt =
(p)
kt1 + (t − t1 )
(p)
δpMLE .
We explore this extrapolation for different Western European countries from the
HMD [195]. We consider separately females and males of the countries {AUT, BE,
CH, ESP, FRA, ITA, NL, POR}, thus, we choose 2 · 8 = 16 different populations
p. For these countries we have observations for the ages 0 = x0 ≤ x ≤ x1 = 99
and for the calendar years 1950 ≤ t ≤ 2018.3 For the following analysis we choose
T = {t0 ≤ t ≤ t1 } = {1950 ≤ t ≤ 2003}, thus, we fit the models on 54 years
of mortality history. This fitted models are then extrapolated to the calendar years
2004 ≤ t ≤ 2018. These 15 calendar years from 2004 to 2018 allow us to perform
(p) (p) (p)
an out-of-sample evaluation because we have the observations Mx,t = Dx,t /ex,t
for these years from the HMD [195].
Figure 8.8 shows the estimated time index process (
(p)
kt )t ∈T to the left of the
dotted lines, and to the right of the dotted lines we have the random walk with
drift extrapolation (
(p)
kt )t >t1 . The general observation is that, indeed, the random
walk with drift seems to be a suitable model for (
(p)
kt )t . Moreover, there is a huge
3We exclude Germany from this consideration of (continental) Western European countries
because the German mortality history is shorter due to the reunification in 1990.
396 8 Recurrent Neural Networks
100 Female: random walk forecast of k_t Male: random walk forecast of k_t
100
50
50
process k_t
process k_t
0
0
AUT AUT
−50
−50
BE BE
CH CH
ESP ESP
FRA FRA
ITA ITA
−100
−100
NL NL
POR POR
1950 1960 1970 1980 1990 2000 2010 2020 1950 1960 1970 1980 1990 2000 2010 2020
calendar years calendar years
Fig. 8.8 Random walk with drift extrapolation of the time index process (
kt )t for different
countries and genders; the y-scale is the same in both plots
similarity between the different countries, only with the Netherlands (NL) being
somewhat an outlier.
Remarks 8.4
• For Fig. 8.8 we did not explore any fine-tuning, for instance, the estimation of
the drift δp is very sensitive in the selection of the time span T . ESP has the
biggest negative drift estimate, but this is partially caused by the corresponding
observations in the calendar years between 1950 and 1960, see Fig. 8.8, which
may no longer be relevant for a decline in mortality in the new millennium.
• For all countries, the females have a bigger negative drift than the males (the
y-scale in both plots is the same). Moreover, note that we use the normalization
x 1
(p)
x=x0 bx = 1 and
(p) = 0, see (7.65). This normalization is discussed
t ∈T k t
and questioned in many publications as the extrapolation becomes dependent on
these choices; see De Jong et al. [90] and the references therein, who propose
different identification schemes.
• Another issue is an age coherence in forecasting, meaning that for long term
forecasts the mortality rates across the different ages should not diverge, see Li
et al. [250], Li–Lu [248] and Gao–Shi [153] and the references therein.
• There are many modifications and extensions of the LC model, we just mention
a few of them. Brouhns et al. [56] embed the LC model into a Poisson modeling
framework which provides a proper stochastic model for mortality modeling.
Renshaw–Haberman [308] extend the one-factor LC model to a multifactor
model, and in Renshaw–Haberman [309] a cohort effect is added. Hyndman–
Ullah [197] and Hainaut–Denuit [179] explore a functional data method and a
wavelet-based decomposition, respectively. The static PCA can be adopted to
a dynamic PCA version, see Shang [333], and a long memory behavior in the
time-series is studied in Yan et al. [395].
8.4 Lab: Mortality Forecasting with RN Networks 397
Our aim here is to replace the individual random walk with drift extrapola-
tions (8.17) by a common extrapolation across all considered populations p. For
this we design a LSTM architecture. A second observation is that the increments
εt = kt −
(p) (p) (p)
kt −1 have an average empirical auto-correlation (for lag 1) of −0.33.
This clearly questions the Gaussian i.i.d. assumption in (8.17).
We first discuss the available data and we construct the input data. We have
the time-series observations (
(p)
kt )t ∈T , and the population index p = (c, g) has
two categorical labels c for country and g for gender. We are going to use two-
dimensional embedding layers for these two categorical variables, see (7.31) for
embedding layers. The time-series observations (
(p)
kt )t ∈T will be pre-processed
such that we do not simultaneously feed the entire time-series into the LSTM layer,
but we divide them into shorter time-series. We will directly forecast the increments
εt = kt − kt −1 and not the time index process (
(p) (p) (p) (p)
kt )t ≥t0 ; in extrapolations with
drift it is easier to forecast the increments with the networks. We choose a lookback
(p)
period of τ = 3 calendar years, and we aim at predicting the response Yt = εt
(p) (p)
based on the time-series features x t −τ :t −1 = (εt −τ , . . . , εt −1 ) ∈ R . This provides
τ
us with the following data structure for each population p = (c, g):
(p)
Thus, each observation Yt = εt is equipped with the feature information
(t, c, g, x t −τ :t −1 ). As discussed in Lindholm–Palmborg [252], one should highlight
that there is a dependence across t, since we have a diagonal cohort structure in the
398 8 Recurrent Neural Networks
(p)
In Fig. 8.9 we plot the LSTM architecture used to forecast εt for t > t1 , and
Listing 8.1 gives the corresponding R code. We process the time-series x t −τ :t −1
through a LSTM cell, see lines 14–16 of Listing 8.1. We choose a shallow LSTM
network (d = 1) and therefore drop the upper index m = 1 in (8.15), but we add
an upper index [LSTM] to highlight the output of the LSTM cell. This gives us the
concatenation
input LSTM cell output
into a shallow
xt−τ :t−1 depth d = 1 t
Y
FN layer
country embedding
c layer
(p)
Fig. 8.9 LSTM architecture used to forecast εt for t > t1
8.4 Lab: Mortality Forecasting with RN Networks 399
This LSTM recursion to process the time-series x t −τ :t −1 gives us the LSTM output
z[LSTM]
t −1 ∈ Rq1 , and it involves 4(q0 + 1 + q1 )q1 = 4(2 + 15)15 = 1 020 network
parameters for the input dimension q0 = 1 and the output dimension q1 = 15.
For the categorical country code c and the binary gender g we choose two-
dimensional embedding layers, see (7.31),
Using the feature information (t1 + 1, c, g, x t1 +1−τ :t1 ) allows us to forecast the next
(p) t1 +1 , using the fitted LSTM architecture of Fig. 8.9.
increment Yt1 +1 = εt1 +1 by Y
Thus, this LSTM network allows us to perform a one-period-ahead forecast to
receive
kt1 +1 =
t1 +1 .
kt1 + Y (8.21)
400 8 Recurrent Neural Networks
0.4
embeddings for forecasting NL
(
kt )t
0.2
BE
0.0
dimension 2
AUT CH
POR
−0.2
FRA
AUT
BE
−0.4
CH
ESP
FRA
ITA ITA
NL
−0.6
ESP POR
This update (8.21) needs to be iterated recursively. For the next period t = t1 + 2
we set for the time-series feature
(p) t1 +1 ) ∈ Rτ ,
x t1 +2−τ :t1 +1 = (εt1 +2−τ , . . . , εt1 , Y
(p)
(8.22)
5
epsilon / Y
epsilon / Y
0
0
−5
−5
−10
−10
−15
1950 1960 1970 1980 1990 2000 2010 2020 1950 1960 1970 1980 1990 2000 2010 2020
calendar years calendar years
red curve is smaller than in the blue curve, the former relates to expected values and
the latter to observations of the random variables (which should be more volatile).
The light-blue color shows the random walk with drift extrapolation (which is just a
horizontal straight line at level
δpMLE , see (8.18)). The orange color shows the LSTM
extrapolation using the recursive one-period-ahead updates (8.20)–(8.22), which has
a zig-zag behavior that vanishes over time. This vanishing behavior is critical and is
going to be discussed next.
There is one issue with this recursive one-period-ahead updating algorithm. This
updating algorithm is not fully consistent in how the data is being used. The original
(p)
LSTM architecture calibration is based on the feature components εt , see (8.20).
Since these increments are not known for the later periods t > t1 , we replace
their unknown values by the predictors, see (8.22). The subtle point here is that
the predictors are on the level of expected values, and not on the level of random
variables. Thus, Y t is typically less volatile than εt(p) , but in (8.22) we pretend
that we can use these predictors as a one-to-one replacement. A more consistent
(p)
way would be to simulate/bootstrap εt from N (Y t , σ 2 ) so that the extrapolation
receives the same volatility as the original process. For simplicity we refrain from
doing so, but Fig. 8.11 indicates that this would be a necessary step because the
volatility in the orange curve is going to vanish after the calendar year 2003, i.e., the
zig-zag behavior vanishes, which is clearly not appropriate.
The LSTM extrapolation of ( kt )t is shown in Fig. 8.12. We observe quite some
similarity to the random walk with drift extrapolation in Fig. 8.8, and, indeed, the
random walk with drift seems to work very well (though the auto-correlation has not
been specified correctly). Note that Fig. 8.8 is based on the individual extrapolations
in p, whereas in Fig. 8.12 we have a common model for all populations.
Table 8.1 shows how often one model outperforms the other one (out-of-sample
on calendar years 2004 ≤ t ≤ 2018 and per gender). On the male populations of
50
50
process k_t
process k_t
0
AUT AUT
−50
−50
BE BE
CH CH
ESP ESP
FRA FRA
ITA ITA
−100
−100
NL NL
POR POR
1950 1960 1970 1980 1990 2000 2010 2020 1950 1960 1970 1980 1990 2000 2010 2020
calendar years calendar years
Table 8.1 Comparison of the out-of-sample mean squared error losses for the calendar years
2004 ≤ t ≤ 2018: the numbers show how often one approach outperforms the other one on
each gender
Female Male
Random walk with drift 5/8 4/8
LSTM architecture 3/8 4/8
the 8 European countries both models outperform the other one 4 times, whereas
for the female population the random walk with drift gives 5 times the better out-of-
sample prediction. Of course, this seems disappointing for the LSTM approach. This
observation is quite common, namely, that the deep learning approach outperforms
the classical methods on complex problems. However, on simple problems, as the
one here, we should go for a classical (simpler) model like a random walk with drift
or an ARIMA model.
The previous section has been relying on the LC mortality model and only the
extrapolation of the time-series (
kt )t has been based on a RN network architecture.
In this section we aim at directly processing the raw mortality rates Mx,t =
Dx,t /ex,t through a network, thus, we perform the representation learning directly
on the raw data. We therefore use a simplified version of the network architecture
proposed in Perla et al. [301].
As input to the network we use the raw mortality rates Mx,t . We choose a
lookback period of τ = 5 years and we define the time-series feature information to
forecast the mortality in calendar year t by
x t−τ :t−1 = (x t−τ , . . . , x t−1 ) = Mx,s x ∈ R(x1 −x0 +1)×τ = R100×5 .
0 ≤x≤x1 ,t−τ ≤s≤t−1
(8.23)
Thus, we directly process the raw mortality rates (simultaneously for all ages x)
through the network architecture; in the corresponding R code we need to input the
transposed features x t −τ :t −1 ∈ R
5×100 , see line 1 of Listing 8.2.
This LSTM recursion to process the time-series x t −τ :t −1 gives us the LSTM output
z[LSTM]
t −1 ∈ Rq1 , see lines 14–15 of Listing 8.2. It involves 4(q0 + 1 + q1 )q1 =
4(100 + 1 + 20)20 = 9 680 network parameters for the input dimension q0 = 100
8.4 Lab: Mortality Forecasting with RN Networks 403
concatenation output
input LSTM cell
xt−τ :t−1 of depth d = 1
into a shallow x,t )0≤x≤99
(Y
linear decoder
country embedding
c layer
gender embedding
g layer
Fig. 8.13 LSTM architecture used to process the raw mortality rates (Mx,t )x,t
and the output dimension q1 = 20. Many statisticians would probably stop at this
point with this approach, as it seems highly over-parametrized. Let’s see what we
get.
For the categorical country code c and the binary gender g we choose two one-
dimensional embeddings, see (7.31),
This decoding involves another (1 + 22)100 = 2 300 parameters (βx0 , βxG , βxC ,
β x )0≤x≤99. Thus, altogether this LSTM network has r = 11 990 parameters.
Summarizing: the above architecture follows the philosophy of the auto-encoder
of Sect. 7.5. A high-dimensional observation (x t −τ :t −1 , c, g) is encoded to a low-
dimensional bottleneck activation zt −1 ∈ R22 , which is then decoded by (8.25)
to give the forecast (Y x,t )0≤x≤99 for the log-mortality rates. It is not precisely an
auto-encoder because the response is different from the input, as we forecast the
log-mortality rates in the next calendar year t based on the information zt −1 that
404 8 Recurrent Neural Networks
Listing 8.2 LSTM architecture to directly process the raw mortality rates (Mx,t )x,t
1 TS = layer_input(shape=c(lookback,100), dtype=’float32’, name=’TS’)
2 Country = layer_input(shape=c(1), dtype=’int32’, name=’Country’)
3 Gender = layer_input(shape=c(1), dtype=’int32’, name=’Gender’)
4 Time = layer_input(shape=c(1), dtype=’float32’, name=’Time’)
5 #
6 CountryEmb = Country %>%
7 layer_embedding(input_dim=8,output_dim=1,input_length=1,name=’CountryEmb’) %>%
8 layer_flatten(name=’Country_flat’)
9 #
10 GenderEmb = Gender %>%
11 layer_embedding(input_dim=2,output_dim=1,input_length=1,name=’GenderEmb’) %>%
12 layer_flatten(name=’Gender_flat’)
13 #
14 LSTM = TS %>%
15 layer_lstm(units=20,activation=’linear’,recurrent_activation=’sigmoid’,
16 name=’LSTM’)
17 #
18 Output = list(LSTM,CountryEmb,GenderEmb) %>% layer_concatenate() %>%
19 layer_dense(units=100, activation=’linear’, name=’scalarproduct’) %>%
20 layer_reshape(c(1,100), name = ’Output’)
21 #
22 model = keras_model(inputs = list(TS, Country, Gender),
23 outputs = c(Output))
Table 8.2 Comparison of the out-of-sample mean squared losses for the calendar years 2004 ≤
t ≤ 2018; the figures are in 10−4
LC female LSTM female LC male LSTM male
Austria AUT 0.765 0.312 2.527 1.169
Belgium BE 0.371 0.311 2.835 0.960
Switzerland CH 0.654 0.478 1.609 1.134
Spain ESP 1.446 0.514 1.742 0.245
France FRA 0.175 1.684 0.333 0.363
Italy ITA 0.179 0.330 0.874 0.320
The Netherlands NL 0.426 0.315 1.978 0.601
Portugal POR 2.097 0.464 1.848 1.239
Sect. 7.4.4. The out-of-sample prediction results on the calendar years 2004 to
2018, i.e., t > t1 = 2004, are presented in Table 8.2. These results verify the
appropriateness of this LSTM approach. It outperforms the LC model on the female
population in 6 out of 8 cases and on the male population on 7 out of 8 cases,
only for the French population this LSTM approach seems to have some difficulties
(compared to the LC model). Note that these are out-of-sample figures because
the LSTM has only been fitted on the data prior to 2004. Moreover, we did not
pre-process the raw mortality rates Mx,t , t ≤ 2003, and the prediction is done
recursively in a one-period-ahead prediction approach, we also refer to (8.22). A
more detailed analysis of the results shows that the LC and the LSTM approaches
have a rather similar behavior for females. For males the LSTM prediction clearly
outperforms the LC model prediction, this out-performance is across different ages
x and different calendar years t ≥ 2004.
The advantage of this LSTM approach is that we can directly predict by
processing the raw data. The disadvantage compared to the LC approach is that the
LSTM network approach is more complex and more time-consuming. Moreover,
unlike in the LC approach, we cannot (easily) assess the prediction uncertainty.
In the LC approach the prediction uncertainty is obtained from assessing the
uncertainty in the extrapolation and the uncertainty in the parameter estimates, e.g.,
using a bootstrap. The LSTM approach is not sufficiently robust (at least not on our
data) to provide any reasonable uncertainty estimates.
We close this section and example by analyzing the functional form of the
decoder (8.25). We observe that this decoder has much similarity with the LC model
assumption (7.63)
; <
x,t = βx0 + βxC eC (c) + βxG eG (g) + β x , z[LSTM] ,
Y t −1
(p)
The LC model considers the average force of mortality ax ∈ R for each population
p = (c, g) and each age x; the LSTM architecture has the same term βx0 +βxC eC (c)+
406 8 Recurrent Neural Networks
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons licence and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons licence, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons licence and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 9
Convolutional Neural Networks
We start from an input tensor z ∈ Rq ×···×q that has dimension q (1) ×· · ·×q (K).
(1) (K)
of dimension q × q with the matrix entries zi1 ,i2 ∈ R describing the intensities
(1) (2)
of the gray scale in the corresponding pixels (i1 , i2 ). A color image typically has
the three color channels Red, Green and Blue (RGB), and such a RGB image can
be represented by a tensor z ∈ Rq ×q ×q of order 3 with q (1) × q (2) being
(1) (2) (3)
the dimension of the image and q (3) = 3 describing the three color channels, i.e.,
(zi1 ,i2 ,1 , zi1 ,i2 ,2 , zi1 ,i2 ,3 ) ∈ R3 describes the intensities of the colors RGB in the
pixel (i1 , i2 ).
Typically, the structure of black and white images and RGB images is unified by
representing the black and white picture by a tensor z ∈ Rq ×q ×q of order 3
(1) (2) (3)
with a single channel q (3) = 1. This philosophy is going to be used throughout this
chapter. Namely, if we consider a tensor z ∈ Rq ×···×q
(1) (K−1) ×q (K)
of order K, the
first K − 1 components (i1 , . . . , iK−1 ) will play the role of the spatial components
that have a natural topology, and the last components 1 ≤ iK ≤ q (K) are called
the channels reflecting, e.g., a gray scale (for q (K) = 1) or the RGB intensities (for
q (K) = 3).
In Sect. 9.1.3, below, we will also study time-series data where we have 2nd
order tensors (matrices). The first component reflects time 1 ≤ t ≤ q (1), i.e.,
the spatial component is temporal for time-series data, and the second component
(channels) describes the different elements zt = (zt,1 , . . . , zt,q (2) ) ∈ Rq that are
(2)
def.
(k)
(k)
qm = qm−1 − fm(k) + 1, (9.1)
for 1 ≤ k ≤ K. Thus, the size of the image is reduced by the window size of
the filter. In particular, the output dimension of the channels component k = K
(K)
is qm = 1, i.e., all channels are compressed to a scalar output. The spatial
components 1 ≤ k ≤ K − 1 retain their spatial structure but the dimension is
reduced according to (9.1).
A CN operation is a mapping (note that the order of the tensor is reduced from
K to K − 1 because the channels are compressed; index j is going to be explained
later)
(1) (K) (1) (K−1)
: Rqm−1 ×···×qm−1 → Rqm ×···×qm
(m)
zj (9.2)
(m) (m)
zj (z) = zi1 ,...,iK−1 ;j (z)
z→ (k)
,
1≤ik ≤qm ;1≤k≤K−1
(m)
for given intercept w0,j ∈ R and filter weights
(1) (K)
(m) (m) ×···×fm
Wj = wl1 ,...,lK ;j (k)
∈ Rf m ; (9.4)
1≤lk ≤fm ;1≤k≤K
D (k)
the network parameter has dimension rm = 1 + K k=1 fm .
At first sight this CN operation looks quite complicated. Let us give some
remarks that allow for a better understanding and a more compact notation. The
operation in (9.3) chooses the corner (i1 , . . . , iK−1 , 1) as base point, and then it
reads the tensor elements in the (discrete) window
(i1 , . . . , iK−1 , 1) + 0 : fm(1) − 1 × · · · × 0 : fm(K−1) − 1 × 0 : fm(K) − 1 ,
(9.5)
(m)
with given filter weights W j . This window is then moved across the entire
tensor z by changing the base point (i1 , . . . , iK−1 , 1) accordingly, but with fixed
filter weights W (m)j . This operation resembles a convolution, however, in (9.3) the
indices in zi1 +l1 −1,...,iK−1 +lK−1 −1,lK run in reverse direction compared to a classical
410 9 Convolutional Neural Networks
(m) (m) (m)
z → zj (z) = φ w0,j + W j ∗ z , (9.6)
(k)
having the activations for 1 ≤ ik ≤ qm , 1 ≤ k ≤ K − 1,
(m) (m) (m)
φ w0,j + W j ∗ z = zi1 ,...,iK−1 ;j (z),
i1 ,...,iK−1
Remarks 9.1
• The beauty of this notation is that we can now see the analogy to the FN layer.
Namely, (9.6) exactly plays the role of a FN neuron (7.6), but the CN operation
(m) (m) (m)
w0,j + W j ∗ z replaces the inner product wj , z , and correspondingly
accounting for the intercept.
• A FN neuron (7.6) can be seen as a special case of CN operation (9.6). Namely,
(1)
if we have a tensor of order K = 1, the input tensor (vector) reads as z ∈ Rqm−1 .
(1)
That is, we do not have a spatial component, but only qm−1 = qm−1 channels.
(1)
(m) (m) (m)
In that case we have W j ∗ z = W j , z for the filter weights W j ∈ Rqm−1 ,
and where we assume that z does not include an intercept component. Thus, the
CN operation boils down to a FN neuron in the case of a tensor of order 1.
• In the CN operation we take advantage of having a spatial structure in the tensor
z, which is not the case in the FN operation. The CN operation takes a spatial
D (k)
input of dimension K q and it maps this input to a spatial object of
DK−1 (k) k=1 m−1 D (k)
dimension k=1 qm . For this it uses rm = 1 + K k=1 fm filter weights. The
FN operation takes an input of dimension qm−1 and it maps it to a 1-dimensional
neuron activation, for this it uses 1 + qm−1 parameters. If we identify the input
! D (k)
dimensions qm−1 = K k=1 qm−1 we can observe that rm " 1 + qm−1 because,
(k) (k)
typically, the filter sizes fm " qm−1 , for 1 ≤ k ≤ K − 1. Thus, the CN
operation uses much less parameters as the filters only act locally through the
∗-operation by translating the filter window (9.5).
This understanding now allows us to define a CN layer. Note that the map-
pings (9.6) have a lower index j which indicates that this is one single projection
9.1 Plain-Vanilla Convolutional Neural Network Layer 411
(m) (m)
(filter extraction), called a filter. By choosing multiple different filters (w0,j , W j ),
we can define the CN layer as follows.
(K)
Choose qm ∈ N filters, each having a rm -dimensional filter weight
(m)
(w0,j , W (m)
j ), 1 (K)
≤ j ≤ qm . A CN layer is a mapping
(1) (K−1)
with filters z(m) ×···×qm (K)
j (z) ∈ R , 1 ≤ j ≤ qm
qm , given by (9.6).
(K) (K)
A CN layer (9.7) converts the qm−1 input channels to qm output filters by
preserving the spatial structure on the first K − 1 components of the input tensor z.
More mathematically, CN layers and networks have been studied, among others,
by Zhang et al. [403, 404], Mallat [263] and Wiatowski–Bölcskei [382]. These
authors prove that CN networks have certain translation invariance properties
and deformation stability. This exactly explains why these networks allow one to
recognize similar objects at different locations in the input tensor. Basically, by
translating the filter windows (9.5) across the tensor, we try to extract the local
structure from the tensor that provides similar signals in different locations of that
tensor. Thinking of an image where we try to recognize, say, a dog, such a dog can
be located at different sites in the image, and a filter (window) that moves across
that image tries to locate the dogs in the image.
A CN layer (9.7) defines one layer indexed by the upper index (m) , and for deep
representation learning we now have to compose multiple of these CN layers, but we
can also compose CN layers with FN layers or RN layers. Before doing so, we need
to introduce some special purpose layers and tools that are useful for CN network
modeling, this is done in Sect. 9.2, below.
Most CN network examples are based on time-series data or images. The former
has a 1-dimensional temporal component, and the latter has a 2-dimensional spatial
component. Thus, these two examples are giving us tensors of orders K = 2 and
K = 3, respectively. We briefly discuss such examples as specific applications of a
tensors of a general order K ≥ 2.
412 9 Convolutional Neural Networks
For a time-series analysis we often have observations x t ∈ Rq0 for the time points
0 ≤ t ≤ T . Bringing this time-series data into a tensor form gives us
(1) (2)
x = x
0:T = (x 0 , . . . , x T )
∈ R(T +1)×q0 = Rq0 ×q0
,
Image Recognition
(1) (2)
Chose a window size of f1 × f1 and q1 ∈ N filters to receive the CN layer
(1) (2)
z(1) : RI ×J ×3 → R(I −f1 +1)×(J −f1 +1)×q1
(9.9)
(1)
(x 1 , x 2 , x 3 ) →
z(1) (x 1 , x 2 , x 3 ) = z1 (x 1 , x 2 , x 3 ), . . . , z(1)
q1 (x 1 , x 2 , x 3 ) ,
(1) (2)
with filters z(1) (I −f1 +1)×(J −f1 +1) , 1 ≤ j ≤ q . Thus, we
j (x 1 , x 2 , x 3 ) ∈ R 1
compress the 3 channels in each filter j , but we preserve the spatial structure of
the image (by the convolution operation ∗).
For black and white pictures which only have one color channel, we preserve the
spatial structure of the picture, and we modify the input tensor to a tensor of order 3
and of the form
x = (x 1 ) ∈ RI ×J ×1 .
We have seen that the CN operation reduces the size of the output by the filter sizes,
see (9.1). Thus, if we start from an image of size 100 × 50 × 1, and if the filter sizes
(1) (2) (3)
are given by fm = fm = 9, then the output will be of dimension 92 × 42 × q1 ,
see (9.9). Sometimes, this reduction in dimension is impractical, and padding helps
(k)
to keep the original shape. Padding a tensor z with pm parameters, 1 ≤ k ≤ K − 1,
means that the tensor is extended in all K −1 spatial directions by (typically) adding
zeros of that size, so that the padded tensor has dimension
(1) (K−1) (K)
(1)
pm + qm−1 + pm
(1)
× · · · × pm
(K−1)
+ qm−1 + pm
(K−1)
× qm−1 .
This implies that the output filters will have the dimensions
(k)
(k)
qm = qm−1 + 2pm
(k)
− fm(k) + 1,
9.2.2 Stride
Strides are used to skip part of the input tensor z in order to reduce the size of the
output. This may be useful if the input tensor is a very high resolution image. Choose
(k)
the stride parameters sm , 1 ≤ k ≤ K − 1. We can then replace the summation
in (9.3) by the following term
(1) (K)
fm f
m
··· wl(m) z (1)
1 ,...,lK ;j s (i
(K−1) .
m 1 −1)+l1 ,...,sm (iK−1 −1)+lK−1 ,lK
l1 =1 lK =1
This only extracts the tensor entries on a discrete grid of the tensor by translating
the window by multiples of integers, see also (9.5),
(1)
sm (i1 − 1), . . . , sm
(K−1)
(iK−1 − 1), 1 + 1 : fm(1) ×· · ·× 1 : fm(K−1) × 0 : fm(K) − 1 ,
and the size of the output is reduced correspondingly. If we choose strides sm(k) =
fm(k) , 1 ≤ k ≤ K − 1, we receive a partition of the spatial part of the input tensor z,
this is going to be used in the max-pooling layer (9.11).
9.2.3 Dilation
Dilation is similar to stride, though, different in that it enlarges the filter sizes instead
(k)
of skipping certain positions in the input tensor. Choose the dilation parameters em ,
1 ≤ k ≤ K − 1. We can then replace the summation in (9.3) by the following term
(1) (K)
fm f
m
··· wl(m) z
1 ,...,lK ;j i
(1) (K−1) .
1 +em (l1 −1),...,iK−1 +em (lK−1 −1),lK
l1 =1 lK =1
This applies the filter weights to the tensor entries on discrete grids
(1)
(i1 , . . . , iK−1 , 1)+em 0 : fm(1) − 1 ×· · ·×em
(K−1)
0 : fm(K−1) − 1 × 0 : fm(K) − 1 ,
(k)
where the intervals em [0 : fm(k) − 1] run over the grids of span sizes em
(k)
,1≤k≤
K − 1. Thus, in comparably smoothing images we do not read all the pixels but only
(k)
every em -th pixel in the window. Also this reduces the size of the output tensor.
9.2 Special Purpose Tools for Convolutional Neural Networks 415
As we have seen above, the dimension of the tensor is reduced by the filter
size in each spatial direction if we do not apply padding with zeros. In general,
deep representation learning follows the paradigm of auto-encoding by reducing a
high-dimensional input to a low-dimensional representation. In CN networks this
is usually (efficiently) done by so-called pooling layers. In spirit, pooling layers
work similarly to CN layers (having a fixed window size), but we do not apply a
convolution operation ∗, but rather a maximum operation to the window to extract
the dominant tensor elements.
We choose a fixed window size (fm(1) , . . . , fm(K−1) ) ∈ NK−1 and strides sm
(k)
=
(k)
fm , 1 ≤ k ≤ K − 1, for the spatial components of the tensor z of order K . A
max-pooling layer is given by
Alternatively, the floors in (9.11) could be replaced by ceilings and padding with
zeros to receive the right cardinality. This extracts the maximums from the (spatial)
windows
fm(1) (i1 − 1), . . . , fm(K−1) (iK−1 − 1), iK + 1 : fm(1) × · · · × 1 : fm(K−1) × [0]
= fm(1) (i1 − 1) + 1 : fm(1) i1 × · · · × fm(K−1) (iK−1 − 1) + 1 : fm(K−1) iK−1 × [iK ] ,
(K)
for each channel 1 ≤ iK ≤ qm−1 individually. Thus, the max-pooling operator is
chosen such that it extracts the maximum of each channel and each window, the
windows providing a partition of the spatial part of the tensor. This reduces the
dimension of the tensor according to (9.11), e.g., if we consider a tensor of order 3
of an RGB image of dimension I × J = 180 × 50 and apply a max-pooling layer
with window sizes fm(1) = 10 and fm(2) = 5, we receive a dimension reduction
180 × 50 × 3 → 18 × 10 × 3.
416 9 Convolutional Neural Networks
D (k)
with qm = K k=1 qm−1 . We have already used flatten layers after embedding layers
on lines 8 and 11 of Listing 7.4.
We are now ready to patch everything together. Assume we have RGB images
described by tensors x (0) ∈ RI ×J ×3 of order 3 modeling the three RGB channels
of images of a fixed size I × J . Moreover, we have the tabular feature information
x (1) ∈ X ⊂ {1} × Rq that describes further properties of the data. That is, we have an
input variable (x (0) , x (1) ), and we aim at predicting a response variable Y by a using
a suitable regression function
(x (0) , x (1) ) → μ(x (0) , x (1) ) = E Y x (0) , x (1) . (9.13)
We choose two convolutional layers z(CN1) and z(CN2) , each followed by a max-
pooling layer z(Max1) and z(Max2) , respectively. Then we apply a flatten layer z(flatten)
to bring the learned representation into a vector form. These layers are chosen
according to (9.7), (9.10) and (9.12) with matching input and output dimensions
so that the following composition is well-defined
z(5:1) = z(flatten) ◦ z(Max2) ◦ z(CN2) ◦ z(Max1) ◦ z(CN1) : RI ×J ×3 → Rq5 .
Listing 9.2 gives the summary of this architecture providing the dimension reduction
mappings (encodings)
for given link function g . This last step can be done in complete analogy to Chap. 7,
and fitting of such a network architecture uses variants of the SGD algorithm.
418 9 Convolutional Neural Networks
x s,t = (vs,t , as,t , s,t ) ∈ [2, 50]km/h × [−3, 3]m/s2 × [0, 1/2] ⊂ R3 ,
9.3 Convolutional Neural Network Architectures 419
where 1 ≤ s ≤ S labels all individual trips of the considered drivers. This data has
been pre-processed by cutting-out the idling phase and the speeds above 50km/h
and concatenating the remaining pieces. We perform this pre-processing since
we do not want to identify the drivers because they have a special idling phase
picture or because they are more likely on the highway. Acceleration has been
censored at ±3m/s2 because we cannot exclude that more extreme observations are
caused by data quality issues (note that the acceleration is calculated from the GPS
coordinates and if the signals are not fully precise it can lead to extreme acceleration
observations). Finally, change in angle is measured in absolute values of sine per
second (censored at 1/2), i.e., we do not distinguish between left and right turns.
This then provides us with three time-series channels giving tensors of order 2
x s = (vs,1 , as,1 , s,1 ) , . . . , (vs,180 , as,180 , s,180 ) ∈ R180×3 ,
RT ×3 → (0, 1)3 .
The first CN and pooling layer z(Max1) ◦ z(CN1) maps the dimension 180 × 3 to a
tensor of dimension 58 × 12 using 12 filters; the max-pooling uses the floor (9.11).
The second CN and pooling layer z(Max2) ◦ z(CN2) maps to 18 × 10 using 10 filters,
and the third CN and pooling layer z(Max3) ◦ z(CN3) maps to 1 × 8 using 8 filters.
Actually, this last max-pooling layer is a global max-pooling layer extracting the
maximum in each of the 8 filters. Next, we apply a drop-out layer with a drop-out
420 9 Convolutional Neural Networks
3
0
0
speed / angle / acceleration
−3
−3
0.5
0.5
0.5
0.25
0.25
0.25
0
0
50
50
50
25
25
25
0
0
0 50 100 150 0 50 100 150 0 50 100 150
3
0
0
speed / angle / acceleration
−3
−3
0.5
0.5
0.5
0.25
0.25
0.25
0
0
50
50
50
25
25
25
0
0
0 50 100 150 0 50 100 150 0 50 100 150
3
3
0
0
0
speed / angle / acceleration
speed / angle / acceleration
−3
0.5
0.5
0.5
0.25
0.25
0.25
0
0
0
50
50
50
25
25
25
0
0
Fig. 9.1 First 3 trips of driver A (top), driver B (middle) and driver C (bottom); each trip is 180
seconds, red color shows the acceleration pattern (at )t , black color the change in angle pattern
(t )t and blue color the speed pattern (vt )t
Table 9.1 Summary of the trips and the choice of learning and test data sets L and T
Driver A Driver B Driver C Total
Number of trips S 261 385 286 932
Learning data L 209 307 228 744
Test data T 52 78 58 188
Average speed vt 24.8 30.4 30.2 km/h
Average acceleration/braking |at | 0.56 0.61 0.74 m/s2
Average change in angle t 0.065 0.054 0.076 |sin|/s
Listing 9.3 CN network architecture for the individual car trip allocation
For a summary of the network architecture see Listing 9.4. Altogether this involves
1’237 network parameters that need to be fitted.
Listing 9.4 Summary of CN network architecture for the individual car trip allocation
1 Layer (type) Output Shape Param #
2 ===============================================================================
3 conv1d_1 (Conv1D) (None, 176, 12) 192
4 -------------------------------------------------------------------------------
5 max_pooling1d_1 (MaxPooling1D) (None, 58, 12) 0
6 -------------------------------------------------------------------------------
7 conv1d_2 (Conv1D) (None, 54, 10) 610
8 -------------------------------------------------------------------------------
9 max_pooling1d_2 (MaxPooling1D) (None, 18, 10) 0
10 -------------------------------------------------------------------------------
11 conv1d_3 (Conv1D) (None, 14, 8) 408
12 -------------------------------------------------------------------------------
13 global_max_pooling1d_1 (GlobalMaxPool (None, 8) 0
14 -------------------------------------------------------------------------------
15 dropout_1 (Dropout) (None, 8) 0
16 -------------------------------------------------------------------------------
17 dense_1 (Dense) (None, 3) 27
18 ===============================================================================
19 Total params: 1,237
20 Trainable params: 1,237
21 Non-trainable params: 0
We choose the 744 trips of the learning data L to train this network to the
classification task, see Table 9.1. We use the multi-class cross-entropy loss function,
see (4.19), with 80% of the learning data L as training data U and the remaining
20% as validation data V to track over-fitting. We retrieve the network with the
smallest validation loss using a callback, we refer to Listing 7.3 for a callback.
Since the learning data is comparably small and to reduce randomness, we use the
nagging predictor averaging over 10 different network fits (using different seeds).
422 9 Convolutional Neural Networks
Table 9.2 shows the out-of-sample results on the test data T . On average more than
80% of all trips are correctly allocated; a purely random allocation would provide
a success rate of 33%. This shows that this allocation problem can be solved rather
successfully and, indeed, the CN network architecture is able to learn structure in
the telematics trip data x s that allows one to discriminate car drivers. This sounds
very promising. In fact, the telematics car driving data seems to be very transparent
which, of course, also raises privacy issues. On the downside we should mention
that from this approach we cannot really see what the network has learned and how
it manages to distinguish the different trips.
There are several approaches that try to visualize what the network has learned
in the different layers by extracting the filter activations in the CN layers, others
try to invert the networks trying to backtrack which activations and weights mostly
contribute to a certain output, we mention, e.g., DeepLIFT of Shrikumar et al. [339].
For more analysis and references we refer to Sect. 4 of the tutorial Meier–Wüthrich
[269]. We do not further discuss this and close this example.
We revisit the mortality example of Sect. 8.4.2 where we used a LSTM architecture
to process the raw mortality data for forecasting, see Fig. 8.13. We are going to do
a (small) change to that architecture by simply replacing the LSTM encoder by a
CN network encoder. This approach has been promoted in the literature, e.g., by
Perla et al. [301], Schnürch–Korn [330] and Wang et al. [375]. A main difference
between these references is whether the mortality tensor is considered as a tensor
9.3 Convolutional Neural Network Architectures 423
for a lookback period of τ = 5. The LSTM cell encodes this tensor/matrix into a 20-
dimensional vector which is then concatenated with the embeddings of the country
code and the gender code (8.24). We use the same architecture here, only the LSTM
part is replaced by a CN network in (8.25), the corresponding code is given on lines
14–17 of Listing 9.5.
Listing 9.5 CN network architecture to directly process the raw mortality rates (Mx,t )x,t
1 Tensor = layer_input(shape=c(lookback,100,1), dtype=’float32’, name=’Tensor’)
2 Country = layer_input(shape=c(1), dtype=’int32’, name=’Country’)
3 Gender = layer_input(shape=c(1), dtype=’int32’, name=’Gender’)
4 Time = layer_input(shape=c(1), dtype=’float32’, name=’Time’)
5 #
6 CountryEmb = Country %>%
7 layer_embedding(input_dim=8,output_dim=1,input_length=1,name=’CountryEmb’) %>%
8 layer_flatten(name=’Country_flat’)
9 #
10 GenderEmb = Gender %>%
11 layer_embedding(input_dim=2,output_dim=1,input_length=1,name=’GenderEmb’) %>%
12 layer_flatten(name=’Gender_flat’)
13 #
14 CN = Tensor %>%
15 layer_conv_2d(filter = 10, kernel_size = c(5,5), activation = ’linear’) %>%
16 layer_max_pooling_2d(pool_size = c(1,8)) %>%
17 layer_flatten()
18 #
19 Output = list(CN,CountryEmb,GenderEmb) %>% layer_concatenate() %>%
20 layer_dense(units=100, activation=’linear’, name=’scalarproduct’) %>%
21 layer_reshape(c(1,100), name = ’Output’)
22 #
23 model = keras_model(inputs = list(Tensor, Country, Gender),
24 outputs = c(Output))
Line 15 maps the input tensor 5 × 100 × 1 to a tensor 1 × 96 × 10 having 10 filters, the
max-pooling layer reduces this tensor to 1 × 12 × 10, and the flatten layer encodes
this tensor into a 120-dimensional vector. This vector is then concatenated with the
embedding vectors of the country and the gender codes, and this provides us with
r = 12 570 network parameters, thus, the LSTM architecture and the CN network
architecture use roughly equally many network parameters that need to be fitted. We
then use the identical partition in training, validation and test data as in Sect. 8.4.2,
i.e., we use the data from 1950 to 2003 for fitting the network architecture, which is
then used to forecast the calendar years 2004 to 2018. The results are presented in
Table 9.3.
424 9 Convolutional Neural Networks
Table 9.3 Comparison of the out-of-sample mean squared losses for the calendar years 2004 ≤
t ≤ 2018; the figures are in 10−4
Female Male
LC LSTM CN LC LSTM CN
Austria AUT 0.765 0.312 0.635 2.527 1.169 1.569
Belgium BE 0.371 0.311 0.290 2.835 0.960 1.100
Switzerland CH 0.654 0.478 0.772 1.609 1.134 2.035
Spain ESP 1.446 0.514 0.199 1.742 0.245 0.240
France FRA 0.175 1.684 0.309 0.333 0.363 0.770
Italy ITA 0.179 0.330 0.186 0.874 0.320 0.421
The Netherlands NL 0.426 0.315 0.266 1.978 0.601 0.606
Portugal POR 2.097 0.464 0.416 1.848 1.239 1.880
We observe that in our case the CN network architecture provides good results for
the female populations, whereas for the male populations we rather prefer the LSTM
architecture. At the current stage we rather see this as a proof of concept, because
we have not really fine-tuned the network architectures, nor has the SGD fitting
been perfected, e.g., often bigger architectures are used in combination with drop-
outs, etc. We refrain from doing so, here, but refer to the relevant literature Perla
et al. [301], Schnürch–Korn [330] and Wang et al. [375] for a more sophisticated
fine-tuning.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 10
Natural Language Processing
Natural language processing (NLP) is a vastly growing field that is studying lan-
guage, communication and text recognition. The purpose of this chapter is to present
an introduction to NLP. Important milestones in the field of NLP are the work of
Bengio et al. [28, 29] who have introduced the idea of word embedding, the work
of Mikolov et al. [275, 276] who have developed word2vec which is an efficient
word embedding tool, and the work of Pennington et al. [300] and Chaubard et
al. [68] who provide the pre-trained word embedding model GloVe1 and detailed
educational material.2 An excellent overview of the NLP working pipeline is
provided by the tutorial of Ferrario–Nägelin [126]. This overview distinguishes
three approaches: (1) the classical approach using bag-of-words and bag-of-part-
of-speech models to classify text documents; (2) the modern approach using word
embeddings to receive a low-dimensional representation of the dictionary, which
is then further processed; (3) the contemporary approach uses a minimal amount
of text pre-processing but directly feeds raw data to a machine learning algorithm.
We discuss these different approaches and show how they can be used to extract
the relevant information from claim descriptions to predict the claim types and the
claim sizes; in the actuarial literature first papers on this topic have been published
by Lee et al. [236] and Manski et al. [264].
1 https://nlp.stanford.edu/projects/glove/.
2 https://nlp.stanford.edu/teaching/.
are still (tedious) manual work. Our goal here is to present the whole working
pipeline to process language, perform text recognition and text understanding. As
an example we use the claim data described in Chap. 13.3; this data has been made
available through the book project of Frees [135], and it comprises property claims
of governmental institutions in Wisconsin, US. An excerpt of the data is given in
Listing 10.1; our attention applies to line 11 which provides a (very) short claim
description for every claim.
Listing 10.1 Excerpt of the Wisconsin Local Government Property Insurance Fund (LGPIF) data
set with short claim descriptions on line 11
In a first step we need to pre-process the texts to make them suitable for predictive
modeling. This first step is called tokenization. Essentially, tokenization labels the
words with integers, that is, the used vocabulary is encoded by integers. There are
several issues that one has to deal with in this first step such as upper and lower
case, punctuation, orthographic errors and differences, abbreviations, etc. Different
treatments of these issues will lead to different results, for more on this topic we
refer to Sect. 1 in Ferrario–Nägelin [126]. We simply use the standard routine
offered in R keras [77] called text_tokenizer() with its standard settings.
1 library(keras)
2
3 ## initialize tokenizer and fit
4 tokenizer <- text_tokenizer() %>% fit_text_tokenizer(dat$Description)
5
6 ## number of tokens/words
7 length(tokenizer$word_index)
8
9 ## frequency of word appearances in each text
10 freq.text <- texts_to_matrix(tokenizer, dat$Description, mode = "count")
The R code in Listing 10.2 shows the crucial steps in tokenization. Line 4 extracts
the relevant vocabulary from all available claim descriptions. In total the 5’424 claim
10.1 Feature Pre-processing and Bag-of-Words 427
descriptions of Listing 10.1 use W = 2 237 different words. This double counts
different spellings, e.g., ‘color’ vs. ’colour’.
Figure 10.1 shows the most frequently used words in the claim descriptions of
Listing 10.1. These are (in this order): ‘at’, ‘damage’, ‘damaged’, ‘vandalism’,
‘lightning’, ‘to’, ‘water’, ‘glass’, ‘park’, ‘fire’, ‘hs’, ‘wind’, ‘light’, ‘door’, ‘es’,
‘and’, ‘of’, ‘vehicle’, ‘pole’ and ‘power’. We observe that many of these words
are directly related to insurance claims, such as ‘damage’ and ‘vandalism’, others
are frequent stopwords like ‘at’ and ‘to’, and then there are abbreviations like ‘hs’
and ‘es’ standing for high school and elementary school.
The next step is to assign the (integer) labels 1 ≤ w ≤ W from the tokenization
to the words in the texts. The maximal length over all texts/sentences is T = 11
words. This step and padding the sentences with zeros to equal length T is presented
on lines 1–7 of Listing 10.3. Lines 11 and 14 of this listing give two explicit text
examples
The label 0 is used for padding shorter texts to the common length T = 11. The
method of bag-of-words embeds text = (w1 , . . . , wT ) into NW
0
T
ψ: W0T → NW
0 , text → ψ(text) = 1{wt =w} . (10.1)
t =1 w∈W
The bag-of-words ψ(text) counts how often each word w ∈ W appears in a given
text = (w1 , . . . , wT ) ; the corresponding code is given on line 10 of Listing 10.2.
The bag-of-words mapping ψ is not injective as the order of occurrence of the
words gets lost, and, thus, also the semantics of the sentence gets lost. E.g., the
bag-of-words of the following two sentences is the same ‘The claim is expensive.’
and ‘Is the claim expensive?’. This is the reason for calling it a “bag of words”
(which is unordered). This bag-of-words encoding resembles one-hot encoding,
namely, if every text consists of a single word T = 1, then we receive the one-hot
encoding with W describing the number of different levels, see (7.28). The bag-of-
words ψ(text) ∈ NW 0 can directly be used as an input to a regression model. The
disadvantage of this approach is that the input typically is high-dimensional (and
likely sparse), and it is recommended that only the frequent words are considered.
1 library(textstem)
2 library(tm)
3
4 text.clean <- removeWords(dat$Description, stopwords("english"))
5 text.clean <- lemmatize_strings(text.clean, dictionary = lexicon::hash_lemmas)
as nouns, adjectives, adverbs, etc., in the corresponding sentences. We then call the
resulting encoding bag-of-POS. We refrain from doing this because we will present
more sophisticated methods in the next sections.
e : W → Rb , w → e(w), (10.2)
that maps each word w (or rather its tokenization) to a b-dimensional vector e(w),
for a given embedding dimension b " W . The general idea now is that similarity in
the meaning of words can be learned from the context in which the words are used
in. That is, when we consider a text
text = (w1 , . . . , wt −1 , wt , wt +1 , . . . , wT ) ,
p(w1 , . . . , wT )
p ( wt | w1 , . . . , wt −1 , wt +1 , . . . , wT ) = .
p(w1 , . . . , wt −1 , wt +1 , . . . , wT )
(10.3)
430 10 Natural Language Processing
There are two ways of estimating the probability p in (10.3). Either we can try to
predict the center word wt from its context as in (10.3) or we can try to predict the
context from the center word wt , which applies Bayes’s rule to (10.3). The latter
variant is called skip-gram and the former variant is called continuous bag-of-words
(CBOW), if we neglect the order of the words in the context. These two approaches
have been developed by Mikolov et al. [275, 276].
Skip-gram Approach
n
W = log p wi,t −c , . . . , wi,t −1 , wi,t +1 , . . . , wi,t +c wi,t
i=1
n
= log p wi,t +j wi,t , (10.4)
i=1 −c≤j ≤c,j =0
categorical responses, see Sect. 5.7. We make the following ansatz for the context
word ws and the center word wt (for all j )
exp *
e(ws ), e(wt )
p (ws | wt ) = W ∈ (0, 1), (10.5)
w=1 exp *
e(w), e(wt )
where e and * e are two (different) embedding maps (10.2) that have the same
embedding dimension b ∈ N. Thus, we construct two different embeddings e and * e
for the center words and for the context words, respectively, and these embeddings
(embedding weights) are chosen such that the log-likelihood (10.4) is maximized
for the given observations W . These assumptions give us a minimization problem
for the negative log-likelihood in the embedding mappings, i.e., we minimize over
the embeddings e and * e
7 8
n exp * e(wi,t+j ), e(wi,t )
− W = − log W 7 8 (10.6)
i=1 −c≤j ≤c,j =0 w=1 exp * e(w), e(wi,t )
⎛ W ⎞
n 7 8 7 8
=− ⎝ *e(wi,t+j ), e(wi,t ) − 2c log exp *e(w), e(wi,t ) ⎠ .
i=1 −c≤j ≤c,j =0 w=1
These optimal embeddings are learned using a variant of the gradient descent
algorithm. This often results in a very high-dimensional optimization problem as
we have 2bW parameters to learn, and the calculation of the last (normalization)
term in (10.6) can be very expensive in gradient descent algorithms. For this reason
we present the method of negative sampling below.
Continuous Bag-of-Words
For the CBOW method we start from the log-likelihood for a context size c ∈ N and
given the observations W
n
log p wi,t wi,t −c , . . . , wi,t −1 , wi,t +1 , . . . , wi,t +c .
i=1
1
*
ei,t = *
e(wi,t +j ).
2c
−c≤j ≤c,j =0
432 10 Natural Language Processing
Again the gradient descent method is applied to the negative log-likelihood to learn
the optimal embedding maps e and *e.
Remark 10.1 In both cases, skip-gram and CBOW, we estimate two separate
embeddings e and * e for the center word and the context words. Typically, CBOW is
faster but skip-gram is better on words that are less frequent.
Negative Sampling
There is a computational issue in (10.6) and (10.7) because the probability normal-
izations in (10.6) and (10.7) aggregate over all available words w ∈ W. This can
be computationally demanding because we need to perform this calculation in each
gradient descent step. For this reason, Mikolov et al. [276] turn the log-likelihood
optimization problem (10.6) into a binary classification problem. Consider a pair
(w, w*) ∈ W × W of center word w and context word w *. We introduce a binary
response variable Y ∈ {1, 0} that indicates whether an observation (W, W *) =
(w, w*) is coming from a true center-context pair (from our texts) or whether
we have a fake center-context pair (that has been generated randomly). Choosing
the canonical link of the Bernoulli EF (logistic/sigmoid function) we make the
following ansatz (in the skip-gram approach) to test for the authenticity of a center-
*)
context pair (w, w
1
P [ Y = 1| w, w
*] = . (10.8)
1 + exp {−* w), e(w) }
e(*
The recipe now is as follows: (1) Consider for a given window size c all center-
*i ) ∈ W×W of our texts, and equip them with a response Yi = 1.
context pairs (wi , w
Assume we have N such observations. (2) Simulate N i.i.d. pairs (WN+k , W *N+k ),
1 ≤ k ≤ N, by randomly choosing WN+k and W *N+k , independent from each
other (by performing independent re-sampling with or without replacements from
the data (wi )1≤i≤N and (* wi )1≤i≤N , respectively). Equip these (false) pairs with the
response YN+k = 0. (3) Maximize the following log-likelihood as a function of the
10.2 Word Embeddings 433
2N
Y = log P [ Y = Yi | wi , w
*i ] (10.9)
i=1
N
2N
1 1
= log + log .
1 + exp−*
e(*wi ), e(wi ) 1 + exp*
e(*
wk ), e(wk )
i=1 k=N+1
3 https://nlp.stanford.edu/projects/glove/.
434 10 Natural Language Processing
(tedious) manual work, and we do this step to be able to compare our results to
pre-trained word2vec versions.
After this pre-processing we apply the tokenizer, see line 4 of Listing 10.2. This
gives us 1 829 different words. To construct our (illustrative) embedding we only
consider the words that appear at least 20 times over all texts, these are W = 142
words. Thus, the following analysis is only based on the W = 142 most frequent
words. Of course, we could increase our vocabulary by considering any text that can
be downloaded from the internet. Since we would like to perform an insurance claim
analysis, these texts should be related to an insurance context so that the learned
embeddings reflect an insurance experience; we come back to this in Remark 10.4,
below. We refrain here from doing so and embed these W = 142 words into the
Euclidean plane (b = 2).
Listing 10.5 shows the tokenization of the most frequent words, and on line 4 we
build the (shortened) texts w1 , w2 , . . . , only considering these most frequent words
w ∈ W = {1, . . . , W }. In total we receive 4’746 texts that contain at least two words
from W and, hence, can be used for the skip-gram building of center-context pairs
(w, w *) ∈ W × W. Lines 7–8 give the code for building these pairs for a window of
size c = 2. In total we receive N = 23 952 center-context pairs (wi , w *i ) from our
texts. We equip these pairs with a response Yi = 1. For the false pairs, we randomly
permute the second component of the true pairs (WN+i , W *N+i ) = (wi , w *τ (i) ),
where τ is a random permutation of {1, . . . , N}. These false pairs are equipped
with a response YN+i = 0. Thus, altogether we have 2N = 47 904 observations
*i ), 1 ≤ j ≤ 2N, that can be used to learn the embeddings e and *
(Yi , wi , w e.
Listing 10.6 shows the R code to perform the embedding learning using the negative
sampling (10.9). This network has 2bW = 568 embedding weights that need to
be learned from the data. There are two more parameters involved on line 10 of
Listing 10.6. These two parameters shift the scalar products by an intercept β0 and
scale them by a constant β1 . We could set (β0 , β1 ) = (0, 1), however, keeping
these two parameters trainable has led to results that are better centered around the
origin. Of course, these two parameters do not harm the arguments as they only
10.2 Word Embeddings 435
1 e β0
P [ Y = 1| w, w
*] = = β ,
1 + exp {−β0 − β1 * w), e(w) }
e(* e 0 + e−β1 *e(*
w),e(w)
and
e β0 e−β0
P [ Y = 0| w, w
*] = 1 − = .
eβ0 + e−β1 *e(*
w),e(w) e−β0 + eβ1 *e(*
w),e(w)
We fit this model using the nadam version of the gradient descent algorithm, and
the fitted embedding weights can be extracted with get_weights(model).
Figure 10.2 shows the learned embedding weights e(w) ∈ R2 of all words w ∈ W.
We highlight the words that coincide with the insured hazards in red color, see line
10 of Listing 10.1. The word ‘vehicle’ is in the first quadrant and it is surrounded
by ‘pole’, ‘truck’, ‘garage’, ‘car’, ‘traffic’. The word ‘vandalism’ is in the third
quadrant surrounded by ‘graffito’, ‘window’, ‘pavilion’, names of cites and parks,
‘ms’ for middle school. Finally, the words ‘fire’, ‘wind’, ‘lightning’ and ‘hail’ are
in the first and fourth quadrant, close to ‘water’; these words are surrounded by
‘bldg’ (building), ‘smoke’, ‘equipment’, ‘alarm’, ‘safety’, ‘power’, ‘library’, etc. We
conclude that these embeddings make perfect sense in an insurance claim context.
Note that we have applied some pre-processing, and embeddings could even be
improved by further pre-processing, e.g., ‘vandalism’ and ‘vandalize’ or ‘hs’ and
‘high school’ are used.
Another nice observation is that the embeddings tend to build a circle around the
origin, see Fig. 10.2. This is enforced by embedding W = 142 different words into
a b = 2 dimensional space so that dissimilar words optimally repulse each other.
436 10 Natural Language Processing
1.0
0.5
dimension 2
0.0
−0.5
−1.0
dimension 1
Fig. 10.2 Two-dimensional skip-gram embedding using negative sampling; in red color are the
insured hazards ‘vehicle’, ‘fire’, ‘lightning’, ‘wind’, ‘hail’, ‘water’ and ‘vandalism’
10
matrix C of Example 10.2; 300
20
the color scale gives the
30
250
observed frequencies
40
center word (ordered)
50
200
60
70
150
80
100
100
120
50
140
0
context word (ordered)
*) ≈ *
log C(w, w e(* *w
w ), e(w) + β * + βw ,
for xmax > 0 and α > 0. Pennington et al. [300] state that the model depends
weakly on the cutoff point xmax , they propose xmax = 100, and a sub-linear
behavior seems to outperform a linear one, suggesting, e.g., a choice of α = 3/4.
Under these choices the embeddings e and * e are found by minimizing the objective
function (10.10) for the given data. Note that limx↓0 χ(x)(log x)2 = 0.
Example 10.3 (GloVe Word Embedding) We provide an example using the GloVe
embedding model, and we revisit the data of Example 10.2; we also use exactly the
same pre-processing as in that example. We start from N = 23 952 center-context
pairs.
In a first step we count the number of co-occurrences C(w, w *). There are only
4’972 pairs that occur, C(w, w *) > 0, this corresponds to the colors in Fig. 10.3.
With these 4’972 pairs we have to fit 568 embedding weights (for the embedding
dimension b = 2) and 284 intercepts β *w
* , βw , thus, 852 parameters in total. The
results of this fitting are shown in Fig. 10.4.
The general picture in Fig. 10.4 is similar to Fig. 10.2, e.g., ‘vandalism’ is
surrounded by ‘graffito’, ‘window’, ‘pavilion’, names of cites and parks, ‘ms’
and ‘es’; or ‘vehicle’ is surrounded by ‘pole’, ‘traffic’, ‘street’, ‘signal’. However,
the clustering of the words around the origin shows a crucial difference between
GloVe and the negative sampling of word2vec. The problem here is that we do
not have sufficiently many observations. We have 4’972 center-context pairs that
occur, C(w, w *) > 0. 2’396 of these pairs occur exactly once, C(w, w *) = 1, this is
almost half of the observations with C(w, w *) > 0. GloVe (10.10) considers these
observations on the log-scale which provides log C(w, w *) = 0 for the pairs that
occur exactly once. The weighted square loss for these pairs is minimized by either
setting *e(*w) = 0 or e(w) = 0, supposed that the intercepts are also set to 0. This
is exactly what we observe in Fig. 10.4 and, thus, successfully fitting GloVe would
require much more (frequent) observations.
4 https://nlp.stanford.edu/projects/glove/.
5 https://spacy.io/models/en#en_core_web_md.
10.2 Word Embeddings 439
1.0
0.5
dimension 2
0.0
−0.5
−1.0
dimension 1
Fig. 10.4 Two-dimensional GloVe embedding; in red color are the insured hazards ‘vehicle’,
‘fire’, ‘lightning’, ‘wind’, ‘hail’, ‘water’ and ‘vandalism’
Listing 10.7 R code for the hazard type prediction based on a word2vec embedding
The R code for the hazard type prediction is presented in Listing 10.7. The crucial
part is shown on line 5. Namely, the embedding map e(w) ∈ Rb , w ∈ W is
initialized with the embedding weights wordEmb received from Example 10.2, and
6WaterW relates to weather related water claims, and WaterNW relates to non-weather related
water claims.
10.3 Lab: Predictive Modeling Using Word Embeddings 441
confusion matrix word2vec with b=2 confusion matrix word2vec with b=10
Hail 5 2 38 7 7 3 7 7 9 Hail 2 0 72 12 8 2 2 2 5
Wat.NW 0 2 1 1 24 36 0 6 0 Wat.NW 1 0 0 0 33 55 0 1 3
Lightning
Hail
Wind
WaterW
WaterNW
Vehicle
Vandalism
Misc
Fire
Lightning
Hail
Wind
WaterW
WaterNW
Vehicle
Vandalism
Misc
Fig. 10.5 Confusion matrices of the hazard type prediction using a word2vec embedding based on
negative sampling (lhs) b = 2 dimensional embedding and (rhs) b = 10 dimensional embedding;
columns show the observations and rows show the predictions
these embedding weights are declared to be non-trainable.7 These features are then
inputted into a FN network with two FN layers having (q1 , q2 ) = (20, 15) neurons,
and as output activation we choose the softmax function. This model has 286 non-
trainable embedding weights, and r = (9 · 2 + 1)20 + (20 + 1)15 + (15 + 1)9 = 839
trainable parameters.
We fit this network using the nadam version of the gradient descent method, and
we exercise an early stopping on a 20% validation data set (of the entire data). This
network is fitted in a few seconds, and the results are presented in Fig. 10.5 (lhs).
This figure shows the confusion matrix of prediction vs. observed (row vs. column).
The general results look rather good, there are only difficulties to distinguish WaterN
from WaterNW claims.
In a second analysis, we increase the embedding dimension to b = 10 and
we perform exactly the same procedure as above. A higher embedding dimension
allows the embedding map to better discriminate the words in their meanings.
However, we should not go for a too high b because we have only 142 different
words and 47’904 center-context pairs (w, w *) to learn these embeddings e(w) ∈ Rb .
A higher embedding dimension also increases the number of network weights in
the first FN layer on line 9 of Listing 10.7. This time, we need to train r =
(9 · 10 + 1)20 + (20 + 1)15 + (15 + 1)9 = 2 279 parameters. The results are
presented in Fig. 10.5 (rhs). We observe an overall improvement compared to the
2-dimensional embeddings. This is also confirmed by Table 10.1 which gives the
deviance losses and the misclassification rates.
Table 10.1 Hazard prediction results summarized in deviance losses and misclassification rates
Number of parameters Deviance Misclassification
Embedding Network loss rate
word2vec negative sampling, b = 2 286 839 0.1442 19.9%
word2vec negative sampling, b = 10 1’430 2’279 0.0912 13.7%
FN GloVe using all words, b = 50 91’500 9’479 0.0802 11.7%
LSTM GloVe using all words, b = 50 91’500 3’369 0.0802 12.1%
Word similarity embedding, b = 7 12’810 1’739 0.1396 21.1%
confusion matrix GloVe with b=50 confusion matrix GloVe with LSTM and b=50
Hail 0 1 78 3 0 0 0 1 2 Hail 0 0 82 1 0 0 0 0 0
Lightning
Hail
Wind
WaterW
WaterNW
Vehicle
Vandalism
Misc
Fire
Lightning
Hail
Wind
WaterW
WaterNW
Vehicle
Vandalism
Misc
Fig. 10.6 Confusion matrices of the hazard type prediction using the pre-trained GloVe with b =
50 (lhs) FN network and (rhs) LSTM network; columns show the observations and rows show the
predictions
Listing 10.8 R code for the hazard type prediction using a LSTM architecture
We fit this LSTM architecture to the data using the pre-trained GloVe embed-
dings. The results are presented in Fig. 10.6 (rhs) and Table 10.1. We receive the
same deviance loss, and the misclassification rate is slightly worse than in the
FN network case (with the same pre-trained GloVe embeddings). Note that the
deviance loss is calculated on the estimated classification probabilities
p(x) =
( 9 (x)) , and the labels are received by
p1 (x), . . . , p
= Y
Y (x) = arg max p
k (x).
k=1,...,9
Thus, it may happen that the improvements on the estimated probabilities are not
fully reflected on the predicted labels.
444 10 Natural Language Processing
Word (Cosine) Similarity In our final analysis we work with the pre-trained GloVe
embeddings e(w) ∈ R50 but we first try to reduce the embedding dimension b. For
this we follow Lee et al. [236], and we consider a word similarity. We can define
the similarity of the words w and w ∈ W by considering the scalar product of their
embeddings
7 8
7
8 e(w), e(w )
sim (u)
(w, w ) = e(w), e(w ) or sim (n)
(w, w ) = .
e(w)2 e(w )2
(10.11)
The first one is an unweighted version and the second one is a nor-
malized version scaling with the corresponding Euclidean norms so that
the similarity measure is within [−1, 1]. In fact, the latter is also called
cosine similarity. To reduce the embedding dimension and because we
have a classification problem with hazard names, we can evaluate the
(cosine) similarity of all used words w ∈ W to the hazards h ∈ H =
{fire, lightning, hail, wind, water, vehicle, vandalism}. Observe
that water is further separated into weather related and non-weather related claims,
and there is a further hazard type called misc, which collects all the rest. We could
choose more words in H to more precisely describe these water and other claims. If
we just use H we obtain a b = |H| = 7 dimensional embedding mapping
w ∈ W0 → e(a) (w) = sim(a) (w, fire), . . . , sim(a) (w, vandalism) ∈ Rb=7 ,
(10.12)
for a ∈ {u, n}. This gives us for every text = (w1 , . . . , wT ) ∈ W0T the pre-
processed features
text → e(a)(w1 ), . . . , e(a) (wT ) ∈ RT ×b . (10.13)
Lee et al. [236] apply a max-pooling layer to these embeddings which are then
inputted into GAM classification model. We use a different approach here, and
directly use the unweighted (a = u) text representations (10.13) as an input to a
network, either of FN network type of Listing 10.7 or of LSTM type of Listing 10.8.
If we use the FN network type we receive the results on the last line of Table 10.1
and Fig. 10.7.
Comparing the results of the word similarity through the embeddings (10.12)
and (10.13) to the other prediction results, we conclude that this word similarity
approach is not fully competitive compared to working directly with the word2vec
or GloVe embeddings. It seems that the projection (10.12) does not discriminate
sufficiently for our classification task.
10.4 Lab: Deep Word Representation Learning 445
Fig. 10.7 Confusion matrix confusion matrix word similarity with b=7
of the hazard type prediction Fire 105 2 0 4 0 1 16 21 9
using the word
Light. 9 906 5 9 7 0 0 2 8
similarity (10.12)–(10.13) for
a = u; columns show the Hail 0 1 72 2 1 0 1 1 2
observations and rows show
Wind 2 4 10 314 21 1 3 7 18
the predictions
Wat.W 2 5 1 14 345 183 14 15 32
Wat.NW 1 0 0 1 25 39 5 4 8
Vehicle 34 6 2 15 5 7 871 75 84
Misc 20 18 2 27 34 21 74 40 186
Fire
Lightning
Hail
Wind
WaterW
WaterNW
Vehicle
Vandalism
Misc
10.4 Lab: Deep Word Representation Learning
All examples above have been relying on embedding the words w ∈ W into
a Euclidean space e(w) ∈ Rb by performing a sort of unsupervised learning
that provided word similarity clusters. The advantage of this approach is that
the embedding is decoupled from the regression or classification task, this is
computationally attractive. Moreover, once a suitable embedding has been learned,
it can be used for several different tasks (in the spirit of transfer learning). The
disadvantage of the pre-trained embeddings is that the embedding is not targeted to
the regression task at hand. This has already been discussed in Remark 10.4 where
we have highlighted that the meaning of some words (such as Lincoln) depends very
much on its context.
Recent NLP aims at pre-processing a text as little as necessary, but tries
to directly feed the raw sentences into RN networks such as LSTM or GRU
architectures. Computationally this is much more demanding because we have
to learn the embeddings and the network weights simultaneously, we refer to
Table 10.1 to indicate the number of parameters involved. The purpose of this short
section is to give an example, though our NLP database is rather small; this latter
approach usually requires a huge database and the corresponding computational
power. Ferrario–Nägelin [126] provide a more comprehensive example on the
classification of movie reviews. For their analysis they evaluated approximately
50’000 movie reviews each using between 235 and 2’498 words. Their analysis
was implemented on the ETH High Performance Computing (HPC) infrastructure
Euler8, and their run times have been between 20 and 30 minutes, see Table 8 of
Ferrario–Nägelin [126].
8 https://scicomp.ethz.ch/wiki/Euler
446 10 Natural Language Processing
Since we neither have the computational power nor the big data to fit such
a NLP application, we start the gradient descent fitting in the initial embedding
weights e(w) ∈ Rb that either come from the word2vec or the GloVe embeddings.
During the gradient descent fitting, we allow these weights to change w.r.t. the
regression task at hand. In comparison to Sect. 10.3, this only requires minor
changes to the R code, namely, the only modification needed is to change from
FALSE to TRUE on lines 5 in Listings 10.7 and 10.8. This change allows us to
learn adapted weights during the gradient descent fitting. The resulting classification
models are now very high-dimensional, and we need to carefully assess the
early stopping rule, otherwise the model will (in-sample) over-fit to the learning
data.
In Fig. 10.8 we provide the results that correspond to the self-trained word2vec
embeddings given in Fig. 10.5, and the corresponding numerical results are given
in Table 10.2. We observe an improvement in the prediction accuracy in both cases
by letting the embedding weights being learned during the network fitting, and we
receive a misclassification rate of 11.6% and 11.0% for the embedding dimensions
b = 2 and b = 10, respectively, see Table 10.2.
Figure 10.8 (rhs) illustrates how the embeddings have changed from the initial (pre-
trained) embeddings e(0) (w) (coming from the word2vec negative sampling) to the
learned embeddings e(w). We measure these changes in terms of the unweighted
similarity measure defined in (10.11), and given by
; <
e(w) .
e(0) (w), (10.14)
The upper horizontal line is a manually set threshold to identify the words w that
experience a major change in their embeddings. These are the words ‘vandalism’,
‘lightning’, ‘grafito’, ‘fence’, ‘hail’, ‘freeze’, ‘blow’ and ‘breakage’. Thus, these
words receive a different embedding location/meaning which is more favorable for
our classification task.
A similar analysis can be performed for the pre-trained GloVe embeddings. There
we expected bigger changes to the embeddings since the GloVe embeddings have
not been learned in an insurance context, and the embeddings will be adapted to
the insurance prediction problem. We refrain from giving an explicit analysis, here,
because to perform a thorough analysis we would need (much) more data.
We conclude this example with some remarks. We emphasize once more that
our available data is minimal, and we expect (even much) better results for longer
claim descriptions. In particular, our data is not sufficient to discriminate the weather
related from the non-weather related water claims, as the claim descriptions seem
to focus on the water claim itself and not on its cause. In a next step, one should use
claim descriptions in order to predict the claim sizes, or to improve their predictions
if they are based on classical tabular features, only. Here, we see some potential, in
particular, w.r.t. medical claims, as medical reports may clearly indicate the severity
of the claim as well as these reports may give some insight into the recovery process.
Thus, our small example may only give some intuition of what is possible with
10.4 Lab: Deep Word Representation Learning 447
4
Fire 182 3 0 0 0 2 3 15 4
Light. 9 908 2 8 2 0 0 0 4
Wind 1 9 4 354 16 0 5 0 9
2
Wat.W 1 4 2 2 387 169 3 0 18
Wat.NW 0 0 0 0 26 73 0 3 2
1
Vehicle 6 11 3 18 5 4 995 36 55
Vand. 5 1 1 4 3 1 19 1964 55
0
Misc 7 17 3 6 21 17 28 26 253
Fire
Lightning
Hail
Wind
WaterW
WaterNW
Vehicle
Vandalism
Misc
hail
freeze
confusion matrix with embeddings b=10 change of word2vec embeddings (b=10)
4
Fire 185 3 0 0 0 4 8 10 8
Light. 9 923 1 4 3 0 1 0 8
3
change (in word similarity)
Hail 0 0 84 5 0 0 0 0 0
Wind 0 1 5 352 6 0 1 1 10
2
Wat.NW 2 0 0 0 17 64 1 0 3
Vehicle 3 2 1 10 2 0 983 24 29
1
Vand. 5 1 1 6 6 6 25 1980 44
Misc 6 16 2 9 19 12 33 27 289
0
Fire
Lightning
Hail
Wind
WaterW
WaterNW
Vehicle
Vandalism
Misc
vandalism
lightning
graffito
fence
hail
freeze
blow
breakage
Fig. 10.8 Confusion matrices and the changes in the embeddings compared to the pre-trained
word2vec embeddings of Fig. 10.5 for the dimensions b = 2 and b = 10
Table 10.2 Hazard prediction results summarized in deviance losses and misclassification rates:
pre-trained embeddings vs. network learned embeddings
Number of parameters Deviance Misclass.
Non-trainable Trainable loss rate
word2vec negative sampling, b = 2 286 839 0.1442 19.9%
word2vec improved embedding, b = 2 1’125 0.0814 11.7%
word2vec negative sampling, b = 10 1’430 2’279 0.0912 13.7%
word2vec improved embedding, b = 10 3’709 0.0714 10.5%
448 10 Natural Language Processing
(unstructured) text data. Unfortunately, the LGPIF data of Listing 10.1 did not give
us any satisfactory results for the claim size prediction, this for several reasons.
Firstly, the data is rather heterogeneous ranging from small to very large claims
and any member of the EDF struggles to model this data; we come back to a
different modeling proposal of heterogeneous data in Sect. 11.3.2. Secondly, the
claim descriptions are not very explanatory as they are too short for a more detailed
information. Thirdly, the data has only 5’424 claims which seems small compared
to the complexity of the problem that we try to solve.
In text recognition problems, obviously, not all the words in a sentence have the
same importance. In the examples above, we have removed the stopwords as they
may disturb the key understanding of our texts. Removing the stopwords means that
we pay more attention to the remaining words. RN networks often face difficulty
in giving the right recognition to the different parts of a sentence. For this reason,
attention layers have gained more popularity recently. Attention layers are special
modules in network architectures that allow the network to impose more weight
on certain parts of the information in the features to emphasize their importance.
The attention mechanism has been introduced in Bahdanau et al. [21]. There are
different ways of modeling attention, the most popular one is the so-called dot-
product attention, we refer to Vaswani et al. [366], and in the actuarial literature we
mention Kuo–Richman [231] and Troxler–Schelldorfer [354].
We start by describing a simple attention mechanism. Consider a sentence
text = (w1 , . . . , wT ) ∈ W0T that provides, under an embedding map e : W0 →
Rb , the embedded sentence (e(w1 ), . . . , e(wT )) ∈ RT ×b . We choose a weight
matrix UQ ∈ Rb×b and an intercept vector uQ ∈ Rb . Based on these choices we
consider for each word wt of our sentence the score, called query,
q t = tanh uQ + UQ e(wt ) ∈ (−1, 1)b . (10.15)
Using these attention weights α = (α1 , . . . , αT ) ∈ (0, 1)T we encode the sentence
text as
T
text = (w1 , . . . , wT ) → w∗ = αt e(wt ) (10.17)
t =1
= (e(w1 ), . . . , e(wT )) α ∈ Rb .
where the softmax function is applied column-wise. I.e., the attention weight matrix
A ∈ (0, 1)T ×q has columns α j = (α1,j , . . . , αT ,j ) ∈ T , 1 ≤ j ≤ q, which are
normalized to total weight 1, this is equivalent to (10.16). This is used to encode the
sentence text
Example 10.6 We revisit the hazard type prediction example of Sect. 10.3. We
select the b = 10 word2vec embedding (using negative sampling) and the
pre-trained GloVe embedding of Table 10.1. These embeddings are then further
processed by applying the attention mechanism (10.15)–(10.17) on the embeddings
using one single attention neuron. Listing 10.9 gives the corresponding implemen-
tation. On line 9 we have the query (10.15), on lines 10–13 the key and the attention
weights (10.16), and on line 15 the encodings (10.17). We then process these
encodings through a FN network of depth d = 2, and we use the softmax output
activation to receive the categorical probabilities. Note that we keep the learned
word embeddings e(w) as non-trainable on line 5 of Listing 10.9.
Table 10.3 gives the results, and Fig. 10.9 shows the confusion matrix. We conclude
that the results are rather similar, this attention mechanism seems to work quite well,
and with less parameters, here.
10.5 Outlook: Creating Attention 451
Listing 10.9 R code for the hazard type prediction using an attention layer with q = 1
Table 10.3 Hazard prediction results summarized in deviance losses and misclassification
rates
Number of parameters Deviance Misclassification
Embedding Network loss rate
word2vec negative sampling, b = 10 1’430 2’279 0.0912 13.7%
word2vec attention, b = 10 1’430 799 0.0784 12.0%
FN GloVe using all words, b = 50 91’500 9’479 0.0802 11.7%
GloVe attention, b = 50 91’500 4’079 0.0824 12.6%
confusion matrix with embeddings b=10 confusion matrix with embeddings b=50
Hail 0 0 85 5 0 0 0 0 0 Hail 0 0 85 5 1 0 0 0 0
Wat.NW 2 0 0 0 9 50 8 1 8 Wat.NW 0 0 0 1 26 45 0 1 6
Lightning
Hail
Wind
WaterW
WaterNW
Vehicle
Vandalism
Misc
Fire
Lightning
Hail
Wind
WaterW
WaterNW
Vehicle
Vandalism
Misc
Fig. 10.9 Confusion matrices of the hazard type prediction (lhs) using an attention layer on the
word2vec embeddings with b = 10, and (rhs) using an attention layer on the pre-trained GloVe
embeddings with b = 50; columns show the observations and rows show the predictions
452 10 Natural Language Processing
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 11
Selected Topics in Deep Learning
We revisit claim size modeling in this section. Claim size modeling is challenging
because often there is no (simple) off-the-shelf distribution that allows one to
appropriately describe all claim size observations. E.g., the main body of the claim
size data may look like gamma distributed, and, at the same time, large claims seem
to be more heavy-tailed (contradicting a gamma model assumption). Moreover,
different product and claim types may lead to multi-modality in the claim size
densities. In Sects. 5.3.7 and 5.3.8 we have explored a gamma and an inverse
Gaussian GLM to model a motorcycle claims data set. In that example, the results
have been satisfactory because this motorcycle data is neither multi-modal nor does
it have heavy tails. These two GLM approaches have been based on the EDF (2.14),
modeling the mean x → μ(x) with a regression function and assuming a constant
dispersion parameter ϕ > 0. There are two natural ways to extend this approach.
One considers a double GLM with a dispersion submodel x → ϕ(x), see Sect. 5.5,
the other explores multi-parameter extensions like the generalized inverse Gaussian
model, which is a k = 3 vector-valued EF, see (2.10), or the GB2 family that
involves 4 parameters, see (5.79). These extensions provide more complexity, also in
MLE. In this section, we are not going to consider multi-parameter extensions, but
in a first step we aim at robustifying (mean) parameter estimation within the EDF.
In a second step we are going to analyze the resulting dispersion ϕ(x). For these
steps, we perform representation learning and parameter estimation under model
uncertainty by simultaneously considering multiple models from Tweedie’s family.
These considerations are closely related to Tweedie’s forecast dominance given in
Definition 4.22.
The unit deviance takes the following form for p > 1 and p = 2, see (4.18),
1−p
y − μ1−p y 2−p − μ2−p
dp (y, μ) = 2 y − ≥ 0, (11.2)
1−p 2−p
Figure 11.1 (lhs) shows the unit deviances y → dp (y, μ) for fixed mean parameter
μ = 2 and power variance parameters p ∈ {0, 2, 2.5, 3, 3.5}; the case p = 0
corresponds to the symmetric Gaussian case d0 (y, μ) = (y − μ)2 . We observe
that with an increasing power variance parameter p large claims Y = y receive a
smaller loss punishment (if we interpret the unit deviance as a loss function). This
is the situation where we have a fixed mean μ and where we assess claim sizes
11.1 Deep Learning Under Model Uncertainty 455
unit deviances of power variance examples unit deviances of power variance examples
10
10
8
8
unit deviance
unit deviance
6
6
4
4
2
2
0
0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
data y mean mu
Fig. 11.1 (lhs) Unit deviances y → dp (y, μ) ≥ 0 for fixed mean μ = 2 and (rhs) unit
deviances μ → dp (y, μ) ≥ 0 for fixed observation y = 2 for power variance parameters
p ∈ {0, 2, 2.5, 3, 3.5}
2 2 2
dp (y0 + y, μ0 + μ) = p (y − μ) + o( ) as → 0.
μ0
Thus, locally around the minimum the unit deviances behave symmetric and like
Gaussian squares, but this is only a local approximation around a minimum μ0 = y0
as can be seen from Fig. 11.1. I.e., in general, model fitting turns out to be rather
different from the Gaussian square loss if we have small and large claim sizes under
choices p > 1.
456 11 Selected Topics in Deep Learning
Remarks 11.1
• Since unit deviances are Bregman divergences, we know that every unit deviance
gives us a strictly consistent scoring function for the mean functional, see
Theorem 4.19. Therefore, the specific choice of the power variance parameter p
seems less relevant. However, strict consistency is an asymptotic statement, and
choosing a unit deviance that matches the property of the data has better finite
sample properties, i.e., a smaller variance in asymptotic normality; we come back
to this in Sect. 11.1.4, below.
• A function (y, μ) → ψ(y, μ) is called b-homogeneous if there exists b ∈ R
such that for all (y, μ) and all λ > 0 we have ψ(λy, λμ) = λb ψ(y, μ). Unit
deviances dp are b-homogeneous with b = 2 − p. This b-homogeneity has
the nice consequence that the decisions taken are independent of the scale, i.e.,
we have an invariance under changes of currencies. On the other hand, such a
scaling influences the estimation of the dispersion parameter, i.e., if we scale the
observation and the mean with λ we have unit deviance
This influences the dispersion estimation for the cases different from the gamma
case p = 2, see, e.g., saddlepoint approximation (5.60)–(5.62). This also relates
to the different parametrizations in Sect. 5.3.8 where we study the inverse
Gaussian model p = 3, which has a dispersion ϕi = 1/αi in the reproductive
form and ϕi = 1/αi2 in parametrization (5.51).
• We only consider power variance parameters p > 1 in this section for non-
negative claim size modeling. Technically, this analysis could be extended to
p ∈ {0, 1}. We do not consider the Gaussian case p = 0 to exclude negative
claims, and we do not consider the Poisson case p = 1 because this is used for
claim counts modeling.
We recall that unit deviances of the EDF are equal to twice the corresponding
KL divergences, which in turn are special cases of Bregman divergences. From
Theorem 4.19 we know that Bregman divergences Dψ are the only strictly
consistent loss/scoring functions for mean estimation.
Lemma 11.2 Choose p > 1. The scaled unit deviance dp (y, μ)/2 is a Bregman
divergence Dψp (y, μ) on R+ × R+ with strictly decreasing and strictly convex
11.1 Deep Learning Under Model Uncertainty 457
function on R+
1
(2−p)(1−p) y
2−p for p > 1 and p = 2,
ψp (y) = yhp (y) − κp (hp (y)) =
−1 − log(y) for p = 2,
for canonical link hp (y) = (κp )−1 (y) = y 1−p /(1 − p).
Proof of Lemma 11.2 The Bregman divergence property follows from (2.29). For
p > 1 and y > 0 we have the strictly decreasing property
The second derivative is ψp (y) = hp (y) = y −p = 1/V (y) > 0 which provides the
strict convexity.
In the Gaussian case we have ψ0 (y) = y 2 /2, and ψ0 (y) > 0 on R+ implies
that this is a strictly increasing convex function for positive claims y > 0. This is
different to Lemma 11.2.
Assume we have independent observations (Yi , x i ) following the same
Tweedie’s distribution, and with means given by μϑ (x i ) for some parameter ϑ.
The M-estimator of ϑ using this Bregman divergence is given by
n
vi
ϑ = arg max Y (ϑ) = arg min Dψp (Yi , μϑ (x i )) .
ϑ ϑ ϕ
i=1
!
n
vi
0 = −∇ϑ Dψp (Yi , μϑ (x i ))
ϕ
i=1
n
vi
= ψp (μϑ (x i )) (Yi − μϑ (x i )) ∇ϑ μϑ (x i )
ϕ
i=1
n
vi Yi − μϑ (x i )
= ∇ϑ μϑ (x i ) (11.5)
ϕ V (μϑ (x i ))
i=1
n
vi Yi − μϑ (x i )
= ∇ϑ μϑ (x i ).
ϕ μϑ (x i )p
i=1
In the GLM case this exactly corresponds to (5.9). To determine the Z-estimator
from (11.5), we scale the residuals Yi − μi inversely proportional to the variances
p
V (μi ) = μi of the chosen Tweedie’s distribution. It is a well-known result that
458 11 Selected Topics in Deep Learning
We present a proposal for deep learning under model uncertainty in this section. We
explain this on an explicit example within Tweedie’s distributions. We emphasize
that this methodology can be applied in more generality, but it is beneficial here to
have an explicit example in mind to illustrate the different phenomena.
We analyze a Swiss accident insurance claims data set. This data is illustrated in
Sect. 13.4, and an excerpt of the data is given in Listing 13.7. In total we have
339’500 claims with positive payments. We choose this data set because it ranges
from very small claims of 1 CHF to very large claims, the biggest one exceeding
1’300’000 CHF. These claims are supported by feature information such as the labor
sector, the injury type or the injured body part, see Listing 13.7 and Fig. 13.25. For
our analysis, we partition the data into a learning data set L and a test data set T .
We do this partition stratified w.r.t. the claim sizes and in a ratio of 9 : 1. This
results in a learning data set L of size n = 305 550 and in a test data set T of
size T = 33 950.
We consider three Tweedie’s distributions with power variance parameters p ∈
{2, 2.5, 3}, the first one is the gamma model, the last one the inverse Gaussian model,
and the power variance parameter p = 2.5 gives a model in between. In a first step
we consider GLMs, this requires feature engineering. We have three categorical
features, one binary feature and two continuous ones. For the categorical and binary
features we use dummy coding, and the continuous features Age and AccQuart
are just included in its raw form. As link function g we choose the log-link which
respects the positivity of the dual mean parameter space M, see Table 2.1, but
this is not the canonical link of the selected models. In the gamma GLM this
leads to a convex minimization problem, but in Tweedie’s GLM with p = 2.5
11.1 Deep Learning Under Model Uncertainty 459
Table 11.1 In-sample and out-of-sample losses (gamma loss, power variance case p = 2.5 loss
(in 10−2 ) and inverse Gaussian (IG) loss (in 10−3 )) and AIC values; the losses use unit dispersion
ϕ = 1, AIC relies on the MLE of ϕ
In-sample loss on L Out-of-sample loss on T AIC
dp=2 dp=2.5 dp=3 dp=2 dp=2.5 dp=3 value
Null model 3.0094 10.2208 4.6979 3.0240 10.2420 4.6931 4’707’115 (IG)
Gamma GLM 2.0695 7.7127 3.9582 2.1043 7.7852 3.9763 4’741’472
p = 2.5 GLM 2.0744 7.6971 3.9433 2.1079 7.7635 3.9580 4’648’698
IG GLM 2.0865 7.7069 3.9398 2.1191 7.7730 3.9541 4’653’501
and in the inverse Gaussian GLM we have non-convex minimization problems, see
Example 5.6. Therefore, we initialize Fisher’s scoring method (5.12) in the latter two
GLMs with the solution of the gamma GLM. The gamma and the inverse Gaussian
cases can directly be fitted with the R command glm [307], for the power variance
parameter case p = 2.5 we have coded our own MLE routine using Fisher’s scoring
method.
Table 11.1 shows the in-sample losses on the learning data L and the corresponding
out-of-sample losses on the test data T . The fitted GLMs (gamma, power variance
parameter p = 2.5 and inverse Gaussian) are always evaluated on all three unit
deviances dp=2 (y, μ), dp=2.5 (y, μ) and dp=3 (y, μ), respectively. We give some
remarks. First, we observe that the in-sample loss is always minimized for the
GLM with the same power variance parameter p as the loss dp studied (2.0695,
7.6971 and 3.9398 in bold face). This result simply states that the parameter
estimates are obtained by minimizing the in-sample loss (or maximizing the
corresponding in-sample log-likelihood). Second, the minimal out-of-sample losses
are also highlighted in bold face. From these results we cannot give any preference
to a single model w.r.t. Tweedie’s forecast dominance, see Definition 4.20. Third,
we calculate the AIC values for all models. The gamma and the inverse Gaussian
cases have a closed-form solution for the normalizing term a(y; v/ϕ) in the EDF
density, and we can directly calculate AIC. The case p = 2.5 is more difficult
and we use the saddlepoint approximation of Sect. 5.5.2. Considering AIC we give
preference to Tweedie’s GLM with p = 2.5. Note that the AIC values use the
MLE for ϕ which is obtained from a general purpose optimizer, and which uses
the saddlepoint approximation in the power variance case p = 2.5. Fourth, under
a constant dispersion parameter ϕ, the mean estimation μi can be done without
explicitly specifying ϕ because it cancels in the score equations. In fact, we perform
this mean estimation in the additive form and not in the reproductive form, see (2.13)
and the discussions in Sects. 5.3.7–5.3.8.
Figure 11.2 plots the deviance residuals (for unit dispersion) against the logged
fitted means μ(x i ) for p ∈ {2, 2.5, 3} for 2’000 randomly selected claims; this
is the Tukey–Anscombe plot. The green line has been obtained by a spline fit
to the deviance residuals as a function of the fitted means μ(x i ), and the cyan
460 11 Selected Topics in Deep Learning
Tukey−Anscombe plot: gamma Tukey−Anscombe plot: p=2.5 Tukey−Anscombe plot: inverse Gaussian
1.5
0.4
residuals
average
dispersion
deviance residuals
deviance residuals
0.0 0.2
0
−0.2
−5
residuals residuals
average average
−0.4
dispersion dispersion
5 6 7 8 9 10 5 6 7 8 9 10 5 6 7 8 9 10
logged fitted means logged fitted means logged fitted means
Fig. 11.2 Tukey–Anscombe plots showing the deviance residuals against the logged GLM fitted
means μ(x i ): (lhs) gamma GLM p = 2, (middle) power variance case p = 2.5, (rhs) inverse
Gaussian GLM p = 3; the cyan lines show twice the estimated standard deviation of the deviance
residuals as a function of the size of the logged estimated means
μ
lines give twice the estimated standard deviation of the deviance residuals as
a function of the fitted means (also obtained from spline fits). This estimated
standard deviation corresponds to the square-rooted deviance dispersion estimate
ϕ D , see (5.30), however, in the additive form because we work with unscaled claim
size observations. A constant dispersion assumption is supported by cyan lines of
roughly constant size. In the gamma case the dispersion seems increasing in the
mean estimate, and in the inverse Gaussian case it is decreasing, thus, the power
variance parameters p = 2 and p = 3 do not support a constant dispersion in this
example. Only the choice p = 2.5 may support a constant dispersion assumption
(because it does not have an obvious trend). This says that the variance should scale
as V (μ) = μ2.5 as a function of the mean μ, see also (11.5).
Deep FN Networks
1In the standard implementation of SGD with early stopping, the learning and validation data
partition is done non-stratified. If necessary, this can be changed manually.
11.1 Deep Learning Under Model Uncertainty 461
Table 11.2 In-sample and out-of-sample losses (gamma loss, power variance case p = 2.5 loss
(in 10−2 ) and inverse Gaussian (IG) loss (in 10−3 )) and average claim amounts; the losses use unit
dispersion ϕ = 1 and the network losses are averaged deviance losses over 20 runs with different
seeds
In-sample loss on L Out-of-sample loss on T Average
dp=2 dp=2.5 dp=3 dp=2 dp=2.5 dp=3 claim
Null model 3.0094 10.2208 4.6979 3.0240 10.2420 4.6931 1’774
Gamma GLM 2.0695 7.7127 3.9582 2.1043 7.7852 3.9763 1’701
p = 2.5 GLM 2.0744 7.6971 3.9433 2.1079 7.7635 3.9580 1’652
IG GLM 2.0865 7.7069 3.9398 2.1191 7.7730 3.9541 1’614
Gamma network 1.9738 7.4556 3.8693 2.0543 7.6478 3.9211 1’748
p = 2.5 network 1.9712 7.4128 3.8458 2.0654 7.6551 3.9178 1’739
IG network 1.9977 7.4568 3.8525 2.0762 7.6682 3.9188 1’712
First, we observe that the networks outperform the GLMs, saying that the feature
engineering has not been done optimally for GLMs. Second, in-sample we no longer
receive the lowest deviance loss in the model with the same p. This comes from the
fact that we exercise early stopping, and, for instance, the gamma in-sample loss of
the gamma network (p = 2) 1.9738 is bigger than the corresponding gamma loss
of 1.9712 from the network with p = 2.5. Third, considering forecast dominance,
preference is given either to the gamma network or to the power variance parameter
p = 2.5. In general, it seems that fitting with higher power variance parameters
leads to less stable results, but this statement needs more analysis. The disadvantage
of this fitting approach is that we independently fit the models with the different
power variance parameters to the observations, and, thus, the learned representations
z(d:1)(x i ) are rather different for different p’s. This makes it difficult to compare
these models. This is exactly the point that we address next.
ηp
n
D Y , (w, β 2 , β 2.5 , β 3 ) = vi dp Yi , μp (x i ) , (11.7)
ϕp
p∈{2,2.5,3} i=1
for the given observations (Yi , x i , vi ), 1 ≤ i ≤ n. Note that the unit deviances
dp live on different scales for different p’s. We use the (constant) weights ηp > 0
to balance these scales so that all power variance parameters p roughly equally
contribute to the total loss, while setting ϕp ≡ 1 (which can be done for a constant
dispersion). This approach is now fitted to the available learning data L. The
corresponding R code is given in Listing 11.1. Note that the fitting also requires that
we triplicate the observations (Yi , Yi , Yi ) so that we can simultaneously evaluate the
three chosen power variance deviance losses, see lines 18–21 of Listing 11.1. We
fit this model to the Swiss accident insurance data, and the results are presented in
Table 11.3 on the lines called ‘multi-out’.
Table 11.3 In-sample and out-of-sample losses (gamma loss, power variance case p = 2.5 loss
(in 10−2 ) and inverse Gaussian (IG) loss (in 10−3 )) and average claim amounts; the losses use unit
dispersion ϕ = 1 and the network losses are averaged deviance losses over 20 runs with different
seeds
In-sample loss on L Out-of-sample loss on T Average
dp=2 dp=2.5 dp=3 dp=2 dp=2.5 dp=3 claim
Null model 3.0094 10.2208 4.6979 3.0240 10.2420 4.6931 1’774
Gamma network 1.9738 7.4556 3.8693 2.0543 7.6478 3.9211 1’748
p = 2.5 network 1.9712 7.4128 3.8458 2.0654 7.6551 3.9178 1’739
IG network 1.9977 7.4568 3.8525 2.0762 7.6682 3.9188 1’712
Gamma multi-output (11.6) 1.9731 7.4275 3.8519 2.0581 7.6422 3.9146 1’745
p = 2.5 multi-output (11.6) 1.9736 7.4281 3.8522 2.0576 7.6407 3.9139 1’732
IG multi-output (11.6) 1.9745 7.4295 3.8525 2.0576 7.6401 3.9134 1’705
Multi-loss fitting (11.8) 1.9677 7.4118 3.8468 2.0580 7.6417 3.9144 1’744
comparison of gamma, p=2.5 and inverse Gauss comparison of gamma, p=2.5 and inverse Gauss
1.10
1.10
1.05
1.00
1.00
0.95
0.95
0.90
0.90
2 4 6 8 10 12 5 6 7 8 9 10
logged observed claim sizes logged claim prediction
models with a smaller power variance parameter p over-fit more to large claims.
From Fig. 11.3 (lhs) we can observe that, indeed, this is the case (see gray and cyan
spline fits which bifurcate for large claims). That is, models with a smaller power
variance parameter react more sensitively to large observations Yi . The ratios in
Fig. 11.3 provide differences of up to 7% for large claims.
Remark 11.3 The loss function (11.7) can also be interpreted as regularization.
For instance, if we choose η2 = 1, and if we assume that this is our preferred
model, then we can regularize this model with further models, and their weights
ηp > 0 determine the degree of regularization. Thus, in contrast to ridge and
LASSO regularization of Sect. 6.2, regularization does not directly act on the
model parameters, here, but rather on what we learn in terms of the representation
zi = z(d:1) (x i ).
ηp
n
D (Y , ϑ) = vi dp (Yi , μ(x i )) . (11.8)
ϕp
p∈{2,2.5,3} i=1
the observed data), otherwise we will receive bad convergence properties, see also
Sect. 11.1.4, below. For instance, we can robustify the Poisson claim counts model
by additionally considering the deviance loss of the negative binomial model that
also assesses over-dispersion.
Nagging Predictor
The loss figures in Table 11.3 are averaged deviance losses over 20 different runs of
the gradient descent algorithm with different seeds (to receive stable results). Rather
than averaging over the losses, we should improve the models by averaging over the
predictors and, then, calculate the losses on these averaged predictors; this is exactly
the proposal of the nagging predictor (7.44). We calculate the nagging predictor of
the models that are simultaneously fit to the different loss functions (lines ‘multi-
output’ and ‘multi-loss’ of Table 11.3). The resulting nagging predictors are reported
in Table 11.4. This table shows that we give a clear preference to the nagging
predictors. The simultaneous loss fitting (11.8) gives the best out-of-sample results
for the nagging predictor, see the last line of Table 11.4.
Figure 11.4 shows the Tukey–Anscombe plot of the multi-loss nagging predictor for
the different deviance losses (for unit dispersion). Again, the case p = 2.5 is closest
to having a constant dispersion, and the other cases will require dispersion modeling
ϕ(x).
Figure 11.5 shows the empirical auto-calibration property of the multi-loss nagging
predictor. This auto-calibration property is calculated as in Listing 7.8. We observe
that the auto-calibration property holds rather accurately. Only for claim predictors
μ(x i ) above 10’000 CHF (vertical dotted line in Fig. 11.5) the fitted means under-
estimate the observed average claim sizes. This affects (only) 1.7% of all claims and
it could be corrected as described in Example 7.19.
466 11 Selected Topics in Deep Learning
Table 11.4 In-sample and out-of-sample losses (gamma loss, power variance case p = 2.5 loss
(in 10−2 ) and inverse Gaussian (IG) loss (in 10−3 )) and average claim amounts; the losses use unit
dispersion ϕ = 1
In-sample loss on L Out-of-sample loss on T Average
dp=2 dp=2.5 dp=3 dp=2 dp=2.5 dp=3 claim
Null model 3.0094 10.2208 4.6979 3.0240 10.2420 4.6931 1’774
Gamma multi-output (11.6) 1.9731 7.4275 3.8519 2.0581 7.6422 3.9146 1’745
p = 2.5 multi-output (11.6) 1.9736 7.4281 3.8522 2.0576 7.6407 3.9139 1’732
IG multi-output (11.6) 1.9745 7.4295 3.8525 2.0576 7.6401 3.9134 1’705
Multi-loss fitting (11.8) 1.9677 7.4118 3.8468 2.0580 7.6417 3.9144 1’744
Gamma multi-out & nagging 1.9486 7.3616 3.8202 2.0275 7.5575 3.8864 1’745
p = 2.5 multi-out & nagging 1.9496 7.3640 3.8311 2.0276 7.5578 3.8864 1’732
IG multi-out & nagging 1.9510 7.3666 3.8320 2.0281 7.5583 3.8865 1’705
Multi-loss with nagging 1.9407 7.3403 3.8236 2.0244 7.5490 3.8837 1’744
Tukey−Anscombe plot: gamma Tukey−Anscombe plot: p=2.5 Tukey−Anscombe plot: inverse Gaussian
1.5
0.4
1.0
5
deviance residuals
deviance residuals
deviance residuals
0.2
0.5
0.0
0.0
0
−0.2
−5
5 6 7 8 9 10 5 6 7 8 9 10 5 6 7 8 9 10
logged fitted means logged fitted means logged fitted means
Fig. 11.4 Tukey–Anscombe plots giving the deviance residuals of the multi-loss nagging predic-
tor of Table 11.4 for different power variance parameters: (lhs) gamma deviances p = 2, (middle)
power variance deviances p = 2.5, (rhs) inverse Gaussian deviances p = 3; the cyan lines show
twice the estimated standard deviation of the deviance residuals as a function of the size of the
logged estimated means μ
From the Tukey–Anscombe plots in Fig. 11.4 we conclude that the dispersion
requires regression modeling, too, as the dispersion does not seem to be constant
over the whole range of the expected claim sizes. We therefore explore a double FN
network model, in spirit this is similar to the double GLM of Sect. 5.5. We therefore
assume to work within Tweedie’s family with power variance parameters p ≥ 2, and
with unit deviances given by (11.2)–(11.3). The saddlepoint approximation (5.59)
gives us
−1/2
2πϕ 1
f (y; θ, v/ϕ) ≈ V (y) exp − dp (y, μ) ,
v 2ϕ/v
11.1 Deep Learning Under Model Uncertainty 467
0.5
the claim size predictor; the density
10
blue curve shows the
0.4
empirical density of the
9
multi-loss nagging predictor
0.3
8
μ(x i )
0.2
6
0.1
5
0.0
5 6 7 8 9 10
fitted means (log scale)
with mean μp = μv/ϕ of X = Y v/ϕ. We set φ = −1/ϕ p−1 < 0. This gives us the
approximation
v p−1 dp (X, μp )φ − (−log (−φ)) 1 2π
X (μp , φ) ≈ − log p−1 V (X) . (11.9)
2 2 v
For given mean μp we again have a gamma approximation on the right-hand side,
but we scale the dispersion differently. This gives us the approximate first moment
Eφ v p−1 dp (X, μp ) μp ≈ κ2 (φ) = − 1/φ = ϕ p−1 = ϕp .
def.
ηp p−1
*
n
D d(X,
μ(t) ), (*
w, α 2 , α 2.5 , α 3 ) = d2 vi dp (Xi ,
μ(t)
p (x i )), ϕp (x i ) ,
2
p∈{2,2.5,3} i=1
(11.11)
11.1 Deep Learning Under Model Uncertainty 469
n p−1
vi
D X,
ϕ (t ) , (w, β 2 , β 2.5 , β 3 ) = ηp dp Xi , μp (x i ) .
p∈{2,2.5,3} ϕp(t )(x i )
i=1
(11.12)
We fit this model by iterating this approach for t ≥ 1: we start from the predictors
(1)
of Sect. 11.1.2 providing us with the first mean estimates
μp (x i ). Based on these
ϕp(t )(x i ) and
mean estimates we iterate this robustified estimation of μ(t )
p (x i ). We
give some remarks:
1. We use the robustified versions (11.11) and (11.12), respectively, where we
simultaneously fit all power variance parameters p = 2, 2.5, 3 on the commonly
learned representations zi = z(d:1) (x i ) in the last FN layer of the mean and the
dispersion network, respectively.
2. For both FN networks of mean μ and dispersion ϕ modeling we use the same
network architecture of depth d = 3 having (q1 , q2 , q3 ) = (20, 15, 10) neurons
in the FN layers, the hyperbolic tangent activation function, and the log-link
for the output. These two networks only differ in their network parameters
(w, β 2 , β 2.5 , β 3 ) and (* w, α 2 , α 2.5 , α 3 ), respectively.
3. For fitting we use the nadam version of SGD. For the early stopping we use a
training data U to validation data V split of 8 : 2.
4. To ensure consistency within the individual SGD runs across t ≥ 1, we use the
learned network parameter of loop t as initial value for loop t + 1. This ensures
monotonicity across the iterations in the log-likelihood and the loss function,
respectively, up to the fact that the random mini-batches in SGD may distort this
monotonicity.
5. To reduce the elements of randomness in SGD fitting we run this iteration
procedure 20 times with different seeds, and we output the nagging predictors
μ(t
for )
p (x i ) and ϕp(t )(x i ) averaged over the 20 runs for every t in Table 11.5.
We iterate this algorithm over two loops, and the results are presented in Table 11.5.
We observe a decrease of −2X ( μ(t )
p , ϕp(t )) by iterating the fitting algorithm for t ≥
1. For AIC, we would have to correct twice the negative log-likelihood by twice
470 11 Selected Topics in Deep Learning
This then allows us to give the Tukey–Anscombe plots for the three considered
power variance parameters.
The corresponding plots are given in Fig. 11.6; the difference to Fig. 11.4 is that
the latter considers unit
dispersion whereas the former scales the residuals with
the rooted dispersion ϕp (x i ); note that vi ≡ 1 in this example. By scaling with
the rooted dispersion the resulting deviance residuals riD should roughly have unit
standard deviation. From Fig. 11.6 we observe that indeed this is the case, the cyan
11.1 Deep Learning Under Model Uncertainty 471
6 Tukey−Anscombe plot: gamma Tukey−Anscombe plot: p=2.5 Tukey−Anscombe plot: inverse Gaussian
6
deviance residuals
deviance residuals
deviance residuals
4
4
2
2
0
0
−2
−2
−2
−4
−4
−4
residuals residuals residuals
average average average
−6
−6
−6
2 std.dev. 2 std.dev. 2 std.dev.
5 6 7 8 9 10 5 6 7 8 9 10 5 6 7 8 9 10
logged fitted means logged fitted means logged fitted means
Fig. 11.6 Tukey–Anscombe plots giving the dispersion scaled deviance residuals riD (11.13) of
the models jointly fitting the mean parameters
μp (x i ) and the dispersion parameters
ϕp (x i ): (lhs)
gamma model, (middle) power variance parameter p = 2.5 model, and (rhs) inverse Gaussian
models; the cyan lines correspond to 2 standard deviations
gamma model: fitted model vs. observations gamma model: estimated shape parameters inverse Gaussian: fitted model vs. observations
estimated shape parameters
12
12
observations (log−scale)
observations (log−scale)
1.5
10
10
8
8
1.0
6
6
4
4
0.5
2
2
−10 −5 0 5 10 2 4 6 8 10 12
simulation (log−scale) simulation (log−scale)
Fig. 11.7 (lhs) Gamma model: observations vs. simulations on log-scale, (middle) gamma model:
estimated shape parameters αt† = 1/ ϕ2 (x †t ) < 1, 1 ≤ t ≤ T , and (rhs) inverse Gaussian model:
observations vs. simulations on log-scale
line shows a spline fit of twice the standard deviation of the deviance residuals riD .
These splines are of magnitude 2 which verifies the unit standard deviation property.
Moreover, the cyan lines are roughly horizontal which indicates that the dispersion
estimation and the scaling works across all expected claim sizes μp (x i ). The three
different power variance parameters p = 2, 2.5, 3 show different behaviors in the
lower and upper tails in the residuals (centering around the orange horizontal zero
line in Fig. 11.6) which corresponds to the different distributional properties of the
chosen models.
We further analyze the gamma and the inverse Gaussian models. Note that the
analysis of the power variance models for general power variance parameters p =
0, 1, 2, 3 is more difficult because neither the EDF density nor the EDF distribution
function have a closed form. To analyze the gamma and the inverse Gaussian models
we simulate observations Xtsim , t = 1, . . . , T , from the estimated models (using the
out-of-sample features x †t of the test data T ), and we compare them against the
true out-of-sample observations Xt† . Figure 11.7 shows the results for the gamma
model (lhs) and the inverse Gaussian model (rhs) on the log-scale. A good fit has
472 11 Selected Topics in Deep Learning
been achieved if the black dots lie on the red diagonal line (in the colored version),
because then the simulated data shares similar features as the observed data. The fit
of the inverse Gaussian model seems reasonably good.
On the other hand, we see that the gamma model gives a poor fit, especially
in the lower tail. This supports the AIC values of Table 11.5. The problem with
the gamma model is that the data is more heavy-tailed than the gamma model can
accomplish. As a consequence, the dispersion parameter estimates ϕ2 (x †t ) in the
gamma model are compensating for this by taking values bigger than 1. A dispersion
parameter bigger than 1 implies a shape parameter in the gamma model of αt† =
†
1/ ϕ2 (x t ) < 1, and the resulting gamma density is strictly decreasing, see Fig. 2.1. If
we simulate from this model we receive many observations Xtsim close to zero (from
the strictly decreasing density). This can be seen from the lower-left part of the graph
in Fig. 11.7 (lhs), suggesting that we have many observations with Xt† ∈ (0, 1), or on
the log-scale log(Xt† ) < 0. However, the graph shows that this is not the case in the
real data. Figure 11.7 (middle) shows the boxplot of the estimated shape parameters
αt† on the test data, 1 ≤ t ≤ T , verifying that most insurance policies of the test data
T receive a shape parameter αt† less than 1.
We conclude that the inverse Gaussian double FN network model seems to work
well for this data, and we give preference to this model.
Yi = μζ0 (x i ) + εi , (11.14)
1
n
ζnPMLE = arg min d(Yi , μζ (x i )), (11.15)
ζ∈ n
i=1
where h = (κ )−1 is the canonical link of the pre-chosen EDF, and with the change
of variable ζ → θ = θ (ζ ) = h(μζ (x)) ∈ , for given feature x, having Jacobian
∂ 1
J (ζ ; x) = h(μζ (x)) = ∇ζ μζ (x) ∈ R1×r .
∂ζk 1≤k≤r κ (h(μζ (x))
Proof We set τ 2 (x) = κ (h(μζ (x))). We have J (ζ ; x) = ∇ζ μζ (x)τ −2 (x). The
following matrix is positive semi-definite and it satisfies
Ex I ∗ (ζ )−1 J (ζ ; x) − H(ζ )J (ζ ; x) τ 2 (x)σ −2 (x) σ 2 (x)
× I ∗ (ζ )−1 J (ζ ; x) − H(ζ )J (ζ ; x) τ 2 (x)σ −2 (x)
= I ∗ (ζ )−1 (ζ )I ∗ (ζ )−1 − H(ζ )I ∗ (ζ )I ∗ (ζ )−1 − I ∗ (ζ )−1 I ∗ (ζ )H(ζ ) + H(ζ )H(ζ )−1 H(ζ )
for a regression function x → sα20 (x) involving the (true) regression parameter α0
and exposures vi > 0. If we choose a fixed EDF, we have the log-likelihood function
v
(μ, ϕ) → Y (μ, ϕ; v) = [Y h(μ) − κ(h(μ))] + a(y; v/ϕ).
ϕ
Equating the variance structure of the true data model with the variance in this pre-
specified EDF, we obtain feature-dependent dispersion parameter
sα20 (x i )
ϕ(x i ) = , (11.16)
V (μζ0 (x i ))
This justifies the approach(es) in the previous chapters and sections, though,
not fully, because we neither work with the MLE in FN networks nor do we
care about identifiability in parameters. Nevertheless, this short section suggests
to find strongly consistent estimators *
ζn and *
αn for ζ0 and α0 . This gives us a first
model calibration step that allows us to specify the dispersion structure x → ϕ(x)
via (11.16). Using this dispersion structure and the deviance loss function (4.9) for
a variable dispersion parameter ϕ(x), the QPMLE is obtained in the second step by,
we replace the likelihood maximization by the deviance loss minimization,
1
n
vi
ζnQPMLE = arg min d(Yi , μζ (x i )).
n 2
sαn (x i )/V (μ*ζn (x i ))
ζ∈ i=1 *
This QPMLE is best asymptotically normal, thus, asymptotically optimal within the
EDF. There might still be better estimators for ζ0 , but these are outside the EDF.
476 11 Selected Topics in Deep Learning
1 V (μ*ζn (x i )) Yi − μζ (x i )
n
!
vi 2
∇ζ μζ (x i ) = 0.
n s*
αn (x i )
V (μζ (x i ))
i=1
Thus, it all boils down to find the right variance structure to receive the optimal
asymptotic behavior.
The previous statements hold true under the following technical assumptions.
These are taken from Appendix 1 of Gourieroux et al. [167], and they are an adapted
version of the ones in Burguete et al. [61].
Assumption 11.9
(i) μζ (x) and d(y, μζ (x)) are continuous w.r.t. all variables and twice continu-
ously differentiable in ζ ;
(ii) ⊂ Rr is a compact set and the true parameter ζ0 is in the interior of ;
(iii) almost every realization of (εi , x i ) is a Cesàro sum generator w.r.t. the
probability measure p,x (ε, x) = pε (ε|x)px (x) and to a dominating function
b(ε, x);
- sequence (x i )i is a Cesàro sum generator w.r.t. px and b(x) =
(iv) the
R b(ε, x)dpε (ε|x);
(v) for each x ∈ {1} × Rq , there exists a neighborhood Nx ⊂ {1} × Rq such that
sup b(ε, x ) dpε (ε|x) < ∞;
R x ∈Nx
(vi) the functions d(Y, μζ (x)), ∂d(Y, μζ (x))/∂ζk , ∂ 2 d(Y, μζ (x))/∂ζk ∂ζl are dom-
inated by b(ε, x).
In this section we present a way of assessing the irreducible risk which does not
require a sophisticated model evaluation of distributional assumptions. Quantile
regression is increasingly used in the machine learning community because it is
a robust way of quantifying the irreducible risk, we refer to Meinshausen [270],
Takeuchi et al. [350] and Richman [314]. We recall that quantiles are elicitable
having the pinball loss as a strictly consistent loss function, see Theorem 5.33.
We define a FN network regression model that allows us to directly estimate the
quantiles based on the pinball loss. We therefore use an adapted version of the
R code of Listing 9 in Richman [314], this adapted version has been proposed in
Fissler et al. [130] to ensure that different quantiles respect monotonicity. For any
two quantile levels 0 < τ1 < τ2 < 1 we have
x → FY−1 −1
|x (τ ) = g β τ , z
(d:1)
(x) , (11.18)
for a strictly monotone and smooth link function g, output parameter β τ ∈ Rqd +1 ,
and where x → z(d:1) (x) is a deep network. We add a lower index Y |x to the
generalized inverse FY−1
|x to highlight that we consider the conditional distribution
of Y , given feature x ∈ X . In the case of a deep FN network, (11.18) involves
a network parameter ϑ = (w(1) (d)
1 , . . . , w qd , β τ ) that needs to be estimated. Of
course, the deep network architecture x → z (d:1) (x) could also involve any other
feature, such as CN or LSTM layers, embedding layers or a NLP text recognition
478 11 Selected Topics in Deep Learning
feature. This would change the network architecture, but it would not change
anything from a methodological viewpoint.
To estimate this regression parameter ϑ from independent data (Yi , x i ), 1 ≤ i ≤
n, we consider the objective function
n
ϑ → Lτ Yi , g −1 β τ , z(d:1)(x i ) ,
i=1
with the strictly consistent pinball loss function Lτ for the τ -quantile. Alternatively,
we could choose any other loss function satisfying Theorem 5.33, and we may try
to find the asymptotically optimal one (similarly to Theorem 11.8). We refrain from
doing so, but we mention Komunjer–Vuong [222]. Fitting the network parameter
ϑ is then done in complete analogy to finding an optimal network parameter for
network mean modeling. The only change is that we replace the deviance loss
function by the pinball loss, e.g., in Listing 7.3 we have to exchange the loss function
on line 5 correspondingly.
We now turn our attention to the multiple quantile case that should satisfy the
monotonicity requirement (11.17) for any quantile levels 0 < τ1 < τ2 < 1.
A separate deep quantile estimation for both quantile levels, as described in the
previous section, may violate the monotonicity property, at least, in some part of
the feature space X , especially if the two quantile levels are close. Therefore, we
enforce the monotonicity by a special choice of the network architecture.
For simplicity, in the remainder of this section, we assume that the response Y is
positive, a.s. This implies for the quantiles τ → FY−1
|x (τ ) ≥ 0, and we should choose
−1
a link function with g ≥ 0 in (11.18). To ensure the monotonicity (11.17) for the
quantile levels 0 < τ1 < τ2 < 1, we choose a second positive link function with
−1
g+ ≥ 0, and we set for multi-task forecasting
x → FY−1
|x (τ1 ), F −1
Y |x (τ2 ) (11.19)
−1
= g −1 β τ1 , z(d:1) (x) , g −1 β τ1 , z(d:1) (x) + g+ β τ2 , z(d:1) (x) ∈ R2+ ,
−1
for a regression parameter ϑ = (w(1) (d)
1 , . . . , w qd , β τ1 , β τ2 ) . The positivity g+ ≥ 0
enforces the monotonicity in the two quantiles. We call (11.19) an additive approach
as we start from a base level characterized by the smaller quantile FY−1 |x (τ1 ), and any
bigger quantile is modeled by an additive increment. To ensure monotonicity for
multiple quantiles we proceed recursively by choosing the lowest quantile as the
initial base level.
11.2 Deep Quantile Regression 479
We can also consider the upper quantile as the base level by multiplicatively
lowering this upper quantile. Choose the (sigmoid) function gσ−1 ∈ (0, 1) and set
for the multiplicative approach
x → FY−1
|x (τ 1 ), F −1
Y |x (τ 2 ) (11.20)
= gσ−1 β τ1 , z(d:1)(x) g −1 β τ2 , z(d:1) (x) , g −1 β τ2 , z(d:1) (x) ∈ R2+ .
If we just use a classical SGD fitting algorithm, we will likely result in a situation
where the monotonicity will be violated in some part of the feature space. Kellner
et al. [211] consider this problem. They add a penalization (regularization term) that
punishes during SGD training network parameters that violate the monotonicity.
Such a penalization can be constructed, e.g., with the ReLU function.
We revisit the Swiss accident insurance data of Sect. 11.1.2, and we provide an
example of a deep quantile regression using both the additive approach (11.19) and
the multiplicative approach (11.20).
We select 5 different quantile levels Q = (τ1 , τ2 , τ3 , τ4 , τ5 ) = (10%, 25%, 50%,
75%, 90%). We start with the additive approach (11.19). It requires to set τ1 =
10% as the base level, and the remaining quantile levels are modeled additively in
a recursive way for τj < τj +1 , 1 ≤ j ≤ 4. The corresponding R code is given on
lines 8–20 of Listing 11.3, and this compiles to the 5-dimensional output on line 22.
For the multiplicative approach (11.20) we set τ5 = 90% as the base level, and the
remaining quantile levels are received multiplicatively in a recursive way for τj +1 >
τj , 4 ≥ j ≥ 1, see Listing 11.4. The additive and the multiplicative approaches take
the extreme quantiles as initialization. One may also be interested in initializing the
model in the median τ3 = 50%, the smaller quantiles can then be received by the
multiplicative approach and the bigger quantiles by the additive approach. We also
explore this case and we call it the mixed approach.
480 11 Selected Topics in Deep Learning
These network architectures are fitted to the data using the pinball loss (5.81) for the
quantile levels of Q; note that the pinball loss requires the assumption of having a
finite first moment. Listing 11.5 shows the choice of the pinball loss functions. We
then fit the three architectures (additive, multiplicative and mixed) to our learning
data L, and we apply early stopping to prevent from over-fitting. Moreover, we
consider the nagging predictor over 20 runs with different seeds to reduce the
randomness coming from SGD fitting.
In Table 11.6 we give the out-of-sample pinball losses on the test data T of the three
considered approaches, and illustrating the 5 quantile levels of Q. The losses of the
three approaches are rather close, giving a slight preference to the mixed approach,
but the other two approaches seem to be competitive, too. We further analyze these
quantile regression models by considering the empirical coverage ratios defined by
1
T
τj = 1 † −1 , (11.22)
T
Yt ≤F † (τj )
t =1 Y |x t
−1 † (τj ) is the estimated quantile for level τj and feature x †t . Remark that the
where F
Y |x t
coverage ratios (11.22) correspond to the identification functions that are essentially
the derivatives of the pinball losses, we refer to Dimitriadis et al. [106]. Table 11.7
reports these out-of-sample coverage ratios on the test data T . From these results
we conclude that on the portfolio level the quantiles are matched rather well.
In Fig. 11.8 we illustrate the estimated out-of-sample quantiles F −1 † (τj ) for
Y |x t
individual claims on the quantile levels τj ∈ {10%, 25%, 50%, 75%, 90%} (cyan,
blue, black, blue, cyan colors) using the mixed approach. The x-axis considers
the logged estimated medians F −1 † (50%). We observe heteroskedasticity resulting
Y |x t
in quantiles that are not ordered w.r.t. the median (black line). This supports the
multiple deep quantile regression model because we cannot (simply) extrapolate the
median to receive the other quantiles.
−1 (τj ) from the mixed deep
In the final step we compare the estimated quantiles FY |x
quantile regression approach to the ones that can be calculated from the fitted
inverse Gaussian model using the double FN network approach of Example 11.4.
In the latter model we estimate the mean μ(x) and the dispersion ϕ (x) with two
FN networks, which then allow us to calculate the quantiles using the inverse
Gaussian distributional assumption. Note that we cannot calculate the quantiles
in Tweedie’s family with power variance parameter p = 2.5 because there is no
Table 11.6 Out-of-sample pinball losses of quantile regressions using the additive, the multi-
plicative and the mixed approaches; nagging predictors over 20 different seeds
Out-of-sample losses on T
10% 25% 50% 75% 90%
Additive approach 171.20 412.78 765.60 988.78 936.31
Multiplicative approach 171.18 412.87 766.04 988.59 936.57
Mixed approach 171.15 412.55 764.60 988.15 935.50
482 11 Selected Topics in Deep Learning
14
−1 † (τj ) of 2’000 randomly
F
Y |x t
12
selected individual claims on
the quantile levels τj ∈
10
claims on log−scale
{10%, 25%, 50%, 75%, 90%}
(cyan, blue, black, blue, cyan
8
colors) using the mixed
approach, the red dots are the
6
out-of-sample observations
Yt† ; the x-axis gives
4
25%/75% quantile
corresponding to the black 10%/90% quantile
diagonal line) 5 6 7 8 9
logged estimated median
closed form of the distribution function. Figure 11.9 compares the two approaches
on the quantile levels of Q. Overall we observe a reasonably good match though it is
not perfect. The small quantiles for level τ1 = 10% seem slightly under-estimated
by the inverse Gaussian approach (see Fig. 11.9 (top-left)), whereas big quantiles
τ4 = 75% and τ5 = 90% seem more conservative in the inverse Gaussian approach
(see Fig. 11.9 (bottom)). This may indicate that the inverse Gaussian distribution
does not fully fit the data, i.e., that one cannot fully recover the true quantiles
from the mean μ(x), the dispersion ϕ (x) and an inverse Gaussian assumption.
There are two ways to further explore these issues. One can either choose other
distributional assumptions which may better match the properties of the data, this
further explores the distributional approach. Alternatively, Theorem 5.33 allows us
to choose loss functions different from the pinball loss, i.e., one could consider
different increasing functions G in that theorem to further explore the distribution-
free approach. In general, any increasing choice of the function G leads to a strictly
consistent quantile estimation (this is an asymptotic statement), but these choices
may have different finite sample properties. Following Komunjer–Vuong [222], we
can determine asymptotically efficient choices for G. This would require feature
dependent choices Gx i (y) = FY |x i (y), where FY |x i is the (true) distribution of
Yi , conditionally given x i . This requires the knowledge of the true distribution,
and Komunjer–Vuong [222] derive asymptotic efficiency when replacing this true
11.3 Deep Composite Model Regression 483
10
9
inverse Gaussian (log−scale)
9
8
8
7
6
7
6
5
6
5
4
5
4
4 5 6 7 4 5 6 7 8 9 5 6 7 8 9 10
quantile regression (log−scale) quantile regression (log−scale) quantile regression (log−scale)
11
10
10
9
9
8
8
7
7
6
6
5
5 6 7 8 9 10 11 6 7 8 9 10 11 12
Fig. 11.9 Inverse Gaussian quantiles vs. deep quantile regression estimates of 2’000 randomly
selected claims on the quantile levels of Q = (10%, 25%, 50%, 75%, 90%)
In the previous examples we have seen that the distributional models may misesti-
mate the true tail of the data because model fitting often pays more attention to an
484 11 Selected Topics in Deep Learning
accurate model fit in the main body of the data. An idea is to directly estimate this
tail in a distribution-free way by considering the (upper) CTE
−1
CTE+
τ (Y |x) = E Y Y > FY |x (τ ), x , (11.23)
for a given quantile level τ ∈ (0, 1). The problem with (11.23) is that this is not an
elicitable quantity, i.e., there is no loss/scoring function that is strictly consistent for
the CTE functional.
If the distribution function FY |x is continuous, we can rewrite the upper CTE as
follows, see Lemma 2.16 in McNeil et al. [268] and (11.35) below,
1
1
CTE+
τ (Y |x) = ES+
τ (Y |x) = FY−1 −1
|x (p) dp ≥ FY |x (τ ). (11.24)
1−τ τ
This second object ES+ τ (Y |x) is called the upper expected shortfall (ES) of Y , given
x, on the security level τ . Fissler–Ziegel [131] and Fissler et al. [132] have proved
−1
that ES+ τ (Y |x) is jointly elicitable with the τ -quantile FY |x (τ ). That is, there is a
strictly consistent bivariate loss function that allows one to jointly estimate the τ -
quantile and the corresponding ES. In fact, Corollary 5.5 of Fissler–Ziegel [131]
give the full characterization of the strictly consistent bivariate loss functions for
the joint elicitability of the τ -quantile and the ES; note that Fissler–Ziegel [131]
use a different sign convention. This result is used in Guillén et al. [175] for the
joint estimation of the quantile and the ES within a GLM. Guillén et al. [175] use a
two-step approach to fit the quantile and the ES.
Fissler et al. [130] extend the results of Fissler–Ziegel [131], allowing for the
joint estimation of the composite triplet consisting of the lower ES, the τ -quantile
and the upper ES. This gives us a composite model that has the τ -quantile as splicing
point. The beauty of this approach is that we can fit (in one step) a deep learning
model to the upper and the lower ES, and perform a (potentially different) regression
in both parts of the distribution. The lower CTE and the lower ES are defined by,
respectively,
−1
CTE−
τ (Y |x) = E Y Y ≤ FY |x (τ ), x ,
and
τ
1
ES−
τ (Y |x) = FY−1 −1
|x (p) dp ≤ FY |x (τ ).
τ 0
for y, a ∈ R and for τ ∈ (0, 1). These auxiliary functions consider only the part
of the pinball loss (5.81) that depends on action a, and we get the pinball loss as
follows
Therefore, all three functions provide strictly consistent scoring functions for the
τ -quantile, but only the pinball loss satisfies the calibration property (L0) on page
92.
For the following theorem we recall the general definition of the τ -quantile
Qτ (FY |x ) of a distribution function FY |x , see (5.82).
Theorem 11.11 (Theorem 2.8 of Fissler et al. [130], Without Proof) Choose τ ∈
(0, 1) and let F contain only distributions with a finite first moment, and being
supported in the interval C ⊆ R. The loss function L : C × C3 → R+ of the form
L(y; e− , q, e+ ) = (G(y) − G(q)) τ − 1{y≤q} (11.26)
9 :
e− + τ1 Sτ− (y, q)
+ ∇(e− , e+ ), + − (e− , e+ ) + (y, y),
e − 1−τ1
Sτ+ (y, q)
1 ∂ 1 ∂
q → Ge− ,e+ (q) = G(q) + −
(e− , e+ )q − (e− , e+ )q,
τ ∂e 1 − τ ∂e+
(11.27)
is strictly increasing, and if EF [|G(Y )|] < ∞, EF [|(Y, Y )|] < ∞ for all Y ∼
F ∈ F.
This opens the door for regression modeling of CTEs for continuous distribution
functions FY |x , x ∈ X . Namely, we can choose a regression function ξϑ with a
three-dimensional output
x ∈ X → ξϑ (x) ∈ C3 ,
486 11 Selected Topics in Deep Learning
1
n
ϑ = arg min L (Yi ; ξϑ (x i )) , (11.28)
ϑ n
i=1
with loss function L given by (11.26). This then provides us with the estimates for
the composite triplet
x → ξ (x) = 3−
ES (Y |x), F 3+
−1 (τ ), ES (Y |x) .
ϑ τ Y |x τ
There remains the choice of the functions G and , such that is strictly convex
and Ge− ,e+ , defined in (11.27), is strictly increasing. Section 2.3 in Fissler et
al. [130] discusses possible choices. A simple choice is to select the identity function
G(y) = y (which gives the pinball loss on the first line of (11.26)) and
with ψ1 and ψ2 strictly convex and with (sub-)gradients ψ1 > 0 and ψ2 < 0.
Inserting this choice into (11.26) provides the loss function
+ ,
− + ψ1 (e− ) −ψ2 (e+ )
L(y; e , q, e ) = 1 + + Lτ (y, q)+Dψ1 (y, e− )+Dψ2 (y, e+ ),
τ 1−τ
(11.29)
where Lτ (y, q) is the pinball loss (5.81) and Dψ1 and Dψ2 are Bregman diver-
gences (2.28). There remains the choices of ψ1 and ψ2 which should be strictly
convex, the first one being strictly increasing and the second one being strictly
decreasing.
We restrict ourselves to strictly convex functions ψ on the positive real line R+ ,
i.e., for positive claims Y > 0, a.s. For b ∈ R, we consider the following functions
on R+
⎧
⎪
⎪ 1 b for b = 0 and b = 1,
⎨ b(b−1) y
ψ (y) = −1 − log(y)
(b)
for b = 0, (11.30)
⎪
⎪
⎩ylog(y) − y for b = 1.
11.3 Deep Composite Model Regression 487
We compute the first and second derivatives. These are for y > 0 given by
∂ (b) 1
y b−1 for b = 1, ∂ 2 (b)
ψ (y) = b−1 and ψ (y) = y b−2 > 0.
∂y log(y) for b = 1, ∂y 2
Thus, for any b ∈ R we have a convex function, and this convex function is
decreasing on R+ for b < 1 and increasing for b > 1. Therefore, we have to select
b > 1 for ψ1 and b < 1 for ψ2 to get suitable choices in (11.29). Interestingly,
these choices correspond to Lemma 11.2 with power variance parameters p =
2 − b, i.e., they provide us with Bregman divergences from Tweedie’s distributions.
However, (11.30) is more general, because it allows us to select any b ∈ R,
whereas for power variance parameters p ∈ (0, 1) there do not exist any Tweedie’s
distributions, see Theorem 2.18.
In view of Lemma 11.2 and using the fact that unit deviances dp are Bregman
divergences, we select a power variance parameter p = 2 − b > 1 for ψ2 and we
select the Gaussian model p = 2 − b = 0 for ψ1 . This gives us the special choice
for the loss function (11.29) for strictly positive claims Y > 0, a.s.,
+ ,
η1 e− η2 (e+ )1−p η1 η2
L(y; e− , q, e+ ) = 1 + + Lτ (y, q) + d0 (y, e− ) + dp (y, e+ ),
τ (1 − τ )(p − 1) 2 2
(11.31)
with the Gaussian unit deviance d0 (y, e− ) = (y − e− )2 and Tweedie’s unit deviance
dp with power variance parameter p > 1, see Sect. 11.1.1. The additional constants
η1 , η2 > 0 are used to balance the contributions of the individual terms to the total
loss. Typically, we choose p ≥ 2 for the upper ES reflecting claim size models.
This choice for ψ2 implies that the residuals are weighted inversely proportional
to the corresponding variances μp within Tweedie’s family, see (11.5). Using
this loss function (11.31) in (11.28) allows us to estimate the composite triplet
−1
(ES− +
τ (Y |x), FY |x (τ ), ESτ (Y |x)) with a strictly consistent loss function.
FY−1 +
|x (τ ) ≤ ESτ (Y |x). We set for the regression function in the additive approach
for multi-task learning
−1
x → ES− +
τ (Y |x), FY |x (τ ), ESτ (Y |x)
−1
= g −1 β 1 , z(d:1) (x) , g −1 β 1 , z(d:1) (x) + g+ β 2 , z(d:1) (x) , (11.32)
−1 −1
g −1 β 1 , z(d:1) (x) + g+ β 2 , z(d:1) (x) + g+ β 3 , z(d:1) (x) ∈ A,
−1
for link functions g and g+ with g+ ≥ 0, deep FN network z(d:1) : Rq0 +1 →
R q d +1 , regression parameters β 1 , β 2 , β 3 ∈ Rqd +1 , and with the action space
A = {(e− , q, e+ ) ∈ R3+ ; e− ≤ q ≤ e+ } for positive claims. We also remind of
Remark 11.10 for a different way of modeling the monotonicity.
Fitting this model is similar to the multiple deep quantile regression presented
in Listings 11.3 and 11.5. There is one important difference though. Namely, we
do not have multiple outputs and multiple loss functions, but we have a three-
dimensional output with a single loss function (11.31) simultaneously evaluating all
three components of the output (11.32). Listing 11.6 gives this loss for the inverse
Gaussian case p = 3 in (11.31).
We revisit the Swiss accident insurance data of Sect. 11.2.3. We again use a FN
network of depth d = 3 with (q1 , q2 , q3 ) = (20, 15, 10) neurons, hyperbolic
tangent activation, two-dimensional embedding layers for the categorical features,
−1
exponential output activations for g −1 and g+ , and the additive structure (11.32).
We implement the loss function (11.31) for quantile level τ = 90% and with power
variance parameter p = 3, see Listing 11.6. This implies that for the upper ES
estimation we scale residuals with V (μ) = μ3 , see (11.5). We then run an initial
calibration of this FN network. Based on this initial calibration we can calculate
the three loss contributions in (11.31) coming from the composite triplet. Based on
these figures we choose the constants η1 , η2 > 0 in (11.31) so that all three terms
of the composite triplet contribute equally to the total loss. For the remainder of our
calibration we hold on to these choices of η1 and η2 .
We calibrate this deep FN architecture to the learning data L, using the strictly
consistent loss function (11.31) for the composite triplet (ES− −1
90% (Y |x), FY |x (90%),
ES+90% (Y |x)), and to reduce the randomness in prediction we average over 20 early
stopped SGD calibrations with different seeds (nagging predictor).
11.3 Deep Composite Model Regression 489
12
ES
estimated upper ES 3+ †
90% (Y |x t )
10
90%-quantile F
Y |x t
the deep composite regression
8
lower ES
6
upper ES
spline fits
diagonal
6 8 10 12
90%−quantile (log−scale)
Figure 11.10 shows the estimated lower and upper ES against the corresponding
90%-quantile estimates for 2’000 randomly selected insurance claims x †t . The
−1 † (90%), and the cyan
diagonal orange line shows the estimated 90%-quantiles F
Y |x t
lines give spline fits to the estimated lower and upper ES. It is clearly visible that
these respect the ordering
3−
ES †
90% (Y |x t ) ≤ F
3+
−1 † (90%) ≤ ES †
90% (Y |x t ),
Y |x t
and
Yt† 1
−1
+ F † (τ ) 1
−τ
−1 (τ )
Yt† >F Y |x t −1 (τ )
Yt† ≤F
1 3+
T † †
Y |x t Y |x t
v+ = ESτ (Y |x †t ) − .
T 1−τ
t =1
(11.34)
These (empirical) identifications should be close too zero if the model fits the data.
490 11 Selected Topics in Deep Learning
Remark that the latter terms in (11.33)–(11.34) describe the lower and upper
ES also in the case of non-continuous distribution functions because we have the
identity
+ ,
1
ES− E Y 1
x + F −1 (τ ) τ − F −1
τ (Y |x) = Y ≤FY−1 Y |x Y |x FY |x (τ ) ,
τ |x (τ )
(11.35)
the second term being zero for a continuous distribution FY |x , but it is needed for
non-continuous distribution functions.
We compare the deep composite regression results of this section to the deep
gamma and inverse Gaussian models using a double FN network for dispersion
modeling, see Sect. 11.1.3. This requires to calculate the ES in the gamma and the
inverse Gaussian models. This can be done within the EDF, see Landsman–Valdez
[233]. The upper ES in the gamma model Y ∼
(α, β) is given by, see (6.47),
⎛ ⎞
α 1 − G α + 1, βFY−1 (τ )
E Y Y > FY−1 (τ ) = ⎝ ⎠,
β 1−τ
where G is the scaled incomplete gamma function (6.48) and FY−1 (τ ) is the τ -
quantile of
(α, β).
Example 4.3 of Landsman–Valdez [233] gives the inverse Gaussian case (2.8)
with α, β > 0
6
−1 α 1/α −1
E Y Y > FY (τ ) = 1+ (1)
FY (τ )ϕ(zτ )
β 1−τ
6
α 1/α 2αβ
+ e 2α!(−zτ(2) ) − FY−1 (τ )ϕ(−zτ(2)) ,
β 1−τ
where ϕ and ! are the standard Gaussian density and distribution, respectively,
FY−1 (τ ) is the τ -quantile of the inverse Gaussian distribution and
α FY−1 (τ ) α FY−1 (τ )
zτ(1) = 6 −1 and zτ(2) = 6 +1 .
FY−1 (τ ) α/β FY−1 (τ ) α/β
25000
model and the deep
15000
5000
0
composite model and the double network with inverse Gaussian claim sizes are
comparably accurate (out-of-sample) determining the lower and upper 90% ES.
Finally, we paste the lower and upper ES from the deep composite regression
model according to (11.25). This gives us an estimated mean (under a continuous
distribution function)
μ(x) =
3−
E[Y |x] = τ ES 3+
τ (Y |x) + (1 − τ ) ESτ (Y |x).
Figure 11.11 compares these estimates of the deep composite regression model
to the deep double inverse Gaussian model estimates. The black dots show 2’000
randomly selected claims x †t , and the cyan line gives a spline fit to all out-of-sample
claims in T . The body of the estimates is rather similar in both approaches but the
deep composite approach provides more large estimates, the dotted orange lines
show the maximum estimate from the deep double inverse Gaussian model.
We conclude that in the case where no member of the EDF reflects the properties
of the data in the tail, the deep composite regression approach presented in this
section provides an alternative method for mean estimation that allows for separate
models in the main body and the tail of the data. Fixing the quantile level allows
for a straightforward fitting in one step, this is in contrast to the composite models
where we fix the splicing point. The latter approaches are more difficult in fitting,
e.g., using the EM algorithm.
492 11 Selected Topics in Deep Learning
Table 11.9 Out-of-sample losses (gamma loss, power variance case p = 2.5 loss (in 10−2 ) and
inverse Gaussian (IG) loss (in 10−3 )) and average claim amounts; the losses use unit dispersion
ϕ=1
Out-of-sample loss on T Average
dp=2 dp=2.5 dp=3 claim
Null model 4.6979 10.2420 4.6931 1’774
Gamma multi-output of Table 11.3 2.0581 7.6422 3.9146 1’745
p = 2.5 multi-output of Table 11.3 2.0576 7.6407 3.9139 1’732
IG multi-output of Table 11.3 2.0576 7.6401 3.9134 1’705
Gamma multi-output: nagging 100 2.0280 7.5582 3.8864 1’752
p = 2.5 multi-output: nagging 100 2.0282 7.5586 3.8865 1’739
IG multi-output: nagging 100 2.0286 7.5592 3.8865 1’711
Gamma multi-output: bootstrap 100 2.0189 7.5301 3.8745 1’803
p = 2.5 multi-output: bootstrap 100 2.0191 7.5305 3.8746 1’790
IG multi-output: bootstrap 100 2.0194 7.5309 3.8746 1’756
bootstrap samples. We then also average over these 100 predictors obtained from
the different bootstrap samples. Table 11.9 provides the resulting out-of-sample
deviance losses on the test data T . We always hold on to the same test data T
which is disjoint/independent from the learning data L and the bootstrap samples
L∗ = L∗(s) , 1 ≤ s ≤ 100.
The nagging predictors over 100 seeds are roughly the same as over 20 seeds
(see Table 11.3), which indicates that 20 different network fits suffice, here.
Interestingly, the average bootstrapped version generally improves the nagging
predictors. Thus, here the average bootstrap predictor provides a better balance
among the observations to receive superior predictive power on the test data T ,
compare lines ‘nagging 100’ vs. ’bootstrap 100’ of Table 11.9.
The main purpose of this analysis is to understand the volatility involved in nagging
and bootstrap predictors. We therefore consider the coefficients of variation Vcot
introduced in (7.43) on individual policies 1 ≤ t ≤ T . Figure 11.12 shows these
coefficients of variation on the individual predictors, i.e., for the individual claims
x †t and the individual network calibrations with different seeds. The left-hand side
gives the coefficients of variation based on 100 bootstrap samples, the right-hand
side gives the coefficients of variation of 100 predictors fitted on the same data L
but with different seeds for the SGD algorithm; the y-scale is identical in both plots.
We observe that the coefficients of variation are clearly higher under the bootstrap
approach compared to holding on to the same data L for SGD fitting with different
seeds. Thus, the nagging predictor averages over the randomness in different seeds
for network calibrations, whereas bootstrapping additionally considers possible
different samples L∗ for model learning. We analyze the difference in magnitudes
in more detail.
Figure 11.13 compares the two coefficients of variation for different claim sizes. The
average coefficient of variation for fixed observations L is 15.9% (cyan columns).
This average coefficient of variation is increased to 24.8% under bootstrapping
494 11 Selected Topics in Deep Learning
1.0
1 std.dev. 1 std.dev.
cubic spline cubic spline
coefficients of variation
coefficients of variation
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0 5000 10000 15000 20000 25000 30000 35000 0 5000 10000 15000 20000 25000 30000 35000
estimated claim size estimated claim size
Fig. 11.12 Coefficients of variation in individual estimators (lhs) bootstrap 100, and (rhs) nagging
100; the y-scale is identical in both plots
coefficients of variations
0.6
1.0
bootstrap
nagging
relative increase
0.5
coefficients of variations
0.8
0.4
0.6
0.3
0.4
0.2
0.2
0.1
0.0
0.0
Fig. 11.13 Coefficients of variation in individual predictors of the bootstrap and the nagging
approaches (ordered w.r.t. estimated claim sizes)
(orange columns). The blue line shows the average relative increase for the different
claim sizes (right axis), and the blue dotted line is at a relative increase of 40%. From
Fig. 11.13 we observe that this spread (relative increase) is rather constant across all
claim predictions; we remark that 93.5% of all claim predictions are below 5’000.
Thus, most claims are at the left end of Fig. 11.13.
From this small analysis we conclude that there is substantial model and
estimation uncertainty involved, recall that we fit the deep network architecture to
305’550 individual claims having 7 feature components, this is a comparably large
portfolio. On average, we have a coefficient of variation of 15% implied by SGD
11.5 LocalGLMnet: An Interpretable Network Architecture 495
fitting with different seeds, and this coefficient of variation is increased to roughly
25% under additionally bootstrapping the observations. This is considerable, and
it requires that we ensemble these predictors to receive more robust predictions.
The results of Table 11.9 support this re-sampling and ensembling approach as we
receive a better out-of-sample performance.
Network architectures are often criticized for not being (sufficiently) explainable.
Of course, this is not fully true as we have gained a lot of insight about the
data examples studied in this book. This criticism of non-explainability has led to
the development of the post-hoc model-agnostic tools studied in Sect. 7.6. This
approach has been questioned at many places, and it is not clear whether one
should try to explain black box models, or whether one should rather try to make
the models interpretable in the first place, see, e.g., Rudin [322]. In this section
we take this different approach by working with a network architecture that is
(more) interpretable. We present the LocalGLMnet proposal of Richman–Wüthrich
[317, 318]. This approach allows for interpreting the results, and it allows for
variable selection either using an empirical Wald test or LASSO regularization.
There are different other proposals that try to achieve similar explainability in
specific network architectures. There is the explainable neural network of Vaughan
et al. [367] and the neural additive model of Agarwal et al. [3]. These proposals
rely on parallel networks considering one single variable at a time. Of course,
this limits their performance because of a missing interaction potential. This has
been improved in the Combined Actuarial eXplainable Neural Network (CAXNN)
approach of Richman [314], which requires a manual specification of parallel
networks for potential interactions. The LocalGLMnet, proposed in this section,
does not require any manual engineering, and it still possesses the universal
approximation property.
q
x → g(μ(x)) = β0 + β, x = β0 + βj x j , (11.36)
j =1
496 11 Selected Topics in Deep Learning
β : Rq → Rq
def.
x → β(x) = z(d:1) (x) = z(d) ◦ · · · ◦ z(1) (x).
q
x → g (μ(x)) = β0 + β(x), x = β0 + βj (x)xj ,
j =1
x → βj (x)xj . (11.37)
(1) If βj (x) ≡ 0, we should drop the term βj (x)xj from the regression function.
(2) If βj (x) ≡ βj (= 0) is not feature dependent (and different from zero), we
receive a GLM term in xj with regression parameter βj .
11.5 LocalGLMnet: An Interpretable Network Architecture 497
(3) Property βj (x) = βj (xj ) implies that we have a term βj (xj )xj that does not
interact with any other term xj , j = j .
(4) Sensitivities of βj (x) in the components of x can be obtained by the gradient
∂ ∂
∇x βj (x) = βj (x), . . . , βj (x) ∈ Rq . (11.38)
∂x1 ∂xq
βj (x)xj = xj . (11.39)
Therefore, we talk about terms in items (1)–(4), e.g., item (1) means that the
term βj (x)xj can be dropped, however, the feature component xj may still
play a significant role in some of the regression attentions βj (x), j = j .
In practical applications we have not experienced identifiability issue (11.39).
Having already the linear terms in the LocalGLMnet regression structure
and starting the SGD fitting in the GLM gives already quite pre-determined
regression functions, and the LocalGLMnet is built around this initialization,
hardly falling into a completely different model (11.39).
(6) The LocalGLMnet architecture has the universal approximation property dis-
cussed in Sect. 7.2.2, because networks can approximate any continuous
function arbitrarily well on a compact support for sufficiently large networks.
(d:1)
We can then select one component, say, x1 and let β1 (x) = z1 (x)
approximate a given continuous function f (x)/x1 , i.e., f (x) ≈ β1 (x)x1
arbitrarily well on the compact support.
The LocalGLMnet allows for variable selection through the regression attentions
βj (x). Roughly speaking, if the estimated regression attentions β j (x) ≈ 0, then the
term βj (x)xj can be dropped. We can also explore whether the entire variable xj
should be dropped (not only the corresponding term βj (x)xj ). For this, we have to
refit the LocalGLMnet excluding the feature component xj . If the out-of-sample
performance on validation data does not change, then xj also does not play an
important role in any other regression attention βj (x), j = j , and it should be
completely dropped from the model.
In GLMs we can either use the Wald test or the LRT to test a null hypothesis H0 :
βj = 0, see Sect. 5.3. We explore a similar idea in this section, however, empirically.
498 11 Selected Topics in Deep Learning
We therefore first need to ensure that all feature components live on the same scale.
We consider standardization with the empirical mean and the empirical standard
deviation, see (7.30), and from now on we assume that all feature components are
centered and have unit variance. Then, the main problem is to determine whether an
j (x) is significantly different from 0 or not.
estimated regression attention β
We therefore extend the features x + = (x1 , . . . , xq , xq+1 ) ∈ Rq+1 by an addi-
tional independent and purely random component xq+1 that is also standardized.
Since this additional component is independent of all other components it cannot
have any predictive power for the response under consideration, thus, fitting this
extended model should result in a regression attention β q+1 (x + ) ≈ 0. The estimate
will not be exactly zero, because there is noise involved, and the magnitude of this
fluctuation will determine the rejection/acceptance region of the null hypothesis of
not being significant.
We fit the LocalGLMnet to the learning data L with features x + i ∈ R
q+1
where !−1 (p) denotes the standard Gaussian quantile for p ∈ (0, 1). H0 should be
rejected if the coverage ratio of this centered interval Iα is substantially smaller than
1 − α, i.e.,
1
n
1{β j (x + )∈Iα } < 1 − α.
n i
i=1
permute one of the feature components xi,j across the entire portfolio 1 ≤ i ≤ n.
Usually, the resulting empirical standard deviations
sq+1 are rather similar.
We revisit the French MTPL data example. We compare the LocalGLMnet approach
to the deep FN network considered in Sect. 7.3.2, and we benchmark with the results
of Table 7.3; we benchmark with the crudest FN network from above because, at
the current stage, we need one-hot encoding for the LocalGLMnet approach. The
analysis in this section is the same as in Richman–Wüthrich [317].
The French MTPL data has 6 continuous feature components (we treat Area as
a continuous variable), 1 binary component and 2 categorical components. We pre-
process the continuous and binary variables to centering and unit variance using
standardization (7.30). This will allow us to do variable selection as presented
in (11.41). The categorical variables with more than two levels are more difficult.
In a first attempt we use one-hot encoding for the categorical variables. We prefer
one-hot encoding over dummy coding because this ensures that for all levels there
is a component xj with xj = 0. This is important because the terms βj (x)xj are
equal to zero for the reference level in dummy coding (since xj = 0). This does
not allow us to study interactions with other variables for the term corresponding to
the reference level. Remark that one-hot encoding and dummy coding do not lead
to centering and unit variance.
This feature pre-processing gives us a feature vector x ∈ Rq of dimension
q = 40. For variable selection of the continuous and binary components we extend
the feature x by two additional independent components xq+1 and xq+2 . We select
two components to explore whether the particular distributional choice has some
influence on the choice of the acceptance/rejection interval Iα in (11.41). We choose
for policies 1 ≤ i ≤ n
i.i.d.
√ √ i.i.d.
xi,q+1 ∼ Uniform − 3, 3 and xi,q+2 ∼ N (0, 1),
these two sets of variables being mutually independent, and being inde-
pendent from all other variables. We define the extended features x + i =
(xi,1 , . . . , xi,q , xi,q+1 , xi,q+2 ) ∈ Rq0 with q0 = q + 2, and we consider the
LocalGLMnet regression function
q0
+ +
x → log μ(x ) = β0 + βj (x + )xj .
j =1
We choose the log-link for Poisson claim frequency modeling. The time exposure
v > 0 can either be integrated as a weight to the EDF or as an offset on the canonical
scale resulting in the same Poisson model, see Sect. 5.2.3.
500 11 Selected Topics in Deep Learning
q0
+ +
x → log μ(x ) = α0 + α1 βj (x + )xj ,
j =1
Table 11.10 Run times, number of parameters, in-sample and out-of-sample deviance losses
(units are in 10−2 ) and in-sample average frequency of the Poisson regressions, see also Table 7.3
Run # In-sample Out-of-sample Aver.
time param. loss on L loss on T freq.
Poisson null – 1 25.213 25.445 7.36%
Poisson GLM3 15s 50 24.084 24.102 7.36%
One-hot FN (q1 , q2 , q3 ) = (20, 15, 10) 51s 1’306 23.757 23.885 6.96%
LocalGLMnet on x + 20s 1’799 23.728 23.945 7.46%
LocalGLMnet on x + bias regularized – – 23.727 23.943 7.36%
stopping, here. The same applies if we add too many purely random components
xq+l , l ≥ 1. Since the balance property will not hold, in general, we apply the bias
regularization step (7.33) to adjust α0 and α1 , the results are presented on the last
line of Table 11.10; in Remark 3.1 of Richman–Wüthrich [317] a more sophisticated
balance property correction is presented. Our goal now is to analyze this solution.
Listing 11.8 Extracting the regression attentions from the LocalGLMnet architecture
1 zz <- keras_model(inputs=model$input,
2 outputs=get_layer(model, ’Attention’)$output)
3 beta <- data.frame(zz %>% predict(list(Xlearn, Vlearn)))
4 alpha1 <- as.numeric(get_weights(model)[[9]])
5 beta <- beta * alpha1
We start by analyzing the two additional components xi,q+1 and xi,q+2 being
uniformly and Gaussian distributed, respectively. Listing 11.8 shows how to extract
the estimated regression attentions β(x +i ). We calculate the means and standard
deviations of the estimated regression attentions of the two additional components
and
sq+1 = 0.0516 and
sq+2 = 0.0482.
From these numbers we see that the regression attentions β q+2 (x i ) are slightly
biased, whereas βq+1 (x i ) are fairly centered compared to the magnitudes of the
standard deviations. If we select a significance level of α = 0.1%, we receive a
two-sided standard normal quantile of |!−1 (α/2)| = 3.29. This provides us for
interval (11.41) with
Iα = !−1 (α/2) ·
sq+1 , !−1 (1 − α/2) ·
sq+1 = [−0.17, 0.17].
502 11 Selected Topics in Deep Learning
0.5 regression attentions: Area Code regression attentions: Bonus−Malus Level regression attentions: Density
0.5
0.5
regression attentions
regression attentions
regression attentions
0.0
0.0
0.0
−0.5
−0.5
−0.5
beta(x) beta(x) beta(x)
zero line zero line zero line
0.1% significance level 0.1% significance level 0.1% significance level
regression attentions: Driver's Age regression attentions: Vehicle Age regression attentions: Vehicle Gas
0.5
0.5
0.5
regression attentions
regression attentions
regression attentions
0.0
0.0
0.0
−0.5
−0.5
−0.5
beta(x) beta(x) beta(x)
zero line zero line zero line
0.1% significance level 0.1% significance level 0.1% significance level
20 30 40 50 60 70 80 90 0 5 10 15 20 Diesel Regular
Driver's Age Vehicle Age Vehicle Gas
regression attentions: Vehicle Power regression attentions: RandU regression attentions: RandN
0.5
0.5
0.5
regression attentions
regression attentions
regression attentions
0.0
0.0
0.0
−0.5
−0.5
−0.5
Fig. 11.14 Estimated regression attentions β j (x + ) of the continuous and binary feature compo-
i
nents Area, BonusMalus, log-Density, DrivAge, VehAge, VehGas, VehPower and the
two random features xi,q+1 and xi,q+2 of 2’000 randomly selected policies x + i ; the orange area
shows the interval Iα for dropping term βj (x)xj on significance level α = 0.1%
Table 11.11 Run times, number of parameters, in-sample and out-of-sample deviance losses
(units are in 10−2 ) and in-sample average frequency of the Poisson regressions, see also Table 7.3
Run # In-sample Out-of-sample Aver.
time param. loss on L loss on T freq.
Poisson null – 1 25.213 25.445 7.36%
Poisson GLM3 15s 50 24.084 24.102 7.36%
One-hot FN (q1 , q2 , q3 ) = (20, 15, 10) 51s 1’306 23.757 23.885 6.96%
LocalGLMnet on x + 20s 1’799 23.728 23.945 7.46%
LocalGLMnet on x + bias regularized – – 23.727 23.943 7.36%
LocalGLMnet on x − 20s 1’675 23.715 23.912 7.30%
LocalGLMnet on x . bias regularized – – 23.714 23.911 7.36%
We remind that dropping a term βj (x)xj does not necessarily imply that we
have to completely drop xj because it may still play an important role in one of the
other regression attentions βj (x), j = j . Therefore, we re-run the whole fitting
procedure, but we drop the purely random feature components xi,q+1 and xi,q+2 ,
and we also drop VehPower and Area to see whether we receive a model with a
similar predictive power. This then would imply that we can drop these variables, in
the sense of variable selection similar to the LRT and the Wald test of Sect. 5.3. We
denote the feature where we drop these components by x − ∈ Rq−2 .
We re-fit the LocalGLMnet on the reduced features x − i , and the results are presented
in Table 11.11. We observe that the loss figures decrease. Indeed, this supports the
null hypothesis of dropping VehPower and Area. The reason for being able to
drop VehPower is that it does not contribute (sufficiently) to explain the systematic
effects in the responses. The reason for being able to drop Area is slightly different:
we have seen that Area and log-Density are highly correlated, see Fig. 13.12
(rhs), and it turns out that it is sufficient to only keep the Density variable (on the
log-scale) in the model.
In a next step, we should analyze the robustness of these results by exploring the
nagging predictor and/or bootstrapping as described in Sect. 11.4. We refrain from
doing so, but we illustrate the LocalGLMnet solution of Table 11.11 in more detail.
Figure 11.15 shows the feature contributions β j (x − )xi,j of 2’000 randomly selected
i
policies on the significant continuous and binary feature components. The magenta
line gives a spline fit, and the more the black dots spread around these splines, the
more interactions we have; for instance, higher bonus-malus levels interact with the
age of driver which explains the scattering of the black dots. On average, frequencies
are increasing in bonus-malus levels and density, decreasing in vehicle age, and for
the driver’s age variable it is important to understand the interactions. We observe
that the spline fit for the log-Density is close to a linear function, this reflects
that the regression attentions β Density (x i ) in Fig. 11.14 (top-right) are more or less
constant. This is also confirmed by the marginal plot in Fig. 5.4 (bottom-rhs) which
has motivated the choice of a linear term for the log-Density in model Poisson
GLM1 of Table 5.3.
feature contribution: Bonus−Malus Level feature contribution: Density feature contribution: Driver's Age
504
1.5
1.5
1.5
beta(x)
zero line
spline fit
1.0
1.0
1.0
0.5
0.5
0.5
0.0
0.0
0.0
−0.5
−0.5
−0.5
feature contribution
feature contribution
feature contribution
beta(x) beta(x)
−1.0
−1.0
−1.0
zero line zero line
spline fit spline fit
−1.5
−1.5
−1.5
60 80 100 120 140 2 4 6 8 10 20 30 40 50 60 70 80 90
Bonus−Malus Level Density Driver's Age
1.5
1.5
beta(x) beta(x)
zero line zero line
spline fit
1.0
1.0
0.5
0.5
0.0
0.0
−0.5
−0.5
feature contribution
feature contribution
−1.0
−1.0
−1.5
−1.5
0 5 10 15 20 Diesel Regular
Vehicle Age Vehicle Gas
VehAge and VehGas of 2’000 randomly selected policies x − i ; the magenta line gives a spline fit
11.5 LocalGLMnet: An Interpretable Network Architecture 505
Density
Vehicle Age
Vehicle Gas
Vehicle Power
Area Code
RandN
RandU
1 +
n
IMj = βj (x i ) ,
n
i=1
2
BonusMalus
2
1
interaction strengths
interaction strengths
BonusMalus
1
VehGas DrivAge
0
0
VehAge
−1
−1
−2
Density Density
interactions of feature component Driver's Age interactions of feature component Vehicle Age
2
2
1
DrivAge
interaction strengths
interaction strengths
VehAge
0
BonusMalus
−1
−1
−2
Density Density
20 30 40 50 60 70 80 90 0 5 10 15 20
Driver's Age Vehicle Age
1.5
1.0
1.0
feature contribution
feature contribution
0.5
0.5
0.0
0.0
−0.5
−0.5
−1.0
−1.0
−1.5
−1.5
B1 B3 B5 B10 B12 B14 R11 R23 R26 R42 R53 R73 R83 R94
Vehicle Brand French Regions
1
n
arg min L (Yi , μ(x i )) − R(β(x i )), (11.42)
β0 ,w n
i=1
with a penalty term (regularizer) R(·) ≥ 0. For the penalty term R we can choose
different forms, e.g., the elastic net regularizer of Zou–Hastie [409] is obtained by,
see Remark 6.3,
1
n
arg min L (Yi , μ(x i )) + η (1 − α)β(x i )22 + αβ(x i )1 , (11.43)
β0 ,w n
i=1
For variable selection of categorical feature components we should rather use the
group LASSO penalization of Yuan–Lin [398], see also (6.5). Assume the features
x have a natural group structure x = (x ∈ Rq . We consider the
1 , . . . , xK )
optimization
1
n K
arg min L (Yi , μ(x i )) + ηk β k (x i )2 , (11.44)
β0 ,w n
i=1 k=1
This motivates to study the optimization problem for a fixed (small) > 0
1
n K
arg min L (Yi , μ(x i )) + ηk β k (x i )2,ε . (11.46)
β0 ,w n
i=1 k=1
For the last two choices there is no visible difference to the 1 -norm.
11.5 LocalGLMnet: An Interpretable Network Architecture 509
1.0
epsilon=0.1
epsilon=0.01
epsilon=0.001
contribution regularization beta
epsilon=1e−04
epsilon=1e−05
0.8
0.5
component beta2
0.6
0.0
0.4
−0.5
0.2
epsilon=0.1
epsilon=0.01
epsilon=0.001
epsilon=1e−04
−1.0
0.0
epsilon=1e−05
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
beta component beta1
Fig. 11.19 (lhs) Comparison of |β| and β2, = β 2 + for β ∈ R, and (rhs) unit balls B for
∈ {10−1 , 10−2 , 10−3 , 10−4 , 10−5 } compared to the Manhattan unit ball
The main disadvantage of the -approximation is that it does not shrink unimportant
components βj (x) exactly to zero. But it allows us to identify unimportant (small)
components, which can then be removed manually. As mentioned in Lee et al. [237],
LASSO regularization needs a second model calibration step only fitting the model
on the selected components (and without regularization) to receive an optimal
predictive power and a minimal bias. Thus, we need a second calibration step after
the removal of the unimportant components anyway.
We revisit the LocalGLMnet architecture applied to the French MTPL claim fre-
quency data, see Sect. 11.5.3. The goal is to perform a group LASSO regularization
so that we can also study the importance of the terms coming from the categorical
feature components VehBrand and Region. We first pre-process all feature
components as follows. We apply dummy coding to the categorical variables, and
then we standardize all components to centering and unit variance, this includes the
dummy coded components.
In a next step we need to define the natural groups x = (x
1 , . . . , x K ) ∈ R . We
q
√
a (sort of) regularization design matrix to encode the K groups and weights qk
for the q components of x. This is done in Listing 11.9 providing us with a matrix
√
of size 38 × 9 and the weights qk . This regularization design matrix enters the
penalty term on lines 13 and 16 of Listing 11.10 which weights the penalizations
· 2, .
Finally, we need to code the loss function (11.42). This is done in Listing 11.11. We
combine the Poisson deviance loss function with the group LASSO -approximation
K
k=1 ηk β k (x i )2, , the latter being outputted by Listing 11.10. We fit this
network to the French MTPL data (as above) for regularization parameters η ∈
{0, 0.0025, 0.005}. Firstly, we note that the resulting networks are not fully compet-
itive, this is probably due to the fact that the high-dimensional dummy coding leads
to too much over-fitting potential which leads to a very early stopping in gradient
descent fitting. Thus, this approach may not be useful to directly receive a good
predictive model, but it may be helpful to select the right feature components to
design a good predictive model.
Figure 11.20 gives the importance measures of the estimated regression attentions
1
n
IMj = βj (x i ) ,
n
i=1
Area Area
VehPow VehPow
VehAge VehAge
DrivAge DrivAge
BonusM BonusM
VehGas VehGas
Density Density
B2 B2
B3 B3
B4 B4
B5 B5
B6 B6
B10 B10
B11 B11
B12 B12
B13 B13
B14 B14
R21 R21
R22 R22
R23 R23
R24 R24
R25 R25
R26 R26
R31 R31
R41 R41
R42 R42
R43 R43
R52 R52
R53 R53
R54 R54
R72 R72
R73 R73
R74 R74
R82 R82
R83 R83
R91 R91
R93 eta=0 R93 eta=0
R94 eta=0.0025 R94 eta=0.0025
eta=0.005 eta=0.005
0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14
Fig. 11.20 Importance measures IMj of the group LASSO regularized LocalGLMnet for variable
selection with different regularization parameters η ∈ {0, 0.0025, 0.005}: (lhs) original data, and
(rhs) randomly permuted Region labels; the x-scale is the same in both plots
In Sect. 6.3 we have introduced mixture distributions and we have presented the EM
algorithm for fitting these mixture distributions. The EM algorithm considers two
steps, an expectation step (E-step) and a maximization step (M-step). The E-step is
motivated by (6.34). In this step the posterior distribution of the latent variable Z
is determined, given the observation Y and the parameter estimates for the model
parameters θ and p. The M-step (6.35) determines the optimal model parameters
θ and p, based on the observation Y and the posterior distribution of Z. Typically,
we explore MLE in the M-step. However, for the EM algorithm to function it is not
important that we really work with the maximum in the M-step, but monotonicity
in (6.38) is sufficient. Thus, if at algorithmic time t − 1 we have a parameter
(t −1)
estimate ( p (t −1)), it suffices that the next estimate (
(t )
θ , θ , p (t )) increases the
log-likelihood, without necessarily being the MLE; this latter approach is called
generalized EM (GEM) algorithm. Exactly this point makes it feasible to also use
the EM algorithm in cases where we model the parameters through networks which
are fit using gradient descent (ascent) algorithms. These methods go under the name
of mixture density networks (MDNs).
MDNs have been introduced by Bishop [35], who explores MDNs on Gaussian
mixtures, and using SGD and quasi-Newton methods for model fitting. MDNs have
also started to gain more popularity within the actuarial community, recent papers
include Delong et al. [95], Kuo [230] and Al-Mudafer et al. [6], the latter two
considering MDNs for claims reserving.
We recall the mixture density for a selected member of the EDF. The incomplete
log-likelihood of the data (Yi , x i , vi )1≤i≤n is given by, see (6.24),
n
(θ , ϕ, p) → Y (θ, ϕ, p) = Yi (θ (x i ), ϕ(x i ), p(x i ))
i=1
n
K
vi
= log pk (x i )fk Yi ; θk (x i ), ,
ϕk (x i )
i=1 k=1
representations are used to model the parameters. For the mixture probability p we
build a logistic categorical GLM, based on zi . For the (canonical) link h, we set
linear predictor, see (5.72),
p p
h(p(zi )) = h p z(d:1)(x i ) = β 1 , zi , . . . , β K , zi ∈ RK , (11.47)
p p
with regression parameter β p = ((β 1 ) , . . . , (β K ) ) ∈ RK(qd +1) . For the
canonical parameter θ , the mean parameter μ, respectively, and the dispersion
parameter ϕ we proceed analogously. Choose strictly monotone and smooth link
functions gμ and gϕ , and consider the double GLMs, for 1 ≤ k ≤ K, on the learned
representations zi
μ ϕ
gμ (μk (zi )) = β k , zi and gϕ (ϕk (zi )) = β k , zi , (11.48)
d
r= qm (qm−1 + 1) + 3K(qd + 1).
m=1
Remarks 11.14
• The regression functions (11.47)–(11.48) use a slight abuse of notation, because,
strictly speaking, these should be functions w.r.t. the features x i ∈ X , i.e.,
we should understand the learned representations zi as a short form for x i →
z(d:1)(x i ).
• It is not fully correct to say that (11.47) is the logistic categorical GLM
of formula (5.72), because (11.47) does not lead to identifiable regression
parameters. In fact, we should reduce the dimension of the categorical GLM to
p
K − 1, by setting β K = 0, see (5.70), because the probability of the last label
K is fully determined if we know the probabilities of all other labels; this would
also justify to say that h is the canonical link. Since in FN network modeling we
do not have identifiability anyway, we neglect this normalization (redundancy),
see line 16 of Listing 11.12, below.
• The above proposal (11.47)–(11.48) suggests to use the same network z(d:1)
for all mixture parameters involved. This requires that the chosen network is
11.6 Selected Applications 515
sufficiently large, so that it can comply simultaneously with these different tasks.
Alternatively, we could choose three separate (parallel) networks for p, μ and
ϕ, respectively. This second proposal does not (easily) allow for (non-trivial)
interactions between the parameters, and it may also suffer from less robustness
in fitting.
• Proposal (11.48) defines double GLMs for the mixture components fk , 1 ≤ k ≤
K. If we decide to not model the dispersion parameters feature dependent, i.e., if
we set ϕk (z) ≡ ϕk ∈ R+ , then the mixture components are modeled with GLMs
on the learned representations zi = z(d:1)(x i ). Nevertheless, this latter approach
still requires that the dispersion parameters ϕk are set to reasonable values, as
they enter the score equations, this can be seen from (6.29) adapted to MDNs.
Thus, in MDNs, the dispersion parameters do not cancel in the score equations,
which is different from the single distribution case. The dispersion parameter can
either be estimated (updated) during the M-step of the EM algorithm (supposed
we use the EM algorithm), or it can be pre-specified as a given hyper-parameter.
• As mentioned in Sect. 6.3, mixture density fitting can be challenging because,
in general, mixture density log-likelihoods are unbounded. Therefore, a suitable
initialization of the EM algorithm is important for a successful model fitting.
This problem is less pronounced in MDNs as we use early stopping in SGD
fitting that prevents the fitted parameters to depend on a small set of observations.
For instance, Example 6.13 cannot occur because an individual observation Y1
enters at most one (mini-)batch of SGD, and the SGD algorithm will provide
a good balance across all batches. Moreover, early stopping will imply that the
selected parameters must also be good on the validation data being disjoint (and
independent) from the training data.
• Delong et al. [95] present two different ways of fitting such MDNs. The crucial
property in EM fitting is to preserve the monotonicity in the M-step. For MDNs
this can either be achieved by using the parameters as offsets for the next EM
iteration (this is called ‘EM network boosting’ in Delong et al. [95]) or to forward
the network weights from one to the next loop (called ‘EM forward network’
in Delong et al. [95]). We are going to present the second option in the next
example.
Example 11.15 (Gamma Claim Size Modeling and MDNs) We revisit Exam-
ple 6.14 which models the claim sizes of the French MTPL data. For the modeling
of these claim sizes we choose the mixture distribution (6.39) which has four
gamma components f1 , . . . , f4 and one Lomax component f5 . In a first step we
again model these five mixture components independent of the feature information
x, and the feature information only enters the mixture probabilities p(x) ∈ 5 .
This modeling approach has been motivated by Fig. 13.17 which suggests that
the features mainly result in systematic effects on the mixture probabilities. We
choose the same model and feature information as in Example 6.14. We only
replace the logistic categorical GLM part (6.40) for modeling p(x) by a depth
d = 2 FN network with (q1 , q2 ) = (20, 10) neurons. Area, VehAge, DrivAge
516 11 Selected Topics in Deep Learning
and BonusMalus are modeled as continuous variables, and for the categorical
variables VehBrand and Region we choose two-dimensional embedding layers.
Listing 11.12 R code of the MDN for modeling the mixture probability p(x)
1 Design = layer_input(shape = c(4), dtype = ’float32’)
2 VehBrand = layer_input(shape = c(1), dtype = ’int32’)
3 Region = layer_input(shape = c(1), dtype = ’int32’)
4 Bias = layer_input(shape = c(1), dtype = ’float32’)
5 #
6 BrandEmb = VehBrand %>%
7 layer_embedding(input_dim = 11, output_dim = 2, input_length = 1) %>%
8 layer_flatten()
9 RegionEmb = Region %>%
10 layer_embedding(input_dim = 22, output_dim = 2, input_length = 1) %>%
11 layer_flatten()
12 #
13 pp = list(Design, BrandEmb, RegionEmb) %>% layer_concatenate() %>%
14 layer_dense(units=20, activation=’tanh’) %>%
15 layer_dense(units=10, activation=’tanh’) %>%
16 layer_dense(units=5, activation=’softmax’)
17 #
18 mu = Bias %>% layer_dense(units=4, activation=’exponential’,
19 use_bias=FALSE)
20 #
21 tail = Bias %>% layer_dense(units=1, activation=’sigmoid’,
22 use_bias=FALSE)
23 #
24 shape = Bias %>% layer_dense(units=4, activation=’exponential’,
25 use_bias=FALSE)
26 #
27 Response = list(pp, mu, tail, shape) %>% layer_concatenate()
28 #
29 keras_model(inputs = c(Design, VehBrand, Region, Bias), outputs = c(Response))
Listing 11.12 shows the chosen network. Lines 13–16 model the mixture probability
p(x). We also integrate the modeling of the (homogeneous) parameters of the
mixture densities f1 , . . . , f5 . Lines 18 and 24 of Listing 11.12 consider the mean
and shape parameter of the gamma components, and line 21 the tail parameter 1/β5
of the Lomax component. Note that we use the sigmoid activation for this Lomax
parameter. This implies 1/β5 ∈ (0, 1) and, thus, β5 > 1, which enforces a finite
mean model. The exponential activations on lines 18 and 24 ensure positivity of
these parameters. The input Bias to these variables is simply the constant 1, which
is the homogeneous case not differentiating w.r.t. the features.
Observe that in most of the networks so far, the output of the network was
equal to an expected response of a random variable that we try to predict. In
this MDN we output the parameters of a distribution function, see line 27 of
Listing 11.12. In our case this output has dimension 14, which then enters the score
in Listing 11.13. In a first attempt we fit this MDN brute-force by just implementing
the incomplete log-likelihood received from (6.39). Since the gamma function
(·) is not easily available in keras [77], we replace the gamma density by its
saddlepoint approximation, see Sect. 5.5.2. Listing 11.13 shows the negative log-
likelihood of the mixture density that is used to perform the brute-force SGD fitting.
11.6 Selected Applications 517
Lines 2–9 give the saddlepoint approximations to the four gamma components, and
line 10 the Lomax component for the scale parameter M. Note that this brute-force
approach is based only on the incomplete observation Y encoded in true[,1],
see Listing 11.13.
We fit this logistic categorical FN network of Listing 11.12 under the score function
of Listing 11.13 using the nadam version of SGD. Moreover, we use a stratified
training-validation split, otherwise we did not obtain a competitive model. The
results are presented in Table 11.12 on line ‘logistic FN network: brute-force fitting’.
We observe a slightly worse performance (in-sample) than in the logistic GLM. This
does not justify the use of the more complex network architecture. Or in other words,
feature pre-processing seems to been done suitably in Example 6.14.
In a next step, we fit this MDN with the (generalized) EM algorithm. The E-
step is exactly the same as in Example 6.14. For the M-step, having knowledge of
the (latent mixture component) variables Z i , 1 ≤ i ≤ n, implies that the mixture
probability estimation and the mixture density estimation completely decouples. As
a consequence, the parameters of the density components f1 , . . . , f5 can directly
be estimated using univariate MLEs, this is the same as in Example 6.14. The
only part that needs further explanation is the estimation of the logistic categorical
FN network for p(x). In each loop of the EM iteration we would like to find the
optimal network parameter for p(x), and at the same time we have to ensure the
monotonicity (6.38). Following the ‘EM forward network’ approach of Delong et
Table 11.12 Mixture models for French MTPL claim size modeling; we set M = 2 000
# Param. Y (
θ,
p)
μ = E θ ,
p [Y ]
Empirical 2’266
Null model 13 −199’306 2’381
Logistic GLM, Example 6.14 193 −198’404 2’176
Logistic FN network: brute-force fitting 520 −198’623 2’003
Logistic FN network: EM fitting 520 −198’449 2’119
MDN: brute-force fitting 825 −198’178 2’144
MDN: EM fitting 825 −198’085 2’240
518 11 Selected Topics in Deep Learning
al. [95], this is most easily achieved by just initializing the FN network in loop t of
the algorithm with the optimal network parameter of the previous loop t − 1. Thus,
the starting parameter of SGD reflects the optimal parameter from the previous
step, and since SGD generally decreases losses, the monotonicity (6.38) holds. The
latter statement is not strictly true, SGD introduces additional randomness through
the building of (mini-)batches, therefore, monotonicity should be traced explicitly
(which also ensures that the early stopping rule is chosen suitably). We have
implemented such an EM-SGD algorithm, essentially, we just have to drop lines
17–28 of Listing 11.12 and lines 13–16 provide the entire response. As loss function
we choose the categorical (multi-class) cross-entropy loss, see (4.19). The results in
Table 11.12 on line ‘logistic FN network: EM fitting’ indicate a superior fitting
behavior compared to the brute-force fitting. Nevertheless, this network approach
is still not outperforming the GLM approach, saying that we should stay with the
simpler GLM.
In a final step, we also model the mean parameters μk (x), 1 ≤ k ≤ 4, of the
gamma components feature dependent, to see whether we can gain predictive power
from this additional flexibility or whether our initial model choice is sufficient. For
robustness reasons we neither model the shape parameters βk , 1 ≤ k ≤ 4, of
the gamma components feature dependent nor the tail parameter β5 of the Lomax
component. The implementation only requires small changes to Listing 11.12, see
Listing 11.14.
A brute-force fitting of the MDN architecture of Listing 11.14 can directly be based
on the score function (negative incomplete log-likelihood) of Listing 11.13. In the
case of the EM algorithm we need to change the score function to the complete
log-likelihood accounting for the variables Z i ∈ 5 . This is done in Listing 11.15
where Z i is encoded in the variables true[,2] to true[,6].
We fit this MDN using the two different fitting approaches, and the results are given
on the last two lines of Table 11.12. Again the performance of the EM fitting is
slightly better than the brute-force fitting, and the bigger log-likelihoods indicate
that we can gain predictive power by also modeling the means of the gamma
components feature dependent.
Figure 11.21 compares the QQ plot of the resulting MDN with EM fitting to the
one received from the logistic categorical GLM of Example 6.14. These graphs are
very similar. We conclude that in this particular example it seems that the simpler
proposal of Example 6.14 is sufficient.
In a next step, we try to understand which feature components influence the mix-
ture probabilities p(x) = (p1 (x), . . . , pK (x)) most. Similarly to Examples 6.14
and 11.15, we therefore use a MDN where we only fit the mixture probability
p(x) with a network and the mixture components f1 , . . . , fK are assumed to be
homogeneous.
Example 11.16 (MDN with LocalGLMnet) We revisit Example 11.15. We choose
the mixture distribution (6.39) which has four gamma components f1 , . . . , f4 and
a Lomax component f5 . We select their parameters independent of the features.
The feature information x should only enter the mixture probability p(x) ∈ 5 ,
similarly to the first part of Example 11.15. We replace the logistic FN network of
11.6 Selected Applications 519
Listing 11.14 R code of the MDN for modeling the mixture probability p(x) and the gamma
means μk (x)
1 Design = layer_input(shape = c(4), dtype = ’float32’)
2 VehBrand = layer_input(shape = c(1), dtype = ’int32’)
3 Region = layer_input(shape = c(1), dtype = ’int32’)
4 Bias = layer_input(shape = c(1), dtype = ’float32’)
5 #
6 BrandEmb = VehBrand %>%
7 layer_embedding(input_dim = 11, output_dim = 2, input_length = 1) %>%
8 layer_flatten()
9 RegionEmb = Region %>%
10 layer_embedding(input_dim = 22, output_dim = 2, input_length = 1) %>%
11 layer_flatten()
12 #
13 Network = list(Design, BrandEmb, RegionEmb) %>% layer_concatenate() %>%
14 layer_dense(units=20, activation=’tanh’) %>%
15 layer_dense(units=15, activation=’tanh’) %>%
16 layer_dense(units=10, activation=’tanh’)
17 #
18 pp = Network %>% layer_dense(units=5, activation=’softmax’)
19 #
20 mu = Network %>% layer_dense(units=4, activation=’exponential’,
21 use_bias=FALSE)
22 #
23 tail = Bias %>% layer_dense(units=1, activation=’sigmoid’,
24 use_bias=FALSE)
25 #
26 shape = Bias %>% layer_dense(units=4, activation=’exponential’,
27 use_bias=FALSE)
28 #
29 Response = list(pp, mu, tail, shape) %>% layer_concatenate()
30 #
31 keras_model(inputs = c(Design, VehBrand, Region, Bias), outputs = c(Response))
Example 11.15 for modeling p(x) by a LocalGLMnet such that we can analyze the
importance of the variables, see Sect. 11.5.
For the feature information we choose the continuous variables Area,
VehPower, VehAge, DrivAge and BonusMalus, the binary variable VehGas
and the categorical variables VehBrand and Region, thus, we extend by
VehPower and VehGas compared to Example 11.15. These latter two variables
have not been included previously, because they did not seem to be important
520 11 Selected Topics in Deep Learning
4 QQ plot QQ plot
4
observed values
observed values
2
2
0
0
–2
–2
–4 –2 0 2 4 –4 –2 0 2 4
theoretical values theoretical values
Fig. 11.21 QQ plots of mixture models: (lhs) logistic categorical GLM for mixture probabilities
and (rhs) for MDN with EM fitting
w.r.t. Fig. 13.17. The continuous and binary variables are centered and normalized
to unit variance. For the categorical variables we use two-dimensional embedding
layers, and afterwards they are concatenated with the continuous variables with
a subsequent normalization layer (to ensure that all components live on the same
scale). This provides us with a 10-dimensional feature vector. This feature vector
is complemented with an i.i.d. standard Gaussian component, called Random,
to perform an empirical Wald type test. We call this pre-processed feature (after
embedding and normalization of the categorical variables) x ∈ Rq0 with q0 = 11.
We design a LocalGLMnet that acts on this feature x ∈ Rq0 for modeling
a categorical multi-class output with K = 5 levels. Therefore, we choose the
regression attentions
z(d:1) : Rq0 → Rq0 ×K , x → β(x) = β 1 (x), . . . , β K (x) = z(d:1) (x),
with intercepts βk,0 ∈ R, and where β k (x) ∈ Rq0 is the k-th column of regression
attention β(x) = z(d:1)(x) ∈ Rq0 ×K . We also refer to the second item of
Remarks 11.14 concerning a possible dimension reduction in (11.49), i.e., in fact we
apply the softmax activation function to the right-hand side of (11.49), neglecting
the identifiability issue. Moreover, as in the introduction of the LocalGLMnet, we
separate the intercept components from the remaining features in (11.49).
We fit this LocalGLMnet-MDN with the EM version presented in Exam-
ple 11.15. We apply early stopping based on the same stratified training-validation
11.6 Selected Applications 521
FN networks have also found their way into solving risk management problems.
We briefly introduce a valuation problem and then describe a way of solving
this problem. Assume we have a liability cash flow Y1:T = (Y1 , . . . , YT ) with
(random) payments Yt at time points t = 1, . . . , T . We assume that this liability
cash flow Y1:T is adapted to a filtration (At )1≤t ≤T on the underlying probability
space (, A, P). Moreover, we assume to have a pricing kernel (state price deflator)
ψ1:T = (ψ1 , . . . , ψT ) on that probability space which is an (At )1≤t ≤T -adapted
522 11 Selected Topics in Deep Learning
importance measure for mixture component 1 importance measure for mixture component 2
0.4
0.4
0.2
0.2
0.0
0.0
−0.2
−0.2
−0.4
−0.6 −0.4
Random
Area
VehPower
VehAge
DrivAge
BonusM
VehGas
Vehicle1
Vehicle2
Region1
Region2
Random
Area
VehPower
VehAge
DrivAge
BonusM
VehGas
Vehicle1
Vehicle2
Region1
Region2
importance measure for mixture component 3 importance measure for mixture component 4
0.3 0.2
0.2
0.1
0.1
0.0
0.0
−0.1
−0.1
−0.2
−0.2
Random
Area
VehPower
VehAge
DrivAge
BonusM
VehGas
Vehicle1
Vehicle2
Region1
Region2
Random
Area
VehPower
VehAge
DrivAge
BonusM
VehGas
Vehicle1
Vehicle2
Region1
Region2
0.2
0.1
0.0
−0.1
−0.2
Random
Area
VehPower
VehAge
DrivAge
BonusM
VehGas
Vehicle1
Vehicle2
Region1
Region2
random vector with strictly positive components ψt > 0, a.s., for all 1 ≤ t ≤ T . A
no-arbitrage value of the outstanding liability cash flow at time 1 ≤ τ < T can be
defined by (we assume existence of all second moments)
T
1
Rτ = E [ ψs Ys | Aτ ] . (11.50)
ψτ
s=τ +1
For the mathematical background on no-arbitrage pricing using state price deflators
we refer to Wüthrich–Merz [393]. The Aτ -measurable quantity Rτ is called
reserves of the outstanding liabilities at time τ . From a risk management and
solvency point of view we would like to understand the volatility in the reserves
Rτ seen from time 0, i.e., we try to model the random variable Rτ seen from time
0 (based on the trivial σ -algebra A0 = {∅, }). In applied problems, the difficulty
often is that the conditional expectations under the summation in (11.50) cannot be
computed in closed form. Therefore the law of Rτ cannot be determined explicitly.
We provide a numerical solution to the calculation of the conditional expectations
in (11.50). Assume that the information set Aτ can be described by a random vector
Xτ , i.e., Aτ = σ (Xτ ). In that case we rewrite (11.50) as follows
T
1
Rτ = E [ ψs Ys | X τ ] . (11.51)
ψτ
s=τ +1
The latter now indicates that we can determine the conditional expectations
in (11.51) as regression functions in features X τ , and we try to understand for s > τ
+ ,
ψs
x τ → E Ys Xτ = x τ . (11.52)
ψτ
have been proposed to determine these basis and regression functions, see, e.g.,
Cheridito et al. [74] or Krah et al. [224].
In the following, we assume that all random variables considered are square-
integrable and, thus, we can work in a Hilbert space with the scalar product
X, Z = E[XZ] for X, Z ∈ L2 (, A, P). Moreover, for simplicity, we drop the
time indices and we also drop the stochastic discounting in (11.52) by assuming
ψs /ψτ ≡ 1. These simplifications are not essential technically and simplify our
outline. The conditional expectation μ(X) = E[Y |X] can then be found by the
orthogonal projection of Y onto the sub-space σ (X), generated by X, in the Hilbert
space L2 (, A, P). That is, the conditional expectation is the measurable function
μ : Rq → R, X → μ(X), that minimizes the mean squared error
!
E (Y − μ(X))2 = min, (11.53)
among all measurable functions on X. In Example 3.7, we have seen that μ(·) is the
minimizer of this problem if and only if
μ(x) = arg min (y − m)2 dFY |x (y), (11.54)
m∈R R
1 ; <2
n
ϑ = arg min Yi − β, z(d:1)(X i ) , (11.55)
ϑ∈Rr n i=1
where (Yi , X i ), 1 ≤ i ≤ n, are i.i.d. copies of (Y, X). This provides us with the
fitted FN network z(d:1)(·) and the fitted output parameter
β. These can be used to
receive an approximation to the conditional expectation, solution of (11.54),
; <
μ(x) =
x → z(d:1)(x) ≈ μ(x) = E [ Y | X = x] .
β, (11.56)
bounded for a fixed network parameter ϑ. This does not necessarily apply to
the conditional expectation E[Y |X = ·] and, thus, the approximation in the tail
may be poor. Second, we consider an approximation based on a finite sample
in (11.55). However, this error can be made arbitrarily small by letting n → ∞.
In-sample over-fitting should not be an issue as we may generate samples of
arbitrary large sample sizes. Third, having the approximation (11.56), we still
need to simulate i.i.d. samples Xk , k ≥ 1, having the same distribution as X to
empirically approximate the distribution of the random variable Rτ in (11.51).
Also in this step we benefit from the fact that we can simulate infinitely many
samples to mitigate this approximation error.
• To fit the network parameter ϑ in (11.55) we use i.i.d. copies (Yi , X i ), 1 ≤ i ≤ n,
that have the same distribution as (Y, X) under P. However, to receive a good
approximation to regression function x → μ(x) we only need to simulate
Yi |{Xi =x i } from FY |x i (·) = P[·|Xi = x i ], and Xi can be simulated from an
arbitrary equivalent distribution to px , and we still get the right conditional
expectation in (11.54). This is worth mentioning because if we need a higher
precision in some part of the feature space of X, we can apply a sort of
importance sampling by choosing a distribution for X that generates more
samples in the corresponding part of the feature space compared to the original
(true) distribution px of X; this proposal has been emphasized in Cheridito et
al. [74].
We study the example presented in Ha–Bauer [177] and Cheridito et al. [74].
This example considers a variable annuity (VA) with a guaranteed minimum income
benefit (GMIB), and we revisit the network approach of Cheridito et al. [74].
Example 11.18 (Approximation of Conditional Expectations) We consider the VA
example with a GMIB introduced and studied in Ha–Bauer [177]. This example
involves a 3-dimensional stochastic process, for t ≥ 0,
Xt = (qt , rt , mx+t ),
with qt being the log-value of the VA account at time t, rt is the short rate at time t,
and mx+t is the force of mortality at time t of a person aged x at time 0. The payoff
at fixed maturity date T > 1 of this insurance contract is given by
S = S(X T ) = max eqT , b ax+T (rT , mx+T ) ,
where eqT is the VA account value at time T , and b ax+T (rT , mx+T ) is the GMIB at
time T consisting of a face value b > 0 and with ax+T (rT , mx+T ) being the value
of an immediate annuity at time T of a person aged x + T . Our goal is to model the
conditional expectation
for a fixed valuation time point 0 < τ < T , and where D(τ, T ) = D(τ, T ; Xτ )
is a σ (Xτ )-measurable discount factor. This requires the explicit specification of
the GMIB term as a function of (rT , mx+T ), the modeling of the stochastic process
(Xt )0≤t ≤T , and the specification of the discount factor D(τ, T ; Xτ ). In financial
and actuarial valuation the regression function μ(·) in (11.57) should reflect a no-
arbitrage price. Therefore, P in (11.57) should be an equivalent martingale measure
w.r.t. the selected numéraire. In our case, we choose a force of mortality (mx+t )t -
adjusted zero-coupon bond price as numéraire. This implies that P is a mortality-
adjusted forward measure; for details and its explicit derivation we refer to Sect. 5.1
of Ha–Bauer [177]. In particular, Ha–Bauer [177] introduce a three-dimensional
Brownian motion based model for (X t )t from which they deduce all relevant terms
explicitly. We skip these calculations here, because, once the GMIB term and the
discount factor are determined, everything boils down to knowing the distribution
of the random vector (X τ , XT ) under the corresponding probability measure P. We
choose initial age x = 55, maturity T = 15 and (solvency) time horizon τ = 1.
Under the model and parametrization of Ha–Bauer [177] we receive a multivariate
Gaussian distribution under P given by
Under the model specification of Ha–Bauer [177], one can furthermore work out the
discount factor and the annuity. Define for t ≥ 0 and k > 0 the affine term structure
1 − e−αk
B(t, t + k; α) = ,
α
σr2
A(t, t + k) = γ̄ (B(t, t + k; α) − k) + (k − 2B(t, t + k; α) + B(t, t + k; 2α))
2α 2
ψ2
+ (k − 2B(t, t + k; −κ) + B(t, t + k; −2κ))
2κ 2
2,3 σr ψ
+ (B(t, t + k; −κ) − k + B(t, t + k; α) − B(t, t + k; α − κ)) ,
ακ
with parameters for the short rate process α = 25%, σr = 1%, for the force of
mortality κ = 7%, ψ = 0.12%, the correlation between the short rate and the force
of mortality 2,3 = −4%, and with market-price of the risk-adjusted mean reversion
11.6 Selected Applications 527
2.5
value eqT and the GMIB GMIB value
2.0
1.5
density
1.0
0.5
0.0
2 4 6 8
log scale
level γ̄ = 1.92% of the short rate process. These formulas can be retrieved because
we work under an affine Gaussian structure. The discount factor is then given by
50
ax+T (rT , mx+T ) = F (T , k; rT , mx+T ).
k=1
Moreover, we set for the face value b = 10.79205. This parametrization implies that
the VA account value eqT exceeds the GMIB b ax+T (rT , mx+T ) with a probability
of roughly 40%, i.e., in roughly 60% of the cases we exercise the GMIB option.
Figure 11.23 shows the marginal densities of these two variables, moreover, their
correlation is close to 0.
The model is now fully specified so that we can estimate the conditional expectation
in (11.57) as a function of X τ . We therefore simulate n = 3 000 000 i.i.d. Gaussian
(i) (i)
observations (X τ , X T ), 1 ≤ i ≤ n, from (11.58). This provides us with the
observations
(i)
Yi = D(τ, T ; X(i)
τ ) S(X T )
(i)
50
= F (τ, T − τ ; rτ(i) , m(i)
x+τ ) max e qT
,b F (T , k; rT(i) , m(i)
x+T ) .
k=1
1
20
μ(·) =
μk (·).
20
k=1
μ(X (l)
τ ) for 1 ≤ l ≤ L, (11.59)
and
G p = E
CTE Fμ(X μ(Xτ ) Gp .
μ(Xτ ) > VaR
τ)
We also refer to Sect. 11.3. The VaR and the CTE are two commonly used risk
measures in insurance practice that determine the necessary risk bearing capital to
run the corresponding insurance business. Typically, the VaR is evaluated on p =
99.5%, i.e., we allow for a default probability of 0.5% of not being able to cover
the changes in valuation over a τ = 1 year time horizon. Alternatively, the CTE is
considered on p = 99% which means that we need sufficient capital to cover on
average the 1% worst changes in valuation over a 1 year time horizon.
Figure 11.24 shows our FN network approximations. The boxplots shows the
individual results of the estimates
μk (·) with 20 different seeds, and the horizontal
lines show the results of the nagging predictor (11.59). The red line at 140.97
gives the estimated VaR for p = 99.5%, this value is slightly bigger than the best
estimate of 139.47 (orange line) in Ha–Bauer [177] which is based on a functional
approximation involving 37 monomials and 40’000’000 simulated samples. CTEs
on p = 99.5% and p = 99% are given by 145.09 and 141.49. We conclude that in
the present example VaRG 99.5% (used in Europe) and CTE G 99% (used in Switzerland)
are approximately of the same size for this VA with a GMIB.
11.6 Selected Applications 529
146
G 99% (blue);
(green) and CTE
the orange line gives the 145.09
result of Ha–Bauer [177] for
the 99.5% VaR
144
142
141.49
140.97
140
139.47
This example shows how problems can be solved that require the computation
of a conditional expectation. Alternatively, we could explore the LocalGLMnet
architecture, which would allow us to explain the conditional expectation more
explicitly in terms of the information Xτ available at time τ . This may also be
relevant in practice because it allows to determine the main risk drivers of the
underlying insurance business.
Figure 11.25 shows the marginal densities of the components of Xτ =
(qτ , rτ , mx+τ ) in blue color. In red color we show the corresponding conditional
densities of X τ , conditioned on G 99.5%, thus, these are the feature
μ(Xτ ) > VaR
values Xτ that lead to a shortfall beyond the 99.5% VaR of μ(Xτ ). From this
figure we conclude that the main driver of VaR is the VA account variable qτ ,
whereas the short rate rτ and the force of mortality mx+τ are slightly lower beyond
the VaR compared to their unconditioned counterparts. The explanation for these
smaller values is that they lead to less discounting and, henceforth, to bigger GMIB
values. This is useful information for exploring importance sampling as mentioned
in Remarks 11.17. This closes the example.
40
250
200
density
density
density
3
30
150
20
2
100
10
1
50
0
0
0
4.0 4.5 5.0 5.5 −0.02 0.00 0.02 0.04 0.06 0.004 0.006 0.008 0.010 0.012 0.014 0.016
Fig. 11.25 Feature values Xτ triggering VaR on the 99.5% level: (lhs) VA account log-value qτ ,
(middle) short rate rτ , and (rhs) force of mortality mx+τ , blue color shows the full density and red
color shows the conditional density conditioned on being above the 99.5% VaR of μ(Xτ )
530 11 Selected Topics in Deep Learning
π ( ϑ| Y, x) ∝ f ( Y, ϑ | x) = f ( Y | ϑ, x) π(ϑ). (11.60)
A new data point Y † with feature x † has conditional density, given observation
(Y, x),
f y † x † ; Y, x = f y † ϑ, x † π ( ϑ| Y, x) dν(ϑ),
ϑ
The optimal approximation within F , for given data (Y, x), is found by solving
θ =
θ (Y, x) = arg min DKL q(·; θ )π ( ·| Y, x) ;
θ∈
for the moment we neglect existence and uniqueness questions. A main difficulty is
the computation of this KL divergence because it involves the intractable posterior
11.6 Selected Applications 531
density of ϑ, given (Y, x). We modify the optimization problem such that we can
circumvent the explicit calculation of this KL divergence.
Lemma 11.19 We have the following identity
logf (Y |x) = E(θ |Y, x) + DKL q(·; θ )π ( ·| Y, x) ,
-
for the (unconditional) density f (y|x) = ϑ f (y|ϑ, x)π(ϑ)dν(ϑ) and the so-
called evidence lower bound (ELBO)
f ( Y, ϑ | x)
E(θ |Y, x) = q(ϑ; θ )log dν(ϑ).
ϑ q(ϑ; θ )
Observe that the left-hand side in the statement of Lemma 11.19 is independent of
θ ∈ . Therefore, minimizing the KL divergence in θ is equivalent to maximizing
the ELBO in θ . This follows exactly the same philosophy as the EM algorithm,
see (6.32), in fact, the ELBO E plays the role of functional Q defined in (6.33).
Proof of Lemma 11.19 We start from the left-hand side of the statement
f (Y, ϑ|x)
logf (Y |x) = q(ϑ; θ )logf (Y |x) dν(ϑ) = q(ϑ; θ )log dν(ϑ)
ϑ ϑ π(ϑ|Y, x)
f (Y, ϑ|x)/q(ϑ; θ )
= q(ϑ; θ )log dν(ϑ)
ϑ π(ϑ|Y, x)/q(ϑ; θ )
= E(θ |Y, x) + DKL q(·; θ )π ( ·| Y, x) .
Interestingly, the ELBO does not include the posterior density, but only the joint
density of Y and ϑ, given x, which is assumed to be known (available). It can be
rewritten as
E(θ |Y, x) = q(ϑ; θ )logf ( Y, ϑ | x) dν(ϑ) − q(ϑ; θ )logq(ϑ; θ ) dν(ϑ)
ϑ ϑ
= Eq(·;θ) logf ( Y, ϑ | x) Y, x − Eq(·;θ) logq(ϑ; θ ) ,
the first term being the expected joint log-likelihood of (Y, ϑ ) under the variational
density ϑ ∼ q(·; θ ), and the second term being the entropy of the variational density.
532 11 Selected Topics in Deep Learning
The optimal approximation within F for given data (Y, x) is then found by
solving
θ =
θ (Y, x) = arg max E(θ |Y, x).
θ∈
θ = arg max E(θ |D)
θ∈
n
= arg max Eq(·;θ) log π(ϑ) f ( Yi | ϑ, x i ) D − Eq(·;θ) logq(ϑ; θ )
θ∈
i=1
n + ,
q(ϑ; θ )
= arg max Eq(·;θ) logf ( Yi | ϑ, x i ) Yi , x i − Eq(·;θ) log
θ∈ π(ϑ)
i=1
n
= arg max Eq(·;θ) logf ( Yi | ϑ, x i ) Yi , x i − DKL ( q(·; θ ) π) .
θ∈ i=1
Typically, one solves this problem with gradient ascent methods which requires
calculation of the gradient ∇θ of the objective function on the right-hand side. This
is more difficult than plain vanilla gradient descent in network fitting because θ
enters the expectation operator Eq(·;θ).
Kingma–Welling [217] propose to use the following reparametrization trick.
Assume that we can receive the random variable ϑ ∼ q(·; θ ) by a reparametrization
(d)
ϑ = t (, θ ) for some smooth function t and where ∼ p does not depend on θ .
E.g., if ϑ is multivariate Gaussian with mean μ and covariance matrix AA , then
(d)
ϑ = μ + A for being standard multivariate Gaussian. Under the assumption that
the reparametrization trick works for the family F = {q(·; θ ); θ ∈ } we arrive at,
for ∼ p,
θ = arg max E (θ|D) (11.61)
θ∈
n
1 + ,
q(t (, θ); θ)
= arg max Ep logf ( Yi | t (, θ), x i ) Yi , x i − Ep log
θ∈ n π(t (, θ))
i=1
f (Yi |t (, θ), x i ) π (t (, θ))1/n
n
= arg max Ep log Yi , x i .
θ∈ q (t (, θ); θ)1/n
i=1
11.6 Selected Applications 533
The gradient of the ELBO is then given by (supposed we can exchange Ep and ∇θ )
f (Yi |t (, θ ), x i ) π (t (, θ ))1/n
n
∇θ E(θ |D) = Ep ∇θ log Yi , x i .
q (t (, θ ); θ )1/n
i=1
These expected gradients are calculated empirically using Monte Carlo methods.
Sample i.i.d. observations (i,j ) ∼ p, 1 ≤ i ≤ n and 1 ≤ j ≤ m, and consider the
empirical approximation
1/n
n
1
m
f Yi t ( (i,j ) , θ ), x i π t ( (i,j ) , θ )
∇θ E(θ |D) ≈ ∇θ log 1/n .
m q t ( (i,j ) , θ ); θ
i=1 j =1
(11.62)
Using this empirical approximation we can use gradient ascent methods to estimate
θ , known as stochastic gradient variational Bayes (SGVB) estimator, see Sect. 2.4.3
of Kingma–Welling [217], or as Bayes by Backprop, see Blundell et al. [41] and
Jospin et al. [205].
Example 11.20 We consider the gradient (11.62) for an example from the EDF.
First, if n is sufficiently large, it often suffices to set m = 1, and we still receive
an accurate estimate. In that case we drop index j giving (i) . Assume that the
(conditionally independent) observations Yi belong to the same member of the EDF
having cumulant function κ. Moreover, assume that the (conditional) mean of Yi ,
given x i , can be described by a FN network and a link function g such that, see (7.8),
; <
μi = μ(x i ) = μϑ (x i ) = g −1 β, zw (d:1)
(x i ) ,
with network parameter ϑ( (i) ; θ ) = (β( (i) ; θ ), w( (i) ; θ )) = t ( (i) , θ ) ∈ Rr .
Maximizing the ELBO implies that we need to calculate the gradients w.r.t. θ . First,
we calculate the gradient w.r.t. the network parameter ϑ of the data log-likelihood
The prior distribution is often taken to be the multivariate Gaussian with prior mean
τ ∈ Rr and (symmetric and positive definite) prior covariance matrix T ∈ Rr×r ,
thus,
1/2 −1 1 −1
π(ϑ) = ((2π) |T | ) exp − (ϑ − τ ) T (ϑ − τ ) .
r/2
2
j =1
2πσj 2σj
with r-dimensional standard Gaussian variable ∼ N (0, 1). The Jacobian matrix
is
⎛ ⎞
1 1 0 0 · · · 0 0
⎜0 0 1 2 · · · 0 0 ⎟
⎜ ⎟
J (θ ; ) = ⎜ . .. .⎟ ∈ R
r×K
.
⎝ .. . .. ⎠
0 0 0 0 · · · 1 r
The mean field Gaussian case provides the entropy of the variational distribution
r
1 1
r √
−Eq(·;θ) logq(ϑ; θ ) = log(2πσj2 ) + = log( 2πeσj ).
2 2
j =1 j =1
This mean field Gaussian variational inference can be implemented with the R
package tfprobability of Keydana et al. [212] and an explicit example is
given in Kuo [230].
Example 11.20, Revisited Working under the assumptions of Example 11.20 and
additionally assuming that the family of variational distributions F is multivariate
(d)
Gaussian q(·; θ ) = N (μ, ) leads us after some calculation to (the well-known
formula)
1 + |T | ,
DKL q(·; θ )π = log − r + trace T −1 + (τ − μ) T −1 (τ − μ) .
2 ||
This further simplifies if T and are diagonal, the latter being the mean field
Gaussian case. The remaining terms of the ELBO are treated empirically as
in (11.63).
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons licence and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons licence, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons licence and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 12
Appendix A: Technical Results on
Networks
The reader may have noticed that for GLMs we have developed an asymptotic
theory that allowed us to assess the quality of predictors as well as it allowed us to
validate the fitted models. For networks there does not exist such a theory, yet, and
the purpose of this appendix is to present more technical results on the asymptotic
behavior of FN networks and their estimators that may lead to an asymptotic
theory. This appendix hopefully stimulates further research in this field of statistical
modeling.
we add a 0th component in feature x = (x0 = 1, x1 , . . . , xq0 ) ∈ {1} × Rq0 for the
intercept. Choose a measurable (activation) function φ : R → R and define
⎧ ⎫
⎨
q1 ⎬
q0 (φ) = f : {1} × Rq0 → R; x → f (x) = βj φ(Aj (x)), Aj ∈ Aq0 , βj ∈ R, q1 ∈ N .
⎩ ⎭
j =0
This is the set of all shallow FN networks f (x) = β, z(1:1)(x) with activation
function φ and the linear output activation, see (7.8); the intercept component of
the output is integrated into the 0th component j = 0. Moreover, we define the
networks
q1 lj
%q0 (φ) = f : {1} × Rq0 → R;
x → f (x) = βj φ(Aj,k (x)),
j =0 k=1
Aj,k ∈ Aq0 , βj ∈ R, lj ∈ N, q1 ∈ N .
The latter networks contain the former q0 (φ) ⊂ %q0 (φ), by setting lj = 1 for
all 0 ≤ j ≤ q1 . We are going to prove a universality theorem first for the networks
%q0 (φ), and afterwards for the shallow FN networks q0 (φ).
Definition 12.1 The function φ : R → [0, 1] is called a squashing function if it is
non-decreasing with limx→−∞ φ(x) = 0 and limx→∞ φ(x) = 1.
Since squashing functions can have at most countably many discontinuities,
they are measurable; a continuous and a non-continuous example are given by the
sigmoid and by the step function activation, respectively, see Table 7.1.
Lemma 12.2 The sigmoid activation function is Lipschitz with constant 1/4.
Proof The derivative of the sigmoid function is given by φ = φ(1 − φ). This
provides for the second derivative φ = φ − 2φφ = φ (1 − 2φ). The latter is zero
for φ(x) = 1/2. This says that the maximal slope of φ is attained for x = 0 and it
is φ (0) = 1/4.
We denote by C(Rq0 ) the set of all continuous functions from {1} × Rq0 to
R, and by M(Rq0 ) the set of all measurable functions from {1} × Rq0 to R. If
the measurable activation function φ is continuous, we have %q0 (φ) ⊂ C(Rq0 ),
otherwise %q0 (φ) ⊂ M(Rq0 ).
Definition 12.3 A subset S ⊂ M(Rq0 ) is said to be uniformly dense on compacta
in C(Rq0 ) if for every compact subset K ⊂ {1}×Rq0 the set S is ρK -dense in C(Rq0 )
meaning that for all > 0 and all g ∈ C(Rq0 ) there exists f ∈ S such that
For the proof we refer to Lemma A.2 of Hornik et al. [192], it uses that ψ is a
continuous squashing function, implying that for every δ ∈ (0, 1) there exists m > 0
540 12 Appendix A: Technical Results on Networks
such that ψ(−m) < δ and ψ(m) > 1 − δ. Approximation H ∈ 1 (φ) of ψ is then
constructed on (−m, m) so that the error bound holds (and for δ sufficiently small).
Secondly, choose > 0 and M > 0, there exists cosM, ∈ 1 (φ) such that
sup cos(x) − cosM, (x) < . (12.2)
x∈[−M,M]
This is Lemma A.3 of Hornik et al. [192]; to prove this, we consider the cosine
squasher of Gallant–White [150], for x ∈ R
1 3π
χ(x) = 1 + cos x + 1{−π/2≤x≤π/2} + 1{x>π/2} ∈ [0, 1].
2 2
Universality Theorem 12.5 tells us that we can approximate any compactly sup-
ported continuous function arbitrarily well by a sufficiently large shallow FN
network, say, with sigmoid activation function φ. The next natural question is
whether we can learn these approximations from data (Yi , x i )i≥1 that follow the true
but unknown regression function x → μ0 (x), or in other words whether we have
consistency for a certain class of learning methods. This is the question addressed,
e.g., in White [379, 380], Barron [26], Chen–Shen [73], Döhler–Rüschendorf [109]
and Shen et al. [336]. This turns the algebraic universality question into a statistical
question about consistency.
12.2 Consistency and Asymptotic Normality 541
1 1
n n
μn = arg min
* L (Yi , μ(x i )) = arg min (Yi − μ(x i ))2 , (12.4)
μ∈C (X ) n i=1 μ∈C (X ) n i=1
where C(X ) denotes the set of continuous functions on the compact set X ⊂
{1}×Rq0 . The main question is whether estimator * μn approaches the true regression
function μ0 for increasing sample size n.
Typically, the family of continuous functions C(X ) is much too rich to be able to
solve optimization problem (12.4), and the solution may have undesired properties.
In particular, the solution to (12.4) will over-fit to the data for any sample size
n, and consistency will not hold, see, e.g., Section 2.2.1 in Chen [72]. Therefore,
the optimization needs to be done over (well-chosen) smaller sets Sn ⊂ C(X ).
For instance, Sn can be the set of shallow FN networks having a maximal width
q1 = q1 (n), depending on the sample size n of the data. Considering this regression
problem in a non-parametric sense, we let grow these sets Sn with the sample size
n. This idea is attributed to Grenander [172] and it is called the method of sieve
estimators of μ0 . We define for d ∈ N, > 0, * > 0 and activation function φ
⎧ ⎫
⎨ q1
q0 ⎬
*, φ) = f ∈ q0 (φ); q1 = d,
S (d, , |βj | ≤ , max * .
|wl,j | ≤
⎩ 1≤j ≤q1 ⎭
j =0 l=0
q
1 The bound j 1=0 |βj | ≤ in S (d, , *, φ) allows us to view this set of shallow FN networks
as a symmetric convex hull of the family of functions S0 (φ) = {x → φ(A(x)); A ∈ Aq0 }, see
Sect. 2.6.3 in Van der Vaart–Wellner [364]. If we choose an increasing activation function φ, this
family of functions φ ◦ A is a composition of a fixed increasing function φ and a finite dimensional
vector space Aq0 of functions A. This implies that S0 (φ) is a VC-class saying that it has a finite
Vapnik–Chervonenkis (VC) dimension [365]; see also Condition A and Theorem 2.1 in Döhler–
Rüschendorf [109]. This VC-class is an important property in many proofs as it leads to a finite
covering (metric entropy) of function spaces, and this allows to apply limit theorems to point
processes, we refer to Van der Vaart–Wellner [364].
542 12 Appendix A: Technical Results on Networks
def.
*n , φ) ⊆ Sn+1 (φ) def.
. . . ⊆ Sn (φ) = S (dn , n , *n+1 , φ) ⊆ . . . .
= S (dn+1 , n+1 ,
1
n
μn = arg min L (Yi , μ(x i )) . (12.5)
μ∈Sn (φ) n
i=1
2A probability space (, A, P) is complete if for any P-null set B ∈ A with P[B] = 0 and every
subset A ⊂ B it follows that A ∈ A.
12.2 Consistency and Asymptotic Normality 543
Remarks 12.9
• Such a consistency result for FN networks has first been proved in Theorem 3.3
of White [380], however, on slightly different spaces and under slightly different
assumptions. Similar consistency results have been obtained for related point
process situations by Döhler–Rüschendorf [109] and for time-series in White
[380] and Chen–Shen [73].
• Item (3) of Assumption 12.7 gives upper complexity bounds on shallow FN
networks as a function of the sample size n of the data, so that asymptotically
they do not over-fit to the data. These bounds allow for much freedom in the
choice of the growth rates, and different choices may lead to different speeds of
convergence. The conditions of Assumption 12.7 are, e.g., satisfied for n =
O(log n) and dn = O(n1−δ ), for any small δ > 0. Under these choices, the
complexity dn of the shallow FN network grows rather quickly. Table 1 of White
[380] gives some examples, for instance, if for n = 100 data points we have a
shallow FN network with 5 neurons, then these magnitudes support 477 neurons
for n = 10 000 and 45’600 neurons for n = 1 000 000 data points (for the
specific choice δ = 0.01). Of course, these numbers do not provide any practical
guidance on the selection of the (shallow) FN network size.
• Theorem 12.8 requires that we can explicitly calculate the sieve estimator
μn , i.e., the global minimizer of the objective function in (12.5). In practical
applications, relying on gradient descent algorithms, typically, this is not the case.
Therefore, Theorem 12.8 is mainly of theoretical value saying that learning the
true regression function μ0 is possible within FN networks.
Sketch of Proof of Theorem 12.8 The proof of this theorem is based on a theorem
in White–Woolridge [381] which states that if we have a sequence (Sn (φ))n≥1 of
compact subsets of C(X ), and if Ln : × Sn (φ) → R is a A ⊗ B(Sn (φ))/B(R)-
measurable sequence, n ≥ 1, with Ln (ω, ·) being lower-semicontinuous on Sn (φ)
for all ω ∈ . Then, there exists
μn : → Sn (φ) being A/B(Sn (φ))-measurable
such that for each ω ∈ , Ln (ω, μn (ω)) = min Ln (ω, μ). For the proof of the
μ∈Sn (φ)
544 12 Appendix A: Technical Results on Networks
compactness of Sn (φ) in C(X ) we need that dn and n are finite for any n. This
then provides the existence of the sieve estimator, for details we refer Lemma 2.1
and Corollary 2.1 in Shen et al. [336]. The proof of the consistency result then uses
the growth rates on (dn )n≥1 and (n )n≥1 , for the details of the proof we refer to
Theorem 3.1 in Shen et al. [336].
The next step is to analyze the rates of convergence of the sieve estimator
μn → μ0 , as n → ∞. These rates heavily depend on (additional) regularity
assumptions on the true regression function μ0 ∈ C(X ); we refer to Remark 3
in Sect. 5 of Chen–Shen [73]. Here, we present some results of Shen et al. [336].
From the proof of Theorem 12.8 we know that Sn (φ) is a compact set in C(X ). This
H closest approximation πn μ ∈ Sn (φ) to μ ∈ C(X ). The
motivates to consider the
uniform denseness of n≥1 Sn (φ) in C(X ) implies that πn μ converges to μ. The
aforementioned rates of convergence of the sieve estimators will depend on how fast
πn μ0 ∈ Sn (φ) converges to the true regression function μ0 ∈ C(X ).
If one cannot determine the global minimum of (12.5), then often an accurate
approximation is sufficient. For this one introduces an approximate sieve estimator.
A sequence ( μn )n≥1 is called an approximate sieve estimator if
1 1
n n
(Yi −
μn (Xi ))2 ≤ inf (Yi − μ(Xi ))2 + OP (ηn ), (12.6)
n μ∈Sn (φ) n
i=1 i=1
Theorem 12.10 (Theorem 4.1 of Shen et al. [336], Without Proof) Set Assump-
tion 12.7. If
2 dn log(dn n ) dn log n
ηn = O min πn μ0 − μ0 n , , ,
n n
Remarks 12.11
• Assumption 12.7 implies that dn log(dn n ) = o(n) as n → ∞. Therefore, ηn →
0 as n → ∞.
12.2 Consistency and Asymptotic Normality 545
• The statement in Theorem 4.1 of Shen et al. [336] is more involved because it
is stated under slightly different assumptions. Our assumptions are sufficient for
having consistency of the sieve estimator, see Theorem 12.8, and making these
assumptions implies that the rate of convergence in Theorem 12.10 is determined
by the rate of convergence of πn μ0 −μ0 n and (n−1 dn log n)1/2 , see Remark 4.1
in Shen et al. [336].
• The rate of convergence in Theorem 12.10 crucially depends on the rate
πn μ0 − μ0 n , as n → ∞. If μ0 lies in the (sub-)space of functions with
finite first absolute moments of the Fourier magnitude distributions, denoted by
F (X ) ⊂ C(X ), Makavoz [262] has shown that πn μ0 − μ0 n decays at least as
−(q +1)/(2q0 ) −1/2−1/(2q0) −1/2
dn 0 = dn , this has improved the rate of dn obtained by
Barron [25]. This space F (X ) allows for the choices dn = (n/ log n)q0 /(2+q0) ,
n ≡ > 0 and *n ≡ * > 0 to receive consistency and the following rate of
convergence, see Chen–Shen [73] and Remark 4.1 in Shen et al. [336],
μn − μ0 n = OP (rn−1 ),
for
(q0 +1)/(4q0+2)
n
rn = n ≥ 2. (12.7)
log n
Note that 1/4 ≤ (q0 + 1)/(4q0 + 2) ≤ 1/2. Thus, this is a slower rate than the
square root rule of typical asymptotic normality, for instance, for q0 = 1 we get
1/3. Interestingly, Barron [26] proposes the choice dn ∼ (n/ log n)1/2 to receive
an approximation rate of (n/ log n)−1/4 .
Also note that the space F (X ) allows us to choose a finite n ≡ > 0
in the sieves, thus, here we do not receive denseness of the sieves in the space
of continuous functions C(X ), but only in the space of functions with finite first
absolute moments of the Fourier magnitude distributions F (X ).
The last step is to establish the asymptotic normality. For this we have to define
perturbations of shallow FN networks μ ∈ Sn (φ). Choose ηn ∈ (0, 1) and define
the function
1/2 1/2
μn (μ) = (1 − ηn )μ + ηn (μ0 + 1).
*
1
n
√ μn (X i ) − μ0 (X i )) ⇒ N 0, σ 2 .
(
n
i=1
1
n
σn2 = (Yi −
μn (Xi ))2 .
n
i=1
Theorem 5.2 in Shen et al. [336] proves that this estimator is consistent for
σ 2 , and the asymptotic normality result also holds true under this estimated
variance parameter (using Slutsky’s theorem), and under the same assumptions as
in Theorem 12.12.
Horel–Giesecke [190] push the above asymptotic results even one step further. Note
that the asymptotic normality of Theorem 12.12 is not directly useful for variable
selection, since the asymptotic result integrates over the feature space X . Horel–
Giesecke [190] prove a functional limit theorem which we briefly review in this
section.
A q0 -tuple α = (α1 , . . . , αq0 ) ∈ N00 is called a multi-index, and we set |α| =
q
∂ |α|
∇α = αq .
∂x1α1 · · · ∂xq0 0
12.3 Functional Limit Theorem 547
Consider the compact feature space X = {1} × [0, 1]q0 with q0 ≥ 3. Choose a
distribution ν on this feature space X and define the L2 -space
L2 (X , ν) = μ : X → R measurable; Eν [μ(X)2 ] = μ(x)2 dν(x) < ∞ .
X
The normed Sobolev space (W k,2 (X , p), ·k,2 ) is a Hilbert space. Since we would
like to consider gradient-based methods, we consider the following space
CB1 (X , ν) = μ : X → R continuously differentiable; μ'q0 /2(+2,2 ≤ B ,
(12.8)
for some positive constant B < ∞. We will assume that the true regression function
μ0 ∈ CB1 (X , ν), thus, the true regression function has a bounded Sobolev norm
·'q0 /2(+2,2 of maximal size B. Assume that X ˚ ⊂ Rq0 is the open interior of X
(excluding the intercept component), and that ν is absolutely continuous w.r.t. the
Lebesgue measure with a strictly positive and bounded density on X (excluding
the intercept component). The Sobolev number of the space W 'q0 /2(+2,2 (X̊ , ν) is
given by m = 'q0 /2( + 2 − q0 /2 ≥ 1.5 > 1. The Sobolev embedding theorem
then tells us that for any function μ ∈ W 'q0 /2(+2,2(X ˚, ν), there exists an 'm(-
times continuously differentiable function on X̊ that is equal to μ a.e., thus, the
class of equivalent functions μ ∈ W 'q0 /2(+2,2 (X̊ , ν) has a representative in C 1 (X̊ ),
'm( = 1, this motivates the consideration of the space in (12.8).
In practice, the bound B needs a careful consideration because the true μ0 is
unknown. Therefore, B should be sufficiently large so that μ0 is contained in the
space CB1 (X , ν) and, on the other hand, it should not be too large as this will weaken
the power of the tests, below.
548 12 Appendix A: Technical Results on Networks
We choose the sigmoid activation function for φ and we consider the approximate
μn )n≥1 for given data (Yi , X i )i obtained by a solution to
sieve estimators (
1 1
n n
μn (Xi ))2 ≤
(Yi − inf (Yi − μ(Xi ))2 + oP (1), (12.9)
n μ∈Sn (φ) n
i=1 i=1
where we allow for an error term oP (1) that converges in probability to zero as
n → ∞. In contrast to (12.6) we do not specify the error rate, here.
Assumption 12.13 Choose a complete probability space (, A, P) and X = {1} ×
[0, 1]q0 .
(1) Assume μ0 ∈ CB1 (X , ν) for some B > 0, and (Yi , X i )i≥1 are i.i.d. on
(, A, P) following regression structure (12.3) with εi being centered, having
E[|εi |2+δ ] < ∞ for some δ > 0, being absolutely continuous w.r.t. the Lebesgue
measure, and being independent of X i ; the features X i ∼ ν are absolutely
continuous w.r.t. the Lebesgue measure having a bounded and strictly positive
density on X (excluding the intercept component). Set σ 2 = Var(εi ) < ∞.
(2) The activation function φ is the sigmoid function.
(3) The sequence (dn )n≥1 is increasing and going to infinity satisfying
dn
2+1/q0
log(dn ) = O(n) as n → ∞, and n ≡ > 0, *n ≡ * > 0
for n ≥ 1.
(4) Define Lμ (X, ε) = −2ε(μ(X) − μ0 (X)) + (μ(X) − μ0 (X))2 , and it holds for
n≥2
1
n
√ μn (X i , εi ) − Eν L
L μn (X 1 , ε1 )
n
i=1
1
n
≤ inf √ Lμ0 +h/rn (X i , εi ) − Eν Lμ0 +h/rn (X 1 , ε1 ) + oP (rn−1 ),
h∈CB1 (X ,ν) n
i=1
μn − μ 0 ) ⇒ μ#
rn ( as n → ∞,
12.4 Hypothesis Testing 549
where μ# is the arg max of the Gaussian process {Gμ ; μ ∈ CB1 (X , ν)} with mean
zero and covariance function Cov(Gμ , Gμ ) = 4σ 2 Eν [μ(X)μ (X)].
Remarks 12.15 We highlight the differences between Theorems 12.12 and 12.14.
• Theorem 12.12 provides a convergence in distribution to a Gaussian random
variable, whereas the limit in Theorem 12.14 is a random function x → μ# (x) =
μ#ω (x), ω ∈ , thus, the former convergence result integrates over the (empirical)
feature distribution, whereas the latter also allows for a point-wise consideration
in feature x.
• The former theorem does not allow for variable selection in X whereas the latter
does because the limiting function still discriminates different feature values.
• For the proof of Theorem 12.14 we refer to Horel–Giesecke [190]. It is based
on asymptotic results on empirical point processes; we refer to Van der Vaart–
Wellner [364]. The Gaussian process {Gμ ; μ ∈ CB1 (X , ν)} is parametrized by the
(totally bounded) space CB1 (X , ν), and it is continuous over this compact index
space. This implies that it takes its maximum. Uniqueness of the maximum then
gives us the random function μ# which exactly describes the limiting distribution
μn − μ0 ) as n → ∞.
of rn (
Theorem 12.14 can be used to provide a significance test for feature component
selection, similarly to the LRT and the Wald test presented in Sect. 5.3.2 on GLMs.
We define gradient-based test statistics, for 1 ≤ j ≤ q0 , and w.r.t. the approximate
sieve estimator
μn ∈ Sn (φ) given in (12.9),
2 n
∂
μn (x) 1 ∂ μn (X i ) 2
(n)
= dν(x) and (n) = .
j j
X ∂xj n ∂xj
i=1
(n)
The test statistics integrates the squared partial derivative of the sieve estimator
j
μn w.r.t. the distribution ν, whereas (n)
j can be considered as its empirical
counterpart if X ∼ ν. Note that both test statistics depend on the data (Yi , X i )1≤i≤n
determining the sieve estimator μn , see (12.9). These test statistics are used to test
the following null hypothesis H0 against the alternative hypothesis H1 for the true
regression function μ0 ∈ CB1 (X , ν)
2
∂μ0 (X)
H0 : λj = Eν =0 against H1 : λj = 0. (12.10)
∂xj
550 12 Appendix A: Technical Results on Networks
These random variables G(1), . . . , G(T ) play the role of discretized random
samples of the Gaussian process {Gμ ; μ ∈ CB1 (X , ν)}.
3. The empirical arg max of the sample G(t ), 1 ≤ t ≤ T , is obtained by
where G(t )
fk is the k-th component of G .
(t )
12.4 Hypothesis Testing 551
i.e., under the null hypothesis H0 we approximate the right-hand side of (12.11)
(t ))1≤t ≤T .
by the empirical distribution of ( j
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons licence and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons licence, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons licence and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 13
Appendix B: Data and Examples
We consider a French motor third party liability (MTPL) claims data set. This data
set is available through the R library CASdatasets1 being hosted by Dutang–
Charpentier [113]. The specific data sets chosen from CASdatasets are called
FreMTPL2freq and FreMTPL2sev, the former contains the insurance policy
and claim frequency information and the latter the corresponding claim severity
information.2
Before we can work with this data set we perform data cleaning. It has been
pointed out by Loser [259] that the claim counts on the insurance policies with
policy IDs ≤ 24500 in FreMTPL2freq do not seem to be correct because these
claims do not have claim severity counterparts in FreMTPL2sev. For this reason
we work with the claim counts extracted from the latter file. In Listing 13.1 we give
the code used for data cleaning.3 In this code we merge FreMTPL2freq with the
aggregated severities on each insurance policy and the corresponding claim counts
are received from FreMTPL2sev, this is done on lines 2–11 of Listing 13.1. A
further inspection of the data indicates that policies with more than 5 claims may be
data error because they all seem to belong to the same driver (and they have very
short exposures).4 For this reason we drop these records on line 12. On line 13 we
censor exposures at one accounting year (since these policies are active within one
calendar year). Finally, on lines 15–16 we re-level the VehBrands.5 All subsequent
analysis is based on this cleaned data set.
Listing 13.1 Data cleaning applied to the French MTPL data set
1 #
2 data(freMTPL2freq)
3 dat <- freMTPL2freq[, -2]
4 dat$VehGas <- factor(dat$VehGas)
5 data(freMTPL2sev)
6 sev <- freMTPL2sev
7 sev$ClaimNb <- 1
8 dat0 <- aggregate(sev, by=list(IDpol=sev$IDpol), FUN = sum)[c(1,3:4)]
9 names(dat0)[2] <- "ClaimTotal"
10 dat <- merge(x=dat, y=dat0, by="IDpol", all.x=TRUE)
11 dat[is.na(dat)] <- 0
12 dat <- dat[which(dat$ClaimNb <=5),]
13 dat$Exposure <- pmin(dat$Exposure, 1)
14 sev <- sev[which(sev$IDpol %in% dat$IDpol), c(1,2)]
15 dat$VehBrand <- factor(dat$VehBrand, levels=c("B1","B2","B3","B4","B5","B6",
16 "B10","B11","B12","B13","B14"))
Listing 13.2 gives an excerpt of the cleaned French MTPL data set, lines 2–
14 give the insurance policy and claim counts information, and lines 17–18
4Short exposure policies may also belong to a commercial car rental company.
5The data set FreMTPLfreq of CASdatasets is a subset of FreMTPL2freq with slightly
changed feature components, for instance, the former data set contains car brand names in a more
aggregated version than the latter, see Table 13.2, below.
13.1 French Motor Third Party Liability Data 555
R22
R23
R25
R11 R21 R41
R53 R42
R52 R24
R26 R43
Île−de−France R11
Champagne−Ardenne R21
Picardie R22 R54
Haute−Normandie R23 R74
Centre R24 R83
Basse−Normandie R25 R82
Bourgogne R26
Nord−Pas−de−Calais R31
Lorraine R41
Alsace R42 R72
Franche−Comté R43 R93
Pays de la Loire R52 R73
Bretagne R53 R91
Poitou−Charentes R54
Aquitaine R72
Midi−Pyrénées R73
Limousin R74
Rhône−Alpes R82
Auvergne R83
R94
Languedoc−Roussillon R91
Provence−Alpes−Côte d'Azur R93
Corse R94
556 13 Appendix B: Data and Examples
histogram of Exposures (678007 policies) boxplot of Exposures (678007 policies) histogram of claim numbers
1.0
600000
150000
500000
0.8
number of policies
400000
0.6
100000
frequency
300000
0.4
200000
50000
0.2
100000
0.0
0
0
0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5
Fig. 13.2 (lhs) Histogram of Exposure, (middle) boxplot of Exposure, (rhs) number of
observed claims ClaimNb of the French MTPL data
unusual. A further inspection of the data indicates that policy renewals during the
year account for two separate records in the data set. Of course, such split policies
should be merged to one yearly policy. Unfortunately, we do not have the necessary
information to perform this merger, therefore, we need to work with the data as it is.
In Table 13.1 and Fig. 13.2 (rhs) we split the portfolio w.r.t. the number of claims.
On 653’069 insurance policies (amounting to a total exposure of 341’090 years-
at-risk) we do not have any claim, and on the remaining 24’938 policies (17’269
years-at-risk) we have at least one claim. The overall portfolio claim frequency
(w.r.t. Exposure) is λ = 7.35%.
We study the split of this overall frequency λ = 7.35% across the different
feature levels. This empirical analysis is crucial for the model choice in regression
modeling.6 For the empirical analysis we provide 3 different types of graphs for each
feature component (where applicable), these are given in Figs. 13.3, 13.4, 13.5, 13.6,
13.7, 13.8, 13.9, 13.10, and 13.11. The first graph (lhs) gives the split of the total
exposure to the different feature levels, the second graph (middle) gives the average
feature value in each French region (green meaning low and red meaning high),7
and the third graph (rhs) gives the observed average frequency per feature level. This
observed frequency is obtained by dividing the total number of claims by the total
exposure per feature level. The frequencies are complemented by confidence bounds
of two standard deviations (shaded area). These confidence bounds correspond to
twice the estimated standard deviations. The standard deviations are estimated under
6 The empirical analysis in these notes differs from Noll et al. [287] because data cleaning has been
done differently here, we refer to Listing 13.1.
7 We acknowledge the use of UNESCO (1987) database through UNEP/GRID-Geneva for the
French map.
13.1 French Motor Third Party Liability Data 557
total volumes per area code groups average area code per regional groups observed frequency per area code groups
0.20
0e+00 2e+04 4e+04 6e+04 8e+04 1e+05
0.15
frequency
Exposure
0.10 0.05
0.00
A B C D E F A B C D E F
area code groups area code groups
Fig. 13.3 (lhs) Histogram of exposures per Area code, (middle) average Area code per
Region, we map (A, . . . , F ) → (1, . . . , 6), (rhs) observed frequency per Area code
total volumes per vehicle power groups average vehicle power per regional groups observed frequency per vehicle power groups
80000
0.20
60000
0.15
Exposure
frequency
40000
0.10
0.05
20000
0.00
0
4 5 6 7 8 9 10 11 12 13 14 15 4 5 6 7 8 9 10 11 12 13 14 15
vehicle power groups vehicle power groups
Fig. 13.4 (lhs) Histogram of exposures per VehPower, (middle) average VehPower per
Region, (rhs) observed frequency per VehPower
total volumes per vehicle age groups average vehicle age per regional groups observed frequency per vehicle age groups
0.20
5000 10000 15000 20000 25000 30000
0.15
frequency
Exposure
0.10
0.05
0.00
0
0 1 2 3 4 5 6 7 8 9 11 13 15 17 19 0 1 2 3 4 5 6 7 8 9 11 13 15 17 19
vehicle age groups vehicle age groups
Fig. 13.5 (lhs) Histogram of exposures per VehAge (censored at 20), (middle) average VehAge
per Region, (rhs) observed frequency per VehAge
558 13 Appendix B: Data and Examples
total volumes per driver's age groups average driver age per regional groups observed frequency per driver's age groups
0.4
8000
0.3
6000
frequency
Exposure
0.2
4000
0.1
2000
0.0
0
18 23 28 33 38 43 48 53 58 63 68 73 78 83 88 18 23 28 33 38 43 48 53 58 63 68 73 78 83 88
driver's age groups driver's age groups
Fig. 13.6 (lhs) Histogram of exposures per DrivAge (censored at 90), (middle) average
DrivAge per Region, (rhs) observed frequency per DrivAge (y-scale is different compared
to the other frequency plots)
total volumes per bonus−malus level groups average bonus−malus level per regional groups observed frequency per bonus−malus level groups
0.6
50000 100000 150000 200000
0.5
0.4
frequency
Exposure
0.3
0.2
0.1
0.0
0
50 60 70 80 90 100 110 120 130 140 150 50 60 70 80 90 100 110 120 130 140 150
bonus−malus level groups bonus−malus level groups
Fig. 13.7 (lhs) Histogram of exposures per BonusMalus level (censored at 150), (middle)
average BonusMalus level per Region, (rhs) observed frequency per BonusMalus level (y-
scale is different compared to the other frequency plots)
6
a Poisson assumption, thus, they are obtained by ±2 λk /Exposurek , where
λk is the observed frequency and Exposurek is the total exposure for a given
feature level k. We note that in all frequency plots the y-axis ranges from 0% to
20%, except in the BonusMalus plot where the maximum is set to 60%, and the
DrivAge plot where the maximum is set to 40%. From these plots we conclude
that some levels have only a small underlying Exposure; BonusMalus leads to
the highest variability in frequencies followed by DrivAge; and there is quite some
heterogeneity.
Table 13.2 gives the assignment of the different VehBrand levels to car
brands. This list has been compiled from the two data sets FreMTPLfreq
and FreMTPL2freq contained in the R package CASdatasets [113], see
Footnote 5.
Next, we analyze collinearity between the feature components. For this we calculate
Pearson’s correlation and Spearman’s Rho for the continuous feature components,
see Table 13.3. In general, these correlations are low, except for DrivAge
vs. BonusMalus. Of course, the latter is very sensible because a BonusMalus
13.1 French Motor Third Party Liability Data 559
total volumes per car brand groups observed frequency per car brand groups
0.20
80000
0.15
60000
frequency
Exposure
0.10
40000
0.05
20000
0.00
0
B1 B2 B3 B4 B5 B6 B10 B11 B12 B13 B14 B1 B2 B3 B4 B5 B6 B10 B11 B12 B13 B14
car brand groups car brand groups
Fig. 13.8 (lhs) Histogram of exposures per VehBrand, (rhs) observed frequency per
VehBrand; for VehBrand assignment we refer to Table 13.2
total volumes per fuel type average diesel ratio per regional groups observed frequency per fuel type
0.20
150000
0.15
Exposure
frequency
100000
0.10
50000
0.05
0.00
0
Fig. 13.9 (lhs) Histogram of exposures per VehGas, (middle) average VehGas per Region
(diesel is green and regular red), (rhs) observed frequency per VehGas
total volumes per density (log−scale) groups average population density per regional groups observed frequency per density (log−scale) groups
0.20
10000 20000 30000 40000 50000 60000
0.15
frequency
Exposure
0.10
0.05
0.00
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
density (log−scale) groups density (log−scale) groups
Fig. 13.10 (lhs) Histogram of exposures per population Density (on log-scale), (middle)
average population Density per Region, (rhs) observed frequency per population Density;
in general, we always consider Density on the log-scale
560 13 Appendix B: Data and Examples
0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 total volumes per regional groups observed frequency per regional groups observed frequencies per regional groups
0.20
MTPL portfolio
French population
0.15
frequency
Exposure
0.10
0.05
0.00
R11 R22 R24 R26 R41 R43 R53 R72 R74 R83 R93 R11 R22 R24 R26 R41 R43 R53 R72 R74 R83 R93
regional groups regional groups
Fig. 13.11 (lhs) Histogram of exposures Exposure, and (middle, rhs) observed claim frequen-
cies per Region in France (prior to 2016)
Table 13.3 Correlations in feature components: top-right shows Pearson’s correlation; bottom-
left shows Spearman’s Rho; Density is considered on the log-scale; significant correlations are
boldface
VehPower VehAge DrivAge BonusMalus Density
VehPower −0.01 0.03 −0.08 0.01
VehAge 0.00 −0.06 0.08 −0.10
DrivAge 0.04 −0.08 −0.48 −0.05
BonusMalus −0.07 0.08 −0.57 0.13
Density −0.01 −0.10 −0.05 0.14
level below 100 needs a certain number of driving years without claims. We give the
corresponding boxplot in Fig. 13.12 (lhs) which confirms this negative correlation.
Figure 13.12 (rhs) gives the boxplot of log-Density vs. Area code. From this
plot we conclude that the area code has likely been set w.r.t. the log-Density.
For our regression models this means that we can drop the area code information,
and we should only work with Density. Nevertheless, we will use the area code
to show what happens in case of collinear feature components, i.e., if we replace
(A, . . . , F ) → (1, . . . , 6).
Figure 13.13 illustrates each continuous feature component w.r.t. the different
VehBrands. Vehicle brands B10 and B11 (Mercedes, Chrysler and BMW) have
more VehPower than other cars, B10 being more likely a diesel car, and vehicle
brand B12 (Japanese and Korean cars) has comparably new cars in more densely
populated French regions.
13.1 French Motor Third Party Liability Data 561
10
110
100
8
BonusMalus
log−Density
90
6
80
70
4
60
50
2
A B C D E F
18
19
20
21
22
23
24
25
26−35
36−45
46−55
56−65
66−75
76+
Fig. 13.12 Boxplots (lhs) BonusMalus vs. DrivAge, (rhs) log-Density vs. Area code;
these plots are inspired by Fig. 2 in Lorentzen–Mayer [258]
1.0
vehicle power among vehicle brands vehicle age among vehicle brands
1.0
15 20
14 18
13 16
12 14
0.8
0.8
11 12
10 10
9 8
relative frequency
relative frequency
8 6
7 4
0.6
0.6
6 2
5 0
4
0.4
0.4
0.2
0.2
0.0
0.0
B1 B2 B3 B4 B5 B6 B10 B11 B12 B13 B14 B1 B2 B3 B4 B5 B6 B10 B11 B12 B13 B14
VehBrand VehBrand
driver's age among vehicle brands bonus−malus level among vehicle brands
1.0
1.0
90 150
80 140
70 130
60 120
0.8
0.8
50 110
40 100
30 90
relative frequency
relative frequency
20 80
70
0.6
0.6
60
50
0.4
0.4
0.2
0.2
0.0
0.0
B1 B2 B3 B4 B5 B6 B10 B11 B12 B13 B14 B1 B2 B3 B4 B5 B6 B10 B11 B12 B13 B14
VehBrand VehBrand
1.0
10 Regular
9 Diesel
8
7
0.8
0.8
6
5
4
relative frequency
relative frequency
3
2
0.6
0.6
1
0
0.4
0.4
0.2
0.2
0.0
0.0
B1 B2 B3 B4 B5 B6 B10 B11 B12 B13 B14 B1 B2 B3 B4 B5 B6 B10 B11 B12 B13 B14
VehBrand VehBrand
Fig. 13.13 Distribution of the variables VehPower, VehAge, DrivAge, BonusMalus, log-
Density, VehGas for each car brand VehBrand, individually
13.1 French Motor Third Party Liability Data 563
Table 13.4 Cramér’s V for the categorical feature components vs. the categorized continuous
components
VehPower VehAge DrivAge BonusMalus log-Density VehGas Region
VehBrand 0.16 0.17 0.06 0.03 0.05 0.12 0.13
Region 0.04 0.09 0.05 0.04 0.24 0.09
Area 0.87
B14
B13
B12
B11
0.8
B10
B6
relative frequency
B5
B4
B3
0.6
B2
B1
0.4
0.2
0.0
R11 R21 R22 R23 R24 R25 R26 R31 R41 R42 R43 R52 R53 R54 R72 R73 R74 R82 R83 R91 R93 R94
VehBrand
We scale the test statistics to the interval [0, 1] by dividing it by the comonotonic
(maximal dependent) case and by the sample size n. This motivates Cramér’s V
.
χ 2 /n
V = ∈ [0, 1].
min{m1 − 1, m2 − 1}
Section 7.2.3 of Cohen [78] gives a rule of thumb for small, medium
and large dependence.
√ Cohen [78] calls the association between x1 and x2
small
√ if V min{m 1 − 1, m2 − 1} is less 0.1, it is of medium strength for
V min{m1 − 1, m2 − 1} of size 0.3, and it is a large effect if this value is around
0.5. Our results are presented in Table 13.4. Clearly, there is some association
between VehBrand and both VehPower and VehAge, this can also be seen
from Fig. 13.13, for the remaining variables the dependence is somewhat weaker.
Not surprisingly, Cramér’s V shows the largest value between Region and log-
Density.
In Fig. 13.14 we show the VehBrands in the different French Regions,
√ Cramér’s
V is 0.13 for these two categorical variables, multiplying with 11 − 1 gives a
value bigger than 0.4 which is a considerable association according to Cohen [78].
We note that in some regions the French car brands B1 and B2 are very dominant,
whereas on the Isle of Corse (R94) 80% of the cars in our portfolio are Japanese
564 13 Appendix B: Data and Examples
Fig. 13.15 Empirical density and log-log plots of the observed claim amounts
or Korean cars B12. Our portfolio has its biggest exposure in Region R24, see
Fig. 13.11, in this region French cars are predominant.
Next, we study the claim sizes of this French MTPL example. Figure 13.15 shows
the empirical density plot and the log-log plot. These two plots already illustrate the
main difficulty we often face in claim size modeling. From the empirical density
plot we observe that there are many payments of fixed size (red vertical lines) which
do not match any absolutely continuous distribution function assumption. The log-
log plot shows heavy-tailedness because we observe asymptotically a straight line
with negative slope on the log-scale, this indicates regularly varying tails and, thus,
the EDF is not a suitable model on the original observation scale.
Figure 13.16 gives the boxplots of the claim sizes per feature level (we omit the
claims outside the whiskers because heavy-tailedness would distort the picture). The
empirical mean in orange is much bigger than the median in red color, which also
expresses the heavy-tailedness. From these plots we conclude that the claim sizes
seem less sensitive in feature values which may question the use of a regression
model for claim sizes.
Figure 13.17 shows the density plots for different feature levels. Interestingly, it
seems that the features determine the sizes of the modes, for instance, if we focus
on Area, Fig. 13.17 (top-left), we see that the area codes mainly influence the sizes
of the modes. This may be interpreted by modes corresponding to different claim
types which occur at different frequencies among the area codes.
Our second example considers the Swedish motorcycle data which originally
has been used in Ohlsson–Johansson [290]. It is available through the R library
13.2 Swedish Motorcycle Data 565
1000 1500 2000 2500 3000 average claim amounts per Area average claim amounts per VehPower average claim amounts per VehAge
500
500
0
0
A B C D E F 4 5 6 7 8 9 10 11 12 13 14 15 0 2 4 6 8 10 12 14 16 18 20
Area VehPower VehAge
average claim amounts per DrivAge average claim amounts per BonusMalus average claim amounts per VehBrand
1000 1500 2000 2500 3000
500
500
0
0
20 30 40 50 60 70 80 90 50 60 70 80 90 100 110 120 130 140 150 B1 B2 B3 B4 B5 B6 B10 B11 B12 B13 B14
DrivAge BonusMalus VehBrand
average claim amounts per LogDensity average claim amounts per Region
1000 1500 2000 2500 3000
500
0
1 2 3 4 5 6 7 8 9 10 R11 R22 R24 R26 R41 R43 R53 R72 R74 R83 R93
LogDensity Region
Fig. 13.16 Boxplots of claim sizes per feature level: these plots omit the claims outside the
whiskers; red color shows the median and orange color the empirical mean
0.0030
empirical density of claim amounts: Area code empirical density of claim amounts: VehPower empirical density of claim amounts: VehAge
0.0030
0.0030
A 4 0−5
B 5 6−12
0.0025
0.0025
0.0025
C 6 12+
D 7
E 8
F 9
0.0020
0.0020
0.0020
empirical density
empirical density
empirical density
0.0015
0.0015
0.0015
0.0010
0.0010
0.0010
0.0005
0.0005
0.0005
0.0000
0.0000
0.0000
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000
empirical density of claim amounts: DrivAge empirical density of claim amounts: BonusMalus empirical density of claim amounts: VehBrand
0.0030
0.0030
0.0030
31−40 50 B1
18−20 60 B2
0.0025
0.0025
0.0025
21−25 70 B3
26−30 80 B4
41−50 90 B5
51−70 100 B6
0.0020
0.0020
0.0020
71+ 110 B10
empirical density
empirical density
empirical density
B11
B12
0.0015
0.0015
0.0015
B13
B14
0.0010
0.0010
0.0010
0.0005
0.0005
0.0005
0.0000
0.0000
0.0000
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000
empirical density of claim amounts: VehGas empirical density of claim amounts: Log−Density empirical density of claim amounts: Region
0.0030
0.0030
0.0030
Diesel 0 R11
R21
Regular 1
R22
0.0025
0.0025
0.0025
2 R23
3 R24
4 R25
5 R26
0.0020
0.0020
0.0020
R31
6
empirical density
empirical density
empirical density
R41
7 R42
8 R43
0.0015
0.0015
0.0015
9 R52
10 R53
R54
R72
0.0010
0.0010
0.0010
R73
R74
R82
R83
R91
0.0005
0.0005
0.0005
R93
R94
0.0000
0.0000
0.0000
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000
Fig. 13.17 Empirical claim size densities split w.r.t. the different levels of the feature components
Listing 13.3 Data cleaning applied to the Swedish motorcycle data set
1 library(CASdatasets)
2 data(swmotorcycle)
3 mcdata <- swmotorcycle
4 mcdata$Gender <- as.factor(mcdata$Gender)
5 mcdata$Area <- as.factor(mcdata$Area)
6 mcdata$Area <- factor(mcdata$Area,levels(mcdata$Area)[c(1,7,3,6,5,4,2)])
7 mcdata$Area <- c("Zone 1","Zone 2","Zone 3","Zone 4","Zone 5",
8 "Zone 6","Zone 7")[as.integer(mcdata$Area)]
9 mcdata$Area <- as.factor(mcdata$Area)
10 mcdata$RiskClass <- as.factor(mcdata$RiskClass)
11 mcdata$RiskClass <- factor(mcdata$RiskClass,
12 levels(mcdata$RiskClass)[c(1,6,7,3,4,5,2)])
13 mcdata$RiskClass <- as.integer(mcdata$RiskClass)
14 mcdata$BonusClass <- as.integer(as.factor(mcdata$BonusClass))
15 #
16 mcdata <- mcdata[which(mcdata$OwnerAge>=18),] # only minimal age 18
17 mcdata$OwnerAge <- pmin(70, mcdata$OwnerAge) # set maximal age 70
18 mcdata$VehAge <- pmin(30, mcdata$VehAge) # set maximal motorcycle age 30
19 mcdata <- mcdata[which(mcdata$Exposure>0),] # only positive exposures
7. Exposure: total exposure in yearly units, these exposures are aggregated for
given feature combinations, resulting in total exposures [0.0274, 31.3397], the
shortest entry referring to 10 days and the longest one to more than 31 years;
8. ClaimNb: number of claims Ni for a given feature;
9. ClaimAmount: total claim amount for a give feature (aggregated over all
claims).
We start with a descriptive and exploratory analysis of the Swedish motorcycle
data of Listing 13.4. We have n = 62 036 different feature combinations with
positive Exposure. This Exposure is aggregated over individual policies with a
fixed feature combination. We denote by Ni the number of claims on feature i, this
corresponds to ClaimNb, and the total claim amount ClaimAmount is denoted
Ni
by Si = j =1 Zi,j , where Zi,j are the individual claim sizes on feature i (in case
of claims). The empirical claimfrequency is λ̄ = ni=1 Ni / ni=1 vi = 1.05%, and
the average claim size is μ̄ = ni=1 Si / ni=1 Ni = 24 641 Swedish crowns SEK.
568 13 Appendix B: Data and Examples
frequency
0
−2
−4
−6
0
0 1 2
number of claims
Fig. 13.18 (lhs) Boxplot of Exposure on the log-scale (the horizontal line corresponds to 1
accounting year), (rhs) histogram of the number of observed claims ClaimNb per feature of the
Swedish motorcycle data
Figure 13.18 shows the boxplot over all Exposures and the claim counts on all
insurance policies. We note that insurance claims are rare events for this product,
because the empirical claim frequency is only λ̄ = 1.05%.
Figures 13.19 and 13.20 give the marginal total exposures (split by gender), the
marginal claim frequencies and the marginal average claim amounts for the covari-
ate components OwnerAge, Area, RiskClass, VehAge and BonusClass.
We observe that we have a very imbalanced portfolio between genders, only 11%
of the total exposure is coming from females. The empirical claim frequency of
females is 0.86% and the one of males is 1.08%. We note that the female claim
frequency comes from (only) 61 claims (based on an exposure for female of 7’094
accounting years, versus 57’679 for male). Therefore, it is difficult to analyze
females separately, and all marginal claim frequencies and claim sizes in Figs. 13.19
and 13.20 (middle and rhs) are analyzed jointly for both genders. If we run a simple
Poisson GLM that only involves Gender as feature component, it turns out that
the female frequency is 20% lower than the male frequency (remember we have
the balance property on each dummy variable, see Example 5.12), but this variable
should not be kept in the model on a 5% significance level. The same holds for claim
amounts.
The empirical marginal frequencies in Figs. 13.19 and 13.20 (middle) are
complemented with confidence bands of ±2 standard deviations. From the plots
we conclude that we should keep the explanatory variables OwnerAge, Area,
RiskClass and VehAge, but the variable BonusClass does not seem to have
any predictive power. At the first sight, this seems surprising because the bonus class
encodes the past claims history. The reason that the bonus class is not needed for our
claims is that we consider comprehensive insurance for motorcycles covering loss
or damage of motorcycles other than collision (for instance, caused by theft, fire or
vandalism), and the bonus class encodes collision claims.
13.2 Swedish Motorcycle Data 569
total exposures per OwnerAge observed frequency per OwnerAge average claim amounts per OwnerAge
0.05
1000 2000 3000 4000 5000 6000 7000
female
12
male
10
0.03
frequency
Exposure
8
0.02
6
0.01
4
0.00
0
18 24 30 36 42 48 54 60 66 20 30 40 50 60 70 18 24 30 36 42 48 54 60 66
OwnerAge OwnerAge OwnerAge
total exposures per Area 0.05 observed frequency per Area average claim amounts per Area
female
12
5000 10000 15000 20000 25000
male
10
0.03
frequency
Exposure
8
0.02
6
0.01
4
0.00
0
Zone 1 Zone 2 Zone 3 Zone 4 Zone 5 Zone 6 Zone 7 Zone 1 Zone 2 Zone 3 Zone 4 Zone 5 Zone 6 Zone 7 Zone 1 Zone 2 Zone 3 Zone 4 Zone 5 Zone 6 Zone 7
total exposures per RiskClass observed frequency per RiskClass average claim amounts per RiskClass
0.05
female
12
male
logged average claim amounts
15000
0.04
10
0.03
frequency
Exposure
10000
8
0.02
6
5000
0.01
4
0.00
0
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7
RiskClass RiskClass RiskClass
Fig. 13.19 (Top, middle and bottom rows) OwnerAge, Area, RiskClass: (lhs) histogram of
exposures (split by gender), (middle) observed claim frequency, (rhs) boxplot of observed average
claim amounts μ̄i = Si /Ni of features with Ni > 0 (on log-scale)
2000 4000 6000 8000 10000 12000 total exposures per VehAge observed frequency per VehAge average claim amounts per VehAge
0.05
12
male
0.04
10
frequency
Exposure
0.03
8
0.02
6
0.01
4
0.00
0
0 3 6 9 12 15 18 21 24 27 30 0 5 10 15 20 25 30 0 3 6 9 12 15 18 21 24 27 30
VehAge VehAge VehAge
total exposures per BonusClass observed frequency per BonusClass average claim amounts per BonusClass
10000 15000 20000 25000
0.05
12
male
0.04
10
frequency
Exposure
0.03
8
0.02
6
0.01
5000
4
0.00
0
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7
BonusClass BonusClass BonusClass
Fig. 13.20 (Top and bottom rows) VehAge, BonusClass: (lhs) histogram of exposures (split
by gender), (middle) observed claim frequency, (rhs) boxplot of observed average claim amounts
μ̄i = Si /Ni of features with Ni > 0 (on log-scale)
The third example considers property insurance claims of the Wisconsin Local
Government Property Insurance Fund (LGPIF). This data8 has been made available
through the book project of Frees [135],9 and is also used in Lee et al. [236]. The
Wisconsin LGPIF is an insurance pool that is managed by the Wisconsin Office
of the Insurance Commissioner. This fund provides insurance protection to local
governmental institutions such as counties, schools, libraries, airports, etc. It insures
property claims for buildings and motor vehicles, and it excludes certain natural and
man made perils like flood, earthquakes or nuclear accidents. We give a description
of the data (we have applied some data cleaning to the original data).
The special feature of this data is that we have a short claim description on line 11
of Listing 13.5. This description will allow us to better understand the claim type
beyond just knowing the hazard type that has been affected.
Figure 13.23 gives the empirical density (upper-truncated at 50’000) and the log-log
plot of the observed LGPIF claim amounts. Most claims are below 10’000, however,
the log-log plot shows clearly that the data is heavy-tailed, the largest claim being
8 https://github.com/OpenActTexts/Loss-Data-Analytics/tree/master/Data.
9 https://ewfrees.github.io/Loss-Data-Analytics/.
13.3 Wisconsin Local Government Property Insurance Fund 571
OwnerAge in each Swedish zone RiskClass in each Swedish zone VehAge in each Swedish zone
70
30
7
25
6
60
20
OwnerAge
RiskClass
5
VehAge
50
15
4
40
10
3
30
5
20
0
Zone 1 Zone 2 Zone 3 Zone 4 Zone 5 Zone 1 Zone 2 Zone 3 Zone 4 Zone 5 Zone 1 Zone 2 Zone 3 Zone 4 Zone 5
Swedish zones (Zones 5−7 merged) Swedish zones (Zones 5−7 merged) Swedish zones (Zones 5−7 merged)
Fig. 13.21 (Top) Correlations: top-right shows Pearson’s correlation; bottom-left shows Spear-
man’s Rho; (bottom) boxplots of OwnerAge, RiskClass, VehAge versus Area (where Zones
5–7 have been merged)
empirical density of average claim amounts empirical distribution of average claim amounts log−log plot of average claim amounts
1.0
0.04
0
logged survival probability
−5 −4 −3 −2 −1
0.8
empirical distribution
0.03
empirical density
0.6
0.02
0.4
0.01
0.2
−6
log 10K
0.00
0.0
log 100K
Fig. 13.22 (lhs) Empirical density (middle) empirical distribution and (rhs) log-log plot of average
claim amounts μ̄i = Si /Ni of features with Ni > 0
12’922’218 and 13 claims being above 1 million. These claims are further described
by the features given in Listing 13.5.
In our example we will not focus on modeling the claim sizes, but we rather
aim at predicting the hazard types from the claim descriptions. There are 9 different
hazard types: Fire, Lightning, Hail, Wind, WaterW, WaterNW, Vehicle, Vandalism
and Misc. The last label contains all claims that cannot be allocated to one of
the previous hazard types, and WaterW refers to weather related water claims and
WaterNW to the non-weather related ones. If we only focus on this latter problem
we have more data available as there is a training data set and a validation data
572 13 Appendix B: Data and Examples
0
0.00000 0.00005 0.00010 0.00015 0.00020
−4
−6
log 1K
log 10K
−8
log 100K
log 1M
Fig. 13.23 (lhs) Empirical density (upper-truncated at 50’000), (rhs) log-log plot of the observed
LGPIF claim amounts
set with hazard types and claim descriptions.10 In total we have 6’031 such claim
descriptions, see Listing 13.6, which are studied in our text recognition Chap. 10.
10 https://github.com/OpenActTexts/Loss-Data-Analytics/tree/master/Data.
13.4 Swiss Accident Insurance Data 573
Our next example considers Swiss accident insurance data.11 This data set is not
publicly available. Swiss accident insurance is compulsory for employees, i.e., by
law each employer has to sign an insurance contract to protect the employees against
accidents. This insurance cover includes both work and leisure accidents, and it
covers medical expenses and daily allowance. Listing 13.7 gives an excerpt of the
data. Line BU indicates whether we have a workplace or a leisure accident, line
10 gives the medical expenses and line 12 shows the allowance expenses. In the
subsequent analysis we only consider medical expenses.
Sector indicates the labor sector of the insured company, AccQuart gives the
accident quarter since leisure claims have a seasonal component, RepDel gives the
reporting delay in yearly units, Age is the age of the injured (in 5 years buckets),
and InjType and InjPart denote the injury type and the injured body part.
Figure 13.24 gives the empirical density (upper-truncated at 10’000) and the log-
log plot of the observed Swiss accident insurance claim amounts. Most claims are
below 5’000, however, the log-log plot shows some heavy-tailedness, the largest
claim exceeding 1’300’000 CHF.
Figure 13.25 shows the average claim amounts split w.r.t. the different feature
components (top) Sector, AccQuart, RepDel, (bottom) Age, InjType,
InjPart, and moreover, split by work and leisure accidents (in cyan and gray
in the colored version). Typically, leisure accidents are more numerous and more
expensive on average than accidents at the work place. From Fig. 13.25 (top, left)
we observe considerable variability in average claim sizes between the different
labor sectors (cyan bars), whereas average leisure claim sizes (gray bars) are similar
11 https://www.unfallstatistik.ch/.
574 13 Appendix B: Data and Examples
0
logged survival probability
−2
0.0015
empirical density
−4
0.0010
−6
−8
0.0005
−10
log 1K
log 10K
0.0000
log 100K
−12
log 1M
Fig. 13.24 (lhs) Empirical density (upper-truncated at 10’000), (rhs) log-log plot of the observed
Swiss accident insurance claim amounts
average claim: labor sector average claim: accident quarter average claim: reporting delay
3000
2500
2000
2500
average claim amount
1500
2000
1500
1500
1000
1000
1000
500
500
500
0
1 3 5 7 9 11 13 15 17 19 21 23 1 2 3 4 0 1 2
labor sector accident quarter reporting delay
average claim: age of injured average claim: injury type average claim: injured body part
2500
8000
8000
average claim amount
6000
6000
1500
4000
4000
1000
2000
2000
500
0
20 25 30 35 40 45 50 55 60 65 1 2 3 4 5 6 7 8 9 10 12 14 16 18 1 3 5 7 9 11 14 17 20 23 26 29 32 35
age of injured injury type injured body part
Fig. 13.25 Average claim amounts split w.r.t. the different feature components (top) Sector,
AccQuart, RepDel, (bottom) Age, InjType, InjPart, and split by work and leisure
accidents (cyan/gray in the colored version)
across the different labor sectors. Average claim sizes considerably differ between
injury types and injured body parts (bottom, middle and right), but they do not differ
between work and leisure claims.
13.4 Swiss Accident Insurance Data 575
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons licence and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons licence, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons licence and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Bibliography
1. Aas, K., Jullum, M., & Løland, A. (2021). Explaining individual predictions when features
are dependent: More accurate approximations to Shapley values. Artificial Intelligence, 298.
Article 103502.
2. Abadi, M., et al. (2015). TensorFlow: Large-scale machine learning on heterogeneous
systems. https://www.tensorflow.org/
3. Agarwal, R., Melnick, L., Frosst, N., Zhang, X., Lengerich, B., Caruana, R., & Hinton,
G. E. (2021). Neural additive models: Interpretable machine learning with neural nets.
arXiv:2004.13912v2.
4. Ágoston, K. C., & Gyetvai, M. (2020). Joint optimization of transition rules and the premium
scale in a bonus-malus system. ASTIN Bulletin, 50/3, 743–776.
5. Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on
Automatic Control, 19/6, 716–723.
6. Al-Mudafer, M. T., Avanzi, B., Taylor, G., & Wong, B. (2022). Stochastic loss reserving with
mixture density neural networks. Insurance: Mathematics & Economics, 105, 144–147.
7. Albrecher, H., Beirlant, J., & Teugels, J. L. (2017). Reinsurance: Actuarial and statistical
aspects. Hoboken: Wiley.
8. Albrecher, H., Bladt, M., & Yslas, J. (2022). Fitting inhomogeneous phase-type distributions
to data: The univariate and the multivariate case. Scandinavian Journal of Statistics, 49/1,
44–77.
9. Alzner, H. (1997). On some inequalities for the gamma and psi functions. Mathematics of
Computation, 66/217, 373–389.
10. Amari, S. (2016). Information geometry and its applications. New York: Springer.
11. Améndola, C., Drton, M., & Sturmfels, B. (2016). Maximum likelihood estimates for
Gaussian mixtures are transcendental. In I. S. Kotsireas, S. M. Rump, & C. K. Yap (Eds.), 6th
International Conference on Mathematical Aspects of Computer and Information Sciences.
Lecture notes in computer science (Vol. 9582, pp. 579–590). New York: Springer.
12. Ancona, M., Ceolini, E., Öztireli, C., & Gross, M. (2019). Gradient-based attribution methods.
In W. Samek, G. Montavon, A. Vedaldi, L. K. Hansen, & K.-R. Müller (Eds.), Explainable AI:
Interpreting, explaining and visualizing deep learning. Lecture notes in artificial intelligence
(Vol. 11700, pp. 169–191). New York: Springer.
13. Apley, D. W., & Zhu, J. (2020). Visualizing the effects of predictor variables in black box
supervised learning models. Journal of the Royal Statistical Society, Series B, 82/4, 1059–
1086.
14. Asmussen, S., Nerman, O., & Olsson, M. (1996). Fitting phase-type distributions via the EM
algorithm. Scandinavian Journal of Statistics, 23/4, 419–441.
15. Awad, Y., Bar-Lev, S. K., & Makov, U. (2022). A new class of counting distributions
embedded in the Lee–Carter model of mortality projections: A Bayesian approach. Risks,
10/6. Article 111.
16. Ay, N., Jost, J., Lê, H. V., & Schwachhöfer, L. (2017). Information geometry. New York:
Springer.
17. Ayuso, M., Guillén, M., & Nielsen, J. P. (2019). Improving automobile insurance ratemaking
using telematics: Incorporating mileage and driver behaviour data. Transportation, 46/3, 735–
752.
18. Ayuso, M., Guillén, M., & Pérez-Marín, A. M. (2016). Telematics and gender discrimination:
Some usage-based evidence on whether men’s risk of accidents differs from women’s. Risks,
4/2. Article 10.
19. Ayuso, M., Guillén, M., & Pérez-Marín, A. M. (2016). Using GPS data to analyse the distance
travelled to the first accident at fault in pay-as-you-drive insurance. Transportation Research
Part C: Emerging Technologies, 68, 160–167.
20. Bachelier, L. (1900). The theory of speculation. English translation by May, D. R. (2011).
Annales Scientifiques de l’École Normale Supérieure, 3/17, 21–89.
21. Bahdanau, D., Cho, K., & Bengio, Y. (2016). Neural machine translation by jointly learning
to align and translate. arXiv:1409.0473v7.
22. Bailey, R. A. (1963). Insurance rates with minimum bias. Proceedings of the Casualty
Actuarial Society, 50, 4–11.
23. Barndorff-Nielsen, O. (2014). Information and exponential families: In statistical theory.
New York: Wiley.
24. Barndorff-Nielsen, O., & Cox, D. R. (1979). Edgeworth and saddlepoint approximations with
statistical applications. Journal of the Royal Statistical Society, Series B, 41/3, 279–299.
25. Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal
function. IEEE Transactions of Information Theory, 39/3, 930–945.
26. Barron, A. R. (1994). Approximation and estimation bounds for artificial neural networks.
Machine Learning, 14, 115–133.
27. Bengio Y., Courville A., & Vincent P. (2013). Representation learning: A review and new
perspectives. IEEE Transactions on Pattern Analysis and Machine Learning Intelligence,
35/8, 1798–1828.
28. Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language
model. Journal of Machine Learning Research, 3/Feb, 1137–1155.
29. Bengio, Y., Schwenk, H., Senécal, J.-S., Morin, F., & Gauvain, J.-L. (2006). Neural
probabilistic language models. In D. E. Holmes & L. C. Jain (Eds.), Innovations in machine
learning. Studies in fuzziness and soft computing (Vol. 194, pp. 137–186). New York:
Springer.
30. Benhamou, E., & Melot, V. (2018). Seven proofs of the Pearson Chi-squared independence
test and its graphical interpretation. arXiv:1808.09171.
31. Berger, J. O. (1985). Statistical decision theory and Bayesian analysis (2nd ed.). New York:
Springer.
32. Bichsel, F. (1964). Erfahrungstarifierung in der Motorfahrzeug-Haftpflicht-Versicherung.
Bulletin of the Swiss Association of Actuaries, 1964, 119–130.
33. Bickel, P. J., & Doksum, K. A. (2001). Mathematical statistics: Basic ideas and selected
topics (Vol. I, 2nd ed.). Hoboken: Prentice Hall.
34. Billingsley, P. (1995). Probability and measure (3rd ed.). New York: Wiley.
35. Bishop, C. M. (1994). Mixture Density Networks. Technical Report. Aston University,
Birmingham.
36. Bishop, C. M. (2006). Pattern recognition and machine learning. New York: Springer.
37. Bladt, M. (2022). Phase-type distributions for insurance pricing. ASTIN Bulletin, 52/2, 417–
448.
38. Blæsild, P., & Jensen, J. L. (1985). Saddlepoint formulas for reproductive exponential models.
Scandinavian Journal of Statistics, 12/3, 193–202.
Bibliography 579
39. Blier-Wong, C., Cossette, H., Lamontagne, L., & Marceau, E. (2022). Geographic ratemaking
with spatial embeddings. ASTIN Bulletin, 52/1, 1–31.
40. Blostein, M., & Miljkovic, T. (2019). On modeling left-truncated loss data using mixture
distributions. Insurance: Mathematics & Economics, 85, 35–46.
41. Blundell, C., Cornebise, J., Kavukcuoglu, K., & Wiersta, D. (2015). Weight uncertainty in
neural network. Proceedings of Machine Learning Research, 37, 1613–1622.
42. Boucher, J. P., Côté, S., & Guillén, M. (2017). Exposure as duration and distance in telematics
motor insurance using generalized additive models. Risks, 5/4. Article 54.
43. Boucher, J. P., Denuit, M., & Guillén, M. (2007). Risk classification for claim counts:
A comparative analysis of various zeroinflated mixed Poisson and hurdle models. North
American Actuarial Journal, 11/4, 110–131.
44. Boucher, J. P., Denuit, M., & Guillén, M. (2008). Modelling of insurance claim count
with hurdle distribution for panel data. In B. C. Arnold, N. Balakrishnan, J. M. Sarabia, &
R. Mínguez (Eds.), Advances in mathematical and statistical modeling. Statistics for industry
and technology (pp. 45–59). Boston: Birkhäuser.
45. Boucher, J. P., Denuit, M., & Guillén, M. (2009). Number of accidents or number of claims?
An approach with zero-inflated Poisson models for panel data. Journal of Risk and Insurance,
76/4, 821–846.
46. Boucher, J. P., & Inoussa, R. (2014). A posteriori ratemaking with panel data. ASTIN Bulletin,
44/3, 587–612.
47. Boucher, J. P., & Pigeon, M. (2018). A claim score for dynamic claim counts modeling.
arXiv:1812.06157.
48. Box, G. E. P., & Cox, D. R. (1964). An analysis of transformations. Journal of the Royal
Statistical Society, Series B, 26/2, 211–243.
49. Box, G. E. P., & Jenkins, G. M. (1976). Time series analysis: Forecasting and control. San
Francisco: Holden-Day.
50. Bregman, L. M. (1967). The relaxation method of finding the common point of convex sets
and its application to the solution of problems in convex programming. USSR Computational
Mathematics and Mathematical Physics, 7/3, 200–217.
51. Breiman, L. (1996). Bagging predictors. Machine Learning, 24/2, 123–140.
52. Breiman, L. (2001). Random forests. Machine Learning, 45/1, 5–32.
53. Breiman, L. (2001). Statistical modeling: The two cultures. Statistical Science, 16/3, 199–
215.
54. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and
regression trees. Wadsworth statistics/probability series. Monterey: Brooks/Cole Publishing.
55. Broadie, M., Du, Y., & Moallemi, C. (2011). Efficient risk estimation via nested sequential
estimation. Management Science, 57/6, 1171–1194.
56. Brouhns, N., Denuit, M., & Vermunt, J. K. (2002). A Poisson log-bilinear regression approach
to the construction of projected lifetables. Insurance: Mathematics & Economics, 31/3, 373–
393.
57. Brouhns, N., Guillén, M., Denuit, M., & Pinquet, J. (2003). Bonus-malus scales in segmented
tariffs with stochastic migration between segments. Journal of Risk and Insurance, 70/4, 577–
599.
58. Bühlmann, H., & Gisler, A. (2005). A course in credibility theory and its applications. New
York: Springer.
59. Bühlmann, P., & Mächler, M. (2014). Computational statistics. Lecture notes. ETH Zurich:
Department of Mathematics.
60. Bühlmann, P., & Yu, B. (2002). Analyzing bagging. Annals of Statistics, 30/4, 927–961.
61. Burguete, J., Gallant, R., & Souza, G. (1982). On unification of the asymptotic theory of
nonlinear econometric models. Economic Review, 1/2, 151–190.
62. Calderín-Ojeda, E., Gómez-Déniz, E., & Barranco-Chamorro, I. (2019). Modeling zero-
inflated count data with a special case of the generalised Poisson distribution. ASTIN Bulletin,
49/3, 689–708.
580 Bibliography
63. Cameron, A., & Trivedi, P. (1986). Econometric models based on count data: Comparisons
and applications of some estimators and tests. Journal of Applied Econometrics, 1, 29–54.
64. Cantelli, F. P. (1933). Sulla determinazione empirica delle leggi di probabilità. Giornale
Dell’Istituto Italiano Degli Attuari, 4, 421–424.
65. Carriere, J. F. (1996). Valuation of the early-exercise price for options using simulations and
nonparametric regression. Insurance: Mathematics & Economics, 19/1, 19–30.
66. Chan, J. S. K., Choy, S. T. B., Makov, U. E., & Landsman, Z. (2018). Modelling insurance
losses using contaminated generalised beta type-II distribution. ASTIN Bulletin, 48/2, 871–
904.
67. Charpentier, A. (2015). Computational actuarial science with R. Boca Raton: CRC Press.
68. Chaubard, F., Mundra, R., & Socher, R. (2016). Deep learning for natural language
processing. Lecture notes. Stanford: Stanford University.
69. Chen, A., Guillén, M., & Vigna, E. (2018). Solvency requirement in a unisex mortality model.
ASTIN Bulletin, 48/3, 1219–1243.
70. Chen, A., & Vigna, E. (2017). A unisex stochastic mortality model to comply with EU Gender
Directive. Insurance: Mathematics & Economics, 73, 124–136.
71. Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system.
arXiv:1603.02754v3.
72. Chen, X. (2007). Large sample sieve estimation of semi-parametric models. In J. J. Heckman
& E. E. Leamer (Eds.), Handbook of econometrics (Vol. 6B, Chap. 76, pp. 5549–5632).
Amsterdam: Elsevier.
73. Chen, X., & Shen, X. (1998). Sieve extremum estimates for weakly dependent data.
Econometrica, 66/2, 289–314.
74. Cheridito, P., Ery, J., & Wüthrich, M. V. (2020). Assessing asset-liability risk with neural
networks. Risks, 8/1. Article 16.
75. Cheridito, P., Jentzen, A., & Rossmannek, F. (2022). Efficient approximation of high-
dimensional functions with neural networks. IEEE Transactions on Neural Networks and
Learning Systems, 33/7, 3079–3093.
76. Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H.,
& Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for
statistical machine translation. arXiv:1406.1078.
77. Chollet, F., Allaire, J. J., et al. (2017). R interface to Keras. https://github.com/rstudio/keras
78. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.) New York:
Lawrence Erlbaum Associates.
79. Congdon, P. (2014). Applied Bayesian modelling (2nd ed.). New York: Wiley.
80. Cook, D. R., & Croos-Dabrera, R. (1993). Partial residual plots in generalized linear models.
Journal of the American Statistical Association, 93/442, 730–739.
81. Cooray, K., & Ananda, M. M. A. (2005). Modeling actuarial data with composite lognormal-
Pareto model. Scandinavian Actuarial Journal, 2005/5, 321–334.
82. Corradin, A., Denuit, M., Detyniecki, M., Grari, V., Sammarco, M., & Trufin, J. (2022). Joint
modeling of claim frequencies and behavior signals in motor insurance. ASTIN Bulletin, 52/1,
33–54.
83. Cragg, J. G. (1971). Some statistical models for limited dependent variables with application
to the demand for durable good. Econometrica, 39/5, 829–844.
84. Craven, P., & Wahba, G. (1978). Smoothing noisy data with spline functions. Numerische
Mathematik, 31, 377–403.
85. Creal, D. (2012). A survey of sequential Monte Carlo methods for economics and finance.
Econometric Reviews, 31/3, 245–296.
86. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics
of Control, Signals and Systems, 2, 303–314.
87. Daniels, H. E. (1954). Saddlepoint approximations in statistics. Annals of Mathematical
Statistics, 25, 631–650.
88. Darmois, G., (1935). Sur les lois de probabilité à estimation exhaustive. Comptes Rendus de
l’Académie des Sciences Paris, 260, 1265–1266.
Bibliography 581
89. De Jong, P., & Heller, G. Z. (2008). Generalized linear models for insurance data. Cambridge:
Cambridge University Press.
90. De Jong, P., Tickle, L., & Xu, J. (2020). A more meaningful parameterization of the Lee–
Carter model. Insurance: Mathematics & Economics, 94, 1–8.
91. De Pril, N. (1978). The efficiency of a bonus-malus system. ASTIN Bulletin, 10/1, 59–72.
92. Del Moral, P., Doucet, A., & Jasra, A. (2006). Sequential Monte Carlo samplers. Journal of
the Royal Statistical Society, Series B, 68/3, 411–436.
93. Del Moral, P., Peters, G. W., & Vergé, C. (2012). An introduction to stochastic particle
integration methods: With applications to risk and insurance. In J. Dick, F. Y. Kuo, G. W.
Peters, & I. H. Sloan (Eds.), Monte Carlo and Quasi-Monte Carlo Methods 2012. Proceedings
in Mathematics & Statistics (Vol. 65, pp. 39–81). New York: Springer.
94. Delong, Ł., Lindholm, M., & Wüthrich, M. V. (2021). Making Tweedie’s compound Poisson
model more accessible. European Actuarial Journal, 11/1, 185–226.
95. Delong, Ł., Lindholm, M., & Wüthrich, M. V. (2021). Gamma mixture density networks
and their application to modeling insurance claim amounts. Insurance: Mathematics &
Economics, 101/B, 240–261.
96. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood for incomplete
data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39/1, 1–22.
97. Denuit, M., Charpentier, A., & Trufin, J. (2021). Autocalibration and Tweedie-dominance for
insurance pricing in machine learning. Insurance: Mathematics & Economics, 101/B, 485–
497.
98. Denuit, M., Guillén, M., & Trufin, J. (2019). Multivariate credibility modelling for usage-
based motor insurance pricing with behavioural data. Annals of Actuarial Science, 13/2, 378–
399.
99. Denuit, M., Hainaut, D., & Trufin, J. (2019). Effective statistical learning methods for
actuaries I: GLMs and extensions. New York: Springer.
100. Denuit, M., Hainaut, D., & Trufin, J. (2020). Effective statistical learning methods for
actuaries II: Tree-based methods and extensions. New York: Springer.
101. Denuit, M., Hainaut, D., & Trufin, J. (2019). Effective statistical learning methods for
actuaries III: Neural networks and extensions. New York: Springer.
102. Denuit, M., Maréchal, X., Pitrebois, S., & Walhin, J.-F. (2007). Actuarial modelling of claim
counts: Risk classification, credibility and bonus-malus systems. New York: Wiley.
103. Denuit, M., & Trufin, J. (2021). Generalization error for Tweedie models: Decomposition and
error reduction with bagging. European Actuarial Journal, 11/1, 325–331.
104. Devriendt, S., Antonio, K., Reynkens, T., & Verbelen, R. (2021). Sparse regression with multi-
type regularized feature modeling. Insurance: Mathematics & Economics, 96, 248–261.
105. Dietterich, T. G. (2000). Ensemble methods in machine learning. In J. Kittel & F. Roli (Eds.),
Multiple classifier systems. Lecture notes in computer science (Vol. 1857, pp. 1–15). New
York: Springer.
106. Dimitriadis, T., Fissler, T., & Ziegel, J. F. (2020). The efficiency gap. arXiv:2010.14146.
107. Dobson, A. J. (2001). An introduction to generalized linear models. Boca Raton: Chapman &
Hall/CRC.
108. Döhler, S., & Rüschendorf, L. (2001). An approximation result for nets in functional
estimation. Statistics & Probability Letters, 52/4, 373–380.
109. Döhler, S., & Rüschendorf, L. (2003). Nonparametric estimation of regression functions in
point process models. Statistics Inference for Stochastic Processes, 6, 291–307.
110. Dong, Y., Huang, F., Yu, H., & Haberman, S. (2020). Multi-population mortality forecasting
using tensor decomposition. Scandinavian Actuarial Journal, 2020/8, 754–775.
111. Doucet, A., & Johansen, A. M. (2011). A tutorial on particle filtering and smoothing: Fifteen
years later. In D. Crisan & B. Rozovsky (Eds.), Handbook of nonlinear filtering (pp. 656–
670). Oxford: Oxford University Press.
112. Dunn, P. K., & Smyth, G. K. (2005). Series evaluation of Tweedie exponential dispersion
model densities. Statistics and Computing, 15, 267–280.
582 Bibliography
113. Dutang, C., & Charpentier, A. (2018). CASdatasets R package vignette. Reference manual.
Version 1.0-8, packaged 2018-05-20.
114. Eckart, G., & Young, G. (1936). The approximation of one matrix by another of lower rank.
Psychometrika, 1, 211–218.
115. Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7/1,
1–26.
116. Efron, B. (2020). Prediction, estimation, and attribution. Journal of the American Statistical
Association, 115/530, 636–655.
117. Efron, B., & Hastie, T. (2016). Computer age statistical inference: Algorithms, evidence, and
data science. Cambridge: Cambridge University Press.
118. Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York: Chapman
& Hall.
119. Ehm, W., Gneiting, T., Jordan, A., & Krüger, F. (2016). Of quantiles and expectiles:
Consistent scoring functions, Choquet representations and forecast rankings. Journal of the
Royal Statistical Society, Series B, 78/3, 505–562.
120. Elbrächter, D., Perekrestenko, D., Grohs, P., & Bölcskei, H. (2021). Deep neural network
approximation theory. IEEE Transactions on Information Theory, 67/5, 2581–2623.
121. Embrechts, P., Klüppelberg, C., & Mikosch, T. (2003). Modelling extremal events for
insurance and finance (4th printing). New York: Springer.
122. Embrechts, P., & Wüthrich, M. V. (2022). Recent challenges in actuarial science. Annual
Review of Statistics and Its Applications, 9, 119–140.
123. Fahrmeir, L., & Tutz, G. (1994). Multivariate statistical modelling based on generalized
linear models. New York: Springer.
124. Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle
properties. Journal of the American Statistical Association, 96/456, 1348–1360.
125. Ferrario, A., & Hämmerli, R. (2019). On boosting: Theory and applications. SSRN
Manuscript ID 3402687. Version June 11, 2019.
126. Ferrario, A., & Nägelin, M. (2020). The art of natural language processing: Classical,
modern and contemporary approaches to text document classification. SSRN Manuscript ID
3547887. Version March 1, 2020.
127. Ferrario, A., Noll, A., & Wüthrich, M. V. (2018). Insights from inside neural networks. SSRN
Manuscript ID 3226852. Version April 23, 2020.
128. Fisher, R. A. (1934). Two new properties of mathematical likelihood. Proceeding of the Royal
Society A, 144/852, 285–307.
129. Fissler, T., Lorentzen, C., & Mayer, M. (2022). Model comparison and calibration assess-
ment: User guide for consistent scoring functions in machine learning and actuarial practice.
arXiv:2202.12780.
130. Fissler, T., Merz, M., & Wüthrich, M. V. (2021). Deep quantile and deep composite model
regression. arXiv:2112.03075.
131. Fissler, T., & Ziegel, J. F. (2016). Higher order elicitability and Osband’s principle. The Annals
of Statistics, 4474, 1680–1707.
132. Fissler, T., Ziegel, J. F., & Gneiting, T. (2015). Expected shortfall is jointly elicitable with
value at risk - Implications for backtesting. arXiv:1507.00244v2.
133. Fortuin, C. M., Kasteleyn, P. W., & Ginibre, J. (1971). Correlation inequalities on some
partially ordered sets. Communication Mathematical Physics, 22/2, 89–103.
134. Frees, E. W. (2010). Regression modelling with actuarial and financial applications. Cam-
bridge: Cambridge University Press.
135. Frees, E. W. (2020). Loss data analytics. An open text authored by the Actuarial Community.
https://ewfrees.github.io/Loss-Data-Analytics/
136. Frees, E. W., & Huang, F. (2021). The discriminating (pricing) actuary. North American
Actuarial Journal (in press).
137. Frees, E. W., Lee, G., & Yang, L. (2016). Multivariate frequency-severity regression models
in insurance. Risks, 4/1. Article 4.
Bibliography 583
138. Frei, D. (2021). Insurance Claim Size Modelling with Mixture Distributions. MSc Thesis.
Department of Mathematics, ETH Zurich.
139. Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and
Computation, 121/2, 256–285.
140. Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning
and an application to boosting. Journal of Computer and System Sciences, 55/1, 119–139.
141. Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals
of Statistics, 29/5, 1189–1232.
142. Friedman, J. H., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized
linear models via coordinate descent. Journal of Statistical Software, 33/1, 1–22.
143. Friedman, J. H., & Popescu, B. E. (2008). Predictive learning via rule ensembles. Annals of
Applied Statistics, 2/3, 916–954.
144. Fritsch, S., Günther, F., Wright, M. N., Suling, M., & Müller, S. M. (2019). neuralnet:
Training of neural networks. https://github.com/bips-hb/neuralnet
145. Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mech-
anism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36/4,
193–202.
146. Fung, T. C., Badescu, A. L., & Lin, X. S. (2019). A class of mixture of experts models for
general insurance: Application to correlated claim frequencies. ASTIN Bulletin, 49/3, 647–
688.
147. Fung, T. C., Badescu, A. L., & Lin, X. S. (2022). Fitting censored and truncated regression
data using the mixture of experts models. North American Actuarial Journal (in press).
148. Fung, T. C., Tzougas, G., & Wüthrich, M. V. (2022). Mixture composite regression models
with multi-type feature selection. North American Actuarial Journal (in press).
149. Gabrielli, A., Richman, R., & Wüthrich, M. V. (2020). Neural network embedding of the
over-dispersed Poisson reserving model. Scandinavian Actuarial Journal, 2020/1, 1–29.
150. Gallant, A. R., & White, H. (1988). There exists a neural network that does not make
avoidable mistakes. In IEEE 1988 International Conference on Neural Networks (pp. I657–
664).
151. Gao, G., Meng, S., & Wüthrich, M. V. (2019). Claims frequency modeling using telematics
car driving data. Scandinavian Actuarial Journal, 2019/2, 143–162.
152. Gao, G., Meng, S., & Wüthrich, M. V. (2022). What can we learn from telematics car driving
data: A survey. Insurance: Mathematics & Economics, 104, 185–199.
153. Gao, G., & Shi, Y. (2021). Age-coherent extensions of the Lee–Carter model. Scandinavian
Actuarial Journal, 2021/10, 998–1016.
154. Gao, G., Wang, H., & Wüthrich, M. V. (2022). Boosting Poisson regression models with
telematics car driving data. Machine Learning, 111/1, 243–272.
155. Gao, G., & Wüthrich, M. V. (2018). Feature extraction from telematics car driving heatmaps.
European Actuarial Journal, 8/2, 383–406.
156. Gao, G., & Wüthrich, M. V. (2019). Convolutional neural network classification of telematics
car driving data. Risks, 7/1. Article 6.
157. Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013).
Bayesian data analysis (3rd ed.). Boca Raton: Chapman & Hall/CRC.
158. Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (1995). Markov chain Monte Carlo in
practice. Boca Raton: Chapman & Hall.
159. Glivenko, V. (1933). Sulla determinazione empirica delle leggi di probabilità. Giornale
Dell’Istituto Italiano Degli Attuari, 4, 92–99.
160. Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward
neural networks. In Proceedings of the Thirteenth International Conference on Artificial
Intelligence and Statistics. Proceedings of Machine Learning Research (Vol. 9, pp. 249–256).
161. Glynn, P., & Lee, S. H. (2003). Computing the distribution function of a conditional
expectation via Monte Carlo: Discrete conditioning spaces. ACM Transactions on Modeling
and Computer Simulation, 13/3, 238–258.
584 Bibliography
162. Gneiting, T. (2011). Making and evaluating point forecasts. Journal of the American
Statistical Association, 106/494, 746–762.
163. Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation.
Journal of the American Statistical Association, 102/477, 359–378.
164. Goldstein, A., Kapelner, A., Bleich, J., & Pitkin, E. (2015). Peeking inside the black box:
Visualizing statistical learning with plots of individual conditional expectation. Journal of
Computational and Graphical Statistics, 24/1, 44–65.
165. Golub, G., & Van Loan, C. (1983). Matrix computations. Baltimore: John Hopkins University
Press.
166. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Cambridge: MIT Press.
http://www.deeplearningbook.org
167. Gourieroux, C., Laurent, J. P., & Scaillet, O. (2000). Sensitivity analysis of values at risk.
Journal of Empirical Finance, 7/3–4, 225–245.
168. Gourieroux, C., Montfort, A., & Trognon, A. (1984). Pseudo maximum likelihood methods:
Theory. Econometrica, 52/3, 681–700.
169. Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian
model determination. Biometrika, 82/4, 711–732.
170. Green, P. J. (2003). Trans-dimensional Markov chain Monte Carlo. In P. J. Green, N. L. Hjort,
& S. Richardson (Eds.), Highly structured stochastic systems. Oxford statistical science series
(pp. 179–206). Oxford: Oxford University Press.
171. Greene, W. (2008). Functional forms for the negative binomial model for count data.
Economics Letters, 99, 585–590.
172. Grenander, U. (1981). Abstract inference. New York: Wiley.
173. Grün, B., & Miljkovic, T. (2019). Extending composite loss models using a general
framework of advanced computational tools. Scandinavian Actuarial Journal, 2019/8, 642–
660.
174. Guillén, M. (2012). Sexless and beautiful data: From quantity to quality. Annals of Actuarial
Science, 6/2, 231–234.
175. Guillén, M., Bermúdez, L., & Pitarque, A. (2021). Joint generalized quantile and conditional
tail expectation for insurance risk analysis. Insurance: Mathematics & Economics, 99, 1–8.
176. Guo, C., & Berkhahn, F. (2016). Entity embeddings of categorical variables.
arXiv:1604.06737.
177. Ha, H., & Bauer, D. (2022). A least-squares Monte Carlo approach to the estimation of
enterprise risk. Finance and Stochastics, 26, 417–459.
178. Hainaut, D. (2018). A neural-network analyzer for mortality forecast. ASTIN Bulletin, 48/2,
481–508.
179. Hainaut, D., & Denuit, M. (2020). Wavelet-based feature extraction for mortality projection.
ASTIN Bulletin, 50/3, 675–707.
180. Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., & Stahel, W. A. (1986). Robust statistics.
New York: Wiley.
181. Hastie, T., & Tibshirani, R. (1986). Generalized additive models (with discussion). Statistical
Science, 1, 297–318.
182. Hastie, T., & Tibshirani, R. (1990). Generalized additive models. New York: Chapman &
Hall.
183. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data
mining, inference, and prediction (2nd ed.). New York: Springer.
184. Hastie, T., Tibshirani, R., & Wainwright, M. (2015). Statistical learning with sparsity: The
Lasso and generalizations. Boca Raton: CRC Press.
185. Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their
applications. Biometrika, 57/1, 97–109.
186. Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural
networks. Science, 313/5786, 504–507.
187. Hinton, G., Srivastava, N., & Swersky, K. (2012). Neural networks for machine learning.
Lecture slides. Toronto: University of Toronto.
Bibliography 585
188. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation,
9/8, 1735–1780.
189. Hong, L. J. (2009). Estimating quantile sensitivities. Operations Research, 57/1, 118–130.
190. Horel, E., & Giesecke, K. (2020). Significance tests in neural networks. Journal of Machine
Learning Research, 21/227, 1–29.
191. Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural
Networks, 4/2, 251–257.
192. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are
universal approximators. Neural Networks, 2/5, 359–366.
193. Huang, Y., & Meng, S. (2019). Automobile insurance classification ratemaking based on
telematics driving data. Decision Support Systems, 127. Article 113156.
194. Huber, P. J. (1981). Robust statistics. Hoboken: Wiley.
195. Human Mortality Database (2018). University of California, Berkeley (USA), and Max
Planck Institute for Demographic Research (Germany). www.mortality.org
196. Hyndman, R. J., Booth, H., & Yasmeen, F. (2013). Coherent mortality forecasting: The
product-ratio method with functional time series models. Demography, 50/1, 261–283.
197. Hyndman, R. J., & Ullah, M. S. (2007). Robust forecasting of mortality and fertility rates: A
functional data approach. Computational Statistics & Data Analysis, 51/10, 4942–4956.
198. Isenbeck, M., & Rüschendorf, L. (1992). Completeness in location families. Probability and
Mathematical Statistics, 13/2, 321–343.
199. Johansen, A. M., Evers, L., & Whiteley, N. (2010). Monte Carlo methods. Lecture notes.
Bristol: Department of Mathematics, University of Bristol.
200. Jørgensen, B. (1981). Statistical properties of the generalized inverse Gaussian distribution.
Lecture notes in statistics. New York: Springer.
201. Jørgensen, B. (1986). Some properties of exponential dispersion models. Scandinavian
Journal of Statistics, 13/3, 187–197.
202. Jørgensen, B. (1987). Exponential dispersion models. Journal of the Royal Statistical Society,
Series B, 49/2, 127–145.
203. Jørgensen, B. (1997). The theory of dispersion models. Boca Raton: Chapman & Hall.
204. Jørgensen, B., & de Souza, M. C. P. (1994). Fitting Tweedie’s compound Poisson model to
insurance claims data. Scandinavian Actuarial Journal, 1994/1, 69–93.
205. Jospin, L. V., Buntine, W., Boussaid, F., Laga, H., & Bennamoun, M. (2020). Hands-on
Bayesian neural networks - A tutorial for deep learning users. arXiv: 2007.06823.
206. Jung, J. (1968). On automobile insurance ratemaking. ASTIN Bulletin, 5/1, 41–48.
207. Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Journal of
Basic Engineering, 82/1, 35–45.
208. Karush, W. (1939). Minima of Functions of Several Variables with Inequalities as Side
Constraints. MSc Thesis. Department of Mathematics, University of Chicago.
209. Kearns, M., & Valiant, L. G. (1988). Learning Boolean Formulae or Finite Automata is
Hard as Factoring. Technical Report TR-14–88. Aiken Computation Laboratory, Harvard
University.
210. Kearns, M., & Valiant, L. G. (1994). Cryptographic limitations on learning Boolean formulae
and finite automata. Journal of the Association for Computing Machinery ACM, 41/1, 67–95.
211. Kellner, R., Nagl, M., & Rösch, D. (2022). Opening the black box - Quantile neural networks
for loss given default prediction. Journal of Banking & Finance, 134, 1–20.
212. Keydana, S., Falbel, D., & Kuo, K. (2021). R package ‘tfprobability’: Interface to ‘Tensor-
Flow Probability’. Version 0.12.0.0, May 20, 2021.
213. Khalili, A. (2010). New estimation and feature selection methods in mixture-of-experts
models. Canadian Journal of Statistics, 38/4, 519–539.
214. Khalili, A., & Chen, J. (2007). Variable selection in finite mixture of regression models.
Journal of the American Statistical Association, 102/479, 1025–1038.
215. Kidger, P., & Lyons, T. (2020). Universal approximation with deep narrow networks.
Proceedings of Machine Learning Research, 125, 2306–2327.
216. Kingma, D., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980.
586 Bibliography
217. Kingma, D. P., & Welling, M. (2019). An introduction to variational autoencoders. Founda-
tions and Trends in Machine Learning, 12/4, 307–392.
218. Kleinow, T. (2015). A common age effect model for the mortality of multiple populations.
Insurance: Mathematics & Economics, 63, 147–152.
219. Knyazev, B., Drozdzal, M., Taylor, G. W., & Romero-Soriano, A. (2021). Parameter
prediction of unseen deep architectures. arXiv:2110.13100.
220. Koenker, R., & Bassett, G., Jr. (1978). Regression quantiles. Econometrica, 46/1, 33–50.
221. Kolmogoroff, A. (1933). Grundbegriffe der Wahrscheinlichkeitsrechnung. New York:
Springer.
222. Komunjer, I., & Vuong, Q. (2010). Efficient estimation in dynamic conditional quantile
models. Journal of Econometrics, 157, 272–285.
223. Koopman, B. O. (1936). On distributions admitting a sufficient statistics. Transactions of the
American Mathematical Society, 39, 399–409.
224. Krah, A.-S., Nikolić, Z., & Korn, R. (2020). Least-squares Monte Carlo for proxy modeling
in life insurance: neural networks. Risks, 8/4. Article 116.
225. Kramer, M. A. (1991). Nonlinear principal component analysis using autoassociative neural
networks. AIChE Journal, 37/2, 233–243.
226. Krizhevsky, Al., Sutskever, I., & Hinton, G. E. (2017). ImageNet classification with deep
convolutional neural networks. Communications of the Association for Computing Machinery
ACM, 60/6, 84–90.
227. Krüger, F., & Ziegel, J. F. (2021). Generic conditions for forecast dominance. Journal of
Business & Economics Statistics, 39/4, 972–983.
228. Kuhn, H. W., & Tucker, A. W. (1951). Nonlinear programming. Proceedings of 2nd Berkeley
Symposium (pp. 481–492). Berkeley: University of California Press.
229. Künsch, H. R. (2005). Mathematische Statistik. Lecture notes. ETH Zurich: Department of
Mathematics.
230. Kuo, K. (2020). Individual claims forecasting with Bayesian mixture density networks.
arXiv:2003.02453.
231. Kuo, K., & Richman, R. (2021). Embeddings and attention in predictive modeling.
arXiv:2104.03545v1.
232. Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in
manufacturing. Technometrics, 34/1, 1–14.
233. Landsman, Z., & Valdez, E. A. (2005). Tail conditional expectation for exponential dispersion
models. ASTIN Bulletin, 35/1, 189–209.
234. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., &
Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural
Computation, 1/4, 541–551.
235. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86/11, 2278–2324.
236. Lee, G. Y., Manski, S., & Maiti, T. (2020). Actuarial applications of word embedding models.
ASTIN Bulletin, 50/1, 1–24.
237. Lee, J. D., Sun, D. L., Sun, Y., & Taylor, J. E. (2016). Exact post-selection inference, with
application to the lasso. Annals of Statistics, 44/3, 907–927.
238. Lee, R. D., & Carter, L. R. (1992). Modeling and forecasting U.S. mortality. Journal of the
American Statistical Association, 87/419, 659–671.
239. Lee, S. C. K. (2021). Addressing imbalanced insurance data through zero-inflated Poisson
regression boosting. ASTIN Bulletin, 51/1, 27–55.
240. Lee, S. C. K., & Lin, X. S. (2010). Modeling and evaluating insurance losses via mixtures of
Erlang distributions. North American Actuarial Journal, 14/1, 107–130.
241. Lee, S. C. K., & Lin, X. S. (2018). Delta boosting machine with application to general
insurance. North American Actuarial Journal, 22/3, 405–425.
242. Lee, S. H. (1998). Monte Carlo Computation of Conditional Expectation Quantiles. PhD
Thesis, Stanford University.
243. Lehmann, E. L. (1959). Testing statistical hypotheses. New York: Wiley.
Bibliography 587
270. Meinshausen, N. (2006). Quantile regression forests. Journal of Machine Learning Research,
7, 983–999.
271. Meng, S., Wang, H., Shi, Y., & Gao, G. (2022). Improving automobile insurance claims
frequency prediction with telematics car driving data. ASTIN Bulletin, 52/2, 363–391.
272. Mercer, J. (1909). Functions of positive and negative type and their connection with the theory
of integral equations. Philosophical Transactions of the Royal Society A, 209/441–458, 415–
446.
273. Merz, M., Richman, R., Tsanakas, A., & Wüthrich, M. V. (2022). Interpreting deep learning
models with marginal attribution by conditioning on quantiles. Data Mining and Knowledge
Discovery, 36, 1335–1370.
274. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953).
Equation of state calculations by fast computing machines. Journal of Chemical Physics,
21/6, 1087–1092.
275. Mikolov, T., Chen, K., Corrado, G. S., & Dean, J. (2013). Efficient estimation of word
representations in vector space. arXiv:1301.3781.
276. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed repre-
sentations of words and phrases and their compositionality. Advances in Neural Information
Processing Systems, 26, 3111–3119.
277. Mikosch, T. (2006). Non-life insurance mathematics. New York: Springer.
278. Miljkovic, T., & Grün, B. (2016). Modeling loss data using mixtures of distributions.
Insurance: Mathematics & Economics, 70, 387–396.
279. Mirsky, L. (1960). Symmetric gauge functions and unitarily invariant norms. Quarterly
Journal of Mathematics, 11/1, 50–59.
280. Montúfar, G., Pascanu, R., Cho, K., & Bengio, Y. (2014). On the number of linear regions of
deep neural networks. Neural Information Processing Systems Proceedings, 27, 2924–2932.
281. Neal, R. M. (1996). Bayesian learning for neural networks. New York: Springer.
282. Nelder, J. A., & Pregibon, D. (1987). An extended quasi-likelihood function. Biometrika,
74/2, 221–232.
283. Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the
Royal Statistical Society, Series A, 135/3, 370–384.
284. Nesterov, Y. (2007). Gradient Methods for Minimizing Composite Objective Function.
Technical Report 76. Center for Operations Research and Econometrics (CORE), Catholic
University of Louvain.
285. Nielsen, F. (2020). An elementary introduction to information geometry. Entropy, 22/10,
1100.
286. Nigri, A., Levantesi, S., Marino, M., Scognamiglio, S., & Perla, F. (2019). A deep learning
integrated Lee–Carter model. Risks, 7/1. Article 33.
287. Noll, A., Salzmann, R., & Wüthrich, M. V. (2018). Case study: French motor third-party
liability claims. SSRN Manuscript ID 3164764. Version March 4, 2020.
288. Oelker, M.-R., & Tutz, G. (2017). A uniform framework for the combination of penalties in
generalized structured models. Advances in Data Analysis and Classification, 11, 97–120.
289. O’Hagan, W., Murphy, B. T., Scrucca, L., & Gormley, I. C. (2019). Investigation of parameter
uncertainty in clustering using a Gaussian mixture model via jackknife, bootstrap and
weighted likelihood bootstrap. Computational Statistics, 34/4, 1779–1813.
290. Ohlsson, E., & Johansson, B. (2010). Non-life insurance pricing with generalized linear
models. New York: Springer.
291. Paefgen, J., Staake, T., & Fleisch, E. (2014). Multivariate exposure modeling of accident risk:
Insights from pay-as-you-drive insurance data. Transportation Research Part A: Policy and
Practice, 61, 27–40.
292. Parikh, N., & Boyd, S. (2013). Proximal algorithms. Foundations and Trends in Optimization,
1/3, 123–231.
293. Park, J., & Sandberg, I. (1991). Universal approximation using radial-basis-function net-
works. Neural Computation, 3/2, 246–257.
Bibliography 589
294. Park, J., & Sandberg, I. (1993). Approximation and radial-basis-function networks. Neural
Computation, 5/2, 305–316.
295. Parodi, P. (2020). A generalised property exposure rating framework that incorporates scale-
independent losses and maximum possible loss uncertainty. ASTIN Bulletin, 50/2, 513–553.
296. Paszke, A., et al. (2019). PyTorch: An imperative style, high-performance deep learning
library. In Advances in Neural Information Processing Systems (Vol. 32, pp. 8024–8035).
297. Patton, A. J. (2020). Comparing possibly misspecified forecasts. Journal of Business &
Economic Statistics, 38/4, 796–809.
298. Pearl, J. (2009). Causal inference in statistics: An overview. Statistics Surveys, 3, 96–146.
299. Pearl, J., Glymour, M., & Jewell, N. P. (2016). Causal inference in statistics: A primer.
Chichester: Wiley.
300. Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word
representation. Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP) (pp. 1532–1543).
301. Perla, F., Richman, R., Scognamiglio, S., & Wüthrich, M. V. (2021). Time-series forecasting
of mortality rates using deep learning. Scandinavian Actuarial Journal, 2021/7, 572–598.
302. Petrushev, P. (1999). Approximation by ridge functions and neural networks. SIAM Journal
on Mathematical Analysis, 30/1, 155–189.
303. Pinkus, A. (1999). Approximation theory of the MLP model in neural networks. Acta
Numerica, 8, 143–195.
304. Pinquet, J. (1998). Designing optimal bonus-malus systems from different types of claims.
ASTIN Bulletin, 28/2, 205–220.
305. Pinquet, J., Guillén, M., & Bolance, C. (2001). Long-range contagion in automobile insurance
data: estimation and implications for experience rating. ASTIN Bulletin, 31/2, 337–348.
306. Pitman, E. J. G. (1936). Sufficient statistics and intrinsic accuracy. Proceedings of the
Cambridge Philosophical Society, 32/4, 567–579.
307. R Core Team (2021). R: a language and environment for statistical computing. R Foundation
for Statistical Computing, Vienna, Austria. https://www.R-project.org/
308. Renshaw, A. E., & Haberman, S. (2003). Lee–Carter mortality forecasting with age-specific
enhancement. Insurance: Mathematics & Economics, 33/2, 255–272.
309. Renshaw, A. E., & Haberman, S. (2006). A cohort-based extension to the Lee–Carter model
for mortality reduction factors. Insurance: Mathematics & Economics, 38/3, 556–570.
310. Rentzmann, S., & Wüthrich, M. V. (2019). Unsupervised learning: What is a sports car?
SSRN Manuscript ID 3439358. Version October 14, 2019.
311. Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why should I trust you?”: Explaining
the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, KDD ’16 (pp. 1135–1144). New
York: Association for Computing Machinery.
312. Richman, R. (2021). AI in actuarial science - A review of recent advances - Part 1. Annals of
Actuarial Science, 15/2, 207–229.
313. Richman, R. (2021). AI in actuarial science - A review of recent advances - Part 2. Annals of
Actuarial Science, 15/2, 230–258.
314. Richman, R. (2021). Mind the gap - Safely incorporating deep learning models into the
actuarial toolkit. SSRN Manuscript ID 3857693. Version April 2, 2021.
315. Richman, R., & Wüthrich, M. V. (2020). Nagging predictors. Risks, 8/3. Article 83.
316. Richman, R., & Wüthrich, M. V. (2021). A neural network extension of the Lee-Carter model
to multiple populations. Annals of Actuarial Science, 15/2, 346–366.
317. Richman, R., & Wüthrich, M. V. (2022). LocalGLMnet: Interpretable deep learning for
tabular data. Scandinavian Actuarial Journal (in press).
318. Richman, R., & Wüthrich, M. V. (2021). LASSO regularization within the LocalGLMnet
architecture. SSRN Manuscript ID 3927187. Version June 1, 2022.
319. Robert, C. P. (2001). The Bayesian choice (2nd ed.). New York: Springer.
320. Rolski, T., Schmidli, H., Schmidt, V., & Teugels, J. (1999). Stochastic processes for insurance
and finance. New York: Wiley.
590 Bibliography
346. Strassen, V. (1965). The existence of probability measures with given marginals. Annals of
Mathematical Statistics, 36/2, 423–439.
347. Sun, S., Bi, J., Guillén, M., & Pérez-Marín, A. M. (2020). Assessing driving risk using internet
of vehicles data: An analysis based on generalized linear models. Sensors, 20/9. Article 2712.
348. Sundberg, R. (1974). Maximum likelihood theory for incomplete data from an exponential
family. Scandinavian Journal of Statistics, 1/2, 49–58.
349. Sundberg, R. (1976). An iterative method for solution of the likelihood equations for
incomplete data from exponential families. Communication in Statistics - Simulation and
Computation, 5/1, 55–64.
350. Takeuchi, I., Le, Q. V., Sears, T. D., & Smola, A. J. (2006). Nonparametric quantile estimation.
Journal of Machine Learning Research, 7, 1231–1264.
351. Thomson, W. (1979). Eliciting production possibilities from a well-informed manager.
Journal of Economic Theory, 20, 360–380.
352. Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. Journal of the
Royal Statistical Society, Series B, 58/1, 267–288.
353. Tikhonov, A. N. (1943). On the stability of inverse problems. Doklady Akademii Nauk SSSR,
39/5, 195–198.
354. Troxler, A., & Schelldorfer, J. (2022). Actuarial applications of natural language pro-
cessing using transformers: Case studies for using text features in an actuarial context.
arXiv:2206.02014.
355. Tsanakas, A., & Millossovich, P. (2016). Sensitivity analysis using risk measures. Risk
Analysis, 36/1, 30–48.
356. Tsitsiklis, J., & Van Roy, B. (2001). Regression methods for pricing complex American-style
options. IEEE Transactions on Neural Networks, 12/4, 694–703.
357. Tukey, J. W. (1977). Exploratory data analysis. Reading: Addison-Wesley.
358. Tweedie, M. C. K. (1984). An index which distinguishes between some important exponential
families. In J. K. Ghosh, & J. Roy (Eds.) Statistics: Applications and new directions.
Proceeding of the Indian Statistical Golden Jubilee International Conference (pp. 579–604).
Calcutta: Indian Statistical Institute.
359. Tzougas, G., & Karlis, D. (2020). An EM algorithm for fitting a new class of mixed
exponential regression models with varying dispersion. ASTIN Bulletin, 50/2, 555–583.
360. Tzougas, G., Vrontos, S., & Frangos, N. (2014). Optimal bonus-malus systems using finite
mixture models. ASTIN Bulletin, 44/2, 417–444.
361. Uribe, J. M., & Guillén, M. (2019). Quantile regression for cross-sectional and time series
data applications in energy markets using R. New York: Springer.
362. Valiant, L. G. (1984). A theory of learnable. Communications of the Association for
Computing Machinery ACM, 27/11, 1134–1142.
363. Van der Vaart, A. W. (1998). Asymptotic statistics. Cambridge: Cambridge University Press.
364. Van der Vaart, A. W., & Wellner, J. A. (1996). Weak convergence and empirical processes:
With applications to statistics. New York: Springer.
365. Vapnik, V., & Chervonenkis, A. (1974). The theory of pattern recognition. Moscow: Nauka.
366. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., &
Polosukhin, I. (2017). Attention is all you need. arXiv:1706.03762v5.
367. Vaughan, J., Sudjianto, A., Brahimi, E., Chen, J., & Nair, V. N. (2018). Explainable neural
networks based on additive index models. arXiv:1806.01933v1.
368. Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S. New York:
Springer.
369. Venter, G. C. (1983). Transformed beta and gamma functions and losses. Proceedings of the
Casualty Actuarial Society, 71, 289–308.
370. Verbelen, R., Antonio, K., & Claeskens, G. (2018). Unraveling the predictive power of
telematics data in car insurance pricing. Journal of the Royal Statistical Society: Series C,
67/5, 1275–1304.
592 Bibliography
371. Verbelen, R., Gong, L., Antonio, K., Badescu, A., & Lin, S. (2015). Fitting mixtures of
Erlangs to censored and truncated data using the EM algorithm. ASTIN Bulletin, 45/3, 729–
758.
372. Verschuren, R. M. (2021). Predictive claim scores for dynamic multi-product risk classifica-
tion in insurance. ASTIN Bulletin, 51/1, 1–25.
373. Wager, S., Wang, S., & Liang, P. S. (2013). Dropout training as adaptive regularization. In C.
Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Weinberger (Eds.), Advances in neural
information processing systems (Vol. 26, pp. 351–359). Red Hook: Curran Associates.
374. Wald, A. (1949). Note on the consistency of the maximum likelihood estimate. Annals of
Mathematical Statistics, 20/4, 595–601.
375. Wang, C.-W., Zhang, J., & Zhu, W. (2021). Neighbouring prediction for mortality. ASTIN
Bulletin, 51/3, 689–718.
376. Wedderburn, R. W. M. (1974). Quasi-likelihood functions, generalized linear models and the
Gauss–Newton method. Biometrika, 61/3, 439–447.
377. Weidner, W., Transchel, F. W. G., & Weidner, R. (2016). Classification of scale-sensitive
telematic observables for riskindividual pricing. European Actuarial Journal, 6/1, 3–24.
378. Weidner, W., Transchel, F. W. G., & Weidner, R. (2017). Telematic driving profile classifica-
tion in car insurance pricing. Annals of Actuarial Science, 11/2, 213–236.
379. White, H. (1989). Learning in artificial neural networks: a statistical perspective. Neural
Computation, 1/4, 425–464.
380. White, H. (1990). Connectionist nonparametric regression: multilayer feedforward networks
can learn arbitrary mappings. Neural Networks, 3/5, 535–549.
381. White, H., & Woolridge, J. M. (1991). Some results on sieve estimation with dependent
observations. In W. Barnett, J. Powell, & G. Tauchen (Eds.), Nonparametric and semi-
parametric in econometrics and statistics (pp. 459–493). Cambridge: Cambridge University
Press.
382. Wiatowski, T., & Bölcskei, H. (2018). A mathematical theory of deep convolutional neural
networks for feature extraction. IEEE Transactions on Information Theory, 64/3, 1845–1866.
383. Wilson, E. B., & Hilferty, M. M. (1931). The distribution of chi-square. Proceedings of
National Academy of Science, 17/12, 684–688.
384. Wood, S. N. (2017). Generalized additive models: An introduction with R (2nd ed.). Boca
Raton: CRC Press.
385. Wu, C. F. J. (1983). On the convergence properties of the EM algorithm. Annals of Statistics,
11/1, 95–103.
386. Wu, C. F. J. (1986). Jackknife, bootstrap and other resampling methods in regression analysis.
Annals of Statistics, 14/4, 1261–1295.
387. Wüthrich, M. V. (2013). Non-life insurance: Mathematics & statistics. SSRN Manuscript ID
2319328. Version February 7, 2022.
388. Wüthrich, M. V. (2017). Covariate selection from telematics car driving data. European
Actuarial Journal, 7/1, 89–108.
389. Wüthrich, M. V. (2017). Sequential Monte Carlo sampling for state space models. In V.
Kreinovich, S. Sriboonchitta, & V.-N. Huynh (Eds.), Robustness in econometrics. Studies
in computational intelligence (Vol. 592, pp. 25–50). New York: Springer.
390. Wüthrich, M. V. (2020). Bias regularization in neural network models for general insurance
pricing. European Actuarial Journal, 10/1, 179–202.
391. Wüthrich, M. V. (2022). Model selection with Gini indices under auto-calibration.
arXiv:2207.14372.
392. Wüthrich, M. V., & Buser, C. (2016). Data analytics for non-life insurance pricing. SSRN
Manuscript ID 2870308. Version of October 27, 2021.
393. Wüthrich, M. V., & Merz, M. (2013). Financial modeling, actuarial valuation and solvency
in insurance. New York: Springer.
394. Wüthrich, M. V., & Merz, M. (2019). Editorial: Yes, we CANN! ASTIN Bulletin, 49/1, 1–3.
395. Yan, H., Peters, G. W., & Chan, J. S. K. (2020). Multivariate long-memory cohort mortality
models. ASTIN Bulletin, 50/1, 223–263.
Bibliography 593
396. Yin, C., & Lin, X. S. (2016). Efficient estimation of Erlang mixtures using iSCAD penalty
with insurance application. ASTIN Bulletin, 46/3, 779–799.
397. Yu, B., & Barter, R. (2020). The data science process: One culture. International Statistical
Review, 88/S1, S83–S86.
398. Yuan, X. T., & Lin, Y. (2007). Model selection and estimation in regression with grouped
variables. Journal of the Royal Statistical Society, Series B, 68/1, 49–67.
399. Yukich, J., Stinchcombe, M., & White, H. (1995). Sup-norm approximation bounds for
networks through probabilistic methods. IEEE Transactions on Information Theory, 41/4,
1021–1027.
400. Zaslavsky, T. (1975). Facing up to arrangements: Face-count formulas for partitions of space
by hyperplanes (Vol. 154). Providence: Memoirs of the American Mathematical Society.
401. Zeileis, A., Kleiber C., & Jackman, S. (2008). Regression models for count data in R. Journal
of Statistical Software, 27/8, 1–25.
402. Zhang, C., Ren, M., & Urtasun, R. (2020). Graph hypernetworks for neural architecture
search. arXiv:1810.05749v3.
403. Zhang, W., Itoh, K., Tanida, J., & Ichioka, Y. (1990). Parallel distributed processing model
with local space-invariant interconnections and its optical architecture. Applied Optics, 29/32,
4790–4797.
404. Zhang, W., Tanida, J., Itoh, K., & Ichioka, Y. (1988). Shift invariant pattern recognition neural
network and its optical architecture. Proceedings of the Annual Conference of the Japan
Society of Applied Physics, 6p-M-14, 734.
405. Zhao, Q., & Hastie, T. (2021). Causal interpretations of black-box models. Journal of Business
& Economic Statistics, 39/1, 272–281.
406. Zhou, Z.-H., Wu, J., & Tang, W. (2002). Ensembling neural networks: Many could be better
than all. Artificial Intelligence, 137/1–2, 239–263.
407. Zhu, R., & Wüthrich, M. V. (2021). Clustering driving styles via image processing. Annals of
Actuarial Science, 15/2, 276–290.
408. Zou, H. (2006). The adaptive LASSO and its oracle properties. Journal of the American
Statistical Assocation, 101/476, 1418–1429.
409. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal
of the Royal Statistical Society, Series B, 67/2, 301–320.
Index
Best asymptotically normal, 475 Claim description, 425–428, 432, 440, 442,
b-homogeneous, 456 446, 570, 572
Bias, 77, 84–86, 105, 109, 113, 126, 162, 171, Claim sizes, 4, 7, 35, 36, 111, 112, 126,
179, 189, 306, 324, 509 167–180, 188–190, 248–254,
Bias regularization, 305–315, 318, 321, 257–259, 453–455, 458–466, 515
327–330, 337, 338, 501, 503, 506 Classification and regression trees (CART),
Binary cross-entropy, 87 200, 201, 270, 320, 358
Binary feature, 126, 130, 168, 339, 502, 503, Clustering, 239, 342, 436, 438
551 CN layer, 407, 408, 410, 411, 413, 415, 417,
Binomial distribution, 13, 18–19, 31, 38, 174 422
Block diagonal matrix, 197, 198 CN network, 273, 394, 407, 411, 412, 415–419,
BN encoder, 353, 355 421–424
Bonus-malus system (BMS), 140, 141, 199 CN network encoder, 422
Boosting, 133, 200, 201, 315–319, 330, 331, CN operation, 409, 410, 413
335, 515 Coefficient of variation, 176, 185, 323–325,
Boosting regression models, 315–319, 330 332, 492–494
Bootstrap, 106–110, 324, 492–495 Collinear, 129, 153, 560
Bootstrap aggregating, 319 Collinearity, 146, 150, 152, 162, 214, 359, 551,
Bootstrap algorithm, 107, 109 558
Botstrap consistency, 108 Color image, 408
Botstrap distribution, 106, 107, 109 Combined GLM and FN network, 318, 319,
Bottleneck neural (BN) network, 342, 351–356 321
Bregman divergence, 43–48, 79, 92, 94, 309, Complete information, 232, 233
456, 457, 464, 486, 487 Complete log-likelihood, 232–233, 235–237,
Budget constraint, 212, 213, 217, 220 250, 253–255, 258, 260, 518, 519
Complete probability space, 542, 548
Composite models, 202, 263–265, 454, 483,
484, 491
C Composite triplet, 484–488
Canonical link, 17–19, 21, 23, 27, 30, 31, 52, Compound Poisson (CP), 4, 34, 189–190
63, 78, 114–118, 120–123, 127, 129, Conditional calibration, 310
133, 196–198, 264, 279, 306–308, Conjugate priors, 156, 209
312, 457 Conjugation, 427
Canonical parameter, 13–17, 24, 30, 37, 41, Consistency, 67–69, 73, 89–90, 109, 124, 225,
43, 45, 46, 62, 65, 78–80, 82, 113, 473, 540–546, 548
114, 116, 127, 182, 184, 186, 195, Consistent, 67–69, 71, 74, 84, 88–92, 109,
196, 231 143–147, 204, 226, 473, 475, 478,
Case deletion, 192–195 482, 484, 546
Categorical cross-entropy, 87, 198, 518 Consistent loss function, 88, 90–93, 204, 464,
Categorical distribution, 27–28, 37, 87, 127, 473, 477, 487, 488
195, 198, 230, 237 Consistent scoring function, 76, 88, 92, 456,
Categorical feature, 127–130, 134, 139, 211, 457, 485
216, 297–306, 322, 325, 370, 375, Constraint optimization, 212, 222, 223, 280
459, 468, 470, 499, 517–520, 551, Context size, 431
561, 563 Context words, 431, 432
Categorical responses, 195–198, 419, 431 Contingency table, 151, 162, 166, 263, 561
Causal inference, 360 Continuous bag-of-words (CBOW), 430, 432
Cell state process, 391, 393 Continuous features, 130–131, 150
Censored data, 248–266 Contrast coding, 128
Center-context pairs, 432, 434, 436, 438, 441 Convex, 2, 14, 16, 17, 29, 30, 37, 38, 43, 44,
Center words, 430–432, 436, 437 92, 94, 213, 214, 219, 221–222, 286,
Central limit theorem (CLT), 5, 70 308, 309, 311, 344, 458, 486, 487
Channel, 408–413, 415, 416, 418, 419, 423 Convex order, 95, 308, 309
Claim counts data, 133 Convolution, 31, 65, 174, 408–410, 413, 415
Index 597
Convolution formula, 31, 65 Deviance residuals, 142, 158, 170, 176, 182,
Convolution operator, 410 459, 460, 466, 471
Convolution property, 174 Diagnostic tools, 141, 190–195
Co-occurrence matrix, 436 Digamma function, 22, 46, 47, 173, 185
Count random variable, 4, 254 Dilation, 414
Covariate, 88, 93, 96, 113 Dimension reduction, 342, 343, 415, 417, 520
Coverage ratio, 108, 481, 482, 490, 491, 498 Directional derivative, 367, 368, 372
Cramér–Rao information bound, 56–66, 72, Discount factor, 526, 527
75, 78, 93, 123 Discrete, 3, 4, 18, 27
Cross-entropy, 87, 198, 421, 518 Discrete random variable, 3
Cross-sectional data, 198 Discrete window, 409
Cross-validation, 95–106, 139, 140, 190, Discrimination-free insurance pricing, 361
192–195, 211, 215–217 Dispersion, 13–46, 155, 157, 158, 181–183,
Cube-root transformation, 170 186–189, 465–472, 475, 476
Cumulant function, 14–17, 24, 25, 29, 30, 34, Dispersion parameter, 30
37–38, 41, 44–46 Dispersion submodel, 182–183, 453, 474
Curse of dimensionality, 209, 530 Dissimilarity function, 343, 352, 353
Cyclic coordinate descent, 220 Distortion risk measure, 368, 370
Distribution function, 3, 5–9, 13–16, 29
Divergence, 40–47, 55, 92, 94, 308
Divisible, 174
D Do-operator, 360
Data collection, 1, 151 Dot-product attention, 448
Data compression, 271 Double FN network model, 466, 470, 472
Data modeling culture, 2 Double generalized linear model (DGLM),
Data pre-processing, 1, 2 182–190, 247, 453, 466, 515
Decision rule, 50–54, 56–58, 60–62, 67, Drift extrapolation, 394–397, 401, 404
75–79, 83–85, 96, 97 Drop-out layer, 298, 302–304, 377, 419
Decision-theoretic approach, 88–95 Duality transformation, 30, 38, 158
Decision theory, 49–51 Dual parameter space, 17–22, 24, 25, 27, 28,
Declension, 428 31–33, 37, 43, 53, 64
Decoder, 344, 346, 353, 356, 404 Dummy coding, 127–130, 195, 293, 298
Deductible, 248, 249
Deep composite model regression, 266,
483–491 E
Deep dispersion modeling, 466–472 Early stopping, 290–293, 299, 303
Deep learning, 267–379, 453–535 Educated guess, 50
Deep network, 204, 273–275, 383, 477, 494 Effective domain, 14–24, 27, 29, 30, 32, 34, 35
Deep quantile regression, 10, 476–483, 488, Efficient likelihood estimator (ELE), 72
489 Eigenvalues, 120, 344, 345
Deep RN network, 383, 385, 386 Eigenvectors, 344, 345
Deep word representation learning, 445–448 Elastic net regularization, 214
Deformation stability, 411 Elastic net regularizer, 507
Dense, 538, 539, 542 Elicitable, 92, 203, 204, 477, 484, 489
Density, 4, 7, 8 EM algorithm for lower-truncated data,
Depth, 268, 272, 273, 275–277 248–249
Derivative operator, 546 EM algorithm for mixture distributions,
Design matrix, 114, 118–122, 128, 145, 177, 230–232
306 EM algorithm for right-censored data, 251–254
Deviance estimate, 143 Embedding dimension, 299, 302, 429, 431,
Deviance generalization loss, 79–88 438, 440–442, 444, 446
Deviance GL, 82, 84–86, 310, 311 Embedding layer, 298–302, 429
Deviance loss function, 43, 79–82, 84, 93, 279, Embedding map, 128, 294, 298, 399, 429–433,
280, 284, 462, 468, 475, 478 437, 440, 441, 444, 448
598 Index
Generalization loss (GL), 10, 75–95, 152, 310 Homoskedastic Gaussian case, 193, 217
Generalization power, 145, 289 Honest forecast, 89, 90
Generalized additive decomposition, 496 Human Mortality Database (HMD), 348, 395
Generalized additive models (GAMs), 130, Hurdle model, 163, 261, 264
194, 200, 314, 315, 337, 444 Hyperbolic tangent activation, 269, 270
Generalized beta of the second kind (GB2), Hyper-parameter, 28, 211, 242, 264, 298, 433,
201, 202, 453 464, 492, 515
Generalized cross-validation (GCV), 100, 193, Hypothesis testing, 50, 145–147, 181, 549–551
195, 217, 314
Generalized EM (GEM) algorithm, 513
Generalized inverse, 202
Generalized inverse Gaussian distribution, I
25–26 Identifiability, 17, 49, 55, 239, 340–342, 348,
Generalized linear model (GLMs), 111 475, 497
Generalized projection operator, 222, 223, 228, Identification function, 481, 489
280, 508 Identity link, 116, 279, 524
Gibbs sampling, 209 Image classification, 418
Glivenko–Cantelli theorem, 7, 55, 68, 106, 107 Image recognition, 273, 407, 412–413
Global balance, 312 Imbalanced, 153, 568
Global max-pooling layer, 419 Importance measure, 505, 511
Global properties, 407 Incomplete gamma function, 252, 490
Global surrogate model, 358, 359 Incomplete information, 232, 233
Global vectors (GloVe), 425, 430, 433, 436, Incomplete log-likelihood, 236–239, 250, 253,
438, 442–444, 446, 449, 451 254, 516, 518
Glorot uniform initializer, 284 Indirect discrimination, 361
GPS location data, 418 Individual attribution, 374, 375
Gradient descent method, 220, 278–293 Individual conditional expectation (ICE),
Gradient descent update, 221, 279, 285–287 359–360
Grouped penalties, 211, 227, 302 Infinitely divisible, 174
Group LASSO generalized projection operator, Inflectional forms, 428
228 Information bound, 57, 62–64
Group LASSO regularization, 226–229, Information geometry, 40–47, 81, 145
508–512 Initialization, 239, 241, 268, 282, 284, 293,
GRU cell, 393 318, 354, 363, 479, 497, 515
Guaranteed minimum income benefit (GMIB), Input gate, 391, 392
525–529 Input tensor, 408, 410, 411, 413, 414, 416,
423
In-sample loss, 98, 102, 103
H In-sample over-fitting, 102, 279, 288, 525
Hadamard product, 287, 391 Interactions, 131, 151, 200, 274, 297, 319, 360,
Hamilton Monte Carlo (HMC) algorithm, 209, 365, 373, 379, 495, 503, 505
530 Interaction strength, 365–366
Hat matrix, 189–193, 216 Intercept model, 112, 139, 152, 154, 171, 188,
Heavy-tailed, 6, 8, 27, 38 333
Helmert’s contrast coding, 128 Interior, 15, 29, 44, 62, 112, 235, 476, 547
Hessian, 16, 42, 61, 105, 118, 121, 122, 215, Inverse Gaussian distribution, 23, 26, 33, 174,
285 482, 490
Heterogeneous, 111, 112, 132, 182, 448, 469 Inverse Gaussian GLM, 122, 173–176, 453,
Heterogeneous dispersion, 182, 469 460
Heteroskedastic, 177, 178, 481 Inverse link function, 269
Hilbert space, 524, 547 IRLS algorithm, 119, 120, 181, 186, 198
Homogeneity, 111, 200, 456, 467 Irreducible risk, 77, 84, 86, 113, 310, 477, 492
Homogeneous model, 103, 112, 114 Iterative re-weighted least squares algorithm,
Homoskedastic, 178, 193, 217 119
600 Index
Markov chain Monte Carlo (MCMC) methods, Momentum-based gradient descent method,
10, 209, 210, 530 280, 285–287
Martingale sequence forecast, 309 Momentum coefficient, 285, 287
Maximal a posterior (MAP) estimator, Mortality, 3, 347–351, 354–356, 394–406,
210–212, 225 422–424, 525, 526, 529
Maximal cover, 248 Mortality surface, 347, 349, 350, 355, 356,
Maximization step, 233, 513 394, 422–424
Maximum likelihood, 51, 116–122 Motor third party liability (MTPL), 133
Maximum likelihood estimation/estimator MSEP optimal predictor, 78
(MLE), 51, 124–125, 169, 172–174, M-step, 233–235, 238, 239, 244, 251, 252,
181, 186–187, 196–198, 288, 293, 256, 257, 513, 515, 517
472–476 Multi-class cross-entropy, 87, 421
Max-pooling, 414, 415, 419, 423, 442, 444 Multi-dimensional array, 408
Mean, 4 Multi-dimensional Cramér-Rao bound, 60, 62
Mean field Gaussian variational family, 534 Multi-index, 546
Mean functional, 84, 90–93, 195, 203, 278, Multi-output network, 462, 463, 466, 479
456 Multiple outputs, 461, 462, 468, 488
Mean parameter space, 17, 116, 458, 473 Multiple quantiles, 478–479
Mean squared error of prediction (MSEP), 10, Multiplicative approach, 479–482
75–79, 83, 95, 142, 209 Multiplicative effects, 116, 126
Memory rate, 390 Multiplicative model, 128, 131
Mercer’s kernel, 132, 268, 269
M-estimation, 93, 476
M-estimator, 69, 73, 93, 457 N
Meta model, 329, 330 Nadam, 287, 288
Method of moments, 54, 55 Nagging, 320, 324–326
Method of moments estimator, 54, 55 Nagging predictor, 324–329
Method of sieve estimators, 11, 56, 543–546, Natural language processing (NLP), 10, 298,
549 425–451
Metropolis–Hastings (MH) algorithm, 209 NB1, 158
Mini-batches, 285, 291, 293, 304, 321, 469, NB2, 157, 158
515, 518 Negative-binomial distribution, 19, 156, 159
Minimal representation, 16, 17, 30, 31, 42, 63, Negative-binomial GLM, 159, 160, 166
67 Negative-binomial model, 20, 156, 158–160,
Minimax decision rule, 50–51 163
MinMaxScaler, 294, 295, 371 Negative expected Hessian, 105, 118, 121, 215
Mixed Poisson distribution, 20, 155, 164 Negative sampling, 431–436, 438, 440, 446,
Mixture density, 230, 233, 242, 243, 454 450
Mixture density networks (MDNs), 233, 453, Nested GLM, 145
513, 515–520 Nested simulation, 522
Mixture distribution, 163, 164, 230–235, Nesterov-accelerated version, 286
238–247, 259, 513 Network aggregating, 325
Mixture probability, 230, 231, 235, 236, 241, Network ensembling, 10, 492
243, 245–247, 513 Network output, 387–388
Model-agnostic tools, 357–376, 495 Network parameter, 272, 274
Model class, 2, 105, 454 Network weight, 271, 284
Modeling cycle, 1–3 Neurons, 269
Model misspecification, 2, 305, 472 New representation, 46, 132, 268
Model uncertainty, 56, 69, 82, 93, 453–476, Newton–Raphson algorithm, 59, 120, 231
492–495, 530 NLP pipeline, 425
Model validation, 2, 141–180, 357 Noisy part, 113, 288, 290
Modified Bessel function, 26 Nominal categorical feature, 127
Moment generating function, 5, 6, 9, 15, 16, Nominal outcome, 4, 127, 130, 195, 302, 364,
30, 31, 35, 38, 125, 168, 174, 201 555
602 Index
Pseudo maximum likelihood estimator ReLU activation, 269, 270, 274, 275
(PMLE), 180, 473–475 Reparametrization trick, 532–534
Pseudo-norm, 542, 543 Representation learning, 267–269, 273, 274,
p-value, 146, 147, 151, 245 293, 402, 411, 415, 445–448, 453,
461, 462
Reproductive dispersion model, 44
Q Reproductive form, 30, 38, 133, 158, 169, 174,
QQ plot, 172, 242, 245, 333, 518 187, 459, 467
Quantile, 92, 203, 368, 476 Resampling distribution, 107
Quantile level, 374, 375 Reset gate, 393
Quantile regression, 10, 202–204, 483, 488 Residual bootstrap, 108
Quantile risk measure, 368 Residual maximum likelihood estimation,
Quasi-generalized pseudo maximum likelihood 186–187
estimator (QPMLE), 180, 475 Residuals, 108, 117, 141–145, 153, 158, 176,
Quasi-likelihood, 180–181 178, 182, 191, 197, 315
Quasi-Newton method, 120, 513 Retention level, 249
Quasi-Poisson model, 180 Ridge regression, 214–217, 304
Quasi-Tweedie’s model, 181 Ridge regularization, 212–214, 217, 224, 303,
Query, 448, 449 507
Right-censored gamma claim sizes, 250–252
Right-censoring, 250–254
R Right-singular matrix, 345, 348
Random component, 498, 501, 505, 521 Risk function, 50, 52, 61, 77, 83, 84, 86
Random effects, 198–199 rmsprop, 287, 297
Random forest, 200, 319 RN layer, 383–391, 411
Random variable, 3–7 RN network, 273, 381, 383, 385–390,
Random vector, 3, 109, 180, 255, 371, 523, 394–406, 412, 442, 445, 448
526 Robustified representation learning, 461–464,
Random walk, 350, 394–397, 401, 402, 404 468
Rank, 17, 118–121, 127, 177, 232, 344, 348, Robust statistics, 56
354 Root mean square propagation, 287
Raw mortality data, 347, 394, 406, 422
Reconstruction error, 342, 344, 346, 350, 352,
354 S
Rectified linear unit activation, 269, 270, 274, Saddlepoint approximation, 36, 47, 145,
275, 479 183–187, 456, 459, 466, 467, 517
Recurrent neural network, 381–406 Saturated model, 80, 81, 113, 115, 157, 158,
Red-green-blue (RGB), 289 166, 354
Reference point, 371–374 Scalar product, 114, 130, 200, 218, 269, 304,
Regression attention, 496–498, 502, 503, 505, 312, 386, 433, 434, 444, 524
507, 511, 520, 521 Scaled deviance GL, 88
Regression function, 113, 267, 269, 485, 488, Scale parameter, 22–24, 33, 34, 38, 168, 172,
512 201, 241, 252, 258, 517
Regression modeling, 88, 112–113 Schwarz’ information criterion (SIC), 106
Regression parameter, 88, 114, 116, 119, 122, Score, 58, 61, 71, 90, 117, 121, 180, 198, 218,
131–133, 180, 189, 192, 208, 210, 234, 457
271, 486, 496, 514 Score equations, 71, 73, 117–120, 180,
Regression trees, 133, 200, 307, 330, 359 196–198, 203, 215, 218, 236, 238,
Regularization, 207–268, 306–308, 314, 464, 304, 459
507–509 Scoring function, 69, 76–79, 81, 88–90, 456,
Regularization through early stopping, 484, 485, 517, 518
290–293 2nd order marginal attribution, 377
Regularly varying, 6, 8, 16, 23, 38, 39, 126, Self-attention mechanism, 449, 450
167, 202, 241, 246 Sequence of sieves, 542
604 Index
Sequential Monte Carlo sampling (SMC), 209 Stochastic boundedness, 544, 545
Set-valued functional, 89, 90, 203 Stochastic gradient descent (SGD), 283–285,
Shallow FN network, 273–275, 277, 341, 399, 291–293
537–541, 543, 545, 546, 550, 551 Stochastic gradient variational Bayes (SGVB),
Shallow network, 273–275, 353, 383 533
Shape parameter, 22–24, 33, 34, 38, 39, 121, Stochastic mortality, 347
168, 247 Stone–Weierstrass theorem, 538, 539
Shapley additive explanation (SHAP), 366 Stone–Weierstrass type arguments, 274
Short rate, 526, 527, 529 Stopwords, 427, 428, 433, 448
Shrinkage, 212, 307 Strassen’s theorem, 309
Sieve estimator, 541–546, 548, 549 Stratified K-fold cross-validation, 99, 101–103
Sigma-finite measure, 4, 13–15, 29, 30, 34 Strict consistency, 69, 90, 456
Sigmoid activation function, 270, 283, 390, Strictly consistent, 76, 84, 90–93, 204, 456,
391, 393, 538, 540, 542, 548 473, 477, 478, 484, 485, 487, 488
Sigmoid function, 18, 432, 479, 538, 542, 548 Strictly convex, 16, 17, 30, 44, 52, 213, 456,
Simple bias regularization, 305–306 485, 486
Single-parameter exponential family, 13, 15, Strictly proper scoring rule, 91
28 Stride, 414, 415
Single-parameter linear exponential family, 15, Strongly consistent, 67, 473, 475
18–20, 22, 25, 28, 31, 36–38, 73, Subexponential, 6
159, 263 Sub-gradient, 44, 92, 218, 269, 485, 486
Singular value decomposition (SVD), 345, Suffixes, 428
347, 350, 376, 394, 397, 404 Survival function, 6, 8, 23, 38–40, 202
Singular values, 345, 350, 356, 377 Synthetic minority oversampling technique
Skip connection, 272, 316, 317, 387, 450 (SMOTE), 155
Skip-gram, 430–432, 434 Systematic effects, 113, 114, 126, 127, 182,
Smoothly clipped absolute deviation (SCAD) 211, 289, 290, 315, 319
regularization, 225, 247
Sobolev embedding theorem, 547
Sobolev norm, 547, 551 T
Sobolev space, 547 Tail index, 6, 8, 38–40, 202
Soft-thresholding operator, 218, 219, 223, 226, Taylor expansion, 71, 72, 142, 220, 279, 280,
508 285, 317, 369–371, 455
Spatial component, 356, 408–411, 415 Telematics data, 418–422
Special purpose layers, 273, 298–305 Temporal causal, 382, 390
Special purpose tools, 413–416 Tenfold cross-validation, 102, 139, 152, 154,
Spectral norm, 345 169, 171, 172, 176, 188, 194, 195
Speed-acceleration-change in angle pattern, Tensor, 42, 397, 408–416, 419, 422, 423
418 Test data, 96–98, 100, 102, 137, 288, 290, 291,
Splicing models, 264 295, 297
Spurious configuration, 239 Test sample, 96, 97, 135, 137, 141, 169
Square loss function, 54, 75, 77–79, 82, 92, Test statistics, 549, 563
125, 170, 304, 312, 313, 330, 528 Text recognition, 10, 425, 426, 448, 477, 572
Squashing function, 538–540 Threshold, 39, 40, 241, 242, 246, 446
Standardization of data matrix, 343 Tikhonov regularization, 212
Standardized data, 218 Time-distributed layer, 388–390, 442
State-space model, 199 Time-series analysis, 273, 411–413
Statistical error, 77 Time-series data, 381, 407, 408, 411, 412, 423
Statistical modeling cycle, 1–3 Tokenization, 426, 427, 434, 436, 438
Steep, 37, 38, 87 Training data, 291, 292, 295
Steepness, 17, 37–38, 43, 80, 83 Training loss, 292, 301
Stemming, 428 Transformer model, 450
Step function activation, 269, 274–277, 538, Translation invariance, 411
539 Truncated data, 265
Index 605
t-statistics, 147 Variance function, 31–36, 44, 82, 86, 87, 117,
Tukey–Anscombe plot, 172, 178, 190, 459, 142, 158, 174, 180, 183, 185, 186,
464–466, 471 209, 280, 454, 467, 474
Tweedie’s CP model, 34–36, 182, 183, 190, Variance reduction technique, 327
325, 327, 454 Variational distributions, 530, 535
Tweedie’s distribution, 34–36, 87, 182, 457, Variational inference, 535
458, 487 Variational lower bound, 531
Tweedie’s family, 187, 454–458, 481, 487 VC-class, 541
Tweedie’s forecast dominance, 95, 327, 453, Vector-valued canonical parameter, 15
459 Vocabulary, 426, 428, 433, 434, 438, 440
Tweedie’s model with log-link, 121 Volume, 29, 129, 134, 145, 168, 184, 234
Two modeling culture, 329
W
U Wald statistics, 146, 147, 229
Unbiased, 57, 60–64, 66, 70, 77, 78, 85, 86, Wald test, 139, 146, 148, 198, 224, 245, 357,
93, 122, 123, 126, 309, 458 495, 503, 521, 549
Unbiased estimator, 56–66, 85, 123, 458 Weak derivative, 547
Under-dispersion, 33, 264 Weight, 3, 4, 18–20, 29, 65, 66
Under-sampling, 153–155 Weighted square loss function, 82, 312, 313,
Unfolded representation, 384, 386 438
Uniformly dense on compacta, 538, 539 Weight matrix, 310, 449
Uniformly minimum variance unbiased Wild bootstrap, 108
(UMVU), 56, 57, 61, 63–67, 79, 93, Window size, 407–409, 412, 413, 415, 430,
123 431, 434
Unit deviance, 42–47, 80–83, 86–88, 90–95, Word embedding, 269, 425, 429–439, 450
124–125, 141, 142, 144, 145, 183, Word to vector (word2vec), 425, 430, 433,
184, 311, 319, 454–456 438–440, 444, 446, 450
Unit simplex, 27, 230, 232, 235 Word2vec algorithm, 430–436
Universality theorem, 10, 11, 274–278, 288, Working residuals, 117, 186, 191, 197, 198,
537–540, 550 244
Unsupervised learning, 342, 436, 445 Working weight matrix, 116, 121, 186, 191
Update gate, 393
X
V XGBoost, 201
VA account, 525, 527, 529
Validation analysis, 290
Validation data, 291–293 Z
Validation loss, 291, 292, 297, 301, 331, 421 Zero-inflated Poisson (ZIP), 163, 164, 166,
Value-at-risk (VaR), 370, 528, 529 167, 259, 261, 263, 337
Vapnik–Chervonenkis (VC) dimension, 541 Zero-truncated claim counts, 259
Variable annuity (VA), 525 Zero-truncated Poisson (ZTP), 164, 259
Variable permutation importance (VPI), Z-estimator, 73, 118, 180, 457
357–359, 372, 377 Z-statistics, 147
Variable selection, 148, 214, 225, 357, 495, ZTP log-likelihood, 262, 263
497–499, 507–509, 549 ZTP model, 155