0% found this document useful (0 votes)

117 views611 pages

Statistical Foundations of Actuarial

Uploaded by

Martin sanchez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

117 views611 pages

Statistical Foundations of Actuarial

Uploaded by

Martin sanchez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 611

Springer Actuarial

Mario V. Wüthrich
Michael Merz

Statistical
Foundations
of Actuarial
Learning and its
Applications
Springer Actuarial

Editors-in-Chief
Hansjoerg Albrecher, University of Lausanne, Lausanne, Switzerland
Michael Sherris, UNSW, Sydney, NSW, Australia

Series Editors
Daniel Bauer, University of Wisconsin-Madison, Madison, WI, USA
Stéphane Loisel, ISFA, Université Lyon 1, Lyon, France
Alexander J. McNeil, University of York, York, UK
Antoon Pelsser, Maastricht University, Maastricht, The Netherlands
Ermanno Pitacco, Università di Trieste, Trieste, Italy
Gordon Willmot, University of Waterloo, Waterloo, ON, Canada
Hailiang Yang, The University of Hong Kong, Hong Kong, Hong Kong
This is a series on actuarial topics in a broad and interdisciplinary sense, aimed at
students, academics and practitioners in the fields of insurance and finance.
Springer Actuarial informs timely on theoretical and practical aspects of top-
ics like risk management, internal models, solvency, asset-liability management,
market-consistent valuation, the actuarial control cycle, insurance and financial
mathematics, and other related interdisciplinary areas.
The series aims to serve as a primary scientific reference for education, research,
development and model validation.
The type of material considered for publication includes lecture notes, mono-
graphs and textbooks. All submissions will be peer-reviewed.
Mario V. Wüthrich • Michael Merz

Statistical Foundations
of Actuarial Learning
and its Applications
Mario V. Wüthrich Michael Merz
Department of Mathematics, RiskLab Faculty of Business Administration
Switzerland University of Hamburg
ETH Zürich Hamburg, Germany
Zürich, Switzerland

This work was supported by Schweizerische Aktuarvereinigung SAV and Swiss Re.

ISSN 2523-3262 ISSN 2523-3270 (electronic)

Springer Actuarial
ISBN 978-3-031-12408-2 ISBN 978-3-031-12409-9 (eBook)
https://doi.org/10.1007/978-3-031-12409-9

Mathematics Subject Classification: C13, C21/31, C24/34, G22, 62F10, 62F12, 62J07, 62J12, 62M45,
62P05, 68T01, 68T50

© The Authors 2023. This book is an open access publication.

Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 Inter-
national License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,
distribution and reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and indicate if changes
were made.
The images or other third party material in this book are included in the book’s Creative Commons
license, unless indicated otherwise in a credit line to the material. If material is not included in the book’s
Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Acknowledgments

We kindly thank our very generous sponsors, the Swiss Association of Actuaries
(SAA) and Swiss Re, for financing the open access option of the electronic version
of this book. Our special thanks go to Sabine Betz (President of SAA), Adrian
Kolly (Swiss Re), and Holger Walz (SAA) who were very positive and interested in
this book project from the very beginning, and who made this open access funding
possible within their institutions.
A very special thank you goes to Hans Bühlmann who has been supporting us
over the last 30 years. We have had so many inspiring discussions over these years,
and we have greatly benefited and learned from Hans’ incredible knowledge and
intuition.
Jointly with Christoph Buser, we have started to teach the lecture “Data Analytics
for Non-Life Insurance Pricing” at ETH Zurich in 2018. Our data analytics lecture
focuses (only) on the Poisson claim counts case, but its lecture notes have provided
a first draft for this book project. This draft has been developed and extended to
the general case of the exponential family. Since our first lecture, we have greatly
benefited from interactions with many colleagues and students. In particular, we
would like to mention the data science initiative “Actuarial Data Science” of the
Swiss Association of Actuaries (chaired by Jürg Schelldorfer), whose tutorials
provided a great stimulus for this book. Moreover, we mention the annual Insurance
Data Science Conference (chaired by Markus Gesmann and Andreas Tsanakas) and
the ASTIN Reading Club (chaired by Ronald Richman and Dimitri Semenovich).
Furthermore, we would like to kindly thank Ronald Richman who has always been
a driving force behind learning and adapting new machine learning techniques, and
we also kindly thank Simon Rentzmann for many interesting discussions on how to
apply these techniques on real insurance problems.
We thank the following colleagues by name (in alphabetical order). We col-
laborated and had inspiring discussions in the field of statistical learning with
the following colleagues: Johannes Abegglen, Hansjörg Albrecher, Davide Apol-
loni, Peter Bühlmann, Christoph Buser, Patrick Cheridito, Łukasz Delong, Paul
Embrechts, Andrea Ferrario, Tobias Fissler, Luca Fontana, Daisuke Frei, Tsz Chai
Fung, Guangyuan Gao, Yan-Xing Lan, Gee Lee, Mathias Lindholm, Christian

v
vi Acknowledgments

Lorentzen, Friedrich Loser, Michael Mayer, Daniel Meier, Alexander Noll, Gareth
Peters, Jan Rabenseifner, Peter Reinhard, Simon Rentzmann, Ronald Richman,
Ludger Rüschendorf, Robert Salzmann, Marc Sarbach, Jürg Schelldorfer, Pavel
Shevchenko, Joël Thomann, Andreas Tsanakas, George Tzougas, Emiliano Valdez,
Tim Verdonck, and Patrick Zöchbauer.
Contents

1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1
1.1 The Statistical Modeling Cycle . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1
1.2 Preliminaries on Probability Theory . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3
1.3 Lab: Exploratory Data Analysis . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7
1.4 Outline of This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 9
2 Exponential Dispersion Family . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 13
2.1 Exponential Family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 13
2.1.1 Definition and Properties . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 13
2.1.2 Single-Parameter Linear EF: Count Variable
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 18
2.1.3 Vector-Valued Parameter EF: Absolutely
Continuous Examples . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 20
2.1.4 Vector-Valued Parameter EF: Count Variable
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 27
2.2 Exponential Dispersion Family . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 28
2.2.1 Definition and Properties . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 28
2.2.2 Exponential Dispersion Family Examples . . . . . . . . . . . . . . . . 31
2.2.3 Tweedie’s Distributions . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 34
2.2.4 Steepness of the Cumulant Function . .. . . . . . . . . . . . . . . . . . . . 37
2.2.5 Lab: Large Claims Modeling . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 38
2.3 Information Geometry in Exponential Families . . . . . . . . . . . . . . . . . . . . 40
2.3.1 Kullback–Leibler Divergence . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 40
2.3.2 Unit Deviance and Bregman Divergence . . . . . . . . . . . . . . . . . 42
3 Estimation Theory .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 49
3.1 Introduction to Decision Theory . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 49
3.2 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 51
3.3 Unbiased Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 56
3.3.1 Cramér–Rao Information Bound . . . . . .. . . . . . . . . . . . . . . . . . . . 56
3.3.2 Information Bound in the Exponential Family
Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 62

vii
viii Contents

3.4 Asymptotic Behavior of Estimators . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 67

3.4.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 67
3.4.2 Asymptotic Normality . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 69
4 Predictive Modeling and Forecast Evaluation .. . . . . .. . . . . . . . . . . . . . . . . . . . 75
4.1 Generalization Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 75
4.1.1 Mean Squared Error of Prediction . . . .. . . . . . . . . . . . . . . . . . . . 76
4.1.2 Unit Deviances and Deviance Generalization
Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 79
4.1.3 A Decision-Theoretic Approach to Forecast
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 88
4.2 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 95
4.2.1 In-Sample and Out-of-Sample Losses . . . . . . . . . . . . . . . . . . . . 95
4.2.2 Cross-Validation Techniques . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 98
4.2.3 Akaike’s Information Criterion . . . . . . . .. . . . . . . . . . . . . . . . . . . . 103
4.3 Bootstrap .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 106
4.3.1 Non-parametric Bootstrap Simulation . . . . . . . . . . . . . . . . . . . . 106
4.3.2 Parametric Bootstrap Simulation . . . . . .. . . . . . . . . . . . . . . . . . . . 109
5 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 111
5.1 Generalized Linear Models and Log-Likelihoods . . . . . . . . . . . . . . . . . 112
5.1.1 Regression Modeling . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 112
5.1.2 Definition of Generalized Linear Models . . . . . . . . . . . . . . . . . 113
5.1.3 Link Functions and Feature Engineering . . . . . . . . . . . . . . . . . 115
5.1.4 Log-Likelihood Function and Maximum
Likelihood Estimation . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 116
5.1.5 Balance Property Under the Canonical Link
Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 122
5.1.6 Asymptotic Normality . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 123
5.1.7 Maximum Likelihood Estimation and Unit
Deviances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 124
5.2 Actuarial Applications of Generalized Linear Models . . . . . . . . . . . . . 126
5.2.1 Selection of a Generalized Linear Model .. . . . . . . . . . . . . . . . 126
5.2.2 Feature Engineering . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 127
5.2.3 Offsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 132
5.2.4 Lab: Poisson GLM for Car Insurance
Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 133
5.3 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 141
5.3.1 Residuals and Dispersion . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 141
5.3.2 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 145
5.3.3 Analysis of Variance . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 147
5.3.4 Lab: Poisson GLM for Car Insurance
Frequencies, Revisited . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 150
5.3.5 Over-Dispersion in Claim Counts Modeling . . . . . . . . . . . . . 155
5.3.6 Zero-Inflated Poisson Model . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 162
5.3.7 Lab: Gamma GLM for Claim Sizes . . .. . . . . . . . . . . . . . . . . . . . 167
Contents ix

5.3.8 Lab: Inverse Gaussian GLM for Claim Sizes . . . . . . . . . . . . . 173

5.3.9 Log-Normal Model for Claim Sizes: A Short
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 176
5.4 Quasi-Likelihoods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 180
5.5 Double Generalized Linear Model . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 182
5.5.1 The Dispersion Submodel . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 182
5.5.2 Saddlepoint Approximation . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 183
5.5.3 Residual Maximum Likelihood Estimation .. . . . . . . . . . . . . . 186
5.5.4 Lab: Double GLM Algorithm for Gamma Claim
Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 187
5.5.5 Tweedie’s Compound Poisson GLM . .. . . . . . . . . . . . . . . . . . . . 189
5.6 Diagnostic Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 190
5.6.1 The Hat Matrix . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 190
5.6.2 Case Deletion and Generalized Cross-Validation . . . . . . . . 192
5.7 Generalized Linear Models with Categorical Responses . . . . . . . . . . 195
5.7.1 Logistic Categorical Generalized Linear Model .. . . . . . . . . 195
5.7.2 Maximum Likelihood Estimation in Categorical
Models .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 196
5.8 Further Topics of Regression Modeling . . . . . . . .. . . . . . . . . . . . . . . . . . . . 198
5.8.1 Longitudinal Data and Random Effects . . . . . . . . . . . . . . . . . . 198
5.8.2 Regression Models Beyond the GLM
Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 199
5.8.3 Quantile Regression . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 202
6 Bayesian Methods, Regularization
and Expectation-Maximization . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 207
6.1 Bayesian Parameter Estimation . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 207
6.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 210
6.2.1 Maximal a Posterior Estimator . . . . . . . .. . . . . . . . . . . . . . . . . . . . 210
6.2.2 Ridge vs. LASSO Regularization . . . . .. . . . . . . . . . . . . . . . . . . . 212
6.2.3 Ridge Regression . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 215
6.2.4 LASSO Regularization . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 217
6.2.5 Group LASSO Regularization . . . . . . . .. . . . . . . . . . . . . . . . . . . . 226
6.3 Expectation-Maximization Algorithm . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 230
6.3.1 Mixture Distributions . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 230
6.3.2 Incomplete and Complete Log-Likelihoods . . . . . . . . . . . . . . 232
6.3.3 Expectation-Maximization Algorithm for
Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 233
6.3.4 Lab: Mixture Distribution Applications . . . . . . . . . . . . . . . . . . 240
6.4 Truncated and Censored Data . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 248
6.4.1 Lower-Truncation and Right-Censoring . . . . . . . . . . . . . . . . . . 248
6.4.2 Parameter Estimation Under Right-Censoring . . . . . . . . . . . 250
6.4.3 Parameter Estimation Under Lower-Truncation . . . . . . . . . . 254
6.4.4 Composite Models . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 264
x Contents

7 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 267

7.1 Deep Learning and Representation Learning . .. . . . . . . . . . . . . . . . . . . . 267
7.2 Generic Feed-Forward Neural Networks . . . . . . .. . . . . . . . . . . . . . . . . . . . 269
7.2.1 Construction of Feed-Forward Neural Networks .. . . . . . . . 269
7.2.2 Universality Theorems . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 274
7.2.3 Gradient Descent Methods . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 278
7.3 Feed-Forward Neural Network Examples . . . . . .. . . . . . . . . . . . . . . . . . . . 293
7.3.1 Feature Pre-processing . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 293
7.3.2 Lab: Poisson FN Network for Car Insurance
Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 295
7.4 Special Features in Networks . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 298
7.4.1 Special Purpose Layers . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 298
7.4.2 The Balance Property in Neural Networks . . . . . . . . . . . . . . . 305
7.4.3 Boosting Regression Models with Network
Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 315
7.4.4 Network Ensemble Learning . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 319
7.4.5 Identifiability in Feed-Forward Neural Networks . . . . . . . . 340
7.5 Auto-encoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 342
7.5.1 Standardization of the Data Matrix . . .. . . . . . . . . . . . . . . . . . . . 343
7.5.2 Introduction to Auto-encoders . . . . . . . .. . . . . . . . . . . . . . . . . . . . 343
7.5.3 Principal Components Analysis . . . . . . .. . . . . . . . . . . . . . . . . . . . 344
7.5.4 Lab: Lee–Carter Mortality Model . . . . .. . . . . . . . . . . . . . . . . . . . 347
7.5.5 Bottleneck Neural Network . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 351
7.6 Model-Agnostic Tools . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 357
7.6.1 Variable Permutation Importance . . . . .. . . . . . . . . . . . . . . . . . . . 357
7.6.2 Partial Dependence Plots . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 359
7.6.3 Interaction Strength . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 365
7.6.4 Local Model-Agnostic Methods . . . . . .. . . . . . . . . . . . . . . . . . . . 366
7.6.5 Marginal Attribution by Conditioning on
Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 366
7.7 Lab: Analysis of the Fitted Networks . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 376
8 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 381
8.1 Motivation for Recurrent Neural Networks . . . .. . . . . . . . . . . . . . . . . . . . 381
8.2 Plain-Vanilla Recurrent Neural Network . . . . . . .. . . . . . . . . . . . . . . . . . . . 383
8.2.1 Recurrent Neural Network Layer . . . . .. . . . . . . . . . . . . . . . . . . . 383
8.2.2 Deep Recurrent Neural Network Architectures . . . . . . . . . . 385
8.2.3 Designing the Network Output . . . . . . . .. . . . . . . . . . . . . . . . . . . . 387
8.2.4 Time-Distributed Layer . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 388
8.3 Special Recurrent Neural Networks . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 390
8.3.1 Long Short-Term Memory Network . .. . . . . . . . . . . . . . . . . . . . 390
8.3.2 Gated Recurrent Unit Network . . . . . . . .. . . . . . . . . . . . . . . . . . . . 392
8.4 Lab: Mortality Forecasting with RN Networks . . . . . . . . . . . . . . . . . . . . 394
8.4.1 Lee–Carter Model, Revisited . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 394
8.4.2 Direct LSTM Mortality Forecasting . .. . . . . . . . . . . . . . . . . . . . 402
Contents xi

9 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 407

9.1 Plain-Vanilla Convolutional Neural Network Layer . . . . . . . . . . . . . . . 407
9.1.1 Input Tensors and Channels . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 408
9.1.2 Generic Convolutional Neural Network Layer . . . . . . . . . . . 408
9.1.3 Example: Time-Series Analysis and Image
Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 411
9.2 Special Purpose Tools for Convolutional Neural Networks . . . . . . . 413
9.2.1 Padding with Zeros . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 413
9.2.2 Stride . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 414
9.2.3 Dilation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 414
9.2.4 Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 415
9.2.5 Flatten Layer . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 416
9.3 Convolutional Neural Network Architectures . .. . . . . . . . . . . . . . . . . . . . 416
9.3.1 Illustrative Example of a CN Network
Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 416
9.3.2 Lab: Telematics Data . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 418
9.3.3 Lab: Mortality Surface Modeling . . . . .. . . . . . . . . . . . . . . . . . . . 422
10 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 425
10.1 Feature Pre-processing and Bag-of-Words . . . . .. . . . . . . . . . . . . . . . . . . . 425
10.2 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 429
10.2.1 Word to Vector Algorithms . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 430
10.2.2 Global Vectors Algorithm . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 436
10.3 Lab: Predictive Modeling Using Word Embeddings . . . . . . . . . . . . . . . 440
10.4 Lab: Deep Word Representation Learning . . . . .. . . . . . . . . . . . . . . . . . . . 445
10.5 Outlook: Creating Attention . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 448
11 Selected Topics in Deep Learning .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 453
11.1 Deep Learning Under Model Uncertainty . . . . . .. . . . . . . . . . . . . . . . . . . . 453
11.1.1 Recap: Tweedie’s Family . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 454
11.1.2 Lab: Claim Size Modeling Under Model
Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 458
11.1.3 Lab: Deep Dispersion Modeling . . . . . .. . . . . . . . . . . . . . . . . . . . 466
11.1.4 Pseudo Maximum Likelihood Estimator . . . . . . . . . . . . . . . . . 472
11.2 Deep Quantile Regression . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 476
11.2.1 Deep Quantile Regression: Single Quantile . . . . . . . . . . . . . . 477
11.2.2 Deep Quantile Regression: Multiple Quantiles .. . . . . . . . . . 478
11.2.3 Lab: Deep Quantile Regression . . . . . . .. . . . . . . . . . . . . . . . . . . . 479
11.3 Deep Composite Model Regression . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 483
11.3.1 Joint Elicitability of Quantiles and Expected
Shortfalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 483
11.3.2 Lab: Deep Composite Model Regression . . . . . . . . . . . . . . . . . 487
11.4 Model Uncertainty: A Bootstrap Approach . . . .. . . . . . . . . . . . . . . . . . . . 492
11.5 LocalGLMnet: An Interpretable Network Architecture . . . . . . . . . . . 495
11.5.1 Definition of the LocalGLMnet . . . . . . .. . . . . . . . . . . . . . . . . . . . 495
11.5.2 Variable Selection in LocalGLMnets .. . . . . . . . . . . . . . . . . . . . 497
xii Contents

11.5.3 Lab: LocalGLMnet for Claim Frequency

Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 499
11.5.4 Variable Selection Through Regularization of
the LocalGLMnet . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 507
11.5.5 Lab: LASSO Regularization of LocalGLMnet . . . . . . . . . . . 509
11.6 Selected Applications . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 513
11.6.1 Mixture Density Networks . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 513
11.6.2 Estimation of Conditional Expectations . . . . . . . . . . . . . . . . . . 521
11.6.3 Bayesian Networks: An Outlook . . . . . .. . . . . . . . . . . . . . . . . . . . 530
12 Appendix A: Technical Results on Networks . . . . . . . .. . . . . . . . . . . . . . . . . . . . 537
12.1 Universality Theorems . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 537
12.2 Consistency and Asymptotic Normality . . . . . . . .. . . . . . . . . . . . . . . . . . . . 540
12.3 Functional Limit Theorem . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 546
12.4 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 549
13 Appendix B: Data and Examples . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 553
13.1 French Motor Third Party Liability Data . . . . . . .. . . . . . . . . . . . . . . . . . . . 553
13.2 Swedish Motorcycle Data . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 564
13.3 Wisconsin Local Government Property Insurance Fund .. . . . . . . . . . 570
13.4 Swiss Accident Insurance Data . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 573

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 577
Index . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 595
Chapter 1
Introduction

1.1 The Statistical Modeling Cycle

We consider statistical modeling of insurance problems. This comprises the process

of data collection, data analysis and statistical model building to forecast insured
events that (may) happen in the future. This problem is at the very heart of statistics
and statistical modeling. Our goal here is to present and provide the statistical tools
that are useful in daily actuarial practice, in particular, we aim at describing the
mathematical foundation behind these statistical concepts and how they can be
applied. Statistical modeling has a wide range of applications, and, depending on
the application, the theoretical aspects may be weighted differently. In insurance
pricing we are mainly interested in optimal predictions, whereas economists often
use statistical tools to explain observations, and in medical fields one is interested
in causal effects that medications have on patients. Therefore, statistical theory is
wide ranging, and one should always keep the corresponding application in mind.
Shmueli [338] nicely discusses the difference between prediction and explanation;
our focus here is mainly on prediction.
Box–Jenkins [49] and McCullagh–Nelder [265] distinguish three processes in
statistical modeling: (i) model identification/selection, (ii) estimation, and (iii)
prediction. In our statistical modeling cycle these three points are slightly modified
and extended:
(1) Data collection, cleaning and pre-processing:
This item takes at least 80% of the total time in statistical modeling. It includes
exploratory data analysis, data visualization and data pre-processing. This part
of the modeling cycle does not seem to be very scientific, however, it is a highly
important step because only extended data analysis allows the modeler to fully
understand the data. Based on this knowledge the modeler can formulate her/his
research question, her/his model, etc.

© The Author(s) 2023 1

M. V. Wüthrich, M. Merz, Statistical Foundations of Actuarial Learning and its
Applications, Springer Actuarial, https://doi.org/10.1007/978-3-031-12409-9_1
2 1 Introduction

(2) Selection of a model class:

Based on the knowledge collected in the first item, the modeler has to select a
suitable model class that is able to answer her/his research question. This model
class can be in the sense of a data model (proper stochastic model), but it can
also be an algorithmic model; we refer to the discussion on the “two modeling
cultures” by Breiman [53].
(3) Choice of an objective function:
Once the modeler has specified a model class, she/he needs to define a decision
rule how a particular member of the model class is selected for the collected
data. Often this is in terms of an objective function, e.g., a scoring rule or a loss
function that quantifies misspecification.
(4) Solving a (non-convex) optimization problem:
Once the first three items are completed, one is left with an optimization
problem that tries to find the best model within the selected model class w.r.t. the
given objective function and the collected data. In simple cases this optimization
problem is a convex minimization problem for which numerical tools are in
place. In more complex cases the optimization problem is neither convex nor
concave, and the ‘best’ solution can often not be found explicitly. In that case,
also the meaning of solution needs to be discussed.
(5) Model validation:
In the final/next step, the selected and fitted model needs to be validated. That
is, does the model fit to the data, does it serve at predicting new data, does
it answer the research question adequately, is there any better model/process
choice, etc.?
(6) Possibly go back to (1):
If the answers in item (5) are not satisfactory, one typically goes back to (1).
For instance, data pre-processing needs to be done differently, etc.
Especially, the two modeling cultures discussion of Breiman [53], after the turn
of the millennium, has shaken up the statistical community. Having predictive
performance as the main criterion, the data modeling culture has gradually shifted
to the algorithmic culture, where the model itself plays a secondary role as long
as the prediction is accurate. The latter is often in the form of a point predictor
which can come from an algorithm. Lifting this discussion to a more scientific
level, providing prediction uncertainty will slowly merge the two modeling cultures.
There is an other interesting discussion by Efron [116] on prediction, estimation
(of model parameters) and attribution (predictor selection), that is very much at
the core of statistical modeling. In these notes we want to especially emphasize
the one modeling culture view of Yu–Barter [397] who expect the two modeling
cultures of Breiman [53] to merge much closer than one would expect. Our goal is
to demonstrate how all these different techniques and views can be seen as a unified
modeling framework.
Concluding, the purpose of these notes is to discuss and illustrate how the
different statistical techniques from the data modeling culture and the algorithmic
modeling culture can be combined to solve actuarial questions in the best possible
way. The main emphasis in this discussion lies on the statistical modeling tools,
1.2 Preliminaries on Probability Theory 3

and we present these tools along with actuarial examples. In actuarial practice one
often distinguishes between life and general insurance. This distinction is done for
good reasons. There are legislative reasons that require to legally separate life from
general insurance business, but there are also modeling reasons, because insurance
products in life and general insurance can have rather different features. In this book,
we do not make this distinction because the statistical methods presented here can be
useful in both branches of insurance, and we are going to consider life and general
insurance examples, e.g., the former considering mortality forecasting and the latter
aiming at insurance claims prediction for pricing.

1.2 Preliminaries on Probability Theory

The modern axiomatic foundation of probability theory was introduced in 1933 by

the famous mathematician Kolmogoroff [221] in his book called “Grundbegriffe der
Wahrscheinlichkeitsrechnung”. We give a brief introduction to probability theory
and random variables; this introduction follows the lecture notes [387]. Throughout
we assume to work on a sufficiently rich probability space (, A, P), meaning that
this probability space should be able to carry all objects that we study. We denote
(real-valued) random variables on this probability space by capital letters Y, Z, . . .,
and random vectors use boldface capital letters, e.g., we have a random vector Y =
(Y1 , . . . , Yq ) of dimension q ∈ N, where each component Yk , 1 ≤ k ≤ q, is a
random variable. Random variables Y are characterized by (cumulative) distribution
functions1 F : R → [0, 1], for y ∈ R

F (y) = P [Y ≤ y] ,

being the probability of the event that Y has a realization of less or equal to y. We
write Y ∼ F for Y having distribution function F . Similarly random vectors Y ∼ F
are characterized by (cumulative) distribution functions F : Rq → [0, 1] with

F (y) = P Y1 ≤ y1 , . . . , Yq ≤ yq for y = (y1 , . . . , yq ) ∈ Rq .

In insurance modeling, there are two important types of random variables,

namely, discrete random variables and absolutely continuous random variables:
• The distribution function F of a discrete random variable Y is a step function
with countably many steps in discrete points k ∈ N ⊂ R. A discrete random
variable has probability weights in these discrete points

f (k) = P [Y = k] > 0 for k ∈ N,

1Cumulative distribution functions F are right-continuous, non-decreasing with limx→−∞ F (x) =

0 and limx→∞ F (x) = 1.
4 1 Introduction

satisfying k∈N f (k) = 1. If N ⊆ N0 , the integer-valued random variable Y
is called count random variable. Count random variables are used to model the
number of claims in insurance. A similar situation occurs if Y models nominal
outcomes, for instance, if Y models gender with female being encoded by 0 and
male being encoded by 1, then f (0) is the probability weight of having a female
and f (1) = 1 − f (0) the probability weight of having a male; in this case we
identify the finite set N = {0, 1} = {female, male}.
• A random variable Y ∼ F is said to be absolutely continuous2 if there exists a
non-negative (measurable) function f , called density of Y , such that
y
F (y) = f (x) dx for all y ∈ R.
−∞

In that case we equivalently write Y ∼ f and Y ∼ F . Absolutely continuous

random variables are often used to model claim sizes in insurance.
More generally speaking, discrete and absolutely continuous random variables
have densities f (·) w.r.t. a σ -finite measure ν on R. In the former case, this σ -
finite measure ν is the counting measure on N ⊂ R, and in the latter case it is
the Lebesgue measure on R. In actuarial science we also consider mixed cases, for
instance, Tweedie’s compound Poisson random variable is absolutely continuous on
(0, ∞) having an additional point mass in 0; this model will be studied in Sect. 2.2.3,
below.
Choose a random variable Y ∼ F and a measurable function h : R → R. The
expected value of h(Y ) is defined by (upon existence)

E [h(Y )] = h(y) dF (y).
R

We mainly focus on the following important examples of function h:

• expected value, mean or first moment of Y ∼ F : for h(y) = y

μ = E [Y ] = y dF (y);
R

• k-th moment of Y ∼ F for k ∈ N: for h(y) = y k

E Y k
= y k dF (y);
R

2 Absolutely continuous is a stronger property than continuous.

1.2 Preliminaries on Probability Theory 5

• moment generating function of Y ∼ F in r ∈ R: for h(y) = ery

MY (r) = E erY = ery dF (y);
R

always subject to existence.

The moment generating function MY (·) is sufficient for identifying distribution
functions of random variables Y . The following statements are elementary and their
proofs are based on Section 30 of Billingsley [34], for more details we also refer to
Chapter 1 in the lecture notes [387]. Assume that the moment generating function
of Y ∼ F has a strictly positive radius of convergence ρ0 > 0 around the origin
implying that MY (r) < ∞ for all r ∈ (−ρ0 , ρ0 ). In this case we can write MY (r)
as a power series expansion
∞ k
r
MY (r) = E Yk for all r ∈ (−ρ0 , ρ0 ).
k!
k=0

As a consequence we can differentiate MY (·) in the open interval (−ρ0 , ρ0 )

arbitrarily often, term by term under the sum. The derivatives in r = 0 provide
the k-th moments (which all exist and are finite)

dk
MY (r)|r=0 = E Yk for all k ∈ N0 . (1.1)
dr k
In particular, in this case we immediately know that all moments of Y exist, and
these moments completely determine the moment generating function MY of Y .
Another consequence is that for a random variable Y , whose moment generating
function MY has a strictly positive radius of convergence around the origin, the
distribution function F is fully determined by this moment generating function.
That is, if we have two such random variables Y1 and Y2 with MY1 (r) = MY2 (r)
(d)
for all r ∈ (−r0 , r0 ), for some r0 > 0, then Y1 = Y2 .3 Thus, these two
random variables have the same distribution function. This statement carries over
to the limit, i.e., if we have a sequence of random variables (Yn )n whose moment
generating functions converge on a common interval (−r0 , r0 ), for some r0 > 0,
to the moment generating function of Y , also being finite on (−r0 , r0 ), then (Yn )n
converges in distribution to Y ; such an argument is used to prove the central limit
theorem (CLT).

(d)
3 The notation Y1 = Y2 is generally used for equality in distribution meaning that Y1 and Y2 have
the same distribution function.
6 1 Introduction

In insurance, we often deal with so-called positive random variables Y , meaning

that Y ≥ 0, almost surely (a.s.). In that case, the statements about moment
generating functions and distributions hold true without the assumption of having a
positive radius of convergence around the origin, see Theorem 22.2 in Billingsley
[34]. Note that for positive random variables the moment generating function MY (r)
exists for all r ≤ 0.
Existence of the moment generating function MY (r) for some positive r > 0
can also be interpreted as having a light-tailed distribution function. Observe that
if MY (r) exists for some positive r > 0, then we can choose s ∈ (0, r) and
Chebychev’s inequality gives us (we assume Y ≥ 0, a.s., here)

P [Y > y] = P exp{sY } > exp{sy} ≤ exp{−sy}MY (s). (1.2)

The latter tells us that the survival function 1 − F (y) = P[Y > y] decays
exponentially for y → ∞. Heavy-tailed distribution functions do not have this
property, but the survival function decays slower than exponentially as y → ∞.
This slower decay of the survival function is the case for so-called subexponential
distribution functions (an example is the log-normal distribution, we refer to Rolski
et al. [320]) and for regularly varying survival functions (an example is the Pareto
distribution). Regularly varying survival functions 1 − F have the property

1 − F (ty)
lim = t −β for all t > 0 and some β > 0. (1.3)
y→∞ 1 − F (y)

These distribution functions have a polynomial tail (power tail) with tail index β >
0. In particular, if a positively supported distribution function F has a regularly
varying survival function with tail index β > 0, then this distribution function is
also subexponential, see Theorem 2.5.5 in Rolski et al. [320].
We are not going to specifically focus on heavy-tailed distribution functions,
here, but we will explain how light-tailed random variables can be transformed to
enjoy heavy-tailed properties. In these notes, we are mainly interested in studying
different aspects of regression modeling. Regression modeling requires numerous
observations to be able to successfully fit these models to the data. By definition,
large claims are scarce, as they live in the tail of the distribution function and, thus,
correspond to rare events. Therefore, it is often not possible to employ a regression
model for scarce tail events. For this reason, extreme value analysis only plays
a marginal role in these notes, though, it has a significant impact on insurance
prices. For more on extreme value theory we refer to the relevant literature, see,
e.g., Embrechts et al. [121], Rolski et al. [320], Mikosch [277] and Albrecher et
al. [7].
1.3 Lab: Exploratory Data Analysis 7

1.3 Lab: Exploratory Data Analysis

Our theory is going to be supported by several data examples. These examples are
mostly based on publicly available data. The different data sets are described in
detail in Chap. 13. We highly recommend the reader to use these data sets to gain
her/his own modeling experience.
We describe some tools here that allow for a descriptive and exploratory analysis
of the available data; exploratory data analysis has been introduced and promoted by
Tukey [357]. We consider the observed claim sizes of the Swedish motorcycle data
set described in Sect. 13.2. This data set consists of 656 (positive) claim amounts yi ,
1 ≤ i ≤ n = 656. These claim amounts are illustrated in the boxplots of Fig. 1.1.
Typically in insurance, there are large claims that dominate the picture, see
Fig. 1.1 (lhs). This results in right-skewed distribution functions, and such data is
better illustrated on the log scale, see Fig. 1.1 (rhs). The latter, of course, assumes
that all claims are strictly positive.
Figure 1.2 (lhs) shows the empirical distribution function of the observations yi ,
1 ≤ i ≤ n, which is obtained by

n
n (y) = 1
F 1{yi ≤y} for y ∈ R.
n
i=1

If this data set has been generated by i.i.d. random variables, then the Glivenko–
Cantelli theorem [64, 159] tells us that this empirical distribution function F n
converges uniformly to the (true) data generating distribution function, a.s., as the
number n of observations converges to infinity, see Theorem 20.6 in Billingsley
[34].
Figure 1.2 (rhs) shows the empirical density of the observations yi , 1 ≤ i ≤
n. This empirical density is obtained by considering a kernel smoother of a given

claim amounts of Swedish motorcycle data claim amounts of Swedish motorcycle data
200000

12
logged claim amounts
10
150000
claim amounts

8
100000

6
50000

4
0

Fig. 1.1 Boxplot of the claim amounts of the Swedish motorcycle data set: (lhs) on the original
scale and (rhs) on the log scale
8 1 Introduction

empirical distribution of claim amounts empirical density of claim amounts

4e−05
1.0
0.8
empirical distribution

3e−05
empirical density
0.6

2e−05
0.4

1e−05
0.2

0e+00
0.0

0 50000 100000 150000 200000 0 50000 100000 150000 200000

claim amounts claim amounts

Fig. 1.2 (lhs) Empirical distribution and (rhs) empirical density of the observed claim amounts yi ,
1≤i≤n

bandwidth around each observation yi . The standard choice is the Gaussian kernel,
with the bandwidth determining the variance parameter σ 2 > 0 of the Gaussian
density,

1
n
1 1 (y − yi )2
y → fn (y) = √ exp − .
n 2πσ 2 2 σ2
i=1

From the graph in Fig. 1.2 (rhs) we observe that the main body of the claim sizes
is below an amount of 50’000, but the biggest claim exceeds 200’000. The latter
motivates to study heavy-tailedness of the claim size data. Therefore, one usually
benchmarks with a distribution function F that has a regularly varying survival
function with a tail index β > 0, see (1.3). Asymptotically a regularly varying
survival function behaves as y −β ; for this reason the log-log plot is a popular tool
to identify regularly varying tails. The log-log plot of a distribution function F is
obtained by considering

y > 0 → (logy, log(1 − F (y))) ∈ R2 .

n . If this
Figure 1.3 gives the log-log plot of the empirical distribution function F
plot looks asymptotically (for y → ∞) like a straight line with a negative slope
−β, then the data shows heavy-tailedness in the sense of regular variation. Such
data cannot be modeled by a distribution function for which the moment generating
function MY (r) exists for some positive r > 0, see (1.2). Figure 1.3 does not suggest
a regularly varying tail as we do not see an obvious asymptotic straight line for
increasing claim sizes.
These graphs give us a first indication what the claim size data is about. Later
on we are going to introduce explanatory variables that describe the insurance
1.4 Outline of This Book 9

Fig. 1.3 Log-log plot of the log−log plot of claim amounts

empirical distribution
n

0
function F

−1
logged survival function
−2
−3
−4
−5
−6
4 6 8 10 12
logged claim amounts

policyholders behind these claims. These explanatory variables characterize the

policyholder and the general goal is to get a better description of the claim sizes
as a function of these explanatory variables, e.g., older policyholders may cause
larger claims than younger ones, etc. Such patterns are called systematic effects that
can be explained by explanatory variables.

1.4 Outline of This Book

This book has eleven chapters (including the present one), and it has two appendices.
We briefly describe the contents of these chapters and appendices.
In Chap. 2 we introduce and discuss the exponential family (EF) and the
exponential dispersion family (EDF). The EF and the EDF are by far the most
important classes of distribution functions for regression modeling. They include,
among others, the Gaussian, the binomial, the Poisson, the gamma, the inverse
Gaussian and Tweedie’s models. We introduce these families of distribution func-
tions, discuss their properties and provide several examples. Moreover, we introduce
the Kullback–Leibler (KL) divergence and the Bregman divergence, which are
important tools in model evaluation.
Chapter 3 is on classical statistical decision theory. This chapter is important for
historical reasons, but it also provides the right mathematical grounding and intu-
ition for more modern tools from data science and machine learning. In particular,
we discuss maximum likelihood estimation (MLE), unbiasedness, consistency and
asymptotic normality of MLEs in this chapter.
Chapter 4 is the core theoretical chapter on predictive modeling and forecast
evaluation. The main problem in actuarial modeling is to forecast and price future
claims. For this, we build predictive models, and this chapter deals with assessing
and ranking these predictive models. We therefore introduce the mean squared
10 1 Introduction

error of prediction (MSEP) and, more generally, the generalization loss (GL)
to assess predictive models. This chapter is complemented by a more decision-
theoretic approach to forecast evaluation, it discusses deviance losses, proper
scoring, elicitability, forecast dominance, cross-validation, Akaike’s information
criterion (AIC) and we give an introduction to the bootstrap simulation method.
Chapter 5 discusses state-of-the-art statistical modeling in insurance which is the
generalized linear model (GLM). We discuss GLMs in the light of claim count and
claim size modeling, we present feature engineering, model fitting, model selection,
over-dispersion, zero-inflated claim counts problems, double GLMs, and insurance-
specific issues such as the balance property for having unbiasedness.
Chapter 6 summarizes some techniques that use Bayes’ theorem. These are
classical Bayesian statistical models, e.g., using the Markov chain Monte Carlo
(MCMC) method for model fitting. This chapter discusses regularization of regres-
sion models such as ridge and LASSO regularization, which has a Bayesian
interpretation, and it concerns the Expectation-Maximization (EM) algorithm. The
EM algorithm is a general purpose tool that can handle incomplete data settings. We
illustrate this for different examples coming from mixture distributions, censored
and truncated claims data.
The core of this book are deep learning methods and neural networks. Chapter 7
considers deep feed-forward neural (FN) networks. We introduce the generic
architecture of deep FN networks, and we discuss universality theorems of FN
networks. We present network fitting, back-propagation, embedding layers for
categorical variables and insurance-specific issues such as the balance property in
network fitting and network ensembling to reduce model uncertainty. This chapter
is complemented by many examples on non-life insurance pricing, but also on
mortality modeling, as well as tools that help to explain deep FN network regression
results.
Chapters 8 and 9 consider recurrent neural (RN) networks and convolutional
neural (CN) networks. These are special network architectures that are useful for
time-series and spatial data modeling, e.g., applied to image recognition problems.
Time-series and images have a natural topology, and RN and CN networks try to
benefit from this additional structure (over tabular data). We introduce these network
architectures and provide insurance-relevant examples.
Chapter 10 discusses natural language processing (NLP) which deals with
regression modeling of non-tabular or unstructured text data. We explain how
words can be embedded into low-dimension spaces that serve as numerical word
encodings. These can then be used for text recognition, either using RN networks or
attention layers. We give an example where we aim at predicting claim perils from
claim descriptions.
Chapter 11 is a selection of different topics. We mention forecasting under
model uncertainty, deep quantile regression, deep composite regression or the
LocalGLMnet which is an interpretable FN network architecture. Moreover, we
provide a bootstrap example to assess prediction uncertainty, and we discuss mixture
density networks.
1.4 Outline of This Book 11

Chapter 12 (Appendix A) is a technical chapter that discusses universality the-

orems for networks and sieve estimators, which are useful for studying asymptotic
normality within a network framework. Chapter 13 (Appendix B) illustrates the data
used in this book.
Finally, we remark that the book is written in a typical mathematical style
using the structure of Lemmas, Theorems, etc. Results and statements which are
particularly important for applications are highlighted with gray boxes.

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 2
Exponential Dispersion Family

We introduce the exponential family (EF) and the exponential dispersion family
(EDF) in this chapter. The single-parameter EF has been introduced in 1934
by the British statistician Sir Fisher [128], and it has been extended to vector-
valued parameters by Darmois [88], Koopman [223] and Pitman [306] between
1935 and 1936. It is the most commonly used family of distribution functions
in statistical modeling; among others, it contains the Gaussian distribution, the
gamma distribution, the binomial distribution and the Poisson distribution. Its
parametrization is taken in a special form that is convenient for statistical modeling.
The EF can be introduced in a constructive way providing the main properties of
this family of distribution functions. In this chapter we follow Jørgensen [201–203]
and Barndorff-Nielsen [23], and we state the most important results based on this
constructive introduction. This gives us a unified notation which is going to be useful
for our purposes.

2.1 Exponential Family

2.1.1 Definition and Properties

We define the EF w.r.t. a σ -finite measure ν on R. The results in this section can be
generalized to σ -finite measures on Rm , but such an extension is not necessary for
our purposes. Select an integer k ∈ N, and choose measurable functions a : R →
R and T : R → Rk .1 Consider for a canonical parameter θ ∈ Rk the Laplace

1We could also use boldface notation for T because T (y) ∈ Rk is vector-valued, but we prefer to
not use boldface notation for (vector-valued) functions.

© The Author(s) 2023 13

M. V. Wüthrich, M. Merz, Statistical Foundations of Actuarial Learning and its
Applications, Springer Actuarial, https://doi.org/10.1007/978-3-031-12409-9_2
14 2 Exponential Dispersion Family

transform

L(θ ) = exp θ T (y) + a(y) dν(y).
R

Assume that this Laplace transform is not identically equal to +∞. The effective
domain is defined by

= θ ∈ Rk ; L(θ ) < ∞ ⊆ Rk . (2.1)

Lemma 2.1 The effective domain ⊆ Rk is a convex set.

The effective domain is not necessarily an open set, but in many applications it
is open. Counterexamples are given in Problem 4.1 of Chapter 1 in Lehmann [244],
and in the inverse Gaussian example in Sect. 2.1.3, below.
Proof of Lemma 2.1 Choose θ i ∈ Rk , i = 1, 2, with L(θ i ) < ∞. Set θ = cθ 1 +
(1 − c)θ 2 for c ∈ (0, 1). We use Hölder’s inequality, applied to the norms p = 1/c
and q = 1/(1 − c),

L(θ ) = exp (cθ 1 + (1 − c)θ 2 ) T (y) + a(y) dν(y)
R
c 1−c
= exp θ
1 T (y) + a(y) exp θ 2 T (y) + a(y) dν(y)
R

≤ L(θ 1 )c L(θ 2 )1−c < ∞.

This implies θ ∈ and proves the claim.

We define the cumulant function on the effective domain

κ : → R, θ → κ(θ) = logL(θ ).

Definition 2.2 The EF with σ -finite measure ν on R and cumulant function

κ : → R is given by the distribution functions F on R with

dF (y; θ) = f (y; θ)dν(y) = exp θ T (y) − κ(θ) + a(y) dν(y),
(2.2)
for canonical parameters θ ∈ ⊆ Rk .
2.1 Exponential Family 15

Remarks 2.3
• The definition of the EF (2.2) assumes that the effective domain ⊆ Rk has
been constructed from the choices a : R → R and T : R → Rk as described
in (2.1). This is not explicitly stated in the surrounding text of (2.2).
• The support of any random variable Y ∼ F (·; θ) of this EF does not depend on
the explicit choice of the canonical parameter θ ∈ , but solely on the choice of
the σ -finite measure ν on R, and the distribution functions F (·; θ) are mutually
absolutely continuous (equivalent) w.r.t. ν.
• In statistics, the main object of interest is the canonical parameter θ . Importantly
for parameter estimation, the function a(·) does not involve the canonical
parameter. Therefore, it is irrelevant for parameter estimation and (only) serves
as a normalization so that F in (2.2) is a proper distribution function. In fact, this
is the way how the EF is often introduced in the statistical and actuarial literature,
but in this latter introduction we lose the deeper interpretation of the cumulant
function κ, nor is it immediately clear what properties it possesses.
• The case k ≥ 2 gives a vector-valued canonical parameter θ. The case k = 1
gives a single-parameter EF, and, if additionally T (y) = y, it is called a single-
parameter linear EF.

Theorem 2.4 Assume the effective domain has a non-empty interior . ˚ Choose
Y ∼ F (·; θ ) for fixed θ ∈ . ˚ The moment generating function of T (Y ) for
sufficiently small r ∈ Rk is given by

MT (Y ) (r) = Eθ exp r T (Y ) = exp {κ(θ + r) − κ(θ)} ,

where the expectation operator Eθ illustrates the selected canonical parameter θ

for Y .

Proof Choose θ ∈ ˚ and r ∈ Rk so small that θ + r ∈ .

˚ We receive

MT (Y ) (r) = exp (θ + r) T (y) − κ(θ ) + a(y) dν(y)
R

= exp {κ(θ + r) − κ(θ )} exp (θ + r) T (y) − κ(θ + r) + a(y) dν(y)
R
= exp {κ(θ + r) − κ(θ )} ,

where the last identity follows from the fact that the support of the EF does not
depend on the explicit choice of the canonical parameter.

Theorem 2.4 has a couple of immediate implications. First, in any interior point
θ ∈ ˚ both the moment generating function r → MT (Y ) (r) (in the neighborhood of
the origin) and the cumulant function θ → κ(θ ) have derivatives of all orders, and,
similarly to Sect. 1.2, moments of all orders of T (Y ) exist, see also (1.1). Existence
16 2 Exponential Dispersion Family

of moments of all orders implies that the distribution function of T (Y ) cannot have
a regularly varying tails.

Corollary 2.5 Assume ˚ is non-empty. The cumulant function θ → κ(θ) is

˚
convex, and for Y ∼ F (·; θ ) with θ ∈

μ = Eθ [T (Y )] = ∇θ κ(θ) and Varθ (T (Y )) = ∇θ2 κ(θ ),

where ∇θ is the gradient and ∇θ2 the Hessian w.r.t. vector θ .

Similarly to T : R → Rk , we will not use boldface notation for the (multi-

dimensional) mean because later on we will understand the mean μ = μ(θ) ∈ Rk
as a function of the canonical parameter θ ; see Footnote 1 on page 13 on boldface
notation.
Proof Existence of the moment generating function for all sufficiently small r ∈ Rk
(around the origin) implies that we have first and second moments. For the first
moment we receive

μ = Eθ [T (Y )] = ∇r MT (Y ) (r)r=0 = exp {κ(θ + r) − κ(θ)} ∇r κ(θ + r)|r=0 = ∇θ κ(θ).

Denote component j of T (Y ) ∈ Rk by Tj (Y ). We have for 1 ≤ j, l ≤ k

∂2
Eθ Tj (Y )Tl (Y ) = MT (Y ) (r)
∂rj ∂rl r=0

∂2 ∂ ∂
= exp {κ(θ + r) − κ(θ)} κ(θ + r) + κ(θ + r) κ(θ + r)
∂rj ∂rl ∂rj ∂rl r=0

∂2 ∂ ∂
= κ(θ) + κ(θ) κ(θ) .
∂θj ∂θl ∂θj ∂θl

This implies for the covariance

∂2
Covθ (Tj (Y ), Tl (Y )) = κ(θ).
∂θj ∂θl

The convexity of κ follows because ∇θ2 κ(θ) is the positive semi-definite covariance
˚ This finishes the proof.
matrix of T (Y ), for all θ ∈ .

Assumption 2.6 (Minimal Representation) We assume that the interior ˚

of the effective domain is non-empty and that the cumulant function κ is
˚
strictly convex on this interior .
2.1 Exponential Family 17

Remarks 2.7
• Throughout these notes we will work under Assumption 2.6 without making
explicit reference. This assumption strengthens the properties of the cumulant
function κ from being convex, see Corollary 2.5, to being strictly convex. This
strengthening implies that the mean function θ → μ = μ(θ ) = ∇θ κ(θ) can be
inverted; this is needed for the canonical link, see Definition 2.8, below.
• The strict convexity of κ means that the covariance matrix ∇θ2 κ(θ) of T (Y ) is
˚ see Corollary 2.5. This property
positive definite and has full rank k for all θ ∈ ,
is important, otherwise we do not have identifiability in the canonical parameter
θ because we have a linear dependence between the components of T (Y ).
• Mathematically, this strict convexity is not a restriction because it can be obtained
by working under a so-called minimal representation. If the covariance matrix
∇θ2 κ(θ) does not have full rank k, the choice k is “non-optimal” because the
problem lives in a smaller dimension. Thus, w.l.o.g., we may and will assume to
work in this smaller dimension, called minimal representation; for a rigorous
derivation of a minimal representation we refer to Section 8.1 in Barndorff-
Nielsen [23].

Definition 2.8 The canonical link is defined by h = (∇θ κ)−1 .

The application of the canonical link h to the mean implies under Assumption 2.6

h (μ) = h (Eθ [T (Y )]) = θ ,

˚
for mean μ = Eθ [T (Y )] of Y ∼ F (·; θ ) with θ ∈ .

Remarks 2.9 (Dual Parameter Space) Assumption 2.6 provides that the
canonical link h is well-defined, and we can either work with the canonical
parameter representation θ ∈ ˚ ⊆ Rk or with its dual (mean) parameter
representation μ = Eθ [T (Y )] ∈ M with

def. ˚ = {∇θ κ(θ); θ ∈ }

M = ∇θ κ() ˚ ⊆ Rk . (2.3)

Strict convexity of κ implies that there is a one-to-one correspondence

between these two parametrizations. is called the effective domain and M
is called the dual parameter space or the mean parameter space.

In Sect. 2.2.4, below, we introduce one more property called steepness that the
cumulant function κ should satisfy. This additional property gives a relationship
between the support T of the random variables T (Y ) of the given EF and the
boundary of the dual parameter space M. This steepness property is important for
parameter estimation.
18 2 Exponential Dispersion Family

2.1.2 Single-Parameter Linear EF: Count Variable Examples

We start by giving single-parameter discrete linear EF examples based on counting

measures on N0 . Since we work in one dimension k = 1, we replace boldface θ by
scalar θ ∈ ⊆ R in this section.

Bernoulli Distribution as a Single-Parameter Linear EF

For the Bernoulli distribution with parameter p ∈ (0, 1) we choose as ν the counting
measure on {0, 1}. We make the following choices: T (y) = y,

eθ p
a(y) = 0, κ(θ) = log(1 + eθ ), p = κ (θ) = , θ = h(p) = log ,
1 + eθ 1−p

for effective domain = R, dual parameter space M = (0, 1) and support T =

{0, 1} of Y = T (Y ). With these choices we have
y 1−y
eθ 1
dF (y; θ ) = exp θy − log(1 + eθ ) dν(y) = dν(y).
1 + eθ 1 + eθ

θ → κ (θ ) is the logistic or sigmoid function, and the canonical link p → h(p) is

the logit function. Mean and variance are given by

eθ
μ = Eθ [Y ] = κ (θ ) = p and Varθ (Y ) = κ (θ ) = = p(1 − p),
(1 + eθ )2

and the probability weights satisfy for y ∈ T = {0, 1}

Pθ [Y = y] = py (1 − p)1−y .

Binomial Distribution as a Single-Parameter Linear EF

For the binomial distribution with parameters n ∈ N and p ∈ (0, 1) we choose as ν

the counting measure on {0, . . . , n}. We make the following choices: T (y) = y,

n neθ μ
a(y) = log , κ(θ) = nlog(1+eθ ), μ = κ (θ) = , θ = h(μ) = log ,
y 1 + eθ n−μ

for effective domain = R, dual parameter space M = (0, n) and support T =

{0, . . . , n} of Y = T (Y ). With these choices we have
θ y n−y
n n e 1
dF (y; θ ) = exp θy − nlog(1 + eθ ) dν(y) = dν(y).
y y 1 + eθ 1 + eθ
2.1 Exponential Family 19

Mean and variance are given by

eθ
μ = Eθ [Y ] = κ (θ ) = np and Varθ (Y ) = κ (θ ) = n = np(1 − p),
(1 + eθ )2

where we set p = eθ /(1 + eθ ). The probability weights satisfy for y ∈ T =

{0, . . . , n}

n y
Pθ [Y = y] = p (1 − p)n−y .
y

Poisson Distribution as a Single-Parameter Linear EF

For the Poisson distribution with parameter λ > 0 we choose as ν the counting
measure on N0 . We make the following choices: T (y) = y,

1
a(y) = log , κ(θ ) = eθ , μ = κ (θ ) = eθ , θ = h(μ) = log(μ),
y!

for effective domain = R, dual parameter space M = (0, ∞) and support T =

N0 of Y = T (Y ). With these choices we have

1 μy
dF (y; θ ) = exp θy − eθ dν(y) = e−μ dν(y). (2.4)
y! y!

The canonical link μ → h(μ) is the log-link. Mean and variance are given by

μ = Eθ [Y ] = κ (θ ) = λ and Varθ (Y ) = κ (θ ) = λ = μ = Eθ [Y ] ,

where we set λ = eθ . The probability weights in the Poisson case satisfy for y ∈
T = N0

λy
Pθ [Y = y] = e−λ .
y!

Negative-Binomial (Pólya) Distribution as a Single-Parameter Linear EF

For the negative-binomial distribution with α > 0 and p ∈ (0, 1) we choose as

ν the counting measure on N0 ; α plays the role of a nuisance parameter or hyper-
parameter. We make the following choices: T (y) = y,

y+α−1
a(y) = log , κ(θ ) = −αlog(1 − eθ ),
y
20 2 Exponential Dispersion Family

eθ μ
μ = κ (θ ) = α , θ = h(μ) = log ,
1 − eθ μ+α

for effective domain = (−∞, 0), dual parameter space M = (0, ∞) and support
T = N0 of Y = T (Y ). With these choices we have

y+α−1
dF (y; θ ) = exp θy + αlog(1 − eθ ) dν(y)
y

y+α−1 y
= p (1 − p)α dν(y),
y

with p = eθ . Parameter α > 0 is treated as nuisance parameter, otherwise we drop

out of the EF framework. We have first the two moments

eθ p eθ
μ = Eθ [Y ] = α =α and Varθ (Y ) = Eθ [Y ] 1 + > Eθ [Y ].
1 − eθ 1−p 1 − eθ

This model allows us to model over-dispersion, in contrast to the Poisson model.

In fact, the negative-binomial model is a mixed Poisson model with a gamma
mixing distribution, for details see Sect. 5.3.5, below. Typically, one uses a different
parametrization. Set eθ = λ/(α + λ), for λ > 0. This implies

λ
μ = Eθ [Y ] = λ and Varθ (Y ) = λ 1 + > λ.
α

For α ∈ N this model can also be interpreted as the waiting time until we observe
α successful trials among i.i.d. trials, for instance, for α = 1 we have the geometric
distribution (with a small reparametrization).
The probability weights of the negative-binomial model satisfy for y ∈ T = N0

y+α−1 y
Pθ [Y = y] = p (1 − p)α . (2.5)
y

2.1.3 Vector-Valued Parameter EF: Absolutely Continuous

Examples

We give vector-valued parameter absolutely continuous EF examples with k = 2,

and being based on the Lebesgue measure on (subsets of) R, in this section.
2.1 Exponential Family 21

Gaussian Distribution as a Vector-Valued Parameter EF

For the Gaussian distribution with parameters μ ∈ R and σ 2 > 0 we choose as ν

the Lebesgue measure on R, and we make the following choices: T (y) = (y, y 2 ) ,

1 θ12 1
a(y) = − log(2π), κ(θ) = − − log(−2θ2 ),
2 4θ2 2

θ1 θ12
(μ, σ 2 + μ2 ) = ∇θ κ(θ ) = −1
, (−2θ2) + 2 ,
−2θ2 4θ2

for effective domain = R × (−∞, 0), dual parameter space M = R × (0, ∞)

and support T = R × [0, ∞) of T (Y ) = (Y, Y 2 ) . With these choices we have

θ 2
1 1
dF (y; θ) = √ exp θ T (y) + 1 + log(−2θ2 ) dν(y)
2π 4θ2 2

1 1 1 θ1 2
= √ exp − y− dν(y).
2π(−2θ2 )−1/2 2 (−2θ2 )−1 −2θ2

This is the Gaussian model with mean μ = θ1 /(−2θ2) and variance σ 2 =

(−2θ2)−1 .
If we treat σ > 0 as a nuisance parameter, we obtain the Gaussian model as a
single-parameter EF. This is the most common example of an EF. Set T (y) = y/σ
and
1
a(y) = − log(2πσ 2 ) − y 2 /(2σ 2 ), κ(θ) = θ 2 /2, μ = κ (θ) = θ, θ = h(μ) = μ,
2

for effective domain = R, dual parameter space M = R and support T = R of

T (Y ) = Y/σ . With these choices we have

1
dF (y; θ ) = √ exp θy/σ − y 2 /(2σ 2 ) − θ 2 /2 dν(y)
2πσ

1 1
= √ exp − 2 (y − σ θ )2 dν(y),
2πσ 2σ

and, in particular, the canonical link is the identity link μ → θ = h(μ) = μ in this
single-parameter EF example.
22 2 Exponential Dispersion Family

Gamma Distribution as a Vector-Valued Parameter EF

For the gamma distribution with parameters α, β > 0 we choose as ν the Lebesgue
measure on R+ . Then we make the following choices: T (y) = (y, logy) ,

a(y) = −logy, κ(θ ) = log (θ2 ) − θ2 log(−θ1 ),

(α) θ2 (θ2 )
α/β, − log(β) = ∇θ κ(θ) = , − log(−θ1 ) ,
(α) −θ1 (θ2 )

for effective domain = (−∞, 0) × (0, ∞), and setting β = −θ1 > 0 and
α = θ2 > 0. The dual parameter space is M = (0, ∞) × R, and we have support
T = (0, ∞) × R of T (Y ) = (Y, logY ) . With these choices we obtain

dF (y; θ) = exp θ T (y) − log (θ2 ) + θ2 log(−θ1 ) − logy dν(y)

(−θ1 )θ2 θ2 −1
= y exp {−(−θ1 )y} dν(y)
(θ2 )
β α α−1
= y exp {−βy} dν(y).
(α)

This is a vector-valued parameter EF with k = 2, and the first moment is given by

(α)
Eθ (Y, logY ) = ∇θ κ(θ) = α/β, − log(β) .
(α)

Parameter α is called shape parameter and parameter β is called scale parameter.2

If we treat the shape parameter α > 0 as a nuisance parameter we can turn the
gamma distribution into a single-parameter linear EF. Set T (y) = y and
α α
a(y) = (α − 1)logy − log (α), κ(θ ) = −α log(−θ ), μ = κ (θ ) = , θ = h(μ) = − ,
−θ μ

for effective domain = (−∞, 0), dual parameter space M = (0, ∞) and support
T = (0, ∞). With these choices we have for β = −θ > 0

(−θ )α α−1
dF (y; θ ) = y exp {−(−θ )y} dν(y). (2.6)
(α)

This provides us with mean and variance

α α 1
μ = Eθ [Y ] = and σ 2 = Varθ (Y ) = = μ2 .
β β2 α

2 The function (x) = d

dx log (x) = (x)/ (x) is called digamma function.
2.1 Exponential Family 23

For parameter estimation one often needs to invert these identities which gives us

μ2 μ
α= and β= .
σ2 σ2

Remarks 2.10
• The gamma distribution contains as special cases the exponential distribution for
α = θ2 = 1 and β = −θ1 > 0, and the χr2 -distribution with r degrees of freedom
for α = θ2 = r/2 and β = −θ1 = 1/2.
• The distributions of the EF are all light-tailed in the sense that all moments
of T (Y ) exist. Therefore, the EF does not allow for regularly varying survival
functions, see (1.3). If Y is gamma distributed, then Z = exp{Y } is log-gamma
distributed (with the special case of the Pareto distribution for the exponential
case α = θ2 = 1). For an example we refer to Sect. 2.2.5. However, this log-
transformation is not always recommended because it may provide accurate
models on the transformed log-scale, but back-transformation to the original
scale may not necessarily provide a good predictive model on that original scale.
• The gamma density (2.6) may be a bit tricky in applications because the effective
domain = (−∞, 0) is one-sided bounded (we come back to this below). For
this reason, in practice, one often uses links different from the canonical link
h(μ) = −α/μ. For instance, a parametrization θ = − exp{−ϑ} for ϑ ∈ R, see
Ohlsson–Johansson [290], leads to the following model

y α−1
dF (y; ϑ) = exp −e−ϑ y − αϑ dν(y). (2.7)
(α)

We will study the gamma model in more depth below, and parametrization (2.7)
will correspond to the log-link choice, see Example 5.5, below.
Figure 2.1 gives examples of gamma densities for shape parameters α ∈
{1/2, 1, 3/2, 2} and scale parameters β ∈ {1/2, 1, 3/2, 2} with α = β all providing
the same mean μ = Eθ [Y ] = α/β = 1. The crucial observation is that these gamma
densities can have two different shapes, for α ≤ 1 we have a strictly decreasing
shape and for α > 1 we have a unimodal density with mode in (α − 1)/β.

Inverse Gaussian Distribution as a Vector-Valued Parameter EF

For the inverse Gaussian distribution with parameters α, β > 0 we choose as ν the
Lebesgue measure on R+ . Then we make the following choices: T (y) = (y, 1/y) ,

1 1
a(y) = − log(2πy 3 ), κ(θ) = − 2(θ1 θ2 )1/2 − log(−2θ2 ),
2 2

−2θ 1/2
−2θ 1/2
1
2 1
α/β, β/α + 1/α 2 = ∇θ κ(θ ) = , + ,
−2θ1 −2θ2 −2θ2
24 2 Exponential Dispersion Family

Fig. 2.1 Gamma densities gamma densities

for shape parameters

1.0
alpha=0.5, beta=0.5
α ∈ {1/2, 1, 3/2, 2} and scale alpha=1, beta=1
alpha=1.5, beta=1.5
parameters alpha=2, beta=2

0.8
β ∈ {1/2, 1, 3/2, 2} all
providing the same mean
μ = α/β = 1

0.6
density
0.4
0.2
0.0
0 1 2 3 4 5

for θ = (θ1 , θ2 ) ∈ (−∞, 0)2 , and setting β = (−2θ1)1/2 and α = (−2θ2)1/2 .

The dual parameter space is M = (0, ∞)2 , and we have support T = (0, ∞)2 of
T (Y ) = (Y, 1/Y ) . With these choices we obtain

1 1
dF (y; θ ) = exp θ T (y) + 2(θ1 θ2 )1/2 + log(−2θ2 ) − log(2πy 3 ) dν(y)
2 2

1 1/2 1 2 1/2
= (−2θ2 ) exp − (−2θ1 )y + (−2θ2 ) − 4(θ1 θ2 ) y dν(y)
(2πy 3 )1/2 2y

α α2 β 2
= exp − 1 − y dν(y). (2.8)
(2πy 3 )1/2 2y α

This is a vector-valued parameter EF with k = 2 and with first moment

Eθ (Y, 1/Y ) = ∇θ κ(θ ) = α/β, β/α + 1/α 2 .

For receiving (2.8) we have chosen canonical parameter θ = (θ1 , θ2 ) ∈ (−∞, 0)2 .
Interestingly, we can close this parameter space for θ1 = 0, i.e., the effective domain
is not open in this example. The choice θ1 = 0 gives us cumulant function κ(θ) =
− 12 log(−2θ2 ) and boundary case

1 1
dF (y; θ) = exp θ T (y) + log(−2θ2) − log(2πy ) dν(y) 3
2 2

1 −2θ2
= (−2θ 2 ) 1/2
exp − dν(y)
(2πy 3 )1/2 2y

α α2
= exp − dν(y). (2.9)
(2πy 3 )1/2 2y
2.1 Exponential Family 25

This is the distribution of the first-passage time of level α > 0 of a standard

Brownian motion, see Bachelier [20]; this distribution is also known as Lévy
distribution.
If we treat α > 0 as a nuisance parameter, we can turn the inverse Gaussian
distribution into a single-parameter linear EF by setting T (y) = y,

α α2
a(y) = log − , κ(θ ) = −α(−2θ )1/2 ,
(2πy 3)1/2 2y
α 1 α2
μ = κ (θ ) = , θ = h(μ) = − ,
(−2θ )1/2 2 μ2

for θ ∈ (−∞, 0), dual parameter space M = (0, ∞) and support T = (0, ∞). With
these choices we have the inverse Gaussian model for β = (−2θ )1/2 > 0

1
dF (y; θ ) = exp{a(y)} exp − (−2θ )y 2 − 2α(−2θ )1/2y dν(y)
2y

α α2 β 2
= exp − 1− y dν(y).
(2πy 3)1/2 2y α

This provides us with mean and variance

α α 1
μ = Eθ [Y ] = and σ 2 = Varθ (Y ) = = 2 μ3 .
β β3 α
For parameter estimation one often needs to invert these identities, which gives us

μ3/2 μ1/2
α= and β= .
σ σ
Figure 2.2 gives examples of inverse Gaussian densities for parameter choices
α = β ∈ {1/2, 1, 3/2, 2} all providing the same mean μ = Eθ [Y ] = α/β = 1.

Generalized Inverse Gaussian Distribution as a Vector-Valued Parameter

For the generalized inverse Gaussian distribution with parameters α, β > 0 and
γ ∈ R we choose as ν the Lebesgue measure on R+ . We combine the terms of
the gamma and the inverse Gaussian models to the vector-valued choice: T (y) =
(y, logy, 1/y) with k = 3. Moreover, we choose a(y) = −logy and cumulant
function
θ
2
κ(θ ) = log 2Kθ2 (2 θ1 θ3 ) − log(θ1 /θ3 ),
2
26 2 Exponential Dispersion Family

Fig. 2.2 Inverse Gaussian inverse Gaussian densities

densities for parameters

2.5
alpha=0.5
α = β ∈ {1/2, 1, 3/2, 2} all alpha=1
alpha=1.5
providing the same mean alpha=2

2.0
μ = α/β = 1

1.5
density
1.0
0.5
0.0
0 1 2 3 4 5

for θ = (θ1 , θ2 , θ3 ) ∈ (−∞, 0) × R × (−∞, 0), and where Kθ2 denotes the
modified Bessel function of the second kind with index γ = θ2 ∈ R. With these
choices we obtain generalized inverse Gaussian density

θ
dF (y; θ) = exp θ T (y) − log 2Kθ2 (2 θ1 θ3 ) + log(θ1 /θ3 ) − logy dν(y)
2
2

(α/β)γ /2 γ −1 1
= √ y exp − αy + βy −1 dν(y), (2.10)
2Kγ ( αβ) 2

setting α = −2θ1 and β = −2θ3 . This is a vector-valued parameter EF with k = 3,

and the first moment is given by

1
Eθ Y, logY, = ∇θ κ(θ)
Y
√ √
Kγ +1 ( αβ) β β ∂ Kγ +1 ( αβ) α 2γ
= √ , log + logKγ ( αβ), √ − .
Kγ ( αβ) α α ∂γ Kγ ( αβ) β β

The effective domain is a bit complicated because the possible choices of (θ1 , θ3 )
depend on θ2 ∈ R, namely, for θ2 < 0 the negative half-line (−∞, 0] can be closed
at the origin for θ1 , and for θ2 > 0 it can be closed at the origin for θ3 . The inverse
Gaussian model is obtained for θ2 = −1/2 and the gamma model is obtained for
θ3 = 0. For further properties of the generalized inverse Gaussian distribution we
refer to the textbook of Jørgensen [200].
2.1 Exponential Family 27

2.1.4 Vector-Valued Parameter EF: Count Variable Example

We close our EF examples by giving a discrete example with a vector-valued

parameter.

Categorical Distribution as a Vector-Valued Parameter EF

For the categorical distribution with k ∈ N and p ∈ (0, 1)k such that ki=1 pi < 1,
we choose as ν the counting measure on the finite set {1, . . . , k + 1}. Then we make
the following choices: T (y) = (1{y=1} , . . . , 1{y=k} ) ∈ Rk , θ = (θ1 , . . . , θk ) ,
eθ = (eθ1 , . . . , eθk ) and

k
eθ
a(y) = 0, κ(θ) = log 1 + e θi
, p = ∇θ κ(θ ) = k ,
i=1 1+ i=1 e
θi

for effective domain = Rk , dual parameter space M = (0, 1)k , and the support
T of T (Y ) are the k + 1 corners of the unit simplex in Rk . This representation is
minimal, see Assumption 2.6. With these choices we have (set θk+1 = 0)
1{y=j }

k
k+1
eθj

dF (y; θ ) = exp θ T (y) − log 1 + e θi
dν(y) = k+1 dν(y).
i=1 j =1 i=1 eθi

This is a vector-valued parameter EF with k ∈ N. The canonical link is slightly

more complicated. Set vectors v = exp{θ } ∈ Rk and w = (1, . . . , 1) ∈ Rk . This
provides p = ∇θ κ(θ) = 1+w1 v v ∈ Rk . Set matrix Ap = 1 − pw ∈ Rk×k , the
latter gives us p = Ap v, and since Ap has full rank k, we obtain canonical link

−1 p
p →
θ = h(p) = log Ap p = log .
1 − w p

The last identity can be verified by explicit calculation

p eθ /(1 + kj =1 eθj )
log = log k = log eθ = θ .
1 − w p k
1 − i=1 eθi /(1 + j =1 eθj )

Remarks 2.11
• There are many more examples that belong to the EF. From Theorem 2.4, we
know that all examples of the EF are light-tailed in the sense that all moments of
T (Y ) exist. If we want to model heavy-tailed distributions within the EF, we first
need to apply a suitable transformation. We could model the Pareto distribution
28 2 Exponential Dispersion Family

using transformation T (y) = logy, and assuming that the transformed random
variable has an exponential distribution. Different light-tailed examples are
obtained by, e.g., using transformation T (y) = y τ for the Weibull distribution
or T (y) = (logy, log(1 − y)) for the beta distribution. We refrain from giving
explicit formulas for these or other examples.
• Observe that in all examples above we have T ⊂ M, i.e., the support of T (Y )
is contained in the closure of the dual parameter space M, we come back to this
observation in Sect. 2.2.4, below.

2.2 Exponential Dispersion Family

In the previous section we have introduced the EF, and we have explicitly studied the
vector-valued parameter EF examples of the Gaussian, the gamma and the inverse
Gaussian models. We have highlighted that these three vector-valued parameter
EFs can be turned into single-parameter EFs by declaring one parameter to be
a nuisance parameter that is not modeled (and acts as a hyper-parameter). This
changes these three models into single-parameter EFs. These three single-parameter
EFs with nuisance parameter can also be interpreted as EDF models. In this section
we discuss the single-parameter EDF; this is sufficient for our purposes, and vector-
valued parameter extensions can be obtained in a canonical way.

2.2.1 Definition and Properties

The EFs of Sect. 2.1 can be extended to EDFs. In the single-parameter case this
is achieved by a transformation Y = X/ω, where ω > 0 is a scaling and where X
belongs to a single-parameter linear EF, i.e., with T (x) = x. We restrict ourselves to
the single-parameter case k = 1 throughout this section. Choose a σ -finite measure
ν1 on R and a measurable function a1 : R → R. These choices give a single-
parameter linear EF, directly modeling a real-valued random variable T (X) = X.
By (2.2) we have distribution for the single-parameter linear EF random variable X

dF (x; θ, 1) = f (x; θ, 1)dν1(x) = exp θ x − κ(θ ) + a1 (x) dν1 (x),

on the effective domain

= θ ∈ R; exp {θ x + a1 (x)} dν1 (x) < ∞ , (2.11)
R
2.2 Exponential Dispersion Family 29

and with cumulant function

θ ∈ → κ(θ ) = log exp {θ x + a1 (x)} dν1 (x) . (2.12)
R

Throughout, we assume that the effective domain has a non-empty interior . ˚

˚
Thus, since is convex, we assume that is a non-empty (possibly infinite) open
interval in R.
Following Jørgensen [201, 202], we extend this linear EF to an EDF as follows.
Choose a family of σ -finite measures νω on R and measurable functions aω : R →
R for a given index set W ω with {1} ⊂ W ⊂ R+ . Assume that we have an
ω-independent scaled cumulant function κ on this index set W, that is,

1
θ ∈ → κ(θ ) = log exp {θ x + aω (x)} dνω (x) for all ω ∈ W,
ω R

with effective domain defined by (2.11), i.e., for ω = 1. This allows us to consider
the distribution functions

dF (x; θ, ω) = f (x; θ, ω)dνω (x) = exp θ x − ωκ(θ ) + aω (x) dνω (x)

= exp ω (θy − κ(θ )) + aω (ωy) dνω (ωy), (2.13)

in the third identity we did a change of variable x → y = x/ω. By re-

parametrizing the function aω (ω ·) and the σ -finite measures νω (ω ·) slightly
differently, depending on the particular structure of the chosen σ -finite measures,
we arrive at the following single-parameter EDF.

Definition 2.12 The (single-parameter) EDF is given by densities of the form

yθ − κ(θ )
Y ∼ f (y; θ, v/ϕ) = exp + a(y; v/ϕ) , (2.14)
ϕ/v

with

κ : → R is the cumulant function (2.12),

θ∈ is the canonical parameter in the effective domain (2.11),
v>0 is a given weight (exposure, volume),
ϕ>0 is the dispersion parameter,
a(·; ·) is the normalization, not depending on the canonical parameter θ.
30 2 Exponential Dispersion Family

Remarks 2.13
• Exposure v > 0 and dispersion parameter ϕ > 0 provide the parametrization
usually used for ω = v/ϕ ∈ W. Their meaning and interpretation will become
clear below, and they will always appear as a ratio ω = v/ϕ.
• The support of these EDF distributions does not depend on the explicit choice of
the canonical parameter θ ∈ , but it may depend on ω = v/ϕ ∈ W through
the choices of the σ -finite measures νω , for ω ∈ W. Consequently, a(y; ω) is
a normalization such that f (y; θ, ω) integrates to 1 w.r.t. the chosen σ -finite
measure νω to receive a proper distributional model.
• The transformation x → y = x/ω in (2.13) is called duality transformation, see
Section 3.1 in Jørgensen [203]. It provides the duality between the additive form
(in variable x in (2.13)) and the reproductive form (in variable y in (2.13)) of the
EDF; Definition 2.12 is the reproductive form.
• Lemma 2.1 tells us that is convex, thus, it is a possibly infinite interval in R.
To exclude trivial cases we will always assume that the σ -finite measure ν1 is not
concentrated in one single point (this relates to the minimal representation for
k = 1 in the linear EF case, see Assumption 2.6), and that the interior ˚ of the
effective domain is non-empty.

Corollary 2.14 Assume ˚ is non-empty and that ν1 is not concentrated in

one single point. Choose Y ∼ F (·; θ, v/ϕ) for fixed θ ∈ .˚ The moment
generating function of Y for small r ∈ R satisfies

v
MY (r) = Eθ exp {rY } = exp [κ(θ + rϕ/v) − κ(θ )] .
ϕ

The first two moments of Y are given by

ϕ
μ = Eθ [Y ] = κ (θ ) and Varθ (Y ) = κ (θ ) > 0.
v

The cumulant function κ is smooth and strictly convex on ˚ with canonical

link h = (κ )−1 . The variance function is defined by μ → V (μ) = (κ ◦h)(μ)
and, consequently, for the variance of Y we have Varμ (Y ) = ϕv V (μ) for
μ ∈ M.

Proof This follows analogously to Theorem 2.4. The linear case T (y) = y with ν1
not being concentrated in one single point guarantees that the minimal dimension is
k = 1, providing a minimal representation in this dimension, see Assumption 2.6.

Before giving explicit examples we state the so-called convolution formula.
2.2 Exponential Dispersion Family 31

Corollary 2.15 (Convolution Formula) Assume ˚ is non-empty and that ν1 is not

concentrated in one single point. Assume that
i Y ∼ F (·; θ, vi /ϕ) are independent,
˚ Set v+ = n vi . Then
for 1 ≤ i ≤ n, with fixed θ ∈ . i=1

1
n
Y+ = vi Yi ∼ F (·; θ, v+ /ϕ).
v+
i=1

Proof The proof immediately follows from calculating the moment generating
function MY+ (r) and from using the independence between the Yi ’s.

2.2.2 Exponential Dispersion Family Examples

The single-parameter linear EF examples introduced above can be reformulated as

EDF examples.

Binomial Distribution as a Single-Parameter EDF

For the binomial distribution with parameters p ∈ (0, 1) and n ∈ N we choose

the counting measure on {0, 1/n, . . . , 1} with ω = n. Then we make the following
choices

n eθ p
a(y) = log , κ(θ) = log(1+eθ ), p = κ (θ) = , θ = h(p) = log ,
ny 1 + eθ 1−p

for effective domain = R and dual parameter space M = (0, 1). With these
choices we have
θ ny n−ny
n n e 1
f (y; θ, n) = exp n θy − log(1 + eθ ) = .
ny ny 1 + eθ 1 + eθ

This is a single-parameter EDF. The canonical link p → h(p) gives the logit
function. Mean and variance are given by

eθ 1 1 eθ 1
p = Eθ [Y ] = κ (θ) = and Varθ (Y ) = κ (θ) = = p(1 − p),
1 + eθ n n (1 + eθ )2 n

and the variance function is given by V (μ) = μ(1 − μ). The binomial random
variable is obtained by setting X = nY ∼ Binom(n, p).
32 2 Exponential Dispersion Family

Poisson Distribution as a Single-Parameter EDF

For the Poisson distribution with parameters λ > 0 and v > 0 we choose the
counting measure on N0 /v for exposure ω = v. Then we make the following choices
vy
v
a(y) = log , κ(θ ) = eθ , λ = κ (θ ) = eθ , θ = h(λ) = log(λ),
(vy)!

for effective domain = R and dual parameter space M = (0, ∞). With these
choices we have

v vy (vλ)vy
f (y; θ, v) = exp v θy − eθ = e−vλ . (2.15)
(vy)! (vy)!

This is a single-parameter EDF. The canonical link λ → h(λ) is the log-link. Mean
and variance are given by

1 1 1
λ = Eθ [Y ] = κ (θ ) = eθ and Varθ (Y ) = κ (θ ) = eθ = λ,
v v v
and the variance function is given by V (λ) = λ, that is, the variance function is
linear in the mean parameter λ. The Poisson random variable is obtained by setting
X = vY ∼ Poi(vλ). We choose ϕ = 1, here, meaning that we have neither under-
nor over-dispersion. Thus, the choices v and ϕ in ω = v/ϕ have the interpretation
of an exposure and a dispersion parameter, respectively. This interpretation is going
to be important in claim counts modeling, below.

Gamma Distribution as a Single-Parameter EDF

For the gamma distribution with parameters α, β > 0 we choose the Lebesgue
measure on R+ and shape parameter ω = v/ϕ = α. We make the following choices

a(y) = (α − 1)logy + αlogα − log (α), κ(θ ) = −log(−θ ),

μ = κ (θ ) = −1/θ, θ = h(μ) = −1/μ,

for effective domain = (−∞, 0) and dual parameter space M = (0, ∞). With
these choices we have
α α α−1 (−θ α)α α−1
f (y; θ, α) = y exp α yθ + log(−θ ) = y exp {−(−θ α)y} .
(α) (α)
2.2 Exponential Dispersion Family 33

This is analogous to (2.6) with shape parameter α > 0 and scale parameter β =
−θ > 0. Mean and variance are given by

1 1
μ = Eθ [Y ] = κ (θ ) = −θ −1 and Varθ (Y ) = κ (θ ) = θ −2 ,
α α

and the variance function is given by V (μ) = μ2 , that is, the variance function
is quadratic in the mean parameter μ. The gamma random variable is obtained by
setting X = αY ∼ (α, β). This gives us for the first two moments of X

α α 1
μX = Eθ [X] = and Varθ (X) = = μ2X .
β β2 α

Suppose v = 1, for shape parameter α > 1, we have under-dispersion ϕ = 1/α < 1

and the gamma density is unimodal; for shape parameter α < 1, we have over-
dispersion ϕ = 1/α > 1 and the gamma density is strictly decreasing, we refer to
Fig. 2.1.

Inverse Gaussian Distribution as a Single-Parameter EDF

For the inverse Gaussian distribution with parameters α, β > 0 we choose the
Lebesgue measure on R+ and we set ω = v/ϕ = α. We make the following choices

α 1/2 α
a(y) = log − , κ(θ ) = −(−2θ )1/2,
(2πy 3 )1/2 2y
1 1
μ = κ (θ ) = 1/2
, θ = h(μ) = − 2 ,
(−2θ ) 2μ

for θ ∈ (−∞, 0) and dual parameter space M = (0, ∞). With these choices we
have

α 1/2 α
f (y; θ, α)dy = exp α θy + (−2θ ) 1/2
− dy
(2πy 3 )1/2 2y

2
α 1/2 α
= exp − 1 − (−2θ ) 1/2
y dy
(2πy 3 )1/2 2y
2
α α2 (−2θ )1/2
= exp − 1− x dx,
(2πx 3 )1/2 2x α
34 2 Exponential Dispersion Family

where in the last step we did a change of variable y → x = αy. This is exactly (2.8).
Mean and variance are given by

1 1
μ = Eθ [Y ] = κ (θ ) = (−2θ )−1/2 and Varθ (Y ) = κ (θ ) = (−2θ )−3/2,
α α

and the variance function is given by V (μ) = μ3 , that is, the variance function is
cubic in the mean parameter μ. The inverse Gaussian random variable is obtained by
setting X = αY . The mean and variance of X are given by, set β = (−2θ )1/2 > 0,

α α 1
μX = Eθ [X] = and Varθ (X) = = 2 μ3X .
β β3 α

This inverse Gaussian density is illustrated in Fig. 2.2.

Similarly to (2.9), we can extend the inverse Gaussian model to the boundary
case θ = 0, i.e., the effective domain = (−∞, 0] is not open. This provides us
with density

α α2
f (y; θ = 0, α)dy = exp − dx, (2.16)
(2πx 3)1/2 2x

using, as above, the change of variable y → x = αy. An additional transformation

x → 1/x gives a gamma distribution with shape parameter 1/2 and scale parameter
α 2 /2.
Remark 2.16 The inverse Gaussian case gives an example of a non-open effective
domain = (−∞, 0]. It is worth noting that for the boundary parameter θ = 0,
the first moment does not exist, i.e., Corollary 2.14 only makes statements in the
interior ˚ of the effective domain . This also relates to Remarks 2.9 on the dual
parameter space M.

2.2.3 Tweedie’s Distributions

Tweedie’s compound Poisson (CP) model was introduced in 1984 by Tweedie [358],
and it has been studied in detail in Jørgensen [202], Jørgensen–de Souza [204],
Smyth–Jørgensen [342] and in the review paper of Delong et al. [94]. Tweedie’s CP
model belongs to the EDF. We spend more time on explaining Tweedie’s CP model
because it plays an important role in actuarial modeling.
Tweedie’s CP model is received by choosing as σ -finite measure ν1 a mixture of
the Lebesgue measure on (0, ∞) and a point measure in 0. Furthermore, we choose
power variance parameter p ∈ (1, 2) and cumulant function

1 2−p
κ(θ ) = κp (θ ) = ((1 − p)θ ) 1−p , (2.17)
2−p
2.2 Exponential Dispersion Family 35

on the effective domain θ ∈ = (−∞, 0). This provides us with Tweedie’s CP

model

yθ − κp (θ )
Y ∼ f (y; θ, v/ϕ) = exp + a(y; v/ϕ) ,
ϕ/v

with exposure v > 0 and dispersion parameter ϕ > 0; the normalizing function
a(·; v/ϕ) does not have any simple closed form, we refer to Section 2.1 in
Jørgensen–de Souza [204] and Section 4.2 in Jørgensen [203].

The first two moments of Tweedie’s CP random variable Y are given by

1
μ = Eθ [Y ] = κp (θ ) = ((1 − p)θ ) 1−p ∈ M = (0, ∞), (2.18)
ϕ ϕ p ϕ
Varθ (Y ) = κp (θ ) = ((1 − p)θ ) 1−p = μp > 0. (2.19)
v v v
The parameter p ∈ (1, 2) determines the power variance functions V (μ) =
μp between the Poisson p = 1 and the gamma p = 2 cases, see Sect. 2.2.2.

The moment generating function of Tweedie’s CP random variable X = vY/ϕ =

ωY in its additive form is given by, we use Corollary 2.14,
⎧ ⎛ ⎞⎫
⎨v 2−p ⎬
−θ p−1
MX (r) = MvY/ϕ (r) = exp κp (θ ) ⎝ − 1⎠ for r < −θ.
⎩ϕ −θ − r ⎭

Some readers will notice that this is the moment generating function of a CP
distribution having i.i.d. gamma claim sizes. This is exactly the statement of the
next proposition which is found, e.g., in Smyth–Jørgensen [342].
N
Proposition 2.17 Assume S = i=1 Zi is CP distributed with Poisson claim
counts N ∼ Poi(λv) and i.i.d. gamma claim sizes Zi ∼ (α, β) being independent
(d)
of N. We have S = vY/ϕ by identifying the parameters as follows

α+2 1
p= ∈ (1, 2), β = −θ > 0 and λ= κp (θ ) > 0.
α+1 ϕ

Proof of Proposition 2.17 Assume S is CP distributed with i.i.d. gamma claim

sizes. From Proposition 2.11 and Section 3.2.1 in Wüthrich [387] we receive that
the moment generating function of S is given by

α
β
MS (r) = exp λv −1 for r < β.
β −r
36 2 Exponential Dispersion Family

Using the proposed parameter identification, the claim immediately follows.

Proposition 2.17 gives us a second interpretation of Tweedie’s CP model which

was introduced in an EDF fashion, above. This second interpretation explains the
name of this EDF model, it explains the mixture of the Lebesgue measure and the
point measure in 0, and it also highlights why the Poisson model and the gamma
model are the boundary cases in terms of power variance functions.

An interesting question is whether the EDF can be extended beyond power

variance functions V (μ) = μp with p ∈ [1, 2]. The answer to this question is
yes, and the full answer is provided in Theorem 2 of Jørgensen [202]:
Theorem 2.18 (Jørgensen [202], Without Proof) Only power variance parame-
ters p ∈ (0, 1) do not allow for EDF models.
Table 2.1 gives the EDF distributions that have a power variance function. These
distributions are called Tweedie’s distributions, with the special case of Tweedie’s
CP distributions for p ∈ (1, 2). The densities for p ∈ {0, 1, 2, 3} have a closed form,
but the other Tweedie’s distributions do not have a closed-form density. Thus, they
cannot explicitly be constructed as suggested in Sect. 2.2.1. Besides the constructive
approach presented above, there is a uniqueness theorem saying that the variance
function V (·) on the domain M characterizes the single-parameter linear EF, see
Theorem 2.11 in Jørgensen [203]. This uniqueness theorem is the basis of the proof
of Theorem 2.18. Tweedie’s distributions for p ∈ [0, 1]∪{2, 3} involve infinite sums
for the normalization exp{a(·, ·)}, we refer to formulas (4.19), (4.20) and (4.31) in
Jørgensen [203], this is the reason that one has to go via the uniqueness theorem
to prove Theorem 2.18. Dunn–Smyth [112] provide methods of fast calculation
of some of these infinite sums; in Sect. 5.5.2, below, we present an approximation
(saddlepoint approximation). The uniqueness theorem is also useful to construct
new examples within the EF, see, e.g., Section 2 of Awad et al. [15].

Table 2.1 Power variance function models V (μ) = μp within the EDF (taken from Table 4.1 in
Jørgensen [203])
p Distribution Support of Y M
p<0 Generated by extreme stable distributions R [0, ∞) (0, ∞)
p=0 Gaussian distribution R R R
p=1 Poisson distribution N0 R (0, ∞)
1<p<2 Tweedie’s CP distribution [0, ∞) (−∞, 0) (0, ∞)
p=2 Gamma distribution (0, ∞) (−∞, 0) (0, ∞)
p>2 Generated by positive stable distributions (0, ∞) (−∞, 0] (0, ∞)
p=3 Inverse Gaussian distribution (0, ∞) (−∞, 0] (0, ∞)
2.2 Exponential Dispersion Family 37

2.2.4 Steepness of the Cumulant Function

Assume we have a fixed EF satisfying Assumption 2.6. All random variables T (Y )

belonging to this EF have the same support, not depending on the particular choice
of the canonical parameter θ ∈ . We denote this support of T (Y ) by T.
Below, we are going to estimate the canonical parameter θ ∈ from data using
maximum likelihood estimation. For this it is advantageous to have the property
T ⊂ M, because, intuitively, this allows us to directly select μ = T (Y ) as the
parameter estimate in the dual parameter space M, for a given observation T (Y ) ∈
T. This then translates to a canonical parameter θ = h( μ) = h(T (Y )) ∈ , using
the canonical link h; this estimation approach will be better motivated in Chap. 3,
below. Unfortunately, many examples of the EF do not satisfy this property T ⊂ M.
For instance, in the Poisson model the observation T (Y ) = Y = 0 is not included
in M, see Table 2.1. This poses some challenges in parameter estimation, and the
purpose of this small discussion is to be prepared for these challenges.
A cumulant function κ is called steep if for all θ ∈ ˚ and all *
θ in the boundary
of

*θ − θ ∇θ κ αθ + (1 − α)*
θ → ∞ for α ↓ 0, (2.20)

we refer to Formula (20) in Section 8.1 of Barndorff-Nielsen [23]. Define the convex
closure of the support T by C = conv(T).
Theorem 2.19 (Theorem 9.2 in Barndorff-Nielsen [23], Without Proof) Assume
we have a fixed EF satisfying Assumption 2.6. The cumulant function κ is steep if
˚
and only if C̊ = M = ∇θ κ().
Theorem 2.19 tells us that for a steep cumulant function we have C = M =
˚ In this case parameter estimation can be extended to observations T (Y ) ∈
∇θ κ().
M such that we may obtain a degenerate model at the boundary of M. Coming
back to our Poisson example from above, in this case we set μ = 0, which gives a
degenerate Poisson model.
Throughout this book we will work under the assumption that κ is steep.
The classical examples satisfy this assumption: the examples with power variance
parameter p in {0} ∪ [1, ∞) satisfy Theorem 2.19; this includes the Gaussian, the
Poisson, the gamma, the inverse Gaussian and Tweedie’s CP models, see Table 2.1.
Moreover, the examples we have met in Sect. 2.1 fulfill this assumption; these
are the single-parameter linear EF models of the Bernoulli, the binomial and the
negative binomial distributions, as well as the vector-valued parameter examples of
the Gaussian, the gamma and the inverse Gaussian models and of the categorical
distribution. The only models we have seen that do not have a steep cumulant
function are the power variance models with p < 0, see Table 2.1.
Remark 2.20 Working within the EDF needs some additional thoughts because the
support T = Tω of the single-parameter linear EDF random variable Y = T (Y ) may
38 2 Exponential Dispersion Family

depend on the specific choice of the dispersion parameter ω ∈ W ⊃ {1} through the
σ -finite measure dνω (ω ·), see (2.13). For instance, in the binomial case the support
of Y is given by Tω = {0, 1/n, . . . , 1} with ω = n, see Sect. 2.2.2.
Assume that the cumulant function κ is steep for the single-parameter linear
EF that corresponds to the single-parameter EDF with ω = 1. Theorem 2.19
then implies that for this choice we have C̊ω=1 = ∇θ κ() ˚ with convex closure
Cω=1 = conv(Tω=1 ).
Consider ω ∈ W \{1} which corresponds to the choice νω of the σ -finite measure
on R. This choice belongs to the cumulant function θ → ωκ(θ ) in the additive form
(x-parametrization in (2.13)). Since steepness (2.20) holds for any ω > 0 we receive
that the convex closure of the support of this distribution in the x-parametrization
in (2.13) is given by ∇θ ωκ()˚ = ω∇θ κ(). ˚ The duality transformation x → y =
x/ω leads to the change of measure dνω (x) → dνω (ωy) and to the corresponding
change of support, see (2.13). The latter implies that in the reproductive form (y-
parametrization) the convex closure of the support does not depend on the specific
choice of ω ∈ W. Since the EDF representation given in (2.14) corresponds to the
y-parametrization (reproductive form), we can use Theorem 2.19 without limitation
also for the single-parameter linear EDF given by (2.14), and C does not depend on
ω ∈ W.

2.2.5 Lab: Large Claims Modeling

From Corollary 2.14 we know that the moment generating function exists around the
origin for all examples belonging to the EDF. This implies that the moments of all
orders exist, and that we have an exponentially decaying survival function Pθ [Y >
y] = 1 − F (y; θ, ω) ∼ exp{−y} for some > 0 as y → ∞, see (1.2). In many
applied situations the data is more heavy-tailed and, thus, cannot be modeled by
such an exponentially decaying survival function. In such cases one often chooses
a distribution function with a regularly varying survival function; regular variation
with tail index β > 0 has been introduced in (1.3). A popular choice is a log-gamma
distribution which can be obtained from the gamma distribution (belonging to the
EDF). We briefly explain how this is done and how it relates to the Pareto and the
Lomax [256] distributions.
We start from the gamma density (2.6). The random variable Z has a log-gamma
distribution with shape parameter α > 0 and scale parameter β = −θ > 0 if
log(Z) = Y has a gamma distribution with these parameters. Thus, the gamma
density of Y = log(Z) is given by

β α α−1
f (y; β, α)dy = y exp {−βy} dy for y > 0.
(α)
2.2 Exponential Dispersion Family 39

We do a change of variable y → z = exp{y} to receive the density of the log-gamma

distributed random variable Z = exp{Y }

βα
f (z; β, α)dz = (logz)α−1 z−(β+1) dz for z > 1.
(α)

This log-gamma density has support (1, ∞). The distribution function of this log-
gamma distributed random variable needs to be calculated numerically, and its
survival function is regularly varying with tail index β > 0.
A special case of the log-gamma distribution is the Pareto distribution. The Pareto
distribution is more tractable and it is obtained by setting shape parameter α = 1 in
the log-gamma density. This gives us the Pareto density

f (z; β)dz = f (z; β, α = 1)dz = βz−(β+1) dz for z > 1.

The distribution function in this Pareto case is for z ≥ 1 given by

F (z; β) = 1 − z−β .

Obviously, this provides a regularly varying survival function with tail index β > 0;
in fact, in this case we do not need to go over to the limit in (1.3) because we
have an exact identity. The Pareto distribution has the nice property that it is closed
under thresholding (lower-truncation) with M, that is, we remain within the family
of Pareto distributions with the same tail index β by considering lower-truncated
claims: for 1 ≤ M ≤ z we have

P [M < Z ≤ z] z −β
F (z; β, M) = P [ Z ≤ z| Z > M] = =1− .
P [Z > M] M

This is the classical definition of the Pareto distribution, and it allows to preserve
full flexibility in the choice of the threshold M > 0.
The disadvantage of the Pareto distribution is that it does not provide a
continuous density on R+ as there is a discontinuity in threshold M. For this reason,
one sometimes explores another change of variable Z → X = Z − M for a Pareto
distributed random variable Z ∼ F (·; β, M). This provides the Lomax distribution,
also called Pareto Type II distribution. X has the following distribution function on
(0, ∞)
−β
x+M
P [X ≤ x] = 1 − for x ≥ 0.
M

This distribution has again a regularly varying survival function with tail index β >
0. Moreover, we have
x+M −β
M −β
lim M
x −β = lim 1 + = 1.
x→∞ x→∞ x
M
40 2 Exponential Dispersion Family

Fig. 2.3 Log-log plot of a log−log plot

Pareto and a Lomax

0
distribution with tail index
β = 2 and threshold
M = 1 000 000

−2
logged survival function
−4
−6
Pareto distribution
Lomax distribution

12 13 14 15 16 17
log(x)

This says that we should choose the same threshold M > 0 for both the Pareto and
the Lomax distribution to receive the same asymptotic tail behavior, and this also
quantifies the rate of convergence between the two survival functions. Figure 2.3
illustrates this convergence in a log-log plot choosing tail index β = 2 and threshold
M = 1 000 000.
For completeness we provide the density of the Pareto distribution

β z −(β+1)
f (z; β, M) = for z ≥ M,
M M
and of the Lomax distribution
−(β+1)
β x+M
f (x; β, M) = for x ≥ 0.
M M

2.3 Information Geometry in Exponential Families

We do a short excursion to information geometry. This excursion may look a bit

disconnected from what we have done so far, but it provides us with important
background information for the chapter on forecast evaluation, see Chap. 4, below.

2.3.1 Kullback–Leibler Divergence

There is literature in information geometry which uses techniques from differential

geometry to study EFs as Riemannian manifolds with points corresponding to EF
densities parametrized by their canonical parameters θ ∈ , we refer to Amari [10],
2.3 Information Geometry in Exponential Families 41

Ay et al. [16] and Nielsen [285] for an extended treatment of these mathematical
concepts.
Choose a fixed EF (2.2) with cumulant function κ on the effective domain
⊆ Rk and with σ -finite measure ν on R. We define the Kullback–Leibler (KL)
divergence (relative entropy) from model θ 1 ∈ to model θ 0 ∈ within this EF
by

f (y; θ 0 )
DKL (f (·; θ 0 )||f (·; θ 1 )) = f (y; θ 0 )log dν(y) ≥ 0.
R f (y; θ 1 )

Recall that the support of the EF does not depend on the specific choice of the
canonical parameter θ in , see Remarks 2.3; this implies that the KL divergence
is well-defined, here. The positivity of the KL divergence is obtained from Jensen’s
inequality; this is proved in Lemma 2.21, below.
The KL divergence has the interpretation of having a data model that is
characterized by the distribution f (·; θ 0 ), and we would like to measure how close
another model f (·; θ 1 ) is to the data model. Note that the KL divergence is not
a distance function because it is neither symmetric nor does it satisfy the triangle
inequality.
We calculate the KL divergence within the chosen EF

DKL (f (·; θ 0 )||f (·; θ 1 )) = f (y; θ 0 ) (θ 0 − θ 1 ) T (y) − κ(θ 0 ) + κ(θ 1 ) dν(y)
R

= (θ 0 − θ 1 ) ∇θ κ(θ 0 ) − κ(θ 0 ) + κ(θ 1 ) ≥ 0, (2.21)

where we have used Corollary 2.5, and the positivity of the KL divergence can be
seen from the convexity of κ. This allows us to consider the following (Taylor)
expansion

κ(θ 1 ) = κ(θ 0 ) + ∇θ κ(θ 0 ) (θ 1 − θ 0 ) + DKL (f (·; θ 0 )||f (·; θ 1 )). (2.22)

This illustrates that the KL divergence corresponds to second and higher order
differences between the cumulant value κ(θ 0 ) and another cumulant value κ(θ 1 ).
The gradients of the KL divergence w.r.t. θ 1 in θ 1 = θ 0 and w.r.t. θ 0 in θ 0 = θ 1 are
given by

∇θ 1 DKL (f (·; θ 0 )||f (·; θ 1 ))θ (2.23)
1 =θ 0

= ∇θ 0 DKL (f (·; θ 0 )||f (·; θ 1 ))θ = 0.
0 =θ 1

This emphasizes that the KL divergence reflects second and higher-order terms in
cumulant function κ; and that the data model θ 0 forms the minimum of this KL
42 2 Exponential Dispersion Family

divergence (as a function of θ 1 ) as we will just see. We calculate the Hessian (second
order term) w.r.t. θ 1 in θ 1 = θ 0

def.
∇θ21 DKL (f (·; θ 0 )||f (·; θ 1 )) = ∇θ2 κ(θ) = I(θ 0 ).
θ 1 =θ 0 θ=θ 0

The positive definite matrix I(θ 0 ) (in a minimal representation) is called Fisher’s
information. Fisher’s information is an important tool in statistics that we will
meet in Theorem 3.13 of Sect. 3.3, below. A function satisfying (2.21) (with
being zero if and only if θ 0 = θ 1 ), fulfilling (2.23) and having positive definite
Fisher’s information is called divergence, see Definition 5 in Nielsen [285]. Fisher’s
information I(θ 0 ) measures the curvature of the KL divergence in θ 0 and we have
the second order Taylor approximation

1
κ(θ 1 ) ≈ κ(θ 0 ) + ∇θ κ(θ 0 ) (θ 1 − θ 0 ) + (θ 1 − θ 0 ) I(θ 0 ) (θ 1 − θ 0 ) .
2
Next-order terms are obtained from the so-called Amari–Chentsov tensor, see Amari
[10] and Section 4.2 in Ay et al. [16]. In information geometry one studies the
(possibly degenerate) Riemannian metric on the effective domain induced by
Fisher’s information; we refer to Section 3.7 in Nielsen [285].
Lemma 2.21 Consider two densities p and q w.r.t. a given σ -finite measure ν. We
have DKL (p||q) ≥ 0, and DKL (p||q) = 0 if and only if p = q, ν-a.s.
Proof Assume Y ∼ pdν, then we can rewrite the KL divergence, using Jensen’s
inequality,
+ ,
p(y) q(Y )
DKL (p||q) = p(y)log dν(y) = − Ep log
q(y) p(Y )
+ ,
q(Y )
≥ −logEp = − log q(y)dν(y) ≥ 0. (2.24)
p(Y )

Equality holds if and only if p = q, ν-a.s. The last inequality of (2.24) considers
- q does not necessarily need to be a density w.r.t. ν, i.e., we can also have
that
q(y)dν(y) < 1.

2.3.2 Unit Deviance and Bregman Divergence

In the next chapter we are going to introduce maximum likelihood estimation for
parameters, see Definition 3.4, below. Maximum likelihood estimators are obtained
by maximizing likelihood functions (evaluated in the observations). Maximizing
likelihood functions within the EDF is equivalent to minimizing deviance loss
2.3 Information Geometry in Exponential Families 43

functions. Deviance loss functions are based on unit deviances, which, in turn,
correspond to KL divergences. The purpose of this small section is to discuss this
relation. This should be viewed as a preparation for Chap. 4.
Assume we work within a single-parameter linear EDF, i.e., T (y) = y. Using
the canonical link h we obtain the canonical parameter θ = h(μ) ∈ ⊆ R
from the mean parameter μ ∈ M. If we replace the (typically unknown) mean
parameter μ by an observation Y , supposed Y ∈ M, we get the specific model
that is exactly calibrated to this observation. This provides us with the canonical
parameter estimate θY = h(Y ) for θ . We can now measure the KL divergence from
any model represented by θ to the observation calibrated model θY = h(Y ). This
KL divergence is given by (we use (2.21) and we set ω = v/ϕ = 1)

f (y;
θY , 1)
DKL ( f (·; h(Y ), 1)| |f (·; θ, 1)) = f (y;
θY , 1)log dν(y)
R f (y; θ, 1)
= (h(Y ) − θ ) Y − κ(h(Y )) + κ(θ ) ≥ 0.

This latter object is the unit deviance (up to factor 2) of the chosen EDF. It plays a
crucial role in predictive modeling.

We define the unit deviance under the assumption that κ is steep as follows:
d : C̊ × M → R+ (2.25)

(y, μ) → d(y, μ) = 2 yh(y) − κ (h(y)) − yh(μ) + κ (h(μ)) ≥ 0,

where C is the convex closure of the support T of Y and M is the dual parameter
space of the chosen EDF. Steepness of κ implies C̊ = M, see Theorem 2.19.
This unit deviance d is received from the KL divergence, and it is (twice) the dif-
ference of two log-likelihood functions, one using canonical parameter h(y) and the
other one having any canonical parameter θ = h(μ) ∈ . ˚ That is, for μ = κ (θ ),

d(y, μ) = 2 DKL (f (·; h(y), 1)||f (·; θ, 1)) (2.26)

ϕ
=2 (logf (y; h(y), v/ϕ) − logf (y; θ, v/ϕ)) ,
v
for general ω = v/ϕ ∈ W. The latter can be rewritten as

1
f (y; θ, v/ϕ) = f (y; h(y), v/ϕ) exp − d(y, κ (θ )) . (2.27)
2ϕ/v

This looks like a generalization of the Gaussian distribution, where the square
difference (y − μ)2 in the exponent is replaced by the unit deviance d(y, μ) with
μ = κ (θ ). This interpretation gets further support by the following lemma.
44 2 Exponential Dispersion Family

Lemma 2.22 Under Assumption 2.6 and the assumption that the cumulant function
κ is steep, the unit deviance d (y, μ) ≥ 0 of the chosen EDF is zero if and only if
y = μ. Moreover, the unit deviance d (y, μ) is twice continuously differentiable
w.r.t. (y, μ) in C̊ × M, and

∂ 2 d (y, μ) ∂ 2 d (y, μ) ∂ 2 d (y, μ)
= = − = 2/V (μ) > 0.
∂μ2 y=μ ∂y 2 y=μ ∂μ∂y y=μ

Proof The positivity and the if and only if statement follows from Lemma 2.21 and
the strict convexity of κ. Continuous differentiability follows from the smoothness
of κ in the interior of . Moreover we have

∂ 2 d (y, μ) ∂

= 2 −yh (μ) + μh (μ) = 2h (μ) = 2/κ (h(μ)) = 2/V (μ) > 0,
∂μ2 y=μ ∂μ
y=μ

where V (μ) is the variance function of the chosen EDF introduced in Corol-
lary 2.14. The remaining second derivatives are received by similar (straightfor-
ward) calculations.

Remarks 2.23
• Lemma 2.22 shows that the unit deviance definition of d(y, μ) provides a so-
called regular unit deviance according to Definition 1.1 in Jørgensen [203].
Moreover, any model that can be brought into the form (2.27) for a (regular) unit
deviance is called (regular) reproductive dispersion model, see Definition 1.2 of
Jørgensen [203].
• In general the unit deviance d(y, μ) is not symmetric in its two arguments y and
μ, we come back to this in Fig. 11.1, below.
More generally, the KL divergence and the unit deviance can be embedded into
the framework of Bregman loss functions [50]. We restrict to the single-parameter
EDF case. Assume that ψ : C̊ → R is a strictly convex function. The Bregman
divergence w.r.t. ψ between y and μ is defined by

Dψ (y, μ) = ψ(y) − ψ(μ) − ψ (μ) (y − μ) ≥ 0, (2.28)

where ψ is a (sub-)gradient of ψ. The lower bound holds because of convexity of

ψ. Consider the specific choice ψ(μ) = μh(μ) − κ(h(μ)) for the chosen EDF.
Similar to Lemma 2.22 we have ψ (μ) = h (μ) = 1/V (μ) > 0, which says that
this choice is strictly convex. Using this choice for ψ gives us unit deviance (up to
factor 1/2)

1
Dψ (y, μ) = yh(y) − κ(h(y)) + κ(h(μ)) − h(μ)y = d(y, μ). (2.29)
2
2.3 Information Geometry in Exponential Families 45

Thus, the unit deviance d can be understood as a difference of log-likelihoods

(2.26), as a KL divergence DKL and as a Bregman divergence Dψ .

Example 2.24 (Poisson Model) We start with a single-parameter EF example.

Consider cumulant function κ(θ ) = exp{θ } for canonical parameter θ ∈ = R,
this gives us the Poisson model. For the KL divergence from model θ1 to model θ0
we receive

DKL (f (·; θ0 )||f (·; θ1 )) = exp{θ1} − exp{θ0 } − (θ1 − θ0 ) exp{θ0 } ≥ 0,

which is zero if and only if θ0 = θ1 . Fisher’s information is given by

I(θ ) = κ (θ ) = exp{θ } > 0.

If we have observation Y > 0 we receive a model described by canonical parameter

θY = h(Y ) = log(Y ). This gives us unit deviance, see (2.26),

d(Y, μ) = 2DKL (f (·; h(Y ), 1)||f (·; θ, 1))

= 2 eθ − Y − (θ − log(Y ))Y
μ
= 2 μ − Y − Y log ≥ 0,
Y

with μ = κ (θ ) = exp{θ }. This Poisson unit deviance will commonly be used for
model fitting and forecast evaluation, see, e.g., (5.28).

Example 2.25 (Gamma Model) The second example considers a vector-valued

parameter EF example. We consider the cumulant function κ(θ ) = log (θ2 ) −
θ2 log(−θ1 ) for θ = (θ1 , θ2 ) ∈ = (−∞, 0) × (0, ∞); this gives us the gamma
model, see Sect. 2.1.3. For the KL divergence from model θ 1 to model θ 0 we receive

(θ0,2) (θ0,2 )
DKL (f (·; θ 0 )||f (·; θ 1 )) = θ0,2 − θ1,2 − log
(θ0,2 ) (θ1,2 )

−θ0,1 −θ1,1
+ θ1,2 log + θ0,2 − 1 ≥ 0.
−θ1,1 −θ0,1

Fisher’s information matrix is given by

θ2 1

(−θ1 )2 −θ1
I(θ ) = ∇θ2 κ(θ ) = 1 (θ2 ) (θ2 )− (θ2 )2 .
−θ1 (θ2 )2
46 2 Exponential Dispersion Family

The off-diagonal terms in Fisher’s information matrix I(θ ) are non-zero which
means that the two components of the canonical parameter θ interact. Choosing
a different parametrization μ = θ2 /(−θ1 ) (dual mean parametrization) and α = θ2
we receive diagonal Fisher’s information in (μ, α)
α
0 α
μ2 0
I(μ, α) = (α) (α)− (α)2 = μ2 , (2.30)
0 (α)2
− 1
α 0 (α) − 1
α

where is the digamma function, see Footnote 2 on page 22. This transformation
is obtained by using the corresponding Jacobian matrix for variable transformation;
more details are provided in (3.16) below. In this new representation, the parameters
μ and α are orthogonal; the term (α) − α1 is further discussed in Remarks 5.26
and Remarks 5.28, below.
Using this second parametrization based on mean μ and dispersion 1/α, we
arrive at the EDF representation of the gamma model. This allows us to calculate the
corresponding unit deviance (within the EDF), which in the gamma case is given by
μ
Y
d(Y, μ) = 2 − 1 + log ≥ 0.
μ Y

Example 2.26 (Inverse Gaussian Model) Our final example considers the inverse
Gaussian vector-valued parameter EF case. We consider the cumulant function
κ(θ) = −2(θ1θ2 )1/2 − 12 log(−2θ2) for θ = (θ1 , θ2 ) ∈ = (−∞, 0] × (−∞, 0),
see Sect. 2.1.3. For the KL divergence from model θ 1 to model θ 0 we receive
. .
−θ0,2 −θ0,1
DKL (f (·; θ 0 )||f (·; θ 1 )) = −θ1,1 − θ1,2 − 2 θ1,1 θ1,2
−θ0,1 −θ0,2

θ0,2 − θ1,2 1 −θ0,2
+ + log ≥ 0.
−2θ0,2 2 −θ1,2

Fisher’s information matrix is given by

⎛ ⎞
(−2θ2 )1/2
− 2(θ θ1 )1/2
I(θ ) = ∇θ2 κ(θ) = ⎝ (−2θ1 )3/2 1/2
1 2 ⎠.
− 2(θ θ1 )1/2 (−2θ 1)
(−2θ2 ) 3/2 + 2
(−2θ2 ) 2
1 2

Again the off-diagonal terms in Fisher’s information matrix I(θ ) are non-zero in
the canonical parametrization. We switch to the mean parametrization by setting
2.3 Information Geometry in Exponential Families 47

μ = (−2θ2/(−2θ1 ))1/2 and α = −2θ2. This provides us with diagonal Fisher’s

information

α
3 0
I(μ, α) = μ . (2.31)
0 2α1 2

This transformation is again obtained by using the corresponding Jacobian matrix

for variable transformation, see (3.16), below. We compare the lower-right entries
of (2.30) and (2.31). Remark that we have first order approximation of the digamma
function
1
(α) ≈ logα − ,
2α
and taking derivatives says that these entries of Fisher’s information are first order
equivalent; this is also used in the saddlepoint approximation in Sect. 5.5.2, below.
Using this second parametrization based on mean μ and dispersion 1/α, we arrive
at the EDF representation of the inverse Gaussian model with unit deviance

(Y − μ)2
d(Y, μ) = ≥ 0.
μ2 Y

More examples will be given in Chap. 4, below.

This chapter gives an introduction to decision and estimation theory. This intro-
duction is based on the books of Lehmann [243, 244], the lecture notes of Künsch
[229] and the book of Van der Vaart [363]. This chapter presents classical statistical
estimation theory, it embeds estimation into a historical context, and it provides
important aspects and intuition for modern data science and predictive modeling.
For further reading we recommend the books of Barndorff-Nielsen [23], Berger
[31], Bickel–Doksum [33] and Efron–Hastie [117].

3.1 Introduction to Decision Theory

We start from an observation vector Y = (Y1 , . . . , Yn ) taking values in a

measurable space Y ⊂ Rn , where n ∈ N denotes the number of components Yi ,
1 ≤ i ≤ n, in Y . Assume that this observation vector Y has been generated by a
distribution belonging to the family P = {P (·; θ ); θ ∈ } being parametrized by a
parameter set .
Remarks 3.1 There are some subtle points in the notation that we are going to
use. We use P (·; θ ) for the distribution of the observation vector Y , and if we
consider a specific component Yi of Y we will use the notation Yi ∼ F (·; θ ). We
make this distinction as in estimation theory one often considers i.i.d. observations
Yi ∼ F (·; θ ), 1 ≤ i ≤ n, with (in this case) joint product distribution Y ∼ P (·; θ ).
This latter distribution is then used for purposes of maximum likelihood estimation,
etc. The family P is parametrized by θ ∈ , and if we want to emphasize that
this parameter is a k-dimensional vector we use boldface notation θ , this is similar
to the EFs introduced in Chap. 2, but in this chapter we do not restrict to EFs.
Finally, we assume identifiability meaning that different parameters θ give different
distributions P (·; θ ) ∈ P.

© The Author(s) 2023 49

To fix ideas, assume we want to determine γ (θ ) of a given functional γ (·) on .

Typically, the true value θ ∈ is not known, and we are not able to determine γ (θ )
explicitly. Therefore, we try to estimate γ (θ ) from data Y ∼ P (·; θ ) that belongs to
the same θ ∈ . As an example we may think of working in the EDF of Chap. 2,
and we are interested in the mean μ = Eθ [Y ] = κ (θ ) of Y . Thus, we aim at
determining γ (θ ) = κ (θ ). If the true θ is unknown, and if we have an observation
Y from this model, we can try to estimate γ (θ ) = κ (θ ) from Y . This motivation
is based on estimation of γ (θ ), but the following framework of decision making is
more general, for instance, it may also be used for statistical hypothesis testing.
Denote the action space of possible decisions (actions) by A. In decision theory
we are looking for a decision rule (action rule)

A : Y → A, Y → A(Y ), (3.1)

which should be understood as an educated guess for γ (θ ) based on observation Y .

A decision rule is evaluated in terms of a (given) loss function

L : × A → R+ , (θ, a) → L(θ, a) ≥ 0. (3.2)

L(θ, a) describes the loss of an action a ∈ A w.r.t. a true parameter choice θ ∈ .

The risk function of decision rule A for data generated by Y ∼ P (·; θ ) is defined by

θ → R(θ, A) = Eθ [L(θ, A(Y ))] = L (θ, A(y)) dP (y; θ ), (3.3)
Y

where Eθ is the expectation w.r.t. the probability distribution P (·; θ ). Risk func-
tion (3.3) describes the long-term average loss of using decision rule A. As an
example we may think of estimating γ (θ ) for unknown (true) parameter θ by a
decision rule Y → A(Y ). Then, the loss function L(θ, A(Y )) should describe the
estimation loss if we consider the discrepancy between γ (θ ) and its estimate A(Y ),
and the risk function R(θ, A) is the average estimation loss in that case.
Good decision rules A should provide a small risk R(θ, A). Unfortunately, this
statement is of rather theoretical nature because, in general, the true data generating
parameter θ is not known and the goodness of a decision rule for the true parameter
cannot be evaluated explicitly, but the risk can only be estimated (for instance, using
a bootstrap approach). Moreover, typically, there does not exist a uniformly best
decision rule A over all θ ∈ . For these reasons we may (just) try to eliminate
decision rules that are obviously not good. We give two introductory examples.
Example 3.2 (Minimax Decision Rule) Decision rule A is called minimax if for all
* : Y → A we have
alternative decision rules A

*
sup R(θ, A) ≤ sup R(θ, A).
θ∈ θ∈
3.2 Parameter Estimation 51

A minimax decision rule is the best choice in the worst case of the true θ , i.e., it
minimizes the worst case risk.

Example 3.3 (Bayesian Decision Rule) Assume we are given a distribution π on

. Decision rule A is called Bayesian w.r.t. π if it satisfies

A = arg min *
R(θ, A)dπ(θ ).
*
A

Distribution π is called prior distribution on .

The above examples give two possible choices of decision rules. The first one
tries to minimize the worst case risk, whereas the second one uses additional knowl-
edge in terms of a prior distribution π on . This means that we impose stronger
assumptions in the second case to get stronger conclusions. The difficult part in
practice is to justify these stronger assumptions in order to validate the stronger
conclusions. Below, we are going to introduce other criteria that should be satisfied
by good decision rules, an important one in estimation will be unbiasedness.

3.2 Parameter Estimation

This section focuses on estimating the (unknown) parameter θ ∈ from observa-

tion Y ∼ P (·; θ ). For this we consider decision rules A : Y → A = with A(Y )
estimating θ . We assume there exist densities p(·; θ ) w.r.t. a fixed σ -finite measure
ν on Y ⊂ Rn ,

dP (y; θ ) = p(y; θ )dν(y),

for all distributions P (·; θ ) ∈ P, i.e., all θ ∈ .

Definition 3.4 (Maximum Likelihood Estimator, MLE) The maximum

likelihood estimator (MLE) of θ for a given observation Y ∈ Y is given by
(subject to existence and uniqueness)

θ MLE = arg max p(Y ; *

θ) = arg max Y (*
θ),
*
θ ∈ *
θ ∈

where the log-likelihood function of p(Y ; θ ) is defined by θ → Y (θ ) =

log p(Y ; θ ).
52 3 Estimation Theory

The MLE Y → θ MLE = θ MLE (Y ) = A(Y ) is nothing else than a specific

decision rule with action space A = for estimating θ . We can now start to explore
the risk function R(θ, θ MLE ) of that decision rule for a given loss function L.
Example 3.5 (MLE within the EDF) We emphasize that this example is used
throughout these notes. Assume that the (independent) components of Y =
(Y1 , . . . , Yn ) ∼ P (·; θ ) follow a given EDF distribution. That is, we assume that
Y1 , . . . , Yn are independent and have densities w.r.t. σ -finite measures on R given
by, see (2.14),

yi θ − κ(θ )
Yi ∼ f (yi ; θ, vi /ϕ) = exp + a(yi ; vi /ϕ) ,
ϕ/vi

for 1 ≤ i ≤ n. Note that these random variables are not i.i.d. because they may
differ in exposures vi > 0. Throughout, we assume that Assumption 2.6 is fulfilled
and that the cumulant function κ is steep, see Theorem 2.19. For the latter we also
refer to Remark 2.20: the supports Tvi /ϕ of Yi may differ; however, these supports
share the same convex closure.
Independence between the Yi ’s implies that the joint probability P (·; θ ) is the
product distribution of the individual distributions F (·; θ, vi /ϕ), 1 ≤ i ≤ n.
Therefore, the MLE of θ in the EDF is found by solving

n
Yi *
θ − κ(*
θ)
θ MLE = arg max Y (*
θ ) = arg max .
*
θ ∈ *
θ ∈ ϕ/vi
i=1

Since the cumulant function κ is strictly convex we receive the MLE (subject
to existence)
n n
vi Yi i=1 vi Yi
θ MLE =
θ MLE (Y ) = (κ )−1 i=1
n =h n .
i=1 vi i=1 vi

Thus, the MLE is received by applying the canonical link h = (κ )−1 , see
Definition 2.8, and strict convexity of κ implies that the MLE is unique. However,
existence needs to be analyzed more carefully! It may happen that the MLE θ MLE is
a boundary point of the effective domain which may not exist (if is open). We
give an example. Assume we work in the Poisson model presented in Sect. 2.1.2.
The canonical link in the Poisson model is the log-link μ → h(μ) = log(μ), for
μ > 0. With positive probability we have in the Poisson case ni=1 vi Yi = 0.
3.2 Parameter Estimation 53

Therefore, with positive probability the MLE θ MLE does not exist (we have a
degenerate Poisson model in that case).
Since the canonical link is strictly increasing we can also perform MLE in the
dual (mean) parametrization. The dual parameter space is given by M = κ (),˚
see Remarks 2.9, with mean parameters μ = κ (θ ) ∈ M. This motivates

n
μ) − κ(h(*
Yi h(* μ))

μMLE = arg max Y (h(*
μ)) = arg max . (3.4)
μ∈M
* μ ∈M
* ϕ/vi
i=1

Subject to existence, this provides the unique MLE

n
vi Yi
MLE =
μ μMLE (Y ) = i=1
n . (3.5)
i=1 vi

Also this dual MLE does not need to exist (in the dual parameter space M).
Under the assumption that the cumulant function κ is steep, we know that the closure
of the dual parameter space M contains the supports Tvi /ϕ of Yi , see Theorem 2.19
and Remark 2.20. Thus, in that case we can close the dual parameter space and
receive MLE μMLE ∈ M (in a possibly degenerate model). In the aforementioned
degenerate Poisson situation we receive
μMLE = 0 which is in the boundary ∂M of
the dual parameter space.

Definition 3.6 (Bayesian Estimator) The Bayesian estimator of θ for a given

observation Y ∈ Y and a given prior distribution π on is given by (subject to
existence)

θ Bayes =
θ Bayes (Y ) = Eπ [θ |Y ],

where the conditional expectation on the right-hand side is calculated under the
posterior distribution π(θ |y) ∝ p(y; θ )π(θ ) for a given observation Y = y.

Example 3.7 (Bayesian Estimator) Assume that A = = R and choose the square
loss function L(θ, a) = (θ − a)2 . Assume that for ν-a.e. y ∈ Y the following
decision rule A : Y → A exists

A(y) = arg min Eπ [(θ − a)2|Y = y], (3.6)

a∈A
54 3 Estimation Theory

where the expectation is calculated w.r.t. the posterior distribution π(θ |y). In
this case, A is a Bayesian decision rule w.r.t. π and L(θ, a) = (θ − a)2 : by
* : Y → A, ν-a.s.,
assumption (3.6) we have for any other decision rule A

* ))2 |Y = y].
Eπ [(θ − A(Y ))2 |Y = y] ≤ Eπ [(θ − A(Y

*
Applying the tower property we receive for any other decision rule A

* ))2 ] =
R(θ, A)dπ(θ ) = E[(θ − A(Y ))2 ] ≤ E[(θ − A(Y *
R(θ, A)dπ(θ ),

where the expectation E is calculated over the joint distribution of Y and θ . This
proves that A is a Bayesian decision rule w.r.t. π and L(θ, a) = (θ − a)2 , see
Example 3.3. Finally, note that the conditional expectation given in Definition 3.6 is
the minimizer of (3.6). This justifies the name Bayesian estimator in Definition 3.6
(for the square loss function). The case of the Bayesian estimator for a general loss
function L is considered in Theorem 4.1.1 of Lehmann [244].

Definition 3.8 (Method of Moments Estimator) Assume that ⊆ Rk and that

the components Yi of Y are i.i.d. F (·; θ) distributed with finite k-th moments for all
θ ∈ . The law of large numbers provides, a.s., for all 1 ≤ l ≤ k,

1 l
n
lim Yi = Eθ [Y1l ].
n→∞ n
i=1

Assume that the following map is invertible (on suitable range definitions for (3.7)–
(3.8))

γ : → Rk , θ → γ (θ ) = (Eθ [Y1 ], . . . , Eθ [Y1k ]) . (3.7)

The method of moments estimator of θ is defined by

1 1 k
n n
MM
θ MM
= θ (Y ) = γ −1
Yi , . . . , Yi . (3.8)
n n
i=1 i=1

The MLE, the Bayesian estimator and the method of moments estimator are the
most commonly used parameter estimators. They may have additional properties
(under certain assumptions) that we are going to explore below. In the remainder of
this section we give an additional view on estimators which is based on the empirical
distribution of the observation Y .
3.2 Parameter Estimation 55

Assume that the components Yi of Y are real-valued and i.i.d. F distributed. The
empirical distribution induced by the observation Y = (Y1 , . . . , Yn ) is given by

n
n (y) = 1
F 1{Yi ≤y} for y ∈ R, (3.9)
n
i=1

we also refer to Fig. 1.2 (lhs). The Glivenko–Cantelli theorem [64, 159] tells us that
the empirical distribution F n converges uniformly to F , a.s., for n → ∞.

Definition 3.9 (Fisher-Consistency) Denote by P the set of all distribution

functions on the given probability space. Let Q : P → be a functional with
the property

Q(F (·; θ )) = θ for all F (·; θ ) ∈ F = {F (·; θ ); θ ∈ } ⊂ P.

Such a functional is called Fisher-consistent for F and θ ∈ , respectively.

A given Fisher-consistent functional Q motivates the estimator θ = Q(F n ) ∈ .
This is exactly what we have applied for the method of moments estimator (3.8)
with Fisher-consistent functional induced by the inverse of (3.7). The next example
shows that this also works for MLE.
Example 3.10 (MLE and Kullback–Leibler (KL) Divergence) The MLE can be
received from a Fisher-consistent functional. Consider for F ∈ P the functional

Q(F ) = arg max log f (y; *
θ)dF (y),
*
θ

assuming that f (·; *θ) are densities w.r.t. a σ -finite measure on R. Assume that F
has density f w.r.t. the σ -finite measure ν on R. Then, we can rewrite the above as

f (y)
Q(F ) = arg min log f (y)dν(y) = arg min DKL (f ||f (·; *
θ )).
*
θ f (y; *
θ) *
θ

The latter is the Kullback–Leibler (KL) divergence which we have met in Sect. 2.3.
Lemma 2.21 states that the KL divergence is non-negative, and it is zero if and only
if the two densities f and f (·; *
θ) are identical, ν-a.s. This implies that Q(F (·; θ )) =
θ . Thus, Q is Fisher-consistent for θ ∈ , assuming identifiability, see Remarks 3.1.
Next, we use this Fisher-consistent functional (KL divergence) to receive the
MLE. Replace the unknown distribution F by the empirical one to receive

n ) = arg min DKL (f n ||f (·; *

Q(F θ ))
*
θ

1
n
= arg max log f (Yi ; *
θ) =
θ MLE ,
*
θ n
i=1
56 3 Estimation Theory

where we have used that the empirical density f n allocates point masses of size 1/n
to the i.i.d. observations Y1 , . . . , Yn . Thus, the MLE θ MLE of θ can be obtained by
choosing the model f (·; * θ ), *
θ ∈ , that is closest in KL divergence to the empirical
distribution F n of i.i.d. observations Yi ∼ F . Note that in this construction we do
not assume that the true distribution F is in F , see Definition 3.9.

Remarks 3.11
• Many properties of estimators of θ are based on properties of Fisher-consistent
functionals Q (in cases where they exist). For instance, asymptotic properties as
n → ∞ are obtained from smoothness properties of Fisher-consistent functionals
Q, or using the influence function we can analyze the impact of individual
observations Yi on decision rules θ = θ (Y ) = Q(F n ). The latter is the basis of
robust statistics, see Huber [194] and Hampel et al. [180]. Since Fisher-consistent
functionals do not require that the true distribution belongs to F it requires a
careful consideration of the quantity to be estimated.
• The discussion on parameter estimation has implicitly assumed that the true data
generating model belongs to the family P = {P (·; θ ); θ ∈ }, and the only
problem was to find the true parameter in . More generally, one should also
consider model uncertainty w.r.t. the chosen family P, i.e., the data generating
model may not belong to this family. Of course, this problem is by far more
difficult. We explore this in more detail in Sect. 11.1.4, below.

3.3 Unbiased Estimators

We introduce the property of uniformly minimum variance unbiased (UMVU) for

decision rules in this section. This is a very attractive property in insurance pricing
because it gives a quality statement to decision rules (and to the resulting prices). At
the current stage it is not clear how unbiasedness is related, e.g., to the MLE of θ .

3.3.1 Cramér–Rao Information Bound

Above we have stated some quality criteria for decision rules like the minimax
property. A crucial property in financial applications is the so-called unbiasedness
(for mean estimates) because this guarantees that the overall (price) levels are
correctly specified.
3.3 Unbiased Estimators 57

Definition 3.12 (Uniformly Minimum Variance Unbiased, UMVU) A

decision rule A : Y → A = R is unbiased for γ : → R if for all
Y ∼ P (·; θ ), θ ∈ , we have

Eθ [A(Y )] = γ (θ ). (3.10)

The decision rule A is called UMVU for γ if additionally to the unbiased-

ness (3.10) we have
* )),
Varθ (A(Y )) ≤ Varθ (A(Y
* : Y → R that is unbiased for
for all θ ∈ and for any other decision rule A
γ.

Note that unbiasedness is not invariant under transformations, i.e., if A(Y ) is

unbiased for γ (θ ), then, in general, b(A(Y )) is not unbiased for b(γ (θ )). For
instance, if b is strictly convex then we get a counterexample by simply applying
Jensen’s inequality.
Our first step is to derive a general lower bound for Varθ (A(Y )). If this general
lower bound is met for an unbiased decision rule A for γ , then we know that it
is UMVU for γ . We start with the one-dimensional case given in Section 2.6 of
Lehmann [244].

Theorem 3.13 (Cramér–Rao Information Bound) Assume that the distri-

butions P (·; θ ), θ ∈ , have densities p(·; θ ) for a given σ -finite measure ν
on Y, and that ⊂ R is an open interval such that the set {y; p(y; θ ) > 0}
does not depend on θ ∈ . Let A(Y ) be unbiased for γ : → R having
finite second moment. If the limit
∂ 1 p(y; θ + ) − p(y; θ )
log p(y; θ ) = lim
∂θ →0 p(y; θ )

exists in L2 (P (·; θ )) and if

2
∂
I(θ ) = Eθ log p(Y ; θ ) ∈ (0, ∞),
∂θ

then the function θ → γ (θ ) is differentiable, Eθ [ ∂θ

∂
log p(Y ; θ )] = 0 and we
have information bound

γ (θ )2
Varθ (A(Y )) ≥ .
I(θ )
58 3 Estimation Theory

Proof We start from an arbitrary function ψ : × Y → R with finite variance

Varθ (ψ(θ, Y )) ∈ (0, ∞) for all θ ∈ . The Cauchy–Schwarz inequality implies

Covθ (A(Y ), ψ(θ, Y ))2

Varθ (A(Y )) ≥ . (3.11)
Varθ (ψ(θ, Y ))

If we manage to make the right-hand side of (3.11) independent of decision rule

A(·) we have a general lower bound, we also refer to Theorem 2.6.1 in Lehmann
[244].
The Cauchy–Schwarz inequality implies that for any U ∈ L2 (P (·; θ )) the
following limit exists and is equal to
+ , + ,
1 p(Y ; θ + ) − p(Y ; θ ) ∂
lim Eθ U = Eθ log p(Y ; θ )U . (3.12)
→0 p(Y ; θ ) ∂θ

Setting U ≡ 1 gives average score Eθ [ ∂θ

∂
log p(Y ; θ )] = 0 because for sufficiently
small
+ ,
p(Y ; θ + ) − p(Y ; θ ) p(y; θ + ) − p(y; θ )
Eθ = p(y; θ )dν(y) = 0,
p(Y ; θ ) Y p(y; θ )

where we have used that the support of the random variables does not depend on θ
and that the domain of θ is open.
Secondly, we set U = A(Y ) in (3.12). We have similarly to above using
unbiasedness w.r.t. γ

p(Y ; θ + ) − p(Y ; θ) p(y; θ + ) − p(y; θ)
Covθ A(Y ), = A(y) p(y; θ)dν(y)
p(Y ; θ) Y p(y; θ)

= γ (θ + ) − γ (θ).

Existence of limit (3.12) provides the differentiability of γ . Finally, from (3.11) we

have
2
Covθ A(Y ), p(Y ;θ+)−p(Y
p(Y ;θ)
;θ)
γ (θ )2
Varθ (A(Y )) ≥ lim = . (3.13)
→0 Varθ p(Y ;θ+)−p(Y ;θ) I(θ )
p(Y ;θ)

This completes the proof.

Remarks 3.14 (Fisher’s Information and Score)

• I(θ ) is called Fisher’s information or Fisher metric.
• s(θ, Y ) = ∂θ∂
log p(Y ; θ ) is called score, and Eθ [s(Y ; θ )] = 0 in Theorem 3.13
expresses that the average score is zero under the assumptions of that theorem.
• Under the regularity conditions of Lemma 6.1 in Section 2.6 of Lehmann [244]
3.3 Unbiased Estimators 59

2 + ,
∂ ∂2
I(θ ) = Eθ log p(Y ; θ ) = −Eθ log p(Y ; θ ) . (3.14)
∂θ ∂θ 2

Fisher’s information I(θ ) expresses the variance of the score s(θ, Y ). Iden-
tity (3.14) justifies the notion Fisher’s information in Sect. 2.3 for the EF.
• In order to determine the Cramér–Rao information bound for unknown θ we
need to estimate Fisher’s information I(θ ) from the available data. There are
two different ways to do so, either we choose
2
∂
I(
θ ) = E
θ log p(Y ; θ ) ,
∂θ

or we choose the observed Fisher’s information

2
∂
I(
θ) = log p(Y ; θ ) ,
∂θ
θ=
θ

for given data Y and where θ = θ (Y ). Both estimated Fisher’s information I(

θ)

and I(θ ) play a central role in MLE of generalized linear models (GLMs). They
are used in Fisher’s scoring method, the iterated re-weighted least squares (IRLS)
algorithm and the Newton–Raphson algorithm to determine the MLE.
• The Cramér–Rao information bound in Theorem 3.13 is stated in terms of the
observation Y ∼ p(·; θ ). Assume that the components Yi of Y are i.i.d. f (·; θ )
distributed. In this case, Fisher’s information scales as

I(θ ) = In (θ ) = nI1 (θ ), (3.15)

with single risk’s Fisher’s information (contribution)

2
∂
I1 (θ ) = Eθ log f (Y1 ; θ ) .
∂θ

In general, Fisher’s information is additive in independent random variables,

because the product of densities is additive after applying the logarithm, and
because the average score is zero.
60 3 Estimation Theory

Proposition 3.15 The unbiased decision rule A for γ attains the Cramér–
Rao information bound if and only if the density is of the form p(y; θ ) =
exp {δ(θ )T (y) − β(θ ) + a(y)} with T = A. In that case we have γ (θ ) =
β (θ )/δ (θ ).

Proof of Proposition 3.15 The Cauchy–Schwarz inequality provides equality

in (3.13) if and only if ∂θ ∂
log p(y; θ ) = δ (θ )A(y)−β (θ ), ν-a.s, for some functions
δ (θ ) and β (θ ) on . Integration and the fact that p(·; θ ) is a density whose support
does not depend on the explicit choice of θ ∈ provide the implication “⇒”. For
the implication “⇐” we study for A = T
+ ,
∂
0 = Eθ log p(Y ; θ) = (δ (θ)A(y) − β (θ))p(y; θ)dν(y) = δ (θ)Eθ [A(Y )] − β (θ).
∂θ Y

In that case we have γ (θ ) = Eθ [A(Y )] = β (θ )/δ (θ ). Moreover, we have equality

in the Cauchy–Schwarz inequality. This finishes the proof.

The single-parameter EF fulfills the properties of Proposition 3.15 with δ(θ ) = θ
and β(θ ) = κ(θ ), and decision rule A(y) = T (y) attains the Cramér–Rao
information bound for γ (θ ) = κ (θ ).
We give a multi-dimensional version of the Cramér–Rao information bound.

Theorem 3.16 (Multi-Dimensional Version of the Cramér–Rao

Information Bound, Without Proof) Assume that the distributions P (·; θ ),
θ ∈ , have densities p(·; θ ) for a given σ -finite measure ν on Y, and that
⊆ Rk is an open convex set such that the set {y; p(y; θ) > 0} does not
depend on θ ∈ . Let A(Y ) be unbiased for γ : → R having finite
second moment. Under additional regularity conditions, see Theorem 7.3 in
Section 2.7 of Lehmann [244], we have

Varθ (A(Y )) ≥ (∇θ γ (θ)) I(θ )−1 ∇θ γ (θ ),

with (positive definite) Fisher’s information matrix I(θ ) = (Il,j (θ ))1≤l,j ≤k

given by
+ ,
∂ ∂
Il,j (θ ) = Eθ log p(Y ; θ) log p(Y ; θ) ,
∂θl ∂θj

for 1 ≤ l, j ≤ k.
3.3 Unbiased Estimators 61

Remarks 3.17
• Whenever an unbiased decision rule A(Y ) for γ (θ ) meets the Cramér–Rao
information bound it is UMVU. Thus, it minimizes the risk function R(θ , A)
being based on the square loss L(θ , a) = (γ (θ) − a)2 among all unbiased
decision rules, because unbiasedness for γ (θ) gives R(θ , A) = Varθ (A(Y )).
• The regularity conditions in Theorem 3.16 include that Fisher’s information
matrix I(θ ) is positive definite.
• Under additional regularity conditions we have the following identity for Fisher’s
information matrix

I (θ ) = Eθ (∇θ log p(Y ; θ )) (∇θ log p(Y ; θ )) = −Eθ ∇θ2 log p(Y ; θ ) ∈ Rk×k .

Thus, Fisher’s information matrix can either be calculated from a quadratic

form of the score s(θ , Y ) = ∇θ log p(Y ; θ) or from the Hessian ∇θ2 of the
log-likelihood Y (θ ) = log p(Y ; θ). Since the score has mean zero, Fisher’s
information matrix is equal to the covariance matrix of the score s(θ , Y ).
In many situations we do not work under the canonical parametrization θ .
Considerations then require a change of variable. Assume that

ζ ∈ Rr → θ = θ (ζ ) ∈ Rk ,

such that all derivatives ∂θl (ζ )/∂ζj exist for 1 ≤ l ≤ k and 1 ≤ j ≤ r. The Jacobian
matrix is given by

∂
J (ζ ) = θl (ζ ) ∈ Rk×r .
∂ζj 1≤l≤k,1≤j ≤r

Fisher’s information matrix w.r.t. ζ is given by

+ ,
∂ ∂
I ∗ (ζ ) = Eθ(ζ ) log p(Y ; θ(ζ )) log p(Y ; θ (ζ )) ∈ Rr×r ,
∂ζl ∂ζj 1≤l,j ≤r

and we have the identity

I ∗ (ζ ) = J (ζ ) I(θ (ζ )) J (ζ ). (3.16)

This formula is used quite frequently, e.g., in generalized linear models when
changing the parametrization of the models.
62 3 Estimation Theory

3.3.2 Information Bound in the Exponential Family Case

The purpose of this section is to summarize the Cramér–Rao information bound

results for the EF and the EDF, since these families play a distinguished role in
statistical and actuarial modeling.

Cramér–Rao Information Bound in the EF Case

We start with the EF case. Assume we have i.i.d. observations Y1 , . . . , Yn having

densities w.r.t. a σ -finite measure ν on R given by the EF, see (2.2),

dF (y; θ) = f (y; θ)dν(y) = exp θ T (y) − κ(θ) + a(y) dν(y),

for canonical parameter θ ∈ ⊆ Rk . We assume to work under a minimal

representation implying that the cumulant function κ is strictly convex on the
˚ see Assumption 2.6. Moreover, we assume that the cumulant function
interior ,
κ is steep in the sense of Theorem 2.19. Consider the (aggregated) statistics of the
joint EF P = {P (·; θ ); θ ∈ }
n
def.
n
y → S(y) = T1 (yi ), . . . , Tk (yi ) ∈ Rk . (3.17)
i=1 i=1

We calculate the score of this EF

n

s(θ , Y ) = ∇θ log p(Y ; θ) = ∇θ θ T (Yi ) − nκ(θ ) = S(Y ) − n∇θ κ(θ ).
i=1

An immediate consequence of Corollary 2.5 is that the expected value of the score
˚ This then reads as
is zero for any θ ∈ .

μ = Eθ [T (Y1 )] = Eθ [S(Y )/n] = ∇θ κ(θ) ∈ Rk . (3.18)

Thus, the statistics S(Y )/n is an unbiased decision rule for the mean μ = ∇θ κ(θ),
and we can study its Cramér–Rao information bound. Fisher’s information matrix
is given by the positive definite matrix

I (θ ) = In (θ) = Eθ s(θ, Y )s(θ, Y ) = −Eθ ∇θ2 log p(Y ; θ) = n∇θ2 κ(θ) ∈ Rk×k .

Note that the multi-dimensionally extended Cramér–Rao information bound in

Theorem 3.16 applies to the individual components of vector μ = ∇θ κ(θ ) ∈
Rk . Assume we would like to estimate its j -th component, set γj (θ ) = μj =
3.3 Unbiased Estimators 63

(∇θ κ(θ))j = ∂κ(θ)/∂θj , for 1 ≤ j ≤ k. This corresponds to the j -th component

Sj (Y ) of the statistics S(Y ). We have unbiasedness of Sj (Y )/n for γj (θ ) = μj =
(∇θ κ(θ))j , and this unbiased statistics attains the Cramér–Rao information bound
1 2
Varθ (Sj (Y )/n) = ∇θ κ(θ ) = (∇θ γj (θ)) I(θ )−1 (∇θ γj (θ )). (3.19)
n j,j

Recall that I(θ )−1 scales as n−1 , see (3.15). This provides us with the following
corollary.

Corollary 3.18 Assume Y1 , . . . , Yn are i.i.d. and follow an EF (under a

minimal representation). The components of the statistics S(Y )/n are UMVU
˚ with
for γj (θ ) = ∂κ(θ)/∂θj , 1 ≤ j ≤ k and θ ∈ ,

1 1 ∂2
Varθ Sj (Y ) = κ(θ).
n n ∂θj2

The corresponding covariance terms are for 1 ≤ j, l ≤ k given by

1 1 1 ∂2
Covθ Sj (Y ), Sl (Y ) = κ(θ).
n n n ∂θj ∂θl

The UMVU property stated in Corollary 3.18 is, in general, not related to MLE,
but within the EF there is the following link. We have (subject to existence)

1
= arg max p(Y ; * *
θ S(Y ) − nκ(*
MLE
θ θ ) = arg max θ) = h S(Y ) ,
*
θ∈ *
θ∈ n
(3.20)

where h = (∇θ κ)−1 is the canonical link of this EF, see Definition 2.8; and where
we need to ensure that a solution to (3.20) exists; e.g., the solution to (3.20) might
be at the boundary of which may cause problems, see Example 3.5.1 Because the
cumulant function κ is strictly convex (in a minimal representation), we receive the

1 Another example where there does not exist a proper solution to the MLE problem (3.20) is, for
instance, obtained within the 2-dimensional Gaussian EF if we have only one single observation Y1 .
Intuitively this is clear because we cannot estimate two parameters from one observation T (Y1 ) =
(Y1 , Y12 ).
64 3 Estimation Theory

MLE for the mean parameter μ = Eθ [T (Y1 )]

1
MLE = arg max
μ μ) S(Y ) − nκ(h(*
h(* μ)) = S(Y ),
μ∈M
*
n

the dual parameter space M = ∇θ κ() ⊆ Rk has been introduced in Remarks 2.9.
If S(Y )/n is contained in M, then this MLE is a proper solution; otherwise, because
we have assumed that the cumulant function κ is steep, the MLE exists in the closure
M, see Theorem 2.19, and it is UMVU for μ, see Corollary 3.18.

Corollary 3.19 (Balance Property) Assume Y1 , . . . , Yn are i.i.d. and follow

an EF with θ ∈ ˚ and T (Yi ) ∈ M, a.s. The MLE μMLE ∈ M is UMVU for
μ, and it fulfills the balance property on portfolio level, i.e.,

n
E
μMLE [T (Yi )] = n
μMLE = S(Y ).
i=1

Remarks 3.20
• The balance property is a very important property in insurance pricing because it
implies that the portfolio is priced on the right level: we have unbiasedness

n
Eθ E
μMLE [T (Yi )] = Eθ [S(Y )] = nμ. (3.21)
i=1

• We emphasize that the balance property is much stronger than unbiased-

ness (3.21), note that the balance property provides unbiasedness even if Y
follows a completely different model, i.e., even if the chosen EF P is completely
misspecified.
• In general, the MLE
MLE
θ is not unbiased for θ . E.g., if the canonical link
h = (∇θ κ)−1 is strictly concave, we have from Jensen’s inequality, subject to
existence at the boundary of ,
+ , + ,
1 1
Eθ
MLE
θ = Eθ h S(Y n ) < h Eθ S(Y n ) = h (μ) = θ .
n n
(3.22)

• The statistics S(Y ) is a sufficient statistics of Y , this follows from the factoriza-
tion criterion; see Theorem 1.5.2 of Lehmann [244].
3.3 Unbiased Estimators 65

Cramér–Rao Information Bound in the EDF Case

The single-parameter linear EDF case is very similar to the above vector-valued
parameter EF case. We briefly summarize the main results in the EDF case.
Recall Example 3.5: assume that Y1 , . . . , Yn are independent having densities
w.r.t. a σ -finite measures on R (not being concentrated in a single point) given by,
see (2.14),

yi θ − κ(θ )
Yi ∼ f (yi ; θ, vi /ϕ) = exp + a(yi ; vi /ϕ) , (3.23)
ϕ/vi

for 1 ≤ i ≤ n. Note that these random variables are not i.i.d. because they may differ
˚ is found by, see (3.5),
in the exposures vi > 0. The MLE of μ = κ (θ ), θ ∈ ,
n

n
μ) − κ(h(*
Yi h(* μ)) vi Yi

μ MLE
= arg max = i=1
n , (3.24)
μ∈M
*
ϕ/vi i=1 vi
i=1

we assume that κ is steep to ensure μMLE ∈ M. The convolution formula of

Corollary 2.15 says that the MLE μMLE = Y+ belongs to the same EDF with the
same canonical
parameter θ and the same dispersion ϕ, only the weight changes to
v+ = ni=1 vi .

Corollary 3.21 (Balance Property) Assume Y1 , . . . , Yn are independent with

˚ and Yi ∈ M, a.s. The MLE
EDF distribution (3.23) for θ ∈ μMLE ∈ M is

UMVU for μ = κ (θ ), and it fulfills the balance property on portfolio level,
i.e.,

n
n
n
E
μMLE [vi Yi ] = vi
μMLE = vi Yi .
i=1 i=1 i=1

The score in this EDF is given by

∂ vi
n n
∂ vi
s(θ, Y ) = log p(Y ; θ ) = (θ Yi − κ(θ )) = Yi − κ (θ ) .
∂θ ∂θ ϕ ϕ
i=1 i=1

˚
Of course, we have Eθ [s(θ, Y )] = 0 and we receive Fisher’s information for θ ∈
+ , n
∂2 vi
I(θ ) = −Eθ 2
log p(Y ; θ ) = κ (θ ) > 0. (3.25)
∂θ ϕ
i=1
66 3 Estimation Theory

Corollary 2.15 gives for the variance of the MLE

ϕ (κ (θ ))2 (∂μ(θ )/∂θ )2
MLE = n
Varθ μ κ (θ ) = = .
i=1 vi I(θ ) I(θ )

This verifies that

μMLE meets the Cramér–Rao information bound and is UMVU

for the mean μ = κ (θ ).
Example 3.22 (Poisson Case) For this example, we consider independent Poisson
random variables Ni ∼ Poi(vi λ). In Sect. 2.2.2 we have seen that Yi = Ni /vi can
be modeled within the single-parameter linear EDF framework using as cumulant
function the exponential function κ(θ ) = eθ , and setting ωi = vi and ϕ = 1. Thus,
the probability weights of a single observation Yi are given by, see (2.15),

f (yi ; θ, vi ) = exp vi θyi − eθ + a(yi ; vi ) ,

with canonical parameter θ = log(λ) ∈ = R. The MLE in the mean

parametrization is given by, see (3.24),
n n
vi Yi Ni
λMLE = i=1 n = i=1
n ∈ M = [0, ∞).
v
i=1 i i=1 vi

This estimator is unbiased for λ. Having independent Poisson random variables we

can calculate the variance of this estimator as
λ
Var
λMLE = n .
i=1 vi

Moreover, from Corollary 3.21 we know that this estimator is UMVU for λ, which
can easily be seen, and uses Fisher’s information (3.25) with dispersion parameter
ϕ=1
+ , n
n
∂2
I(θ ) = −Eθ log p(Y ; θ ) = vi κ (θ ) = λ vi .
∂θ 2
i=1 i=1

One could study many other properties of decision rules (and corresponding
estimators), for instance, admissibility or uniformly minimum risk equivariance
(UMRE), and we could also study other families of distribution functions such as
group families. We refrain from doing so because we will not need this for our
purposes.
3.4 Asymptotic Behavior of Estimators 67

3.4 Asymptotic Behavior of Estimators

All results above have been based on a finite sample Y n = (Y1 , . . . , Yn ) , we add a
lower index n to Y n to indicate the finite sample size n ∈ N. The aim of this section
is to analyze properties of decision rules when the sample size n tends to infinity.

3.4.1 Consistency

Assume we have an infinite sequence of observations Yi , i ≥ 1, which allows us

to construct an infinite sequence of decision rules An = An (Y n ), n ≥ 1, where
An always considers the first n observations Y n = (Y1 , . . . , Yn ) ∼ Pn (·; θ ), for
θ ∈ not depending on n. To fix ideas, one may think of i.i.d. random variables Yi .
Definition 3.23 (Consistency) The sequence An = An (Y n ) ∈ Rr , n ≥ 1, is
consistent for γ : → Rr if for all θ ∈ and for all ε > 0 we have

lim Pθ An (Y n ) − γ (θ )2 > ε = 0.
n→∞

Definition 3.23 says that An (Y n ) converges in probability to γ (θ ) as n → ∞. If

we (even) have a.s. convergence, we call An , n ≥ 1, strongly consistent for γ : →
Rr . Consistency is a minimal property that decision rules should fulfill. Typically, in
applications, this is not enough, and we are interested in (fast) rates of convergence,
i.e., we would like to know the error rates between An (Y n ) and γ (θ ) for n → ∞.
Example 3.24 (Consistency of the MLE in the EF) We revisit Corollary 3.19 and
consider an i.i.d. sequence of random variables Yi , i ≥ 1, belonging to an EF, and
we assume to work under a minimal representation and to have a steep cumulant
function κ. The MLE for μ is given by the statistics

1
n
1
MLE
μ n = S(Y n ) = (T1 (Yi ), . . . , Tk (Yi )) ∈ M.
n n
i=1

We add a lower index n to the MLE to indicate the sample size. The i.i.d. property
of Yi , i ≥ 1, implies that we can apply the strong law of large numbers which tells
us that we have limn→∞ μMLE
n = Eθ [T (Y1 )] = ∇θ κ(θ) = μ, a.s., for all θ ∈ .
This implies strong consistency of the sequence of MLEs μMLE
n , n ≥ 1, for μ.
We have seen that these MLEs are also UMVU for μ, but if we transform them
to the canonical scale
MLE
θn they are, in general, biased for θ , see (3.22). However,
since the cumulant function κ is strictly convex (under a minimal representation)
we receive limn→∞
MLE
θn = θ , a.s., which provides strong consistency also on the
canonical scale.
68 3 Estimation Theory

Proposition 3.25 Assume the real-valued random variables Yi , i ≥ 1, are

i.i.d. F (·; θ ) distributed with fixed θ ∈ . The resulting empirical distributions
n , n ≥ 1, are given by (3.9). Assume Q is a Fisher-consistent functional for γ (θ ),
F
i.e., Q(F (·; θ )) = γ (θ ) for all θ ∈ . Moreover, assume that Q is continuous in
n ), n ≥ 1, are
F (·; θ ), for all θ ∈ , w.r.t. the supremum norm. The functionals Q(F
consistent for γ (θ ).
Sketch of Proof The Glivenko–Cantelli theorem [64, 159] says that the empirical
n converges uniformly to F (·; θ ), a.s., for n → ∞. Using the
distribution F
assumptions made, we are allowed to exchange the corresponding limits, which
provides consistency.

In view of Proposition 3.25, we discuss the case of the MLE of θ ∈ . In
Example 3.10 we have seen that the MLE of θ ∈ is obtained from a Fisher-
consistent functional Q for θ on the set of probability distributions P given by

Q(F ) = arg max θ)dF (y) = arg min DKL (f ||f (·; *
log f (y; * θ )),
*
θ *
θ

in the second step we assumed that F has a density f w.r.t. a σ -finite measure ν on
R.
Assume we have i.i.d. data Yi ∼ f (·; θ ), i ≥ 1. Thus, the true data generating
distribution is described by the parameter θ ∈ . MLE requires the study of the
log-likelihood function (we scale with the sample size n)

1
n
1
*
θ → Y n (*
θ) = log f (Yi ; *
θ ).
n n
i=1

The law of large numbers gives us, a.s.,

1
n

lim θ ) = Eθ log f (Y ; *
log f (Yi ; * θ) . (3.26)
n→∞ n
i=1

Thus, if we are allowed to exchange the arg max operation and the limit in n → ∞
we receive, a.s.,

1
n
lim θ MLE
= lim arg max log f (Yi ; *
θ)
n→∞ n n→∞ * n
θ i=1

1
n
?
= arg max lim *
log f (Yi ; θ)
* n→∞ n
θ i=1

= arg max Eθ log f (Y ; *
θ ) = Q(F (·; θ )) = θ. (3.27)
*
θ
3.4 Asymptotic Behavior of Estimators 69

That is, we receive consistency of the MLE for θ if we are allowed to exchange the
arg max operation and the limit in n → ∞. This requires regularity conditions on
the considered family of distributions F = {F (·; θ ); θ ∈ }. The case of a finite
parameter space = {θ1 , . . . , θJ } is easy, this is a simplified version of Wald’s
[374] consistency proof,

1 1 1
n n n
Pθj θj ∈ arg max log f (Yi ; θk ) ≤ Pθj log f (Yi ; θk ) > log f (Yi ; θj ) .
θk n n n
i=1 k=j i=1 i=1

The right-hand side converges to 0 as n → ∞ for all θk = θj , which gives

consistency. For regularity conditions on more general parameter spaces we refer
to Section 5.2 in Van der Vaart [363]. Basically, one needs that the arg max of the
limiting function given on the right-hand side of (3.26) is well-separated from other
large values of that function, see Theorem 5.7 in Van der Vaart [363].
Remarks 3.26
• The estimator from the arg max operation in (3.27) is also called M-estimator,
and (y, a) → log(f (y; a)) plays the role of a scoring function (similar to
a loss function). The the last line of (3.27) says that this scoring function is
strictly consistent for the functional Q : F → , and Fisher-consistency of
this functional Q implies

Eθ log f (Y ; *
θ) ≤ Eθ log f (Y ; Q(F (·; θ ))) = Eθ log f (Y ; θ ) ,

for all *
θ ∈ . Strict consistency of loss and scoring functions is going to be
defined formally in Sect. 4.1.3, below, and we have just seen that this plays an
important role for the consistency of M-estimators in the sense of Definition 3.23.
• Consistency (3.27) assumes that the data generating model Y ∼ F belongs to
the specified family F = {F (·; θ ); θ ∈ }. Model uncertainty may imply that
the data generating model does not belong to F . In this situation, and if we are
allowed to exchange the arg max operation and the limit in n in (3.27), the MLE
will provide the model in F that is closest in KL divergence to the true model F .
We come back to this in Sect. 11.1.4, below.

3.4.2 Asymptotic Normality

As mentioned above, typically, we would like to have stronger results than just
consistency. We give an introductory example based on the EF.
Example 3.27 (Asymptotic Normality of the MLE in the EF) We work under the
same EF as in Example 3.24. This example has provided consistency of the sequence
of MLEs μMLE
n , n ≥ 1, for μ. Note that the i.i.d. property together with the finite
70 3 Estimation Theory

variance property immediately implies the following convergence in distribution

√ MLE (d)
n
μn − μ ⇒ N (0, ∇θ2 κ(θ )) = N (0, I1 (θ)) as n → ∞,

where θ = θ (μ) = (∇θ κ)−1 (μ) ∈ for μ ∈ M, and N denotes the Gaussian
distribution. This is the multivariate version √
of the central limit theorem (CLT), and
it tells us that the rate of convergence is 1/ n. This asymptotic result is stated in
terms of Fisher’s information matrix under parametrization θ . We transform this
to the dual mean parametrization and call Fisher’s information matrix under the
dual mean parametrization I1∗ (μ). This involves the change of variable μ → θ =
θ (μ) = (∇θ κ)−1 (μ). The Jacobian matrix of this change of variable is given by
J (μ) = I1 (θ (μ))−1 and, thus, the transformation of Fisher’s information matrix
gives, see also (3.16),

μ → I1∗ (μ) = J (μ) I1 (θ (μ)) J (μ) = I1 (θ (μ))−1 .

This allows us to express the above CLT w.r.t. Fisher’s information matrix corre-
sponding to μ and it gives us

√ MLE
μn − μ ⇒ N 0, I1∗ (μ)−1
n as n → ∞. (3.28)

We conclude that the appropriately normalized MLE μMLE

n converges in distri-
bution to the centered Gaussian distribution having as covariance matrix√the inverse
of Fisher’s information matrix I1∗ (μ), and the rate of convergence is 1/ n.
Assume that the effective domain is open, and that θ = θ (μ) ∈ . This
allows us to transform asymptotic normality (3.28) to the canonical scale. Consider
again the change of variable μ → θ = θ (μ) = (∇θ κ)−1 (μ) with Jacobian matrix
J (μ) = I1 (θ (μ))−1 = I1∗ (μ). Theorem 1.9 in Section 5.2 of Lehmann [244] tells
us how the CLT transforms under such a change of variable, namely,

√ MLE √
n
θ n − θ = n (∇θ κ)−1
μMLE
n − (∇θ κ)−1 (μ) (3.29)
(d)
⇒ N 0, J (μ)I1∗ (μ)−1 J (μ) = N 0, I1 (θ)−1 as n → ∞.

We have exactly the same structural form in the two asymptotic results (3.28)
and (3.29). There is a main difference,
μMLE
n is unbiased for μ whereas, in general,
MLE
θ n is not unbiased for θ , but we receive the same asymptotic behavior.
3.4 Asymptotic Behavior of Estimators 71

There are many different versions of asymptotic normality results similar

to (3.28) and (3.29), and the main difficulty often is to verify the assumptions made.
For instance, one can prove asymptotic normality based on a Fisher-consistent
functional Q. The assumptions made are, among others, that Q needs to be Fréchet
differentiable in P (·; θ ) which, unfortunately, is rather difficult to verify. We make
a list of assumptions here that are easier to check and then we give a version of the
asymptotic normality result which is stated in the book of Lehmann [244]. This list
of assumptions in the one-dimensional case ⊆ R reads as follows:
(i) ⊆ R is an open interval (possibly infinite).
(ii) The real-valued random variables Yi ∼ F (·; θ ), i ≥ 1, have common support
T = {y ∈ R; f (y; θ ) > 0} which is independent of θ ∈ .
(iii) For every y ∈ T, the density f (y; θ ) is three times continuously differentiable
in θ . -
(iv) The integral f (y; θ )dν(y) is twice differentiable under the integral sign.
(v) Fisher’s information satisfies I1 (θ ) = Eθ [(∂ log f (Y1 ; θ )/∂θ )2] ∈ (0, ∞).
(vi) For every θ0 ∈ there exist a positive constant c and a function M(y) (both
may depend on θ0 ) such that Eθ0 [M(Y1 )] < ∞ and
3
∂
log f (y; θ ) ≤ M(y) for all y ∈ T and θ ∈ (θ0 − c, θ0 + c).
∂θ 3

Theorem 3.28 (Theorem 2.3 in Section 6.2 of Lehmann [244]) Assume Yi ,

i ≥ 1, are i.i.d. F (·; θ ) distributed satisfying (i)–(vi) from above. Assume that

θn =
θn (Y n ), n ≥ 1, is a sequence of roots that solves the score equations

∂
n
∂
log f (Yi ; *
θ) = Y n (*
θ ) = 0,
*
∂ θ i=1 ∂*θ

and which is consistent for θ , i.e. this sequence of roots

θn (Y n ) converges in
probability to the true parameter θ . Then we have asymptotic normality
√
n
θn − θ ⇒ N 0, I1 (θ )−1 as n → ∞. (3.30)

Sketch of Proof Fix θ ∈ and consider a Taylor expansion of the score Y n (·) in
θ for
θn . It is given by

1 2
Y n (
θn ) = Y n (θ ) + Y n (θ )
θn − θ + (θn ) θn − θ ,
2 Yn
72 3 Estimation Theory

θn ]. Since
for θn ∈ [θ, θn is a root of the score, the left-hand side is equal to zero.
This allows us to re-arrange the above Taylor expansion as follows

√ √1 (θ )
n Yn
n
θn − θ = .
− n1 Y n (θ ) − 2n Y n (θn )
1
θn −θ

The enumerator on the right-hand side converges in distribution to N (0, I1 (θ )),

see (18) in Section 6.2 of [244], the first term in the denominator converges in
probability to I1 (θ ), see (19) in Section 6.2 of [244], and in the second term of
1
the denominator we have 2n Y n (θn ) which is bounded in probability, see (20) in
Section 6.2 of [244]. The claim then follows from Slutsky’s theorem.

Remarks 3.29
• A sequence ( θn )n≥1 satisfying Theorem 3.28 is called efficient likelihood esti-
mator (ELE) of θ . Typically, the sequence of MLEs θnMLE gives such an
ELE sequence, but there are counterexamples where this is not the case, see
Example 3.1 in Section 6.2 of Lehmann [244]. In that example θnMLE exists for
all n ≥ 1, but it converges in probability to ∞, regardless of the value of the true
parameter θ .
• Any sequence of estimators that fulfills (3.30) is called asymptotically efficient,
because, similarly to the Cramér–Rao information bound of Theorem 3.13, it
attains I1 (θ )−1 (which under certain assumptions is a lower variance bound
except on Lebesgue measure zero, see Theorem 1.1 in Section 6.1 of Lehmann
[244]). However, there are two important differences here: (1) the Cramér–
Rao information bound statement needs unbiasedness of the decision rule,
whereas (3.30) only requires consistency (but not unbiasedness nor asymptoti-
cally vanishing bias); and (2) the lower bound in the Cramér–Rao statement is
an effective variance (on a finite sample), whereas the quantity in (3.30) is only
an asymptotic variance. Moreover, any other sequence √ that differs in probability
from an asymptotically efficient one less than o(1/ n) is asymptotically effi-
cient, too.
• If we consider a differentiable function θ → γ (θ ), then Theorem 3.28 implies

√ (γ (θ ))2
n γ
θn − γ (θ ) ⇒ N 0, as n → ∞. (3.31)
I1 (θ )

This follows from asymptotic normality, consistency and considering a Taylor

expansion around θ .
• We were starting from the MLE problem

1
n

θnMLE = arg max log f (Yi ; *
θ ). (3.32)
*θ n
i=1
3.4 Asymptotic Behavior of Estimators 73

In statistical theory a parameter estimator that is obtained through a maximiza-

tion operation is called M-estimator (for maximizing or minimizing), see also
Remarks 3.26. If the log-likelihood is differentiable in *θ we can turn the above
problem into a root search problem for *θ

1 ∂
n
log f (Yi ; *
θ) = 0. (3.33)
n ∂*
θ
i=1

If a parameter estimator is obtained through a root search problem it is called

Z-estimator (for equating to zero). The Z-estimator (3.33) does not require a
maximum of the original function, but only a critical point; this is exactly what
we have been exploring in Theorem 3.28. More generally, for a sufficiently nice
function ψ(·; θ ) a Z-estimator
θnZ for θ is obtained by solving the following
equation for *
θ

1
n
ψ(Yi ; *
θ ) = 0, (3.34)
n
i=1

for i.i.d. data Yi ∼ F (·; θ ). Suppose that the first moment of ψ(Yi ; *
θ) exists. The
law of large numbers gives us, a.s., see also (3.26),

1
n

lim θ ) = Eθ ψ(Y ; *
ψ(Yi ; * θ) . (3.35)
n→∞ n
i=1

Consistency of the Z-estimator θnZ , n ≥ 1, for θ is related to the right-hand

*
side of (3.35) being zero for θ = θ . Under additional regularity conditions (and
consistency) it then holds asymptotic normality

√ Z Eθ ψ(Y ; θ )2
n
θn − θ ⇒ N 0, ∂ 2 as n → ∞. (3.36)
Eθ ∂θ ψ(Y ; θ )

For rigorous statements we refer to Theorems 5.21 and 5.41 in Van der Vaart
[363]. A modification to the regression case is given in Theorem 11.6 below.

Example 3.30 We consider the single-parameter linear EF for given strictly convex
and steep cumulant function κ and w.r.t. a σ -finite measure ν on R. The score
equation gives requirement

1 !
S(Y n ) = κ (θ ) = Eθ [Y1 ]. (3.37)
n
Strict convexity implies that the right-hand side strictly increases in θ . Therefore,
we have at most one solution of the score equation here. We assume that the
74 3 Estimation Theory

effective domain ⊆ R is open. It is easily verified that assumptions (ii)–(vi)

hold, in particular, (vi) saying that the third derivative should have a uniformly
bounded integrable bound holds because the third derivative is independent of y and
continuous in θ . With probability converging to 1, (3.37) has a solution
θn which
is unique, consistent and Theorem 3.28 holds. Note that in Example 3.5 we have
mentioned the Poisson case which can be degenerate. For the asymptotic normality
result we use here that this degeneracy asymptotically vanishes with probability
converging to one.

Remark 3.31 (Multi-Dimensional Extension) For an extension of Theorem 3.28 to

the multi-dimensional case ⊆ Rk we refer to Section 6.4 in Lehmann [244]. The
assumptions made in the multi-dimensional case do not essentially differ from the
ones in the 1-dimensional case.

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons licence and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons licence, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons licence and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 4
Predictive Modeling and Forecast
Evaluation

In the previous chapter, we have fully focused on parameter estimation θ ∈ and

the estimation of functions θ → γ (θ ) by exploiting decision rules A for estimating
Y n → θ = A(Y n ) or Y n → γ (θ ) = A(Y n ), respectively. The derivations in
that chapter analyzed the quality of decision rules in terms of loss functions which
compare, e.g., the action θ = A(Y n ) to the true parameter θ . The Cramér–Rao
information bound considers this in terms of a square loss function. In actuarial
modeling, parameter estimation is only part of the problem, and the second part is
to predict new random variables Y . These new random variables should be thought
as claims in the future that we try to predict (and price) using decision rules being
developed based on past information Y n = (Y1 , . . . , Yn ) . In this case, we would
like to study how a decision rule A(Y n ) generalizes to new data Y , and we then
call the decision rule rather a predictor for Y . This capability of suitable decision
rules to generalize to new (unseen) data is analyzed in Sect. 4.1. Such an analysis
often relies on (numerical) techniques such as cross-validation, which is examined
in Sect. 4.2, or the bootstrap technique, being presented in Sect. 4.3, below. In this
chapter, we denote past observations by Y n = (Y1 , . . . , Yn ) supported on Y, and
the (real-valued) random variables to be predicted are denoted by Y with support
Y ⊂ R. Often we have Y = Y × · · · × Y.

4.1 Generalization Loss

We start by considering the most commonly used expected generalization loss

(GL) which is the mean squared error of prediction (MSEP). The MSEP is based
on the square loss function, and it can be seen as a distribution-free approach to
measure expected GL. In subsequent sections we will study distribution-adapted
GL approaches. Expected GL measurement with MSEP is considered to be general
knowledge and we do not give a specific reference in this section. Distribution-

© The Author(s) 2023 75

adapted versions are mainly based on the strictly consistent scoring framework of
Gneiting–Raftery [163] and Gneiting [162]. In particular, we will discuss deviance
losses in Sect. 4.1.2 that are strictly consistent scoring functions for mean estimation
and, hence, provide proper scoring rules.

4.1.1 Mean Squared Error of Prediction

We denote by Y n = (Y1 , . . . , Yn ) (past) observations on which predictors and

decision rules A : Y → A are based on. The new observation that we would like
to predict is denoted by Y having support Y ⊂ R. In the previous chapter we have
used decision rule the A(Y n ) to estimate an unknown quantity γ (θ ). In this section
we will use this decision rule to directly predict the new (unseen) observation Y .

Theorem 4.1 (Mean Squared Error of Prediction, MSEP) Assume that

Y n and Y are independent. Assume that the predictor A : Y → A ⊆ R,
Y n → A(Y n ) has finite second moment, and that the real-valued random
variable Y has finite second moment, too. The MSEP of predictor A to predict
Y is given by

E (Y − A(Y n ))2 = (E [Y ] − E [A(Y n )])2 + Var(A(Y n )) + Var(Y ).
(4.1)

Proof of Theorem 4.1 We compute

E (A(Y n ) − Y )2 = E (A(Y n ) − E[Y ] + E[Y ] − Y )2

= E (A(Y n ) − E[Y ])2 + E (E[Y ] − Y )2

+2 E [(A(Y n ) − E[Y ]) (E[Y ] − Y )]

= E (E [Y ] − E [A(Y n )] + E [A(Y n )] − A(Y n ))2 + Var(Y )

= (E [Y ] − E [A(Y n )])2 + Var(A(Y n )) + Var(Y ),

where on the second last line we use the independence between Y n and Y . This
finishes the proof.

Remarks 4.2 (Expected Generalization Loss)

• The quantity E[(Y − A(Y n ))2 ] is an expected GL because it measures how well
the decision rule (predictor) A(Y n ) generalizes to new (unseen) data Y . As loss
4.1 Generalization Loss 77

function we use the square loss function

L : Y × A → R+ , (y, a) → L(y, a) = (y − a)2 . (4.2)

Therefore, this expected GL is called MSEP.

• MSEP (4.1) is called expected GL. If we condition on Y n , then we call it GL. For
the square loss function the GL (conditional MSEP) is given by

E (Y − A(Y n ))2 Y n = (E [Y ] − A(Y n ))2 + Var(Y ), (4.3)

where we have used independence between Y and Y n .

• We do not distinguish the terms ‘prediction’ and ‘forecast’. Sometimes the
literature makes a subtle difference between the two, the latter involving a
temporal component and the former not. In the context of prediction/forecasting
a loss function (4.2) is also called scoring function. We also use these two terms
interchangeably in the context of prediction/forecasting.
• The MSEP in Theorem 4.1 decouples into three terms:
– The first term (E [Y ] − E [A(Y n )])2 is the (squared) bias. Obviously, good
decision rules A(Y n ) under the MSEP should be unbiased for E[Y ]. If we
compare this to the previous chapter, we note that now the bias is measured
w.r.t. the mean of the new observation Y . Additionally, there might be a slight
difference to the previous chapter if Y n and Y do not belong to the same
parameter θ ∈ (if we work in a parametrized family): the risk function
in (3.3) considers R(θ, A) = Eθ [L(θ, A(Y n ))] with both components of the
loss function L belonging to the same parameter value θ . For the MSEP we
replace θ in L(θ, A(Y n )) by the new observation Y that might originate from
a different distribution (or from a randomized θ in a Bayesian case).
– The second term Var(A(Y n )) is called estimation variance or statistical error.
– The last term Var(Y ) is called process variance or irreducible risk. It reflects
the pure randomness received from the fact that we try to predict random
variables Y with deterministic means E[Y ].
• All three terms on the right-hand side of (4.1) are non-negative. The MSEP
optimal predictor for Y is its expected value E[Y ]. For this choice, the first two
terms (squared bias and estimation variance) vanish, and we are only left with
the irreducible risk. Since this MSEP optimal predictor is typically unknown it
is replaced by a decision rule A(Y n ) that is based on past experience Y n . This
decision rule is used to predict Y , but it can also be seen as an estimator for
E[Y ]. A good decision rule A(Y n ) is unbiased for E[Y ], making the first term on
the right-hand side of (4.1) equal to zero, and at the same time trying to make
the estimation variance small. Typically, this cannot be achieved simultaneously
and, therefore, there is a trade-off between bias and estimation variance in most
applied statistical problems.
78 4 Predictive Modeling and Forecast Evaluation

• We emphasize that in financial applications we typically aim for unbiased

estimators for E[Y ], we especially refer to Sect. 7.4.2 that studies the balance
property in network regression models under a stationary portfolio assumption.
Here, this stationarity may, e.g., translate into a (stronger) i.i.d. assumption on
Y1 , . . . , Yn , Y . Unbiasedness then implies that the predictor A(Y n ) is optimal
in (4.1) if it meets the Cramér–Rao information bound, see Theorem 3.13.
Theorem 4.1 considers the MSEP which implicitly assumes that the square loss
function is the objective (scoring) function of interest. The square loss function may
be considered as being distribution-free, but it is motivated by a Gaussian model for
Y n and Y , respectively; this will be justified in Remarks 4.6, below. If we use the
square loss function for observations different from Gaussian ones it might under-
or over-weigh particular characteristics in these observations because they may not
look very Gaussian (e.g. more heavy-tailed). Therefore, we should always choose a
scoring function that fits the problem considered, for instance, a square loss function
is not appropriate if we model claim counts following a Poisson distribution. We
close this section with the example of the EDF.
Example 4.3 (MSEP Within the EDF) We choose a fixed single-parameter linear
EDF satisfying Assumption 2.6 and having a steep cumulant function κ, see
Theorem 2.19 and Remark 2.20. Assume we have independent random variables
Y1 , . . . , Yn , Y belonging to this EDF having densities, see Example 3.5,

yi θ − κ(θ )
Yi ∼ f (yi ; θ, vi /ϕ) = exp + a(yi ; vi /ϕ) , (4.4)
ϕ/vi

and similarly for Y ∼ f (y; θ, v/ϕ). Note that all random variables share the same
canonical parameter θ ∈ .˚ The MLE of μ ∈ M based on Y n = (Y1 , . . . , Yn ) is
found by solving, see (3.4)–(3.5),

μMLE = μ
MLE (Y n ) = arg max Y n (*
μ) (4.5)
μ∈M
*

n
μ) − κ(h(*
Yi h(* μ))
= arg max ,
μ∈M
*
ϕ/vi
i=1

with canonical link h = (κ )−1 . Since the cumulant function κ is strictly convex and
assumed to be steep, there exists a unique solution μMLE ∈ M. If μMLE ∈ M we
have a proper solution providing θ MLE = h(
μ MLE ) ∈ , otherwise μMLE provides
a degenerate model. This decision rule Y n → μ MLE = μ MLE (Y n ) is now used
to predict the (independent) new random variable Y and to estimate the unknown
parameters θ and μ, respectively. That is, we use the following predictor for Y

=
Y n → Y Eθ [Y ] = E
θ MLE [Y ] =
μMLE =
μMLE (Y n ).
4.1 Generalization Loss 79

Note that this predictor Y is used to predict an unobserved (new) random variable
Y , and it is itself a random variable as a function of (independent) past observations
Y n . We calculate the MSEP in this model. Using Theorem 4.1 we obtain
+ 2 , 2
Eθ Y −
μ MLE
= Eθ [Y ] − Eθ
μMLE + Varθ
μMLE + Varθ (Y )

2 ϕκ (θ ) ϕκ (θ )
= κ (θ ) − κ (θ ) + n + (4.6)
i=1 vi v
(κ (θ ))2 ϕκ (θ )
= + ,
I(θ ) v

see (3.25) for Fisher’s information I(θ ). In this calculation we have used that the
MLE μMLE is UMVU for μ = κ (θ ) and that Y n and Y come from the same
EDF with the same canonical parameter θ ∈ . ˚ As a result, we are only left
with estimation variance and process variance, moreover, the estimation variance
asymptotically vanishes as ni=1 vi → ∞.

4.1.2 Unit Deviances and Deviance Generalization Loss

The main estimation technique used in these notes is MLE introduced in Def-
inition 3.4. At this stage, MLE is un-related to any specific scoring function L
because it has been received by maximizing the log-likelihood function. In this
section we discuss the deviance loss function (as a scoring function) and we
highlight its connection to the Bregman divergence introduced in Sect. 2.3. Based
on the deviance loss function choice we rephrase Theorem 4.1 in terms of this
scoring function. A theoretical foundation to these considerations will be given in
Sect. 4.1.3, below.
For the derivations in this section we rely on the same single-parameter linear
EDF as in Example 4.3, having a steep cumulant function κ. The MLE of μ = κ(θ )
is found by solving, see (4.5),

n
μ) − κ(h(*
Yi h(* μ))
μMLE =
μMLE (Y n ) = arg max ∈ M,
μ∈M
*
ϕ/vi
i=1

with canonical link h = (κ )−1 . This decision rule Y n → μMLE = μMLE (Y n )

is now used to predict the (new) random variable Y and to estimate the unknown
parameters θ and μ, respectively. We aim at studying the expected GL under a
distribution-adapted loss function choice potentially different from the square loss
function. Below we will justify this second choice more extensively.
80 4 Predictive Modeling and Forecast Evaluation

For the saturated model the common canonical parameter θ of the independent
random variables Y1 , . . . , Yn in (4.4) is replaced by individual canonical parameters
θi , 1 ≤ i ≤ n. These individual canonical parameters are estimated with individual
MLEs. The individual MLEs are given by, respectively,

θiMLE = (κ )−1 (Yi ) = h (Yi ) and μMLE
i = Yi ∈ M,

the latter always exists because of strict convexity and steepness of κ. Since the
MLE μMLE
i = Yi maximizes the log-likelihood, we receive for any μ ∈ M the
inequality

0 ≤ 2 logf (Yi ; h (Yi ) , vi /ϕ) − logf (Yi ; h(μ), vi /ϕ)
vi
=2 Yi h (Yi ) − κ (h (Yi )) − Yi h (μ) + κ (h (μ)) (4.7)
ϕ
vi
= d (Yi , μ) .
ϕ

The function (y, μ) → d(y, μ) ≥ 0 is the unit deviance introduced in (2.25),

extended to C, and it is zero if and only if y = μ, see Lemma 2.22. The latter
is also an immediate consequence of the fact that the MLE is unique within EDFs.
Remark 4.4 The unit deviance d(y, μ) has only been considered on C̊ × M
in (2.25). Having steepness of cumulant function κ implies C̊ = M, see Theo-
rem 2.19, and in the absolutely continuous EDF case, we always have Yi ∈ M, a.s.,
which makes (4.7) well-defined for all observations Yi , a.s. In the discrete or the
mixed EDF case, an observation Yi can be at the boundary of M. In that case (4.7)
must be calculated from

* *
d (Yi , μ) = 2 sup Yi θ − κ θ − Yi h (μ) + κ (h (μ)) . (4.8)
*
θ ∈

This applies, e.g., to the Poisson or Bernoulli cases for observation Yi = 0, in these
cases we obtain unit deviances 2μ and −2log(1 − μ), respectively.
The previous considerations (4.7)–(4.8) have been studying one single obser-
vation Yi of Y n . Aggregating over all observations in Y n (and additionally using
independence between the individual components of Y n ) we arrive at the so-called
deviance loss function
4.1 Generalization Loss 81

1 vi
n
def.
D(Y n , μ) = d (Yi , μ) (4.9)
n ϕ
i=1

2 vi
n
= Yi h (Yi ) − κ (h (Yi )) − Yi h (μ) + κ (h (μ)) ≥ 0.
n ϕ
i=1

The deviance loss function D(Y n , μ) subtracts twice the log-likelihood Y n (μ)
from the one of the saturated model. Thus, it introduces a sign flip compared to (4.5).
This immediately gives us the following corollary.

Corollary 4.5 (Deviance Loss Function) The MLE problem (4.5) is equiva-
lent to solving

μMLE = arg max Y n (*
μ) = arg min D(Y n , *
μ). (4.10)
μ∈M
* μ ∈M
*

Remarks 4.6
• Formula (4.10) replaces a maximization problem by a minimization problem
with objective function D(Y n , μ) being bounded below by zero. We can use
this deviance loss function as a loss function not only for parameter estimation,
but also as a scoring function for analyzing GLs within the EDF (similarly to
Theorem 4.1).
• We draw the link to the KL divergence discussed in Sect. 2.3. In formula (2.26)
we have shown that the unit deviance is equal to the KL divergence (up to
scaling with factor 2), thus, equivalently, MLE aims at minimizing the average
KL divergence over all observations Y n

1
n

θ MLE = arg min DKL f (·; h(Yi ), vi /ϕ)f (·; *
θ , vi /ϕ) ,
*
θ ∈ n
i=1
82 4 Predictive Modeling and Forecast Evaluation

by finding an optimal parameter θ MLE somewhere ‘in the middle’ of the

observation MLE
θ1 = h(Y1 ), . . . , MLE
θn = h(Yn ). This then provides us with,
see (2.27),
n

n

−1 n
vi
d(Yi ,κ (*
f Yi ; *
θ ))
θ, vi /ϕ = f (Yi ; h (Yi ) , vi /ϕ) e 2 i=1 ϕ (4.11)
i=1 i=1

n
*
∝ exp − DKL f (·; h(Yi ), vi /ϕ)f (·; θ , vi /ϕ) ,
i=1

where ∝ highlights that we drop all terms that do not involve * θ. This describes the
change in joint likelihood by varying the canonical parameter * θ over its domain
. The first line of (4.11) is in the spirit of minimizing a weighted square loss, but
the Gaussian square is replaced by the unit deviance d. The second line of (4.11)
is in the spirit of information geometry considered in Sect. 2.3, where we try to
find a canonical parameter * θ that has a small KL divergence to the n individual
models being parametrized by h(Y1 ), . . . , h(Yn ), thus, the MLE θ MLE provides
an optimal balance over the entire set of (independent) observations Y1 , . . . , Yn
w.r.t. the KL divergence.
• In contrast to the square loss function, the deviance loss function D(Y n , μ)
respects the distributional properties of Y n , see (4.11). That is, if the underlying
distribution allows for larger or smaller claims, this fact is appropriately valued
in the deviance loss function (supposed that we have chosen the right family of
distributions; model uncertainty will be studied in Sect. 11.1, below).
• Assume we work in the Gaussian model. In this model we have κ(θ ) = θ 2 /2
and canonical link h(μ) = μ, see Sect. 2.1.3. This provides unit deviance in the
Gaussian case d (y, μ) = (y − μ)2 , which is exactly the square loss function for
action space A = M. Thus, the square loss function is most appropriate in the
Gaussian case.
• As explained above, we use unit deviances d(y, μ) as a measure of discrepancy.
Alternatively, as in the introduction to this section, see (4.6), we can consider
Pearson’s χ 2 -statistic which corresponds to the weighted square loss function

(y − μ)2
X2 (y, μ) = , (4.12)
V (μ)

where μ → V (μ) is the variance function of the chosen EDF. Similarly, to

the deviance loss function (4.9), we can aggregate these Pearson’s χ 2 -statistics
X2 (Yi , μ) over all observations Yi in Y n to receive a second overall measure of
discrepancy. In the Gaussian case the deviance loss and Pearson’s χ 2 -statistic
coincide and have a χ 2 -distribution, for other distributions asymptotic results are
available.
In the non-Gaussian case, (4.12) is not always robust. For instance, if we
work in the Poisson model, we have variance function V (μ) = μ. Our examples
4.1 Generalization Loss 83

below will have low claim frequencies which implies that μ will be small. The
appearance of a small μ in the denominator of (4.12) will imply that Pearson’s
χ 2 -statistic is not very robust in small frequency applications, in particular, if we
need to estimate this μ from Y n . Therefore, we refrain from using (4.12).
Naturally, in analogy to Theorem 4.1 and derivation (4.6), the above consider-
ations motivate us to consider expected GLs under unit deviances within the EDF.
We use the decision rule
μMLE (Y n ) ∈ A = M to predict a new observation Y .

The expected deviance GL is defined and given by

Eθ d Y, μMLE (Y n )

= Eθ [d (Y, μ)] + 2 Eθ Y h(μ) − κ (h(μ)) − Y h(
μMLE (Y n )) + κ h(
μMLE (Y n ))

= Eθ [d (Y, μ)] + E μ, μMLE (Y n ) , (4.13)

the last identity uses independence between Y n and Y , and with estimation
risk function

μMLE (Y n ) = Eθ d μ,
E μ, μMLE (Y n ) > 0, (4.14)

we use steepness of the cumulant function, C = conv(T) = M, and Lemma 2.22

for the strict positivity of the estimation risk function. Thus, for the estimation risk
function E we replace Y by μ in the unit deviance and the expectation Eθ is only
over the observations Y n . This looks like a very convincing generalization of the
MSEP, however, one needs to ensure that all terms in (4.13) exist.

Theorem 4.7 (Expected Deviance Generalization Loss) Assume that Y n

and Y are independent and belong to the same linear EDF having the same
canonical parameter θ ∈ ˚ and having strictly convex and steep cumulant
function κ. Choose a predictor A : Y → A = M, Y n → A(Y n ) and assume
that all expectations in the following formula exist. The expected deviance GL
of predictor A to predict Y is given by

Eθ [d (Y, A(Y n ))] = Eθ [d (Y, μ)] + E (μ, A(Y n )) ≥ Eθ [d (Y, μ)] .

84 4 Predictive Modeling and Forecast Evaluation

Remarks 4.8
• Eθ [d(Y, μ)] plays the role of the pure process variance (irreducible risk) of
Theorem 4.1. This term does not involve any parameter estimation bias and
uncertainty because it is based on the true parameter θ and μ = κ (θ ),
respectively. In Sect. 4.1.3, below, we are going to justify the appropriateness
of this object as a tool for forecast evaluation. In particular, because the unit
deviance is strictly consistent for the mean functional, the true mean μ = μ(θ )
minimizes Eθ [d(Y, μ)], see (4.28), below.
• The second term E (μ, A(Y n )) measures parameter estimation bias and uncer-
tainty of decision rule A(Y n ) versus the true parameter μ = κ (θ ). The first
remark is that we can do this for any decision rule A, i.e., we do not necessarily
need to consider the MLE. The second remark is that we can no longer get a clear
cut differentiation between a bias term and a parameter estimation uncertainty
term for deviance loss functions not coming from the Gaussian distribution. We
come back to this in Remarks 7.17, below, where we give more characterization
to the individual terms of the expected deviance GL.
• An issue in applying Theorem 4.7 to the MLE decision rule A(Y n ) = μMLE (Y n )
is that, in general, it does not lead to a finite estimation risk function. For instance,
in the Poisson case we have with positive probability μMLE (Y n ) = 0, which
results in an infinite estimation risk. In order to avoid this, we need to bound
away the decision rule form the boundary of M and , respectively. In the
Poisson case this can be achieved by considering a decision rule A(Y n ) =
max{ μMLE (Y n ), } for a fixed given ∈ (0, μ = κ (θ )). This decision rule
has a bias which asymptotically vanishes as n → ∞. Moreover, consistency and
asymptotic normality tells us that this lower bound does not affect prediction for
large sample sizes n (with large probability).
• Similar to (4.3), we can also consider the deviance GL, given Y n . Under
independence of Y n and Y we have deviance GL

Eθ [ d (Y, A(Y n ))| Y n ] = Eθ [ d (Y, μ)| Y n ] + d(μ, A(Y n )) (4.15)

≥ Eθ [d (Y, μ)] .

Thus, here we directly compare A(Y n ) to the true parameter μ.

Example 4.9 (Estimation Risk Function in the Gaussian Case) We consider the
Gaussian case with cumulant function κ(θ ) = θ 2 /2 and canonical link h(μ) = μ.
4.1 Generalization Loss 85

The estimation risk function is in the Gaussian case for a square integrable predictor
A(Y n ) given by

E (μ, A(Y n )) = Eθ [d (μ, A(Y n ))]

= 2 μh(μ) − κ (h(μ)) − μEθ [h(A(Y n ))] + Eθ [κ (h(A(Y n )))]

= μ2 − 2μEθ [A(Y n )] + Eθ (A(Y n ))2

= (μ − Eθ [A(Y n )])2 + Varθ (A(Y n )).

These are exactly the squared bias and the estimation variance, see (4.1). Thus, in the
Gaussian case, the MSEP and the expected deviance GL coincide. Moreover, adding
a deterministic bias c ∈ R to A(Y n ) increases the estimation risk function, supposed
that A(Y n ) is unbiased for μ. We emphasize the latter as this is an important
property to have, and we refer to the next Example 4.10 for an example where this
property fails to hold.

Example 4.10 (Estimation Risk Function in the Poisson Case) We consider the
Poisson case with cumulant function κ(θ ) = eθ and canonical link h(μ) = logμ.
The estimation risk function is given by (subject to existence)

E (μ, A(Y n )) = 2 μlog(μ) − μ − μEθ log(A(Y n )) + Eθ [A(Y n )] . (4.16)

Assume that decision rule A(Y n ) is non-deterministic and unbiased for μ. Using
Jensen’s inequality these assumptions imply for the estimation risk function

E (μ, A(Y n )) = 2μ log(μ) − Eθ log(A(Y n )) > 0.

We now add a small deterministic bias c ∈ R to the unbiased estimator A(Y n ) for
μ. This gives us estimation risk function, see (4.16) and subject to existence,

E (μ, A(Y n ) + c) = 2 μlog(μ) − μEθ log(A(Y n ) + c) + c .

Consider the derivative w.r.t. bias c in 0, we use Jensen’s inequality on the last line,
+ ,
∂ 1

E (μ, A(Y n ) + c) = 2 − μEθ + 1
∂c c=0 A(Y n ) + c c=0
+ ,
1
= −2μEθ +2
A(Y n )
1
< −2μ + 2 = 0. (4.17)
Eθ [A(Y n )]
86 4 Predictive Modeling and Forecast Evaluation

Thus, the estimation risk becomes smaller if we add a small bias to the (non-
deterministic) unbiased predictor A(Y n ). This issue has been raised in Denuit et
al. [97]. Of course, this is a very unfavorable property, and it is rather different from
the Gaussian case in Example 4.9. It is essentially driven by the fact that parameter
estimation is based on a finite sample, which implies a strict inequality in (4.17)
for the finite sample estimate A(Y n ). A conclusion of this example is that if we use
expected deviance GLs for forecast evaluation we need to insist on having unbiased
predictors. This will become especially important for more complex regression
models, see Sect. 7.4.2, below.
More generally, one can prove this result of a smaller estimation risk function for
a small positive bias for any EDF member with power variance function V (μ) = μp
with p ≥ 1, see also (4.18) below. The proof uses the Fortuin–Kasteleyn–Ginibre
(FKG) inequality [133] providing Eθ [A(Y n )1−p ] < Eθ [A(Y n )]Eθ [A(Y n )−p ] =
μEθ [A(Y n )−p ] to receive (4.17) for power variance parameters p ≥ 1.

Remarks 4.11 (Conclusion from Examples 4.9 and 4.10 and a Further Remark)
• Working with expected deviance GLs for evaluating forecasts requires some care
because a bigger bias in the (finite sample) estimate A(Y n ) may provide a smaller
estimation risk function E(μ, A(Y n )). For this reason, we typically insist on
having unbiased predictors/forecasts. The latter is also an important requirement
in financial applications to guarantee that the overall price is set to the right level,
we refer to the balance property in Corollary 3.19 and to Sect. 7.4.2, below.
• In Theorems 4.1 and 4.7 we use independence between the predictor A(Y n )
and the random variable Y to receive the split of the expected deviance GL
into irreducible risk and estimation risk function. In regression models, this
independence between the predictor A(Y n ) and the random variable Y may
no longer hold. In that case we will still work with the expected deviance GL
Eθ [d(Y, A(Y n ))], but a clear split between estimation and forecasting will no
longer be possible, see Sect. 4.2, below.
The next example gives the most important unit deviances in actuarial modeling.
Example 4.12 (Unit Deviances) We give the most prominent examples of unit
deviances within the single-parameter linear EDF. We recall unit deviance (2.25)

d(y, μ) = 2 yh(y) − κ (h(y)) − yh(μ) + κ (h(μ)) ≥ 0.

In Sect. 2.2 we have met the examples given in Table 4.1.

4.1 Generalization Loss 87

Table 4.1 Unit deviances of selected distributions commonly used in actuarial science
Distribution Cumulant function κ(θ) Unit deviance d(y, μ)
Gaussian θ 2 /2 (y − μ)2
Gamma −log(−θ) 2 ((y − μ)/μ + log(μ/y))
√
Inverse Gaussian − −2θ (y − μ)2 /(μ2 y)
Poisson eθ 2 (μ − y − ylog(μ/y))

Negative-binomial −log(1 − eθ ) 2 ylog μy − (y + 1)log μ+1 y+1

2−p 1−p 1−p

((1−p)θ ) 1−p −μ y 2−p −μ2−p
Tweedie’s CP 2−p , p ∈ (1, 2) 2 y y 1−p − 2−p
Bernoulli log(1 + e ) θ 2 (−ylogμ − (1 − y)log(1 − μ))

If we focus on Tweedie’s distributions having power variance functions V (μ) =

μp , see Table 2.1, we get a unified expression for the unit deviances for p ∈ {0} ∪
(1, 2) ∪ (2, ∞)
1−p
y − μ1−p y 2−p − μ2−p
d(y, μ) = 2 y − (4.18)
1−p 2−p
2−p 1−p
y yμ μ2−p
=2 − + .
(1 − p)(2 − p) 1−p 2−p

For the remaining power variance cases we have: p = 1 corresponds to the Poisson
case, p = 2 gives the gamma case, the cases p < 0 do not have a steep cumulant
function, and, moreover, there are no EDF models for p ∈ (0, 1), see Theorem 2.18.
The unit deviance in the Bernoulli case is also called binary cross-entropy.
This binary cross-entropy has a categorical generalization, called multi-class cross-
entropy. Assume we have a categorical EF with levels {1, . . . , k + 1} and corre-
sponding probabilities p1 , . . . , pk+1 ∈ (0, 1) summing up to 1, see Sect. 2.1.4.
We denote by Y = (1{Y =1} , . . . , 1{Y =k+1} ) ∈ Rk+1 the indicator variable that
shows which level the categorical random variable Y takes; Y is called one-hot
encoding of the categorical random variable Y . Assume y is a realization of Y and
set μ = p = (p1 , . . . , pk+1 ) . The categorical (multi-class) cross-entropy loss
function is given by

k+1
d(y, μ) = d(y, p) = −2 yj logpj ≥ 0. (4.19)
j =1

This cross-entropy is closely related to the KL divergence between two categorical

distributions p and q on {1, . . . , k + 1}. The KL divergence from p to q is given by

k+1
k+1
k+1
qj
DKL (q||p) = qj log = qj logqj − qj logpj .
pj
j =1 j =1 j =1
88 4 Predictive Modeling and Forecast Evaluation

If we replace the true (but unknown) distribution q by observation Y = y we receive

unit deviance (4.19) (scaled by 2), and the MLE is obtained by minimizing this KL
divergence, see also Example 3.10.

Outlook 4.13 In the regression modeling, below, each response Yi will have its own
mean parameter μi = μ(β, x i ) which will be a function of its covariate information
x i , and β denotes a regression parameter to be estimated with MLE. In that case,
we modify the deviance loss function (4.9) to

1 vi 1 vi
n n
β → D(Y n , β) = d (Yi , μi ) = d (Yi , μ(β, x i )) , (4.20)
n ϕ n ϕ
i=1 i=1

and the MLE of β can be found by solving

β
MLE
= arg min D(Y n , β). (4.21)
β

If Y is a new response with covariate information x and following the same EDF as
Y n , we will evaluate the corresponding expected scaled deviance GL given by
+ ,
v
d Y, μ(
MLE
Eβ β , x) , (4.22)
ϕ

where Eβ is the expectation under the true regression parameter β for Y n and Y .
This will be discussed in Sect. 5.1.7, below. If we interpret (Y, x, v) as a random
vector describing a randomly selected insurance policy from our portfolio, and being
independent of Y n (and the corresponding covariate information x i , 1 ≤ i ≤ n),
then will be independent of (Y, x, v). Nevertheless, the predictor μ(
MLE MLE
β β , x)
will introduce dependence between the chosen decision rule and Y through x, and
we no longer receive the split of the expected deviance GL as stated in Theorem 4.7,
for a related discussion we also refer to Remarks 7.17, below.
If we interpret (Y, x, v) as a randomly selected insurance policy, then the
expected GL (4.22) is evaluated under the joint (portfolio) distribution of (Y, x, v),
and the deviance loss D(Y n ,
MLE
β ) is an (in-sample) empirical version of (4.22).

4.1.3 A Decision-Theoretic Approach to Forecast Evaluation

We present an excursion to a decision-theoretic approach to forecast evaluation.

This excursion gives the theoretical foundation to the unit deviance considerations
from above. This section follows Gneiting [162], Krüger–Ziegel [227] and Denuit
et al. [97], and we refrain from giving complete proofs in this section. Forecast
evaluation should involve consistent loss/scoring functions and proper scoring rules
4.1 Generalization Loss 89

to encourage the forecaster to make careful assessments and honest forecasts.

Consistent loss functions are also a necessary tool to receive consistency of M-
estimators, we refer to Remarks 3.26.

Consistency and Proper Scoring Rules

Denote by C ⊆ R the convex closure of the support of a real-valued random variable

Y , and let the action space be A = C, see also (3.1). Predictions are evaluated in
terms of a loss/scoring function

L : C × A → R+ , (y, a) → L(y, a) ≥ 0. (4.23)

Remark 4.14 In (4.23) we assume that the loss function L is bounded below by
zero. This can be an advantage in applications because it gives a calibration to the
loss function. In general, this lower bound is not a necessary condition for forecast
evaluation. If we drop this lower bound property, we rather call L (only) a scoring
function. For instance, the log-likelihood log(f (y, a)) in (3.27) plays the role of a
scoring function.
The forecaster can take the position of minimizing the expected loss to choose
her/his action rule. That is, subject to existence, an optimal action w.r.t. L is received
by

a =
a (F ) = arg min EF [L(Y, a)] = arg min L(y, a)dF (y). (4.24)
a∈A a∈A C

In this setup the scoring function L(y, a) describes the loss that the forecaster suffers
if she/he uses action a ∈ A and observation y ∈ C materializes. Since we do not
want to insist on uniqueness in (4.24) we rather think of set-valued functionals in
this section, which may provide solutions to problems like (4.24).1
We now reverse the line of arguments, and we start from a general set-valued
functional. Denote by F the family of distribution functions of interest supported
on C. Consider the set-valued functional

A : F → P(A), F → A(F ) ⊂ A, (4.25)

that maps each distribution F ∈ F to a subset A(F ) of the action space A = C,

that is, an element of the power set P(A). The main question that we want to study
in this section is the following: can we find a loss function L so that the set-valued

1 In fact, also for the MLE in Definition 3.4 we should consider a set-valued functional. We have
decided to skip this distinction to avoid any kind of complication and to not disturb the flow of
reading.
90 4 Predictive Modeling and Forecast Evaluation

functional A is obtained by a loss minimization (4.24)? This motivates the following

definition.
Definition 4.15 (Strict Consistency) The loss function L : C × A → R+ is
consistent for the functional A : F → P(A) relative to the class F if

EF [L(Y,
a )] ≤ EF [L(Y, a)] , (4.26)

for all F ∈ F ,
a ∈ A(F ) and a ∈ A. It is strictly consistent if it is consistent and
equality in (4.26) implies that a ∈ A(F ).
As stated in Theorem 1 of Gneiting [162], a loss function L is consistent for the
functional A relative to the class F if and only if, given any F ∈ F , every
a ∈ A(F )
is an optimal action under L in the sense of (4.24).
We give an example. Assume we start from the functional F → A(F ) = EF [Y ]
that maps each distribution F to its expected value. In this case we do not need
to consider a set-valued functional because the expected value is a singleton (we
assume that F only contains distributions with a finite first moment). The question
then is whether we can find a loss function L such that this mean can be received by
a minimization (4.24). This question is answered in Theorem 4.19, below.
Next we relate a consistent loss function L to a proper scoring rule. A proper
scoring rule is a function R : C × F → R such that

EF [R(Y, F )] ≤ EF [R(Y, G)] , (4.27)

for all F, G ∈ F, supposed that the expectations are well-defined. A scoring rule
R analyzes the penalty R(y, G) if the forecaster works with a distribution G and
an observation y of Y ∼ F materializes. Proper scoring rules have been promoted
in Gneiting–Raftery [163] and Gneiting [162]. They are important because they
encourage the forecaster to make honest forecasts, i.e., it gives the forecaster the
incentive to minimize the expected score by following his true belief about the true
distribution, because only this minimizes the expected penalty in (4.27).
Theorem 4.16 (Gneiting [162, Theorem 3]) Assume that L is a consistent loss
function for the functional A relative to the class F . For each F ∈ F , let aF ∈ A(F ).
The scoring rule

R : C × F → R, (y, F ) → R(y, F ) = L(y, aF ),

is a proper scoring rule.

Example 4.17 Consider the unit deviance d (·, ·) : C × M → R+ for a given EDF
˚ with cumulant function κ. Lemma 2.22 says that under
F = {F (·; θ, v/ϕ); θ ∈ }
suitable assumptions this unit deviance d (y, μ) is zero if and only if y = μ. We
consider the mean functional on F

A : F → A = M, Fθ = F (·; θ, v/ϕ) → A(Fθ ) = μ(θ ),

4.1 Generalization Loss 91

where μ = μ(θ ) = κ (θ ) is the mean of the chosen EDF. Choosing the unit deviance
as loss function we receive for any action a ∈ A, see (4.13),

Eθ [d (Y, a)] = Eθ [d (Y, μ)] + 2 Eθ [Y h(μ) − κ (h(μ)) − Y h(a) + κ (h(a))]

= Eθ [d (Y, μ)] + 2 (μh(μ) − κ (h(μ)) − μh(a) + κ (h(a)))
= Eθ [d (Y, μ)] + d (μ, a) .

This is minimized for a = μ and it proves that the unit deviance is strictly consistent
for the mean functional A : Fθ → A(Fθ ) = μ(θ ) relative to the chosen EDF
˚ Using Theorem 4.16, the scoring rule
F = {F (·; θ, v/ϕ); θ ∈ }.

R : C × F → R, (y, Fθ ) → R(y, Fθ ) = d(y, μ(θ )),

is a strictly proper scoring rule, that is,

Eθ [R(Y, Fθ )] = Eθ [d(Y, μ(θ ))] < Eθ d(Y, μ(*
θ )) = Eθ R(Y, F*
θ) ,

for any * θ = θ . We conclude from this small example that the unit deviance is a
strictly consistent loss function for the mean functional on the chosen EDF, and this
provides us with a strictly proper scoring rule.

In the above Example 4.17 we have chosen the mean functional

A : F → A = M, Fθ = F (·; θ, v/ϕ) → A(Fθ ) = μ(θ ),

˚ We have seen that

within a given EDF F = {F (·; θ, v/ϕ); θ ∈ }.
• the unit deviance d(·, ·) is a strictly consistent loss function for the mean
functional A relative to the EDF F ;
• the function (y, Fθ ) → R(y, Fθ ) = d(y, μ(θ )) is a strictly proper scoring
rule for the EDF F , i.e.,

Eθ [d(Y, μ(θ ))] < Eθ d(Y, μ(*
θ )) ,

for any *
θ = θ .

The consideration of the mean functional F → A(F ) = EF [Y ] in Example 4.17

is motivated by the fact that we typically forecast random variables by their means.
However, more generally, we may ask the question for which functionals A : F →
P(A), relative to a given set of distributions F , there exists a loss function L that is
strictly consistent.
92 4 Predictive Modeling and Forecast Evaluation

Definition 4.18 (Elicitable) The functional A is elicitable relative to a given set of

distributions F if there exists a loss function L that is strictly consistent for A and
F.
Above we have seen that the mean functional is elicitable relative to the EDF
using the unit deviance loss; expected values relative to F with finite second
moments are also elicitable using the square loss function. Savage [327] more
generally identifies the Bregman divergences as being the only consistent scoring
functions for the mean functional; recall that the unit deviance is a special case of a
Bregman divergence, see (2.29). We are going to state the corresponding result.
For a general loss function L we make the following (standard) assumptions:
(L0) L(y, a) ≥ 0 and we have an equality if and only if y = a;
(L1) L(y, a) is measurable in y and continuous in a;
(L2) the partial derivative ∂L(y, a)/∂a exists and is continuous in a whenever
a = y.
This then allows us to cite the following theorem.
Theorem 4.19 (Gneiting [162, Theorem 7]) Let F be the class of distributions on
an interval C ⊆ R having finite first moments.
• Assume the loss function L : C×A → R satisfies (L0)–(L2) for interval C = A ⊆
R. L is consistent for the mean functional relative to the class F of compactly
supported distributions on C if and only if the loss function L is of Bregman
divergence form

Dψ (y, a) = ψ(y) − ψ(a) − ψ (a)(y − a),

for a convex function ψ with (sub-)gradient ψ on C.

• If ψ is strictly convex on C, then the Bregman divergence Dψ is strictly consistent
for the mean functional relative to the class F on C for which both EF [Y ] and
EF [ψ(Y )] exist and are finite.
Theorem 4.19 tells us that Bregman divergences are the only consistent loss
functions for the mean functional (under some additional assumptions). Consider
the specific choice ψ(a) = a 2 /2 which is a strictly convex function. For this choice,
the Bregman divergence is the square loss function Dψ (y, a) = (y − a)2 /2, which
is strictly consistent for the mean functional relative to the class F ⊂ L2 (P). We
remark that also quantiles are elicitable, the corresponding result is going to be
stated in Theorem 5.33, below.
The second bullet point of Theorem 4.19 immediately implies that the unit
deviance d(·, ·) is a strictly consistent loss function for the mean functional within
the chosen EDF, see also (2.29) and Example 4.17. In particular, for θ ∈ ˚

μ = μ(θ ) = arg min Eθ [d(Y, a)] . (4.28)

a∈M
4.1 Generalization Loss 93

Explicit evaluation of (4.28) requires that the true distribution Fθ of Y is known.

Since, typically, this is not the case, we need to evaluate it empirically. Assume
that the random variables Yi are independent and Fθ distributed, with Fθ belonging
to the fixed EDF providing the corresponding unit deviance d. Then, the objective
function in (4.28) is approximated by, a.s.,
+ ,
1 vi
n
v
D(Y n , a) = d(Yi , a) → Eθ d(Y, a) as n → ∞. (4.29)
n ϕ ϕ
i=1

The convergence statement follows from the strong law of large numbers applied
to the i.i.d. random variables (Yi , vi ), i ≥ 1, and supposed that the right-hand side
of (4.29) exists. Thus, the deviance loss function (4.9) is an empirical version of the
expected deviance loss function, and this approach is successful if we can exchange
the ‘argmin’ operator of (4.28) and the limit n → ∞ in (4.29). This closes the circle
and brings us back to the M-estimator considered in Remarks 3.26 and 3.29, and
which also links forecast evaluation and M-estimation.

Forecast Dominance

A consequence of Theorem 4.19 is that there are infinitely many strictly consistent
loss functions for the mean functional, and, in principle, we could choose any
of these for forecast evaluation. Choosing the unit deviance d that matches the
distribution Fθ of the observations Y n and Y , respectively, gives us the MLE μMLE ,
and we have seen that the MLE μ MLE
is not only unbiased for μ = κ (θ ), but it
also meets the Cramér–Rao information bound. That is, it is UMVU within the data
generating model reflected by the true unit deviance d. This provides us (in the finite
sample case) with a natural candidate for d in (4.29) and, thus, a canonical proper
scoring rule for (out-of-sample) forecast evaluation.
The previous statements have all been done under the assumption that there is
no uncertainty about the underlying family of distribution functions that generates
Y and Y n , respectively. Uncertainty was limited to the true canonical parameter θ
and the true mean μ(θ ). This situation changes under model uncertainty. Krüger–
Ziegel [227] study the question of having multiple strictly consistent loss functions
in the situation where there is no natural candidate choice. Different choices may
give different rankings to different (finite sample) predictors. Assume we have
two predictors μ1 and μ2 for a random variable Y . Similarly to the definition of
the expected deviance GL, we understand these predictors μ1 and μ2 as random
variables, and we assume that all considered random variables have a finite first
moment. Importantly, we do not assume independence between μ1 ,
μ2 and Y ,
and in regression models we typically receive dependence between predictors μ
and random variables Y through the features (covariates) x, see also Outlook 4.13.
Following Krüger–Ziegel [227] and Ehm et al. [119] we define forecast dominance
as follows.
94 4 Predictive Modeling and Forecast Evaluation

Definition 4.20 (Forecast Dominance) Predictor

μ1 dominates predictor
μ2 if

E Dψ (Y,
μ1 ) ≤ E Dψ (Y,
μ2 ) ,

for all Bregman divergences Dψ with (convex) ψ supported on C, the latter being
the convex closure of the supports of Y ,
μ1 and
μ2 .
If we work with a fixed member of the EDF, e.g., the gamma distribution, then
we typically study the corresponding expected deviance GL for forecast evaluation
in one single model, see Theorem 4.7 and (4.29). This evaluation may involve
model risk in the decision making process, and forecast dominance provides a robust
selection criterion.
Krüger–Ziegel [227] build on Theorem 1b and Corollary 1b of Ehm et al. [119] to
prove the following theorem (which prevents from considering all convex functions
ψ).
Theorem 4.21 (Theorem 2.1 of Krüger–Ziegel [227]) Predictor
μ1 dominates
predictor
μ2 if and only if for all τ ∈ C

μ1 >τ } ≥ E (Y − τ ) 1{
E (Y − τ ) 1{ μ2 >τ } . (4.30)

Denuit et al. [97] argue that in insurance one typically works with Tweedie’s
distributions having power variances V (μ) = μp with power variance parameters
p ≥ 1. This motivates the following weaker form of forecast dominance.
Definition 4.22 (Tweedie’s Forecast Dominance) Predictor
μ1 Tweedie-
dominates predictor
μ2 if

E dp (Y,
μ1 ) ≤ E dp (Y,
μ2 ) ,

for all Tweedie’s unit deviances dp with power variance parameters p ≥ 1, we

refer to (4.18) for p ∈ (1, ∞) \ {2} and Table 4.1 for the Poisson and gamma cases
p ∈ {1, 2}.
Recall that Tweedie’s unit deviances dp are a subclass of Bregman divergences,
see (2.29). Define the following function for power variance parameters p ≥ 1

logμ for p = 2,
Υp (μ) = μ2−p
2−p otherwise.

Denuit et al. [97] prove the following proposition.

Proposition 4.23 (Proposition 4.1 of Denuit et al. [97]) Predictor
μ1 Tweedie-
dominates predictor
μ2 if

E Υp (
μ1 ) ≤ E Υp (
μ2 ) for all p ≥ 1,
4.2 Cross-Validation 95

and

μ1 >τ } ≥ E Y 1{
E Y 1{ μ2 >τ } for all τ ∈ C.

Theorem 4.21 gives necessary and sufficient conditions to have forecast dom-
inance, Proposition 4.23 gives sufficient conditions to have the weaker Tweedie’s
forecast dominance. In Theorem 7.15, below, we give another characterization of
forecast dominance in terms of convex orders, under the additional assumption that
the predictors are so-called auto-calibrated.

4.2 Cross-Validation

This section focuses on estimating the expected deviance GL (4.13) in cases where
the canonical parameter θ is not known. Of course, the same concepts apply to the
MSEP. In the remainder of this section we scale the unit deviances with v/ϕ, to
bring them in line with the deviance loss (4.9).

4.2.1 In-Sample and Out-of-Sample Losses

The general aim in predictive modeling is to predict an unobserved random variable

Y as good as possible based on past information Y n . Within the EDF, the predictive
performance is then evaluated under an empirical version of the expected deviance
GL
+ , + ,
v v
Eθ d (Y, A(Y n )) = 2Eθ Y h(Y ) − κ (h(Y )) − Y h(A(Y n )) + κ (h(A(Y n ))) .
ϕ ϕ
(4.31)

Here, we no longer assume that Y and A(Y n ) are independent, and in the dependent
case Theorem 4.7 does not apply. The reason for dropping the independence
assumption is that below we consider regression models of a similar type as in
Outlook 4.13. The expected deviance GL (4.31) as such is not directly useful
because it cannot be calculated if the true canonical parameter θ is not known.
Therefore, we are going to explain how it can be estimated empirically.
We start from the expected deviance GL in the EDF applied to the MLE decision
rule
μMLE (Y n ). It can be rewritten as

+
v , +
v ,
Eθ μMLE (Y n ) = Eθ
d Y, μMLE (Y n ) Y n = y n dP (y n ; θ ),
d Y,
ϕ ϕ
(4.32)
96 4 Predictive Modeling and Forecast Evaluation

where we use the tower property for conditional expectations. In view of (4.32),
there are two things to be done:
(1) For given observations Y n = y n , we need to estimate the deviance GL, see
also (4.15),
+
v , +
v ,
Eθ d Y,
μ MLE
(Y n ) Y n = y n = Eθ d Y,
μ MLE
(y n ) Y n = y n .
ϕ ϕ
(4.33)

This is the part that we are going to solve empirically in the this section.
Typically, we assume that Y and Y n are independent, nevertheless, Y and
its MLE predictor may still be dependent because we may have a predictor

μMLE (Y n ) = μMLE (Y n , x). That is, this predictor often depends on covariate
information x that describes Y , an example is provided in (4.22) of Outlook 4.13
and this is different from (4.15). In that case, the decision rule A : Y × X → A
is extended by an additional covariate component x ∈ X , we refer to Sect. 5.1.1,
where X is introduced and discussed.
(2) We have to find a way to generate more observations Y n from P (y n ; θ ) in
order to evaluate the outer integral in (4.32) empirically. One way to do so is
the bootstrap method that is going to be discussed in Sect. 4.3, below.
We address the first problem of estimating the deviance GL given in (4.33).
We do this under the assumption that Y n and Y are independent. In order to
estimate (4.33) we need observations for Y . However, typically, there are no
observations available for this random variable because it is only going to be
observed in the future. For this reason, one uses past observations for both, model
fitting and the GL analysis. In order to perform this analysis in a proper way, the
general paradigm is to partition the entire data into two disjoint data sets, a so-
called learning data set L = {Y1 , . . . , Yn } and a test data set T = {Y1† , . . . , YT† }.
If we assume that all observations in L ∪ T are independent, then we receive a
suitable observation Y n from the learning data set L that can be used for model
fitting. The test sample T can then play the role of the unobserved random variable
Y (by assumption being independent of Y n ). Note that L is only used for model
fitting and T is only used for the deviance GL evaluation, see Fig. 4.1.
This setup motivates to estimate the mean parameter μ with MLE μMLEL =

μ MLE (Y n ) from the learning data L and Y n , respectively, by minimizing the
deviance loss function μ → D(Y n , μ) on the learning data L, according to Corol-
lary 4.5. Then we use this predictor μMLE
L to empirically evaluate the conditional
expectation in (4.33) on T . The perception used is that we (in-sample) learn a
model on L and we out-of-sample test this model on T to see how it generalizes
to unobserved variables Yt† , 1 ≤ t ≤ T , that are of a similar nature as Y .
4.2 Cross-Validation 97

Fig. 4.1 Partition of entire

data into learning data set L
and test data set T

Definition 4.24 (In-Sample and Out-of-Sample Losses) The in-sample

deviance loss on the learning data L = {Y1 , . . . , Yn } is given by

2 vi
n
D(L, L )=
μMLE Yi h (Yi ) − κ (h (Yi )) − Yi h( L ) + κ h(
μMLE μMLE
L ) ,
n ϕ
i=1

with MLE μMLE

L =
μMLE (Y n ) on L.
The out-of-sample deviance loss on the test data T = {Y1† , . . . , YT† } of
predictor
μMLE
L is

2 vt† † †
T
μMLE
D(T , L )= Yt h Yt −κ h Yt† −Yt† h(
μMLE
L )+κ h(
μMLE ) ,
L
T ϕ
t =1

where the sum runs over the test sample T having exposures v1† , . . . , vT† > 0.

For MLE we minimize the objective function (4.9), therefore, the in-sample
deviance loss D(L, L ) = D(Y n ,
μMLE μMLE (Y n )) exactly corresponds to the
minimal deviance loss (4.9) achieved on the learning data L, i.e., when using
MLE μMLE
L = μMLE (Y n ). We call this in-sample because the same data L is
used for parameter estimation and deviance loss calculation. Typically, this loss is
biased because it uses the optimal (in-sample) parameter estimate, we also refer to
Sect. 4.2.3, below.
The out-of-sample loss D(T ,
μMLE
L ) then empirically estimates the inner expec-
tation in (4.32). This is a proper out-of-sample analysis because the test data T
is disjoint from the learning data L on which the decision rule μMLE
L has been
trained. Note that this out-of-sample figure reflects (4.33) in the following sense.
98 4 Predictive Modeling and Forecast Evaluation

We have a portfolio of risks (Yt† , vt† ), 1 ≤ t ≤ T , and (4.33) does not only reflect
the calculation of the deviance GL of a given risk, but also the random selection of
a risk from the portfolio. In this sense, (4.33) is an average over a given portfolio
whose description is also included in the probability Pθ .

Summary 4.25 Definition 4.24 gives the general principle in predictive

modeling according to which model learning and the generalization analysis
are done. Namely, based on two disjoint and independent data sets L and T ,
we perform model calibration on L, and we analyze (conditional) GLs (using
out-of-sample losses) on T , respectively. For this concept to be useful, the
learning data L and the test data T have to be sufficiently similar, i.e., ideally
coming from the same model.
This approach does not estimate the outer expectation in the expected
deviance GL (4.32), i.e., it is only an estimate for the deviance GL, given
Y n , see (4.33).

4.2.2 Cross-Validation Techniques

In many applications one is not in the comfortable situation of having two

sufficiently large data sets L and T available to support model learning and an
out-of-sample generalization analysis. That is, we are usually equipped with only
one data set of average size, let us call it D. In order to calculate the objects in
Definition 4.24 we could partition this data set (at random) into two data sets and
then calculate in-sample and out-of-sample deviance losses on this partition. The
disadvantage of this approach is that it is an inefficient use of information if only
little data is available. In that case we require (almost) all data for learning. However,
we still need a sufficiently large share of data for testing, to receive reliable deviance
GL estimates for (4.33). The classical approach in this situation is to use cross-
validation for estimating out-of-sample losses. The concept works as follows:

1. Perform model learning and in-sample loss calculation D(L, μMLE

L ) on all
available data L = D, i.e., this part is not affected by selecting test data T
and it is not touched by cross-validation.
2. For out-of-sample deviance loss calculation use the data D iteratively in an
efficient way such that part of the data is used for model learning and the
other part for the out-of-sample generalization analysis. This second step

(continued)
4.2 Cross-Validation 99

is (only) done for estimating the deviance GL of the model learned on all
data. I.e. for prediction we work with MLE μMLE
L=D , but the out-of-sample
deviance loss is estimated using this data in a different way.

The three most commonly used methods are leave-one-out, K-fold and stratified
K-fold cross-validation. We briefly describe these three cross-validation methods.

Leave-One-Out Cross-Validation

Denote all available data by D = {Y1 , . . . , Yn }, and assume independence between

the components. For leave-one-out (loo) cross-validation we select 1 ≤ i ≤ n and
define the partition L(−i) = D \ {Yi } for the learning data and Ti = {Yi } for the test
data. Based on the learning data L(−i) we calculate the MLE

def.
μMLE
μ(−i) = L(−i) ,

which is based on all data except observation Yi . This observation is now used to
do an out-of-sample analysis, and averaging this over all 1 ≤ i ≤ n we receive the
leave-one-out cross-validation loss
vi
n 1
n
loo = 1
D μ(−i) =
d Yi , D Ti ,
μ(−i) (4.34)
n ϕ n
i=1 i=1

2 vi
n
= μ(−i) + κ h
Yi h (Yi ) − κ (h (Yi )) − Yi h μ(−i) ,
n ϕ
i=1

where D(Ti , μ(−i) ) is the (out-of-sample) cross-validation loss on Ti = {Yi } using

the predictor μ(−i) . This leave-one-out cross-validation loss D loo is now used as
estimate for the out-of-sample deviance loss D(T , MLE
μL ). Leave-one-out cross-
validation uses all data D for learning and testing, namely, the data D is partitioned
into a learning set L(−i) for (partial) learning and a test set Ti = {Yi } for an out-
of-sample generalization analysis. This is done for all instances 1 ≤ i ≤ n, and the
out-of-sample loss is estimated by the resulting average cross-validation loss. This
averaging allows us to not only understand (4.34) as a conditional out-of-sample loss
in the spirit of Definition 4.24. The outer empirical average in (4.34) also makes it
suitable for an expected deviance GL estimate according to (4.32).
The variance of this empirical deviance GL is given by (subject to existence)

vi v
n
n
Varθ D loo = 1 Covθ d Yi ,
μ (−i)
,
j
d Yj ,
μ (−j )
.
n2 ϕ ϕ
i=1 j =1
100 4 Predictive Modeling and Forecast Evaluation

Fig. 4.2 Partitions of K-fold cross-validation for K = 5

These covariances use exactly the same observations on D \ {Yi , Yj }, therefore,

there are strong correlations between the estimators μ(−i) and
μ(−j ) . In addition,
the leave-one-out cross-validation is often computationally not feasible because it
requires fitting the model n times, which in the situation of complex models and of
large insurance portfolios can be too demanding. We come back to this in Sect. 5.6
where we provide the generalized cross-validation (GCV) loss approximation within
generalized linear models (GLMs).

K-Fold Cross-Validation

Choose a fixed integer K ≥ 2 and partition the entire data D at random into K
disjoint subsets (called folds) L1 , . . . , LK of approximately the same size. The
learning data for fixed 1 ≤ k ≤ K is then defined by L[−k] = D \ Lk and the
test data by Tk = Lk , see Fig. 4.2. Based on learning data L[−k] we calculate the
MLE

μ[−k] = μ
def.
MLE
L[−k] ,

which is based on all data except Tk .

These observations are now used to do an (out-of-sample) cross-validation
analysis, and averaging this over all 1 ≤ k ≤ K we receive the K-fold cross-
validation (CV) loss.
4.2 Cross-Validation 101

K
CV = 1
D μ[−k]
D Tk ,
K
k=1

1 1 vi
K
= μ[−k]
d Yi , (4.35)
K |Tk | ϕ
k=1 Yi ∈Tk

1 vi
K
≈ μ[−k] .
d Yi ,
n ϕ
k=1 Yi ∈Tk

The last step is an approximation because not all Tk may have exactly the same
sample size if n is not a multiple of K. We can understand (4.35) not only as a
conditional out-of-sample loss estimate in the spirit of Definition 4.24. The outer
empirical average in (4.35) also makes it suitable for an expected deviance GL
estimate according to (4.32). The variance of this empirical deviance GL is given by
(subject to existence)

1 vi v
K
Varθ D CV
≈ 2 Covθ d Yi ,
μ [−k] j
, d Yj ,
μ [−l]
.
n ϕ ϕ
k,l=1 Yi ∈Tk Yj ∈Tl

Typically, in applications, one uses K-fold cross-validation with K = 10.

Stratified K-Fold Cross-Validation

A disadvantage of the above K-fold cross-validation is that it may happen that there
are two outliers in the data, and there is a positive probability that these two outliers
belong to the same subset Lk . This may substantially distort K-fold cross-validation
because in that case the subsets Lk , 1 ≤ k ≤ K, are of different quality. Stratified K-
fold cross-validation aims at distributing outliers more equally across the partition.
Order the observations Yi , 1 ≤ i ≤ n, as follows

Y(1) ≥ Y(2) ≥ . . . ≥ Y(n) .

For stratified K-fold cross-validation, we randomly distribute (partition) the K

biggest claims Y(1) , . . . , Y(K) to the subsets Lk , 1 ≤ k ≤ K, then we randomly
partition the next K biggest claims Y(K+1), . . . , Y(2K) to the subsets Lk , 1 ≤ k ≤ K,
and so forth. This implies, e.g., that the two biggest claims cannot fall into the same
set Lk . This stratified partition Lk , 1 ≤ k ≤ K, is then used for K-fold cross-
validation.
102 4 Predictive Modeling and Forecast Evaluation

Summary 4.26 (Cross-Validation)

• A model is calibrated on the learning data set L by minimizing the in-
sample deviance loss D(L, μ) in μ. This provides MLE μMLE
L .
• The quality of this model is assessed on test data T being disjoint of L
considering the corresponding out-of-sample deviance loss D(T , μMLE
L ).
• If there is no test data set T available we perform (stratified) K-fold
cross-validation. This provides the (stratified) K-fold cross-validation loss
CV which is an estimate for the out-of-sample deviance loss and for the
D
expected deviance GL (4.32).

Example 4.27 (Out-of-Sample Deviance Loss Estimation) We consider a claim

counts example using the Poisson EDF model. The claim counts Ni and exposures
vi > 0 used come from the French motor insurance data given in Listing 13.2
of Chap. 13.1. We model the claim frequencies Yi = Ni /vi with the Poisson EDF
model having cumulant function κ(θ ) = exp{θ } and dispersion parameter ϕ = 1 for
all 1 ≤ i ≤ n. The expected frequency is given by μ = Eθ [Yi ] = κ (θ ). Moreover,
we assume that all claim counts Ni , 1 ≤ i ≤ n, are independent. This provides us
with the Poisson deviance loss function for observations Y n = (Y1 , . . . , Yn ) , see
Example 4.12,

1 1
n n
μ
D(Y n , μ) = vi d(Yi , μ) = 2vi μ − Yi − Yi log
n n Yi
i=1 i=1

1
n
vi μ
= 2 vi μ − Ni − Ni log ≥ 0,
n Ni
i=1

where, for Yi = 0, we set d(Yi = 0, μ) = 2μ. Minimizing the Poisson deviance

loss function D(Y n , μ) in μ gives us the MLE for μ and θ = h(μ), respectively. It
is given by, see (3.24),
n
Ni

μMLE =
μMLE
L = i=1
n = 7.36%,
v i=1 i

for learning data set L = {Y1 , . . . , Yn }. This provides us with an in-sample Poisson
deviance loss of D(Y n , −2
L ) = D(L,
μMLE L ) = 25.213 · 10 .
μMLE
Since we do not have test data T , we explore tenfold cross-validation. We
therefore partition the entire data at random into K = 10 disjoint sets L1 , . . . , L10 ,
and compute the tenfold cross-validation loss as described in (4.35). This gives us
CV = 25.213 · 10−2 , thus, we receive the same value as for the in-sample loss
D
which says that we do not have in-sample over-fitting, here. This is not surprising
4.2 Cross-Validation 103

in the homogeneous model λ = Eθ [Yi ]. We can also quantify the uncertainty in this
estimate by the corresponding empirical standard deviation for Tk = Lk
/
0
0 1
K

1 CV 2 = 0.234 · 10−2 .
μ[−k] − D
D Tk , (4.36)
K −1
k=1

This says that there is quite some fluctuation in the data because uncertainty in
CV = 25.213 · 10−2 is roughly 1%. This finishes this example, and we
estimate D
will come back to it in Sect. 5.2.4, below.

4.2.3 Akaike’s Information Criterion

The out-of-sample analysis in terms of GLs and cross-validation evaluates the

predictive performance on unseen data. Another way of model selection is to study
in-sample losses instead, but penalize model complexity. Akaike’s information
criterion (AIC), see Akaike [5], is the most popular tool that follows such a model
selection methodology. AIC is based on a set of assumptions which should be
fulfilled to apply, this is going to be discussed in this section; we therefore follow
the lecture notes of Künsch [229].
Assume we have independent random variables Yi from some (unknown) density
f . Assume we have two candidate models with densities hθ and gϑ from which we
would like to select the preferred one for the given data Y n = (Y1 , . . . , Yn ). The two
unknown parameters in these densities hθ and gϑ are called θ and ϑ, respectively.
We neither assume that one of the two models hθ and gϑ contains the true model f ,
nor that the two models are nested. That is, f , hθ and gϑ are quite general densities
w.r.t. a given σ -finite measure ν.
Assume that both models under consideration have a unique MLE θ MLE =

θ MLE (Y n ) and
ϑ MLE = ϑ MLE (Y n ) which is based on the same observations Y n .
AIC [5] says that model h θ MLE should be preferred over model g
ϑ MLE if

n

n

−2 log h
θ MLE (Yi ) + 2 dim(θ ) < − 2 ϑ MLE (Yi ) + 2 dim(ϑ),
log g
i=1 i=1
(4.37)

where dim(·) denotes the dimension of the corresponding parameter. Thus, we

compute the log-likelihoods of the data Y n in the corresponding MLEs θ MLE and

ϑ MLE , and we penalize the resulting values with the number of parameters to correct
for model complexity. We give some remarks.
104 4 Predictive Modeling and Forecast Evaluation

Remarks 4.28
• AIC is neither an in-sample loss nor an out-of-sample loss to measure gen-
eralization accuracy, but it considers penalized log-likelihoods. Under certain
assumptions one can prove that asymptotically minimizing AICs is equivalent
to minimizing leave-one-out cross-validation mean squared errors.
• The two penalized log-likelihoods have to be evaluated on the same data Y n
and they need to consider the MLEs θ MLE and ϑ MLE because the justification
of AIC is based on the asymptotic normality of MLEs, otherwise there is no
mathematical justification why (4.37) should be a reasonable model selection
tool.
• AIC does not require (but allows for) nested models hθ and gϑ nor need they be
Gaussian, it is only based on asymptotic normality. We give a heuristic argument
below.
• Evaluation of (4.37) involves all terms of the log-likelihoods, also those that do
not depend on the parameters θ and ϑ.
• Both models should consider the data Y n in the same units, i.e., AIC does not
apply if hθ is a density for Yi and gϑ is a density for cYi . In that case, one has
to perform a transformation of variables to ensure that both densities consider
the data in the same units. We briefly highlight this by considering a Gaussian
example. We choose i.i.d. observations Yi ∼ N (θ, σ 2 ) for known variance σ 2 >
nChoose c > 0, we have cYi ∼ N (ϑ MLE
0. = cθ, c2σ 2 ). We obtain MLE θ MLE =
i=1 Yi /n and log-likelihood in MLE θ

1 2
n n
n
θ MLE (Yi ) = − log(2πσ ) −
log h 2
Yi −
θ MLE
.
2 2σ 2
i=1 i=1

n
On the transformed scale we have MLE
ϑ MLE = i=1 cYi /n = c
θ MLE and
log-likelihood in MLE
ϑ MLE

1 2
n n
n
log g
ϑ MLE (cYi ) = − log(2πc 2 2
σ ) − cYi − c θ MLE
.
2 2c2σ 2
i=1 i=1

Thus, find that the two log-likelihoods differ by −nlog(c), but we consider the
same model only under different measurement units of the data. The same applies
when we work, e.g., with a log-normal model or logged data in a Gaussian model.

We give a heuristic justification of AIC. In Example 3.10 we have seen that

the MLE is obtained by minimizing the KL divergence from hθ to the empirical
distribution f n of Y n . This motivates to use the KL divergence also for comparing
4.2 Cross-Validation 105

the MLE estimated models to the true model, i.e., we consider the difference
(supposed the densities are defined on the same domain)
2 2
DKL f 2h
θ MLE (·) − DKL f g
2 MLE (·)
ϑ

f (y) f (y)
= log f (y)dν(y) − log f (y)dν(y)
h θ MLE (y) g
ϑ MLE (y)

= log g ϑ MLE (y) f (y)dν(y) − log h
θ MLE (y) f (y)dν(y). (4.38)

If this difference is negative, model h θ MLE should be preferred over model g ϑ MLE
because it is closer to the true model f w.r.t. the KL divergence. Thus, we need to
calculate the two integrals in (4.38). Since the true density f is not known, these
two integrals need to be estimated.
As a first idea we estimate the integrals on the right-hand side empirically using
the observations Y n , say, the first integral is estimated by

1
n

log g
ϑ MLE (Yi ) .
n
i=1

However, this will lead to a biased estimate because the MLE ϑ MLE exactly
maximizes this empirical estimate (as a function of ϑ). The integrals in (4.38),
on the other hand, can be interpreted as an out-of-sample calculation between
independent random variables Y n (used for MLE) and Y ∼ f dν used in the integral.
The bias results from the fact that in the empirical estimate the independence
gets lost. Therefore, we need to correct this estimate for the bias in order to
obtain a reasonable estimate for the difference of the KL divergences. Under the
following
√ assumptions this bias correction is asymptotically given by −dim(ϑ)/n:
(1) n( ϑ MLE (Y n ) − ϑ0 ) is asymptotically normally distributed N (0, (ϑ0 )−1 ) as
n → ∞, where ϑ0 is the parameter that minimizes the KL divergence from gϑ to
f ; we also refer to Remarks 3.26. (2) The true f is sufficiently close to gϑ0 such
that the Ef -covariance matrix of the score ∇ϑ loggϑ0 is close to the negative Ef -
expected Hessian ∇ϑ2 loggϑ0 ; see also (3.36) and Sect. 11.1.4, below. In that case,
(ϑ0 ) approximately corresponds to Fisher’s information matrix I1 (ϑ0 ) and AIC is
justified.
This shows that AIC applies if both models are evaluated under the same
observations Y n , the models need to use the MLEs, and asymptotic normality needs
to hold with limits such that the true model is close to a member of the selected
model classes {hθ ; θ } and {gϑ ; ϑ}. We remark that this is not the only set-up under
which AIC can be justified, but other set-ups do not essentially differ.
The Bayesian information criterion (BIC) is similar to AIC but in a Bayesian
context. The BIC says that model h θ MLE should be preferred over model g ϑ MLE if

n

n

−2 log h
θ MLE (Yi ) +log(n)dim(θ ) < −2 ϑ MLE (Yi ) +log(n)dim(ϑ),
log g
i=1 i=1
106 4 Predictive Modeling and Forecast Evaluation

where n is the sample size of Y n used for model fitting. The BIC has been derived
by Schwarz [331]. Therefore, it is also called Schwarz’ information criterion (SIC).

4.3 Bootstrap

The bootstrap method has been invented by Efron [115] and Efron–Tibshirani [118].
The bootstrap is used to simulate new data from either the empirical distribution F n

or from an estimated model F (·; θ ). This allows, for instance, to evaluate the outer
expectation in the expected deviance GL (4.32) which requires a data model for Y n .
The presentation in this section is based on the lecture notes of Bühlmann–Mächler
[59, Chapter 5].

4.3.1 Non-parametric Bootstrap Simulation

Assume we have i.i.d. observations Y1 , . . . , Yn from an unknown distribution

function F (·; θ ). Based on these observations Y = (Y1 , . . . , Yn ) we choose a
decision rule A : Y → A = ⊆ R which provides us with an estimator for θ

Y →
θ = A(Y ). (4.39)

Typically, the decision rule A(·) is a known function and we would like to determine
the distributional properties of parameter estimator (4.39) as a function of the
(random) observations Y . E.g., for any measurable set C, we might want to compute

Pθ θ ∈ C = Pθ [A(Y ) ∈ C] = 1{A(y)∈C} dP (y; θ ). (4.40)

Since, typically, the true data generating distribution Yi ∼ F (·; θ ) is not known, the
distributional properties of θ cannot be determined, also not by Monte Carlo simula-
tion. The idea behind bootstrap is to approximate F (·; θ ). Choose as approximation
to F (·; θ ) the empirical distribution of the i.i.d. observations Y given by, see (3.9),

n
n (y) = 1
F 1{Yi ≤y} for y ∈ R.
n
i=1

The Glivenko–Cantelli theorem [64, 159] tells us that the empirical distribution
n converges uniformly to F (·; θ ), a.s., for n → ∞, so it should be a good
F
approximation to F (·; θ ) for large n. The idea now is to simulate from the empirical
n .
distribution F
4.3 Bootstrap 107

(Non-parametric) bootstrap algorithm

(1) Repeat for m = 1, . . . , M

n (these are obtained by
(a) simulate i.i.d. observations Y1∗ , . . . , Yn∗ from F
random drawings with replacements from the observations Y1 , . . . , Yn ; we
denote this resampling distribution of Y ∗ = (Y1∗ , . . . , Yn∗ ) by P∗ = P∗Y );
(b) calculate the estimator
θ (m∗) = A(Y ∗ ).
(2) Return
θ (1∗) , . . . ,
θ (M∗) and the resulting empirical bootstrap distribution

1
M
M
F ∗
(ϑ) = 1{
θ (m∗) ≤ϑ} ,
M
m=1

for the estimated distribution of

θ.

We can use the empirical bootstrap distribution F ∗ as an estimate of the true

M

distribution of θ , that is, we estimate and approximate

1
M
def. ∗
Pθ θ ∈C ≈
Pθ θ ∈ C = P∗Y θ ∈C ≈ 1{
θ (m∗) ∈C} , (4.41)
M
m=1

where P∗Y corresponds to the bootstrap distribution of Step (1a) of the above
algorithm, and where we set θ ∗ = A(Y ∗ ). This bootstrap distribution P∗Y is
∗ for studying
empirically approximated by the empirical bootstrap distribution FM

θ ∗.
Remarks 4.29
• The quality of the approximations in (4.41) depend on the richness of the
observation Y = (Y1 , . . . , Yn ), because the bootstrap distribution
∗ ∗
P∗Y
θ ∈ C = P∗Y =y
θ ∈C ,

depends on the realization y of the data Y from which we generate the bootstrap
sample Y ∗ . It also depends on M and the explicit random drawings Yi∗ providing
∗ . The latter uncertainty can be controlled
the empirical bootstrap distribution F M
∗
since the bootstrap distribution PY corresponds to a multinomial distribution, and
the Glivenko–Cantelli theorem [64, 159] applies to F ∗ and P∗ for M → ∞. The
M Y
former uncertainty inherited from the realization Y = y cannot be diminished
because we cannot enrich the observation Y .
108 4 Predictive Modeling and Forecast Evaluation

∗ can be used to estimate the mean of the

• The empirical bootstrap distribution FM
estimator
θ given in (4.39)

1 (m∗)
M
∗

Eθ
θ = E∗Y
θ ≈
θ ,
M
m=1

and its variance

2
1 (m∗) 1 (k∗)
M M
∗
3θ
Var θ = VarP∗Y
θ ≈
θ −
θ .
M −1 M
m=1 k=1

• The previous item discusses the approximation of the bootstrap mean and
variance, respectively. Bootstrap intervals for coverage ratios need some care,
and there are different versions. The naive way of just calculating quantiles from
∗ often does not work well, and methods like a double bootstrap may need to
F M
be considered.
• In (4.39) we have assumed that the quantity of interest is the parameter θ , but
similar considerations also apply to general decision rules estimating γ (θ ).
• The bootstrap as defined above directly acts on the observations Y1 , . . . , Yn , and
the basic assumption is that these observations are i.i.d. If this is not the case,
one may first need to transform the observations, for instance, one can calculate
residuals and assume that these residuals are i.i.d. In more complicated cases, one
even drops the i.i.d. assumption and replaces it by an identical mean and variance
assumption, that is, that all residuals are assumed to be independent, centered and
with unit variance. This is sometimes also called residual bootstrap and it may
be suitable in regression models as will be introduced below. Thus, in this latter
case we estimate for each observation Yi its mean μi and its standard deviation

σi , for instance, using the variance function of the chosen EDF. This then allows
for calculating the residuals εi = (Yi − σi . For the residual bootstrap we
μi )/
resample the residuals εi∗ from εn . This provides bootstrap observations
ε1 , . . . ,

Yi∗ =
μi + εi∗ .
σi

The wild bootstrap proposed by Wu [386] additionally uses a centered and

εi∗ ) to modify
normalized i.i.d. random variable Vi (also being independent of
the residual bootstrap observations to

Yi∗ =
μi + εi∗ .
σi Vi
4.3 Bootstrap 109

The bootstrap is called consistent for

θ if we have for all z ∈ R the following
convergence in probability as n → ∞

√ √ ∗ prob.
Pθ n
θ − θ ≤ z − P∗Y n
θ −
θ ≤z → 0,

the quantities θ = θn and θ∗ =

θn∗ depend on (the size n of) the observation Y =
Y n ; the convergence in probability is needed because Y = Y n are random vectors.
Assume that θ MLE = θ is the MLE of θ satisfying the assumptions of Theorem 3.28.
Then we have asymptotic normality, see (3.30),
√
n
θ − θ ⇒ N 0, I1 (θ )−1 as n → ∞,

with Fisher’s information I1 (θ ). Bootstrap consistency then requires

√ ∗ P∗Y
n
θ −
θ ⇒ N 0, I1 (θ )−1 in probability as n → ∞.

Bootstrap consistency typically holds if

θ is asymptotically normal (as n → ∞) and
if the underlying data Yi is i.i.d. Moreover, bootstrap consistency usually implies
consistent variance and bias estimation
∗ ∗
VarP∗Y θ prob. E∗Y θ − θ prob.
→ 1 and → 1 as n → ∞.
Varθ θ Eθ θ −θ

For more information and bootstrap confidence intervals we refer to Chapter 5 in

the lecture notes of Bühlmann–Mächler [59].

4.3.2 Parametric Bootstrap Simulation

For the parametric bootstrap we assume to know the parametric family F =

{F (·; θ ); θ ∈ } from which the i.i.d. observations Y1 , . . . , Yn ∼ F (·; θ ) have
been generated from, and only the explicit choice of the parameter θ ∈ is not
known. Based on these observations we construct an estimator θ = A(Y ), for the
unknown parameter θ ∈ .

(Parametric) bootstrap algorithm

(1) Repeat for m = 1, . . . , M

(a) simulate i.i.d. observations Y1∗ , . . . , Yn∗ from F (·;
θ) (we denote the resam-
pling distribution of Y ∗ = (Y1∗ , . . . , Yn∗ ) by P∗ = P∗Y );
(b) calculate the estimator θ (m∗) = A(Y ∗ ).
110 4 Predictive Modeling and Forecast Evaluation

(2) Return
θ (1∗) , . . . ,
θ (M∗) and the resulting empirical bootstrap distribution

1
M
M
F ∗
(ϑ) = 1{
θ (m∗) ≤ϑ} .
M
m=1

We then estimate and approximate the distribution of θ analogously to (4.41),

and the same remarks apply as for the non-parametric bootstrap. The parametric
bootstrap has the advantage that it can enrich the data by sampling new observations
from the distribution F (·; θ). A shortfall of the parametric bootstrap will occur if the
family F is misspecified, then the bootstrap sample Y ∗ will only poorly describe the
true data Y , e.g., if the data shows over-dispersion but the select family F does not
allow to model such over-dispersion.

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons licence and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons licence, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons licence and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 5
Generalized Linear Models

Most of the theory in the previous chapters has been based on the assumption of
having similarity (or homogeneity) between the different observations. This was
expressed by making an i.i.d. assumption on the observations, see, e.g., Sect. 3.3.2.
In many practical applications such a homogeneity assumption is not reasonable,
one may for example think of car insurance pricing where different car drivers have
different driving experience and they drive different cars, or of health insurance
where policyholders may have different genders and ages. Figure 5.1 shows a
health insurance example where the claim sizes depend on the gender and the
age of the policyholders. The most popular statistical models that are able to
cope with such heterogeneous data are the generalized linear models (GLMs). The
notion of GLMs has been introduced in the seminal work of Nelder–Wedderburn
[283] in 1972. Their work has introduced a unified procedure for modeling and
fitting distributions within the EDF to data having systematic differences (effects)
that can be described by explanatory variables. Today, GLMs are the state-of-the-
art statistical models in many applied fields including statistics, actuarial science
and economics. However, the specific use of GLMs in the different fields may
substantially differ. In fields like actuarial science these models are mainly used for
predictive modeling, in other fields like economics or social sciences GLMs have
become the main tool in exploring and explaining (hopefully) causal relations. For
a discussion on “predicting” versus “explaining” we refer to Shmueli [338].
It is difficult to give a good list of references for GLMs, since GLMs and their
offsprings are present in almost every statistical modeling publication and in every
lecture on statistics. Classical statistical references are the books of McCullagh–
Nelder [265], Fahrmeir–Tutz [123] and Dobson [107], in the actuarial literature we
mention the textbooks (in alphabetical order) of Charpentier [67], De Jong–Heller
[89], Denuit et al. [99–101], Frees [134] and Ohlsson–Johansson [290], but this list
is far from being complete.

© The Author(s) 2023 111

12000
Fig. 5.1 Claim sizes in
health insurance as a function
of the age of the policyholder,

11500
and split by gender

11000
claim size
10500
10000
9500
female

9000
male

20 40 60 80 100
age of policyholder

In this chapter we introduce and discuss GLMs in the context of actuarial

modeling. We do this in such a way that GLMs can be seen as a building block of
network regression models which will be the main topic of Chap. 7 on deep learning.

5.1 Generalized Linear Models and Log-Likelihoods

5.1.1 Regression Modeling

We start by assuming of having independent random variables Y1 , . . . , Yn which

are described by a fixed member of the EDF. That is, we assume that all Yi are
independent and have densities w.r.t. a σ -finite measure ν on R given by

yi θi − κ(θi )
Yi ∼ f (yi ; θi , vi /ϕ) = exp + a(yi ; vi /ϕ) for 1 ≤ i ≤ n,
ϕ/vi
(5.1)

with canonical parameters θi ∈ ,˚ exposures vi > 0 and dispersion parameter ϕ >
0. Throughout, we assume that the effective domain has a non-empty interior.
There is a fundamental difference between (5.1) and Example 3.5. We now allow
every random variable Yi to have its own canonical parameter θi ∈ . ˚ We call
this a heterogeneous situation because the observations are allowed to differ in a
systematic way expressed by different canonical parameters. This is highlighted by
the lines in the health insurance example of Fig. 5.1 where (expected) claim sizes
differ by gender and age of policyholder.
In Sect. 4.1.2 we have introduced the saturated model where every observation Yi
has its own parameter θi . In general, if we have n observations Y = (Y1 , . . . , Yn )
we can estimate at most n parameters. The other extreme case is the homogeneous
one, meaning that θi = θ ∈ ˚ for all 1 ≤ i ≤ n. In this latter case we have exactly
one parameter to estimate, and we call this model null model, intercept model
or homogeneous model, because all components of Y are assumed to follow the
5.1 Generalized Linear Models and Log-Likelihoods 113

same law expressed in a single common parameter θ . Both the saturated model and
the null model may behave very poorly in predicting new observations. Typically,
the saturated model fully reflects the data Y including the noisy part (random
component, irreducible risk, see Remarks 4.2) and, therefore, it is not useful for
prediction. We also say that this model (in-sample) over-fits to the data Y and
does not generalize (out-of-sample) to new data. The null model often has a poor
predictive performance because if the data has systematic effects these cannot be
captured by a null model. GLMs try to find a good balance between these two
extreme cases, by trying to extract (only) the systematic effects from noisy data
Y . We therefore model the canonical parameters θi as a low-dimensional function
of explanatory variables which capture the systematic effects in the data. In Fig. 5.1
gender and age of policyholder play the role of such explanatory variables.
Assume that each observation Yi is equipped with a feature (explanatory variable,
covariate) x i that belongs to a fixed given feature space X . These features x i
are assumed to describe the systematic effects in the observations Yi , i.e., these
features are assumed to be appropriate descriptions of the heterogeneity between the
observations. In a nutshell, we then assume of having a suitable regression function

˚
θ : X → , x → θ (x),

such that we can appropriately describe the observations by

ind. yi θ (x i ) − κ(θ (x i ))
Yi ∼ f (yi ; θi = θ (x i ), vi /ϕ) = exp + a(yi ; vi /ϕ) ,
ϕ/vi
(5.2)

for 1 ≤ i ≤ n. As a result we receive for the first moment of Yi , see Corollary 2.14,

μi = μ(x i ) = Eθ(x i ) [Yi ] = κ (θ (x i )). (5.3)

Thus, the regression function θ : X → ˚ is assumed to describe the systematic

differences (effects) between the random variables Y1 , . . . , Yn being expressed by
the means μ(x i ) for features x 1 , . . . , x n . In GLMs this regression function takes a
linear form after a suitable transformation, which exactly motivates the terminology
generalized linear model.

5.1.2 Definition of Generalized Linear Models

We start with the discussion of the features x ∈ X . Features are also called
explanatory variables, covariates, independent variables or regressors. Throughout,
we assume that the features x = (x0 , x1 , . . . , xq ) include a first component x0 = 1,
and we choose feature space X ⊂ {1} × Rq . The inclusion of this first component
x0 = 1 is useful in what follows. We call this first component intercept or bias
component because it will be modeling an intercept of a regression model. The
114 5 Generalized Linear Models

null model (homogeneous model) has features that only consist of this intercept
component. For later purposes it will be useful to introduce the design matrix X
which collects the features x 1 , . . . , x n ∈ X of all responses Y1 , . . . , Yn . The design
matrix is defined by
⎛ ⎞
1 x1,1 · · · x1,q
⎜ .. .. . . .. ⎟ ∈ Rn×(q+1) .
X = (x 1 , . . . , x n ) = ⎝ . . . . ⎠ (5.4)
1 xn,1 · · · xn,q

Based on these choices we assume existence of a regression parameter β ∈ Rq+1

and of a strictly monotone and smooth link function g : M → R such that we can
express (5.3) by the following function (we drop index i)

q
x → g(μ(x)) = g Eθ(x) [Y ] = η(x) = β, x = β0 + βj x j . (5.5)
j =1

Here, ·, · describes the scalar product in the Euclidean space Rq+1 , θ (x) =
h(μ(x)) is the resulting canonical parameter (using canonical link h = (κ )−1 ),
and η(x) is the so-called linear predictor. After applying a suitable link function g,
the systematic effects of the random variable Y with features x can be described by
a linear predictor η(x) = β, x , linear in the components of x ∈ X . This gives
a particular functional form to (5.3), and the random variables Y1 , . . . , Yn share
a common regression parameter β ∈ Rq+1 . Remark that the link function g used
in (5.5) can be different from the canonical link h used to calculate θ (x) = h(μ(x)).
We come back to this distinction below.

Summary of (5.5)
1. The independent random variables Yi follow a fixed member of the
EDF (5.1) with individual canonical parameters θi ∈ , ˚ for all 1 ≤ i ≤ n.
2. The canonical parameters θi and the corresponding mean parameters μi
are related by the canonical link h = (κ )−1 as follows h(μi ) = θi , where
κ is the cumulant function of the chosen EDF, see Corollary 2.14.
3. We assume that the systematic effects in the random variables Yi can
be described by linear predictors ηi = η(x i ) = β, x i and a strictly
monotone and smooth link function g such that we have g(μi ) = ηi =
β, x i , for all 1 ≤ i ≤ n, with common regression parameter β ∈ Rq+1 .

We can either express this GLM regression structure in the dual (mean) parameter
˚ see Remarks 2.9,
space M or in the effective domain ,
5.1 Generalized Linear Models and Log-Likelihoods 115

x → μ(x) = g −1 (η(x)) = g −1 β, x ∈ M or

˚
x → θ (x) = (h ◦ g −1 )(η(x)) = (h ◦ g −1 )β, x ∈ ,

where (h ◦ g −1 ) is the composition of the inverse link g −1 and the canonical link h.
For the moment, the link function g is quite general. In practice, the explicit choice
needs some care. The right-hand side of (5.5) is defined on the whole real line if at
least one component of x is both-sided unbounded. On the other hand, M and ˚
may be bounded sets. Therefore, the link function g may require some restrictions
such that the domain and the range fulfill the necessary constraints. The dimension
of β should satisfy 1 ≤ 1 + q ≤ n, the lower bound will provide a null model and
the upper bound a saturated model.

5.1.3 Link Functions and Feature Engineering

As link function we choose a strictly monotone and smooth function g : M → R

such that we do not have any conflicts in domains and ranges. Beside these
requirements, we may want further properties for the link function g and the features
x. From (5.5) we have

μ(x) = Eθ(x) [Y ] = g −1 β, x . (5.6)

Of course, a basic requirement is that the selected features x can appropriately

describe the mean of Y by the function in (5.6), see also Fig. 5.1. This may
require so-called feature engineering of x, for instance, we may want to replace
the first component x1 of the raw features x by, say, x12 in the pre-processed
features. For example, if this first component describes the age of the insurance
policyholder, then, in some regression problems, it might be more appropriate to
consider age2 instead of age to bring the predictive problem into structure (5.6). It
may also be that we would like to enforce a certain type of interaction between the
components of the raw features. For instance, we may include in a pre-processed
feature a component x1 /x22 which might correspond to weight/height2 if the
policyholder has body weight x1 and body height x2 . In fact, this pre-processed
feature is exactly the body mass index of the policyholder. We will come back to
feature engineering in Sect. 5.2.2, below.
116 5 Generalized Linear Models

Another important requirement is the ability of model interpretation. In insurance

pricing problems, one often prefers additive and multiplicative effects in feature
components. Choosing the identity link g(m) = m we receive a model with additive
effects

q
μ(x) = Eθ(x) [Y ] = β, x = β0 + βj x j ,
j =1

and choosing the log-link g(m) = log(m) we receive a model with multiplicative
effects

q
μ(x) = Eθ(x) [Y ] = expβ, x = eβ0 e βj xj .
j =1

The latter is probably the most commonly used GLM in insurance pricing because
it leads to explainable tariffs where feature values directly relate to price de- and
increases in percentages of a base premium exp{β0 }.
Another very popular choice is the canonical (natural) link, i.e., g = h = (κ )−1 .
The canonical link substantially simplifies the analysis and it has very favorable
statistical properties (as we will see below). However, in some applications practical
needs overrule good statistical properties. Under the canonical link g = h we have
in the dual mean parameter space M and in the effective domain , respectively,

x → μ(x) = κ (η(x)) = κ β, x and x → θ (x) = η(x) = β, x .

Thus, the linear predictor η and the canonical parameter θ coincide under the
canonical link choice g = h = (κ )−1 .

5.1.4 Log-Likelihood Function and Maximum Likelihood

Estimation

After having a fully specified GLM within the EDF, there remains estimation of the
regression parameter β ∈ Rq+1 . This is done within the framework of MLE.

The log-likelihood function of Y = (Y1 , . . . , Yn ) for regression parameter

β ∈ Rq+1 is given by, see (5.2) and we use the independence between the
Yi ’s,

(continued)
5.1 Generalized Linear Models and Log-Likelihoods 117

vi
n
β → Y (β) = Yi h(μ(x i )) − κ (h(μ(x i ))) + a(Yi ; vi /ϕ), (5.7)
ϕ
i=1

where we set μ(x i ) = g −1 β, x i . For the canonical link g = h = (κ )−1

this simplifies to

vi
n
β → Y (β) = Yi β, x i − κβ, x i + a(Yi ; vi /ϕ). (5.8)
ϕ
i=1

MLE of β needs maximization of log-likelihoods (5.7) and (5.8), respectively;

these are the GLM counterparts to the homogeneous case treated in Section 3.3.2.
We calculate the score, we set ηi = β, x i and μi = μ(x i ) = g −1 β, x i ,

n
vi
s(β, Y ) = ∇β Y (β) = [Yi − μi ] ∇β h(μ(x i ))
ϕ
i=1

n
vi ∂h(μi ) ∂μi
= [Yi − μi ] ∇β η(x i ) (5.9)
ϕ ∂μi ∂ηi
i=1

vi Yi − μi ∂g(μi ) −1
n
= xi ,
ϕ V (μi ) ∂μi
i=1

where we use the definition of the variance function V (μ) = (κ ◦ h)(μ), see
Corollary 2.14. We define the diagonal working weight matrix, which in general
depends on β through the means μi = g −1 β, x i ,

−2
∂g(μi ) vi 1
W (β) = diag ∈ Rn×n ,
∂μi ϕ V (μi )
1≤i≤n

and the working residuals

∂g(μi )
R = R(Y , β) = (Yi − μi ) ∈ Rn .
∂μi 1≤i≤n

This allows us to write the score equations in a compact form, which provides the
following proposition.
118 5 Generalized Linear Models

Proposition 5.1 The MLE for β is found by solving the score equations

s(β, Y ) = ∇β Y (β) = X W (β)R(Y , β) = 0.

For the canonical link g = h = (κ )−1 the score equations simplify to

vi
s(β, Y ) = ∇β Y (β) = X diag Y − κ (Xβ) = 0,
ϕ 1≤i≤n

where κ (Xβ) ∈ Rn is understood element-wise.

Remarks 5.2
• In general, the MLE of β is not calculated by maximizing the log-likelihood
function Y (β), but rather by solving the score equations s(β, Y ) = 0; we also
refer to Remarks 3.29 on M- and Z-estimators. The score equations provide
the critical points for β, from which the global maximum of the log-likelihood
function can be determined, supposed it exists.
• Existence of a MLE of β is not always given, similarly to Example 3.5, we may
face the problem that the solution lies at the boundary of the parameter space
(which itself may be an open set).
• If the log-likelihood function β → Y (β) is strictly concave, then the critical
point of the score equations s(β, Y ) = 0 is unique, supposed it exists, and,
henceforth, we have a unique MLE
MLE
β for β. Below, we give cases where
the strict concavity of the log-likelihood holds.
• In general, there is no closed from solution for the MLE of β, except in the
Gaussian case with canonical link, thus, we need to solve the score equations
numerically.
Similarly to Remarks 3.17 we can calculate Fisher’s information matrix w.r.t. β
through the negative expected Hessian of Y (β).

We get Fisher’s information matrix w.r.t. β

I(β) = Eβ ∇β Y (β) ∇β Y (β) = −Eβ ∇β2 Y (β) = X W (β)X.
(5.10)

If the design matrix X ∈ Rn×(q+1) has full rank q + 1 ≤ n, Fisher’s

information matrix I(β) is positive definite.
5.1 Generalized Linear Models and Log-Likelihoods 119

Dispersion parameter ϕ > 0 has been treated as a nuisance parameter above.

Its explicit specification does not influence the MLE of β because it cancels in the
score equations. If necessary, we can also estimate this dispersion parameter with
MLE. This requires solving the additional score equation

∂ vi
n ∂
Y (β, ϕ) = − 2 Yi h(μ(x i )) − κ (h(μ(x i ))) + a(Yi ; vi /ϕ) = 0,
∂ϕ ϕ ∂ϕ
i=1
(5.11)

and we can plug in the MLE of β (which can be estimated independently of ϕ).
Fisher’s information matrix is in this extended framework given by
X W (β)X
2
I(β, ϕ) = −Eβ ∇(β,ϕ) Y (β, ϕ) = 2 0 ,
0 −Eβ ∂ Y (β, ϕ)/∂ϕ 2

that is, the off-diagonal terms between β and ϕ are zero.

In view of Proposition 5.1 we need a root search algorithm to obtain the MLE
of β. Typically, one uses Fisher’s scoring method or the iterative re-weighted
least squares (IRLS) algorithm to solve this root search problem. This is a main
result derived in the seminal work of Nelder–Wedderburn [283] and it explains the
popularity of GLMs, namely, GLMs can be solved efficiently by this algorithm.
Fisher’s scoring method/IRLS algorithm explore the updates for t ≥ 0 until
convergence

−1
(t +1)
→ = X W ( X W (
β ) X β + R(Y ,
(t ) (t ) (t ) (t ) (t )
β β β )X β ) ,
(5.12)

where all terms on the right-hand side are evaluated at algorithmic time t. If we
have n observations Y = (Y1 , . . . , Yn ) we can estimate at most n parameters.
Therefore, in our GLM we assume to have a regression parameter β ∈ Rq+1 of
dimension q + 1 ≤ n. Moreover, we require that the design matrix X has full rank
q + 1 ≤ n. Otherwise the regression parameter is not uniquely identifiable since
linear dependence in the columns of X allows us to reduce the dimension of the
parameter space to a smaller representation. This is also needed to calculate the
inverse matrix in (5.12). This motivates the following assumption.
120 5 Generalized Linear Models

Assumption 5.3 Throughout, we assume that the design matrix X ∈

Rn×(q+1) has full rank q + 1 ≤ n.

Remarks 5.4 (Justification of Fisher’s Scoring Method/IRLS Algorithm)

• We give a short justification of Fisher’s scoring method/IRLS algorithm, for a
more detailed treatment we refer to Section 2.5 in McCullagh–Nelder [265] and
Section 2.2 in Fahrmeir–Tutz [123].
The Newton–Raphson algorithm provides a numerical scheme to find solu-
tions to the score equations. It requires to iterate for t ≥ 0

(t +1)
→ =
β + I(
β )−1 s(
(t ) (t ) (t ) (t )
β β β , Y ),

where I(β) = −∇β2 Y (β) denotes the observed information matrix in β ∈ Rq+1 .
I(
The calculation of the inverse of the observed information matrix (
(t )
β ))−1 can
be time consuming and unstable because we need to calculate second derivatives
and the eigenvalues of the observed information matrix can be close to zero. A
stable scheme is obtained by replacing the observed information matrix I(β)
by Fisher’s information matrix I(β) = Eβ [ I(β)] being positive definite under
Assumption 5.3; this provides a quasi-Newton method. Thus, for Fisher’s scoring
method we iterate for t ≥ 0

(t +1)
→ =
β + I(
β )−1 s(
(t ) (t ) (t ) (t )
β β β , Y ), (5.13)

and rewriting this provides us exactly with (5.12). The latter can also be
interpreted as an IRLS scheme where the response g(Yi ) is replaced by an
adjusted linearized version Zi = g(μi ) + ∂g(μ i)
∂μi (Yi − μi ). This corresponds
to the last bracket in (5.12), and with corresponding weights.
• Under the canonical link choice, Fisher’s information matrix and the observed
information matrix coincide, i.e. I(β) = I(β), and the Newton–Raphson
algorithm, Fisher’s scoring method and the IRLS algorithm are identical. This
can easily be seen from Proposition 5.1. We receive under the canonical link
choice

vi
∇β2 Y (β) = − I(β) = −X diag V (μi ) X (5.14)
ϕ 1≤i≤n

= −X W (β)X = − I(β).

5.1 Generalized Linear Models and Log-Likelihoods 121

The full rank assumption q + 1 ≤ n on the design matrix X implies that

Fisher’s information matrix I(β) is positive definite. This in turn implies that
the log-likelihood function Y (β) is strictly concave, providing uniqueness of a
critical point (supposed it exists). This indicates that the canonical link has very
favorable properties for MLE. Examples 5.5 and 5.6 give two examples not using
the canonical link, the first one is a concave maximization problem, the second
one is not for p > 2.

Example 5.5 (Gamma Model with Log-Link) We study the gamma distribution as a
single-parameter EDF model, choosing the shape parameter α = 1/ϕ as the inverse
of the dispersion parameter, see Sect. 2.2.2. Cumulant function κ(θ ) = − log(−θ )
gives us the canonical link θ = h(μ) = −1/μ. Moreover, we choose the log-link
η = g(μ) = log(μ) for the GLM. This gives a canonical parameter θ = − exp{−η}.
We receive the score

n + ,
vi Yi vi
s(β, Y ) = ∇β Y (β) = − 1 x i = X diag R(Y , β).
ϕ μi ϕ 1≤i≤n
i=1

Unlike in other examples with non-canonical links, we receive a favorable expres-

sion here because only one term in the square bracket depends on the regression
parameter β, or equivalently, the working weight matrix W does not dependent on
β. We calculate the negative Hessian (observed information matrix)

vi Yi
I(β) = − ∇β2 Y (β) = X diag X.
ϕ μi 1≤i≤n

In the gamma model all observations Yi are strictly positive, a.s., and under the
full rank assumption q + 1 ≤ n, the observed information matrix I(β) is positive
definite, thus, we have a strictly concave log-likelihood function in the gamma case
with log-link.

Example 5.6 (Tweedie’s Models with Log-Link) We study Tweedie’s models for
power variance parameters p > 1 as a single-parameter EDF model, see Sect. 2.2.3.
The cumulant function κp is given in Table 4.1. This gives us the canonical link θ =
hp (μ) = μ1−p /(1 − p) < 0 for μ > 0 and p > 1. Moreover, we choose the log-
link η = g(μ) = log(μ) for the GLM. This implies θ = exp{(1 − p)η}/(1 − p) < 0
for p > 1. We receive the score

n
vi Yi − μi vi 1

s(β, Y ) = ∇β Y (β) = x i = X diag R(Y , β).
ϕ μp−1 ϕ μp−2
i=1 i i 1≤i≤n
122 5 Generalized Linear Models

We calculate the negative Hessian (observed information matrix) for μi > 0

2 vi (p − 1)Y i − (p − 2)μ i
I(β) = − ∇β Y (β) = X diag p−1
X.
ϕ μi 1≤i≤n

This matrix is positive definite for p ∈ [1, 2], and for p > 2 it is not positive definite
because (p−1)Yi −(p−2)μi may have positive or negative values if we vary μi > 0
over its domain M. Thus, we do not have concavity of the optimization problem
under the log-link choice in Tweedie’s GLMs for power variance parameters p > 2.
This in particular applies to the inverse Gaussian GLM with log-link.

5.1.5 Balance Property Under the Canonical Link Choice

Throughout this section we work under the canonical link choice g = h = (κ )−1 .
This choice has very favorable statistical properties. We have already seen in
Remarks 5.4 that the derivation of the MLE of β becomes particularly easy under
the canonical link choice and the observed information matrix I(β) coincides with
Fisher’s information matrix I(β) in this case, see (5.14).
For insurance pricing, canonical links have another very remarkable property,
namely, that the estimated model automatically fulfills the balance property and,
henceforth, is unbiased. This is particularly important in insurance pricing because
it tells us that the insurance prices (over the entire portfolio) are on the right level.
We have already met the balance property in Corollary 3.19.

Corollary 5.7 (Balance Property) Assume that Y has independent compo-

nents being modeled by a GLM under the canonical link choice g = h =
(κ )−1 . Assume that the MLE of regression parameter β ∈ Rq+1 exists and
denote it by
MLE
β . We have balance property on portfolio level (for constant
dispersion ϕ)

n
n
n
vi κ
MLE
E
β
MLE [vi Yi ] = β , xi = vi Yi .
i=1 i=1 i=1

Proof The first column of the design matrix X is identically equal to 1 representing
the intercept, see (5.4). The second part of Proposition 5.1 then provides for this first
column of X, we cancel the (constant) dispersion ϕ,

(1, . . . , 1) diag(v1 , . . . , vn ) κ (X
MLE
β ) = (1, . . . , 1) diag(v1 , . . . , vn ) Y .

This proves the claim.

5.1 Generalized Linear Models and Log-Likelihoods 123

Remark 5.8 We mention once more that this balance property is very strong and
useful, see also Remarks 3.20. In particular, the balance property holds, even though
the chosen GLM might be completely misspecified. Misspecification may include
an incorrect distributional model, not the right link function choice, or if we have
not pre-processed features appropriately, etc. Such misspecification will imply that
we have a poor model on an insurance policy level (observation level). However,
the total premium charged over the entire portfolio will be on the right level
(supposed that the structure of the portfolio does not change) because it matches
the observations, and henceforth, we have unbiasedness for the portfolio mean.
From the log-likelihood function (5.8) we see that under the canonical link choice
we consider the statistics S(Y ) = X diag(vi /ϕ)1≤i≤n Y ∈ Rq+1 , and to prove the
balance property we have used the first component of this statistics. Considering all
components, S(Y ) is an unbiased estimator (decision rule) for

n
vi

Eβ [S(Y )] = X diag(vi /ϕ)1≤i≤n κ (Xβ) = κ β, x i xi,j .
ϕ
i=1 0≤j ≤q
(5.15)

This unbiased estimator S(Y ) meets the Cramér–Rao information bound, hence
it is UMVU: taking the partial derivatives of the previous expression gives
∇β Eβ [S(Y )] = I(β), the latter also being the multivariate Cramér–Rao
information bound for the unbiased decision rule S(Y ) for (5.15). Focusing on
the first component we have
n

n
n
Varβ E
β
MLE [vi Yi ] = Varβ vi Yi = ϕvi V (μi ) = ϕ 2 (I(β))0,0 ,
i=1 i=1 i=1
(5.16)

where the component (0, 0) in the last expression is the top-left entry of Fisher’s
information matrix I(β) under the canonical link choice.

5.1.6 Asymptotic Normality

Formula (5.16) quantifies the uncertainty in the premium calculation of the insur-
ance policies if we use the MLE estimated model (under the canonical link
choice). That is, this quantifies the uncertainty in the dual mean parametrization
in terms of the resulting variance. We could also focus on the MLE
MLE
β itself
(for general link function g). In general, this MLE is not unbiased but we have
124 5 Generalized Linear Models

consistency and asymptotic normality similar to Theorem 3.28. Under “certain

regularity conditions”1 we have for n large

MLE (d)
βn ≈ N β, In (β)−1 , (5.17)

where
MLE
β n is the MLE based on the observations Y n = (Y1 , . . . , Yn ) , and In (β)
is Fisher’s information matrix of Y n , which scales linearly in n in the homogeneous

EF case, see Remarks 3.14, and in the homogeneous EDF case it scales as ni=1 vi ,
see (3.25).

5.1.7 Maximum Likelihood Estimation and Unit Deviances

From formula (5.7) we conclude that the MLE

MLE
β of β ∈ Rq+1 is found by the
solution of (subject to existence)

vi
n

β
MLE
= arg max Y (β) = arg max Yi h(μ(x i )) − κ (h(μ(x i ))) ,
β β ϕ
i=1

with μi = μ(x i ) = Eθ(x i ) [Y ] = g −1 β, x i under the link choice g. If we prefer

to work with an objective function that reflects the notion of a loss function, we
can work under the unit deviances d(Yi , μi ) studied in Sect. 4.1.2. The MLE is then
obtained by, see (4.20)–(4.21),

n
vi

β
MLE
= arg max Y (β) = arg min d(Yi , μi ), (5.18)
β β ϕ
i=1

the latter satisfying d(Yi , μi ) ≥ 0 for all 1 ≤ i ≤ n, and being zero if and
only if Yi = μi , see Lemma 2.22. Thus, using the unit deviances we have a loss
function that is bounded below by zero, and we determine the regression parameter
β such that this loss is (in-sample) minimized. This can also be interpreted in a more
geometric way. Consider the (q + 1)-dimensional manifold M ⊂ Rn spanned by
the GLM function

β → μ(β) = g −1 (Xβ) = (g −1 β, x 1 , . . . , g −1 β, x n ) ∈ Rn . (5.19)

1 The regularity conditions for asymptotic normality results will depend on the particular
regression problem studied, we refer to pages 43–44 in Fahrmeir–Tutz [123].
5.1 Generalized Linear Models and Log-Likelihoods 125

Fig. 5.2 2-dimensional

manifold M ⊂ R3 for
observation
Y = (Y1 , Y2 , Y3 ) ∈ R3 , the
straight line illustrates the
projection (w.r.t. the unit Y
deviance distances d) of Y
i=3
onto M which gives MLE

β
MLE
satisfying
μ(
MLE
β )∈M

i=2
i=1

Minimization (5.18) then tries to find the point μ(β) in this manifold M ⊂ Rn
that minimizes simultaneously all unit deviances d(Yi , ·) w.r.t. the observation Y =
(Y1 , . . . , Yn ) ∈ Rn . Or in other words, the optimal parameter β is obtained by
“projecting” observation Y onto this manifold M,where “projection” is understood
as a simultaneous minimization of loss function ni=1 vϕi d(Yi , μi ), see Fig. 5.2. In
the un-weighted Gaussian case, this corresponds to the usual orthogonal projection
as the next example shows, and in the non-Gaussian case it is understood in the KL
divergence minimization sense as displayed in formula (4.11).
Example 5.9 (Gaussian Case) Assume we have the Gaussian EDF case κ(θ ) =
θ 2 /2 with canonical link g(μ) = h(μ) = μ. In this case, the manifold (5.19) is the
linear space spanned by the columns of the design matrix X

β → μ(β) = Xβ = (β, x 1 , . . . , β, x n ) ∈ Rn .

If additionally we assume vi /ϕ = c > 0 for all 1 ≤ i ≤ n, the minimization

problem (5.18) reads as

n
vi

β
MLE
= arg min d(Yi , μi ) = arg min Y − Xβ22 ,
β ϕ β
i=1

where we have used that the unit deviances in the Gaussian case are given by the
square loss function, see Example 4.12. As a consequence, the MLE
MLE
β is found
by orthogonally projecting Y onto M = {Xβ| β ∈ Rq+1 } ⊂ Rn , and this orthogonal
projection is given by X
MLE
β ∈ M.
126 5 Generalized Linear Models

5.2 Actuarial Applications of Generalized Linear Models

The purpose of this section is to illustrate how the concept of GLMs is used in
actuarial modeling. We therefore explore the typical actuarial examples of claim
counts and claim size modeling.

5.2.1 Selection of a Generalized Linear Model

The selection of a predictive model within GLMs for solving an applied actuarial
problem requires the following choices.
Choice of the Member of the EDF Select a member of the EDF that fits the
modeling problem. In a first step, we should try to understand the properties of
the data Y before doing this selection, for instance, do we have count data, do we
have a classification problem, do we have continuous observations?
All members of the EDF are light-tailed because the moment generating function
exists around the origin, see Corollary 2.14, and the EDF is not suited to model
heavy-tailed data, for instance, having a regularly varying tail. Therefore, a datum
Y is sometimes first transformed before being modeled by a member of the EDF.
A popular transformation is the logarithm for positive observations. After this
transformation a member of the EDF can be chosen to model log(Y ). For instance,
if we choose the Gaussian distribution for log(Y ), then Y will be log-normally
distributed, or if we choose the exponential distribution for log(Y ), then Y will
be Pareto distributed, see Sect. 2.2.5. One can then model the transformed datum
with a GLM. Often this provides very accurate models, say, on the log scale for the
log-transformed data. There is one issue with this approach, namely, if a model
is unbiased on the transformed scale then it is typically biased on the original
observation scale; if the transformation is concave this easily follows from Jensen’s
inequality. The problematic part now is that the bias correction itself often has
systematic effects which means that the transformation (or the involved nuisance
parameters) should be modeled with a regression model, too, see Sect. 5.3.9. In
many cases this will not easily work, unfortunately. Therefore, if possible, clear
preference should be given to modeling the data on the original observation scale (if
unbiasedness is a central requirement).

Choice of Link Function From a statistical point of view we should choose the
canonical link g = h to connect the mean μ of the model to the linear predictor
η because this implies many favorable mathematical properties. However, as seen,
sometimes we have different needs. Practical reasons may require that we have a
model with additive or multiplicative effects, which favors the identity or the log-
link, respectively. Another requirement is that the resulting canonical parameter θ =
(h ◦ g −1 )(η) needs to be within the effective domain . If this effective domain is
bounded, for instance, if it covers the negative real line as for the gamma model,
5.2 Actuarial Applications of Generalized Linear Models 127

a (transformation of the) log-link might be more suitable than the canonical link
because g −1 (·) = − exp(·) has a strictly negative range, see Example 5.5.

Choice of Features and Feature Engineering Assume we have selected the

member of the EDF and the link function g. This gives us the relationship between
the mean μ and the linear predictor η, see (5.5),

μ(x) = Eθ(x) [Y ] = g −1 (η(x)) = g −1 β, x . (5.20)

Thus, the features x ∈ X ⊂ Rq+1 need to be in the right functional form so that
they can appropriately describe the systematic effect via the function (5.20). We
distinguish the following feature types:
• Continuous real-valued feature components, examples are age of policyholder,
weight of car, body mass index, etc.
• Ordinal categorical feature components, examples are ratings like good-
medium-bad or A-B-C-D-E.
• Nominal categorical feature components, examples are vehicle brands, occupa-
tion of policyholders, provinces of living places of policyholders, etc. The values
that the categorical feature components can take are called levels.
• Binary feature components are special categorical features that only have two
levels, e.g. female-male, open-closed. Because binary variables often play a
distinguished role in modeling they are separated from categorical variables
which are typically assumed to have more than two levels.
All these components need to be brought into a suitable form so that they can be
used in a linear predictor η(x) = β, x , see (5.20). This requires the consideration
of the following points (1) transformation of continuous components so that they can
describe the systematic effects in a linear form, (2) transformation of categorical
components to real-valued components, (3) interaction of components beyond an
additive structure in the linear predictor, and (4) the resulting design matrix X should
have full rank q + 1 ≤ n. We are going to describe these points (1)–(4) in the next
section.

5.2.2 Feature Engineering

Categorical Feature Components: Dummy Coding

Categorical variables need to be embedded into a Euclidean space. This embedding

needs to be done such that the resulting design matrix X has full rank q + 1 ≤ n.
There are many different ways to do so, and the particular choice depends on
the modeling purpose. The most popular way is dummy coding. We only describe
dummy coding here because it is sufficient for our purposes, but we mention that
128 5 Generalized Linear Models

Table 5.1 Dummy coding a1 = white 1 0 0 0 0 0 0 0 0 0

example that maps the
a2 = yellow 0 1 0 0 0 0 0 0 0 0
K = 11 levels (colors) to the
unit vectors of the a3 = orange 0 0 1 0 0 0 0 0 0 0
10-dimensional Euclidean a4 = red 0 0 0 1 0 0 0 0 0 0
space R10 selecting the last a5 = magenta 0 0 0 0 1 0 0 0 0 0
level a11 (brown color) as a6 = violet 0 0 0 0 0 1 0 0 0 0
reference level, and showing
a7 = blue 0 0 0 0 0 0 1 0 0 0
the resulting dummy vectors
x a8 = cyan 0 0 0 0 0 0 0 1 0 0
j as row vectors
a9 = green 0 0 0 0 0 0 0 0 1 0
a10 = beige 0 0 0 0 0 0 0 0 0 1
a11 = brown 0 0 0 0 0 0 0 0 0 0

there are also other codings like effects coding or Helmert’s contrast coding.2 The
choice of the coding will not influence the predictive model (if we work with
a full rank design matrix), but it may influence parameter selection, parameter
reduction and model interpretation. For instance, the choice of the coding is (more)
important in medical studies where one tries to understand the effects between
certain therapies.
Assume that the raw feature component * xj is a categorical variable taking K
different levels {a1 , . . . , aK }. For dummy coding we declare one level, say aK , to
be the reference level and all other levels are described relative to that reference
level. Formally, this can be described by an embedding map

*
xj → x j = (1{*
xj =a1 } , . . . , 1{*
xj =aK−1 } ) ∈ R
K−1
. (5.21)

This is closely related to the categorical distribution in Sect. 2.1.4. An explicit

example is given in Table 5.1.
Example 5.10 (Multiplicative Model) If we choose the log-link function η =
g(μ) = log(μ), we receive the regression function for the categorical example of
Table 5.1

K−1

*
xj → expβ, x j = exp{β0 } exp βk 1{*
xj =ak } , (5.22)
k=1

including an intercept component. Thus, the base value exp{β0 } is determined

by the reference level a11 = brown, and any color different from brown has
a deviation from the base value described by the multiplicative correction term
exp{βk 1{*
xj =ak } }.

2 There is an example of Helmert’s contrast coding in Remarks 2.7 of lecture notes [392], and for
more examples we refer to the UCLA statistical consulting website: https://stats.idre.ucla.edu/r/
library/r-library-contrast-coding-systems-for-categorical-variables/.
5.2 Actuarial Applications of Generalized Linear Models 129

Remarks 5.11
• Importantly, dummy coding leads to full rank design matrices X and, henceforth,
Assumption 5.3 is fulfilled.
• Dummy coding is different from one-hot encoding which is going to be
introduced in Sect. 7.3.1, below.
• Dummy coding needs some care if we have categorical feature components with
many levels, for instance, considering car brands and car models we can get
hundreds of levels. In that case we will have sparsity in the resulting design
matrix. This may cause computational issues, and, as the following example
will show, it may lead to high uncertainty in parameter estimation. In particular,
the columns of the design matrix X of very rare levels will be almost collinear
which implies that we do not receive very well-conditioned matrices in Fisher’s
scoring method (5.12). For this reason, it is recommended to merge levels
to bigger classes. In Sect. 7.3.1, below, we are going to present a different
treatment. Categorical variables are embedded into low-dimensional spaces, so
that proximity in these spaces has a reasonable meaning for the regression task
at hand.

Example 5.12 (Balance Property and Dummy Coding) A main argument for the
use of the canonical link function has been the fulfillment of the balance property,
see Corollary 5.7. If we have categorical feature components and if we apply dummy
coding to those, then the balance property is projected down to the individual levels
of that categorical variable. Assume that columns 2 to K of design matrix X are
used to model a raw categorical feature *x1 with K levels according to (5.21). In that
case, columns 2 ≤ k ≤ K will indicate all observations Yi which belong to levels
ak−1 . Analogously to the proof of Corollary 5.7, we receive (summation i runs over
the different instances/policies)

n
n
E
β
MLE [vi Yi ] = xi,k E
β
MLE [vi Yi ] = xi,k vi Yi = vi Yi .
i: *
xi,1 =ak−1 i=1 i=1 i: *
xi,1 =ak−1
(5.23)

Thus, we receive the balance property for all policies 1 ≤ i ≤ n that belong to level
ak−1 .
If we have many levels, then it will happen that some levels have only very few
observations, and the above summation (5.23) only runs over very few insurance
policies with *
xi,1 = ak−1 . Suppose additionally the volumes vi are small. This can
lead to considerable estimation uncertainty, because the estimated prices on the left-
hand side of (5.23) will be based too much on individual observations Yi having the
corresponding level, and we are not in the regime of a law of large numbers that
balances these observations.
Thus, this balance property from dummy coding is a natural property under the
canonical link choice. Actuarial pricing is very familiar with such a property. Early
130 5 Generalized Linear Models

distribution-free approaches have postulated this property resulting in the method of

the total marginal sums, see Bailey and Jung [22, 206], where the balance property
is enforced for marginal sums of all categorical levels in parameter estimation.
However, if we have scarce levels in categorical variables, this approach needs
careful consideration.

Binary Feature Components

Binary feature components do not need a treatment different from the categorical
ones, they are Bernoulli variables which can be encoded as 0 or 1. This is exactly
dummy coding for K = 2 levels.

Continuous Feature Components

Continuous feature components are already real-valued. Therefore, from the view-
point of ‘variable types’, the continuous feature components do not need any
pre-processing because they are already in the right format to be included in scalar
products.
Nevertheless, in many cases, also continuous feature components need feature
engineering because only in rare cases they directly fit the functional form (5.20).
We give an example. Consider car drivers that have different driving experience and
different driving skills. To explain experience and skills we typically choose the age
of driver as explanatory variable. Modeling the claim frequency as a function of the
age of driver, we often observe a U-shaped function, thus, a function that is non-
monotone in the age of driver variable. Since the link function g needs to be strictly
monotone, this regression problem cannot be modeled by (5.20), directly including
the age of driver as a feature because this leads to monotonicity of the regression
function in the age of driver variable.
Typically, in such situations, the continuous variable is discretized to categorical
classes. In the driver’s age example, we build age classes. These age classes
are then treated as categorical variables using dummy coding (5.21). We will
give examples below. These age classes should fulfill the requirement of being
sufficiently homogeneous in the sense that insurance policies that fall into the
same class should have a similar propensity to claims. This implies that we would
like to have many small homogeneous classes. However, the classes should be
sufficiently large, otherwise parameter estimation involves high uncertainty, see
also Example 5.12. Thus, there is a trade-off to sufficiently meet both of these two
requirements.
A disadvantage of this discretization approach is that neighboring age classes
will not be recognized by the regression function because, per se, dummy coding
is based on nominal variables not having any topology. This is also illustrated by
the fact, that all categorical levels (excluding the reference level) have, in view
5.2 Actuarial Applications of Generalized Linear Models 131

of embedding (5.21), the same mutual Euclidean distance. Therefore, in some

applications, one prefers a different approach by rather trying to find an appropriate
functional form. For instance, we can pre-process a strictly positive raw feature
component * xl to a higher-dimensional functional form

*
xl → β1* xl2 + β3*
xl + β2* xl3 + β4 log(*
xl ), (5.24)

with regression parameter (β1 , . . . , β4 ) , i.e., we have a polynomial function of

degree 3 plus a logarithmic term in this choice. If one does not want to choose
a specific functional form, one often chooses natural cubic splines. This, together
with regularization, leads to the framework of generalized additive models (GAMs),
which is popular family of regression models besides GLMs; for literature on GAMs
we refer to Hastie–Tibshirani [182], Wood [384], Ohlsson–Johansson [290], Denuit
et al. [99] and Wüthrich–Buser [392]. In these notes we will not further pursue
GAMs.
Example 5.13 (Multiplicative Model) If we choose the log-link function η =
g(μ) = log(μ) we receive a multiplicative regression function

q

x → μ(x) = expβ, x = exp{β0 } exp βj xj .
j =1

That is, all feature components xj enter the regression function in an exponential
form. In general insurance, one may have specific variables for which it is explicitly
known that they should enter the regression function as a power function. Having a
raw feature *xl we can pre-process it as *
xl → xl = log(*
xl ). This implies

β

q

μ(x) = expβ, x = exp{β0 } *
xl l exp βj xj ,
j =1,j =l

which gives a power term of order βl . The GLM estimates in this case the power
parameter that should be used for *
xl . If the power parameter is known, then one
can even include this component as an offset; offsets are discussed in Sect. 5.2.3,
below.

Interactions

Naturally, GLMs only allow for an additive structure in the linear predictor. Similar
to continuous feature components, such an additive structure may not always be
suitable and one wants to model more complex interaction terms. Such interactions
need to be added manually by the modeler, for instance, if we have two raw feature
132 5 Generalized Linear Models

components *
xl and *
xk , we may want to consider a functional form

xl , *
(* xk ) → β1*
xl + β2*
xk + β3*
xl * xl2*
xk + β4* xk ,

with regression parameter (β1 , . . . , β4 ) .

More generally, this manual feature engineering of adding interactions and of
specifying functional forms (5.24) can be understood as a new representation of raw
features. Representation learning in relation to deep learning is going to be discussed
in Sect. 7.1, and this discussion is also related to Mercer’s kernels.

5.2.3 Offsets

In many heterogeneous portfolio problems with observations Y = (Y1 , . . . , Yn ) ,

there are known prior differences between the individual risks Yi , for instance, the
time exposure varies between the different policies i. Such known prior differences
can be integrated into the predictors, and this integration typically does not involve
any additional model parameters. A simple way is to use an offset (constant) in
the linear predictor of a GLM. Assume that each observation Yi is equipped with a
feature x i ∈ X and a known offset oi ∈ R such that the linear predictor ηi takes the
form

(x i , oi ) → g(μi ) = ηi = η(x i , oi ) = oi + β, x i , (5.25)

for all 1 ≤ i ≤ n. An offset oi does not change anything from a structural viewpoint,
in fact, it could be integrated into the feature x i with a regression parameter that is
identically equal to 1.
Offsets are frequently used in Poisson models with the (canonical) log-link
choice to model multiplicative time exposures in claim frequency modeling. Under
the log-link choice we receive from (5.25) the following mean function

(x i , oi ) → μ(x i , oi ) = exp{η(x i , oi )} = exp{oi + β, x i } = exp{oi } expβ, x i .

In this version, the offset oi provides us with an exposure exp{oi } that acts
multiplicatively on the regression function. If wi = exp{oi } measures time, then
wi is a so-called pro-rata temporis (proportional in time) exposure.
Remark 5.14 (Boosting) A popular machine learning technique in statistical mod-
eling is boosting. Boosting tries to step-wise adaptively improve a regression
model. Offsets (5.25) are a simple way of constructing boosted models. Assume
we have constructed a predictive model using any statistical model, and denote the
resulting estimated means of Yi by μ i (0) . The idea of boosting is that we select
another statistical model and we try to see whether this second model can still find
systematic structure in the data which has not been found by the first model. In view
5.2 Actuarial Applications of Generalized Linear Models 133

of (5.25), we include the first model into the offset and we build a second model
around this offset, that is, we may explore a GLM

i (1) = g −1 g(μ
μ i (0) ) + β, x i .

If the first model is perfect we come up with a regression parameter β = 0,

otherwise the linear predictor β, x i of the second model starts to compensate
for weaknesses in μ i (0). Of course, this boosting procedure can then be iterated
and one should stop boosting before the resulting model starts to over-fit to the
data. Typically, this approach is applied to regression trees instead of GLMs, see
Ferrario–Hämmerli [125], Section 7.4 in Wüthrich–Buser [392], Lee–Lin [241] and
Denuit et al. [100].

5.2.4 Lab: Poisson GLM for Car Insurance Frequencies

We present a first GLM example. This example is based on French motor third
party liability (MTPL) insurance claim counts data. The data is described in detail
in Chap. 13.1; an excerpt of the available MTPL data is given in Listing 13.2. For the
moment we only consider claim frequency modeling. We use the following data: Ni
describes the number of claims, vi ∈ (0, 1] describes the duration of the insurance
policy, and *x i describes the available raw feature information of insurance policy i,
see Listing 13.2.
We are going to model the claim counts Ni with a Poisson GLM using the
canonical link function of the Poisson model. In the Poisson approach there are two
different ways to account for the duration of the insurance policy. Either we model
Yi = Ni /vi with the Poisson model of the EDF, see Sect. 2.2.2 and Remarks 2.13
(reproductive form), or we directly model Ni with the Poisson distribution from the
EF and treat the log-duration as an offset variable oi = log vi . In the first approach
we have for the log-link choice g(·) = h(·) = log(·) and dispersion ϕ = 1

yi β, x i − eβ,x i
Yi = Ni /vi ∼ f (yi ; θi , vi ) = exp + a(yi ; vi ) , (5.26)
1/vi

where x i ∈ X is the suitably pre-processed feature information of insurance policy

i, and with canonical parameter θi = η(x i ) = β, x i . In the second approach we
include the log-duration as offset into the regression function and model Ni with
134 5 Generalized Linear Models

the Poisson distribution from the EF. Using notation (2.2) this gives us

Ni ∼ f (ni ; θi ) = exp ni (log vi + β, x i ) − elog vi +β,x i + a(ni ) (5.27)
ni β,x i

vi β, x i − e
= exp + a(ni ) + ni log vi ,
1/vi

with canonical parameter θi = η(x i , oi ) = oi + β, x i = log vi + β, x i for

observation ni = vi yi . That is, we receive the same model in both cases (5.26)
and (5.27) under the canonical log-link choice for the Poisson GLM.
Finally, we make the assumption that all observations Ni are independent. There
remains the pre-processing of the raw features * x i to features x i so that they can be
used in a sensible way in the linear predictors ηi = η(x i , oi ) = oi + β, x i .

Feature Engineering

Categorical and Binary Variables: Dummy Coding

For categorical and binary variables we use dummy coding as described in

Sect. 5.2.2. We have two categorical variables VehBrand and Region, as well
as a binary variable VehGas, see Listing 13.2. We choose the first level as
reference level, and the remaining levels are characterized by (K − 1)-dimensional
embeddings (5.21). This provides us with K − 1 = 10 parameters for VehBrand,
K − 1 = 21 parameters for Region and K − 1 = 1 parameter for VehGas.

Figure 5.3 shows the empirical marginal frequencies λ = Ni / vi on all
levels of the categorical feature components VehBrand, Region and VehGas.
Moreover,
6 the blue areas (in the colored version) give confidence bounds of
±2 λ/ vi (under a Poisson assumption), see Example 3.22. The more narrow

these confidence bounds, the bigger the volumes vi behind these empirical
marginal estimates.

observed frequency per car brand groups observed frequency per regional groups observed frequency per fuel type
0.20
0.20
0.20

0.15
0.15
0.15

frequency
frequency
frequency

0.10
0.10
0.10

0.05
0.05
0.05

0.00
0.00
0.00

B1 B2 B3 B4 B5 B6 B10 B11 B12 B13 B14 R11 R22 R24 R26 R41 R43 R53 R72 R74 R83 R93 Diesel Regular
car brand groups regional groups fuel type

Fig. 5.3 Empirical marginal frequencies on each level of the categorical variables (lhs)
VehBrand, (middle) Region, and (rhs) VehGas
5.2 Actuarial Applications of Generalized Linear Models 135

Continuous Variables

We consider feature engineering of the continuous variables Area, VehPower,

VehAge, DrivAge, BonusMalus and log-Density (Density on the log
scale); note that we map the Area codes (A, . . . , F ) → (1, . . . , 6). Some of these
variables do not show any monotonicity nor log-linearity in the empirical marginal
frequency plots, see Fig. 5.4.
These non-monotonicity and non-log-linearity suggest in a first step to build
homogeneous classes for these feature components and use dummy coding for the
resulting classes. We make the following choices here (motivated by the marginal
graphs of Fig. 5.4):
• Area: continuous log-linear feature component for {A, . . . , F} → {1, . . . , 6};
• VehPower: discretize into categorical classes where we merge vehicle power
groups bigger and equal to 9 (totally K = 6 levels);
• VehAge: we build categorical classes [0, 6), [6, 13), [13, ∞) (totally K = 3
levels);
• DrivAge: we build categorical classes [18, 21), [21, 26), [26, 31), [31, 41),
[41, 51), [51, 71), [71, ∞) (totally K = 7 levels);
• BonusMalus: continuous log-linear feature component (we censor at 150);
• Density: log-density is chosen as continuous log-linear feature component.
This encoding is slightly different from Noll et al. [287] because of different data
cleaning. The discretization has been chosen quite ad-hoc by just looking at the
empirical plots; as illustrated in Section 6.1.6 of Wüthrich–Buser [392] regression
trees may provide an algorithmic way of choosing homogeneous classes of sufficient
volume. This provides us with a feature space (the initial component stands for the
intercept xi,0 = 1 and the order of the terms is the same as in Listing 13.2)

X ⊂ {1} × R × {0, 1}5 × {0, 1}2 × {0, 1}6 × R × {0, 1}10 × {0, 1} × R × {0, 1}21,

of dimension q +1 = 1 +1 +5 +2 +6 +1+10 +1+1+21 = 49. The R code [307]

for this pre-processing of continuous variables is shown in Listing 5.1, categorical
variables do not need any special treatment because variables of factor type are
consider internally in R by dummy coding; we call this model Poisson GLM1.

Choice of Learning and Test Samples

To measure predictive performance we follow the generalization approach as

proposed in Chap. 4. This requires that we partition our entire data into learning
sample L and test sample T , see Fig. 4.1. Model selection and model fitting will
be done on the learning sample L, only, and the test sample T is used to analyze
the generalization of the fitted models to unseen data. We partition the data at
random (non-stratified) in a ratio of 9 : 1, and we are going to hold on to the same
partitioning throughout this monograph whenever we study this example. The R
code used is given in Listing 5.2.
observed frequency per area code groups observed frequency per vehicle power groups observed frequency per vehicle age groups

0.20
0.20
0.20
136

0.15
0.15
0.15

0.10
0.10
0.10

frequency
frequency
frequency

0.05
0.05
0.05

0.00
0.00
0.00
A B C D E F 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 11 13 15 17 19
area code groups vehicle power groups vehicle age groups

observed frequency per driver's age groups observed frequency per bonus−malus level groups observed frequency per density (log−scale) groups

0.4
0.6
0.20

0.5

0.3
0.15

0.4

0.2
0.3
0.10

frequency
frequency
frequency

0.2

0.1
0.05

0.1

0.0
0.0
0.00

18 23 28 33 38 43 48 53 58 63 68 73 78 83 88 50 60 70 80 90 100 110 120 130 140 150 0 1 2 3 4 5 6 7 8 9 10

driver's age groups bonus−malus level groups density (log−scale) groups

Fig. 5.4 Empirical marginal frequencies of the continuous variables: top row (lhs) Area, (middle) VehPower, (rhs) VehAge, and bottom row (lhs)
DrivAge, (middle) BonusMalus, (rhs) log-Density, i.e., Density on the log scale; note that DrivAge and BonusMalus have a different y-scale
5 Generalized Linear Models

in these plots
5.2 Actuarial Applications of Generalized Linear Models 137

Listing 5.1 Pre-processing of features for model Poisson GLM1 in R

1 dat$AreaGLM <- as.integer(dat$Area)
2 dat$VehPowerGLM <- as.factor(pmin(dat$VehPower, 9))
3 dat$VehAgeGLM <- as.factor(cut(dat$VehAge, c(0,5,12,101),
4 labels = c("0-5","6-12","12+"),
5 include.lowest = TRUE))
6 dat$DrivAgeGLM <- as.factor(cut(dat$DrivAge, c(18,20,25,30,40,50,70,101),
7 labels = c("18-20","21-25","26-30","31-40","41-50",
8 "51-70","71+"), include.lowest = TRUE))
9 dat$BonusMalusGLM <- pmin(dat$BonusMalus, 150)
10 dat$DensityGLM <- log(dat$Density)

Table 5.2 shows the summary of the chosen partition into learning and test
samples

L = (Yi = Ni /vi , x i , vi ) : i = 1, . . . , n = 610 206 ,

and

T = (Yt† = Nt† /vt† , x †t , vt† ) : t = 1, . . . , T = 67 801 .

In contrast to Sect. 4.2 we also include feature information and exposure information
to L and T .

Listing 5.2 Partition of the data to learning sample L and test sample T
1 RNGversion("3.5.0") # we use R version 3.5.0 for this partition
2 set.seed(500)
3 ll <- sample(c(1:nrow(dat)), round(0.9*nrow(dat)), replace = FALSE)
4 learn <- dat[ll,]
5 test <- dat[-ll,]

Table 5.2 Choice of learning data set L and test data set T ; the empirical frequency on both
data sets is similar (last column), and the split of the policies w.r.t. the numbers of claims is also
rather similar
Numbers of observed claims Empirical
0 1 2 3 4 5 frequency
Learning sample L 96.32% 3.47% 0.19% 0.01% 0.0006% 0.0002% 7.36%
Test sample T 96.31% 3.50% 0.18% 0.01% 0.0015% 0.0015% 7.35%
138 5 Generalized Linear Models

Maximum-Likelihood Estimation and Results

The remaining step is to perform MLE to estimate regression parameter β ∈ Rq+1 .

This can be done either by maximizing the Poisson log-likelihood function or by
minimizing the Poisson deviance loss. In view of (4.9) and Example 4.27, the
Poisson deviance loss on the learning data L is given by

2
n
μ(x i )
β → D(L, β) = vi μ(x i ) − Yi − Yi log ≥ 0, (5.28)
n Yi
i=1

where the terms under the summation are set equal to vi μ(x i ) for Yi = 0, see (4.8),
and we have GLM regression function

x → μ(x) = μβ (x) = expβ, x .

That is, we work under the canonical link with the canonical parameter being equal
to the linear predictor. The MLE of β is found by minimizing (5.28). This is done
with Fisher’s scoring method. In order to receive a non-degenerate solution we need
to ensure that we have sufficiently many claims Yi > 0, otherwise it might happen
that the MLE provides a (degenerate) solution at the boundary of the effective
domain . We denote the MLE by βL =
MLE MLE
β , because it has been estimated
on the learning data L, only. This gives us estimated regression function

MLE (x) = exp

MLE
x →
μ(x) = μ
β
βL ,x .
L

We emphasize that we only use the learning data L for this model fitting. In view of
Definition 4.24 we receive in-sample and out-of-sample Poisson deviance losses

2
n
MLE
μ(x i )
D(L, β L ) = μ(x i ) − Yi − Yi log
vi ≥ 0,
n Yi
i=1

2 †
T
MLE † † † μ(x †t )

D(T , β L ) = (x t ) − Yt − Yt log
vt μ ≥ 0.
T
t =1 Yt†

We implement this GLM on the data of Listing 5.1 (and including the categorical
features) in R using the function glm [307], a short overview of the results is
presented in Listing 5.3. This overview presents the regression model implemented,
an excerpt of the parameter estimates
MLE
β L , standard errors which are received
from the square-rooted diagonal entries of the inverse of the estimated Fisher’s
information matrix In (
MLE
β L ), see (5.17); the remaining columns will be described
in Sect. 5.3.2 on the Wald test (5.33). The bottom line of the output says that Fisher’s
scoring algorithm has converged in 6 iterations, it gives the in-sample deviance loss
nD(L,
MLE
β L ) called Residual deviance (not being scaled by the number of
5.2 Actuarial Applications of Generalized Linear Models 139

Listing 5.3 Results in model Poisson GLM1 using the R command glm
1 Call:
2 glm(formula = ClaimNb ~ VehPowerGLM + VehAgeGLM + DrivAgeGLM +
3 BonusMalusGLM + VehBrand + VehGas + DensityGLM + Region +
4 AreaGLM, family = poisson(), data = learn, offset = log(Exposure))
5
6 Deviance Residuals:
7 Min 1Q Median 3Q Max
8 -1.4728 -0.3256 -0.2456 -0.1383 7.7971
9
10 Coefficients:
11 Estimate Std. Error z value Pr(>!z!)
12 (Intercept) -4.8175439 0.0579296 -83.162 < 2e-16 ***
13 VehPowerGLM5 0.0604293 0.0229841 2.629 0.008559 **
14 VehPowerGLM6 0.0868252 0.0225509 3.850 0.000118 ***
15 . . .
16 . . .
17 RegionR93 0.1388160 0.0294901 4.707 2.51e-06 ***
18 RegionR94 0.1918538 0.0938250 2.045 0.040874 *
19 AreaGLM 0.0407973 0.0200818 2.032 0.042199 *
20 ---
21 Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
22
23 (Dispersion parameter for poisson family taken to be 1)
24
25 Null deviance: 153852 on 610205 degrees of freedom
26 Residual deviance: 147069 on 610157 degrees of freedom
27 AIC: 192818
28
29 Number of Fisher Scoring iterations: 6

Table 5.3 Run times, number of parameters, AICs, in-sample and out-of-sample deviance losses,
tenfold cross-validation losses with empirical standard deviation in brackets, see also (4.36), (units
are in 10−2 ) and the in-sample average frequency of the null model (Poisson intercept model, see
Example 4.27) and of model Poisson GLM1
Run # AIC In-sample Out-of-sample Tenfold CV Aver.
time Param. loss on L loss on T CV
loss D freq.
Poisson null – 1 199’506 25.213 25.445 25.213(0.234) 7.36%
Poisson GLM1 16 s 49 192’818 24.101 24.146 24.121(0.245) 7.36%

observations), as well as Akaike’s Information Criterion (AIC), see Sect. 4.2.3 for
AIC. Note that we have implemented Poisson version (5.27) with the exposures
entering the offset, see lines 2–4 of Listing 5.3; this is important for understanding
AIC being calculated on the (unscaled) claim counts Ni .
Table 5.3 summarizes the results of model Poisson GLM1 and it compares the
figures to the null model (only having an intercept β0 ); the null model has already
been introduced in Example 4.27. We present the run time needed to fit the model,3
the number of regression parameters q + 1 in β ∈ Rq+1 , AIC, in-sample and
out-of-sample deviance losses, as well as tenfold cross-validation losses on the

3 All run times are measured on a personal laptop Intel(R) Core(TM) i7-8550U CPU @ 1.80 GHz

1.99 GHz with 16 GB RAM, and they only correspond to fitting the model (or the corresponding
step) once, i.e., they do not account for multiple runs, for instance, for K-fold cross-validation.
140 5 Generalized Linear Models

learning data L. For tenfold cross-validation we always use the same (non-stratified)
partition of L (in all examples in this monograph), and in bracket we show the
empirical standard deviation received by (4.36). Tenfold cross-validation would not
be necessary in this case because we have test data T on which we can evaluate the
out-of-sample deviance GL. We present both figures to back-test whether tenfold
cross-validation works properly in our example. We observe that the out-of-sample
deviance losses D(T ,
MLE
β L ) are within one empirical standard deviation of the
tenfold cross-validation losses D CV , which supports this methodology of model
comparison.
From Table 5.3 we conclude that we should prefer model Poisson GLM1 over
the null model, this decision is supported by a smaller AIC, a smaller out-of-sample
deviance loss D(T , CV . The last
MLE
β L ) as well as a smaller cross-validation loss D
column of Table 5.3 confirms that the estimated model meets the balance property
(we work with the canonical link here). Note that this balance property should be
fulfilled for two reasons. Firstly, we would like to have the overall portfolio price on
the right level, and secondly, deviance losses should only be compared on the same
overall frequency, see Example 4.10.
Before we continue to introduce more models to challenge model Poisson
GLM1, we are going to discuss statistical tools for model evaluation. Of course,
we would like to know whether model Poisson GLM1 is a good model for this data
or whether it is just the better model of two bad options.
Remark 5.15 (Prior and Posterior Information) Pricing literature distinguishes
between prior feature information and posterior feature information, see Verschuren
[372]. Prior feature information is available at the inception of the (new) insurance
contract before having any claims history. This includes, for instance, age of driver,
vehicle brand, etc. For policy renewals, past claims history is available and prices
of policy renewals can also be based on such posterior information. Past claims
history has led to the development of so-called bonus-malus systems (BMS) which
often are in the form of multiplicative factors to the base premium to reward and
punish good and bad past experience, respectively. One stream of literature studies
optimal designs of BMS, we refer to Loimaranta [255], De Pril [91], Lemaire [245],
Denuit et al. [102], Brouhns et al. [57] Pinquet [304], Pinquet et al. [305], Tzougas
et al. [360] or Ágoston–Gyetvai [4]. Another stream of literature studies how one
can optimally extract predictive information from an existing BMS, see Boucher–
Inoussa [46], Boucher–Pigeon [47] and Verschuren [372].
The latter is basically what we also do in the above example: note that we include
the variable BonusMalus into the feature information and, thus, we use past
claims information to predict future claims. For new policies, the bonus-malus level
is at 100%, and our information does not allow to clearly distinguish between new
5.3 Model Validation 141

policies and policy renewals for drivers that have posterior information reflected by
a bonus-malus level of 100%. Since young drivers are more likely new customers we
expect interactions between the driver’s age variable and the bonus-malus level, this
intuition is supported by Fig. 13.12 (lhs). In order to improve our model, we would
require more detailed information about past claims history. Remark that we do
not strictly distinguish between prior and posterior information, here. If we go over
to a time-series consideration, where more and more claims experience becomes
available of an individual driver, we should clearly distinguish the different sets of
information, because otherwise it may happen that in prior and posterior pricing
factors we correct twice for the same factor; an interesting paper is Corradin et
al. [82].
We also mention that a new source of posterior information is emerging through
the collection of telematics car driving data. Telematics car driving data leads to a
completely new way of posterior information rate making (experience rating), we
refer to Ayuso et al. [17–19], Boucher et al. [42], Lemaire et al. [246] and Denuit
et al. [98]. We mention the papers of Gao et al. [152, 154] and Meng et al. [271]
who directly extract posterior feature information from telematics car driving data
in order to improve rate making. This approach combines a Poisson GLM with a
network extractor for the telematics car driving data.

5.3 Model Validation

One of the purposes of Chap. 4 has been to describe measures to analyze how well
a fitted model generalizes to unseen data. In a proper generalization analysis this
requires learning data L for in-sample model fitting and a test sample T for an
out-of-sample generalization analysis. In many cases, one is not in the comfortable
situation of having a test sample. In such situations one can use AIC that tries to
correct the in-sample figure for model complexity or, alternatively, K-fold cross-
validation as used in Table 5.3.
The purpose of this section is to introduce diagnostic tools for fitted models; these
are often based on unit deviances d(Yi , μi ), which play the role of squared residuals
in classical linear regression. Moreover, we discuss parameter and model selection,
for instance, by step-wise backward elimination or forward selection using the
analysis of variance (ANOVA) or the likelihood ratio test (LRT).

5.3.1 Residuals and Dispersion

Within the EDF we distinguish two different types of residuals. The first type of
residuals are based on the unit deviances d(Yi , μi ) studied in (4.7). The deviance
142 5 Generalized Linear Models

residuals are given by

vi
riD = sign(Yi − μi ) d (Yi , μi ).
ϕ

Secondly, Pearson’s residuals are given by, see also (4.12),

vi Yi − μi
riP = √ .
ϕ V (μi )

In the Gaussian case the two residuals coincide. This indicates that Pearson’s
residuals are most appropriate in the Gaussian case because they respect the
distributional properties in that case. For other distributions, Pearson’s residuals
can be markedly skewed, as stated in Section 2.4.2 of McCullagh–Nelder [265],
and therefore may fail to have properties similar to Gaussian residuals. An other
issue occurs in Pearson’s
√ residuals when the denominator involves an estimated
standard deviation V ( μi ), for instance, if we work in a small frequency Poisson
problem. Estimation uncertainty in small denominators of Pearson’s residuals may
substantially distort the estimated residuals. For this reason, we typically work with
(the more robust) deviance residuals; this is related to the discussion in Chap. 4 on
MSEPs versus expected deviance GLs, see Remarks 4.6.
The squared residuals provide unit deviance and weighted square loss, respec-
tively,

vi vi (Yi − μi )2
(riD )2 = d (Yi , μi ) and (riP )2 = ,
ϕ ϕ V (μi )

the latter corresponds to Pearson’s χ 2 -statistic, see (4.12).

Example 5.16 (Residuals in the Poisson Case) In the Poisson case, Pearson’s χ 2 -
statistic is for vi = ϕ = 1 given by

(Yi − μi )2
(riP )2 = ,
μi

because we have variance function V (μ) = μ. A second order Taylor expansion

1/3
around Yi on the scale μi (for μi ) provides approximation to the unit deviances in
the Poisson case, see formula (6.4) and Figure 6.2 in McCullagh–Nelder [265],

1/3 1/3 1/3 2
d (Yi , μi ) ≈ 9Yi Yi − μi . (5.29)

This emphasizes the different behaviors around the observation Yi of the two types
1/3
of residuals in the Poisson case. The scale μi has been motivated in McCullagh–
5.3 Model Validation 143

log−likelihood of Poisson model log−likelihood of gamma model log−likelihood of inverse Gaussian model

−0.90
−1.0

−1.0
log−likelihood for mu

log−likelihood for mu

log−likelihood for mu
−0.95
−1.5

−1.5

−1.00
−2.0

−2.0

−1.05
−1.10
−2.5

−2.5
0.4 0.6 0.8 1.0 1.2 1.4 1.6 0.4 0.6 0.8 1.0 1.2 1.4 1.6 0.4 0.6 0.8 1.0 1.2 1.4 1.6
mu^1/3 mu^(−1/3) mu^(−1)

Fig. 5.5 Log-likelihoods Y (μ) in Y = 1 as a function of μ plotted against (lhs) μ1/3 in the
Poisson case, (middle) μ−1/3 in the gamma case with shape parameter α = 1, and (rhs) μ−1 in the
inverse Gaussian case with α = 1

Nelder [265] by providing a symmetric behavior around the mode in Yi = 1 of the

resulting log-likelihood function, see Fig. 5.5 (lhs).

The explicit calculation of the residuals requires knowledge of the dispersion

parameter ϕ > 0. In the Poisson Example 5.16 this dispersion parameter has been
set equal to 1 because the Poisson model does neither allow for under- nor for
over-dispersion. Typically, this is not the case for other models, and this requires
determination of the dispersion parameter if we want to simulate from these other
models. So far, this dispersion parameter has been treated as a nuisance parameter
and, in fact, it canceled in MLE (because it was assumed to be constant), see
Proposition 5.1.
If we need to estimate the dispersion parameter, we can either do this within
MLE, see Remarks 5.2, or we can use Pearson’s or the deviance estimates,
respectively,

1 (Yi − μ i )2
n
1 n
P =
ϕ and D =
ϕ vi d (Yi ,
μi ) ,
n − (q + 1) V (
μi )/vi n − (q + 1)
i=1 i=1
(5.30)

where μi = μ(x i ) are the MLE estimated means involving q + 1 estimated

parameters
MLE
β ∈ Rq+1 . We briefly motivate these choices. Firstly, Pearson’s
estimate
ϕ P is consistent for ϕ. Note that in the Gaussian case this is just the standard
estimate for the variance parameter. Justification of the deviance dispersion estimate
is more challenging. Consider the unscaled deviance with μn = ( μn ) ,
μ1 , . . . ,
see (4.9),

n
nϕD(Y n ,
μn ) = vi d (Yi ,
μi ) .
i=1
144 5 Generalized Linear Models

1.2
expected Poisson unit deviance

expected Poisson unit deviance

1.0

0.4
0.8

0.3
0.6

0.2
0.4

0.1
0.2

0.0
0 1 2 3 4 5 0.00 0.02 0.04 0.06 0.08 0.10
E[N] E[N]

Fig. 5.6 Expected unit deviance vEμ [d(Y, μ)] in the Poisson case as a function of E[N] =
E[vY ] = vμ; the two plots only differ in the scale on the x-axis

2
This statistic is under certain assumptions asymptotically ϕχn−(q+1) -distributed,
where χn−(q+1) denotes a χ -distribution with n−(q +1) degrees of freedom. Thus,
2 2

this approximation gives us an expected value of ϕ(n−(q +1)). This exactly justifies
the deviance dispersion estimate (5.30) in these cases. However, as stated in the last
paragraph of Section 2.3 of McCullagh–Nelder [265], often a χ 2 -approximation is
not suitable even as n → ∞. We give an example.
Example 5.17 (Poisson Unit Deviances) The deviance statistics in the Poisson
model with means μn = (μ1 , . . . , μn ) is given by

1 1
n n
μi
D(Y n , μn ) = vi d (Yi , μi ) = 2vi μi − Yi − Yi log ,
n n Yi
i=1 i=1

note that in the Poisson model we have (by definition) ϕ = 1. We evaluate the
expected value of this deviance statistics. It is given by
+ , + ,
1 1
n n
μi Ni
Eμn D(Y n , μn ) = 2vi Eμi μi − Yi − Yi log = 2Eμi Ni log ,
n Yi n vi μi
i=1 i=1

ind.
with Ni ∼ Poi(vi μi ).
In Fig. 5.6 we plot the expected unit deviance vμ → vEμ [d(Y, μ)] in the Poisson
model. In our example of Table 5.3, we have Eμ [vY ] = vμ ≈ 3.89%, which results
in an expected unit deviance of vEμ [d(Y, μ)] ≈ 25.52·10−2 < 1. This is in line with
the losses in Table 5.3. Thus, the expected deviance nEμn D(Y n , μn ) ≈ n/4 < n.
Therefore it is substantially smaller than n. But this implies that nD(Y n , μn ) cannot
2
be asymptotically χn−(q+1) -distributed because the latter has an expected of value
n−(q+1) ≈ n for n → ∞. In fact, the deviance dispersion estimate is not consistent
5.3 Model Validation 145

in this example, and for a consistent estimate one should rely on Pearson’s deviance
estimate.
In order to have an asymptotic χ 2 -distribution we need to have large volumes
v because then a saddlepoint approximation holds that allows to approximate the
(scaled) unit deviances by χ 2 -distributions, see Sect. 5.5.2, below.

5.3.2 Hypothesis Testing

Consider a sub-vector β r ∈ Rr of the GLM parameter β ∈ Rq+1 , for r < q + 1.

We would like to understand if we can set this sub-vector β r = 0, and at the same
time we do not lose any generalization power. Thus, we investigate whether there is
a simpler nested GLM that provides a similar prediction accuracy. If this is the case,
preference should be given to the simpler model because the bigger model seems
over-parametrized (has redundancy, is not parsimonious). This section is based on
Section 2.2.2 of Fahrmeir–Tutz [123].
Geometric Interpretation We begin by giving a geometric interpretation. We start
from the full model being expressed by the design matrix X ∈ Rn×(q+1) . This design
matrix together with the link function g generates a (q + 1)-dimensional manifold
M ⊂ Rn given by, see (5.19) and Fig. 5.2,

M = μ = g −1 (Xβ) = (g −1 β, x 1 , . . . , g −1 β, x n ) ∈ Rn β ∈ Rq+1 ⊂ Rn .

The MLE
MLE
β is determined by the point in M that minimizes the distance to Y ,
where distance between Y and M is measured component-wise by vϕi d(Yi , μi ) with
μ ∈ M, i.e., w.r.t. the KL divergence.
Assume, now, that we want to drop the components β r in β, i.e., we want to drop
these columns from the design matrix resulting in a smaller design matrix Xr ∈
Rn×(q+1−r) . This generates a (q + 1 − r)-dimensional nested manifold Mr ⊂ M
described by

Mr = μ = g −1 (Xr β) ∈ Rn β ∈ Rq+1−r ⊂ M.

If the distance of Y to Mr and M is roughly the same, we should go for

the smaller model. In the Gaussian case of Example 5.9 this can be explained
by the Pythagorean theorem applied to successive orthogonal projections. In the
general unit deviance case, this has to be studied in terms of information geometry
considering the KL divergence, see Sect. 2.3.
146 5 Generalized Linear Models

Likelihood Ratio Test (LRT) We consider the testing problem of the null hypoth-
esis H0 against the alternative hypothesis H1

H0 : β r = 0 against H1 : β r = 0. (5.31)

Denote by the MLE under the full model and by

MLE MLE
β β (−r) the MLE under the
null hypothesis model. Define the (log-)likelihood ratio test (LRT) statistics

= −2 Y (
β (−r) ) − Y (
MLE MLE
β ) ≥ 0.

The inequality holds because the null hypothesis model is nested in the full model,
henceforth, the latter needs to have a bigger log-likelihood value in the MLE. If
the LRT statistics is large, the null hypothesis should be rejected because the
reduced model is not competitive compared to the full model. More mathematically,
under similar conditions as for the asymptotic normality results of the MLE of
β in (5.17), we have that under the null hypothesis H0 the LRT statistics is
asymptotically χ 2 -distributed with r degrees of freedom. Therefore, we should
reject the null hypothesis in favor of the full model if the resulting p-value of
under the χr2 -distribution is too small. These results remain true if the unknown
dispersion parameter ϕ is replaced by a consistent estimator ϕ , e.g., Pearson’s
dispersion estimate ϕ P (from the bigger model).
The LRT statistics may not be properly defined in over-dispersed situations
where the distributional assumptions are not fully specified, for instance, in an over-
dispersed Poisson model. In such situations, one usually divides the log-likelihood
(of the Poisson model) by the estimated over-dispersion and then uses the resulting
scaled LRT statistics as an approximation to the unspecified model.

Wald Test Alternatively, we can use the Wald statistics. The Wald statistics uses
a second order approximation to the log-likelihood and, therefore, is only based
on the first two moments (and not on the entire distribution). Define the matrix
Ir ∈ Rr×(q+1) such that β r = Ir β, i.e., matrix Ir selects exactly the components of
β that are included in β r (and which are set to 0 under the null hypothesis H0 given
in (5.31)).
Asymptotic normality (5.17) motivates consideration of the Wald statistics

MLE −1 −1
W = (Ir − 0) Ir I( (Ir
MLE MLE
β β ) Ir β − 0). (5.32)

The Wald statistics measures the distance between the MLE in the full model
Ir
MLE
β restricted to the components of β r and the null hypothesis H0 (being
β r = 0). The estimated Fisher’s information matrix I(
MLE
β ) is used to bring
all components onto the same unit scale (and to account for collinearity). The
Wald statistics W is asymptotically χr2 -distributed under the same assumptions as
for (5.17) to hold. Thus, the null hypothesis H0 should be rejected if the resulting p-
5.3 Model Validation 147

value of W under the χr2 -distribution is too small. Note that this test does not require
calculation of the MLE in the null hypothesis model, i.e., this test is computationally
more attractive than the LRT because we only need to fit one model. Again, an
unknown dispersion parameter ϕ in Fisher’s information matrix I(β) is replaced by
a consistent estimator
ϕ (from the bigger model).
In the special case of considering only one component of β, i.e., if β r = βk with
r = 1 and for one selected component 0 ≤ k ≤ q, the Wald statistics reduces to

MLE)2
(β 1/2
MLE
β
Wk = k
or Tk = Wk = k
, (5.33)

σk2
σk

with diagonal entries of the inverse of the estimated Fisher’s information matrix
σk2 = (I(
MLE −1
given by β ) )k,k , 0 ≤ k ≤ q. The square-roots of these estimates are
provided in column Std. Error of the R output in Listing 5.3.
In this case the Wald statistics Wk is equal to the square of the t-statistics Tk ;
this t-statistics is provided in column z value of the R output of Listing 5.3.
Remark that Fisher’s information matrix involves the dispersion parameter ϕ. If
this dispersion parameter is estimated with a consistent estimator ϕ we have a t-
statistics. For known dispersion parameter the t-statistics reduces to a z-statistics,
i.e., the corresponding p-values can be calculated from a normal distribution instead
of a t-distribution. In the Poisson case, the dispersion ϕ = 1 is known, and for this
reason, we perform a z-test (and not a t-test) in the last column of Listing 5.3; and
we call Tk a z-statistics in that case.

5.3.3 Analysis of Variance

In the previous section, we have presented tests that allow for model selection in
the case of nested models. More generally, if we have a full model, say, based
on regression parameter β ∈ Rq+1 we would like to select the “best” sub-
model according to some selection criterion. In most cases, it is computationally
not feasible to fit all sub-models if q is large, therefore, this is not a practical
solution. For large models and data sets step-wise procedures are a feasible tool.
Backward elimination starts from the full model, and then recursively drops feature
components which have high p-values in the corresponding Wald statistics (5.32)
and (5.33). Performing this recursively will provide us with hierarchy of nested
models. Forward selection works just in the opposite direction, that is, we start with
the null model and we include feature components one after the other that have a
low p-value in the corresponding Wald statistics.
148 5 Generalized Linear Models

Remarks 5.18
• The order of the inclusion/exclusion of the feature components matters in this
selection algorithms because we do not have additivity in this selection process.
For this reason, often backward elimination and forward selection is combined
in an alternating way.
• This process as well as the tests from Sect. 5.3.2 are based on a fixed pre-
processing of features. If the feature pre-processing is done differently, all
analysis needs to be repeated for this new model. Moreover, between two dif-
ferent models we need to apply different tools for model selection (if they are not
nested), for instance, AIC, cross-validation or an out-of-sample generalization
analysis.
• For categorical variables with dummy coding we should apply the forward
selection or the backward elimination simultaneously on the entire dummy coded
vector of a categorical variable. This will include or exclude this variable; if we
only apply the Wald test to one component of the dummy vector, then we test
whether this level should be merged with the reference level.
Typically, in practice, a so-called analysis of variance (ANOVA) table is studied.
The ANOVA table is mainly motivated by the Gaussian model with orthogonal
data. The Gaussian assumption implies that the deviance loss is equal to the
square loss and the orthogonality implies that the square loss decouples in an
additive way w.r.t. the feature components. This implies that one can explicitly
study the contribution of each feature component to the decrease in square loss;
an example is given in Section 2.3.2 of McCullagh–Nelder [265]. In non-Gaussian
and non-orthogonal situations one loses this additivity property and, as mentioned
in Remarks 5.18, the order of inclusion matters. Therefore, for the ANOVA table
we pre-specify the order in which the components are included and then we analyze
the decrease of deviance loss by the inclusion of additional components.
Example 5.19 (Poisson GLM1, Revisited) We revisit the MTPL claim frequency
example of Sect. 5.2.4 to illustrate the variable selection procedures. Based on the
model presented in Listing 5.3 we run an ANOVA analysis using the R command
anova, the results are presented in Listing 5.4.
Listing 5.4 shows the hierarchy of models starting from the null model by
sequentially including feature components one by one. The column Df gives the
number of regression parameters involved and the column Deviance the decrease
of deviance loss by the inclusion of this feature component. The biggest model
improvements are provided by the bonus-malus level and driver’s age, this is not
surprising in view of the empirical analysis in Figs. 5.3 and 5.4, and in Chap. 13.1.
At the other end we have the Area code which only seems to improve the model
marginally. However, this does not imply, yet, that this variable should be dropped.
There are two points that need to be considered: (1) maybe feature pre-processing
of Area has not been done in an appropriate way and the variable is not in the
right functional form for the chosen link function; and (2) Area is the last variable
included in the model in Listing 5.4 and, maybe, there are already other variables
5.3 Model Validation 149

Listing 5.4 ANOVA table of model Poisson GLM1

1 Analysis of Deviance Table
2
3 Model: poisson, link: log
4
5 Response: ClaimNb
6
7 Terms added sequentially (first to last)
8
9
10 Df Deviance Resid. Df Resid. Dev
11 NULL 610205 153852
12 VehPowerGLM 5 73.7 610200 153779
13 VehAgeGLM 2 179.7 610198 153599
14 DrivAgeGLM 6 1199.4 610192 152400
15 BonusMalusGLM 1 4300.6 610191 148099
16 VehBrand 10 240.3 610181 147859
17 VehGas 1 82.4 610180 147776
18 DensityGLM 1 512.1 610179 147264
19 Region 21 191.3 610158 147073
20 AreaGLM 1 4.1 610157 147069

that take over the role of Area in smaller models which is possible if we have
correlations between the feature components. In our data, Area and Density are
highly correlated. For this reason, we exchange the order of these two components
and run the same analysis again, we call this model Poisson GLM1B (which of
course provides the same predictive model as Poisson GLM1).

Listing 5.5 ANOVA table of model Poisson GLM1B

1 Analysis of Deviance Table
2
3 Model: poisson, link: log
4
5 Response: ClaimNb
6
7 Terms added sequentially (first to last)
8
9
10 Df Deviance Resid. Df Resid. Dev
11 NULL 610205 153852
12 VehPowerGLM 5 73.7 610200 153779
13 VehAgeGLM 2 179.7 610198 153599
14 DrivAgeGLM 6 1199.4 610192 152400
15 BonusMalusGLM 1 4300.6 610191 148099
16 VehBrand 10 240.3 610181 147859
17 VehGas 1 82.4 610180 147776
18 AreaGLM 1 505.0 610179 147271
19 Region 21 192.4 610158 147079
20 DensityGLM 1 10.1 610157 147069

Listing 5.5 shows the ANOVA table if we exchange the order of these two
variables. We observe that the magnitudes of the decrease of the deviance loss
has switched between the two variables. Overall, Density seems slightly more
150 5 Generalized Linear Models

predictive, and we may consider dropping Area from the model, also because the
correlation between Density and Area is very high.
If we want to perform backward elimination (sequentially drop one variable after
the other) we can use the R command drop1. For small models this is doable, for
larger models it is computationally demanding.

Listing 5.6 drop1 analysis of model Poisson GLM1

1 Single term deletions
2
3 Model:
4 ClaimNb ~ VehPowerGLM + VehAgeGLM + DrivAgeGLM + BonusMalusGLM +
5 VehBrand + VehGas + DensityGLM + Region + AreaGLM
6 Df Deviance AIC LRT Pr(>Chi)
7 <none> 147069 192818
8 VehPowerGLM 5 147152 192892 83.4 < 2.2e-16 ***
9 VehAgeGLM 2 147283 193028 214.1 < 2.2e-16 ***
10 DrivAgeGLM 6 147603 193341 534.5 < 2.2e-16 ***
11 BonusMalusGLM 1 150970 196718 3901.5 < 2.2e-16 ***
12 VehBrand 10 147298 193027 228.9 < 2.2e-16 ***
13 VehGas 1 147213 192961 144.5 < 2.2e-16 ***
14 DensityGLM 1 147079 192826 10.1 0.001459 **
15 Region 21 147259 192967 190.7 < 2.2e-16 ***
16 AreaGLM 1 147073 192820 4.1 0.042180 *
17 ---
18 Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

In Listing 5.6 we present the results of this drop1 analysis. Both, according to
AIC and according to the LRT, we should keep all variables in the model. Again,
Area and Density provide the smallest LRT statistics which illustrates the
high collinearity between these two variables (note that the values in Listing 5.6 are
identical to the ones in Listings 5.4 and 5.5, respectively).
We conclude that in model Poisson GLM1 we should keep all feature com-
ponents, and a model improvement can only be obtained by a different feature
pre-processing, by a different regression function or by a different distributional
model.

5.3.4 Lab: Poisson GLM for Car Insurance Frequencies,

Revisited
Continuous Coding of Non-monotone Feature Components

We revisit model Poisson GLM1 studied in Sect. 5.2.4 for MTPL claim frequency
modeling, and we consider additional competing models by using different feature
pre-processing. From Example 5.19, above, we conclude that we should keep all
variables in the model if we work with model Poisson GLM1.
5.3 Model Validation 151

Table 5.4 Contingency table of observed number of policies against predicted number of
policies with given claim counts ClaimNb
Numbers of claims ClaimNb
0 1 2 3 4 5
Observed number of policies 587’772 21’198 1’174 57 4 1
Predicted number of policies 587’325 22’064 779 34 3 0.3

We calculate Pearson’s dispersion estimate which provides ϕ P = 1.6697 > 1.

This indicates that the model is not fully suitable for our data because in a Poisson
model the dispersion parameter should be equal to 1. There may be two reasons
for this over-dispersion: (1) the Poisson assumption is not appropriate because,
for instance, the tail of the observations is more heavy-tailed, or (2) the Poisson
assumption is appropriate but the regression function has not been chosen in a fully
suitable way (maybe also due to missing feature information).
We believe that in our example the observed over-dispersion is a mixture of
the two reasons (1) and (2). Surely, the regression structure can be improved since
our feature pre-processing is non-optimal and since the chosen regression function
only considers multiplicative interactions between the feature components (we have
chosen the log-link regression function without adding interaction terms to the
regression function).
Table 5.4 gives a contingency table. We observe that we have much more policies
with more than 1 claim compared to what is predicted by the fitted model. As a
result, a χ 2 -test rejects this Poisson model because the resulting p-value is close
to 0.
In our data, we have a rather large number of policies with short exposures vi ,
and further analysis suggests that these short exposures are not suitably modeled.
We will not invest more time into improving the exposure modeling. As mentioned
in the appendix, there seem to be a couple of issues how the exposures are displayed
and how policy renewals are accounted for in this data. However, it is difficult
(almost impossible) to clean the data for better exposure measures without more
detailed information about the data collection process.
Our next aim is to model continuous feature components differently, if their raw
form does not match the linear predictor assumption. In Poisson GLM1 we have
categorized such components and then used dummy coding for the resulting classes,
see Sect. 5.2.4. Alternatively, we can use different functional forms, for instance, we
can use for DrivAge the following pre-processing

4
DrivAge → βl DrivAge + βl+1 log(DrivAge) + βl+j (DrivAge)j .
j =2
(5.34)
152 5 Generalized Linear Models

Table 5.5 Run times, number of parameters, AICs, in-sample and out-of-sample deviance losses,
tenfold cross-validation losses (units are in 10−2 ) and in-sample average frequency of the null
model (intercept model) and of different Poisson GLMs
Run # In-sample Out-of-sample Tenfold CV Aver.
time Param. AIC loss on L loss on T CV
loss D freq.
Poisson null – 1 199’506 25.213 25.445 25.213 7.36%
Poisson GLM1 16s 49 192’818 24.101 24.146 24.121 7.36%
Poisson GLM2 15s 48 192’753 24.091 24.113 24.110 7.36%
Poisson GLM3 15s 50 192’716 24.084 24.102 24.104 7.36%

This replaces the K = 7 categorical age classes of model Poisson GLM1 by

5 continuous functions of the variable DrivAge, and the number of regression
parameters is reduced from K − 1 = 6 to 5. We call this model Poisson GLM2.
Besides improving the modeling of the feature components we can also start
to add interactions beyond the multiplicative ones. For instance, Fig. 13.12 in
Chap. 13 may indicate that there is an interaction term between BonusMalus
and DrivAge. New young drivers enter the bonus-malus system at level 100,
and it takes some years free of accidents to reach the lowest bonus-malus level
of 50. Whereas for senior drivers a bonus-malus level of 100 may indicate that they
have had a bad claim experience because otherwise they would be on the lowest
bonus-malus level, see also Remark 5.15. We are adding the following interaction
to Poisson GLM2 and we call the resulting model Poisson GLM3

βl BonusMalus · DrivAge + βl +1 BonusMalus · (DrivAge)2 . (5.35)

From Table 5.5 we observe that this leads to a further small model improvement.
We mention that this model improvement can also be observed in a decrease of
Pearson’s dispersion estimate to ϕ P = 1.6644. Noteworthy, all model selection
criteria AIC, out-of-sample generalization loss and cross-validation come to the
same conclusion in this example.
The tedious task of the modeler now is to find all these systematic effects and
bring them in an appropriate form into the model. Here, this is still possible because
we have a comparably small model. However, if we have hundreds of feature
components, such a manual analysis becomes intractable. Other regression models
such as network regression models should be preferred, or at least should be used
to find systematic effects. But, one should also keep in mind that the (final) chosen
model should be as simple as possible (parsimonious).
Remarks 5.20
• An advantage of GLMs is that these regression models can deal with collinearity
in feature components. Nevertheless, the results should be carefully checked if
the collinearity in feature components is very high. If we have a high collinearity
between two feature components then we may observe large values with opposite
signs in the corresponding regression parameters compensating each other. The
5.3 Model Validation 153

Listing 5.7 drop1 analysis of model Poisson GLM2

1 Single term deletions
2
3 Model:
4 ClaimNb ~ VehPowerGLM + VehAgeGLM + DrivAge + log(DrivAge) +
5 I(DrivAge^2) + I(DrivAge^3) + I(DrivAge^4) + BonusMalusGLM +
6 VehBrand + VehGas + DensityGLM + Region + AreaGLM
7 Df Deviance AIC LRT Pr(>Chi)
8 <none> 147005 192753
9 VehPowerGLM 5 147087 192825 82.4 2.671e-16 ***
10 VehAgeGLM 2 147225 192969 220.3 < 2.2e-16 ***
11 DrivAge 1 147157 192902 151.9 < 2.2e-16 ***
12 log(DrivAge) 1 147190 192935 184.8 < 2.2e-16 ***
13 I(DrivAge^2) 1 147123 192869 118.1 < 2.2e-16 ***
14 I(DrivAge^3) 1 147094 192840 89.0 < 2.2e-16 ***
15 I(DrivAge^4) 1 147071 192816 65.5 5.687e-16 ***
16 BonusMalusGLM 1 150907 196653 3902.0 < 2.2e-16 ***
17 VehBrand 10 147232 192959 226.5 < 2.2e-16 ***
18 VehGas 1 147148 192893 142.8 < 2.2e-16 ***
19 DensityGLM 1 147015 192761 10.1 0.001498 **
20 Region 21 147193 192899 188.0 < 2.2e-16 ***
21 AreaGLM 1 147009 192755 4.1 0.043123 *
22 ---
23 Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

resulting GLM will not be very robust, and a slight change in the observations
may change these regression parameters completely. In this case one should drop
one of the two highly collinear feature components. This problem may also occur
if we include too many terms in functional forms like in (5.34).
• A tool to find suitable functional forms of regression functions in continuous
feature components are the partial residual plots of Cook–Croos-Dabrera [80]. If
we want to analyze the first feature component x1 of x, we can fit a GLM to the
data using the entire feature vector x. The partial residuals for component x1 are
defined by, see formula (8) in Cook–Croos-Dabrera [80],

= (Yi − μ(x i ))g (μ(x i )) + β1 xi,1

partial
ri for 1 ≤ i ≤ n,

where g is the chosen link function and g(μ(x i )) = β, x i . These partial
residuals offset the effect of feature component x1 . The partial residual plot shows
partial
ri against xi,1 . If this plot shows a linear structure then including x1 linearly
is justified, and any other functional form may be detected from that plot.

Under-Sampling and Over-Sampling

Often run times are an issue in model fitting, in particular, if we want to exper-
iment with different models, different feature codings, etc. Under-sampling is an
interesting approach that can be applied in imbalanced situations (like in our claim
frequency data situation) to speed up calculations, and still receiving accurate
approximations. We briefly describe under-sampling in this subsection.
154 5 Generalized Linear Models

Under-sampling is based on the idea that we do not need to consider all n =

610 206 insurance policies for model fitting, and we can still receive accurate
results. For this we select all insurance policies that have at least 1 claim; in our
data these are 22’434 insurance policies, we call this data set L∗≥1 . The motivation
for selecting these insurance policies is that these are exactly the policies that have
information about the drivers causing claims. These selected insurance policies need
to be complemented with policies that do not cause any claims. We select at random
(under-sample) 22’434 insurance policies of drivers without claims, we call this
data set L∗0 . Merging the two sets we receive data L∗ = L∗0 ∪ L∗≥1 comprising
44’868 insurance policies. This data is balanced from the viewpoint of claim causing
policies because exactly half of the policies in L∗ suffers a claim and the other half
does not. The idea now is to fit a GLM only on this learning data L∗ , and because
we only consider 44’868 insurance policies the fitting should be fast.
There is still one point to be considered, namely, in the new learning data L∗
policies with claims are over-represented (because we work in a low frequency
problem). This motivates that we adjust the time exposures vi in L∗0 accordingly
by multiplying as follows
n
j =1 vj 1{Nj =0}
vi → vi∗ = vi .
vj ∈L∗0 vj

Thus, we stretch the exposures of the policies without claims in L∗ ; for our data this
factor is 26.17. This then provides us with an empirical frequency on L∗ of 7.36%
which is identical to the observed frequency on the entire learning data L.
We fit model Poisson GLM3 on this reduced (and exposure adjusted) learning
data L∗ , the results are presented on the last line of Table 5.6. This model can be
fitted in 1s, and by construction it fulfills the balance property. The resulting in-
sample and out-of-sample losses (evaluated on the entire data L and T ) are very
close to model Poisson GLM3 which verifies that the model fitted only on the
learning data L∗ gives a good approximation. We do not provide AIC because the
data used is not identical to the data used to fit the other models. The tenfold cross-

Table 5.6 Run times, number of parameters, AICs, in-sample and out-of-sample deviance losses,
tenfold cross-validation losses (units are in 10−2 ) and in-sample average frequency of the null
model (intercept model) and of different Poisson GLMs, the last row uses under-sampling in model
Poisson GLM3
Run # In-sample Out-of-sample Tenfold CV Aver.
time param. AIC loss on L loss on T CV
loss D freq.
Poisson null – 1 199’506 25.213 25.445 25.213 7.36%
Poisson GLM1 16 s 49 192’818 24.101 24.146 24.121 7.36%
Poisson GLM2 15 s 48 192’753 24.091 24.113 24.110 7.36%
Poisson GLM3 15 s 50 192’716 24.084 24.102 24.104 7.36%
under-sampling 1s 50 – 24.098 24.108 24.120 7.36%
5.3 Model Validation 155

validation loss is a little bit bigger which seems to be a consequence of applying

the non-stratified version to only 44’868 insurance policies, i.e., this higher cross-
validation loss shows that we fit the model on less data which provides higher
uncertainty in model fitting. This finishes this example.
The presented method is called under-sampling because we under-sample from
the insurance policies without claims to make both classes (policies with claims and
policies without claims) equally large. Alternatively, to achieve a class balance we
could also over-sample from the minority class by duplicating policies. This has a
similar effect, but it increases run times. Importantly, if we under- or over-sample we
have to adjust the exposures correspondingly. Otherwise we obtain a biased model
that is not useful for pricing, the same applies to methods such as the synthetic
minority oversampling technique (SMOTE) and similar techniques.
Alternatively, to under-sampling we could also fit a so-called zero-truncated
Poisson (ZTP) model to the data by only fitting a model on the insurance policies
that suffer at least one claim, and adjusting the distribution to the observations
Ni |{Ni ≥1} . This is rather similar to a hurdle Poisson model and we come back to
this in Example 6.19, below.

5.3.5 Over-Dispersion in Claim Counts Modeling

Mixed Poisson Distribution

In the previous example we have seen that the considered Poisson GLMs do not fully
fit our data, at least not with the chosen feature engineering, because there is over-
dispersion in the data (relative to the chosen models). This may give rise to consider
models that allow for over-dispersion. Typically, such over-dispersed models are
constructed starting from the Poisson model, because the Poisson model enjoys
many nice properties as we have seen above. A natural extension is to introduce the
family of mixed Poisson models, where the frequency is not modeled with a single
parameter but rather with a whole family of parameters described by an underlying
mixing distribution.
In the dual mean parametrization the Poisson distribution for Y = N/v reads as

(vλ)vy
Y ∼ f (y; λ, v) = e−vλ for y ∈ N0 /v,
(vy)!

where the mean parameter is given by λ = κ (θ ) = exp{θ }, and θ denotes the

canonical parameter; on purpose we use for the mean notation λ instead of μ, here,
the reason will become clear below. This model satisfies for the first two moments
of N = vY

Eλ [N] = vκ (θ ) = vλ and Varλ (N) = vκ (θ ) = vλ = Eλ [N] ,

with dispersion parameter ϕ = 1. A mixed Poisson distribution is obtained

by mixing/integrating over different frequency parameters λ > 0. We choose a
156 5 Generalized Linear Models

distribution π on R+ (strictly positively supported), and define the new distribution

(vλ)vy
Y = N/v ∼ fπ (y; v) = f (y; λ, v) dπ(λ) = e−vλ dπ(λ).
R+ R+ (vy)!
(5.36)

If π is not concentrated in a single point, the tower property immediately implies

Eπ [N] < Varπ (N) , (5.37)

supposed that the moments exist, we refer to Lemma 2.18 in Wüthrich [387]. Hence,
mixing over different frequency parameters allows us to receive over-dispersion. Of
course, this concept can also be applied to mixing over the canonical parameter θ in
the EF (instead of the mean parameter).
This leads to the framework of Bayesian credibility models which are widely
used and studied in actuarial science, we refer to the textbook of Bühlmann–Gisler
[58]. We have already met this idea in the Bayesian decision rule of Example 3.3
which has led to the Bayesian estimator in Definition 3.6.

Negative-Binomial Model

In the case of the Poisson model, the gamma distribution is a particularly attractive
mixing distribution for λ because it allows for a closed-form solution in (5.36),
and fπ= (y; v) will be a negative-binomial distribution.4 One can choose differ-
ent parametrizations of this mixing distribution, and they will provide different
scalings in the resulting negative-binomial distribution. We choose the following
(d)
parametrization π(λ) = (vα, vα/μ) for mean parameter μ > 0 and shape
parameter vα > 0. This implies, see (5.36),

(vλ)vy (vα/μ)vα vα−1 −vαλ/μ
fNB (y; μ, v, α) = e−vλ λ e dλ
R+ (vy)! (vα)
(vy + vα) v vy (vα/μ)vα
=
(vy)! (vα) (v + vα/μ)vy+vα

vy + vα − 1 θ vy vα
= e 1 − eθ ,
vy

4The gamma distribution is the conjugate prior to the Poisson distribution. As a result, the posterior
distribution, given observations, will again be a gamma distribution with posterior parameters, see
Section 8.1 of Wüthrich [387]. This Bayesian model has been introduced to the actuarial literature
by Bichsel [32].
5.3 Model Validation 157

setting for canonical parameter θ = log(μ/(μ + α)) < 0. This is the negative-
binomial distribution we have already met in (2.5). A single-parameter linear EDF
representation is given by, we set unit dispersion parameter ϕ = 1,

yθ + α log(1 − eθ ) vy + vα − 1
Y ∼ fNB (y; θ, v, α) = exp + log ,
1/v vy
(5.38)

where this is a density w.r.t. the counting measure on N0 /v. The cumulant function
and the canonical link, respectively, are given by

μ
κ(θ ) = −α log(1 − eθ ) and θ = h(μ) = log ∈ = (−∞, 0).
μ+α

Note that α > 0 is treated as nuisance parameter (which is a fixed part of the
cumulant function, here). The first two moments of the claim count N = vY are
given by

eθ
vμ = Eθ [N] = vα , (5.39)
1 − eθ

eθ μ
Varθ (N) = Eθ [N] 1 + = Eθ [N] 1 + > Eθ [N]. (5.40)
1 − eθ α

This shows that we receive a fixed over-dispersion of size μ/α, which (in this
parametrization) does not depend on the exposure v; this is the reason for choosing
(d)
a mixing distribution π(λ) = (vα, vα/μ). This parametrization is called NB2
parametrization.
Remarks 5.21
• We emphasize that the effective domain = (−∞, 0) is one-sided bounded.
Therefore, the canonical link for the linear predictor will not work in general
because the linear predictor x → η(x) can be both-sided unbounded in a GLM
setting. Instead, we use the log-link for g(·) in our example below, with the
downside that one loses the balance property.
• The unit deviance in this negative-binomial EDF model is given by
+ ,
y y+α
(y, μ) → d(y, μ) = 2 y log − (y + α) log ,
μ μ+α

we also refer to Table 4.1 for α = 1. We emphasize that this is the unit deviance
in a single-parameter linear EDF, and we only aim at estimating canonical
parameter θ ∈ and mean parameter μ ∈ M, respectively, whereas α > 0 is
treated as a given nuisance parameter. This is important because the unit deviance
relies on the saturated model which, in general, estimates a one-dimensional
158 5 Generalized Linear Models

parameter θ and μ, respectively, from the one-dimensional observation Y . The

nuisance parameter is not affected by the consideration of the saturated model,
and it is treated as a fixed part of the cumulant function, which is not estimated
at this stage. An important consequence of this is that model comparison using
deviance residuals only works for identical nuisance parameters.
• We mention that we receive over-dispersion in (5.40) though we have dispersion
parameter ϕ = 1 in (5.38). Alternatively, we could do the duality transformation
y → *y = y/α for nuisance parameter α > 0; this gives the reproductive form of
the negative-binomial model NB2, see also Remarks 2.13. This provides us with
a density on N0 /(vα), set *ϕ = 1/α,

* ∼ fNB (* *
y θ + log(1 − eθ ) y + vα − 1
vα*
Y y ; θ, v/*
ϕ ) = exp + log .
1/(vα) vα*
y

The cumulant function and the canonical link, respectively, are now given by

*
μ
κ(θ ) = − log(1 − eθ ) and θ = h(*
μ) = log ∈ = (−∞, 0).
*
μ+1

The first two moments are for θ ∈ given by

*] = eθ
μ = Eθ [Y
* ,
1 − eθ
*) = *
Varθ (Y
ϕ
κ (θ ) =
1
μ (1 + *
* μ) .
v vα
Thus, we receive the reproductive EDF representation with dispersion parameter
ϕ = 1/α and variance function V (*
* μ) = *
μ(1 + *μ). Moreover, N = vY = vα Y *.
• The negative-binomial model with the NB1 parametrization uses the mixing
(d)
distribution π(λ) = (μv/α, v/α). This leads to mean Eθ [N] = vμ and
variance Varθ (N) = Eθ [N](1 + α). In this parametrization, μ enters the gamma
function as (μv/α) in the gamma density which does not allow for an EDF
representation. This parametrization has been called NB1 by Cameron–Trivedi
[63] because both terms in the variance Varθ (N) = vμ + vμα are linear in μ. In
contrast, in the NB2 parametrization the second term has a square vμ2 /α in μ,
see (5.40). Further discussion is provided in Greene [171].

Nuisance Parameter Estimation

All previous statements have been based on the assumption that α > 0 is a
given nuisance parameter. If α needs to be estimated too, then, we drop out
of the EF. In this case, an iterative estimation procedure is applied to the EDF
representation (5.38). One starts with a fixed nuisance parameter α (0) and fits the
5.3 Model Validation 159

negative-binomial GLM with MLE which provides a first set of MLE

(1)
β =
(1)
β (α (0) ). Based on this estimate the nuisance parameter is updated α (0) → α (1) by
maximizing the log-likelihood in α for given
(1)
β . Iteration of this procedure then
leads to a joint estimation of regression parameter β and nuisance parameter α. Both
MLE steps in this algorithm increase the joint log-likelihood.
Remark 5.22 (Implementation of the Negative-Binomial GLM in R) Implementa-
tion of the negative-binomial model needs some care. There are two R procedures
glm and glm.nb that can be used to fit negative-binomial GLMs, the latter being
built on the former. The procedure glm is just the classical R procedure [307] that
is usually used to fit GLMs within the EDF, it requires to set

family=negative.binomial(theta, link="log").

This parametrization considers the single-parameter linear EF on N (for mean μ ∈

M)
n theta
n + theta − 1 μ μ
fNB (n; μ, theta) = 1− ,
n μ + theta μ + theta

where theta > 0 denotes the nuisance parameter. The tricky part now is that we
have to bring in the different exposures vi of all policies 1 ≤ i ≤ n. That is, we
would like to have for claim counts ni = vi yi , see (5.38),
vi yi vi α
vi yi + vi α − 1 vi μi vi μi
fNB (yi ; μi , vi , α) = 1−
vi yi vi μi + vi α vi μi + vi α
+ yi α ,vi
vi yi + vi α − 1 μi μi
= 1− .
vi yi μi + α μi + α

The square bracket can be implemented in glm as a scaled and weighted regression
problem, see Listing 5.8 with theta = α. This approach provides the correct GLM
parameter estimates
MLE
β for given α, however, the outputted AIC values cannot
be compared to the Poisson case. Note that the Poisson case of Table 5.5 considers
observations Ni whereas Listing 5.8 uses Yi = Ni /vi . For this reason we calculate
the log-likelihood and AIC by an own implementation.
The same remark applies to glm.nb, and also nuisance parameter estimation
cannot be performed by that routine under different exposures vi . Therefore, we
have implemented an iterative estimation algorithm ourselves, alternating glm of
Listing 5.8 for given α and a maximization routine optimize to find the optimal
α for given β using (5.38). We have applied this iteration in Example 5.23, below,
and it has converged in 5 iterations.

Example 5.23 (Negative-Binomial Distribution for Claim Counts) We revisit the

MTPL claim frequency GLM example of Sect. 5.3.4, but we replace the Poisson
distribution by the negative-binomial one. We start with the negative-binomial (NB)
160 5 Generalized Linear Models

Listing 5.8 Implementation of model NB GLM3

1 d.glmnb <- glm(ClaimNb/Exposure ~ VehPowerGLM + VehAgeGLM
2 + log(DrivAge) + I(DrivAge^3) + I(DrivAge^4)
3 + BonusMalusGLM*DrivAge + BonusMalusGLM*I(DrivAge^2)
4 + VehBrand + VehGas + DensityGLM + Region + AreaGLM,
5 data=learn, weights=Exposure,
6 family=negative.binomial(alpha, link="log"))

Table 5.7 Run times, number of parameters, AICs, in-sample and out-of-sample deviance losses
(units are in 10−2 ) and in-sample average frequency of the null models (Poisson and negative-
binomial) and the Poisson and negative-binomial GLMs. The optimal model is highlighted in
boldface
Run # In-sample Out-of-sample Aver.
time Param. AIC loss on L loss on T freq.
Poisson null – 1 199’506 25.213 25.445 7.36%
Poisson GLM3 15 s 50 192’716 24.084 24.102 7.36%
NB null MLE = 1.059
αnull – 2 198’466 20.357 20.489 7.36%
NB null MLE = 1.810
αNB – 1 198’564 21.796 21.948 7.36%
NB GLM3 αNBMLE = 1.810 85s 51 192’113 20.722 20.674 7.38%

null model. The NB null model has two parameters, the homogeneous (overall)
frequency and the nuisance parameter. MLE of the homogeneous overall frequency
is identical to the one in the Poisson null model, and MLE of the nuisance parameter
provides MLE = 1.059. This is substantially smaller than infinity and suggests
αnull
over-dispersion. The results are presented on the third line of Table 5.7. We observe
a smaller AIC of the NB null model against the Poisson null model which says that
we should allow for over-dispersion.
We now focus on the NB GLM. The feature pre-processing is done exactly as
in model Poisson GLM3, and we choose the log-link for g. We call this model
NB GLM3. The iterative estimation procedure outlined above provides a nuisance
parameter estimate MLE = 1.810. This is bigger than in the NB null model because
αNB
the regression structure explains some part of the over-dispersion, however, it is
still substantially smaller than infinity which justifies the inclusion of this over-
dispersion parameter.
The last line of Table 5.7 gives the result of model NB GLM3. From AIC we
conclude that we favor the negative-binomial GLM over the Poisson GLM since
AIC decreases from 192’716 to 192’113. The in-sample and out-of-sample deviance
losses can only be compared within the same models, i.e., the models that have the
same cumulant function. This also applies to the negative-binomial models which
have cumulant function κ(θ ) = −α log(1 − eθ ). Thus, to compare the NB null
model and model NB GLM3, we need to choose the same nuisance parameter α.
For this reason we added this second NB null model to Table 5.7. This second NB
null model no longer uses the MLE MLE , therefore, the corresponding AIC only
αnull
includes one estimated parameter.
5.3 Model Validation 161

Fig. 5.7 Poisson logged

predictors
vs. negative-binomial logged
predictors

Table 5.8 Out-of-sample deviance losses: forecast dominance. The optimal model is highlighted
in boldface
Poisson NB deviance NB deviance
Model deviance MLE = 1.059
αnull MLE = 1.810
αNB
Null model 25.445 20.489 21.948
Poisson GLM3 24.102 19.266 20.678
NB GLM3 MLE = 1.810
αNB 24.100 19.262 20.674

As mentioned above, deviance losses can only be compared under exactly the
same cumulant function (including the same nuisance parameters). If we want to
have a more robust model selection, we can consider forecast dominance according
to Definition 4.20. Being less ambitious, here, we consider forecast dominance
only for the three considered cumulant functions Poisson, negative-binomial with
MLE = 1.059 and negative-binomial with
αnull MLE = 1.810. The out-of-sample
αNB
deviance losses are given in Table 5.8 in the different columns. According to this
forecast dominance analysis we also give preference to model NB GLM3, but model
Poisson GLM3 is pretty close.
Figure 5.7 compares the logged predictors log( μi ), 1 ≤ i ≤ n, of the models
Poisson GLM3 and NB GLM3. We see a huge similarity in these predictors, only
high frequency policies are judged slightly differently by the NB model compared
to the Poisson model.
Table 5.9 gives the predicted number of claims against the observed ones. We
observe that model NB GLM3 predicts more accurately the number of policies with
2 or less claims, but it over-estimates the number of policies with more than 2 claims.
This may also be related to the fact that the estimated in-sample frequency has a
162 5 Generalized Linear Models

Table 5.9 Contingency table of observed number of policies against predicted number of policies
with given claim counts ClaimNb
Numbers of claims ClaimNb
0 1 2 3 4 5
Observed number of policies 587’772 21’198 1’174 57 4 1
Poisson predicted number of policies 587’325 22’064 779 34 3 0.3
NB predicted number of policies 587’902 20’982 1’200 100 15 4

positive bias in model NB GLM3, see Table 5.7. That is, since we do not work with
the canonical link, we do not have the balance property.

Listing 5.9 drop1 analysis of model NB GLM3

1 Single term deletions
2
3 Model:
4 ClaimNb/Exposure ~ VehPowerGLM + VehAgeGLM + DrivAge + log(DrivAge) +
5 I(DrivAge^2) + I(DrivAge^3) + I(DrivAge^4) + BonusMalusGLM *
6 DrivAge + BonusMalusGLM * I(DrivAge^2) + BonusMalusGLM +
7 VehBrand + VehGas + DensityGLM + Region + AreaGLM
8 Df Deviance AIC scaled dev. Pr(>Chi)
9 <none> 126446 171064
10 VehPowerGLM 5 126524 171102 48.266 3.134e-09 ***
11 VehAgeGLM 2 126655 171190 130.070 < 2.2e-16 ***
12 log(DrivAge) 1 126592 171153 91.057 < 2.2e-16 ***
13 I(DrivAge^3) 1 126527 171112 50.483 1.202e-12 ***
14 I(DrivAge^4) 1 126508 171100 38.381 5.820e-10 ***
15 VehBrand 10 126658 171176 132.098 < 2.2e-16 ***
16 VehGas 1 126583 171147 85.232 < 2.2e-16 ***
17 DensityGLM 1 126456 171068 6.137 0.01324 *
18 Region 21 126622 171132 109.838 5.042e-14 ***
19 AreaGLM 1 126450 171064 2.411 0.12049
20 DrivAge:BonusMalusGLM 1 126484 171085 23.481 1.262e-06 ***
21 I(DrivAge^2):BonusMalusGLM 1 126490 171089 27.199 1.836e-07 ***
22 ---
23 Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

We close this example by providing the drop1 analysis in Listing 5.9. From
this analysis we conclude that the feature component Area should be dropped.
Of course, this confirms the high collinearity between Density and Area which
implies that we do not need both variables in the model. We remark that the AIC
values in Listing 5.9 are not on our scale, as stated in Remark 5.22.

5.3.6 Zero-Inflated Poisson Model

In many applications it is the case that the Poisson distribution does not fully fit
the claim counts data because there are too many policies with zero claims, i.e.,
5.3 Model Validation 163

policies with Y = 0, compared to a Poisson assumption. This topic has attracted

some attention in the recent actuarial literature, see, e.g., Boucher et al. [43–45],
Frees et al. [137], Calderín-Ojeda et al. [62] and Lee [239]. An obvious solution to
this problem is to ‘artificially’ increase the probability of a zero claim compared to
a Poisson model, this is the proposal introduced by Lambert [232]. Y has a zero-
inflated Poisson (ZIP) distribution if the probability weights of Y are given by (set
v = 1)

π0 + (1 − π0 )e−μ for y = 0,
fZIP (y; θ, π0 ) = y
(1 − π0 )e−μ μy! for y ∈ N,

for π0 ∈ (0, 1), μ = eθ > 0, and for the Poisson probability weights we refer
to (2.4). For π0 > 0 the weight of a zero claim Y = 0 is increased (inflated)
compared to the original Poisson distribution.
Remarks 5.24
• The ZIP distribution has different interpretations. It can be interpreted as a
hierarchical model where we have a latent variable Z which indicates with
probability π0 that we have an excess zero, and with probability 1 − π0 we have
an ordinary Poisson distribution, i.e. for y ∈ N0

1{y=0} for z = 0,
Pθ [ Y = y| Z = z] = y (5.41)
e−μ μy! for z = 1,

with P[Z = 0] = 1 − P[Z = 1] = π0 .

The latter shows that we can also understand it as a mixture of two distribu-
tions, namely, of the Poisson distribution and of a single point measure in y = 0
with mixing probability π0 . Mixture distributions are going to be discussed in
Sect. 6.3.1, below. In this sense, we can also interpret the model as a mixed
Poisson model with mixing distribution π(λ) being a Bernoulli distribution
taking values 0 and μ with probability π0 and 1 − π0 , respectively, see (5.36),
and the former parameter λ = 0 leads to a degenerate Poisson model.
• We have introduced the ZIP model, but this approach is neither limited to the
Poisson model nor the zeros. For instance, we could also consider an inflated
negative-binomial model where both the zeros and the ones are inflated with
probabilities π0 , π1 > 0 such that π0 + π1 < 1.
• Hurdle models are an alternative way to model excess zeros. Hurdle models
have been introduced by Cragg [83], and they also allow for too little zeros.
A hurdle (Poisson) model mixes a lower-truncated (Poisson) count distribution
with a point mass in zero

π0 for y = 0,
fhurdle Poisson(y; θ, π0) = y
e−μ μy! (5.42)
(1 − π0 ) 1−e−μ for y ∈ N,
164 5 Generalized Linear Models

for π0 ∈ (0, 1) and μ > 0. For π0 > e−μ the weight of a zero claim is increased
and for π0 < e−μ it is decreased. This distribution is called a hurdle distribution,
because we first need to overcome the hurdle at zero to come to the Poisson
model. Lower-truncated distributions are studied in Sect. 6.4, below, and mixture
distributions are discussed in Sect. 6.3.1. In general, fitting lower-truncated
distributions is challenging because the density and the distribution function
should both have tractable forms to perform MLE for truncated distributions.
The Expectation-Maximization (EM) algorithm is a useful tool to perform
model fitting under truncation. We come back to the hurdle Poisson model in
Example 6.19, below, and it is also closely related to the zero-truncated Poisson
(ZTP) model discussed in Remarks 6.20.
The first two moments of a ZIP random variable Y ∼ fZIP (·; θ, π0 ) are given by

Eθ,π0 [Y ] = (1 − π0 )μ,
Varθ,π0 (Y ) = (1 − π0 )μ + (π0 − π02 )μ2 = Eθ,π0 [Y ] (1 + π0 μ) ,

these calculations easily follow with the latent variable Z interpretation from above.
As a consequence, we receive an over-dispersed model with over-dispersion π0 μ
(the latter also follows from the fact that we consider a mixed Poisson distribution
with a Bernoulli mixing distribution having weights π0 in 0 and 1 − π0 in μ > 0,
see (5.37)).
Unfortunately, MLE does not allow for explicit solutions in this model. The score
i.i.d.
equations of Yi ∼ fZIP (·; θ, π0 ) are given by

n

∇(π0 ,μ) Y (π0 , μ) = ∇(π0 ,μ) log π0 + (1 − π0 )e−μ 1{Yi =0}
i=1

n y
−μ μ
+ ∇(π0 ,μ) log (1 − π0 )e 1{Yi >0} = 0.
y!
i=1

The R package pscl [401] has a function called zeroinfl which uses the general
purpose optimizer optim to find the MLEs in the ZIP model. Alternatively, we
could explore the EM algorithm for mixture distributions presented in Sect. 6.3,
below.
In insurance applications, the ZIP application can be problematic if we have
different exposures vi > 0 for different insurance policies i. In the Poisson GLM
case with canonical link choice we typically integrate the different exposures into
the offset, see (5.27). However, it is not clear whether and how we should integrate
the different exposures into the zero-inflation probability π0 . It seems natural to
believe that shorter exposures should increase π0 , but the explicit functional form of
this increase can be debated, some options are discussed in Section 5 of Lee [239].
5.3 Model Validation 165

Listing 5.10 Implementation of model ZIP GLM3

1 d.ZIP <- zeroinfl(ClaimNb ~ VehPowerGLM + VehAgeGLM
2 + log(DrivAge) + I(DrivAge^3) + I(DrivAge^4)
3 + BonusMalusGLM*DrivAge + BonusMalusGLM*I(DrivAge^2)
4 + VehBrand + VehGas + DensityGLM + Region
5 + AreaGLM | 1,
6 data=learn, offset=log(Exposure), dist=’poisson’, link=’logit’,
7 start=list(count=glm3$coefficients, zero=c(-0.4153)) )

Table 5.10 Run times, number of parameters, AICs, in-sample and out-of-sample deviance
losses (units are in 10−2 ) and in-sample average frequency of the null models (Poisson, negative-
binomial and ZIP) and the Poisson, negative-binomial and ZIP GLMs. The optimal model is
highlighted in boldface
Run # AIC In-sample Out-of-sample Aver.
time Param. loss on L loss on T freq.
Poisson null – 1 199’506 25.213 25.445 7.36%
Poisson GLM3 15 s 50 192’716 24.084 24.102 7.36%
NB null MLE = 1.059
αnull – 2 198’466 20.357 20.489 7.36%
NB null MLE = 1.810
αNB – 1 198’564 21.796 21.948 7.36%
NB GLM3 αNBMLE = 1.810 85 s 51 192’113 20.722 20.674 7.38%
ZIP null 20 s 2 198’638 – – 7.43%
ZIP GLM3 (null π0 ) 270 s 51 192’393 – – 7.37%

In the following application, we simply choose π0 independent of the exposures, but

certainly this is not the best modeling choice.
Example 5.25 (ZIP Model for Claim Counts) We revisit the MTPL claim frequency
example of Sect. 5.3.4, but this time we fit a ZIP model. For the Poisson part we
use exactly the same GLM regression function as in model Poisson GLM3 and,
in particular, we use for the different exposures vi of the insurance policies the
offset term oi = log vi , see line 6 of Listing 5.10. This offset only acts on the
Poisson part of the ZIP GLM. The zero-inflating probability π0 is modeled with a
logistic Bernoulli model, see Sect. 2.1.2. For computational reasons, we choose the
null model for the Bernoulli part modeling the zero-inflation π0 . This is indicated
by the “1” on line 5 of Listing 5.10. This 1 should be expanded if we also want to
consider a regression model for the zero-inflating probability π0 and, in particular,
if we want to integrate an offset term for the exposure. We can set this term to
offset(f), where f is a suitable transformation of the exposure. Furthermore,
successful calibration requires meaningful starting values, otherwise zeroinfl
will not find the MLEs. We start the algorithm in the parameters of model Poisson
GLM3, see line 7 of Listing 5.10. The results are presented in Table 5.10.
Firstly, we see that the run times are not fully competitive in this implementation,
even if we choose the null model for the zero-inflating probability π0 , i.e., only
166 5 Generalized Linear Models

Table 5.11 Out-of-sample deviance losses: forecast dominance. The optimal model is highlighted
in boldface
Poisson NB deviance NB deviance
Model deviance MLE = 1.059
αnull MLE = 1.810
αNB
Null model 25.445 20.489 21.948
Poisson GLM3 24.102 19.266 20.678
NB GLM3 MLE = 1.810
αNB 24.100 19.262 20.674
ZIP null model 25.446 20.490 21.949
ZIP GLM3 24.103 19.267 20.679

Table 5.12 Contingency table of observed numbers of policies against predicted numbers of
policies with given claim counts ClaimNb
Numbers of claims ClaimNb
0 1 2 3 4 5
Observed number of policies 587’772 21’198 1’174 57 4 1
Poisson predicted number of policies 587’325 22’064 779 34 3 0.3
NB predicted number of policies 587’902 20’982 1’200 100 15 4
ZIP predicted number of policies 587’829 21’094 1’191 79 9 4

one intercept parameter is involved for determining π0 . Secondly, in this model we

cannot calculate deviance losses because the saturated model has two parameters for
each observation. Thirdly, the model does not satisfy the balance property though we
work with the canonical links for the Poisson part and the Bernoulli part, however,
this property gets lost under the combination of these two model parts.
Most interesting are the AIC values. We observe that the ZIP GLM improves the
Poisson GLM, but it has a bigger AIC value than the negative-binomial GLM. From
this we conclude that we give preference to the negative-binomial model in our case.
Considering forecast dominance according to Definition 4.20, but restricted to
the three deviance losses studied in Example 5.23, we receive Table 5.11. Also this
table gives preference to the negative-binomial GLM. However, if we consider the
table of the observed numbers of policies against the predicted numbers of claims,
see Table 5.12, we give preference to the ZIP GLM because it has the lowest χ 2 -
value, i.e., it reflects best (in-sample) our observations.
Figure 5.8 compares the resulting predictors on the log-scale. From this plot we
conclude that in our example the predictors of the ZIP GLM are closer to the Poisson
ones than the NB GLM predictors. In a next step, one could refine the zero-inflating
probability π0 modeling by integrating the exposure and further feature information.
This would lead to a further model improvement. We refrain here from doing so and
close this example; in Example 6.19, below, we study the hurdle Poisson model.
5.3 Model Validation 167

Fig. 5.8 Comparison linear

predictors of the NB and ZIP
GLMs against the ones of the
Poisson GLM

5.3.7 Lab: Gamma GLM for Claim Sizes

As a second example we consider claim size modeling within GLMs. For this
example we do not use the French MTPL claims data because the empirical
density plot in Fig. 13.15 indicates that a GLM will not fit to that data. The French
MTPL data seems to have three distinct modes, which suggests to use a mixture
distribution. Moreover, the log-log plot indicates a regularly varying tail, which
cannot be captured by the EDF on the original observation scale; we are going
to study this data in Example 6.14, below. Here, we use the Swedish motorcycle
data, previously used in the textbook of Ohlsson–Johansson [290] and described in
Chap. 13.2. From Fig. 5.9 we see that the empirical density has one mode, and the
log-log plot supports light tails, i.e., the gamma model might be a suitable choice for
this data. Therefore, we choose a gamma GLM with log-link g. As described above,
the log-link is not the canonical link for the gamma EDF distribution but it ensures
the right sign w.r.t. the linear predictor ηi = β, x i . Working with the log-link in
the gamma model will imply that the balance property is not fulfilled.

empirical density of average claim amounts empirical distribution of average claim amounts log−log plot of average claim amounts
1.0
0.04

0
logged survival probability
−5 −4 −3 −2 −1
empirical distribution
0.8
0.03
empirical density

0.6
0.02

0.4
0.01

0.2

−6

log 10K
0.00

0.0

log 100K

0 50 100 150 200 0 50 100 150 200 −4 −2 0 2 4

average claim amounts in SEK 1'000 average claim amounts in SEK 1'000 logged average claim amounts in SEK 1'000

Fig. 5.9 (lhs) Empirical density, (middle) empirical distribution and (rhs) log-log plot of claim
amounts of the Swedish motorcycle data presented in Chap. 13.2
168 5 Generalized Linear Models

Feature Engineering

We have 4 continuous feature components OwnerAge, RiskClass, VehAge and

BonusClass, one binary feature component Gender and a categorical compo-
nent Area, see Listing 13.4. We have decided for a minimal feature engineering; we
refer to Figs. 13.19 (rhs) and 13.20 (rhs) for descriptive plots. We use the continuous
variables directly in a log-linear fashion, we add quadratic terms for OwnerAge and
VehAge, we merge RiskClass 6 and 7, and we censor VehAge at 20. Area
is categorical, but we may interpret the Zone levels as ordinal categorical, and
mapping them to integers allows us to use them in a continuous fashion; Fig. 13.19
(middle row, rhs) shows that this is a reasonable choice. Moreover, we merge Zone
5, 6 and 7 due to small volumes and their similar behavior.

Gamma Generalized Linear Model

The Swedish motorcycle claim amount data poses the special difficulty that we
do not have individual claim observations Zi,j , but we only know the total claim
i
amounts Si = N j =1 Zi,j and the number of claims Ni on each insurance policy;
Fig. 5.9 shows average claims Si /Ni of insurance policies i with Ni > 0. In general,
this poses a problem in statistical modeling, but in the gamma model this problem
can be handled because the gamma distribution is closed under aggregation of
i.i.d. gamma claims Zi,j . In all what follows in this section, we only study insurance
policies with Ni > 0, and we label these insurance policies i accordingly.
Assume that Zi,j are i.i.d. gamma distributed with shape parameter αi and scale
parameter ci , we refer to (2.6). The mean, the variance and the moment generating
function of Zi,j are given by
αi
αi αi ci
E[Zi,j ] = , Var(Zi,j ) = 2 and MZi,j (r) = ,
ci ci ci − r
(5.43)
where the moment generating function requires r < ci to be finite. Assuming that
the number of claims Ni is a known positive integer ni ∈ N, we see from the
ni
moment generating function that Si = j =1 Zi,j is again gamma distributed with
shape parameter ni αi and scale parameter ci . We change the notation from Ni to
ni to emphasize that the number of claims is treated as a known constant (and
also to avoid using the notation of conditional probabilities, here). Finally, we scale
Yi = Si /(ni αi ) ∼ (ni αi , ni αi ci ). This random variable Yi has a single-parameter
EDF gamma distribution with weight vi = ni , dispersion ϕi = 1/αi and cumulant
function κ(θi ) = − log(−θi ), for θi ∈ = (−∞, 0),

yθi − κ(θi )
Yi ∼ f (y; θi , vi /ϕi ) = exp + a(y; vi /ϕi ) (5.44)
ϕi /vi
(−θi αi vi )vi αi vi αi −1
= y exp {−(−θi αi vi )y} ,
(vi αi )
5.3 Model Validation 169

and the canonical parameter is θi = −ci . For our GLM analysis we treat the shape
parameter αi ≡ α > 0 as a nuisance parameter that does not depend on the specific
policy i, i.e., we set constant dispersion ϕ = 1/α, and only the scale parameter ci is
chosen policy dependent through θi = −ci .
Random variable Yi = Si /(ni α) ∼ (ni α, ni αci ) gives the reproductive form
of the gamma EDF, see Remarks 2.13. In applications, this form is not directly
useful because under unknown shape parameter α, we cannot calculate observations
Yi = Si /(ni α). For this reason, we parametrize the model differently, here. We
consider instead

Yi = Si /ni ∼ (ni α, ni ci ). (5.45)

This (new) random variable has the same gamma EDF (5.44), we only need to
reinterpret the canonical parameter as θi = −ci /α. Then, we choose the log-link
for g which implies

1
μi = Eθi [Yi ] = κ (θi ) = − = exp{ηi } = expβ, x i ,
θi

if x i ∈ X ⊂ Rq+1 describes the pre-processed features of policy i. The gamma

GLM is now fully specified and can be fitted to the data; from Example 5.5 we
know that we have a concave maximization problem. We call this model Gamma
GLM1 (with the feature pre-processing as described above). Note that the (constant)
dispersion parameter ϕ cancels in the score equations, thus, we do not need to
explicitly specify the nuisance parameter α to estimate regression parameter β ∈
Rq+1 .

Maximum Likelihood Estimation and Model Selection

Because we have only few claims data in this Swedish motorcycle example (only
m = 656 insurance policies suffer claims), we do not perform a generalization
analysis with learning and test samples. In this situation we need all data for
model fitting, and model performance is analyzed with AIC and with tenfold cross-
validation.
The in-sample deviance loss in the gamma GLM is given by

2 ni
m
Yi −
μ(x i ) Yi
D(L,
μ(·)) = − log , (5.46)
m ϕ
μ(x i )
μ(x i )
i=1

where i runs over the policies i = 1, . . . , m with positive claims Yi = Si /ni > 0,
μ(x i ) = exp
MLE
and β , x i is the MLE estimated regression function. Similar to
the Poisson case (5.29), McCullagh–Nelder [265] derive the following behavior
170 5 Generalized Linear Models

empirical density of average claim amounts

0.04 empirical density of average claim amounts

0.3
0.03
empirical density

empirical density
0.2
0.02

0.1
0.01
0.00

0.0
0 50 100 150 200 0 2 4 6
average claim amounts in SEK 1'000 cube−rooted average claim amounts^(1/3)

1/3
Fig. 5.10 (lhs) Empirical density of Yi and (rhs) empirical density of Yi

for the gamma unit deviance around its mode, see Section 7.2 and Figure 7.2 in
McCullagh–Nelder [265],

2/3 −1/3 −1/3 2
d (Yi , μi ) ≈ 9Yi Yi − μi , (5.47)

−1/3
this uses that the log-likelihood is symmetric around its mode for scale μi , see
Fig. 5.5 (middle). This shows that the gamma deviance scales differently around Yi
compared to the square loss function. From this we receive an approximation to the
deviance residuals (for v/ϕ = 1)
1/3
Yi
1/3
Yi − μi
1/3
riD = sign(Yi − μi ) d (Yi , μi ) ≈ 3 −1 =3 1/3
.
μi μi
(5.48)

This is the cube-root transformation derived by Wilson–Hilferty [383]. This sug-

1/3
gests that if the empirical distribution of Yi looks roughly Gaussian we can use a
gamma distribution. Figure 5.10 gives the empirical densities of Yi on the left-hand
1/3
side and of Yi on the right-hand side. The latter looks roughly Gaussian (except
of the second mode close to 4), this supports the use of a gamma model.
Listing 5.11 provides the summary statistics of the fitted model Gamma GLM1;
note that we integrate the number of claims ni through scaling into the weights.
We have q + 1 = 9 regression parameters, and from this summary statistics we
observe that not all variables should be kept in the model. If we perform backward
elimination using drop1 in each step, see Sect. 5.3.3, we first drop BonusClass
and then Gender, resulting in a reduced model with 7 parameters. We call this
reduced model Gamma GLM2.
5.3 Model Validation 171

Listing 5.11 Results in model Gamma GLM1 using the R command glm
1 Call:
2 glm(formula = ClaimAmount/ClaimNb ~ OwnerAge + I(OwnerAge^2) +
3 AreaGLM + RiskClass + VehAge + I(VehAge^2) + Gender + BonusClass,
4 family = Gamma(link = "log"), data = mcdata0, weights = ClaimNb)
5
6 Deviance Residuals:
7 Min 1Q Median 3Q Max
8 -3.3683 -1.4585 -0.5979 0.4354 3.4763
9
10 Coefficients:
11 Estimate Std. Error t value Pr(>!t!)
12 (Intercept) 8.9737854 0.5532821 16.219 < 2e-16 ***
13 OwnerAge 0.1072781 0.0280862 3.820 0.000147 ***
14 I(OwnerAge^2) -0.0014508 0.0003489 -4.158 3.65e-05 ***
15 AreaGLM -0.0768512 0.0368284 -2.087 0.037303 *
16 RiskClass 0.0615575 0.0327553 1.879 0.060651 .
17 VehAge -0.2051148 0.0296184 -6.925 1.05e-11 ***
18 I(VehAge^2) 0.0062649 0.0015946 3.929 9.45e-05 ***
19 GenderMale 0.1085538 0.1673443 0.649 0.516772
20 BonusClass 0.0089004 0.0225371 0.395 0.693029
21 ---
22 Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
23
24 (Dispersion parameter for Gamma family taken to be 1.536577)
25
26 Null deviance: 1368.0 on 655 degrees of freedom
27 Residual deviance: 1126.5 on 647 degrees of freedom
28 AIC: 14922
29
30 Number of Fisher Scoring iterations: 11

Table 5.13 Run times, number of parameters, AICs, Pearson’s dispersion estimate, in-sample
losses, tenfold cross-validation losses and the in-sample average claim amounts of the null model
(gamma intercept model) and the gamma GLMs
Run # AIC Dispersion In-sample Tenfold CV Average
time Param. est.
ϕP loss on L CV
loss D amount
Gamma null – 1+1 14’416 2.057 2.085 2.091 24’641
Gamma GLM1 1s 9+1 14’277 1.537 1.717 1.752 25’105
Gamma GLM2 1s 7+1 14’274 1.544 1.719 1.747 25’130

The results of models Gamma GLM1 and Gamma GLM2 are presented in
Table 5.13. We show AICs, Pearson’s dispersion estimate, the in-sample deviance
losses on all available data, the corresponding tenfold cross-validation losses, and
the average claim amounts.
Firstly, we observe that the GLMs do not meet the balance property. This is
implied by the fact that we do not use the canonical link to avoid any sort of difficulty
of dealing with the one-sided bounded effective domain = (−∞, 0). For pricing,
the intercept parameter β MLE should be shifted to eliminate this bias, i.e, we need to
0
shift this parameter under the log-link by − log(25 130/24641) for model Gamma
GLM2.
Secondly, the in-sample and tenfold cross-validation losses are not directly
comparable to AIC. Observe that we need to know the dispersion parameter ϕ in
order to calculate both of these statistics. For the in-sample and cross-validation
172 5 Generalized Linear Models

losses we have set ϕ = 1, thus, all these figures are directly comparable. For AIC
we have estimated the dispersion parameter ϕ with MLE. This is the reason for
increasing the number of parameters in Table 5.13 by +1. Moreover, the resulting
AICs differ from the ones received from the R command glm, see, for instance,
Listing 5.11. The AIC value in Listing 5.11 does not consider all terms appropriately
due to the inclusion of weights, this is similar to Remark 5.22, it uses the
deviance dispersion estimate ϕ D , i.e., not the MLE and (still) increases the number
of parameters by 1 because the dispersion is estimated. For these reasons, we have
implemented our own code for calculating AIC. Both AIC and the tenfold cross-
validation losses say that we should give preference to model Gamma GLM2.
The dispersion estimate in Listing 5.11 corresponds to Pearson’s estimate

1 (Yi −
m
μ i )2
P =
ϕ ni . (5.49)
m − (q + 1)
μ2i
i=1

We observe that the dispersion estimate is roughly 1.5 which gives an estimate of
the shape parameter α = 1/ϕ of 2/3. A shape parameter less than 1 implies that the
density of the gamma distribution is strictly decreasing, see Fig. 2.1. Often this is a
sign that the model does not fully fit the data, and if we use this model for simulation
we may receive too many observations close to zero compared to the true data.
A shape parameter less than 1 may be implied by more heterogeneity in the data
compared to what the chosen gamma GLM allows for or by large claims that cannot
be explained by the present gamma density structure. Thus, there is some sign here
that the data is more heavy-tailed than our model choice suggests. Alternatively,
there might be some need to also model the shape parameter with a regression
model; this could be done using the vector-valued parameter EF representation of
the gamma model, see Sect. 2.1.3. In view of Fig. 5.10 (rhs) it may also be that
the feature information is not sufficient to describe the second mode in 4, thus, we
probably need more explanatory information to reduce dispersion.
In Fig. 5.11 we give the Tukey–Anscombe plot and a QQ plot. Note that the
observations for ni = 1 follow a gamma distribution with shape parameter α
and scale parameter ci = α/μi = −αθi . Thus, if we scale Yi /μi , we receive
i.i.d. gamma random variables with shape and scale parameters equal to α. This
then allows us for ni = 1 to plot the empirical distribution of Yi / μi against (α, α)
in a QQ plot where we estimate 1/α by Pearson’s dispersion estimate. The Tukey–
Anscombe plot looks reasonable, but the QQ plot shows that the gamma model
does not entirely fit the data. From this plot we cannot conclude whether the gamma
distribution is causing the problem or whether it is a missing term in the regression
structure. We only see that the data is over-dispersed, resulting in more heavy-tailed
observations than the theoretical gamma model can explain, and a compensation
by too many small observations (which is induced by over-dispersion, i.e., a shape
parameter smaller than one). In the network chapter we will refine the regression
function, keeping the gamma assumption, to understand which modeling part is
causing the difficulty.
Remark 5.26 For the calculation of AIC in Table 5.13 we have used the MLE of the
dispersion parameter ϕ. This is obtained by solving the score equation (5.11) for the
5.3 Model Validation 173

Tukey−Anscombe plot: fitted Gamma GLM2 QQ plot: fitted Gamma GLM2

8
2
deviance residuals

observed values
1

6
0

4
−1

2
−2
−3

0
8.0 8.5 9.0 9.5 10.0 10.5 11.0 0 2 4 6 8
fitted means (log−scale) theoretical values

Fig. 5.11 (lhs) Tukey–Anscombe plot of the fitted model Gamma GLM2, and (rhs) QQ plot of
the fitted model Gamma GLM2

gamma case. It is given by, we set α = 1/ϕ and we calculate the MLE of α instead,

∂ n
Y (β, α) = vi Yi h(μ(x i )) − κ (h(μ(x i ))) + log Yi + log(αvi ) + 1 − (αvi ) = 0,
∂α
i=1

where (α) = (α)/ (α) is the digamma function. We calculate the second
derivative w.r.t. α, see also (2.30),

n + , n + ,
∂2 1 1
Y (β, α) = vi − vi (αvi ) = vi
2
− (αvi ) <0 for α > 0,
∂α 2 α αvi
i=1 i=1

the negativity follows from Theorem 1 in Alzner [9]. In fact, the function log α −
(α) is strictly completely monotonic for α > 0. This says that the log-likelihood
Y (β, α) is a concave function in α > 0 and the solution to the score equation is
unique, giving the MLE of α and ϕ, respectively.

5.3.8 Lab: Inverse Gaussian GLM for Claim Sizes

We present the inverse Gaussian GLM in this section as a competing model to the
gamma GLM studied in the previous section.

Infinite Divisibility

In the gamma model above we have used that the total claim amount S = nj=1 Zj
has a gamma distribution for given claim counts N = n > 0 and i.i.d. gamma
claim sizes Zj . This property is closely related to divisibility. A random variable S
is called divisible by n ∈ N if there exist i.i.d. random variables Z1 , . . . , Zn such
174 5 Generalized Linear Models

that

(d)
n
S = Zj ,
j =1

and S is called infinitely divisible if S is divisible by n for all n ∈ N. The EDF

is based on parameters (θ, ω) ∈ × W. Jørgensen [203] gives the following
interesting result.
Theorem 5.27 (Theorem 3.7 in Jørgensen [203], Without Proof) Choose a
member of the EDF with parameter set × W. Then
• the index set W is an additive semi-group and N ⊆ W ⊆ R+ , and
• the members of the chosen EDF are infinitely divisible if and only if W = R+ .
This theorem tells us how to aggregate and disaggregate within EDFs, e.g.,
the Poisson, gamma and inverse Gaussian models are infinitely divisible, and the
binomial distribution is divisible by n with the disaggregated random variables
belonging to the same EDF and the same canonical parameter, see Sect. 2.2.2. In
particular, we also refer to Corollary 2.15 on the convolution property.

Inverse Gaussian Generalized Linear Model

Alternatively to the gamma GLM one often explores an inverse Gaussian GLM
which has a cubic variance function V (μ) = μ3 . We bring this inverse Gaussian
model into the same form as the gamma model of Sect. 5.3.7, so that we can
aggregate claims within insurance policies. The mean, the variance and the moment
generating function of an inverse Gaussian random variable Zi,j with parameters
αi , ci > 0 are given by

+ 6 ,
αi αi
E[Zi,j ] = , Var(Zi,j ) = and MZi,j (r) = exp αi ci − ci − 2r ,
2
ci ci3

where the moment generating function requires r < ci2 /2 to be finite. From the
ni
moment generating function we see that Si = j =1 Zi,j is inverse Gaussian
distributed with parameters ni αi and ci . Finally, we scale Yi = Si /(ni αi ) which
1/2 1/2
provides us with an inverse Gaussian distribution with parameters ni αi and
1/2 1/2
ni αi ci . This random variable Yi has a single-parameter EDF inverse Gaussian
distribution in its reproductive form, namely,

yθi − κ(θi )
Yi ∼ f (y; θi , vi /ϕi ) = exp + a(y; vi /ϕi ) (5.50)
ϕi /vi

2
αi
1/2
α
= 6i exp − 1 − −2θi y ,
2π 3
y 2y/vi
vi
5.3 Model Validation 175

√
with cumulant function κ(θ ) = − −2θ for θ ∈ = (−∞, 0], weight vi = ni ,
dispersion parameter ϕi = 1/αi and canonical parameter θi = −ci2/2.
Similarly to the gamma case, this representation is not directly useful if the
parameter αi is not known. Therefore, we parametrize this model differently.
Namely, we consider

1/2 1/2
Yi = Si /ni ∼ InvGauss ni αi , ni ci . (5.51)

This re-scaled random variable has that same inverse Gaussian EDF (5.50), but
we need to re-interpret the parameters. We have dispersion parameter ϕi = 1/αi2
and canonical parameter θi = −ci2 /(2αi2 ). For our GLM analysis we will treat
the parameter αi ≡ α > 0 as a nuisance parameter that does not depend on the
specific policy i. Thus, we have constant dispersion ϕ = 1/α 2 and only the scale
parameter ci is assumed to be policy dependent through the canonical parameter
θi = −ci2 /(2α 2 ).
We are now in the same situation as in the gamma case in Sect. 5.3.7. We choose
the log-link for g which implies

1
μi = Eθi [Yi ] = κ (θi ) = √ = exp{ηi } = expβ, x i ,
−2θi

for x i ∈ X ⊂ Rq+1 describing the pre-processed features of policy i. We use the

same feature pre-processing as in model Gamma GLM2, and we call this resulting
model IG GLM2. Again the constant dispersion parameter ϕ = 1/α 2 cancels in the
score equations, thus, we do not need to explicitly specify the nuisance parameter
α to estimate the regression parameter β ∈ Rq+1 . However, there is an important
difference to the gamma GLM, namely, as stated in Example 5.6, we do not have a
concave maximization problem and Fisher’s scoring method needs a suitable initial
value. We start the fitting algorithm in the parameters of model Gamma GLM2.
The in-sample deviance loss in the inverse Gaussian GLM is given by

1 ni (Yi −
m
μ(x i ))2
D(L,
μ(·)) = , (5.52)
m ϕ μ(x i )2 Yi
i=1

where i runs over the policies i = 1, . . . , m with positive claims Yi = Si /ni > 0,
μ(x i ) = exp
MLE
and β , x i is the MLE estimated regression function. The unit
deviances behave as
2
d (Yi , μi ) = Yi Yi−1 − μ−1
i , (5.53)
176 5 Generalized Linear Models

Table 5.14 Run times, number of parameters, AICs, in-sample losses, tenfold cross-validation
losses and the in-sample average claim amounts of the null gamma model, model Gamma GLM2,
the null inverse Gaussian model, and model inverse Gaussian GLM2; the deviance losses use unit
dispersion ϕ = 1
Run # In-sample Tenfold CV Average
time Param. AIC loss on L CV
loss D amount
Gamma null – 1+1 14’416 2.085 2.091 24’641
Gamma GLM2 1s 7+1 14’274 1.719 1.747 25’130
IG null – 1+1 14’715 5.012 · 10−4 5.016 · 10−4 24’641
IG GLM2 1s 7+1 14’686 4.793 · 10−4 4.820 · 10−4 32’268

note that the log-likelihood is symmetric around its mode for scale μ−1
i , see Fig. 5.5
(rhs). From this we receive deviance residuals (for v/ϕ = 1)

μ−1 −1
1/2
riD = sign(Yi − μi ) d (Yi , μi ) = Yi i − Yi .

for Yi → ∞ (and fixed μ−1

1/2
Thus, these residuals behave as Yi i ), which is
1/3
more heavy-tailed than the cube-root behavior Yi in the gamma case, see (5.48).
Another difference to the gamma case is that the deviance loss (5.52) is not scale-
invariant, see also (11.4), below.
We revisit the example of Table 5.13, but we replace the gamma distribution
by the inverse Gaussian distribution. The results in Table 5.14 show that the inverse
Gaussian model is not fully competitive on this data set. In view of (5.43) we observe
that the coefficient of
√variation (standard deviation divided by mean) is in the gamma
model given by 1/ α, thus, in the gamma model this coefficient of variation is
independent of the expected claim size μi and only depends on the shape parameter
α. In the inverse Gaussian model the coefficient of variation is given by
√
Var(Zi,j ) μi
Vco(Zi,j ) = = ,
E[Zi,j ] α

thus, it monotonically increases in the expected claim size μi . It seems that this
structure is not fully suitable for this data set, i.e., there is no indication that the
coefficient of variation increases in the expected claim size. We come back to a
comparison of the gamma and the inverse Gaussian model in Sect. 11.1, below.

5.3.9 Log-Normal Model for Claim Sizes: A Short Discussion

Another way to improve the gamma model of Sect. 5.3.7 could be to use a log-
normal distribution instead. In the above situation this does not work because the
observations are not in the right format. If the claim observations Zi,j are log-
5.3 Model Validation 177

normally distributed, then log(Zi,j ) are normally distributed. Unfortunately, in our

Swedish motorcycle data set we do not have individual claim observations Zi,j ,
but the provided information is aggregated over all claims per insurance policy, i.e.,
Ni
Si = j =1 Zi,j . Therefore, there is no possibility here to challenge the gamma
framework of Sect. 5.3.7 with a corresponding log-normal framework, because
the log-normal framework is not closed under summation of i.i.d. log-normally
distributed random variables.
We would like to give some remarks that concern calculations on the log-scale (or
any other strictly increasing and concave transformation of the original data). For the
log-normal distribution, as well as in similar cases like the log-gamma distribution,
one works with logged observations Yi = log(Zi ). This is a strictly monotone
transformation and the MLEs in the log-normal model based on observations Zi
and in the normal model based on observations Yi = log(Zi ) coincide. This can be
seen from the following calculation. We start from the log-normal density on R+ ,
and we do a transformation of variable z > 0 → y = log(z) ∈ R with dy = dz/z

1 1 1
fLN (z; μ, σ 2 )dz = √ exp − 2 (log(z) − μ)2 dz
2πσ 2 z 2σ

1 1
= √ exp − 2 (y − μ)2 dy = f! (y; μ, σ 2 )dy.
2πσ 2 2σ

From this we see that the MLEs will coincide.

In many situations, one assumes that σ 2 > 0 is a given nuisance parameter,
and one models x → μ(x) with a GLM within the single-parameter EDF. In the
log-normal/Gaussian case one typically chooses the canonical link on the log-scale
which is the identity function. This then allows one to perform a classical linear
regression for μ(x) = β, x using the logged observations Y = (Y1 , . . . , Yn ) =
(log(Z1 ), . . . , log(Zn )) , and the corresponding MLE is given by

β
MLE
= (X X)−1 X Y , (5.54)

for full rank q + 1 ≤ n design matrix X. Note that in this case we have a closed-
form solution for the MLE of β. This is called the homoskedastic case because
all observations Yi are assumed to have the same variance σ 2 , otherwise, in the
heteroskedastic case, we would still have to include the covariance matrix.
Since we work with the canonical link on the log-scale we have the balance
property on the log-scale, see Corollary 5.7. Thus, we receive unbiasedness

n n MLE n
n
Eβ E MLE [Yi ] = Eβ
β , x i = Eβ [Yi ] = μ(x i ).
β
i=1 i=1 i=1 i=1
(5.55)
178 5 Generalized Linear Models

4 Tukey−Anscombe plot variance correction in log−normal models

60000
fitted means E[Z]
2
residuals

40000
0
−2

20000
−4

0
7 8 9 10 11 7 8 9 10 11
fitted log−means E[Y]=E[log Z] fitted log−means E[Y]=E[log Z]

Fig. 5.12 (lhs) Tukey–Anscombe plot of the fitted Gaussian model

μ(x i ) on the logged claim
sizes Yi = log(Zi ), and (rhs) estimated means μZi as a function of μ(x i ) considering
heteroskedasticity
σ (x i )

If we move back to the original scale of the observations Zi we receive from the
log-normal assumption

MLE 2 [Zi ] = exp
MLE
E(
β ,σ )
β , x i + σ 2
/2 .

Therefore, we need to adjust with the nuisance parameter σ 2 for the back-
transformation to the original observation scale. At this point, typically, the dif-
ficulties start. Often, a good back-transformation involves a feature dependent
variance parameter σ 2 (x i ), thus, in many practical applications the homoskedas-
ticity assumption is not fulfilled, and a constant variance parameter choice leads to
a poor model on the original observation scale.
A suitable estimation of σ 2 (x i ) may turn out to be rather difficult. This is
illustrated in Fig. 5.12. The left-hand side of this figure shows the Tukey–Anscombe
plot of the homoskedastic case providing unscaled (σ 2 ≡ 1) (Pearson’s) residuals
on the log-scale

riP = log(Zi ) −
μ(x i ) = Yi −
μ(x i ).

The light-blue color shows an insurance policy dependent standard deviation

estimate
σ (x i ). In our case this estimate is non-monotone in
μ(x i ) (which is quite
common on real data). Using this estimate we can estimate the means of the log-
normal random variables by

μZ i =
E[Zi ] = exp μ(x i ) +
σ (x i )2 /2 .
5.3 Model Validation 179

The right-hand side of Fig. 5.12 plots these estimated means μZi against the
estimated means μ(x i ) on the log-scale. We observe a graph that is non-monotone,
implied by the non-monotonicity of the standard deviation estimate σ (x i ) as a
function of μ(x i ). This non-monotonicity is not bad per se, as we still have a
proper statistical model, however, it might be rather counter-intuitive and difficult to
explain. For this reason it is advisable to directly model the expected value by one
single function, and not to decompose it into different regression functions.
Another important point to be considered is that for model selection using AIC
we have to work on the same scale for all models. Thus, if we use a gamma model to
model Zi , then for an AIC selection we need to evaluate also the log-normal model
on that scale. This can be seen from the justification in Sect. 4.2.3.
Finally, we focus on unbiasedness. Note that on the log-scale we have unbiased-
ness (5.55) through the balance property. Unfortunately, this does not carry over to
the original scale. We give a small example, where we assume that there is neither
any uncertainty about the distributional model nor about the nuisance parameter.
That is, we assume that Zi are i.i.d. log-normally distributed with parameters μ and
σ 2 , where only μ is unknown. The MLE of μ is given by

1
n
MLE =
μ log(Zi ) ∼ N (μ, σ 2 /n).
n
i=1

In this case we have

1
n
1
n
E(μ,σ 2 ) E(
μMLE ,σ 2 ) [Zi ] = E(μ,σ 2 ) exp{
μMLE } exp{σ 2 /2}
n n
i=1 i=1

= exp μ + (1 + n−1 )σ 2 /2
1
n
> exp μ + σ 2 /2 = E(μ,σ 2 ) [Zi ] .
n
i=1

Volatility in parameter estimation

μMLE leads to a positive bias in this case. Note
that we have assumed full knowledge of the distributional model (i.i.d. log-normal)
and the nuisance parameter σ 2 in this calculation. If, for instance, we do not know
the true nuisance parameter and we work with (deterministic) * σ 2 " σ 2 and n > 1,
we can get a negative bias

1
n
1
n
E(μ,σ 2 ) E(
μMLE 2 [Z
σ ) i
,* ] = E 2
(μ,σ ) exp{
μ MLE
} σ 2 /2}
exp{*
n n
i=1 i=1

= exp μ + σ 2 /(2n) + * σ 2 /2
1
n
< exp μ + σ 2 /2 = E(μ,σ 2 ) [Zi ] .
n
i=1
180 5 Generalized Linear Models

This shows that working on the log-scale is rather difficult because the back-
transformation is far from being trivial, and for unknown nuisance parameter not
even the sign of the bias is clear. Similar considerations apply to the frequently used
Box–Cox transformation [48] for χ = 1
χ
Z −1
Zi →
Yi = i .
χ

For this reason, if unbiasedness is a central requirement (like in insurance pricing)

non-linear transformations should only be used with great care (and only if
necessary).

5.4 Quasi-Likelihoods

Above we have been mentioning the notion of over-dispersed Poisson models.

This naturally leads to so-called quasi-Poisson models and quasi-likelihoods. The
framework of quasi-likelihoods has been introduced by Wedderburn [376]. In this
section we give the main idea behind quasi-likelihoods, and for a more detailed
treatment and mathematical results we refer to Chapter 8 of McCullagh–Nelder
[265].
In Sect. 5.1.4 we have discussed the estimation of GLMs. This has been based
on the explicit knowledge of the full log-likelihood function Y (β) for given data
Y . This has allowed us to calculate the score equations s(β, Y ) = ∇β Y (β) = 0
whose solutions (Z-estimators) contain the MLE for β. The solutions of the score
equations themselves, using Fisher’s scoring method, no longer need the explicit
functional form of the log-likelihood, but they are only based on the first and
second moments, see (5.9) and Remarks 5.4. Thus, all models where these first
two moments coincide will provide the same MLE for the regression parameter
β; this is also the explanation behind the IRLS algorithm. Moreover, the first two
moments are sufficient for prediction and uncertainty quantification based on mean
squared errors, and they are also sufficient to quantify asymptotic normality. This is
exactly what motivates the quasi-likelihood considerations, and these considerations
are also related to the quasi-generalized pseudo maximum likelihood estimator
(QPMLE) that we are going to discuss in Theorem 11.8, below.
Assume that Y is a random vector having first moment μ ∈ Rn , positive
definite variance function V (μ) ∈ Rn×n and dispersion parameter ϕ. The quasi-
(log-)likelihood function Y (μ) assumes that its gradient is given by

1
∇μ Y (μ) = V (μ)−1 (Y − μ) .
ϕ

In case of a diagonal variance function V (μ) this relates to the score (5.9). The
remaining step is to model the mean parameter μ = μ(β) ∈ Rn as a function of a
lower dimensional regression parameter β ∈ Rq+1 , we also refer to Fig. 5.2. For
5.4 Quasi-Likelihoods 181

this last step we assume that the Jacobian B ∈ Rn×(q+1) of dμ/dβ has full rank
q + 1. The score equations for β and given observations Y then read as

1
B V (μ(β))−1 (Y − μ(β)) = 0.
ϕ

This is of exactly the same structure as the score equations in Proposition 5.1, and
the roots are found by using the IRLS algorithm for t ≥ 0, see (5.12),
−1
(t +1)
→ = B V (
μ(t ))−1 B μ(t ))−1 B
B V (
(t ) (t )
β β β +Y −
μ(t ) ,

μ(t ) = μ(
(t )
where β ).
We conclude with the following points about quasi-likelihoods:
• For regression parameter estimation within the quasi-likelihood framework it
is sufficient to know the structure of the first two moments μ(β) ∈ Rn and
V (μ) ∈ Rn×n as well as the score equations. Thus, we do not need to explicitly
specify a distributional family for the observations Y . This structure of the first
two moments is then sufficient for their estimation using the IRLS algorithm, i.e.,
we receive the predictors within this framework.
• Since we do not specify the full distribution of Y we can neither simulate from
this model nor can we calculate quantities where the full log-likelihood of the
model needs to be known. For example, we cannot calculate AIC in a quasi-
likelihood model.
• The quasi-likelihood model is characterized by the functional forms of μ(β) and
V (μ). The former plays the role of the link function and the linear predictor in the
GLM, and the latter plays the role of the variance function within the EDF which
is characterized through the cumulant function κ. For instance, if we assume to
have a diagonal matrix

V (μ) = diag(V (μ1 ), . . . , V (μn )),

then, the choice of the variance function μ → V (μ) describes the explicit
selection of the quasi-likelihood model. If we choose the power variance function
V (μ) = μp , p ∈ (0, 1), we have a quasi-Tweedie’s model.
• For prediction uncertainty evaluation we also need an estimate of the dispersion
parameter ϕ > 0. Since we do not know the full likelihood in this approach,
Pearson’s estimate ϕ P is the only option we have to estimate ϕ.
• For asymptotic normality results and hypothesis testing within the quasi-
likelihood framework we refer to Section 8.4 of McCullagh–Nelder [265].
182 5 Generalized Linear Models

5.5 Double Generalized Linear Model

In the derivations above we have treated the dispersion parameter ϕ in the GLM as
a nuisance parameter. In the case of a homogeneous dispersion parameter it can be
canceled in the score equations for MLE, see (5.9). Therefore, it does not influence
MLE, and in a subsequent step this nuisance parameter can still be estimated
using, e.g., Pearson’s or deviance residuals, see Sect. 5.3.1 and Remark 5.26. In
some examples we may have systematic effects in the dispersion parameter, too.
In this case the above approach will not work because a heterogeneous dispersion
parameter no longer cancels in the score equations. This has been considered in
Smyth [341] and Smyth–Verbyla [343]. The heterogeneous dispersion situation is
of general interest for GLMs, and it is of particular interest for Tweedie’s CP GLM
if we interpret Tweedie’s distribution [358] as a CP model with i.i.d. gamma claim
sizes, see Proposition 2.17; we also refer to Jørgensen–de Souza [204], Smyth–
Jørgensen [342] and Delong et al. [94].

5.5.1 The Dispersion Submodel

We extend model assumption (5.1) by assuming that also the dispersion parameter
ϕi is policy i dependent. Assume that all random variables Yi are independent and
have densities w.r.t. a σ -finite measure ν on R given by

yi θi − κ(θi )
Yi ∼ f (yi ; θi , vi /ϕi ) = exp + a(yi ; vi /ϕi ) ,
ϕi /vi

for 1 ≤ i ≤ n, with canonical parameters θi ∈ ,˚ exposures vi > 0 and dispersion

parameters ϕi > 0. As in (5.5) we assume that every policy i is equipped with
feature information x i ∈ X such that for a given link function g : M → R we can
model its mean as

x i → g(μi ) = g(μ(x i )) = g Eθ(x i ) [Yi ] = ηi = η(x i ) = β, x i . (5.56)

This provides us with log-likelihood function for observation Y = (Y1 , . . . , Yn )

vi
n
β → Y (β) = Yi h(μ(x i )) − κ (h(μ(x i ))) + a(Yi ; vi /ϕi ),
ϕi
i=1

with canonical link h = (κ )−1 . The difference to (5.7) is that the dispersion
parameter ϕi now depends on the insurance policy which requires additional
modeling. We choose a second strictly monotone and smooth link function gϕ :
5.5 Double Generalized Linear Model 183

R+ → R, and we express the dispersion of policy 1 ≤ i ≤ n by

gϕ (ϕi ) = gϕ (ϕ(zi )) = γ , zi , (5.57)

where zi is the feature of policy i, which may potentially differ from x i . The
rationale behind this different feature is that different information might be relevant
for modeling the dispersion parameter, or feature information might be differently
pre-processed compared to the response Yi . We now need to estimate two regression
parameters β and γ in this approach on possibly differently pre-processed feature
information x i and zi of policy i. In general, this is not easily doable because the
term a(Yi ; vi /ϕi ) of the log-likelihood of Yi may have a complicated structure (or
may not be available in closed form like in Tweedie’s CP model).

5.5.2 Saddlepoint Approximation

We reformulate the EDF density using the unit deviance d(Y, μ) defined in (2.25);
˚ for the canonical link
we drop the lower index i for the moment. Set θ = h(μ) ∈
h, then

v
f (y; θ, v/ϕ) = exp [yh(μ) − κ(h(μ))] + a(y; v/ϕ)
ϕ

v 1
= exp [yh(y) − κ(h(y))] + a(y; v/ϕ) exp − d(y, μ)
ϕ 2ϕ/v
ω
def. ∗
= a (y; ω) exp − d(y, μ) , (5.58)
2
with ω = v/ϕ ∈ W. This corresponds to (2.27), and it brings the EDF density into
a Gaussian-looking form. A general difficulty is that the term a ∗ (y; ω) may have a
complicated structure or may not be given in closed form. Therefore, we consider
its saddlepoint approximation; this is based on Section 3.5 of Jørgensen [203].
Suppose that we are in the absolutely continuous EDF case and that κ is steep.
In that case Y ∈ M, a.s., and the variance function y → V (y) is well-defined for
all observations Y = y, a.s. Based on Daniels [87], Barndorff-Nielsen–Cox [24]
proved the following statement, see Theorem 3.10 in Jørgensen [203]: assume there
exists ω0 ∈ W such that for all ω > ω0 the density (5.58) is bounded. Then, the
following saddlepoint approximation is uniform on compact subsets of the support
T of Y

−1/2

2πϕ 1
f (y; θ, v/ϕ) = V (y) exp − d(y, μ) (1 + O(ϕ/v)) ,
v 2ϕ/v
(5.59)
184 5 Generalized Linear Models

as ϕ/v → 0. What makes this saddlepoint approximation attractive is that we can

get rid of a complicated function a ∗ (y; ω) by a neat approximation ( 2πϕ
v V (y))
−1/2

for sufficiently large volumes v, and at the same time, this does not affect the unit
deviance d(y, μ), preserving the estimation properties of μ. The discrete counterpart
is given in Theorem 3.11 of Jørgensen [203].
Using saddlepoint approximation (5.59) we receive an approximate log-
likelihood function

1 −1 1 2π
Y (μ, ϕ) ≈ −ϕ vd(Y, μ) − log (ϕ) − log V (Y ) .
2 2 v

This approximation has an attractive form for dispersion estimation because it gives
def.
an approximate EDF for observation d = vd(Y, μ), for given μ. Namely, for
canonical parameter φ = −ϕ −1 < 0 we have approximation

dφ − (− log (−φ)) 1 2π
Y (μ, φ) ≈ − log V (Y ) . (5.60)
2 2 v

The right-hand side has the structure of a gamma EDF for observation d with
canonical parameter φ < 0, cumulant function κϕ (φ) = − log(−φ) and dispersion
parameter 2. Thus, we have the structure of an approximate gamma model on the
right-hand side of (5.60) with, for given μ,

1
Eφ [d|μ] ≈ κϕ (φ) = − = ϕ, (5.61)
φ
1
Varφ (d|μ) ≈ 2κϕ (φ) = 2 = 2ϕ 2 . (5.62)
φ2

These statements say that for given μ and assuming that the saddlepoint approx-
imation is sufficiently accurate, d is approximately gamma distributed with shape
parameter 1/2 and canonical parameter φ (which relates to the dispersion ϕ in the
mean parametrization). Thus, we can estimate φ and ϕ, respectively, with a (second)
GLM from (5.60), for given mean parameter μ.
Remarks 5.28
• The accuracy of the saddlepoint approximation is discussed in Section 3.2 of
Smyth–Verbyla [343]. The saddlepoint approximation is exact in the Gaussian
and the inverse Gaussian case. In the Gaussian case, we have log-likelihood

dφ − (− log (−φ)) 1 2π
Y (μ, φ) = − log ,
2 2 v
5.5 Double Generalized Linear Model 185

with variance function V (Y ) = 1. In the inverse Gaussian case, we have log-

likelihood

dφ − (− log (−φ)) 1 2π 3
Y (μ, φ) = − log Y ,
2 2 v

with variance function V (Y ) = Y 3 . Thus, in the Gaussian case and in the inverse
Gaussian case we have a gamma model for d with mean ϕ and shape parameter
1/2, for given μ; for a related result we also refer to Theorem 3 of Blæsild–Jensen
[38]. For Tweedie’s models with p ≥ 1, one can show that the relative error of the
saddlepoint approximation is a non-increasing function of the squared coefficient
of variation τ = ϕv V (y)/y 2 = ϕv y p−2 , leading to small approximation errors if
ϕ/v is sufficiently small; typically one requires τ < 1/3, see Section 3.2 of
Smyth–Verbyla [343].
• The saddlepoint approximation itself does not provide a density because in gen-
eral the term O(ϕ/v) in (5.59) is non-zero. Nelder–Pregibon [282] renormalized
the saddlepoint approximation to a proper density and studied its properties.
• In the gamma EDF case, the saddlepoint approximation would not be necessary
because this case can still be solved in closed form. In fact, in the gamma EDF
case we have log-likelihood, set φ = −v/ϕ < 0,

φd(Y, μ) − χ(φ)
Y (μ, φ) = − log Y, (5.63)
2
with χ(φ) = 2(log (−φ) + φ log(−φ) − φ). For given μ, this is an EDF
for d(Y, μ) with cumulant function χ on the effective domain (−∞, 0). This
provides us with expected value and variance

1
Eφ [d(Y, μ)|μ] = χ (φ) = 2 (−(−φ) + log(−φ)) ≈ − ,
φ

1
Varφ (d(Y, μ)|μ) = 2χ (φ) = 4 (−φ) − ,
−φ

with digamma function and the approximation exactly refers to the sad-
dlepoint approximation; for the variance statement we also refer to Fisher’s
information (2.30). For receiving more accurate mean approximations one can
consider higher order terms, e.g., the second order approximation is χ (φ) ≈
−1/φ + 1/(6φ 2 ). In fact, from the saddlepoint approximation (5.60) and from
the exact formula (5.63) we receive in the gamma case Stirling’s formula
√
(γ ) ≈ 2πγ γ −1/2 e−γ .

In the subsequent examples we will just use the saddlepoint approximation also
in the gamma EDF case.
186 5 Generalized Linear Models

5.5.3 Residual Maximum Likelihood Estimation

The saddlepoint approximation (5.60) proposes to alternate MLE of β for the mean
model (5.56) and of γ for the dispersion model (5.57). Fisher’s information matrix
of the saddlepoint approximation (5.60) w.r.t. the canonical parameters θ and φ is
given by
v
φvκ (θ ) −v Y − κ (θ ) V (μ(θ )) 0
I (θ, φ) = −Eθ,φ = ϕ(φ) ,
−v Y − κ (θ ) − 12 φ12 0 1
2 Vϕ (ϕ(φ))

with variance function Vϕ (ϕ) = ϕ 2 , and emphasizing that we work in the canonical
parametrization (θ, φ). This is a positive definite diagonal matrix which suggests
that the algorithm alternating the β and γ estimations will have a fast convergence.
For fixed estimate
γ we calculate estimated dispersion parameters ϕi = gϕ−1
γ , zi
of policies 1 ≤ i ≤ n, see (5.57). These then allow us to calculate diagonal working
weight matrix
−2
∂g(μi ) vi 1
W (β) = diag ∈ Rn×n ,
∂μi
ϕi V (μi )
1≤i≤n

which is used in Fisher’s scoring method/IRLS algorithm (5.12) to receive MLE β,

given the estimates ( ϕi )i . These MLEs allow us to estimate the mean parameters
μi = g −1
β, x i , and to calculate the deviances

di = vi d (Yi ,
μi ) = 2vi Yi h (Yi ) − κ (h (Yi )) − Yi h ( μi )) ≥ 0.
μi ) + κ (h (

Using (5.60) we know that these deviances can be approximated by gamma

distributions (1/2, 1/(2ϕi )). This is a single-parameter EDF with dispersion
parameter 2 (as nuisance parameter) and mean parameter ϕi . This motivates the
definition of the working weight matrix (based on the gamma EDF model)
−2
∂gϕ (ϕi ) 1 1
Wϕ (γ ) = diag ∈ Rn×n ,
∂ϕi 2 Vϕ (ϕi )
1≤i≤n

and the working residuals

∂gϕ (ϕi )
R ϕ (d, γ ) = (di − ϕi ) ∈ Rn .
∂ϕi 1≤i≤n
5.5 Double Generalized Linear Model 187

Fisher’s scoring method (5.12) iterates for s ≥ 0 the following recursion to receive

γ
−1
γ (s+1) = Z Wϕ (
γ (s) → γ (s))Z Z Wϕ (
γ (s) ) Z
γ (s) + R ϕ (d,
γ (s)) ,
(5.64)
where Z = (z1 , . . . , zn ) is the design matrix used to estimate the dispersion
parameters.

5.5.4 Lab: Double GLM Algorithm for Gamma Claim Sizes

We revisit the Swedish motorcycle claim size data studied in Sect. 5.3.7. We expand
the gamma claim size GLM to a double GLM also modeling the systematic effects
in the dispersion parameter. In a first step we need to change the parametrization of
the gamma model of Sect. 5.3.7. In the former section we have modeled the average
claim size Si /ni ∼ (ni αi , ni ci ), but for applying the saddlepoint approximation
we should use the reproductive form (5.44) of the gamma model. We therefore set

Yi = Si /(ni αi ) ∼ (ni αi , ni αi ci ). (5.65)

The reason for the different parametrization in Sect. 5.3.7 has been that (5.65) is not
directly useful if αi is unknown because in that case the observations Yi cannot be
calculated. In this section we estimate ϕi = 1/αi which allows us to model (5.65);
a different treatment within Tweedie’s family is presented in Sect. 11.1.3. The only
difficulty is to initialize the double GLM algorithm. We proceed as follows.
(0) In an initial step we assume constant dispersion ϕi = 1/αi ≡ 1/α = 1. This
gives us exactly the mean estimates of Sect. 5.3.7 for Si /ni ∼ (ni α, ni ci );
note that for constant shape parameter α the mean of Si /ni can be estimated
without explicit knowledge of α (because it cancels in the score equations).
Using these mean estimates we calculate the MLE α (0) of the (constant) shape
parameter α, see Remark 5.26. This then allows us to determine the (scaled)
(1) (0)
observations Yi = Si /(ni α (0) ) and we initialize
ϕi = 1/ α (0) .
(1) Iterate for t ≥ 1:
– estimate the mean μi of Yi using the mean GLM (5.56) based on the
(t ) (t −1)
observations Yi and the dispersion estimates ϕi . This provides us with
(t )

μi ;
– based on the deviances d(t i
)
= vi d(Yi(t ) ,
μ(t )
i ), calculate the updated dis-
persion estimates ϕi(t ) using the dispersion GLM (5.57) and the residual
MLE iteration (5.64) with the saddlepoint approximation. Set for the updated
observations Yi(t +1) = Si ϕi(t )/ni .
188 5 Generalized Linear Models

Table 5.15 Number of parameters, AICs, Pearson’s dispersion estimate, in-sample losses, tenfold
cross-validation losses and the in-sample average claim amounts of the null model (gamma
intercept model) and the (double) gamma GLM
# Dispersion In-sample Tenfold CV Average
Param. AIC est.
ϕP loss on L CV
loss D amount
Gamma null 1+1 14’416 2.057 2.085 2.091 24’641
Gamma GLM2 7+1 14’274 1.544 1.719 1.747 25’130
Double gamma GLM 7+6 14’258 – (1.721) – 26’413

In an initial double GLM analysis we use the feature information zi = x i for the
dispersion ϕi modeling (5.57). We choose for both GLMs the log-link which leads to
concave maximization problems, see Example 5.5. Running the above double GLM
algorithm converges in 4 iterations, and analyzing the resulting model we observe
that we should drop the variable RiskClass from the feature zi . We then run the
same double GLM algorithm with the feature information x i and the new zi again,
and the results are presented in Table 5.15.
The considered double GLM has parameter dimensions β ∈ R7 and γ ∈ R6 . To
have comparability with AIC of Sect. 5.3.7, we evaluate AIC of the double GLM
in the observations Si /ni (and not in Yi ; i.e., similar to the gamma GLM). We
observe that it has an improved AIC value compared to model Gamma GLM2.
Thus, indeed, dispersion modeling seems necessary in this example (under the
GLM2 regression structure). We do not calculate in-sample and cross-validation
losses in the double GLM because in the other two models of Table 5.15 we have
set ϕ = 1 in these statistics. However, the in-sample loss of model Gamma GLM2
with ϕ = 1 corresponds to the (homogeneous) deviance dispersion estimate (up to
scaling n/(n − (q + 1))), and this in-sample loss of 1.719 can directly be compared
to the average estimated dispersion m−1 m i=1
ϕi = 1.721 (in round brackets in
Table 5.15). On the downside, the double GLM has a bigger bias which needs an
adjustment.
In Fig. 5.13 (lhs) we give the normal plots of model Gamma GLM2 and the
double gamma GLM model. This plot is received by transforming the observations
to normal quantiles using the corresponding estimated gamma models. We see
quite some similarity between the two estimated gamma models. Both models
seem to have similar deficiencies, i.e., dispersion modeling improves explanation
of observations, however, either the regression function or the gamma distributional
assumption does not fully fit the data, especially for small claims. Finally, in
Fig. 5.13 (rhs) we plot the estimated dispersion parameters ϕi against the logged
estimated means log( μi ) (linear predictors). We observe that the estimated disper-
sion has a (weak) U-shape as a function of the expected claim sizes which indicates
that the tails cannot fully be captured by our model. This closes this example.
Remark 5.29 For the dispersion estimation ϕi we use as observations the deviances
di = vi d (Yi ,
μi ), 1 ≤ i ≤ n. On a finite sample, these deviances are typically
biased due to the use of the estimated means
μi . Smyth–Verbyla [343] propose the
5.5 Double Generalized Linear Model 189

normal plot of the fitted gamma models estimated dispersion vs. logged estimated means

4.0
Double GLM
3

Gamma GLM2 in−sample loss

3.5
2

dispersion parameter
3.0
observed values
1

2.5
0

2.0
−1

1.5
−2

1.0
Gamma GLM2
−3

Double GLM

0.5
−3 −2 −1 0 1 2 3 8 9 10 11
theoretical values logged (linear) predictor

Fig. 5.13 (lhs) Normal plot of the fitted models Gamma GLM2 and double GLM, (rhs) estimated
dispersion parameters
ϕi against the logged estimated means log(
μi ) (the orange line gives the
in-sample loss in model Gamma GLM2)

following bias correction. Consider the estimated hat matrix defined by

−1
H = W ( γ )1/2 X X W (
β, β,
γ)X X W (
β,
γ )1/2 ,

with the diagonal work weight matrix W ( β,

γ ) depending on the estimated
regression parameters β and γ through μ and ϕ. Denote the diagonal entries of
the hat matrix by (hi,i )1≤i≤n . A bias corrected version of the deviances is received
by considering observations (1 − hi,i )−1 di = (1 − hi,i )−1 vi d (Yi ,
μi ), 1 ≤ i ≤ n.
We will come back to the hat matrix H in Sect. 5.6.1, below.

5.5.5 Tweedie’s Compound Poisson GLM

A popular situation for applying the double GLM framework is Tweedie’s CP

model introduced in Sect. 2.2.3, in particular, we refer to Proposition 2.17 for the
corresponding parametrization. Having claim frequency and claim sizes involved,
such a model can hardly be calibrated with one single regression function and a
constant dispersion parameter. An obvious choice is a double GLM, this is the
proposal presented in Smyth–Jørgensen [342]. In most of the cases one chooses for
both link functions g and gϕ the log-links because positivity needs to be guaranteed.
190 5 Generalized Linear Models

This implies for the two working weight matrices of the double GLM

vi 1 2−p vi
W (β) = diag μ2i = diag μi ,
ϕi V (μi ) 1≤i≤n ϕi 1≤i≤n

1 1
Wϕ (γ ) = diag ϕi2 = diag(1/2, . . . , 1/2).
2 Vϕ (ϕi ) 1≤i≤n

The deviances in Tweedie’s CP model are given by, see (4.18),

1−p 2−p

Yi −μ
i 1−p Y −μ
i 2−p
di = vi d (Yi ,
μi ) = 2vi Yi − i ≥ 0,
1−p 2−p

and these deviances could still be de-biased, see Remark 5.29. The working
responses for the two GLMs are

R = (Yi /μi − 1)

1≤i≤n and R ϕ = (di /ϕi − 1)
1≤i≤n .

The drawback of this approach is that it only considers the (scaled) total claim
amounts Yi = Si ϕi /vi as observations, see Proposition 2.17. These total claim
amounts consist of the number of claims Ni and i.i.d. individual claim sizes
Zi,j ∼ (α, ci ), supposed Ni ≥ 1. Having observations of both claim amounts
Si and claim counts Ni allows one to build a Poisson GLM for claim counts and
a gamma GLM for claim sizes which can be estimated separately. This has also
been the reason of Smyth–Jørgensen [342] to enhance Tweedie’s model estimation
for known claim counts in their Section 4. Moreover, in Theorem 4 of Delong et
al. [94] it is proved that the two GLM approaches can be identified under log-link
choices.

5.6 Diagnostic Tools

In our examples we have studied several figures like AIC, cross-validation losses,
etc., for model and parameter selection. Moreover, we have plotted the results, for
instance, using the Tukey–Anscombe plot or the QQ plot. Of course, there are
numerous other plots and tools that can help us to analyze the results and to improve
the resulting models. We present some of these in this section.

5.6.1 The Hat Matrix

The MLE
MLE
β satisfies at convergence of the IRLS algorithm, see (5.12),
−1

β
MLE
= X W (
β
MLE
)X X W (
β
MLE
) X
β
MLE
+ R(Y ,
β
MLE
) ,
5.6 Diagnostic Tools 191

with working residuals for β ∈ Rq+1

∂g(μi )
R(Y , β) = (Yi − μi (β)) ∈ Rn .
∂μi μi =μi (β)
1≤i≤n

Following Section 4.2.2 of Fahrmeir–Tutz [123], this allows us to define the so-
called hat matrix, see also Remark 5.29,
−1
H = H ( ) = W ( X X W ( X W (
MLE MLE 1/2 MLE MLE 1/2
β β ) β )X β ) ∈ Rn×n ,
(5.66)

recall that the working weight matrix W (β) is diagonal. The hat matrix H is
symmetric and idempotent, i.e. H 2 = H , with trace(H ) = rank(H ) = q + 1.
Therefore, H acts as a projection, mapping the observations *
Y to the fitted values

Y = W ( X + R(Y , Y = W (
MLE 1/2 MLE
* def.
β
MLE 1/2
) β
MLE
β
MLE
) → H * β ) Xβ

= W (
MLE 1/2
β ) η,

the latter being the fitted linear predictors. The diagonal elements hi,i of this hat
matrix H satisfy 0 ≤ hi,i ≤ 1, and values close to 1 correspond to extreme data
*i influences
points i, in particular, for hi,i = 1 only observation Y ηi , whereas for
hi,i = 0 observation Y*i has no influence on ηi .
Figure 5.14 gives the resulting hat matrices of the double gamma GLM of
Sect. 5.5.4. On the left-hand side we show the diagonal entries hi,i of the claim

diagonal of hat matrix for Y diagonal of hat matrix for dispersion

0.00 0.01 0.02 0.03 0.04 0.05 0.06

diagonal elements h_ii

8 9 10 11 8 9 10 11
logged (linear) predictor logged (linear) predictor

Fig. 5.14 Diagonal entries hi,i of the two hat matrices of the example in Sect. 5.5.4: (lhs) for
means μi and responses Yi , and (rhs) for dispersions
ϕi and responses di
192 5 Generalized Linear Models

amount responses Yi (for the estimation of μi ), and on the right-hand side the
corresponding plots for the deviance responses di (for the estimation of ϕi ). These
diagonal elements hi,i are ordered on the x-axis w.r.t. the linear predictors
ηi . From
this figure we conclude that the diagonal entries of the hat matrices are bigger for
very small responses in our example, and the dispersion plot has a couple of more
special observations that may require further analysis.

5.6.2 Case Deletion and Generalized Cross-Validation

As a continuation of the previous subsection we can analyze the influence of

an individual observation Yi on the estimation of regression parameter β. This
influence is naturally measured by fitting the regression parameter based on the
full data D and based only on the observations L(−i) = D \ {Yi }, we also refer
to leave-one-out cross-validation in Sect. 4.2.2. The influence of observation Yi is
then obtained by comparing and
MLE MLE
β β (−i) . Since fitting n different models by
individually leaving out each observation Yi is too costly, one only explores a one-
step Fisher’s scoring update starting from
MLE
β that provides an approximation to
MLE
β (−i) , that is,
−1 MLE

β (−i) = X
(1) MLE )X(−i) X W(−i) ( MLE
) X + R(Y ,
MLE
(−i) W(−i) (β (−i) β β β )
(−i)
−1
= X MLE )X(−i) X W(−i) ( ) *
MLE 1/2
(−i) W(−i) (β (−i) β Y (−i) ,

where all lower indices (−i) indicate that we drop the corresponding row or/and
column from the matrices and vectors, and where * Y has been defined in the previous
subsection. This allows us to compare β MLE
and
(1)
β (−i) to analyze the influence of
observation Yi .
To reformulate this approximation, we come back to the hat matrix H =
H (
MLE
β ) = (hi,j )1≤i,j ≤n defined in (5.66). It fulfills
⎛ ⎞

n
n
MLE 1/2
W (β ) MLE
Xβ = H*
Y =⎝ *j , . . . ,
h1,j Y *j ⎠
hn,j Y ∈ Rn .
j =1 j =1

Thus, for predicting Yi we can consider the linear predictor (for the chosen link g)

n
μi ) = , x i = (X )i = Wi,i ( *j .
MLE MLE MLE −1/2

ηi = g( β β β ) hi,j Y
j =1
5.6 Diagnostic Tools 193

A computation of the linear predictor of Yi using the leave-one-out approximation

(1)
β (−i) gives

1 MLE −1/2 hi,i

= ηi − Wi,i ( *i .
(−i,1) (1)

ηi β (−i) , x i = β ) Y
1 − hi,i 1 − hi,i

This allows one to efficiently calculate a leave-one-out prediction using the hat
matrix H . This also motivates to study the generalized cross-validation (GCV) loss
which is an approximation to leave-one-out cross-validation, see Sect. 4.2.2,

vi
n
GCV = 1
D d Yi , g −1 (
ηi(−i,1) ) (5.67)
n ϕ
i=1

2 vi
n
Yi h (Yi ) − κ (h (Yi )) − Yi h g −1 ( ) + κ h g −1 (
(−i,1) (−i,1)
= ηi ηi ) .
n ϕ
i=1

Example 5.30 (Generalized Cross-Validation Loss in the Gaussian Case) We study

GCV in the homoskedastic Gaussian case
the generalized cross-validation loss D
vi /ϕ ≡ 1/σ with cumulant function κ(θ ) = θ 2 /2 and canonical link g(μ) =
2

h(μ) = μ. The generalized cross-validation loss in the Gaussian case is given by

1 2
n
GCV = 1
D ηi(−i,1) ,
Yi −
n σ 2
i=1

with (linear) leave-one-out predictor

n
hi,j 1 hi,i
ηi(−i,1) =
(1)
β (−i) , x i = Yj =
ηi − Yi .
1 − hi,i 1 − hi,i 1 − hi,i
j =1,j =i

This gives us generalized cross-validation loss in the Gaussian case

1 n 2
GCV = 1 Yi −
ηi
D ,
n σ2 1 − hi,i
i=1

with β independent hat matrix

−1
H = X X X X .
194 5 Generalized Linear Models

The generalized cross-validation loss is used, for instance, for generalized addi-
tive model (GAM) fitting where an efficient and fast cross-validation method is
required to select regularization parameters. Generalized cross-validation
has been
introduced by Craven–Wahba [84] but these authors replaced hi,i by nj=1 hj,j /n.

It holds that nj=1 hj,j = trace(H ) = q + 1, thus, using this approximation we
receive
2
1 1
n n
GCV Yi −
ηi n (Yi − ηi )2
D ≈ =
n σ2 1 − nj=1 hj,j /n (n − (q + 1)) i=1
2 σ 2
i=1

n
ϕP
= ,
n − (q + 1) σ 2

with
ϕ P being Pearson’s dispersion estimate in the Gaussian model, see (5.30).

We give a numerical example based on the gamma GLM for the claim sizes
studied in Sect. 5.3.7.
Example 5.31 (Leave-One-Out Cross-Validation) The aim of this example is to
compare the generalized cross-validation loss D GCV to the leave-one-out cross-
validation loss D , see (4.34), the former being an approximation to the latter.
loo

We do this for the gamma claim size model studied in Sect. 5.3.7. In this example
it is feasible to exactly calculate the leave-one-out cross-validation loss because we
have only 656 claims.
The results are presented in Table 5.16. Firstly, the different cross-validation
losses confirm that the model slightly (in-sample) over-fits to the data, which is
not a surprise when estimating 7 regression parameters based on 656 observations.
Secondly, the cross-validation losses provide similar numbers with leave-one-out
being slightly bigger than tenfold cross-validation, here. Thirdly, the generalized
cross-validation loss D GCV manages to approximate the leave-one-out cross-
validation loss D very well in this example.
loo

Table 5.17 gives the corresponding results for model Poisson GLM1 of
Sect. 5.2.4. Firstly, in this example with 610’206 observations it is not feasible
to calculate the leave-one-out cross-validation loss (for computational reasons).
Therefore, we rely on the generalized cross-validation loss as an approximation.
From the results of Table 5.17 it seems that this approximation (rather) under-
estimates the loss (compared to tenfold cross-validation). Indeed, this is an
observation that we have made also in other examples.

Table 5.16 Comparison of Gamma GLM2

different cross-validation
losses for model Gamma In-sample loss D(L,
μMLE
L ) 1.719
GLM2
Tenfold CV loss D CV 1.747
loo
Leave-one-out CV loss D 1.756

Generalized CV loss D GCV 1.758
5.7 Generalized Linear Models with Categorical Responses 195

Table 5.17 Comparison of Poisson GLM1

different cross-validation
losses for model Poisson In-sample lossD(L,
μMLE )
L 24.101
GLM1 CV
Tenfold CV loss D 24.121
loo
Leave-one-out CV loss D N/A
GCV
Generalized CV loss D 24.105

5.7 Generalized Linear Models with Categorical Responses

The reader will have noticed that the discussion of GLMs in this chapter has
been focusing on the single-parameter linear EDF case (5.1). In many actuarial
applications we also want to study examples of the vector-valued parameter
EF (2.2). We briefly discuss the categorical case since this case is frequently used.

5.7.1 Logistic Categorical Generalized Linear Model

We recall the EF representation of the categorical distribution studied in Sect. 2.1.4.

We choose as ν the counting measure on the finite set Y = {1, . . . , k +1}. A random
variable Y taking values in Y is called categorical, and the levels y ∈ Y can either
be ordinal or nominal. This motivates dummy coding of the categorical random
variable Y providing

T (Y ) = (1{Y =1} , . . . , 1{Y =k} ) ∈ {0, 1}k , (5.68)

thus, k + 1 has been chosen as reference level. For the canonical parameter
θ = (θ1 , . . . , θk ) ∈ = Rk we have cumulant function and mean functional,
respectively,
⎛ ⎞

k
eθ
κ(θ) = log ⎝1 + e θj ⎠ , p = Eθ [T (Y )] = ∇θ κ(θ) = k .
j =1 1+ j =1 e
θj

With these choices we receive the EF representation of the categorical distribution

(set θk+1 = 0)
⎧ ⎛ ⎞⎫ 1{y=l}
⎨
k ⎬
k+1
eθl
⎝
dF (y; θ ) = exp θ T (y) − log 1 + e θj ⎠
dν(y) = k+1 θ dν(y).
⎩ ⎭ ej
j =1 l=1 j =1

The covariance matrix of T (Y ) is given by

(θ ) = Varθ (T (Y )) = ∇θ2 κ(θ) = diag (p) − pp ∈ Rk×k .

196 5 Generalized Linear Models

Assume that we have feature information x ∈ X ⊂ {1} × Rq for response variable

Y . This allows us to lift this categorical model to a GLM. The logistic GLM assumes
for p = (p1 , . . . , pk ) ∈ (0, 1)k a regression function, 1 ≤ l ≤ k,

expβ l , x
x → pl = pl (x) = Pβ [Y = l] = k , (5.69)
1 + j =1 expβ j , x

for regression parameter β = (β

1 , . . . , βk ) ∈ Rk(q+1) . Equivalently, we can
rewrite these regression probabilities relative to the reference level, that is, we
consider linear predictors for 1 ≤ l ≤ k

Pβ [Y = l]
ηl (x) = log = β l , x . (5.70)
Pβ [Y = k + 1]

Note that this naturally gives us the canonical link h which we have already derived
in Sect. 2.1.4. Define the matrix for feature x ∈ X ⊂ {1} × Rq
⎛ ⎞
x 0 0 ··· 0
⎜0 x 0 ··· 0 ⎟
⎜ ⎟
⎜ x ··· ⎟
X=⎜ 0 0 0 ⎟ ∈ Rk×k(q+1) . (5.71)
⎜ . .. .. .. .. ⎟
⎝ .. . . . . ⎠
0 0 0 · · · x

This gives linear predictor and canonical parameter, respectively, under the canoni-
cal link h

θ = h(p(x)) = η(x) = Xβ = β 1 , x , . . . , β k , x ∈ = Rk . (5.72)

5.7.2 Maximum Likelihood Estimation in Categorical Models

Assume we have n independent observations Yi following the logistic categorical

GLM (5.69) with features x i ∈ Rq+1 and Xi ∈ Rk×k(q+1) , respectively, for 1 ≤
i ≤ n. The joint log-likelihood function is given by, we use (5.72),

n
β → Y (β) = (X i β) T (Yi ) − κ(Xi β).
i=1

This provides us with score equations

n
n
s(β, Y ) = ∇β Y (β) = X
i T (Y i ) − ∇θ κ(X i β) = X
i [T (Yi ) − p(x i )] = 0,
i=1 i=1
5.7 Generalized Linear Models with Categorical Responses 197

with logistic regression function (5.69) for p(x). For the score equations with
canonical link we also refer to the second case in Proposition 5.1. Next, we calculate
Fisher’s information matrix, we also refer to (3.16),

n
In (β) = −Eβ ∇β2 Y (β) = X
i i (β)X i ,
i=1

with covariance matrix of T (Yi )

i (β) = ∇θ2 κ(Xi β) = diag (p(x i )) − p(x i )p(x i ) .

We rewrite the score in a similar way as in Sect. 5.1.4. This requires for general link
g(p) = η and inverse link p = g −1 (η), respectively, the following block diagonal
matrix

W (β) = diag ∇η g −1 (η) i (β)−1 ∇η g −1 (η)
η=Xi β η=Xi β
1≤i≤n
−1

= diag ∇p g(p)p=g−1 (X β) i (β) ∇p g(p)p=g−1 (X , (5.73)
i i β)
1≤i≤n

and the working residuals

R(Y , β) = ∇p g(p)p=g −1 (X β) (T (Yi ) − p(x i )) . (5.74)
i
1≤i≤n

Because we work with the canonical link g = h and g −1 = ∇θ κ, we can use the
simplified block diagonal matrix

W (β) = diag (1 (β), . . . , n (β)) ∈ Rkn×kn ,

and the working residuals

R(Y , β) = i (β)−1 (T (Yi ) − p(x i )) ∈ Rkn .
1≤i≤n

Finally, we define the design matrix

⎛ ⎞
X1
⎜X2 ⎟
⎜ ⎟
X = ⎜ . ⎟ ∈ Rkn×k(q+1) .
⎝ .. ⎠
Xn
198 5 Generalized Linear Models

Putting everything together we receive the score equations

s(β, Y ) = ∇β Y (β) = X W (β)R(Y , β) = 0. (5.75)

This is now exactly in the same form as in Proposition 5.1. Fisher’s scoring
method/IRLS algorithm then allows us to recursively calculate the MLE of β ∈
Rk(q+1) by
−1
(t +1)
β → = X W ( X W (
β ) X β + R(Y ,
(t ) (t ) (t ) (t ) (t )
β β )X β ) .

We have asymptotic normality of the MLE (under suitable regularity conditions)

(d)
MLE
βn ≈ N (β, In (β)−1 ),

for large sample sizes n. This allows us to apply the Wald test (5.32) for back-
ward parameter elimination. Moreover, in-sample and out-of-sample losses can
be analyzed with unit deviances coming from the categorical cross-entropy loss
function (4.19).
Remarks 5.32 The above derivations have been done for the categorical distribution
under the canonical link choice. However, these considerations hold true for more
general links g within the vector-valued parameter EF. That is, the block diagonal
matrix W (β) in (5.73) and the working residuals R(Y , β) in (5.74) provide score
equations (5.75) for general vector-valued parameter EF examples, and where we
replace the categorical probability p by the mean μ = Eβ [T (Y )].

5.8 Further Topics of Regression Modeling

There are several special topics and tools in regression modeling that we have not
discussed, yet. Some of them will be considered in selected chapters below, and
some points are mentioned here, without going into detail.

5.8.1 Longitudinal Data and Random Effects

The GLMs studied above have been considering cross-sectional data, meaning that
we have fixed one time period t and studied this time period in an isolated fashion.
Time-dependent extensions are called longitudinal or panel data. Consider a time
series of data (Yi,t , x i,t ) for policies 1 ≤ i ≤ n and time points t ≥ 1. For the
prediction of response variable Yi,t we may then regress on the individual past
5.8 Further Topics of Regression Modeling 199

history of policy i, given by the data

Di,t = Yi,1 , . . . , Yi,t −1 , x i,1 , . . . , x i,t .

In particular, we may explore the distribution of Yi,t , conditionally given Di,t ,

Yi,t |Di,t ∼ F (·|Di,t ; θ ),

for canonical parameter θ ∈ and F (·|Di,t ; θ ) being a member of the EDF. For a
GLM we choose a link function g and make the assumption

g Eβ [Yi,t |Di,t ] = β, zi,t , (5.76)

where zi,t ∈ Rq+1 is a (q + 1)-dimensional and σ (Di,t )-measurable feature vector,

and regression parameter β ∈ Rq+1 describes the common systematic effects across
all policies 1 ≤ i ≤ n. This gives a generalized auto-regressive model, and if we
have the Markov property

(d)
F (·|Di,t ; θ ) = F (·|Yi,t −1 , x i,t ; θ ) for all t ≥ 2 and θ ∈ ,

we obtain a generalized auto-regressive model of order 1. These longitudinal models

allow one to model experience rating, for instance, in car insurance where the
past claims history directly influences the future insurance prices, we refer to
Remark 5.15 on bonus-malus systems (BMS).
The next level of complexity is obtained by extending regression structure (5.76)
by policy i specific random effects B i such that we may postulate

g Eβ [Yi,t |Di,t , B i ] = β, zi,t + B i , wi,t , (5.77)

with σ (Di,t )-measurable feature vector wi,t . Regression parameter β then describes
the fixed systematic effects that are common over the entire portfolio 1 ≤ i ≤ n
and B i describes the policy dependent random effects (assumed to be normalized
E[B i ] = 0). Typically one assumes that B 1 , . . . , B n are centered and i.i.d. Such
effects are called static random effects because they are not time-dependent, and
they may also be interpreted in a Bayesian sense.
Finally, extending these static random effects to dynamic random effects B i,t ,
t ≥ 1, leads to so-called state-space models, the linear state-space model being the
most popular example and being fitted using the Kalman filter [207].

5.8.2 Regression Models Beyond the GLM Framework

There are several ways in which the GLM framework can be modified.
200 5 Generalized Linear Models

Siblings of Generalized Linear Regression Functions

The most common modification of GLMs concerns the regression structure, namely,
that the scalar product in the linear predictor

x → g(μ) = η = β, x ,

is replaced by another regression function. A popular alternative is the framework

of generalized additive models (GAMs). GAMs go back to Hastie–Tibshirani
[181, 182] and the standard reference is Wood [384]. GAMs consider the regression
functions

x → g(μ) = η = β0 + βj sj (xj ), (5.78)
j

where sj : R → R are natural cubic splines. Natural cubic splines sj are obtained
by concatenating cubic functions in so-called nodes. A GAM can have as many
nodes in each cubic spline sj as there are different levels xi,j in the data 1 ≤ i ≤ n.
In general, this leads to very flexible regression models, and to control in-sample
over-fitting regularization is applied, for regularization we also refer to Sect. 6.2.
Regularization requires setting a tuning parameter, and an efficient determination of
this tuning parameter uses generalized cross-validation, see Sect. 5.6. Nevertheless,
fitting GAMs can be very computational, already for portfolios with 1 million
policies and involving 20 feature components the calibration can be very slow.
Moreover, regression function (5.78) does not (directly) allow for a data driven
method of finding interactions between feature components. For these reasons, we
do not further study GAMs in this monograph.
A modification in the regression function that is able to consider interactions
between feature components is the framework of classification and regression trees
(CARTs). CARTs have been introduced by Breiman et al. [54] in 1984, and they
are still used in its original form today. Regression trees aim to partition the feature
space X into a finite number of disjoint subsets Xt , 1 ≤ t ≤ T , such that all policies
(Yi , x i ) in the same subset x i ∈ Xt satisfy a certain homogeneity property w.r.t. the
regression task (and the chosen loss function). The CART regression function is
then defined by

T
x → μ(x) =
μt 1{x∈Xt } ,
t =1

where μt is the homogeneous mean estimator on Xt . These CARTs are popular

building blocks for ensemble methods where different regression functions are
combined, we mention random forests and boosting algorithms that mainly rely
on CARTs. Random forests have been introduced by Breiman [52], and boosting
has been popularized by Valiant [362], Kearns–Valiant [209, 210], Schapire [328],
5.8 Further Topics of Regression Modeling 201

Freund [139] and Freund–Schapire [140]. Today boosting belongs to the most
powerful predictive regression methods, we mention the XGBoost algorithm of
Chen–Guestrin [71] that has won many competitions. We will not further study
CARTs and boosting in these notes because these methods also have some
drawbacks. For instance, resulting regression functions are not continuous nor do
they easily allow to extrapolate data beyond the (observed) feature space, e.g., if we
have a time component. Moreover, they are more difficult in the use of unstructured
data such as text data. For more on CARTs and boosting in actuarial science we
refer to Denuit et al. [100] and Ferrario–Hämmerli [125].

Other Distributional Models

The theory above has been relying on the EDF, but, of course, we could also study
any other family of distribution functions. A clear drawback of the EDF is that
it only considers light-tailed distribution functions, i.e., distribution functions for
which the moment generating function exists around the origin. If the data is more
heavy-tailed, one may need to transform this data and then use the EDF on the
transformed data (with the drawback that one loses the balance property) or one
chooses another family of distribution functions. Transformations have already been
discussed in Remarks 2.11 and Sect. 5.3.9. Another two families of distributions that
have been studied in the actuarial literature are the generalized beta of the second
kind (GB2) distribution, see Venter [369], Frees et al. [137] and Chan et al. [66], and
inhomogeneous phase type (IHP) distributions, see Albrecher et al. [8] and Bladt
[37]. The GB2 family is a 4-parameter family, and it nests several examples such
as the gamma, the Weibull, the Pareto and the Lomax distributions, see Table B1 in
Chan et al. [66]. The density of the GB2 distribution is for y > 0 given by

|a| y aα1 −1
f (y; a, b, α1, α2 ) = b

b
a α1 +α2 (5.79)
B(α1 , α2 ) 1 + yb
|a| y a α1 α2
y 1
= b
y a y a ,
B(α1 , α2 ) 1+ b 1+ b

with scale parameter b > 0, shape parameters a ∈ R and α1 , α2 > 0, and beta
function
(α1 ) (α2 )
B(α1 , α2 ) = .
(α1 + α2 )

Consider a modified logistic transformation of variable y → z = (y/b)a /(1 +

(y/b)a ) ∈ (0, 1). This gives us the beta density

zα1 −1 (1 − z)α2 −1
f (z; α1 , α2 ) = .
B(α1 , α2 )
202 5 Generalized Linear Models

Thus, the GB2 distribution can be obtained by a transformation of the beta

distribution. The latter provides that a GB2 distributed random variable Y can be
(d)
simulated from Y = b(Z/(1 − Z))1/a with Z ∼ Beta(α1 , α2 ).
A GB2 distributed random variable Y has first moment
B(α1 + 1/a, α2 − 1/a)
Ea,b,α1 ,α2 [Y ] = b,
B(α1 , α2 )

for −α1 a < 1 < α2 a. Observe that for a > 0 we have that the survival function of
Y is regularly varying with tail index α2 a > 0. Thus, we can model Pareto-like tails
with the GB2 family; for regular variation we refer to (1.3).
As proposed in Frees et al. [137], one can introduce a regression structure for
b > 0 by choosing a log-link and setting

B(α1 + 1/a, α2 − 1/a)
log Ea,b,α1 ,α2 [Y ] = log + β, x .
B(α1 , α2 )

MLE of β may pose some challenge because it depends on nuisance parameters

a, α1 , α2 . In a recent paper Li et al. [251], there is a proposal to extend this GB2
regression to a composite regression model; composite models are discussed in
Sect. 6.4.4, below. This closes this short section, and for more examples we refer
to the literature.

5.8.3 Quantile Regression

Pinball Loss Function

The GLMs introduced above aim at estimating the means μ(x) = Eθ(x) [Y ] of
random variables Y being explained by features x. Since mean estimation can
be rather sensitive in situations where we have large claims, the more robust
quantile regression has attracted some attention, recently. Quantile regression has
been introduced by Koenker–Bassett [220]. The idea is that instead of estimating
the mean μ of a random variable Y , we rather try to estimate its τ -quantile for
given τ ∈ (0, 1). The τ -quantile is given by the generalized inverse F −1 (τ ) of the
distribution function F of Y , that is,

F −1 (τ ) = inf {y ∈ R; F (y) ≥ τ } . (5.80)

Consider the pinball loss function for y ∈ C (convex closure of the support of Y )
and actions a ∈ A = R

(y, a) → Lτ (y, a) = (y − a) τ − 1{y−a<0} ≥ 0. (5.81)
5.8 Further Topics of Regression Modeling 203

This provides us with the expected loss for Y ∼ F and action a ∈ A

EF [Lτ (Y, a)] = EF (Y − a) τ − 1{Y <a}

= (τ − 1)EF (Y − a)1{Y <a} + τ EF (Y − a)1{Y ≥a}
a ∞
= (τ − 1) (y − a)dF (y) + τ (y − a)dF (y).
−∞ a

The aim is to find an optimal action

a (F ) that minimizes this expected loss,
see (4.24),

a (F ) ∈ A(F ) = arg min EF [Lτ (Y, a)] .
a∈A

Note that for the time being we do not know whether the solution to this
minimization problem is a singleton. For this reason, we state the solution (subject
to existence) as a set-valued functional A, see (4.25).
We calculate the score equation of the expected loss using the Leibniz rule
a ∞
∂
EF [Lτ (Y, a)] = −(τ − 1) dF (y) − τ dF (y)
∂a −∞ a
!
= −(τ − 1)F (a) − τ (1 − F (a)) = F (a) − τ = 0.

Assume the distribution F is continuous. This implies F (F −1 (τ )) = τ , and we have

F −1 (τ ) ∈ A(F ) = arg min EF [Lτ (Y, a)] .

a∈A

In fact, using the pinball loss, we have just seen that the τ -quantile is elicitable
within the class of continuous distributions, see Definition 4.18.
For a more general result we need a more general definition of a (set-valued)
τ -quantile

Qτ (F ) = y ∈ R; lim F (z) ≤ τ ≤ F (y) . (5.82)
z↑y

This defines a closed interval and its lower endpoint corresponds to the generalized
inverse F −1 (τ ) given in (5.80). In complete analogy to Theorem 4.19 on the
elicitability of the mean functional, we have the following statement for the τ -
quantile; this result goes back to Thomson [351] and Saerens [326].
Theorem 5.33 (Gneiting [162, Theorem 9], Without Proof) Let F be the class of
distribution functions on an interval C ⊆ R and choose quantile level τ ∈ (0, 1).
• The τ -quantile (5.82) is elicitable relative to F .
204 5 Generalized Linear Models

• Assume the loss function L : C × A → R+ satisfies (L0)-(L2) on page 92 for

interval C = A ⊆ R. L is consistent for the τ -quantile (5.82) relative to the class
F of compactly supported distributions on C if and only if L is of the form

L(y, a) = (G(y) − G(a)) τ − 1{y−a<0} ,

for a non-decreasing function G on C.

• If G is strictly increasing on C and if EF [G(Y )] exists and is finite for all F ∈
F, then the above loss function L is strictly consistent for the τ -quantile (5.82)
relative to the class F .
Theorem 5.33 characterizes the strictly consistent loss functions for quantile
estimation, the pinball loss being the special case G(y) = y.

Quantile Regression

The idea behind quantile regression is that we build a regression model for the τ -
quantile. Assume we have a datum (Y, x) whose conditional τ -quantile, given x ∈
{1} × Rq , can be described by the regression function

x → g FY−1
|x (τ ) = β τ , x ,

for a strictly monotone and smooth link function g : C → R, and for a regression
parameter β τ ∈ Rq+1 . The aim now is to estimate this regression parameter from
independent data (Yi , x i ), 1 ≤ i ≤ n. The pinball loss Lτ , given in (5.81), provides
us with the following optimization problem

n

β τ = arg min Lτ Yi , g −1 β, x i .
β∈Rq+1 i=1

This then allows us to estimate the corresponding τ -quantile as a function of the

feature information x. For τ = 1/2 we estimate the median by
7 8
−1 (1/2) = g −1
F β 1/2 , x .
Y |x

We conclude from this short section that we can regress any quantity a(F ) that is
elicitable, i.e., for which a loss function exists that is strictly consistent for a(F )
on F ∈ F . For more on quantile regression we refer to the monograph of Uribe–
Guillén [361], and an interesting paper is Dimitriades et al. [106]. We will study
quantile regression within deep networks in Chap. 11.2, below.
5.8 Further Topics of Regression Modeling 205

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons licence and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons licence, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons licence and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 6
Bayesian Methods, Regularization
and Expectation-Maximization

The previous chapter has been focusing on MLE of regression parameters within
GLMs. Alternatively, we could address the parameter estimation problem within a
Bayesian setting. The purpose of this chapter is to discuss the Bayesian estimation
approach. This leads us to the notion of regularization within GLMs. Bayesian
methods are also used in the Expectation-Maximization (EM) algorithm for MLE
in the case of incomplete data. For literature on Bayesian theory we recommend
Gelman et al. [157], Congdon [79], Robert [319], Bühlmann–Gisler [58] and Gilks
et al. [158]. A nice historical (non-mathematical) review of Bayesian methods is
presented in McGrayne [266]. Regularization is discussed in the book of Hastie et
al. [184], and a good reference for the EM algorithm is McLachlan–Krishnan [267].

6.1 Bayesian Parameter Estimation

The Bayesian estimator has been introduced in Definition 3.6. Assume that the
observation Y has independent components Yi that can be described by a GLM
with link function g and regression parameter β ∈ Rq+1 , i.e., the random variables
Yi have densities

ind. y(h ◦ g −1 )β, x i − (κ ◦ h ◦ g −1 )β, x i
Yi ∼ f (y; β, x i , vi /ϕ) = exp + a(y; vi /ϕ) ,
ϕ/vi

with canonical link h = (κ )−1 . In a Bayesian approach one models the regression
parameter β with a prior distribution1 π(β) on the parameter space Rq+1 , and the
independence assumption between the components of Y needs to be understood

1Often, in Bayesian arguing, distribution and density is used in an interchangeable (and not fully
precise) way, and it is left to the reader to give the right meaning to π.

© The Author(s) 2023 207

conditionally, given the regression parameter β. In other words, all observations

Yi share the same regression parameter β, which itself is modeled by a prior
distribution π.
The joint density of Y and β is given by

n

p(y, β) = f (yi ; β, x i , vi /ϕ) π(β) = exp Y =y (β) + log π(β) .
i=1
(6.1)

For the given observation Y , this allows us to calculate the posterior density of β
using Bayes’ rule

p(Y , β)
n
π(β|Y ) = - ∝ f (Yi ; β, x i , vi /ϕ) π(β), (6.2)
p(Y , *
β)d *
β i=1

where the proportionality sign ∝ indicates that we have dropped the terms that do
not depend on β. Thus, the functional form in β of the posterior density π(β|Y )
is fully determined by the joint density p(Y , β), and the remaining term is a
normalization to obtain a proper probability distribution. In many situations, the
knowledge of the functional form of the posterior density in β is sufficient to
perform Bayesian parameter estimation, at least, numerically. We will give some
references, below.
The Bayesian estimator for β is given by the posterior mean (supposed it exists)

β
Bayes
= Eπ [ β| Y ] = β π(β|Y )dν(β).

If we want to calculate the expectation of a new random variable Yn+1 that is

conditionally, given β, independent of Y and follows the same GLM as Y , we can
directly calculate, using the tower property and conditional independence,2

Eπ [ Yn+1 | Y ] = Eπ [ E [ Yn+1 | β, Y ]| Y ] = Eπ [ E [ Yn+1 | β]| Y ]

−1
= Eπ g β, x n+1 Y = g −1 β, x n+1 π(β|Y )dν(β),

supposed that this first moment exists and that x n+1 is the feature of Yn+1 . We see
that it all boils down to have sufficiently explicit knowledge about the posterior
density π(β|Y ) given in (6.2).
Remark 6.1 (Conditional MSEP) Based on the assumption that the posterior distri-
bution π(β|Y ) can be determined, we can analyze the GL. In a Bayesian setup one

2 Note that we identify probabilities Pβ [·] = P[·|β] for given β.

6.1 Bayesian Parameter Estimation 209

usually does not calculate the MSEP as described in Theorem 4.1, but one rather
studies the conditional MSEP, conditioned exactly on the collected information Y .
That is,

Eπ (Yn+1 − Eπ [ Yn+1 | Y ])2 Y = Varπ ( Yn+1 | Y )

= Varπ ( E [ Yn+1 | β, Y ]| Y ) + Eπ [ Var ( Yn+1 | β, Y )| Y ]

ϕ
= Varπ g −1 β, x n+1 Y + Eπ (κ ◦ h ◦ g −1 )β, x n+1 Y
vn+1

ϕ
= Varπ g −1 β, x n+1 Y + Eπ V (g −1 β, x n+1 ) Y ,
vn+1

where we need to assume existence of second moments. Similar to Theorem 4.1,

the first term is the estimation variance (in a Bayesian setting) and the second term
is the average process variance (using the EDF variance function μ → V (μ)).
The remaining difficulty is the calculation of the posterior expectation of func-
tions of β, based on posterior density (6.2). In very well-designed experiments the
posterior density π(β|Y ) can be determined explicitly, for instance, in the homoge-
neous EDF case with so-called conjugate priors, see Chapter 2 in Bühlmann–Gisler
[58]. But in most cases, there is no closed from solution for the posterior distribution.
Major progress in Bayesian modeling has been made with the emergence of
computational methods like the Markov chain Monte Carlo (MCMC) method, Gibbs
sampling, the Metropolis–Hastings (MH) algorithm [185, 274], sequential Monte
Carlo (SMC) sampling, non-linear particle filters, and the Hamilton Monte Carlo
(HMC) algorithm. These methods help us to empirically approximate the posterior
density π(β|Y ) in different modeling setups. These methods have in common that
the explicit knowledge of the normalizing constant in (6.2) is not necessary, but it
suffices to know the functional form in β of the posterior density π(β|Y ).
For a detailed description of MCMC methods in general, which includes Gibbs
sampling and MH algorithms, we refer to Gilks et al. [158], Green [169, 170],
Johansen et al. [199]; SMC sampling and non-linear particle filters are explained
in Del Moral et al. [92, 93], Johansen–Evers [199], Doucet–Johansen [111], Creal
[85] and Wüthrich [389]; the HMC algorithm is described in Neal [281]. We do not
present these algorithms here, but for the description of the most popular algorithms
we refer to Section 4.4 in Wüthrich–Buser [392]. The reason for not presenting
these algorithms here is that they still face the curse of dimensionality, which makes
it difficult to use Bayesian methods for high-dimensional data sets in large models;
we provide another short discussion in Sect. 11.6.3, below.
210 6 Bayesian Methods, Regularization and Expectation-Maximization

6.2 Regularization

6.2.1 Maximal a Posterior Estimator

In the previous section we have proposed to approximate the posterior density

π(β|Y ) of the regression parameter β, given Y , using MCMC methods. The
posterior log-likelihood in the Bayesian GLM is given by, see (6.2),

log π(β|Y ) ∝ Y (β) + log π(β)

n
Yi (h ◦ g −1 )β, x i − (κ ◦ h ◦ g −1 )β, x i
∝ + log π(β).
ϕ/vi
i=1

Compared to the classical log-likelihood function Y (β) for MLE, there is an

additional log-density term log π(β) that comes from the prior distribution of β.
Thus, the posterior log-likelihood is a balanced version of the log-likelihood Y (β)
of the data Y and the prior log-density log π(β) of the regression parameter β. We
interpret this as regularization because the prior π smooths extremes in the log-
likelihood of the observation Y . This gives rise to estimate the regression parameter
β by the so-called maximal a posterior (MAP) estimator

β
MAP
= arg max log π(β|Y ) = arg max Y (β) + log π(β). (6.3)
β∈Rq+1 β∈Rq+1

This π-regularized (MAP) parameter estimation has gained much popularity

because it is a useful tool to prevent the model from over-fitting under suitable
prior choices. Moreover, under specific choices, it allows for parameter selection.
This is especially useful in high-dimensional problems; for a reference we refer to
Hastie et al. [184].
Popular choices for π are prior densities coming from Lp -norms for some p ≥ 1,
p
that is, π(β) ∝ exp{−λβp } for λ > 0. Optimization problem (6.3) then becomes

β
MAP p
= arg max Y (β) − λβp ,
β∈Rq+1

for a fixed regularization parameter λ > 0 (also called tuning parameter). In

practical applications we should exclude the intercept parameter β0 ∈ R from
regularization: if we work with the canonical link within the GLM framework
we have the balance property which implies unbiasedness, see Corollary 5.7. This
property gets lost if β0 is included in the regularization term. For this reason, we set
β − = (β1 , . . . , βq ) ∈ Rq and we let regularization only act on these components
6.2 Regularization 211

1
=
MAP MAP p
β β (λ) = arg max Y (β) − λβ − p , (6.4)
β∈R q+1 n

we also scale with the sample size n to make the units of the tuning parameter λ
independent of the sample size n.
Remarks 6.2
p
• The regularization term λβ − p keeps the components of the regression parame-
ter β − close to zero, thus, it prevents from over-fitting by letting parameters only
take moderate values. The magnitudes of the parameter values are controlled by
the regularization parameter λ > 0 which acts as a hyper-parameter. Optimal
hyper-parameters are determined by cross-validation.
• In (6.4) all components of β − are treated equally. This may not be appropriate
if the feature components of x live on different scales. This problem of different
scales can be solved by either scaling the components of x to a unit scale, or
by introducing a diagonal importance matrix T = diag(t1 , . . . , tq ) with tj > 0
that describes the scales of the components of x. This allows us to regularize
T −1 β − p instead of β − p . Thus, in this latter case we replace (6.4) by the
p p

weighted version

1 −p q

β
MAP
= arg max Y (β) − λ tj |βj |p .
β n
j =1

• Often, the features have a natural group structure x = (x0 , x 1 , . . . , x K ), for

instance, x k ∈ {0, 1}qk may represent dummy coding of a categorical feature
component with qk + 1 levels. In that case regularization should equally act on
all components of β k ∈ Rqk (that correspond to x k ) because these components
describe the same systematic effect. Yuan–Lin [398] proposed for this problem
grouped penalties of the form

1 K

β
MAP
= arg max Y (β) − λ β k 2 . (6.5)
β n
k=1

This proposal leads to sparsity, i.e., for large regularization parameters λ the
entire β k may be shrunk (exactly) to zero; this is discussed in Sect. 6.2.5, below.
We also refer to Section 4.3 in Hastie et al. [184], and Devriendt et al. [104]
proposed this approach in the actuarial literature.
• There are more versions of regularization, e.g., in the fused LASSO approach we
ensure that the first differences βj − βj −1 remain small.
212 6 Bayesian Methods, Regularization and Expectation-Maximization

Our motivation for considering regularization has been inspired by Bayesian

theory, but we can also come from a completely different angle, namely, we can
consider a constraint optimization problem with a given budget constraint c > 0.
That is, we can consider

1 p
arg max Y (β) subject to β − p ≤ c. (6.6)
β∈Rq+1 n

This optimization problem can be tackled by the method of Karush, Kuhn and
Tucker (KKT) [208, 228]. Optimization problem (6.4) corresponds by Lagrangian
duality to the constraint optimization problem (6.6). For every c for which the
p
budget constraint in (6.6) is binding β − p = c, there is a corresponding
regularization parameter λ = λ(c), and, conversely, the solution of (6.4) solves (6.6)
with c =
MAP p
β − (λ)p .

6.2.2 Ridge vs. LASSO Regularization

We compare the two special cases of p = 1, 2 in this section, and in the subsequent
Sects. 6.2.3 and 6.2.4 we discuss how these two cases can be solved numerically.
Ridge Regularization p = 2 For p = 2, the prior distribution π in (6.4) is a
centered Gaussian distribution. This L2 -regularization is called ridge regularization
or Tikhonov regularization [353], and we have

1 q

β
ridge
=
β
ridge
(λ) = arg max Y (β) − λ βj2 . (6.7)
β∈R q+1 n
j =1

LASSO Regularization p = 1 For p = 1, the prior distribution π in (6.4) is a

Laplace distribution. This L1 -regularization is called LASSO regularization (least
absolute shrinkage and selection operator), see Tibshirani [352], and we have

1 q

β
LASSO
=
β
LASSO
(λ) = arg max Y (β) − λ |βj |. (6.8)
β∈Rq+1 n j =1
6.2 Regularization 213

LASSO regularization has the advantage that it shrinks (unimportant) regression

components to exactly zero, i.e., LASSO regularization can be used for parameter
elimination and model reduction. This is discussed in the next paragraphs.

Ridge vs. LASSO Regularization Ridge (p = 2) and LASSO (p = 1)

regularization behave rather differently. This can be understood best by using the
budget constraint (6.6) interpretation which gives us a nice geometric illustration.
The crucialpart is that the side constraint gives us either a budget q constraint
q
β − 22 = j =1 βj2 ≤ c (squared Euclidean norm) or β − 1 = j =1 |βj | ≤ c
(Manhattan norm). In Fig. 6.1 we illustrate these two cases, the left-hand side shows
the Euclidean ball in blue color (in two dimensions) and the right-hand side shows
the corresponding Manhattan square in blue color; this figure is similar to Figure 2.2
in Hastie et al. [184].
The (unconstraint) MLE
MLE
β is illustrated by the red dot in Fig. 6.1. If the
red dot would lie within the blue area, the budget constraint would not be binding.
In Fig. 6.1 the red dot (MLE) does not lie within the blue budget constraint,
and we need to compromise in the optimality of the MLE. Assume that the log-
likelihood β → Y (β) is a concave function in β, then we receive convex level sets
{β; Y (β) ≥ γ0 } around the MLE
MLE
β . The critical constant γ0 for which this level
set is tangential to the blue budget constraint exactly gives us the solution to (6.6);
this solution corresponds to the yellow dots in Fig. 6.1. The crucial difference
between ridge and LASSO regularization is that in the latter case the yellow dot
will eventually be in the corner of the Manhattan square if we shrink the budget
constraint c to zero. Or in other words, some of the components of β are set
exactly equal to zero for small c or large λ, respectively; in Fig. 6.1 (rhs) this
happens to the first component of
LASSO
β (under the given budget constraint c). In

ridge regularization (L2) LASSO regularization (L1)

MLE MLE
15
15

ridge regularized LASSO regularized

10
10
feature component x_2

feature component x_2

5
0

0
−5
−5

−10
−10

−15
−15

−15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15

feature component x_1 feature component x_1

Fig. 6.1 Illustration of optimization problem (6.6) under a budget constraint (lhs) for p = 2
(Euclidean norm) and (rhs) p = 1 (Manhattan norm)
214 6 Bayesian Methods, Regularization and Expectation-Maximization

Fig. 6.2 Elastic net elastic net

regularization ridge
LASSO
elastic net

4
feature component x_2
2
0
−2
−4
−4 −2 0 2 4
feature component x_1

ridge regularization this is not the case, except for special situations concerning the
position of the red MLE. Thus, ridge regression makes components of parameter
estimates generally smaller, whereas LASSO shrinks some of these components
exactly to zero (this also explains the name LASSO).

Remark 6.3 (Elastic Net) LASSO regularization faces difficulties with collinearity
in feature components. In particular, if we have a group of highly correlated feature
components, LASSO fails to do a grouped selection, but it selects one component
and ignores the other ones. On the other hand, ridge regularization can deal with
this issue. For this reason, Zou–Hastie [409] proposed the elastic net regularization,
which uses a combined regularization term
1

β
elastic net
= arg max Y (β) − λ (1 − α)β22 + αβ1 ,
β∈Rq+1 n

for some α ∈ (0, 1). The L1 -term gives sparsity and the quadratic term removes
the limitation on the number of selected variables, providing a grouped selection.
In Fig. 6.2 we compare the elastic net regularization (orange color) to ridge and
LASSO regularization (black and blue color). Ridge regularization provides a
smooth strictly convex boundary (black), whereas LASSO provides a boundary that
is non-differentiable in the corners (blue). The elastic net is still non-differentiable
in the corners, this is needed for variable selection, and at the same time it is strictly
convex between the corners which is needed for grouping.
6.2 Regularization 215

6.2.3 Ridge Regression

In this section we consider ridge regression (p = 2) in more detail and we provide an

example. The ridge estimator
ridge
β in (6.7) is found by solving the score equations

*s(β, Y ) = ∇β Y (β) − nλβ − 22 = X W (β)R(Y , β) − 2nλβ − = 0, (6.9)

note that we exclude the intercept β0 from regularization (we use a slight abuse of
notation, here), and we also refer to Proposition 5.1. The negative expected Hessian
of this optimization problem is given by

J (β) = −Eβ ∇β2 Y (β) − nλβ − 22 = I (β) + 2nλdiag(0, 1, . . . , 1) ∈ R(q+1)×(q+1) ,

where I(β) = X W (β)X is Fisher’s information matrix of the unconstraint MLE

problem. This provides us with Fisher’s scoring updates for t ≥ 0, see (5.13),

(t +1)
β → =
β + J ( s(
β )−1 *
(t ) (t ) (t ) (t )
β β , Y ). (6.10)

Lemma 6.4 Fisher’s scoring update (6.10) can be rewritten as follows

(t +1)
→ = J (
β )−1 X W (
β ) X β + R(Y ,
(t ) (t ) (t ) (t ) (t )
β β β ) .

Proof A straightforward calculation shows

(t +1)
= β + J ( β )−1 * s(
(t ) (t ) (t )
β β ,Y)

= J ( β )−1 J ( β ) β + X W ( β )R(Y , β ) − 2nλ
(t ) (t ) (t ) (t ) (t ) (t )
β−

= J ( β )−1 I( β ) β + X W ( β )R(Y ,
(t ) (t ) (t ) (t ) (t )
β )

= J ( β )−1 X W ( β ) X β + R(Y ,
(t ) (t ) (t ) (t )
β ) .

This proves the claim.

Lemma 6.4 allows us to fit a ridge regularized GLM. To determine an optimal
regularization parameter λ ≥ 0 one uses cross-validation, in particular, generalized
cross-validation is used to receive an efficient cross-validation method, see (5.67).
Example 6.5 (Ridge Regression) We revisit the gamma claim size example of
Sect. 5.3.7, and we choose model Gamma GLM1, see Listing 5.11. This example
does not consider any categorical features, but only continuous ones. We directly
216 6 Bayesian Methods, Regularization and Expectation-Maximization

ridge regularization: in−sample losses ridge regularization: regression parameters

2.1

1.5
1.0
2.0

0.5
losses

beta_j
0.0
1.9

beta_1

−0.5
beta_2
beta_3
1.8

beta_4

−1.0
beta_5
ridge regularized beta_6
Gamma GLM1 beta_7

−1.5
gamma null beta_8
−10 −5 0 5 10 15 20 25 −10 −5 0 5
log(lambda) log(lambda)

Fig. 6.3 Ridge regularized MLEs in model Gamma GLM1: (lhs) in-sample deviance losses as a
ridge (λ) for 1 ≤ j ≤ q = 8
function of the regularization parameter λ > 0, (rhs) resulting β j

apply Fisher’s scoring updates (6.10).3 For this analysis we center and normalize
(to unit variance) the columns of the design matrix (except for the initial column of
X encoding the intercept).
Figure 6.3 (lhs) shows the resulting in-sample deviance losses as a function of
λ > 0. Regularization parameter λ allows us to continuously connect the in-sample
deviance losses of the null model (2.085) and model Gamma GLM1 (1.717), see
Table 5.13. Figure 6.3 (rhs) shows the regression parameter estimates β ridge(λ), 1 ≤
j
j ≤ q = 8, as a function of λ > 0. Overall they decrease because the budget
constraint gets more tight for increasing λ, however, the individual parameters do
not need to be monotone, since one parameter may (better) compensate a decrease
of another (through correlations in feature components).
Finally, we need to choose the optimal regularization parameter λ > 0.
This is done by cross-validation. We exploit the generalized cross-validation loss,
see (5.67), and the hat matrix in this ridge regularized case is given by

Hλ = W ( ) X J ( ) X W (
ridge 1/2 ridge −1 ridge 1/2
β β β ) .

In contrast to (5.66), this hat matrix Hλ is not a projection but we would need to
work in an augmented model to receive the projection property (accounting for the
regularization part).
Figure 6.4 plots the generalized cross-validation loss as a function of λ > 0.
We observe the minimum in parameter λ = e−9.4 . The resulting generalized cross-
validation loss is 1.76742. This is bigger than the one received in model Gamma

3The R command glmnet [142] allows for regularized MLE, however, the current version does
not include the gamma distribution. Therefore, we have implemented our own routine.
6.2 Regularization 217

Fig. 6.4 Generalized generalized cross−validation loss

cross-validation loss
GCV (λ) as a function of

1.76746
D
λ>0

1.76745
GCV losses
1.76744
1.76743
1.76742
−13 −12 −11 −10 −9
log(lambda)

GLM2, see Table 5.16, thus, we still prefer model Gamma GLM2 over the optimally
ridge regularized model GLM1. Note that for model Gamma GLM2 we did variable
selection, whereas ridge regression just generally shrinks regression parameters.
For more interpretation we refer to Example 6.8, below, which considers LASSO
regularization.

6.2.4 LASSO Regularization

In this section we consider LASSO regularization (p = 1). This is more chal-

lenging than ridge regularization because of the non-differentiability of the budget
constraint, see Fig. 6.1 (rhs). This section follows Chapters 2 and 5 of Hastie et
al. [184] and Parikh–Boyd [292].

Gaussian Case

We start with the homoskedastic Gaussian model having unit variance σ 2 = 1. In a

first step, the regression model only involves one feature component q = 1. Thus,
we aim at solving LASSO optimization

1
n

β
LASSO
= arg max − (Yi − β0 − β1 xi )2 − λ|β1 |.
β∈R2 2n
i=1

We
n standardizethe observations andfeatures (Yi , xi )1≤i≤n such that we have
= n −1 n
i=1 i = 0 and n i=1 xi = 1. This implies that we can omit
Y 0, x 2
i=1 i
the intercept parameter β0 , as the optimal intercept satisfies for this standardized
data (and any β1 ∈ R)

1
n
0 =
β Yi − β1 xi = 0. (6.11)
n
i=1
218 6 Bayesian Methods, Regularization and Expectation-Maximization

Thus, w.l.o.g., we assume to work with standardized data in this section, this gives
us the optimization problem (we drop the lower index in β1 because we only have
one component)

1
n
β LASSO (λ) = arg max −
LASSO = β (Yi − βxi )2 − λ|β|. (6.12)
β∈R 2n
i=1

The difficulty is that the regularization term is not differentiable in zero. Since this
term is convex we can express its derivative in terms of a sub-gradient s. This
provides score
⎛ ⎞
1 1
n n
∂ 1
⎝− (Yi − βxi )2 − λ|β|⎠ = (Yi − βxi ) xi − λs = Y , x − β − λs,
∂β 2n n n
i=1 i=1

where we use standardization n−1 ni=1 xi2 = 1 in the second step, Y , x is the
scalar product of Y , x = (x1 , . . . , xn ) ∈ Rn , and where we consider the sub-
gradient
⎧
⎨ +1 if β > 0,
s = s(β) = −1 if β < 0,
⎩
∈ [−1, 1] otherwise.

Henceforth, we receive the score equation for β = 0

!
n−1 Y , x − β − λs = n−1 Y , x − β − sign(β)λ = 0.

> 0 if n−1 Y , x > λ, and it has a

This score equation has a proper solution β
−1
proper solution β < 0 if n Y , x < −λ. In any other case we have a boundary
= 0 for our maximization problem (6.12).
solution β

This solution can be written in terms of the following soft-thresholding

operator for λ ≥ 0

LASSO = Sλ n−1 Y , x
β with Sλ (x) = sign(x)(|x| − λ)+ .
(6.13)

This soft-thresholding operator is illustrated in Fig. 6.5 for λ = 4.

This approach can be generalized to multiple n feature components
n x ∈ Rq .
We standardize the observations and features i=1 Yi = 0, i=1 xi,j = 0 and
6.2 Regularization 219

Fig. 6.5 Soft-thresholding soft−thresholding operator

operator x → Sλ (x) for

20
λ = 4 (red dotted lines)

10
0
−10
−20

soft−thresholding

−20 −10 0 10 20

n−1 ni=1 xi,j
2 = 1 for all 1 ≤ j ≤ q. This allows us again to drop the intercept

term and to directly consider

⎛ ⎞2
1
n
q

β
LASSO
=
β
LASSO
(λ) = arg max − ⎝Yi − βj xi,j ⎠ − λβ1 .
β∈R q 2n
i=1 j =1

Since this is a concave (quadratic) maximization problem with a separable (convex)

penalty term, we can apply a cycle coordinate descent method that iterates a cyclic
coordinate-wise maximization until convergence. Thus, if we want to maximize
in the t-th iteration the j -th coordinate of the regression parameter we consider
recursively
⎛ ⎞2
j −1
1 ⎝
n q
(t)
βj = arg max − Yi − βl(t) xi,l − βl(t−1) xi,l − βj xi,j ⎠ − λ|βj |.
βj ∈R 2n
i=1 l=1 l=j +1

Using the soft-thresholding operator (6.13) we find the optimal solution

⎛ 9 :⎞
j −1

q
(t ) = Sλ ⎝n−1 Y −
β j βl(t )x l − βl(t −1)x l , x j ⎠ ,
l=1 l=j +1

with vectors x l = (x1,l , . . . , xn,l ) ∈ Rn for 1 ≤ l ≤ q. Iteration until convergence

provides the LASSO regularized estimator
LASSO
β (λ) for given regularization
parameter λ > 0.
220 6 Bayesian Methods, Regularization and Expectation-Maximization

Typically, we want to explore

LASSO
β (λ) for multiple λ’s. For this, one runs
a pathwise cyclic coordinate descent method. We start with a large value for λ,
namely, we define

λmax = max n−1 Y , x j .
1≤j ≤q

For λ ≥ λmax , we have

LASSO
β (λ) = 0, i.e., we have the null model. Pathwise cycle
coordinate descent starts with this solution for λ0 = λmax . In a next step, one slightly
decreases λ0 and runs the cyclic coordinate descent algorithm until convergence for
this slightly smaller λ1 < λ0 , and with starting value
LASSO
β (λ0 ). This is then
iterated for λt +1 < λt , t ≥ 0, which provides a sequence of LASSO regularized
estimators
LASSO
β (λt ) along the path (λt )t ≥0 .
For further remarks we refer to Section 2.6 in Hastie et al. [184]. This concerns
statements about uniqueness for general design matrices, also in the set-up where
q > n, i.e., where we have more parameters than observations. Moreover, references
to convergence results are given in Section 2.7 of Hastie et al. [184]. This closes the
Gaussian case.

Gradient Descent Algorithm for LASSO Regularization

In Sect. 7.2.3 we will discuss gradient descent methods for network fitting. In this
section we provide preliminary considerations on gradient descent methods because
these are also useful to fit LASSO regularized parameters within GLMs (different
from Gaussian GLMs). Remark that we do a sign switch in what follows, and we
aim at minimizing an objective function g.
Choose a convex and differentiable function g : Rq+1 → R. Assuming that
the global minimum of g is achieved, a necessary and sufficient condition for the
optimality of β ∗ ∈ Rq+1 in this convex setting is ∇β g(β)|β=β ∗ = 0. Gradient
descent algorithms find this optimal point by iterating for t ≥ 0

β (t ) → β (t +1) = β (t ) − t +1 ∇β g(β (t )), (6.14)

for tempered learning rates t +1 > 0. This algorithm is motivated by a first order
Taylor expansion that determines the direction of the maximal local decrease of the
objective function g supposed we are in position β, i.e.,

β) = g(β) + ∇β g(β) *
g(* β − β + o *
β − β2 as *
β − β2 → 0.

The gradient descent algorithm (6.14) leads to the (unconstraint) minimum of the
objective function g at convergence. A budget constraint like (6.6) leads to a convex
constraint β ∈ C ⊂ Rq+1 . Consideration of such a convex constraint requires
that we reformulate the gradient descent algorithm (6.14). The gradient descent
step (6.14) can also be found, for given learning rate t +1 , by solving the following
6.2 Regularization 221

Fig. 6.6 Projected gradient projected gradient descent

descent step, first, mapping beta^t
β (t) to the unconstraint unconstraint

15
solution beta^t+1
β (t) − t+1 ∇β g(β (t) )

10
of (6.15) and, second,
projecting this unconstraint
solution back to the convex

5
set C giving β (t+1) ; see also
Figure 5.5 in Hastie et

0
al. [184]
−5
−10
−15

−15 −10 −5 0 5 10 15

linearized problem for g with the Euclidean square distance penalty term (ridge
regularization) for too big gradient descent steps

1
arg min g(β (t )) + ∇β g(β (t )) β − β (t ) + β − β (t ) 22 . (6.15)
β∈Rq+1 2t +1

The solution to this optimization problem exactly gives the gradient descent
step (6.14). This is now adapted to a constraint gradient descent update for convex
constraint C:

1
β (t +1) = arg min g(β (t ) ) + ∇β g(β (t ) ) β − β (t ) + β − β (t )22 .
β∈C 2t +1
(6.16)

The solution to this constraint convex optimization problem is obtained by, first,
taking an unconstraint gradient descent step β (t ) → β (t ) − t +1 ∇β g(β (t )), and,
second, if this step is not within the convex set C, it is projected back to C; this is
illustrated in Fig. 6.6, and it is called projected gradient descent step (justification
is given in Lemma 6.6 below). Thus, the only difficulty in applying this projected
gradient descent step is to find an efficient method of projecting the unconstraint
solution (6.14)–(6.15) back to the convex constraint set C.
Assume that the convex constraint set C is expressed by a convex function
h (not necessarily being differentiable). To solve (6.16) and to motivate the
projected gradient descent step, we use the proximal gradient method discussed in
Section 5.3.3 of Hastie et al. [184]. The proximal gradient method helps us to do
the projection in the projected gradient descent step. We introduce the generalized
222 6 Bayesian Methods, Regularization and Expectation-Maximization

projection operator, for z ∈ Rq+1

1
proxh (z) = arg min z − β2 + h(β) .
2
(6.17)
β∈Rq+1 2

This generalized projection operator should be interpreted as a square minimization

problem z − β22 /2 on a convex set C being expressed by its dual Lagrangian
formulation described by the regularization term h(β). The following lemma shows
that the generalized projection operator solves the Lagrangian form of (6.16).
Lemma 6.6 Assume the convex constraint C is expressed by the convex function h.
The generalized projection operator solves

β (t+1) = proxt+1 h β (t) − t+1 ∇β g(β (t) ) (6.18)

(t) 1
= arg min g(β ) + ∇β g(β ) β − β
(t) (t)
+ β − β 2 + h(β) .
(t) 2
β∈Rq+1 2t+1

Proof of Lemma 6.6 It suffices to consider the following calculation

122 (t)
22
2
2β − t+1 ∇β g(β (t) ) − β 2 + t+1 h(β)
2 2
1 2 2 2
22
2
; < 12
2
22
2
= t+1 2∇β g(β (t) )2 − t+1 ∇β g(β (t) ), β (t) − β + 2β (t) − β 2 + t+1 h(β)
2 2 2 2

1 2 2 2
22
2
1 2
2 (t)
22
2
= t+1 2∇β g(β (t) )2 + t+1 ∇β g(β (t) ) β − β (t) + 2β − β 2 + h(β) .
2 2 2t+1 2

This is exactly the right objective function (in the round brackets) if we ignore all
terms that are independent of β. This proves the lemma.

Thus, to solve the constraint optimization problem (6.16) we bring it into its dual
Lagrangian form (6.18). Then we apply the generalized projection operator to the
unconstraint solution to find the constraint solution, see Lemma 6.6. This approach
will be successful if we can explicitly compute the generalized projection operator
proxh (·).

Lemma 6.7 The generalized projection operator (6.17) satisfies for LASSO
constraint h(β) = λβ − 1

def.
proxh (z) = SλLASSO (z) = z0 , sign(z1 )(|z1 | − λ)+ , . . . , sign(zq )(|zq | − λ)+ ,

for z ∈ Rq+1 .
6.2 Regularization 223

Proof of Lemma 6.7 We need to solve for function β → h(β) = λβ − 1

⎧ ⎫

⎨1
q
q ⎬
1
proxλ(·)− 1 (z) = arg min z − β22 + λβ − 1 = arg min (zj − βj )2 + λ |βj | .
β∈Rq+1 2 β∈Rq+1 ⎩ 2 j =0 j =1
⎭

This decouples into q + 1 independent optimization problems. The first one is

solved by β0 = z0 and the remaining ones are solved by the soft-thresholding
operator (6.13). This finishes the proof.

We conclude that the constraint optimization problem (6.16) for the (convex)
LASSO constraint C = {β; β − 1 ≤ c} is brought into its dual Lagrangian
form (6.18) of Lemma 6.6 with h(β) = λβ − 1 for suitable λ = λ(c). The LASSO
regularized parameter estimation is then solved by first performing an unconstraint
gradient descent step β (t ) → β (t ) − t +1 ∇β g(β (t )), and this updated parameter is
projected back to C using the generalized projection operator of Lemma 6.7 with
h(β) = t +1 λβ − 1 .

Proximal gradient descent algorithm for LASSO

1. Make the gradient descent step for a suitable learning rate t +1 > 0
(t +1)
β (t ) → *
β = β (t ) − t +1 ∇β g(β (t )).

2. Perform soft-thresholding of the gradient descent solution

* (t +1) * (t +1)
β → β (t +1) = SLASSO
t+1 λ β ,

where the latter soft-thresholding function is defined in Lemma 6.7.

3. Iterate these two steps until a stopping criterion is met.

If the gradient ∇β g(·) is Lipschitz continuous with Lipschitz constant L > 0, the
proximal gradient descent algorithm will converge at rate O(1/t) for a fixed step
size 0 < = t +1 ≤ L, see Section 4.2 in Parikh–Boyd [292].
Example 6.8 (LASSO Regression) We revisit Example 6.5 which considers claim
size modeling using model Gamma GLM1. In order to apply the proximal gradient
descent algorithm for LASSO regularization we need to calculate the gradient of
the negative log-likelihood. In the gamma case with log-link, it is given by, see
Example 5.5,

−∇β Y (β) = −X W (β)R(Y , β)

n1 nm Y1 Ym
= −X diag ,..., − 1, . . . , −1 ,
ϕ ϕ μ1 μm
224 6 Bayesian Methods, Regularization and Expectation-Maximization

2.1 LASSO regularization: in−sample losses LASSO regularization: regression parameters

1.5
LASSO regularized
Gamma GLM1
gamma null

1.0
2.0

0.5
losses

beta_j
0.0
1.9

−0.5
beta_1
beta_2
1.8

beta_3

−1.0
beta_4
beta_5
beta_6
beta_7

−1.5
beta_8

−10 −9 −8 −7 −6 −10 −9 −8 −7 −6
log(lambda) log(lambda)

Fig. 6.7 LASSO regularized MLEs in model Gamma GLM1: (lhs) in-sample losses as a function
LASSO (λ) for 1 ≤ j ≤ q
of the regularization parameter λ > 0, (rhs) resulting β j

where m ∈ N is the number of policies with claims, and μi = μi (β) = expβ, x i .

We set ϕ = 1 as this constant can be integrated into the learning rates t +1 .
We have implemented the proximal gradient descent algorithm ourselves using
an equidistant grid for the regularization parameter λ > 0, a fixed learning rate
t +1 = 0.05 and normalized features. Since this has been done rather brute force,
the results presented in Fig. 6.7 look a bit wiggly. These results should be compared
to Fig. 6.3. We see that, in contrast to ridge regularization, less important regression
parameters are shrunk exactly to zero in LASSO regularization. We give the order
in which the parameters are shrunk to zero: β1 (OwnerAge), β4 (RiskClass),
β6 (VehAge2), β8 (BonusClass), β7 (GenderMale), β2 (OwnerAge2), β3
(AreaGLM) and β5 (VehAge). In view of Listing 5.11 this order seems a bit
surprising. The reason for this surprising order is that we have grouped features
here, and, obviously, these should be considered jointly. In particular, we first drop
OwnerAge because this can also be partially explained by OwnerAge2 , therefore,
we should not treat these two variables individually. Having this weakness in mind
supports the conclusions drawn from the Wald tests in Listing 5.11, and we come
back to this in Example 6.10, below.

6.2 Regularization 225

Oracle Property

An interesting question is whether the chosen regularization fulfills the so-called

oracle property. For simplicity, we assume to work in the normalized Gaussian
case that allows us to exclude the intercept β0 , see (6.11). Thus, we work with a
regression parameter β ∈ Rq . Assume that there is a true data model that can be
described by the (true) regression parameter β ∗ ∈ Rq . Denote by A∗ = {j ∈
{1, . . . , q}; βj∗ = 0} the set of feature components of x ∈ Rq that determine the
true regression function, and we assume |A∗ | < q. Denote by β n (λ) the parameter
estimate that has been received by the regularized MAP estimation for a given
regularization parameter λ ≥ 0 and based on i.i.d. data of sample size n. We say
that ( β n (λn ))n∈N fulfills the oracle property if there exists a sequence (λn )n∈N of
regularization parameters λn ≥ 0 such that

n = A∗ ] = 1,
lim P[A (6.19)
n→∞
√
n −1
β n,A∗ (λn ) − β ∗A∗ ⇒ N 0, IA∗ as n → ∞, (6.20)

where A n = {j ∈ {1, . . . , q}; (

β n (λn ))j = 0}, β A only considers the components
in A ⊂ {1, . . . , q}, and IA∗ is Fisher’s information matrix on the true feature
components. The first oracle property (6.19) tells us that asymptotically we choose
the right feature components, and the second oracle property (6.20) tells us that
we have asymptotic normality and, in particular, consistency on the right feature
components.
Zou [408] states that LASSO regularization, in general, does not satisfy the
oracle property. LASSO regularization can perform variable selection, however, as
Zou [408] argues, there are situations where consistency is violated and, therefore,
the oracle property cannot hold in general. Zou [408] therefore proposes an
adaptive LASSO regularization method. Alternatively, Fan–Li [124] introduced
smoothly clipped absolute deviation (SCAD) regularization which is a non-convex
regularization that possesses the oracle property. SCAD regularization of β is
obtained by penalizing

q
|βj |2 − 2aλ|βj | + λ2 (a + 1)λ2
Jλ (β) = λ|βj |1{|βj |≤λ} − 1{λ<|βj |≤aλ} + 1{|βj |>aλ} ,
2(a − 1) 2
j =1

for a hyperparameter a > 2. This function is continuous and differentiable except

in βj = 0 with partial derivatives for β > 0

(aλ − β)+
λ 1{β≤λ} + 1{β>λ} .
λ(a − 1)
226 6 Bayesian Methods, Regularization and Expectation-Maximization

LASSO soft−thresholding operator SCAD soft−thresholding operator

20
20

10
10
0

0
−10
−10

−20
−20

LASSO soft−thresholding SCAD soft−thresholding

−20 −10 0 10 20 −20 −10 0 10 20

Fig. 6.8 (lhs) LASSO soft-thresholding operator x → Sλ (x) for λ = 4 (red dotted lines), (rhs)
SCAD thresholding operator x → SλSCAD (x) for λ = 4 and a = 3

Thus, we have a constant LASSO-like slope λ > 0 for 0 < β ≤ λ, shrinking some
components exactly to zero. For β > aλ the slope is 0, removing regularization, and
it is concatenated between the two scenarios. The thresholding operator for SCAD
regularization is given by, see Fan–Li [124],
⎧
⎪
⎨ sign(x)(|x| − λ)+ for |x| ≤ 2λ,
SCAD
Sλ (x) = (a−1)x−sign(x)aλ for 2λ < |x| ≤ aλ,
⎪
⎩x
a−2
for |x| > aλ.

Figure 6.8 compares the two thresholding operators of LASSO and SCAD.
Alternatively, we propose to do variable selection with LASSO regularization in
a first step. Since the resulting LASSO regularized estimator may not be consistent,
one should explore a second regression step where one uses an un-penalized
regression model on the LASSO selected components, we also refer to Lee et al.
[237].

6.2.5 Group LASSO Regularization

In Example 6.8 we have seen that if there are natural groups within the feature
components they should be treated simultaneously. Assume we have a group
6.2 Regularization 227

structure x = (x0 , x 1 , . . . , x K ) with groups x k ∈ Rqk that should be treated

simultaneously. This motivates the grouped penalties proposed by Yuan–Lin [398],
see (6.5),

1
K

β
group
=
β
group
(λ) = arg max Y (β) − λ β k 2 , (6.21)
β=(β0 ,β 1 ,...,β K ) n
k=1

where we assume a group structure in the linear predictor providing

K
x → η(x) = β, x = β0 + β k , x k .
k=1

LASSO regularization is a special case of this grouped regularization, namely, if

all groups 1 ≤ k ≤ K only contain one single component, i.e., K = q, we have

β
group
= β
LASSO
.
The side constraint in (6.21) is convex, and the optimization problem (6.21)
can again be solved by the proximal gradient descent algorithm. That is, in view
of Lemma 6.6, the only difficulty is the calculation
of the generalized projection
operator for regularization term h(β) = λ K k=1 β
k 2 . We therefore need to solve
for z = (z0 , z1 , . . . , zK ), zk ∈ Rqk ,

1 K
proxh (z) = arg min z − β2 + λ
2
β k 2
β=(β0 ,β 1 ,...,β K ) 2 k=1

122
22
2
= z0 , arg min zk − β k 2 + λβ k 2 .
β k ∈Rqk 2
1≤k≤K

The latter highlights that the problem decouples into K independent problems. Thus,
we need to solve for all 1 ≤ k ≤ K the optimization problems

122
22
2
arg min zk − β k 2 + λβ k 2 .
β k ∈Rqk 2
228 6 Bayesian Methods, Regularization and Expectation-Maximization

Lemma 6.9 The group LASSO generalized soft-thresholding operator satis-

fies for zk ∈ Rqk

q 12 2
2zk − β k 22 + λβ k 2 = zk 1 − λ
Sλ k (zk ) = arg min ∈ Rq k ,
β k ∈Rqk 2 2 zk 2 +

K
and for the generalized projection operator for h(β) = λ k=1 β k 2 we
have
group def. q q
proxh (z) = Sλ (z) = z0 , Sλ 1 (z1 ), . . . , Sλ K (zK ) ,

for z = (z0 , z1 , . . . , zK ) with zk ∈ Rqk .

Proof We prove this lemma. In a first step we have

2
12 2
2zk − β k 22 + λβ k 2 1
arg min = arg min zk 22 1 − + λ ,
βk 2 2
β k =zk /zk 2 , ≥0 2 zk 2

2 22 2 22
this follows because the square distance 2zk − β k 22 = zk 22 − 2zk , β k + 2β k 22
is minimized if zk and β k point into the same direction. Thus, there remains the
minimization of the objective function in ≥ 0. The first derivative is given by
2
∂ 1
zk 2 1 −
2
+ λ = − zk 2 1 − +λ = λ−zk 2 +.
∂ 2 zk 2 zk 2

If zk 2 > λ we have = zk 2 − λ > 0, and otherwise we need to set = 0. This
implies
q
Sλ k (zk ) = (zk 2 − λ)+ zk /zk 2 .

This completes the proof.

6.2 Regularization 229

group LASSO regularization: in−sample losses group LASSO regularization: regression parameters
2.1

1.5
group LASSO
Gamma GLM1
gamma null

1.0
2.0

0.5
losses

beta_j
0.0
1.9

−0.5
beta_1
beta_2
1.8

beta_3

−1.0
beta_4
beta_5
beta_6
beta_7

−1.5
beta_8

−10 −9 −8 −7 −6 −10 −9 −8 −7 −6
log(lambda) log(lambda)

Fig. 6.9 Group LASSO regularized MLEs in model Gamma GLM1: (lhs) in-sample losses as a
group (λ) for 1 ≤ j ≤ q
function of the regularization parameter λ > 0, (rhs) resulting β j

Proximal gradient descent algorithm for group LASSO

1. Make the gradient descent step for a suitable learning rate t +1 > 0

(t +1)
β (t ) → *
β = β (t ) − t +1 ∇β g(β (t )).

2. Perform soft-thresholding of the gradient descent solution

* (t +1) (t +1)
→ β (t +1) = St+1 λ *
group
β β ,

where the latter soft-thresholding function is defined in Lemma 6.9.

3. Iterate these two steps until a stopping criterion is met.

Example 6.10 (Group LASSO Regression) We revisit Example 6.8 which considers
claim size modeling using model Gamma GLM1. This time we group the variables
OwnerAge and OwnerAge2 (β1 , β2 ) as well as VehAge and VehAge2 (β5 , β6 ).
The results are shown in Fig. 6.9.
The order in which the parameters are regularized to zero is: β4 (RiskClass),
β8 (BonusClass), β7 (GenderMale), (β1 , β2 ) (OwnerAge, OwnerAge2), β3
(AreaGLM) and (β5 , β6 ) (VehAge, VehAge2). This order now reflects more the
variable importance as received from the Wald statistics of Listing 5.11, and it
shows that grouped features should be regularized jointly in order to determine their
importance.
230 6 Bayesian Methods, Regularization and Expectation-Maximization

6.3 Expectation-Maximization Algorithm

6.3.1 Mixture Distributions

In many applied problems there does not exist a simple off-the-shelf distribution
that is suitable to model the whole range of observations. We think of claim size
modeling which may range from small to very large claims; the main body of the
data may look like, say, gamma distributed, but the tail of the data being regularly
varying. Another related problem is that claims may come from different insurance
policy modules. For instance, in property insurance, one can insure water damage,
fire, glass and theft claims on the same insurance policy, and feature information
about the claim type may not always be available. In such cases, it looks attractive
to choose a mixture or a composition of different distributions. In this section we
focus on mixtures.
Choose a fixed integer K bigger than 1 and define the (K − 1)-unit simplex
excluding the edges by

K
K = p ∈ (0, 1)K ; pk = 1 . (6.22)
k=1

K defines the family of categorical distributions with K levels (all levels having
a strictly positive probability). These distributions belong to the vector-valued
parameter EF which we have met in Sects. 2.1.4 and 5.7.
The idea behind mixture distributions is to mix K different distributions with a
mixture probability p ∈ K . For instance, we can mix K different EDF densities
fk by considering

K
K

yθk − κk (θk )
Y ∼ pk fk (y; θk , v/ϕk ) = pk exp + ak (y; v/ϕk ) ,
ϕk /v
k=1 k=1
(6.23)

with cumulant functions θk ∈ k → κk (θk ), exposure v > 0 and dispersion

parameters ϕk > 0, for 1 ≤ k ≤ K.
At the first sight, this does not look very spectacular and parameter estimation
seems straightforward. If we consider the log-likelihood of n independent random
variables Y = (Y1 , . . . , Yn ) following mixture density (6.23) we receive log-
likelihood function
K
n
n
(θ , p) → Y (θ, p) = Yi (θ , p) = log pk fk (Yi ; θk , vi /ϕk ) ,
i=1 i=1 k=1
(6.24)
6.3 Expectation-Maximization Algorithm 231

for canonical parameter θ = (θ1 , . . . , θK ) ∈ = 1 × · · · × K and mixture

probability p ∈ K . Unfortunately, MLE of (θ, p) in (6.24) is not that simple.
Note, the summation over 1 ≤ k ≤ K is inside of the logarithmic function, and
the use of the Newton–Raphson algorithm may be cumbersome. The Expectation-
Maximization (EM) algorithm presented in Sect. 6.3.3, below, makes parameter
estimation feasible. In a nutshell, the EM algorithm leads to a sequence of parameter
estimates for (θ , p) that monotonically increases the log-likelihood in each iteration
of the algorithm. Thus, we can receive an approximation to the MLE of (θ , p).
Nevertheless, model fitting may still be difficult for the following reasons. Firstly,
the log-likelihood function of a mixture distribution does not need to be bounded,
we highlight this in Example 6.13, below. In that case, MLE is not a well-defined
problem. Secondly, even in very simple situations, the log-likelihood function (6.24)
can have multiple local maximums. This usually happens if the data is clustered
and the clusters are well separated. In that case of multiple local maximums,
convergence of the EM algorithm does not guarantee that we have found the global
maximum. Thirdly, convergence of the log-likelihood function through the EM
algorithm does not guarantee that also the sequence of parameter estimates of (θ , p)
converges. The latter needs additional examination and regularity conditions.
Figure 6.10 (lhs) shows a density of a mixture distribution mixing K = 3 gamma
densities with shape parameters αk = 1, 20, 40 (orange, green and blue) and mixture
probability p = (0.7, 0.1, 0.2); the mixture components are already multiplied
with p. The resulting mixture density in red color is continuous. Figure 6.10 (rhs)
replaces the blue gamma component of the plot on the left-hand side by a Pareto
component (in blue). As a result we observe that the resulting mixture density in
red is no longer continuous. This example is often used in practice, however, the
discontinuity may be a serious issue in applications and one may use a Lomax
(Pareto Type II) component instead, we refer to Sect. 2.2.5.

mixture distribution mixture distribution

mixture distribution mixture distribution
gamma component 1 gamma component 1
0.00030

0.00030

gamma component 2 gamma component 2

gamma component 3 Pareto component
0.00020

0.00020
density

density
0.00010

0.00010
0.00000

0.00000

0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
claim size y claim size y

Fig. 6.10 (lhs) Mixture distribution mixing three gamma densities, and (rhs) mixture distributions
mixing two gamma components and a Pareto component with mixture probabilities p =
(0.7, 0.1, 0.2) for orange, green and blue components (the density components are already
multiplied with p)
232 6 Bayesian Methods, Regularization and Expectation-Maximization

6.3.2 Incomplete and Complete Log-Likelihoods

A mixture distribution can be defined (brute force) by just defining a mixture

density as in (6.23). Alternatively, we could define a mixture distribution in a more
constructive way. In the following we discuss this constructive derivation which will
allow us to efficiently fit mixture distributions to data Y . For our outline we focus
on (6.23), but all results presented below hold true in much more generality.
Choose a categorical random variable Z with K ≥ 2 levels having probabilities
P[Z = k] = pk > 0 for 1 ≤ k ≤ K, that is, with p ∈ K . The main idea is to
sample in a first step level Z = k ∈ {1, . . . , K}, and in a second step Y |{Z=k} ∼
fk (y; θk , v/ϕk ), based on the selected level Z = k. The random tuple (Y, Z) has
joint density

(Y, Z) ∼ fθ ,p (y, k) = pk fk (y; θk , v/ϕk ),

and the marginal density of Y is exactly given by (6.23). In this interpretation we

have a hierarchical model (Y, Z). If only Y is available for parameter estimation,
then we are in the situation of incomplete information because information about
the first hierarchy Z is missing. If both Y and Z are available we say that we have
complete information.
For the subsequent derivations we use a different coding of the categorical
random variable Z, namely, Z can be represented in the following one-hot encoding
version

Z = (Z1 , . . . , ZK ) = (1{Z=1} , . . . , 1{Z=K} ) , (6.25)

these are the K corners of the (K − 1)-unit simplex K . One-hot encoding differs
from dummy coding (5.21). One-hot encoding does not lead to a full rank design
matrix because there is a redundancy, that is, we can drop one component of Z
and still have the same information. One-hot encoding Z of Z allows us to extend
the incomplete (data) log-likelihood Y (θ , p), see (6.23)–(6.24), under complete
information (Y, Z) as follows

K
(Y,Z) (θ , p) = log (pk fk (Y ; θk , v/ϕk )) Zk

k=1
K

Zk
Y θk − κk (θk )
= log pk exp + ak (Y ; v/ϕk ) (6.26)
ϕk /v
k=1

K
Y θk − κk (θk )
= Zk log(pk ) + + ak (Y ; v/ϕk ) .
ϕk /v
k=1
6.3 Expectation-Maximization Algorithm 233

(Y,Z) (θ , p) is called complete (data) log-likelihood. As a consequence of this last

expression we observe that under complete information (Yi , Z i )1≤i≤n , the MLE
of θ and p can be determined completely analogously to above. Namely, θk is
estimated from all observations Yi for which Z i belongs to level k, and the level
indicators (Z i )1≤i≤n are used to estimate the mixture probability p. Thus, the
objective function nicely decouples under complete information into independent
parts for θk and p estimation. There remains the question of how to fit this model
under incomplete information Y . The next section will discuss this problem.

6.3.3 Expectation-Maximization Algorithm for Mixtures

The EM algorithm is a general purpose tool for parameter estimation under

incomplete information. The EM algorithm has been introduced within the EF by
Sundberg [348, 349]. Sundberg’s developments have been based on the vector-
valued parameter EF with statistics S(Y ) ∈ Rk , see (3.17), and he solved the
estimation problem under the assumption that S(Y ) is not fully known. These results
have been generalized to MLE under incomplete data in the celebrated work of
Dempster et al. [96] and Wu [385]. The monograph of McLachlan–Krishnan [267]
gives the theory behind the EM algorithm, and it also provides a historical review
in Section 1.8. In actuarial science the EM algorithm is increasingly used to solve
various kinds of problems of incomplete data. Mixture models of Erlang distribu-
tions are considered in Lee–Lin [240], Yin–Lin [396] and Fung et al. [146, 147];
general Erlang mixtures are universal approximators to positive distributions (in the
weak convergence sense), and regularized Erlang mixtures and mixtures of experts
models are determined using the EM algorithm to receive approximations to the
true underlying model. Miljkovic–Grün [278], Parodi [295] and Fung et al. [148]
consider the EM algorithm for mixtures of general distributions, in particular,
mixtures of small and large claims distributions. Verbelen et al. [371], Blostein–
Miljkovic [40], Grün–Miljkovic [173] and Fung et al. [147] use the EM algorithm
for censored and/or truncated observations, and dispersion modeling is performed
with the EM algorithm in Tzougas–Karlis [359]. (Inhomogeneous) phase-type and
matrix Mittag–Leffler distributions are fitted with the EM algorithm in Asmussen
et al. [14], Albrecher et al. [8] and Bladt [37], and the EM algorithm is used to
fit mixture density networks (MDNs) in Delong et al. [95]. Parameter uncertainty is
investigated in O’Hagan et al. [289] using the bootstrap method. The present section
is mainly based on McLachlan–Krishnan [267].
As mentioned above, the EM algorithm is a general purpose tool for parameter
estimation under incomplete data, and we describe the variant of the EM algorithm
which is useful for our mixture distribution setup given in (6.26). We give a
justification for its functioning below. The EM algorithm is an iterative algorithm
that performs a Bayesian expectation step (E-step) to infer the latent variable Z,
given the model parameters and Y . Next, it performs a maximization step (M-step)
for MLE of the parameters given the observation Y and the estimated latent variable

Z. More specifically, the E-step and the M-step look as follows.
234 6 Bayesian Methods, Regularization and Expectation-Maximization

• E-step. Calculate the posterior probability of the event that a given

observation Y has been generated from the k-th component of the mixture
distribution. Bayes’ rule allows us to infer this posterior probability (for
given θ and p) from (6.26)

pk fk (Y ; θk , v/ϕk )
Pθ,p [Zk = 1|Y ] = K .
l=1 pl fl (Y ; θl , v/ϕl )

The posterior (Bayesian) estimate for Zk after having observed Y is given

k (θ , p|Y ) def.
Z = Eθ ,p [Zk |Y ] = Pθ,p [Zk = 1|Y ] for 1 ≤ k ≤ K.
(6.27)
This posterior mean
Z= 1 (θ , p|Y ), . . . , Z
Z(θ , p|Y ) = (Z K (θ , p|Y )) ∈
K is used as an estimate for the (unobserved) latent variable Z; note that
this posterior mean depends on the unknown parameters (θ, p).
• M-step. Based on Y and Z the parameters θ and p are estimated with
MLE.

Alternation of these two steps provide the following recursive algorithm. We

assume to have independent responses (Yi , Z i ), 1 ≤ i ≤ n, following the mixture
distribution (6.26), where, for simplicity, we assume that only the volumes vi > 0
are dependent on i.

EM algorithm for mixture distributions

(0) Choose an initial parameter (

(0)
θ , p (0) ) ∈ × K .
(1) Repeat for t ≥ 1 until a stopping criterion is met:
(t −1)
• E-step. Given parameter ( θ p (t −1)) ∈ × K estimate the latent
,
variables Z i , 1 ≤ i ≤ n, by their conditional expectations, see (6.27),

(t −1)
Zi =
Z p (t −1) Yi = E θ (t−1) ,
(t )
θ , p(t−1)
[Z i |Yi ] ∈ K . (6.28)

• M-step. Calculate the MLE (

(t )
θ , p(t ) ) ∈ × K based on (complete)
observations ((Y1 ,
Z 1 ), . . . , (Yn ,
(t ) (t )
Z n )), i.e., solve the score equations,
6.3 Expectation-Maximization Algorithm 235

see (6.26),

n
K
− κk (θk )
∇θ (t ) Yi θk
Z = 0, (6.29)
i,k
ϕk /vi
i=1 k=1

n
K
∇p− (t )
Zi,k log(pk ) = 0, (6.30)
i=1 k=1

K−1
where p − = (p1 , . . . , pK−1 ) and setting pK = 1 − k=1 pk ∈ (0, 1).

Remarks 6.11
• The E-step uses Bayes’ rule. This motivates to consider the EM algorithm in this
Bayesian chapter; alternatively, it also fits to the MLE chapters.
• We have formulated the M-step in (6.29)–(6.30) in a general way because the
canonical parameter θ and the mixture probability p could be modeled by
GLMs, and, henceforth, they may be feature x i dependent. Moreover, (6.29) is
formulated for a mixture of single-parameter EDF distributions, but, of course,
this holds in much more generality.
• Equations (6.29)–(6.30) are the score equations received from (6.26). There is
a subtle point here, namely, Zk ∈ {0, 1} in (6.26) are observations, whereas
(t ) ∈ (0, 1) in (6.29)–(6.30) are their estimates. Thus, in the EM algorithm
Z i,k
the unknown latent variables are replaced by their estimates which, in our setup,
results in two different types of variables with disjoint ranges. This may matter
in software implementations, for instance, a categorical GLM may ask for a
categorical random variable Z ∈ {1, . . . , K} (of factor type), whereas Z is
in the interior of the unit simplex K .
• For mixture distributions one can replace the latent variables Z i by their
conditionally expected values Z i , see (6.29)–(6.30). In general, this does not hold
true in EM algorithm applications: in our case we benefit from the fact that Zk
influences the complete log-likelihood linearly, see (6.26). In the general (non-
linear) case of the EM algorithm application, different from mixture distribution
problems, one needs to calculate the conditional expectation of the log-likelihood
function.
236 6 Bayesian Methods, Regularization and Expectation-Maximization

• If we calculate the scores element-wise we receive

∂ Yi θk − κk (θk )
n
= 0,
∂θk (t )
i=1 ϕk /(vi Zi,k )

∂ (t )
n
(t ) log(pK ) = 0,
Zi,k log(pk ) + Z i,K
∂pk
i=1

recall normalization pK = 1 − K−1 k=1 pk ∈ (0, 1).
From the first score equation we see that we receive the classical MLE/GLM
framework, and all tools introduced above for parameter estimation can directly
be used. The only part that changes are the weights vi → vi Z (t ) . In the
i,k
homogeneous case, i.e., in the null model we have MLE after the t-th iteration of
the EM algorithm
n
(t )
i=1 vi Zi,k Yi

θk(t ) = hk n ,
(t )
i=1 vi Zi,k

where hk is the canonical link that corresponds to cumulant function κk .

If we choose the null model for the mixture probabilities we receive MLEs

1 (t )
n
k(t ) =
p Zi,k for 1 ≤ k ≤ K. (6.31)
n
i=1

In Sect. 6.3.4, below, we will present an example that uses the null model for
the mixture probabilities p, and we present an other example that uses a logistic
categorical GLM for these mixture probabilities.

Justification of the EM Algorithm So far, we have neither given any argument

why the EM algorithm is reasonable for parameter estimation nor have we said
anything about convergence. The purpose of this paragraph is to justify the above
EM algorithm. We aim at solving the incomplete log-likelihood maximization
problem, see (6.24),
K

n
MLE
(θ ,
p MLE ) = arg max Y (θ, p) = arg max log pk fk (Yi ; θk , vi /ϕk ) ,
(θ,p) (θ ,p) i=1 k=1

subject to existence and uniqueness. We introduce some notation. Let f (y, z; θ, p)

= exp{(y,z) (θ , p)} be the joint density of (Y, Z) and let f (y; θ, p) =
6.3 Expectation-Maximization Algorithm 237

exp{y (θ, p)} be the marginal density of Y . This allows us to rewrite the incomplete
log-likelihood as follows for any value of z

f (Y, z; θ, p)
Y (θ , p) = log f (Y ; θ, p) = log ,
f (z|Y ; θ, p)

thus, we bring in the complete log-likelihood by using Bayes’ rule. Choose an

arbitrary categorical distribution π ∈ K with K levels. We have using the previous
step

Y (θ , p) = log f (Y ; θ , p) = π(z) log f (Y ; θ, p)
z

f (Y, z; θ, p)/π(z)
= π(z) log
z
f (z|Y ; θ, p)/π(z)

f (Y, z; θ, p) π(z)
= π(z) log + π(z) log
z
π(z) z
f (z|Y ; θ, p)

f (Y, z; θ, p)
= π(z) log + DKL (π||f (·|Y ; θ, p)) (6.32)
z
π(z)

f (Y, z; θ, p)
≥ π(z) log ,
z
π(z)

the inequality follows because the KL divergence is always non-negative, see

Lemma 2.21. This provides us with a lower bound for the incomplete log-likelihood
Y (θ , p) for any categorical distribution π ∈ K and any (θ , p) ∈ × K :

f (Y, z; θ , p)
Y (θ , p) ≥ π(z) log (6.33)
z
π(z)

= EZ∼π (Y,Z) (θ , p) Y −
def.
π(z) log(π(z)) = Q(θ , p; π).
z

Thus, we have a lower bound Q(θ , p; π) on the incomplete log-likelihood Y (θ , p).

This lower bound is based on the conditionally expected complete log-likelihood
(Y,Z) (θ , p), given Y , and under an arbitrary choice π for Z. The difference between
this arbitrary π and the true conditional posterior distribution is given by the KL
divergence DKL (π||f (·|Y ; θ, p)), see (6.32).
238 6 Bayesian Methods, Regularization and Expectation-Maximization

The general idea of the EM algorithm is to make this lower bound Q(θ, p; π) as
large as possible in θ , p and π by iterating the following two alternating steps for
t ≥ 1:

(t −1)
π (t ) = arg max Q
θ p(t −1) ; π ,
, (6.34)
π

(
(t )
θ , p (t )) = arg max Q θ , p;
π (t ) . (6.35)
θ,p

The first step (6.34) can be solved explicitly and it results in the E-step. Namely,
(t −1)
from (6.32) we see that maximizing Q( θ p (t −1); π) in π is equivalent to
,
(t −1)
minimizing the KL divergence DKL (π||f (·|Y ; θ p (t −1))) in π because the
,
left-hand side of (6.32) is independent of π. Thus, we have to solve
2
2
π (t) = arg max Q p (t−1) ; π = arg min DKL π 2f (·|Y ;
(t−1) (t−1)
θ , θ ,
p (t−1) ) .
π π

(t −1)
This optimization is solved by choosing the density π (t ) = f (·|Y ;
θ p (t −1)),
,
see Lemma 2.21, and this gives us exactly (6.28) if we calculate the corresponding
conditional expectation of the latent variable Z. Moreover, importantly, this step
provides us with an identity in (6.33):

(t −1) (t −1)
Y (
θ p(t −1) ) = Q
, θ p (t −1);
, π (t ) . (6.36)

The second step (6.35) then increases the right-hand side of (6.36). This second
step is equivalent to

(
(t )
θ , p (t )) = arg max Q θ , p;
π (t ) = arg max EZ∼
π (t) (Y,Z) (θ , p) Y ,
θ,p θ,p
(6.37)

and this maximization is solved by the solution of the score equations (6.29)–(6.30)
of the M-step. In this step we explicitly use the linearity in Z of the log-likelihood
(Y,Z) , which allows us to calculate the objective function in (6.37) explicitly
resulting in replacing Z by
(t )
Z . For other incomplete data problems, where we
do not have this linearity, this step will be more complicated.
Summarizing, alternating optimizations (6.34) and (6.35) gives us a sequence of
parameters (
(t )
θ , p (t ))t ≥0 with monotonically increasing incomplete log-likelihoods

(t −1) (t +1)
. . . ≤ Y ( p (t −1)) ≤ Y ( p (t )) ≤ Y ( p (t +1)) ≤ . . . .
(t )
θ , θ , θ ,
(6.38)
6.3 Expectation-Maximization Algorithm 239

Therefore, the EM algorithm converges supposed that the incomplete log-likelihood

Y (θ , p) is a bounded function.

Remarks 6.12
• In general, the log-likelihood function (θ , p) → Y (θ , p) does not need to be
bounded. In that case the EM algorithm may not converge (unless it converges
to a local maximum). An illustrative example is given in Example 6.13, below,
which shows what can go wrong in MLE of mixture distributions.
• Even if the log-likelihood function (θ , p) → Y (θ , p) is bounded, one may
not expect a unique solution to the parameter estimation problem with the EM
algorithm. Firstly, a monotonically increasing sequence (6.38) only guarantees
that we have convergence of that sequence. But the sequence may not converge
to the global maximum and different starting points of the algorithm need to
be explored. Secondly, convergence of sequence (6.38) does not necessarily
imply that the parameters (
(t )
θ , p (t ) ) converge for t → ∞. On the one hand,
we may have an identifiability issue because the components fk of the mixture
distribution may be exchangeable, and secondly one needs stronger conditions
to ensure that not only the log-likelihoods converge but also their arguments
(parameters) (
(t )
θ , p (t )). This is the point studied in Wu [385].
• Even in very simple examples of mixture distributions we can have multiple local
maximums. In this case the role of the starting point plays a crucial role. It is
advantageous that in the starting configuration every component k shares roughly
the same number of observations for the initial estimates ( p (0) ) and
(0) (1)
θ , Z ,
otherwise one may start in a so-called spurious configuration where only a few
observations almost fully determine a component k of the mixture distribution.
This may result in similar singularities as in Example 6.13, below. Therefore,
there are three common ways to determine a starting configuration of the EM
algorithm, see Miljkovic–Grün [278]: (a) Euclidean distance-based initialization:
cluster centers are selected at random, and all observations are allocated to these
centers according to the shortest Euclidean distance; (b) K-means clustering
allocation; or (c) completely random allocation to K bins. Using one of these
three options, fk and p are initialized.
• We have formulated the EM algorithm in the homogeneous situation. However,
we can easily expand it to GLMs by, for instance, assuming that the canonical
parameters θk are modeled by linear predictors β k , x and/or likewise for
the mixture probabilities p. The E-step will not change in this setup. For
the M-step, we will solve a different maximization problem, however, this
maximization problem respects monotonicity (6.38), and therefore a modified
version of the above EM algorithm applies. We emphasize that the crucial point
is monotonicity (6.38) that makes the EM algorithm a valid procedure.
240 6 Bayesian Methods, Regularization and Expectation-Maximization

6.3.4 Lab: Mixture Distribution Applications

In this section we are going to present different mixture distribution examples that
use the EM algorithm for parameter estimation. On the one hand this illustrates the
functioning of the EM algorithm, and on the other hand it also highlights pitfalls
that need to be avoided.
Example 6.13 (Gaussian Mixture) We directly fit a mixture model to the observa-
tion Y = (Y1 , . . . , Yn ) . Assume that the log-likelihood of Y is given by a mixture
of two Gaussian distributions
2
n 1 1
Y (θ , σ , p) = log pk √ exp − 2 (Yi − θk )2 ,
i=1 k=1
2πσ k 2σk

with p ∈ 2 , mean vector θ = (θ1 , θ2 ) ∈ R2 and standard deviations σ =

(σ1 , σ2 ) ∈ R2+ . Choose estimate
θ1 = Y1 , then we have

1 1 1
lim √ exp − 2 (Y1 −
θ 1 )2 = lim √ = ∞.
σ1 →0 2πσ1 2σ1 σ1 →0 2πσ1

For any i = 1 we have Yi = θ1 (note that the Gaussian distribution is absolutely

continuous and observations are distinct, a.s.). Henceforth for i = 1

1 1 1 1
lim √ exp − 2 (Yi −
θ 1 )2 = lim √
exp − 2 (Yi − θ1 ) − log σ1 = 0.
2
σ1 →0 2π σ1 2σ1 σ1 →0 2π 2σ1

If we choose any
θ2 ∈ R, p ∈ 2 and σ2 > 0, we receive for
θ1 = Y1
2
1 1
lim Y (
θ , σ , p) = lim log pk √ exp − 2 (Y1 −
θ k )2
σ1 →0 σ1 →0 2πσk 2σ
k=1 k

n
p2 1
+ log √ − (Yi −
θ2 )2 = ∞.
i=2
2πσ2 2σ22

Thus, we can make the log-likelihood of this mixture Gaussian model arbitrarily
large by fitting a degenerate Gaussian model to one observation in one mixture
component, and letting the remaining observations be described by the other mixture
component. This shows that the MLE problem may not be well-posed for mixture
distributions because the log-likelihood can be unbounded.
If the data has well separated clusters, the log-likelihood of a mixture Gaussian
distribution will have multiple local maximums. One can construct for any given
6.3 Expectation-Maximization Algorithm 241

number B ∈ N a data set Y such that the number of local maximums exceeds this
number B, see Theorem 3 in Améndola et al. [11].

Example 6.14 (Gamma Claim Size Modeling) In this example we consider claim
size modeling of the French MTPL example given in Chap. 13.1. In view of
Fig. 13.15 this seems quite difficult because we have three modes and heavy-
tailedness. We choose a mixture of 5 distribution functions, we choose four gamma
distributions and the Lomax distribution
4

β5 y + M −(β5 +1)
α
βk k αk −1
Y ∼ pk y exp {−βk y} + p5 , (6.39)
(αk ) M M
k=1

with shape parameters αk and scale parameters βk , 1 ≤ k ≤ 4, for the gamma

densities; scale parameter M and tail parameter β5 for the Lomax density; and
with mixture probability p ∈ 5 . The idea behind this choice is that three gamma
distributions take care of the three modes of the empirical density, see Fig. 13.15,
the fourth gamma distribution models the remaining claims in the body of the
distribution, and the Lomax distribution takes care of the regularly varying tail of
the data. For the gamma distribution, we refer to Sect. 2.1.3, and for the Lomax
distribution, we refer to Sect. 2.2.5.
We choose the null model for both the mixture probabilities p ∈ 5 and the
densities fk , 1 ≤ k ≤ 5. This model can directly be fitted with the EM algorithm as
presented above, in particular, we can estimate the mixture probabilities by (6.31).
The remaining shape, scale and tail parameters are directly estimated by MLE. To
initialize the EM algorithm we use the interpretation of the components as explained
above. We partition the entire data into K = 5 bins according to their claim sizes
Yi being in (0, 300], (300, 1000], (1 000, 1200], (1 200, 5000] or (5 000, ∞).
The first three intervals will initialize the three modes of the empirical density,
see Fig. 13.15 (lhs). This will correspond to the categorical variable taking values
Z = 1, 2, 3; the fourth interval will correspond to Z = 4 and it will model the main
body of the claims; and the last interval will correspond to Z = 5, modeling the
Lomax tail of the claims. These choices provide the initialization given in Table 6.1
with upper indices (0) . We remark that we choose a fixed threshold of M = 2 000
for the Lomax distribution, this choice will be further discussed below.
Based on these choices we run the EM algorithm for mixture distributions. We
observe convergence after roughly 80 iterations, and the resulting parameters after
100 iterations are presented in Table 6.1. We observe rather large shape parameters
αk(100) for the first three components k = 1, 2, 3. This indicates that these three

components model the three modes of the empirical density and these three modes
collect almost p 1(100) + p
2(100) + p 3(100) ≈ 50% of all claims. The remaining claims
are modeled by the gamma density k = 4 having mean 1’304 and by the Lomax
distribution having tail parameter β (100) = 1.416, thus, this tail has finite first
5
moment M/(β (100)
− 1) = 4 812 and infinite second moment.
5
242 6 Bayesian Methods, Regularization and Expectation-Maximization

Table 6.1 Parameter choices in the mixture model (6.39)

k=1 k=2 k=3 k=4 k=5
k(0)
p 0.13 0.18 0.25 0.39 0.05
αk(0)
2.43 11.24 1’299.44 5.63 –
(0)
β 0.019 0.018 1.141 0.003 0.517
k
μ(0)
(0)
αk(0) /β
k = k 125 623 1’138 1’763 –
(100)
k
p 0.04 0.03 0.42 0.25 0.26
αk(100)
93.05 650.94 1’040.37 1.34 –
(100)
β 1.207 1.108 0.888 0.001 1.416
k
(100) (100) (100)

μk = αk /β k 77 588 1’172 1’304 –

Figure 6.11 shows the resulting estimated mixture distribution. It gives the
individual mixture components (top-lhs), the resulting mixture density (top-rhs),
the QQ plot (bottom-lhs) and the log-log plot (bottom-rhs). Overall we find a
rather good fit; maybe the first mode is a bit too spiky. However, this plot may
also be misleading because the empirical density plot relies on kernel smoothing
having a given bandwidth. Thus, the true observations may be more spiky than the
plot indicates. The third mode suggests that there are two different values in the
observations around 1’100, this is also visible in the QQ plot. Nevertheless, the
overall result seems satisfactory. These results (based on 13 estimated parameters)
are also summarized in Table 6.2.
We mention a couple of limitations of these results. Firstly, the log-likelihood
of this mixture model is unbounded, similarly to Example 6.13 we can precisely fit
one degenerate gamma mixture component to an individual observation Yi which
results in an infinite log-likelihood value. Thus, the found solution corresponds
to a local maximum of the log-likelihood function and we should not state AIC
values in Table 6.2, see also Remarks 4.28. Secondly, it is crucial to initialize three
components to the three modes, if we randomly allocate all claims to 5 bins as initial
configuration, the EM algorithm only finds mode Z = 3 but not necessarily the first
two modes, at least, in our specifically chosen random initialization this was the
case. In fact, the likelihood value of our latter solution was worse than in the first
calibration which shows that we ended up in a worse local maximum.
We may be tempted to also estimate the Lomax threshold M with MLE. In
Fig. 6.12 we plot the maximal log-likelihood as a function of M (if we start the EM
algorithm always in the same configuration given in Table 6.1). From this figure a
threshold of M = 1 600 seems optimal. Choosing this threshold of M = 1 600
leads to a slightly bigger log-likelihood of −199’304 and a slightly smaller tail
parameter of β (100) = 1.318. However, overall the model is very similar to the one
5

with M = 2 000. In general, we do not recommend to estimate M with MLE, but
this should be treated as a hyper-parameter selected by the modeler. The reason for
this recommendation is that this threshold is crucial in deciding for large claims
modeling and its estimation from data is, typically, not very robust; we also refer to
Remarks 6.15, below.
6.3 Expectation-Maximization Algorithm 243

mixtures components mixture density

Z=1, p=4% empirical density
Z=2, p=3% mixture density

0.000 0.001 0.002 0.003 0.004 0.005 0.006

0.000 0.001 0.002 0.003 0.004 0.005

Z=3, p=42%
Z=4, p=25%
Z=5, p=26%
density

density
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000
claim amounts claim amounts
QQ plot log-log plot
0
4

logged survival probability

–2
observed values
2

–4
0

–6
–8
–2

observations
–10

mixture null model

–4 –2 0 2 4 0 5 10 15
theoretical values logged claim amounts

Fig. 6.11 Mixture null model: (top-lhs) individual estimated gamma components
(100) ), 1 ≤ k ≤ K, and Lomax component f5 (·; β
αk(100) , β
fk (·; (100) ), (top-rhs) estimated
k
4 (100) (100)
5
mixture density k=1 p
(100)
k fk (·;
αk , βk ) + p
(100)
5 f5 (·; β (100)
), (bottom-lhs) QQ plot of
5
the estimated model, (bottom-rhs) log-log plot of the estimated model

Table 6.2 Mixture models for French MTPL claim size modeling
# Param. Y (
θ,
p) AIC
μ = E θ ,
p [Y ]
Empirical 2’266
Null model (M = 2000) 13 −199’306 398’637 2’381
Logistic GLM (M = 2000) 193 −198’404 397’193 2’176

In a next step we enhance the mixture modeling by including feature information

x i to explain the responses Yi . In view of Fig. 13.17 we have decided to only model
the mixture probabilities p = p(x) feature dependent because feature information
seems to mainly influence the heights of the peaks. We do not consider features
VehPower and VehGas because these features do not seem to contribute, and
244 6 Bayesian Methods, Regularization and Expectation-Maximization

Fig. 6.12 Choice of Lomax mixture model: Lomax thresholds

−199800 −199700 −199600 −199500 −199400 −199300

threshold M

maximal log−likelihood
0 1000 2000 3000 4000
threshold M

we do not consider Density because of the high co-linearity with Area, see
Fig. 13.12 (rhs). Thus, we are left with the features Area, VehAge, DrivAge,
BonusMalus, VehBrand and Region. Pre-processing of these features is done
as in Listing 5.1, except that we keep Area categorical. Using these features
x ∈ X ⊂ {1} × Rq we choose a logistic categorical GLM for the mixture
probabilities

exp{Xγ }
x → (p1 (x), . . . , pK−1 (x)) = 4 , (6.40)
1 + l=1 expγ l , x

that is, we choose K = 5 as reference level, feature matrix X ∈ R(K−1)×(K−1)(q+1)

is defined in (5.71), and with regression parameter γ = (γ
1 , . . . , γ K−1 )
∈

R(K−1)(q+1); this regression parameter γ should not be confused with the shape
parameters β1 , . . . , β4 of the gamma components and the tail parameter β5 of the
Lomax component, see (6.39). Note that the notation in this section slightly differs
from Sect. 5.7 on the logistic categorical GLM. In this section we consider mixture
probabilities p(x) ∈ K=5 (which corresponds to one-hot encoding), whereas
in Sect. 5.7 we model (p1 (x), . . . , pK−1 (x)) with a categorical GLM (which
corresponds
to dummy coding), and normalization provides us with pK (x) =
1 − K−1 l=1 p l (x) ∈ (0, 1).
This logistic categorical GLM requires that we replace in the M-step
the probability estimation (6.31) by Fisher’s scoring method for GLMs as
outlined in Sect. 5.7.2, but there is a small difference to that section. In the
working residuals (5.74) we use dummy coding T (Z) ∈ {0, 1}K−1 of a
categorical variable Z, this now needs to be replaced by the estimated vector
1 (θ , p|Y ), . . . , Z
(Z K−1 (θ , p|Y )) ∈ (0, 1)K−1 which is used as an estimate
for the latent variable T (Z). Apart from that everything is done as described in
Sect. 5.7.2; in R this can be done with the procedure multinom from the package
nnet [368]. We start the EM algorithm exactly in the final configuration of the
6.3 Expectation-Maximization Algorithm 245

Table 6.3 Parameter choices in the mixture models: upper part null model, lower part GLM for
estimated mixture probabilities
p(x i )
k=1 k=2 k=3 k=4 k=5
k(100)
Null: p 0.04 0.03 0.42 0.25 0.26
αk(100)
Null: 93.05 650.94 1’040.37 1.34 –
(100)
Null: β 1.207 1.108 0.888 0.001 1.416
k
μ(100)
Null: = (100)
αk(100) /β 77 588 1’172 1’304 –
k k
GLM: average mixture probabilities 0.04 0.03 0.42 0.25 0.26
GLM: αk(100) 94.03 597.20 1’043.38 1.28 –
GLM: β (100) 1.223 1.019 0.891 0.001 1.365
k
(100) (100) (100)
GLM: μ k =
kα k /β 77 586 1’172 1’268 –

estimated mixture null model, and we run this algorithm for 20 iterations (which
provides convergences).
The resulting parameters are given in the lower part of Table 6.3. We observe that
the resulting parameters remain essentially the same, the second mode Z = 2 is a
bit less spiky, and the tail parameter is slightly smaller. The summary of this model
is given on the last line of Table 6.2. Regression modeling adds another 4 · 45 = 180
parameters to the model because we have q = 45 feature components in x (different
from the intercept component). In view of AIC we give preference to the logistic
mixture probability case (though AIC has to be interpreted with care, here, because
we do not consider the MLE but rather a local maximum).
Figure 6.13 plots the individual estimated mixture probabilities x i → p (x i ) ∈
5 over the insurance policies 1 ≤ i ≤ n; these plots are inspired by the thesis of
Frei [138]. The upper plots consider these probabilities against the estimated claim
sizes μ(x i ) = 5k=1 p k (x i )
μk and the lower plots against the ranks of μ(x i ), the
latter gives a different scaling on the x-axis because of the heavy-tailedness of the
claims. The plots on the left-hand side show all individual policies 1 ≤ i ≤ n, and
the plots on the right-hand side show a quadratic spline fit to these observations. Not
surprisingly, we observe that the claim size estimate μ(x i ) is mainly driven by the
large claims probability p 5 (x i ) describing the Lomax contribution.
In Fig. 6.14 we compare the QQ plots of the mixture null model and the one
where we model the mixture probabilities with the logistic categorical GLM. We
see that the latter (more complex) model clearly outperforms the more simple one,
in fact, this QQ plot looks quite convincing for the French MTPL claim size data.
Finally, we perform a Wald test (5.32). We simultaneously treat all parameters that
belong to the same feature variable (similar to the ANOVA analysis); for instance,
for the 22 Regions the corresponding part of the regression parameter γ contains
4 · 21 = 84 components. The resulting p-values of dropping such components are
all close to 0 which says that we should not eliminate one of the feature variables.
This closes the example.
246 6 Bayesian Methods, Regularization and Expectation-Maximization

1.0
mixture probabilities mixture probabilities

1.0
p1 p1
p2 p2
p3 p3
p4 p4
p5
0.8

0.8
p5
mixture probabilities

mixture probabilities
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0
1500 2000 2500 3000 3500 4000 4500 1500 2000 2500 3000 3500 4000 4500
estimated means estimated means

mixture probabilities mixture probabilities

1.0

1.0
p1 p1
p2 p2
p3 p3
p4 p4
p5
0.8

0.8

p5
mixture probabilities

mixture probabilities
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0

0 5000 10000 15000 20000 25000 0 5000 10000 15000 20000 25000
rank of estimated means rank of estimated means

Fig. 6.13 Mixture probabilities x i → p (x i ) on individual policies 1 ≤ i ≤ n: (top) against the

estimated means μ(x i ) and (bottom) against the ranks of the estimated means μ(x i ); (lhs) over
policies 1 ≤ i ≤ n and (rhs) quadratic spline fit

Remarks 6.15
• In Example 6.14 we have chosen a mixture distribution with four gamma
components and one Lomax component. The reason for choosing the Lomax
component has been two-fold. Firstly, we need a regularly varying tail to
model the heavy-tailed property of the data. Secondly, we have preferred the
Lomax distribution over the Pareto distribution because this provides us with a
continuous density in (6.39). The results in Example 6.14 have been satisfactory.
In most practical approaches, however, this approach will not work, even when
fixing the threshold M of the Lomax component. Often, the nature of the data
is such that the chosen gamma mixture distribution is not able to fully explain
the small data in the body of the distribution, and in that situation the Lomax tail
will assist in fitting the small claims. The typical result is that the Lomax part
6.3 Expectation-Maximization Algorithm 247

QQ plot QQ plot

4
4
observed values

observed values
2
2

0
0

–2
–2

–4 –2 0 2 4 –4 –2 0 2 4
theoretical values theoretical values

Fig. 6.14 QQ plots of the mixture models: (lhs) null model and (rhs) logistic categorical GLM for
mixture probabilities

then pays more attention to small claims (through the log-likelihood function of
numerous small claims) and the fitting of the tail turns out to be poor (because
a few large claims do not sufficiently contribute to the log-likelihood). There are
two ways to solve this dilemma. Either one works with composite distributions,
see (6.56) below, and one drops the continuity property of the density; this is the
approach taken in Fung et al. [148]. Or one fits the Lomax distribution solely
to large observations in a first step, and then fixes the parameters of the Lomax
distribution during the second step when fitting the full model to all data, this
is the approach taken in Frei [138]. Both of these two approaches have been
providing good results on real insurance data.
• There is an asymptotic theory for the optimal selection of the number of
mixture components, we refer to Khalili–Chen [214] and Khalili [213]. Fung et
al. [148] combine this asymptotic theory of mixture component selection with
feature selection within these mixture components using LASSO and SCAD
regularization.
• In Example 6.14 we have only modeled the mixture probabilities feature depen-
dent, but not the parameters of the gamma mixture components. Introducing
regressions for the gamma mixture components needs some care in fitting. For
policy independent shape parameters α1 , . . . , α4 , we can estimate the regression
functions for the means of the mixture components without explicitly specifying
αk because these shape parameters cancel in the score equations. However, these
shape parameters will be needed in the E-step, which requires also MLE of αk .
For more discussion on shape parameter estimation we refer to Sect. 5.3.7 (GLM
with constant shape parameter) and Sect. 5.5.4 (double GLM).
248 6 Bayesian Methods, Regularization and Expectation-Maximization

6.4 Truncated and Censored Data

6.4.1 Lower-Truncation and Right-Censoring

A common problem in insurance is that we often have truncated or censored

observations. Truncation naturally occurs if we sell insurance products that have
a deductible d > 0 because in that case only the insurance claim (Y − d)+ is
compensated, and claims below the deductible d are usually not reported to the
insurance company. This case is called lower-truncation, because claims below the
deductible are not observed. If we lower-truncate an original claim Y ∼ f (·; θ ) with
lower-truncation point τ ∈ R we obtain the density

f (y; θ )1{y>τ }
f(τ,∞) (y; θ ) = , (6.41)
1 − F (τ, θ )

if F (·; θ ) is the distribution function corresponding to the density f (·; θ ). The

lower-truncated density f(τ,∞) (y; θ ) only considers claims that fall into the interval
(τ, ∞). Obviously, we can define upper-truncation completely analogously by
considering an interval (−∞, τ ] instead. Figure 6.15 (lhs) gives an example of a
lower-truncated density, and Fig. 6.15 (rhs) gives an example of a lower- and upper-
truncated density.
Censoring occurs by selling insurance products with a maximal cover M > 0
because in that case only the insurance claim Y ∧ M = min{Y, M} is compensated,
and the exact claim size above the maximal cover M may not be available. This case
is called right-censoring because the exact claim amount above M is not known.
Right-censoring of an original claim Y ∼ F (·; θ ) with censoring point M ∈ R

lower−truncated density lower− and upper−truncated density

gamma density gamma density
lower−truncated density lower− and upper−truncated density
0.00030

0.00030
0.00020

0.00020
density

density
0.00010

0.00010
0.00000

0.00000

0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
claim size y claim size y

Fig. 6.15 (lhs) Lower-truncated gamma density with τ = 2 000, and (rhs) lower- and upper-
truncated gamma density with truncation points 2 000 and 6 000
6.4 Truncated and Censored Data 249

1.0 right−censored distribution left− and right−censored distribution

1.0
0.8

0.8
probability

probability
0.6

0.6
0.4

0.4
0.2

0.2
gamma distribution gamma distribution
0.0

0.0
right−censored distribution left− and right−censored distribution

0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
claim size y claim size y

Fig. 6.16 (lhs) Right-censored gamma distribution with M = 6 000, and (rhs) left- and right-
censored gamma distribution with censoring points 2 000 and 6 000

gives the distribution

FY ∧M (y; θ ) = F (y; θ )1{y<M} + 1{y≥M} ,

that is, we have a point mass in the censoring point M. We can define left-censoring
analogously by considering the claim Y ∨ M = max{Y, M}. Figure 6.16 (lhs) shows
a right-censored gamma distribution with censoring point M = 6 000, and Fig. 6.16
(rhs) shows a left- and right-censored example with censoring points 2 000 and
6 000.
Often in re-insurance, deductibles (also called retention levels) and maximal
covers are combined, for instance, an excess-of-loss (XL) insurance cover of size
u > 0 above the retention level d > 0 covers the claim

(Y − d)+ ∧ u = (Y − d)1{d≤Y <d+u} + u1{Y ≥d+u} = (Y − d)+ − (Y − (d + u))+ .

Obviously, truncation and censoring pose some challenges in regression modeling

because at the same time we need to consider the density f (·; θ ) and the distribution
function F (·; θ ) to estimate a parameter θ . Both cases can be understood as
missing data problems, with censoring providing the number of claims but not
necessarily the exact claim size, and with truncation leaving also the number of
claims unknown. These two cases are studied in Fung et al. [147] within the mixture
of experts models using a variant of the EM algorithm. We use their techniques
within the EDF framework for right-censored or lower-truncated data. This is done
in the next sections.
250 6 Bayesian Methods, Regularization and Expectation-Maximization

6.4.2 Parameter Estimation Under Right-Censoring

Assume we have a fixed censoring point M > 0 that applies to independent

observations Yi following EDF densities f (·; θi , vi /ϕ); for simplicity we assume
to work with an absolutely continuous EDF in this section. The (incomplete) log-
likelihood function of canonical parameters θ = (θi )1≤i≤n for observations Y ∧ M
is given by

Y ∧M (θ ) = log f (Yi ; θi , vi /ϕ) + log (1 − F (M; θi , vi /ϕ)) .
i: Yi <M i: Yi ∧M=M
(6.42)

We interpret this as an incomplete data problem because the claim sizes Yi above
the censoring point M are not known. The complete log-likelihood is given by

n
Y (θ ) = log f (Yi ; θi , vi /ϕ).
i=1

Similarly to (6.32) we calculate a lower bound to the incomplete log-likelihood.

We focus on one component of Y and drop the lower index i in Yi for this
consideration. Firstly, if Y ∧ M < M we are in the situation of full claim size
information and, obviously, we have log-likelihood in that case Y < M

Y θ − κ(θ )
Y ∧M (θ ) = Y (θ ) = + a(Y ; v/ϕ). (6.43)
ϕ/v

In the second case Y ∧ M = M we do not have precise claim size information. In

that case we have conditional density of claim Y |{Y ∧M=M} = Y |{Y ≥M} above M

f (z; θ, v/ϕ)1{z≥M} f (z; θ, v/ϕ)1{z≥M}

f (z|Y ≥ M; θ, v/ϕ) = = , (6.44)
1 − F (M; θ, v/ϕ) exp{Y ∧M (θ )}

the latter follows because Y ∧M = M has the corresponding point mass in censoring
point M (we work with an absolutely continuous EDF here). Choose an arbitrary
density π having the same support as Y |{Y ≥M} , and consider a random variable
Z ∼ π. Using (6.44) and the EDF structure on the last line, we have for Y ≥ M

Y ∧M (θ ) = π(z) Y ∧M (θ ) dν(z)

f (z; θ, v/ϕ)/π(z)
= π(z) log dν(z)
f (z|Y ≥ M; θ, v/ϕ)/π(z)

f (z; θ, v/ϕ)
= π(z) log dν(z) + DKL (π||f (·|Y ≥ M; θ, v/ϕ))
π(z)
6.4 Truncated and Censored Data 251

f (z; θ, v/ϕ)
≥ π(z) log dν(z)
π(z)
Eπ [Z] θ − κ(θ ) def.
= + Eπ [a(Z; v/ϕ)] − Eπ log π(Z) = Q(θ ; π).
ϕ/v

This allows us to explore the E-step and the M-step similarly to (6.34) and (6.35).
The E-step in the case Y ≥ M for given canonical parameter estimate θ (t −1)
reads as
2
2
π (t ) = arg max Q
θ (t −1) ; π = arg min DKL π 2f (·|Y ≥ M; θ (t −1), v/ϕ)
π π

= f (·|Y ≥ M;
θ (t −1), v/ϕ).

This allows us to calculate the estimation of the claim size above M, i.e., under
π (t )

(t ) = E
Y π (t) [Z] = z f (z|Y ≥ M; θ (t −1), v/ϕ) dν(z). (6.45)

Note that this is an estimate of the censored claim Y |{Y ≥M} . This completes the
E-step.
The M-step considers in the EDF case for censored claim sizes Y ≥ M
Eπ (t) [Z] θ − κ(θ )
π (t ) = arg max
θ (t ) = arg max Q θ ;
θ θ ϕ/v
= arg max Y (t) (θ ), (6.46)
θ

the latter uses that the normalizing term a(·; v/ϕ) is not relevant for the MLE of
θ . That is, (6.46) describes the regular MLE step under the observation Y (t ) in the
case of a censored observation Y ≥ M; and if Y < M we simply use the log-
likelihood (6.43).

EM algorithm for right-censored data within the EDF

(0) Choose an initial parameter

θ = (
(0) (0)
θi )1≤i≤n .
(1) Repeat for t ≥ 1:
(t −1)
• E-step. Given parameter θ θi(t −1))1≤i≤n , estimate for the right-
= (
censored claims Yi ≥ M their sizes by, see (6.45),

(t ) = (t −1)
Yi z f z Yi ≥ M;
θi , vi /ϕ dν(z).
252 6 Bayesian Methods, Regularization and Expectation-Maximization

This provides us with an estimated observation

(t ) (t ) 1{Yi ≥M}
Y = Yi 1{Yi <M} + Y .
i 1≤i≤n

• M-step. Calculate the MLE = (

θi )1≤i≤n based on observation
(t ) (t ) (t )
θ Y ,
i.e., solve

(t )
θ = arg max (t) (θ ).
Y
θ

Note that the above EM algorithm uses that the log-likelihood Y (θ) of the EDF
is linear in the observations that interact with parameter θ . We revisit the gamma
claim size example of Sect. 5.3.7.
Example 6.16 (Right-Censored Gamma Claim Sizes) We revisit the gamma claim
size GLM introduced in Sect. 5.3.7. The claim sizes are illustrated in Fig. 13.22. In
total we have n = 656 observations Yi , and they range from 16 SEK to 211’254
SEK. We right-censor this data at M = 50 000, this results in 545 uncensored
observations and 111 censored observations equal to M. Thus, for the 17% largest
claims we assume to not have any knowledge about the exact claim sizes. We use
the EM algorithm for right-censored data to fit a GLM to this problem.
In order to calculate the E-step we need to evaluate the conditional expecta-
tion (6.45) under the gamma model

(t ) =
Y z f (z|Y ≥ M;
θ (t −1), v/ϕ) dν(z) (6.47)

∞ β α α−1
(α) z exp{−βz} α 1 − G(α + 1, βM)
= z dz = ,
M 1 − G(α, βM) β 1 − G(α, βM)

with shape parameter α = v/ϕ, scale parameter β = −

θ (t −1)v/ϕ, see (5.45), and
scaled incomplete gamma function
y
1
G(α, y) = zα−1 exp{−z} dz ∈ (0, 1) for y ∈ (0, ∞).
(α) 0
(6.48)

Thus, we receive a simple formula that allows us to efficiently calculate the E-

step, and the M-step is exactly the gamma GLM explained in Sect. 5.3.7 for the
(estimated) data
(t )
Y .
For the modeling we choose exactly the features as used for model Gamma
GLM2, this gives q + 1 = 7 regression parameter components and additionally we
set for the dispersion parameter
ϕ MLE = 1.427, this is the MLE in model Gamma
6.4 Truncated and Censored Data 253

Table 6.4 Comparison of the complete log-likelihood and the incomplete log-likelihood (right-
censoring M = 50 000) results
# Log-likelihood Dispersion Average Rel.
Param. Y ( θ MLE ,
ϕ MLE ) est.
ϕ MLE amount change
Gamma GLM2 (complete data) 7+1
−7 129 1.427 25’130
Crude GLM2 (right-censored) 7+1 −7 158 18’068 −28%
EM est. GLM2 (right-censored) 7+1 −7 132 26’687 +6%

GLM2. This dispersion parameter we keep fixed in all our models studied in this
example. In a first step we simply fit a gamma GLM to the right-censored data
Yi ∧ M. We call this model ‘crude GLM2’, and it underestimates the empirical
claim sizes by 28% because it ignores the fact of having right-censored data.
To initialize the EM algorithm for right-censored data we use the model crude
GLM2. We then iterate the algorithm for 15 steps which provides convergence. The
results are presented in Table 6.4. We observe that the resulting log-likelihood of
the model fitted on the censored data and evaluated on the complete data Y (which
is available here) is almost the same as for model Gamma GLM2, which has been
estimated on the complete data. Moreover, this right-censored EM algorithm fitted
model slightly over-estimates the average claim sizes.
Figure 6.17 shows the estimated means μi on an individual claims level. The
x-axis always gives the estimates from the complete log-likelihood model Gamma
GLM2. The y-axis on the left-hand side shows the estimates from the crude GLM
and the right-hand side the estimates from the EM algorithm fitted counterpart (fitted
on the right-censored data). We observe that the crude model underestimates the
claims (being below the diagonal), and the largest estimate lies below M = 50 000

crude GLM with M=50000 EM estimated GLM with M=50000

11
11
right−censored data

right−censored data
10
10
9

9
8

8 9 10 11 8 9 10 11
complete data complete data

Fig. 6.17 Comparison of the estimated means μi in model Gamma GLM2 against (lhs) the crude
GLM and (rhs) the EM fitted right-censored model; both axis are on the log-scale, the dotted lines
shows the censoring point log(M)
254 6 Bayesian Methods, Regularization and Expectation-Maximization

in our example (horizontal dotted line). The EM algorithm fitted model, considering
the fact that we have right-censored data, corrects for the censoring, and the resulting
estimates resemble the ones from the complete log-likelihood model quite well.
In fact, we probably slightly over-estimate under right-censoring, here. Note that
all these considerations have been done under an identical dispersion parameter
estimate ϕ MLE . For the complete log-likelihood case, this is not really needed for
mean estimation because it cancels in the score equations for mean estimation.
However, a reasonable dispersion parameter estimate is crucial for the incomplete
case as it enters Y (t ) in the E-step, see (6.47), thus, the caveat here is that we need
a reasonable dispersion estimate from the right-censored data (which we did not
discuss, here, and which requires further research).

6.4.3 Parameter Estimation Under Lower-Truncation

Compared to censoring we have less information under truncation because not only
the claim sizes below the lower-truncation point are unknown, but we also do not
know how many claims there are below that truncation point τ . Assume we work
with responses belonging to the EDF. The incomplete log-likelihood is given by

n
Y >τ (θ ) = log f (Yi ; θi , vi /ϕ) − log (1 − F (τ ; θi , vi /ϕ)) ,
i=1

assuming that Y = (Yi )1≤i≤n > τ collects all claims above the truncation point
Yi > τ , see (6.41). We proceed as in Fung et al. [147] to construct a complete
log-likelihood; there are different ways to do so, but this proposal is convenient
for parameter estimation. Firstly, we equip each observed claim Yi > τ with an
independent count random variable Ki ∼ p(·; θi , vi /ϕ) that determines the number
of claims below the truncation point that correspond to claim i above the truncation
point. Secondly, we assume that these claims are given by independent observations
Zi,1 , . . . , Zi,Ki ≤ τ , a.s., with a distribution obtained from an un-truncated version
of Yi , i.e., we consider the upper-truncated version of f (·; θi , vi /ϕ) for Zi,j . This
gives us the complete log-likelihood
n

f (Yi ; θi , vi /ϕ)
(Y ,K,Z) (θ) = log (6.49)
1 − F (τ ; θi , vi /ϕ)
i=1

Ki
f (Zi,j ; θi , vi /ϕ)
+ log p(Ki ; θi , vi /ϕ) + log ,
F (τ ; θi , vi /ϕ)
j =1
6.4 Truncated and Censored Data 255

with K = (Ki )1≤i≤n , and Z collects all (latent) claims Zi,j ≤ τ , an empty sum is
set equal to zero. Next, we assume that Ki is following the geometric distribution

Pθi [Ki = k] = p(k; θi , vi /ϕ) = F (τ ; θi , vi /ϕ)k (1 − F (τ ; θi , vi /ϕ)) .

(6.50)

As emphasized in Fung et al. [147], this complete log-likelihood is an artificial

construct that supports parameter estimation of lower-truncated data. It does not
claim that the true un-truncated data follows this model (6.49) but it provides
a distributional extension below the truncation point τ > 0 that is convenient
for parameter estimation. Namely, inserting this geometric distribution assumption
into (6.49) gives us complete log-likelihood
⎛ ⎞

n
Ki
(Y ,K,Z) (θ) = ⎝log f (Yi ; θi , vi /ϕ) + log f (Zi,j ; θi , vi /ϕ)⎠ . (6.51)
i=1 j =1

Within the EDF this allows us to do the same EM algorithm considerations as above;
note that this expression no longer involves the distribution function. We consider
one observation Yi > τ and we drop the lower index i. This gives us complete
observation (Y, K, Z = (Zj )1≤j ≤K ) and conditional density

f (y, k, z; θ, v/ϕ) f (y, k, z; θ, v/ϕ)

f (k, z|y; θ, v/ϕ) = = ,
f(τ,∞) (y; θ, v/ϕ) exp{Y =y>τ (θ )}

where Y >τ (θ ) is the log-likelihood of the lower-truncated datum Y > τ . Choose an

arbitrary density π modeling the random vector (K, Z) below the truncation point
τ . This gives us for the random vector (K, Z) ∼ π

Y >τ (θ ) = π(k, z) Y >τ (θ ) dν(k, z)

f (Y, k, z; θ, v/ϕ)/π(k, z)
= π(k, z) log dν(k, z)
f (k, z|Y ; θ, v/ϕ)/π(k, z)

f (Y, k, z; θ, v/ϕ)
= π(k, z) log dν(k, z) + DKL (π||f (·|Y ; θ, v/ϕ))
π(k, z)

f (Y, k, z; θ, v/ϕ)
≥ π(k, z) log dν(k, z)
π(k, z)

= Eπ (Y,K,Z) (θ ) Y − Eπ log π(K, Z)
⎡ ⎤
K

= log f (Y ; θ, v/ϕ) + Eπ ⎣ log f (Zj ; θ, v/ϕ)⎦ − Eπ log π(K, Z)
j =1

def.
= Q(θ ; π),
256 6 Bayesian Methods, Regularization and Expectation-Maximization

where the second last identity uses that the log-likelihood (6.51) has a simple form
under the geometric distribution chosen for K; this is exactly the step where we
benefit from this specific choice of the probability extension below the truncation
point. There is a subtle point here. Namely, Y >τ (θ ) is the log-likelihood of the
lower-truncated datum Y > τ , whereas log f (Y ; θ, v/ϕ) is the log-likelihood not
using any lower-truncation.
The E-step for given canonical parameter estimate θ (t −1) reads as
2
2
θ (t −1); π = arg min DKL π 2f (·|Y ;
π (t ) = arg max Q θ (t −1), v/ϕ)
π π

= f · Y ;
θ (t −1), v/ϕ
·
f (·j ;
θ (t −1), v/ϕ)
= p ·;
θ (t −1), v/ϕ .
j =1
F (τ ; θ (t −1), v/ϕ)

The latter describes a compound distribution for K j =1 Zj with a geometric count
random variable K and independent i.i.d. random variables Z1 , Z2 , . . . , having
upper-truncated densities f(−∞,τ ] (·;
θ (t −1), v/ϕ). This allows us to calculate the
expected compound claim below the truncation point
⎡ ⎤

K
≤τ
Y
(t )
= E
π (t)
⎣ Zj ⎦ = E
π (t) [K] E
π (t) [Z1 ]
j =1

F (τ ;
θ (t −1), v/ϕ)
= z f(−∞,τ ] (z;
θ (t −1), v/ϕ) dν(z).
1 − F (τ ; θ (t −1), v/ϕ)

This completes the E-step.

The M-step considers within the EDF

θ (t ) = arg max Q θ ;
π (t )
θ

K
Y + E
π (t) j =1 Zj θ − (1 + E
π (t) [K])κ(θ )
= arg max
θ ϕ/v

v(1 + E Y +Y ≤τ(t )
π (t) [K])
= arg max θ − κ(θ ) .
θ ϕ 1 + E
π (t) [K]
6.4 Truncated and Censored Data 257

That is, the M-step applies the classical MLE step, we only need to change weights
and observations
v
π (t) [K] =
v → v (t ) = v 1 + E ,

1 − F (τ ; θ (t −1), v/ϕ)
Y +Y ≤τ(t )
Y + E
(t ) = π (t) [K] E π (t) [Z1 ]
Y → Y = .
1 + E
π (t) [K] 1 + E
π (t) [K]

Note that this uses the specific structure of the EDF, in particular, we benefit from
linearity here which allows for closed-form solutions.

EM algorithm for lower-truncated data within the EDF

(0) Choose an initial parameter = (

(0) (0)
θ θi )1≤i≤n .
(1) Repeat for t ≥ 1:
(t −1)
• E-step. Given parameter θ θi(t −1))1≤i≤n , estimate the number of
= (
claims K and the corresponding claim sizes Zi,j by

θi(t −1), vi /ϕ)

F (τ ;
(t ) =
K ,
(t −1)
1 − F (τ ;
i
θi , vi /ϕ)

(t ) (t −1)
Z i,1 = z f(−∞,τ ] (z; θi , vi /ϕ) dν(z). (6.52)

This provides us with estimated weights and observations for 1 ≤ i ≤ n

(t )Z
Yi + K (t )
(t )
vi(t ) = vi 1 + K and (t ) =
Y
i i,1
.
i i
(t )
1+Ki

• M-step. Calculate the MLE θ = ( θi )1≤i≤n based on observations

(t ) (t ) (t )
Y =
(t )) vi )
(t )
(Yi 1≤i≤n and weights
v (t ) = ( 1≤i≤n , i.e., solve

n
(t )
θ = arg max (t) (θ ;
v (t )/ϕ) = arg max (t ); θi ,
log f (Y vi(t ) /ϕ).
Y i
θ θ i=1

Remarks 6.17 Essentially, the above algorithm uses that the MLE in the EDF is
based on a sufficient statistics of the observations, and in our case this sufficient
(t ).
statistics is Yi

Example 6.18 (Lower-Truncated Claim Sizes) We revisit the gamma claim size
GLM introduced in Sect. 5.3.7, see also Example 6.16 on right-censored claims. We
258 6 Bayesian Methods, Regularization and Expectation-Maximization

choose as lower-truncation point τ = 1 000, i.e., we get rid of the very small claims
that mainly generate administrative expenses at a rather small claim compensation.
We have 70 claims below this truncation point, and there remain n = 586 claims
above the truncation point that can be used for model fitting in the lower-truncated
case. We use the EM algorithm for lower-truncated data to fit a GLM to this problem.
In order to calculate the E-step we need to evaluate the conditional expecta-
tion (6.52) under the gamma model for truncation probability
τ β α α−1
F (τ ;
θ (t −1), v/ϕ) = z exp{−βz} dz = G(α, βτ ),
0 (α)

with shape parameter α = v/ϕ and scale parameter β = −

θ (t −1)v/ϕ. In complete
analogy to (6.47) we have

(t ) = α G(α + 1, βτ )
Z z f(∞,τ ] (z;
θ (t −1), v/ϕ) dν(z) = .
1
β G(α, βτ )

For the modeling we choose again the features as used for model Gamma GLM2,
this gives q +1 = 7 regression parameter components and additionally we set for the
dispersion parameter ϕ MLE = 1.427. This dispersion parameter we keep fixed in all
the models studied in this example. In a first step we simply fit a gamma GLM to the
lower-truncated data Yi > τ . We call this model ‘crude GLM2’, and it overestimates
the true claim sizes because it ignores the fact of having lower-truncated data.
To initialize the EM algorithm for lower-truncated data we use the model crude
GLM2. We then iterate the algorithm for 10 steps which provides convergence.
The results are presented in Table 6.5. We observe that the resulting log-likelihood
fitted on the lower-truncated data and evaluated on the complete data Y (which is
available here) is the same as for model Gamma GLM2 which has been estimated
on the complete data. Moreover, this lower-truncated EM algorithm fitted model
slightly under-estimates the average claim sizes.
Figure 6.18 shows the estimated means μi on an individual claims level. The
x-axis always gives the estimates from the complete log-likelihood model Gamma
GLM2. The y-axis on the left-hand side shows the estimates from the crude GLM
and the right-hand side the estimates from the EM algorithm fitted counterpart
(fitted on the lower-truncated data). We observe that the crude model overestimates

Table 6.5 Comparison of the complete log-likelihood and the incomplete log-likelihood (lower-
truncation τ = 1 000) results
# Log-likelihood Dispersion Average Rel.
Param. Y ( θ MLE ,
ϕ MLE ) est.
ϕ MLE amount change
Gamma GLM2 (complete data) 7+1
−7 129 1.427 25’130
Crude GLM2 (lower-truncated) 7+1 −7 133 26’879 +7%
EM est. GLM2 (lower-truncated) 7+1 −7 129 24’900 −1%
6.4 Truncated and Censored Data 259

crude GLM with tau=1000 EM estimated GLM with tau=1000

10.0 10.5 11.0

10.0 10.5 11.0
lower−truncated data

lower−truncated data
9.5

9.5
9.0

9.0
8.5

8.5
above truncation point above truncation point
8.0

8.0
below truncation point below truncation point

8.0 8.5 9.0 9.5 10.0 10.5 11.0 8.0 8.5 9.0 9.5 10.0 10.5 11.0
complete data complete data

Fig. 6.18 Comparison of the estimated means μi in model Gamma GLM2 against (lhs) the crude
GLM and (rhs) the EM fitted lower-truncated model; both axis are on the log-scale

the claims (being above the orange diagonal), in particular, this applies to claims
with lower expected claim amounts. The EM algorithm fitted model, considering
the fact that we have lower-truncated data, corrects for the truncation, and the
resulting estimates almost completely coincide with the ones from the complete log-
likelihood model. Again we remark that we use an identical dispersion parameter
estimate ϕ MLE , and it is an open problem to select a reasonable value from lower-
truncated data.

Example 6.19 (Zero-Truncated Claim Counts and the Hurdle Poisson Model) In
Sect. 5.3.6, we have been studying the ZIP model that has assigned an additional
probability weight to the event {N = 0} of having zero claims. This model can
be understood as a hierarchical model with a latent variable Z indicating whether
we have an excess zero claim or not, see (5.41). In that situation we have a
mixture distribution of a Poisson distribution and a degenerate distribution. Fitting
in Example 5.25 has been done brute force by using a general purpose optimizer,
but we could also use the EM algorithm for mixture distributions.
An alternative way of modeling excess zeros is the hurdle approach which
combines a lower-truncated count distribution with a point mass in zero. For the
Poisson case this reads as, see (5.42),
⎧
⎨ π0 for k = 0,
fhurdle Poisson(k; λ, v, π0 ) = −vλ (vλ)k (6.53)
⎩ (1 − π0 ) e k!
for k ∈ N,
1−e −vλ

for π0 ∈ (0, 1) and λ, v > 0. If we ignore any observation {N = 0} we obtain

a lower-truncated Poisson model, also called zero-truncated Poisson (ZTP) model.
This ZTP model can be fitted with the EM algorithm for lower-truncated data. In the
following we only consider insurance policies i with Ni > 0. The log-likelihood of
260 6 Bayesian Methods, Regularization and Expectation-Maximization

the ZTP model N > 0 is given by (we consider one single component only and drop
the lower index in the notation)

θ → N>0 (θ ) = Nθ − veθ − log(N!) + N log(v) − log(1 − e−ve ),

θ
(6.54)

with exposure v > 0 and canonical parameter θ ∈ = R such that λ = exp{θ }.

The ZTP model provides for the random variable K the following geometric
distribution (for the number of claims below the truncation point), see (6.50),

Pθ [K = k] = Pθ [N = 0]k Pθ [N > 0] = e−kve 1 − e−ve .
θ θ

In view of (6.51), this gives us complete log-likelihood (note that Zj = 0 for all j )

K

(N,K,Z) (θ) = Nθ − veθ − log(N!) + N log(v) + Zj θ − veθ − log(Zj !) + Zj log(v)
j =1

= Nθ − (1 + K) veθ − log(N!) + N log(v).

We can now directly apply a simplified version of the EM algorithm for lower-
truncated data. For the E-step we have, given parameter
θ (t −1),

θ (t−1)
(t )
P
θ (t−1) [N = 0] e−ve (t ) = 0.
K = = and Z
1 − P θ (t−1) [N = 0] 1 − e−ve
θ (t−1) 1

This provides us with the estimated weights and observations (set Y = N/v)
v Y N
(t ) =
v (t ) = v 1 + K and (t ) =
Y = (t ) .
1−e −ve
θ (t−1) 1+K (t ) v
(6.55)

Thus, the EM algorithm iterates Poisson MLEs, and the E-Step modifies the weights
v (t ) in each step of the loop correspondingly. We remark that the ZTP model
has an EF representation which allows one to directly estimate the corresponding
parameters without using the EM algorithm, see Remark 6.20, below.
We revisit the French MTPL claim frequency data, and, in particular, we use
model Poisson GLM3 as a benchmark, we refer to Tables 5.5 and 5.10. The feature
engineering is done exactly as in model Poisson GLM3. We then select only the
insurance policies from the learning data L that have suffered at least one claim, i.e.,
Ni > 0. These are m = 22 434 out of n = 610 206 insurance policies. Thus, we
only consider m/n = 3.68% of all insurance policies, and we fit the lower-truncated
log-likelihood (ZTP model) to this data

m
Ni θi − vi eθi − log(Ni !) + Ni log(vi ) − log(1 − e−vi e ),
θi
N>0 (β) =
i=1
6.4 Truncated and Censored Data 261

convergence of log−likelihood in EM algorithm fitted on all data vs. fitted on data N>0

canonical parameter fitted on N>0

−0.245 −0.240 −0.235 −0.230 −0.225

0
–2
log−likelihood

–4
–6
–8
0 20 40 60 80 100 –8 –6 –4 –2 0
algorithmic time canonical parameter fitted on all data

Fig. 6.19 (lhs) Convergence of the EM algorithm for the lower-truncated data in the Poisson
hurdle case; (rhs) canonical parameters of the Poisson GLMs fitted on all data L vs. fitted only
on policies with Ni > 0

Table 6.6 Run times, number of parameters, AICs, in-sample and out-of-sample deviance losses
(units are in 10−2 ) and in-sample average frequency of the Poisson null model and the Poisson,
negative-binomial, ZIP and hurdle Poisson GLMs
Run # In-sample Out-of-sample Aver.
time Param. AIC loss on L loss on T freq.
Poisson null – 1 199’506 25.213 25.445 7.36%
Poisson GLM3 15 s 50 192’716 24.084 24.102 7.36%
NB GLM3 MLE = 1.810
αNB 85 s 51 192’113 20.722 20.674 7.38%
ZIP GLM3 (null π0 ) 270 s 51 192’393 – – 7.37%
Hurdle Poisson GLM3 300 s 100 191’851 – – 7.39%

where 1 ≤ i ≤ m runs over all insurance policies with at least one claim and where
the canonical parameter θi is given by the linear predictor θi = β, x i . We fit this
model using the EM algorithm for lower-truncated data. In each loop this requires
that the offset oi(t ) = log(vi(t ) ) is adjusted according to (6.55); for the discussion of
offsets we refer to Sect. 5.2.3. Convergence of the EM algorithm is achieved after
roughly 75 iterations, see Fig. 6.19 (lhs).
In our first analysis we do not consider the Poisson hurdle model, but we simply
consider model Poisson GLM3. However, this Poisson model with regression
parameter β is fitted only on the data Ni > 0 (exactly using the results of the
EM algorithm for lower-truncated data Ni > 0). The resulting predictive model is
presented in Table 6.7. We observe that model Poisson GLM3 that is only fitted on
the data Ni > 0 is clearly not competitive, i.e., we cannot simply extrapolate this
estimated model to {Ni = 0}. This extrapolation results in a Poisson GLM that has
a much too large average frequency of 15.11%, see last column of Table 6.7; this
bias can clearly be seen in Fig. 6.19 (rhs) where we compare the two fits. From
this we conclude that either the Poisson model assumption in general does not
262 6 Bayesian Methods, Regularization and Expectation-Maximization

Table 6.7 Number of parameters, in-sample and out-of-sample deviance losses on all data
(units are in 10−2 ), out-of-sample lower-truncated log-likelihood N>0 and in-sample average
frequency of the Poisson null model and model Poisson GLM3 fitted on all data L and fitted on
the data Ni > 0 only
# In-sample Out-of-sample Aver.
Param. Loss on L Loss on T N>0 freq.
Poisson null 1 25.213 25.445 – 7.36%
Poisson GLM3 fitted on all data 50 24.084 24.102 −0.2278 7.36%
Poisson GLM3 fitted on Ni > 0 50 28.064 28.211 −0.2195 15.11%

match the data, or that we have excess zeros (which do not influence the estimation
procedure if we only consider the policies with at least one claim). Let us compare
the lower-truncated log-likelihood N>0 out-of-sample only on the policies with at
least one claim (ZTP model). We observe that the EM fitted model provides a better
description of the data, as we have a bigger log-likelihood than the model fitted on
all data L (i.e. −0.2195 vs. −0.2278 for the ZTP log-likelihood). Thus, the lower-
truncated fitting procedure finds a better model on {Ni > 0} when only fitted on
these lower-truncated claim counts.
This analysis concludes that we need to fit the full hurdle Poisson model (6.53).
That is, we cannot simply extrapolate the model fitted on the ZTP log-likelihood
N>0 because, typically, π0 (x i ) = exp{−vi eβ,x i }, the latter coming from the
Poisson GLM with regression parameter β. We model the zero claim probability
π0 (x i ) by the logistic Bernoulli GLM indicating whether we have claims or not.
We set up the logistic GLM for p(x i ) = 1 − π0 (x i ) of describing the indicator
Yi = 1{Ni >0} of having claims. The difficulty compared to the Poisson model is that
we cannot easily integrate the time exposure vi as a pro rata temporis variable like
in the Poisson case. We therefore make the following considerations. The canonical
link in the logistic Bernoulli GLM is the logit function p → logit(p) = log(p/(1 −
p)) = log(p) − log(1 − p) for p ∈ (0, 1). Typically, in our application, p " 1 is
fairly small because claims are rare events. This implies log(p/(1 − p)) ≈ log(p),
i.e., the logit link behaves similarly to the log-link for small default probabilities p.
This motivates to integrate the logged exposures log vi as offsets into the logistic
probabilities. That is, we make the following model assumption

(x, v) → logit(p(x i , vi )) = log(vi ) + *

β, x i ,

with offset oi = log(vi ) and regression parameter *

β ∈ Rq+1 . We fit this model using
the R command glm using family=binomial(). The results then allow us to
define the estimated hurdle Poisson model by, recall p(x i , vi ) = 1 − π0 (x i , vi ),
−1
1 − p(x i , vi ) = 1 + exp{log(vi ) + *
β, x i } for k = 0,
fhurdle Poisson (k; x i , vi ) = k
p(x i ,vi )
e−μ(x i ,vi ) μ(xk!
i ,vi )
for k ∈ N,
1−e−μ(x i ,vi )
6.4 Truncated and Censored Data 263

Table 6.8 Contingency table of the observed numbers of policies against predicted numbers of
policies with given claim counts ClaimNb (in-sample)
Numbers of claims ClaimNb
0 1 2 3 4 5
Observed number of policies 587’772 21’198 1’174 57 4 1
Poisson predicted number of policies 587’325 22’064 779 34 3 0.3
NB predicted number of policies 587’902 20’982 1’200 100 15 4
ZIP predicted number of policies 587’829 21’094 1’191 79 9 4
Hurdle Poisson predicted number of policies 587’772 21’119 1’233 76 6 1

where * β ∈ Rq+1 is the regression parameter from the logistic Bernoulli GLM,
and where μ(x i , vi ) = vi expβ, x i is the Poisson GLM estimated with the
EM algorithm on the lower-truncated data Ni > 0 (ZTP model). The results are
presented in Table 6.6.
Table 6.6 compares the hurdle Poisson model to the approaches studied in
Table 5.10. Firstly, fitting the hurdle Poisson model is more time intensive, the EM
algorithm takes some time and we need to fit the Bernoulli logistic GLM which
is of a similar complexity as fitting model Poisson GLM3. The results in terms of
AIC look convincing. The hurdle Poisson model provides an excellent model for the
indicator of having a claim (here it outperforms model ZIP GLM3). It also tries to
optimally fit a ZTP model to all insurance policies having at least one claim. This
can also be seen from Table 6.8 which determines the expected number of policies
that suffer the different numbers of claims.
We close this example by concluding that the hurdle Poisson model provides the
best description, at the price of using more parameters. The ZIP model could be
lifted to a similar level, however, we consider fitting the hurdle approach to be more
convenient, see also Remark 6.20, below. In particular, feature engineering seems
simpler in the hurdle approach because the different effects are clearly separated,
whereas in the ZIP approach it is more difficult to suitably model the excess zeros,
see also Listing 5.10. This closes this example.

Remark 6.20 In (6.54) we have been considering the ZTP model for different
exposures v > 0. If we set these exposures to v = 1, we obtain the ZTP log-
likelihood

N>0 (θ ) = Nθ − eθ + log(1 − e−e ) − log(N!).
θ

Note that this describes a single-parameter linear EF with cumulant function

κ(θ ) = eθ + log(1 − e−e ),

θ
264 6 Bayesian Methods, Regularization and Expectation-Maximization

for canonical parameter in the effective domain θ ∈ = R. The mean of this EF

model is given by

eθ λ
μ = Eθ [N] = κ (θ ) = = ,
1− e−e
θ
1 − e−λ

where we set λ = eθ . The variance is given by

eλ − (1 + λ)
Varθ (N) = κ (θ ) = μ = μ 1 − μe−λ > 0.
eλ − 1

Note that the term in brackets is positive but less than one. The latter implies that
the ZTP model has under-dispersion. Alternatively to the EM algorithm, we can
also directly fit a GLM to this ZTP model. The only difficulty is that we need to
appropriately integrate the time exposures. The original Poisson model suggests
that if we choose the canonical parameter being equal to the linear predictor, we
should integrate the logged exposures as offsets into the linear predictors. Along
these lines, if we choose the canonical link h = (κ )−1 of the ZTP model, we
receive that the canonical parameter θ is equal to the linear predictor β, x , and we
can directly integrate the logged exposures as offsets into the canonical parameters,
see (5.25). This then allows us to directly fit this ZTP model with exposures using
Fisher’s scoring method. In this case of a concave log-likelihood function, the result
will be identical to the solution of the EM algorithm found in Example 6.19, and, in
fact, this direct approach is more straightforward and more time-efficient. Similar
considerations can be done for other hurdle models.

6.4.4 Composite Models

In Sect. 6.3.1 we have promoted to mix distributions in cases where the data cannot
be modeled by a single EDF distribution. Alternatively, one can also consider to
compose densities which leads to so-called composite models (also called splicing
models). This idea has been introduced to the actuarial literature by Cooray–Ananda
[81] and Scollnik [332]. Assume we have two absolutely continuous densities
f (i) (·; θi ) with corresponding distribution functions F (i) (·; θi ), i = 1, 2. These two
densities can easily be composed at a splicing value τ and with weight p ∈ (0, 1)
by considering the following composite density

f (1)(y; θ1)1{y≤τ } f (2)(y; θ2)1{y>τ }

f (y; p, θ1 , θ2 ) = p + (1 − p) , (6.56)
F (τ ; θ1 )
(1) 1 − F (2)(τ ; θ2 )

supposed that both denominators are non-zero. In this notation we treat splicing
value τ as a hyper-parameter that is chosen by the modeler, and is not estimated
6.4 Truncated and Censored Data 265

from data. In view of (6.41) we can rewrite this in terms for lower- and upper-
truncated densities
(1) (2)
f (y; p, θ1 , θ2 ) = p f(−∞,τ ] (y; θ1) + (1 − p) f(τ,∞) (y; θ2 ).

In this notation, we see that a composite model can also be interpreted as a mixture
(1) (2)
model with mixture probability p ∈ (0, 1) and mixing densities f(−∞,τ ] and f(τ,∞)
having disjoint supports (∞, τ ] and (τ, ∞), respectively.
These disjoint supports allow for simpler MLE, i.e., we do not need to rely on
the ‘EM algorithm for mixture distributions’ to fit this model. The log-likelihood of
Y ∼ f (y; p, θ1 , θ2 ) is given by

(1)
Y (p, θ1 , θ2 ) = log(p) + log f(−∞,τ ] (Y ; θ1 ) 1{Y ≤τ }

(2)
+ log(1 − p) + log f(τ,∞) (Y ; θ2 ) 1{Y >τ } .

This shows that the log-likelihood nicely decouples in the composite case and all
parameters can directly be estimated with MLE: parameter θ1 uses all observations
smaller or equal to τ , parameter θ2 uses all observations bigger than τ , and p is
estimated by the proportions of claims below and above the splicing point τ . This
holds for a null model as well as for a GLM approach for θ1 , θ2 and p.
Nevertheless, the EM algorithm may still be used for parameter estimation,
namely, truncation may ask for the ‘EM algorithm for truncated data’. Alternatively,
we could also use the ‘EM algorithm for censored data’ to estimate the truncated
densities, because we have knowledge of the number of claims above and below the
splicing point τ , thus, we could right- or left-censor these claims. The latter may
lead to more stability in the estimation procedure since we use more information
in parameter estimation, i.e., the two truncated densities will not be independent
because they simultaneously consider all claim counts (but not identical claim sizes
due to censoring).
For composite models one sometimes requires more regularity in the densities,
we may, e.g., require continuity in the density in the splicing point which provides
mixture probability

f (2) (τ ; θ2 )F (1)(τ ; θ1 )
p = .
f (1) (τ ; θ 1 )(1 − F (τ ; θ2 )) + f
(2) (2) (τ ; θ )F (1) (τ ; θ )
2 1

This reduces the number of parameters to be estimated but complicates the score
equations. If we require a differential condition in τ we receive requirement

fy(2) (τ ; θ2)F (1) (τ ; θ1 )

p = ,
fy(1) (τ ; θ1)(1 − F (2) (τ ; θ2 )) + fy(2) (τ ; θ2)F (1) (τ ; θ1 )
266 6 Bayesian Methods, Regularization and Expectation-Maximization

(i)
where fy (y; θi ) denotes the first derivative w.r.t. y. Together with the continuity
this provides requirement for having differentiability in τ

(2)
f (2)(τ ; θ2 ) fy (τ ; θ2 )
= (1) .
f (τ ; θ1 )
(1)
fy (τ ; θ1 )

Again this reduces the degrees of freedom in parameter estimation but complicates
the score equations. We refrain from giving an example and close this section; we
will consider a deep composite regression model in Sect. 11.3.2, below, where we
replace the fixed splicing point by a quantile for a fixed quantile level.

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons licence and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons licence, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons licence and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 7
Deep Learning

In the sequel, we introduce deep learning models. In this chapter these deep
learning models will be based on fully-connected feed-forward neural networks. We
present these networks as an extension of GLMs. These networks perform feature
engineering themselves. We discuss how networks achieve this, and we explain how
networks are used for predictive modeling. There is a vastly growing literature on
deep learning with networks, the classical reference is the book of Goodfellow et
al. [166], but also the numerous tutorials around the open-source deep learning
libraries TensorFlow [2], Keras [77] or PyTorch [296] give an excellent overview
of the state-of-the-art in this field.

7.1 Deep Learning and Representation Learning

In Chap. 5 on GLMs, we have been modeling the mean structure of the responses
Y , given features x, by the following regression function, see (5.6),

x → μ(x) = Eθ(x) [Y ] = g −1 β, x . (7.1)

The crucial assumption has been that the regression function (7.1) provides a
reasonable functional description of the expected value Eθ(x) [Y ] of datum (Y, x).
As described in Sect. 5.2.2, this typically requires manual feature engineering of x,
bringing feature information into the right structural form.
In contrast to manual feature engineering, deep learning aims at performing an
automated feature engineering within the statistical model by massaging infor-
mation through different transformations. Deep learning uses a finite sequence of
functions (z(m) )1≤m≤d , called layers,

z(m) : {1} × Rqm−1 → {1} × Rqm ,

© The Author(s) 2023 267

M. V. Wüthrich, M. Merz, Statistical Foundations of Actuarial Learning and its
Applications, Springer Actuarial, https://doi.org/10.1007/978-3-031-12409-9_7
268 7 Deep Learning

of (fixed) dimensions qm ∈ N, 1 ≤ m ≤ d, and initialization q0 = q being the

dimension of the (raw) feature information x ∈ X ⊂ {1} × Rq . Each of these
layers presents a new representation of the features, that is, after layer m we have a
qm -dimensional representation of the raw feature x ∈ X

def.
z(m:1) (x) = z(m) ◦ · · · ◦ z(1) (x) ∈ {1} × Rqm . (7.2)

Note that the first component is always identically equal to 1. For this reason we
call the representation z(m:1) (x) ∈ {1} × Rqm of x to be qm -dimensional.
Deep learning now assumes that we have d ∈ N appropriate transformations
(layers) z(m) , 1 ≤ m ≤ d, such that z(d:1) (x) provides a suitable qd -dimensional
representation of the raw feature x ∈ X , that then enters a GLM

μ(x) = Eθ(x) [Y ] = g −1 β, z(d:1)(x) , (7.3)

with link function g : M → R and regression parameter β ∈ Rqd +1 . This

regression architecture is called a feed-forward network of depth d ∈ N because
information x is processed in a directed acyclic (feed-forward) path through the d
layers z(1) , . . . , z(d) before entering the final GLM.
Each layer z(m) involves parameters. Successful deep learning simultaneously
fits these parameters as well as the regression parameter β to the available learning
data L so that we obtain an optimal predictive model on the test data T . That is,
the learned model should optimally generalize to unseen data, we refer to Chap. 4
on predictive modeling. Thus, the process of optimal representation learning is also
part of the model fitting procedure. In contrast to GLMs, the resulting log-likelihood
functions are non-concave in their parameters because, typically, each layer involves
non-linear transformations. This makes model fitting a challenge. State-of-the-art
model fitting in deep learning uses variants of the gradient descent algorithm which
we have already met in Sect. 6.2.4.
Remark 7.1 Representation learning x → z(d:1)(x) is closely related to Mercer’s
kernel [272]. If we have a portfolio with features x 1 , . . . , x n , we obtain a Mercer’s
kernel by considering the matrix
; <
K = K(x i , x j ) 1≤i,j ≤n = z(d:1) (x i ), z(d:1) (x j ) ∈ Rn×n . (7.4)
1≤i,j ≤n

In many regression problems it can be shown that one can equivalently work
with the design matrix Z = (z(d:1)(x 1 ), . . . , z(d:1)(x n )) ∈ Rn×(qd +1) or with
7.2 Generic Feed-Forward Neural Networks 269

Mercer’s kernel K ∈ Rn×n . Mercer’s kernel does not require the full knowledge
of the learned representations z(d:1)(x i ), but it suffices to know the discrepancies
between z(d:1)(x i ) and z(d:1) (x j ) measured by the scalar products K(x i , x j ). This
is also closely related to the cosine similarity in word embeddings, see (10.11). This
approach then results in replacing the search for an optimal representation learning
by a search of the optimal Mercer’s kernel for the given data; this is called the kernel
trick in machine learning.

7.2 Generic Feed-Forward Neural Networks

Feed-forward neural (FN) networks use special layers z(m) in (7.2)–(7.3), whose
components are called neurons. This is discussed and studied in detail in this section.

7.2.1 Construction of Feed-Forward Neural Networks

FN networks are regression functions of type (7.3) where each neuron zj(m) , 1 ≤
j ≤ qm , of the layers z(m) = (1, z1(m) , . . . , zq(m)
m ) , 1 ≤ m ≤ d, has the structure of
(m)
a GLM; the first component z0 = 1 always plays the role of the intercept and does
not need any modeling.
A first important choice is the activation function φ : R → R which plays the
role of the inverse link function g −1 . To perform non-linear representation learning,
this activation function should be non-linear, too. The most popular choices of
activation functions are listed in Table 7.1.
The first three examples in Table 7.1 are smooth functions with simple deriva-
tives, see the last column of Table 7.1. Having simple derivatives is an advantage in
gradient descent algorithms for model fitting. The derivative of the ReLU activation
function for x = 0 is given by the step function activation, and in 0 one typically
considers a sub-gradient. We briefly comment on these activation functions.

Table 7.1 Popular choices of non-linear activation functions and their derivatives; the last two
examples are not strictly monotone
Activation function Derivative
Sigmoid (logistic) activation φ(x) = (1 + e−x )−1 φ = φ(1 − φ)
Hyperbolic tangent activation φ(x) = tanh(x) φ = 1 − φ2
Exponential activation φ(x) = exp(x) φ = φ
Step function activation φ(x) = 1{x≥0}
Rectified linear unit (ReLU) activation φ(x) = x 1{x≥0}
270 7 Deep Learning

Fig. 7.1 Hyperbolic tangent hyperbolic tangent function

activation function

1.0
w=5
x → tanh(wx) ∈ (−1, 1) for w=1
(fixed) weights w=1/5

w ∈ {1/5, 1, 5} and

0.5
x ∈ (−10, 10)

tanh
0.0
−0.5
−1.0
−10 −5 0 5 10
x

• We are mainly going to use the hyperbolic tangent activation function

ex − e−x −1
−2x
x → tanh(x) = = 2 1 + e − 1 ∈ (−1, 1).
ex + e−x

Figure 7.1 illustrates the hyperbolic tangent activation function.

The hyperbolic tangent activation function is anti-symmetric w.r.t. the origin
with range (−1, 1). This anti-symmetry and boundedness is an advantage in
fitting deep FN network architectures. For this reason we usually prefer the
hyperbolic tangent over other activation functions.
• The sigmoid activation function corresponds to the logistic function that was
used in the Bernoulli and the categorical EFs, see Sects. 2.1.2 and 5.7. The sig-
moid activation function can be obtained from the hyperbolic tangent activation
function by setting φ(x) = (tanh(x/2) + 1)/2.
• The step function activation is not really used in applications. However, it allows
for nice interpretations, and it links FN networks to the theory of regression and
classification trees (CARTs); see Breiman et al. [54] for CARTs.
• The exponential activation function is a nice differentiable choice whenever the
range should be one-sided bounded.
• The ReLU activation function is also called hinge function or ramp function. This
is the preferred choice in the machine learning community. However, typically,
we will not use it because in our experience it is less robust in fitting compared to
the hyperbolic tangent activation function. This may be for two reasons, firstly,
the ReLU activation is unbounded, and secondly, it is identically equal to zero
for x < 0, which implies that there is no sensitivity in negative choices of x.
7.2 Generic Feed-Forward Neural Networks 271

A FN layer with activation function φ is a mapping

z(m) : {1} × Rqm−1 → {1} × Rqm (7.5)

(m)
z → z(m) (z) = 1, z1 (z), . . . , zq(m)
m
(z) ,

having neurons for 1 ≤ j ≤ qm

qm−1
(m) (m)
(m)
zj (z) = φwj , z =φ wl,j zl , (7.6)
l=0

= (wl,j )0≤l≤qm−1 ∈ Rqm−1 +1 .

(m) (m)
with given network weights wj

(m)
Interpretation Every neuron z → zj (z) describes a GLM regression function
with link function φ −1 and regression parameter wj ∈ Rqm−1 +1 for features
(m)

z ∈ {1} × Rqm−1 . These GLM regression functions can be interpreted as data

compression, i.e., in each neuron the qm−1 -dimensional feature z is projected to
(m)
a real number w j , z ∈ R which is then (non-linearly) activated by φ. Since
this leads to a substantial loss of information, we perform this procedure of data
compression qm times in FN layer z(m) , so that each neuron in (zj(m) (z))1≤j ≤qm
(m)
represents a different projection of input z. Choosing suitable weights wj will
allow us to extract the crucial feature information from z to receive good explanatory
variables for the regression task at hand.

A FN network of depth d ∈ N is obtained by composing d FN layers

z(1), . . . , z(d) to receive the mapping

z(d:1) : {1} × Rq0 =q → {1} × Rqd (7.7)

x → z(d:1) (x) = z(d) ◦ · · · ◦ z(1) (x).

Choosing a strictly monotone and smooth link function g and a regression

parameter β ∈ Rqd +1 we receive the FN network regression function

x ∈ X → μ(x) = g −1 β, z(d:1)(x) . (7.8)

272 7 Deep Learning

Area
Power
VehAge
DrivAge GLM
Bonus
B1
B2
B3
B4
B5
B6
B10
B11
B12
B13
B14
VehGas
Density
R11
R21 Y
R22
R23
R24
R25
R26
R31
R41
R42
R43
R52
R53
R54
R72
R73
R74
R82
R83
R91
R93
R94

Fig. 7.2 FN network of depth d = 3, with number of neurons (q1 , q2 , q3 ) = (20, 15, 10) and
input dimension q0 = 40. This gives us a network parameter ϑ ∈ Rr of dimension r = 1 306

This FN network regression function (7.8) has a network parameter ϑ =

(w1 , . . . , wqd , β)
(1) (d)
∈ Rr of dimension

d
r= qm (qm−1 + 1) + (qd + 1).
m=1

In Fig. 7.2 we illustrate a FN network of depth d = 3, FN layers of dimensions

(q1 , q2 , q3 ) = (20, 15, 10) and input dimension q0 = 40.1 This gives us a network
parameter ϑ ∈ Rr of dimension r = 1 306. On the left-hand side we have the raw
features x ∈ X ⊂ {1} × Rq0 , these are processed through the three FN layers, where
(m)
the black circles illustrate the neurons zj . The third FN layer z(3) has dimension

1Figures 7.2 and 7.9 are similar to Figure 1 in [122], and all FN network plots have been created
with modified versions of the plot functions of the R package neuralnet [144].
7.2 Generic Feed-Forward Neural Networks 273

q3 = 10 providing the learned representation z(3:1) (x) ∈ {1} × Rq3 of x. This is

used in the final GLM step (7.8) in the green box of Fig. 7.2.
Remarks 7.2
• One distinguishes between FN networks of depth d = 1, called shallow
networks, and FN networks of depth d > 1, called deep networks. In this
sense, deep learning means that we learn suitable feature representations through
multiple FN layers d > 1. We come back to this in Sect. 7.2.2, below. Remark
that some people would only call a network deep if d & 1, here d > 1 will be
chosen for the definition of deep (which is also a precise definition).
• There are two ways of receiving a GLM. If we have a (trivial) FN network of
depth d = 0, this naturally corresponds to a GLM, see Fig. 7.2. In that case, one
works with the original features x ∈ X in (7.8). The second way of receiving a
GLM is given by choosing the identity function as activation function φ(x) = x.
This implies that x → z(d:1)(x) = Ax is a linear function for some matrix
A ∈ R(qd +1)×(q+1) and, henceforth, we receive a GLM.
• Under the above interpretation of the representation learning structure (7.7), we
may also give a different intuition for the FN layers. Typically, we expect that
the first FN layers decompose feature information x into bits and pieces, which
are then recomposed in a suitable way for the prediction task. In this sense, we
typically choose a larger dimension for the early FN layers otherwise we may
lose too much information already from the very beginning.
• The neural network introduced in (7.7) is called FN network because the signals
propagate from one layer to the next (directed acyclic graph). If the network
has loops it is called a recurrent neural (RN) network. RN networks have been
applied very successfully in image and speech recognition, for instance, long
short-term memory (LSTM) networks are very useful for time-series analysis.
We study RN networks in Chap. 8, below. A third type of neural networks
are convolutional neural (CN) networks which are very successfully applied
to image recognition because they are capable to detect similar structures at
different places in images, i.e., CN networks learn local representations. We will
discuss CN network architectures in Chap. 9, below.
• The generic FN network architecture (7.8) can be complemented by drop-
out layers, normalization layers, skip connections, embedding layers, etc. Such
layers are special purpose layers, for instance, taking care of over-fitting. We
introduce and discuss these below.
• The regression function (7.8) has a one-dimensional output for regression mod-
eling. Of course, categorical classification can be done completely analogously
by choosing a link function g suitable for classification, see Sect. 5.7. A similar
approach also works if, for instance, we want to model simultaneously the mean
and the dispersion of the data with a two-dimensional output function g −1 .
274 7 Deep Learning

7.2.2 Universality Theorems

The use of FN networks for representation learning is motivated by the so-called

universality theorems which say that any compactly supported continuous (regres-
sion) function can be approximated arbitrarily well by a suitably large FN network.
As such, we can understand the FN network framework as an approximation tool
which, of course, is useful far beyond statistical modeling. In Chapter 12 we give
some proofs of selected universality statements to illustrate the flavor of such results.
In particular, Cybenko [86], Hornik et al. [192], Hornik [191], Leshno et al. [247],
Park–Sandberg [293, 294], Petrushev [302] and Isenbeck–Rüschendorf [198] have
shown (under mild conditions on the activation function) that shallow FN networks
can approximate any compactly supported continuous function arbitrarily well (in
supremum norm or in L2 -norm), if we allow for an arbitrary number of neurons q1 ∈
N in the single FN layer. Roughly speaking, such a result for shallow FN networks
holds true if and only if the chosen activation function is non-polynomial, see
Leshno et al. [247]. Such results are proved either by algebraic methods of Stone–
Weierstrass type or by Wiener–Tauberian denseness type arguments. Moreover,
approximation results are studied in Barron [25, 26], Yukich et al. [399], Makavoz
[262], Pinkus [303] and Döhler–Rüschendorf [108].
The above stated universality theorems say that shallow FN networks are
sufficient from an approximation point of view. Nevertheless, we will mainly
use deep (multiple layers) FN networks, below. These have better convergence
properties to given function classes because they more easily promote interactions
in feature components compared to shallow ones. Such questions have been studied,
e.g., by Elbrächter et al. [120], Kidger–Lyons [215], Lu et al. [260] or Cheridito et
al. [75]. For instance, Elbrächter et al. [120] compare finite-depth wide networks
to finite-width deep networks (under the choice of the ReLU activation function),
and they conclude that for many function classes deep networks lead to exponential
approximation rates, whereas shallow networks only provide polynomial approxi-
mation rates at the same number of network parameters. This motivates to consider
sufficiently deep FN networks for representation learning because these typically
have a better approximation capacity compared to shallow ones.
We motivate this by two simple examples. For this motivation we use the step
function activation φ(x) = 1{x≥0} ∈ {0, 1}. If we have the step function activation,
each neuron partitions Rqm−1 along a hyperplane, i.e.,

z → zj(m) (z) = φw (m) q

j ,z = 1 m−1 (m) (m)
wl,j zl ≥ −w0,j
∈ {0, 1}. (7.9)
l=1

For a shallow FN network we can study the question of the maximal complexity
of the resulting partition of the feature space X ⊂ {1} × Rq0 when considering q1
7.2 Generic Feed-Forward Neural Networks 275

neurons (7.9) in the single FN layer z(1). Zaslavsky [400] proved that q1 hyperplanes
can partition the Euclidean space Rq0 in at most

min{q 0 ,q1 }
q1
disjoint sets. (7.10)
j
j =0

This number (7.10) can be seen as a maximal upper complexity bound for shallow
FN networks with step function activation. It grows exponentially for q1 ≤ q0 , and
it slows down to a polynomial growth for q1 > q0 . Thus, the complexity of shallow
FN networks grows comparably slow as the width q1 of the network exceeds q0 , and
therefore we often need a huge network to receive a good approximation.
This result (7.10) should be contrasted to Theorem 4 in Montúfar et al. [280] who
give a lower bound on the complexity of regression functions of deep FN networks
(under the ReLU activation function). Assume qm ≥ q0 for all 1 ≤ m ≤ d. The
maximal complexity is bounded below by
d−1 B C q0
qm q0 qd
disjoint linear regions. (7.11)
q0 j
m=1 j =0

If we choose as an example a FN network with fixed width qm = 4 for all m ≥ 1

and an input of dimension q0 = 2, we receive from (7.11) a lower bound of

4 4 4 11
4d−1 + + = exp{dlog(4)}.
0 1 2 4

Thus, we have an exponential growth in depth d → ∞. This contrasts the

polynomial complexity growth (7.10) of shallow FN networks.
Example 7.3 (Shallow vs. Deep Networks: Partitions) We give a second more
explicit example that compares shallow and deep FN networks. Choose q0 = 2
and assume we want to describe a regression function

μ : R2 → R, x → μ(x).

If we think of a tool box of basis functions to build regression function μ we may

want to choose indicator functions x → χA (x) ∈ {0, 1} for arbitrary rectangles A =
[x1− , x1+ ) × [x2− , x2+ ) ⊂ R2 . We show that we can easily construct such indicator
functions χA (x) for given rectangles A ⊂ R2 with FN networks of depth d = 2, but
not with shallow FN networks.
For illustrative purposes, we fix a square A = [−1/2, 1/2) × [−1/2, 1/2) ⊂ R2 ,
and we want to construct χA (x) with a network of depth d = 2. This indicator
function χA is illustrated in Fig. 7.3.
276 7 Deep Learning

Fig. 7.3 Indicator function deep FN of depth d=2

χA (x) for square
A = [−1/2, 1/2) ×
[−1/2, 1/2) ⊂ R2
0.5

x2
0.0

−0.5

−0.5 0.0 0.5

We choose the step function activation for φ and a first FN layer with q1 = 4
neurons

x → z(1) (x) = 1, z1(1) (x), . . . , z4(1) (x)

= 1, 1{x1 ≥−1/2} , 1{x2 ≥−1/2} , 1{x1 ≥1/2} , 1{x2 ≥1/2} ∈ {1} × {0, 1}4.

This FN layer has a network parameter, see also (7.9),

⎛⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞⎞
1/2 1/2 −1/2 −1/2
w(1) (1)
1 , . . . , w4 = ⎝⎝ 1 ⎠ , ⎝ 0 ⎠ , ⎝ 1 ⎠ , ⎝ 0 ⎠⎠ , (7.12)
0 1 0 1

having dimension q1 (q0 + 1) = 12. For the second FN layer with q2 = 4 neurons
we choose the step function activation and

(2) (2)
z → z(2) (z) = 1, z1 (z), . . . , z4 (z)

= 1, 1{z1 +z2 ≥3/2} , 1{z2 +z3 ≥3/2} , 1{z1 +z4 ≥3/2} , 1{z3 +z4 ≥3/2} .

having dimension q2 (q1 + 1) = 20. For the output layer we choose the identity link
g(x) = x, and the regression parameter β = (0, 1, −1, −1, 1) ∈ R5 . As a result,
we obtain
; <
χA (x) = β, z(2:1) (x) . (7.13)

That is, this network of depth d = 2, number of neurons (q1 , q2 ) = (4, 4), step
function activation and identity link can perfectly replicate the indicator function for
the square A = [−1/2, 1/2) × [−1/2, 1/2), see Fig. 7.3. This network has r = 37
parameters.
We now consider a shallow FN network with q1 neurons. The resulting regression
function with identity link is given by
; < ; <
(1)
x → β, z(1:1) (x) = β, (1, z1 (x), . . . , zq(1)
1
(x))
9 :
= β, 1, 1; (1)
< , . . . , 1;
(1)
< ,
w1 ,x ≥0 w q1 ,x ≥0

where we have used the step function activation φ(x) = 1{x≥0} . As in (7.9),
each of these neurons leads to a partition of the space R2 with a straight line.
Importantly these straight lines go across the entire feature space, and, there-
fore, we cannot exactly construct the indicator function of Fig. 7.3 with a shal-
low FN network. This can nicely be seen in Fig. 7.4 (lhs), where we con-
sider a shallow FN network with q1 = 4 neurons, weights (7.12), and β =
(0, 1/2, 1/2, −1/2, −1/2).
However, from the universality theorems we know that shallow FN networks
can approximate any compactly supported (continuous) function arbitrarily well
for sufficiently large q1 . In this example we can introduce additional neurons and
let the resulting hyperplanes rotate around the origin. In Fig. 7.4 (middle, rhs) we
show this for q1 = 8 and q1 = 64 neurons. We observe that this allows us to
approximate a circle, see Fig. 7.4 (rhs), and having circles of different sizes at

shallow FN network q1=4 shallow FN network q1=8 shallow FN network q1=64

0.5 0.5 0.5

0.0 0.0 0.0

−0.5 −0.5 −0.5

−0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5

x1 x1 x1

Fig. 7.4 Shallow FN networks with q1 = 4 (lhs), q1 = 8 (middle) and q1 = 64 (rhs)

278 7 Deep Learning

different locations will allow us to approximate the square A considered above.

However, of course, this is a much less efficient way compared to the deep FN
network (7.13).
Intuitively speaking, shallow FN networks act like additions where we add more
and more separating hyperplanes for q1 → ∞ (superposition of basis functions).
In contrast to that, going deep allows us to not only use additions but to also use
multiplications (composition of basis functions). This is the reason, why we can
easily construct the indicator function χA in the deep case (where we multiply
zero’s along the boundary of A), but not in the shallow case.

7.2.3 Gradient Descent Methods

We describe gradient descent methods in this section. These are used to fit FN
networks. Gradient descent algorithms have already been used in Sect. 6.2.4 for
fitting LASSO regularized regression models. We will give the full methodological
part here, without relying on Sect. 6.2.4.

Plain Vanilla Gradient Descent Algorithm

Assume we have independent instances (Yi , x i ), 1 ≤ i ≤ n, that follow the same

member of the EDF. We choose a regression function
; <
x i → μ(x i ) = μϑ (x i ) = Eθ(x i ) [Yi ] = g −1 β, z(d:1)(x i ) ,

for a strictly monotone and smooth link function g, and a FN network z(d:1) with
network parameter ϑ ∈ Rr . We assume that the chosen activation function φ is
differentiable. We highlight in the notation that the mean functional μϑ (·) depends
on the network parameter ϑ. The canonical parameter of the response Yi is given
by θ (x i ) = h(μϑ (x i )) ∈ , where h = (κ )−1 is the canonical link and κ the
cumulant function of the chosen member of the EDF. This gives us (under constant
dispersion ϕ) the log-likelihood function, for given data Y = (Y1 , . . . , Yn ) ,

vi
n
ϑ → Y (ϑ) = Yi h(μϑ (x i )) − κ (h(μϑ (x i ))) + a(Yi ; vi /ϕ).
ϕ
i=1

The deviance loss function in this model is given by, see (4.9) and (4.8),

2 vi
n
D(Y , ϑ) = Yi h (Yi ) − κ (h (Yi )) − Yi h (μϑ (x i )) + κ (h (μϑ (x i ))) ≥ 0.
n ϕ
i=1
(7.14)
7.2 Generic Feed-Forward Neural Networks 279

The MLE of ϑ is found by either maximizing the log-likelihood function or by

minimizing the deviance loss function in ϑ. This problem cannot be solved in
general because of complexity. Typically, the deviance loss function is non-convex
in ϑ and it may have many local minimums. This is one of the reasons, why we
are less ambitious here, and why we just try to find a network parameter ϑ which
provides a “small” deviance loss D(Y , ϑ) for the given data Y . We discuss this
further, below, in fact, this is a crucial point in FN network fitting that is related to
in-sample over-fitting and, therefore, this point will require a broader discussion.
For the moment, we just try to find a network parameter ϑ that provides a
small deviance loss D(Y , ϑ) for the given data Y . Gradient descent algorithms
suggest that we try to step-wise locally improve our current position by changing the
network parameter into the direction of the maximal local decrease of the deviance
loss function. By assumption, our deviance loss function is differentiable in ϑ. This
allows us to consider the following first order Taylor expansion in ϑ

D(Y , *
ϑ) = D(Y , ϑ)+∇ϑ D(Y , ϑ) *
ϑ − ϑ +o *
ϑ − ϑ2 as *
ϑ −ϑ2 → 0.

This shows that the locally optimal change ϑ → *ϑ points into the opposite direction
of the gradient of the deviance loss function. This motivates the following gradient
descent step.

Assume that at algorithmic time t ∈ N we have a network parameter ϑ (t ) ∈

Rr . Choose a suitable learning rate t +1 > 0, and consider the gradient
descent update

ϑ (t ) → ϑ (t +1) = ϑ (t ) − t +1 ∇ϑ D(Y , ϑ (t )). (7.15)

This gradient descent update gives us the new (smaller) deviance loss at
algorithmic time t + 1
2 22
2 2
D(Y , ϑ (t +1)) = D(Y , ϑ (t ) ) − t +1 2∇ϑ D(Y , ϑ (t ))2 + o (t +1 ) for t +1 ↓ 0.
2

Under suitably tempered learning rates (t )t ≥1 , this algorithm will converge to a
local minimum of the deviance loss function as t → ∞ (supposed that we do not
get trapped in a saddlepoint).
Remarks 7.4 We give a couple of (preliminary) remarks on the gradient descent
algorithm (7.15), more explanation, further derivations, and variants of the gradient
descent algorithm will be discussed below.
280 7 Deep Learning

• In the applications we will early stop the gradient descent algorithm before
reaching a local minimum (to prevent from over-fitting). This is going to be
discussed in the next paragraphs.
• Fine-tuning the learning rate (t )t is important, in particular, there is a trade-off
between smaller and bigger learning rates: they need to be sufficiently small so
that the first order Taylor expansion is still a valid approximation, and they should
be sufficiently big otherwise the convergence of the algorithm will be very slow
because it needs many iterations.
• The gradient descent algorithm is a first order algorithm, and one is tempted to
study higher order approximations, e.g., leading to the Newton–Raphson algo-
rithm. Unfortunately, higher order derivatives are computationally not feasible if
the size n of the data Y = (Y1 , . . . , Yn ) and the dimension r of the network
parameter ϑ are large. In fact, even the calculation of the first order derivatives
may be challenging and, therefore, stochastic gradient descent methods are
considered below. Nevertheless, it is beneficial to have a notion of a second order
term. Momentum-based methods originate from approximating the second order
terms, these will be studied in (7.19)–(7.20), below.
• The gradient descent step (7.15) solves an unconstraint local optimization.
Similarly to (6.15)–(6.16) we could change the gradient descent algorithm to
a constraint optimization problem, e.g., involving a LASSO constraint that can
be solved with the generalized projection operator (6.17).

Gradient Calculation via Back-Propagation

Fast gradient descent algorithms essentially rely on fast gradient calculations of the
deviance loss function. Under the EDF setup we have gradient w.r.t. ϑ

2 vi
n
∇ϑ D(Y , ϑ) = μϑ (x i ) − Yi h (μϑ (x i )) ∇ϑ μϑ (x i ) (7.16)
n ϕ
i=1

2 vi μϑ (x i ) − Yi
n
1 ; <
= ∇ϑ β, z (d:1)
(x i ) ,
n ϕ V (μϑ (x i )) g (μϑ (x i ))
i=1

where the last step uses the variance function V (·) of the chosen EDF, we also refer
to (5.9). The main difficulty is the calculation of the gradient
; < ; <
∇ϑ β, z(d:1)(x) = ∇ϑ β, z(d) ◦ · · · ◦ z(1) (x) ,

w.r.t. the network parameter ϑ = (w(1) (d) ∈ Rr , and where each

1 , . . . , w qd , β)
FN layer z(m) involves the weights W (m) = (w1 , . . . , wqm ) ∈ R(qm−1 +1)×qm .
(m) (m)

The workhorse for these gradient calculations is the back-propagation method

of Rumelhart et al. [324]. Basically, the back-propagation method is a clever
7.2 Generic Feed-Forward Neural Networks 281

reparametrization of the problem so that the gradients can be calculated more easily.
We therefore modify the weight matrices W (m) by dropping the first row containing
(m)
the intercept parameters w0,j , 1 ≤ j ≤ qm . Define for 1 ≤ m ≤ d + 1

∈ Rqm−1 ×qm ,
(m) (m)
W(−0) = wjm−1 ,jm
1≤jm−1 ≤qm−1 ; 1≤jm ≤qm

where wj(m)
m−1 ,jm
denotes component jm−1 of w(m)
jm , and where we set qd+1 = 1
(d+1)
(output dimension) and wjd ,1 = βjd for 0 ≤ jd ≤ qd .
Proposition 7.5 (Back-Propagation for the Hyperbolic Tangent Activation)
Choose a FN network of depth d ∈ N and with hyperbolic tangent activation
function φ(x) = tanh(x).
• Define recursively
– initialize qd+1 = 1 and δ (d+1)(x) = 1 ∈ Rqd+1 ;
– iterate for d ≥ m ≥ 1
2
(m:1) (m+1)
δ (m) (x) = diag 1 − zjm (x) W(−0) δ (m+1) (x) ∈ Rqm .
1≤jm ≤qm

• We obtain for 0 ≤ m ≤ d
⎛ ⎞
⎝ ∂β, z (d:1) (x)
⎠
(m+1)
= z(m:1) (x) δ (m+1) (x) ∈ R(qm +1)×qm+1 ,
∂wjm ,jm+1
0≤jm ≤qm ; 1≤jm+1 ≤qm+1

where z(0:1) (x) = x ∈ Rq0 +1 and w1(d+1) = β ∈ Rqd +1 .

Proof of Proposition 7.5 Choose 1 ≤ m ≤ d and define for the neurons 1 ≤ jm ≤

qm the variables
; <
ζj(m)
m
(x) = w (m) (m−1:1)
jm , z (x) .

The learned representation in the m-th FN layer is obtained by activating these

variables

z(m:1) (x) = 1, φ ζ1(m) (x) , . . . , φ ζq(m)
m
(x) ∈ Rqm +1 .

For the output we define

(d+1)
ζ1 (x) = β, z(d:1)(x) .
282 7 Deep Learning

The main idea is to calculate the derivatives of β, z(d:1)(x) w.r.t. these new
variables ζj(m) (x).
Initialization for m = d +1 This provides for m = d +1 and 1 ≤ jd+1 ≤ qd+1 = 1

∂β, z(d:1) (x) (d+1)

= 1 = δ1 (x).
∂ζ1(d+1)(x)

Recursion for m < d +1 Next, we calculate the derivatives w.r.t. ζj(d)

d
(x), for m = d
and 1 ≤ jd ≤ qd . They are given by (note qd+1 = 1)

(d+1)
∂β, z(d:1)(x) ∂β, z(d:1)(x) ∂ζ1 (x)
=
∂ζj(d)
d
(x) ∂ζ1(d+1)(x) ∂ζj(d)
d
(x)

(x) βjd φ (ζjd (x))

(d+1) (d)
= δ1 (7.17)

= δ1(d+1) (x) wj(d+1)
d ,1
1 − (zj(d:1)
d
(x))2 = δj(d)
d
(x),

(d+1)
where we have used wjd ,1 = βjd and for the hyperbolic tangent activation function
φ = 1 − φ 2 . Continuing recursively for d > m ≥ 1 and 1 ≤ jm ≤ qm we obtain

∂β, z(d:1)(x) ∂β, z(d:1)(x) ∂ζj(m+1)

qm+1
(x)
(m)
= (m+1)
m+1
(m)
∂ζjm (x) jm+1 =1 ∂ζjm+1 (x) ∂ζjm (x)

qm+1
(m+1) (m+1) (m:1) (m)
= δjm+1 (x) wjm ,jm+1 1 − (zjm (x))2 = δjm (x).
jm+1 =1

Thus, the vectors δ (m) (x) = (δ1(m) (x), . . . , δq(m) are calculated recursively in
m (x))
d ≥ m ≥ 1 with initialization δ (d+1)
(x) = 1 and the recursion

δ (m) (x) = diag 1 − (zj(m:1) (x)) 2 (m+1) (m+1)
W(−0) δ (x) ∈ Rqm .
m 1≤jm ≤qm

Finally, we need to show how these derivatives are related to the original
derivatives in the gradient descent method. We have for 0 ≤ jd ≤ qd and jd+1 = 1

(d+1)
∂β, z(d:1)(x) ∂β, z(d:1) (x) ∂ζ1 (x)
= (d+1)
= δj(d+1) (x) zj(d:1) (x).
∂βjd ∂ζ (x) ∂β jd
d+1 d
1
7.2 Generic Feed-Forward Neural Networks 283

For 1 ≤ m < d, and 0 ≤ jm ≤ qm and 1 ≤ jm+1 ≤ qm+1 we have

(m+1)
∂β, z(d:1) (x) ∂β, z(d:1)(x) ∂ζjm+1 (x)
(m+1)
= (m+1) (m+1)
= δj(m+1)
m+1
(x) zj(m:1)
m
(x).
∂wjm ,jm+1 ∂ζjm+1 (x) ∂wjm ,jm+1

For m = 0, and 0 ≤ l ≤ q0 and 1 ≤ j1 ≤ q1 we have

(1)
∂β, z(d:1)(x) ∂β, z(d:1)(x) ∂ζj1 (x) (1)
(1)
= = δj1 (x) xl .
∂wl,j1
∂ζj(1)
1
(x) (1)
∂wl,j1

This completes the proof of Proposition 7.5.

Remark 7.6 Proposition 7.5 gives the back-propagation method for the hyperbolic
tangent activation function which has derivative φ = 1 − φ 2 . This becomes visible
in the definition of δ (m) (x) where we consider the diagonal matrix
2
(m:1)
diag 1 − zjm (x) .
1≤jm ≤qm

For a general differentiable activation function φ this needs to be replaced by,

see (7.17),
; <
diag φ wjm , z(m−1:1) (x)
(m)
.
1≤jm ≤qm

In the case of the sigmoid activation function this gives us, see also Table 7.1,

(m:1) (m:1)
diag zjm (x) 1 − zjm (x) .
1≤jm ≤qm

Plain vanilla gradient descent algorithm for FN networks

1. Choose an initial network parameter ϑ (0) ∈ Rr .

2. Iterate for t ≥ 0 until a stopping criterion is met:
(a) Calculate the gradient ∇ϑ D(Y , ϑ) in network parameter ϑ = ϑ (t )
using (7.16) and the back-propagation method of Proposition 7.5 (for the
hyperbolic tangent activation function).
(b) Make the gradient descent step for a suitable learning rate t +1 > 0

ϑ (t ) → ϑ (t +1) = ϑ (t ) − t +1 ∇ϑ D(Y , ϑ (t ) ).
284 7 Deep Learning

Remark 7.7 The initialization ϑ (0) ∈ Rr of the gradient descent algorithm needs
some care. A FN network has many symmetries, for instance, we can permute
neurons within a FN layer and we receive the same predictive model. For this
reason, the initial network weights W (m) = (w1 , . . . , wqm ) ∈ R(qm−1 +1)×qm ,
(m) (m)

1 ≤ m ≤ d, should not be chosen with identical components because this will

result in a saddlepoint of the corresponding objective function, and gradient descent
will not work. For this reason, these weights are initialized randomly either using a
uniform or a Gaussian distribution. The former is related to the glorot_uniform
initializer in keras,2 see (16) in Glorot–Bengio [160]. This initializer scales the
support of the uniform distribution with the sizes of the FN layers that are connected
by the corresponding weights w(m) j .
(0) , 0, . . . , 0) ∈
For the output parameter we usually set as initial value β (0) = (β 0
Rqd +1 , where β is the MLE in the corresponding null model (not considering any
(0)
0
features) and transformed to the chosen link g. This choice implies that the gradient
descent algorithm starts in the null model, and any decrease in deviance loss can be
seen as an improved in-sample loss of using the FN network regression structure
over the null model.

Stochastic Gradient Descent

The gradient in (7.16) has two parts. We have a vector

vi 1 1
v(Y ) = μϑ (x i ) − Yi ∈ Rn ,
ϕ V (μϑ (x i )) g (μϑ (x i )) 1≤i≤n

and we have a matrix

; < ; <
M = ∇ϑ β, z(d:1)(x 1 ) , . . . , ∇ϑ β, z(d:1) (x n ) ∈ Rr×n .

The gradient of the deviance loss function is obtained by the matrix multiplication

2
∇ϑ D(Y , ϑ) = M v(Y ).
n
Matrix multiplication can be very slow in numerical implementations if the
sample size n is large. For this reason, one typically uses the stochastic gradient
descent (SGD) method that does not consider the entire data Y = (Y1 , . . . , Yn )
simultaneously.

2 For our examples we use the R library keras [77] which is an API to TensorFlow [2].
7.2 Generic Feed-Forward Neural Networks 285

For the SGD method one chooses a fixed batch size b ∈ N, and one randomly
partitions the entire data Y into (mini-)batches Y 1 , . . . , Y 'n/b( of approximately the
same size b (up to cardinality). Each gradient descent update

ϑ (t ) → ϑ (t +1) = ϑ (t ) − t +1 ∇ϑ D(Y s , ϑ (t ) ),

is then only based on the observations Y s in the corresponding batch 1 ≤ s ≤ 'n/b(.

Typically, one sequentially visits all batches, and screening each batch once is called
an epoch. Thus, if we run the SGD algorithm over K epochs on batches of size
b ≤ n, then we perform K'n/b( gradient descent steps.
Choosing batches of size b reduces the complexity of the matrix multiplication
from n to b, and, henceforth, leads to much faster run times in one gradient
descent step. On the other hand, batches should have a minimal size so that the
gradient descent updates are not too erratic, i.e., if the batches are too small, the
randomness in the data may point too often into a (completely) wrong direction for
the optimal gradient descent step. For this reason, optimal batch sizes should be
chosen carefully. For instance, if we study a low frequency claims count problem,
say, with an expected frequency of λ = 10%, we can determine confidence bounds
for parameter estimation. This will provide an estimate of a minimal batch size b
for a reliable parameter estimate.
To have a few erratic steps in SGD, however, can also be beneficial, as long
as there are not too many of those. Sometimes, the algorithm gets trapped in
saddlepoints or in flat areas of the objective function (vanishing gradient problem).
If this is the case, an erratic step may be beneficial because it may perturb the
algorithm out of its bottleneck. In fact, often SGD has a better performance than the
plain vanilla gradient descent algorithm that is based on the entire data Y because
of these noisy contributions.

Momentum-Based Gradient Descent Methods

The gradient descent method only considers a first order Taylor expansion and one is
tempted to consider higher order terms to improve the approximation. For instance,
Newton’s method uses a second order Taylor term by updating
−1
ϑ (t ) → ϑ (t +1) = ϑ (t ) − ∇ϑ2 D(Y , ϑ (t ) ) ∇ϑ D(Y , ϑ (t )). (7.18)

In many practical applications this calculation is not feasible as the Hessian

∇ϑ2 D(Y , ϑ (t ) ) cannot be calculated in a reasonable amount of time. Another
(simple) way of considering the changes in the gradients is the momentum-based
gradient descent method of Rumelhart et al. [324]. This is inspired by mechanics in
physics and it is achieved by considering the gradients over several iterations of the
algorithm (with exponentially decaying weights). Choose a momentum coefficient
ν ∈ [0, 1) and define the initial speed v(0) = 0 ∈ Rr .
286 7 Deep Learning

Replace the gradient descent update (7.15) by

v(t ) → v(t +1) = νv(t ) − t +1 ∇ϑ D(Y , ϑ (t ) ), (7.19)

(t +1) (t +1)
ϑ (t )
→ ϑ = ϑ (t )
+v . (7.20)

For ν = 0 we have the plain vanilla gradient descent method, for ν > 0 we also
memorize the previous gradients (with exponentially decaying weights). Typically
this leads to better convergence properties.
Nesterov [284] has noticed that for convex functions the gradient descent updates
may have a zig-zag behavior. Therefore, he proposed the so-called Nesterov-
accelerated version

v(t ) → v(t +1) = νv(t ) − t +1 ∇ϑ D(Y , ϑ (t ) + νv(t ) ),

ϑ (t ) → ϑ (t +1) = ϑ (t ) + v(t +1). (7.21)

Thus, the calculation of the momentum v(t +1) uses a look-ahead ϑ (t ) + νv(t ) in
the gradient calculation (anticipating part of the next step). This provides for the
update (7.21) the following equivalent versions, under reparametrization *
(t )
ϑ =
ϑ + νv ,
(t ) (t )

ϑ (t +1) = ϑ (t ) + νv(t ) − t +1 ∇ϑ D(Y , ϑ (t ) + νv(t ))

= ϑ (t ) + νv(t ) − t +1 ∇ϑ D(Y , *
(t )
ϑ ) (7.22)

=*ϑ + νv(t +1) − t +1 ∇ϑ D(Y , * ϑ ) − νv(t +1).
(t ) (t )

For the Nesterov accelerated update we can also study, we use the last line of (7.22),

v(t ) → v(t +1) = νv(t ) − t +1 ∇ϑ D(Y , *

(t )
ϑ ),

* (t +1)
ϑ → * = *
ϑ + νv(t +1) − t +1 ∇ϑ D(Y , *
(t ) (t ) (t )
ϑ ϑ ) . (7.23)

Compared to (7.19)–(7.20), we just shift the index by 1 in the momentum v(t ) in

the round brackets of (7.23). The typical way how the Nesterov-acceleration is
formulated is, yet, another equivalent formulation, namely, only in terms of ϑ (t ) and
* (t )
ϑ . From the second line of (7.22) and (7.21) we have the updates

ϑ (t +1) = * − t +1 ∇ϑ D(Y , *
(t ) (t )
ϑ ϑ ),

* (t +1)
ϑ = ϑ (t +1) + ν ϑ (t +1) − ϑ (t ) . (7.24)
7.2 Generic Feed-Forward Neural Networks 287

Typically, one chooses the momentum coefficient ν in (7.24) time-dependent by

setting νt = t/(t + 3).
In our applications we will use the R interface to the keras library [77].
This library has a couple of standard momentum-based gradient descent methods
implemented which use pre-defined learning rates and momentum coefficients. In
our analysis we are mainly relying on the variants rmsprop and the Nesterov-
accelerated version of adam, called nadam. Therefore, we briefly describe these
three variants, and for more information we refer to Sections 8.3 and 8.5 in
Goodfellow et al. [166].
Predefined Gradient Descent Methods
• rmsprop stands for ‘root mean square propagation’, and its origin can be
found in a lecture of Hinton et al. [187]. Denote by ) the Hadamard product
that computes the component-wise products of two matrices. Choose a weight
α ∈ (0, 1) and calculate the accumulated squared gradients, set r(0) = 0 ∈ Rr ,

r(t ) → r(t +1) = αr(t ) + (1 − α) ∇ϑ D(Y , ϑ (t ) ) ) ∇ϑ D(Y , ϑ (t ) ) ∈ Rr .

The sequence (r(t ) )t ≥1 memorizes the (squared) magnitudes of the components

of the gradients ∇ϑ D(Y , ϑ (t ) ), t ≥ 1. This is done individually for each
component because we may have directional differences in magnitudes (and
momentum). In contrast to (7.19), r(t ) does not model the speed, but rather an
inverse weight. This then motivates the gradient descent update

ϑ (t ) → ϑ (t +1) = ϑ (t ) − √ ) ∇ϑ D(Y , ϑ (t )),
ε + r(t +1)
where the square-root is taken component-wise, for a global decay rate > 0,
and for a small positive constant ε > 0 to ensure that everything is well-defined.
• adam stands for ‘adaptive moment’ estimation, and it has been proposed by
Kingma–Ba [216]. The momentum is determined by the first two moments in
adam, namely, we set v(0) = r(0) = 0 ∈ Rr and we consider

v(t ) → v(t +1) = νv(t ) + (1 − ν)∇ϑ D(Y , ϑ (t ) ), (7.25)

r(t ) → r(t +1) = αr(t ) + (1 − α) ∇ϑ D(Y , ϑ (t ) ) ) ∇ϑ D(Y , ϑ (t ) ) , (7.26)

for given weights ν, α ∈ (0, 1). Similar to Bayesian credibility theory, v(t )
and r(t ) are biased because these two processes have been initialized in zero.
Therefore, they are rescaled by 1/(1 − ν t ) and 1/(1 − α t ), respectively. This
gives us the gradient descent update

v(t +1)
ϑ (t ) → ϑ (t +1) = ϑ (t ) − 6 ) ,
ε+ r(t+1) 1 − νt
1−α t
288 7 Deep Learning

where the square-root is taken component-wise, for a global decay rate > 0,
and for a small positive constant ε > 0 to ensure that everything is well-defined.
• nadam is the Nesterov-accelerated [284] version of adam. Similarly as when
going from (7.19)–(7.20) to (7.23), the acceleration is obtained by a shift of 1 in
the velocity parameter, thus, consider the Nesterov-accelerated adam update

νv(t +1) + (1 − ν)∇ϑ D(Y , ϑ (t ) )

ϑ (t ) → ϑ (t +1) = ϑ (t ) − 6 ) ,
ε+ r(t+1) 1 − νt
1−α t

using (7.25) and (7.26).

Maximum Likelihood Estimation and Over-fitting

As explained above, we model the mean of the datum (Y, x) by a deep FN network
; <
x → μ(x) = μϑ (x) = Eθ(x) [Y ] = g −1 β, z(d:1)(x) ,

for a network parameter ϑ ∈ Rr . MLE of this network parameter requires solving

for given data Y

ϑ
MLE
= arg min D(Y , ϑ).
ϑ

In Fig. 7.5 we give a schematic figure of a loss surface ϑ → D(Y , ϑ) for a (low-
dimensional) example ϑ ∈ R2 . The two plots show the same loss surface from two
different angles. This loss surface has three (local) minimums (red color), and the
smallest one (global minimum) gives the MLE
MLE
ϑ .
In general, this global minimum cannot be found for more complex network
architectures because the loss surface typically has a complicated structure for high-
dimensional parameter spaces. Is this a problem in FN network fitting? Not really!
We are going to explain why. The universality theorems in Sect. 7.2.2 state that more
complex FN networks have an excellent approximation capacity. If we translate
this to our statistical modeling problem it means that the observations Y can be
approximated arbitrarily well by sufficiently complex FN networks. In particular,
for a given complex network architecture, the MLE
MLE
ϑ will provide the optimal
fit of this architecture to the data Y , and, as a result, this network does not only
reflect the systematic effects in the data but also the noisy part. This behavior is
called (in-sample) over-fitting to the learning data L. It implies that such statistical
models typically have a poor generalization to unseen (out-of-sample) test data T ;
this is illustrated by the red color in Fig. 7.6. For this reason, in general, we are
not interested in finding the MLE
MLE
ϑ of ϑ in FN network regression modeling,
but we would like to find a parameter estimate ϑ that (only) extracts the systematic
effects from the learning data L. This is illustrated by the different colors in Figs. 7.5
7.2 Generic Feed-Forward Neural Networks 289

loss surface (view 1) loss surface (view 2)

D(Y,theta)

the
ta1

ta2
the
the
ta2

1
theta

Fig. 7.5 Schematic figure of a loss surface ϑ → D(Y , ϑ) from two different angles for a two-
dimensional parameter ϑ ∈ R2

Fig. 7.6 Schematic figure of in−sample over−fitting

in-sample over-fitting (red), observations (in−sample)
under-fitting (blue) and
10

under−fitting
extracting systematic effects systematic effects
over−fitting
(green)
8
6
mu(x)
4
2
0

0.5 1.0 1.5 2.0

and 7.6, where we assume: (a) red color provides models with a poor generalization
power due to over-fitting, (b) blue color provides models with a poor generalization
power, too, because these parametrizations do not explain the systematic effects in
the data at all (called under-fitting), and (c) green color gives good parametrizations
that explain the systematic effects in the data and generalize well to unseen data.
Thus, the aim is to find parametrizations that are in the green area of Fig. 7.5.
This green area emphasizes that we lose the notion of uniqueness because there
are infinitely many models in the green area that have a comparable generalization
290 7 Deep Learning

power. Next we explain how we can exploit the gradient descent algorithm to make
it useful for finding parametrizations in the green area.
Remark 7.8 The loss surface considerations in Fig. 7.5 are based on a fixed network
architecture. Recent research promotes the so-called Graph HyperNetwork (GHN)
that is a (hyper-)network which tries to find the optimal network architecture and
its parametrization by an additional network, we refer to Zhang et al. [402] and
Knyazev et al. [219].

Regularization Through Early Stopping

As stated above, if we run the gradient descent algorithm with properly tempered
learning rates it will converge to a local minimum of the loss function, which means
that the resulting FN network over-fits to the learning data. For this reason we need
to early stop the gradient descent algorithm beforehand. Coming back to Fig. 7.5,
typically, we start the gradient descent algorithm somewhere in the blue area of
the loss surface (supposed that the red area is a sparse set on the loss surface).
Visually speaking, the gradient descent algorithm then walks down the valley (green,
yellow and red area) by exploiting locally optimal steps. Since at the early stage of
the algorithm the systematic effects play a dominant role over the noisy part, the
gradient descent algorithm learns these systematic effects at this first stage (blue
area in Fig. 7.5). When the algorithm arrives at the green area the noisy part in the
data starts to increasingly influence the model calibration (gradient descent steps),
and, henceforth, at this stage the algorithm should be stopped, and the learned
parameter should be selected for predictive modeling. This early stopping is an
implicit way of regularization, because it implies that we stop the parameter fitting
before the parameters start to learn very individual features of the (noisy) data (and
take extreme values).
This early stopping point is determined by doing an out-of-sample analysis. This
requires the learning data L to be further split into training data U and validation
data V. The training data U is used for gradient descent parameter learning, and
the validation data V is used for tracking the over-fitting by an instantaneous (out-
of-sample) validation analysis. This partition is illustrated in Fig. 7.7, which also
highlights that the validation data V is disjoint from the test data T , the latter only
being used in the final step for comparing different statistical models (e.g., a GLM
vs. a FN network). That is, model comparison is done in a proper out-of-sample
manner on T , and each of these models is only fit on U and V. Thus, for FN network
fitting with early stopping we need a reasonable amount of data that can be split into
3 sufficiently large data sets so that each is suitable for its purpose.
For early stopping we partition the learning data L into training data U and
validation data V. The plain vanilla gradient descent algorithm can then be changed
as follows.
7.2 Generic Feed-Forward Neural Networks 291

T T

D
L
U

Fig. 7.7 Partition of entire data D (lhs) into learning data L and test data T (middle), and into
training data U , validation data V and test data T (rhs)

Plain vanilla gradient descent algorithm with early stopping

1. Choose an initial network parameter ϑ (0) ∈ Rr .

2. Iterate for t ≥ 0 until the early stopping criterion is met:
(a) Calculate the gradient ∇ϑ D(U, ϑ) in network parameter ϑ = ϑ (t ) on the
training data U using (7.16) and the back-propagation method of Proposi-
tion 7.5 (for the hyperbolic tangent activation function).
(b) Make the gradient descent step for a suitable learning rate t +1 > 0

ϑ (t ) → ϑ (t +1) = ϑ (t ) − t +1 ∇ϑ D(U, ϑ (t ) ).

(c) Calculate the validation loss D(V, ϑ (t )) on the validation data V.

(d) Stop the algorithm if the validation loss increases, i.e., if

D(V, ϑ (t )) > D(V, ϑ (t −1)), (7.27)

and return the learned parameter (estimate)

ϑ = ϑ (t −1) .

In applications we use the SGD algorithm that can also have erratic steps because
not all random (mini-)batches are necessarily typical representations of the data.
In such cases we should use more sophisticated stopping criteria than (7.27), for
instance, early stop if the validation loss increases five times in a row.
292 7 Deep Learning

Fig. 7.8 Training loss stochastic gradient descent algorithm

D(U , ϑ (t) ) vs. validation loss

0.160
training loss
D(V , ϑ (t) ) over different validation loss
iterations t ≥ 0 of the SGD minimal validation loss
algorithm

(modified) deviance loss

0.158
0.156
0.154
0.152
0 200 400 600 800 1000
training epochs

Figure 7.8 provides an example of the application of the SGD algorithm on

training data U and validation data V. The training loss is in blue color and the
validation loss in green color. We observe that the validation loss has its minimum
after 52 epochs (orange vertical line), and hence the fitting algorithm should be
stopped at this point. We give a couple of remarks concerning Fig. 7.8:
• The learning data L exactly corresponds to the claims frequency data of
Sect. 5.2.4, see also Table 5.2. We take 10% as validation data which gives
|U| = 549 185 and |V| = 61 021. For the SGD algorithm we use batches of size
10 000 which implies that one epoch corresponds to '549 185/10000( = 54
gradient descent steps. For batches of size 10 000 we expect an approximate
estimation precision on an average frequency of λ̄ = 7.36% in the Poisson model
of
⎡ . . ⎤
λ̄ λ̄
⎣λ̄ − 2 , λ̄ + 2 ⎦ = [6.62%, 8.11%],
10 000v̄ 10 000v̄

with an average exposure v̄ = 0.5283 on our learning data, we also refer to

Example 3.22.
• The FN network architecture used in Fig. 7.8 is the one shown in Fig. 7.2
using one-hot encoding for categorical variables, see Sect. 7.3.1, below, and the
responses are modeled by a Poisson distribution.
• The training loss D(U, ϑ (t ) ), blue curve in Fig. 7.8, is a bit wiggly which comes
from the fact that we use a SGD where not every batch leads to the optimal
decrease in loss. Remark that the loss figures in the graph correspond to average
losses over an entire epoch, i.e., in our case an average over 54 SGD steps. Also
remark that the y-scale does not show the Poisson deviance loss: we use the loss
figures provided by keras [77] and these figures drop all terms of the deviance
loss that are not relevant for parameter estimation.
7.3 Feed-Forward Neural Network Examples 293

We close this section with remarks.

Remarks 7.9
• We perform early stopping because otherwise a complex FN network would
in-sample over-fit to the learning data. At this stage, one could be tempted to
choose a smaller network to prevent from over-fitting. In general, this is not a
sensible thing to do because the network needs sufficient flexibility to be able to
be fitted to the data. That is, we need some redundancy in the model to be able to
successfully apply the SGD algorithm, otherwise the algorithm may get trapped
in saddlepoints or bottlenecks. Thus, the chosen network architecture should be
above the bound of a necessary minimal complexity, and different architectures
above this bound will provide similar accuracy (without a clear winner).
• The chosen network will contain certain elements of randomness, and different
runs of the SGD algorithm will provide different solutions. Firstly, the initializa-
tion ϑ (0) ∈ Rr of the algorithm is chosen at random, and since we early stop
the algorithm and because we do not have a unique optimal point, the chosen
solution will depend on this random initialization. Secondly, the split between
training and validation data is done at random, and thirdly the partitioning of the
training data into mini-batches is done at random. All these random elements
make the early stopped SGD solution non-unique.
• Early stopping implies that the chosen network parameter estimate ϑ does not
correspond to a solution of the score equations and, henceforth, asymptotic
results about MLEs do not apply, see Theorem 3.28.

7.3 Feed-Forward Neural Network Examples

7.3.1 Feature Pre-processing

Similarly to GLMs, we also need to pre-process the feature components in FN

network regression modeling. The former Sect. 5.2.2 for GLMs has been called
‘feature engineering’ because we need to bring the feature components into an
appropriate functional form w.r.t. the given regression task. The present section is
called ‘feature pre-processing’ because we do not need to engineer the features for
FN networks. We only need to bring them into a suitable (tabular) form to enter the
network, and the network will then do an automated feature engineering through
representation learning.

Categorical Feature Components: One-Hot Encoding

The categorical features have been treated by dummy coding within GLMs. Dummy
coding provides full rank design matrices. For FN network regression modeling the
294 7 Deep Learning

Table 7.2 One-hot encoding a1 = white 1 0 0 0 0 0 0 0 0 0 0

example mapping the K = 11
a2 = yellow 0 1 0 0 0 0 0 0 0 0 0
levels (colors) to the unit
vectors of the 11-dimensional a3 = orange 0 0 1 0 0 0 0 0 0 0 0
Euclidean space R11 showing a4 = red 0 0 0 1 0 0 0 0 0 0 0
the resulting encoding vectors a5 = magenta 0 0 0 0 1 0 0 0 0 0 0
xj as row vectors a6 = violet 0 0 0 0 0 1 0 0 0 0 0
a7 = blue 0 0 0 0 0 0 1 0 0 0 0
a8 = cyan 0 0 0 0 0 0 0 1 0 0 0
a9 = green 0 0 0 0 0 0 0 0 1 0 0
a10 = beige 0 0 0 0 0 0 0 0 0 1 0
a11 = brown 0 0 0 0 0 0 0 0 0 0 1

full rank property is not important because, anyway, we neither have a single (local)
minimum in the objective function, nor do we want to calculate the MLE of the
network parameter. Typically, in FN network regression modeling one uses one-
hot encoding for the categorical variables that encodes every level by a unit vector.
Assume the raw feature component * xj is a categorical variable taking K different
levels {a1 , . . . , aK }. One-hot encoding is obtained by the embedding map

*
xj → x j = (1{*
xj =a1 } , . . . , 1{*
xj =aK } ) ∈ {0, 1} .
K
(7.28)

An explicit example is given in Table 7.2 which should be compared to Table 5.1.

Continuous Feature Components

The continuous feature components do not need any pre-processing but they can
directly enter the FN network which will take care of representation learning.
However, an efficient use of gradient descent methods typically requires that all
feature components live on a similar scale and that they are roughly uniformly
spread across their domains. This makes gradient descent steps more efficient in
exploiting the relevant directions.
One possibility is to use the MinMaxScaler. Let xj− and xj+ be the minimal and
maximal possible feature values of the continuous feature component xj , i.e., xj ∈
[xj− , xj+ ]. We transform this continuous feature component to unit scale for all data
1 ≤ i ≤ n by

MM
xi,j − xj−
xi,j → xi,j =2 − 1 ∈ [−1, 1]. (7.29)
xj+ − xj−

MM )
The resulting feature values (xi,j 1≤i≤n should roughly be uniformly spread
across the interval [−1, 1]. If this is not the case, for instance, because we have
outliers in the feature values, we may first transform them non-linearly to get
7.3 Feed-Forward Neural Network Examples 295

more uniformly spread values. For example, we consider the Density of the car
frequency example on the log scale.
An alternative to the MinMaxScaler is to consider normalization with the
empirical mean x̄j and the empirical standard deviation σ̂j over all data xi,j . That
is,

sd xi,j − x̄j
xi,j → xi,j = . (7.30)
σ̂j

It depends on the application whether the MinMaxScaler or normalization with

the empirical mean and standard deviation works better. Important in applications
is that we use exactly the same values for the normalization of training data U,
validation data V and test data T , to make the same network applicable to all
MM
these data sets. For notational convenience we will drop the upper index in xi,j
sd
or xi,j , respectively, and we throughout assume that all feature components are
appropriately pre-processed.

7.3.2 Lab: Poisson FN Network for Car Insurance Frequencies

We present a first FN network example applied to the French MTPL claim frequency
data studied in Sect. 5.2.4. We assume that the claim counts Ni are independent and
Poisson distributed with claim count density (5.26), where we replace the GLM
regression function x → expβ, x by a FN network regression function

x ∈ X → μ(x) = expβ, z(d:1)(x) .

We use a FN network of depth d = 3 having number of neurons (q1 , q2 , q3 ) =

(20, 15, 10) and using the hyperbolic tangent activation function. We pre-process
the categorical variables VehBrand and Region by one-hot encoding pro-
viding input dimensions 11 and 22, respectively. The binary variable VehGas
is encoded as 0–1. Because of scarcity of data we right-censor the continuous
variables VehAge at 20, DrivAge at 90 and BonusMalus at 150, and we
transform Density to the log scale. We then apply to each of these (modified)
continuous variables Area, VehPower, VehAge, DrivAge, BonusMalus and
log(Density) a MinMaxScaler. This provides us with an input dimension q0 =
11 + 22 + 1 + 6 = 40. The resulting FN network is illustrated in Fig. 7.2, with
the one-hot encoded variables VehBrand in orange color and Region in magenta
color. It has a network parameter ϑ ∈ Rr of dimension r = 1 306.
This network is implemented in R using the library keras [77]. The code is
provided in Listing 7.1 and the resulting network architecture is summarized in
Listing 7.2. This network is now fitted to the data. We use a batch size of 10’000,
we use the nadam version of SGD, we use 10% of the learning data L as validation
data V and the remaining 90% as training data U. We then run the corresponding
296 7 Deep Learning

Listing 7.1 FN network of depth d = 3 using the R library keras [77]

1 library(keras)
2 #
3 Design = layer_input(shape = c(40), dtype = ’float32’, name = ’Design’)
4 Vol = layer_input(shape = c(1), dtype = ’float32’, name = ’Vol’)
5 #
6 Network = Design %>%
7 layer_dense(units=20, activation=’tanh’, name=’FNLayer1’) %>%
8 layer_dense(units=15, activation=’tanh’, name=’FNLayer2’) %>%
9 layer_dense(units=10, activation=’tanh’, name=’FNLayer3’) %>%
10 layer_dense(units=1, activation=’exponential’, name=’Network’,
11 weights=list(array(0, dim=c(10,1)), array(log(lambda0), dim=c(1))))
12 #
13 Response = list(Network, Vol) %>% layer_multiply(name=’Multiply’)
14 #
15 model = keras_model(inputs = c(Design, Vol), outputs = c(Response))
16 #
17 summary(model)

Listing 7.2 FN network illustrated in Fig. 7.2

1 Layer (type) Output Shape Param # Connected to
2 ==================================================================
3 Design (InputLayer) (None, 40) 0
4 __________________________________________________________________
5 FNLayer1 (Dense) (None, 20) 820 Design[0][0]
6 __________________________________________________________________
7 FNLayer2 (Dense) (None, 15) 315 FNLayer1[0][0]
8 __________________________________________________________________
9 FNLayer3 (Dense) (None, 10) 160 FNLayer2[0][0]
10 __________________________________________________________________
11 Network (Dense) (None, 1) 11 FNLayer3[0][0]
12 __________________________________________________________________
13 Vol (InputLayer) (None, 1) 0
14 __________________________________________________________________
15 Multiply (Multiply) (None, 1) 0 Network[0][0]
16 Vol[0][0]
17 ==================================================================
18 Total params: 1,306
19 Trainable params: 1,306
20 Non-trainable params: 0

Listing 7.3 Fitting a FN network using the R library keras [77]

1 path0 <- "path_for_callback"
2 CBs <- callback_model_checkpoint(path0, monitor = "val_loss", verbose = 0,
3 save_best_only = TRUE, save_weights_only = TRUE)
4 #
5 model %>% compile(loss = ’poisson’, optimizer = ’nadam’)
6 fit <- model %>% fit(list(Xlearn, Vlearn), Ylearn, validation_split=0.1,
7 batch_size=10000, epochs=1000, verbose=0, callbacks=CBs)
8 #
9 load_model_weights_hdf5(model, path0)
7.3 Feed-Forward Neural Network Examples 297

Table 7.3 Run times, number of parameters, in-sample and out-of-sample deviance losses (units
are in 10−2 ) and in-sample average frequency of the Poisson null model, model Poisson GLM3 of
Table 5.5 and the FN network model (with one-hot encoding of the categorical variables)
Run # In-sample Out-of-sample Aver.
time param. loss on L loss on T freq.
Poisson null – 1 25.213 25.445 7.36%
Poisson GLM3 15 s 50 24.084 24.102 7.36%
One-hot FN (q1 , q2 , q3 ) = (20, 15, 10) 51 s 1’306 23.757 23.885 6.96%

SGD algorithm and we retrieve the network with the lowest validation loss using
a callback. This is illustrated in Listing 7.3. The fitting performance on the
training and validation data is illustrated in Fig. 7.8, and we retrieve the network
calibration after the 52th epoch because it has the lowest validation loss. The results
are presented in Table 7.3.
From the results of Table 7.3 we conclude that the FN network outperforms
model Poisson GLM3 (out-of-sample) since it has a (clearly) lower out-of-sample
deviance loss on the test data T . This may indicate that there is an interaction
between the feature components that has not been captured in the GLM. The run
time of 51s corresponds to the run time until the minimal validation loss is reached,
of course, in practice we need to continue beyond this minimal validation loss to
ensure that we have really found the minimum. Finally, and importantly, we observe
that this early stopped FN network calibration does not meet the balance property
because the resulting average frequency of this fitted model of 6.96% is below the
empirical frequency of 7.36%. This is a major deficiency of this FN network fitting
approach, and this is going to be discussed further in Sect. 7.4.2, below.
We can perform a detailed analysis about different batch sizes, variants of SGD
methods, run times, etc. We briefly summarize our findings; this summary is also
based on the findings in Ferrario et al. [127]. We have fitted this model on batches
of sizes 2’000, 5’000, 10’000 and 20’000, and it seems that a batch size around
5’000 has the best performance, both concerning out-of-sample performance and
run time to reach the minimal validation loss. Comparing the different optimizers
rmsprop, adam and nadam, a clear preference can be given to nadam: the
resulting prediction accuracy is similar in all three optimizers (they all reach the
green area in Fig. 7.5), but nadam reaches this optimal point in half of the time
compared to rmsprop and adam.
We conclude by highlighting that different initial points ϑ (0) of the SGD
algorithm will give different network calibrations, and differences can be consid-
erable. This is discussed in Sect. 7.4.4, below. Moreover, we could explore different
network architectures, more simple ones, more complex ones, different activation
functions, etc. The results of these different architectures will not be essentially
different from our results, as long as the networks are above a minimal complexity
bound. This closes our first example on FN networks and this example is the
benchmark for refined versions that are presented in the subsequent sections.
298 7 Deep Learning

7.4 Special Features in Networks

7.4.1 Special Purpose Layers

So far, our networks consist of stacked FN layers, and information is passed in a

directed acyclic feed-forward path from one to the next FN layer. In this section we
discuss special purpose layers that perform a specific task in a FN network. These
include embedding layers, drop-out layers and normalization layers. These modules
should be seen as add-ons to the FN layers. Besides these add-ons, there are also
recurrent layers and convolutional layers. These two types of layers are going to be
discussed in own chapters, below, because their importance goes beyond just being
add-ons to the FN layers.

Embedding Layers for Categorical Feature Components

The categorical feature components have been treated either by dummy coding or
by one-hot encoding, and this has resulted in numerous network parameters in the
first FN layer, see Fig. 7.2. Natural language processing (NLP) treats categorical
feature components differently, namely, it embeds categorical feature components
(or words in NLP) into a Euclidean space Rb of a small dimension b. This small
dimension b is a hyper-parameter that has to be selected by the modeler, and which,
typically, is selected much smaller than the total number of levels of the categorical
feature. This embedding technique is quite common in NLP, see Bengio et al. [27–
29], but it goes beyond NLP applications, see Guo–Berkhahn [176], and it has been
introduced to the actuarial community by Richman [312, 313] and the tutorial of
Schelldorfer–Wüthrich [329].
We assume the same set-up as in dummy coding (5.21) and in one-hot encod-
ing (7.28), namely, that we have a raw categorical feature component * xj taking K
different levels {a1 , . . . , aK }. In one-hot encoding these K levels are mapped to the
K unit vectors of the Euclidean space RK , and consequently all levels have the same
mutual Euclidean distance. This does not seem to be the best way of comparing the
different levels because in our regression analysis we would like to identify the
levels that are more similar w.r.t. the regression task and, thus, these should cluster.
For an embedding layer one chooses a Euclidean space Rb of a dimension b < K,
typically being (much) smaller than K. One then considers the embedding map

def.
e : {a1 , . . . , aK } → Rb , ak → e(ak ) = e(k) . (7.31)

That is, every level ak receives a vector representation e(k) ∈ Rb which is

lower dimensional than its one-hot encoding counterpart in RK . Proximity of the

representations e(k) and e(k ) in Rb , i.e., of two levels ak and ak , should be related
to similarity w.r.t. the regression task at hand. Such an embedding involves K
7.4 Special Features in Networks 299

Area
Power
VehAge
DrivAge Area
Bonus
B1
B2 Power
B3
B4
B5 VehAge
B6
B10
B11
B12 DrivAge
B13
B14
VehGas Bonus
Density
R11
R21 Y Y
R22
R23 VehBrEmb
R24
R25
R26
R31
R41 VehGas
R42
R43
R52
R53 Density
R54
R72
R73
R74
R82 RegEmb
R83
R91
R93
R94

Fig. 7.9 (lhs) One-hot encoding with q0 = 40, and (rhs) embedding layers for VehBrand and
Region with embedding dimension b = 2 and q0 = 11; the remaining network architecture is
identical with (q1 , q2 , q3 ) = (20, 15, 10) for depth d = 3

vectors e(k) ∈ Rb of dimension b, thus, it involves Kb parameters, called embedding

weights.
In network modeling, these embedding weights e(1) , . . . , e(K) can also be learned
during gradient descent training. Basically, it just means that for the categorical
variables we add an additional embedding layer before the first FN layer z(1) , i.e.,
we increase the depth of the network by 1 for the categorical feature components
(by a layer that is not fully connected). This is illustrated in Fig. 7.9 (rhs) for
the French MTPL insurance example of Sect. 7.3.2. The graph on the left-hand
side shows the network if we apply one-hot encoding to the categorical variables
VehBrand and Region; this results in a network parameter of dimension r =
1 306. The graph on the right-hand side first embeds VehBrand and Region
into two 2-dimensional spaces, illustrated by the orange and magenta circles. These
embeddings are concatenated with the remaining feature components, which then
provides a new dimension q0 = 7 + 2 + 2 = 11 in that example. This results in a
network parameter of dimension r = 726 + 22 + 44 = 792, where 22 + 44 = 66
stands for the 2-dimensional embedding weights of the 11 VehBrands and the 22
French Regions, see Listing 7.5.
Example 7.10 (Embedding Layers for Categorical Features) We revisit the exam-
ple of Sect. 7.3.2, but we replace one-hot encoding of the categorical variables by
embedding layers of dimension b = 2. The corresponding R code is given in
Listing 7.4 and the resulting model is illustrated in Listing 7.5 and Fig. 7.9 (rhs).
Apart from replacing one-hot encoding by embedding layers, we use exactly
the same FN network architecture as in Sect. 7.3.2 and we apply the same fitting
strategy in terms of batch sizes, optimizer and early stopping strategy. The results
are presented in Table 7.4.
300 7 Deep Learning

Listing 7.4 FN network of depth d = 3 using embedding layers

1 Design = layer_input(shape = c(7), dtype = ’float32’, name = ’Design’)
2 VehBrand = layer_input(shape = c(1), dtype = ’int32’, name = ’VehBrand’)
3 Region = layer_input(shape = c(1), dtype = ’int32’, name = ’Region’)
4 Vol = layer_input(shape = c(1), dtype = ’float32’, name = ’Vol’)
5 #
6 BrandEmb = VehBrand %>%
7 layer_embedding(input_dim=11,output_dim=2,input_length=1,name=’BrandEmb’) %>%
8 layer_flatten(name=’Brand_flat’)
9 RegionEmb = Region %>%
10 layer_embedding(input_dim=22,output_dim=2,input_length=1,name=’RegionEmb’) %>%
11 layer_flatten(name=’Region_flat’)
12 #
13 Network = list(Design,BrandEmb,RegionEmb) %>% layer_concatenate(name=’concate’) %>%
14 layer_dense(units=20, activation=’tanh’, name=’FNLayer1’) %>%
15 layer_dense(units=15, activation=’tanh’, name=’FNLayer2’) %>%
16 layer_dense(units=10, activation=’tanh’, name=’FNLayer3’) %>%
17 layer_dense(units=1, activation=’exponential’, name=’Network’,
18 weights=list(array(0, dim=c(10,1)), array(log(lambda0), dim=c(1))))
19 #
20 Response = list(Network, Vol) %>% layer_multiply(name=’Multiply’)
21 #
22 model = keras_model(inputs = c(Design, VehBrand, Region, Vol),
23 outputs = c(Response))

Table 7.4 Run times, number of parameters, in-sample and out-of-sample deviance losses (units
are in 10−2 ) and in-sample average frequency of the Poisson null model, model Poisson GLM3 of
Table 5.5 and the FN network models (with one-hot encoding and embedding layers of dimension
b = 2, respectively)
Run # In-sample Out-of-sample Aver.
time param. loss on L loss on T freq.
Poisson null – 1 25.213 25.445 7.36%
Poisson GLM3 15 s 50 24.084 24.102 7.36%
One-hot FN (q1 , q2 , q3 ) = (20, 15, 10) 51 s 1’306 23.757 23.885 6.96%
Embed FN (q1 , q2 , q3 ) = (20, 15, 10) 120 s 792 23.694 23.820 7.24%

A first remark is that the model calibration takes longer using embedding layers
compared to one-hot encoding. The main reason for this is that having an embedding
layer increases the depth of the network by one layer, as can be seen from Fig. 7.9.
Therefore, the back-propagation takes more time, and the convergence is slower
requiring more gradient descent steps. We have less over-fitting as can be seen from
Fig. 7.10. The final fitted model has a slightly better out-of-sample performance
compared to the one-hot encoding one. However, this slight improvement in the
performance should not be overstated because, as explained in Remarks 7.9, there
are a couple of elements of randomness involved in SGD fitting, and choosing
a different seed may change the results. We remark that the balance property is
not fulfilled because the average frequency of the fitted model does not meet the
empirical frequency, see the last column of Table 7.4; we come back to this in
Sect. 7.4.2, below.
7.4 Special Features in Networks 301

Listing 7.5 Summary of FN network of Fig. 7.9 (rhs) using embedding layers of dimension b = 2
1 Layer (type) Output Shape Param # Connected to
2 ==============================================================================
3 VehBrand (InputLayer) (None, 1) 0
4 ______________________________________________________________________________
5 Region (InputLayer) (None, 1) 0
6 ______________________________________________________________________________
7 BrandEmb (Embedding) (None, 1, 2) 22 VehBrand[0][0]
8 ______________________________________________________________________________
9 RegionEmb (Embedding) (None, 1, 2) 44 Region[0][0]
10 ______________________________________________________________________________
11 Design (InputLayer) (None, 7) 0
12 ______________________________________________________________________________
13 Brand_flat (Flatten) (None, 2) 0 BrandEmb[0][0]
14 ______________________________________________________________________________
15 Region_flat (Flatten) (None, 2) 0 RegionEmb[0][0]
16 ______________________________________________________________________________
17 concate (Concatenate) (None, 11) 0 Design[0][0]
18 Brand_flat[0][0]
19 Region_flat[0][0]
20 ______________________________________________________________________________
21 FNLayer1 (Dense) (None, 20) 240 concate[0][0]
22 ______________________________________________________________________________
23 FNLayer2 (Dense) (None, 15) 315 FNLayer1[0][0]
24 ______________________________________________________________________________
25 FNLayer3 (Dense) (None, 10) 160 FNLayer2[0][0]
26 ______________________________________________________________________________
27 Network (Dense) (None, 1) 11 FNLayer3[0][0]
28 ______________________________________________________________________________
29 Vol (InputLayer) (None, 1) 0
30 ______________________________________________________________________________
31 Multiply (Multiply) (None, 1) 0 Network[0][0]
32 Vol[0][0]
33 ==============================================================================
34 Total params: 792
35 Trainable params: 792
36 Non-trainable params: 0

Fig. 7.10 Training loss stochastic gradient descent algorithm

D(U , ϑ (t) ) vs. validation loss
0.160

training loss
D(V , ϑ (t) ) over different validation loss
iterations t ≥ 0 of the SGD
0.159

minimal validation loss

algorithm in the deep FN
(modified) deviance loss
0.158

network with embedding

layers for categorical
0.157

variables
0.156
0.155
0.154

0 200 400 600 800 1000

training epochs
302 7 Deep Learning

2−dimensional embedding of VehBrand 2−dimensional embedding of Region

R43
0.2

R73

0.4
B13
B14 B3
B6 R21 R91
B1
B2
0.0

0.2
dimension 2

dimension 2
B5
B4 B11 R23
R41 R93
R11
−0.2

R83
B10 R25

0.0
R31
R24
R53R54 R42
R52 R72 R22
R94
−0.4

R26
R82
B1/B2 Renault, Nissan, Citroen

−0.2
B12 B3 Volkswagen, Audi, Skoda, Seat
−0.6

B4/B5 Opel, General Motors, Ford

B6 Fiat

−0.4
B10/B11 Mercedes, Chrysler, BMW
B12 Japanese cars (except Nissan)
−0.8

B13/B14 other cars R74

−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 −0.2 −0.1 0.0 0.1 0.2 0.3
dimension 1 dimension 1

Fig. 7.11 Embedding weights eVehBrand ∈ R2 and eRegion ∈ R2 of the categorical variables
VehBrand and Region for embedding dimension b = 2

A major advantage of using embedding layers for the categorical variables is that
we receive a continuous representation of nominal variables, where proximity can be
interpreted as similarity for the regression task at hand. This is nicely illustrated in
Fig. 7.11 which shows the resulting 2-dimensional embeddings eVehBrand ∈ R2 and
eRegion ∈ R2 of the categorical variables VehBrand and Region. The Region
embedding eRegion ∈ R2 shows surprising similarities with the French map, for
instance, Paris region R11 is adjacent to R23, R22, R21, R26, R24 (which is also
the case in the French map), the Isle of Corsica R94 and the South of France R93,
R91 and R73 are well separated from other regions, etc. Similar observations can
be made for the embedding of VehBrand, Japanese cars B12 are far apart from the
other cars, cars B1, B2, B3 and B6 (Renault, Nissan, Citroen, Volkswagen, Audi,
Skoda, Seat and Fiat) cluster, etc.

Drop-Out Layers and Regularization

Above, over-fitting to the learning data has been taken care of by early stopping. In
view of Sect. 6.2 one could also use regularization. This can easily be obtained by
replacing (7.14), for instance, by the following Lp -regularized counterpart

2 vi
n
p
ϑ → Yi h (Yi )−κ (h (Yi ))−Yi h (μϑ (x i ))+κ (h (μϑ (x i ))) +λ ϑ − p ,
n ϕ
i=1

for some p ≥ 1, regularization parameter λ > 0 and where the reduced network
parameter ϑ − ∈ Rr−1 excludes the intercept parameter β0 of the output layer,
we also refer to (6.4) in the context of GLMs. For grouped penalty terms we
7.4 Special Features in Networks 303

refer to (6.21). The difficulty with this approach is the tuning of the regularization
parameter(s) λ: run time is one issue, suitable grouping is another issue, and non-
uniqueness of the optimal network a further one that can substantially distort the
selection of reasonable regularization parameters.
A more popular method to prevent from over-fitting individual neurons in a FN
layer to a certain task are so-called drop-out layers. A drop-out layer is an additional
layer between FN layers that removes at random during gradient descent training
neurons from the network, i.e., in each gradient descent step, any of the earmarked
neurons is offset independently from the others with a fixed probability δ ∈ (0, 1).
This random removal will imply that the composite of the remaining neurons needs
to be sufficiently well balanced to take over the role of the dropped-out neurons.
Therefore, a single neuron cannot be over-trained to a certain task because it needs
to be able play several different roles. Drop-out has been introduced by Srivastava
et al. [345] and Wager et al. [373].

Listing 7.6 FN network of depth d = 3 using a drop-out layer, ridge regularization and a
normalization layer
1 Network = list(Design,BrandEmb,RegionEmb) %>%
2 layer_concatenate(name=’concate’) %>%
3 layer_dense(units=20, activation=’tanh’, name=’FNLayer1’) %>%
4 layer_dropout (rate = 0.01) %>%
5 layer_dense(units=15, kernel_regularizer=regularizer_l2(0.0001),
6 activation=’tanh’, name=’FNLayer2’) %>%
7 layer_batch_normalization() %>%
8 layer_dense(units=10, activation=’tanh’, name=’FNLayer3’) %>%
9 layer_dense(units=1, activation=’exponential’, name=’Network’,
10 weights=list(array(0, dim=c(10,1)), array(log(lambda0), dim=c(1))))

Listing 7.6 gives an example, where we add a drop-out layer with a drop-out
probability of δ = 0.01 after the first FN layer, and in the second FN layer we apply
(2)
ridge regularization to the weights (w1,1 , . . . , wq(2)
1 ,q2 ), i.e., excluding the intercepts
(2)
w0,j , 1 ≤ j ≤ q2 . Both the drop-out layer and regularization are only used during
the gradient descent fitting, and these network features are disabled during the
prediction.
Drop-out is closely related to ridge regularization as the following linear
Gaussian regression example shows; this consideration is taken from Section 18.6
of Efron–Hastie [117]. Assume we have a linear regression problem with square
loss function

1
n
D(Y , β) = (Yi − β, x i )2 .
2
i=1

We assume in this Gaussian case that the observations and

the features are
standardized, see Sect. 6.2.4. This means that ni=1 Yi = 0, ni=1 xi,j = 0 and
304 7 Deep Learning

n−1 ni=1 xi,j
2 = 1, for all 1 ≤ j ≤ q. This standardization implies that we can

omit the intercept parameter β0 because its MLE is equal to 0.

We introduce i.i.d. drop-out random variables Ii,j for 1 ≤ i ≤ n and 1 ≤ j ≤ q
with (1 − δ)Ii,j being Bernoulli distributed with probability 1 − δ ∈ (0, 1). This
scaling implies E[Ii,j ] = 1. Using these Bernoulli random variables we modify the
above square loss function to
⎛ ⎞2
1 ⎝
n q
DI (Y , β) = Yi − βj Ii,j xi,j ⎠ ,
2
i=1 j =1

i.e., every individual component xi,j can drop out independently of the others.
Gaussian MLE requires to set the gradient of DI (Y , β) w.r.t. β ∈ Rq equal to
zero. The average score equation is given by (we average over the drop-out random
variables Ii,j )
n
δ
n
Eδ ∇β DI (Y , β) Y = −X Y + X Xβ + diag 2
xi,1 , . . . , xi,q β
2
1−δ
i=1 i=1
δn !
= −X Y + X Xβ + β = 0,
1−δ

where we have used the normalization of the columns of the design matrix X ∈
Rn×q (we drop the intercept column). This is ridge regression in the linear Gaussian
case with a regularization parameter λ = δ/(2(1 − δ)) > 0 for δ ∈ (0, 1), see (6.9).

Normalization Layers

In (7.29) and (7.30) we have discussed that the continuous feature components
should be pre-processed so that all components live on the same scale, otherwise the
gradient descent fitting may not be efficient. A similar phenomenon may occur with
the learned representations z(m:1) (x i ) in the FN layers 1 ≤ m ≤ d. In particular, this
is the case if we choose an unbounded activation function φ. For this reason, it can
(m:1)
be advantageous to rescale the components zj (x i ), 1 ≤ j ≤ qm , in a given FN
layer back to the same scale. To achieve this, a normalization step (7.30) is applied
to every neuron zj(m:1) (x i ) over the given cases i in the considered (mini-)batch. This
involves two more parameters (for the empirical mean and the empirical standard
deviation) in each neuron of the corresponding FN layer. Note, however, that all
these operations are of a linear nature. Therefore, they do not affect the predictive
model (i.e., these operations cancel in the scalar products in (7.6)), but they may
improve the performance of the gradient descent algorithm.
The code in Listing 7.6 uses a normalization layer on line 6. In our applications,
it has not been necessary to use these normalization layers, as it has not led to better
7.4 Special Features in Networks 305

run times in SGD algorithms; note that our networks are not very deep and they use
the symmetric and bounded hyperbolic tangent activation function.

7.4.2 The Balance Property in Neural Networks

We have seen in Table 7.4 that our FN network outperforms the GLM for claim
frequency prediction in terms of a lower out-of-sample loss. We interpret this as
follows. Feature engineering has not been done in the most optimal way for the
GLM because the FN network finds modeling structure that is not present in the
selected GLM. As a consequence, the FN network provides a better generalization
to unseen data, i.e., we can better predict new data on a granular level with the FN
network. However, having a more precise model on an individual policy level does
not necessarily imply that the model also performs better on a global portfolio level.
In our example we see that we may have smaller errors on an individual policy level,
but these smaller errors do not aggregate to a more precise model in the average
portfolio frequency. In our case, we have a misspecification of the average portfolio
frequency, see the last column of Table 7.4. This is a major deficiency in insurance
pricing because it may result in a misspecification of the overall price level, and this
requires a correction. We call this correction bias regularization.

Simple Bias Regularization

The straightforward correction is to adjust the intercept parameter β0 ∈ R

accordingly. That is, compare the empirical mean
n
vi Yi
μ̄ = i=1
n ,
i=1 vi

to the model average of the fitted FN network

n
i=1 vi μ (x i )
μ=
n ϑ ,
i=1 vi

where ϑ = ( w(1) (d)

1 ,...,w qd , β) ∈ R is the learned network parameter from the
r

(early stopped) SGD algorithm. The output of this fitted model reads as
⎛ ⎞
; <
qd
−1 (d:1)
(x i ) = g −1 ⎝ zj (x i )⎠ ,
j (d:1)
xi →
ϑ (x i ) = g
μ β,
z β0 + β
j =1
306 7 Deep Learning

(m)
where the hat in z(d:1) indicates that we use the estimated weights w l , 1 ≤ l ≤ qm ,
1 ≤ m ≤ d, in the FN layers. The balance property can be rectified by replacing β 0

by the solution β 0 of the following identity
⎛ ⎞
n n qd
vi g −1 ⎝
!
vi Yi = β 0 + β zj(d:1)(x i )⎠ .
j
i=1 i=1 j =1

Since g −1 is continuous and strictly monotone, there is a unique solution to this

requirement supposed that the range of g −1 covers the support of the Yi ’s. If we
work with the log-link g(·) = log(·), this can easily be solved and we obtain

μ̄
β 0 = β 0 + log .

μ

Sophisticated Bias Regularization Under the Canonical Link Choice

If we work with the canonical link g = h = (κ )−1 , we can do better because the
MLE of such a GLM automatically provides the balance property, see Corollary 5.7.
Choose the SGD learned network parameter ϑ = ( w(1) (d) ∈ Rr .
1 ,...,w qd , β)
Denote by z (d:1) the fitted network architecture that is based on the estimated
(1) (d)
weights w 1 , . . . , w qd . This allows us to study the learned representations of the
raw features x 1 , . . . , x n in the last FN layer. We denote these learned representations
by

z1 =
z(d:1) (x 1 ), . . . ,
zn =
z(d:1) (x n ) ∈ {1} × Rqd . (7.32)

These learned representations can be used as new features to explain the response
Y . We define the feature engineered design matrix by

X = ( zn ) ∈ Rn×(qd +1) .
z1 , . . . ,

Based on this new design matrix X we can run a classical GLM receiving a unique
MLE β MLE
∈R q d +1 supposed that this design matrix has a full rank qd + 1 ≤ n,
see Proposition 5.1. Since we work with the canonical link, this re-calibrated FN
network will automatically satisfy the balance property, and the resulting regression
function reads as

; <
μ(x) = h−1
MLE (d:1)
x → β z
, (x) . (7.33)
7.4 Special Features in Networks 307

This is the proposal of Wüthrich [390]. We give some remarks.

Remarks 7.11
• This additional MLE step for the output parameter β ∈ Rqd +1 may lead to
over-fitting. In that case one might choose a lower dimensional last FN layer.
Alternatively, one might explore a more early stopping rule in SGD.
• Wüthrich [390] also explores other bias correction methods like regularization
using shrinkage. In combination with regression trees one can achieve averages
on pre-defined sub-portfolios. We will not further explore these other approaches
because they are less robust and more difficult in the applications.

Example 7.12 (Balance Property in Networks) We apply this additional MLE step
to the two FN networks of Table 7.4. Note that in these two examples we consider
a Poisson model using the canonical link for g, thus, the resulting adjusted
network (7.33) will automatically satisfy the balance property, see Corollary 5.7.

Listing 7.7 Balance property adjustment (7.33)

1 glm.formula <- function(nn){
2 string <- "yy ~ X1"
3 if (nn>1){for (ll in 2:nn){ string <- paste(string, "+X",ll, sep="")}}
4 string
5 }
6 #
7 zz <- keras_model(inputs=model$input,
8 outputs=get_layer(model, ’FNLayer3’)$output)
9 xx.learn <- data.frame(zz %>% predict(list(Xlearn, Vlearn)))
10 q3 <- ncol(xx.learn)
11 xx.learn$yy <- Ylearn
12 xx.learn$Exposure <- learn$Exposure
13 #
14 glm1 <- glm(as.formula(glm.formula(q3)),
15 data=xx.learn, offset=log(Exposure), family=poisson())
16
17 #
18 w1 <- get_weights(model)
19 w1[[7]] <- array(glm1$coefficients[2:(q3+1)], dim=c(q3,1))
20 w1[[8]] <- array(glm1$coefficients[1], dim=c(1))
21 set_weights(model, w1)

In Listing 7.7 we illustrate the necessary code that has to be added to List-
ings 7.1–7.3. On lines 7–8 of Listing 7.7 we retrieve the learned representa-
tions (7.32) which are used as the new features in the Poisson GLM on lines 13–14.
The resulting MLE ∈ Rqd +1 is imputed to the network parameter
MLE
β ϑ on
lines 17–20. Table 7.5 shows the performance of the resulting bias regularized FN
networks.
Firstly, we observe from the last column of Table 7.5 that, indeed, the bias
regularization step (7.33) provides the balance property. In general, in-sample losses
(have to) decrease because
MLE
β is (in-sample) more optimal than the early stopped

SGD solution β. Out-of-sample this leads to a small improvement in the one-
308 7 Deep Learning

Table 7.5 Run times, number of parameters, in-sample and out-of-sample deviance losses (units
are in 10−2 ) and in-sample average frequency of the Poisson null model, model Poisson GLM3 of
Table 5.5 and the FN network models (with one-hot encoding and embedding layers of dimension
b = 2, respectively), and their bias regularized counterparts
Run # In-sample Out-of-sample Aver.
time param. loss on L loss on T freq.
Poisson null – 1 25.213 25.445 7.36%
Poisson GLM3 15 s 50 24.084 24.102 7.36%
One-hot FN (q1 , q2 , q3 ) = (20, 15, 10) 51 s 1’306 23.757 23.885 6.96%
Embed FN (q1 , q2 , q3 ) = (20, 15, 10) 120 s 792 23.694 23.820 7.24%
One-hot FN bias regularized +4 s 1’306 23.742 23.878 7.36%
Embed FN bias regularized +4 s 792 23.690 23.824 7.36%

hot encoded variant and a small worsening in the embedding variant, i.e., the
latter slightly over-fits in this additional MLE step. However, these differences are
comparably small so that we do not further worry about the over-fitting, here. This
closes this example.

Auto-Calibration for Bias Regularization

We present another approach of correcting for the potential failure of the balance
property. This method does not depend on a particular type of regression model,
i.e., it can be applied to any regression model. This proposal goes back to Denuit et
al. [97], and it is based on the notion of auto-calibration introduced by Patton [297]
and Krüger–Ziegel [227]. We first describe auto-calibration and its implications.
Definition 7.13 The random variable Z is an auto-calibrated forecast of random
variable Y if E[Y |Z] = Z, a.s.
If the response Y is described by the features X = x, we consider the conditional
mean of Y , given X,

μ(X) = E [Y |X] .

This conditional mean μ(X) is an auto-calibrated forecast for the response Y . Use
the tower property and note that σ (μ(X)) ⊂ σ (X) to receive, a.s.,

E [Y | μ(X)] = E [E [Y | X]| μ(X)] = E [μ(X)| μ(X)] = μ(X).

For the further understanding of auto-calibration and forecast dominance, we

introduce the concept of convex order; forecast dominance has been introduced in
Definition 4.20.
7.4 Special Features in Networks 309

Definition 7.14 (Convex Order) A random variable Z1 is bigger in convex order

than a random variable Z2 , write Z1 ,cx Z2 , if E[(Z1 )] ≥ E[(Z2 )], for all
convex functions for which the expectations exist.
By Strassen’s theorem [346], Z1 ,cx Z2 if and only if there exist random variables
(d) (d)
Z1 and Z2 with Z1 = Z1 and Z2 = Z2 and E[Z1 |Z2 ] = Z2 , a.s. In particular,
the convex order Z1 ,cx Z2 implies that Var(Z1 ) ≥ Var(Z2 ) and E[Z1 ] = E[Z2 ].
The latter follows from Strassen’s theorem and the tower property, and the former
follows from the latter and the convex order by using the explicit choice (x) = x 2 .
Thus, the random variable Z1 is more volatile than Z2 , both having the same mean.
The following theorem shows that this additional volatility is a favorable property
in terms of forecast dominance under auto-calibration.
Theorem 7.15 (Krüger–Ziegel [227, Theorem 3.1], Without Proof) Assume that

μ1 and μ2 are auto-calibrated forecasts for the random variable Y . Predictor
μ1
forecast dominates
μ2 if and only if
μ1 ,cx
μ2 .
Recall that forecast dominance of
μ1 over
μ2 was defined as follows, see Defini-
tion 4.20,

E Dψ (Y,
μ1 ) ≤ E Dψ (Y,
μ2 ) ,

for all Bregman divergences Dψ . Strassen’s theorem tells us that μ1 is more volatile
than μ2 (both being auto-calibrated and unbiased for E[Y ]) and this additional
volatility implies that the former auto-calibrated predictor can better follow Y . This
provides the superior forecast dominance of μ1 over
μ2 . This relation is most easily
understood by the following example. Consider (Y, X) as above. Assume that the
feature X* is a sub-variable of the feature X by dropping some of the components
of X. Naturally, we have σ (X)* ⊂ σ (X), and both sets of information provide auto-
calibrated forecasts

μ(X) = E [Y |X] and * = E Y *
μ(X) X .

The tower property and Jensen’s inequality give for any convex function (subject
to existence)

E [(μ(X))] = E [ (E [Y |X])] = E E (E [Y |X]) *
X

≥ E E E [Y |X] * X = E E Y *X * .
= E μ(X)

Thus, we have μ(X) ,cx μ(X) * which implies forecast dominance of μ(X) over
* This makes perfect sense in view of σ (X)
μ(X). * ⊂ σ (X). Basically, this describes
the construction of a F-martingale using an integrable random variable Y and a
filtration F on the underlying probability space (, A, P). This martingale sequence
provides forecast dominance with increasing information sets described by the
filtration F.
310 7 Deep Learning

We now turn our attention to the balance property and the unbiasedness of
predictors, this follows Denuit et al. [97]. Assume we have any predictor
μ(x) of
Y , for instance, this can be any FN network predictor μ
ϑ (x) coming from an early
stopped SGD algorithm. We define its balance-corrected version by

μBC (x) = E [Y |
μ(x)] . (7.34)

Proposition 7.16 (Wüthrich [391, Proposition 4.6], Without Proof) The

balance-corrected predictor
μBC (X) is an auto-calibrated forecast for Y .

Remarks 7.17 (Expected Deviance Generalization Loss) We return to the decom-

position of the expected deviance GL given in Theorem 4.7, but we add the features
X = x, now. The expected deviance GL of a predictor
μ(X) under the unit deviance
d then reads as

Eθ [d (Y,
μ(X))] = Eθ [d (Y, μ)]

μ(X))] + Eθ [κ (h (
+ 2 μh(μ) − κ(h(μ)) − Eθ [Y h ( μ(X)))] ,

where μ = Eθ [Y ] is the unconditional mean of Y (averaging also over the feature

distribution of X). Note that this formula differs from (4.13) because Y and h(
μ(X))
are no longer independent if we include the features X. The term Eθ [d (Y, μ)] is
called the entropy which is driven by the stochastic nature of the random variable
Y . This is the irreducible risk if no feature information is available.
In statistical modeling one considers different decompositions of the expected
deviance GL, we refer to Fissler et al. [129]. Namely, introducing the features X
we can reduce the expected deviance GL compared to the unconditional mean μ in
terms of forecast dominance. This allows us to decouple as follows for the prediction
μ(X) = Eθ [Y |X]

Eθ [d (Y,
μ(X))] = Eθ [d (Y, μ)] − Eθ [d (Y, μ)] − Eθ [d (Y, μ(X))]

+ Eθ [d (Y,
μ(X))] − Eθ [d (Y, μ(X))] .

This expresses the expected deviance GL of the predictor μ(X) as the entropy (first
term), the conditional resolution (second term) and the conditional calibration (third
term). The conditional resolution describes the information gain in terms of forecast
dominance knowing the feature X, and the conditional calibration describes how
7.4 Special Features in Networks 311

well we estimate μ(X). The conditional resolution is positive because μ(X) ,cx μ
and the unit deviance d(Y, ·) is a convex function, see Lemma 2.22. The conditional
calibration is also positive, this can be seen by considering the deviance GL,
conditional on X.
We can reformulate this expected deviance GL in terms of the auto-calibration
property

Eθ [d (Y,
μ(X))] = Eθ [d (Y, μ)] − Eθ [d (Y, μ)] − Eθ [d (Y,
μBC (X))]

+ Eθ [d (Y,
μ(X))] − Eθ [d (Y,
μBC (X))] .

The first term is the entropy, the second term is called the auto-resolution and the
third term describes the auto-calibration. If we have an auto-calibrated forecast

μ(X) then the last term vanishes because it is equal to its balance-corrected version

μBC (X). Again these two latter terms are positive, for the auto-calibration this can
be seen by considering the deviance GL, conditioned on μ(X).
To rectify the balance property we directly focus on (7.34), and we estimate
this conditional expectation. That is, the balance correction can be achieved by an
additional regression step directly estimating the balance-corrected version μBC (x)
in (7.34). This additional regression step differs from (7.33) as it does not use the
learned representations z(d:1) (x) in the last FN layer (7.32), but it uses the learned
representations in the output layer. That is, consider the learned features

z#1 = (1, μ
ϑ (x 1 )) , . . . ,
z#n = (1, μ
ϑ (x n )) ∈ {1} × R,

and perform an additional linear regression step for the response Y using the design
matrix
#

X= z#n
z1 , . . . , ∈ Rn×2 .

This additional linear regression step gives us an estimate

−1

β= X V
X
X V Y ∈ R2 , (7.35)

with diagonal weight matrix V = diag(vi )1≤i≤n . The balance property is then
restored by estimating the balance-corrected means
μBC (x i ) by

0 + β
μBC (x i ) = β 1 μ
ϑ (x i ), (7.36)

for 1 ≤ i ≤ n. Note that this can be done for any regression model since we do not
rely on the network architecture in this step.
312 7 Deep Learning

Remarks 7.18
• Balance correction (7.36) may lead to some conflict in range if the dual (mean)
parameter space M is (one-sided) bounded. Moreover, it does not consider the
deviance loss of the response Y , but it rather underlies a Gaussian model by
using the weighted square loss function for finding (the Gaussian MLE) β ∈ R2 .
Alternatively, we could consider the canonical link h that belongs to the chosen
EDF. This then allows us to study the regression problem on the canonical scale
by setting for the learned representations

zθ1 = 1, h(μ
ϑ (x 1 )) , ...,
zθn = 1, h(μ
ϑ (x n )) ∈ {1} × . (7.37)

The latter motivates the consideration of a GLM under the chosen EDF

x i → h (
μBC (x i )) = β,
zθi = β0 + β1 h(μ
ϑ (x i )), (7.38)

for regression parameter β ∈ R2 . The choice of the canonical link and the
inclusion of an intercept will provide the balance property when estimating β
with MLE, see Corollary 5.7. If the mean estimates μ
ϑ (x i ) involve the canonical
link h, (7.38) reads as
; <
x i → h ( zθi = β0 + β1
μBC (x i )) = β, β,
z(d:1)(x i ) ,

the latter scalar product is the output activation received from the FN net-
work. From this we see that the estimated balance-corrected calibration on the
canonical scale will give us a non-optimal (in-sample) estimation step compared
to (7.33), if we work with the canonical link h.
• Denuit et al. [97] give a proposal to break down the global balance to a local
version using a suitable kernel function, this will be further discussed in the next
Example 7.19.

Example 7.19 (Auto-calibration in Networks) We apply this additional auto-

calibration step (7.34) to the FN network with embedding layers that does not
satisfy the balance property, i.e., having an average frequency of 7.24% < 7.36%,
see Tables 7.4 and 7.5. We start by analyzing the auto-calibration property (7.34) of
this network predictor vμ ϑ (x) by studying an empirical version of

μBC (x) = E vY vμ
z → v ϑ (x) = z . (7.39)

This empirical version is obtained from the R library locfit [254] that allows us
to consider a local polynomial regression fit of degree deg=2, and we use a nearest
neighbor fraction of alpha=0.05, the code is provided in Listing 7.8. We use the
exposure v scaled version in (7.39) since the balance property should hold on that
scale, see Corollary 5.7. The claim counts are given by N = vY , and the exposure
7.4 Special Features in Networks 313

v is integrated as an offset into the FN network regression function, see line 20 of

Listing 7.4.

Listing 7.8 Empirical auto-calibration using the R library locfit [254]

1 z <- learn$pred
2 mu.BC <- predict(locfit(learn$N ~ learn$pred, alpha=0.05, deg=2), newdata=z)

Figure 7.12 (lhs) shows the empirical auto-calibration of (7.39) using the R
code of Listing 7.8. If the auto-calibration would hold exactly, then the black
dots should lie on the red diagonal line. We observe a very good match, which
indicates that the auto-calibration property holds quite accurately for our network
predictor (v, x) → vμ ϑ (x). For very small expectations Eθ(x) [N] we slightly
underestimate, and for bigger expectations we slightly overestimate. The blue line
shows the empirical density of the predictors vi μ ϑ (x i ), 1 ≤ i ≤ n, highlighting
heavy-tailedness and that the underestimation in the right tail will not substantially
contribute to the balance property as these are only very few insurance policies.
We explore the Gaussian balance correction (7.35) considering a linear regression
model with weighted square loss function. We receive the estimate β = (9 ·
10−4 , 1.005), thus, μ
ϑ (x) only gets very gently distorted, see (7.36). The results of
this balance-corrected version
μBC (x) are given on line ‘embed FN Gauss balance-
corrected’ in Table 7.6. We observe that this approach is rather competitive leading
to a slightly better model (out-of-sample). Figure 7.12 (rhs) shows the resulting
(empirical) auto-calibration plot which is still not fully in line with Proposition 7.16;
this empirical plot may be distorted by the exposures, by the fact that it is an

auto−calibration of network prediction auto−calibration of network prediction

25
0.5

0.5

20
20
0.4

0.4
auto−calibration

auto−calibration

15
15
0.3

0.3

10
10
0.2

0.2
0.1

0.1

5
5

auto−calibration auto−calibration
0.0

0.0

density (right axis) density (right axis)

0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5
estimated claims v*mu(x) estimated claims v*mu(x)

Fig. 7.12 (lhs) Empirical auto-calibration (7.39), the blue line shows the empirical density of the
ϑ (x i ), 1 ≤ i ≤ n; (rhs) balance-corrected version using the weighted Gaussian
predictors vi μ
correction (7.35)
314 7 Deep Learning

Table 7.6 Run times, number of parameters, in-sample and out-of-sample deviance losses (units
are in 10−2 ) and in-sample average frequency of the Poisson null model, model Poisson GLM3
of Table 5.5, the FN network model (with embedding layers of dimension b = 2), and their bias
regularized and balance-corrected counterparts, the local correction uses a GAM with 2.6 degrees
of freedom in the cubic spline part
Run # In-sample Out-of-sample Aver.
time param. loss on L loss on T freq.
Poisson null – 1 25.213 25.445 7.36%
Poisson GLM3 15 s 50 24.084 24.102 7.36%
Embed FN (q1 , q2 , q3 ) = (20, 15, 10) 120 s 792 23.694 23.820 7.24%
Embed FN bias regularized +4 s 792 23.690 23.824 7.36%
Embed FN Gauss balance-corrected – 792 + 2 23.692 23.819 7.36%
Embed FN locally balance-corrected – 792 + 3.6 23.692 23.818 7.36%

empirical plot fitted with locfit, and by fact that a linear Gaussian correction
estimate may not be fully suitable.
Denuit et al. [97] propose a local balance correction that is very much in the
spirit of the local polynomial regression fit with locfit. However, when using
locfit we did not pay any attention to the balance property. Therefore, we
proceed slightly differently, here. In formula (7.37) we give the network predictors
on the canonical scale. This equips us with the data (Yi , vi , zθi )1≤i≤n . To perform
a local balance correction we fit a generalized additive model (GAM) to this data,
using the canonical link, the Poisson deviance loss function, the observations Yi ,
the exposures vi and the feature information zθi ; for GAMs we refer to Hastie–
Tibshirani [181, 182], Wood [384] and Chapter 3 in Wüthrich–Buser [392], in
particular, we proceed as in Example 3.4 of the latter reference.
The GAM regression fit on the canonical scale is illustrated in Fig. 7.13 (lhs).
We essentially receive a straight line which says that the auto-calibration property is
already well satisfied by the FN network predictor μ ϑ . In fact, it is not completely
a straight line, but GCV provides an optimal model with 2.6 effective degrees of
freedom in the natural cubic spline part. This local (GAM) balance correction leads
to another small model improvement (out-of-sample), see last line of Table 7.6.
Conclusion The balance property adjustment and the bias regularization are crucial
in ensuring that the predictive model is on the right (price) level. We have pre-
sented three sophisticated methods of balance property adjustments: the additional
GLM step under the canonical link choice (7.33), the model-free global Gaussian
correction (7.35)–(7.36), and the local balance correction using a GAM under the
canonical link choice. In our example, the results of the three different approaches
are rather similar. In the sequel, we use the additional GLM step solution (7.33), the
reason being that under this approach we can rely on one single regression model
that directly predicts the claims. The other two approaches need two steps to get the
predictions, which requires the storage of two models.
7.4 Special Features in Networks 315

local balance correction auto−calibration of network prediction

0.5
2
cubic spline regression

20
0.4
auto−calibration
1

15
0.3
0

10
0.2
−1

0.1

5
−2

auto−calibration

0.0

0
density (right axis)

−5 −4 −3 −2 −1 0 0.0 0.1 0.2 0.3 0.4 0.5

log frequency estimated claims v*mu(x)

Fig. 7.13 (lhs) GAM fit on the canonical scale having 2.6 effective degrees of freedom (red shows
the estimated confidence bounds); (rhs) balance-corrected version using the local GAM correction

7.4.3 Boosting Regression Models with Network Features

From Table 7.5 we conclude that the FN networks find systematic structure in the
data that is not present in model Poisson GLM3, thus, the feature engineering for
the GLM can be improved. Unfortunately, FN networks neither directly build on
GLMs nor do they highlight the weaknesses of GLMs. In this section we discuss
a proposal presented in Wüthrich–Merz [394] and Schelldorfer–Wüthrich [329]
of combining two regression approaches. We are going to boost a GLM with FN
network features. Typically, boosting is applied within the framework of regression
trees. It goes back to the work of Valiant [362], Kearns–Valiant [209, 210], Schapire
[328], Freund [139] and Freund–Schapire [140]. The idea behind boosting is to
analyze the residuals of a given regression model with a second regression model
to see whether this second regression model can still find systematic effects in the
residuals which have not been discovered by the first one.
We start from the GLM studied in Chap. 5, and we boost this GLM with a FN
network. Assume that both regression models act on the same feature space X ⊂
{1} × Rq0 . The GLM provides a regression function for link function g and GLM
parameter β GLM ∈ Rq0 +1
; <
x → μGLM (x) = g −1 β GLM , x .

Recall that this GLM can be interpreted as a FN network of depth 0, see

Remarks 7.2. Next, we choose a FN network of depth d ≥ 1 with the same link
316 7 Deep Learning

function g as the GLM

; <
x → μFN (x) = g −1 β FN , z(d:1) (x) ,

having a network parameter ϑ = (w 1 , . . . , wqd , β FN ) ∈ Rr . In particular, we

(1) (d)

have the FN output parameter β FN ∈ Rqd +1 , we refer to Fig. 7.2.

We blend these two regression models by combining their regression func-

tions
; < ; <
x → μ(x) = g −1 β GLM , x + β FN , z(d:1)(x) , (7.40)

with parameter ! = (β GLM , ϑ) = (β GLM , w (1) (d) FN

1 , . . . , w qd , β ) ∈
R q 0 +1+r .

An example is provided in Fig. 7.14. It shows the FN network using embedding

layers for the categorical variables, see also Fig. 7.9 (rhs), and we add a GLM (in
green color) that directly links the input x to the response variable. In machine
learning this green connection is called a skip connection because it skips the FN
layers.
Remarks 7.20
• Skip connections are a popular tool in network modeling, and they can be applied
to any FN layers, i.e., a skip connection can, for instance, be added to skip the
first FN layer. There are two benefits from skip connections. Firstly, they allow
for more modeling flexibility, in (7.40) we directly combine a linear function

Fig. 7.14 Illustration of the skip connection

combined regression
Area
function (7.40) using a GLM
(in a skip connection) and a Power

FN network VehAge

DrivAge

Bonus
Y

VehBrEmb

VehGas

Density

RegEmb
7.4 Special Features in Networks 317

(coming from the GLM) with a non-linear one (coming form the FN network).
This has the flavor of a Taylor expansion to combine terms of different orders.
Secondly, skip connections can also be beneficial for gradient descent fitting
because the inputs have a more direct link to the outputs, and the network only
builds the functional form around the function in the skip connection.
• There are numerous variants of (7.40). A straightforward one is to choose a
weight α ∈ (0, 1) and consider the regression function
; < ; <
x → μ(x) = g −1 α β GLM , x + (1 − α) β FN , z(d:1)(x) . (7.41)

The weight α can be interpreted as the credibility assigned to the GLM.

• Regression function (7.40) considers two intercepts β0GLM and β0FN . If we do not
consider the credibility version (7.41), one of the two intercepts is redundant.
• This approach also allows us to learn systematic effects across different insurance
portfolios. If we have three insurance portfolios living on the same feature space
and if χ ∈ {1, 2, 3} indicates which insurance portfolio we consider, we can
modify the regression function (7.40) to
⎧ ⎫
⎨3 ; < ; <⎬
(x, χ) →
μ(x, χ) = g −1 β GLM , x 1{χ=j } + β FN , z(d:1)(x, χ) .
⎩ j ⎭
j =1

The indicator 1{χ=j } chooses the GLM that belongs to the corresponding
insurance portfolio χ ∈ {1, 2, 3} with the (individual) GLM parameter β GLM χ .
The FN network term makes them related, i.e., the GLMs of the different
insurance portfolios interact (jointly learn) via the FN network module. This is
the approach used in Gabrielli et al. [149] to improve the chain-ladder reserving
method by learning across different claims reserving triangles.
The regression function (7.40) gives the structural form of the combined
regression model, but there is a second important ingredient proposed by Wüthrich–
Merz [394]. Namely, the gradient descent algorithm (7.15) for model fitting can be
started in an initial network parameter !(0) ∈ Rq0 +1+r that corresponds to the MLE
of the GLM. Denote by
GLM
β the MLE of the GLM part, only.

Choose the initial value of the gradient descent algorithm for the fitting of the
combined regression model (7.40)

!(0) = ≡ 0 ∈ Rq0 +1+r ,
GLM (1) FN
β , w1 , . . . , w (d)
qd , β (7.42)

that is, initially, no signals traverse the FN network part because we set β FN ≡
0.
318 7 Deep Learning

Remarks 7.21
• Using the initialization (7.42), the gradient descent algorithm starts exactly in
the optimal GLM. The algorithm then tries to improve this GLM w.r.t. the given
loss function using the additional FN network features. If the loss substantially
reduces during the gradient descent training, the GLM misses systematic struc-
ture and it can be improved, otherwise the GLM is already good (enough).
• We can declare the MLE
GLM
β to be non-trainable. In that case the original
GLM always remains in the combined regression model and it acts as an offset.
If we declare the MLE
GLM
β to be non-trainable, we could choose a trainable
credibility weight α ∈ (0, 1), see (7.41), which gradually reduces the influence
of the GLM (if necessary).
Implementation of the general combined regression model (7.40) can be a bit
cumbersome, see Listing 4 in Gabrielli et al. [149], but things can substantially
be simplified by declaring the GLM part in (7.40) as being non-trainable, i.e.,
estimating β GLM by
GLM
β in the GLM, and then freeze this parameter. In view
of (7.40) this simply means that we add an offset oi =
GLM
β , x i to the FN
network that is treated as a prior difference between the different data points, we
refer to Sect. 5.2.3.
Example 7.22 (Combined GLM and FN Network) We revisit the French MTPL
claim frequency GLM of Sect. 5.3.4, and we boost model Poisson GLM3 with FN
network features. For the FN architecture we use the structure depicted in Fig. 7.14,
i.e., a FN network of depth d = 3 having (q1 , q2 , q3 ) = (20, 15, 10) neurons, and
using embedding layers of dimension b = 2 for the categorical feature components.
Moreover, we declare the GLM part to be non-trainable which allows us to use the
GLM as an offset in the FN network. Moreover, we apply bias regularization (7.33)
to receive the balance property.
The results are presented in Table 7.7. A first observation is that using model
Poisson GLM3 as an offset reduces the run time of gradient descent fitting because
we start the algorithm already in a reasonable model. Secondly, as expected, the

Table 7.7 Run times, number of parameters, in-sample and out-of-sample deviance losses (units
are in 10−2 ) and in-sample average frequency of the Poisson null model, model Poisson GLM3 of
Table 5.5, the FN network model (with embedding layers of dimension b = 2), and the combined
regression model GLM3+FN, see (7.40)
Run # In-sample Out-of-sample Aver.
time param. loss on L loss on T freq.
Poisson null – 1 25.213 25.445 7.36%
Poisson GLM3 15 s 50 24.084 24.102 7.36%
Embed FN (q1 , q2 , q3 ) = (20, 15, 10) 120 s 792 23.694 23.820 7.24%
Embed FN bias regularized +4 s 792 23.690 23.824 7.36%
Combined GLM+FN (20, 15, 10) +53 s 50 + 792 23.772 23.834 7.24%
Combined GLM+FN bias regularized +4 s 50 + 792 23.765 23.830 7.36%
7.4 Special Features in Networks 319

FN features decrease the loss of model Poisson GLM3, this indicates that there
are systematic effects that are not captured by the GLM. The final combined and
regularized model has roughly the same out-of-sample loss as the corresponding
FN network, showing that this approach can be beneficial in run times, and the
predictive power is similar to a pure FN network.

Example 7.23 (Improving Model Poisson GLM3) In this example we would like to
explore the deficiencies of model Poisson GLM3 by boosting it with FN network
features. We do this in a systematic way by only considering two (continuous)
features components at a time in the FN network. That is, we consider the combined
approach (7.40) with initialization (7.42), but as feature information for the network
part, we only consider two components at a time. For instance, we start with the
features (1, Area, VehPower) ∈ {1} × R2 for the network part, and the remaining
feature information is ignored in this step. This way we can test whether the
marginal modeling of Area and VehPower is suitable in model Poisson GLM3,
and whether a pairwise interaction in these two components is missing. We train
this FN network starting from model Poisson GLM3 (and keeping this GLM part
frozen). The decrease in the out-of-sample loss during the gradient descent training
is shown in Fig. 7.15 (top-left). We observe that the loss remains rather constant over
100 training epochs. This tells us that the pair (Area, VehPower) is appropriately
considered in model Poisson GLM3.
Figure 7.15 gives all pairwise plots of the continuous feature components Area,
VehPower, VehAge, DrivAge, BonusMalus, Density, the scale on the y-
axis is identical in all plots. We observe that only the plots including the variable
BonusMalus provide a bigger decrease in loss (in blue color in the colored
version). This indicates that mainly this feature component is not modeled optimally
in model Poisson GLM3, because boosting with a FN network finds systematic
structure here that improves the loss of model Poisson GLM3. In model Poisson
GLM3, the variable BonusMalus has been modeled log-linearly with an interac-
tion term with DrivAge and (DrivAge)2 , see (5.35). Table 7.8 shows the result
if we add a FN network feature (7.40) for the pair (DrivAge, BonusMalus)
to model Poisson GLM3. Indeed, we see that the resulting combined GLM-FN
network model has the same GL as the full FN network approach. Thus, we
conclude that model Poisson GLM3 performs fairly well and only the modeling
of the pair (DrivAge, BonusMalus) should be improved.

7.4.4 Network Ensemble Learning

Ensemble learning is a popular way of expressing that one takes an average over
different predictors. There are many established methods that belong to the family of
ensemble learning, e.g., there is boostrap aggregating (called bagging) introduced
by Breiman [51], there are random forests, and there is boosting. Random forests
320 7 Deep Learning

Fig. 7.15 Exploring all pairwise interactions: out-of-sample losses over 100 gradient descent
epochs for all pairs of the continuous feature components Area, VehPower, VehAge,
DrivAge, BonusMalus, Density (the scale on the y-axis is identical in all plots)

and boosting are mainly based on classification and regression trees (CARTs) and
they belong to the most powerful machine learning methods for tabular data. These
methods combine a family of predictors to a more powerful predictor. The present
section is inspired by the bagging method of Breiman [51], and we perform network
aggregating (called nagging).

Stochastic Gradient Descent Fitting of Networks

We have described that network calibration involves several elements of random-

ness. This in combination with early stopping leads to the non-uniqueness of
reasonably good networks for prediction and pricing. We have discussed this based
7.4 Special Features in Networks 321

Table 7.8 Run times, number of parameters, in-sample and out-of-sample deviance losses (units
are in 10−2 ) and in-sample average frequency of the Poisson null model, model Poisson GLM3
of Table 5.5, model Poisson GLM3 with additional FN features for (DrivAge, BonusMalus),
the FN network model (with embedding layers of dimension b = 2), and the combined regression
model GLM3+FN, see (7.40)
Run # In-sample Out-of-sample Aver.
time param. loss on L loss on T freq.
Poisson null – 1 25.213 25.445 7.36%
Poisson GLM3 15 s 50 24.084 24.102 7.36%
GLM3 +FN(DrivAge, BonusMalus) – 50 + 792 23.804 23.805 7.36%
Embed FN bias regularized 124 s 792 23.690 23.824 7.36%
Combined GLM+FN bias regularized 72 s 50 + 792 23.765 23.830 7.36%

on Fig. 7.5, namely, for a given network architecture we have a continuum of

comparably good models (w.r.t. the chosen objective function) that lie in the green
area of Fig. 7.5. One SGD calibration picks one specific model from this green area,
we also refer to Remarks 7.9. Of course, this is very unsatisfactory in insurance
pricing because it implies that the selection of a price for an insurance policy has
a substantial element of subjectivity (that cannot be explained to the customer).
Naturally, we would like to combine models in the green area of Fig. 7.5, for
instance, by performing some sort of integration over the models in the green area.
Intuitively, this should lead to a very powerful predictive model because it diversifies
the weaknesses of each individual model. This is exactly what we discuss in this
section. Before doing so, we would first like to understand the different single
calibrations of a given network architecture.
We consider the MTPL data of Example 7.12. We model this data with a Poisson
FN network using embedding layers for the categorical features and using bias
regularization (7.33) to guarantee the balance property to hold. For the FN network
architecture we choose depth d = 3 with (q1 , q2 , q3 ) = (20, 15, 10) FN neurons;
this setup gives us the results on the last line of Table 7.5. We now repeat this
procedure M = 1 600 times, using exactly the same FN network architecture, the
same early stopping strategy, the same SGD method and the same batch size. We
only change the seeds of the starting point ϑ (0) ∈ Rr of the SGD algorithm, the
partitioning of the learning data L into training data U and validation data V, see
Fig. 7.7, and the partitioning of the training data into the (mini-)batches.
The resulting 1 600 in-sample and out-of-sample deviance losses are presented
in Fig. 7.16. We observe a considerable variation in these figures. The in-sample
losses vary between 23.616 and 23.815 (mean 23.728), and the corresponding out-
of-sample loss between 23.766 and 23.899 (mean 23.819), units are in 10−2 ; note
that all network calibrations are bias regularized. The in-sample loss is an average
over n = 610 206 (individual) unit deviance losses, and the out-of-sample an
average over T = 67 801 unit deviance losses, see also Definition 4.24. Therefore,
we expect an even much bigger variation on individual insurance policies. We are
going to analyze this in more detail in this section.
322 7 Deep Learning

in−sample losses over 1600 calibrations out−of−sample losses over 1600 calibrations

23.78 23.80 23.82 23.84 23.86 23.88 23.90

1600 calibrations 1600 calibrations
23.80

out−of−sample deviance losses

selected calibration selected calibration
in−sample deviance losses
23.75
23.70
23.65

Fig. 7.16 Boxplots over 1 600 network calibrations only differing in the seeds for the SGD
algorithm and the partitioning of the learning data: (lhs) in-sample losses on L and (rhs) out-
of-sample losses on T , the horizontal lines show the calibration chosen in Table 7.5; units are in
10−2

Before doing so, we would like to understand whether there is some dependence
between the in-sample and the out-of-sample losses over the M = 1 600 runs of
the SGD algorithm with different seeds. In Fig. 7.17 we provide a scatter plot of
the out-of-sample losses vs. the in-sample losses. This plot is complemented by
a cubic spline regression (in orange color). From this plot we conclude that the
models with very small in-sample losses tend to over-fit, and the models with large
in-sample losses tend to under-fit (always using the same early stopping rule). In
view of these results we conclude that the chosen early stopping rule is sensible
because on average it tends to provide the model with the smallest out-of-sample
loss on T . Recall that we do not use T during the SGD fitting, but only the learning
data L that is split into the training data U and the validation data V for exercising
the early stopping, see Fig. 7.7.

Fig. 7.17 Scatter plot of scatter plot of out−of−sample vs. in−sample losses
23.90

out-of-sample losses
vs. in-sample losses for
23.88
out−of−sample deviance losses

different seeds, the orange

line gives a fitted cubic
23.86

spline, and the cyan lines

show the empirical means;
23.84

units are in 10−2

23.82
23.80
23.78

single calibration
cubic spline
empirical mean

23.65 23.70 23.75 23.80

in−sample deviance losses
7.4 Special Features in Networks 323

Next, we study the estimated prices on the test data (out-of-sample)

T = (Yt† = Nt† /vt† , x †t , vt† ) : t = 1, . . . , T = 67 801 .

For each run of the SGD algorithm we receive a different (early stopped) network
parameter estimate
m
ϑ ∈ Rr , 1 ≤ m ≤ M = 1 600. Using these parameter
estimates we receive the estimated network regression functions, for 1 ≤ m ≤ M,

x → μ
m (x) = μ
ϑ (x),
m

Since we choose the seeds of the SGD runs at random we may (and will) assume
that we have independence between the prices ( μmt )t ∈T of the different runs 1 ≤
m ≤ M of the SGD algorithm. This allows us to estimate the average price and the
coefficient of variation of these prices of a fixed insurance policy t over the different
SGD runs
/
0
1 0 1 m 2
M M
1
(1:M)
μ̄t =
μtm
and Vcot = (1:M) 1 μt − μ̄(1:M)
t .
M μ̄t M−1
m=1 m=1
(7.43)
These (out-of-sample) coefficients of variation are illustrated in Fig. 7.18. We
observe a considerable variation on some policies. The average coefficient of
variation is roughly 10% (orange horizontal line, lhs). The maximal coefficient of
variation is about 40%, thus, for this policy the individual prices μm
t of the different
(1:M)
SGD runs 1 ≤ m ≤ M fluctuate considerably around μ̄t . This now explains
why we choose M = 1 600 SGD runs, namely, √the averaging in (7.43) reduces the
coefficient of variation on this policy to 40%/ M = 40%/40 = 1%, note that we
have independence between the different SGD runs. Thus, by averaging we receive
an acceptable influence of the variation of the individual SGD fittings.
Listing 7.9 shows the 10 policies (out-of-sample) with the largest coefficients
of variations Vcot . These polices have in common that they belong to the lowest
BonusMalus level, the drivers are very young, the cars are comparably old and
they have a bigger vehicle power. From a practical point of view we should doubt
these policies, since the information provided may not be correct. New drivers (at
the age of 18) typically enter a bonus-malus scheme at level 100, and only after
several accident-free years these drivers can reach a bonus-malus level of 50. Thus,
policies as in Listing 7.9 should not exist, and our pricing framework has difficulties
to (correctly) handle them. In practice, this needs further investigation because,
obviously, there is a data issue, here.
324 7 Deep Learning

coefficients of variation of estimated frequencies histogram of coefficients of variations

individual policies
0.4

average

20000
1 std.dev.
cubic spline
coefficient of variation
0.3

15000
frequency
10000
0.2

5000
0.1

0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.1 0.2 0.3 0.4
average estimated frequency coefficient of variations

Fig. 7.18 Out-of-sample coefficients of variations Vcot on an individual policy level 1 ≤ t ≤ T

over the 1 600 calibrations (lhs) scatter plot against the average estimated frequencies μ̄(1:M)
t and
(rhs) resulting histogram

Listing 7.9 The 10 policies (out-of-sample) with the largest coefficients of variation
1 Area VehPower VehAge DrivAge BonusMalus VehBrand VehGas Region vco
2 D 8 16 18 50 B11 Regular R53 0.4089006
3 D 9 17 20 50 B11 Regular R24 0.3827665
4 C 8 11 18 50 B5 Regular R24 0.3762306
5 C 9 18 18 50 B5 Regular R24 0.3697370
6 C 7 17 18 50 B1 Regular R24 0.3579979
7 C 9 19 19 50 B5 Regular R24 0.3554879
8 C 6 15 20 50 B1 Regular R93 0.3528679
9 C 7 14 19 50 B1 Regular R53 0.3518279
10 A 11 20 50 50 B13 Regular R74 0.3442184
11 D 5 14 18 50 B3 Diesel R24 0.3403783

Nagging Predictor

The previously observed variations of the prices motivate to average over the
different models (network calibrations). This brings us to bagging introduced by
Breiman [51]. Bagging is based on averaging/aggregating over several ‘indepen-
dent’ predictions; this is done in three steps. In a first step, a model is fitted to the
data L. In a second step, independent bootstrap samples L∗(m) are generated from
this fitted model; the independence has to be understood in a conditional sense,
namely, the different bootstrap samples L∗(m) are independent in m, given the data
L. In the third step, for every bootstrap sample L∗(m) one estimates a model μm ,
and averaging (7.43) provides the bagging predictor. Bagging is mainly a variance
reduction technique. Note that if the fitted model of the first step has a bias, then
likely the bootstrap samples L∗(m) are biased, and so is the bagging predictor.
Therefore, bagging does not help to reduce a potential bias. All these results have to
7.4 Special Features in Networks 325

be understood conditionally on the data L. If this data is atypical for the problem,
so will the bootstrap samples be.
We can perform a similar analysis for the fitted networks, but we do not need to
bootstrap, here, because the various elements of randomness in SGD fitting allow us
to generate independent predictors μm , conditional on the data L. Averaging (7.43)
over these predictors then provides us with the network aggregating (nagging)
predictor μ̄(1:M) ; we also refer to Dietterich [105] and Richman–Wüthrich [315]
for this aggregation. Thus, we replace the bootstrap step by the different runs of
the SGD algorithm. Both options provide independent predictors μm , conditional
on the data L. However, there is a fundamental difference between bagging and
nagging. Bagging generates new (bootstrap) samples L∗(m) and, thus, bagging also
involves randomness coming from sampling the new observations. Nagging always
acts on the same sample L, and it only refits the model multiple times. Therefore,
the latter will typically introduce less variation. Of course, bagging and nagging can
be combined, and then the full expected GL can be estimated, we come back to this
in Sect. 11.4, below. We do not sample new observations, here, because we would
like to understand the variations implied by the SGD algorithm with early stopping
on the given (fixed) data.
In Fig. 7.18 we have seen that we need nagging over 1 600 network calibrations
so that the maximal coefficient of variation on an individual policy level is below
1% in our MTPL example. In this section we would like to understand the minimal
out-of-sample loss that can be achieved by nagging on the (entire) test data set, and
we would like to analyze its rate of convergence.

For this we define the sequence of nagging predictors

1 m
M
μ̄ (1:M)
(x) =
μ (x) for M ≥ 1. (7.44)
M
m=1

This allows us to study the out-of-sample losses on T in the Poisson model for
M ≥1

2 †
T
(1:M) † † † μ̄(1:M) (x †t )
D(T , μ̄ (1:M)
)= vt μ̄ (x t ) − Yt − Yt log .
T
t =1 Yt†

Remark 7.24 From Remarks 7.17 we know that the expected deviance GL of
the estimated model is lower bounded by the expected deviance GL of the true
data generating model; the difference is the conditional calibration. Within the
family of Tweedie’s CP models Richman–Wüthrich [315] proved that, indeed,
aggregating decreases monotonically the expected deviance GL of the estimated
model (Proposition 2 of [315]), convergence is established (Proposition 3 of [315]),
326 7 Deep Learning

and the speed of convergence is provided using asymptotic normality (Proposition

4 of [315]). For the Gaussian square loss results we refer to Breiman [51] and
Bühlmann–Yu [60].
We revisit Proposition 2 of Richman–Wüthrich [315] which has also been proved
in Proposition 3.1 of Denuit–Trufin [103]. We only consider a single case in the next
proposition and we drop the feature information x (because we can condition on
X = x).
Proposition 7.25 Choose a response Y ∼ f (·; θ, v/ϕ) belonging to Tweedie’s CP
model having a power variance cumulant function κ = κp with power variance
parameter p ∈ [1, 2], see (2.17). Assume
μ is an estimator for the mean parameter
μ = κp (θ ) > 0 satisfying <
μ ≤ p/(p − 1)μ, a.s., for some ∈ (0, p/(p − 1)μ).
Choose i.i.d. copies μm , m ≥ 1, of
μ being all independent of Y . We have for all
M ≥1

Eθ d Y, μ1 ≥ Eθ d Y, μ̄(1:M) ≥ Eθ d Y, μ̄(1:M+1) ≥ Eθ [d(Y, μ)] .

Proof of Proposition 7.25 The lower bound on the right-hand side immediately
follows from Theorem 4.19. For an estimate μ > 0 we define the function, we
also refer to (4.18) and we set for the canonical link hp = (κp )−1 ,
⎧
⎪ μlog(
μ) − μ for p = 1,
⎨ μ1−p
μ2−p
μ → ψp (
μ) = μhp ( μ) = μ 1−p − 2−p
μ) − κp hp ( for p ∈ (1, 2),
⎪
⎩
−μ/ μ − log( μ) for p = 2.

This is the part of the log-likelihood (and deviance loss) that depends on the
canonical parameter θ = hp ( μ), and replacing the observation Y by μ. Calculating
the second derivative w.r.t.
μ provides for p ∈ [1, 2]

∂2
μ−p−1 − (1 − p)
μ) = −pμ
ψp ( μ−p =
μ−(1+p) [−pμ − (1 − p)
μ] ≤ 0,
μ2
∂

the last inequality uses that the square bracket is non-positive, a.s., under our
assumptions on μ. Thus, ψp is concave on the interval (0, p/(p − 1)μ). We now
focus on the inequalities for M ≥ 1. Consider the decomposition of the nagging
predictor for M + 1

1
M+1
1 m
M+1
μ̄(1:M+1) = μ̄(−j ) , where μ̄(−j ) =
μ 1{m=j } .
M +1 M
j =1 m=1
7.4 Special Features in Networks 327

The predictors μ̄(−j ) , j ≥ 1, are copies of μ̄(1:M) , though not independent ones.
Using the function ψp , the second term on the right-hand side has the same structure
as the estimation risk function (4.14),

Eθ d(Y, μ̄(1:M) )

= Eθ d(Y, μ̄(1:M+1) ) + 2 Eθ Y hp μ̄(1:M+1) − κp hp μ̄(1:M+1)

− 2 Eθ Y hp μ̄(1:M) − κp hp μ̄(1:M)

= Eθ d(Y, μ̄(1:M+1) ) + 2 E ψp μ̄(1:M+1) − E ψp μ̄(1:M)
⎛ ⎡ ⎛ ⎞⎤ ⎞
1
M+1
= Eθ d(Y, μ̄(1:M+1) ) + 2 ⎝E ⎣ψp ⎝ μ̄(−j ) ⎠⎦ − E ψp μ̄(1:M) ⎠
M +1
j =1
⎛ ⎡ ⎤ ⎞
1
M+1
≥ Eθ d(Y, μ̄(1:M+1) ) + 2 ⎝E ⎣ ψp μ̄(−j ) ⎦ − E ψp μ̄(1:M) ⎠
M +1
j =1

= Eθ d(Y, μ̄(1:M+1) ) ,

the second last step applies Jensen’s inequality to the concave function ψp , and the
last step follows from the fact that μ̄(−j ) , j ≥ 1, are copies of μ̄(1:M) .

Remarks 7.26
• Proposition 7.25 says that aggregation works, i.e., aggregating i.i.d. predictors
leads to monotonically decreasing expected deviance GLs. In fact, if μ ≤ 2μ,
a.s., we receive Tweedie’s forecast dominance by aggregating, restricted to the
power variance parameters p ∈ [1, 2], see Definition 4.22.
• The i.i.d. assumption can be relaxed, indeed, it is sufficient that every μ̄(−j )
in the above proof has the same distribution as μ̄(1:M) . This does not require
independence between the predictors μm , m ≥ 1, but exchangeability is
sufficient.
• We need the condition < μ ≤ p/(p − 1)μ, a.s., to ensure the monotonicity
within Tweedie’s CP models. For the Poisson model p = 1 we can drop the
upper bound, and we only need the lower bound to ensure the existence of the
expected deviance GL. For p ∈ (1, 2] the upper bound is increasingly binding,
in the gamma case p = 2 requiring μ ≤ 2μ, a.s.
• Note that we do not require unbiasedness of μ for μ in Proposition 7.25. Thus,
at this stage, aggregating is a variance reduction technique.
328 7 Deep Learning

Fig. 7.19 Out-of-sample nagging predictors for M>=1

23.84
losses D(T , μ̄(1:M) ) of the nagging predictor
1 standard deviation
nagging predictors

23.83
(μ̄(1:M) (x †t ))1≤t≤T for
1 ≤ M ≤ 40; losses are in

23.82
out−of−sample losses
10−2

23.81
23.80
23.79
23.78
0 10 20 30 40
index M

• If additionally we have unbiasedness of

μ for μ and a uniformly integrable upper
bound on μ̄(1:M) , we can use Lebesgue’s dominated convergence theorem and the
law of large numbers to prove
+ ,
lim Eθ d Y, μ̄ (1:M)
= Eθ lim d Y, μ̄ (1:M)
= Eθ [d(Y, μ)] .
M→∞ M→∞
(7.45)

The uniformly integrable upper bound is only needed in the Poisson case p = 1,
because the other cases are covered by < μ ≤ p/(p − 1)μ, a.s. Moreover,
asymptotic normality can be established, we refer to Proposition 4 in Richman–
Wüthrich [315].
We come back to our MTPL Poisson claim frequency example and its 1 600
network calibrations illustrated in Fig. 7.17. Figure 7.19 provides the out-of-sample
portfolio losses D(T , μ̄(1:M) ) of the resulting nagging predictors (μ̄(1:M) (x †t ))1≤t ≤T
for 1 ≤ M ≤ 40 in red color, and the corresponding 1 standard deviation confidence
bounds in orange color. The blue horizontal dotted line shows the case M = 1
which exactly refers to the (first) bias regularized FN network μm=1 with embedding
layers given in Table 7.5. Indeed, averaging over multiple networks improves the
predictive model and the out-of-sample loss decreases over the first 2 ≤ M ≤ 10
nagging steps. After the first 10 steps the picture starts to stabilize which indicates
that for this size of portfolio (and this type of problem) we need to average over
roughly 10–20 FN networks to receive optimal predictive models on the portfolio
level. For M → ∞ the out-of-sample loss converges to the green horizontal dotted
line in Fig. 7.19 of 23.783 · 10−2 . These numbers are also reported on the last line
of Table 7.9.
Figure 7.20 provides the empirical auto-calibration property (7.39) of the
nagging predictor μ̄(1:1600); this is obtained completely analogously to Fig. 7.12.
7.4 Special Features in Networks 329

Table 7.9 Run times, number of parameters, in-sample and out-of-sample deviance losses (units
are in 10−2 ) and in-sample average frequency of the Poisson null model, model Poisson GLM3 of
Table 5.5, the FN network models (with embedding layers of dimension b = 2), and the nagging
predictor for M = 1 600
Run # In-sample Out-of-sample Aver.
time param. loss on L loss on T freq.
Poisson null – 1 25.213 25.445 7.36%
Poisson GLM3 15 s 50 24.084 24.102 7.36%
Embed FN bias regularized
μm=1 +4 s 792 23.690 23.824 7.36%

Average over 1 600 SGDs (Fig. 7.16) – 792 23.728 23.819 7.36%
Nagging FN μ̄(1:M) , M = 1 600 ∞ ‘792’ 23.691 23.783 7.36%

Fig. 7.20 Empirical auto−calibration of nagging predictor

0.5
auto-calibration (7.39) of the
Poisson nagging predictor,
the blue line shows the

20
0.4

empirical density of
vi μ̄(1:1600) (x i ), 1 ≤ i ≤ n
auto−calibration

15
0.3

10
0.2
0.1

5
auto−calibration
0.0

density (right axis)

0
0.0 0.1 0.2 0.3 0.4 0.5
estimated claims v*mu(x)

The nagging predictors are (already) bias regularized, and Fig. 7.20 supports that
the auto-calibration property holds rather accurately.
At this stage, we have fully arrived at Breiman’s [53] two modeling cultures
dilemma, see also Sect. 1.1. We have started from a parametric data model, and
in order to boost its predictive performance we have combined such models in
an algorithmic way. Working with many blended networks is not really practical,
therefore, in such situations, a meta model can be fitted to the resulting nagging
predictor.

Meta Model

Since working with M = 1 600 different FN networks is not practical, we fit a meta
model to the nagging predictors μ̄(1:M) (·). This can easily be done by just selecting
an additional FN network and fit this additional network to the working data

D∗ = μ̄(1:M) (x i ), x i , vi : i = 1, . . . , n ∪ μ̄(1:M) (x †t ), x †t , vt† : t = 1, . . . , T .
330 7 Deep Learning

Table 7.10 Run times, number of parameters, in-sample and out-of-sample deviance losses (units
are in 10−2 ) and in-sample average frequency of the Poisson null model, model Poisson GLM3
of Table 5.5, the FN network model (with embedding layers of dimension b = 2), the nagging
predictor, and the meta network model
Run # In-sample Out-of-sample Aver.
time param. loss on L loss on T freq.
Poisson null – 1 25.213 25.445 7.36%
Poisson GLM3 15 s 50 24.084 24.102 7.36%
Embed FN bias regularized
μm=1 +4 s 792 23.690 23.824 7.36%
Nagging FN μ̄(1:M) ∞ ‘792’ 23.691 23.783 7.36%
Meta FN network μmeta – 792 23.714 23.777 7.36%

For this calibration step we can consider all data, since we would like to fit a
regression model as accurately as possible to the entire regression surface formed by
all nagging predictors from the learning and the test data sets L and T . Moreover,
this step should not over-fit since this regression surface of nagging predictors
does not include any noise, but it is on the level of expected values. As network
architecture we choose again the same FN network of depth d = 3. The only
change to the fitting procedure above is replacing the Poisson deviance loss by the
square loss function, since we do not work with the Poisson responses Ni but rather
with their mean estimates μ̄(1:M) (x i ) and μ̄(1:M) (x †t ) in this fitting step. Since the
resulting meta network model may still have a bias we apply the bias regularization
step of Listing 7.7 to the Poisson observations with the Poisson deviance loss on the
learning data L (only). The results are presented in Table 7.10.
From these results we observe that in our case the meta network performs
similarly well to the nagging predictor, and it seems to be a very reasonable choice.
Finally, in Fig. 7.21 (lhs) we analyze the resulting frequencies on an individual
policy level on the test data set T . We plot the estimated frequencies μm=1 (x †t ) of
the first FN network (this corresponds to ‘embed FN bias regularized’ in Table 7.10
with an out-of-sample loss of 23.824) against the nagging predictor μ̄(1:M) (x †t )
which averages over M = 1 600 networks. From Fig. 7.21 (lhs) we conclude
that there are quite some differences between these two predictors, this exactly
reflects the variations obtained in Fig. 7.18 (lhs). The nagging predictor removes this
variation by averaging. Figure 7.21 (rhs) compares the nagging predictor μ̄(1:M) (x †t )
to the one of the meta model μmeta (x †t ). This scatter plot shows that the predictors
lie almost perfectly on the diagonal line which suggests that the meta model can be
used as a substitute for the nagging predictor. This completes this claim frequency
modeling example.
Remark 7.27 The meta model concept can also be useful in other situations. For
instance, we can fit a gradient boosting regression model to the observations.
Typically, this is much faster than calculating a nagging predictor (because it directly
focuses on the weaknesses of the existing model). If the gradient boosting model
is based on regression trees, it has the disadvantage that the resulting regression
7.4 Special Features in Networks 331

claims frequency prediction (log-scale) claims frequency prediction (log-scale)

1.0 1.0
embedding FN network 1

–1
–1

0.8 0.8

meta model
–2

–2
0.6 0.6
–3

0.4 0.4

–3
–4

0.2 0.2

–4
–4 –3 –2 –1 –4 –3 –2 –1
nagging predictor nagging predictor

Fig. 7.21 Scatter plot of the out-of-sample predictions μm=1 (x †t ), μ̄(1:M) (x †t ) and
μmeta (x †t ) over
† †
all polices 1 ≤ t ≤ T on the test data set T : (lhs) μm=1 (x t ) vs. μ̄(1:M) (x t ) and (rhs) μmeta (x †t )
† †
vs. μ̄(1:M) (x t ); the color scale shows the exposures vt ∈ (0, 1]

function is not continuous, and a non-constant extrapolation might be an issue.

In a second step we can fit a meta FN network model to the former regression
model, lifting the boosting model to a smooth network that allows for a non-constant
extrapolation.

Example 7.28 (Gamma Claim Size Modeling) We revisit the gamma claim size
example of Sect. 5.3.7. The data comprises Swedish motorcycle claim amounts. We
have seen that this claim size data is not heavy-tailed, thus, a gamma distribution
may be a reasonable choice for this data. For the modeling of this data we use the
same normalization is in (5.45), this parametrization does not require the explicit
knowledge of the (constant) shape parameter of the gamma distribution for mean
estimation.
The difficulty with this data is that only 656 insurance policies suffer a claim,
and likely a single FN network will not lead to stable results in this example.
As FN network architecture we again choose a network of depth d = 3 and
with (q1 , q2 , q3 ) = (20, 15, 10) neurons. Since the input layer has dimension
q0 = 1 + 6 = 7 we receive a network parameter of dimension r = 626. As loss
function we choose the gamma deviance loss, see Table 4.1. Moreover, we choose
the nadam optimizer, a batch size of 300, a training-validation split of 8:2, and we
retrieve the network calibration with the lowest validation loss with a callback.
Figure 7.22 shows the results of 1 000 different SGD runs (only differing in the
initial seeds and the splits of the training-validation sets as well as the batches).
We see a considerable variation between the different SGD runs, both in in-sample
deviance losses but also in the average estimated claims. Note that we did not bias-
regularize the resulting networks (we work with the log-link here which is not the
canonical one). This is why we receive fluctuating portfolio averages in Fig. 7.22
332 7 Deep Learning

in−sample losses over 1000 calibrations portfolio means over 1000 calibrations
1000 calibrations 1000 calibrations
1.35 1.40 1.45 1.50 1.55 1.60 1.65

Gamma GLM2: 1.719 empirical mean

30000
in−sample deviance losses

estimated portfolio means

28000
26000
24000
Fig. 7.22 Boxplots over 1 000 network calibrations only differing in the seeds for the SGD
algorithm and the partitioning of the learning-validation data: (lhs) in-sample losses on the (entire)
data L and (rhs) average estimated claims

coefficients of variation of estimated means histogram of coefficients of variations

250

individual policies
average
1 std.dev.
cubic spline
200
coefficient of variation
0.4

150
frequency
0.3

100
0.2

50
0.1

20000 40000 60000 80000 0.1 0.2 0.3 0.4 0.5

estimated means coefficient of variations

Fig. 7.23 Coefficients of variations Vcoi on an individual claim level 1 ≤ i ≤ n over the 1 000
calibrations (lhs) scatter plot against the nagging predictor μ̄(1:M) (x i ) and (rhs) histogram

(rhs), the red line illustrates the empirical mean. Obviously, these FN networks are
(on average) positively biased, and they will need a bias correction for the final
prediction.
Figure 7.23 analyzes the variations on an individual claim level by studying
the in-sample version of the coefficient of variation given in (7.43). We see that
these coefficients of variation are bigger than in the claim frequency example, see
Fig. 7.18. Thus, to receive stable results the nagging predictors μ̄(1:M) (x i ) have to be
calculated over many networks. Figure 7.24 confirms that aggregating reduces (in-
sample) losses also in this case. From this figure we also see that the convergence is
slower compared to the MTPL frequency example of Fig. 7.19, of course, because
we have a much smaller claims portfolio.
7.4 Special Features in Networks 333

Fig. 7.24 In-sample losses nagging predictors for M>=1

D(L, μ̄(1:M) ) of the nagging nagging predictor
1 standard deviation
predictors (μ̄(1:M) (x i ))1≤i≤n

1.55
for 1 ≤ M ≤ 40 on the
motorcycle claim size data

in−sample losses
1.50
1.45
0 10 20 30 40
index M

Table 7.11 Number of parameters, Pearson’s dispersion estimate, MLE dispersion estimate, in-
sample losses and in-sample average claim amounts of the null model (gamma intercept model),
the gamma GLMs and the network nagging predictor; for the GLMs we refer to Table 5.13
# Dispersion In-sample Average
param. P
ϕ
ϕ MLE loss on L amount
Gamma null 1+1 2.057 1.690 2.085 24’641
Gamma GLM1 9+1 1.537 1.426 1.717 25’105
Gamma GLM2 7+1 1.544 1.427 1.719 25’130
Gamma FN network nagging 626 + 1 – – 1.478 26’387
Gamma FN network nagging (bias reg) 626 + 1 1.050 1.240 1.465 24’641

Table 7.11 presents the results if we take the nagging predictor over 1 000
different networks. The first observation is that we receive a much smaller in-sample
loss compared to the GLMs, thus, there seems to be much room for improvements in
the GLMs. Secondly, the nagging predictor has a substantial bias. For this reason we
shift the intercept parameter in the output layer so that the portfolio average of the
nagging predictor is equal to the empirical mean, see the last column of Table 7.11.
A main difficulty in this model is the estimation of the dispersion parameter
ϕ > 0 and the shape parameter α = 1/ϕ of the gamma distribution, respectively.
Pearson’s dispersion estimate does not work because we do not know the degrees
of freedom of the nagging predictor, see also (5.49). In Table 7.11 we calculate
Pearson’s dispersion estimate by simply dividing by the number of observations;
this should be understood as a lower bound; this number is highlighted in italic.
Alternatively, we can calculate the MLE, however, this may be rather different from
Pearson’s estimate, as indicated in Table 7.11. Figure 7.25 (lhs) shows the resulting
QQ plot of the nagging predictor if we use the MLE ϕ MLE = 1.240, and the right-
hand side shows the same plot for ϕ = 1.050. From these plots it seems that we
should rather go for a smaller dispersion parameter, the MLE being probably too
much dominated by the small claims. This observation should also be understood as
a red flag, as it tells us that the chosen gamma model is not fully suitable. This may
334 7 Deep Learning

8 QQ plot of the fitted gamma model QQ plot of the fitted gamma model

8
6

6
observed values

observed values
4

4
2

2
0

0
0 2 4 6 8 0 2 4 6 8
theoretical values theoretical values

Fig. 7.25 QQ plots of the nagging predictors against the gamma density with (lhs)
ϕ MLE = 1.240
and (rhs)
ϕ = 1.050

claim amount prediction claim amount prediction

80000

80000
nagging predictor 2
60000

60000
GLM predictor
40000

40000
20000

20000
0

0 20000 40000 60000 80000 20000 40000 60000 80000

nagging predictor nagging predictor 1

Fig. 7.26 (lhs) Scatter plot of model Gamma GLM2 predictors against the nagging predictors
μ̄(1:M) (x i ) over all instances 1 ≤ i ≤ n, (rhs) scatter plot of two (independent) nagging predictors

be for various reasons: (1) the dispersion is not constant and should be modeled
policy dependent, (2) the features are not sufficient to explain the observations,
or (3) the gamma distribution is not suitable and should be replaced by another
distribution.
In Fig. 7.26 (lhs) we compare the predictions received from model Gamma
GLM2 against the nagging predictors μ̄(1:M) (x i ) over all instances 1 ≤ i ≤ n.
The scatter plot spreads quite wildly around the diagonal which seriously questions
at least one of the two models. To ensure that this variability between the two models
is not caused by the (complex) FN network architecture, we verify the nagging
7.4 Special Features in Networks 335

Fig. 7.27 Empirical auto−calibration of gamma predictor

140000
auto-calibration (7.39) of the

3.0e−05
Gamma FN network nagging
predictor of Table 7.11, the
blue line shows the empirical

20000 40000 60000 80000 100000

density of μ̄(1:M) (x i ),

2.0e−05
auto−calibration
1≤i≤n

1.0e−05
0.0e+00
auto−calibration
density (right axis)

0
0 20000 40000 60000 80000 100000 120000 140000
estimated claim size

predictor μ̄(1:M) , M = 1 000, by computing a second independent one. Indeed,

Fig. 7.26 shows that these two independent nagging predictors come to the same
conclusion on the individual instance level. Thus, the network finds/uses systematic
effects that are not present in model Gamma GLM2. If we perform a pairwise
interaction analysis for boosting the GLM as in Example 7.23, we find that we
should add interactions to the GLM between (VehAge, RiskClass), (VehAge,
BonusClass), (OwnerAge, Area), and (OwnerAge, VehAge); recall that
model Gamma GLM2 neither includes BonusClass nor Gender as supported
by a drop1 backward elimination analysis from model Gamma GLM1. However,
it turns out, here, that we should have BonusClass in the model by letting it
interact with VehAge.
Finally, Fig. 7.27 shows the empirical auto-calibration behavior (7.39) of the
Gamma FN network nagging predictor of Table 7.11. The resulting black dots are
rather volatile which shows that we do not (fully) have the auto-calibration property,
here, but it also expresses that we fit a model on only 656 claims. The prediction
of these claims is highlighted by the blue empirical density given by μ̄(1:M) (x i ),
1 ≤ i ≤ n. On the positive side, the auto-calibration plot shows that we neither
systematically under- nor over-estimate because the black dots fluctuate around the
diagonal red line, only the upper tail seems to under-estimate the true claim size.

Ensembling over Selected Networks vs. All Networks

Zhou et al. [406] ask the question whether ensembling over ‘selected’ networks is
better than ensembling over all networks. In their proposal they introduce a weighted
averaging scheme over the different network predictors μm , 1 ≤ m ≤ M. We
perform a slightly different analysis here. We are re-using the M = 1 600 SGD
calibrations of the Poisson FN network illustrated in Fig. 7.17. We order these SGD
calibrations w.r.t. their in-sample losses D(L,
μm ), 1 ≤ m ≤ M, and partition this
ordered sample into three equally sized sets: the first one containing the smallest
336 7 Deep Learning

Fig. 7.28 Empirical density empirical density of in−sample losses

of the in-sample losses

15
D(L, μm ), 1 ≤ m ≤ M, of
Fig. 7.17

10
empirical density
5
0
23.60 23.65 23.70 23.75 23.80
in−sample losses

in-sample losses, the second one the middle sized in-sample losses, and the third
one the largest in-sample losses. Figure 7.28 shows the empirical density of these
in-sample losses, and the vertical lines give the partition into the three sets, we call
the resulting (disjoint) index sets I small , I middle , I large ⊂ {1, . . . , M}. Remark that
this partition is done fully in-sample, based on the learning data L, only.
We then consider the nagging predictors on each of these index sets separately,
i.e.,

If we believe into the orange cubic spline in Fig. 7.17, the middle nagging predictor
μ̄middle should out-perform the other two nagging predictors. Indeed, this is the case,
here. We receive the out-of-sample losses (in 10−2 ) on the three subsets

D(T , μ̄small ) = 23.784, D(T , μ̄middle ) = 23.272, D(T , μ̄large ) = 23.782.

(7.47)

This approach boosts by far any other approach considered, see Table 7.10; note that
this analysis relies on a fully proper in-sample and out-of-sample testing strategy.
Moreover, this also supports our early stopping strategy because, obviously, the
optimal networks are centered around our early stopping rule. How does this result
match Proposition 7.25 saying that the nagging predictor has a monotonically
7.4 Special Features in Networks 337

Fig. 7.29 Scatter plot of the

nagging predictors
μ̄middle (x †t ) and μ̄(1:M) (x †t )
over all out-of-sample polices
1 ≤ t ≤ T ; the color scale
shows the sizes of the
exposures vt† ∈ (0, 1]

decreasing deviance loss. For the convergence (7.45) we need unbiasedness,

and (7.47) indicates that averaging over all M network calibrations results in biases
on an individual policy level; on the aggregate portfolio level, we have applied the
bias regularization step (7.33), but this does not act on an individual policy level.
The latter would require a local balance correction similar to the GAM approach
presented in Example 7.19.
Figure 7.29 is truly striking! It compares the nagging predictors μ̄(1:M) (x †t )
to the ones μ̄middle (x †t ) only using the calibrations m ∈ I middle , i.e., only using
the calibrations with middle sized in-sample losses. The different colors show the
exposures vt† ∈ (0, 1]. We observe that only portfolios with short exposures do not
lie on the diagonal line. Thus, there seems to be an issue with insurance policies
with short exposures. Recall that we model the Poisson claim counts Ni using the
assumption, see (5.27),

Ni ∼ Poi(vi μ(x i )). (7.48)

That is, the expected claim count Eθi [Ni ] = vi μ(x i ) is assumed to scale
proportionally in the exposure vi > 0. Figure 7.29 raises some doubts whether this
is really the case, or at least SGD fitting has some difficulties to assess the expected
frequencies μ(x i ) on the policies i with short exposures vi > 0. We discuss this
further in the next subsection. Table 7.12 gives a summary of our results.

Analysis of Over-dispersion

With all the excitement of Fig. 7.29, the above models do not fit the observations
since the over-dispersion is too large, see the last column of Table 7.12. This has
motivated the study of the negative binomial model in Sect. 5.3.5, the ZIP model in
Sect. 5.3.6, and the hurdle Poisson model in Example 6.19. These models have led
to an improvement in terms of AIC, see Table 6.6. We could go down the same
338 7 Deep Learning

Table 7.12 Number of parameters, in-sample and out-of-sample deviance losses (units are in
10−2 ), in-sample average frequency and (over-)dispersion of the Poisson null model, model Poisson
GLM3 of Table 5.5, the FN network model (with embedding layers of dimension b = 2), the
nagging predictor, the meta network model, and the middle nagging predictor
# In-sample Out-of-sample Aver. Disp.
param. loss on L loss on T freq.
ϕP
Poisson null 1 25.213 25.445 7.36% 1.7160
Poisson GLM3 50 24.084 24.102 7.36% 1.6644
Embed FN bias regularized μm=1 792 23.690 23.824 7.36% 1.6812
Nagging FN μ̄(1:M) ‘792’ 23.691 23.783 7.36% 1.6592
Meta FN network μmeta 792 23.714 23.777 7.36% 1.6737
Middle nagging FN μ̄middle ‘792’ 23.698 23.272 7.36% 1.6618

route here by substituting the Poisson model. We refrain from doing so, as we
want to further analyze the Poisson model. Suppose we calculate an AIC value for
the Poisson FN network using 792 as the number of parameters involved. In that
case, we receive a value of 191 790, thus, clearly lower than the one of the negative
binomial GLM, and also slightly lower than the one of the hurdle Poisson model,
see Table 6.6. Remark that AIC values within FN networks are not supported by
any theory as we neither use the MLE nor do we have a reasonable evaluation of the
number of parameters involved in networks. Thus, such a value may serve at best as
a rough rule of thumb.
This lower AIC value suggests that we should try to improve the modeling of
the systematic effects by better regression functions. In particular, there may be
more explanatory variables involved that have predictive power. If these explanatory
variables are latent, we can rely on the negative binomial model, as it can be
interpreted as a mixture model averaging over latent variables. In view of Fig. 7.29,
the exposures vi seem to have a predictive power different from proportional scaling,
see (7.48); we also mention some peculiarities of the exposures on page 556. This
motivates to change the FN network regression model such that the exposures are
considered non-proportionally. We choose a FN network that directly models the
mean of the claim counts
; <
(x, v) ∈ X × (0, 1] → μ(x, v) = exp β, z(d:1) (x, v) > 0, (7.49)

modeling the mean Eϑ [N] = μ(x, v) of the Poisson datum (N, x, v). The expected
frequency is then given by Eϑ [Y ] = Eϑ [N/v] = μ(x, v)/v.
Remark 7.29 At this stage we clearly have to distinguish between statistical
modeling and actuarial modeling. In statistical modeling it makes perfect sense
to choose the regression function (7.49), since including the exposure in a non-
proportional way may increase the predictive power of the model, at least this is
what our data suggests.
7.4 Special Features in Networks 339

From an actuarial point of view this approach should clearly be doubted. The
typical exposure of car insurance policies is one calendar year, i.e., v = 1, if the
renewals of insurance policies are accounted correctly. Shorter exposures may have
a specific (non-predictable) reason, for example, the policyholder or the insurance
company may terminate an insurance contract after a claim. Thus, if this is possible,
the exposure is a random variable, too, and it clearly has a predictive power for
claims prediction; in that case we lose the properties of the Poisson count process
(having independent and stationary increments).
As a consequence, we should include the exposure proportionally from an
actuarial modeling point of view. Nevertheless we do the modeling exercise based
on the regression function (7.49), here. This will indicate the predictive power of the
exposure, which may be thought of a proxy for another (non-available) explanatory
variable. Moreover, if (7.49) allows for a good Poisson regression model, we have a
simple way of bootstrapping from our data (conditionally on given exposures v).
We would also like to emphasize that if one feature component dominates all
others in terms of the predictive power, then likely there is a leakage of information
through this component, and this needs a more careful analysis.
We implement the FN network regression model (7.49) using again a network
architecture of depth d = 3 with (q1 , q2 , q3 ) = (20, 15, 10) neurons. We use
embedding layers for the two categorical variables VehBrand and Region, and
we have 8 continuous/binary feature components. This is one more compared to
Fig. 7.9 (rhs) because we also model the exposure vi as a continuous input to the
network. As a result, the dimension r of the network parameter ϑ ∈ Rr increases
from 792 to 812 (because we have q1 = 20 neurons in the first FN layer). We
calculate the nagging predictor μ̄(1:M) of this network averaging over M = 500
individual (early stopped) FN network calibrations, the results are presented in
Table 7.13.

Table 7.13 Number of parameters, in-sample and out-of-sample deviance losses (units are in
10−2 ), in-sample average frequency and (over-)dispersion of the Poisson null model, model Poisson
GLM3 of Table 5.5, the FN network models (with embedding layers of dimension b = 2), the
nagging predictors, and the middle nagging predictors excluding and including exposures vi as
continuous network inputs
# In-sample Out-of-sample Aver. Disp.
param. loss on L loss on T freq.
ϕP
Poisson null 1 25.213 25.445 7.36% 1.7160
Poisson GLM3 50 24.084 24.102 7.36% 1.6644
Embed FN μm=1 792 23.690 23.824 7.36% 1.6812
Nagging FN μ̄(1:M) ‘792’ 23.691 23.783 7.36% 1.6592
Middle nagging FN μ̄middle ‘792’ 23.698 23.272 7.36% 1.6618
Exposure v: FN
μm=1 812 23.358 23.496 7.36% 1.0650
Exposure v: nagging FN μ̄(1:M) ‘812’ 23.299 23.382 7.36% 1.0416
Exposure v: middle nagging FN μ̄middle ‘812’ 23.303 23.299 7.36% 1.0427
340 7 Deep Learning

Fig. 7.30 Average frequency frequency as a function of Exposure

as a function of the exposure
v ∈ (0, 1]: nagging predictors
considering the exposures

0.15
proportionally (blue), the
model including exposures
non-proportionally through

frequency
0.10
the FN network (black) and
observed (red)

0.05
FN w/o exposure
FN exposure

0.00
observed

0.2 0.4 0.6 0.8 1.0

Exposure

We observe a major improvement when including the exposure v as an input

to the network, i.e., by including the exposure non-proportionally into the mean
estimate. This is true in-sample (we use early stopping here), and in terms of
Pearson’s dispersion estimate; we set r = 812 for the number of parameters in
Pearson’s dispersion estimate (5.30) which may be too big because we do not
perform proper MLE, here. In particular, we receive a dispersion estimate close
to one which, now, is in support of modeling the claim counts by Poisson random
variables (using this regression function). That is, this regression function explains
the systematic effects so that we no longer observe much over-dispersion in the data
relative to the chosen model. However, we would like to remind of Remark 7.29
which needs a careful consideration for the use of this regression model in insurance
practice.
This is also supported by Fig. 7.30 which studies the average frequency as a
function of the exposure v ∈ (0, 1]. The red observed average frequency has a
clear decreasing slope which can be modeled by running the exposure v through the
FN network (black), but not by including it proportionally (blue). From an actuarial
modeling point of view this plot clearly questions the quality of the data, because
there seem to be effects in the exposures that certainly require more investigation.
Unfortunately, we cannot do this here because we do not have additional insight into
this data set. This closes the example.

7.4.5 Identifiability in Feed-Forward Neural Networks

In the previous section we have studied ensembles of FN networks. One may also
aim at directly comparing these networks to each other in terms of the fitted network
parameters
j
ϑ over the different calibrations 1 ≤ j ≤ M (of the same FN network
architecture). Such a comparison may, e.g., be useful if one wants to choose a
7.4 Special Features in Networks 341

prior parameter distribution π for ϑ in a Bayesian setting. Comparing the different

network calibrations
j
ϑ , 1 ≤ j ≤ M, of an architecture needs some care because
networks have many symmetries that make the parameters non-identifiable. We
can, for instance, permute the neurons in a FN layer z(m) , with the corresponding
permutation of the weights that connect this layer to the previous layer z(m−1) and to
the succeeding layer z(m+1) . The resulting predictive model under this permutation
is the same as the original one. For this reason we need to introduce some order in a
FN network to make the parameters identifiable.
Rüger–Ossen [323] have introduced the notion of a fundamental domain for the
network parameter ϑ, and we briefly review this idea. We start with an explicit
example. Assume that the activation function fulfills the anti-symmetry property
−φ(x) = φ(−x) for all x ∈ R, this is the case for the hyperbolic tangent. This
implies several symmetries in the FN network parametrization. E.g., if we consider
the output of a shallow FN network d = 1 with link function g, we can do a sign
switch in a fixed neuron 1 ≤ k ≤ q1

q1
q1 ; <
(1:1) (1)
g(μ(x)) = β0 + βj zj (x) = β0 + βj φ w j , x
j =1 j =1
; < ; <
= β0 + βj φ w(1)
j , x + (−β k ) φ −w (1)
k , x . (7.50)
j =k

From this we see that the following two network parameters (we switch signs in all
the parameters that belong to index k)

(1) (1)
ϑ = (w1 , . . . , wk , . . . , w (1)
q 1 , β0 , . . . , βk , . . . , βq 1 ) and
* (1) (1)
ϑ = (w1 , . . . , −wk , . . . , w (1)
q1 , β0 , . . . , −βk , . . . , βq1 )

give the same FN network predictions. Beside these sign switches, we can also
permute the enumeration of the neurons in a given FN layer, giving the same
predictions. We discuss Theorem 2 of Rüger–Ossen [323] to solve this identifiability
issue. First, we consider the network weights from the input x to the first FN layer
z(1) (x). Apply the sign switch operation (7.50) to the neurons in the first FN layer
(1) (1)
so that all the resulting intercepts w0,1 , . . . , w0,q1 are positive while not changing
the regression function x → g(μ(x)). Next, apply a permutation to the indices
1 ≤ j ≤ q1 so that we receive ordered intercepts

(1) (1)
w0,1 > . . . > w0,q 1
> 0,

with an unchanged regression function x → g(μ(x)). To make these transforma-

tions well-defined we need to assume that all intercepts are non-zero and mutually
different (which we assume for the time-being).
342 7 Deep Learning

Then, we move recursively through the FN layers 2 ≤ m ≤ d applying the sign

switch operations and the permutations so that the regression function x → g(μ(x))
remains unchanged and such that for all 1 ≤ m ≤ d

(m) (m)
w0,1 > . . . > w0,qm > 0.

This provides us with a unique representation of every network parameter ϑ ∈ Rr

in the fundamental domain

(m) (m)
ϑ ∈ Rr ; w0,1 > . . . > w0,qm > 0 for all 1 ≤ m ≤ d ⊂ Rr , (7.51)

supposed that all intercepts are different from zero and mutually different in the
same FN layers. As stated in Section 2.2 of Rüger–Ossen [323], there may still exist
different parameters in this fundamental domain that provide the same predictive
model, but these are of zero Lebesgue measure. The same applies to the intercepts
(m)
w0,j being zero or having equal intercepts for different neurons. Basically, this
means that we are fine if we work with absolutely continuous prior distributions
on the fundamental domain when we want to work within a Bayesian setup.

7.5 Auto-encoders

Auto-encoders are tools that aim at reducing the dimension of high-dimensional

data such that the reconstruction error of the original data is small, i.e., such that
the loss of information by the dimension reduction is minimized. The most popular
auto-encoder is the principal components analysis (PCA) which we are going to
present here. The PCA is a linear dimension reduction technique. Bottleneck neural
(BN) networks can be viewed as a non-linear extension of the PCA. This is going
to be discussed in Sect. 7.5.5, below. Dimension reduction techniques belong to the
family of unsupervised learning methods because they do not consider a response
variable, but they aim at finding common structure in the features. Unsupervised
learning methods can roughly be categorized into three classes: dimension reduction
techniques (studied in this section), clustering methods and visualization methods.
For a discussion of clustering and visualization methods we refer to the tutorial of
Rentzmann–Wüthrich [310].
7.5 Auto-encoders 343

7.5.1 Standardization of the Data Matrix

Assume we have q-dimensional data points y i ∈ Rq , 1 ≤ i ≤ n. This provides us

with a data matrix
⎛ ⎞
y1,1 · · · y1,q
⎜ ⎟
Y = (y 1 , . . . , y n ) = ⎝ ... . . . ... ⎠ ∈ Rn×q .
yn,1 · · · yn,q

We assume that each of the q columns of Y measures a quantity in a given unit.

The first column may, for instance, describe the age of a car driver in years, the
second column his body weight in kilograms, etc. That is, each column 1 ≤ j ≤ q
of Y describes a specific quantity, and each row y i of Y describes these quantities
for a given instance 1 ≤ i ≤ n. Since often the analysis should not depend on
of the columns of Y , one centers the columns with the empirical means
the units
ȳj = ni=1 yi,j /n, and one normalizes them with the empirical standard deviations
σj = ( ni=1 (yi,j − ȳj )2 /n)1/2, 1 ≤ j ≤ q. This gives the normalized data matrix

⎛ y1,1 −ȳ1 y1,q −ȳq ⎞

σ1 ···
σq
⎜ ⎟
⎜ .. .. .. ⎟ ∈ Rn×q . (7.52)
⎝ . . . ⎠
yn,1 −ȳ1 yn,q −ȳq

σ1 ···
σq

We typically center the data matrix Y , providing ni=1 yi,j = 0 for all 1 ≤ j ≤ q,
normalization w.r.t. the standard deviation can be done, but is not always necessary.
Centering implies that we can interpret Y as a q-dimensional empirical distribution
with each component (column) being centered. The covariance matrix of this
(centered) empirical distribution is calculated as
n
1 1
=
yi,j yi,k = Y Y ∈ Rq×q . (7.53)
n n
i=1 1≤j,k≤q

This is a covariance matrix, and if the columns of Y are normalized with the
empirical standard deviations
σj , 1 ≤ j ≤ q, this is a correlation matrix.

7.5.2 Introduction to Auto-encoders

An auto-encoder encodes a high-dimensional vector y ∈ Rq to a low-dimensional

representation so that the dimension reduction leads to a minimal loss of infor-
mation. A function L(·, ·) : Rq × Rq → R+ is called dissimilarity function if
L(y, y ) = 0 if and only if y = y .
344 7 Deep Learning

An auto-encoder is a pair (!, ) of mappings, for given dimensions p < q,

! : Rq → Rp and : Rp → Rq , (7.54)

such that their composition ◦ ! has a small reconstruction error w.r.t. the chosen
dissimilarity function L(·, ·), that is,

y → L (y, ◦ !(y)) is small for all cases y of interest. (7.55)

Note that we want (7.55) for selected cases y, and if they are within a p-dimensional
manifold the auto-encoding will be successful. The first mapping ! : Rq → Rp is
called encoder, and the second mapping : Rp → Rq is called decoder. The object
!(y) ∈ Rp is a p-dimensional encoding (representation) of y ∈ Rq which contains
maximal information of y up to the reconstruction error (7.55).

7.5.3 Principal Components Analysis

PCA gives us a linear auto-encoder (7.54). If the data matrix Y ∈ Rn×q has rank
q, there exist q linearly independent rows of Y that span Rq . PCA determines a
different, very specific basis of Rq . It looks for an orthonormal basis v 1 , . . . , v q ∈
Rq such that v 1 explains the direction of the biggest variability in Y , v 2 the direction
of the second biggest variability in Y orthogonal to v 1 , and so forth. Variability is
understood in the sense of maximal empirical variance under the assumption that
the columns of Y are centered, see (7.52)–(7.53). Such an orthonormal basis can
be found by determining q linearly independent eigenvectors of the symmetric and
positive definite matrix
= Y Y ∈ Rq×q .
A = n

For this we can solve recursively the following convex Lagrange problems. The first
basis vector v 1 ∈ Rq is determined by the solution of3

v 1 = arg max Y w22 = arg max w Y Y w , (7.56)
w2 =1 w w=1

and the j -th basis vector v j ∈ Rq , 2 ≤ j ≤ q, is received recursively by the solution

v j = arg max Y w22 subject to v k , w = 0 for all 1 ≤ k ≤ j −1. (7.57)

w2 =1

3If the q eigenvalues of A are distinct, the solution to (7.56) and (7.57) is unique up to the sign,
otherwise this requires more care.
7.5 Auto-encoders 345

Singular value decomposition (SVD) gives an alternative way of computing this

orthonormal basis, we refer to Section 14.5.1 in Hastie et al. [183]. The algorithm
of Golub–Van Loan [165] gives an efficient way of performing a SVD. There exist
orthogonal matrices U ∈ Rn×q and V ∈ Rq×q (with U U = V V = 1q ), and
a diagonal matrix = diag(λ1 , . . . , λq ) ∈ Rq×q with singular values λ1 ≥ . . . ≥
λq > 0 such that we have the SVD

Y = U V . (7.58)

The matrix U is called left-singular matrix of Y , and the matrix V is called right-
singular matrix of Y . Observe by using the SVD (7.58)

V AV = V Y Y V = V V U U V V = 2
= diag(λ21 , . . . , λ2q ).

That is, the squared singular values (λ2j )1≤j ≤q are the eigenvalues of matrix A, and
the column vectors of the right-singular matrix V = (v 1 , . . . , v q ) (eigenvectors of
A) give an orthonormal basis v 1 , . . . , v q . This motivates to define the q principal
components of Y by the column vectors of

YV = U = U diag(λ1 , . . . , λq ) (7.59)

= λ1 u1 , . . . , λq uq ∈ Rn×q .

E.g., the first principal component of the instances 1 ≤ i ≤ n is given by Y v 1 =

λ1 u1 ∈ Rn . Considering the first p ≤ q principal components gives the rank p
matrix

Y p = U diag(λ1 , . . . , λp , 0, . . . , 0)V ∈ Rn×q . (7.60)

The Eckart–Young–Mirsky theorem [114, 279]4 proves that this rank p matrix Y p
minimizes the Frobenius norm relative to Y among all rank p matrices, that is,

Y p ∈ arg min Y − BF subject to rank(B) ≤ p, (7.61)

B∈Rn×q

where the Frobenius norm is given by C2F = i,j ci,j 2 for a matrix C = (c ) .
i,j i,j
The orthonormal basis v 1 , . . . , v q ∈ R gives the (linear) encoder (projection)
q

! : Rq → Rp , y → !(y) = y v 1 , . . . , y v p = (v 1 , . . . , v p ) y.

4 In fact, (7.61) holds for both the Frobenius norm and the spectral norm.
346 7 Deep Learning

These gives the first p principal components in (7.59) if we insert the transposed
data matrix Y = (y 1 , . . . , y n ) ∈ Rq×n for y ∈ Rq . The (linear) decoder is
given by

: Rp → Rq , z → (z) = (v 1 , . . . , v p )z.

The following is understood column-wise for the transposed data matrix Y ,

◦ !(Y ) = (v 1 , . . . , v p ) Y

= Y (v 1 , . . . , v p )(v 1 , . . . , v p )

= Y (v 1 , . . . , v p , 0, . . . , 0)(v 1 , . . . , v p , v p+1 , . . . , v q )

= U diag(λ1 , . . . , λp , 0, . . . , 0)V = Y
p.

Thus, ◦ !(Y ) minimizes the Frobenius reconstruction error (7.61) on the data
matrix Y among all linear maps of rank p. In view of (7.55) we can express the
squared Frobenius reconstruction error as

n
2 2
n

Y − Y p 2F = 2y i − ◦ !(y i )22 = L y i , ◦ !(y i ) , (7.62)
2
i=1 i=1

thus, we choose the squared Euclidean distance as the dissimilarity measure, here,
that we minimize simultaneously on all cases y i , 1 ≤ i ≤ n.
Remark 7.30 The PCA gives a linear approximation to the data matrix Y by
minimizing (7.61) and (7.62) for given rank p. This may not be appropriate if the
non-linear terms are dominant. Figure 7.31 (lhs) gives a situation where the PCA
works well; this data has been generated by i.i.d. multivariate Gaussian random
vectors y i ∼ N (0, ). Figure 7.31 (middle) gives a non-linear example where the
PCA does not work well, the data matrix Y ∈ Rn×2 is a column-centered matrix
that builds a circle around the origin.
Another nice example where the PCA fails is Fig. 7.31 (rhs). This figure is
inspired by Shlens [337] and Ruckstuhl [321]. It shows a situation where the level
sets are non-convex, and the principal components point into a completely wrong
direction to explain the structure of the data.
7.5 Auto-encoders 347

Fig. 7.31 Two-dimensional PCAs in different situations of the data matrix Y ∈ Rn×2

7.5.4 Lab: Lee–Carter Mortality Model

We use the SVD to fit the most popular stochastic mortality model, the Lee–Carter
(LC) model [238], to (raw) mortality data. The raw mortality data considers for each
calendar year t and each age x the number of people Dx,t who died (in that year t
at age x) divided by the corresponding population exposure ex,t . In practice this
requires some care. Due to migration, often, the exposures ex,t are non-observable
figures and need to be estimated. Moreover, also the death counts Dx,t in year t at
age x can be defined differently, age cohorts are usually defined by the year of birth.
We denote the (observed) raw mortality rates by Mx,t = Dx,t /ex,t . The subsequent
derivations consider the raw log-mortality rates log(Mx,t ), for this reason we assume
that Mx,t > 0 for all calendar years t and ages x. The goal is to model these raw
log-mortality rates (for each country, region, risk group and gender separately).
The LC model defines the force of mortality as

log(μx,t ) = ax + bx kt , (7.63)

where log(μx,t ) is the (deterministic) log-mortality rate in calendar year t for a

person aged x (for a fixed country, region and gender). The individual terms in (7.63)
have the following meaning: ax is the average force of mortality at age x, bx is the
rate of change of the force of mortality broken down to the different ages x, and kt
is the time index describing the change of the force of mortality in calendar year t.
Strictly speaking, we do not have a stochastic model, here, that can explain the
observations Mx,t , but we try to fit a deterministic mortality surface (μx,t )x,t to
these noisy observations (Mx,t )x,t . For this we use the PCA and the Frobenius norm
as the measure of dissimilarity (on the log-scale).
In a first step, we center the raw log-mortality rates for all ages x, i.e., over the
calendar years t ∈ T under consideration. We define the centered raw log-mortality
rates yx,t and the estimate ax of the average force of mortality at age x as follows

1
Yx,t = log(Mx,t ) −
ax = log(Mx,t ) − log(Mx,s ), (7.64)
|T |
s∈T
348 7 Deep Learning

where the last identity defines the estimate ax . Strictly speaking we have a slight
difference to the centering in Sect. 7.5.1 because we center the rows and not the
columns of the data matrix, here, but the role of rows and columns is exchangeable in
the PCA. The optimal (parameter) values ( bx )x and ( kt )t are determined as follows,
see (7.63),
2
arg min Yx,t − bx kt ,
(bx )x ,(kt )t x,t

where the sum runs over the years t ∈ T and the ages x0 ≤ x ≤ x1 , with x0 and x1
being the lower and upper age boundaries. This can be rewritten as an optimization
problem (7.61)–(7.62). Consider the data matrix Y = (Yx,t )x0 ≤x≤x1;t ∈T ∈ Rn×q ,
and set n = x1 − x0 + 1 and q = |T |. Assume Y has rank q. This allows us to
consider

Y 1 ∈ arg min Y − BF subject to rank(B) ≤ 1.

B∈Rn×q

A solution to this problem is given, see (7.60),

Y 1 = U diag(λ1 , 0, . . . , 0)V = (λ1 u1 ) v

1 = (Y v 1 ) v 1 ∈ R
n×q
,

with left-singular matrix U = (u1 , . . . , uq ) ∈ Rn×q and right-singular matrix V =

(v 1 , . . . , v q ) ∈ Rq×q of Y . This implies that the first principal component λ1 u1 =
Y v 1 ∈ Rn gives an estimate for (bx )x0 ≤x≤x1 , and the first column vector v 1 ∈ Rq
of V gives an estimate for the time index (kt )t ∈T . For parameter identifiability we
normalize

x1

bx = 1 and
kt = 0, (7.65)
x=x0 t ∈T

the latter being consistent with the centering of the rows of Y with
ax in (7.64).
We fit the LC model to the Swiss mortality data of females and males separately.
The raw log-mortality rates log(Mx,t ) for the years t ∈ T = {1950, . . . , 2016}
and the ages 0 ≤ x ≤ 99 are illustrated in Fig. 7.32; both plots use the same color
scale. This mortality data has been obtained from the Human Mortality Database
(HMD) [195]. In general, we observe a diagonal structure that indicates mortality
improvements over time.
7.5 Auto-encoders 349

Swiss Female raw log−mortality rates Swiss Male raw log−mortality rates
0 0
0 6 12 20 28 36 44 52 60 68 76 84 92

0 6 12 20 28 36 44 52 60 68 76 84 92
−2 −2

−4 −4
age x

age x
−6 −6

−8 −8

−10 −10

1950 1957 1964 1971 1978 1985 1992 1999 2006 2013 1950 1957 1964 1971 1978 1985 1992 1999 2006 2013
calendar year t calendar year t

Fig. 7.32 Raw log-mortality rates log(Mx,t ) for the calendar years 1950 ≤ t ≤ 2016 and the ages
x0 = 0 ≤ x ≤ x1 = 99 of Swiss females (lhs) and Swiss males (rhs); both plots use the same color
scale

Swiss Female Lee−Carter log−mortality rates Swiss Male Lee−Carter log−mortality rates
0 0
0 6 12 20 28 36 44 52 60 68 76 84 92

0 6 12 20 28 36 44 52 60 68 76 84 92

−2 −2

−4 −4
age x

age x

−6 −6

−8 −8

−10 −10

1950 1957 1964 1971 1978 1985 1992 1999 2006 2013 1950 1957 1964 1971 1978 1985 1992 1999 2006 2013
calendar year t calendar year t

μx,t ) for the calendar years 1950 ≤ t ≤ 2016 and the

Fig. 7.33 LC fitted log-mortality rates log(
ages x0 = 0 ≤ x ≤ x1 = 99 of Swiss females (lhs) and Swiss males (rhs); the plots use the same
color scale as Fig. 7.32

Define the fitted log-mortality surface

log( ax +
μx,t ) = bx
kt for x0 ≤ x ≤ x1 and t ∈ T .

Figure 7.33 shows the LC fitted log-mortality surface (log( μx,t ))0≤x≤99;t ∈T sepa-
rately for Swiss females and Swiss males, the color scale is the same as in Fig. 7.32.
The plots show a huge similarity between the raw log-mortality data and the LC
fitted log-mortality surface which clearly supports the LC model for the Swiss
data. In general, the LC surface is a smoothed version of the raw log-mortality
surface. The main difference in our LC fit concerns the male population for ages
350 7 Deep Learning

35 singular values of females and males Frobenius norm reconstruction error

females females

1500
males males
30

reconstruction error
singular values
25

1000
20
15

500
10
5
0

0
0 10 20 30 40 50 60 0 10 20 30 40 50 60 70
index index

Fig. 7.34 (lhs) Singular values λj , 1 ≤ j ≤ |T |, of the SVD of the data matrix Y ∈ Rn×|T | , and
(rhs) the reconstruction errors Y − Y p 2F for 0 ≤ p ≤ |T |

20 ≤ x ≤ 40 from 1980 to 2000, one explanation of the special pattern in the

observed data during that time is the emergence of HIV.
Figure 7.34 (lhs) shows the singular values λ1 ≥ . . . ≥ λ|T | > 0 for
Swiss females and Swiss males. We observe that the first singular value λ1 by
far dominates the remaining singular values λj , j ≥ 2. Thus, the first principal
component indeed may already be sufficient, and the centered raw log-mortality
data Y can be described by a matrix Y 1 of rank p = 1. Figure 7.34 (rhs) gives
the squared Frobenius reconstruction errors of the approximations Y p of ranks
0 ≤ p ≤ |T |, where Y 0 corresponds to the zero matrix where we do not use any
approximation, but use just the average observed log-mortality rate. We observe that
the first singular value leads by far to the biggest decrease in the reconstruction error,
and the subsequent expansions λj , j ≥ 2, improve it only slightly in each step. This
supports the use of the LC model using a rank p = 1 approximation to the centered
raw log-mortality rates Y . The higher rank PCA within mortality modeling has
been studied in Renshaw–Haberman (RH) [308], and the RH(p) mortality model
considers the rank p approximation Y p to the raw log-mortality rates Y given by

log(μx,t ) = ax + bx , k t ,

for bx , k t ∈ Rp .
We have (only) fitted a mortality surface to the raw log-mortality rates on the
rectangle {x0 , . . . , x1 } × T . This does not allow us to forecast mortality into the
future. Forecasting requires a two step procedure, which, after this first estimation
step, extrapolates the time index (time-series) ( kt )t ∈T beyond the latest observation
point in T . The simplest (meaningful) model for this second (extrapolation) step
is a random walk with drift for the time index process ( kt )t ≥0 . Figure 7.35 shows
the estimated two-dimensional process ( k t )t ∈T , i.e., for p = 2, on the rectangle
7.5 Auto-encoders 351

time series k_t of Swiss females time series k_t of Swiss males

60
40
50

20
0
0

−20
−40
−60
−50

1st component of k_t 1st component of k_t

−80
2nd component of k_t 2nd component of k_t

1950 1960 1970 1980 1990 2000 2010 1950 1960 1970 1980 1990 2000 2010
calendar years calendar years

Fig. 7.35 Estimated two-dimensional processes ( k t )t∈T for Swiss females (lhs) and Swiss males
(rhs); these are normalized such that they are centered and such that the components of bx add up
to 1

{x0 , . . . , x1 } × T which needs to be extrapolated to predict within the RH (p = 2)

mortality model. We refrain from doing this step, but extrapolation will be studied
in Sect. 8.4, below.

7.5.5 Bottleneck Neural Network

BN networks have become popular in studying non-linear generalizations of PCA,

we refer to Kramer [225] and Hinton–Salakhutdinov [186]. The BN network
architecture is such that (1) the input dimension q0 is equal to the output dimension
qd+1 of a FN network, and (2) in between there is a FN layer 1 ≤ m ≤ d that has a
very low dimension qm " q0 , called the bottleneck. Figure 7.36 (lhs) shows such a
BN network of depth d = 3 and neurons

(q0 , q1 , q2 , q3 , q4 ) = (20, 7, 2, 7, 20).

The input and output neurons have blue color, and the bottleneck of dimension q2 =
2 is shown in red color in Fig. 7.36 (lhs).
352 7 Deep Learning

z1 Z1
y1 Y1
y2 Y2
y3 Y3
y4 Y4 z2 Z2
y5 Y5
y6 Y6
y7 Y7 z3 Z3
y8 Y8
y9 Y9
y10 Y10
z4 Z4
y11 Y11
y12 Y12
y13 Y13
y14 Y14 z5 Z5
y15 Y15
y16 Y16
y17 Y17 z6 Z6
y18 Y18
y19 Y19
y20 Y20 z7 Z7

Fig. 7.36 (lhs) BN network of depth d = 3 with (q0 , q1 , q2 , q3 , q4 ) = (20, 7, 2, 7, 20), (middle
and rhs) shallow BN networks with a bottleneck of dimensions 7 and 2, respectively

The motivation is as follows. Assume we have a given dissimilarity function

L(·, ·) : Rq × Rq → R+ that measures the reconstruction error of an auto-encoder
◦ !(y) ∈ Rq relative to the original input y ∈ Rq , see (7.55). We try to find a BN
network with input and output dimensions q0 = qd+1 = q (we drop the intercepts in
the entire construction) and a bottleneck in layer m having a low dimension qm , such
that the BN network provides a small reconstruction error. Choose a FN network

y ∈ Rq → ◦ !(y) = z(d+1:1)(y) = z(d+1) ◦ z(d) ◦ · · · ◦ z(1) (y) ∈ Rq ,

with FN layers for 1 ≤ m ≤ d (excluding intercepts)

(m)
z(m) : Rqm−1 → Rqm , z → z(m) (z) = φw 1 , z , . . . , φw (m)
qm , z ,

(m)
and having network weights wj ∈ Rqm−1 , 1 ≤ j ≤ qm . For the output we choose
the identity function as activation function

(d+1)
z(d+1) : Rqd → Rqd+1 , z → z(d+1)(z) = w 1 , z , . . . , w (d+1)
qd+1 , z ,

and having network weights wj(d+1) ∈ Rqd , 1 ≤ j ≤ qd+1 . The resulting network
parameter ϑ is now fitted to the data matrix Y = (y 1 , . . . , y n ) ∈ Rn×q such that
the reconstruction error is minimized over all instances

n

n

ϑ = arg min L y i , ◦ !(y i ) = arg min L y i , z(d+1:1)(y i ) .
ϑ∈Rr i=1 ϑ∈Rr i=1

We use this fitted network parameter

ϑ and denote the resulting FN layers by
z(m)
for 1 ≤ m ≤ d + 1.
7.5 Auto-encoders 353

This allows us to define the BN encoder, set q = q0 and p = qm ,

! : Rq 0 → Rq m , y → !(y) = z(m:1) (y) =
z(m) ◦ · · · ◦
z(1) (y),
(7.66)
and the BN decoder is given by, set qm = p and qd+1 = q,

: Rqm → Rqd+1 , z → (z) =
z(d+1:m+1)(z) = z(d+1) ◦ · · · ◦
z(m+1) (z).

The BN encoder (7.66) gives us a qm -dimensional representation of the data. A

linear rank p representation Y p of Y , see (7.61), can be found by a BN network
architecture that has a minimal FN layer width of dimension p = min1≤j ≤d qj , and
with the identity activation function φ(x) = x. Such a BN network is a linear map
of maximal rank p. Using the Euclidean square distance as dissimilarity measure
provides us an optimal network parameter ϑ for this linear map such that we receive
Yp = z (d+1:1) (Y ). There is one point to be considered, here, why the bottleneck

activations !(y) = z(m:1) (y) ∈ Rp in the linear activation case are not directly
comparable to the principal components (y v 1 , . . . , y v p ) of the PCA. Namely,
the PCA uses an orthonormal basis v 1 , . . . , v p whereas the linear BN network case
uses any p-dimensional basis, i.e., to directly bring these two representations in line
we still need a coordinate transformation of the bottleneck activations.
Hinton–Salakhutdinov [186] noticed that the gradient descent fitting of a BN
network needs some care, otherwise we may find a local minimum of the loss
function that has a poor reconstruction performance. In order to implement a more
sophisticated way of SGD fitting we require that the depth d of the network is an
odd number and that the network architecture is symmetric around the central FN
layer (d + 1)/2. This is the case in Fig. 7.36 (lhs). Fitting of this network of depth
d = 3 is now done in three steps:
1. The symmetry around the central FN layer m = 2 allows us to collapse this
central layer by merging layers 1 and 3 (because q1 = q3 ). Merging these two
layers provides us a shallow BN network with neurons (q0 , q1 = q3 , qd+1 =
q0 ) = (20, 7, 20). This shallow BN network is shown in Fig. 7.36 (middle).
In a first step we fit this simpler network to the data Y . This gives us the
(1) (1) (4) (4)
preliminary estimates for the network weights w1 , . . . , w q1 and w1 , . . . , w q4
of the full BN network. From this fitted shallow BN network we receive the
learned representations zi = z(1) (y i ) ∈ Rq1 , 1 ≤ i ≤ n, in the central layer
using the preliminary estimates of the network weights.
2. In the second step we use the learned representations zi ∈ Rq1 , 1 ≤ i ≤ n, to
fit the inner part of the original network (using a suitable dissimilarity function).
This inner part is a shallow network with neurons (q1 , q2 , q3 = q1 ) = (7, 2, 7),
354 7 Deep Learning

see Fig. 7.36 (rhs). This second step gives us the preliminary estimates for the
network weights w (2) (2) (3) (3)
1 , . . . , w q2 and w 1 , . . . , w q3 of the full BN network.
3. In the final step we fit the full BN network on the data Y and use the preliminary
estimates of the weights (of the previous two steps) as initialization of the
gradient descent algorithm.

Example 7.31 (BN Network Mortality Model) We apply this BN network approach
to modify the LC model of Sect. 7.5.4. Hainaut [178] considered such a BN network
application. For computational reasons, Hainaut [178] proposed a calibration
strategy different from Hinton–Salakhutdinov [186]. We use this latter calibration
strategy as it has turned out to work well in our setting.
As BN network architecture we choose a FN network of depth d = 3. The input
and output dimensions are equal to q0 = q4 = 67, this exactly corresponds to
the number of available calendar years 1950 ≤ t ≤ 2016, see Fig. 7.32. Then, we
select a symmetric architecture around the central FN layer m = 2 with q1 = q3 =
20 neurons. That is, in a first step, the 67 calendar years are compressed to a 20-
dimensional representation. For the bottleneck we then explore different numbers
of neurons q2 = p ∈ {1, . . . , 20}. These BN networks are implemented and fitted in
R with the library keras [77]. We have fitted these models separately to the Swiss
female and male populations. The raw log-mortality rates are illustrated in Fig. 7.32,
and for comparability with the LC approach we have centered these log-mortality
rates according to (7.64), and we use the squared Euclidean distance as the objective
function.
Figure 7.37 compares the squared Frobenius reconstruction errors of the linear
LC approximations Y p to their non-linear BN network counterparts with bottle-
necks q2 = p. We observe that the BN figures are clearly smaller saying that a
non-linear auto-encoding provides a better reconstruction, this is true, in particular,
for 2 ≤ q2 < 20. For q2 ≥ 20 the learning with the BN networks seems saturated,
note that the outer layers have q1 = q3 = 20 neurons which limits the learning at
the bottleneck for bigger q2 . In view of Fig. 7.37 there seems to be a kink at q2 = 4,

Fig. 7.37 Frobenius Frobenius norm reconstruction error

400

reconstruction errors females PCA

Y − Y p 2F for females BNN
1 ≤ p = q2 ≤ 20 in the linear males PCA
males BNN
300

LC approach and the

reconstruction error

non-linear BN approach
200
100
0

5 10 15 20
index
7.5 Auto-encoders 355

Swiss Female BNN log−mortality rates Swiss Male BNN log−mortality rates
0 0
0 6 12 20 28 36 44 52 60 68 76 84 92

0 6 12 20 28 36 44 52 60 68 76 84 92
−2 −2

−4 −4
age x

age x
−6 −6

−8 −8

−10 −10

1950 1957 1964 1971 1978 1985 1992 1999 2006 2013 1950 1957 1964 1971 1978 1985 1992 1999 2006 2013
calendar year t calendar year t

Fig. 7.38 BN network (q1 , q2 , q3 ) = (20, 2, 20) fitted log-mortality rates log(
μx,t ) for the
calendar years 1950 ≤ t ≤ 2016 and the ages x0 = 0 ≤ x ≤ x1 = 99 of Swiss females
(left) and Swiss males (right); the plots use the same color scale as Fig. 7.32

and an “elbow” criterion says that this is the critical bottleneck size that should not
be exceeded.
The resulting estimated log-mortality surfaces for the bottleneck q2 = 2 are
illustrated in Fig. 7.38. These strongly resemble the raw log-mortality rates in
Fig. 7.32, in particular, for the male population we get a better fit for ages 20 ≤
x ≤ 40 from 1980 to 2000 compared to the LC model. In a further analysis we
should check whether this BN network does not over-fit to the data. We could, e.g.,
explore drop-outs during calibration or smaller FN (compression) layers q1 = q3 .
Finally, we analyze the resulting activations at the bottleneck by considering the
BN encoder (7.66). Note that we assume y ∈ Rq in (7.66) with q = |T | being
the rank of the data matrix Y ∈ Rn×q . Thus, the encoder takes a fixed age 0 ≤
x ≤ 99 and encodes the corresponding time-series observation y x ∈ R|T | by the
bottleneck activations. This parametrization has been inspired by the PCA which
typically considers a data matrix that has more rows than columns. This results in
at most q = rank(Y ) singular values, supposed n ≥ q. However, we can easily
exchange the role of rows and columns, e.g., by transposing all matrices involved.
For mortality forecasting it is advantageous to exchange these roles because we
would like to extrapolate a time-series beyond T . For this reason we set for the input
dimension q0 = q = 100, which provides us with |T | observations y t ∈ R100 . We
then fit the BN encoder (7.66) to receive the bottleneck activations

Y = (y t )t ∈T → !(Y ) = (!(y t ))t ∈T ∈ Rq2 ×|T | .

Figure 7.39 shows these figures for a bottleneck q2 = 2. We observe that these
bottleneck time-series (!(y t ))t ∈T are much more difficult to understand than the
LC/RH ones given in Fig. 7.35. Firstly, we see that we have quite some dependence
356 7 Deep Learning

Swiss female bottleneck activations Swiss male bottleneck activations

neuron 1 neuron 1
neuron 2 neuron 2
1.0

1.0
bottleneck activations

bottleneck activations
0.5

0.5
0.0

0.0
−0.5

−0.5
−1.0

−1.0
1950 1960 1970 1980 1990 2000 2010 1950 1960 1970 1980 1990 2000 2010
calendar years calendar years

Fig. 7.39 BN network (q1 , q2 , q3 ) = (20, 2, 20): bottleneck activations showing !(y t ) ∈ R2 for
t∈T

between the components of the time-series. Secondly, in contrast to the LC/RH case
of Fig. 7.35, there is not one component that dominates. Note that this dominance
has been obtained by scaling the components of (bx )x to add up to 1 (which,
of course, reflects the magnitudes of the singular values). In the non-linear case,
these scales are hidden in the decoder which is more difficult to extract. Thirdly,
the extrapolation may not work if the time-series has a trend and if we use the
hyperbolic tangent activation function that has a bounded range. In general, a trend
extrapolation has to be considered very carefully with FN networks with non-linear
activation functions, and often there is no good solution to this problem within
the FN network framework. We conclude that this approach improves in-sample
mortality surface modeling, but it leaves open the question about forecasting the
future mortality rates because an extrapolation seems more difficult.

Remark 7.32 The concept of BN networks has also been considered in the actuarial
literature to encode geographic information, see Blier-Wong et al. [39]. Since
geographic information has a natural spatial component, these authors propose
to use a convolutional neural network to encode the spatial information before
processing the learned features through a BN network. The proposed decoder may
have different forms, either it tries to reconstruct the whole (spatial) neighborhood
of a given location or it only tries to reconstruct the site of a given location.
7.6 Model-Agnostic Tools 357

7.6 Model-Agnostic Tools

We collect some model-agnostic tools in this section that help us to better understand
and analyze the networks, their calibrations and predictions. Model-agnostic tools
are techniques that are not specific to a certain model type and can be used for
any regression model. Most methods presented here are nicely presented in the
tutorial of Lorentzen–Mayer [258]. There are several ways of getting a better
understanding of a regression model. First, we can analyze variable importance
which tries to answer similar questions to the GLM variable selection tools
of Sect. 5.3 on model validation. However, in general, we cannot rely on any
asymptotic likelihood theory for such an analysis. Second, we can try to understand
the predictive model. For a GLM with the log-link function this is quite simple
because the systematic effects are of a multiplicative nature. For networks this
is much more complicated because we allow for much more general regression
functions. We can either try to understand these functions on a global portfolio level
(by averaging the effects over many insurance policies) or we can try to understand
these functions locally for individual insurance policies. The latter refers to local
sensitivities around a chosen feature value x ∈ X , and the former to global model-
agnostics.

7.6.1 Variable Permutation Importance

For GLMs we have studied the LRT and the Wald test that have been assisting us
in reducing the GLM by the feature components that do not contribute sufficiently
to the regression task at hand, see Sects. 5.3.2 and 5.3.3. These variable reduction
techniques rely on an asymptotic likelihood theory. Here, we need to proceed
differently, and we just aim at ranking the variables by their importance, similarly
to a drop1 analysis, see Listing 5.6.
For a given FN network regression model

x ∈ X → μ(x) = g −1 β, z(d:1)(x) ,

we randomize one component of x = (x1 , . . . , xq ) at a time, and we study the

resulting change in the objective function. More precisely, for given (learning) data
L, with features x 1 , . . . , x n , we select one feature component 1 ≤ j ≤ q and
permute (xi,j )1≤i≤n randomly across the entire portfolio 1 ≤ i ≤ n. We denote by
L(j ) the resulting data with the j -th component being permuted. We then compare
the resulting deviance loss D(L(j ) , μ) to the one D(L, μ) on the original data L
using the same regression model μ. We call this approach variable permutation
importance (VPI). Note that such a permutation does not only act on the marginal
effects, but it also distorts the interaction effects of the different feature components.
358 7 Deep Learning

Fig. 7.40 VPI measured by variable permutation importance (VPI)

the relative change vpi(j ) ,
1 ≤ j ≤ q, of model Poisson BonusMalus

GLM3 of Table 5.5 and the DrivAge

FN network regression model

μm=1 of Table 7.9 VehBrand

Region

Density

VehAge

Area

VehGas
model GLM3
VehPower FN network

0.00 0.02 0.04 0.06 0.08 0.10 0.12

We calculate the VPI on the MTPL claim frequency data of model Poisson
GLM3 of Table 5.5 and the FN network regression model μm=1 of Table 7.9; we
use this example throughout this section on model-agnostic tools. Figure 7.40 shows
the relative increases

D(L(j ) , μ) − D(L, μ)
vpi(j ) = ,
D(L, μ)

of the deviance losses by permuting one feature component 1 ≤ j ≤ q at a time.

Obviously, the BonusMalus level followed by DrivAge and VehBrand are
the most important variables according to this VPI method. This is in alignment for
both models. Thereafter, there are smaller disagreements between the two models.
These disagreements may (also) be caused by a non-optimal feature pre-processing
in the GLM where, for instance, we have to add the interaction effects manually,
see (5.35). Overall, these VPI results are in line with the findings of the classical
methods on GLMs, see for instance the drop1 table in Listing 5.6.
One point that is worth mentioning (and which makes the VPI results not fully
reliable) is the use of feature components that are highly correlated. In our case,
Density and Area are highly correlated, see Fig. 13.12. Therefore, it may not
make sense to randomly permute one component while keeping the other one
unchanged. This issue will also arise in other methods described below.
Remark 7.33 (Global Surrogate Model) There are other machine learning methods
that offer different measures of variable importance. For instance, (binary split)
classification and regression trees (CARTs) offer popular methods for measuring
variable importance; for binary split CARTs we refer to Breiman et al. [54]
and Denuit et al. [100]. These CARTs select individual feature components for
partitioning the feature space X , and variable importance is measured by analyzing
the contribution of each feature component to the total decrease of the objective
7.6 Model-Agnostic Tools 359

function. Binary split CARTs have the advantage that this can be done in an additive
way.
More complex regression models like FN networks can then be analyzed by using
a binary split regression tree as a global surrogate model. That is, we can fit a CART
to the network regression function (as a surrogate model) and then analyze variable
importance in this surrogate regression tree model using the tools of regression trees.
We will not give an explicit example here because we have not formally introduced
regression trees in this manuscript, but this concept is fairly straightforward and
well-understood.

7.6.2 Partial Dependence Plots

There are several graphical tools that study the individual behavior in the feature
components. Some of these tools select individual insurance policies and others
study global portfolio properties. They have in common that they are based on
marginal considerations, i.e., some sort of projection.

Individual Conditional Expectation

Individual conditional expectation (ICE) selects individual insurance policies

(Yi , x i , vi ) and varies the feature components of x i over their entire domain;
we refer to Goldstein et al. [164]. Similarly to the VPI of Sect. 7.6.1, ICE does
not respect collinearity in feature components, but it is rather an isolated view of
individual components.
In Fig. 7.41 we provide the ICE plots of model Poisson GLM3 of Table 5.5 and
the FN network regression model μm=1 of Table 7.9 of 100 randomly selected
insurance policies x i . For these randomly selected insurance policies we let the
variable DrivAge vary over its domain {18, . . . , 90}. Each color corresponds to
one insurance policy i, and the colors in the two plots coincide. In the GLM
we observe that the lines are roughly parallel which reflects that we have an
additive regression structure on the canonical scale (note that these plots are on the
canonical parameter scale). The lines are not perfectly parallel because we allow
for an interaction between DrivAge and BonusMalus in model Poisson GLM3,
see (5.35). The plot of the FN network is more difficult to interpret. Overall the
levels (colors) coincide in the two plots, but in the FN network plot the lines are not
increasing for ages approaching 18, the reason for this is that we have interactions
with other feature components that are important. In particular, for ages close to
18 we cannot have a BonusMalus level of 50% and, therefore, the FN network
cannot be trained on this part of the feature space. Nevertheless, the ICE plot allows
for such feature configurations (by just extrapolating the FN network regression
function beyond the set of available insurance policies). This difficulty is confirmed
360 7 Deep Learning

individual conditional expectations (ICE): model GLM3

0 individual conditional expectations (ICE): FN network

0
−1

−1
logged expected frequency

logged expected frequency

−2

−2
−3

−3
−4

−4
−5

−5
−6

−6
−7

−7
20 30 40 50 60 70 80 90 20 30 40 50 60 70 80 90
age of driver age of driver

Fig. 7.41 ICE plots of 100 randomly selected insurance policies x i of (lhs) model Poisson GLM3
and (rhs) FN network μm=1 letting the variable DrivAge vary over its domain; the y-axis is on
the canonical parameter scale

by exploiting the same plot only on insurance policies that have a BonusMalus
level of at least 100%. In that case the lines for small ages are non-decreasing when
approaching the age of 18, thus, providing a more reasonable interpretation. We
conclude that if we have strong dependence and/or interactions between the feature
components this method may not provide any reasonable interpretations.

Partial Dependence Plot

Partial dependence plots (PDPs) have been introduced by Friedman [141], see also
Zhao–Hastie [405]. PDPs are closely related to the do-operator in causal inference
in statistics; we refer to Pearl [298] and Pearl et al. [299] for the do-operator. A
PDP and the do-operator, respectively, are obtained by breaking the dependence
structure between different feature components. Namely, we decompose the feature
x = (xj , x \j ) into two parts with x \j denoting all feature components except of
component xj ; we will use a slight abuse of notation because the components need
to be permuted correspondingly in the following regression function x → μ(x) =
μ(xj , x \j ). Since, typically, there is dependence between xj and x \j one can infer
x \j from xj , and vice versa. A PDP breaks this inference potential so that the
sensitivity can be studied purely in xj . In particular, the partial dependence profile
is obtained by

xj → μ̄ (xj ) =
j
μ(xj , x \j ) dp(x \j ), (7.67)
7.6 Model-Agnostic Tools 361

where p(x \j ) is the marginal (portfolio) distribution of the feature components x \j .

Observe that this differs from the conditional expectation which reads as

xj → μ(xj ) = Ep μ(xj , x \j ) xj = μ(xj , x \j ) dp(x \j |xj ),

the latter allowing for inferring x \j from xj through the conditional probability
dp(x \j |xj ).
Remark 7.34 (Discrimination-Free Insurance Pricing) Recent actuarial literature
discusses discrimination-free insurance pricing which aims at developing a pricing
framework that is free of discrimination w.r.t. so-called protected characteristics
such as gender and ethnicity; we refer to Guillén [174], Chen et al. [69, 70],
Lindholm et al. [253] and Frees–Huang [136] for discussions on discrimination
in insurance. In general, part of the problem also lies in the fact that one can
often infer the protected characteristics from the non-protected feature information.
This is called indirect discrimination or proxy discrimination. The proposal of
Lindholm et al. [253] for achieving discrimination-free prices exactly follows the
construction (7.67), by breaking the link, which infers the protected characteristics
from the non-protected ones.
The partial dependence profile on our portfolio L with given features x 1 , . . . , x n
is now obtained by just using the portfolio distribution as an empirical distribution
for p in (7.67). That is, for a selected component xj of x, we consider the partial
dependence profile

1 1
n n

xj → μ̄j (xj ) = μ(xj , x i,\j ) = μ xi,0 , xi,1 , . . . , xi,j −1 , xj , xi,j +1 , . . . , xi,q ,
n n
i=1 i=1

thus, we average the ICE plots over x i,\j of our portfolio 1 ≤ i ≤ n.

Figure 7.42 (lhs, middle) give the PDPs of the variables BonusMalus and
DrivAge of model Poisson GLM3 and the FN network μm=1 . Overall they

Fig. 7.42 PDPs of (lhs) BonusMalus level and (middle) DrivAge; the y-axis is on the
canonical parameter scale; (rhs) ratio of policies with a bonus-malus level of 50% per driver’s
age
362 7 Deep Learning

look reasonable. However, we are again facing the difficulty that these partial
dependence profiles consider feature configurations that should not appear in our
portfolio. Roughly 57% of all insurance policies have a bonus-malus level of 50%,
which means that these driver’s did not suffer any claims in the past couple of
years. Obviously a driver of age 18 cannot be on this bonus-malus level, simply
because she/he is not in a state where she/he can have multiple years of driving
experience without an accident. However, the PDP does not respect this fact, and just
extrapolates the regression function into that part of the feature space. Therefore, the
PDP at driver’s age 18 is based on 57% of the insurance policies being on a bonus-
malus level of 50% because this corresponds to the empirical portfolio distribution
p(x \j ) excluding the driver’s age xj = DrivAge information. Figure 7.42 (rhs)
shows the ratio of insurance policies that have a bonus-malus level of 50%. We
observe that this ratio is roughly zero up to age 28 (orange vertical dotted line),
which indicates that a driver needs 10 successive accident-free years to reach the
lowest bonus-malus level (starting from 100%). We consider it to be data error that
this ratio below age 28 is not identically equal to zero. We conclude that these PDPs
need to be interpreted very carefully because the insurance portfolio is not uniformly
distributed across the feature space. In some parts of the feature space the regression
function x → μ(x) may not even be well-defined because certain combinations of
feature values x may not exist (e.g., a driver of age 18 on bonus-malus level 50% or
a boy at a girl’s college).

Accumulated Local Effects Profile

PDPs have the problem that they do not respect the dependencies between the
feature components, as explained in the previous paragraphs. The accumulated local
effects (ALE) profile tries to take account for these dependencies by only studying
a local feature perturbation, we refer to Apley–Zhu [13]. We present a smooth
(gradient-based) version of ALE because our regression functions are differentiable.
Consider the local effect in the individual feature x w.r.t. the component xj by
studying the partial derivative
∂μ(x)
μj (x) = . (7.68)
∂xj
The average local effect of component j is received by

xj → j (xj ; μ) = μj (xj , x \j )dp(x \j |xj ). (7.69)

ALE integrate the average local effects j (·) over their domain, and the ALE profile
is defined by
xj xj
xj → j (zj ; μ)dzj = μj (zj , x \j )dp(x \j |zj )dzj , (7.70)
xj0 xj0
7.6 Model-Agnostic Tools 363

where xj0 is a given initialization point. The difference between PDPs and ALE
is that the latter correctly considers the dependence structure between xj and x \j ,
see (7.69).

Listing 7.10 Local effects through the gradients of FN networks in keras [77]
1 Input = layer_input(shape = c(11), dtype = ’float32’, name = ’Design’)
2 #
3 Output = Input %>%
4 layer_dense(units=20, activation=’tanh’, name=’FNLayer1’) %>%
5 layer_dense(units=15, activation=’tanh’, name=’FNLayer2’) %>%
6 layer_dense(units=10, activation=’tanh’, name=’FNLayer3’) %>%
7 layer_dense(units=1, activation=’linear’, name=’Network’)
8 #
9 model = keras_model(inputs = c(Input), outputs = c(Output))
10 #
11 grad = Output %>%
12 layer_lambda(function(x) k_gradients(model$outputs, model$inputs))
13 model.grad = keras_model(inputs = c(Input), outputs = c(grad))
14 theta.grad <- data.frame(model.grad %>% predict(XX))

Example We come back to our MTPL claim frequency FN network example. The
local effects (7.68) can directly be calculated in the R library keras [77] for a FN
network, see Listing 7.10. In order to do so we need to drop the embedding layers,
compared to Listing 7.4, and directly work on the learned embeddings. This gives
an input layer of dimension q = 7 + 2 + 2 = 11 because we have two categorical
features that have been embedded into 2-dimensional Euclidean spaces R2 . Then,
we can formally calculate the gradient of the FN network w.r.t. its inputs which is
done on lines 11–13 of Listing 7.10. Remark that we work on the canonical scale
because we use the linear activation function on line 7 of the listing.
There remain the averaging (7.69) and the integration (7.70) which can be done
empirically

1
xj → j (xj ; μ) = μj (x i ), (7.71)
|E(xj )|
i∈E (xj )

where E(xj ) denotes the indices i of all cases x i , 1 ≤ i ≤ n, with xi,j = xj ,

assuming of having discrete feature data observations. Note that this empirical
averaging respects the dependence within x. The (uncentered) ALE profile is then
obtained by aggregating these local effects, that is,
xj
xj → *
μj (xj ) = j (zj ; μ)dzj ,
xj0
364 7 Deep Learning

where this integration is typically understood in a discrete sense because the

observed feature components xi,j are discrete. Often, this uncentered ALE profile is
still translated (centered) by the portfolio average.
Remarks 7.35
• We have only introduced ALE for continuous feature variables. For nominal
categorical feature components it is not immediately clear how to reasonably
integrate the average local effects j (xj ; μ), and one typically directly analyzes
these average local effects.
• For GLMs the ALEs are rather simple if we work on the canonical scale and
under the canonical link, since

∂θ (x)
θj (x) = = βj ≡ j (xj ; θ ).
∂xj

In the case of model Poisson GLM3 presented in Sect. 5.3.4 the situation is
more delicate as we model the interactions in the GLM as follows, see (5.34)
and (5.35),

(DrivAge, BonusMalus)

4
→ βl DrivAge + βl+1 log(DrivAge) + βl+j (DrivAge)j
j =2

+βl+5 BonusMalus + βl+6 BonusMalus · DrivAge

+βl+7 BonusMalus · (DrivAge)2 .

In that case, though we work with a GLM, the resulting local effects are different
if we calculate the derivatives w.r.t. DrivAge and BonusMalus, respectively,
because we explicitly (manually) include non-linear effects into the GLM.
Figure 7.43 shows the ALE profiles of the variables BonusMalus and
DrivAge. The shapes of these profiles can directly be compared to the PDPs
of Fig. 7.42 (the scale on the y-axis should be ignored because this will depend
on the applied centering, however, we hold on to the canonical scale). The main
difference between these two plots can be observed for the variable DrivAge at
low ages. Namely, the ALE profiles have a different shape at low ages respecting
the dependencies in the feature components by only considering real local feature
configurations.
7.6 Model-Agnostic Tools 365

accumulated local effects: BonusMalus accumulated local effects: DrivAge

0.8
3.0
2.5

0.6
2.0
logged effect

logged effect
0.4
1.5
1.0

0.2
0.5

model GLM3 model GLM3

0.0
0.0

FN network FN network

60 80 100 120 140 20 30 40 50 60 70 80 90

BonusMalus DrivAge

Fig. 7.43 ALE profiles of (lhs) BonusMalus level and (rhs) DrivAge; the y-axis is on the
log-scale

7.6.3 Interaction Strength

Next we are going to discuss pairwise interaction strength. Friedman–Popescu [143]

made the following proposal. Roughly speaking, there is an interaction between the
two feature components xj and xk of x in the regression function x → μ(x) if

∂ 2 μ(x)
μj,k (x) = = 0. (7.72)
∂xj ∂xk

This means that the magnitude of a change of the regression function μ(x) in xj
depends on the current value of xk . If there is no such interaction, we can additively
decompose the regression function μ(x) into two independent terms. This then
reads as μ(x) = μ\j (x \j ) + μ\k (x \k ). This motivation is now applied to the
PDP profiles given in (7.67). We define the centered versions xj → μ̆j (xj ) and
xk → μ̆k (xk ) of the PDP profiles by centering the PDP profiles xj → μ̄j (xj )
and xk → μ̄k (xk ) over the portfolio values x i , 1 ≤ i ≤ n. Next, we consider an
analogous two-dimensional version for (xj , xk ). Let (xj , xk ) → μ̆j,k (xj , xk ) be the
centered version of a two-dimensional PDP profile (xj , xk ) → μ̄j,k (xj , xk ).
Friedman’s H -statistics measures the pairwise interaction strength between the
components xj and xk , and it is defined by
n j,k 2
i=1 μ̆ (xi,j , xi,k ) − μ̆j (xi,j ) − μ̆k (xi,k )
2
Hj,k = n j,k 2
, (7.73)
i=1 μ̆ (xi,j , xi,k )

we refer to formula (44) in Friedman–Popescu [143]. While Hj,k 2 measures the

proportion of the joint interaction effect, as we normalize by the variability of

366 7 Deep Learning

n j,k 2
the joint effect i=1 μ̆ (xi,j , xi,k ) , sometimes also the absolute measure is
considered by taking the square root of the enumerator in (7.73). Of course, this
can be extended to interactions of three components, etc., we refer to Friedman–
Popescu [143].
We do not give an example here, because calculating Friedman’s H -statistics
can be computationally demanding if one has many feature components with many
levels in FN network modeling.

7.6.4 Local Model-Agnostic Methods

The above methods like the PDP and the ALE profile have been analyzing the global
behavior of the regression functions. We briefly mention some tools that describe the
local sensitivity and explanation of regression results.
Probably the most popular method is the locally interpretable model-agnostic
explanation (LIME) introduced by Ribeiro et al. [311]. This analyzes locally the
expected response of a given feature x by perturbing x. In a nutshell, the idea is to
select an environment E(x) ⊂ X of a chosen feature x and to study the regression
function x → μ(x ) in this environment x ∈ E(x). This is done by fitting a
(much) simpler surrogate model to μ on this environment E(x). If the environment
is small, often a linear regression model is chosen. This then allows one to interpret
the regression function μ(·) locally using the simpler surrogate model, and if we
have a high-dimensional feature space, this linear regression is complemented with
LASSO regularization to only select the most important feature components.
The second method considered in the literature is the Shapley additive expla-
nation (SHAP). The SHAP is based on Shapley values [335] which is a method
of allocating rewards to players in cooperative games, where a team of individual
players jointly contributes to a potential success. Shapley values solve this allocation
problem under the requirements of additivity and fairness. This concept can be
translated to analyzing how individual feature components of x contribute to the
total prediction μ(x) of a given case. Shapley values allow one to do such a
contribution analysis in the aforementioned additive and fair way, see Lundberg–Lee
[261]. The calculation of SHAP values is combinatorially demanding and therefore
several approximations have been proposed, many of them having their own caveats,
we refer to Aas et al. [1]. We will not further consider these but refer to the relevant
literature.

7.6.5 Marginal Attribution by Conditioning on Quantiles

The above model-agnostic tools have mainly been studying the sensitivities of the
expected response μ(x) in the feature components of x. This becomes apparent
7.6 Model-Agnostic Tools 367

from considering the partial derivatives (7.68) to calculate the local effects. Alterna-
tively, we could try to understand how the feature components of x contribute to a
given response μ(x), see Ancona et al. [12]; this section follows Merz et al. [273].
The marginal attribution on an input component j of the response μ(x) can be
studied by the directional derivative

∂μ(x)
xj → xj μj (x) = xj . (7.74)
∂xj

This was first proposed to the data science community by Shrikumar et al. [340].
Basically, it means that we replace the partial derivative μj (x) by the directional
derivative along the vector xj ej = (0, . . . , 0, xj , 0, . . . , 0) ∈ Rq+1

μ(x + xj ej ) − μ(x)

lim
→0

μ (1, x1 , . . . , xj −1 , (1 + )xj , xj +1 , . . . , xq ) − μ(x)
= lim = xj μj (x),
→0

where ej is the (j + 1)-st basis vector in Rq+1 (index j = 0 corresponds to the

intercept component x0 = 1).
We start by recalling the sensitivity analysis of Hong [189] and Tsanakas–
Millossovich [355] in the context of risk measurement. Assume the features have
a portfolio distribution X ∼ p. This describes the random selection of an insurance
policy X = x from the portfolio described by p. The average price over the entire
portfolio is then given by

μ = Ep [μ(X)] = μ(x)dp(x).

We implicitly interpret μ(X) = E[Y |X] as the price of the response Y , here,
though we do not need the response distribution in this section. Assume μ(X)
has a continuous distribution function Fμ(X) ; and we drop the intercept component
X0 = x0 = 1 from these considerations (but we still keep it in the regression
model). This implies that Uμ(X) = Fμ(X) (μ(X)) is uniformly distributed on [0, 1].
Choosing a density ζ on [0, 1] gives us a probability distortion ζ(Uμ(X) ) as we have
the normalization

1
Ep ζ(Uμ(X) ) = ζ(u)du = 1.
0

This allows us to define a distorted portfolio price in the sense of a Radon–Nikodým

derivative, namely, we set for the distorted portfolio price

(μ(X); ζ ) = Ep μ(X)ζ(Uμ(X) ) .
368 7 Deep Learning

This functional (μ(X); ζ ) is a so-called distortion risk measure. Our goal is to

study the sensitivities of this distortion risk measure in the components of X.
Assume existence of the following directional derivatives for all 1 ≤ j ≤ q

∂

Sj (μ; ζ ) = μ (1, X1 , . . . , Xj −1 , (1 + )Xj , Xj +1 , . . . Xq ) ; ζ .
∂ =0

Sj (μ; ζ ) can be used to describe the sensitivities of the regression function X →

μ(X) in the feature components Xj . Under different sets of assumptions, Hong
[189] and Tsanakas–Millossovich [355] have proved the following identity

Sj (μ; ζ ) = Ep Xj μj (X)ζ(Uμ(X) ) ,

the right-hand side exactly uses the marginal attribution (7.74). There remains the
freedom of the choice of the density ζ on [0, 1], which allows us to study the
sensitivities of different distortion risk measures. For the uniform distribution ζ ≡ 1
on [0, 1] we simply have the average (best-estimate) price and its average marginal
attributions

(μ(X); ζ ≡ 1) = Ep [μ(X)] = μ and Sj (μ; ζ ≡ 1) = Ep [Xj μj (X)].

If we want to consider a quantile risk measure, called value-at-risk (VaR), we choose

a Dirac measure for the density ζ . That is, choose a point measure of mass 1 in
α ∈ (0, 1), i.e., the density ζ is concentrated in the single point α. In that case, the
event {Fμ(X) (μ(X)) = Uμ(X) = α} receives probability one, and therefore we have
the α-quantile
−1
(μ(X); α) = Fμ(X) (α),

and the corresponding sensitivities for 1 ≤ j ≤ q

−1
Sj (μ; α) = Ep Xj μj (X) μ(X) = Fμ(X) (α) . (7.75)

Remarks 7.36
• In the introduction to this section we have assumed that μ(X) has a continuous
distribution function. This emphasizes that this sensitivity analysis is most
suitable for continuous feature components. Categorical and discrete feature
components can be embedded into a Euclidean space, e.g., using embedding
layers, and then they can be treated as continuous variables.
• Sensitivities (7.75) respect the local portfolio structure as they are calculated
w.r.t. p.
• In applications, we will work with the empirical portfolio distribution for p
provided by (x i )1≤i≤n . This gives an empirical approximation to (7.75) and,
in particular, it will require a choice of a bandwidth for the evaluation of the
7.6 Model-Agnostic Tools 369

−1
conditional probability, conditioned on the event {μ(X) = Fμ(X) (α)}. This is
done with a local smoother similarly to Listing 7.8.
In analogy to Merz et al. [273] we give a different interpretation to the
sensitivities (7.75), which allows us to further expand this formula. We have 1st
order Taylor expansion

μ(x + ) = μ(x) + (∇x μ(x)) + o (2 ) for 2 → 0.

Obviously, this is a local approximation in x. Setting = −x, we get (a possibly

crude) approximation

μ(0) ≈ μ (x) − (∇x μ(x)) x.

By bringing the gradient term to the other side, using (7.75) and conditionally
averaging, we receive the 1st order marginal attributions

q
−1 −1
Fμ(X) (α) = Ep μ (X) μ(X) = Fμ(X) (α) ≈ μ (0) + Sj (μ; α). (7.76)
j =1

Thus, the sensitivities Sj (μ; α) provide a 1st order description of the quantiles
−1
Fμ(X) (α) of μ(X). We call this approach marginal attribution by conditioning on
quantiles (MACQ) because it shows how the components Xj of X contribute to a
given quantile level.
Example 7.37 (MACQ for Linear Regression) The simplest case is the linear
regression case because the 1st order marginal attributions (7.76) are exact in this
case. Consider a linear regression function with regression parameter β ∈ Rq+1

q
x → μ(x) = β, x = β0 + βj x j .
j =1

The 1st order marginal attributions for fixed α ∈ (0, 1) are given by

q
−1
Fμ(X) (α) = μ (0) + Sj (μ; α)
j =1

q
−1
= β0 + βj Ep Xj μ(X) = Fμ(X) (α) . (7.77)
j =1

That is, we replace the feature components Xj by their expected contributions on

−1
a given quantile level Fμ(X) (α) in (7.77). We compare this explanation to the ALE
370 7 Deep Learning

profile (7.70). Set initial value xj0 = 0, the ALE profile for the linear regression
model is given by
xj
xj → j (zj )dzj = βj xj .
0

This is the sensitivity of the linear regression function in component xj ,

whereas (7.77) describes the contribution of each feature component to an expected
−1
response level μ(x), in particular, Ep [Xj |μ(X) = Fμ(X) (α)] describes the average
feature value in component j on a given quantile level.

A natural next step is to expand the 1st order attributions to 2nd orders. This
allows us to consider the interaction terms. Consider the 2nd order Taylor expansion
1
μ(x + ) = μ(x) + (∇x μ(x)) + ∇x2 μ(x) + o(22 ) for 2 → 0.
2
Similar to (7.76), setting = −x, this gives us the 2nd order marginal attributions

q
1
q
−1
Fμ(X) (α) ≈ μ (0) + Sj (μ; α) − Tj,k (μ; α) (7.78)
2
j =1 j,k=1
q

1
= μ (0) + Sj (μ; α) − Tj,j (μ; α) − Tj,k (μ; α),
2
j =1 1≤j <k≤q

where for 1 ≤ j, k ≤ q we define μj,k (x) = ∂xj ∂xk μ(x), see (7.72), and

−1
Tj,k (μ; α) = Ep Xj Xk μj,k (X) μ(X) = Fμ(X) (α) . (7.79)

Remarks 7.38
• The first line of (7.78) separates the 1st order attributions from the 2nd order
attributions, the second line splits w.r.t. the individual component j attributions
and the interaction attributions j = k.
• The 1st order attributions (7.75) have been motivated by considering the direc-
tional derivatives of the VaR distortion risk measure. Unfortunately, the 2nd order
consideration has no simple equivalent motivation, as the 2nd order directional
derivatives are much more involved, even in the linear case, we refer to Property
1 in Gourieroux et al. [167].
7.6 Model-Agnostic Tools 371

• Interestingly, we can precisely evaluate the accuracy of approximation (7.78) by

analyzing for a given regression function μ(·)

q q
−1 1

sup Fμ(X) (α) − μ (0) − Sj (μ; α) + Tj,k (μ; α) . (7.80)
α∈(0,1) j =1
2
j,k=1

Intuitively, in order to have a uniform good approximation, the origin 0 should be

somehow centered in the feature distribution X ∼ p. This will be studied next.
Above we have implicitly assumed that 0 is a suitable reference point that makes
the approximation error (7.80) small. For FN network fitting we typically normalize
the features either using the MinMaxScaler (7.29) or we center and normalize the
components of (x i )1≤i≤n according to (7.30). That is, the reference point is chosen
such that the gradient descent fitting works efficiently. However, this may not be
an optimal reference point for studying the 2nd order attributions. Therefore, we
analyze this question in more detail, and the following reparametrization can still be
done after model fitting.
If we choose an arbitrary translation a ∈ Rq , we can set = a − x in the
above 2nd order Taylor expansion to receive another 2nd order marginal attribution
representation

−1 −1
Fμ(X) (α) ≈ μ (a) − Ep (a − X) ∇x μ(X) μ(X) = Fμ(X) (α) (7.81)

1
−1
− Ep (a − X) ∇x2 μ(X)(a − X) μ(X) = Fμ(X) (α) .
2
Essentially, this means that we shift the feature distribution p to considering the
shifted random vectors Xa = X − a and while setting μa (·) = μ(a + ·),
thus, this simply says that we pre-process the features differently. In view of
approximation (7.81) we can now select a reference point a ∈ Rq that makes the 2nd
order marginal attributions as precise as possible. Define the events Al = {μ(X) =
−1
Fμ(X) (αl )} for a discrete quantile grid 0 < α1 < . . . < αL < 1. We define the
objective function

L

−1
a → G(a; μ) = Fμ(X) (αl ) − μ (a) + Ep (a − X) ∇x μ(X) Al (7.82)
l=1
2
1
+ Ep (a − X) ∇x2 μ(X)(a − X) Al .
2

Making this objective function G(a; μ) small in a will provide us with a good
reference point for the selected quantile levels (αl )1≤l≤L; this is exactly the MACQ
372 7 Deep Learning

proposal of Merz et al. [273]. A local minimum can be found by applying a gradient
descent algorithm

a (t ) → a (t +1) = a (t ) − δt +1 ∇a G(a (t ); μ),

for tempered learning rates δt +1 > 0. The gradient of G w.r.t. a is given by

L

−1
∇a G(a; μ) = 2 Fμ(X) (αl ) − μ (a) + Ep (a − X) ∇x μ(X) Al
l=1

1
+ Ep (a − X) ∇x2 μ(X)(a − X) Al
2

× − ∇a μ (a) + Ep [∇x μ(X)| Al ]

1
−Ep X ∇x2 μ(X) Al 2
+ a Ep ∇x μ(X) Al .
2

All subsequent considerations and interpretations are done w.r.t. an optimal ref-
erence point a ∈ Rq by minimizing the objective function (7.82) on the chosen
quantile grid. Mathematically speaking, this optimal choice is w.l.o.g. because the
origin 0 of the coordinate system of the feature space X is arbitrary, and any
other origin can be chosen by a translation, see formula (7.81) and the subsequent
discussion. For interpretations, however, the choice of the reference point a matters
because the directional derivative Xj μj (X) can be small either because Xj is small
or because μj (X) is small. Having a small Xj means that this feature value is close
to the chosen reference point.
Example 7.39 (MACQ Analysis) We revisit the MTPL claim frequency example
using the FN network regression model of depth d = 3 having (q1 , q2 , q3 ) =
(20, 15, 10) neurons. Importantly, we use the hyperbolic tangent as the activation
function in the FN layers which provides smoothness of the regression function.
Figure 7.40 shows the VPI plot of this fitted model. Obviously, the variable
BonusMalus plays the most important role in this predictive model. Remark that
the VPI plot does not properly respect the dependence structure in the features as it
independently permutes each feature component at a time. The aim in this example
is to determine variable importance by doing the MACQ analysis (7.78).
Figure 7.44 (lhs) shows the empirical density of the fitted canonical parameter
θ (x i ), 1 ≤ i ≤ n; all plots in this example refer to the canonical scale. We then
minimize the objective function (7.82) which provides us with an optimal reference
point a ∈ Rq ; we choose equidistant quantile grid 1% < 2% < . . . < 99%
and all conditional expectations in ∇a G(a; μ) are empirically approximated by a
local smoother similar to Listing 7.8. Figure 7.44 (rhs) gives the resulting marginal
attributions w.r.t. this reference point. The orange line shows the 1st order marginal
7.6 Model-Agnostic Tools 373

expected frequency on canonical scale marginal attributions

−1.5
0.8

canonical parameter
−2.0
empirical density
0.6

−2.5
−3.0
0.4

−3.5
0.2

1st order marginal attributions

−4.0
2nd order marginal attributions
2nd order marginal attributions w/o interactions
true empirical quantiles
0.0

−4.5
−4 −3 −2 −1 0.0 0.2 0.4 0.6 0.8 1.0
expected frequency (canonical scale) quantile level alpha

Fig. 7.44 (lhs) Empirical density of the fitted canonical parameter θ(x i ), 1 ≤ i ≤ n, (rhs) 1st and
2nd order marginal attributions

2nd order attributions S_j−T_jj/2 interaction terms −T_jk

1.0

Area
VehPower
0.4

VehAge
interaction attributions

DrivAge
BonusMalus
VehGas
0.2
0.5

Density
VehBrand
sensitivities

Region
0.0
0.0

−0.8 −0.6 −0.4 −0.2

−0.5

BonusMalusX−AreaX
BonusMalusX−DrivAgeX
VehGasX−DrivAgeX
VehGasX−BonusMalusX
VehBrand2−BonusMalusX
Region2−BonusMalusX
−1.0

Region2−VehBrand2

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
quantile level alpha quantile level alpha

Fig. 7.45 (lhs) Second order marginal attributions Sj (μ; α) − 12 Tj,j (μ; α) excluding interaction
terms, and (rhs) interaction terms − 12 Tj,k (μ; α), j = k

attributions (7.76), and the red line the 2nd order marginal attributions (7.78). The
cyan line drops the interaction terms Tj,k (μ; α), j = k, from the 2nd order marginal
attributions. From the shaded cyan area we see the importance of the interaction
terms. We note that the 2nd order marginal attributions (red line) match the true
empirical quantiles (black dots) quite well for the chosen reference point a.
Figure 7.45 gives the 2nd order marginal attributions Sj (μ; α) − 12 Tj,j (μ; α) of
the individual components 1 ≤ j ≤ q on the left-hand side, and the interaction terms
− 12 Tj,k (μ; α), j = k on the right-hand side. We identify the following components
as being important BonusMalus, DrivAge, VehGas, VehBrand and Region;
these components show a behavior substantially different from being equal to 0, i.e.,
374 7 Deep Learning

1.0 2nd order attributions attributions on different quantile levels

Area
VehPower Area
VehAge
DrivAge
BonusMalus VehPower
VehGas
0.5

Density VehAge
VehBrand
sensitivities

Region
DrivAge
0.0

BonusMalus

VehGas

Density
−0.5

VehBrand
quantiles 0.8
quantiles 0.6
Region quantiles 0.4
−1.0

quantiles 0.2

0.0 0.2 0.4 0.6 0.8 1.0 −1.0 −0.5 0.0 0.5
quantile level alpha
q
Fig. 7.46 (lhs) Second order marginal attributions Sj (μ; α) − 12 k=1 Tj,k (μ; α) including
interaction terms, and (rhs) slices at the quantile levels α ∈ {20%, 40%, 60%, 80%}

these components differentiate from the reference point a. These components also
have major interactions that contribute to the quantiles above the level 80%.
q 1 ≤ j ≤ q
If we allocate the interaction terms to the corresponding components
we receive the second order marginal attributions Sj (μ; α) − 12 k=1 Tj,k (μ; α).
These are illustrated in Fig. 7.46 (lhs) and the quantile slices at the levels α ∈
{20%, 40%, 60%, 80%} are given in Fig. 7.46 (rhs). These graphs illustrate variable
importance on different quantile levels (and respecting the dependence within
the features). In particular, we identify the main variables that distinguish the
given quantile levels from the reference level θ (a), i.e., Fig. 7.46 (rhs) should be
understood as the relative differences to the chosen reference level. Once more we
see that BonusMalus is the main driver, but also other variables contribute to the
differentiation of the high quantile levels.
Figure 7.47 shows the individual attributions xi,j μj (x i ) of 1’000 randomly
selected cases x i for the feature components j = BonusMalus, DrivAge,
VehGas, VehBrand; the colors illustrate the corresponding feature values xi,j
of the individual car drivers i, and the black solid line corresponds to Sj (μ; α) −
1
2 Tj,j (μ; α) excluding the interaction terms (the black dotted line is one empir-
ical standard deviation around the black solid line). Focusing on the variable
BonusMalus we observe that the lower quantiles are almost completely domi-
nated by insurance policies on the lowest bonus-malus level. The bonus-malus levels
70–80 provide little sensitivity (are concentrated around the zero line) because the
reference point a reflects these bonus-malus levels, and, finally, the large quantiles
are dominated by high bonus-malus levels (red dots).
The plot of the variable DrivAge is interpreted similarly. The reference point
a is close to the young drivers, therefore, young drivers are concentrated around
the zero line. At the low quantile levels, higher ages contribute positively to the
low expected frequencies, whereas these ages have an unfavorable impact at higher
7.6 Model-Agnostic Tools 375

individual marginal attribution: BonusMalus individual marginal attribution: DrivAge

90
6

4
120
individual 2nd order attribution

individual 2nd order attribution

3
110
4

2
100
60

1
90
2

50
80

0
40
0

−1
30
60

−2
−2

20
50
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
quantile level alpha quantile level alpha

individual marginal attribution: VehGas individual marginal attribution: VehBrand

1.0
1.0
individual 2nd order attribution

individual 2nd order attribution

0.5
0.5

0.0
0.0

−0.5
−0.5

B1
B2
B3
−1.0

B4
−1.0

B5
B6
B10
−1.5

B11
−1.5

B12
Diesel B13
Regular B14
−2.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
quantile level alpha quantile level alpha

Fig. 7.47 Individual attributions xi,j μj (x i ) of 1’000 randomly selected cases x i for j =
BonusMalus, DrivAge, VehGas, VehBrand; the plots have different y-scales

quantile levels (this should be considered in combination with their bonus-malus

levels). We also observe a few outliers in this plot, for instance, we can identify a
driver of age 20 at a quantile level of 20%. Further inspection of this driver raises
some doubts whether this data is correct since this driver is at a bonus-malus level
of 68% (which should technically not be possible) and she/he has an exposure of 2
days. Surely, this insurance policy would need further investigation.
The plot of VehGas shows that the chosen reference level θ (a) is closer to
Diesel fuel cars as the red dots fluctuate less around the zero line; in different
runs of the gradient descent algorithm (with different seeds) this order has been
changing (as it depends on the reference point a). We skip a detailed analysis of the
variable VehBrand.
376 7 Deep Learning

7.7 Lab: Analysis of the Fitted Networks

In the previous section we have studied some model-agnostic tools that can be used
for any (differentiable) regression model. In this section we give some network
specific plots. For simplicity we choose one specific example, namely, the FN
def.
network μ = μm=1 of Table 7.9. We start by analyzing the learned representations
in the different FN layers, this links to our introduction in Sect. 7.1.
For any FN layer 1 ≤ m ≤ d we can study the learned representations
z(m:1) (x). For Fig. 7.48 we select at random 1’000 insurance policies x i , and the
dots show the activations of these insurance policies in neurons j = 4 (x-axis)
and j = 9 (y-axis) in the corresponding FN layers. These neuron activations are
in the interval (−1, 1) because we work with the hyperbolic tangent activation
function for φ. The color scale shows the resulting estimated frequencies μ(x i ) of
the selected policies. We observe that the layers are increasingly (in the depth of the
network) separating the low frequency policies (light blue-green colors) from the
high frequency policies (red color). This is a quite typical picture that we obtain
here, though, this sparsity in the 3rd FN layer is not the case for every neuron
1 ≤ j ≤ qd .
In higher dimensional FN architectures it will be difficult to analyze the learned
representations on each individual neuron, but at least one can try to understand
the main effects learned. For this, on the one hand, we can focus on the important
feature components, see, e.g., Sect. 7.6.1, and, on the other hand, we can try to study
the main effects learned using a PCA in each FN layer, see Sect. 7.5.3. Figure 7.49
shows the singular values λ1 ≥ λ2 ≥ . . . ≥ λqm > 0 in each of the three FN layers
1 ≤ m ≤ d = 3; we center the neuron activations to mean zero before applying
the SVD. These plots support the previously made statement that the layers are
increasingly separating the high frequency from the low frequency policies. An
elbow criterion tells us that in the first FN layer we have 8 important principal
components (out of 20), in the second FN layer 3 (out of 15) and in the third FN
layer 1 (out of 10). This is also reflected in Fig. 7.48 where we see more and more

Fig. 7.48 Observed activations in the three FN layers m = 1, 2, 3 (left-middle-right) in the

corresponding neurons j = 4, 9, the color key shows the estimated frequencies
μ(x i )
7.7 Lab: Analysis of the Fitted Networks 377

100 150 200 250 300 singular values: FN layer 1 singular values: FN layer 2 singular values: FN layer 3

100 150 200 250 300

singular values

singular values
50

50
0

0
5 10 15 20 2 4 6 8 10 12 14 2 4 6 8 10
Index Index Index

Fig. 7.49 Singular values λ1 ≥ λ2 ≥ . . . ≥ λqm > 0 in the FN layers 1 ≤ m ≤ d = 3

concentration in the neuron activations. It is important to notice that the chosen

FN network calibration μ does not involve any drop-out layers during the gradient
descent fitting, see Sect. 7.4.1. Drop-out layers prevent individual neurons to over-
train to a specific task. Consequently, we will receive a network calibration that is
more equally balanced across all neurons under drop-outs, because if one neuron
drops out, the composite of the remaining neurons needs to be able to take over the
task of the dropped out neuron. This leads to less sparsity and to singular values that
are more similarly sized.
In Fig. 7.50 we analyze the first two principal components in each FN layer,
thus, these are the two principal components that correspond to the two biggest
singular values (λ1 , λ2 ) in each of the three FN layers. The first row shows the
input variables (BonusMalus, DrivAge) ∈ [50, 125] × [18, 90] of the 1’000
randomly selected policies x i ; these are the two most important feature components
according to the VPI analysis. All three columns show the same data, however, in
different color scales: (lhs) gives the color scale
μ, (middle) gives the color scale
BonusMalus, and (rhs) gives the color scale DrivAge. These color scales also
apply to the other rows. The 2nd row shows the first two principal components in
the 1st FN layer, the 3rd row in the 2nd FN layer, and the last row in the third
FN layer. Focusing on the first column we observe that the layers cluster the high
and the low frequency policies in the 1st principal component more and more
across the FN layers. Not surprisingly this leads to a quite clear-cut separation
w.r.t. the bonus-malus level which can be verified from the second column of
Fig. 7.50. For the driver’s age variable this sharp separation gets lost across the
layers, see third column of Fig. 7.50, which indicates that the variable DrivAge
does not influence the frequency monotonically and it interacts with the variable
BonusMalus.
Figure 7.51 shows the second order marginal attributions (7.78) for the different
inputs. The graph on the left-hand side shows the plot w.r.t. the original inputs
x i , the graph in the middle w.r.t. the learned representations z(1:1)(x i ) ∈ Rq1
in the first FN layer, and on the right-hand side w.r.t. the learned representations
z(2:1) (x i ) ∈ Rq2 in the second FN layer. We interpret these plots as follows: the
FN network disentangles the different effects through the FN layers by making
378 7 Deep Learning

Fig. 7.50 (First row) Input variables (BonusMalus, DrivAge), (Second–fourth row) first two
principal components in FN layers m = 1, 2, 3; (lhs) gives the color scale of estimated frequency

μ, (middle) gives the color scale BonusMalus, and (rhs) gives the color scale DrivAge
7.7 Lab: Analysis of the Fitted Networks 379

Fig. 7.51 Second order marginal attributions: (lhs) w.r.t. the input layer x ∈ Rq0 , (middle)
w.r.t. the first FN layer z(1:1) (x) ∈ Rq1 , and (rhs) w.r.t. the second FN layer z(2:1) (x) ∈ Rq2

the plots more smooth and making the interactions between the neurons smaller.
Note that the learned representations z(3:1)(x i ) ∈ Rq3 in the last FN layer go into
a classical GLM for the output layer, which does not have any interactions in the
canonical predictor (because it is additive on the canonical scale), thus, being of
the same type as the linear regression of Example 7.37. In the Poisson model with
the log-link function, the interactions can only be of a multiplicative type in GLMs.
Therefore, the network feature-engineers the input x i (in an automated way) such
that the learned representation z(d:1)(x i ) in the last FN layer is exactly in this GLM
structure. This is verified by the small interaction part in Fig. 7.51 (rhs). This closes
this part on model-agnostic tools.

Chapter 7 has discussed fully-connected feed-forward neural (FN) networks. Feed-

forward means that information is passed in a directed acyclic path from the input
layer to the output layer. A natural extension is to allow these networks to have
cycles. In that case, we call the architecture a recurrent neural (RN) network. A RN
network architecture is particularly useful for time-series modeling. The discussion
on time-series data also links to Sect. 5.8.1 on longitudinal and panel data. RN
networks have been introduced in the 1980s, and the two most popular RN network
architectures are the long short-term memory (LSTM) architecture proposed by
Hochreiter–Schmidhuber [188] and the gated recurrent unit (GRU) architecture
introduced by Cho et al. [76]. These two architectures will be described in detail
in this chapter.

8.1 Motivation for Recurrent Neural Networks

We start from a deep FN network providing the regression function, see (7.2)–(7.3),

x → μ(x) = g −1 β, z(d:1)(x) , (8.1)

with a composition z(d:1) of d FN layers z(m) , 1 ≤ m ≤ d, link function g and with

output parameter β ∈ Rqd +1 . In principle, we could directly use this FN network
architecture for time-series forecasting. We explain here why this is not the best
option to deal with time-series data.
Assume we want to predict a random variable YT +1 at time T ≥ 0 based on the
time-series information x 0 , x 1 , . . . , x T . This information is assumed to be available
at time T for predicting the response YT +1 . The past response information Yt , 1 ≤

© The Author(s) 2023 381

t ≤ T , is typically included in x t .1 Using the above FN network architecture we

could directly try to predict YT +1 , based on this past information. Therefore, we
define the feature information x 0:T = (x 0 , . . . , x T ) and we aim at designing a FN
network (8.1) for modeling

x 0:T → μT (x 0:T ) = E[YT +1 |x 0:T ] = E[YT +1 |x 0 , . . . , x T ].

In principle we could work with such an approach, however, it has a couple

of severe drawbacks. Obviously, the length of the feature vector x 0:T depends
on time T , that is, it will grow with every time step. Therefore, the regression
function (network architecture) x 0:T → μT (x 0:T ) is time-dependent. Consequently,
with this approach we have to fit a network for every T . This deficiency can be
circumvented if we assume a Markov property that does not require of carrying
forward the whole past history. Assume that it is sufficient to consider a history of
a certain length. Choose τ ≥ 0 fixed, then, for T ≥ τ , we can set for the feature
information x T −τ :T = (x T −τ , . . . , x T ), which has a fixed length τ + 1 ≥ 1, now.
In this situation we could try to design a FN network

x T −τ :T → μ(x T −τ :T ) = E[YT +1 |x T −τ :T ] = E[YT +1 |x T −τ , . . . , x T ].

This network regression function can be chosen independent of T since the relevant
history x T −τ :T always has the same length τ + 1. The time variable T could be used
as a feature component in x T −τ :T . The disadvantage of this approach is that such
a FN network architecture does not respect the temporal causality. Observe that we
feed the past history into the first FN layer

x T −τ :T → z(1)(x T −τ :T ) ∈ {1} × Rq1 .

This operation typically does not respect any topology in the time index of
x T −τ +1:T . Thus, the FN network does not recognize that the feature x t −1 has been
experienced just before the next feature x t . For this reason we are looking for a
network architecture that can handle the time-series information in a temporal causal
way.

1 More mathematically speaking, we assume to have a filtration (At )t≥0 on the probability space
(, A, P). The basic assumption then is that both sequences (x t )t and (Yt )t are (At )t -adapted, and
we aim at predicting YT +1 , based on the information AT . In the above case this information AT is
generated by x 0 , x 1 , . . . , x T , where x t typically includes the observation Yt . We could also shift
the time index in x t by one time unit, and in that case we would assume that (x t )t is previsible
w.r.t. the filtration (At )t . We do not consider this shift in time index as it only makes the notation
unnecessarily more complicated, but the results remain the same by including the information
correspondingly into the features.
8.2 Plain-Vanilla Recurrent Neural Network 383

8.2 Plain-Vanilla Recurrent Neural Network

8.2.1 Recurrent Neural Network Layer

We explain the basic idea of RN networks in a shallow network architecture, and

deep network architectures will be discussed in Sect. 8.2.2, below. We start from the
time-series input variable x 0:T = (x 0 , . . . , x T ), all components having the same
structure x t ∈ X ⊂ {1} × Rq0 , 0 ≤ t ≤ T . The aim is to design a network
architecture that allows us to predict the random variable YT +1 , based on this time-
series information x 0:T .
The main idea is to feed one component x t of the time-series x 0:T at a time into
the network, and at the same time we use the output zt −1 of the previous loop as
an input for the next loop. This variable zt −1 carries forward a memory of the past
variables x 0:t −1 . We explain this with a single RN layer having q1 ∈ N neurons. A
RN layer is given (recursively) by a mapping, t ≥ 1,

z(1) : {1} × Rq0 × Rq1 → Rq1 , (8.2)

(x t , zt −1 ) → zt = z (1)
(x t , zt −1 ) ,

where the RN layer z(1) has the same structure as the FN layer given in (7.5), but
based on feature input (x t , zt −1 ) ∈ X × Rq1 ⊂ {1} × Rq0 × Rq1 , and not including
an intercept component {1} in the output.

More formally, a RN layer with activation function φ is a mapping

z(1) : {1} × Rq0 × Rq1 → Rq1 (8.3)

z(1) (x, z) = z1(1) (x, z), . . . , zq(1)
(x, z) → 1
(x, z) ,

having neurons, 1 ≤ j ≤ q1 ,
; < ; <
zj(1)(x, z) = φ w(1)
j , x + u(1)
j , z , (8.4)

for given network weights w(1) q0 +1 and u(1) ∈ Rq1 .

j ∈R j

Thus, the FN layers (7.5)–(7.6) and the RN layers (8.3)–(8.4) are structurally
equivalent, only the input x ∈ X is adapted to the time-series structure (x t , zt −1 ) ∈
X × Rq1 . Before giving more interpretation and before explaining how this single
RN network structure can be extended to a deep RN network we illustrate this RN
layer.
384 8 Recurrent Neural Networks

RN layer
time-series z (1) (xt , z t−1 ) output
input xt processing zt
input (xt , z t−1 )

Fig. 8.1 RN layer z(1) processing the input (x t , zt−1 )

output
z t−1

RN layer
time-series z (1) (xt , z t−1 ) output
input xt processing zt
input (xt , z t−1 )

RN layer
time-series z (1) (xt+1 , z t ) output
input xt+1 processing z t+1
input (xt+1 , z t )

Fig. 8.2 Unfolded representation of RN layer z(1) processing the input (x t , zt−1 )

Figure 8.1 shows an RN layer z(1) processing the input (x t , zt −1 ), see (8.2). From
this graph, the recurrent structure becomes clear since we have a loop (cycle) feeding
the output zt back into the RN layer to process the next input (x t +1 , zt ).
Often one depicts the RN architecture in a so-called unfolded way. This is done
in Fig. 8.2. Instead of plotting the loop (cycle) as in Fig. 8.1 (orange arrow in the
colored version), we unfold this loop by plotting the RN layer multiple times. Note
that this RN layer in Fig. 8.2 uses always the same network weights w(1) (1)
j and uj ,
1 ≤ j ≤ q1 , for all t. Moreover, the use of the colors of the arrows (in the colored
version) in the two figures coincides.
Remarks 8.1
• The neurons of the RN layer (8.4) have the following structure

q0
q1
zj(1) (x, z) (1) (1) (1)
= φ wj , x + uj , z = φ w0,j + (1)
wl,j xl + (1)
ul,j zl .
l=1 l=1
8.2 Plain-Vanilla Recurrent Neural Network 385

The network weights W (1) = (wj )1≤j ≤q1 ∈ R(q0 +1)×q1 include an intercept
(1)

component w0,j and the network weights U (1) = (uj )1≤j ≤q1 ∈ Rq1 ×q1 do not
(1) (1)

include an intercept component, otherwise we would have a redundancy.

• The RN network architecture generates a new process (zt )t . This process encodes
the part of the past history (x 0:t )t which is relevant for forecasting the next step.
Thus, (zt )t can be interpreted as a (latent) memory process, or as the process of
learned (relevant) time-series representation giving us zt = zt (x 0:t ).
(1)
• The same activation function φ and the same network weights (wj )1≤j ≤q1 and
(1)
(uj )1≤j ≤q1 are shared across all time periods t ≥ 0. This means that we assume
a stationary (stochastic) process.
• The upper index (1) indicates the fact that this is the first (and single) RN layer
in this example. In this sense, Figs. 8.1 and 8.2 show a shallow RN network. In
the next section we are going to discuss deep RN networks, and below we are
also going to discuss how the output is modeled, i.e., how the response YT +1 is
predicted based on the pre-processed features (zt )0≤t ≤T ∈ Rq1 ×(T +1) .

8.2.2 Deep Recurrent Neural Network Architectures

There are many different ways of extending a shallow RN network to a deep RN

network. Assume we want to model a RN network of depth d ≥ 2. A first (obvious)
way of receiving a deep RN network architecture is

zt[1] = z(1) x t , zt[1]
−1 ∈ Rq 1 , (8.5)

z[m]
t = z(m) z[m−1]
t , z[m]
t −1 ∈ Rq m for 2 ≤ m ≤ d, (8.6)

where all RN layers z(m) , 1 ≤ m ≤ d, are of type (8.3)–(8.4), and additionally we

include an intercept component in the RN layers z(m) , 2 ≤ m ≤ d. We add the
upper indices (in square brackets [·]) to the time-series (z[m]
t )t to indicate which
RN layer outputs these learned representations (memory processes). In fact, we
could also write z[m:1]
t instead of z[m]
t , because in zt
[m:1]
the feature input x 0:t has
been processed through m RN layers z , . . . , z . For simplicity, we just use the
(1) (m)

notation z[m]
t = z[m]
t (x 0:t ).
386 8 Recurrent Neural Networks

We are going to use the following abbreviation for a RN layer m ≥ 1

; < ; <
z[m]
t = z(m) z[m−1]
t , z[m]
t −1 = φ W
(m) [m−1]
, zt + U (m) , z[m]
t −1 ∈ Rq m ,
(8.7)
where the weights W (m) = (w(m) , . . . , w (m)
) ∈ R (qm−1 +1)×qm include the
1 qm
intercept components, and the weights U (m) = (u1 , . . . , uqm ) ∈ Rqm ×qm
(m) (m)

do not include any intercept components. The scalar product is understood

column-wise in the weight matrices W (m) and U (m) , and the activation φ is
understood component-wise. Moreover, we initialize for the input zt[0] = x t .

output output
[1] [2]
z t−1 z t−1

1st RN layer 2nd RN layer

[1] [1] [2]
time-series z (1) (x t , z t−1 ) output z (2) (z t , z t−1 ) output
[1] [2]
input xt processing zt processing zt
[1] [1] [2]
input (xt , z t−1 ) input (z t , z t−1 )

1st RN layer 2nd RN layer

[1] [1] [2]
time-series z (1) (xt+1 , z t ) output z (2) (z t+1 , z t ) output
[1] [2]
input xt+1 processing z t+1 processing z t+1
[1] [1] [2]
input (xt+1 , z t ) input (z t+1 , z t )

Fig. 8.3 Unfolded representation of a RN network architecture of depth d = 2

Figure 8.3 shows the RN network architecture of depth d = 2 defined in (8.5)–(8.6).

The dimension of the input zt[0] = x t ∈ X ⊆ {1} × Rq0 is q0 + 1, the first RN layer
has q1 neurons and the second RN layer q2 neurons. From this graph it becomes
clear how a RN network architecture of any depth d ∈ N can be constructed
(recursively).
Remark 8.2 There are many alternative ways in building deep RN networks. E.g.,
we can add a loop that connects the output of the second RN layer back to the first
one

zt[1] = z(1) x t , zt[1] z [2]
−1 t −1 ,
,

zt[2] = z(2) zt[1] , zt[2]
−1 ,
8.2 Plain-Vanilla Recurrent Neural Network 387

or we can add a skip connection from the input variable x t to the second RN layer

zt[1] = z(1) x t , zt[1]
−1 ,

zt[2] = z(2) x t , zt[1] , zt[2]
−1 .

We refrain from explicitly studying such RN network variants any further.

8.2.3 Designing the Network Output

There remains to explain how to predict the response variable YT +1 based on

the pre-processed features (memory processes) zT[1] , . . . , z[d]
T , outputted by the RN
network of depth d ≥ 1. Typically, only the final output of the last RN layer
z[d] [d]
T = zT (x 0:T ) ∈ R
qd is considered to predict the response Y
T +1 . We take this
output and feed it into a FN network z̄ (D:1) : {1} × R → {1} × Rq̄D of depth
q d

D ∈ N and with FN layers z̄ , 1 ≤ m ≤ D, given by (7.5). Moreover, we choose

(m)

a strictly monotone and smooth link function g.

This then provides us with the regression function, see (7.7)–(7.8),

; <
x 0:T → E[YT +1 |x 0:T ] = μ(x 0:T ) = g −1 β, z̄(D:1) z[d]
T (x 0:T ) . (8.8)

Thus, we first process the time-series features x 0:T through a RN network

to receive the learned representation z[d] T (x 0:T ) ∈ R
qd at time T . This learned

representation is then used as a feature input to a FN network z̄(D:1) that allows

us to predict the response YT +1 . This is illustrated in Fig. 8.4 for depth d = 1.
Remarks 8.3
• From the graph in Fig. 8.4 it also becomes apparent that we can consider different
insurance policies 1 ≤ i ≤ n having different lengths of the corresponding his-
tories x i,T −τi :T ∈ R(q0 +1)×(τi +1) , τi ∈ {0, . . . , T }. The stationarity assumption
allows us to enter the network in Fig. 8.4 at any time T − τi . The RN network
encodes this history into a learned feature zT[1] (x i,T −τi :T ) which is then decoded
by the FN network z̄(D:1) to forecast Yi,T +1 .
• If there is additional insurance policy dependent feature information * x i that
is not of a time-series structure, we can concatenate the feature information
(z[d]
T (x i,0:T ),*x i ) which then enters the FN network (8.8).
388 8 Recurrent Neural Networks

output
[1]
z T −2

RN layer
[1]
time-series z (1) (xT −1 , z T −2 ) output
[1]
input xT −1 processing z T −1
[1]
input (xT −1 , z T −2 )

RN layer FN network
[1] [1]
time-series z (1) (xT , z T −1 ) output z̄ (D:1) (z T ) prediction
input xT processing zT
[1]
processing T +1
Y
[1] [1]
input (xT , z T −1 ) memory z T

Fig. 8.4 Forecasting the response YT +1 using a RN network (8.8) based on a single RN layer
d = 1 and on a FN network of depth D

There remains to fit this network architecture having d RN layers and D FN

layers to the available data. The RN layers involve the network weights W (m) ∈
R(qm−1 +1)×qm and U (m) ∈ Rqm ×qm , for 1 ≤ m ≤ d, and the FN layers involve the
network weights (w̄(m) (q̄m−1 +1)×q̄m , for 1 ≤ m ≤ D, and with q̄ = q .
j )1≤j ≤q̄m ∈ R 0 d
Moreover, we have an output parameter β ∈ Rq̄D +1 . The fitting is again done by a
gradient descent algorithm minimizing the corresponding objective function.
Assume we have independent (in i) data (Yi,T +1 , x i,0:T , vi,T +1 ) of the cases 1 ≤
i ≤ n. We then assume that the responses Yi,T +1 can be modeled by a fixed member
of the EDF having unit deviance d. We consider the deviance loss function, see (4.9),

1 vi,T +1
n
ϑ → D(Y T +1 , ϑ) = d Yi,T +1 , μϑ (x i,0:T ) , (8.9)
n ϕ
i=1

for the observations Y T +1 = (Y1,T +1 , . . . , Yn,T +1 ) , and where ϑ collects all the
RN and FN network weights/parameters of the regression function (8.8). This model
can now be fitted using a variant of the gradient descent algorithm. The variant
uses back-propagation through time (BPTT) which is an adaption of the back-
propagation method to calculate the gradient w.r.t. the network parameter ϑ.

8.2.4 Time-Distributed Layer

There is a special feature in RN network modeling which is called a time-distributed

layer. Observe from Fig. 8.4 that the deviance loss function (8.9) only focuses on the
8.2 Plain-Vanilla Recurrent Neural Network 389

final observation Yi,T +1 . However, the stationarity assumption allows us to output

and study any (previous) observation Yi,t +1 , 0 ≤ t ≤ T . A time-distributed layer
considers applying the deep FN network (8.8) simultaneously at all time points 0 ≤
t ≤ T ; simultaneously meaning that we use the same FN network weights for all t.
The latter is justified under the assumption of having stationarity.

This then provides us with the regressions

; <
x 0:t → E[Yt+1 |x 0:t ] = μ(x 0:t ) = g −1 β, z̄(D:1) z[d]
t (x 0:t ) for all t ≥ 0. (8.10)

Figure 8.5 illustrates a time-distributed output where we predict (Yt +1 )t based on

the history (x 0:t )t , and we always apply the same FN network z̄(D:1) to the memory
zt[1] = zt[1] (x 0:t ).
A time-distributed layer changes the fitting procedure. Instead of considering
the objective function (8.9) for the final observation Yi,T +1 , we now include all
observations Y = (Yi,t +1 )0≤t ≤T ,1≤i≤n into the objective function. This results in
studying the deviance loss function

1 1 vi,t +1
n T
ϑ → D(Y , ϑ) = d Yi,t +1 , μϑ (x i,0:t ) . (8.11)
n T +1 ϕ
i=1 t =0

FN network
[1]
output z̄ (D:1) (z T −2 ) prediction
[1]
z T −2 processing T −1
Y
[1]
memory z T −2

RN layer FN network
[1] [1]
time-series z (1) (xT −1 , z T −2 ) output z̄ (D:1) (z T −1 ) prediction
[1]
input xT −1 processing z T −1 processing T
Y
[1] [1]
input (xT −1 , z T −2 ) memory z T −1

RN layer FN network
[1] [1]
time-series z (1) (xT , z T −1 ) output z̄ (D:1) (z T ) prediction
[1]
input xT processing zT processing YT +1
[1] [1]
input (xT , z T −1 ) memory z T

Fig. 8.5 Forecasting (Yt+1 )t using a RN network (8.10) based on a single RN layer d = 1 and
using a time-distributed FN layer for the outputs
390 8 Recurrent Neural Networks

Note that this can easily be adapted if the different cases 1 ≤ i ≤ n have different
lengths in their histories. An example is provided in Listing 10.8, below.

8.3 Special Recurrent Neural Networks

In the plain-vanilla RN networks introduced above we have defined the memory

processes (z[m]
t )t ≥0 , 1 ≤ m ≤ d, which encode the information history (x t )t ≥0
through different RN layers in a temporal causal way. This is naturally done through
the use of a time-series structure as illustrated, e.g., in Fig. 8.5. There are more
specific RN network architectures that allow the memory processes to be of a long
memory or a short memory type. In this section, we present the two most popular
architectures that pay a special attention to the memory storage. This is the long
short-term memory (LSTM) architecture introduced by Hochreiter–Schmidhuber
[188] and the gated recurrent unit (GRU) architecture proposed by Cho et al. [76].

8.3.1 Long Short-Term Memory Network

The LSTM network of Hochreiter–Schmidhuber [188] is the most commonly used

RN network architecture. The LSTM network uses simultaneously three different
activation functions for different purposes, the sigmoid and hyperbolic tangent
activation functions, respectively,

1 ex − e−x
φσ (x) = ∈ (0, 1) and φtanh(x) = ∈ (−1, 1),
1 + e−x ex + e−x

and a general activation function φ : R → R, see also Table 7.1.

The LSTM network relies on several RN layers that are of the same structure
as the plain-vanilla RN layer given in (8.7). We start by defining three different so-
called gates that all have the RN layer structure (8.7). These three gates are used
to model the memory cell of the LSTM network. Choose a layer index m ≥ 1 and
assume that z[m−1]
t is modeled by the previous layer m − 1; for m = 1 we initialize
zt[0] = x t . The three gates are then defined as follows, set t ≥ 1:
• The forget gate models the loss of memory rate
; < ; <
f [m] [m−1] [m] (m) [m−1] (m) [m]
t = f (m)
z t , z t−1 = φ f
σ Wf , z t + U f , z t−1 ∈ (0, 1)qm ,

with the network weights Wf(m) ∈ R(qm−1 +1)×qm and Uf(m) ∈ Rqm ×qm , and with
f
the sigmoid activation function φσ = φσ , we also refer to (8.7).
8.3 Special Recurrent Neural Networks 391

• The input gate models the memory update rate

; < ; <
i [m]
t = i (m) z[m−1]
t , z[m]
t −1 = φσ
i
Wi(m) , z[m−1]
t + Ui(m) , z[m]
t −1 ∈ (0, 1)qm ,

with the network weights Wi(m) ∈ R(qm−1 +1)×qm and Ui(m) ∈ Rqm ×qm , and with
the sigmoid activation function φσi = φσ .
• The output gate models the release of memory information rate
; < ; <
o[m]
t = o(m) z[m−1]
t , z[m]
t −1 = φ o
σ Wo
(m) [m−1]
, z t + U (m) [m]
o , z t −1 ∈ (0, 1)qm ,
(8.12)

with the network weights Wo(m) ∈ R(qm−1 +1)×qm and Uo(m) ∈ Rqm ×qm , and with
the sigmoid activation function φσo = φσ .
These gates have outputs in (0, 1), and they determine the relative amount of
memory that is updated and released in each step. The so-called cell state process
(c[m]
t )t is used to store the relevant memory. Given zt
[m−1]
, z[m] [m]
t −1 and c t −1 , the
updated cell state is defined by

c[m]
t = c(m) z[m−1]
t , z[m] , c [m]
t −1 t −1 (8.13)
; < ; <
= f [m]
t ) c[m] [m]
t −1 + i t ) φtanh Wc(m) , z[m−1]
t + Uc(m) , z[m]
t −1 ∈ Rq m ,

with the network weights Wc(m) ∈ R(qm−1 +1)×qm and Uc(m) ∈ Rqm ×qm , and )
denotes the Hadamard product. This defines how the memory (cell state) is updated
and passed forward using the forget and the input gates f [m]
t and i [m]
t , respectively.
The neuron activations z[m]
t are updated, given z [m−1] [m]
t , z t −1 and c[m]
t , by

z[m]
t = z (m)
z [m−1] [m]
t , z ,
t −1 tc [m]
= o [m]
t ) φ c [m]
t ∈ Rq m , (8.14)

with the cell state c[m]

t given in (8.13) and the output gate o[m]
t defined in (8.12).
Figure 8.62 shows a LSTM cell (8.13)–(8.14) which includes four RN layers (8.7)
for the forget gate f (m) , the input gate i (m) , the output gate o(m) and in the cell
state update (8.13). These RN layers are combined using the Hadamard product )
resulting in the updated cell state c[m]
t and the learned representation z[m]
t both being
functions of the inputs x 0:t .

2This figure is based on colah’s blog explaining LSTMs https://colah.github.io/posts/2015-08-

Understanding-LSTMs/.
392 8 Recurrent Neural Networks

f
Fig. 8.6 LSTM cell z(m) with forget gate φσ , input gate φσi and output gate φσo

Below, we are going to summarize the LSTM cell update (8.13)–(8.14) as

follows

z[m−1]
t , z[m] , c [m]
t −1 t −1 →
z [m] [m]
t , c t = z LSTM(m)
z [m−1] [m]
t , z c [m]
t −1 t −1 .
,
(8.15)

(m) (m) (m)

The update (8.15) involves the eight network weight matrices Wf , Wi , Wo ,
Wc ∈ R(qm−1 +1)×qm and Uf , Ui , Uo , Uc ∈ Rqm ×qm . Altogether we have
(m) (m) (m) (m) (m)

4(qm−1 + 1 + qm )qm network parameters in each LSTM cell 1 ≤ m ≤ d. These

are learned with the gradient descent method. Moreover, we need to initialize the
LSTM cell update (8.15). From the previous layer m − 1 we have the input z[m−1]
t
which we initialize as zt[0] = x t for m = 1 and t ≥ 0. The initial states z[m]
0 and
[m]
c0 are usually set to zero.

8.3.2 Gated Recurrent Unit Network

The LSTM architecture of the previous section seems quite complex and involves
many parameters. Cho et al. [76] have introduced the GRU architecture that is
simpler and uses less parameters, but has similar properties. The GRU architecture
uses two gates that are defined as follows for t ≥ 1, see also (8.7):
8.3 Special Recurrent Neural Networks 393

• The reset gate models the memory reset rate

; < ; <
r [m]
t = r (m) z[m−1]
t , z[m]
t −1 = φσ
r
Wr(m) , z[m−1]
t + Ur(m) , z[m]
t −1 ∈ (0, 1)qm ,

with the network weights Wr(m) ∈ R(qm−1 +1)×qm and Ur(m) ∈ Rqm ×qm , and with
the sigmoid activation function φσr = φσ .
• The update gate models the memory update rate
; < ; <
u[m]
t = u(m) z[m−1]
t , z[m]
t −1 = φσ
u
Wu(m) , z[m−1]
t + Uu(m) , z[m]
t −1 ∈ (0, 1)qm ,

with the network weights Wu(m) ∈ R(qm−1 +1)×qm and Uu(m) ∈ Rqm ×qm , and with
the sigmoid activation function φσu = φσ .
The neuron activations z[m]
t are updated, given z[m−1]
t and z[m]
t −1 , by

z[m]
t = z(m) z[m−1]
t , z[m]
t−1 (8.16)

= r [m]
t ) z[m] [m]
t−1 + (1 − r t ) ) φ W
(m) [m−1]
, zt + u[m]
t ) U (m) , z[m]
t−1 ∈ Rqm ,

with the network weights W (m) ∈ R(qm−1+1)×qm and U (m) ∈ Rqm ×qm , and for a
general activation function φ.
The GRU and the LSTM architectures are similar, the former using less parameters
because we do not explicitly model the cell state process. For an illustration of a
GRU cell we refer to Fig. 8.7. In the sequel we focus on the LSTM architecture;

Fig. 8.7 GRU cell z(m) with reset gate φσr and update gate φσu
394 8 Recurrent Neural Networks

though the GRU architecture is simpler and has less parameters, it is less robust in
fitting.

8.4 Lab: Mortality Forecasting with RN Networks

8.4.1 Lee–Carter Model, Revisited

The mortality data has a natural time-series structure, and for this reason mortality
forecasting is an obvious problem that can be studied within RN networks. For
instance, the LC mortality model (7.63) involves a stochastic process (kt )t that
needs to be extrapolated into the future. This extrapolation problem can be done
in different ways. The original proposal of Lee and Carter [238] has been to analyze
ARIMA time-series models, and to use standard statistical tools, Lee and Carter
found that the random walk with drift gives a good stochastic description of the
time index process (kt )t . Nigri et al. [286] proposed to fit a LSTM network to
this stochastic process, this approach is also studied in Lindholm–Palmborg [252]
where an efficient use of the mortality data for network fitting is discussed. These
approaches still rely on the classical LC calibration using the SVD of Sect. 7.5.4,
and the LSTM network is (only) used to extrapolate the LC time index process (kt )t .
More generally, one can design a RN network architecture that directly processes
the raw mortality data Mx,t = Dx,t /ex,t , not specifically relying on the LC structure.
This has been done in Richman–Wüthrich [316] using a FN network architecture, in
Perla et al. [301] using a RN network and a convolutional neural (CN) network
architecture, and in Schürch–Korn [330] extending this analysis to the study of
prediction uncertainty using bootstrapping. A similar CN network approach has
been taken by Wang et al. [375] interpreting the raw mortality data of Fig. 7.32
as an image.

Lee–Carter Mortality Model: Random Walk with Drift Extrapolation

We revisit the LC mortality model [238] presented in Sect. 7.5.4. The LC log-
mortality rate is assumed to have the following structure, see (7.63),

(p) (p) (p) (p)

log(μx,t ) = ax + bx kt ,

for the ages x0 ≤ x ≤ x1 and for the calendar years t ∈ T . We now add the upper
(p)
indices (p) to consider different populations p. The SVD gives us the estimates ax ,

kt and
(p) (p)
bx based on the observed centered raw log-mortality rates, see Sect. 7.5.4.
The SVD is applied to each population p separately, i.e., there is no interaction
between the different populations. This approach allows us to fit a separate log-
(p)
mortality surface estimate (log( μx,t ))x0 ≤x≤x1;t ∈T to each population p. Figure 7.33
8.4 Lab: Mortality Forecasting with RN Networks 395

shows an example for two populations p, namely, for Swiss females and for Swiss
males.
The mortality forecasting requires to extrapolate the time index processes
(
(p)
kt )t ∈T beyond the latest observed calendar year t1 = max{T }. As mentioned in
Lee–Carter [238] a random walk with drift provides a suitable model for modeling
(
(p)
kt )t ≥0 for many populations p, see Fig. 7.35 for the Swiss population. Assume
that

kt +1 =
(p) (p) (p)
kt + εt +1 t ≥ 0, (8.17)

(p) i.i.d.
with εt ∼ N (δp , σp2 ), t ≥ 1, having drift δp ∈ R and variance σp2 > 0.
Model assumption (8.17) allows us to estimate the (constant) drift δp with MLE.
For observations (
(p)
kt )t ∈T we receive the log-likelihood function

√ 1 (p) (p) 2
t1
δp → ( k (p) ) (δp ) = − log( 2πσp ) − k − k t −1 − δ p ,
t t∈T 2σp2 t
t =t0 +1

with first observed calendar year t0 = min{T }. The MLE is given by

kt1 −
(p) (p)
kt0

δpMLE = . (8.18)
t1 − t0

This allows us to forecast the time index process for t > t1 by

kt =
(p)
kt1 + (t − t1 )
(p)
δpMLE .

We explore this extrapolation for different Western European countries from the
HMD [195]. We consider separately females and males of the countries {AUT, BE,
CH, ESP, FRA, ITA, NL, POR}, thus, we choose 2 · 8 = 16 different populations
p. For these countries we have observations for the ages 0 = x0 ≤ x ≤ x1 = 99
and for the calendar years 1950 ≤ t ≤ 2018.3 For the following analysis we choose
T = {t0 ≤ t ≤ t1 } = {1950 ≤ t ≤ 2003}, thus, we fit the models on 54 years
of mortality history. This fitted models are then extrapolated to the calendar years
2004 ≤ t ≤ 2018. These 15 calendar years from 2004 to 2018 allow us to perform
(p) (p) (p)
an out-of-sample evaluation because we have the observations Mx,t = Dx,t /ex,t
for these years from the HMD [195].
Figure 8.8 shows the estimated time index process (
(p)
kt )t ∈T to the left of the
dotted lines, and to the right of the dotted lines we have the random walk with
drift extrapolation (
(p)
kt )t >t1 . The general observation is that, indeed, the random
walk with drift seems to be a suitable model for (
(p)
kt )t . Moreover, there is a huge

3We exclude Germany from this consideration of (continental) Western European countries
because the German mortality history is shorter due to the reunification in 1990.
396 8 Recurrent Neural Networks

100 Female: random walk forecast of k_t Male: random walk forecast of k_t

100
50

50
process k_t

process k_t
0

0
AUT AUT
−50

−50
BE BE
CH CH
ESP ESP
FRA FRA
ITA ITA
−100

−100
NL NL
POR POR

1950 1960 1970 1980 1990 2000 2010 2020 1950 1960 1970 1980 1990 2000 2010 2020
calendar years calendar years

Fig. 8.8 Random walk with drift extrapolation of the time index process (
kt )t for different
countries and genders; the y-scale is the same in both plots

similarity between the different countries, only with the Netherlands (NL) being
somewhat an outlier.
Remarks 8.4
• For Fig. 8.8 we did not explore any fine-tuning, for instance, the estimation of
the drift δp is very sensitive in the selection of the time span T . ESP has the
biggest negative drift estimate, but this is partially caused by the corresponding
observations in the calendar years between 1950 and 1960, see Fig. 8.8, which
may no longer be relevant for a decline in mortality in the new millennium.
• For all countries, the females have a bigger negative drift than the males (the
y-scale in both plots is the same). Moreover, note that we use the normalization
x 1
(p)
x=x0 bx = 1 and
(p) = 0, see (7.65). This normalization is discussed
t ∈T k t
and questioned in many publications as the extrapolation becomes dependent on
these choices; see De Jong et al. [90] and the references therein, who propose
different identification schemes.
• Another issue is an age coherence in forecasting, meaning that for long term
forecasts the mortality rates across the different ages should not diverge, see Li
et al. [250], Li–Lu [248] and Gao–Shi [153] and the references therein.
• There are many modifications and extensions of the LC model, we just mention
a few of them. Brouhns et al. [56] embed the LC model into a Poisson modeling
framework which provides a proper stochastic model for mortality modeling.
Renshaw–Haberman [308] extend the one-factor LC model to a multifactor
model, and in Renshaw–Haberman [309] a cohort effect is added. Hyndman–
Ullah [197] and Hainaut–Denuit [179] explore a functional data method and a
wavelet-based decomposition, respectively. The static PCA can be adopted to
a dynamic PCA version, see Shang [333], and a long memory behavior in the
time-series is studied in Yan et al. [395].
8.4 Lab: Mortality Forecasting with RN Networks 397

• The LC model is fitted to each population p separately, without exploring

any common structure across the populations. There are many multi-population
extensions that try to learn common structure across different populations. We
mention the common age effect (CAE) model of Kleinow [218], the augmented
common factor (ACF) model of Li–Lee [249] and the functional time-series
models of Hyndman et al. [196] and Shang–Haberman [334]. A direct multi-
population extension of the SVD matrix decomposition of the LC model is
obtained by the tensor decomposition approaches of Russolillo et al. [325] and
Dong et al. [110].

Lee–Carter Mortality Model: LSTM Extrapolation

Our aim here is to replace the individual random walk with drift extrapola-
tions (8.17) by a common extrapolation across all considered populations p. For
this we design a LSTM architecture. A second observation is that the increments
εt = kt −
(p) (p) (p)
kt −1 have an average empirical auto-correlation (for lag 1) of −0.33.
This clearly questions the Gaussian i.i.d. assumption in (8.17).
We first discuss the available data and we construct the input data. We have
the time-series observations (
(p)
kt )t ∈T , and the population index p = (c, g) has
two categorical labels c for country and g for gender. We are going to use two-
dimensional embedding layers for these two categorical variables, see (7.31) for
embedding layers. The time-series observations (
(p)
kt )t ∈T will be pre-processed
such that we do not simultaneously feed the entire time-series into the LSTM layer,
but we divide them into shorter time-series. We will directly forecast the increments
εt = kt − kt −1 and not the time index process (
(p) (p) (p) (p)
kt )t ≥t0 ; in extrapolations with
drift it is easier to forecast the increments with the networks. We choose a lookback
(p)
period of τ = 3 calendar years, and we aim at predicting the response Yt = εt
(p) (p)
based on the time-series features x t −τ :t −1 = (εt −τ , . . . , εt −1 ) ∈ R . This provides
τ

us with the following data structure for each population p = (c, g):

year country gender feature x t −τ :t −1 Yt

(p) (p) (p)
t0 + τ + 1 c g εt0 +1 · · · εt0 +τ εt0 +τ +1
.. .. .. .. .. ..
. . . . . .
(p) (p) (p) (8.19)
t c g εt −τ · · · εt −1 εt
.. .. .. .. .. ..
. . . . . .
(p) (p) (p)
t1 c g εt1 −τ · · · εt1 −1 εt1

(p)
Thus, each observation Yt = εt is equipped with the feature information
(t, c, g, x t −τ :t −1 ). As discussed in Lindholm–Palmborg [252], one should highlight
that there is a dependence across t, since we have a diagonal cohort structure in the
398 8 Recurrent Neural Networks

features and the observations (x t −τ :t −1 , Yt ). Usually, this dependence is not harmful

in stochastic gradient descent fitting.

Listing 8.1 LSTM architecture example

1 TS = layer_input(shape=c(lookback,1), dtype=’float32’, name=’TS’)
2 Country = layer_input(shape=c(1), dtype=’int32’, name=’Country’)
3 Gender = layer_input(shape=c(1), dtype=’int32’, name=’Gender’)
4 Time = layer_input(shape=c(1), dtype=’float32’, name=’Time’)
5 #
6 CountryEmb = Country %>%
7 layer_embedding(input_dim=8,output_dim=2,input_length=1,name=’CountryEmb’) %>%
8 layer_flatten(name=’Country_flat’)
9 #
10 GenderEmb = Gender %>%
11 layer_embedding(input_dim=2,output_dim=2,input_length=1,name=’GenderEmb’) %>%
12 layer_flatten(name=’Gender_flat’)
13 #
14 LSTM = TS %>%
15 layer_lstm(units=15,activation=’tanh’,recurrent_activation=’sigmoid’,
16 name=’LSTM’)
17 #
18 Output = list(LSTM,CountryEmb,GenderEmb,Time) %>% layer_concatenate() %>%
19 layer_dense(units=10, activation=’tanh’, name=’FNLayer’) %>%
20 layer_dense(units=1, activation=’linear’, name=’Network’)
21 #
22 model = keras_model(inputs = list(TS, Country, Gender, Time),
23 outputs = c(Output))

(p)
In Fig. 8.9 we plot the LSTM architecture used to forecast εt for t > t1 , and
Listing 8.1 gives the corresponding R code. We process the time-series x t −τ :t −1
through a LSTM cell, see lines 14–16 of Listing 8.1. We choose a shallow LSTM
network (d = 1) and therefore drop the upper index m = 1 in (8.15), but we add
an upper index [LSTM] to highlight the output of the LSTM cell. This gives us the

concatenation
input LSTM cell output
into a shallow
xt−τ :t−1 depth d = 1 t
Y
FN layer

country embedding
c layer

gender embedding year

g layer t

(p)
Fig. 8.9 LSTM architecture used to forecast εt for t > t1
8.4 Lab: Mortality Forecasting with RN Networks 399

LSTM cell updates for t − τ ≤ s ≤ t − 1

x s , z[LSTM]
s−1 , cs−1 → z[LSTM]
s , cs = zLSTM x s , z[LSTM]
s−1 , cs−1 .

This LSTM recursion to process the time-series x t −τ :t −1 gives us the LSTM output
z[LSTM]
t −1 ∈ Rq1 , and it involves 4(q0 + 1 + q1 )q1 = 4(2 + 15)15 = 1 020 network
parameters for the input dimension q0 = 1 and the output dimension q1 = 15.
For the categorical country code c and the binary gender g we choose two-
dimensional embedding layers, see (7.31),

c → eC (c) ∈ R2 and g → eG (g) ∈ R2 ,

these embedding maps give us 2(8 + 2) = 20 embedding weights. Finally, we

concatenate the LSTM output z[LSTM] t −1 ∈ R15 , the embeddings eC (c), eG (g) ∈ R2
and the continuous calendar year variable t ∈ R and process this vector through a
shallow FN network with q2 = 10 neurons, see lines 18–20 of Listing 8.1. This FN
layer gives us (q1 + 2 + 2 + 1 + 1)q2 = (15 + 2 + 2 + 1 + 1)10 = 210 parameters.
Together with the output parameter of dimension q2 + 1 = 11, we receive 1’261
network parameters to be fitted, which seems quite a lot.
To fit this model we have 8 · 2 = 16 populations, and for each population we
have the observations
(p)
kt for the calendar years 1950 ≤ t ≤ 2003. Considering
(p)
the increments εt and a lookback period of τ = 3 calendar years gives us 2003 −
1950 − τ = 50 observations, rows in (8.19), per population p, thus, we have in total
800 observations. For the gradient descent fitting and the early stopping we choose a
training to validation split of 8 : 2. As loss function we choose the squared error loss
(p)
function. This implicitly implies that we assume that the increments Yt = εt are
Gaussian distributed, or in other words, minimizing the squared error loss function
means maximizing the Gaussian log-likelihood function. We then fit this model to
the data using early stopping as described in (7.27). We analyze this fitted model.
Figure 8.10 provides the learned embeddings for the country codes c. These learned
embeddings have some similarity with the European map.
The final step is the extrapolation kt , t > t1 . These updates need to be done
recursively. We initialize for t = t1 + 1 the time-series feature

x t1 +1−τ :t1 = (εt1 +1−τ , . . . , εt1 ) ∈ Rτ .

(p) (p)
(8.20)

Using the feature information (t1 + 1, c, g, x t1 +1−τ :t1 ) allows us to forecast the next
(p) t1 +1 , using the fitted LSTM architecture of Fig. 8.9.
increment Yt1 +1 = εt1 +1 by Y
Thus, this LSTM network allows us to perform a one-period-ahead forecast to
receive

kt1 +1 =
t1 +1 .
kt1 + Y (8.21)
400 8 Recurrent Neural Networks

Fig. 8.10 Learned country 2−dimensional embedding of Country

0.4
embeddings for forecasting NL
(
kt )t

0.2
BE

0.0
dimension 2
AUT CH

POR

−0.2
FRA

AUT
BE

−0.4
CH
ESP
FRA
ITA ITA
NL

−0.6
ESP POR

−0.6 −0.4 −0.2 0.0 0.2 0.4

dimension 1

This update (8.21) needs to be iterated recursively. For the next period t = t1 + 2
we set for the time-series feature
(p) t1 +1 ) ∈ Rτ ,
x t1 +2−τ :t1 +1 = (εt1 +2−τ , . . . , εt1 , Y
(p)
(8.22)

which gives us the next predictions Y t1 +2 and kt1 +2 , etc.

(p)
In Fig. 8.11 we present the extrapolation of (εt )t for Belgium females and males.
(p)
The blue curve shows the observed increments (εt )1951≤t ≤2003 and the LSTM fit-
t )1954≤t ≤2003 are in red color. Firstly, we observe a negative
ted (in-sample) values (Y
(p)
correlation (zig-zag behavior) in both the blue observations (εt )1951≤t ≤2003 and

in their red estimated means (Yt )1954≤t ≤2003. Thus, the LSTM finds this negative
correlation (and it does not propose i.i.d. residuals). Secondly, the volatility in the

epsilon_t process of BE Female epsilon_t process of BE Male

5
epsilon / Y

epsilon / Y
0

0
−5

−5
−10

−10

observed epsilon_t observed epsilon_t

RW drift RW drift
−15

−15

LSTM in−sample LSTM in−sample

LSTM forecast LSTM forecast
average average

1950 1960 1970 1980 1990 2000 2010 2020 1950 1960 1970 1980 1990 2000 2010 2020
calendar years calendar years

t )t>t1 for Belgium (BE) females and males

Fig. 8.11 LSTM network extrapolation (Y
8.4 Lab: Mortality Forecasting with RN Networks 401

red curve is smaller than in the blue curve, the former relates to expected values and
the latter to observations of the random variables (which should be more volatile).
The light-blue color shows the random walk with drift extrapolation (which is just a
horizontal straight line at level
δpMLE , see (8.18)). The orange color shows the LSTM
extrapolation using the recursive one-period-ahead updates (8.20)–(8.22), which has
a zig-zag behavior that vanishes over time. This vanishing behavior is critical and is
going to be discussed next.
There is one issue with this recursive one-period-ahead updating algorithm. This
updating algorithm is not fully consistent in how the data is being used. The original
(p)
LSTM architecture calibration is based on the feature components εt , see (8.20).
Since these increments are not known for the later periods t > t1 , we replace
their unknown values by the predictors, see (8.22). The subtle point here is that
the predictors are on the level of expected values, and not on the level of random
variables. Thus, Y t is typically less volatile than εt(p) , but in (8.22) we pretend
that we can use these predictors as a one-to-one replacement. A more consistent
(p)
way would be to simulate/bootstrap εt from N (Y t , σ 2 ) so that the extrapolation
receives the same volatility as the original process. For simplicity we refrain from
doing so, but Fig. 8.11 indicates that this would be a necessary step because the
volatility in the orange curve is going to vanish after the calendar year 2003, i.e., the
zig-zag behavior vanishes, which is clearly not appropriate.
The LSTM extrapolation of ( kt )t is shown in Fig. 8.12. We observe quite some
similarity to the random walk with drift extrapolation in Fig. 8.8, and, indeed, the
random walk with drift seems to work very well (though the auto-correlation has not
been specified correctly). Note that Fig. 8.8 is based on the individual extrapolations
in p, whereas in Fig. 8.12 we have a common model for all populations.
Table 8.1 shows how often one model outperforms the other one (out-of-sample
on calendar years 2004 ≤ t ≤ 2018 and per gender). On the male populations of

Female: LSTM forecast of k_t Male: LSTM forecast of k_t

100
100

50
50
process k_t

process k_t
0

AUT AUT
−50

−50

BE BE
CH CH
ESP ESP
FRA FRA
ITA ITA
−100

−100

NL NL
POR POR

1950 1960 1970 1980 1990 2000 2010 2020 1950 1960 1970 1980 1990 2000 2010 2020
calendar years calendar years

Fig. 8.12 LSTM network extrapolation of (

kt )t for different countries and genders
402 8 Recurrent Neural Networks

Table 8.1 Comparison of the out-of-sample mean squared error losses for the calendar years
2004 ≤ t ≤ 2018: the numbers show how often one approach outperforms the other one on
each gender
Female Male
Random walk with drift 5/8 4/8
LSTM architecture 3/8 4/8

the 8 European countries both models outperform the other one 4 times, whereas
for the female population the random walk with drift gives 5 times the better out-of-
sample prediction. Of course, this seems disappointing for the LSTM approach. This
observation is quite common, namely, that the deep learning approach outperforms
the classical methods on complex problems. However, on simple problems, as the
one here, we should go for a classical (simpler) model like a random walk with drift
or an ARIMA model.

8.4.2 Direct LSTM Mortality Forecasting

The previous section has been relying on the LC mortality model and only the
extrapolation of the time-series (
kt )t has been based on a RN network architecture.
In this section we aim at directly processing the raw mortality rates Mx,t =
Dx,t /ex,t through a network, thus, we perform the representation learning directly
on the raw data. We therefore use a simplified version of the network architecture
proposed in Perla et al. [301].
As input to the network we use the raw mortality rates Mx,t . We choose a
lookback period of τ = 5 years and we define the time-series feature information to
forecast the mortality in calendar year t by

x t−τ :t−1 = (x t−τ , . . . , x t−1 ) = Mx,s x ∈ R(x1 −x0 +1)×τ = R100×5 .
0 ≤x≤x1 ,t−τ ≤s≤t−1
(8.23)

Thus, we directly process the raw mortality rates (simultaneously for all ages x)
through the network architecture; in the corresponding R code we need to input the
transposed features x t −τ :t −1 ∈ R
5×100 , see line 1 of Listing 8.2.

We choose a shallow LSTM network (d = 1) and drop the upper index m = 1

in (8.15). This gives us the LSTM cell updates for t − τ ≤ s ≤ t − 1

x s , z[LSTM]
s−1 , cs−1 → z[LSTM]
s , cs = zLSTM x s , z[LSTM]
s−1 , cs−1 .

This LSTM recursion to process the time-series x t −τ :t −1 gives us the LSTM output
z[LSTM]
t −1 ∈ Rq1 , see lines 14–15 of Listing 8.2. It involves 4(q0 + 1 + q1 )q1 =
4(100 + 1 + 20)20 = 9 680 network parameters for the input dimension q0 = 100
8.4 Lab: Mortality Forecasting with RN Networks 403

concatenation output
input LSTM cell
xt−τ :t−1 of depth d = 1
into a shallow x,t )0≤x≤99
(Y
linear decoder

country embedding
c layer

gender embedding
g layer

Fig. 8.13 LSTM architecture used to process the raw mortality rates (Mx,t )x,t

and the output dimension q1 = 20. Many statisticians would probably stop at this
point with this approach, as it seems highly over-parametrized. Let’s see what we
get.
For the categorical country code c and the binary gender g we choose two one-
dimensional embeddings, see (7.31),

c → eC (c) ∈ R and g → eG (g) ∈ R. (8.24)

These embeddings give us 8 + 2 = 10 embedding weights. Figure 8.13 shows

the LSTM cell in orange color and the embeddings in yellow color (in the colored
version).
The LSTM output and the two embeddings are then concatenated to a learned
representation zt −1 = (z[LSTM]
t −1 , eC (c), eG (g)) ∈ Rq1 ×1×1 = R22 . The 22-
dimensional learned representation zt −1 encodes the 500-dimensional input
x t −τ :t −1 ∈ R100×5 and the two categorical variables c and g. The last step
is to decode this representation zt −1 ∈ R22 to predict the log-mortality rates
(Yx,t )0≤x≤99 = (log Mx,t )0≤x≤99 ∈ R100, simultaneously for all ages x. This
decoding is obtained by the code on lines 17–19 of Listing 8.2; this reads as
; <
zt −1 → βx0 + βxC eC (c) + βxG eG (g) + β x , zt[LSTM]
−1 . (8.25)
0≤x≤99

This decoding involves another (1 + 22)100 = 2 300 parameters (βx0 , βxG , βxC ,
β x )0≤x≤99. Thus, altogether this LSTM network has r = 11 990 parameters.
Summarizing: the above architecture follows the philosophy of the auto-encoder
of Sect. 7.5. A high-dimensional observation (x t −τ :t −1 , c, g) is encoded to a low-
dimensional bottleneck activation zt −1 ∈ R22 , which is then decoded by (8.25)
to give the forecast (Y x,t )0≤x≤99 for the log-mortality rates. It is not precisely an
auto-encoder because the response is different from the input, as we forecast the
log-mortality rates in the next calendar year t based on the information zt −1 that
404 8 Recurrent Neural Networks

Listing 8.2 LSTM architecture to directly process the raw mortality rates (Mx,t )x,t
1 TS = layer_input(shape=c(lookback,100), dtype=’float32’, name=’TS’)
2 Country = layer_input(shape=c(1), dtype=’int32’, name=’Country’)
3 Gender = layer_input(shape=c(1), dtype=’int32’, name=’Gender’)
4 Time = layer_input(shape=c(1), dtype=’float32’, name=’Time’)
5 #
6 CountryEmb = Country %>%
7 layer_embedding(input_dim=8,output_dim=1,input_length=1,name=’CountryEmb’) %>%
8 layer_flatten(name=’Country_flat’)
9 #
10 GenderEmb = Gender %>%
11 layer_embedding(input_dim=2,output_dim=1,input_length=1,name=’GenderEmb’) %>%
12 layer_flatten(name=’Gender_flat’)
13 #
14 LSTM = TS %>%
15 layer_lstm(units=20,activation=’linear’,recurrent_activation=’sigmoid’,
16 name=’LSTM’)
17 #
18 Output = list(LSTM,CountryEmb,GenderEmb) %>% layer_concatenate() %>%
19 layer_dense(units=100, activation=’linear’, name=’scalarproduct’) %>%
20 layer_reshape(c(1,100), name = ’Output’)
21 #
22 model = keras_model(inputs = list(TS, Country, Gender),
23 outputs = c(Output))

is available at the end of the previous calendar year t − 1. In contrast to the LC

mortality model, we no longer rely on the two-step approach by first fitting the
parameters with a SVD, and performing a random walk with drift extrapolation.
This encoder-decoder network performs both steps simultaneously.
We fit this network architecture to the available data. We have r = 11 990
network parameters. Based on a lookback period of τ = 5 years, we have 2003 −
1950−τ +1 = 49 observations per population p = (c, g). Thus, we have in total 784
observations x t −τ :t −1, c, g, (Yx,t )0≤x≤99 . We fit this network using the nadam
version of the gradient descent algorithm. We choose a training to validation split of
8 : 2 and we explore 10’000 gradient descent epochs. A crucial observation is that
the algorithm converges rather slowly and it does not show any signs of over-fitting,
i.e., there is no strong need for the early stopping. This seems surprising because we
have 11’990 parameters and only 784 observations. There are a couple of important
ingredients that make this work. The features and observations themselves are
high-dimensional, the low-dimensional encoding (compression) leads to a natural
regularization, Moreover, this is combined with linear activation functions, see lines
15 and 19 of Listing 8.2. The gradient descent fitting has a certain inertness, and
it seems that high-dimensional problems on comparably smooth high-dimensional
data do not over-fit to individual components because the gradients are not very
sensitive in the individual partial derivatives (in high dimensions). These high-
dimensional approaches only work if we have sufficiently many populations across
which we can learn, here we have 16 populations, Perla et al. [301] even use 76
populations.
Since every gradient descent fit still involves several elements of randomness,
we consider the nagging predictor (7.44), averaging over 10 fitted networks, see
8.4 Lab: Mortality Forecasting with RN Networks 405

Table 8.2 Comparison of the out-of-sample mean squared losses for the calendar years 2004 ≤
t ≤ 2018; the figures are in 10−4
LC female LSTM female LC male LSTM male
Austria AUT 0.765 0.312 2.527 1.169
Belgium BE 0.371 0.311 2.835 0.960
Switzerland CH 0.654 0.478 1.609 1.134
Spain ESP 1.446 0.514 1.742 0.245
France FRA 0.175 1.684 0.333 0.363
Italy ITA 0.179 0.330 0.874 0.320
The Netherlands NL 0.426 0.315 1.978 0.601
Portugal POR 2.097 0.464 1.848 1.239

Sect. 7.4.4. The out-of-sample prediction results on the calendar years 2004 to
2018, i.e., t > t1 = 2004, are presented in Table 8.2. These results verify the
appropriateness of this LSTM approach. It outperforms the LC model on the female
population in 6 out of 8 cases and on the male population on 7 out of 8 cases,
only for the French population this LSTM approach seems to have some difficulties
(compared to the LC model). Note that these are out-of-sample figures because
the LSTM has only been fitted on the data prior to 2004. Moreover, we did not
pre-process the raw mortality rates Mx,t , t ≤ 2003, and the prediction is done
recursively in a one-period-ahead prediction approach, we also refer to (8.22). A
more detailed analysis of the results shows that the LC and the LSTM approaches
have a rather similar behavior for females. For males the LSTM prediction clearly
outperforms the LC model prediction, this out-performance is across different ages
x and different calendar years t ≥ 2004.
The advantage of this LSTM approach is that we can directly predict by
processing the raw data. The disadvantage compared to the LC approach is that the
LSTM network approach is more complex and more time-consuming. Moreover,
unlike in the LC approach, we cannot (easily) assess the prediction uncertainty.
In the LC approach the prediction uncertainty is obtained from assessing the
uncertainty in the extrapolation and the uncertainty in the parameter estimates, e.g.,
using a bootstrap. The LSTM approach is not sufficiently robust (at least not on our
data) to provide any reasonable uncertainty estimates.
We close this section and example by analyzing the functional form of the
decoder (8.25). We observe that this decoder has much similarity with the LC model
assumption (7.63)
; <
x,t = βx0 + βxC eC (c) + βxG eG (g) + β x , z[LSTM] ,
Y t −1

(p) (p) (p) (p)

log(μx,t ) = ax + bx kt .

(p)
The LC model considers the average force of mortality ax ∈ R for each population
p = (c, g) and each age x; the LSTM architecture has the same term βx0 +βxC eC (c)+
406 8 Recurrent Neural Networks

βxG eG (g). In the LC model, the change of force of mortality is considered by a

(p) (p)
population-dependent term bx kt , whereas the LSTM architecture has a term
[LSTM]
β x , zt −1 . This latter term is also population-dependent because the LSTM cell
directly processes the raw mortality data Mx,t coming from the different populations
p. Note that this is the only time-t-dependent term in the LSTM architecture. We
conclude that the main difference between these two forecast approaches is how the
past mortality observations are processed. Apart from that the general structure is
the same.

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons licence and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons licence, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons licence and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 9
Convolutional Neural Networks

The previous two chapters have been considering fully-connected feed-forward

neural (FN) networks and recurrent neural (RN) networks. Fully-connected FN
networks are the prototype of networks for deep representation learning on tabular
data. This type of networks extracts global properties from the features x. RN
networks are an adaption of FN networks to time-series data. Convolutional neural
(CN) networks are a third type of networks, and their specialty is to extract local
structure from the features. Originally, they have been introduced for speech and
image recognition aiming at finding similar structure in different parts of the feature
x. For instance, if x is a picture consisting of pixels, and if we want to classify
this picture according to its contents, then we try to find similar structure (objects)
in different locations of this picture. CN networks are suitable for this task as
they work with filters (kernels) that have a fixed window size. These filters then
screen across the picture to detect similar local structure at different locations in
the picture. CN networks were introduced in the 1980s by Fukushima [145] and
LeCun et al. [234, 235], and they have been celebrating great success in many
applications. Our introduction to CN networks is based on the tutorial of Meier–
Wüthrich [269]. For real data applications there are many pre-trained CN network
libraries that can be downloaded and used for several different tasks, an example for
image recognition is the AlexNet of Krizhevsky et al. [226].

9.1 Plain-Vanilla Convolutional Neural Network Layer

Structurally, the CN network architectures are similar to the FN network architec-

tures, only they replace certain FN layers by CN layers. Therefore, we start by
introducing the CN layer, and one should keep the structure of the FN layer (7.5)

© The Author(s) 2023 407

in mind. In a nutshell, FN layers consider non-linearly activated inner products

w(m)
j , z , and CN layers replace these inner products by a type of convolution
W (m)
j ∗ z.

9.1.1 Input Tensors and Channels

We start from an input tensor z ∈ Rq ×···×q that has dimension q (1) ×· · ·×q (K).
(1) (K)

This input tensor z is a multi-dimensional array of order (length) K ∈ N and with

elements zi1 ,...,iK ∈ R for 1 ≤ ik ≤ q (k) and 1 ≤ k ≤ K. The special case of order
K = 2 is a matrix z ∈ Rq ×q . This matrix can illustrate a black and white image
(1) (2)

of dimension q × q with the matrix entries zi1 ,i2 ∈ R describing the intensities
(1) (2)

of the gray scale in the corresponding pixels (i1 , i2 ). A color image typically has
the three color channels Red, Green and Blue (RGB), and such a RGB image can
be represented by a tensor z ∈ Rq ×q ×q of order 3 with q (1) × q (2) being
(1) (2) (3)

the dimension of the image and q (3) = 3 describing the three color channels, i.e.,
(zi1 ,i2 ,1 , zi1 ,i2 ,2 , zi1 ,i2 ,3 ) ∈ R3 describes the intensities of the colors RGB in the
pixel (i1 , i2 ).
Typically, the structure of black and white images and RGB images is unified by
representing the black and white picture by a tensor z ∈ Rq ×q ×q of order 3
(1) (2) (3)

with a single channel q (3) = 1. This philosophy is going to be used throughout this
chapter. Namely, if we consider a tensor z ∈ Rq ×···×q
(1) (K−1) ×q (K)
of order K, the
first K − 1 components (i1 , . . . , iK−1 ) will play the role of the spatial components
that have a natural topology, and the last components 1 ≤ iK ≤ q (K) are called
the channels reflecting, e.g., a gray scale (for q (K) = 1) or the RGB intensities (for
q (K) = 3).
In Sect. 9.1.3, below, we will also study time-series data where we have 2nd
order tensors (matrices). The first component reflects time 1 ≤ t ≤ q (1), i.e.,
the spatial component is temporal for time-series data, and the second component
(channels) describes the different elements zt = (zt,1 , . . . , zt,q (2) ) ∈ Rq that are
(2)

measured/observed at each time point t.

9.1.2 Generic Convolutional Neural Network Layer

(1) (K)
We start from an input tensor z ∈ Rqm−1 ×···×qm−1 of order K. The first K − 1
components of this tensor have a spatial structure and the K-th component stands
for the channels. A CN layer applies (local) convolution operations to this tensor. We
choose a filter size, also called window size or kernel size, (fm(1) , . . . , fm(K) ) ∈ NK
(k) (k) (K) (K)
with fm ≤ qm−1 , for 1 ≤ k ≤ K − 1, and fm = qm−1 . This filter size determines
9.1 Plain-Vanilla Convolutional Neural Network Layer 409

the output dimension of the CN operation by

def.
(k)
(k)
qm = qm−1 − fm(k) + 1, (9.1)

for 1 ≤ k ≤ K. Thus, the size of the image is reduced by the window size of
the filter. In particular, the output dimension of the channels component k = K
(K)
is qm = 1, i.e., all channels are compressed to a scalar output. The spatial
components 1 ≤ k ≤ K − 1 retain their spatial structure but the dimension is
reduced according to (9.1).
A CN operation is a mapping (note that the order of the tensor is reduced from
K to K − 1 because the channels are compressed; index j is going to be explained
later)
(1) (K) (1) (K−1)
: Rqm−1 ×···×qm−1 → Rqm ×···×qm
(m)
zj (9.2)

(m) (m)
zj (z) = zi1 ,...,iK−1 ;j (z)
z→ (k)
,
1≤ik ≤qm ;1≤k≤K−1

taking the values for a fixed activation function φ : R → R

⎛ (1) (K)
⎞

fm f
m
zi1 ,...,iK−1 ;j (z) = φ ⎝w0,j + wl1 ,...,lK ;j zi1 +l1 −1,...,iK−1 +lK−1 −1,lK ⎠ ,
(m) (m) (m)
···
l1 =1 lK =1
(9.3)

(m)
for given intercept w0,j ∈ R and filter weights
(1) (K)
(m) (m) ×···×fm
Wj = wl1 ,...,lK ;j (k)
∈ Rf m ; (9.4)
1≤lk ≤fm ;1≤k≤K

D (k)
the network parameter has dimension rm = 1 + K k=1 fm .
At first sight this CN operation looks quite complicated. Let us give some
remarks that allow for a better understanding and a more compact notation. The
operation in (9.3) chooses the corner (i1 , . . . , iK−1 , 1) as base point, and then it
reads the tensor elements in the (discrete) window

(i1 , . . . , iK−1 , 1) + 0 : fm(1) − 1 × · · · × 0 : fm(K−1) − 1 × 0 : fm(K) − 1 ,
(9.5)

(m)
with given filter weights W j . This window is then moved across the entire
tensor z by changing the base point (i1 , . . . , iK−1 , 1) accordingly, but with fixed
filter weights W (m)j . This operation resembles a convolution, however, in (9.3) the
indices in zi1 +l1 −1,...,iK−1 +lK−1 −1,lK run in reverse direction compared to a classical
410 9 Convolutional Neural Networks

(mathematical) convolution. By a slight abuse of notation, nevertheless, we use the

symbol of the convolution operator ∗ to abbreviate (9.2). This gives us the compact
notation:

(1) (K) (1) (K−1)

z(m)
j : Rqm−1 ×···×qm−1 → Rqm ×···×qm

(m) (m) (m)
z → zj (z) = φ w0,j + W j ∗ z , (9.6)

(k)
having the activations for 1 ≤ ik ≤ qm , 1 ≤ k ≤ K − 1,

(m) (m) (m)
φ w0,j + W j ∗ z = zi1 ,...,iK−1 ;j (z),
i1 ,...,iK−1

where the latter is given by (9.3).

Remarks 9.1
• The beauty of this notation is that we can now see the analogy to the FN layer.
Namely, (9.6) exactly plays the role of a FN neuron (7.6), but the CN operation
(m) (m) (m)
w0,j + W j ∗ z replaces the inner product wj , z , and correspondingly
accounting for the intercept.
• A FN neuron (7.6) can be seen as a special case of CN operation (9.6). Namely,
(1)
if we have a tensor of order K = 1, the input tensor (vector) reads as z ∈ Rqm−1 .
(1)
That is, we do not have a spatial component, but only qm−1 = qm−1 channels.
(1)
(m) (m) (m)
In that case we have W j ∗ z = W j , z for the filter weights W j ∈ Rqm−1 ,
and where we assume that z does not include an intercept component. Thus, the
CN operation boils down to a FN neuron in the case of a tensor of order 1.
• In the CN operation we take advantage of having a spatial structure in the tensor
z, which is not the case in the FN operation. The CN operation takes a spatial
D (k)
input of dimension K q and it maps this input to a spatial object of
DK−1 (k) k=1 m−1 D (k)
dimension k=1 qm . For this it uses rm = 1 + K k=1 fm filter weights. The
FN operation takes an input of dimension qm−1 and it maps it to a 1-dimensional
neuron activation, for this it uses 1 + qm−1 parameters. If we identify the input
! D (k)
dimensions qm−1 = K k=1 qm−1 we can observe that rm " 1 + qm−1 because,
(k) (k)
typically, the filter sizes fm " qm−1 , for 1 ≤ k ≤ K − 1. Thus, the CN
operation uses much less parameters as the filters only act locally through the
∗-operation by translating the filter window (9.5).
This understanding now allows us to define a CN layer. Note that the map-
pings (9.6) have a lower index j which indicates that this is one single projection
9.1 Plain-Vanilla Convolutional Neural Network Layer 411

(m) (m)
(filter extraction), called a filter. By choosing multiple different filters (w0,j , W j ),
we can define the CN layer as follows.

(K)
Choose qm ∈ N filters, each having a rm -dimensional filter weight
(m)
(w0,j , W (m)
j ), 1 (K)
≤ j ≤ qm . A CN layer is a mapping

(1) (K) (1) (K)

z(m) : Rqm−1 ×···×qm−1 → Rqm ×···×qm
(9.7)

z(m) (z) = z(m)
z→ 1 (z), . . . , z (m)
(K) (z) ,
qm

(1) (K−1)
with filters z(m) ×···×qm (K)
j (z) ∈ R , 1 ≤ j ≤ qm
qm , given by (9.6).

(K) (K)
A CN layer (9.7) converts the qm−1 input channels to qm output filters by
preserving the spatial structure on the first K − 1 components of the input tensor z.
More mathematically, CN layers and networks have been studied, among others,
by Zhang et al. [403, 404], Mallat [263] and Wiatowski–Bölcskei [382]. These
authors prove that CN networks have certain translation invariance properties
and deformation stability. This exactly explains why these networks allow one to
recognize similar objects at different locations in the input tensor. Basically, by
translating the filter windows (9.5) across the tensor, we try to extract the local
structure from the tensor that provides similar signals in different locations of that
tensor. Thinking of an image where we try to recognize, say, a dog, such a dog can
be located at different sites in the image, and a filter (window) that moves across
that image tries to locate the dogs in the image.
A CN layer (9.7) defines one layer indexed by the upper index (m) , and for deep
representation learning we now have to compose multiple of these CN layers, but we
can also compose CN layers with FN layers or RN layers. Before doing so, we need
to introduce some special purpose layers and tools that are useful for CN network
modeling, this is done in Sect. 9.2, below.

9.1.3 Example: Time-Series Analysis and Image Recognition

Most CN network examples are based on time-series data or images. The former
has a 1-dimensional temporal component, and the latter has a 2-dimensional spatial
component. Thus, these two examples are giving us tensors of orders K = 2 and
K = 3, respectively. We briefly discuss such examples as specific applications of a
tensors of a general order K ≥ 2.
412 9 Convolutional Neural Networks

Time-Series Analysis with CN Networks

For a time-series analysis we often have observations x t ∈ Rq0 for the time points
0 ≤ t ≤ T . Bringing this time-series data into a tensor form gives us
(1) (2)
x = x
0:T = (x 0 , . . . , x T )

∈ R(T +1)×q0 = Rq0 ×q0
,

with q0(1) = T + 1 and q0(2) = q0 . We have met such examples in Chap. 8 on RN

networks. Thus, for time-series data the input to a CN network is a tensor of order
K = 2 with a temporal component having the dimension T + 1 and at each time
point t we have q0 measurements (channels) x t ∈ Rq0 . A CN network tries to find
similar structure at different time points in this time-series data x 0:T . For a first CN
layer m = 1 we therefore choose q1 ∈ N filters and consider the mapping

z(1) : R(T +1)×q0 → R(T −f1 +2)×q1 (9.8)

x
0:T →
z (1)
(x 0:T ) = z (1)
1 (x 0:T ), . . . , z (1)
q1 (x 0:T ) ,

with filters zj (x T −f1 +2 , 1 ≤ j ≤ q , given by (9.6) and for a fixed

(1)
0:T ) ∈ R 1
window size f1 ∈ N. From (9.8) we observe that the length of the time-series is
reduced from T + 1 to T − f1 + 2 accounting for the window size f1 . In financial
mathematics, a structure (9.8) is often called a rolling window that moves across the
time-series x 0:T and extracts the corresponding information.
We have introduced two different architectures to process time-series information
x 0:T , and these different architectures serve different purposes. A RN network
architecture is most suitable if we try to forecast the next response of a time-
series. I.e., we typically process the past observations through a recurrent structure
to predict the next response, this is the motivation, e.g., behind Figs. 8.4 and 8.5.
The motivation for the use of a CN network architecture is different as we try to
find similar structure at different times, e.g., in a financial time-series we may be
interested in finding the downturns of more than 20%. The latter is a local analysis
which is explored by local filters (of a finite window size).

Image Recognition

Image recognition extends (9.8) by one order to a tensor of order K = 3. Typically,

we have images of dimensions (pixels) I ×J , and having three color channels RGB.
These images then read as
(1) (2) (3)
x = (x 1 , x 2 , x 3 ) ∈ RI ×J ×3 = Rq0 ×q0 ×q0
,

where x 1 ∈ RI ×J is the intensity of red, x 2 ∈ RI ×J is the intensity of green, and

x 3 ∈ RI ×J is the intensity of blue.
9.2 Special Purpose Tools for Convolutional Neural Networks 413

(1) (2)
Chose a window size of f1 × f1 and q1 ∈ N filters to receive the CN layer

(1) (2)
z(1) : RI ×J ×3 → R(I −f1 +1)×(J −f1 +1)×q1
(9.9)

(1)
(x 1 , x 2 , x 3 ) →
z(1) (x 1 , x 2 , x 3 ) = z1 (x 1 , x 2 , x 3 ), . . . , z(1)
q1 (x 1 , x 2 , x 3 ) ,

(1) (2)
with filters z(1) (I −f1 +1)×(J −f1 +1) , 1 ≤ j ≤ q . Thus, we
j (x 1 , x 2 , x 3 ) ∈ R 1
compress the 3 channels in each filter j , but we preserve the spatial structure of
the image (by the convolution operation ∗).
For black and white pictures which only have one color channel, we preserve the
spatial structure of the picture, and we modify the input tensor to a tensor of order 3
and of the form

x = (x 1 ) ∈ RI ×J ×1 .

9.2 Special Purpose Tools for Convolutional Neural

Networks

9.2.1 Padding with Zeros

We have seen that the CN operation reduces the size of the output by the filter sizes,
see (9.1). Thus, if we start from an image of size 100 × 50 × 1, and if the filter sizes
(1) (2) (3)
are given by fm = fm = 9, then the output will be of dimension 92 × 42 × q1 ,
see (9.9). Sometimes, this reduction in dimension is impractical, and padding helps
(k)
to keep the original shape. Padding a tensor z with pm parameters, 1 ≤ k ≤ K − 1,
means that the tensor is extended in all K −1 spatial directions by (typically) adding
zeros of that size, so that the padded tensor has dimension

(1) (K−1) (K)
(1)
pm + qm−1 + pm
(1)
× · · · × pm
(K−1)
+ qm−1 + pm
(K−1)
× qm−1 .

This implies that the output filters will have the dimensions

(k)
(k)
qm = qm−1 + 2pm
(k)
− fm(k) + 1,

for 1 ≤ k ≤ K − 1. The spatial dimension of the original tensor size is preserved if

(k) (k)
2pm − fm + 1 = 0. Padding does not add any additional parameters, but it is only
used to reshape the tensors.
414 9 Convolutional Neural Networks

9.2.2 Stride

Strides are used to skip part of the input tensor z in order to reduce the size of the
output. This may be useful if the input tensor is a very high resolution image. Choose
(k)
the stride parameters sm , 1 ≤ k ≤ K − 1. We can then replace the summation
in (9.3) by the following term

(1) (K)

fm f
m
··· wl(m) z (1)
1 ,...,lK ;j s (i
(K−1) .
m 1 −1)+l1 ,...,sm (iK−1 −1)+lK−1 ,lK
l1 =1 lK =1

This only extracts the tensor entries on a discrete grid of the tensor by translating
the window by multiples of integers, see also (9.5),

(1)
sm (i1 − 1), . . . , sm
(K−1)
(iK−1 − 1), 1 + 1 : fm(1) ×· · ·× 1 : fm(K−1) × 0 : fm(K) − 1 ,

and the size of the output is reduced correspondingly. If we choose strides sm(k) =
fm(k) , 1 ≤ k ≤ K − 1, we receive a partition of the spatial part of the input tensor z,
this is going to be used in the max-pooling layer (9.11).

9.2.3 Dilation

Dilation is similar to stride, though, different in that it enlarges the filter sizes instead
(k)
of skipping certain positions in the input tensor. Choose the dilation parameters em ,
1 ≤ k ≤ K − 1. We can then replace the summation in (9.3) by the following term

(1) (K)

fm f
m
··· wl(m) z
1 ,...,lK ;j i
(1) (K−1) .
1 +em (l1 −1),...,iK−1 +em (lK−1 −1),lK
l1 =1 lK =1

This applies the filter weights to the tensor entries on discrete grids

(1)
(i1 , . . . , iK−1 , 1)+em 0 : fm(1) − 1 ×· · ·×em
(K−1)
0 : fm(K−1) − 1 × 0 : fm(K) − 1 ,

(k)
where the intervals em [0 : fm(k) − 1] run over the grids of span sizes em
(k)
,1≤k≤
K − 1. Thus, in comparably smoothing images we do not read all the pixels but only
(k)
every em -th pixel in the window. Also this reduces the size of the output tensor.
9.2 Special Purpose Tools for Convolutional Neural Networks 415

9.2.4 Pooling Layer

As we have seen above, the dimension of the tensor is reduced by the filter
size in each spatial direction if we do not apply padding with zeros. In general,
deep representation learning follows the paradigm of auto-encoding by reducing a
high-dimensional input to a low-dimensional representation. In CN networks this
is usually (efficiently) done by so-called pooling layers. In spirit, pooling layers
work similarly to CN layers (having a fixed window size), but we do not apply a
convolution operation ∗, but rather a maximum operation to the window to extract
the dominant tensor elements.
We choose a fixed window size (fm(1) , . . . , fm(K−1) ) ∈ NK−1 and strides sm
(k)
=
(k)
fm , 1 ≤ k ≤ K − 1, for the spatial components of the tensor z of order K . A
max-pooling layer is given by

(1) (K) (1) (K)

z(m) : Rqm−1 ×···×qm−1 → Rqm ×···×qm

z → z(m) (z) = MaxPool(z), (9.10)

with dimensions qm(K) = qm−1

(K)
and for 1 ≤ k ≤ K − 1
E F
(k)
(k)
qm = qm−1 /fm(k) , (9.11)

having the activations for 1 ≤ ik ≤ qm(k) , 1 ≤ k ≤ K ,

MaxPool(z)i1 ,...,iK = max zf (1) (i (K−1) .

(k) m 1 −1)+l1 ,...,fm (iK−1 −1)+lK−1 ,iK
1≤lk ≤fm ,
1≤k≤K−1

Alternatively, the floors in (9.11) could be replaced by ceilings and padding with
zeros to receive the right cardinality. This extracts the maximums from the (spatial)
windows

fm(1) (i1 − 1), . . . , fm(K−1) (iK−1 − 1), iK + 1 : fm(1) × · · · × 1 : fm(K−1) × [0]

= fm(1) (i1 − 1) + 1 : fm(1) i1 × · · · × fm(K−1) (iK−1 − 1) + 1 : fm(K−1) iK−1 × [iK ] ,

(K)
for each channel 1 ≤ iK ≤ qm−1 individually. Thus, the max-pooling operator is
chosen such that it extracts the maximum of each channel and each window, the
windows providing a partition of the spatial part of the tensor. This reduces the
dimension of the tensor according to (9.11), e.g., if we consider a tensor of order 3
of an RGB image of dimension I × J = 180 × 50 and apply a max-pooling layer
with window sizes fm(1) = 10 and fm(2) = 5, we receive a dimension reduction

180 × 50 × 3 → 18 × 10 × 3.
416 9 Convolutional Neural Networks

Replacing the maximum operator in (9.10) by an averaging operator is sometimes

also used, and this is called an average-pooling layer.

9.2.5 Flatten Layer

A flatten layer performs the transformation of rearranging a tensor to a vector, so that

the output of a flatten layer can be used as an input to a FN layer. That is,
(1) (K)
z(m) : Rqm−1 ×···×qm−1 → Rqm

z → z(m) (z) = z1,...,1 , . . . , zq (1) (K) , (9.12)
m−1 ,...,qm−1

D (k)
with qm = K k=1 qm−1 . We have already used flatten layers after embedding layers
on lines 8 and 11 of Listing 7.4.

9.3 Convolutional Neural Network Architectures

9.3.1 Illustrative Example of a CN Network Architecture

We are now ready to patch everything together. Assume we have RGB images
described by tensors x (0) ∈ RI ×J ×3 of order 3 modeling the three RGB channels
of images of a fixed size I × J . Moreover, we have the tabular feature information
x (1) ∈ X ⊂ {1} × Rq that describes further properties of the data. That is, we have an
input variable (x (0) , x (1) ), and we aim at predicting a response variable Y by a using
a suitable regression function

(x (0) , x (1) ) → μ(x (0) , x (1) ) = E Y x (0) , x (1) . (9.13)

We choose two convolutional layers z(CN1) and z(CN2) , each followed by a max-
pooling layer z(Max1) and z(Max2) , respectively. Then we apply a flatten layer z(flatten)
to bring the learned representation into a vector form. These layers are chosen
according to (9.7), (9.10) and (9.12) with matching input and output dimensions
so that the following composition is well-defined

z(5:1) = z(flatten) ◦ z(Max2) ◦ z(CN2) ◦ z(Max1) ◦ z(CN1) : RI ×J ×3 → Rq5 .

Listing 9.1 provides an example starting from a I × J × 3 = 180 × 50 × 3 input tensor

x (0) and receiving a q5 = 60 dimensional learned representation z(5:1) (x (0) ) ∈ R60 .
9.3 Convolutional Neural Network Architectures 417

Listing 9.1 CN network architecture in keras

1 shape <- c(180,50,3)

2 #
3 model = keras_model_sequential()
4 model %>%
5 layer_conv_2d(filters = 10, kernel_size = c(11,6), activation=’tanh’,
6 input_shape = shape) %>%
7 layer_max_pooling_2d(pool_size = c(10,5)) %>%
8 layer_conv_2d(filters = 5, kernel_size = c(6,4), activation=’tanh’) %>%
9 layer_max_pooling_2d(pool_size = c(3,2)) %>%
10 layer_flatten()

Listing 9.2 Summary of CN network architecture

1 Layer (type) Output Shape Param #
2 =======================================================================
3 conv2d_1 (Conv2D) (None, 170, 45, 10) 1990
4 -----------------------------------------------------------------------
5 max_pooling2d_1 (MaxPooling2D) (None, 17, 9, 10) 0
6 -----------------------------------------------------------------------
7 conv2d_2 (Conv2D) (None, 12, 6, 5) 1205
8 -----------------------------------------------------------------------
9 max_pooling2d_2 (MaxPooling2D) (None, 4, 3, 5) 0
10 -----------------------------------------------------------------------
11 flatten_1 (Flatten) (None, 60) 0
12 =======================================================================
13 Total params: 3,195
14 Trainable params: 3,195
15 Non-trainable params: 0

Listing 9.2 gives the summary of this architecture providing the dimension reduction
mappings (encodings)

CN1 Max1 CN2 Max2 flatten

180 × 50 × 3 → 170 × 45 × 10 → 17 × 9 × 10 → 12 × 6 × 5 → 4 × 3 × 5 → 60.

The first CN layer (m = 1) involves q1(3) r1 = 10 · (1 + 11 · 6 · 3) = 1 990 filter weights

(1)
(w0,j , W (1)
j )1≤j ≤q (3) (including the intercepts), and the second CN layer (m = 3)
1
involves q3(3) r3 = 5·(1+6·4·10) = 1 205 filter weights (w0,j
(3)
, W (3)
j )1≤j ≤q3(3) . Altogether
we have a network parameter of dimension 3 195 to be fitted in this CN network
architecture.
To perform the prediction task (9.13) we concatenate the learned representation
z(5:1) (x (0) ) ∈ Rq5 of the RGB image x (0) with the tabular feature x (1) ∈ X ⊂ {1} × Rq .
This concatenated vector is processed through a FN network architecture z(d+5:6) of
depth d ≥ 1 providing the output
; <

z(5:1) (x (0) ), x (1) → E Y x (0) , x (1) = g −1 β, z(d+5:6) z(5:1) (x (0) ), x (1) ,

for given link function g . This last step can be done in complete analogy to Chap. 7,
and fitting of such a network architecture uses variants of the SGD algorithm.
418 9 Convolutional Neural Networks

9.3.2 Lab: Telematics Data

We present a CN network example that studies time-series of telematics car driving

data. Unfortunately, this data is not publicly available. Recently, telematics car
driving data has gained much popularity in actuarial science, because this data
provides information of car drivers that goes beyond the classical features (age of
driver, year of driving test, etc.), and it provides a better discrimination of good and
bad drivers as it is directly based on the driving habits and the driving styles.
The telematics data has many different aspects. Raw telematics data typically
consists of high-frequency GPS location data, say, second by second, from which
several different statistics such as speed, acceleration and change of direction can
be calculated. Besides the GPS location data, it often contains vehicle speeds
from the vehicle instrumental panel, and acceleration in all directions from an
accelerometer. Thus, often, there are 3 different sources from which the speed and
the acceleration can be extracted. In practice, the data quality is often an issue as
these 3 different sources may give substantially different numbers, Meng et al. [271]
give a broader discussion on these data quality issues. The telematics GPS data
is often complemented by further information such as engine revolutions, daytime
of trips, road and traffic conditions, weather conditions, traffic rule violations, etc.
This raw telematics data is then pre-processed, e.g., special maneuvers are extracted
(speeding, sudden acceleration, hard braking, extreme right- and left-turns), total
distances are calculated, driving distances at different daytimes and weekdays are
analyzed. For references analyzing such statistics for predictive modeling we refer to
Ayuso et al. [17–19], Boucher et al. [42], Huang–Meng [193], Lemaire et al. [246],
Paefgen et al. [291], So et al. [344], Sun et al. [347] and Verbelen et al. [370]. A
different approach has been taken by Wüthrich [388] and Gao et al. [151, 154, 155],
namely, these authors aggregate the telematics data of speed and acceleration to
so-called speed-acceleration v -a heatmaps. These v -a heatmaps are understood as
images which can be analyzed, e.g., by CN networks; such an analysis has been
performed in Zhu–Wüthrich [407] for image classification and in Gao et al. [154]
for claim frequency modeling. Finally, the work of Weidner et al. [377, 378] directly
acts on the time-series of the telematics GPS data by performing a Fourier analysis.
In this section, we aim at allocating individual car driving trips to the right drivers
by directly analyzing the time-series of the telematics data of these trips using CN
networks. We therefore replicate the analysis of Gao–Wüthrich [156] on slightly
different data. For our illustrative example we select 3 car drivers and we call them
driver A, driver B and driver C. For each of these 3 drivers we choose individual
car driving trips of 180 seconds, and we analyze their speed-acceleration-change in
angle (v -a -) pattern every second. Thus, for t = 1, . . . , T = 180, we study the three
input channels

x s,t = (vs,t , as,t , s,t ) ∈ [2, 50]km/h × [−3, 3]m/s2 × [0, 1/2] ⊂ R3 ,
9.3 Convolutional Neural Network Architectures 419

where 1 ≤ s ≤ S labels all individual trips of the considered drivers. This data has
been pre-processed by cutting-out the idling phase and the speeds above 50km/h
and concatenating the remaining pieces. We perform this pre-processing since
we do not want to identify the drivers because they have a special idling phase
picture or because they are more likely on the highway. Acceleration has been
censored at ±3m/s2 because we cannot exclude that more extreme observations are
caused by data quality issues (note that the acceleration is calculated from the GPS
coordinates and if the signals are not fully precise it can lead to extreme acceleration
observations). Finally, change in angle is measured in absolute values of sine per
second (censored at 1/2), i.e., we do not distinguish between left and right turns.
This then provides us with three time-series channels giving tensors of order 2

x s = (vs,1 , as,1 , s,1 ) , . . . , (vs,180 , as,180 , s,180 ) ∈ R180×3 ,

for 1 ≤ s ≤ S . Moreover, there is a categorical response Ys ∈ {A, B, C} indicating

which driver has been driving trip s .
Figure 9.1 illustrates the first three trips x s of T = 180 seconds of each of these three
drivers A (top), B (middle) and C (bottom); note that the 180 seconds have been
chosen at a random location within each trip. The first lines in red color show the
acceleration patterns (at )1≤t≤T , the second lines in black color the change in angle
patterns (t )1≤t≤T , and the last lines in blue color the speed patterns (vt )1≤t≤T .
Table 9.1 summarizes the available data. In total we have 932 individual trips, and
we randomly split these trips in the learning data L consisting of 744 trips and the
test data T collecting the remaining trips. The goal is to train a classification model
that correctly allocates the test data T to the right driver. As feature information, we
use the telematics data x s of length 180 seconds. We design a logistic categorical
regression model with response set Y = {A, B, C}. Hence, we obtain a vector-valued
parameter EF with a response having 3 levels, see Sect. 2.1.4.
To process the telematics data x s , we design a CN network architecture having
three convolutional layers z(CNj ) , 1 ≤ j ≤ 3, each followed by a max-pooling
layer z(Maxj ) , then we apply a drop-out layer z(DO) and finally a fully-connected FN
layer z(FN) providing the logistic response classification; this is the same network
architecture as used in Gao–Wüthrich [156]. The code is given in Listing 9.3 and it
describes the mapping

z(8:1) = z(FN) ◦ z(DO) ◦ z(Max3) ◦ z(CN3) ◦ z(Max2) ◦ z(CN2) ◦ z(Max1) ◦ z(CN1) :

RT ×3 → (0, 1)3 .

The first CN and pooling layer z(Max1) ◦ z(CN1) maps the dimension 180 × 3 to a
tensor of dimension 58 × 12 using 12 filters; the max-pooling uses the floor (9.11).
The second CN and pooling layer z(Max2) ◦ z(CN2) maps to 18 × 10 using 10 filters,
and the third CN and pooling layer z(Max3) ◦ z(CN3) maps to 1 × 8 using 8 filters.
Actually, this last max-pooling layer is a global max-pooling layer extracting the
maximum in each of the 8 filters. Next, we apply a drop-out layer with a drop-out
420 9 Convolutional Neural Networks

3 driver A, trip number 1 driver A, trip number 2 driver A, trip number 3

3
0

0
speed / angle / acceleration

speed / angle / acceleration

−3

−3
0.5

0.5

0.5
0.25

0.25

0.25
0

0
50

50
25

25
0

0
0 50 100 150 0 50 100 150 0 50 100 150

time in seconds time in seconds time in seconds

driver B, trip number 1 driver B, trip number 2 driver B, trip number 3

3
0

0
speed / angle / acceleration

speed / angle / acceleration

−3

−3
0.5

0.5

0.5
0.25

0.25

0.25
0

0
50

50
25

25
0

0
0 50 100 150 0 50 100 150 0 50 100 150

time in seconds time in seconds time in seconds

driver C, trip number 1 driver C, trip number 2 driver C, trip number 3

3
3
0
0

0
speed / angle / acceleration
speed / angle / acceleration

speed / angle / acceleration

−3
−3

−3
0.5
0.5

0.5
0.25
0.25

0.25
0
0

0
50
50

50
25
25

25
0
0

0 50 100 150 0 50 100 150 0 50 100 150

time in seconds time in seconds time in seconds

Fig. 9.1 First 3 trips of driver A (top), driver B (middle) and driver C (bottom); each trip is 180
seconds, red color shows the acceleration pattern (at )t , black color the change in angle pattern
(t )t and blue color the speed pattern (vt )t

Table 9.1 Summary of the trips and the choice of learning and test data sets L and T
Driver A Driver B Driver C Total
Number of trips S 261 385 286 932
Learning data L 209 307 228 744
Test data T 52 78 58 188
Average speed vt 24.8 30.4 30.2 km/h
Average acceleration/braking |at | 0.56 0.61 0.74 m/s2
Average change in angle t 0.065 0.054 0.076 |sin|/s

rate of 30% to prevent from over-fitting. Finally we apply a fully-connected FN

layer that maps the 8 neurons to the 3 categorical outputs using the softmax output
activation function, which provides the canonical link of the logistic categorical EF.
9.3 Convolutional Neural Network Architectures 421

Listing 9.3 CN network architecture for the individual car trip allocation

1 shape <- c(180,3)

2 #
3 model = keras_model_sequential()
4 model %>%
5 layer_conv_1d(filters = 12, kernel_size = 5, activation=’tanh’,
6 input_shape = shape) %>%
7 layer_max_pooling_1d(pool_size = 3) %>%
8 layer_conv_1d(filters = 10, kernel_size = 5, activation=’tanh’) %>%
9 layer_max_pooling_1d(pool_size = 3) %>%
10 layer_conv_1d(filters = 8, kernel_size = 5, activation=’tanh’) %>%
11 layer_global_max_pooling_1d() %>%
12 layer_dropout(rate = .3) %>%
13 layer_dense(units = 3, activation = ’softmax’)

For a summary of the network architecture see Listing 9.4. Altogether this involves
1’237 network parameters that need to be fitted.

Listing 9.4 Summary of CN network architecture for the individual car trip allocation
1 Layer (type) Output Shape Param #
2 ===============================================================================
3 conv1d_1 (Conv1D) (None, 176, 12) 192
4 -------------------------------------------------------------------------------
5 max_pooling1d_1 (MaxPooling1D) (None, 58, 12) 0
6 -------------------------------------------------------------------------------
7 conv1d_2 (Conv1D) (None, 54, 10) 610
8 -------------------------------------------------------------------------------
9 max_pooling1d_2 (MaxPooling1D) (None, 18, 10) 0
10 -------------------------------------------------------------------------------
11 conv1d_3 (Conv1D) (None, 14, 8) 408
12 -------------------------------------------------------------------------------
13 global_max_pooling1d_1 (GlobalMaxPool (None, 8) 0
14 -------------------------------------------------------------------------------
15 dropout_1 (Dropout) (None, 8) 0
16 -------------------------------------------------------------------------------
17 dense_1 (Dense) (None, 3) 27
18 ===============================================================================
19 Total params: 1,237
20 Trainable params: 1,237
21 Non-trainable params: 0

We choose the 744 trips of the learning data L to train this network to the
classification task, see Table 9.1. We use the multi-class cross-entropy loss function,
see (4.19), with 80% of the learning data L as training data U and the remaining
20% as validation data V to track over-fitting. We retrieve the network with the
smallest validation loss using a callback, we refer to Listing 7.3 for a callback.
Since the learning data is comparably small and to reduce randomness, we use the
nagging predictor averaging over 10 different network fits (using different seeds).
422 9 Convolutional Neural Networks

Table 9.2 Out-of-sample True labels

confusion matrix
Driver A Driver B Driver C
Predicted label A 39 10 2
Predicted label B 9 66 6
Predicted label C 4 2 50
% correctly allocated 75.0% 84.6% 86.2%
# of trips in test data 52 78 58

These fitted networks then provide us with a mapping

z(8:1) : RT ×3 → (0, 1)3 ,
(8:1) (8:1) (8:1)
x → z(8:1) (x) = zA (x), zB (x), zC (x) ,

and for each trip x s ∈ RT ×3 we receive the classification

s = arg max z(8:1)

Y y (x s ).
y∈{A,B,C}

Table 9.2 shows the out-of-sample results on the test data T . On average more than
80% of all trips are correctly allocated; a purely random allocation would provide
a success rate of 33%. This shows that this allocation problem can be solved rather
successfully and, indeed, the CN network architecture is able to learn structure in
the telematics trip data x s that allows one to discriminate car drivers. This sounds
very promising. In fact, the telematics car driving data seems to be very transparent
which, of course, also raises privacy issues. On the downside we should mention
that from this approach we cannot really see what the network has learned and how
it manages to distinguish the different trips.
There are several approaches that try to visualize what the network has learned
in the different layers by extracting the filter activations in the CN layers, others
try to invert the networks trying to backtrack which activations and weights mostly
contribute to a certain output, we mention, e.g., DeepLIFT of Shrikumar et al. [339].
For more analysis and references we refer to Sect. 4 of the tutorial Meier–Wüthrich
[269]. We do not further discuss this and close this example.

9.3.3 Lab: Mortality Surface Modeling

We revisit the mortality example of Sect. 8.4.2 where we used a LSTM architecture
to process the raw mortality data for forecasting, see Fig. 8.13. We are going to do
a (small) change to that architecture by simply replacing the LSTM encoder by a
CN network encoder. This approach has been promoted in the literature, e.g., by
Perla et al. [301], Schnürch–Korn [330] and Wang et al. [375]. A main difference
between these references is whether the mortality tensor is considered as a tensor
9.3 Convolutional Neural Network Architectures 423

of order 2 (reflecting time-series data) or of order 3 (reflecting the mortality surface

as an image). In the present example we are going to interpret the mortality tensor
as a monochrome image, and this requires that we extend (8.23) by an additional
channels component

x t−τ :t−1 = (x t−τ , . . . , x t−1 )

= Mx,s t−τ ≤s≤t−1,x ≤x≤x ∈ Rτ ×(x1 −x0 +1)×1 = R5×100×1 ,
0 1

for a lookback period of τ = 5. The LSTM cell encodes this tensor/matrix into a 20-
dimensional vector which is then concatenated with the embeddings of the country
code and the gender code (8.24). We use the same architecture here, only the LSTM
part is replaced by a CN network in (8.25), the corresponding code is given on lines
14–17 of Listing 9.5.

Listing 9.5 CN network architecture to directly process the raw mortality rates (Mx,t )x,t
1 Tensor = layer_input(shape=c(lookback,100,1), dtype=’float32’, name=’Tensor’)
2 Country = layer_input(shape=c(1), dtype=’int32’, name=’Country’)
3 Gender = layer_input(shape=c(1), dtype=’int32’, name=’Gender’)
4 Time = layer_input(shape=c(1), dtype=’float32’, name=’Time’)
5 #
6 CountryEmb = Country %>%
7 layer_embedding(input_dim=8,output_dim=1,input_length=1,name=’CountryEmb’) %>%
8 layer_flatten(name=’Country_flat’)
9 #
10 GenderEmb = Gender %>%
11 layer_embedding(input_dim=2,output_dim=1,input_length=1,name=’GenderEmb’) %>%
12 layer_flatten(name=’Gender_flat’)
13 #
14 CN = Tensor %>%
15 layer_conv_2d(filter = 10, kernel_size = c(5,5), activation = ’linear’) %>%
16 layer_max_pooling_2d(pool_size = c(1,8)) %>%
17 layer_flatten()
18 #
19 Output = list(CN,CountryEmb,GenderEmb) %>% layer_concatenate() %>%
20 layer_dense(units=100, activation=’linear’, name=’scalarproduct’) %>%
21 layer_reshape(c(1,100), name = ’Output’)
22 #
23 model = keras_model(inputs = list(Tensor, Country, Gender),
24 outputs = c(Output))

Line 15 maps the input tensor 5 × 100 × 1 to a tensor 1 × 96 × 10 having 10 filters, the
max-pooling layer reduces this tensor to 1 × 12 × 10, and the flatten layer encodes
this tensor into a 120-dimensional vector. This vector is then concatenated with the
embedding vectors of the country and the gender codes, and this provides us with
r = 12 570 network parameters, thus, the LSTM architecture and the CN network
architecture use roughly equally many network parameters that need to be fitted. We
then use the identical partition in training, validation and test data as in Sect. 8.4.2,
i.e., we use the data from 1950 to 2003 for fitting the network architecture, which is
then used to forecast the calendar years 2004 to 2018. The results are presented in
Table 9.3.
424 9 Convolutional Neural Networks

Table 9.3 Comparison of the out-of-sample mean squared losses for the calendar years 2004 ≤
t ≤ 2018; the figures are in 10−4
Female Male
LC LSTM CN LC LSTM CN
Austria AUT 0.765 0.312 0.635 2.527 1.169 1.569
Belgium BE 0.371 0.311 0.290 2.835 0.960 1.100
Switzerland CH 0.654 0.478 0.772 1.609 1.134 2.035
Spain ESP 1.446 0.514 0.199 1.742 0.245 0.240
France FRA 0.175 1.684 0.309 0.333 0.363 0.770
Italy ITA 0.179 0.330 0.186 0.874 0.320 0.421
The Netherlands NL 0.426 0.315 0.266 1.978 0.601 0.606
Portugal POR 2.097 0.464 0.416 1.848 1.239 1.880

We observe that in our case the CN network architecture provides good results for
the female populations, whereas for the male populations we rather prefer the LSTM
architecture. At the current stage we rather see this as a proof of concept, because
we have not really fine-tuned the network architectures, nor has the SGD fitting
been perfected, e.g., often bigger architectures are used in combination with drop-
outs, etc. We refrain from doing so, here, but refer to the relevant literature Perla
et al. [301], Schnürch–Korn [330] and Wang et al. [375] for a more sophisticated
fine-tuning.

Natural language processing (NLP) is a vastly growing field that is studying lan-
guage, communication and text recognition. The purpose of this chapter is to present
an introduction to NLP. Important milestones in the field of NLP are the work of
Bengio et al. [28, 29] who have introduced the idea of word embedding, the work
of Mikolov et al. [275, 276] who have developed word2vec which is an efficient
word embedding tool, and the work of Pennington et al. [300] and Chaubard et
al. [68] who provide the pre-trained word embedding model GloVe1 and detailed
educational material.2 An excellent overview of the NLP working pipeline is
provided by the tutorial of Ferrario–Nägelin [126]. This overview distinguishes
three approaches: (1) the classical approach using bag-of-words and bag-of-part-
of-speech models to classify text documents; (2) the modern approach using word
embeddings to receive a low-dimensional representation of the dictionary, which
is then further processed; (3) the contemporary approach uses a minimal amount
of text pre-processing but directly feeds raw data to a machine learning algorithm.
We discuss these different approaches and show how they can be used to extract
the relevant information from claim descriptions to predict the claim types and the
claim sizes; in the actuarial literature first papers on this topic have been published
by Lee et al. [236] and Manski et al. [264].

10.1 Feature Pre-processing and Bag-of-Words

NLP requires an extensive feature pre-processing and engineering as different texts

can be rather diverse in language, grammar, abbreviations, typos, etc. The current
developments aim at automating this process, nevertheless, many of these steps

1 https://nlp.stanford.edu/projects/glove/.
2 https://nlp.stanford.edu/teaching/.

© The Author(s) 2023 425

are still (tedious) manual work. Our goal here is to present the whole working
pipeline to process language, perform text recognition and text understanding. As
an example we use the claim data described in Chap. 13.3; this data has been made
available through the book project of Frees [135], and it comprises property claims
of governmental institutions in Wisconsin, US. An excerpt of the data is given in
Listing 10.1; our attention applies to line 11 which provides a (very) short claim
description for every claim.

Listing 10.1 Excerpt of the Wisconsin Local Government Property Insurance Fund (LGPIF) data
set with short claim descriptions on line 11

1 ’data.frame’: 5424 obs. of 10 variables:

2 $ PolicyNum : int 120002 120003 120003 120003 120003 120003 120003 ...
3 $ Year : int 2010 2007 2008 2007 2009 2010 2007 2007 2009 2007 ...
4 $ Claim : num 6839 2085 8775 600 34610 ...
5 $ Deduct : int 1000 5000 5000 5000 5000 5000 5000 5000 5000 5000 ...
6 $ EntityType : Factor w/ 6 levels "City","County",..: 2 2 2 2 2 2 2 2 2 2 ...
7 $ CoverageCode: Factor w/ 13 levels "CE","CF","CS",..: 12 12 11 11 11 12 ...
8 $ Fire5 : int 4 0 0 0 0 0 0 0 0 0 ...
9 $ CountyCode : Factor w/ 72 levels "ADA","ASH","BAR",..: 2 3 3 3 3 3 3 3...
10 $ Hazard : Factor w/ 9 levels "Fire","Hail",..: 3 3 5 5 9 6 3 3 3 3 ...
11 $ Description : chr "lightning damage" "lightning damage at Comm. Center" ...

In a first step we need to pre-process the texts to make them suitable for predictive
modeling. This first step is called tokenization. Essentially, tokenization labels the
words with integers, that is, the used vocabulary is encoded by integers. There are
several issues that one has to deal with in this first step such as upper and lower
case, punctuation, orthographic errors and differences, abbreviations, etc. Different
treatments of these issues will lead to different results, for more on this topic we
refer to Sect. 1 in Ferrario–Nägelin [126]. We simply use the standard routine
offered in R keras [77] called text_tokenizer() with its standard settings.

Listing 10.2 Tokenization within R keras [77]

1 library(keras)
2
3 ## initialize tokenizer and fit
4 tokenizer <- text_tokenizer() %>% fit_text_tokenizer(dat$Description)
5
6 ## number of tokens/words
7 length(tokenizer$word_index)
8
9 ## frequency of word appearances in each text
10 freq.text <- texts_to_matrix(tokenizer, dat$Description, mode = "count")

The R code in Listing 10.2 shows the crucial steps in tokenization. Line 4 extracts
the relevant vocabulary from all available claim descriptions. In total the 5’424 claim
10.1 Feature Pre-processing and Bag-of-Words 427

Fig. 10.1 Most frequently 20 most frequently used words

used words in the claim at
descriptions of Listing 10.1 damage
damaged
vandalism
lightning
to
water
glass
park
fire
hs
wind
light
door
es
and
of
vehicle
pole
power

0 175 500 1000 1500 2000

descriptions of Listing 10.1 use W = 2 237 different words. This double counts
different spellings, e.g., ‘color’ vs. ’colour’.
Figure 10.1 shows the most frequently used words in the claim descriptions of
Listing 10.1. These are (in this order): ‘at’, ‘damage’, ‘damaged’, ‘vandalism’,
‘lightning’, ‘to’, ‘water’, ‘glass’, ‘park’, ‘fire’, ‘hs’, ‘wind’, ‘light’, ‘door’, ‘es’,
‘and’, ‘of’, ‘vehicle’, ‘pole’ and ‘power’. We observe that many of these words
are directly related to insurance claims, such as ‘damage’ and ‘vandalism’, others
are frequent stopwords like ‘at’ and ‘to’, and then there are abbreviations like ‘hs’
and ‘es’ standing for high school and elementary school.

Listing 10.3 Word and text encoding

1 maxlen <- max(rowSums(freq.text))

2
3 ## encode the sentences
4 text.seq <- texts_to_sequences(tokenizer, dat$Description)
5
6 ## pad the sentences
7 text.seq.pad <- pad_sequences(text.seq, maxlen = maxlen, padding = "post")
8
9 ## examples
10 lightning/hail damage to equip at airport
11 5 48 2 6 196 1 40 0 0 0 0
12 ##
13 garage door damaged
14 36 14 3 0 0 0 0 0 0 0 0

The next step is to assign the (integer) labels 1 ≤ w ≤ W from the tokenization
to the words in the texts. The maximal length over all texts/sentences is T = 11
words. This step and padding the sentences with zeros to equal length T is presented
on lines 1–7 of Listing 10.3. Lines 11 and 14 of this listing give two explicit text
examples

text = (w1 , . . . , wT ) ∈ W0T ,

428 10 Natural Language Processing

where we set for the vocabulary W0 used

W = {1, . . . , W } ⊂ N and W0 = W ∪ {0}.

The label 0 is used for padding shorter texts to the common length T = 11. The
method of bag-of-words embeds text = (w1 , . . . , wT ) into NW
0

T
ψ: W0T → NW
0 , text → ψ(text) = 1{wt =w} . (10.1)
t =1 w∈W

The bag-of-words ψ(text) counts how often each word w ∈ W appears in a given
text = (w1 , . . . , wT ) ; the corresponding code is given on line 10 of Listing 10.2.
The bag-of-words mapping ψ is not injective as the order of occurrence of the
words gets lost, and, thus, also the semantics of the sentence gets lost. E.g., the
bag-of-words of the following two sentences is the same ‘The claim is expensive.’
and ‘Is the claim expensive?’. This is the reason for calling it a “bag of words”
(which is unordered). This bag-of-words encoding resembles one-hot encoding,
namely, if every text consists of a single word T = 1, then we receive the one-hot
encoding with W describing the number of different levels, see (7.28). The bag-of-
words ψ(text) ∈ NW 0 can directly be used as an input to a regression model. The
disadvantage of this approach is that the input typically is high-dimensional (and
likely sparse), and it is recommended that only the frequent words are considered.

Listing 10.4 Removal of stopwords and lemmatization

1 library(textstem)
2 library(tm)
3
4 text.clean <- removeWords(dat$Description, stopwords("english"))
5 text.clean <- lemmatize_strings(text.clean, dictionary = lexicon::hash_lemmas)

Additionally, stopwords can be removed. We perform this removal below because

frequent stopwords like ‘and’ or ‘to’ may not essentially contribute to the under-
standing of the (short) claim descriptions; the code for the stopword removal is
provided on line 4 of Listing 10.4. Moreover, stemming can be performed which
means that inflectional forms are reduced to their stem by just truncating pre- and
suffixes, conjugations, declensions, etc. Lemmatization is a more sophisticated form
of reducing inflectional forms by using vocabularies and morphological analyses;
an example is provided on line 5 of Listing 10.4. If we perform these two steps
of removing stopwords and lemmatization to our example, the number of different
words is reduced from 2’237 to 1’982.
Another step that can be performed is tagging words with part-of-speech (POS)
attributes. These POS attributes indicate whether the corresponding words are used
10.2 Word Embeddings 429

as nouns, adjectives, adverbs, etc., in the corresponding sentences. We then call the
resulting encoding bag-of-POS. We refrain from doing this because we will present
more sophisticated methods in the next sections.

10.2 Word Embeddings

The bag-of-words (10.1) can be interpreted as representing each word w ∈ W =

{1, . . . , W } by a one-hot encoding in {0, 1}W , and then aggregating these one-hot
encodings over all words that appear in the given text = (w1 , . . . , wT ) . Bengio
et al. [28, 29] have introduced the technique of word embedding that maps words
to a lower dimensional Euclidean space Rb , b " W , such that proximity in Rb
is associated with similarity in the meaning of the word, e.g., ‘rain’, ‘water’ and
‘flood’ should be more close to each other in Rb than to ‘vandalism’ (in an insurance
context). This is exactly the idea promoted in the embedding mapping (7.31) using
the embedding layers. Thus, we are looking for an embedding mapping

e : W → Rb , w → e(w), (10.2)

that maps each word w (or rather its tokenization) to a b-dimensional vector e(w),
for a given embedding dimension b " W . The general idea now is that similarity in
the meaning of words can be learned from the context in which the words are used
in. That is, when we consider a text

text = (w1 , . . . , wt −1 , wt , wt +1 , . . . , wT ) ,

then it might be possible to infer wt from its neighbors wt −j and wt +j , j ≥ 1. This

explains the context of a word wt , and using suitable learning tools it should also be
possible to learn synonyms for wt as these synonyms will stand in similar contexts.
More mathematically speaking, we assume that there exists a probability distri-
bution p over the set of all texts of length T (using padding with zeros to common
length)

T = text = (w1 , . . . , wT ) ⊆ W0T ,

such that a randomly chosen text ∈ T appears with probability p(w1 , . . . , wT ) ∈

[0, 1). Inference of a word wt from its context can then be obtained by studying the
conditional probablity of wt , given its context, that is

p(w1 , . . . , wT )
p ( wt | w1 , . . . , wt −1 , wt +1 , . . . , wT ) = .
p(w1 , . . . , wt −1 , wt +1 , . . . , wT )
(10.3)
430 10 Natural Language Processing

Since, typically, the probability distribution p is not known we aim at learning it

from the available data. This idea has been taken up by Mikolov et al. [275, 276]
who designed the word to vector (word2vec) algorithm. Pennington et al. [300]
designed an alternative algorithm called global vectors (GloVe); we also refer to
Chaubard et al. [68]. We describe these algorithms in the following sections.

10.2.1 Word to Vector Algorithms

There are two ways of estimating the probability p in (10.3). Either we can try to
predict the center word wt from its context as in (10.3) or we can try to predict the
context from the center word wt , which applies Bayes’s rule to (10.3). The latter
variant is called skip-gram and the former variant is called continuous bag-of-words
(CBOW), if we neglect the order of the words in the context. These two approaches
have been developed by Mikolov et al. [275, 276].

Skip-gram Approach

Typically, inferring a general probability distribution p over T is too complex.

Therefore, we make a simplifying assumption. This simplifying assumption is not
reasonable from a practical linguistic point of view, but it is sufficient to receive a
reasonable word embedding map e : W → Rb . We assume conditional i.i.d. of the
context words, given the center word wt . Choosing a fixed context (window) size
c ∈ N, we try to maximize the log-likelihood over all probabilities p satisfying this
conditional i.i.d. assumption

n

W = log p wi,t −c , . . . , wi,t −1 , wi,t +1 , . . . , wi,t +c wi,t
i=1

n
= log p wi,t +j wi,t , (10.4)
i=1 −c≤j ≤c,j =0

having n independent rows in the observed data matrix W = (wi,t −c , . . . ,

wi,t +c )1≤i≤n ∈ W n×(2c+1) . Thus, under the conditional i.i.d. of the context words,
given the center word, the probabilities (10.4) infer the occurrence of (individual)
context words of a given center word wi,t within a symmetric window of fixed size
c. In the sequel we directly work with the log-likelihood (10.4), supposed that a
context word wi,t +j exists for index j , otherwise the corresponding term is just
dropped from the sum in (10.4).
The remaining step is to estimate the conditional probabilities p(wt +j |wt ) from
the data matrix W . This step will provide us with the embeddings (10.2). This
estimation step is received by considering an approach similar to a GLM for
10.2 Word Embeddings 431

categorical responses, see Sect. 5.7. We make the following ansatz for the context
word ws and the center word wt (for all j )

exp *
e(ws ), e(wt )
p (ws | wt ) = W ∈ (0, 1), (10.5)
w=1 exp *
e(w), e(wt )

where e and * e are two (different) embedding maps (10.2) that have the same
embedding dimension b ∈ N. Thus, we construct two different embeddings e and * e
for the center words and for the context words, respectively, and these embeddings
(embedding weights) are chosen such that the log-likelihood (10.4) is maximized
for the given observations W . These assumptions give us a minimization problem
for the negative log-likelihood in the embedding mappings, i.e., we minimize over
the embeddings e and * e
7 8

n exp * e(wi,t+j ), e(wi,t )
− W = − log W 7 8 (10.6)
i=1 −c≤j ≤c,j =0 w=1 exp * e(w), e(wi,t )
⎛ W ⎞

n 7 8 7 8
=− ⎝ *e(wi,t+j ), e(wi,t ) − 2c log exp *e(w), e(wi,t ) ⎠ .
i=1 −c≤j ≤c,j =0 w=1

These optimal embeddings are learned using a variant of the gradient descent
algorithm. This often results in a very high-dimensional optimization problem as
we have 2bW parameters to learn, and the calculation of the last (normalization)
term in (10.6) can be very expensive in gradient descent algorithms. For this reason
we present the method of negative sampling below.

Continuous Bag-of-Words

For the CBOW method we start from the log-likelihood for a context size c ∈ N and
given the observations W

n

log p wi,t wi,t −c , . . . , wi,t −1 , wi,t +1 , . . . , wi,t +c .
i=1

Again we need to reduce the complexity which requires an approximation to the

above. Assume that the embedding map of the context words is given by *e:W→
Rb . We then average over the embeddings of the context words in order to predict
the center word. Define the average embedding of the context words of wi,t (with a
fixed window size c) by

1
*
ei,t = *
e(wi,t +j ).
2c
−c≤j ≤c,j =0
432 10 Natural Language Processing

Making an ansatz similar to (10.5), the full log-likelihood is approximated by

7 8

n
n
exp * ei,t , e(wi,t )

log p wi,t *ei,t = log W 7 8 (10.7)
i=1 i=1 w=1 exp * ei,t , e(w)
W
n
7 8 7 8
= *ei,t , e(wi,t ) − log exp *ei,t , e(w) .
i=1 w=1

Again the gradient descent method is applied to the negative log-likelihood to learn
the optimal embedding maps e and *e.
Remark 10.1 In both cases, skip-gram and CBOW, we estimate two separate
embeddings e and * e for the center word and the context words. Typically, CBOW is
faster but skip-gram is better on words that are less frequent.

Negative Sampling

There is a computational issue in (10.6) and (10.7) because the probability normal-
izations in (10.6) and (10.7) aggregate over all available words w ∈ W. This can
be computationally demanding because we need to perform this calculation in each
gradient descent step. For this reason, Mikolov et al. [276] turn the log-likelihood
optimization problem (10.6) into a binary classification problem. Consider a pair
(w, w*) ∈ W × W of center word w and context word w *. We introduce a binary
response variable Y ∈ {1, 0} that indicates whether an observation (W, W *) =
(w, w*) is coming from a true center-context pair (from our texts) or whether
we have a fake center-context pair (that has been generated randomly). Choosing
the canonical link of the Bernoulli EF (logistic/sigmoid function) we make the
following ansatz (in the skip-gram approach) to test for the authenticity of a center-
*)
context pair (w, w

1
P [ Y = 1| w, w
*] = . (10.8)
1 + exp {−* w), e(w) }
e(*

The recipe now is as follows: (1) Consider for a given window size c all center-
*i ) ∈ W×W of our texts, and equip them with a response Yi = 1.
context pairs (wi , w
Assume we have N such observations. (2) Simulate N i.i.d. pairs (WN+k , W *N+k ),
1 ≤ k ≤ N, by randomly choosing WN+k and W *N+k , independent from each
other (by performing independent re-sampling with or without replacements from
the data (wi )1≤i≤N and (* wi )1≤i≤N , respectively). Equip these (false) pairs with the
response YN+k = 0. (3) Maximize the following log-likelihood as a function of the
10.2 Word Embeddings 433

embedding maps e and *

2N
Y = log P [ Y = Yi | wi , w
*i ] (10.9)
i=1

N
2N
1 1
= log + log .
1 + exp−*
e(*wi ), e(wi ) 1 + exp*
e(*
wk ), e(wk )
i=1 k=N+1

This approach is called negative sampling because we sample false or negative

pairs (WN+k , W*N+k ) that should not appear in our texts (as WN+k and W *N+k have
been generated independently from each other). The binary classification (10.9)
aims at detecting the negative pairs be letting the scalar products * e(*
wi ), e(wi )
be large for the true pairs and letting the scalar products *
e(*wk ), e(wk ) be small
for the false pairs. The former means that * e(*wi ) and e(wi ) should point into the
same direction in the embedding space Rb . The same should apply for a synonym
of wi and, thus, we receive the desired behavior that synonyms or words with similar
meanings tend to cluster.
Example 10.2 (word2vec with Negative Sampling) We provide an example by
constructing a word2vec embedding based on negative sampling. For this we aim
at maximizing the log-likelihood (10.9) by finding optimal embedding maps e and
*
e : W → Rb . To construct these embedding maps we use the Wisconsin LGPIF
data described in Sect. 13.3. The first decision (hyper-parameter) is the choice of the
embedding dimension b. English language has millions of different words, and these
words should be (in some sense) densely embedded into a b-dimensional Euclidean
space. Typical choices of b vary between 50 and 300. Our LGPIF data vocabulary is
much smaller, and for this example we choose b = 2 because this allows us to nicely
illustrate the learned embeddings. However, apart from illustration, we should not
choose such a small dimension as it does not allow for a sufficient flexibility in
discriminating the words, as we will see.
We consider all available claim texts described in Sect. 13.3. These are 6’031
texts coming from the training and validation data sets (we include the validation
data here to have more texts for learning the embeddings; this is different from
Sect. 10.1). We extract the claim descriptions from these two data sets and we apply
some pre-processing to the texts. This involves transforming all letters to lower case,
removing the special characters like !”/&, and removing the stopwords. Moreover,
we remove the words ‘damage’ and ‘damaged’ as these two words are very common
in our insurance claim descriptions, see Fig. 10.1, but they do not further specify
the claim type. Then we apply lemmatization, see Listing 10.4, and we adjust the
vocabulary with the GloVe database,3 see also Example 10.4. The latter step is

3 https://nlp.stanford.edu/projects/glove/.
434 10 Natural Language Processing

(tedious) manual work, and we do this step to be able to compare our results to
pre-trained word2vec versions.
After this pre-processing we apply the tokenizer, see line 4 of Listing 10.2. This
gives us 1 829 different words. To construct our (illustrative) embedding we only
consider the words that appear at least 20 times over all texts, these are W = 142
words. Thus, the following analysis is only based on the W = 142 most frequent
words. Of course, we could increase our vocabulary by considering any text that can
be downloaded from the internet. Since we would like to perform an insurance claim
analysis, these texts should be related to an insurance context so that the learned
embeddings reflect an insurance experience; we come back to this in Remark 10.4,
below. We refrain here from doing so and embed these W = 142 words into the
Euclidean plane (b = 2).

Listing 10.5 Tokenization of the most frequent words

1 ## applying the tokenizer to the cleaned texts

2 tokenizer <- text_tokenizer(num_words=142+1) %>% fit_text_tokenizer(dat$clean)
3
4 seqs <- texts_to_sequences(tokenizer, dat$clean)
5
6 ## skip-gram of text 1 using a window of size 2
7 skipgrams(sequence=unlist(seqs[[1]]),
8 vocabulary_size=142, window_size=2, negative_samples=0)

Listing 10.5 shows the tokenization of the most frequent words, and on line 4 we
build the (shortened) texts w1 , w2 , . . . , only considering these most frequent words
w ∈ W = {1, . . . , W }. In total we receive 4’746 texts that contain at least two words
from W and, hence, can be used for the skip-gram building of center-context pairs
(w, w *) ∈ W × W. Lines 7–8 give the code for building these pairs for a window of
size c = 2. In total we receive N = 23 952 center-context pairs (wi , w *i ) from our
texts. We equip these pairs with a response Yi = 1. For the false pairs, we randomly
permute the second component of the true pairs (WN+i , W *N+i ) = (wi , w *τ (i) ),
where τ is a random permutation of {1, . . . , N}. These false pairs are equipped
with a response YN+i = 0. Thus, altogether we have 2N = 47 904 observations
*i ), 1 ≤ j ≤ 2N, that can be used to learn the embeddings e and *
(Yi , wi , w e.
Listing 10.6 shows the R code to perform the embedding learning using the negative
sampling (10.9). This network has 2bW = 568 embedding weights that need to
be learned from the data. There are two more parameters involved on line 10 of
Listing 10.6. These two parameters shift the scalar products by an intercept β0 and
scale them by a constant β1 . We could set (β0 , β1 ) = (0, 1), however, keeping
these two parameters trainable has led to results that are better centered around the
origin. Of course, these two parameters do not harm the arguments as they only
10.2 Word Embeddings 435

Listing 10.6 R code for negative sampling

1 center = layer_input(shape = c(1), dtype = ’int32’)

2 context = layer_input(shape = c(1), dtype = ’int32’)
3 #
4 centerEmb = center %>%
5 layer_embedding(input_dim=142,output_dim=2,input_length=1) %>% layer_flatten()
6 contextEmb = context %>%
7 layer_embedding(input_dim=142,output_dim=2,input_length=1) %>% layer_flatten()
8 #
9 response = list(centerEmb, contextEmb) %>% layer_dot(axes = 1) %>%
10 layer_dense(units=1, activation=’sigmoid’, name=’response’)
11 #
12 model = keras_model(inputs = c(center, context), outputs = c(response))

replace (10.8) by a slightly different model

1 e β0
P [ Y = 1| w, w
*] = = β ,
1 + exp {−β0 − β1 * w), e(w) }
e(* e 0 + e−β1 *e(*
w),e(w)

and

e β0 e−β0
P [ Y = 0| w, w
*] = 1 − = .
eβ0 + e−β1 *e(*
w),e(w) e−β0 + eβ1 *e(*
w),e(w)

We fit this model using the nadam version of the gradient descent algorithm, and
the fitted embedding weights can be extracted with get_weights(model).
Figure 10.2 shows the learned embedding weights e(w) ∈ R2 of all words w ∈ W.
We highlight the words that coincide with the insured hazards in red color, see line
10 of Listing 10.1. The word ‘vehicle’ is in the first quadrant and it is surrounded
by ‘pole’, ‘truck’, ‘garage’, ‘car’, ‘traffic’. The word ‘vandalism’ is in the third
quadrant surrounded by ‘graffito’, ‘window’, ‘pavilion’, names of cites and parks,
‘ms’ for middle school. Finally, the words ‘fire’, ‘wind’, ‘lightning’ and ‘hail’ are
in the first and fourth quadrant, close to ‘water’; these words are surrounded by
‘bldg’ (building), ‘smoke’, ‘equipment’, ‘alarm’, ‘safety’, ‘power’, ‘library’, etc. We
conclude that these embeddings make perfect sense in an insurance claim context.
Note that we have applied some pre-processing, and embeddings could even be
improved by further pre-processing, e.g., ‘vandalism’ and ‘vandalize’ or ‘hs’ and
‘high school’ are used.
Another nice observation is that the embeddings tend to build a circle around the
origin, see Fig. 10.2. This is enforced by embedding W = 142 different words into
a b = 2 dimensional space so that dissimilar words optimally repulse each other.
436 10 Natural Language Processing

2−dimensional embedding of center word

1.0
0.5
dimension 2
0.0
−0.5
−1.0

−1.0 −0.5 0.0 0.5 1.0

dimension 1

Fig. 10.2 Two-dimensional skip-gram embedding using negative sampling; in red color are the
insured hazards ‘vehicle’, ‘fire’, ‘lightning’, ‘wind’, ‘hail’, ‘water’ and ‘vandalism’

10.2.2 Global Vectors Algorithm

A second popular word embedding approach is global vectors (GloVe) developed

by Pennington et al. [300], we also refer to Chaubard et al. [68]. GloVe is an
unsupervised learning method that performs a word-word clustering (center-context
pairs) over all available texts. Assume that the tokenization of all texts provides us
with the words w ∈ W. Choose a fixed context window size c ∈ N and define the
matrix
×W
C = C(w, w
*) w,*
w ∈W
∈ NW
0 ,

with C(w, w *) counting the number of co-occurrences of w and w

* over all available
texts where the word w * appears as a context word of the center word w (for the
given window size c). We note that C is a symmetric matrix that is typically sparse
as many words do not appear in the context of other words (on finitely many
*) co-occurrence matrix C
texts). Figure 10.3 shows the center-context pairs (w, w
10.2 Word Embeddings 437

Fig. 10.3 Center-context co−occurrence matrix

0 10 20 30 40 50 60 70 80 90 100 120
*) co-occurrence
pairs (w, w

10
matrix C of Example 10.2; 300

20
the color scale gives the

30
250
observed frequencies

40
center word (ordered)
50
200

60
70
150

80
100
100

120
50

140
0
context word (ordered)

of Example 10.2 which is based on W = 142 words and 23’952 center-context

*) > 0, and
pairs. The color pixels indicate the pairs that occur in the data, C(w, w
the white space corresponds to the pairs that have not been observed in the texts,
C(w, w *) = 0. This plot confirms the sparsity of the center-context pairs; the words
are ordered w.r.t. their frequencies in the texts.
In an empirical analysis Pennington et al. [300] have observed that the crucial
quantities to be considered are the ratios for fixed context words. That is, for a
context word w * study a function of the center words w and v (subject to existence
of the right-hand side)

*)/ *
C(w, w C(w,*
u) (*
p w |w)
*) → F (w, v, w
(w, v, w *) = u∈W = ,
*)/ *
C(v, w u∈ W C(v,*
u )
p w |v)
(*

denoting the empirical probabilities. An empirical analysis suggests that such an

p
approach seems to lead to a good discrimination of the meanings of the words, see
Sect. 3 in Pennington et al. [300]. Further simplifications and assumptions provide
the following ansatz, for details we refer to Pennington et al. [300],

*) ≈ *
log C(w, w e(* *w
w ), e(w) + β * + βw ,

with intercepts β*w

* , βw ∈ R. There is still one issue, namely, that log C(w, w
*) may
*) are not observed. Therefore, Pennington
not be well-defined as certain pairs (w, w
et al. [300] propose to solve a weighted squared error loss function problem to find
the embedding mappings e,* *w
e and intercepts β * , βw ∈ R. Their objective function
is given by
2
χ(C(w, w *) − *
*)) log C(w, w e(* *w
w), e(w) − β * − βw , (10.10)
w ∈W
w,*
438 10 Natural Language Processing

with weighting function

α
x ∧ xmax
x ≥ 0 → χ(x) = ,
xmax

for xmax > 0 and α > 0. Pennington et al. [300] state that the model depends
weakly on the cutoff point xmax , they propose xmax = 100, and a sub-linear
behavior seems to outperform a linear one, suggesting, e.g., a choice of α = 3/4.
Under these choices the embeddings e and * e are found by minimizing the objective
function (10.10) for the given data. Note that limx↓0 χ(x)(log x)2 = 0.
Example 10.3 (GloVe Word Embedding) We provide an example using the GloVe
embedding model, and we revisit the data of Example 10.2; we also use exactly the
same pre-processing as in that example. We start from N = 23 952 center-context
pairs.
In a first step we count the number of co-occurrences C(w, w *). There are only
4’972 pairs that occur, C(w, w *) > 0, this corresponds to the colors in Fig. 10.3.
With these 4’972 pairs we have to fit 568 embedding weights (for the embedding
dimension b = 2) and 284 intercepts β *w
* , βw , thus, 852 parameters in total. The
results of this fitting are shown in Fig. 10.4.
The general picture in Fig. 10.4 is similar to Fig. 10.2, e.g., ‘vandalism’ is
surrounded by ‘graffito’, ‘window’, ‘pavilion’, names of cites and parks, ‘ms’
and ‘es’; or ‘vehicle’ is surrounded by ‘pole’, ‘traffic’, ‘street’, ‘signal’. However,
the clustering of the words around the origin shows a crucial difference between
GloVe and the negative sampling of word2vec. The problem here is that we do
not have sufficiently many observations. We have 4’972 center-context pairs that
occur, C(w, w *) > 0. 2’396 of these pairs occur exactly once, C(w, w *) = 1, this is
almost half of the observations with C(w, w *) > 0. GloVe (10.10) considers these
observations on the log-scale which provides log C(w, w *) = 0 for the pairs that
occur exactly once. The weighted square loss for these pairs is minimized by either
setting *e(*w) = 0 or e(w) = 0, supposed that the intercepts are also set to 0. This
is exactly what we observe in Fig. 10.4 and, thus, successfully fitting GloVe would
require much more (frequent) observations.

Remark 10.4 (Pre-trained Word Embeddings) In practical applications we rely on

pre-trained word embeddings. For GloVe there are pre-trained versions that can be
downloaded.4 These pre-trained versions comprise a vocabulary of 400K words,
and they exist for the embedding dimensions b = 50, 100, 200, 300. These GloVe’s
have been trained on Wikipedia 2014 and Gigaword 5 which provided roughly 6B
tokens. Another pre-trained open-source model that can be downloaded is spaCy.5

4 https://nlp.stanford.edu/projects/glove/.
5 https://spacy.io/models/en#en_core_web_md.
10.2 Word Embeddings 439

2−dimensional embedding of center word

1.0
0.5
dimension 2
0.0
−0.5
−1.0

−1.0 −0.5 0.0 0.5 1.0

dimension 1

Fig. 10.4 Two-dimensional GloVe embedding; in red color are the insured hazards ‘vehicle’,
‘fire’, ‘lightning’, ‘wind’, ‘hail’, ‘water’ and ‘vandalism’

Pre-trained embeddings can be problematic if we work in very specific settings.

For instance, the Wisconsin LGPIF data contains the word ‘Lincoln’ in the claim
descriptions. Now, Lincoln is a county in Wisconsin, it is town in Kewaunee County
in Wisconsin, it is a former US president, there are Lincoln memorials, it is a
common street name, it is a car brand and there are restaurants with this name.
In our context, Lincoln is most commonly used w.r.t. the Lincoln Elementary and
Middle Schools. On the other hand, it is likely that in pre-trained embeddings a
different meaning of Lincoln is predominant, and therefore the embedding may not
be reasonable for our insurance problem.
440 10 Natural Language Processing

10.3 Lab: Predictive Modeling Using Word Embeddings

This section gives an example of applying the word embedding technique to a

predictive modeling setting. This example is based on the Wisconsin LGPIF data
set illustrated in Listing 10.1. Our goal is to predict the hazard types on line 10
of Listing 10.1 from the claim descriptions on line 11. We perform the same data
cleaning process as in Example 10.2. This provides us with W = 1 829 different
words, and the resulting (short) claim descriptions have a maximal length of T = 9.
After padding with zeros we receive n = 6 031 claim descriptions given by texts
(w1 , . . . , wT ) ∈ W0T ; we apply the padding to the left end of the sentences.
Word2vec Using Negative Sampling We start by the word2vec embedding tech-
nique using the negative sampling. We follow Example 10.2, and to successfully
embed the available words w ∈ W we restrict the vocabulary to the words that are
used at least 20 times. This reduces the vocabulary from 1’892 different words to
142 different words. The number of claim descriptions are reduced to 5’883 because
148 claim descriptions do not contain any of these 142 different words and, thus,
cannot be classified as one of the hazard types (based on this reduced vocabulary).
In a first analysis we choose the embedding dimension b = 2, and this provides
us with the word2vec embedding map that is illustrated in Fig. 10.2. Based on these
embeddings we aim at predicting the hazard types from the claim descriptions. We
have 9 different hazard types: Fire, Lightning, Hail, Wind, WaterW, WaterNW,
Vehicle, Vandalism and Misc.6 Therefore, we design a categorical classification
model that has 9 different labels, we refer to Sect. 2.1.4.

Listing 10.7 R code for the hazard type prediction based on a word2vec embedding

1 input = layer_input(shape = list(T), name = "input")

2 #
3 word2vec = input %>%
4 layer_embedding(input_dim = W+1, output_dim = b, input_length = T,
5 weights=list(wordEmb), trainable=FALSE) %>%
6 layer_flatten()
7 # response = word2vec %>%
8 layer_dense(units=20, activation=’tanh’, name=’FNLayer1’) %>%
9 layer_dense(units=15, activation=’tanh’, name=’FNLayer2’) %>%
10 layer_dense(units=9, activation=’softmax’, name=’output’)
11 #
12 model = keras_model(inputs = c(input), outputs = c(response))

The R code for the hazard type prediction is presented in Listing 10.7. The crucial
part is shown on line 5. Namely, the embedding map e(w) ∈ Rb , w ∈ W is
initialized with the embedding weights wordEmb received from Example 10.2, and

6WaterW relates to weather related water claims, and WaterNW relates to non-weather related
water claims.
10.3 Lab: Predictive Modeling Using Word Embeddings 441

confusion matrix word2vec with b=2 confusion matrix word2vec with b=10

Fire 109 2 0 6 0 3 22 19 8 Fire 171 4 1 1 2 2 12 19 7

Light. 14 885 5 30 20 16 24 16 41 Light. 11 917 1 3 7 1 6 5 17

Hail 5 2 38 7 7 3 7 7 9 Hail 2 0 72 12 8 2 2 2 5

Wind 48 25 11 281 9 2 28 18 24 Wind 0 3 10 338 6 0 4 2 11

Wat.W 2 7 5 9 353 181 5 19 29 Wat.W 2 6 5 9 378 182 4 5 22

Wat.NW 0 2 1 1 24 36 0 6 0 Wat.NW 1 0 0 0 33 55 0 1 3

Vehicle 16 11 1 31 8 0 917 35 38 Vehicle 10 9 1 18 7 3 966 31 47

Vand. 12 5 7 16 17 19 37 1915 80 Vand. 9 2 2 5 12 9 33 1957 70

Misc 5 15 26 15 22 7 13 9 175 Misc 5 13 2 10 7 13 26 22 222

Fire

Lightning

Hail

Wind

WaterW

WaterNW

Vehicle

Vandalism

Misc

Fire

Lightning

Hail

Wind

WaterW

WaterNW

Vehicle

Vandalism

Misc
Fig. 10.5 Confusion matrices of the hazard type prediction using a word2vec embedding based on
negative sampling (lhs) b = 2 dimensional embedding and (rhs) b = 10 dimensional embedding;
columns show the observations and rows show the predictions

these embedding weights are declared to be non-trainable.7 These features are then
inputted into a FN network with two FN layers having (q1 , q2 ) = (20, 15) neurons,
and as output activation we choose the softmax function. This model has 286 non-
trainable embedding weights, and r = (9 · 2 + 1)20 + (20 + 1)15 + (15 + 1)9 = 839
trainable parameters.
We fit this network using the nadam version of the gradient descent method, and
we exercise an early stopping on a 20% validation data set (of the entire data). This
network is fitted in a few seconds, and the results are presented in Fig. 10.5 (lhs).
This figure shows the confusion matrix of prediction vs. observed (row vs. column).
The general results look rather good, there are only difficulties to distinguish WaterN
from WaterNW claims.
In a second analysis, we increase the embedding dimension to b = 10 and
we perform exactly the same procedure as above. A higher embedding dimension
allows the embedding map to better discriminate the words in their meanings.
However, we should not go for a too high b because we have only 142 different
words and 47’904 center-context pairs (w, w *) to learn these embeddings e(w) ∈ Rb .
A higher embedding dimension also increases the number of network weights in
the first FN layer on line 9 of Listing 10.7. This time, we need to train r =
(9 · 10 + 1)20 + (20 + 1)15 + (15 + 1)9 = 2 279 parameters. The results are
presented in Fig. 10.5 (rhs). We observe an overall improvement compared to the
2-dimensional embeddings. This is also confirmed by Table 10.1 which gives the
deviance losses and the misclassification rates.

7 The zeros from padding are mapped to the origin.

442 10 Natural Language Processing

Table 10.1 Hazard prediction results summarized in deviance losses and misclassification rates
Number of parameters Deviance Misclassification
Embedding Network loss rate
word2vec negative sampling, b = 2 286 839 0.1442 19.9%
word2vec negative sampling, b = 10 1’430 2’279 0.0912 13.7%
FN GloVe using all words, b = 50 91’500 9’479 0.0802 11.7%
LSTM GloVe using all words, b = 50 91’500 3’369 0.0802 12.1%
Word similarity embedding, b = 7 12’810 1’739 0.1396 21.1%

Pre-trained GloVe Embedding In a next analysis we use the pre-trained GloVe

embeddings, see Remark 10.4. This allows us to use all W = 1 892 words that
appear in the n = 6 031 claim descriptions, and we can also classify all these
claims. I.e., we can classify more claims, here, compared to the 5’883 claims we
have classified based on the self-trained word2vec embeddings. Apart from that, all
modeling steps are chosen as above. Only the higher embedding dimension b = 50
from the pre-trained glove.6B.50d increases the size of the network parameter
to r = (9 · 50 + 1)20 + (20 + 1)15 + (15 + 1)9 = 9 479 parameters; remark that
the 91’500 embedding weights are not trained as they come from the pre-trained
GloVe embeddings. Using the nadam optimizer with an early stopping provides us
with the results in Fig. 10.6 (lhs). Using this pre-trained GloVe embedding leads to a
further improvement, this is also verified by Table 10.1. Using the pre-trained GloVe
is two-fold. On the one hand, it allows us to use all words of the claim descriptions,
which improves the prediction accuracy. On the other hand, the embeddings are
not adapted to insurance problems, as these have been trained on Wikipedia and
Gigaword texts. The former advantage overrules the latter shortcoming in our
example.
All the results above have been using the FN network of Listing 10.7. We made
this choice because our texts have a maximal length of T = 9, which is very
short. In general, texts should be understood as time-series, and RN networks are
a canonical choice to analyze these time-series. Therefore, we study again the pre-
trained GloVe embeddings, but we process the texts with a LSTM architecture, we
refer to Sect. 8.3.1 for LSTM layers.
Listing 10.8 shows the LSTM architecture used. On line 9 we set the variable
return_sequences to true which implies that all intermediate steps zt[1] , 1 ≤
t ≤ T , are outputted to a time-distributed FN layer on line 10, see Sect. 8.2.4 for
time-distributed layers. This LSTM network has r = 4(50 + 1 + 10)10 + (10 +
1)10 + (90 + 1)9 = 3 369 parameters. The flatten layer on line 11 of Listing 10.8
turns the T = 9 outputs zt[2] ∈ Rq2 , 1 ≤ t ≤ T , of dimension q2 = 10 into a vector
of size T q2 = 90. This vector is then fed into the output layer on line 12. At this
stage, one could reduce the dimension of the parameter by setting a max-pooling
layer in between the flatten and the output layer.
10.3 Lab: Predictive Modeling Using Word Embeddings 443

confusion matrix GloVe with b=50 confusion matrix GloVe with LSTM and b=50

Fire 165 3 2 1 2 0 13 9 4 Fire 168 2 0 0 0 2 9 8 2

Light. 10 920 2 6 3 2 6 1 9 Light. 10 917 1 6 2 0 3 1 5

Hail 0 1 78 3 0 0 0 1 2 Hail 0 0 82 1 0 0 0 0 0

Wind 1 5 8 356 12 2 4 5 6 Wind 0 7 8 357 19 0 2 0 5

Wat.W 2 6 0 6 364 116 4 2 18 Wat.W 0 3 1 6 385 179 7 2 21

Wat.NW 2 0 0 1 60 139 2 0 5 Wat.NW 0 0 0 0 31 69 1 4 5

Vehicle 18 4 3 12 4 1 969 32 42 Vehicle 23 7 2 19 10 7 989 35 57

Vand. 8 2 0 10 9 5 56 2009 52 Vand. 11 1 0 4 6 4 41 2017 58

Misc 12 14 1 8 10 4 25 25 327 Misc 6 18 0 10 11 8 27 17 312

Fire

Lightning

Hail

Wind

WaterW

WaterNW

Vehicle

Vandalism

Misc

Fire

Lightning

Hail

Wind

WaterW

WaterNW

Vehicle

Vandalism

Misc
Fig. 10.6 Confusion matrices of the hazard type prediction using the pre-trained GloVe with b =
50 (lhs) FN network and (rhs) LSTM network; columns show the observations and rows show the
predictions

Listing 10.8 R code for the hazard type prediction using a LSTM architecture

1 input = layer_input(shape = list(T), name = "input")

2 #
3 word2vec = input %>%
4 layer_embedding(input_dim = W+1, output_dim = b, input_length = T,
5 weights=list(wordEmb), trainable=FALSE) %>%
6 layer_flatten()
7 #
8 response = word2vec %>%
9 layer_lstm(units=10, activation=’tanh’, return_sequences=TRUE,
10 name=’LSTM’) %>%
11 time_distributed(layer_dense(units=10, activation=’tanh’, name=’FNLayer’)) %>%
12 layer_flatten() %>%
13 layer_dense(units=9, activation=’softmax’, name=’output’)
14 #
15 model = keras_model(inputs = c(input), outputs = c(response))

We fit this LSTM architecture to the data using the pre-trained GloVe embed-
dings. The results are presented in Fig. 10.6 (rhs) and Table 10.1. We receive the
same deviance loss, and the misclassification rate is slightly worse than in the
FN network case (with the same pre-trained GloVe embeddings). Note that the
deviance loss is calculated on the estimated classification probabilities
p(x) =
( 9 (x)) , and the labels are received by
p1 (x), . . . , p

= Y
Y (x) = arg max p
k (x).
k=1,...,9

Thus, it may happen that the improvements on the estimated probabilities are not
fully reflected on the predicted labels.
444 10 Natural Language Processing

Word (Cosine) Similarity In our final analysis we work with the pre-trained GloVe
embeddings e(w) ∈ R50 but we first try to reduce the embedding dimension b. For
this we follow Lee et al. [236], and we consider a word similarity. We can define
the similarity of the words w and w ∈ W by considering the scalar product of their
embeddings
7 8

7
8 e(w), e(w )
sim (u)
(w, w ) = e(w), e(w ) or sim (n)
(w, w ) = .
e(w)2 e(w )2
(10.11)

The first one is an unweighted version and the second one is a nor-
malized version scaling with the corresponding Euclidean norms so that
the similarity measure is within [−1, 1]. In fact, the latter is also called
cosine similarity. To reduce the embedding dimension and because we
have a classification problem with hazard names, we can evaluate the
(cosine) similarity of all used words w ∈ W to the hazards h ∈ H =
{fire, lightning, hail, wind, water, vehicle, vandalism}. Observe
that water is further separated into weather related and non-weather related claims,
and there is a further hazard type called misc, which collects all the rest. We could
choose more words in H to more precisely describe these water and other claims. If
we just use H we obtain a b = |H| = 7 dimensional embedding mapping

w ∈ W0 → e(a) (w) = sim(a) (w, fire), . . . , sim(a) (w, vandalism) ∈ Rb=7 ,
(10.12)

for a ∈ {u, n}. This gives us for every text = (w1 , . . . , wT ) ∈ W0T the pre-
processed features

text → e(a)(w1 ), . . . , e(a) (wT ) ∈ RT ×b . (10.13)

Lee et al. [236] apply a max-pooling layer to these embeddings which are then
inputted into GAM classification model. We use a different approach here, and
directly use the unweighted (a = u) text representations (10.13) as an input to a
network, either of FN network type of Listing 10.7 or of LSTM type of Listing 10.8.
If we use the FN network type we receive the results on the last line of Table 10.1
and Fig. 10.7.
Comparing the results of the word similarity through the embeddings (10.12)
and (10.13) to the other prediction results, we conclude that this word similarity
approach is not fully competitive compared to working directly with the word2vec
or GloVe embeddings. It seems that the projection (10.12) does not discriminate
sufficiently for our classification task.
10.4 Lab: Deep Word Representation Learning 445

Fig. 10.7 Confusion matrix confusion matrix word similarity with b=7
of the hazard type prediction Fire 105 2 0 4 0 1 16 21 9
using the word
Light. 9 906 5 9 7 0 0 2 8
similarity (10.12)–(10.13) for
a = u; columns show the Hail 0 1 72 2 1 0 1 1 2
observations and rows show
Wind 2 4 10 314 21 1 3 7 18
the predictions
Wat.W 2 5 1 14 345 183 14 15 32

Wat.NW 1 0 0 1 25 39 5 4 8

Vehicle 34 6 2 15 5 7 871 75 84

Vand. 45 13 2 17 26 17 95 1919 118

Misc 20 18 2 27 34 21 74 40 186

Fire

Lightning

Hail

Wind

WaterW

WaterNW

Vehicle

Vandalism

Misc
10.4 Lab: Deep Word Representation Learning

All examples above have been relying on embedding the words w ∈ W into
a Euclidean space e(w) ∈ Rb by performing a sort of unsupervised learning
that provided word similarity clusters. The advantage of this approach is that
the embedding is decoupled from the regression or classification task, this is
computationally attractive. Moreover, once a suitable embedding has been learned,
it can be used for several different tasks (in the spirit of transfer learning). The
disadvantage of the pre-trained embeddings is that the embedding is not targeted to
the regression task at hand. This has already been discussed in Remark 10.4 where
we have highlighted that the meaning of some words (such as Lincoln) depends very
much on its context.
Recent NLP aims at pre-processing a text as little as necessary, but tries
to directly feed the raw sentences into RN networks such as LSTM or GRU
architectures. Computationally this is much more demanding because we have
to learn the embeddings and the network weights simultaneously, we refer to
Table 10.1 to indicate the number of parameters involved. The purpose of this short
section is to give an example, though our NLP database is rather small; this latter
approach usually requires a huge database and the corresponding computational
power. Ferrario–Nägelin [126] provide a more comprehensive example on the
classification of movie reviews. For their analysis they evaluated approximately
50’000 movie reviews each using between 235 and 2’498 words. Their analysis
was implemented on the ETH High Performance Computing (HPC) infrastructure
Euler8, and their run times have been between 20 and 30 minutes, see Table 8 of
Ferrario–Nägelin [126].

8 https://scicomp.ethz.ch/wiki/Euler
446 10 Natural Language Processing

Since we neither have the computational power nor the big data to fit such
a NLP application, we start the gradient descent fitting in the initial embedding
weights e(w) ∈ Rb that either come from the word2vec or the GloVe embeddings.
During the gradient descent fitting, we allow these weights to change w.r.t. the
regression task at hand. In comparison to Sect. 10.3, this only requires minor
changes to the R code, namely, the only modification needed is to change from
FALSE to TRUE on lines 5 in Listings 10.7 and 10.8. This change allows us to
learn adapted weights during the gradient descent fitting. The resulting classification
models are now very high-dimensional, and we need to carefully assess the
early stopping rule, otherwise the model will (in-sample) over-fit to the learning
data.
In Fig. 10.8 we provide the results that correspond to the self-trained word2vec
embeddings given in Fig. 10.5, and the corresponding numerical results are given
in Table 10.2. We observe an improvement in the prediction accuracy in both cases
by letting the embedding weights being learned during the network fitting, and we
receive a misclassification rate of 11.6% and 11.0% for the embedding dimensions
b = 2 and b = 10, respectively, see Table 10.2.
Figure 10.8 (rhs) illustrates how the embeddings have changed from the initial (pre-
trained) embeddings e(0) (w) (coming from the word2vec negative sampling) to the
learned embeddings e(w). We measure these changes in terms of the unweighted
similarity measure defined in (10.11), and given by
; <
e(w) .
e(0) (w), (10.14)

The upper horizontal line is a manually set threshold to identify the words w that
experience a major change in their embeddings. These are the words ‘vandalism’,
‘lightning’, ‘grafito’, ‘fence’, ‘hail’, ‘freeze’, ‘blow’ and ‘breakage’. Thus, these
words receive a different embedding location/meaning which is more favorable for
our classification task.
A similar analysis can be performed for the pre-trained GloVe embeddings. There
we expected bigger changes to the embeddings since the GloVe embeddings have
not been learned in an insurance context, and the embeddings will be adapted to
the insurance prediction problem. We refrain from giving an explicit analysis, here,
because to perform a thorough analysis we would need (much) more data.
We conclude this example with some remarks. We emphasize once more that
our available data is minimal, and we expect (even much) better results for longer
claim descriptions. In particular, our data is not sufficient to discriminate the weather
related from the non-weather related water claims, as the claim descriptions seem
to focus on the water claim itself and not on its cause. In a next step, one should use
claim descriptions in order to predict the claim sizes, or to improve their predictions
if they are based on classical tabular features, only. Here, we see some potential, in
particular, w.r.t. medical claims, as medical reports may clearly indicate the severity
of the claim as well as these reports may give some insight into the recovery process.
Thus, our small example may only give some intuition of what is possible with
10.4 Lab: Deep Word Representation Learning 447

confusion matrix with embeddings b=2 change of word2vec embeddings (b=2)

4
Fire 182 3 0 0 0 2 3 15 4

Light. 9 908 2 8 2 0 0 0 4

change (in word similarity)

3
Hail 0 1 79 4 0 1 0 0 4

Wind 1 9 4 354 16 0 5 0 9

2
Wat.W 1 4 2 2 387 169 3 0 18

Wat.NW 0 0 0 0 26 73 0 3 2

1
Vehicle 6 11 3 18 5 4 995 36 55

Vand. 5 1 1 4 3 1 19 1964 55

0
Misc 7 17 3 6 21 17 28 26 253
Fire

Lightning

Hail

Wind

WaterW

WaterNW

Vehicle

Vandalism

Misc

hail

freeze
confusion matrix with embeddings b=10 change of word2vec embeddings (b=10)
4

Fire 185 3 0 0 0 4 8 10 8

Light. 9 923 1 4 3 0 1 0 8
3
change (in word similarity)

Hail 0 0 84 5 0 0 0 0 0

Wind 0 1 5 352 6 0 1 1 10
2

Wat.W 1 8 0 10 407 181 1 2 13

Wat.NW 2 0 0 0 17 64 1 0 3

Vehicle 3 2 1 10 2 0 983 24 29
1

Vand. 5 1 1 6 6 6 25 1980 44

Misc 6 16 2 9 19 12 33 27 289
0
Fire

Lightning

Hail

Wind

WaterW

WaterNW

Vehicle

Vandalism

Misc

vandalism
lightning

graffito

fence
hail

freeze

blow

breakage

Fig. 10.8 Confusion matrices and the changes in the embeddings compared to the pre-trained
word2vec embeddings of Fig. 10.5 for the dimensions b = 2 and b = 10

Table 10.2 Hazard prediction results summarized in deviance losses and misclassification rates:
pre-trained embeddings vs. network learned embeddings
Number of parameters Deviance Misclass.
Non-trainable Trainable loss rate
word2vec negative sampling, b = 2 286 839 0.1442 19.9%
word2vec improved embedding, b = 2 1’125 0.0814 11.7%
word2vec negative sampling, b = 10 1’430 2’279 0.0912 13.7%
word2vec improved embedding, b = 10 3’709 0.0714 10.5%
448 10 Natural Language Processing

(unstructured) text data. Unfortunately, the LGPIF data of Listing 10.1 did not give
us any satisfactory results for the claim size prediction, this for several reasons.
Firstly, the data is rather heterogeneous ranging from small to very large claims
and any member of the EDF struggles to model this data; we come back to a
different modeling proposal of heterogeneous data in Sect. 11.3.2. Secondly, the
claim descriptions are not very explanatory as they are too short for a more detailed
information. Thirdly, the data has only 5’424 claims which seems small compared
to the complexity of the problem that we try to solve.

10.5 Outlook: Creating Attention

In text recognition problems, obviously, not all the words in a sentence have the
same importance. In the examples above, we have removed the stopwords as they
may disturb the key understanding of our texts. Removing the stopwords means that
we pay more attention to the remaining words. RN networks often face difficulty
in giving the right recognition to the different parts of a sentence. For this reason,
attention layers have gained more popularity recently. Attention layers are special
modules in network architectures that allow the network to impose more weight
on certain parts of the information in the features to emphasize their importance.
The attention mechanism has been introduced in Bahdanau et al. [21]. There are
different ways of modeling attention, the most popular one is the so-called dot-
product attention, we refer to Vaswani et al. [366], and in the actuarial literature we
mention Kuo–Richman [231] and Troxler–Schelldorfer [354].
We start by describing a simple attention mechanism. Consider a sentence
text = (w1 , . . . , wT ) ∈ W0T that provides, under an embedding map e : W0 →
Rb , the embedded sentence (e(w1 ), . . . , e(wT )) ∈ RT ×b . We choose a weight
matrix UQ ∈ Rb×b and an intercept vector uQ ∈ Rb . Based on these choices we
consider for each word wt of our sentence the score, called query,

q t = tanh uQ + UQ e(wt ) ∈ (−1, 1)b . (10.15)

Matrix Q = (q 1 , . . . , q T ) ∈ RT ×b collects all queries. It is obtained by

applying a time-distributed FN layer with b neurons to the embedded sentence
(e(w1 ), . . . , e(wT )) .
These queries q t are evaluated with a so-called key k ∈ Rb giving us the attention
weights
7 8
exp k, q t
αt = T 7 8 ∈ (0, 1) for 1 ≤ t ≤ T . (10.16)
s=1 exp k, q s
10.5 Outlook: Creating Attention 449

Using these attention weights α = (α1 , . . . , αT ) ∈ (0, 1)T we encode the sentence
text as

T
text = (w1 , . . . , wT ) → w∗ = αt e(wt ) (10.17)
t =1

= (e(w1 ), . . . , e(wT )) α ∈ Rb .

Thus, to every sentence text we assign a categorical probability vector α =

α(text) ∈ T , see Sect. 2.1.4, (6.22) and (5.69), which is encoding this sentence
text to a b-dimensional vector w∗ ∈ Rb . This vector is then further processed
by the network. Such a construction is called a self-attention mechanism because
the text (w1 , . . . , wT ) ∈ W0T is used to formulate the queries in (10.15), but, of
course, these queries could also be coming from a completely different source.
In the above set-up we have to learn the following parameters UQ ∈ Rb×b and
uQ , k ∈ Rb , assuming that the embedding map e : W0 → Rb has already been
specified.
There are several generalizations and modifications to this self-attention mech-
anism. The most common one is to expand the vector w∗ ∈ Rb in (10.17) to a
matrix W ∗ = (w∗1 , . . . , w ∗q ) ∈ Rb×q . This matrix W ∗ can be interpreted as having
q neurons w∗j ∈ Rb , 1 ≤ j ≤ q. For this, one replaces the key k ∈ Rb by a matrix-
valued key K = (k 1 , . . . , k q ) ∈ Rb×q . This allows one to calculate the attention
weight matrix
7 8
exp k j , q t
A = αt,j 1≤t ≤T ,1≤j ≤q
= T 7 8
s=1 exp k j , q s 1≤t ≤T ,1≤j ≤q

= softmax (QK) ∈ (0, 1)T ×q ,

where the softmax function is applied column-wise. I.e., the attention weight matrix
A ∈ (0, 1)T ×q has columns α j = (α1,j , . . . , αT ,j ) ∈ T , 1 ≤ j ≤ q, which are
normalized to total weight 1, this is equivalent to (10.16). This is used to encode the
sentence text

(e(w1 ), . . . , e(wT )) ∈ Rb×T → W ∗ = (e(w1 ), . . . , e(wT )) A (10.18)

T

= αt,j e(wt ) ∈ Rb×q .
t =1 1≤j ≤q

Mapping (10.18) is called an attention layer. Let us give some remarks.

Remarks 10.5
• Encoding (10.18) gives a natural multi-dimensional extension of (10.17). The
crucial parts are the attention weights α j ∈ T which weigh the different words
450 10 Natural Language Processing

(wt )1≤t ≤T . In the multi-dimensional case, we perform this weighting mechanism

multiple times (in different directions), allowing us to extract different features
from the sentences. In contrast, in (10.17) we only do this once. This is similar
as going form one neuron to a layer of q neurons.
• The above structure uses a self-attention mechanism because the queries involve
the words themselves, and the weight matrix UQ ∈ Rb×b and the intercept vector
uQ ∈ Rb are learned with gradient descent. Concerning the key K ∈ Rb×q
one often chooses another self-attention mechanism by choosing a (non-linear)
function K = K(w1 , . . . , wT ) to infer optimal keys.
• These attention layers are also the building blocks of transformer models.
Transformer models use attention layers (10.18) of dimension W ∗ ∈ Rb×T and
skip connections to transform the input
W + W∗
W = (e(w1 ), . . . , e(wT )) ∈ Rb×T → ∈ Rb×T . (10.19)
2
Stacking multiple of these layers (10.19) transforms the original input W by
weighing the important information in feature W for the prediction task at hand.
Compared to LSTM layers this no longer sequentially screens the text but it
directly acts on the part of the text that seems important.
• The attention mechanism is applied to a matrix (e(w1 ), . . . , e(wT )) ∈ RT ×b
which presents a numerical encoding of the sentence (w1 , . . . , wT ) ∈ W0T .
Kuo–Richman [231] propose to apply this attention mechanism more generally
to categorical feature components. Assume that we have T categorical feature
components x1 , . . . , xT , after embedding them into b-dimensional Euclidean
spaces we receive a representation (e(x1 ), . . . , e(xT )) ∈ RT ×b , see (7.31).
Naturally, this can now be further processed by putting different attention on
the components of this embedding exactly using an attention layer (10.18),
alternatively we can use transformer layers (10.19).

Example 10.6 We revisit the hazard type prediction example of Sect. 10.3. We
select the b = 10 word2vec embedding (using negative sampling) and the
pre-trained GloVe embedding of Table 10.1. These embeddings are then further
processed by applying the attention mechanism (10.15)–(10.17) on the embeddings
using one single attention neuron. Listing 10.9 gives the corresponding implemen-
tation. On line 9 we have the query (10.15), on lines 10–13 the key and the attention
weights (10.16), and on line 15 the encodings (10.17). We then process these
encodings through a FN network of depth d = 2, and we use the softmax output
activation to receive the categorical probabilities. Note that we keep the learned
word embeddings e(w) as non-trainable on line 5 of Listing 10.9.
Table 10.3 gives the results, and Fig. 10.9 shows the confusion matrix. We conclude
that the results are rather similar, this attention mechanism seems to work quite well,
and with less parameters, here.
10.5 Outlook: Creating Attention 451

Listing 10.9 R code for the hazard type prediction using an attention layer with q = 1

1 input = layer_input(shape = list(T), name = "input")

2 #
3 word2vec = input %>%
4 layer_embedding(input_dim = W+1, output_dim = b, input_length = T,
5 weights=list(wordEmb), trainable=FALSE) %>%
6 layer_flatten()
7 #
8 attention = word2vec %>%
9 time_distributed(layer_dense(units=b, activation=’tanh’)) %>%
10 time_distributed(layer_dense(units=1, activation=’linear’,
11 use_bias=FALSE)) %>%
12 layer_flatten() %>%
13 layer_dense(unit=T, activation=’softmax’, weights=list(diag(T)),
14 use_bias=FALSE, trainable=FALSE)
15 #
16 response = list(attention, word2vec) %>% layer_dot(axes=1) %>%
17 layer_dense(units=20, activation=’tanh’) %>%
18 layer_dense(units=15, activation=’tanh’) %>%
19 layer_dense(units=9, activation=’softmax’)
20 #
21 model = keras_model(inputs = c(input), outputs = c(response))

Table 10.3 Hazard prediction results summarized in deviance losses and misclassification
rates
Number of parameters Deviance Misclassification
Embedding Network loss rate
word2vec negative sampling, b = 10 1’430 2’279 0.0912 13.7%
word2vec attention, b = 10 1’430 799 0.0784 12.0%
FN GloVe using all words, b = 50 91’500 9’479 0.0802 11.7%
GloVe attention, b = 50 91’500 4’079 0.0824 12.6%

confusion matrix with embeddings b=10 confusion matrix with embeddings b=50

Fire 177 3 0 0 0 2 9 16 8 Fire 185 3 0 0 1 2 13 14 4

Light. 9 911 1 6 2 0 0 0 7 Light. 9 912 1 6 2 0 0 0 4

Hail 0 0 85 5 0 0 0 0 0 Hail 0 0 85 5 1 0 0 0 0

Wind 1 8 5 354 18 1 1 1 13 Wind 1 0 4 342 5 0 2 1 8

Wat.W 4 2 0 4 403 191 1 0 14 Wat.W 3 7 1 22 405 199 6 3 15

Wat.NW 2 0 0 0 9 50 8 1 8 Wat.NW 0 0 0 1 26 45 0 1 6

Vehicle 5 9 2 14 5 3 973 29 39 Vehicle 5 11 1 14 5 4 975 31 47

Vand. 6 2 1 4 5 1 20 1960 52 Vand. 7 1 1 5 4 4 46 2012 68

Misc 7 19 0 9 18 19 41 37 263 Misc 8 21 1 8 15 15 37 22 313

Fire

Lightning

Hail

Wind

WaterW

WaterNW

Vehicle

Vandalism

Misc

Fire

Lightning

Hail

Wind

WaterW

WaterNW

Vehicle

Vandalism

Misc

Fig. 10.9 Confusion matrices of the hazard type prediction (lhs) using an attention layer on the
word2vec embeddings with b = 10, and (rhs) using an attention layer on the pre-trained GloVe
embeddings with b = 50; columns show the observations and rows show the predictions
452 10 Natural Language Processing

11.1 Deep Learning Under Model Uncertainty

We revisit claim size modeling in this section. Claim size modeling is challenging
because often there is no (simple) off-the-shelf distribution that allows one to
appropriately describe all claim size observations. E.g., the main body of the claim
size data may look like gamma distributed, and, at the same time, large claims seem
to be more heavy-tailed (contradicting a gamma model assumption). Moreover,
different product and claim types may lead to multi-modality in the claim size
densities. In Sects. 5.3.7 and 5.3.8 we have explored a gamma and an inverse
Gaussian GLM to model a motorcycle claims data set. In that example, the results
have been satisfactory because this motorcycle data is neither multi-modal nor does
it have heavy tails. These two GLM approaches have been based on the EDF (2.14),
modeling the mean x → μ(x) with a regression function and assuming a constant
dispersion parameter ϕ > 0. There are two natural ways to extend this approach.
One considers a double GLM with a dispersion submodel x → ϕ(x), see Sect. 5.5,
the other explores multi-parameter extensions like the generalized inverse Gaussian
model, which is a k = 3 vector-valued EF, see (2.10), or the GB2 family that
involves 4 parameters, see (5.79). These extensions provide more complexity, also in
MLE. In this section, we are not going to consider multi-parameter extensions, but
in a first step we aim at robustifying (mean) parameter estimation within the EDF.
In a second step we are going to analyze the resulting dispersion ϕ(x). For these
steps, we perform representation learning and parameter estimation under model
uncertainty by simultaneously considering multiple models from Tweedie’s family.
These considerations are closely related to Tweedie’s forecast dominance given in
Definition 4.22.

We emphasize that we remain within a single distribution function choice in this

section, i.e., we neither consider mixture distributions nor composite models in this
section. Mixture density networks are going to be considered in Sect. 11.6, below,
and a composite model approach is studied in Sect. 11.3, below. These mixture
density networks and composite models allow us to model the body and the tail
of the data with different distribution functions by either mixing or concatenating
suitable distributions.

11.1.1 Recap: Tweedie’s Family

Tweedie’s family with power variance function V (μ) = μp , p ≥ 2, provides us

with a rich model class for claim size modeling if the claim sizes are strictly positive,
a.s., and extending to p ∈ (1, 2) allows us to model claims with a positive point mass
in 0. This class of distribution functions contains the gamma case (p = 2) and the
inverse Gaussian case (p = 3). In general, p > 2 provides us with positive stable
generated distributions and p ∈ (1, 2) gives Tweedie’s CP models, see Table 2.1.
Tweedie’s family has cumulant function for p > 1
2−p
1
((1 − p)θ ) 1−p for p > 1 and p = 2,
κ(θ ) = κp (θ ) = 2−p (11.1)
−log(−θ ) for p = 2,

on the effective domain θ ∈ ∈ (−∞, 0) for p ∈ (1, 2], and θ ∈ ∈ (−∞, 0]

for p > 2. The mean and the power variance function are for p > 1 given by
1
θ → μ = μ(θ ) = ((1 − p)θ ) 1−p and μ → V (μ) = μp .

The unit deviance takes the following form for p > 1 and p = 2, see (4.18),
1−p
y − μ1−p y 2−p − μ2−p
dp (y, μ) = 2 y − ≥ 0, (11.2)
1−p 2−p

and in the gamma case p = 2 we have, see Table 4.1,

y μ
d2 (y, μ) = 2 − 1 + log ≥ 0. (11.3)
μ y

Figure 11.1 (lhs) shows the unit deviances y → dp (y, μ) for fixed mean parameter
μ = 2 and power variance parameters p ∈ {0, 2, 2.5, 3, 3.5}; the case p = 0
corresponds to the symmetric Gaussian case d0 (y, μ) = (y − μ)2 . We observe
that with an increasing power variance parameter p large claims Y = y receive a
smaller loss punishment (if we interpret the unit deviance as a loss function). This
is the situation where we have a fixed mean μ and where we assess claim sizes
11.1 Deep Learning Under Model Uncertainty 455

unit deviances of power variance examples unit deviances of power variance examples

10
10

Gauss p=0 Gauss p=0

gamma p=2 gamma p=2
case p=2.5 case p=2.5
inverse Gauss p=3 inverse Gauss p=3

8
8

case p=3.5 case p=3.5

unit deviance
unit deviance

6
6

4
4

2
2

0
0

0 1 2 3 4 5 6 0 1 2 3 4 5 6
data y mean mu

Fig. 11.1 (lhs) Unit deviances y → dp (y, μ) ≥ 0 for fixed mean μ = 2 and (rhs) unit
deviances μ → dp (y, μ) ≥ 0 for fixed observation y = 2 for power variance parameters
p ∈ {0, 2, 2.5, 3, 3.5}

Y = y relative to this mean. For estimation purposes we have fixed observations

Y = y and we study the sensitivities in μ. Note that, in general, the unit deviances
dp (y, μ) are not symmetric in y and μ. This second case is shown in Fig. 11.1 (rhs),
and the general behavior in p is similar. As a result, by selecting different hyper-
parameters p > 1, we can control the influence of large (and small) claims on
parameter estimation, because the unit deviances dp (y, ·) have different slopes for
different p’s. Basically, the choice of the loss function (unit deviance) determines
the choice of the underlying distributional model, which then assesses the claim
observations Y = y according to their sizes and how these sizes match the model
assumptions made.
In Lemma 2.22 we have seen that the unit deviances dp (y, μ) ≥ 0 are zero if and
only if y = μ. The second derivatives given in Lemma 2.22 allow us to consider a
second order Taylor expansion around a minimum μ0 = y0

2 2 2
dp (y0 + y, μ0 + μ) = p (y − μ) + o( ) as → 0.
μ0

Thus, locally around the minimum the unit deviances behave symmetric and like
Gaussian squares, but this is only a local approximation around a minimum μ0 = y0
as can be seen from Fig. 11.1. I.e., in general, model fitting turns out to be rather
different from the Gaussian square loss if we have small and large claim sizes under
choices p > 1.
456 11 Selected Topics in Deep Learning

Remarks 11.1
• Since unit deviances are Bregman divergences, we know that every unit deviance
gives us a strictly consistent scoring function for the mean functional, see
Theorem 4.19. Therefore, the specific choice of the power variance parameter p
seems less relevant. However, strict consistency is an asymptotic statement, and
choosing a unit deviance that matches the property of the data has better finite
sample properties, i.e., a smaller variance in asymptotic normality; we come back
to this in Sect. 11.1.4, below.
• A function (y, μ) → ψ(y, μ) is called b-homogeneous if there exists b ∈ R
such that for all (y, μ) and all λ > 0 we have ψ(λy, λμ) = λb ψ(y, μ). Unit
deviances dp are b-homogeneous with b = 2 − p. This b-homogeneity has
the nice consequence that the decisions taken are independent of the scale, i.e.,
we have an invariance under changes of currencies. On the other hand, such a
scaling influences the estimation of the dispersion parameter, i.e., if we scale the
observation and the mean with λ we have unit deviance

dp (λy, λμ) = λ2−p dp (y, μ). (11.4)

This influences the dispersion estimation for the cases different from the gamma
case p = 2, see, e.g., saddlepoint approximation (5.60)–(5.62). This also relates
to the different parametrizations in Sect. 5.3.8 where we study the inverse
Gaussian model p = 3, which has a dispersion ϕi = 1/αi in the reproductive
form and ϕi = 1/αi2 in parametrization (5.51).
• We only consider power variance parameters p > 1 in this section for non-
negative claim size modeling. Technically, this analysis could be extended to
p ∈ {0, 1}. We do not consider the Gaussian case p = 0 to exclude negative
claims, and we do not consider the Poisson case p = 1 because this is used for
claim counts modeling.
We recall that unit deviances of the EDF are equal to twice the corresponding
KL divergences, which in turn are special cases of Bregman divergences. From
Theorem 4.19 we know that Bregman divergences Dψ are the only strictly
consistent loss/scoring functions for mean estimation.
Lemma 11.2 Choose p > 1. The scaled unit deviance dp (y, μ)/2 is a Bregman
divergence Dψp (y, μ) on R+ × R+ with strictly decreasing and strictly convex
11.1 Deep Learning Under Model Uncertainty 457

function on R+

1
(2−p)(1−p) y
2−p for p > 1 and p = 2,
ψp (y) = yhp (y) − κp (hp (y)) =
−1 − log(y) for p = 2,

for canonical link hp (y) = (κp )−1 (y) = y 1−p /(1 − p).
Proof of Lemma 11.2 The Bregman divergence property follows from (2.29). For
p > 1 and y > 0 we have the strictly decreasing property

ψp (y) = hp (y) = y 1−p /(1 − p) < 0.

The second derivative is ψp (y) = hp (y) = y −p = 1/V (y) > 0 which provides the
strict convexity.

In the Gaussian case we have ψ0 (y) = y 2 /2, and ψ0 (y) > 0 on R+ implies
that this is a strictly increasing convex function for positive claims y > 0. This is
different to Lemma 11.2.
Assume we have independent observations (Yi , x i ) following the same
Tweedie’s distribution, and with means given by μϑ (x i ) for some parameter ϑ.
The M-estimator of ϑ using this Bregman divergence is given by

n
vi

ϑ = arg max Y (ϑ) = arg min Dψp (Yi , μϑ (x i )) .
ϑ ϑ ϕ
i=1

If we turn this M-estimator into a Z-estimator (supposed we have differentiability),

the parameter estimate
ϑ is found as a solution of the score equations

!
n
vi
0 = −∇ϑ Dψp (Yi , μϑ (x i ))
ϕ
i=1

n
vi
= ψp (μϑ (x i )) (Yi − μϑ (x i )) ∇ϑ μϑ (x i )
ϕ
i=1

n
vi Yi − μϑ (x i )
= ∇ϑ μϑ (x i ) (11.5)
ϕ V (μϑ (x i ))
i=1

n
vi Yi − μϑ (x i )
= ∇ϑ μϑ (x i ).
ϕ μϑ (x i )p
i=1

In the GLM case this exactly corresponds to (5.9). To determine the Z-estimator
from (11.5), we scale the residuals Yi − μi inversely proportional to the variances
p
V (μi ) = μi of the chosen Tweedie’s distribution. It is a well-known result that
458 11 Selected Topics in Deep Learning

if we scale individual unbiased estimators inversely proportional to their variances,

we receive the unbiased estimator with minimal variance, we come back to this
in (11.16), below. This gives us the intuition behind a specific choice of the power
p
variance parameter for mean estimation, as the sizes of the variances μi scale
(weight) the observed residuals Yi − μi , and balance potential outliers in the
observations correspondingly.

11.1.2 Lab: Claim Size Modeling Under Model Uncertainty

We present a proposal for deep learning under model uncertainty in this section. We
explain this on an explicit example within Tweedie’s distributions. We emphasize
that this methodology can be applied in more generality, but it is beneficial here to
have an explicit example in mind to illustrate the different phenomena.

Generalized Linear Models

We analyze a Swiss accident insurance claims data set. This data is illustrated in
Sect. 13.4, and an excerpt of the data is given in Listing 13.7. In total we have
339’500 claims with positive payments. We choose this data set because it ranges
from very small claims of 1 CHF to very large claims, the biggest one exceeding
1’300’000 CHF. These claims are supported by feature information such as the labor
sector, the injury type or the injured body part, see Listing 13.7 and Fig. 13.25. For
our analysis, we partition the data into a learning data set L and a test data set T .
We do this partition stratified w.r.t. the claim sizes and in a ratio of 9 : 1. This
results in a learning data set L of size n = 305 550 and in a test data set T of
size T = 33 950.
We consider three Tweedie’s distributions with power variance parameters p ∈
{2, 2.5, 3}, the first one is the gamma model, the last one the inverse Gaussian model,
and the power variance parameter p = 2.5 gives a model in between. In a first step
we consider GLMs, this requires feature engineering. We have three categorical
features, one binary feature and two continuous ones. For the categorical and binary
features we use dummy coding, and the continuous features Age and AccQuart
are just included in its raw form. As link function g we choose the log-link which
respects the positivity of the dual mean parameter space M, see Table 2.1, but
this is not the canonical link of the selected models. In the gamma GLM this
leads to a convex minimization problem, but in Tweedie’s GLM with p = 2.5
11.1 Deep Learning Under Model Uncertainty 459

Table 11.1 In-sample and out-of-sample losses (gamma loss, power variance case p = 2.5 loss
(in 10−2 ) and inverse Gaussian (IG) loss (in 10−3 )) and AIC values; the losses use unit dispersion
ϕ = 1, AIC relies on the MLE of ϕ
In-sample loss on L Out-of-sample loss on T AIC
dp=2 dp=2.5 dp=3 dp=2 dp=2.5 dp=3 value
Null model 3.0094 10.2208 4.6979 3.0240 10.2420 4.6931 4’707’115 (IG)
Gamma GLM 2.0695 7.7127 3.9582 2.1043 7.7852 3.9763 4’741’472
p = 2.5 GLM 2.0744 7.6971 3.9433 2.1079 7.7635 3.9580 4’648’698
IG GLM 2.0865 7.7069 3.9398 2.1191 7.7730 3.9541 4’653’501

and in the inverse Gaussian GLM we have non-convex minimization problems, see
Example 5.6. Therefore, we initialize Fisher’s scoring method (5.12) in the latter two
GLMs with the solution of the gamma GLM. The gamma and the inverse Gaussian
cases can directly be fitted with the R command glm [307], for the power variance
parameter case p = 2.5 we have coded our own MLE routine using Fisher’s scoring
method.
Table 11.1 shows the in-sample losses on the learning data L and the corresponding
out-of-sample losses on the test data T . The fitted GLMs (gamma, power variance
parameter p = 2.5 and inverse Gaussian) are always evaluated on all three unit
deviances dp=2 (y, μ), dp=2.5 (y, μ) and dp=3 (y, μ), respectively. We give some
remarks. First, we observe that the in-sample loss is always minimized for the
GLM with the same power variance parameter p as the loss dp studied (2.0695,
7.6971 and 3.9398 in bold face). This result simply states that the parameter
estimates are obtained by minimizing the in-sample loss (or maximizing the
corresponding in-sample log-likelihood). Second, the minimal out-of-sample losses
are also highlighted in bold face. From these results we cannot give any preference
to a single model w.r.t. Tweedie’s forecast dominance, see Definition 4.20. Third,
we calculate the AIC values for all models. The gamma and the inverse Gaussian
cases have a closed-form solution for the normalizing term a(y; v/ϕ) in the EDF
density, and we can directly calculate AIC. The case p = 2.5 is more difficult
and we use the saddlepoint approximation of Sect. 5.5.2. Considering AIC we give
preference to Tweedie’s GLM with p = 2.5. Note that the AIC values use the
MLE for ϕ which is obtained from a general purpose optimizer, and which uses
the saddlepoint approximation in the power variance case p = 2.5. Fourth, under
a constant dispersion parameter ϕ, the mean estimation μi can be done without
explicitly specifying ϕ because it cancels in the score equations. In fact, we perform
this mean estimation in the additive form and not in the reproductive form, see (2.13)
and the discussions in Sects. 5.3.7–5.3.8.
Figure 11.2 plots the deviance residuals (for unit dispersion) against the logged
fitted means μ(x i ) for p ∈ {2, 2.5, 3} for 2’000 randomly selected claims; this
is the Tukey–Anscombe plot. The green line has been obtained by a spline fit
to the deviance residuals as a function of the fitted means μ(x i ), and the cyan
460 11 Selected Topics in Deep Learning

Tukey−Anscombe plot: gamma Tukey−Anscombe plot: p=2.5 Tukey−Anscombe plot: inverse Gaussian

1.5

0.4
residuals
average
dispersion

−1.5 −1.0 −0.5 0.0 0.5 1.0

5
deviance residuals

deviance residuals

deviance residuals
0.0 0.2
0

−0.2
−5

residuals residuals
average average

−0.4
dispersion dispersion

5 6 7 8 9 10 5 6 7 8 9 10 5 6 7 8 9 10
logged fitted means logged fitted means logged fitted means

Fig. 11.2 Tukey–Anscombe plots showing the deviance residuals against the logged GLM fitted
means μ(x i ): (lhs) gamma GLM p = 2, (middle) power variance case p = 2.5, (rhs) inverse
Gaussian GLM p = 3; the cyan lines show twice the estimated standard deviation of the deviance
residuals as a function of the size of the logged estimated means
μ

lines give twice the estimated standard deviation of the deviance residuals as
a function of the fitted means (also obtained from spline fits). This estimated
standard deviation corresponds to the square-rooted deviance dispersion estimate

ϕ D , see (5.30), however, in the additive form because we work with unscaled claim
size observations. A constant dispersion assumption is supported by cyan lines of
roughly constant size. In the gamma case the dispersion seems increasing in the
mean estimate, and in the inverse Gaussian case it is decreasing, thus, the power
variance parameters p = 2 and p = 3 do not support a constant dispersion in this
example. Only the choice p = 2.5 may support a constant dispersion assumption
(because it does not have an obvious trend). This says that the variance should scale
as V (μ) = μ2.5 as a function of the mean μ, see also (11.5).

Deep FN Networks

We compare the above GLMs to FN networks of depth d = 3 with (q1 , q2 , q3 ) =

(20, 15, 10) neurons. The categorical features are modeled with embedding layers
of dimension b = 2. We fit this network architecture with Tweedie’s deviances
losses having power variance parameters p ∈ {2, 2.5, 3}. Moreover, we use 20%
of the learning data L as validation data V to explore the early stopping rule.1 To
reduce the randomness coming from early stopping with different seeds, we average
the deviance losses over 20 runs (this is not the nagging predictor: we only average
the deviance losses to have stable conclusions concerning forecast dominance). The
results are presented in Table 11.2.

1In the standard implementation of SGD with early stopping, the learning and validation data
partition is done non-stratified. If necessary, this can be changed manually.
11.1 Deep Learning Under Model Uncertainty 461

Table 11.2 In-sample and out-of-sample losses (gamma loss, power variance case p = 2.5 loss
(in 10−2 ) and inverse Gaussian (IG) loss (in 10−3 )) and average claim amounts; the losses use unit
dispersion ϕ = 1 and the network losses are averaged deviance losses over 20 runs with different
seeds
In-sample loss on L Out-of-sample loss on T Average
dp=2 dp=2.5 dp=3 dp=2 dp=2.5 dp=3 claim
Null model 3.0094 10.2208 4.6979 3.0240 10.2420 4.6931 1’774
Gamma GLM 2.0695 7.7127 3.9582 2.1043 7.7852 3.9763 1’701
p = 2.5 GLM 2.0744 7.6971 3.9433 2.1079 7.7635 3.9580 1’652
IG GLM 2.0865 7.7069 3.9398 2.1191 7.7730 3.9541 1’614
Gamma network 1.9738 7.4556 3.8693 2.0543 7.6478 3.9211 1’748
p = 2.5 network 1.9712 7.4128 3.8458 2.0654 7.6551 3.9178 1’739
IG network 1.9977 7.4568 3.8525 2.0762 7.6682 3.9188 1’712

First, we observe that the networks outperform the GLMs, saying that the feature
engineering has not been done optimally for GLMs. Second, in-sample we no longer
receive the lowest deviance loss in the model with the same p. This comes from the
fact that we exercise early stopping, and, for instance, the gamma in-sample loss of
the gamma network (p = 2) 1.9738 is bigger than the corresponding gamma loss
of 1.9712 from the network with p = 2.5. Third, considering forecast dominance,
preference is given either to the gamma network or to the power variance parameter
p = 2.5. In general, it seems that fitting with higher power variance parameters
leads to less stable results, but this statement needs more analysis. The disadvantage
of this fitting approach is that we independently fit the models with the different
power variance parameters to the observations, and, thus, the learned representations
z(d:1)(x i ) are rather different for different p’s. This makes it difficult to compare
these models. This is exactly the point that we address next.

Robustified Representation Learning

To deal with the drawback of missing comparability of the network approaches

with different power variance parameters, we can try to learn a representation
that simultaneously fits different models. The implementation of this idea is rather
straightforward in network modeling. We choose the above network of depth d = 3,
which gives us the new (learned) representation zi = z(d:1)(x i ) in the last FN
layer. The general idea now is that we design multiple outputs for this learned
representation to fit the different distributional models. That is, in the case of
three Tweedie’s loss functions with power variance parameters p ∈ {2, 2.5, 3} we
consider a three-dimensional output mapping

x → μp=2 (x), μp=2.5 (x), μp=3 (x) (11.6)

= g −1 β 2 , z(d:1) (x) , g −1 β 2.5 , z(d:1) (x) , g −1 β 3 , z(d:1)(x) ∈ R3 ,
462 11 Selected Topics in Deep Learning

for different output parameters β 2 , β 2.5 , β 3 ∈ Rqd +1 . These three expected

responses (11.6) share the network parameters w = (w(1) (d)
1 , . . . , w qd ) in the FN
layers, and the network fitting should learn these parameters such that zi =
z(d:1)(x i ) gives a good representation for all considered loss functions. Choose
positive weights ηp > 0, and define the combined deviance loss function

ηp
n

D Y , (w, β 2 , β 2.5 , β 3 ) = vi dp Yi , μp (x i ) , (11.7)
ϕp
p∈{2,2.5,3} i=1

for the given observations (Yi , x i , vi ), 1 ≤ i ≤ n. Note that the unit deviances
dp live on different scales for different p’s. We use the (constant) weights ηp > 0
to balance these scales so that all power variance parameters p roughly equally
contribute to the total loss, while setting ϕp ≡ 1 (which can be done for a constant
dispersion). This approach is now fitted to the available learning data L. The
corresponding R code is given in Listing 11.1. Note that the fitting also requires that
we triplicate the observations (Yi , Yi , Yi ) so that we can simultaneously evaluate the
three chosen power variance deviance losses, see lines 18–21 of Listing 11.1. We
fit this model to the Swiss accident insurance data, and the results are presented in
Table 11.3 on the lines called ‘multi-out’.

Listing 11.1 FN network with multiple output

1 Design = layer_input(shape = c(q0), dtype = ’float32’, name = ’Design’)
2 #
3 Network = Design %>%
4 layer_dense(units=20, activation=’tanh’, name=’FNLayer1’) %>%
5 layer_dense(units=15, activation=’tanh’, name=’FNLayer2’) %>%
6 layer_dense(units=10, activation=’tanh’, name=’FNLayer3’)
7 #
8 Output1 = Network %>%
9 layer_dense(units=1, activation=’exponential’, name=’Output1’)
10 #
11 Output2 = Network %>%
12 layer_dense(units=1, activation=’exponential’, name=’Output2’)
13 #
14 Output3 = Network %>%
15 layer_dense(units=1, activation=’exponential’, name=’Output3’)
16
17 #
18 keras_model(inputs = c(Design), outputs = c(Output1, Output2, Output3))
19 #
20 model %>% compile(loss = list(loss1, loss2, loss3),
21 loss_weights=list(eta1, eta2, eta3), optimizer = ’nadam’)

This simultaneous representation learning across different loss functions leads to

more stability in the results between the different loss function choices, i.e., there
is less variability between the losses of the different outputs compared to fitting the
three different models independently. The predictive performance seems slightly
better in this robustified vs. the independent case (see bold face out-of-sample
figures). The similarity of the results across the different loss functions (using the
11.1 Deep Learning Under Model Uncertainty 463

Table 11.3 In-sample and out-of-sample losses (gamma loss, power variance case p = 2.5 loss
(in 10−2 ) and inverse Gaussian (IG) loss (in 10−3 )) and average claim amounts; the losses use unit
dispersion ϕ = 1 and the network losses are averaged deviance losses over 20 runs with different
seeds
In-sample loss on L Out-of-sample loss on T Average
dp=2 dp=2.5 dp=3 dp=2 dp=2.5 dp=3 claim
Null model 3.0094 10.2208 4.6979 3.0240 10.2420 4.6931 1’774
Gamma network 1.9738 7.4556 3.8693 2.0543 7.6478 3.9211 1’748
p = 2.5 network 1.9712 7.4128 3.8458 2.0654 7.6551 3.9178 1’739
IG network 1.9977 7.4568 3.8525 2.0762 7.6682 3.9188 1’712
Gamma multi-output (11.6) 1.9731 7.4275 3.8519 2.0581 7.6422 3.9146 1’745
p = 2.5 multi-output (11.6) 1.9736 7.4281 3.8522 2.0576 7.6407 3.9139 1’732
IG multi-output (11.6) 1.9745 7.4295 3.8525 2.0576 7.6401 3.9134 1’705
Multi-loss fitting (11.8) 1.9677 7.4118 3.8468 2.0580 7.6417 3.9144 1’744

comparison of gamma, p=2.5 and inverse Gauss comparison of gamma, p=2.5 and inverse Gauss
1.10

1.10

gamma/p=2.5 model gamma/p=2.5 model

inverse Gauss/p=2.5 model inverse Gauss/p=2.5 model
ratio with p=2.5 model

ratio with p=2.5 model

1.05

1.05
1.00

1.00
0.95

0.95
0.90

0.90

2 4 6 8 10 12 5 6 7 8 9 10
logged observed claim sizes logged claim prediction

Fig. 11.3 Ratios μp=2.5 (x i ) (black color) and

μp=2 (x i )/ μp=2.5 (x i ) (blue color) of the
μp=3 (x i )/
three predictors (lhs) in-sample figures ordered on the x-axis w.r.t. the logged observed claims Yi ,
darkgray and cyan lines give spline fits, (rhs) out-of-sample figures ordered on the x-axis w.r.t. the
logged average size of the three predictors

jointly learned representation zi ) allows us to directly compare the corresponding

predictors μp (x i ) for the different p’s.
Figure 11.3 compares the three predictors by considering the ratios
μp=2.5 (x i ) in black color and
μp=2 (x i )/ μp=2.5 (x i ) in blue color, i.e.,
μp=3 (x i )/
we divide by the (middle) predictor with power variance parameter p = 2.5.
The figure on the left-hand side shows these ratios in-sample and ordered on
the x-axis w.r.t. the observed claim sizes Yi , and the darkgray and cyan lines
give spline fits to these ratios. The figure on the right-hand side shows these
ratios out-of-sample and ordered on the x-axis w.r.t. the average predictors
μ̄i = (
μp=2 (x i ) + μp=2.5 (x i ) +
μp=3 (x i ))/3. In view of (11.5) we expect that the
464 11 Selected Topics in Deep Learning

models with a smaller power variance parameter p over-fit more to large claims.
From Fig. 11.3 (lhs) we can observe that, indeed, this is the case (see gray and cyan
spline fits which bifurcate for large claims). That is, models with a smaller power
variance parameter react more sensitively to large observations Yi . The ratios in
Fig. 11.3 provide differences of up to 7% for large claims.
Remark 11.3 The loss function (11.7) can also be interpreted as regularization.
For instance, if we choose η2 = 1, and if we assume that this is our preferred
model, then we can regularize this model with further models, and their weights
ηp > 0 determine the degree of regularization. Thus, in contrast to ridge and
LASSO regularization of Sect. 6.2, regularization does not directly act on the
model parameters, here, but rather on what we learn in terms of the representation
zi = z(d:1) (x i ).

Using Forecast Dominance to Deal with Model Uncertainty

In GLMs, the power variance parameter p typically acts as a hyper-parameter, i.e.,

one fits different GLMs for different choices of p. Model selection is then done, e.g.,
by analyzing the Tukey–Anscombe plot, AIC, cross-validation or by studying out-
of-sample forecast dominance. In networks we should not use AIC as we neither
have a parsimonious network parameter nor do we use the MLE. Here, we focus
on forecast dominance for the network predictors (based on the different chosen
power variance parameters). If we are mainly interested in receiving a model that
provides optimal forecast dominance, we should not consider three different outputs
as in (11.7), but rather fit the same output to different loss functions; the required
changes are minimal, see Listing 11.2. Namely, consider one FN network with one
output μ(x i ), but evaluate this output simultaneously on the different chosen loss
functions

ηp
n
D (Y , ϑ) = vi dp (Yi , μ(x i )) . (11.8)
ϕp
p∈{2,2.5,3} i=1

In contrast to (11.7), we only have one FN network regression function x i → μ(x i ),

here.
We present the results on the last line of Table 11.3, called ‘multi-loss’. In our
case, this approach is slightly less competitive (out-of-sample), however, it is less
sensitive to outliers since we need to have a good regression function simultaneously
for multiple loss functions. Of course, this multiple loss fitting approach is not
restricted to different power variance parameters. As stated in Theorem 4.19,
Bregman divergences are the only consistent loss functions for mean estimation,
and the unit deviances are examples of Bregman divergences. Forecast dominance
now suggests that we may choose any Bregman divergence as a loss function in
Listing 11.2 as long as it reflects the expected properties of the model (and of
11.1 Deep Learning Under Model Uncertainty 465

Listing 11.2 FN network with a single output for multiple losses

the observed data), otherwise we will receive bad convergence properties, see also
Sect. 11.1.4, below. For instance, we can robustify the Poisson claim counts model
by additionally considering the deviance loss of the negative binomial model that
also assesses over-dispersion.

Nagging Predictor

The loss figures in Table 11.3 are averaged deviance losses over 20 different runs of
the gradient descent algorithm with different seeds (to receive stable results). Rather
than averaging over the losses, we should improve the models by averaging over the
predictors and, then, calculate the losses on these averaged predictors; this is exactly
the proposal of the nagging predictor (7.44). We calculate the nagging predictor of
the models that are simultaneously fit to the different loss functions (lines ‘multi-
output’ and ‘multi-loss’ of Table 11.3). The resulting nagging predictors are reported
in Table 11.4. This table shows that we give a clear preference to the nagging
predictors. The simultaneous loss fitting (11.8) gives the best out-of-sample results
for the nagging predictor, see the last line of Table 11.4.
Figure 11.4 shows the Tukey–Anscombe plot of the multi-loss nagging predictor for
the different deviance losses (for unit dispersion). Again, the case p = 2.5 is closest
to having a constant dispersion, and the other cases will require dispersion modeling
ϕ(x).
Figure 11.5 shows the empirical auto-calibration property of the multi-loss nagging
predictor. This auto-calibration property is calculated as in Listing 7.8. We observe
that the auto-calibration property holds rather accurately. Only for claim predictors

μ(x i ) above 10’000 CHF (vertical dotted line in Fig. 11.5) the fitted means under-
estimate the observed average claim sizes. This affects (only) 1.7% of all claims and
it could be corrected as described in Example 7.19.
466 11 Selected Topics in Deep Learning

Table 11.4 In-sample and out-of-sample losses (gamma loss, power variance case p = 2.5 loss
(in 10−2 ) and inverse Gaussian (IG) loss (in 10−3 )) and average claim amounts; the losses use unit
dispersion ϕ = 1
In-sample loss on L Out-of-sample loss on T Average
dp=2 dp=2.5 dp=3 dp=2 dp=2.5 dp=3 claim
Null model 3.0094 10.2208 4.6979 3.0240 10.2420 4.6931 1’774
Gamma multi-output (11.6) 1.9731 7.4275 3.8519 2.0581 7.6422 3.9146 1’745
p = 2.5 multi-output (11.6) 1.9736 7.4281 3.8522 2.0576 7.6407 3.9139 1’732
IG multi-output (11.6) 1.9745 7.4295 3.8525 2.0576 7.6401 3.9134 1’705
Multi-loss fitting (11.8) 1.9677 7.4118 3.8468 2.0580 7.6417 3.9144 1’744
Gamma multi-out & nagging 1.9486 7.3616 3.8202 2.0275 7.5575 3.8864 1’745
p = 2.5 multi-out & nagging 1.9496 7.3640 3.8311 2.0276 7.5578 3.8864 1’732
IG multi-out & nagging 1.9510 7.3666 3.8320 2.0281 7.5583 3.8865 1’705
Multi-loss with nagging 1.9407 7.3403 3.8236 2.0244 7.5490 3.8837 1’744

Tukey−Anscombe plot: gamma Tukey−Anscombe plot: p=2.5 Tukey−Anscombe plot: inverse Gaussian
1.5

0.4
1.0
5
deviance residuals

deviance residuals

deviance residuals
0.2
0.5
0.0

0.0
0

−1.5 −1.0 −0.5

−0.2
−5

residuals residuals residuals

average average average
−0.4

dispersion dispersion dispersion

5 6 7 8 9 10 5 6 7 8 9 10 5 6 7 8 9 10
logged fitted means logged fitted means logged fitted means

Fig. 11.4 Tukey–Anscombe plots giving the deviance residuals of the multi-loss nagging predic-
tor of Table 11.4 for different power variance parameters: (lhs) gamma deviances p = 2, (middle)
power variance deviances p = 2.5, (rhs) inverse Gaussian deviances p = 3; the cyan lines show
twice the estimated standard deviation of the deviance residuals as a function of the size of the
logged estimated means μ

11.1.3 Lab: Deep Dispersion Modeling

From the Tukey–Anscombe plots in Fig. 11.4 we conclude that the dispersion
requires regression modeling, too, as the dispersion does not seem to be constant
over the whole range of the expected claim sizes. We therefore explore a double FN
network model, in spirit this is similar to the double GLM of Sect. 5.5. We therefore
assume to work within Tweedie’s family with power variance parameters p ≥ 2, and
with unit deviances given by (11.2)–(11.3). The saddlepoint approximation (5.59)
gives us
−1/2

2πϕ 1
f (y; θ, v/ϕ) ≈ V (y) exp − dp (y, μ) ,
v 2ϕ/v
11.1 Deep Learning Under Model Uncertainty 467

Fig. 11.5 Empirical auto-calibration of network prediction

auto-calibration property of auto-c.

0.5
the claim size predictor; the density

10
blue curve shows the

auto-calibration (log scale)

0.4
empirical density of the

9
multi-loss nagging predictor

0.3
8
μ(x i )

0.2
6

0.1
5

0.0
5 6 7 8 9 10
fitted means (log scale)

with power variance function V (y) = y p . This saddlepoint approximation is

formulated in the reproductive form for Y = X/ω = Xϕ/v. This requires scaling of
the observations X with the unknown ϕ to receive Y . In Sect. 5.5.4 we have shown
how this problem can be solved. In this section we give a different proposal which
is more robust in network fitting, and which benefits from the b-homogeneity of dp ,
see (11.4).
We consider the variable transformation y → x = yω = yv/ϕ. In the absolutely
continuous case p ≥ 2 this gives us the approximation
−1/2

2πϕ 1+p 1 xϕ μϕv ϕ
f (x; θ, v/ϕ) ≈ V (x) exp − dp ,
v 1+p 2ϕ/v v ϕv v
−1/2

2πϕ p−1 1
= V (x) exp − d p x, μ p ,
v p−1 2ϕ p−1 /v p−1

with mean μp = μv/ϕ of X = Y v/ϕ. We set φ = −1/ϕ p−1 < 0. This gives us the
approximation

v p−1 dp (X, μp )φ − (−log (−φ)) 1 2π
X (μp , φ) ≈ − log p−1 V (X) . (11.9)
2 2 v

For given mean μp we again have a gamma approximation on the right-hand side,
but we scale the dispersion differently. This gives us the approximate first moment

Eφ v p−1 dp (X, μp ) μp ≈ κ2 (φ) = − 1/φ = ϕ p−1 = ϕp .
def.

The remainder of this modeling is similar to the residual MLE approach in

Section 5.5.3. Namely, we set up two FN network regression functions

x → μp (x) and x → ϕp (x) = κ2 (φ(x)) = −1/φ(x).

468 11 Selected Topics in Deep Learning

Parameter fitting is achieved by alternating the network parameter fitting of μp (x)

and ϕp (x) see also Section 5.5.4. We start the iteration by setting the dispersion
constant to ϕp(0) (x) ≡ const. In this case, the dispersion cancels in the score
(1)
equations and the mean μp (x) can be estimated without the explicit knowledge
of the (constant) dispersion parameter ϕp(0) ; this exactly provides the results of the
previous Sect. 11.1.2. Then, we iterate this procedure for t ≥ 1. For given mean
estimate μ(t )
p (x) we receive deviances v
p−1 d (X,
p μ(t )
p (x)), and this allows us to
(t )
estimate
ϕp (x) from the approximate gamma model (11.9), and for given disper-
(t ) (t +1)
sion parameters ϕp (x) we estimate
μp (x) from the corresponding Tweedie’s
model for the observation X.
Example 11.4 We revisit the Swiss accident insurance data example of Sect. 11.1.2,
and we use the robustified representation learning approach (11.7) that simulta-
neously fits Tweedie’s models for the power variance parameters p = 2, 2.5, 3.
(0)
The initial calibration step is done for constant dispersions
ϕp (x) ≡ const, and
(1)
it provides us with the estimated means μp (x) as illustrated in Fig. 11.3. For
stability reasons we choose the nagging predictor averaging over 20 different SGD
runs with 20 different seeds. These estimated means μ(1)
p (x) give us the deviances
(1)
v p−1 dp (X,
μp (x)).
Using these deviances allows us to alternate the dispersion and mean estimation
(t )
for t ≥ 1. For given means μp (x), p = 2, 2.5, 3, we set up a deep FN network
x → z(d:1) (x) that allows for a robustified deep dispersion learning ϕp (x), for
p = 2, 2.5, 3. Under the log-link choice we consider the regression function with
multiple outputs

x → ϕp=2 (x), ϕp=2.5 (x), ϕp=3 (x) (11.10)

= expα 2 , z(d:1) (x) , expα 2.5 , z(d:1) (x) , expα 3 , z(d:1) (x) ∈ R3+ ,

for different output parameters α 2 , α 2.5 , α 3 ∈ Rqd +1 . These three dispersion

responses (11.10) share the common network parameter w * = (*w(1)
1 ,...,w *(d)
qd ) in
the FN layers of z (d:1) . The network fitting learns these parameters simultaneously
for the different power variance parameters. Choose positive weights * ηp > 0, and
define the combined deviance loss function (based on the gamma model κ2 and
having dispersion parameter 2)

ηp p−1
*
n
D d(X,
μ(t) ), (*
w, α 2 , α 2.5 , α 3 ) = d2 vi dp (Xi ,
μ(t)
p (x i )), ϕp (x i ) ,
2
p∈{2,2.5,3} i=1
(11.11)
11.1 Deep Learning Under Model Uncertainty 469

where X = (X1 , . . . , Xn ) collects the unscaled observations Xi = Yi vi /ϕi . Thus,

for all power variance parameters p = 2, 2.5, 3 we fit a gamma model d2 (·, ·)/2
p−1
to the observed deviances (observations) vi dp (Xi , μ(t )
p (x i )) providing us with
the estimated dispersions ϕp(t )(x i ). This fitting step is received by the R code
of Listing 11.1, where the losses on line 20 are all given by gamma deviance
p−1
losses (11.11) and the deviances vi dp (Xi , μ(t )
p (x i )) play the role of the responses
(observations).
+1)
In the next step we update the mean estimates μ(tp (x i ), given the estimated
(t )
dispersions
ϕp (x i ) from the previous step. This requires that we optimize the
expected responses (11.6) for given heterogeneous dispersion parameters. We
therefore consider the loss function for positive weights ηp > 0, see (11.7),

n p−1
vi
D X,
ϕ (t ) , (w, β 2 , β 2.5 , β 3 ) = ηp dp Xi , μp (x i ) .
p∈{2,2.5,3} ϕp(t )(x i )
i=1
(11.12)

We fit this model by iterating this approach for t ≥ 1: we start from the predictors
(1)
of Sect. 11.1.2 providing us with the first mean estimates
μp (x i ). Based on these
ϕp(t )(x i ) and
mean estimates we iterate this robustified estimation of μ(t )
p (x i ). We
give some remarks:
1. We use the robustified versions (11.11) and (11.12), respectively, where we
simultaneously fit all power variance parameters p = 2, 2.5, 3 on the commonly
learned representations zi = z(d:1) (x i ) in the last FN layer of the mean and the
dispersion network, respectively.
2. For both FN networks of mean μ and dispersion ϕ modeling we use the same
network architecture of depth d = 3 having (q1 , q2 , q3 ) = (20, 15, 10) neurons
in the FN layers, the hyperbolic tangent activation function, and the log-link
for the output. These two networks only differ in their network parameters
(w, β 2 , β 2.5 , β 3 ) and (* w, α 2 , α 2.5 , α 3 ), respectively.
3. For fitting we use the nadam version of SGD. For the early stopping we use a
training data U to validation data V split of 8 : 2.
4. To ensure consistency within the individual SGD runs across t ≥ 1, we use the
learned network parameter of loop t as initial value for loop t + 1. This ensures
monotonicity across the iterations in the log-likelihood and the loss function,
respectively, up to the fact that the random mini-batches in SGD may distort this
monotonicity.
5. To reduce the elements of randomness in SGD fitting we run this iteration
procedure 20 times with different seeds, and we output the nagging predictors
μ(t
for )
p (x i ) and ϕp(t )(x i ) averaged over the 20 runs for every t in Table 11.5.
We iterate this algorithm over two loops, and the results are presented in Table 11.5.
We observe a decrease of −2X ( μ(t )
p , ϕp(t )) by iterating the fitting algorithm for t ≥
1. For AIC, we would have to correct twice the negative log-likelihood by twice
470 11 Selected Topics in Deep Learning

Table 11.5 Iteration of mean μ(t)

p and dispersion ϕp(t) estimation for the gamma model p = 2,
the power variance parameter p = 2.5 model and the inverse Gaussian model p = 3: the numbers
ϕp ) by 2·2·812 = 3 248 (twice
(t) (t) (t) (t)
correspond to −2X (
μp , ϕp ); the last line corrects −2X ( μp ,
the number of parameters used in the mean and dispersion FN networks)
Iteration −2· log-likelihood
t Gamma p = 2 Power variance p = 2.5 Inverse Gaussian p = 3
( μ(1) ,
ϕ (0) ) 4’722’961 4’635’038 4’644’869
( μ ,
(1) ϕ (1) ) 4’702’247 4’622’097 4’617’593
( μ ,
(2) ϕ (1) ) 4’701’234 4’621’123 4’616’869
( μ ,
(2) ϕ (2) ) 4’700’686 4’620’845 4’616’588
“AIC” 4’703’978 4’624’137 4’619’880

the number of MLE estimated parameters. We also adjust here correspondingly,

though the correction is not justified by any theory, because we do not work with
the MLE nor do we have a parsimonious model for mean and dispersion estimation.
Nevertheless, we receive smaller values than in Table 11.1 which supports the use
of this more complex double FN network model.
Comparing the three power variance parameter models, we now give preference
to the inverse Gaussian model, as it has the biggest log-likelihood. Note that we
directly compare all power variance models as the complexity is equal in all models
(they only differ in the chosen power variance parameter) and the joint robustified
fitting applies the same stopping rule to all power variance parameter models. The
same result is obtained by comparing the out-of-sample log-likelihoods. Note that
we do not compare the deviance losses, here, because the unit deviances are not
designed to estimate parameters in vector-valued parameter families; we model
dispersion as a second parameter.
Next, we study the estimated dispersions ϕp (x i ) as a function of the estimated
means μp (x i ). We fit a spline to ϕp (x i ) as a function of μp (x i ), and we receive
estimates that almost perfectly match the cyan lines in Fig. 11.4. This provides
a proof of concept that the dispersion regression model finds the right level of
dispersion as a function of the expected means.
Using the mean and dispersion estimates, we can calculate the dispersion scaled
deviance residuals
6
p−1
ri = sign(Xi −
D
μp (x i )) vi d Xi , μp (x i ) / ϕp (x i ). (11.13)

This then allows us to give the Tukey–Anscombe plots for the three considered
power variance parameters.
The corresponding plots are given in Fig. 11.6; the difference to Fig. 11.4 is that
the latter considers unit
dispersion whereas the former scales the residuals with
the rooted dispersion ϕp (x i ); note that vi ≡ 1 in this example. By scaling with
the rooted dispersion the resulting deviance residuals riD should roughly have unit
standard deviation. From Fig. 11.6 we observe that indeed this is the case, the cyan
11.1 Deep Learning Under Model Uncertainty 471

6 Tukey−Anscombe plot: gamma Tukey−Anscombe plot: p=2.5 Tukey−Anscombe plot: inverse Gaussian

6
deviance residuals

deviance residuals

deviance residuals
4

4
2

2
0

0
−2

−2

−2
−4

−4

−4
residuals residuals residuals
average average average
−6

−6

−6
2 std.dev. 2 std.dev. 2 std.dev.

5 6 7 8 9 10 5 6 7 8 9 10 5 6 7 8 9 10
logged fitted means logged fitted means logged fitted means

Fig. 11.6 Tukey–Anscombe plots giving the dispersion scaled deviance residuals riD (11.13) of
the models jointly fitting the mean parameters
μp (x i ) and the dispersion parameters
ϕp (x i ): (lhs)
gamma model, (middle) power variance parameter p = 2.5 model, and (rhs) inverse Gaussian
models; the cyan lines correspond to 2 standard deviations

gamma model: fitted model vs. observations gamma model: estimated shape parameters inverse Gaussian: fitted model vs. observations
estimated shape parameters
12

12
observations (log−scale)

observations (log−scale)
1.5
10

10
8
8

1.0

6
6

4
4

0.5

2
2

−10 −5 0 5 10 2 4 6 8 10 12
simulation (log−scale) simulation (log−scale)

Fig. 11.7 (lhs) Gamma model: observations vs. simulations on log-scale, (middle) gamma model:
estimated shape parameters αt† = 1/ ϕ2 (x †t ) < 1, 1 ≤ t ≤ T , and (rhs) inverse Gaussian model:
observations vs. simulations on log-scale

line shows a spline fit of twice the standard deviation of the deviance residuals riD .
These splines are of magnitude 2 which verifies the unit standard deviation property.
Moreover, the cyan lines are roughly horizontal which indicates that the dispersion
estimation and the scaling works across all expected claim sizes μp (x i ). The three
different power variance parameters p = 2, 2.5, 3 show different behaviors in the
lower and upper tails in the residuals (centering around the orange horizontal zero
line in Fig. 11.6) which corresponds to the different distributional properties of the
chosen models.
We further analyze the gamma and the inverse Gaussian models. Note that the
analysis of the power variance models for general power variance parameters p =
0, 1, 2, 3 is more difficult because neither the EDF density nor the EDF distribution
function have a closed form. To analyze the gamma and the inverse Gaussian models
we simulate observations Xtsim , t = 1, . . . , T , from the estimated models (using the
out-of-sample features x †t of the test data T ), and we compare them against the
true out-of-sample observations Xt† . Figure 11.7 shows the results for the gamma
model (lhs) and the inverse Gaussian model (rhs) on the log-scale. A good fit has
472 11 Selected Topics in Deep Learning

been achieved if the black dots lie on the red diagonal line (in the colored version),
because then the simulated data shares similar features as the observed data. The fit
of the inverse Gaussian model seems reasonably good.
On the other hand, we see that the gamma model gives a poor fit, especially
in the lower tail. This supports the AIC values of Table 11.5. The problem with
the gamma model is that the data is more heavy-tailed than the gamma model can
accomplish. As a consequence, the dispersion parameter estimates ϕ2 (x †t ) in the
gamma model are compensating for this by taking values bigger than 1. A dispersion
parameter bigger than 1 implies a shape parameter in the gamma model of αt† =
†
1/ ϕ2 (x t ) < 1, and the resulting gamma density is strictly decreasing, see Fig. 2.1. If
we simulate from this model we receive many observations Xtsim close to zero (from
the strictly decreasing density). This can be seen from the lower-left part of the graph
in Fig. 11.7 (lhs), suggesting that we have many observations with Xt† ∈ (0, 1), or on
the log-scale log(Xt† ) < 0. However, the graph shows that this is not the case in the
real data. Figure 11.7 (middle) shows the boxplot of the estimated shape parameters
αt† on the test data, 1 ≤ t ≤ T , verifying that most insurance policies of the test data

T receive a shape parameter αt† less than 1.
We conclude that the inverse Gaussian double FN network model seems to work
well for this data, and we give preference to this model.

11.1.4 Pseudo Maximum Likelihood Estimator

This short section gives a mathematical foundation to parameter estimation under

model uncertainty and model misspecification. We summarize the results of
Gourieroux et al. [168], and we refrain from giving any proofs in this section.
Assume that the real-valued observations Yi , 1 ≤ i ≤ n, have been generated by the
model

Yi = μζ0 (x i ) + εi , (11.14)

with (true) parameter ζ0 ∈ ⊂ Rr , feature x i ∈ X ⊆ {1} × Rq , and where

the conditional distribution of the noise random variables (εi )1≤i≤n D satisfies the
conditional independence property pε (ε1 , . . . , εn |x 1 , . . . , x n ) = ni=1 pε (εi |x i ).
Denote by px (x) the portfolio distribution of the features x. Thus, under (11.14), the
claim Y of a randomly selected policy is generated by the joint probability measure
p,x (ε, x) = pε (ε|x)px (x). The technical assumptions under which the following
statements hold are given in Assumption 11.9 at the end of this section.
Let F0 (·|x i ) denote the true conditional distribution of Yi , given x i . Typically,
this (true) conditional distribution is unknown. It is assumed to provide the first two
conditional moments

Eζ0 [ Yi | x i ] = μζ0 (x i ) and Varζ0 ( Yi | x i ) = σ02 (x i ).

11.1 Deep Learning Under Model Uncertainty 473

Thus, εi |x i is assumed to be centered with conditional variance σ02 (x i ), see (11.14).

Our goal is to estimate the (true) parameter ζ0 ∈ , based on the fact that the
conditional distribution F0 (·|x) of the observations is unknown. Throughout we
assume parameter identifiability, i.e., if μζ1 (x) = μζ2 (x), px -a.s., then ζ1 = ζ2 .
The following estimator is called pseudo maximum likelihood estimator (PMLE)

1
n

ζnPMLE = arg min d(Yi , μζ (x i )), (11.15)
ζ∈ n
i=1

where d(y, μ) is the unit deviance of a (pre-chosen) single-parameter linear EDF

being parametrized by the same parameter space ⊂ Rr as the original random
variables (11.14); note that is not the effective domain of the chosen EDF.

ζnPMLE is called PMLE because it is a MLE for ζ0 ∈ , but not in the right
model, because the pre-chosen EDF in (11.15) typically differs from the (unknown)
true conditional distribution F0 (·|x). Nevertheless, we may hope to find the true
parameter ζ0 , but possibly at a slower asymptotic rate. This is exactly what is going
to be stated in the next theorems.
Theorem 11.5 (Theorem 1 of Gourieroux et al. [168]) Denote by M = κ () ˚
the dual mean parameter space of the pre-chosen EDF (having cumulant function
κ), and assume that μζ (x) ∈ M for all x ∈ X and ζ ∈ . Let Assumption 11.9,
below, hold. The PMLE ζnPMLE is strongly consistent for ζ0 , i.e., it converges a.s. as
n → ∞.
This theorem tells us that we can perform MLE in a pre-chosen EDF (which may
differ from the true data model), and asymptotically we find the true parameter ζ0
of the data model F0 (·|x). Of course, this uses the fact that any unit deviance d is
a strictly consistent loss function for mean estimation, see Theorem 4.19. We do
not only receive consistency, but the following theorem also gives us the rate of
convergence.
Theorem 11.6 (Theorem 3 of Gourieroux et al. [168]) Set the same assumptions
as in Theorem 11.5. The PMLE
ζnPMLE has the following asymptotic behavior
√ PMLE
n
ζn − ζ0 ⇒ N 0, I ∗ (ζ0 )−1 (ζ0 )I ∗ (ζ0 )−1 for n → ∞,

with the following matrices evaluated in ζ = ζ0

I ∗ (ζ ) = Ex I ∗ (ζ ; x) = Ex J (ζ ; x) κ (h(μζ (x)))J (ζ ; x) ∈ Rr×r ,

(ζ ) = Ex J (ζ ; x) σ02 (x)J (ζ ; x) ∈ Rr×r ,
474 11 Selected Topics in Deep Learning

where h = (κ )−1 is the canonical link of the pre-chosen EDF, and with the change
of variable ζ → θ = θ (ζ ) = h(μζ (x)) ∈ , for given feature x, having Jacobian

∂ 1
J (ζ ; x) = h(μζ (x)) = ∇ζ μζ (x) ∈ R1×r .
∂ζk 1≤k≤r κ (h(μζ (x))

Remark that I ∗ (ζ ) averages Fisher’s information I ∗ (ζ ; x) (of the chosen EDF)

over the feature distribution px . This theorem can be seen as a modification of (3.36)
to the regression case. Theorem 11.6 gives us the asymptotic normality of the
PMLE, and the resulting asymptotic variance depends on how well the pre-chosen
EDF matches the true data distribution F0 (·|x). The following lemma corresponds
to Property 5 in Gourieroux et al. [168].
Lemma 11.7 The asymptotic variance in Theorem 11.6 has the lower bound, set
ζ = ζ0 and σ 2 (x) = σ02 (x),
−1
I ∗ (ζ )−1 (ζ )I ∗ (ζ )−1 ≥ H(ζ ) = Ex ∇ζ μζ (x)σ −2 (x) ∇ζ μζ (x) ∈ Rr×r .

Proof We set τ 2 (x) = κ (h(μζ (x))). We have J (ζ ; x) = ∇ζ μζ (x)τ −2 (x). The
following matrix is positive semi-definite and it satisfies

Ex I ∗ (ζ )−1 J (ζ ; x) − H(ζ )J (ζ ; x) τ 2 (x)σ −2 (x) σ 2 (x)

× I ∗ (ζ )−1 J (ζ ; x) − H(ζ )J (ζ ; x) τ 2 (x)σ −2 (x)

= I ∗ (ζ )−1 (ζ )I ∗ (ζ )−1 − H(ζ )I ∗ (ζ )I ∗ (ζ )−1 − I ∗ (ζ )−1 I ∗ (ζ )H(ζ ) + H(ζ )H(ζ )−1 H(ζ )

= I ∗ (ζ )−1 (ζ )I ∗ (ζ )−1 − H(ζ ).

This proves the claim.

Theorem 11.6 and Lemma 11.7 tell us that if we estimate the parameter ζ0 of
the unknown model F0 (·|x) with PMLE based on a single-parameter linear EDF,
we receive minimal asymptotic variance if we can match the variance V (μζ0 (x)) =
κ (h(μζ0 (x))) of the chosen EDF with the variance σ02 (x) of the true data model.
E.g., if we know that the variance in the true model behaves as σ02 (x) = μ3ζ0 (x)
we should select the inverse Gaussian model with variance function V (μ) = μ3 for
PMLE.
If the members of the single-parameter linear EDF do not fully match the
variance structure of the true data, we can turn our attention to a dispersion submodel
as in Sect. 5.5.1. Assume for the variance structure of the true data
1 2
Varζ0 (Yi |x i ) = σ02 (x i ) = s (x i ),
vi α0
11.1 Deep Learning Under Model Uncertainty 475

for a regression function x → sα20 (x) involving the (true) regression parameter α0
and exposures vi > 0. If we choose a fixed EDF, we have the log-likelihood function
v
(μ, ϕ) → Y (μ, ϕ; v) = [Y h(μ) − κ(h(μ))] + a(y; v/ϕ).
ϕ

Equating the variance structure of the true data model with the variance in this pre-
specified EDF, we obtain feature-dependent dispersion parameter

sα20 (x i )
ϕ(x i ) = , (11.16)
V (μζ0 (x i ))

with variance function V (μ) = (κ ◦ h)(μ). The following theorem proposes a

two-step procedure for this estimation problem.
Theorem 11.8 (Theorem 4 of Gourieroux et al. [168]) Assume√* ζn and * αn are
strongly consistent estimators for ζ and α , as n → ∞, such that n(*ζ − ζ 0 ) and
√ 0 0 n
αn − α0 ) are bounded in probability. The quasi-generalized pseudo maximum
n(*
likelihood estimator (QPMLE) of ζ0 is obtained by

n 2
αn (x i )
s*

ζnQPMLE = arg max Yi μζ (x i ), ; vi .
ζ∈ V (μ*ζn (x i ))
i=1

Under Assumption 11.9, below,

QPMLE
ζn is strongly consistent and best asymptoti-
cally normal, i.e.,
√ QPMLE
n
ζn − ζ0 ⇒ N (0, H(ζ0 )) for n → ∞.

This justifies the approach(es) in the previous chapters and sections, though,
not fully, because we neither work with the MLE in FN networks nor do we
care about identifiability in parameters. Nevertheless, this short section suggests
to find strongly consistent estimators *
ζn and *
αn for ζ0 and α0 . This gives us a first
model calibration step that allows us to specify the dispersion structure x → ϕ(x)
via (11.16). Using this dispersion structure and the deviance loss function (4.9) for
a variable dispersion parameter ϕ(x), the QPMLE is obtained in the second step by,
we replace the likelihood maximization by the deviance loss minimization,

1
n
vi

ζnQPMLE = arg min d(Yi , μζ (x i )).
n 2
sαn (x i )/V (μ*ζn (x i ))
ζ∈ i=1 *

This QPMLE is best asymptotically normal, thus, asymptotically optimal within the
EDF. There might still be better estimators for ζ0 , but these are outside the EDF.
476 11 Selected Topics in Deep Learning

If we turn M-estimation into Z-estimation we have the requirement for ζ , see

also (11.5),

1 V (μ*ζn (x i )) Yi − μζ (x i )
n
!
vi 2
∇ζ μζ (x i ) = 0.
n s*
αn (x i )
V (μζ (x i ))
i=1

Thus, it all boils down to find the right variance structure to receive the optimal
asymptotic behavior.
The previous statements hold true under the following technical assumptions.
These are taken from Appendix 1 of Gourieroux et al. [167], and they are an adapted
version of the ones in Burguete et al. [61].
Assumption 11.9
(i) μζ (x) and d(y, μζ (x)) are continuous w.r.t. all variables and twice continu-
ously differentiable in ζ ;
(ii) ⊂ Rr is a compact set and the true parameter ζ0 is in the interior of ;
(iii) almost every realization of (εi , x i ) is a Cesàro sum generator w.r.t. the
probability measure p,x (ε, x) = pε (ε|x)px (x) and to a dominating function
b(ε, x);
- sequence (x i )i is a Cesàro sum generator w.r.t. px and b(x) =
(iv) the
R b(ε, x)dpε (ε|x);
(v) for each x ∈ {1} × Rq , there exists a neighborhood Nx ⊂ {1} × Rq such that

sup b(ε, x ) dpε (ε|x) < ∞;
R x ∈Nx

(vi) the functions d(Y, μζ (x)), ∂d(Y, μζ (x))/∂ζk , ∂ 2 d(Y, μζ (x))/∂ζk ∂ζl are dom-
inated by b(ε, x).

11.2 Deep Quantile Regression

So far, in network regression modeling, we have not addressed the question of

prediction uncertainty. As mentioned in Remarks 4.2 on forecast evaluation, there
are different sources that contribute to prediction uncertainty. There is the model
and parameter estimation uncertainty, which may result in an inappropriate model
choice, and there is the irreducible risk which comes from the fact that we forecast
random variables which inherit a natural randomness that cannot be controlled.
We have discussed methods of evaluating model and parameter estimation error,
such as the asymptotic normality of MLEs within GLMs, and we have discussed
forecast dominance, the bootstrap method or the nagging predictor that allow
one to assess the different sources of prediction uncertainty. However, we have
not explicitly quantified these sources of uncertainty within the class of network
11.2 Deep Quantile Regression 477

regression models. We do an attempt in Sect. 11.4, below, by considering the

fluctuations generated by bootstrap simulations. The irreducible risk can be assessed
once we have a suitable statistical model; in Example 11.4 we have studied a
gamma and an inverse Gaussian model on an explicit data set, and these models
can be used, e.g., to calculate quantiles. In this section we consider a distribution-
free approach that directly estimates these quantiles. Recall from Section 5.8.3 that
quantiles are elicitable with the pinball loss as a strictly consistent loss function, see
Theorem 5.33. This allows us to directly estimate the quantiles from the data.

11.2.1 Deep Quantile Regression: Single Quantile

In this section we present a way of assessing the irreducible risk which does not
require a sophisticated model evaluation of distributional assumptions. Quantile
regression is increasingly used in the machine learning community because it is
a robust way of quantifying the irreducible risk, we refer to Meinshausen [270],
Takeuchi et al. [350] and Richman [314]. We recall that quantiles are elicitable
having the pinball loss as a strictly consistent loss function, see Theorem 5.33.
We define a FN network regression model that allows us to directly estimate the
quantiles based on the pinball loss. We therefore use an adapted version of the
R code of Listing 9 in Richman [314], this adapted version has been proposed in
Fissler et al. [130] to ensure that different quantiles respect monotonicity. For any
two quantile levels 0 < τ1 < τ2 < 1 we have

F −1 (τ1 ) ≤ F −1 (τ2 ), (11.17)

where F −1 denotes the generalized inverse of distribution function F , see (5.80).

If we simultaneously learn these quantiles for different quantile levels τ1 < τ2 ,
we need to enforce the network to respect this monotonicity (11.17). This can be
achieved by exploring a special network architecture in the output layer, and this is
going to be presented in the next section.
We start by considering a single deep τ -quantile regression for a quantile level
τ ∈ (0, 1). For datum (Y, x) we consider the regression function

x → FY−1 −1
|x (τ ) = g β τ , z
(d:1)
(x) , (11.18)

for a strictly monotone and smooth link function g, output parameter β τ ∈ Rqd +1 ,
and where x → z(d:1) (x) is a deep network. We add a lower index Y |x to the
generalized inverse FY−1
|x to highlight that we consider the conditional distribution
of Y , given feature x ∈ X . In the case of a deep FN network, (11.18) involves
a network parameter ϑ = (w(1) (d)
1 , . . . , w qd , β τ ) that needs to be estimated. Of
course, the deep network architecture x → z (d:1) (x) could also involve any other
feature, such as CN or LSTM layers, embedding layers or a NLP text recognition
478 11 Selected Topics in Deep Learning

feature. This would change the network architecture, but it would not change
anything from a methodological viewpoint.
To estimate this regression parameter ϑ from independent data (Yi , x i ), 1 ≤ i ≤
n, we consider the objective function

n
ϑ → Lτ Yi , g −1 β τ , z(d:1)(x i ) ,
i=1

with the strictly consistent pinball loss function Lτ for the τ -quantile. Alternatively,
we could choose any other loss function satisfying Theorem 5.33, and we may try
to find the asymptotically optimal one (similarly to Theorem 11.8). We refrain from
doing so, but we mention Komunjer–Vuong [222]. Fitting the network parameter
ϑ is then done in complete analogy to finding an optimal network parameter for
network mean modeling. The only change is that we replace the deviance loss
function by the pinball loss, e.g., in Listing 7.3 we have to exchange the loss function
on line 5 correspondingly.

11.2.2 Deep Quantile Regression: Multiple Quantiles

We now turn our attention to the multiple quantile case that should satisfy the
monotonicity requirement (11.17) for any quantile levels 0 < τ1 < τ2 < 1.
A separate deep quantile estimation for both quantile levels, as described in the
previous section, may violate the monotonicity property, at least, in some part of
the feature space X , especially if the two quantile levels are close. Therefore, we
enforce the monotonicity by a special choice of the network architecture.
For simplicity, in the remainder of this section, we assume that the response Y is
positive, a.s. This implies for the quantiles τ → FY−1
|x (τ ) ≥ 0, and we should choose
−1
a link function with g ≥ 0 in (11.18). To ensure the monotonicity (11.17) for the
quantile levels 0 < τ1 < τ2 < 1, we choose a second positive link function with
−1
g+ ≥ 0, and we set for multi-task forecasting

x → FY−1
|x (τ1 ), F −1
Y |x (τ2 ) (11.19)

−1
= g −1 β τ1 , z(d:1) (x) , g −1 β τ1 , z(d:1) (x) + g+ β τ2 , z(d:1) (x) ∈ R2+ ,

−1
for a regression parameter ϑ = (w(1) (d)
1 , . . . , w qd , β τ1 , β τ2 ) . The positivity g+ ≥ 0
enforces the monotonicity in the two quantiles. We call (11.19) an additive approach
as we start from a base level characterized by the smaller quantile FY−1 |x (τ1 ), and any
bigger quantile is modeled by an additive increment. To ensure monotonicity for
multiple quantiles we proceed recursively by choosing the lowest quantile as the
initial base level.
11.2 Deep Quantile Regression 479

We can also consider the upper quantile as the base level by multiplicatively
lowering this upper quantile. Choose the (sigmoid) function gσ−1 ∈ (0, 1) and set
for the multiplicative approach

x → FY−1
|x (τ 1 ), F −1
Y |x (τ 2 ) (11.20)

= gσ−1 β τ1 , z(d:1)(x) g −1 β τ2 , z(d:1) (x) , g −1 β τ2 , z(d:1) (x) ∈ R2+ .

Remark 11.10 In (11.19) and (11.20) we directly enforce the monotonicty by a

corresponding regression function choice. Alternatively, we can also design a (plain-
vanilla) multi-output network

x → FY−1
|x (τ 1 ), F −1
Y |x (τ 2 ) (11.21)

= g −1 β τ1 , z(d:1)(x) , g −1 β τ2 , z(d:1)(x) ∈ R2+ .

If we just use a classical SGD fitting algorithm, we will likely result in a situation
where the monotonicity will be violated in some part of the feature space. Kellner
et al. [211] consider this problem. They add a penalization (regularization term) that
punishes during SGD training network parameters that violate the monotonicity.
Such a penalization can be constructed, e.g., with the ReLU function.

11.2.3 Lab: Deep Quantile Regression

We revisit the Swiss accident insurance data of Sect. 11.1.2, and we provide an
example of a deep quantile regression using both the additive approach (11.19) and
the multiplicative approach (11.20).
We select 5 different quantile levels Q = (τ1 , τ2 , τ3 , τ4 , τ5 ) = (10%, 25%, 50%,
75%, 90%). We start with the additive approach (11.19). It requires to set τ1 =
10% as the base level, and the remaining quantile levels are modeled additively in
a recursive way for τj < τj +1 , 1 ≤ j ≤ 4. The corresponding R code is given on
lines 8–20 of Listing 11.3, and this compiles to the 5-dimensional output on line 22.
For the multiplicative approach (11.20) we set τ5 = 90% as the base level, and the
remaining quantile levels are received multiplicatively in a recursive way for τj +1 >
τj , 4 ≥ j ≥ 1, see Listing 11.4. The additive and the multiplicative approaches take
the extreme quantiles as initialization. One may also be interested in initializing the
model in the median τ3 = 50%, the smaller quantiles can then be received by the
multiplicative approach and the bigger quantiles by the additive approach. We also
explore this case and we call it the mixed approach.
480 11 Selected Topics in Deep Learning

Listing 11.3 Multiple FN quantile regression: additive approach

1 Design = layer_input(shape = c(q0), dtype = ’float32’, name = ’Design’)
2 #
3 Network = Design %>%
4 layer_dense(units=20, activation=’tanh’, name=’FNLayer1’) %>%
5 layer_dense(units=15, activation=’tanh’, name=’FNLayer2’) %>%
6 layer_dense(units=10, activation=’tanh’, name=’FNLayer3’)
7 #
8 q1 = Network %>% layer_dense(units=1, activation=’exponential’)
9 #
10 q20 = Network %>% layer_dense(units=1, activation=’exponential’)
11 q2 = list(q1,q20) %>% layer_add()
12 #
13 q30 = Network %>% layer_dense(units=1, activation=’exponential’)
14 q3 = list(q2,q30) %>% layer_add()
15 #
16 q40 = Network %>% layer_dense(units=1, activation=’exponential’)
17 q4 = list(q3,q40) %>% layer_add()
18 #
19 q50 = Network %>% layer_dense(units=1, activation=’exponential’)
20 q5 = list(q4,q50) %>% layer_add()
21 #
22 model = keras_model(inputs = list(Design), outputs = c(q1,q2,q3,q4,q5))

Listing 11.4 Multiple FN quantile regression: multiplicative approach

1 q5 = Network %>% layer_dense(units=1, activation=’exponential’)
2 #
3 q40 = Network %>% layer_dense(units=1, activation=’sigmoid’)
4 q4 = list(q5,q40) %>% layer_multiply()
5 #
6 q30 = Network %>% layer_dense(units=1, activation=’sigmoid’)
7 q3 = list(q4,q30) %>% layer_multiply()
8 #
9 q20 = Network %>% layer_dense(units=1, activation=’sigmoid’)
10 q2 = list(q3,q20) %>% layer_multiply()
11 #
12 q10 = Network %>% layer_dense(units=1, activation=’sigmoid’)
13 q1 = list(q2,q10) %>% layer_multiply()

Listing 11.5 Fitting a multiple FN quantile regression

1 Q_loss1 = function(y_true, y_pred){k_mean(k_maximum(y_true - y_pred, 0) * 0.1
2 + k_maximum(y_pred - y_true, 0) * (1 - 0.1))}
3 Q_loss2 = function(y_true, y_pred){k_mean(k_maximum(y_true - y_pred, 0) * 0.25
4 + k_maximum(y_pred - y_true, 0) * (1 - 0.25))}
5 Q_loss3 = function(y_true, y_pred){k_mean(k_maximum(y_true - y_pred, 0) * 0.5
6 + k_maximum(y_pred - y_true, 0) * (1 - 0.5))}
7 Q_loss4 = function(y_true, y_pred){k_mean(k_maximum(y_true - y_pred, 0) * 0.75
8 + k_maximum(y_pred - y_true, 0) * (1 - 0.75))}
9 Q_loss5 = function(y_true, y_pred){k_mean(k_maximum(y_true - y_pred, 0) * 0.9
10 + k_maximum(y_pred - y_true, 0) * (1 - 0.9))}
11 #
12 model %>% compile(loss = list(Q_loss1,Q_loss2,Q_loss3,Q_loss4,Q_loss5),
13 optimizer = ’nadam’)
11.2 Deep Quantile Regression 481

These network architectures are fitted to the data using the pinball loss (5.81) for the
quantile levels of Q; note that the pinball loss requires the assumption of having a
finite first moment. Listing 11.5 shows the choice of the pinball loss functions. We
then fit the three architectures (additive, multiplicative and mixed) to our learning
data L, and we apply early stopping to prevent from over-fitting. Moreover, we
consider the nagging predictor over 20 runs with different seeds to reduce the
randomness coming from SGD fitting.
In Table 11.6 we give the out-of-sample pinball losses on the test data T of the three
considered approaches, and illustrating the 5 quantile levels of Q. The losses of the
three approaches are rather close, giving a slight preference to the mixed approach,
but the other two approaches seem to be competitive, too. We further analyze these
quantile regression models by considering the empirical coverage ratios defined by

T

τj = 1 † −1 , (11.22)
T
Yt ≤F † (τj )
t =1 Y |x t

−1 † (τj ) is the estimated quantile for level τj and feature x †t . Remark that the
where F
Y |x t
coverage ratios (11.22) correspond to the identification functions that are essentially
the derivatives of the pinball losses, we refer to Dimitriadis et al. [106]. Table 11.7
reports these out-of-sample coverage ratios on the test data T . From these results
we conclude that on the portfolio level the quantiles are matched rather well.
In Fig. 11.8 we illustrate the estimated out-of-sample quantiles F −1 † (τj ) for
Y |x t
individual claims on the quantile levels τj ∈ {10%, 25%, 50%, 75%, 90%} (cyan,
blue, black, blue, cyan colors) using the mixed approach. The x-axis considers
the logged estimated medians F −1 † (50%). We observe heteroskedasticity resulting
Y |x t
in quantiles that are not ordered w.r.t. the median (black line). This supports the
multiple deep quantile regression model because we cannot (simply) extrapolate the
median to receive the other quantiles.
−1 (τj ) from the mixed deep
In the final step we compare the estimated quantiles FY |x
quantile regression approach to the ones that can be calculated from the fitted
inverse Gaussian model using the double FN network approach of Example 11.4.
In the latter model we estimate the mean μ(x) and the dispersion ϕ (x) with two
FN networks, which then allow us to calculate the quantiles using the inverse
Gaussian distributional assumption. Note that we cannot calculate the quantiles
in Tweedie’s family with power variance parameter p = 2.5 because there is no

Table 11.6 Out-of-sample pinball losses of quantile regressions using the additive, the multi-
plicative and the mixed approaches; nagging predictors over 20 different seeds
Out-of-sample losses on T
10% 25% 50% 75% 90%
Additive approach 171.20 412.78 765.60 988.78 936.31
Multiplicative approach 171.18 412.87 766.04 988.59 936.57
Mixed approach 171.15 412.55 764.60 988.15 935.50
482 11 Selected Topics in Deep Learning

Table 11.7 Out-of-sample coverage ratios

τj below the estimated deep FN quantile estimates
−1 † (τj )
F
Y |x t
Out-of-sample coverage ratios
10% 25% 50% 75% 90%
Additive approach 10.27% 25.30% 50.19% 75.08% 90.03%
Multiplicative approach 10.18% 25.15% 49.64% 75.14% 90.22%
Mixed approach 10.13% 25.03% 50.32% 75.20% 90.08%

Fig. 11.8 Estimated quantiles on individual claims

out-of-sample quantiles

14
−1 † (τj ) of 2’000 randomly
F
Y |x t

12
selected individual claims on
the quantile levels τj ∈

10
claims on log−scale
{10%, 25%, 50%, 75%, 90%}
(cyan, blue, black, blue, cyan
8
colors) using the mixed
approach, the red dots are the
6

out-of-sample observations
Yt† ; the x-axis gives
4

logF −1 † (50%) (also observation

Y |x t median
2

25%/75% quantile
corresponding to the black 10%/90% quantile

diagonal line) 5 6 7 8 9
logged estimated median

closed form of the distribution function. Figure 11.9 compares the two approaches
on the quantile levels of Q. Overall we observe a reasonably good match though it is
not perfect. The small quantiles for level τ1 = 10% seem slightly under-estimated
by the inverse Gaussian approach (see Fig. 11.9 (top-left)), whereas big quantiles
τ4 = 75% and τ5 = 90% seem more conservative in the inverse Gaussian approach
(see Fig. 11.9 (bottom)). This may indicate that the inverse Gaussian distribution
does not fully fit the data, i.e., that one cannot fully recover the true quantiles
from the mean μ(x), the dispersion ϕ (x) and an inverse Gaussian assumption.
There are two ways to further explore these issues. One can either choose other
distributional assumptions which may better match the properties of the data, this
further explores the distributional approach. Alternatively, Theorem 5.33 allows us
to choose loss functions different from the pinball loss, i.e., one could consider
different increasing functions G in that theorem to further explore the distribution-
free approach. In general, any increasing choice of the function G leads to a strictly
consistent quantile estimation (this is an asymptotic statement), but these choices
may have different finite sample properties. Following Komunjer–Vuong [222], we
can determine asymptotically efficient choices for G. This would require feature
dependent choices Gx i (y) = FY |x i (y), where FY |x i is the (true) distribution of
Yi , conditionally given x i . This requires the knowledge of the true distribution,
and Komunjer–Vuong [222] derive asymptotic efficiency when replacing this true
11.3 Deep Composite Model Regression 483

estimated 0.1−quantile estimated 0.25−quantile estimated 0.5−quantile

10
9
inverse Gaussian (log−scale)

inverse Gaussian (log−scale)

9
8

8
7
6

7
6
5

6
5
4

5
4
4 5 6 7 4 5 6 7 8 9 5 6 7 8 9 10
quantile regression (log−scale) quantile regression (log−scale) quantile regression (log−scale)

estimated 0.75−quantile estimated 0.9−quantile

inverse Gaussian (log−scale)

12
11

11
10

10
9

9
8

8
7

7
6

6
5

5 6 7 8 9 10 11 6 7 8 9 10 11 12

quantile regression (log−scale) quantile regression (log−scale)

Fig. 11.9 Inverse Gaussian quantiles vs. deep quantile regression estimates of 2’000 randomly
selected claims on the quantile levels of Q = (10%, 25%, 50%, 75%, 90%)

distribution by a non-parametric estimator, this is in spirit similar to Theorem 11.8.

We refrain from giving more details but refer to the corresponding paper.

11.3 Deep Composite Model Regression

We have established a deep quantile regression in the previous section. Next we

jointly estimate quantiles and conditional tail expectations (CTEs), leading to a
composite regression model that has a splicing point determined by a quantile level;
for composite models we refer to Sect. 6.4.4. This is exactly the proposal of Fissler et
al. [130] which we are going to present in this section. Note that having a composite
model allows us to have different distributions and regression structures below and
above the splicing point, e.g., we can have a more heavy-tailed model in the upper
tail using a different feature engineering from the main body of the data.

11.3.1 Joint Elicitability of Quantiles and Expected Shortfalls

In the previous examples we have seen that the distributional models may misesti-
mate the true tail of the data because model fitting often pays more attention to an
484 11 Selected Topics in Deep Learning

accurate model fit in the main body of the data. An idea is to directly estimate this
tail in a distribution-free way by considering the (upper) CTE

−1
CTE+
τ (Y |x) = E Y Y > FY |x (τ ), x , (11.23)

for a given quantile level τ ∈ (0, 1). The problem with (11.23) is that this is not an
elicitable quantity, i.e., there is no loss/scoring function that is strictly consistent for
the CTE functional.
If the distribution function FY |x is continuous, we can rewrite the upper CTE as
follows, see Lemma 2.16 in McNeil et al. [268] and (11.35) below,
1
1
CTE+
τ (Y |x) = ES+
τ (Y |x) = FY−1 −1
|x (p) dp ≥ FY |x (τ ). (11.24)
1−τ τ

This second object ES+ τ (Y |x) is called the upper expected shortfall (ES) of Y , given
x, on the security level τ . Fissler–Ziegel [131] and Fissler et al. [132] have proved
−1
that ES+ τ (Y |x) is jointly elicitable with the τ -quantile FY |x (τ ). That is, there is a
strictly consistent bivariate loss function that allows one to jointly estimate the τ -
quantile and the corresponding ES. In fact, Corollary 5.5 of Fissler–Ziegel [131]
give the full characterization of the strictly consistent bivariate loss functions for
the joint elicitability of the τ -quantile and the ES; note that Fissler–Ziegel [131]
use a different sign convention. This result is used in Guillén et al. [175] for the
joint estimation of the quantile and the ES within a GLM. Guillén et al. [175] use a
two-step approach to fit the quantile and the ES.
Fissler et al. [130] extend the results of Fissler–Ziegel [131], allowing for the
joint estimation of the composite triplet consisting of the lower ES, the τ -quantile
and the upper ES. This gives us a composite model that has the τ -quantile as splicing
point. The beauty of this approach is that we can fit (in one step) a deep learning
model to the upper and the lower ES, and perform a (potentially different) regression
in both parts of the distribution. The lower CTE and the lower ES are defined by,
respectively,

−1
CTE−
τ (Y |x) = E Y Y ≤ FY |x (τ ), x ,

and
τ
1
ES−
τ (Y |x) = FY−1 −1
|x (p) dp ≤ FY |x (τ ).
τ 0

Again, in case of a continuous distribution function FY |x we have the following

identity CTE− −
τ (Y |x) = ESτ (Y |x). From the lower and upper CTEs we receive the
mean of Y , given x, by

μ(x) = E[Y |x] = τ CTE− +

τ (Y |x) + (1 − τ ) CTEτ (Y |x). (11.25)
11.3 Deep Composite Model Regression 485

We introduce the auxiliary scoring functions

Sτ− (y, a) = 1{y≤a} − τ a − 1{y≤a} y,

Sτ+ (y, a) = 1 − τ − 1{y>a} a + 1{y>a} y = Sτ− (y, a) + y,

for y, a ∈ R and for τ ∈ (0, 1). These auxiliary functions consider only the part
of the pinball loss (5.81) that depends on action a, and we get the pinball loss as
follows

Lτ (y, a) = Sτ− (y, a) + τy = Sτ+ (y, a) − (1 − τ )y.

Therefore, all three functions provide strictly consistent scoring functions for the
τ -quantile, but only the pinball loss satisfies the calibration property (L0) on page
92.
For the following theorem we recall the general definition of the τ -quantile
Qτ (FY |x ) of a distribution function FY |x , see (5.82).
Theorem 11.11 (Theorem 2.8 of Fissler et al. [130], Without Proof) Choose τ ∈
(0, 1) and let F contain only distributions with a finite first moment, and being
supported in the interval C ⊆ R. The loss function L : C × C3 → R+ of the form

L(y; e− , q, e+ ) = (G(y) − G(q)) τ − 1{y≤q} (11.26)
9 :
e− + τ1 Sτ− (y, q)
+ ∇(e− , e+ ), + − (e− , e+ ) + (y, y),
e − 1−τ1
Sτ+ (y, q)

is strictly consistent for the composite triplet (ES− +

τ , Qτ , ESτ ) relative to the class
F , if is strictly convex with (sub-)gradient ∇ such that for all (e− , e+ ) ∈ C2
the function

1 ∂ 1 ∂
q → Ge− ,e+ (q) = G(q) + −
(e− , e+ )q − (e− , e+ )q,
τ ∂e 1 − τ ∂e+
(11.27)

is strictly increasing, and if EF [|G(Y )|] < ∞, EF [|(Y, Y )|] < ∞ for all Y ∼
F ∈ F.
This opens the door for regression modeling of CTEs for continuous distribution
functions FY |x , x ∈ X . Namely, we can choose a regression function ξϑ with a
three-dimensional output

x ∈ X → ξϑ (x) ∈ C3 ,
486 11 Selected Topics in Deep Learning

depending on a regression parameter ϑ. This regression function is now used to

−1
describe the composite triplet (ES− +
τ (Y |x), FY |x (τ ), ESτ (Y |x)). Having i.i.d. data
(Yi , x i ), 1 ≤ i ≤ n, it can be fitted by solving

1
n

ϑ = arg min L (Yi ; ξϑ (x i )) , (11.28)
ϑ n
i=1

with loss function L given by (11.26). This then provides us with the estimates for
the composite triplet

x → ξ (x) = 3−
ES (Y |x), F 3+
−1 (τ ), ES (Y |x) .
ϑ τ Y |x τ

There remains the choice of the functions G and , such that is strictly convex
and Ge− ,e+ , defined in (11.27), is strictly increasing. Section 2.3 in Fissler et
al. [130] discusses possible choices. A simple choice is to select the identity function
G(y) = y (which gives the pinball loss on the first line of (11.26)) and

(e− , e+ ) = ψ1 (e− ) + ψ2 (e+ ),

with ψ1 and ψ2 strictly convex and with (sub-)gradients ψ1 > 0 and ψ2 < 0.
Inserting this choice into (11.26) provides the loss function
+ ,
− + ψ1 (e− ) −ψ2 (e+ )
L(y; e , q, e ) = 1 + + Lτ (y, q)+Dψ1 (y, e− )+Dψ2 (y, e+ ),
τ 1−τ
(11.29)

where Lτ (y, q) is the pinball loss (5.81) and Dψ1 and Dψ2 are Bregman diver-
gences (2.28). There remains the choices of ψ1 and ψ2 which should be strictly
convex, the first one being strictly increasing and the second one being strictly
decreasing.
We restrict ourselves to strictly convex functions ψ on the positive real line R+ ,
i.e., for positive claims Y > 0, a.s. For b ∈ R, we consider the following functions
on R+
⎧
⎪
⎪ 1 b for b = 0 and b = 1,
⎨ b(b−1) y
ψ (y) = −1 − log(y)
(b)
for b = 0, (11.30)
⎪
⎪
⎩ylog(y) − y for b = 1.
11.3 Deep Composite Model Regression 487

We compute the first and second derivatives. These are for y > 0 given by

∂ (b) 1
y b−1 for b = 1, ∂ 2 (b)
ψ (y) = b−1 and ψ (y) = y b−2 > 0.
∂y log(y) for b = 1, ∂y 2

Thus, for any b ∈ R we have a convex function, and this convex function is
decreasing on R+ for b < 1 and increasing for b > 1. Therefore, we have to select
b > 1 for ψ1 and b < 1 for ψ2 to get suitable choices in (11.29). Interestingly,
these choices correspond to Lemma 11.2 with power variance parameters p =
2 − b, i.e., they provide us with Bregman divergences from Tweedie’s distributions.
However, (11.30) is more general, because it allows us to select any b ∈ R,
whereas for power variance parameters p ∈ (0, 1) there do not exist any Tweedie’s
distributions, see Theorem 2.18.
In view of Lemma 11.2 and using the fact that unit deviances dp are Bregman
divergences, we select a power variance parameter p = 2 − b > 1 for ψ2 and we
select the Gaussian model p = 2 − b = 0 for ψ1 . This gives us the special choice
for the loss function (11.29) for strictly positive claims Y > 0, a.s.,

+ ,
η1 e− η2 (e+ )1−p η1 η2
L(y; e− , q, e+ ) = 1 + + Lτ (y, q) + d0 (y, e− ) + dp (y, e+ ),
τ (1 − τ )(p − 1) 2 2
(11.31)

with the Gaussian unit deviance d0 (y, e− ) = (y − e− )2 and Tweedie’s unit deviance
dp with power variance parameter p > 1, see Sect. 11.1.1. The additional constants
η1 , η2 > 0 are used to balance the contributions of the individual terms to the total
loss. Typically, we choose p ≥ 2 for the upper ES reflecting claim size models.
This choice for ψ2 implies that the residuals are weighted inversely proportional
to the corresponding variances μp within Tweedie’s family, see (11.5). Using
this loss function (11.31) in (11.28) allows us to estimate the composite triplet
−1
(ES− +
τ (Y |x), FY |x (τ ), ESτ (Y |x)) with a strictly consistent loss function.

11.3.2 Lab: Deep Composite Model Regression

The joint elicitability of Theorem 11.11 allows us to directly estimate these

functionals for a fixed quantile level τ ∈ (0, 1). In a similar way to quantile
regression we set up a FN network that respects the monotonicity ES− τ (Y |x) ≤
488 11 Selected Topics in Deep Learning

FY−1 +
|x (τ ) ≤ ESτ (Y |x). We set for the regression function in the additive approach
for multi-task learning

−1
x → ES− +
τ (Y |x), FY |x (τ ), ESτ (Y |x)

−1
= g −1 β 1 , z(d:1) (x) , g −1 β 1 , z(d:1) (x) + g+ β 2 , z(d:1) (x) , (11.32)

−1 −1
g −1 β 1 , z(d:1) (x) + g+ β 2 , z(d:1) (x) + g+ β 3 , z(d:1) (x) ∈ A,

−1
for link functions g and g+ with g+ ≥ 0, deep FN network z(d:1) : Rq0 +1 →
R q d +1 , regression parameters β 1 , β 2 , β 3 ∈ Rqd +1 , and with the action space
A = {(e− , q, e+ ) ∈ R3+ ; e− ≤ q ≤ e+ } for positive claims. We also remind of
Remark 11.10 for a different way of modeling the monotonicity.
Fitting this model is similar to the multiple deep quantile regression presented
in Listings 11.3 and 11.5. There is one important difference though. Namely, we
do not have multiple outputs and multiple loss functions, but we have a three-
dimensional output with a single loss function (11.31) simultaneously evaluating all
three components of the output (11.32). Listing 11.6 gives this loss for the inverse
Gaussian case p = 3 in (11.31).

Listing 11.6 Loss function (11.31) for p = 3

1 Bregman_IG = function(y_true, y_pred){
2 k_mean( (k_maximum(y_true[,1]-y_pred[,2],0)*tau0 +
3 k_maximum(y_pred[,2]-y_true[,1],0)*(1-tau0) ) *
4 ( 1 + eta1*y_pred[,1]/tau0 + eta2*y_pred[,3]^(-2)/(2*(1-tau0)) ) +
5 eta1*(y_true[,1]-y_pred[,1])^2/2 +
6 eta2*((y_true[,1]-y_pred[,3])^2/(y_pred[,3]^2*y_true[,1]))/2 )}

We revisit the Swiss accident insurance data of Sect. 11.2.3. We again use a FN
network of depth d = 3 with (q1 , q2 , q3 ) = (20, 15, 10) neurons, hyperbolic
tangent activation, two-dimensional embedding layers for the categorical features,
−1
exponential output activations for g −1 and g+ , and the additive structure (11.32).
We implement the loss function (11.31) for quantile level τ = 90% and with power
variance parameter p = 3, see Listing 11.6. This implies that for the upper ES
estimation we scale residuals with V (μ) = μ3 , see (11.5). We then run an initial
calibration of this FN network. Based on this initial calibration we can calculate
the three loss contributions in (11.31) coming from the composite triplet. Based on
these figures we choose the constants η1 , η2 > 0 in (11.31) so that all three terms
of the composite triplet contribute equally to the total loss. For the remainder of our
calibration we hold on to these choices of η1 and η2 .
We calibrate this deep FN architecture to the learning data L, using the strictly
consistent loss function (11.31) for the composite triplet (ES− −1
90% (Y |x), FY |x (90%),
ES+90% (Y |x)), and to reduce the randomness in prediction we average over 20 early
stopped SGD calibrations with different seeds (nagging predictor).
11.3 Deep Composite Model Regression 489

Fig. 11.10 Comparison of deep composite model regression

the estimated lower
3− †
90% (Y |x t ) and the

12
ES
estimated upper ES 3+ †
90% (Y |x t )

lower & upper ES (log−scale)

against the estimated
−1 † (90%) in

10
90%-quantile F
Y |x t
the deep composite regression

8
lower ES

6
upper ES
spline fits
diagonal

6 8 10 12
90%−quantile (log−scale)

Figure 11.10 shows the estimated lower and upper ES against the corresponding
90%-quantile estimates for 2’000 randomly selected insurance claims x †t . The
−1 † (90%), and the cyan
diagonal orange line shows the estimated 90%-quantiles F
Y |x t
lines give spline fits to the estimated lower and upper ES. It is clearly visible that
these respect the ordering

3−
ES †
90% (Y |x t ) ≤ F
3+
−1 † (90%) ≤ ES †
90% (Y |x t ),
Y |x t

for fixed features x †t ∈ X .

The deep quantile regression has been back-tested using the coverage
ratios (11.22). Back-testing the ES is more difficult, the standalone ES is not
elicitable, and the ES can only be back-tested jointly with the corresponding
quantile. The part of the joint identification function that corresponds to the ES is
given by, see (4.2)–(4.3) in Fissler et al. [130],

Yt† 1
−1
+ F † (τ ) τ − 1

−1 (τ )
Yt† ≤F Y |x t −1 (τ )
Yt† ≤F
1 3−
T † †
Y |x t Y |x t

v− = ESτ (Y |x †t ) − ,
T τ
t =1
(11.33)

and

Yt† 1
−1
+ F † (τ ) 1
−τ
−1 (τ )
Yt† >F Y |x t −1 (τ )
Yt† ≤F
1 3+
T † †
Y |x t Y |x t

v+ = ESτ (Y |x †t ) − .
T 1−τ
t =1
(11.34)
These (empirical) identifications should be close too zero if the model fits the data.
490 11 Selected Topics in Deep Learning

Remark that the latter terms in (11.33)–(11.34) describe the lower and upper
ES also in the case of non-continuous distribution functions because we have the
identity
+ ,
1
ES− E Y 1 x + F −1 (τ ) τ − F −1
τ (Y |x) = Y ≤FY−1 Y |x Y |x FY |x (τ ) ,
τ |x (τ )
(11.35)

the second term being zero for a continuous distribution FY |x , but it is needed for
non-continuous distribution functions.
We compare the deep composite regression results of this section to the deep
gamma and inverse Gaussian models using a double FN network for dispersion
modeling, see Sect. 11.1.3. This requires to calculate the ES in the gamma and the
inverse Gaussian models. This can be done within the EDF, see Landsman–Valdez
[233]. The upper ES in the gamma model Y ∼ (α, β) is given by, see (6.47),
⎛ ⎞
α 1 − G α + 1, βFY−1 (τ )

E Y Y > FY−1 (τ ) = ⎝ ⎠,
β 1−τ

where G is the scaled incomplete gamma function (6.48) and FY−1 (τ ) is the τ -
quantile of (α, β).
Example 4.3 of Landsman–Valdez [233] gives the inverse Gaussian case (2.8)
with α, β > 0
6
−1 α 1/α −1
E Y Y > FY (τ ) = 1+ (1)
FY (τ )ϕ(zτ )
β 1−τ
6
α 1/α 2αβ
+ e 2α!(−zτ(2) ) − FY−1 (τ )ϕ(−zτ(2)) ,
β 1−τ

where ϕ and ! are the standard Gaussian density and distribution, respectively,
FY−1 (τ ) is the τ -quantile of the inverse Gaussian distribution and

α FY−1 (τ ) α FY−1 (τ )
zτ(1) = 6 −1 and zτ(2) = 6 +1 .
FY−1 (τ ) α/β FY−1 (τ ) α/β

This now allows us to calculate the identifications (11.33)–(11.34) in the fitted

deep double networks using the gamma and the inverse Gaussian distributions of
Sect. 11.1.3.
Table 11.8 shows the out-of-sample coverage ratios and the identifications of the
deep composite regression and the two distributional approaches. These figures
suggest that the gamma model is not competitive; the deep composite model has
the most precise coverage ratio. In terms of the ES identification terms, the deep
11.3 Deep Composite Model Regression 491

Table 11.8 Out-of-sample coverage ratios

τ and identifications
v− and
v+ of the deep composite
regression model and the deep double networks in the gamma and inverse Gaussian cases
Coverage Lower ES Upper ES
ratio identification identification
τ = 90%
v−
v+
Deep composite model 90.12% 32.9 -143.5
Deep double network gamma 93.51% 356.6 -2’409.0
Deep double network inverse Gaussian 92.56% −13.0 115.1

Fig. 11.11 Comparison of deep double IG vs. composite model

the estimated means from the
deep double inverse Gaussian

25000
model and the deep

estimate deep composite model

composite model (11.25)

15000
5000
0

0 5000 15000 25000

estimate deep double inverse Gauss

composite model and the double network with inverse Gaussian claim sizes are
comparably accurate (out-of-sample) determining the lower and upper 90% ES.
Finally, we paste the lower and upper ES from the deep composite regression
model according to (11.25). This gives us an estimated mean (under a continuous
distribution function)

μ(x) =
3−
E[Y |x] = τ ES 3+
τ (Y |x) + (1 − τ ) ESτ (Y |x).

Figure 11.11 compares these estimates of the deep composite regression model
to the deep double inverse Gaussian model estimates. The black dots show 2’000
randomly selected claims x †t , and the cyan line gives a spline fit to all out-of-sample
claims in T . The body of the estimates is rather similar in both approaches but the
deep composite approach provides more large estimates, the dotted orange lines
show the maximum estimate from the deep double inverse Gaussian model.
We conclude that in the case where no member of the EDF reflects the properties
of the data in the tail, the deep composite regression approach presented in this
section provides an alternative method for mean estimation that allows for separate
models in the main body and the tail of the data. Fixing the quantile level allows
for a straightforward fitting in one step, this is in contrast to the composite models
where we fix the splicing point. The latter approaches are more difficult in fitting,
e.g., using the EM algorithm.
492 11 Selected Topics in Deep Learning

11.4 Model Uncertainty: A Bootstrap Approach

As described in Sect. 4, there are different sources of prediction uncertainty when

forecasting random variables. There is the irreducible risk that comes from the fact
that we try to predict random variables. This source of uncertainty is always present,
even if we know the true data generating mechanism, i.e., it is irreducible. In most
applied situations we do not know the true data generating mechanism which results
in additional prediction uncertainty. Within GLMs this source of uncertainty has
mainly been allocated to parameter estimation uncertainty deriving from the fact that
we estimate the parameters from a finite sample, we refer to Sects. 3.4 and 11.1.4
on asymptotic results. In network modeling, the situation is more complicated.
Firstly, we have seen that there is no best network regression model even if the
architecture and the hyper-parameters are fully specified. In Fig. 7.18 we have seen
that in a claim frequency context the different solutions from an early stopped SGD
fitting can have a coefficient of variation of up to 40% on the individual policy
level, on average these coefficients of variation were around 10%. This has led to
the consideration of network ensembling and the nagging predictor in Sect. 7.4.4.
These considerations have been based on a fixed learning data set L. In this section,
we assume that also the learning data set L may look differently by considering
different realizations of the (randomly generated) observations Yi . To reflect this
source of randomness in outcomes we bootstrap new data from L by exploring
a non-parametric bootstrap with random drawings with replacements from L, see
Sect. 4.3.1. This will allow us to study the volatility implied in estimation by
considering a different set of observations, i.e., a different sample.
Ideally we would like to generate new observations from the true data generating
mechanism, but, since this mechanism is not known, we can at best generate data
from an estimated model. If we rely on a distributional model, we may suffer from
model error, e.g., in Sect. 11.3 we have seen that it is rather difficult to specify a
distributional regression model that has the right tail behavior. Therefore, we may
give preference to a distribution-free approach. Non-parametric bootstrapping is
such a distribution-free approach, the disadvantage being that we cannot enrich the
existing observations by new observations, but we can only rearrange the available
observations.
We revisit the robust representation learning approach of Sect. 11.1.2 on the
same Swiss accident insurance data as explored in that section. In particular,
we reconsider the deep multi-output models introduced in (11.6) and studied in
Table 11.3 for power variance parameters p = 2, 2.5, 3 (and constant dispersion
parameter). We perform exactly the same analysis, here, however we consider for
this analysis bootstrapped data L∗ for model fitting.
First, we fit 100 times the same deep FN network architecture as in (11.6)
with different seeds (on identical learning data L). From this we calculate the
nagging predictor. Second, we generate 100 different bootstrap samples L∗ =
L∗(s) , 1 ≤ s ≤ 100, from L (having an identical sample size) with random
drawings with replacements, and we fit the same network architecture to these 100
11.4 Model Uncertainty: A Bootstrap Approach 493

Table 11.9 Out-of-sample losses (gamma loss, power variance case p = 2.5 loss (in 10−2 ) and
inverse Gaussian (IG) loss (in 10−3 )) and average claim amounts; the losses use unit dispersion
ϕ=1
Out-of-sample loss on T Average
dp=2 dp=2.5 dp=3 claim
Null model 4.6979 10.2420 4.6931 1’774
Gamma multi-output of Table 11.3 2.0581 7.6422 3.9146 1’745
p = 2.5 multi-output of Table 11.3 2.0576 7.6407 3.9139 1’732
IG multi-output of Table 11.3 2.0576 7.6401 3.9134 1’705
Gamma multi-output: nagging 100 2.0280 7.5582 3.8864 1’752
p = 2.5 multi-output: nagging 100 2.0282 7.5586 3.8865 1’739
IG multi-output: nagging 100 2.0286 7.5592 3.8865 1’711
Gamma multi-output: bootstrap 100 2.0189 7.5301 3.8745 1’803
p = 2.5 multi-output: bootstrap 100 2.0191 7.5305 3.8746 1’790
IG multi-output: bootstrap 100 2.0194 7.5309 3.8746 1’756

bootstrap samples. We then also average over these 100 predictors obtained from
the different bootstrap samples. Table 11.9 provides the resulting out-of-sample
deviance losses on the test data T . We always hold on to the same test data T
which is disjoint/independent from the learning data L and the bootstrap samples
L∗ = L∗(s) , 1 ≤ s ≤ 100.
The nagging predictors over 100 seeds are roughly the same as over 20 seeds
(see Table 11.3), which indicates that 20 different network fits suffice, here.
Interestingly, the average bootstrapped version generally improves the nagging
predictors. Thus, here the average bootstrap predictor provides a better balance
among the observations to receive superior predictive power on the test data T ,
compare lines ‘nagging 100’ vs. ’bootstrap 100’ of Table 11.9.
The main purpose of this analysis is to understand the volatility involved in nagging
and bootstrap predictors. We therefore consider the coefficients of variation Vcot
introduced in (7.43) on individual policies 1 ≤ t ≤ T . Figure 11.12 shows these
coefficients of variation on the individual predictors, i.e., for the individual claims
x †t and the individual network calibrations with different seeds. The left-hand side
gives the coefficients of variation based on 100 bootstrap samples, the right-hand
side gives the coefficients of variation of 100 predictors fitted on the same data L
but with different seeds for the SGD algorithm; the y-scale is identical in both plots.
We observe that the coefficients of variation are clearly higher under the bootstrap
approach compared to holding on to the same data L for SGD fitting with different
seeds. Thus, the nagging predictor averages over the randomness in different seeds
for network calibrations, whereas bootstrapping additionally considers possible
different samples L∗ for model learning. We analyze the difference in magnitudes
in more detail.
Figure 11.13 compares the two coefficients of variation for different claim sizes. The
average coefficient of variation for fixed observations L is 15.9% (cyan columns).
This average coefficient of variation is increased to 24.8% under bootstrapping
494 11 Selected Topics in Deep Learning

coefficients of variation: bootstrap 100 coefficients of variation: nagging 100

individual policies individual policies
average average
1.0

1.0
1 std.dev. 1 std.dev.
cubic spline cubic spline
coefficients of variation

coefficients of variation
0.8

0.8
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0
0 5000 10000 15000 20000 25000 30000 35000 0 5000 10000 15000 20000 25000 30000 35000
estimated claim size estimated claim size

Fig. 11.12 Coefficients of variation in individual estimators (lhs) bootstrap 100, and (rhs) nagging
100; the y-scale is identical in both plots

coefficients of variations
0.6

1.0
bootstrap
nagging
relative increase
0.5
coefficients of variations

0.8
0.4

0.6
0.3

0.4
0.2

0.2
0.1

0.0
0.0

0 5000 10000 15000 20000 25000 30000 35000

claim sizes

Fig. 11.13 Coefficients of variation in individual predictors of the bootstrap and the nagging
approaches (ordered w.r.t. estimated claim sizes)

(orange columns). The blue line shows the average relative increase for the different
claim sizes (right axis), and the blue dotted line is at a relative increase of 40%. From
Fig. 11.13 we observe that this spread (relative increase) is rather constant across all
claim predictions; we remark that 93.5% of all claim predictions are below 5’000.
Thus, most claims are at the left end of Fig. 11.13.
From this small analysis we conclude that there is substantial model and
estimation uncertainty involved, recall that we fit the deep network architecture to
305’550 individual claims having 7 feature components, this is a comparably large
portfolio. On average, we have a coefficient of variation of 15% implied by SGD
11.5 LocalGLMnet: An Interpretable Network Architecture 495

fitting with different seeds, and this coefficient of variation is increased to roughly
25% under additionally bootstrapping the observations. This is considerable, and
it requires that we ensemble these predictors to receive more robust predictions.
The results of Table 11.9 support this re-sampling and ensembling approach as we
receive a better out-of-sample performance.

11.5 LocalGLMnet: An Interpretable Network Architecture

Network architectures are often criticized for not being (sufficiently) explainable.
Of course, this is not fully true as we have gained a lot of insight about the
data examples studied in this book. This criticism of non-explainability has led to
the development of the post-hoc model-agnostic tools studied in Sect. 7.6. This
approach has been questioned at many places, and it is not clear whether one
should try to explain black box models, or whether one should rather try to make
the models interpretable in the first place, see, e.g., Rudin [322]. In this section
we take this different approach by working with a network architecture that is
(more) interpretable. We present the LocalGLMnet proposal of Richman–Wüthrich
[317, 318]. This approach allows for interpreting the results, and it allows for
variable selection either using an empirical Wald test or LASSO regularization.
There are different other proposals that try to achieve similar explainability in
specific network architectures. There is the explainable neural network of Vaughan
et al. [367] and the neural additive model of Agarwal et al. [3]. These proposals
rely on parallel networks considering one single variable at a time. Of course,
this limits their performance because of a missing interaction potential. This has
been improved in the Combined Actuarial eXplainable Neural Network (CAXNN)
approach of Richman [314], which requires a manual specification of parallel
networks for potential interactions. The LocalGLMnet, proposed in this section,
does not require any manual engineering, and it still possesses the universal
approximation property.

11.5.1 Definition of the LocalGLMnet

Starting point of the LocalGLMnet is a classical GLM. Choose a strictly monotone

and smooth link function g. A GLM is received by considering the regression
function

q
x → g(μ(x)) = β0 + β, x = β0 + βj x j , (11.36)
j =1
496 11 Selected Topics in Deep Learning

for features x ∈ X ⊂ Rq , intercept β0 ∈ R and regression parameter β ∈

Rq . Compared to (5.5) we change the notation in this section by excluding the
intercept component from the feature x = (x1 , . . . , xq ) , because this will be
more convenient for the LocalGLMnet proposal. The beauty of this GLM regression
function is that we obtain a linear function after applying the link function g. This
linear function is considered to be explainable as we can precisely quantify how
much the expected response will change by slightly changing one of the feature
components xj . In particular, this holds true for the log-link which leads to a
multiplicative structure in the expected response.
The idea is to hold on to this additive structure (11.36) as far as possible, still
trying to benefit from the universal approximation property of network architectures.
Richman–Wüthrich [317] propose the following regression structure.

Definition 11.12 (LocalGLMnet) Choose a FN network architecture z(d:1) :

Rq → Rq of depth d ∈ N with equal input and output dimensions to model
the regression attention

β : Rq → Rq

def.
x → β(x) = z(d:1) (x) = z(d) ◦ · · · ◦ z(1) (x).

The LocalGLMnet is defined by the generalized additive decomposition

q
x → g (μ(x)) = β0 + β(x), x = β0 + βj (x)xj ,
j =1

for a strictly monotone and smooth link function g.

This architecture is called LocalGLMnet because locally, around a given feature

value x, it can be understood as a GLM, supposed that β(x) does not change too
much in the environment of x. In the GLM context β is called regression parameter,
and in the LocalGLMnet context β(x) is called regression attention because the
components βj (x) determine how much attention there should be given to a specific
value xj . We highlight this in the following discussion. Select one component 1 ≤
j ≤ q and study the individual term

x → βj (x)xj . (11.37)

(1) If βj (x) ≡ 0, we should drop the term βj (x)xj from the regression function.
(2) If βj (x) ≡ βj (= 0) is not feature dependent (and different from zero), we
receive a GLM term in xj with regression parameter βj .
11.5 LocalGLMnet: An Interpretable Network Architecture 497

(3) Property βj (x) = βj (xj ) implies that we have a term βj (xj )xj that does not
interact with any other term xj , j = j .
(4) Sensitivities of βj (x) in the components of x can be obtained by the gradient

∂ ∂
∇x βj (x) = βj (x), . . . , βj (x) ∈ Rq . (11.38)
∂x1 ∂xq

The j -th component of ∇x βj (x) determines the (non-)linearity in term xj , the

components different from j describe the interactions of term xj with the other
components.
(5) These interpretations need some care because we do not have identifiability. For
the special regression attention βj (x) = xj /xj we have

βj (x)xj = xj . (11.39)

Therefore, we talk about terms in items (1)–(4), e.g., item (1) means that the
term βj (x)xj can be dropped, however, the feature component xj may still
play a significant role in some of the regression attentions βj (x), j = j .
In practical applications we have not experienced identifiability issue (11.39).
Having already the linear terms in the LocalGLMnet regression structure
and starting the SGD fitting in the GLM gives already quite pre-determined
regression functions, and the LocalGLMnet is built around this initialization,
hardly falling into a completely different model (11.39).
(6) The LocalGLMnet architecture has the universal approximation property dis-
cussed in Sect. 7.2.2, because networks can approximate any continuous
function arbitrarily well on a compact support for sufficiently large networks.
(d:1)
We can then select one component, say, x1 and let β1 (x) = z1 (x)
approximate a given continuous function f (x)/x1 , i.e., f (x) ≈ β1 (x)x1
arbitrarily well on the compact support.

11.5.2 Variable Selection in LocalGLMnets

The LocalGLMnet allows for variable selection through the regression attentions
βj (x). Roughly speaking, if the estimated regression attentions β j (x) ≈ 0, then the
term βj (x)xj can be dropped. We can also explore whether the entire variable xj
should be dropped (not only the corresponding term βj (x)xj ). For this, we have to
refit the LocalGLMnet excluding the feature component xj . If the out-of-sample
performance on validation data does not change, then xj also does not play an
important role in any other regression attention βj (x), j = j , and it should be
completely dropped from the model.
In GLMs we can either use the Wald test or the LRT to test a null hypothesis H0 :
βj = 0, see Sect. 5.3. We explore a similar idea in this section, however, empirically.
498 11 Selected Topics in Deep Learning

We therefore first need to ensure that all feature components live on the same scale.
We consider standardization with the empirical mean and the empirical standard
deviation, see (7.30), and from now on we assume that all feature components are
centered and have unit variance. Then, the main problem is to determine whether an
j (x) is significantly different from 0 or not.
estimated regression attention β
We therefore extend the features x + = (x1 , . . . , xq , xq+1 ) ∈ Rq+1 by an addi-
tional independent and purely random component xq+1 that is also standardized.
Since this additional component is independent of all other components it cannot
have any predictive power for the response under consideration, thus, fitting this
extended model should result in a regression attention β q+1 (x + ) ≈ 0. The estimate
will not be exactly zero, because there is noise involved, and the magnitude of this
fluctuation will determine the rejection/acceptance region of the null hypothesis of
not being significant.
We fit the LocalGLMnet to the learning data L with features x + i ∈ R
q+1

extended by the standardized i.i.d. component xi,q+1 being independent of (Yi , x i ).

This gives us the estimated regression attentions β 1 (x + ), . . . , β
q (x + ), β
q+1 (x + ).
i i i
We compute the empirical mean and standard deviation of the attention weight of
the additional component xq+1
/
0
1 0 1
n n

b̄q+1 = q+1 (x + )
β and
sq+1 =1 q+1 (x + ) − b̄q+1 2 .
β
n i
n−1 i
i=1 i=1
(11.40)

We expect approximate centering b̄q+1 ≈ 0 because this additional component xq+1

does not enter the true regression function, and the empirical standard deviation
sq+1
quantifies the expected fluctuation around zero of insignificant components.
We can now test the null hypothesis H0 : βj (x) = 0 of component j on
significance level α ∈ (0, 1/2). We define centered interval

Iα = !−1 (α/2) ·
sq+1 , !−1 (1 − α/2) ·
sq+1 , (11.41)

where !−1 (p) denotes the standard Gaussian quantile for p ∈ (0, 1). H0 should be
rejected if the coverage ratio of this centered interval Iα is substantially smaller than
1 − α, i.e.,

1
n
1{β j (x + )∈Iα } < 1 − α.
n i
i=1

This proposal is designed for continuous feature components, and categorical

variables are discussed in Sect. 11.5.4, below. For xq+1 we can choose a standard
Gaussian distribution, a normalized uniform distribution or we can randomly
11.5 LocalGLMnet: An Interpretable Network Architecture 499

permute one of the feature components xi,j across the entire portfolio 1 ≤ i ≤ n.
Usually, the resulting empirical standard deviations
sq+1 are rather similar.

11.5.3 Lab: LocalGLMnet for Claim Frequency Modeling

We revisit the French MTPL data example. We compare the LocalGLMnet approach
to the deep FN network considered in Sect. 7.3.2, and we benchmark with the results
of Table 7.3; we benchmark with the crudest FN network from above because, at
the current stage, we need one-hot encoding for the LocalGLMnet approach. The
analysis in this section is the same as in Richman–Wüthrich [317].
The French MTPL data has 6 continuous feature components (we treat Area as
a continuous variable), 1 binary component and 2 categorical components. We pre-
process the continuous and binary variables to centering and unit variance using
standardization (7.30). This will allow us to do variable selection as presented
in (11.41). The categorical variables with more than two levels are more difficult.
In a first attempt we use one-hot encoding for the categorical variables. We prefer
one-hot encoding over dummy coding because this ensures that for all levels there
is a component xj with xj = 0. This is important because the terms βj (x)xj are
equal to zero for the reference level in dummy coding (since xj = 0). This does
not allow us to study interactions with other variables for the term corresponding to
the reference level. Remark that one-hot encoding and dummy coding do not lead
to centering and unit variance.
This feature pre-processing gives us a feature vector x ∈ Rq of dimension
q = 40. For variable selection of the continuous and binary components we extend
the feature x by two additional independent components xq+1 and xq+2 . We select
two components to explore whether the particular distributional choice has some
influence on the choice of the acceptance/rejection interval Iα in (11.41). We choose
for policies 1 ≤ i ≤ n

i.i.d.
√ √ i.i.d.
xi,q+1 ∼ Uniform − 3, 3 and xi,q+2 ∼ N (0, 1),

these two sets of variables being mutually independent, and being inde-
pendent from all other variables. We define the extended features x + i =
(xi,1 , . . . , xi,q , xi,q+1 , xi,q+2 ) ∈ Rq0 with q0 = q + 2, and we consider the
LocalGLMnet regression function

q0
+ +
x → log μ(x ) = β0 + βj (x + )xj .
j =1

We choose the log-link for Poisson claim frequency modeling. The time exposure
v > 0 can either be integrated as a weight to the EDF or as an offset on the canonical
scale resulting in the same Poisson model, see Sect. 5.2.3.
500 11 Selected Topics in Deep Learning

Listing 11.7 LocalGLMnet architecture

1 Design = layer_input(shape = c(42), dtype = ’float32’, name = ’Design’)
2 Vol = layer_input(shape = c(1), dtype = ’float32’, name = ’Vol’)
3 #
4 Attention = Design %>%
5 layer_dense(units=20, activation=’tanh’, name=’FNLayer1’) %>%
6 layer_dense(units=15, activation=’tanh’, name=’FNLayer2’) %>%
7 layer_dense(units=10, activation=’tanh’, name=’FNLayer3’) %>%
8 layer_dense(units=42, activation=’linear’, name=’Attention’)
9 #
10 LocalGLM = list(Design, Attention) %>% layer_dot(name=’LocalGLM’, axes=1) %>%
11 layer_dense(units=1, activation=’exponential’, name=’Balance’)
12 #
13 Response = list(LocalGLM, Vol) %>% layer_multiply(name=’Multiply’)
14 #
15 keras_model(inputs = c(Design, Vol), outputs = c(Response))

We are now ready to define the LocalGLMnet architecture. We choose a network

z(d:1) : Rq0 → Rq0 of depth d = 4 with (q1 , q2 , q3 , q4 ) = (20, 15, 10, 42)
neurons. The R code is given in Listing 11.7. We note that this is not much more
involved than a plain-vanilla FN network. Slightly special in this implementation is
the integration of the intercept β0 on line 11. Naturally, we would like to add this
intercept, however, there is no simple code for doing this. For that reason, we model
the additive decomposition by

q0
+ +
x → log μ(x ) = α0 + α1 βj (x + )xj ,
j =1

with real-valued parameters α0 and α1 being estimated on line 11 of Listing 11.7.

Thus, in this implementation the regression attentions are obtained by α1 βj (x + ).
Of course, there are also other ways of implementing this. This LocalGLMnet
architecture has 1’799 network weights to be fitted.
We fit this LocalGLMnet using a training to validation data split of 8 : 2 and a batch
size of 5’000. We initialize the gradient descent algorithm such that we exactly start
in the GLM with βj (x + ) ≡ β MLE . For this we set all weights in the last layer
j
(d)
on line 8 of Listing 11.7 to zero, wl,j = 0, and the corresponding intercepts to
(d)
the MLEs of the GLM, i.e., w0,j = β MLE . This gives us the GLM initialization
q 0 j
xj on line 10 of Listing 11.7. Moreover, on line 11 of that listing, we
MLE
j =1 βj
initialize α1 = 1 and α0 = β MLE . This implies that the gradient descent algorithm
0
starts in the MLE estimated GLM. The SGD fitting turns out to be faster than in
the plain-vanilla FN case, probably, because we start in the GLM having already
the reasonable linear terms xj in the model, and we only need to find the regression
attentions βj (x + ) around these linear terms. The results are presented on the second
last line of Table 11.10. The out-of-sample results are slightly worse than in the
plain-vanilla FN case. There are many reasons for that, for instance, many levels in
one-hot encoding may lead to more potential for over-fitting, and hence to an earlier
11.5 LocalGLMnet: An Interpretable Network Architecture 501

Table 11.10 Run times, number of parameters, in-sample and out-of-sample deviance losses
(units are in 10−2 ) and in-sample average frequency of the Poisson regressions, see also Table 7.3
Run # In-sample Out-of-sample Aver.
time param. loss on L loss on T freq.
Poisson null – 1 25.213 25.445 7.36%
Poisson GLM3 15s 50 24.084 24.102 7.36%
One-hot FN (q1 , q2 , q3 ) = (20, 15, 10) 51s 1’306 23.757 23.885 6.96%
LocalGLMnet on x + 20s 1’799 23.728 23.945 7.46%
LocalGLMnet on x + bias regularized – – 23.727 23.943 7.36%

stopping, here. The same applies if we add too many purely random components
xq+l , l ≥ 1. Since the balance property will not hold, in general, we apply the bias
regularization step (7.33) to adjust α0 and α1 , the results are presented on the last
line of Table 11.10; in Remark 3.1 of Richman–Wüthrich [317] a more sophisticated
balance property correction is presented. Our goal now is to analyze this solution.

Listing 11.8 Extracting the regression attentions from the LocalGLMnet architecture
1 zz <- keras_model(inputs=model$input,
2 outputs=get_layer(model, ’Attention’)$output)
3 beta <- data.frame(zz %>% predict(list(Xlearn, Vlearn)))
4 alpha1 <- as.numeric(get_weights(model)[[9]])
5 beta <- beta * alpha1

We start by analyzing the two additional components xi,q+1 and xi,q+2 being
uniformly and Gaussian distributed, respectively. Listing 11.8 shows how to extract
the estimated regression attentions β(x +i ). We calculate the means and standard
deviations of the estimated regression attentions of the two additional components

b̄q+1 = 0.0042 and b̄q+2 = 0.0213,

and

sq+1 = 0.0516 and
sq+2 = 0.0482.

From these numbers we see that the regression attentions β q+2 (x i ) are slightly

biased, whereas βq+1 (x i ) are fairly centered compared to the magnitudes of the
standard deviations. If we select a significance level of α = 0.1%, we receive a
two-sided standard normal quantile of |!−1 (α/2)| = 3.29. This provides us for
interval (11.41) with

Iα = !−1 (α/2) ·
sq+1 , !−1 (1 − α/2) ·
sq+1 = [−0.17, 0.17].
502 11 Selected Topics in Deep Learning

0.5 regression attentions: Area Code regression attentions: Bonus−Malus Level regression attentions: Density

0.5

0.5
regression attentions

regression attentions

regression attentions
0.0

0.0

0.0
−0.5

−0.5

−0.5
beta(x) beta(x) beta(x)
zero line zero line zero line
0.1% significance level 0.1% significance level 0.1% significance level

1 2 3 4 5 6 60 80 100 120 140 2 4 6 8 10

Area Code Bonus−Malus Level Density

regression attentions: Driver's Age regression attentions: Vehicle Age regression attentions: Vehicle Gas
0.5

0.5

0.5
regression attentions

regression attentions

regression attentions
0.0

0.0

0.0
−0.5

−0.5

−0.5
beta(x) beta(x) beta(x)
zero line zero line zero line
0.1% significance level 0.1% significance level 0.1% significance level

20 30 40 50 60 70 80 90 0 5 10 15 20 Diesel Regular
Driver's Age Vehicle Age Vehicle Gas

regression attentions: Vehicle Power regression attentions: RandU regression attentions: RandN
0.5

0.5

0.5
regression attentions

regression attentions

regression attentions
0.0

0.0

0.0
−0.5

−0.5

beta(x) beta(x) beta(x)

zero line zero line zero line
0.1% significance level 0.1% significance level 0.1% significance level

4 6 8 10 12 14 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −3 −2 −1 0 1 2 3

Vehicle Power RandU RandN

Fig. 11.14 Estimated regression attentions β j (x + ) of the continuous and binary feature compo-
i
nents Area, BonusMalus, log-Density, DrivAge, VehAge, VehGas, VehPower and the
two random features xi,q+1 and xi,q+2 of 2’000 randomly selected policies x + i ; the orange area
shows the interval Iα for dropping term βj (x)xj on significance level α = 0.1%

Figure 11.14 shows the estimated regression attentions β j (x + ) of the continuous

i
and binary feature components for 2’000 randomly selected policies x + i , and the
orange area shows the acceptance region Iα on significance level α = 0.1%.
Focusing on the figures of the two additional variables xi,q+1 and xi,q+2 , Fig. 11.14
(bottom, middle and right), we observe that the estimated regression attentions are
mostly within the confidence bounds of Iα . This says that we should drop these
two terms (of course, this is clear since we have set the bounds according to these
regression attentions). Focusing on the other variables, we question the inclusion
of the term VehPower as it seems concentrated within Iα , and hence we cannot
reject the null hypothesis H0 : βVehPower(x) = 0. Moreover, the inclusion of the
term Area needs further exploration.
11.5 LocalGLMnet: An Interpretable Network Architecture 503

Table 11.11 Run times, number of parameters, in-sample and out-of-sample deviance losses
(units are in 10−2 ) and in-sample average frequency of the Poisson regressions, see also Table 7.3
Run # In-sample Out-of-sample Aver.
time param. loss on L loss on T freq.
Poisson null – 1 25.213 25.445 7.36%
Poisson GLM3 15s 50 24.084 24.102 7.36%
One-hot FN (q1 , q2 , q3 ) = (20, 15, 10) 51s 1’306 23.757 23.885 6.96%
LocalGLMnet on x + 20s 1’799 23.728 23.945 7.46%
LocalGLMnet on x + bias regularized – – 23.727 23.943 7.36%
LocalGLMnet on x − 20s 1’675 23.715 23.912 7.30%
LocalGLMnet on x . bias regularized – – 23.714 23.911 7.36%

We remind that dropping a term βj (x)xj does not necessarily imply that we
have to completely drop xj because it may still play an important role in one of the
other regression attentions βj (x), j = j . Therefore, we re-run the whole fitting
procedure, but we drop the purely random feature components xi,q+1 and xi,q+2 ,
and we also drop VehPower and Area to see whether we receive a model with a
similar predictive power. This then would imply that we can drop these variables, in
the sense of variable selection similar to the LRT and the Wald test of Sect. 5.3. We
denote the feature where we drop these components by x − ∈ Rq−2 .
We re-fit the LocalGLMnet on the reduced features x − i , and the results are presented
in Table 11.11. We observe that the loss figures decrease. Indeed, this supports the
null hypothesis of dropping VehPower and Area. The reason for being able to
drop VehPower is that it does not contribute (sufficiently) to explain the systematic
effects in the responses. The reason for being able to drop Area is slightly different:
we have seen that Area and log-Density are highly correlated, see Fig. 13.12
(rhs), and it turns out that it is sufficient to only keep the Density variable (on the
log-scale) in the model.
In a next step, we should analyze the robustness of these results by exploring the
nagging predictor and/or bootstrapping as described in Sect. 11.4. We refrain from
doing so, but we illustrate the LocalGLMnet solution of Table 11.11 in more detail.
Figure 11.15 shows the feature contributions β j (x − )xi,j of 2’000 randomly selected
i
policies on the significant continuous and binary feature components. The magenta
line gives a spline fit, and the more the black dots spread around these splines, the
more interactions we have; for instance, higher bonus-malus levels interact with the
age of driver which explains the scattering of the black dots. On average, frequencies
are increasing in bonus-malus levels and density, decreasing in vehicle age, and for
the driver’s age variable it is important to understand the interactions. We observe
that the spline fit for the log-Density is close to a linear function, this reflects
that the regression attentions β Density (x i ) in Fig. 11.14 (top-right) are more or less
constant. This is also confirmed by the marginal plot in Fig. 5.4 (bottom-rhs) which
has motivated the choice of a linear term for the log-Density in model Poisson
GLM1 of Table 5.3.
feature contribution: Bonus−Malus Level feature contribution: Density feature contribution: Driver's Age
504

1.5
1.5
1.5
beta(x)
zero line
spline fit

1.0
1.0
1.0

0.5
0.5
0.5

0.0
0.0
0.0

−0.5
−0.5
−0.5

feature contribution
feature contribution
feature contribution
beta(x) beta(x)

−1.0
−1.0
−1.0
zero line zero line
spline fit spline fit

−1.5
−1.5
−1.5
60 80 100 120 140 2 4 6 8 10 20 30 40 50 60 70 80 90
Bonus−Malus Level Density Driver's Age

feature contribution: Vehicle Age feature contribution: Vehicle Gas

1.5
1.5
beta(x) beta(x)
zero line zero line
spline fit

1.0
1.0

0.5
0.5

0.0
0.0

−0.5
−0.5

feature contribution
feature contribution

−1.0
−1.0

−1.5
−1.5

0 5 10 15 20 Diesel Regular
Vehicle Age Vehicle Gas

Fig. 11.15 Estimated feature contributions β

j (x − )xi,j of the significant continuous and binary components BonusMalus, log-Density, DrivAge,
i
11 Selected Topics in Deep Learning

VehAge and VehGas of 2’000 randomly selected policies x − i ; the magenta line gives a spline fit
11.5 LocalGLMnet: An Interpretable Network Architecture 505

Fig. 11.16 Importance importance measure

measure IMj of the
continuous and binary Bonus−Malus

variables Driver's Age

Density

Vehicle Age

Vehicle Gas

Vehicle Power

Area Code

RandN

RandU

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Using the regression attentions we define an importance measure. We consider

the extended features x + in the following numerical analysis. We set

1 +
n
IMj = βj (x i ) ,
n
i=1

for 1 ≤ j ≤ q + 2, and where we aggregate over all policies 1 ≤ i ≤ n.

Figure 11.16 shows the importance measures IMj of the continuous and binary vari-
ables j . The bars are ordered w.r.t. these importance measures. The graph confirms
our previous conclusion, the least important variables are the two additional purely
random components xi,q+1 and xi,q+2 , followed by Area and VehPower. These
are exactly the components that have been dropped going from the full model x + to
the reduced model x − .
Next, we analyze the interactions by studying the gradients (11.38). Figure 11.17
illustrates spline fits to the components ∂ β j (x − )/∂xk w.r.t. xj of the continuous
i
variables BonusMalus, log-Density, DrivAge and VehAge over all policies
i = 1, . . . , n. The components ∂ β j (x − )/∂xj show the non-linearity in xj . We
i
conclude that BonusMalus, DrivAge and VehAge should be non-linear, and
log-Density is linear because ∂ β j (x − )/∂xj ≈ 0. The components ∂ β j (x − )/∂xk ,
i i
k = j , determine the interactions. We have the strongest interactions between
BonusMalus and DrivAge, and BonusMalus has interactions with all vari-
ables. On the other hand, the log-Density only interacts with BonusMalus.
The reader will have noticed that we have excluded the categorical components
VehBrand and Region from all model discussions. Firstly, these components are
not standardized to zero mean and unit variance, and, secondly, we cannot study one
level in isolation to be able to decide to keep or drop that variable. I.e., similar to
group LASSO we need to study all levels simultaneously of each categorical feature
component. We do this in the next section, and we conclude with the regression
506 11 Selected Topics in Deep Learning

interactions of feature component Bonus−Malus Level interactions of feature component Density

2
BonusMalus
2

1
interaction strengths

interaction strengths
BonusMalus
1

VehGas DrivAge

0
0

VehAge

−1
−1

Vehicle Age Vehicle Age

Driver's Age Driver's Age
Bonus−Malus Level Bonus−Malus Level
Vehicle Gas Vehicle Gas
−2

−2
Density Density

60 80 100 120 140 0 2 4 6 8 10

Bonus−Malus Level Density

interactions of feature component Driver's Age interactions of feature component Vehicle Age
2

2
1

DrivAge
interaction strengths

interaction strengths

VehAge
0

BonusMalus
−1

−1

Vehicle Age Vehicle Age

Driver's Age Driver's Age
Bonus−Malus Level Bonus−Malus Level
Vehicle Gas Vehicle Gas
−2

−2

Density Density

20 30 40 50 60 70 80 90 0 5 10 15 20
Driver's Age Vehicle Age

j (x − )/∂xk w.r.t. xj of the continuous variables

Fig. 11.17 Spline fits to the derivatives ∂ β i
BonusMalus, log-Density, DrivAge and VehAge over all policies i = 1, . . . , n

j (x) of the categorical feature components in Fig. 11.18, which seem to

attentions β
be significantly different from zero (VehBrands B10, B11, and Regions R22,
R43, R82, R93), but which do not allow for variable selection as just described.
Remark 11.13 The bias regularization in Table 11.11 has simply been obtained by
applying an additional MLE step to α0 and α1 . Alternatively, we can also define
the new features
zi = ( 1 (x i )xi,1 , . . . ,
α1 β q0 (x i )xi,q0 ) ∈ Rq0 , and then apply
α1 β
a proper GLM step to these newly (learned) features z1 , . . . ,
zn . Working with the
canonical link will give us the balance property. This is discussed in more detail in
Remark 3.1 of Richman–Wüthrich [317].
11.5 LocalGLMnet: An Interpretable Network Architecture 507

1.5 feature contribution: Vehicle Brand feature contribution: French Regions

1.5
1.0

1.0
feature contribution

feature contribution
0.5

0.5
0.0

0.0
−0.5

−0.5
−1.0

−1.0
−1.5

−1.5
B1 B3 B5 B10 B12 B14 R11 R23 R26 R42 R53 R73 R83 R94
Vehicle Brand French Regions

j (x) of the categorical feature components

Fig. 11.18 Boxplot of the regression attentions β
VehBrand and Region; the y-scale is the same as in Fig. 11.15

11.5.4 Variable Selection Through Regularization of the

LocalGLMnet

A natural next step is to introduce regularization on the regression attentions

β(x); this is the proposal suggested in Richman–Wüthrich [318]. We choose the
LocalGLMnet architecture x → μ(x) of Definition 11.12 having an intercept
parameter β0 ∈ R and the network weights w. For fitting, we consider a loss
function L and we add a regularization term to this loss function penalizing large
regression attentions. That is, we aim at minimizing

1
n
arg min L (Yi , μ(x i )) − R(β(x i )), (11.42)
β0 ,w n
i=1

with a penalty term (regularizer) R(·) ≥ 0. For the penalty term R we can choose
different forms, e.g., the elastic net regularizer of Zou–Hastie [409] is obtained by,
see Remark 6.3,

1
n
arg min L (Yi , μ(x i )) + η (1 − α)β(x i )22 + αβ(x i )1 , (11.43)
β0 ,w n
i=1

for a regularization parameter η ≥ 0 and weight α ∈ [0, 1]. For α = 0 we receive

ridge regularization, and for α = 1 we get LASSO regularization of β(·).
508 11 Selected Topics in Deep Learning

For variable selection of categorical feature components we should rather use the
group LASSO penalization of Yuan–Lin [398], see also (6.5). Assume the features
x have a natural group structure x = (x ∈ Rq . We consider the
1 , . . . , xK )
optimization

1
n K
arg min L (Yi , μ(x i )) + ηk β k (x i )2 , (11.44)
β0 ,w n
i=1 k=1

for regularization parameters ηk ≥ 0, and where β k (x) collects all components

βj (x) of β(x) that belong to the k-th group x k of x. Yuan–Lin [398] propose to
√
scale the regularization parameters as ηk = qk η ≥ 0, where qk is the size of group
k. Remark that if every group has size one we exactly obtain LASSO regularization.
Solving the optimization problem (11.44) poses some challenges because the
regularizer is not differentiable in zero. In Sect. 6.2.5 we have presented the
generalized projection operator (using the soft-thresholding operator) to solve
the group LASSO regularization within GLMs. However, this proposal will not
work here: the generalized projection operator may help to project the regression
attentions β(x i ) back to the constraint set C. However, this does not tell us anything
about how to choose the network parameters w and, therefore, will not work
here. In a different setting, Oelker–Tutz [288] propose to use a differentiable -
approximation to the terms in (11.44). Choose > 0 and define for β k ∈ Rqk
6 6
β k 2, = β k 22 + = β
k βk + → β k 2 as ↓ 0. (11.45)

This motivates to study the optimization problem for a fixed (small) > 0

1
n K
arg min L (Yi , μ(x i )) + ηk β k (x i )2,ε . (11.46)
β0 ,w n
i=1 k=1

In Fig. 11.19 we plot these -approximations for ∈ {10−1,10−2 , 10−3 , 10−4 ,

10−5 }.The plot on the left-hand side gives β ∈ R → β2, = β 2 + → |β| for
↓ 0, and the plot on the right-hand side gives the unit ball

B = β = (β1 , β2 ) ∈ R2 ; β1 2, + β2 2, = 1 .

For the last two choices there is no visible difference to the 1 -norm.
11.5 LocalGLMnet: An Interpretable Network Architecture 509

1.0 penalization from regularization regularization domain/rejection region (2d)

1.0
epsilon=0.1
epsilon=0.01
epsilon=0.001
contribution regularization beta

epsilon=1e−04
epsilon=1e−05
0.8

0.5
component beta2
0.6

0.0
0.4

−0.5
0.2

epsilon=0.1
epsilon=0.01
epsilon=0.001
epsilon=1e−04

−1.0
0.0

epsilon=1e−05

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
beta component beta1

Fig. 11.19 (lhs) Comparison of |β| and β2, = β 2 + for β ∈ R, and (rhs) unit balls B for
∈ {10−1 , 10−2 , 10−3 , 10−4 , 10−5 } compared to the Manhattan unit ball

The main disadvantage of the -approximation is that it does not shrink unimportant
components βj (x) exactly to zero. But it allows us to identify unimportant (small)
components, which can then be removed manually. As mentioned in Lee et al. [237],
LASSO regularization needs a second model calibration step only fitting the model
on the selected components (and without regularization) to receive an optimal
predictive power and a minimal bias. Thus, we need a second calibration step after
the removal of the unimportant components anyway.

11.5.5 Lab: LASSO Regularization of LocalGLMnet

We revisit the LocalGLMnet architecture applied to the French MTPL claim fre-
quency data, see Sect. 11.5.3. The goal is to perform a group LASSO regularization
so that we can also study the importance of the terms coming from the categorical
feature components VehBrand and Region. We first pre-process all feature
components as follows. We apply dummy coding to the categorical variables, and
then we standardize all components to centering and unit variance, this includes the
dummy coded components.
In a next step we need to define the natural groups x = (x
1 , . . . , x K ) ∈ R . We
q

have 7 continuous and binary components which give us dimensions qk = 1 for

1 ≤ k ≤ 7. VehBrand provides us with a group of size q8 = 10, and Region
gives us a group of size q9 = 21. We set K = 9 and q = 9k=1 qk = 38. We code
510 11 Selected Topics in Deep Learning

Listing 11.9 Group LASSO regularization design

1 group.lasso.grouping <- function(xx){
2 pp <- array(0, dim=c(length(xx),sum(xx)))
3 for (k in 1:length(xx)){
4 if (k==1){pp[k,1:xx[k]] <- 1
5 }else{
6 pp[k,(sum(xx[1:(k-1)])+1):sum(xx[1:k])] <- 1
7 }}
8 t(pp)
9 }
10 #
11 ww <- group.lasso.grouping(c(rep(1,7),10,21)) 12 etaK <- eta
12 etaK <- eta * sqrt(c(rep(1,7),10,21))

√
a (sort of) regularization design matrix to encode the K groups and weights qk
for the q components of x. This is done in Listing 11.9 providing us with a matrix
√
of size 38 × 9 and the weights qk . This regularization design matrix enters the
penalty term on lines 13 and 16 of Listing 11.10 which weights the penalizations
· 2, .

Listing 11.10 LocalGLMnet with group LASSO regularization

1 Design = layer_input(shape = c(38), dtype = ’float32’)
2 LogVol = layer_input(shape = c(1), dtype = ’float32’)
3 Bias1 = layer_input(shape = c(1), dtype = ’float32’)
4 #
5 Attention = Design %>%
6 layer_dense(units=15, activation=’tanh’) %>%
7 layer_dense(units=10, activation=’tanh’) %>%
8 layer_dense(units=38, activation=’linear’, name=’Attention’)
9 #
10 Penalty = Attention %>%
11 layer_lambda(function(x) k_square(x)) %>%
12 layer_dense(units=9, activation=’linear’,
13 weights=list(ww), use_bias=FALSE, trainable=FALSE) %>%
14 layer_lambda(function(x) k_sqrt(x+epsilon)) %>%
15 layer_dense(units=1, activation=’linear’,
16 weights=list(array(etaK, dim=c(9,1))), use_bias=FALSE, trainable=FALSE)
17 #
18 LocalGLM = list(Design, Attention) %>% layer_dot(axes=1)
19 #
20 Bias = Bias1 %>%
21 layer_dense(units=1, activation=’linear’, use_bias=FALSE)
22 #
23 Response = list(LocalGLM, Bias, LogVol) %>% layer_add() %>%
24 layer_lambda(function(x) k_exp(x))
25 #
26 Output = list(Response, Penalty) %>% layer_concatenate()
27 #
28 keras_model(inputs = c(Design, LogVol, Bias1), outputs = c(Output))
11.5 LocalGLMnet: An Interpretable Network Architecture 511

The entire group LASSO regularized LocalGLMnet is depicted in Listing 11.10,

showing the regression attentions on lines 5–8, the regularization on lines 10–16,
and the output on line 26 returns the expected response vi μ(x i ) and the regularizer
K −5 for our example.
k=1 ηk β k (x i )2, , we choose = 10

Listing 11.11 Group LASSO regularized Poisson deviance loss

1 Poisson.reg <- function(y_true, y_pred){k_mean(
2 y_pred[,1]-y_true[,1] + y_true[,1]*k_log((y_true[,1]/y_pred[,1]+.00000001))
3 + y_pred[,2] )}

Finally, we need to code the loss function (11.42). This is done in Listing 11.11. We
combine the Poisson deviance loss function with the group LASSO -approximation
K
k=1 ηk β k (x i )2, , the latter being outputted by Listing 11.10. We fit this
network to the French MTPL data (as above) for regularization parameters η ∈
{0, 0.0025, 0.005}. Firstly, we note that the resulting networks are not fully compet-
itive, this is probably due to the fact that the high-dimensional dummy coding leads
to too much over-fitting potential which leads to a very early stopping in gradient
descent fitting. Thus, this approach may not be useful to directly receive a good
predictive model, but it may be helpful to select the right feature components to
design a good predictive model.
Figure 11.20 gives the importance measures of the estimated regression attentions

1
n
IMj = βj (x i ) ,
n
i=1

of all components 1 ≤ j ≤ q = 38. The red color corresponds to regularization

parameter η = 0.005, red + yellow colors to η = 0.0025, and red + yellow + green
colors to η = 0 (no regularization). Figure 11.20 (lhs) shows the results on the
original (standardized) features x. By far the smallest red + yellow column among
the continuous features is observed for VehPower which confirms the variable
selection of Sect. 11.5.3. Among the categorical variables Region seems more
important (on average) than VehBrand because the red and yellow columns are
generally bigger for Region. All these red and yellow columns of VehBrand and
Region are bigger than the ones of VehPower which supports the inclusion of
the two categorical variables.
Figure 11.20 (rhs) verifies this decision of keeping the categorical variables. For
this latter graph we randomly permute Region across the entire portfolio, and we
run the same group LASSO regularized fitting procedure again on this modified
data. The vertical black line shows the average importance of the permuted Region
variable for η = 0.0025. We see that only VehPower has a smaller importance
measure, and all other variables dominate the permuted Region variable. This
confirms our conclusions above.
512 11 Selected Topics in Deep Learning

importance measure (group Lasso) importance measure (group Lasso)

Area Area
VehPow VehPow
VehAge VehAge
DrivAge DrivAge
BonusM BonusM
VehGas VehGas
Density Density
B2 B2
B3 B3
B4 B4
B5 B5
B6 B6
B10 B10
B11 B11
B12 B12
B13 B13
B14 B14
R21 R21
R22 R22
R23 R23
R24 R24
R25 R25
R26 R26
R31 R31
R41 R41
R42 R42
R43 R43
R52 R52
R53 R53
R54 R54
R72 R72
R73 R73
R74 R74
R82 R82
R83 R83
R91 R91
R93 eta=0 R93 eta=0
R94 eta=0.0025 R94 eta=0.0025
eta=0.005 eta=0.005

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14

Fig. 11.20 Importance measures IMj of the group LASSO regularized LocalGLMnet for variable
selection with different regularization parameters η ∈ {0, 0.0025, 0.005}: (lhs) original data, and
(rhs) randomly permuted Region labels; the x-scale is the same in both plots

We conclude that the LocalGLMnet architecture with a group LASSO regular-

ization is helpful for variable selection, and, more generally, the LocalGLMnet
architecture is useful for model interpretation, finding interactions and functional
forms of the features entering the regression function. In examples that have
categorical variables with many levels, the LocalGLMnet approach may not lead
to a regression model that is fully competitive. In this case, the LocalGLMnet can
be used for variable selection, and an other network architecture should then be fitted
on the selected variables. Alternatively, we can embed the categorical variables in a
preparatory network step, and then work with these embeddings of the categorical
variables (kept fixed within the LocalGLMnet).
11.6 Selected Applications 513

11.6 Selected Applications

11.6.1 Mixture Density Networks

In Sect. 6.3 we have introduced mixture distributions and we have presented the EM
algorithm for fitting these mixture distributions. The EM algorithm considers two
steps, an expectation step (E-step) and a maximization step (M-step). The E-step is
motivated by (6.34). In this step the posterior distribution of the latent variable Z
is determined, given the observation Y and the parameter estimates for the model
parameters θ and p. The M-step (6.35) determines the optimal model parameters
θ and p, based on the observation Y and the posterior distribution of Z. Typically,
we explore MLE in the M-step. However, for the EM algorithm to function it is not
important that we really work with the maximum in the M-step, but monotonicity
in (6.38) is sufficient. Thus, if at algorithmic time t − 1 we have a parameter
(t −1)
estimate ( p (t −1)), it suffices that the next estimate (
(t )
θ , θ , p (t )) increases the
log-likelihood, without necessarily being the MLE; this latter approach is called
generalized EM (GEM) algorithm. Exactly this point makes it feasible to also use
the EM algorithm in cases where we model the parameters through networks which
are fit using gradient descent (ascent) algorithms. These methods go under the name
of mixture density networks (MDNs).
MDNs have been introduced by Bishop [35], who explores MDNs on Gaussian
mixtures, and using SGD and quasi-Newton methods for model fitting. MDNs have
also started to gain more popularity within the actuarial community, recent papers
include Delong et al. [95], Kuo [230] and Al-Mudafer et al. [6], the latter two
considering MDNs for claims reserving.
We recall the mixture density for a selected member of the EDF. The incomplete
log-likelihood of the data (Yi , x i , vi )1≤i≤n is given by, see (6.24),

n
(θ , ϕ, p) → Y (θ, ϕ, p) = Yi (θ (x i ), ϕ(x i ), p(x i ))
i=1

n
K
vi
= log pk (x i )fk Yi ; θk (x i ), ,
ϕk (x i )
i=1 k=1

for canonical parameter θ = (θ1 , . . . , θK ) ∈ = 1 × · · · × K , dispersion

parameter ϕ = (ϕ1 , . . . , ϕK ) ∈ RK
+ , mixture probability p ∈ K , and K denotes
the number of mixture components. MDNs model these parameters with networks.
Choose a FN network z(d:1) : Rq+1 → {1}×Rqd of depth d, with input dimension q
being equal to the dimension of the features x ∈ X ⊆ {1}×Rq and output dimension
qd + 1. This gives us the learned representations zi = z(d:1)(x i ). These learned
514 11 Selected Topics in Deep Learning

representations are used to model the parameters. For the mixture probability p we
build a logistic categorical GLM, based on zi . For the (canonical) link h, we set
linear predictor, see (5.72),

p p
h(p(zi )) = h p z(d:1)(x i ) = β 1 , zi , . . . , β K , zi ∈ RK , (11.47)

p p
with regression parameter β p = ((β 1 ) , . . . , (β K ) ) ∈ RK(qd +1) . For the
canonical parameter θ , the mean parameter μ, respectively, and the dispersion
parameter ϕ we proceed analogously. Choose strictly monotone and smooth link
functions gμ and gϕ , and consider the double GLMs, for 1 ≤ k ≤ K, on the learned
representations zi
μ ϕ
gμ (μk (zi )) = β k , zi and gϕ (ϕk (zi )) = β k , zi , (11.48)

with regression parameters β μ = ((β 1 ) , . . . , (β K ) ) ∈ RK(qd +1) for the

μ μ

mean parameters and β ϕ = ((β 1 ) , . . . , (β K ) ) ∈ RK(qd +1) for the dispersion

ϕ ϕ

parameters. Thus, altogether this gives us a network parameter of dimension, set

q0 = q,

d
r= qm (qm−1 + 1) + 3K(qd + 1).
m=1

Remarks 11.14
• The regression functions (11.47)–(11.48) use a slight abuse of notation, because,
strictly speaking, these should be functions w.r.t. the features x i ∈ X , i.e.,
we should understand the learned representations zi as a short form for x i →
z(d:1)(x i ).
• It is not fully correct to say that (11.47) is the logistic categorical GLM
of formula (5.72), because (11.47) does not lead to identifiable regression
parameters. In fact, we should reduce the dimension of the categorical GLM to
p
K − 1, by setting β K = 0, see (5.70), because the probability of the last label
K is fully determined if we know the probabilities of all other labels; this would
also justify to say that h is the canonical link. Since in FN network modeling we
do not have identifiability anyway, we neglect this normalization (redundancy),
see line 16 of Listing 11.12, below.
• The above proposal (11.47)–(11.48) suggests to use the same network z(d:1)
for all mixture parameters involved. This requires that the chosen network is
11.6 Selected Applications 515

sufficiently large, so that it can comply simultaneously with these different tasks.
Alternatively, we could choose three separate (parallel) networks for p, μ and
ϕ, respectively. This second proposal does not (easily) allow for (non-trivial)
interactions between the parameters, and it may also suffer from less robustness
in fitting.
• Proposal (11.48) defines double GLMs for the mixture components fk , 1 ≤ k ≤
K. If we decide to not model the dispersion parameters feature dependent, i.e., if
we set ϕk (z) ≡ ϕk ∈ R+ , then the mixture components are modeled with GLMs
on the learned representations zi = z(d:1)(x i ). Nevertheless, this latter approach
still requires that the dispersion parameters ϕk are set to reasonable values, as
they enter the score equations, this can be seen from (6.29) adapted to MDNs.
Thus, in MDNs, the dispersion parameters do not cancel in the score equations,
which is different from the single distribution case. The dispersion parameter can
either be estimated (updated) during the M-step of the EM algorithm (supposed
we use the EM algorithm), or it can be pre-specified as a given hyper-parameter.
• As mentioned in Sect. 6.3, mixture density fitting can be challenging because,
in general, mixture density log-likelihoods are unbounded. Therefore, a suitable
initialization of the EM algorithm is important for a successful model fitting.
This problem is less pronounced in MDNs as we use early stopping in SGD
fitting that prevents the fitted parameters to depend on a small set of observations.
For instance, Example 6.13 cannot occur because an individual observation Y1
enters at most one (mini-)batch of SGD, and the SGD algorithm will provide
a good balance across all batches. Moreover, early stopping will imply that the
selected parameters must also be good on the validation data being disjoint (and
independent) from the training data.
• Delong et al. [95] present two different ways of fitting such MDNs. The crucial
property in EM fitting is to preserve the monotonicity in the M-step. For MDNs
this can either be achieved by using the parameters as offsets for the next EM
iteration (this is called ‘EM network boosting’ in Delong et al. [95]) or to forward
the network weights from one to the next loop (called ‘EM forward network’
in Delong et al. [95]). We are going to present the second option in the next
example.

Example 11.15 (Gamma Claim Size Modeling and MDNs) We revisit Exam-
ple 6.14 which models the claim sizes of the French MTPL data. For the modeling
of these claim sizes we choose the mixture distribution (6.39) which has four
gamma components f1 , . . . , f4 and one Lomax component f5 . In a first step we
again model these five mixture components independent of the feature information
x, and the feature information only enters the mixture probabilities p(x) ∈ 5 .
This modeling approach has been motivated by Fig. 13.17 which suggests that
the features mainly result in systematic effects on the mixture probabilities. We
choose the same model and feature information as in Example 6.14. We only
replace the logistic categorical GLM part (6.40) for modeling p(x) by a depth
d = 2 FN network with (q1 , q2 ) = (20, 10) neurons. Area, VehAge, DrivAge
516 11 Selected Topics in Deep Learning

and BonusMalus are modeled as continuous variables, and for the categorical
variables VehBrand and Region we choose two-dimensional embedding layers.

Listing 11.12 R code of the MDN for modeling the mixture probability p(x)
1 Design = layer_input(shape = c(4), dtype = ’float32’)
2 VehBrand = layer_input(shape = c(1), dtype = ’int32’)
3 Region = layer_input(shape = c(1), dtype = ’int32’)
4 Bias = layer_input(shape = c(1), dtype = ’float32’)
5 #
6 BrandEmb = VehBrand %>%
7 layer_embedding(input_dim = 11, output_dim = 2, input_length = 1) %>%
8 layer_flatten()
9 RegionEmb = Region %>%
10 layer_embedding(input_dim = 22, output_dim = 2, input_length = 1) %>%
11 layer_flatten()
12 #
13 pp = list(Design, BrandEmb, RegionEmb) %>% layer_concatenate() %>%
14 layer_dense(units=20, activation=’tanh’) %>%
15 layer_dense(units=10, activation=’tanh’) %>%
16 layer_dense(units=5, activation=’softmax’)
17 #
18 mu = Bias %>% layer_dense(units=4, activation=’exponential’,
19 use_bias=FALSE)
20 #
21 tail = Bias %>% layer_dense(units=1, activation=’sigmoid’,
22 use_bias=FALSE)
23 #
24 shape = Bias %>% layer_dense(units=4, activation=’exponential’,
25 use_bias=FALSE)
26 #
27 Response = list(pp, mu, tail, shape) %>% layer_concatenate()
28 #
29 keras_model(inputs = c(Design, VehBrand, Region, Bias), outputs = c(Response))

Listing 11.12 shows the chosen network. Lines 13–16 model the mixture probability
p(x). We also integrate the modeling of the (homogeneous) parameters of the
mixture densities f1 , . . . , f5 . Lines 18 and 24 of Listing 11.12 consider the mean
and shape parameter of the gamma components, and line 21 the tail parameter 1/β5
of the Lomax component. Note that we use the sigmoid activation for this Lomax
parameter. This implies 1/β5 ∈ (0, 1) and, thus, β5 > 1, which enforces a finite
mean model. The exponential activations on lines 18 and 24 ensure positivity of
these parameters. The input Bias to these variables is simply the constant 1, which
is the homogeneous case not differentiating w.r.t. the features.
Observe that in most of the networks so far, the output of the network was
equal to an expected response of a random variable that we try to predict. In
this MDN we output the parameters of a distribution function, see line 27 of
Listing 11.12. In our case this output has dimension 14, which then enters the score
in Listing 11.13. In a first attempt we fit this MDN brute-force by just implementing
the incomplete log-likelihood received from (6.39). Since the gamma function
(·) is not easily available in keras [77], we replace the gamma density by its
saddlepoint approximation, see Sect. 5.5.2. Listing 11.13 shows the negative log-
likelihood of the mixture density that is used to perform the brute-force SGD fitting.
11.6 Selected Applications 517

Listing 11.13 Mixture density negative incomplete log-likelihood

1 mixture_LogLikeli <- function(true, pred){ - k_mean(k_log(
2 pred[,1]*k_exp(-k_log(2*pi*true[,1]^2/pred[,11])/2 -
3 pred[,11]*(true[,1]/pred[,6]-1+k_log(pred[,6]/true[,1]))) +
4 pred[,2]*k_exp(-k_log(2*pi*true[,1]^2/pred[,12])/2 -
5 pred[,12]*(true[,1]/pred[,7]-1+k_log(pred[,7]/true[,1]))) +
6 pred[,3]*k_exp(-k_log(2*pi*true[,1]^2/pred[,13])/2 -
7 pred[,13]*(true[,1]/pred[,8]-1+k_log(pred[,8]/true[,1]))) +
8 pred[,4]*k_exp(-k_log(2*pi*true[,1]^2/pred[,14])/2 -
9 pred[,14]*(true[,1]/pred[,9]-1+k_log(pred[,9]/true[,1]))) +
10 pred[,5]*k_exp(k_log(1/(pred[,10]*M))-(1/pred[,10]+1)
11 *k_log(true[,1]/M+1))))
12 }

Lines 2–9 give the saddlepoint approximations to the four gamma components, and
line 10 the Lomax component for the scale parameter M. Note that this brute-force
approach is based only on the incomplete observation Y encoded in true[,1],
see Listing 11.13.
We fit this logistic categorical FN network of Listing 11.12 under the score function
of Listing 11.13 using the nadam version of SGD. Moreover, we use a stratified
training-validation split, otherwise we did not obtain a competitive model. The
results are presented in Table 11.12 on line ‘logistic FN network: brute-force fitting’.
We observe a slightly worse performance (in-sample) than in the logistic GLM. This
does not justify the use of the more complex network architecture. Or in other words,
feature pre-processing seems to been done suitably in Example 6.14.
In a next step, we fit this MDN with the (generalized) EM algorithm. The E-
step is exactly the same as in Example 6.14. For the M-step, having knowledge of
the (latent mixture component) variables Z i , 1 ≤ i ≤ n, implies that the mixture
probability estimation and the mixture density estimation completely decouples. As
a consequence, the parameters of the density components f1 , . . . , f5 can directly
be estimated using univariate MLEs, this is the same as in Example 6.14. The
only part that needs further explanation is the estimation of the logistic categorical
FN network for p(x). In each loop of the EM iteration we would like to find the
optimal network parameter for p(x), and at the same time we have to ensure the
monotonicity (6.38). Following the ‘EM forward network’ approach of Delong et

Table 11.12 Mixture models for French MTPL claim size modeling; we set M = 2 000
# Param. Y (
θ,
p)
μ = E θ ,
p [Y ]
Empirical 2’266
Null model 13 −199’306 2’381
Logistic GLM, Example 6.14 193 −198’404 2’176
Logistic FN network: brute-force fitting 520 −198’623 2’003
Logistic FN network: EM fitting 520 −198’449 2’119
MDN: brute-force fitting 825 −198’178 2’144
MDN: EM fitting 825 −198’085 2’240
518 11 Selected Topics in Deep Learning

al. [95], this is most easily achieved by just initializing the FN network in loop t of
the algorithm with the optimal network parameter of the previous loop t − 1. Thus,
the starting parameter of SGD reflects the optimal parameter from the previous
step, and since SGD generally decreases losses, the monotonicity (6.38) holds. The
latter statement is not strictly true, SGD introduces additional randomness through
the building of (mini-)batches, therefore, monotonicity should be traced explicitly
(which also ensures that the early stopping rule is chosen suitably). We have
implemented such an EM-SGD algorithm, essentially, we just have to drop lines
17–28 of Listing 11.12 and lines 13–16 provide the entire response. As loss function
we choose the categorical (multi-class) cross-entropy loss, see (4.19). The results in
Table 11.12 on line ‘logistic FN network: EM fitting’ indicate a superior fitting
behavior compared to the brute-force fitting. Nevertheless, this network approach
is still not outperforming the GLM approach, saying that we should stay with the
simpler GLM.
In a final step, we also model the mean parameters μk (x), 1 ≤ k ≤ 4, of the
gamma components feature dependent, to see whether we can gain predictive power
from this additional flexibility or whether our initial model choice is sufficient. For
robustness reasons we neither model the shape parameters βk , 1 ≤ k ≤ 4, of
the gamma components feature dependent nor the tail parameter β5 of the Lomax
component. The implementation only requires small changes to Listing 11.12, see
Listing 11.14.
A brute-force fitting of the MDN architecture of Listing 11.14 can directly be based
on the score function (negative incomplete log-likelihood) of Listing 11.13. In the
case of the EM algorithm we need to change the score function to the complete
log-likelihood accounting for the variables Z i ∈ 5 . This is done in Listing 11.15
where Z i is encoded in the variables true[,2] to true[,6].
We fit this MDN using the two different fitting approaches, and the results are given
on the last two lines of Table 11.12. Again the performance of the EM fitting is
slightly better than the brute-force fitting, and the bigger log-likelihoods indicate
that we can gain predictive power by also modeling the means of the gamma
components feature dependent.
Figure 11.21 compares the QQ plot of the resulting MDN with EM fitting to the
one received from the logistic categorical GLM of Example 6.14. These graphs are
very similar. We conclude that in this particular example it seems that the simpler
proposal of Example 6.14 is sufficient.

In a next step, we try to understand which feature components influence the mix-
ture probabilities p(x) = (p1 (x), . . . , pK (x)) most. Similarly to Examples 6.14
and 11.15, we therefore use a MDN where we only fit the mixture probability
p(x) with a network and the mixture components f1 , . . . , fK are assumed to be
homogeneous.
Example 11.16 (MDN with LocalGLMnet) We revisit Example 11.15. We choose
the mixture distribution (6.39) which has four gamma components f1 , . . . , f4 and
a Lomax component f5 . We select their parameters independent of the features.
The feature information x should only enter the mixture probability p(x) ∈ 5 ,
similarly to the first part of Example 11.15. We replace the logistic FN network of
11.6 Selected Applications 519

Listing 11.14 R code of the MDN for modeling the mixture probability p(x) and the gamma
means μk (x)
1 Design = layer_input(shape = c(4), dtype = ’float32’)
2 VehBrand = layer_input(shape = c(1), dtype = ’int32’)
3 Region = layer_input(shape = c(1), dtype = ’int32’)
4 Bias = layer_input(shape = c(1), dtype = ’float32’)
5 #
6 BrandEmb = VehBrand %>%
7 layer_embedding(input_dim = 11, output_dim = 2, input_length = 1) %>%
8 layer_flatten()
9 RegionEmb = Region %>%
10 layer_embedding(input_dim = 22, output_dim = 2, input_length = 1) %>%
11 layer_flatten()
12 #
13 Network = list(Design, BrandEmb, RegionEmb) %>% layer_concatenate() %>%
14 layer_dense(units=20, activation=’tanh’) %>%
15 layer_dense(units=15, activation=’tanh’) %>%
16 layer_dense(units=10, activation=’tanh’)
17 #
18 pp = Network %>% layer_dense(units=5, activation=’softmax’)
19 #
20 mu = Network %>% layer_dense(units=4, activation=’exponential’,
21 use_bias=FALSE)
22 #
23 tail = Bias %>% layer_dense(units=1, activation=’sigmoid’,
24 use_bias=FALSE)
25 #
26 shape = Bias %>% layer_dense(units=4, activation=’exponential’,
27 use_bias=FALSE)
28 #
29 Response = list(pp, mu, tail, shape) %>% layer_concatenate()
30 #
31 keras_model(inputs = c(Design, VehBrand, Region, Bias), outputs = c(Response))

Listing 11.15 Mixture density negative complete log-likelihood

1 mixture_LogLikeli_Complete <- function(true, pred){ - k_mean(
2 true[,2]*(k_log(pred[,1])-k_log(2*pi*true[,1]^2/pred[,11])/2 -
3 pred[,11]*(true[,1]/pred[,6]-1+k_log(pred[,6]/true[,1]))) +
4 true[,3]*(k_log(pred[,2])-k_log(2*pi*true[,1]^2/pred[,12])/2 -
5 pred[,12]*(true[,1]/pred[,7]-1+k_log(pred[,7]/true[,1]))) +
6 true[,4]*(k_log(pred[,3])-k_log(2*pi*true[,1]^2/pred[,13])/2 -
7 pred[,13]*(true[,1]/pred[,8]-1+k_log(pred[,8]/true[,1]))) +
8 true[,5]*(k_log(pred[,4])-k_log(2*pi*true[,1]^2/pred[,14])/2 -
9 pred[,14]*(true[,1]/pred[,9]-1+k_log(pred[,9]/true[,1]))) +
10 true[,6]*(k_log(pred[,5])+k_log(1/(pred[,10]*M))-
11 (1/pred[,10]+1)*k_log(true[,1]/M+1)))
12 }

Example 11.15 for modeling p(x) by a LocalGLMnet such that we can analyze the
importance of the variables, see Sect. 11.5.
For the feature information we choose the continuous variables Area,
VehPower, VehAge, DrivAge and BonusMalus, the binary variable VehGas
and the categorical variables VehBrand and Region, thus, we extend by
VehPower and VehGas compared to Example 11.15. These latter two variables
have not been included previously, because they did not seem to be important
520 11 Selected Topics in Deep Learning

4 QQ plot QQ plot

4
observed values

observed values
2

2
0

0
–2
–2

–4 –2 0 2 4 –4 –2 0 2 4
theoretical values theoretical values

Fig. 11.21 QQ plots of mixture models: (lhs) logistic categorical GLM for mixture probabilities
and (rhs) for MDN with EM fitting

w.r.t. Fig. 13.17. The continuous and binary variables are centered and normalized
to unit variance. For the categorical variables we use two-dimensional embedding
layers, and afterwards they are concatenated with the continuous variables with
a subsequent normalization layer (to ensure that all components live on the same
scale). This provides us with a 10-dimensional feature vector. This feature vector
is complemented with an i.i.d. standard Gaussian component, called Random,
to perform an empirical Wald type test. We call this pre-processed feature (after
embedding and normalization of the categorical variables) x ∈ Rq0 with q0 = 11.
We design a LocalGLMnet that acts on this feature x ∈ Rq0 for modeling
a categorical multi-class output with K = 5 levels. Therefore, we choose the
regression attentions

z(d:1) : Rq0 → Rq0 ×K , x → β(x) = β 1 (x), . . . , β K (x) = z(d:1) (x),

where z(d:1) is a network of depth d having a matrix-valued output of dimension

q0 × K. For the (canonical) link h, this gives us the predictor, see (5.72),

h(p(x)) = β1,0 + β 1 (x), x , . . . , βK,0 + β K (x), x ∈ RK , (11.49)

with intercepts βk,0 ∈ R, and where β k (x) ∈ Rq0 is the k-th column of regression
attention β(x) = z(d:1)(x) ∈ Rq0 ×K . We also refer to the second item of
Remarks 11.14 concerning a possible dimension reduction in (11.49), i.e., in fact we
apply the softmax activation function to the right-hand side of (11.49), neglecting
the identifiability issue. Moreover, as in the introduction of the LocalGLMnet, we
separate the intercept components from the remaining features in (11.49).
We fit this LocalGLMnet-MDN with the EM version presented in Exam-
ple 11.15. We apply early stopping based on the same stratified training-validation
11.6 Selected Applications 521

split as in the aforementioned example, and this provides us with a log-likelihood

of -198’290, thus, slightly bigger than the corresponding numbers in Table 11.12.
More interestingly, our goal is to understand the regression attentions given by
β(x i ) = (β 1 (x i ), . . . , β 5 (x i )) ∈ R11×5 over all claims 1 ≤ i ≤ n. Figure 11.22
shows the resulting boxplots, where each of the five graphs corresponds to one
mixture component 1 ≤ k ≤ 5, and the different colors illustrate the 11 feature
components providing the attention weights βk,j (x i ), 1 ≤ j ≤ 11. The red boxplots
show the purely random component Random for 1 ≤ k ≤ 5, which provides
the acceptance region of an empirical Wald test for the null hypothesis that the
corresponding term should be dropped. This is highlighted by the orange shaded
area (at a significance level of 0.1%). Thus, whenever a boxplot lies within this
orange shaded area we may consider dropping this term, e.g., for k = 2 (top-right),
this is the case for Area, VehPower and Region2 (being the second component
of the two-dimensional region embedding). Note that this interpretation needs some
care because we do not have identifiability in the class probabilities.
The first observation is that, indeed, VehPower is mostly in the orange
confidence area and, thus, may be dropped. This does not apply to the other feature
components, and, thus, we should keep them in the model. The three gamma mixture
components f1 , f2 and f3 correspond to the three modes at 75, 600 and 1’175
in Fig. 13.17. Component f4 is a gamma component covering the whole range
of claims, and f5 is the Lomax component modeling the regular variation in the
tail. Interestingly, DrivAge and BonusMalus seem very important for mixture
components k = 1, k = 3 and k = 4 (with different signs), this is supported
by Fig. 13.17. The Lomax component seems mostly impacted by DrivAge,
VehBrand and Region. Only mixture component k = 2 is more difficult to
interpret. This component seems influenced by most the feature components, in
particular, the combination of VehAge, VehGas and VehBrand seems important.
This could mean that mixture component k = 2 belongs to a certain type of vehicle.
In a next step we could study interactions and their impact on the mixture
components, and LASSO regularization would provide us with another method of
variable selection, see Sect. 11.5.4. We refrain from doing so and close the example.

11.6.2 Estimation of Conditional Expectations

FN networks have also found their way into solving risk management problems.
We briefly introduce a valuation problem and then describe a way of solving
this problem. Assume we have a liability cash flow Y1:T = (Y1 , . . . , YT ) with
(random) payments Yt at time points t = 1, . . . , T . We assume that this liability
cash flow Y1:T is adapted to a filtration (At )1≤t ≤T on the underlying probability
space (, A, P). Moreover, we assume to have a pricing kernel (state price deflator)
ψ1:T = (ψ1 , . . . , ψT ) on that probability space which is an (At )1≤t ≤T -adapted
522 11 Selected Topics in Deep Learning

importance measure for mixture component 1 importance measure for mixture component 2
0.4

0.4
0.2

0.2
0.0

0.0
−0.2

−0.2
−0.4

−0.6 −0.4
Random

Area

VehPower

VehAge

DrivAge

BonusM

VehGas

Vehicle1

Vehicle2

Region1

Region2

Random

Area

VehPower

VehAge

DrivAge

BonusM

VehGas

Vehicle1

Vehicle2

Region1

Region2
importance measure for mixture component 3 importance measure for mixture component 4

0.3 0.2

0.2
0.1

0.1
0.0

0.0

−0.1
−0.1

−0.2
−0.2
Random

Area

VehPower

VehAge

DrivAge

BonusM

VehGas

Vehicle1

Vehicle2

Region1

Region2

Random

Area

VehPower

VehAge

DrivAge

BonusM

VehGas

Vehicle1

Vehicle2

Region1

Region2

importance measure for mixture component 5

0.2

0.1

0.0

−0.1

−0.2
Random

Area

VehPower

VehAge

DrivAge

BonusM

VehGas

Vehicle1

Vehicle2

Region1

Region2

Fig. 11.22 Boxplot of regression attentions β(x i ) = (β 1 (x i ), . . . , β 5 (x i )) ∈ R11×5 over all

claims 1 ≤ i ≤ n for the different mixture components f1 , . . . , f5
11.6 Selected Applications 523

random vector with strictly positive components ψt > 0, a.s., for all 1 ≤ t ≤ T . A
no-arbitrage value of the outstanding liability cash flow at time 1 ≤ τ < T can be
defined by (we assume existence of all second moments)

T
1
Rτ = E [ ψs Ys | Aτ ] . (11.50)
ψτ
s=τ +1

For the mathematical background on no-arbitrage pricing using state price deflators
we refer to Wüthrich–Merz [393]. The Aτ -measurable quantity Rτ is called
reserves of the outstanding liabilities at time τ . From a risk management and
solvency point of view we would like to understand the volatility in the reserves
Rτ seen from time 0, i.e., we try to model the random variable Rτ seen from time
0 (based on the trivial σ -algebra A0 = {∅, }). In applied problems, the difficulty
often is that the conditional expectations under the summation in (11.50) cannot be
computed in closed form. Therefore the law of Rτ cannot be determined explicitly.
We provide a numerical solution to the calculation of the conditional expectations
in (11.50). Assume that the information set Aτ can be described by a random vector
Xτ , i.e., Aτ = σ (Xτ ). In that case we rewrite (11.50) as follows

T
1
Rτ = E [ ψs Ys | X τ ] . (11.51)
ψτ
s=τ +1

The latter now indicates that we can determine the conditional expectations
in (11.51) as regression functions in features X τ , and we try to understand for s > τ
+ ,
ψs
x τ → E Ys Xτ = x τ . (11.52)
ψτ

The random variable Rτ can then be determined empirically by simulation. This

requires two steps: (1) We have to be able to simulate ψs Ys /ψτ , conditionally given
Xτ = x τ . This allows us to estimate the conditional expectation (11.52) with a
regression function. (2) We need to be able to simulate X τ . This provides us with
the empirical occurrence probabilities of specific choices X τ = x τ in (11.52) which
then gives an empirical version of Rτ .
In theory, this problem can be approached by nested simulations which is
a two-stage procedure that first performs step (2), and then calculates step (1)
empirically with Monte Carlo simulations for every realization of step (2), see,
e.g., Lee [242] and Glynn–Lee [161]. The disadvantage of this two-stage nested
simulation procedure is that it is computationally demanding. Building upon the
work on valuation of American options by Carriere [65], Tsitsiklis–Van Roy [356]
and Longstaff–Schwartz [257], the papers of Broadie et al. [55] and Ha–Bauer [177]
propose to regress future cash flows on finitely many basis functions depending on
the state variable Xτ . More recently, machine learning tools such as FN networks
524 11 Selected Topics in Deep Learning

have been proposed to determine these basis and regression functions, see, e.g.,
Cheridito et al. [74] or Krah et al. [224].
In the following, we assume that all random variables considered are square-
integrable and, thus, we can work in a Hilbert space with the scalar product
X, Z = E[XZ] for X, Z ∈ L2 (, A, P). Moreover, for simplicity, we drop the
time indices and we also drop the stochastic discounting in (11.52) by assuming
ψs /ψτ ≡ 1. These simplifications are not essential technically and simplify our
outline. The conditional expectation μ(X) = E[Y |X] can then be found by the
orthogonal projection of Y onto the sub-space σ (X), generated by X, in the Hilbert
space L2 (, A, P). That is, the conditional expectation is the measurable function
μ : Rq → R, X → μ(X), that minimizes the mean squared error

!
E (Y − μ(X))2 = min, (11.53)

among all measurable functions on X. In Example 3.7, we have seen that μ(·) is the
minimizer of this problem if and only if

μ(x) = arg min (y − m)2 dFY |x (y), (11.54)
m∈R R

for px -a.e. x ∈ Rq , where px is the distribution of X, and where FY |x is the

conditional distribution of Y , given feature X = x; we also refer to (3.6).
Under the assumption that we can simulate observations (Y, X) under P, we can
solve (11.53)–(11.54) approximately by restricting to a sufficiently rich family of
regression functions. Choose a FN network z(d:1) : Rq → Rqd of depth d and the
identity link g(x) = x. An optimal network parameter ϑ is found by minimizing

1 ; <2
n

ϑ = arg min Yi − β, z(d:1)(X i ) , (11.55)
ϑ∈Rr n i=1

where (Yi , X i ), 1 ≤ i ≤ n, are i.i.d. copies of (Y, X). This provides us with the
fitted FN network z(d:1)(·) and the fitted output parameter
β. These can be used to
receive an approximation to the conditional expectation, solution of (11.54),
; <
μ(x) =
x → z(d:1)(x) ≈ μ(x) = E [ Y | X = x] .
β, (11.56)

This then allows us to approximate the random variable in (11.51) empirically by

simulating features X and inserting them into left-hand side of (11.56).
Remarks 11.17
• There are different types of errors involved. First, there is an irreducible
approximation error if the chosen family of FN networks is not sufficiently
rich to approximate the conditional expectation well. For example, if we choose
the hyperbolic tangent activation function, then, naturally, z(d:1) (·) is uniformly
11.6 Selected Applications 525

bounded for a fixed network parameter ϑ. This does not necessarily apply to
the conditional expectation E[Y |X = ·] and, thus, the approximation in the tail
may be poor. Second, we consider an approximation based on a finite sample
in (11.55). However, this error can be made arbitrarily small by letting n → ∞.
In-sample over-fitting should not be an issue as we may generate samples of
arbitrary large sample sizes. Third, having the approximation (11.56), we still
need to simulate i.i.d. samples Xk , k ≥ 1, having the same distribution as X to
empirically approximate the distribution of the random variable Rτ in (11.51).
Also in this step we benefit from the fact that we can simulate infinitely many
samples to mitigate this approximation error.
• To fit the network parameter ϑ in (11.55) we use i.i.d. copies (Yi , X i ), 1 ≤ i ≤ n,
that have the same distribution as (Y, X) under P. However, to receive a good
approximation to regression function x → μ(x) we only need to simulate
Yi |{Xi =x i } from FY |x i (·) = P[·|Xi = x i ], and Xi can be simulated from an
arbitrary equivalent distribution to px , and we still get the right conditional
expectation in (11.54). This is worth mentioning because if we need a higher
precision in some part of the feature space of X, we can apply a sort of
importance sampling by choosing a distribution for X that generates more
samples in the corresponding part of the feature space compared to the original
(true) distribution px of X; this proposal has been emphasized in Cheridito et
al. [74].
We study the example presented in Ha–Bauer [177] and Cheridito et al. [74].
This example considers a variable annuity (VA) with a guaranteed minimum income
benefit (GMIB), and we revisit the network approach of Cheridito et al. [74].
Example 11.18 (Approximation of Conditional Expectations) We consider the VA
example with a GMIB introduced and studied in Ha–Bauer [177]. This example
involves a 3-dimensional stochastic process, for t ≥ 0,

Xt = (qt , rt , mx+t ),

with qt being the log-value of the VA account at time t, rt is the short rate at time t,
and mx+t is the force of mortality at time t of a person aged x at time 0. The payoff
at fixed maturity date T > 1 of this insurance contract is given by

S = S(X T ) = max eqT , b ax+T (rT , mx+T ) ,

where eqT is the VA account value at time T , and b ax+T (rT , mx+T ) is the GMIB at
time T consisting of a face value b > 0 and with ax+T (rT , mx+T ) being the value
of an immediate annuity at time T of a person aged x + T . Our goal is to model the
conditional expectation

μ(Xτ ) = D(τ, T ; Xτ ) E [ S(X T )| Xτ ] (11.57)

q
= D(τ, T ; Xτ ) E max e , b ax+T (rT , mx+T ) X τ ,
T
526 11 Selected Topics in Deep Learning

for a fixed valuation time point 0 < τ < T , and where D(τ, T ) = D(τ, T ; Xτ )
is a σ (Xτ )-measurable discount factor. This requires the explicit specification of
the GMIB term as a function of (rT , mx+T ), the modeling of the stochastic process
(Xt )0≤t ≤T , and the specification of the discount factor D(τ, T ; Xτ ). In financial
and actuarial valuation the regression function μ(·) in (11.57) should reflect a no-
arbitrage price. Therefore, P in (11.57) should be an equivalent martingale measure
w.r.t. the selected numéraire. In our case, we choose a force of mortality (mx+t )t -
adjusted zero-coupon bond price as numéraire. This implies that P is a mortality-
adjusted forward measure; for details and its explicit derivation we refer to Sect. 5.1
of Ha–Bauer [177]. In particular, Ha–Bauer [177] introduce a three-dimensional
Brownian motion based model for (X t )t from which they deduce all relevant terms
explicitly. We skip these calculations here, because, once the GMIB term and the
discount factor are determined, everything boils down to knowing the distribution
of the random vector (X τ , XT ) under the corresponding probability measure P. We
choose initial age x = 55, maturity T = 15 and (solvency) time horizon τ = 1.
Under the model and parametrization of Ha–Bauer [177] we receive a multivariate
Gaussian distribution under P given by

(Xτ , X T ) = (qτ , rτ , mx+τ , qT , rT , mx+T ) (11.58)

⎛⎛ ⎞ ⎛ −2 −4 −5
⎞⎞
4.64 3.2 · 10 −4.8 · 10 1.3 · 10 3.1 · 10−2 −1.4 · 10−5 3.6 · 10−5
⎜⎜0.02⎟ ⎜ −4.8 · 10−4 7.9 · 10−5 −4.4 · 10−7 −1.7 · 10−4 2.4 · 10−6 −1.2 · 10−6 ⎟⎟
⎜⎜ ⎟ ⎜ 1.3 · 10−5 −4.4 · 10−7 1.5 · 10−6 1.2 · 10−5 −1.3 · 10−8 4.1 · 10−6 ⎟⎟
∼ N ⎜⎜0.01⎟, ⎜ 3.1 · 10−2 −1.7 · 10−4 1.2 · 10−5 4.5 · 10−1 −1.3 · 10−3 3.0 · 10−4
⎟⎟ .
⎝⎝4.71⎠ ⎝ ⎠⎠
0.02 −1.4 · 10−5 2.4 · 10−6 −1.3 · 10−8 −1.3 · 10−3 2.0 · 10−4 −2.5 · 10−6
0.03 3.6 · 10−5 −1.2 · 10−6 4.1 · 10−6 3.0 · 10−4 −2.5 · 10−6 7.4 · 10−5

Under the model specification of Ha–Bauer [177], one can furthermore work out the
discount factor and the annuity. Define for t ≥ 0 and k > 0 the affine term structure

F (t, k; rt , mx+t ) = exp {A(t, t + k) − B(t, t + k; α)rt − B(t, t + k; −κ)mx+t } ,

with deterministic functions

1 − e−αk
B(t, t + k; α) = ,
α
σr2
A(t, t + k) = γ̄ (B(t, t + k; α) − k) + (k − 2B(t, t + k; α) + B(t, t + k; 2α))
2α 2
ψ2
+ (k − 2B(t, t + k; −κ) + B(t, t + k; −2κ))
2κ 2
2,3 σr ψ
+ (B(t, t + k; −κ) − k + B(t, t + k; α) − B(t, t + k; α − κ)) ,
ακ

with parameters for the short rate process α = 25%, σr = 1%, for the force of
mortality κ = 7%, ψ = 0.12%, the correlation between the short rate and the force
of mortality 2,3 = −4%, and with market-price of the risk-adjusted mean reversion
11.6 Selected Applications 527

Fig. 11.23 Marginal marginal densities of VA account and GMIB

densities of the VA account VA account value

2.5
value eqT and the GMIB GMIB value

value b ax+T (rT , mx+T )

2.0
1.5
density
1.0
0.5
0.0
2 4 6 8
log scale

level γ̄ = 1.92% of the short rate process. These formulas can be retrieved because
we work under an affine Gaussian structure. The discount factor is then given by

D(τ, T ; Xτ ) = F (τ, T − τ ; rτ , mx+τ ),

and the annuity is determined by (we cap at age 55 + 50 = 105)

50
ax+T (rT , mx+T ) = F (T , k; rT , mx+T ).
k=1

Moreover, we set for the face value b = 10.79205. This parametrization implies that
the VA account value eqT exceeds the GMIB b ax+T (rT , mx+T ) with a probability
of roughly 40%, i.e., in roughly 60% of the cases we exercise the GMIB option.
Figure 11.23 shows the marginal densities of these two variables, moreover, their
correlation is close to 0.
The model is now fully specified so that we can estimate the conditional expectation
in (11.57) as a function of X τ . We therefore simulate n = 3 000 000 i.i.d. Gaussian
(i) (i)
observations (X τ , X T ), 1 ≤ i ≤ n, from (11.58). This provides us with the
observations
(i)
Yi = D(τ, T ; X(i)
τ ) S(X T )

(i)
50
= F (τ, T − τ ; rτ(i) , m(i)
x+τ ) max e qT
,b F (T , k; rT(i) , m(i)
x+T ) .
k=1

The resulting data (Yi , X(i)

τ )1≤i≤n is used for determining the regression function
μ(·) in (11.57). We choose n = 3 000 000 samples in line with the least squares
Monte Carlo approximation of Ha–Bauer [177].
528 11 Selected Topics in Deep Learning

We choose a FN network of depth d = 3 for approximating μ(·). For the three FN

layers we choose (q1 , q2 , q3 ) = (20, 15, 10) neurons with the hyperbolic tangent
activation function, and as output activation we choose the identity function; we
choose a more complex network compared to Cheridito et al. [74] because it seems
that this gives us more accurate results. We fit this FN network using the square loss
function. The square loss is motivated by (11.55). Furthermore, we average over 20
runs with different seeds. Thus, we receive 20 fitted FN networks μk (·) for the 20
different seeds 1 ≤ k ≤ 20 and the nagging predictor is obtained by averaging

1
20

μ(·) =
μk (·).
20
k=1

We then generate new i.i.d. samples X(l)

τ , 1 ≤ l ≤ L, from the multivariate Gaussian
distribution (11.58), where this time we only need the first 3 components. This gives
us the empirical samples

μ(X (l)
τ ) for 1 ≤ l ≤ L, (11.59)

providing an empirical distribution F μ(X τ ) that approximates the distribution of

μ(Xτ ), given in (11.57). In risk management and solvency analysis, this empirical
distribution can be used to estimate the Value-at-Risk (VaR) and the (upper)
conditional tail expectation (CTE) in valuation μ(Xτ ), seen from time 0, on
different safety levels p ∈ (0, 1)

VaR −1 (p) = inf y ∈ R; F
Gp = F μ(X τ ) (y) ≥ p ,
μ(X τ )

and

G p = E
CTE Fμ(X μ(Xτ ) Gp .
μ(Xτ ) > VaR
τ)

We also refer to Sect. 11.3. The VaR and the CTE are two commonly used risk
measures in insurance practice that determine the necessary risk bearing capital to
run the corresponding insurance business. Typically, the VaR is evaluated on p =
99.5%, i.e., we allow for a default probability of 0.5% of not being able to cover
the changes in valuation over a τ = 1 year time horizon. Alternatively, the CTE is
considered on p = 99% which means that we need sufficient capital to cover on
average the 1% worst changes in valuation over a 1 year time horizon.
Figure 11.24 shows our FN network approximations. The boxplots shows the
individual results of the estimates
μk (·) with 20 different seeds, and the horizontal
lines show the results of the nagging predictor (11.59). The red line at 140.97
gives the estimated VaR for p = 99.5%, this value is slightly bigger than the best
estimate of 139.47 (orange line) in Ha–Bauer [177] which is based on a functional
approximation involving 37 monomials and 40’000’000 simulated samples. CTEs
on p = 99.5% and p = 99% are given by 145.09 and 141.49. We conclude that in
the present example VaRG 99.5% (used in Europe) and CTE G 99% (used in Switzerland)
are approximately of the same size for this VA with a GMIB.
11.6 Selected Applications 529

Fig. 11.24 Resulting

G 99.5% (red), CTE
VaR G 99.5%

146
G 99% (blue);
(green) and CTE
the orange line gives the 145.09
result of Ha–Bauer [177] for
the 99.5% VaR

144
142
141.49
140.97

140
139.47

VaR99.5% CTE99.5% CTE99%

This example shows how problems can be solved that require the computation
of a conditional expectation. Alternatively, we could explore the LocalGLMnet
architecture, which would allow us to explain the conditional expectation more
explicitly in terms of the information Xτ available at time τ . This may also be
relevant in practice because it allows to determine the main risk drivers of the
underlying insurance business.
Figure 11.25 shows the marginal densities of the components of Xτ =
(qτ , rτ , mx+τ ) in blue color. In red color we show the corresponding conditional
densities of X τ , conditioned on G 99.5%, thus, these are the feature
μ(Xτ ) > VaR
values Xτ that lead to a shortfall beyond the 99.5% VaR of μ(Xτ ). From this
figure we conclude that the main driver of VaR is the VA account variable qτ ,
whereas the short rate rτ and the force of mortality mx+τ are slightly lower beyond
the VaR compared to their unconditioned counterparts. The explanation for these
smaller values is that they lead to less discounting and, henceforth, to bigger GMIB
values. This is useful information for exploring importance sampling as mentioned
in Remarks 11.17. This closes the example.

component X1 triggering VaR99.5% component X2 triggering VaR99.5% component X3 triggering VaR99.5%

above VaR99.5% above VaR99.5% above VaR99.5%

300

full density X2 full density X3

full density X1
4

250
200
density

density

density
3

150
20
2

100
10
1

50
0

0
0

4.0 4.5 5.0 5.5 −0.02 0.00 0.02 0.04 0.06 0.004 0.006 0.008 0.010 0.012 0.014 0.016

Fig. 11.25 Feature values Xτ triggering VaR on the 99.5% level: (lhs) VA account log-value qτ ,
(middle) short rate rτ , and (rhs) force of mortality mx+τ , blue color shows the full density and red
color shows the conditional density conditioned on being above the 99.5% VaR of μ(Xτ )
530 11 Selected Topics in Deep Learning

11.6.3 Bayesian Networks: An Outlook

This section provides a short introduction to Bayesian networks and to variational

inference. We see this section as a motivation for doing more research in that
direction. In Sect. 11.4 we have assessed model uncertainty through bootstrapping.
Alternatively, we could take a Bayesian viewpoint. We start from a fixed network
architecture that involves a network parameter ϑ. The Bayesian approach consid-
ered in Section 6.1 selects a prior density π(ϑ) on the space of network parameters
(w.r.t. a measure ν). For given data (Y, x) we can then calculate the posterior density
of ϑ by

π ( ϑ| Y, x) ∝ f ( Y, ϑ | x) = f ( Y | ϑ, x) π(ϑ). (11.60)

A new data point Y † with feature x † has conditional density, given observation
(Y, x),

f y † x † ; Y, x = f y † ϑ, x † π ( ϑ| Y, x) dν(ϑ),
ϑ

supposed that (Y, x) and (Y † , x † ) are conditionally independent, given ϑ. Thus,

there only remains to determine the posterior density (11.60) of the network
parameter ϑ. Unfortunately, this is a rather challenging problem because of the
curse of dimensionality, and even advanced MCMC methods, such as HMC, often
do not lead to satisfactory results (convergence), for MCMC we refer to Section 6.1.
For this reason one often explores approximate inference methods, see, e.g.,
Chapter 10 of Bishop [36] or the tutorial of Jospin et al. [205]. A scalable version
is to approximate the posterior density using the so-called method of variational
inference. This is presented in the following.
Choose a family F = {q(·; θ ); θ ∈ } of (more tractable) densities that have
the same support as the prior π(·), and being parametrized by θ ∈ ⊂ RK . This
family F is called the set of variational distributions, and the goal is to find the
variational density q(·; θ ) ∈ F that is closest to the posterior density (11.60).
To evaluate the similarity between two densities, we use the KL divergence which
analyzes the divergence from π ( ·| Y, x) to q(·; θ ) given by

q(ϑ; θ )
DKL q(·; θ )π ( ·| Y, x) = q(ϑ; θ )log dν(ϑ).
ϑ π ( ϑ| Y, x)

The optimal approximation within F , for given data (Y, x), is found by solving

θ =
θ (Y, x) = arg min DKL q(·; θ )π ( ·| Y, x) ;
θ∈

for the moment we neglect existence and uniqueness questions. A main difficulty is
the computation of this KL divergence because it involves the intractable posterior
11.6 Selected Applications 531

density of ϑ, given (Y, x). We modify the optimization problem such that we can
circumvent the explicit calculation of this KL divergence.
Lemma 11.19 We have the following identity

logf (Y |x) = E(θ |Y, x) + DKL q(·; θ )π ( ·| Y, x) ,
-
for the (unconditional) density f (y|x) = ϑ f (y|ϑ, x)π(ϑ)dν(ϑ) and the so-
called evidence lower bound (ELBO)

f ( Y, ϑ | x)
E(θ |Y, x) = q(ϑ; θ )log dν(ϑ).
ϑ q(ϑ; θ )

Observe that the left-hand side in the statement of Lemma 11.19 is independent of
θ ∈ . Therefore, minimizing the KL divergence in θ is equivalent to maximizing
the ELBO in θ . This follows exactly the same philosophy as the EM algorithm,
see (6.32), in fact, the ELBO E plays the role of functional Q defined in (6.33).
Proof of Lemma 11.19 We start from the left-hand side of the statement

f (Y, ϑ|x)
logf (Y |x) = q(ϑ; θ )logf (Y |x) dν(ϑ) = q(ϑ; θ )log dν(ϑ)
ϑ ϑ π(ϑ|Y, x)

f (Y, ϑ|x)/q(ϑ; θ )
= q(ϑ; θ )log dν(ϑ)
ϑ π(ϑ|Y, x)/q(ϑ; θ )

= E(θ |Y, x) + DKL q(·; θ )π ( ·| Y, x) .

This proves the claim.

The ELBO provides the lower bound (also called variational lower bound)

logf (Y |x) ≥ sup E(θ |Y, x).

θ∈

Interestingly, the ELBO does not include the posterior density, but only the joint
density of Y and ϑ, given x, which is assumed to be known (available). It can be
rewritten as

E(θ |Y, x) = q(ϑ; θ )logf ( Y, ϑ | x) dν(ϑ) − q(ϑ; θ )logq(ϑ; θ ) dν(ϑ)
ϑ ϑ

= Eq(·;θ) logf ( Y, ϑ | x) Y, x − Eq(·;θ) logq(ϑ; θ ) ,

the first term being the expected joint log-likelihood of (Y, ϑ ) under the variational
density ϑ ∼ q(·; θ ), and the second term being the entropy of the variational density.
532 11 Selected Topics in Deep Learning

The optimal approximation within F for given data (Y, x) is then found by
solving

θ =
θ (Y, x) = arg max E(θ |Y, x).
θ∈

That is we try to simultaneously maximize the expected joint log-likelihood of

(Y, ϑ) and the entropy over all variational densities q(·; θ ) in F .
If we have multiple observations D = {(Yi , x i ); 1 ≤ i ≤ n}, that are
conditionally i.i.d., given ϑ, we have to solve (we use conditional independence)

θ = arg max E(θ |D)
θ∈

n

= arg max Eq(·;θ) log π(ϑ) f ( Yi | ϑ, x i ) D − Eq(·;θ) logq(ϑ; θ )
θ∈
i=1
n + ,

q(ϑ; θ )
= arg max Eq(·;θ) logf ( Yi | ϑ, x i ) Yi , x i − Eq(·;θ) log
θ∈ π(ϑ)
i=1
n

= arg max Eq(·;θ) logf ( Yi | ϑ, x i ) Yi , x i − DKL ( q(·; θ ) π) .
θ∈ i=1

Typically, one solves this problem with gradient ascent methods which requires
calculation of the gradient ∇θ of the objective function on the right-hand side. This
is more difficult than plain vanilla gradient descent in network fitting because θ
enters the expectation operator Eq(·;θ).
Kingma–Welling [217] propose to use the following reparametrization trick.
Assume that we can receive the random variable ϑ ∼ q(·; θ ) by a reparametrization
(d)
ϑ = t (, θ ) for some smooth function t and where ∼ p does not depend on θ .
E.g., if ϑ is multivariate Gaussian with mean μ and covariance matrix AA , then
(d)
ϑ = μ + A for being standard multivariate Gaussian. Under the assumption that
the reparametrization trick works for the family F = {q(·; θ ); θ ∈ } we arrive at,
for ∼ p,

θ = arg max E (θ|D) (11.61)
θ∈
n
1 + ,
q(t (, θ); θ)
= arg max Ep logf ( Yi | t (, θ), x i ) Yi , x i − Ep log
θ∈ n π(t (, θ))
i=1

f (Yi |t (, θ), x i ) π (t (, θ))1/n
n
= arg max Ep log Yi , x i .
θ∈ q (t (, θ); θ)1/n
i=1
11.6 Selected Applications 533

The gradient of the ELBO is then given by (supposed we can exchange Ep and ∇θ )

f (Yi |t (, θ ), x i ) π (t (, θ ))1/n
n
∇θ E(θ |D) = Ep ∇θ log Yi , x i .
q (t (, θ ); θ )1/n
i=1

These expected gradients are calculated empirically using Monte Carlo methods.
Sample i.i.d. observations (i,j ) ∼ p, 1 ≤ i ≤ n and 1 ≤ j ≤ m, and consider the
empirical approximation
1/n

n
1
m
f Yi t ( (i,j ) , θ ), x i π t ( (i,j ) , θ )
∇θ E(θ |D) ≈ ∇θ log 1/n .
m q t ( (i,j ) , θ ); θ
i=1 j =1
(11.62)

Using this empirical approximation we can use gradient ascent methods to estimate
θ , known as stochastic gradient variational Bayes (SGVB) estimator, see Sect. 2.4.3
of Kingma–Welling [217], or as Bayes by Backprop, see Blundell et al. [41] and
Jospin et al. [205].
Example 11.20 We consider the gradient (11.62) for an example from the EDF.
First, if n is sufficiently large, it often suffices to set m = 1, and we still receive
an accurate estimate. In that case we drop index j giving (i) . Assume that the
(conditionally independent) observations Yi belong to the same member of the EDF
having cumulant function κ. Moreover, assume that the (conditional) mean of Yi ,
given x i , can be described by a FN network and a link function g such that, see (7.8),
; <
μi = μ(x i ) = μϑ (x i ) = g −1 β, zw (d:1)
(x i ) ,

for network parameter ϑ = (β, w) ∈ Rr . In a Bayesian FN network this network

parameter is not fixed but rather acts as a latent variable. In (11.62) this latent
variable is for realization i given by (and using the reparametrization trick) ϑ =
t ( (i) ; θ ) ∈ Rr ; θ is not the canonical parameter, here. Thus, we receive conditional
mean of Yi , given (i) and x i ,
; <
μi = μt ( (i) ;θ) (x i ) = g −1 β( (i) ; θ ), zw(
(d:1)
(i) ;θ) (x i ) ,

with network parameter ϑ( (i) ; θ ) = (β( (i) ; θ ), w( (i) ; θ )) = t ( (i) , θ ) ∈ Rr .
Maximizing the ELBO implies that we need to calculate the gradients w.r.t. θ . First,
we calculate the gradient w.r.t. the network parameter ϑ of the data log-likelihood

∇ϑ logf (Yi |ϑ, x i ) = ∇ϑ Yi (ϑ) ∈ Rr .

This gradient is calculated with back-propagation, we refer to (7.16) and Proposi-

tion 7.5. There remains the chain rule for evaluating the inner derivative coming
534 11 Selected Topics in Deep Learning

from the reparametrization trick θ ∈ ⊂ RK → ϑ = t ( (i) ; θ ) ∈ Rr . Consider

the Jacobian matrix

∂
J (θ ; (i)) = tj ( (i) ; θ ) ∈ Rr×K .
∂θk 1≤j ≤r,1≤k≤K

This gives us the gradient w.r.t. θ

∇θ logf Yi t ( (i) , θ ), x i = J (θ ; (i)) ∇ϑ Yi (ϑ) ∈ RK .
ϑ=t ( (i) ,θ)
(11.63)

The prior distribution is often taken to be the multivariate Gaussian with prior mean
τ ∈ Rr and (symmetric and positive definite) prior covariance matrix T ∈ Rr×r ,
thus,

1/2 −1 1 −1
π(ϑ) = ((2π) |T | ) exp − (ϑ − τ ) T (ϑ − τ ) .
r/2
2

This implies for the gradient w.r.t. θ for the prior

∇θ logπ(t ( (i) , θ )) = −J (θ ; (i)) T −1 t ( (i) , θ ) − τ ∈ RK .

There remains the choice of the family F = {q(·; θ ); θ ∈ } of variational densities

such that the reparametrization trick works. This is discussed in the remainder.
We briefly discuss the most popular and simplest family chosen for the varia-
tional distributions F . This family is the so-called mean field Gaussian variational
family, meaning that all components of ϑ ∈ Rr are assumed to be independent
Gaussian, that is,

r
1 1
q(ϑ; θ ) = √ exp − 2 (ϑj − μj ) , 2

j =1
2πσj 2σj

for θ = (μ1 , σ1 , . . . , μr , σr ) ∈ RK with K = 2r and with σj > 0 for all 1 ≤ j ≤

r. This allows us to apply the reparametrization trick
⎛ ⎞
μ1 + σ1 1
(d) ⎜ .. ⎟
ϑ = t (, θ ) = μ + diag(σ1 , . . . , σr ) = ⎝ . ⎠,
μr + σr r
11.6 Selected Applications 535

with r-dimensional standard Gaussian variable ∼ N (0, 1). The Jacobian matrix
is
⎛ ⎞
1 1 0 0 · · · 0 0
⎜0 0 1 2 · · · 0 0 ⎟
⎜ ⎟
J (θ ; ) = ⎜ . .. .⎟ ∈ R
r×K
.
⎝ .. . .. ⎠
0 0 0 0 · · · 1 r

The mean field Gaussian case provides the entropy of the variational distribution
r
1 1
r √
−Eq(·;θ) logq(ϑ; θ ) = log(2πσj2 ) + = log( 2πeσj ).
2 2
j =1 j =1

This mean field Gaussian variational inference can be implemented with the R
package tfprobability of Keydana et al. [212] and an explicit example is
given in Kuo [230].
Example 11.20, Revisited Working under the assumptions of Example 11.20 and
additionally assuming that the family of variational distributions F is multivariate
(d)
Gaussian q(·; θ ) = N (μ, ) leads us after some calculation to (the well-known
formula)
1 + |T | ,

DKL q(·; θ )π = log − r + trace T −1 + (τ − μ) T −1 (τ − μ) .
2 ||

This further simplifies if T and are diagonal, the latter being the mean field
Gaussian case. The remaining terms of the ELBO are treated empirically as
in (11.63).

This section has provided a short introduction to uncertainty estimation in

networks using Bayesian methods. We believe that this gives a promising outlook
that certainly needs more theoretical and practical work to become useful in
practical applications.

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons licence and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons licence, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons licence and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 12
Appendix A: Technical Results on
Networks

The reader may have noticed that for GLMs we have developed an asymptotic
theory that allowed us to assess the quality of predictors as well as it allowed us to
validate the fitted models. For networks there does not exist such a theory, yet, and
the purpose of this appendix is to present more technical results on the asymptotic
behavior of FN networks and their estimators that may lead to an asymptotic
theory. This appendix hopefully stimulates further research in this field of statistical
modeling.

12.1 Universality Theorems

We present a specific version of the universality theorems for shallow FN networks;

we refer to the discussion in Sect. 7.2.2. This section follows Hornik et al. [192].
Choose an input dimension q0 ∈ N and consider the set of all affine functions

Aq0 = A : {1} × Rq0 → R; x → A(x) = w, x , w ∈ Rq0 +1 ,

we add a 0th component in feature x = (x0 = 1, x1 , . . . , xq0 ) ∈ {1} × Rq0 for the
intercept. Choose a measurable (activation) function φ : R → R and define
⎧ ⎫
⎨
q1 ⎬
q0 (φ) = f : {1} × Rq0 → R; x → f (x) = βj φ(Aj (x)), Aj ∈ Aq0 , βj ∈ R, q1 ∈ N .
⎩ ⎭
j =0

This is the set of all shallow FN networks f (x) = β, z(1:1)(x) with activation
function φ and the linear output activation, see (7.8); the intercept component of
the output is integrated into the 0th component j = 0. Moreover, we define the
networks

q1 lj

%q0 (φ) = f : {1} × Rq0 → R;
x → f (x) = βj φ(Aj,k (x)),
j =0 k=1

Aj,k ∈ Aq0 , βj ∈ R, lj ∈ N, q1 ∈ N .

The latter networks contain the former q0 (φ) ⊂ %q0 (φ), by setting lj = 1 for
all 0 ≤ j ≤ q1 . We are going to prove a universality theorem first for the networks
%q0 (φ), and afterwards for the shallow FN networks q0 (φ).
Definition 12.1 The function φ : R → [0, 1] is called a squashing function if it is
non-decreasing with limx→−∞ φ(x) = 0 and limx→∞ φ(x) = 1.
Since squashing functions can have at most countably many discontinuities,
they are measurable; a continuous and a non-continuous example are given by the
sigmoid and by the step function activation, respectively, see Table 7.1.
Lemma 12.2 The sigmoid activation function is Lipschitz with constant 1/4.
Proof The derivative of the sigmoid function is given by φ = φ(1 − φ). This
provides for the second derivative φ = φ − 2φφ = φ (1 − 2φ). The latter is zero
for φ(x) = 1/2. This says that the maximal slope of φ is attained for x = 0 and it
is φ (0) = 1/4.

We denote by C(Rq0 ) the set of all continuous functions from {1} × Rq0 to
R, and by M(Rq0 ) the set of all measurable functions from {1} × Rq0 to R. If
the measurable activation function φ is continuous, we have %q0 (φ) ⊂ C(Rq0 ),
otherwise %q0 (φ) ⊂ M(Rq0 ).
Definition 12.3 A subset S ⊂ M(Rq0 ) is said to be uniformly dense on compacta
in C(Rq0 ) if for every compact subset K ⊂ {1}×Rq0 the set S is ρK -dense in C(Rq0 )
meaning that for all > 0 and all g ∈ C(Rq0 ) there exists f ∈ S such that

ρK (g, f ) = sup |g(x) − f (x)| < .

x∈K

Theorem 12.4 (Theorem 2.1 in Hornik et al. [192]) Assume φ is a non-constant

and continuous activation function. %q0 (φ) ⊂ C(Rq0 ) is uniformly dense on
compacta in C(Rq0 ).
Proof The proof is based on the Stone–Weierstrass theorem. We briefly recall the
Stone–Weierstrass theorem. Assume A is a family of real functions defined on a set
E. A is called an algebra if it is closed under addition, multiplication and scalar
12.1 Universality Theorems 539

multiplication. A family A separates points in E, if for every x, z ∈ E with x = z

there exists a function A ∈ A with A(x) = A(z). The family A does not vanish at
any point of E if for all x ∈ E there exists a function A ∈ A such that A(x) = 0.
Let A be an algebra of continuous real functions on a compact set K. The Stone–
Weierstrass theorem says that if A separates points in K and if it does not vanish at
any point of K, then A is ρK -dense in the space of all continuous real functions on
K.
Choose any compact set K ⊂ {1} × Rq0 . For any activation function φ, %q0 (φ)
is obviously an algebra. So there remains to prove that this algebra separates points
and does not vanish at any point. Firstly, choose x, z ∈ K such that x = z. Since
φ is non-constant we can choose a, b ∈ R such that φ(a) = φ(b). Next choose
A ∈ Aq0 such that A(x) = a and A(z) = b. Then, φ(A(x)) = φ(A(z)) and
%q0 (φ) separates points. Secondly, since φ is non-constant, we can choose a ∈ R
such that φ(a) = 0. Moreover, choose weight w = (a, 0, . . . , 0) ∈ Rq0 +1 . Then
for this A ∈ Aq0 , A(x) = w, x = a for any x ∈ K. Henceforth, φ(A(x)) = 0,
therefore %q0 (φ) does not vanish at any point of K. The claim then follows from
the Stone–Weierstrass theorem and using that φ is continuous by assumption.

For Theorem 12.4 to hold, the activation function φ can be any continuous and
non-constant function, i.e., it does not need to be a squashing function. This is
fairly general, but it rules out the step function activation as it is not continuous.
However, for squashing functions continuity is not needed and one still receives
the uniformly dense on compacta property of %q0 (φ) in C(Rq0 ), this has been
proved in Theorem 2.3 of Hornik et al. [192]. The following theorem also does not
need continuity, i.e., we do not require q0 (φ) ⊂ C(Rq0 ) as φ only needs to be
measurable (and squashing).
Theorem 12.5 (Universality, Theorem 2.4 in Hornik et al. [192]) Assume φ is a
squashing activation function. q0 (φ) is uniformly dense on compacta in C(Rq0 ).
Sketch of Proof For the (continuous) cosine activation function choice cos(·),
Theorem 12.4 applies to %q0 (cos). Repeatedly applying the trigonometric identity
cos(a) cos(b) = cos(a + b) − cos(a − b) allows us to rewrite any trigonometric
Dlj T
t =1 αt cos(At (x)) for suitable At ∈ A ,
polynomial k=1 cos(Aj,k (x)) as q0

αt ∈ R and T ∈ N. This allows us to identify 0 (cos) = % 0 (cos). As a

q q

consequence of Theorem 12.4, shallow FN networks q0 (cos) are uniformly dense

on compacta in C(Rq0 ).
The remaining part relies on approximating the cosine activation function.
Firstly, Lemma A.2 of Hornik et al. [192] says that for any continuous squashing
q j j
function ψ and any > 0 there exists H (x) = j 1=1 βj φ(w0 + w1 x) ∈ 1 (φ),
x ∈ R, such that

sup |ψ(x) − H (x)| < . (12.1)

x∈R

For the proof we refer to Lemma A.2 of Hornik et al. [192], it uses that ψ is a
continuous squashing function, implying that for every δ ∈ (0, 1) there exists m > 0
540 12 Appendix A: Technical Results on Networks

such that ψ(−m) < δ and ψ(m) > 1 − δ. Approximation H ∈ 1 (φ) of ψ is then
constructed on (−m, m) so that the error bound holds (and for δ sufficiently small).
Secondly, choose > 0 and M > 0, there exists cosM, ∈ 1 (φ) such that

sup cos(x) − cosM, (x) < . (12.2)
x∈[−M,M]

This is Lemma A.3 of Hornik et al. [192]; to prove this, we consider the cosine
squasher of Gallant–White [150], for x ∈ R

1 3π
χ(x) = 1 + cos x + 1{−π/2≤x≤π/2} + 1{x>π/2} ∈ [0, 1].
2 2

This is a continuous squashing function. Adding, subtracting and scaling a finite

number of affinely shifted versions of the cosine squasher χ can exactly replicate
the cosine on [−M, M]. Claim (12.2) then follows from the fact that we need a
finite number of cosine squashers χ to replicate the cosine on [−M, M], the triangle
equality, and the fact that the (continuous) cosine squasher can be approximated
arbitrarily well in 1 (φ) using (12.1).
The final step is to patch everything together. Consider Tt=1 αt cos(At (x))
which approximates on the compact set K ⊂ {1} × Rq0 a given continuous
function g ∈ C(Rq0 ) with a given tolerance /2. Choose M > 0 such that
At (K) ⊂ [−M, M] for all 1 ≤ t ≤ T . Note that this M can be found because
K is compact, At are continuous and T is finite. Define T = T Tt=1 |αt | < ∞.
By (12.2) we can then choose cosM,/(2T ) ∈ 1 (φ) such that
T

T

sup αt cos(At (x)) − αt cosM,/(2T ) (At (x)) < /2.
x∈K
t =1 t =1

This completes the proof.

12.2 Consistency and Asymptotic Normality

Universality Theorem 12.5 tells us that we can approximate any compactly sup-
ported continuous function arbitrarily well by a sufficiently large shallow FN
network, say, with sigmoid activation function φ. The next natural question is
whether we can learn these approximations from data (Yi , x i )i≥1 that follow the true
but unknown regression function x → μ0 (x), or in other words whether we have
consistency for a certain class of learning methods. This is the question addressed,
e.g., in White [379, 380], Barron [26], Chen–Shen [73], Döhler–Rüschendorf [109]
and Shen et al. [336]. This turns the algebraic universality question into a statistical
question about consistency.
12.2 Consistency and Asymptotic Normality 541

Assume that the true data model satisfies

Y = μ0 (x) + ε = E[Y |x] + ε, (12.3)

for a continuous regression function μ0 : X → R on a compact set X ⊂ {1} × Rq0 ,

and with a centered error ε satisfying E[|ε|2+δ ] < ∞ for some δ > 0 and being
independent of x. The question now is whether we can learn this (true) regression
function μ0 from independent data (Yi , x i ), 1 ≤ i ≤ n, obeying (12.3). Throughout
this section we use the square error loss function L(y, a) = (y − a)2 . For given
data, this results in solving

1 1
n n
μn = arg min
* L (Yi , μ(x i )) = arg min (Yi − μ(x i ))2 , (12.4)
μ∈C (X ) n i=1 μ∈C (X ) n i=1

where C(X ) denotes the set of continuous functions on the compact set X ⊂
{1}×Rq0 . The main question is whether estimator * μn approaches the true regression
function μ0 for increasing sample size n.
Typically, the family of continuous functions C(X ) is much too rich to be able to
solve optimization problem (12.4), and the solution may have undesired properties.
In particular, the solution to (12.4) will over-fit to the data for any sample size
n, and consistency will not hold, see, e.g., Section 2.2.1 in Chen [72]. Therefore,
the optimization needs to be done over (well-chosen) smaller sets Sn ⊂ C(X ).
For instance, Sn can be the set of shallow FN networks having a maximal width
q1 = q1 (n), depending on the sample size n of the data. Considering this regression
problem in a non-parametric sense, we let grow these sets Sn with the sample size
n. This idea is attributed to Grenander [172] and it is called the method of sieve
estimators of μ0 . We define for d ∈ N, > 0, * > 0 and activation function φ
⎧ ⎫
⎨ q1
q0 ⎬
*, φ) = f ∈ q0 (φ); q1 = d,
S (d, , |βj | ≤ , max * .
|wl,j | ≤
⎩ 1≤j ≤q1 ⎭
j =0 l=0

* φ) are shallow FN networks of a given width q1 = d and with

These sets S(d, , ,
some restrictions on the network parameters.1 We then choose increasing sequences

q
1 The bound j 1=0 |βj | ≤ in S (d, , *, φ) allows us to view this set of shallow FN networks
as a symmetric convex hull of the family of functions S0 (φ) = {x → φ(A(x)); A ∈ Aq0 }, see
Sect. 2.6.3 in Van der Vaart–Wellner [364]. If we choose an increasing activation function φ, this
family of functions φ ◦ A is a composition of a fixed increasing function φ and a finite dimensional
vector space Aq0 of functions A. This implies that S0 (φ) is a VC-class saying that it has a finite
Vapnik–Chervonenkis (VC) dimension [365]; see also Condition A and Theorem 2.1 in Döhler–
Rüschendorf [109]. This VC-class is an important property in many proofs as it leads to a finite
covering (metric entropy) of function spaces, and this allows to apply limit theorems to point
processes, we refer to Van der Vaart–Wellner [364].
542 12 Appendix A: Technical Results on Networks

*n )n≥1 which provides us with an increasing sequence of

(dn )n≥1 , (n )n≥1 and (
sieves (becoming finer as n increases)

def.
*n , φ) ⊆ Sn+1 (φ) def.
. . . ⊆ Sn (φ) = S (dn , n , *n+1 , φ) ⊆ . . . .
= S (dn+1 , n+1 ,

The following corollary is a simple consequence of Theorem 12.5.

Corollary 12.6 Assume φ is a squashing activation function, and let the increasing
sequences (dn )n≥1 , (n )n≥1 and ( *n )n≥1 tend to infinity for n → ∞. Then
H
S
n≥1 n (φ) is uniformly dense in C(X ).
This corollary says that for any regression function μ0 ∈ C(X ) we can find n ∈ N
and μn ∈ Sn (φ) such that μn is arbitrarily close to μ0 ; remark that all functions are
continuous on the compact set X , and uniformly dense means ρX -dense in that case.
Corollary 12.6 does not hold true if n ≡ H > 0, for all n. In that case we can only
approximate the smaller function class n≥1 Sn (φ) ⊂ C(X ). This is going to be
used in one of the cases, below.
For increasing sequences (dn )n≥1 , (n )n≥1 and ( *n )n≥1 we define the sieve
estimator (
μn )n≥1 by

1
n

μn = arg min L (Yi , μ(x i )) . (12.5)
μ∈Sn (φ) n
i=1

Under the following assumptions one can prove a consistency theorem.

Assumption 12.7 Choose a complete probability space (, A, P)2 and X = {1} ×
[0, 1]q0 .
(1) Assume μ0 ∈ C(X ). Assume (Yi , X i )i≥1 are i.i.d. on (, A, P) following the
regression structure (12.3) with εi being centered, having E[|εi |2+δ ] < ∞ for
some δ > 0 and being independent of X i . Set σ 2 = Var(εi ) < ∞.
(2) The activation function φ is the sigmoid function.
(3) The sequences (dn )n≥1 , (n )n≥1 and (*n )n≥1 are increasing and tending to
infinity as n → ∞ with dn 2n log(dn n ) = o(n).
Most results that we are going to present below hold for activation functions that
are Lipschitz. The sigmoid activation function is Lipschitz, see Lemma 12.2.
The following considerations are based on the pseudo-norm, given (Xi )1≤i≤n ,
/
0 n
01
μn = 1 (μ(Xi ))2 for μ ∈ C(X ).
n
i=1

2A probability space (, A, P) is complete if for any P-null set B ∈ A with P[B] = 0 and every
subset A ⊂ B it follows that A ∈ A.
12.2 Consistency and Asymptotic Normality 543

This is a pseudo-norm because it is positive μn ≥ 0, absolutely homogeneous

aμn = |a| μn and the triangle inequality holds, but it is not definite because
μn = 0 does not imply that μ is the zero function (i.e. it is not point-separating).
This pseudo-norm ·n depends on the (random) features (Xi )1≤i≤n and, therefore,
the subsequent statements involving this pseudo-norm hold in probability. The
following result provides consistency, and that the true regression function μ0 ,
indeed, can be learned from i.i.d. data.
Theorem 12.8 (Consistency, Theorem 3.1 of Shen et al. [336]) Under Assump-
tion 12.7, the sieve estimator ( μn )n≥1 in (12.5) exists. We have consistency
μn − μ0 n → 0 in probability as n → ∞, i.e., for all > 0

lim P
μn − μ0 n > = 0.
n→∞

Remarks 12.9
• Such a consistency result for FN networks has first been proved in Theorem 3.3
of White [380], however, on slightly different spaces and under slightly different
assumptions. Similar consistency results have been obtained for related point
process situations by Döhler–Rüschendorf [109] and for time-series in White
[380] and Chen–Shen [73].
• Item (3) of Assumption 12.7 gives upper complexity bounds on shallow FN
networks as a function of the sample size n of the data, so that asymptotically
they do not over-fit to the data. These bounds allow for much freedom in the
choice of the growth rates, and different choices may lead to different speeds of
convergence. The conditions of Assumption 12.7 are, e.g., satisfied for n =

O(log n) and dn = O(n1−δ ), for any small δ > 0. Under these choices, the
complexity dn of the shallow FN network grows rather quickly. Table 1 of White
[380] gives some examples, for instance, if for n = 100 data points we have a
shallow FN network with 5 neurons, then these magnitudes support 477 neurons
for n = 10 000 and 45’600 neurons for n = 1 000 000 data points (for the
specific choice δ = 0.01). Of course, these numbers do not provide any practical
guidance on the selection of the (shallow) FN network size.
• Theorem 12.8 requires that we can explicitly calculate the sieve estimator

μn , i.e., the global minimizer of the objective function in (12.5). In practical
applications, relying on gradient descent algorithms, typically, this is not the case.
Therefore, Theorem 12.8 is mainly of theoretical value saying that learning the
true regression function μ0 is possible within FN networks.
Sketch of Proof of Theorem 12.8 The proof of this theorem is based on a theorem
in White–Woolridge [381] which states that if we have a sequence (Sn (φ))n≥1 of
compact subsets of C(X ), and if Ln : × Sn (φ) → R is a A ⊗ B(Sn (φ))/B(R)-
measurable sequence, n ≥ 1, with Ln (ω, ·) being lower-semicontinuous on Sn (φ)
for all ω ∈ . Then, there exists
μn : → Sn (φ) being A/B(Sn (φ))-measurable
such that for each ω ∈ , Ln (ω, μn (ω)) = min Ln (ω, μ). For the proof of the
μ∈Sn (φ)
544 12 Appendix A: Technical Results on Networks

compactness of Sn (φ) in C(X ) we need that dn and n are finite for any n. This
then provides the existence of the sieve estimator, for details we refer Lemma 2.1
and Corollary 2.1 in Shen et al. [336]. The proof of the consistency result then uses
the growth rates on (dn )n≥1 and (n )n≥1 , for the details of the proof we refer to
Theorem 3.1 in Shen et al. [336].

The next step is to analyze the rates of convergence of the sieve estimator

μn → μ0 , as n → ∞. These rates heavily depend on (additional) regularity
assumptions on the true regression function μ0 ∈ C(X ); we refer to Remark 3
in Sect. 5 of Chen–Shen [73]. Here, we present some results of Shen et al. [336].
From the proof of Theorem 12.8 we know that Sn (φ) is a compact set in C(X ). This
H closest approximation πn μ ∈ Sn (φ) to μ ∈ C(X ). The
motivates to consider the
uniform denseness of n≥1 Sn (φ) in C(X ) implies that πn μ converges to μ. The
aforementioned rates of convergence of the sieve estimators will depend on how fast
πn μ0 ∈ Sn (φ) converges to the true regression function μ0 ∈ C(X ).
If one cannot determine the global minimum of (12.5), then often an accurate
approximation is sufficient. For this one introduces an approximate sieve estimator.
A sequence ( μn )n≥1 is called an approximate sieve estimator if

1 1
n n
(Yi −
μn (Xi ))2 ≤ inf (Yi − μ(Xi ))2 + OP (ηn ), (12.6)
n μ∈Sn (φ) n
i=1 i=1

where (ηn )n≥1 is a positive sequence converging to 0 as n → ∞. The last term

OP (ηn ) denotes stochastic boundedness meaning that for all > 0 there exits K >
0 such that for all n ≥ 1
n
1 1
n
P (Yi −
μn (X i )) − inf
2
(Yi − μ(Xi )) > K ηn < .
2
n μ∈Sn (φ) n
i=1 i=1

Theorem 12.10 (Theorem 4.1 of Shen et al. [336], Without Proof) Set Assump-
tion 12.7. If

2 dn log(dn n ) dn log n
ηn = O min πn μ0 − μ0 n , , ,
n n

the following stochastic boundedness holds for n ≥ 1

dn log n

μn − μ0 n = OP max πn μ0 − μ0 n , .
n

Remarks 12.11
• Assumption 12.7 implies that dn log(dn n ) = o(n) as n → ∞. Therefore, ηn →
0 as n → ∞.
12.2 Consistency and Asymptotic Normality 545

• The statement in Theorem 4.1 of Shen et al. [336] is more involved because it
is stated under slightly different assumptions. Our assumptions are sufficient for
having consistency of the sieve estimator, see Theorem 12.8, and making these
assumptions implies that the rate of convergence in Theorem 12.10 is determined
by the rate of convergence of πn μ0 −μ0 n and (n−1 dn log n)1/2 , see Remark 4.1
in Shen et al. [336].
• The rate of convergence in Theorem 12.10 crucially depends on the rate
πn μ0 − μ0 n , as n → ∞. If μ0 lies in the (sub-)space of functions with
finite first absolute moments of the Fourier magnitude distributions, denoted by
F (X ) ⊂ C(X ), Makavoz [262] has shown that πn μ0 − μ0 n decays at least as
−(q +1)/(2q0 ) −1/2−1/(2q0) −1/2
dn 0 = dn , this has improved the rate of dn obtained by
Barron [25]. This space F (X ) allows for the choices dn = (n/ log n)q0 /(2+q0) ,
n ≡ > 0 and *n ≡ * > 0 to receive consistency and the following rate of
convergence, see Chen–Shen [73] and Remark 4.1 in Shen et al. [336],

μn − μ0 n = OP (rn−1 ),

for
(q0 +1)/(4q0+2)
n
rn = n ≥ 2. (12.7)
log n

Note that 1/4 ≤ (q0 + 1)/(4q0 + 2) ≤ 1/2. Thus, this is a slower rate than the
square root rule of typical asymptotic normality, for instance, for q0 = 1 we get
1/3. Interestingly, Barron [26] proposes the choice dn ∼ (n/ log n)1/2 to receive
an approximation rate of (n/ log n)−1/4 .
Also note that the space F (X ) allows us to choose a finite n ≡ > 0
in the sieves, thus, here we do not receive denseness of the sieves in the space
of continuous functions C(X ), but only in the space of functions with finite first
absolute moments of the Fourier magnitude distributions F (X ).
The last step is to establish the asymptotic normality. For this we have to define
perturbations of shallow FN networks μ ∈ Sn (φ). Choose ηn ∈ (0, 1) and define
the function
1/2 1/2
μn (μ) = (1 − ηn )μ + ηn (μ0 + 1).
*

This allows us to state the following asymptotic normality result.

Theorem 12.12 (Theorem 5.1 of Shen et al. [336], Without Proof) Set Assump-
tion 12.7. We make the following additional assumptions: suppose ηn = o(n−1 ) and
choose n such that we have stochastic boundedness n μn − μ0 n = OP (1). Let
the following conditions hold:
(C1) dn n log(dn n ) = o(n1/4 );
(C2) nn−2 /δn = o(1);
546 12 Appendix A: Technical Results on Networks

(C3) supμ∈Sn (φ):μ−μ0 n ≤n−1 πn *

μn (μ) − *
μn (μ)n = OP (n ηn );
1 n
(C4) supμ∈Sn (φ):μ−μ0 n ≤−1 n i=1 εi (πn * μn (μ)(Xi ) − *
μn (μ)(Xi )) = OP (ηn ).
n

We have the following asymptotic normality for n → ∞

1
n
√ μn (X i ) − μ0 (X i )) ⇒ N 0, σ 2 .
(
n
i=1

The assumptions of Theorem 12.12 require a slower growth rate dn on the

shallow FN network compared to the consistency results. Shen et al. [336] bring
forward the argument that for the asymptotic normality result to hold, the shallow
FN network should grow slower in order to get the Gaussian property, otherwise the
sieve estimator may skew towards the true function μ0 . Conditions (C3)–(C4) on
the other side give lower growth rates on the networks such that the approximation
error decreases sufficiently fast.
If the variance parameter σ 2 = Var(εi ) is not known, we can empirically estimate
it

1
n

σn2 = (Yi −
μn (Xi ))2 .
n
i=1

Theorem 5.2 in Shen et al. [336] proves that this estimator is consistent for
σ 2 , and the asymptotic normality result also holds true under this estimated
variance parameter (using Slutsky’s theorem), and under the same assumptions as
in Theorem 12.12.

12.3 Functional Limit Theorem

Horel–Giesecke [190] push the above asymptotic results even one step further. Note
that the asymptotic normality of Theorem 12.12 is not directly useful for variable
selection, since the asymptotic result integrates over the feature space X . Horel–
Giesecke [190] prove a functional limit theorem which we briefly review in this
section.
A q0 -tuple α = (α1 , . . . , αq0 ) ∈ N00 is called a multi-index, and we set |α| =
q

α1 + . . . + αq0 . Define the derivative operator

∂ |α|
∇α = αq .
∂x1α1 · · · ∂xq0 0
12.3 Functional Limit Theorem 547

Consider the compact feature space X = {1} × [0, 1]q0 with q0 ≥ 3. Choose a
distribution ν on this feature space X and define the L2 -space

L2 (X , ν) = μ : X → R measurable; Eν [μ(X)2 ] = μ(x)2 dν(x) < ∞ .
X

Next, define the Sobolev space for k ∈ N

q
W k,2 (X , ν) = μ ∈ L2 (X , ν); ∇ α μ ∈ L2 (X , ν) for all α ∈ N00 with |α| ≤ k ,

where ∇ α μ is the weak derivative of μ. The motivation for studying Sobolev

spaces is that for sufficiently large k and the existence of weak derivatives ∇ α μ ∈
L2 (X , ν), |α| ≤ k, we eventually receive a classical derivative of μ, see below. We
define the Sobolev norm for μ ∈ W k,2 (X , ν) by
⎛ ⎞1/2
2
μk,2 = ⎝ Eν ∇ α μ(X) ⎠ .
|α|≤k

The normed Sobolev space (W k,2 (X , p), ·k,2 ) is a Hilbert space. Since we would
like to consider gradient-based methods, we consider the following space

CB1 (X , ν) = μ : X → R continuously differentiable; μ'q0 /2(+2,2 ≤ B ,
(12.8)

for some positive constant B < ∞. We will assume that the true regression function
μ0 ∈ CB1 (X , ν), thus, the true regression function has a bounded Sobolev norm
·'q0 /2(+2,2 of maximal size B. Assume that X ˚ ⊂ Rq0 is the open interior of X
(excluding the intercept component), and that ν is absolutely continuous w.r.t. the
Lebesgue measure with a strictly positive and bounded density on X (excluding
the intercept component). The Sobolev number of the space W 'q0 /2(+2,2 (X̊ , ν) is
given by m = 'q0 /2( + 2 − q0 /2 ≥ 1.5 > 1. The Sobolev embedding theorem
then tells us that for any function μ ∈ W 'q0 /2(+2,2(X ˚, ν), there exists an 'm(-
times continuously differentiable function on X̊ that is equal to μ a.e., thus, the
class of equivalent functions μ ∈ W 'q0 /2(+2,2 (X̊ , ν) has a representative in C 1 (X̊ ),
'm( = 1, this motivates the consideration of the space in (12.8).
In practice, the bound B needs a careful consideration because the true μ0 is
unknown. Therefore, B should be sufficiently large so that μ0 is contained in the
space CB1 (X , ν) and, on the other hand, it should not be too large as this will weaken
the power of the tests, below.
548 12 Appendix A: Technical Results on Networks

We choose the sigmoid activation function for φ and we consider the approximate
μn )n≥1 for given data (Yi , X i )i obtained by a solution to
sieve estimators (

1 1
n n
μn (Xi ))2 ≤
(Yi − inf (Yi − μ(Xi ))2 + oP (1), (12.9)
n μ∈Sn (φ) n
i=1 i=1

where we allow for an error term oP (1) that converges in probability to zero as
n → ∞. In contrast to (12.6) we do not specify the error rate, here.
Assumption 12.13 Choose a complete probability space (, A, P) and X = {1} ×
[0, 1]q0 .
(1) Assume μ0 ∈ CB1 (X , ν) for some B > 0, and (Yi , X i )i≥1 are i.i.d. on
(, A, P) following regression structure (12.3) with εi being centered, having
E[|εi |2+δ ] < ∞ for some δ > 0, being absolutely continuous w.r.t. the Lebesgue
measure, and being independent of X i ; the features X i ∼ ν are absolutely
continuous w.r.t. the Lebesgue measure having a bounded and strictly positive
density on X (excluding the intercept component). Set σ 2 = Var(εi ) < ∞.
(2) The activation function φ is the sigmoid function.
(3) The sequence (dn )n≥1 is increasing and going to infinity satisfying
dn
2+1/q0
log(dn ) = O(n) as n → ∞, and n ≡ > 0, *n ≡ * > 0
for n ≥ 1.
(4) Define Lμ (X, ε) = −2ε(μ(X) − μ0 (X)) + (μ(X) − μ0 (X))2 , and it holds for
n≥2

1
n

√ μn (X i , εi ) − Eν L
L μn (X 1 , ε1 )
n
i=1

1
n

≤ inf √ Lμ0 +h/rn (X i , εi ) − Eν Lμ0 +h/rn (X 1 , ε1 ) + oP (rn−1 ),
h∈CB1 (X ,ν) n
i=1

for rn being the rate defined in (12.7).

The first three items of this assumption are rather similar to Assumption 12.7
which provides consistency in Theorem 12.8 and the rates of convergence in
Theorem 12.10. Item (4) of Assumption 12.13 needs to be compared to (C3)–
(C4) of Theorem 12.12 which is used for getting the asymptotic normality. (rn )n
is the rate that provides convergence in probability of the sieve estimator to the true
regression function, and this magnitude is used for the perturbation, see also (C3)–
(C4) in Theorem 12.12.
Theorem 12.14 (Asymptotics, Theorem 1 of Horel–Gisecke [190], Without
Proof) Under Assumption 12.13 the approximate sieve estimator ( μn )n≥1 (12.9)
converges weakly in the metric space (CB1 (X , ν), dν ) with dν (μ, μ ) = Eν [(μ(X) −
μ (X))2 ]:

μn − μ 0 ) ⇒ μ#
rn ( as n → ∞,
12.4 Hypothesis Testing 549

where μ# is the arg max of the Gaussian process {Gμ ; μ ∈ CB1 (X , ν)} with mean
zero and covariance function Cov(Gμ , Gμ ) = 4σ 2 Eν [μ(X)μ (X)].

Remarks 12.15 We highlight the differences between Theorems 12.12 and 12.14.
• Theorem 12.12 provides a convergence in distribution to a Gaussian random
variable, whereas the limit in Theorem 12.14 is a random function x → μ# (x) =
μ#ω (x), ω ∈ , thus, the former convergence result integrates over the (empirical)
feature distribution, whereas the latter also allows for a point-wise consideration
in feature x.
• The former theorem does not allow for variable selection in X whereas the latter
does because the limiting function still discriminates different feature values.
• For the proof of Theorem 12.14 we refer to Horel–Giesecke [190]. It is based
on asymptotic results on empirical point processes; we refer to Van der Vaart–
Wellner [364]. The Gaussian process {Gμ ; μ ∈ CB1 (X , ν)} is parametrized by the
(totally bounded) space CB1 (X , ν), and it is continuous over this compact index
space. This implies that it takes its maximum. Uniqueness of the maximum then
gives us the random function μ# which exactly describes the limiting distribution
μn − μ0 ) as n → ∞.
of rn (

12.4 Hypothesis Testing

Theorem 12.14 can be used to provide a significance test for feature component
selection, similarly to the LRT and the Wald test presented in Sect. 5.3.2 on GLMs.
We define gradient-based test statistics, for 1 ≤ j ≤ q0 , and w.r.t. the approximate
sieve estimator
μn ∈ Sn (φ) given in (12.9),
2 n
∂
μn (x) 1 ∂ μn (X i ) 2
(n)
= dν(x) and (n) = .
j j
X ∂xj n ∂xj
i=1

(n)
The test statistics integrates the squared partial derivative of the sieve estimator
j
μn w.r.t. the distribution ν, whereas (n)
j can be considered as its empirical
counterpart if X ∼ ν. Note that both test statistics depend on the data (Yi , X i )1≤i≤n
determining the sieve estimator μn , see (12.9). These test statistics are used to test
the following null hypothesis H0 against the alternative hypothesis H1 for the true
regression function μ0 ∈ CB1 (X , ν)
2
∂μ0 (X)
H0 : λj = Eν =0 against H1 : λj = 0. (12.10)
∂xj
550 12 Appendix A: Technical Results on Networks

We emphasize that the expression λj in (12.10) is a deterministic number, for this

reason we use the expected value notation Eν [·]. This in contrast to (n) j , which is
only a conditional expectation, conditionally given the data (Yi , X i )1≤i≤n .
Proposition 12.16 (Theorem 2 and Proposition 3 of Horel–Giesecke [190],
Without Proof) Under Assumption 12.13 and under the null hypothesis H0 we
have for n → ∞
2
(n) 2 (n) def. ∂μ# (x)
rn2 j , rn j ⇒ j = dν(x). (12.11)
X ∂xj

In order to use this proposition we need to be able to calculate the limiting

distribution characterized by random variable j . The maximal argument μ# of
the Gaussian process {Gμ ; μ ∈ CB1 (X , ν)} is given by a random function such that
for all ω ∈ , μ#ω (·) fulfills

Gμ#ω (·)(ω) ≥ Gμ (ω) for all μ ∈ CB1 (X , ν).

A discretization and simulation approach can be explored to approximate this

maximal argument μ# for different ω ∈ , see Section 5.7 in Horel–Giesecke [190].
1. Sample random functions fk from CB1 (X , ν), k ≥ 1. The universality the-
orems suggest that we sample these random functions fk from the sieves
(Sn ∩ CB1 (X , ν))n≥1 . This requires sampling dimension q1 of the shallow FN
network and the corresponding network weights. This provides us with candidate
functions f1 , . . . , fK ∈ CB1 (X , ν), these candidate functions can be understood
as a random covering of the (totally bounded) index space CB1 (X , ν).
2. Simulate K-dimensional multivariate Gaussian random variables G(t ) (i.i.d.)
with mean zero and (empirical) covariance matrix

1
n
=
fk (Xi )fl (Xi ) .
n
i=1 1≤k,l≤K

These random variables G(1), . . . , G(T ) play the role of discretized random
samples of the Gaussian process {Gμ ; μ ∈ CB1 (X , ν)}.
3. The empirical arg max of the sample G(t ), 1 ≤ t ≤ T , is obtained by

μ#t = arg max G(t

)
fk ,
fk : 1≤k≤K

where G(t )
fk is the k-th component of G .
(t )
12.4 Hypothesis Testing 551

4. The empirical distribution of the following sample (t ), 1 ≤ t ≤ T , gives us an

j
approximation to the limiting distribution in Proposition 12.16
n
1 ∂ μ#t (X i ) 2
(t ) =
,
j
n ∂xj
i=1

i.e., under the null hypothesis H0 we approximate the right-hand side of (12.11)
(t ))1≤t ≤T .
by the empirical distribution of ( j

We close this section we some remarks.

Remarks 12.17
• The quality of the empirical approximation ( )1≤t ≤T to the limiting distribu-
(t )
j
tion of j will depend on how well we cover the index set CB1 (X , ν). We could
try to use covering theorems to control the accuracy. However, this is often too
challenging. The simulation approach presented above suffers from not giving
us any control on the quality of this covering, nor is it clear how the Sobolev
norm condition for B in (12.8) can efficiently be checked during the simulation
approach. We highlight that this Sobolev norm bound fk 'q0 /2(+2,2 ≤ B is
crucial when we want to empirically estimate the distribution of j ; under
special assumptions Horel–Giesecke [190] prove in their Theorem 4 that j
scales as B 2 . Thus, if we do not have any control over the Sobolev norm of the
sampled shallow FN networks fk , the above simulation algorithm is not useful to
approximate the limiting distribution in Proposition 12.16.
• The assumptions of Proposition 12.16 require that X ∼ ν has a strictly positive
density over the entire feature space X (excluding the intercept component). This
is necessary to be able to capture any non-zero partial derivative ∂μ0 (x)/∂xj over
the entire feature space X . In practical applications, where we rely on a finite
sample (Xi )1≤i≤n , this may be problematic and needs some care. For instance,
there may be the situation where the samples cluster in two disjoint regions, say
C1 ⊂ X and C2 ⊂ X , because we may have ν(C1 ∪ C2 ) ≈ 1. That is, in that
case we rarely have observations Xi not lying in one of these two clusters. If
∂μ0 (x)/∂xj = 0 on these two clusters x ∈ C1 ∪ C2 , but if μ0 has a very steep
slope between the two clusters (i.e., if they are really different in terms of μ0 ),
then the test on this finite sample will not find the significant slope.
• The distribution X ∼ ν of the features is assumed to be absolutely continuous on
the hypercube [0, 1]q0 , this is not fulfilled for binary and categorical features.
• Another question is how the test of Proposition 12.16 is affected by collinearity in
feature components. Note that we only test one component at a time. Moreover,
we would like to highlight the j -dependency in the limiting random variable j .
This dependency is induced by the properties of the feature distribution ν that
may not be exchangeable in the components of x.
552 12 Appendix A: Technical Results on Networks

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons licence and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons licence, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons licence and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 13
Appendix B: Data and Examples

This appendix presents and describes the data sets used.

13.1 French Motor Third Party Liability Data

We consider a French motor third party liability (MTPL) claims data set. This data
set is available through the R library CASdatasets1 being hosted by Dutang–
Charpentier [113]. The specific data sets chosen from CASdatasets are called
FreMTPL2freq and FreMTPL2sev, the former contains the insurance policy
and claim frequency information and the latter the corresponding claim severity
information.2
Before we can work with this data set we perform data cleaning. It has been
pointed out by Loser [259] that the claim counts on the insurance policies with
policy IDs ≤ 24500 in FreMTPL2freq do not seem to be correct because these
claims do not have claim severity counterparts in FreMTPL2sev. For this reason
we work with the claim counts extracted from the latter file. In Listing 13.1 we give
the code used for data cleaning.3 In this code we merge FreMTPL2freq with the
aggregated severities on each insurance policy and the corresponding claim counts
are received from FreMTPL2sev, this is done on lines 2–11 of Listing 13.1. A

1 CASdatasets website: http://cas.uqam.ca/.

2 We use CASdatasets version 1.0–8 which has been packaged on 2018-05-20. This version
uses for the 22 French regions the labels R11, . . . , R94. In later versions of CASdatasets these
labels have been replaced by the region names, in this transformation the labels R31 (Nord-Pas-
de-Calais) and R41 (Lorraine) have been merged to one region called Nord-Pas-de-Calais. We
believe that this is an error and therefore prefer to work with an older version of CASdatasets.
This older version can be downloaded in R with library(OpenML), library(farff),
freMTPL2freq <- getOMLDataSet(data.id = 41214)$data
3 The code in Listing 13.1 is a modified version of the R code provided by Loser [259].

further inspection of the data indicates that policies with more than 5 claims may be
data error because they all seem to belong to the same driver (and they have very
short exposures).4 For this reason we drop these records on line 12. On line 13 we
censor exposures at one accounting year (since these policies are active within one
calendar year). Finally, on lines 15–16 we re-level the VehBrands.5 All subsequent
analysis is based on this cleaned data set.

Listing 13.1 Data cleaning applied to the French MTPL data set
1 #
2 data(freMTPL2freq)
3 dat <- freMTPL2freq[, -2]
4 dat$VehGas <- factor(dat$VehGas)
5 data(freMTPL2sev)
6 sev <- freMTPL2sev
7 sev$ClaimNb <- 1
8 dat0 <- aggregate(sev, by=list(IDpol=sev$IDpol), FUN = sum)[c(1,3:4)]
9 names(dat0)[2] <- "ClaimTotal"
10 dat <- merge(x=dat, y=dat0, by="IDpol", all.x=TRUE)
11 dat[is.na(dat)] <- 0
12 dat <- dat[which(dat$ClaimNb <=5),]
13 dat$Exposure <- pmin(dat$Exposure, 1)
14 sev <- sev[which(sev$IDpol %in% dat$IDpol), c(1,2)]
15 dat$VehBrand <- factor(dat$VehBrand, levels=c("B1","B2","B3","B4","B5","B6",
16 "B10","B11","B12","B13","B14"))

Listing 13.2 Excerpt of the French MTPL data set

1 ’data.frame’: 678007 obs. of 13 variables:
2 $ IDpol : num 1 3 5 10 11 13 15 17 18 21 ...
3 $ Exposure : num 0.1 0.77 0.75 0.09 0.84 0.52 0.45 0.27 0.71 0.15 ...
4 $ Area : Factor w/ 6 levels "A","B","C","D",..: 4 4 2 2 2 5 5 3 3 2 ...
5 $ VehPower : int 5 5 6 7 7 6 6 7 7 7 ...
6 $ VehAge : int 0 0 2 0 0 2 2 0 0 0 ...
7 $ DrivAge : int 55 55 52 46 46 38 38 33 33 41 ...
8 $ BonusMalus: int 50 50 50 50 50 50 50 68 68 50 ...
9 $ VehBrand : Factor w/ 11 levels "B1","B2","B3",..: 9 9 9 9 9 9 9 9 9 9 ...
10 $ VehGas : Factor w/ 2 levels "Diesel","Regular": 2 2 1 1 1 2 2 1 1 1 ...
11 $ Density : int 1217 1217 54 76 76 3003 3003 137 137 60 ...
12 $ Region : Factor w/ 22 levels "R11","R21","R22",..: 18 18 3 15 15 8 8 ...
13 $ ClaimTotal: num 0 0 0 0 0 0 0 0 0 0 ...
14 $ ClaimNb : num 0 0 0 0 0 0 0 0 0 0 ...
15 ####
16 ’data.frame’: 26383 obs. of 2 variables:
17 $ IDpol : int 1552 1010996 4024277 4007252 4046424 4073956 4012173 ...
18 $ ClaimAmount: num 995 1128 1851 1204 1204 ...

Listing 13.2 gives an excerpt of the cleaned French MTPL data set, lines 2–
14 give the insurance policy and claim counts information, and lines 17–18

4Short exposure policies may also belong to a commercial car rental company.
5The data set FreMTPLfreq of CASdatasets is a subset of FreMTPL2freq with slightly
changed feature components, for instance, the former data set contains car brand names in a more
aggregated version than the latter, see Table 13.2, below.
13.1 French Motor Third Party Liability Data 555

display the individual claim amounts. We have 9 feature components on lines 4–

12 (1 component is binary, 3 components are categorical, and 5 components are
continuous), an exposure variable on line 3, and claim information on lines 13–14
and 18. In total we have 26’383 claims on 678’007 insurance policies.
We start by giving a descriptive analysis of the data, this closely follows Noll et
al. [287]. We have the following insurance policy information:
1. IDpol: policy number (unique identifier);
2. Exposure: total exposure in yearly units (years-at-risk) and within (0, 1];
3. Area: area code (categorical, ordinal with 6 levels);
4. VehPower: power of the car (continuous);
5. VehAge: age of the car in years;
6. DrivAge: age of the (most common) driver in years;
7. BonusMalus: bonus-malus level between 50 and 230 (with entrance level
100);
8. VehBrand: car brand (categorical, nominal with 11 levels), see also
Table 13.2;
9. VehGas: diesel or regular fuel car (binary);
10. Density: density of population per km2 at the location of the living place of
the driver;
11. Region: regions in France (prior to 2016), see also Fig. 13.1 (categorical).
We start by describing the Exposure. The Exposure measures the duration of
an insurance policy in yearly units; sometimes it is also called years-at-risk. The
shortest exposure in our data set is 0.0027 which corresponds to 1 day, and the
longest exposure is 1 which corresponds to 1 year. Figure 13.2 (lhs, middle) shows
a histogram and a boxplot of these exposures. In view of the histogram we conclude
that roughly 1/4 of all policies have a full exposure of 1 calendar year, and all
other policies are only partly exposed during the calendar year. From a practical
insurance point of view this high ratio of partly exposed policies seems rather

Fig. 13.1 The 22 regions in 22 French regions from 1982−2015

France between 1982 and

2015 R31

R22
R23
R25
R11 R21 R41
R53 R42

R52 R24
R26 R43
Île−de−France R11
Champagne−Ardenne R21
Picardie R22 R54
Haute−Normandie R23 R74
Centre R24 R83
Basse−Normandie R25 R82
Bourgogne R26
Nord−Pas−de−Calais R31
Lorraine R41
Alsace R42 R72
Franche−Comté R43 R93
Pays de la Loire R52 R73
Bretagne R53 R91
Poitou−Charentes R54
Aquitaine R72
Midi−Pyrénées R73
Limousin R74
Rhône−Alpes R82
Auvergne R83
R94
Languedoc−Roussillon R91
Provence−Alpes−Côte d'Azur R93
Corse R94
556 13 Appendix B: Data and Examples

histogram of Exposures (678007 policies) boxplot of Exposures (678007 policies) histogram of claim numbers

1.0

600000
150000

500000
0.8
number of policies

400000
0.6
100000

frequency
300000
0.4

200000
50000

0.2

100000
0.0
0

0
0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5

Exposures number of claims

Fig. 13.2 (lhs) Histogram of Exposure, (middle) boxplot of Exposure, (rhs) number of
observed claims ClaimNb of the French MTPL data

Table 13.1 Split of the Number of claims 0 1 2 3 4 5

portfolio w.r.t. the number of
claims Number of policies 653’069 23’571 1’298 62 5 2
Total exposure 341’090 16’315 909 42 2 1

unusual. A further inspection of the data indicates that policy renewals during the
year account for two separate records in the data set. Of course, such split policies
should be merged to one yearly policy. Unfortunately, we do not have the necessary
information to perform this merger, therefore, we need to work with the data as it is.
In Table 13.1 and Fig. 13.2 (rhs) we split the portfolio w.r.t. the number of claims.
On 653’069 insurance policies (amounting to a total exposure of 341’090 years-
at-risk) we do not have any claim, and on the remaining 24’938 policies (17’269
years-at-risk) we have at least one claim. The overall portfolio claim frequency
(w.r.t. Exposure) is λ = 7.35%.
We study the split of this overall frequency λ = 7.35% across the different
feature levels. This empirical analysis is crucial for the model choice in regression
modeling.6 For the empirical analysis we provide 3 different types of graphs for each
feature component (where applicable), these are given in Figs. 13.3, 13.4, 13.5, 13.6,
13.7, 13.8, 13.9, 13.10, and 13.11. The first graph (lhs) gives the split of the total
exposure to the different feature levels, the second graph (middle) gives the average
feature value in each French region (green meaning low and red meaning high),7
and the third graph (rhs) gives the observed average frequency per feature level. This
observed frequency is obtained by dividing the total number of claims by the total
exposure per feature level. The frequencies are complemented by confidence bounds
of two standard deviations (shaded area). These confidence bounds correspond to
twice the estimated standard deviations. The standard deviations are estimated under

6 The empirical analysis in these notes differs from Noll et al. [287] because data cleaning has been
done differently here, we refer to Listing 13.1.
7 We acknowledge the use of UNESCO (1987) database through UNEP/GRID-Geneva for the

French map.
13.1 French Motor Third Party Liability Data 557

total volumes per area code groups average area code per regional groups observed frequency per area code groups

0.20
0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

0.15
frequency
Exposure

0.10 0.05
0.00
A B C D E F A B C D E F
area code groups area code groups

Fig. 13.3 (lhs) Histogram of exposures per Area code, (middle) average Area code per
Region, we map (A, . . . , F ) → (1, . . . , 6), (rhs) observed frequency per Area code

total volumes per vehicle power groups average vehicle power per regional groups observed frequency per vehicle power groups
80000

0.20
60000

0.15
Exposure

frequency
40000

0.10
0.05
20000

0.00
0

4 5 6 7 8 9 10 11 12 13 14 15 4 5 6 7 8 9 10 11 12 13 14 15
vehicle power groups vehicle power groups

Fig. 13.4 (lhs) Histogram of exposures per VehPower, (middle) average VehPower per
Region, (rhs) observed frequency per VehPower

total volumes per vehicle age groups average vehicle age per regional groups observed frequency per vehicle age groups
0.20
5000 10000 15000 20000 25000 30000

0.15
frequency
Exposure

0.10
0.05
0.00
0

0 1 2 3 4 5 6 7 8 9 11 13 15 17 19 0 1 2 3 4 5 6 7 8 9 11 13 15 17 19
vehicle age groups vehicle age groups

Fig. 13.5 (lhs) Histogram of exposures per VehAge (censored at 20), (middle) average VehAge
per Region, (rhs) observed frequency per VehAge
558 13 Appendix B: Data and Examples

total volumes per driver's age groups average driver age per regional groups observed frequency per driver's age groups

0.4
8000

0.3
6000

frequency
Exposure

0.2
4000

0.1
2000

0.0
0

18 23 28 33 38 43 48 53 58 63 68 73 78 83 88 18 23 28 33 38 43 48 53 58 63 68 73 78 83 88
driver's age groups driver's age groups

Fig. 13.6 (lhs) Histogram of exposures per DrivAge (censored at 90), (middle) average
DrivAge per Region, (rhs) observed frequency per DrivAge (y-scale is different compared
to the other frequency plots)

total volumes per bonus−malus level groups average bonus−malus level per regional groups observed frequency per bonus−malus level groups

0.6
50000 100000 150000 200000

0.5
0.4
frequency
Exposure

0.3
0.2
0.1
0.0
0

50 60 70 80 90 100 110 120 130 140 150 50 60 70 80 90 100 110 120 130 140 150
bonus−malus level groups bonus−malus level groups

Fig. 13.7 (lhs) Histogram of exposures per BonusMalus level (censored at 150), (middle)
average BonusMalus level per Region, (rhs) observed frequency per BonusMalus level (y-
scale is different compared to the other frequency plots)

6
a Poisson assumption, thus, they are obtained by ±2 λk /Exposurek , where
λk is the observed frequency and Exposurek is the total exposure for a given
feature level k. We note that in all frequency plots the y-axis ranges from 0% to
20%, except in the BonusMalus plot where the maximum is set to 60%, and the
DrivAge plot where the maximum is set to 40%. From these plots we conclude
that some levels have only a small underlying Exposure; BonusMalus leads to
the highest variability in frequencies followed by DrivAge; and there is quite some
heterogeneity.
Table 13.2 gives the assignment of the different VehBrand levels to car
brands. This list has been compiled from the two data sets FreMTPLfreq
and FreMTPL2freq contained in the R package CASdatasets [113], see
Footnote 5.
Next, we analyze collinearity between the feature components. For this we calculate
Pearson’s correlation and Spearman’s Rho for the continuous feature components,
see Table 13.3. In general, these correlations are low, except for DrivAge
vs. BonusMalus. Of course, the latter is very sensible because a BonusMalus
13.1 French Motor Third Party Liability Data 559

total volumes per car brand groups observed frequency per car brand groups

0.20
80000

0.15
60000

frequency
Exposure

0.10
40000

0.05
20000

0.00
0

B1 B2 B3 B4 B5 B6 B10 B11 B12 B13 B14 B1 B2 B3 B4 B5 B6 B10 B11 B12 B13 B14
car brand groups car brand groups

Fig. 13.8 (lhs) Histogram of exposures per VehBrand, (rhs) observed frequency per
VehBrand; for VehBrand assignment we refer to Table 13.2

total volumes per fuel type average diesel ratio per regional groups observed frequency per fuel type
0.20
150000

0.15
Exposure

frequency
100000

0.10
50000

0.05
0.00
0

Diesel Regular Diesel Regular

fuel type fuel type

Fig. 13.9 (lhs) Histogram of exposures per VehGas, (middle) average VehGas per Region
(diesel is green and regular red), (rhs) observed frequency per VehGas

total volumes per density (log−scale) groups average population density per regional groups observed frequency per density (log−scale) groups
0.20
10000 20000 30000 40000 50000 60000

0.15
frequency
Exposure

0.10
0.05
0.00
0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
density (log−scale) groups density (log−scale) groups

Fig. 13.10 (lhs) Histogram of exposures per population Density (on log-scale), (middle)
average population Density per Region, (rhs) observed frequency per population Density;
in general, we always consider Density on the log-scale
560 13 Appendix B: Data and Examples

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 total volumes per regional groups observed frequency per regional groups observed frequencies per regional groups

0.20
MTPL portfolio
French population

0.15
frequency
Exposure

0.10
0.05
0.00
R11 R22 R24 R26 R41 R43 R53 R72 R74 R83 R93 R11 R22 R24 R26 R41 R43 R53 R72 R74 R83 R93
regional groups regional groups

Fig. 13.11 (lhs) Histogram of exposures Exposure, and (middle, rhs) observed claim frequen-
cies per Region in France (prior to 2016)

Table 13.2 VehBrand Renault, Nissan and Citroën B1 / B2

assignment
Volkswagen, Audi, Skoda and Seat B3
Opel, General Motors and Ford B4 / B5
Fiat B6
Mercedes, Chrysler and BMW B10 / B11
Japanese (except Nissan) and Korean cars B12
Other cars B13 / B14

Table 13.3 Correlations in feature components: top-right shows Pearson’s correlation; bottom-
left shows Spearman’s Rho; Density is considered on the log-scale; significant correlations are
boldface
VehPower VehAge DrivAge BonusMalus Density
VehPower −0.01 0.03 −0.08 0.01
VehAge 0.00 −0.06 0.08 −0.10
DrivAge 0.04 −0.08 −0.48 −0.05
BonusMalus −0.07 0.08 −0.57 0.13
Density −0.01 −0.10 −0.05 0.14

level below 100 needs a certain number of driving years without claims. We give the
corresponding boxplot in Fig. 13.12 (lhs) which confirms this negative correlation.
Figure 13.12 (rhs) gives the boxplot of log-Density vs. Area code. From this
plot we conclude that the area code has likely been set w.r.t. the log-Density.
For our regression models this means that we can drop the area code information,
and we should only work with Density. Nevertheless, we will use the area code
to show what happens in case of collinear feature components, i.e., if we replace
(A, . . . , F ) → (1, . . . , 6).
Figure 13.13 illustrates each continuous feature component w.r.t. the different
VehBrands. Vehicle brands B10 and B11 (Mercedes, Chrysler and BMW) have
more VehPower than other cars, B10 being more likely a diesel car, and vehicle
brand B12 (Japanese and Korean cars) has comparably new cars in more densely
populated French regions.
13.1 French Motor Third Party Liability Data 561

120 boxplot BonusMalus vs. DrivAge boxplot log−Density vs. Area

10
110
100

8
BonusMalus

log−Density
90

6
80
70

4
60
50

2
A B C D E F
18
19
20
21
22
23
24
25
26−35
36−45
46−55
56−65
66−75
76+

DrivAge Area code

Fig. 13.12 Boxplots (lhs) BonusMalus vs. DrivAge, (rhs) log-Density vs. Area code;
these plots are inspired by Fig. 2 in Lorentzen–Mayer [258]

More formally, the strength of dependence between categorical variables can be

measured by Cramér’s V . Cramér’s V is based on the χ 2 -test of independence
on contingency tables. We briefly explain this. Assume we have two-dimensional
categorical features x = (x1 , x2 ) ∈ X having m1 and m2 levels, respectively. Let px
describe the probability on X that a randomly chosen insurance policy takes feature
x, and let px1 and px2 be the marginal distributions of px . If the two components of
x are independent with these two marginals, then we have special (independence)
distribution

πx = px1 px2 for all x = (x1 , x2 ) ∈ X .

The χ 2 -test for independence now analyzes px vs. πx . Assume we have n

observations. Denote by nx = nx1 ,x2 the number of instances that have feature
x = (x1 , x2 ), and let nx1 ,· and n·,x2 be the corresponding marginal observations.
The χ 2 -test statistics is given by
nx1 ,· n·,x2
2
nx − n
χ2 = nx1 ,· n·,x2 .
x=(x1 ,x2 )∈X n

Under the null hypothesis of having independence between the components of x,

the test statistics χ 2 converges in distribution to a χ 2 -distribution with (m1 m2 − 1)
degrees of freedom if we let the number of independently drawn instances go to
infinity. Seven different proofs of this statement are given in Benhamou–Melot [30].
562 13 Appendix B: Data and Examples

1.0
vehicle power among vehicle brands vehicle age among vehicle brands

1.0
15 20
14 18
13 16
12 14
0.8

0.8
11 12
10 10
9 8
relative frequency

relative frequency
8 6
7 4
0.6

0.6
6 2
5 0
4
0.4

0.4
0.2

0.2
0.0

0.0
B1 B2 B3 B4 B5 B6 B10 B11 B12 B13 B14 B1 B2 B3 B4 B5 B6 B10 B11 B12 B13 B14

VehBrand VehBrand

driver's age among vehicle brands bonus−malus level among vehicle brands
1.0

1.0
90 150
80 140
70 130
60 120
0.8

0.8

50 110
40 100
30 90
relative frequency

relative frequency

20 80
70
0.6

0.6

60
50
0.4

0.4
0.2

0.2
0.0

0.0

B1 B2 B3 B4 B5 B6 B10 B11 B12 B13 B14 B1 B2 B3 B4 B5 B6 B10 B11 B12 B13 B14

VehBrand VehBrand

log−density among vehicle brands fuel type among vehicle brands

1.0

10 Regular
9 Diesel
8
7
0.8

0.8

6
5
4
relative frequency

relative frequency

3
2
0.6

0.6

1
0
0.4

0.4
0.2

0.2
0.0

0.0

B1 B2 B3 B4 B5 B6 B10 B11 B12 B13 B14 B1 B2 B3 B4 B5 B6 B10 B11 B12 B13 B14

VehBrand VehBrand

Fig. 13.13 Distribution of the variables VehPower, VehAge, DrivAge, BonusMalus, log-
Density, VehGas for each car brand VehBrand, individually
13.1 French Motor Third Party Liability Data 563

Table 13.4 Cramér’s V for the categorical feature components vs. the categorized continuous
components
VehPower VehAge DrivAge BonusMalus log-Density VehGas Region
VehBrand 0.16 0.17 0.06 0.03 0.05 0.12 0.13
Region 0.04 0.09 0.05 0.04 0.24 0.09
Area 0.87

VehBrands in French Regions

1.0

B14
B13
B12
B11
0.8

B10
B6
relative frequency

B5
B4
B3
0.6

B2
B1
0.4
0.2
0.0

R11 R21 R22 R23 R24 R25 R26 R31 R41 R42 R43 R52 R53 R54 R72 R73 R74 R82 R83 R91 R93 R94
VehBrand

Fig. 13.14 VehBrands in the different French Regions

We scale the test statistics to the interval [0, 1] by dividing it by the comonotonic
(maximal dependent) case and by the sample size n. This motivates Cramér’s V
.
χ 2 /n
V = ∈ [0, 1].
min{m1 − 1, m2 − 1}

Section 7.2.3 of Cohen [78] gives a rule of thumb for small, medium
and large dependence.
√ Cohen [78] calls the association between x1 and x2
small
√ if V min{m 1 − 1, m2 − 1} is less 0.1, it is of medium strength for
V min{m1 − 1, m2 − 1} of size 0.3, and it is a large effect if this value is around
0.5. Our results are presented in Table 13.4. Clearly, there is some association
between VehBrand and both VehPower and VehAge, this can also be seen
from Fig. 13.13, for the remaining variables the dependence is somewhat weaker.
Not surprisingly, Cramér’s V shows the largest value between Region and log-
Density.
In Fig. 13.14 we show the VehBrands in the different French Regions,
√ Cramér’s
V is 0.13 for these two categorical variables, multiplying with 11 − 1 gives a
value bigger than 0.4 which is a considerable association according to Cohen [78].
We note that in some regions the French car brands B1 and B2 are very dominant,
whereas on the Isle of Corse (R94) 80% of the cars in our portfolio are Japanese
564 13 Appendix B: Data and Examples

Fig. 13.15 Empirical density and log-log plots of the observed claim amounts

or Korean cars B12. Our portfolio has its biggest exposure in Region R24, see
Fig. 13.11, in this region French cars are predominant.
Next, we study the claim sizes of this French MTPL example. Figure 13.15 shows
the empirical density plot and the log-log plot. These two plots already illustrate the
main difficulty we often face in claim size modeling. From the empirical density
plot we observe that there are many payments of fixed size (red vertical lines) which
do not match any absolutely continuous distribution function assumption. The log-
log plot shows heavy-tailedness because we observe asymptotically a straight line
with negative slope on the log-scale, this indicates regularly varying tails and, thus,
the EDF is not a suitable model on the original observation scale.
Figure 13.16 gives the boxplots of the claim sizes per feature level (we omit the
claims outside the whiskers because heavy-tailedness would distort the picture). The
empirical mean in orange is much bigger than the median in red color, which also
expresses the heavy-tailedness. From these plots we conclude that the claim sizes
seem less sensitive in feature values which may question the use of a regression
model for claim sizes.
Figure 13.17 shows the density plots for different feature levels. Interestingly, it
seems that the features determine the sizes of the modes, for instance, if we focus
on Area, Fig. 13.17 (top-left), we see that the area codes mainly influence the sizes
of the modes. This may be interpreted by modes corresponding to different claim
types which occur at different frequencies among the area codes.

13.2 Swedish Motorcycle Data

Our second example considers the Swedish motorcycle data which originally
has been used in Ohlsson–Johansson [290]. It is available through the R library
13.2 Swedish Motorcycle Data 565

1000 1500 2000 2500 3000 average claim amounts per Area average claim amounts per VehPower average claim amounts per VehAge

1000 1500 2000 2500 3000

average claim amounts

500

500
0

0
A B C D E F 4 5 6 7 8 9 10 11 12 13 14 15 0 2 4 6 8 10 12 14 16 18 20
Area VehPower VehAge

average claim amounts per DrivAge average claim amounts per BonusMalus average claim amounts per VehBrand
1000 1500 2000 2500 3000

1000 1500 2000 2500 3000

average claim amounts

500

500
0

0
20 30 40 50 60 70 80 90 50 60 70 80 90 100 110 120 130 140 150 B1 B2 B3 B4 B5 B6 B10 B11 B12 B13 B14
DrivAge BonusMalus VehBrand

average claim amounts per LogDensity average claim amounts per Region
1000 1500 2000 2500 3000

1000 1500 2000 2500 3000

average claim amounts

500

500
0

1 2 3 4 5 6 7 8 9 10 R11 R22 R24 R26 R41 R43 R53 R72 R74 R83 R93
LogDensity Region

Fig. 13.16 Boxplots of claim sizes per feature level: these plots omit the claims outside the
whiskers; red color shows the median and orange color the empirical mean

CASdatasets [113], and it is called swmotorcycle. Listing 13.3 shows the

data cleaning that we have used, and Listing 13.4 gives an excerpt of the cleaned
data.
We briefly describe the data. The data considers comprehensive insurance for
motorcycles. This covers loss or damage of motorcycles other than collision, e.g.,
caused by theft, fire or vandalism. The data considers aggregated claims on feature
levels for years 1994–1998. We have claims on 656 out of the 62’036 different
features, thus, only slightly more than 1% of all feature combinations suffer a claim
in the considered period.
566 13 Appendix B: Data and Examples

0.0030
empirical density of claim amounts: Area code empirical density of claim amounts: VehPower empirical density of claim amounts: VehAge

0.0030

0.0030
A 4 0−5
B 5 6−12
0.0025

0.0025

0.0025
C 6 12+
D 7
E 8
F 9
0.0020

0.0020

0.0020
empirical density

empirical density

empirical density
0.0015

0.0015

0.0015
0.0010

0.0010

0.0010
0.0005

0.0005

0.0005
0.0000

0.0000

0.0000
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000

claim amounts claim amounts claim amounts

empirical density of claim amounts: DrivAge empirical density of claim amounts: BonusMalus empirical density of claim amounts: VehBrand
0.0030

0.0030

0.0030
31−40 50 B1
18−20 60 B2
0.0025

0.0025

0.0025
21−25 70 B3
26−30 80 B4
41−50 90 B5
51−70 100 B6
0.0020

0.0020

0.0020
71+ 110 B10
empirical density

empirical density

empirical density
B11
B12
0.0015

0.0015

0.0015
B13
B14
0.0010

0.0010

0.0010
0.0005

0.0005

0.0005
0.0000

0.0000

0.0000
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000

claim amounts claim amounts claim amounts

empirical density of claim amounts: VehGas empirical density of claim amounts: Log−Density empirical density of claim amounts: Region
0.0030

0.0030

Diesel 0 R11
R21
Regular 1
R22
0.0025

0.0025

2 R23
3 R24
4 R25
5 R26
0.0020

0.0020

R31
6
empirical density

empirical density

R41
7 R42
8 R43
0.0015

0.0015

9 R52
10 R53
R54
R72
0.0010

0.0010

R73
R74
R82
R83
R91
0.0005

0.0005

R93
R94
0.0000

0.0000

0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000

claim amounts claim amounts claim amounts

Fig. 13.17 Empirical claim size densities split w.r.t. the different levels of the feature components

We start by describing the available variables on lines 2–10 of Listing 13.4:

1. OwnerAge: age of motorcycle owner in {18, . . . , 70} years (we censor at 70
because of scarcity of data above);
2. Gender: gender of motorcycle owner either being Female or Male;
3. Area: 7 geographical Swedish zones being (1) central parts of Sweden’s three
largest cities, (2) suburbs and middle-sized towns, (3) lesser towns except those
in zones (5)–(7), (4) small towns and countryside except those in zones (5)–(7),
(5) Northern towns, (6) Northern countryside, and (7) Gotland (Sweden’s largest
island);
4. RiskClass: 7 ordered motorcycle classes received from the so-called EV ratio
defined as (Engine power in kW × 100) / (Vehicle weight in kg + 75kg);
5. VehAge: age of motorcycle in {0, . . . , 30} years (we censor at 30);
6. BonusClass: ordered bonus-malus class from 1 to 7, entry level is 1;
13.2 Swedish Motorcycle Data 567

Listing 13.3 Data cleaning applied to the Swedish motorcycle data set
1 library(CASdatasets)
2 data(swmotorcycle)
3 mcdata <- swmotorcycle
4 mcdata$Gender <- as.factor(mcdata$Gender)
5 mcdata$Area <- as.factor(mcdata$Area)
6 mcdata$Area <- factor(mcdata$Area,levels(mcdata$Area)[c(1,7,3,6,5,4,2)])
7 mcdata$Area <- c("Zone 1","Zone 2","Zone 3","Zone 4","Zone 5",
8 "Zone 6","Zone 7")[as.integer(mcdata$Area)]
9 mcdata$Area <- as.factor(mcdata$Area)
10 mcdata$RiskClass <- as.factor(mcdata$RiskClass)
11 mcdata$RiskClass <- factor(mcdata$RiskClass,
12 levels(mcdata$RiskClass)[c(1,6,7,3,4,5,2)])
13 mcdata$RiskClass <- as.integer(mcdata$RiskClass)
14 mcdata$BonusClass <- as.integer(as.factor(mcdata$BonusClass))
15 #
16 mcdata <- mcdata[which(mcdata$OwnerAge>=18),] # only minimal age 18
17 mcdata$OwnerAge <- pmin(70, mcdata$OwnerAge) # set maximal age 70
18 mcdata$VehAge <- pmin(30, mcdata$VehAge) # set maximal motorcycle age 30
19 mcdata <- mcdata[which(mcdata$Exposure>0),] # only positive exposures

Listing 13.4 Excerpt of the Swedish motorcycle data set

1 ’data.frame’: 62036 obs. of 9 variables:
2 $ OwnerAge : num 18 18 18 18 18 18 18 18 18 18 ...
3 $ Gender : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 1 1 1 1 ...
4 $ Area : Factor w/ 7 levels "Zone 1","Zone 2",..: 1 1 1 1 2 2 2 3 ...
5 $ RiskClass : int 1 2 3 3 1 1 3 1 1 1 ...
6 $ VehAge : num 8 11 9 9 11 12 24 4 6 6 ...
7 $ BonusClass : int 2 2 3 4 1 1 2 1 1 2 ...
8 $ Exposure : num 1 0.778 0.499 0.501 0.929 ...
9 $ ClaimNb : int 0 0 0 0 0 0 0 0 0 0 ...
10 $ ClaimAmount: int 0 0 0 0 0 0 0 0 0 0 ...

7. Exposure: total exposure in yearly units, these exposures are aggregated for
given feature combinations, resulting in total exposures [0.0274, 31.3397], the
shortest entry referring to 10 days and the longest one to more than 31 years;
8. ClaimNb: number of claims Ni for a given feature;
9. ClaimAmount: total claim amount for a give feature (aggregated over all
claims).
We start with a descriptive and exploratory analysis of the Swedish motorcycle
data of Listing 13.4. We have n = 62 036 different feature combinations with
positive Exposure. This Exposure is aggregated over individual policies with a
fixed feature combination. We denote by Ni the number of claims on feature i, this
corresponds to ClaimNb, and the total claim amount ClaimAmount is denoted
Ni
by Si = j =1 Zi,j , where Zi,j are the individual claim sizes on feature i (in case

of claims). The empirical claimfrequency is λ̄ = ni=1 Ni / ni=1 vi = 1.05%, and
the average claim size is μ̄ = ni=1 Si / ni=1 Ni = 24 641 Swedish crowns SEK.
568 13 Appendix B: Data and Examples

boxplot of Exposure (62036 policies) histogram number of claims per feature

10000 20000 30000 40000 50000 60000

2
exposure on log scale

frequency
0
−2
−4
−6

0
0 1 2
number of claims

Fig. 13.18 (lhs) Boxplot of Exposure on the log-scale (the horizontal line corresponds to 1
accounting year), (rhs) histogram of the number of observed claims ClaimNb per feature of the
Swedish motorcycle data

Figure 13.18 shows the boxplot over all Exposures and the claim counts on all
insurance policies. We note that insurance claims are rare events for this product,
because the empirical claim frequency is only λ̄ = 1.05%.
Figures 13.19 and 13.20 give the marginal total exposures (split by gender), the
marginal claim frequencies and the marginal average claim amounts for the covari-
ate components OwnerAge, Area, RiskClass, VehAge and BonusClass.
We observe that we have a very imbalanced portfolio between genders, only 11%
of the total exposure is coming from females. The empirical claim frequency of
females is 0.86% and the one of males is 1.08%. We note that the female claim
frequency comes from (only) 61 claims (based on an exposure for female of 7’094
accounting years, versus 57’679 for male). Therefore, it is difficult to analyze
females separately, and all marginal claim frequencies and claim sizes in Figs. 13.19
and 13.20 (middle and rhs) are analyzed jointly for both genders. If we run a simple
Poisson GLM that only involves Gender as feature component, it turns out that
the female frequency is 20% lower than the male frequency (remember we have
the balance property on each dummy variable, see Example 5.12), but this variable
should not be kept in the model on a 5% significance level. The same holds for claim
amounts.
The empirical marginal frequencies in Figs. 13.19 and 13.20 (middle) are
complemented with confidence bands of ±2 standard deviations. From the plots
we conclude that we should keep the explanatory variables OwnerAge, Area,
RiskClass and VehAge, but the variable BonusClass does not seem to have
any predictive power. At the first sight, this seems surprising because the bonus class
encodes the past claims history. The reason that the bonus class is not needed for our
claims is that we consider comprehensive insurance for motorcycles covering loss
or damage of motorcycles other than collision (for instance, caused by theft, fire or
vandalism), and the bonus class encodes collision claims.
13.2 Swedish Motorcycle Data 569

total exposures per OwnerAge observed frequency per OwnerAge average claim amounts per OwnerAge

0.05
1000 2000 3000 4000 5000 6000 7000

female

12
male

logged average claim amounts

0.04

10
0.03
frequency
Exposure

8
0.02

6
0.01

4
0.00
0

18 24 30 36 42 48 54 60 66 20 30 40 50 60 70 18 24 30 36 42 48 54 60 66
OwnerAge OwnerAge OwnerAge

total exposures per Area 0.05 observed frequency per Area average claim amounts per Area
female

12
5000 10000 15000 20000 25000

male

logged average claim amounts

0.04

10
0.03
frequency
Exposure

8
0.02

6
0.01

4
0.00
0

Zone 1 Zone 2 Zone 3 Zone 4 Zone 5 Zone 6 Zone 7 Zone 1 Zone 2 Zone 3 Zone 4 Zone 5 Zone 6 Zone 7 Zone 1 Zone 2 Zone 3 Zone 4 Zone 5 Zone 6 Zone 7

Area Area Area

total exposures per RiskClass observed frequency per RiskClass average claim amounts per RiskClass
0.05

female
12

male
logged average claim amounts
15000

0.04

10
0.03
frequency
Exposure
10000

8
0.02

6
5000

0.01

4
0.00
0

1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7
RiskClass RiskClass RiskClass

Fig. 13.19 (Top, middle and bottom rows) OwnerAge, Area, RiskClass: (lhs) histogram of
exposures (split by gender), (middle) observed claim frequency, (rhs) boxplot of observed average
claim amounts μ̄i = Si /Ni of features with Ni > 0 (on log-scale)

For a regression analysis Zones 5 to 7 should be merged because of small

exposures and a similar behavior, the same applies to RiskClass 6 and 7, and
VehAge above 20.
Figure 13.21 shows the correlations between the features: (top) correlations between
continuous features, (bottom), dependence between continuous features and the
categorical Area features. We have some dependence, for instance, in Zone 1
(three largest Swedish cities) the motorcycles are more light (RiskClass) and
less old. Older people drive less heavy motorcycles that are more old, and older
motorcycles are less heavy.
Figure 13.22 gives the empirical density, empirical distribution and log-log plot of
average claim amounts μ̄i = Si /Ni . From the log-log plot we conclude that the
average claim amounts are not heavy-tailed for this motorcycle insurance product.
570 13 Appendix B: Data and Examples

2000 4000 6000 8000 10000 12000 total exposures per VehAge observed frequency per VehAge average claim amounts per VehAge

0.05

logged average claim amounts

female

12
male

0.04

10
frequency
Exposure

0.03

8
0.02

6
0.01

4
0.00
0

0 3 6 9 12 15 18 21 24 27 30 0 5 10 15 20 25 30 0 3 6 9 12 15 18 21 24 27 30
VehAge VehAge VehAge
total exposures per BonusClass observed frequency per BonusClass average claim amounts per BonusClass
10000 15000 20000 25000

0.05

logged average claim amounts

female

12
male
0.04

10
frequency
Exposure

0.03

8
0.02

6
0.01
5000

4
0.00
0

1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7
BonusClass BonusClass BonusClass

Fig. 13.20 (Top and bottom rows) VehAge, BonusClass: (lhs) histogram of exposures (split
by gender), (middle) observed claim frequency, (rhs) boxplot of observed average claim amounts
μ̄i = Si /Ni of features with Ni > 0 (on log-scale)

13.3 Wisconsin Local Government Property Insurance Fund

The third example considers property insurance claims of the Wisconsin Local
Government Property Insurance Fund (LGPIF). This data8 has been made available
through the book project of Frees [135],9 and is also used in Lee et al. [236]. The
Wisconsin LGPIF is an insurance pool that is managed by the Wisconsin Office
of the Insurance Commissioner. This fund provides insurance protection to local
governmental institutions such as counties, schools, libraries, airports, etc. It insures
property claims for buildings and motor vehicles, and it excludes certain natural and
man made perils like flood, earthquakes or nuclear accidents. We give a description
of the data (we have applied some data cleaning to the original data).
The special feature of this data is that we have a short claim description on line 11
of Listing 13.5. This description will allow us to better understand the claim type
beyond just knowing the hazard type that has been affected.
Figure 13.23 gives the empirical density (upper-truncated at 50’000) and the log-log
plot of the observed LGPIF claim amounts. Most claims are below 10’000, however,
the log-log plot shows clearly that the data is heavy-tailed, the largest claim being

8 https://github.com/OpenActTexts/Loss-Data-Analytics/tree/master/Data.
9 https://ewfrees.github.io/Loss-Data-Analytics/.
13.3 Wisconsin Local Government Property Insurance Fund 571

OwnerAge in each Swedish zone RiskClass in each Swedish zone VehAge in each Swedish zone
70

30
7

25
6
60

20
OwnerAge

RiskClass
5

VehAge
50

15
4
40

10
3
30

5
20

0
Zone 1 Zone 2 Zone 3 Zone 4 Zone 5 Zone 1 Zone 2 Zone 3 Zone 4 Zone 5 Zone 1 Zone 2 Zone 3 Zone 4 Zone 5
Swedish zones (Zones 5−7 merged) Swedish zones (Zones 5−7 merged) Swedish zones (Zones 5−7 merged)

Fig. 13.21 (Top) Correlations: top-right shows Pearson’s correlation; bottom-left shows Spear-
man’s Rho; (bottom) boxplots of OwnerAge, RiskClass, VehAge versus Area (where Zones
5–7 have been merged)

empirical density of average claim amounts empirical distribution of average claim amounts log−log plot of average claim amounts
1.0
0.04

0
logged survival probability
−5 −4 −3 −2 −1
0.8
empirical distribution
0.03
empirical density

0.6
0.02

0.4
0.01

0.2

−6

log 10K
0.00

0.0

log 100K

0 50 100 150 200 0 50 100 150 200 −4 −2 0 2 4

average claim amounts in SEK 1'000 average claim amounts in SEK 1'000 logged average claim amounts in SEK 1'000

Fig. 13.22 (lhs) Empirical density (middle) empirical distribution and (rhs) log-log plot of average
claim amounts μ̄i = Si /Ni of features with Ni > 0

12’922’218 and 13 claims being above 1 million. These claims are further described
by the features given in Listing 13.5.
In our example we will not focus on modeling the claim sizes, but we rather
aim at predicting the hazard types from the claim descriptions. There are 9 different
hazard types: Fire, Lightning, Hail, Wind, WaterW, WaterNW, Vehicle, Vandalism
and Misc. The last label contains all claims that cannot be allocated to one of
the previous hazard types, and WaterW refers to weather related water claims and
WaterNW to the non-weather related ones. If we only focus on this latter problem
we have more data available as there is a training data set and a validation data
572 13 Appendix B: Data and Examples

empirical density of claim amounts log−log plot of claim amounts

0
0.00000 0.00005 0.00010 0.00015 0.00020

logged survival probability

−2
empirical density

−4
−6
log 1K
log 10K

−8
log 100K
log 1M

0 10000 20000 30000 40000 50000 0 5 10 15

claim amounts logged claim amounts

Fig. 13.23 (lhs) Empirical density (upper-truncated at 50’000), (rhs) log-log plot of the observed
LGPIF claim amounts

Listing 13.5 Excerpt of the Wisconsin LGPIF data set

1 ’data.frame’: 5424 obs. of 10 variables:
2 $ PolicyNum : int 120002 120003 120003 120003 120003 120003 120003 ...
3 $ Year : int 2010 2007 2008 2007 2009 2010 2007 2007 2009 2007 ...
4 $ Claim : num 6839 2085 8775 600 34610 ...
5 $ Deduct : int 1000 5000 5000 5000 5000 5000 5000 5000 5000 5000 ...
6 $ EntityType : Factor w/ 6 levels "City","County",..: 2 2 2 2 2 2 2 2 2 2 ...
7 $ CoverageCode: Factor w/ 13 levels "CE","CF","CS",..: 12 12 11 11 11 12 ...
8 $ Fire5 : int 4 0 0 0 0 0 0 0 0 0 ...
9 $ CountyCode : Factor w/ 72 levels "ADA","ASH","BAR",..: 2 3 3 3 3 3 3 3...
10 $ Hazard : Factor w/ 9 levels "Fire","Hail",..: 3 3 5 5 9 6 3 3 3 3 ...
11 $ Description : chr "lightning damage" "lightning damage at Comm. Center" ...

set with hazard types and claim descriptions.10 In total we have 6’031 such claim
descriptions, see Listing 13.6, which are studied in our text recognition Chap. 10.

Listing 13.6 Excerpt of the Wisconsin LGPIF claim descriptions

1 ’data.frame’: 6031 obs. of 2 variables:
2 Hazard : Factor w/ 9 levels "Fire","Hail",..: 1 3 3 5 5 9 3 6 ...
3 Description: chr "fire damage at Town Hall"
4 "lightning damage at water tower" ...

10 https://github.com/OpenActTexts/Loss-Data-Analytics/tree/master/Data.
13.4 Swiss Accident Insurance Data 573

13.4 Swiss Accident Insurance Data

Our next example considers Swiss accident insurance data.11 This data set is not
publicly available. Swiss accident insurance is compulsory for employees, i.e., by
law each employer has to sign an insurance contract to protect the employees against
accidents. This insurance cover includes both work and leisure accidents, and it
covers medical expenses and daily allowance. Listing 13.7 gives an excerpt of the
data. Line BU indicates whether we have a workplace or a leisure accident, line
10 gives the medical expenses and line 12 shows the allowance expenses. In the
subsequent analysis we only consider medical expenses.

Listing 13.7 Excerpt of the Swiss accident insurance data set

1 ’data.frame’: 339500 obs. of 11 variables:
2 $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
3 $ BU : Factor w/ 2 levels "1","2": 1 1 2 2 2 1 2 2 2 1 ...
4 $ Sector : Factor w/ 24 levels "5","12","13",..: 5 10 13 7 12 13 4 21 1 ...
5 $ AccQuart : int 3 2 1 3 4 4 1 2 1 3 ...
6 $ RepDel : num 0 0 0 0 1 0 0 0 0 0 ...
7 $ Age : num 45 20 20 20 60 55 30 25 20 20 ...
8 $ InjType : Factor w/ 19 levels "1","2","3","4",..: 7 6 4 13 16 2 6 4 4 ...
9 $ InjPart : Factor w/ 35 levels "1","2","3","4",..: 20 28 28 20 14 23 2 ...
10 $ Claim : num 562 6675 700 57 2382 ...
11 $ NumbPaym : num 2 2 2 1 1 3 1 1 1 1 ...
12 $ Allowance: num 2345 5554 21 0 395 ...

Sector indicates the labor sector of the insured company, AccQuart gives the
accident quarter since leisure claims have a seasonal component, RepDel gives the
reporting delay in yearly units, Age is the age of the injured (in 5 years buckets),
and InjType and InjPart denote the injury type and the injured body part.
Figure 13.24 gives the empirical density (upper-truncated at 10’000) and the log-
log plot of the observed Swiss accident insurance claim amounts. Most claims are
below 5’000, however, the log-log plot shows some heavy-tailedness, the largest
claim exceeding 1’300’000 CHF.
Figure 13.25 shows the average claim amounts split w.r.t. the different feature
components (top) Sector, AccQuart, RepDel, (bottom) Age, InjType,
InjPart, and moreover, split by work and leisure accidents (in cyan and gray
in the colored version). Typically, leisure accidents are more numerous and more
expensive on average than accidents at the work place. From Fig. 13.25 (top, left)
we observe considerable variability in average claim sizes between the different
labor sectors (cyan bars), whereas average leisure claim sizes (gray bars) are similar

11 https://www.unfallstatistik.ch/.
574 13 Appendix B: Data and Examples

0.0020 empirical density of claim amounts log−log plot of claim amounts

0
logged survival probability
−2
0.0015
empirical density

−4
0.0010

−6
−8
0.0005

−10
log 1K
log 10K
0.0000

log 100K

−12
log 1M

0 2000 4000 6000 8000 10000 0 2 4 6 8 10 12 14

claim amounts logged claim amounts

Fig. 13.24 (lhs) Empirical density (upper-truncated at 10’000), (rhs) log-log plot of the observed
Swiss accident insurance claim amounts

average claim: labor sector average claim: accident quarter average claim: reporting delay

3000
2500

2000

2500
average claim amount

average claim amount

2000

1500

2000
1500

1500
1000
1000

1000
500
500

500
0

1 3 5 7 9 11 13 15 17 19 21 23 1 2 3 4 0 1 2
labor sector accident quarter reporting delay

average claim: age of injured average claim: injury type average claim: injured body part
2500

8000
8000
average claim amount

average claim amount

2000

6000
6000
1500

4000
4000
1000

2000
2000
500
0

20 25 30 35 40 45 50 55 60 65 1 2 3 4 5 6 7 8 9 10 12 14 16 18 1 3 5 7 9 11 14 17 20 23 26 29 32 35
age of injured injury type injured body part

Fig. 13.25 Average claim amounts split w.r.t. the different feature components (top) Sector,
AccQuart, RepDel, (bottom) Age, InjType, InjPart, and split by work and leisure
accidents (cyan/gray in the colored version)

across the different labor sectors. Average claim sizes considerably differ between
injury types and injured body parts (bottom, middle and right), but they do not differ
between work and leisure claims.
13.4 Swiss Accident Insurance Data 575

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons licence and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons licence, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons licence and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Bibliography

1. Aas, K., Jullum, M., & Løland, A. (2021). Explaining individual predictions when features
are dependent: More accurate approximations to Shapley values. Artificial Intelligence, 298.
Article 103502.
2. Abadi, M., et al. (2015). TensorFlow: Large-scale machine learning on heterogeneous
systems. https://www.tensorflow.org/
3. Agarwal, R., Melnick, L., Frosst, N., Zhang, X., Lengerich, B., Caruana, R., & Hinton,
G. E. (2021). Neural additive models: Interpretable machine learning with neural nets.
arXiv:2004.13912v2.
4. Ágoston, K. C., & Gyetvai, M. (2020). Joint optimization of transition rules and the premium
scale in a bonus-malus system. ASTIN Bulletin, 50/3, 743–776.
5. Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on
Automatic Control, 19/6, 716–723.
6. Al-Mudafer, M. T., Avanzi, B., Taylor, G., & Wong, B. (2022). Stochastic loss reserving with
mixture density neural networks. Insurance: Mathematics & Economics, 105, 144–147.
7. Albrecher, H., Beirlant, J., & Teugels, J. L. (2017). Reinsurance: Actuarial and statistical
aspects. Hoboken: Wiley.
8. Albrecher, H., Bladt, M., & Yslas, J. (2022). Fitting inhomogeneous phase-type distributions
to data: The univariate and the multivariate case. Scandinavian Journal of Statistics, 49/1,
44–77.
9. Alzner, H. (1997). On some inequalities for the gamma and psi functions. Mathematics of
Computation, 66/217, 373–389.
10. Amari, S. (2016). Information geometry and its applications. New York: Springer.
11. Améndola, C., Drton, M., & Sturmfels, B. (2016). Maximum likelihood estimates for
Gaussian mixtures are transcendental. In I. S. Kotsireas, S. M. Rump, & C. K. Yap (Eds.), 6th
International Conference on Mathematical Aspects of Computer and Information Sciences.
Lecture notes in computer science (Vol. 9582, pp. 579–590). New York: Springer.
12. Ancona, M., Ceolini, E., Öztireli, C., & Gross, M. (2019). Gradient-based attribution methods.
In W. Samek, G. Montavon, A. Vedaldi, L. K. Hansen, & K.-R. Müller (Eds.), Explainable AI:
Interpreting, explaining and visualizing deep learning. Lecture notes in artificial intelligence
(Vol. 11700, pp. 169–191). New York: Springer.
13. Apley, D. W., & Zhu, J. (2020). Visualizing the effects of predictor variables in black box
supervised learning models. Journal of the Royal Statistical Society, Series B, 82/4, 1059–
1086.
14. Asmussen, S., Nerman, O., & Olsson, M. (1996). Fitting phase-type distributions via the EM
algorithm. Scandinavian Journal of Statistics, 23/4, 419–441.

M. V. Wüthrich, M. Merz, Statistical Foundations of Actuarial Learning and its
Applications, Springer Actuarial, https://doi.org/10.1007/978-3-031-12409-9
578 Bibliography

15. Awad, Y., Bar-Lev, S. K., & Makov, U. (2022). A new class of counting distributions
embedded in the Lee–Carter model of mortality projections: A Bayesian approach. Risks,
10/6. Article 111.
16. Ay, N., Jost, J., Lê, H. V., & Schwachhöfer, L. (2017). Information geometry. New York:
Springer.
17. Ayuso, M., Guillén, M., & Nielsen, J. P. (2019). Improving automobile insurance ratemaking
using telematics: Incorporating mileage and driver behaviour data. Transportation, 46/3, 735–
752.
18. Ayuso, M., Guillén, M., & Pérez-Marín, A. M. (2016). Telematics and gender discrimination:
Some usage-based evidence on whether men’s risk of accidents differs from women’s. Risks,
4/2. Article 10.
19. Ayuso, M., Guillén, M., & Pérez-Marín, A. M. (2016). Using GPS data to analyse the distance
travelled to the first accident at fault in pay-as-you-drive insurance. Transportation Research
Part C: Emerging Technologies, 68, 160–167.
20. Bachelier, L. (1900). The theory of speculation. English translation by May, D. R. (2011).
Annales Scientifiques de l’École Normale Supérieure, 3/17, 21–89.
21. Bahdanau, D., Cho, K., & Bengio, Y. (2016). Neural machine translation by jointly learning
to align and translate. arXiv:1409.0473v7.
22. Bailey, R. A. (1963). Insurance rates with minimum bias. Proceedings of the Casualty
Actuarial Society, 50, 4–11.
23. Barndorff-Nielsen, O. (2014). Information and exponential families: In statistical theory.
New York: Wiley.
24. Barndorff-Nielsen, O., & Cox, D. R. (1979). Edgeworth and saddlepoint approximations with
statistical applications. Journal of the Royal Statistical Society, Series B, 41/3, 279–299.
25. Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal
function. IEEE Transactions of Information Theory, 39/3, 930–945.
26. Barron, A. R. (1994). Approximation and estimation bounds for artificial neural networks.
Machine Learning, 14, 115–133.
27. Bengio Y., Courville A., & Vincent P. (2013). Representation learning: A review and new
perspectives. IEEE Transactions on Pattern Analysis and Machine Learning Intelligence,
35/8, 1798–1828.
28. Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language
model. Journal of Machine Learning Research, 3/Feb, 1137–1155.
29. Bengio, Y., Schwenk, H., Senécal, J.-S., Morin, F., & Gauvain, J.-L. (2006). Neural
probabilistic language models. In D. E. Holmes & L. C. Jain (Eds.), Innovations in machine
learning. Studies in fuzziness and soft computing (Vol. 194, pp. 137–186). New York:
Springer.
30. Benhamou, E., & Melot, V. (2018). Seven proofs of the Pearson Chi-squared independence
test and its graphical interpretation. arXiv:1808.09171.
31. Berger, J. O. (1985). Statistical decision theory and Bayesian analysis (2nd ed.). New York:
Springer.
32. Bichsel, F. (1964). Erfahrungstarifierung in der Motorfahrzeug-Haftpflicht-Versicherung.
Bulletin of the Swiss Association of Actuaries, 1964, 119–130.
33. Bickel, P. J., & Doksum, K. A. (2001). Mathematical statistics: Basic ideas and selected
topics (Vol. I, 2nd ed.). Hoboken: Prentice Hall.
34. Billingsley, P. (1995). Probability and measure (3rd ed.). New York: Wiley.
35. Bishop, C. M. (1994). Mixture Density Networks. Technical Report. Aston University,
Birmingham.
36. Bishop, C. M. (2006). Pattern recognition and machine learning. New York: Springer.
37. Bladt, M. (2022). Phase-type distributions for insurance pricing. ASTIN Bulletin, 52/2, 417–
448.
38. Blæsild, P., & Jensen, J. L. (1985). Saddlepoint formulas for reproductive exponential models.
Scandinavian Journal of Statistics, 12/3, 193–202.
Bibliography 579

39. Blier-Wong, C., Cossette, H., Lamontagne, L., & Marceau, E. (2022). Geographic ratemaking
with spatial embeddings. ASTIN Bulletin, 52/1, 1–31.
40. Blostein, M., & Miljkovic, T. (2019). On modeling left-truncated loss data using mixture
distributions. Insurance: Mathematics & Economics, 85, 35–46.
41. Blundell, C., Cornebise, J., Kavukcuoglu, K., & Wiersta, D. (2015). Weight uncertainty in
neural network. Proceedings of Machine Learning Research, 37, 1613–1622.
42. Boucher, J. P., Côté, S., & Guillén, M. (2017). Exposure as duration and distance in telematics
motor insurance using generalized additive models. Risks, 5/4. Article 54.
43. Boucher, J. P., Denuit, M., & Guillén, M. (2007). Risk classification for claim counts:
A comparative analysis of various zeroinflated mixed Poisson and hurdle models. North
American Actuarial Journal, 11/4, 110–131.
44. Boucher, J. P., Denuit, M., & Guillén, M. (2008). Modelling of insurance claim count
with hurdle distribution for panel data. In B. C. Arnold, N. Balakrishnan, J. M. Sarabia, &
R. Mínguez (Eds.), Advances in mathematical and statistical modeling. Statistics for industry
and technology (pp. 45–59). Boston: Birkhäuser.
45. Boucher, J. P., Denuit, M., & Guillén, M. (2009). Number of accidents or number of claims?
An approach with zero-inflated Poisson models for panel data. Journal of Risk and Insurance,
76/4, 821–846.
46. Boucher, J. P., & Inoussa, R. (2014). A posteriori ratemaking with panel data. ASTIN Bulletin,
44/3, 587–612.
47. Boucher, J. P., & Pigeon, M. (2018). A claim score for dynamic claim counts modeling.
arXiv:1812.06157.
48. Box, G. E. P., & Cox, D. R. (1964). An analysis of transformations. Journal of the Royal
Statistical Society, Series B, 26/2, 211–243.
49. Box, G. E. P., & Jenkins, G. M. (1976). Time series analysis: Forecasting and control. San
Francisco: Holden-Day.
50. Bregman, L. M. (1967). The relaxation method of finding the common point of convex sets
and its application to the solution of problems in convex programming. USSR Computational
Mathematics and Mathematical Physics, 7/3, 200–217.
51. Breiman, L. (1996). Bagging predictors. Machine Learning, 24/2, 123–140.
52. Breiman, L. (2001). Random forests. Machine Learning, 45/1, 5–32.
53. Breiman, L. (2001). Statistical modeling: The two cultures. Statistical Science, 16/3, 199–
215.
54. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and
regression trees. Wadsworth statistics/probability series. Monterey: Brooks/Cole Publishing.
55. Broadie, M., Du, Y., & Moallemi, C. (2011). Efficient risk estimation via nested sequential
estimation. Management Science, 57/6, 1171–1194.
56. Brouhns, N., Denuit, M., & Vermunt, J. K. (2002). A Poisson log-bilinear regression approach
to the construction of projected lifetables. Insurance: Mathematics & Economics, 31/3, 373–
393.
57. Brouhns, N., Guillén, M., Denuit, M., & Pinquet, J. (2003). Bonus-malus scales in segmented
tariffs with stochastic migration between segments. Journal of Risk and Insurance, 70/4, 577–
599.
58. Bühlmann, H., & Gisler, A. (2005). A course in credibility theory and its applications. New
York: Springer.
59. Bühlmann, P., & Mächler, M. (2014). Computational statistics. Lecture notes. ETH Zurich:
Department of Mathematics.
60. Bühlmann, P., & Yu, B. (2002). Analyzing bagging. Annals of Statistics, 30/4, 927–961.
61. Burguete, J., Gallant, R., & Souza, G. (1982). On unification of the asymptotic theory of
nonlinear econometric models. Economic Review, 1/2, 151–190.
62. Calderín-Ojeda, E., Gómez-Déniz, E., & Barranco-Chamorro, I. (2019). Modeling zero-
inflated count data with a special case of the generalised Poisson distribution. ASTIN Bulletin,
49/3, 689–708.
580 Bibliography

63. Cameron, A., & Trivedi, P. (1986). Econometric models based on count data: Comparisons
and applications of some estimators and tests. Journal of Applied Econometrics, 1, 29–54.
64. Cantelli, F. P. (1933). Sulla determinazione empirica delle leggi di probabilità. Giornale
Dell’Istituto Italiano Degli Attuari, 4, 421–424.
65. Carriere, J. F. (1996). Valuation of the early-exercise price for options using simulations and
nonparametric regression. Insurance: Mathematics & Economics, 19/1, 19–30.
66. Chan, J. S. K., Choy, S. T. B., Makov, U. E., & Landsman, Z. (2018). Modelling insurance
losses using contaminated generalised beta type-II distribution. ASTIN Bulletin, 48/2, 871–
904.
67. Charpentier, A. (2015). Computational actuarial science with R. Boca Raton: CRC Press.
68. Chaubard, F., Mundra, R., & Socher, R. (2016). Deep learning for natural language
processing. Lecture notes. Stanford: Stanford University.
69. Chen, A., Guillén, M., & Vigna, E. (2018). Solvency requirement in a unisex mortality model.
ASTIN Bulletin, 48/3, 1219–1243.
70. Chen, A., & Vigna, E. (2017). A unisex stochastic mortality model to comply with EU Gender
Directive. Insurance: Mathematics & Economics, 73, 124–136.
71. Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system.
arXiv:1603.02754v3.
72. Chen, X. (2007). Large sample sieve estimation of semi-parametric models. In J. J. Heckman
& E. E. Leamer (Eds.), Handbook of econometrics (Vol. 6B, Chap. 76, pp. 5549–5632).
Amsterdam: Elsevier.
73. Chen, X., & Shen, X. (1998). Sieve extremum estimates for weakly dependent data.
Econometrica, 66/2, 289–314.
74. Cheridito, P., Ery, J., & Wüthrich, M. V. (2020). Assessing asset-liability risk with neural
networks. Risks, 8/1. Article 16.
75. Cheridito, P., Jentzen, A., & Rossmannek, F. (2022). Efficient approximation of high-
dimensional functions with neural networks. IEEE Transactions on Neural Networks and
Learning Systems, 33/7, 3079–3093.
76. Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H.,
& Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for
statistical machine translation. arXiv:1406.1078.
77. Chollet, F., Allaire, J. J., et al. (2017). R interface to Keras. https://github.com/rstudio/keras
78. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.) New York:
Lawrence Erlbaum Associates.
79. Congdon, P. (2014). Applied Bayesian modelling (2nd ed.). New York: Wiley.
80. Cook, D. R., & Croos-Dabrera, R. (1993). Partial residual plots in generalized linear models.
Journal of the American Statistical Association, 93/442, 730–739.
81. Cooray, K., & Ananda, M. M. A. (2005). Modeling actuarial data with composite lognormal-
Pareto model. Scandinavian Actuarial Journal, 2005/5, 321–334.
82. Corradin, A., Denuit, M., Detyniecki, M., Grari, V., Sammarco, M., & Trufin, J. (2022). Joint
modeling of claim frequencies and behavior signals in motor insurance. ASTIN Bulletin, 52/1,
33–54.
83. Cragg, J. G. (1971). Some statistical models for limited dependent variables with application
to the demand for durable good. Econometrica, 39/5, 829–844.
84. Craven, P., & Wahba, G. (1978). Smoothing noisy data with spline functions. Numerische
Mathematik, 31, 377–403.
85. Creal, D. (2012). A survey of sequential Monte Carlo methods for economics and finance.
Econometric Reviews, 31/3, 245–296.
86. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics
of Control, Signals and Systems, 2, 303–314.
87. Daniels, H. E. (1954). Saddlepoint approximations in statistics. Annals of Mathematical
Statistics, 25, 631–650.
88. Darmois, G., (1935). Sur les lois de probabilité à estimation exhaustive. Comptes Rendus de
l’Académie des Sciences Paris, 260, 1265–1266.
Bibliography 581

89. De Jong, P., & Heller, G. Z. (2008). Generalized linear models for insurance data. Cambridge:
Cambridge University Press.
90. De Jong, P., Tickle, L., & Xu, J. (2020). A more meaningful parameterization of the Lee–
Carter model. Insurance: Mathematics & Economics, 94, 1–8.
91. De Pril, N. (1978). The efficiency of a bonus-malus system. ASTIN Bulletin, 10/1, 59–72.
92. Del Moral, P., Doucet, A., & Jasra, A. (2006). Sequential Monte Carlo samplers. Journal of
the Royal Statistical Society, Series B, 68/3, 411–436.
93. Del Moral, P., Peters, G. W., & Vergé, C. (2012). An introduction to stochastic particle
integration methods: With applications to risk and insurance. In J. Dick, F. Y. Kuo, G. W.
Peters, & I. H. Sloan (Eds.), Monte Carlo and Quasi-Monte Carlo Methods 2012. Proceedings
in Mathematics & Statistics (Vol. 65, pp. 39–81). New York: Springer.
94. Delong, Ł., Lindholm, M., & Wüthrich, M. V. (2021). Making Tweedie’s compound Poisson
model more accessible. European Actuarial Journal, 11/1, 185–226.
95. Delong, Ł., Lindholm, M., & Wüthrich, M. V. (2021). Gamma mixture density networks
and their application to modeling insurance claim amounts. Insurance: Mathematics &
Economics, 101/B, 240–261.
96. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood for incomplete
data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39/1, 1–22.
97. Denuit, M., Charpentier, A., & Trufin, J. (2021). Autocalibration and Tweedie-dominance for
insurance pricing in machine learning. Insurance: Mathematics & Economics, 101/B, 485–
497.
98. Denuit, M., Guillén, M., & Trufin, J. (2019). Multivariate credibility modelling for usage-
based motor insurance pricing with behavioural data. Annals of Actuarial Science, 13/2, 378–
399.
99. Denuit, M., Hainaut, D., & Trufin, J. (2019). Effective statistical learning methods for
actuaries I: GLMs and extensions. New York: Springer.
100. Denuit, M., Hainaut, D., & Trufin, J. (2020). Effective statistical learning methods for
actuaries II: Tree-based methods and extensions. New York: Springer.
101. Denuit, M., Hainaut, D., & Trufin, J. (2019). Effective statistical learning methods for
actuaries III: Neural networks and extensions. New York: Springer.
102. Denuit, M., Maréchal, X., Pitrebois, S., & Walhin, J.-F. (2007). Actuarial modelling of claim
counts: Risk classification, credibility and bonus-malus systems. New York: Wiley.
103. Denuit, M., & Trufin, J. (2021). Generalization error for Tweedie models: Decomposition and
error reduction with bagging. European Actuarial Journal, 11/1, 325–331.
104. Devriendt, S., Antonio, K., Reynkens, T., & Verbelen, R. (2021). Sparse regression with multi-
type regularized feature modeling. Insurance: Mathematics & Economics, 96, 248–261.
105. Dietterich, T. G. (2000). Ensemble methods in machine learning. In J. Kittel & F. Roli (Eds.),
Multiple classifier systems. Lecture notes in computer science (Vol. 1857, pp. 1–15). New
York: Springer.
106. Dimitriadis, T., Fissler, T., & Ziegel, J. F. (2020). The efficiency gap. arXiv:2010.14146.
107. Dobson, A. J. (2001). An introduction to generalized linear models. Boca Raton: Chapman &
Hall/CRC.
108. Döhler, S., & Rüschendorf, L. (2001). An approximation result for nets in functional
estimation. Statistics & Probability Letters, 52/4, 373–380.
109. Döhler, S., & Rüschendorf, L. (2003). Nonparametric estimation of regression functions in
point process models. Statistics Inference for Stochastic Processes, 6, 291–307.
110. Dong, Y., Huang, F., Yu, H., & Haberman, S. (2020). Multi-population mortality forecasting
using tensor decomposition. Scandinavian Actuarial Journal, 2020/8, 754–775.
111. Doucet, A., & Johansen, A. M. (2011). A tutorial on particle filtering and smoothing: Fifteen
years later. In D. Crisan & B. Rozovsky (Eds.), Handbook of nonlinear filtering (pp. 656–
670). Oxford: Oxford University Press.
112. Dunn, P. K., & Smyth, G. K. (2005). Series evaluation of Tweedie exponential dispersion
model densities. Statistics and Computing, 15, 267–280.
582 Bibliography

113. Dutang, C., & Charpentier, A. (2018). CASdatasets R package vignette. Reference manual.
Version 1.0-8, packaged 2018-05-20.
114. Eckart, G., & Young, G. (1936). The approximation of one matrix by another of lower rank.
Psychometrika, 1, 211–218.
115. Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7/1,
1–26.
116. Efron, B. (2020). Prediction, estimation, and attribution. Journal of the American Statistical
Association, 115/530, 636–655.
117. Efron, B., & Hastie, T. (2016). Computer age statistical inference: Algorithms, evidence, and
data science. Cambridge: Cambridge University Press.
118. Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York: Chapman
& Hall.
119. Ehm, W., Gneiting, T., Jordan, A., & Krüger, F. (2016). Of quantiles and expectiles:
Consistent scoring functions, Choquet representations and forecast rankings. Journal of the
Royal Statistical Society, Series B, 78/3, 505–562.
120. Elbrächter, D., Perekrestenko, D., Grohs, P., & Bölcskei, H. (2021). Deep neural network
approximation theory. IEEE Transactions on Information Theory, 67/5, 2581–2623.
121. Embrechts, P., Klüppelberg, C., & Mikosch, T. (2003). Modelling extremal events for
insurance and finance (4th printing). New York: Springer.
122. Embrechts, P., & Wüthrich, M. V. (2022). Recent challenges in actuarial science. Annual
Review of Statistics and Its Applications, 9, 119–140.
123. Fahrmeir, L., & Tutz, G. (1994). Multivariate statistical modelling based on generalized
linear models. New York: Springer.
124. Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle
properties. Journal of the American Statistical Association, 96/456, 1348–1360.
125. Ferrario, A., & Hämmerli, R. (2019). On boosting: Theory and applications. SSRN
Manuscript ID 3402687. Version June 11, 2019.
126. Ferrario, A., & Nägelin, M. (2020). The art of natural language processing: Classical,
modern and contemporary approaches to text document classification. SSRN Manuscript ID
3547887. Version March 1, 2020.
127. Ferrario, A., Noll, A., & Wüthrich, M. V. (2018). Insights from inside neural networks. SSRN
Manuscript ID 3226852. Version April 23, 2020.
128. Fisher, R. A. (1934). Two new properties of mathematical likelihood. Proceeding of the Royal
Society A, 144/852, 285–307.
129. Fissler, T., Lorentzen, C., & Mayer, M. (2022). Model comparison and calibration assess-
ment: User guide for consistent scoring functions in machine learning and actuarial practice.
arXiv:2202.12780.
130. Fissler, T., Merz, M., & Wüthrich, M. V. (2021). Deep quantile and deep composite model
regression. arXiv:2112.03075.
131. Fissler, T., & Ziegel, J. F. (2016). Higher order elicitability and Osband’s principle. The Annals
of Statistics, 4474, 1680–1707.
132. Fissler, T., Ziegel, J. F., & Gneiting, T. (2015). Expected shortfall is jointly elicitable with
value at risk - Implications for backtesting. arXiv:1507.00244v2.
133. Fortuin, C. M., Kasteleyn, P. W., & Ginibre, J. (1971). Correlation inequalities on some
partially ordered sets. Communication Mathematical Physics, 22/2, 89–103.
134. Frees, E. W. (2010). Regression modelling with actuarial and financial applications. Cam-
bridge: Cambridge University Press.
135. Frees, E. W. (2020). Loss data analytics. An open text authored by the Actuarial Community.
https://ewfrees.github.io/Loss-Data-Analytics/
136. Frees, E. W., & Huang, F. (2021). The discriminating (pricing) actuary. North American
Actuarial Journal (in press).
137. Frees, E. W., Lee, G., & Yang, L. (2016). Multivariate frequency-severity regression models
in insurance. Risks, 4/1. Article 4.
Bibliography 583

138. Frei, D. (2021). Insurance Claim Size Modelling with Mixture Distributions. MSc Thesis.
Department of Mathematics, ETH Zurich.
139. Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and
Computation, 121/2, 256–285.
140. Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning
and an application to boosting. Journal of Computer and System Sciences, 55/1, 119–139.
141. Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals
of Statistics, 29/5, 1189–1232.
142. Friedman, J. H., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized
linear models via coordinate descent. Journal of Statistical Software, 33/1, 1–22.
143. Friedman, J. H., & Popescu, B. E. (2008). Predictive learning via rule ensembles. Annals of
Applied Statistics, 2/3, 916–954.
144. Fritsch, S., Günther, F., Wright, M. N., Suling, M., & Müller, S. M. (2019). neuralnet:
Training of neural networks. https://github.com/bips-hb/neuralnet
145. Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mech-
anism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36/4,
193–202.
146. Fung, T. C., Badescu, A. L., & Lin, X. S. (2019). A class of mixture of experts models for
general insurance: Application to correlated claim frequencies. ASTIN Bulletin, 49/3, 647–
688.
147. Fung, T. C., Badescu, A. L., & Lin, X. S. (2022). Fitting censored and truncated regression
data using the mixture of experts models. North American Actuarial Journal (in press).
148. Fung, T. C., Tzougas, G., & Wüthrich, M. V. (2022). Mixture composite regression models
with multi-type feature selection. North American Actuarial Journal (in press).
149. Gabrielli, A., Richman, R., & Wüthrich, M. V. (2020). Neural network embedding of the
over-dispersed Poisson reserving model. Scandinavian Actuarial Journal, 2020/1, 1–29.
150. Gallant, A. R., & White, H. (1988). There exists a neural network that does not make
avoidable mistakes. In IEEE 1988 International Conference on Neural Networks (pp. I657–
664).
151. Gao, G., Meng, S., & Wüthrich, M. V. (2019). Claims frequency modeling using telematics
car driving data. Scandinavian Actuarial Journal, 2019/2, 143–162.
152. Gao, G., Meng, S., & Wüthrich, M. V. (2022). What can we learn from telematics car driving
data: A survey. Insurance: Mathematics & Economics, 104, 185–199.
153. Gao, G., & Shi, Y. (2021). Age-coherent extensions of the Lee–Carter model. Scandinavian
Actuarial Journal, 2021/10, 998–1016.
154. Gao, G., Wang, H., & Wüthrich, M. V. (2022). Boosting Poisson regression models with
telematics car driving data. Machine Learning, 111/1, 243–272.
155. Gao, G., & Wüthrich, M. V. (2018). Feature extraction from telematics car driving heatmaps.
European Actuarial Journal, 8/2, 383–406.
156. Gao, G., & Wüthrich, M. V. (2019). Convolutional neural network classification of telematics
car driving data. Risks, 7/1. Article 6.
157. Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013).
Bayesian data analysis (3rd ed.). Boca Raton: Chapman & Hall/CRC.
158. Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (1995). Markov chain Monte Carlo in
practice. Boca Raton: Chapman & Hall.
159. Glivenko, V. (1933). Sulla determinazione empirica delle leggi di probabilità. Giornale
Dell’Istituto Italiano Degli Attuari, 4, 92–99.
160. Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward
neural networks. In Proceedings of the Thirteenth International Conference on Artificial
Intelligence and Statistics. Proceedings of Machine Learning Research (Vol. 9, pp. 249–256).
161. Glynn, P., & Lee, S. H. (2003). Computing the distribution function of a conditional
expectation via Monte Carlo: Discrete conditioning spaces. ACM Transactions on Modeling
and Computer Simulation, 13/3, 238–258.
584 Bibliography

162. Gneiting, T. (2011). Making and evaluating point forecasts. Journal of the American
Statistical Association, 106/494, 746–762.
163. Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation.
Journal of the American Statistical Association, 102/477, 359–378.
164. Goldstein, A., Kapelner, A., Bleich, J., & Pitkin, E. (2015). Peeking inside the black box:
Visualizing statistical learning with plots of individual conditional expectation. Journal of
Computational and Graphical Statistics, 24/1, 44–65.
165. Golub, G., & Van Loan, C. (1983). Matrix computations. Baltimore: John Hopkins University
Press.
166. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Cambridge: MIT Press.
http://www.deeplearningbook.org
167. Gourieroux, C., Laurent, J. P., & Scaillet, O. (2000). Sensitivity analysis of values at risk.
Journal of Empirical Finance, 7/3–4, 225–245.
168. Gourieroux, C., Montfort, A., & Trognon, A. (1984). Pseudo maximum likelihood methods:
Theory. Econometrica, 52/3, 681–700.
169. Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian
model determination. Biometrika, 82/4, 711–732.
170. Green, P. J. (2003). Trans-dimensional Markov chain Monte Carlo. In P. J. Green, N. L. Hjort,
& S. Richardson (Eds.), Highly structured stochastic systems. Oxford statistical science series
(pp. 179–206). Oxford: Oxford University Press.
171. Greene, W. (2008). Functional forms for the negative binomial model for count data.
Economics Letters, 99, 585–590.
172. Grenander, U. (1981). Abstract inference. New York: Wiley.
173. Grün, B., & Miljkovic, T. (2019). Extending composite loss models using a general
framework of advanced computational tools. Scandinavian Actuarial Journal, 2019/8, 642–
660.
174. Guillén, M. (2012). Sexless and beautiful data: From quantity to quality. Annals of Actuarial
Science, 6/2, 231–234.
175. Guillén, M., Bermúdez, L., & Pitarque, A. (2021). Joint generalized quantile and conditional
tail expectation for insurance risk analysis. Insurance: Mathematics & Economics, 99, 1–8.
176. Guo, C., & Berkhahn, F. (2016). Entity embeddings of categorical variables.
arXiv:1604.06737.
177. Ha, H., & Bauer, D. (2022). A least-squares Monte Carlo approach to the estimation of
enterprise risk. Finance and Stochastics, 26, 417–459.
178. Hainaut, D. (2018). A neural-network analyzer for mortality forecast. ASTIN Bulletin, 48/2,
481–508.
179. Hainaut, D., & Denuit, M. (2020). Wavelet-based feature extraction for mortality projection.
ASTIN Bulletin, 50/3, 675–707.
180. Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., & Stahel, W. A. (1986). Robust statistics.
New York: Wiley.
181. Hastie, T., & Tibshirani, R. (1986). Generalized additive models (with discussion). Statistical
Science, 1, 297–318.
182. Hastie, T., & Tibshirani, R. (1990). Generalized additive models. New York: Chapman &
Hall.
183. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data
mining, inference, and prediction (2nd ed.). New York: Springer.
184. Hastie, T., Tibshirani, R., & Wainwright, M. (2015). Statistical learning with sparsity: The
Lasso and generalizations. Boca Raton: CRC Press.
185. Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their
applications. Biometrika, 57/1, 97–109.
186. Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural
networks. Science, 313/5786, 504–507.
187. Hinton, G., Srivastava, N., & Swersky, K. (2012). Neural networks for machine learning.
Lecture slides. Toronto: University of Toronto.
Bibliography 585

188. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation,
9/8, 1735–1780.
189. Hong, L. J. (2009). Estimating quantile sensitivities. Operations Research, 57/1, 118–130.
190. Horel, E., & Giesecke, K. (2020). Significance tests in neural networks. Journal of Machine
Learning Research, 21/227, 1–29.
191. Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural
Networks, 4/2, 251–257.
192. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are
universal approximators. Neural Networks, 2/5, 359–366.
193. Huang, Y., & Meng, S. (2019). Automobile insurance classification ratemaking based on
telematics driving data. Decision Support Systems, 127. Article 113156.
194. Huber, P. J. (1981). Robust statistics. Hoboken: Wiley.
195. Human Mortality Database (2018). University of California, Berkeley (USA), and Max
Planck Institute for Demographic Research (Germany). www.mortality.org
196. Hyndman, R. J., Booth, H., & Yasmeen, F. (2013). Coherent mortality forecasting: The
product-ratio method with functional time series models. Demography, 50/1, 261–283.
197. Hyndman, R. J., & Ullah, M. S. (2007). Robust forecasting of mortality and fertility rates: A
functional data approach. Computational Statistics & Data Analysis, 51/10, 4942–4956.
198. Isenbeck, M., & Rüschendorf, L. (1992). Completeness in location families. Probability and
Mathematical Statistics, 13/2, 321–343.
199. Johansen, A. M., Evers, L., & Whiteley, N. (2010). Monte Carlo methods. Lecture notes.
Bristol: Department of Mathematics, University of Bristol.
200. Jørgensen, B. (1981). Statistical properties of the generalized inverse Gaussian distribution.
Lecture notes in statistics. New York: Springer.
201. Jørgensen, B. (1986). Some properties of exponential dispersion models. Scandinavian
Journal of Statistics, 13/3, 187–197.
202. Jørgensen, B. (1987). Exponential dispersion models. Journal of the Royal Statistical Society,
Series B, 49/2, 127–145.
203. Jørgensen, B. (1997). The theory of dispersion models. Boca Raton: Chapman & Hall.
204. Jørgensen, B., & de Souza, M. C. P. (1994). Fitting Tweedie’s compound Poisson model to
insurance claims data. Scandinavian Actuarial Journal, 1994/1, 69–93.
205. Jospin, L. V., Buntine, W., Boussaid, F., Laga, H., & Bennamoun, M. (2020). Hands-on
Bayesian neural networks - A tutorial for deep learning users. arXiv: 2007.06823.
206. Jung, J. (1968). On automobile insurance ratemaking. ASTIN Bulletin, 5/1, 41–48.
207. Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Journal of
Basic Engineering, 82/1, 35–45.
208. Karush, W. (1939). Minima of Functions of Several Variables with Inequalities as Side
Constraints. MSc Thesis. Department of Mathematics, University of Chicago.
209. Kearns, M., & Valiant, L. G. (1988). Learning Boolean Formulae or Finite Automata is
Hard as Factoring. Technical Report TR-14–88. Aiken Computation Laboratory, Harvard
University.
210. Kearns, M., & Valiant, L. G. (1994). Cryptographic limitations on learning Boolean formulae
and finite automata. Journal of the Association for Computing Machinery ACM, 41/1, 67–95.
211. Kellner, R., Nagl, M., & Rösch, D. (2022). Opening the black box - Quantile neural networks
for loss given default prediction. Journal of Banking & Finance, 134, 1–20.
212. Keydana, S., Falbel, D., & Kuo, K. (2021). R package ‘tfprobability’: Interface to ‘Tensor-
Flow Probability’. Version 0.12.0.0, May 20, 2021.
213. Khalili, A. (2010). New estimation and feature selection methods in mixture-of-experts
models. Canadian Journal of Statistics, 38/4, 519–539.
214. Khalili, A., & Chen, J. (2007). Variable selection in finite mixture of regression models.
Journal of the American Statistical Association, 102/479, 1025–1038.
215. Kidger, P., & Lyons, T. (2020). Universal approximation with deep narrow networks.
Proceedings of Machine Learning Research, 125, 2306–2327.
216. Kingma, D., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980.
586 Bibliography

217. Kingma, D. P., & Welling, M. (2019). An introduction to variational autoencoders. Founda-
tions and Trends in Machine Learning, 12/4, 307–392.
218. Kleinow, T. (2015). A common age effect model for the mortality of multiple populations.
Insurance: Mathematics & Economics, 63, 147–152.
219. Knyazev, B., Drozdzal, M., Taylor, G. W., & Romero-Soriano, A. (2021). Parameter
prediction of unseen deep architectures. arXiv:2110.13100.
220. Koenker, R., & Bassett, G., Jr. (1978). Regression quantiles. Econometrica, 46/1, 33–50.
221. Kolmogoroff, A. (1933). Grundbegriffe der Wahrscheinlichkeitsrechnung. New York:
Springer.
222. Komunjer, I., & Vuong, Q. (2010). Efficient estimation in dynamic conditional quantile
models. Journal of Econometrics, 157, 272–285.
223. Koopman, B. O. (1936). On distributions admitting a sufficient statistics. Transactions of the
American Mathematical Society, 39, 399–409.
224. Krah, A.-S., Nikolić, Z., & Korn, R. (2020). Least-squares Monte Carlo for proxy modeling
in life insurance: neural networks. Risks, 8/4. Article 116.
225. Kramer, M. A. (1991). Nonlinear principal component analysis using autoassociative neural
networks. AIChE Journal, 37/2, 233–243.
226. Krizhevsky, Al., Sutskever, I., & Hinton, G. E. (2017). ImageNet classification with deep
convolutional neural networks. Communications of the Association for Computing Machinery
ACM, 60/6, 84–90.
227. Krüger, F., & Ziegel, J. F. (2021). Generic conditions for forecast dominance. Journal of
Business & Economics Statistics, 39/4, 972–983.
228. Kuhn, H. W., & Tucker, A. W. (1951). Nonlinear programming. Proceedings of 2nd Berkeley
Symposium (pp. 481–492). Berkeley: University of California Press.
229. Künsch, H. R. (2005). Mathematische Statistik. Lecture notes. ETH Zurich: Department of
Mathematics.
230. Kuo, K. (2020). Individual claims forecasting with Bayesian mixture density networks.
arXiv:2003.02453.
231. Kuo, K., & Richman, R. (2021). Embeddings and attention in predictive modeling.
arXiv:2104.03545v1.
232. Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in
manufacturing. Technometrics, 34/1, 1–14.
233. Landsman, Z., & Valdez, E. A. (2005). Tail conditional expectation for exponential dispersion
models. ASTIN Bulletin, 35/1, 189–209.
234. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., &
Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural
Computation, 1/4, 541–551.
235. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86/11, 2278–2324.
236. Lee, G. Y., Manski, S., & Maiti, T. (2020). Actuarial applications of word embedding models.
ASTIN Bulletin, 50/1, 1–24.
237. Lee, J. D., Sun, D. L., Sun, Y., & Taylor, J. E. (2016). Exact post-selection inference, with
application to the lasso. Annals of Statistics, 44/3, 907–927.
238. Lee, R. D., & Carter, L. R. (1992). Modeling and forecasting U.S. mortality. Journal of the
American Statistical Association, 87/419, 659–671.
239. Lee, S. C. K. (2021). Addressing imbalanced insurance data through zero-inflated Poisson
regression boosting. ASTIN Bulletin, 51/1, 27–55.
240. Lee, S. C. K., & Lin, X. S. (2010). Modeling and evaluating insurance losses via mixtures of
Erlang distributions. North American Actuarial Journal, 14/1, 107–130.
241. Lee, S. C. K., & Lin, X. S. (2018). Delta boosting machine with application to general
insurance. North American Actuarial Journal, 22/3, 405–425.
242. Lee, S. H. (1998). Monte Carlo Computation of Conditional Expectation Quantiles. PhD
Thesis, Stanford University.
243. Lehmann, E. L. (1959). Testing statistical hypotheses. New York: Wiley.
Bibliography 587

244. Lehmann, E. L. (1983). Theory of point estimation. New York: Wiley.

245. Lemaire, J. (1995). Bonus-malus systems in automobile insurance. Dordrecht: Kluwer
Academic Publisher.
246. Lemaire, J., Park, S. C., & Wang, K. (2016). The use of annual mileage as a rating variable.
ASTIN Bulletin, 46/1, 39–69.
247. Leshno, M., Lin, V. Y., Pinkus, A., & Schocken, S. (1993). Multilayer feedforward networks
with a nonpolynomial activation function can approximate any function. Neural Networks,
6/6, 861–867.
248. Li, H., & Lu, Y. (2017). Coherent forecasting of mortality rates: a sparse vector-autoregression
approach. ASTIN Bulletin, 47/2, 563–600.
249. Li, N., & Lee, R. (2005). Coherent mortality forecasts for a group of populations: An
extension of the Lee–Carter method. Demography, 42/3, 575–594.
250. Li, N., Lee, R., & Gerland, P. (2013). Extending the Lee–Carter method to model the rotation
of age patterns of mortality decline for long-term projections. Demography, 50/6, 2037–2051.
251. Li, Z., Wang, F., & Zhao, Z. (2022). A new class of composite GBII regression models with
varying threshold for modelling heavy-tailed data. arXiv:2203.11469v2.
252. Lindholm, M., & Palmborg, L. (2022). Efficient use of data from LSTM mortality forecasting.
European Actuarial Journal (in press).
253. Lindholm, M., Richman, R., Tsanakas, A., & Wüthrich, M. V. (2022). Discrimination-free
insurance pricing. ASTIN Bulletin, 52/1, 55–89.
254. Loader, C., Sun, J., Lucent Technologies, & Liaw, A. (2022). locfit: Local regression,
likelihood and density estimation. https://cran.r-project.org/web/packages/locfit/index.html
255. Loimaranta, K. (1972). Some asymptotic properties of bonus systems. ASTIN Bulletin, 6/3,
233–245.
256. Lomax, K. S. (1954). Business failures: Another example of the analysis of failure data.
Journal of the American Statistical Association, 49/268, 847–852.
257. Longstaff, F., & Schwartz, E. (2001). Valuing American options by simulation: A simple
least-squares approach. The Review of Financial Studies, 14/1, 113–147.
258. Lorentzen, C., & Mayer, M. (2020). Peeking into the black box: An actuarial case study for
interpretable machine learning. SSRN Manuscript ID 3595944. Version May 7, 2020.
259. Loser, F. (2020). Private communication.
260. Lu, J., Shen, Z., Yang, H., & Zhang, S. (2021). Deep network approximation for smooth
functions. SIAM Journal on Mathematical Analysis, 53/5, 5465–5506.
261. Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions.
In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett
(Eds.), Advances in neural information processing systems (Vol. 30, pp. 4765–4774). New
York: Curran Associates.
262. Makavoz, Y. (1996). Random approximants and neural networks. Journal of Approximation
Theory, 85/1, 98–109.
263. Mallat, S. (2012). Group invariant scattering. Communication in Pure and Applied Mathe-
matics, 65/10, 1331–1398.
264. Manski, S., Yang, K., Lee, G. Y., & Maiti, T. (2021). Extracting information from textual
descriptions for actuarial applications. Annals of Actuarial Science, 15/3, 605–622.
265. McCullagh, P., & Nelder, J. A. (1983). Generalized linear models. Boca Raton: Chapman &
Hall.
266. McGrayne, S. B. (2011). The theory that would not die. New Haven: Yale University Press.
267. McLachlan, G. J., & Krishnan, T. (2008). The EM algorithm and extensions (2nd ed.). New
York: Wiley.
268. McNeil, A. J., Frey, R., & Embrechts, P. (2015). Quantitative risk management: Concepts,
techniques and tools (revised edition). Princeton: Princeton University Press.
269. Meier, D., & Wüthrich, M. V. (2020). Convolutional neural network case studies: (1)
anomalies in mortality rates (2) image recognition. SSRN Manuscript ID 3656210. Version
July 19, 2020.
588 Bibliography

270. Meinshausen, N. (2006). Quantile regression forests. Journal of Machine Learning Research,
7, 983–999.
271. Meng, S., Wang, H., Shi, Y., & Gao, G. (2022). Improving automobile insurance claims
frequency prediction with telematics car driving data. ASTIN Bulletin, 52/2, 363–391.
272. Mercer, J. (1909). Functions of positive and negative type and their connection with the theory
of integral equations. Philosophical Transactions of the Royal Society A, 209/441–458, 415–
446.
273. Merz, M., Richman, R., Tsanakas, A., & Wüthrich, M. V. (2022). Interpreting deep learning
models with marginal attribution by conditioning on quantiles. Data Mining and Knowledge
Discovery, 36, 1335–1370.
274. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953).
Equation of state calculations by fast computing machines. Journal of Chemical Physics,
21/6, 1087–1092.
275. Mikolov, T., Chen, K., Corrado, G. S., & Dean, J. (2013). Efficient estimation of word
representations in vector space. arXiv:1301.3781.
276. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed repre-
sentations of words and phrases and their compositionality. Advances in Neural Information
Processing Systems, 26, 3111–3119.
277. Mikosch, T. (2006). Non-life insurance mathematics. New York: Springer.
278. Miljkovic, T., & Grün, B. (2016). Modeling loss data using mixtures of distributions.
Insurance: Mathematics & Economics, 70, 387–396.
279. Mirsky, L. (1960). Symmetric gauge functions and unitarily invariant norms. Quarterly
Journal of Mathematics, 11/1, 50–59.
280. Montúfar, G., Pascanu, R., Cho, K., & Bengio, Y. (2014). On the number of linear regions of
deep neural networks. Neural Information Processing Systems Proceedings, 27, 2924–2932.
281. Neal, R. M. (1996). Bayesian learning for neural networks. New York: Springer.
282. Nelder, J. A., & Pregibon, D. (1987). An extended quasi-likelihood function. Biometrika,
74/2, 221–232.
283. Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the
Royal Statistical Society, Series A, 135/3, 370–384.
284. Nesterov, Y. (2007). Gradient Methods for Minimizing Composite Objective Function.
Technical Report 76. Center for Operations Research and Econometrics (CORE), Catholic
University of Louvain.
285. Nielsen, F. (2020). An elementary introduction to information geometry. Entropy, 22/10,
1100.
286. Nigri, A., Levantesi, S., Marino, M., Scognamiglio, S., & Perla, F. (2019). A deep learning
integrated Lee–Carter model. Risks, 7/1. Article 33.
287. Noll, A., Salzmann, R., & Wüthrich, M. V. (2018). Case study: French motor third-party
liability claims. SSRN Manuscript ID 3164764. Version March 4, 2020.
288. Oelker, M.-R., & Tutz, G. (2017). A uniform framework for the combination of penalties in
generalized structured models. Advances in Data Analysis and Classification, 11, 97–120.
289. O’Hagan, W., Murphy, B. T., Scrucca, L., & Gormley, I. C. (2019). Investigation of parameter
uncertainty in clustering using a Gaussian mixture model via jackknife, bootstrap and
weighted likelihood bootstrap. Computational Statistics, 34/4, 1779–1813.
290. Ohlsson, E., & Johansson, B. (2010). Non-life insurance pricing with generalized linear
models. New York: Springer.
291. Paefgen, J., Staake, T., & Fleisch, E. (2014). Multivariate exposure modeling of accident risk:
Insights from pay-as-you-drive insurance data. Transportation Research Part A: Policy and
Practice, 61, 27–40.
292. Parikh, N., & Boyd, S. (2013). Proximal algorithms. Foundations and Trends in Optimization,
1/3, 123–231.
293. Park, J., & Sandberg, I. (1991). Universal approximation using radial-basis-function net-
works. Neural Computation, 3/2, 246–257.
Bibliography 589

294. Park, J., & Sandberg, I. (1993). Approximation and radial-basis-function networks. Neural
Computation, 5/2, 305–316.
295. Parodi, P. (2020). A generalised property exposure rating framework that incorporates scale-
independent losses and maximum possible loss uncertainty. ASTIN Bulletin, 50/2, 513–553.
296. Paszke, A., et al. (2019). PyTorch: An imperative style, high-performance deep learning
library. In Advances in Neural Information Processing Systems (Vol. 32, pp. 8024–8035).
297. Patton, A. J. (2020). Comparing possibly misspecified forecasts. Journal of Business &
Economic Statistics, 38/4, 796–809.
298. Pearl, J. (2009). Causal inference in statistics: An overview. Statistics Surveys, 3, 96–146.
299. Pearl, J., Glymour, M., & Jewell, N. P. (2016). Causal inference in statistics: A primer.
Chichester: Wiley.
300. Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word
representation. Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP) (pp. 1532–1543).
301. Perla, F., Richman, R., Scognamiglio, S., & Wüthrich, M. V. (2021). Time-series forecasting
of mortality rates using deep learning. Scandinavian Actuarial Journal, 2021/7, 572–598.
302. Petrushev, P. (1999). Approximation by ridge functions and neural networks. SIAM Journal
on Mathematical Analysis, 30/1, 155–189.
303. Pinkus, A. (1999). Approximation theory of the MLP model in neural networks. Acta
Numerica, 8, 143–195.
304. Pinquet, J. (1998). Designing optimal bonus-malus systems from different types of claims.
ASTIN Bulletin, 28/2, 205–220.
305. Pinquet, J., Guillén, M., & Bolance, C. (2001). Long-range contagion in automobile insurance
data: estimation and implications for experience rating. ASTIN Bulletin, 31/2, 337–348.
306. Pitman, E. J. G. (1936). Sufficient statistics and intrinsic accuracy. Proceedings of the
Cambridge Philosophical Society, 32/4, 567–579.
307. R Core Team (2021). R: a language and environment for statistical computing. R Foundation
for Statistical Computing, Vienna, Austria. https://www.R-project.org/
308. Renshaw, A. E., & Haberman, S. (2003). Lee–Carter mortality forecasting with age-specific
enhancement. Insurance: Mathematics & Economics, 33/2, 255–272.
309. Renshaw, A. E., & Haberman, S. (2006). A cohort-based extension to the Lee–Carter model
for mortality reduction factors. Insurance: Mathematics & Economics, 38/3, 556–570.
310. Rentzmann, S., & Wüthrich, M. V. (2019). Unsupervised learning: What is a sports car?
SSRN Manuscript ID 3439358. Version October 14, 2019.
311. Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why should I trust you?”: Explaining
the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, KDD ’16 (pp. 1135–1144). New
York: Association for Computing Machinery.
312. Richman, R. (2021). AI in actuarial science - A review of recent advances - Part 1. Annals of
Actuarial Science, 15/2, 207–229.
313. Richman, R. (2021). AI in actuarial science - A review of recent advances - Part 2. Annals of
Actuarial Science, 15/2, 230–258.
314. Richman, R. (2021). Mind the gap - Safely incorporating deep learning models into the
actuarial toolkit. SSRN Manuscript ID 3857693. Version April 2, 2021.
315. Richman, R., & Wüthrich, M. V. (2020). Nagging predictors. Risks, 8/3. Article 83.
316. Richman, R., & Wüthrich, M. V. (2021). A neural network extension of the Lee-Carter model
to multiple populations. Annals of Actuarial Science, 15/2, 346–366.
317. Richman, R., & Wüthrich, M. V. (2022). LocalGLMnet: Interpretable deep learning for
tabular data. Scandinavian Actuarial Journal (in press).
318. Richman, R., & Wüthrich, M. V. (2021). LASSO regularization within the LocalGLMnet
architecture. SSRN Manuscript ID 3927187. Version June 1, 2022.
319. Robert, C. P. (2001). The Bayesian choice (2nd ed.). New York: Springer.
320. Rolski, T., Schmidli, H., Schmidt, V., & Teugels, J. (1999). Stochastic processes for insurance
and finance. New York: Wiley.
590 Bibliography

321. Ruckstuhl, N. (2021). Multi-Population Mortality Modeling Using Tensor Decomposition.

MSc Thesis. Department of Mathematics, ETH Zurich.
322. Rudin, C. (2019). Stop explaining black box machine learning models for high stakes
decisions and use interpretable models instead. Nature Machine Intelligence, 1, 206–215.
323. Rüger, S. M., & Ossen, A. (1997). The metric structure of weight space. Neural Processing
Letters, 5/2, 1–9.
324. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-
propagating errors. Nature, 323/6088, 533–536.
325. Russolillo, M., Giordano, G., & Haberman, S. (2010). Extending the Lee-Carter model: A
three-way decomposition. Scandinavian Actuarial Journal, 2011/2, 96–117.
326. Saerens, M. (2000). Building cost functions minimizing to some summary statistics. IEEE
Transactions on Neural Networks, 11, 1263–1271.
327. Savage, L. J. (1971). Elicitable of personal probabilities and expectations. Journal of the
American Statistical Association, 66/336, 783–810.
328. Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5/2, 197–227.
329. Schelldorfer, J., & Wüthrich, M. V. (2019). Nesting classical actuarial models into neural
networks. SSRN Manuscript ID 3320525. Version January 25, 2019.
330. Schnürch, S., & Korn, R. (2022). Point and interval forecasts of death rates using neural
networks. ASTIN Bulletin, 52/1, 333–360.
331. Schwarz, G. E. (1978). Estimating the dimension of a model. Annals of Statistics, 6/2, 461–
464.
332. Scollnik, D. P. M. (2007). On composite lognormal-Pareto models. Scandinavian Actuarial
Journal, 2007/1, 20–33.
333. Shang, H. L. (2019). Dynamic principal component regression: Application to age-specific
mortality forecasting. ASTIN Bulletin, 49/3, 619–645.
334. Shang, H. L., & Haberman, S. (2020). Forecasting multiple functional time series in a group
structure: an application to mortality. ASTIN Bulletin, 50/2, 357–379.
335. Shapley, L. S. (1953). A value for n-person games. In H. W. Kuhn, & A. W. Tucker (Eds.),
Contributions to the theory of games (AM-28) (Vol. II, pp. 307–318). Princeton: Princeton
University Press.
336. Shen, X., Jiang, C., Sakhanenko, L., & Lu, Q. (2019). Asymptotic properties of neural network
sieve estimators. arXiv:1906.00875v2.
337. Shlens, J. (2014). A tutorial on principal component analysis. arXiv:1404.1100.
338. Shmueli, G. (2010). To explain or to predict? Statistical Science, 25/3, 289–310.
339. Shrikumar, A., Greenside, P., & Kundaje, A. (2017). Learning important features through
propagating activation differences. In Proceedings of the 34th International Conference on
Machine Learning, Proceedings of Machine Learning Research, PMLR (Vol. 70, pp. 3145–
3153). Sydney: International Convention Centre.
340. Shrikumar, A., Greenside, P., Shcherbina, A., & Kundaje, A. (2016). Not just a black box:
Learning important features through propagating activation differences. arXiv:1605.01713.
341. Smyth, G. K. (1989). Generalized linear models with varying dispersion. Journal of the Royal
Statistical Society, Series B, 51/1, 47–60.
342. Smyth, G. K., & Jørgensen, B. (2002). Fitting Tweedie’s compound Poisson model to
insurance claims data: dispersion modeling. ASTIN Bulletin, 32/1, 143–157.
343. Smyth, G. K., & Verbyla, A. P. (1999). Double generalized linear models: Approximate
REML and diagnostics. In H. Friedl, A. Berghold, & G. Kauermann (Eds.), Proceedings of
the 14th International Workshop on Statistical Modelling (pp. 66–80). Technical University,
Graz.
344. So, B., Boucher, J.-P., & Valdez, E. A. (2021). Cost-sensitive multi-class AdaBoost for
understanding behavior based on telematics. ASTIN Bulletin, 51/3, 719–751.
345. Srivastava, N., Hinton, G., Krizhevsky, A. Sutskever, I., & Salakhutdinov, R. (2014). Dropout:
A simple way to prevent neural networks from overfitting. Journal of Machine Learning
Research, 15/56, 1929–1958.
Bibliography 591

346. Strassen, V. (1965). The existence of probability measures with given marginals. Annals of
Mathematical Statistics, 36/2, 423–439.
347. Sun, S., Bi, J., Guillén, M., & Pérez-Marín, A. M. (2020). Assessing driving risk using internet
of vehicles data: An analysis based on generalized linear models. Sensors, 20/9. Article 2712.
348. Sundberg, R. (1974). Maximum likelihood theory for incomplete data from an exponential
family. Scandinavian Journal of Statistics, 1/2, 49–58.
349. Sundberg, R. (1976). An iterative method for solution of the likelihood equations for
incomplete data from exponential families. Communication in Statistics - Simulation and
Computation, 5/1, 55–64.
350. Takeuchi, I., Le, Q. V., Sears, T. D., & Smola, A. J. (2006). Nonparametric quantile estimation.
Journal of Machine Learning Research, 7, 1231–1264.
351. Thomson, W. (1979). Eliciting production possibilities from a well-informed manager.
Journal of Economic Theory, 20, 360–380.
352. Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. Journal of the
Royal Statistical Society, Series B, 58/1, 267–288.
353. Tikhonov, A. N. (1943). On the stability of inverse problems. Doklady Akademii Nauk SSSR,
39/5, 195–198.
354. Troxler, A., & Schelldorfer, J. (2022). Actuarial applications of natural language pro-
cessing using transformers: Case studies for using text features in an actuarial context.
arXiv:2206.02014.
355. Tsanakas, A., & Millossovich, P. (2016). Sensitivity analysis using risk measures. Risk
Analysis, 36/1, 30–48.
356. Tsitsiklis, J., & Van Roy, B. (2001). Regression methods for pricing complex American-style
options. IEEE Transactions on Neural Networks, 12/4, 694–703.
357. Tukey, J. W. (1977). Exploratory data analysis. Reading: Addison-Wesley.
358. Tweedie, M. C. K. (1984). An index which distinguishes between some important exponential
families. In J. K. Ghosh, & J. Roy (Eds.) Statistics: Applications and new directions.
Proceeding of the Indian Statistical Golden Jubilee International Conference (pp. 579–604).
Calcutta: Indian Statistical Institute.
359. Tzougas, G., & Karlis, D. (2020). An EM algorithm for fitting a new class of mixed
exponential regression models with varying dispersion. ASTIN Bulletin, 50/2, 555–583.
360. Tzougas, G., Vrontos, S., & Frangos, N. (2014). Optimal bonus-malus systems using finite
mixture models. ASTIN Bulletin, 44/2, 417–444.
361. Uribe, J. M., & Guillén, M. (2019). Quantile regression for cross-sectional and time series
data applications in energy markets using R. New York: Springer.
362. Valiant, L. G. (1984). A theory of learnable. Communications of the Association for
Computing Machinery ACM, 27/11, 1134–1142.
363. Van der Vaart, A. W. (1998). Asymptotic statistics. Cambridge: Cambridge University Press.
364. Van der Vaart, A. W., & Wellner, J. A. (1996). Weak convergence and empirical processes:
With applications to statistics. New York: Springer.
365. Vapnik, V., & Chervonenkis, A. (1974). The theory of pattern recognition. Moscow: Nauka.
366. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., &
Polosukhin, I. (2017). Attention is all you need. arXiv:1706.03762v5.
367. Vaughan, J., Sudjianto, A., Brahimi, E., Chen, J., & Nair, V. N. (2018). Explainable neural
networks based on additive index models. arXiv:1806.01933v1.
368. Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S. New York:
Springer.
369. Venter, G. C. (1983). Transformed beta and gamma functions and losses. Proceedings of the
Casualty Actuarial Society, 71, 289–308.
370. Verbelen, R., Antonio, K., & Claeskens, G. (2018). Unraveling the predictive power of
telematics data in car insurance pricing. Journal of the Royal Statistical Society: Series C,
67/5, 1275–1304.
592 Bibliography

371. Verbelen, R., Gong, L., Antonio, K., Badescu, A., & Lin, S. (2015). Fitting mixtures of
Erlangs to censored and truncated data using the EM algorithm. ASTIN Bulletin, 45/3, 729–
758.
372. Verschuren, R. M. (2021). Predictive claim scores for dynamic multi-product risk classifica-
tion in insurance. ASTIN Bulletin, 51/1, 1–25.
373. Wager, S., Wang, S., & Liang, P. S. (2013). Dropout training as adaptive regularization. In C.
Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Weinberger (Eds.), Advances in neural
information processing systems (Vol. 26, pp. 351–359). Red Hook: Curran Associates.
374. Wald, A. (1949). Note on the consistency of the maximum likelihood estimate. Annals of
Mathematical Statistics, 20/4, 595–601.
375. Wang, C.-W., Zhang, J., & Zhu, W. (2021). Neighbouring prediction for mortality. ASTIN
Bulletin, 51/3, 689–718.
376. Wedderburn, R. W. M. (1974). Quasi-likelihood functions, generalized linear models and the
Gauss–Newton method. Biometrika, 61/3, 439–447.
377. Weidner, W., Transchel, F. W. G., & Weidner, R. (2016). Classification of scale-sensitive
telematic observables for riskindividual pricing. European Actuarial Journal, 6/1, 3–24.
378. Weidner, W., Transchel, F. W. G., & Weidner, R. (2017). Telematic driving profile classifica-
tion in car insurance pricing. Annals of Actuarial Science, 11/2, 213–236.
379. White, H. (1989). Learning in artificial neural networks: a statistical perspective. Neural
Computation, 1/4, 425–464.
380. White, H. (1990). Connectionist nonparametric regression: multilayer feedforward networks
can learn arbitrary mappings. Neural Networks, 3/5, 535–549.
381. White, H., & Woolridge, J. M. (1991). Some results on sieve estimation with dependent
observations. In W. Barnett, J. Powell, & G. Tauchen (Eds.), Nonparametric and semi-
parametric in econometrics and statistics (pp. 459–493). Cambridge: Cambridge University
Press.
382. Wiatowski, T., & Bölcskei, H. (2018). A mathematical theory of deep convolutional neural
networks for feature extraction. IEEE Transactions on Information Theory, 64/3, 1845–1866.
383. Wilson, E. B., & Hilferty, M. M. (1931). The distribution of chi-square. Proceedings of
National Academy of Science, 17/12, 684–688.
384. Wood, S. N. (2017). Generalized additive models: An introduction with R (2nd ed.). Boca
Raton: CRC Press.
385. Wu, C. F. J. (1983). On the convergence properties of the EM algorithm. Annals of Statistics,
11/1, 95–103.
386. Wu, C. F. J. (1986). Jackknife, bootstrap and other resampling methods in regression analysis.
Annals of Statistics, 14/4, 1261–1295.
387. Wüthrich, M. V. (2013). Non-life insurance: Mathematics & statistics. SSRN Manuscript ID
2319328. Version February 7, 2022.
388. Wüthrich, M. V. (2017). Covariate selection from telematics car driving data. European
Actuarial Journal, 7/1, 89–108.
389. Wüthrich, M. V. (2017). Sequential Monte Carlo sampling for state space models. In V.
Kreinovich, S. Sriboonchitta, & V.-N. Huynh (Eds.), Robustness in econometrics. Studies
in computational intelligence (Vol. 592, pp. 25–50). New York: Springer.
390. Wüthrich, M. V. (2020). Bias regularization in neural network models for general insurance
pricing. European Actuarial Journal, 10/1, 179–202.
391. Wüthrich, M. V. (2022). Model selection with Gini indices under auto-calibration.
arXiv:2207.14372.
392. Wüthrich, M. V., & Buser, C. (2016). Data analytics for non-life insurance pricing. SSRN
Manuscript ID 2870308. Version of October 27, 2021.
393. Wüthrich, M. V., & Merz, M. (2013). Financial modeling, actuarial valuation and solvency
in insurance. New York: Springer.
394. Wüthrich, M. V., & Merz, M. (2019). Editorial: Yes, we CANN! ASTIN Bulletin, 49/1, 1–3.
395. Yan, H., Peters, G. W., & Chan, J. S. K. (2020). Multivariate long-memory cohort mortality
models. ASTIN Bulletin, 50/1, 223–263.
Bibliography 593

396. Yin, C., & Lin, X. S. (2016). Efficient estimation of Erlang mixtures using iSCAD penalty
with insurance application. ASTIN Bulletin, 46/3, 779–799.
397. Yu, B., & Barter, R. (2020). The data science process: One culture. International Statistical
Review, 88/S1, S83–S86.
398. Yuan, X. T., & Lin, Y. (2007). Model selection and estimation in regression with grouped
variables. Journal of the Royal Statistical Society, Series B, 68/1, 49–67.
399. Yukich, J., Stinchcombe, M., & White, H. (1995). Sup-norm approximation bounds for
networks through probabilistic methods. IEEE Transactions on Information Theory, 41/4,
1021–1027.
400. Zaslavsky, T. (1975). Facing up to arrangements: Face-count formulas for partitions of space
by hyperplanes (Vol. 154). Providence: Memoirs of the American Mathematical Society.
401. Zeileis, A., Kleiber C., & Jackman, S. (2008). Regression models for count data in R. Journal
of Statistical Software, 27/8, 1–25.
402. Zhang, C., Ren, M., & Urtasun, R. (2020). Graph hypernetworks for neural architecture
search. arXiv:1810.05749v3.
403. Zhang, W., Itoh, K., Tanida, J., & Ichioka, Y. (1990). Parallel distributed processing model
with local space-invariant interconnections and its optical architecture. Applied Optics, 29/32,
4790–4797.
404. Zhang, W., Tanida, J., Itoh, K., & Ichioka, Y. (1988). Shift invariant pattern recognition neural
network and its optical architecture. Proceedings of the Annual Conference of the Japan
Society of Applied Physics, 6p-M-14, 734.
405. Zhao, Q., & Hastie, T. (2021). Causal interpretations of black-box models. Journal of Business
& Economic Statistics, 39/1, 272–281.
406. Zhou, Z.-H., Wu, J., & Tang, W. (2002). Ensembling neural networks: Many could be better
than all. Artificial Intelligence, 137/1–2, 239–263.
407. Zhu, R., & Wüthrich, M. V. (2021). Clustering driving styles via image processing. Annals of
Actuarial Science, 15/2, 276–290.
408. Zou, H. (2006). The adaptive LASSO and its oracle properties. Journal of the American
Statistical Assocation, 101/476, 1418–1429.
409. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal
of the Royal Statistical Society, Series B, 67/2, 301–320.
Index

A Auto-calibration, 308–315, 328, 329, 335, 465,

Absolutely continuous, 3, 4, 15, 20–26, 80, 466
183, 240, 342 Auto-encoder, 341–356
Accumulated local effects (ALE) profile, Automated feature engineering, 267, 293
362–365, 369 Auto-resolution, 310
Action rule, 50, 89 Average estimation loss, 50
Action space, 50, 52, 82, 88, 89, 488 Average local effect, 362, 364
Activation function, 269–274, 278, 283, 295, Average-pooling, 416
304, 341, 356, 383, 385, 390, 391,
393, 404, 409, 537–542, 548
Actuarial valuation, 525 B
Adam, 287, 288, 297 Back-propagation, 280–284, 291, 300, 388,
Adaptive moments, 287 533
Additive approach, 478–482, 488 Back-propagation through time (BPTT), 388
Additive effects, 116, 126 Backward elimination, 141, 147, 148, 150,
Additive form, 30, 35, 38 170, 335
Additive semi-group, 174 Bagging, 319, 324, 327
Aggregated statistics, 62 Bag-of-part-of-speed, 425
Akaike’s information criterion (AIC), 103–106 Bag-of-words, 425–429
Algorithmic culture, 2 Balance-corrected, 310–315
Amari–Chentsov tensor, 42 Balance property, 64, 65, 122–124, 129, 154,
Analysis of variance (ANOVA), 141, 148, 149, 157, 162, 165, 167, 171, 179, 201,
245 210, 297, 300, 304–315, 501, 505
Approximate sieve estimator, 544, 548, 549 Base premium, 116, 140
Approximation capacity, 274, 288 Batch, 285, 291–293, 304, 321, 515, 517
ARIMA, 394, 402 a.s., 6 Batch size, 285, 291, 295, 299, 320, 331
Asymptotically efficient, 72 Bayes by Backprop, 533
Asymptotic normality, 69–74, 104, 105, 109, Bayesian decision rule, 51, 54, 156
123–124, 146, 180, 181, 198, 225, Bayesian GLM, 210
326, 328, 474, 476, 540–546, 548 Bayesian information criterion (BIC), 105
Asymptotic variance, 72, 474 Bayesian methods, 207–266, 535
Attention, 447–451, 497, 498 Bayesian networks, 530–535
Attention layer, 447–451 Bayesian parameter estimation, 207–209
Attention weights, 448, 521 Bayes’ rule, 208, 233, 235, 430
Auto-calibrated forecast, 308–311 Bernoulli distribution, 18–19, 87, 163, 304

M. V. Wüthrich, M. Merz, Statistical Foundations of Actuarial Learning and its
Applications, Springer Actuarial, https://doi.org/10.1007/978-3-031-12409-9
596 Index

Best asymptotically normal, 475 Claim description, 425–428, 432, 440, 442,
b-homogeneous, 456 446, 570, 572
Bias, 77, 84–86, 105, 109, 113, 126, 162, 171, Claim sizes, 4, 7, 35, 36, 111, 112, 126,
179, 189, 306, 324, 509 167–180, 188–190, 248–254,
Bias regularization, 305–315, 318, 321, 257–259, 453–455, 458–466, 515
327–330, 337, 338, 501, 503, 506 Classification and regression trees (CART),
Binary cross-entropy, 87 200, 201, 270, 320, 358
Binary feature, 126, 130, 168, 339, 502, 503, Clustering, 239, 342, 436, 438
551 CN layer, 407, 408, 410, 411, 413, 415, 417,
Binomial distribution, 13, 18–19, 31, 38, 174 422
Block diagonal matrix, 197, 198 CN network, 273, 394, 407, 411, 412, 415–419,
BN encoder, 353, 355 421–424
Bonus-malus system (BMS), 140, 141, 199 CN network encoder, 422
Boosting, 133, 200, 201, 315–319, 330, 331, CN operation, 409, 410, 413
335, 515 Coefficient of variation, 176, 185, 323–325,
Boosting regression models, 315–319, 330 332, 492–494
Bootstrap, 106–110, 324, 492–495 Collinear, 129, 153, 560
Bootstrap aggregating, 319 Collinearity, 146, 150, 152, 162, 214, 359, 551,
Bootstrap algorithm, 107, 109 558
Botstrap consistency, 108 Color image, 408
Botstrap distribution, 106, 107, 109 Combined GLM and FN network, 318, 319,
Bottleneck neural (BN) network, 342, 351–356 321
Bregman divergence, 43–48, 79, 92, 94, 309, Complete information, 232, 233
456, 457, 464, 486, 487 Complete log-likelihood, 232–233, 235–237,
Budget constraint, 212, 213, 217, 220 250, 253–255, 258, 260, 518, 519
Complete probability space, 542, 548
Composite models, 202, 263–265, 454, 483,
484, 491
C Composite triplet, 484–488
Canonical link, 17–19, 21, 23, 27, 30, 31, 52, Compound Poisson (CP), 4, 34, 189–190
63, 78, 114–118, 120–123, 127, 129, Conditional calibration, 310
133, 196–198, 264, 279, 306–308, Conjugate priors, 156, 209
312, 457 Conjugation, 427
Canonical parameter, 13–17, 24, 30, 37, 41, Consistency, 67–69, 73, 89–90, 109, 124, 225,
43, 45, 46, 62, 65, 78–80, 82, 113, 473, 540–546, 548
114, 116, 127, 182, 184, 186, 195, Consistent, 67–69, 71, 74, 84, 88–92, 109,
196, 231 143–147, 204, 226, 473, 475, 478,
Case deletion, 192–195 482, 484, 546
Categorical cross-entropy, 87, 198, 518 Consistent loss function, 88, 90–93, 204, 464,
Categorical distribution, 27–28, 37, 87, 127, 473, 477, 487, 488
195, 198, 230, 237 Consistent scoring function, 76, 88, 92, 456,
Categorical feature, 127–130, 134, 139, 211, 457, 485
216, 297–306, 322, 325, 370, 375, Constraint optimization, 212, 222, 223, 280
459, 468, 470, 499, 517–520, 551, Context size, 431
561, 563 Context words, 431, 432
Categorical responses, 195–198, 419, 431 Contingency table, 151, 162, 166, 263, 561
Causal inference, 360 Continuous bag-of-words (CBOW), 430, 432
Cell state process, 391, 393 Continuous features, 130–131, 150
Censored data, 248–266 Contrast coding, 128
Center-context pairs, 432, 434, 436, 438, 441 Convex, 2, 14, 16, 17, 29, 30, 37, 38, 43, 44,
Center words, 430–432, 436, 437 92, 94, 213, 214, 219, 221–222, 286,
Central limit theorem (CLT), 5, 70 308, 309, 311, 344, 458, 486, 487
Channel, 408–413, 415, 416, 418, 419, 423 Convex order, 95, 308, 309
Claim counts data, 133 Convolution, 31, 65, 174, 408–410, 413, 415
Index 597

Convolution formula, 31, 65 Deviance residuals, 142, 158, 170, 176, 182,
Convolution operator, 410 459, 460, 466, 471
Convolution property, 174 Diagnostic tools, 141, 190–195
Co-occurrence matrix, 436 Digamma function, 22, 46, 47, 173, 185
Count random variable, 4, 254 Dilation, 414
Covariate, 88, 93, 96, 113 Dimension reduction, 342, 343, 415, 417, 520
Coverage ratio, 108, 481, 482, 490, 491, 498 Directional derivative, 367, 368, 372
Cramér–Rao information bound, 56–66, 72, Discount factor, 526, 527
75, 78, 93, 123 Discrete, 3, 4, 18, 27
Cross-entropy, 87, 198, 421, 518 Discrete random variable, 3
Cross-sectional data, 198 Discrete window, 409
Cross-validation, 95–106, 139, 140, 190, Discrimination-free insurance pricing, 361
192–195, 211, 215–217 Dispersion, 13–46, 155, 157, 158, 181–183,
Cube-root transformation, 170 186–189, 465–472, 475, 476
Cumulant function, 14–17, 24, 25, 29, 30, 34, Dispersion parameter, 30
37–38, 41, 44–46 Dispersion submodel, 182–183, 453, 474
Curse of dimensionality, 209, 530 Dissimilarity function, 343, 352, 353
Cyclic coordinate descent, 220 Distortion risk measure, 368, 370
Distribution function, 3, 5–9, 13–16, 29
Divergence, 40–47, 55, 92, 94, 308
Divisible, 174
D Do-operator, 360
Data collection, 1, 151 Dot-product attention, 448
Data compression, 271 Double FN network model, 466, 470, 472
Data modeling culture, 2 Double generalized linear model (DGLM),
Data pre-processing, 1, 2 182–190, 247, 453, 466, 515
Decision rule, 50–54, 56–58, 60–62, 67, Drift extrapolation, 394–397, 401, 404
75–79, 83–85, 96, 97 Drop-out layer, 298, 302–304, 377, 419
Decision-theoretic approach, 88–95 Duality transformation, 30, 38, 158
Decision theory, 49–51 Dual parameter space, 17–22, 24, 25, 27, 28,
Declension, 428 31–33, 37, 43, 53, 64
Decoder, 344, 346, 353, 356, 404 Dummy coding, 127–130, 195, 293, 298
Deductible, 248, 249
Deep composite model regression, 266,
483–491 E
Deep dispersion modeling, 466–472 Early stopping, 290–293, 299, 303
Deep learning, 267–379, 453–535 Educated guess, 50
Deep network, 204, 273–275, 383, 477, 494 Effective domain, 14–24, 27, 29, 30, 32, 34, 35
Deep quantile regression, 10, 476–483, 488, Efficient likelihood estimator (ELE), 72
489 Eigenvalues, 120, 344, 345
Deep RN network, 383, 385, 386 Eigenvectors, 344, 345
Deep word representation learning, 445–448 Elastic net regularization, 214
Deformation stability, 411 Elastic net regularizer, 507
Dense, 538, 539, 542 Elicitable, 92, 203, 204, 477, 484, 489
Density, 4, 7, 8 EM algorithm for lower-truncated data,
Depth, 268, 272, 273, 275–277 248–249
Derivative operator, 546 EM algorithm for mixture distributions,
Design matrix, 114, 118–122, 128, 145, 177, 230–232
306 EM algorithm for right-censored data, 251–254
Deviance estimate, 143 Embedding dimension, 299, 302, 429, 431,
Deviance generalization loss, 79–88 438, 440–442, 444, 446
Deviance GL, 82, 84–86, 310, 311 Embedding layer, 298–302, 429
Deviance loss function, 43, 79–82, 84, 93, 279, Embedding map, 128, 294, 298, 399, 429–433,
280, 284, 462, 468, 475, 478 437, 440, 441, 444, 448
598 Index

Embedding theorem, 547 Feature pre-processing, 148, 293–295,

Embedding weights, 299, 302, 399, 403, 431, 425–429
434, 435, 438, 440, 446 Feature space, 113–116
EM forward network, 515, 517 Feed-forward network, 268
EM network boosting, 515 Feed-forward neural network, 267, 269–298,
Empirical bootstrap distribution, 107, 108, 110 340–342
Empirical density, 7, 8, 56 Filter, 199, 209, 407–415, 417, 419, 421–423
Empirical distribution, 7–9, 55, 68, 106 Filter weights, 409–411, 414, 417
Empirical Wald test, 495, 520 Finite first absolute moments of Fourier
Encoder, 344, 353, 355, 422 magnitudes distributions, 545
Ensembling over selected networks, 335–337 First moment, 4
Entropy, 41, 87, 198, 310, 311, 421, 518, 531, Fisher-consistent, 55, 56, 68, 71
535, 541 Fisher metric, 58
Epoch, 285, 292, 297, 301, 319, 320 Fisher’s contribution, 59
Estimate, 75–76 Fisher’s information, 42, 58–62, 70–71,
Estimation loss, 50 118–122
Estimation of conditional expectation, 521–529 Fisher’s scoring, 192, 216
Estimation risk function, 83, 84, 86, 327 Fisher’s scoring method, 59, 119, 120, 138,
Estimation theory, 49–74 180, 187, 244, 264, 459
Estimation variance, 77, 79, 209 Flatten layer, 416, 423, 442
Euclidean ball, 213 FN layer, 271–273
Euclidean norm, 213, 444 FN network, 269
Evidence lower bound (ELBO), 531, 533, 535 Folds, 100, 246, 442
Excess-of-loss (XL), 249 Force of mortality, 347, 405, 525, 526, 529
Expectation-Maximization (EM) algorithm, Forecast, 75–110
231, 233–236, 238–242, 244, 249, Forecast dominance, 93–95, 312–314,
251, 253, 254, 257, 258, 261, 469–461, 464–465
263–265 Forecast evaluation, 40, 45, 75–110, 476
Expectation step (E-step), 233–235, 238, 247, Forget gate, 390–392
251, 252, 254, 257, 258, 513 Forward selection, 141, 147, 148
Expected deviance generalization loss (GL), Friedman’s H -statistics, 365
83, 84, 86, 88, 93, 95, 310, 311 Frobenius norm, 345, 347
Expected generalization loss (GL), 75–79, 83, Full rank, 17, 27, 118–121, 127–129, 177, 181,
88, 310, 325 232, 294, 306
Expected shortfall, 483–487 Functional limit theorem, 546–549
Expected value, 4, 62, 77 Fundamental domain, 341, 342
Experience rating, 141, 199
Explain, 1, 111
Explanatory variable, 8, 9, 111, 113, 130, 338, G
568 Gamma distribution, 22–23, 32, 34, 36, 121,
Exponential activation, 269, 270, 516 156, 168, 170
Exponential dispersion family (EDF), 13–47 Gamma GLM, 167–176
Exponential distribution, 23, 28, 126 Gamma model with log-link, 121
Exponential family (EF), 13–47 Gated recurrent unit (GRU) network, 381, 390,
Exponentially decaying survival function, 38 392–394
Exposure, 30, 112, 132 Gaussian distribution, 13, 21–27, 36, 43, 70,
Extreme stable distribution, 36 84, 126, 212, 240, 399, 498, 501,
526
Gaussian kernel, 8
F Gaussian mixture, 240
Feature, 113–116 Gaussian model, 21, 25, 26, 28
Feature component selection, 247 Gaussian process, 549, 550
Feature engineering, 115–116, 127–132, 134, Generalization, 43, 83, 87, 98, 99, 135, 141,
135, 267, 293, 483 148, 169, 288, 305, 351, 449
Index 599

Generalization loss (GL), 10, 75–95, 152, 310 Homoskedastic Gaussian case, 193, 217
Generalization power, 145, 289 Honest forecast, 89, 90
Generalized additive decomposition, 496 Human Mortality Database (HMD), 348, 395
Generalized additive models (GAMs), 130, Hurdle model, 163, 261, 264
194, 200, 314, 315, 337, 444 Hyperbolic tangent activation, 269, 270
Generalized beta of the second kind (GB2), Hyper-parameter, 28, 211, 242, 264, 298, 433,
201, 202, 453 464, 492, 515
Generalized cross-validation (GCV), 100, 193, Hypothesis testing, 50, 145–147, 181, 549–551
195, 217, 314
Generalized EM (GEM) algorithm, 513
Generalized inverse, 202
Generalized inverse Gaussian distribution, I
25–26 Identifiability, 17, 49, 55, 239, 340–342, 348,
Generalized linear model (GLMs), 111 475, 497
Generalized projection operator, 222, 223, 228, Identification function, 481, 489
280, 508 Identity link, 116, 279, 524
Gibbs sampling, 209 Image classification, 418
Glivenko–Cantelli theorem, 7, 55, 68, 106, 107 Image recognition, 273, 407, 412–413
Global balance, 312 Imbalanced, 153, 568
Global max-pooling layer, 419 Importance measure, 505, 511
Global properties, 407 Incomplete gamma function, 252, 490
Global surrogate model, 358, 359 Incomplete information, 232, 233
Global vectors (GloVe), 425, 430, 433, 436, Incomplete log-likelihood, 236–239, 250, 253,
438, 442–444, 446, 449, 451 254, 516, 518
Glorot uniform initializer, 284 Indirect discrimination, 361
GPS location data, 418 Individual attribution, 374, 375
Gradient descent method, 220, 278–293 Individual conditional expectation (ICE),
Gradient descent update, 221, 279, 285–287 359–360
Grouped penalties, 211, 227, 302 Infinitely divisible, 174
Group LASSO generalized projection operator, Inflectional forms, 428
228 Information bound, 57, 62–64
Group LASSO regularization, 226–229, Information geometry, 40–47, 81, 145
508–512 Initialization, 239, 241, 268, 282, 284, 293,
GRU cell, 393 318, 354, 363, 479, 497, 515
Guaranteed minimum income benefit (GMIB), Input gate, 391, 392
525–529 Input tensor, 408, 410, 411, 413, 414, 416,
423
In-sample loss, 98, 102, 103
H In-sample over-fitting, 102, 279, 288, 525
Hadamard product, 287, 391 Interactions, 131, 151, 200, 274, 297, 319, 360,
Hamilton Monte Carlo (HMC) algorithm, 209, 365, 373, 379, 495, 503, 505
530 Interaction strength, 365–366
Hat matrix, 189–193, 216 Intercept model, 112, 139, 152, 154, 171, 188,
Heavy-tailed, 6, 8, 27, 38 333
Helmert’s contrast coding, 128 Interior, 15, 29, 44, 62, 112, 235, 476, 547
Hessian, 16, 42, 61, 105, 118, 121, 122, 215, Inverse Gaussian distribution, 23, 26, 33, 174,
285 482, 490
Heterogeneous, 111, 112, 132, 182, 448, 469 Inverse Gaussian GLM, 122, 173–176, 453,
Heterogeneous dispersion, 182, 469 460
Heteroskedastic, 177, 178, 481 Inverse link function, 269
Hilbert space, 524, 547 IRLS algorithm, 119, 120, 181, 186, 198
Homogeneity, 111, 200, 456, 467 Irreducible risk, 77, 84, 86, 113, 310, 477, 492
Homogeneous model, 103, 112, 114 Iterative re-weighted least squares algorithm,
Homoskedastic, 178, 193, 217 119
600 Index

J Local balance correction, 314, 315, 337

Jacobian, 46, 47, 61, 181, 474, 535 LocalGLMnet, 10, 495–512, 518, 520, 529
Jacobian matrix, 70, 535 Locally interpretable model-agnostic
Joint elicitability, 483–487 explanation (LIME), 366
Local model-agnostic tools, 357–375
Local polynomial regression, 312, 314
K Local structure, 407
Kalman filter, 199 Locfit, 312–314
Karush-Kuhn-Tucker (KKT), 212 Log-gamma distribution, 23, 38, 39, 177
Kernel size, 408 Logistic activation, 269
Kernel smoother, 7 Logistic categorical generalized linear model,
Key, 448–450 195–196
K-fold cross-validation, 99–101, 139, 141 Logistic function, 270
k-th moment, 54 Logistic GLM, 196, 243, 263, 517
Kullback–Leibler (KL) divergence, 40–45, 55, Log-likelihood, 51, 61, 68, 73, 79–81, 89, 103,
56, 69, 81, 82, 87, 104, 105, 145, 104, 112–125, 181–183, 185, 196,
237, 238, 456, 530, 531 210, 223, 230–232
Log-link, 19, 23, 32, 116, 121, 122, 127, 128,
131–134
L Log-log plot, 8, 9, 40, 167, 242, 243, 564,
Lagrangian duality, 212 569–574
Lagrangian form, 222, 223 Log-mortality rates, 347–350, 354, 355, 394,
Laplace transform, 13–14 403
LASSO regression, 223, 229 Log-mortality surface, 348
LASSO regularization, 212–214, 217–229, Log-normal model for claim sizes, 176–180
366, 464, 495, 507–512, 521 Log-transformation, 23
Layer, 267–269, 271–277 Lomax distribution, 39, 40, 201, 241, 246, 247
LC model, 347–351, 354, 396, 397, 405, 406 Longitudinal data, 198–199
Learned representation, 26, 281, 304, 306, 311, Long-short term memory network (LSTM)
312, 353, 373, 377, 385, 387, 391, Lookback period, 399, 402, 423
403, 416, 417, 461, 463, 469, 513, Loss function, 43, 44, 50, 52–54, 75, 77–82,
514 84, 88–93
Learning data, 356 Lower-truncated, 39, 163, 164, 248, 249,
Learning rate, 220, 223, 229, 279 255–263
Learning sample, 135, 137 Lower-truncation, 39, 248–249, 254–264
Least absolute shrinkage and selection L1 -regularization, 212, 213
operator, 212 L2 -regularization, 212, 274
Leave-one-out cross-validation, 99–100, 103, LRT statistics, 146
191–193 LSTM cell, 391, 392, 398, 399, 406, 423
Lee–Carter model, 347–351, 394, 395, LSTM encoder, 422
397–402 LSTM extrapolation, 397–402
Left-singular matrix, 345, 348 LSTM network, 390, 394, 399–403, 405, 442,
Lemmatization, 428, 433 443
Light-tailed, 6, 23, 28, 126, 201
Likelihood ratio test (LRT), 141, 146, 147,
150, 357, 497, 503, 549 M
Linear exponential family, 15, 18–20, 22, MACQ, 369, 371, 372
28–40, 73 Manhattan norm, 213
Linear link, 21 Manhattan square, 213
Linear predictor, 114, 116, 127, 131–134, 138, Manhattan unit ball, 509
151, 196 Manual feature engineering, 132, 267
Link function, 114–116, 123, 126, 129–131, Marginal attribution, 366–375, 379
133, 189, 204, 268, 269, 271, 273 Marginal attribution by conditioning on
Lipschitz, 223, 538, 542 quantiles, 366–375
Index 601

Markov chain Monte Carlo (MCMC) methods, Momentum-based gradient descent method,
10, 209, 210, 530 280, 285–287
Martingale sequence forecast, 309 Momentum coefficient, 285, 287
Maximal a posterior (MAP) estimator, Mortality, 3, 347–351, 354–356, 394–406,
210–212, 225 422–424, 525, 526, 529
Maximal cover, 248 Mortality surface, 347, 349, 350, 355, 356,
Maximization step, 233, 513 394, 422–424
Maximum likelihood, 51, 116–122 Motor third party liability (MTPL), 133
Maximum likelihood estimation/estimator MSEP optimal predictor, 78
(MLE), 51, 124–125, 169, 172–174, M-step, 233–235, 238, 239, 244, 251, 252,
181, 186–187, 196–198, 288, 293, 256, 257, 513, 515, 517
472–476 Multi-class cross-entropy, 87, 421
Max-pooling, 414, 415, 419, 423, 442, 444 Multi-dimensional array, 408
Mean, 4 Multi-dimensional Cramér-Rao bound, 60, 62
Mean field Gaussian variational family, 534 Multi-index, 546
Mean functional, 84, 90–93, 195, 203, 278, Multi-output network, 462, 463, 466, 479
456 Multiple outputs, 461, 462, 468, 488
Mean parameter space, 17, 116, 458, 473 Multiple quantiles, 478–479
Mean squared error of prediction (MSEP), 10, Multiplicative approach, 479–482
75–79, 83, 95, 142, 209 Multiplicative effects, 116, 126
Memory rate, 390 Multiplicative model, 128, 131
Mercer’s kernel, 132, 268, 269
M-estimation, 93, 476
M-estimator, 69, 73, 93, 457 N
Meta model, 329, 330 Nadam, 287, 288
Method of moments, 54, 55 Nagging, 320, 324–326
Method of moments estimator, 54, 55 Nagging predictor, 324–329
Method of sieve estimators, 11, 56, 543–546, Natural language processing (NLP), 10, 298,
549 425–451
Metropolis–Hastings (MH) algorithm, 209 NB1, 158
Mini-batches, 285, 291, 293, 304, 321, 469, NB2, 157, 158
515, 518 Negative-binomial distribution, 19, 156, 159
Minimal representation, 16, 17, 30, 31, 42, 63, Negative-binomial GLM, 159, 160, 166
67 Negative-binomial model, 20, 156, 158–160,
Minimax decision rule, 50–51 163
MinMaxScaler, 294, 295, 371 Negative expected Hessian, 105, 118, 121, 215
Mixed Poisson distribution, 20, 155, 164 Negative sampling, 431–436, 438, 440, 446,
Mixture density, 230, 233, 242, 243, 454 450
Mixture density networks (MDNs), 233, 453, Nested GLM, 145
513, 515–520 Nested simulation, 522
Mixture distribution, 163, 164, 230–235, Nesterov-accelerated version, 286
238–247, 259, 513 Network aggregating, 325
Mixture probability, 230, 231, 235, 236, 241, Network ensembling, 10, 492
243, 245–247, 513 Network output, 387–388
Model-agnostic tools, 357–376, 495 Network parameter, 272, 274
Model class, 2, 105, 454 Network weight, 271, 284
Modeling cycle, 1–3 Neurons, 269
Model misspecification, 2, 305, 472 New representation, 46, 132, 268
Model uncertainty, 56, 69, 82, 93, 453–476, Newton–Raphson algorithm, 59, 120, 231
492–495, 530 NLP pipeline, 425
Model validation, 2, 141–180, 357 Noisy part, 113, 288, 290
Modified Bessel function, 26 Nominal categorical feature, 127
Moment generating function, 5, 6, 9, 15, 16, Nominal outcome, 4, 127, 130, 195, 302, 364,
30, 31, 35, 38, 125, 168, 174, 201 555
602 Index

Non-linear activation, 269 Partial residuals, 153

Non-linear generalizations of PCA, 351 Particle filters, 209
Non-monotone feature, 150–153, 178 Part-of-speech (POS), 428
Non-parametric bootstrap, 106–109, 492 Pathwise cyclic coordinate descent, 220
Non-trainable, 318, 441, 450 Pearson’s chi-square statistics, 82, 142
Normalization layer, 273, 298, 303, 304, 520 Pearson’s estimate, 172, 181, 333
Nuisance parameter, 19–22, 28, 119, 143, Pearson’s residuals, 142
157–161, 169, 175, 177–180, 182, Phase type distribution, 201
186, 202 Pinball loss, 477, 478, 481, 485
Nuisance parameter estimation, 158–162 Pinball loss function, 202–204
Null model, 241 Plain vanilla gradient descent, 278–280, 283,
285, 290, 532
Poisson distribution, 19, 32, 133, 155, 163,
O 259, 292
Objective function, 2, 81, 97, 124, 220 Poisson GLM, 133–134
Observation scale, 126, 167, 178 Poisson unit deviances, 45, 144
Observed information matrix, 120–122 Polya distribution, 19
Offset, 131–133, 139, 153, 165, 262, 264, 303, Pooling layer, 415–416
313, 318, 515 Positive stable distribution, 36, 454
One-hot encoding, 128, 232, 244, 292–300, Posterior density, 208–210, 530, 531
308, 428, 499, 500 Posterior distribution, 53, 54, 209, 237, 513
One-period ahead forecast, 399 Posterior information, 140, 141
Oracle property, 225–226 Posterior log-likelihood, 210
Ordinal categorical feature, 127, 168 Power variance function, 35, 36, 86, 87, 181,
Orthogonal projection, 125, 145, 523 454, 467
Orthonormal basis, 344, 345, 353 Power variance parameter, 34, 36, 86, 94, 121,
Out-of-sample loss, 95–98 122, 454–456, 459–462, 464, 466,
Output gate, 391, 392 468–471, 487, 492
Output mapping, 461 Predefined gradient descent methods, 287
Over-dispersed Poisson model, 180 Predict, 76, 78, 79
Over-dispersion, 20, 32, 143, 146, 151, Predicting vs. explaining, 111
155–162, 173, 337–340 Prediction, 76–79, 89, 310, 476, 494
Over-fitting, 102, 113, 133, 194, 210, 273, 279, Predictive modeling, 43, 75–110
288–290, 293, 303, 307 Predictor, 2, 75, 114, 309, 310, 320, 325
Over-parametrized, 145, 403 Prefixes, 428
Over-sampling, 153–155 Pre-processed features, 123, 169, 385, 387
Pre-trained word embeddings, 425, 430, 436,
438
P Principal components analysis (PCA), 342,
Padding, 413, 415, 427, 428, 440 344, 346–348
Panel data, 198, 381 Prior density, 530
Parameter estimation, 15, 17, 37, 51–56 Prior distribution, 51, 53, 207, 210, 212, 342,
Parameter estimation under lower-truncation, 534
254–264 Prior information, 140, 141
Parameter estimation under right-censoring, Probability distortion, 367
248–249 Probability space, 3, 55, 309, 382, 521, 542,
Parameter set, 49, 174 548
Parametric bootstrap, 109–110 Probability weight, 3, 4, 18–20, 66, 163, 259
Pareto distribution, 23, 27, 39, 40, 246 Process variance, 77, 79, 84, 209
Parsimonious, 145, 152, 464, 470 Projected gradient descent, 221
Partial dependence plot (PDP), 360–366 Proper scoring rule, 76, 88, 90, 91
Partial dependence profile, 360, 361 Protected characteristics, 361
Partial derivative, 92, 123, 225, 362, 367, 404, Proximal gradient descent algorithm, 223, 227
549, 551 Proximity, 129, 298, 302, 429
Index 603

Pseudo maximum likelihood estimator ReLU activation, 269, 270, 274, 275
(PMLE), 180, 473–475 Reparametrization trick, 532–534
Pseudo-norm, 542, 543 Representation learning, 267–269, 273, 274,
p-value, 146, 147, 151, 245 293, 402, 411, 415, 445–448, 453,
461, 462
Reproductive dispersion model, 44
Q Reproductive form, 30, 38, 133, 158, 169, 174,
QQ plot, 172, 242, 245, 333, 518 187, 459, 467
Quantile, 92, 203, 368, 476 Resampling distribution, 107
Quantile level, 374, 375 Reset gate, 393
Quantile regression, 10, 202–204, 483, 488 Residual bootstrap, 108
Quantile risk measure, 368 Residual maximum likelihood estimation,
Quasi-generalized pseudo maximum likelihood 186–187
estimator (QPMLE), 180, 475 Residuals, 108, 117, 141–145, 153, 158, 176,
Quasi-likelihood, 180–181 178, 182, 191, 197, 315
Quasi-Newton method, 120, 513 Retention level, 249
Quasi-Poisson model, 180 Ridge regression, 214–217, 304
Quasi-Tweedie’s model, 181 Ridge regularization, 212–214, 217, 224, 303,
Query, 448, 449 507
Right-censored gamma claim sizes, 250–252
Right-censoring, 250–254
R Right-singular matrix, 345, 348
Random component, 498, 501, 505, 521 Risk function, 50, 52, 61, 77, 83, 84, 86
Random effects, 198–199 rmsprop, 287, 297
Random forest, 200, 319 RN layer, 383–391, 411
Random variable, 3–7 RN network, 273, 381, 383, 385–390,
Random vector, 3, 109, 180, 255, 371, 523, 394–406, 412, 442, 445, 448
526 Robustified representation learning, 461–464,
Random walk, 350, 394–397, 401, 402, 404 468
Rank, 17, 118–121, 127, 177, 232, 344, 348, Robust statistics, 56
354 Root mean square propagation, 287
Raw mortality data, 347, 394, 406, 422
Reconstruction error, 342, 344, 346, 350, 352,
354 S
Rectified linear unit activation, 269, 270, 274, Saddlepoint approximation, 36, 47, 145,
275, 479 183–187, 456, 459, 466, 467, 517
Recurrent neural network, 381–406 Saturated model, 80, 81, 113, 115, 157, 158,
Red-green-blue (RGB), 289 166, 354
Reference point, 371–374 Scalar product, 114, 130, 200, 218, 269, 304,
Regression attention, 496–498, 502, 503, 505, 312, 386, 433, 434, 444, 524
507, 511, 520, 521 Scaled deviance GL, 88
Regression function, 113, 267, 269, 485, 488, Scale parameter, 22–24, 33, 34, 38, 168, 172,
512 201, 241, 252, 258, 517
Regression modeling, 88, 112–113 Schwarz’ information criterion (SIC), 106
Regression parameter, 88, 114, 116, 119, 122, Score, 58, 61, 71, 90, 117, 121, 180, 198, 218,
131–133, 180, 189, 192, 208, 210, 234, 457
271, 486, 496, 514 Score equations, 71, 73, 117–120, 180,
Regression trees, 133, 200, 307, 330, 359 196–198, 203, 215, 218, 236, 238,
Regularization, 207–268, 306–308, 314, 464, 304, 459
507–509 Scoring function, 69, 76–79, 81, 88–90, 456,
Regularization through early stopping, 484, 485, 517, 518
290–293 2nd order marginal attribution, 377
Regularly varying, 6, 8, 16, 23, 38, 39, 126, Self-attention mechanism, 449, 450
167, 202, 241, 246 Sequence of sieves, 542
604 Index

Sequential Monte Carlo sampling (SMC), 209 Stochastic boundedness, 544, 545
Set-valued functional, 89, 90, 203 Stochastic gradient descent (SGD), 283–285,
Shallow FN network, 273–275, 277, 341, 399, 291–293
537–541, 543, 545, 546, 550, 551 Stochastic gradient variational Bayes (SGVB),
Shallow network, 273–275, 353, 383 533
Shape parameter, 22–24, 33, 34, 38, 39, 121, Stochastic mortality, 347
168, 247 Stone–Weierstrass theorem, 538, 539
Shapley additive explanation (SHAP), 366 Stone–Weierstrass type arguments, 274
Short rate, 526, 527, 529 Stopwords, 427, 428, 433, 448
Shrinkage, 212, 307 Strassen’s theorem, 309
Sieve estimator, 541–546, 548, 549 Stratified K-fold cross-validation, 99, 101–103
Sigma-finite measure, 4, 13–15, 29, 30, 34 Strict consistency, 69, 90, 456
Sigmoid activation function, 270, 283, 390, Strictly consistent, 76, 84, 90–93, 204, 456,
391, 393, 538, 540, 542, 548 473, 477, 478, 484, 485, 487, 488
Sigmoid function, 18, 432, 479, 538, 542, 548 Strictly convex, 16, 17, 30, 44, 52, 213, 456,
Simple bias regularization, 305–306 485, 486
Single-parameter exponential family, 13, 15, Strictly proper scoring rule, 91
28 Stride, 414, 415
Single-parameter linear exponential family, 15, Strongly consistent, 67, 473, 475
18–20, 22, 25, 28, 31, 36–38, 73, Subexponential, 6
159, 263 Sub-gradient, 44, 92, 218, 269, 485, 486
Singular value decomposition (SVD), 345, Suffixes, 428
347, 350, 376, 394, 397, 404 Survival function, 6, 8, 23, 38–40, 202
Singular values, 345, 350, 356, 377 Synthetic minority oversampling technique
Skip connection, 272, 316, 317, 387, 450 (SMOTE), 155
Skip-gram, 430–432, 434 Systematic effects, 113, 114, 126, 127, 182,
Smoothly clipped absolute deviation (SCAD) 211, 289, 290, 315, 319
regularization, 225, 247
Sobolev embedding theorem, 547
Sobolev norm, 547, 551 T
Sobolev space, 547 Tail index, 6, 8, 38–40, 202
Soft-thresholding operator, 218, 219, 223, 226, Taylor expansion, 71, 72, 142, 220, 279, 280,
508 285, 317, 369–371, 455
Spatial component, 356, 408–411, 415 Telematics data, 418–422
Special purpose layers, 273, 298–305 Temporal causal, 382, 390
Special purpose tools, 413–416 Tenfold cross-validation, 102, 139, 152, 154,
Spectral norm, 345 169, 171, 172, 176, 188, 194, 195
Speed-acceleration-change in angle pattern, Tensor, 42, 397, 408–416, 419, 422, 423
418 Test data, 96–98, 100, 102, 137, 288, 290, 291,
Splicing models, 264 295, 297
Spurious configuration, 239 Test sample, 96, 97, 135, 137, 141, 169
Square loss function, 54, 75, 77–79, 82, 92, Test statistics, 549, 563
125, 170, 304, 312, 313, 330, 528 Text recognition, 10, 425, 426, 448, 477, 572
Squashing function, 538–540 Threshold, 39, 40, 241, 242, 246, 446
Standardization of data matrix, 343 Tikhonov regularization, 212
Standardized data, 218 Time-distributed layer, 388–390, 442
State-space model, 199 Time-series analysis, 273, 411–413
Statistical error, 77 Time-series data, 381, 407, 408, 411, 412, 423
Statistical modeling cycle, 1–3 Tokenization, 426, 427, 434, 436, 438
Steep, 37, 38, 87 Training data, 291, 292, 295
Steepness, 17, 37–38, 43, 80, 83 Training loss, 292, 301
Stemming, 428 Transformer model, 450
Step function activation, 269, 274–277, 538, Translation invariance, 411
539 Truncated data, 265
Index 605

t-statistics, 147 Variance function, 31–36, 44, 82, 86, 87, 117,
Tukey–Anscombe plot, 172, 178, 190, 459, 142, 158, 174, 180, 183, 185, 186,
464–466, 471 209, 280, 454, 467, 474
Tweedie’s CP model, 34–36, 182, 183, 190, Variance reduction technique, 327
325, 327, 454 Variational distributions, 530, 535
Tweedie’s distribution, 34–36, 87, 182, 457, Variational inference, 535
458, 487 Variational lower bound, 531
Tweedie’s family, 187, 454–458, 481, 487 VC-class, 541
Tweedie’s forecast dominance, 95, 327, 453, Vector-valued canonical parameter, 15
459 Vocabulary, 426, 428, 433, 434, 438, 440
Tweedie’s model with log-link, 121 Volume, 29, 129, 134, 145, 168, 184, 234
Two modeling culture, 329

W
U Wald statistics, 146, 147, 229
Unbiased, 57, 60–64, 66, 70, 77, 78, 85, 86, Wald test, 139, 146, 148, 198, 224, 245, 357,
93, 122, 123, 126, 309, 458 495, 503, 521, 549
Unbiased estimator, 56–66, 85, 123, 458 Weak derivative, 547
Under-dispersion, 33, 264 Weight, 3, 4, 18–20, 29, 65, 66
Under-sampling, 153–155 Weighted square loss function, 82, 312, 313,
Unfolded representation, 384, 386 438
Uniformly dense on compacta, 538, 539 Weight matrix, 310, 449
Uniformly minimum variance unbiased Wild bootstrap, 108
(UMVU), 56, 57, 61, 63–67, 79, 93, Window size, 407–409, 412, 413, 415, 430,
123 431, 434
Unit deviance, 42–47, 80–83, 86–88, 90–95, Word embedding, 269, 425, 429–439, 450
124–125, 141, 142, 144, 145, 183, Word to vector (word2vec), 425, 430, 433,
184, 311, 319, 454–456 438–440, 444, 446, 450
Unit simplex, 27, 230, 232, 235 Word2vec algorithm, 430–436
Universality theorem, 10, 11, 274–278, 288, Working residuals, 117, 186, 191, 197, 198,
537–540, 550 244
Unsupervised learning, 342, 436, 445 Working weight matrix, 116, 121, 186, 191
Update gate, 393

X
V XGBoost, 201
VA account, 525, 527, 529
Validation analysis, 290
Validation data, 291–293 Z
Validation loss, 291, 292, 297, 301, 331, 421 Zero-inflated Poisson (ZIP), 163, 164, 166,
Value-at-risk (VaR), 370, 528, 529 167, 259, 261, 263, 337
Vapnik–Chervonenkis (VC) dimension, 541 Zero-truncated claim counts, 259
Variable annuity (VA), 525 Zero-truncated Poisson (ZTP), 164, 259
Variable permutation importance (VPI), Z-estimator, 73, 118, 180, 457
357–359, 372, 377 Z-statistics, 147
Variable selection, 148, 214, 225, 357, 495, ZTP log-likelihood, 262, 263
497–499, 507–509, 549 ZTP model, 155