4985-Article Text-8048-1-10-20190709

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

The Thirty-First AAAI Conference on Innovative Applications of Artificial Intelligence (IAAI-19)

Transforming Underwriting in the Life Insurance Industry

Marc Maier, Hayley Carlotto, Freddie Sanchez, Sherriff Balogun, Sears Merritt
MassMutual Data Science
59 E Pleasant St, Amherst, Massachusetts 01002
{mmaier, hcarlotto, freddiesanchez, sbalogun, smerritt}@massmutual.com

Abstract million applicants spanning 15 years and containing health,


behavioral, and financial attributes. To the best of our knowl-
Life insurance provides trillions of dollars of financial se-
curity for hundreds of millions of individuals and families edge, this is the largest and most comprehensive application
worldwide. Life insurance companies must accurately assess data set in the industry. Combining this data with advance-
individual-level mortality risk to simultaneously maintain fi- ments in machine learning and survival modeling enables
nancial strength and price their products competitively. The accurate estimation of mortality risk. We develop a high-
traditional underwriting process used to assess this risk is resolution model that generates a life score and underpins
based on manually examining an applicant’s health, behav- the MassMutual Mortality Score (M3S) and LifeScore360.1
ioral, and financial profile. The existence of large historical Collaborating with actuaries, we design a novel evalua-
data sets provides an unprecedented opportunity for artificial tion framework to compare historical underwriting decisions
intelligence and machine learning to transform underwriting against simulated model decisions over a 15-year period.
in the life insurance industry. We present an overview of how
This empirical study demonstrates that the life score outper-
a rich application data set and survival modeling were com-
bined to develop a life score that has been deployed in an al- forms traditional underwriting, yielding a 6% reduction in
gorithmic underwriting system at MassMutual, an American claims in the healthiest pool of applicants. Based on these
mutual life insurance company serving millions of clients. promising results, we engaged additional partners across
Through a novel evaluation framework, we show that the life MassMutual to implement an algorithmic underwriting sys-
score outperforms traditional underwriting by 6% on the ba- tem with this mortality model as its primary risk-driving en-
sis of claims. We describe how engagement with actuaries, gine. Over the past two years, this system has reduced time
medical doctors, underwriters, and reinsurers was paramount to issue by >25% and increased customer acceptance by
to building an algorithmic underwriting system with a pre- >30% for offers made with light manual review, while sav-
dictive model at its core. Finally, we provide details of the ing millions of dollars in operational efficiency and driving
deployed system and highlight its value, which includes sav-
the decisions behind tens of billions of dollars of benefits.
ing millions of dollars in operational efficiency while driving
the decisions behind tens of billions of dollars of benefits. The remainder of this paper: (1) provides background
on life insurance and the mathematical frameworks used to
quantify risk in insurance; (2) describes the data set and
1 Introduction methodologies used to estimate mortality risk; (3) presents
Life insurance is a critical protective financial tool for mil- performance results and deployment details; and (4) dis-
lions of households. In the United States, life insurance cusses the future and implications of using predictive models
companies collectively manage trillions of dollars of ben- as a core component of underwriting in life insurance.
efits. While there are numerous types of insurance contracts,
a common component is the estimation of individual-level 2 Background
mortality risk through the process of underwriting. Tradi- This section provides background on the traditional under-
tionally, this is performed manually using human judgment writing process, survival modeling, and actuarial science.
and point-based systems that consider risk factors inde-
pendently. These methods are sufficient in industry but are 2.1 Life Insurance and Underwriting
coarse and subject to inconsistency. As a result, traditional
underwriting limits the degree to which an insurer can esti- A life insurance policy is an agreement between a policy-
mate risk from data and offer efficiently priced products. holder and an insurer whereby the insurer agrees to pay ben-
The availability of large historical data sets provides an eficiaries a sum of money at the time of the policyholder’s
opportunity for machine learning to transform underwriting death. In return, the policyholder pays premiums over a pre-
for life insurance. MassMutual, a large insurance and finan- defined period of time (Atkinson and Dallas 2000). Life in-
cial services company, has curated a data set of nearly one surance provides security to the beneficiaries by reducing
1
Copyright c 2019, Association for the Advancement of Artificial M3S and LifeScore360 refer to branded versions of the mor-
Intelligence (www.aaai.org). All rights reserved. tality model described in this work.

9373
the financial impact of an untimely death. Beneficiaries can is the rate of the event at time t conditioned on having sur-
use the proceeds to pay for future expenses (e.g., daily living vived until time t. In actuarial science, the hazard is often
expenses, college tuition, retirement) that would have other- denoted as µ and describes the mortality rate for a given at-
wise been paid for by the earnings of the insured. tained age. The cumulative hazard function, defined as
Most types of life insurance require an estimate of ex- Z t
pected lifetime of an individual at the time of application. Λ(t) = λ(u)du, (2)
This is referred to as mortality risk, and the process of col- 0
lecting and analyzing data that describes such risk is known is related to the survival function as Λ(t) = − log S(t).
as underwriting (Black and Skipper 2000). Actuaries com- Nonparametric estimators, namely the Kaplan-Meier (Ka-
pute the cost of covering mortality risk over the lifetime plan and Meier 1958) and Nelson-Aalen estimators, com-
of the policy and translate it into a set of premium pay- pute these quantities directly from observed survival data.
ments (Jordan 1967). The financial risk and general approval The primary goal of predictive modeling in the survival
of the underwriting process is agreed upon with reinsurance context—termed survival modeling—is to develop estimates
companies, institutions that assume a portion of the risk and of the survival, hazard, or cumulative hazard functions with
who diversify their holdings across insurance industries. respect to a set of observed covariates. In the underwriting-
In contrast to other types of insurance, such as auto, home, for-mortality setting, the covariates are medical and behav-
and health, life insurance is typically purchased through a fi- ioral attributes of life insurance applicants and the event is
nancial advisor who connects an individual to a carrier and mortality. The techniques used to estimate these functions
helps clients identify the type and amount of insurance that fundamentally require a different set of statistics as the time-
suits their needs. Advisors provide estimates of the premi- to-event of mortality is unknown for most individuals. This
ums, but the exact price is determined after underwriting. is referred to as right-censored data because the date of birth
Life insurance underwriting has primarily used point sys- is known, but the date of death is unobserved for a large set
tems developed by doctors and underwriters. These sys- of individuals. Missing survival information is a key charac-
tems calculate risk by mapping medical and behavioral teristic of survival analysis, in which the data may be cen-
attributes—such as cholesterol, build, driving record, and sored at the beginning, end, or even middle of study periods.
family and personal medical history—to point values that ei- There is a well-established set of methods employed by
ther debit or credit an overall score (Brackenridge, Croxson, academic and industrial practitioners of survival analysis.
and Mackenzie 2006). This approach resembles risk cal- The Cox proportional hazards model is the most widely used
culations employed in clinical medicine (e.g., Framingham statistical technique for estimating individual risk in studies
risk scores (Wilson et al. 1998)). A life underwriter reviews of survival (Cox 1972). This is a semi-parametric regres-
an application to calculate the net number of points, deter- sion model that assumes a linear functional form and pro-
mining one of several risk classes that drive premium and portional hazards for any two strata over time. In machine
are priced according to aggregate mortality.2 Advancements learning, random forests (Breiman 2001) have been adapted
in statistics and machine learning present an opportunity to by Ishwaran (2008) to handle right-censored survival out-
update the traditional approach to underwriting, which pre- comes (called random survival forests, or RSF) and efficient
dominately considers factors independently. Leveraging AI implementations exist (Wright and Ziegler 2017). As a non-
to automate underwriting decisions is not novel in the in- parametric, adaptive model, RSF captures interactions and
dustry (e.g., using fuzzy logic (Aggour et al. 2006)), but de- non-linear dependencies that are more subtle and complex
veloping a machine learning model that outperforms human than can be reflected by a linear model. The extension to
decisions and deploying at scale is unprecedented. survival data includes setting the splitting criterion to maxi-
mize survival difference, as measured by a log-rank test, and
2.2 Survival Modeling the terminal nodes directly estimate the cumulative hazard
The majority of predictive modeling tasks are based on clas- function via an ensemble of Nelson-Aalen estimators.
sification or regression. In the context of survival analysis, Survival models can be evaluated with concordance, a
however, the outcome of interest is the duration until a bi- pairwise ranking statistic similar in interpretation to area un-
nary event may occur for a particular record. The objective der the receiver operating characteristic curve (AUC) com-
of survival analysis is to approximate the survival function, monly used in classification. The next section provides back-
S(t) = P r(T > t), which describes the probability that an ground on a more relevant metric for an actuarial setting.
event, occurring at random variable time T , occurs later than
some given time t. The hazard rate, 2.3 Actuarial Mathematics
Actuaries evaluate mortality risk and its financial impact
Pr(t ≤ T < t + dt)
λ(t) = lim , (1) when developing life insurance products. Pricing and cash
dt→0 dt · S(t) flow simulations require assumptions about expected mor-
2 tality rates. These are derived from a combination of ob-
MassMutual uses the following risk classes: ultra-preferred
(UPNT), select-preferred (SPNT), and standard (NT) non-tobacco served mortality experience within a company and industry-
and select-preferred (SPT) and standard (T) tobacco, in order of wide life tables. The Society of Actuaries publishes a series
increasing risk. Substandard non-tobacco and tobacco classes ex- of Valuation Basic Tables (VBTs) that aggregate mortality
ist for specific medical impairments, and a small fraction may be experience within the insured population across many carri-
declined for various financial and medical reasons. ers. The most recent tables, published in 2015, compile data

9374
from over 50 life insurers and facet mortality rates by stan-
dard factors: age, gender, duration, and smoking status. 0.03
VBTs are often used as a standard baseline because they

Density
sex
reflect a much larger population than that of a single car- 0.02
F
rier. Actuaries compare their observed mortality experience M

against the expected mortality rates in the VBTs using a met- 0.01

ric referred to as the actual-to-expected (A/E) ratio. The


0.00
A/E ratio is computed by summing all observed deaths di- 25 50 75
vided by the accumulated hazard corresponding to each in- Applicant Age
dividual policy year on record:
P Figure 1: The age distribution of the application population
event indicator stratified by sex.
A/E = P . (3)
accumulated hazard
1.00
When the A/E is less than 100%, this indicates that the
● ●

● ● ●
● ●

● ●
● ● ●

● ●

● ●
● ● ●
● ●

● ●

● ●

● ● ●


● ●

actual mortality experience is better than expected. In this

Survival Probability
● ●


work, we rely on A/E ratios to compare model performance strata



0.95 ● ●

<= 50 F
against underwriters using the 2015 VBT expected basis.

<= 50 M


> 50 F


0.90 ●

> 50 M
3 Modeling Mortality ●

With an understanding of traditional underwriting for life in- 0.85


surance, this section demonstrates how historical underwrit- 0 5
Exposure (years)
10 15

ing data can be leveraged to train models of mortality.


Figure 2: Survival probabilities by age ≤ 50 and sex are
3.1 Data consistent with general population statistics.
Life insurance carriers track policies over a potentially long
period of time to maintain records of financial exposure.
Minimally, this requires demographics, policy details, and on age and sex (e.g., females tend to outlive males (Kalben
post-issue events (e.g., status changes). However, to build a 2000) and survival probability decreases with age).
model that predicts mortality risk, it is critical to retain data
used for underwriting. MassMutual has a consolidated, dig- Lab Tests Life insurance underwriting typically includes a
ital record of nearly one million applications for which a lab de facto set of laboratory tests on blood and urine specimens.
test was ordered during 1999–2014. After removing applica- A vast medical and actuarial literature ties various tests di-
tions with a high degree of missing values, typically incom- rectly with all-cause or specific causes of mortality, such
plete or withdrawn applications, this reduces to 908k records as albumin (Goldwasser and Feldman 1997) and cholesterol
with 9.16M exposure years and 15.7k observed deaths. (Kronmal et al. 1993). The lab test data provide broad expo-
To develop a general-purpose model for life underwrit- sure to a range of values and includes biophysical measure-
ing, mortality outcomes on all applicants are crucial. Inter- ments (e.g., build, blood pressure), lipids (e.g., cholesterol),
nal records are limited to death benefit claims, which ex- liver function tests (e.g., gamma-glutamyltransferase), kid-
cludes applicants that never became policyholders or termi- ney function tests (e.g., creatinine), blood proteins (e.g., al-
nated their policies prior to a claim. MassMutual obtained bumin, globulin), urine proteins (e.g., microalbumin), blood
and periodically refreshes ground-truth mortality data from sugars (e.g., fructosamine, hemoglobin A1C), and several
internal and third-party sources on historical applicants. indicators (e.g., cocaine, HIV).
Aside from demographics, labs, and mortality, the data set
covers attributes drawn from a lengthy health history ques- Health History Questionnaires Lab tests are a point-in-
tionnaire that accompanies the application process. Addi- time view into an individual’s health that yield substan-
tional data sources are widely used in underwriting, such as tial protective value for risk selection. The application pro-
prescription drug histories and motor vehicle records, but at cess also solicits information related to personal and family
present, we do not have adequate historical coverage to di- health history, as well as behavioral risk through an exten-
rectly tie to mortality. Below we review select statistics on sive questionnaire. Partnering with a vendor specializing in
the primary attributes contained in the overall data set. handwriting recognition, we digitized the vast majority of
MassMutual’s paper and imaged archive. This endeavor was
Demographics Over a 15-year period, the data set pro- challenging due to a manual element of standardizing ques-
vides broad coverage of demographics. Males are generally tions phrased differently across time, states, and product of-
older than their female counterparts at time of application, ferings. Despite the acquisition costs, this data enable a con-
as shown in Figure 1. Males account for more than twice the sistent mapping with the current application. The training
number of deaths, which is a function of an older age distri- data include variables that align to major medical impair-
bution, a higher proportion in earlier application years, and ments (e.g., cardiovascular disease) derived from Boolean
a higher mortality rate in general (see Figure 2). Addition- responses and keyword extraction on open-text fields.
ally, the applicant data exhibit expected survival dependence

9375
200

0.009
150
Year Interval
Density

1999−2002

A/E
0.006 100
2003−2006
2007−2010
2011−2014
0.003 50

0.000 0
100 150 200 250 300 20 30 40
Cholesterol BMI

Figure 3: Grouping by 4-year bands, the distribution of Figure 4: Trends in aggregate mortality risk, measured by
cholesterol trends lower over time. A/E, as a function of 5-point bands of BMI.

Health Trends across Time Given the 15-year time pe- in biophysical measurements, blood and urine specimens,
riod of our data, we observe trends in the distribution of cer- and applicant health history questionnaires.
tain lab values. For example, recent applicants exhibit lower
levels of cholesterol compared to those in earlier years, as Experiments Research on survival methods has made ad-
shown in Figure 3. This is consistent with medical research vances over the past decade. The most widely studied ma-
reporting similar trends over the same time period (Rosinger chine learning models for survival data are tree-based meth-
et al. 2017). A variable that trends over time is referred to as ods, such as the random survival forest. Emerging research
covariate shift or non-stationarity, which presents a model- aims to apply advanced statistical models, such as gradient
ing challenge due to the temporal association with predic- boosting and generalized additive models to discrete-time
tive variables (Sugiyama and Kawanabe 2012). We apply a survival analysis (Chen and Guestrin 2016; Wood 2006),
statistical adjustment that translates and controls for these as well as survival extensions of deep learning (Katzman et
temporal differences in distributions. With recent research al. 2016; Ranganath et al. 2016). However, scalable imple-
discovering worsening mortality trends on specific subpop- mentations are limited, with the most comprehensively de-
ulations (Case and Deaton 2015) (albeit stemming from un- veloped survival suite existing in the R environment. Thus,
certain factors), it will be imperative to capture the changing we focused our modeling on the Cox proportional hazards
dependence of lab tests and mortality risk. model (COX) and random survival forest (RSF). Experiments
iterated on findings drawn from our collaborative feature se-
3.2 Modeling lection process, in addition to improvements through vari-
able transformation, hyperparameter tuning, and sampling
This valuable data asset enables survival modeling. Below, techniques. Each experiment performed 10-fold cross val-
we outline the strategy for selecting relevant features and idation and held-out predictions were used to produce a
refining the mortality model, and we demonstrate that the suite of statistical, actuarial, and business-relevant evalua-
predictions appropriately stratify health factors. tion metrics. The RSF model consistently yields a substantial
improvement over traditional underwriting and COX.
Feature Selection Feature selection was heavily influ-
enced by medical and actuarial experts and validated with Developing the Life Score The RSF mortality model di-
standard machine learning techniques. A model intended to rectly estimates the cumulative hazard function, Λ(t), across
be used for an embedded and central process to the business the duration of exposure years in the training data. From this
cannot solely be optimized for predictive accuracy. It is criti- vector of cumulative hazards, we derive a single, standard-
cal to consider the operational impact of each prediction, in- ized life score that can be used to rank individuals for under-
cluding reconciling with complementary underwriting data writing. Specifically, we select Λ(10), the cumulative haz-
sources and transparent communication to customers. ard at t = 10, corresponding to the median exposure of our
Through close partnership with the MassMutual’s medi- data. The life score has a range of 0–100, ranging from high-
cal team, we constructed an intuitive and medically relevant est to lowest risk. The score reflects the relative risk among
mortality model. Given their recommendations, we reviewed 5-year age band, sex, and smoker cohorts—primary factors
the historical coverage of each variable as procedures for fil- in actuarial mortality studies. Conditioned on cohort, the life
ing and testing have changed across time and underwriting score is the integer-valued quantiles of the empirical distri-
requirements vary by demographics and policy features. We bution of all 10-year cumulative hazard values. Figure 5(a)
also assessed the statistical dependence with mortality inher- demonstrates that, as expected, the proportion of each cohort
ent to each variable. For example, Figure 4 shows how A/E is represented consistently across the range of life scores.
ratios vary by 5-point bands of body mass index (BMI), ex- Example: If Carlos is a 55-year-old non-smoking male with
hibiting elevated mortality risk for low BMI and steadily in- a life score of 87, he can be compared directly against and
creasing mortality risk for higher values of BMI. has lower mortality risk than Barry, another 55-year-old non-
The deployed mortality model relies on nearly sixty in- smoking male with a score of 53. However, if Amy is a 35-
puts, including internally generated features (e.g., BMI as a year-old non-smoking female with a score of 87, she does
function of height and weight). The main inputs are captured not necessarily present the same mortality risk as Carlos.

9376
Female Male Age
0.15
1.00 20
25 1.00
30

Proportion
0.75 0.10 BMI
Proportion

35
0.75

Proportion
40 20
0.50 45 25
50 0.05 0.50 30
55 35
0.25 60 40
0.25
65 45
70 0.00
0.00 0.00
75 25 50 75 100
25 50 75 100 25 50 75 100 80 25 50 75 100
Life Score Life Score Life Score

(a) (b) (c)

Figure 5: (a) The proportion of individuals in each decile of the score is consistent across 5-year age and sex bands. (b) Incidence
of heart condition as a function of life score. The proportion ranges from 14.4% in the first decile to 0.3% in the tenth decile,
gradually decreasing in between. (c) Distribution of BMI as a function of life score. The highest scores have a greater proportion
of healthy-range BMI. As the score decreases, the proportion of upper and lower BMI extremes gradually increases.

We can also demonstrate how medical impairments are Table 1: A/E confusion matrices for (a) non-tobacco classes
stratified across the life score. Figures 5(b) and 5(c) display relative to UPNT and (b) tobacco classes relative to SPT.
the proportion of heart condition incidence and BMI bands (rows - model; columns - underwriters)
within each score decile. This highlights the effect that BMI (a)
and heart condition have on mortality risk. Each variable UPNT SPNT NT <NT Marginal
exhibits different stratification structures depending on its
UPNT 84 85 109 177 93
mortality dependence (e.g., U - or J-shaped mortality curves
SPNT 100 120 143 256 127
(Chokshi, El-Sayed, and Stine 2015; Cox et al. 2008)).
NT 126 143 174 247 163
<NT 226 306 340 653 432
4 Validation Marginal 100 126 174 381
Analyses of correlations among the life score and health fac-
tors are useful, but business-related metrics are critical to (b)
understand the expected performance of a deployed system. SPT T <T Marginal
SPT 68 79 156 80
4.1 Simulation Method T 107 148 149 130
In collaboration with actuaries, we designed a novel al- <T 287 274 409 329
gorithm that generates a synthetic, model-assigned book Marginal 100 142 253
of business to compare against historical underwriting risk
class offers. The algorithm ensures that the number of sim-
ulated offers for each issue year, risk class, 5-year age band, The steps to equitably generate historical offers for a pool
sex, and smoking status cohort are identical to those offered of applicants is shown in Algorithm 1. Using the historical
historically. This effectively controls for all actuarial factors data D, the algorithm first computes the number of offered
and is consistent with how the life score is normalized. With- policies by risk class within each cohort. Then, the mortal-
out controlling for these factors, the algorithm would dispro- ity model M predicts a life score LS for each individual in
portionately assign, for example, young females to the best Dcohort . For each risk class r in order, assign r to the next
risk classes as they present lower mortality risk. offer counts[r] lowest-risk individuals that have yet to be
assigned a model risk class, rmodel . Assign a worse-than-
Algorithm 1: AssignRiskClasses(D, M ) standard rating to the remaining individuals.
1 Dassign ← ∅ Example: Consider a cohort of 35-year-old, non-smoking
2 for Year y, Age a, Sex s, Smoking Status t do females in 2005. Assume 100 applications were submitted,
3 Dcohort ← D[Y = y, A = a, S = s, T = t][:] and underwriters offered 50 UPNT, 15 SPNT, 30 NT, and de-
4 offer counts ← count(Dcohort ) clined coverage for 5 cases. Order the cases by life score and
5 Dcohort [:][LS] ← predict(Dcohort , M ) assign the 50 applicants with the highest life score to UPNT,
6 Dcohort ← sort(Dcohort [:][LS]) the next 15, 30, and 5 to SPNT, NT, and decline, respectively.
7 idx ← 1 Each 35-year-old, non-smoking female who applied in 2005
8 for ordered Risk Class r do now has a model- and underwriter-assigned risk class.
9 Dcohort [idx : offer counts[r]][rmodel ] ← r
10 idx + = offer counts[r] 4.2 Simulation Results
11 Dcohort [idx :][rmodel ] ← decline The model-assigned risk classes produced from Algorithm 1
12 Dassign ∪ = Dcohort enable the calculation of useful statistics, including the dif-
13 return Dassigned ference in deaths and A/E ratios compared to underwriters.
We applied this simulation to historical life insurance ap-

9377
Cumulative % Claims Difference

from a research environment to a real-time decision-making


Cumulative % Claims Difference
50
50 system. Below, we describe the approach to developing, re-
model leasing, and monitoring an algorithmic underwriting system.
25 Cox model
25 random Cox
RSF random
RSF
5.1 The Algorithmic Underwriting System
0
0 A well-designed algorithmic underwriting system should
capture digitally structured data and enable a simple inter-
−25
−25 4 8 12 face and decision process for underwriter interaction. At
Policy4Duration 8
Policy Duration
12 MassMutual, a prospective life insurance customer com-
pletes a digital application and submits laboratory tests, gen-
Figure 6: Cumulative percent difference in deaths in UPNT erally through a paramedic visit. To predict a life score, the
across policy duration, where 0 indicates equivalent counts. mortality model requires inputs from these lab test results
and responses within the health questionnaire portion of the
application. Additional information required for underwrit-
plications submitted 2000–2014 and assume policies remain ing, such as motor vehicle records and prescription drug his-
active until death, ignoring lapse. This amounts to roughly tory, is obtained via vendor-supplied API calls. This infor-
650k applications and 7k deaths. mation is not included in the model as historical coverage of
Recall from Section 2.1, risk classes determine premiums this data is currently limited. The same data are collected on
based on expected mortality rates. The UPNT class corre- applicants undergoing algorithmic and traditional underwrit-
sponds to the lowest mortality rate and premium; thus, an ing, yet the overall processes are fundamentally different.
effective model must assign those lowest-risk individuals to Some of the technical and business challenges include (1)
UPNT to maintain profitability. The model should also strat- generating discrete risk class recommendations from contin-
ify high-risk individuals into the appropriate classes. uous life scores; (2) serving real-time scores in a robust en-
Using the output of Algorithm 1, we compute the differ- vironment; (3) integrating the model recommendations with
ence in death counts for the RSF mortality model, as well medical and financial underwriting guidelines; and (4) em-
as COX and a random scoring process. Figure 6 displays the powering underwriters with explanations of the factors be-
cumulative percent difference in UPNT deaths for the three hind individual life scores to enable communication with ad-
methods compared to underwriters. Underwriters are experts visors and customers.
at risk selection, yet the results show that after a 15-year du- Calibrating score thresholds. The mortality model supports
ration, RSF would have formed an offer pool with 6% fewer a flexible framework that can recommend risk classes based
deaths. COX and the random process produce 8% and 57% on different objectives. For example, because the life score
more deaths than underwriters, respectively. The results ag- measures mortality risk, actuaries could adjust offers to
gregated across all risk classes are qualitatively similar. achieve desired levels of mortality. However, the current ap-
To measure performance of the RSF model with an ac- proach sets thresholds that yield offer rates consistent with
tuarial lens, we perform an A/E analysis. Tables 1a and 1b historical rates as those form the basis of pricing assump-
display confusion matrices of A/E ratios for the risk classes tions. This aligns with the design of the simulation study
formed by RSF and underwriters. All A/Es are normalized from Section 4.1 and its corresponding metrics.
by the marginal of the underwriter-assigned best risk class Predicting in real time. Real-time risk class recommenda-
(UPNT and SPT, respectively) so that values can be inter- tions are accessed via an internally developed REST API
preted relative to underwriter performance. The RSF model that hosts the mortality model. Once the full set of require-
consistently produces lower mortality rates in each risk ments are received for an application, the algorithmic un-
class and is substantially higher in the <NT and <T pools. derwriting system sends a formatted request to the API to
The joint A/Es indicate that the model effectively dis- receive the life score and suggested risk class. The API is
perses mortality risk in desired directions throughout the risk highly scalable and responds within seconds, where the la-
classes. Combined with underwriter decisions, there is po- tency is driven by the complexity of the model prediction.
tential for improved risk selection. For example, where they Integration with underwriting guidelines. Thousands of au-
agree on UPNT, the mortality risk is 84% of the marginal. tomated rules encompassing health, behavioral, and finan-
The mortality model leverages fewer data sources than cial attributes serve as guardrails for the risk class recom-
underwriters, who review additional requirements such as mendations generated by the model. The rules reflect a com-
prescription drug histories, motor vehicle records, and finan- prehensive set of medical and underwriting guidelines de-
cial data. As such, these results are conservative. An algo- veloped by experts in underwriting and insurance medicine.
rithmic underwriting system combining the mortality model, Each rule determines the best available risk class in the pres-
a comprehensive rules environment, and controlled manual ence of certain values in the application. For example, a high
oversight will generate even better mortality results. BMI would preclude an applicant from receiving a preferred
offer. When a rule is triggered, underwriters can focus on
pertinent details of the application and use domain exper-
5 Deployment tise to (1) override the rule, allowing the case to continue
The simulation study illustrates the value of the mortality through the automated process, (2) decide if additional in-
model, but it is a non-trivial undertaking to promote a model formation is required for further review, or (3) confirm the

9378
rule and proceed with the suggested risk class. Ultimately, reports aggregate statistics, and the medical team analyzes
the life score drives the final offer, but the rules may lead to individual model decisions before approval. Any change to
a worse rating. This approach to underwriting has led to new the expected distribution of offers requires further approval
analyst positions and revised workflows for underwriters. from an actuarial team. Final deployment of a new version
Interpreting model predictions. With a complex model driv- requires collaboration between data science and IT develop-
ing risk class decisions, it is imperative that analysts and ers, who maintain the production system. The cadence for
underwriters can effectively explain why an individual ap- new model versions occurs on an as-needed basis, roughly
plicant received a given offer. Model interpretability is an biannually, rather than a scheduled frequency.
active area of research as machine learning models be-
come increasingly opaque, despite evidence that even linear 5.4 Regulation
models can present a challenge in its interpretation (Lipton
The use of predictive modeling in life insurance under-
2016). We developed a model-agnostic approach to generat-
writing raises legal, regulatory, and ethical questions re-
ing interpretable, approximate factors that contribute to the
lated to transparency and fairness. In 2017, the New York
life score at an individual prediction level. The methodol-
Department of Financial Services requested that life insur-
ogy is similar to recent research, including Shapley values
ers provide details of their use of algorithmic underwrit-
and LIME (Lundberg and Lee 2017; Ribeiro, Singh, and
ing, including data sources, choice of model inputs, and the
Guestrin 2016). The contribution factors are returned with
available mechanisms for disputing model-based risk deci-
the life score and displayed to underwriters.
sions (Scism 2017). Further, the National Association of In-
5.2 Rolling out the System surance Commissioner’s Model Rating Law requires under-
writing inputs to be actuarially justified (i.e., demonstrate
We systematically and gradually transitioned the exclusively correlation with risk). Increasingly, insurers are being chal-
human process of underwriting to an algorithmic frame- lenged to provide details around the inner workings of their
work. As a proof-of-concept, we conducted a pilot of the underwriting models to provide both applicants and regula-
system on 1,000 cases in parallel to traditional underwrit- tors a sense of which factors drive individual ratings.
ing. This enabled observation of risk class offer rates and A growing interest in consumer protection also manifests
agreement between the two systems. Following a successful through concerns around fairness and the impact predictive
pilot, algorithmic underwriting began issuing UPNT offers models have on protected classes. The use of a wide variety
on all life products up to $1M benefits for applicants aged of model inputs related to an applicant’s personhood (e.g.,
17–40. This was followed by an expansion to $3M and ap- age, gender, income) makes life underwriting models vul-
plicant ages up to 59, and finally for all standard-and-above nerable to persisting societal biases that exist without the
risk classes. At present for these parameters, algorithmic un- benefit of human manipulation to counteract its negative im-
derwriting is applied to 90% of applications. pacts on protected classes. In an effort to combat undesired
5.3 System Maintenance biases, models and risk ratings are conditioned on certain
protected classes, such as age and gender. In addition, pur-
Collaboration across several teams supports the monitoring, poseful omission of ethnicity and geography partially miti-
refreshing, and updating of the mortality model. A critical gates the risk related to fairness and disparate impact from
component to an algorithmic underwriting system, or any use of algorithms in life insurance underwriting.
machine learning system (Sculley et al. 2015), is to contin-
uously monitor the model inputs and outputs. Distributional
drift, such as deteriorating offer rates, or sudden outliers, 6 Business Value
such as a lab test changing units, could manifest in the sys- The implementation of predictive modeling in life insurance
tem, affecting the quality of the decisions. We implement a underwriting has favorable implications for a firm’s prof-
monitoring protocol that reports on daily batches of requests itability and its customer experience. At MassMutual, the
to the mortality model and use web-based dashboards to vi- use of the mortality model and algorithmic underwriting has
sualize and track trends across time. In the future, we plan resulted in greatly improved operational efficiency—time to
to establish automated monitoring to detect anomalies and policy issuance has decreased by >25% for certain appli-
structural changes to model inputs and risk class offer rates. cants. This improvement has had material impact on cus-
The model is retrained periodically to incorporate re- tomer experience as indicated by a >30% increase of ap-
freshed data and performance enhancements. Updates to plicants opting to purchase their policies when the decision
data include refreshed death information and additional was made by the model compared to traditional underwrit-
cases that have been underwritten. Collaborating with a team ing within the best class. Further, the automation of under-
of MDs, enhancements to the model address concerns iden- writing decisions at the company has amounted to labor and
tified upon individual case reviews. To date, new versions time savings of millions of dollars in 2 years on a growing
have focused on improving the accuracy of risk class recom- portfolio of policies that is valued in the tens of billions of
mendations for individual cases and specific medical impair- dollars. Despite these operational financial gains, there is yet
ments rather than aggregate performance. Prior to deploying more profitability to be derived from the increased accuracy
a new version, we conduct a retroactive pilot to ensure no un- of the decisions when driven by the mortality model. That
expected outcomes occur. The data science team generates is, the retrospective simulation study detailed in Section 4.1
new model outcomes for the past several months of cases, suggests a long-term benefit of reduced claims experience.

9379
7 Conclusion and Future Directions Cox, H. J.; Bhandari, S.; Rigby, A. S.; and Kilpatrick, E. S. 2008.
Pairing machine learning capabilities with historical data Mortality at low and high estimated glomerular filtration rate val-
ues: A u-shaped curve. Nephron Clinical Practice 110(2):c67–c72.
provides an unprecedented opportunity in the life insurance
industry to transform the underwriting status quo. Leverag- Cox, D. R. 1972. Regression models and life-tables regression.
Journal of the Royal Statistical Society, Series B 34:187–220.
ing 15 years of applications at MassMutual, we developed a
mortality model and life score that can consistently compare Goldwasser, P., and Feldman, J. 1997. Association of serum
applicants relative to their demographic cohorts. We demon- albumin and mortality risk. Journal of Clinical Epidemiology
50(6):693–703.
strated that embedding such an approach has profound im-
plications for profitability and customer experience. Ishwaran, H.; Kogalur, U. B.; Blackstone, E. H.; and Lauer, M. S.
Deploying a machine learning model and transforming a 2008. Random survival forests. The Annals of Applied Statistics
2(3):841–860.
central business process has demonstrated the need to en-
gage and collaborate with various partners beyond a data Jordan, C. W. 1967. Society of Actuaries’ textbook on life contin-
science team. Medical and underwriting have been crucial gencies. Society of Actuaries.
to improving the mortality model and its integration with the Kalben, B. B. 2000. Why men die younger: Causes of mortality
algorithmic underwriting system; actuarial and reinsurance differences by sex. N. Am. Actuarial Journal 4(4):83–111.
stakeholders have vetted and approved a business-relevant Kaplan, E. L., and Meier, P. 1958. Nonparametric estimation from
evaluation framework; and legal partners have ensured that incomplete observations. Journal of the American Statistical Asso-
the process remains equitable in its treatment of applicants. ciation 53(282):457–481.
There are many avenues for future directions that span Katzman, J.; Shaham, U.; Bates, J.; Cloninger, A.; Jiang, T.; and
data, methods, and insurance innovation. The currently de- Kluger, Y. 2016. Deep survival: A deep cox proportional hazards
ployed mortality model does not consider all traditional un- network. arXiv preprint arXiv:1606.00931.
derwriting data sources, such as prescription drugs or mo- Kronmal, R. A.; Cain, K. C.; Ye, Z.; and Omenn, G. S. 1993.
tor vehicle records, and there are non-traditional sources, Total serum cholesterol levels and mortality risk as a function of
such as financial data, public records, and wearable sensors, age: A report based on the Framingham data. Archives of Internal
Medicine 153(9):1065–1073.
that may improve accuracy or enable alternative underwrit-
ing mechanisms. The general framework of producing high- Lipton, Z. C. 2016. The mythos of model interpretability. In
resolution estimates of individual-level mortality risk can Proceedings of the ICML Workshop on Human Interpretability in
Machine Learning, 96–100.
lead to actuarial and product innovation. Finally, trends in
machine learning research on survival models may improve Lundberg, S. M., and Lee, S.-I. 2017. A unified approach to in-
terpreting model predictions. In Advances in Neural Information
risk selection, and topics related to fairness and transparency
Processing Systems, 4765–4774.
of complex models are equally crucial to study.
Ranganath, R.; Perotte, A.; Elhadad, N.; and Blei, D. 2016. Deep
survival analysis. arXiv preprint arXiv:1608.02158.
8 Acknowledgments
Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. Why should I trust
The authors are grateful for contributions made by Paul you?: Explaining the predictions of any classifier. In Proceedings
Shearer, Martha Miller, John Karlen, Debora Sujono, and of the Twenty-Second ACM SIGKDD International Conference on
Sara Saperstein. We also give thanks to our many internal Knowledge Discovery and Data Mining, 1135–1144. ACM.
and external collaborators for their continued partnership. Rosinger, A.; Carroll, M. D.; Lacher, D.; and Ogden, C. 2017.
Trends in total cholesterol, triglycerides, and low-density lipopro-
References tein in us adults, 1999-2014. JAMA Cardiology 2(3):339–341.
Aggour, K. S.; Bonissone, P. P.; Cheetham, W. E.; and Messmer, Scism, L. 2017. New York regulator seeks details from life insurers
R. P. 2006. Automating the underwriting of insurance applications. using algorithms to issue policies. The Wall Street Journal.
AI magazine 27(3):36. Sculley, D.; Holt, G.; Golovin, D.; Davydov, E.; Phillips, T.; Ebner,
Atkinson, D. B., and Dallas, J. W. 2000. Life insurance products D.; Chaudhary, V.; Young, M.; Crespo, J.-F.; and Dennison, D.
and finance: charting a clear course. Society of Actuaries. 2015. Hidden technical debt in machine learning systems. In Ad-
Black, K., and Skipper, H. D. 2000. Life and health insurance. vances in Neural Information Processing Systems 28. 2503–2511.
Prentice Hall. Sugiyama, M., and Kawanabe, M. 2012. Machine Learning
Brackenridge, R. D. C.; Croxson, R.; and Mackenzie, R. 2006. in Non-Stationary Environments: Introduction to Covariate Shift
Brackenridge’s medical selection of life risks. Springer. Adaptation. MIT press.
Breiman, L. 2001. Random forests. Machine learning 45(1):5–32. Wilson, P. W.; D’Agostino, R. B.; Levy, D.; Belanger, A. M.; Sil-
bershatz, H.; and Kannel, W. B. 1998. Prediction of coronary
Case, A., and Deaton, A. 2015. Rising morbidity and mortality in heart disease using risk factor categories. Circulation 97(18):1837–
midlife among white non-hispanic americans in the 21st century. 1847.
Proc. of the National Academy of Sciences 112(49):15078–15083.
Wood, S. 2006. Generalized Additive Models: An Introduction
Chen, T., and Guestrin, C. 2016. Xgboost: A scalable tree boosting with R. CRC Press.
system. In Proceedings of the Twenty-Second ACM SIGKDD In-
ternational Conference on Knowledge Discovery and Data Mining, Wright, M. N., and Ziegler, A. 2017. ranger: A fast implementation
785–794. ACM. of random forests for high dimensional data in C++ and R. Journal
of Statistical Software 77(1):1–17.
Chokshi, D. A.; El-Sayed, A. M.; and Stine, N. W. 2015. J-shaped
curves and public health. JAMA 314(13):1339–1340.

9380

You might also like