Rose 2016
Rose 2016
Address correspondence to Sherri Rose, Ph.D., Department of Health Care Policy, Harvard Medi-
cal School, 180 Longwood Avenue, Boston, MA 02115; e-mail: [email protected].
Machine Learning for Risk Adjustment 3
mula should be assessed in terms of the effects it will have on efficient sorting
of consumers across plan types (Einav and Finkelstein 2011) and on ameliorat-
ing the incentives to plans to distort services to favor healthy (low-cost) enrol-
lees over the sick (high-cost) enrollees (Glazer and McGuire 2000). While
some papers have assessed these incentives in the context of employer-spon-
sored health insurance (Einav, Finkelstein, and Cullen 2010), Medicare
(Brown et al. 2014), and Marketplaces (McGuire et al. 2014), by far the most
common metric for assessing risk adjustment alternatives is simply the R2 of
the risk adjustment model (Kautter et al. 2014; van Veen et al. 2015). While
acknowledging the limitations of a fit measure like R2, as a first step in consid-
ering the potential of machine learning methods, fit measures seem like the
natural place to start.
Meanwhile, the broader health statistics field is rapidly moving toward
newer techniques as the era of “big data” brings increased information on
patients, such as electronic health records. Given the size and complexity of
these new data structures, standard statistical methods will often not be suit-
able or feasible. For example, it has become commonplace for there to be hun-
dreds or thousands of covariates collected to explain an outcome of interest
(van der Laan and Rose 2011). Newer statistical methods can accommodate
this challenge. There is substantial potential to incorporate these advanced
methods in risk adjustment. Current methods do not fully exploit the informa-
tion in the data by remaining limited to parametric regression. That said,
embracing more sophisticated estimation techniques with improved abilities
for detecting interaction, nonlinear, and higher order effects need not indicate
that a more complex risk adjustment estimator is necessarily warranted.
Machine learning frameworks also allow us to screen variables, reducing a
potential risk adjustment formula to, for example, just 10 variables.
The general applied statistical literature has begun to embrace these
automated machine learning techniques (Lee, Lessler, and Stuart 2009; Sudat
et al. 2010; Rose 2013), but this transition has yet to be made in other areas,
including health economics. Machine learning methods aim to smooth over
the data similarly to the way parametric regression procedures do, except they
may make fewer assumptions in a nonparametric statistical model and adapt
more flexibly to the data. The potential of these methods is considerable; they
can provide avenues for researchers to build the exact type of interactive pre-
diction methods they desire for use in practice. And, in the case of risk adjust-
ment, they could lead to more accurate spending predictions. Over 50 million
people in the United States are currently enrolled in an insurance program
that uses risk adjustment—over three times the number in Medicare Advan-
4 HSR: Health Services Research
tage (Geruso and Layton 2015). The cost-saving implications of improved risk
adjustment formulas are immense.
The contributions of this paper include introducing cross-validation
(“hold-out” sampling), machine learning, and more parsimonious formulas
for plan payment risk adjustment. We illustrate the use of several machine
learning approaches for risk adjustment in the Truven MarketScan database
(Adamson, Chang, and Hansen 2008) and assess whether use of these proce-
dures improves risk adjustment with respect to cross-validated R2. The core
proposed framework is an ensembling machine learning technique that lever-
ages the use of cross-validation to take a weighted average of multiple algo-
rithms and form a single best predictor, as well as allowing for variable
selection procedures that produce a parsimonious formula with a reduced set
of variables. Our results demonstrated high accuracy for prediction using a
small set of variables, while protecting against overfitting.
M ETHODS
Data Source
We defined a population from the Truven MarketScan database with 2 years
of continuous coverage spanning 2011–2012, which yielded 10,976,994 peo-
ple. The Truven MarketScan database contains information on enrollment
and claims from private health plans and employers for between 17 and
51 million people each year, and it is one of the biggest databases of this type
(Adamson, Chang, and Hansen 2008). Variables available include enrollee
age, sex, region, insurance plan type, date and site of service, procedures,
expenditures, and inpatient diagnoses, as well as many others. We also created
Hierarchical Condition Category (HCC) variables using ICD-9-CM codes, as
these variables are the basis of the federal risk adjustment system (Kautter
et al. 2014). This database was the source of subjects for the current Market-
place risk-adjustment models devised by the federal government. For the pur-
poses of our work, we extracted a random sample of 250,000 people to
demonstrate the proposed risk adjustment procedures. The covariates we use
mimic those used in official risk-adjustment formulas, including age, sex, geo-
graphic area, five inpatient diagnosis categories, and 74 HCC variables, all
from 2011. The outcome is total annual expenditures in 2012. We refer to
other literature for further discussion of the database and variable construc-
tion, as well as additional approaches to sample construction (Layton, Ellis,
and McGuire 2015; Rose et al. 2015).
Machine Learning for Risk Adjustment 5
However, the lasso penalty will typically select a single variable from a set of
correlated variables, which may not be desirable (e.g., including only one of a
set of predictors that were dichotomized from a categorical variable). Addi-
tionally, when the number of predictors is small in relation to the number of
subjects, other penalties, namely the ridge penalty, will outperform the lasso
penalty when variables are highly correlated. The ridge penalty offers no vari-
able selection, as it shrinks the regression coefficients toward zero but they are
never exactly zero. Other penalties are available, and general elastic nets pro-
vide some balance between these two extremes by allowing a varying degree
of combined lasso and ridge penalties.
Decision tree-based methods are popular in many fields with high-
dimensional data, such as genomics, and they can be useful for reducing the
number of covariates and identifying more complex relationships between
variables. Decision trees are developed from a set of predictor variables, simi-
lar to standard regression methods. From these predictors, the algorithm cre-
ates rules to define splits in the tree, usually with a subset of the data. The goal
of the splits is to create divisions that have the most homogeneity in the out-
come Y. If Y is not sufficiently homogenous after a split, the node will be split
again based on another predictor variable. Otherwise, it will be defined as a
terminal node and assigned an outcome value. Different decision tree meth-
ods employ varying techniques to grow the tree, and common procedures
grow very large trees (i.e., many nodes) with a backwards deletion step used at
the end to discover the optimal tree size and remove terminal nodes (Breiman
et al. 1984). Overfitting is typically an issue with decision trees, especially
when the tree has a large number of terminal nodes. This overfitting can lead
to poor performance when the decision tree is applied to the full data or
another dataset from a similar population. Random forests is an ensembling
method that grows many trees in an effort to protect against outlier trees and
overfitting, although overfitting can still be an issue. The algorithm uses a sub-
set of the data to define the splits in the tree, but unlike in single decision trees,
random forests takes a bootstrap sample for each tree. The unselected obser-
vations are used to validate the procedure. Once a large number of trees have
been produced (often 500 or more), final rules are developed based on the
modal or average value across the trees.
Artificial neural networks, also referred to as neural nets and recently
rebranded as deep learning, are algorithms that attempt to explain an outcome
given a set of input variables by postulating a series of interconnected nodes
within multiple layers (i.e., the network). The relationships between the inter-
connected nodes are defined by weights, calculated with respect to one of a
Machine Learning for Risk Adjustment 7
number of different rules. The algorithm starts with an initial guess for the
weights of the nodes, and then iterates, adjusting the weights given how well it
did in predicting the outcome. The algorithm was inspired by the complex
relationships of neurons within the brain. We guide the interested reader to
additional literature for further details of this complex technique (Venables
and Ripley 2002).
There are many other possible algorithm choices available (Friedman,
Hastie, and Tibshirani 2001; James et al. 2013), but even considering only
regression, lasso penalized regression, ridge penalized regression, a bal-
anced elastic net, a single tree, random forests, and a neural net, it is not
clear which procedure will yield the best performance. If we want to iden-
tify the single best algorithm from among these choices, we could employ
cross-validation. Cross-validation involves the use of rotating “hold-out”
samples from within our data to assess the performance of an algorithm. It
allows us to assign measures of performance to each algorithm that reflect
how the procedure would behave in practice. If the algorithm is only effec-
tive at producing accurate estimates in the data used to fit the algorithm, it
is not useful to employ in another setting with novel data, which is the goal
of most risk-adjustment problems.
The utility of cross-validation is easy to understand once the proce-
dure is illuminated. We discuss 10-fold cross-validation here, as it has many
desirable statistical properties and low computational burden compared to
other types of cross-validation (Dudoit and van der Laan 2005; van der
Laan, Polley, and Hubbard 2007). Our sample data are partitioned into 10
mutually exclusive blocks of equal size. In the first “fold,” we isolate the
first nine blocks to serve as the training set, so-called because we will fit
each of our five algorithms discussed above on this set of data, with the last
block serving as a validation set. After each of our algorithms is “trained”
using the data in the training set in fold 1, this fit is used to generate pre-
dicted values for the observations in the validation set. Thus, after fold 1 is
completed, we only have predicted values for 1/10th of the data for each
algorithm, but these values were generated on data not used to fit the algo-
rithm, thus protecting against overfitting. The validation set rotates such
that each block serves as the validation set once. See Figure 1. At the end
of the complete 10-fold cross-validation procedure, we have predicted val-
ues for each observation using the held-out validation sample from each
fold. A cross-validated mean squared error for each algorithm can then be
calculated using these predicted values:
8 HSR: Health Services Research
Xn
CV MSE ¼ 1=n i¼1
ðYi Y^ i Þ2
where Y is the mean of Y. The optimal choice algorithm has been referred to
as the cross-validation selector or the discrete super learner (Dudoit and van
der Laan 2005; van der Laan, Polley, and Hubbard 2007).
One might query whether it is possible to improve on this method for
selecting the optimal algorithm. Recall that a single decision tree is often
improved by using the ensembling random forests procedure. It would be nat-
ural to then consider a general ensembling framework that allows us to aver-
age across many different types of algorithms. Thus, if multiple algorithms
capture important but unique components of the prediction function, our final
prediction function will incorporate all of them. This is exactly the method we
propose: an ensembler that takes a weighted average of multiple algorithms to
produce a single combined algorithm with optimal mean squared error. This
machine learning approach is called super learning, and it has been developed
and applied in the statistics literature (van der Laan, Polley, and Hubbard
Machine Learning for Risk Adjustment 9
2007; van der Laan and Rose 2011). One can also conceptually view the
ensembling super learner as taking a weighted average of the predicted values
from each algorithm considered. The estimator requires only a few additional
steps beyond those described in the 10-fold cross-validation procedure.
Given that we have already performed 10-fold cross-validation and
obtained predicted values in the validation sets for each observation and algo-
rithm, we will now use these values to calculate the optimal weight coeffi-
cients. This takes the form of a regression ofYon these predicted values, with a
separate column of predicted values for each algorithm. The optimal weight
coefficients are the coefficients in front of each of these column variables in the
regression. One can show that these optimal weights minimize the cross-vali-
dated risk (van der Laan, Polley, and Hubbard 2007; van der Laan and Rose
2011). The penultimate step is to fit each algorithm with the full data and com-
bine these fits with the weights to generate the super learner prediction func-
tion. The super learner prediction function is thus now defined as a weighted
combination of algorithms. Some algorithms will typically receive a weight of
zero and are therefore ignored for the purposes of generating predicted values.
To produce final predicted values for the full data, one feeds the data through
the described super learner function. For example, if the lasso penalized
regression had a weight of 0.50 and the random forests algorithm also had a
weight of 0.50, with all other algorithms receiving a weight of zero, the super
learner predicted values would be a weighted combination of the lasso penal-
ized regression predicted values and the random forests predicted values.
Specifically, you would multiply the predicted values generated by the lasso
by 0.50 and add them to the predicted values generated by the random forests
procedure, also having been multiplied by 0.50, to produce your final pre-
dicted values. To apply the super learner function to a new dataset, one would
use the fixed fits of the lasso and random forests procedure established by the
original dataset, as well as the weights, running the new data through this func-
tion. Additional specifics regarding the mechanics of the super learner can be
found in other literature (van der Laan, Polley, and Hubbard 2007; van der
Laan and Rose 2011).
There is another key property of the super learning framework that
will be particularly important in the risk adjustment setting. With high-
dimensional data, it can be useful to reduce the number of variables con-
sidered for adjustment, thus simplifying the final formula. In super learn-
ing, a screening step can be included within the overall algorithm and its
cross-validation. We use a random forests screening step that takes the
top 10 variables with the highest variable importance measures. These
10 HSR: Health Services Research
RESULTS
Truven MarketScan Database
The variables in our dataset are summarized in Table 2. The 10 variables
retained by the random forests screening step were age category 21–34 years,
Machine Learning for Risk Adjustment 11
Note. For brevity, we only summarize those hierarchical condition categories that rated as top 10
variables in our analysis.
all five inpatient diagnoses categories, and four HCC codes: metastatic cancer,
multiple sclerosis, end-stage renal disease, and stem cell transplant status/com-
plications. The super learner performed better than all the single algorithms
included in the analysis of the Truven MarketScan data. Efficiency losses for
the single algorithms compared to super learner, with respect to cross-vali-
dated R2, ranged from 4 to 92 percent. Neural net using the top 10 variables
from the random forests screening step was the worst performing algorithm,
with a relative efficiency of 8 percent. The neural net with all 86 variables also
performed poorly, with 15 percent relative efficiency compared to super lear-
ner. The parametric regression, lasso, elastic net, ridge, and random forests
with all variables performed equivalently, capturing 96 percent of the effi-
ciency of the super learner with cross-validated R2 values of 0.25. Any of these
five algorithms could be chosen as the discrete super learner in practice given
the minor absolute differences in performance, although the ridge regression
had the best performance. The top 10 versions of these algorithms had a drop
in relative efficiency compared to their respective full versions, with 88 percent
12 HSR: Health Services Research
DISCUSSION
We introduced cross-validated machine learning methods for prediction in
risk adjustment in the Truven MarketScan database, generating new predic-
tion functions, including parsimonious versions of each method. Applying a
machine learning framework can be a useful tool for risk adjustment, and it
provides researchers with alternatives to large parametric regressions with
ever increasing numbers of covariates, which may not provide the flexibility
necessary in the age of “big data.” When additional novel estimators for pre-
diction are developed, they can easily be added to the ensembling framework
described here, as potential candidate learners. Ensembling can augment our
learning from data and provide statistical guarantees that we are leveraging
the information collected in the strongest possible way. Researchers need not
Machine Learning for Risk Adjustment 13
Table 4: Coefficients and Standard Errors from Top Ten Linear Regression
Variable Coefficient (SE)
spend energy guessing which algorithm might perform the best or which vari-
ables should be included; they can now use super learning to run many at
once. The super learner here had the best overall performance. One should
note that algorithms that performed well here will not necessarily perform
well in other settings, as has been seen in other literature (van der Laan, Polley,
and Hubbard 2007; van der Laan and Rose 2011).
Our results also provide preliminary evidence that the use of a lesser
number of variables in risk adjustment could actually lead to better plan pay-
ment risk adjustment. Even examining only the two parametric linear regres-
sions considered within the super learner, the regression with 10 variables had
a cross-validated R2 of 0.23 versus 0.25 when compared to the regression with
all 86 covariates. While there is an efficiency loss with respect to cross-vali-
dated R2, it is relatively minor. It is possible that potential cost savings due to
the inability to game the risk adjustment system as aggressively as is possible
now with the current large number of diagnostic condition codes included in
risk adjustment formulas could leave this difference negligible. Thus, even if a
full super learner is not performed, a discrete super learner selecting among a
number of parametric linear regressions can lead to nontrivial improvements.
A discrete super learner could also be designed such that use of the full set of
risk adjustment variables would only be warranted if, say, there was a 20 per-
cent loss of efficiency when using the reduced set.
More broadly, deciding if a risk adjustment formula is better requires
more extensive empirical evaluation. As noted earlier, a statistical fit measure
14 HSR: Health Services Research
is the common starting point for evaluation, with a natural next step being sim-
ulation-based measures also applied in the literature. It is common, for exam-
ple, to construct predictive ratios, which compares the predicted payments
and costs for subgroups of the population who are regarded as being vulnera-
ble to underservice by plans (Kautter et al. 2014). Parsimony may come at a
cost in these terms—by employing a stripped-down empirical model, some
disease groups that merit higher payment in conventional risk adjustment
models may be underpaid with a simpler model. Any tradeoff here requires
empirical work in the context of a particular policy application. Simulation is
also needed to fully capture other plan payment features, such as consumer
premiums or reinsurance that also affect plan payments and plan incentives
(McGuire et al. 2014). Given the impressive fit of the parsimonious model
estimated here, consideration of its properties on additional policy-related cri-
teria is clearly merited.
One may notice that our top 10 parametric linear regression does not
contain sex. If there are specific variables that must be included for important
policy reasons, the super learner framework also allows the user to prespecify
these variables such that they are included in all algorithms regardless of their
results in any screening step. For that matter, different subsets of covariates
need not be selected in an automated fashion via a screening step. Policy mak-
ers, clinicians, and actuaries can work collaboratively to define subsets based
on various considerations (e.g., regulations, sensitivity to upcoding, clinical
pathways) and then compare the cross-validated results across these different
subsets. One may also be interested in performing cross-validation among dif-
ferent plans to better understand the generalizability of the risk-adjustment
formula, or considering plan type as a covariate in the prediction function
given the role contracts may play in utilization.
It is important to note that there is increased computing time and mem-
ory required in implementing ensemble super learning compared to standard
regression techniques. In our paper we also used a number of algorithms that
possibly could not be implemented in some settings given both time and com-
puting constraints. However, it may be feasible in those settings to compare a
minimal set of regressors, such as a parametric regression with the full set of
risk adjustment variables and one with 10 variables. The cross-validation of
the two algorithms would be performed, selecting the optimal regression
based on cross-validated MSE or R2 or other predefined rule balancing for-
mula complexity and performance, and then returning a final fixed regression
formula fit on the full data for use in new data. The key point being that our
study provides supporting evidence that using an exhaustive set of variables in
Machine Learning for Risk Adjustment 15
ACKNOWLEDGMENTS
Joint Acknowledgment/Disclosure Statement: This work was supported by the John
and Laura Arnold Foundation, NIMH R01-MH094290, and an NIH-NIA
pilot grant from the Program on the Global Demography of Aging at the Har-
vard T.H. Chan School of Public Health. The author thanks Thomas
McGuire, Timothy Layton, Randall Ellis, and the anonymous reviewers for
helpful comments on an earlier version of this manuscript.
Disclosures: None.
Disclaimers: None.
REFERENCES
Adamson, D. M., S. Chang, and L. G. Hansen. 2008. Health Research Data for the Real
World: The MarketScan Databases. New York: Thompson Healthcare.
Breiman, L., J. Friedman, C. J. Stone, and R. A. Olshen. 1984. Classification and Regres-
sion Trees. New York: CRC Press.
Breyer, F., K. Bundorf, and M. V. Pauly. 2012. “Health Care Spending Risk, Health
Insurance, and Payment to Health Plans.” In The Handbook of Health Economics,
Vol. 2, edited by M. Pauly, T. McGuire, and P. Barros, pp. 691–792. Amsterdam:
Elsevier.
16 HSR: Health Services Research
Brown, J., M. Duggan, I. Kuziemko, and W. Woolston. 2014. “How Does Risk Selec-
tion Respond to Risk Adjustment? Evidence from the Medicare Advantage Pro-
gram.” American Economic Review 104 (10): 3335–64.
Dudoit, S., and M. J. van der Laan. 2005. “Asymptotics of Cross-Validated Risk Estima-
tion in Estimator Selection and Performance Assessment.” Statistical Methodology
2 (2): 131–54.
Einav, L., and A. Finkelstein. 2011. “Selection in Insurance Markets: Theory and
Empirics in Pictures.” Journal of Economic Perspectives 25 (1): 115–38.
Einav, L., A. Finkelstein, and M. R. Cullen. 2010. “Estimating Welfare in Insurance
Markets Using Variation in Prices.” Quarterly Journal of Economics 125 (3):
877–921.
Friedman, J., T. Hastie, and R. Tibshirani. 2001. The Elements of Statistical Learning. New
York: Springer.
———————. 2010. “Regularization Paths for Generalized Linear Models via Coordinate
Descent.” Journal of Statistical Software 33 (1): 1–22.
Geruso, M., and T. Layton. 2015. “Upcoding: Evidence from Medicare on Squishy
Risk Adjustment.” NBER Working Paper 21222 [accessed on June 1, 2015].
Available at http://www.nber.org/papers/w21222
Glazer, J., and T. G. McGuire. 2000. “Optimal Risk Adjustment of Health Insurance
Premiums: An Application to Managed Care.” American Economic Review 90 (4):
1055–71.
Iezzoni, L. 2012. Risk Adjustment for Measuring Healthcare Outcomes, 4th Edition. Chicago,
IL: Health Administration Press.
James, G., D. Witten, T. Hastie, and R. Tibshirani. 2013. An Introduction to Statistical
Learning. New York: Springer.
Kautter, J., G. C. Pope, M. Ingber, S. Freeman, L. Patterson, M. Cohen, and P. Keenan.
2014. “The HHS-HCC Risk Adjustment Model for Individual and Small Group
Markets under the Affordable Care Act.” Medicare & Medicaid Research Review 4 (3).
Kronick, R., and W. P. Welch. 2014. “Measuring Coding Intensity in the Medicare
Advantage Program.” Medicare & Medicaid Research Review 4 (2).
van der Laan, M. J., E. C. Polley, and A. E. Hubbard. 2007. “Super Learner.” Statistical
Applications in Genetics and Molecular Biology 6: 25.
van der Laan, M. J., and S. Rose. 2011. Targeted Learning: Causal Inference for Observational
and Experimental Data. New York: Springer.
Layton, T., R. Ellis, and T. McGuire. 2015. “Assessing Incentives for Adverse Selection
in Health Plan Payment Systems.” NBER Working Paper 21531 [accessed on
October 1, 2015]. Available at http://www.nber.org/papers/w21531
Lee, B., J. Lessler, and E. A. Stuart. 2009. “Improving Propensity Score Weighting
Using Machine Learning.” Statistics in Medicine 29: 337–46.
Liaw, A., and M. Wiener. 2002. “Classification and Regression by RandomForest.” R
News 2 (3): 18–22.
McGuire, T. G., J. P. Newhouse, S.-L. Normand, J. Shi, and S. Zuvekas. 2014. “Assess-
ing Incentives for Service-Level Selection in Private Health Insurance
Exchanges.” Journal of Health Economics 35: 47–63.
Machine Learning for Risk Adjustment 17
Polley, E., and van der Laan M. J.. 2013. SuperLearner: Super Learner Prediction. R Package
Version 2.0-10.
R Core Team. 2013. R: A Language and Environment for Statistical Computing. Vienna,
Austria: R Foundation for Statistical Computing.
Rose, S. 2013. “Mortality Risk Score Prediction in an Elderly Population Using
Machine Learning.” American Journal of Epidemiology 177 (5): 443–52.
Rose, S., J. Shi, T. McGuire, and S. L. Normand. 2015. “Matching and Imputation
Methods for Risk Adjustment in the Health Insurance Marketplaces.” Statistics in
Biosciences, in press.
Sudat, S. E., E. J. Carlton, E. Y. Seto, R. C. Spear, and A. E. Hubbard. 2010. “Using
Variable Importance Measures from Causal Inference to Rank Risk Factors of
Schistosomiasis Infection in a Rural Setting in China.” Epidemiologic Perspectives
& Innovations 7: 3.
Therneau, T., B. Atkinson, and B. Ripley. 2013. rpart: Recursive Partitioning. R package
version 4.1-3.
van Veen, S. H. C. M., R. C. van Kleef, W. P. M. M. van de Ven, and R. C. J. A. van
Vliet. 2015. “Is There One Measure of Fit That Fits All? A Taxonomy and
Review of Measures of Fit for Risk Equalization Models.” Medical Care Research
and Review 72 (2): 220–43.
Venables, W. N., and B. D. Ripley. 2002. Modern Applied Statistics with S, 4th Edition.
New York: Springer.