Semiparametric Regression With R: Jaroslaw Harezlak - David Ruppert - Matt P. Wand
Semiparametric Regression With R: Jaroslaw Harezlak - David Ruppert - Matt P. Wand
Semiparametric Regression With R: Jaroslaw Harezlak - David Ruppert - Matt P. Wand
Wand
Semiparametric Regression
with R
123
Jaroslaw Harezlak David Ruppert
School of Public Health Department of Statistical Science
Indiana University Bloomington Cornell University
Bloomington, Indiana, USA Ithaca, New York, USA
Matt P. Wand
School of Mathematical
and Physical Sciences
University of Technology Sydney
Ultimo, New South Wales, Australia
This Springer imprint is published by the registered company Springer Science+Business Media, LLC
part of Springer Nature.
The registered company address is: 233 Spring Street, New York, NY 10013, U.S.A.
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Semiparametric Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The R Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Some Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.1 Warsaw Apartments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.2 Boston Mortgage Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.3 Indiana Adolescent Growth Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.4 Sydney Real Estate Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.5 Michigan Panel Study of Income Dynamics Data . . . . . . . . . . 11
1.3.6 All of the Datasets Used in This Book . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Aim of This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Penalized Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Penalized Spline Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Choosing the Smoothing Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Choosing the Basis Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Checking the Residuals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 Effective Degrees of Freedom. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.7 Mixed Model-Based Penalized Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.8 Variability Bands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.9 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.10 Bayesian Penalized Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.10.1 Multiple Chains Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.11 Choosing Between Different Penalized Spline Approaches . . . . . . . . . 51
2.12 Penalized Splines with Factor Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.12.1 A Simple Semiparametric Additive Model . . . . . . . . . . . . . . . . . 53
2.12.2 A Simple Semiparametric Interaction Model . . . . . . . . . . . . . . . 55
2.12.3 A Simple Factor-by-Curve Model. . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.13 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.14 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
ix
x Contents
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
Chapter 1
Introduction
180
area (square meters) per million zloty
160
140
120
100
80
60
Fig. 1.1 Area/price ratio versus construction date for the Warsaw apartment data in the data frame
WarsawApts within the R package HRW. The curve is an estimate of the mean area/price ratio
given the construction date. The shaded region indicates approximately 95% pointwise confidence
intervals.
tutorials and notes that are available on the Internet, including on the Comprehensive
R Archive Network.
The Warsaw apartments dataset is used throughout this book’s early chapters to
illustrate the most fundamental semiparametric regression models. It contains data
on several variables for 409 apartments sold in the city of Warsaw, Poland, during
2007–2009. The data are stored in the data frame WarsawApts within the R
package, HRW, that accompanies this book. This data frame is a subset of one named
apartments in the R package PBImisc (Biecek 2014). The full description of
apartments can be found in the PBImisc package’s help files.
1.3 Some Examples 3
A question of interest is how the ratio of floor area to price depends on the
construction date. The basic unit of currency in Poland is the złoty. Figure 1.1
contains a plot of area per million złoty versus construction date with a nonpara-
metric regression function estimate and variability bands which have approximately
95% pointwise confidence interval validity. “Pointwise” means that there is a 95%
coverage probability at each value of the predictor. We see from Fig. 1.1 that there is
an interesting nonlinear relationship between area/price ratio and construction date.
The first three turning points in the mean function correspond to major events in
Warsaw’s history: (1) the German invasion of 1939, (2) the end of World War II and
beginning of communist rule in 1945, and (3) the start of martial law in 1981. During
communist rule building quality declined. Hence buildings constructed in 1975
have a larger mean area/price ratio compared with those constructed before 1940.
Poland became a democracy in 1989 and around 2000 pre-war building quality was
restored. In Chap. 2, we use the WarsawApts dataset to illustrate the basic concepts
of semiparametric regression modeling.
Another question of possible interest is “Are there differences between districts of
Warsaw in terms of how construction date impacts the area/price ratio?” Figure 1.2
Mokotow Srodmiescie
150
area (square meters) per million zloty
100
50
Wola Zoliborz
150
100
50
Fig. 1.2 The same data as shown in Fig. 1.1 but broken down according to the district in Warsaw
in which each apartment is located. The curve in each panel is an estimate of the mean area/price
ratio given the construction date for that district treated separately. The shaded regions indicate
approximately 95% pointwise confidence intervals.
4 1 Introduction
plots the data of Fig. 1.1 broken down according to district. This plot uses graphics
supported by the R package lattice (Sarkar 2017). The regression function
estimates and approximate pointwise confidence intervals are obtained individually
for each district. Some differences among the districts are apparent. For example,
the Mokotow curve is higher than that for Srodmiescie—the latter being the central
business district of Warsaw. This suggests that buyer’s get more floor space for their
money in Mokotow than Srodmiescie for apartments built around the same time.
Generalized additive models (GAMs) are useful when there are several predictors
each having a nonlinear effect. In GAMs, the linear predictor is a sum of nonpara-
metrically modeled functions of univariate predictors. GAMs are covered in Chap. 3.
We illustrate GAMs using a dataset concerning mortgage applications in Boston,
USA, during the years 1997–1998. The data frame BostonMortgages in the
HRW package contains data on several variables concerning 2380 applications.
BostonMortgages is a subset of the Hdma data frame in the package Ecdat
(Croissant 2016). This name “Hdma” is an apparent typographic error and should be
Hmda, which stands for “Home Mortgage Disclosure Act.” We selected a subset of
the predictors and deleted cases with missing values to create this smaller dataset.
The response of interest is deny, the status of the mortgage application which is
coded as “yes” when the mortgage application was denied and “no” otherwise. We
are interested in developing a regression model for the probability that a mortgage
application is denied.
Figure 1.3 is a visual display of the data in which the variable of primary interest,
indicator of mortgage application denied, is plotted against the 12 other variables
in BostonMortgages. The yes/no variables are coded as 0 = no and 1 = yes. To
aid visualization, jittering has been applied to the variables that take discrete values.
There are 12 possible predictors but, for now, we concentrate on the predictor
ratio of the debt payments to total income which is shortened to debt payments
to income ratio in Fig. 1.3. The curve in Fig. 1.4 shows that the probability
that a mortgage is denied is decreasing in the range from 0 to 0.3 of the debt
payments to income ratio and is increasing after 0.3. The shaded region has a
pointwise approximate 95% confidence interval interpretation. In Sect. 3.3.3, we
will incorporate additional predictors that feature in Fig. 1.3.
Munnell et al. (1996) investigated whether race was a factor in the denial of
mortgage applications after adjustment for the other variables. The variable black
is the indicator of Black or Hispanic ethnicity. In Chap. 3 we investigate the effect of
black using semiparametric regression to adjust for possible confounding variables.
1.3 Some Examples 5
indic. of mortg.
indic. of mortg.
applic. denied
applic. denied
−0.2 0.6
−0.2 0.6
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
debt to income ratio housing expen. to income ratio
indic. of mortg.
indic. of mortg.
applic. denied
applic. denied
−0.2 0.6
−0.2 0.6
0.0 0.5 1.0 1.5 2.0 1 2 3 4 5 6
loan to asses. prop. value ratio credit score (low good)
indic. of mortg.
indic. of mortg.
applic. denied
applic. denied
−0.2 0.6
−0.2 0.6
indic. of mortg.
applic. denied
applic. denied
−0.2 0.6
−0.2 0.6
indic. of mortg.
applic. denied
applic. denied
−0.2 0.6
−0.2 0.6
indic. of mortg.
applic. denied
applic. denied
−0.2 0.6
−0.2 0.6
Fig. 1.3 Plots of indicator of a mortgage application denied against the other variables in the data
frame BostonMortgages within the R package HRW. The yes/no variables are coded: 0 = no and
1 = yes. To aid visualization, jittering has been applied to the discrete variables data.
6 1 Introduction
Fig. 1.4 Estimated probability of mortgage denial as a function of the debt payments to income
ratio based on the data shown in the top-left panel of Fig. 1.3. The blue circles show the data
with jittering of the response values to aid visualization. The shaded region is an approximate
95% confidence band. This fit was obtained using the gam() function in the R package mgcv;
see Chap. 3. Of the 2380 mortgage applications, 5 have debt payments to income ratios between
1.16 and 1.42 and one has a debt payments to income ratio of 3. These cases were used during
estimation but, to focus attention on the majority of the cases, they are not shown in the plot.
The Indiana adolescent growth data were obtained from a study of the mechanisms
of human hypertension development conducted at the Indiana University School of
Medicine, Indianapolis, USA, that started in the 1980s and is still continuing. Pratt
et al. (1989) contains a full description of the study. The data are from a longitudinal
study and are a special case of grouped data, which is the topic of Chap. 4.
The Indiana adolescent growth dataset is stored in the data frame named
growthIndiana in the HRW package. Note that growthIndiana is restricted to
the subset of 216 adolescents in the original study who had at least nine height
measurements. Table 1.1 is a cross-tabulation of the adolescents by race and gender.
Figure 1.5 shows the entire dataset using lattice graphics in R. The panels in
Fig. 1.5 plot height against age for each of the 216 adolescents, with color-coding
according to gender/race status. Such data are often referred to as growth curves. It is
1.3 Some Examples 7
180
140
100
180
140
100
180
140
100
180
140
100
180
140
100
180
140
100
180
140
100
height (centimetres)
180
140
100
180
140
100
180
140
100
180
140
100
180
140
100
180
140
100
180
140
100
180
140
100
180
140
100
180
140
100
180
140
100
510 20 510 20 510 20 510 20 510 20 510 20
age (years)
Fig. 1.5 The Indiana adolescent growth data stored in the data frame growthIndiana in the R
package HRW. Each panel plots height (centimeters) against age (years) for each of 216 adolescents.
Color-coding is used to indicate combined gender/race status.
not easy to fit these data using common parametric models. An additional challenge
arises from proper accounting for dependencies between measurements on the same
adolescent.
Comparison of growth between the gender and race categories is often of interest
and will be studied in Chap. 4. Figure 1.6 is a different lattice graphics plot
8 1 Introduction
5 10 15 20
black females white females
200
180
160
140
120
height (cm)
100
black males white males
200
180
160
140
120
100
5 10 15 20
age (years)
Fig. 1.6 The same data as shown in Fig. 1.5 but with the panels corresponding to the four
gender/race combinations.
of the same data shown in Fig. 1.5 but with the panels corresponding to the four
gender/race combinations. This better enables cross-category comparisons. For
example, black males between 15 and 20 years of age tend to be taller than black
females in the same age bracket.
To give a flavor of semiparametric regression analyses of interest for such data,
described in Chap. 4, Fig. 1.7 shows two estimated contrast functions in which males
and females are compared within their own race categories. The estimates and
variability bands are based on a Bayesian semiparametric regression model with
approximate inference achieved via Markov chain Monte Carlo sampling facilitated
by the R package rstan (Guo et al. 2017). This approach is introduced in Sect. 2.10.
From Fig. 1.7 we see that there is little difference, statistically, between males
and females up to the age of 12. After that males are significantly taller, with the
gap bigger for the black race than it is for the white race. There is more variability
in the black race contrast function since it is based on fewer observations—only
about a quarter of the subjects in the study are black.
1.3 Some Examples 9
6 8 10 12 14 16 18
mean difference in height (centimeters) male vs. female for black adolescents male vs. female for white adolescents
20
15
10
6 8 10 12 14 16 18
age (years)
Fig. 1.7 Estimated contrast functions and approximate pointwise 95% credible sets based on
a Bayesian semiparametric regression model fitted to the data shown in Figs. 1.5 and 1.6.
Approximate Bayesian inference, based on Markov chain Monte Carlo, was performed using the
R package rstan.
The Sydney real estate data were collected as a part of an unpublished study by A.
Chernih and M. Sherris at the University of New South Wales, Australia. The data
consist of 39 variables on 37,676 houses sold in Sydney, Australia, during the year
2001 and are stored in the data frame SydneyRealEstate in the HRW package.
Of central interest is the nature of the dependence of house prices on the
other variables. Figure 1.8 depicts some of the individual dependencies through
scatterplots of the logarithm of sale price against 8 of the potential predictors.
For example, the top-left panel in Fig. 1.8 shows the intuitively obvious positive
correlation between price and lot size. Underneath that, distance to the coastline is
seen to have a negative impact on price.
Figure 1.9 shows the average log-prices on a 50 × 50 equal-sized geographical
mesh. A strong spatial effect is apparent. The higher-priced areas tend to be near
Sydney’s waterways and ocean front. Rather than estimating univariate regression
functions a bivariate function of longitude and latitude seems to be appropriate to
model the behavior exhibited in Fig. 1.9. The bivariate extension of semiparametric
regression analysis is dealt with in Chap. 5.
10 1 Introduction
12 13 14 15 16
500 1000 1500 500 1000 1500 2000
lot size (square meters) average income of suburb
log(sale price (dollars))
12 13 14 15 16
0 10 20 30 40 0 5 10 15
distance to coastline (kilometers) distance to hospital (kilometers)
log(sale price (dollars))
12 13 14 15 16
12 13 14 15 16
0.2 0.4 0.6 0.8 1.0 0.5 1.0 1.5 2.0 2.5 3.0
crime rate nitrogen dioxide level
Fig. 1.8 Plots of logarithm of sale price (dollars) against some of the other variables in the data
frame SydneyRealEstate within the R package HRW. To aid visualization, a 10% random subset
of the data is used in the plots.
1.3 Some Examples 11
14.5
14.0
latitude
13.5
13.0
12.5
Fig. 1.9 The spatial variation in log price of the houses sold in Sydney, Australia, during 2001
based on the dataset SydneyRealEstate within the R package HRW. The averaging was done on
a 50 by 50 rectangular longitude by latitude pixel mesh. The pixels where no data were recorded
are left blank. Data are present in only 836 out of 2500 pixels.
The scatterplot on the left panel of Fig. 1.10 is household income excluding income
from the wife’s work versus the wife’s age for 3382 households in the year 1987.
These data are part of a much larger dataset from the Michigan Panel Study of
Income Dynamics (e.g. Lee 1995). The 1987 cross-section is in the data frame
Workinghours in the R package Ecdat.
A question of interest is the impact of wife’s age on other household income
but, unlike the situation in Fig. 1.1, the response variable here is highly skewed
and includes some strong positive outliers. The methodology used to fit the mean
response curve to the Fig. 1.1 scatterplot is not appropriate for the Fig. 1.10 scatter-
plot and the conditional mean function is not necessarily a good way of summarizing
the response/predictor relationship. Instead we use conditional quantile functions.
The right panel shows the 1, 5, 25, 50, 75, 95, and 99% estimated quantiles of other
household income conditional on the wife’s age. This plot allows appreciation for
the effect of the predictor on the response in a different way than Fig. 1.1 and is
more appropriate for such a skewed and outlier-ridden response variable.
The Workinghours data frame has data on several other variables such as
education level of the wife, occupation of the husband, and number of children
in the household. In Chap. 6 we explore semiparametric quantile regression models
that incorporate multiple predictor effects.
12 1 Introduction
250
99% quantile
95% quantile
75% quantile
600
other household income ('000 $US)
200
25% quantile
5% quantile
1% quantile
150
400
100
200
50
0
0
20 30 40 50 60 20 30 40 50 60
wife's age in years wife's age in years
Fig. 1.10 Left panel: Household income from sources other than the wife’s work (thousands
of U.S. dollars) versus wife’s age (years). Right panel: Zoomed view of left panel plot with
restriction to households for which other income does not exceed $250,000. The curves correspond
to nonparametric quantile function estimates with color-coding for the level of the quantile. The
estimates were obtained using the function rqss() in the R package quantreg.
Table 1.2 lists and briefly describes each of the datasets used in this book, and the
sections in which they are analyzed.
Table 1.2 Datasets used in this book and sections where they are analyzed.
R data frame (package) Brief description Sections
WarsawApts (HRW) Apartments sold in Warsaw, 1.3,
Poland, during 2007–2009 2.2–2.10
2.12, 3.6
BostonMortgages (HRW) Mortgage applications of resi- 1.3, 3.2
dents of Boston, USA 3.3, 3.6
growthIndiana (HRW) Longitudinal heights of adoles- 1.3, 4.3
cents in Indiana, USA
SydneyRealEstate (HRW) Real estate sold in Sydney, 1.3, 5.3
Australia, during 2001
Workinghours (Ecdat) Income and attributes of 1.3, 6.2
households in Michigan, USA
OFP (Ecdat) Physician visits and attributes 3.2, 3.3
of elderly USA residents
Caschool (Ecdat) School test scores and attri- 3.3, 3.4
butes in California, USA 3.4.2, 3.5
femSBMD (HRW) Longitudinal spinal bone mineral 4.2, 5.7
density in the USA adolescents
protein (HRW) Longitudinal protein intake 4.4
from a USA nutrition study
indonRespir (HRW) Longitudinal respiratory infection 4.5
status of children in Indonesia
ozoneSub (HRW) Ozone concentrations in the 5.2
midwest region of the USA
capm (HRW) Daily USA stock returns 5.4
and indices during 1993–2003
gasoline(refund) Near infrared spectra and octane 5.6
numbers for gasoline samples
brainImage (HRW) Brain image coronal slice 5.8
DTI (refund) Diffusion tensor imaging data 6.3
tecator (fda.usc) Content of meat samples 6.3, 6.5
yields (HRW) Yield curves 6.6
carAuction (HRW) Attributes of auction-bought cars 6.7
PimaIndiansDiabetes Diabetes status and attributes 6.8
(mlbench) of the USA study of Pima Indians
BCR (HRW) Mental health scores from 6.8
a drug/placebo clinical trial
CHD (HRW) Coronary heart status and 6.8
attributes from a U.S. study
coral (HRW) Alive/death status of coral 6.9
organisms in French Polynesia
Ozone (mlbench) Daily ozone levels and weather 6.9
Los Angeles area during 1976
Datasets used in exercises only are not listed here but can be found in the index
14 1 Introduction
(Goldsmith et al. 2016), rstan (Guo et al. 2017), and VGAM (Yee 2017). The index
has the full list of packages mentioned in the book. Our intention is to describe in a
straightforward way the relevant steps needed to conduct semiparametric regression
analyses using R packages such as these.
This book will be useful to anybody who has a basic knowledge of R and is
interested in exploring and modeling data where simple parametric assumptions are
not realistic. Biostatisticians, data analysts, econometricians, and social scientists
should find this book of special interest. We expect that the material presented
here will be accessible to any reader who has taken courses in linear regression
and generalized linear models. To fully appreciate Bayesian model fitting, an
introductory course in Bayesian inference will be helpful.
In Chap. 2 we give a detailed account of the main semiparametric regres-
sion building block: penalized splines. Chapter 3 covers the important family of
semiparametric regression models known as generalized additive models. Then
in Chap. 4 we deal with extensions to grouped data, which includes longitudinal,
multilevel, panel, and small area data as special cases. Chapter 5 is concerned
with bivariate extensions of penalized splines and spatial semiparametric regression
models. The last chapter, Chap. 6, is a collection of additional topics such as building
in robustness and accounting for missing observations in semiparametric regression
analysis.