Semiparametric Regression With R: Jaroslaw Harezlak - David Ruppert - Matt P. Wand

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Jaroslaw Harezlak • David Ruppert • Matt P.

Wand

Semiparametric Regression
with R

123
Jaroslaw Harezlak David Ruppert
School of Public Health Department of Statistical Science
Indiana University Bloomington Cornell University
Bloomington, Indiana, USA Ithaca, New York, USA

Matt P. Wand
School of Mathematical
and Physical Sciences
University of Technology Sydney
Ultimo, New South Wales, Australia

ISSN 2197-5736 ISSN 2197-5744 (electronic)


Use R!
ISBN 978-1-4939-8851-8 ISBN 978-1-4939-8853-2 (eBook)
https://doi.org/10.1007/978-1-4939-8853-2

Library of Congress Control Number: 2018953727

© Springer Science+Business Media, LLC, part of Springer Nature 2018


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, express or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Science+Business Media, LLC
part of Springer Nature.
The registered company address is: 233 Spring Street, New York, NY 10013, U.S.A.
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Semiparametric Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The R Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Some Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.1 Warsaw Apartments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.2 Boston Mortgage Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.3 Indiana Adolescent Growth Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.4 Sydney Real Estate Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.5 Michigan Panel Study of Income Dynamics Data . . . . . . . . . . 11
1.3.6 All of the Datasets Used in This Book . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Aim of This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Penalized Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Penalized Spline Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Choosing the Smoothing Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Choosing the Basis Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Checking the Residuals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 Effective Degrees of Freedom. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.7 Mixed Model-Based Penalized Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.8 Variability Bands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.9 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.10 Bayesian Penalized Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.10.1 Multiple Chains Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.11 Choosing Between Different Penalized Spline Approaches . . . . . . . . . 51
2.12 Penalized Splines with Factor Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.12.1 A Simple Semiparametric Additive Model . . . . . . . . . . . . . . . . . 53
2.12.2 A Simple Semiparametric Interaction Model . . . . . . . . . . . . . . . 55
2.12.3 A Simple Factor-by-Curve Model. . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.13 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.14 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

ix
x Contents

3 Generalized Additive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.2 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.2.1 Example: Mortgage Applications in Boston . . . . . . . . . . . . . . . . 75
3.2.2 Example: Physician Offices Visits . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.3 Generalized Additive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.3.1 Example: Test Scores of Children in California
School Districts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.3.2 Example: Physician Office Visits . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.3.3 Example: Mortgage Applications in Boston . . . . . . . . . . . . . . . . 90
3.4 Model Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.4.1 Stepwise Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.4.2 Penalty-Based Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.5 Extension to Vector Responses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.6 Extension to Factor-by-Curve Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
3.6.1 Example: Mortgage Applications in Boston . . . . . . . . . . . . . . . . 117
3.7 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
3.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4 Semiparametric Regression Analysis of Grouped Data . . . . . . . . . . . . . . . . . 129
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.2 Additive Mixed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
4.2.1 Bayesian Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
4.2.2 Serial Correlation Extension. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
4.3 Models with Group-Specific Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4.4 Marginal Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
4.4.1 Marginal Nonparametric Regression . . . . . . . . . . . . . . . . . . . . . . . . 150
4.4.2 Additive Model Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
4.4.3 Incorporation of Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
4.5 Extension to Non-Gaussian Response Variables . . . . . . . . . . . . . . . . . . . . . 160
4.5.1 Penalized Quasi-Likelihood Analysis . . . . . . . . . . . . . . . . . . . . . . . 161
4.5.2 Markov Chain Monte Carlo Analysis . . . . . . . . . . . . . . . . . . . . . . . 164
4.6 Further Readings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
5 Bivariate Function Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
5.2 Bivariate Nonparametric Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
5.2.1 Example: Ozone Levels in Midwest USA . . . . . . . . . . . . . . . . . . 175
5.3 Geoadditive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
5.3.1 Example: House Prices in Sydney, Australia . . . . . . . . . . . . . . . 184
5.4 Varying-Coefficient Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
5.4.1 Example: Daily Stock Returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
5.5 Additional Semiparametric Regression Models . . . . . . . . . . . . . . . . . . . . . . 200
5.6 Covariance Function Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
5.6.1 Example: Gasoline Near-Infrared Spectra . . . . . . . . . . . . . . . . . . 204
Contents xi

5.7 Estimating a Covariance Function with Sparse Data . . . . . . . . . . . . . . . . 206


5.7.1 Example: Spinal Bone Mineral Density Data . . . . . . . . . . . . . . 207
5.8 The Sandwich Smoother . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
5.8.1 Example: Brain Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
5.9 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
5.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
6 Selection of Additional Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
6.2 Robust and Quantile Semiparametric Regression . . . . . . . . . . . . . . . . . . . . 221
6.2.1 Robust and Resistant Scatterplot Smoothing . . . . . . . . . . . . . . . 222
6.2.2 Robust Semiparametric Regression . . . . . . . . . . . . . . . . . . . . . . . . . 225
6.2.3 Quantile Semiparametric Regression . . . . . . . . . . . . . . . . . . . . . . . 233
6.3 Scalar-on-Function Linear Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
6.3.1 Example: Diffusion Tensor Imaging Data . . . . . . . . . . . . . . . . . . 239
6.3.2 Example: Fat Content of Meat Samples . . . . . . . . . . . . . . . . . . . . 242
6.3.3 Example: Octane and Near Infrared Spectra. . . . . . . . . . . . . . . . 244
6.4 Scalar-on-Function Additive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
6.4.1 Example: Fat Content of Meat Samples . . . . . . . . . . . . . . . . . . . . 247
6.5 Additive Models Using Principal Component Scores . . . . . . . . . . . . . . . 247
6.5.1 Example: Fat Content of Meat Samples . . . . . . . . . . . . . . . . . . . . 248
6.6 Function-on-Function Linear Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
6.6.1 Example: Yield Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
6.7 Kernel Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
6.7.1 Support Vector Machine Classification . . . . . . . . . . . . . . . . . . . . . 264
6.8 Missing Data and Measurement Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
6.8.1 Graphical Models Approach to Bayesian
Semiparametric Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
6.8.2 Nonparametric Regression with a Partially Observed
Gaussian Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
6.8.3 Example: Pima Indians Diabetes Study . . . . . . . . . . . . . . . . . . . . . 282
6.8.4 Example: Mental Health Clinical Trial . . . . . . . . . . . . . . . . . . . . . 285
6.8.5 Extension to Finite Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . 288
6.9 Arbitrarily Complicated Bayesian Semiparametric Regression . . . . . 291
6.9.1 Binary Response Group-Specific Curves Model . . . . . . . . . . . 293
6.9.2 Heteroscedastic Additive Model with Missingness . . . . . . . . 298
6.9.3 Practical Aspects of Graphical Models Approach to
Bayesian Semiparametric Regression . . . . . . . . . . . . . . . . . . . . . . . 302
6.10 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
6.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
Chapter 1
Introduction

1.1 Semiparametric Regression

Regression is used to understand the relationships between predictor variables and


response variables and for predicting the latter using the former. In parametric
regression, the effect of each predictor has a simple form, for example, is a linear or
exponential function, so that its overall shape is dictated by the model, not the data.
In contrast, with nonparametric regression the model is flexible enough to allow any
smooth trend in the data; see Fig. 1.1 for an example. Semiparametric regression
combines parametrically modeled effects for some predictors with nonparametric
modeling of the effects of the other variables.
Because of its flexibility, semiparametric regression has proven to be of great
value in many applications in fields as diverse as astronomy, biology, medicine,
economics, and finance. Using semiparametric regression models, one can extract
important information from often messy datasets. An introduction to the field can
be found in the book Semiparametric Regression by Ruppert et al. (2003) and its
follow-up survey article, Ruppert et al. (2009).

1.2 The R Language

R (R Core Team 2016) is a major computing programming language for statistical


methodology. The emergence of R around the start of mainstream Internet usage in
the mid-1990s leads to a revolution of sorts and has allowed statistical methodolo-
gists from around the world to share their code much more easily than ever before,
using the so-called packages. The primary website for R is the Comprehensive
R Archive Network (cran.r-project.org) and contains the latest version of R
and thousands of packages. Familiarity with R is assumed throughout this book.
A reader with no such familiarity should first consult some of the numerous R

© Springer Science+Business Media, LLC, part of Springer Nature 2018 1


J. Harezlak et al., Semiparametric Regression with R, Use R!,
https://doi.org/10.1007/978-1-4939-8853-2_1
2 1 Introduction

180
area (square meters) per million zloty
160
140
120
100
80
60

1940 1960 1980 2000


construction date (year)

Fig. 1.1 Area/price ratio versus construction date for the Warsaw apartment data in the data frame
WarsawApts within the R package HRW. The curve is an estimate of the mean area/price ratio
given the construction date. The shaded region indicates approximately 95% pointwise confidence
intervals.

tutorials and notes that are available on the Internet, including on the Comprehensive
R Archive Network.

1.3 Some Examples

To illustrate features of semiparametric regression, in this chapter we discuss one


dataset taken from each of the subsequent chapters. Table 1.2 at the end of the
chapter describes all datasets used in this book.

1.3.1 Warsaw Apartments

The Warsaw apartments dataset is used throughout this book’s early chapters to
illustrate the most fundamental semiparametric regression models. It contains data
on several variables for 409 apartments sold in the city of Warsaw, Poland, during
2007–2009. The data are stored in the data frame WarsawApts within the R
package, HRW, that accompanies this book. This data frame is a subset of one named
apartments in the R package PBImisc (Biecek 2014). The full description of
apartments can be found in the PBImisc package’s help files.
1.3 Some Examples 3

A question of interest is how the ratio of floor area to price depends on the
construction date. The basic unit of currency in Poland is the złoty. Figure 1.1
contains a plot of area per million złoty versus construction date with a nonpara-
metric regression function estimate and variability bands which have approximately
95% pointwise confidence interval validity. “Pointwise” means that there is a 95%
coverage probability at each value of the predictor. We see from Fig. 1.1 that there is
an interesting nonlinear relationship between area/price ratio and construction date.
The first three turning points in the mean function correspond to major events in
Warsaw’s history: (1) the German invasion of 1939, (2) the end of World War II and
beginning of communist rule in 1945, and (3) the start of martial law in 1981. During
communist rule building quality declined. Hence buildings constructed in 1975
have a larger mean area/price ratio compared with those constructed before 1940.
Poland became a democracy in 1989 and around 2000 pre-war building quality was
restored. In Chap. 2, we use the WarsawApts dataset to illustrate the basic concepts
of semiparametric regression modeling.
Another question of possible interest is “Are there differences between districts of
Warsaw in terms of how construction date impacts the area/price ratio?” Figure 1.2

1940 1960 1980 2000

Mokotow Srodmiescie

150
area (square meters) per million zloty

100

50

Wola Zoliborz

150

100

50

1940 1960 1980 2000


construction date (year)

Fig. 1.2 The same data as shown in Fig. 1.1 but broken down according to the district in Warsaw
in which each apartment is located. The curve in each panel is an estimate of the mean area/price
ratio given the construction date for that district treated separately. The shaded regions indicate
approximately 95% pointwise confidence intervals.
4 1 Introduction

plots the data of Fig. 1.1 broken down according to district. This plot uses graphics
supported by the R package lattice (Sarkar 2017). The regression function
estimates and approximate pointwise confidence intervals are obtained individually
for each district. Some differences among the districts are apparent. For example,
the Mokotow curve is higher than that for Srodmiescie—the latter being the central
business district of Warsaw. This suggests that buyer’s get more floor space for their
money in Mokotow than Srodmiescie for apartments built around the same time.

1.3.2 Boston Mortgage Applications

Generalized additive models (GAMs) are useful when there are several predictors
each having a nonlinear effect. In GAMs, the linear predictor is a sum of nonpara-
metrically modeled functions of univariate predictors. GAMs are covered in Chap. 3.
We illustrate GAMs using a dataset concerning mortgage applications in Boston,
USA, during the years 1997–1998. The data frame BostonMortgages in the
HRW package contains data on several variables concerning 2380 applications.
BostonMortgages is a subset of the Hdma data frame in the package Ecdat
(Croissant 2016). This name “Hdma” is an apparent typographic error and should be
Hmda, which stands for “Home Mortgage Disclosure Act.” We selected a subset of
the predictors and deleted cases with missing values to create this smaller dataset.
The response of interest is deny, the status of the mortgage application which is
coded as “yes” when the mortgage application was denied and “no” otherwise. We
are interested in developing a regression model for the probability that a mortgage
application is denied.
Figure 1.3 is a visual display of the data in which the variable of primary interest,
indicator of mortgage application denied, is plotted against the 12 other variables
in BostonMortgages. The yes/no variables are coded as 0 = no and 1 = yes. To
aid visualization, jittering has been applied to the variables that take discrete values.
There are 12 possible predictors but, for now, we concentrate on the predictor
ratio of the debt payments to total income which is shortened to debt payments
to income ratio in Fig. 1.3. The curve in Fig. 1.4 shows that the probability
that a mortgage is denied is decreasing in the range from 0 to 0.3 of the debt
payments to income ratio and is increasing after 0.3. The shaded region has a
pointwise approximate 95% confidence interval interpretation. In Sect. 3.3.3, we
will incorporate additional predictors that feature in Fig. 1.3.
Munnell et al. (1996) investigated whether race was a factor in the denial of
mortgage applications after adjustment for the other variables. The variable black
is the indicator of Black or Hispanic ethnicity. In Chap. 3 we investigate the effect of
black using semiparametric regression to adjust for possible confounding variables.
1.3 Some Examples 5

indic. of mortg.

indic. of mortg.
applic. denied

applic. denied
−0.2 0.6

−0.2 0.6
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
debt to income ratio housing expen. to income ratio
indic. of mortg.

indic. of mortg.
applic. denied

applic. denied
−0.2 0.6

−0.2 0.6
0.0 0.5 1.0 1.5 2.0 1 2 3 4 5 6
loan to asses. prop. value ratio credit score (low good)
indic. of mortg.

indic. of mortg.
applic. denied

applic. denied
−0.2 0.6

−0.2 0.6

1.0 2.0 3.0 4.0 −0.2 0.2 0.6 1.0


mortgage credit score (low good) public bad credit record?
indic. of mortg.

indic. of mortg.
applic. denied

applic. denied
−0.2 0.6

−0.2 0.6

−0.2 0.2 0.6 1.0 −0.2 0.2 0.6 1.0


mortgage insurance denied? applicant self−employed?
indic. of mortg.

indic. of mortg.
applic. denied

applic. denied
−0.2 0.6

−0.2 0.6

−0.2 0.2 0.6 1.0 2 4 6 8 10


applicant single? unemploy. rate in applic.'s indust.
indic. of mortg.

indic. of mortg.
applic. denied

applic. denied
−0.2 0.6

−0.2 0.6

−0.2 0.2 0.6 1.0 −0.2 0.2 0.6 1.0


property a condominium? applicant black?

Fig. 1.3 Plots of indicator of a mortgage application denied against the other variables in the data
frame BostonMortgages within the R package HRW. The yes/no variables are coded: 0 = no and
1 = yes. To aid visualization, jittering has been applied to the discrete variables data.
6 1 Introduction

indicator of mortgage application denied


1.0
0.8
0.6
0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0


debt payments to income ratio

Fig. 1.4 Estimated probability of mortgage denial as a function of the debt payments to income
ratio based on the data shown in the top-left panel of Fig. 1.3. The blue circles show the data
with jittering of the response values to aid visualization. The shaded region is an approximate
95% confidence band. This fit was obtained using the gam() function in the R package mgcv;
see Chap. 3. Of the 2380 mortgage applications, 5 have debt payments to income ratios between
1.16 and 1.42 and one has a debt payments to income ratio of 3. These cases were used during
estimation but, to focus attention on the majority of the cases, they are not shown in the plot.

Table 1.1 Cross-tabulation Black White


of adolescents by gender and
race in the Indiana adolescent Female 30 70
growth dataset. Male 28 88

1.3.3 Indiana Adolescent Growth Data

The Indiana adolescent growth data were obtained from a study of the mechanisms
of human hypertension development conducted at the Indiana University School of
Medicine, Indianapolis, USA, that started in the 1980s and is still continuing. Pratt
et al. (1989) contains a full description of the study. The data are from a longitudinal
study and are a special case of grouped data, which is the topic of Chap. 4.
The Indiana adolescent growth dataset is stored in the data frame named
growthIndiana in the HRW package. Note that growthIndiana is restricted to
the subset of 216 adolescents in the original study who had at least nine height
measurements. Table 1.1 is a cross-tabulation of the adolescents by race and gender.
Figure 1.5 shows the entire dataset using lattice graphics in R. The panels in
Fig. 1.5 plot height against age for each of the 216 adolescents, with color-coding
according to gender/race status. Such data are often referred to as growth curves. It is
1.3 Some Examples 7

white females white males


black females black males
510 20 510 20 510 20 510 20 510 20 510 20

180
140
100
180
140
100
180
140
100
180
140
100
180
140
100
180
140
100
180
140
100
height (centimetres)

180
140
100
180
140
100
180
140
100
180
140
100
180
140
100
180
140
100
180
140
100
180
140
100
180
140
100
180
140
100
180
140
100
510 20 510 20 510 20 510 20 510 20 510 20

age (years)

Fig. 1.5 The Indiana adolescent growth data stored in the data frame growthIndiana in the R
package HRW. Each panel plots height (centimeters) against age (years) for each of 216 adolescents.
Color-coding is used to indicate combined gender/race status.

not easy to fit these data using common parametric models. An additional challenge
arises from proper accounting for dependencies between measurements on the same
adolescent.
Comparison of growth between the gender and race categories is often of interest
and will be studied in Chap. 4. Figure 1.6 is a different lattice graphics plot
8 1 Introduction

5 10 15 20
black females white females
200

180

160

140

120
height (cm)

100
black males white males
200

180

160

140

120

100
5 10 15 20
age (years)

Fig. 1.6 The same data as shown in Fig. 1.5 but with the panels corresponding to the four
gender/race combinations.

of the same data shown in Fig. 1.5 but with the panels corresponding to the four
gender/race combinations. This better enables cross-category comparisons. For
example, black males between 15 and 20 years of age tend to be taller than black
females in the same age bracket.
To give a flavor of semiparametric regression analyses of interest for such data,
described in Chap. 4, Fig. 1.7 shows two estimated contrast functions in which males
and females are compared within their own race categories. The estimates and
variability bands are based on a Bayesian semiparametric regression model with
approximate inference achieved via Markov chain Monte Carlo sampling facilitated
by the R package rstan (Guo et al. 2017). This approach is introduced in Sect. 2.10.
From Fig. 1.7 we see that there is little difference, statistically, between males
and females up to the age of 12. After that males are significantly taller, with the
gap bigger for the black race than it is for the white race. There is more variability
in the black race contrast function since it is based on fewer observations—only
about a quarter of the subjects in the study are black.
1.3 Some Examples 9

6 8 10 12 14 16 18
mean difference in height (centimeters) male vs. female for black adolescents male vs. female for white adolescents

20

15

10

6 8 10 12 14 16 18
age (years)

Fig. 1.7 Estimated contrast functions and approximate pointwise 95% credible sets based on
a Bayesian semiparametric regression model fitted to the data shown in Figs. 1.5 and 1.6.
Approximate Bayesian inference, based on Markov chain Monte Carlo, was performed using the
R package rstan.

1.3.4 Sydney Real Estate Data

The Sydney real estate data were collected as a part of an unpublished study by A.
Chernih and M. Sherris at the University of New South Wales, Australia. The data
consist of 39 variables on 37,676 houses sold in Sydney, Australia, during the year
2001 and are stored in the data frame SydneyRealEstate in the HRW package.
Of central interest is the nature of the dependence of house prices on the
other variables. Figure 1.8 depicts some of the individual dependencies through
scatterplots of the logarithm of sale price against 8 of the potential predictors.
For example, the top-left panel in Fig. 1.8 shows the intuitively obvious positive
correlation between price and lot size. Underneath that, distance to the coastline is
seen to have a negative impact on price.
Figure 1.9 shows the average log-prices on a 50 × 50 equal-sized geographical
mesh. A strong spatial effect is apparent. The higher-priced areas tend to be near
Sydney’s waterways and ocean front. Rather than estimating univariate regression
functions a bivariate function of longitude and latitude seems to be appropriate to
model the behavior exhibited in Fig. 1.9. The bivariate extension of semiparametric
regression analysis is dealt with in Chap. 5.
10 1 Introduction

log(sale price (dollars))

log(sale price (dollars))


12 13 14 15 16

12 13 14 15 16
500 1000 1500 500 1000 1500 2000
lot size (square meters) average income of suburb
log(sale price (dollars))

log(sale price (dollars))


12 13 14 15 16

12 13 14 15 16

0 10 20 30 40 0 5 10 15
distance to coastline (kilometers) distance to hospital (kilometers)
log(sale price (dollars))

log(sale price (dollars))


12 13 14 15 16

12 13 14 15 16

0 10 30 50 70 0.2 0.4 0.6 0.8


distance to post office (kilometers) foreigner ratio
log(sale price (dollars))

log(sale price (dollars))


12 13 14 15 16

12 13 14 15 16

0.2 0.4 0.6 0.8 1.0 0.5 1.0 1.5 2.0 2.5 3.0
crime rate nitrogen dioxide level

Fig. 1.8 Plots of logarithm of sale price (dollars) against some of the other variables in the data
frame SydneyRealEstate within the R package HRW. To aid visualization, a 10% random subset
of the data is used in the plots.
1.3 Some Examples 11

−34.2 −34.1 −34.0 −33.9 −33.8 −33.7 −33.6 −33.5 15.0

14.5

14.0
latitude

13.5

13.0

12.5

150.6 150.8 151.0 151.2 151.4


longitude

Fig. 1.9 The spatial variation in log price of the houses sold in Sydney, Australia, during 2001
based on the dataset SydneyRealEstate within the R package HRW. The averaging was done on
a 50 by 50 rectangular longitude by latitude pixel mesh. The pixels where no data were recorded
are left blank. Data are present in only 836 out of 2500 pixels.

1.3.5 Michigan Panel Study of Income Dynamics Data

The scatterplot on the left panel of Fig. 1.10 is household income excluding income
from the wife’s work versus the wife’s age for 3382 households in the year 1987.
These data are part of a much larger dataset from the Michigan Panel Study of
Income Dynamics (e.g. Lee 1995). The 1987 cross-section is in the data frame
Workinghours in the R package Ecdat.
A question of interest is the impact of wife’s age on other household income
but, unlike the situation in Fig. 1.1, the response variable here is highly skewed
and includes some strong positive outliers. The methodology used to fit the mean
response curve to the Fig. 1.1 scatterplot is not appropriate for the Fig. 1.10 scatter-
plot and the conditional mean function is not necessarily a good way of summarizing
the response/predictor relationship. Instead we use conditional quantile functions.
The right panel shows the 1, 5, 25, 50, 75, 95, and 99% estimated quantiles of other
household income conditional on the wife’s age. This plot allows appreciation for
the effect of the predictor on the response in a different way than Fig. 1.1 and is
more appropriate for such a skewed and outlier-ridden response variable.
The Workinghours data frame has data on several other variables such as
education level of the wife, occupation of the husband, and number of children
in the household. In Chap. 6 we explore semiparametric quantile regression models
that incorporate multiple predictor effects.
12 1 Introduction

250
99% quantile
95% quantile
75% quantile
600
other household income ('000 $US)

other household income ('000 $US)


50% quantile

200
25% quantile
5% quantile
1% quantile

150
400

100
200

50
0

0
20 30 40 50 60 20 30 40 50 60
wife's age in years wife's age in years

Fig. 1.10 Left panel: Household income from sources other than the wife’s work (thousands
of U.S. dollars) versus wife’s age (years). Right panel: Zoomed view of left panel plot with
restriction to households for which other income does not exceed $250,000. The curves correspond
to nonparametric quantile function estimates with color-coding for the level of the quantile. The
estimates were obtained using the function rqss() in the R package quantreg.

1.3.6 All of the Datasets Used in This Book

Table 1.2 lists and briefly describes each of the datasets used in this book, and the
sections in which they are analyzed.

1.4 Aim of This Book

Semiparametric regression is a major area of methodological development and is


being used widely in applications. See, for example, Ruppert et al. (2009) for
a summary of the state of affairs in the late 2000s. Nevertheless, we believe
that semiparametric regression should be used even more widely by applied
researchers and that ongoing contributions to the R computing environment make
this increasingly easier. The aim of this book is to demonstrate how semiparametric
regression analyses can be carried out with only minimal knowledge of R. We do not
get into the intricacies of semiparametric regression methodology and its underlying
theory. Instead, our focus is implementation in R.
Relevant R packages include gamlss (Stasinopoulos and Rigby 2017), nlme
(Pinheiro et al. 2017), mgcv (Wood 2017), quantreg (Koenker 2017), refund
1.4 Aim of This Book 13

Table 1.2 Datasets used in this book and sections where they are analyzed.
R data frame (package) Brief description Sections
WarsawApts (HRW) Apartments sold in Warsaw, 1.3,
Poland, during 2007–2009 2.2–2.10
2.12, 3.6
BostonMortgages (HRW) Mortgage applications of resi- 1.3, 3.2
dents of Boston, USA 3.3, 3.6
growthIndiana (HRW) Longitudinal heights of adoles- 1.3, 4.3
cents in Indiana, USA
SydneyRealEstate (HRW) Real estate sold in Sydney, 1.3, 5.3
Australia, during 2001
Workinghours (Ecdat) Income and attributes of 1.3, 6.2
households in Michigan, USA
OFP (Ecdat) Physician visits and attributes 3.2, 3.3
of elderly USA residents
Caschool (Ecdat) School test scores and attri- 3.3, 3.4
butes in California, USA 3.4.2, 3.5
femSBMD (HRW) Longitudinal spinal bone mineral 4.2, 5.7
density in the USA adolescents
protein (HRW) Longitudinal protein intake 4.4
from a USA nutrition study
indonRespir (HRW) Longitudinal respiratory infection 4.5
status of children in Indonesia
ozoneSub (HRW) Ozone concentrations in the 5.2
midwest region of the USA
capm (HRW) Daily USA stock returns 5.4
and indices during 1993–2003
gasoline(refund) Near infrared spectra and octane 5.6
numbers for gasoline samples
brainImage (HRW) Brain image coronal slice 5.8
DTI (refund) Diffusion tensor imaging data 6.3
tecator (fda.usc) Content of meat samples 6.3, 6.5
yields (HRW) Yield curves 6.6
carAuction (HRW) Attributes of auction-bought cars 6.7
PimaIndiansDiabetes Diabetes status and attributes 6.8
(mlbench) of the USA study of Pima Indians
BCR (HRW) Mental health scores from 6.8
a drug/placebo clinical trial
CHD (HRW) Coronary heart status and 6.8
attributes from a U.S. study
coral (HRW) Alive/death status of coral 6.9
organisms in French Polynesia
Ozone (mlbench) Daily ozone levels and weather 6.9
Los Angeles area during 1976
Datasets used in exercises only are not listed here but can be found in the index
14 1 Introduction

(Goldsmith et al. 2016), rstan (Guo et al. 2017), and VGAM (Yee 2017). The index
has the full list of packages mentioned in the book. Our intention is to describe in a
straightforward way the relevant steps needed to conduct semiparametric regression
analyses using R packages such as these.
This book will be useful to anybody who has a basic knowledge of R and is
interested in exploring and modeling data where simple parametric assumptions are
not realistic. Biostatisticians, data analysts, econometricians, and social scientists
should find this book of special interest. We expect that the material presented
here will be accessible to any reader who has taken courses in linear regression
and generalized linear models. To fully appreciate Bayesian model fitting, an
introductory course in Bayesian inference will be helpful.
In Chap. 2 we give a detailed account of the main semiparametric regres-
sion building block: penalized splines. Chapter 3 covers the important family of
semiparametric regression models known as generalized additive models. Then
in Chap. 4 we deal with extensions to grouped data, which includes longitudinal,
multilevel, panel, and small area data as special cases. Chapter 5 is concerned
with bivariate extensions of penalized splines and spatial semiparametric regression
models. The last chapter, Chap. 6, is a collection of additional topics such as building
in robustness and accounting for missing observations in semiparametric regression
analysis.

You might also like